Analysis of Variance

To begin our foray into statistics in R, we will start with the most basic and useful analysis, Analysis of Variance (ANOVA). An ANOVA is used to test the effect of 1 or more categorical explanatory variables (X) on a continuous response variable (Y). The ANOVA tests the difference between the factors variance (distance from the grand mean) compared to the error variance. The variance for each point is squared and added together to generate the sum of squares, which is then used to generate the F ratio. The F ratio is then compared to a critical value from a table of values (based of degrees of freedom) to determine the level of significance, or P value.

In these tutorials, we will be using a variety of datasets to test each analysis. These will be specified within a “instruction” block before the requisite code chunk.

Content:

  • Analysis of Variance
  • Analyse the relationship between mutliple categorical predictor variables on a continuous response variable

  • Assumptions
  • All statistical tests need to make various assumptions about your data when conducting the test. This is due to the algorithms at work assuming your data fits specific distributions, has equal variances and your replicates are independant of one another (not spatially autocorrelated). Any violations of these assumptions can cause the test to produce a false-positive, as an analysis of variance is sensitive to violations in the assumptions of normality and homogeneity of variance (also called homoscedasticity).

  • Viewing results
  • Once we know our data is normal and we have our aov() object, we can use one of two commands on this object to generate our statistical result. The normal way to do so is to use the anova() command. anova(weeds.aov) # run an anova on the object ## Analysis of Variance Table ## ## Response: flowers ## Df Sum Sq Mean Sq F value Pr(>F) ## species 2 2368.6 1184.

  • Two-factor ANOVAs
  • To conduct an two-factor ANOVA is pretty straightforward. weeds.aov2 <- aov(flowers ~ species + soil, data = weeds) # two-factor anova (without interaction) summary(weeds.aov2) ## Df Sum Sq Mean Sq F value Pr(>F) ## species 2 2369 1184.3 9.272 0.000436 *** ## soil 1 239 238.5 1.867 0.178720 ## Residuals 44 5620 127.7 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 This example constructs an ANOVA with two factors, but does not include the interaction term.

  • Tukeys HSD
  • All of our analyses so far have showed us that species has an influence on flower abundance. But without conducting an extra test, we cannot be certain which species are statistically significant from each other when it comes to their effect on flower abundance TukeyHSD(weeds.aov) ## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = flowers ~ species, data = weeds) ## ## $species ## diff lwr upr p adj ## Olearia-Coprosma 12.