Anova in R ¶
Comparing Means of Three or More Groups
Example: Comparing the average income of employees across three different industries
When independent variable (factor) has three or more categories (levels)
Example: Analyzing how different brands of fertilizer (Brand A, Brand B, Brand C) affect crop yield.
#read data from exernal sources
dataurl="https://raw.githubusercontent.com/madhuko/madhuko.github.io/refs/heads/main/datasets/R/employee.csv"
employee=read.csv(dataurl)
head(employee)
id | gender | educ | jobcat | salary | salbegin | jobtime | prevexp | minority | |
---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |
1 | 1 | m | 15 | 3 | 57000 | 27000 | 98 | 144 | 0 |
2 | 2 | m | 16 | 1 | 40200 | 18750 | 98 | 36 | 0 |
3 | 3 | f | 12 | 1 | 21450 | 12000 | 98 | 381 | 0 |
4 | 4 | f | 8 | 1 | 21900 | 13200 | 98 | 190 | 0 |
5 | 5 | m | 15 | 1 | 45000 | 21000 | 98 | 138 | 0 |
6 | 6 | m | 15 | 1 | 32100 | 13500 | 98 | 67 | 0 |
# aggregate is like pivot table in excel
aggregate(employee$salary, list(employee$jobcat),FUN=mean)
aggregate(employee$salary, list(employee$jobcat, employee$gender), FUN = mean)
# If we provide agruememnt with proper name, then order of the arguement doesnot matter
# x should be numerical, by should be ctegorical
aggregate(FUN=mean,by=list(employee$jobcat),x=employee$salary)
Group.1 | x |
---|---|
<int> | <dbl> |
1 | 27838.54 |
2 | 30938.89 |
3 | 63977.80 |
Group.1 | Group.2 | x |
---|---|---|
<int> | <chr> | <dbl> |
1 | f | 25003.69 |
3 | f | 47213.50 |
1 | m | 31558.15 |
2 | m | 30938.89 |
3 | m | 66243.24 |
Group.1 | x |
---|---|
<int> | <dbl> |
1 | 27838.54 |
2 | 30938.89 |
3 | 63977.80 |
emp_anova=aov(employee$salary~as.factor(employee$jobcat)) #as.factor force variable to be factor. oarticularly useful if data is when numeric variables are used as categorical predictors.
summary(emp_anova)
Df Sum Sq Mean Sq F value Pr(>F) as.factor(employee$jobcat) 2 8.944e+10 4.472e+10 434.5 <2e-16 *** Residuals 471 4.848e+10 1.029e+08 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Two way ANOVA ¶
two_way_anova=aov(salary ~ as.factor(jobcat) + as.factor(gender) + as.factor(jobcat) : as.factor(gender), data = employee)
summary(two_way_anova)
Df Sum Sq Mean Sq F value Pr(>F) as.factor(jobcat) 2 8.944e+10 4.472e+10 505.87 < 2e-16 as.factor(gender) 1 5.770e+09 5.770e+09 65.27 5.57e-15 as.factor(jobcat):as.factor(gender) 1 1.248e+09 1.248e+09 14.11 0.000194 Residuals 469 4.146e+10 8.840e+07 as.factor(jobcat) *** as.factor(gender) *** as.factor(jobcat):as.factor(gender) *** Residuals --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interpretaion of two way anova ¶
Job category has a highly significant impact on the dependent variable.
Gender has a statistically significant effect on the dependent variable.
There is a significant interaction effect between job category and gender.
This means that the effect of job category depends on gender (e.g., gender influences salary differently across job categories)
# Post-hoc Analysis:: to check which groups differ:
TukeyHSD(aov(salary ~ as.factor(jobcat) + as.factor(gender) + as.factor(jobcat) : as.factor(gender), data = employee))
Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = salary ~ as.factor(jobcat) + as.factor(gender) + as.factor(jobcat):as.factor(gender), data = employee) $`as.factor(jobcat)` diff lwr upr p adj 2-1 3100.349 -1309.375 7510.072 0.2246099 3-1 36139.258 33462.710 38815.806 0.0000000 3-2 33038.909 28148.398 37929.419 0.0000000 $`as.factor(gender)` diff lwr upr p adj m-f 6392.997 4689.067 8096.927 0 $`as.factor(jobcat):as.factor(gender)` diff lwr upr p adj 2:f-1:f NA NA NA NA 3:f-1:f 22209.811 13497.7211 30921.900 0.0000000 1:m-1:f 6554.464 3704.1086 9404.818 0.0000000 2:m-1:f 5935.200 428.4981 11441.901 0.0261600 3:m-1:f 41239.554 37593.2017 44885.906 0.0000000 3:f-2:f NA NA NA NA 1:m-2:f NA NA NA NA 2:m-2:f NA NA NA NA 3:m-2:f NA NA NA NA 1:m-3:f -15655.347 -24430.1522 -6880.542 0.0000071 2:m-3:f -16274.611 -26234.3522 -6314.870 0.0000561 3:m-3:f 19029.743 9965.0554 28094.431 0.0000001 2:m-1:m -619.264 -6224.6596 4986.132 0.9995786 3:m-1:m 34685.090 30891.3346 38478.846 0.0000000 3:m-2:m 35304.354 29255.2370 41353.472 0.0000000
Job category 3 earns significantly more than job categories 1 and 2.
No significant difference between job categories 1 and 2.
Males earn significantly more than females across all job categories.
Meaning of interaction effect
Within each job category, males tend to earn significantly more than females.
Job category 3 males (3:m) earn the highest, significantly more than other groups.
No significant salary difference between job categories 1 and 2 for males (2:m - 1:m).
#Grapchical presentation.
#GGPLOT2 will be discussed later
library(ggplot2)
ggplot(employee, aes(x = jobcat, y = salary, fill = gender)) +
geom_boxplot() +
theme_minimal()
Chi Square Test ¶
It is used for
1. Chi-Square Test for Independence:To check if two categorical variables are independent (unrelated) or dependent (related).
2. Chi-Square Goodness-of-Fit Test: To check if observed frequencies match expected frequencies in one categorical variable.
T-Test is done with one numerical and another categorical variables where categorical variable has two groups.
Chi-square test should have both variable categorical
# Cross tab operations
table2=table(employee$gender,employee$jobcat)
table2 #but it will be riskier to interpret in number, it is always advisable to interpret in percentage term
1 2 3 f 206 0 10 m 157 27 74
# this gives the joint probability
prop.table(table2) #sum of all should be 1
1 2 3 f 0.43459916 0.00000000 0.02109705 m 0.33122363 0.05696203 0.15611814
#row percentage (most important)
#it is followed rule of thumb that while making crosstab, independent variable should be placed before (as rows) dependent variable. then only margin=1 gives sensible result.
prop.table(table2, margin=1) #sum of each row must be 1
1 2 3 f 0.9537037 0.0000000 0.0462963 m 0.6085271 0.1046512 0.2868217
chisq.test(table2)
Pearson's Chi-squared test data: table2 X-squared = 79.277, df = 2, p-value < 2.2e-16