Notebook

Anova in R ¶

Comparing Means of Three or More Groups
Example: Comparing the average income of employees across three different industries

When independent variable (factor) has three or more categories (levels)
Example: Analyzing how different brands of fertilizer (Brand A, Brand B, Brand C) affect crop yield.

In [25]:

#read data from exernal sources
dataurl="https://raw.githubusercontent.com/madhuko/madhuko.github.io/refs/heads/main/datasets/R/employee.csv"
employee=read.csv(dataurl)
head(employee)

A data.frame: 6 Ã— 9
	id	gender	educ	jobcat	salary	salbegin	jobtime	prevexp	minority
	<int>	<chr>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
1	1	m	15	3	57000	27000	98	144	0
2	2	m	16	1	40200	18750	98	36	0
3	3	f	12	1	21450	12000	98	381	0
4	4	f	8	1	21900	13200	98	190	0
5	5	m	15	1	45000	21000	98	138	0
6	6	m	15	1	32100	13500	98	67	0

In [26]:

# aggregate is like pivot table in excel 
aggregate(employee$salary, list(employee$jobcat),FUN=mean)
aggregate(employee$salary, list(employee$jobcat, employee$gender), FUN = mean)

# If we provide agruememnt with proper name, then order of the arguement doesnot matter
# x should be numerical, by should be ctegorical
aggregate(FUN=mean,by=list(employee$jobcat),x=employee$salary)

A data.frame: 3 Ã— 2
Group.1	x
<int>	<dbl>
1	27838.54
2	30938.89
3	63977.80

A data.frame: 5 Ã— 3
Group.1	Group.2	x
<int>	<chr>	<dbl>
1	f	25003.69
3	f	47213.50
1	m	31558.15
2	m	30938.89
3	m	66243.24

A data.frame: 3 Ã— 2
Group.1	x
<int>	<dbl>
1	27838.54
2	30938.89
3	63977.80

In [27]:

emp_anova=aov(employee$salary~as.factor(employee$jobcat)) #as.factor force variable to be factor. oarticularly useful if data is when numeric variables are used as categorical predictors.
summary(emp_anova)

                            Df    Sum Sq   Mean Sq F value Pr(>F)    
as.factor(employee$jobcat)   2 8.944e+10 4.472e+10   434.5 <2e-16 ***
Residuals                  471 4.848e+10 1.029e+08                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since p vlaue is close to 0, full hypothesis is rejected ¶

It means mean of at least one group (of jobcat) is significationly different from other. ¶

Two way ANOVA ¶

In [28]:

two_way_anova=aov(salary ~ as.factor(jobcat) + as.factor(gender) +  as.factor(jobcat) : as.factor(gender), data = employee)
summary(two_way_anova)

                                     Df    Sum Sq   Mean Sq F value   Pr(>F)
as.factor(jobcat)                     2 8.944e+10 4.472e+10  505.87  < 2e-16
as.factor(gender)                     1 5.770e+09 5.770e+09   65.27 5.57e-15
as.factor(jobcat):as.factor(gender)   1 1.248e+09 1.248e+09   14.11 0.000194
Residuals                           469 4.146e+10 8.840e+07                 
                                       
as.factor(jobcat)                   ***
as.factor(gender)                   ***
as.factor(jobcat):as.factor(gender) ***
Residuals                              
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interpretaion of two way anova ¶

Job category has a highly significant impact on the dependent variable.
Gender has a statistically significant effect on the dependent variable.
There is a significant interaction effect between job category and gender.
This means that the effect of job category depends on gender (e.g., gender influences salary differently across job categories)

In [29]:

# Post-hoc Analysis:: to check which groups differ:
TukeyHSD(aov(salary ~ as.factor(jobcat) + as.factor(gender) +  as.factor(jobcat) : as.factor(gender), data = employee))

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = salary ~ as.factor(jobcat) + as.factor(gender) + as.factor(jobcat):as.factor(gender), data = employee)

$`as.factor(jobcat)`
         diff       lwr       upr     p adj
2-1  3100.349 -1309.375  7510.072 0.2246099
3-1 36139.258 33462.710 38815.806 0.0000000
3-2 33038.909 28148.398 37929.419 0.0000000

$`as.factor(gender)`
        diff      lwr      upr p adj
m-f 6392.997 4689.067 8096.927     0

$`as.factor(jobcat):as.factor(gender)`
              diff         lwr       upr     p adj
2:f-1:f         NA          NA        NA        NA
3:f-1:f  22209.811  13497.7211 30921.900 0.0000000
1:m-1:f   6554.464   3704.1086  9404.818 0.0000000
2:m-1:f   5935.200    428.4981 11441.901 0.0261600
3:m-1:f  41239.554  37593.2017 44885.906 0.0000000
3:f-2:f         NA          NA        NA        NA
1:m-2:f         NA          NA        NA        NA
2:m-2:f         NA          NA        NA        NA
3:m-2:f         NA          NA        NA        NA
1:m-3:f -15655.347 -24430.1522 -6880.542 0.0000071
2:m-3:f -16274.611 -26234.3522 -6314.870 0.0000561
3:m-3:f  19029.743   9965.0554 28094.431 0.0000001
2:m-1:m   -619.264  -6224.6596  4986.132 0.9995786
3:m-1:m  34685.090  30891.3346 38478.846 0.0000000
3:m-2:m  35304.354  29255.2370 41353.472 0.0000000

Job category 3 earns significantly more than job categories 1 and 2.
No significant difference between job categories 1 and 2.

Males earn significantly more than females across all job categories.

Meaning of interaction effect
Within each job category, males tend to earn significantly more than females.
Job category 3 males (3:m) earn the highest, significantly more than other groups.
No significant salary difference between job categories 1 and 2 for males (2:m - 1:m).

In [30]:

#Grapchical presentation.
#GGPLOT2 will be discussed later
library(ggplot2)
ggplot(employee, aes(x = jobcat, y = salary, fill = gender)) + 
  geom_boxplot() + 
  theme_minimal()

No description has been provided for this image

Chi Square Test ¶

It is used for
1. Chi-Square Test for Independence:To check if two categorical variables are independent (unrelated) or dependent (related).
2. Chi-Square Goodness-of-Fit Test: To check if observed frequencies match expected frequencies in one categorical variable.

T-Test is done with one numerical and another categorical variables where categorical variable has two groups.
Chi-square test should have both variable categorical

In [45]:

# Cross tab operations
table2=table(employee$gender,employee$jobcat)
table2 #but it will be riskier to interpret in number, it is always advisable to interpret in percentage term

   
      1   2   3
  f 206   0  10
  m 157  27  74

In [46]:

# this gives the joint probability 
prop.table(table2) #sum of all should be 1

   
             1          2          3
  f 0.43459916 0.00000000 0.02109705
  m 0.33122363 0.05696203 0.15611814

In [47]:

#row percentage (most important)
#it is followed rule of thumb that while making crosstab, independent variable should be placed before (as rows) dependent variable. then only margin=1 gives sensible result. 
prop.table(table2, margin=1) #sum of each row must be 1

   
            1         2         3
  f 0.9537037 0.0000000 0.0462963
  m 0.6085271 0.1046512 0.2868217

In [51]:

chisq.test(table2)

	Pearson's Chi-squared test

data:  table2
X-squared = 79.277, df = 2, p-value < 2.2e-16

R Basics Part 3: ANOVA and chi-square Test

Anova in R ¶

Comparing Means of Three or More Groups
Example: Comparing the average income of employees across three different industries

When independent variable (factor) has three or more categories (levels)
Example: Analyzing how different brands of fertilizer (Brand A, Brand B, Brand C) affect crop yield.

Since p vlaue is close to 0, full hypothesis is rejected ¶

It means mean of at least one group (of jobcat) is significationly different from other. ¶

Two way ANOVA ¶

interpretaion of two way anova ¶

Chi Square Test ¶

P-value close to 0 indicates that there is a significant association between gender and job category. ¶

Post a Comment

Contact Form

R Basics Part 3: ANOVA and chi-square Test

Anova in R ¶

Comparing Means of Three or More Groups Example: Comparing the average income of employees across three different industries

When independent variable (factor) has three or more categories (levels) Example: Analyzing how different brands of fertilizer (Brand A, Brand B, Brand C) affect crop yield.

Since p vlaue is close to 0, full hypothesis is rejected ¶

It means mean of at least one group (of jobcat) is significationly different from other. ¶

Two way ANOVA ¶

interpretaion of two way anova ¶

Chi Square Test ¶

P-value close to 0 indicates that there is a significant association between gender and job category. ¶

Post a Comment

Contact Form

Comparing Means of Three or More Groups
Example: Comparing the average income of employees across three different industries

When independent variable (factor) has three or more categories (levels)
Example: Analyzing how different brands of fertilizer (Brand A, Brand B, Brand C) affect crop yield.