Functions, packages and library ¶
function: A function in programming is a reusable block of code designed to perform a specific task. It takes inputs (optional), processes them, and returns an output (optional).
For example: In the expression mean(c(2,3,4)); mean() is a function that takes a vector c(2,3,4) as input, calculats mean of vector elements and returns the calculated mean.
package: A collection of multiple functions, data sets, and documentation, usually designed for a specific purpose (e.g., data analysis, machine learning, visualization).
R supports package installation from several repository primarily from CRAN (The official repository maintained by R community). Installation of new package installs new functions/dataset/documentaion in the existing R software. it can be run through install.packages("packagename")
library:Installing a package in R is similar to installing Microsoft Office on a PC—you only need to do it once unless updates are required. However, when working in R, you must load the necessary packages into the working environment each time you need them, just as you open Excel when you need to work on spreadsheets. If you don’t require spreadsheet tasks on a given day, you simply don’t open Excel.
mean(c(2,3,4))
#installing a package
install.packages("jtools")
Installing package into 'C:/Users/Dell/AppData/Local/R/win-library/4.4' (as 'lib' is unspecified)
package 'jtools' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Dell\AppData\Local\Temp\RtmpoDkbIt\downloaded_packages
# Now if i need jtools in this session of R, then i need to load jtools using library()
library(jtools)
now apart form in-buit function of R, new functionality defined by "jtools" package has been installed and loaded in this R session
some common packages to install are ¶
- tidyverse (auto installs whole lot list of packages)
- lmtest
- jtools
- car
library(tidyverse) #installation of each package is not shown
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
the conflict ✖ dplyr::filter() masks stats::filter() indicates that ¶
There was already a function called filter() defined unders stats (stats is the pre-iinstalled package with R), now dplyr module also has the filter function.
So R raised the conflict.
In this situation it is advisable to use packagename::function instead. for example stats::filter()
T-Test ¶
#read data from exernal sources
dataurl="https://raw.githubusercontent.com/madhuko/madhuko.github.io/refs/heads/main/datasets/R/employee.csv"
employee=read.csv(dataurl)
head(employee)
id | gender | educ | jobcat | salary | salbegin | jobtime | prevexp | minority | |
---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |
1 | 1 | m | 15 | 3 | 57000 | 27000 | 98 | 144 | 0 |
2 | 2 | m | 16 | 1 | 40200 | 18750 | 98 | 36 | 0 |
3 | 3 | f | 12 | 1 | 21450 | 12000 | 98 | 381 | 0 |
4 | 4 | f | 8 | 1 | 21900 | 13200 | 98 | 190 | 0 |
5 | 5 | m | 15 | 1 | 45000 | 21000 | 98 | 138 | 0 |
6 | 6 | m | 15 | 1 | 32100 | 13500 | 98 | 67 | 0 |
#one sample t-test
t_test_result=t.test(employee$salary,mu=35000)
t_test_result
One Sample t-test data: employee$salary t = -0.74005, df = 473, p-value = 0.4596 alternative hypothesis: true mean is not equal to 35000 95 percent confidence interval: 32878.40 35960.73 sample estimates: mean of x 34419.57
The Test was done to to check whether true mean equal to to 35000. As per the result, p-value is 0.4596 which means we fail to rejecy null hypothesis. It means true mean is generally around 35000.
#one sample t-test
t_test_result=t.test(employee$salary,mu=45000)
t_test_result
One Sample t-test data: employee$salary t = -13.49, df = 473, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 45000 95 percent confidence interval: 32878.40 35960.73 sample estimates: mean of x 34419.57
In this case p-value is close to zero indicating that null hypothesis is rejected and hence, true mean is significantly different than 45000
# to check at different confidence interval
t.test(employee$salary,mu = 55000,conf.level = .90)
One Sample t-test data: employee$salary t = -26.24, df = 473, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 55000 90 percent confidence interval: 33126.96 35712.18 sample estimates: mean of x 34419.57
#difference between two means
aggregate(employee$salary, list(employee$gender),FUN=mean)
aggregate(employee$salary, list(employee$gender),FUN=sd)
aggregate(employee$salary, list(employee$gender),FUN=min)
aggregate(employee$salary, list(employee$gender),FUN=max)
Group.1 | x |
---|---|
<chr> | <dbl> |
f | 26031.92 |
m | 41441.78 |
Group.1 | x |
---|---|
<chr> | <dbl> |
f | 7558.021 |
m | 19499.214 |
Group.1 | x |
---|---|
<chr> | <int> |
f | 15750 |
m | 19650 |
Group.1 | x |
---|---|
<chr> | <int> |
f | 58125 |
m | 135000 |
# While testing two means, alwasy use numerical data in first and categorical data in lst order irrespective of dependent or independent
t_output=t.test(employee$salary~employee$gender,alt="two.sided")
t_output
Welch Two Sample t-test data: employee$salary by employee$gender t = -11.688, df = 344.26, p-value < 2.2e-16 alternative hypothesis: true difference in means between group f and group m is not equal to 0 95 percent confidence interval: -18003.00 -12816.73 sample estimates: mean in group f mean in group m 26031.92 41441.78
# In the context of the t.test() function in R, the var.equal argument specifies whether you assume the variances of the two groups being compared are equal.
# This is important for determining which version of the t-test to use.
var_e_result=t.test(employee$salary~employee$gender,alt="two.sided", var.equal=TRUE)
var_e_result
Two Sample t-test data: employee$salary by employee$gender t = -10.945, df = 472, p-value < 2.2e-16 alternative hypothesis: true difference in means between group f and group m is not equal to 0 95 percent confidence interval: -18176.40 -12643.32 sample estimates: mean in group f mean in group m 26031.92 41441.78
# one tailed t-test instead of two tailed.
t.test(employee$salary~employee$gender,alt="less")
Welch Two Sample t-test data: employee$salary by employee$gender t = -11.688, df = 344.26, p-value < 2.2e-16 alternative hypothesis: true difference in means between group f and group m is less than 0 95 percent confidence interval: -Inf -13235.43 sample estimates: mean in group f mean in group m 26031.92 41441.78
Paired T-test ¶
before=c(70,75,80,76,67,59,60)
after=c(60,63,75,70,62,55,55)
t.test(before,after,paired=TRUE,alt="two.sided")
Paired t-test data: before and after t = 5.8446, df = 6, p-value = 0.001106 alternative hypothesis: true mean difference is not equal to 0 95 percent confidence interval: 3.903288 9.525284 sample estimates: mean difference 6.714286
#similar functionality can be chavieved via
t.test(before-after,mu=0,alternative = "t")
One Sample t-test data: before - after t = 5.8446, df = 6, p-value = 0.001106 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 3.903288 9.525284 sample estimates: mean of x 6.714286
#T Test can be done with different number os observation for each series as well
female=c(60,63,75,70,62,55,55)
male=c(70,75,80,76,67,59,60,70)
t.test(male,female)
Welch Two Sample t-test data: male and female t = 1.7569, df = 12.753, p-value = 0.1029 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.570741 15.106455 sample estimates: mean of x mean of y 69.62500 62.85714