Notebook

Functions, packages and library ¶

function: A function in programming is a reusable block of code designed to perform a specific task. It takes inputs (optional), processes them, and returns an output (optional).
For example: In the expression mean(c(2,3,4)); mean() is a function that takes a vector c(2,3,4) as input, calculats mean of vector elements and returns the calculated mean.
package: A collection of multiple functions, data sets, and documentation, usually designed for a specific purpose (e.g., data analysis, machine learning, visualization).
R supports package installation from several repository primarily from CRAN (The official repository maintained by R community). Installation of new package installs new functions/dataset/documentaion in the existing R software. it can be run through install.packages("packagename")
library:Installing a package in R is similar to installing Microsoft Office on a PCâ€”you only need to do it once unless updates are required. However, when working in R, you must load the necessary packages into the working environment each time you need them, just as you open Excel when you need to work on spreadsheets. If you donâ€™t require spreadsheet tasks on a given day, you simply donâ€™t open Excel.

In [2]:

mean(c(2,3,4))

In [6]:

#installing a package
install.packages("jtools")

Installing package into 'C:/Users/Dell/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)

package 'jtools' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Dell\AppData\Local\Temp\RtmpoDkbIt\downloaded_packages

In [7]:

# Now if i need jtools in this session of R, then i need to load jtools using library()
library(jtools)

now apart form in-buit function of R, new functionality defined by "jtools" package has been installed and loaded in this R session

some common packages to install are ¶

tidyverse (auto installs whole lot list of packages)
lmtest
jtools
car

In [1]:

library(tidyverse) #installation of each package is not shown

â”€â”€ Attaching core tidyverse packages â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse 2.0.0 â”€â”€
âœ” dplyr     1.1.4     âœ” readr     2.1.5
âœ” forcats   1.0.0     âœ” stringr   1.5.1
âœ” ggplot2   3.5.1     âœ” tibble    3.2.1
âœ” lubridate 1.9.4     âœ” tidyr     1.3.1
âœ” purrr     1.0.2     
â”€â”€ Conflicts â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse_conflicts() â”€â”€
âœ– dplyr::filter() masks stats::filter()
âœ– dplyr::lag()    masks stats::lag()
â„¹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

the conflict âœ– dplyr::filter() masks stats::filter() indicates that ¶

There was already a function called filter() defined unders stats (stats is the pre-iinstalled package with R), now dplyr module also has the filter function.
So R raised the conflict.
In this situation it is advisable to use packagename::function instead. for example stats::filter()

T-Test ¶

In [4]:

#read data from exernal sources
dataurl="https://raw.githubusercontent.com/madhuko/madhuko.github.io/refs/heads/main/datasets/R/employee.csv"
employee=read.csv(dataurl)
head(employee)

A data.frame: 6 Ã— 9
	id	gender	educ	jobcat	salary	salbegin	jobtime	prevexp	minority
	<int>	<chr>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
1	1	m	15	3	57000	27000	98	144	0
2	2	m	16	1	40200	18750	98	36	0
3	3	f	12	1	21450	12000	98	381	0
4	4	f	8	1	21900	13200	98	190	0
5	5	m	15	1	45000	21000	98	138	0
6	6	m	15	1	32100	13500	98	67	0

In [11]:

#one sample t-test
t_test_result=t.test(employee$salary,mu=35000)
t_test_result

	One Sample t-test

data:  employee$salary
t = -0.74005, df = 473, p-value = 0.4596
alternative hypothesis: true mean is not equal to 35000
95 percent confidence interval:
 32878.40 35960.73
sample estimates:
mean of x 
 34419.57

The Test was done to to check whether true mean equal to to 35000. As per the result, p-value is 0.4596 which means we fail to rejecy null hypothesis. It means true mean is generally around 35000.

In [12]:

#one sample t-test
t_test_result=t.test(employee$salary,mu=45000)
t_test_result

	One Sample t-test

data:  employee$salary
t = -13.49, df = 473, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 45000
95 percent confidence interval:
 32878.40 35960.73
sample estimates:
mean of x 
 34419.57

In this case p-value is close to zero indicating that null hypothesis is rejected and hence, true mean is significantly different than 45000

In [13]:

# to check at different confidence interval
t.test(employee$salary,mu = 55000,conf.level = .90)

	One Sample t-test

data:  employee$salary
t = -26.24, df = 473, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 55000
90 percent confidence interval:
 33126.96 35712.18
sample estimates:
mean of x 
 34419.57

In [16]:

#difference between two means
aggregate(employee$salary, list(employee$gender),FUN=mean)
aggregate(employee$salary, list(employee$gender),FUN=sd)
aggregate(employee$salary, list(employee$gender),FUN=min)
aggregate(employee$salary, list(employee$gender),FUN=max)

A data.frame: 2 Ã— 2
Group.1	x
<chr>	<dbl>
f	26031.92
m	41441.78

A data.frame: 2 Ã— 2
Group.1	x
<chr>	<dbl>
f	7558.021
m	19499.214

A data.frame: 2 Ã— 2
Group.1	x
<chr>	<int>
f	15750
m	19650

A data.frame: 2 Ã— 2
Group.1	x
<chr>	<int>
f	58125
m	135000

In [19]:

# While testing two means, alwasy use numerical data in first and categorical data in lst order irrespective of dependent or independent
t_output=t.test(employee$salary~employee$gender,alt="two.sided")
t_output

	Welch Two Sample t-test

data:  employee$salary by employee$gender
t = -11.688, df = 344.26, p-value < 2.2e-16
alternative hypothesis: true difference in means between group f and group m is not equal to 0
95 percent confidence interval:
 -18003.00 -12816.73
sample estimates:
mean in group f mean in group m 
       26031.92        41441.78

In [21]:

# In the context of the t.test() function in R, the var.equal argument specifies whether you assume the variances of the two groups being compared are equal. 
# This is important for determining which version of the t-test to use.
var_e_result=t.test(employee$salary~employee$gender,alt="two.sided", var.equal=TRUE)
var_e_result

	Two Sample t-test

data:  employee$salary by employee$gender
t = -10.945, df = 472, p-value < 2.2e-16
alternative hypothesis: true difference in means between group f and group m is not equal to 0
95 percent confidence interval:
 -18176.40 -12643.32
sample estimates:
mean in group f mean in group m 
       26031.92        41441.78

In [23]:

# one tailed t-test instead of two tailed.
t.test(employee$salary~employee$gender,alt="less")

	Welch Two Sample t-test

data:  employee$salary by employee$gender
t = -11.688, df = 344.26, p-value < 2.2e-16
alternative hypothesis: true difference in means between group f and group m is less than 0
95 percent confidence interval:
      -Inf -13235.43
sample estimates:
mean in group f mean in group m 
       26031.92        41441.78

Paired T-test ¶

In [24]:

before=c(70,75,80,76,67,59,60)
after=c(60,63,75,70,62,55,55)
t.test(before,after,paired=TRUE,alt="two.sided")

	Paired t-test

data:  before and after
t = 5.8446, df = 6, p-value = 0.001106
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 3.903288 9.525284
sample estimates:
mean difference 
       6.714286

In [25]:

#similar functionality can be chavieved via
t.test(before-after,mu=0,alternative = "t")

	One Sample t-test

data:  before - after
t = 5.8446, df = 6, p-value = 0.001106
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 3.903288 9.525284
sample estimates:
mean of x 
 6.714286

In [26]:

#T Test can be done with different number os observation for each series as well
female=c(60,63,75,70,62,55,55)
male=c(70,75,80,76,67,59,60,70)
t.test(male,female)

	Welch Two Sample t-test

data:  male and female
t = 1.7569, df = 12.753, p-value = 0.1029
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.570741 15.106455
sample estimates:
mean of x mean of y 
 69.62500  62.85714

R Basics Part 2: FUnction, packages, and T-Test

Functions, packages and library ¶

some common packages to install are ¶

the conflict âœ– dplyr::filter() masks stats::filter() indicates that ¶

T-Test ¶

Paired T-test ¶

Post a Comment

Contact Form