Basics of R ¶
R is open source programming language used primarily for stastistical analysis. ¶
R can be downloaded free from https://www.r-project.org/ ¶
Its most popular IDE (integrated developemnt environment) is Rstudio (free download https://posit.co/download/rstudio-desktop/).
VS code (free download https://code.visualstudio.com/download) can also be used for jupyter notebook with R kernel.
This document has been prepared via R-kernel based jupyter notebook in VS code
Datatypes in R ¶
Basic Data Types ¶
- Numeric: Represents real numbers (e.g.,
10.5
,3.14
). - Integer: Represents whole numbers (e.g.,
10L
,-3L
). TheL
suffix explicitly defines an integer. - Logical: Represents Boolean values (
TRUE
orFALSE
). - Character: Represents text or strings (e.g.,
"Hello"
,"R"
). - Complex: Represents complex numbers (e.g.,
3 + 2i
). - NULL: Represents an empty or null object.
- NA: Represents missing or undefined data.
- NaN: Represents "Not a Number" (e.g.,
0/0
). - Inf: Represents infinity (e.g.,
1/0
).
Data Structures ¶
- Vector: A sequence of elements of the same data type →
c(1, 2, 3)
- List: A collection of elements of different data types →
list(1, "a", TRUE)
- Matrix: A two-dimensional array of elements of the same data type →
matrix(1:6, nrow=2)
- Array: A multi-dimensional extension of a matrix →
array(1:12, dim=c(2, 3, 2))
- Data Frame: A table-like structure where columns can be of different data types →
data.frame(name=c("Alice", "Bob"), age=c(25, 30))
- Factor: Represents categorical data with fixed levels →
factor(c("Male", "Female", "Male"))
Assignment operators war (= Vs <-) ¶
It is commonly followed approach that
<- is the standard assignment operator in R.
= can also be used for assignment, but it's mainly used for naming arguments in functions.
while using in real life scenerio, = assigns the value but <- creates variables further <<- force assigns the value to a variable in the global environment.
Although use of = for assignment is generally not recommended to avoid confusion, it doesnot necessarily create a proble. I find it rather straightforward to use.
x <- 10
y = 20
print(x)
print(y)
[1] 10 [1] 20
During this session, = is used as assignment variable unless explicitely required for <- or <<-
Variable Names and Assignment ¶
If you are using jupyter notebook, use shift+enter to runt active cell.
If you are using Rstudio, press control+enter to run active line or select the code and press control+enter to run selected code
a=1
b="ram"
c=1.0
print(a)
print(b)
print(c)
[1] 1 [1] "ram" [1] 1
Beauty of interactive programming is use of print() command is not necessary; still it gives the result.
in Rstudio the out will be in consol window.
a
b
c
Lets get started!
#createing vector and operation with it (most common task)
jp=c(3.8,4.5,4.6,4.2,3.9,4.7,4.9)
mt=c(3.7,4.2,4.5,4.0,3.4,4.5,4.7)
It should be noted that any content on a single line after # is treated as a comment and will not be considered part of the code.
#calculation of mean
mean(jp)
mean(mt)
#calculation of standard deviation
sd(jp)
#correlation
cor(jp,mt)
#creation of Dataframe from vector
js=data.frame(jp,mt)
js
jp | mt |
---|---|
<dbl> | <dbl> |
3.8 | 3.7 |
4.5 | 4.2 |
4.6 | 4.5 |
4.2 | 4.0 |
3.9 | 3.4 |
4.7 | 4.5 |
4.9 | 4.7 |
#summary () summerizes the data. It can be used with vector as well as dataframe
summary(js)
jp mt Min. :3.800 Min. :3.400 1st Qu.:4.050 1st Qu.:3.850 Median :4.500 Median :4.200 Mean :4.371 Mean :4.143 3rd Qu.:4.650 3rd Qu.:4.500 Max. :4.900 Max. :4.700
#individual columns of dataframe can be accessed using dataframename$columnname
js$jp
- 3.8
- 4.5
- 4.6
- 4.2
- 3.9
- 4.7
- 4.9
#individual element of vector can be accessed with vectorname[position]
jp[2] #provides second element of jp
#head(dataframe) will give first 6 rows of the dataframe
head(js)
jp | mt | |
---|---|---|
<dbl> | <dbl> | |
1 | 3.8 | 3.7 |
2 | 4.5 | 4.2 |
3 | 4.6 | 4.5 |
4 | 4.2 | 4.0 |
5 | 3.9 | 3.4 |
6 | 4.7 | 4.5 |
#head(dataframe, n) will fetch first n rows instead
head(js,2)
jp | mt | |
---|---|---|
<dbl> | <dbl> | |
1 | 3.8 | 3.7 |
2 | 4.5 | 4.2 |
Logical operator
syntax | operator name | description |
---|---|---|
& | and | Returns TRUE if both condition are TRUE else returns FALSE |
| | or | Return TRUE if any condition is True |
! | not | Returns TRUE if condition is not TRUE |
== | equal | Returns TRUE if both value are equal |
| ! | not | Returns TRUE if condition is not TRUE |
x=5
is.numeric(x) #TRUE
is.numeric(x)& x<0 #FALSE
is.numeric(x)& x<10 #TRUE
is.numeric(x)| x<0 #TRUE
!is.character(x) #TRUE
lifehack:: usign control+space while writting code forces IDE to show suggestion if not already showing: for example
isTRUE(FALSE) #obviously
# Basic Plotting in R
hhno=c(1,2,3,4,5,6)
age=c(25,32,43,34,52,29)
salary=c(18000,23400,54321,34000,65000,21000)
gender=c('M','F','F','M','F','M')
surveydata=data.frame(hhno,age,salary,gender) #creation of dataframe from vector
#Basic plot
plot(age,salary)
#read data from exernal sources
dataurl="https://raw.githubusercontent.com/madhuko/madhuko.github.io/refs/heads/main/datasets/R/employee.csv"
employee=read.csv(dataurl)
head(employee)
id | gender | educ | jobcat | salary | salbegin | jobtime | prevexp | minority | |
---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |
1 | 1 | m | 15 | 3 | 57000 | 27000 | 98 | 144 | 0 |
2 | 2 | m | 16 | 1 | 40200 | 18750 | 98 | 36 | 0 |
3 | 3 | f | 12 | 1 | 21450 | 12000 | 98 | 381 | 0 |
4 | 4 | f | 8 | 1 | 21900 | 13200 | 98 | 190 | 0 |
5 | 5 | m | 15 | 1 | 45000 | 21000 | 98 | 138 | 0 |
6 | 6 | m | 15 | 1 | 32100 | 13500 | 98 | 67 | 0 |
#table() create frequency tables from vectors or factors. It counts the occurrences of each unique value in a dataset.
table(employee$gender)
f m 216 258
# prop.table() gives the proportion of the table
table1=table(employee$gender)
prop.table(table1)
f m 0.4556962 0.5443038
min(employee$salary)
max(employee$salary)
mean(employee$salary)
summary(employee$salary)
Min. 1st Qu. Median Mean 3rd Qu. Max. 15750 24000 28875 34420 36938 135000