R Basics
Configuration
Before starting the course make sure you install the library with the relevant datasets included
install.packages("dslabs")
library(dslabs)
From the dslabs library you can use the data sets as needed
Objects
In order to store a value as a variable we use the assignment variable
a <- 25
To display this object we can use
print(a)
Functions
A Data Analysis process is typically a series of functions applied to data
We can define a function in R using:
myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}
To Evaluate a function we use the parenthesis and arguments:
log(a)
We can nest function calls inside of function arguments, for example:
log((exp(a)))
To get help for a function we can use the following:
?log
help("log")
We can enter function arguments in an order different to the default by using named arguments:
log(x=5, base=3)
Otherwise the arguments are evaluated in order
Comments
Commenting in R is done with the #
symbol:
# This is a comment
Data Types
R makes use of different data types, in R we typically use DataFrames to store data, these can store a mixture of different data types in a collection
To make use of DataFrames we need to import the dslabs
library:
library(dslabs)
To check the type of an object we an use
class(object)
In order to view the structure of an object we can use
str(object)
If we want to view the data in a DataFrame, we can use:
head(dataFrame)
to Access a variable in an object we use the $
accessesor, this preserves the rows in the DataFrame
data$names
will list the names column of the DataFrame
In R we refer to the data points in our DataFrame or Matrix as Vectors
We can use the ==
as the logical operator
Factors
allow us to store catergorical data, we can view the different catergories with the following:
> class(dataFrame$gender)
[1] Factor
> levels(dataFrame$gender)
[2] "Male" "Female"
Vectors
The most basic data unit in R is a Vector
To create a vector we can use the concatonate function with:
codes <- c(380, 124, 818)
If we want to name the values we can do so as follows:
codes <- c(italy=380, canada=124, egypt=818)
codes <- c("italy"=380, "canada"=124, "egypt"=818)
Getting a sequence of number we can use:
> seq(1, 5)
[1] 1, 2, 3, 4, 5
> seq(1, 10, 2)
[1] 1, 3, 5, 7, 9
We can access an element of a vector with either a single access or multi-access vector as follows:
> codes[3]
[1] 818
> codes["canada"]
[2] 124
> codes["canada", "egypt"]
[3] 124 818
> codes[1:2]
[4] 380 124
Vector Coercion
Coercion is an attempt by R to guess the type of a variable if it's of a different type of the rest of the values
x <- c(1, "hello", 3)
[1] "1" "hello" "3"
If we want to force a coercion we can use the as.character
function or as.numeric
function as follows:
> x <- 1:5
> y <- as.character(1:5)
> y
[1] "1" "2" "3" "4" "5"
> as.numeric(y)
[2] 1 2 3 4 5
If R is unable to coerse a value it will result in NA
which is very common with data sets as it refers to missing data
Sorting
The sort
function will sort a vector in increasing order, however this gives us no relation to the positions of that data. We can use the order
function to reuturn the index of the values that are sorted
> x
[1] 31 4 15 92 65
> sort(x)
[2] 4 15 31 65 92
> order(x)
[3] 2 3 1 5 4
The entries of vectors that the vectors are ordered by correspond to their rows in the DataFrame, therefore we can order one row by another
index <- order(data.total)
data$name[index]
To get the max or min value we can use:
max(data$total) # maximum value
which.max(data$total) # index of maximum value
min(data$total) # minimum value
which.min(data$total) # index of minimum value
The rank
function will return the index of the sizes of the vectors
Vector Aritmetic
Aritmetic operations occur element-wise
If we operate with a single value the operation will work per element, however if we do this with two vectors, we will add it element-wise, v3 <- v1 + v2
will mean v3[1] <- v1[1] + v2[1]
and so on
Indexing
R provides ways to index vectors based on properties on another vector, this allows us to make use of logical comparators, etc.
> large_tots <- data$total > 200
[1] TRUE TRUE FALSE TRUE FALSE
> small_size <- data$size < 20
[2] FALSE TRUE TRUE TRUE FALSE
index <- large_tots && small_size
[3] FALSE TRUE FALSE TRUE FALSE
Indexing Functinos
which
will give us the indexes which are truewhich(data$total > 200)
this will only return the values that are truematch
returns the values in one vector where another occursmatch(c(20, 14, 5), data$size)
will return only the values in which data$size == 20 || 14 || 5%in%
if we want to check if the contents of a vector are in another vector, for example:
> x <- c("a", "b", "c", "d", "e")
> y <- c("a", "d", "f")
> y %in% x
[1] TRUE, TRUE, FALSE
These functions are very useful for subsetting datasets
Data Wrangling
The dplyr
package is useful for manipulating tables of data
- Add or change a column with
mutate
- Filter data by rows with
filter
- Filter data by columns with
select
mutate(data, rate=total/size) # Add rate column based on two other columns
select(data, name, rate) # Will create a new table with only the name and rate columns
filter(data, rate <= 0.7) # Will filter out the rows where the rate expression is true
We can combine functions using the pipe operator:
dataTable %>% select(name, rate) %>% filter(rate <= 0.7)
Creating Data Frames
we can create a data frame with the data.frame
function as follows:
data <- data.frame(names = c("John","James", "Jenny"),
exam_1 = c(90, 29, 45),
exam_2 = c(30, 10, 95))
Howewever, by default R will pass strings as Factors, to prevent this we use the stringsAsFactors
argument:
data <- data.frame(names = c("John","James", "Jenny"),
exam_1 = c(90, 29, 45),
exam_2 = c(30, 10, 95),
stringsAsFactors = FALSE)
Basic Plots
We can make simple plots very easily with the following functions:
plot(dataFrame$size, data$rate)
lines(dataFrame$size, data$rate)
hist(dataFrame$size)
boxplot(rate~catergory, data=dataFrame)
Programming Basics
Conditionals
# Can evalutate all elements of a vector
if (test_expression) {
statement1
} else {
statement2
}
# Will reuturn a result
ifelse(comparison, trueReturn, falseReturn)
# Will return true if any value in vector meets condition
any(condition)
# Will return true if all values meet condition
all(condition)
Functions
Functions in R are objects, if we need to write a function in R we can do this wth the following:
myfunction <- function(arg1, arg2, optional=TRUE ){
statements
return(object)
}
This will make use of the usual lexical scoping
For Loops
for (i in sequence) {
statements
}
At the end of our loop the index value will hold it's last value
Other Functions
In R we rarely use for-loops We can use other functions like the following:
- apply
- sapply
- tapply
- mapply
Other functions that are widely used are:
- split
- cut
- quantile
- reduce
- identical
- unique