R Basics

Based on this EdX Course

Configuration

Before starting the course make sure you install the library with the relevant datasets included

install.packages("dslabs")
library(dslabs)

From the dslabs library you can use the data sets as needed

Objects

In order to store a value as a variable we use the assignment variable

a <- 25

To display this object we can use

print(a)

Functions

A Data Analysis process is typically a series of functions applied to data

We can define a function in R using:

myfunction <- function(arg1, arg2, ... ){
    statements
    return(object)
}

To Evaluate a function we use the parenthesis and arguments:

log(a)

We can nest function calls inside of function arguments, for example:

log((exp(a)))

To get help for a function we can use the following:

?log
help("log")

We can enter function arguments in an order different to the default by using named arguments:

log(x=5, base=3)

Otherwise the arguments are evaluated in order

Comments

Commenting in R is done with the # symbol:

# This is a comment

Data Types

R makes use of different data types, in R we typically use DataFrames to store data, these can store a mixture of different data types in a collection

To make use of DataFrames we need to import the dslabs library:

library(dslabs)

To check the type of an object we an use

class(object)

In order to view the structure of an object we can use

str(object)

If we want to view the data in a DataFrame, we can use:

head(dataFrame)

to Access a variable in an object we use the $ accessesor, this preserves the rows in the DataFrame

data$names will list the names column of the DataFrame

In R we refer to the data points in our DataFrame or Matrix as Vectors

We can use the == as the logical operator

Factors allow us to store catergorical data, we can view the different catergories with the following:

> class(dataFrame$gender) 
[1] Factor
> levels(dataFrame$gender) 
[2] "Male" "Female"

Vectors

The most basic data unit in R is a Vector

To create a vector we can use the concatonate function with:

codes <- c(380, 124, 818)

If we want to name the values we can do so as follows:

codes <- c(italy=380, canada=124, egypt=818)
codes <- c("italy"=380, "canada"=124, "egypt"=818)

Getting a sequence of number we can use:

> seq(1, 5)
[1] 1, 2, 3, 4, 5

> seq(1, 10, 2)
[1] 1, 3, 5, 7, 9

We can access an element of a vector with either a single access or multi-access vector as follows:

> codes[3] 
[1] 818

> codes["canada"]
[2] 124

> codes["canada", "egypt"]
[3] 124 818

> codes[1:2]
[4] 380 124

Vector Coercion

Coercion is an attempt by R to guess the type of a variable if it's of a different type of the rest of the values

x <- c(1, "hello", 3)
[1] "1" "hello" "3"

If we want to force a coercion we can use the as.character function or as.numeric function as follows:

> x <- 1:5
> y <- as.character(1:5)
> y
[1] "1" "2" "3" "4" "5"
> as.numeric(y)
[2] 1 2 3 4 5

If R is unable to coerse a value it will result in NA which is very common with data sets as it refers to missing data

Sorting

The sort function will sort a vector in increasing order, however this gives us no relation to the positions of that data. We can use the order function to reuturn the index of the values that are sorted

> x
[1] 31 4 15 92 65

> sort(x)
[2] 4 15 31 65 92

> order(x)
[3] 2 3 1 5 4

The entries of vectors that the vectors are ordered by correspond to their rows in the DataFrame, therefore we can order one row by another

index <- order(data.total)
data$name[index]

To get the max or min value we can use:

max(data$total) # maximum value
which.max(data$total) # index of maximum value

min(data$total) # minimum value
which.min(data$total) # index of minimum value

The rank function will return the index of the sizes of the vectors

Vector Aritmetic

Aritmetic operations occur element-wise

If we operate with a single value the operation will work per element, however if we do this with two vectors, we will add it element-wise, v3 <- v1 + v2 will mean v3[1] <- v1[1] + v2[1] and so on

Indexing

R provides ways to index vectors based on properties on another vector, this allows us to make use of logical comparators, etc.

> large_tots <- data$total > 200
[1] TRUE TRUE FALSE TRUE FALSE

> small_size <- data$size < 20
[2] FALSE TRUE TRUE TRUE FALSE

index <- large_tots && small_size
[3] FALSE TRUE FALSE TRUE FALSE

Indexing Functinos

  • which will give us the indexes which are true which(data$total > 200) this will only return the values that are true
  • match returns the values in one vector where another occurs match(c(20, 14, 5), data$size) will return only the values in which data$size == 20 || 14 || 5
  • %in% if we want to check if the contents of a vector are in another vector, for example:
> x <- c("a", "b", "c", "d", "e")
> y <- c("a", "d", "f")
> y %in% x
[1] TRUE, TRUE, FALSE

These functions are very useful for subsetting datasets

Data Wrangling

The dplyr package is useful for manipulating tables of data

  • Add or change a column with mutate
  • Filter data by rows with filter
  • Filter data by columns with select
mutate(data, rate=total/size) # Add rate column based on two other columns

select(data, name, rate) # Will create a new table with only the name and rate columns

filter(data, rate <= 0.7) # Will filter out the rows where the rate expression is true

We can combine functions using the pipe operator:

dataTable %>% select(name, rate) %>% filter(rate <= 0.7)

Creating Data Frames

we can create a data frame with the data.frame function as follows:

data <- data.frame(names = c("John","James", "Jenny"), 
                   exam_1 = c(90, 29, 45),
                   exam_2 = c(30, 10, 95))

Howewever, by default R will pass strings as Factors, to prevent this we use the stringsAsFactors argument:

data <- data.frame(names = c("John","James", "Jenny"), 
                   exam_1 = c(90, 29, 45),
                   exam_2 = c(30, 10, 95),
                   stringsAsFactors = FALSE)

Basic Plots

We can make simple plots very easily with the following functions:

  • plot(dataFrame$size, data$rate)
  • lines(dataFrame$size, data$rate)
  • hist(dataFrame$size)
  • boxplot(rate~catergory, data=dataFrame)

Programming Basics

Conditionals

# Can evalutate all elements of a vector
if (test_expression) {
    statement1
} else {
    statement2
}

# Will reuturn a result
ifelse(comparison, trueReturn, falseReturn)

# Will return true if any value in vector meets condition
any(condition)

# Will return true if all values meet condition
all(condition)

Functions

Functions in R are objects, if we need to write a function in R we can do this wth the following:

myfunction <- function(arg1, arg2, optional=TRUE ){
    statements
    return(object)
}

This will make use of the usual lexical scoping

For Loops

for (i in sequence) {
    statements
}

At the end of our loop the index value will hold it's last value

Other Functions

In R we rarely use for-loops We can use other functions like the following:

  • apply
  • sapply
  • tapply
  • mapply

Other functions that are widely used are:

  • split
  • cut
  • quantile
  • reduce
  • identical
  • unique