Probability
Configuration
Before starting the course make sure you install the library with the relevant datasets included
install.packages("dslabs")
library(dslabs)
From the dslabs library you can use the data sets as needed
Discrete Probability
A subset of probability in which there are distinct possible outcomes, or categorical data. We can express this as
Monte Carlo Simulations
Computers allow us to mimic randomness, in R we can use the sample
function to do this. We can first create a sample set, and from that select a random element. We use the repeat function to generate our elements
> beads <- rep(c("red","blue"), times = c(2,3))
[1] "red" "red" "blue" "blue" "blue"
> sample(beads, 1)
[1] "red"
Next we can run this simulation repetively by usimg the replicate function
> B <- 1000
> events <- replicate(B, sample(beads, 1))
> table(events)
events
blue red
628 372
This method is useful for estimating the probability when we are unable to calculate it directly
The sample function works without replacement, if we want to use it with replacement we can set the replace argument to true, this will allow us to make the selection without the use of replicate
> events <- sample(beads, 1000, replace = TRUE)
> table(events)
events
blue red
588 412
Probability Distributions
For Catergorical data probability distribution is the same as the propportion of each value in the data set
Independence
Events are Independent if the outcome of one event does not affect the outcom of the other
Hence the probability of A given B is the same as the probability of A for independant events
if A and B are independent:
Combinations and Permutations
In order to review more complex calculations we can use R to calculate them, this can make use of the expand.grid
function which will output all the combinations of two lists, and the paste function will combine two vectors
We can make use of this to define a deck of cards as follows
> suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
> numbers <- c("Ace", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King")
> grid <- expand.grid(number = numbers, suit = suits)
> grid
number suit
1 Ace Diamonds
2 Two Diamonds
3 Three Diamonds
...
51 Queen Spades
52 King Spades
> deck <- paste(grid$number, grid$suit)
> deck
[1] "Ace Diamonds" "Two Diamonds" "Three Diamonds"
[4] "Four Diamonds" "Five Diamonds" "Six Diamonds"
...
[49] "Ten Spades" "Jack Spades" "Queen Spades"
[52] "King Spades"
Now that we have a deck we can check the probaility that the first card drawn will be a certain value, we can do this by first defining a vector of kings and then checking what portion of the deck that is
> kings <- paste("King", suits)
> mean(deck %in% kings)
[1] 0.07692308 # = 4 / 52
What if we want to find the likelihood of a specific permutation, e.g two consecutive kings, we can do this with the permutaion
and combination
functions
permutations()
This function computes for any list of size n
all the different ways we can select r
items
> install.packages("gtools")
> library(gtools)
> permutations(5,2)
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
...
[19,] 5 3
[20,] 5 4
If we want to use this using a specific vector from which we select our permutations we can do this with the v
argument
> hands <- permutations(52, 2, v = deck)
> hands
[,1] [,2]
[1,] "Ace Clubs" "Ace Diamonds"
[2,] "Ace Clubs" "Ace Hearts"
[3,] "Ace Clubs" "Ace Spades"
...
Thereafter we can look at which cases have a first card that is a king and the second card that is a king
> first_card <- hands[,1]
> second_card <- hands[,2]
> mean(first_card %in% kings & second_card %in% kings) / mean(first_card %in% kings)
[1] 0.05882353 # = 3/51
combinations()
This will not take into consideration the order of two events, but rather the overall result
combinations(52, 2, v = deck)
The Birthday Problem
We have 50 people and we want to find out the probability of at least two people sharing a birthday. We can do this by using the duplicated function which will return true for an element if that element has occurred previously
> birthdays <- sample(1:365, 50, replace = TRUE)
> any(duplicated(birthdays))
[1] TRUE
We can simulate this many times with a Monte Carlo Simulation to find the probability numerically
> results <- replicate(10000,
any(duplicated(sample(1:365, 50, replace = TRUE))))
> mean(results)
[1] 0.9703 # The probabilty of there being two people who share a birthday
Note that the replicate function can take multiple statements/lines in the function area with {...}
sapply
What if we want to apply the above statements to a variety of different n values? We can defie the above as a function and then apply this to a different set of data
compute <- function(n, B = 1000){
same_day <- replicate(B, {
bdays <- sample(1:365, n, replace = TRUE)
any(duplicated(bdays))
})
mean(same_day)
}
We can then use the sapply
function to apply this function in an element-wise method (basically turn our function into something more like an array map function)
> count <- 1:50
> probablilites <- sapply(count, compute)
[1] 0.000 0.001 0.009 0.013 0.026 ...
> plot(probabilities)
We can do this mathematically though with the following:
Pr(1 is unique) = 1
Pr(2 is unique | 1 is unique) = 364/365
Pr(3 is unique | 1 and 2 is unique) = 363/365
...
We can then use the multiplicative rule to compute the final probability which is
1 - (1 x 364/365 x 363/365 x ... x (365 - n + 1)/365)
We can define a function that will compute this exactly for a given problem
exact <- function(n){
unique_prob <- seq(365, 365 - n + 1)/ 365
1 - prod(unique_prob)
}
Thereafter we can use this and plot it comparatively
> exactprobability - sapply(n, exact)
> plot(probabilities) lines(exact_probability, col = "red")
Sample size
How many samples are enough? Basically when our estimate result starts to stabalize we can assume that we have a large enough number of experiments
Addition Rule
The addition rule states that
Continuous Probability
We make use of a Cumulative Probability Function to allow us to use continuous probabilities, this is because we need to verify whether a value falls within a certain range and not is a specific value
Theoretical Distribution
The cumulative distribution for the Normal Distribution in R is defined by the pnorm
function. We say that a random quantity is normally distributed as follows
F(a) = pnorm(a, avg, s)
Based on the idea of continuous probability we make use of intervals instead of exact values, we can however run into the issue of discretization where although our measurement of interest is continious, our dataset is still discrete
Probability Density
We can get the probability density function for a normal distribution with the dnorm
function
Normal Distributions
We can run Monte Carlo simulations on normally distributed data, we can use the rnorm
function to get a normally distributed set of random values
rnorm(n, avg, sd)
This is useful as it will allow us to generate normally distributed data to mimic naturally occuring data
Other Continuous Distributions
R has other functions available for different distribution types, these are prefixed with the letters
d
for Densityq
for Quartilep
for Probability Desnsity Functionr
for Random
Random Variables
Random Variables are numeric outcomes resulting from a random process
Sampling Models
We model the behavior of one system by using a simplified version of that system
Notation
For random values we use Capital letters for random values
Standard Error
This tells us the expected difference between an actual value and the expected value
The variance is the square of the standard error
Central Limit Theorem
When the sample size is large the probability distribution is approximately normal
Based on this we need only the average and standard deviation to find the expected probability distribution
Law of Large Numbers
The standard error of a set of numbers decreases as the sample size increases, this is also known as The Law of Averages
How Large is Large
The CLT can be useful even with relatively small data sets, but this is not the norm
In general, the larger the probability of success the smaller our sample size needs to be