Inference and Modeling
Inference
Inference is using information about a sample as being representative of the whole
Parameters and Estimates
We can plot the results of a random 'election poll' draw with the following
library(ggplot2)
library(tidyverse)
library(dslabs)
ds_theme_set()
take_poll(1000)
The goal of statistical inference is to predict the parameter using the observed data in the sample
We would like to predict the portion of blue beads which is , based on this we can identify the proportion of red beads and the spread
Proportion of Red Beads
Spread
The Sample Average
The sample average is the proportion of a certain perameter which is calculated as follows
In this case, the value of an individual is 1 if it is our outcome of interest, or 0 if not
Polling vs Forecasting
A poll is taken at a specific time, but forecasting takes into consideration the fact that the probability will change in the future and therefore aims to predict the probability of some event at that time
Properties of an Estimate
The Expected value of our estimate is the same as the parameter of interest
Expected Value
We can decrease our standard error by increasing our sample size as can be seen by
Standard Error
Due to the Law of Large Numbers we can know that our standard error will be smallest as we increase the sample size
Central Limit Theorem in Practice
Suppose we want to know whether or not our estimate is sufficiently accurate (i.e. the standard error) but we do not know the actual probability? Well we can estimate that with the following
Estimate of the Standard Error
Using this we can see what the estimate for our probability being correct within 1% is by
pnorm(0.01/se) - pnorm(-0.01/se)
Margin of Error
The margin of error is two times the standard error, using this we can see that there is a 95% chance that we will be within two standard errors
The Spread
Because we only have two parties we know The Spread can be estimated by
Since we are multiplying a random variable by two the standard error of this new variable is also multiplied by two
Bias
Polling is more complex than random selections as we do not necessarily know if we are reaching all groups equally. For example an internet poll may only be as accurate as that as we are excluding people without access to the internet
Intervals and P-Values
Confidence intervals are the region in which we can have a 95% chance that will be within this range
It is the intervals that are random, not . The 95% relates to the probability that the random interval we selected contains
Power
Power can be thought of as the probability of detecting a spread different from zero
P-Values
These are related to the confidence interval.
Poll Aggregation
Poll aggregation is the task of combining the results of multiple polls to get an overall result which would be more accurate than each individual result
Poll Data and Pollster Bias
We can run into differences between polls that do not seem to have expected values that are aligned. This can be known as Pollster Bias
Data Driven Models
If we make use of a random selection of the different poll data, our standard error will now include pollster to pollster variability, this standard deviation is now an unknown parameter. Because we are still using independent random variables our CLT still works.
We can still however estimate the sample's standard deviation with the following
Sample Standard Deviation
Using the sd
function in R we can still calculate the sample standard deviation by making use of
> sd(polls$spread)
Bayesian Statistics
We speak about probability on the basis that the probability is not a fixed value.
Bayes' Theorem
Or rather
The Hierarchical Model
Provides a mathematical description for why results may not seem to correlate with what we expect. This takes into consideration subjective data, like an individual's ability to play a game, and then the associated randomness or luck
Posterior Distribution
The probability distribution of $$p$$ where we have an observed distribution $$y$$
Posterior Distribution
From this we can see that $$B$$ is close to one when $$\sigma$$ is larger
Standard Error for Posterior Distribution
This is known as an empirical Bayesian approach which is based on observed data. This will deliver a better confidence interval known as the Bayesian Credible Interval
Note that the posterior distribution is normally distributed
Mathematical Representation of Models
Given a of polls from which we sample an random value from a random poll, we can describe the variability of that data with
Where $$\epsilon$$ is the an associated error value
In order to adjust this value for pollster to pollster variability we can make use of an adjustment based on the house bias
House Bias Adjusted Sampling
In order to compensate for the general bias that may exist in all polls
General Bias Adjusted
The reason we add these biases, though unknown, will have a significant effect on the standard deviation of our data
Adjusted Average Value
Adjusted Standard Deviation
Note that because the $$b$$ value is the same in every poll, this does not affect our variance
Forecasting
Forecasting is about making predictions based on the variability of poll results over time.
Time Variation in Model
Model Trend
The T Distribution
Because we are introducing additional variability when estimating the $$\sigma$$ we result in over-confidence confidence intervals which are not sufficiently large to take into consideration this additional variability
Confidence Interval
Confidence Interval with $$s$$ instead of $$\sigma$$
This theory tells us that $$Z$$ follows a t-distribution with $$N-1$$ degrees of freedom which controls the variability of our system. This holds for data which is still somewhat different from a normal distribution
Chi Squared Test
Aims to calculate how likely it is that we see a deviation as large or larger than identified by chance, in the case of categorical or binary data