Time Series Stationarity

View Notebook on Kaggle

Components of Time Series data

  • Trend
  • Seasonality
  • Irregularity
  • Cyclicality

When not to use Time Series Analyis

  • Values are constant - it's pointless
  • Values are in the form of functions - just use the function

Stationarity

  • Constant mean
  • Constant variance
  • Autovariance that does not depend on time

A stationary series has a high probability to follow the same pattern in future

Stationarity Tests

  • Rolling Statistics - moving average, moving variance, visualization
  • ADCF Test

ARIMA

ARIMA is a common model for analysis

The ARIMA model has the following parameters::

  • P - Auto Regressive (AR)
  • d - Integration (I)
  • Q - Moving Average (MA)

Applying the Above

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/air-passengers/AirPassengers.csv
import seaborn as sns
df = pd.read_csv('/kaggle/input/air-passengers/AirPassengers.csv')

df.head()
Month #Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
df['Month'] = pd.to_datetime(df['Month'], infer_datetime_format=True)
df = df.set_index(['Month'])

df.head()
#Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
sns.lineplot(data=df)
<AxesSubplot:xlabel='Month'>

In the above we can see that there is an upward trend as well as some seasonality

Next, we can check some summary statistics using a rolling mean approach

Rolling Averages

Note that for the rolling functions we use a window of 12, this is because the data has a seasonality of 12 months

rolling_mean = df.rolling(window=12).mean()
rolling_std = df.rolling(window=12).std()

df_summary = df.assign(Mean=rolling_mean)
df_summary = df_summary.assign(Std=rolling_std)

sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>

Since the mean and standard deviation are not constant we can conclude that the data is not stationary

ADF Test

The null hypothesis for the test is that the series is non-stationary, we reject it if the resulting probability > 0.05 (or some other threshold)

from statsmodels.tsa.stattools import adfuller
def print_adf(adf):
    print('ADF test statistic', adf[0])
    print('p-value', adf[1])
    print('Lags used', adf[2])
    print('Observations used', adf[3])
    print('Critical values', adf[4])
adf = adfuller(df['#Passengers'])

print_adf(adf)
ADF test statistic 0.8153688792060472
p-value 0.991880243437641
Lags used 13
Observations used 130
Critical values {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}

In the result of the ADF test we can see that the p-value is much higher than 0.05 which means that the data is not stationary

Because the data is non-stationary the next think we need to do is estimate the trend

df_log = np.log(df)

sns.lineplot(data=df_log)
<AxesSubplot:xlabel='Month'>
rolling_mean_log = df_log.rolling(window=12).mean()

df_summary = df_log.assign(Mean=rolling_mean_log)

sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>

Using the log there is still some residual effect visible, we can try taking a diff:

df_diff = df - rolling_mean

sns.lineplot(data=df_diff)
<AxesSubplot:xlabel='Month'>
rolling_mean_diff = df_diff.rolling(window=12).mean()
rolling_std_diff = df_diff.rolling(window=12).std()

df_summary = df_diff.assign(Mean=rolling_mean_diff)
df_summary = df_summary.assign(Std=rolling_std_diff)

sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
adf_diff = adfuller(df_diff.dropna())

print_adf(adf_diff)
ADF test statistic -3.1649681299551427
p-value 0.022104139473878973
Lags used 13
Observations used 119
Critical values {'1%': -3.4865346059036564, '5%': -2.8861509858476264, '10%': -2.579896092790057}

We can do the same with the log:

df_diff_log = df_log - rolling_mean_log

sns.lineplot(data=df_diff_log)
<AxesSubplot:xlabel='Month'>
rolling_mean_diff_log = df_diff_log.rolling(window=12).mean()
rolling_std_diff_log = df_diff_log.rolling(window=12).std()

df_summary = df_diff_log.assign(Mean=rolling_mean_diff_log)
df_summary = df_summary.assign(Std=rolling_std_diff_log)

sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
adf_diff_log = adfuller(df_diff_log.dropna())

print_adf(adf_diff_log)
ADF test statistic -3.162907991300889
p-value 0.02223463000124189
Lags used 13
Observations used 119
Critical values {'1%': -3.4865346059036564, '5%': -2.8861509858476264, '10%': -2.579896092790057}

The ADF for the log diff is less than 0.05 so the result is stationary

We can also try a divide using the the original data and the rolling mean:

df_div = df / rolling_mean

sns.lineplot(data=df_div)
<AxesSubplot:xlabel='Month'>
rolling_mean_div = df_div.rolling(window=12).mean()
rolling_std_div = df_div.rolling(window=12).std()

df_summary = df_div.assign(Mean=rolling_mean_div)
df_summary = df_summary.assign(Std=rolling_std_div)

sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
adf_div = adfuller(df_div.dropna())

print_adf(adf_div)
ADF test statistic -3.034425217431025
p-value 0.03180220053455359
Lags used 13
Observations used 119
Critical values {'1%': -3.4865346059036564, '5%': -2.8861509858476264, '10%': -2.579896092790057}

The ADF for the division is less than 0.05 so the result is stationary

Next we can try to do a decomposition on the above series since it is stationary:

from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df_div.dropna())
trend = decomposition.trend

sns.lineplot(data=trend.dropna())
<AxesSubplot:xlabel='Month', ylabel='trend'>
seasonal = decomposition.seasonal

sns.lineplot(data=seasonal.dropna())
<AxesSubplot:xlabel='Month', ylabel='seasonal'>
resid = decomposition.resid

sns.lineplot(data=resid.dropna())
<AxesSubplot:xlabel='Month', ylabel='resid'>