Data Analysis

Data Analysis With Python

Based on this Cognitive Class Course

Labs

The Labs for the course are located in the Labs folder are from CognitiveClass and are licensed under MIT

Introduction

The data being analysed will be based on a used-car analysis and how to estimate the price of a used car based on its characteristics

  • Dataset to be analyzed in Python
  • Overview of Python Packages
  • Importing and Exporting Data
  • Basic Insights from the Data

Understanding the Data

Thhe Data being used is the Autos dataset from the Machine Learning Database at archive.ics.uci

The first few lines of the file are as follows

imports-85.data

3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500
1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823,ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500
2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four,109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950
2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five,136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450
2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710
1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920

And the description of the data can be found here, the attributes are as follows

Column Attribute Attribute Range
1. symboling: -3, -2, -1, 0, 1, 2, 3.
2. normalized-losses: continuous from 65 to 256.
3. make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
4. fuel-type: diesel, gas.
5. aspiration: std, turbo.
6. num-of-doors: four, two.
7. body-style: hardtop, wagon, sedan, hatchback, convertible.
8. drive-wheels: 4wd, fwd, rwd.
9. engine-location: front, rear.
10. wheel-base: continuous from 86.6 120.9.
11. length: continuous from 141.1 to 208.1.
12. width: continuous from 60.3 to 72.3.
13. height: continuous from 47.8 to 59.8.
14. curb-weight: continuous from 1488 to 4066.
15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
16. num-of-cylinders: eight, five, four, six, three, twelve, two.
17. engine-size: continuous from 61 to 326.
18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
19. bore: continuous from 2.54 to 3.94.
20. stroke: continuous from 2.07 to 4.17.
21. compression-ratio: continuous from 7 to 23.
22. horsepower: continuous from 48 to 288.
23. peak-rpm: continuous from 4150 to 6600.
24. city-mpg: continuous from 13 to 49.
25. highway-mpg: continuous from 16 to 54.
26. price: continuous from 5118 to 45400.

Missing Attribute Values: (denoted by "?")

The Symboling is an indicator of the vehicle risk level, the lower the level the lower the risk (from an insurance level)

The normalized-losses is an indicator of the rate at which the vehicle loses value over time

The price is the value that we would like to predict given the other features

Note that this dataset is from 1995 and therefore the prices may seem a little low

Python Libraries

Scientific Computing Libraries

Pandas

Pandas is a library for working with Data Structures, primarily DataFrames

NumPy

NumPy is a library for working with Arrays and Matrices

SciPy

SciPy includes functions for assisting in mathematical analysis as well as some basic visualizations

Visualization

Matplotlib

Matplotlib is the most commonly used Python library for data visualization

Seaborn

Seaborn is based on Matplotlib and provides functionality such as heat maps, time series, and viollin plots

Algorithmic Libraries

Scikit-learn

Machine learning, regression, classification,etc.

Statsmodels

Explore data, estimate statistical models, and perform statistical tests

Import and Export Data

Importing

When importing data we need ot take a few things into consideration such as

  • Format
  • File source/path

In our case the data is CSV (.data), and is located as a remote source

We can import our data as follows

import pandas as pd
import numpy as np
# read the online file by the URL provides above, and assign it to variable "df"
path="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

df = pd.read_csv(path,header=None)

print("import done")
import done
df.head(2)
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500

2 rows × 26 columns

We can see that our data comes in without headers, we can assign the headers to our data based on the information in the imports-85.names file

headers = ['symboling','normalized-losses','make','fuel-type','aspiration', 'num-of-doors','body-style',
         'drive-wheels','engine-location','wheel-base', 'length','width','height','curb-weight','engine-type',
         'num-of-cylinders', 'engine-size','fuel-system','bore','stroke','compression-ratio','horsepower',
         'peak-rpm','city-mpg','highway-mpg','price']
df.columns = headers
df.head(10)
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710
7 1 ? audi gas std four wagon fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 ... 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875
9 0 ? audi gas turbo two hatchback 4wd front 99.5 ... 131 mpfi 3.13 3.40 7.0 160 5500 16 22 ?

10 rows × 26 columns

Next we can remove the missing values from the price column as follows

Analyzing Data

Pandas has a few different methods to undertand the data

We need to do some basic checks such as

  • Data Types
    • Pandas automatically assigns data types which may not necessarily be correct
    • This will further allow us to understand what functions we can apply to what columns
df.dtypes
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

We can also view the statisitcal summary as follows

df.describe()
symboling wheel-base length width height curb-weight engine-size compression-ratio city-mpg highway-mpg
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 10.142537 25.219512 30.751220
std 1.245307 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 3.972040 6.542142 6.886443
min -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 7.000000 13.000000 16.000000
25% 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 8.600000 19.000000 25.000000
50% 1.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 9.000000 24.000000 30.000000
75% 2.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 9.400000 30.000000 34.000000
max 3.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 23.000000 49.000000 54.000000

However, if we want to view a summary including fields that are of type objectm we can do the following

df.describe(include = "all")
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
count 205.000000 205 205 205 205 205 205 205 205 205.000000 ... 205.000000 205 205 205 205.000000 205 205 205.000000 205.000000 205
unique NaN 52 22 2 2 3 5 3 2 NaN ... NaN 8 39 37 NaN 60 24 NaN NaN 187
top NaN ? toyota gas std four sedan fwd front NaN ... NaN mpfi 3.62 3.40 NaN 68 5500 NaN NaN ?
freq NaN 41 32 185 168 114 96 120 202 NaN ... NaN 94 23 20 NaN 19 37 NaN NaN 4
mean 0.834146 NaN NaN NaN NaN NaN NaN NaN NaN 98.756585 ... 126.907317 NaN NaN NaN 10.142537 NaN NaN 25.219512 30.751220 NaN
std 1.245307 NaN NaN NaN NaN NaN NaN NaN NaN 6.021776 ... 41.642693 NaN NaN NaN 3.972040 NaN NaN 6.542142 6.886443 NaN
min -2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 86.600000 ... 61.000000 NaN NaN NaN 7.000000 NaN NaN 13.000000 16.000000 NaN
25% 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 94.500000 ... 97.000000 NaN NaN NaN 8.600000 NaN NaN 19.000000 25.000000 NaN
50% 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN 97.000000 ... 120.000000 NaN NaN NaN 9.000000 NaN NaN 24.000000 30.000000 NaN
75% 2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 102.400000 ... 141.000000 NaN NaN NaN 9.400000 NaN NaN 30.000000 34.000000 NaN
max 3.000000 NaN NaN NaN NaN NaN NaN NaN NaN 120.900000 ... 326.000000 NaN NaN NaN 23.000000 NaN NaN 49.000000 54.000000 NaN

11 rows × 26 columns

We can also use df.info to see the top and bottom thirty rows of the dataframe

If we would like to get a more complete report of the dataset however, we can make use of the pandas_profiling library which will output a full overall data visualisation

import pandas as pd
import pandas_profiling
pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/planets.csv').profile_report()

Preprocessing Data

Data preprocessing is the process of cleaning the data in order to get data from its raw form to a more usable format

This can include different stages such as

  • Missing value handling
  • Data formatting
  • Data normalization
  • Data binning
  • Turning categorical values into numerical values

Missing Values

When no data value is stored for a specific feature in an obsevation

Missing data can be represented in many different ways such as ?, N/A, 0 or blank among other ways

In our dataset normalized losses are represented as NaA

We can deal with missing values in a variety of ways

  • Check if the correct data can be found
  • Drop the feature
  • Drop the data entry
  • Replace with average or similar datapoints
  • Replace with most common value
  • Replace based on knowledge about data
  • Leave the data as missing data

Dropping missing data can be done with the df.dropna() function, since we have missing values in the price column, which means those rows will need to be dropped

We also need to make use of the inplace=True parameter to modify the actual dataframe

The df.replace() function allows us to replace missing values in the data

df.replace("?", np.nan, inplace = True)
df.head(10)
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 2 NaN audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710
7 1 NaN audi gas std four wagon fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 ... 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875
9 0 NaN audi gas turbo two hatchback 4wd front 99.5 ... 131 mpfi 3.13 3.40 7.0 160 5500 16 22 NaN

10 rows × 26 columns

Next, we can count the missing value in each column

missing_data = df.isnull()
missing_data.head(5)
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 False True False False False False False False False False ... False False False False False False False False False False
1 False True False False False False False False False False ... False False False False False False False False False False
2 False True False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False

5 rows × 26 columns

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print('')
symboling
False    205
Name: symboling, dtype: int64

normalized-losses
False    164
True      41
Name: normalized-losses, dtype: int64

make
False    205
Name: make, dtype: int64

fuel-type
False    205
Name: fuel-type, dtype: int64

aspiration
False    205
Name: aspiration, dtype: int64

num-of-doors
False    203
True       2
Name: num-of-doors, dtype: int64

body-style
False    205
Name: body-style, dtype: int64

drive-wheels
False    205
Name: drive-wheels, dtype: int64

engine-location
False    205
Name: engine-location, dtype: int64

wheel-base
False    205
Name: wheel-base, dtype: int64

length
False    205
Name: length, dtype: int64

width
False    205
Name: width, dtype: int64

height
False    205
Name: height, dtype: int64

curb-weight
False    205
Name: curb-weight, dtype: int64

engine-type
False    205
Name: engine-type, dtype: int64

num-of-cylinders
False    205
Name: num-of-cylinders, dtype: int64

engine-size
False    205
Name: engine-size, dtype: int64

fuel-system
False    205
Name: fuel-system, dtype: int64

bore
False    201
True       4
Name: bore, dtype: int64

stroke
False    201
True       4
Name: stroke, dtype: int64

compression-ratio
False    205
Name: compression-ratio, dtype: int64

horsepower
False    203
True       2
Name: horsepower, dtype: int64

peak-rpm
False    203
True       2
Name: peak-rpm, dtype: int64

city-mpg
False    205
Name: city-mpg, dtype: int64

highway-mpg
False    205
Name: highway-mpg, dtype: int64

price
False    201
True       4
Name: price, dtype: int64

We'll replace the missing data as follows

  • Replace by Mean
    • "normalized-losses": 41 missing data
    • "stroke": 4 missing data
    • "bore": 4 missing data
    • "horsepower": 2 missing data
    • "peak-rpm": 2 missing data
  • Replace by Frequency
    • "num-of-doors": 2 missing data
      • Becase most cars have 4 doors
  • Drop the Roe
    • "price": 4 missing data
      • Because price is what we want to predict, it does not help if it is not there

Normalized Losses

avg_1 = df['normalized-losses'].astype('float').mean(axis = 0)
df['normalized-losses'].replace(np.nan, avg_1, inplace = True)

Bore

avg_2=df['bore'].astype('float').mean(axis=0)
df['bore'].replace(np.nan, avg_2, inplace= True)

Stroke

avg_3 = df["stroke"].astype("float").mean(axis = 0)
df["stroke"].replace(np.nan, avg_3, inplace = True)

Horsepower

avg_4=df['horsepower'].astype('float').mean(axis=0)
df['horsepower'].replace(np.nan, avg_4, inplace= True)

Peak RPM

avg_5=df['peak-rpm'].astype('float').mean(axis=0)
df['peak-rpm'].replace(np.nan, avg_5, inplace= True)

Number of Doors

First we need to check which door count is the most common

df['num-of-doors'].value_counts()
four    114
two      89
Name: num-of-doors, dtype: int64

And then replace invalid values with that

df["num-of-doors"].replace(np.nan, "four", inplace = True)

Price

And then lastly drop the columns with a missing price

df.dropna(subset=["price"], axis=0, inplace = True)

# reset index, because we droped two rows
df.reset_index(drop = True, inplace = True)

Clean Data

The we can view our dataset which has no missing values

df.head()
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 122 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

Formatting Data

Now we nee to ensure that our data is correctly formatted

  • Data is usually collected in different places and stored in different formats
  • Bringing data into a common standard allows for more meaningful comparison

In order to format data, Pandas comes with some tool for us to use

We can also apply mathematical to our data as well as type conversions and column renames

Sometimes the incorrect data type may be set by default to the correct type, it may be necessary for us to convert our data to the correct type for analysis. We can convert datatypes using the df.astype() function, and check types with df.dtypes

Data Types

df.dtypes
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

Convert types

We can easily convert our data to the correct types with

df[['bore', 'stroke']] = df[['bore', 'stroke']].astype('float')
df[['normalized-losses']] = df[['normalized-losses']].astype('int')
df[['price']] = df[['price']].astype('float')
df[['peak-rpm']] = df[['peak-rpm']].astype('float')

And view the corrected types

df.dtypes
symboling              int64
normalized-losses      int64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower            object
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Standardization

Standardization is the process of taking data from one format to another that may be more meaningful for us to use, such as converting our fuel consumption

# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df['city-L/100km'] = 235/df['city-mpg']
df[['city-mpg','city-L/100km']].head()
city-mpg city-L/100km
0 21 11.190476
1 21 11.190476
2 19 12.368421
3 24 9.791667
4 18 13.055556
df.head()
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 27 13495.0 11.190476
1 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 27 16500.0 11.190476
2 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... mpfi 2.68 3.47 9.0 154 5000.0 19 26 16500.0 12.368421
3 2 164 audi gas std four sedan fwd front 99.8 ... mpfi 3.19 3.40 10.0 102 5500.0 24 30 13950.0 9.791667
4 2 164 audi gas std four sedan 4wd front 99.4 ... mpfi 3.19 3.40 8.0 115 5500.0 18 22 17450.0 13.055556

5 rows × 27 columns

df['highway-L/100km'] = 235/df['highway-mpg']
df[['highway-mpg','highway-L/100km']].head()
highway-mpg highway-L/100km
0 27 8.703704
1 27 8.703704
2 26 9.038462
3 30 7.833333
4 22 10.681818
df.head()
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km highway-L/100km
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... 3.47 2.68 9.0 111 5000.0 21 27 13495.0 11.190476 8.703704
1 3 122 alfa-romero gas std two convertible rwd front 88.6 ... 3.47 2.68 9.0 111 5000.0 21 27 16500.0 11.190476 8.703704
2 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... 2.68 3.47 9.0 154 5000.0 19 26 16500.0 12.368421 9.038462
3 2 164 audi gas std four sedan fwd front 99.8 ... 3.19 3.40 10.0 102 5500.0 24 30 13950.0 9.791667 7.833333
4 2 164 audi gas std four sedan 4wd front 99.4 ... 3.19 3.40 8.0 115 5500.0 18 22 17450.0 13.055556 10.681818

5 rows × 28 columns

Renaming Columns

If there is a need for us to rename columns we can do so with df.rename(), for example

df.rename(columns={'"highway-mpg"':'highway-consumption'}, inplace=True)

Data Normalization

This is an important part of data preprocessing

We may need to normalize our variables such that the range of our data is more consistent, allowing us to manage the way that different values will impact our analysis

There are several differnet ways to normalize data

Simple Feature Scaling

xnew=xoldxmaxx_{new}=\frac{x_{old}}{x_{max}}

df['length'] = df['length']/df['length'].max()

Min-Max

xnew=xoldxminxmaxxminx_{new}=\frac{x_{old}-x_{min}}{x_{max}-x_{min}}

df['length'] = (df['length']-df['length'].min())/
               (df['length'].max()-df['length'].min())

Z-Score

The resulting values hover around zero, and are in terms of their standard deviation from the mean

xnew=xoldμσx_{new}=\frac{x_{old}-\mu}{\sigma}

df['length'] = (df['length']-df['length'].mean())/
               df['length'].std()

Normalization

Next we will normalize our values using the Simple Feature Scaling Method

We will apply this to the following features

  • 'length'
  • 'width'
  • 'height'
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()
df[["length","width","height"]].head()
length width height
0 0.811148 0.890278 0.816054
1 0.811148 0.890278 0.816054
2 0.822681 0.909722 0.876254
3 0.848630 0.919444 0.908027
4 0.848630 0.922222 0.908027

Binning

Binning is grouping values together into bins, this can sometimes improve accuracy of predictive models and help us to better understand the data distribution

We'll arrange our 'horsepower' column into bins such that we can label cars as having a 'Low', 'Medium', or 'High' horsepower as follows

df["horsepower"]=df["horsepower"].astype(float, copy=True)
binwidth = (max(df["horsepower"])-min(df["horsepower"]))/4
bins = np.arange(min(df["horsepower"]), max(df["horsepower"]), binwidth)
bins
array([  48. ,  101.5,  155. ,  208.5])

We can then use pd.cut() to determine what bin each value belongs in

group_names = ['Low', 'Medium', 'High']
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names,include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)
horsepower horsepower-binned
0 111.0 Medium
1 111.0 Medium
2 154.0 Medium
3 102.0 Medium
4 115.0 Medium
5 110.0 Medium
6 110.0 Medium
7 110.0 Medium
8 140.0 Medium
9 101.0 Low
10 101.0 Low
11 121.0 Medium
12 121.0 Medium
13 121.0 Medium
14 182.0 High
15 182.0 High
16 182.0 High
17 48.0 Low
18 70.0 Low
19 70.0 Low

Visualization

We can use matplotlib to visualize the number of vehicles in each of our bins with the following

%matplotlib inline
from matplotlib import pyplot as plt

a = (0,1,2)

# draw historgram of attribute "horsepower" with bins = 3
plt.hist(df["horsepower"], bins = 3)

# set x/y labels and plot title
plt.xlabel("Horsepower")
plt.ylabel("Count")
plt.title("Horsepower Bins")
plt.show()

png

Categorical to Quantatative

Most statistical models cannot take in objects as strings, in order to do this we provide a dummy variable that contains whether or not an entry is a certain string variable, we can do this in pandas with pd.get_dummies()

This method will automatically generate a dummy vector for the provided dataframe, and is used as follows

dummy_value = pd.get_dummies(df['value'])

Fuel

We will do this for the 'fuel-type' column in our data

dummy_fuel = pd.get_dummies(df["fuel-type"])
dummy_fuel.head()
diesel gas
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1

Then we will rename our columns for clarity

dummy_fuel.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_fuel.head()
fuel-type-diesel fuel-type-gas
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1

Then we will add this column to our dataset

# merge data frame "df" and "dummy_fuel" 
df = pd.concat([df, dummy_fuel], axis=1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)
df.head()
symboling normalized-losses make aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... horsepower peak-rpm city-mpg highway-mpg price city-L/100km highway-L/100km horsepower-binned fuel-type-diesel fuel-type-gas
0 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 111.0 5000.0 21 27 13495.0 11.190476 8.703704 Medium 0 1
1 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 111.0 5000.0 21 27 16500.0 11.190476 8.703704 Medium 0 1
2 1 122 alfa-romero std two hatchback rwd front 94.5 0.822681 ... 154.0 5000.0 19 26 16500.0 12.368421 9.038462 Medium 0 1
3 2 164 audi std four sedan fwd front 99.8 0.848630 ... 102.0 5500.0 24 30 13950.0 9.791667 7.833333 Medium 0 1
4 2 164 audi std four sedan 4wd front 99.4 0.848630 ... 115.0 5500.0 18 22 17450.0 13.055556 10.681818 Medium 0 1

5 rows × 30 columns

Aspiration

We will do the same as above with our aspiration variable

# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_aspiration = pd.get_dummies(df['aspiration'])

# change column names for clarity
dummy_aspiration.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)

# show first 5 instances of data frame "dummy_variable_1"
dummy_aspiration.head()
aspiration-std aspiration-turbo
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_aspiration], axis=1)

# drop original column "fuel-type" from "df"
df.drop('aspiration', axis = 1, inplace=True)
df.head()
symboling normalized-losses make num-of-doors body-style drive-wheels engine-location wheel-base length width ... city-mpg highway-mpg price city-L/100km highway-L/100km horsepower-binned fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo
0 3 122 alfa-romero two convertible rwd front 88.6 0.811148 0.890278 ... 21 27 13495.0 11.190476 8.703704 Medium 0 1 1 0
1 3 122 alfa-romero two convertible rwd front 88.6 0.811148 0.890278 ... 21 27 16500.0 11.190476 8.703704 Medium 0 1 1 0
2 1 122 alfa-romero two hatchback rwd front 94.5 0.822681 0.909722 ... 19 26 16500.0 12.368421 9.038462 Medium 0 1 1 0
3 2 164 audi four sedan fwd front 99.8 0.848630 0.919444 ... 24 30 13950.0 9.791667 7.833333 Medium 0 1 1 0
4 2 164 audi four sedan 4wd front 99.4 0.848630 0.922222 ... 18 22 17450.0 13.055556 10.681818 Medium 0 1 1 0

5 rows × 31 columns

Exporting

Then we can export our clean data to a CSV as follows

df.to_csv('clean_df.csv')

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach to analyze data in order to

  • Summarize main characteristics of data
  • Gain better understanding of a dataset
  • Uncover relationships between varables
  • Identify important variables that impact our problem

Descriptive Statistics

When analyzing data it is important to describe the data giving a few short summaries. We have a few different ways to do this, such as

  • df.describe() to get general statistical information about numeric data
  • df.values_counts to get a count of the different values of categorical data
  • sns.boxplot() to generate box plots
  • plt.scatter() Scatter plots show the relationship between the predictor and target variables

Correlation

Correlation is a measure to what exctent different vriables are interdependent

Correlation does not imply causation

Pearson Correlation

We can make use of Pearson correlation to measure the strength of correlation between two features, this has two values

  • Correlation Coeffient
    • Close to 1: Positive relationship
    • Close to -1: Negative relationship
    • Close to 0: No relationship
  • P-Value
    • P < 0.001: Strong certainty in the result
    • P < 0.05 : Moderate certainty in result
    • P < 0.1 : Weak certainty in result
    • P > 0.1 : No certainty in result

We would define a strong correlation when the Correlation Coefficient is around 1 or -1, and the P-Value is less than 0.001

We can find our correlation values for two variables with

from scipy import stats
Wheel Base and Price
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  
The Pearson Correlation Coefficient is 0.584641822266  with a P-value of P = 8.07648827073e-20
Horsepower and Price
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  
The Pearson Correlation Coefficient is 0.809574567004  with a P-value of P = 6.36905742826e-48

Positive Correlation

We can visualize our data using some of the visualizations we described above such as by comparing engine size to price

import seaborn as sns

sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
plt.show()

png

We can also view the correlation matrix for these two variables as follows

df[["engine-size", "price"]].corr()
engine-size price
engine-size 1.000000 0.872335
price 0.872335 1.000000

From this we can see that engine size is a fairly good predictor of price

Negative Correlation

We can also look at the correation between highway mileage and price

sns.regplot(x="highway-mpg", y="price", data=df)
plt.show()
df[['highway-mpg', 'price']].corr()

png

highway-mpg price
highway-mpg 1.000000 -0.704692
price -0.704692 1.000000

Where we can note a Negative Linear Relationship

Weak Correlation

We can the compare Peak RPM and price

sns.regplot(x="peak-rpm", y="price", data=df)
plt.show()
df[['peak-rpm','price']].corr()

png

peak-rpm price
peak-rpm 1.000000 -0.101616
price -0.101616 1.000000

Where we can see a weak linear relationship

We can also observe the relationships between stroke and price

sns.regplot(x="stroke", y="price", data=df)
plt.show()
df[["stroke","price"]].corr()

png

stroke price
stroke 1.000000 0.082269
price 0.082269 1.000000

Next we can look at boxplots of our different categorical values in order to look at their distribution

sns.boxplot(x="body-style", y="price", data=df)
plt.show()
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)

png

sns.boxplot(x="engine-location", y="price", data=df)
plt.show()
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)

png

sns.boxplot(x="drive-wheels", y="price", data=df)
plt.show()
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)

png

Descriptive Statistics

Next we can find some descriptive statistics about our data by the following

df.describe()
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke ... peak-rpm city-mpg highway-mpg price city-L/100km highway-L/100km fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo
count 201.000000 201.00000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 ... 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000
mean 0.840796 122.00000 98.797015 0.837102 0.915126 0.899108 2555.666667 126.875622 3.330692 3.256874 ... 5117.665368 25.179104 30.686567 13207.129353 9.944145 8.044957 0.099502 0.900498 0.820896 0.179104
std 1.254802 31.99625 6.066366 0.059213 0.029187 0.040933 517.296727 41.546834 0.268072 0.316048 ... 478.113805 6.423220 6.815150 7947.066342 2.534599 1.840739 0.300083 0.300083 0.384397 0.384397
min -2.000000 65.00000 86.600000 0.678039 0.837500 0.799331 1488.000000 61.000000 2.540000 2.070000 ... 4150.000000 13.000000 16.000000 5118.000000 4.795918 4.351852 0.000000 0.000000 0.000000 0.000000
25% 0.000000 101.00000 94.500000 0.801538 0.890278 0.869565 2169.000000 98.000000 3.150000 3.110000 ... 4800.000000 19.000000 25.000000 7775.000000 7.833333 6.911765 0.000000 1.000000 1.000000 0.000000
50% 1.000000 122.00000 97.000000 0.832292 0.909722 0.904682 2414.000000 120.000000 3.310000 3.290000 ... 5125.369458 24.000000 30.000000 10295.000000 9.791667 7.833333 0.000000 1.000000 1.000000 0.000000
75% 2.000000 137.00000 102.400000 0.881788 0.925000 0.928094 2926.000000 141.000000 3.580000 3.410000 ... 5500.000000 30.000000 34.000000 16500.000000 12.368421 9.400000 0.000000 1.000000 1.000000 0.000000
max 3.000000 256.00000 120.900000 1.000000 1.000000 1.000000 4066.000000 326.000000 3.940000 4.170000 ... 6600.000000 49.000000 54.000000 45400.000000 18.076923 14.687500 1.000000 1.000000 1.000000 1.000000

8 rows × 22 columns

df.describe(include=['object'])
make num-of-doors body-style drive-wheels engine-location engine-type num-of-cylinders fuel-system
count 201 201 201 201 201 201 201 201
unique 22 2 5 3 2 6 7 8
top toyota four sedan fwd front ohc four mpfi
freq 32 115 94 118 198 145 157 92
Value Counts

We can get an indication as to the frequencies of our categorical values with the following

# drive-wheel counts as a variable
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts
value_counts
drive-wheels
fwd 118
rwd 75
4wd 8
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
value_counts
engine-location
front 198
rear 3

Grouping Data

It can be helpful to group data in order to see if there is a correlation between a certain categorical data and some other variable

We can get an array of all uniques 'drive-wheel' categories with

df['drive-wheels'].unique()
array(['rwd', 'fwd', '4wd'], dtype=object)

In Pandas we can use df.groupby()to do this as such

df_wheels = df[['drive-wheels','body-style','price']]
df_wheels = df_wheels.groupby(['drive-wheels'],as_index= False).mean()
df_wheels
drive-wheels price
0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333

Next we can look at the wheels compared to the body style

df_external = df[['drive-wheels', 'body-style', 'price']]
df_group = df_external = df_external.groupby(['drive-wheels','body-style'],
                                            as_index=False).mean().round()
df_group
drive-wheels body-style price
0 4wd hatchback 7603.0
1 4wd sedan 12647.0
2 4wd wagon 9096.0
3 fwd convertible 11595.0
4 fwd hardtop 8249.0
5 fwd hatchback 8396.0
6 fwd sedan 9812.0
7 fwd wagon 9997.0
8 rwd convertible 23950.0
9 rwd hardtop 24203.0
10 rwd hatchback 14338.0
11 rwd sedan 21712.0
12 rwd wagon 16994.0

We can view this data a bit more conviniently as a pivot table, we can do this with df.pivot() so as to display the data as below

grouped_pivot = df_group.pivot(index='drive-wheels',columns='body-style')
grouped_pivot
price
body-style convertible hardtop hatchback sedan wagon
drive-wheels
4wd NaN NaN 7603.0 12647.0 9096.0
fwd 11595.0 8249.0 8396.0 9812.0 9997.0
rwd 23950.0 24203.0 14338.0 21712.0 16994.0

Next we can use a heatmap to visualize the above relationship, the heatmap plots the target variable as the colour with respect to the drive-wheels and body-style variables

#use the grouped results
fig, ax = plt.subplots()
im=ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels=grouped_pivot.columns.levels[1]
col_labels=grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1])+0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0])+0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

png

Analysis of Variance (ANOVA)

If we want to analyze a categorical variable and see the correlation between different categories

  • Statistical comparison of groups

ANOVA is a statistical test that hels find the correlation between different groups of a categorical variable, the ANOVA returns two values

  • F-Test score is the variation between sample group means divided by variation within the sample group
  • P-Value is a measure of confidence and show if the obtained result is statistically significant
Drive Wheels
grouped_test2=df[['drive-wheels','price']].groupby(['drive-wheels'])
grouped_test2.head(2)
drive-wheels price
0 rwd 13495.0
1 rwd 16500.0
3 fwd 13950.0
4 4wd 17450.0
5 fwd 15250.0
136 4wd 7603.0
grouped_test2.get_group('4wd')['price']
4      17450.0
136     7603.0
140     9233.0
141    11259.0
144     8013.0
145    11694.0
150     7898.0
151     8778.0
Name: price, dtype: float64

To obtain the ANOVA values, we use the stats.f_oneway function from scipy for the correlation between drive wheels and price

Drive Wheels and Price
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)   
ANOVA results: F= 67.9540650078 , P = 3.39454435772e-23

This shows a large F and small P, meaning that there is a strong corelation between the three drive types and the price, however does this apply for each individual comparison of drive type?

FWD and RWD
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val )
ANOVA results: F= 130.553316096 , P = 2.23553063557e-23
4WD and RWD
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)   
ANOVA results: F= 8.58068136892 , P = 0.00441149221123

4WD and FWD

f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])  
 
print("ANOVA results: F=", f_val, ", P =", p_val)   
ANOVA results: F= 0.665465750252 , P = 0.416201166978

Model Development

A model can be thought of as a mathematical equation used to predict a value given a specfic input

More relevant data will allow us to more accurately predict an outcome

Linear Regression

Simple Linear Regression

SLR helps us to identify a relationship between two independant variables in the form of

y=b0+b1xy=b_0 + b_1x

We make use of training datapoints to help us find the b0b_0 and b1b_1 values

Our measured datapoints will have some noise, causing our data to be differerntiated from the model

We make use of our training data to fit a model to our data, we then use the model to make predictions. We denote our predicted values as an estimate with

y^=b0+b1x\hat y=b_0+b_1x

If the linear model is correct, we can assume that the offsets are noise, otherwise it may be other factors that we have not taken into consideration

We can create a linear regression model by importing it from sklearn and creating a new LinearRegression object as follows

from sklearn.linear_model import LinearRegression

lm = LinearRegression()

Then we define the predictor and target variables and use lm.fit() to find the parameters b0b_0 and b1b_1

X = df[['predictor']]
Y = df[['target']]

lm.fit(X,Y)

We can then obtain a prediction using `lm.predict()'

yhat = lm.predict(X)

The intercept and slope are attributes of the LinearRegression object and can be found with lm.intercept_ and lm.coef_ respectively

We can train an SLR model for our data as follows

df.head()
symboling normalized-losses make num-of-doors body-style drive-wheels engine-location wheel-base length width ... city-mpg highway-mpg price city-L/100km highway-L/100km horsepower-binned fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo
0 3 122 alfa-romero two convertible rwd front 88.6 0.811148 0.890278 ... 21 27 13495.0 11.190476 8.703704 Medium 0 1 1 0
1 3 122 alfa-romero two convertible rwd front 88.6 0.811148 0.890278 ... 21 27 16500.0 11.190476 8.703704 Medium 0 1 1 0
2 1 122 alfa-romero two hatchback rwd front 94.5 0.822681 0.909722 ... 19 26 16500.0 12.368421 9.038462 Medium 0 1 1 0
3 2 164 audi four sedan fwd front 99.8 0.848630 0.919444 ... 24 30 13950.0 9.791667 7.833333 Medium 0 1 1 0
4 2 164 audi four sedan 4wd front 99.4 0.848630 0.922222 ... 18 22 17450.0 13.055556 10.681818 Medium 0 1 1 0

5 rows × 31 columns

from sklearn.linear_model import LinearRegression
X = df[['highway-mpg']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Yhat=lm.predict(X)
Yhat[0:5]
array([ 16236.50464347,  16236.50464347,  17058.23802179,  13771.3045085 ,
        20345.17153508])
lm_eq = 'yhat = ' + str(lm.intercept_) + ' + ' + str(lm.coef_[0]) + 'x'
print(lm_eq)
yhat = 38423.3058582 + -821.733378322x

Multiple Linear Regression

MLR is used to explain the relationship between 1 continuous target and 2 or more predictors

The model will then be defined by the function

y^=b0+b1x1+b2x2+b3x3+...+bnxn\hat y=b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ... + b_nx_n

The resulting equation represents a surface in nn dimensional space that will define yy given each component of xx

We can train this model just as before as follows

X = df[['predictor1','predictor2','predictor3',...,'predictorN']]
Y = df[['target']]

lm.fit(X,Y)

And then predict as before with

yhat = lm.predict(X)

Where X is in the form of the training data

The intercept and coefficiencts of the model can once again be found with lm.intercept_ and lm.coef_ respectively

We can develop a MLR model as follows

Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
lm.fit(Z, df['price'])
print('intercept: ' + str(lm.intercept_), '\ncoefficients: ' + str(lm.coef_))
intercept: -15806.6246263 
coefficients: [ 53.49574423   4.70770099  81.53026382  36.05748882]

Model Evaluation using Visualization

Regression Plot

Regression plots gives us a good estimate of:

  • The relationship between two variables
  • The strength of correlation
  • The direction of the relationship

A regression plot is a combination of a scatter plot, and a linear regression line

We create a regression plot using the sns.regplot() function in seaborn

import seaborn as sns

sns.regplot(x='predictor', y='target', data=df)
plt.ylim(0,)
plt.show()
Single Linear Regression

We can visualize SLR as follows

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
plt.show()

png

plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
(0, 47422.919330307624)

png

Residual Plot

A Residual plot represents the error between the actual and predicted value, the results should have 0 mean if the linear assumption is applicable

If the residual plot is not equally distributed around the mean of 0 throughout, we know that the linear model is not correct for the data we have

We can create a residual plot as with sns.residplot()

sns.residplot(df['predictor'], df['target'])
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()

png

Distribution Plot

We use this to look at the distribution of the actual vs predicted values for our model

A distribution plot can be made with sns.distplot()

ax1 = sns.distplot(df['target'], hist=False, label='Actual Value')
sns.distplot(yhat, hist=False, label='Fitted Values', ax=ax1)
plt.show()
Multiple Linear Regression

Using A Distribution Plot is more valuable when looking at MLR model performance

Y_hat = lm.predict(Z)
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

png

Polynomial Regression and Pipelines

Polynomial Regression is a form of regression used to describe curvilinear relationships in which our predictor variable is not linear

  • Quadratic y^=b0+b1x1+b2x12\hat y=b_0 + b_1x_1 + b_2x_1^2
  • Cubic y^=b0+b1x1+b2x12+b2x13\hat y=b_0 + b_1x_1 + b_2x_1^2 + b_2x_1^3
  • Higher Order y^=b0+b1x1+b2x12+...+b2x1n\hat y=b_0 + b_1x_1 + b_2x_1^2 + ... + b_2x_1^n

Picking the correct order can make our model more accurate

We can fit a polynomial to our data using np.polyfit(), and can print out the resulting model with np.polydl()

For an nthn^{th} order polynomial we can create and view a model as follows

model = np.polyfit(X,Y,n)
print(np.polydl(f))

Visualization

Visualizing Polynomial regression plots is a bit more work but can be done with the function below

def PlotPolly(model,independent_variable,dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable,dependent_variabble,'.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
    ax = plt.gca()
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()

Next we can do the polynomial fit for our data

x = df['highway-mpg']
y = df['price']

f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)
        3         2
-1.557 x + 204.8 x - 8965 x + 1.379e+05
PlotPolly(p,x,y, 'highway-mpg')

png

Multi Dimension Polynomial Regression

Polynomial regression can also be done in multiple dimensions, for example a second order approximation would look like the following

y^=b0+b1x1+b2x2+b3x1x2+b4x12+b5x22\hat y=b_0 + b_1x_1 + b_2x_2 + b_3x_1x_2 + b_4x_1^2 + b_5x_2^2

If we want to do multi dimentional polynomial fits, we will need to use PolynomialFeatures from sklearn.preprocessing to preprocess our features as follows

X = df[['predictor1','predictor2','predictor3',...,'predictorN']]
Y = df[['target']]

pr = PolynomialFeatures(degree=2)
x_poly = pr.fit_transform(X, include_bias=False)
from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2)
pr
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)
Z_pr = pr.fit_transform(Z)

Before the transformation we have 4 features

Z.shape
(201, 4)

After the transformation we have 15 features

Z_pr.shape
(201, 15)

Pre-Processing

sklearn has some preprocessing functionality such as

Normalization

We can train a scaler and normalize our data based on that as follows

from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
scale.fit(X[['feature1','feature2',..., 'featureN']])

X_scaled = scale.transform(X[['feature1','feature2',..., 'featureN']])

There are other normalization functions available with which we can preprocess our data

Pipelines

There are many steps to getting a prediction, such as Normalization, Polynomial Transformation, and training a Linear Regression Model

We can use a Pipeline library to help simplify the process

First we import all the libraries we will need as well as the pipeline library

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

Then we construct our pipeline using a list of tuples defined as ('name of estimator model', ModelConstructor()) and create our pipeline object

input = [('scale', StandardScaler()),
         ('polynomial', PolynomialFeatures(degree=n),
         ...,
         ('mode', LinearRegression()]
pipe = Pipeline(Input)

We can then create our pipeline object on the data by using the pipe.train function

pipe.train(X[['feature1','feature2',..., 'featureN']], Y)
yhat = pipe.predict(X[['feature1','feature2',..., 'featureN']])

The above method will normalize the data, then perform a polynomial transform and output a prediction

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
Input=[('scale',StandardScaler()),
       ('polynomial', PolynomialFeatures(include_bias=False)),
       ('model',LinearRegression())]
pipe=Pipeline(Input)
pipe
Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])
pipe.fit(Z,y)
Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])
ypipe=pipe.predict(Z)
ypipe[0:4]
array([ 13102.74784201,  13102.74784201,  18225.54572197,  10390.29636555])

In-Sample Evaluation

Two important measures that determine how well a model fits a speficic dataset are

  • Mean Squared Error (MSE)
  • R-Squared (R2R^2)

Mean Squared Error

We simply find the difference between the average square error of our prediction when compared to our data

To get the MSE in Python we can do the following

from sklearn.metrics import mean_squared_error

mean_squared_error(X['target'],Y_predicted_sample)

R-Squared

R2R^2 is the coefficient of determination

  • Measure of how close the data is to the fitted regression line
  • The percentage of variation in Y that is explained by the model
  • Like comparing our model to the mean of the data in approximating the data

R2R^2 is usually between 0 and 1

R2=1MSEofRegressionLineMSEofy¯R^2=1-\frac{MSE of Regression Line}{MSE of \bar y}

We can get the R24R^24 value with lm.score()

A negative R2R^2 can be a sign of overfitting

In-Sample Evaluation of Models

We can evaluate our models with the following

#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
lm.score(X, Y)
0.49659118843391747
Yhat = lm.predict(X)
Yhat[0:4]
array([ 16236.50464347,  16236.50464347,  17058.23802179,  13771.3045085 ])
from sklearn.metrics import mean_squared_error
#mean_squared_error(Y_true, Y_predict)
mean_squared_error(df['price'], Yhat)
31635042.944639895
# fit the model 
lm.fit(Z, df['price'])
# Find the R^2
lm.score(Z, df['price'])
0.80935628065774567
Y_predict_multifit = lm.predict(Z)
mean_squared_error(df['price'], Y_predict_multifit)
11980366.87072649

Now we can check what the R2R^2 value for our model is

from sklearn.metrics import r2_score
r_squared = r2_score(y, p(x))
r_squared
0.67419466639065173
mean_squared_error(df['price'], p(x))
20474146.426361222

Prediction and Decision Making

We should use visualization, numerical evaluation, and model comparison in order to see if the model values makes sense

To compare our model to the data we can simply plot the output of our model over the range of our data

new_input=np.arange(1,100,1).reshape(-1,1)
lm.fit(X, Y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
yhat=lm.predict(new_input)
yhat[0:5]
array([ 37601.57247984,  36779.83910151,  35958.10572319,  35136.37234487,
        34314.63896655])
plt.plot(new_input,yhat)
plt.show()

png

Conclusion

From the results above (yes, they're a mess but it's all pretty much just from the CC Lab file) we can note the R2R^2 and MSE values are as follows

Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.

  • R-squared: 0.49659118843391759
  • MSE: 3.16 x10^7

Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.

  • R-squared: 0.80896354913783497
  • MSE: 1.2 x10^7

Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.

  • R-squared: 0.6741946663906514
  • MSE: 2.05 x 10^7

SLR vs MLR

Usually having more variables helps the prediction, however if you do not have enough data you can run into trouble or many of the variables may just be noise and not be very useful

The MSE for MLR is Smaller than for SLR, and the R2R^2 for MLR is higher as well, MSE and R2R^2 both seem to indicate that MLR is a better fit than SLR

SLR vs Polynomial

The MSE for the Polynomial Fit is less than for the SLR and the R2R^2 is higher meaning that the Polynomial Fit is a better predictor based on 'highway-mpg' than the SLR

MLR vs Polynomial

The MSE for the MLR is smaller than for the Polynomial Fit, and the MLR also has a higher R2R^2 therefore the MLR is as better fit than the Polynomial in this case

Overall

The MLR has the lowest R2R^2 and the highest MSE, meaning that it is the best fit of the three models that have been evaluate

Model Evaluation

Model Evaluation tells us how our model works in the real wold. In-Sample evaluation does not give us an indication as to how our model performs under real lifr circumstances

Training and Test Sets

We typically split our data into a training and testing set and use to build and evaluate our model respectively

  • Split into
    • 70% Training
    • 30% Testing
  • Build and Train with Training Set
  • Use Testing Set to evaluate model performance

sklearn gives us a function to split out data into a train and test set

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(x_data,y_data,test_size=0.3,random_state=0)

Generalization Performance

Generalization Error is a measure of how well our data does at predicting previously unseen data

The error we obtain using our testing data is an approximation of this error

Cross Validation

Cross validation involves us splitting the data into folds and using a fold for testing and the remainder for training, and we make use of each combination of training, and evalution

We can use cross_val_score() to evaluate our out-of-sample evaluation

from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr,X,Y,cv=n)

Where X is our predictor matrix, Y is our target, and n is our number of folds

If we want to use get the actual predicted values from our model we can do the following

from sklearn.model_selection import cross_val_predict
yhat = cross_val_predict(lr,X,Y,cv=n)

Overfitting, Underfitting, and Model Selection

The goal of model selection is to try to select the best function to fit our model, if our model is too simple we will have model that does not appropriately fit our data, whereas if we have a model that is too complex, it wil perfectly fit our testing data but will not be good at approximating new data

We need to be sure to fit the data, not the noise

It is also possible that the data we are trying to approximate cannot be fitted by polynomial at all, for example in the case of cyclic data

We can make use of the following code to look at the effect of order on our R2R^2 error for a model

rsq_test = []
order = [1,2,3,4]

for n in order:
    pr = PolynomialFeatures(degree=n)
    x_train_pr = pr.fit_transform(x_train[['feature']]
    x_test_pr = pr.fit_transform(x_test[['feature']]
    lm.fit(x_train_pr, y_train)
    rsq_test.append(lr.score(x_test_pr, y_test))

Ridge Regression

Ridge Regression prevents overfitting

Ridge regression controls the higher order parameters in our model by using a factor α\alpha, increasing our α\alpha value will help us avoid overfitting until, however increasing it too far can lead to us underfitting the data

To use ridge regression we can do the following

from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X,Y)
yhat=RidgeModel.predict(X)

We will usually start off with a small value of α\alpha such as 0.0001 and increase our value in orders of magnitude

Furthermore we can use Cross Validation to identify an optimal α\alpha

Grid Search allows us to scan through multiple free parameters

Parameters such as α\alpha are not part of the training or fitting process. These parameters are called Hyperparameters

sklearn has a means of automatically iterating over these parameters called Grid Search

This allows us to use different hyperparameters to train our model, and select the model that provides the lowest MSE

The values of a grid search are simply a list whcih contains a dictionary for the parameters we want to modify

params = [{'param name':[value1, value2, value3, ...]}]

We can use Grid Search as follows

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

params = [{'alpha':[0.0001, 0.01, 1, 10, 100]}, {'normalize':[True,False]}]

rr = Ridge()

grid = GridSearchCV(rr, params, cv=4)
grid.fit(X_train[['feature1','feature2',...], Y_train)

grid.best_estimator_

scores = grid.cv_results_

The scores dictionary will store the results for each hyperparameter combination given our inputs

Conclusion

For more specific examples of the different setions look at the appropriate Lab