Deep Learning with Keras
Deep Learning and Neural Networks with Keras
Notes from this YouTube Series, Full course notes can be found on GitHub
Overview of Neural Networks
A Neural Network takes the some kind of data and has the ability to handle and process data that other ML models are not really able to process
In a normal model you would pass in a 1D vector such as a list of predictors, with a an NN you can pass in more complex data and the model will place weight on the position as well as the values of a respective data point which is something other models can't necessarily handle
Some examples of higher order data can be:
- 1D Vector - Normal input, like a row in a spreadsheet
- 2D Matrix - Grayscale image
- 3D Matrix - Colour image
- nD Matrix - Any higher order data
With traditional models we speak about regression or classification.
A regression network could have a single numerical output, or a classification network could have a set of potential binary outputs for each classes (like one-hot) or a probability of the result being each of the possible outputs
Neural Networks are also capble of more complex outputs or even combinations of outputs
In general an NN consists of an Input Layer which takes in the input data, a few hidden layers which proces the data, and an output layer which is our target outcome. Each layer passes a weighted data to each model
There are usually these types of neurons:
- Input - get the input data
- Hidden - between input and output and abstract processing
- Output - the output that's calculated
- Context - hold state between calls to the network
- Bias Neurons - similar to a y-intercept, alow us to offset the data to a neurons
Neural networks pass data to nodes using Activation functions, some common ones are:
- Rectified Linear Unit (ReLU) - used for hidden layers
- Softmax - output for classification
- Linear - for regression
The Bias Neuron along with a Weight allow us to move and scale our activation functions
Tensorflow and Keras
TensorFlow is the low-level library for Neural Networks, and Keras is an API that sits on top of TF and allows you to interact with it at a higher level
The current version of TF requires Python 3.7, so just align with that
TensorBoard is a way to visualize Neural Networks
Using Tensorflow Directly
Simple Matrix Multiplication
import tensorflow as tf
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.], [2.]])
product = tf.matmul(matrix1, matrix2)
print(product)
float(product)
tf.Tensor([[12.]], shape=(1, 1), dtype=float32)
12.0
Using Variables
Variables can be created, used, and resasigned and recalculated with
x = tf.Variable([1., 2.])
a = tf.constant([3., 3.])
print(tf.subtract(x, a).numpy())
[-2. -1.]
x.assign([4., 6.])
print(tf.subtract(x, a).numpy())
[1. 3.]
Using Keras with MPG Dataset
Keras enables us to think about the Layers in an NN, we'll use the Miles Per Gallon dataset which uses the
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn import metrics
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
COLUMN_NAMES = [
'mpg',
'cylinders',
'displacement',
'horsepower',
'weight',
'acceleration',
'model year',
'origin',
'car name'
]
import pandas as pd
from tensorflow.keras.layers import Dense, Activation
df = pd.read_fwf(
DATA_URL,
names=COLUMN_NAMES,
na_values=['NA', '?']
)
# fill missing
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())
df.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model year | origin | car name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 | "chevrolet chevelle malibu" |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 | "buick skylark 320" |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 | "plymouth satellite" |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 | "amc rebel sst" |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 | "ford torino" |
X = df.drop(['mpg', 'car name'], axis=1).values
y = df[['mpg']].values
Build Regression Model with Keras
When building a Neural Network we take the following steps:
- Create a Sequential
- Define the Hidden Layers
- Define the Output Layer
- Compile and Train the Model
1. Create Sequential
model = Sequential()
2. Define Hidden Layers
Define the first hiddel layer with the input_dim
to be the shape of our input data set (X
columns in this case)
A dense layer is one where each neuron is connected to the next
model.add(Dense(25, input_dim=X.shape[1], activation='relu'))
model.add(Dense(10, activation='relu'))
3. Define the Output Layer
This is depends on the dimensionality of the output, similar to the input. For this case it is one dimensional
model.add(Dense(1))
3. Compile and train the model
We specify a loss
function and an optimizer
for the model, and then give it the X
and y
values to train on a well as how many epoch
s we want it to train for
For a Regression NN you usually use MSE as the loss
We can also make use of methods to increase the model's effectiveness and identifying the optimal number of epochh
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, verbose=0, epochs=400)
<tensorflow.python.keras.callbacks.History at 0x2041a5ff688>
Test the Model
y_pred = model.predict(X)
score = np.sqrt(metrics.mean_squared_error(y_pred, y))
'MSE: ' + str(score)
'MSE: 3.4881811444565303'
Build a Classification Model with Keras
Building a Classification Model is much the same, however we need to ensure that we hot-encode our categorical values, and in this case we'll have a categorical output which means more than one potential result
For this we're making use of the Iris Dataset
However for a Multi-Class classification we use softmax
and categorical_crossentropy
For a Binary we can additionally use an appliccable loss and activation
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
COLUMN_NAMES = [
'sepal length',
'sepal width',
'petal length',
'petal width',
'class'
]
df = pd.read_csv(DATA_URL, names=COLUMN_NAMES)
df.head()
sepal length | sepal width | petal length | petal width | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
X = df.drop('class', axis=1).values
dummies = pd.get_dummies(df['class'])
species = dummies.columns
y = dummies.values
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dense(25, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(X, y, verbose=0, epochs=100)
<tensorflow.python.keras.callbacks.History at 0x2041b8c9c88>
y_pred = model.predict(X)
predict_classes = np.argmax(y_pred,axis=1)
expected_classes = np.argmax(y,axis=1)
print(f"Predictions: {predict_classes}")
print(f"Expected: {expected_classes}")
print(species[predict_classes[1:10]])
score = metrics.accuracy_score(expected_classes,predict_classes)
'Accuracy: ' + str(score)
Predictions: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] Expected: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] Index(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa'], dtype='object')
'Accuracy: 0.9733333333333334'
Saving and Loading Neural Networks
We can store data in a few different formats, the ideal one is the HDF5
format which stores the structure and weights for the network
Save Model
We can save the model we just trained with:
MODEL_SAVE_PATH = './exported-models/iris-model.h5'
model.save(MODEL_SAVE_PATH)
Load Model
from tensorflow.keras.models import load_model
loaded_model = load_model(MODEL_SAVE_PATH)
loaded_model
<tensorflow.python.keras.engine.sequential.Sequential at 0x2041b8141c8>
Early Stopping to prevent Overfitting
We can make use of test/train sets to help us prevent overfitting, this is done by helping us identify when to stop training the network
It's important that we save our score at a good fitted value
Data is usually split into the following sets:
- Test
- Train
- Holdout
If we have have a lot of data we can even try to have multiple test and train sets
To train the model we'll do the normal preprocessing and model definition as before, and then we'll implement EarlyStopping
from Keras when doing the model.fit
portion
Categorical
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
Preprocessing and Model Definition
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
COLUMN_NAMES = [
'sepal length',
'sepal width',
'petal length',
'petal width',
'class'
]
df = pd.read_csv(DATA_URL, names=COLUMN_NAMES)
X = df.drop('class', axis=1).values
dummies = pd.get_dummies(df['class'])
species = dummies.columns
y = dummies.values
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dense(25, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Train/Test Split
X_train, X_test, y_train, y_test = train_test_split (
X, y,
test_size=0.25,
random_state=0
)
Train the Model
The below applies to both categorical and regression models
We can train the model using an EarlyStopping
callback, in this we specify:
- The metric we want to monitor for change,
val_loss
to use the validationloss
we defined for the model as the metric - The minimum change we want to for stability, this will not have much of an impact if made smaller
- The number of rounds we want the delta to be small for before stopping
- The mode, usually keep this at
auto
but it is whether to minimize or maximize the error - Restore best weights automatically, always keep this at
True
monitor = EarlyStopping(
monitor='val_loss',
min_delta=1e-3,
patience=50,
verbose=1,
mode='auto',
restore_best_weights=True
)
model.fit(
X_train, y_train,
validation_data=(X_test, y_test),
callbacks=[monitor],
verbose=0,
epochs=1000
)
Restoring model weights from the end of the best epoch. Epoch 00075: early stopping
<tensorflow.python.keras.callbacks.History at 0x204254a7108>
Measure the Accuracy
y_pred = model.predict(X_test)
predicted_classes = np.argmax(y_pred, axis=1)
expected_classes = np.argmax(y_test, axis=1)
score = metrics.accuracy_score(expected_classes, predicted_classes)
'Accuracy: ' + str(score)
'Accuracy: 0.9736842105263158'
Feature Vectors and Tabular Data
All data that comes into a Neural Network must be numerical
Some of the processing we will typically do are:
- Convert categorical values to dummies (features and target)
- Drop any columns like ID, etc.
- Get all the different numerical data to be in a the same range
- Center numerical data around a mean of zero
- Fill missing values as appropriate for the relevant data
- If we have missing data in the targe column we should drop those rows
We can use a Z-score to work with points 3 and 4
CLassification Metrics
Sometimes we care about additional factors than just the accuracy, such as the counts of false positives or negatives etc.
ROC Values
- Flase Positives
- False Negatives
- True Positives
- True Negatives
These can also be be described as Type-1 and Type-2 Erors as well as Test Sensitivity and Specificity
A sensitive NN will lead to more false positives, and more specific NN will lead towards fewer false positives
A ROC chart compares our model to random predictions, the higher up our line is the more accurate our model. We measure the area under this curve to get the AUC Value, if our model falls below the 0.5
mark (below the random line) it means our model is doing worse than a random guess (which is really bad)
Log Loss
A Log Loss calculation we can get a sort of accuracy score that's more harsh on overconfidence
Confusion Matrix
This compares our predicted values to the actual values, in this we would ideally want to see a strong diagonal correlation
Regression Metrics
When working with regression models there are different metrics that we can use in order to
Mean Squared Error and Root Mean Squared Error
We usually work with the MSE value which is sort of releative to our dataset, square rooting this gives us the RMSE which tells us how close we are to our actual value in the same units as our target data
Lift Chart
A Lift chart is a way to compare our model output to the actual test data in order to see how our model compares over specific value ranges in the target vector
Backpropagation
We have a few two types of backpropagation which we use when training a model
- Classic - using gradient descent (e.g. 0.1, 0.01, 0.001)
- Momentum - pushes weights in order to avoid local minumums (e.g. 0.9)
- Batch and Online - update weights in batches instead of every iteration
- Stochastic Gradient Descent - Often used with batching, network trained on differing sets of the data and decreases overfitting by focusing on a smaller number of weights
Additionally we have a few methods that can help us to automate certain hyperparameters;
- Resilient Propogation - uses only the gradient magnitude and allows each neuron it's own learning rate
- Nesterov accelerated gradient - helps mitigate the risk of choosing a bad batch
- Adagrad - allows an automatically decaying learning rate and momentum per weight
- Adadelta - Based on on Adagrad, monotonically decreasing learning rate
There are also other non gradient methods such as:
- Simulated Annealing
- Generic Algorithm
- Particle Swarm Optimization
- Nelder Mead
Some interestnig diagrams comparing different algorithms
Regularization
Regularization is used to combat overfitting. The two types of Regularization we have Lasso (L1) and Ridge (L2) regularization
L1 regularization can help a network focus on the important factors
The alpha
value lets us say how important the regularization is to our model, in general a higher alpha
will cause the model to have a lower accuracy but prevent overfitting
Lasso (L1)
from sklearn.linear_model import Lasso
model = Lasso(random_state=0, alpha=0.1)
model.fit(X_train, y_train)
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=0, selection='cyclic', tol=0.0001, warm_start=False)
Ridge (L2)
L2 Regression (Ridge) lets us focus a bit less on the weightings than the L1 method and penalizes the model less for large weights
from sklearn.linear_model import Ridge
model = Ridge(random_state=0, alpha=0.1)
model.fit(X_train, y_train)
Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=0, solver='auto', tol=0.001)
ElasticNet
ElasticNet uses a combination of L1 and L2 regularization
from sklearn.linear_model import ElasticNet
model = ElasticNet(random_state=0, alpha=0.1)
model.fit(X_train, y_train)
ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.5, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=0, selection='cyclic', tol=0.0001, warm_start=False)
Dropout
Dropout is another method of regularization and is applied during training
When using dropout we disable random neurons in each epoch to prevent them from becoming too specialized. This helps us to prevent overfitting as well as reduce the variance in the overall trained network
The dropped neurons are re-added once the training is complete
In order to use dropout in Keras we can add a Dropout
layer with a value for what fraction of neurons we want to be dropped out
The suggestion is usually not to use a dropout after the final hidden layer
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
COLUMN_NAMES = [
'sepal length',
'sepal width',
'petal length',
'petal width',
'class'
]
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
df = pd.read_csv(DATA_URL, names=COLUMN_NAMES)
X = df.drop('class', axis=1).values
dummies = pd.get_dummies(df['class'])
species = dummies.columns
y = dummies.values
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.25,
random_state=0
)
Below we will train the model, the model has the following:
- 1st Dense Layer with 50 neurons and ReLU activation
- A dropout of 50%
- 2nd Dense Layer with 25 neurons, ReLU, and an L1 Regularization
- An Output Layer with the categories and Softmax activation
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(
25,
activation='relu',
activity_regularizer=regularizers.l1(1e-4)
))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(
X_train,
y_train,
validation_data=(X_test, y_test),
verbose=0,
epochs=100
)
<tensorflow.python.keras.callbacks.History at 0x1804e5a00c8>
y_pred = model.predict(X_test)
predicted_classes = np.argmax(y_pred, axis=1)
expected_classes = np.argmax(y_test, axis=1)
print(predicted_classes)
print(expected_classes)
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0 2] [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0 1]
score = metrics.accuracy_score(expected_classes, predicted_classes)
'Accuracy: ' + str(score)
'Accuracy: 0.9736842105263158'
Benchmarking and Regularization
So far we've seen of a network is based on the following:
- Number of layers
- How many neurons per layers
- Activation functions for each layers
- Droppout per layer
- L2 and L2 Regularization
There are additional parameters that can also influence the network
Due to the different parameters and the random nature of a network it can be difficult to see if our change in hyperparameters is actually impacting the output of a network
We can do something called Bootstrapping which is similar to cross validation with replacement and early stopping to help our network average converge and after how many epochs this takes
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.utils import resample
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping
SPLITS = 15
Bootstrap
For a Regression model we can use:
boot = ShuffleSplit(n_splits=SPLITS, test_size=0.1)
and then:
for train, test in boot.split(X):
# train model
However, for a Categorical classification we want to ensure that we have a class balance, we can do this with the StratifiedShuffleSplit
which works like so:
Note that the
EarlyStopping
monitor returns0
if the training was not early stopped (e.g. trained till end)
boot = StratifiedShuffleSplit(n_splits=SPLITS, test_size=0.2)
Progress Tracking
accuracy_tracker = []
epoch_tracker = []
for train, test in boot.split(X, y): # using the data from the last import
X_train = X[train]
X_test = X[test]
y_train = y[train]
y_test = y[test]
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(
25,
activation='relu'
))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
monitor = EarlyStopping(
monitor='val_loss',
min_delta=1e-3,
patience=50,
verbose=0,
mode='auto',
restore_best_weights=True
)
model.fit(
X_train, y_train,
validation_data=(X_test, y_test),
callbacks=[monitor],
verbose=0,
epochs=1000
)
epoch_tracker.append(
monitor.stopped_epoch if monitor.stopped_epoch > 0 else 1000
)
y_pred = model.predict(X_test)
predicted_classes = np.argmax(y_pred, axis=1)
expected_classes = np.argmax(y_test, axis=1)
score = metrics.accuracy_score(expected_classes, predicted_classes)
accuracy_tracker.append(score)
pd.DataFrame({
"Score": accuracy_tracker,
"Epochs": epoch_tracker
}).describe()
Score | Epochs | |
---|---|---|
count | 15.000000 | 15.000000 |
mean | 0.993333 | 350.200000 |
std | 0.013801 | 76.766064 |
min | 0.966667 | 226.000000 |
25% | 1.000000 | 308.000000 |
50% | 1.000000 | 321.000000 |
75% | 1.000000 | 397.000000 |
max | 1.000000 | 528.000000 |