Lesson 4: Under the Hood: Training a Digit Classifier

28-09-2020

This notebook will go over some of the practical material discussed in lesson 4 of the fastai 2020 course, namely, some different ways of training a digit classifier using the MNIST data set. The lesson 4 video is an extension on the lesson 3 video. There is a lot to cover...

In the last notebook we looked at some simple examples of using SGD to optimise a model. In this notebook we will apply the concepts to the MNIST problem from scratch then leter, we will refactor the code using PyTorch and fastai modules.

# imports and things we need from previous notebooks

from fastai.vision.all import *

# data 
path = untar_data(URLs.MNIST_SAMPLE)

threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()

seven_tensors = [tensor(Image.open(o)) for o in sevens]
three_tensors = [tensor(Image.open(o)) for o in threes]

stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255

valid_3_tens = torch.stack([tensor(Image.open(o)) 
                            for o in (path/'valid'/'3').ls()])
valid_3_tens = valid_3_tens.float()/255
valid_7_tens = torch.stack([tensor(Image.open(o)) 
                            for o in (path/'valid'/'7').ls()])
valid_7_tens = valid_7_tens.float()/255

def plot_function(f, tx=None, ty=None, title=None, min=-2, max=2, figsize=(6,4)):
    x = torch.linspace(min,max)
    fig,ax = plt.subplots(figsize=figsize)
    ax.plot(x,f(x))
    if tx is not None: ax.set_xlabel(tx)
    if ty is not None: ax.set_ylabel(ty)
    if title is not None: ax.set_title(title)

MNIST Loss function

Our X values will be pixels, we need to reshape the data using view. We want to concatenate our x's into a single tensor, then change them from a list of matrices (a rank-3 tensor) to a list of vectors (a rank-2 tensor). Why? Because this example is meant to be simplified.

view will return a new tensor with the same data as the original tensor but with a different shape that we define.

# concat 3s and 7s, then reshape into a matrix
# so that each row is 1 image, with all rows and columns in a single vector
train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)

# label the data
# 3 == 1
# 7 == 0
# we need this to be a matrix
# unsqueeze will do this for us
train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)

# check the shape
train_x.shape,train_y.shape

(torch.Size([12396, 784]), torch.Size([12396, 1]))

# in PyTorch we need data to be in a tuple for each row
# zip will help us with this
dset = list(zip(train_x,train_y))

# take a look at the first thing
x,y = dset[0]

x.shape, y

(torch.Size([784]), tensor([1]))

(torch.Size([784]), tensor([1])) this matches what we would expect

# repeat for validation
valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x,valid_y))

Now we have training and validation data sets

1. Randomly initialise weights for each pixel - use torch.randn to create tensor of randomly initialised weights

def init_params(size, var=1.0): 
    return (torch.randn(size)*var).requires_grad_()

weights = init_params((28*28,1))

weights.shape

torch.Size([784, 1])

We need to add a bias term because just using weights*pixels will not be flexible enough. Our function will always be equal to zero when the pixels are equal to zero.

bias = init_params(1)

y = w*x+b is the formula for a line, where w are the weights, b is the bias. In neural network jargon, the weights and bias will be our parameters.

This linear equation is one of the two fundamental equations of any neural network. The other is an activation function that we will see shortly.

Let's use this to calculate a prediction for one image... weights.T will transpose the weights, this is done to make sure the rows and columns match up for our multiplication

(train_x[0]*weights.T).sum() + bias

tensor([13.3326], grad_fn=<AddBackward0>)

Now we need to do this for all images. A for loop will be too slow. In PyTorch we can perform matrix multiplication using the @ operator OR by using torch.matmul().

# define a linear function that will 
# multiple the input by weights then add a bias term

def linear1(xb): return xb@weights + bias

preds = linear1(train_x)
preds

tensor([[13.3326],
        [ 9.1011],
        [ 9.4999],
        ...,
        [-1.0068],
        [15.9130],
        [12.6228]], grad_fn=<AddBackward0>)

Notice the result are the same as we just saw above. We can confirm our function is working and can also see that the operation is performed for every image in train_x

checking accuracy - if a prediction is above the threshold, ie if > 0 then it is a 3, less than 0, 7. - so we check if a prediction is greater than our threshold of 0, then check these against the validation set. - this will return true when a row is correctly predicted - we can convert these to floats using .float() then take their mean to check overall accuracy of our randomly initialised model

threshold = 0.0
accuracy = (preds > threshold).float() == train_y
accuracy

tensor([[ True],
        [ True],
        [ True],
        ...,
        [ True],
        [False],
        [False]])

accuracy.float().mean().item()

0.484188437461853

Let's change one of the weights by a small amount to see how accuracy is affected.

weights[0]+= 1.0001 # increase the weigh a little
preds = linear1(train_x)
accuracy2 = ((preds > threshold).float() == train_y).float().mean().item()
accuracy2

0.484188437461853

This is exactly the same as before. We have a problem, when we calculate the change, our gradient is now 0, this is because if we change a single pixel by a very small amount we might not change an actual prediction.

So because our gradient is 0, our step will be 0 which means our prediction will be unchanged.

So our accuracy loss function is not very good. A small change in our weights does not result in a small change in accuracy, so we will have zero gradients.

We need a new function that won't have a zero gradient, it needs to be more sensitive to small changes, so that a slightly better prediction needs to have a slightly better loss.

In other words, then the predictions are close to the targets the loss needs to be small, when they are far away, it needs to be big.

So let's create a new function to address this issue.

# MNIST loss

def mnist_loss(preds, targets):
    return torch.where(targets==1., 1.-preds, preds).mean()

# test case

t = torch.tensor([1,0,1])         # targets
p = torch.tensor([0.9, 0.4, 0.2]) # predictions


# this is the same as mnist_loss but before the mean
torch.where(t==1, 1-p, p)

tensor([0.1000, 0.4000, 0.8000])

torch.where is like list comprehension for tensors.

This function returns a lower loss when predictions are more accurate and a higher loss when they are not.

But for this to work, we need our predictions to be between 0 and 1, otherwise things do not work.

p2 = torch.tensor([1.2, -1, 0])   # predictions outside 0, 1 range

torch.where(t==1, 1-p2, p2)

tensor([-0.2000, -1.0000,  1.0000])

The Sigmoid function

This function will constrain our numbers between 0 and 1.
It squashes any input in the range (-inf, inf) to some value in the range (0, 1)

def sigmoid(x) : return 1 / (1 + torch.exp(-x))

plot_function(torch.sigmoid, title='Sigmoid', min=-4, max=4)

# MNIST loss with sigmoid

def mnist_loss(predictions, targets):
    preds = predictions.sigmoid()
    return torch.where(targets==1., 1.-preds, preds).mean()

SGD and Mini-batches

By batching images and running computations over them is a way to compromise between speed and computational efficiency.

The size of the batch will impact your accuracy and estimates as well as the speed at which you are able to run computations. The batch size is something to be considered during training.

The DataLoader class in pytorch helps with batching. It returns an iterator which we can loop through.

coll = range(15)

dl = DataLoader(coll, batch_size=5, shuffle=True)
list(dl)

[tensor([ 4, 12,  5,  6,  3]),
 tensor([10,  9,  2,  0, 14]),
 tensor([ 7, 13,  8, 11,  1])]

Putting it together

# re-initialise weights and params
weights = init_params((28*28,1))
bias = init_params(1)

# create a data loader
dl = DataLoader(dset, batch_size=256)

# grab the first x and y
xb, yb = first(dl)

# check the shape
xb.shape, yb.shape

(torch.Size([256, 784]), torch.Size([256, 1]))

# repeat for validation set
valid_dl = DataLoader(valid_dset, batch_size=256)

# grab a mini batch to test on
batch = train_x[:4]
batch.shape

torch.Size([4, 784])

# make some predictions
preds = linear1(batch)
preds

tensor([[-1.1306],
        [-3.9293],
        [-0.6736],
        [-6.9805]], grad_fn=<AddBackward0>)

loss = mnist_loss(preds, train_y[:4])
loss

tensor(0.8495, grad_fn=<MeanBackward0>)

# calculate gradients
loss.backward()

weights.grad.shape, weights.grad.mean(), bias.grad

(torch.Size([784, 1]), tensor(-0.0153), tensor([-0.1070]))

# take those 3 steps and put it in a function

def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = mnist_loss(preds, yb)
    loss.backward()

# test it

calc_grad(batch, train_y[:4], linear1)

weights.grad.shape, weights.grad.mean(), bias.grad

(torch.Size([784, 1]), tensor(-0.0306), tensor([-0.2140]))

# zero the gradients
weights.grad.zero_()
bias.grad.zero_()

tensor([0.])

The last step is to work out how to update the weights and bias based on the gradient and learning rate.

train_epoch loops through the data loader, grab x batch and y batch, calculate the gradient, make a prediction and calculate the loss. Go through each parameter (weights and bias) and for each update with gradient * lr, then zero these in prep for the next loop.

p.data is used because PyTorch keeps track of all operations so it can calculate the gradients, but we do not want the gradients to be calculated on the gradient descent step.

def train_epoch(model, lr, params):
    for xb, yb in dl:
        calc_grad(xb, yb, model) 
        for p in params:
            p.data -= p.grad*lr
            p.grad.zero_()

batch_accuracy is similar to the previous loss function, but since we use a sigmoid, which constrains our preds between 0 and 1, we need to check whether preds > 0.5.

def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds>0.5) == yb # check predictions against target
    return correct.float().mean()

batch_accuracy(linear1(train_x[:4]), train_y[:4])

tensor(0.)

# check accuracy for every batch in the validation set
# stack converts the list of items into tensor

def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb, yb in valid_dl]
    return round(torch.stack(accs).mean().item(), 4)

validate_epoch(linear1)

0.407

This is a starting point, let's train for one epoch and see if accuracy improves.

as a reminder, the linear1 function was... - def linear1(xb): return xb@weights + bias

lr = 1.
params = weights, bias
train_epoch(linear1, lr, params)
validate_epoch(linear1)

0.6932

for i in range(20):
    train_epoch(linear1, lr, params)
    print(validate_epoch(linear1), end=' ')

0.8242 0.9042 0.9355 0.9501 0.9555 0.9614 0.9638 0.9677 0.9736 0.9751 0.9751 0.976 0.977 0.9775 0.9775 0.978 0.9785 0.979 0.9795 0.979

Accuracy has indeed improved! We have built an SGD optimizer that has reached about 97% accuracy.

Refactor and clean up

create an optimiser
use PyTorch modules and functions where available
- like nn.Linear
- which "Applies a linear transformation to the incoming data: $y = xA^T + b$"

nn.Linear?

# remove our linear function
# in place for torch module

# creates a matrix of size 28*28
# with bias of 1

linear_model = nn.Linear(28*28,1)

# check model params

w,b = linear_model.parameters()

w.shape, b.shape

(torch.Size([1, 784]), torch.Size([1]))

Create a basic optimiser

pass in params to optimise and lr
store these away
step though each param (weights and bias) and for each, update with gradient * lr
zero the gradients in prep for the next step

class BasicOptim:
    def __init__(self, params, lr): 
        self.params, self.lr = list(params), lr

    def step(self, *args, **kwargs):
        for p in self.params: 
            p.data -= p.grad.data * self.lr

    def zero_grad(self, *args, **kwargs):
        for p in self.params: 
            p.grad = None

# create an optimiser by passing in parameters from model
opt = BasicOptim(linear_model.parameters(), lr)

# simplify the training loop
def train_epoch(model):
    for xb, yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()

validate_epoch(linear_model)

0.5075

Now create a function train_model that will call train_epoch on our model for the specified number of epochs

def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_epoch(model), end=' ')

train_model(linear_model, 20)

0.4932 0.7876 0.852 0.916 0.9345 0.9497 0.957 0.9638 0.9658 0.9677 0.9697 0.9721 0.9731 0.9751 0.9755 0.9765 0.9775 0.9775 0.978 0.9785

The results are very similar to what we have seen before.

Fastai provides SGD that we can use instead of writing our own, again the results are very similar.

linear_model = nn.Linear(28*28, 1)
opt = SGD(linear_model.parameters(), lr)
train_model(linear_model, 20)

0.4932 0.9091 0.8056 0.9043 0.9316 0.9443 0.9546 0.9619 0.9648 0.9668 0.9692 0.9707 0.9731 0.9746 0.976 0.976 0.9775 0.9775 0.9785 0.979

Let's refactor some more, using some fastai classes. The Learner implements everything we have implemented manually.

# Previously we used DataLoader not DataLoaders

dls = DataLoaders(dl, valid_dl)

learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(10)

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.480832	0.461122	0.843474	00:00
1	0.466804	0.441299	0.908734	00:00
2	0.450758	0.422136	0.934249	00:00
3	0.433590	0.403750	0.943572	00:00
4	0.416052	0.386223	0.949460	00:00
5	0.398675	0.369610	0.952404	00:00
6	0.381799	0.353943	0.956820	00:00
7	0.365630	0.339228	0.957311	00:00
8	0.350289	0.325455	0.959764	00:00
9	0.335835	0.312599	0.960255	00:00

The results again are very similar, but with some additional functionality (like printing out results in a pretty table!

Non-Linearity

To create a simple neural net, using a linear function like we did before is not enough. We need to add in a non-linearity between two linear functions.

This is the basic definition for a neural net..

The universal approximation theorem says, that given any arbitrarily complex continuous function, we can approximate it with a neural network. I found this useful for visualising how this works. This is what we are trying to do.

In our basic_net, each line represents a layer in our network, the first and 3rd layers are known as linear layers the second, as a nonlinearity or an activation.

res.max(tensor(0.0)) takes the result of our linear function and sets any negative value to 0.0 while maintaining any positive values.

def basic_net(xb):
    res = xb@w1 + b1
    res = res.max(tensor(0.0))
    res = res@w2 + b2
    return res

plot_function(F.relu)

Like we have seen previously.. - w1 and w2 are weight tensors - b1 and b2 are bias tensors

we can initialise these the same as we have done previously..

w1 has 30 output activations, so in order for w2 to match it require 30 input activations.

w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)

We can simplify further using PyTorch...

What we did in basic_net was called function composition, where we passed the results of one function into another function and then into another function. This is what neural nets are doing with linear layers and activation functions. nn.Sequential() will do this for us...

simple_net = nn.Sequential(
    nn.Linear(28*28, 30), # 28*28 in, 30 out
    nn.ReLU(),
    nn.Linear(30,1)       # 30 in 1 out
)

learn = Learner(dls, simple_net, opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

learn.fit(40,0.1)

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.301185	0.414520	0.506379	00:00
1	0.142869	0.223012	0.814033	00:00
2	0.079959	0.114103	0.916094	00:00
3	0.053115	0.077652	0.939156	00:00
4	0.040578	0.060868	0.953876	00:00
5	0.034118	0.051373	0.963690	00:00
6	0.030368	0.045362	0.965653	00:00
7	0.027905	0.041246	0.965162	00:00
8	0.026117	0.038246	0.968597	00:00
9	0.024726	0.035950	0.969087	00:00
10	0.023596	0.034124	0.971050	00:00
11	0.022651	0.032629	0.972031	00:00
12	0.021847	0.031376	0.973503	00:00
13	0.021151	0.030301	0.974485	00:00
14	0.020542	0.029363	0.974485	00:00
15	0.020002	0.028535	0.975957	00:00
16	0.019519	0.027797	0.976448	00:00
17	0.019083	0.027134	0.976938	00:00
18	0.018687	0.026535	0.977920	00:00
19	0.018325	0.025991	0.978901	00:00
20	0.017992	0.025495	0.978901	00:00
21	0.017684	0.025040	0.978901	00:00
22	0.017398	0.024621	0.979392	00:00
23	0.017131	0.024233	0.979392	00:00
24	0.016881	0.023874	0.980373	00:00
25	0.016645	0.023541	0.980373	00:00
26	0.016424	0.023232	0.980373	00:00
27	0.016214	0.022943	0.980864	00:00
28	0.016015	0.022673	0.980864	00:00
29	0.015827	0.022421	0.981354	00:00
30	0.015648	0.022185	0.981845	00:00
31	0.015476	0.021963	0.982336	00:00
32	0.015313	0.021756	0.982336	00:00
33	0.015156	0.021561	0.982336	00:00
34	0.015006	0.021378	0.982826	00:00
35	0.014863	0.021204	0.982826	00:00
36	0.014725	0.021041	0.982336	00:00
37	0.014592	0.020886	0.982336	00:00
38	0.014464	0.020740	0.982336	00:00
39	0.014341	0.020601	0.982336	00:00

# this is what our model now looks like
learn.model

Sequential(
  (0): Linear(in_features=784, out_features=30, bias=True)
  (1): ReLU()
  (2): Linear(in_features=30, out_features=1, bias=True)
)

# plot the loss
learn.recorder.plot_loss()

# learn.recorder.values hold the table values above
# lets plot the accuracy

plt.plot(L(learn.recorder.values).itemgot(2));

Looking inside...

# let's visualise some of the parameters

# 1. grab your model
m = learn.model # (0): Linear(in_features=784, out_features=30, bias=True)

# 2. look inside and grab the weights and biases
w,b = m[0]. parameters()

# 3. grab first (or any) row, reshape, and plot
show_image(w[0].view(28,28), figsize=(4,4))

<AxesSubplot:>

fastai in full

from fastai.vision.all import *
from pathlib import Path

path = Path.cwd()/'datasets/fastai/mnist_sample'

dls = ImageDataLoaders.from_folder(path)

learn = cnn_learner(dls, resnet18, pretrained=False,
                   loss_func=F.cross_entropy, metrics=accuracy)

learn.fit_one_cycle(1, 0.1)

epoch	train_loss	valid_loss	accuracy	time
0	0.086805	0.025215	0.994603	00:11

Summary

We have gone over creating and training a neural network from scratch using the simple example of a digit classifier. The key idea for the last few notebooks was to start with planning out the problem and identifying a way to solve it using a simple common sense solution - the pixel similarity model.

This proved successful but it was not really robust beyond the straightforward example we chose - identifying 3s and 7s. We then implemented a more complex solution that could be applied to more complicated problems.

After each step or concept had been implemented manually, we refactored the code to use convenient PyTorch functions and modules, eventually ending up with using fastai's implementation which abstracts away from all of the underlying heavy lifting. This is done for convenience and in my own opinion, to help lower the entry barrier into deep learning.

Ultimately I believe it is fundamentally important to understand the concepts and implementation if your goal (and this is my goal) is to implement deep learning solutions to solve business problems within your industry.