Lesson 8: Deep Learning for Coders

NLP

This notebook will take a dive into Natural Language Processing and will attempt to train an NLP classifyer. This is a binary classification task using movie review sentiment.

The pretrained model

In lesson 1, we acheived over 90% accuracy because we were using a pre-trained model that we fine-tuned further. So what is a pre-trained language model?

A language model is one where we try to predict the next word in a sentence. For lesson one, this was a neaural net pre-trained on wiki articles (Wikitext 103).

How does this help with sentiment anlysis? Like pre-trained image models, language models too contain a lot of information that can be leveraged rather than training from scratch. Fine-tuning will throw away the last layer(s) and train these rather than the entire model.

Through transfer learning, we will create an Imdb language model using the wikitext model as a base.

Text preprocessing

Tokenization: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)
Numericalization: Make a list of all of the unique words that appear (the vocab), convert each word into a number, by looking up its index in the vocab.
Language model data loader creation: fastai provides an LMDataLoader class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required
Language model creation: We need a special kind of model that does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a recurrent neural network (RNN).

source

Tokenisation

There are different approaches to tokenisation these are... - Word based: which splits a sentence on spaces - Subword based: splits words into smaller parts based on the most commonly occuring substrings - Character bases: splits a sentence into individual characters

Word tokenisation with fastai

there are a number of tokenisers out there, fastai makes it easy to switch between them.
currently fastai default is from the spaCy library

Data

The IMDB Large dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing. It is very large!

from fastai.text.all import *
path = untar_data(URLs.IMDB)

get_text_files gets all the text files in a path. We can also optionally pass folders to restrict the search to a particular list of subfolders:

# only using 50k sample due to size of dataset
# results may vary from fastai book

files = get_text_files(path, folders = ['train', 'test', 'unsup'])[:50000]

# print out slice of first review
txt = files[0].open().read()

txt[:75]

"The worst movie I've ever seen, hands down. It is ten times more a rip-off "

first() - First element of x, or None if missing

coll_repr - String repr of up to max_n items of (possibly lazy) collection c

spacy = WordTokenizer()
toks = first(spacy([txt]))

print(coll_repr(toks,30))

(#156) ['The','worst','movie','I',"'ve",'ever','seen',',','hands','down','.','It','is','ten','times','more','a','rip','-','off','of','Lake','Placid','than','it','is','a','sequel','.','Director'...]

fastai provides additional functionality to tokenisers, such as adding in special tokens like begining of string xxbos or lowercasing all strings and adding the xxmaj token before. This is done to preserve importance and reduce some complexity.

tkn = Tokenizer(spacy)

print(coll_repr(tkn(txt),31))

(#176) ['xxbos','xxmaj','the','worst','movie','xxmaj','i',"'ve",'ever','seen',',','hands','down','.','xxmaj','it','is','ten','times','more','a','rip','-','off','of','xxmaj','lake','xxmaj','placid','than','it'...]

You can explore the rules like so

defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

then check the source code for each using ?? like fix_html??

Subword Tokenisation

Word tokenisation relies on spaces within the document.

Subword tokenisation does two things 1. analyses a corpus of documents to find the most commonly occurring groups of letters. These then become the vocab 2. Tokenise the corpus using this vocab of subword units

txts = L(o.open().read() for o in files[:2000])

We instantiate our tokeniser, by defining the size of the vocab, then training it.

meaing, have the tokeniser read the documents, find the common sequences of characters then create the vocab. in fastai, this is done with setup.

def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

subword(1000)

"▁The ▁worst ▁movie ▁I ' ve ▁ever ▁seen , ▁hand s ▁down . ▁It ▁is ▁t en ▁time s ▁more ▁a ▁ r i p - off ▁of ▁L ake ▁P la ci d ▁than ▁it ▁is ▁a ▁sequel ."

the special character ▁ represents a space character in the original text.

using a smaller vocab results in each token representing fewer characters, and will need more tokens to represent a sentence

subword(200)

"▁The ▁w or s t ▁movie ▁I ' ve ▁ e ver ▁s e en , ▁ h an d s ▁d o w n . ▁I t ▁is ▁ t en ▁ t i m es ▁mo re ▁a"

Using larger vocab will result in most common English words ending up in the vocab, and fewer tokens will be needed to represent a sentence

subword(10000)

"▁The ▁worst ▁movie ▁I ' ve ▁ever ▁seen , ▁hands ▁down . ▁It ▁is ▁ten ▁times ▁more ▁a ▁rip - off ▁of ▁Lake ▁Placid ▁than ▁it ▁is ▁a ▁sequel . ▁Director ▁David ▁F lo re s ▁clearly ▁did ▁not ▁go"

There are trade-off to be made here: larger vocab means fewer tokens per sentence leading to faster training and less memory and state required for the model. The downside is larger embedding matrices which require more data to learn.

Subword tokenisation provides an easy way to scale between character and word tokenisation while also being useful for applications involving languages other than english.

Numericalisation

This is the process of mapping tokens to integers. It is nearly identical to the steps necessary to create a Category variable

Make a list of all possible levels of that categorical variable (vocab)
replace each level with it's index in the vocab

toks = tkn(txt)

print(coll_repr(tkn(txt), 32))

(#176) ['xxbos','xxmaj','the','worst','movie','xxmaj','i',"'ve",'ever','seen',',','hands','down','.','xxmaj','it','is','ten','times','more','a','rip','-','off','of','xxmaj','lake','xxmaj','placid','than','it','is'...]

# a small example
toks200 = txts[:200].map(tkn)

toks200[0]

(#176) ['xxbos','xxmaj','the','worst','movie','xxmaj','i',"'ve",'ever','seen'...]

num = Numericalize()
num.setup(toks200)

coll_repr(num.vocab,20)

"(#2144) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','i'...]"

this is our vocab, starting with special tokens, then english words in order of highest frequency.

we can now use the Numericalize object as a function and apply it to our tokens to see the integers they now represent

nums = num(toks)[:20]

nums

tensor([  2,   8,   9, 310,  27,   8,  19, 218, 158, 141,  10,   0, 229,  11,
          8,  18,  16, 550, 299,  66])

Create batches for language model

Batches are split based on the sequence length and batch size.

Batches are created by concatenating individual texts into a stream. Order of inputs are randomised, meaning the order of the documents (not order of words in these) are shuffled.

The stream is then divided into batches.

This is done at every epoch - shuffle the collection of documents - concatenate them together into a stream of tokens - cut the stream into batches of fixed size consecutive mini streams

This is all done in fastai using LMDataLoader. For example

nums200 = toks200.map(num)

dl = LMDataLoader(nums200)

x,y = first(dl)

x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

batch size = 64

stream length = 72

Looking at the first row of the independent variable should contain the start of the text

' '.join(num.vocab[o] for o in x[0][:20])

"xxbos xxmaj the worst movie xxmaj i 've ever seen , xxunk down . xxmaj it is ten times more"

the dependent variable will be the same but offset by one token

' '.join(num.vocab[o] for o in y[0][:20])

"xxmaj the worst movie xxmaj i 've ever seen , xxunk down . xxmaj it is ten times more a"

Pt 1: Training a Text Classifier using fastai

The reason that TextBlock is special is because setting up the numericalizer's vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible the TextBlock performs a few optimizations:

It saves the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once
It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs

We need to tell TextBlock how to access the texts, so that it can do this initial preprocessing—that's what from_folder does.

show_batch then works in the usual way

source

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb,
    splitter=RandomSplitter(0.1)
    ).dataloaders(path, path=path, bs=128, seq_len=80)

dls_lm.show_batch(max_n=2)

	text	text_
0	xxbos xxmaj my sincere advice to all : do n't watch the movie . \n\n xxmaj do n't even go near to the theater where this movie is being played ! ! even a glimpse of it is bad for health . serious . no jokes . it 's xxunk am in the morning . and i returned from this crappiest movie on this universe . xxup four xxup hours xxup damn xxrep 3 ! i am proud that i	xxmaj my sincere advice to all : do n't watch the movie . \n\n xxmaj do n't even go near to the theater where this movie is being played ! ! even a glimpse of it is bad for health . serious . no jokes . it 's xxunk am in the morning . and i returned from this crappiest movie on this universe . xxup four xxup hours xxup damn xxrep 3 ! i am proud that i survived
1	what has led to the overwhelmingly negative reaction . \n\n xxmaj the shock value is the least appealing thing about this film - a minor detail that has been blown out of proportion . xxmaj the story is of xxmaj pierre 's downfall - and the subsequent destruction of those around him - which is overtly demonstrated in his features , demeanour and xxunk . xxmaj the dialogue and soundtrack set this film apart from any other i have seen	has led to the overwhelmingly negative reaction . \n\n xxmaj the shock value is the least appealing thing about this film - a minor detail that has been blown out of proportion . xxmaj the story is of xxmaj pierre 's downfall - and the subsequent destruction of those around him - which is overtly demonstrated in his features , demeanour and xxunk . xxmaj the dialogue and soundtrack set this film apart from any other i have seen ,

Fine tuning

To convert the integer word indices into activations for the neural net, we will use embeddings. These are then fed into the RNN using an architecture called AWD_LSTM

cross entropy loss is sutable here since this is a classification problem. Often a metric called perplexity is used in NLP, this is the exponential of the loss (torch.exp(cross_entropy)). To this we will also add accuracy to determine how the model performs when trying to predict the next word.

to_fp16 uses less GPU memory and trains faster

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 2e-2)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	4.129613	3.911054	0.299887	49.951557	34:09

Saving and Loading models

learn.save('1epoch')

learn = learn.load('1epoch')

learn.unfreeze()
learn.fit_one_cycle(5, 2e-3)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.870227	3.766035	0.318266	43.208385	35:48
1	3.772985	3.673187	0.329042	39.377213	35:38
2	3.677068	3.615694	0.335646	37.177132	35:36
3	3.570553	3.582507	0.339907	35.963577	35:56
4	3.526067	3.577981	0.340754	35.801178	35:58

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Pt 2: A Language Model from Scratch

Data

The Human Numbers data set contains the first 10,000 numbers written in english. It was created by Jeremy for experimentation.

from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

path.ls()

(#2) [Path('/storage/data/human_numbers/valid.txt'),Path('/storage/data/human_numbers/train.txt')]

Lake a look at some of the data

lines = L()

with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())

lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

concat all into one big stream, with "." to separate

text = ' . '.join([l.strip() for l in lines])
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

use work tokenisation by splitting on spaces

tokens = L(text.split(' '))
tokens[100:110]

(#10) ['.','forty','two','.','forty','three','.','forty','four','.']

for numericalisation, we need to create a list of all unique words. we can then convert these into numbers

vocab = L(tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

word2idx = {w:i for i,w in enumerate(vocab)}

nums = L(word2idx[i] for i in tokens)

tokens, nums

((#63095) ['one','.','two','.','three','.','four','.','five','.'...],
 (#63095) [0,1,2,1,3,1,4,1,5,1...])

We now have a small dataset that we can use for language modelling.

Creating a Language Model

For this simple example, we will predict the next word based on the previous 3 words.

To do this, create a list with the independent variable being the first 3 words, and dependent variable being the 4th word.

L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4,3))[0]

((#3) ['one','.','two'], '.')

We can see from looking at the first items that ['one','.','two'] are the independent variable and '.' is the dependent variable.

What the model will actually use are tensors of the numericalised values.

seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0, len(nums)-4,3))

seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Create a DataLoader

batch size of 64
split randomly, taking 80%

bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

The model in PyTorch

A simple linear model has an input of size (batch size x #inputs), followed by a single hidden layer that computes a matrix product followed by ReLU; Out of which we will get some activations, the size of which will be (batch size x #activations). This is then followed by more computation, a matrix product followed by softmax. The final output size will be (batch size x #classes).

We will take this approach and modify it for our model.

Our model will be a Neural Net with 3 layers - The embedding layer (input to hidden i_h) - The Linear Layer (hidden to hidden h_h) - this layer created the activations for the next word - this layer will be used for words 1-3 - Final Linear layer to predict the fourth word (hidden to output layer h_o)

In the diagram below, the arrows represent the computational steps (a linear layer followed by non-linearity (ReLU))

To start, take the word 1 input and put it through the linear layer and ReLU to get first set of activations.

Then put that through another linear layer and non-linearity. These activations are added (or concatenated would be fine) to the resulting activations of word 2 which is also run through a linear layer and non-linearity.

Again the results are run through another linear layer and non-linearity while also adding in the result of putting word 3 through a computation layer as we did with word 2.

These activations then go through a final linear layer and softmax to create the output activations.

What is interesting about this model is that inputs are entering in later layer and added into the network. Also, arrows of the same colour mean that the same weight matrix is being used.

nlp_net

In code we can represent this like so...

To go from the input to hidden layer we use an embedding. We create one embedding which subsequent words will also go through, and each time we add this to the current set of activations.

Why use the same embedding layer?? Conceptually, the words all represent english spellings of numbers, so they have the same meaning and therefore wouldn't need separate embeddings.

Once we have the embedding, we send this through the linear layer, then through relu. As with embeddings, we can use the same Linear layer because we are doing the same kind of computation.

The computation happens from the inner most brackets out so this...F.relu(self.h_h(self.i_h(x[:,0]))) - starts with sending word 1 x[:,0] through the embedding layer self.i_h(x[:,0]) - then through a Linear layer self.h_h(self.i_h(x[:,0])) - and finally through the relu F.relu(self.h_h(self.i_h(x[:,0])))

class LMModel1(Module):
    # vocab_sz == vocab size
    def __init__(self, vocab_sz, n_hidden):
        # the embedding layer
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        # the linear layer
        self.h_h = nn.Linear(n_hidden, n_hidden)
        # final linear layer
        self.h_o = nn.Linear(n_hidden, vocab_sz)

    def forward(self, x):
        # h is the hidden state
        # word 1 to embedding
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        # word 2 to same embedding
        h = h + self.i_h(x:,1)    
        h = F.relu(self.h_h(h))                
        h = h + self.i_h(x[:,2])  # word 3 to same embedding
        h = F.relu(self.h_h(h))

        # hidden to output
        return self.h_o(h)

the activations in the model are known as the "hidden state"

class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)

    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))                
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

learn = Learner(dls, LMModel1(len(vocab), 64), 
                loss_func=F.cross_entropy, 
                metrics=accuracy)

learn.fit_one_cycle(4, 1e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.863818	2.031583	0.464939	00:02
1	1.392999	1.803210	0.467079	00:02
2	1.410863	1.698382	0.490849	00:02
3	1.371146	1.703473	0.411457	00:02

So far our accuracy is just under 50%. Not bad. We can improve by first refactoring... LMModel1 has a few repeated steps, we can remove this by adding in a for loop.

nlp_net2

Our first Recurrent Neural Net

class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)

    def forward(self,x):
        # initialise h as 0.
        # this gets braodcast to a tensor in the loop
        h = 0.

        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))  
        return self.h_o(h)

# check we get the same results

learn = Learner(dls, LMModel2(len(vocab), 64), 
                loss_func=F.cross_entropy, 
                metrics=accuracy)

learn.fit_one_cycle(4, 1e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.877445	2.006204	0.479914	00:02
1	1.398311	1.774232	0.482054	00:03
2	1.421882	1.650312	0.492988	00:03
3	1.372779	1.634449	0.484906	00:02

We have actually just created a Recurrent Nuearal Net.

Reminder - Hidden State represents the activations that are occurring inside the neural net.

Maintaining the Hidden State

we can do this by storing the hidden state and updating it. detach throws away the gradient history, also known as truncated back propagation.

look into this!!

class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = 0.

    def forward(self,x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out

    def reset(self): self.h = 0.

m = len(seqs)//bs
m,bs,len(seqs)

(328, 64, 21031)

def group_chunks(ds, bds):
    m = len(ds)//bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

Callbacks

ModelResetter is a fastai callback that resets the model at each training/validation step.

learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.720635	1.881263	0.397837	00:03
1	1.319839	1.718479	0.460337	00:03
2	1.100321	1.620753	0.508413	00:03
3	1.016059	1.486176	0.543750	00:02
4	0.995253	1.397851	0.554567	00:03
5	0.965056	1.494172	0.529567	00:03
6	0.928281	1.383823	0.590625	00:03
7	0.841828	1.437241	0.601442	00:03
8	0.796445	1.491238	0.609615	00:03
9	0.787689	1.509905	0.606731	00:03

This RNN keeps the state from batch to batch and the results show the uplift from this change.

By only predicting every 4th word, we are throwing away signal, which seems wastful. By moving the output stage inside the loop (ie after every hidden state was created we make a prediction) it means we can predict the next word after every single word, rather than every 3 words.

To do this we have to change our data so that the dependent variable has each of the three next words after each of out three input words.

sl = 16 # sequence length

seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0,len(nums)-sl-1,sl))

cut = int(len(seqs) * 0.8)

dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

We can see from the first two items in seqs that they are the same length but the second list is offset by 1

[L(vocab[o] for o in s) for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

update the model by creating a list to store outputs, then append to this after every element in the loop

# Modify the model to output a prediction after every word

class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0

    def forward(self, x):
        outs = []
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))
        self.h = self.h.detach()
        return torch.stack(outs, dim=1)

    def reset(self): self.h = 0

# flatten targets to fit loss function

def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
               metrics=accuracy, cbs=ModelResetter)

learn.fit_one_cycle(15, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	3.212160	3.065020	0.248291	00:01
1	2.303481	1.984242	0.440511	00:01
2	1.722320	1.854335	0.428141	00:01
3	1.448855	1.685089	0.516846	00:01
4	1.267960	1.905218	0.549316	00:01
5	1.135633	1.983428	0.591064	00:01
6	1.026269	2.132295	0.593994	00:01
7	0.943246	2.123302	0.622966	00:01
8	0.850973	2.263324	0.638346	00:01
9	0.784856	2.315861	0.662028	00:01
10	0.729662	2.344142	0.649821	00:01
11	0.689929	2.379879	0.648519	00:01
12	0.656299	2.397321	0.655355	00:01
13	0.634652	2.373445	0.661947	00:01
14	0.623158	2.386648	0.659424	00:01

Multilayer RNN

Our model is deep but every hidden to hidden layer uses the same weight matrix which means it isn't that deep at all. It is using the same weight matrix every time, so not very sophisticated.

Let's refactor again to pass the activations of our current net into a second recurrent neaural network. This is called a stacked or multilayered RNN.

Using PyTorch's nn.RNN module lets us define the number of layers (n_layers). We can also remove the loop and just call self.rnn

class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)

    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(res)

    def reset(self): self.h.zero_()

# using 2 layers

learn = Learner(dls, LMModel5(len(vocab), 64,2), loss_func=loss_func,
               metrics=accuracy, cbs=ModelResetter)

learn.fit_one_cycle(15, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	3.054249	2.617002	0.445882	00:01
1	2.159026	1.819027	0.470703	00:01
2	1.711899	1.820897	0.402262	00:01
3	1.492573	1.740131	0.468669	00:01
4	1.336913	1.825282	0.494303	00:01
5	1.205158	1.927248	0.513591	00:01
6	1.079775	1.974853	0.543864	00:01
7	0.975060	2.035518	0.549235	00:01
8	0.899264	2.100957	0.536458	00:01
9	0.847659	2.070400	0.546956	00:01
10	0.796934	2.078454	0.546875	00:01
11	0.756011	2.080719	0.546956	00:01
12	0.725375	2.105812	0.549886	00:01
13	0.704403	2.089411	0.549723	00:01
14	0.693146	2.079454	0.550700	00:01

Our results are worse!

Why? Deep models are hard to train. This can be due to exploding or disappearing activiations. This basically means that our results either become very very large or very very small. This causes an explosion or vanishing of a number and can be computationally intensive or the accuracy of the floating point numbers gets lost.

We can avoid this in a number of ways...

LSTM

Replacing the matrix multiplication in an RNN with this architecture, basically means the model is able to make decisions about how much of an update to do each time. This helps the model to avoid updating too much or too little.

Training a Language Model Using LSTMs

This is the same network but the RNN is replaced with an LSTM. We need to increase the number of layers in our hidden state for this to work because the LSTM has more layers.

# Training with LSTM

class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(res)

    def reset(self): 
        for h in self.h: h.zero_()

learn = Learner(dls, LMModel6(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	3.040422	2.731378	0.391113	00:02
1	2.212882	1.787726	0.447917	00:02
2	1.632609	1.882631	0.484049	00:02
3	1.323149	2.013697	0.508057	00:02
4	1.091380	2.002088	0.518880	00:02
5	0.832751	1.733890	0.622233	00:02
6	0.620027	1.593947	0.696370	00:03
7	0.412642	1.505920	0.715495	00:04
8	0.257724	1.480889	0.761475	00:04
9	0.157874	1.392939	0.771973	00:04
10	0.097578	1.393537	0.774984	00:02
11	0.064152	1.374384	0.778158	00:02
12	0.046606	1.387635	0.785889	00:02
13	0.037822	1.404215	0.781982	00:02
14	0.033807	1.398724	0.782389	00:03

Results are much much better!

Regularising an LSTM

Dropout

Dropout improves neaural net training by deleting random activations. This reduces the computation but also prevents the model from overfitting.
Dropout helps the model to generalise by ensuring certain activations don't over specialise during the learning process

class Droput - p the probability that an activation gets deleted - only perform dropout in training - mask the mask a tensor with random zeros with probability (p) and ones with probability (p-1)

class Dropout(Module):
    def __init__(self, p): self.p = p
    def forward(self, x):
        if not self.training: return x
        mask = x.new(*x.shape).bernoulli(1-p)
        return x * mask.div_(1-p)

A simple example

p = .3

B = torch.ones((3,3)).bernoulli(1-p)

tensor([[0., 1., 1.],
        [1., 0., 0.],
        [0., 0., 0.]])

In this example, 1-p adds 3 ones in the 3*3 matrix. Basically the probability of drawing a one here is 3/9 or 0.3. As we saw earlier with one hot encodings, this matrix will act as a lookup when you multiply it by another matrix.

In context of what we are doing, by performing this multiplication you are randomly prunning elements of the other matrix

A = tensor([[1., 2., 3.],
            [4., 5., 6.],
            [7., 8., 9.]])

A*B

tensor([[0., 2., 3.],
        [4., 0., 0.],
        [0., 0., 0.]])

Corresponding elements of A are returned only if there is a 1 in the same position in matrix B

AR and TAR regularisation

AR (activation regularisation) and TAR (temporal activation regularisation) are very similar to weight decay but are applied to activations instead of weights.

TAR is linked to the fact that we are trying to predict a sequence of tokens. So we take the difference of the activations between time steps. It limits the changes in activations between time steps.

Weight Tying

Sets the hidden to output weights equal to the input to hidden weights. The idea is that converting words to activations and activations to words should conceptually be the same thing since the language is consistent, and the computation is consistent so why would you need to change the weights?

class LMModel7(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers, p):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.drop = nn.Dropout(p)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h_o.weight = self.i_h.weight
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        raw,h = self.rnn(self.i_h(x), self.h)
        out = self.drop(raw)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(out),raw,out

    def reset(self): 
        for h in self.h: h.zero_()

learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
                loss_func=CrossEntropyLossFlat(), metrics=accuracy,
                cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])

This is the same as above but TextLearner adds the additions peices for you

learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
                    loss_func=CrossEntropyLossFlat(), metrics=accuracy)

learn.fit_one_cycle(15, 1e-2, wd=0.1)

epoch	train_loss	valid_loss	accuracy	time
0	0.030605	1.401770	0.780680	00:02
1	0.029504	1.535201	0.766602	00:03
2	0.045721	1.465324	0.771484	00:03
3	0.057497	1.550894	0.807780	00:02
4	0.043013	1.394347	0.807292	00:02
5	0.029584	1.430816	0.807536	00:02
6	0.025191	1.391779	0.826009	00:02
7	0.021182	1.496358	0.825439	00:03
8	0.016158	1.389334	0.817139	00:02
9	0.014285	1.503886	0.828369	00:02
10	0.011608	1.421619	0.823079	00:02
11	0.009167	1.429033	0.825521	00:02
12	0.007534	1.449290	0.824382	00:02
13	0.006630	1.455278	0.824300	00:02
14	0.006208	1.456966	0.824056	00:02

Almost 85% accuracy!