Lesson 8: Deep Learning for Coders
NLP
This notebook will take a dive into Natural Language Processing and will attempt to train an NLP classifyer. This is a binary classification task using movie review sentiment.
The pretrained model
In lesson 1, we acheived over 90% accuracy because we were using a pre-trained model that we fine-tuned further. So what is a pre-trained language model?
A language model is one where we try to predict the next word in a sentence. For lesson one, this was a neaural net pre-trained on wiki articles (Wikitext 103).
How does this help with sentiment anlysis? Like pre-trained image models, language models too contain a lot of information that can be leveraged rather than training from scratch. Fine-tuning will throw away the last layer(s) and train these rather than the entire model.
Through transfer learning, we will create an Imdb language model using the wikitext model as a base.
Text preprocessing
- Tokenization: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)
- Numericalization: Make a list of all of the unique words that appear (the vocab), convert each word into a number, by looking up its index in the vocab.
- Language model data loader creation: fastai provides an
LMDataLoader
class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required - Language model creation: We need a special kind of model that does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a recurrent neural network (RNN).
Tokenisation
There are different approaches to tokenisation these are... - Word based: which splits a sentence on spaces - Subword based: splits words into smaller parts based on the most commonly occuring substrings - Character bases: splits a sentence into individual characters
Word tokenisation with fastai
- there are a number of tokenisers out there, fastai makes it easy to switch between them.
- currently fastai default is from the spaCy library
Data
- The IMDB Large dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing. It is very large!
from fastai.text.all import *
path = untar_data(URLs.IMDB)
get_text_files
gets all the text files in a path. We can also optionally pass folders to restrict the search to a particular list of subfolders:
# only using 50k sample due to size of dataset
# results may vary from fastai book
files = get_text_files(path, folders = ['train', 'test', 'unsup'])[:50000]
# print out slice of first review
txt = files[0].open().read()
txt[:75]
first()
- First element of x
, or None if missing
coll_repr
- String repr of up to max_n
items of (possibly lazy) collection c
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks,30))
fastai provides additional functionality to tokenisers, such as adding in special tokens like begining of string xxbos
or lowercasing all strings and adding the xxmaj
token before. This is done to preserve importance and reduce some complexity.
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt),31))
You can explore the rules like so
defaults.text_proc_rules
then check the source code for each using ?? like fix_html??
Subword Tokenisation
Word tokenisation relies on spaces within the document.
Subword tokenisation does two things 1. analyses a corpus of documents to find the most commonly occurring groups of letters. These then become the vocab 2. Tokenise the corpus using this vocab of subword units
txts = L(o.open().read() for o in files[:2000])
We instantiate our tokeniser, by defining the size of the vocab, then training it.
meaing, have the tokeniser read the documents, find the common sequences of characters then create the vocab. in fastai, this is done with setup
.
def subword(sz):
sp = SubwordTokenizer(vocab_sz=sz)
sp.setup(txts)
return ' '.join(first(sp([txt]))[:40])
subword(1000)
the special character ▁ represents a space character in the original text.
using a smaller vocab results in each token representing fewer characters, and will need more tokens to represent a sentence
subword(200)
Using larger vocab will result in most common English words ending up in the vocab, and fewer tokens will be needed to represent a sentence
subword(10000)
There are trade-off to be made here: larger vocab means fewer tokens per sentence leading to faster training and less memory and state required for the model. The downside is larger embedding matrices which require more data to learn.
Subword tokenisation provides an easy way to scale between character and word tokenisation while also being useful for applications involving languages other than english.
Numericalisation
This is the process of mapping tokens to integers. It is nearly identical to the steps necessary to create a Category
variable
- Make a list of all possible levels of that categorical variable (vocab)
- replace each level with it's index in the vocab
toks = tkn(txt)
print(coll_repr(tkn(txt), 32))
# a small example
toks200 = txts[:200].map(tkn)
toks200[0]
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)
this is our vocab, starting with special tokens, then english words in order of highest frequency.
we can now use the Numericalize
object as a function and apply it to our tokens to see the integers they now represent
nums = num(toks)[:20]
nums
Create batches for language model
Batches are split based on the sequence length and batch size.
Batches are created by concatenating individual texts into a stream. Order of inputs are randomised, meaning the order of the documents (not order of words in these) are shuffled.
The stream is then divided into batches.
This is done at every epoch - shuffle the collection of documents - concatenate them together into a stream of tokens - cut the stream into batches of fixed size consecutive mini streams
This is all done in fastai using LMDataLoader
. For example
nums200 = toks200.map(num)
dl = LMDataLoader(nums200)
x,y = first(dl)
x.shape, y.shape
batch size
= 64
stream length
= 72
Looking at the first row of the independent variable should contain the start of the text
' '.join(num.vocab[o] for o in x[0][:20])
the dependent variable will be the same but offset by one token
' '.join(num.vocab[o] for o in y[0][:20])
Pt 1: Training a Text Classifier using fastai
The reason that TextBlock
is special is because setting up the numericalizer's vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible the TextBlock
performs a few optimizations:
- It saves the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once
- It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs
We need to tell TextBlock
how to access the texts, so that it can do this initial preprocessing—that's what from_folder does.
show_batch
then works in the usual way
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb,
splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
dls_lm.show_batch(max_n=2)
Fine tuning
To convert the integer word indices into activations for the neural net, we will use embeddings. These are then fed into the RNN using an architecture called AWD_LSTM
cross entropy loss is sutable here since this is a classification problem. Often a metric called perplexity is used in NLP, this is the exponential of the loss (torch.exp(cross_entropy)
). To this we will also add accuracy to determine how the model performs when trying to predict the next word.
to_fp16
uses less GPU memory and trains faster
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
learn.fit_one_cycle(1, 2e-2)
Saving and Loading models
learn.save('1epoch')
learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(5, 2e-3)
Pt 2: A Language Model from Scratch
Data
- The Human Numbers data set contains the first 10,000 numbers written in english. It was created by Jeremy for experimentation.
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()
Lake a look at some of the data
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines
concat all into one big stream, with "." to separate
text = ' . '.join([l.strip() for l in lines])
text[:100]
use work tokenisation by splitting on spaces
tokens = L(text.split(' '))
tokens[100:110]
for numericalisation, we need to create a list of all unique words. we can then convert these into numbers
vocab = L(tokens).unique()
vocab
word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
tokens, nums
We now have a small dataset that we can use for language modelling.
Creating a Language Model
For this simple example, we will predict the next word based on the previous 3 words.
To do this, create a list with the independent variable being the first 3 words, and dependent variable being the 4th word.
L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4,3))[0]
We can see from looking at the first items that ['one','.','two']
are the independent variable and '.'
is the dependent variable.
What the model will actually use are tensors of the numericalised values.
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0, len(nums)-4,3))
seqs
Create a DataLoader
- batch size of 64
- split randomly, taking 80%
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)
The model in PyTorch
A simple linear model has an input of size (batch size x #inputs), followed by a single hidden layer that computes a matrix product followed by ReLU; Out of which we will get some activations, the size of which will be (batch size x #activations). This is then followed by more computation, a matrix product followed by softmax. The final output size will be (batch size x #classes).
We will take this approach and modify it for our model.
Our model will be a Neural Net with 3 layers
- The embedding layer (input to hidden i_h
)
- The Linear Layer (hidden to hidden h_h
)
- this layer created the activations for the next word
- this layer will be used for words 1-3
- Final Linear layer to predict the fourth word (hidden to output layer h_o
)
In the diagram below, the arrows represent the computational steps (a linear layer followed by non-linearity (ReLU))
To start, take the word 1 input and put it through the linear layer and ReLU to get first set of activations.
Then put that through another linear layer and non-linearity. These activations are added (or concatenated would be fine) to the resulting activations of word 2 which is also run through a linear layer and non-linearity.
Again the results are run through another linear layer and non-linearity while also adding in the result of putting word 3 through a computation layer as we did with word 2.
These activations then go through a final linear layer and softmax to create the output activations.
What is interesting about this model is that inputs are entering in later layer and added into the network. Also, arrows of the same colour mean that the same weight matrix is being used.
In code we can represent this like so...
To go from the input to hidden layer we use an embedding. We create one embedding which subsequent words will also go through, and each time we add this to the current set of activations.
Why use the same embedding layer?? Conceptually, the words all represent english spellings of numbers, so they have the same meaning and therefore wouldn't need separate embeddings.
Once we have the embedding, we send this through the linear layer, then through relu. As with embeddings, we can use the same Linear layer because we are doing the same kind of computation.
The computation happens from the inner most brackets out so this...F.relu(self.h_h(self.i_h(x[:,0])))
- starts with sending word 1 x[:,0]
through the embedding layer self.i_h(x[:,0])
- then through a Linear layer self.h_h(self.i_h(x[:,0]))
- and finally through the relu F.relu(self.h_h(self.i_h(x[:,0])))
class LMModel1(Module):
# vocab_sz == vocab size
def __init__(self, vocab_sz, n_hidden):
# the embedding layer
self.i_h = nn.Embedding(vocab_sz, n_hidden)
# the linear layer
self.h_h = nn.Linear(n_hidden, n_hidden)
# final linear layer
self.h_o = nn.Linear(n_hidden, vocab_sz)
def forward(self, x):
# h is the hidden state
# word 1 to embedding
h = F.relu(self.h_h(self.i_h(x[:,0])))
# word 2 to same embedding
h = h + self.i_h(x:,1)
h = F.relu(self.h_h(h))
h = h + self.i_h(x[:,2]) # word 3 to same embedding
h = F.relu(self.h_h(h))
# hidden to output
return self.h_o(h)
the activations in the model are known as the "hidden state"
class LMModel1(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden, vocab_sz)
def forward(self, x):
h = F.relu(self.h_h(self.i_h(x[:,0])))
h = h + self.i_h(x[:,1])
h = F.relu(self.h_h(h))
h = h + self.i_h(x[:,2])
h = F.relu(self.h_h(h))
return self.h_o(h)
learn = Learner(dls, LMModel1(len(vocab), 64),
loss_func=F.cross_entropy,
metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
So far our accuracy is just under 50%. Not bad. We can improve by first refactoring... LMModel1
has a few repeated steps, we can remove this by adding in a for loop.
Our first Recurrent Neural Net
class LMModel2(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden, vocab_sz)
def forward(self,x):
# initialise h as 0.
# this gets braodcast to a tensor in the loop
h = 0.
for i in range(3):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)
# check we get the same results
learn = Learner(dls, LMModel2(len(vocab), 64),
loss_func=F.cross_entropy,
metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
We have actually just created a Recurrent Nuearal Net.
Reminder - Hidden State represents the activations that are occurring inside the neural net.
Maintaining the Hidden State
we can do this by storing the hidden state and updating it. detach
throws away the gradient history, also known as truncated back propagation.
look into this!!
class LMModel3(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = 0.
def forward(self,x):
for i in range(3):
self.h = self.h + self.i_h(x[:,i])
self.h = F.relu(self.h_h(self.h))
out = self.h_o(self.h)
self.h = self.h.detach()
return out
def reset(self): self.h = 0.
m = len(seqs)//bs
m,bs,len(seqs)
def group_chunks(ds, bds):
m = len(ds)//bs
new_ds = L()
for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
return new_ds
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
group_chunks(seqs[:cut], bs),
group_chunks(seqs[cut:], bs),
bs=bs, drop_last=True, shuffle=False)
Callbacks
ModelResetter
is a fastai callback that resets the model at each training/validation step.
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)
This RNN keeps the state from batch to batch and the results show the uplift from this change.
By only predicting every 4th word, we are throwing away signal, which seems wastful. By moving the output stage inside the loop (ie after every hidden state was created we make a prediction) it means we can predict the next word after every single word, rather than every 3 words.
To do this we have to change our data so that the dependent variable has each of the three next words after each of out three input words.
sl = 16 # sequence length
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
for i in range(0,len(nums)-sl-1,sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
group_chunks(seqs[cut:], bs),
bs=bs, drop_last=True, shuffle=False)
We can see from the first two items in seqs
that they are the same length but the second list is offset by 1
[L(vocab[o] for o in s) for s in seqs[0]]
update the model by creating a list to store outputs, then append to this after every element in the loop
# Modify the model to output a prediction after every word
class LMModel4(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
self.h = 0
def forward(self, x):
outs = []
for i in range(sl):
self.h = self.h + self.i_h(x[:,i])
self.h = F.relu(self.h_h(self.h))
outs.append(self.h_o(self.h))
self.h = self.h.detach()
return torch.stack(outs, dim=1)
def reset(self): self.h = 0
# flatten targets to fit loss function
def loss_func(inp, targ):
return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
Multilayer RNN
Our model is deep but every hidden to hidden layer uses the same weight matrix which means it isn't that deep at all. It is using the same weight matrix every time, so not very sophisticated.
Let's refactor again to pass the activations of our current net into a second recurrent neaural network. This is called a stacked or multilayered RNN.
Using PyTorch's nn.RNN
module lets us define the number of layers (n_layers
). We can also remove the loop and just call self.rnn
class LMModel5(Module):
def __init__(self, vocab_sz, n_hidden, n_layers):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = torch.zeros(n_layers, bs, n_hidden)
def forward(self, x):
res,h = self.rnn(self.i_h(x), self.h)
self.h = h.detach()
return self.h_o(res)
def reset(self): self.h.zero_()
# using 2 layers
learn = Learner(dls, LMModel5(len(vocab), 64,2), loss_func=loss_func,
metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
Our results are worse!
Why? Deep models are hard to train. This can be due to exploding or disappearing activiations. This basically means that our results either become very very large or very very small. This causes an explosion or vanishing of a number and can be computationally intensive or the accuracy of the floating point numbers gets lost.
We can avoid this in a number of ways...
LSTM
Replacing the matrix multiplication in an RNN with this architecture, basically means the model is able to make decisions about how much of an update to do each time. This helps the model to avoid updating too much or too little.
Training a Language Model Using LSTMs
This is the same network but the RNN is replaced with an LSTM. We need to increase the number of layers in our hidden state for this to work because the LSTM has more layers.
# Training with LSTM
class LMModel6(Module):
def __init__(self, vocab_sz, n_hidden, n_layers):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
def forward(self, x):
res,h = self.rnn(self.i_h(x), self.h)
self.h = [h_.detach() for h_ in h]
return self.h_o(res)
def reset(self):
for h in self.h: h.zero_()
learn = Learner(dls, LMModel6(len(vocab), 64, 2),
loss_func=CrossEntropyLossFlat(),
metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)
Results are much much better!
Regularising an LSTM
Dropout
- Dropout improves neaural net training by deleting random activations. This reduces the computation but also prevents the model from overfitting.
- Dropout helps the model to generalise by ensuring certain activations don't over specialise during the learning process
class Droput
- p
the probability that an activation gets deleted
- only perform dropout in training
- mask
the mask a tensor with random zeros with probability (p
) and ones with probability (p
-1)
class Dropout(Module):
def __init__(self, p): self.p = p
def forward(self, x):
if not self.training: return x
mask = x.new(*x.shape).bernoulli(1-p)
return x * mask.div_(1-p)
A simple example
p = .3
B = torch.ones((3,3)).bernoulli(1-p)
B
In this example, 1-p
adds 3 ones in the 3*3 matrix. Basically the probability of drawing a one here is 3/9 or 0.3. As we saw earlier with one hot encodings, this matrix will act as a lookup when you multiply it by another matrix.
In context of what we are doing, by performing this multiplication you are randomly prunning elements of the other matrix
A = tensor([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]])
A*B
Corresponding elements of A
are returned only if there is a 1 in the same position in matrix B
AR and TAR regularisation
AR (activation regularisation) and TAR (temporal activation regularisation) are very similar to weight decay but are applied to activations instead of weights.
TAR is linked to the fact that we are trying to predict a sequence of tokens. So we take the difference of the activations between time steps. It limits the changes in activations between time steps.
Weight Tying
Sets the hidden to output weights equal to the input to hidden weights. The idea is that converting words to activations and activations to words should conceptually be the same thing since the language is consistent, and the computation is consistent so why would you need to change the weights?
class LMModel7(Module):
def __init__(self, vocab_sz, n_hidden, n_layers, p):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
self.drop = nn.Dropout(p)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h_o.weight = self.i_h.weight
self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
def forward(self, x):
raw,h = self.rnn(self.i_h(x), self.h)
out = self.drop(raw)
self.h = [h_.detach() for h_ in h]
return self.h_o(out),raw,out
def reset(self):
for h in self.h: h.zero_()
learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
loss_func=CrossEntropyLossFlat(), metrics=accuracy,
cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])
This is the same as above but TextLearner
adds the additions peices for you
learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(15, 1e-2, wd=0.1)
Almost 85% accuracy!