Lesson 6: Multi-Label Classification
06-10-2020
This notebook will go over some of the practical material discussed in lesson 6 of the fastai 2020 course. Lesson 5 was an extension on the pet classifier we built as well as a discussion on data ethics.
# imports
from fastai.vision.all import *
# dataset
path = untar_data(URLs.PASCAL_2007)
Dataset: PASCAL
Our data set contains images along with a csv containing the labels. There are multiple labels per image, these are space sparated.
df = pd.read_csv(path/'train.csv')
df.head()
Let's create a dataset from scratch using fastais suggested methods.
We need to grab the appropriate fields from the data frame, that is...
- The Independent variable will be the images
- The Label will be extracted from space separated strings
# convenience for setting the base path to the path
Path.BASE_PATH = path
DataBlock
We need to tell our datablock
- where to get the data from
- where to get the labels from
- what kind of data we are working with
- this is an image classification problem so images and labels
MultiCategoryBlock
expects a list of category labels
- how to split our data
# helper functions
# create functions to grab x from training path
# create a function to split lables for y
def get_x(r): return path/'train'/r['fname']
def get_y(r): return r['labels'].split(' ')
# training/valid splitter
def splitter(df):
train = df.index[~df['is_valid']].tolist()
valid = df.index[df['is_valid']].tolist()
return train,valid
dblock = DataBlock(
blocks=(ImageBlock, MultiCategoryBlock),
splitter=splitter,
get_x=get_x,
get_y=get_y,
item_tfms=RandomResizedCrop(128, min_scale=0.35))
dsets = dblock.datasets(df)
dsets.train[0]
dsets.train[0][0].to_thumb(192)
# for our image, check where the vocab == 1
# filter vocab by this to check that our car is a car
idx = torch.where(dsets.train[0][1]==1.)[0]
dsets.train.vocab[idx]
TensorMultiCategory
is a one-hot encoded vector, this means that instead of having a label or list of labels, for each image we have a tensor that will have a 1 for the labels for that image, and a 0 for all other labels. The vocab
is useful to see which label classes are avaiable.
dsets.train.vocab.o2i
putting it all together
- we now have a complete data block so will swap out
- dsets =
dblock.datasets(df)
- for
dsets = dblock.dataloaders(df)
- dsets =
dblock = DataBlock(
blocks=(ImageBlock, MultiCategoryBlock),
splitter=splitter,
get_x=get_x,
get_y=get_y,
item_tfms=RandomResizedCrop(128, min_scale=0.35))
dls = dblock.dataloaders(df)
dls.show_batch(nrows=1, ncols=3)
Binary cross entropy loss
- Binary Cross entropy loss is used for classificaiton problems. For multi-label classification, we do not need the sum of different classes to add up to one so we will not need
softmax
here. - WHy? Because we might see multiple objects that we are confident appear in an image, so using softmax to restrict this is not a good idea for this problem.
- We may also want the sum to be less than one if the model is not confident that any of the categories appear in the image.
- Each activation will be compared to each target for each column, so we don't have to do anything to make this function work for multiple columns.
Some explainations I found useful...
# Create a CNN learner
learn = cnn_learner(dls, resnet18)
# we can check the activations of our model
# by passing in a mini-batch of the independent variable
x,y = to_cpu(dls.train.one_batch())
activs = learn.model(x)
activs.shape
Why this shape?
- batch size = 64
- number of categories = 20
# check the 20 activations
# this is just to see what they look like
activs[0]
These activations are not between 0 and 1. We need them to represent probabilities so will need to run them through a sigmoid.
Remember, here, we do not need the sum of them to add up to one.
We will use F.binary_cross_entropy_with_logits
because this contains a sigmoid.
loss_func = nn.BCEWithLogitsLoss()
loss = loss_func(activs, y)
loss
Defining a metric
accuracy
will only work for single label classification problems. The reason is because it takes the input (final layer activatons) and performs argmax on these. Argmax will return the largest value from the inputs. Then it compares this to the target and takes the mean.-
So it will only make sense when there is a single maximum we are looking for.
-
accuracy_multi
will be used instead - We need something that works for multiple labels. To do this we will compare the final layer activations to a threshold (0.5 by default). Then we say, if the sigmoid is grater than the threshold, assume that category is there, else if it is less, then it is not there. We then compare this list of trues and falses to the target, then take the mean.
- We might not want to use 0.5 for our threshold. We can do this using partials when we create our learner by passing in the required argument
thresh=0.2
.
Fastai by default will know we are doing a multilabel classification problem so don't need to specify the loss.
learn = cnn_learner(dls,
resnet50,
metrics=partial(accuracy_multi, thresh=0.2))
learn.fine_tune(3, base_lr=3e-3, freeze_epocs=4)
How do we pick a good threshold? By trial and error!
preds,targs = learn.get_preds()
xs = torch.linspace(0.05, 0.95, 29)
accs = [accuracy_multi(preds, targs, thresh=i, sigmoid=False) for i in xs]
plt.plot(xs,accs); # somewhere just above 0.5
Image Regression
path = untar_data(URLs.BIWI_HEAD_POSE)
Inspect the data, there are 24 directories that correspond to 24 different people photographed.
Path.BASE_PATH = path
path.ls().sorted()
# take a look inside one directory
(path/'01').ls().sorted()
each directory contains image files and a pose file wich shows the location of the centre of the head. We can use a function that will return the cooordinates of the head centre point.
get_image_files
will recursively get all image files
img2pose
will convert an image filename to its associated pose file
img_files = get_image_files(path)
def img2pose(x): return Path(f'{str(x)[:-7]}pose.txt')
img2pose(img_files[0])
# take alook at the shape of an image and a sample image
im = PILImage.create(img_files[0])
im.shape
im.to_thumb(160)
# this function is supplied by BIWI dataset website
# returns the coordinates as a tensor
cal = np.genfromtxt(path/'01'/'rgb.cal', skip_footer=6)
def get_ctr(f):
ctr = np.genfromtxt(img2pose(f), skip_header=3)
c1 = ctr[0] * cal[0][0]/ctr[2] + cal[0][2]
c2 = ctr[1] * cal[1][1]/ctr[2] + cal[1][2]
return tensor([c1,c2])
# test it
get_ctr(img_files[0])
create a DataBlock
we will not use a random splitter because there are multiple images of each person in the data set. We want the model to generalise well on people it has not yet seen. So we will hold back one person for validation.
Our data block will be an ImageBlock
with two continuous values (the coordinate files). PointBlock
specifies this for us.
get_ctr
will return our y values.
we will also half the size of our images with aug_transforms(size=(240,320))
biwi = DataBlock(
blocks=(ImageBlock, PointBlock),
get_items=get_image_files,
get_y=get_ctr,
splitter=FuncSplitter(lambda o: o.parent.name=='13'),
batch_tfms=[*aug_transforms(size=(240,320)),
Normalize.from_stats(*imagenet_stats)]
)
# check some data
dls = biwi.dataloaders(path)
dls.show_batch(max_n=9, figsize=(8,6))
# check the shape of one batch
xb,yb = dls.one_batch()
xb.shape,yb.shape
Understanding torch.Size([64, 3, 240, 320])
- mini batch is 64 items
- there are 3 channels R,G,B
- image size is 240x320
torch.Size([64, 1, 2])
- 64 items in mini batch
- each item is one point, represented by 2 coordinates (1,2)
Training the model
create a learner in the standard way with cnn_learner
. We use y_range=(-1,1)
to tell fastai what range of data we expect to see in the dependent variable.
y_range
uses a sigmoid function mapped to the low and high values you supplied.
learn = cnn_learner(dls, resnet18, y_range=(-1,1))
# we didn't specify the loss
# what has fastai picked?
dls.loss_func
MSE will be suitable for this problem since we are trying to predict something as close as possible to the given coordinates.
lr_min, lr_steep = learn.lr_find()
learn.recorder.plot_lr_find()
plt.axvline(x=lr_min, color='orange')
plt.axvline(x=lr_steep, color='r');
lr=lr_min
learn.fine_tune(3,lr)
We got a loss of 0.000083 This corresponds to an average coordinate prediction error of...
math.sqrt(0.000083)
# check results against actuals
learn.show_results(ds_idx=1, nrows=3, figsize=(6,8))
Summary
For the image regression problem, we were able to use fine_tune
rather than train from scratch because our pre-trained model has enough information about faces that retro-fitting it to a different problem is somewhat trivial for it. This is a pretty powerful idea, these algoriths are advanced enough so that there is a certain amount of flexibility in them that can be trained further and utilised on other problems.