Lesson 6: Multi-Label Classification

06-10-2020

This notebook will go over some of the practical material discussed in lesson 6 of the fastai 2020 course. Lesson 5 was an extension on the pet classifier we built as well as a discussion on data ethics.

# imports
from fastai.vision.all import *

# dataset
path = untar_data(URLs.PASCAL_2007)

Dataset: PASCAL

Our data set contains images along with a csv containing the labels. There are multiple labels per image, these are space sparated.

df = pd.read_csv(path/'train.csv')
df.head()

	fname	labels	is_valid
0	000005.jpg	chair	True
1	000007.jpg	car	True
2	000009.jpg	horse person	True
3	000012.jpg	car	False
4	000016.jpg	bicycle	True

Let's create a dataset from scratch using fastais suggested methods.

We need to grab the appropriate fields from the data frame, that is...

The Independent variable will be the images
The Label will be extracted from space separated strings

# convenience for setting the base path to the path
Path.BASE_PATH = path

DataBlock

We need to tell our datablock

where to get the data from
where to get the labels from
what kind of data we are working with
- this is an image classification problem so images and labels
- MultiCategoryBlock expects a list of category labels
how to split our data

# helper functions

# create functions to grab x from training path
# create a function to split lables for y

def get_x(r): return path/'train'/r['fname']
def get_y(r): return r['labels'].split(' ')

# training/valid splitter

def splitter(df):
    train = df.index[~df['is_valid']].tolist()
    valid = df.index[df['is_valid']].tolist()
    return train,valid

dblock = DataBlock(
    blocks=(ImageBlock, MultiCategoryBlock),
    splitter=splitter,
    get_x=get_x, 
    get_y=get_y,
    item_tfms=RandomResizedCrop(128, min_scale=0.35))

dsets = dblock.datasets(df)
dsets.train[0]

(PILImage mode=RGB size=500x333,
 TensorMultiCategory([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]))

dsets.train[0][0].to_thumb(192)

# for our image, check where the vocab == 1
# filter vocab by this to check that our car is a car

idx = torch.where(dsets.train[0][1]==1.)[0]
dsets.train.vocab[idx]

(#1) ['car']

TensorMultiCategory is a one-hot encoded vector, this means that instead of having a label or list of labels, for each image we have a tensor that will have a 1 for the labels for that image, and a 0 for all other labels. The vocab is useful to see which label classes are avaiable.

dsets.train.vocab.o2i

{'aeroplane': 0,
 'bicycle': 1,
 'bird': 2,
 'boat': 3,
 'bottle': 4,
 'bus': 5,
 'car': 6,
 'cat': 7,
 'chair': 8,
 'cow': 9,
 'diningtable': 10,
 'dog': 11,
 'horse': 12,
 'motorbike': 13,
 'person': 14,
 'pottedplant': 15,
 'sheep': 16,
 'sofa': 17,
 'train': 18,
 'tvmonitor': 19}

putting it all together

we now have a complete data block so will swap out
- dsets = dblock.datasets(df)
- for dsets = dblock.dataloaders(df)

dblock = DataBlock(
    blocks=(ImageBlock, MultiCategoryBlock),
    splitter=splitter,
    get_x=get_x, 
    get_y=get_y,
    item_tfms=RandomResizedCrop(128, min_scale=0.35))

dls = dblock.dataloaders(df)

dls.show_batch(nrows=1, ncols=3)

Binary cross entropy loss

Binary Cross entropy loss is used for classificaiton problems. For multi-label classification, we do not need the sum of different classes to add up to one so we will not need softmax here.
WHy? Because we might see multiple objects that we are confident appear in an image, so using softmax to restrict this is not a good idea for this problem.
We may also want the sum to be less than one if the model is not confident that any of the categories appear in the image.
Each activation will be compared to each target for each column, so we don't have to do anything to make this function work for multiple columns.

Some explainations I found useful...

here
and here

# Create a CNN learner
learn = cnn_learner(dls, resnet18)

# we can check the activations of our model 
# by passing in a mini-batch of the independent variable
x,y = to_cpu(dls.train.one_batch())
activs = learn.model(x)
activs.shape

torch.Size([64, 20])

Why this shape?

batch size = 64
number of categories = 20

# check the 20 activations
# this is just to see what they look like
activs[0]

tensor([-1.0697,  2.5707, -0.2860, -2.3535,  3.0095,  4.3694,  1.0534,  0.7723,
        -3.6078,  3.1691, -2.0013,  0.9257,  2.9621, -1.3111,  0.7584,  0.0951,
         1.2465,  0.9465, -0.1643,  0.6763], grad_fn=<SelectBackward>)

These activations are not between 0 and 1. We need them to represent probabilities so will need to run them through a sigmoid.

Remember, here, we do not need the sum of them to add up to one.

We will use F.binary_cross_entropy_with_logits because this contains a sigmoid.

loss_func = nn.BCEWithLogitsLoss()
loss = loss_func(activs, y)
loss

tensor(1.0935, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

Defining a metric

accuracy will only work for single label classification problems. The reason is because it takes the input (final layer activatons) and performs argmax on these. Argmax will return the largest value from the inputs. Then it compares this to the target and takes the mean.
So it will only make sense when there is a single maximum we are looking for.
accuracy_multi will be used instead
We need something that works for multiple labels. To do this we will compare the final layer activations to a threshold (0.5 by default). Then we say, if the sigmoid is grater than the threshold, assume that category is there, else if it is less, then it is not there. We then compare this list of trues and falses to the target, then take the mean.
We might not want to use 0.5 for our threshold. We can do this using partials when we create our learner by passing in the required argument thresh=0.2.

Fastai by default will know we are doing a multilabel classification problem so don't need to specify the loss.

learn = cnn_learner(dls, 
                    resnet50, 
                    metrics=partial(accuracy_multi, thresh=0.2))

learn.fine_tune(3, base_lr=3e-3, freeze_epocs=4)

epoch	train_loss	valid_loss	accuracy_multi	time
0	0.855198	0.581906	0.323546	00:21

epoch	train_loss	valid_loss	accuracy_multi	time
0	0.562123	0.351448	0.488068	00:24
1	0.379880	0.169467	0.896534	00:23
2	0.268020	0.150422	0.924243	00:23

How do we pick a good threshold? By trial and error!

preds,targs = learn.get_preds()

xs = torch.linspace(0.05, 0.95, 29)
accs = [accuracy_multi(preds, targs, thresh=i, sigmoid=False) for i in xs]
plt.plot(xs,accs); # somewhere just above 0.5

Image Regression

path = untar_data(URLs.BIWI_HEAD_POSE)

Inspect the data, there are 24 directories that correspond to 24 different people photographed.

Path.BASE_PATH = path
path.ls().sorted()

(#50) [Path('01'),Path('01.obj'),Path('02'),Path('02.obj'),Path('03'),Path('03.obj'),Path('04'),Path('04.obj'),Path('05'),Path('05.obj')...]

# take a look inside one directory
(path/'01').ls().sorted()

(#1000) [Path('01/depth.cal'),Path('01/frame_00003_pose.txt'),Path('01/frame_00003_rgb.jpg'),Path('01/frame_00004_pose.txt'),Path('01/frame_00004_rgb.jpg'),Path('01/frame_00005_pose.txt'),Path('01/frame_00005_rgb.jpg'),Path('01/frame_00006_pose.txt'),Path('01/frame_00006_rgb.jpg'),Path('01/frame_00007_pose.txt')...]

each directory contains image files and a pose file wich shows the location of the centre of the head. We can use a function that will return the cooordinates of the head centre point.

get_image_files will recursively get all image files

img2pose will convert an image filename to its associated pose file

img_files = get_image_files(path)
def img2pose(x): return Path(f'{str(x)[:-7]}pose.txt')
img2pose(img_files[0])

Path('16/frame_00182_pose.txt')

# take alook at the shape of an image and a sample image
im = PILImage.create(img_files[0])

im.shape

(480, 640)

im.to_thumb(160)

# this function is supplied by BIWI dataset website
# returns the coordinates as a tensor

cal = np.genfromtxt(path/'01'/'rgb.cal', skip_footer=6)

def get_ctr(f):
    ctr = np.genfromtxt(img2pose(f), skip_header=3)
    c1 = ctr[0] * cal[0][0]/ctr[2] + cal[0][2]
    c2 = ctr[1] * cal[1][1]/ctr[2] + cal[1][2]
    return tensor([c1,c2])

# test it

get_ctr(img_files[0])

tensor([324.0023, 251.5637])

create a DataBlock

we will not use a random splitter because there are multiple images of each person in the data set. We want the model to generalise well on people it has not yet seen. So we will hold back one person for validation.

Our data block will be an ImageBlock with two continuous values (the coordinate files). PointBlock specifies this for us.

get_ctr will return our y values.

we will also half the size of our images with aug_transforms(size=(240,320))

biwi = DataBlock(
    blocks=(ImageBlock, PointBlock),
    get_items=get_image_files,
    get_y=get_ctr,
    splitter=FuncSplitter(lambda o: o.parent.name=='13'),
    batch_tfms=[*aug_transforms(size=(240,320)), 
                Normalize.from_stats(*imagenet_stats)]
)

# check some data
dls = biwi.dataloaders(path)
dls.show_batch(max_n=9, figsize=(8,6))

# check the shape of one batch
xb,yb = dls.one_batch()
xb.shape,yb.shape

(torch.Size([64, 3, 240, 320]), torch.Size([64, 1, 2]))

Understanding torch.Size([64, 3, 240, 320])

mini batch is 64 items
there are 3 channels R,G,B
image size is 240x320

torch.Size([64, 1, 2])

64 items in mini batch
each item is one point, represented by 2 coordinates (1,2)

Training the model

create a learner in the standard way with cnn_learner. We use y_range=(-1,1) to tell fastai what range of data we expect to see in the dependent variable.

y_range uses a sigmoid function mapped to the low and high values you supplied.

learn = cnn_learner(dls, resnet18, y_range=(-1,1))

# we didn't specify the loss
# what has fastai picked?
dls.loss_func

FlattenedLoss of MSELoss()

MSE will be suitable for this problem since we are trying to predict something as close as possible to the given coordinates.

lr_min, lr_steep = learn.lr_find()

learn.recorder.plot_lr_find()
plt.axvline(x=lr_min, color='orange')
plt.axvline(x=lr_steep, color='r');

lr=lr_min
learn.fine_tune(3,lr)

epoch	train_loss	valid_loss	time
0	0.059350	0.002665	01:51

epoch	train_loss	valid_loss	time
0	0.005905	0.002256	02:29
1	0.003000	0.000682	02:29
2	0.001558	0.000083	02:29

We got a loss of 0.000083 This corresponds to an average coordinate prediction error of...

math.sqrt(0.000083)

0.0091104335791443

# check results against actuals
learn.show_results(ds_idx=1, nrows=3, figsize=(6,8))

Summary

For the image regression problem, we were able to use fine_tune rather than train from scratch because our pre-trained model has enough information about faces that retro-fitting it to a different problem is somewhat trivial for it. This is a pretty powerful idea, these algoriths are advanced enough so that there is a certain amount of flexibility in them that can be trained further and utilised on other problems.