Skip to content

Lesson 1: Deep Learning for Coders

06-09-2020

This notebook will go over some of the practical material discussed in lesson 1 of the fastai 2020 course.

I am going to cover 2 examples here - classification from image data and tabular data

Example 1: Computer Vision

from fastai.vision.all import *
from pathlib import Path

download one of the standard datasets provided by fasta, the Oxford-IIIT Pet Dataset which is a 37 category pet dataset with roughly 200 images for each class.

path = untar_data(URLs.PETS)/'images'
path
Path('/storage/data/oxford-iiit-pet/images')

Create an ImageDataLoader

  • Fastai needs to know where to get the image labels from. Normally these labels are part of the filenames or folder structure. In this case the filenames contain the animal breeds.

  • american_bulldog_146.jpg and Siamese_56.jpg for example

    • it so happens that cat breeds start with an uppercase letter.
  • For this example, we will not classify all 37 breeds. We will instead classify whether the images are of dogs or cats.

First define a function is_cat that checks whether the first letter in the image label is uppercase. is_cat returns a boolean value that will be used as the new image label. - from_name_func applies the function to our data to create the labels we need.

  • valid_pct=0.2: hold 20% of the data aside for the validation set, 80% will be used for the training set

  • item_tfms=Resize(224): resize images to 224x224

    • fastai provides item transforms (applied to each image in this case) and batch transform which are applied to a batch of items at a time.
# check a few image names to confirm that 
# dog images start with lowercase filenames
# cat images start with uppercase filenames

files = get_image_files(path)
files[0],files[6]
(Path('/storage/data/oxford-iiit-pet/images/american_bulldog_146.jpg'),
 Path('/storage/data/oxford-iiit-pet/images/Siamese_56.jpg'))
def is_cat(x): return x[0].isupper()

dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))
# check our function works!
is_cat(files[0].name), is_cat(files[6].name)
(False, True)
# take a look at some of the data
dls.show_batch(max_n=6)
# check number of items in training and test datasets
len(dls.train_ds), len(dls.valid_ds)
(5912, 1478)

Create a cnn_learner

  • using the resnet34 architecture
  • this is a pretrained learner, which means when we fit the model, we will not need to train from scratch, rather, we will only fine tune the model
  • by default, freeze_epochs is set to 1
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
epoch train_loss valid_loss error_rate time
0 0.158638 0.023677 0.008119 00:45
epoch train_loss valid_loss error_rate time
0 0.061309 0.013070 0.004736 01:01
learn.show_results()

Testing the model

  • Lets load in a picture of a cat and a dog to check the model
img_01 = Path.cwd()/'lesson1_assets/img_1.PNG'
img_02 = Path.cwd()/'lesson1_assets/img_2.PNG'
im1 = PILImage.create(img_01)
im2 = PILImage.create(img_02)

im1.to_thumb(192)
im2.to_thumb(192)

learn.predict() returns 3 things, the label (True/False in our case), the class that scored highest (1 or 0) and then the probabilities of each class.

As a reminder, let's use learn.dls.vocab.o2i to check how the classes are mapped to our labels

# show how our labels map to our vocab
learn.dls.vocab.o2i
{False: 0, True: 1}
is_cat, clas, probs = learn.predict(im1)

is_cat, clas, probs
('True', tensor(1), tensor([2.7169e-10, 1.0000e+00]))

Let's check both images...

images = [im1, im2]

for i in images:
    is_cat,_,probs = learn.predict(i)

    print(f"Is this a cat?: {is_cat}.")
    print(f"Probability it's a cat: {probs[1].item():.5f}")
Is this a cat?: True.
Probability it's a cat: 1.00000

Is this a cat?: False.
Probability it's a cat: 0.00000

Example 2: Tabular

For this example we will use the Adults data set. Our goal is to predict if a person is earning above or below $50k per year using information such as age, working class, education and occupation. There are about 32K rows in the dataset.

from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
path
Path('/storage/data/adult_sample')
df = pd.read_csv(path/'adult.csv')

df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
len(df)
32561

Create an TabularDataLoader

Again we create data loader using the path. We need to specify some information such as the y variable (the value we want to predict), and we also need to specify which columns contain categorical values and which contain continuous variables. Do this using cat_names and cont_names.

Some data processing needs to occur.. - we need to specify how to handle missing data. Info below from the docs - FillMissing by default sets fill_strategy=median - Normalize will normalize the continuous variables (substract the mean and divide by the std) - Categorify transform the categorical variables to something similar to pd.Categorical

This is another classification problem. Our goal is to predict whether a persons salary was below 50k (0) or above (1).

dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

I'm going to keep some of the data at the end of the set aside for testing. df[:32500] will select from row 0 to 32500, the remaining rows will not be seen by the model

splits = RandomSplitter(valid_pct=0.2)(range_of(df[:32500]))

to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)
dls = to.dataloaders(bs=64)
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private HS-grad Married-spouse-absent Other-service Unmarried White False 32.000000 128016.002920 9.0 <50k
1 Private 7th-8th Married-civ-spouse Exec-managerial Wife White False 52.000000 194259.000001 4.0 <50k
2 Private Some-college Widowed Exec-managerial Unmarried White False 31.000000 73796.004491 10.0 <50k
3 Private Some-college Separated Other-service Not-in-family White False 64.000001 114993.998143 10.0 <50k
4 Self-emp-not-inc Assoc-voc Married-civ-spouse Prof-specialty Husband White False 68.000000 116902.996854 11.0 <50k
5 Private Bachelors Married-civ-spouse Prof-specialty Husband White False 42.000000 190178.999991 13.0 >=50k
6 Self-emp-not-inc Prof-school Married-civ-spouse Prof-specialty Husband White False 66.000000 291362.001320 15.0 <50k
7 Self-emp-not-inc Bachelors Married-civ-spouse Sales Husband White False 63.000001 298249.000475 13.0 >=50k
8 Private Masters Divorced Tech-support Not-in-family White False 47.000000 606752.001736 14.0 <50k
9 State-gov Bachelors Married-civ-spouse Exec-managerial Husband White False 42.000000 345969.005416 13.0 >=50k

We can see that our y values have been turned into the categories 0 and 1.

dls.y.value_counts()
0    19756
1     6244
Name: salary, dtype: int64
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(3)
epoch train_loss valid_loss accuracy time
0 0.366288 0.354235 0.834769 00:06
1 0.367247 0.348617 0.839538 00:05
2 0.358275 0.345206 0.839077 00:06
learn.show_results()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary salary_pred
0 5.0 11.0 3.0 11.0 1.0 5.0 1.0 1.494630 1.838917 2.322299 0.0 1.0
1 5.0 12.0 3.0 8.0 1.0 5.0 1.0 -0.558852 -0.690051 -0.421488 0.0 0.0
2 3.0 10.0 3.0 11.0 6.0 3.0 1.0 0.174535 0.000144 1.146390 1.0 1.0
3 5.0 10.0 3.0 5.0 1.0 5.0 1.0 0.467889 -1.014015 1.146390 1.0 1.0
4 5.0 16.0 5.0 9.0 4.0 5.0 1.0 -1.365576 4.387854 -0.029518 0.0 0.0
5 5.0 10.0 1.0 5.0 2.0 5.0 1.0 0.174535 0.616141 1.146390 0.0 0.0
6 5.0 10.0 3.0 2.0 6.0 5.0 1.0 1.494630 0.898075 1.146390 0.0 1.0
7 5.0 12.0 3.0 5.0 6.0 5.0 1.0 0.101196 -0.713219 -0.421488 1.0 1.0
8 7.0 2.0 3.0 4.0 1.0 5.0 1.0 -0.338836 0.932638 -1.205427 0.0 0.0

Check the model by making predictions on the dataset

using the data that was held aside which the model has not yet seen.

# pick some random rows of the df
sample_df = df.iloc[[32513,32542,32553]]

sample_df
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
32513 23 Private 209955 HS-grad 9.0 Never-married Craft-repair Not-in-family White Male 0 0 40 United-States <50k
32542 34 Private 98283 Prof-school 15.0 Never-married Tech-support Not-in-family Asian-Pac-Islander Male 0 1564 40 India >=50k
32553 35 Self-emp-inc 135436 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband White Male 0 0 50 United-States >=50k

Lets loop through these rows and make predictions, printing out the predicted class, the probabilities and the actual class.

for i, r in sample_df.iterrows():
    row, clas, probs = learn.predict(r)
    print(f'the predicted class is {clas}')
    print(f'with a probability of {probs}')
    print(f'the actual class was {r.salary}')
the predicted class is 0
with a probability of tensor([0.9911, 0.0089])
the actual class was <50k

the predicted class is 0
with a probability of tensor([0.6258, 0.3742])
the actual class was >=50k

the predicted class is 1
with a probability of tensor([0.0919, 0.9081])
the actual class was >=50k