Lesson 1: Deep Learning for Coders
06-09-2020
This notebook will go over some of the practical material discussed in lesson 1 of the fastai 2020 course.
I am going to cover 2 examples here - classification from image data and tabular data
Example 1: Computer Vision
from fastai.vision.all import *
from pathlib import Path
download one of the standard datasets provided by fasta, the Oxford-IIIT Pet Dataset which is a 37 category pet dataset with roughly 200 images for each class.
path = untar_data(URLs.PETS)/'images'
path
Create an ImageDataLoader
-
Fastai needs to know where to get the image labels from. Normally these labels are part of the filenames or folder structure. In this case the filenames contain the animal breeds.
-
american_bulldog_146.jpg
andSiamese_56.jpg
for example- it so happens that cat breeds start with an uppercase letter.
-
For this example, we will not classify all 37 breeds. We will instead classify whether the images are of dogs or cats.
First define a function is_cat
that checks whether the first letter in the image label is uppercase. is_cat
returns a boolean value that will be used as the new image label.
- from_name_func
applies the function to our data to create the labels we need.
-
valid_pct=0.2
: hold 20% of the data aside for the validation set, 80% will be used for the training set -
item_tfms=Resize(224)
: resize images to 224x224- fastai provides item transforms (applied to each image in this case) and batch transform which are applied to a batch of items at a time.
# check a few image names to confirm that
# dog images start with lowercase filenames
# cat images start with uppercase filenames
files = get_image_files(path)
files[0],files[6]
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
# check our function works!
is_cat(files[0].name), is_cat(files[6].name)
# take a look at some of the data
dls.show_batch(max_n=6)
# check number of items in training and test datasets
len(dls.train_ds), len(dls.valid_ds)
Create a cnn_learner
- using the
resnet34
architecture - this is a pretrained learner, which means when we fit the model, we will not need to train from scratch, rather, we will only fine tune the model
- by default,
freeze_epochs
is set to 1
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
learn.show_results()
Testing the model
- Lets load in a picture of a cat and a dog to check the model
img_01 = Path.cwd()/'lesson1_assets/img_1.PNG'
img_02 = Path.cwd()/'lesson1_assets/img_2.PNG'
im1 = PILImage.create(img_01)
im2 = PILImage.create(img_02)
im1.to_thumb(192)
im2.to_thumb(192)
learn.predict()
returns 3 things, the label (True
/False
in our case), the class that scored highest (1 or 0) and then the probabilities of each class.
As a reminder, let's use learn.dls.vocab.o2i
to check how the classes are mapped to our labels
# show how our labels map to our vocab
learn.dls.vocab.o2i
is_cat, clas, probs = learn.predict(im1)
is_cat, clas, probs
Let's check both images...
images = [im1, im2]
for i in images:
is_cat,_,probs = learn.predict(i)
print(f"Is this a cat?: {is_cat}.")
print(f"Probability it's a cat: {probs[1].item():.5f}")
Example 2: Tabular
For this example we will use the Adults data set. Our goal is to predict if a person is earning above or below $50k per year using information such as age, working class, education and occupation. There are about 32K rows in the dataset.
from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
path
df = pd.read_csv(path/'adult.csv')
df.head()
len(df)
Create an TabularDataLoader
Again we create data loader using the path
. We need to specify some information such as the y variable (the value we want to predict), and we also need to specify which columns contain categorical values and which contain continuous variables. Do this using cat_names
and cont_names
.
Some data processing needs to occur..
- we need to specify how to handle missing data. Info below from the docs
- FillMissing
by default sets fill_strategy=median
- Normalize
will normalize the continuous variables (substract the mean and divide by the std)
- Categorify
transform the categorical variables to something similar to pd.Categorical
This is another classification problem. Our goal is to predict whether a persons salary was below 50k (0) or above (1).
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [Categorify, FillMissing, Normalize])
I'm going to keep some of the data at the end of the set aside for testing. df[:32500]
will select from row 0 to 32500, the remaining rows will not be seen by the model
splits = RandomSplitter(valid_pct=0.2)(range_of(df[:32500]))
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
y_names='salary',
splits=splits)
dls = to.dataloaders(bs=64)
dls.show_batch()
We can see that our y values have been turned into the categories 0 and 1.
dls.y.value_counts()
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(3)
learn.show_results()
Check the model by making predictions on the dataset
using the data that was held aside which the model has not yet seen.
# pick some random rows of the df
sample_df = df.iloc[[32513,32542,32553]]
sample_df
Lets loop through these rows and make predictions, printing out the predicted class, the probabilities and the actual class.
for i, r in sample_df.iterrows():
row, clas, probs = learn.predict(r)
print(f'the predicted class is {clas}')
print(f'with a probability of {probs}')
print(f'the actual class was {r.salary}')