Lesson 2: Deep Learning for Coders
12-09-2020
Lesson 2 goes a little deeper into computer vision and the fastai library by going a little deeper into the DataBlocks and DataLoaders.
The course notebook uses Bing Images to download image data, the idea being that we curate our own data set for this exercise. Fastai provides some methods and instructions for doing this, you can see details in the notebook
I have taken a different route to gathering data. My goal for this notebook is to build a model that is able to classify musical pitches.
Audio Data
-
generate audio samples using MIDI.
- I will not be worrying about sharps/flats simply to reduce complexity
-
use librosa to process audio signals and generate chromagrams using the Constant Q Transform
- The Constant Q does a good job at isolating pitch but is not sensitive to octaves, thus, all audio samples are in the same octave.
- stackexchange
# !conda install -c conda-forge librosa -y
from fastai.vision.all import *
from fastai.vision.data import *
import matplotlib.pyplot as plt
# hi-res plots
%config InlineBackend.figure_format = 'retina'
from pathlib import Path
# sound library & widget to play audio
import librosa
import librosa.display
import IPython.display as ipd
Load in Data
path = Path.cwd()
data_path = path/'lesson2_assets/cqt_data'
audio_file = path/'lesson2_assets/A3_1.wav'
# take a look at the filenames
data_path.ls()[1]
sidebar.. generating a chromagram
Below is a sample note (A3 on the piano) followed by a demonstration of how to generate a Constant-Q chromagram. The Y axis is displaying the note name for convenience. These were removed to create the training data set.
y, sr = librosa.load(audio_file, mono=True)
ipd.Audio(y, rate=sr)
doc(librosa.feature.chroma_cqt)
plt.figure(figsize=(5,5))
C = librosa.feature.chroma_cqt(y=y, sr=sr)
librosa.display.specshow(C, y_axis='chroma');
For comparison, here is the same note visualised using a spectrogram...
- "A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time." wiki
- There is a lot more information within this plot (such as the fundamental frequency and harmonics above it), however using these images would make our classification task much harder.
plt.figure(figsize=(5,5))
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
librosa.display.specshow(D, y_axis='log');
Image Data
The filenames contain the classes that we are trying to predict. We need to define a function that will grab the first 2 characters of each file name to use as labels. This will look familiar to lesson one, excet we have more classes to predict
fnames = get_image_files(data_path)
def label_func(fname): return fname.name[:2]
label_func(fnames[37]) # verify the funciton works
From Data to DataLoader
What is a DataBlock? - "The data block API takes its name from the way it's designed: every bit needed to build the DataLoaders object (type of inputs, targets, how to label, split...) is encapsulated in a block, and you can mix and match those blocks" - docs
Breaking down the Block - the tutorial in the docs does a good job of stepping through building a block from scratch..
Steps
- Start with an empty
DataBlock
.dblock = DataBlock()
- Tell the block how you want to assemble your items using a
get_items
function.- we will use
get_image_files
as we did in lesson 1.
- we will use
- Let the block know how/where to get our labels from in
get_y
.- the lesson notebook uses
parent_label
which inherits the label from the parent folder. We need to use thelabel_func
we created for this task.
- the lesson notebook uses
- Specify the types of our data (images and labels).
ImageBlock
andCategoryBlock
.blocks=(ImageBlock, CategoryBlock)
.
- Decide how we want to split our data into training and valid datasets.
- we will randomly split (80% training, 20% validation).
- Specify any item transforms or batch transforms.
audio = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=label_func,
item_tfms=Resize(128)
)
dls = audio.dataloaders(data_path, bs=32)
dls.show_batch()
Sidebar: Data Augmentation and Transforms
fastai provides a number of transforms that can be applied to data. In the case of computer vision, augmentation is useful for introducing variety into the dataset. Consider facial recognition, in production, you may not always be dealing with descent portraits; camera angle, lighting, perspective and lighting conditions may vary. Augmentation introduces some of these concepts into our traing and validation set.
In the context of the data I am working with, not all transformations may be useful. I would not expect these images to suffer from perspective warping or rotation, however, mirroring the image on the vertical could be useful, as could increasing and decreasing brightness and contrast.
Here is a quick example of how to apply some transforms to a batch of images at a time using aug_transforms
I am not going to apply any of these for training in this notebook.
# no transformation applied
dls.valid.show_batch(max_n=4, nrows=1)
Here is a list of available transforms..
aug_transforms(
mult=1.0,
do_flip=True,
flip_vert=False,
max_rotate=10.0,
min_zoom=1.0,
max_zoom=1.1,
max_lighting=0.2,
max_warp=0.2,
p_affine=0.75,
p_lighting=0.75,
xtra_tfms=None,
size=None,
mode='bilinear',
pad_mode='reflection',
align_corners=True,
batch=False,
min_scale=1.0,
)
aug_tfms = aug_transforms(max_lighting=0.8, do_flip=True, flip_vert=True, max_rotate=0)
audio = audio.new(item_tfms=Resize(128), batch_tfms=aug_tfms)
dls = audio.dataloaders(data_path)
dls.train.show_batch(max_n=4, nrows=1)
Create a cnn_learner
and Train the model
learn = cnn_learner(dls, resnet34, metrics=error_rate)
Before training, let's use learn.lr_find
to help find a good learning rate. Two values are returned by running lr_find
- one tenth of the minimum before the divergence
- when the slope is the steepest
lr_min, lr_steep = learn.lr_find()
# plot the values returned by lr_find
learn.recorder.plot_lr_find()
plt.axvline(x=lr_min, color='red')
plt.axvline(x=3e-3, color='green')
plt.axvline(x=lr_steep, color='red');
I'm going to pick a value inbetween the suggested lr's for training.
lr_max = 3e-3
learn.fit_one_cycle(n_epoch=5, lr_max=lr_max)
learn.recorder.plot_loss()
Interpretation
- the model is performing quite well, only one note was incorrectly predicted (G3 predicted for F3 actual)
interp.plot_top_losses(k=4, figsize=(6,6))
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
learn.save('base_cqt_model')
Fine tune to try improve accuracy..
learn.fine_tune(1)
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
Summary
This wasn't the hardest problem in the world. The Constant Q transform really simplifies pitch detection. I think this is an interesting problem space because there are opportunities to progress these examples; I'd like to try classify all 12 notes (by adding sharps/flats), then try multi-label classification using a phrase of notes, and hopefully then addressing the issue of identifying notes across octaves.