Lesson 6: Collaborative Filtering

This notebook will cover collaborative filtering using the MovieLens data set.

from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)

Data

"MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of:

100,000 ratings (1-5) from 943 users on 1682 movies.
Each user has rated at least 20 movies.
Simple demographic info for the users (age, gender, occupation, zip)"

Additional Info

u.data contains the full data set, 100,000 ratings by 943 users on 1,682 items.
Each user has rated at least 20 movies.
Users and items are numbered consecutively from 1.
The data is randomly ordered.
This is a tab separated list of
- user id | item id | rating | timestamp
- The time stamps are unix seconds since 1/1/1970 UTC

source

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])
ratings.head()

	user	movie	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

The goal is to predict or guess what films users might like to watch. You could easily imagine that a user might have a preference for certain genres, and based on films they have seen from a particular genre, you might be able to say something like, user 123 likes action movies, therefore it would be safe to suggest an action movie to them.

Given that we have minimal information in our data set (userid, movieid, rating and timestamp), collaborative filter seeks to solve this problem by extracting latent features from the data.

For example, assume that these features range between -1 and +1, with postive numbers indicating stronger mathes to certain factors.

We can use a simple example to illustrate the point. Take the following three dummy factors science-fiction, action, and old movies, we can compare user preferences against these for two different movies and see how they score.

import numpy as np

last_skywalker = np.array([0.98,0.9,-0.9])
casablanca = np.array([-0.99,-0.3,0.8])

user1 = np.array([0.9,0.8,-0.6])

We can compute the dot product and arrive at a match

m1 = (user1*last_skywalker).sum()
m2 = (user1*casablanca).sum()

print(f'last skywalker match: {m1.round(2)} \n casablanca match: {m2.round(2)}')

last skywalker match: 2.14 
 casablanca match: -1.61

Voila! based on this we might want to recommend Last Skywalker but not Casablanca to this user.

So how do we find these latent factors? They can be learned.

Step 1: Randomly Initialize Parameters

Randomly assign parameters to represent our latent factors for each user and each movie. We get to decide how many of these factors we want to use.

Step 2: Calculate Predictions

Calculate predictions. This is done as we have just seen, by computing the dot product.

Step 3: Improve Predictions

Then improve the prediction using gradient descent on these latent factors

First, let's add the movie titles to our data set for readability

movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()

	movie	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

# join on movie titles
ratings = ratings.merge(movies)
ratings.head()

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

Create a DataLoader

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

	user	title	rating
0	494	Shawshank Redemption, The (1994)	5
1	806	Wrong Trousers, The (1993)	5
2	91	Glory (1989)	5
3	497	Lawnmower Man 2: Beyond Cyberspace (1996)	2
4	630	Rainmaker, The (1997)	3
5	89	That Thing You Do! (1996)	2
6	442	Brothers McMullen, The (1995)	3
7	37	Braveheart (1995)	5
8	159	Kansas City (1996)	1
9	585	Cinema Paradiso (1988)	5

dls.classes

{'user': (#944) ['#na#',1,2,3,4,5,6,7,8,9...],
 'title': (#1665) ['#na#',"'Til There Was You (1997)",'1-900 (1994)','101 Dalmatians (1996)','12 Angry Men (1957)','187 (1997)','2 Days in the Valley (1996)','20,000 Leagues Under the Sea (1954)','2001: A Space Odyssey (1968)','3 Ninjas: High Noon At Mega Mountain (1998)'...]}

Randomly initialize parameters

create user_factors and movie_factors of size n users x n factors and n movies by n factors

n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

user_factors.size()

torch.Size([944, 5])

One Hot Encoding

We need to prepare our data in a specific way before we can pass it to our model. Some algorithms can work directly with category labels, but many cannot.

One hot encoding is a way of representing categorical data by transforming categorical labels into vectors of 0s and 1s.

To get a sense of how one-hot-encoding is operating, here is a simple example...

df = pd.DataFrame(['adam', 'beatrix', 'cam'])

df

	0
0	adam
1	beatrix
2	cam

# the categorical variables are represented by 1s and 0s
pd.get_dummies(df)

	0_adam	0_beatrix	0_cam
0	1	0	0
1	0	1	0
2	0	0	1

In order to calculate a result for a particular user and movie combination, we weill need to look up the index of the movie, and the index of a user in the respective latent factor matrices, then perform a dot product.

But this is not something that our model is capable of doing. It is capable of performing dot products though..

Here is another example of how we can use dot products to return elements from a matrix...

say the rows in matrix w represent latent movie factors and the matrix v is our one hot encoded movie ids

w = torch.randn((3,3))
w

tensor([[-1.7224, -0.4789,  0.3553],
        [-1.3465, -0.3057,  0.6882],
        [ 0.4594,  0.7893,  0.0150]])

v = torch.tensor([[1,0,0],[0,1,0],[0,0,1]])
v

tensor([[1, 0, 0],
        [0, 1, 0],
        [0, 0, 1]])

we can retrieve the factors for movie 2 (row 2 of w) by preforming a matrix product on the corresponding row of v

v[1].float() @ w

tensor([-1.3465, -0.3057,  0.6882])

this is the same as calling..

w[1]

tensor([-1.3465, -0.3057,  0.6882])

One hot encoding is basically performing an index lookup on our data. This is however memory intensive since we now have these huge matrices to deal with and most of the values will be 0.

Enter embeddings. Embeddings are a computational shortcut for doing matrix multiplication of one-hot-encoded vectors

I found this link useful

Collaborative filtering from Scratch

forward is a very important method name in PyTorch. forward will be the method that handles computation.

`DotProduct`

The forward method here will get passed users and movies in two column (x). We then grab the factors from an embedding by calling user_factors just like a function. Assign these to users and movies then perform the dot product using (users * movies).sum(dim=1). dim=1 is used because we want to sum over the second index

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)

    def forward(self, x):

        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

x,y = dls.one_batch()

# check the shape of x
x.shape

torch.Size([64, 2])

# check the first 5 things in x
x[:5]

# user ids and movie ids

tensor([[ 396, 1021],
        [ 118,  887],
        [ 206,  380],
        [ 207,  938],
        [ 923,  861]], device='cuda:0')

What does torch.Size([64, 2]) tell us? - batch size is 64 - then we have 2 items, the user ids and movie ids - check with x[:,0] and x[:,1]

# check the first 5 things of 7
y[:5]

# these are the ratings

tensor([[4],
        [5],
        [1],
        [3],
        [4]], device='cuda:0', dtype=torch.int8)

Create a learner

our model will be the DotProduct class with 50 latent factors
loss function will be MSE
- this is a regression problem for continuous variables

model = DotProduct(n_users, n_movies, n_factors=50)
learn = Learner(dls, model, loss_func=MSELossFlat())

learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	1.362978	1.286954	00:08
1	1.051144	1.094442	00:08
2	0.963542	0.975438	00:08
3	0.841074	0.893598	00:08
4	0.788252	0.873474	00:08

Not bad but we can make some improvements. We can use a sigmoid to contrain our ratings between 0 and 5, matching what we see in our original data set. We will actually use 0-5.5 otherwise the sigmoid will prevent us ever getting a 5.

# check rating scores/categories
ratings.rating.value_counts()

4    34174
3    27145
5    21201
2    11370
1     6110
Name: rating, dtype: int64

help(sigmoid_range)

Help on function sigmoid_range in module fastai.layers:

sigmoid_range(x, low, high)
    Sigmoid function with range `(low, high)`

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	1.024472	0.989683	00:08
1	0.882552	0.901929	00:08
2	0.684762	0.860367	00:08
3	0.479302	0.864543	00:08
4	0.347512	0.870277	00:08

Not much better....

Adding in a Bias term

We can make further improvements by adding a bias term to our model.

Why would we do this? Some movies may have a high rating because they are genuinely better movies, and some users may skew towards being more positive and therefore their rating could generally be more positive. The idea of the bias term is that we now have a way to represent this missing piece of information.

# Add in a bias term for each user and each movie. 

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	0.969996	0.988517	00:08
1	0.900296	0.907242	00:08
2	0.691395	0.859838	00:08
3	0.491507	0.861104	00:08
4	0.365541	0.864733	00:08

Our final result is slightly better but we are actually overfitting! How can we stop this and train for longer?

Regularisation

Regularisation is a set of techniques that help to reduce the capacity of the model. Regularisations helps to prevent overfitting of models. Rather than reduce the parameters, we can try to force the parameters to be smaller, unless they are required to be big.

Weight Decay

Also known as L2 regularisation. It consists of adding to the loss function the sum of all the parameters squared.

Why does this work and why would this prevent overfitting? - one way to decrease loss is to decrease the weights - limiting the weights is going to hinder the training (won't fit training set as well) but will help the model to generalise better

In practice what we are doing is adding onto the gradients, the weights multiplied by some hyper parameter.

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.953578	0.935531	00:08
1	0.846484	0.874122	00:07
2	0.718206	0.829690	00:07
3	0.593549	0.817133	00:07
4	0.474108	0.818135	00:08

much better!

Creating our own Embedding module

Let's recreate the DotProductBias without using the Embedding class.

to recap: - an embedding layer is a computational shortcut for performing a matrix multiplication by a one hot encoded matrix, which is the same as indexing into an array.

# create a tensor as a parameter, with random initialization

def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

The below is DotProductBias refactored

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.958299	0.949737	00:08
1	0.890685	0.870207	00:08
2	0.742867	0.825684	00:08
3	0.575261	0.817917	00:08
4	0.467277	0.818017	00:09

Interpreting Embeddings and Biases

Let's take a look at some of the films with the smallest bias. These would be movies that were liked a lot less than others.

We can then do the opposite to see the most liked movies (sorting by bias)

The goal is to see what the model has learnt and to gain some information about how it is operating.

Then using PCA we can reduce the number of latent factors and plot these to view the "space". Again, this is a way we can interpret what the model has learnt.

movie_bias = learn.model.movie_bias.squeeze()
idxs = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in idxs]

['Children of the Corn: The Gathering (1996)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Robocop 3 (1993)',
 'Leave It to Beaver (1997)',
 'Vampire in Brooklyn (1995)']

# most liked films
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

['Titanic (1997)',
 "Schindler's List (1993)",
 'As Good As It Gets (1997)',
 'L.A. Confidential (1997)',
 'Apt Pupil (1998)']

g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

The most interesting cluster I can see is on the mid-right hand side. Conspiracy Theory, Mission Impossible, Air Force One etc. Asside from Liar Liar, these seem like the kinds of movies that someone who likes action films would likely enjoy.

The fastai way

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.941142	0.945376	00:08
1	0.861970	0.873035	00:08
2	0.724418	0.828043	00:08
3	0.617956	0.815366	00:08
4	0.488715	0.815252	00:08

Summary

We have just implemented a simple collaborative filtering model from scratch. The idea with this lesson, as with most of them so far, is to dig into the theory, code a model from scratch (mostly), improve the mode, then use the fastai implementation.