Lesson 6: Collaborative Filtering
This notebook will cover collaborative filtering using the MovieLens data set.
from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)
Data
"MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.
This data set consists of:
- 100,000 ratings (1-5) from 943 users on 1682 movies.
- Each user has rated at least 20 movies.
- Simple demographic info for the users (age, gender, occupation, zip)"
Additional Info
u.data
contains the full data set, 100,000 ratings by 943 users on 1,682 items.- Each user has rated at least 20 movies.
- Users and items are numbered consecutively from 1.
- The data is randomly ordered.
- This is a tab separated list of
user id
|item id
|rating
|timestamp
- The time stamps are unix seconds since 1/1/1970 UTC
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
names=['user','movie','rating','timestamp'])
ratings.head()
The goal is to predict or guess what films users might like to watch. You could easily imagine that a user might have a preference for certain genres, and based on films they have seen from a particular genre, you might be able to say something like, user 123
likes action
movies, therefore it would be safe to suggest an action movie to them.
Given that we have minimal information in our data set (userid, movieid, rating and timestamp), collaborative filter seeks to solve this problem by extracting latent features from the data.
For example, assume that these features range between -1 and +1, with postive numbers indicating stronger mathes to certain factors.
We can use a simple example to illustrate the point. Take the following three dummy factors science-fiction
, action
, and old movies
, we can compare user preferences against these for two different movies and see how they score.
import numpy as np
last_skywalker = np.array([0.98,0.9,-0.9])
casablanca = np.array([-0.99,-0.3,0.8])
user1 = np.array([0.9,0.8,-0.6])
We can compute the dot product and arrive at a match
m1 = (user1*last_skywalker).sum()
m2 = (user1*casablanca).sum()
print(f'last skywalker match: {m1.round(2)} \n casablanca match: {m2.round(2)}')
Voila! based on this we might want to recommend Last Skywalker but not Casablanca to this user.
So how do we find these latent factors? They can be learned.
Step 1: Randomly Initialize Parameters
Randomly assign parameters to represent our latent factors for each user and each movie. We get to decide how many of these factors we want to use.
Step 2: Calculate Predictions
Calculate predictions. This is done as we have just seen, by computing the dot product.
Step 3: Improve Predictions
Then improve the prediction using gradient descent on these latent factors
First, let's add the movie titles to our data set for readability
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
usecols=(0,1), names=('movie','title'), header=None)
movies.head()
# join on movie titles
ratings = ratings.merge(movies)
ratings.head()
Create a DataLoader
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()
dls.classes
Randomly initialize parameters
- create
user_factors
andmovie_factors
of size n users x n factors and n movies by n factors
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5
user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)
user_factors.size()
One Hot Encoding
We need to prepare our data in a specific way before we can pass it to our model. Some algorithms can work directly with category labels, but many cannot.
One hot encoding is a way of representing categorical data by transforming categorical labels into vectors of 0s and 1s.
To get a sense of how one-hot-encoding is operating, here is a simple example...
df = pd.DataFrame(['adam', 'beatrix', 'cam'])
df
# the categorical variables are represented by 1s and 0s
pd.get_dummies(df)
In order to calculate a result for a particular user and movie combination, we weill need to look up the index of the movie, and the index of a user in the respective latent factor matrices, then perform a dot product.
But this is not something that our model is capable of doing. It is capable of performing dot products though..
Here is another example of how we can use dot products to return elements from a matrix...
say the rows in matrix w
represent latent movie factors and the matrix v
is our one hot encoded movie ids
w = torch.randn((3,3))
w
v = torch.tensor([[1,0,0],[0,1,0],[0,0,1]])
v
we can retrieve the factors for movie 2 (row 2 of w
) by preforming a matrix product on the corresponding row of v
v[1].float() @ w
this is the same as calling..
w[1]
One hot encoding is basically performing an index lookup on our data. This is however memory intensive since we now have these huge matrices to deal with and most of the values will be 0.
Enter embeddings. Embeddings are a computational shortcut for doing matrix multiplication of one-hot-encoded vectors
I found this link useful
Collaborative filtering from Scratch
forward
is a very important method name in PyTorch. forward
will be the method that handles computation.
DotProduct
The forward
method here will get passed users and movies in two column (x
). We then grab the factors from an embedding by calling user_factors
just like a function. Assign these to users
and movies
then perform the dot product using (users * movies).sum(dim=1)
. dim=1
is used because we want to sum over the second index
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
return (users * movies).sum(dim=1)
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
return (users * movies).sum(dim=1)
x,y = dls.one_batch()
# check the shape of x
x.shape
# check the first 5 things in x
x[:5]
# user ids and movie ids
What does torch.Size([64, 2])
tell us?
- batch size is 64
- then we have 2 items, the user ids and movie ids
- check with x[:,0]
and x[:,1]
# check the first 5 things of 7
y[:5]
# these are the ratings
Create a learner
- our model will be the
DotProduct
class with 50 latent factors - loss function will be MSE
- this is a regression problem for continuous variables
model = DotProduct(n_users, n_movies, n_factors=50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
Not bad but we can make some improvements. We can use a sigmoid
to contrain our ratings between 0 and 5, matching what we see in our original data set. We will actually use 0-5.5 otherwise the sigmoid will prevent us ever getting a 5.
# check rating scores/categories
ratings.rating.value_counts()
help(sigmoid_range)
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
self.y_range = y_range
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
return sigmoid_range((users * movies).sum(dim=1), *self.y_range)
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
Not much better....
Adding in a Bias term
We can make further improvements by adding a bias term to our model.
Why would we do this? Some movies may have a high rating because they are genuinely better movies, and some users may skew towards being more positive and therefore their rating could generally be more positive. The idea of the bias term is that we now have a way to represent this missing piece of information.
# Add in a bias term for each user and each movie.
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
res = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
return sigmoid_range(res, *self.y_range)
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
Our final result is slightly better but we are actually overfitting! How can we stop this and train for longer?
Regularisation
Regularisation is a set of techniques that help to reduce the capacity of the model. Regularisations helps to prevent overfitting of models. Rather than reduce the parameters, we can try to force the parameters to be smaller, unless they are required to be big.
Weight Decay
Also known as L2 regularisation. It consists of adding to the loss function the sum of all the parameters squared.
Why does this work and why would this prevent overfitting? - one way to decrease loss is to decrease the weights - limiting the weights is going to hinder the training (won't fit training set as well) but will help the model to generalise better
In practice what we are doing is adding onto the gradients, the weights multiplied by some hyper parameter.
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)
much better!
Creating our own Embedding module
Let's recreate the DotProductBias
without using the Embedding class.
to recap: - an embedding layer is a computational shortcut for performing a matrix multiplication by a one hot encoded matrix, which is the same as indexing into an array.
# create a tensor as a parameter, with random initialization
def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
The below is DotProductBias
refactored
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
users = self.user_factors[x[:,0]]
movies = self.movie_factors[x[:,1]]
res = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
return sigmoid_range(res, *self.y_range)
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)
Interpreting Embeddings and Biases
Let's take a look at some of the films with the smallest bias. These would be movies that were liked a lot less than others.
We can then do the opposite to see the most liked movies (sorting by bias)
The goal is to see what the model has learnt and to gain some information about how it is operating.
Then using PCA we can reduce the number of latent factors and plot these to view the "space". Again, this is a way we can interpret what the model has learnt.
movie_bias = learn.model.movie_bias.squeeze()
idxs = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in idxs]
# most liked films
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]
g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()
The most interesting cluster I can see is on the mid-right hand side. Conspiracy Theory, Mission Impossible, Air Force One etc. Asside from Liar Liar, these seem like the kinds of movies that someone who likes action films would likely enjoy.
The fastai way
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)
Summary
We have just implemented a simple collaborative filtering model from scratch. The idea with this lesson, as with most of them so far, is to dig into the theory, code a model from scratch (mostly), improve the mode, then use the fastai implementation.