W2 - Recommender Systems
Learning Objectives
- Implement collaborative filtering recommender systems in TensorFlow
- Implement deep learning content based filtering using a neural network in TensorFlow
- Understand ethical considerations in building recommender systems
Collaborative filtering
Making recommendations
Recommendations systems are used for online shopping website like Amazon or a movie streaming sites like Netflix. For many companies, the economics or the value driven by recommended systems is very large
So with this framework for recommended systems one possible way to approach the problem is to look at the movies that users have not rated. And to try to predict how users would rate those movies.
Using per-item features
In a first stage, we are making assumption we have access to features or extra information about the movies such as which movies are romance movies, which movies are action movies
And this is a lot like linear regression, except that we’re fitting a different linear regression model for each of the 4 users in the dataset
So the algorithm is very similar to linear regression. But when calculate cost function :
- we focus on a single user j , so we write out the cost function for learning the parameters w(j) and b(j) for a given user j
- we use only movies rated by the user, so we sum only over the movies i that user j has actually rated, so all values where r(i,j)=1
- we eliminate this division by m(j) term, m(j) is just a constant in this expression (convenience reason not explained)
If we do the same for all user, we sum the cost function over all the $n_u$ users.
Collaborative filtering algorithm
We now assume we don’t have any more features qualifying the movies (romance, action, etc)
Now, just for the purposes of illustration, let’s say we had somehow already learned parameters for the four users w1, w2, … w4 for the 4 users and b1, … b4 that we will ignore (equals to 0)
What we have is that if you have the parameters for all four users 1, and if you have 4 ratings, you can take a reasonable guess at what lists a feature vector (x1,x2) for movie 1 that would make good predictions for these four ratings up on top
By the way, notice that this works only because we have parameters for four users.
That’s what allows us to try to guess appropriate features, x1.
This is why in a typical linear regression application if you had just a single user, you don’t actually have enough information to figure out what would be the features, x1 and x2, which is why in the linear regression contexts (course 1), you can’t come up with features x1 and x2 from scratch.
But in collaborative filtering, is because you have ratings from multiple users of the same item with the same movie. That’s what makes it possible to try to guess what are possible values for these features (not clear why…)
If you want to learn the features $x^i$ for a specific movie I :
- we defined this as a cost function for $x^i$
- and then we minimize this as a function of J($x^i$), the features for movie i
So if you have parameters w and b, all the users, then minimizing this cost function as a function of x1 through x^n_m using gradient descent or cellular algorithm, this will actually allow you to take a pretty good guess at learning good features for the movies.
If we combine the two cost functions:
- the one to find wi and bi
- the one to find xi
We could minimize this cost function J(w, b, x) using gradient descent.
Binary labels: favs, likes and clicks
Many applications of recommended systems involved binary labels where instead of a user giving you a rating (like instead of stars)
We can generalize the rating algorithm to ninary labels.The process used to generalize the algorithm is quite similar to the approach used to go from linear regression to logistic regression
Few examples :
We go from linear regression to binary classification using the logistic function
We go from linear regression to binary classification binary using cross entropy cost function
Recommender systems implementation detail
Mean normalization
For linear regression, future normalization can help the algorithm run faster (scaling features in order to have same magnitude)
Problem with collaborative filtering algorithm is that it predicts 0 star for all films for a new user that has not yet rated anything
In order to correct that, we calculate the mean for all rows and substract the mean to all coefficient of the row
With that modification, by default, algorithm predict the mean of rating on the same movie by other users
TensorFlow implementation of collaborative filtering
TensorFlow is a tool for building neural network, but TensorFlow can also be very hopeful for building other types of learning algorithms as well like the collaborative filtering algorithm.
TensorFlow is helpful to implement gradient descent, it can automatically figure out what are the derivatives of the cost function. All you have to do is implement the cost function and without needing to know any calculus, without needing to take derivatives yourself, you can get TensorFlow with just a few lines of code to compute that derivative term, that can be used to optimize the cost function.
Sometimes computing this derivative or partial derivative term can be difficult. And it turns out that TensorFlow can help with that.
For simplification, we take :
- b = 0, so $f_w(x) = wx$
- only one training example, so cost function has only one term
So this procedure allows you to implement gradient descent without computing this derivative term df/dw
This is a very powerful feature of TensorFlow called Auto Diff. And some other machine learning packages like pytorch also support Auto Diff
Sometimes you hear people call this Auto Grad. The technically correct term is Auto Diff, and Auto Grad is actually the name of the specific software package for doing automatic differentiation
You can also use a more powerful optimization algorithm, like the adam optimization algorithm
In order to implement the collaborative filtering algorithm TensorFlow, this is the syntax you can use
Finding related items
Is quite hard to interpret individualy $x_1^i$, …, $x_n^i$, saying $x_1$ is an action movie and $x_2$ is as a foreign film and so on. But nonetheless, collectively $x_1$, $x_2$, $x_3$, do convey something about what that movie is like.
So if movies related to movie i, then what you can do is try to find the item k with features $x^{(k)}$ that is similar to $x^{(i)}$.
Limitations
Content-based filtering
Collaborative filtering vs Content-based filtering
Content-based filtering differ fron collaborative filtering. The algorithm consists in matching user features with movies features
The process starts building a vector with user features and another with movies features
The problem is that the user and movies vectors don´t have the same size. We need to find a method to obtain two vectors of same size in order to use a linear expression $w^j.x^i + b$ for user j and movie i
Deep learning for content-based filtering
The first neural network is the user network that takes as input the list of features of the user, $x_u$, then using a few layers will output a vector $v_u$ that describes the user. The output layer has 32 units
Similarly, to compute $v_m$ for a movie with a movie network.
Actually, we combine the two networks into a single one, using the dot operator two cobine the two outputs.
This can be pre-computed ahead of time, you can run a compute server overnight to go through the list of all your movies and for every movie, find similar movies to it, so that if a user comes to the website and they’re browsing a specific movie, you can already have pre-computed to 10 or 20 most similar movies to show to the user at that time (important for scalability)
It may be worth spending some time engineering good features for this application because the algorithm can be computationally very expensive to run if you have a large catalog of a lot of different movies you may want to recommend.
Recommending from a large catalogue
Step 1 : retrieve plausible movies
- For each of the last 10 movies that the user has watched find the 10 most similar movies (can be pre computed)
- top movies by genres
- top movies by countries…
Step 2 : do the ranking on plausible movies list (One additional optimization is that we can computed VM in advance for all movies)
Retrieval step trade-off
Ethical use of recommender systems
Many websites that are not showing you the most relevant product but the products that will generate the largest profit for the company.
A payday loan is a short-term financial solution.
Othe example
TensorFlow implementation of content-based filtering
We use a sequential model :
- one for user
- another for movies
Next, we need to tell TensorFlow Keras how to feed the user features or the item features, then compute the vector $V_u$ and normalize (norm=1) to make the algorithm work better. Idem for vetor $V_m$
Next, have the dot product of $V_u$ and $V_m$
Principal component analysis
Reducing the number of features (optional)
This is an algorithm that is commonly used by data scientists to visualize the data, to figure out what might be going on.
In both the examples we saw, only one of the two features seemed to have a meaningful degree of variation.
Or find a new axis, combining two different features
We can for example reduce from 3D to a 2D (in that case aligned in a common plane)
We can reduce from 50 features to only 2 features, combining features together
You might find for example that $Z_1$ loosely corresponds to how big is the country and what is this total GDP, and $Z_2$ corresponds roughly to the per person GDP (Gross domestic product - PIB in French language)
PCA algorithm (optional)
Preprocessing:
- normalize to have zero-mean
- feature scaling
See feature scaling
These five points are quite spread apart so you’re still capturing a lot of the variance in the original dataset. The projections of the data onto the z-axis is decently large
Exampele of bad axis. With this choice of z, you’re capturing much less of the information in the original dataset because you’ve partially squish all five examples together
In the PCA algorithm, this axis is called the principal component.
We find the new value on the new axis using dot product (in French : produit scalaire)
If you need a second axis, you choose it perpendicular to principal component Idem if you need a third axis
PCA (principal component algorithm) is not linear regression algorithm (vertical progression vs. vertical direction projection)
An extreme case where PCA differs from linear regression
If zou need to find an approimate point that represents the original coordinates, you use what is called “reconstruction”
PCA in code (optional)
Here I’m assuming you want two or three axes if you want to visualize the data in 2D or 3D
In scikit-learn, you will use the fit function
The fit function in PCA automatically carries out mean normalization, it subtracts out the mean of each feature. So you don’t need to separately perform mean normalization
Optionally, use the explained_variance_ratio function Ato have a look at how much each of these new axes, or each of these new principal components explains the variance in your data.
Finaly, transform the data onto the new principal components (projections)
Example : 2D -> 1D
- explained_variance_ratio = 0.992 means PCA captures 99.2 percent of the variability or of the information in the original dataset.
- you obtain a 1-dimension array
Example : 2D -> 2D
- explained_variance_ratio, z_1 = 0.992 for first axis, and z_2 = 0.008 for second axis. z_1 + z_2 = 1. Because while this data is two-dimensional, so the two axes, Z_1 and Z_2, together they explain 100 percent of the variance in the data.
PCA usage, vizualization examples:
- visualization (common usage today)
- data compression (used fews years ago, not so much now)
- learning model (used 10 years ago, doesnt help so much for machine learning)