08 Collaborative Filtering Deep Dive - A First Look at the Data - 《The fastai book》

The dataset is available through the usual fastai function:

In [ ]:

According to the README, the main table is in the file u.data. It is tab-separated and the columns are, respectively user, movie, rating, and timestamp. Since those names are not encoded, we need to indicate them when reading the file with Pandas. Here is a way to open this table and take a look:

In [ ]:


                      names=['user','movie','rating','timestamp'])

Out[ ]:

Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <> shows the same data cross-tabulated into a human-friendly table.

If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between -1 and +1, with positive numbers indicating stronger matches and negative numbers weaker ones, and the categories are science-fiction, action, and old movies, then we could represent the movie The Last Skywalker as:

In [ ]:

last_skywalker = np.array([0.98,0.9,-0.9])

Here, for instance, we are scoring very science-fiction as 0.98, very action as 0.9, and very not old as -0.9. We could represent a user who likes modern sci-fi action movies as:

In [ ]:

and we can now calculate the match between this combination:

In [ ]:

Out[ ]:

On the other hand, we might represent the movie Casablanca as:

In [ ]:

The match between this combination is:

In [ ]:

Out[ ]:

-1.611

Since we don’t know what the latent factors actually are, and we don’t know how to score them for each user and movie, we should learn them.