The dataset is available through the usual fastai function:

    In [ ]:

    According to the README, the main table is in the file u.data. It is tab-separated and the columns are, respectively user, movie, rating, and timestamp. Since those names are not encoded, we need to indicate them when reading the file with Pandas. Here is a way to open this table and take a look:

    In [ ]:

    1. names=['user','movie','rating','timestamp'])

    Out[ ]:

    Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <> shows the same data cross-tabulated into a human-friendly table.

    If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between -1 and +1, with positive numbers indicating stronger matches and negative numbers weaker ones, and the categories are science-fiction, action, and old movies, then we could represent the movie The Last Skywalker as:

    In [ ]:

    1. last_skywalker = np.array([0.98,0.9,-0.9])

    Here, for instance, we are scoring very science-fiction as 0.98, very action as 0.9, and very not old as -0.9. We could represent a user who likes modern sci-fi action movies as:

    In [ ]:

    and we can now calculate the match between this combination:

    In [ ]:

    Out[ ]:

      On the other hand, we might represent the movie Casablanca as:

      In [ ]:

      The match between this combination is:

      In [ ]:

      Out[ ]:

      1. -1.611

      Since we don’t know what the latent factors actually are, and we don’t know how to score them for each user and movie, we should learn them.