We can download, extract, and take a look at our dataset in the usual way:

    In [ ]:

    In [ ]:

    1. Path.BASE_PATH = path

    In [ ]:

    1. path.ls()

    Out[ ]:

      In [ ]:

      Out[ ]:

      We take all those lines and concatenate them in one big stream. To mark when we go from one number to the next, we use a . as a separator:

      In [ ]:

      1. text = ' . '.join([l.strip() for l in lines])
      2. text[:100]

      Out[ ]:

        We can tokenize this dataset by splitting on spaces:

        Out[ ]:

        1. ['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

        To numericalize, we have to create a list of all the unique tokens (our vocab):

        In [ ]:

        1. vocab = L(*tokens).unique()
        2. vocab

        Out[ ]:

          Then we can convert our tokens into numbers by looking up the index of each in the vocab:

          In [ ]:

          1. (#63095) [0,1,2,1,3,1,4,1,5,1...]

          Now that we have a small dataset on which language modeling should be an easy task, we can build our first model.