12 A Language Model from Scratch - The Data - 《The fastai book》

We can download, extract, and take a look at our dataset in the usual way:

In [ ]:


Path.BASE_PATH = path

In [ ]:

path.ls()

Out[ ]:

In [ ]:

Out[ ]:

We take all those lines and concatenate them in one big stream. To mark when we go from one number to the next, we use a . as a separator:

In [ ]:

text = ' . '.join([l.strip() for l in lines])
text[:100]

Out[ ]:

We can tokenize this dataset by splitting on spaces:

Out[ ]:

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

To numericalize, we have to create a list of all the unique tokens (our vocab):

In [ ]:

vocab = L(*tokens).unique()
vocab

Out[ ]:

Then we can convert our tokens into numbers by looking up the index of each in the vocab:

In [ ]:

(#63095) [0,1,2,1,3,1,4,1,5,1...]

Now that we have a small dataset on which language modeling should be an easy task, we can build our first model.