We can download, extract, and take a look at our dataset in the usual way:
In [ ]:
In [ ]:
Path.BASE_PATH = path
In [ ]:
path.ls()
Out[ ]:
In [ ]:
Out[ ]:
We take all those lines and concatenate them in one big stream. To mark when we go from one number to the next, we use a .
as a separator:
In [ ]:
text = ' . '.join([l.strip() for l in lines])
text[:100]
Out[ ]:
We can tokenize this dataset by splitting on spaces:
Out[ ]:
['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']
To numericalize, we have to create a list of all the unique tokens (our vocab):
In [ ]:
vocab = L(*tokens).unique()
vocab
Out[ ]:
Then we can convert our tokens into numbers by looking up the index of each in the vocab:
In [ ]:
(#63095) [0,1,2,1,3,1,4,1,5,1...]
Now that we have a small dataset on which language modeling should be an easy task, we can build our first model.