11 Data Munging with fastai’s Mid-Level API - TfmdLists and Datasets: Transformed Collections - 《The fastai book》

Here is the short way of doing the transformation we saw in the previous section:

In [ ]:

At initialization, the TfmdLists will automatically call the setup method of each Transform in order, providing them not with the raw items but the items transformed by all the previous Transforms in order. We can get the result of our Pipeline on any raw element just by indexing into the TfmdLists:

In [ ]:

t = tls[0]; t[:20]

Out[ ]:

tensor([    2,     8,    91,    11,    22,  5793,    22,    37,  4910,    34,    11,     8, 13042,    23,   107,    30,    11,    25,    44,    14])

And the TfmdLists knows how to decode for show purposes:

In [ ]:

tls.decode(t)[:100]

Out[ ]:

'xxbos xxmaj well , " cube " ( 1997 ) , xxmaj vincenzo \'s first movie , was one of the most interesti'

In fact, it even has a show method:

In [ ]:

tls.show(t)

xxbos xxmaj well , " cube " ( 1997 ) , xxmaj vincenzo 's first movie , was one of the most interesting and tricky ideas that xxmaj i 've ever seen when talking about movies . xxmaj they had just one scenery , a bunch of actors and a plot . xxmaj so , what made it so special were all the effective direction , great dialogs and a bizarre condition that characters had to deal like rats in a labyrinth . xxmaj his second movie , " cypher " ( 2002 ) , was all about its story , but it was n't so good as " cube " but here are the characters being tested like rats again . 
 " nothing " is something very interesting and gets xxmaj vincenzo coming back to his ' cube days ' , locking the characters once again in a very different space with no time once more playing with the characters like playing with rats in an experience room . xxmaj but instead of a thriller sci - fi ( even some of the promotional teasers and trailers erroneous seemed like that ) , " nothing " is a loose and light comedy that for sure can be called a modern satire about our society and also about the intolerant world we 're living . xxmaj once again xxmaj xxunk amaze us with a great idea into a so small kind of thing . 2 actors and a blinding white scenario , that 's all you got most part of time and you do n't need more than that . xxmaj while " cube " is a claustrophobic experience and " cypher " confusing , " nothing " is completely the opposite but at the same time also desperate . 
 xxmaj this movie proves once again that a smart idea means much more than just a millionaire budget . xxmaj of course that the movie fails sometimes , but its prime idea means a lot and offsets any flaws . xxmaj there 's nothing more to be said about this movie because everything is a brilliant surprise and a totally different experience that i had in movies since " cube " .

The TfmdLists is named with an “s” because it can handle a training and a validation set with a splits argument. You just need to pass the indices of which elements are in the training set, and which are in the validation set:

In [ ]:

cut = int(len(files)*0.8)
splits = [list(range(cut)), list(range(cut,len(files)))]
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize], 
                splits=splits)

You can then access them through the train and valid attributes:

In [ ]:

tensor([    2,     8,    20,    30,    87,   510,  1570,    12,   408,   379,  4196,    10,     8,    20,    30,    16,    13, 12216,   202,   509])

If you have manually written a Transform that performs all of your preprocessing at once, turning raw items into a tuple with inputs and targets, then TfmdLists is the class you need. You can directly convert it to a DataLoaders object with the method. This is what we will do in our Siamese example later in this chapter.

In general, though, you will have two (or more) parallel pipelines of transforms: one for processing your raw items into inputs and one to process your raw items into targets. For instance, here, the pipeline we defined only processes the raw text into inputs. If we want to do text classification, we also have to process the labels into targets.

For this we need to do two things. First we take the label name from the parent folder. There is a function, parent_label, for this:

In [ ]:

lbls

Out[ ]:

(#50000) ['pos','pos','pos','pos','pos','pos','pos','pos','pos','pos'...]

Then we need a Transform that will grab the unique items and build a vocab with them during setup, then transform the string labels into integers when called. fastai provides this for us; it’s called Categorize:

In [ ]:

cat = Categorize()
cat.setup(lbls)
cat.vocab, cat(lbls[0])

Out[ ]:

((#2) ['neg','pos'], TensorCategory(1))

To do the whole setup automatically on our list of files, we can create a TfmdLists as before:

In [ ]:

tls_y = TfmdLists(files, [parent_label, Categorize()])
tls_y[0]

Out[ ]:

TensorCategory(1)

But then we end up with two separate objects for our inputs and targets, which is not what we want. This is where Datasets comes to the rescue.

Datasets

Datasets will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result. Like TfmdLists, it will automatically do the setup for us, and when we index into a Datasets, it will return us a tuple with the results of each pipeline:

In [ ]:

x_tfms = [Tokenizer.from_folder(path), Numericalize]
y_tfms = [parent_label, Categorize()]
dsets = Datasets(files, [x_tfms, y_tfms], splits=splits)
x,y = dsets.valid[0]
x[:20],y

Out[ ]:

(tensor([    2,     8,    20,    30,    87,   510,  1570,    12,   408,   379,  4196,    10,     8,    20,    30,    16,    13, 12216,   202,   509]),
 TensorCategory(0))

It can also decode any processed tuple or show it directly:

In [ ]:

t = dsets.valid[0]

Out[ ]:

('xxbos xxmaj this movie had horrible lighting and terrible camera movements . xxmaj this movie is a jumpy horror flick with no meaning at all . xxmaj the slashes are totally fake looking . xxmaj it looks like some 17 year - old idiot wrote this movie and a 10 year old kid shot it . xxmaj with the worst acting you can ever find . xxmaj people are tired of knives . xxmaj at least move on to guns or fire . xxmaj it has almost exact lines from " when a xxmaj stranger xxmaj calls " . xxmaj with gruesome killings , only crazy people would enjoy this movie . xxmaj it is obvious the writer does n\'t have kids or even care for them . i mean at show some mercy . xxmaj just to sum it up , this movie is a " b " movie and it sucked . xxmaj just for your own sake , do n\'t even think about wasting your time watching this crappy movie .',
 'neg')

The last step is to convert our object to a DataLoaders, which can be done with the dataloaders method. Here we need to pass along a special argument to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to before_batch:

In [ ]:

dls = dsets.dataloaders(bs=64, before_batch=pad_input)

dataloaders directly calls DataLoader on each subset of our Datasets. fastai’s DataLoader expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. It has a lot of points of customization, but the most important ones that you should know are:

after_item:: Applied on each item after grabbing it inside the dataset. This is the equivalent of item_tfms in DataBlock.
before_batch:: Applied on the list of items before they are collated. This is the ideal place to pad items to the same size.
after_batch:: Applied on the batch as a whole after its construction. This is the equivalent of batch_tfms in DataBlock.

As a conclusion, here is the full code necessary to prepare the data for text classification:

In [ ]:

tfms = [[Tokenizer.from_folder(path), Numericalize], [parent_label, Categorize]]
files = get_text_files(path, folders = ['train', 'test'])
splits = GrandparentSplitter(valid_name='test')(files)
dsets = Datasets(files, tfms, splits=splits)
dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)

The two differences from the previous code are the use of GrandparentSplitter to split our training and validation data, and the dl_type argument. This is to tell dataloaders to use the SortedDL class of DataLoader, and not the usual one. SortedDL constructs batches by putting samples of roughly the same lengths into batches.

This does the exact same thing as our previous DataBlock:

In [ ]:

path = untar_data(URLs.IMDB)
dls = DataBlock(
    blocks=(TextBlock.from_folder(path),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),

But now, you know how to customize every single piece of it!