11 Data Munging with fastai’s Mid-Level API - Going Deeper into fastai’s Layered API - 《The fastai book》

In [ ]:

The factory method TextDataLoaders.from_folder is very convenient when your data is arranged the exact same way as the IMDb dataset, but in practice, that often won’t be the case. The data block API offers more flexibility. As we saw in the last chapter, we can get the same result with:

In [ ]:

path = untar_data(URLs.IMDB)
dls = DataBlock(
    blocks=(TextBlock.from_folder(path),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path)

But it’s sometimes not flexible enough. For debugging purposes, for instance, we might need to apply just parts of the transforms that come with this data block. Or we might want to create a DataLoaders for some application that isn’t directly supported by fastai. In this section, we’ll dig into the pieces that are used inside fastai to implement the data block API. Understanding these will enable you to leverage the power and flexibility of this mid-tier API.

When we studied tokenization and numericalization in the last chapter, we started by grabbing a bunch of texts:

In [ ]:

files = get_text_files(path, folders = ['train', 'test'])
txts = L(o.open().read() for o in files[:2000])

We then showed how to tokenize them with a Tokenizer:

In [ ]:

tok = Tokenizer.from_folder(path)
tok.setup(txts)
toks = txts.map(tok)
toks[0]

Out[ ]:

(#374) ['xxbos','xxmaj','well',',','"','cube','"','(','1997',')'...]

and how to numericalize, including automatically creating the vocab for our corpus:

In [ ]:

num = Numericalize()
num.setup(toks)
nums[0][:10]

Out[ ]:

tensor([   2,    8,   76,   10,   23, 3112,   23,   34, 3113,   33])

The classes also have a decode method. For instance, Numericalize.decode gives us back the string tokens:

In [ ]:

nums_dec = num.decode(nums[0][:10]); nums_dec

and Tokenizer.decode turns this back into a single string (it may not, however, be exactly the same as the original string; this depends on whether the tokenizer is reversible, which the default word tokenizer is not at the time we’re writing this book):

In [ ]:

tok.decode(nums_dec)

Out[ ]:

'xxbos xxmaj well , " cube " ( 1997 )'

decode is used by fastai’s show_batch and show_results, as well as some other inference methods, to convert predictions and mini-batches into a human-understandable representation.

For each of tok or num in the preceding example, we created an object, called the setup method (which trains the tokenizer if needed for tok and creates the vocab for num), applied it to our raw texts (by calling the object as a function), and then finally decoded the result back to an understandable representation. These steps are needed for most data preprocessing tasks, so fastai provides a class that encapsulates them. This is the Transform class. Both Tokenize and Numericalize are s.

In general, a Transform is an object that behaves like a function and has an optional setup method that will initialize some inner state (like the vocab inside num) and an optional decode that will reverse the function (this reversal may not be perfect, as we saw with tok).

A good example of decode is found in the Normalize transform that we saw in <>: to be able to plot the images its decode method undoes the normalization (i.e., it multiplies by the standard deviation and adds back the mean). On the other hand, data augmentation transforms do not have a decode method, since we want to show the effects on images to make sure the data augmentation is working as we want.

A special behavior of Transforms is that they always get applied over tuples. In general, our data is always a tuple (input,target) (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as Resize, we don’t want to resize the tuple as a whole; instead, we want to resize the input (if applicable) and the target (if applicable) separately. It’s the same for batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.

We can see this behavior if we pass a tuple of texts to tok:

In [ ]:

tok((txts[0], txts[1]))

Out[ ]:

((#374) ['xxbos','xxmaj','well',',','"','cube','"','(','1997',')'...],
 (#207) ['xxbos','xxmaj','conrad','xxmaj','hall','went','out','with','a','bang'...])

If you want to write a custom transform to apply to your data, the easiest way is to write a function. As you can see in this example, a Transform will only be applied to a matching type, if a type is provided (otherwise it will always be applied). In the following code, the :int in the function signature means that f only gets applied to ints. That’s why tfm(2.0) returns 2.0, but tfm(2) returns 3 here:

In [ ]:

def f(x:int): return x+1
tfm = Transform(f)
tfm(2),tfm(2.0)

Out[ ]:

(3, 2.0)

Here, f is converted to a Transform with no setup and no decode method.

In [ ]:

@Transform
def f(x:int): return x+1
f(2),f(2.0)

Out[ ]:

If you need either setup or decode, you will need to subclass Transform to implement the actual encoding behavior in , then (optionally), the setup behavior in setups and the decoding behavior in decodes:

In [ ]:

class NormalizeMean(Transform):
    def setups(self, items): self.mean = sum(items)/len(items)
    def encodes(self, x): return x-self.mean
    def decodes(self, x): return x+self.mean

Here, NormalizeMean will initialize some state during the setup (the mean of all elements passed), then the transformation is to subtract that mean. For decoding purposes, we implement the reverse of that transformation by adding the mean. Here is an example of NormalizeMean in action:

In [ ]:

tfm.setup([1,2,3,4,5])
start = 2
y = tfm(start)
z = tfm.decode(y)
tfm.mean,y,z

Out[ ]:

(3.0, -1.0, 2.0)

Note that the method called and the method implemented are different, for each of these methods:

asciidoc
[options="header"]
|======
| Class | To call | To implement
| `nn.Module` (PyTorch) | `()` (i.e., call as function) | `forward`
| `Transform` | `()` | `encodes`
| `Transform` | `decode()` | `decodes`
| `Transform` | `setup()` | `setups`
|======

So, for instance, you would never call setups directly, but instead would call setup. The reason for this is that setup does some work before and after calling setups for you. To learn more about Transforms and how you can use them to implement different behavior depending on the type of the input, be sure to check the tutorials in the fastai docs.

To compose several transforms together, fastai provides the Pipeline class. We define a Pipeline by passing it a list of Transforms; it will then compose the transforms inside it. When you call Pipeline on an object, it will automatically call the transforms inside, in order:

In [ ]:

tfms = Pipeline([tok, num])
t = tfms(txts[0]); t[:20]

Out[ ]:

tensor([   2,    8,   76,   10,   23, 3112,   23,   34, 3113,   33,   10,    8, 4477,   22,   88,   32,   10,   27,   42,   14])

And you can call decode on the result of your encoding, to get back something you can display and analyze:

In [ ]:

tfms.decode(t)[:100]

Out[ ]: