07 Training a State-of-the-Art Model - Mixup - 《The fastai book》

Mixup works as follows, for each image:

Select another image from your dataset at random.
Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable.

In pseudocode, we’re doing this (where is the weight for our weighted average):

For this to work, our targets need to be one-hot encoded. The paper describes this using the equations shown in <> where $\lambda$ is the same as t in our pseudocode:

We’re going to be looking at more and more research papers from here on in the book. Now that you have the basic jargon, you might be surprised to discover how much of them you can understand, with a little practice! One issue you’ll notice is that Greek letters, such as $\lambda$, appear in most papers. It’s a very good idea to learn the names of all the Greek letters, since otherwise it’s very hard to read the papers to yourself, and remember them (or to read code based on them, since code often uses the names of the Greek letters spelled out, such as ).

End sidebar

<> shows what it looks like when we take a linear combination of images, as done in Mixup.

In [ ]:

Mixup - 图1

The third image is built by adding 0.3 times the first one and 0.7 times the second. In this example, should the model predict “church” or “gas station”? The right answer is 30% church and 70% gas station, since that’s what we’ll get if we take the linear combination of the one-hot-encoded targets. For instance, suppose we have 10 classes and “church” is represented by the index 2 and “gas station” is reprsented by the index 7, the one-hot-encoded representations are:

so our final target is:

Here is how we train a model with Mixup:

What happens when we train a model with data that’s “mixed up” in this way? Clearly, it’s going to be harder to train, because it’s harder to see what’s in each image. And the model has to predict two labels per image, rather than just one, as well as figuring out how much each one is weighted. Overfitting seems less likely to be a problem, however, because we’re not showing the same image in each epoch, but are instead showing a random combination of two images.

Mixup requires far more epochs to train to get better accuracy, compared to other augmentation approaches we’ve seen. You can try training Imagenette with and without Mixup by using the examples/train_imagenette.py script in the . At the time of writing, the leaderboard in the Imagenette repo is showing that Mixup is used for all leading results for trainings of >80 epochs, and for fewer epochs Mixup is not being used. This is in line with our experience of using Mixup too.

One of the reasons that Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by using Mixup on activations inside their models, not just on inputs—this allows Mixup to be used for NLP and other data types too.

There’s another subtle issue that Mixup deals with for us, which is that it’s not actually possible with the models we’ve seen before for our loss to ever be perfect. The problem is that our labels are 1s and 0s, but the outputs of softmax and sigmoid can never equal 1 or 0. This means training our model pushes our activations ever closer to those values, such that the more epochs we do, the more extreme our activations become.

One issue with this, however, is that Mixup is “accidentally” making the labels bigger than 0, or smaller than 1. That is to say, we’re not explicitly telling our model that we want to change the labels in this way. So, if we want to make the labels closer to, or further away from 0 and 1, we have to change the amount of Mixup—which also changes the amount of data augmentation, which might not be what we want. There is, however, a way to handle this more directly, which is to use label smoothing.