How do we decide which parameters should have a high learning rate and which should not? We can look at the gradients to get an idea. If a parameter’s gradients have been close to zero for a while, that parameter will need a higher learning rate because the loss is flat. On the other hand, if the gradients are all over the place, we should probably be careful and pick a low learning rate to avoid divergence. We can’t just average the gradients to see if they’re changing a lot, because the average of a large positive and a large negative number is close to zero. Instead, we can use the usual trick of either taking the absolute value or the squared values (and then taking the square root after the mean).

    Once again, to determine the general tendency behind the noise, we will use a moving average—specifically the moving average of the gradients squared. Then we will update the corresponding weight by using the current gradient (for the direction) divided by the square root of this moving average (that way if it’s low, the effective learning rate will be higher, and if it’s high, the effective learning rate will be lower):

    We can add this to by doing much the same thing we did for , but with an extra :

    In [ ]:

    In [ ]:

    Let’s try it out:

    Much better! Now we just have to bring these ideas together, and we have Adam, fastai’s default optimizer.