There is one other difference in how Adam calculates moving averages. It takes the unbiased moving average, which is:

    Putting everything together, our update step looks like:

      In fastai, Adam is the default optimizer we use since it allows faster training, but we’ve found that beta2=0.99 is better suited to the type of schedule we are using. beta1 is the momentum parameter, which we specify with the argument in our call to fit_one_cycle. As for eps, fastai uses a default of 1e-5. eps is not just useful for numerical stability. A higher eps limits the maximum value of the adjusted learning rate. To take an extreme example, if eps is 1, then the adjusted learning will never be higher than the base learning rate.

      One thing that changes when we go from SGD to Adam is the way we apply weight decay, and it can have important consequences.