As we saw when training an MNIST model from scratch, zero_grad just loops through the parameters of the model and sets the gradients to zero. It also calls detach_, which removes any history of gradient computation, since it won’t be needed after zero_grad.

    The more interesting method is , which loops through the callbacks (cbs) and calls them to update the parameters (the _update function just calls state.update if there’s anything returned by cb). As you can see, Optimizer doesn’t actually do any SGD steps itself. Let’s see how we can add SGD to .

    In [ ]:

    We can pass this to Optimizer using the cbs parameter; we’ll need to use partial since will call this function to create our optimizer later:

    Let’s see if this trains:

    In [ ]: