Here are a few things to remember:

    • A neural net is basically a bunch of matrix multiplications with nonlinearities in between.
    • Two tensors are broadcastable if the dimensions starting from the end and going backward match (if they are the same, or one of them is 1). To make tensors broadcastable, we may need to add dimensions of size 1 with or a None index.
    • The backward pass is the chain rule applied multiple times, computing the gradients from the output of our model and going back, one layer at a time.