Visualising SGD with Momentum, Adam and Learning Rate Annealing

[Full code on my github here . To see it from mobile, once you land on github, click on “Desktop Version” ] At the very heart of the model training procedure of almost every modern machine learning or deep learning algorithm applied to big enough data, you’ll find Stochastic Gradient Descent (SGD). The best part of SGD is its simplicity. As Francois Chollet would say, it is made of a small set of high school-level ideas put together. But it does not make it less powerful and beautiful. In this post we’ll implement from scratch SGD and some optimizations around it like Momentum, Adam and learning rate annealing, and we’ll apply it on some very simple generated toy data in order to visually compare them together with some animated graph in python. In the post, we’ll only show some snippets of a subset of the code, check here for the full code.

Vanilla SGD

First we generate some data. We’ll take on purpose the simplest and almost smallest data set ever, by simply generating 20 random points from a linear function ax+b.

np.random.seed(7)
a_real = 1.5
b_real = -28
xlim = [-10,10]
x_gen = np.random.randint(low=xlim[0], high=xlim[1], size=20)
y_real = a_real*x_gen + b_real
plt.plot(x_gen, y_real, 'bo')

So we’ll start with some initial a and b (say a=0 and b=0) and the goal of SGD is to find alone the real a and b (a=1.5 and b=-28 in our example) which were used to generate those few data points. To do so, we simply need to minimize a cost function on the data, in our case

∑(𝑥,𝑦)∈𝑑𝑎𝑡𝑎𝑠𝑒𝑡(𝑎𝑥+𝑏−𝑦)2 . SGD achieves that by simply following the negative of the gradient (negative because the gradient is the direction of the steepest increase of the function and we’re looking for the minimum of the cost function). So basically, the vanilla SGD parameter update is simply:

param += -lr*dx

with lr being the learning rate, and dx being the gradient of the cost function relative to the corresponding param you want to update (in our case, only a or b). How to compute dx ? if our hypothesis function was a deep neural network you could simply apply the chain rule multiple times (a.k.a backpropagation) via e.g. pytorch’s autograd , but in our case we can simply compute the analytical gradient of the cost function w.r.t. a and b, or just use wolfram alpha (like this ) if you’re lazy. This inevitably leads you to a full implementation of SGD (for our example) in less than 10 lines of code:

def sgd(X,Y, a, b, lr, epochs=1):
    for e in range(epochs):
        for x_ , y_ in zip(X,Y):
            a = a - lr*2*x_*(a*x_+b-y_)
            b = b - lr*2*(a*x_+b-y_)        
    return a,b
a,b = gradient_descent(x_gen,y_real,0,0,0.001,epochs=150) 
print("a={:.3f} , b={:.3f}".format(a,b))

which prints:

a=1.501 , b=-27.913

Not that far from our real a=1.5 and b= -28 . Note that we could make this code much more efficient by vectorizing it but we keep it dumb on purpose to observe easily how simple it is (and also add metrics to be able to monitor and visualize the gradients steps, c.f. later section). In the code we treat one data point at a time (batch size = 1) , and going over all the data points multiple times (as many time as specified in the epoch parameter). We can keep track of the a and b updates after each epoch and animate the evolution (see the section named “Simple standalone Animation code” in the notebook to see how to generate such an animation):

Note how a and b are converging slowly but surely to the real values (of a = 1.5 and b = -28).

Momentum

So, the vanilla gradient descent is converging surely towards the optimal, but also rather slowly as we see above, taking the same small step (in the right direction) at every iteration. More than that, here we have a nice and easy convex cost function, but in case of ravines, SGD becomes even more slower by taking hesitant steps toward the optimum. To improve that, the momentum update is taking advantage of the history of previous gradient steps directions in order to make more aggressive steps when the gradient direction seems stable, and slows it down when it is starting to go in multiple directions, inspired from the velocity principle in physics. So, instead of of doing the vanilla update rule of SGD (param += -lr*dx) , the momentum update is actually replacing the dx part by a decaying average of previous gradients. The average is controlled by a parameter beta , and the gradient is replaced by a linear interpolation of previous gradient’s update and current gradient. It gives the simple following code:

def sgd_momentum(X,Y, a, b, lr,  beta, epochs=1):
    avg_ga = 0
    avg_gb = 0
    for e in range(epochs):
        for x_ , y_ in zip(X,Y):
            de_da = 2*x_*(a*x_+b-y_)
            de_db = 2*(a*x_+b-y_) 
            avg_ga = avg_ga*beta + (1.0-beta)*de_da
            avg_gb = avg_gb*beta + (1.0-beta)*de_db
            a = a - lr*avg_ga
            b = b - lr*avg_gb        
    return a,b

What is interesting is to compare the evolution of the gradient per method, to see if we do see the expected smoother evolution of the gradient update each iterations. This is how compares the vanilla SGD v.s momentum gradient updates on the first learned parameter (the a):

And on the second one (the b):

Yeah, i know it looks a bit like a

, but the point is that the momentum update is doing its job: it is not going crazy in all direction like the raw gradient, but it is smoothing it, based on previous iterations. For such a simple convex error function like ours, it does not really matters (and it won’t make a dramatic difference in terms of how fast we converge as we’ll see below), but we can easily understand how in a very bumpy loss function surface, this could be a great advantage to surf around those rather than entering eyes closed inside each small ravine.

Adam, Learning Rate Annealing and other SGD optimisations

In the same spirit as the momentum update, many different methods are exposing multiple variations of how to modify the SGD update rule, in order to converge faster and better to the optimal parameters that the model is trying to learn. We won’t get into details of all those great optimizations because there are already excellent posts/video around that topic, e.g. from Sebastian Rudder here or Andrej Karpathy here or the fantastic video (and corresponding excel file) by Jeremy Howard . But we’ll do mention a couple of important concepts that those methods are using:

Per coordinate adaptive learning rate

The idea is that the size of the steps taken in gradient descent should be adapted for each learned parameter separately. Intuitively, the idea would be to make the learning rate smaller as a function of how much data was observed for the specific corresponding parameter. First popular proposed method is AdaGrad and then Adadelta and RMSProp are some evolution of it, then Adam (Adaptive moment estimation) is combining that idea with momentum, then other methods are improving on top of Adam etc.. Again, c.f. the links above and the excel file, they are the best reference for all those. You can also find an illustration of how to apply and implement (in few lines of code) this concept in the context of logistic regression in my blog post here.

Learning rate annealing (with restarts)

This simple concept is also one the most effective tricks you can find in the deep learning world. The idea is that when you start your search of the optimal parameter, you can afford doing some big jump, but the more you progress towards your minimum, the more you want to make smaller steps to nail it down (and not miss it by too big steps), and thus you progressively reduce your learning rate. You can also combine that idea by reinitialising your learning rate to its highest value from time to time (this is called SGD with restarts) to find more general optimums. More on that in the fantastic fast.ai course and also in that great blog post.

Implementation and Visual comparisons on our simple example

In that notebook, I’ve implemented a few functions allowing to simulate, visualize, debug, investigate, experiment with few variations of SGD: vanilla, momentum, adam and adam with learning rate annealing . The implementation can easily be extended with any function (not only a linear function as in our example), only the derivative needs to be provided as a function (which itself could be automated using e.g. pytorch’s autograd), although the visualisations are adapted only for functions in the domain

ℝ𝑛→ℝ . Below, we’re showing some of the output given when calling this code with:

params,methods,x,y, \
loss_evolution_list,lr_evolution_list = \
compare_methods_and_plot([1.5,-28],[1,1],lin,
    linear_gradients,[-10,10],
    size_gen_data=30,
    epochs = 50,
    methods = ["SGD","Momentum","Adam","AdamAnn"],
    methods_optim_params = [[0.001] , [0.001,0.95] , [1,0.7,0.9] , [1,0.7,0.9] ],
    anim_interval_ms=100,ylim_anim=[-50,15])
fig, ax = plt.subplots(figsize=(12, 6))    
anim = draw_animation(params,methods, lin,x,y,75,ylim=[-50,15])
HTML(anim.to_jshtml())

First let’s observe the animation:

We can note few things:

See how fast Adam and Adam with annealing are converging compared to vanilla SGD or SGD with momentum.
However, Adam without annealing is not stable and suffers from some “parkinson” side effects. Probably because in that case the initial learning rate remains too high at the end for it to stay around, while with annealing, once it converged, it stays there because after enough iterations, the learning rate becomes really tiny.
The Momentum update is not really helping to converge faster in that specific example. This is because our example is using a dead simple convex cost function that is easy to optimize anyway. But in more complex cost functions ( like the ones represented by neural nets) the momentum update can provide much more added value.
By tuning the initial learning rates for each method, we could potentially make them converge faster, but here we took standard initial value for each method for the sake of the comparison.

Let’s observe the loss evolution over iterations between methods:

We can see how fast Adam is converging to a minimal cost compared to vanilla SGD of SGD with momentum. Momentum also seems to converge as fast as (even slightly more slowly than) vanilla SGD, but again, this is due to the dead simple function we used here. In a neural net, it would already proved much more useful in most cases. The code is also generating the evolution of each learned param over iterations. We can e.g. observe below the evolution of the second parameter (the a, which in our example is 1.5). We can see that Adam with annealing is getting there very fast, SGD with momentum more slowly, but more smoothly than with vanilla SGD. And we can observe how Adam without annealing is suffering for high oscillations (which was also observed in the animation).

Again, one should obviously not generalise around the added value of each method based on that simple example. Here we just wanted to illustrate the concepts, and even on such a toy set, we can understand and observe the core ideas behind those simple yet powerful methods. That’s it for now. Hope you enjoyed that post. Feel free to comment/ask questions and/or use the code for your own experiments.