
Vanilla SGD
First we generate some data. We’ll take on purpose the simplest and almost smallest data set ever, by simply generating 20 random points from a linear function ax+b.
np.random.seed(7)a_real = 1.5b_real = -28xlim = [-10,10]x_gen = np.random.randint(low=xlim[0], high=xlim[1], size=20)y_real = a_real*x_gen + b_realplt.plot(x_gen, y_real, 'bo') |
So we’ll start with some initial a and b (say a=0 and b=0) and the goal of SGD is to find alone the real a and b (a=1.5 and b=-28 in our example) which were used to generate those few data points. To do so, we simply need to minimize a cost function on the data, in our case
param += -lr*dx |
|
1
2
3
4
5
6
7
8 |
def sgd(X,Y, a, b, lr, epochs=1): for e in range(epochs): for x_ , y_ in zip(X,Y): a = a - lr*2*x_*(a*x_+b-y_) b = b - lr*2*(a*x_+b-y_) return a,ba,b = gradient_descent(x_gen,y_real,0,0,0.001,epochs=150) print("a={:.3f} , b={:.3f}".format(a,b)) |
a=1.501 , b=-27.913 |
Note how a and b are converging slowly but surely to the real values (of a = 1.5 and b = -28).
Momentum
So, the vanilla gradient descent is converging surely towards the optimal, but also rather slowly as we see above, taking the same small step (in the right direction) at every iteration. More than that, here we have a nice and easy convex cost function, but in case of ravines, SGD becomes even more slower by taking hesitant steps toward the optimum. To improve that, the momentum update is taking advantage of the history of previous gradient steps directions in order to make more aggressive steps when the gradient direction seems stable, and slows it down when it is starting to go in multiple directions, inspired from the velocity principle in physics. So, instead of of doing the vanilla update rule of SGD (param += -lr*dx) , the momentum update is actually replacing the dx part by a decaying average of previous gradients. The average is controlled by a parameter beta , and the gradient is replaced by a linear interpolation of previous gradient’s update and current gradient. It gives the simple following code:|
1
2
3
4
5
6
7
8
9
10
11
12 |
def sgd_momentum(X,Y, a, b, lr, beta, epochs=1): avg_ga = 0 avg_gb = 0 for e in range(epochs): for x_ , y_ in zip(X,Y): de_da = 2*x_*(a*x_+b-y_) de_db = 2*(a*x_+b-y_) avg_ga = avg_ga*beta + (1.0-beta)*de_da avg_gb = avg_gb*beta + (1.0-beta)*de_db a = a - lr*avg_ga b = b - lr*avg_gb return a,b |
And on the second one (the b):
Yeah, i know it looks a bit like a Adam, Learning Rate Annealing and other SGD optimisations
In the same spirit as the momentum update, many different methods are exposing multiple variations of how to modify the SGD update rule, in order to converge faster and better to the optimal parameters that the model is trying to learn. We won’t get into details of all those great optimizations because there are already excellent posts/video around that topic, e.g. from Sebastian Rudder here or Andrej Karpathy here or the fantastic video (and corresponding excel file) by Jeremy Howard . But we’ll do mention a couple of important concepts that those methods are using:Per coordinate adaptive learning rate
The idea is that the size of the steps taken in gradient descent should be adapted for each learned parameter separately. Intuitively, the idea would be to make the learning rate smaller as a function of how much data was observed for the specific corresponding parameter. First popular proposed method is AdaGrad and then Adadelta and RMSProp are some evolution of it, then Adam (Adaptive moment estimation) is combining that idea with momentum, then other methods are improving on top of Adam etc.. Again, c.f. the links above and the excel file, they are the best reference for all those. You can also find an illustration of how to apply and implement (in few lines of code) this concept in the context of logistic regression in my blog post here.Learning rate annealing (with restarts)
This simple concept is also one the most effective tricks you can find in the deep learning world. The idea is that when you start your search of the optimal parameter, you can afford doing some big jump, but the more you progress towards your minimum, the more you want to make smaller steps to nail it down (and not miss it by too big steps), and thus you progressively reduce your learning rate. You can also combine that idea by reinitialising your learning rate to its highest value from time to time (this is called SGD with restarts) to find more general optimums. More on that in the fantastic fast.ai course and also in that great blog post.Implementation and Visual comparisons on our simple example
In that notebook, I’ve implemented a few functions allowing to simulate, visualize, debug, investigate, experiment with few variations of SGD: vanilla, momentum, adam and adam with learning rate annealing . The implementation can easily be extended with any function (not only a linear function as in our example), only the derivative needs to be provided as a function (which itself could be automated using e.g. pytorch’s autograd), although the visualisations are adapted only for functions in the domain|
1
2
3
4
5
6
7
8
9
10
11
12
13 |
params,methods,x,y, \loss_evolution_list,lr_evolution_list = \compare_methods_and_plot([1.5,-28],[1,1],lin, linear_gradients,[-10,10], size_gen_data=30, epochs = 50, methods = ["SGD","Momentum","Adam","AdamAnn"], methods_optim_params = [[0.001] , [0.001,0.95] , [1,0.7,0.9] , [1,0.7,0.9] ], anim_interval_ms=100,ylim_anim=[-50,15])fig, ax = plt.subplots(figsize=(12, 6)) anim = draw_animation(params,methods, lin,x,y,75,ylim=[-50,15])HTML(anim.to_jshtml()) |
We can note few things:
- See how fast Adam and Adam with annealing are converging compared to vanilla SGD or SGD with momentum.
- However, Adam without annealing is not stable and suffers from some “parkinson” side effects. Probably because in that case the initial learning rate remains too high at the end for it to stay around, while with annealing, once it converged, it stays there because after enough iterations, the learning rate becomes really tiny.
- The Momentum update is not really helping to converge faster in that specific example. This is because our example is using a dead simple convex cost function that is easy to optimize anyway. But in more complex cost functions ( like the ones represented by neural nets) the momentum update can provide much more added value.
- By tuning the initial learning rates for each method, we could potentially make them converge faster, but here we took standard initial value for each method for the sake of the comparison.
We can see how fast Adam is converging to a minimal cost compared to vanilla SGD of SGD with momentum. Momentum also seems to converge as fast as (even slightly more slowly than) vanilla SGD, but again, this is due to the dead simple function we used here. In a neural net, it would already proved much more useful in most cases.
The code is also generating the evolution of each learned param over iterations. We can e.g. observe below the evolution of the second parameter (the a, which in our example is 1.5). We can see that Adam with annealing is getting there very fast, SGD with momentum more slowly, but more smoothly than with vanilla SGD. And we can observe how Adam without annealing is suffering for high oscillations (which was also observed in the animation).
Again, one should obviously not generalise around the added value of each method based on that simple example. Here we just wanted to illustrate the concepts, and even on such a toy set, we can understand and observe the core ideas behind those simple yet powerful methods.
That’s it for now. Hope you enjoyed that post. Feel free to comment/ask questions and/or use the code for your own experiments.
