There's some magic on learning rate that you played around with. Yeah. It's quite interesting. Yeah. So this is all work that came from a guy called Leslie Smith. Leslie's a researcher who, like us, cares a lot about just the practicalities of training neural networks quickly and accurately, which you would think is what everybody should care about, but almost nobody does.
And he discovered something very interesting, which he calls superconvergence, which is there are certain networks that with certain settings of high parameters could suddenly be trained 10 times faster by using a 10 times higher learning rate. Now no one published that paper because it's not an area of kind of active research in the academic world.
No academics recognize this is important. And also deep learning in academia is not considered a experimental science. So unlike in physics where you could say like, I just saw a subatomic particle do something which the theory doesn't explain, you could publish that without an explanation. And then in the next 60 years people can try to work out how to explain it.
We don't allow this in the deep learning world. So it's literally impossible for Leslie to publish a paper that says, I've just seen something amazing happen. This thing trained 10 times faster than it should have. I don't know why. And so the reviewers were like, well, you can't publish that because you don't know why.
So anyway. That's important to pause on because there's so many discoveries that would need to start like that. Every other scientific field I know of works that way. I don't know why ours is uniquely disinterested in publishing unexplained experimental results, but there it is. So it wasn't published. Having said that, I read a lot more unpublished papers than published papers because that's where you find the interesting insights.
So I absolutely read this paper and I was just like, this is astonishingly mind-blowing and weird and awesome. And like, why isn't everybody only talking about this? Because like, if you can train these things 10 times faster, they also generalize better because you're doing less epochs, which means you look at the data less, you get better accuracy.
So I've been kind of studying that ever since. And eventually Leslie kind of figured out a lot of how to get this done. And we added minor tweaks. And a big part of the trick is starting at a very low learning rate, very gradually increasing it. So as you're training your model, you would take very small steps at the start and you gradually make them bigger and bigger until eventually you're taking much bigger steps than anybody thought was possible.
There's a few other little tricks to make it work, but basically we can reliably get superconvergence. And so for the dawn bench thing, we were using just much higher learning rates than people expected to work. What do you think the future of, I mean, it makes so much sense for that to be a critical hyperparameter learning rate that you vary.
What do you think the future of learning rate magic looks like? Well, there's been a lot of great work in the last 12 months in this area. And people are increasingly realizing that optimize, like we just have no idea really how optimizers work. And the combination of weight decay, which is how we regularize optimizers and the learning rate.
And then other things like the epsilon we use in the atom optimizer, they all work together in weird ways. And different parts of the model. This is another thing we've done a lot of work on is research into how different parts of the model should be trained at different rates in different ways.
So we do something we call discriminative learning rates, which is really important, particularly for transfer learning. So really, I think in the last 12 months, a lot of people have realized that all this stuff is important. There's been a lot of great work coming out. And we're starting to see algorithms appear, which have very, very few dials, if any, that you have to touch.
So like, I think what's going to happen is the idea of a learning rate will, it almost already has disappeared in the latest research. And instead, it's just like, you know, we know enough about how to interpret the gradients and the change of gradients we see to know how to set every parameter.
1 Page 2 of 9 Page 3 of 9 Page 4 of 9 Page 5 of 9 Page 6 of 9 Page 7 of 9 Page 8 of 9 Page 9 of 9 Page 10 of 10 Page 11 of 10 Page 12 of 10 Page 13 of 10 Page 14 of 10 Page 15 of 15 Page 16 of 15 Page 17 of 15 Page 18 of 15 Page 19 of 15