back to indexJeremy Howard: Very Fast Training of Neural Networks | AI Podcast Clips
Chapters
0:0 Intro
0:20 Super Convergence
0:45 Why is this important
1:36 Why the paper wasnt published
3:13 The future of learning rates
00:00:00.000 |
There's some magic on learning rate that you played around with. 00:00:07.800 |
So this is all work that came from a guy called Leslie Smith. 00:00:08.800 |
Leslie's a researcher who, like us, cares a lot about just the practicalities of training 00:00:17.720 |
neural networks quickly and accurately, which you would think is what everybody should care 00:00:25.280 |
And he discovered something very interesting, which he calls superconvergence, which is 00:00:30.160 |
there are certain networks that with certain settings of high parameters could suddenly 00:00:34.320 |
be trained 10 times faster by using a 10 times higher learning rate. 00:00:39.640 |
Now no one published that paper because it's not an area of kind of active research in 00:00:53.020 |
And also deep learning in academia is not considered a experimental science. 00:01:00.140 |
So unlike in physics where you could say like, I just saw a subatomic particle do something 00:01:05.380 |
which the theory doesn't explain, you could publish that without an explanation. 00:01:10.660 |
And then in the next 60 years people can try to work out how to explain it. 00:01:14.320 |
We don't allow this in the deep learning world. 00:01:16.400 |
So it's literally impossible for Leslie to publish a paper that says, I've just seen 00:01:23.760 |
This thing trained 10 times faster than it should have. 00:01:27.680 |
And so the reviewers were like, well, you can't publish that because you don't know 00:01:32.160 |
That's important to pause on because there's so many discoveries that would need to start 00:01:36.660 |
Every other scientific field I know of works that way. 00:01:39.320 |
I don't know why ours is uniquely disinterested in publishing unexplained experimental results, 00:01:51.480 |
Having said that, I read a lot more unpublished papers than published papers because that's 00:02:00.280 |
So I absolutely read this paper and I was just like, this is astonishingly mind-blowing 00:02:10.040 |
And like, why isn't everybody only talking about this? 00:02:12.720 |
Because like, if you can train these things 10 times faster, they also generalize better 00:02:16.800 |
because you're doing less epochs, which means you look at the data less, you get better 00:02:22.560 |
So I've been kind of studying that ever since. 00:02:25.040 |
And eventually Leslie kind of figured out a lot of how to get this done. 00:02:32.480 |
And a big part of the trick is starting at a very low learning rate, very gradually increasing 00:02:38.760 |
So as you're training your model, you would take very small steps at the start and you 00:02:42.400 |
gradually make them bigger and bigger until eventually you're taking much bigger steps 00:02:49.600 |
There's a few other little tricks to make it work, but basically we can reliably get 00:02:55.240 |
And so for the dawn bench thing, we were using just much higher learning rates than people 00:03:02.120 |
What do you think the future of, I mean, it makes so much sense for that to be a critical 00:03:08.320 |
What do you think the future of learning rate magic looks like? 00:03:13.040 |
Well, there's been a lot of great work in the last 12 months in this area. 00:03:17.600 |
And people are increasingly realizing that optimize, like we just have no idea really 00:03:23.720 |
And the combination of weight decay, which is how we regularize optimizers and the learning 00:03:29.440 |
And then other things like the epsilon we use in the atom optimizer, they all work together 00:03:38.800 |
This is another thing we've done a lot of work on is research into how different parts 00:03:42.520 |
of the model should be trained at different rates in different ways. 00:03:46.840 |
So we do something we call discriminative learning rates, which is really important, 00:03:53.460 |
So really, I think in the last 12 months, a lot of people have realized that all this 00:04:00.320 |
And we're starting to see algorithms appear, which have very, very few dials, if any, that 00:04:08.160 |
So like, I think what's going to happen is the idea of a learning rate will, it almost 00:04:11.480 |
already has disappeared in the latest research. 00:04:14.660 |
And instead, it's just like, you know, we know enough about how to interpret the gradients 00:04:22.800 |
and the change of gradients we see to know how to set every parameter.