back to index

Jeremy Howard: Very Fast Training of Neural Networks | AI Podcast Clips


Chapters

0:0 Intro
0:20 Super Convergence
0:45 Why is this important
1:36 Why the paper wasnt published
3:13 The future of learning rates

Whisper Transcript | Transcript Only Page

00:00:00.000 | There's some magic on learning rate that you played around with.
00:00:04.800 | Yeah.
00:00:05.800 | It's quite interesting.
00:00:06.800 | Yeah.
00:00:07.800 | So this is all work that came from a guy called Leslie Smith.
00:00:08.800 | Leslie's a researcher who, like us, cares a lot about just the practicalities of training
00:00:17.720 | neural networks quickly and accurately, which you would think is what everybody should care
00:00:21.800 | about, but almost nobody does.
00:00:25.280 | And he discovered something very interesting, which he calls superconvergence, which is
00:00:30.160 | there are certain networks that with certain settings of high parameters could suddenly
00:00:34.320 | be trained 10 times faster by using a 10 times higher learning rate.
00:00:39.640 | Now no one published that paper because it's not an area of kind of active research in
00:00:49.680 | the academic world.
00:00:50.680 | No academics recognize this is important.
00:00:53.020 | And also deep learning in academia is not considered a experimental science.
00:01:00.140 | So unlike in physics where you could say like, I just saw a subatomic particle do something
00:01:05.380 | which the theory doesn't explain, you could publish that without an explanation.
00:01:10.660 | And then in the next 60 years people can try to work out how to explain it.
00:01:14.320 | We don't allow this in the deep learning world.
00:01:16.400 | So it's literally impossible for Leslie to publish a paper that says, I've just seen
00:01:22.340 | something amazing happen.
00:01:23.760 | This thing trained 10 times faster than it should have.
00:01:25.760 | I don't know why.
00:01:27.680 | And so the reviewers were like, well, you can't publish that because you don't know
00:01:31.160 | So anyway.
00:01:32.160 | That's important to pause on because there's so many discoveries that would need to start
00:01:35.660 | like that.
00:01:36.660 | Every other scientific field I know of works that way.
00:01:39.320 | I don't know why ours is uniquely disinterested in publishing unexplained experimental results,
00:01:47.980 | but there it is.
00:01:48.980 | So it wasn't published.
00:01:51.480 | Having said that, I read a lot more unpublished papers than published papers because that's
00:01:57.280 | where you find the interesting insights.
00:02:00.280 | So I absolutely read this paper and I was just like, this is astonishingly mind-blowing
00:02:08.080 | and weird and awesome.
00:02:10.040 | And like, why isn't everybody only talking about this?
00:02:12.720 | Because like, if you can train these things 10 times faster, they also generalize better
00:02:16.800 | because you're doing less epochs, which means you look at the data less, you get better
00:02:20.560 | accuracy.
00:02:22.560 | So I've been kind of studying that ever since.
00:02:25.040 | And eventually Leslie kind of figured out a lot of how to get this done.
00:02:30.440 | And we added minor tweaks.
00:02:32.480 | And a big part of the trick is starting at a very low learning rate, very gradually increasing
00:02:38.760 | So as you're training your model, you would take very small steps at the start and you
00:02:42.400 | gradually make them bigger and bigger until eventually you're taking much bigger steps
00:02:46.400 | than anybody thought was possible.
00:02:49.600 | There's a few other little tricks to make it work, but basically we can reliably get
00:02:54.240 | superconvergence.
00:02:55.240 | And so for the dawn bench thing, we were using just much higher learning rates than people
00:03:01.120 | expected to work.
00:03:02.120 | What do you think the future of, I mean, it makes so much sense for that to be a critical
00:03:06.020 | hyperparameter learning rate that you vary.
00:03:08.320 | What do you think the future of learning rate magic looks like?
00:03:13.040 | Well, there's been a lot of great work in the last 12 months in this area.
00:03:17.600 | And people are increasingly realizing that optimize, like we just have no idea really
00:03:21.640 | how optimizers work.
00:03:23.720 | And the combination of weight decay, which is how we regularize optimizers and the learning
00:03:28.320 | rate.
00:03:29.440 | And then other things like the epsilon we use in the atom optimizer, they all work together
00:03:35.180 | in weird ways.
00:03:36.920 | And different parts of the model.
00:03:38.800 | This is another thing we've done a lot of work on is research into how different parts
00:03:42.520 | of the model should be trained at different rates in different ways.
00:03:46.840 | So we do something we call discriminative learning rates, which is really important,
00:03:50.320 | particularly for transfer learning.
00:03:53.460 | So really, I think in the last 12 months, a lot of people have realized that all this
00:03:56.520 | stuff is important.
00:03:57.620 | There's been a lot of great work coming out.
00:04:00.320 | And we're starting to see algorithms appear, which have very, very few dials, if any, that
00:04:07.160 | you have to touch.
00:04:08.160 | So like, I think what's going to happen is the idea of a learning rate will, it almost
00:04:11.480 | already has disappeared in the latest research.
00:04:14.660 | And instead, it's just like, you know, we know enough about how to interpret the gradients
00:04:22.800 | and the change of gradients we see to know how to set every parameter.
00:04:26.360 | [END]
00:04:29.360 | Page 2 of 9
00:04:29.360 | Page 3 of 9
00:04:30.360 | Page 4 of 9
00:04:31.360 | Page 5 of 9
00:04:32.360 | Page 6 of 9
00:04:33.360 | Page 7 of 9
00:04:34.360 | Page 8 of 9
00:04:35.360 | Page 9 of 9
00:04:36.360 | Page 10 of 10
00:04:37.360 | Page 11 of 10
00:04:38.360 | Page 12 of 10
00:04:39.360 | Page 13 of 10
00:04:40.360 | Page 14 of 10
00:04:41.360 | Page 15 of 15
00:04:42.360 | Page 16 of 15
00:04:43.360 | Page 17 of 15
00:04:44.360 | Page 18 of 15
00:04:45.360 | Page 19 of 15