Jeremy Howard: Very Fast Training of Neural Networks

00:00:00.000 | There's some magic on learning rate that you played around with.

00:00:04.800 | Yeah.

00:00:05.800 | It's quite interesting.

00:00:06.800 | Yeah.

00:00:07.800 | So this is all work that came from a guy called Leslie Smith.

00:00:08.800 | Leslie's a researcher who, like us, cares a lot about just the practicalities of training

00:00:17.720 | neural networks quickly and accurately, which you would think is what everybody should care

00:00:21.800 | about, but almost nobody does.

00:00:25.280 | And he discovered something very interesting, which he calls superconvergence, which is

00:00:30.160 | there are certain networks that with certain settings of high parameters could suddenly

00:00:34.320 | be trained 10 times faster by using a 10 times higher learning rate.

00:00:39.640 | Now no one published that paper because it's not an area of kind of active research in

00:00:49.680 | the academic world.

00:00:50.680 | No academics recognize this is important.

00:00:53.020 | And also deep learning in academia is not considered a experimental science.

00:01:00.140 | So unlike in physics where you could say like, I just saw a subatomic particle do something

00:01:05.380 | which the theory doesn't explain, you could publish that without an explanation.

00:01:10.660 | And then in the next 60 years people can try to work out how to explain it.

00:01:14.320 | We don't allow this in the deep learning world.

00:01:16.400 | So it's literally impossible for Leslie to publish a paper that says, I've just seen

00:01:22.340 | something amazing happen.

00:01:23.760 | This thing trained 10 times faster than it should have.

00:01:25.760 | I don't know why.

00:01:27.680 | And so the reviewers were like, well, you can't publish that because you don't know

00:01:30.160 | why.

00:01:31.160 | So anyway.

00:01:32.160 | That's important to pause on because there's so many discoveries that would need to start

00:01:35.660 | like that.

00:01:36.660 | Every other scientific field I know of works that way.

00:01:39.320 | I don't know why ours is uniquely disinterested in publishing unexplained experimental results,

00:01:47.980 | but there it is.

00:01:48.980 | So it wasn't published.

00:01:51.480 | Having said that, I read a lot more unpublished papers than published papers because that's

00:01:57.280 | where you find the interesting insights.

00:02:00.280 | So I absolutely read this paper and I was just like, this is astonishingly mind-blowing

00:02:08.080 | and weird and awesome.

00:02:10.040 | And like, why isn't everybody only talking about this?

00:02:12.720 | Because like, if you can train these things 10 times faster, they also generalize better

00:02:16.800 | because you're doing less epochs, which means you look at the data less, you get better

00:02:20.560 | accuracy.

00:02:22.560 | So I've been kind of studying that ever since.

00:02:25.040 | And eventually Leslie kind of figured out a lot of how to get this done.

00:02:30.440 | And we added minor tweaks.

00:02:32.480 | And a big part of the trick is starting at a very low learning rate, very gradually increasing

00:02:37.760 | it.

00:02:38.760 | So as you're training your model, you would take very small steps at the start and you

00:02:42.400 | gradually make them bigger and bigger until eventually you're taking much bigger steps

00:02:46.400 | than anybody thought was possible.

00:02:49.600 | There's a few other little tricks to make it work, but basically we can reliably get

00:02:54.240 | superconvergence.

00:02:55.240 | And so for the dawn bench thing, we were using just much higher learning rates than people

00:03:01.120 | expected to work.

00:03:02.120 | What do you think the future of, I mean, it makes so much sense for that to be a critical

00:03:06.020 | hyperparameter learning rate that you vary.

00:03:08.320 | What do you think the future of learning rate magic looks like?

00:03:13.040 | Well, there's been a lot of great work in the last 12 months in this area.

00:03:17.600 | And people are increasingly realizing that optimize, like we just have no idea really

00:03:21.640 | how optimizers work.

00:03:23.720 | And the combination of weight decay, which is how we regularize optimizers and the learning

00:03:28.320 | rate.

00:03:29.440 | And then other things like the epsilon we use in the atom optimizer, they all work together

00:03:35.180 | in weird ways.

00:03:36.920 | And different parts of the model.

00:03:38.800 | This is another thing we've done a lot of work on is research into how different parts

00:03:42.520 | of the model should be trained at different rates in different ways.

00:03:46.840 | So we do something we call discriminative learning rates, which is really important,

00:03:50.320 | particularly for transfer learning.

00:03:53.460 | So really, I think in the last 12 months, a lot of people have realized that all this

00:03:56.520 | stuff is important.

00:03:57.620 | There's been a lot of great work coming out.

00:04:00.320 | And we're starting to see algorithms appear, which have very, very few dials, if any, that

00:04:07.160 | you have to touch.

00:04:08.160 | So like, I think what's going to happen is the idea of a learning rate will, it almost

00:04:11.480 | already has disappeared in the latest research.

00:04:14.660 | And instead, it's just like, you know, we know enough about how to interpret the gradients

00:04:22.800 | and the change of gradients we see to know how to set every parameter.

00:04:26.360 | [END]

00:04:28.360 | 1

00:04:29.360 | Page 2 of 9

00:04:29.360 | Page 3 of 9

00:04:30.360 | Page 4 of 9

00:04:31.360 | Page 5 of 9

00:04:32.360 | Page 6 of 9

00:04:33.360 | Page 7 of 9

00:04:34.360 | Page 8 of 9

00:04:35.360 | Page 9 of 9

00:04:36.360 | Page 10 of 10

00:04:37.360 | Page 11 of 10

00:04:38.360 | Page 12 of 10

00:04:39.360 | Page 13 of 10

00:04:40.360 | Page 14 of 10

00:04:41.360 | Page 15 of 15

00:04:42.360 | Page 16 of 15

00:04:43.360 | Page 17 of 15

00:04:44.360 | Page 18 of 15

00:04:45.360 | Page 19 of 15

Jeremy Howard: Very Fast Training of Neural Networks | AI Podcast Clips

Chapters