Deep learning certificate "lesson 0"

00:00:00.000 | So, anyway, I'm going to just briefly say my name is Dave Yominsky, I'm the director

00:00:09.840 | of the analytics program and director of the data institute.

00:00:12.600 | I'll say a few more words about the data institute upstairs and I hope everyone will come upstairs

00:00:16.560 | and join us, food and drink afterwards.

00:00:19.760 | But right now, it's my great pleasure to introduce Jeremy Howard, who is many things a serial

00:00:25.840 | entrepreneur, his most recent venture is analytic, which is bringing deep learning to medicine

00:00:32.840 | and before that, he was also the former president and chief data scientist at Kaggle and I think

00:00:40.120 | I'm going to leave it there because I could keep going.

00:00:42.120 | But anyway, let's give a warm welcome to Jeremy.

00:00:51.760 | Thanks very much, David.

00:00:54.260 | Everybody can hear okay?

00:00:56.680 | Yeah?

00:00:57.680 | Great.

00:00:58.680 | Okay.

00:00:59.680 | So, you know, my passion at the moment and for the last few years has been this area of

00:01:03.520 | deep learning.

00:01:05.720 | Who here has kind of come across deep learning at some point?

00:01:09.200 | Heard of it, knows about it, maybe a little over half of you, two-thirds, okay, great.

00:01:16.520 | It's one of these things which kind of feels like a great fad or a great marketing thing

00:01:22.680 | or something kind of like, I don't know, big data or internet of things or, you know, all

00:01:28.840 | these various things that we have.

00:01:31.480 | But it actually reminds me of another fad, which I was really excited about in the early

00:01:37.000 | 90s and I was telling everybody it was going to be huge and that fad was called the internet.

00:01:41.280 | And so some fads, fads, but they're also fads for a reason.

00:01:46.040 | I think deep learning is going to be more important and more transformational than the

00:01:51.200 | internet.

00:01:52.680 | So that's hence the title of this talk.

00:01:54.720 | It changes everything.

00:01:57.240 | Every one of you will be deeply impacted by deep learning and that many of us are already

00:02:03.200 | starting to be impacted by deep learning.

00:02:06.420 | So before I talk about that, I want to talk about the kind of how people have viewed computers

00:02:12.140 | for many years and people have really made computers the butt of jokes for many years

00:02:19.360 | for all of the things they can't do.

00:02:22.240 | So you may remember from 2009, this was Google's autopilot, which was their April Fool's joke.

00:02:29.240 | And the butt of the joke was basically, well, of course, computers can't send email.

00:02:35.260 | And so that was the April Fool's joke of 2009.

00:02:40.160 | Going back further, a source of humor for Douglas Adams was the Babel Fish, which was

00:02:46.240 | basically the idea that technology could never be so advanced as to do something this clever

00:02:51.880 | as to translate language.

00:02:53.180 | So he came up with this idea of this fish called the Babel Fish that could translate

00:02:56.960 | language and so probably useful was this thing that it was used as the proof of the existence

00:03:02.760 | of God in the Hitchhiker's Guide to the Galaxy.

00:03:07.960 | But things have changed in the last year.

00:03:11.640 | Basically the joke doesn't work anymore because your computer really can reply to your email.

00:03:17.320 | These are actual examples of replies that have been automatically generated by Google

00:03:22.560 | Inbox, which is a mobile app for Android and iOS and you can also access on the web.

00:03:30.400 | And this is not some carefully curated set of responses for this particular email.

00:03:37.280 | In fact, 15% of emails sent through Inbox by Google are now automatically created by

00:03:44.000 | the system.

00:03:45.000 | So it's actually being very widely used already.

00:03:50.080 | So here's another example of what the system does.

00:03:55.640 | And indeed the Babel Fish now exists as well.

00:04:00.240 | You can for free use Skype's translator system to translate voice to voice for any of six

00:04:06.720 | languages and they're adding more and more.

00:04:13.720 | Even computers are artists now.

00:04:15.840 | This is actually not a genuine Van Gogh, but it's fairly impressive.

00:04:22.400 | In fact on that I'm going to give you a little test which is to figure out which of these

00:04:29.960 | are real paintings and drawings and which ones were done by a computer.

00:04:36.160 | So they're all pretty sophisticated.

00:04:37.800 | Now that you've made your decisions, I will show you.

00:04:41.160 | The first sketch, I guess pastel sketch on the left, is done by a computer.

00:04:49.120 | The second drawing is done by a computer.

00:04:52.480 | And the third one we all know about extrapolating from past events, but at this level it worked.

00:04:56.880 | It also done by a computer.

00:04:59.000 | And you can see here that the level of nuance that the computer has done here and kind of

00:05:04.660 | realizing that this piece uses a lot of lines and arcs and has decided to actually connect

00:05:13.860 | this lady's eyebrow to her nose, to her shoulder as an arc and also has these kind of areas

00:05:20.720 | of bursts of color and realize that her hair bun would be a good place to have a burst

00:05:24.720 | of color.

00:05:25.720 | It's quite a sophisticated rendition of both the style and the content.

00:05:31.680 | So as you might have guessed, the reason that fiction has become reality and computers have

00:05:39.680 | gone past what was previously a joke and indeed now they're generating art, which is very

00:05:44.680 | hard to tell from real human art, is because of this thing called deep learning.

00:05:50.840 | I don't have time today to go into detail about all of the interesting applications,

00:05:55.400 | but I do have a talk on tech.com that you can watch if you have 18 minutes and get more

00:05:59.680 | information about it.

00:06:02.320 | But before we talk more about deep learning, let's talk about machine learning.

00:06:06.760 | They are not one and the same thing.

00:06:09.000 | Deep learning is a way of doing machine learning.

00:06:11.760 | So machine learning was invented by this guy, Arthur Samuels, in 1956.

00:06:16.680 | This is him playing chess against an IBM mainframe.

00:06:20.680 | Rather than programming this IBM mainframe to play chess, instead he got the computer

00:06:28.000 | to play against itself thousands of times and figure out how to play effectively.

00:06:33.380 | And after doing that, this computer beat the creator of the program.

00:06:37.960 | So that was a big step in 1956.

00:06:40.440 | So machine learning has been around for a long time.

00:06:43.280 | The thing is though that until very recently you needed an Arthur Samuels to write your

00:06:50.720 | machine learning algorithm for you, to actually get to the point that the machine could learn

00:06:55.600 | to tackle your task, talk a lot of programming effort and engineering effort and also a lot

00:07:00.440 | of domain expertise to bring basically mainly to do what's called feature engineering.

00:07:07.880 | But something very interesting has happened more recently, which is that we have the three

00:07:13.360 | pieces that at least in theory ought to make machine learning universal.

00:07:18.880 | So imagine if you could get a computer to learn and it could learn any type of relationship.

00:07:25.440 | Now when you see the word function as a mathematical function, you might think like a line or a

00:07:29.800 | quadratic or something.

00:07:31.080 | But I mean function in the most wide possible sense.

00:07:34.120 | Like the function that translates Russian into Japanese or the function that allows

00:07:39.920 | you to recognize the face of George Clooney.

00:07:43.200 | These are all functions.

00:07:44.200 | That's what I mean by an infinitely flexible function.

00:07:46.920 | So imagine if you had that and you had some way to fit the parameters of that function

00:07:52.520 | such that that function could do anything.

00:07:55.760 | It could model anything that you could come up with such as the two examples I just gave.

00:08:00.560 | That would be all very well.

00:08:01.560 | You just need one more piece, which is the ability to do that quickly and that scale.

00:08:07.000 | And if you had those three things, you now have a totally general learning system, which

00:08:14.520 | is what we now have.

00:08:15.680 | That's what deep learning is.

00:08:17.240 | Deep learning is a particular algorithm for doing machine learning, which has these three

00:08:22.440 | vital characteristics.

00:08:24.040 | The infinitely flexible function is the neural network, which has been around for a long

00:08:27.880 | time.

00:08:29.600 | The all-purpose parameter fitting is backpropagation, which has been around since the, really since

00:08:35.800 | 1974, but was not noticed by the world until 1986.

00:08:40.760 | Until very recently, though, we didn't have this.

00:08:43.300 | And the fast and scalable has recently come along for various reasons, including the advances

00:08:48.320 | in GPUs, which used mainly to play computer games, but also turn out to be perfect for

00:08:53.160 | deep learning, wider variety of data, and some vital improvements to the algorithms themselves.

00:09:01.680 | So it's interesting how this is working.

00:09:03.560 | Jeff Dean presented this from Google last week, showing how often deep learning is now

00:09:12.040 | being used in Google products and services.

00:09:15.840 | And you can see this classic hockey stick shape showing an exponential growth here.

00:09:23.160 | Google are amongst the first, or maybe the first, at really picking up on using this

00:09:28.640 | technology effectively.

00:09:30.360 | But you can imagine that if Google is basically-- what they did was they set aside a group of

00:09:35.600 | people, and they said, go to different parts of Google, tell them about deep learning,

00:09:39.520 | and see if they can use it.

00:09:41.040 | And from my understanding of the people I know, everywhere they went, the answer was,

00:09:44.440 | yes, they can.

00:09:45.440 | And that's why we now have this shape.

00:09:48.600 | And of course, the people that that original team talked to were now talking to other people,

00:09:52.120 | and that also creates this.

00:09:54.280 | So when I say deep learning changes, everything I certainly would expect in your organizations

00:09:58.240 | that you would probably find the same thing.

00:10:00.400 | Every aspect of your organization can probably be touched effectively by this.

00:10:05.680 | An example, when Google wanted to map the location of every residence and business in

00:10:11.600 | France, they did it in less than one hour.

00:10:14.880 | They basically grabbed the entire Street View database.

00:10:18.940 | These are examples of pictures from the Street View database, and they built a deep learning

00:10:22.080 | system that could identify house numbers and could then read those house numbers.

00:10:27.080 | And an hour later, they had mapped the entirety of the country of France.

00:10:31.520 | This is obviously something that previously would have taken hundreds of people many years.

00:10:36.520 | And this is one of the reasons that, particularly for startups, and here in the Bay Area, this

00:10:40.680 | is important, deep learning really does change everything, because suddenly a startup can

00:10:45.760 | do things that previously required huge amounts of resources.

00:10:52.360 | So we've kind of seen a little bit of this before.

00:10:55.100 | What happens when an algorithm comes along that makes a big difference?

00:11:01.800 | And Yahoo discovered what happened.

00:11:04.260 | They used to own the web.

00:11:06.600 | Eighty percent of home pages were Yahoo back in the day.

00:11:12.840 | And Yahoo was manually curated by expert web surfers.

00:11:18.080 | And then this company came along and replaced the expert web surfers with a machine learning

00:11:22.960 | algorithm called PageRank.

00:11:25.400 | And we all know what happened next.

00:11:27.320 | Now this was an algorithm that, compared to deep learning, is incredibly limited and simple

00:11:32.280 | in terms of what it can do.

00:11:33.880 | But if you think about the impact that that algorithm had on Yahoo, well, think about

00:11:39.680 | the impact of the collaborative filtering algorithm had on kind of Amazon versus Barnes

00:11:44.080 | and Noble, now that we have really successful recommendation systems.

00:11:47.320 | You can see how even relatively simple versions of machine learning have had huge commercial

00:11:52.600 | impacts already.

00:11:56.600 | So what can deep learning do?

00:12:00.480 | I'll just give you a few examples.

00:12:02.560 | A paper last year showed that deep learning is able to recognize the content of photos.

00:12:10.600 | This is something called the ImageNet dataset, which is one and a half million photos.

00:12:17.160 | And a very patient human had actually spent time trying to classify thousands of these

00:12:23.720 | photos and tested themselves and found that they had a 5% error rate.

00:12:28.640 | And last year it was announced by Microsoft Research that they had a system which was

00:12:33.820 | better than humans at recognizing images.

00:12:37.960 | In fact, this number is now down to about 3%.

00:12:42.160 | And so it keeps on dropping quickly.

00:12:44.440 | So with deep learning, computers can now see.

00:12:49.520 | And they can see in a range of interesting ways.

00:12:51.560 | Anybody here from China will probably recognize Baidu Shutu.

00:12:55.480 | And on Baidu Shutu, which is a part of the popular kind of Google competitor-- well,

00:13:01.840 | not really a competitor, since Google's not there.

00:13:04.040 | They're Google akin system Baidu.

00:13:07.000 | You can upload a picture, which is what I did here.

00:13:09.040 | I uploaded the picture in the top left.

00:13:11.040 | And it has come up with all of these similar images.

00:13:14.120 | I didn't upload any text.

00:13:16.000 | So it figured out the breed of the dog, the composition, the type of the background, the

00:13:20.480 | fact that it's had its tongue hanging out, and so forth.

00:13:23.200 | So you can see that image analysis is a lot more than just saying it's a dog, which is

00:13:28.320 | what the Chinese at the top are saying it's a golden retriever, but really understanding

00:13:31.720 | what's going on there.

00:13:32.920 | And I'll give you some examples of some of the extraordinary things that allows us to

00:13:36.800 | do shortly.

00:13:38.640 | Speaking about Baidu, they have now announced that they can recognize speech more accurately

00:13:44.720 | than humans, in Chinese and English at least.

00:13:49.160 | So we now have computers at a point where last year they can recognize pictures better

00:13:53.440 | than us, and now they can recognize speech better than us.

00:14:00.360 | Microsoft have this amazing system using deep learning where you can take a picture in which

00:14:05.280 | large bits are being cut off, in this case it was a panorama that was done quite badly.

00:14:09.520 | That's the top picture.

00:14:10.520 | And the bottom shows how it has automatically filled in its guess as to what the rest might

00:14:16.000 | look like.

00:14:17.000 | And so this is taking image recognition to the next level, which is to say can I construct

00:14:21.480 | an image which would be believable to an image recognizer.

00:14:25.160 | This is part of something called generative models, which is a huge area right now.

00:14:29.000 | And again, this is a freely available software that you can download off the internet.

00:14:32.960 | I think it might be, hmm?

00:14:36.640 | That's in Spain.

00:14:37.640 | There you go.

00:14:40.600 | If I had deep learning system here, I probably could have looked that up.

00:14:46.480 | So generative models are kind of interesting.

00:14:48.240 | This is like in some ways more quirky than anything else, but I think it's fascinating.

00:14:52.320 | These pictures here, the four corners are actual photos.

00:14:56.360 | The ones in the middle are generated by a deep learning algorithm to try and interpolate

00:15:00.080 | between the sets of photos.

00:15:02.240 | But what you can do more than that is you can then say to the deep learning algorithm,

00:15:07.700 | what would this photo look like if the person was feeling differently?

00:15:11.560 | And then we can animate that.

00:15:14.600 | And that is nothing if not creepy.

00:15:18.680 | I mean, the interesting thing here is you can see it's doing a lot more than just plastering

00:15:22.840 | a smile on their faces.

00:15:23.840 | You know, their eyes are smiling, their faces are moving.

00:15:26.440 | We can even take some famous paintings and slightly change how they're looking.

00:15:31.440 | Or we can do the same to the queen.

00:15:34.600 | Or we can even do the same to Mona Lisa.

00:15:38.000 | And you can see as she's moving her eyes up and down again, the whole features are moving

00:15:42.080 | as well.

00:15:43.080 | One of the interesting things about this Mona Lisa example was that this system-- she's

00:15:47.720 | looking pretty shifty now, isn't she?

00:15:49.920 | This system was originally trained without having any paintings in the training set.

00:15:53.320 | It was only trained with actual photos.

00:15:56.320 | And one of the interesting things about deep learning is how well it can generalize to

00:16:00.280 | types of data it hasn't seen before.

00:16:02.040 | In this case, it turns out that it knows how to generate different face movements for paintings.

00:16:11.560 | A lot of people think that deep learning is just about big data.

00:16:14.280 | It's not.

00:16:15.280 | Ilya Sutskever from OpenAI presented last week a new model in which he showed that on

00:16:21.120 | a very famous data set called MNIST, which we'll learn about more shortly.

00:16:25.680 | But it's basically trying to recognize digits.

00:16:28.880 | It's digit recognition.

00:16:29.880 | It's a very old, classic machine learning problem.

00:16:35.200 | He discovered that with just 50 labeled images of digits, he could train a 99% accurate number

00:16:43.560 | classifier.

00:16:44.800 | So we're not talking millions or billions.

00:16:47.440 | We're talking 50.

00:16:49.480 | And so these recent advances that allow us to use small amounts of data is something

00:16:52.920 | that's really changing what's possible with deep learning.

00:16:58.080 | It's also turning really anybody into an artist.

00:17:00.720 | There's a thing called neural doodle that allows you to whip out your stylus and jot

00:17:05.400 | down some sophisticated imagery like this and then say how you would like it rendered,

00:17:09.280 | what style.

00:17:10.280 | In this case, it was being rendered as impressionism.

00:17:13.680 | You can see it's done a pretty good job of generating an image which hopefully fits what

00:17:19.160 | the original artist had in their head with their original doodle.

00:17:25.720 | And it's not just about images, it's about text as well, or even combining the two.

00:17:29.920 | This is a field called multimodal learning.

00:17:32.800 | These sentences are totally novel sentences constructed from scratch by a deep learning

00:17:38.560 | algorithm after looking at the picture.

00:17:41.640 | So you can see that in order to construct this, the deep learning algorithm must have

00:17:45.840 | understood a lot about not just what the main objects in the picture are, but how they relate

00:17:52.160 | to each other and what they're doing.

00:17:58.200 | I got so excited about this that three years ago, I left my job at Kaggle and spent a year

00:18:04.760 | researching what are the biggest opportunities for deep learning in the world.

00:18:08.880 | I came to the conclusion that the number one biggest opportunity at that time was medicine.

00:18:13.960 | I started a new company called Enlurik.

00:18:17.960 | We had four of us, all computer scientists and mathematicians, no medical people on the

00:18:25.640 | team.

00:18:26.640 | And within two months, we had a system for radiology which could predict the malignancy

00:18:33.040 | of lung cancer more accurately than a panel of four of the world's best radiologists.

00:18:39.960 | This was kind of very exciting to me because it was everything that I hoped was possible.

00:18:46.040 | It was also always somehow surprising when you actually run a model and it's classifying

00:18:52.840 | cancer and you genuinely have no idea how it did it because of course all you do is

00:18:57.040 | set up the kind of situation in which you can learn and then it does that learning.

00:19:04.160 | So this turned out to be very successful and Enlurik today has raised $15 million.

00:19:09.280 | It's a pretty successful company and one of the things I mentioned is that earlier Baidu

00:19:15.560 | Shutu example of taking a picture and finding similar pictures, that's doing big things

00:19:23.320 | in radiology.

00:19:24.320 | It basically allows radiologists to find previous patients from a database of millions of CT

00:19:30.080 | scans and MRIs to find the people that have medical imagery just like the patient that

00:19:34.680 | they're interested in and then they can find out exactly the path of that patient, how

00:19:38.920 | did they respond to different drugs, so forth.

00:19:42.240 | So this kind of semantic search of imagery is a really exciting area.

00:19:48.600 | So one thing interesting about my particular CV when it comes to creating the first deep

00:19:57.120 | learning medical diagnostic company is not so much what I've done but perhaps what I

00:20:02.680 | haven't done.

00:20:04.520 | And so that's the entirety of my actual biology, life sciences and medicine experience.

00:20:11.640 | And one of the exciting things to those of you who are entrepreneurs or interested in

00:20:15.480 | being entrepreneurs is that there is no limits as to what you can hope to do.

00:20:21.800 | You recognize a problem that you want to solve and that you care about and that hopefully

00:20:26.920 | maybe hasn't been solved before that well and have a go and really you can do a lot.

00:20:34.960 | In my case, once I kind of showed that we could do some useful stuff in oncology and

00:20:41.560 | we got covered by CNN on one of the TV shows and then suddenly the medical establishment

00:20:47.440 | kind of came to us at which point we've got a lot of help from the medical establishment

00:20:50.920 | as well so you kind of get this nice feedback loop going on.

00:20:56.440 | So most importantly, deep learning can also do choreography.

00:21:02.120 | So if you're excited about this and think this all sounds interesting, you might be wondering

00:21:09.280 | well where can you learn more?

00:21:11.880 | And the answer you won't be surprised to hear is the Data Institute.

00:21:17.520 | We haven't previously announced this but I'm going to announce now that the first, through

00:21:23.160 | our knowledge, the first ever in-person university accredited deep learning certificate will

00:21:30.660 | be here at the Data Institute starting in the second lesson will start in late October.

00:21:38.400 | So I invite you all to join.

00:21:40.360 | You might be wondering when the first lesson is and the answer is it's right now.

00:21:45.280 | So let's get started.

00:21:48.700 | So you came to university, come on, you've got to expect to be studying here.

00:21:53.320 | There's no slacking off.

00:21:54.800 | So what I'm going to show you is so the Stivka course will be seven weeks of two and a half

00:22:01.040 | hours each.

00:22:03.080 | We don't have two and a half hours right now so this will by necessity be heavily compressed.

00:22:07.360 | So if this doesn't make as much sense as you might like it to, don't worry, the MSAM students

00:22:12.200 | will certainly follow along fine but I'll try and make it as clear as possible.

00:22:19.780 | One of the things that I strongly believe is that deep learning is easy.

00:22:23.640 | It is made hard by people who put way more math into it than is necessary and also by

00:22:30.120 | what I think is a desire for exclusivity amongst the kind of deep learning specialists.

00:22:36.960 | They make up crazy new jargon about things that are really very simple.

00:22:41.080 | So I want to kind of show you how simple it can be.

00:22:43.920 | And specifically we're going to look at MNIST, the data set I told you about which is about

00:22:48.200 | recognizing handwritten digits.

00:22:51.080 | And I'm going to use a system called Jupyter Notebook.

00:22:55.140 | For those of you that don't code, I hope the fact that this is done in code isn't too off-putting.

00:22:59.400 | You certainly don't need to use code for everything but I find it a very good way to kind of show

00:23:03.860 | what's going on.

00:23:06.360 | So I'm going to have to make sure that we actually have this thing running, which was

00:23:10.680 | not.

00:23:11.680 | Let's try again.

00:23:12.680 | OK.

00:23:13.680 | Let's try that.

00:23:16.680 | OK.

00:23:17.680 | OK.

00:23:20.680 | So I'm going to load the data in.

00:23:23.360 | And so the data, the MNIST data has 55,000 28x28 images in it.

00:23:32.800 | So we're going to take a look at one.

00:23:34.260 | So here is the first of those images.

00:23:37.080 | As you can see it is a 28x28 picture and as well as the image we also have labels, which

00:23:47.600 | is just a list of numbers.

00:23:50.640 | And so you can see that this, as is common with pretty much every machine learning dataset,

00:23:56.020 | you have two things.

00:23:57.020 | You have something which is information that you're given and then information that you

00:24:00.240 | have to derive.

00:24:01.240 | So in this case the goal of this dataset is to take a picture of a number and return what

00:24:06.480 | the number is.

00:24:08.320 | So let's have a look at a few more.

00:24:11.600 | So here's the first five pictures and the first five numbers that go with each one.

00:24:17.840 | So this was originally generated by an IST and they basically had thousands of people

00:24:24.240 | draw lots of numbers and then somebody went through and coded into a computer what each

00:24:28.240 | one was.

00:24:29.240 | So I'm going to show you some interesting things we can do with pictures.

00:24:33.320 | The first thing I'm going to do is I'm going to create a little matrix here that I've called

00:24:36.040 | top.

00:24:37.040 | And as you can see I've got minus ones at the top of it and then ones and then zeros

00:24:42.520 | or maybe you want to see it visually.

00:24:45.720 | And what I'm going to show you is, in fact I want you to think about something, which

00:24:49.640 | is what would happen if I took that matrix and what I want to do is basically shift it

00:24:57.200 | over this first image and I'm going to take this three by three and I'm going to put it

00:25:01.240 | right at the top left and I'm going to move it right a bit and I'm going to move it right

00:25:04.120 | a bit and I'm going to go all the way to the end, I'll start back here and all the way

00:25:07.400 | to the end.

00:25:08.400 | And at each point, as it's kind of overlapping the three by three area of pixels, I want

00:25:13.160 | to take the value of that pixel and I want to multiply it by each of the equivalent values

00:25:19.000 | in this matrix and add them all together.

00:25:22.240 | And just to give you a sense of what that looks like, here on the right is a low res

00:25:29.360 | photo.

00:25:30.520 | Here on the left is how that photo is represented as numbers.

00:25:33.480 | So you can see here where it's black, there are low numbers in the 20s and where it's

00:25:38.680 | white, there are high numbers in the 200s.

00:25:41.840 | So that's how pictures are stored in your computer.

00:25:45.080 | And then you can see here we've got an example of a particular matrix and basically we can

00:25:53.240 | multiply every one of these sets of three pixels by the three things in that matrix

00:25:59.480 | and you get something that comes out on the right.

00:26:01.240 | So this is basically what we're doing.

00:26:04.120 | So in this case, we're going to take this picture and multiply it by this matrix.

00:26:08.760 | And so to make life a little bit easier for ourself, let's try and zoom in to a little

00:26:13.400 | bit of it.

00:26:14.400 | So here's our original picture, the first picture, and let's zoom into the top left-hand

00:26:19.400 | corner.

00:26:20.400 | So let's move that.

00:26:23.400 | There we go, and that looks pretty good.

00:26:29.360 | All right, so let's think about what would happen if we took that three-by-three picture

00:26:34.000 | and it was over here, or if it was over here, what would happen?

00:26:38.560 | So I want you to try and have a guess at what you think is going to happen to each one of

00:26:42.120 | these pixels, and don't actually have very much room here.

00:26:51.400 | So what I've done here is I've printed out the actual value of each one of those pixels.

00:26:55.400 | So you can see at the top it's all black, it's all zeros, and in that bit where there's

00:27:00.040 | a little bit of seven pointing out through, there's some numbers that go up to one.

00:27:06.120 | So let's try, it's called correlating, by the way, let's try correlating my top filter

00:27:12.500 | with the picture and see what it looks like.

00:27:17.360 | So here's the result, and you can see at the top it's all zeros.

00:27:21.520 | And up here we've got some high numbers, and down here we've got some low numbers.

00:27:25.480 | What does that look like?

00:27:27.760 | That's what it looks like.

00:27:28.840 | So test yourself, how did you go?

00:27:30.480 | Did you figure out what that was going to look like?

00:27:32.880 | So you can see basically what it's done if we look at the whole picture is it has highlighted

00:27:38.720 | the top edges.

00:27:40.720 | So it's kind of pretty interesting, right?

00:27:42.400 | We've taken something incredibly simple, which is this 3x3 matrix.

00:27:47.560 | We've multiplied it by every 3x3 area in our picture, and each time we've added it up.

00:27:52.360 | And we've ended up with something that finds top edges.

00:27:56.480 | And so before deep learning, this is part of what we would call feature engineering.

00:28:00.400 | This is basically where people would say, "How do you figure out what kind of number

00:28:04.360 | this is?"

00:28:05.360 | Well, maybe one of the things we should do is find out where the edges of it are.

00:28:11.040 | So we're going to keep doing this a little bit more.

00:28:13.320 | So one of the things we could do is look at other kinds of edges.

00:28:17.240 | And it's quite nice in Python.

00:28:19.580 | You can basically take a matrix and say, "Rotate it by 90 degrees n times."

00:28:24.460 | So if I rotate it by 90 degrees once, I now have something which looks like this.

00:28:31.960 | And if I do that for every possible rotation, you can see that basically gives me four different

00:28:40.360 | edge filters.

00:28:42.920 | So a word that you're going to hear a lot is convolutional neural networks, because convolutional

00:28:47.280 | neural networks is basically what all image recognition today uses.

00:28:53.040 | And the word convolution is one of these overly complex words, in my opinion.

00:28:56.880 | It actually means the same thing as correlation.

00:28:58.940 | The only difference is that convolution means that you take the original filter and you

00:29:03.680 | rotate it by 180 degrees.

00:29:04.680 | I'm going to prove it to you here.

00:29:06.760 | I've said convolve my image by my top filter rotated by 90 degrees and plot it.

00:29:13.440 | And you can see it looks exactly the same.

00:29:15.240 | So when you hear people talk about convolutions, this is actually all they mean.

00:29:18.960 | They're basically multiplying it by each area and adding it up.

00:29:24.160 | So we can do the same thing for diagonal edges.

00:29:27.840 | So here I've built four different diagonal edges.

00:29:31.520 | And then I could try taking our first image and correlating it with every one of those.

00:29:37.960 | And so here you can see I've got a correlate at the top, with the left, bottom, right,

00:29:41.760 | and each of the diagonals.

00:29:44.400 | So why have we done that?

00:29:46.560 | Well, basically this is because this is a kind of feature engineering.

00:29:50.520 | We have found eight different ways of thinking about the number seven, or this particular

00:29:55.500 | rendition of the number seven.

00:29:57.520 | And so what we do with that in machine learning is we want to basically create a fingerprint

00:30:02.680 | of what does a seven tend to look like on average.

00:30:06.280 | And so in deep learning, to do that, we tend to use something called max pooling.

00:30:11.720 | And max pooling is another of these complex sounding things that is actually ridiculously

00:30:15.600 | easy.

00:30:16.600 | And as you can see in Python, it's actually a single line of code.

00:30:19.240 | What we're going to do is we're going to take each seven by seven area, because these are

00:30:22.520 | 28 by 28, so that'll give us four, seven by seven areas, and find the value of the brightest

00:30:27.680 | pixel, the max in each.

00:30:30.560 | So this is the result of doing max pooling.

00:30:34.320 | So you can see that this top edge one, there's some really big numbers here.

00:30:41.040 | You can see that for the bottom left edge, there's very little, which is bright.

00:30:45.760 | So this is kind of like a fingerprint of this particular image.

00:30:53.320 | So I'm going to use this now to create something really simple.

00:30:55.780 | It's going to figure out the difference between eights and ones, because that just seems like

00:31:00.200 | the easiest thing we can do.

00:31:01.200 | They're very different numbers.

00:31:02.200 | So I'm going to grab all of the eights out of our MNIST data set, and all of the ones,

00:31:07.200 | and I'm going to show you a few examples of each of them.

00:31:09.400 | OK, so there's some eights and there's some ones.

00:31:12.280 | Hopefully, one of the things you're seeing here is that if you're not somebody who codes

00:31:16.560 | or maybe you used to and you don't much anymore, it's very quick and easy to code.

00:31:21.600 | Like these things are generally like one short line, you know, it doesn't take lots of mucking

00:31:26.600 | around like it used to back in the days of writing C code.

00:31:30.680 | So what I'm going to do now is I'm going to create this max pooling fingerprint basically

00:31:36.680 | for every single one of my eights.

00:31:42.560 | And then what I can do is I'll show you the first five of them.

00:31:46.520 | So here are the first five eights that are in our data set and what their little fingerprints

00:31:52.440 | look like for their top edge.

00:31:54.360 | This is just for one of their edges.

00:31:57.680 | So what I can now do is I can basically say, tell me what the average one of those fingerprints

00:32:05.040 | looks like for all of the eights.

00:32:07.880 | So that's what I'm going to do here.

00:32:08.880 | I'm going to take the mean across all of the eights that have been pulled.

00:32:15.320 | And here is what that looks like.

00:32:18.880 | These eight pictures here are the average of the top side, the left side, bottom side,

00:32:25.680 | sorry, right side, bottom side, right side, and so forth for all of the eights in our

00:32:29.920 | data set.

00:32:30.920 | So this is like our kind of ideal eight.

00:32:36.080 | And so we can do something, we can, first of all, I'll repeat the exact same process

00:32:40.680 | for the ones and hopefully we'll be able to see that there'll be some differences.

00:32:46.180 | And you can, right?

00:32:47.240 | You can see that the ones basically have no diagonal edges, right?

00:32:51.000 | So it's all very light gray, but they have very strong vertical edges.

00:32:56.560 | So what we're hoping is that we can use this insight now to recognize eights versus ones

00:33:01.900 | and have our little hand, have little digit recognizer.

00:33:05.920 | So the way we're going to do that is we are going to correlate for every image in our

00:33:14.080 | data set, we're going to correlate it with each of these parts of the fingerprint, basically.

00:33:20.960 | That's what this single line of code here does.

00:33:23.500 | So it's defining that function.

00:33:25.320 | So here's an example of taking the very first one of our eights and seeing how well it correlates

00:33:34.200 | with each of our fingerprints.

00:33:37.280 | That's just an example.

00:33:39.200 | So we're basically at the point where we can now put all this together.

00:33:43.360 | So what I'm going to do is I'm going to basically say, all right, I'm going to decide whether

00:33:47.800 | something is an eight or not.

00:33:49.440 | So I've got this function called is it an eight?

00:33:52.280 | And it's going to return a one.

00:33:54.660 | This is the sum of squared errors, which I won't bother explaining to you, but a lot

00:33:58.120 | of you probably already know what that is, sum of squared errors.

00:34:00.720 | So basically, if it's closer to the filters for being a one than it is to the one for

00:34:07.680 | being an eight, it's going to-- how I'm going to decide whether it's going to be a one or

00:34:12.880 | an eight.

00:34:13.880 | So I'm just going to test and I'm going to say if the error is higher for the ones, then

00:34:20.600 | it must be an eight and vice versa.

00:34:22.960 | So that's basically my little function.

00:34:27.240 | So here's an example.

00:34:29.200 | Here is the very first one of our eights.

00:34:31.880 | And I've tested to see my error for the eight filters and my error for the one filters.

00:34:37.280 | And you can see my error for the eight filters is lower than my error for the one filters.

00:34:42.440 | So that looks very helpful.

00:34:45.580 | So let's do it.

00:34:46.600 | We can now calculate for our entire data set of eights and ones.

00:34:53.760 | Is it an eight?

00:34:55.280 | And for our entire data set of eights and ones, one minus is it an eight?

00:34:59.080 | So in other words, is it not an eight?

00:35:01.220 | So as you can see, this is taking now a little while to calculate because it's basically

00:35:04.820 | running this on all of them.

00:35:06.440 | OK.

00:35:07.440 | So let's finish the first set.

00:35:09.080 | So for is it an eight, 5,200 times it said yes if it was an eight, and 287 times it said

00:35:17.560 | yes if it was a one.

00:35:18.880 | So that's great.

00:35:20.240 | It has successfully found something that can recognize a difference.

00:35:25.040 | And what about the it's not an eight?

00:35:26.640 | And again, it's done a good job.

00:35:28.120 | It's for 8,900 times if it's a one, it said it's a one, and 166 times it said it's a one

00:35:33.400 | if it's an eight.

00:35:35.420 | So these four numbers here are called a classification matrix.

00:35:40.400 | And when data scientists build these machine learning models, this is basically the thing

00:35:43.800 | that we tend to look at, decide whether they're any good or not.

00:35:47.040 | So that's it.

00:35:48.040 | That's the entirety of building a simple machine learning approach to image recognition.

00:35:54.600 | So how do we make it better?

00:35:56.600 | I'm sure you guys have lots of examples of how to make it better.

00:36:00.000 | And one obvious way to make it better would be to not use the crappy first attempt I had

00:36:05.880 | at-- this was literally the first eight features I just came up with.

00:36:10.400 | I'm sure there's a lot of much better features we could be using.

00:36:14.600 | Specifically, there's a lot of much better three by three matrices.

00:36:18.040 | We call these filters we could be using.

00:36:21.440 | So that would be one step would be to make these better.

00:36:23.720 | Another would be it doesn't really make sense that we're treating all of the filters as

00:36:28.720 | equally important.

00:36:30.080 | We're just averaging out how close they are.

00:36:34.040 | Maybe some are more important than others.

00:36:36.400 | More importantly though, wouldn't we like to be able to say, I don't just want to look

00:36:41.160 | for a straight edge or a horizontal edge, but I want to look for something more complex,

00:36:45.280 | which is a corner.

00:36:46.280 | I'd love to be able to find corners just here.

00:36:50.080 | Deep learning is a thing that takes this and does all of those things.

00:36:54.560 | And the way it does it is by using something called optimization.

00:36:58.960 | Basically what we do is rather than starting out with eight carefully planned filters like

00:37:03.400 | these, we actually start out with eight random filters or 100 random filters.

00:37:11.480 | And we set up something that tries to make those filters better and better and better.

00:37:16.440 | So I'm going to show you how that works.

00:37:18.400 | But rather than optimizing filters, we are going to optimize a simple line.

00:37:23.960 | A lot of you have probably looked at linear regression some time in your life.

00:37:27.760 | And we're going to do linear regression, but the deep learning way.

00:37:31.760 | So again, this is going to be super simple.

00:37:34.720 | The definition of a line is something that takes a slope, a coefficient, and an x value,

00:37:43.120 | and it gives you ax plus b.

00:37:45.720 | Probably everybody has done that amount of math at the very least.

00:37:49.360 | So I have now defined a line.

00:37:51.320 | Again, it looks like I'm going to have to restart this guy.

00:37:56.600 | There we go, okay.

00:38:03.560 | So after we define a line, let's actually set up some data.

00:38:08.760 | So let's say our actual a is three and our actual b is eight.

00:38:12.160 | Okay, so that's done.

00:38:14.440 | So we're now going to create some random data.

00:38:15.840 | We're going to create 30 random points.

00:38:17.800 | Okay, and for x, it's just going to be a random number.

00:38:21.040 | And then for y, it will be the correct value of y based on this line, okay.

00:38:27.640 | So here is my x values and here is my y values.

00:38:32.520 | So now we've generated some data.

00:38:34.880 | The machine learning goal, if you were given this data, would be forget that you ever knew

00:38:39.720 | that the correct values of a were three and b is eight.

00:38:43.760 | You have to figure out what they were.

00:38:45.500 | This is the equivalent of figuring out what the optimal set of filters are for my image

00:38:51.840 | recognition.

00:38:52.840 | It's basically the same thing, but in this case, my filters, we have, you know, quite

00:38:57.040 | a few of them, but here we just have two of them to make the reasoning simpler.

00:39:02.400 | But actually it's going to be exactly the same, totally identical as to how this works.

00:39:06.320 | So once you know how to do this, you'll know how to do that deep learning thing I just

00:39:10.040 | described of actually optimizing these filters.

00:39:14.120 | So to do it, we do something very similar to what we had before.

00:39:17.120 | We basically have to define how do we know whether our prediction is good or not.

00:39:22.280 | And so we'll basically say our prediction is good if the squared error, again we're

00:39:27.920 | using the squared error thing, is high versus low, okay.

00:39:32.120 | So that's the sum of squared errors.

00:39:34.120 | So our loss function, every deep learning algorithm has a loss function, will be the errors based

00:39:39.600 | on the y values that we actually have versus the result of applying our linear function.

00:39:47.760 | And so then I have to start somewhere.

00:39:49.200 | I have to start with some random numbers.

00:39:51.720 | So I've just decided let's start at guessing that a is minus 1 and guessing that b is positive

00:39:56.240 | 1.

00:39:57.240 | Okay, so if that were the case, what would my average loss be?

00:40:01.540 | And so this says on average, Jeremy, you would have been out by 8.6, okay.

00:40:06.920 | So I want to improve that.

00:40:09.080 | And the way I improve that is very simple.

00:40:11.240 | I basically figure out can I make it a little bit higher or a little bit lower and for each

00:40:17.820 | of my a guess and my b guess, and would my loss function go up or would it go down?

00:40:23.640 | I've actually got a nice little Excel spreadsheet that actually does this.

00:40:28.980 | I won't go through it in detail now, but basically I've done exactly the same thing.

00:40:31.840 | I've got my random x's and y's.

00:40:34.600 | I've got my predictions.

00:40:36.200 | That's the linear function.

00:40:37.200 | I've got my sum of squared errors, and you can see I've literally taken what's the value

00:40:41.120 | of y if I add 0.01 to a?

00:40:43.680 | What's the value of y if I add 0.01 to b?

00:40:46.580 | So what's the change in the error if I add 0.01 to a and 0.01 to b, okay?

00:40:51.180 | And so if I divide that error by my 0.01, that gives me what's known as the derivative.

00:40:56.400 | So anybody who's done calculus will, of course, recognize this.

00:40:59.720 | So all I need to do now is say, okay, if I increased x by a bit, my loss function goes

00:41:05.720 | down.

00:41:06.720 | If I increase a by a bit, my loss function goes down.

00:41:09.720 | If I increase b by a bit, my loss function goes down.

00:41:13.920 | Therefore, I should increase a and b by a bit.

00:41:19.480 | How do we decide what a bit is?

00:41:22.520 | We just make it up.

00:41:23.520 | It's called a learning rate, okay?

00:41:25.360 | And so I'm picking a learning rate of 0.01.

00:41:29.880 | So here is the entirety of how to do an optimization from scratch.

00:41:35.280 | It's just this code here, right?

00:41:37.400 | So it's basically saying, okay, calculate my predicted y.

00:41:41.040 | That's just my linear function with my a guess and my b guess and my x, okay?

00:41:46.560 | Now I calculate my two derivatives.

00:41:49.160 | And you'll see in this case, I'm not doing it that slow way of adding 0.01, but everybody

00:41:53.440 | who's done calculus will know there's always a shortcut in calculus to doing things quickly.

00:41:59.700 | And in case you're thinking you, if you want to do deep learning, you're going to have

00:42:02.440 | to remember all of your rules of calculus, you don't.

00:42:07.360 | In real life, nobody does that.

00:42:09.620 | In real life, if you need a derivative, you go to alpha.wolfram.com, and you type in the

00:42:15.680 | thing that you want your derivative of, and you press enter.

00:42:20.360 | You wait three seconds.

00:42:22.600 | You then go to plain text.

00:42:25.000 | You double-click that, copy it, and you paste it, okay?

00:42:30.400 | And so that's what I've done here, okay?

00:42:32.360 | And then I pasted that into my code, all right?

00:42:34.680 | So that's how we do calculus today.

00:42:40.440 | So I bet you're all glad you spent lots of time learning those stupid rules.

00:42:44.880 | So okay, now that I've done that, don't worry too much about this code, but basically what

00:42:48.800 | I'm going to do now is I'm going to animate what happens as we call this update function

00:42:53.040 | 40 times, okay, starting with a guess of a of minus 1 and a guess of b of 1.

00:43:01.240 | And I'm going to plot the original data and my line, and let's see what happens.

00:43:09.800 | There it is, okay?

00:43:10.800 | So I'd have to run it for a little bit longer, of course, if it was going to exactly hit.

00:43:14.520 | But you can see that the line is getting closer and closer to the data.

00:43:18.360 | So just imagine now taking this idea and doing it for each of these filters, what would happen?

00:43:28.880 | And now imagine something further.

00:43:30.560 | Imagine if you didn't just have these filters, but imagine if these filters themselves became

00:43:35.600 | the inputs to a second set of filters.

00:43:39.080 | That would allow you to create a corner, because it could say, "Oh, a bit of top edge and a

00:43:41.920 | bit of right edge," right?

00:43:44.200 | Assuming that the thing did actually decide that edges were interesting.

00:43:48.240 | So it turns out that, this is why the little guys here, so we can be excited that we've

00:43:55.620 | just successfully learned about deep learning, it turns out that somebody did this, and two

00:44:01.240 | years ago they showed the results.

00:44:04.600 | They created lots and lots of layers optimized in exactly, exactly this way.

00:44:10.360 | This is not some super dumbed down version, this is it, right?

00:44:14.200 | They did this, and they discovered that layer one, I mean, what author, there's one difference

00:44:18.920 | with they had color images rather than black and white.

00:44:21.560 | This is nine out of the 34 examples they had on the first layer, and you can see it decided

00:44:27.180 | that it wanted to look for edges as well as gradients.

00:44:31.120 | This is what that shows.

00:44:32.160 | On the right-hand side, it's showing examples from real photos, they had one and a half

00:44:35.800 | million photos, real examples of like nine patches of photo that matched this particular

00:44:42.200 | patch.

00:44:43.200 | So this is layer one.

00:44:44.360 | So then what they did was they say, "Okay, this guy's name is Matt Zeiler."

00:44:49.000 | He said, "Okay, what would happen now if we created a new layer which took these as inputs

00:44:54.240 | and combined them in exactly the same way as we did with pixels?"

00:44:58.240 | And he called it layer two.

00:45:00.400 | And so layer two is a little bit harder to draw, so instead he draws nine examples of

00:45:05.320 | how it gets activated by various images, and you can see in layer two it's learned to find

00:45:11.720 | lots of horizontal lines, lots of vertical lines, it's learned to find circles.

00:45:16.280 | And indeed, if you look on the right, it's even already basically got something that

00:45:20.020 | finds sunsets.

00:45:22.940 | So layer two is finding circular things, stripey things, edges, and as we hoped, corners.

00:45:34.000 | So what is layer three going to do?

00:45:35.400 | Layer three is going to do exactly the same thing, but it's going to start with these.

00:45:39.280 | And this is just 16 out of the probably 60 or so filters he has.

00:45:44.040 | And so each part is getting exponentially more sophisticated in what it can do.

00:45:48.800 | And so by layer three, we already have a filter which can find text.

00:45:54.600 | We have a filter that can find repeating patterns.

00:45:58.180 | By layer four, we have a filter which can find dog faces.

00:46:03.200 | By layer five, we have a filter that can find the eyeballs of lizards and birds.

00:46:08.660 | The most recent deep learning networks have over 1,000 filters.

00:46:14.160 | So can you imagine each of these exponentially improving levels of kind of semantic richness?

00:46:19.760 | And that is why these incredibly simple things I showed you, which is convolutions plus optimization

00:46:27.360 | applied to multiple layers, can let you understand speech better than a human, recognize images

00:46:33.460 | better than a human.

00:46:35.360 | So that's basically the summary of why deep learning changes everything.

00:46:39.160 | And if you want to have the rest of lesson one and a review of that, and then lesson

00:46:42.680 | two, come along in late October.

00:46:45.560 | Let's thank Jeremy.

00:46:46.160 | (audience applauds)

Deep learning certificate "lesson 0"

Chapters