Deep learning certificate "lesson 0"

So, anyway, I'm going to just briefly say my name is Dave Yominsky, I'm the director of the analytics program and director of the data institute. I'll say a few more words about the data institute upstairs and I hope everyone will come upstairs and join us, food and drink afterwards.

But right now, it's my great pleasure to introduce Jeremy Howard, who is many things a serial entrepreneur, his most recent venture is analytic, which is bringing deep learning to medicine and before that, he was also the former president and chief data scientist at Kaggle and I think I'm going to leave it there because I could keep going.

But anyway, let's give a warm welcome to Jeremy. Thanks very much, David. Everybody can hear okay? Yeah? Great. Okay. So, you know, my passion at the moment and for the last few years has been this area of deep learning. Who here has kind of come across deep learning at some point?

Heard of it, knows about it, maybe a little over half of you, two-thirds, okay, great. It's one of these things which kind of feels like a great fad or a great marketing thing or something kind of like, I don't know, big data or internet of things or, you know, all these various things that we have.

But it actually reminds me of another fad, which I was really excited about in the early 90s and I was telling everybody it was going to be huge and that fad was called the internet. And so some fads, fads, but they're also fads for a reason. I think deep learning is going to be more important and more transformational than the internet.

So that's hence the title of this talk. It changes everything. Every one of you will be deeply impacted by deep learning and that many of us are already starting to be impacted by deep learning. So before I talk about that, I want to talk about the kind of how people have viewed computers for many years and people have really made computers the butt of jokes for many years for all of the things they can't do.

So you may remember from 2009, this was Google's autopilot, which was their April Fool's joke. And the butt of the joke was basically, well, of course, computers can't send email. And so that was the April Fool's joke of 2009. Going back further, a source of humor for Douglas Adams was the Babel Fish, which was basically the idea that technology could never be so advanced as to do something this clever as to translate language.

So he came up with this idea of this fish called the Babel Fish that could translate language and so probably useful was this thing that it was used as the proof of the existence of God in the Hitchhiker's Guide to the Galaxy. But things have changed in the last year.

Basically the joke doesn't work anymore because your computer really can reply to your email. These are actual examples of replies that have been automatically generated by Google Inbox, which is a mobile app for Android and iOS and you can also access on the web. And this is not some carefully curated set of responses for this particular email.

In fact, 15% of emails sent through Inbox by Google are now automatically created by the system. So it's actually being very widely used already. So here's another example of what the system does. And indeed the Babel Fish now exists as well. You can for free use Skype's translator system to translate voice to voice for any of six languages and they're adding more and more.

Even computers are artists now. This is actually not a genuine Van Gogh, but it's fairly impressive. In fact on that I'm going to give you a little test which is to figure out which of these are real paintings and drawings and which ones were done by a computer. So they're all pretty sophisticated.

Now that you've made your decisions, I will show you. The first sketch, I guess pastel sketch on the left, is done by a computer. The second drawing is done by a computer. And the third one we all know about extrapolating from past events, but at this level it worked.

It also done by a computer. And you can see here that the level of nuance that the computer has done here and kind of realizing that this piece uses a lot of lines and arcs and has decided to actually connect this lady's eyebrow to her nose, to her shoulder as an arc and also has these kind of areas of bursts of color and realize that her hair bun would be a good place to have a burst of color.

It's quite a sophisticated rendition of both the style and the content. So as you might have guessed, the reason that fiction has become reality and computers have gone past what was previously a joke and indeed now they're generating art, which is very hard to tell from real human art, is because of this thing called deep learning.

I don't have time today to go into detail about all of the interesting applications, but I do have a talk on tech.com that you can watch if you have 18 minutes and get more information about it. But before we talk more about deep learning, let's talk about machine learning.

They are not one and the same thing. Deep learning is a way of doing machine learning. So machine learning was invented by this guy, Arthur Samuels, in 1956. This is him playing chess against an IBM mainframe. Rather than programming this IBM mainframe to play chess, instead he got the computer to play against itself thousands of times and figure out how to play effectively.

And after doing that, this computer beat the creator of the program. So that was a big step in 1956. So machine learning has been around for a long time. The thing is though that until very recently you needed an Arthur Samuels to write your machine learning algorithm for you, to actually get to the point that the machine could learn to tackle your task, talk a lot of programming effort and engineering effort and also a lot of domain expertise to bring basically mainly to do what's called feature engineering.

But something very interesting has happened more recently, which is that we have the three pieces that at least in theory ought to make machine learning universal. So imagine if you could get a computer to learn and it could learn any type of relationship. Now when you see the word function as a mathematical function, you might think like a line or a quadratic or something.

But I mean function in the most wide possible sense. Like the function that translates Russian into Japanese or the function that allows you to recognize the face of George Clooney. These are all functions. That's what I mean by an infinitely flexible function. So imagine if you had that and you had some way to fit the parameters of that function such that that function could do anything.

It could model anything that you could come up with such as the two examples I just gave. That would be all very well. You just need one more piece, which is the ability to do that quickly and that scale. And if you had those three things, you now have a totally general learning system, which is what we now have.

That's what deep learning is. Deep learning is a particular algorithm for doing machine learning, which has these three vital characteristics. The infinitely flexible function is the neural network, which has been around for a long time. The all-purpose parameter fitting is backpropagation, which has been around since the, really since 1974, but was not noticed by the world until 1986.

Until very recently, though, we didn't have this. And the fast and scalable has recently come along for various reasons, including the advances in GPUs, which used mainly to play computer games, but also turn out to be perfect for deep learning, wider variety of data, and some vital improvements to the algorithms themselves.

So it's interesting how this is working. Jeff Dean presented this from Google last week, showing how often deep learning is now being used in Google products and services. And you can see this classic hockey stick shape showing an exponential growth here. Google are amongst the first, or maybe the first, at really picking up on using this technology effectively.

But you can imagine that if Google is basically-- what they did was they set aside a group of people, and they said, go to different parts of Google, tell them about deep learning, and see if they can use it. And from my understanding of the people I know, everywhere they went, the answer was, yes, they can.

And that's why we now have this shape. And of course, the people that that original team talked to were now talking to other people, and that also creates this. So when I say deep learning changes, everything I certainly would expect in your organizations that you would probably find the same thing.

Every aspect of your organization can probably be touched effectively by this. An example, when Google wanted to map the location of every residence and business in France, they did it in less than one hour. They basically grabbed the entire Street View database. These are examples of pictures from the Street View database, and they built a deep learning system that could identify house numbers and could then read those house numbers.

And an hour later, they had mapped the entirety of the country of France. This is obviously something that previously would have taken hundreds of people many years. And this is one of the reasons that, particularly for startups, and here in the Bay Area, this is important, deep learning really does change everything, because suddenly a startup can do things that previously required huge amounts of resources.

So we've kind of seen a little bit of this before. What happens when an algorithm comes along that makes a big difference? And Yahoo discovered what happened. They used to own the web. Eighty percent of home pages were Yahoo back in the day. And Yahoo was manually curated by expert web surfers.

And then this company came along and replaced the expert web surfers with a machine learning algorithm called PageRank. And we all know what happened next. Now this was an algorithm that, compared to deep learning, is incredibly limited and simple in terms of what it can do. But if you think about the impact that that algorithm had on Yahoo, well, think about the impact of the collaborative filtering algorithm had on kind of Amazon versus Barnes and Noble, now that we have really successful recommendation systems.

You can see how even relatively simple versions of machine learning have had huge commercial impacts already. So what can deep learning do? I'll just give you a few examples. A paper last year showed that deep learning is able to recognize the content of photos. This is something called the ImageNet dataset, which is one and a half million photos.

And a very patient human had actually spent time trying to classify thousands of these photos and tested themselves and found that they had a 5% error rate. And last year it was announced by Microsoft Research that they had a system which was better than humans at recognizing images. In fact, this number is now down to about 3%.

And so it keeps on dropping quickly. So with deep learning, computers can now see. And they can see in a range of interesting ways. Anybody here from China will probably recognize Baidu Shutu. And on Baidu Shutu, which is a part of the popular kind of Google competitor-- well, not really a competitor, since Google's not there.

They're Google akin system Baidu. You can upload a picture, which is what I did here. I uploaded the picture in the top left. And it has come up with all of these similar images. I didn't upload any text. So it figured out the breed of the dog, the composition, the type of the background, the fact that it's had its tongue hanging out, and so forth.

So you can see that image analysis is a lot more than just saying it's a dog, which is what the Chinese at the top are saying it's a golden retriever, but really understanding what's going on there. And I'll give you some examples of some of the extraordinary things that allows us to do shortly.

Speaking about Baidu, they have now announced that they can recognize speech more accurately than humans, in Chinese and English at least. So we now have computers at a point where last year they can recognize pictures better than us, and now they can recognize speech better than us. Microsoft have this amazing system using deep learning where you can take a picture in which large bits are being cut off, in this case it was a panorama that was done quite badly.

That's the top picture. And the bottom shows how it has automatically filled in its guess as to what the rest might look like. And so this is taking image recognition to the next level, which is to say can I construct an image which would be believable to an image recognizer.

This is part of something called generative models, which is a huge area right now. And again, this is a freely available software that you can download off the internet. I think it might be, hmm? That's in Spain. There you go. If I had deep learning system here, I probably could have looked that up.

So generative models are kind of interesting. This is like in some ways more quirky than anything else, but I think it's fascinating. These pictures here, the four corners are actual photos. The ones in the middle are generated by a deep learning algorithm to try and interpolate between the sets of photos.

But what you can do more than that is you can then say to the deep learning algorithm, what would this photo look like if the person was feeling differently? And then we can animate that. And that is nothing if not creepy. I mean, the interesting thing here is you can see it's doing a lot more than just plastering a smile on their faces.

You know, their eyes are smiling, their faces are moving. We can even take some famous paintings and slightly change how they're looking. Or we can do the same to the queen. Or we can even do the same to Mona Lisa. And you can see as she's moving her eyes up and down again, the whole features are moving as well.

One of the interesting things about this Mona Lisa example was that this system-- she's looking pretty shifty now, isn't she? This system was originally trained without having any paintings in the training set. It was only trained with actual photos. And one of the interesting things about deep learning is how well it can generalize to types of data it hasn't seen before.

In this case, it turns out that it knows how to generate different face movements for paintings. A lot of people think that deep learning is just about big data. It's not. Ilya Sutskever from OpenAI presented last week a new model in which he showed that on a very famous data set called MNIST, which we'll learn about more shortly.

But it's basically trying to recognize digits. It's digit recognition. It's a very old, classic machine learning problem. He discovered that with just 50 labeled images of digits, he could train a 99% accurate number classifier. So we're not talking millions or billions. We're talking 50. And so these recent advances that allow us to use small amounts of data is something that's really changing what's possible with deep learning.

It's also turning really anybody into an artist. There's a thing called neural doodle that allows you to whip out your stylus and jot down some sophisticated imagery like this and then say how you would like it rendered, what style. In this case, it was being rendered as impressionism. You can see it's done a pretty good job of generating an image which hopefully fits what the original artist had in their head with their original doodle.

And it's not just about images, it's about text as well, or even combining the two. This is a field called multimodal learning. These sentences are totally novel sentences constructed from scratch by a deep learning algorithm after looking at the picture. So you can see that in order to construct this, the deep learning algorithm must have understood a lot about not just what the main objects in the picture are, but how they relate to each other and what they're doing.

I got so excited about this that three years ago, I left my job at Kaggle and spent a year researching what are the biggest opportunities for deep learning in the world. I came to the conclusion that the number one biggest opportunity at that time was medicine. I started a new company called Enlurik.

We had four of us, all computer scientists and mathematicians, no medical people on the team. And within two months, we had a system for radiology which could predict the malignancy of lung cancer more accurately than a panel of four of the world's best radiologists. This was kind of very exciting to me because it was everything that I hoped was possible.

It was also always somehow surprising when you actually run a model and it's classifying cancer and you genuinely have no idea how it did it because of course all you do is set up the kind of situation in which you can learn and then it does that learning. So this turned out to be very successful and Enlurik today has raised $15 million.

It's a pretty successful company and one of the things I mentioned is that earlier Baidu Shutu example of taking a picture and finding similar pictures, that's doing big things in radiology. It basically allows radiologists to find previous patients from a database of millions of CT scans and MRIs to find the people that have medical imagery just like the patient that they're interested in and then they can find out exactly the path of that patient, how did they respond to different drugs, so forth.

So this kind of semantic search of imagery is a really exciting area. So one thing interesting about my particular CV when it comes to creating the first deep learning medical diagnostic company is not so much what I've done but perhaps what I haven't done. And so that's the entirety of my actual biology, life sciences and medicine experience.

And one of the exciting things to those of you who are entrepreneurs or interested in being entrepreneurs is that there is no limits as to what you can hope to do. You recognize a problem that you want to solve and that you care about and that hopefully maybe hasn't been solved before that well and have a go and really you can do a lot.

In my case, once I kind of showed that we could do some useful stuff in oncology and we got covered by CNN on one of the TV shows and then suddenly the medical establishment kind of came to us at which point we've got a lot of help from the medical establishment as well so you kind of get this nice feedback loop going on.

So most importantly, deep learning can also do choreography. So if you're excited about this and think this all sounds interesting, you might be wondering well where can you learn more? And the answer you won't be surprised to hear is the Data Institute. We haven't previously announced this but I'm going to announce now that the first, through our knowledge, the first ever in-person university accredited deep learning certificate will be here at the Data Institute starting in the second lesson will start in late October.

So I invite you all to join. You might be wondering when the first lesson is and the answer is it's right now. So let's get started. So you came to university, come on, you've got to expect to be studying here. There's no slacking off. So what I'm going to show you is so the Stivka course will be seven weeks of two and a half hours each.

We don't have two and a half hours right now so this will by necessity be heavily compressed. So if this doesn't make as much sense as you might like it to, don't worry, the MSAM students will certainly follow along fine but I'll try and make it as clear as possible.

One of the things that I strongly believe is that deep learning is easy. It is made hard by people who put way more math into it than is necessary and also by what I think is a desire for exclusivity amongst the kind of deep learning specialists. They make up crazy new jargon about things that are really very simple.

So I want to kind of show you how simple it can be. And specifically we're going to look at MNIST, the data set I told you about which is about recognizing handwritten digits. And I'm going to use a system called Jupyter Notebook. For those of you that don't code, I hope the fact that this is done in code isn't too off-putting.

You certainly don't need to use code for everything but I find it a very good way to kind of show what's going on. So I'm going to have to make sure that we actually have this thing running, which was not. Let's try again. OK. Let's try that. OK. OK.

So I'm going to load the data in. And so the data, the MNIST data has 55,000 28x28 images in it. So we're going to take a look at one. So here is the first of those images. As you can see it is a 28x28 picture and as well as the image we also have labels, which is just a list of numbers.

And so you can see that this, as is common with pretty much every machine learning dataset, you have two things. You have something which is information that you're given and then information that you have to derive. So in this case the goal of this dataset is to take a picture of a number and return what the number is.

So let's have a look at a few more. So here's the first five pictures and the first five numbers that go with each one. So this was originally generated by an IST and they basically had thousands of people draw lots of numbers and then somebody went through and coded into a computer what each one was.

So I'm going to show you some interesting things we can do with pictures. The first thing I'm going to do is I'm going to create a little matrix here that I've called top. And as you can see I've got minus ones at the top of it and then ones and then zeros or maybe you want to see it visually.

And what I'm going to show you is, in fact I want you to think about something, which is what would happen if I took that matrix and what I want to do is basically shift it over this first image and I'm going to take this three by three and I'm going to put it right at the top left and I'm going to move it right a bit and I'm going to move it right a bit and I'm going to go all the way to the end, I'll start back here and all the way to the end.

And at each point, as it's kind of overlapping the three by three area of pixels, I want to take the value of that pixel and I want to multiply it by each of the equivalent values in this matrix and add them all together. And just to give you a sense of what that looks like, here on the right is a low res photo.

Here on the left is how that photo is represented as numbers. So you can see here where it's black, there are low numbers in the 20s and where it's white, there are high numbers in the 200s. So that's how pictures are stored in your computer. And then you can see here we've got an example of a particular matrix and basically we can multiply every one of these sets of three pixels by the three things in that matrix and you get something that comes out on the right.

So this is basically what we're doing. So in this case, we're going to take this picture and multiply it by this matrix. And so to make life a little bit easier for ourself, let's try and zoom in to a little bit of it. So here's our original picture, the first picture, and let's zoom into the top left-hand corner.

So let's move that. There we go, and that looks pretty good. All right, so let's think about what would happen if we took that three-by-three picture and it was over here, or if it was over here, what would happen? So I want you to try and have a guess at what you think is going to happen to each one of these pixels, and don't actually have very much room here.

So what I've done here is I've printed out the actual value of each one of those pixels. So you can see at the top it's all black, it's all zeros, and in that bit where there's a little bit of seven pointing out through, there's some numbers that go up to one.

So let's try, it's called correlating, by the way, let's try correlating my top filter with the picture and see what it looks like. So here's the result, and you can see at the top it's all zeros. And up here we've got some high numbers, and down here we've got some low numbers.

What does that look like? That's what it looks like. So test yourself, how did you go? Did you figure out what that was going to look like? So you can see basically what it's done if we look at the whole picture is it has highlighted the top edges. So it's kind of pretty interesting, right?

We've taken something incredibly simple, which is this 3x3 matrix. We've multiplied it by every 3x3 area in our picture, and each time we've added it up. And we've ended up with something that finds top edges. And so before deep learning, this is part of what we would call feature engineering.

This is basically where people would say, "How do you figure out what kind of number this is?" Well, maybe one of the things we should do is find out where the edges of it are. So we're going to keep doing this a little bit more. So one of the things we could do is look at other kinds of edges.

And it's quite nice in Python. You can basically take a matrix and say, "Rotate it by 90 degrees n times." So if I rotate it by 90 degrees once, I now have something which looks like this. And if I do that for every possible rotation, you can see that basically gives me four different edge filters.

So a word that you're going to hear a lot is convolutional neural networks, because convolutional neural networks is basically what all image recognition today uses. And the word convolution is one of these overly complex words, in my opinion. It actually means the same thing as correlation. The only difference is that convolution means that you take the original filter and you rotate it by 180 degrees.

I'm going to prove it to you here. I've said convolve my image by my top filter rotated by 90 degrees and plot it. And you can see it looks exactly the same. So when you hear people talk about convolutions, this is actually all they mean. They're basically multiplying it by each area and adding it up.

So we can do the same thing for diagonal edges. So here I've built four different diagonal edges. And then I could try taking our first image and correlating it with every one of those. And so here you can see I've got a correlate at the top, with the left, bottom, right, and each of the diagonals.

So why have we done that? Well, basically this is because this is a kind of feature engineering. We have found eight different ways of thinking about the number seven, or this particular rendition of the number seven. And so what we do with that in machine learning is we want to basically create a fingerprint of what does a seven tend to look like on average.

And so in deep learning, to do that, we tend to use something called max pooling. And max pooling is another of these complex sounding things that is actually ridiculously easy. And as you can see in Python, it's actually a single line of code. What we're going to do is we're going to take each seven by seven area, because these are 28 by 28, so that'll give us four, seven by seven areas, and find the value of the brightest pixel, the max in each.

So this is the result of doing max pooling. So you can see that this top edge one, there's some really big numbers here. You can see that for the bottom left edge, there's very little, which is bright. So this is kind of like a fingerprint of this particular image.

So I'm going to use this now to create something really simple. It's going to figure out the difference between eights and ones, because that just seems like the easiest thing we can do. They're very different numbers. So I'm going to grab all of the eights out of our MNIST data set, and all of the ones, and I'm going to show you a few examples of each of them.

OK, so there's some eights and there's some ones. Hopefully, one of the things you're seeing here is that if you're not somebody who codes or maybe you used to and you don't much anymore, it's very quick and easy to code. Like these things are generally like one short line, you know, it doesn't take lots of mucking around like it used to back in the days of writing C code.

So what I'm going to do now is I'm going to create this max pooling fingerprint basically for every single one of my eights. And then what I can do is I'll show you the first five of them. So here are the first five eights that are in our data set and what their little fingerprints look like for their top edge.

This is just for one of their edges. So what I can now do is I can basically say, tell me what the average one of those fingerprints looks like for all of the eights. So that's what I'm going to do here. I'm going to take the mean across all of the eights that have been pulled.

And here is what that looks like. These eight pictures here are the average of the top side, the left side, bottom side, sorry, right side, bottom side, right side, and so forth for all of the eights in our data set. So this is like our kind of ideal eight.

And so we can do something, we can, first of all, I'll repeat the exact same process for the ones and hopefully we'll be able to see that there'll be some differences. And you can, right? You can see that the ones basically have no diagonal edges, right? So it's all very light gray, but they have very strong vertical edges.

So what we're hoping is that we can use this insight now to recognize eights versus ones and have our little hand, have little digit recognizer. So the way we're going to do that is we are going to correlate for every image in our data set, we're going to correlate it with each of these parts of the fingerprint, basically.

That's what this single line of code here does. So it's defining that function. So here's an example of taking the very first one of our eights and seeing how well it correlates with each of our fingerprints. That's just an example. So we're basically at the point where we can now put all this together.

So what I'm going to do is I'm going to basically say, all right, I'm going to decide whether something is an eight or not. So I've got this function called is it an eight? And it's going to return a one. This is the sum of squared errors, which I won't bother explaining to you, but a lot of you probably already know what that is, sum of squared errors.

So basically, if it's closer to the filters for being a one than it is to the one for being an eight, it's going to-- how I'm going to decide whether it's going to be a one or an eight. So I'm just going to test and I'm going to say if the error is higher for the ones, then it must be an eight and vice versa.

So that's basically my little function. So here's an example. Here is the very first one of our eights. And I've tested to see my error for the eight filters and my error for the one filters. And you can see my error for the eight filters is lower than my error for the one filters.

So that looks very helpful. So let's do it. We can now calculate for our entire data set of eights and ones. Is it an eight? And for our entire data set of eights and ones, one minus is it an eight? So in other words, is it not an eight?

So as you can see, this is taking now a little while to calculate because it's basically running this on all of them. OK. So let's finish the first set. So for is it an eight, 5,200 times it said yes if it was an eight, and 287 times it said yes if it was a one.

So that's great. It has successfully found something that can recognize a difference. And what about the it's not an eight? And again, it's done a good job. It's for 8,900 times if it's a one, it said it's a one, and 166 times it said it's a one if it's an eight.

So these four numbers here are called a classification matrix. And when data scientists build these machine learning models, this is basically the thing that we tend to look at, decide whether they're any good or not. So that's it. That's the entirety of building a simple machine learning approach to image recognition.

So how do we make it better? I'm sure you guys have lots of examples of how to make it better. And one obvious way to make it better would be to not use the crappy first attempt I had at-- this was literally the first eight features I just came up with.

I'm sure there's a lot of much better features we could be using. Specifically, there's a lot of much better three by three matrices. We call these filters we could be using. So that would be one step would be to make these better. Another would be it doesn't really make sense that we're treating all of the filters as equally important.

We're just averaging out how close they are. Maybe some are more important than others. More importantly though, wouldn't we like to be able to say, I don't just want to look for a straight edge or a horizontal edge, but I want to look for something more complex, which is a corner.

I'd love to be able to find corners just here. Deep learning is a thing that takes this and does all of those things. And the way it does it is by using something called optimization. Basically what we do is rather than starting out with eight carefully planned filters like these, we actually start out with eight random filters or 100 random filters.

And we set up something that tries to make those filters better and better and better. So I'm going to show you how that works. But rather than optimizing filters, we are going to optimize a simple line. A lot of you have probably looked at linear regression some time in your life.

And we're going to do linear regression, but the deep learning way. So again, this is going to be super simple. The definition of a line is something that takes a slope, a coefficient, and an x value, and it gives you ax plus b. Probably everybody has done that amount of math at the very least.

So I have now defined a line. Again, it looks like I'm going to have to restart this guy. There we go, okay. So after we define a line, let's actually set up some data. So let's say our actual a is three and our actual b is eight. Okay, so that's done.

So we're now going to create some random data. We're going to create 30 random points. Okay, and for x, it's just going to be a random number. And then for y, it will be the correct value of y based on this line, okay. So here is my x values and here is my y values.

So now we've generated some data. The machine learning goal, if you were given this data, would be forget that you ever knew that the correct values of a were three and b is eight. You have to figure out what they were. This is the equivalent of figuring out what the optimal set of filters are for my image recognition.

It's basically the same thing, but in this case, my filters, we have, you know, quite a few of them, but here we just have two of them to make the reasoning simpler. But actually it's going to be exactly the same, totally identical as to how this works. So once you know how to do this, you'll know how to do that deep learning thing I just described of actually optimizing these filters.

So to do it, we do something very similar to what we had before. We basically have to define how do we know whether our prediction is good or not. And so we'll basically say our prediction is good if the squared error, again we're using the squared error thing, is high versus low, okay.

So that's the sum of squared errors. So our loss function, every deep learning algorithm has a loss function, will be the errors based on the y values that we actually have versus the result of applying our linear function. And so then I have to start somewhere. I have to start with some random numbers.

So I've just decided let's start at guessing that a is minus 1 and guessing that b is positive 1. Okay, so if that were the case, what would my average loss be? And so this says on average, Jeremy, you would have been out by 8.6, okay. So I want to improve that.

And the way I improve that is very simple. I basically figure out can I make it a little bit higher or a little bit lower and for each of my a guess and my b guess, and would my loss function go up or would it go down? I've actually got a nice little Excel spreadsheet that actually does this.

I won't go through it in detail now, but basically I've done exactly the same thing. I've got my random x's and y's. I've got my predictions. That's the linear function. I've got my sum of squared errors, and you can see I've literally taken what's the value of y if I add 0.01 to a?

What's the value of y if I add 0.01 to b? So what's the change in the error if I add 0.01 to a and 0.01 to b, okay? And so if I divide that error by my 0.01, that gives me what's known as the derivative. So anybody who's done calculus will, of course, recognize this.

So all I need to do now is say, okay, if I increased x by a bit, my loss function goes down. If I increase a by a bit, my loss function goes down. If I increase b by a bit, my loss function goes down. Therefore, I should increase a and b by a bit.

How do we decide what a bit is? We just make it up. It's called a learning rate, okay? And so I'm picking a learning rate of 0.01. So here is the entirety of how to do an optimization from scratch. It's just this code here, right? So it's basically saying, okay, calculate my predicted y.

That's just my linear function with my a guess and my b guess and my x, okay? Now I calculate my two derivatives. And you'll see in this case, I'm not doing it that slow way of adding 0.01, but everybody who's done calculus will know there's always a shortcut in calculus to doing things quickly.

And in case you're thinking you, if you want to do deep learning, you're going to have to remember all of your rules of calculus, you don't. In real life, nobody does that. In real life, if you need a derivative, you go to alpha.wolfram.com, and you type in the thing that you want your derivative of, and you press enter.

You wait three seconds. You then go to plain text. You double-click that, copy it, and you paste it, okay? And so that's what I've done here, okay? And then I pasted that into my code, all right? So that's how we do calculus today. So I bet you're all glad you spent lots of time learning those stupid rules.

So okay, now that I've done that, don't worry too much about this code, but basically what I'm going to do now is I'm going to animate what happens as we call this update function 40 times, okay, starting with a guess of a of minus 1 and a guess of b of 1.

And I'm going to plot the original data and my line, and let's see what happens. There it is, okay? So I'd have to run it for a little bit longer, of course, if it was going to exactly hit. But you can see that the line is getting closer and closer to the data.

So just imagine now taking this idea and doing it for each of these filters, what would happen? And now imagine something further. Imagine if you didn't just have these filters, but imagine if these filters themselves became the inputs to a second set of filters. That would allow you to create a corner, because it could say, "Oh, a bit of top edge and a bit of right edge," right?

Assuming that the thing did actually decide that edges were interesting. So it turns out that, this is why the little guys here, so we can be excited that we've just successfully learned about deep learning, it turns out that somebody did this, and two years ago they showed the results.

They created lots and lots of layers optimized in exactly, exactly this way. This is not some super dumbed down version, this is it, right? They did this, and they discovered that layer one, I mean, what author, there's one difference with they had color images rather than black and white.

This is nine out of the 34 examples they had on the first layer, and you can see it decided that it wanted to look for edges as well as gradients. This is what that shows. On the right-hand side, it's showing examples from real photos, they had one and a half million photos, real examples of like nine patches of photo that matched this particular patch.

So this is layer one. So then what they did was they say, "Okay, this guy's name is Matt Zeiler." He said, "Okay, what would happen now if we created a new layer which took these as inputs and combined them in exactly the same way as we did with pixels?" And he called it layer two.

And so layer two is a little bit harder to draw, so instead he draws nine examples of how it gets activated by various images, and you can see in layer two it's learned to find lots of horizontal lines, lots of vertical lines, it's learned to find circles. And indeed, if you look on the right, it's even already basically got something that finds sunsets.

So layer two is finding circular things, stripey things, edges, and as we hoped, corners. So what is layer three going to do? Layer three is going to do exactly the same thing, but it's going to start with these. And this is just 16 out of the probably 60 or so filters he has.

And so each part is getting exponentially more sophisticated in what it can do. And so by layer three, we already have a filter which can find text. We have a filter that can find repeating patterns. By layer four, we have a filter which can find dog faces. By layer five, we have a filter that can find the eyeballs of lizards and birds.

The most recent deep learning networks have over 1,000 filters. So can you imagine each of these exponentially improving levels of kind of semantic richness? And that is why these incredibly simple things I showed you, which is convolutions plus optimization applied to multiple layers, can let you understand speech better than a human, recognize images better than a human.

So that's basically the summary of why deep learning changes everything. And if you want to have the rest of lesson one and a review of that, and then lesson two, come along in late October. Let's thank Jeremy. (audience applauds)

Deep learning certificate "lesson 0"

Chapters

Transcript