Foundations and Challenges of Deep Learning (Yoshua Bengio)

Thank you, Sammy. So I'll tell you about some very high level stuff today and no new algorithm. Some of you already know about the book that Ian Goodfellow, Aaron Corville and I have written. And it's now in pre-sale by MIT Press. I think you can find it on Amazon or something.

And the paper, the actual shipping is gonna be in December, hopefully for NIPS. So we've already heard that story at least, well from several people here, at least from Andrew I think. But it's good to ponder a little bit some of these ingredients that seem to be important for deep learning to succeed.

But in general for machine learning to succeed, to learn really complicated tasks of the kind we want to reach human level performance. So if a machine is gonna be intelligent, it's going to need to acquire a lot of information about the world. And the big success of machine learning for AI has been to show that we can provide that information through data, through examples.

But really think about it, that machine will need to know a huge amount of information about the world around us. This is not how we are doing it now because we're not able to train such big models, but it will come one day. And so we'll need models that are much bigger than the ones we currently have.

Of course, that means machine learning algorithms that can represent complicated functions, that's one good thing about neural nets. But there are many other machine learning approaches that allow you in principle to represent very flexible forms like non-parametric methods, classical non-parametric methods or SVMs. But they're gonna be missing 0.4 and potentially 0.5 depending on the methods.

0.3 of course, you need enough computing power to train and use these big models. And 0.5 just says that it's not enough to be able to train the model, you have to be able to use it in a reasonably efficient way from a computational perspective. This is not always the case with some probabilistic models where inference, in other words, answering questions, having the computer do something, can be intractable and then you need to do some approximations, which could be efficient or not.

Now, the point I really want to talk about is the fourth one, how do we defeat the curse of dimensionality? In other words, if you don't assume much about the world, it's actually impossible to learn about it. And so I'm gonna tell you a bit about the assumptions that are behind a lot of deep learning algorithms which make it possible to work as well as we are seeing in practice in the last few years.

Hmm, something wrong? Microsoft bug. >> >> Okay, so how do we bypass the curse of dimensionality? The curse of dimensionality is about the exponentially large number of configurations of the space variables that we want to model. The number of values that all of the variables that we observe can take is gonna be exponentially large in general because there's a compositional nature.

If each pixel can take two values and you got a million pixels, then you got two to one million number of possible images. So the only way to beat an exponential is to use another exponential. So we need to make our models compositional. We need to build our models in such a way that they can represent functions that look very complicated.

But yet, these models need to have a reasonably small number of parameters. Reasonably small in the sense that compared to the number of configurations of the variables, the number of parameters should be small. And we can achieve that by composing little pieces together, composing layers together, composing units on the same layer together.

And that's essentially what's happening with deep learning. So you actually have two kinds of compositions. There's the compositions happening on the same layer. This is the idea of distributed representations, which I'm gonna try to explain a bit more. This is what you get when you learn embeddings for words or for images, representations in general.

And then there's the idea of having multiple levels of representation. That's the notion of depth. And there, there is another kind of composition that takes place, whereas the first one is a kind of parallel composition. I can choose the values of my different units separately, and then they together represent an exponentially large number of possible configurations.

In the second case, there's a sequential composition where I take the output of one level and I combine them in new ways to build features for the next level and so on and so on. So the reason deep learning is working is because the world around us is better modeled by making these assumptions.

It's not necessarily true that deep learning is gonna work for any machine learning problem. In fact, if we consider the set of all possible distributions that we would like to work from, deep learning is no better than any other. And that's basically what the no free lunch theorem is saying.

It's because we are incredibly lucky that we live in this world, which can be described by using composition, that these algorithms are working so well. This is important to really understand this. So before I go a bit more into distributed representations, let me say a few words about non-distributed representations.

So if you're thinking about things like clustering, N-grams for language modeling, classical nearest neighbors, SVMs with Gaussian kernels, classical non-parametric models with local kernels, decision trees, all these things, the way these algorithms really work is actually pretty straightforward if you cut the crap and hide the math and try to understand what is going on.

They look at the data in the data space and they break that space into regions. And they're gonna use different free parameters for each of those regions to figure out what the right answer should be. The right answer, it doesn't have to be supervised learning. Even in unsupervised learning, there's a right answer.

It might be the density or something like that. Okay, and you might think that that's the only way of solving a problem. We consider all of the cases and we have an answer for each of the cases. And we can maybe interpolate between those cases that we've seen. The problem with this is somebody comes up with a new example which isn't in between two of the examples we've seen, something that requires us to extrapolate.

Something that's a non-trivial generalization. And these algorithms just fail. They don't really have a recipe for saying something meaningful away from the training examples. There's another interesting thing to note here, which I would like you to keep in mind before I show the next slide, which is in red here, which is we can do a kind of simple counting to relate the number of parameters, the number of free parameters that can be learning, and the number of regions in the data space that we can distinguish.

So here, we basically have a linear relationship between these two things. So for each region, I'm gonna need at least something like some kind of center for the region, and maybe if I need to output something, I'll need an extra set of parameters to tell me what the answer should be in that area.

So the number of parameters grows linearly with the number of regions that I'm gonna be able to distinguish. The good news is I can have any kind of function, right? So I can break up the space in any way I want, and then for each of those regions, I can have any kind of output that I need.

So for decision trees, the regions would be splitting across axes and so on, and for, this is more like for nearest neighbor or something like that. Now, what's going on? Ah, another bug. >> >> I don't think I will send. >> >> Let's hope it works this time. I have another option.

Sorry about this. Okay, so here's the point of view of distributed representations for solving the same general machine learning problem. We have a data space and we wanna break it down, but we're gonna break it down in a way that's not general. We're gonna break it down in a way that makes assumptions about the data, but it's gonna be compositional and it's going to allow us to be exponentially more efficient.

So how are we gonna do this? So in the picture on the right, what you see is a way to break the input space by the intersection of half planes. And this is the kind of thing you would have with what happens at the first layer of a neural net.

So here, imagine the input is two dimensional, so I can plot it here, and I have three binary hidden units, C1, C2, C3. So because they're binary, you can think of them as little binary classifiers. And because it's only a one layer net, you can think of what they're doing as a linear classification.

And so those colored hyperplanes here are the decision surfaces for each of them. Now, these three bits, they can take eight values, right, corresponding to whether each of them is on or off. And those different configurations of those bits correspond to actually seven regions here, because there's one of the eight regions which is not feasible.

So now you see that we are defining a number of regions which is corresponding to all of the possible intersections of the corresponding half planes. And now we can play the game of how many regions do we get for how many parameters? And what we see is that if we played the game of growing the number of dimensions, of features, and also of inputs, we can get an exponentially large number of regions, which are all of these intersections, right?

There's an exponential number of these intersections corresponding to different binary configurations. Yet the number of parameters grows linearly with the number of units. So it looks like we're able to express a function. And then on top of that, I could imagine you have a linear classifier, right? That's the one hidden layer neural net.

So the number of parameters grows just linearly with the number of features. But the number of regions that the network can really provide a different answer to grows exponentially. So this is very cool. And the reason it's very cool is that it allows those neural nets to generalize. Because while we're learning about each of those features, we can generalize to regions we've never seen because we've learned enough about each of those features separately.

I'm going to give you an example of this in a couple of slides. Actually, let's do it first. So think about those features, let's say the input is an image of a person. And think of those features as things like, I have a detector that says that the person wears glasses.

And I have another unit that's detecting that the person is a female or male. And I have another unit that detects that the person is a child or not. And you can imagine hundreds or thousands of these things, of course. So the good news is you could imagine learning about each of these feature detectors, these little classifiers, separately.

In fact, you could do better than that. You could share intermediate layers between the input and those features. But let's take even the worst case and imagine we were to train those separately, which is the case in the linear model that I showed before. We have a separate set of parameters for each of these detectors.

So if I have n features, each of them, say, needs order of k parameters. Then I need order of nk parameters, and I need order of nk examples. And one thing you should know from machine learning theory is that if you have order of p parameters, you need order of p examples to do a reasonable job of generalizing.

You can get around that by regularizing and effectively having less degrees of freedom. But to keep things simple, you need about the same number of examples, or maybe say 100 times more or 10 times more, as the number of really free parameters. So now the relationship between the number of regions that I can represent and the number of examples I need is quite nice because the number of regions is going to be two to the number of features of these binary features.

So a person could wear glasses or not, be a female or a male, a child or not, and I could have 100 of these things. And I could probably recognize reasonably well all of these 2 to the 100 configurations of people, even though I've obviously not seen all of those 2 to 100 configurations.

Why is it that I'm able to do that? I'm able to do that because the models can learn about each of these binary features kind of independently in the sense that I don't need to see every possible configuration of the other features to know about wearing glasses. I can learn about wearing glasses even though I've never seen somebody who was a female and a child and chubby and had yellow shoes.

And I have seen enough examples of people wearing glasses, I can learn about wearing glasses in general. I don't need to see all of the configurations of the other features to learn about one feature. Okay? And so this is really why this thing works, is because we're making assumptions about the data that those features are meaningful by themselves.

And you don't need to actually have data for each of the regions, the exponential number of regions, in order to learn the proper way of detecting or of discovering these intermediate features. Let me add something here. There were some experiments recently actually showing that this kind of thing is really happening.

Because the features I was talking about, not only I'm assuming that they exist, but the optimization methods, the training procedures discover them. They can learn them. And this is an experiment that's been done in Antonio Toralba's lab at MIT, where they trained a usual ConvNet to recognize places. So the outputs of the net are just the types of places, like is this a beach scene or an office scene or a street scene and so on?

But then the thing they've done is they asked people to analyze the hidden units to try to figure out what each hidden unit was doing. And they found that there's a large proportion of units that humans can find a pretty obvious interpretation for what those units like. So they see a bunch of units which like people or different kinds of people or animals or buildings or seatings or tables, lighting and so on.

So it's like if indeed those neural nets are discovering semantic features. They're semantic because actually people give them names as the intermediate features in order to reach the final goal of here, classifying scenes. And the reason they're generalizing is because now you can combine those features in an exponentially large number of ways.

You could have a scene that has a table, a different kind of lighting, some people, maybe a pet. And you can say something meaningful about the combinations of these things. Because the network is able to learn all of these features without having to see all of the possible configurations of them.

So I don't know if my explanation makes sense to you, but now's your chance to ask me a question. All clear? Usually it's not. Yeah? >> So with one decision piece you can kind of do this the same as well, right? >> With decision trees? >> If you have a set of decision trees.

>> Right, to some extent. So the question is, can't we do the same thing with a set of decision trees? Yeah, in fact, this is one of the reasons why forests work better or bagged trees work better than single trees. Forests are actually, or bagged trees are one level deeper than a single trees.

But they still don't have as much of a sort of distributed aspect as neural nets. And usually they're not trained jointly. I mean, boosted trees are, to some extent, in a greedy way. But yeah, any other question? Yeah? >> Do you find cases that are non-compositional? >> Cases where what?

>> Do you find non-compositional cases? >> Non-conditional? >> Non-compositional. >> Non-computer vision. >> Non-compositional. >> Non-compositional. I don't understand the question. I mean, I don't understand what you mean. What do you mean non-compositional? >> You're talking about compositionality here. >> Yeah, it's everywhere around us. I don't think that there are examples of neural nets that really work well where the data doesn't have some kind of compositional structure in it.

But if you come up with an example, I'd like to hear about it. Yes, yes? >> So in the language of rock models, do you mean that we're facing a model of this rock? And in the real world, we're trying to look for some independent, but we cannot get independent, but it starts somewhere with a very small square.

>> To think about this issue in graphical model terms can be done. But you have to think about not feature detection, like I've been doing here, but about generating an image or something like that. Then it's easier to think about it. So the same kinds of things happen if you think about how I could generate an image.

If you think about underlying factors like which objects, where they are, what's their identity, what's their size, these are all independent factors, which you compose together in funny ways. If you were to do a graphics engine, you can see exactly what those ways are. And it's much, much easier to represent that joint of distribution using this compositional structure than if you're trying to work directly in pixel space, which is normally what you would do with a classical non-parametric method, and it wouldn't work.

But if you look at our best deep generative models now for images, for example, like GANs or VAEs, they're really, we're not there yet, but they're amazingly better than anything that people could dream of just a few years ago in machine learning. Okay, let me move on, because I have other things to talk about.

So this is all kind of hand wavy, but some people have done some math around these ideas. And so for example, there's one result from two years ago, right here, where we studied the single layer case. And we consider a network with rectifiers, and we find that the network, of course, computes a piecewise linear function.

And so one way to quantify the richness of the function that it can compute, I was talking about regions here, but well, you can do the same thing here. You can count how many pieces does this network have in its input to output function. And it turns out that it's exponential in the number of inputs, well, it's number of units to the power number of inputs.

So that's for sort of distributed representation, there's an exponential kicking in. We also studied the depth aspect. So what you need to know about depth is that there's a lot of earlier theory that says that a single layer is sufficient to represent any function. However, that theory doesn't specify how many units you might need.

And in fact, you might need an exponentially large number of units. So what several results show is that there are functions that can be represented very efficiently with few units, so few parameters. If you allow the network to be deep enough. So out of all the functions, again, it's a luckiness thing, right?

Out of all the functions that exists, there's a very, very small fraction, which happened to be very easy to represent with a deep network. And if you try to represent these functions with a shallow network, you're screwed. You're gonna need an exponential number of parameters. And so you're gonna need an exponential number of examples to learn these things.

But again, we're incredibly lucky that the function we want to learn have this property. But in a sense, it's not surprising. I mean, we use this kind of compositionality and depth everywhere. When we write a computer program, we just don't have a single main. We have functions and call functions.

And we were able to show similar things as what I was telling you about for the single layer case, that as you increase depth for these deep ReLU networks, the number of pieces in the piecewise linear function grows exponentially with the depth. So it's already exponentially large with a single layer, but it gets exponentially even more with a deeper net.

Okay, so this was a topic of representation of functions. Why deep learn, deep architectures can be very powerful if we're lucky, and we seem to be lucky. Another topic I wanna mention that's kind of very much in the foundations is how is it that we're able to train these neural nets in the first place?

In the 90s, many people decided to not do any more research on neural nets, because there were theoretical results showing that there are really an exponentially large number of local minima in the training objective of a neural net. So in other words, the function we wanna learn has many of these holes, and if we start at a random place, well, what's the chance we're gonna find the best one, the one that corresponds to a good cost?

And that was one of the motivations for people who flocked into a very large area of research in machine learning in the 90s and 2000s, based on algorithms that require only convex optimization to train. Cuz of course, if we can do convex optimization, we eliminate this problem. If the objective function is convex in the parameters, then we know there's a single global minimum.

Right, so let me show you a picture here, you get a sense of, if you look on the right hand top, this is, if you draw a random function in 1D or 2D or 3D, like here is kind of a random smooth function in 2D, you see that it's gonna have many ups and downs.

These are local minima. But the good news is that in high dimension, it's a totally different story. So what are the dimensions here? We're talking about the parameters of the model, and the vertical axis is the cost that we're trying to minimize. And what happens in high dimension is that instead of having a huge number of local minima on our way when we're trying to optimize, what we encounter instead is a huge number of saddle points.

So saddle point is like the thing on the bottom right in 2D. So you have two parameters and the y-axis is the cost you wanna minimize. And so what you see in a saddle point is you have dimensions or directions where the objective function draws a minimum. So there's like a curve that, it curves up.

And in other directions, it curves down. So saddle point has both a minimum in some direction and a maximum in other directions. So this is interesting because even though it's a, these points, like saddle points and minima are places where you could get stuck. In principle, if you're exactly at the saddle point, you don't move.

But if you move a little bit away from it, you will go down the saddle, right? So what our work and other work from NYU, Chormanska and collaborators of Yan LeCun showed is that actually in very high dimension, not only it's the issue is more saddle points than local minima.

But the local minima are good. So let me try to explain what I mean by this. So let me show you actually first an experiment from the NYU guys. So they did an experiment where they gradually changed the size of the neural net. And they look at what looks like local minima, but they could be saddle points that are the lowest that they could obtain by training.

And what you're looking at is a distribution of errors they get from different initializations of their training. And so what happens is that when the network is small, like the pink here on the right, there's a widespread distribution of cost that you can get depending on where you start and you're pretty high.

And if you increase the size of the network, it's like all of the local minima that you find concentrate around a particular cost. So you don't get any of these bad local minima that you would get with a small network, they're all kind of pretty good. And if you increase even more the size of the network, this is like a single hidden layer network, not very complicated.

This phenomenon increases even more. In other words, they all kind of converge to the same kind of cost. So let me try to explain what's going on. So if we go back to the picture of the saddle point, but instead of being in 2D, imagine you are in a million D.

And in fact, people have billion D networks these days. I'm sure Andrew has even bigger ones, I'm sure. >> >> But, so what happens in this very high dimensional space of parameters is that, if things are not really bad for you, so if you imagine a little bit of randomness in the way the problem is set up, and it seems to be the case, in order to have a true local minimum, you need to have the curvature going up like this in all the billion directions.

So if there is a certain probability of this event happening, that this particular direction is curving up and this one is curving up, the probability that all of them curve up becomes exponentially small. So we tested that experimentally. What you see in the bottom left is a curve that shows the training error as a function of what's called the index of the critical point, which is just the fraction of the directions which are curving down, right?

So 0% would mean it's a local minimum. 100% would be it's a local maximum, and anything in between is a saddle point. So what we find is that as training progresses, we're going close to a bunch of saddle points, and none of them are local minima, otherwise we would be stuck.

And in fact, we never encounter local minima until we reach the lowest possible cost that we're able to get. In addition, there is a theory suggesting that, so the local minima will actually be close in cost to the global minimum. They will be above, and they will concentrate in a little band above the global minimum.

But that band of local minima will be close to the global minimum. And the larger the dimension, the more this is gonna be true. So to go back to my analogy, right? At some point, of course, you will get local minima, even though it's unlikely when you're in the middle.

When you get close to the bottom, well, you can't go lower. So it has to rise up in all the directions. But it's, yeah. So that's kind of good news. I think, in spite of this, I don't think that the optimization problem of neural nets is solved. There are still many cases where we find ourselves to be stuck.

And we still don't understand what the landscape looks like. There's a set of beautiful experiments by Ian Goodfellow that help us visualize a bit what's going on. But I think one of the open problems of optimization for neural nets is, what does the landscape actually look like? It's hard to visualize, of course, because it's very high dimensional.

But for example, we don't know what those saddle points really look like. When we actually measure the gradient near those, when we are approaching those saddle points, it's not close to zero. So we never go to actually flat places. This may be due to the fact that we're using SGD and it's kind of hovering above things.

There might be conditioning issues where even if you are at a saddle, near a saddle point, you might be stuck, even though it's not a local minimum. Because in many directions, it's still going up, maybe 95% of the directions. And the other directions are hard to reach because simply, there's a lot more curvature in some directions than other directions.

And that's the traditional ill conditioning problem. We don't know exactly what's making it hard to train some networks. Usually, conv nets are pretty easy to train. But when you go into things like machine translation or even worse, reasoning tasks with things like neural training machines and things like that, it gets really, really hard to train these things.

And people have to use all kinds of tricks like curriculum learning, which are essentially optimization tricks, to make the optimization easier. So I don't want to tell you that, the optimization problem of neural nets is easy, it's done, we don't need to worry about it. But it's much easier and less of a concern than what people thought in the 90s.

Okay, so. So machine learning, I mean, deep learning is moving out of pattern recognition and into more complicated tasks, for example, including reasoning and combining deep learning with reinforcement learning, planning, and things like that. You've heard about attention. That's one of the tools that is really, really useful for many of these tasks.

We've sort of come up with attention mechanisms as not a way to focus on what's going on in the outside world. Like we usually think of attention like attention in the visual space, but internal attention, right? In the space of representations that have been built. So that's what we do here in machine translation.

And it's been extremely successful, as Quark said. So I'm not gonna show you any of these pictures, blah, blah, blah. So I'm getting more now into the domain of challenges. A challenge that I've been working on since I was a baby researcher as a PhD student is long term dependencies and recurrentness.

And although we've made a lot of progress, this is still something that we haven't completely cracked. And it's connected to the optimization problem that I told you before, but it's a very particular kind of optimization problem. So some of the ideas that we've used to try to make the propagation of information and gradients easier include using skip connections over time, include using multiple time scales.

There's some recent work in this direction from my lab and other groups. And even the attention mechanism itself, you can think of a way to help dealing with long term dependency. So the way to see this is to think of the place on which we're putting attention as part of the state.

So imagine really you have a recurrent net and it has two kinds of state. It has the usual recurrent net state, but it has the content of the memory. Kwok told you about memory nets and neural train machines. And the full state really includes all of these things. And now we're able to read or write from that memory.

I mean, the little recurrent net is able to do that. So what happens is that there are memory elements which don't change over time, maybe they've been written once. And so the information that has been stored there, it can stay for as much time as they're not gonna be overwritten.

So that means that if you consider the gradients back propagated through those cells, they can go pretty much unhampered and there's no vanishing gradient problem. So this is something that could be, that view of the problem of long term dependencies with memory I think could be very useful. All right, in the last part of my presentation, I wanna tell you about what I think is the biggest challenge ahead of us, which is unsupervised learning.

Any question about attention and memory before I move on to unsupervised learning? Okay, so why do we care about unsupervised learning? It's not working. >> >> Well, well, well. Actually, it's working a lot better than it was, but it's still not something you find in industrial products. At least not in an obvious way.

There are less obvious ways where unsupervised learning is actually already extremely successful. So for example, when you train word embeddings with Word2Vec or any other model and you use that to pre-train, like we did our machine translation systems or other kinds of NLP tasks. You're exploiting unsupervised learning. Even when you train a language model that you're gonna stick in some other thing or pre-train something with that, you're also doing unsupervised learning.

But I think the potential of and the importance of unsupervised learning is usually underrated. So why do we care? First of all, the idea of unsupervised learning is that we can train, we can learn something from large quantities of unlabeled data that humans have not curated, and we have lots of that.

Humans are very good at learning from unlabeled data. I have an example that I use often that makes it very, very clear that, for example, children can learn all kinds of things about the world, even though no one, no adult ever tells them anything about it until much later when it's too late.

>> >> Physics. So a two or three year old understands physics. If she has a ball, she knows what's gonna happen when she drops the ball. She knows how liquids behave. She knows all kinds of things about objects and ordinary Newtonian physics, even though she doesn't have explicit equations and a way to describe them with words, but she can predict what's gonna happen next, right?

And the parents don't tell the children, force equals mass times acceleration. >> >> Right, so this is purely unsupervised, and it's very powerful. We don't even have that right now. We don't have computers that can understand the kinds of physics that children can understand. So it looks like it's a skill that humans have, and that's very important for humans to make sense of the world around us, but we haven't really yet succeeded to put in machines.

Let me tell you other reasons that are connected to this, why unsupervised learning could be useful. When you do supervised learning, essentially the way you train your system is you focus on a particular task. It goes, here's the inputs, and here's the input variables, and here's an output variable that I would like you to predict given the input.

You're learning P of Y given X. But if you're doing unsupervised learning, essentially you're learning about all the possible questions that could be asked about the data that you observe. So it's not that there's X1, X2, X3, and Y. Everything is an X, and you can predict any of the X given any of the other X, right?

If I give you a picture and I hide a part of it, you can guess what's missing. If I hide the caption, you can generate the caption given the image. If I hide the image and I give you the caption, you can guess what the image would be or draw it or figure out from examples which one is the most appropriate.

So you can answer any questions about the data when you have captured the joint distribution between them, essentially. So that could be useful. Another practical thing that unsupervised learning has been used, in fact, this is how the whole deep learning thing started, is that it could be used as a regularizer.

Because in addition to telling our model that we want to predict Y given X, we're saying find representations of X that both predict Y and somehow capture something about the distribution of X, the leading factors, the explanatory factors of X. And this, again, is making an assumption about the data, so we can use that as a regularizer if the assumption is valid.

Essentially, the assumption is that the factor Y that we're trying to predict is one of the factors that explain X. And that by doing unsupervised learning to discover factors that explain X, we're gonna pick Y among the other factors. And so it's gonna be much easier now to do supervised learning.

Of course, this is also the reason why transfer learning works, because there are underlying factors that explain the inputs for a bunch of tasks. And maybe a different subset of factors explain are relevant for one task, and another subset of factors is relevant for another task. But if these factors overlap, then there's a potential for synergy by doing multitask learning.

So the reason multitask learning is working is because unsupervised learning is working, is because there are representations and factors that explain the data that can be useful for our supervised learning tasks of interest. That also could be used for domain adaptation for the same reason. The other thing that people don't talk about as much about unsupervised learning, and I think it was part of the initial success that we had with stacking auto encoders and RBMs, is that you can actually make the optimization problem of training deep nets easier.

Cuz if you're gonna, for the most part, if you're gonna train a bunch of RBMs or a bunch of auto encoders, and I'm not saying this is the right way of doing it, but it captures some of the spirit of what unsupervised learning does. A lot of the learning can be done locally.

You're trying to extract some information, you're trying to discover some dependencies, that's a local thing. Once we have a slightly better representation, we can again tweak it to extract better, more independence, or something like that. So there's a sense in which the optimization problem might be easier if you have a very deep net.

Another reason why we should care about unsupervised learning, even if our ultimate goal is to do supervised learning, is because sometimes the output variables are complicated. They are compositional. They have a joint distribution. So in machine translation, which we talked about, the output is a sentence. A sentence is a set of, is a tuple of words that have a complicated joint distribution given the input in the other language.

And so it turns out that many of the things we discover by exploring unsupervised learning, which is essentially about capturing joint distributions, can be often used to deal with these structured output problems where you have many outputs that form a compositional, complicated distribution. There's another reason why unsupervised learning, I think, is going to be really necessary for AI.

Model-based reinforcement learning. So I think I have another slide just for this. Let's think about self-driving cars. This is a very popular topic these days. How did I learn that I shouldn't do some things with the wheel that will kill myself when I'm driving? Because I haven't experienced these states where I get killed.

And I simply haven't done it like a thousand times to learn how to avoid it. So supervised learning where our rather traditional reinforcement learning, like policy learning kind of thing, or actor-critic or things like that, won't work because I need to generalize about situations that I'm never going to encounter, because otherwise if I did, I would die.

So these are like dangerous states that I need to generalize about these states, but I can't have enough data for them. And I'm sure there are lots of machine learning applications where we would be in that situation. I remember a couple of decades ago, I got some data from a nuclear plant.

And so they wanted to predict that when it's going to blow up. >> >> To avoid it. So I said, how many examples? >> >> I said zero. Right, so you see, sometimes it's hard to do supervised learning, because the data you'd like to have, you can't have. It's data that situations that are very rare.

So how can we possibly solve this problem? Well, the only solution I can see is that we learn enough about the world that we can predict how things would unfold. When I'm driving, I have a kind of mental model of physics and how cars behave that I can figure out if I turn to right at this point, I'm going to end up on the wall and this is going to be very bad for me.

And I don't need to actually experience that to know that it's bad. I can make a mental simulation of what would happen. So I need a kind of generative model of how the world would unfold if I do such and such actions. And unsupervised learning is sort of the ideal thing to do that.

But of course, it's going to be hard because we're going to have to train models that capture a lot of aspects of the world in order to be able to learn to generalize properly in those situations even though they don't see any data of it. So that's one reason why I think reinforcement learning needs to be worked on more.

So I have a little thing here. I think people who have been doing deep learning can collaborate with people who are doing reinforcement learning and not just by providing a black box that they can use in their usual algorithms. I think there are things that we do in supervised deep learning that, or unsupervised deep learning, that can be useful in sort of rethinking our reinforcement learning.

So one example, well, one thing I really like to think about is credit assignment. In other words, how do different machine learning algorithms figure out what the hidden units are supposed to do, what the intermediate computations or the intermediate actions should be? This is what credit assignment is about.

And that prop is the best recipe we currently have for doing credit assignment. It tells the parameters of some intermediate layer how they should change so that the cost much, much later, 100 steps later, if it's a recurrent net, should be reduced. So we could probably use some inspiration from back prop and how it's used to improve reinforcement learning.

And one such cue is how when we do supervised back prop, say, we don't predict the expected loss that we're gonna have and then try to minimize it. Where the expectation would be over the different realizations of the correct class. That's not what we do. But this is what people do in RL.

They will learn a critic or a cue function, which is learning the expected value of the future reward or future loss. In our case, that might be minus log probability of the correct answer given the input. And then they will back prop through this or use it to estimate the gradient on the actions.

Instead, when we do supervised learning, we're gonna do credit assignment where we use the particular observations of the correct class that actually happened for this x, right? We have x, we have y, and we use the y to figure out how to change our prediction or our action. So it looks like this is something that should be done for RL.

And in fact, we have a paper on something like this for a sequence prediction. This is the kind of work which is at the intersection of dealing with structured outputs, reinforcement learning, and supervised learning. So I think there's a lot of potential benefit of changing the frame of thinking that people in RL have had.

For many decades, people in RL have been not thinking about the world with the same eyes as people doing neural nets. They've been thinking about the world in terms of discrete states that could be enumerated and proving theorems about these algorithms that depend on essentially collecting enough data to fill all the possible configurations of the state and their corresponding effects on the reward.

When you start thinking in terms of neural nets and deep learning, the way to approach problems is very, very different. Okay, let me continue about unsupervised learning and why this is so important. If you look at the kinds of mistakes that our current machine learning algorithms make, you find that our neural nets are just cheating.

They're using the wrong cues to try to produce the answers. And sometimes it works, sometimes it doesn't work. So how can we make our models be smarter, make less mistakes? Well, The only solution is to make sure that those models really understand how the world works, at least at the level of humans, to get human level accuracy, human level performance.

It may be not necessary to do this for a particular problem you're trying to solve. So maybe we can get away with doing speech recognition without really understanding the meaning of the words. Probably that's gonna be okay. But for other tasks, especially those involving language, I think having models that actually understand how the world ticks is gonna be very, very important.

So how can we have machines that understand how the world works? Well, one of the ideas that I've been talking a lot about in the last decade is that of disentangling factors of variation. This is related to a very old idea in pattern recognition, computer vision, called invariance. The idea of invariance was that we would like to compute or design, initially design and now learn features, say, of the image that are invariant to the things we don't care about.

Maybe we wanna do object recognition, so we don't care about position or orientation. So we would like to have features that are translation invariant, rotation invariant, scaling invariant, whatever. So this is what invariance is about. But when you're in the business of doing unsupervised learning, of trying to figure out how the world works, it's not good enough to extract invariant features.

What we actually wanna do is to extract all of the factors that explain the data. So if we're doing speech recognition, we want not only to extract the phonemes, but we also want to figure out what kind of voice is that? Maybe who is it? What kind of recording conditions or what kind of microphone?

Is it in a car? Is it outside? All of that information which you're trying to get rid of normally, you actually want to learn about, so that you'll be able to generalize even to new tasks, for example. Maybe the next day I'm not gonna ask you to recognize phonemes, but recognize who's speaking.

More generally, if we're able to disentangle these factors that explain how the data varies, everything becomes easy. Especially if those factors now can be generated in an independent way, we can to generate the data. For example, we can learn to answer a question that only depends on one or two factors, and basically we have eliminated all the other ones because we have separated them.

So a lot of things become much easier. So that's one notion, right? We can disentangle factors. There's another notion, which is the notion of multiple levels of abstraction, which is of course at the heart of what we're trying to do with deep learning. And the idea is that we can have representations of the world, representation of the data as a description that involves factors, or features, and we can do that at multiple levels.

And there are more abstract levels. So if I'm looking at a document, there's the level of the pixels, the level of the strokes, the level of the characters, the level of the words, and maybe the level of the meaning of individual words. And we actually have systems that will recognize from a scanned document all of these levels.

When we go higher up, we're not sure what the right levels are, but clearly there must be representations of the meaning, not just of single words, but of sequences of words and the whole paragraph. What's the story? And why is it important to represent things in that way? Because higher levels of abstraction are representations from which it is much easier to do things, to answer questions.

So the more semantic levels mean basically we can very easily act on the information when it's represented that way. If we think about the level of words, it's much easier to check whether a particular word is in the document if I have the words extracted than if I have to do it from the pixels.

And if I have to answer a complicated question about the intention of the person, working at level of words is not high enough, it's not abstract enough. I need to work at a more abstract level in which maybe the same notion could be represented with many different types of words, where many different sentences could express the same meaning, and I wanna be able to capture that meaning.

So the last slide I have is something that I've been working on in the last couple of years, which is connected to unsupervised learning, but more generally to the relationship between how we can build intelligent machines and the intelligence of humans or animals. And as you may know, this was one of the key motivations for doing neural nets in the first place.

The intuition is this, that we are hoping that there are a few simple key principles that explain what allows us to be intelligent. And that if we can discover these principles, of course, we can also build machines that are intelligent. That's why the neural nets were inspired by things we know from the brain in the first place.

We don't know if this is true, but if it is, then it's great. And I mean, this would make it much easier to understand how brains work, as well as building AI. So in trying to bridge this gap, because right now, our best neural nets are very, very different from what's going on in brains, as far as we can tell by talking to neuroscientists.

In particular, backprop, although it's kicking ass from a machine learning point of view, it's not clear at all how something like this would be implemented in brains. So I've been trying to explore that, and also trying to see how we could generalize those credit assignment principles that would come out in order to also do unsupervised learning.

So we've made a little bit of progress. A couple of years ago, I came up with an idea called target prop, which is a way of generalizing backprop to propagating targets for each layer. Of course, this idea has a long history. More recently, we've been looking at ways to implement gradient estimation in deep recurrent networks that perform some computation that turn out to end up with parameter updates corresponding to gradient descent in the prediction error that look like something that neuroscientists have been observing and don't completely understand called STDP, spike timing dependent plasticity.

So I don't really have time to go into this, but I think this whole area of reconnecting neuroscience with machine learning and neural nets is something that has been kind of forgotten by the machine learning community because we're all so busy building self-driving cars. >> >> But I think over the long term, it's a very exciting prospect.

Thank you very much. >> >> Yes, questions? Yeah? >> To begin with, great talk. My question is regarding the lack of interlap between the results in the study of complex networks, like when they study the brain networks, right? There are a lot of publications that talk about the emergence of hubs, and especially a lot of publications on the degree distribution of the interneuron network.

>> Right, right. >> But then when you look at the degree distribution of the so-called neurons in deep nets, you don't get to see the emergence of the hub behavior. So why do you think that there's such lack of overlap between the results? >> Because I think the hub story is maybe not that important.

First of all, I really think that in order to understand the brain, you have to understand learning in the brain. And if we look at our experience in machine learning and deep learning, although the architecture does matter, what matters even more is the general principles that allow us to train these things.

So I think the study of the connectivity makes sense. You can't have a fully connected thing and adding a way to have a short number of hubs to go from anywhere to anywhere is a reasonable idea, but I don't think it really explains that much. The central question is, how does the brain learn complicated things?

And it does it better than our current machines, yet we don't know even a simple way of training brains that at least fits the biology reasonably. Yeah? >> There any cases of real world examples where the curse of dimensionality is still a problem for neural nets? >> Yeah, anytime it doesn't work.

>> >> I mean, from a generalization point of view. So Andrew told us yesterday that we can just add more data and computing power and for some problems this may work. But sometimes the amount of data you would need is just too large with our current techniques. And we'll need also to develop, how did you call it, the Hail Mary.

All right? We also need to do some research on the algorithms and the architectures to be able to learn about how the world is organized so that we can generalize in much more powerful ways. And that is needed because the kind of task we want to solve involve many, many variables that have an exponentially large number of possible values.

And that's the curse of dimensionality essentially. So it's facing pretty much all of the AI problems around us. >> Hi, I have a question on multi-agent reinforcement learning. >> Yeah. >> If you assume all cars can never predict all possible potential accidents, what about the potential for transfer learning and things like that?

>> Yeah, so I was giving an example of a single human learning how to drive. We might be able to use the millions of people using self-driving cars, correcting, and some of them making accidents to actually make some progress without actually solving the hard problems. And this is probably what you're going to be doing for a while.

But, and we should do it. We should definitely use all the data we have. Currently, if you look at the amount of data we're using for speech recognition or language modeling, it's hugely more than what any human actually sees in their lifetime. So we're doing something wrong. And we could do better with less data.

And babies and kids can do it. >> >> Yes. >> Well, so one thing that strikes me is that most of this training, like for images, it's done on static images. >> Well, there's quite a bit of work on video these days. >> Okay. >> It's mostly a computational bottleneck.

>> I mean, I've seen dogs generated by GANs, right? >> Yeah. Well, keep in mind we were doing MNIST just a couple of years ago. >> Yeah, but if you see a dog >> Yeah, absolutely. Yeah, I don't think it's a fundamental issue. If we're able to do it well on static images, the same principles will allow us to do sequences.

We're already doing sequential things. For example, an interesting project is speech synthesis with recurrent nets and stuff like that, or convolutional nets, whatever. So it's more like we're not sure how to train them well and how to discover these explanatory factors and so on. That's my view. Yeah? >> I have a question, maybe non-technical.

So we have seen the human error rates versus our algorithms error rates for things that we are used to, like image recognition, speech recognition. >> Right, right. >> So has there ever been an experiment where we try to train humans for things that we are not used to? >> Right.

>> And not train the machine at the same time and see. >> Right. >> So how capable are algorithms? >> You're asking if these experiments have been done? >> Yeah, yeah. >> I don't know, but I'm sure the humans would beat the hell out of the machines, for now.

For this kind of thing, humans are able to learn a new task or new concepts from very few examples. And we know that in order for machines to do as well, they just need more sort of common sense, right? More general knowledge of the world. This is what allows humans to learn so quickly on a few examples.

Yeah? >> You presented experimental data where you showed that lots of local minima for these parameters, or maybe saddle points. >> Saddle points. >> Have similar performance. >> Yeah. >> Are these saddle points- >> Well, no, the local minima, that's the local minima, yeah. >> Are these local minima separated widely in parameter space, or are they close by?

>> That's a good question, I could- >> And I guess a related question is, once you've trained the network, if there are lots of local minima, does that suggest that you could compress the network and represent it with far fewer parameters? >> Maybe. So for your first question, we have some experiments dating from 2009, where we try to visualize in 2D the trajectories of training.

So this is a paper, first author is Dmitri Aron, former PhD students with me, where we wanted to see how, depending on where you start, where do you end up? Do different trajectories end up in the same place? Or do they all go in a different place? Turns out they all go in a different place.

And so the number of local minima is much larger than the number of trajectories that we tried, like 500 or 1,000. It's so much larger that no two random initial Cs end up near each other. So it looks like there's a huge number of local minima, which is in agreement with the theory that there's an exponential number of them.

But the good news is they're all kind of equivalent in terms of cost, if you have a large network. >> Is that just compressibility at all, or how can you? >> I'm not sure. I'm sure there are many ways to compress these networks. There's a lot of redundancy in many ways.

There are redundancies due to the numbering. Like you could flip all, take that unit, put it here, take that unit, put it here, and so on. But I don't think you're going to gain a lot of bits from that. >> So we've talked about that one of the main advantages of deep learning is that it can work with lots of data.

But you were mentioning before that we need also to capture the ability of humans of working with fewer data. >> Yeah, but the reason we're able to work with fewer data is because we have first learned from a lot of data about the general knowledge of the world. >> Right, so how can we adapt neural networks to bring us to this new few data paradigm?

>> We have to do a lot better at unsupervised learning, and of the kind that really discovers sort of explanations about the world. That's what I think. >> Okay, let's thank Yeshua again. >> >> So before we stop this workshop, first an announcement you might remember yesterday, Carl invited all the women here for an informal dinner.

It's going to be right outside right now after we close. So before we close, actually, I'd like to thank all the speakers today and yesterday. I think everybody appreciated their talk, so thanks again, all of you. >> >> And thanks to all the attendants, I think it was a very nice weekend.

Hope you enjoyed. >>

Foundations and Challenges of Deep Learning (Yoshua Bengio)

Chapters

Transcript