back to indexFoundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU)
Chapters
0:0 Deep Unsupervised Learning
1:27 Deep Autoencoder Model
4:2 Talk Roadmap
5:1 Learning Feature Representations
5:40 Traditional Approaches
6:2 Computer Vision Features
6:17 Audio Features
8:37 Sparse Coding: Training
10:19 Sparse Coding: Testing Time
10:51 Image Classification Evaluated on Caltech101 object category dataset
12:5 Interpreting Sparse Coding
16:12 Another Autoencoder Model
16:37 Predictive Sparse Decomposition
17:32 Stacked Autoencoders
18:43 Deep Autoencoders
20:20 Information Retrieval
21:5 Semantic Hashing
22:41 Deep Generative Model
28:8 Learning Features
29:4 Model Learning
33:2 Contrastive Divergence
34:45 RBMs for Word Counts
36:33 Collaborative Filtering
37:49 Product of Experts
39:32 Local vs. Distributed Representations
40:55 Deep Boltzmann Machines
41:24 Model Formulation
44:25 Good Generative Model?
45:54 Generative Model of 3-D Objects
46:52 3-D Object Recognition
47:9 Data - Collection of Modalities
47:40 Challenges - 11
48:37 A Simple Multimodal Model
49:23 Text Generated from Images
51:11 Multimodal Linguistic Regularities
53:34 Helmholtz Machines vs. DBMS
54:20 Variational Autoencoders (VAE) The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers
00:00:02.000 |
So I wanted to talk to you about unsupervised learning, and that's the area where there's 00:00:09.020 |
But compared to supervised learning that you've heard about today, like convolutional networks, 00:00:20.560 |
Parts of the talk are going to be a little bit more mathematical. 00:00:23.840 |
I apologize for that, but I'll try to give you a gist of the foundations, the math behind 00:00:28.840 |
these models, as well as try to highlight some of the application areas. 00:00:35.640 |
Well, the motivation is that the space of data that we have today is just growing. 00:00:42.200 |
If you look at the space of images, speech, if you look at social network data, if you 00:00:47.720 |
look at scientific data, I would argue that most of the data that we see today is unlabeled. 00:00:56.440 |
So how can we develop statistical models, models that can discover interesting kind 00:01:00.480 |
of structure in unsupervised way or semi-supervised way? 00:01:04.640 |
And that's what I'm interested in, as well as how can we apply these models across multiple 00:01:13.360 |
And one particular framework of doing that is the framework of deep learning, where you're 00:01:17.380 |
trying to learn hierarchical representations of data. 00:01:21.240 |
And again, as I go through the talk, I'm going to show you some examples. 00:01:30.320 |
You can take a simple bag-of-words representation of an article or a newspaper. 00:01:36.220 |
You can use something that's called an autoencoder, just multiple levels. 00:01:41.400 |
You extract some latent code, and then you get some representation out of it. 00:01:49.920 |
And if you look at the kind of structure that the model is discovering, it could be useful 00:01:53.280 |
for visualization, for example, to see what kind of structure you see in your data. 00:02:03.760 |
I've tried to kind of cluster together lots of different unsupervised learning techniques, 00:02:15.100 |
But the way that I typically think about these models is that there's a class of what I would 00:02:19.640 |
call non-probabilistic models, models like sparse coding, autoencoders, clustering-based 00:02:27.560 |
And these are all very, very powerful techniques, and I'll cover some of them in that talk as 00:02:33.880 |
And then there is sort of a space of probabilistic models. 00:02:38.160 |
And within probabilistic models, you have tractable models, things like fully observed 00:02:45.440 |
There's a beautiful class of models called neural autoregressive density estimators. 00:02:50.200 |
More recently, we've seen some successes of so-called pixel recurrent neural network models. 00:03:00.280 |
There is a class of so-called intractable models, where you are looking at models like 00:03:05.640 |
Boltzmann machines and models like variational autoencoders, something that's been quite-- 00:03:10.560 |
there's been a lot of development in our community, in deep learning community in that space. 00:03:15.400 |
From Holtz machines, I'll tell you a little bit about what these models are, and a whole 00:03:22.920 |
One particular structure within these models is that when you're building these generative 00:03:27.480 |
models of data, you typically have to specify what the distributions you're looking at. 00:03:33.160 |
So you have to specify what the probability of the data, and generally doing some kind 00:03:37.020 |
of approximate maximum likelihood estimation. 00:03:39.680 |
And then more recently, we've seen some very exciting models coming out. 00:03:44.120 |
These are generative adversarial networks, moment matching networks. 00:03:48.640 |
And this is a slightly different class of models, where you don't really have to specify 00:03:55.120 |
You just need to be able to sample from those models. 00:03:57.480 |
And I'm going to show you some examples of that. 00:04:04.120 |
I'd like to introduce you to the basic building blocks, models like sparse coding models. 00:04:09.720 |
Because I think that these are very important classes of models, particularly for folks 00:04:13.680 |
who are working in industry and looking for simpler models. 00:04:21.200 |
And then the second part of the talk, I'll focus more on generative models. 00:04:25.440 |
I'll give you an introduction into restricted Boltz machines and deep Boltz machines. 00:04:29.320 |
These are statistical models that can model complicated data. 00:04:38.280 |
And I'll spend some time showing you some examples, some recent developments in our 00:04:42.480 |
community, specifically in the case of variational autoencoders, which I view them as a subclass 00:04:49.360 |
And I'll finish off by giving you an intuition about a slightly different class of models, 00:04:54.440 |
which would be these generative adversarial networks. 00:05:00.880 |
But before I do that, let me just give you a little bit of motivation. 00:05:05.360 |
I know Andre's done a great job, and Richard alluded to that as well. 00:05:10.640 |
But the idea is, if I'm trying to classify a particular image, and if I say, if I'm looking 00:05:16.760 |
at specific pixel representation, it might be difficult for me to classify what I'm seeing. 00:05:22.040 |
On the other hand, if I can find the right representations, the right representations 00:05:27.500 |
for these images, and then I get the right features, or get the right structure from 00:05:32.640 |
the data, then it might be easier for me to see what's going on with my data. 00:05:40.760 |
And this is one of traditional approaches that we've seen for a long time, is that you 00:05:48.120 |
have a data, you're creating some features, and then you're running your learning algorithm. 00:05:53.120 |
And for the longest time, in object recognition or in audio classification, you typically 00:05:57.500 |
use some kind of hand-designed features, and then you start classifying what you have. 00:06:04.000 |
And like Andre was saying, in the space of vision, there's been a lot of different features, 00:06:11.120 |
designs of what's the right structure we should see in the data. 00:06:15.600 |
In the space of audio, same thing is happening. 00:06:19.960 |
How can you find these right representations for your data? 00:06:25.380 |
And the idea behind representation learning, in particular in deep learning, is can we 00:06:32.200 |
actually learn these representations automatically? 00:06:35.160 |
And more importantly, can we actually learn these representations in an unsupervised way, 00:06:39.000 |
by just seeing lots and lots of unlabeled data? 00:06:43.480 |
And there's been a lot of work done in that space, but we're not there yet. 00:06:47.960 |
So I wanted to lower your expectations as I show you some of the results. 00:06:56.400 |
This is one of the models that I think that everybody should know what it is. 00:07:01.000 |
It was actually first has its roots in '96, and it was originally developed to explain 00:07:14.800 |
Well, if I give you a set of data points, x1 up to xn, you'd want to learn a dictionary 00:07:19.520 |
of bases, phi 1 up to phi k, so that every single data point can be written as a linear 00:07:29.000 |
There is one constraint in that you'd want your coefficients to be sparse. 00:07:38.780 |
So every data point is represented as a sparse linear combination of bases. 00:07:43.960 |
So if you apply sparse coding to natural images, and this was originally has been a lot of 00:07:53.040 |
work developed at Stanford with Andrew Ng's group. 00:07:56.120 |
So if you apply sparse coding to take little patches of images and learn these bases, these 00:08:04.480 |
And they look really nice in terms of finding edge-like structure. 00:08:09.720 |
So if given a new example, I can say, well, this new example can be written as a linear 00:08:18.040 |
And taking that representation, it turns out that particular representation, a sparse representation, 00:08:23.640 |
is quite useful as a feature representation of your data. 00:08:36.760 |
Well, if I give you a whole bunch of image patches, but these don't necessarily have 00:08:42.520 |
This could be little speech signals or any kind of data you're working with. 00:08:53.200 |
So the first term here, you can think of it as a reconstruction error, which is to say, 00:08:57.480 |
well, I take a linear combination of my bases. 00:09:03.920 |
And then there's a second term, which is, you can think of it as a sparse penalty term, 00:09:08.160 |
which essentially says, try to penalize my coefficients so that most of them are zero. 00:09:15.640 |
That way, every single data point can be written as just a linear combination, sparse linear 00:09:22.180 |
And it turns out there is an easy optimization for doing that. 00:09:26.720 |
If you fix your dictionary of bases, 5, 1 up to 5k, and you solve for the activations, 00:09:36.840 |
There's a lot of solvers for solving that particular problem. 00:09:40.720 |
That's a general, it's a lasso problem, which is fairly easy to optimize. 00:09:47.440 |
And then if you fix the activations and you optimize for dictionary of bases, then it's 00:09:55.440 |
Each problem is convex, so you can alternate between finding coefficients, finding bases, 00:10:00.560 |
and so forth, so you can optimize this function. 00:10:02.840 |
And there's been a lot of recent work in the last 10 years of doing these things online 00:10:13.300 |
At test time, given a new input or a new image patch, and given a set of learned bases, once 00:10:18.520 |
you have your dictionary, you can then just solve a lasso problem to find the right coefficients. 00:10:25.280 |
So in this case, given a test sample or a test patch, you can find, well, it's written 00:10:30.480 |
as a linear combination of a subset of the bases. 00:10:35.840 |
And it turns out, again, that that particular representation is very useful, particularly 00:10:39.680 |
if you're interested in classifying what you see in images. 00:10:43.180 |
And this is done in a completely unsupervised way. 00:10:46.560 |
There is no specific supervisory signal that's here. 00:10:52.280 |
So back in 2006, there was work done, again, at Stanford that basically showed a very interesting 00:11:01.240 |
So if I give you an input like this, and these are my learned bases, remember these little 00:11:05.040 |
edges, what happens is that you just convolve these bases. 00:11:09.480 |
You can get these different feature maps, much like the feature maps that we've seen 00:11:15.440 |
And then you take these feature maps, and you can just do a classification. 00:11:20.000 |
This was done on one of the older data sets, the Caltech 101, which is a data set that 00:11:27.880 |
And if you look at some of the competing algorithms, if you do a simple logistic regression versus 00:11:35.240 |
if you do PCA and then do logistic regression versus finding these features using sparse 00:11:41.400 |
coding, you can get substantial improvements. 00:11:44.960 |
So that's, again, that's-- and you see sparse coding popping up in a lot of different areas, 00:11:51.280 |
not just in deep learning, but folks who are using-- looking at the medical imaging domain, 00:11:57.240 |
in neuroscience, these are very popular models. 00:12:00.000 |
Because they're easy, they're easy to fit, they're easy to deal with. 00:12:05.400 |
So what's the interpretation of the sparse coding? 00:12:11.440 |
And we can think of sparse coding as finding an overcomplete representation of your data. 00:12:17.880 |
Now the encoding function, we can think of this encoding function, which is, well, I 00:12:23.160 |
give you an input, find me the features or sparse coefficients or bases that make up 00:12:29.900 |
We can think of encoding as an implicit and a very nonlinear function of x. 00:12:36.920 |
And the decoder, or the reconstruction, is just a simple linear function. 00:12:44.000 |
You just take your coefficients and then multiply it by-- find the right basis and get back 00:12:56.420 |
And that sort of flows naturally into the ideas of autoencoders. 00:13:01.200 |
The autoencoder is a general framework where if I give you an input data, let's say it's 00:13:05.720 |
an input image, you encode it, you get some representation, some feature representation, 00:13:11.560 |
and then you have a decoder given that representation. 00:13:16.880 |
So you can think of encoder as a feedforward, bottom-up pass, much like in a convolutional 00:13:23.800 |
neural network, given the image, you're doing a forward pass. 00:13:27.200 |
And then there is also feedback and generative or top-down pass. 00:13:31.920 |
And the features, you're reconstructing back the input image. 00:13:35.880 |
And the details of what's going inside the encoder, decoder, they matter a lot. 00:13:40.480 |
And obviously, you need some form of constraints. 00:13:42.320 |
You need some of constraints to avoid learning an identity. 00:13:45.840 |
Because if you don't put these constraints, what you could do is just take your input, 00:13:50.600 |
copy it to your features, and then reconstruct back. 00:13:55.560 |
So we need to introduce some additional constraints. 00:13:59.740 |
If you're dealing with binary features, if you want to extract binary features, for example, 00:14:05.440 |
I'm going to show you later why you'd want to do that. 00:14:07.920 |
You can pass your encoder through sigmoid nonlinearity, much like in the neural network. 00:14:13.920 |
And then you have a linear decoder that reconstructs back the input. 00:14:17.640 |
And the way we optimize these little building blocks or these little blocks is we can just 00:14:24.720 |
have an encoder, which takes your input, takes a linear combination, passes it through some 00:14:36.220 |
And then there is a decoder where you reconstruct back your original input. 00:14:41.020 |
So this is nothing more than a neural network with one hidden layer. 00:14:44.240 |
And typically, that hidden layer would have a small dimensionality than the input. 00:14:50.540 |
We can determine the network parameters, the parameters of the encoder and the parameters 00:14:54.500 |
of the decoder by writing down the reconstruction error. 00:14:58.480 |
And that's what the reconstruction error would look like. 00:15:00.860 |
Given the input, encode, decode, and make sure whatever you're decoding is as close 00:15:09.060 |
Then we can use backpropagation algorithm to train. 00:15:14.140 |
There is an interesting sort of relationship between autoencoders and principal component 00:15:22.400 |
As a practitioner, if you're dealing with large data and you want to see what's going 00:15:26.180 |
on, PCA is the first thing to use, much like logistic regression. 00:15:32.460 |
And the idea here is that if the parameters of encoder and decoder are shared and you 00:15:36.780 |
actually have the hidden layer, which is a linear layer, so you don't introduce any nonlinearities, 00:15:42.180 |
then it turns out that the latent space that the model will discover is going to be the 00:15:49.780 |
It effectively will collapse the principal component analysis, right? 00:15:52.940 |
We're doing PCA, which is sort of a nice connection because it basically says that autoencoders, 00:16:00.100 |
you can think of them as nonlinear extensions of PCA. 00:16:02.700 |
So you can learn a little richer features if you are using autoencoders. 00:16:14.180 |
If you're dealing with binary input, sometimes we're dealing with like MNIST, for example. 00:16:19.140 |
Again, your encoder and decoder could use sigmoid nonlinearities. 00:16:22.900 |
So given an input, you extract some binary features. 00:16:25.020 |
Give binary features, you reconstruct back the binary input. 00:16:29.100 |
And that actually relates to a model called the restricted Boltz machine, something that 00:16:33.860 |
I'm going to tell you about later in the talk. 00:16:37.860 |
There's also other classes of models where you can say, well, I can also introduce some 00:16:42.100 |
sparsity, much like in sparse coding, to say that I need to constrain my latent features 00:16:50.420 |
And that actually allows you to learn quite reasonable features, nice features. 00:16:56.780 |
Here's one particular model called predictive sparse decomposition, where you effectively, 00:17:02.300 |
if you look at the first part of the equation here, the decoder part, that pretty much looks 00:17:09.140 |
But in addition, you have an encoding part that essentially says train an encoder such 00:17:14.380 |
that it actually approximates what my latent code should be. 00:17:19.980 |
So effectively, you can think of this model as there is an encoder, there is a decoder, 00:17:23.700 |
but then you put the sparsity constraint on your latent representation. 00:17:32.400 |
And obviously, the other thing that we've been doing in the last seven, eight, and ten 00:17:37.340 |
years is, well, what you can do is you can actually stack these things together. 00:17:42.060 |
So you can learn low-level features, try to learn high-level features, and so forth. 00:17:50.900 |
And perhaps at the top level, if you're trying to solve a classification problem, you can 00:17:57.140 |
And this is sometimes known as a greedy layer-wise learning. 00:18:00.900 |
And this is sometimes useful whenever you have lots and lots of unlabeled data. 00:18:05.460 |
And when you have a little labeled data, a small sample of labeled data, typically these 00:18:10.340 |
models help you find meaningful representations such that you don't need a lot of labeled 00:18:15.620 |
data to solve a particular task that you're trying to solve. 00:18:19.420 |
And this is, again, you can remove the decoding part, and then you end up with a standard 00:18:25.660 |
Again, your encoder and decoder could be convolutional. 00:18:30.020 |
And it depends on what problem you're tackling. 00:18:34.060 |
And typically, you can stack these things together and optimize for a particular task 00:18:43.100 |
Here's an example of-- just wanted to show you some examples, some early examples. 00:18:46.940 |
Back in 2006, this was a way of trying to build these nonlinear autoencoders. 00:18:53.900 |
And you can sort of pre-train these models using restricted-bolt machines or autoencoders 00:18:59.700 |
And then you can stitch them together into this deep autoencoder and backpropagate through 00:19:08.460 |
One thing I want to point out is that-- here's one particular example. 00:19:15.780 |
The second row, you're seeing faces reconstructed from a bottleneck of 30-dimensional real-value 00:19:23.420 |
You can think of it as just a compression mechanism. 00:19:25.580 |
Given the data, high-dimensional data, you're compressing it down to 30-dimensional code. 00:19:30.000 |
And then from that 30-dimensional code, you're reconstructing back the original data. 00:19:34.260 |
So if you look at the first row, this is the data. 00:19:43.060 |
One thing I want to point out is that the solution here, you have a much sharper representation, 00:19:47.940 |
which means that it's capturing a little bit more structure in the data. 00:19:50.780 |
It's also kind of interesting to see that sometimes these models tend to-- how should 00:19:58.780 |
For example, if you see this person with glasses, removes the glasses. 00:20:02.580 |
And that generally has to do with the fact that there is only one person with glasses. 00:20:05.820 |
So the model just basically says, that's noise. 00:20:13.340 |
And then again, that has to do with the fact that there's enough capacity. 00:20:16.300 |
So the model might think that that's just a noise. 00:20:20.860 |
And if you're dealing with text type of data, this was done using a Reuters data set. 00:20:30.780 |
You take bag of words representation, something very simple. 00:20:33.020 |
You can press it down to two-dimensional space. 00:20:37.940 |
And I always like to joke that the model basically discovers that European community economic 00:20:43.180 |
policies are just next to disasters and accidents. 00:20:46.740 |
This was back in-- I think the data was collected in '96. 00:20:50.780 |
I think today it's probably going to become closer. 00:20:55.980 |
But again, this is just a way-- typically, autoencoder is a way of compression or trying 00:21:02.460 |
But we'll see later that they don't have to be. 00:21:05.820 |
There's another class of algorithm called semantic hashing, which is to say, well, what 00:21:09.860 |
if you take your data and compress it down to binary representation? 00:21:16.180 |
Because if you have binary representation, you can search in the binary space very efficiently. 00:21:22.540 |
In fact, if you can compress your data down to 20 dimension, 20-dimensional binary code, 00:21:33.980 |
And you can look at the-- just do memory lookups without actually doing any search at all. 00:21:42.020 |
So this sort of representation sometimes have been used successfully in computer vision, 00:21:46.260 |
where you take your images, and then you learn these binary representations, 30-dimensional 00:21:55.540 |
And it turns out it's very efficient to search through large volumes of data using binary 00:22:01.380 |
So you can-- it takes a fraction of a millisecond to retrieve images from a set of millions 00:22:09.700 |
And again, this is also an active area of research right now, because people are trying 00:22:13.240 |
to figure out, we have these large databases. 00:22:17.220 |
And learning a semantic hashing function that maps your data to the binary representation 00:22:25.540 |
OK, now let's step back a little bit and say, let's now look at generative models. 00:22:31.500 |
Let's look at probabilistic models and how different they are. 00:22:34.420 |
And I'm going to show you some examples of where they're applicable. 00:22:39.420 |
Here's one example of a simple model trying to learn a distribution of these handwritten 00:22:53.180 |
And now we can build a model that says, well, can you actually generate me what a Sanskrit 00:23:00.220 |
The flickering you see at the top, these are neurons. 00:23:05.300 |
And what you're seeing at the bottom is you're seeing what the model generates, what it believes 00:23:11.140 |
So in some sense, when you think about generative models, you think about models that can generate 00:23:15.860 |
or they can sample the distribution or they can sample the data. 00:23:23.180 |
We have about 25,000 characters coming from 50 different alphabets around the world. 00:23:31.860 |
But this is what the model believes Sanskrit should look like. 00:23:35.100 |
And I think that I've asked a couple of people to say that, does that really look like Sanskrit? 00:23:45.020 |
It can mean that the model is actually generalizing or the model is overfitting, meaning that 00:23:50.700 |
it's just memorizing what the training data looks like. 00:23:52.620 |
And I'm just showing you examples from the training data. 00:23:55.380 |
We'll come back to that point as we go through the talk. 00:24:02.180 |
Given half of the image, can you complete the remaining half? 00:24:06.020 |
And more recently, there's been a lot of advances, especially in the last couple of years, for 00:24:13.580 |
And it's pretty amazing what you can do in terms of in-painting, given half of the image, 00:24:19.020 |
what the other half of the image should look like. 00:24:21.300 |
This is sort of a simple example, but it does show you that it's trying to be consistent 00:24:32.820 |
In the space of so-called undirected graphical models of Bolsom machines, the difficulty 00:24:38.680 |
If I show you this image, which is a 28 by 28 image, it's a binary image. 00:24:49.620 |
So in fact, there are 2 to the 784 possible configurations. 00:24:56.000 |
So how can you build models that figure out-- in the space of characters, there's only a 00:25:03.680 |
If you start generating 200 by 200 images, that space is huge. 00:25:11.580 |
In the space of real images, it's really, really tiny. 00:25:18.480 |
That's a very difficult question in general to answer. 00:25:23.640 |
One class of models is so-called fully observed models. 00:25:27.940 |
There's been a stream of learning generative models that are tractable. 00:25:33.000 |
And they have very nice properties, like you can compute the probabilities, you can do 00:25:38.760 |
Here's one example where I can, if I try to model the image, I can write it down as taking 00:25:43.840 |
the first pixel, modeling the first pixel, then modeling the second pixel, giving the 00:25:47.280 |
first pixel, and just writing it down in terms of the conditional probabilities. 00:25:53.960 |
And each conditional probability can take a very complicated form. 00:26:03.200 |
So there's been a number of successful models. 00:26:08.280 |
One of the early models called Neural Autoregressive Density Estimator, actually developed by Hugo, 00:26:16.520 |
And more recently, we start seeing these flavors of models. 00:26:19.680 |
There were a couple of papers popped up, actually this year, from DeepMind, where they make 00:26:27.680 |
these conditionals to be sophisticated RNNs, LSTMs, or convolutional models. 00:26:32.800 |
And they can actually generate remarkable images. 00:26:35.600 |
And so this is just a pixel CNN generating, I guess, elephants. 00:26:45.520 |
The drawback of these models is that we yet have to see how good of representations these 00:26:49.480 |
models are learning, so that we can use these representations for other tasks, like classifying 00:26:59.080 |
Now let me jump into a class of models called Restricted Bulse Machines. 00:27:03.680 |
So this is the class of models where we're actually trying to learn some latent structure, 00:27:10.000 |
These models belong to the class of so-called graphical models. 00:27:12.480 |
And graphical models are a very powerful framework for representing dependency structure between 00:27:19.320 |
This is an example where we have-- you can think of this particular model. 00:27:25.600 |
These are stochastic binary, so-called visible variables. 00:27:30.680 |
And you have stochastic binary hidden variables. 00:27:32.480 |
You can think of them as feature detectors, so detecting certain patterns that you see 00:27:40.280 |
You can write down the probability, the joint distribution over all of these variables. 00:27:48.120 |
But it's not really important what they look like. 00:27:49.960 |
The important thing here is that if I look at this conditional probability of the data 00:27:53.920 |
given the features, I can actually write down explicitly what it looks like. 00:27:59.600 |
That basically means that if you tell me what features you see in the image, I can generate 00:28:03.480 |
the data for you, or I can generate the corresponding input. 00:28:08.320 |
In terms of learning features, so what do these models learn? 00:28:11.960 |
They sort of learn something similar that we've seen in sparse coding. 00:28:16.240 |
And so these classes of models are very similar to each other. 00:28:20.120 |
So given a new image, I can say, well, this new image is made up by some combination of 00:28:24.960 |
these learned weights or these learned bases. 00:28:28.480 |
And the numbers here are given by the probabilities that each particular edge is present in the 00:28:35.080 |
In terms of how we learn these models, one thing I want to make-- another point I should 00:28:41.200 |
make here is that given an input, I can actually quickly infer what features I'm seeing in 00:28:48.720 |
So that operation is very easy to do, unlike in sparse coding models. 00:28:52.640 |
It's a little bit more closer to an autoencoder. 00:28:54.300 |
Given the data, I can actually tell you what features are present in my input, which is 00:28:58.480 |
very important for things like information retrieval or classifying images, because you 00:29:06.280 |
Let me just give you an intuition, maybe a little bit of math behind how we learn these 00:29:12.000 |
If I give you a set of training examples, and I want to learn model parameters, I can 00:29:18.840 |
And you've probably seen that in these tutorials, the maximum likelihood objective is essentially 00:29:24.280 |
nothing more than saying, I want to make sure that the probability of observing these images 00:29:31.400 |
So finding the parameters of the probability of observing what I'm seeing is high. 00:29:36.480 |
And that's why you're maximizing the likelihood objective for the log of the likelihood objective, 00:29:55.140 |
And you basically have this learning rule, which is the difference between two terms. 00:30:01.800 |
The first term, you can think of it as looking at sufficient statistics, so-called sufficient 00:30:09.700 |
And the second term is the sufficient statistics driven by the model. 00:30:16.240 |
Intuitively, what that means is that you look at the correlations you see in the data. 00:30:21.720 |
And then you look at the correlations that the model is telling you it should be. 00:30:30.560 |
It's trying to match the correlations that you see in the data. 00:30:34.280 |
So the model is actually respecting the statistics that you see in the data. 00:30:38.440 |
But it turns out that the second term is very difficult to compute. 00:30:41.380 |
And it's precisely because the space of all possible images is so high dimensional that 00:30:46.640 |
you need to figure out or use some kind of approximate learning algorithms to do that. 00:30:52.080 |
So you have these difference between these two terms. 00:30:54.000 |
The first term is easy to compute, it turns out, because of a particular structure of 00:31:02.560 |
The second term is the difficult one to compute. 00:31:05.320 |
So it sort of requires summing over all possible configurations, all possible images that you 00:31:15.880 |
And what a lot of different algorithms are doing-- and we'll see that over and over again-- 00:31:20.320 |
is using so-called Monte Carlo sampling, or Markov chain Monte Carlo sampling, or Monte 00:31:26.680 |
So let me give you an intuition of what this term is doing. 00:31:29.160 |
And that's a general trick for approximating exponential sums. 00:31:33.760 |
There's a whole subfield in statistics that's basically dedicated to how do we approximate 00:31:43.600 |
In fact, if you could do that, if you could solve that problem, you could solve a lot 00:31:52.080 |
The idea is to say, well, you're going to be replacing the average by sampling. 00:31:58.000 |
And there's something that's called Gibbs sampling, Markov chain Monte Carlo, which 00:32:04.200 |
It basically says, well, start with the data, sample the states of the latent variables, 00:32:10.560 |
sample the data, sample the states of the latent data, sample the data from these conditional 00:32:13.880 |
distributions, something that you can compute explicitly. 00:32:19.760 |
Much like in sparse coding, we're optimizing for the basis when we're optimizing for the 00:32:24.680 |
Here, you're inferring the coefficients, then you're inferring what the data should look 00:32:30.760 |
And then you can just run a Markov chain and approximate this exponential sum. 00:32:38.400 |
So you start with the data, you sample the states of the hidden variables, you resample 00:32:44.280 |
And the only problem with a lot of these methods is that you need to run them up to infinity 00:32:52.120 |
to guarantee that you're getting the right thing. 00:32:55.620 |
And so obviously, you will never run them infinite. 00:33:02.120 |
So there's a very clever algorithm, a contrastive divergent algorithm that was developed by 00:33:10.400 |
It basically said, well, instead of running this thing up to infinity, run it for one 00:33:20.360 |
You start with a training vector, you update the hidden units, you update all the visible 00:33:27.960 |
Much like in autoencoder, you reconstruct your data. 00:33:31.120 |
You update the hidden units again, and then you just update the model parameters, which 00:33:34.440 |
is just looking at empirically the statistics between the data and the model. 00:33:39.920 |
Very similar to what the autoencoder is doing, but slight, slight differences. 00:33:44.640 |
And implementation is basically takes about 10 lines of MATLAB code. 00:33:48.840 |
I suspect it's going to be two lines in TensorFlow, although I don't think TensorFlow folks implemented 00:34:00.360 |
But you can extend these models to dealing with real value data. 00:34:04.640 |
So whenever you're dealing with images, for example. 00:34:06.840 |
And it's just a little change to the definition of the model. 00:34:11.400 |
And your conditional probabilities here are just going to be a bunch of Gaussians. 00:34:14.760 |
So that basically means that given the features, sample me the space of images, and I can sample 00:34:25.840 |
If you train this model on these images, you tend to find edges, something similar, again, 00:34:33.160 |
to what you'd see in sparse coding, in ICA, independent component analysis model, autoencoders 00:34:39.640 |
And again, you can say, well, every single image is made up by some linear combination 00:34:45.800 |
You can also extend these models to dealing with count data. 00:34:48.840 |
If you're dealing with documents, in this case, again, a slight change to the model. 00:34:58.560 |
And D here denotes number of words that you're seeing in your document. 00:35:05.600 |
And the conditional here is given by so-called softmax distribution, much like what you've 00:35:09.040 |
seen in the previous classes, when the distribution of possible words. 00:35:15.920 |
And the parameters here, Ws, you can think of them as something similar to what word 00:35:24.200 |
And so if you apply it to, again, some of data sets, you tend to find reasonable features. 00:35:31.760 |
So you tend to find features about Russia, about US, about computers, and so forth. 00:35:37.320 |
So much like you found these representations, little edges, every image is made up by some 00:35:45.000 |
In case of documents or web pages, you're saying it's the same thing. 00:35:49.280 |
It's just made up some linear combination of these learned topics. 00:35:53.200 |
Every single document is made up by some combination of these topics. 00:35:57.080 |
You can also look at one-step reconstruction. 00:35:59.240 |
So you can basically say, well, how can I find similarity between the words? 00:36:03.080 |
So if I show you chocolate cake and further states of hidden units, and then I reconstruct 00:36:07.960 |
back the distribution of possible words, it tells me chocolate cake, cake chocolate sweet 00:36:16.960 |
I particularly like the one about the flower high, and then there is a Japanese sign. 00:36:22.520 |
And the model sort of generates flower, Japan, sakura, blossom, Tokyo. 00:36:27.800 |
So it sort of picks up again on low-level correlations that you see in your data. 00:36:33.720 |
You can also apply these kinds of models to collaborative filtering, where every single 00:36:38.560 |
observed variable you can model, can represent a user rating for a particular movie. 00:36:46.980 |
So every single user would rate a certain subset of movies. 00:36:50.880 |
And so you can represent it as the state of visible vector. 00:36:53.560 |
And your hidden states can represent user preferences, what they are. 00:36:58.680 |
And on the Netflix data set, if you look at the latent space that the model is learning, 00:37:04.240 |
some of these hidden variables are capturing specific movie genre. 00:37:08.960 |
So for example, there is actually one hidden unit dedicated to Michael Moore's movies. 00:37:16.960 |
I think it's sort of either people like it or hate it. 00:37:19.280 |
So there are a few hidden units specifically dedicated to that. 00:37:22.560 |
But it also finds interesting things like action movies and so forth. 00:37:26.080 |
So it finds that particular structure in the data. 00:37:28.840 |
So you can model different kinds of modality, real value data, you can model count data, 00:37:36.280 |
And it's very easy to infer the states of the hidden variables. 00:37:39.020 |
So that's given just the product of logistic functions. 00:37:41.400 |
And that's very important in a lot of different applications. 00:37:44.380 |
Given the input, I can quickly tell you what topics I see in the data. 00:37:49.200 |
One thing that I want to point out, and that's an important point, is a lot of these models 00:37:56.120 |
Sometimes people call them product of experts. 00:37:58.960 |
And this is because of the following intuition. 00:38:03.680 |
If I write down the joint distribution of my hidden observed variables, I can write 00:38:10.360 |
But if I sum out or integrate out the states of the hidden variables, I have a product 00:38:26.040 |
Suppose the model finds these specific topics. 00:38:29.800 |
And suppose I'm going to be telling you that the document contains topic government, corruption, 00:38:35.540 |
Then the word Silvio Berlusconi will have very high probability. 00:38:50.680 |
And I guess I should add a bunga bunga parties here. 00:38:53.800 |
Then it will become completely clear what I'm talking about. 00:38:57.520 |
But then one point I want to make here is that you can think of these models as a product. 00:39:05.040 |
Each hidden variable defines a distribution of possible words, of possible topics. 00:39:10.800 |
And once you take the intersection of these distributions, you can be very precise about 00:39:17.360 |
So that's unlike generally topic models or latent directory allocation models, models 00:39:22.560 |
where you're actually using mixture-like approach. 00:39:28.280 |
And then typically, these models do perform far better than traditional mixture-based 00:39:33.560 |
And this comes to the point of local versus distributed representations. 00:39:39.280 |
In a lot of different algorithms, even unsupervised learning algorithms, such as clustering, you 00:39:44.640 |
typically have some, you're partitioning the space, and you're finding local prototypes. 00:39:52.120 |
And the number of parameters for each, you have basically parameters for each region, 00:39:55.720 |
the number of regions typically grow linearly with the number of parameters. 00:40:00.280 |
But in models like factor models, PCA, restricted Boltzmann machines, deep models, you typically 00:40:10.520 |
The idea here is that if I show you the two inputs, each particular neuron can differentiate 00:40:19.440 |
Given the second one, I can partition it again. 00:40:23.040 |
Given the third hidden variable, you can partition it again. 00:40:25.520 |
So you can see that every single neuron will be affecting lots of different regions. 00:40:31.240 |
And that's the idea behind distributed representations, because every single parameter is affecting 00:40:35.520 |
many, many regions, not just the local region. 00:40:38.080 |
And so the number of regions grow roughly exponentially with the number of parameters. 00:40:42.820 |
So that's the differences between these two classes of models. 00:40:48.880 |
Now let me jump and quickly tell you a little bit of inspiration behind what can we build 00:40:55.840 |
As we've seen with convolutional networks, the first layer, we typically learn some low-level 00:41:04.000 |
If you're working with a word table, typically we'll learn some low-level structure. 00:41:09.600 |
And the hope is that the high-level features will start picking up some high-level structure 00:41:15.960 |
And these kinds of models can be built in a completely unsupervised way, because what 00:41:19.960 |
you're trying to do is you're trying to model the data. 00:41:21.840 |
You're trying to model the distribution of the data. 00:41:25.120 |
You can write down the probability distribution for these models, known as a Bolson machine 00:41:32.760 |
You have dependencies between hidden variables. 00:41:34.560 |
So now introducing some extra layers and dependencies between those layers. 00:41:42.960 |
And if we look at the equation, the first part of the equation is basically the same 00:41:46.560 |
as what we had with restricted Bolson machine. 00:41:49.560 |
And then the second and third part of the equation is essentially modeling dependencies 00:41:53.200 |
between the first and the second hidden layer, and the second hidden layer and the third 00:41:58.160 |
There is also a very natural notion of bottom-up and top-down. 00:42:01.560 |
So if I want to see what's the probability of a particular unit taking value 1, it's 00:42:06.840 |
really dependent on what's coming from below and what's coming from above. 00:42:10.760 |
So there has to be some consensus in the model to say, ah, yes, what I'm seeing in the image 00:42:16.080 |
and what my model believes the overall structure should be should be in agreement. 00:42:21.880 |
And so in this case, of course, in this case, hidden variables become dependent even when 00:42:27.200 |
So these kinds of models we'll see a lot is you're introducing more flexibility, you're 00:42:32.440 |
introducing more structure, but then learning becomes much more difficult. 00:42:37.240 |
You have to deal-- how do you inference in these models? 00:42:42.680 |
Now let me give you an intuition of how can we learn these models. 00:42:47.200 |
What's the maximum likelihood estimator doing here? 00:42:50.560 |
Well, if I differentiate this model with respect to parameters, I basically run into the same 00:42:56.040 |
And it's the same learning rule you see whenever you're working with undirected graphical models, 00:43:03.840 |
It really is just trying to look at the statistics driven by the data, correlations that you 00:43:07.840 |
see in the data, and the correlations that the model is telling you it's seeing in the 00:43:11.400 |
data, and you're just trying to match the two. 00:43:13.680 |
That's exactly what's happening in that particular equation. 00:43:18.460 |
But the first term is no longer factorial, so you have to do some approximation with 00:43:23.360 |
But let me give you an intuition what each term is doing. 00:43:26.800 |
Suppose I have some data, and I get to observe these characters. 00:43:30.480 |
Well, what I can do is I really want to tell the model, this is real. 00:43:37.160 |
So I want to put some probability mass around them and say, these are real. 00:43:41.200 |
And then there is some sort of a data point that looks like this, just a bunch of pixels 00:43:47.000 |
And I really want to tell my model that put almost zero probability on this. 00:43:55.440 |
And so the first term is exactly trying to do that. 00:43:57.920 |
The first term is just trying to say, put the probability mass where you see the data. 00:44:02.000 |
And the second term is effectively trying to say, well, look at this entire exponential 00:44:05.280 |
space and just say, no, everything else is not real. 00:44:08.960 |
Just the real thing is what I'm seeing in my data. 00:44:11.960 |
And so you can use advanced techniques for doing that. 00:44:14.480 |
There's a class of algorithms called variational inference. 00:44:17.600 |
There's something that's called stochastic approximation, which is Monte Carlo-based 00:44:22.080 |
And I'm not going to go into these techniques. 00:44:29.080 |
Because a lot of approximations that go into these models. 00:44:32.740 |
So what I'm going to do is, if you haven't seen it, I'm going to show you two panels. 00:44:40.640 |
On another panel, you'll see data simulated by the model or the fake data. 00:44:47.400 |
So again, these are handwritten characters coming from alphabets around the world. 00:44:51.820 |
How many of you think this is simulated and the other part was real? 00:45:05.160 |
If you look at these images a little bit more carefully, you will see the difference. 00:45:10.840 |
So you will see that this is simulated and this is real. 00:45:16.320 |
Because if you look at the real data, it's much crisper. 00:45:20.600 |
When you're simulating the data, there's a lot of structure in the simulated characters, 00:45:24.200 |
but sometimes they look a little bit fuzzy and there isn't as much diversity. 00:45:29.760 |
And I've learned that trick from my neuroscience friends. 00:45:33.480 |
If I show you quickly enough, you won't see the difference. 00:45:38.840 |
And if you're using these models for classifying, you can do proper analysis, which is to say, 00:45:45.880 |
given a new character, you find further states of the latent variables, hidden variables. 00:45:49.960 |
If I classify based on that, how good are they? 00:45:52.560 |
And they are much better than some of the existing techniques. 00:46:02.120 |
And later on, I'll show you some bigger advances that's been happening in the last few years. 00:46:08.600 |
If you look at the space of generated samples, they sort of, obviously you can see the difference. 00:46:20.400 |
This image looks like a car with wings, don't you think? 00:46:24.280 |
So sometimes it can sort of simulate things that are not necessarily realistic. 00:46:29.820 |
And for some reason, it just doesn't generate donkeys and elephants too often, but it generates 00:46:38.400 |
And that, again, has to do with the fact that you're exploring this exponential space of 00:46:43.480 |
possible images, and sometimes it's very hard to assign the right probabilities to different 00:46:52.200 |
And then obviously you can do things like pattern completion. 00:46:54.360 |
So given half of the image, can you complete the remaining half? 00:46:57.360 |
So the second one shows what the completions look like, and the last one is what the truth 00:47:05.640 |
These are sort of toyish examples, but where else? 00:47:08.080 |
Let me show you one example where these models can potentially succeed, which is trying to 00:47:13.680 |
model the space of the multimodal space, which is the space of images and text. 00:47:20.240 |
Or generally, if you look at the data, it's not just single sources. 00:47:26.560 |
So how can we take all of these modalities into account? 00:47:30.100 |
And this is really just the idea of given images and text. 00:47:33.360 |
And you actually find a concept that relates these two different sources of data. 00:47:40.120 |
And there are a few challenges, and that's why models like generative models, sometimes 00:47:46.720 |
In general, one of the biggest challenges we've seen is that typically when you're working 00:47:50.700 |
with images and text, these are very different modalities. 00:47:54.160 |
If you think about images and pixel representation, they're very dense. 00:47:58.260 |
If you're looking at text, it's typically very sparse. 00:48:01.960 |
It's very difficult to learn these cross-modal features from low-level representation. 00:48:06.920 |
Perhaps a bigger challenge is that a lot of times we see data that's very noisy. 00:48:15.880 |
Or if you look at the first image, a lot of the tags about is what kind of camera was 00:48:20.120 |
used to describe that particular image, which doesn't really tell us anything about the 00:48:26.740 |
And these would be the text generated by a version of a Boltzmann machine model. 00:48:32.120 |
It sort of does samples of what the text should look like. 00:48:40.040 |
If you just build a simple representation, given images and given text, you just try 00:48:43.760 |
to find what the common representation is, it's very difficult to learn these cross-modal 00:48:49.560 |
But if you actually build a hierarchical model, so you start with representation, you can 00:48:54.520 |
build a Gaussian model, replicate a softmax model, you can build up that representation, 00:48:58.880 |
then it turns out it's much more, it gives you much richer representation. 00:49:04.080 |
There's also a notion of bottom-up and top-down, which means that low-level images or tags 00:49:12.480 |
can effectively affect low-level representation of images and the other way around. 00:49:16.480 |
So information flows between images and text and gets into some stable state. 00:49:22.680 |
And this is what the text generated from images looks like, some of the examples. 00:49:28.160 |
A lot of them look reasonable, but more recently, with the advances of CovNets, this is probably 00:49:37.480 |
Here's some examples of the model that's not quite doing the right thing. 00:49:44.560 |
For some reason, it sort of correlates with Barack Obama and such. 00:49:48.760 |
And the features, when we were using this model, we didn't have, at that time, image 00:49:55.320 |
Right now, I don't think we'd be making these mistakes. 00:49:57.320 |
But generally speaking, what we found in a lot of the data is that there are a lot of 00:50:00.800 |
images of animals, which brings us to the next problem, is that if you don't see images 00:50:05.200 |
of animals, then the model is confused because it sees a lot of Obama signs, and these are 00:50:08.800 |
black and white and blue signs that are appearing a lot. 00:50:14.180 |
You can also do images from text, given text, or tags can retrieve relevant images. 00:50:21.240 |
This is the dataset itself, about a million images. 00:50:23.240 |
It's a nice dataset, and you have very noisy tags. 00:50:27.960 |
The question is, can you actually learn some representation from those images? 00:50:32.120 |
One thing that I want to highlight here is, we've tried, there's 25,000 labeled images. 00:50:38.040 |
Somebody went and labeled what's going on in those images, what classes we see in those 00:50:41.440 |
images, and you get some numbers, which is mean average precision. 00:50:45.360 |
What's important here is that we found that if we actually use unlabeled data, and we 00:50:49.560 |
pre-train these channels separately, using a million unlabeled data points, then we can 00:50:58.320 |
At least that was a little bit of a happy sign for us to say that unlabeled data can 00:51:03.280 |
help in the situations where you don't have a lot of labeled examples. 00:51:07.920 |
Here it was helping us, it was helping us a lot. 00:51:11.680 |
And then once you get into these representations, dealing with text and images, this is one 00:51:18.120 |
I think Richard pointed out what happens in the space of linguistic regularities. 00:51:25.800 |
You can do the same thing with images, which is kind of fun to do. 00:51:29.320 |
They sometimes work, they don't work all the time. 00:51:32.520 |
I take that particular image at the top, and I say get the representation of this image, 00:51:37.680 |
subtract the representation of day, add the night, and then find closest images, you get 00:51:43.800 |
And then you can do some interesting things, like take these kittens and say minus ball 00:51:50.640 |
If you take this particular image and say minus box plus ball, you get kittens in the 00:51:58.840 |
So you can get these interesting representations. 00:52:03.840 |
Of course, these are all fun things to look at, but they don't really mean much, because 00:52:07.800 |
we're not specifically optimizing for those things. 00:52:11.840 |
Now let me spend some time also talking about another class of models. 00:52:18.280 |
These are known as Helmholtz machines and variational autoencoders. 00:52:21.560 |
These are the models that have been popping up in our community in the last two years. 00:52:29.320 |
A Helmholtz machine was developed back in '95, and it was developed by Hinton and Peter Dayan 00:52:42.200 |
You have a generative process, so given some latent state, you just-- it's a neural network, 00:52:48.120 |
it's a stochastic neural network that generates the input data. 00:52:52.520 |
And then you have so-called approximate inference step, which is to say, given the data, infer 00:52:57.800 |
approximately what the latent states should look like. 00:53:05.200 |
There's something that's called wake-sleep algorithm, and it never worked. 00:53:09.200 |
Basically, people just said, it just doesn't work. 00:53:12.800 |
And then we started looking at restricted-bolt machines and Boltz machines, because they're 00:53:18.440 |
And then two years ago, people figured out how to make them work. 00:53:22.080 |
And so now, 10 years later, I'm going to show you the trick. 00:53:24.560 |
Now, these models are actually working pretty well. 00:53:27.520 |
The difference between Helmholtz machines and deep-bolt machines is very subtle. 00:53:33.280 |
The big difference between the two is that in Helmholtz machines, you have a generative 00:53:38.040 |
process that generates the data, and you have a separate recognition model that tries to 00:53:44.240 |
So you can think of this Q function as a convolutional neural network, given the data, tries to figure 00:53:51.400 |
And then there's a generative model, given the features, it generates the data. 00:53:54.840 |
Boltzmann machine is sort of similar class of models, but it has undirected connections. 00:53:59.120 |
So you can think of it as generative and recognition connections are the same. 00:54:03.160 |
So it's sort of a system that tries to reach some equilibrium state when you're running 00:54:10.040 |
So the semantics is a little bit different between these two models. 00:54:15.480 |
A variational encoder is a Helmholtz machine. 00:54:17.960 |
It defines a generative process in terms of sampling through cascades of stochastic layers. 00:54:24.080 |
And if you look at it, there's just a bunch of conditional probability distributions that 00:54:28.040 |
you're defining, so you can generate the data. 00:54:30.400 |
So theta here will denote the parameters of the variational autoencoders. 00:54:36.960 |
And sampling from these conditional probability distributions, we're assuming that we can 00:54:45.680 |
But the innovation here is that every single conditional probability can actually be a 00:54:54.680 |
It can denote a nonlinear-- you can model nonlinear relationships. 00:54:58.000 |
It can be a multilayer nonlinear neural network, deterministic neural network. 00:55:05.400 |
Here's an example of I have a stochastic layer. 00:55:08.640 |
You have a stochastic layer, and then you generate the data. 00:55:12.040 |
So you can introduce these nonlinearities into these models. 00:55:16.880 |
And this conditional probability would denote a one-layer neural network. 00:55:21.960 |
Now I'll show you some examples, but maybe I can just give you a little intuition behind 00:55:31.480 |
In a lot of these kinds of models, learning is very hard to do. 00:55:36.800 |
And there's a class of models called variational learning. 00:55:39.000 |
And what the variational learning is trying to do is basically trying to do the following. 00:55:42.440 |
Well, I want to maximize the probability of the data that I observe, but I cannot do it 00:55:48.080 |
So instead, what I'm going to do is I'm going to maximize the so-called variational low 00:55:54.080 |
And it's effectively saying, well, if I take the log of expectation, I can take the log 00:56:02.240 |
And it turns out just logistically, working in this representation is much easier than 00:56:08.680 |
If you go a little bit through the math, it turns out that you can actually optimize this 00:56:13.520 |
variational bound, but you can't really optimize this particular likelihood objective. 00:56:19.200 |
It's a little bit surprising for those of you who haven't seen variational learning 00:56:24.840 |
But this one little trick, this one little so-called Jensen's inequality actually allows 00:56:33.080 |
And the other way to write the lower bound is to say, well, there is a log likelihood 00:56:37.360 |
function and something that's called KL divergence, which is the distance between your approximating 00:56:42.080 |
distribution Q, which is your recognition model, and the truth. 00:56:45.880 |
The truth in these models would be the true posterior according to your model. 00:56:51.000 |
And it's hard to optimize these kinds of models in general. 00:56:54.560 |
You're trying to optimize your generative model, you're trying to optimize your recognition 00:56:58.800 |
And back in '95, Hinton and his students basically, they developed this wake-sleep algorithm that 00:57:05.200 |
was a bunch of different things put together, but it was never quite the right algorithm 00:57:09.280 |
because it wasn't really optimizing anything. 00:57:14.560 |
But in 2014, there was a beautiful trick introduced by Kingman and Welling, and there was a few 00:57:19.880 |
other groups that came up with the same trick called reparameterization trick. 00:57:24.060 |
So let me show you what reparameterization trick does intuitively. 00:57:27.960 |
So let's say your recognition distribution is a Gaussian. 00:57:32.160 |
So a Gaussian, I can write it as a mean and a variance. 00:57:37.160 |
Notice that my mean depends on the layer below. 00:57:43.720 |
The variance also depends on the layer below, so it could also be a nonlinear function. 00:57:49.840 |
But what I can do is I can actually do the following. 00:57:52.080 |
I can express this particular Gaussian in terms of auxiliary variables. 00:57:56.880 |
I can say, well, if I sample this epsilon from normal 0, 1, a Gaussian distribution, 00:58:02.240 |
then I can write this particular h, my state, in a deterministic way. 00:58:09.680 |
It's just mean plus essentially standard deviation or variance, square root of the variance, 00:58:17.440 |
So this is just a simple parameterization of the Gaussians. 00:58:21.920 |
I'm just pulling out the mean and the variance. 00:58:27.040 |
So I can write my recognition model as this Gaussian, or I can write it in terms of noise 00:58:34.860 |
So the recognition distribution can be represented as a deterministic mapping. 00:58:38.400 |
And that's the beauty, because it turns out that you can collapse these complicated models 00:58:47.520 |
We can back propagate through the entire model. 00:58:53.320 |
And then the distribution of these auxiliary variables really don't depend on parameters. 00:58:58.960 |
So it's almost like taking a stochastic system and separating the stochastic part and deterministic 00:59:05.560 |
In deterministic part, you can do back propagation, so you can do learning. 00:59:08.420 |
And the stochastic part, you can do sampling. 00:59:10.720 |
So just think of it as a separation between the two pieces. 00:59:16.280 |
So now, if I take the gradient of the variational bound or the variational objective with respect 00:59:21.820 |
to parameters, this is something that we couldn't do back in '95, and we couldn't do it in the 00:59:27.800 |
People have tried using reinforce algorithm or some approximations to it. 00:59:32.500 |
But here what we can do is we can do the following. 00:59:34.060 |
We can say, well, I can write this expression, because it's a Gaussian, as sampling a bunch 00:59:41.780 |
And then this log, I can just inject the noise in here. 00:59:48.780 |
You take this gradient here, and you push it inside the expectation. 00:59:55.020 |
So before, if you take the gradient of expectations, like taking the gradient of averages, like 01:00:00.940 |
you compute a bunch of averages, and you're taking the gradient. 01:00:03.460 |
What you're doing now with reparameterization trick is you're taking the gradients and then 01:00:09.620 |
It turns out that hugely reduces the variance in your training. 01:00:14.020 |
It actually allows you to learn these models quite efficiently. 01:00:18.140 |
So the mapping edge here is completely deterministic. 01:00:21.020 |
And gradients here can be computed by back propagation. 01:00:25.420 |
And you can think of this thing inside as just an autoencoder that you are optimizing. 01:00:33.080 |
And obviously, there are other extensions of these models that we've looked at and a 01:00:36.900 |
bunch of other teams looked at where you can say, well, maybe we can improve these models 01:00:42.180 |
These are so-called case samples, importance weighting bounds. 01:00:44.980 |
And so you can make them a little bit better, a little bit more precise. 01:00:48.860 |
You can model a little bit more complicated distributions over the posteriors. 01:00:54.780 |
But now, let me step back a little bit and say, why am I telling you about this? 01:01:02.900 |
Why do we need stochastic systems in general? 01:01:09.120 |
We wanted to build a model that, given captions, we want to generate the image. 01:01:15.140 |
And my student was very ambitious and basically said, I want to be able to just give you any 01:01:19.820 |
sentence and I want to be able to generate image, kind of like an artificial paint. 01:01:24.540 |
I want to paint what's in my caption in the most general way. 01:01:31.120 |
So this is one example of a Helmholtz machine where you have a generative model, which is 01:01:35.500 |
It's just a chain sequence of variational autoencoders. 01:01:38.500 |
And there's a recognition model, which you can think of as a deterministic system, like 01:01:42.260 |
a convolutional system that tries to approximate what the latent states are. 01:02:01.780 |
Now if you were using a deterministic system, like an autoencoder, you would generate one 01:02:11.940 |
Once you have stochastic system, you inject this noise, this latent noise, that allows 01:02:16.880 |
you to actually generate the whole space of possible images. 01:02:19.860 |
So for example, it tends to generate this stop sign and this stop sign. 01:02:33.060 |
Here is this yellow school bus is flying in blue skies. 01:02:35.660 |
So here we wanted to test the system to see does it understand something about what's 01:02:42.380 |
Here is a herd of elephants is flying in blue skies. 01:02:45.500 |
Now we cannot generate elephants, although there are now techniques that are getting 01:02:52.140 |
And it's a commercial plane flying in blue skies. 01:02:54.940 |
But this is where we need stochasticity because we want to be able to generate the whole distribution 01:02:58.780 |
of possible outcomes, not necessarily just one particular point. 01:03:04.940 |
We can basically do things like a yellow school bus parked in the parking lot versus a red 01:03:09.380 |
school bus parked in the parking lot versus a green school bus parked in the parking lot. 01:03:17.020 |
We can't quite generate blue school buses, but we've seen blue cars and we've seen blue 01:03:22.100 |
So it can make an association to draw these different things. 01:03:28.180 |
But in terms of comparing to different models, if I give you a group of people on the beach 01:03:33.860 |
with surfboards, this is what we can generate. 01:03:37.260 |
There is another model called LabGap model, which is a model based on adversarial neural 01:03:41.340 |
networks, something I'll talk as the last part of this talk. 01:03:45.580 |
And there is these models, convolutional and deconvolutional variation autoencoders, which 01:03:50.140 |
is again convolutional deconvolutional autoencoders just with some noise. 01:03:54.460 |
And you can certainly see that it's generally we found it's very hard to be able to generate 01:04:08.980 |
I don't know if you can see toilet seats here, maybe, but you can see a toilet seat sits 01:04:18.100 |
And when we put this paper on archive, one of the students basically came to me and said, 01:04:26.260 |
"This is really bad because you can always ask Google." 01:04:30.700 |
If you type that particular query into Google search, it gives you that. 01:04:40.460 |
But now if you actually Google or if you actually put this query into Google, this image comes 01:04:49.500 |
And generally because what's happening is that people are just clicking on that image 01:04:52.840 |
all the time to figure out what's going on in that image. 01:04:59.560 |
So now I can say that according to Google, this is a much better representation for that 01:05:07.780 |
Here's another sort of interesting model, which is a model where you're trying to build 01:05:15.380 |
Again, it's a generative model, but it's a generative model of text. 01:05:19.400 |
This model was trained on about 7,000 romance novels. 01:05:25.180 |
And you take a caption model and you hook it up to the caption generation system. 01:05:30.620 |
So you're basically saying the model, here's an image, generate me in the style of romantic 01:05:42.220 |
We're barely able to catch the breeze on the beach and so forth. 01:05:45.980 |
She's beautiful, but the truth is I don't know what to do. 01:05:49.040 |
The sun was just starting to fade away, leaving people scattered around the Atlantic Ocean. 01:05:56.580 |
And there are a bunch of different things that you can do. 01:05:58.660 |
Obviously, we're not there yet in terms of generating romantic stories. 01:06:02.960 |
But here's one example where it's a generative model. 01:06:05.900 |
It seems like syntactically we can actually generate reasonable things. 01:06:14.860 |
And actually, that particular work was inspired a little bit by Baidu's system that would 01:06:27.020 |
It was mostly selecting the right poem for the image. 01:06:30.700 |
Here we actually were trying to generate something. 01:06:35.900 |
So there's still a lot of work to do in that space. 01:06:39.900 |
Semantically, we are nowhere near getting the right structure. 01:06:44.580 |
Here's another last example that I want to show you. 01:06:48.300 |
This was done in the case of one-shot learning, which is can you build generative model of 01:06:53.260 |
That's a very defined domain, very well-defined domain. 01:06:57.300 |
It's a very simple domain, but it's also very hard. 01:07:02.140 |
We've shown this example to people and to the algorithm. 01:07:06.300 |
And we can say, well, can you draw me this example? 01:07:09.060 |
And on one panel, humans would draw how they believe this example should look like. 01:07:14.700 |
And then on the other panel, we have machines drawing it. 01:07:17.420 |
So this is really just a generative model based on a single example. 01:07:22.060 |
We're showing you an example, and you're trying to generate what it is. 01:07:26.840 |
How many of you think this was machine-generated and this was human-generated? 01:07:39.260 |
How many of you think this is machine-generated and this is human-generated? 01:07:45.860 |
Well, the truth is I don't really know which one was generated by which machine. 01:07:52.660 |
Because that was done, I should actually ask Brendan Lake, who designed the experiments 01:07:58.380 |
But I can tell you that there's been a lot of studies. 01:08:01.580 |
He's done a lot of studies, and it's almost 50/50. 01:08:05.480 |
So in sort of this kind of small, carved domain, we can actually compete with people trying 01:08:16.220 |
Now let me step back a little bit and tell you about a different class of models. 01:08:21.580 |
These are models known as generative adversarial networks, and they've been gaining a lot of 01:08:28.540 |
attraction in our community because they seem to produce remarkable results. 01:08:36.940 |
We're not going to be really defining explicitly the density, but we need to be able to sample 01:08:43.780 |
And the interesting thing is that there's no variation learning, there's no maximum 01:08:46.460 |
likelihood estimation, there's no Markov chain Monte Carlo, there's no sampling. 01:08:53.520 |
And it turns out that you can learn these models by playing a game. 01:09:00.660 |
You're going to be setting up a game between two players. 01:09:03.020 |
You're going to have a discriminator, D, think of it as a convolutional neural network, and 01:09:08.620 |
then you're going to have a generator, G. Maybe you can think of it as a variation learning 01:09:12.540 |
core or a Helmholtz machine or something that gives you samples from the data. 01:09:17.980 |
The discriminator, D, is going to be discriminating between a sample from the data distribution 01:09:26.780 |
So the goal of the discriminator is to basically say, is this a fake sample or is this a real 01:09:33.020 |
A fake sample is a sample generated by the model. 01:09:41.860 |
And the generator is going to be trying to fool the discriminator by trying to generate 01:09:47.100 |
samples that are hard for the discriminator to discriminate. 01:09:50.820 |
So my goal as a generator would be to generate really nice looking digits so that the discriminator 01:09:56.300 |
wouldn't be able to tell the difference between the simulated and the real. 01:10:02.940 |
And so here is intuitively what that looks like. 01:10:08.540 |
Let's say you have some data, so images of faces. 01:10:13.180 |
I give you an image of a face, and now I have a discriminator that basically says, well, 01:10:17.900 |
if I get a real face, I push it through some function, some differentiable function. 01:10:23.000 |
Think of it as a convolutional neural network or another differentiable function. 01:10:29.260 |
So I want to output one if it's a real sample. 01:10:32.820 |
Then you have a generator, and generator is you have some noise, so input noise. 01:10:41.220 |
Given some noise, I go through differentiable function, which is your generator, and I generate 01:10:49.980 |
And then on top of it, I take this sample, I put it into my discriminator, and I say, 01:10:59.260 |
Because my discriminator will have to say, well, this is fake, and this is real. 01:11:06.140 |
And the generator basically says, well, how can I get a sample such that my discriminator 01:11:12.260 |
is going to be confused, such that the discriminator always outputs one here, because it believes 01:11:17.260 |
it's a true sample, believes it's coming from the true data. 01:11:29.740 |
It's a very intuitive objective function that has the following structure. 01:11:35.940 |
You have a discriminator that says, well, this is an expectation with respect to distribution, 01:11:41.740 |
So this is basically saying, I want to classify any data points that I get from my data as 01:11:48.140 |
So I want this output to be one, because if it's one, the whole thing is going to be zero. 01:11:52.340 |
If it's less than one, it's going to be negative. 01:11:57.960 |
And then discriminator is saying, well, any time I generate a sample, whatever samples 01:12:01.920 |
comes out from my generator, I want to classify it as being fake. 01:12:11.600 |
The generator is sort of the other-- you try to minimize this function, which essentially 01:12:15.560 |
says, well, generate samples that discriminator would classify as real. 01:12:20.920 |
So I really am going to try to change the parameters of my generator such that this 01:12:36.480 |
And it turns out the optimal strategy for discriminator is this ratio, which is probability 01:12:40.800 |
of the data divided by the probability of the data plus probability of the model. 01:12:44.740 |
And in general, if you succeed in building a good generative model, then probability 01:12:48.940 |
of the data would be the same as probability of the model. 01:12:50.920 |
So discriminator will always be confused to one half. 01:12:56.720 |
And here's one particular-- it seems like a simple idea, but it turns out to work remarkably 01:13:02.680 |
Here's an architecture called deconvolutional generative adversarial network architecture 01:13:12.320 |
It passes through a sequence of convolutions, a sequence of deconvolutions. 01:13:16.480 |
So given the code, you sort of deconvolve it back to high dimensional image. 01:13:28.760 |
And then there is a discriminator, which is just a convolutional neural network that's 01:13:34.200 |
And if you train these models on bedrooms-- these are called L-SUN data sets, a bunch 01:13:39.820 |
of bedrooms-- this is how samples from the model would look like, which is pretty impressive. 01:13:46.200 |
And in fact, when I look at these samples, I'm also sort of thinking, well, maybe the 01:13:49.840 |
model is memorizing the data, because these samples look remarkably impressive. 01:14:04.040 |
And here you're seeing samples generated from the model, which is, again, very impressive. 01:14:09.400 |
If you look at the structure in these samples, it's quite remarkable that you can generate 01:14:20.840 |
This is what's done-- again, this was done by Tim Salomon and his collaborators. 01:14:26.580 |
If you look at the ImageNet and you look at the training data on the ImageNet and looking 01:14:31.040 |
at the samples, again, you look-- this is a horse. 01:14:33.800 |
This is like-- there's some animal that is an airplane and so forth. 01:14:41.520 |
So it looks-- when I look at these images, I was very impressed by the quality of these 01:14:47.800 |
images, because generally, it's very, very hard to generate realistic looking images. 01:14:52.120 |
And the last thing I want to point out-- this was picked up by Ian Goodfellow. 01:14:58.100 |
If you cherry pick some of the examples, this is what generated images look like. 01:15:02.920 |
So you can sort of see there is a little bit of interesting structure that you're seeing 01:15:12.040 |
And one question still remains with these models is, how can we evaluate these models 01:15:18.600 |
Is the model really learning a space of all possible images and how images-- what's the 01:15:26.280 |
Or is the model mostly kind of like blurring things around and just making some small changes 01:15:33.040 |
So the question that I would really want-- would like to answer-- to get an answer to 01:15:37.200 |
is that, if I show you a new example, a new test image, a new kind of a horse, would the 01:15:47.120 |
I've seen similar images before or something like that or not. 01:15:54.600 |
But again, this is the class of models which steps away from maximum likelihood estimation, 01:15:58.960 |
sort of sets it up in a game theoretic framework, which is a really nice set of work. 01:16:05.800 |
And in the computer vision community, a lot of people are showing a lot of progress in 01:16:08.920 |
using these kinds of models because they tend to generate much more realistic-looking images. 01:16:15.760 |
So let me just summarize to say that I've shown you, hopefully, a set of learning algorithms 01:16:23.560 |
There's a lot of space in these models, a lot of excitement in that space. 01:16:27.640 |
And I just wanted to point out that these models, the deep models, they improve upon 01:16:31.120 |
current state of the art in a lot of different application domains. 01:16:34.400 |
And as I mentioned before, there's been a lot of progress in discriminative models, 01:16:38.440 |
convolutional models, using recurrent neural networks for solving action recognition models, 01:16:45.280 |
And unsupervised learning still remains a field where we've made some progress, but 01:17:21.880 |
So as a Bayesian guy, I'm pretty depressed by the fact that GAN can generate a clearer 01:17:31.160 |
So my question is, do you think there could be an energy-based framework or a probabilistic 01:17:37.640 |
interpretation of why GAN is so successful, other than it's just a MiMAX game? 01:17:43.240 |
I think that generally, if you look at-- I sort of go back and forth between variational 01:17:47.520 |
autoencoders, because some of my friends at OpenAI are saying that they can actually generate 01:17:52.960 |
really nice-looking images using variational autoencoders. 01:17:59.440 |
But I think that one of the problems with image generation today is that with variational 01:18:06.440 |
autoencoders, there is this notion of Gaussian loss function. 01:18:11.320 |
And what it does is it basically says, well, never produce crystal-clear images, because 01:18:17.760 |
if you're wrong, if you put the edge in the wrong place, you're going to be penalized 01:18:26.800 |
What the GANs are doing, GANs are basically saying, well, I don't really care where I 01:18:30.720 |
put the edge, as long as it looks realistic so that I can fool my classifier. 01:18:35.000 |
So what tends to happen in practice, and a lot of times, if you actually look at the 01:18:38.880 |
images generated by GANs, sometimes they have a lot of artifacts, like these specific things 01:18:47.240 |
Whereas in variational autoencoders, you don't see that. 01:18:49.560 |
But again, the problem with variational autoencoders is they tend to produce images that are much 01:18:52.840 |
more diffused or not as sharp or not as clear as what GANs is doing. 01:18:58.080 |
And there's been some work on trying to sharpen the images, which is you're using variational 01:19:02.120 |
autoencoders to generate the globally coherent scene, and then you're using generative adversarialness 01:19:10.120 |
Again, it depends what loss function you're using. 01:19:13.400 |
And GANs seem to be able to deal with that problem implicitly, because they don't really 01:19:18.240 |
care whether you get the edge quite right or not, as long as it fools your classifier. 01:19:27.400 |
Thank you very much for the interesting talk. 01:19:31.440 |
I have a question about the variational autoencoder. 01:19:37.760 |
For the more challenging data set, like the street view house number, I noticed that many 01:19:42.920 |
implementation, they use a PCA to preprocess the data before they train the model. 01:19:48.720 |
What is your thought on that preprocessing step? 01:19:59.880 |
My experience has been that we don't really do a lot of preprocessing. 01:20:03.640 |
What you can do is you can do ZCA preprocessing, and you can take the mean, you can take the 01:20:07.560 |
second order covariance structure from the data. 01:20:13.000 |
But I don't see any particular reason why you'd want to do PCA preprocessing. 01:20:17.400 |
It's just one of-- just like we've seen a lot in our field, people just doing x, y, 01:20:24.760 |
and then later on they figure out that they don't really need x and y. 01:20:30.200 |
Maybe it was working better for their implementation, for their particular task, but generally I 01:20:33.760 |
haven't seen people doing a lot of preprocessing using PCA for training variational timecodes. 01:20:55.240 |
So if you look at the literature for, let us say, estimation of the partition function 01:21:00.780 |
for Ising models, you will see that the literature is a lot more rich compared to the variational 01:21:06.520 |
inference literature for restricted Boltzmann machines, especially in the binary context. 01:21:12.200 |
Because specifically, you have for the strictly ferromagnetic case, you have a fully polynomial 01:21:18.680 |
time approximation scheme for estimating the log partition function. 01:21:24.080 |
But then I don't see usage of these FPRAS algorithms in the RBM space. 01:21:29.640 |
So when you juxtapose the literature for the Ising models compared to binary RBMs, you'll 01:21:38.260 |
Yeah, so the thing about Ising models is that if you're in a ferromagnetic case, or if you 01:21:45.100 |
have certain particular structure to the Ising models, you can use a lot of techniques. 01:21:48.420 |
Even if you use techniques like coupling from the past, you can draw exact samples from 01:21:52.560 |
You can compute the log partition function of polynomial time if you have a specific 01:21:56.460 |
But the problem with RBMs is that generally those assumptions don't apply. 01:22:00.360 |
You cannot learn a model which is a ferromagnetic model with RBMs, just where all your weights 01:22:04.980 |
That's a lot of constraints to put on these class of models. 01:22:08.820 |
So that's why-- and once you get outside of these assumptions, then the problem becomes 01:22:14.740 |
NP-hard for estimating the partition function. 01:22:17.980 |
And obviously, for learning these systems, you need the gradient of the log of the partition 01:22:27.300 |
And unfortunately, variational methods are also not working as well as approximations 01:22:33.180 |
like contrastive divergence, or something based on sampling. 01:22:37.660 |
People have looked at better approximations and using more sophisticated techniques, but 01:22:50.580 |
I'm curious about using auto-coder to get semantic hash, especially in text. 01:23:00.340 |
Do we need any special representation, text representation, like word to vector as input 01:23:11.100 |
So I've talked about the model, which is a very simple model, which is modeling bag of 01:23:17.580 |
You can use word2vec and initialize the model, because it's a way of just taking your words 01:23:25.460 |
There has been a lot of recent techniques using, like Richard was mentioning, GRUs as 01:23:31.300 |
a way, if you want to work with sentences, or if you want to embed the entire document 01:23:36.340 |
into the semantic space, and if you want to make it binary, you can use GRUs, bidirectional 01:23:41.780 |
GRUs, to get the representation of the document. 01:23:44.580 |
I think that would probably work better than using word2vec and then just adding things 01:23:48.300 |
And then based on that, you can learn a hashing function that maps that particular representation 01:23:52.140 |
to the binary space, in which case you can do searching fairly efficiently. 01:23:57.340 |
So as an input representation, there are lots of choices. 01:23:59.700 |
You can use bidirectional GRUs, which is the method of choice right now. 01:24:06.940 |
You can use GloVe, or you can use word2vec and sum up the representations of the words. 01:24:14.100 |
Okay, so using only back of word, we use only normal network, that is no recurrence network, 01:24:23.300 |
But again, your representation can be whatever that representation is, as long as it's differentiable, 01:24:27.820 |
Because in this case, you can back propagate through the bidirectional GRUs and learn what