Back to Index

Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU)


Chapters

0:0 Deep Unsupervised Learning
1:27 Deep Autoencoder Model
4:2 Talk Roadmap
5:1 Learning Feature Representations
5:40 Traditional Approaches
6:2 Computer Vision Features
6:17 Audio Features
8:37 Sparse Coding: Training
10:19 Sparse Coding: Testing Time
10:51 Image Classification Evaluated on Caltech101 object category dataset
12:5 Interpreting Sparse Coding
16:12 Another Autoencoder Model
16:37 Predictive Sparse Decomposition
17:32 Stacked Autoencoders
18:43 Deep Autoencoders
20:20 Information Retrieval
21:5 Semantic Hashing
22:41 Deep Generative Model
28:8 Learning Features
29:4 Model Learning
33:2 Contrastive Divergence
34:45 RBMs for Word Counts
36:33 Collaborative Filtering
37:49 Product of Experts
39:32 Local vs. Distributed Representations
40:55 Deep Boltzmann Machines
41:24 Model Formulation
44:25 Good Generative Model?
45:54 Generative Model of 3-D Objects
46:52 3-D Object Recognition
47:9 Data - Collection of Modalities
47:40 Challenges - 11
48:37 A Simple Multimodal Model
49:23 Text Generated from Images
51:11 Multimodal Linguistic Regularities
53:34 Helmholtz Machines vs. DBMS
54:20 Variational Autoencoders (VAE) The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers

Transcript

Sound is good? Okay, great. So I wanted to talk to you about unsupervised learning, and that's the area where there's been a lot of research. But compared to supervised learning that you've heard about today, like convolutional networks, unsupervised learning is not there yet. So I'm going to show you lots of areas.

Parts of the talk are going to be a little bit more mathematical. I apologize for that, but I'll try to give you a gist of the foundations, the math behind these models, as well as try to highlight some of the application areas. What's the motivation? Well, the motivation is that the space of data that we have today is just growing.

If you look at the space of images, speech, if you look at social network data, if you look at scientific data, I would argue that most of the data that we see today is unlabeled. So how can we develop statistical models, models that can discover interesting kind of structure in unsupervised way or semi-supervised way?

And that's what I'm interested in, as well as how can we apply these models across multiple different domains. And one particular framework of doing that is the framework of deep learning, where you're trying to learn hierarchical representations of data. And again, as I go through the talk, I'm going to show you some examples.

So here's one example. You can take a simple bag-of-words representation of an article or a newspaper. You can use something that's called an autoencoder, just multiple levels. You extract some latent code, and then you get some representation out of it. This is done completely in unsupervised way. You don't provide any labels.

And if you look at the kind of structure that the model is discovering, it could be useful for visualization, for example, to see what kind of structure you see in your data. This was done on the Reuters dataset. I've tried to kind of cluster together lots of different unsupervised learning techniques, and I'll touch on some of them.

It's a little bit-- it's not a full set. But the way that I typically think about these models is that there's a class of what I would call non-probabilistic models, models like sparse coding, autoencoders, clustering-based methods. And these are all very, very powerful techniques, and I'll cover some of them in that talk as well.

And then there is sort of a space of probabilistic models. And within probabilistic models, you have tractable models, things like fully observed belief networks. There's a beautiful class of models called neural autoregressive density estimators. More recently, we've seen some successes of so-called pixel recurrent neural network models. And I'll show you some examples of that.

There is a class of so-called intractable models, where you are looking at models like Boltzmann machines and models like variational autoencoders, something that's been quite-- there's been a lot of development in our community, in deep learning community in that space. From Holtz machines, I'll tell you a little bit about what these models are, and a whole bunch of others as well.

One particular structure within these models is that when you're building these generative models of data, you typically have to specify what the distributions you're looking at. So you have to specify what the probability of the data, and generally doing some kind of approximate maximum likelihood estimation. And then more recently, we've seen some very exciting models coming out.

These are generative adversarial networks, moment matching networks. And this is a slightly different class of models, where you don't really have to specify what the density is. You just need to be able to sample from those models. And I'm going to show you some examples of that. So my talk is going to be structured.

I'd like to introduce you to the basic building blocks, models like sparse coding models. Because I think that these are very important classes of models, particularly for folks who are working in industry and looking for simpler models. Autoencoder is a beautiful class of models. And then the second part of the talk, I'll focus more on generative models.

I'll give you an introduction into restricted Boltz machines and deep Boltz machines. These are statistical models that can model complicated data. And I'll spend some time showing you some examples, some recent developments in our community, specifically in the case of variational autoencoders, which I view them as a subclass of Helmholtz machines.

And I'll finish off by giving you an intuition about a slightly different class of models, which would be these generative adversarial networks. OK, so let's jump into the first part. But before I do that, let me just give you a little bit of motivation. I know Andre's done a great job, and Richard alluded to that as well.

But the idea is, if I'm trying to classify a particular image, and if I say, if I'm looking at specific pixel representation, it might be difficult for me to classify what I'm seeing. On the other hand, if I can find the right representations, the right representations for these images, and then I get the right features, or get the right structure from the data, then it might be easier for me to see what's going on with my data.

So how do I find these representations? And this is one of traditional approaches that we've seen for a long time, is that you have a data, you're creating some features, and then you're running your learning algorithm. And for the longest time, in object recognition or in audio classification, you typically use some kind of hand-designed features, and then you start classifying what you have.

And like Andre was saying, in the space of vision, there's been a lot of different features, designs of what's the right structure we should see in the data. In the space of audio, same thing is happening. How can you find these right representations for your data? And the idea behind representation learning, in particular in deep learning, is can we actually learn these representations automatically?

And more importantly, can we actually learn these representations in an unsupervised way, by just seeing lots and lots of unlabeled data? Can we achieve that? And there's been a lot of work done in that space, but we're not there yet. So I wanted to lower your expectations as I show you some of the results.

OK, sparse coding. This is one of the models that I think that everybody should know what it is. It was actually first has its roots in '96, and it was originally developed to explain early visual processing in the brain. I think of it as an edge detector. And the objective here is the following.

Well, if I give you a set of data points, x1 up to xn, you'd want to learn a dictionary of bases, phi 1 up to phi k, so that every single data point can be written as a linear combination of the bases. That's fairly simple. There is one constraint in that you'd want your coefficients to be sparse.

You'd want them to be mostly zero. So every data point is represented as a sparse linear combination of bases. So if you apply sparse coding to natural images, and this was originally has been a lot of work developed at Stanford with Andrew Ng's group. So if you apply sparse coding to take little patches of images and learn these bases, these dictionaries, this is how they look like.

And they look really nice in terms of finding edge-like structure. So if given a new example, I can say, well, this new example can be written as a linear combination of a few of these bases. And taking that representation, it turns out that particular representation, a sparse representation, is quite useful as a feature representation of your data.

So it's quite useful to have it. And in general, how do we fit these models? Well, if I give you a whole bunch of image patches, but these don't necessarily have to be image patches. This could be little speech signals or any kind of data you're working with. You'd want to learn a dictionary of bases.

You have to solve this optimization problem. So the first term here, you can think of it as a reconstruction error, which is to say, well, I take a linear combination of my bases. I want them to match my data. And then there's a second term, which is, you can think of it as a sparse penalty term, which essentially says, try to penalize my coefficients so that most of them are zero.

That way, every single data point can be written as just a linear combination, sparse linear combination of the bases. And it turns out there is an easy optimization for doing that. If you fix your dictionary of bases, 5, 1 up to 5k, and you solve for the activations, that becomes a standard lasso problem.

There's a lot of solvers for solving that particular problem. That's a general, it's a lasso problem, which is fairly easy to optimize. And then if you fix the activations and you optimize for dictionary of bases, then it's a well-known quadratic programming problem. Each problem is convex, so you can alternate between finding coefficients, finding bases, and so forth, so you can optimize this function.

And there's been a lot of recent work in the last 10 years of doing these things online and doing it more efficiently and so forth. At test time, given a new input or a new image patch, and given a set of learned bases, once you have your dictionary, you can then just solve a lasso problem to find the right coefficients.

So in this case, given a test sample or a test patch, you can find, well, it's written as a linear combination of a subset of the bases. And it turns out, again, that that particular representation is very useful, particularly if you're interested in classifying what you see in images.

And this is done in a completely unsupervised way. There is no class labels. There is no specific supervisory signal that's here. So back in 2006, there was work done, again, at Stanford that basically showed a very interesting result. So if I give you an input like this, and these are my learned bases, remember these little edges, what happens is that you just convolve these bases.

You can get these different feature maps, much like the feature maps that we've seen in convolutional neural networks. And then you take these feature maps, and you can just do a classification. This was done on one of the older data sets, the Caltech 101, which is a data set that predates ImageNet.

And if you look at some of the competing algorithms, if you do a simple logistic regression versus if you do PCA and then do logistic regression versus finding these features using sparse coding, you can get substantial improvements. So that's, again, that's-- and you see sparse coding popping up in a lot of different areas, not just in deep learning, but folks who are using-- looking at the medical imaging domain, in neuroscience, these are very popular models.

Because they're easy, they're easy to fit, they're easy to deal with. So what's the interpretation of the sparse coding? Well, let's look at this equation again. And we can think of sparse coding as finding an overcomplete representation of your data. Now the encoding function, we can think of this encoding function, which is, well, I give you an input, find me the features or sparse coefficients or bases that make up my image.

We can think of encoding as an implicit and a very nonlinear function of x. But it's an implicit function. We don't really specify it. And the decoder, or the reconstruction, is just a simple linear function. And it's very explicit. You just take your coefficients and then multiply it by-- find the right basis and get back the image or the data.

And that sort of flows naturally into the ideas of autoencoders. The autoencoder is a general framework where if I give you an input data, let's say it's an input image, you encode it, you get some representation, some feature representation, and then you have a decoder given that representation. You're decoding it back into the image.

So you can think of encoder as a feedforward, bottom-up pass, much like in a convolutional neural network, given the image, you're doing a forward pass. And then there is also feedback and generative or top-down pass. And the features, you're reconstructing back the input image. And the details of what's going inside the encoder, decoder, they matter a lot.

And obviously, you need some form of constraints. You need some of constraints to avoid learning an identity. Because if you don't put these constraints, what you could do is just take your input, copy it to your features, and then reconstruct back. And that would be a trivial solution. So we need to introduce some additional constraints.

If you're dealing with binary features, if you want to extract binary features, for example, I'm going to show you later why you'd want to do that. You can pass your encoder through sigmoid nonlinearity, much like in the neural network. And then you have a linear decoder that reconstructs back the input.

And the way we optimize these little building blocks or these little blocks is we can just have an encoder, which takes your input, takes a linear combination, passes it through some nonlinearity, the sigmoid nonlinearity. It could be rectified linear units. It could be 10H nonlinearity. And then there is a decoder where you reconstruct back your original input.

So this is nothing more than a neural network with one hidden layer. And typically, that hidden layer would have a small dimensionality than the input. So we can think of it as a bottleneck layer. We can determine the network parameters, the parameters of the encoder and the parameters of the decoder by writing down the reconstruction error.

And that's what the reconstruction error would look like. Given the input, encode, decode, and make sure whatever you're decoding is as close as possible to the original input. All right. Then we can use backpropagation algorithm to train. There is an interesting sort of relationship between autoencoders and principal component analysis.

Many of you have probably heard about PCA. As a practitioner, if you're dealing with large data and you want to see what's going on, PCA is the first thing to use, much like logistic regression. And the idea here is that if the parameters of encoder and decoder are shared and you actually have the hidden layer, which is a linear layer, so you don't introduce any nonlinearities, then it turns out that the latent space that the model will discover is going to be the same space as the space discovered by PCA.

It effectively will collapse the principal component analysis, right? We're doing PCA, which is sort of a nice connection because it basically says that autoencoders, you can think of them as nonlinear extensions of PCA. So you can learn a little richer features if you are using autoencoders. So here's another model.

If you're dealing with binary input, sometimes we're dealing with like MNIST, for example. Again, your encoder and decoder could use sigmoid nonlinearities. So given an input, you extract some binary features. Give binary features, you reconstruct back the binary input. And that actually relates to a model called the restricted Boltz machine, something that I'm going to tell you about later in the talk.

There's also other classes of models where you can say, well, I can also introduce some sparsity, much like in sparse coding, to say that I need to constrain my latent features or my latent space to be sparse. And that actually allows you to learn quite reasonable features, nice features.

Here's one particular model called predictive sparse decomposition, where you effectively, if you look at the first part of the equation here, the decoder part, that pretty much looks like a sparse coding model. But in addition, you have an encoding part that essentially says train an encoder such that it actually approximates what my latent code should be.

So effectively, you can think of this model as there is an encoder, there is a decoder, but then you put the sparsity constraint on your latent representation. And you can optimize for that model. And obviously, the other thing that we've been doing in the last seven, eight, and ten years is, well, what you can do is you can actually stack these things together.

So you can learn low-level features, try to learn high-level features, and so forth. So you're just building these blocks. And perhaps at the top level, if you're trying to solve a classification problem, you can do that. And this is sometimes known as a greedy layer-wise learning. And this is sometimes useful whenever you have lots and lots of unlabeled data.

And when you have a little labeled data, a small sample of labeled data, typically these models help you find meaningful representations such that you don't need a lot of labeled data to solve a particular task that you're trying to solve. And this is, again, you can remove the decoding part, and then you end up with a standard or a convolutional architecture.

Again, your encoder and decoder could be convolutional. And it depends on what problem you're tackling. And typically, you can stack these things together and optimize for a particular task that you're trying to solve. Here's an example of-- just wanted to show you some examples, some early examples. Back in 2006, this was a way of trying to build these nonlinear autoencoders.

And you can sort of pre-train these models using restricted-bolt machines or autoencoders generally. And then you can stitch them together into this deep autoencoder and backpropagate through reconstruction loss. One thing I want to point out is that-- here's one particular example. The top row, I show you real faces.

The second row, you're seeing faces reconstructed from a bottleneck of 30-dimensional real-value bottlenecks. You can think of it as just a compression mechanism. Given the data, high-dimensional data, you're compressing it down to 30-dimensional code. And then from that 30-dimensional code, you're reconstructing back the original data. So if you look at the first row, this is the data.

The second row shows you reconstructed data. And the last row shows you PCA solution. One thing I want to point out is that the solution here, you have a much sharper representation, which means that it's capturing a little bit more structure in the data. It's also kind of interesting to see that sometimes these models tend to-- how should I say it?

They tend to regularize your data. For example, if you see this person with glasses, removes the glasses. And that generally has to do with the fact that there is only one person with glasses. So the model just basically says, that's noise. Get rid of it. Or it sort of gets rid of mustaches.

Like if you see this, there's no mustache. And then again, that has to do with the fact that there's enough capacity. So the model might think that that's just a noise. And if you're dealing with text type of data, this was done using a Reuters data set. You have about 800,000 stories.

You take bag of words representation, something very simple. You can press it down to two-dimensional space. And then you see what that space looks like. And I always like to joke that the model basically discovers that European community economic policies are just next to disasters and accidents. This was back in-- I think the data was collected in '96.

I think today it's probably going to become closer. But again, this is just a way-- typically, autoencoder is a way of compression or trying to do dimensionality reduction. But we'll see later that they don't have to be. There's another class of algorithm called semantic hashing, which is to say, well, what if you take your data and compress it down to binary representation?

Wouldn't that be nice? Because if you have binary representation, you can search in the binary space very efficiently. In fact, if you can compress your data down to 20 dimension, 20-dimensional binary code, 2 to the 20 is about 4 gigabytes. So you can just store everything in memory. And you can look at the-- just do memory lookups without actually doing any search at all.

So this sort of representation sometimes have been used successfully in computer vision, where you take your images, and then you learn these binary representations, 30-dimensional codes, 200-dimensional codes. And it turns out it's very efficient to search through large volumes of data using binary representation. So you can-- it takes a fraction of a millisecond to retrieve images from a set of millions and millions of images.

And again, this is also an active area of research right now, because people are trying to figure out, we have these large databases. How can you search through them efficiently? And learning a semantic hashing function that maps your data to the binary representation turns out to be quite useful.

OK, now let's step back a little bit and say, let's now look at generative models. Let's look at probabilistic models and how different they are. And I'm going to show you some examples of where they're applicable. Here's one example of a simple model trying to learn a distribution of these handwritten characters.

So we have Sanskrit. We have Arabic. We have Cyrillic. And now we can build a model that says, well, can you actually generate me what a Sanskrit should look like? The flickering you see at the top, these are neurons. You can think of them as neurons firing. And what you're seeing at the bottom is you're seeing what the model generates, what it believes Sanskrit should look like.

So in some sense, when you think about generative models, you think about models that can generate or they can sample the distribution or they can sample the data. This is a fairly simple model. We have about 25,000 characters coming from 50 different alphabets around the world. You have about 2 million parameters.

This is one of the older models. But this is what the model believes Sanskrit should look like. And I think that I've asked a couple of people to say that, does that really look like Sanskrit? Yes. OK, great. Which can mean two things. It can mean that the model is actually generalizing or the model is overfitting, meaning that it's just memorizing what the training data looks like.

And I'm just showing you examples from the training data. We'll come back to that point as we go through the talk. You can also do conditional simulation. Given half of the image, can you complete the remaining half? And more recently, there's been a lot of advances, especially in the last couple of years, for the conditional generations.

And it's pretty amazing what you can do in terms of in-painting, given half of the image, what the other half of the image should look like. This is sort of a simple example, but it does show you that it's trying to be consistent with what different strokes look like.

So why is it so difficult? In the space of so-called undirected graphical models of Bolsom machines, the difficulty really comes from the following fact. If I show you this image, which is a 28 by 28 image, it's a binary image. So some pixels are on, some pixels are off.

There are 2 to the 28 by 28 possible images. So in fact, there are 2 to the 784 possible configurations. And that space is exponential. So how can you build models that figure out-- in the space of characters, there's only a little tiny subspace in that space. If you start generating 200 by 200 images, that space is huge.

In the space of real images, it's really, really tiny. So how do you find that space? How do you generalize to new images? That's a very difficult question in general to answer. One class of models is so-called fully observed models. There's been a stream of learning generative models that are tractable.

And they have very nice properties, like you can compute the probabilities, you can do maximum likelihood estimation. Here's one example where I can, if I try to model the image, I can write it down as taking the first pixel, modeling the first pixel, then modeling the second pixel, giving the first pixel, and just writing it down in terms of the conditional probabilities.

And each conditional probability can take a very complicated form. It could be a complicated neural network. So there's been a number of successful models. One of the early models called Neural Autoregressive Density Estimator, actually developed by Hugo, real valid extension of these models. And more recently, we start seeing these flavors of models.

There were a couple of papers popped up, actually this year, from DeepMind, where they make these conditionals to be sophisticated RNNs, LSTMs, or convolutional models. And they can actually generate remarkable images. And so this is just a pixel CNN generating, I guess, elephants. Yeah. And actually, it looks pretty interesting.

The drawback of these models is that we yet have to see how good of representations these models are learning, so that we can use these representations for other tasks, like classifying images or find similar images and such. Now let me jump into a class of models called Restricted Bulse Machines.

So this is the class of models where we're actually trying to learn some latent structure, some latent representation. These models belong to the class of so-called graphical models. And graphical models are a very powerful framework for representing dependency structure between random variables. This is an example where we have-- you can think of this particular model.

You have some pixels. These are stochastic binary, so-called visible variables. You can think of pixels in your image. And you have stochastic binary hidden variables. You can think of them as feature detectors, so detecting certain patterns that you see in the data, much like sparse coding models. This has a bipartite structure.

You can write down the probability, the joint distribution over all of these variables. You sort of have pairwise term. You have unitary term. But it's not really important what they look like. The important thing here is that if I look at this conditional probability of the data given the features, I can actually write down explicitly what it looks like.

What does that mean? That basically means that if you tell me what features you see in the image, I can generate the data for you, or I can generate the corresponding input. In terms of learning features, so what do these models learn? They sort of learn something similar that we've seen in sparse coding.

And so these classes of models are very similar to each other. So given a new image, I can say, well, this new image is made up by some combination of these learned weights or these learned bases. And the numbers here are given by the probabilities that each particular edge is present in the data.

In terms of how we learn these models, one thing I want to make-- another point I should make here is that given an input, I can actually quickly infer what features I'm seeing in the image. So that operation is very easy to do, unlike in sparse coding models. It's a little bit more closer to an autoencoder.

Given the data, I can actually tell you what features are present in my input, which is very important for things like information retrieval or classifying images, because you need to do it fast. How do we learn these models? Let me just give you an intuition, maybe a little bit of math behind how we learn these models.

If I give you a set of training examples, and I want to learn model parameters, I can maximize the log-likelihood objective. And you've probably seen that in these tutorials, the maximum likelihood objective is essentially nothing more than saying, I want to make sure that the probability of observing these images is as high as possible.

So finding the parameters of the probability of observing what I'm seeing is high. And that's why you're maximizing the likelihood objective for the log of the likelihood objective, which is take a product into the sum. You take the derivative. There's a little bit of algebra. I promise you it's not very difficult.

It's like second-year college algebra. You differentiate. And you basically have this learning rule, which is the difference between two terms. The first term, you can think of it as looking at sufficient statistics, so-called sufficient statistics driven by the data. And the second term is the sufficient statistics driven by the model.

Maybe I can parse it out. What does that mean? Intuitively, what that means is that you look at the correlations you see in the data. And then you look at the correlations that the model is telling you it should be. And you're trying to match the two. That's what the learning is trying to do.

It's trying to match the correlations that you see in the data. So the model is actually respecting the statistics that you see in the data. But it turns out that the second term is very difficult to compute. And it's precisely because the space of all possible images is so high dimensional that you need to figure out or use some kind of approximate learning algorithms to do that.

So you have these difference between these two terms. The first term is easy to compute, it turns out, because of a particular structure of the model. And we can actually do it explicitly. The second term is the difficult one to compute. So it sort of requires summing over all possible configurations, all possible images that you could possibly see.

So this term is intractable. And what a lot of different algorithms are doing-- and we'll see that over and over again-- is using so-called Monte Carlo sampling, or Markov chain Monte Carlo sampling, or Monte Carlo estimation. So let me give you an intuition of what this term is doing.

And that's a general trick for approximating exponential sums. There's a whole subfield in statistics that's basically dedicated to how do we approximate exponential sums. In fact, if you could do that, if you could solve that problem, you could solve a lot of problems in machine learning. And the idea is very simple, actually.

The idea is to say, well, you're going to be replacing the average by sampling. And there's something that's called Gibbs sampling, Markov chain Monte Carlo, which essentially does something very simple. It basically says, well, start with the data, sample the states of the latent variables, sample the data, sample the states of the latent data, sample the data from these conditional distributions, something that you can compute explicitly.

And that's a general trick. Much like in sparse coding, we're optimizing for the basis when we're optimizing for the coefficients. Here, you're inferring the coefficients, then you're inferring what the data should look like and so forth. And then you can just run a Markov chain and approximate this exponential sum.

So you start with the data, you sample the states of the hidden variables, you resample the data and so forth. And the only problem with a lot of these methods is that you need to run them up to infinity to guarantee that you're getting the right thing. And so obviously, you will never run them infinite.

You don't have time to do that. So there's a very clever algorithm, a contrastive divergent algorithm that was developed by Hinton back in 2002. And it was very clever. It basically said, well, instead of running this thing up to infinity, run it for one step. And so you're just running it for one step.

You start with a training vector, you update the hidden units, you update all the visible units again. So that's your reconstruction. Much like in autoencoder, you reconstruct your data. You update the hidden units again, and then you just update the model parameters, which is just looking at empirically the statistics between the data and the model.

Very similar to what the autoencoder is doing, but slight, slight differences. And implementation is basically takes about 10 lines of MATLAB code. I suspect it's going to be two lines in TensorFlow, although I don't think TensorFlow folks implemented Boltzmann machines yet. That would be my request. But you can extend these models to dealing with real value data.

So whenever you're dealing with images, for example. And it's just a little change to the definition of the model. And your conditional probabilities here are just going to be a bunch of Gaussians. So that basically means that given the features, sample me the space of images, and I can sample you, give you real, real valued images.

The structure of the model remains the same. If you train this model on these images, you tend to find edges, something similar, again, to what you'd see in sparse coding, in ICA, independent component analysis model, autoencoders and such. And again, you can say, well, every single image is made up by some linear combination of these basis functions.

You can also extend these models to dealing with count data. If you're dealing with documents, in this case, again, a slight change to the model. K here denotes your vocabulary size. And D here denotes number of words that you're seeing in your document. It's a bag of words representation.

And the conditional here is given by so-called softmax distribution, much like what you've seen in the previous classes, when the distribution of possible words. And the parameters here, Ws, you can think of them as something similar to what word to vacuum bedding would do. And so if you apply it to, again, some of data sets, you tend to find reasonable features.

So you tend to find features about Russia, about US, about computers, and so forth. So much like you found these representations, little edges, every image is made up by some combination of these edges. In case of documents or web pages, you're saying it's the same thing. It's just made up some linear combination of these learned topics.

Every single document is made up by some combination of these topics. You can also look at one-step reconstruction. So you can basically say, well, how can I find similarity between the words? So if I show you chocolate cake and further states of hidden units, and then I reconstruct back the distribution of possible words, it tells me chocolate cake, cake chocolate sweet dessert cupcake food sugar, and so forth.

I particularly like the one about the flower high, and then there is a Japanese sign. And the model sort of generates flower, Japan, sakura, blossom, Tokyo. So it sort of picks up again on low-level correlations that you see in your data. You can also apply these kinds of models to collaborative filtering, where every single observed variable you can model, can represent a user rating for a particular movie.

So every single user would rate a certain subset of movies. And so you can represent it as the state of visible vector. And your hidden states can represent user preferences, what they are. And on the Netflix data set, if you look at the latent space that the model is learning, some of these hidden variables are capturing specific movie genre.

So for example, there is actually one hidden unit dedicated to Michael Moore's movies. So it's sort of very strong. I think it's sort of either people like it or hate it. So there are a few hidden units specifically dedicated to that. But it also finds interesting things like action movies and so forth.

So it finds that particular structure in the data. So you can model different kinds of modality, real value data, you can model count data, multinomials. And it's very easy to infer the states of the hidden variables. So that's given just the product of logistic functions. And that's very important in a lot of different applications.

Given the input, I can quickly tell you what topics I see in the data. One thing that I want to point out, and that's an important point, is a lot of these models can be viewed as product models. Sometimes people call them product of experts. And this is because of the following intuition.

If I write down the joint distribution of my hidden observed variables, I can write it down in this sort of log linear form. But if I sum out or integrate out the states of the hidden variables, I have a product of a whole bunch of functions. So what does it mean?

What's the intuition here? So let me show you an example. Suppose the model finds these specific topics. And suppose I'm going to be telling you that the document contains topic government, corruption, and mafia. Then the word Silvio Berlusconi will have very high probability. I guess, does anybody know? Everybody knows who Silvio is?

Silvio Berlusconi? He's in head of the government. He's connected to mafia. He's very corrupt, was corrupt. And I guess I should add a bunga bunga parties here. Then it will become completely clear what I'm talking about. But then one point I want to make here is that you can think of these models as a product.

Each hidden variable defines a distribution of possible words, of possible topics. And once you take the intersection of these distributions, you can be very precise about what is it that you're modeling. So that's unlike generally topic models or latent directory allocation models, models where you're actually using mixture-like approach.

And then typically, these models do perform far better than traditional mixture-based models. And this comes to the point of local versus distributed representations. In a lot of different algorithms, even unsupervised learning algorithms, such as clustering, you typically have some, you're partitioning the space, and you're finding local prototypes. And the number of parameters for each, you have basically parameters for each region, the number of regions typically grow linearly with the number of parameters.

But in models like factor models, PCA, restricted Boltzmann machines, deep models, you typically have distributed representations. And what's the idea here? The idea here is that if I show you the two inputs, each particular neuron can differentiate between two parts of the plane. Given the second one, I can partition it again.

Given the third hidden variable, you can partition it again. So you can see that every single neuron will be affecting lots of different regions. And that's the idea behind distributed representations, because every single parameter is affecting many, many regions, not just the local region. And so the number of regions grow roughly exponentially with the number of parameters.

So that's the differences between these two classes of models. It's important to know about them. Now let me jump and quickly tell you a little bit of inspiration behind what can we build with these models. As we've seen with convolutional networks, the first layer, we typically learn some low-level features, like edges.

If you're working with a word table, typically we'll learn some low-level structure. And the hope is that the high-level features will start picking up some high-level structure as you are building. And these kinds of models can be built in a completely unsupervised way, because what you're trying to do is you're trying to model the data.

You're trying to model the distribution of the data. You can write down the probability distribution for these models, known as a Bolson machine model. You have dependencies between hidden variables. So now introducing some extra layers and dependencies between those layers. And if we look at the equation, the first part of the equation is basically the same as what we had with restricted Bolson machine.

And then the second and third part of the equation is essentially modeling dependencies between the first and the second hidden layer, and the second hidden layer and the third hidden layer. There is also a very natural notion of bottom-up and top-down. So if I want to see what's the probability of a particular unit taking value 1, it's really dependent on what's coming from below and what's coming from above.

So there has to be some consensus in the model to say, ah, yes, what I'm seeing in the image and what my model believes the overall structure should be should be in agreement. And so in this case, of course, in this case, hidden variables become dependent even when you condition on the data.

So these kinds of models we'll see a lot is you're introducing more flexibility, you're introducing more structure, but then learning becomes much more difficult. You have to deal-- how do you inference in these models? Now let me give you an intuition of how can we learn these models. What's the maximum likelihood estimator doing here?

Well, if I differentiate this model with respect to parameters, I basically run into the same learning rule. And it's the same learning rule you see whenever you're working with undirected graphical models, factor graphs, conditional random fields. You might have heard about those ones. It really is just trying to look at the statistics driven by the data, correlations that you see in the data, and the correlations that the model is telling you it's seeing in the data, and you're just trying to match the two.

That's exactly what's happening in that particular equation. But the first term is no longer factorial, so you have to do some approximation with these models. But let me give you an intuition what each term is doing. Suppose I have some data, and I get to observe these characters. Well, what I can do is I really want to tell the model, this is real.

These are real characters. So I want to put some probability mass around them and say, these are real. And then there is some sort of a data point that looks like this, just a bunch of pixels on and off. And I really want to tell my model that put almost zero probability on this.

This is not real. And so the first term is exactly trying to do that. The first term is just trying to say, put the probability mass where you see the data. And the second term is effectively trying to say, well, look at this entire exponential space and just say, no, everything else is not real.

Just the real thing is what I'm seeing in my data. And so you can use advanced techniques for doing that. There's a class of algorithms called variational inference. There's something that's called stochastic approximation, which is Monte Carlo-based inference. And I'm not going to go into these techniques. But in general, you can train these models.

So one question is, how good are they? Because a lot of approximations that go into these models. So what I'm going to do is, if you haven't seen it, I'm going to show you two panels. On one panel, you will see the real data. On another panel, you'll see data simulated by the model or the fake data.

And you have to tell me which one is which. So again, these are handwritten characters coming from alphabets around the world. How many of you think this is simulated and the other part was real? Honestly. OK, some. What about the other way around? I get half and half, which is great.

If you look at these images a little bit more carefully, you will see the difference. So you will see that this is simulated and this is real. Because if you look at the real data, it's much crisper. There's more diversity. When you're simulating the data, there's a lot of structure in the simulated characters, but sometimes they look a little bit fuzzy and there isn't as much diversity.

And I've learned that trick from my neuroscience friends. If I show you quickly enough, you won't see the difference. And if you're using these models for classifying, you can do proper analysis, which is to say, given a new character, you find further states of the latent variables, hidden variables.

If I classify based on that, how good are they? And they are much better than some of the existing techniques. This is another example. Trying to generate 3D objects. This is sort of a toy datasets. And later on, I'll show you some bigger advances that's been happening in the last few years.

This was done a few years ago. If you look at the space of generated samples, they sort of, obviously you can see the difference. Look at this particular image. This image looks like a car with wings, don't you think? So sometimes it can sort of simulate things that are not necessarily realistic.

And for some reason, it just doesn't generate donkeys and elephants too often, but it generates people with guns more often. Like if you look at here and here and here. And that, again, has to do with the fact that you're exploring this exponential space of possible images, and sometimes it's very hard to assign the right probabilities to different parts of the space.

And then obviously you can do things like pattern completion. So given half of the image, can you complete the remaining half? So the second one shows what the completions look like, and the last one is what the truth is. So you can do these things. So where else can we use these models?

These are sort of toyish examples, but where else? Let me show you one example where these models can potentially succeed, which is trying to model the space of the multimodal space, which is the space of images and text. Or generally, if you look at the data, it's not just single sources.

It's a collection of different modalities. So how can we take all of these modalities into account? And this is really just the idea of given images and text. And you actually find a concept that relates these two different sources of data. And there are a few challenges, and that's why models like generative models, sometimes probabilistic models, could be useful.

In general, one of the biggest challenges we've seen is that typically when you're working with images and text, these are very different modalities. If you think about images and pixel representation, they're very dense. If you're looking at text, it's typically very sparse. It's very difficult to learn these cross-modal features from low-level representation.

Perhaps a bigger challenge is that a lot of times we see data that's very noisy. Sometimes it's just non-existent. Given an image, there is no text. Or if you look at the first image, a lot of the tags about is what kind of camera was used to describe that particular image, which doesn't really tell us anything about the image itself.

And these would be the text generated by a version of a Boltzmann machine model. It sort of does samples of what the text should look like. And the idea, again, is very simple. If you just build a simple representation, given images and given text, you just try to find what the common representation is, it's very difficult to learn these cross-modal features.

But if you actually build a hierarchical model, so you start with representation, you can build a Gaussian model, replicate a softmax model, you can build up that representation, then it turns out it's much more, it gives you much richer representation. There's also a notion of bottom-up and top-down, which means that low-level images or tags can effectively affect low-level representation of images and the other way around.

So information flows between images and text and gets into some stable state. And this is what the text generated from images looks like, some of the examples. A lot of them look reasonable, but more recently, with the advances of CovNets, this is probably not that surprising. Here's some examples of the model that's not quite doing the right thing.

I particularly like the second one. For some reason, it sort of correlates with Barack Obama and such. And the features, when we were using this model, we didn't have, at that time, image net features. Right now, I don't think we'd be making these mistakes. But generally speaking, what we found in a lot of the data is that there are a lot of images of animals, which brings us to the next problem, is that if you don't see images of animals, then the model is confused because it sees a lot of Obama signs, and these are black and white and blue signs that are appearing a lot.

You can also do images from text, given text, or tags can retrieve relevant images. This is the dataset itself, about a million images. It's a nice dataset, and you have very noisy tags. The question is, can you actually learn some representation from those images? One thing that I want to highlight here is, we've tried, there's 25,000 labeled images.

Somebody went and labeled what's going on in those images, what classes we see in those images, and you get some numbers, which is mean average precision. What's important here is that we found that if we actually use unlabeled data, and we pre-train these channels separately, using a million unlabeled data points, then we can actually get some performance improvements.

At least that was a little bit of a happy sign for us to say that unlabeled data can help in the situations where you don't have a lot of labeled examples. Here it was helping us, it was helping us a lot. And then once you get into these representations, dealing with text and images, this is one particular thing you can do.

I think Richard pointed out what happens in the space of linguistic regularities. You can do the same thing with images, which is kind of fun to do. They sometimes work, they don't work all the time. But here's one example. I take that particular image at the top, and I say get the representation of this image, subtract the representation of day, add the night, and then find closest images, you get these images.

And then you can do some interesting things, like take these kittens and say minus ball plus box, you get kittens in the box. If you take this particular image and say minus box plus ball, you get kittens in the ball. Except for this thing, that's a duck. So you can get these interesting representations.

Of course, these are all fun things to look at, but they don't really mean much, because we're not specifically optimizing for those things. Now let me spend some time also talking about another class of models. These are known as Helmholtz machines and variational autoencoders. These are the models that have been popping up in our community in the last two years.

So what is a Helmholtz machine? A Helmholtz machine was developed back in '95, and it was developed by Hinton and Peter Dayan and Brendan Frey and Radford Neal. And it has this particular architecture. You have a generative process, so given some latent state, you just-- it's a neural network, it's a stochastic neural network that generates the input data.

And then you have so-called approximate inference step, which is to say, given the data, infer approximately what the latent states should look like. And again, it was developed in '95. There's something that's called wake-sleep algorithm, and it never worked. Basically, people just said, it just doesn't work. And then we started looking at restricted-bolt machines and Boltz machines, because they're working a little bit better.

And then two years ago, people figured out how to make them work. And so now, 10 years later, I'm going to show you the trick. Now, these models are actually working pretty well. The difference between Helmholtz machines and deep-bolt machines is very subtle. They almost look identical. The big difference between the two is that in Helmholtz machines, you have a generative process that generates the data, and you have a separate recognition model that tries to recognize what you're seeing in the data.

So you can think of this Q function as a convolutional neural network, given the data, tries to figure out what the features should look like. And then there's a generative model, given the features, it generates the data. Boltzmann machine is sort of similar class of models, but it has undirected connections.

So you can think of it as generative and recognition connections are the same. So it's sort of a system that tries to reach some equilibrium state when you're running it. So the semantics is a little bit different between these two models. So what is a variational autoencoder? A variational encoder is a Helmholtz machine.

It defines a generative process in terms of sampling through cascades of stochastic layers. And if you look at it, there's just a bunch of conditional probability distributions that you're defining, so you can generate the data. So theta here will denote the parameters of the variational autoencoders. You have a number of stochastic layers.

And sampling from these conditional probability distributions, we're assuming that we can do it. It's attractable. It has to be attractable. But the innovation here is that every single conditional probability can actually be a very complicated function. It can denote a nonlinear-- you can model nonlinear relationships. It can be a multilayer nonlinear neural network, deterministic neural network.

So it becomes fairly powerful. Here's an example of I have a stochastic layer. You have a deterministic layer. You have a stochastic layer, and then you generate the data. So you can introduce these nonlinearities into these models. And this conditional probability would denote a one-layer neural network. Now I'll show you some examples, but maybe I can just give you a little intuition behind what these equations do.

In a lot of these kinds of models, learning is very hard to do. And there's a class of models called variational learning. And what the variational learning is trying to do is basically trying to do the following. Well, I want to maximize the probability of the data that I observe, but I cannot do it directly.

So instead, what I'm going to do is I'm going to maximize the so-called variational low bound, which is this term here. And it's effectively saying, well, if I take the log of expectation, I can take the log and push it inside. And it turns out just logistically, working in this representation is much easier than working in this representation.

If you go a little bit through the math, it turns out that you can actually optimize this variational bound, but you can't really optimize this particular likelihood objective. It's a little bit surprising for those of you who haven't seen variational learning how it's done. But this one little trick, this one little so-called Jensen's inequality actually allows you to solve a lot of problems.

And the other way to write the lower bound is to say, well, there is a log likelihood function and something that's called KL divergence, which is the distance between your approximating distribution Q, which is your recognition model, and the truth. The truth in these models would be the true posterior according to your model.

And it's hard to optimize these kinds of models in general. You're trying to optimize your generative model, you're trying to optimize your recognition model. And back in '95, Hinton and his students basically, they developed this wake-sleep algorithm that was a bunch of different things put together, but it was never quite the right algorithm because it wasn't really optimizing anything.

It was just a bunch of things alternating. But in 2014, there was a beautiful trick introduced by Kingman and Welling, and there was a few other groups that came up with the same trick called reparameterization trick. So let me show you what reparameterization trick does intuitively. So let's say your recognition distribution is a Gaussian.

So a Gaussian, I can write it as a mean and a variance. So this is the mean, this is the variance. Notice that my mean depends on the layer below. It could be a very nonlinear function. The variance also depends on the layer below, so it could also be a nonlinear function.

But what I can do is I can actually do the following. I can express this particular Gaussian in terms of auxiliary variables. I can say, well, if I sample this epsilon from normal 0, 1, a Gaussian distribution, then I can write this particular h, my state, in a deterministic way.

It's just mean plus essentially standard deviation or variance, square root of the variance, times this epsilon. So this is just a simple parameterization of the Gaussians. I'm just pulling out the mean and the variance. There's no surprises here. So I can write my recognition model as this Gaussian, or I can write it in terms of noise plus the deterministic part.

So the recognition distribution can be represented as a deterministic mapping. And that's the beauty, because it turns out that you can collapse these complicated models effectively into autoencoders. And we know how to deal with autoencoders. We can back propagate through the entire model. So we have a deterministic encoder.

And then the distribution of these auxiliary variables really don't depend on parameters. So it's almost like taking a stochastic system and separating the stochastic part and deterministic part. In deterministic part, you can do back propagation, so you can do learning. And the stochastic part, you can do sampling. So just think of it as a separation between the two pieces.

So now, if I take the gradient of the variational bound or the variational objective with respect to parameters, this is something that we couldn't do back in '95, and we couldn't do it in the last 10 years. People have tried using reinforce algorithm or some approximations to it. It never worked.

But here what we can do is we can do the following. We can say, well, I can write this expression, because it's a Gaussian, as sampling a bunch of these auxiliary variables. And then this log, I can just inject the noise in here. The whole thing here becomes deterministic.

And that's where the beauty comes in. You take this gradient here, and you push it inside the expectation. So before, if you take the gradient of expectations, like taking the gradient of averages, like you compute a bunch of averages, and you're taking the gradient. What you're doing now with reparameterization trick is you're taking the gradients and then taking the average.

It turns out that hugely reduces the variance in your training. It actually allows you to learn these models quite efficiently. So the mapping edge here is completely deterministic. And gradients here can be computed by back propagation. It's a deterministic system. And you can think of this thing inside as just an autoencoder that you are optimizing.

And obviously, there are other extensions of these models that we've looked at and a bunch of other teams looked at where you can say, well, maybe we can improve these models by drawing multiple samples. These are so-called case samples, importance weighting bounds. And so you can make them a little bit better, a little bit more precise.

You can model a little bit more complicated distributions over the posteriors. But now, let me step back a little bit and say, why am I telling you about this? What's the point? There's a bunch of equations. You're injecting noise. Why do we need noise? Why do we need stochastic systems in general?

Here's a motivating example. We wanted to build a model that, given captions, we want to generate the image. And my student was very ambitious and basically said, I want to be able to just give you any sentence and I want to be able to generate image, kind of like an artificial paint.

I want to paint what's in my caption in the most general way. So this is one example of a Helmholtz machine where you have a generative model, which is a stochastic recurrent network. It's just a chain sequence of variational autoencoders. And there's a recognition model, which you can think of as a deterministic system, like a convolutional system that tries to approximate what the latent states are.

But why do I need stochasticity here? Why do I need variational autoencoders here? And the reason is very simple. Suppose I give you the following task. I say a stop sign is flying in blue skies. Now if you were using a deterministic system, like an autoencoder, you would generate one image because it's a deterministic system.

Given input, I give you output. Once you have stochastic system, you inject this noise, this latent noise, that allows you to actually generate the whole space of possible images. So for example, it tends to generate this stop sign and this stop sign. They look very different. And there's a car here.

So maybe it's not really flying. It just can't draw the pole here. This one looks like there are clouds. Here is this yellow school bus is flying in blue skies. So here we wanted to test the system to see does it understand something about what's in the sentence. Here is a herd of elephants is flying in blue skies.

Now we cannot generate elephants, although there are now techniques that are getting better. But sometimes it generates two of them. And it's a commercial plane flying in blue skies. But this is where we need stochasticity because we want to be able to generate the whole distribution of possible outcomes, not necessarily just one particular point.

We can basically do things like a yellow school bus parked in the parking lot versus a red school bus parked in the parking lot versus a green school bus parked in the parking lot. It's sort of a blue school bus. We can't quite generate blue school buses, but we've seen blue cars and we've seen blue buses.

So it can make an association to draw these different things. They look a little bit fuzzy. But in terms of comparing to different models, if I give you a group of people on the beach with surfboards, this is what we can generate. There is another model called LabGap model, which is a model based on adversarial neural networks, something I'll talk as the last part of this talk.

And there is these models, convolutional and deconvolutional variation autoencoders, which is again convolutional deconvolutional autoencoders just with some noise. And you can certainly see that it's generally we found it's very hard to be able to generate scenes with arbitrary inputs as a text. Here's my favorite one. A toilet seat sits open in the bathroom.

I don't know if you can see toilet seats here, maybe, but you can see a toilet seat sits open in the grass field. That was a little bit better. At least the colors were quite right. And when we put this paper on archive, one of the students basically came to me and said, "This is really bad because you can always ask Google." If you type that particular query into Google search, it gives you that.

Which was a little bit disappointing. But now if you actually Google or if you actually put this query into Google, this image comes before this image. And generally because what's happening is that people are just clicking on that image all the time to figure out what's going on in that image.

So we got bumped up before that other image. So now I can say that according to Google, this is a much better representation for that sentence than this one. Here's another sort of interesting model, which is a model where you're trying to build a recurrent neural network. Again, it's a generative model, but it's a generative model of text.

This model was trained on about 7,000 romance novels. And you take a caption model and you hook it up to the caption generation system. So you're basically saying the model, here's an image, generate me in the style of romantic books what you'd see here. And it generates something interesting.

We're barely able to catch the breeze on the beach and so forth. She's beautiful, but the truth is I don't know what to do. The sun was just starting to fade away, leaving people scattered around the Atlantic Ocean. And there are a bunch of different things that you can do.

Obviously, we're not there yet in terms of generating romantic stories. But here's one example where it's a generative model. It seems like syntactically we can actually generate reasonable things. Semantically, we're not there yet. And actually, that particular work was inspired a little bit by Baidu's system that would give an image.

I think it would generate poems. But the poems were predefined. It was mostly selecting the right poem for the image. Here we actually were trying to generate something. So there's still a lot of work to do in that space. Because syntactically we can get there. Semantically, we are nowhere near getting the right structure.

Here's another last example that I want to show you. This was done in the case of one-shot learning, which is can you build generative model of characters? That's a very defined domain, very well-defined domain. It's a very simple domain, but it's also very hard. Here's one example. We've shown this example to people and to the algorithm.

And we can say, well, can you draw me this example? And on one panel, humans would draw how they believe this example should look like. And then on the other panel, we have machines drawing it. So this is really just a generative model based on a single example. We're showing you an example, and you're trying to generate what it is.

And so quick question for you. How many of you think this was machine-generated and this was human-generated? What about the other way around? More, more. So there's a vote. What about this one? How many of you think this is machine-generated and this is human-generated? A few. What about the other way around?

Ah, great, great. Well, the truth is I don't really know which one was generated by which machine. Because that was done, I should actually ask Brendan Lake, who designed the experiments for this particular model. But I can tell you that there's been a lot of studies. He's done a lot of studies, and it's almost 50/50.

So in sort of this kind of small, carved domain, we can actually compete with people trying to generate these characters. Now let me step back a little bit and tell you about a different class of models. These are models known as generative adversarial networks, and they've been gaining a lot of attraction in our community because they seem to produce remarkable results.

So here's the idea. We're not going to be really defining explicitly the density, but we need to be able to sample from the model. And the interesting thing is that there's no variation learning, there's no maximum likelihood estimation, there's no Markov chain Monte Carlo, there's no sampling. How do you do that?

How do you learn these models? And it turns out that you can learn these models by playing a game. And that's a very clever strategy. And the idea is the following. You're going to be setting up a game between two players. You're going to have a discriminator, D, think of it as a convolutional neural network, and then you're going to have a generator, G.

Maybe you can think of it as a variation learning core or a Helmholtz machine or something that gives you samples from the data. The discriminator, D, is going to be discriminating between a sample from the data distribution and a sample from the generator. So the goal of the discriminator is to basically say, is this a fake sample or is this a real sample?

A fake sample is a sample generated by the model. A real sample is what you see in your data. Can you tell the difference between the two? And the generator is going to be trying to fool the discriminator by trying to generate samples that are hard for the discriminator to discriminate.

So my goal as a generator would be to generate really nice looking digits so that the discriminator wouldn't be able to tell the difference between the simulated and the real. That's the key idea. And so here is intuitively what that looks like. Let's say you have some data, so images of faces.

I give you an image of a face, and now I have a discriminator that basically says, well, if I get a real face, I push it through some function, some differentiable function. Think of it as a convolutional neural network or another differentiable function. And here I'm outputting one. So I want to output one if it's a real sample.

Then you have a generator, and generator is you have some noise, so input noise. Think of it as a Gaussian distribution. Think about Helmholtz machines. Given some noise, I go through differentiable function, which is your generator, and I generate a sample. This is how my sample might look like.

And then on top of it, I take this sample, I put it into my discriminator, and I say, for my discriminator, I want to output zero. Because my discriminator will have to say, well, this is fake, and this is real. That's the goal. And the generator basically says, well, how can I get a sample such that my discriminator is going to be confused, such that the discriminator always outputs one here, because it believes it's a true sample, believes it's coming from the true data.

So now you have these systems. So what's the objective? The objective is a min-max value function. It's a very intuitive objective function that has the following structure. You have a discriminator that says, well, this is an expectation with respect to distribution, data distribution. So this is basically saying, I want to classify any data points that I get from my data as being real.

So I want this output to be one, because if it's one, the whole thing is going to be zero. If it's less than one, it's going to be negative. And I really want to maximize it. And then discriminator is saying, well, any time I generate a sample, whatever samples comes out from my generator, I want to classify it as being fake.

That's the goal of the discriminator. And then there's a generator. The generator is sort of the other-- you try to minimize this function, which essentially says, well, generate samples that discriminator would classify as real. So I really am going to try to change the parameters of my generator such that this would produce zero.

Oh, sorry. So the discriminator would produce one. So trying to fool the discriminator. And it turns out the optimal strategy for discriminator is this ratio, which is probability of the data divided by the probability of the data plus probability of the model. And in general, if you succeed in building a good generative model, then probability of the data would be the same as probability of the model.

So discriminator will always be confused to one half. And here's one particular-- it seems like a simple idea, but it turns out to work remarkably well. Here's an architecture called deconvolutional generative adversarial network architecture that takes the code-- this is a random code. It's a Gaussian code. It passes through a sequence of convolutions, a sequence of deconvolutions.

So given the code, you sort of deconvolve it back to high dimensional image. And you train it using adversarial setting. This is your sampling. You generate the image. And then there is a discriminator, which is just a convolutional neural network that's trying to say, is that real? Is that a fake?

And if you train these models on bedrooms-- these are called L-SUN data sets, a bunch of bedrooms-- this is how samples from the model would look like, which is pretty impressive. And in fact, when I look at these samples, I'm also sort of thinking, well, maybe the model is memorizing the data, because these samples look remarkably impressive.

Then there was a follow up work. These are samples from the CFAR data set. So here you're seeing training samples. And here you're seeing samples generated from the model, which is, again, very impressive. If you look at the structure in these samples, it's quite remarkable that you can generate samples that look very realistic, actually.

This is what's done-- again, this was done by Tim Salomon and his collaborators. If you look at the ImageNet and you look at the training data on the ImageNet and looking at the samples, again, you look-- this is a horse. This is like-- there's some animal that is an airplane and so forth.

There's like some kind of a truck and such. So it looks-- when I look at these images, I was very impressed by the quality of these images, because generally, it's very, very hard to generate realistic looking images. And the last thing I want to point out-- this was picked up by Ian Goodfellow.

If you cherry pick some of the examples, this is what generated images look like. So you can sort of see there is a little bit of interesting structure that you're seeing in these samples. And one question still remains with these models is, how can we evaluate these models properly?

Is the model really learning a space of all possible images and how images-- what's the coherency in those images? Or is the model mostly kind of like blurring things around and just making some small changes to the data? So the question that I would really want-- would like to answer-- to get an answer to is that, if I show you a new example, a new test image, a new kind of a horse, would the model say, yes, this is a likely image?

This is very probable images. I've seen similar images before or something like that or not. So that still remains an open question. But again, this is the class of models which steps away from maximum likelihood estimation, sort of sets it up in a game theoretic framework, which is a really nice set of work.

And in the computer vision community, a lot of people are showing a lot of progress in using these kinds of models because they tend to generate much more realistic-looking images. So let me just summarize to say that I've shown you, hopefully, a set of learning algorithms for deep unsupervised models.

There's a lot of space in these models, a lot of excitement in that space. And I just wanted to point out that these models, the deep models, they improve upon current state of the art in a lot of different application domains. And as I mentioned before, there's been a lot of progress in discriminative models, convolutional models, using recurrent neural networks for solving action recognition models, dealing with videos.

And unsupervised learning still remains a field where we've made some progress, but there's still a lot of progress to be made. And let me stop there. So thank you. Go to the mics. So as a Bayesian guy, I'm pretty depressed by the fact that GAN can generate a clearer image than the variational autoencoder.

So my question is, do you think there could be an energy-based framework or a probabilistic interpretation of why GAN is so successful, other than it's just a MiMAX game? I think that generally, if you look at-- I sort of go back and forth between variational autoencoders, because some of my friends at OpenAI are saying that they can actually generate really nice-looking images using variational autoencoders.

I'm looking at Peter here. But I think that one of the problems with image generation today is that with variational autoencoders, there is this notion of Gaussian loss function. And what it does is it basically says, well, never produce crystal-clear images, because if you're wrong, if you put the edge in the wrong place, you're going to be penalized a lot because of the L2 loss function.

What the GANs are doing, GANs are basically saying, well, I don't really care where I put the edge, as long as it looks realistic so that I can fool my classifier. So what tends to happen in practice, and a lot of times, if you actually look at the images generated by GANs, sometimes they have a lot of artifacts, like these specific things that pop up.

Whereas in variational autoencoders, you don't see that. But again, the problem with variational autoencoders is they tend to produce images that are much more diffused or not as sharp or not as clear as what GANs is doing. And there's been some work on trying to sharpen the images, which is you're using variational autoencoders to generate the globally coherent scene, and then you're using generative adversarialness to maybe sharpen it.

Again, it depends what loss function you're using. And GANs seem to be able to deal with that problem implicitly, because they don't really care whether you get the edge quite right or not, as long as it fools your classifier. Thank you. Hi. Thank you very much for the interesting talk.

I have a question about the variational autoencoder. For the more challenging data set, like the street view house number, I noticed that many implementation, they use a PCA to preprocess the data before they train the model. What is your thought on that preprocessing step? Why is it necessary to do that?

Why don't we just learn from the raw pixel? I actually don't know. My experience has been that we don't really do a lot of preprocessing. What you can do is you can do ZCA preprocessing, and you can take the mean, you can take the second order covariance structure from the data.

That sometimes helps, sometimes it doesn't. But I don't see any particular reason why you'd want to do PCA preprocessing. It's just one of-- just like we've seen a lot in our field, people just doing x, y, and then later on they figure out that they don't really need x and y.

Maybe it was working better for their implementation, for their particular task, but generally I haven't seen people doing a lot of preprocessing using PCA for training variational timecodes. Any more questions? Yes, there's one. This is regarding binary RBMs. So if you look at the literature for, let us say, estimation of the partition function for Ising models, you will see that the literature is a lot more rich compared to the variational inference literature for restricted Boltzmann machines, especially in the binary context.

Is there a cultural reason for this? Because specifically, you have for the strictly ferromagnetic case, you have a fully polynomial time approximation scheme for estimating the log partition function. But then I don't see usage of these FPRAS algorithms in the RBM space. So when you juxtapose the literature for the Ising models compared to binary RBMs, you'll find a very stark asymmetry.

Is there a reason for this? Yeah, so the thing about Ising models is that if you're in a ferromagnetic case, or if you have certain particular structure to the Ising models, you can use a lot of techniques. Even if you use techniques like coupling from the past, you can draw exact samples from the models.

You can compute the log partition function of polynomial time if you have a specific structure. But the problem with RBMs is that generally those assumptions don't apply. You cannot learn a model which is a ferromagnetic model with RBMs, just where all your weights are positive. That's a lot of constraints to put on these class of models.

So that's why-- and once you get outside of these assumptions, then the problem becomes NP-hard for estimating the partition function. And obviously, for learning these systems, you need the gradient of the log of the partition function. And that's where all the problems come in. I don't think there is a solution for that.

And unfortunately, variational methods are also not working as well as approximations like contrastive divergence, or something based on sampling. People have looked at better approximations and using more sophisticated techniques, but it hasn't really popped up yet. Practically, it just doesn't work as well. But it's a good question. Hello.

I'm curious about using auto-coder to get semantic hash, especially in text. Do we need any special representation, text representation, like word to vector as input for our text sequence? So I've talked about the model, which is a very simple model, which is modeling bag of words. Yes. You can use word2vec and initialize the model, because it's a way of just taking your words and projecting them into the semantic space.

There has been a lot of recent techniques using, like Richard was mentioning, GRUs as a way, if you want to work with sentences, or if you want to embed the entire document into the semantic space, and if you want to make it binary, you can use GRUs, bidirectional GRUs, to get the representation of the document.

I think that would probably work better than using word2vec and then just adding things up. And then based on that, you can learn a hashing function that maps that particular representation to the binary space, in which case you can do searching fairly efficiently. So as an input representation, there are lots of choices.

You can use bidirectional GRUs, which is the method of choice right now. You can use GloVe, or you can use word2vec and sum up the representations of the words. Okay, so using only back of word, we use only normal network, that is no recurrence network, or just simple network?

Yeah, that's right. But again, your representation can be whatever that representation is, as long as it's differentiable, right? Because in this case, you can back propagate through the bidirectional GRUs and learn what these representations should be. Okay, thank you so much. Okay, let's thank Russ again. (audience applauding)