Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU)

00:00:00.000 | Sound is good?

00:00:01.000 | Okay, great.

00:00:02.000 | So I wanted to talk to you about unsupervised learning, and that's the area where there's

00:00:06.360 | been a lot of research.

00:00:09.020 | But compared to supervised learning that you've heard about today, like convolutional networks,

00:00:14.680 | unsupervised learning is not there yet.

00:00:15.920 | So I'm going to show you lots of areas.

00:00:20.560 | Parts of the talk are going to be a little bit more mathematical.

00:00:23.840 | I apologize for that, but I'll try to give you a gist of the foundations, the math behind

00:00:28.840 | these models, as well as try to highlight some of the application areas.

00:00:34.640 | What's the motivation?

00:00:35.640 | Well, the motivation is that the space of data that we have today is just growing.

00:00:42.200 | If you look at the space of images, speech, if you look at social network data, if you

00:00:47.720 | look at scientific data, I would argue that most of the data that we see today is unlabeled.

00:00:56.440 | So how can we develop statistical models, models that can discover interesting kind

00:01:00.480 | of structure in unsupervised way or semi-supervised way?

00:01:04.640 | And that's what I'm interested in, as well as how can we apply these models across multiple

00:01:11.320 | different domains.

00:01:13.360 | And one particular framework of doing that is the framework of deep learning, where you're

00:01:17.380 | trying to learn hierarchical representations of data.

00:01:21.240 | And again, as I go through the talk, I'm going to show you some examples.

00:01:27.480 | So here's one example.

00:01:30.320 | You can take a simple bag-of-words representation of an article or a newspaper.

00:01:36.220 | You can use something that's called an autoencoder, just multiple levels.

00:01:41.400 | You extract some latent code, and then you get some representation out of it.

00:01:46.640 | This is done completely in unsupervised way.

00:01:48.520 | You don't provide any labels.

00:01:49.920 | And if you look at the kind of structure that the model is discovering, it could be useful

00:01:53.280 | for visualization, for example, to see what kind of structure you see in your data.

00:02:00.480 | This was done on the Reuters dataset.

00:02:03.760 | I've tried to kind of cluster together lots of different unsupervised learning techniques,

00:02:09.520 | and I'll touch on some of them.

00:02:11.640 | It's a little bit-- it's not a full set.

00:02:15.100 | But the way that I typically think about these models is that there's a class of what I would

00:02:19.640 | call non-probabilistic models, models like sparse coding, autoencoders, clustering-based

00:02:26.560 | methods.

00:02:27.560 | And these are all very, very powerful techniques, and I'll cover some of them in that talk as

00:02:32.880 | well.

00:02:33.880 | And then there is sort of a space of probabilistic models.

00:02:38.160 | And within probabilistic models, you have tractable models, things like fully observed

00:02:43.440 | belief networks.

00:02:45.440 | There's a beautiful class of models called neural autoregressive density estimators.

00:02:50.200 | More recently, we've seen some successes of so-called pixel recurrent neural network models.

00:02:58.640 | And I'll show you some examples of that.

00:03:00.280 | There is a class of so-called intractable models, where you are looking at models like

00:03:05.640 | Boltzmann machines and models like variational autoencoders, something that's been quite--

00:03:10.560 | there's been a lot of development in our community, in deep learning community in that space.

00:03:15.400 | From Holtz machines, I'll tell you a little bit about what these models are, and a whole

00:03:19.960 | bunch of others as well.

00:03:22.920 | One particular structure within these models is that when you're building these generative

00:03:27.480 | models of data, you typically have to specify what the distributions you're looking at.

00:03:33.160 | So you have to specify what the probability of the data, and generally doing some kind

00:03:37.020 | of approximate maximum likelihood estimation.

00:03:39.680 | And then more recently, we've seen some very exciting models coming out.

00:03:44.120 | These are generative adversarial networks, moment matching networks.

00:03:48.640 | And this is a slightly different class of models, where you don't really have to specify

00:03:53.440 | what the density is.

00:03:55.120 | You just need to be able to sample from those models.

00:03:57.480 | And I'm going to show you some examples of that.

00:04:01.800 | So my talk is going to be structured.

00:04:04.120 | I'd like to introduce you to the basic building blocks, models like sparse coding models.

00:04:09.720 | Because I think that these are very important classes of models, particularly for folks

00:04:13.680 | who are working in industry and looking for simpler models.

00:04:18.840 | Autoencoder is a beautiful class of models.

00:04:21.200 | And then the second part of the talk, I'll focus more on generative models.

00:04:25.440 | I'll give you an introduction into restricted Boltz machines and deep Boltz machines.

00:04:29.320 | These are statistical models that can model complicated data.

00:04:38.280 | And I'll spend some time showing you some examples, some recent developments in our

00:04:42.480 | community, specifically in the case of variational autoencoders, which I view them as a subclass

00:04:47.280 | of Helmholtz machines.

00:04:49.360 | And I'll finish off by giving you an intuition about a slightly different class of models,

00:04:54.440 | which would be these generative adversarial networks.

00:04:57.440 | OK, so let's jump into the first part.

00:05:00.880 | But before I do that, let me just give you a little bit of motivation.

00:05:05.360 | I know Andre's done a great job, and Richard alluded to that as well.

00:05:10.640 | But the idea is, if I'm trying to classify a particular image, and if I say, if I'm looking

00:05:16.760 | at specific pixel representation, it might be difficult for me to classify what I'm seeing.

00:05:22.040 | On the other hand, if I can find the right representations, the right representations

00:05:27.500 | for these images, and then I get the right features, or get the right structure from

00:05:32.640 | the data, then it might be easier for me to see what's going on with my data.

00:05:38.920 | So how do I find these representations?

00:05:40.760 | And this is one of traditional approaches that we've seen for a long time, is that you

00:05:48.120 | have a data, you're creating some features, and then you're running your learning algorithm.

00:05:53.120 | And for the longest time, in object recognition or in audio classification, you typically

00:05:57.500 | use some kind of hand-designed features, and then you start classifying what you have.

00:06:04.000 | And like Andre was saying, in the space of vision, there's been a lot of different features,

00:06:11.120 | designs of what's the right structure we should see in the data.

00:06:15.600 | In the space of audio, same thing is happening.

00:06:19.960 | How can you find these right representations for your data?

00:06:25.380 | And the idea behind representation learning, in particular in deep learning, is can we

00:06:32.200 | actually learn these representations automatically?

00:06:35.160 | And more importantly, can we actually learn these representations in an unsupervised way,

00:06:39.000 | by just seeing lots and lots of unlabeled data?

00:06:41.600 | Can we achieve that?

00:06:43.480 | And there's been a lot of work done in that space, but we're not there yet.

00:06:47.960 | So I wanted to lower your expectations as I show you some of the results.

00:06:53.920 | OK, sparse coding.

00:06:56.400 | This is one of the models that I think that everybody should know what it is.

00:07:01.000 | It was actually first has its roots in '96, and it was originally developed to explain

00:07:08.320 | early visual processing in the brain.

00:07:10.200 | I think of it as an edge detector.

00:07:13.160 | And the objective here is the following.

00:07:14.800 | Well, if I give you a set of data points, x1 up to xn, you'd want to learn a dictionary

00:07:19.520 | of bases, phi 1 up to phi k, so that every single data point can be written as a linear

00:07:25.760 | combination of the bases.

00:07:27.500 | That's fairly simple.

00:07:29.000 | There is one constraint in that you'd want your coefficients to be sparse.

00:07:35.480 | You'd want them to be mostly zero.

00:07:38.780 | So every data point is represented as a sparse linear combination of bases.

00:07:43.960 | So if you apply sparse coding to natural images, and this was originally has been a lot of

00:07:53.040 | work developed at Stanford with Andrew Ng's group.

00:07:56.120 | So if you apply sparse coding to take little patches of images and learn these bases, these

00:08:02.400 | dictionaries, this is how they look like.

00:08:04.480 | And they look really nice in terms of finding edge-like structure.

00:08:09.720 | So if given a new example, I can say, well, this new example can be written as a linear

00:08:14.420 | combination of a few of these bases.

00:08:18.040 | And taking that representation, it turns out that particular representation, a sparse representation,

00:08:23.640 | is quite useful as a feature representation of your data.

00:08:27.760 | So it's quite useful to have it.

00:08:29.680 | And in general, how do we fit these models?

00:08:36.760 | Well, if I give you a whole bunch of image patches, but these don't necessarily have

00:08:41.440 | to be image patches.

00:08:42.520 | This could be little speech signals or any kind of data you're working with.

00:08:47.920 | You'd want to learn a dictionary of bases.

00:08:49.320 | You have to solve this optimization problem.

00:08:53.200 | So the first term here, you can think of it as a reconstruction error, which is to say,

00:08:57.480 | well, I take a linear combination of my bases.

00:09:00.760 | I want them to match my data.

00:09:03.920 | And then there's a second term, which is, you can think of it as a sparse penalty term,

00:09:08.160 | which essentially says, try to penalize my coefficients so that most of them are zero.

00:09:15.640 | That way, every single data point can be written as just a linear combination, sparse linear

00:09:19.400 | combination of the bases.

00:09:22.180 | And it turns out there is an easy optimization for doing that.

00:09:26.720 | If you fix your dictionary of bases, 5, 1 up to 5k, and you solve for the activations,

00:09:34.200 | that becomes a standard lasso problem.

00:09:36.840 | There's a lot of solvers for solving that particular problem.

00:09:40.720 | That's a general, it's a lasso problem, which is fairly easy to optimize.

00:09:47.440 | And then if you fix the activations and you optimize for dictionary of bases, then it's

00:09:51.800 | a well-known quadratic programming problem.

00:09:55.440 | Each problem is convex, so you can alternate between finding coefficients, finding bases,

00:10:00.560 | and so forth, so you can optimize this function.

00:10:02.840 | And there's been a lot of recent work in the last 10 years of doing these things online

00:10:07.800 | and doing it more efficiently and so forth.

00:10:13.300 | At test time, given a new input or a new image patch, and given a set of learned bases, once

00:10:18.520 | you have your dictionary, you can then just solve a lasso problem to find the right coefficients.

00:10:25.280 | So in this case, given a test sample or a test patch, you can find, well, it's written

00:10:30.480 | as a linear combination of a subset of the bases.

00:10:35.840 | And it turns out, again, that that particular representation is very useful, particularly

00:10:39.680 | if you're interested in classifying what you see in images.

00:10:43.180 | And this is done in a completely unsupervised way.

00:10:45.560 | There is no class labels.

00:10:46.560 | There is no specific supervisory signal that's here.

00:10:52.280 | So back in 2006, there was work done, again, at Stanford that basically showed a very interesting

00:11:00.240 | result.

00:11:01.240 | So if I give you an input like this, and these are my learned bases, remember these little

00:11:05.040 | edges, what happens is that you just convolve these bases.

00:11:09.480 | You can get these different feature maps, much like the feature maps that we've seen

00:11:13.220 | in convolutional neural networks.

00:11:15.440 | And then you take these feature maps, and you can just do a classification.

00:11:20.000 | This was done on one of the older data sets, the Caltech 101, which is a data set that

00:11:25.160 | predates ImageNet.

00:11:27.880 | And if you look at some of the competing algorithms, if you do a simple logistic regression versus

00:11:35.240 | if you do PCA and then do logistic regression versus finding these features using sparse

00:11:41.400 | coding, you can get substantial improvements.

00:11:44.960 | So that's, again, that's-- and you see sparse coding popping up in a lot of different areas,

00:11:51.280 | not just in deep learning, but folks who are using-- looking at the medical imaging domain,

00:11:57.240 | in neuroscience, these are very popular models.

00:12:00.000 | Because they're easy, they're easy to fit, they're easy to deal with.

00:12:05.400 | So what's the interpretation of the sparse coding?

00:12:09.280 | Well, let's look at this equation again.

00:12:11.440 | And we can think of sparse coding as finding an overcomplete representation of your data.

00:12:17.880 | Now the encoding function, we can think of this encoding function, which is, well, I

00:12:23.160 | give you an input, find me the features or sparse coefficients or bases that make up

00:12:28.900 | my image.

00:12:29.900 | We can think of encoding as an implicit and a very nonlinear function of x.

00:12:33.720 | But it's an implicit function.

00:12:35.120 | We don't really specify it.

00:12:36.920 | And the decoder, or the reconstruction, is just a simple linear function.

00:12:42.400 | And it's very explicit.

00:12:44.000 | You just take your coefficients and then multiply it by-- find the right basis and get back

00:12:52.240 | the image or the data.

00:12:56.420 | And that sort of flows naturally into the ideas of autoencoders.

00:13:01.200 | The autoencoder is a general framework where if I give you an input data, let's say it's

00:13:05.720 | an input image, you encode it, you get some representation, some feature representation,

00:13:11.560 | and then you have a decoder given that representation.

00:13:14.360 | You're decoding it back into the image.

00:13:16.880 | So you can think of encoder as a feedforward, bottom-up pass, much like in a convolutional

00:13:23.800 | neural network, given the image, you're doing a forward pass.

00:13:27.200 | And then there is also feedback and generative or top-down pass.

00:13:31.920 | And the features, you're reconstructing back the input image.

00:13:35.880 | And the details of what's going inside the encoder, decoder, they matter a lot.

00:13:40.480 | And obviously, you need some form of constraints.

00:13:42.320 | You need some of constraints to avoid learning an identity.

00:13:45.840 | Because if you don't put these constraints, what you could do is just take your input,

00:13:50.600 | copy it to your features, and then reconstruct back.

00:13:53.440 | And that would be a trivial solution.

00:13:55.560 | So we need to introduce some additional constraints.

00:13:59.740 | If you're dealing with binary features, if you want to extract binary features, for example,

00:14:05.440 | I'm going to show you later why you'd want to do that.

00:14:07.920 | You can pass your encoder through sigmoid nonlinearity, much like in the neural network.

00:14:13.920 | And then you have a linear decoder that reconstructs back the input.

00:14:17.640 | And the way we optimize these little building blocks or these little blocks is we can just

00:14:24.720 | have an encoder, which takes your input, takes a linear combination, passes it through some

00:14:30.520 | nonlinearity, the sigmoid nonlinearity.

00:14:32.300 | It could be rectified linear units.

00:14:34.180 | It could be 10H nonlinearity.

00:14:36.220 | And then there is a decoder where you reconstruct back your original input.

00:14:41.020 | So this is nothing more than a neural network with one hidden layer.

00:14:44.240 | And typically, that hidden layer would have a small dimensionality than the input.

00:14:47.960 | So we can think of it as a bottleneck layer.

00:14:50.540 | We can determine the network parameters, the parameters of the encoder and the parameters

00:14:54.500 | of the decoder by writing down the reconstruction error.

00:14:58.480 | And that's what the reconstruction error would look like.

00:15:00.860 | Given the input, encode, decode, and make sure whatever you're decoding is as close

00:15:04.900 | as possible to the original input.

00:15:08.060 | All right.

00:15:09.060 | Then we can use backpropagation algorithm to train.

00:15:14.140 | There is an interesting sort of relationship between autoencoders and principal component

00:15:18.940 | analysis.

00:15:20.140 | Many of you have probably heard about PCA.

00:15:22.400 | As a practitioner, if you're dealing with large data and you want to see what's going

00:15:26.180 | on, PCA is the first thing to use, much like logistic regression.

00:15:32.460 | And the idea here is that if the parameters of encoder and decoder are shared and you

00:15:36.780 | actually have the hidden layer, which is a linear layer, so you don't introduce any nonlinearities,

00:15:42.180 | then it turns out that the latent space that the model will discover is going to be the

00:15:46.940 | same space as the space discovered by PCA.

00:15:49.780 | It effectively will collapse the principal component analysis, right?

00:15:52.940 | We're doing PCA, which is sort of a nice connection because it basically says that autoencoders,

00:16:00.100 | you can think of them as nonlinear extensions of PCA.

00:16:02.700 | So you can learn a little richer features if you are using autoencoders.

00:16:13.180 | So here's another model.

00:16:14.180 | If you're dealing with binary input, sometimes we're dealing with like MNIST, for example.

00:16:19.140 | Again, your encoder and decoder could use sigmoid nonlinearities.

00:16:22.900 | So given an input, you extract some binary features.

00:16:25.020 | Give binary features, you reconstruct back the binary input.

00:16:29.100 | And that actually relates to a model called the restricted Boltz machine, something that

00:16:33.860 | I'm going to tell you about later in the talk.

00:16:37.860 | There's also other classes of models where you can say, well, I can also introduce some

00:16:42.100 | sparsity, much like in sparse coding, to say that I need to constrain my latent features

00:16:47.820 | or my latent space to be sparse.

00:16:50.420 | And that actually allows you to learn quite reasonable features, nice features.

00:16:56.780 | Here's one particular model called predictive sparse decomposition, where you effectively,

00:17:02.300 | if you look at the first part of the equation here, the decoder part, that pretty much looks

00:17:06.580 | like a sparse coding model.

00:17:09.140 | But in addition, you have an encoding part that essentially says train an encoder such

00:17:14.380 | that it actually approximates what my latent code should be.

00:17:19.980 | So effectively, you can think of this model as there is an encoder, there is a decoder,

00:17:23.700 | but then you put the sparsity constraint on your latent representation.

00:17:26.300 | And you can optimize for that model.

00:17:32.400 | And obviously, the other thing that we've been doing in the last seven, eight, and ten

00:17:37.340 | years is, well, what you can do is you can actually stack these things together.

00:17:42.060 | So you can learn low-level features, try to learn high-level features, and so forth.

00:17:47.220 | So you're just building these blocks.

00:17:50.900 | And perhaps at the top level, if you're trying to solve a classification problem, you can

00:17:54.620 | do that.

00:17:57.140 | And this is sometimes known as a greedy layer-wise learning.

00:18:00.900 | And this is sometimes useful whenever you have lots and lots of unlabeled data.

00:18:05.460 | And when you have a little labeled data, a small sample of labeled data, typically these

00:18:10.340 | models help you find meaningful representations such that you don't need a lot of labeled

00:18:15.620 | data to solve a particular task that you're trying to solve.

00:18:19.420 | And this is, again, you can remove the decoding part, and then you end up with a standard

00:18:23.900 | or a convolutional architecture.

00:18:25.660 | Again, your encoder and decoder could be convolutional.

00:18:30.020 | And it depends on what problem you're tackling.

00:18:34.060 | And typically, you can stack these things together and optimize for a particular task

00:18:38.740 | that you're trying to solve.

00:18:43.100 | Here's an example of-- just wanted to show you some examples, some early examples.

00:18:46.940 | Back in 2006, this was a way of trying to build these nonlinear autoencoders.

00:18:53.900 | And you can sort of pre-train these models using restricted-bolt machines or autoencoders

00:18:58.180 | generally.

00:18:59.700 | And then you can stitch them together into this deep autoencoder and backpropagate through

00:19:05.900 | reconstruction loss.

00:19:08.460 | One thing I want to point out is that-- here's one particular example.

00:19:12.740 | The top row, I show you real faces.

00:19:15.780 | The second row, you're seeing faces reconstructed from a bottleneck of 30-dimensional real-value

00:19:22.420 | bottlenecks.

00:19:23.420 | You can think of it as just a compression mechanism.

00:19:25.580 | Given the data, high-dimensional data, you're compressing it down to 30-dimensional code.

00:19:30.000 | And then from that 30-dimensional code, you're reconstructing back the original data.

00:19:34.260 | So if you look at the first row, this is the data.

00:19:36.260 | The second row shows you reconstructed data.

00:19:39.420 | And the last row shows you PCA solution.

00:19:43.060 | One thing I want to point out is that the solution here, you have a much sharper representation,

00:19:47.940 | which means that it's capturing a little bit more structure in the data.

00:19:50.780 | It's also kind of interesting to see that sometimes these models tend to-- how should

00:19:55.820 | I say it?

00:19:56.820 | They tend to regularize your data.

00:19:58.780 | For example, if you see this person with glasses, removes the glasses.

00:20:02.580 | And that generally has to do with the fact that there is only one person with glasses.

00:20:05.820 | So the model just basically says, that's noise.

00:20:07.820 | Get rid of it.

00:20:08.820 | Or it sort of gets rid of mustaches.

00:20:11.020 | Like if you see this, there's no mustache.

00:20:13.340 | And then again, that has to do with the fact that there's enough capacity.

00:20:16.300 | So the model might think that that's just a noise.

00:20:20.860 | And if you're dealing with text type of data, this was done using a Reuters data set.

00:20:27.900 | You have about 800,000 stories.

00:20:30.780 | You take bag of words representation, something very simple.

00:20:33.020 | You can press it down to two-dimensional space.

00:20:34.980 | And then you see what that space looks like.

00:20:37.940 | And I always like to joke that the model basically discovers that European community economic

00:20:43.180 | policies are just next to disasters and accidents.

00:20:46.740 | This was back in-- I think the data was collected in '96.

00:20:50.780 | I think today it's probably going to become closer.

00:20:55.980 | But again, this is just a way-- typically, autoencoder is a way of compression or trying

00:21:00.780 | to do dimensionality reduction.

00:21:02.460 | But we'll see later that they don't have to be.

00:21:05.820 | There's another class of algorithm called semantic hashing, which is to say, well, what

00:21:09.860 | if you take your data and compress it down to binary representation?

00:21:14.540 | Wouldn't that be nice?

00:21:16.180 | Because if you have binary representation, you can search in the binary space very efficiently.

00:21:22.540 | In fact, if you can compress your data down to 20 dimension, 20-dimensional binary code,

00:21:28.460 | 2 to the 20 is about 4 gigabytes.

00:21:30.460 | So you can just store everything in memory.

00:21:33.980 | And you can look at the-- just do memory lookups without actually doing any search at all.

00:21:42.020 | So this sort of representation sometimes have been used successfully in computer vision,

00:21:46.260 | where you take your images, and then you learn these binary representations, 30-dimensional

00:21:52.340 | codes, 200-dimensional codes.

00:21:55.540 | And it turns out it's very efficient to search through large volumes of data using binary

00:22:00.380 | representation.

00:22:01.380 | So you can-- it takes a fraction of a millisecond to retrieve images from a set of millions

00:22:07.420 | and millions of images.

00:22:09.700 | And again, this is also an active area of research right now, because people are trying

00:22:13.240 | to figure out, we have these large databases.

00:22:15.460 | How can you search through them efficiently?

00:22:17.220 | And learning a semantic hashing function that maps your data to the binary representation

00:22:22.520 | turns out to be quite useful.

00:22:25.540 | OK, now let's step back a little bit and say, let's now look at generative models.

00:22:31.500 | Let's look at probabilistic models and how different they are.

00:22:34.420 | And I'm going to show you some examples of where they're applicable.

00:22:39.420 | Here's one example of a simple model trying to learn a distribution of these handwritten

00:22:46.260 | characters.

00:22:47.260 | So we have Sanskrit.

00:22:49.460 | We have Arabic.

00:22:50.460 | We have Cyrillic.

00:22:53.180 | And now we can build a model that says, well, can you actually generate me what a Sanskrit

00:22:58.700 | should look like?

00:23:00.220 | The flickering you see at the top, these are neurons.

00:23:02.780 | You can think of them as neurons firing.

00:23:05.300 | And what you're seeing at the bottom is you're seeing what the model generates, what it believes

00:23:09.660 | Sanskrit should look like.

00:23:11.140 | So in some sense, when you think about generative models, you think about models that can generate

00:23:15.860 | or they can sample the distribution or they can sample the data.

00:23:21.900 | This is a fairly simple model.

00:23:23.180 | We have about 25,000 characters coming from 50 different alphabets around the world.

00:23:28.380 | You have about 2 million parameters.

00:23:29.660 | This is one of the older models.

00:23:31.860 | But this is what the model believes Sanskrit should look like.

00:23:35.100 | And I think that I've asked a couple of people to say that, does that really look like Sanskrit?

00:23:40.380 | Yes.

00:23:41.380 | OK, great.

00:23:43.140 | Which can mean two things.

00:23:45.020 | It can mean that the model is actually generalizing or the model is overfitting, meaning that

00:23:50.700 | it's just memorizing what the training data looks like.

00:23:52.620 | And I'm just showing you examples from the training data.

00:23:55.380 | We'll come back to that point as we go through the talk.

00:23:59.820 | You can also do conditional simulation.

00:24:02.180 | Given half of the image, can you complete the remaining half?

00:24:06.020 | And more recently, there's been a lot of advances, especially in the last couple of years, for

00:24:12.140 | the conditional generations.

00:24:13.580 | And it's pretty amazing what you can do in terms of in-painting, given half of the image,

00:24:19.020 | what the other half of the image should look like.

00:24:21.300 | This is sort of a simple example, but it does show you that it's trying to be consistent

00:24:25.820 | with what different strokes look like.

00:24:29.640 | So why is it so difficult?

00:24:32.820 | In the space of so-called undirected graphical models of Bolsom machines, the difficulty

00:24:36.820 | really comes from the following fact.

00:24:38.680 | If I show you this image, which is a 28 by 28 image, it's a binary image.

00:24:43.560 | So some pixels are on, some pixels are off.

00:24:46.060 | There are 2 to the 28 by 28 possible images.

00:24:49.620 | So in fact, there are 2 to the 784 possible configurations.

00:24:53.580 | And that space is exponential.

00:24:56.000 | So how can you build models that figure out-- in the space of characters, there's only a

00:25:00.620 | little tiny subspace in that space.

00:25:03.680 | If you start generating 200 by 200 images, that space is huge.

00:25:11.580 | In the space of real images, it's really, really tiny.

00:25:15.120 | So how do you find that space?

00:25:16.420 | How do you generalize to new images?

00:25:18.480 | That's a very difficult question in general to answer.

00:25:23.640 | One class of models is so-called fully observed models.

00:25:27.940 | There's been a stream of learning generative models that are tractable.

00:25:33.000 | And they have very nice properties, like you can compute the probabilities, you can do

00:25:36.780 | maximum likelihood estimation.

00:25:38.760 | Here's one example where I can, if I try to model the image, I can write it down as taking

00:25:43.840 | the first pixel, modeling the first pixel, then modeling the second pixel, giving the

00:25:47.280 | first pixel, and just writing it down in terms of the conditional probabilities.

00:25:53.960 | And each conditional probability can take a very complicated form.

00:25:56.880 | It could be a complicated neural network.

00:26:03.200 | So there's been a number of successful models.

00:26:08.280 | One of the early models called Neural Autoregressive Density Estimator, actually developed by Hugo,

00:26:14.880 | real valid extension of these models.

00:26:16.520 | And more recently, we start seeing these flavors of models.

00:26:19.680 | There were a couple of papers popped up, actually this year, from DeepMind, where they make

00:26:27.680 | these conditionals to be sophisticated RNNs, LSTMs, or convolutional models.

00:26:32.800 | And they can actually generate remarkable images.

00:26:35.600 | And so this is just a pixel CNN generating, I guess, elephants.

00:26:39.360 | Yeah.

00:26:40.360 | And actually, it looks pretty interesting.

00:26:45.520 | The drawback of these models is that we yet have to see how good of representations these

00:26:49.480 | models are learning, so that we can use these representations for other tasks, like classifying

00:26:54.320 | images or find similar images and such.

00:26:59.080 | Now let me jump into a class of models called Restricted Bulse Machines.

00:27:03.680 | So this is the class of models where we're actually trying to learn some latent structure,

00:27:07.520 | some latent representation.

00:27:10.000 | These models belong to the class of so-called graphical models.

00:27:12.480 | And graphical models are a very powerful framework for representing dependency structure between

00:27:17.260 | random variables.

00:27:19.320 | This is an example where we have-- you can think of this particular model.

00:27:24.600 | You have some pixels.

00:27:25.600 | These are stochastic binary, so-called visible variables.

00:27:28.480 | You can think of pixels in your image.

00:27:30.680 | And you have stochastic binary hidden variables.

00:27:32.480 | You can think of them as feature detectors, so detecting certain patterns that you see

00:27:35.600 | in the data, much like sparse coding models.

00:27:38.040 | This has a bipartite structure.

00:27:40.280 | You can write down the probability, the joint distribution over all of these variables.

00:27:45.760 | You sort of have pairwise term.

00:27:47.120 | You have unitary term.

00:27:48.120 | But it's not really important what they look like.

00:27:49.960 | The important thing here is that if I look at this conditional probability of the data

00:27:53.920 | given the features, I can actually write down explicitly what it looks like.

00:27:58.600 | What does that mean?

00:27:59.600 | That basically means that if you tell me what features you see in the image, I can generate

00:28:03.480 | the data for you, or I can generate the corresponding input.

00:28:08.320 | In terms of learning features, so what do these models learn?

00:28:11.960 | They sort of learn something similar that we've seen in sparse coding.

00:28:16.240 | And so these classes of models are very similar to each other.

00:28:20.120 | So given a new image, I can say, well, this new image is made up by some combination of

00:28:24.960 | these learned weights or these learned bases.

00:28:28.480 | And the numbers here are given by the probabilities that each particular edge is present in the

00:28:32.920 | data.

00:28:35.080 | In terms of how we learn these models, one thing I want to make-- another point I should

00:28:41.200 | make here is that given an input, I can actually quickly infer what features I'm seeing in

00:28:47.720 | the image.

00:28:48.720 | So that operation is very easy to do, unlike in sparse coding models.

00:28:52.640 | It's a little bit more closer to an autoencoder.

00:28:54.300 | Given the data, I can actually tell you what features are present in my input, which is

00:28:58.480 | very important for things like information retrieval or classifying images, because you

00:29:02.560 | need to do it fast.

00:29:05.280 | How do we learn these models?

00:29:06.280 | Let me just give you an intuition, maybe a little bit of math behind how we learn these

00:29:11.000 | models.

00:29:12.000 | If I give you a set of training examples, and I want to learn model parameters, I can

00:29:16.520 | maximize the log-likelihood objective.

00:29:18.840 | And you've probably seen that in these tutorials, the maximum likelihood objective is essentially

00:29:24.280 | nothing more than saying, I want to make sure that the probability of observing these images

00:29:29.680 | is as high as possible.

00:29:31.400 | So finding the parameters of the probability of observing what I'm seeing is high.

00:29:36.480 | And that's why you're maximizing the likelihood objective for the log of the likelihood objective,

00:29:41.960 | which is take a product into the sum.

00:29:44.960 | You take the derivative.

00:29:45.960 | There's a little bit of algebra.

00:29:47.000 | I promise you it's not very difficult.

00:29:49.640 | It's like second-year college algebra.

00:29:54.140 | You differentiate.

00:29:55.140 | And you basically have this learning rule, which is the difference between two terms.

00:30:01.800 | The first term, you can think of it as looking at sufficient statistics, so-called sufficient

00:30:07.360 | statistics driven by the data.

00:30:09.700 | And the second term is the sufficient statistics driven by the model.

00:30:14.120 | Maybe I can parse it out.

00:30:15.240 | What does that mean?

00:30:16.240 | Intuitively, what that means is that you look at the correlations you see in the data.

00:30:21.720 | And then you look at the correlations that the model is telling you it should be.

00:30:25.800 | And you're trying to match the two.

00:30:28.160 | That's what the learning is trying to do.

00:30:30.560 | It's trying to match the correlations that you see in the data.

00:30:34.280 | So the model is actually respecting the statistics that you see in the data.

00:30:38.440 | But it turns out that the second term is very difficult to compute.

00:30:41.380 | And it's precisely because the space of all possible images is so high dimensional that

00:30:46.640 | you need to figure out or use some kind of approximate learning algorithms to do that.

00:30:52.080 | So you have these difference between these two terms.

00:30:54.000 | The first term is easy to compute, it turns out, because of a particular structure of

00:30:57.280 | the model.

00:30:58.280 | And we can actually do it explicitly.

00:31:02.560 | The second term is the difficult one to compute.

00:31:05.320 | So it sort of requires summing over all possible configurations, all possible images that you

00:31:11.680 | could possibly see.

00:31:13.600 | So this term is intractable.

00:31:15.880 | And what a lot of different algorithms are doing-- and we'll see that over and over again--

00:31:20.320 | is using so-called Monte Carlo sampling, or Markov chain Monte Carlo sampling, or Monte

00:31:24.600 | Carlo estimation.

00:31:26.680 | So let me give you an intuition of what this term is doing.

00:31:29.160 | And that's a general trick for approximating exponential sums.

00:31:33.760 | There's a whole subfield in statistics that's basically dedicated to how do we approximate

00:31:42.520 | exponential sums.

00:31:43.600 | In fact, if you could do that, if you could solve that problem, you could solve a lot

00:31:46.680 | of problems in machine learning.

00:31:50.160 | And the idea is very simple, actually.

00:31:52.080 | The idea is to say, well, you're going to be replacing the average by sampling.

00:31:58.000 | And there's something that's called Gibbs sampling, Markov chain Monte Carlo, which

00:32:02.040 | essentially does something very simple.

00:32:04.200 | It basically says, well, start with the data, sample the states of the latent variables,

00:32:10.560 | sample the data, sample the states of the latent data, sample the data from these conditional

00:32:13.880 | distributions, something that you can compute explicitly.

00:32:17.440 | And that's a general trick.

00:32:19.760 | Much like in sparse coding, we're optimizing for the basis when we're optimizing for the

00:32:23.680 | coefficients.

00:32:24.680 | Here, you're inferring the coefficients, then you're inferring what the data should look

00:32:28.960 | like and so forth.

00:32:30.760 | And then you can just run a Markov chain and approximate this exponential sum.

00:32:38.400 | So you start with the data, you sample the states of the hidden variables, you resample

00:32:42.560 | the data and so forth.

00:32:44.280 | And the only problem with a lot of these methods is that you need to run them up to infinity

00:32:52.120 | to guarantee that you're getting the right thing.

00:32:55.620 | And so obviously, you will never run them infinite.

00:33:00.640 | You don't have time to do that.

00:33:02.120 | So there's a very clever algorithm, a contrastive divergent algorithm that was developed by

00:33:06.840 | Hinton back in 2002.

00:33:09.040 | And it was very clever.

00:33:10.400 | It basically said, well, instead of running this thing up to infinity, run it for one

00:33:14.520 | step.

00:33:18.600 | And so you're just running it for one step.

00:33:20.360 | You start with a training vector, you update the hidden units, you update all the visible

00:33:25.760 | units again.

00:33:26.760 | So that's your reconstruction.

00:33:27.960 | Much like in autoencoder, you reconstruct your data.

00:33:31.120 | You update the hidden units again, and then you just update the model parameters, which

00:33:34.440 | is just looking at empirically the statistics between the data and the model.

00:33:39.920 | Very similar to what the autoencoder is doing, but slight, slight differences.

00:33:44.640 | And implementation is basically takes about 10 lines of MATLAB code.

00:33:48.840 | I suspect it's going to be two lines in TensorFlow, although I don't think TensorFlow folks implemented

00:33:54.360 | Boltzmann machines yet.

00:33:56.320 | That would be my request.

00:34:00.360 | But you can extend these models to dealing with real value data.

00:34:04.640 | So whenever you're dealing with images, for example.

00:34:06.840 | And it's just a little change to the definition of the model.

00:34:11.400 | And your conditional probabilities here are just going to be a bunch of Gaussians.

00:34:14.760 | So that basically means that given the features, sample me the space of images, and I can sample

00:34:20.000 | you, give you real, real valued images.

00:34:23.520 | The structure of the model remains the same.

00:34:25.840 | If you train this model on these images, you tend to find edges, something similar, again,

00:34:33.160 | to what you'd see in sparse coding, in ICA, independent component analysis model, autoencoders

00:34:38.040 | and such.

00:34:39.640 | And again, you can say, well, every single image is made up by some linear combination

00:34:43.980 | of these basis functions.

00:34:45.800 | You can also extend these models to dealing with count data.

00:34:48.840 | If you're dealing with documents, in this case, again, a slight change to the model.

00:34:55.360 | K here denotes your vocabulary size.

00:34:58.560 | And D here denotes number of words that you're seeing in your document.

00:35:03.120 | It's a bag of words representation.

00:35:05.600 | And the conditional here is given by so-called softmax distribution, much like what you've

00:35:09.040 | seen in the previous classes, when the distribution of possible words.

00:35:15.920 | And the parameters here, Ws, you can think of them as something similar to what word

00:35:20.840 | to vacuum bedding would do.

00:35:24.200 | And so if you apply it to, again, some of data sets, you tend to find reasonable features.

00:35:31.760 | So you tend to find features about Russia, about US, about computers, and so forth.

00:35:37.320 | So much like you found these representations, little edges, every image is made up by some

00:35:42.280 | combination of these edges.

00:35:45.000 | In case of documents or web pages, you're saying it's the same thing.

00:35:49.280 | It's just made up some linear combination of these learned topics.

00:35:53.200 | Every single document is made up by some combination of these topics.

00:35:57.080 | You can also look at one-step reconstruction.

00:35:59.240 | So you can basically say, well, how can I find similarity between the words?

00:36:03.080 | So if I show you chocolate cake and further states of hidden units, and then I reconstruct

00:36:07.960 | back the distribution of possible words, it tells me chocolate cake, cake chocolate sweet

00:36:13.760 | dessert cupcake food sugar, and so forth.

00:36:16.960 | I particularly like the one about the flower high, and then there is a Japanese sign.

00:36:22.520 | And the model sort of generates flower, Japan, sakura, blossom, Tokyo.

00:36:27.800 | So it sort of picks up again on low-level correlations that you see in your data.

00:36:33.720 | You can also apply these kinds of models to collaborative filtering, where every single

00:36:38.560 | observed variable you can model, can represent a user rating for a particular movie.

00:36:46.980 | So every single user would rate a certain subset of movies.

00:36:50.880 | And so you can represent it as the state of visible vector.

00:36:53.560 | And your hidden states can represent user preferences, what they are.

00:36:58.680 | And on the Netflix data set, if you look at the latent space that the model is learning,

00:37:04.240 | some of these hidden variables are capturing specific movie genre.

00:37:08.960 | So for example, there is actually one hidden unit dedicated to Michael Moore's movies.

00:37:14.520 | So it's sort of very strong.

00:37:16.960 | I think it's sort of either people like it or hate it.

00:37:19.280 | So there are a few hidden units specifically dedicated to that.

00:37:22.560 | But it also finds interesting things like action movies and so forth.

00:37:26.080 | So it finds that particular structure in the data.

00:37:28.840 | So you can model different kinds of modality, real value data, you can model count data,

00:37:34.880 | multinomials.

00:37:36.280 | And it's very easy to infer the states of the hidden variables.

00:37:39.020 | So that's given just the product of logistic functions.

00:37:41.400 | And that's very important in a lot of different applications.

00:37:44.380 | Given the input, I can quickly tell you what topics I see in the data.

00:37:49.200 | One thing that I want to point out, and that's an important point, is a lot of these models

00:37:53.360 | can be viewed as product models.

00:37:56.120 | Sometimes people call them product of experts.

00:37:58.960 | And this is because of the following intuition.

00:38:03.680 | If I write down the joint distribution of my hidden observed variables, I can write

00:38:07.520 | it down in this sort of log linear form.

00:38:10.360 | But if I sum out or integrate out the states of the hidden variables, I have a product

00:38:17.800 | of a whole bunch of functions.

00:38:20.680 | So what does it mean?

00:38:22.640 | What's the intuition here?

00:38:24.240 | So let me show you an example.

00:38:26.040 | Suppose the model finds these specific topics.

00:38:29.800 | And suppose I'm going to be telling you that the document contains topic government, corruption,

00:38:34.240 | and mafia.

00:38:35.540 | Then the word Silvio Berlusconi will have very high probability.

00:38:39.360 | I guess, does anybody know?

00:38:42.000 | Everybody knows who Silvio is?

00:38:43.000 | Silvio Berlusconi?

00:38:44.200 | He's in head of the government.

00:38:46.480 | He's connected to mafia.

00:38:49.040 | He's very corrupt, was corrupt.

00:38:50.680 | And I guess I should add a bunga bunga parties here.

00:38:53.800 | Then it will become completely clear what I'm talking about.

00:38:57.520 | But then one point I want to make here is that you can think of these models as a product.

00:39:05.040 | Each hidden variable defines a distribution of possible words, of possible topics.

00:39:10.800 | And once you take the intersection of these distributions, you can be very precise about

00:39:15.200 | what is it that you're modeling.

00:39:17.360 | So that's unlike generally topic models or latent directory allocation models, models

00:39:22.560 | where you're actually using mixture-like approach.

00:39:28.280 | And then typically, these models do perform far better than traditional mixture-based

00:39:32.560 | models.

00:39:33.560 | And this comes to the point of local versus distributed representations.

00:39:39.280 | In a lot of different algorithms, even unsupervised learning algorithms, such as clustering, you

00:39:44.640 | typically have some, you're partitioning the space, and you're finding local prototypes.

00:39:52.120 | And the number of parameters for each, you have basically parameters for each region,

00:39:55.720 | the number of regions typically grow linearly with the number of parameters.

00:40:00.280 | But in models like factor models, PCA, restricted Boltzmann machines, deep models, you typically

00:40:07.460 | have distributed representations.

00:40:08.460 | And what's the idea here?

00:40:10.520 | The idea here is that if I show you the two inputs, each particular neuron can differentiate

00:40:17.060 | between two parts of the plane.

00:40:19.440 | Given the second one, I can partition it again.

00:40:23.040 | Given the third hidden variable, you can partition it again.

00:40:25.520 | So you can see that every single neuron will be affecting lots of different regions.

00:40:31.240 | And that's the idea behind distributed representations, because every single parameter is affecting

00:40:35.520 | many, many regions, not just the local region.

00:40:38.080 | And so the number of regions grow roughly exponentially with the number of parameters.

00:40:42.820 | So that's the differences between these two classes of models.

00:40:46.720 | It's important to know about them.

00:40:48.880 | Now let me jump and quickly tell you a little bit of inspiration behind what can we build

00:40:53.800 | with these models.

00:40:55.840 | As we've seen with convolutional networks, the first layer, we typically learn some low-level

00:41:01.080 | features, like edges.

00:41:04.000 | If you're working with a word table, typically we'll learn some low-level structure.

00:41:09.600 | And the hope is that the high-level features will start picking up some high-level structure

00:41:13.520 | as you are building.

00:41:15.960 | And these kinds of models can be built in a completely unsupervised way, because what

00:41:19.960 | you're trying to do is you're trying to model the data.

00:41:21.840 | You're trying to model the distribution of the data.

00:41:25.120 | You can write down the probability distribution for these models, known as a Bolson machine

00:41:31.600 | model.

00:41:32.760 | You have dependencies between hidden variables.

00:41:34.560 | So now introducing some extra layers and dependencies between those layers.

00:41:42.960 | And if we look at the equation, the first part of the equation is basically the same

00:41:46.560 | as what we had with restricted Bolson machine.

00:41:49.560 | And then the second and third part of the equation is essentially modeling dependencies

00:41:53.200 | between the first and the second hidden layer, and the second hidden layer and the third

00:41:57.160 | hidden layer.

00:41:58.160 | There is also a very natural notion of bottom-up and top-down.

00:42:01.560 | So if I want to see what's the probability of a particular unit taking value 1, it's

00:42:06.840 | really dependent on what's coming from below and what's coming from above.

00:42:10.760 | So there has to be some consensus in the model to say, ah, yes, what I'm seeing in the image

00:42:16.080 | and what my model believes the overall structure should be should be in agreement.

00:42:21.880 | And so in this case, of course, in this case, hidden variables become dependent even when

00:42:25.320 | you condition on the data.

00:42:27.200 | So these kinds of models we'll see a lot is you're introducing more flexibility, you're

00:42:32.440 | introducing more structure, but then learning becomes much more difficult.

00:42:37.240 | You have to deal-- how do you inference in these models?

00:42:42.680 | Now let me give you an intuition of how can we learn these models.

00:42:47.200 | What's the maximum likelihood estimator doing here?

00:42:50.560 | Well, if I differentiate this model with respect to parameters, I basically run into the same

00:42:55.040 | learning rule.

00:42:56.040 | And it's the same learning rule you see whenever you're working with undirected graphical models,

00:42:59.640 | factor graphs, conditional random fields.

00:43:01.360 | You might have heard about those ones.

00:43:03.840 | It really is just trying to look at the statistics driven by the data, correlations that you

00:43:07.840 | see in the data, and the correlations that the model is telling you it's seeing in the

00:43:11.400 | data, and you're just trying to match the two.

00:43:13.680 | That's exactly what's happening in that particular equation.

00:43:18.460 | But the first term is no longer factorial, so you have to do some approximation with

00:43:22.200 | these models.

00:43:23.360 | But let me give you an intuition what each term is doing.

00:43:26.800 | Suppose I have some data, and I get to observe these characters.

00:43:30.480 | Well, what I can do is I really want to tell the model, this is real.

00:43:36.160 | These are real characters.

00:43:37.160 | So I want to put some probability mass around them and say, these are real.

00:43:41.200 | And then there is some sort of a data point that looks like this, just a bunch of pixels

00:43:46.000 | on and off.

00:43:47.000 | And I really want to tell my model that put almost zero probability on this.

00:43:52.520 | This is not real.

00:43:55.440 | And so the first term is exactly trying to do that.

00:43:57.920 | The first term is just trying to say, put the probability mass where you see the data.

00:44:02.000 | And the second term is effectively trying to say, well, look at this entire exponential

00:44:05.280 | space and just say, no, everything else is not real.

00:44:08.960 | Just the real thing is what I'm seeing in my data.

00:44:11.960 | And so you can use advanced techniques for doing that.

00:44:14.480 | There's a class of algorithms called variational inference.

00:44:17.600 | There's something that's called stochastic approximation, which is Monte Carlo-based

00:44:21.080 | inference.

00:44:22.080 | And I'm not going to go into these techniques.

00:44:23.080 | But in general, you can train these models.

00:44:26.400 | So one question is, how good are they?

00:44:29.080 | Because a lot of approximations that go into these models.

00:44:32.740 | So what I'm going to do is, if you haven't seen it, I'm going to show you two panels.

00:44:37.720 | On one panel, you will see the real data.

00:44:40.640 | On another panel, you'll see data simulated by the model or the fake data.

00:44:44.240 | And you have to tell me which one is which.

00:44:47.400 | So again, these are handwritten characters coming from alphabets around the world.

00:44:51.820 | How many of you think this is simulated and the other part was real?

00:44:55.640 | Honestly.

00:44:56.640 | OK, some.

00:44:58.040 | What about the other way around?

00:45:01.320 | I get half and half, which is great.

00:45:05.160 | If you look at these images a little bit more carefully, you will see the difference.

00:45:10.840 | So you will see that this is simulated and this is real.

00:45:16.320 | Because if you look at the real data, it's much crisper.

00:45:18.440 | There's more diversity.

00:45:20.600 | When you're simulating the data, there's a lot of structure in the simulated characters,

00:45:24.200 | but sometimes they look a little bit fuzzy and there isn't as much diversity.

00:45:29.760 | And I've learned that trick from my neuroscience friends.

00:45:33.480 | If I show you quickly enough, you won't see the difference.

00:45:38.840 | And if you're using these models for classifying, you can do proper analysis, which is to say,

00:45:45.880 | given a new character, you find further states of the latent variables, hidden variables.

00:45:49.960 | If I classify based on that, how good are they?

00:45:52.560 | And they are much better than some of the existing techniques.

00:45:57.220 | This is another example.

00:45:58.480 | Trying to generate 3D objects.

00:46:00.520 | This is sort of a toy datasets.

00:46:02.120 | And later on, I'll show you some bigger advances that's been happening in the last few years.

00:46:06.720 | This was done a few years ago.

00:46:08.600 | If you look at the space of generated samples, they sort of, obviously you can see the difference.

00:46:18.040 | Look at this particular image.

00:46:20.400 | This image looks like a car with wings, don't you think?

00:46:24.280 | So sometimes it can sort of simulate things that are not necessarily realistic.

00:46:29.820 | And for some reason, it just doesn't generate donkeys and elephants too often, but it generates

00:46:34.120 | people with guns more often.

00:46:35.600 | Like if you look at here and here and here.

00:46:38.400 | And that, again, has to do with the fact that you're exploring this exponential space of

00:46:43.480 | possible images, and sometimes it's very hard to assign the right probabilities to different

00:46:48.880 | parts of the space.

00:46:52.200 | And then obviously you can do things like pattern completion.

00:46:54.360 | So given half of the image, can you complete the remaining half?

00:46:57.360 | So the second one shows what the completions look like, and the last one is what the truth

00:47:00.760 | is.

00:47:01.760 | So you can do these things.

00:47:04.060 | So where else can we use these models?

00:47:05.640 | These are sort of toyish examples, but where else?

00:47:08.080 | Let me show you one example where these models can potentially succeed, which is trying to

00:47:13.680 | model the space of the multimodal space, which is the space of images and text.

00:47:20.240 | Or generally, if you look at the data, it's not just single sources.

00:47:23.600 | It's a collection of different modalities.

00:47:26.560 | So how can we take all of these modalities into account?

00:47:30.100 | And this is really just the idea of given images and text.

00:47:33.360 | And you actually find a concept that relates these two different sources of data.

00:47:40.120 | And there are a few challenges, and that's why models like generative models, sometimes

00:47:44.840 | probabilistic models, could be useful.

00:47:46.720 | In general, one of the biggest challenges we've seen is that typically when you're working

00:47:50.700 | with images and text, these are very different modalities.

00:47:54.160 | If you think about images and pixel representation, they're very dense.

00:47:58.260 | If you're looking at text, it's typically very sparse.

00:48:01.960 | It's very difficult to learn these cross-modal features from low-level representation.

00:48:06.920 | Perhaps a bigger challenge is that a lot of times we see data that's very noisy.

00:48:12.700 | Sometimes it's just non-existent.

00:48:13.700 | Given an image, there is no text.

00:48:15.880 | Or if you look at the first image, a lot of the tags about is what kind of camera was

00:48:20.120 | used to describe that particular image, which doesn't really tell us anything about the

00:48:25.240 | image itself.

00:48:26.740 | And these would be the text generated by a version of a Boltzmann machine model.

00:48:32.120 | It sort of does samples of what the text should look like.

00:48:37.000 | And the idea, again, is very simple.

00:48:40.040 | If you just build a simple representation, given images and given text, you just try

00:48:43.760 | to find what the common representation is, it's very difficult to learn these cross-modal

00:48:48.440 | features.

00:48:49.560 | But if you actually build a hierarchical model, so you start with representation, you can

00:48:54.520 | build a Gaussian model, replicate a softmax model, you can build up that representation,

00:48:58.880 | then it turns out it's much more, it gives you much richer representation.

00:49:04.080 | There's also a notion of bottom-up and top-down, which means that low-level images or tags

00:49:12.480 | can effectively affect low-level representation of images and the other way around.

00:49:16.480 | So information flows between images and text and gets into some stable state.

00:49:22.680 | And this is what the text generated from images looks like, some of the examples.

00:49:28.160 | A lot of them look reasonable, but more recently, with the advances of CovNets, this is probably

00:49:33.840 | not that surprising.

00:49:37.480 | Here's some examples of the model that's not quite doing the right thing.

00:49:42.360 | I particularly like the second one.

00:49:44.560 | For some reason, it sort of correlates with Barack Obama and such.

00:49:48.760 | And the features, when we were using this model, we didn't have, at that time, image

00:49:54.320 | net features.

00:49:55.320 | Right now, I don't think we'd be making these mistakes.

00:49:57.320 | But generally speaking, what we found in a lot of the data is that there are a lot of

00:50:00.800 | images of animals, which brings us to the next problem, is that if you don't see images

00:50:05.200 | of animals, then the model is confused because it sees a lot of Obama signs, and these are

00:50:08.800 | black and white and blue signs that are appearing a lot.

00:50:14.180 | You can also do images from text, given text, or tags can retrieve relevant images.

00:50:21.240 | This is the dataset itself, about a million images.

00:50:23.240 | It's a nice dataset, and you have very noisy tags.

00:50:27.960 | The question is, can you actually learn some representation from those images?

00:50:32.120 | One thing that I want to highlight here is, we've tried, there's 25,000 labeled images.

00:50:38.040 | Somebody went and labeled what's going on in those images, what classes we see in those

00:50:41.440 | images, and you get some numbers, which is mean average precision.

00:50:45.360 | What's important here is that we found that if we actually use unlabeled data, and we

00:50:49.560 | pre-train these channels separately, using a million unlabeled data points, then we can

00:50:56.220 | actually get some performance improvements.

00:50:58.320 | At least that was a little bit of a happy sign for us to say that unlabeled data can

00:51:03.280 | help in the situations where you don't have a lot of labeled examples.

00:51:07.920 | Here it was helping us, it was helping us a lot.

00:51:11.680 | And then once you get into these representations, dealing with text and images, this is one

00:51:16.760 | particular thing you can do.

00:51:18.120 | I think Richard pointed out what happens in the space of linguistic regularities.

00:51:25.800 | You can do the same thing with images, which is kind of fun to do.

00:51:29.320 | They sometimes work, they don't work all the time.

00:51:31.520 | But here's one example.

00:51:32.520 | I take that particular image at the top, and I say get the representation of this image,

00:51:37.680 | subtract the representation of day, add the night, and then find closest images, you get

00:51:41.680 | these images.

00:51:43.800 | And then you can do some interesting things, like take these kittens and say minus ball

00:51:47.520 | plus box, you get kittens in the box.

00:51:50.640 | If you take this particular image and say minus box plus ball, you get kittens in the

00:51:53.760 | ball.

00:51:55.840 | Except for this thing, that's a duck.

00:51:58.840 | So you can get these interesting representations.

00:52:03.840 | Of course, these are all fun things to look at, but they don't really mean much, because

00:52:07.800 | we're not specifically optimizing for those things.

00:52:11.840 | Now let me spend some time also talking about another class of models.

00:52:18.280 | These are known as Helmholtz machines and variational autoencoders.

00:52:21.560 | These are the models that have been popping up in our community in the last two years.

00:52:27.240 | So what is a Helmholtz machine?

00:52:29.320 | A Helmholtz machine was developed back in '95, and it was developed by Hinton and Peter Dayan

00:52:35.280 | and Brendan Frey and Radford Neal.

00:52:39.200 | And it has this particular architecture.

00:52:42.200 | You have a generative process, so given some latent state, you just-- it's a neural network,

00:52:48.120 | it's a stochastic neural network that generates the input data.

00:52:52.520 | And then you have so-called approximate inference step, which is to say, given the data, infer

00:52:57.800 | approximately what the latent states should look like.

00:53:03.120 | And again, it was developed in '95.

00:53:05.200 | There's something that's called wake-sleep algorithm, and it never worked.

00:53:09.200 | Basically, people just said, it just doesn't work.

00:53:12.800 | And then we started looking at restricted-bolt machines and Boltz machines, because they're

00:53:16.840 | working a little bit better.

00:53:18.440 | And then two years ago, people figured out how to make them work.

00:53:22.080 | And so now, 10 years later, I'm going to show you the trick.

00:53:24.560 | Now, these models are actually working pretty well.

00:53:27.520 | The difference between Helmholtz machines and deep-bolt machines is very subtle.

00:53:31.400 | They almost look identical.

00:53:33.280 | The big difference between the two is that in Helmholtz machines, you have a generative

00:53:38.040 | process that generates the data, and you have a separate recognition model that tries to

00:53:42.400 | recognize what you're seeing in the data.

00:53:44.240 | So you can think of this Q function as a convolutional neural network, given the data, tries to figure

00:53:49.200 | out what the features should look like.

00:53:51.400 | And then there's a generative model, given the features, it generates the data.

00:53:54.840 | Boltzmann machine is sort of similar class of models, but it has undirected connections.

00:53:59.120 | So you can think of it as generative and recognition connections are the same.

00:54:03.160 | So it's sort of a system that tries to reach some equilibrium state when you're running

00:54:09.040 | it.

00:54:10.040 | So the semantics is a little bit different between these two models.

00:54:14.120 | So what is a variational autoencoder?

00:54:15.480 | A variational encoder is a Helmholtz machine.

00:54:17.960 | It defines a generative process in terms of sampling through cascades of stochastic layers.

00:54:24.080 | And if you look at it, there's just a bunch of conditional probability distributions that

00:54:28.040 | you're defining, so you can generate the data.

00:54:30.400 | So theta here will denote the parameters of the variational autoencoders.

00:54:33.920 | You have a number of stochastic layers.

00:54:36.960 | And sampling from these conditional probability distributions, we're assuming that we can

00:54:41.800 | do it.

00:54:42.800 | It's attractable.

00:54:43.800 | It has to be attractable.

00:54:45.680 | But the innovation here is that every single conditional probability can actually be a

00:54:53.680 | very complicated function.

00:54:54.680 | It can denote a nonlinear-- you can model nonlinear relationships.

00:54:58.000 | It can be a multilayer nonlinear neural network, deterministic neural network.

00:55:02.240 | So it becomes fairly powerful.

00:55:05.400 | Here's an example of I have a stochastic layer.

00:55:07.640 | You have a deterministic layer.

00:55:08.640 | You have a stochastic layer, and then you generate the data.

00:55:12.040 | So you can introduce these nonlinearities into these models.

00:55:16.880 | And this conditional probability would denote a one-layer neural network.

00:55:21.960 | Now I'll show you some examples, but maybe I can just give you a little intuition behind

00:55:27.600 | what these equations do.

00:55:31.480 | In a lot of these kinds of models, learning is very hard to do.

00:55:36.800 | And there's a class of models called variational learning.

00:55:39.000 | And what the variational learning is trying to do is basically trying to do the following.

00:55:42.440 | Well, I want to maximize the probability of the data that I observe, but I cannot do it

00:55:47.080 | directly.

00:55:48.080 | So instead, what I'm going to do is I'm going to maximize the so-called variational low

00:55:50.920 | bound, which is this term here.

00:55:54.080 | And it's effectively saying, well, if I take the log of expectation, I can take the log

00:55:58.960 | and push it inside.

00:56:02.240 | And it turns out just logistically, working in this representation is much easier than

00:56:06.760 | working in this representation.

00:56:08.680 | If you go a little bit through the math, it turns out that you can actually optimize this

00:56:13.520 | variational bound, but you can't really optimize this particular likelihood objective.

00:56:19.200 | It's a little bit surprising for those of you who haven't seen variational learning

00:56:23.280 | how it's done.

00:56:24.840 | But this one little trick, this one little so-called Jensen's inequality actually allows

00:56:29.100 | you to solve a lot of problems.

00:56:33.080 | And the other way to write the lower bound is to say, well, there is a log likelihood

00:56:37.360 | function and something that's called KL divergence, which is the distance between your approximating

00:56:42.080 | distribution Q, which is your recognition model, and the truth.

00:56:45.880 | The truth in these models would be the true posterior according to your model.

00:56:51.000 | And it's hard to optimize these kinds of models in general.

00:56:54.560 | You're trying to optimize your generative model, you're trying to optimize your recognition

00:56:57.800 | model.

00:56:58.800 | And back in '95, Hinton and his students basically, they developed this wake-sleep algorithm that

00:57:05.200 | was a bunch of different things put together, but it was never quite the right algorithm

00:57:09.280 | because it wasn't really optimizing anything.

00:57:10.840 | It was just a bunch of things alternating.

00:57:14.560 | But in 2014, there was a beautiful trick introduced by Kingman and Welling, and there was a few

00:57:19.880 | other groups that came up with the same trick called reparameterization trick.

00:57:24.060 | So let me show you what reparameterization trick does intuitively.

00:57:27.960 | So let's say your recognition distribution is a Gaussian.

00:57:32.160 | So a Gaussian, I can write it as a mean and a variance.

00:57:35.700 | So this is the mean, this is the variance.

00:57:37.160 | Notice that my mean depends on the layer below.

00:57:41.200 | It could be a very nonlinear function.

00:57:43.720 | The variance also depends on the layer below, so it could also be a nonlinear function.

00:57:49.840 | But what I can do is I can actually do the following.

00:57:52.080 | I can express this particular Gaussian in terms of auxiliary variables.

00:57:56.880 | I can say, well, if I sample this epsilon from normal 0, 1, a Gaussian distribution,

00:58:02.240 | then I can write this particular h, my state, in a deterministic way.

00:58:09.680 | It's just mean plus essentially standard deviation or variance, square root of the variance,

00:58:16.440 | times this epsilon.

00:58:17.440 | So this is just a simple parameterization of the Gaussians.

00:58:21.920 | I'm just pulling out the mean and the variance.

00:58:23.960 | There's no surprises here.

00:58:27.040 | So I can write my recognition model as this Gaussian, or I can write it in terms of noise

00:58:32.720 | plus the deterministic part.

00:58:34.860 | So the recognition distribution can be represented as a deterministic mapping.

00:58:38.400 | And that's the beauty, because it turns out that you can collapse these complicated models

00:58:43.960 | effectively into autoencoders.

00:58:46.160 | And we know how to deal with autoencoders.

00:58:47.520 | We can back propagate through the entire model.

00:58:51.380 | So we have a deterministic encoder.

00:58:53.320 | And then the distribution of these auxiliary variables really don't depend on parameters.

00:58:58.960 | So it's almost like taking a stochastic system and separating the stochastic part and deterministic

00:59:04.560 | part.

00:59:05.560 | In deterministic part, you can do back propagation, so you can do learning.

00:59:08.420 | And the stochastic part, you can do sampling.

00:59:10.720 | So just think of it as a separation between the two pieces.

00:59:16.280 | So now, if I take the gradient of the variational bound or the variational objective with respect

00:59:21.820 | to parameters, this is something that we couldn't do back in '95, and we couldn't do it in the

00:59:25.780 | last 10 years.

00:59:27.800 | People have tried using reinforce algorithm or some approximations to it.

00:59:30.860 | It never worked.

00:59:32.500 | But here what we can do is we can do the following.

00:59:34.060 | We can say, well, I can write this expression, because it's a Gaussian, as sampling a bunch

00:59:39.660 | of these auxiliary variables.

00:59:41.780 | And then this log, I can just inject the noise in here.

00:59:44.520 | The whole thing here becomes deterministic.

00:59:47.340 | And that's where the beauty comes in.

00:59:48.780 | You take this gradient here, and you push it inside the expectation.

00:59:55.020 | So before, if you take the gradient of expectations, like taking the gradient of averages, like

01:00:00.940 | you compute a bunch of averages, and you're taking the gradient.

01:00:03.460 | What you're doing now with reparameterization trick is you're taking the gradients and then

01:00:07.780 | taking the average.

01:00:09.620 | It turns out that hugely reduces the variance in your training.

01:00:14.020 | It actually allows you to learn these models quite efficiently.

01:00:18.140 | So the mapping edge here is completely deterministic.

01:00:21.020 | And gradients here can be computed by back propagation.

01:00:23.580 | It's a deterministic system.

01:00:25.420 | And you can think of this thing inside as just an autoencoder that you are optimizing.

01:00:33.080 | And obviously, there are other extensions of these models that we've looked at and a

01:00:36.900 | bunch of other teams looked at where you can say, well, maybe we can improve these models

01:00:40.220 | by drawing multiple samples.

01:00:42.180 | These are so-called case samples, importance weighting bounds.

01:00:44.980 | And so you can make them a little bit better, a little bit more precise.

01:00:48.860 | You can model a little bit more complicated distributions over the posteriors.

01:00:54.780 | But now, let me step back a little bit and say, why am I telling you about this?

01:00:58.900 | What's the point?

01:00:59.900 | There's a bunch of equations.

01:01:00.900 | You're injecting noise.

01:01:01.900 | Why do we need noise?

01:01:02.900 | Why do we need stochastic systems in general?

01:01:06.380 | Here's a motivating example.

01:01:09.120 | We wanted to build a model that, given captions, we want to generate the image.

01:01:15.140 | And my student was very ambitious and basically said, I want to be able to just give you any

01:01:19.820 | sentence and I want to be able to generate image, kind of like an artificial paint.

01:01:24.540 | I want to paint what's in my caption in the most general way.

01:01:31.120 | So this is one example of a Helmholtz machine where you have a generative model, which is

01:01:34.500 | a stochastic recurrent network.

01:01:35.500 | It's just a chain sequence of variational autoencoders.

01:01:38.500 | And there's a recognition model, which you can think of as a deterministic system, like

01:01:42.260 | a convolutional system that tries to approximate what the latent states are.

01:01:47.060 | But why do I need stochasticity here?

01:01:49.620 | Why do I need variational autoencoders here?

01:01:51.500 | And the reason is very simple.

01:01:53.620 | Suppose I give you the following task.

01:01:57.580 | I say a stop sign is flying in blue skies.

01:02:01.780 | Now if you were using a deterministic system, like an autoencoder, you would generate one

01:02:06.580 | image because it's a deterministic system.

01:02:08.940 | Given input, I give you output.

01:02:11.940 | Once you have stochastic system, you inject this noise, this latent noise, that allows

01:02:16.880 | you to actually generate the whole space of possible images.

01:02:19.860 | So for example, it tends to generate this stop sign and this stop sign.

01:02:22.900 | They look very different.

01:02:23.900 | And there's a car here.

01:02:26.140 | So maybe it's not really flying.

01:02:27.140 | It just can't draw the pole here.

01:02:30.380 | This one looks like there are clouds.

01:02:33.060 | Here is this yellow school bus is flying in blue skies.

01:02:35.660 | So here we wanted to test the system to see does it understand something about what's

01:02:41.380 | in the sentence.

01:02:42.380 | Here is a herd of elephants is flying in blue skies.

01:02:45.500 | Now we cannot generate elephants, although there are now techniques that are getting

01:02:48.820 | better.

01:02:49.820 | But sometimes it generates two of them.

01:02:52.140 | And it's a commercial plane flying in blue skies.

01:02:54.940 | But this is where we need stochasticity because we want to be able to generate the whole distribution

01:02:58.780 | of possible outcomes, not necessarily just one particular point.

01:03:04.940 | We can basically do things like a yellow school bus parked in the parking lot versus a red

01:03:09.380 | school bus parked in the parking lot versus a green school bus parked in the parking lot.

01:03:13.940 | It's sort of a blue school bus.

01:03:17.020 | We can't quite generate blue school buses, but we've seen blue cars and we've seen blue

01:03:21.100 | buses.

01:03:22.100 | So it can make an association to draw these different things.

01:03:25.700 | They look a little bit fuzzy.

01:03:28.180 | But in terms of comparing to different models, if I give you a group of people on the beach

01:03:33.860 | with surfboards, this is what we can generate.

01:03:37.260 | There is another model called LabGap model, which is a model based on adversarial neural

01:03:41.340 | networks, something I'll talk as the last part of this talk.

01:03:45.580 | And there is these models, convolutional and deconvolutional variation autoencoders, which

01:03:50.140 | is again convolutional deconvolutional autoencoders just with some noise.

01:03:54.460 | And you can certainly see that it's generally we found it's very hard to be able to generate

01:03:59.700 | scenes with arbitrary inputs as a text.

01:04:04.660 | Here's my favorite one.

01:04:06.020 | A toilet seat sits open in the bathroom.

01:04:08.980 | I don't know if you can see toilet seats here, maybe, but you can see a toilet seat sits

01:04:13.100 | open in the grass field.

01:04:14.140 | That was a little bit better.

01:04:15.860 | At least the colors were quite right.

01:04:18.100 | And when we put this paper on archive, one of the students basically came to me and said,

01:04:26.260 | "This is really bad because you can always ask Google."

01:04:30.700 | If you type that particular query into Google search, it gives you that.

01:04:37.420 | Which was a little bit disappointing.

01:04:40.460 | But now if you actually Google or if you actually put this query into Google, this image comes

01:04:46.300 | before this image.

01:04:49.500 | And generally because what's happening is that people are just clicking on that image

01:04:52.840 | all the time to figure out what's going on in that image.

01:04:55.760 | So we got bumped up before that other image.

01:04:59.560 | So now I can say that according to Google, this is a much better representation for that

01:05:04.180 | sentence than this one.

01:05:07.780 | Here's another sort of interesting model, which is a model where you're trying to build

01:05:14.380 | a recurrent neural network.

01:05:15.380 | Again, it's a generative model, but it's a generative model of text.

01:05:19.400 | This model was trained on about 7,000 romance novels.

01:05:25.180 | And you take a caption model and you hook it up to the caption generation system.

01:05:30.620 | So you're basically saying the model, here's an image, generate me in the style of romantic

01:05:36.940 | books what you'd see here.

01:05:39.900 | And it generates something interesting.

01:05:42.220 | We're barely able to catch the breeze on the beach and so forth.

01:05:45.980 | She's beautiful, but the truth is I don't know what to do.

01:05:49.040 | The sun was just starting to fade away, leaving people scattered around the Atlantic Ocean.

01:05:56.580 | And there are a bunch of different things that you can do.

01:05:58.660 | Obviously, we're not there yet in terms of generating romantic stories.

01:06:02.960 | But here's one example where it's a generative model.

01:06:05.900 | It seems like syntactically we can actually generate reasonable things.

01:06:11.660 | Semantically, we're not there yet.

01:06:14.860 | And actually, that particular work was inspired a little bit by Baidu's system that would

01:06:20.820 | give an image.

01:06:21.820 | I think it would generate poems.

01:06:25.320 | But the poems were predefined.

01:06:27.020 | It was mostly selecting the right poem for the image.

01:06:30.700 | Here we actually were trying to generate something.

01:06:35.900 | So there's still a lot of work to do in that space.

01:06:37.880 | Because syntactically we can get there.

01:06:39.900 | Semantically, we are nowhere near getting the right structure.

01:06:44.580 | Here's another last example that I want to show you.

01:06:48.300 | This was done in the case of one-shot learning, which is can you build generative model of

01:06:52.260 | characters?

01:06:53.260 | That's a very defined domain, very well-defined domain.

01:06:57.300 | It's a very simple domain, but it's also very hard.

01:07:01.140 | Here's one example.

01:07:02.140 | We've shown this example to people and to the algorithm.

01:07:06.300 | And we can say, well, can you draw me this example?

01:07:09.060 | And on one panel, humans would draw how they believe this example should look like.

01:07:14.700 | And then on the other panel, we have machines drawing it.

01:07:17.420 | So this is really just a generative model based on a single example.

01:07:22.060 | We're showing you an example, and you're trying to generate what it is.

01:07:25.260 | And so quick question for you.

01:07:26.840 | How many of you think this was machine-generated and this was human-generated?

01:07:32.980 | What about the other way around?

01:07:34.660 | More, more.

01:07:36.860 | So there's a vote.

01:07:38.260 | What about this one?

01:07:39.260 | How many of you think this is machine-generated and this is human-generated?

01:07:42.860 | A few.

01:07:43.860 | What about the other way around?

01:07:44.860 | Ah, great, great.

01:07:45.860 | Well, the truth is I don't really know which one was generated by which machine.

01:07:52.660 | Because that was done, I should actually ask Brendan Lake, who designed the experiments

01:07:56.820 | for this particular model.

01:07:58.380 | But I can tell you that there's been a lot of studies.

01:08:01.580 | He's done a lot of studies, and it's almost 50/50.

01:08:05.480 | So in sort of this kind of small, carved domain, we can actually compete with people trying

01:08:11.940 | to generate these characters.

01:08:16.220 | Now let me step back a little bit and tell you about a different class of models.

01:08:21.580 | These are models known as generative adversarial networks, and they've been gaining a lot of

01:08:28.540 | attraction in our community because they seem to produce remarkable results.

01:08:34.220 | So here's the idea.

01:08:36.940 | We're not going to be really defining explicitly the density, but we need to be able to sample

01:08:41.700 | from the model.

01:08:43.780 | And the interesting thing is that there's no variation learning, there's no maximum

01:08:46.460 | likelihood estimation, there's no Markov chain Monte Carlo, there's no sampling.

01:08:50.380 | How do you do that?

01:08:51.660 | How do you learn these models?

01:08:53.520 | And it turns out that you can learn these models by playing a game.

01:08:56.140 | And that's a very clever strategy.

01:08:58.980 | And the idea is the following.

01:09:00.660 | You're going to be setting up a game between two players.

01:09:03.020 | You're going to have a discriminator, D, think of it as a convolutional neural network, and

01:09:08.620 | then you're going to have a generator, G. Maybe you can think of it as a variation learning

01:09:12.540 | core or a Helmholtz machine or something that gives you samples from the data.

01:09:17.980 | The discriminator, D, is going to be discriminating between a sample from the data distribution

01:09:24.380 | and a sample from the generator.

01:09:26.780 | So the goal of the discriminator is to basically say, is this a fake sample or is this a real

01:09:32.020 | sample?

01:09:33.020 | A fake sample is a sample generated by the model.

01:09:36.260 | A real sample is what you see in your data.

01:09:39.060 | Can you tell the difference between the two?

01:09:41.860 | And the generator is going to be trying to fool the discriminator by trying to generate

01:09:47.100 | samples that are hard for the discriminator to discriminate.

01:09:50.820 | So my goal as a generator would be to generate really nice looking digits so that the discriminator

01:09:56.300 | wouldn't be able to tell the difference between the simulated and the real.

01:10:00.540 | That's the key idea.

01:10:02.940 | And so here is intuitively what that looks like.

01:10:08.540 | Let's say you have some data, so images of faces.

01:10:13.180 | I give you an image of a face, and now I have a discriminator that basically says, well,

01:10:17.900 | if I get a real face, I push it through some function, some differentiable function.

01:10:23.000 | Think of it as a convolutional neural network or another differentiable function.

01:10:27.140 | And here I'm outputting one.

01:10:29.260 | So I want to output one if it's a real sample.

01:10:32.820 | Then you have a generator, and generator is you have some noise, so input noise.

01:10:37.460 | Think of it as a Gaussian distribution.

01:10:38.780 | Think about Helmholtz machines.

01:10:41.220 | Given some noise, I go through differentiable function, which is your generator, and I generate

01:10:46.100 | a sample.

01:10:47.100 | This is how my sample might look like.

01:10:49.980 | And then on top of it, I take this sample, I put it into my discriminator, and I say,

01:10:55.260 | for my discriminator, I want to output zero.

01:10:59.260 | Because my discriminator will have to say, well, this is fake, and this is real.

01:11:05.140 | That's the goal.

01:11:06.140 | And the generator basically says, well, how can I get a sample such that my discriminator

01:11:12.260 | is going to be confused, such that the discriminator always outputs one here, because it believes

01:11:17.260 | it's a true sample, believes it's coming from the true data.

01:11:21.960 | So now you have these systems.

01:11:23.960 | So what's the objective?

01:11:26.660 | The objective is a min-max value function.

01:11:29.740 | It's a very intuitive objective function that has the following structure.

01:11:35.940 | You have a discriminator that says, well, this is an expectation with respect to distribution,

01:11:40.740 | data distribution.

01:11:41.740 | So this is basically saying, I want to classify any data points that I get from my data as

01:11:46.420 | being real.

01:11:48.140 | So I want this output to be one, because if it's one, the whole thing is going to be zero.

01:11:52.340 | If it's less than one, it's going to be negative.

01:11:55.780 | And I really want to maximize it.

01:11:57.960 | And then discriminator is saying, well, any time I generate a sample, whatever samples

01:12:01.920 | comes out from my generator, I want to classify it as being fake.

01:12:07.560 | That's the goal of the discriminator.

01:12:10.600 | And then there's a generator.

01:12:11.600 | The generator is sort of the other-- you try to minimize this function, which essentially

01:12:15.560 | says, well, generate samples that discriminator would classify as real.

01:12:20.920 | So I really am going to try to change the parameters of my generator such that this

01:12:26.080 | would produce zero.

01:12:27.880 | Oh, sorry.

01:12:29.240 | So the discriminator would produce one.

01:12:32.280 | So trying to fool the discriminator.

01:12:36.480 | And it turns out the optimal strategy for discriminator is this ratio, which is probability

01:12:40.800 | of the data divided by the probability of the data plus probability of the model.

01:12:44.740 | And in general, if you succeed in building a good generative model, then probability

01:12:48.940 | of the data would be the same as probability of the model.

01:12:50.920 | So discriminator will always be confused to one half.

01:12:56.720 | And here's one particular-- it seems like a simple idea, but it turns out to work remarkably

01:13:01.680 | well.

01:13:02.680 | Here's an architecture called deconvolutional generative adversarial network architecture

01:13:08.480 | that takes the code-- this is a random code.

01:13:10.960 | It's a Gaussian code.

01:13:12.320 | It passes through a sequence of convolutions, a sequence of deconvolutions.

01:13:16.480 | So given the code, you sort of deconvolve it back to high dimensional image.

01:13:23.080 | And you train it using adversarial setting.

01:13:25.900 | This is your sampling.

01:13:27.320 | You generate the image.

01:13:28.760 | And then there is a discriminator, which is just a convolutional neural network that's

01:13:31.720 | trying to say, is that real?

01:13:33.040 | Is that a fake?

01:13:34.200 | And if you train these models on bedrooms-- these are called L-SUN data sets, a bunch

01:13:39.820 | of bedrooms-- this is how samples from the model would look like, which is pretty impressive.

01:13:46.200 | And in fact, when I look at these samples, I'm also sort of thinking, well, maybe the

01:13:49.840 | model is memorizing the data, because these samples look remarkably impressive.

01:13:55.840 | Then there was a follow up work.

01:13:57.900 | These are samples from the CFAR data set.

01:14:01.320 | So here you're seeing training samples.

01:14:04.040 | And here you're seeing samples generated from the model, which is, again, very impressive.

01:14:09.400 | If you look at the structure in these samples, it's quite remarkable that you can generate

01:14:14.000 | samples that look very realistic, actually.

01:14:20.840 | This is what's done-- again, this was done by Tim Salomon and his collaborators.

01:14:26.580 | If you look at the ImageNet and you look at the training data on the ImageNet and looking

01:14:31.040 | at the samples, again, you look-- this is a horse.

01:14:33.800 | This is like-- there's some animal that is an airplane and so forth.

01:14:38.000 | There's like some kind of a truck and such.

01:14:41.520 | So it looks-- when I look at these images, I was very impressed by the quality of these

01:14:47.800 | images, because generally, it's very, very hard to generate realistic looking images.

01:14:52.120 | And the last thing I want to point out-- this was picked up by Ian Goodfellow.

01:14:58.100 | If you cherry pick some of the examples, this is what generated images look like.

01:15:02.920 | So you can sort of see there is a little bit of interesting structure that you're seeing

01:15:09.280 | in these samples.

01:15:12.040 | And one question still remains with these models is, how can we evaluate these models

01:15:17.600 | properly?

01:15:18.600 | Is the model really learning a space of all possible images and how images-- what's the

01:15:24.640 | coherency in those images?

01:15:26.280 | Or is the model mostly kind of like blurring things around and just making some small changes

01:15:32.040 | to the data?

01:15:33.040 | So the question that I would really want-- would like to answer-- to get an answer to

01:15:37.200 | is that, if I show you a new example, a new test image, a new kind of a horse, would the

01:15:42.480 | model say, yes, this is a likely image?

01:15:45.480 | This is very probable images.

01:15:47.120 | I've seen similar images before or something like that or not.

01:15:51.760 | So that still remains an open question.

01:15:54.600 | But again, this is the class of models which steps away from maximum likelihood estimation,

01:15:58.960 | sort of sets it up in a game theoretic framework, which is a really nice set of work.

01:16:05.800 | And in the computer vision community, a lot of people are showing a lot of progress in

01:16:08.920 | using these kinds of models because they tend to generate much more realistic-looking images.

01:16:15.760 | So let me just summarize to say that I've shown you, hopefully, a set of learning algorithms

01:16:21.160 | for deep unsupervised models.

01:16:23.560 | There's a lot of space in these models, a lot of excitement in that space.

01:16:27.640 | And I just wanted to point out that these models, the deep models, they improve upon

01:16:31.120 | current state of the art in a lot of different application domains.

01:16:34.400 | And as I mentioned before, there's been a lot of progress in discriminative models,

01:16:38.440 | convolutional models, using recurrent neural networks for solving action recognition models,

01:16:43.680 | dealing with videos.

01:16:45.280 | And unsupervised learning still remains a field where we've made some progress, but

01:16:53.320 | there's still a lot of progress to be made.

01:16:56.560 | And let me stop there.

01:16:58.400 | So thank you.

01:16:59.400 | [APPLAUSE]

01:16:59.400 | Go to the mics.

01:17:21.880 | So as a Bayesian guy, I'm pretty depressed by the fact that GAN can generate a clearer

01:17:28.760 | image than the variational autoencoder.

01:17:31.160 | So my question is, do you think there could be an energy-based framework or a probabilistic

01:17:37.640 | interpretation of why GAN is so successful, other than it's just a MiMAX game?

01:17:43.240 | I think that generally, if you look at-- I sort of go back and forth between variational

01:17:47.520 | autoencoders, because some of my friends at OpenAI are saying that they can actually generate

01:17:52.960 | really nice-looking images using variational autoencoders.

01:17:57.320 | I'm looking at Peter here.

01:17:59.440 | But I think that one of the problems with image generation today is that with variational

01:18:06.440 | autoencoders, there is this notion of Gaussian loss function.

01:18:11.320 | And what it does is it basically says, well, never produce crystal-clear images, because

01:18:17.760 | if you're wrong, if you put the edge in the wrong place, you're going to be penalized

01:18:22.760 | a lot because of the L2 loss function.

01:18:26.800 | What the GANs are doing, GANs are basically saying, well, I don't really care where I

01:18:30.720 | put the edge, as long as it looks realistic so that I can fool my classifier.

01:18:35.000 | So what tends to happen in practice, and a lot of times, if you actually look at the

01:18:38.880 | images generated by GANs, sometimes they have a lot of artifacts, like these specific things

01:18:43.920 | that pop up.

01:18:47.240 | Whereas in variational autoencoders, you don't see that.

01:18:49.560 | But again, the problem with variational autoencoders is they tend to produce images that are much

01:18:52.840 | more diffused or not as sharp or not as clear as what GANs is doing.

01:18:58.080 | And there's been some work on trying to sharpen the images, which is you're using variational

01:19:02.120 | autoencoders to generate the globally coherent scene, and then you're using generative adversarialness

01:19:07.960 | to maybe sharpen it.

01:19:10.120 | Again, it depends what loss function you're using.

01:19:13.400 | And GANs seem to be able to deal with that problem implicitly, because they don't really

01:19:18.240 | care whether you get the edge quite right or not, as long as it fools your classifier.

01:19:25.400 | Thank you.

01:19:26.400 | Hi.

01:19:27.400 | Thank you very much for the interesting talk.

01:19:31.440 | I have a question about the variational autoencoder.

01:19:37.760 | For the more challenging data set, like the street view house number, I noticed that many

01:19:42.920 | implementation, they use a PCA to preprocess the data before they train the model.

01:19:48.720 | What is your thought on that preprocessing step?

01:19:52.120 | Why is it necessary to do that?

01:19:54.080 | Why don't we just learn from the raw pixel?

01:19:57.040 | I actually don't know.

01:19:59.880 | My experience has been that we don't really do a lot of preprocessing.

01:20:03.640 | What you can do is you can do ZCA preprocessing, and you can take the mean, you can take the

01:20:07.560 | second order covariance structure from the data.

01:20:10.680 | That sometimes helps, sometimes it doesn't.

01:20:13.000 | But I don't see any particular reason why you'd want to do PCA preprocessing.

01:20:17.400 | It's just one of-- just like we've seen a lot in our field, people just doing x, y,

01:20:24.760 | and then later on they figure out that they don't really need x and y.

01:20:30.200 | Maybe it was working better for their implementation, for their particular task, but generally I

01:20:33.760 | haven't seen people doing a lot of preprocessing using PCA for training variational timecodes.

01:20:46.520 | Any more questions?

01:20:47.520 | Yes, there's one.

01:20:49.520 | This is regarding binary RBMs.

01:20:55.240 | So if you look at the literature for, let us say, estimation of the partition function

01:21:00.780 | for Ising models, you will see that the literature is a lot more rich compared to the variational

01:21:06.520 | inference literature for restricted Boltzmann machines, especially in the binary context.

01:21:10.600 | Is there a cultural reason for this?

01:21:12.200 | Because specifically, you have for the strictly ferromagnetic case, you have a fully polynomial

01:21:18.680 | time approximation scheme for estimating the log partition function.

01:21:24.080 | But then I don't see usage of these FPRAS algorithms in the RBM space.

01:21:29.640 | So when you juxtapose the literature for the Ising models compared to binary RBMs, you'll

01:21:35.260 | find a very stark asymmetry.

01:21:36.980 | Is there a reason for this?

01:21:38.260 | Yeah, so the thing about Ising models is that if you're in a ferromagnetic case, or if you

01:21:45.100 | have certain particular structure to the Ising models, you can use a lot of techniques.

01:21:48.420 | Even if you use techniques like coupling from the past, you can draw exact samples from

01:21:51.220 | the models.

01:21:52.560 | You can compute the log partition function of polynomial time if you have a specific

01:21:55.460 | structure.

01:21:56.460 | But the problem with RBMs is that generally those assumptions don't apply.

01:22:00.360 | You cannot learn a model which is a ferromagnetic model with RBMs, just where all your weights

01:22:03.980 | are positive.

01:22:04.980 | That's a lot of constraints to put on these class of models.

01:22:08.820 | So that's why-- and once you get outside of these assumptions, then the problem becomes

01:22:14.740 | NP-hard for estimating the partition function.

01:22:17.980 | And obviously, for learning these systems, you need the gradient of the log of the partition

01:22:20.940 | function.

01:22:22.060 | And that's where all the problems come in.

01:22:23.380 | I don't think there is a solution for that.

01:22:27.300 | And unfortunately, variational methods are also not working as well as approximations

01:22:33.180 | like contrastive divergence, or something based on sampling.

01:22:37.660 | People have looked at better approximations and using more sophisticated techniques, but

01:22:41.340 | it hasn't really popped up yet.

01:22:43.620 | Practically, it just doesn't work as well.

01:22:46.580 | But it's a good question.

01:22:49.580 | Hello.

01:22:50.580 | I'm curious about using auto-coder to get semantic hash, especially in text.

01:23:00.340 | Do we need any special representation, text representation, like word to vector as input

01:23:07.620 | for our text sequence?

01:23:11.100 | So I've talked about the model, which is a very simple model, which is modeling bag of

01:23:15.580 | words.

01:23:16.580 | Yes.

01:23:17.580 | You can use word2vec and initialize the model, because it's a way of just taking your words

01:23:22.100 | and projecting them into the semantic space.

01:23:25.460 | There has been a lot of recent techniques using, like Richard was mentioning, GRUs as

01:23:31.300 | a way, if you want to work with sentences, or if you want to embed the entire document

01:23:36.340 | into the semantic space, and if you want to make it binary, you can use GRUs, bidirectional

01:23:41.780 | GRUs, to get the representation of the document.

01:23:44.580 | I think that would probably work better than using word2vec and then just adding things

01:23:47.300 | up.

01:23:48.300 | And then based on that, you can learn a hashing function that maps that particular representation

01:23:52.140 | to the binary space, in which case you can do searching fairly efficiently.

01:23:57.340 | So as an input representation, there are lots of choices.

01:23:59.700 | You can use bidirectional GRUs, which is the method of choice right now.

01:24:06.940 | You can use GloVe, or you can use word2vec and sum up the representations of the words.

01:24:14.100 | Okay, so using only back of word, we use only normal network, that is no recurrence network,

01:24:20.620 | or just simple network?

01:24:22.300 | Yeah, that's right.

01:24:23.300 | But again, your representation can be whatever that representation is, as long as it's differentiable,

01:24:26.820 | right?

01:24:27.820 | Because in this case, you can back propagate through the bidirectional GRUs and learn what

01:24:34.900 | these representations should be.

01:24:35.900 | Okay, thank you so much.

01:24:36.900 | Okay, let's thank Russ again.

01:24:41.900 | (audience applauding)

Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU)

Chapters