back to indexStanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

00:00:05.200 | 
So today, I'm delighted to introduce our first invited speaker, 00:00:12.800 | 
Dow has also been, as well as being invited, and I'll tell his background. 00:00:17.920 | 
He's also in the Symbolic Systems program as being an adjunct professor and 00:00:23.840 | 
has been involved with some students in that role as well. 00:00:26.680 | 
But in his invited role, he's originally from the Netherlands, 00:00:30.640 | 
where he even learned some logic, among other things, back in the old days. 00:00:36.680 | 
he's been a prominent deep learning researcher. 00:00:41.160 | 
For a number of years, he worked at Facebook, now Meta, in the FAIR unit, 00:00:47.600 | 
and was involved in various ideas, including retrieval augmented generation. 00:00:54.000 | 
After that, he then spent some time at Hugging Face. 00:00:57.760 | 
He's become interested in looking at multimodal models, 00:01:01.560 | 
which is what he's gonna be talking about today. 00:01:20.520 | 
I understand that you get points for being here, so 00:01:27.160 | 
So I'm gonna talk about multimodal deep learning. 00:01:29.680 | 
It's gonna have an NLP focus, of course, as for this course. 00:01:32.800 | 
But it's also because otherwise I would really be talking for 00:01:38.880 | 
So I'll try to really keep it focused on the things that I think will be 00:01:44.760 | 
And so the first thing you should understand is that this whole concept of 00:01:48.760 | 
multimodality is kind of ill-defined, actually. 00:01:52.360 | 
So if you go to the dictionary, you'll see that it means having or 00:01:56.160 | 
involving several modes or modalities or maxima. 00:02:00.720 | 
And so what mode here really means is, so it could be mode in the very generic sense, 00:02:06.080 | 
or it could be a very precise sense of the mode of a statistical distribution. 00:02:11.240 | 
And so depending on the paper you're reading, in some cases, 00:02:16.200 | 
In other cases, people really mean this sort of very vague concept of a modality, 00:02:20.920 | 
where it really means the type of information that you're getting. 00:02:23.840 | 
So an example of modality in that case is an image or speech signal or 00:02:28.480 | 
audio in general, or even olfaction, so smell or things like that. 00:02:33.160 | 
So in this lecture, we're just gonna focus mostly on text, 00:02:38.640 | 
because this is an NLP course, and we're gonna focus on images mostly as 00:02:50.880 | 
And so there are a couple of really good reasons in general for this. 00:02:58.320 | 
So if you look at how we humans understand the world, 00:03:01.200 | 
how we make sense of what happens in the world, that is very multimodal, right? 00:03:06.720 | 
So we perceive the world not just using vision or just audio, but 00:03:11.040 | 
we synthesize information across all of these different modalities, and 00:03:14.600 | 
that's how we understand the world and each other. 00:03:17.800 | 
There's also a very practical argument for doing it. 00:03:20.680 | 
It's because the Internet is multimodal, right? 00:03:23.000 | 
So if you go to, I don't know, Facebook or something like that, 00:03:27.080 | 
it rarely happens that it's just text or just an image. 00:03:29.760 | 
There's usually a combination of multiple modalities. 00:03:33.160 | 
And then the final good reason that we're just starting to hit now, 00:03:37.920 | 
if you're really following where the field is going, 00:03:40.760 | 
we're kind of running out of text data for these large language models. 00:03:44.000 | 
So one interesting way to keep scaling on the data side 00:03:48.200 | 
is to make use of all of these other modalities, right? 00:03:51.000 | 
So if you can have your language model also watch all of the videos of cats in 00:03:55.080 | 
the world, it's gonna understand the concept of cat much better. 00:03:59.160 | 
And that's what we want to have in these models. 00:04:00.760 | 
We want them to understand the world in the same way that humans understand it. 00:04:05.040 | 
So right now, multimodality is really one of the main frontiers of this new 00:04:10.960 | 
foundation model drive that we're all in right now. 00:04:19.640 | 
But so what we'll see when this loads is this guy over here, 00:04:25.240 | 
and we'll have the same audio effect being played. 00:04:28.960 | 
So the audio is exactly the same, and this man is gonna say something like, 00:04:35.920 | 
And so you're hearing a bee there, I think, if you look at my mouth, 00:04:41.360 | 
But if you then change the video to where he says, bah, bah, bah, 00:04:46.200 | 
with exactly the same audio, you're going to hear the other version. 00:04:50.680 | 
So unfortunately, I can't really swap in the different audio here, so 00:04:55.640 | 
We might suddenly start hearing a guy saying, bah, bah, bah, and then. 00:05:02.480 | 
multimodal applications, so when we have multiple modalities, 00:05:11.640 | 
And as I said, most of the use cases we have on the Internet, 00:05:16.320 | 
And there are some really kind of obvious things we would be interested in 00:05:20.440 | 
if we have information from these different data sources, 00:05:28.000 | 
So maybe given a bit of text, we want to find the right image. 00:05:31.560 | 
Or maybe given some image, we want to find the right text for it so 00:05:36.440 | 
Obviously, we can also do this in a generative setting. 00:05:38.600 | 
So then we have image captioning, which you probably heard of. 00:05:43.640 | 
So that's image synthesis, and so stable diffusion. 00:05:46.200 | 
Everybody in the audience here has probably seen that. 00:05:48.880 | 
Then we can do visual question answering, where we have an image and text. 00:05:54.920 | 
We have multimodal classification, where we have image and text, and 00:05:57.560 | 
we need to have a label, for example, whether something is hate speech or not. 00:06:01.160 | 
And then in general, we want to be able to have a richer understanding of 00:06:05.920 | 
information, which means that we combine images and text and then use it for 00:06:09.360 | 
downstream applications that require better understanding or better generation. 00:06:20.000 | 
I predict that this paper is going to do really well in terms of citations, 00:06:25.880 | 
I think a lot of people are not actually going to read it. 00:06:29.000 | 
And so, I mean, I've been in this field for quite a while now, and 00:06:32.080 | 
people have been saying this for a really long time. 00:06:35.160 | 
So for decades, people have been saying that multimodal is the next big thing. 00:06:44.080 | 
the outline for what we're gonna be talking about. 00:06:46.520 | 
So first, I'm gonna tell you a little bit about early models. 00:06:49.560 | 
Then we're gonna do a bit of a deep dive on some of the specifics. 00:06:53.200 | 
Then we're gonna go over a particular type of fusion, 00:06:58.640 | 
Then we're gonna go through a little bit of the history of 00:07:03.880 | 
Then we're gonna talk a little bit about evaluation, 00:07:07.600 | 
And then I'll make some predictions for the future, and hopefully maybe give you 00:07:10.680 | 
some cool research ideas or things to talk or think about. 00:07:17.080 | 
there's a lot of work that happened before deep learning. 00:07:19.920 | 
But I think if you want to start from the deep learning revolution and 00:07:23.680 | 
what was happening in images and text, then a good starting point is, 00:07:29.080 | 
for example, Wasabi or Device or Richard Soker, 00:07:33.880 | 
who you've probably heard of, has done some really cool early work in this. 00:07:40.600 | 
And the basic gist of this is that we have a vision model on the one hand, 00:07:48.760 | 
the first lecture of this course I think was about word embeddings, right? 00:07:51.600 | 
So that's just your basic word embedding model. 00:07:54.240 | 
And now we need to figure out how to align them in the same multimodal space. 00:07:58.560 | 
So the way you do that is you get some sort of similarity metric, right? 00:08:03.720 | 
if you're thinking about this from a support vector machine literature perspective. 00:08:07.400 | 
And now you need to figure out in a max margin or margin loss, 00:08:13.160 | 
how you want to align these two points in your embedding space, right? 00:08:16.360 | 
So things that are similar, you want to bring them closer together. 00:08:18.800 | 
Things that are not, you want to bring them further apart. 00:08:21.720 | 
And if you do that in this multimodal embedding space, 00:08:24.960 | 
that means that you can do interesting cross-modal transfer, 00:08:28.840 | 
where you can take the word embedding for something like auto or like horse, 00:08:32.840 | 
and then you can find close images in the embedding space to that thing. 00:08:42.280 | 
And I think a lot of this stuff that I'm going to talk about in the early slides, 00:08:46.400 | 
you're going to see this thing come over and over again. 00:08:49.000 | 
You're going to see it get kind of reinvented with fancier models, 00:08:54.760 | 
So you can do cross-modal transfer where you have images and text, 00:08:59.080 | 
but you can also combine them together so that you get a multimodal word embedding. 00:09:03.640 | 
And so this just gives you a more accurate representation 00:09:10.600 | 
Because when we think about the word moon or cat or something, 00:09:14.200 | 
we can go to Wikipedia and read that a cat is a small carnivorous mammal 00:09:20.080 | 
Or we can just go and look at pictures of cats, 00:09:24.400 | 
And I would argue actually that for a lot of people, 00:09:26.720 | 
the picture of the cat is much closer to the meaning of the concept of cat. 00:09:31.240 | 
So some early work where people were trying to do this 00:09:36.360 | 
is from Bruni et al. where they did multimodal distributional semantics 00:09:39.960 | 
using this very elegant approach called bag of visual words. 00:09:44.160 | 
So just like who has heard of bag of visual words? 00:09:49.480 | 
Okay, so it's surprisingly simple, and so I kind of like it. 00:09:54.480 | 
So you take a picture of a moon in this case. 00:09:57.160 | 
I think you can see it in the back too, right? 00:09:58.720 | 
So we use an algorithm like SIFT to find interesting key points. 00:10:03.760 | 
So it's sort of where the difference between the pixels and the pixels next to it, 00:10:08.040 | 
where that difference is big, those are sort of the spots you want to be looking at. 00:10:13.080 | 
And for each of these key points, you get feature descriptors. 00:10:16.960 | 
So relatively small vectors like 32-dimensional events 00:10:23.080 | 
And what you can do now with these feature descriptors 00:10:27.840 | 
and then you assign every one of these points 00:10:31.400 | 
so you can count how often they occur, right? 00:10:33.360 | 
So in this picture of the moon, we have like actually the count is... 00:10:36.560 | 
Oh, yeah, so there are three like red dots, right? 00:10:41.520 | 
So what that gives you is an idea of the visual words, 00:10:46.000 | 
very similar to the original bag of words model 00:10:48.280 | 
that you hopefully have heard about maybe in the first lecture. 00:10:52.400 | 
So that's the visual equivalent of the textual thing. 00:10:56.040 | 
And so if you do this and you then concatenate 00:11:03.200 | 
that is much more representative of human meaning. 00:11:12.600 | 
So after that, there were a couple of people, 00:11:24.760 | 
and then you can transfer the features from your conf net 00:11:45.320 | 
So when you see a word like cat in some context, 00:11:50.680 | 
then when you see cat, you also want to predict cat pictures. 00:11:56.400 | 
that this gives you much richer word representations. 00:12:02.880 | 
What we really care about is not words, but sentences. 00:12:08.360 | 
into sentence representations and how can we figure out 00:12:24.360 | 
but now we just have a sentence encoder, right? 00:12:42.120 | 
or some other kind of recurrent neural network, 00:12:44.240 | 
or in the case of this one, recursive neural network, 00:12:47.360 | 
and then we try to align the features together. 00:12:57.680 | 
because we showed here that grounded sentence representation. 00:13:09.840 | 
already gives you a really good sentence representation. 00:13:15.560 | 
you can sort of imagine what things look like, 00:13:17.720 | 
and that gives you a really good meaning representation, 00:13:19.800 | 
which you can then transfer to, I don't know, 00:13:24.360 | 
And then of course, once we have sentence encoders, 00:13:33.040 | 
and so when the sequence-to-sequence architecture came out, 00:13:36.440 | 
which you've probably also heard about in this course, 00:13:39.160 | 
what you can do instead of having a text encoder 00:13:44.800 | 
is you can plug in a conf net instead of an LSTM encoder, 00:13:53.840 | 
We used to have all of these fancy diagrams in our papers 00:13:56.400 | 
then where we explained LSTM and how that works. 00:13:59.160 | 
Probably people don't learn that anymore these days. 00:14:04.400 | 
They might make a comeback, I think, you know, 00:14:06.320 | 
at some point, transformers are gonna go away. 00:14:10.200 | 
And so one of the things that people figured out 00:14:19.280 | 
between your source language and your target language, 00:14:21.760 | 
and you can do the same thing actually with images, right? 00:14:24.040 | 
So if you want to align a word in your generated sequence 00:14:33.480 | 
and that approach, of course, is called attention, right? 00:14:35.720 | 
So, you know, you learned a lot about attention 00:14:45.720 | 
and really see that when it has to generate stop 00:14:50.120 | 
that it's really actually looking at the stop sign, right? 00:14:52.720 | 
So there's a really cool alignment going on there 00:15:14.160 | 
is really that you have this generator and discriminator, 00:15:16.480 | 
and you want to have the generator generate images 00:15:19.400 | 
that the discriminator cannot distinguish from, 00:15:23.120 | 
so it cannot distinguish fake and real images, right? 00:15:26.480 | 
you can actually condition that on the piece of text, 00:15:38.280 | 
of stable diffusion were doing things like this, 00:15:40.560 | 
and it's all a natural progression to that model. 00:15:47.240 | 
Maybe, do people have any burning questions about this, 00:16:02.040 | 
So those are really the kind of core building blocks 00:16:11.640 | 
and sort of useful and doesn't look that difficult, 00:16:15.680 | 
like why aren't we all doing multimodal things? 00:16:30.600 | 
than vision or audio in many use cases, right? 00:16:36.960 | 
and basically learns to ignore the image completely, 00:16:41.240 | 
for visual question answering, we'll get to that. 00:16:43.520 | 
So visual question answering, you could do that 00:16:47.600 | 
Additional modalities can add a lot of noise. 00:16:51.760 | 
So it makes your machine learning problem more difficult. 00:16:58.400 | 
sometimes you have text, sometimes you have pictures, 00:17:01.280 | 
but you don't have a guarantee that you always have both. 00:17:10.080 | 
And also just in general, like how to design your model 00:17:18.040 | 
So in order to maybe drive that point home a little bit, 00:17:23.040 | 
so featurizing text, I guess we all know how to do that 00:17:28.200 | 
by now, especially sort of in the age of transformers 00:17:30.760 | 
and before in LSTMs, where we just have like, 00:17:34.920 | 
So batch size by sequence length by embedding size, right? 00:17:41.320 | 
and that's how you encode your textual information 00:17:48.720 | 
because you can just kind of look at the patches, 00:17:57.840 | 
And in many cases, you don't really want to be this uniform. 00:18:01.160 | 
You want to have something that actually looks 00:18:13.000 | 
that encodes the features for that particular sub image, 00:18:16.520 | 
like this guy's like skateboard or something. 00:18:18.480 | 
It has its own like vector representation, right? 00:18:31.400 | 
is a really good one if you haven't heard of that yet. 00:18:34.280 | 
So we're at YOLO v7 now, I think, or eight, I don't know. 00:18:37.840 | 
So there's a new one coming out every other year 00:18:42.840 | 
But the basic idea is that we get these bounding boxes 00:18:46.320 | 
for things in the images are actually segmentations, 00:18:49.000 | 
but the bounding boxes is what people tend to use 00:18:52.560 | 
So this is labeled like backpack or something. 00:18:54.840 | 
And so you can do this as a pre-processing step 00:18:57.560 | 
on your image to get a much richer representation 00:19:14.680 | 
And so this probably feels like super obvious now, 00:19:18.680 | 
but in 2014, when people were starting to discover this, 00:19:22.040 | 
it was really very surprising that you could just use 00:19:27.080 | 
to really replace the entire computer vision pipeline. 00:19:36.600 | 
And then it was all thrown away and replaced by a conf net 00:19:41.920 | 
And so the cool thing you get there is that you can transfer 00:19:49.000 | 
and then use it to all kinds of very specialized things 00:19:52.320 | 
like spotting buildings in Paris, for example, 00:19:57.720 | 
And then of course, in the age of transformers, 00:20:10.720 | 
So vision transformers are what we would use these days 00:20:13.760 | 
to encode the images where you have these flattened patches 00:20:20.520 | 
birth architecture maybe as you would know it 00:20:22.880 | 
from this course, and then you do classification, right? 00:20:27.200 | 
everything's standard, except now your input here 00:20:29.240 | 
is not words or tokens, it's patches of an image 00:20:34.240 | 
All right, so then we have a bunch of features 00:20:37.960 | 
and now how do we combine the information, right? 00:20:50.520 | 
So I don't think it's really useful to go over 00:20:56.720 | 
So obviously like inner product or similarity 00:20:59.240 | 
is what you would use if you want to do cross-modal things. 00:21:01.600 | 
So if you want to embed things in the same vector space, 00:21:04.840 | 
but you can do sort of fancier projections on top 00:21:08.440 | 
or different combinations that are kind of linear 00:21:13.520 | 
where you multiply the components element-wise 00:21:16.280 | 
or you do some sort of gating over the different features. 00:21:18.680 | 
You can do attention, you can do fancier bilinear things, 00:21:22.320 | 
you can do very fancy compact bilinear things. 00:21:59.960 | 
You can first treat them separately and then combine them 00:22:05.280 | 
and then you only combine the final scores, right? 00:22:08.040 | 
And so that's kind of what we would call early fusion 00:22:17.640 | 
where you really just combine the scores or the logits 00:22:22.640 | 
between the information from the different modalities. 00:22:25.520 | 
So you could do really fun stuff with multimodal fusion. 00:22:34.120 | 
where you have this sort of very special feature map, 00:22:44.840 | 
So this gamma and an additive sort of bias vector, 00:22:56.280 | 
So in this case, are there more cubes than yellow things? 00:22:58.640 | 
So we have some vector representation for that 00:23:12.800 | 
with the other one and really try to have them learn 00:23:17.400 | 
All right, so let's talk about late fusion then. 00:23:22.240 | 
So late fusion is what we would now call contrastive models 00:23:26.520 | 
but the basic idea is that we have this similarity score. 00:23:31.000 | 
we process the modalities completely independently 00:23:33.360 | 
and then at the very end, we do some combination. 00:23:35.960 | 
And the most famous instance of that these days is CLIP. 00:23:47.840 | 
So it's again, exactly the same contrastive loss 00:23:51.280 | 
that we've seen in all these early approaches. 00:23:54.200 | 
It does kind of negative sampling, but then in batch. 00:24:06.960 | 
And I just wanna make sure that I rank this thing higher 00:24:12.080 | 
And I wanna make sure I rank this thing higher 00:24:18.280 | 
Really, nothing special about this architecture 00:24:22.480 | 
but what made this thing so cool was first of all, 00:24:25.840 | 
it was transformers and it was transformers all the way. 00:24:30.360 | 
and your image encoder would be a VIT image encoder. 00:24:35.880 | 
And it was trained on lots and lots of web data. 00:24:44.440 | 
And he created, I think 300 million image text pairs 00:24:47.720 | 
for this dataset, trained a bigger model on it 00:24:52.760 | 
And then we got this amazing model out of it. 00:25:00.200 | 
to the sort of texts that you would see on the internet. 00:25:07.640 | 
It's gonna say a photo of a cat doing something, something. 00:25:10.720 | 
So that means that you can do kind of zero shot 00:25:14.800 | 
label predictions where you have a photo of the, 00:25:17.760 | 
and then you need to figure out what the right label 00:25:21.560 | 
is for a given image using this kind of prompt. 00:25:28.520 | 
And so you can prompt vision and language models 00:25:30.880 | 
in very much the same way and do zero shot generalization. 00:25:42.800 | 
It's thorough, it's really worth a very close read, 00:25:59.960 | 
But what really made it special was that it generalized 00:26:08.080 | 
at some of these adversarial versions of ImageNet, 00:26:13.160 | 
So it's just a way better image encoder in general. 00:26:19.960 | 
there was this paper from Google using Align, 00:26:30.840 | 
but then you just keep like throwing more data 00:26:32.840 | 
and more compute at it, and it often works much better. 00:26:38.120 | 
1.8 billion image taxpayers instead of 300 million 00:26:49.440 | 
is that there's this organization called Lion, 00:26:51.800 | 
where they've started this open source collective 00:27:11.280 | 
And so now there's a much bigger version of Lion 00:27:14.920 | 
that's even multilingual and it has 5 billion examples. 00:27:18.080 | 
So stable diffusion was trained on sort of the image, 00:27:23.760 | 
And that's one of the reasons that it's so awesome, 00:27:29.160 | 
And that really makes your system a lot better. 00:27:31.360 | 
So if you're looking for like the ultimate dataset 00:27:41.360 | 
All right, any questions about up until this point? 00:28:01.680 | 
of what I think a lot of people in the field right now, 00:28:04.520 | 
or if you're interested in getting in this field, 00:28:09.760 | 
like this is what you should really understand. 00:28:12.320 | 
And again, like the ideas sort of stack onto each other. 00:28:18.840 | 
to give you an idea sort of how the scientists 00:28:39.160 | 
Everybody should raise their hand now in this. 00:28:46.560 | 
I think everybody kind of gets how BERT works, right? 00:28:55.800 | 
because I want you to think about if you have a BERT model 00:29:05.080 | 
Right, so there are a bunch of like obvious things 00:29:09.400 | 
you could do given the kind of features I told you about 00:29:28.600 | 
and then just concatenate it to whatever encoder, 00:29:31.440 | 
like maybe an ANN or whatever you're training 00:29:41.160 | 
and classify your token from BERT, concatenate them, 00:29:43.800 | 
and then classify for like a cat or something like that 00:29:47.400 | 
or whatever the thing is you're interested in, yeah. 00:29:51.240 | 
You could also like take the ConfNet features 00:29:59.520 | 
So I think a lot of people when BERT came out 00:30:04.080 | 
who were working in vision and language processing 00:30:13.440 | 
And so there were a lot of papers all coming out 00:30:25.120 | 
into their own thing because of Hugging Face Transformers 00:30:36.240 | 
and people would do object detection on this. 00:30:39.600 | 
So you get like a hat and a racket and a shirt 00:30:45.200 | 
and then plug them into your transformer model 00:30:49.080 | 
and then you try to like recover the features. 00:30:56.400 | 
And so this is what we call a single stream architecture 00:31:00.920 | 
where you have all of these kind of concatenating 00:31:05.000 | 
and then putting them through the same transformer. 00:31:08.960 | 
and that's something that this model called VILBERT did 00:31:14.640 | 
So you essentially have these two parallel transformers 00:31:19.880 | 
you kind of give them cross attention, right? 00:31:25.760 | 
so you just make sure you have an attention map 00:31:28.880 | 
and then you just do your full normal transformer layer 00:31:42.120 | 
and here you do sort of some equivalent of that 00:31:44.480 | 
and then you also have your next sentence prediction 00:31:47.320 | 
which you probably remember from your BERT lecture. 00:31:52.560 | 
is this image aligned with this piece of text or not? 00:32:00.000 | 
There are like a hundred papers that came out 00:32:03.680 | 
So LexMERT had a different cross-modal output encoder, 00:32:08.800 | 
of encoding the positional information, right? 00:32:11.760 | 
I just have a bunch of bounding boxes that are featurized 00:32:14.160 | 
but I don't care about where they are in the image. 00:32:25.600 | 
And that's what you featurize into your network. 00:32:44.640 | 
and you just give those feature maps to BERT. 00:32:50.120 | 
between like your text segment embeddings, right? 00:32:56.480 | 
But so this actually works surprisingly well. 00:32:58.920 | 
You don't have to do any additional training. 00:33:11.640 | 
And now you have a very good multimodal classifier 00:33:17.520 | 
they're doing what they call multimodal pre-training 00:33:19.840 | 
where first you have a BERT model and a ResNet. 00:33:29.960 | 
before you fine tune it on the problem you care about. 00:33:32.920 | 
And what we showed here is that you don't really need that 00:33:39.760 | 
You can also go to the pixel level completely. 00:33:52.400 | 
but here they do do the multimodal pre-training step 00:33:55.200 | 
and show that I think for VQA it helps a little bit. 00:34:08.200 | 
where they added a bunch of different losses. 00:34:10.960 | 
We can really talk about this for a very long time. 00:34:18.520 | 
So this one I think is quite interesting built 00:34:21.280 | 
because here this is really the first instance 00:34:23.400 | 
where we are completely gone from ConvNet features. 00:34:27.040 | 
So we don't do any pre-processing on the image, 00:34:37.280 | 
So really integrate, we flatten those patches, 00:34:40.240 | 
we just pump them into the transformer straight away. 00:34:43.240 | 
So this really is like sort of BERT and VIT together 00:34:46.000 | 
in one model and this worked really very well. 00:34:54.960 | 
of all of these different models and what they do. 00:35:02.040 | 
So do you use BERT or something fancier or better, 00:35:08.320 | 
So in many cases you have these region features. 00:35:17.560 | 
So either a single or dual stream as we talked about, 00:35:24.240 | 
so mass language modeling, image text matching. 00:35:28.240 | 
There's a bunch of like funkier ones you can do. 00:35:31.720 | 
So, and then finally you can do multimodal pre-training 00:35:35.320 | 
on all of these different datasets that have aligned data. 00:35:41.600 | 
okay, so what is really the interesting difference 00:35:59.360 | 
So basically they say, if you take all of these 00:36:02.720 | 
little model inventions and you train these different models 00:36:06.320 | 
on exactly the same data in exactly the same way, 00:36:09.520 | 
it turns out that they're all basically the same. 00:36:16.520 | 
on the part of the field because everybody's saying, 00:36:18.480 | 
well, my model is better, but it's actually just 00:36:22.640 | 
There's no real sort of model innovation going on 00:36:29.880 | 
or anything like that, but I think that's why 00:36:33.360 | 
this paper is really nice and really important 00:36:35.360 | 
is because it just shows us what really matters. 00:36:38.880 | 
So this is also work that I did myself called Flava 00:36:43.880 | 
with my team where we wanted to take these ideas 00:37:02.320 | 
we only care about problems that always involve 00:37:08.280 | 
the basic premise, I think, of foundation models in general 00:37:19.880 | 
different modalities and then do useful things 00:37:24.280 | 
So with Flava, that's exactly what we tried to build. 00:37:29.600 | 
that is good at vision and language and computer vision 00:37:47.400 | 
So it's good at the things that you would expect 00:37:49.400 | 
as a kind of basic image model to be good at. 00:38:01.920 | 
if you take all the datasets that were ever created 00:38:04.280 | 
that have image text pairs that are publicly available. 00:38:06.840 | 
So unfortunately, the ClipData and the Google Align data 00:38:10.200 | 
and all of these datasets, they haven't been open source. 00:38:19.520 | 
if you combine all of these image text pairs, 00:38:30.520 | 
that we know we care about in these different fields. 00:38:41.240 | 
I think if you work at a company like Facebook, 00:38:50.360 | 
That's gonna really make your life a lot easier. 00:38:53.160 | 
So the exact architecture here is that on the one hand, 00:38:57.760 | 
we have this image encoder where we take the image, 00:39:01.720 | 
and we just do what we call mass image modeling, 00:39:11.720 | 
we have the mass language modeling on the language. 00:39:22.960 | 
So we have a mass multi-modal modeling loss term 00:39:28.960 | 
So this is like your bird next sentence prediction thing. 00:39:31.520 | 
And then we also have a global contrastive loss, 00:39:42.080 | 
I think, to combine a lot of this information. 00:39:54.800 | 
we were pretty thorough generating the table here. 00:40:01.640 | 
if you compare Flava to all kinds of different ablations 00:40:11.880 | 
of where we're probably gonna go with the field 00:40:20.960 | 
is that everybody cares about generative models. 00:40:23.680 | 
So language models and image generative models, 00:40:27.760 | 
there's just a trend where we want to be generative, 00:40:31.680 | 
discriminative stuff to the more interesting, 00:40:34.880 | 
more richer representations maybe that you get 00:40:41.240 | 
So this SimVLM paper was one of the first ones 00:40:46.160 | 
that was trying to generate or kind of complete captions, 00:40:49.480 | 
which they showed gives you a lot richer representations. 00:40:53.640 | 
I think this is actually the current state of the art now, 00:41:07.040 | 
I think that's also what they were trying to go for, 00:41:14.280 | 
And I think, so it took us a while as a field 00:41:16.560 | 
to really figure out how to do this the right way. 00:41:25.440 | 
And so one of the interesting things you can do 00:41:29.000 | 
with language models is just keep them frozen 00:41:32.080 | 
and then learn how to project into the language models. 00:41:40.960 | 
and we learned to project into the BERT token space. 00:41:56.560 | 
and then you learn to project into the token space 00:42:01.640 | 
And then you can do lots of fun stuff, it turns out. 00:42:11.120 | 
where you can just give it some kind of in-context examples 00:42:14.040 | 
and it's gonna figure out binding kind of on the fly. 00:42:18.480 | 
So it says like, this is a DEX and this is a BLEKET. 00:42:22.760 | 
And then it gives you the answer that it's a DEX. 00:42:29.400 | 
which is really kind of solving the grounding problem 00:42:32.480 | 
that a lot of this multimodal stuff started with. 00:42:37.200 | 
And then probably one of the coolest papers right now 00:42:41.960 | 
or models right now that you might've heard of 00:42:43.880 | 
if you follow the field is Flamingo out of DeepMind, 00:42:50.200 | 
And so this is really an optimal language model. 00:43:03.160 | 
So what this gets you is just a much more powerful model 00:43:10.720 | 
So it's really like stepwise, you can see it, right? 00:43:32.200 | 
So we wanna make sure that we can compress it 00:43:45.880 | 
this is not my code, this comes from the actual paper. 00:43:48.120 | 
So they just have the diagram together with the code 00:43:50.560 | 
so that you can really understand what it's doing, 00:43:56.400 | 
And so once you have your perceiver resampling step, 00:44:00.840 | 
what you then do is you do a gated cross attention. 00:44:08.440 | 
you do that before your frozen language model layer. 00:44:12.200 | 
So you really just have a frozen chinchilla language model 00:44:15.480 | 
and you learn to kind of modulate the information 00:44:20.040 | 
You propagate the gradients all the way back, 00:44:24.040 | 
So you're really kind of trying to figure out like, 00:44:28.040 | 
so that my language model can do the most with it, right? 00:44:32.920 | 
So you'll notice that now we do it before the layer, right? 00:44:39.640 | 
So Garpathy, I think more than 10 years ago had this image 00:44:45.920 | 
with Barack Obama kind of setting his foot here on the scale 00:44:58.920 | 
I think unless it really understands the scene. 00:45:04.680 | 
this would be a really good visual Turing test. 00:45:11.000 | 
And so obviously it's been a bit of a challenge 00:45:14.320 | 
then to get something that actually works on this. 00:45:16.480 | 
And so Flamingo, as it turns out, kind of gets the joke. 00:45:20.040 | 
But yeah, so it's a bit unclear if it really gets the joke, 00:45:37.600 | 
but you can really take this almost to the full extreme 00:45:42.120 | 
And you just want to learn this kind of mapping 00:45:44.520 | 
between your image encoder and your language model, 00:45:47.320 | 
or your image encoder and your encoder decoder architecture. 00:45:57.000 | 
where they experiment with like OPT for the language model 00:46:00.040 | 
and Flantify for the encoder decoder architecture. 00:46:04.960 | 
It gives you really complex captions and things like that 00:46:09.040 | 
without any real direct supervision on the captions itself, 00:46:23.320 | 
from captioning to reasoning, to visual question answering, 00:46:29.960 | 
So you can have a long conversation with this system. 00:46:32.720 | 
This really is kind of the future where we're going, right? 00:46:36.600 | 
but it's also going to be able to see the world in a way. 00:46:43.440 | 
so you've probably heard of like chain of thought prompting 00:46:46.080 | 
and things like that, where you ask the language model, 00:46:50.520 | 
and you can tell a vision and language model, 00:46:54.080 | 
generate a rationale for why something might be the case. 00:47:03.160 | 
And then after that, you ask it to answer the question. 00:47:05.880 | 
And it turns out that if you do that sort of multimodal 00:47:08.760 | 
chain of thought prompting, then the system gets much better. 00:47:11.960 | 
And so, this was like the new state of the art 00:47:17.800 | 
just because it learns to unpack the information, right? 00:47:23.560 | 
just starting to figure out what the potential is of this. 00:47:26.760 | 
And I think this paper is where they also show 00:47:33.480 | 
And they show very nice results on Raven matrices 00:47:37.040 | 
and like very complicated kind of IQ tests sort of things 00:47:40.960 | 
that humans are supposed to be really good at, 00:47:51.640 | 
And we started off from a very simple bird model 00:48:33.720 | 
- Yeah, yeah, so I think the history of computer vision 00:48:41.320 | 
where we thought we needed all of this structure 00:48:44.720 | 
And it turns out you can just throw it all away 00:48:46.680 | 
and just have a big transformer over the patches. 00:48:53.040 | 
- Seeing as it's 2.31 and one minute, save time. 00:49:03.800 | 
- Yeah, yeah, sorry, I should have explained that better. 00:49:06.760 | 
So, it just means that we are not updating the weights. 00:49:11.280 | 
So, like if we go to this here, I think is a nice example. 00:49:19.520 | 
So, that just means that when we do a forward pass, 00:49:22.240 | 
we go all the way to whatever we want to predict, 00:49:24.560 | 
we get some gradients, we take them all the way down, 00:49:30.760 | 
So, here the gradients actually do get updated, 00:49:36.280 | 
is because otherwise you're gonna drift way too far. 00:49:41.640 | 
all of the cool stuff your language model has learned, 00:49:44.000 | 
because you're just gonna focus on the small data set 00:49:48.160 | 
So, you wanna preserve the abilities of the language model, 00:49:50.800 | 
but you want it to become good at the thing you care about. 00:50:03.160 | 
is there a benefit to doing like the earlier middle fusion 00:50:08.000 | 
- Yeah, so, I mean, we're gonna talk about evaluation next, 00:50:11.800 | 
but so it really depends on the task that you care about. 00:50:15.400 | 
And so, I would say the earlier is always the better 00:50:20.360 | 
And so, like CLIP is very efficient to train, 00:50:23.920 | 
it's very late fusion, right, at the very end. 00:50:25.800 | 
So, there's no interaction between the different modalities. 00:50:28.800 | 
And so, that's really good if you want to be very efficient 00:50:33.240 | 
and if you wanna be like, for training, it's much nicer. 00:50:37.480 | 
But if you want to have a richer understanding 00:50:47.520 | 
- It seems that images are just a lot more data than text. 00:50:52.520 | 
So, how much more difficult are these to train 00:50:55.600 | 
and how much bigger does like the image processing 00:51:03.520 | 
- Yeah, so, images are more complex in a way, 00:51:08.520 | 
but they're also kind of higher bandwidth representations. 00:51:14.280 | 
just pixels that our brains just abstract away, right? 00:51:17.640 | 
It's really about the scene that you're seeing 00:51:26.680 | 
that language is just a kind of low bandwidth, 00:51:33.560 | 
which is much richer and much higher bandwidth. 00:51:35.800 | 
And like he thinks probably visual, I'm not so sure. 00:51:39.800 | 
But so, yeah, I don't think that there's necessarily 00:51:43.800 | 
a difference between kind of the scaling laws 00:51:47.440 | 
or at least we still have to figure that out. 00:51:50.800 | 
We'll kind of talk about that towards the end as well. 00:51:53.400 | 
- Do these modern models also have certain social 00:51:59.440 | 
and cultural bias, just like the natural language model? 00:52:06.800 | 
So, yeah, some people are actually working on this 00:52:24.560 | 
then the model will think that he's playing ping pong 00:52:39.000 | 
that you should be working on if you're a student 00:52:42.720 | 
how do we get these systems to be much better 00:52:54.360 | 
So, we wanna understand the content of a video. 00:53:02.400 | 
you might see like what improvements we can make 00:53:08.840 | 
- Yeah, so, you're asking about the attention mask 00:53:13.520 | 
Yeah, so you can use the same idea for videos 00:53:21.920 | 
you can really track objects kind of real time 00:53:39.160 | 
because you can very often just sub sample images 00:53:43.280 | 
rather than having to deal with the complex video. 00:53:57.560 | 
let's say you only provide a single source of media 00:54:10.520 | 
is that they're really just built for multi-modal stuff. 00:54:13.640 | 
And so, what if I don't have an image, right? 00:54:26.040 | 
so the supervised multi-modal bi-transformer, 00:54:30.440 | 
how robust is this model to missing images or missing text? 00:54:51.320 | 
And so, I think if I'm gonna tell you about multi-modality, 00:54:55.480 | 
I also have to tell you how you're gonna check 00:55:16.280 | 
because you only have limited GPUs anyway, right? 00:55:33.360 | 
So, ImageNet really changed the history of deep learning, 00:55:45.600 | 
where they have just a bunch of main multi-modal tasks. 00:55:57.120 | 
the bounding boxes, the labels of the bounding boxes, 00:56:00.240 | 
they come at like sort of a different pixel granularities. 00:56:07.360 | 
annotated in terms of like the categories that it has, 00:56:10.040 | 
and then you have five captions for each of these images. 00:56:20.440 | 
because you had your picture and you had your caption, 00:56:24.480 | 
okay, how do I give the right caption for this image? 00:56:31.080 | 
the right image or the image for the piece of text? 00:56:34.800 | 
So, there's a bunch of very impactful datasets 00:56:37.680 | 
that do this stuff that we already talked about, Lion, 00:56:44.520 | 
as the canonical instance of this dataset category. 00:56:49.200 | 
And then, the other thing that people really care about 00:56:56.320 | 
And so, there really are a bunch of academic groups 00:57:03.760 | 
that they didn't really care about anything else, 00:57:07.520 | 
that are really optimized just for multimodal 00:57:13.200 | 
in the citation counts as of last night, 3 a.m., 00:57:24.520 | 
And so, what you do here is you just have an image 00:57:30.120 | 
so annotators, right, they ask these simple questions, 00:57:34.400 | 
and now we want to be able to answer these questions 00:57:39.760 | 
one of the kind of embarrassing backstories of this dataset 00:57:45.560 | 
was actually found to have images not really matter at all. 00:58:01.240 | 
the right answer for how much or how many question was two. 00:58:04.520 | 
So if you just predicted two to every how much 00:58:08.600 | 
you got like 70% accuracy on the counting category. 00:58:12.160 | 
So careful dataset or evaluation benchmark design 00:58:18.040 | 
and you really need to think about what you're doing. 00:58:20.000 | 
You can't just like set some data aside and evaluate it on, 00:58:23.120 | 
you have to really think about what you're doing. 00:58:30.400 | 
a better designed version of this dataset maybe. 00:58:34.960 | 
There are also kind of very targeted datasets 00:58:40.680 | 
that really try to measure one particular thing. 00:58:42.920 | 
And I think one of the things we really want to get at 00:58:46.080 | 
with these models is what we would call compositionality. 00:58:49.320 | 
So we want to be able to really take the parts 00:58:59.320 | 
that was designed really to measure the compositionality 00:59:03.040 | 
both on the language side and on the vision side. 00:59:06.840 | 
between all of these different objects in the images. 00:59:16.480 | 
But a lot of these datasets really had big problems. 00:59:21.320 | 
So one of the problem is, they were too easy. 00:59:31.960 | 
and that's probably gonna make some people's lives better. 00:59:40.640 | 
So obviously, so these memes are not actually 00:59:51.080 | 
which are in the dataset, but that would be less fun. 00:59:54.080 | 
So these are mean meme examples to kind of demonstrate 01:00:02.200 | 
And so one of the problems we had, as I said, 01:00:28.400 | 
But so it turns out that if you just swap out the background 01:00:54.240 | 
suddenly it's like a really nice thing to say, right? 01:01:00.520 | 
if you want to classify this correctly for the meanness, 01:01:04.280 | 
then you have to really understand multimodal reasoning. 01:01:12.480 | 
And so it was really constructed by design to do that. 01:01:19.400 | 
is we use some really highly trained annotators. 01:01:26.240 | 
is that nobody really knows who owns the meme, 01:01:38.200 | 
and they were very afraid of copyright things. 01:01:49.640 | 
so we could show them kind of the actual examples. 01:01:54.960 | 
that were kind of corresponding to the original source image 01:02:00.520 | 
but now with an image that we could buy from Getty. 01:02:07.440 | 
so that we could then release the dataset to the public 01:02:11.000 | 
so that people could do actually research on this 01:02:21.240 | 
sorry, it's a startup world with co-founders. 01:02:39.960 | 
And so this led to a really nice dataset, I think, 01:02:46.120 | 
that I think a lot of people in the field had, 01:02:48.360 | 
which is that multimodal pre-training doesn't really work. 01:02:53.560 | 
So multimodal pre-training doesn't really work. 01:02:58.000 | 
And so all of this stuff that people have been doing 01:03:02.880 | 
actually turned out maybe to not really be that useful 01:03:06.080 | 
anyway and so maybe it got you like one point extra, right? 01:03:09.400 | 
From visual birth to like a different visual birth, 01:03:16.680 | 
So that means like we still have to figure this stuff out, 01:03:29.800 | 
that does something new, like we're not there yet. 01:03:33.080 | 
And I think that's encouraging, especially for you. 01:03:45.680 | 
to try to see what people could come up with. 01:03:47.920 | 
And so there was a lot of nice work coming out of that 01:03:52.480 | 
and we really kind of managed to crank the numbers up 01:03:56.240 | 
But the solutions were slightly disappointing. 01:04:12.760 | 
where there wasn't really the fundamental breakthrough 01:04:25.840 | 
So the theme sort of of this section is like, 01:04:28.000 | 
if you make a dataset, think about it very carefully 01:04:31.600 | 
because you can really be very creative with this 01:04:33.560 | 
and really measure the things you're trying to get at. 01:04:44.240 | 
and it's way better than things that were previously there, 01:04:47.200 | 
but does it understand compositional relationships 01:04:50.280 | 
in the same way that humans would understand it 01:04:52.240 | 
or is it sort of just fitting onto the data distribution 01:04:55.440 | 
and it can be very good at the head of the distribution, 01:05:00.360 | 
And you can probably already guess where this is going, 01:05:07.560 | 
you would have some plants surrounding a light bulb 01:05:10.960 | 
or you would have a light bulb surrounding some plants. 01:05:13.880 | 
So notice that the words here are exactly the same words, 01:05:19.480 | 
So, and so the visual depiction of these words 01:05:25.960 | 
your contrastive model is actually good at understanding 01:05:29.000 | 
the visual semantic or the visual linguistic compositionality 01:05:43.400 | 
and it just kind of is biased toward what it sees often, 01:05:56.320 | 
"Order Word Matters Pre-training for Little." 01:06:15.600 | 
And so that's probably not what we want to have. 01:06:37.400 | 
Like these are very different pictures, right? 01:06:54.960 | 
State-of-the-art models often perform below random chance. 01:07:03.240 | 
we still have a lot of work to do, which is good. 01:07:38.960 | 
If you don't add that, then it breaks down completely. 01:07:45.120 | 
or sort of tuning on the test set, but okay, you know. 01:07:51.080 | 
So it definitely is better than I think a lot of people 01:07:54.760 | 
would have expected even a couple of years ago. 01:07:57.120 | 
But it's not perfect because people on the internet 01:08:02.120 | 
like to take more pictures of spoons than forks. 01:08:04.800 | 
So if you say there are fewer spoons than forks, 01:08:17.400 | 
You know, and so maybe it's like the matrix or something, 01:08:25.040 | 
So again, what you can see here is that these models 01:08:37.160 | 
like it still can't count fingers and things like that. 01:08:39.840 | 
So again, there's still a lot of cool work to be done. 01:08:57.320 | 
because so we've really just been focused on images 01:09:04.560 | 
And so that makes it sort of an obvious thing to focus on. 01:09:10.760 | 
like vision is a very dominant modality, right? 01:09:13.080 | 
So how we understand the world is very vision driven, 01:09:18.840 | 
So there's all these other interesting problems 01:09:22.920 | 
And so the most obvious one is just speech or audio, right? 01:09:29.120 | 
and really we could do another lecture just like this, 01:09:34.200 | 
And there's lots of interesting stuff to talk about. 01:09:41.000 | 
of how amazing Alec Radford is at creating datasets. 01:09:45.000 | 
So there's this whisper model that came out of open AI 01:09:57.400 | 
and they trained this very fancy thing on there, 01:10:06.600 | 
and then you feed that into a big transformer. 01:10:08.640 | 
So this is sort of your encoder self-attention here, right? 01:10:17.280 | 
So this is encoder decoder, basic transformer model, 01:10:23.280 | 
one dimensional convolutions over the log mail spectrogram. 01:10:26.280 | 
And so there's lots of papers that do very similar things. 01:10:32.440 | 
that tried to turn the wave signal into vectors, 01:10:34.720 | 
or you can discretize it in lots of different ways. 01:10:40.080 | 
Then I think one of the funny observations actually 01:10:42.760 | 
is that you can just reduce audio to vision anyway, right? 01:11:04.280 | 
So what does the spectrum of the audio file look like? 01:11:08.000 | 
Feed that to a regular conf net, like an Alex net even, 01:11:11.480 | 
and then that gives you amazing auditory features. 01:11:15.520 | 
between violins or guitars and things like that. 01:11:17.960 | 
So maybe you can just reduce all of this to vision. 01:11:23.480 | 
can we also reduce language to vision or vision to language? 01:11:27.280 | 
So that's sort of what people are thinking about. 01:11:35.320 | 
So a lot of these ideas also extend pretty directly 01:11:38.720 | 
to video, but now you just have more data, right? 01:11:46.880 | 
Probably a lot of the images are pretty useless 01:11:50.120 | 
for what you're trying to do with this video model, right? 01:11:53.840 | 
It doesn't really add all that much information. 01:12:08.360 | 
and language transformer encoder thing on top of that. 01:12:16.400 | 
And so there's this, so Merlot is a nice architecture 01:12:23.400 | 
kind of a silly name where they also added audio 01:12:30.120 | 
And so we're going towards this foundation model 01:12:33.400 | 
that can consume all of these different modalities 01:12:37.160 | 
And that's really like a clear trend in the field. 01:12:55.160 | 
But what you can do is you can have simulated environments. 01:13:01.680 | 
where they had this agent walk around in a maze 01:13:03.800 | 
and then he could have natural language instructions. 01:13:06.160 | 
He could also generalize to like decks and blicks 01:13:08.640 | 
and different sort of groundings and assignments 01:13:16.560 | 
because this is how humans learn language, right? 01:13:21.520 | 
We have all of these different perceptual observations. 01:13:29.200 | 
And that's how we learn everything we know about the world. 01:13:32.160 | 
And so our language is very intricately connected 01:13:45.600 | 
So especially with this kind of conditioning on text 01:13:54.360 | 
And the original GAN we talked about at the beginning. 01:13:59.000 | 
but now you're generating 3D point clouds, right? 01:14:13.720 | 
and it's just gonna design the whole house for you. 01:14:16.600 | 
So you can just like tweak the prompt and things like that. 01:14:23.600 | 
So the final modality I just briefly wanted to talk about 01:14:33.720 | 
And so olfaction means smell if you didn't know. 01:14:39.720 | 
so my PhD thesis was about grounding semantics 01:14:50.080 | 
now audio is sort of the obvious next one, right? 01:14:59.560 | 
And that's gonna give you a richer representation. 01:15:03.640 | 
what's actually very primitive to their meaning 01:15:15.640 | 
if you want to complete all of your perceptual modalities 01:15:19.320 | 
is you can try to build olfactory embeddings. 01:15:32.640 | 
this Sigma Aldrich Fine Flavors and Fragrances catalog, 01:15:36.840 | 
where you can look up words like melon and pineapple 01:15:40.160 | 
and then it's gonna give you all of the chemical compounds 01:15:52.880 | 
to get it to be a bit more of a real embedding model. 01:15:56.360 | 
So now you get smell embeddings, smell vectors, 01:15:59.640 | 
and then you can compute similarity judgments 01:16:12.200 | 
So you get these clusters of different smells 01:16:27.800 | 
so like if you have a word like democracy in there, 01:16:43.120 | 
And then, so the really interesting thing to me 01:16:51.880 | 
than the linguistic vectors we had at the time. 01:17:01.440 | 
And so you can do like skip gram and things like that. 01:17:04.800 | 
But that thing is not going to be as correlated 01:17:14.400 | 
So even something like smell where maybe we think, 01:17:37.000 | 
I'll just, I think I've already said most of this actually. 01:17:39.640 | 
So one foundation model is going to rule them all. 01:17:51.880 | 
and trying to understand really what is the relationship 01:17:55.600 | 
which one do we want more of, that sort of stuff. 01:18:09.480 | 
We need way better evaluation and better measurements.