back to indexStanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela
00:00:05.200 |
So today, I'm delighted to introduce our first invited speaker, 00:00:12.800 |
Dow has also been, as well as being invited, and I'll tell his background. 00:00:17.920 |
He's also in the Symbolic Systems program as being an adjunct professor and 00:00:23.840 |
has been involved with some students in that role as well. 00:00:26.680 |
But in his invited role, he's originally from the Netherlands, 00:00:30.640 |
where he even learned some logic, among other things, back in the old days. 00:00:36.680 |
he's been a prominent deep learning researcher. 00:00:41.160 |
For a number of years, he worked at Facebook, now Meta, in the FAIR unit, 00:00:47.600 |
and was involved in various ideas, including retrieval augmented generation. 00:00:54.000 |
After that, he then spent some time at Hugging Face. 00:00:57.760 |
He's become interested in looking at multimodal models, 00:01:01.560 |
which is what he's gonna be talking about today. 00:01:20.520 |
I understand that you get points for being here, so 00:01:27.160 |
So I'm gonna talk about multimodal deep learning. 00:01:29.680 |
It's gonna have an NLP focus, of course, as for this course. 00:01:32.800 |
But it's also because otherwise I would really be talking for 00:01:38.880 |
So I'll try to really keep it focused on the things that I think will be 00:01:44.760 |
And so the first thing you should understand is that this whole concept of 00:01:48.760 |
multimodality is kind of ill-defined, actually. 00:01:52.360 |
So if you go to the dictionary, you'll see that it means having or 00:01:56.160 |
involving several modes or modalities or maxima. 00:02:00.720 |
And so what mode here really means is, so it could be mode in the very generic sense, 00:02:06.080 |
or it could be a very precise sense of the mode of a statistical distribution. 00:02:11.240 |
And so depending on the paper you're reading, in some cases, 00:02:16.200 |
In other cases, people really mean this sort of very vague concept of a modality, 00:02:20.920 |
where it really means the type of information that you're getting. 00:02:23.840 |
So an example of modality in that case is an image or speech signal or 00:02:28.480 |
audio in general, or even olfaction, so smell or things like that. 00:02:33.160 |
So in this lecture, we're just gonna focus mostly on text, 00:02:38.640 |
because this is an NLP course, and we're gonna focus on images mostly as 00:02:50.880 |
And so there are a couple of really good reasons in general for this. 00:02:58.320 |
So if you look at how we humans understand the world, 00:03:01.200 |
how we make sense of what happens in the world, that is very multimodal, right? 00:03:06.720 |
So we perceive the world not just using vision or just audio, but 00:03:11.040 |
we synthesize information across all of these different modalities, and 00:03:14.600 |
that's how we understand the world and each other. 00:03:17.800 |
There's also a very practical argument for doing it. 00:03:20.680 |
It's because the Internet is multimodal, right? 00:03:23.000 |
So if you go to, I don't know, Facebook or something like that, 00:03:27.080 |
it rarely happens that it's just text or just an image. 00:03:29.760 |
There's usually a combination of multiple modalities. 00:03:33.160 |
And then the final good reason that we're just starting to hit now, 00:03:37.920 |
if you're really following where the field is going, 00:03:40.760 |
we're kind of running out of text data for these large language models. 00:03:44.000 |
So one interesting way to keep scaling on the data side 00:03:48.200 |
is to make use of all of these other modalities, right? 00:03:51.000 |
So if you can have your language model also watch all of the videos of cats in 00:03:55.080 |
the world, it's gonna understand the concept of cat much better. 00:03:59.160 |
And that's what we want to have in these models. 00:04:00.760 |
We want them to understand the world in the same way that humans understand it. 00:04:05.040 |
So right now, multimodality is really one of the main frontiers of this new 00:04:10.960 |
foundation model drive that we're all in right now. 00:04:19.640 |
But so what we'll see when this loads is this guy over here, 00:04:25.240 |
and we'll have the same audio effect being played. 00:04:28.960 |
So the audio is exactly the same, and this man is gonna say something like, 00:04:35.920 |
And so you're hearing a bee there, I think, if you look at my mouth, 00:04:41.360 |
But if you then change the video to where he says, bah, bah, bah, 00:04:46.200 |
with exactly the same audio, you're going to hear the other version. 00:04:50.680 |
So unfortunately, I can't really swap in the different audio here, so 00:04:55.640 |
We might suddenly start hearing a guy saying, bah, bah, bah, and then. 00:05:02.480 |
multimodal applications, so when we have multiple modalities, 00:05:11.640 |
And as I said, most of the use cases we have on the Internet, 00:05:16.320 |
And there are some really kind of obvious things we would be interested in 00:05:20.440 |
if we have information from these different data sources, 00:05:28.000 |
So maybe given a bit of text, we want to find the right image. 00:05:31.560 |
Or maybe given some image, we want to find the right text for it so 00:05:36.440 |
Obviously, we can also do this in a generative setting. 00:05:38.600 |
So then we have image captioning, which you probably heard of. 00:05:43.640 |
So that's image synthesis, and so stable diffusion. 00:05:46.200 |
Everybody in the audience here has probably seen that. 00:05:48.880 |
Then we can do visual question answering, where we have an image and text. 00:05:54.920 |
We have multimodal classification, where we have image and text, and 00:05:57.560 |
we need to have a label, for example, whether something is hate speech or not. 00:06:01.160 |
And then in general, we want to be able to have a richer understanding of 00:06:05.920 |
information, which means that we combine images and text and then use it for 00:06:09.360 |
downstream applications that require better understanding or better generation. 00:06:20.000 |
I predict that this paper is going to do really well in terms of citations, 00:06:25.880 |
I think a lot of people are not actually going to read it. 00:06:29.000 |
And so, I mean, I've been in this field for quite a while now, and 00:06:32.080 |
people have been saying this for a really long time. 00:06:35.160 |
So for decades, people have been saying that multimodal is the next big thing. 00:06:44.080 |
the outline for what we're gonna be talking about. 00:06:46.520 |
So first, I'm gonna tell you a little bit about early models. 00:06:49.560 |
Then we're gonna do a bit of a deep dive on some of the specifics. 00:06:53.200 |
Then we're gonna go over a particular type of fusion, 00:06:58.640 |
Then we're gonna go through a little bit of the history of 00:07:03.880 |
Then we're gonna talk a little bit about evaluation, 00:07:07.600 |
And then I'll make some predictions for the future, and hopefully maybe give you 00:07:10.680 |
some cool research ideas or things to talk or think about. 00:07:17.080 |
there's a lot of work that happened before deep learning. 00:07:19.920 |
But I think if you want to start from the deep learning revolution and 00:07:23.680 |
what was happening in images and text, then a good starting point is, 00:07:29.080 |
for example, Wasabi or Device or Richard Soker, 00:07:33.880 |
who you've probably heard of, has done some really cool early work in this. 00:07:40.600 |
And the basic gist of this is that we have a vision model on the one hand, 00:07:48.760 |
the first lecture of this course I think was about word embeddings, right? 00:07:51.600 |
So that's just your basic word embedding model. 00:07:54.240 |
And now we need to figure out how to align them in the same multimodal space. 00:07:58.560 |
So the way you do that is you get some sort of similarity metric, right? 00:08:03.720 |
if you're thinking about this from a support vector machine literature perspective. 00:08:07.400 |
And now you need to figure out in a max margin or margin loss, 00:08:13.160 |
how you want to align these two points in your embedding space, right? 00:08:16.360 |
So things that are similar, you want to bring them closer together. 00:08:18.800 |
Things that are not, you want to bring them further apart. 00:08:21.720 |
And if you do that in this multimodal embedding space, 00:08:24.960 |
that means that you can do interesting cross-modal transfer, 00:08:28.840 |
where you can take the word embedding for something like auto or like horse, 00:08:32.840 |
and then you can find close images in the embedding space to that thing. 00:08:42.280 |
And I think a lot of this stuff that I'm going to talk about in the early slides, 00:08:46.400 |
you're going to see this thing come over and over again. 00:08:49.000 |
You're going to see it get kind of reinvented with fancier models, 00:08:54.760 |
So you can do cross-modal transfer where you have images and text, 00:08:59.080 |
but you can also combine them together so that you get a multimodal word embedding. 00:09:03.640 |
And so this just gives you a more accurate representation 00:09:10.600 |
Because when we think about the word moon or cat or something, 00:09:14.200 |
we can go to Wikipedia and read that a cat is a small carnivorous mammal 00:09:20.080 |
Or we can just go and look at pictures of cats, 00:09:24.400 |
And I would argue actually that for a lot of people, 00:09:26.720 |
the picture of the cat is much closer to the meaning of the concept of cat. 00:09:31.240 |
So some early work where people were trying to do this 00:09:36.360 |
is from Bruni et al. where they did multimodal distributional semantics 00:09:39.960 |
using this very elegant approach called bag of visual words. 00:09:44.160 |
So just like who has heard of bag of visual words? 00:09:49.480 |
Okay, so it's surprisingly simple, and so I kind of like it. 00:09:54.480 |
So you take a picture of a moon in this case. 00:09:57.160 |
I think you can see it in the back too, right? 00:09:58.720 |
So we use an algorithm like SIFT to find interesting key points. 00:10:03.760 |
So it's sort of where the difference between the pixels and the pixels next to it, 00:10:08.040 |
where that difference is big, those are sort of the spots you want to be looking at. 00:10:13.080 |
And for each of these key points, you get feature descriptors. 00:10:16.960 |
So relatively small vectors like 32-dimensional events 00:10:23.080 |
And what you can do now with these feature descriptors 00:10:27.840 |
and then you assign every one of these points 00:10:31.400 |
so you can count how often they occur, right? 00:10:33.360 |
So in this picture of the moon, we have like actually the count is... 00:10:36.560 |
Oh, yeah, so there are three like red dots, right? 00:10:41.520 |
So what that gives you is an idea of the visual words, 00:10:46.000 |
very similar to the original bag of words model 00:10:48.280 |
that you hopefully have heard about maybe in the first lecture. 00:10:52.400 |
So that's the visual equivalent of the textual thing. 00:10:56.040 |
And so if you do this and you then concatenate 00:11:03.200 |
that is much more representative of human meaning. 00:11:12.600 |
So after that, there were a couple of people, 00:11:24.760 |
and then you can transfer the features from your conf net 00:11:45.320 |
So when you see a word like cat in some context, 00:11:50.680 |
then when you see cat, you also want to predict cat pictures. 00:11:56.400 |
that this gives you much richer word representations. 00:12:02.880 |
What we really care about is not words, but sentences. 00:12:08.360 |
into sentence representations and how can we figure out 00:12:24.360 |
but now we just have a sentence encoder, right? 00:12:42.120 |
or some other kind of recurrent neural network, 00:12:44.240 |
or in the case of this one, recursive neural network, 00:12:47.360 |
and then we try to align the features together. 00:12:57.680 |
because we showed here that grounded sentence representation. 00:13:09.840 |
already gives you a really good sentence representation. 00:13:15.560 |
you can sort of imagine what things look like, 00:13:17.720 |
and that gives you a really good meaning representation, 00:13:19.800 |
which you can then transfer to, I don't know, 00:13:24.360 |
And then of course, once we have sentence encoders, 00:13:33.040 |
and so when the sequence-to-sequence architecture came out, 00:13:36.440 |
which you've probably also heard about in this course, 00:13:39.160 |
what you can do instead of having a text encoder 00:13:44.800 |
is you can plug in a conf net instead of an LSTM encoder, 00:13:53.840 |
We used to have all of these fancy diagrams in our papers 00:13:56.400 |
then where we explained LSTM and how that works. 00:13:59.160 |
Probably people don't learn that anymore these days. 00:14:04.400 |
They might make a comeback, I think, you know, 00:14:06.320 |
at some point, transformers are gonna go away. 00:14:10.200 |
And so one of the things that people figured out 00:14:19.280 |
between your source language and your target language, 00:14:21.760 |
and you can do the same thing actually with images, right? 00:14:24.040 |
So if you want to align a word in your generated sequence 00:14:33.480 |
and that approach, of course, is called attention, right? 00:14:35.720 |
So, you know, you learned a lot about attention 00:14:45.720 |
and really see that when it has to generate stop 00:14:50.120 |
that it's really actually looking at the stop sign, right? 00:14:52.720 |
So there's a really cool alignment going on there 00:15:14.160 |
is really that you have this generator and discriminator, 00:15:16.480 |
and you want to have the generator generate images 00:15:19.400 |
that the discriminator cannot distinguish from, 00:15:23.120 |
so it cannot distinguish fake and real images, right? 00:15:26.480 |
you can actually condition that on the piece of text, 00:15:38.280 |
of stable diffusion were doing things like this, 00:15:40.560 |
and it's all a natural progression to that model. 00:15:47.240 |
Maybe, do people have any burning questions about this, 00:16:02.040 |
So those are really the kind of core building blocks 00:16:11.640 |
and sort of useful and doesn't look that difficult, 00:16:15.680 |
like why aren't we all doing multimodal things? 00:16:30.600 |
than vision or audio in many use cases, right? 00:16:36.960 |
and basically learns to ignore the image completely, 00:16:41.240 |
for visual question answering, we'll get to that. 00:16:43.520 |
So visual question answering, you could do that 00:16:47.600 |
Additional modalities can add a lot of noise. 00:16:51.760 |
So it makes your machine learning problem more difficult. 00:16:58.400 |
sometimes you have text, sometimes you have pictures, 00:17:01.280 |
but you don't have a guarantee that you always have both. 00:17:10.080 |
And also just in general, like how to design your model 00:17:18.040 |
So in order to maybe drive that point home a little bit, 00:17:23.040 |
so featurizing text, I guess we all know how to do that 00:17:28.200 |
by now, especially sort of in the age of transformers 00:17:30.760 |
and before in LSTMs, where we just have like, 00:17:34.920 |
So batch size by sequence length by embedding size, right? 00:17:41.320 |
and that's how you encode your textual information 00:17:48.720 |
because you can just kind of look at the patches, 00:17:57.840 |
And in many cases, you don't really want to be this uniform. 00:18:01.160 |
You want to have something that actually looks 00:18:13.000 |
that encodes the features for that particular sub image, 00:18:16.520 |
like this guy's like skateboard or something. 00:18:18.480 |
It has its own like vector representation, right? 00:18:31.400 |
is a really good one if you haven't heard of that yet. 00:18:34.280 |
So we're at YOLO v7 now, I think, or eight, I don't know. 00:18:37.840 |
So there's a new one coming out every other year 00:18:42.840 |
But the basic idea is that we get these bounding boxes 00:18:46.320 |
for things in the images are actually segmentations, 00:18:49.000 |
but the bounding boxes is what people tend to use 00:18:52.560 |
So this is labeled like backpack or something. 00:18:54.840 |
And so you can do this as a pre-processing step 00:18:57.560 |
on your image to get a much richer representation 00:19:14.680 |
And so this probably feels like super obvious now, 00:19:18.680 |
but in 2014, when people were starting to discover this, 00:19:22.040 |
it was really very surprising that you could just use 00:19:27.080 |
to really replace the entire computer vision pipeline. 00:19:36.600 |
And then it was all thrown away and replaced by a conf net 00:19:41.920 |
And so the cool thing you get there is that you can transfer 00:19:49.000 |
and then use it to all kinds of very specialized things 00:19:52.320 |
like spotting buildings in Paris, for example, 00:19:57.720 |
And then of course, in the age of transformers, 00:20:10.720 |
So vision transformers are what we would use these days 00:20:13.760 |
to encode the images where you have these flattened patches 00:20:20.520 |
birth architecture maybe as you would know it 00:20:22.880 |
from this course, and then you do classification, right? 00:20:27.200 |
everything's standard, except now your input here 00:20:29.240 |
is not words or tokens, it's patches of an image 00:20:34.240 |
All right, so then we have a bunch of features 00:20:37.960 |
and now how do we combine the information, right? 00:20:50.520 |
So I don't think it's really useful to go over 00:20:56.720 |
So obviously like inner product or similarity 00:20:59.240 |
is what you would use if you want to do cross-modal things. 00:21:01.600 |
So if you want to embed things in the same vector space, 00:21:04.840 |
but you can do sort of fancier projections on top 00:21:08.440 |
or different combinations that are kind of linear 00:21:13.520 |
where you multiply the components element-wise 00:21:16.280 |
or you do some sort of gating over the different features. 00:21:18.680 |
You can do attention, you can do fancier bilinear things, 00:21:22.320 |
you can do very fancy compact bilinear things. 00:21:59.960 |
You can first treat them separately and then combine them 00:22:05.280 |
and then you only combine the final scores, right? 00:22:08.040 |
And so that's kind of what we would call early fusion 00:22:17.640 |
where you really just combine the scores or the logits 00:22:22.640 |
between the information from the different modalities. 00:22:25.520 |
So you could do really fun stuff with multimodal fusion. 00:22:34.120 |
where you have this sort of very special feature map, 00:22:44.840 |
So this gamma and an additive sort of bias vector, 00:22:56.280 |
So in this case, are there more cubes than yellow things? 00:22:58.640 |
So we have some vector representation for that 00:23:12.800 |
with the other one and really try to have them learn 00:23:17.400 |
All right, so let's talk about late fusion then. 00:23:22.240 |
So late fusion is what we would now call contrastive models 00:23:26.520 |
but the basic idea is that we have this similarity score. 00:23:31.000 |
we process the modalities completely independently 00:23:33.360 |
and then at the very end, we do some combination. 00:23:35.960 |
And the most famous instance of that these days is CLIP. 00:23:47.840 |
So it's again, exactly the same contrastive loss 00:23:51.280 |
that we've seen in all these early approaches. 00:23:54.200 |
It does kind of negative sampling, but then in batch. 00:24:06.960 |
And I just wanna make sure that I rank this thing higher 00:24:12.080 |
And I wanna make sure I rank this thing higher 00:24:18.280 |
Really, nothing special about this architecture 00:24:22.480 |
but what made this thing so cool was first of all, 00:24:25.840 |
it was transformers and it was transformers all the way. 00:24:30.360 |
and your image encoder would be a VIT image encoder. 00:24:35.880 |
And it was trained on lots and lots of web data. 00:24:44.440 |
And he created, I think 300 million image text pairs 00:24:47.720 |
for this dataset, trained a bigger model on it 00:24:52.760 |
And then we got this amazing model out of it. 00:25:00.200 |
to the sort of texts that you would see on the internet. 00:25:07.640 |
It's gonna say a photo of a cat doing something, something. 00:25:10.720 |
So that means that you can do kind of zero shot 00:25:14.800 |
label predictions where you have a photo of the, 00:25:17.760 |
and then you need to figure out what the right label 00:25:21.560 |
is for a given image using this kind of prompt. 00:25:28.520 |
And so you can prompt vision and language models 00:25:30.880 |
in very much the same way and do zero shot generalization. 00:25:42.800 |
It's thorough, it's really worth a very close read, 00:25:59.960 |
But what really made it special was that it generalized 00:26:08.080 |
at some of these adversarial versions of ImageNet, 00:26:13.160 |
So it's just a way better image encoder in general. 00:26:19.960 |
there was this paper from Google using Align, 00:26:30.840 |
but then you just keep like throwing more data 00:26:32.840 |
and more compute at it, and it often works much better. 00:26:38.120 |
1.8 billion image taxpayers instead of 300 million 00:26:49.440 |
is that there's this organization called Lion, 00:26:51.800 |
where they've started this open source collective 00:27:11.280 |
And so now there's a much bigger version of Lion 00:27:14.920 |
that's even multilingual and it has 5 billion examples. 00:27:18.080 |
So stable diffusion was trained on sort of the image, 00:27:23.760 |
And that's one of the reasons that it's so awesome, 00:27:29.160 |
And that really makes your system a lot better. 00:27:31.360 |
So if you're looking for like the ultimate dataset 00:27:41.360 |
All right, any questions about up until this point? 00:28:01.680 |
of what I think a lot of people in the field right now, 00:28:04.520 |
or if you're interested in getting in this field, 00:28:09.760 |
like this is what you should really understand. 00:28:12.320 |
And again, like the ideas sort of stack onto each other. 00:28:18.840 |
to give you an idea sort of how the scientists 00:28:39.160 |
Everybody should raise their hand now in this. 00:28:46.560 |
I think everybody kind of gets how BERT works, right? 00:28:55.800 |
because I want you to think about if you have a BERT model 00:29:05.080 |
Right, so there are a bunch of like obvious things 00:29:09.400 |
you could do given the kind of features I told you about 00:29:28.600 |
and then just concatenate it to whatever encoder, 00:29:31.440 |
like maybe an ANN or whatever you're training 00:29:41.160 |
and classify your token from BERT, concatenate them, 00:29:43.800 |
and then classify for like a cat or something like that 00:29:47.400 |
or whatever the thing is you're interested in, yeah. 00:29:51.240 |
You could also like take the ConfNet features 00:29:59.520 |
So I think a lot of people when BERT came out 00:30:04.080 |
who were working in vision and language processing 00:30:13.440 |
And so there were a lot of papers all coming out 00:30:25.120 |
into their own thing because of Hugging Face Transformers 00:30:36.240 |
and people would do object detection on this. 00:30:39.600 |
So you get like a hat and a racket and a shirt 00:30:45.200 |
and then plug them into your transformer model 00:30:49.080 |
and then you try to like recover the features. 00:30:56.400 |
And so this is what we call a single stream architecture 00:31:00.920 |
where you have all of these kind of concatenating 00:31:05.000 |
and then putting them through the same transformer. 00:31:08.960 |
and that's something that this model called VILBERT did 00:31:14.640 |
So you essentially have these two parallel transformers 00:31:19.880 |
you kind of give them cross attention, right? 00:31:25.760 |
so you just make sure you have an attention map 00:31:28.880 |
and then you just do your full normal transformer layer 00:31:42.120 |
and here you do sort of some equivalent of that 00:31:44.480 |
and then you also have your next sentence prediction 00:31:47.320 |
which you probably remember from your BERT lecture. 00:31:52.560 |
is this image aligned with this piece of text or not? 00:32:00.000 |
There are like a hundred papers that came out 00:32:03.680 |
So LexMERT had a different cross-modal output encoder, 00:32:08.800 |
of encoding the positional information, right? 00:32:11.760 |
I just have a bunch of bounding boxes that are featurized 00:32:14.160 |
but I don't care about where they are in the image. 00:32:25.600 |
And that's what you featurize into your network. 00:32:44.640 |
and you just give those feature maps to BERT. 00:32:50.120 |
between like your text segment embeddings, right? 00:32:56.480 |
But so this actually works surprisingly well. 00:32:58.920 |
You don't have to do any additional training. 00:33:11.640 |
And now you have a very good multimodal classifier 00:33:17.520 |
they're doing what they call multimodal pre-training 00:33:19.840 |
where first you have a BERT model and a ResNet. 00:33:29.960 |
before you fine tune it on the problem you care about. 00:33:32.920 |
And what we showed here is that you don't really need that 00:33:39.760 |
You can also go to the pixel level completely. 00:33:52.400 |
but here they do do the multimodal pre-training step 00:33:55.200 |
and show that I think for VQA it helps a little bit. 00:34:08.200 |
where they added a bunch of different losses. 00:34:10.960 |
We can really talk about this for a very long time. 00:34:18.520 |
So this one I think is quite interesting built 00:34:21.280 |
because here this is really the first instance 00:34:23.400 |
where we are completely gone from ConvNet features. 00:34:27.040 |
So we don't do any pre-processing on the image, 00:34:37.280 |
So really integrate, we flatten those patches, 00:34:40.240 |
we just pump them into the transformer straight away. 00:34:43.240 |
So this really is like sort of BERT and VIT together 00:34:46.000 |
in one model and this worked really very well. 00:34:54.960 |
of all of these different models and what they do. 00:35:02.040 |
So do you use BERT or something fancier or better, 00:35:08.320 |
So in many cases you have these region features. 00:35:17.560 |
So either a single or dual stream as we talked about, 00:35:24.240 |
so mass language modeling, image text matching. 00:35:28.240 |
There's a bunch of like funkier ones you can do. 00:35:31.720 |
So, and then finally you can do multimodal pre-training 00:35:35.320 |
on all of these different datasets that have aligned data. 00:35:41.600 |
okay, so what is really the interesting difference 00:35:59.360 |
So basically they say, if you take all of these 00:36:02.720 |
little model inventions and you train these different models 00:36:06.320 |
on exactly the same data in exactly the same way, 00:36:09.520 |
it turns out that they're all basically the same. 00:36:16.520 |
on the part of the field because everybody's saying, 00:36:18.480 |
well, my model is better, but it's actually just 00:36:22.640 |
There's no real sort of model innovation going on 00:36:29.880 |
or anything like that, but I think that's why 00:36:33.360 |
this paper is really nice and really important 00:36:35.360 |
is because it just shows us what really matters. 00:36:38.880 |
So this is also work that I did myself called Flava 00:36:43.880 |
with my team where we wanted to take these ideas 00:37:02.320 |
we only care about problems that always involve 00:37:08.280 |
the basic premise, I think, of foundation models in general 00:37:19.880 |
different modalities and then do useful things 00:37:24.280 |
So with Flava, that's exactly what we tried to build. 00:37:29.600 |
that is good at vision and language and computer vision 00:37:47.400 |
So it's good at the things that you would expect 00:37:49.400 |
as a kind of basic image model to be good at. 00:38:01.920 |
if you take all the datasets that were ever created 00:38:04.280 |
that have image text pairs that are publicly available. 00:38:06.840 |
So unfortunately, the ClipData and the Google Align data 00:38:10.200 |
and all of these datasets, they haven't been open source. 00:38:19.520 |
if you combine all of these image text pairs, 00:38:30.520 |
that we know we care about in these different fields. 00:38:41.240 |
I think if you work at a company like Facebook, 00:38:50.360 |
That's gonna really make your life a lot easier. 00:38:53.160 |
So the exact architecture here is that on the one hand, 00:38:57.760 |
we have this image encoder where we take the image, 00:39:01.720 |
and we just do what we call mass image modeling, 00:39:11.720 |
we have the mass language modeling on the language. 00:39:22.960 |
So we have a mass multi-modal modeling loss term 00:39:28.960 |
So this is like your bird next sentence prediction thing. 00:39:31.520 |
And then we also have a global contrastive loss, 00:39:42.080 |
I think, to combine a lot of this information. 00:39:54.800 |
we were pretty thorough generating the table here. 00:40:01.640 |
if you compare Flava to all kinds of different ablations 00:40:11.880 |
of where we're probably gonna go with the field 00:40:20.960 |
is that everybody cares about generative models. 00:40:23.680 |
So language models and image generative models, 00:40:27.760 |
there's just a trend where we want to be generative, 00:40:31.680 |
discriminative stuff to the more interesting, 00:40:34.880 |
more richer representations maybe that you get 00:40:41.240 |
So this SimVLM paper was one of the first ones 00:40:46.160 |
that was trying to generate or kind of complete captions, 00:40:49.480 |
which they showed gives you a lot richer representations. 00:40:53.640 |
I think this is actually the current state of the art now, 00:41:07.040 |
I think that's also what they were trying to go for, 00:41:14.280 |
And I think, so it took us a while as a field 00:41:16.560 |
to really figure out how to do this the right way. 00:41:25.440 |
And so one of the interesting things you can do 00:41:29.000 |
with language models is just keep them frozen 00:41:32.080 |
and then learn how to project into the language models. 00:41:40.960 |
and we learned to project into the BERT token space. 00:41:56.560 |
and then you learn to project into the token space 00:42:01.640 |
And then you can do lots of fun stuff, it turns out. 00:42:11.120 |
where you can just give it some kind of in-context examples 00:42:14.040 |
and it's gonna figure out binding kind of on the fly. 00:42:18.480 |
So it says like, this is a DEX and this is a BLEKET. 00:42:22.760 |
And then it gives you the answer that it's a DEX. 00:42:29.400 |
which is really kind of solving the grounding problem 00:42:32.480 |
that a lot of this multimodal stuff started with. 00:42:37.200 |
And then probably one of the coolest papers right now 00:42:41.960 |
or models right now that you might've heard of 00:42:43.880 |
if you follow the field is Flamingo out of DeepMind, 00:42:50.200 |
And so this is really an optimal language model. 00:43:03.160 |
So what this gets you is just a much more powerful model 00:43:10.720 |
So it's really like stepwise, you can see it, right? 00:43:32.200 |
So we wanna make sure that we can compress it 00:43:45.880 |
this is not my code, this comes from the actual paper. 00:43:48.120 |
So they just have the diagram together with the code 00:43:50.560 |
so that you can really understand what it's doing, 00:43:56.400 |
And so once you have your perceiver resampling step, 00:44:00.840 |
what you then do is you do a gated cross attention. 00:44:08.440 |
you do that before your frozen language model layer. 00:44:12.200 |
So you really just have a frozen chinchilla language model 00:44:15.480 |
and you learn to kind of modulate the information 00:44:20.040 |
You propagate the gradients all the way back, 00:44:24.040 |
So you're really kind of trying to figure out like, 00:44:28.040 |
so that my language model can do the most with it, right? 00:44:32.920 |
So you'll notice that now we do it before the layer, right? 00:44:39.640 |
So Garpathy, I think more than 10 years ago had this image 00:44:45.920 |
with Barack Obama kind of setting his foot here on the scale 00:44:58.920 |
I think unless it really understands the scene. 00:45:04.680 |
this would be a really good visual Turing test. 00:45:11.000 |
And so obviously it's been a bit of a challenge 00:45:14.320 |
then to get something that actually works on this. 00:45:16.480 |
And so Flamingo, as it turns out, kind of gets the joke. 00:45:20.040 |
But yeah, so it's a bit unclear if it really gets the joke, 00:45:37.600 |
but you can really take this almost to the full extreme 00:45:42.120 |
And you just want to learn this kind of mapping 00:45:44.520 |
between your image encoder and your language model, 00:45:47.320 |
or your image encoder and your encoder decoder architecture. 00:45:57.000 |
where they experiment with like OPT for the language model 00:46:00.040 |
and Flantify for the encoder decoder architecture. 00:46:04.960 |
It gives you really complex captions and things like that 00:46:09.040 |
without any real direct supervision on the captions itself, 00:46:23.320 |
from captioning to reasoning, to visual question answering, 00:46:29.960 |
So you can have a long conversation with this system. 00:46:32.720 |
This really is kind of the future where we're going, right? 00:46:36.600 |
but it's also going to be able to see the world in a way. 00:46:43.440 |
so you've probably heard of like chain of thought prompting 00:46:46.080 |
and things like that, where you ask the language model, 00:46:50.520 |
and you can tell a vision and language model, 00:46:54.080 |
generate a rationale for why something might be the case. 00:47:03.160 |
And then after that, you ask it to answer the question. 00:47:05.880 |
And it turns out that if you do that sort of multimodal 00:47:08.760 |
chain of thought prompting, then the system gets much better. 00:47:11.960 |
And so, this was like the new state of the art 00:47:17.800 |
just because it learns to unpack the information, right? 00:47:23.560 |
just starting to figure out what the potential is of this. 00:47:26.760 |
And I think this paper is where they also show 00:47:33.480 |
And they show very nice results on Raven matrices 00:47:37.040 |
and like very complicated kind of IQ tests sort of things 00:47:40.960 |
that humans are supposed to be really good at, 00:47:51.640 |
And we started off from a very simple bird model 00:48:33.720 |
- Yeah, yeah, so I think the history of computer vision 00:48:41.320 |
where we thought we needed all of this structure 00:48:44.720 |
And it turns out you can just throw it all away 00:48:46.680 |
and just have a big transformer over the patches. 00:48:53.040 |
- Seeing as it's 2.31 and one minute, save time. 00:49:03.800 |
- Yeah, yeah, sorry, I should have explained that better. 00:49:06.760 |
So, it just means that we are not updating the weights. 00:49:11.280 |
So, like if we go to this here, I think is a nice example. 00:49:19.520 |
So, that just means that when we do a forward pass, 00:49:22.240 |
we go all the way to whatever we want to predict, 00:49:24.560 |
we get some gradients, we take them all the way down, 00:49:30.760 |
So, here the gradients actually do get updated, 00:49:36.280 |
is because otherwise you're gonna drift way too far. 00:49:41.640 |
all of the cool stuff your language model has learned, 00:49:44.000 |
because you're just gonna focus on the small data set 00:49:48.160 |
So, you wanna preserve the abilities of the language model, 00:49:50.800 |
but you want it to become good at the thing you care about. 00:50:03.160 |
is there a benefit to doing like the earlier middle fusion 00:50:08.000 |
- Yeah, so, I mean, we're gonna talk about evaluation next, 00:50:11.800 |
but so it really depends on the task that you care about. 00:50:15.400 |
And so, I would say the earlier is always the better 00:50:20.360 |
And so, like CLIP is very efficient to train, 00:50:23.920 |
it's very late fusion, right, at the very end. 00:50:25.800 |
So, there's no interaction between the different modalities. 00:50:28.800 |
And so, that's really good if you want to be very efficient 00:50:33.240 |
and if you wanna be like, for training, it's much nicer. 00:50:37.480 |
But if you want to have a richer understanding 00:50:47.520 |
- It seems that images are just a lot more data than text. 00:50:52.520 |
So, how much more difficult are these to train 00:50:55.600 |
and how much bigger does like the image processing 00:51:03.520 |
- Yeah, so, images are more complex in a way, 00:51:08.520 |
but they're also kind of higher bandwidth representations. 00:51:14.280 |
just pixels that our brains just abstract away, right? 00:51:17.640 |
It's really about the scene that you're seeing 00:51:26.680 |
that language is just a kind of low bandwidth, 00:51:33.560 |
which is much richer and much higher bandwidth. 00:51:35.800 |
And like he thinks probably visual, I'm not so sure. 00:51:39.800 |
But so, yeah, I don't think that there's necessarily 00:51:43.800 |
a difference between kind of the scaling laws 00:51:47.440 |
or at least we still have to figure that out. 00:51:50.800 |
We'll kind of talk about that towards the end as well. 00:51:53.400 |
- Do these modern models also have certain social 00:51:59.440 |
and cultural bias, just like the natural language model? 00:52:06.800 |
So, yeah, some people are actually working on this 00:52:24.560 |
then the model will think that he's playing ping pong 00:52:39.000 |
that you should be working on if you're a student 00:52:42.720 |
how do we get these systems to be much better 00:52:54.360 |
So, we wanna understand the content of a video. 00:53:02.400 |
you might see like what improvements we can make 00:53:08.840 |
- Yeah, so, you're asking about the attention mask 00:53:13.520 |
Yeah, so you can use the same idea for videos 00:53:21.920 |
you can really track objects kind of real time 00:53:39.160 |
because you can very often just sub sample images 00:53:43.280 |
rather than having to deal with the complex video. 00:53:57.560 |
let's say you only provide a single source of media 00:54:10.520 |
is that they're really just built for multi-modal stuff. 00:54:13.640 |
And so, what if I don't have an image, right? 00:54:26.040 |
so the supervised multi-modal bi-transformer, 00:54:30.440 |
how robust is this model to missing images or missing text? 00:54:51.320 |
And so, I think if I'm gonna tell you about multi-modality, 00:54:55.480 |
I also have to tell you how you're gonna check 00:55:16.280 |
because you only have limited GPUs anyway, right? 00:55:33.360 |
So, ImageNet really changed the history of deep learning, 00:55:45.600 |
where they have just a bunch of main multi-modal tasks. 00:55:57.120 |
the bounding boxes, the labels of the bounding boxes, 00:56:00.240 |
they come at like sort of a different pixel granularities. 00:56:07.360 |
annotated in terms of like the categories that it has, 00:56:10.040 |
and then you have five captions for each of these images. 00:56:20.440 |
because you had your picture and you had your caption, 00:56:24.480 |
okay, how do I give the right caption for this image? 00:56:31.080 |
the right image or the image for the piece of text? 00:56:34.800 |
So, there's a bunch of very impactful datasets 00:56:37.680 |
that do this stuff that we already talked about, Lion, 00:56:44.520 |
as the canonical instance of this dataset category. 00:56:49.200 |
And then, the other thing that people really care about 00:56:56.320 |
And so, there really are a bunch of academic groups 00:57:03.760 |
that they didn't really care about anything else, 00:57:07.520 |
that are really optimized just for multimodal 00:57:13.200 |
in the citation counts as of last night, 3 a.m., 00:57:24.520 |
And so, what you do here is you just have an image 00:57:30.120 |
so annotators, right, they ask these simple questions, 00:57:34.400 |
and now we want to be able to answer these questions 00:57:39.760 |
one of the kind of embarrassing backstories of this dataset 00:57:45.560 |
was actually found to have images not really matter at all. 00:58:01.240 |
the right answer for how much or how many question was two. 00:58:04.520 |
So if you just predicted two to every how much 00:58:08.600 |
you got like 70% accuracy on the counting category. 00:58:12.160 |
So careful dataset or evaluation benchmark design 00:58:18.040 |
and you really need to think about what you're doing. 00:58:20.000 |
You can't just like set some data aside and evaluate it on, 00:58:23.120 |
you have to really think about what you're doing. 00:58:30.400 |
a better designed version of this dataset maybe. 00:58:34.960 |
There are also kind of very targeted datasets 00:58:40.680 |
that really try to measure one particular thing. 00:58:42.920 |
And I think one of the things we really want to get at 00:58:46.080 |
with these models is what we would call compositionality. 00:58:49.320 |
So we want to be able to really take the parts 00:58:59.320 |
that was designed really to measure the compositionality 00:59:03.040 |
both on the language side and on the vision side. 00:59:06.840 |
between all of these different objects in the images. 00:59:16.480 |
But a lot of these datasets really had big problems. 00:59:21.320 |
So one of the problem is, they were too easy. 00:59:31.960 |
and that's probably gonna make some people's lives better. 00:59:40.640 |
So obviously, so these memes are not actually 00:59:51.080 |
which are in the dataset, but that would be less fun. 00:59:54.080 |
So these are mean meme examples to kind of demonstrate 01:00:02.200 |
And so one of the problems we had, as I said, 01:00:28.400 |
But so it turns out that if you just swap out the background 01:00:54.240 |
suddenly it's like a really nice thing to say, right? 01:01:00.520 |
if you want to classify this correctly for the meanness, 01:01:04.280 |
then you have to really understand multimodal reasoning. 01:01:12.480 |
And so it was really constructed by design to do that. 01:01:19.400 |
is we use some really highly trained annotators. 01:01:26.240 |
is that nobody really knows who owns the meme, 01:01:38.200 |
and they were very afraid of copyright things. 01:01:49.640 |
so we could show them kind of the actual examples. 01:01:54.960 |
that were kind of corresponding to the original source image 01:02:00.520 |
but now with an image that we could buy from Getty. 01:02:07.440 |
so that we could then release the dataset to the public 01:02:11.000 |
so that people could do actually research on this 01:02:21.240 |
sorry, it's a startup world with co-founders. 01:02:39.960 |
And so this led to a really nice dataset, I think, 01:02:46.120 |
that I think a lot of people in the field had, 01:02:48.360 |
which is that multimodal pre-training doesn't really work. 01:02:53.560 |
So multimodal pre-training doesn't really work. 01:02:58.000 |
And so all of this stuff that people have been doing 01:03:02.880 |
actually turned out maybe to not really be that useful 01:03:06.080 |
anyway and so maybe it got you like one point extra, right? 01:03:09.400 |
From visual birth to like a different visual birth, 01:03:16.680 |
So that means like we still have to figure this stuff out, 01:03:29.800 |
that does something new, like we're not there yet. 01:03:33.080 |
And I think that's encouraging, especially for you. 01:03:45.680 |
to try to see what people could come up with. 01:03:47.920 |
And so there was a lot of nice work coming out of that 01:03:52.480 |
and we really kind of managed to crank the numbers up 01:03:56.240 |
But the solutions were slightly disappointing. 01:04:12.760 |
where there wasn't really the fundamental breakthrough 01:04:25.840 |
So the theme sort of of this section is like, 01:04:28.000 |
if you make a dataset, think about it very carefully 01:04:31.600 |
because you can really be very creative with this 01:04:33.560 |
and really measure the things you're trying to get at. 01:04:44.240 |
and it's way better than things that were previously there, 01:04:47.200 |
but does it understand compositional relationships 01:04:50.280 |
in the same way that humans would understand it 01:04:52.240 |
or is it sort of just fitting onto the data distribution 01:04:55.440 |
and it can be very good at the head of the distribution, 01:05:00.360 |
And you can probably already guess where this is going, 01:05:07.560 |
you would have some plants surrounding a light bulb 01:05:10.960 |
or you would have a light bulb surrounding some plants. 01:05:13.880 |
So notice that the words here are exactly the same words, 01:05:19.480 |
So, and so the visual depiction of these words 01:05:25.960 |
your contrastive model is actually good at understanding 01:05:29.000 |
the visual semantic or the visual linguistic compositionality 01:05:43.400 |
and it just kind of is biased toward what it sees often, 01:05:56.320 |
"Order Word Matters Pre-training for Little." 01:06:15.600 |
And so that's probably not what we want to have. 01:06:37.400 |
Like these are very different pictures, right? 01:06:54.960 |
State-of-the-art models often perform below random chance. 01:07:03.240 |
we still have a lot of work to do, which is good. 01:07:38.960 |
If you don't add that, then it breaks down completely. 01:07:45.120 |
or sort of tuning on the test set, but okay, you know. 01:07:51.080 |
So it definitely is better than I think a lot of people 01:07:54.760 |
would have expected even a couple of years ago. 01:07:57.120 |
But it's not perfect because people on the internet 01:08:02.120 |
like to take more pictures of spoons than forks. 01:08:04.800 |
So if you say there are fewer spoons than forks, 01:08:17.400 |
You know, and so maybe it's like the matrix or something, 01:08:25.040 |
So again, what you can see here is that these models 01:08:37.160 |
like it still can't count fingers and things like that. 01:08:39.840 |
So again, there's still a lot of cool work to be done. 01:08:57.320 |
because so we've really just been focused on images 01:09:04.560 |
And so that makes it sort of an obvious thing to focus on. 01:09:10.760 |
like vision is a very dominant modality, right? 01:09:13.080 |
So how we understand the world is very vision driven, 01:09:18.840 |
So there's all these other interesting problems 01:09:22.920 |
And so the most obvious one is just speech or audio, right? 01:09:29.120 |
and really we could do another lecture just like this, 01:09:34.200 |
And there's lots of interesting stuff to talk about. 01:09:41.000 |
of how amazing Alec Radford is at creating datasets. 01:09:45.000 |
So there's this whisper model that came out of open AI 01:09:57.400 |
and they trained this very fancy thing on there, 01:10:06.600 |
and then you feed that into a big transformer. 01:10:08.640 |
So this is sort of your encoder self-attention here, right? 01:10:17.280 |
So this is encoder decoder, basic transformer model, 01:10:23.280 |
one dimensional convolutions over the log mail spectrogram. 01:10:26.280 |
And so there's lots of papers that do very similar things. 01:10:32.440 |
that tried to turn the wave signal into vectors, 01:10:34.720 |
or you can discretize it in lots of different ways. 01:10:40.080 |
Then I think one of the funny observations actually 01:10:42.760 |
is that you can just reduce audio to vision anyway, right? 01:11:04.280 |
So what does the spectrum of the audio file look like? 01:11:08.000 |
Feed that to a regular conf net, like an Alex net even, 01:11:11.480 |
and then that gives you amazing auditory features. 01:11:15.520 |
between violins or guitars and things like that. 01:11:17.960 |
So maybe you can just reduce all of this to vision. 01:11:23.480 |
can we also reduce language to vision or vision to language? 01:11:27.280 |
So that's sort of what people are thinking about. 01:11:35.320 |
So a lot of these ideas also extend pretty directly 01:11:38.720 |
to video, but now you just have more data, right? 01:11:46.880 |
Probably a lot of the images are pretty useless 01:11:50.120 |
for what you're trying to do with this video model, right? 01:11:53.840 |
It doesn't really add all that much information. 01:12:08.360 |
and language transformer encoder thing on top of that. 01:12:16.400 |
And so there's this, so Merlot is a nice architecture 01:12:23.400 |
kind of a silly name where they also added audio 01:12:30.120 |
And so we're going towards this foundation model 01:12:33.400 |
that can consume all of these different modalities 01:12:37.160 |
And that's really like a clear trend in the field. 01:12:55.160 |
But what you can do is you can have simulated environments. 01:13:01.680 |
where they had this agent walk around in a maze 01:13:03.800 |
and then he could have natural language instructions. 01:13:06.160 |
He could also generalize to like decks and blicks 01:13:08.640 |
and different sort of groundings and assignments 01:13:16.560 |
because this is how humans learn language, right? 01:13:21.520 |
We have all of these different perceptual observations. 01:13:29.200 |
And that's how we learn everything we know about the world. 01:13:32.160 |
And so our language is very intricately connected 01:13:45.600 |
So especially with this kind of conditioning on text 01:13:54.360 |
And the original GAN we talked about at the beginning. 01:13:59.000 |
but now you're generating 3D point clouds, right? 01:14:13.720 |
and it's just gonna design the whole house for you. 01:14:16.600 |
So you can just like tweak the prompt and things like that. 01:14:23.600 |
So the final modality I just briefly wanted to talk about 01:14:33.720 |
And so olfaction means smell if you didn't know. 01:14:39.720 |
so my PhD thesis was about grounding semantics 01:14:50.080 |
now audio is sort of the obvious next one, right? 01:14:59.560 |
And that's gonna give you a richer representation. 01:15:03.640 |
what's actually very primitive to their meaning 01:15:15.640 |
if you want to complete all of your perceptual modalities 01:15:19.320 |
is you can try to build olfactory embeddings. 01:15:32.640 |
this Sigma Aldrich Fine Flavors and Fragrances catalog, 01:15:36.840 |
where you can look up words like melon and pineapple 01:15:40.160 |
and then it's gonna give you all of the chemical compounds 01:15:52.880 |
to get it to be a bit more of a real embedding model. 01:15:56.360 |
So now you get smell embeddings, smell vectors, 01:15:59.640 |
and then you can compute similarity judgments 01:16:12.200 |
So you get these clusters of different smells 01:16:27.800 |
so like if you have a word like democracy in there, 01:16:43.120 |
And then, so the really interesting thing to me 01:16:51.880 |
than the linguistic vectors we had at the time. 01:17:01.440 |
And so you can do like skip gram and things like that. 01:17:04.800 |
But that thing is not going to be as correlated 01:17:14.400 |
So even something like smell where maybe we think, 01:17:37.000 |
I'll just, I think I've already said most of this actually. 01:17:39.640 |
So one foundation model is going to rule them all. 01:17:51.880 |
and trying to understand really what is the relationship 01:17:55.600 |
which one do we want more of, that sort of stuff. 01:18:09.480 |
We need way better evaluation and better measurements.