back to index

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela


Whisper Transcript | Transcript Only Page

00:00:00.000 | [BLANK_AUDIO]
00:00:05.200 | So today, I'm delighted to introduce our first invited speaker,
00:00:10.560 | who's Dow Akila.
00:00:12.800 | Dow has also been, as well as being invited, and I'll tell his background.
00:00:17.920 | He's also in the Symbolic Systems program as being an adjunct professor and
00:00:23.840 | has been involved with some students in that role as well.
00:00:26.680 | But in his invited role, he's originally from the Netherlands,
00:00:30.640 | where he even learned some logic, among other things, back in the old days.
00:00:34.760 | But in more recent times,
00:00:36.680 | he's been a prominent deep learning researcher.
00:00:41.160 | For a number of years, he worked at Facebook, now Meta, in the FAIR unit,
00:00:47.600 | and was involved in various ideas, including retrieval augmented generation.
00:00:54.000 | After that, he then spent some time at Hugging Face.
00:00:57.760 | He's become interested in looking at multimodal models,
00:01:01.560 | which is what he's gonna be talking about today.
00:01:04.640 | And welcome, Dow, it's great to have you.
00:01:07.000 | >> Thank you very much.
00:01:08.680 | >> [APPLAUSE]
00:01:15.560 | >> All right, that works, right?
00:01:17.360 | Yes, yeah, thanks everyone for coming.
00:01:20.520 | I understand that you get points for being here, so
00:01:22.520 | you're not really here for me.
00:01:23.600 | >> [LAUGH]
00:01:24.320 | >> But thanks for coming anyway.
00:01:27.160 | So I'm gonna talk about multimodal deep learning.
00:01:29.680 | It's gonna have an NLP focus, of course, as for this course.
00:01:32.800 | But it's also because otherwise I would really be talking for
00:01:36.720 | many more hours than I have time for here.
00:01:38.880 | So I'll try to really keep it focused on the things that I think will be
00:01:43.120 | most useful for you to learn.
00:01:44.760 | And so the first thing you should understand is that this whole concept of
00:01:48.760 | multimodality is kind of ill-defined, actually.
00:01:52.360 | So if you go to the dictionary, you'll see that it means having or
00:01:56.160 | involving several modes or modalities or maxima.
00:02:00.720 | And so what mode here really means is, so it could be mode in the very generic sense,
00:02:06.080 | or it could be a very precise sense of the mode of a statistical distribution.
00:02:11.240 | And so depending on the paper you're reading, in some cases,
00:02:14.440 | people really mean the statistical sense.
00:02:16.200 | In other cases, people really mean this sort of very vague concept of a modality,
00:02:20.920 | where it really means the type of information that you're getting.
00:02:23.840 | So an example of modality in that case is an image or speech signal or
00:02:28.480 | audio in general, or even olfaction, so smell or things like that.
00:02:33.160 | So in this lecture, we're just gonna focus mostly on text,
00:02:38.640 | because this is an NLP course, and we're gonna focus on images mostly as
00:02:42.640 | the other modality to keep it simple.
00:02:45.040 | All right, so why does it matter?
00:02:48.400 | Why do we care about multimodality?
00:02:50.880 | And so there are a couple of really good reasons in general for this.
00:02:55.480 | The first one is about faithfulness.
00:02:58.320 | So if you look at how we humans understand the world,
00:03:01.200 | how we make sense of what happens in the world, that is very multimodal, right?
00:03:06.720 | So we perceive the world not just using vision or just audio, but
00:03:11.040 | we synthesize information across all of these different modalities, and
00:03:14.600 | that's how we understand the world and each other.
00:03:17.800 | There's also a very practical argument for doing it.
00:03:20.680 | It's because the Internet is multimodal, right?
00:03:23.000 | So if you go to, I don't know, Facebook or something like that,
00:03:27.080 | it rarely happens that it's just text or just an image.
00:03:29.760 | There's usually a combination of multiple modalities.
00:03:33.160 | And then the final good reason that we're just starting to hit now,
00:03:37.920 | if you're really following where the field is going,
00:03:40.760 | we're kind of running out of text data for these large language models.
00:03:44.000 | So one interesting way to keep scaling on the data side
00:03:48.200 | is to make use of all of these other modalities, right?
00:03:51.000 | So if you can have your language model also watch all of the videos of cats in
00:03:55.080 | the world, it's gonna understand the concept of cat much better.
00:03:59.160 | And that's what we want to have in these models.
00:04:00.760 | We want them to understand the world in the same way that humans understand it.
00:04:05.040 | So right now, multimodality is really one of the main frontiers of this new
00:04:10.960 | foundation model drive that we're all in right now.
00:04:15.600 | There's a thing called the McGurk effect.
00:04:17.760 | Let's see if it loads up.
00:04:19.640 | But so what we'll see when this loads is this guy over here,
00:04:25.240 | and we'll have the same audio effect being played.
00:04:28.960 | So the audio is exactly the same, and this man is gonna say something like,
00:04:33.040 | bah, bah, bah.
00:04:35.920 | And so you're hearing a bee there, I think, if you look at my mouth,
00:04:39.720 | because that's what I said.
00:04:41.360 | But if you then change the video to where he says, bah, bah, bah,
00:04:46.200 | with exactly the same audio, you're going to hear the other version.
00:04:50.680 | So unfortunately, I can't really swap in the different audio here, so
00:04:54.160 | you have to trust me for it.
00:04:55.640 | We might suddenly start hearing a guy saying, bah, bah, bah, and then.
00:04:59.000 | >> [LAUGH] >> All right, so
00:05:02.480 | multimodal applications, so when we have multiple modalities,
00:05:09.320 | we can do all kinds of interesting things.
00:05:11.640 | And as I said, most of the use cases we have on the Internet,
00:05:14.720 | they're all multimodal.
00:05:16.320 | And there are some really kind of obvious things we would be interested in
00:05:20.440 | if we have information from these different data sources,
00:05:23.600 | right, from different modalities.
00:05:25.600 | So obviously, we might want to do retrieval.
00:05:28.000 | So maybe given a bit of text, we want to find the right image.
00:05:31.560 | Or maybe given some image, we want to find the right text for it so
00:05:34.960 | we can match them up.
00:05:36.440 | Obviously, we can also do this in a generative setting.
00:05:38.600 | So then we have image captioning, which you probably heard of.
00:05:41.080 | We can do text to image generation.
00:05:43.640 | So that's image synthesis, and so stable diffusion.
00:05:46.200 | Everybody in the audience here has probably seen that.
00:05:48.880 | Then we can do visual question answering, where we have an image and text.
00:05:52.920 | And then we need to generate some new text.
00:05:54.920 | We have multimodal classification, where we have image and text, and
00:05:57.560 | we need to have a label, for example, whether something is hate speech or not.
00:06:01.160 | And then in general, we want to be able to have a richer understanding of
00:06:05.920 | information, which means that we combine images and text and then use it for
00:06:09.360 | downstream applications that require better understanding or better generation.
00:06:14.400 | So this field really is super hot right now.
00:06:17.520 | So there's this nice paper title.
00:06:20.000 | I predict that this paper is going to do really well in terms of citations,
00:06:23.480 | just because it has such a citable title.
00:06:25.880 | I think a lot of people are not actually going to read it.
00:06:29.000 | And so, I mean, I've been in this field for quite a while now, and
00:06:32.080 | people have been saying this for a really long time.
00:06:34.000 | I think Chris would agree.
00:06:35.160 | So for decades, people have been saying that multimodal is the next big thing.
00:06:39.560 | But now it's really true, I think.
00:06:40.920 | >> [LAUGH] >> All right, so
00:06:44.080 | the outline for what we're gonna be talking about.
00:06:46.520 | So first, I'm gonna tell you a little bit about early models.
00:06:49.560 | Then we're gonna do a bit of a deep dive on some of the specifics.
00:06:53.200 | Then we're gonna go over a particular type of fusion,
00:06:56.560 | contrastive models or late fusion.
00:06:58.640 | Then we're gonna go through a little bit of the history of
00:07:01.760 | multimodal foundation models.
00:07:03.880 | Then we're gonna talk a little bit about evaluation,
00:07:06.000 | a little bit about other modalities.
00:07:07.600 | And then I'll make some predictions for the future, and hopefully maybe give you
00:07:10.680 | some cool research ideas or things to talk or think about.
00:07:13.720 | All right, so obviously,
00:07:17.080 | there's a lot of work that happened before deep learning.
00:07:19.920 | But I think if you want to start from the deep learning revolution and
00:07:23.680 | what was happening in images and text, then a good starting point is,
00:07:29.080 | for example, Wasabi or Device or Richard Soker,
00:07:33.880 | who you've probably heard of, has done some really cool early work in this.
00:07:38.280 | They really pioneered a lot of these ideas.
00:07:40.600 | And the basic gist of this is that we have a vision model on the one hand,
00:07:45.960 | we have a language model.
00:07:47.800 | So this really, I mean,
00:07:48.760 | the first lecture of this course I think was about word embeddings, right?
00:07:51.600 | So that's just your basic word embedding model.
00:07:54.240 | And now we need to figure out how to align them in the same multimodal space.
00:07:58.560 | So the way you do that is you get some sort of similarity metric, right?
00:08:01.640 | A score function or like a kernel function,
00:08:03.720 | if you're thinking about this from a support vector machine literature perspective.
00:08:07.400 | And now you need to figure out in a max margin or margin loss,
00:08:13.160 | how you want to align these two points in your embedding space, right?
00:08:16.360 | So things that are similar, you want to bring them closer together.
00:08:18.800 | Things that are not, you want to bring them further apart.
00:08:21.720 | And if you do that in this multimodal embedding space,
00:08:24.960 | that means that you can do interesting cross-modal transfer,
00:08:28.840 | where you can take the word embedding for something like auto or like horse,
00:08:32.840 | and then you can find close images in the embedding space to that thing.
00:08:36.720 | And now you've solved the retrieval problem.
00:08:39.960 | So this is a really nice early application.
00:08:42.280 | And I think a lot of this stuff that I'm going to talk about in the early slides,
00:08:46.400 | you're going to see this thing come over and over again.
00:08:49.000 | You're going to see it get kind of reinvented with fancier models,
00:08:52.040 | but it's basically all the same stuff.
00:08:54.760 | So you can do cross-modal transfer where you have images and text,
00:08:59.080 | but you can also combine them together so that you get a multimodal word embedding.
00:09:03.640 | And so this just gives you a more accurate representation
00:09:08.000 | of how humans understand word meaning.
00:09:10.600 | Because when we think about the word moon or cat or something,
00:09:14.200 | we can go to Wikipedia and read that a cat is a small carnivorous mammal
00:09:18.320 | that people like to keep as pets.
00:09:20.080 | Or we can just go and look at pictures of cats,
00:09:22.200 | and now we understand what a cat is.
00:09:24.400 | And I would argue actually that for a lot of people,
00:09:26.720 | the picture of the cat is much closer to the meaning of the concept of cat.
00:09:31.240 | So some early work where people were trying to do this
00:09:36.360 | is from Bruni et al. where they did multimodal distributional semantics
00:09:39.960 | using this very elegant approach called bag of visual words.
00:09:44.160 | So just like who has heard of bag of visual words?
00:09:48.600 | Very few people.
00:09:49.480 | Okay, so it's surprisingly simple, and so I kind of like it.
00:09:53.160 | It's nicely elegant.
00:09:54.480 | So you take a picture of a moon in this case.
00:09:57.160 | I think you can see it in the back too, right?
00:09:58.720 | So we use an algorithm like SIFT to find interesting key points.
00:10:03.760 | So it's sort of where the difference between the pixels and the pixels next to it,
00:10:08.040 | where that difference is big, those are sort of the spots you want to be looking at.
00:10:13.080 | And for each of these key points, you get feature descriptors.
00:10:16.960 | So relatively small vectors like 32-dimensional events
00:10:20.480 | are kind of on the implementation of this.
00:10:23.080 | And what you can do now with these feature descriptors
00:10:25.440 | is you can cluster them using k-means,
00:10:27.840 | and then you assign every one of these points
00:10:31.400 | so you can count how often they occur, right?
00:10:33.360 | So in this picture of the moon, we have like actually the count is...
00:10:36.560 | Oh, yeah, so there are three like red dots, right?
00:10:38.600 | So that's why the red dot one is three.
00:10:41.520 | So what that gives you is an idea of the visual words,
00:10:46.000 | very similar to the original bag of words model
00:10:48.280 | that you hopefully have heard about maybe in the first lecture.
00:10:52.400 | So that's the visual equivalent of the textual thing.
00:10:56.040 | And so if you do this and you then concatenate
00:10:58.800 | or you apply SVD to fuse the information,
00:11:00.840 | what you get is a word embedding
00:11:03.200 | that is much more representative of human meaning.
00:11:06.560 | So, you know, as reflected in the datasets
00:11:09.040 | that people used to care about at the time.
00:11:12.600 | So after that, there were a couple of people,
00:11:15.360 | me included, who tried to take these ideas
00:11:17.520 | and then really apply deep learning to them.
00:11:19.320 | So some of the very early versions of this
00:11:22.400 | use convolutional neural networks,
00:11:24.760 | and then you can transfer the features from your conf net
00:11:28.720 | and you take your word embeddings,
00:11:30.400 | which you've seen in the first lecture,
00:11:32.480 | and then you can concatenate them.
00:11:34.600 | Now you have a multimodal word vector,
00:11:36.640 | or you can do something slightly fancier.
00:11:38.520 | So you've seen the skip-gram model.
00:11:40.960 | You can also try to do skip-gram predictions
00:11:43.720 | onto image features, right?
00:11:45.320 | So when you see a word like cat in some context,
00:11:48.000 | like the cute little cat sat on the mat,
00:11:50.680 | then when you see cat, you also want to predict cat pictures.
00:11:54.280 | So super easy ideas, but it turned out
00:11:56.400 | that this gives you much richer word representations.
00:11:59.440 | So that's kind of cool,
00:12:00.680 | but obviously words are very limited.
00:12:02.880 | What we really care about is not words, but sentences.
00:12:06.080 | So then people started really looking
00:12:08.360 | into sentence representations and how can we figure out
00:12:12.360 | how to get compositional understanding
00:12:14.760 | in these sentence representations
00:12:16.360 | and how do we align that with images?
00:12:18.960 | So the loss here is very similar
00:12:22.480 | to what we saw with words and pictures,
00:12:24.360 | but now we just have a sentence encoder, right?
00:12:27.320 | And so there's some really cool early papers
00:12:29.440 | from Andrej Kropotky and Richard Soker
00:12:31.640 | also had some work here.
00:12:33.360 | And then, so the basic idea is just that
00:12:37.880 | instead of having these word embeddings,
00:12:39.440 | we now have an LSTM in these papers
00:12:42.120 | or some other kind of recurrent neural network,
00:12:44.240 | or in the case of this one, recursive neural network,
00:12:47.360 | and then we try to align the features together.
00:12:50.400 | And so these three or four papers
00:12:53.320 | are actually very important.
00:12:54.360 | This one by me is less important,
00:12:56.320 | but it's still kind of interesting
00:12:57.680 | because we showed here that grounded sentence representation.
00:13:02.160 | So if you actually just use this part here
00:13:04.520 | as a sentence encoder for NLP tasks,
00:13:07.440 | the ability to just predict pictures from it
00:13:09.840 | already gives you a really good sentence representation.
00:13:12.800 | Right, so just by predicting pictures,
00:13:15.560 | you can sort of imagine what things look like,
00:13:17.720 | and that gives you a really good meaning representation,
00:13:19.800 | which you can then transfer to, I don't know,
00:13:22.000 | sentiment classification or something else.
00:13:24.360 | And then of course, once we have sentence encoders,
00:13:30.720 | or then we also have decoders,
00:13:33.040 | and so when the sequence-to-sequence architecture came out,
00:13:36.440 | which you've probably also heard about in this course,
00:13:39.160 | what you can do instead of having a text encoder
00:13:42.160 | for like your source language,
00:13:43.440 | if you're doing machine translation,
00:13:44.800 | is you can plug in a conf net instead of an LSTM encoder,
00:13:49.800 | and now you can generate captions.
00:13:51.680 | So that's exactly what people did.
00:13:53.840 | We used to have all of these fancy diagrams in our papers
00:13:56.400 | then where we explained LSTM and how that works.
00:13:59.160 | Probably people don't learn that anymore these days.
00:14:01.560 | They do?
00:14:02.400 | - Mostly for LSTM.
00:14:03.240 | - Very good.
00:14:04.400 | They might make a comeback, I think, you know,
00:14:06.320 | at some point, transformers are gonna go away.
00:14:09.360 | We'll see.
00:14:10.200 | And so one of the things that people figured out
00:14:14.680 | in machine translation very early on
00:14:16.600 | is that you can do alignment of words
00:14:19.280 | between your source language and your target language,
00:14:21.760 | and you can do the same thing actually with images, right?
00:14:24.040 | So if you want to align a word in your generated sequence
00:14:28.760 | with something in your picture,
00:14:30.200 | then you can use the same approach for that,
00:14:33.480 | and that approach, of course, is called attention, right?
00:14:35.720 | So, you know, you learned a lot about attention
00:14:38.280 | probably in this course, and so, yeah,
00:14:40.760 | that was one of the building blocks
00:14:42.400 | of these systems as well
00:14:43.480 | where you can do very interesting things
00:14:45.720 | and really see that when it has to generate stop
00:14:49.080 | for the stop sign,
00:14:50.120 | that it's really actually looking at the stop sign, right?
00:14:52.720 | So there's a really cool alignment going on there
00:14:55.760 | in these models.
00:14:56.720 | And so the final kind of early model
00:15:00.480 | we should talk about a little bit is GANs.
00:15:04.160 | Who here has heard of GANs?
00:15:05.600 | Okay, that's a lot more than visual words.
00:15:09.200 | I guess that makes sense.
00:15:11.080 | And so, yeah, the basic idea of a GAN
00:15:14.160 | is really that you have this generator and discriminator,
00:15:16.480 | and you want to have the generator generate images
00:15:19.400 | that the discriminator cannot distinguish from,
00:15:23.120 | so it cannot distinguish fake and real images, right?
00:15:25.640 | And if you do that,
00:15:26.480 | you can actually condition that on the piece of text,
00:15:30.040 | and then you can generate images
00:15:32.520 | using some text prompt, right?
00:15:35.040 | So that's what kind of the first versions
00:15:38.280 | of stable diffusion were doing things like this,
00:15:40.560 | and it's all a natural progression to that model.
00:15:43.840 | So those were the early models.
00:15:47.240 | Maybe, do people have any burning questions about this,
00:15:51.480 | or does this all make sense?
00:15:52.880 | All right.
00:15:56.360 | So let's do a bit of a deeper dive then
00:15:58.960 | in particular on features and fusion.
00:16:02.040 | So those are really the kind of core building blocks
00:16:04.040 | for all of this multimodal stuff.
00:16:06.000 | But before we go there, maybe very briefly,
00:16:08.760 | like if all of this multimodal stuff is cool
00:16:11.640 | and sort of useful and doesn't look that difficult,
00:16:15.680 | like why aren't we all doing multimodal things?
00:16:18.520 | So why do we focus on specific modalities?
00:16:21.920 | And I think there are a couple of problems
00:16:23.400 | just to be aware of.
00:16:24.680 | So one is modalities can sometimes dominate,
00:16:28.120 | especially text is much more dominant
00:16:30.600 | than vision or audio in many use cases, right?
00:16:33.160 | So you can already just have a model
00:16:35.320 | that picks up on the text signal
00:16:36.960 | and basically learns to ignore the image completely,
00:16:39.280 | which actually happened embarrassingly
00:16:41.240 | for visual question answering, we'll get to that.
00:16:43.520 | So visual question answering, you could do that
00:16:45.600 | without actually looking at the picture.
00:16:47.600 | Additional modalities can add a lot of noise.
00:16:51.760 | So it makes your machine learning problem more difficult.
00:16:54.920 | You don't always have full coverage, right?
00:16:56.600 | So as I said, if you look at Facebook posts,
00:16:58.400 | sometimes you have text, sometimes you have pictures,
00:17:00.280 | sometimes you have both,
00:17:01.280 | but you don't have a guarantee that you always have both.
00:17:03.400 | So how do you deal with that?
00:17:04.880 | In many cases, we just really weren't ready.
00:17:07.480 | It was too complicated to implement stuff.
00:17:10.080 | And also just in general, like how to design your model
00:17:13.160 | really to combine all the information
00:17:16.640 | is actually quite complicated.
00:17:18.040 | So in order to maybe drive that point home a little bit,
00:17:23.040 | so featurizing text, I guess we all know how to do that
00:17:28.200 | by now, especially sort of in the age of transformers
00:17:30.760 | and before in LSTMs, where we just have like,
00:17:33.080 | you have your batch by your sequence.
00:17:34.920 | So batch size by sequence length by embedding size, right?
00:17:38.160 | So it's always like a 3D tensor,
00:17:41.320 | and that's how you encode your textual information
00:17:43.880 | when you pump it through your neural net.
00:17:46.520 | And so with images, it's slightly trickier
00:17:48.720 | because you can just kind of look at the patches,
00:17:51.800 | but then if you do convolutions,
00:17:53.360 | you're kind of like shifting over the image
00:17:55.400 | and then you're aggregating, right?
00:17:57.840 | And in many cases, you don't really want to be this uniform.
00:18:01.160 | You want to have something that actually looks
00:18:03.080 | at the things in the picture, right?
00:18:05.120 | So this is called region features,
00:18:07.000 | where you would use an object detector
00:18:08.680 | as a first step for processing your image.
00:18:11.200 | And then you would have a confident backbone
00:18:13.000 | that encodes the features for that particular sub image,
00:18:16.520 | like this guy's like skateboard or something.
00:18:18.480 | It has its own like vector representation, right?
00:18:21.080 | And then in terms of dense features,
00:18:24.040 | we now also have vision transformers.
00:18:25.720 | So we'll just very quickly go over that
00:18:28.000 | to make sure we're on the same page.
00:18:29.800 | So there are all these models like YOLO
00:18:31.400 | is a really good one if you haven't heard of that yet.
00:18:34.280 | So we're at YOLO v7 now, I think, or eight, I don't know.
00:18:37.840 | So there's a new one coming out every other year
00:18:41.640 | or something.
00:18:42.840 | But the basic idea is that we get these bounding boxes
00:18:46.320 | for things in the images are actually segmentations,
00:18:49.000 | but the bounding boxes is what people tend to use
00:18:51.080 | and they have labels, right?
00:18:52.560 | So this is labeled like backpack or something.
00:18:54.840 | And so you can do this as a pre-processing step
00:18:57.560 | on your image to get a much richer representation
00:19:00.920 | of what is really in that image,
00:19:02.280 | which you can then pump into your system
00:19:04.040 | as we'll see later.
00:19:05.240 | And so then how you encode the information
00:19:07.920 | that is in these little bounding boxes
00:19:09.640 | or actually in the image itself in general,
00:19:12.480 | we just use a standard conf net for that.
00:19:14.680 | And so this probably feels like super obvious now,
00:19:18.680 | but in 2014, when people were starting to discover this,
00:19:22.040 | it was really very surprising that you could just use
00:19:25.120 | off the shelf conf net features
00:19:27.080 | to really replace the entire computer vision pipeline.
00:19:30.160 | So people used to do all of this very fancy,
00:19:32.600 | sophisticated stuff and people, you know,
00:19:34.680 | spend decades on trying to refine this.
00:19:36.600 | And then it was all thrown away and replaced by a conf net
00:19:39.440 | that does all of that stuff for free.
00:19:41.920 | And so the cool thing you get there is that you can transfer
00:19:44.840 | very easily across different tasks.
00:19:46.760 | So you can have a very generic conf net
00:19:49.000 | and then use it to all kinds of very specialized things
00:19:52.320 | like spotting buildings in Paris, for example,
00:19:55.880 | or flowers or other stuff.
00:19:57.720 | And then of course, in the age of transformers,
00:20:01.840 | how far are we?
00:20:03.120 | We're already quite a while in.
00:20:04.200 | This is only the first transformer actually
00:20:06.240 | in the slide deck.
00:20:07.320 | So, you know, we're making good progress.
00:20:10.720 | So vision transformers are what we would use these days
00:20:13.760 | to encode the images where you have these flattened patches
00:20:17.320 | and then you would do kind of the standard
00:20:20.520 | birth architecture maybe as you would know it
00:20:22.880 | from this course, and then you do classification, right?
00:20:25.240 | So this is all like a standard transformer,
00:20:27.200 | everything's standard, except now your input here
00:20:29.240 | is not words or tokens, it's patches of an image
00:20:32.840 | and then you classify that.
00:20:34.240 | All right, so then we have a bunch of features
00:20:37.960 | and now how do we combine the information, right?
00:20:40.120 | So let's say we have two vectors, U and V.
00:20:43.000 | So, you know, it sounds easy, right?
00:20:45.120 | To how we could combine them.
00:20:47.480 | It turns out that there are actually
00:20:49.040 | very many ways to combine them.
00:20:50.520 | So I don't think it's really useful to go over
00:20:52.960 | all the different ways here,
00:20:55.200 | but you can do very simple things, right?
00:20:56.720 | So obviously like inner product or similarity
00:20:59.240 | is what you would use if you want to do cross-modal things.
00:21:01.600 | So if you want to embed things in the same vector space,
00:21:04.840 | but you can do sort of fancier projections on top
00:21:08.440 | or different combinations that are kind of linear
00:21:11.640 | or you can do multiplicative things
00:21:13.520 | where you multiply the components element-wise
00:21:16.280 | or you do some sort of gating over the different features.
00:21:18.680 | You can do attention, you can do fancier bilinear things,
00:21:22.320 | you can do very fancy compact bilinear things.
00:21:25.240 | So there's really a wealth of literature
00:21:27.280 | kind of on all the different ways
00:21:28.720 | you can combine two vectors.
00:21:30.560 | And so this is called multimodal fusion
00:21:33.760 | and most of the literature on multimodality
00:21:36.040 | is essentially about this question,
00:21:37.840 | what is the best way to do fusion?
00:21:39.680 | And that's it.
00:21:42.160 | So I think within that discussion
00:21:44.200 | is maybe useful to distinguish
00:21:45.600 | between different levels of fusion.
00:21:47.800 | So you can do it very early
00:21:49.680 | where basically you make sure
00:21:51.200 | you have the different features
00:21:52.440 | and then you just kind of,
00:21:54.280 | in the sort of modern sense of attention,
00:21:56.480 | you would attend to everything
00:21:57.800 | in all the features from the beginning.
00:21:59.960 | You can first treat them separately and then combine them
00:22:03.200 | or you can treat them as completely separate
00:22:05.280 | and then you only combine the final scores, right?
00:22:08.040 | And so that's kind of what we would call early fusion
00:22:11.440 | and then sort of my invention
00:22:13.120 | for calling the middle part
00:22:14.520 | would be sort of middle fusion
00:22:15.760 | and then you have late fusion
00:22:17.640 | where you really just combine the scores or the logits
00:22:20.560 | but you don't really have any interaction
00:22:22.640 | between the information from the different modalities.
00:22:25.520 | So you could do really fun stuff with multimodal fusion.
00:22:30.520 | So this is a paper I really like film
00:22:34.120 | where you have this sort of very special feature map,
00:22:40.000 | this sort of F here and it gets modulated
00:22:42.720 | by a multiplicative factor.
00:22:44.840 | So this gamma and an additive sort of bias vector,
00:22:48.120 | this beta and you have a different one
00:22:50.560 | for every layer of a ResNet
00:22:52.400 | that is conditioned on some encoding
00:22:54.560 | of the thing you're after.
00:22:56.280 | So in this case, are there more cubes than yellow things?
00:22:58.640 | So we have some vector representation for that
00:23:01.320 | and we use that vector representation
00:23:03.000 | to modulate the ResNet blocks
00:23:05.200 | at every layer of the conf net.
00:23:08.480 | So you can really do very fun things
00:23:10.880 | where you're sort of modulating one network
00:23:12.800 | with the other one and really try to have them learn
00:23:15.880 | as much as possible from that.
00:23:17.400 | All right, so let's talk about late fusion then.
00:23:22.240 | So late fusion is what we would now call contrastive models
00:23:26.520 | but the basic idea is that we have this similarity score.
00:23:29.600 | So we have the two kind of,
00:23:31.000 | we process the modalities completely independently
00:23:33.360 | and then at the very end, we do some combination.
00:23:35.960 | And the most famous instance of that these days is CLIP.
00:23:40.480 | So who's heard of CLIP?
00:23:41.640 | Okay, so CLIP from OpenAI.
00:23:47.840 | So it's again, exactly the same contrastive loss
00:23:51.280 | that we've seen in all these early approaches.
00:23:54.200 | It does kind of negative sampling, but then in batch.
00:23:58.320 | So you just have a batch,
00:23:59.520 | you have two things that are aligned, right?
00:24:01.400 | So like this, the first piece of text
00:24:03.560 | and the first image, they are aligned.
00:24:05.120 | So this is the right answer.
00:24:06.960 | And I just wanna make sure that I rank this thing higher
00:24:09.680 | than all the alternatives, right?
00:24:12.080 | And I wanna make sure I rank this thing higher
00:24:14.080 | than all the alternatives.
00:24:15.400 | So it's a very, very simple idea.
00:24:18.280 | Really, nothing special about this architecture
00:24:20.920 | that was sort of invented here,
00:24:22.480 | but what made this thing so cool was first of all,
00:24:25.840 | it was transformers and it was transformers all the way.
00:24:28.360 | So your text encoder would be a transformer
00:24:30.360 | and your image encoder would be a VIT image encoder.
00:24:33.640 | So also a transformer.
00:24:35.880 | And it was trained on lots and lots of web data.
00:24:39.400 | So Alec Radford is really a genius
00:24:41.600 | at creating very high quality datasets.
00:24:44.440 | And he created, I think 300 million image text pairs
00:24:47.720 | for this dataset, trained a bigger model on it
00:24:50.520 | than people used to do.
00:24:52.760 | And then we got this amazing model out of it.
00:24:55.640 | And so moving away from the words there
00:25:00.200 | to the sort of texts that you would see on the internet.
00:25:03.280 | So the caption for an image on the web,
00:25:05.840 | it's not gonna say dog or cat.
00:25:07.640 | It's gonna say a photo of a cat doing something, something.
00:25:10.720 | So that means that you can do kind of zero shot
00:25:14.800 | label predictions where you have a photo of the,
00:25:17.760 | and then you need to figure out what the right label
00:25:21.560 | is for a given image using this kind of prompt.
00:25:24.600 | So the thing, you probably all know
00:25:26.880 | about prompting large language models.
00:25:28.520 | And so you can prompt vision and language models
00:25:30.880 | in very much the same way and do zero shot generalization.
00:25:34.720 | So if you want to read a really good paper,
00:25:38.160 | I would recommend that you read this paper.
00:25:39.800 | This is really one that's gonna teach you
00:25:41.320 | how to write really good papers.
00:25:42.800 | It's thorough, it's really worth a very close read,
00:25:45.920 | I think, if you're interested in this field.
00:25:48.240 | And so I think when it came out,
00:25:51.240 | actually on ImageNet itself,
00:25:53.280 | it didn't really outperform ResNet.
00:25:55.640 | So you might think, oh yeah,
00:25:57.920 | actually it's not all that special.
00:25:59.960 | But what really made it special was that it generalized
00:26:02.400 | much better to these other datasets, right?
00:26:04.680 | So this ResNet thing here is pretty terrible
00:26:08.080 | at some of these adversarial versions of ImageNet,
00:26:11.440 | and Clip is super robust to that.
00:26:13.160 | So it's just a way better image encoder in general.
00:26:16.120 | So very quickly after Clip,
00:26:19.960 | there was this paper from Google using Align,
00:26:23.880 | which was basically exactly the same idea.
00:26:27.560 | The field is not really that creative at all
00:26:29.880 | as like the same idea,
00:26:30.840 | but then you just keep like throwing more data
00:26:32.840 | and more compute at it, and it often works much better.
00:26:35.640 | So that's what they found here too.
00:26:38.120 | 1.8 billion image taxpayers instead of 300 million
00:26:41.120 | gives you a better model.
00:26:43.120 | Surprise.
00:26:43.960 | But so still very cool.
00:26:47.360 | And what is really cool, I think,
00:26:49.440 | is that there's this organization called Lion,
00:26:51.800 | where they've started this open source collective
00:26:56.680 | to create really high quality datasets.
00:26:59.280 | And so the Lion, the initial dataset was,
00:27:04.280 | how many examples in the initial Lion?
00:27:06.880 | - 400. - 400 million.
00:27:08.840 | He knows, I know that he knows.
00:27:10.440 | (audience laughing)
00:27:11.280 | And so now there's a much bigger version of Lion
00:27:14.920 | that's even multilingual and it has 5 billion examples.
00:27:18.080 | So stable diffusion was trained on sort of the image,
00:27:21.600 | the English subset of this thing.
00:27:23.760 | And that's one of the reasons that it's so awesome,
00:27:26.200 | it's because it's just seen a ton of data.
00:27:29.160 | And that really makes your system a lot better.
00:27:31.360 | So if you're looking for like the ultimate dataset
00:27:34.000 | to play around with your own ideas,
00:27:37.240 | if you have enough compute, obviously,
00:27:38.840 | then you should really look at this dataset.
00:27:41.360 | All right, any questions about up until this point?
00:27:46.480 | No, all right.
00:27:51.840 | So then we'll move on from late fusion
00:27:56.000 | to kind of middle fusion, early fusion.
00:27:59.640 | And this really is kind of the core
00:28:01.680 | of what I think a lot of people in the field right now,
00:28:04.520 | or if you're interested in getting in this field,
00:28:06.080 | or if you're going to go into industry
00:28:07.840 | and you're gonna be using this stuff,
00:28:09.760 | like this is what you should really understand.
00:28:12.320 | And again, like the ideas sort of stack onto each other.
00:28:16.440 | So I've kind of sequenced the slides
00:28:18.840 | to give you an idea sort of how the scientists
00:28:21.240 | kind of came up with the next step.
00:28:23.440 | And you can really see the architecture
00:28:25.080 | just get slightly more and more advanced,
00:28:27.160 | but basically a lot of it is just more data
00:28:28.960 | and more compute again.
00:28:30.640 | So who knows how BERT works?
00:28:35.720 | (audience laughing)
00:28:39.160 | Everybody should raise their hand now in this.
00:28:41.560 | So yeah, so BERT is kind of so canonical.
00:28:46.560 | I think everybody kind of gets how BERT works, right?
00:28:48.720 | So I don't think we need a real refresher,
00:28:51.840 | but I think you can think.
00:28:53.920 | And so the reason I have to slide is
00:28:55.800 | because I want you to think about if you have a BERT model
00:28:59.800 | and you have a bunch of images,
00:29:01.720 | how are you going to turn that BERT model
00:29:03.600 | into something multimodal?
00:29:05.080 | Right, so there are a bunch of like obvious things
00:29:09.400 | you could do given the kind of features I told you about
00:29:11.880 | in the sort of fusion process.
00:29:13.240 | So how are you gonna do that?
00:29:15.400 | Does anybody wanna like say something?
00:29:19.480 | (audience member speaking indistinctly)
00:29:24.240 | - Like if you're doing classification,
00:29:25.880 | you can take the CLS token from BERT
00:29:28.600 | and then just concatenate it to whatever encoder,
00:29:31.440 | like maybe an ANN or whatever you're training
00:29:33.840 | on the data concatenated and then train.
00:29:37.200 | - Okay, exactly, yeah.
00:29:38.160 | So you can take the ConfNet features
00:29:41.160 | and classify your token from BERT, concatenate them,
00:29:43.800 | and then classify for like a cat or something like that
00:29:47.400 | or whatever the thing is you're interested in, yeah.
00:29:50.040 | Yeah, so that's one thing.
00:29:51.240 | You could also like take the ConfNet features
00:29:53.560 | and like give them to the BERT model
00:29:55.920 | in lots of different ways, right?
00:29:58.080 | We can use the region features.
00:29:59.520 | So I think a lot of people when BERT came out
00:30:04.080 | who were working in vision and language processing
00:30:06.640 | were thinking exactly about, okay,
00:30:08.040 | so do we do like middle fusion, late fusion?
00:30:10.120 | Do we do early fusion?
00:30:11.160 | How do we do the fusion?
00:30:13.440 | And so there were a lot of papers all coming out
00:30:16.680 | basically at around the same time
00:30:18.120 | where people were doing versions of this.
00:30:21.120 | So BERT was really kind of the innovation
00:30:23.040 | and then everybody sort of just plugged it
00:30:25.120 | into their own thing because of Hugging Face Transformers
00:30:27.440 | and things like that.
00:30:28.320 | So the first thing is Visual BERT.
00:30:33.320 | This was one of the very early ones
00:30:34.760 | where you have this image
00:30:36.240 | and people would do object detection on this.
00:30:39.600 | So you get like a hat and a racket and a shirt
00:30:41.720 | and things like that.
00:30:42.800 | So you can just really take these features
00:30:45.200 | and then plug them into your transformer model
00:30:49.080 | and then you try to like recover the features.
00:30:52.720 | And so this really is probably
00:30:54.440 | like the simplest way to do it, right?
00:30:56.400 | And so this is what we call a single stream architecture
00:31:00.920 | where you have all of these kind of concatenating
00:31:03.560 | the original input features
00:31:05.000 | and then putting them through the same transformer.
00:31:07.960 | What you can also do,
00:31:08.960 | and that's something that this model called VILBERT did
00:31:12.320 | is where you have two different streams.
00:31:14.640 | So you essentially have these two parallel transformers
00:31:18.240 | but at every layer,
00:31:19.880 | you kind of give them cross attention, right?
00:31:23.120 | So, or co-attention as they call it,
00:31:24.920 | but it's basically like,
00:31:25.760 | so you just make sure you have an attention map
00:31:28.040 | that spans both
00:31:28.880 | and then you just do your full normal transformer layer
00:31:31.880 | again.
00:31:33.120 | And then, so this you can train
00:31:35.280 | just like your regular BERT, right?
00:31:37.120 | So you have your masked model,
00:31:40.960 | masked language model here
00:31:42.120 | and here you do sort of some equivalent of that
00:31:44.480 | and then you also have your next sentence prediction
00:31:47.320 | which you probably remember from your BERT lecture.
00:31:50.120 | But instead here we're saying, okay,
00:31:52.560 | is this image aligned with this piece of text or not?
00:31:55.600 | There's also LexMERT.
00:31:58.240 | I mean, there I could go on forever.
00:32:00.000 | There are like a hundred papers that came out
00:32:02.040 | that did this all at the same time.
00:32:03.680 | So LexMERT had a different cross-modal output encoder,
00:32:07.000 | a bunch of different ways
00:32:08.800 | of encoding the positional information, right?
00:32:10.920 | So you could say, okay,
00:32:11.760 | I just have a bunch of bounding boxes that are featurized
00:32:14.160 | but I don't care about where they are in the image.
00:32:16.640 | So it's just kind of like a,
00:32:17.800 | just a bag of bounding boxes.
00:32:20.640 | Or you could say, I found it here.
00:32:22.000 | Like this is the particular like top left
00:32:23.880 | and bottom right coordinate.
00:32:25.600 | And that's what you featurize into your network.
00:32:28.200 | You can also do something even dumber.
00:32:32.960 | And I can say that because this is my paper.
00:32:35.160 | Where you just take the image itself,
00:32:38.920 | you put it through a ResNet
00:32:40.840 | and then you do a little bit of pooling
00:32:42.960 | on the final feature maps
00:32:44.640 | and you just give those feature maps to BERT.
00:32:47.280 | And so you then need to distinguish
00:32:50.120 | between like your text segment embeddings, right?
00:32:52.640 | And your vision segment embeddings.
00:32:56.480 | But so this actually works surprisingly well.
00:32:58.920 | You don't have to do any additional training.
00:33:02.160 | You can just take BERT out of the box.
00:33:03.880 | Initially you freeze it.
00:33:05.040 | You learn to project into BERT token space.
00:33:07.680 | Then you unfreeze your ResNet
00:33:09.480 | and then finally you unfreeze your BERT.
00:33:11.640 | And now you have a very good multimodal classifier
00:33:13.920 | on the problem you care about.
00:33:15.400 | So a lot of these other papers,
00:33:17.520 | they're doing what they call multimodal pre-training
00:33:19.840 | where first you have a BERT model and a ResNet.
00:33:22.320 | So they're kind of unimodally pre-trained
00:33:24.480 | and then you cobble them together
00:33:25.880 | and then you have a multimodal
00:33:28.000 | sort of intermediary pre-training step
00:33:29.960 | before you fine tune it on the problem you care about.
00:33:32.920 | And what we showed here is that you don't really need that
00:33:35.000 | actually in many cases.
00:33:36.440 | So it's a very strong baseline.
00:33:39.760 | You can also go to the pixel level completely.
00:33:42.880 | So that's what they did in this other paper
00:33:45.440 | called PixelBERT where they,
00:33:47.200 | it's basically exactly MMBT.
00:33:50.040 | So the previous supervised one,
00:33:52.400 | but here they do do the multimodal pre-training step
00:33:55.200 | and show that I think for VQA it helps a little bit.
00:33:57.800 | So there are many of these BERTs
00:34:02.080 | doing sort of visual things.
00:34:04.200 | People really tried everything.
00:34:06.640 | Here's another one called UNIDER
00:34:08.200 | where they added a bunch of different losses.
00:34:10.960 | We can really talk about this for a very long time.
00:34:14.040 | We're not gonna do that.
00:34:15.000 | I'm just gonna kind of talk you through
00:34:16.800 | some of the more interesting ones.
00:34:18.520 | So this one I think is quite interesting built
00:34:21.280 | because here this is really the first instance
00:34:23.400 | where we are completely gone from ConvNet features.
00:34:27.040 | So we don't do any pre-processing on the image,
00:34:29.960 | no region features,
00:34:31.080 | no backbone that it featurizes
00:34:32.840 | the parts of the image we care about.
00:34:35.600 | We just have these patches of the image.
00:34:37.280 | So really integrate, we flatten those patches,
00:34:40.240 | we just pump them into the transformer straight away.
00:34:43.240 | So this really is like sort of BERT and VIT together
00:34:46.000 | in one model and this worked really very well.
00:34:48.840 | So that's been the trend.
00:34:52.240 | So here's a nice, very long list
00:34:54.960 | of all of these different models and what they do.
00:34:57.400 | And so really the distinctions are just in
00:35:00.240 | what is the text encoder that you use?
00:35:02.040 | So do you use BERT or something fancier or better,
00:35:04.920 | Roberta, what is your vision encoder?
00:35:08.320 | So in many cases you have these region features.
00:35:11.000 | So you would do an RCNN style thing,
00:35:13.480 | or you could just do a ResNet or a VIT.
00:35:15.800 | You have different kinds of fusion.
00:35:17.560 | So either a single or dual stream as we talked about,
00:35:20.520 | so visual BERT or a VILBERT.
00:35:22.960 | Different pre-training tasks,
00:35:24.240 | so mass language modeling, image text matching.
00:35:28.240 | There's a bunch of like funkier ones you can do.
00:35:31.720 | So, and then finally you can do multimodal pre-training
00:35:35.320 | on all of these different datasets that have aligned data.
00:35:38.280 | So you are probably wondering,
00:35:41.600 | okay, so what is really the interesting difference
00:35:44.480 | between a lot of these?
00:35:46.520 | And so I have another recommended paper
00:35:49.560 | that if you're interested in this space,
00:35:51.120 | you should really take a look at.
00:35:52.240 | It's also a really well done paper
00:35:54.560 | where they unmask multimodal pre-training.
00:35:59.360 | So basically they say, if you take all of these
00:36:02.720 | little model inventions and you train these different models
00:36:06.320 | on exactly the same data in exactly the same way,
00:36:09.520 | it turns out that they're all basically the same.
00:36:11.960 | So that's a lot of kind of wasted effort
00:36:16.520 | on the part of the field because everybody's saying,
00:36:18.480 | well, my model is better, but it's actually just
00:36:20.360 | because you trained it on different data.
00:36:22.640 | There's no real sort of model innovation going on
00:36:26.560 | in a lot of these things.
00:36:28.400 | So I don't mean to sound discouraging
00:36:29.880 | or anything like that, but I think that's why
00:36:33.360 | this paper is really nice and really important
00:36:35.360 | is because it just shows us what really matters.
00:36:38.880 | So this is also work that I did myself called Flava
00:36:43.880 | with my team where we wanted to take these ideas
00:36:49.160 | really to the limit.
00:36:50.000 | So a lot of the things that you've seen now,
00:36:52.960 | so the visual words and the VIL words
00:36:54.600 | and the things like that,
00:36:55.440 | they're all about multimodal questions.
00:36:57.120 | So how can we do visual question answering,
00:36:59.720 | something like that,
00:37:00.560 | where we just have these two modalities,
00:37:02.320 | we only care about problems that always involve
00:37:04.440 | these two modalities.
00:37:05.760 | And where we want to go, and this is kind of
00:37:08.280 | the basic premise, I think, of foundation models in general
00:37:11.720 | is that we have one model to rule them all.
00:37:14.280 | So this one model can consume data
00:37:16.120 | from all of these different modalities
00:37:17.760 | and it can synthesize across all of these
00:37:19.880 | different modalities and then do useful things
00:37:22.080 | with that information.
00:37:24.280 | So with Flava, that's exactly what we tried to build.
00:37:27.320 | So we wanted to have one foundation model
00:37:29.600 | that is good at vision and language and computer vision
00:37:32.360 | and natural language processing,
00:37:34.040 | is jointly pre-trained on all of these
00:37:35.800 | different data sources.
00:37:36.880 | So it's also trained on just CC News,
00:37:39.200 | so Common Crawl and BookCorpus.
00:37:41.960 | So it's very good at the sort of things
00:37:43.360 | you would expect BERT to be good at.
00:37:45.200 | It's trained on ImageNet for image data.
00:37:47.400 | So it's good at the things that you would expect
00:37:49.400 | as a kind of basic image model to be good at.
00:37:51.600 | And then you have this PMD dataset
00:37:53.880 | that we created out of publicly available
00:37:57.000 | image text pairs that we also train it on.
00:37:59.920 | So this PMD dataset is really just,
00:38:01.920 | if you take all the datasets that were ever created
00:38:04.280 | that have image text pairs that are publicly available.
00:38:06.840 | So unfortunately, the ClipData and the Google Align data
00:38:10.200 | and all of these datasets, they haven't been open source.
00:38:12.320 | So this is before Lion.
00:38:15.040 | So now there's a good alternative to this.
00:38:17.800 | But so this PMD dataset,
00:38:19.520 | if you combine all of these image text pairs,
00:38:22.960 | you get 70 million of them.
00:38:24.400 | So that's still pretty decent size.
00:38:26.320 | And then you can take all of this data,
00:38:28.760 | basically to solve all of these problems
00:38:30.520 | that we know we care about in these different fields.
00:38:32.600 | So you can do multi-modal reasoning,
00:38:34.360 | you can do language understanding,
00:38:35.720 | you can do visual recognition,
00:38:37.520 | all with exactly the same model.
00:38:39.400 | And that's a very powerful idea.
00:38:41.240 | I think if you work at a company like Facebook,
00:38:44.480 | you don't want to have different models
00:38:45.800 | for all kinds of different things.
00:38:47.360 | You want to have one model
00:38:48.560 | that you can really use for everything.
00:38:50.360 | That's gonna really make your life a lot easier.
00:38:53.160 | So the exact architecture here is that on the one hand,
00:38:57.760 | we have this image encoder where we take the image,
00:39:00.400 | we encode it as patches,
00:39:01.720 | and we just do what we call mass image modeling,
00:39:04.560 | but it's basically mass language modeling,
00:39:06.280 | and then just on the image tokens.
00:39:09.040 | And then on the other side,
00:39:11.720 | we have the mass language modeling on the language.
00:39:16.480 | So your regular sort of bird thing.
00:39:18.440 | And then we have a multi-modal part
00:39:20.280 | where all of this information gets combined.
00:39:22.960 | So we have a mass multi-modal modeling loss term
00:39:27.120 | where you can also do image text matching.
00:39:28.960 | So this is like your bird next sentence prediction thing.
00:39:31.520 | And then we also have a global contrastive loss,
00:39:33.680 | which is exactly like a clip.
00:39:35.440 | So if you do all of this stuff,
00:39:37.080 | it's just all transformers all the way down.
00:39:40.400 | It's sort of a very elegant way,
00:39:42.080 | I think, to combine a lot of this information.
00:39:44.920 | And when you do that,
00:39:45.880 | you get something that can really do
00:39:47.560 | a lot of things very well.
00:39:49.200 | So we're not gonna talk about that table,
00:39:51.520 | it's just way too many numbers.
00:39:53.080 | But so just trust me,
00:39:54.800 | we were pretty thorough generating the table here.
00:39:57.960 | And so over 35 different tasks,
00:40:01.640 | if you compare Flava to all kinds of different ablations
00:40:04.480 | in terms of clip models,
00:40:06.120 | then this is just a much better way
00:40:08.040 | to get to this information.
00:40:10.280 | So I think this is a nice example
00:40:11.880 | of where we're probably gonna go with the field
00:40:15.080 | in the near future.
00:40:17.400 | So the other trend that we see
00:40:19.320 | very obviously in the field right now
00:40:20.960 | is that everybody cares about generative models.
00:40:23.680 | So language models and image generative models,
00:40:27.760 | there's just a trend where we want to be generative,
00:40:30.040 | we wanna move away from this contrastive,
00:40:31.680 | discriminative stuff to the more interesting,
00:40:34.880 | more richer representations maybe that you get
00:40:37.360 | out of generating sequences or images.
00:40:41.240 | So this SimVLM paper was one of the first ones
00:40:44.200 | where they really had this separate decoder
00:40:46.160 | that was trying to generate or kind of complete captions,
00:40:49.480 | which they showed gives you a lot richer representations.
00:40:53.640 | I think this is actually the current state of the art now,
00:40:55.800 | it's called Koka.
00:40:56.920 | So a lot of these models,
00:40:59.840 | they all again look very similar,
00:41:02.240 | but in this case now we're starting
00:41:03.520 | to really see these text decoders.
00:41:05.280 | So initially with Clip,
00:41:07.040 | I think that's also what they were trying to go for,
00:41:09.200 | like OpenAI being a company
00:41:10.880 | that really likes generative models,
00:41:12.840 | but they couldn't really get it to work.
00:41:14.280 | And I think, so it took us a while as a field
00:41:16.560 | to really figure out how to do this the right way.
00:41:19.280 | And so right now we're really kind of
00:41:23.640 | in the age of language models, right?
00:41:25.440 | And so one of the interesting things you can do
00:41:29.000 | with language models is just keep them frozen
00:41:32.080 | and then learn how to project into the language models.
00:41:35.160 | So the MMBT architecture I talked about
00:41:38.320 | where we had this BERT model,
00:41:39.760 | we kind of kept it frozen
00:41:40.960 | and we learned to project into the BERT token space.
00:41:44.920 | You can do exactly the same thing,
00:41:46.440 | but then with a much fancier model
00:41:49.000 | or something like T5,
00:41:50.320 | even where you just have an encoder decoder
00:41:52.200 | or some kind of generative part of this,
00:41:54.760 | you keep that thing frozen
00:41:56.560 | and then you learn to project into the token space
00:41:59.520 | of that frozen language model.
00:42:01.640 | And then you can do lots of fun stuff, it turns out.
00:42:04.960 | So what they show in this paper
00:42:06.600 | is that you then get few-shot learners.
00:42:08.920 | So all of the things you see with GPT-3
00:42:11.120 | where you can just give it some kind of in-context examples
00:42:14.040 | and it's gonna figure out binding kind of on the fly.
00:42:18.480 | So it says like, this is a DEX and this is a BLEKET.
00:42:21.320 | So what is this?
00:42:22.760 | And then it gives you the answer that it's a DEX.
00:42:25.440 | So it really learns in context
00:42:26.960 | how you decide the feature mappings,
00:42:29.400 | which is really kind of solving the grounding problem
00:42:32.480 | that a lot of this multimodal stuff started with.
00:42:36.040 | So I think that's very cool.
00:42:37.200 | And then probably one of the coolest papers right now
00:42:41.960 | or models right now that you might've heard of
00:42:43.880 | if you follow the field is Flamingo out of DeepMind,
00:42:47.480 | where they take a chinchilla language model.
00:42:50.200 | And so this is really an optimal language model.
00:42:53.640 | And now you have this vision encoder
00:42:56.200 | that encodes multiple different images
00:42:59.400 | that you can then do reasoning over
00:43:01.400 | and then kind of auto-complete.
00:43:03.160 | So what this gets you is just a much more powerful model
00:43:06.800 | because you can do your generative
00:43:09.360 | over lots of different images.
00:43:10.720 | So it's really like stepwise, you can see it, right?
00:43:13.400 | We started off with very simple transformers
00:43:15.520 | and now we're actually at something
00:43:16.960 | that is starting to get pretty complicated
00:43:19.280 | because we have these building blocks
00:43:21.160 | like a perceiver resampler,
00:43:23.360 | where we have a bunch of different images
00:43:25.920 | that we feature rise
00:43:26.960 | and now we need to compress the information
00:43:29.320 | because sometimes we have three images,
00:43:30.840 | sometimes we have five images.
00:43:32.200 | So we wanna make sure that we can compress it
00:43:34.080 | so that it's always ready for consumption
00:43:36.080 | by the next layer of the language model.
00:43:40.400 | And then, so this paper again,
00:43:42.040 | is a really good paper to read
00:43:43.400 | because they actually, so this is not me,
00:43:45.880 | this is not my code, this comes from the actual paper.
00:43:48.120 | So they just have the diagram together with the code
00:43:50.560 | so that you can really understand what it's doing,
00:43:52.880 | which I think is really great.
00:43:56.400 | And so once you have your perceiver resampling step,
00:44:00.840 | what you then do is you do a gated cross attention.
00:44:04.040 | This is how you implement it.
00:44:06.200 | And so this gated cross attention,
00:44:08.440 | you do that before your frozen language model layer.
00:44:12.200 | So you really just have a frozen chinchilla language model
00:44:15.480 | and you learn to kind of modulate the information
00:44:17.800 | that goes into that language model.
00:44:20.040 | You propagate the gradients all the way back,
00:44:22.240 | you just don't update the language model.
00:44:24.040 | So you're really kind of trying to figure out like,
00:44:25.880 | how am I gonna design my signal
00:44:28.040 | so that my language model can do the most with it, right?
00:44:31.000 | How am I gonna combine the information?
00:44:32.920 | So you'll notice that now we do it before the layer, right?
00:44:35.400 | In a lot of this other stuff,
00:44:36.480 | you would do the attention after the layer,
00:44:38.160 | but here you do it before.
00:44:39.640 | So Garpathy, I think more than 10 years ago had this image
00:44:45.920 | with Barack Obama kind of setting his foot here on the scale
00:44:49.000 | to make somebody think like,
00:44:51.800 | they're a lot heavier than they really are.
00:44:54.440 | So this is obviously funny to us,
00:44:56.400 | but not to an AI system,
00:44:58.920 | I think unless it really understands the scene.
00:45:02.120 | And so that's why Garpathy at the time said,
00:45:04.680 | this would be a really good visual Turing test.
00:45:06.680 | Like if a system can figure this out,
00:45:08.720 | then it's actually really smart.
00:45:11.000 | And so obviously it's been a bit of a challenge
00:45:13.000 | for everybody working in the field
00:45:14.320 | then to get something that actually works on this.
00:45:16.480 | And so Flamingo, as it turns out, kind of gets the joke.
00:45:20.040 | But yeah, so it's a bit unclear if it really gets the joke,
00:45:24.640 | because if you read this conversation,
00:45:26.160 | it's sort of kind of getting steered
00:45:27.680 | in the right direction, right?
00:45:28.760 | But at least we're making progress,
00:45:31.240 | let's put it that way.
00:45:32.320 | And then, so in Flamingo,
00:45:36.120 | you still have a lot of moving parts,
00:45:37.600 | but you can really take this almost to the full extreme
00:45:40.000 | where you try to freeze almost everything.
00:45:42.120 | And you just want to learn this kind of mapping
00:45:44.520 | between your image encoder and your language model,
00:45:47.320 | or your image encoder and your encoder decoder architecture.
00:45:50.680 | And all you really do is just a projection
00:45:52.840 | between the two, right?
00:45:54.400 | So there's this nice model called Blip2,
00:45:57.000 | where they experiment with like OPT for the language model
00:46:00.040 | and Flantify for the encoder decoder architecture.
00:46:03.080 | And this just gives you amazing results.
00:46:04.960 | It gives you really complex captions and things like that
00:46:09.040 | without any real direct supervision on the captions itself,
00:46:12.280 | which is pretty impressive, I think.
00:46:13.840 | So that just shows you the power
00:46:15.600 | of language models in general.
00:46:17.600 | So here are some examples.
00:46:21.400 | So it can really do like different things
00:46:23.320 | from captioning to reasoning, to visual question answering,
00:46:26.640 | to like location detection.
00:46:29.960 | So you can have a long conversation with this system.
00:46:32.720 | This really is kind of the future where we're going, right?
00:46:35.280 | Where we're going to have a chat GPT,
00:46:36.600 | but it's also going to be able to see the world in a way.
00:46:39.600 | And so I think an interesting thing,
00:46:43.440 | so you've probably heard of like chain of thought prompting
00:46:46.080 | and things like that, where you ask the language model,
00:46:48.080 | like let's think step-by-step,
00:46:50.520 | and you can tell a vision and language model,
00:46:54.080 | generate a rationale for why something might be the case.
00:46:58.360 | So you generate a potential explanation
00:47:01.200 | for what your answer might be.
00:47:03.160 | And then after that, you ask it to answer the question.
00:47:05.880 | And it turns out that if you do that sort of multimodal
00:47:08.760 | chain of thought prompting, then the system gets much better.
00:47:11.960 | And so, this was like the new state of the art
00:47:14.880 | on science QA or benchmark like that,
00:47:17.800 | just because it learns to unpack the information, right?
00:47:20.920 | And so I think we're really as a field,
00:47:23.560 | just starting to figure out what the potential is of this.
00:47:26.760 | And I think this paper is where they also show
00:47:29.080 | that multimodal chain of thought prompting
00:47:31.120 | really gets you pretty amazing results.
00:47:33.480 | And they show very nice results on Raven matrices
00:47:37.040 | and like very complicated kind of IQ tests sort of things
00:47:40.960 | that humans are supposed to be really good at,
00:47:43.320 | but you have to be a pretty smart human
00:47:44.840 | to really be good at this.
00:47:46.280 | And this system just nails it.
00:47:47.760 | So, we're making super fast progress.
00:47:51.640 | And we started off from a very simple bird model
00:47:54.560 | that was able to look at some pictures.
00:47:56.320 | And now we're getting to these
00:47:57.440 | very sophisticated foundation models.
00:48:00.520 | So, that was my short history
00:48:02.440 | of multimodal foundation models.
00:48:04.680 | So, how much time do I have left?
00:48:08.720 | - So, after 5.50, so 25 minutes.
00:48:11.280 | - All right, okay, plenty of time.
00:48:13.040 | - We can also take questions.
00:48:15.560 | - Yeah, please, questions.
00:48:17.040 | - Do we do much pre-processing images
00:48:22.840 | for these models anymore?
00:48:24.000 | So, I noticed a lot of the images
00:48:25.360 | that just look like they were boxes,
00:48:28.720 | like where it just passed through
00:48:30.640 | with kind of no sense of shape in them.
00:48:33.720 | - Yeah, yeah, so I think the history of computer vision
00:48:38.320 | has been very similar to the history
00:48:40.040 | of natural language processing,
00:48:41.320 | where we thought we needed all of this structure
00:48:43.200 | and all of these different things.
00:48:44.720 | And it turns out you can just throw it all away
00:48:46.680 | and just have a big transformer over the patches.
00:48:49.640 | Sorry, yes.
00:48:53.040 | - Seeing as it's 2.31 and one minute, save time.
00:48:56.120 | (all laughing)
00:48:58.720 | - You mentioned a couple of times
00:49:00.960 | like model team frozen, what does that mean?
00:49:03.800 | - Yeah, yeah, sorry, I should have explained that better.
00:49:06.760 | So, it just means that we are not updating the weights.
00:49:11.280 | So, like if we go to this here, I think is a nice example.
00:49:15.880 | So, we have frozen self-attention.
00:49:19.520 | So, that just means that when we do a forward pass,
00:49:22.240 | we go all the way to whatever we want to predict,
00:49:24.560 | we get some gradients, we take them all the way down,
00:49:27.200 | but we only update the non-frozen layers.
00:49:30.760 | So, here the gradients actually do get updated,
00:49:32.920 | but these just never change.
00:49:34.480 | And so, the reason you wanna do that
00:49:36.280 | is because otherwise you're gonna drift way too far.
00:49:39.000 | And so, then you're gonna kind of destroy
00:49:41.640 | all of the cool stuff your language model has learned,
00:49:44.000 | because you're just gonna focus on the small data set
00:49:46.680 | that you're training it on.
00:49:48.160 | So, you wanna preserve the abilities of the language model,
00:49:50.800 | but you want it to become good at the thing you care about.
00:49:53.720 | Other questions?
00:49:59.760 | - In terms of poly-modal fusion,
00:50:03.160 | is there a benefit to doing like the earlier middle fusion
00:50:05.400 | as opposed to like only doing the fusion?
00:50:08.000 | - Yeah, so, I mean, we're gonna talk about evaluation next,
00:50:11.800 | but so it really depends on the task that you care about.
00:50:15.400 | And so, I would say the earlier is always the better
00:50:19.200 | if you can afford it.
00:50:20.360 | And so, like CLIP is very efficient to train,
00:50:23.920 | it's very late fusion, right, at the very end.
00:50:25.800 | So, there's no interaction between the different modalities.
00:50:28.800 | And so, that's really good if you want to be very efficient
00:50:33.240 | and if you wanna be like, for training, it's much nicer.
00:50:37.480 | But if you want to have a richer understanding
00:50:39.640 | of the multi-modal signal,
00:50:41.520 | then you want to do earlier fusion.
00:50:43.280 | So, yeah, it's always a trade-off.
00:50:47.520 | - It seems that images are just a lot more data than text.
00:50:52.520 | So, how much more difficult are these to train
00:50:55.600 | and how much bigger does like the image processing
00:50:59.720 | have to be compared to the language model?
00:51:03.520 | - Yeah, so, images are more complex in a way,
00:51:08.520 | but they're also kind of higher bandwidth representations.
00:51:12.520 | So, there's a lot of kind of like,
00:51:14.280 | just pixels that our brains just abstract away, right?
00:51:17.640 | It's really about the scene that you're seeing
00:51:19.440 | and like you're not really thinking too much
00:51:21.960 | about the pixels themselves.
00:51:24.560 | So, like Jan Le Coon likes to say
00:51:26.680 | that language is just a kind of low bandwidth,
00:51:30.680 | a proxy for a language of thought,
00:51:33.560 | which is much richer and much higher bandwidth.
00:51:35.800 | And like he thinks probably visual, I'm not so sure.
00:51:39.800 | But so, yeah, I don't think that there's necessarily
00:51:43.800 | a difference between kind of the scaling laws
00:51:45.800 | that you see in these systems,
00:51:47.440 | or at least we still have to figure that out.
00:51:50.800 | We'll kind of talk about that towards the end as well.
00:51:53.400 | - Do these modern models also have certain social
00:51:59.440 | and cultural bias, just like the natural language model?
00:52:02.800 | - Oh, yeah, they have terrible biases, yeah.
00:52:05.000 | (audience laughing)
00:52:06.800 | So, yeah, some people are actually working on this
00:52:09.800 | who are in this very room.
00:52:11.000 | But so, these models can be very racist
00:52:13.480 | also in what they generate
00:52:14.760 | or the kind of predictions they make.
00:52:17.280 | So, if you have an Asian basketball player
00:52:21.200 | standing sort of like this
00:52:22.280 | with a basketball very obviously there,
00:52:24.560 | then the model will think that he's playing ping pong
00:52:26.560 | because he's Asian.
00:52:27.600 | (audience laughing)
00:52:28.720 | I'm not joking.
00:52:29.560 | So, these models,
00:52:33.800 | just like all neural networks,
00:52:35.800 | this is really a big problem.
00:52:36.920 | And one of the most interesting problems
00:52:39.000 | that you should be working on if you're a student
00:52:41.120 | and you wanna make a difference is,
00:52:42.720 | how do we get these systems to be much better
00:52:45.280 | at these sorts of things?
00:52:46.520 | - So, in one of the examples,
00:52:51.040 | you show like the model interpret
00:52:52.640 | from the content of an image.
00:52:54.360 | So, we wanna understand the content of a video.
00:52:57.040 | So, what actual challenges you might see
00:52:59.720 | like in the video?
00:53:00.560 | - Yeah, so, in the video,
00:53:02.400 | you might see like what improvements we can make
00:53:06.880 | to our new scope.
00:53:08.840 | - Yeah, so, you're asking about the attention mask
00:53:12.560 | sort of, right?
00:53:13.520 | Yeah, so you can use the same idea for videos
00:53:16.640 | and you just look at the video
00:53:18.080 | and so these systems are so good.
00:53:20.200 | Now, the object detectors are so good,
00:53:21.920 | you can really track objects kind of real time
00:53:25.040 | as they go through your video.
00:53:27.280 | And so, you can try to check how that aligns
00:53:29.240 | with your attention mask in your model.
00:53:31.960 | So, a lot of like,
00:53:35.280 | so videos I think are sort of interesting,
00:53:37.160 | but they're also not really interesting
00:53:39.160 | because you can very often just sub sample images
00:53:42.240 | and solve the images
00:53:43.280 | rather than having to deal with the complex video.
00:53:45.800 | But yeah.
00:53:48.160 | All right, maybe one more question
00:53:50.680 | and then we'll go do some evaluation.
00:53:52.680 | - So, these multi-model models,
00:53:56.360 | when you only provide,
00:53:57.560 | let's say you only provide a single source of media
00:53:59.400 | to the sampling type for a vision,
00:54:01.680 | how does it perform in that case?
00:54:03.320 | 'Cause it's obviously more geared
00:54:05.240 | for multi-modal cases.
00:54:06.880 | - Yeah, so I mean,
00:54:07.720 | that's one of the giant shortcomings
00:54:09.480 | of a lot of these models
00:54:10.520 | is that they're really just built for multi-modal stuff.
00:54:13.640 | And so, what if I don't have an image, right?
00:54:16.960 | And so, I mean, that's why we did Flava
00:54:20.480 | because we want to have one model
00:54:21.680 | that can do all of that stuff.
00:54:23.840 | And that's why in MBT,
00:54:26.040 | so the supervised multi-modal bi-transformer,
00:54:28.680 | we actually have an analysis of like,
00:54:30.440 | how robust is this model to missing images or missing text?
00:54:33.880 | So, I think a lot of folks
00:54:37.120 | working on these early visual bird models
00:54:39.360 | that were kind of myopically focused on VQA,
00:54:42.640 | which is actually a great segue
00:54:43.880 | to what I wanna talk about next.
00:54:45.480 | So, it really depends on the task
00:54:49.520 | that you care about, as I said, right?
00:54:51.320 | And so, I think if I'm gonna tell you about multi-modality,
00:54:55.480 | I also have to tell you how you're gonna check
00:54:57.400 | that the multi-modal system
00:54:58.520 | is actually good at multi-modal things.
00:55:00.760 | And so, that's the topic of evaluation,
00:55:04.720 | which actually is a super important topic.
00:55:07.080 | And a lot of people, they wanna be cool
00:55:08.800 | and build big models,
00:55:10.680 | but I think it should be way cooler
00:55:12.320 | to do proper evaluation of these models,
00:55:14.640 | especially if you're in academia
00:55:16.280 | because you only have limited GPUs anyway, right?
00:55:18.800 | So, what can you do?
00:55:21.120 | Sorry, I don't wanna rub it in.
00:55:24.200 | (audience laughing)
00:55:26.640 | So, how do you check?
00:55:29.320 | Well, there's this amazing project.
00:55:33.360 | So, ImageNet really changed the history of deep learning,
00:55:36.440 | I think, and this other dataset, COCO,
00:55:38.760 | I think also really changed,
00:55:40.840 | especially vision and language,
00:55:42.120 | but also, I think, vision in general,
00:55:45.600 | where they have just a bunch of main multi-modal tasks.
00:55:50.600 | So, these images are very richly annotated
00:55:53.560 | with all kinds of different things.
00:55:54.800 | So, like the segmentation of the objects,
00:55:57.120 | the bounding boxes, the labels of the bounding boxes,
00:56:00.240 | they come at like sort of a different pixel granularities.
00:56:04.040 | It's a huge dataset.
00:56:05.880 | It's very fine-grained,
00:56:07.360 | annotated in terms of like the categories that it has,
00:56:10.040 | and then you have five captions for each of these images.
00:56:14.320 | And so, this really was the first dataset
00:56:17.120 | that unlocked a lot of sort of vision
00:56:18.960 | and language processing at scale
00:56:20.440 | because you had your picture and you had your caption,
00:56:23.040 | and now you need to figure out,
00:56:24.480 | okay, how do I give the right caption for this image?
00:56:26.920 | So, that's image captioning,
00:56:28.000 | or can I retrieve, given some piece of text,
00:56:31.080 | the right image or the image for the piece of text?
00:56:34.800 | So, there's a bunch of very impactful datasets
00:56:37.680 | that do this stuff that we already talked about, Lion,
00:56:40.480 | but COCO really is the main one still,
00:56:42.760 | I think, that a lot of people kind of use
00:56:44.520 | as the canonical instance of this dataset category.
00:56:49.200 | And then, the other thing that people really care about
00:56:52.240 | in vision and language processing
00:56:53.840 | is visual question answering.
00:56:56.320 | And so, there really are a bunch of academic groups
00:57:00.800 | who are or have been so focused on this task
00:57:03.760 | that they didn't really care about anything else,
00:57:05.760 | and that's why you see a lot of models
00:57:07.520 | that are really optimized just for multimodal
00:57:10.280 | and nothing else.
00:57:11.600 | And you can see that kind of reflected
00:57:13.200 | in the citation counts as of last night, 3 a.m.,
00:57:16.640 | where, so, VQA just has way more citations
00:57:21.320 | than image captioning datasets even, right?
00:57:24.520 | And so, what you do here is you just have an image
00:57:27.480 | and then people ask very simple questions,
00:57:30.120 | so annotators, right, they ask these simple questions,
00:57:33.120 | they give the answers,
00:57:34.400 | and now we want to be able to answer these questions
00:57:37.400 | with machines.
00:57:38.560 | And as I alluded to earlier,
00:57:39.760 | one of the kind of embarrassing backstories of this dataset
00:57:43.120 | was that the initial version of the dataset
00:57:45.560 | was actually found to have images not really matter at all.
00:57:50.640 | So you could just look at the question,
00:57:52.680 | then it could have something like
00:57:54.200 | how many slices of pizza are there?
00:57:56.920 | And so, well, not in that particular case,
00:57:58.960 | but in almost all of the dataset,
00:58:01.240 | the right answer for how much or how many question was two.
00:58:04.520 | So if you just predicted two to every how much
00:58:07.760 | or how many question,
00:58:08.600 | you got like 70% accuracy on the counting category.
00:58:12.160 | So careful dataset or evaluation benchmark design
00:58:16.600 | is also really a skill,
00:58:18.040 | and you really need to think about what you're doing.
00:58:20.000 | You can't just like set some data aside and evaluate it on,
00:58:23.120 | you have to really think about what you're doing.
00:58:25.760 | And so there's GQA by Chris actually,
00:58:27.960 | which is also just, I think,
00:58:30.400 | a better designed version of this dataset maybe.
00:58:32.560 | So you might want to use that these days.
00:58:34.960 | There are also kind of very targeted datasets
00:58:40.680 | that really try to measure one particular thing.
00:58:42.920 | And I think one of the things we really want to get at
00:58:46.080 | with these models is what we would call compositionality.
00:58:49.320 | So we want to be able to really take the parts
00:58:51.680 | and reason about the whole
00:58:53.600 | and understand the relationships
00:58:55.080 | between the different concepts.
00:58:56.280 | So Clever was a very clever dataset
00:58:59.320 | that was designed really to measure the compositionality
00:59:03.040 | both on the language side and on the vision side.
00:59:05.160 | So you have to understand the relationships
00:59:06.840 | between all of these different objects in the images.
00:59:09.920 | So that's been a pretty impactful dataset.
00:59:11.800 | I think we're really forcing people
00:59:13.920 | to think about compositionality.
00:59:16.480 | But a lot of these datasets really had big problems.
00:59:21.320 | So one of the problem is, they were too easy.
00:59:25.200 | So VQA is sort of like plateauing out.
00:59:27.080 | We can talk about that a little bit too.
00:59:28.760 | Wasn't really realistic.
00:59:30.000 | So you could solve VQA
00:59:31.960 | and that's probably gonna make some people's lives better.
00:59:34.840 | You're all like trying to process the memes.
00:59:36.640 | I can see everybody.
00:59:37.680 | (laughs)
00:59:39.200 | Okay, let's get to the memes first.
00:59:40.640 | So obviously, so these memes are not actually
00:59:44.960 | in the dataset.
00:59:46.640 | So I could put some really hateful memes
00:59:49.080 | about sort of Hitler or something,
00:59:51.080 | which are in the dataset, but that would be less fun.
00:59:54.080 | So these are mean meme examples to kind of demonstrate
00:59:59.080 | how the dataset was constructed.
01:00:02.200 | And so one of the problems we had, as I said,
01:00:04.920 | like VQA, the V didn't really matter.
01:00:07.280 | What we want to have is a dataset,
01:00:09.040 | if we care about multimodality specifically,
01:00:11.720 | is like, how do we get a dataset
01:00:14.160 | that you can only get right
01:00:15.880 | if you are good at multimodal reasoning
01:00:17.600 | and otherwise you're just gonna screw it up.
01:00:20.080 | And so this is what we came up with
01:00:21.640 | is if you have a meme like this one,
01:00:23.920 | love the way you smell today.
01:00:25.200 | I mean, that's not very nice
01:00:26.400 | if you send this to your friends, right?
01:00:28.400 | But so it turns out that if you just swap out the background
01:00:35.160 | now it's a very nice thing to say, right?
01:00:37.720 | And like this one is, I don't know,
01:00:39.880 | you're maybe a bit weird if you like this,
01:00:41.520 | but there's nothing wrong with it, right?
01:00:44.840 | And so it's the same for this one here,
01:00:48.200 | like look how many people love you
01:00:49.520 | with the tumbleweed that's really sad.
01:00:51.160 | And like, if you change just one word
01:00:54.240 | suddenly it's like a really nice thing to say, right?
01:00:56.880 | So if you want to solve this,
01:01:00.520 | if you want to classify this correctly for the meanness,
01:01:04.280 | then you have to really understand multimodal reasoning.
01:01:07.320 | You have to understand the relationship
01:01:09.120 | between the image and the text
01:01:10.640 | in order to get to the right label, right?
01:01:12.480 | And so it was really constructed by design to do that.
01:01:15.400 | And so how we did it exactly
01:01:19.400 | is we use some really highly trained annotators.
01:01:22.920 | And then one of the big problems
01:01:24.960 | with a lot of these datasets
01:01:26.240 | is that nobody really knows who owns the meme,
01:01:30.480 | for example, right?
01:01:31.320 | So somebody makes this meme,
01:01:32.440 | now they technically own a copyright.
01:01:34.120 | And so when I made this dataset,
01:01:36.280 | I was working at the Facebook
01:01:38.200 | and they were very afraid of copyright things.
01:01:40.840 | So what we actually had to do
01:01:42.000 | is we had to pay people to make new memes.
01:01:45.280 | (audience laughing)
01:01:48.040 | And so not from scratch,
01:01:49.640 | so we could show them kind of the actual examples.
01:01:51.920 | And then they had to try to find images
01:01:54.960 | that were kind of corresponding to the original source image
01:01:58.880 | and try to recreate the meme,
01:02:00.520 | but now with an image that we could buy from Getty.
01:02:03.560 | And so we gave a lot of money to Getty
01:02:07.440 | so that we could then release the dataset to the public
01:02:11.000 | so that people could do actually research on this
01:02:13.120 | and understand for their multimodal models,
01:02:15.200 | whether they're good or not.
01:02:17.040 | And so we really tried to make it
01:02:18.520 | so that we had these benign co-founders,
01:02:21.240 | sorry, it's a startup world with co-founders.
01:02:26.400 | So the co-founder here is obviously
01:02:30.040 | like you have your original meme
01:02:31.440 | and then you have your co-founder
01:02:33.680 | where you swap out one of the modalities
01:02:35.360 | and here you have the other one, right?
01:02:36.560 | So we had our annotators do that as well.
01:02:39.960 | And so this led to a really nice dataset, I think,
01:02:44.080 | because it showed some of the intuitions
01:02:46.120 | that I think a lot of people in the field had,
01:02:48.360 | which is that multimodal pre-training doesn't really work.
01:02:51.520 | Is that an alarm?
01:02:53.560 | So multimodal pre-training doesn't really work.
01:02:58.000 | And so all of this stuff that people have been doing
01:03:00.720 | with all their fancy visual birth models
01:03:02.880 | actually turned out maybe to not really be that useful
01:03:06.080 | anyway and so maybe it got you like one point extra, right?
01:03:09.400 | From visual birth to like a different visual birth,
01:03:12.080 | like less than a point,
01:03:13.800 | just by doing that multimodal pre-training.
01:03:16.680 | So that means like we still have to figure this stuff out,
01:03:20.560 | right?
01:03:21.400 | This dataset is far from salt
01:03:22.960 | and we still have a long way to go
01:03:25.120 | despite all of these fancy models
01:03:26.720 | and a new paper coming out every week
01:03:29.800 | that does something new, like we're not there yet.
01:03:33.080 | And I think that's encouraging, especially for you.
01:03:35.880 | Like when you can go out and solve it.
01:03:38.600 | So what we did with this dataset
01:03:42.000 | is we organized a competition.
01:03:43.280 | We had 100K in prize money
01:03:45.680 | to try to see what people could come up with.
01:03:47.920 | And so there was a lot of nice work coming out of that
01:03:52.480 | and we really kind of managed to crank the numbers up
01:03:55.080 | by quite a lot.
01:03:56.240 | But the solutions were slightly disappointing.
01:04:00.200 | So I don't know if you've ever used Kaggle,
01:04:02.120 | but if you wanna really win on Kaggle,
01:04:03.880 | you just have to ensemble the hell out
01:04:05.760 | of all of the different models
01:04:06.880 | that are current state of the art
01:04:08.360 | and then you're very likely to win, right?
01:04:10.120 | And so that's what happened here
01:04:12.760 | where there wasn't really the fundamental breakthrough
01:04:17.120 | we had maybe been hoping for.
01:04:18.440 | So that still needs to be built, I think.
01:04:21.440 | So this other dataset,
01:04:24.280 | I just wanna kind of briefly talk about.
01:04:25.840 | So the theme sort of of this section is like,
01:04:28.000 | if you make a dataset, think about it very carefully
01:04:31.600 | because you can really be very creative with this
01:04:33.560 | and really measure the things you're trying to get at.
01:04:36.440 | So this dataset, Winterground,
01:04:39.680 | we were trying to figure out,
01:04:40.880 | okay, how good is CLIP actually?
01:04:42.880 | So it looks really amazing
01:04:44.240 | and it's way better than things that were previously there,
01:04:47.200 | but does it understand compositional relationships
01:04:50.280 | in the same way that humans would understand it
01:04:52.240 | or is it sort of just fitting onto the data distribution
01:04:55.440 | and it can be very good at the head of the distribution,
01:04:58.360 | but it's terrible at the tail.
01:05:00.360 | And you can probably already guess where this is going,
01:05:03.480 | but so just to give you an illustration
01:05:06.200 | of what is in this dataset,
01:05:07.560 | you would have some plants surrounding a light bulb
01:05:10.960 | or you would have a light bulb surrounding some plants.
01:05:13.880 | So notice that the words here are exactly the same words,
01:05:17.720 | but in a different order, right?
01:05:19.480 | So, and so the visual depiction of these words
01:05:23.400 | is very, very different.
01:05:24.840 | So if your model,
01:05:25.960 | your contrastive model is actually good at understanding
01:05:29.000 | the visual semantic or the visual linguistic compositionality
01:05:34.000 | of these examples, then it can get it right.
01:05:39.440 | But again, if it's actually just overfitting
01:05:41.640 | on the data distribution that it's seen
01:05:43.400 | and it just kind of is biased toward what it sees often,
01:05:47.120 | then it doesn't really get it, right?
01:05:48.760 | And so one paper that we use
01:05:51.520 | as a source of inspiration for this work
01:05:53.800 | is this paper here,
01:05:56.320 | "Order Word Matters Pre-training for Little."
01:05:59.400 | So we actually found that the order of words
01:06:01.280 | doesn't even matter that much
01:06:03.000 | for general pre-training very often,
01:06:05.880 | which is also kind of a scary thing, right?
01:06:07.520 | So this is deep learning for NLP.
01:06:09.160 | We think that language is really important,
01:06:11.360 | but these models can reason about language
01:06:13.840 | even if you shuffle all the words.
01:06:15.600 | And so that's probably not what we want to have.
01:06:19.440 | And so that doesn't tell you something
01:06:22.480 | about how great we are as researchers.
01:06:24.920 | It tells you something about how terrible
01:06:26.640 | our evaluation benchmarks are, right?
01:06:29.040 | And that's what we need to fix.
01:06:30.600 | So what we did with this data set,
01:06:33.360 | here are some other nice examples,
01:06:34.680 | like there's a mug in some grass
01:06:36.040 | or there's some grass in a mug.
01:06:37.400 | Like these are very different pictures, right?
01:06:39.480 | And so for us, these are trivial.
01:06:41.560 | So like, what's the difference
01:06:43.120 | between a truck fire and a fire truck?
01:06:45.040 | They're pretty important, I think,
01:06:48.480 | also to get that distinction right.
01:06:50.600 | So guess what?
01:06:54.960 | State-of-the-art models often perform below random chance.
01:06:58.400 | So, you know, as I said,
01:07:03.240 | we still have a lot of work to do, which is good.
01:07:06.600 | And so when this paper came out,
01:07:08.880 | I think that the reaction was really nice.
01:07:11.800 | And so when DALI-2 came out,
01:07:14.280 | so you've probably heard of DALI-2, right?
01:07:17.560 | So it's sort of like stable diffusion,
01:07:19.360 | but then before stable diffusion.
01:07:22.120 | And so this was really the first model
01:07:24.640 | that really showed just how impressive
01:07:27.200 | these generative models can be
01:07:29.320 | when they're creating images.
01:07:31.240 | So this is, there's a mug in some grass.
01:07:34.160 | You do have to kind of cheat a little bit
01:07:36.160 | because you have to add digital art here.
01:07:38.960 | If you don't add that, then it breaks down completely.
01:07:42.120 | So it's sort of prompt hacking, I think,
01:07:45.120 | or sort of tuning on the test set, but okay, you know.
01:07:48.400 | So this is pretty good, right?
01:07:51.080 | So it definitely is better than I think a lot of people
01:07:54.760 | would have expected even a couple of years ago.
01:07:57.120 | But it's not perfect because people on the internet
01:08:02.120 | like to take more pictures of spoons than forks.
01:08:04.800 | So if you say there are fewer spoons than forks,
01:08:10.120 | or there are fewer forks than spoons,
01:08:12.520 | it just really likes spoons more.
01:08:14.320 | (audience laughing)
01:08:17.400 | You know, and so maybe it's like the matrix or something,
01:08:20.560 | I don't know, but so spoons are just nicer.
01:08:25.040 | So again, what you can see here is that these models
01:08:28.200 | really are just reflections of the data
01:08:30.320 | that they're trained on, right?
01:08:32.560 | And yeah, so models are getting better,
01:08:35.280 | but if you've looked at stable diffusion,
01:08:37.160 | like it still can't count fingers and things like that.
01:08:39.840 | So again, there's still a lot of cool work to be done.
01:08:43.520 | Any questions on evaluation?
01:08:46.320 | (man speaking faintly)
01:08:49.560 | No, okay.
01:08:54.320 | So let's talk about other modalities then,
01:08:57.320 | because so we've really just been focused on images
01:09:00.000 | and images are great.
01:09:01.120 | There are lots of images on the internet.
01:09:04.560 | And so that makes it sort of an obvious thing to focus on.
01:09:08.400 | It's also, I think if you look at our brain,
01:09:10.760 | like vision is a very dominant modality, right?
01:09:13.080 | So how we understand the world is very vision driven,
01:09:16.720 | but it doesn't have to be the case.
01:09:18.840 | So there's all these other interesting problems
01:09:20.920 | that involve different modalities.
01:09:22.920 | And so the most obvious one is just speech or audio, right?
01:09:26.440 | So after seeing comes hearing,
01:09:29.120 | and really we could do another lecture just like this,
01:09:32.760 | just on speech and audio.
01:09:34.200 | And there's lots of interesting stuff to talk about.
01:09:36.640 | Obviously we don't have time,
01:09:38.120 | but I'll give you another nice example
01:09:41.000 | of how amazing Alec Radford is at creating datasets.
01:09:45.000 | So there's this whisper model that came out of open AI
01:09:48.240 | not too long ago,
01:09:49.600 | which was trained on 680,000 hours
01:09:52.040 | of multilingual multitask speech data.
01:09:55.440 | So speech with transcriptions,
01:09:57.400 | and they trained this very fancy thing on there,
01:10:01.280 | which actually is not very fancy at all.
01:10:02.960 | It's just a log mail spectrogram.
01:10:04.400 | So how you represent the audio signal,
01:10:06.600 | and then you feed that into a big transformer.
01:10:08.640 | So this is sort of your encoder self-attention here, right?
01:10:11.880 | And then you have your decoder
01:10:13.080 | where you have your cross attention,
01:10:14.840 | and then you just generate the sequence.
01:10:17.280 | So this is encoder decoder, basic transformer model,
01:10:20.880 | but your input is convolutions,
01:10:23.280 | one dimensional convolutions over the log mail spectrogram.
01:10:26.280 | And so there's lots of papers that do very similar things.
01:10:30.160 | There's models like wave2vec
01:10:32.440 | that tried to turn the wave signal into vectors,
01:10:34.720 | or you can discretize it in lots of different ways.
01:10:37.800 | So there's a wealth of literature.
01:10:40.080 | Then I think one of the funny observations actually
01:10:42.760 | is that you can just reduce audio to vision anyway, right?
01:10:46.040 | So that's what you could sort of argue
01:10:48.440 | this log mail spectrogram does,
01:10:49.960 | but so not to toot my own horn,
01:10:52.600 | but in 27, I did this paper where we showed
01:10:55.440 | that you can just take a real audio sample,
01:10:58.480 | turn it into a kind of a spectrogram,
01:11:03.240 | really just a spectrogram.
01:11:04.280 | So what does the spectrum of the audio file look like?
01:11:08.000 | Feed that to a regular conf net, like an Alex net even,
01:11:11.480 | and then that gives you amazing auditory features.
01:11:13.640 | So now you can use this to distinguish
01:11:15.520 | between violins or guitars and things like that.
01:11:17.960 | So maybe you can just reduce all of this to vision.
01:11:21.520 | So one question maybe you could ask is like,
01:11:23.480 | can we also reduce language to vision or vision to language?
01:11:27.280 | So that's sort of what people are thinking about.
01:11:31.000 | So we talked about the video.
01:11:33.920 | There was a question about video.
01:11:35.320 | So a lot of these ideas also extend pretty directly
01:11:38.720 | to video, but now you just have more data, right?
01:11:41.200 | So like Flamingo already had a bunch
01:11:43.160 | of different images in it.
01:11:44.240 | You can do Flamingo over videos.
01:11:46.880 | Probably a lot of the images are pretty useless
01:11:50.120 | for what you're trying to do with this video model, right?
01:11:52.520 | So they're too similar.
01:11:53.840 | It doesn't really add all that much information.
01:11:56.080 | So you wanna sub sample the frames
01:11:58.280 | so that you get the most useful information
01:12:00.160 | out of your video.
01:12:01.720 | And so there's a bunch of approaches
01:12:03.640 | that kind of take the key frames
01:12:05.520 | and then you just do a standard joint vision
01:12:08.360 | and language transformer encoder thing on top of that.
01:12:11.600 | So this is kind of becoming hopefully
01:12:13.680 | by now a very familiar recipe, right?
01:12:16.400 | And so there's this, so Merlot is a nice architecture
01:12:20.240 | that does this.
01:12:21.080 | And then they came up with Merlot Reserve,
01:12:23.400 | kind of a silly name where they also added audio
01:12:27.000 | to this model.
01:12:28.000 | So this is now a tri-modal model, right?
01:12:30.120 | And so we're going towards this foundation model
01:12:33.400 | that can consume all of these different modalities
01:12:36.000 | all in one go.
01:12:37.160 | And that's really like a clear trend in the field.
01:12:39.720 | Another very interesting direction,
01:12:44.040 | I think where in the field,
01:12:45.760 | we were very excited about this for a while,
01:12:47.680 | but I think it's sort of gone now
01:12:50.960 | because it's too difficult to create
01:12:53.200 | lots of high quality data in this setting.
01:12:55.160 | But what you can do is you can have simulated environments.
01:12:58.960 | So this is a paper from DeepMind from 2017
01:13:01.680 | where they had this agent walk around in a maze
01:13:03.800 | and then he could have natural language instructions.
01:13:06.160 | He could also generalize to like decks and blicks
01:13:08.640 | and different sort of groundings and assignments
01:13:11.280 | that you could do in that environment.
01:13:13.840 | So this is a super interesting direction,
01:13:15.560 | I think in the longterm,
01:13:16.560 | because this is how humans learn language, right?
01:13:18.640 | Like we walk around in the world,
01:13:20.000 | we interact with our environments.
01:13:21.520 | We have all of these different perceptual observations.
01:13:24.360 | We synthesize them in our brain.
01:13:26.200 | We manipulate objects.
01:13:27.720 | We change our own viewpoint.
01:13:29.200 | And that's how we learn everything we know about the world.
01:13:32.160 | And so our language is very intricately connected
01:13:35.400 | to that world and how we observe it.
01:13:37.440 | So I think that might make a comeback
01:13:41.200 | at some point in the future.
01:13:42.680 | You can also do other stuff.
01:13:45.600 | So especially with this kind of conditioning on text
01:13:48.560 | that we're seeing a lot of, right?
01:13:50.600 | So like, so, you know,
01:13:51.720 | DALI 2 and stable diffusion
01:13:53.080 | and all of these different things.
01:13:54.360 | And the original GAN we talked about at the beginning.
01:13:57.840 | You can do the same thing,
01:13:59.000 | but now you're generating 3D point clouds, right?
01:14:02.240 | So this is a 3D corgi using a corgi.
01:14:06.000 | And so this prompt can probably become
01:14:08.040 | much more complex over time.
01:14:09.480 | And you can do like sort of AutoCAD design
01:14:11.920 | and just say like, give me a house
01:14:13.720 | and it's just gonna design the whole house for you.
01:14:16.600 | So you can just like tweak the prompt and things like that.
01:14:19.800 | Like that's all coming
01:14:21.400 | or even already here in many cases.
01:14:23.600 | So the final modality I just briefly wanted to talk about
01:14:27.960 | is olfactory embeddings.
01:14:30.200 | (audience laughing)
01:14:33.720 | And so olfaction means smell if you didn't know.
01:14:37.080 | And so it turns out,
01:14:39.720 | so my PhD thesis was about grounding semantics
01:14:44.640 | in different perceptual modalities.
01:14:47.080 | So a lot of my work started in vision
01:14:49.240 | and then it's like, okay,
01:14:50.080 | now audio is sort of the obvious next one, right?
01:14:52.320 | So you can learn the meaning of violin
01:14:54.120 | and then maybe you can learn that violin,
01:14:56.760 | like what a violin looks like and what it is
01:14:58.680 | and what it sounds like.
01:14:59.560 | And that's gonna give you a richer representation.
01:15:01.800 | But for a lot of these words,
01:15:03.640 | what's actually very primitive to their meaning
01:15:06.360 | is what they smell like,
01:15:07.600 | because in our brains,
01:15:09.280 | that's really one of the core areas
01:15:10.880 | and one of the oldest areas in your brain.
01:15:13.440 | So what you can try to do
01:15:15.640 | if you want to complete all of your perceptual modalities
01:15:19.320 | is you can try to build olfactory embeddings.
01:15:21.480 | So it was kind of a joke paper I did,
01:15:24.560 | but the funny thing is it actually worked.
01:15:28.040 | So there's a catalog,
01:15:32.640 | this Sigma Aldrich Fine Flavors and Fragrances catalog,
01:15:36.840 | where you can look up words like melon and pineapple
01:15:40.160 | and then it's gonna give you all of the chemical compounds
01:15:43.120 | that produce this smell or taste.
01:15:45.200 | And so if you do that,
01:15:47.880 | then you can count the occurrences
01:15:49.720 | and then you can sort of do SVD
01:15:51.440 | or something like that on it
01:15:52.880 | to get it to be a bit more of a real embedding model.
01:15:56.360 | So now you get smell embeddings, smell vectors,
01:15:59.640 | and then you can compute similarity judgments
01:16:02.880 | between these smells.
01:16:04.400 | So turns out apple smells like pear,
01:16:07.040 | and chocolate and cocoa and sweet and coffee
01:16:11.120 | are sort of related.
01:16:12.200 | So you get these clusters of different smells
01:16:14.560 | just based off of their chemical compounds.
01:16:17.200 | So this bag of chemical compounds model
01:16:20.000 | gives you a very rich representation.
01:16:22.360 | And so if you look at all of the words
01:16:25.000 | that are concrete enough to have smell,
01:16:27.800 | so like if you have a word like democracy in there,
01:16:30.360 | that doesn't really smell like anything.
01:16:33.160 | So you ignore democracy,
01:16:36.400 | you just focus on the things that smell
01:16:41.000 | or that could smell, I guess.
01:16:43.120 | And then, so the really interesting thing to me
01:16:45.840 | is that this is much more correlated
01:16:50.200 | with human similarity judgments
01:16:51.880 | than the linguistic vectors we had at the time.
01:16:54.280 | So for a word like apple,
01:16:57.440 | like you can just get a word vector
01:16:59.080 | like you've learned in your first lecture.
01:17:01.440 | And so you can do like skip gram and things like that.
01:17:04.800 | But that thing is not going to be as correlated
01:17:07.320 | with human similarity judgments
01:17:09.240 | as this bag of chemical compounds model.
01:17:12.320 | So that's pretty interesting.
01:17:14.400 | So even something like smell where maybe we think,
01:17:17.120 | this doesn't really matter.
01:17:18.440 | If you really want to understand
01:17:19.720 | how humans understand language,
01:17:21.680 | then maybe you want to include this
01:17:23.640 | in your foundation model too.
01:17:25.280 | But I would start with the other modalities.
01:17:30.000 | All right.
01:17:32.160 | Okay, yeah, sorry.
01:17:35.800 | So where to next?
01:17:37.000 | I'll just, I think I've already said most of this actually.
01:17:39.640 | So one foundation model is going to rule them all.
01:17:42.440 | And so, I mean, there will be many of these,
01:17:45.920 | but a lot of them are going to have
01:17:47.240 | very similar traits, I think.
01:17:49.760 | We're going to be looking at scaling loss
01:17:51.880 | and trying to understand really what is the relationship
01:17:54.240 | between the different modalities,
01:17:55.600 | which one do we want more of, that sort of stuff.
01:17:58.880 | We're gonna have retrieval augmentation.
01:18:00.400 | This thing is gonna be really huge.
01:18:02.040 | If you've heard of RAG,
01:18:03.880 | or if you haven't, you should look it up.
01:18:06.080 | So all of these parts of these models
01:18:07.760 | can also be multimodal.
01:18:09.480 | We need way better evaluation and better measurements.
01:18:12.160 | We already talked about that too.
01:18:14.080 | And that's all I have, thank you.
01:18:15.680 | (audience applauding)
01:18:17.400 | (upbeat music)
01:18:19.980 | (SILENCE)