Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

00:00:00.000 | [BLANK_AUDIO]

00:00:05.200 | So today, I'm delighted to introduce our first invited speaker,

00:00:10.560 | who's Dow Akila.

00:00:12.800 | Dow has also been, as well as being invited, and I'll tell his background.

00:00:17.920 | He's also in the Symbolic Systems program as being an adjunct professor and

00:00:23.840 | has been involved with some students in that role as well.

00:00:26.680 | But in his invited role, he's originally from the Netherlands,

00:00:30.640 | where he even learned some logic, among other things, back in the old days.

00:00:34.760 | But in more recent times,

00:00:36.680 | he's been a prominent deep learning researcher.

00:00:41.160 | For a number of years, he worked at Facebook, now Meta, in the FAIR unit,

00:00:47.600 | and was involved in various ideas, including retrieval augmented generation.

00:00:54.000 | After that, he then spent some time at Hugging Face.

00:00:57.760 | He's become interested in looking at multimodal models,

00:01:01.560 | which is what he's gonna be talking about today.

00:01:04.640 | And welcome, Dow, it's great to have you.

00:01:07.000 | >> Thank you very much.

00:01:08.680 | >> [APPLAUSE]

00:01:15.560 | >> All right, that works, right?

00:01:17.360 | Yes, yeah, thanks everyone for coming.

00:01:20.520 | I understand that you get points for being here, so

00:01:22.520 | you're not really here for me.

00:01:23.600 | >> [LAUGH]

00:01:24.320 | >> But thanks for coming anyway.

00:01:27.160 | So I'm gonna talk about multimodal deep learning.

00:01:29.680 | It's gonna have an NLP focus, of course, as for this course.

00:01:32.800 | But it's also because otherwise I would really be talking for

00:01:36.720 | many more hours than I have time for here.

00:01:38.880 | So I'll try to really keep it focused on the things that I think will be

00:01:43.120 | most useful for you to learn.

00:01:44.760 | And so the first thing you should understand is that this whole concept of

00:01:48.760 | multimodality is kind of ill-defined, actually.

00:01:52.360 | So if you go to the dictionary, you'll see that it means having or

00:01:56.160 | involving several modes or modalities or maxima.

00:02:00.720 | And so what mode here really means is, so it could be mode in the very generic sense,

00:02:06.080 | or it could be a very precise sense of the mode of a statistical distribution.

00:02:11.240 | And so depending on the paper you're reading, in some cases,

00:02:14.440 | people really mean the statistical sense.

00:02:16.200 | In other cases, people really mean this sort of very vague concept of a modality,

00:02:20.920 | where it really means the type of information that you're getting.

00:02:23.840 | So an example of modality in that case is an image or speech signal or

00:02:28.480 | audio in general, or even olfaction, so smell or things like that.

00:02:33.160 | So in this lecture, we're just gonna focus mostly on text,

00:02:38.640 | because this is an NLP course, and we're gonna focus on images mostly as

00:02:42.640 | the other modality to keep it simple.

00:02:45.040 | All right, so why does it matter?

00:02:48.400 | Why do we care about multimodality?

00:02:50.880 | And so there are a couple of really good reasons in general for this.

00:02:55.480 | The first one is about faithfulness.

00:02:58.320 | So if you look at how we humans understand the world,

00:03:01.200 | how we make sense of what happens in the world, that is very multimodal, right?

00:03:06.720 | So we perceive the world not just using vision or just audio, but

00:03:11.040 | we synthesize information across all of these different modalities, and

00:03:14.600 | that's how we understand the world and each other.

00:03:17.800 | There's also a very practical argument for doing it.

00:03:20.680 | It's because the Internet is multimodal, right?

00:03:23.000 | So if you go to, I don't know, Facebook or something like that,

00:03:27.080 | it rarely happens that it's just text or just an image.

00:03:29.760 | There's usually a combination of multiple modalities.

00:03:33.160 | And then the final good reason that we're just starting to hit now,

00:03:37.920 | if you're really following where the field is going,

00:03:40.760 | we're kind of running out of text data for these large language models.

00:03:44.000 | So one interesting way to keep scaling on the data side

00:03:48.200 | is to make use of all of these other modalities, right?

00:03:51.000 | So if you can have your language model also watch all of the videos of cats in

00:03:55.080 | the world, it's gonna understand the concept of cat much better.

00:03:59.160 | And that's what we want to have in these models.

00:04:00.760 | We want them to understand the world in the same way that humans understand it.

00:04:05.040 | So right now, multimodality is really one of the main frontiers of this new

00:04:10.960 | foundation model drive that we're all in right now.

00:04:15.600 | There's a thing called the McGurk effect.

00:04:17.760 | Let's see if it loads up.

00:04:19.640 | But so what we'll see when this loads is this guy over here,

00:04:25.240 | and we'll have the same audio effect being played.

00:04:28.960 | So the audio is exactly the same, and this man is gonna say something like,

00:04:33.040 | bah, bah, bah.

00:04:35.920 | And so you're hearing a bee there, I think, if you look at my mouth,

00:04:39.720 | because that's what I said.

00:04:41.360 | But if you then change the video to where he says, bah, bah, bah,

00:04:46.200 | with exactly the same audio, you're going to hear the other version.

00:04:50.680 | So unfortunately, I can't really swap in the different audio here, so

00:04:54.160 | you have to trust me for it.

00:04:55.640 | We might suddenly start hearing a guy saying, bah, bah, bah, and then.

00:04:59.000 | >> [LAUGH] >> All right, so

00:05:02.480 | multimodal applications, so when we have multiple modalities,

00:05:09.320 | we can do all kinds of interesting things.

00:05:11.640 | And as I said, most of the use cases we have on the Internet,

00:05:14.720 | they're all multimodal.

00:05:16.320 | And there are some really kind of obvious things we would be interested in

00:05:20.440 | if we have information from these different data sources,

00:05:23.600 | right, from different modalities.

00:05:25.600 | So obviously, we might want to do retrieval.

00:05:28.000 | So maybe given a bit of text, we want to find the right image.

00:05:31.560 | Or maybe given some image, we want to find the right text for it so

00:05:34.960 | we can match them up.

00:05:36.440 | Obviously, we can also do this in a generative setting.

00:05:38.600 | So then we have image captioning, which you probably heard of.

00:05:41.080 | We can do text to image generation.

00:05:43.640 | So that's image synthesis, and so stable diffusion.

00:05:46.200 | Everybody in the audience here has probably seen that.

00:05:48.880 | Then we can do visual question answering, where we have an image and text.

00:05:52.920 | And then we need to generate some new text.

00:05:54.920 | We have multimodal classification, where we have image and text, and

00:05:57.560 | we need to have a label, for example, whether something is hate speech or not.

00:06:01.160 | And then in general, we want to be able to have a richer understanding of

00:06:05.920 | information, which means that we combine images and text and then use it for

00:06:09.360 | downstream applications that require better understanding or better generation.

00:06:14.400 | So this field really is super hot right now.

00:06:17.520 | So there's this nice paper title.

00:06:20.000 | I predict that this paper is going to do really well in terms of citations,

00:06:23.480 | just because it has such a citable title.

00:06:25.880 | I think a lot of people are not actually going to read it.

00:06:29.000 | And so, I mean, I've been in this field for quite a while now, and

00:06:32.080 | people have been saying this for a really long time.

00:06:34.000 | I think Chris would agree.

00:06:35.160 | So for decades, people have been saying that multimodal is the next big thing.

00:06:39.560 | But now it's really true, I think.

00:06:40.920 | >> [LAUGH] >> All right, so

00:06:44.080 | the outline for what we're gonna be talking about.

00:06:46.520 | So first, I'm gonna tell you a little bit about early models.

00:06:49.560 | Then we're gonna do a bit of a deep dive on some of the specifics.

00:06:53.200 | Then we're gonna go over a particular type of fusion,

00:06:56.560 | contrastive models or late fusion.

00:06:58.640 | Then we're gonna go through a little bit of the history of

00:07:01.760 | multimodal foundation models.

00:07:03.880 | Then we're gonna talk a little bit about evaluation,

00:07:06.000 | a little bit about other modalities.

00:07:07.600 | And then I'll make some predictions for the future, and hopefully maybe give you

00:07:10.680 | some cool research ideas or things to talk or think about.

00:07:13.720 | All right, so obviously,

00:07:17.080 | there's a lot of work that happened before deep learning.

00:07:19.920 | But I think if you want to start from the deep learning revolution and

00:07:23.680 | what was happening in images and text, then a good starting point is,

00:07:29.080 | for example, Wasabi or Device or Richard Soker,

00:07:33.880 | who you've probably heard of, has done some really cool early work in this.

00:07:38.280 | They really pioneered a lot of these ideas.

00:07:40.600 | And the basic gist of this is that we have a vision model on the one hand,

00:07:45.960 | we have a language model.

00:07:47.800 | So this really, I mean,

00:07:48.760 | the first lecture of this course I think was about word embeddings, right?

00:07:51.600 | So that's just your basic word embedding model.

00:07:54.240 | And now we need to figure out how to align them in the same multimodal space.

00:07:58.560 | So the way you do that is you get some sort of similarity metric, right?

00:08:01.640 | A score function or like a kernel function,

00:08:03.720 | if you're thinking about this from a support vector machine literature perspective.

00:08:07.400 | And now you need to figure out in a max margin or margin loss,

00:08:13.160 | how you want to align these two points in your embedding space, right?

00:08:16.360 | So things that are similar, you want to bring them closer together.

00:08:18.800 | Things that are not, you want to bring them further apart.

00:08:21.720 | And if you do that in this multimodal embedding space,

00:08:24.960 | that means that you can do interesting cross-modal transfer,

00:08:28.840 | where you can take the word embedding for something like auto or like horse,

00:08:32.840 | and then you can find close images in the embedding space to that thing.

00:08:36.720 | And now you've solved the retrieval problem.

00:08:39.960 | So this is a really nice early application.

00:08:42.280 | And I think a lot of this stuff that I'm going to talk about in the early slides,

00:08:46.400 | you're going to see this thing come over and over again.

00:08:49.000 | You're going to see it get kind of reinvented with fancier models,

00:08:52.040 | but it's basically all the same stuff.

00:08:54.760 | So you can do cross-modal transfer where you have images and text,

00:08:59.080 | but you can also combine them together so that you get a multimodal word embedding.

00:09:03.640 | And so this just gives you a more accurate representation

00:09:08.000 | of how humans understand word meaning.

00:09:10.600 | Because when we think about the word moon or cat or something,

00:09:14.200 | we can go to Wikipedia and read that a cat is a small carnivorous mammal

00:09:18.320 | that people like to keep as pets.

00:09:20.080 | Or we can just go and look at pictures of cats,

00:09:22.200 | and now we understand what a cat is.

00:09:24.400 | And I would argue actually that for a lot of people,

00:09:26.720 | the picture of the cat is much closer to the meaning of the concept of cat.

00:09:31.240 | So some early work where people were trying to do this

00:09:36.360 | is from Bruni et al. where they did multimodal distributional semantics

00:09:39.960 | using this very elegant approach called bag of visual words.

00:09:44.160 | So just like who has heard of bag of visual words?

00:09:48.600 | Very few people.

00:09:49.480 | Okay, so it's surprisingly simple, and so I kind of like it.

00:09:53.160 | It's nicely elegant.

00:09:54.480 | So you take a picture of a moon in this case.

00:09:57.160 | I think you can see it in the back too, right?

00:09:58.720 | So we use an algorithm like SIFT to find interesting key points.

00:10:03.760 | So it's sort of where the difference between the pixels and the pixels next to it,

00:10:08.040 | where that difference is big, those are sort of the spots you want to be looking at.

00:10:13.080 | And for each of these key points, you get feature descriptors.

00:10:16.960 | So relatively small vectors like 32-dimensional events

00:10:20.480 | are kind of on the implementation of this.

00:10:23.080 | And what you can do now with these feature descriptors

00:10:25.440 | is you can cluster them using k-means,

00:10:27.840 | and then you assign every one of these points

00:10:31.400 | so you can count how often they occur, right?

00:10:33.360 | So in this picture of the moon, we have like actually the count is...

00:10:36.560 | Oh, yeah, so there are three like red dots, right?

00:10:38.600 | So that's why the red dot one is three.

00:10:41.520 | So what that gives you is an idea of the visual words,

00:10:46.000 | very similar to the original bag of words model

00:10:48.280 | that you hopefully have heard about maybe in the first lecture.

00:10:52.400 | So that's the visual equivalent of the textual thing.

00:10:56.040 | And so if you do this and you then concatenate

00:10:58.800 | or you apply SVD to fuse the information,

00:11:00.840 | what you get is a word embedding

00:11:03.200 | that is much more representative of human meaning.

00:11:06.560 | So, you know, as reflected in the datasets

00:11:09.040 | that people used to care about at the time.

00:11:12.600 | So after that, there were a couple of people,

00:11:15.360 | me included, who tried to take these ideas

00:11:17.520 | and then really apply deep learning to them.

00:11:19.320 | So some of the very early versions of this

00:11:22.400 | use convolutional neural networks,

00:11:24.760 | and then you can transfer the features from your conf net

00:11:28.720 | and you take your word embeddings,

00:11:30.400 | which you've seen in the first lecture,

00:11:32.480 | and then you can concatenate them.

00:11:34.600 | Now you have a multimodal word vector,

00:11:36.640 | or you can do something slightly fancier.

00:11:38.520 | So you've seen the skip-gram model.

00:11:40.960 | You can also try to do skip-gram predictions

00:11:43.720 | onto image features, right?

00:11:45.320 | So when you see a word like cat in some context,

00:11:48.000 | like the cute little cat sat on the mat,

00:11:50.680 | then when you see cat, you also want to predict cat pictures.

00:11:54.280 | So super easy ideas, but it turned out

00:11:56.400 | that this gives you much richer word representations.

00:11:59.440 | So that's kind of cool,

00:12:00.680 | but obviously words are very limited.

00:12:02.880 | What we really care about is not words, but sentences.

00:12:06.080 | So then people started really looking

00:12:08.360 | into sentence representations and how can we figure out

00:12:12.360 | how to get compositional understanding

00:12:14.760 | in these sentence representations

00:12:16.360 | and how do we align that with images?

00:12:18.960 | So the loss here is very similar

00:12:22.480 | to what we saw with words and pictures,

00:12:24.360 | but now we just have a sentence encoder, right?

00:12:27.320 | And so there's some really cool early papers

00:12:29.440 | from Andrej Kropotky and Richard Soker

00:12:31.640 | also had some work here.

00:12:33.360 | And then, so the basic idea is just that

00:12:37.880 | instead of having these word embeddings,

00:12:39.440 | we now have an LSTM in these papers

00:12:42.120 | or some other kind of recurrent neural network,

00:12:44.240 | or in the case of this one, recursive neural network,

00:12:47.360 | and then we try to align the features together.

00:12:50.400 | And so these three or four papers

00:12:53.320 | are actually very important.

00:12:54.360 | This one by me is less important,

00:12:56.320 | but it's still kind of interesting

00:12:57.680 | because we showed here that grounded sentence representation.

00:13:02.160 | So if you actually just use this part here

00:13:04.520 | as a sentence encoder for NLP tasks,

00:13:07.440 | the ability to just predict pictures from it

00:13:09.840 | already gives you a really good sentence representation.

00:13:12.800 | Right, so just by predicting pictures,

00:13:15.560 | you can sort of imagine what things look like,

00:13:17.720 | and that gives you a really good meaning representation,

00:13:19.800 | which you can then transfer to, I don't know,

00:13:22.000 | sentiment classification or something else.

00:13:24.360 | And then of course, once we have sentence encoders,

00:13:30.720 | or then we also have decoders,

00:13:33.040 | and so when the sequence-to-sequence architecture came out,

00:13:36.440 | which you've probably also heard about in this course,

00:13:39.160 | what you can do instead of having a text encoder

00:13:42.160 | for like your source language,

00:13:43.440 | if you're doing machine translation,

00:13:44.800 | is you can plug in a conf net instead of an LSTM encoder,

00:13:49.800 | and now you can generate captions.

00:13:51.680 | So that's exactly what people did.

00:13:53.840 | We used to have all of these fancy diagrams in our papers

00:13:56.400 | then where we explained LSTM and how that works.

00:13:59.160 | Probably people don't learn that anymore these days.

00:14:01.560 | They do?

00:14:02.400 | - Mostly for LSTM.

00:14:03.240 | - Very good.

00:14:04.400 | They might make a comeback, I think, you know,

00:14:06.320 | at some point, transformers are gonna go away.

00:14:09.360 | We'll see.

00:14:10.200 | And so one of the things that people figured out

00:14:14.680 | in machine translation very early on

00:14:16.600 | is that you can do alignment of words

00:14:19.280 | between your source language and your target language,

00:14:21.760 | and you can do the same thing actually with images, right?

00:14:24.040 | So if you want to align a word in your generated sequence

00:14:28.760 | with something in your picture,

00:14:30.200 | then you can use the same approach for that,

00:14:33.480 | and that approach, of course, is called attention, right?

00:14:35.720 | So, you know, you learned a lot about attention

00:14:38.280 | probably in this course, and so, yeah,

00:14:40.760 | that was one of the building blocks

00:14:42.400 | of these systems as well

00:14:43.480 | where you can do very interesting things

00:14:45.720 | and really see that when it has to generate stop

00:14:49.080 | for the stop sign,

00:14:50.120 | that it's really actually looking at the stop sign, right?

00:14:52.720 | So there's a really cool alignment going on there

00:14:55.760 | in these models.

00:14:56.720 | And so the final kind of early model

00:15:00.480 | we should talk about a little bit is GANs.

00:15:04.160 | Who here has heard of GANs?

00:15:05.600 | Okay, that's a lot more than visual words.

00:15:09.200 | I guess that makes sense.

00:15:11.080 | And so, yeah, the basic idea of a GAN

00:15:14.160 | is really that you have this generator and discriminator,

00:15:16.480 | and you want to have the generator generate images

00:15:19.400 | that the discriminator cannot distinguish from,

00:15:23.120 | so it cannot distinguish fake and real images, right?

00:15:25.640 | And if you do that,

00:15:26.480 | you can actually condition that on the piece of text,

00:15:30.040 | and then you can generate images

00:15:32.520 | using some text prompt, right?

00:15:35.040 | So that's what kind of the first versions

00:15:38.280 | of stable diffusion were doing things like this,

00:15:40.560 | and it's all a natural progression to that model.

00:15:43.840 | So those were the early models.

00:15:47.240 | Maybe, do people have any burning questions about this,

00:15:51.480 | or does this all make sense?

00:15:52.880 | All right.

00:15:56.360 | So let's do a bit of a deeper dive then

00:15:58.960 | in particular on features and fusion.

00:16:02.040 | So those are really the kind of core building blocks

00:16:04.040 | for all of this multimodal stuff.

00:16:06.000 | But before we go there, maybe very briefly,

00:16:08.760 | like if all of this multimodal stuff is cool

00:16:11.640 | and sort of useful and doesn't look that difficult,

00:16:15.680 | like why aren't we all doing multimodal things?

00:16:18.520 | So why do we focus on specific modalities?

00:16:21.920 | And I think there are a couple of problems

00:16:23.400 | just to be aware of.

00:16:24.680 | So one is modalities can sometimes dominate,

00:16:28.120 | especially text is much more dominant

00:16:30.600 | than vision or audio in many use cases, right?

00:16:33.160 | So you can already just have a model

00:16:35.320 | that picks up on the text signal

00:16:36.960 | and basically learns to ignore the image completely,

00:16:39.280 | which actually happened embarrassingly

00:16:41.240 | for visual question answering, we'll get to that.

00:16:43.520 | So visual question answering, you could do that

00:16:45.600 | without actually looking at the picture.

00:16:47.600 | Additional modalities can add a lot of noise.

00:16:51.760 | So it makes your machine learning problem more difficult.

00:16:54.920 | You don't always have full coverage, right?

00:16:56.600 | So as I said, if you look at Facebook posts,

00:16:58.400 | sometimes you have text, sometimes you have pictures,

00:17:00.280 | sometimes you have both,

00:17:01.280 | but you don't have a guarantee that you always have both.

00:17:03.400 | So how do you deal with that?

00:17:04.880 | In many cases, we just really weren't ready.

00:17:07.480 | It was too complicated to implement stuff.

00:17:10.080 | And also just in general, like how to design your model

00:17:13.160 | really to combine all the information

00:17:16.640 | is actually quite complicated.

00:17:18.040 | So in order to maybe drive that point home a little bit,

00:17:23.040 | so featurizing text, I guess we all know how to do that

00:17:28.200 | by now, especially sort of in the age of transformers

00:17:30.760 | and before in LSTMs, where we just have like,

00:17:33.080 | you have your batch by your sequence.

00:17:34.920 | So batch size by sequence length by embedding size, right?

00:17:38.160 | So it's always like a 3D tensor,

00:17:41.320 | and that's how you encode your textual information

00:17:43.880 | when you pump it through your neural net.

00:17:46.520 | And so with images, it's slightly trickier

00:17:48.720 | because you can just kind of look at the patches,

00:17:51.800 | but then if you do convolutions,

00:17:53.360 | you're kind of like shifting over the image

00:17:55.400 | and then you're aggregating, right?

00:17:57.840 | And in many cases, you don't really want to be this uniform.

00:18:01.160 | You want to have something that actually looks

00:18:03.080 | at the things in the picture, right?

00:18:05.120 | So this is called region features,

00:18:07.000 | where you would use an object detector

00:18:08.680 | as a first step for processing your image.

00:18:11.200 | And then you would have a confident backbone

00:18:13.000 | that encodes the features for that particular sub image,

00:18:16.520 | like this guy's like skateboard or something.

00:18:18.480 | It has its own like vector representation, right?

00:18:21.080 | And then in terms of dense features,

00:18:24.040 | we now also have vision transformers.

00:18:25.720 | So we'll just very quickly go over that

00:18:28.000 | to make sure we're on the same page.

00:18:29.800 | So there are all these models like YOLO

00:18:31.400 | is a really good one if you haven't heard of that yet.

00:18:34.280 | So we're at YOLO v7 now, I think, or eight, I don't know.

00:18:37.840 | So there's a new one coming out every other year

00:18:41.640 | or something.

00:18:42.840 | But the basic idea is that we get these bounding boxes

00:18:46.320 | for things in the images are actually segmentations,

00:18:49.000 | but the bounding boxes is what people tend to use

00:18:51.080 | and they have labels, right?

00:18:52.560 | So this is labeled like backpack or something.

00:18:54.840 | And so you can do this as a pre-processing step

00:18:57.560 | on your image to get a much richer representation

00:19:00.920 | of what is really in that image,

00:19:02.280 | which you can then pump into your system

00:19:04.040 | as we'll see later.

00:19:05.240 | And so then how you encode the information

00:19:07.920 | that is in these little bounding boxes

00:19:09.640 | or actually in the image itself in general,

00:19:12.480 | we just use a standard conf net for that.

00:19:14.680 | And so this probably feels like super obvious now,

00:19:18.680 | but in 2014, when people were starting to discover this,

00:19:22.040 | it was really very surprising that you could just use

00:19:25.120 | off the shelf conf net features

00:19:27.080 | to really replace the entire computer vision pipeline.

00:19:30.160 | So people used to do all of this very fancy,

00:19:32.600 | sophisticated stuff and people, you know,

00:19:34.680 | spend decades on trying to refine this.

00:19:36.600 | And then it was all thrown away and replaced by a conf net

00:19:39.440 | that does all of that stuff for free.

00:19:41.920 | And so the cool thing you get there is that you can transfer

00:19:44.840 | very easily across different tasks.

00:19:46.760 | So you can have a very generic conf net

00:19:49.000 | and then use it to all kinds of very specialized things

00:19:52.320 | like spotting buildings in Paris, for example,

00:19:55.880 | or flowers or other stuff.

00:19:57.720 | And then of course, in the age of transformers,

00:20:01.840 | how far are we?

00:20:03.120 | We're already quite a while in.

00:20:04.200 | This is only the first transformer actually

00:20:06.240 | in the slide deck.

00:20:07.320 | So, you know, we're making good progress.

00:20:10.720 | So vision transformers are what we would use these days

00:20:13.760 | to encode the images where you have these flattened patches

00:20:17.320 | and then you would do kind of the standard

00:20:20.520 | birth architecture maybe as you would know it

00:20:22.880 | from this course, and then you do classification, right?

00:20:25.240 | So this is all like a standard transformer,

00:20:27.200 | everything's standard, except now your input here

00:20:29.240 | is not words or tokens, it's patches of an image

00:20:32.840 | and then you classify that.

00:20:34.240 | All right, so then we have a bunch of features

00:20:37.960 | and now how do we combine the information, right?

00:20:40.120 | So let's say we have two vectors, U and V.

00:20:43.000 | So, you know, it sounds easy, right?

00:20:45.120 | To how we could combine them.

00:20:47.480 | It turns out that there are actually

00:20:49.040 | very many ways to combine them.

00:20:50.520 | So I don't think it's really useful to go over

00:20:52.960 | all the different ways here,

00:20:55.200 | but you can do very simple things, right?

00:20:56.720 | So obviously like inner product or similarity

00:20:59.240 | is what you would use if you want to do cross-modal things.

00:21:01.600 | So if you want to embed things in the same vector space,

00:21:04.840 | but you can do sort of fancier projections on top

00:21:08.440 | or different combinations that are kind of linear

00:21:11.640 | or you can do multiplicative things

00:21:13.520 | where you multiply the components element-wise

00:21:16.280 | or you do some sort of gating over the different features.

00:21:18.680 | You can do attention, you can do fancier bilinear things,

00:21:22.320 | you can do very fancy compact bilinear things.

00:21:25.240 | So there's really a wealth of literature

00:21:27.280 | kind of on all the different ways

00:21:28.720 | you can combine two vectors.

00:21:30.560 | And so this is called multimodal fusion

00:21:33.760 | and most of the literature on multimodality

00:21:36.040 | is essentially about this question,

00:21:37.840 | what is the best way to do fusion?

00:21:39.680 | And that's it.

00:21:42.160 | So I think within that discussion

00:21:44.200 | is maybe useful to distinguish

00:21:45.600 | between different levels of fusion.

00:21:47.800 | So you can do it very early

00:21:49.680 | where basically you make sure

00:21:51.200 | you have the different features

00:21:52.440 | and then you just kind of,

00:21:54.280 | in the sort of modern sense of attention,

00:21:56.480 | you would attend to everything

00:21:57.800 | in all the features from the beginning.

00:21:59.960 | You can first treat them separately and then combine them

00:22:03.200 | or you can treat them as completely separate

00:22:05.280 | and then you only combine the final scores, right?

00:22:08.040 | And so that's kind of what we would call early fusion

00:22:11.440 | and then sort of my invention

00:22:13.120 | for calling the middle part

00:22:14.520 | would be sort of middle fusion

00:22:15.760 | and then you have late fusion

00:22:17.640 | where you really just combine the scores or the logits

00:22:20.560 | but you don't really have any interaction

00:22:22.640 | between the information from the different modalities.

00:22:25.520 | So you could do really fun stuff with multimodal fusion.

00:22:30.520 | So this is a paper I really like film

00:22:34.120 | where you have this sort of very special feature map,

00:22:40.000 | this sort of F here and it gets modulated

00:22:42.720 | by a multiplicative factor.

00:22:44.840 | So this gamma and an additive sort of bias vector,

00:22:48.120 | this beta and you have a different one

00:22:50.560 | for every layer of a ResNet

00:22:52.400 | that is conditioned on some encoding

00:22:54.560 | of the thing you're after.

00:22:56.280 | So in this case, are there more cubes than yellow things?

00:22:58.640 | So we have some vector representation for that

00:23:01.320 | and we use that vector representation

00:23:03.000 | to modulate the ResNet blocks

00:23:05.200 | at every layer of the conf net.

00:23:08.480 | So you can really do very fun things

00:23:10.880 | where you're sort of modulating one network

00:23:12.800 | with the other one and really try to have them learn

00:23:15.880 | as much as possible from that.

00:23:17.400 | All right, so let's talk about late fusion then.

00:23:22.240 | So late fusion is what we would now call contrastive models

00:23:26.520 | but the basic idea is that we have this similarity score.

00:23:29.600 | So we have the two kind of,

00:23:31.000 | we process the modalities completely independently

00:23:33.360 | and then at the very end, we do some combination.

00:23:35.960 | And the most famous instance of that these days is CLIP.

00:23:40.480 | So who's heard of CLIP?

00:23:41.640 | Okay, so CLIP from OpenAI.

00:23:47.840 | So it's again, exactly the same contrastive loss

00:23:51.280 | that we've seen in all these early approaches.

00:23:54.200 | It does kind of negative sampling, but then in batch.

00:23:58.320 | So you just have a batch,

00:23:59.520 | you have two things that are aligned, right?

00:24:01.400 | So like this, the first piece of text

00:24:03.560 | and the first image, they are aligned.

00:24:05.120 | So this is the right answer.

00:24:06.960 | And I just wanna make sure that I rank this thing higher

00:24:09.680 | than all the alternatives, right?

00:24:12.080 | And I wanna make sure I rank this thing higher

00:24:14.080 | than all the alternatives.

00:24:15.400 | So it's a very, very simple idea.

00:24:18.280 | Really, nothing special about this architecture

00:24:20.920 | that was sort of invented here,

00:24:22.480 | but what made this thing so cool was first of all,

00:24:25.840 | it was transformers and it was transformers all the way.

00:24:28.360 | So your text encoder would be a transformer

00:24:30.360 | and your image encoder would be a VIT image encoder.

00:24:33.640 | So also a transformer.

00:24:35.880 | And it was trained on lots and lots of web data.

00:24:39.400 | So Alec Radford is really a genius

00:24:41.600 | at creating very high quality datasets.

00:24:44.440 | And he created, I think 300 million image text pairs

00:24:47.720 | for this dataset, trained a bigger model on it

00:24:50.520 | than people used to do.

00:24:52.760 | And then we got this amazing model out of it.

00:24:55.640 | And so moving away from the words there

00:25:00.200 | to the sort of texts that you would see on the internet.

00:25:03.280 | So the caption for an image on the web,

00:25:05.840 | it's not gonna say dog or cat.

00:25:07.640 | It's gonna say a photo of a cat doing something, something.

00:25:10.720 | So that means that you can do kind of zero shot

00:25:14.800 | label predictions where you have a photo of the,

00:25:17.760 | and then you need to figure out what the right label

00:25:21.560 | is for a given image using this kind of prompt.

00:25:24.600 | So the thing, you probably all know

00:25:26.880 | about prompting large language models.

00:25:28.520 | And so you can prompt vision and language models

00:25:30.880 | in very much the same way and do zero shot generalization.

00:25:34.720 | So if you want to read a really good paper,

00:25:38.160 | I would recommend that you read this paper.

00:25:39.800 | This is really one that's gonna teach you

00:25:41.320 | how to write really good papers.

00:25:42.800 | It's thorough, it's really worth a very close read,

00:25:45.920 | I think, if you're interested in this field.

00:25:48.240 | And so I think when it came out,

00:25:51.240 | actually on ImageNet itself,

00:25:53.280 | it didn't really outperform ResNet.

00:25:55.640 | So you might think, oh yeah,

00:25:57.920 | actually it's not all that special.

00:25:59.960 | But what really made it special was that it generalized

00:26:02.400 | much better to these other datasets, right?

00:26:04.680 | So this ResNet thing here is pretty terrible

00:26:08.080 | at some of these adversarial versions of ImageNet,

00:26:11.440 | and Clip is super robust to that.

00:26:13.160 | So it's just a way better image encoder in general.

00:26:16.120 | So very quickly after Clip,

00:26:19.960 | there was this paper from Google using Align,

00:26:23.880 | which was basically exactly the same idea.

00:26:27.560 | The field is not really that creative at all

00:26:29.880 | as like the same idea,

00:26:30.840 | but then you just keep like throwing more data

00:26:32.840 | and more compute at it, and it often works much better.

00:26:35.640 | So that's what they found here too.

00:26:38.120 | 1.8 billion image taxpayers instead of 300 million

00:26:41.120 | gives you a better model.

00:26:43.120 | Surprise.

00:26:43.960 | But so still very cool.

00:26:47.360 | And what is really cool, I think,

00:26:49.440 | is that there's this organization called Lion,

00:26:51.800 | where they've started this open source collective

00:26:56.680 | to create really high quality datasets.

00:26:59.280 | And so the Lion, the initial dataset was,

00:27:04.280 | how many examples in the initial Lion?

00:27:06.880 | - 400. - 400 million.

00:27:08.840 | He knows, I know that he knows.

00:27:10.440 | (audience laughing)

00:27:11.280 | And so now there's a much bigger version of Lion

00:27:14.920 | that's even multilingual and it has 5 billion examples.

00:27:18.080 | So stable diffusion was trained on sort of the image,

00:27:21.600 | the English subset of this thing.

00:27:23.760 | And that's one of the reasons that it's so awesome,

00:27:26.200 | it's because it's just seen a ton of data.

00:27:29.160 | And that really makes your system a lot better.

00:27:31.360 | So if you're looking for like the ultimate dataset

00:27:34.000 | to play around with your own ideas,

00:27:37.240 | if you have enough compute, obviously,

00:27:38.840 | then you should really look at this dataset.

00:27:41.360 | All right, any questions about up until this point?

00:27:46.480 | No, all right.

00:27:51.840 | So then we'll move on from late fusion

00:27:56.000 | to kind of middle fusion, early fusion.

00:27:59.640 | And this really is kind of the core

00:28:01.680 | of what I think a lot of people in the field right now,

00:28:04.520 | or if you're interested in getting in this field,

00:28:06.080 | or if you're going to go into industry

00:28:07.840 | and you're gonna be using this stuff,

00:28:09.760 | like this is what you should really understand.

00:28:12.320 | And again, like the ideas sort of stack onto each other.

00:28:16.440 | So I've kind of sequenced the slides

00:28:18.840 | to give you an idea sort of how the scientists

00:28:21.240 | kind of came up with the next step.

00:28:23.440 | And you can really see the architecture

00:28:25.080 | just get slightly more and more advanced,

00:28:27.160 | but basically a lot of it is just more data

00:28:28.960 | and more compute again.

00:28:30.640 | So who knows how BERT works?

00:28:35.720 | (audience laughing)

00:28:39.160 | Everybody should raise their hand now in this.

00:28:41.560 | So yeah, so BERT is kind of so canonical.

00:28:46.560 | I think everybody kind of gets how BERT works, right?

00:28:48.720 | So I don't think we need a real refresher,

00:28:51.840 | but I think you can think.

00:28:53.920 | And so the reason I have to slide is

00:28:55.800 | because I want you to think about if you have a BERT model

00:28:59.800 | and you have a bunch of images,

00:29:01.720 | how are you going to turn that BERT model

00:29:03.600 | into something multimodal?

00:29:05.080 | Right, so there are a bunch of like obvious things

00:29:09.400 | you could do given the kind of features I told you about

00:29:11.880 | in the sort of fusion process.

00:29:13.240 | So how are you gonna do that?

00:29:15.400 | Does anybody wanna like say something?

00:29:19.480 | (audience member speaking indistinctly)

00:29:24.240 | - Like if you're doing classification,

00:29:25.880 | you can take the CLS token from BERT

00:29:28.600 | and then just concatenate it to whatever encoder,

00:29:31.440 | like maybe an ANN or whatever you're training

00:29:33.840 | on the data concatenated and then train.

00:29:37.200 | - Okay, exactly, yeah.

00:29:38.160 | So you can take the ConfNet features

00:29:41.160 | and classify your token from BERT, concatenate them,

00:29:43.800 | and then classify for like a cat or something like that

00:29:47.400 | or whatever the thing is you're interested in, yeah.

00:29:50.040 | Yeah, so that's one thing.

00:29:51.240 | You could also like take the ConfNet features

00:29:53.560 | and like give them to the BERT model

00:29:55.920 | in lots of different ways, right?

00:29:58.080 | We can use the region features.

00:29:59.520 | So I think a lot of people when BERT came out

00:30:04.080 | who were working in vision and language processing

00:30:06.640 | were thinking exactly about, okay,

00:30:08.040 | so do we do like middle fusion, late fusion?

00:30:10.120 | Do we do early fusion?

00:30:11.160 | How do we do the fusion?

00:30:13.440 | And so there were a lot of papers all coming out

00:30:16.680 | basically at around the same time

00:30:18.120 | where people were doing versions of this.

00:30:21.120 | So BERT was really kind of the innovation

00:30:23.040 | and then everybody sort of just plugged it

00:30:25.120 | into their own thing because of Hugging Face Transformers

00:30:27.440 | and things like that.

00:30:28.320 | So the first thing is Visual BERT.

00:30:33.320 | This was one of the very early ones

00:30:34.760 | where you have this image

00:30:36.240 | and people would do object detection on this.

00:30:39.600 | So you get like a hat and a racket and a shirt

00:30:41.720 | and things like that.

00:30:42.800 | So you can just really take these features

00:30:45.200 | and then plug them into your transformer model

00:30:49.080 | and then you try to like recover the features.

00:30:52.720 | And so this really is probably

00:30:54.440 | like the simplest way to do it, right?

00:30:56.400 | And so this is what we call a single stream architecture

00:31:00.920 | where you have all of these kind of concatenating

00:31:03.560 | the original input features

00:31:05.000 | and then putting them through the same transformer.

00:31:07.960 | What you can also do,

00:31:08.960 | and that's something that this model called VILBERT did

00:31:12.320 | is where you have two different streams.

00:31:14.640 | So you essentially have these two parallel transformers

00:31:18.240 | but at every layer,

00:31:19.880 | you kind of give them cross attention, right?

00:31:23.120 | So, or co-attention as they call it,

00:31:24.920 | but it's basically like,

00:31:25.760 | so you just make sure you have an attention map

00:31:28.040 | that spans both

00:31:28.880 | and then you just do your full normal transformer layer

00:31:31.880 | again.

00:31:33.120 | And then, so this you can train

00:31:35.280 | just like your regular BERT, right?

00:31:37.120 | So you have your masked model,

00:31:40.960 | masked language model here

00:31:42.120 | and here you do sort of some equivalent of that

00:31:44.480 | and then you also have your next sentence prediction

00:31:47.320 | which you probably remember from your BERT lecture.

00:31:50.120 | But instead here we're saying, okay,

00:31:52.560 | is this image aligned with this piece of text or not?

00:31:55.600 | There's also LexMERT.

00:31:58.240 | I mean, there I could go on forever.

00:32:00.000 | There are like a hundred papers that came out

00:32:02.040 | that did this all at the same time.

00:32:03.680 | So LexMERT had a different cross-modal output encoder,

00:32:07.000 | a bunch of different ways

00:32:08.800 | of encoding the positional information, right?

00:32:10.920 | So you could say, okay,

00:32:11.760 | I just have a bunch of bounding boxes that are featurized

00:32:14.160 | but I don't care about where they are in the image.

00:32:16.640 | So it's just kind of like a,

00:32:17.800 | just a bag of bounding boxes.

00:32:20.640 | Or you could say, I found it here.

00:32:22.000 | Like this is the particular like top left

00:32:23.880 | and bottom right coordinate.

00:32:25.600 | And that's what you featurize into your network.

00:32:28.200 | You can also do something even dumber.

00:32:32.960 | And I can say that because this is my paper.

00:32:35.160 | Where you just take the image itself,

00:32:38.920 | you put it through a ResNet

00:32:40.840 | and then you do a little bit of pooling

00:32:42.960 | on the final feature maps

00:32:44.640 | and you just give those feature maps to BERT.

00:32:47.280 | And so you then need to distinguish

00:32:50.120 | between like your text segment embeddings, right?

00:32:52.640 | And your vision segment embeddings.

00:32:56.480 | But so this actually works surprisingly well.

00:32:58.920 | You don't have to do any additional training.

00:33:02.160 | You can just take BERT out of the box.

00:33:03.880 | Initially you freeze it.

00:33:05.040 | You learn to project into BERT token space.

00:33:07.680 | Then you unfreeze your ResNet

00:33:09.480 | and then finally you unfreeze your BERT.

00:33:11.640 | And now you have a very good multimodal classifier

00:33:13.920 | on the problem you care about.

00:33:15.400 | So a lot of these other papers,

00:33:17.520 | they're doing what they call multimodal pre-training

00:33:19.840 | where first you have a BERT model and a ResNet.

00:33:22.320 | So they're kind of unimodally pre-trained

00:33:24.480 | and then you cobble them together

00:33:25.880 | and then you have a multimodal

00:33:28.000 | sort of intermediary pre-training step

00:33:29.960 | before you fine tune it on the problem you care about.

00:33:32.920 | And what we showed here is that you don't really need that

00:33:35.000 | actually in many cases.

00:33:36.440 | So it's a very strong baseline.

00:33:39.760 | You can also go to the pixel level completely.

00:33:42.880 | So that's what they did in this other paper

00:33:45.440 | called PixelBERT where they,

00:33:47.200 | it's basically exactly MMBT.

00:33:50.040 | So the previous supervised one,

00:33:52.400 | but here they do do the multimodal pre-training step

00:33:55.200 | and show that I think for VQA it helps a little bit.

00:33:57.800 | So there are many of these BERTs

00:34:02.080 | doing sort of visual things.

00:34:04.200 | People really tried everything.

00:34:06.640 | Here's another one called UNIDER

00:34:08.200 | where they added a bunch of different losses.

00:34:10.960 | We can really talk about this for a very long time.

00:34:14.040 | We're not gonna do that.

00:34:15.000 | I'm just gonna kind of talk you through

00:34:16.800 | some of the more interesting ones.

00:34:18.520 | So this one I think is quite interesting built

00:34:21.280 | because here this is really the first instance

00:34:23.400 | where we are completely gone from ConvNet features.

00:34:27.040 | So we don't do any pre-processing on the image,

00:34:29.960 | no region features,

00:34:31.080 | no backbone that it featurizes

00:34:32.840 | the parts of the image we care about.

00:34:35.600 | We just have these patches of the image.

00:34:37.280 | So really integrate, we flatten those patches,

00:34:40.240 | we just pump them into the transformer straight away.

00:34:43.240 | So this really is like sort of BERT and VIT together

00:34:46.000 | in one model and this worked really very well.

00:34:48.840 | So that's been the trend.

00:34:52.240 | So here's a nice, very long list

00:34:54.960 | of all of these different models and what they do.

00:34:57.400 | And so really the distinctions are just in

00:35:00.240 | what is the text encoder that you use?

00:35:02.040 | So do you use BERT or something fancier or better,

00:35:04.920 | Roberta, what is your vision encoder?

00:35:08.320 | So in many cases you have these region features.

00:35:11.000 | So you would do an RCNN style thing,

00:35:13.480 | or you could just do a ResNet or a VIT.

00:35:15.800 | You have different kinds of fusion.

00:35:17.560 | So either a single or dual stream as we talked about,

00:35:20.520 | so visual BERT or a VILBERT.

00:35:22.960 | Different pre-training tasks,

00:35:24.240 | so mass language modeling, image text matching.

00:35:28.240 | There's a bunch of like funkier ones you can do.

00:35:31.720 | So, and then finally you can do multimodal pre-training

00:35:35.320 | on all of these different datasets that have aligned data.

00:35:38.280 | So you are probably wondering,

00:35:41.600 | okay, so what is really the interesting difference

00:35:44.480 | between a lot of these?

00:35:46.520 | And so I have another recommended paper

00:35:49.560 | that if you're interested in this space,

00:35:51.120 | you should really take a look at.

00:35:52.240 | It's also a really well done paper

00:35:54.560 | where they unmask multimodal pre-training.

00:35:59.360 | So basically they say, if you take all of these

00:36:02.720 | little model inventions and you train these different models

00:36:06.320 | on exactly the same data in exactly the same way,

00:36:09.520 | it turns out that they're all basically the same.

00:36:11.960 | So that's a lot of kind of wasted effort

00:36:16.520 | on the part of the field because everybody's saying,

00:36:18.480 | well, my model is better, but it's actually just

00:36:20.360 | because you trained it on different data.

00:36:22.640 | There's no real sort of model innovation going on

00:36:26.560 | in a lot of these things.

00:36:28.400 | So I don't mean to sound discouraging

00:36:29.880 | or anything like that, but I think that's why

00:36:33.360 | this paper is really nice and really important

00:36:35.360 | is because it just shows us what really matters.

00:36:38.880 | So this is also work that I did myself called Flava

00:36:43.880 | with my team where we wanted to take these ideas

00:36:49.160 | really to the limit.

00:36:50.000 | So a lot of the things that you've seen now,

00:36:52.960 | so the visual words and the VIL words

00:36:54.600 | and the things like that,

00:36:55.440 | they're all about multimodal questions.

00:36:57.120 | So how can we do visual question answering,

00:36:59.720 | something like that,

00:37:00.560 | where we just have these two modalities,

00:37:02.320 | we only care about problems that always involve

00:37:04.440 | these two modalities.

00:37:05.760 | And where we want to go, and this is kind of

00:37:08.280 | the basic premise, I think, of foundation models in general

00:37:11.720 | is that we have one model to rule them all.

00:37:14.280 | So this one model can consume data

00:37:16.120 | from all of these different modalities

00:37:17.760 | and it can synthesize across all of these

00:37:19.880 | different modalities and then do useful things

00:37:22.080 | with that information.

00:37:24.280 | So with Flava, that's exactly what we tried to build.

00:37:27.320 | So we wanted to have one foundation model

00:37:29.600 | that is good at vision and language and computer vision

00:37:32.360 | and natural language processing,

00:37:34.040 | is jointly pre-trained on all of these

00:37:35.800 | different data sources.

00:37:36.880 | So it's also trained on just CC News,

00:37:39.200 | so Common Crawl and BookCorpus.

00:37:41.960 | So it's very good at the sort of things

00:37:43.360 | you would expect BERT to be good at.

00:37:45.200 | It's trained on ImageNet for image data.

00:37:47.400 | So it's good at the things that you would expect

00:37:49.400 | as a kind of basic image model to be good at.

00:37:51.600 | And then you have this PMD dataset

00:37:53.880 | that we created out of publicly available

00:37:57.000 | image text pairs that we also train it on.

00:37:59.920 | So this PMD dataset is really just,

00:38:01.920 | if you take all the datasets that were ever created

00:38:04.280 | that have image text pairs that are publicly available.

00:38:06.840 | So unfortunately, the ClipData and the Google Align data

00:38:10.200 | and all of these datasets, they haven't been open source.

00:38:12.320 | So this is before Lion.

00:38:15.040 | So now there's a good alternative to this.

00:38:17.800 | But so this PMD dataset,

00:38:19.520 | if you combine all of these image text pairs,

00:38:22.960 | you get 70 million of them.

00:38:24.400 | So that's still pretty decent size.

00:38:26.320 | And then you can take all of this data,

00:38:28.760 | basically to solve all of these problems

00:38:30.520 | that we know we care about in these different fields.

00:38:32.600 | So you can do multi-modal reasoning,

00:38:34.360 | you can do language understanding,

00:38:35.720 | you can do visual recognition,

00:38:37.520 | all with exactly the same model.

00:38:39.400 | And that's a very powerful idea.

00:38:41.240 | I think if you work at a company like Facebook,

00:38:44.480 | you don't want to have different models

00:38:45.800 | for all kinds of different things.

00:38:47.360 | You want to have one model

00:38:48.560 | that you can really use for everything.

00:38:50.360 | That's gonna really make your life a lot easier.

00:38:53.160 | So the exact architecture here is that on the one hand,

00:38:57.760 | we have this image encoder where we take the image,

00:39:00.400 | we encode it as patches,

00:39:01.720 | and we just do what we call mass image modeling,

00:39:04.560 | but it's basically mass language modeling,

00:39:06.280 | and then just on the image tokens.

00:39:09.040 | And then on the other side,

00:39:11.720 | we have the mass language modeling on the language.

00:39:16.480 | So your regular sort of bird thing.

00:39:18.440 | And then we have a multi-modal part

00:39:20.280 | where all of this information gets combined.

00:39:22.960 | So we have a mass multi-modal modeling loss term

00:39:27.120 | where you can also do image text matching.

00:39:28.960 | So this is like your bird next sentence prediction thing.

00:39:31.520 | And then we also have a global contrastive loss,

00:39:33.680 | which is exactly like a clip.

00:39:35.440 | So if you do all of this stuff,

00:39:37.080 | it's just all transformers all the way down.

00:39:40.400 | It's sort of a very elegant way,

00:39:42.080 | I think, to combine a lot of this information.

00:39:44.920 | And when you do that,

00:39:45.880 | you get something that can really do

00:39:47.560 | a lot of things very well.

00:39:49.200 | So we're not gonna talk about that table,

00:39:51.520 | it's just way too many numbers.

00:39:53.080 | But so just trust me,

00:39:54.800 | we were pretty thorough generating the table here.

00:39:57.960 | And so over 35 different tasks,

00:40:01.640 | if you compare Flava to all kinds of different ablations

00:40:04.480 | in terms of clip models,

00:40:06.120 | then this is just a much better way

00:40:08.040 | to get to this information.

00:40:10.280 | So I think this is a nice example

00:40:11.880 | of where we're probably gonna go with the field

00:40:15.080 | in the near future.

00:40:17.400 | So the other trend that we see

00:40:19.320 | very obviously in the field right now

00:40:20.960 | is that everybody cares about generative models.

00:40:23.680 | So language models and image generative models,

00:40:27.760 | there's just a trend where we want to be generative,

00:40:30.040 | we wanna move away from this contrastive,

00:40:31.680 | discriminative stuff to the more interesting,

00:40:34.880 | more richer representations maybe that you get

00:40:37.360 | out of generating sequences or images.

00:40:41.240 | So this SimVLM paper was one of the first ones

00:40:44.200 | where they really had this separate decoder

00:40:46.160 | that was trying to generate or kind of complete captions,

00:40:49.480 | which they showed gives you a lot richer representations.

00:40:53.640 | I think this is actually the current state of the art now,

00:40:55.800 | it's called Koka.

00:40:56.920 | So a lot of these models,

00:40:59.840 | they all again look very similar,

00:41:02.240 | but in this case now we're starting

00:41:03.520 | to really see these text decoders.

00:41:05.280 | So initially with Clip,

00:41:07.040 | I think that's also what they were trying to go for,

00:41:09.200 | like OpenAI being a company

00:41:10.880 | that really likes generative models,

00:41:12.840 | but they couldn't really get it to work.

00:41:14.280 | And I think, so it took us a while as a field

00:41:16.560 | to really figure out how to do this the right way.

00:41:19.280 | And so right now we're really kind of

00:41:23.640 | in the age of language models, right?

00:41:25.440 | And so one of the interesting things you can do

00:41:29.000 | with language models is just keep them frozen

00:41:32.080 | and then learn how to project into the language models.

00:41:35.160 | So the MMBT architecture I talked about

00:41:38.320 | where we had this BERT model,

00:41:39.760 | we kind of kept it frozen

00:41:40.960 | and we learned to project into the BERT token space.

00:41:44.920 | You can do exactly the same thing,

00:41:46.440 | but then with a much fancier model

00:41:49.000 | or something like T5,

00:41:50.320 | even where you just have an encoder decoder

00:41:52.200 | or some kind of generative part of this,

00:41:54.760 | you keep that thing frozen

00:41:56.560 | and then you learn to project into the token space

00:41:59.520 | of that frozen language model.

00:42:01.640 | And then you can do lots of fun stuff, it turns out.

00:42:04.960 | So what they show in this paper

00:42:06.600 | is that you then get few-shot learners.

00:42:08.920 | So all of the things you see with GPT-3

00:42:11.120 | where you can just give it some kind of in-context examples

00:42:14.040 | and it's gonna figure out binding kind of on the fly.

00:42:18.480 | So it says like, this is a DEX and this is a BLEKET.

00:42:21.320 | So what is this?

00:42:22.760 | And then it gives you the answer that it's a DEX.

00:42:25.440 | So it really learns in context

00:42:26.960 | how you decide the feature mappings,

00:42:29.400 | which is really kind of solving the grounding problem

00:42:32.480 | that a lot of this multimodal stuff started with.

00:42:36.040 | So I think that's very cool.

00:42:37.200 | And then probably one of the coolest papers right now

00:42:41.960 | or models right now that you might've heard of

00:42:43.880 | if you follow the field is Flamingo out of DeepMind,

00:42:47.480 | where they take a chinchilla language model.

00:42:50.200 | And so this is really an optimal language model.

00:42:53.640 | And now you have this vision encoder

00:42:56.200 | that encodes multiple different images

00:42:59.400 | that you can then do reasoning over

00:43:01.400 | and then kind of auto-complete.

00:43:03.160 | So what this gets you is just a much more powerful model

00:43:06.800 | because you can do your generative

00:43:09.360 | over lots of different images.

00:43:10.720 | So it's really like stepwise, you can see it, right?

00:43:13.400 | We started off with very simple transformers

00:43:15.520 | and now we're actually at something

00:43:16.960 | that is starting to get pretty complicated

00:43:19.280 | because we have these building blocks

00:43:21.160 | like a perceiver resampler,

00:43:23.360 | where we have a bunch of different images

00:43:25.920 | that we feature rise

00:43:26.960 | and now we need to compress the information

00:43:29.320 | because sometimes we have three images,

00:43:30.840 | sometimes we have five images.

00:43:32.200 | So we wanna make sure that we can compress it

00:43:34.080 | so that it's always ready for consumption

00:43:36.080 | by the next layer of the language model.

00:43:40.400 | And then, so this paper again,

00:43:42.040 | is a really good paper to read

00:43:43.400 | because they actually, so this is not me,

00:43:45.880 | this is not my code, this comes from the actual paper.

00:43:48.120 | So they just have the diagram together with the code

00:43:50.560 | so that you can really understand what it's doing,

00:43:52.880 | which I think is really great.

00:43:56.400 | And so once you have your perceiver resampling step,

00:44:00.840 | what you then do is you do a gated cross attention.

00:44:04.040 | This is how you implement it.

00:44:06.200 | And so this gated cross attention,

00:44:08.440 | you do that before your frozen language model layer.

00:44:12.200 | So you really just have a frozen chinchilla language model

00:44:15.480 | and you learn to kind of modulate the information

00:44:17.800 | that goes into that language model.

00:44:20.040 | You propagate the gradients all the way back,

00:44:22.240 | you just don't update the language model.

00:44:24.040 | So you're really kind of trying to figure out like,

00:44:25.880 | how am I gonna design my signal

00:44:28.040 | so that my language model can do the most with it, right?

00:44:31.000 | How am I gonna combine the information?

00:44:32.920 | So you'll notice that now we do it before the layer, right?

00:44:35.400 | In a lot of this other stuff,

00:44:36.480 | you would do the attention after the layer,

00:44:38.160 | but here you do it before.

00:44:39.640 | So Garpathy, I think more than 10 years ago had this image

00:44:45.920 | with Barack Obama kind of setting his foot here on the scale

00:44:49.000 | to make somebody think like,

00:44:51.800 | they're a lot heavier than they really are.

00:44:54.440 | So this is obviously funny to us,

00:44:56.400 | but not to an AI system,

00:44:58.920 | I think unless it really understands the scene.

00:45:02.120 | And so that's why Garpathy at the time said,

00:45:04.680 | this would be a really good visual Turing test.

00:45:06.680 | Like if a system can figure this out,

00:45:08.720 | then it's actually really smart.

00:45:11.000 | And so obviously it's been a bit of a challenge

00:45:13.000 | for everybody working in the field

00:45:14.320 | then to get something that actually works on this.

00:45:16.480 | And so Flamingo, as it turns out, kind of gets the joke.

00:45:20.040 | But yeah, so it's a bit unclear if it really gets the joke,

00:45:24.640 | because if you read this conversation,

00:45:26.160 | it's sort of kind of getting steered

00:45:27.680 | in the right direction, right?

00:45:28.760 | But at least we're making progress,

00:45:31.240 | let's put it that way.

00:45:32.320 | And then, so in Flamingo,

00:45:36.120 | you still have a lot of moving parts,

00:45:37.600 | but you can really take this almost to the full extreme

00:45:40.000 | where you try to freeze almost everything.

00:45:42.120 | And you just want to learn this kind of mapping

00:45:44.520 | between your image encoder and your language model,

00:45:47.320 | or your image encoder and your encoder decoder architecture.

00:45:50.680 | And all you really do is just a projection

00:45:52.840 | between the two, right?

00:45:54.400 | So there's this nice model called Blip2,

00:45:57.000 | where they experiment with like OPT for the language model

00:46:00.040 | and Flantify for the encoder decoder architecture.

00:46:03.080 | And this just gives you amazing results.

00:46:04.960 | It gives you really complex captions and things like that

00:46:09.040 | without any real direct supervision on the captions itself,

00:46:12.280 | which is pretty impressive, I think.

00:46:13.840 | So that just shows you the power

00:46:15.600 | of language models in general.

00:46:17.600 | So here are some examples.

00:46:21.400 | So it can really do like different things

00:46:23.320 | from captioning to reasoning, to visual question answering,

00:46:26.640 | to like location detection.

00:46:29.960 | So you can have a long conversation with this system.

00:46:32.720 | This really is kind of the future where we're going, right?

00:46:35.280 | Where we're going to have a chat GPT,

00:46:36.600 | but it's also going to be able to see the world in a way.

00:46:39.600 | And so I think an interesting thing,

00:46:43.440 | so you've probably heard of like chain of thought prompting

00:46:46.080 | and things like that, where you ask the language model,

00:46:48.080 | like let's think step-by-step,

00:46:50.520 | and you can tell a vision and language model,

00:46:54.080 | generate a rationale for why something might be the case.

00:46:58.360 | So you generate a potential explanation

00:47:01.200 | for what your answer might be.

00:47:03.160 | And then after that, you ask it to answer the question.

00:47:05.880 | And it turns out that if you do that sort of multimodal

00:47:08.760 | chain of thought prompting, then the system gets much better.

00:47:11.960 | And so, this was like the new state of the art

00:47:14.880 | on science QA or benchmark like that,

00:47:17.800 | just because it learns to unpack the information, right?

00:47:20.920 | And so I think we're really as a field,

00:47:23.560 | just starting to figure out what the potential is of this.

00:47:26.760 | And I think this paper is where they also show

00:47:29.080 | that multimodal chain of thought prompting

00:47:31.120 | really gets you pretty amazing results.

00:47:33.480 | And they show very nice results on Raven matrices

00:47:37.040 | and like very complicated kind of IQ tests sort of things

00:47:40.960 | that humans are supposed to be really good at,

00:47:43.320 | but you have to be a pretty smart human

00:47:44.840 | to really be good at this.

00:47:46.280 | And this system just nails it.

00:47:47.760 | So, we're making super fast progress.

00:47:51.640 | And we started off from a very simple bird model

00:47:54.560 | that was able to look at some pictures.

00:47:56.320 | And now we're getting to these

00:47:57.440 | very sophisticated foundation models.

00:48:00.520 | So, that was my short history

00:48:02.440 | of multimodal foundation models.

00:48:04.680 | So, how much time do I have left?

00:48:08.720 | - So, after 5.50, so 25 minutes.

00:48:11.280 | - All right, okay, plenty of time.

00:48:13.040 | - We can also take questions.

00:48:15.560 | - Yeah, please, questions.

00:48:17.040 | - Do we do much pre-processing images

00:48:22.840 | for these models anymore?

00:48:24.000 | So, I noticed a lot of the images

00:48:25.360 | that just look like they were boxes,

00:48:28.720 | like where it just passed through

00:48:30.640 | with kind of no sense of shape in them.

00:48:33.720 | - Yeah, yeah, so I think the history of computer vision

00:48:38.320 | has been very similar to the history

00:48:40.040 | of natural language processing,

00:48:41.320 | where we thought we needed all of this structure

00:48:43.200 | and all of these different things.

00:48:44.720 | And it turns out you can just throw it all away

00:48:46.680 | and just have a big transformer over the patches.

00:48:49.640 | Sorry, yes.

00:48:53.040 | - Seeing as it's 2.31 and one minute, save time.

00:48:56.120 | (all laughing)

00:48:58.720 | - You mentioned a couple of times

00:49:00.960 | like model team frozen, what does that mean?

00:49:03.800 | - Yeah, yeah, sorry, I should have explained that better.

00:49:06.760 | So, it just means that we are not updating the weights.

00:49:11.280 | So, like if we go to this here, I think is a nice example.

00:49:15.880 | So, we have frozen self-attention.

00:49:19.520 | So, that just means that when we do a forward pass,

00:49:22.240 | we go all the way to whatever we want to predict,

00:49:24.560 | we get some gradients, we take them all the way down,

00:49:27.200 | but we only update the non-frozen layers.

00:49:30.760 | So, here the gradients actually do get updated,

00:49:32.920 | but these just never change.

00:49:34.480 | And so, the reason you wanna do that

00:49:36.280 | is because otherwise you're gonna drift way too far.

00:49:39.000 | And so, then you're gonna kind of destroy

00:49:41.640 | all of the cool stuff your language model has learned,

00:49:44.000 | because you're just gonna focus on the small data set

00:49:46.680 | that you're training it on.

00:49:48.160 | So, you wanna preserve the abilities of the language model,

00:49:50.800 | but you want it to become good at the thing you care about.

00:49:53.720 | Other questions?

00:49:59.760 | - In terms of poly-modal fusion,

00:50:03.160 | is there a benefit to doing like the earlier middle fusion

00:50:05.400 | as opposed to like only doing the fusion?

00:50:08.000 | - Yeah, so, I mean, we're gonna talk about evaluation next,

00:50:11.800 | but so it really depends on the task that you care about.

00:50:15.400 | And so, I would say the earlier is always the better

00:50:19.200 | if you can afford it.

00:50:20.360 | And so, like CLIP is very efficient to train,

00:50:23.920 | it's very late fusion, right, at the very end.

00:50:25.800 | So, there's no interaction between the different modalities.

00:50:28.800 | And so, that's really good if you want to be very efficient

00:50:33.240 | and if you wanna be like, for training, it's much nicer.

00:50:37.480 | But if you want to have a richer understanding

00:50:39.640 | of the multi-modal signal,

00:50:41.520 | then you want to do earlier fusion.

00:50:43.280 | So, yeah, it's always a trade-off.

00:50:47.520 | - It seems that images are just a lot more data than text.

00:50:52.520 | So, how much more difficult are these to train

00:50:55.600 | and how much bigger does like the image processing

00:50:59.720 | have to be compared to the language model?

00:51:03.520 | - Yeah, so, images are more complex in a way,

00:51:08.520 | but they're also kind of higher bandwidth representations.

00:51:12.520 | So, there's a lot of kind of like,

00:51:14.280 | just pixels that our brains just abstract away, right?

00:51:17.640 | It's really about the scene that you're seeing

00:51:19.440 | and like you're not really thinking too much

00:51:21.960 | about the pixels themselves.

00:51:24.560 | So, like Jan Le Coon likes to say

00:51:26.680 | that language is just a kind of low bandwidth,

00:51:30.680 | a proxy for a language of thought,

00:51:33.560 | which is much richer and much higher bandwidth.

00:51:35.800 | And like he thinks probably visual, I'm not so sure.

00:51:39.800 | But so, yeah, I don't think that there's necessarily

00:51:43.800 | a difference between kind of the scaling laws

00:51:45.800 | that you see in these systems,

00:51:47.440 | or at least we still have to figure that out.

00:51:50.800 | We'll kind of talk about that towards the end as well.

00:51:53.400 | - Do these modern models also have certain social

00:51:59.440 | and cultural bias, just like the natural language model?

00:52:02.800 | - Oh, yeah, they have terrible biases, yeah.

00:52:05.000 | (audience laughing)

00:52:06.800 | So, yeah, some people are actually working on this

00:52:09.800 | who are in this very room.

00:52:11.000 | But so, these models can be very racist

00:52:13.480 | also in what they generate

00:52:14.760 | or the kind of predictions they make.

00:52:17.280 | So, if you have an Asian basketball player

00:52:21.200 | standing sort of like this

00:52:22.280 | with a basketball very obviously there,

00:52:24.560 | then the model will think that he's playing ping pong

00:52:26.560 | because he's Asian.

00:52:27.600 | (audience laughing)

00:52:28.720 | I'm not joking.

00:52:29.560 | So, these models,

00:52:33.800 | just like all neural networks,

00:52:35.800 | this is really a big problem.

00:52:36.920 | And one of the most interesting problems

00:52:39.000 | that you should be working on if you're a student

00:52:41.120 | and you wanna make a difference is,

00:52:42.720 | how do we get these systems to be much better

00:52:45.280 | at these sorts of things?

00:52:46.520 | - So, in one of the examples,

00:52:51.040 | you show like the model interpret

00:52:52.640 | from the content of an image.

00:52:54.360 | So, we wanna understand the content of a video.

00:52:57.040 | So, what actual challenges you might see

00:52:59.720 | like in the video?

00:53:00.560 | - Yeah, so, in the video,

00:53:02.400 | you might see like what improvements we can make

00:53:06.880 | to our new scope.

00:53:08.840 | - Yeah, so, you're asking about the attention mask

00:53:12.560 | sort of, right?

00:53:13.520 | Yeah, so you can use the same idea for videos

00:53:16.640 | and you just look at the video

00:53:18.080 | and so these systems are so good.

00:53:20.200 | Now, the object detectors are so good,

00:53:21.920 | you can really track objects kind of real time

00:53:25.040 | as they go through your video.

00:53:27.280 | And so, you can try to check how that aligns

00:53:29.240 | with your attention mask in your model.

00:53:31.960 | So, a lot of like,

00:53:35.280 | so videos I think are sort of interesting,

00:53:37.160 | but they're also not really interesting

00:53:39.160 | because you can very often just sub sample images

00:53:42.240 | and solve the images

00:53:43.280 | rather than having to deal with the complex video.

00:53:45.800 | But yeah.

00:53:48.160 | All right, maybe one more question

00:53:50.680 | and then we'll go do some evaluation.

00:53:52.680 | - So, these multi-model models,

00:53:56.360 | when you only provide,

00:53:57.560 | let's say you only provide a single source of media

00:53:59.400 | to the sampling type for a vision,

00:54:01.680 | how does it perform in that case?

00:54:03.320 | 'Cause it's obviously more geared

00:54:05.240 | for multi-modal cases.

00:54:06.880 | - Yeah, so I mean,

00:54:07.720 | that's one of the giant shortcomings

00:54:09.480 | of a lot of these models

00:54:10.520 | is that they're really just built for multi-modal stuff.

00:54:13.640 | And so, what if I don't have an image, right?

00:54:16.960 | And so, I mean, that's why we did Flava

00:54:20.480 | because we want to have one model

00:54:21.680 | that can do all of that stuff.

00:54:23.840 | And that's why in MBT,

00:54:26.040 | so the supervised multi-modal bi-transformer,

00:54:28.680 | we actually have an analysis of like,

00:54:30.440 | how robust is this model to missing images or missing text?

00:54:33.880 | So, I think a lot of folks

00:54:37.120 | working on these early visual bird models

00:54:39.360 | that were kind of myopically focused on VQA,

00:54:42.640 | which is actually a great segue

00:54:43.880 | to what I wanna talk about next.

00:54:45.480 | So, it really depends on the task

00:54:49.520 | that you care about, as I said, right?

00:54:51.320 | And so, I think if I'm gonna tell you about multi-modality,

00:54:55.480 | I also have to tell you how you're gonna check

00:54:57.400 | that the multi-modal system

00:54:58.520 | is actually good at multi-modal things.

00:55:00.760 | And so, that's the topic of evaluation,

00:55:04.720 | which actually is a super important topic.

00:55:07.080 | And a lot of people, they wanna be cool

00:55:08.800 | and build big models,

00:55:10.680 | but I think it should be way cooler

00:55:12.320 | to do proper evaluation of these models,

00:55:14.640 | especially if you're in academia

00:55:16.280 | because you only have limited GPUs anyway, right?

00:55:18.800 | So, what can you do?

00:55:21.120 | Sorry, I don't wanna rub it in.

00:55:24.200 | (audience laughing)

00:55:26.640 | So, how do you check?

00:55:29.320 | Well, there's this amazing project.

00:55:33.360 | So, ImageNet really changed the history of deep learning,

00:55:36.440 | I think, and this other dataset, COCO,

00:55:38.760 | I think also really changed,

00:55:40.840 | especially vision and language,

00:55:42.120 | but also, I think, vision in general,

00:55:45.600 | where they have just a bunch of main multi-modal tasks.

00:55:50.600 | So, these images are very richly annotated

00:55:53.560 | with all kinds of different things.

00:55:54.800 | So, like the segmentation of the objects,

00:55:57.120 | the bounding boxes, the labels of the bounding boxes,

00:56:00.240 | they come at like sort of a different pixel granularities.

00:56:04.040 | It's a huge dataset.

00:56:05.880 | It's very fine-grained,

00:56:07.360 | annotated in terms of like the categories that it has,

00:56:10.040 | and then you have five captions for each of these images.

00:56:14.320 | And so, this really was the first dataset

00:56:17.120 | that unlocked a lot of sort of vision

00:56:18.960 | and language processing at scale

00:56:20.440 | because you had your picture and you had your caption,

00:56:23.040 | and now you need to figure out,

00:56:24.480 | okay, how do I give the right caption for this image?

00:56:26.920 | So, that's image captioning,

00:56:28.000 | or can I retrieve, given some piece of text,

00:56:31.080 | the right image or the image for the piece of text?

00:56:34.800 | So, there's a bunch of very impactful datasets

00:56:37.680 | that do this stuff that we already talked about, Lion,

00:56:40.480 | but COCO really is the main one still,

00:56:42.760 | I think, that a lot of people kind of use

00:56:44.520 | as the canonical instance of this dataset category.

00:56:49.200 | And then, the other thing that people really care about

00:56:52.240 | in vision and language processing

00:56:53.840 | is visual question answering.

00:56:56.320 | And so, there really are a bunch of academic groups

00:57:00.800 | who are or have been so focused on this task

00:57:03.760 | that they didn't really care about anything else,

00:57:05.760 | and that's why you see a lot of models

00:57:07.520 | that are really optimized just for multimodal

00:57:10.280 | and nothing else.

00:57:11.600 | And you can see that kind of reflected

00:57:13.200 | in the citation counts as of last night, 3 a.m.,

00:57:16.640 | where, so, VQA just has way more citations

00:57:21.320 | than image captioning datasets even, right?

00:57:24.520 | And so, what you do here is you just have an image

00:57:27.480 | and then people ask very simple questions,

00:57:30.120 | so annotators, right, they ask these simple questions,

00:57:33.120 | they give the answers,

00:57:34.400 | and now we want to be able to answer these questions

00:57:37.400 | with machines.

00:57:38.560 | And as I alluded to earlier,

00:57:39.760 | one of the kind of embarrassing backstories of this dataset

00:57:43.120 | was that the initial version of the dataset

00:57:45.560 | was actually found to have images not really matter at all.

00:57:50.640 | So you could just look at the question,

00:57:52.680 | then it could have something like

00:57:54.200 | how many slices of pizza are there?

00:57:56.920 | And so, well, not in that particular case,

00:57:58.960 | but in almost all of the dataset,

00:58:01.240 | the right answer for how much or how many question was two.

00:58:04.520 | So if you just predicted two to every how much

00:58:07.760 | or how many question,

00:58:08.600 | you got like 70% accuracy on the counting category.

00:58:12.160 | So careful dataset or evaluation benchmark design

00:58:16.600 | is also really a skill,

00:58:18.040 | and you really need to think about what you're doing.

00:58:20.000 | You can't just like set some data aside and evaluate it on,

00:58:23.120 | you have to really think about what you're doing.

00:58:25.760 | And so there's GQA by Chris actually,

00:58:27.960 | which is also just, I think,

00:58:30.400 | a better designed version of this dataset maybe.

00:58:32.560 | So you might want to use that these days.

00:58:34.960 | There are also kind of very targeted datasets

00:58:40.680 | that really try to measure one particular thing.

00:58:42.920 | And I think one of the things we really want to get at

00:58:46.080 | with these models is what we would call compositionality.

00:58:49.320 | So we want to be able to really take the parts

00:58:51.680 | and reason about the whole

00:58:53.600 | and understand the relationships

00:58:55.080 | between the different concepts.

00:58:56.280 | So Clever was a very clever dataset

00:58:59.320 | that was designed really to measure the compositionality

00:59:03.040 | both on the language side and on the vision side.

00:59:05.160 | So you have to understand the relationships

00:59:06.840 | between all of these different objects in the images.

00:59:09.920 | So that's been a pretty impactful dataset.

00:59:11.800 | I think we're really forcing people

00:59:13.920 | to think about compositionality.

00:59:16.480 | But a lot of these datasets really had big problems.

00:59:21.320 | So one of the problem is, they were too easy.

00:59:25.200 | So VQA is sort of like plateauing out.

00:59:27.080 | We can talk about that a little bit too.

00:59:28.760 | Wasn't really realistic.

00:59:30.000 | So you could solve VQA

00:59:31.960 | and that's probably gonna make some people's lives better.

00:59:34.840 | You're all like trying to process the memes.

00:59:36.640 | I can see everybody.

00:59:37.680 | (laughs)

00:59:39.200 | Okay, let's get to the memes first.

00:59:40.640 | So obviously, so these memes are not actually

00:59:44.960 | in the dataset.

00:59:46.640 | So I could put some really hateful memes

00:59:49.080 | about sort of Hitler or something,

00:59:51.080 | which are in the dataset, but that would be less fun.

00:59:54.080 | So these are mean meme examples to kind of demonstrate

00:59:59.080 | how the dataset was constructed.

01:00:02.200 | And so one of the problems we had, as I said,

01:00:04.920 | like VQA, the V didn't really matter.

01:00:07.280 | What we want to have is a dataset,

01:00:09.040 | if we care about multimodality specifically,

01:00:11.720 | is like, how do we get a dataset

01:00:14.160 | that you can only get right

01:00:15.880 | if you are good at multimodal reasoning

01:00:17.600 | and otherwise you're just gonna screw it up.

01:00:20.080 | And so this is what we came up with

01:00:21.640 | is if you have a meme like this one,

01:00:23.920 | love the way you smell today.

01:00:25.200 | I mean, that's not very nice

01:00:26.400 | if you send this to your friends, right?

01:00:28.400 | But so it turns out that if you just swap out the background

01:00:35.160 | now it's a very nice thing to say, right?

01:00:37.720 | And like this one is, I don't know,

01:00:39.880 | you're maybe a bit weird if you like this,

01:00:41.520 | but there's nothing wrong with it, right?

01:00:44.840 | And so it's the same for this one here,

01:00:48.200 | like look how many people love you

01:00:49.520 | with the tumbleweed that's really sad.

01:00:51.160 | And like, if you change just one word

01:00:54.240 | suddenly it's like a really nice thing to say, right?

01:00:56.880 | So if you want to solve this,

01:01:00.520 | if you want to classify this correctly for the meanness,

01:01:04.280 | then you have to really understand multimodal reasoning.

01:01:07.320 | You have to understand the relationship

01:01:09.120 | between the image and the text

01:01:10.640 | in order to get to the right label, right?

01:01:12.480 | And so it was really constructed by design to do that.

01:01:15.400 | And so how we did it exactly

01:01:19.400 | is we use some really highly trained annotators.

01:01:22.920 | And then one of the big problems

01:01:24.960 | with a lot of these datasets

01:01:26.240 | is that nobody really knows who owns the meme,

01:01:30.480 | for example, right?

01:01:31.320 | So somebody makes this meme,

01:01:32.440 | now they technically own a copyright.

01:01:34.120 | And so when I made this dataset,

01:01:36.280 | I was working at the Facebook

01:01:38.200 | and they were very afraid of copyright things.

01:01:40.840 | So what we actually had to do

01:01:42.000 | is we had to pay people to make new memes.

01:01:45.280 | (audience laughing)

01:01:48.040 | And so not from scratch,

01:01:49.640 | so we could show them kind of the actual examples.

01:01:51.920 | And then they had to try to find images

01:01:54.960 | that were kind of corresponding to the original source image

01:01:58.880 | and try to recreate the meme,

01:02:00.520 | but now with an image that we could buy from Getty.

01:02:03.560 | And so we gave a lot of money to Getty

01:02:07.440 | so that we could then release the dataset to the public

01:02:11.000 | so that people could do actually research on this

01:02:13.120 | and understand for their multimodal models,

01:02:15.200 | whether they're good or not.

01:02:17.040 | And so we really tried to make it

01:02:18.520 | so that we had these benign co-founders,

01:02:21.240 | sorry, it's a startup world with co-founders.

01:02:26.400 | So the co-founder here is obviously

01:02:30.040 | like you have your original meme

01:02:31.440 | and then you have your co-founder

01:02:33.680 | where you swap out one of the modalities

01:02:35.360 | and here you have the other one, right?

01:02:36.560 | So we had our annotators do that as well.

01:02:39.960 | And so this led to a really nice dataset, I think,

01:02:44.080 | because it showed some of the intuitions

01:02:46.120 | that I think a lot of people in the field had,

01:02:48.360 | which is that multimodal pre-training doesn't really work.

01:02:51.520 | Is that an alarm?

01:02:53.560 | So multimodal pre-training doesn't really work.

01:02:58.000 | And so all of this stuff that people have been doing

01:03:00.720 | with all their fancy visual birth models

01:03:02.880 | actually turned out maybe to not really be that useful

01:03:06.080 | anyway and so maybe it got you like one point extra, right?

01:03:09.400 | From visual birth to like a different visual birth,

01:03:12.080 | like less than a point,

01:03:13.800 | just by doing that multimodal pre-training.

01:03:16.680 | So that means like we still have to figure this stuff out,

01:03:20.560 | right?

01:03:21.400 | This dataset is far from salt

01:03:22.960 | and we still have a long way to go

01:03:25.120 | despite all of these fancy models

01:03:26.720 | and a new paper coming out every week

01:03:29.800 | that does something new, like we're not there yet.

01:03:33.080 | And I think that's encouraging, especially for you.

01:03:35.880 | Like when you can go out and solve it.

01:03:38.600 | So what we did with this dataset

01:03:42.000 | is we organized a competition.

01:03:43.280 | We had 100K in prize money

01:03:45.680 | to try to see what people could come up with.

01:03:47.920 | And so there was a lot of nice work coming out of that

01:03:52.480 | and we really kind of managed to crank the numbers up

01:03:55.080 | by quite a lot.

01:03:56.240 | But the solutions were slightly disappointing.

01:04:00.200 | So I don't know if you've ever used Kaggle,

01:04:02.120 | but if you wanna really win on Kaggle,

01:04:03.880 | you just have to ensemble the hell out

01:04:05.760 | of all of the different models

01:04:06.880 | that are current state of the art

01:04:08.360 | and then you're very likely to win, right?

01:04:10.120 | And so that's what happened here

01:04:12.760 | where there wasn't really the fundamental breakthrough

01:04:17.120 | we had maybe been hoping for.

01:04:18.440 | So that still needs to be built, I think.

01:04:21.440 | So this other dataset,

01:04:24.280 | I just wanna kind of briefly talk about.

01:04:25.840 | So the theme sort of of this section is like,

01:04:28.000 | if you make a dataset, think about it very carefully

01:04:31.600 | because you can really be very creative with this

01:04:33.560 | and really measure the things you're trying to get at.

01:04:36.440 | So this dataset, Winterground,

01:04:39.680 | we were trying to figure out,

01:04:40.880 | okay, how good is CLIP actually?

01:04:42.880 | So it looks really amazing

01:04:44.240 | and it's way better than things that were previously there,

01:04:47.200 | but does it understand compositional relationships

01:04:50.280 | in the same way that humans would understand it

01:04:52.240 | or is it sort of just fitting onto the data distribution

01:04:55.440 | and it can be very good at the head of the distribution,

01:04:58.360 | but it's terrible at the tail.

01:05:00.360 | And you can probably already guess where this is going,

01:05:03.480 | but so just to give you an illustration

01:05:06.200 | of what is in this dataset,

01:05:07.560 | you would have some plants surrounding a light bulb

01:05:10.960 | or you would have a light bulb surrounding some plants.

01:05:13.880 | So notice that the words here are exactly the same words,

01:05:17.720 | but in a different order, right?

01:05:19.480 | So, and so the visual depiction of these words

01:05:23.400 | is very, very different.

01:05:24.840 | So if your model,

01:05:25.960 | your contrastive model is actually good at understanding

01:05:29.000 | the visual semantic or the visual linguistic compositionality

01:05:34.000 | of these examples, then it can get it right.

01:05:39.440 | But again, if it's actually just overfitting

01:05:41.640 | on the data distribution that it's seen

01:05:43.400 | and it just kind of is biased toward what it sees often,

01:05:47.120 | then it doesn't really get it, right?

01:05:48.760 | And so one paper that we use

01:05:51.520 | as a source of inspiration for this work

01:05:53.800 | is this paper here,

01:05:56.320 | "Order Word Matters Pre-training for Little."

01:05:59.400 | So we actually found that the order of words

01:06:01.280 | doesn't even matter that much

01:06:03.000 | for general pre-training very often,

01:06:05.880 | which is also kind of a scary thing, right?

01:06:07.520 | So this is deep learning for NLP.

01:06:09.160 | We think that language is really important,

01:06:11.360 | but these models can reason about language

01:06:13.840 | even if you shuffle all the words.

01:06:15.600 | And so that's probably not what we want to have.

01:06:19.440 | And so that doesn't tell you something

01:06:22.480 | about how great we are as researchers.

01:06:24.920 | It tells you something about how terrible

01:06:26.640 | our evaluation benchmarks are, right?

01:06:29.040 | And that's what we need to fix.

01:06:30.600 | So what we did with this data set,

01:06:33.360 | here are some other nice examples,

01:06:34.680 | like there's a mug in some grass

01:06:36.040 | or there's some grass in a mug.

01:06:37.400 | Like these are very different pictures, right?

01:06:39.480 | And so for us, these are trivial.

01:06:41.560 | So like, what's the difference

01:06:43.120 | between a truck fire and a fire truck?

01:06:45.040 | They're pretty important, I think,

01:06:48.480 | also to get that distinction right.

01:06:50.600 | So guess what?

01:06:54.960 | State-of-the-art models often perform below random chance.

01:06:58.400 | So, you know, as I said,

01:07:03.240 | we still have a lot of work to do, which is good.

01:07:06.600 | And so when this paper came out,

01:07:08.880 | I think that the reaction was really nice.

01:07:11.800 | And so when DALI-2 came out,

01:07:14.280 | so you've probably heard of DALI-2, right?

01:07:17.560 | So it's sort of like stable diffusion,

01:07:19.360 | but then before stable diffusion.

01:07:22.120 | And so this was really the first model

01:07:24.640 | that really showed just how impressive

01:07:27.200 | these generative models can be

01:07:29.320 | when they're creating images.

01:07:31.240 | So this is, there's a mug in some grass.

01:07:34.160 | You do have to kind of cheat a little bit

01:07:36.160 | because you have to add digital art here.

01:07:38.960 | If you don't add that, then it breaks down completely.

01:07:42.120 | So it's sort of prompt hacking, I think,

01:07:45.120 | or sort of tuning on the test set, but okay, you know.

01:07:48.400 | So this is pretty good, right?

01:07:51.080 | So it definitely is better than I think a lot of people

01:07:54.760 | would have expected even a couple of years ago.

01:07:57.120 | But it's not perfect because people on the internet

01:08:02.120 | like to take more pictures of spoons than forks.

01:08:04.800 | So if you say there are fewer spoons than forks,

01:08:10.120 | or there are fewer forks than spoons,

01:08:12.520 | it just really likes spoons more.

01:08:14.320 | (audience laughing)

01:08:17.400 | You know, and so maybe it's like the matrix or something,

01:08:20.560 | I don't know, but so spoons are just nicer.

01:08:25.040 | So again, what you can see here is that these models

01:08:28.200 | really are just reflections of the data

01:08:30.320 | that they're trained on, right?

01:08:32.560 | And yeah, so models are getting better,

01:08:35.280 | but if you've looked at stable diffusion,

01:08:37.160 | like it still can't count fingers and things like that.

01:08:39.840 | So again, there's still a lot of cool work to be done.

01:08:43.520 | Any questions on evaluation?

01:08:46.320 | (man speaking faintly)

01:08:49.560 | No, okay.

01:08:54.320 | So let's talk about other modalities then,

01:08:57.320 | because so we've really just been focused on images

01:09:00.000 | and images are great.

01:09:01.120 | There are lots of images on the internet.

01:09:04.560 | And so that makes it sort of an obvious thing to focus on.

01:09:08.400 | It's also, I think if you look at our brain,

01:09:10.760 | like vision is a very dominant modality, right?

01:09:13.080 | So how we understand the world is very vision driven,

01:09:16.720 | but it doesn't have to be the case.

01:09:18.840 | So there's all these other interesting problems

01:09:20.920 | that involve different modalities.

01:09:22.920 | And so the most obvious one is just speech or audio, right?

01:09:26.440 | So after seeing comes hearing,

01:09:29.120 | and really we could do another lecture just like this,

01:09:32.760 | just on speech and audio.

01:09:34.200 | And there's lots of interesting stuff to talk about.

01:09:36.640 | Obviously we don't have time,

01:09:38.120 | but I'll give you another nice example

01:09:41.000 | of how amazing Alec Radford is at creating datasets.

01:09:45.000 | So there's this whisper model that came out of open AI

01:09:48.240 | not too long ago,

01:09:49.600 | which was trained on 680,000 hours

01:09:52.040 | of multilingual multitask speech data.

01:09:55.440 | So speech with transcriptions,

01:09:57.400 | and they trained this very fancy thing on there,

01:10:01.280 | which actually is not very fancy at all.

01:10:02.960 | It's just a log mail spectrogram.

01:10:04.400 | So how you represent the audio signal,

01:10:06.600 | and then you feed that into a big transformer.

01:10:08.640 | So this is sort of your encoder self-attention here, right?

01:10:11.880 | And then you have your decoder

01:10:13.080 | where you have your cross attention,

01:10:14.840 | and then you just generate the sequence.

01:10:17.280 | So this is encoder decoder, basic transformer model,

01:10:20.880 | but your input is convolutions,

01:10:23.280 | one dimensional convolutions over the log mail spectrogram.

01:10:26.280 | And so there's lots of papers that do very similar things.

01:10:30.160 | There's models like wave2vec

01:10:32.440 | that tried to turn the wave signal into vectors,

01:10:34.720 | or you can discretize it in lots of different ways.

01:10:37.800 | So there's a wealth of literature.

01:10:40.080 | Then I think one of the funny observations actually

01:10:42.760 | is that you can just reduce audio to vision anyway, right?

01:10:46.040 | So that's what you could sort of argue

01:10:48.440 | this log mail spectrogram does,

01:10:49.960 | but so not to toot my own horn,

01:10:52.600 | but in 27, I did this paper where we showed

01:10:55.440 | that you can just take a real audio sample,

01:10:58.480 | turn it into a kind of a spectrogram,

01:11:03.240 | really just a spectrogram.

01:11:04.280 | So what does the spectrum of the audio file look like?

01:11:08.000 | Feed that to a regular conf net, like an Alex net even,

01:11:11.480 | and then that gives you amazing auditory features.

01:11:13.640 | So now you can use this to distinguish

01:11:15.520 | between violins or guitars and things like that.

01:11:17.960 | So maybe you can just reduce all of this to vision.

01:11:21.520 | So one question maybe you could ask is like,

01:11:23.480 | can we also reduce language to vision or vision to language?

01:11:27.280 | So that's sort of what people are thinking about.

01:11:31.000 | So we talked about the video.

01:11:33.920 | There was a question about video.

01:11:35.320 | So a lot of these ideas also extend pretty directly

01:11:38.720 | to video, but now you just have more data, right?

01:11:41.200 | So like Flamingo already had a bunch

01:11:43.160 | of different images in it.

01:11:44.240 | You can do Flamingo over videos.

01:11:46.880 | Probably a lot of the images are pretty useless

01:11:50.120 | for what you're trying to do with this video model, right?

01:11:52.520 | So they're too similar.

01:11:53.840 | It doesn't really add all that much information.

01:11:56.080 | So you wanna sub sample the frames

01:11:58.280 | so that you get the most useful information

01:12:00.160 | out of your video.

01:12:01.720 | And so there's a bunch of approaches

01:12:03.640 | that kind of take the key frames

01:12:05.520 | and then you just do a standard joint vision

01:12:08.360 | and language transformer encoder thing on top of that.

01:12:11.600 | So this is kind of becoming hopefully

01:12:13.680 | by now a very familiar recipe, right?

01:12:16.400 | And so there's this, so Merlot is a nice architecture

01:12:20.240 | that does this.

01:12:21.080 | And then they came up with Merlot Reserve,

01:12:23.400 | kind of a silly name where they also added audio

01:12:27.000 | to this model.

01:12:28.000 | So this is now a tri-modal model, right?

01:12:30.120 | And so we're going towards this foundation model

01:12:33.400 | that can consume all of these different modalities

01:12:36.000 | all in one go.

01:12:37.160 | And that's really like a clear trend in the field.

01:12:39.720 | Another very interesting direction,

01:12:44.040 | I think where in the field,

01:12:45.760 | we were very excited about this for a while,

01:12:47.680 | but I think it's sort of gone now

01:12:50.960 | because it's too difficult to create

01:12:53.200 | lots of high quality data in this setting.

01:12:55.160 | But what you can do is you can have simulated environments.

01:12:58.960 | So this is a paper from DeepMind from 2017

01:13:01.680 | where they had this agent walk around in a maze

01:13:03.800 | and then he could have natural language instructions.

01:13:06.160 | He could also generalize to like decks and blicks

01:13:08.640 | and different sort of groundings and assignments

01:13:11.280 | that you could do in that environment.

01:13:13.840 | So this is a super interesting direction,

01:13:15.560 | I think in the longterm,

01:13:16.560 | because this is how humans learn language, right?

01:13:18.640 | Like we walk around in the world,

01:13:20.000 | we interact with our environments.

01:13:21.520 | We have all of these different perceptual observations.

01:13:24.360 | We synthesize them in our brain.

01:13:26.200 | We manipulate objects.

01:13:27.720 | We change our own viewpoint.

01:13:29.200 | And that's how we learn everything we know about the world.

01:13:32.160 | And so our language is very intricately connected

01:13:35.400 | to that world and how we observe it.

01:13:37.440 | So I think that might make a comeback

01:13:41.200 | at some point in the future.

01:13:42.680 | You can also do other stuff.

01:13:45.600 | So especially with this kind of conditioning on text

01:13:48.560 | that we're seeing a lot of, right?

01:13:50.600 | So like, so, you know,

01:13:51.720 | DALI 2 and stable diffusion

01:13:53.080 | and all of these different things.

01:13:54.360 | And the original GAN we talked about at the beginning.

01:13:57.840 | You can do the same thing,

01:13:59.000 | but now you're generating 3D point clouds, right?

01:14:02.240 | So this is a 3D corgi using a corgi.

01:14:06.000 | And so this prompt can probably become

01:14:08.040 | much more complex over time.

01:14:09.480 | And you can do like sort of AutoCAD design

01:14:11.920 | and just say like, give me a house

01:14:13.720 | and it's just gonna design the whole house for you.

01:14:16.600 | So you can just like tweak the prompt and things like that.

01:14:19.800 | Like that's all coming

01:14:21.400 | or even already here in many cases.

01:14:23.600 | So the final modality I just briefly wanted to talk about

01:14:27.960 | is olfactory embeddings.

01:14:30.200 | (audience laughing)

01:14:33.720 | And so olfaction means smell if you didn't know.

01:14:37.080 | And so it turns out,

01:14:39.720 | so my PhD thesis was about grounding semantics

01:14:44.640 | in different perceptual modalities.

01:14:47.080 | So a lot of my work started in vision

01:14:49.240 | and then it's like, okay,

01:14:50.080 | now audio is sort of the obvious next one, right?

01:14:52.320 | So you can learn the meaning of violin

01:14:54.120 | and then maybe you can learn that violin,

01:14:56.760 | like what a violin looks like and what it is

01:14:58.680 | and what it sounds like.

01:14:59.560 | And that's gonna give you a richer representation.

01:15:01.800 | But for a lot of these words,

01:15:03.640 | what's actually very primitive to their meaning

01:15:06.360 | is what they smell like,

01:15:07.600 | because in our brains,

01:15:09.280 | that's really one of the core areas

01:15:10.880 | and one of the oldest areas in your brain.

01:15:13.440 | So what you can try to do

01:15:15.640 | if you want to complete all of your perceptual modalities

01:15:19.320 | is you can try to build olfactory embeddings.

01:15:21.480 | So it was kind of a joke paper I did,

01:15:24.560 | but the funny thing is it actually worked.

01:15:28.040 | So there's a catalog,

01:15:32.640 | this Sigma Aldrich Fine Flavors and Fragrances catalog,

01:15:36.840 | where you can look up words like melon and pineapple

01:15:40.160 | and then it's gonna give you all of the chemical compounds

01:15:43.120 | that produce this smell or taste.

01:15:45.200 | And so if you do that,

01:15:47.880 | then you can count the occurrences

01:15:49.720 | and then you can sort of do SVD

01:15:51.440 | or something like that on it

01:15:52.880 | to get it to be a bit more of a real embedding model.

01:15:56.360 | So now you get smell embeddings, smell vectors,

01:15:59.640 | and then you can compute similarity judgments

01:16:02.880 | between these smells.

01:16:04.400 | So turns out apple smells like pear,

01:16:07.040 | and chocolate and cocoa and sweet and coffee

01:16:11.120 | are sort of related.

01:16:12.200 | So you get these clusters of different smells

01:16:14.560 | just based off of their chemical compounds.

01:16:17.200 | So this bag of chemical compounds model

01:16:20.000 | gives you a very rich representation.

01:16:22.360 | And so if you look at all of the words

01:16:25.000 | that are concrete enough to have smell,

01:16:27.800 | so like if you have a word like democracy in there,

01:16:30.360 | that doesn't really smell like anything.

01:16:33.160 | So you ignore democracy,

01:16:36.400 | you just focus on the things that smell

01:16:41.000 | or that could smell, I guess.

01:16:43.120 | And then, so the really interesting thing to me

01:16:45.840 | is that this is much more correlated

01:16:50.200 | with human similarity judgments

01:16:51.880 | than the linguistic vectors we had at the time.

01:16:54.280 | So for a word like apple,

01:16:57.440 | like you can just get a word vector

01:16:59.080 | like you've learned in your first lecture.

01:17:01.440 | And so you can do like skip gram and things like that.

01:17:04.800 | But that thing is not going to be as correlated

01:17:07.320 | with human similarity judgments

01:17:09.240 | as this bag of chemical compounds model.

01:17:12.320 | So that's pretty interesting.

01:17:14.400 | So even something like smell where maybe we think,

01:17:17.120 | this doesn't really matter.

01:17:18.440 | If you really want to understand

01:17:19.720 | how humans understand language,

01:17:21.680 | then maybe you want to include this

01:17:23.640 | in your foundation model too.

01:17:25.280 | But I would start with the other modalities.

01:17:30.000 | All right.

01:17:32.160 | Okay, yeah, sorry.

01:17:35.800 | So where to next?

01:17:37.000 | I'll just, I think I've already said most of this actually.

01:17:39.640 | So one foundation model is going to rule them all.

01:17:42.440 | And so, I mean, there will be many of these,

01:17:45.920 | but a lot of them are going to have

01:17:47.240 | very similar traits, I think.

01:17:49.760 | We're going to be looking at scaling loss

01:17:51.880 | and trying to understand really what is the relationship

01:17:54.240 | between the different modalities,

01:17:55.600 | which one do we want more of, that sort of stuff.

01:17:58.880 | We're gonna have retrieval augmentation.

01:18:00.400 | This thing is gonna be really huge.

01:18:02.040 | If you've heard of RAG,

01:18:03.880 | or if you haven't, you should look it up.

01:18:06.080 | So all of these parts of these models

01:18:07.760 | can also be multimodal.

01:18:09.480 | We need way better evaluation and better measurements.

01:18:12.160 | We already talked about that too.

01:18:14.080 | And that's all I have, thank you.

01:18:15.680 | (audience applauding)

01:18:17.400 | (upbeat music)

01:18:19.980 | (SILENCE)