Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

So today, I'm delighted to introduce our first invited speaker, who's Dow Akila. Dow has also been, as well as being invited, and I'll tell his background. He's also in the Symbolic Systems program as being an adjunct professor and has been involved with some students in that role as well.

But in his invited role, he's originally from the Netherlands, where he even learned some logic, among other things, back in the old days. But in more recent times, he's been a prominent deep learning researcher. For a number of years, he worked at Facebook, now Meta, in the FAIR unit, and was involved in various ideas, including retrieval augmented generation.

After that, he then spent some time at Hugging Face. He's become interested in looking at multimodal models, which is what he's gonna be talking about today. And welcome, Dow, it's great to have you. >> Thank you very much. >> >> All right, that works, right? Yes, yeah, thanks everyone for coming.

I understand that you get points for being here, so you're not really here for me. >> >> But thanks for coming anyway. So I'm gonna talk about multimodal deep learning. It's gonna have an NLP focus, of course, as for this course. But it's also because otherwise I would really be talking for many more hours than I have time for here.

So I'll try to really keep it focused on the things that I think will be most useful for you to learn. And so the first thing you should understand is that this whole concept of multimodality is kind of ill-defined, actually. So if you go to the dictionary, you'll see that it means having or involving several modes or modalities or maxima.

And so what mode here really means is, so it could be mode in the very generic sense, or it could be a very precise sense of the mode of a statistical distribution. And so depending on the paper you're reading, in some cases, people really mean the statistical sense. In other cases, people really mean this sort of very vague concept of a modality, where it really means the type of information that you're getting.

So an example of modality in that case is an image or speech signal or audio in general, or even olfaction, so smell or things like that. So in this lecture, we're just gonna focus mostly on text, because this is an NLP course, and we're gonna focus on images mostly as the other modality to keep it simple.

All right, so why does it matter? Why do we care about multimodality? And so there are a couple of really good reasons in general for this. The first one is about faithfulness. So if you look at how we humans understand the world, how we make sense of what happens in the world, that is very multimodal, right?

So we perceive the world not just using vision or just audio, but we synthesize information across all of these different modalities, and that's how we understand the world and each other. There's also a very practical argument for doing it. It's because the Internet is multimodal, right? So if you go to, I don't know, Facebook or something like that, it rarely happens that it's just text or just an image.

There's usually a combination of multiple modalities. And then the final good reason that we're just starting to hit now, if you're really following where the field is going, we're kind of running out of text data for these large language models. So one interesting way to keep scaling on the data side is to make use of all of these other modalities, right?

So if you can have your language model also watch all of the videos of cats in the world, it's gonna understand the concept of cat much better. And that's what we want to have in these models. We want them to understand the world in the same way that humans understand it.

So right now, multimodality is really one of the main frontiers of this new foundation model drive that we're all in right now. There's a thing called the McGurk effect. Let's see if it loads up. But so what we'll see when this loads is this guy over here, and we'll have the same audio effect being played.

So the audio is exactly the same, and this man is gonna say something like, bah, bah, bah. And so you're hearing a bee there, I think, if you look at my mouth, because that's what I said. But if you then change the video to where he says, bah, bah, bah, with exactly the same audio, you're going to hear the other version.

So unfortunately, I can't really swap in the different audio here, so you have to trust me for it. We might suddenly start hearing a guy saying, bah, bah, bah, and then. >> >> All right, so multimodal applications, so when we have multiple modalities, we can do all kinds of interesting things.

And as I said, most of the use cases we have on the Internet, they're all multimodal. And there are some really kind of obvious things we would be interested in if we have information from these different data sources, right, from different modalities. So obviously, we might want to do retrieval.

So maybe given a bit of text, we want to find the right image. Or maybe given some image, we want to find the right text for it so we can match them up. Obviously, we can also do this in a generative setting. So then we have image captioning, which you probably heard of.

We can do text to image generation. So that's image synthesis, and so stable diffusion. Everybody in the audience here has probably seen that. Then we can do visual question answering, where we have an image and text. And then we need to generate some new text. We have multimodal classification, where we have image and text, and we need to have a label, for example, whether something is hate speech or not.

And then in general, we want to be able to have a richer understanding of information, which means that we combine images and text and then use it for downstream applications that require better understanding or better generation. So this field really is super hot right now. So there's this nice paper title.

I predict that this paper is going to do really well in terms of citations, just because it has such a citable title. I think a lot of people are not actually going to read it. And so, I mean, I've been in this field for quite a while now, and people have been saying this for a really long time.

I think Chris would agree. So for decades, people have been saying that multimodal is the next big thing. But now it's really true, I think. >> >> All right, so the outline for what we're gonna be talking about. So first, I'm gonna tell you a little bit about early models.

Then we're gonna do a bit of a deep dive on some of the specifics. Then we're gonna go over a particular type of fusion, contrastive models or late fusion. Then we're gonna go through a little bit of the history of multimodal foundation models. Then we're gonna talk a little bit about evaluation, a little bit about other modalities.

And then I'll make some predictions for the future, and hopefully maybe give you some cool research ideas or things to talk or think about. All right, so obviously, there's a lot of work that happened before deep learning. But I think if you want to start from the deep learning revolution and what was happening in images and text, then a good starting point is, for example, Wasabi or Device or Richard Soker, who you've probably heard of, has done some really cool early work in this.

They really pioneered a lot of these ideas. And the basic gist of this is that we have a vision model on the one hand, we have a language model. So this really, I mean, the first lecture of this course I think was about word embeddings, right? So that's just your basic word embedding model.

And now we need to figure out how to align them in the same multimodal space. So the way you do that is you get some sort of similarity metric, right? A score function or like a kernel function, if you're thinking about this from a support vector machine literature perspective.

And now you need to figure out in a max margin or margin loss, how you want to align these two points in your embedding space, right? So things that are similar, you want to bring them closer together. Things that are not, you want to bring them further apart. And if you do that in this multimodal embedding space, that means that you can do interesting cross-modal transfer, where you can take the word embedding for something like auto or like horse, and then you can find close images in the embedding space to that thing.

And now you've solved the retrieval problem. So this is a really nice early application. And I think a lot of this stuff that I'm going to talk about in the early slides, you're going to see this thing come over and over again. You're going to see it get kind of reinvented with fancier models, but it's basically all the same stuff.

So you can do cross-modal transfer where you have images and text, but you can also combine them together so that you get a multimodal word embedding. And so this just gives you a more accurate representation of how humans understand word meaning. Because when we think about the word moon or cat or something, we can go to Wikipedia and read that a cat is a small carnivorous mammal that people like to keep as pets.

Or we can just go and look at pictures of cats, and now we understand what a cat is. And I would argue actually that for a lot of people, the picture of the cat is much closer to the meaning of the concept of cat. So some early work where people were trying to do this is from Bruni et al.

where they did multimodal distributional semantics using this very elegant approach called bag of visual words. So just like who has heard of bag of visual words? Very few people. Okay, so it's surprisingly simple, and so I kind of like it. It's nicely elegant. So you take a picture of a moon in this case.

I think you can see it in the back too, right? So we use an algorithm like SIFT to find interesting key points. So it's sort of where the difference between the pixels and the pixels next to it, where that difference is big, those are sort of the spots you want to be looking at.

And for each of these key points, you get feature descriptors. So relatively small vectors like 32-dimensional events are kind of on the implementation of this. And what you can do now with these feature descriptors is you can cluster them using k-means, and then you assign every one of these points so you can count how often they occur, right?

So in this picture of the moon, we have like actually the count is... Oh, yeah, so there are three like red dots, right? So that's why the red dot one is three. So what that gives you is an idea of the visual words, very similar to the original bag of words model that you hopefully have heard about maybe in the first lecture.

So that's the visual equivalent of the textual thing. And so if you do this and you then concatenate or you apply SVD to fuse the information, what you get is a word embedding that is much more representative of human meaning. So, you know, as reflected in the datasets that people used to care about at the time.

So after that, there were a couple of people, me included, who tried to take these ideas and then really apply deep learning to them. So some of the very early versions of this use convolutional neural networks, and then you can transfer the features from your conf net and you take your word embeddings, which you've seen in the first lecture, and then you can concatenate them.

Now you have a multimodal word vector, or you can do something slightly fancier. So you've seen the skip-gram model. You can also try to do skip-gram predictions onto image features, right? So when you see a word like cat in some context, like the cute little cat sat on the mat, then when you see cat, you also want to predict cat pictures.

So super easy ideas, but it turned out that this gives you much richer word representations. So that's kind of cool, but obviously words are very limited. What we really care about is not words, but sentences. So then people started really looking into sentence representations and how can we figure out how to get compositional understanding in these sentence representations and how do we align that with images?

So the loss here is very similar to what we saw with words and pictures, but now we just have a sentence encoder, right? And so there's some really cool early papers from Andrej Kropotky and Richard Soker also had some work here. And then, so the basic idea is just that instead of having these word embeddings, we now have an LSTM in these papers or some other kind of recurrent neural network, or in the case of this one, recursive neural network, and then we try to align the features together.

And so these three or four papers are actually very important. This one by me is less important, but it's still kind of interesting because we showed here that grounded sentence representation. So if you actually just use this part here as a sentence encoder for NLP tasks, the ability to just predict pictures from it already gives you a really good sentence representation.

Right, so just by predicting pictures, you can sort of imagine what things look like, and that gives you a really good meaning representation, which you can then transfer to, I don't know, sentiment classification or something else. And then of course, once we have sentence encoders, or then we also have decoders, and so when the sequence-to-sequence architecture came out, which you've probably also heard about in this course, what you can do instead of having a text encoder for like your source language, if you're doing machine translation, is you can plug in a conf net instead of an LSTM encoder, and now you can generate captions.

So that's exactly what people did. We used to have all of these fancy diagrams in our papers then where we explained LSTM and how that works. Probably people don't learn that anymore these days. They do? - Mostly for LSTM. - Very good. They might make a comeback, I think, you know, at some point, transformers are gonna go away.

We'll see. And so one of the things that people figured out in machine translation very early on is that you can do alignment of words between your source language and your target language, and you can do the same thing actually with images, right? So if you want to align a word in your generated sequence with something in your picture, then you can use the same approach for that, and that approach, of course, is called attention, right?

So, you know, you learned a lot about attention probably in this course, and so, yeah, that was one of the building blocks of these systems as well where you can do very interesting things and really see that when it has to generate stop for the stop sign, that it's really actually looking at the stop sign, right?

So there's a really cool alignment going on there in these models. And so the final kind of early model we should talk about a little bit is GANs. Who here has heard of GANs? Okay, that's a lot more than visual words. I guess that makes sense. And so, yeah, the basic idea of a GAN is really that you have this generator and discriminator, and you want to have the generator generate images that the discriminator cannot distinguish from, so it cannot distinguish fake and real images, right?

And if you do that, you can actually condition that on the piece of text, and then you can generate images using some text prompt, right? So that's what kind of the first versions of stable diffusion were doing things like this, and it's all a natural progression to that model.

So those were the early models. Maybe, do people have any burning questions about this, or does this all make sense? All right. So let's do a bit of a deeper dive then in particular on features and fusion. So those are really the kind of core building blocks for all of this multimodal stuff.

But before we go there, maybe very briefly, like if all of this multimodal stuff is cool and sort of useful and doesn't look that difficult, like why aren't we all doing multimodal things? So why do we focus on specific modalities? And I think there are a couple of problems just to be aware of.

So one is modalities can sometimes dominate, especially text is much more dominant than vision or audio in many use cases, right? So you can already just have a model that picks up on the text signal and basically learns to ignore the image completely, which actually happened embarrassingly for visual question answering, we'll get to that.

So visual question answering, you could do that without actually looking at the picture. Additional modalities can add a lot of noise. So it makes your machine learning problem more difficult. You don't always have full coverage, right? So as I said, if you look at Facebook posts, sometimes you have text, sometimes you have pictures, sometimes you have both, but you don't have a guarantee that you always have both.

So how do you deal with that? In many cases, we just really weren't ready. It was too complicated to implement stuff. And also just in general, like how to design your model really to combine all the information is actually quite complicated. So in order to maybe drive that point home a little bit, so featurizing text, I guess we all know how to do that by now, especially sort of in the age of transformers and before in LSTMs, where we just have like, you have your batch by your sequence.

So batch size by sequence length by embedding size, right? So it's always like a 3D tensor, and that's how you encode your textual information when you pump it through your neural net. And so with images, it's slightly trickier because you can just kind of look at the patches, but then if you do convolutions, you're kind of like shifting over the image and then you're aggregating, right?

And in many cases, you don't really want to be this uniform. You want to have something that actually looks at the things in the picture, right? So this is called region features, where you would use an object detector as a first step for processing your image. And then you would have a confident backbone that encodes the features for that particular sub image, like this guy's like skateboard or something.

It has its own like vector representation, right? And then in terms of dense features, we now also have vision transformers. So we'll just very quickly go over that to make sure we're on the same page. So there are all these models like YOLO is a really good one if you haven't heard of that yet.

So we're at YOLO v7 now, I think, or eight, I don't know. So there's a new one coming out every other year or something. But the basic idea is that we get these bounding boxes for things in the images are actually segmentations, but the bounding boxes is what people tend to use and they have labels, right?

So this is labeled like backpack or something. And so you can do this as a pre-processing step on your image to get a much richer representation of what is really in that image, which you can then pump into your system as we'll see later. And so then how you encode the information that is in these little bounding boxes or actually in the image itself in general, we just use a standard conf net for that.

And so this probably feels like super obvious now, but in 2014, when people were starting to discover this, it was really very surprising that you could just use off the shelf conf net features to really replace the entire computer vision pipeline. So people used to do all of this very fancy, sophisticated stuff and people, you know, spend decades on trying to refine this.

And then it was all thrown away and replaced by a conf net that does all of that stuff for free. And so the cool thing you get there is that you can transfer very easily across different tasks. So you can have a very generic conf net and then use it to all kinds of very specialized things like spotting buildings in Paris, for example, or flowers or other stuff.

And then of course, in the age of transformers, how far are we? We're already quite a while in. This is only the first transformer actually in the slide deck. So, you know, we're making good progress. So vision transformers are what we would use these days to encode the images where you have these flattened patches and then you would do kind of the standard birth architecture maybe as you would know it from this course, and then you do classification, right?

So this is all like a standard transformer, everything's standard, except now your input here is not words or tokens, it's patches of an image and then you classify that. All right, so then we have a bunch of features and now how do we combine the information, right? So let's say we have two vectors, U and V.

So, you know, it sounds easy, right? To how we could combine them. It turns out that there are actually very many ways to combine them. So I don't think it's really useful to go over all the different ways here, but you can do very simple things, right? So obviously like inner product or similarity is what you would use if you want to do cross-modal things.

So if you want to embed things in the same vector space, but you can do sort of fancier projections on top or different combinations that are kind of linear or you can do multiplicative things where you multiply the components element-wise or you do some sort of gating over the different features.

You can do attention, you can do fancier bilinear things, you can do very fancy compact bilinear things. So there's really a wealth of literature kind of on all the different ways you can combine two vectors. And so this is called multimodal fusion and most of the literature on multimodality is essentially about this question, what is the best way to do fusion?

And that's it. So I think within that discussion is maybe useful to distinguish between different levels of fusion. So you can do it very early where basically you make sure you have the different features and then you just kind of, in the sort of modern sense of attention, you would attend to everything in all the features from the beginning.

You can first treat them separately and then combine them or you can treat them as completely separate and then you only combine the final scores, right? And so that's kind of what we would call early fusion and then sort of my invention for calling the middle part would be sort of middle fusion and then you have late fusion where you really just combine the scores or the logits but you don't really have any interaction between the information from the different modalities.

So you could do really fun stuff with multimodal fusion. So this is a paper I really like film where you have this sort of very special feature map, this sort of F here and it gets modulated by a multiplicative factor. So this gamma and an additive sort of bias vector, this beta and you have a different one for every layer of a ResNet that is conditioned on some encoding of the thing you're after.

So in this case, are there more cubes than yellow things? So we have some vector representation for that and we use that vector representation to modulate the ResNet blocks at every layer of the conf net. So you can really do very fun things where you're sort of modulating one network with the other one and really try to have them learn as much as possible from that.

All right, so let's talk about late fusion then. So late fusion is what we would now call contrastive models but the basic idea is that we have this similarity score. So we have the two kind of, we process the modalities completely independently and then at the very end, we do some combination.

And the most famous instance of that these days is CLIP. So who's heard of CLIP? Okay, so CLIP from OpenAI. So it's again, exactly the same contrastive loss that we've seen in all these early approaches. It does kind of negative sampling, but then in batch. So you just have a batch, you have two things that are aligned, right?

So like this, the first piece of text and the first image, they are aligned. So this is the right answer. And I just wanna make sure that I rank this thing higher than all the alternatives, right? And I wanna make sure I rank this thing higher than all the alternatives.

So it's a very, very simple idea. Really, nothing special about this architecture that was sort of invented here, but what made this thing so cool was first of all, it was transformers and it was transformers all the way. So your text encoder would be a transformer and your image encoder would be a VIT image encoder.

So also a transformer. And it was trained on lots and lots of web data. So Alec Radford is really a genius at creating very high quality datasets. And he created, I think 300 million image text pairs for this dataset, trained a bigger model on it than people used to do.

And then we got this amazing model out of it. And so moving away from the words there to the sort of texts that you would see on the internet. So the caption for an image on the web, it's not gonna say dog or cat. It's gonna say a photo of a cat doing something, something.

So that means that you can do kind of zero shot label predictions where you have a photo of the, and then you need to figure out what the right label is for a given image using this kind of prompt. So the thing, you probably all know about prompting large language models.

And so you can prompt vision and language models in very much the same way and do zero shot generalization. So if you want to read a really good paper, I would recommend that you read this paper. This is really one that's gonna teach you how to write really good papers.

It's thorough, it's really worth a very close read, I think, if you're interested in this field. And so I think when it came out, actually on ImageNet itself, it didn't really outperform ResNet. So you might think, oh yeah, actually it's not all that special. But what really made it special was that it generalized much better to these other datasets, right?

So this ResNet thing here is pretty terrible at some of these adversarial versions of ImageNet, and Clip is super robust to that. So it's just a way better image encoder in general. So very quickly after Clip, there was this paper from Google using Align, which was basically exactly the same idea.

The field is not really that creative at all as like the same idea, but then you just keep like throwing more data and more compute at it, and it often works much better. So that's what they found here too. 1.8 billion image taxpayers instead of 300 million gives you a better model.

Surprise. But so still very cool. And what is really cool, I think, is that there's this organization called Lion, where they've started this open source collective to create really high quality datasets. And so the Lion, the initial dataset was, how many examples in the initial Lion? - 400. - 400 million.

He knows, I know that he knows. (audience laughing) And so now there's a much bigger version of Lion that's even multilingual and it has 5 billion examples. So stable diffusion was trained on sort of the image, the English subset of this thing. And that's one of the reasons that it's so awesome, it's because it's just seen a ton of data.

And that really makes your system a lot better. So if you're looking for like the ultimate dataset to play around with your own ideas, if you have enough compute, obviously, then you should really look at this dataset. All right, any questions about up until this point? No, all right.

So then we'll move on from late fusion to kind of middle fusion, early fusion. And this really is kind of the core of what I think a lot of people in the field right now, or if you're interested in getting in this field, or if you're going to go into industry and you're gonna be using this stuff, like this is what you should really understand.

And again, like the ideas sort of stack onto each other. So I've kind of sequenced the slides to give you an idea sort of how the scientists kind of came up with the next step. And you can really see the architecture just get slightly more and more advanced, but basically a lot of it is just more data and more compute again.

So who knows how BERT works? (audience laughing) Everybody should raise their hand now in this. So yeah, so BERT is kind of so canonical. I think everybody kind of gets how BERT works, right? So I don't think we need a real refresher, but I think you can think. And so the reason I have to slide is because I want you to think about if you have a BERT model and you have a bunch of images, how are you going to turn that BERT model into something multimodal?

Right, so there are a bunch of like obvious things you could do given the kind of features I told you about in the sort of fusion process. So how are you gonna do that? Does anybody wanna like say something? (audience member speaking indistinctly) - Like if you're doing classification, you can take the CLS token from BERT and then just concatenate it to whatever encoder, like maybe an ANN or whatever you're training on the data concatenated and then train.

- Okay, exactly, yeah. So you can take the ConfNet features and classify your token from BERT, concatenate them, and then classify for like a cat or something like that or whatever the thing is you're interested in, yeah. Yeah, so that's one thing. You could also like take the ConfNet features and like give them to the BERT model in lots of different ways, right?

We can use the region features. So I think a lot of people when BERT came out who were working in vision and language processing were thinking exactly about, okay, so do we do like middle fusion, late fusion? Do we do early fusion? How do we do the fusion? And so there were a lot of papers all coming out basically at around the same time where people were doing versions of this.

So BERT was really kind of the innovation and then everybody sort of just plugged it into their own thing because of Hugging Face Transformers and things like that. So the first thing is Visual BERT. This was one of the very early ones where you have this image and people would do object detection on this.

So you get like a hat and a racket and a shirt and things like that. So you can just really take these features and then plug them into your transformer model and then you try to like recover the features. And so this really is probably like the simplest way to do it, right?

And so this is what we call a single stream architecture where you have all of these kind of concatenating the original input features and then putting them through the same transformer. What you can also do, and that's something that this model called VILBERT did is where you have two different streams.

So you essentially have these two parallel transformers but at every layer, you kind of give them cross attention, right? So, or co-attention as they call it, but it's basically like, so you just make sure you have an attention map that spans both and then you just do your full normal transformer layer again.

And then, so this you can train just like your regular BERT, right? So you have your masked model, masked language model here and here you do sort of some equivalent of that and then you also have your next sentence prediction which you probably remember from your BERT lecture. But instead here we're saying, okay, is this image aligned with this piece of text or not?

There's also LexMERT. I mean, there I could go on forever. There are like a hundred papers that came out that did this all at the same time. So LexMERT had a different cross-modal output encoder, a bunch of different ways of encoding the positional information, right? So you could say, okay, I just have a bunch of bounding boxes that are featurized but I don't care about where they are in the image.

So it's just kind of like a, just a bag of bounding boxes. Or you could say, I found it here. Like this is the particular like top left and bottom right coordinate. And that's what you featurize into your network. You can also do something even dumber. And I can say that because this is my paper.

Where you just take the image itself, you put it through a ResNet and then you do a little bit of pooling on the final feature maps and you just give those feature maps to BERT. And so you then need to distinguish between like your text segment embeddings, right? And your vision segment embeddings.

But so this actually works surprisingly well. You don't have to do any additional training. You can just take BERT out of the box. Initially you freeze it. You learn to project into BERT token space. Then you unfreeze your ResNet and then finally you unfreeze your BERT. And now you have a very good multimodal classifier on the problem you care about.

So a lot of these other papers, they're doing what they call multimodal pre-training where first you have a BERT model and a ResNet. So they're kind of unimodally pre-trained and then you cobble them together and then you have a multimodal sort of intermediary pre-training step before you fine tune it on the problem you care about.

And what we showed here is that you don't really need that actually in many cases. So it's a very strong baseline. You can also go to the pixel level completely. So that's what they did in this other paper called PixelBERT where they, it's basically exactly MMBT. So the previous supervised one, but here they do do the multimodal pre-training step and show that I think for VQA it helps a little bit.

So there are many of these BERTs doing sort of visual things. People really tried everything. Here's another one called UNIDER where they added a bunch of different losses. We can really talk about this for a very long time. We're not gonna do that. I'm just gonna kind of talk you through some of the more interesting ones.

So this one I think is quite interesting built because here this is really the first instance where we are completely gone from ConvNet features. So we don't do any pre-processing on the image, no region features, no backbone that it featurizes the parts of the image we care about. We just have these patches of the image.

So really integrate, we flatten those patches, we just pump them into the transformer straight away. So this really is like sort of BERT and VIT together in one model and this worked really very well. So that's been the trend. So here's a nice, very long list of all of these different models and what they do.

And so really the distinctions are just in what is the text encoder that you use? So do you use BERT or something fancier or better, Roberta, what is your vision encoder? So in many cases you have these region features. So you would do an RCNN style thing, or you could just do a ResNet or a VIT.

You have different kinds of fusion. So either a single or dual stream as we talked about, so visual BERT or a VILBERT. Different pre-training tasks, so mass language modeling, image text matching. There's a bunch of like funkier ones you can do. So, and then finally you can do multimodal pre-training on all of these different datasets that have aligned data.

So you are probably wondering, okay, so what is really the interesting difference between a lot of these? And so I have another recommended paper that if you're interested in this space, you should really take a look at. It's also a really well done paper where they unmask multimodal pre-training.

So basically they say, if you take all of these little model inventions and you train these different models on exactly the same data in exactly the same way, it turns out that they're all basically the same. So that's a lot of kind of wasted effort on the part of the field because everybody's saying, well, my model is better, but it's actually just because you trained it on different data.

There's no real sort of model innovation going on in a lot of these things. So I don't mean to sound discouraging or anything like that, but I think that's why this paper is really nice and really important is because it just shows us what really matters. So this is also work that I did myself called Flava with my team where we wanted to take these ideas really to the limit.

So a lot of the things that you've seen now, so the visual words and the VIL words and the things like that, they're all about multimodal questions. So how can we do visual question answering, something like that, where we just have these two modalities, we only care about problems that always involve these two modalities.

And where we want to go, and this is kind of the basic premise, I think, of foundation models in general is that we have one model to rule them all. So this one model can consume data from all of these different modalities and it can synthesize across all of these different modalities and then do useful things with that information.

So with Flava, that's exactly what we tried to build. So we wanted to have one foundation model that is good at vision and language and computer vision and natural language processing, is jointly pre-trained on all of these different data sources. So it's also trained on just CC News, so Common Crawl and BookCorpus.

So it's very good at the sort of things you would expect BERT to be good at. It's trained on ImageNet for image data. So it's good at the things that you would expect as a kind of basic image model to be good at. And then you have this PMD dataset that we created out of publicly available image text pairs that we also train it on.

So this PMD dataset is really just, if you take all the datasets that were ever created that have image text pairs that are publicly available. So unfortunately, the ClipData and the Google Align data and all of these datasets, they haven't been open source. So this is before Lion. So now there's a good alternative to this.

But so this PMD dataset, if you combine all of these image text pairs, you get 70 million of them. So that's still pretty decent size. And then you can take all of this data, basically to solve all of these problems that we know we care about in these different fields.

So you can do multi-modal reasoning, you can do language understanding, you can do visual recognition, all with exactly the same model. And that's a very powerful idea. I think if you work at a company like Facebook, you don't want to have different models for all kinds of different things.

You want to have one model that you can really use for everything. That's gonna really make your life a lot easier. So the exact architecture here is that on the one hand, we have this image encoder where we take the image, we encode it as patches, and we just do what we call mass image modeling, but it's basically mass language modeling, and then just on the image tokens.

And then on the other side, we have the mass language modeling on the language. So your regular sort of bird thing. And then we have a multi-modal part where all of this information gets combined. So we have a mass multi-modal modeling loss term where you can also do image text matching.

So this is like your bird next sentence prediction thing. And then we also have a global contrastive loss, which is exactly like a clip. So if you do all of this stuff, it's just all transformers all the way down. It's sort of a very elegant way, I think, to combine a lot of this information.

And when you do that, you get something that can really do a lot of things very well. So we're not gonna talk about that table, it's just way too many numbers. But so just trust me, we were pretty thorough generating the table here. And so over 35 different tasks, if you compare Flava to all kinds of different ablations in terms of clip models, then this is just a much better way to get to this information.

So I think this is a nice example of where we're probably gonna go with the field in the near future. So the other trend that we see very obviously in the field right now is that everybody cares about generative models. So language models and image generative models, there's just a trend where we want to be generative, we wanna move away from this contrastive, discriminative stuff to the more interesting, more richer representations maybe that you get out of generating sequences or images.

So this SimVLM paper was one of the first ones where they really had this separate decoder that was trying to generate or kind of complete captions, which they showed gives you a lot richer representations. I think this is actually the current state of the art now, it's called Koka.

So a lot of these models, they all again look very similar, but in this case now we're starting to really see these text decoders. So initially with Clip, I think that's also what they were trying to go for, like OpenAI being a company that really likes generative models, but they couldn't really get it to work.

And I think, so it took us a while as a field to really figure out how to do this the right way. And so right now we're really kind of in the age of language models, right? And so one of the interesting things you can do with language models is just keep them frozen and then learn how to project into the language models.

So the MMBT architecture I talked about where we had this BERT model, we kind of kept it frozen and we learned to project into the BERT token space. You can do exactly the same thing, but then with a much fancier model or something like T5, even where you just have an encoder decoder or some kind of generative part of this, you keep that thing frozen and then you learn to project into the token space of that frozen language model.

And then you can do lots of fun stuff, it turns out. So what they show in this paper is that you then get few-shot learners. So all of the things you see with GPT-3 where you can just give it some kind of in-context examples and it's gonna figure out binding kind of on the fly.

So it says like, this is a DEX and this is a BLEKET. So what is this? And then it gives you the answer that it's a DEX. So it really learns in context how you decide the feature mappings, which is really kind of solving the grounding problem that a lot of this multimodal stuff started with.

So I think that's very cool. And then probably one of the coolest papers right now or models right now that you might've heard of if you follow the field is Flamingo out of DeepMind, where they take a chinchilla language model. And so this is really an optimal language model.

And now you have this vision encoder that encodes multiple different images that you can then do reasoning over and then kind of auto-complete. So what this gets you is just a much more powerful model because you can do your generative over lots of different images. So it's really like stepwise, you can see it, right?

We started off with very simple transformers and now we're actually at something that is starting to get pretty complicated because we have these building blocks like a perceiver resampler, where we have a bunch of different images that we feature rise and now we need to compress the information because sometimes we have three images, sometimes we have five images.

So we wanna make sure that we can compress it so that it's always ready for consumption by the next layer of the language model. And then, so this paper again, is a really good paper to read because they actually, so this is not me, this is not my code, this comes from the actual paper.

So they just have the diagram together with the code so that you can really understand what it's doing, which I think is really great. And so once you have your perceiver resampling step, what you then do is you do a gated cross attention. This is how you implement it.

And so this gated cross attention, you do that before your frozen language model layer. So you really just have a frozen chinchilla language model and you learn to kind of modulate the information that goes into that language model. You propagate the gradients all the way back, you just don't update the language model.

So you're really kind of trying to figure out like, how am I gonna design my signal so that my language model can do the most with it, right? How am I gonna combine the information? So you'll notice that now we do it before the layer, right? In a lot of this other stuff, you would do the attention after the layer, but here you do it before.

So Garpathy, I think more than 10 years ago had this image with Barack Obama kind of setting his foot here on the scale to make somebody think like, they're a lot heavier than they really are. So this is obviously funny to us, but not to an AI system, I think unless it really understands the scene.

And so that's why Garpathy at the time said, this would be a really good visual Turing test. Like if a system can figure this out, then it's actually really smart. And so obviously it's been a bit of a challenge for everybody working in the field then to get something that actually works on this.

And so Flamingo, as it turns out, kind of gets the joke. But yeah, so it's a bit unclear if it really gets the joke, because if you read this conversation, it's sort of kind of getting steered in the right direction, right? But at least we're making progress, let's put it that way.

And then, so in Flamingo, you still have a lot of moving parts, but you can really take this almost to the full extreme where you try to freeze almost everything. And you just want to learn this kind of mapping between your image encoder and your language model, or your image encoder and your encoder decoder architecture.

And all you really do is just a projection between the two, right? So there's this nice model called Blip2, where they experiment with like OPT for the language model and Flantify for the encoder decoder architecture. And this just gives you amazing results. It gives you really complex captions and things like that without any real direct supervision on the captions itself, which is pretty impressive, I think.

So that just shows you the power of language models in general. So here are some examples. So it can really do like different things from captioning to reasoning, to visual question answering, to like location detection. So you can have a long conversation with this system. This really is kind of the future where we're going, right?

Where we're going to have a chat GPT, but it's also going to be able to see the world in a way. And so I think an interesting thing, so you've probably heard of like chain of thought prompting and things like that, where you ask the language model, like let's think step-by-step, and you can tell a vision and language model, generate a rationale for why something might be the case.

So you generate a potential explanation for what your answer might be. And then after that, you ask it to answer the question. And it turns out that if you do that sort of multimodal chain of thought prompting, then the system gets much better. And so, this was like the new state of the art on science QA or benchmark like that, just because it learns to unpack the information, right?

And so I think we're really as a field, just starting to figure out what the potential is of this. And I think this paper is where they also show that multimodal chain of thought prompting really gets you pretty amazing results. And they show very nice results on Raven matrices and like very complicated kind of IQ tests sort of things that humans are supposed to be really good at, but you have to be a pretty smart human to really be good at this.

And this system just nails it. So, we're making super fast progress. And we started off from a very simple bird model that was able to look at some pictures. And now we're getting to these very sophisticated foundation models. So, that was my short history of multimodal foundation models. So, how much time do I have left?

- So, after 5.50, so 25 minutes. - All right, okay, plenty of time. - We can also take questions. - Yeah, please, questions. - Do we do much pre-processing images for these models anymore? So, I noticed a lot of the images that just look like they were boxes, like where it just passed through with kind of no sense of shape in them.

- Yeah, yeah, so I think the history of computer vision has been very similar to the history of natural language processing, where we thought we needed all of this structure and all of these different things. And it turns out you can just throw it all away and just have a big transformer over the patches.

Sorry, yes. - Seeing as it's 2.31 and one minute, save time. (all laughing) - You mentioned a couple of times like model team frozen, what does that mean? - Yeah, yeah, sorry, I should have explained that better. So, it just means that we are not updating the weights. So, like if we go to this here, I think is a nice example.

So, we have frozen self-attention. So, that just means that when we do a forward pass, we go all the way to whatever we want to predict, we get some gradients, we take them all the way down, but we only update the non-frozen layers. So, here the gradients actually do get updated, but these just never change.

And so, the reason you wanna do that is because otherwise you're gonna drift way too far. And so, then you're gonna kind of destroy all of the cool stuff your language model has learned, because you're just gonna focus on the small data set that you're training it on. So, you wanna preserve the abilities of the language model, but you want it to become good at the thing you care about.

Other questions? - In terms of poly-modal fusion, is there a benefit to doing like the earlier middle fusion as opposed to like only doing the fusion? - Yeah, so, I mean, we're gonna talk about evaluation next, but so it really depends on the task that you care about. And so, I would say the earlier is always the better if you can afford it.

And so, like CLIP is very efficient to train, it's very late fusion, right, at the very end. So, there's no interaction between the different modalities. And so, that's really good if you want to be very efficient and if you wanna be like, for training, it's much nicer. But if you want to have a richer understanding of the multi-modal signal, then you want to do earlier fusion.

So, yeah, it's always a trade-off. - It seems that images are just a lot more data than text. So, how much more difficult are these to train and how much bigger does like the image processing have to be compared to the language model? - Yeah, so, images are more complex in a way, but they're also kind of higher bandwidth representations.

So, there's a lot of kind of like, just pixels that our brains just abstract away, right? It's really about the scene that you're seeing and like you're not really thinking too much about the pixels themselves. So, like Jan Le Coon likes to say that language is just a kind of low bandwidth, a proxy for a language of thought, which is much richer and much higher bandwidth.

And like he thinks probably visual, I'm not so sure. But so, yeah, I don't think that there's necessarily a difference between kind of the scaling laws that you see in these systems, or at least we still have to figure that out. We'll kind of talk about that towards the end as well.

- Do these modern models also have certain social and cultural bias, just like the natural language model? - Oh, yeah, they have terrible biases, yeah. (audience laughing) So, yeah, some people are actually working on this who are in this very room. But so, these models can be very racist also in what they generate or the kind of predictions they make.

So, if you have an Asian basketball player standing sort of like this with a basketball very obviously there, then the model will think that he's playing ping pong because he's Asian. (audience laughing) I'm not joking. So, these models, just like all neural networks, this is really a big problem.

And one of the most interesting problems that you should be working on if you're a student and you wanna make a difference is, how do we get these systems to be much better at these sorts of things? - So, in one of the examples, you show like the model interpret from the content of an image.

So, we wanna understand the content of a video. So, what actual challenges you might see like in the video? - Yeah, so, in the video, you might see like what improvements we can make to our new scope. - Yeah, so, you're asking about the attention mask sort of, right?

Yeah, so you can use the same idea for videos and you just look at the video and so these systems are so good. Now, the object detectors are so good, you can really track objects kind of real time as they go through your video. And so, you can try to check how that aligns with your attention mask in your model.

So, a lot of like, so videos I think are sort of interesting, but they're also not really interesting because you can very often just sub sample images and solve the images rather than having to deal with the complex video. But yeah. All right, maybe one more question and then we'll go do some evaluation.

- So, these multi-model models, when you only provide, let's say you only provide a single source of media to the sampling type for a vision, how does it perform in that case? 'Cause it's obviously more geared for multi-modal cases. - Yeah, so I mean, that's one of the giant shortcomings of a lot of these models is that they're really just built for multi-modal stuff.

And so, what if I don't have an image, right? And so, I mean, that's why we did Flava because we want to have one model that can do all of that stuff. And that's why in MBT, so the supervised multi-modal bi-transformer, we actually have an analysis of like, how robust is this model to missing images or missing text?

So, I think a lot of folks working on these early visual bird models that were kind of myopically focused on VQA, which is actually a great segue to what I wanna talk about next. So, it really depends on the task that you care about, as I said, right? And so, I think if I'm gonna tell you about multi-modality, I also have to tell you how you're gonna check that the multi-modal system is actually good at multi-modal things.

And so, that's the topic of evaluation, which actually is a super important topic. And a lot of people, they wanna be cool and build big models, but I think it should be way cooler to do proper evaluation of these models, especially if you're in academia because you only have limited GPUs anyway, right?

So, what can you do? Sorry, I don't wanna rub it in. (audience laughing) So, how do you check? Well, there's this amazing project. So, ImageNet really changed the history of deep learning, I think, and this other dataset, COCO, I think also really changed, especially vision and language, but also, I think, vision in general, where they have just a bunch of main multi-modal tasks.

So, these images are very richly annotated with all kinds of different things. So, like the segmentation of the objects, the bounding boxes, the labels of the bounding boxes, they come at like sort of a different pixel granularities. It's a huge dataset. It's very fine-grained, annotated in terms of like the categories that it has, and then you have five captions for each of these images.

And so, this really was the first dataset that unlocked a lot of sort of vision and language processing at scale because you had your picture and you had your caption, and now you need to figure out, okay, how do I give the right caption for this image? So, that's image captioning, or can I retrieve, given some piece of text, the right image or the image for the piece of text?

So, there's a bunch of very impactful datasets that do this stuff that we already talked about, Lion, but COCO really is the main one still, I think, that a lot of people kind of use as the canonical instance of this dataset category. And then, the other thing that people really care about in vision and language processing is visual question answering.

And so, there really are a bunch of academic groups who are or have been so focused on this task that they didn't really care about anything else, and that's why you see a lot of models that are really optimized just for multimodal and nothing else. And you can see that kind of reflected in the citation counts as of last night, 3 a.m., where, so, VQA just has way more citations than image captioning datasets even, right?

And so, what you do here is you just have an image and then people ask very simple questions, so annotators, right, they ask these simple questions, they give the answers, and now we want to be able to answer these questions with machines. And as I alluded to earlier, one of the kind of embarrassing backstories of this dataset was that the initial version of the dataset was actually found to have images not really matter at all.

So you could just look at the question, then it could have something like how many slices of pizza are there? And so, well, not in that particular case, but in almost all of the dataset, the right answer for how much or how many question was two. So if you just predicted two to every how much or how many question, you got like 70% accuracy on the counting category.

So careful dataset or evaluation benchmark design is also really a skill, and you really need to think about what you're doing. You can't just like set some data aside and evaluate it on, you have to really think about what you're doing. And so there's GQA by Chris actually, which is also just, I think, a better designed version of this dataset maybe.

So you might want to use that these days. There are also kind of very targeted datasets that really try to measure one particular thing. And I think one of the things we really want to get at with these models is what we would call compositionality. So we want to be able to really take the parts and reason about the whole and understand the relationships between the different concepts.

So Clever was a very clever dataset that was designed really to measure the compositionality both on the language side and on the vision side. So you have to understand the relationships between all of these different objects in the images. So that's been a pretty impactful dataset. I think we're really forcing people to think about compositionality.

But a lot of these datasets really had big problems. So one of the problem is, they were too easy. So VQA is sort of like plateauing out. We can talk about that a little bit too. Wasn't really realistic. So you could solve VQA and that's probably gonna make some people's lives better.

You're all like trying to process the memes. I can see everybody. (laughs) Okay, let's get to the memes first. So obviously, so these memes are not actually in the dataset. So I could put some really hateful memes about sort of Hitler or something, which are in the dataset, but that would be less fun.

So these are mean meme examples to kind of demonstrate how the dataset was constructed. And so one of the problems we had, as I said, like VQA, the V didn't really matter. What we want to have is a dataset, if we care about multimodality specifically, is like, how do we get a dataset that you can only get right if you are good at multimodal reasoning and otherwise you're just gonna screw it up.

And so this is what we came up with is if you have a meme like this one, love the way you smell today. I mean, that's not very nice if you send this to your friends, right? But so it turns out that if you just swap out the background now it's a very nice thing to say, right?

And like this one is, I don't know, you're maybe a bit weird if you like this, but there's nothing wrong with it, right? And so it's the same for this one here, like look how many people love you with the tumbleweed that's really sad. And like, if you change just one word suddenly it's like a really nice thing to say, right?

So if you want to solve this, if you want to classify this correctly for the meanness, then you have to really understand multimodal reasoning. You have to understand the relationship between the image and the text in order to get to the right label, right? And so it was really constructed by design to do that.

And so how we did it exactly is we use some really highly trained annotators. And then one of the big problems with a lot of these datasets is that nobody really knows who owns the meme, for example, right? So somebody makes this meme, now they technically own a copyright.

And so when I made this dataset, I was working at the Facebook and they were very afraid of copyright things. So what we actually had to do is we had to pay people to make new memes. (audience laughing) And so not from scratch, so we could show them kind of the actual examples.

And then they had to try to find images that were kind of corresponding to the original source image and try to recreate the meme, but now with an image that we could buy from Getty. And so we gave a lot of money to Getty so that we could then release the dataset to the public so that people could do actually research on this and understand for their multimodal models, whether they're good or not.

And so we really tried to make it so that we had these benign co-founders, sorry, it's a startup world with co-founders. So the co-founder here is obviously like you have your original meme and then you have your co-founder where you swap out one of the modalities and here you have the other one, right?

So we had our annotators do that as well. And so this led to a really nice dataset, I think, because it showed some of the intuitions that I think a lot of people in the field had, which is that multimodal pre-training doesn't really work. Is that an alarm? So multimodal pre-training doesn't really work.

And so all of this stuff that people have been doing with all their fancy visual birth models actually turned out maybe to not really be that useful anyway and so maybe it got you like one point extra, right? From visual birth to like a different visual birth, like less than a point, just by doing that multimodal pre-training.

So that means like we still have to figure this stuff out, right? This dataset is far from salt and we still have a long way to go despite all of these fancy models and a new paper coming out every week that does something new, like we're not there yet.

And I think that's encouraging, especially for you. Like when you can go out and solve it. So what we did with this dataset is we organized a competition. We had 100K in prize money to try to see what people could come up with. And so there was a lot of nice work coming out of that and we really kind of managed to crank the numbers up by quite a lot.

But the solutions were slightly disappointing. So I don't know if you've ever used Kaggle, but if you wanna really win on Kaggle, you just have to ensemble the hell out of all of the different models that are current state of the art and then you're very likely to win, right?

And so that's what happened here where there wasn't really the fundamental breakthrough we had maybe been hoping for. So that still needs to be built, I think. So this other dataset, I just wanna kind of briefly talk about. So the theme sort of of this section is like, if you make a dataset, think about it very carefully because you can really be very creative with this and really measure the things you're trying to get at.

So this dataset, Winterground, we were trying to figure out, okay, how good is CLIP actually? So it looks really amazing and it's way better than things that were previously there, but does it understand compositional relationships in the same way that humans would understand it or is it sort of just fitting onto the data distribution and it can be very good at the head of the distribution, but it's terrible at the tail.

And you can probably already guess where this is going, but so just to give you an illustration of what is in this dataset, you would have some plants surrounding a light bulb or you would have a light bulb surrounding some plants. So notice that the words here are exactly the same words, but in a different order, right?

So, and so the visual depiction of these words is very, very different. So if your model, your contrastive model is actually good at understanding the visual semantic or the visual linguistic compositionality of these examples, then it can get it right. But again, if it's actually just overfitting on the data distribution that it's seen and it just kind of is biased toward what it sees often, then it doesn't really get it, right?

And so one paper that we use as a source of inspiration for this work is this paper here, "Order Word Matters Pre-training for Little." So we actually found that the order of words doesn't even matter that much for general pre-training very often, which is also kind of a scary thing, right?

So this is deep learning for NLP. We think that language is really important, but these models can reason about language even if you shuffle all the words. And so that's probably not what we want to have. And so that doesn't tell you something about how great we are as researchers.

It tells you something about how terrible our evaluation benchmarks are, right? And that's what we need to fix. So what we did with this data set, here are some other nice examples, like there's a mug in some grass or there's some grass in a mug. Like these are very different pictures, right?

And so for us, these are trivial. So like, what's the difference between a truck fire and a fire truck? They're pretty important, I think, also to get that distinction right. So guess what? State-of-the-art models often perform below random chance. So, you know, as I said, we still have a lot of work to do, which is good.

And so when this paper came out, I think that the reaction was really nice. And so when DALI-2 came out, so you've probably heard of DALI-2, right? So it's sort of like stable diffusion, but then before stable diffusion. And so this was really the first model that really showed just how impressive these generative models can be when they're creating images.

So this is, there's a mug in some grass. You do have to kind of cheat a little bit because you have to add digital art here. If you don't add that, then it breaks down completely. So it's sort of prompt hacking, I think, or sort of tuning on the test set, but okay, you know.

So this is pretty good, right? So it definitely is better than I think a lot of people would have expected even a couple of years ago. But it's not perfect because people on the internet like to take more pictures of spoons than forks. So if you say there are fewer spoons than forks, or there are fewer forks than spoons, it just really likes spoons more.

(audience laughing) You know, and so maybe it's like the matrix or something, I don't know, but so spoons are just nicer. So again, what you can see here is that these models really are just reflections of the data that they're trained on, right? And yeah, so models are getting better, but if you've looked at stable diffusion, like it still can't count fingers and things like that.

So again, there's still a lot of cool work to be done. Any questions on evaluation? (man speaking faintly) No, okay. So let's talk about other modalities then, because so we've really just been focused on images and images are great. There are lots of images on the internet. And so that makes it sort of an obvious thing to focus on.

It's also, I think if you look at our brain, like vision is a very dominant modality, right? So how we understand the world is very vision driven, but it doesn't have to be the case. So there's all these other interesting problems that involve different modalities. And so the most obvious one is just speech or audio, right?

So after seeing comes hearing, and really we could do another lecture just like this, just on speech and audio. And there's lots of interesting stuff to talk about. Obviously we don't have time, but I'll give you another nice example of how amazing Alec Radford is at creating datasets. So there's this whisper model that came out of open AI not too long ago, which was trained on 680,000 hours of multilingual multitask speech data.

So speech with transcriptions, and they trained this very fancy thing on there, which actually is not very fancy at all. It's just a log mail spectrogram. So how you represent the audio signal, and then you feed that into a big transformer. So this is sort of your encoder self-attention here, right?

And then you have your decoder where you have your cross attention, and then you just generate the sequence. So this is encoder decoder, basic transformer model, but your input is convolutions, one dimensional convolutions over the log mail spectrogram. And so there's lots of papers that do very similar things.

There's models like wave2vec that tried to turn the wave signal into vectors, or you can discretize it in lots of different ways. So there's a wealth of literature. Then I think one of the funny observations actually is that you can just reduce audio to vision anyway, right? So that's what you could sort of argue this log mail spectrogram does, but so not to toot my own horn, but in 27, I did this paper where we showed that you can just take a real audio sample, turn it into a kind of a spectrogram, really just a spectrogram.

So what does the spectrum of the audio file look like? Feed that to a regular conf net, like an Alex net even, and then that gives you amazing auditory features. So now you can use this to distinguish between violins or guitars and things like that. So maybe you can just reduce all of this to vision.

So one question maybe you could ask is like, can we also reduce language to vision or vision to language? So that's sort of what people are thinking about. So we talked about the video. There was a question about video. So a lot of these ideas also extend pretty directly to video, but now you just have more data, right?

So like Flamingo already had a bunch of different images in it. You can do Flamingo over videos. Probably a lot of the images are pretty useless for what you're trying to do with this video model, right? So they're too similar. It doesn't really add all that much information. So you wanna sub sample the frames so that you get the most useful information out of your video.

And so there's a bunch of approaches that kind of take the key frames and then you just do a standard joint vision and language transformer encoder thing on top of that. So this is kind of becoming hopefully by now a very familiar recipe, right? And so there's this, so Merlot is a nice architecture that does this.

And then they came up with Merlot Reserve, kind of a silly name where they also added audio to this model. So this is now a tri-modal model, right? And so we're going towards this foundation model that can consume all of these different modalities all in one go. And that's really like a clear trend in the field.

Another very interesting direction, I think where in the field, we were very excited about this for a while, but I think it's sort of gone now because it's too difficult to create lots of high quality data in this setting. But what you can do is you can have simulated environments.

So this is a paper from DeepMind from 2017 where they had this agent walk around in a maze and then he could have natural language instructions. He could also generalize to like decks and blicks and different sort of groundings and assignments that you could do in that environment. So this is a super interesting direction, I think in the longterm, because this is how humans learn language, right?

Like we walk around in the world, we interact with our environments. We have all of these different perceptual observations. We synthesize them in our brain. We manipulate objects. We change our own viewpoint. And that's how we learn everything we know about the world. And so our language is very intricately connected to that world and how we observe it.

So I think that might make a comeback at some point in the future. You can also do other stuff. So especially with this kind of conditioning on text that we're seeing a lot of, right? So like, so, you know, DALI 2 and stable diffusion and all of these different things.

And the original GAN we talked about at the beginning. You can do the same thing, but now you're generating 3D point clouds, right? So this is a 3D corgi using a corgi. And so this prompt can probably become much more complex over time. And you can do like sort of AutoCAD design and just say like, give me a house and it's just gonna design the whole house for you.

So you can just like tweak the prompt and things like that. Like that's all coming or even already here in many cases. So the final modality I just briefly wanted to talk about is olfactory embeddings. (audience laughing) And so olfaction means smell if you didn't know. And so it turns out, so my PhD thesis was about grounding semantics in different perceptual modalities.

So a lot of my work started in vision and then it's like, okay, now audio is sort of the obvious next one, right? So you can learn the meaning of violin and then maybe you can learn that violin, like what a violin looks like and what it is and what it sounds like.

And that's gonna give you a richer representation. But for a lot of these words, what's actually very primitive to their meaning is what they smell like, because in our brains, that's really one of the core areas and one of the oldest areas in your brain. So what you can try to do if you want to complete all of your perceptual modalities is you can try to build olfactory embeddings.

So it was kind of a joke paper I did, but the funny thing is it actually worked. So there's a catalog, this Sigma Aldrich Fine Flavors and Fragrances catalog, where you can look up words like melon and pineapple and then it's gonna give you all of the chemical compounds that produce this smell or taste.

And so if you do that, then you can count the occurrences and then you can sort of do SVD or something like that on it to get it to be a bit more of a real embedding model. So now you get smell embeddings, smell vectors, and then you can compute similarity judgments between these smells.

So turns out apple smells like pear, and chocolate and cocoa and sweet and coffee are sort of related. So you get these clusters of different smells just based off of their chemical compounds. So this bag of chemical compounds model gives you a very rich representation. And so if you look at all of the words that are concrete enough to have smell, so like if you have a word like democracy in there, that doesn't really smell like anything.

So you ignore democracy, you just focus on the things that smell or that could smell, I guess. And then, so the really interesting thing to me is that this is much more correlated with human similarity judgments than the linguistic vectors we had at the time. So for a word like apple, like you can just get a word vector like you've learned in your first lecture.

And so you can do like skip gram and things like that. But that thing is not going to be as correlated with human similarity judgments as this bag of chemical compounds model. So that's pretty interesting. So even something like smell where maybe we think, this doesn't really matter. If you really want to understand how humans understand language, then maybe you want to include this in your foundation model too.

But I would start with the other modalities. All right. Okay, yeah, sorry. So where to next? I'll just, I think I've already said most of this actually. So one foundation model is going to rule them all. And so, I mean, there will be many of these, but a lot of them are going to have very similar traits, I think.

We're going to be looking at scaling loss and trying to understand really what is the relationship between the different modalities, which one do we want more of, that sort of stuff. We're gonna have retrieval augmentation. This thing is gonna be really huge. If you've heard of RAG, or if you haven't, you should look it up.

So all of these parts of these models can also be multimodal. We need way better evaluation and better measurements. We already talked about that too. And that's all I have, thank you. (audience applauding) (upbeat music) (SILENCE)

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

Transcript