Back to Index

Stanford CS25: V1 I Transformers in Vision: Tackling problems in Computer Vision


Chapters

0:0
0:34 General Visual Representation
4:8 The Visual Task Adaptation Benchmark
7:26 Self-Supervised Pre-Training
7:58 Semi-Supervised Training
21:22 Synthetic Images
26:33 Applying Transformers to Vision
26:49 Embedding Space
42:5 Early Convolutions
45:28 Patch Size
46:24 Inference Speed
59:31 Scaling the Data Set

Transcript

Today, I'm going to talk to you about vision transformers, since this is all about transformers, specifically their application for visual representation learning. But before we jump into transformers, I'm going to spend like 10 or 15 minutes giving you a lot of context on all of this, and specifically also on the vision part of things, because I think the majority of what you have seen and will see will be about language.

All right, so let's get started. My goal and that of my close collaborators is to find general visual representation, and you're going to soon see what that means and why, or what can we do if we imagine we have a general visual representation. The hope is that with this, we can kickstart all kinds of tasks that require visual input.

That means most tasks that you do when you have your eyes open, basically, because if you have a good understanding of what you see, then you can much quicker understand what's going on and what you should do. And eventually, I have now a little kid since a year, and so I really want that when he's grown up, that there is like some kind of robot.

It doesn't need to be nice and pretty like in movies, just maybe an arm or whatever, that my kid could teach, or my parents who cannot program can teach to do some boring task that they really don't want to do. And I believe one component of this is a good visual representation that generalizes to understanding the world visually everywhere.

It's not all that's required, but it's one part, and the part that I'm trying to push. So this is for context and motivation on working on general visual representation, and one good example of a general visual representation is the humans, and I'm going to show you what I mean by that.

So here is a task that I give you. There is three classes, class A, B, and C, and I give you five images of each class, okay? And here I give you a new image, and I'm sure that by now you all know which class it is. I'm not going to ask because I don't actually see you.

If I was in the room, I would do the raised hands, but I'm sure you know it's class A now. Okay, this is fine. We have seen millions of flowers in our lives, hopefully, but there is other kinds of pictures, like this satellite images that you don't see much in your life.

Some people may have never seen it sometimes, like when you fly or maybe on TV or in the Internet or so, but it's rather rare, but still, same story. Three classes, class A, B, C, five images of each, and I show you a new image. This might be a little bit less trivial than the flower, but I think I've spent enough time talking that by now, most of you should know that this is class B.

Shows a, what is it, basketball court, right? All right, now even more abstract. You don't see this in real life, all right, but still, I give you images of class A and B. I have just two to make it a bit easier here because you need to use your brain a little bit more, and I show you this new image, and now I should do a little bit of small talk to let you think, like you see that there is like spheres, boxes, and whatnot, and by now, I hope that most of you know that this is class A.

Why? Because there is three objects in class A, and class B is always, what is it, five objects, no matter what they are, what they look like. Okay, I think by now, you more or less understand what I mean when I mean a good visual representation, general visual representation, right?

Some, I don't know how to call it, in your brain, in your eyes such that you can quickly see something new and understand what's going on with just a few examples, and then generalize from that, right, and that's the goal. Then the next step, if you have the goal, how do we measure progress towards it?

And this is a paper we did a few years ago with my collaborators, which we call the Visual Task Adaptation Benchmark. It's kind of formalization of the little game that we just played, so it's a benchmark, and there is some component that you, or anybody who participates in the benchmark does, which is creating a model with some data.

We don't really care what data, what model, how, what not. Just you come with a model. Then we come with this landscape of all possible visual tasks that kind of make sense, which is a vague statement, and we sample some tasks from that, and this is kind of the task that you have just seen.

They were actually taken out of this Task Adaptation Benchmark, and we have, for a first step, made 19 such tasks where we try to cover broad types of visual tasks, not just classes of natural images like these dogs and cats things, but also of very specialized images like satellite image, also non-classification tasks that involve counting, like the one I showed you before, right, but that can be expressed in this simple classification API, but that logically requires some more thinking.

Some things like distance, we have something with cars and with distance of the closest car and things like that. It should cover a broad range of variation, and then with the model that you came to this benchmark, you can do some adaptation step on each of the datasets, one after another or at the same time.

It doesn't really matter, but then you should have, as a result, a model of this dataset, which is very small. It just has seen a few examples for each class that then performs well there, and then we just take the average score across all of these tasks, and this is what we call the VTAP task, and this is how, for now, we judge how good of a general visual representation does your model and adaptation algorithm have, and now just for some nomenclature, this preparation, we have words that we often use pre-training.

Sometimes we call it the upstream, like upstream data, upstream training, something, so I may use this word interchangeably with pre-training, and then there is the second part, which we usually call transfer, and then sometimes we say downstream, and the adaptation, in principle, it's whatever you want, but for our work, we almost always just use very simple, fine-tuning without any bits and whistles because it's simple and works well.

In general, we try to do things as simple as possible. It still works well, and so sometimes I even just say, like, fine-tuning when fine-tuning. That means moving from this pre-training to the transfer. All right, so so far for the settings, so far so good? Good. Then the question is, how do we get there, and we spend a lot of time thinking about this and trying different things, and this is also roughly the outline of all that I have available to talk about, which doesn't mean we're going to cover everything, so I'm not going to go, like, through the outline exactly, but you will see this again and again, and as you see, vision transformer, field transformer only comes a little bit later.

There's some stuff before that, so this one, just really quickly because it doesn't matter for this course, is that we spend some time trying self-supervised pre-training which is very popular in language, and in vision only recently has become popular, and it doesn't work that way. You don't need to understand these bars, but basically higher is better, and here, just look at the blue ones.

That's the VTAP score for this few-shot VTAP, and self-supervised learning performs like this bar. We tried multiple methods and multiple models and so on. It was a proper good benchmark, but it was a couple years ago. Then we moved on to semi-supervised training, so a few labeled examples and a ton of unlabeled examples.

That's this next blue bar. Did you actually see the mouse cursor? Sorry. - We don't see the mouse cursor. - Maybe I need to do some laser -- - Oh, we can see it. We can see it. - Yeah. - Oh, okay. - Yeah, so then semi-supervised is that blue bar which is a lot higher than this other blue bar, so what this means to us is that by adding a few labeled examples, we're able to get much better or much more general visual representation.

Then I'm not going to spend more time on this and how exactly and so on, but I'm going to move to the next one, which was for us kind of a breakthrough when we figured out that, well, if we just scale up fully-supervised pre-training, then we get really much better representations than everything we've seen before, and here I want to briefly spend some time on that one because it's the precursor to using vision or transformers in vision.

So the idea is simple. There are tons of images on the Internet. That's always what you hear is motivation for semi-supervised or unsupervised learning, right? But actually, where these images come from, there's almost always some extra information, like surrounding the image on the Web or if you collect it otherwise, there's some extra information there that you could use as some weak source of information or some weak label, right?

Then it happens that in Google, there's some team that actually does this for production, and they have collected already a large dataset with some pipeline that from the surrounding signals somewhat automatically, but very noisily annotates the images, and we wanted to figure out how far can we go when we scale up pre-training.

Then, long story short, you need a couple of ingredients. One is patience. I really like this plot. This is one of the curves of just pre-training on large data with large models. The details don't really matter. The gist is that if I zoom into this little box, I see this here, and this is the metric for the training, like the performance in upstream.

Then I see after spending eight GPU weeks of compute, what does GPU week mean? It means eight GPUs for a week or, sorry, one GPU for eight weeks or eight GPUs for one week or 16 GPUs for half week and so on, right? But this looks flat. A reasonable person would say, "Yeah, there's no progress for a week on eight GPUs.

This is flat. I'm going to stop and try something else," but we are not reasonable, so we keep going, and this is what the exact same spot looks like after eight GPU months of training, and you can clearly see the things are progressing, right? So it may not always be obvious, and you need patience.

The second thing is that you actually need to scale up everything, so this was work done with ResNets, not yet with transformers. I see you see a lot of ResNet models here. The x-axis is the number of images available. In vision, there is this image in the dataset, which is a very common, super common dataset for pre-training, which has 1.3 million images.

There's another one which has 10 times more images that's still public, and then there is one subset from this internal group that has 300 million labeled images, so the y-axis is measure of accuracy on some tasks, and we tried many. They all look similar, and the dots are differently sized ResNets.

The blue dot is the standard ResNet 50 that everybody uses. If this one, you trained on more data, it looks promising at first, but if you go to even more data, it looks like, oh, okay, this doesn't really seem that useful, and this is what most people have been doing for a long time, and a lot of people, even in Google, were like, yeah, I tried this internal checkpoint on these tons of data.

It doesn't really help that much. However, what we found out, and in hindsight, it's kind of obvious, is that you actually need to scale not just the data but also the model. Here, this blue dot is a gigantic ResNet that is slow as hell, but when you scale this up together with the data, you keep getting benefit with adding more data, and then if you do these two things, scale up everything and be patient, be patient could also be quite scale up your patience.

Then you get a lot of benefits, so here there is a few short transfer learning. They're what I showed you before, and on the x-axis is size of the model, on the y-axis is the accuracy on one of these tasks, but again, others look similar, and these three different curves are featuring with different data set sizes.

The green one being the standard one, you don't really see benefit or small benefit from going with larger models. The blue one is 10 times larger. You start seeing some slope upwards, but really only with this giant data, you start getting better and better and better at this few short transfer learning when you pre-train on more and more data with larger and larger models.

Second benefit that we did not anticipate really at all, but then found out is that these models are super robust when you scale everything up. This is ObjectNet. It's a data set that's specifically designed to measure robustness, and it shows things in crazy, like a chair in the bathtub and things like that, and you should recognize it as a chair.

Here, the pink dots are basically how existing models, and x-axis is, again, how large is the model, and pink dot is existing ones from the literature, and then these lines, same color coding, is what we found out. Again, you see this large data, and then going to large model just gives you amazing benefits on, like in this case, out-of-distribution robustness.

This was amazing. Scale up everything, be patient, and get huge benefit. - Sorry, Lucas. Sorry for interrupting you, but there is a question from a student in the class. - Yep. - Right. Do you want to unmute yourself and ask it yourself? - Yeah, I can ask my question.

Can people hear me? Maybe there's some-- - Yes. - I'm sorry, one second. Let me just step away real quick. Yeah, so the question I wanna know is, what work has been done characterizing the parameters after pre-training finishes? Like, the reason why I'm motivating this question is, it seems like we do this tremendous amount of pre-training, but it seems like we might be able to significantly reduce that if we just have smarter initialization schemes.

- Yeah, you know, I've been thinking this for a long time, actually, also. And they've come to conclude that I think not. I think there is, like, two parts. One is, like, what I like to call hand-wavy the numerics of the weights. You know, that everything is in a nice range, such that it can have nice input/output functions, and so on, and that your optimizer can do steps that make reasonable change to the input/output function, but not too large, and so on.

I think that is part of it, and that you can get through good init or good normalizations and whatnot. But then I also think there is, I do think that these models memorize a lot, and then, personally, I believe, but I don't know of evidence or so, that these models do more kind of, you know, remembering similarity to things they've seen in training.

And then, as you grow things up, they have more memory, and they have seen more things, so they should be better on more newer things, because there's more similar things they have seen. And this, I don't think you can, like, just create one shot from initialization. But I don't have the immediate pointer to a paper at the top of my head now to answer your question.

- Okay, thank you. - I think we also have more questions, so has posted on the chat and is raising his hand. Maybe in this order, you wanna ask your question first? - Yeah, for sure, I can go ahead. So I just have a quick clarification on this chart right here, the chart number three.

The bit L and bit M and bit S, are they the same model architecture, but just trained on different datasets? So the bit S is trained on the 1.3 million all the way to the 300 million image dataset for bit L? - Yes and no. The architecture is here on the x-axis.

So within one vertical slice, these are the same architecture. And then the different points are random restarts, because when you do future learning, there is a lot of variance in which few examples do you see. And then again, this next vertical slice is the same model and so on.

And as you go to the right, the model gets larger. And so you can see that for this little data, going to larger model doesn't really help you much for pre-training, only for this giant data, everything's the giant data, not necessarily giant model in this case. - Right, that makes a lot of sense, thank you.

- Okay. - Do you have a question? Oh, I see you're raising your hand as well. Go ahead and let Otto. - Hey, yeah, thanks. What is the intuition for the upstream performance in figure one spiking so suddenly at like 60 or 40 points in training? - Here, right?

Yeah. - Yeah, yeah, I'm looking at it again, like around one point, like, I don't know, that just seems like an odd looking training curve. So like, what's the intuition behind that? - Yeah, this is old school computer vision thing, or old school, I mean, a few years ago.

Is this when the learning rate changes? In computer vision, it used to be very common to have the learning rate in a kind of staircase pattern. So it's constant for a while, and then you stop, you divide the learning rate by 10, usually, boom, smaller, and then you continue.

And this gives you this huge jump. And nowadays, people don't use this much anymore. And this work was like three years ago, I think, or two or three years ago, I don't remember. It was very common back then. And nowadays, people use more continuously changing learning rate schedule, and then you don't really have this sudden change anymore.

But if you would overlay it, it would be like more continuously, but going roughly the same. And then in language, I think most people, or many people use just linearly decreasing learning rate schedule, where also you don't see this effect, because learning rate continuously decreases. - Okay, yeah, sounds good, thanks.

- And then this is what, because you asked for, about this dotted line. Actually here, if you're like here, you could say, okay, but this is excessive, right? Maybe it does really seem almost flat. Maybe you could have started the decay earlier, and earlier, and earlier, and then you would get the same, but much quicker.

And this one shows what would happen then. And you do land at much worse place in the end than with the patient. - All right, yeah, yeah, that makes sense. Thanks. - Was there more question, or I continue? - I think both of you have your answers. - 'Cause I need to mention, I don't see you, I just see my slide.

- Yeah, it's fine, we can coordinate that with this. - Hi, yeah, so I just wanted to make sure that I'm on the same page. So basically what you're trying to do is multitask learning with convolutional neural networks/LSTMs, right? That's kind of like ResNet. But you're doing multitask learning, correct?

- No, where does the multitask come from? Or where does it come from? - Because like, initially, like you showed like different, - Ah, yeah, okay. - Yeah, okay. - So there is two phases. The first one is the pre-training. And this pre-training, I didn't mention it yet. I just said, I don't care what you do in the pre-training, just pre-train somehow, and give me the model.

And then I test it on multiple tasks independently. And I'm tested on multiple tasks, means like transfer it to that task, which in our case means fine-tune it just on the task, and then see how well it does, and so on. But it could mean other things. Like later we moved to just learning a linear regression on top of the embeddings for each task.

And now during the pre-training, what we do is just regular supervised learning, but just scaling everything up. And regular supervised learning is just, well, not multitask, but multilabel, in the sense that an image could have a couple labels or not, but it usually doesn't have. - This is minor.

- Okay, got it. - Thanks. - All right, we have a question. - Yeah, just have a quick follow-up about the question rather than, like the discussion rather than started about this, it's like memorization, or it's more memorizing the data in pre-training datasets. So I know in the language side, there's a quite interesting phenomenon that you can pre-train on a synthetic language that's, it doesn't have any semantic meaning, but it only has structural, like paired premises or things like that.

And that actually gives you almost the same boost in your downstream transfer as a normal pre-training. So I wonder if, say like, so this means like in for language, right, the structure seems to make a lot of contribution, which can be replaced by visualization. But I don't know if it's an image, it's a different case, maybe to have people done, maybe some synthetic pre-training data set for image.

- Yeah, there was a paper, I forgot the name and the authors, but it creates completely synthetic images and like not even rendering of some realistic things, but just completely patterns, waves, and shapes and so on, and uses that for pre-training. And then it shows that they get almost the same performance as ImageNet quickly, they actually do this with vision transformers.

But yeah, they never go further or it is not clear, you know, they kind of show that you can almost get to this point here. That is not clear how much further can you go with this. And I think probably not much further, but it's just me guessing that not much further, I don't have evidence for it.

- Right, so I have one question and then we can continue with the talk. Said that you think like the large vision models are like learning some sort of similarity to the data set they're trained on. So do you think they are behaving like prototypical networks, in a sense?

- They're behaving like what networks? - Oh, so like prototypical networks? Essentially like when you're doing pre-short learning, you just say like, "I'm going to learn a network." - Yeah, yeah, yeah. - And learn the metric space. - Probably not exactly, but close-ish. - I mean, I cannot really say because this is just some intuitive guess that I have.

That's what they do, but nobody really knows what the models do, right? Yeah, I mean, we do get much more, when we do something like prototypical networks for the future learning with these pre-trained models, we do get worse performance than when we do fine-tuning. So there is a bit more to it still.

However, I don't know what is this more. (laughs) - Okay, thanks. - All right, let's continue. Okay, yeah, so, ah, right, and I didn't mention, but on ImageNet, which is the top benchmark in computer vision, with this work, with the big transfer, we finally were able to increase the score after there was a long period of a couple of years of no improvement, but many attempts that you see the great upside.

This was, yay, awesome. Pre-training, scaling up everything, and leveraging the data. And then, okay, let's not care about that. Yeah, that's, okay, this is just a little aside, that if you are in the setting that I mentioned of pre-training on huge amounts of data and then testing on many other tasks, you should, of course, be careful that you don't have images from the other tasks in your pre-training data, right?

Otherwise, you have seen them during training, and then you're not really generalizing, and you're just fooling yourself with good scores. And this is a real danger when we get huge amounts of data, because, like, ImageNet images can totally be in huge amounts of data, right? So we actually use an internal pipeline that is really good at finding duplicates, and also new duplicates, like when they are shifted, rotated, squeezed, color changed a bit, whatnot.

It still finds it. And we use this to completely remove all images from the test data sets that we test on later. And we actually found that a lot of classic just vision data sets have clear duplicates between their training and validation set, between the training set of ImageNet and CIFAR, 10 and 100 test sets, and so on.

So new duplicates are quite widespread problem in vision. And this slide is just to say, hey, there are problems, but in all that we present, we actually took care that in the pre-training, as best as we can, we don't have new duplicates. Right, now back to being like, hey, we figured out large data, a large model, and then things get really good.

And that's how we got to transformers, basically. In computer vision, everything was convolutional networks for many years. And basically there was nothing else, CNN is king. However, in language, we saw a transformation recently, right, that everything used to be LSTM, everywhere LSTM was king, and then came the transformer.

And in the case when there is a lot of data available, suddenly transformer worked much better than LSTM. For little data, that was still not the case exactly. So what we then thought is that, okay, so we are now in this regime where we have tons of data and we see benefit from it.

Can we see even more benefit if we try also out the transformer architecture in vision? And that's basically what we did. To be fair, there were a few other attempts at trying out transformer in vision before, that I don't want to detail too much here because I don't want to point fingers too much, but they were all not really using transformers for learning everything from the data.

It was always like, get something out of a ResNet first, like object detection proposals or high-level feature maps or things like that, and then stick a little transformer on top. But we wanted to go all the way, just transformer everything. And so we came up with the simplest and most natural, I believe, way of applying transformers to vision, which is you take the image, you cut it into pieces, and that's it, like a puzzle.

Tack, tack, tack, patches, and that's it. Each of these patches, you take it and you project it into your embedding space, which is the input to the transformer. Embedded space is just abstract space of, let's say, 768 dimensions, for example. How do you embed it? You just take the pixel values and put the linear projection layer on top.

So take all the pixels, flatten the vector, matrix multiply into whatever size you want, and use the same matrix for all the patches. And here we just went the simplest way ever with non-overlapping patches and everything. You can, and people later did, go on and say, "Hey, this is almost a convolution.

Let's make proper convolution. Let's make stack of them," whatnot. But this is all for web work later. This is just the simplest way to do it first. Then we have these embedded patches, and we treat them exactly literally like the tokens in language, and then give them to exactly the BERT transformer from language folks.

And just like in language, we add this class token, or I think the language is like end-of-sentence token or something. And we add the position embeddings to the tokens that can be learned. And then we feed all of this to a transformer encoder, which has a MLP head, which reads out this class token, and then maps it to Softmax layer for classification, for example.

And that's it. That is the vision transformer. So it's literally a BERT transformer, but instead of words or sentence tokens, feed in patches transformed into tokens. And that's it. And then just same story as before, scale everything up. Compute, data set, model size, patients, everything. And see what happens.

Is this good or not? That was the question. And now we can see a plot here. This is similar plot as before. The gray area is actually what were all of the bit dots before. And now the bubbles are vision transformers of different sizes. And the bubble is kind of the size of the model, although it's a bit hard to say exactly.

And what you can see first is that with little data, ImageNet is the 1.3 million images. It works worse than ResNet. So if we would not believe in this idea and just try this, we're like, "Okay, this is a crap idea." And 1.3 million images is not that little.

Then the 10 times larger data sets started in the same ballpark as ResNet. And when we go to much larger data with a much larger transformer, then we actually start outperforming this ResNet. And we outperform it just by a little. But this ResNet was really hard to get and is extremely clumsy and slow and big.

So we were very excited by this. Then we did more controlled studies and everything. And one of them is like using subset of the same data set. And there's lots of curves, but basically just look at the dark gray one and the light blue one. These are roughly similarly fast and clumsy or easy to use or difficult to use bits, which is a ResNet variant and bits, the vision transformer.

And what you can see, vision transformer, when we have little, in quotes, little data, is really bad compared to ResNet. But as we start having a lot of data, actually, it starts outperforming the ResNet. And this is very promising because I think everything that looks huge and a lot and so on now, in five or 10 years, it's maybe regular.

Like 10 years ago, imagine if this one seemed to be huge and massive amount of data. No, not anymore. So we should look to the future. And this looks promising for the future. Then back to the same benchmark. That was another little jump. - Because we, yeah, yeah, we have some questions.

- Yep. There is also this section about, yeah. So it's in that order, if you want to unmute yourself and ask the questions. - Sure, yeah. And I think Dimal already answered part of the question, but I was wondering in the input to this transformer, when you're chunking up the image into little puzzle pieces and then finding them, does the order of feeding these patches in matter?

Like if you switch the order, does the prediction maybe change? - Yeah, that's a good question. And I actually have a slide on something like this, but not exactly. Let me jump there. So first of all, if the order is consistent during training, right? And you don't shuffle the order again for each new image, then it's literally the exact same.

You get the same curve saying everything because we don't encode the order anywhere. If you start randomizing the order all the time during training, then performance gets quite a lot worse. And let me show you why. This is the slide was on my plan to present anyways. Then if you ask, let's jump here.

These are, this is a visualization of the position embeddings. What does it mean? So in this case, we had 14 by 14 patches that we cut the image in. So it means we have also 14 by 14 position embeddings. Although we just see them as one long sequence of, what is it?

150 something, or I don't know, 140 something. And now each of these pictures shows the position embedding, which corresponds to this location. How similar is it to all the other position embeddings? So let's look at this one, for example. Yellow means perfectly similar, like exactly the same. And blue means opposite in terms of cosine similarity.

So this position embedding is most similar to itself, which is the pixel here. And then the neighboring pixels is how similar is it to the position embeddings that correspond originally to the neighboring patch. And we do see a very clear pattern that each position embedding is very similar to the embedding from its surrounding patches.

And we didn't implement any of this, right? We just had these position embeddings at randomly initialized variables, and they are learned as freely as the rest of the parameters of the model. But they learned to recover this notion of what are my neighbor patches, even though we don't give this information anywhere at any time, besides the raw image data and the task to please classify this image.

So that's pretty cool, I think. But it also means that if you take the trained model now and give in patches in a completely differently shuffled order, it's going to perform poorly because these learned position embeddings don't make sense anymore. We did try also to implement, like, position embeddings which encode the location as hardcoded by us, and other fancy position embeddings like relative ones.

But basically, none of that really outperformed these freely learned. And then the freely learned is simpler. You just run them in it, let it learn as part of SGD, and that's it. And so we go with that, and so just like that. -Nice, it's awesome. -We have one more question from -- -Hey, yeah, I was wondering if you could -- Yeah, this slide.

I think something that's really interesting is we're talking about scaling up the data, and scaling up the model would be fun as well. But it seems like you're reaching an awesome job, right, when you keep doing the scaling. So I'm curious if you have any thoughts on that. Like, are these points just look like that, or is there kind of a best you can sort of do where when you're pre-training the data or the parameters, you're actually not going to get much -- -Yeah, I have another slide, but much further in the talk about this, where I would like to not jump on it, if you don't mind.

And then maybe in 10, 15 minutes, we will be there. -Sounds good. Thanks. -Yeah, maybe to be a bit optimistic, it does seem like the transformers have a better slope here in the end, and there is a plateau earlier. -Sorry, Lucas, I did not mean to interrupt. Are there any more questions before we proceed?

-Yeah, can I ask my question real quick? -Sorry about that. -So what I'm curious to know is how does this VIT compare to if you equip a ConvNet, so, for example, ResNet, with an attention mechanism? -Mm-hmm. -Like, how much of this is due to the structure of a transformer and the particular way it operates versus just the benefit of attention that a vanilla ConvNet does not have access to?

-Yeah, so this has been tried many times before, and the first time that I know of was actually from -- I mispronounce his name, but Jaime Herr, the inventor of ResNet, and some of his colleagues, they called it non-blocker networks. This was way -- I think even before the transformer paper, if I remember correctly, and they basically inserted attention blocks at various locations in the ResNet, and then they showed improvement, but it was, like, tiny improvements.

It was a cool block and a simple paper, but it was not really worth it, and people usually place their attention -- you can imagine if you place the attention just on the pixels and don't do this patch-cutting, this is way too expensive computation-wise, right? If you have two to four by two to four pixels, that's like -- yeah, I cannot do this in my head.

I don't know, 40,000 or so maybe pixels? Attending to 40,000 others, that doesn't work, so people just do it in the very high and very final layers of the ResNet, like, where it's maybe seven by seven, and then they add a bit of -- sprinkle a bit of attention there, but then you don't really get much benefit of scaling because it's essentially still a ResNet.

And there is -- in ResNet, there is this block called Squeeze Excite that has been getting really popular -- or has gotten really popular and improves ResNet quite a bit, and that is also kind of a form of attention, but, like, nicely tailored to images. I'm not doing -- it's arguable.

But yeah, it has been tried many times before, but it just -- it doesn't show -- or it hasn't been shown to have this scaling benefit as much as they did. -So I think I'm missing something critical here, which is you just said, in fact, or it's computationally difficult to do an attention layer at a low level in the ResNet, but why is it any different than doing an attention layer in the Vision Transformer?

-Because we cut the patches first, so we have maybe 14 by 14 patches, which is not that much. -Okay, but I'm confused. Like, you could imagine, not at a high level, not at a high layer in the ResNet, but at a relatively low layer, after you've applied, like, one or two convolutional filters -- convolutional layers, excuse me -- then you have something the size of the patches.

-That's still 50 by 50 at the early layers, and that's -- -But 50 by 50 is significantly less than, I don't know, like, 400 by 400 or whatever. -But it's still 2,500 tokens attending to 2,500 tokens, which -- -Yeah, I mean, it's a lot, but it's not comparable. I don't know.

Okay, cool. Thank you. -Yeah. I mean, it could be tracked. Okay, maybe another answer to your question is then we're slowly getting to this, my next slide after the set of questions, where we do try something almost like what you said, have a very small part of the ResNet, and then stick a transformer on top of it, but, like, the full transformer encoder on top of it, and not just sprinkle a few attention layers and then continue with columns and so on.

And this is this process, and we call them hybrid, but it's almost literally what you said, actually, like a few early layers from the ResNet and with different varying amount, and then stick the whole transformer encoder. And this seems to work well, too, especially for the -- when you -- x-axis in this case is amount of compute, so for the little compute, it seems to work well.

But then the scaling behavior of the pure ResNet is a little better, so we focused on that. I think we later tried also hybrid further to the right, and it was a bit lower, but it was after the paper, so it's not on this plot, which I just cut out of the paper.

But you can already see the trend here. Yeah, so if you don't scale all the way up, then this is a totally reasonable thing to do, have a little bit of ResNet and then the encoder from transformer. -Do you want to ask a question? -Yeah, I was just wondering about the -- basically, there's like a short section of paper about, like, fine-tuning and, like, higher resolution, and in that case, right, like, the pre-trained, like, position embeddings, sorry, are, like, skewed, right?

And it basically says that you guys are, like, interpolating. Can you, like, talk a little bit? Like, how do you interpolate what's going on? -Yeah. Actually, when I checked the slides earlier today, I was like, "Oh, it would be cool to have a slide on that." And we don't have a nice visualization in the paper, either, because it's a bit difficult to explain, but this is the best starting point we have.

So if you want to increase the resolution of the image, and you keep the patch size fixed, it means you have more patches suddenly, right? And then, as you say, the patch embeddings, like, what do you even use as position embeddings, right? And basically, you can see here that we see that they learn a very regular structure, right?

We don't really know what is the structure of these position embeddings that I learned. We just see the similarity to each other and that it is very regular. And so this gave us the intuition that we may be able to just take them, kind of imaging these boxes, they slide apart, and new boxes appear between them, and they are just the interpolation of the surrounding ones.

And that's basically what we do with the position embeddings. We create new ones where there are missing ones, because we need more, and by interpolating the surrounding. Or more precisely, we basically see them as a picture, in this case, 14 by 14, with 700-something channels, or whatever is the dimensionality.

And then we basically resize this like you would resize a picture by interpolation. And that way, we get more and new position embeddings that we don't understand where they are, but they follow the same pattern as the learned ones, just at a higher resolution, basically. Yeah, go ahead. - Yeah, I just have a quick question.

So when you're creating the embeddings as input, right now you're doing a light projection, at least in this case. Has there been work to do to memorize the other way, 'cause there's a lot of pixels that are close to each other? - Yeah, there were quite a few works that tried varying other things.

One that I especially liked recently, it's called "Early Convolutions Help Transformers See Better," or something like that. And they basically say, "Okay, instead of this linear projection, instead of this one big linear projection, we replace it by a stack of three-by-three convolution with a stride two." And then they have also nonlinearities between them, normalizations between them, but such that the overall stride is the same as this patchifying.

So the outcome would then be the same dimensionality as after this patch cutting and then projecting. And then they showed that, supposedly it makes it a bit easier to optimize in the sense that more optimized settings are good settings. In many scenarios, it performs the same, but like more robustly to get there.

And they also show some scenarios where this performs much better, like for example, when pre-training on, actually, when they pre-train on more data, that seems to perform even better. I have played a bit with it and tried to reproduce it. I don't have it fully reproduced, but I don't see as much benefit as in the paper yet.

But that's not to say that the paper is wrong, just that I didn't get there yet. That is one example of them. There are other papers that do stuff, but this one I found especially interesting because it's simple. - Thank you. - All right, continue? - We don't have any more questions.

- All right, then let's see. Yeah, I have like three more interesting details from the paper and then depending on if we want more discussion or more content, I have more content, like also the question about, does it saturate here or not? All right, so another interesting thing that we had in the paper, but it is buried in the appendix, and then follow-up papers from others have been written on this by now actually, is like how should we scale these transformers?

I don't know, right in the high-level shape of the transformer, there's lots of settings that you could choose. And we actually tried many of them. So we started with the reasonable medium-sized transformer, this dot in the middle, and then we varied things one by one, such that we always double the compute.

So for example, this pink line, if we go to the right, this point increases the width, such that we double the compute. X-axis is compute relative to this starting point. And we have all of these different settings. There's the width, which is how wide are the vectors with which self-attention is done, which is for the base model 768, and then goes larger or smaller.

There is like, as you see scaling, this does not seem promising. So we didn't scale that much. Then there's other things like the width of the multi-layer perceptron, or some people call it the one-by-one convolution in these attentions. And this seems to scale a bit nicer, this orange part.

I actually wonder where it went to the left. I don't remember. I don't know if it's hidden somewhere or if we just didn't scale it down, but anyways. Then another thing to scale, which does not exist in the transformers from text is the patch size. As you make the patch smaller, you get more and more tokens out of an image and thus more and more compute capacity.

This is the green one, which also seems to scale nicely. Then the depth is an interesting one, this yellow one. And this is the number of encoder blocks. As we scale, it first seems like, wow, this is the thing you want to scale, but then it does seem to plateau.

And it scales really badly if you decrease the depth. So that's not a good thing to decrease. However, the width seems to be a good thing to decrease if you want to go to smaller models. And then the blue is just scaling everything together such that the compute is kept, like everything by roughly the same amount.

That seems to scale nicely as well as the rest and is relatively simple, or at least conceptually. So we like this, so we went with that whenever we scaled up or down the model. And this one I really like is the inference speed, because if you have the image size of two to four pixels, it actually means you have two to four by two to four pixels.

So if you have, then you patchify it with 16 by 16 patch, for example, patch size, then you have 14 by 14 patches. So that is the sequence length is actually 150. And then on top of the sequence length, you have the self-attention operation, which is square again. So overall, with respect to image size, the self-attention operation is to the fourth power, which is called quartic.

So that is really bad. Like everybody who sees all of something to the fourth is like, "What the hell are you doing? This is never going to scale." So we checked what does it look like in practice with the image sizes that we operate in, and this is what you see here.

On the y-axis is how fast it goes, basically how fast it does inference, and on the x-axis is varying the input size. And this, what this means, it doesn't look so bad yet. Basically, when you go here to the 512, to the really large image, then you see that the transformers actually start going down a lot more than the ResNets.

But in this reasonable image size, let's call it very typical, it doesn't seem so bad in practice yet. So we're not getting hit by the big O yet. But as we go larger, it will likely be a problem, and there will be a lot of follow-up works trying to make that better.

Then, this is the last one from the original paper. This is looking at the input's receptive field size. So in the self-attention operation, how far ago do heads typically attend? And here on the x-axis, we see the layer in the network. To the right is more towards the output, the classes, and to the left is more towards the input, the patches.

And the y-axis is how far on average across, I think, the whole validation set, does the self-attention look? And does look means that the peak of the self-attention or the max, how far is it away? Something like that. And each dot is a different head because we can use multi-head self-attention.

And so what this shows is that in the early layers, actually you have some heads that go far, but also a lot of heads that look very nearby them, so locally. And as we go deeper in the model, we only are left with heads that, on average, look further.

So it's just some kind of analysis. There is not immediately action to take about this, but it's interesting to see that earlier layers, they learn a mixture of looking to a local neighborhood and looking globally, and later layers only look globally anymore. Right. So that is about the original vision transformers.

Now, I don't know how long you want me to continue speaking or discussing. I have a couple of options that I can talk about, which is one project that was further scaling updates, and this one also has the answer to the -- I can also jump straight to the answer if you don't want to hear the rest.

But to the question of, like, how does it continue to the right? Are we separating? There is another project about how to train vision transformers when you don't have massive amounts of data. Can you still do it? Is it reasonable? Or is it maybe just unreasonable to do? This one is maybe too unrelated.

Let's not talk about this. And the last one is, like, I talk all about these benefits of a really large model when you pre-train them on lots of data. Okay, that's nice. That's how we get a good model. But then actually using a model that is massive is not fun at all.

Like, it doesn't fit on your GPUs. You need, like, multiple TPUs to even use it. So people are not happy to use it and usually still go back to small-ish models, even though they know, like, larger models should be better. What can we do about it? That's another project we had, which is about distillation.

So I would say it's up to you guys what you prefer to do. Or if you have plenty of questions, we can continue with the questions now, because I think now the original one hour would be over, right? -Right. -So I think one suggestion was, like, we can continue the talk, and we'll also be recording it so people can, like, just, like, go and see it if they miss out something.

So we could do that. -Yeah, the other thing is two people have their hands raised, so we can... -Okay. -...take questions first. -Up to you guys, and fight either way. -So you guys want to ask a question? -Yeah, I just had a pretty basic question. So if an object lies on the border between the patches, does that impact the model's performance in any way?

-Yeah, I mean, that's not a basic question. It's a good question. There is a mix of answers. So one is we didn't specifically go and test this. It would be an interesting thing to test in a very controlled way with some of the trained models. That's for sure. The other thing is that when you have a massive data set, like 300 million images, it's an insane amount.

I used to try to conceptualize how much is image net, 1 million images, and I think I did the math. It's like if you go to an image and look at all of the images, each image for a couple of seconds, you are sitting there for a month or something like that.

Don't remember. But so 300 million is just insanely massive. And then on top of that, we do actually use random augmentations, like random crop out of the image. So I would say it's the default that you see objects that don't fall on a patch during the training already. And if you look at here, basically, this is the standard model, like how the patches are.

When we have 14 by 14, they look roughly this size also. Then an object is usually scattered across many patches, actually, because objects in typical images are relatively large, right? People don't take a picture where the object of interest is super tiny in the corner. So that's the default that you see during pre-training.

And so I believe that the model just learns to do that much better, actually. Then the other answer to the question is like, OK, maybe if you did some nicer thing than this very crude patch cutting, like for example, this stack of convolutions that I mentioned, maybe this is even better.

Thank you. Thank you. So you mentioned that we're using transformers, or at least you mentioned in the paper that they lack locality and echoliteration. I was just thinking, are these sort of properties that you probably and especially when you're in the So why is it that we would prefer The audio was not that good, but I believe I understood the question.

Is that we say that transformers lack locality bias, or prior, or whatever? And why is this even something that we want, right? Wouldn't we want our models to know about locality if they are about pictures in the first place? Yes and no. So that's why I gave the context in the beginning.

This is all about what happens when you scale things up. And specifically, in the ideal world, at least in our mind, we want gigantic amounts of data. And we believe that it will just keep growing as the years go by. And there will be more and more data just generally there.

And then we want the model to have as little of our thinking built in. Because what we may think that is good to solve the task may actually not be best to solve the task. Maybe an analogy would be, what was it, AlphaGo that made some moves that experts would say, this is crazy.

This is a silly move. But it actually then was much better. And in a similar way, we want to encode as little as possible into the model, such that if we just throw massive amounts of data in the difficult task at it, that it might think things that are even better that we didn't think of before.

This is our approach. Because we believe that, as I mentioned, I think, already, what seems massive and excessive now will be the norm in five years or so. So that's where we want to go and look what's the direction. However, if you want to just get something working now and don't have massive amounts of data and don't want to use pre-trained model for some reason, which always use a pre-trained model, but if you don't want to, then it makes total sense to build in some of your prior intuition and knowledge of what should probably help the model, like locality.

I hope this answered your question. I suppose this is a quick follow up. What sort of like any vision task? Isn't that sort of like, yeah, I don't know. Maybe I'm not seeing exactly why we'd not want those inductive biases. Could you maybe elaborate on that? Why is it that we don't want locality or what translation Well, ideally, we want the model that is powerful enough to learn about this concept itself if it is useful to solve the task.

If it's not useful to solve the task, then if we had put it in, there is no way for the model not to do this, right? That is ideally the outcome. In a similar way also that in language, it seemed to be nonsense to not encode the from left to right direction of text, like in RMS.

But then comes transformer and just doesn't. And works much better if you throw a lot of data at it. And it recovers that plus some more or a more flexible variant of it or something like that. That is even better for solving tasks. So basically, the idea being that we are not as smart to design the thing, the model in the way that will be best for the task.

Let's rather give it all the flexibility and all the data it needs to figure out what is the best way of solving the task. I mean, it is a philosophy of approaching it. I'm not saying this is the only true way, right? So we have around seven minutes left before the scheduled end of the talk.

And Lucas, we want to be mindful of your time as well, because it is evening where you are. So one thing we could do is you could-- I don't see any more questions right now. So you could quickly go over the last few bits, maybe skipping through the details and just talking about the final results.

I will do this to a high level, then. Those two that are still very, very tight to transformers and answer some questions that happened before. Like the first question was like, OK, are we saturating? Yes or no? And here, no. This was the bit on this benchmark from the original transforming paper.

But then it's like these transformers, when we use them, we just notice they have really nice scaling properties. And they seem, actually, to be easier to scale up without paying massive compute as much as ResNet, just from gut feeling from us having experience with both. And so we went and looked what happens if we scale vision transformer just as far up as we possibly can.

And we spent quite a lot of our blood into making this happen. One part of it is scaling the data set. So we went back to this Google internal team that this 300 million data set is just one out of many that they work with. And we asked around, and they basically had the 3 billion, like 10 times larger data set that we could also play around with.

So there we go. We want to scale up the data set. And this is just showing, yes, just scaling up the data set and switching it gives you benefits, but that's not all of it. Then the next thing is we needed to figure out how to use less memory on device, like on GPU or TPU, because already previously with this score, we fitted the model as large as we could fit.

So we did a lot of clicks that I will skip for now and are able to scale much larger. This is like-- this plot shows the size of the model in the different shape factors that I mentioned before, like the width of the MIP on x-axis, the self-attention width on the y-axis, and then the different plots are different layers for the depth.

This box are how large the transformer we did in the original paper. And then boom, one step further and two steps further, this is just super massive transformer we did in this scaling paper. And with all of our tricks, how much larger we could go, a lot larger. Then, yeah, some learning rate stuff, and it is really cool.

I recommend people to look at square root learning rate schedule, which is cool, and often just mentioned as a side note. It is also cool, but I'm going to skip it for the interest and basic interest of time. And basically, we scaled it up a lot. And of course, again, we get always this envision image net number a bit higher.

This is actually plus 2% on what we had before, which is very significant in this high percentage range there. But also, what's very interesting is the view shot again. By just keep scaling up everything, we get super large boost in view shot again. This is image net top 1 accuracy.

And for example, it's just 10 images per image net class, which means 10,000 images total because 1,000 classes. We get this big of a jump. We get 85% of 1 accuracy, which is what you typically get when using the full data set, basically. So this is scaling up. It makes actually view shot work significantly better.

And then I'm going to skip on this. Well, this actually has an interesting message. This is three times the same story, but measured in a slightly different way, which is that if you make the model larger, it actually needs to see fewer images to get to a similar score.

This blue line is a tiny vision transformer, and the base vision transformer in the large one. And the y-axis is the error. So lower is better. And actually, you need to see-- still, we're talking in millions of images, and here it's 100 million images. But still, you need to see a lot fewer images with the larger models.

Doesn't mean a lot less compute, right? Because the model is larger and the slower. But it's interesting. And then there's some scaling loss that are popular in language. And we, I think, maybe for the first time in discriminative image learning show that, yeah, they appear to be here, too.

And then-- right. Then we want to-- sorry, I had the order of the slides mixed up in my head. So I'm a bit surprised. But then another threat was that besides further scaling up the model, we wanted to push even further into this direction of less hand engineering of things into the model architecture.

And then with the vision transformer, transform in general, what is the obviously most hand engineered part of it is the self-attention. So we tried, what can we do something more generic than that and less smart than that, basically? And we ended up by replacing it, essentially, with just a multi-layer perceptron that, however, has a little bit of structure, but much less than self-attention.

So they would skip the structure or the safety of time. And we're coming back to this plot, where the question was, aren't we saturating? Now, this plot is slightly different. We, again, have this bit resonate here in black. And the full green line is the vision transformer. And the other color, also the full lines, are the vision transformers.

So it is exactly the same numbers as from before. However, now we also throw in this mixer architecture, which we believe is even more flexible and less hand-engineered than transformer. And as you see, with less data, it's even worse. However, with much more data, it may be surpassing the transformer, or it may be random noise.

Not clear at this point, right? Because it's the only point where this happens. So we need to go further. So we use this 3 billion data set, for example, from the previous paper that I mentioned here, and try to extend these lines to the right to see what happens.

We don't extend many of them, because these are very expensive experiments that require a ton of patience. But we extended two most interesting. And it seems that it continues. And that, first of all, yes, the vision transformer keeps increasing. We don't have such experiment with the ResNet, because it doesn't look promising enough to pay the cost of doing it.

But it also seems that the mixer, what we believe is even more flexible architecture, actually is consistently above the transformer now, which is good news. And yeah, it is good news. So we're now right at the time when I should stop, right? Or open to more questions again. Yeah, I guess, as a question-- Can I ask a follow up on the scaling that you were showing earlier?

It's related to my previous question. I'm curious how this model size compares to model sizes for Earth or the natural language. Like, especially when we're going from smaller models to much bigger models, are they comparable at all in terms of model size? And if not, why do you think-- what is the models for these two different tasks?

Yeah, actually, a colleague of mine has a slide, which I hate. But he loves-- it's the model number of parameters in NLP and in vision. And the question is, how do you measure model size? If you just measure number of parameters, then these vision models are much smaller. However, the language models, number of parameters, like a huge chunk of it is in the dictionary, for example, which for us just doesn't exist.

It is linear embedding, which is trivial number of parameters. So in terms of number of parameters, it's much smaller. My personal opinion is number of parameters doesn't mean that much. Then the other way that you could measure this maybe in terms of compute, like how much floating point operations does it do on one data point.

And in terms of this, it's in the same ballpark. However, last time I checked, which is quite a few months ago, the largest language model was still like four times more or five times more in the vision model, I believe. So that's the two ways of measuring model size.

I don't think either of the ways is the one true way to measure model size. And I think it's actually an interesting research topic, like how to properly measure and order models in terms of capacity is not clear. Do you know why the vision is-- I'm sorry, the vision is four times smaller?

Like, what about that ?? I think it's just there is less interest in it, so less resources spent on it, basically. Like in Google, there are many, many more groups doing research with language than with vision. And I think we are one of the few groups that have access to a lot of resources and are interested in scaling up things in vision so much.

Whereas in language, it seems there are a lot of groups that are doing that. I think that's the main reason, actually. It's not that we don't want to go beyond that, or if we can, we would go even more. Awesome, thank you. Right, so we are actually over time at this point.

So anyone who has to leave, please feel free to do so. And before we do that, Lucas, thank you so much for joining, for all the way from across the ocean. And we know it's in the evening, so thank you for taking your free time to come and talk to us here.

Yeah, thanks for the invitation. Always like to talk about the work.