Stanford CS25: V1 I Transformers in Vision: Tackling problems in Computer Vision

00:00:00.000 | Today, I'm going to talk to you about vision transformers,

00:00:07.640 | since this is all about transformers,

00:00:09.440 | specifically their application

00:00:12.200 | for visual representation learning.

00:00:14.400 | But before we jump into transformers,

00:00:16.360 | I'm going to spend like 10 or 15 minutes

00:00:18.080 | giving you a lot of context on all of this,

00:00:20.560 | and specifically also on the vision part of things,

00:00:24.320 | because I think the majority of what you have seen

00:00:27.200 | and will see will be about language.

00:00:29.960 | All right, so let's get started.

00:00:32.280 | My goal and that of my close collaborators

00:00:34.400 | is to find general visual representation,

00:00:37.000 | and you're going to soon see what that means and why,

00:00:40.720 | or what can we do if we imagine

00:00:42.640 | we have a general visual representation.

00:00:45.040 | The hope is that with this,

00:00:46.680 | we can kickstart all kinds of tasks

00:00:49.640 | that require visual input.

00:00:51.280 | That means most tasks that you do

00:00:52.960 | when you have your eyes open, basically,

00:00:56.680 | because if you have a good understanding of what you see,

00:01:00.240 | then you can much quicker understand what's going on

00:01:03.040 | and what you should do.

00:01:04.640 | And eventually, I have now a little kid since a year,

00:01:09.720 | and so I really want that when he's grown up,

00:01:13.320 | that there is like some kind of robot.

00:01:16.080 | It doesn't need to be nice and pretty like in movies,

00:01:18.080 | just maybe an arm or whatever, that my kid could teach,

00:01:21.640 | or my parents who cannot program

00:01:23.520 | can teach to do some boring task

00:01:25.600 | that they really don't want to do.

00:01:27.440 | And I believe one component of this

00:01:29.440 | is a good visual representation

00:01:31.800 | that generalizes to understanding

00:01:33.880 | the world visually everywhere.

00:01:36.000 | It's not all that's required, but it's one part,

00:01:38.160 | and the part that I'm trying to push.

00:01:40.000 | So this is for context and motivation

00:01:42.520 | on working on general visual representation,

00:01:45.080 | and one good example of a general visual representation

00:01:47.920 | is the humans,

00:01:49.920 | and I'm going to show you what I mean by that.

00:01:51.680 | So here is a task that I give you.

00:01:55.680 | There is three classes, class A, B, and C,

00:01:58.120 | and I give you five images of each class, okay?

00:02:00.560 | And here I give you a new image,

00:02:04.040 | and I'm sure that by now you all know which class it is.

00:02:09.320 | I'm not going to ask because I don't actually see you.

00:02:11.760 | If I was in the room, I would do the raised hands,

00:02:14.560 | but I'm sure you know it's class A now.

00:02:17.160 | Okay, this is fine. We have seen millions of flowers

00:02:19.280 | in our lives, hopefully,

00:02:21.600 | but there is other kinds of pictures,

00:02:23.280 | like this satellite images

00:02:25.360 | that you don't see much in your life.

00:02:26.960 | Some people may have never seen it sometimes,

00:02:29.520 | like when you fly or maybe on TV or in the Internet or so,

00:02:32.840 | but it's rather rare, but still, same story.

00:02:35.680 | Three classes, class A, B, C, five images of each,

00:02:39.760 | and I show you a new image.

00:02:41.720 | This might be a little bit less trivial than the flower,

00:02:44.720 | but I think I've spent enough time talking that by now,

00:02:48.520 | most of you should know that this is class B.

00:02:51.160 | Shows a, what is it, basketball court, right?

00:02:53.720 | All right, now even more abstract.

00:02:57.760 | You don't see this in real life, all right,

00:03:00.200 | but still, I give you images of class A and B.

00:03:03.160 | I have just two to make it a bit easier here

00:03:05.320 | because you need to use your brain a little bit more,

00:03:08.120 | and I show you this new image,

00:03:10.680 | and now I should do a little bit of small talk

00:03:14.400 | to let you think,

00:03:15.960 | like you see that there is like spheres, boxes, and whatnot,

00:03:19.120 | and by now, I hope that most of you know

00:03:21.760 | that this is class A. Why?

00:03:24.320 | Because there is three objects in class A,

00:03:26.880 | and class B is always, what is it, five objects,

00:03:29.640 | no matter what they are, what they look like.

00:03:33.080 | Okay, I think by now, you more or less understand

00:03:37.680 | what I mean when I mean a good visual representation,

00:03:40.160 | general visual representation, right?

00:03:42.400 | Some, I don't know how to call it,

00:03:46.640 | in your brain, in your eyes such that you can quickly see

00:03:51.360 | something new and understand what's going on

00:03:53.760 | with just a few examples, and then generalize from that,

00:03:57.480 | right, and that's the goal.

00:04:00.320 | Then the next step, if you have the goal,

00:04:02.680 | how do we measure progress towards it?

00:04:04.640 | And this is a paper we did a few years ago

00:04:07.680 | with my collaborators,

00:04:08.800 | which we call the Visual Task Adaptation Benchmark.

00:04:10.920 | It's kind of formalization of the little game

00:04:13.120 | that we just played, so it's a benchmark,

00:04:16.880 | and there is some component that you,

00:04:20.280 | or anybody who participates in the benchmark does,

00:04:22.440 | which is creating a model with some data.

00:04:24.800 | We don't really care what data, what model, how, what not.

00:04:28.160 | Just you come with a model.

00:04:30.440 | Then we come with this landscape of all possible visual tasks

00:04:35.280 | that kind of make sense, which is a vague statement,

00:04:38.360 | and we sample some tasks from that,

00:04:41.160 | and this is kind of the task that you have just seen.

00:04:44.840 | They were actually taken out of this Task Adaptation Benchmark,

00:04:49.160 | and we have, for a first step, made 19 such tasks

00:04:53.000 | where we try to cover broad types of visual tasks,

00:04:56.320 | not just classes of natural images

00:04:59.320 | like these dogs and cats things,

00:05:01.200 | but also of very specialized images like satellite image,

00:05:04.480 | also non-classification tasks that involve counting,

00:05:07.360 | like the one I showed you before, right,

00:05:09.360 | but that can be expressed in this simple classification API,

00:05:13.200 | but that logically requires some more thinking.

00:05:15.840 | Some things like distance, we have something with cars

00:05:19.760 | and with distance of the closest car and things like that.

00:05:22.560 | It should cover a broad range of variation,

00:05:27.000 | and then with the model that you came to this benchmark,

00:05:32.000 | you can do some adaptation step on each of the datasets,

00:05:35.760 | one after another or at the same time.

00:05:37.560 | It doesn't really matter,

00:05:38.920 | but then you should have, as a result,

00:05:40.720 | a model of this dataset, which is very small.

00:05:44.040 | It just has seen a few examples for each class

00:05:46.960 | that then performs well there,

00:05:48.560 | and then we just take the average score

00:05:50.880 | across all of these tasks,

00:05:52.040 | and this is what we call the VTAP task,

00:05:54.040 | and this is how, for now,

00:05:56.120 | we judge how good of a general visual representation

00:06:01.040 | does your model and adaptation algorithm have,

00:06:03.320 | and now just for some nomenclature, this preparation,

00:06:08.560 | we have words that we often use pre-training.

00:06:10.720 | Sometimes we call it the upstream,

00:06:12.560 | like upstream data, upstream training, something,

00:06:15.640 | so I may use this word interchangeably with pre-training,

00:06:18.680 | and then there is the second part,

00:06:20.200 | which we usually call transfer,

00:06:21.920 | and then sometimes we say downstream,

00:06:23.720 | and the adaptation, in principle, it's whatever you want,

00:06:29.440 | but for our work, we almost always just use very simple,

00:06:32.200 | fine-tuning without any bits and whistles

00:06:34.680 | because it's simple and works well.

00:06:36.720 | In general, we try to do things as simple as possible.

00:06:39.600 | It still works well, and so sometimes I even just say,

00:06:42.720 | like, fine-tuning when fine-tuning.

00:06:44.280 | That means moving from this pre-training to the transfer.

00:06:47.120 | All right, so so far for the settings, so far so good?

00:06:52.440 | Good. Then the question is, how do we get there,

00:06:58.600 | and we spend a lot of time thinking about this

00:07:00.640 | and trying different things,

00:07:02.240 | and this is also roughly the outline

00:07:04.600 | of all that I have available to talk about,

00:07:07.520 | which doesn't mean we're going to cover everything,

00:07:10.120 | so I'm not going to go, like, through the outline exactly,

00:07:13.640 | but you will see this again and again,

00:07:15.160 | and as you see, vision transformer,

00:07:17.040 | field transformer only comes a little bit later.

00:07:19.080 | There's some stuff before that,

00:07:20.680 | so this one, just really quickly

00:07:23.640 | because it doesn't matter for this course,

00:07:25.280 | is that we spend some time trying self-supervised pre-training

00:07:28.440 | which is very popular in language,

00:07:30.200 | and in vision only recently has become popular,

00:07:32.760 | and it doesn't work that way.

00:07:35.480 | You don't need to understand these bars,

00:07:38.080 | but basically higher is better,

00:07:40.000 | and here, just look at the blue ones.

00:07:42.880 | That's the VTAP score for this few-shot VTAP,

00:07:47.000 | and self-supervised learning performs like this bar.

00:07:49.680 | We tried multiple methods and multiple models and so on.

00:07:52.520 | It was a proper good benchmark,

00:07:53.960 | but it was a couple years ago.

00:07:55.560 | Then we moved on to semi-supervised training,

00:08:00.080 | so a few labeled examples and a ton of unlabeled examples.

00:08:03.440 | That's this next blue bar.

00:08:05.000 | Did you actually see the mouse cursor?

00:08:06.400 | Sorry.

00:08:08.000 | - We don't see the mouse cursor.

00:08:11.680 | - Maybe I need to do some laser --

00:08:16.240 | - Oh, we can see it. We can see it.

00:08:17.640 | - Yeah. - Oh, okay.

00:08:19.960 | - Yeah, so then semi-supervised is that blue bar

00:08:22.520 | which is a lot higher than this other blue bar,

00:08:24.880 | so what this means to us

00:08:26.400 | is that by adding a few labeled examples,

00:08:28.280 | we're able to get much better

00:08:31.040 | or much more general visual representation.

00:08:33.240 | Then I'm not going to spend more time on this

00:08:36.240 | and how exactly and so on,

00:08:38.320 | but I'm going to move to the next one,

00:08:39.640 | which was for us kind of a breakthrough

00:08:42.120 | when we figured out that, well,

00:08:43.600 | if we just scale up fully-supervised pre-training,

00:08:47.280 | then we get really much better representations

00:08:49.840 | than everything we've seen before,

00:08:51.800 | and here I want to briefly spend some time on that one

00:08:54.080 | because it's the precursor to using vision

00:08:56.480 | or transformers in vision.

00:08:58.080 | So the idea is simple.

00:09:00.960 | There are tons of images on the Internet.

00:09:03.200 | That's always what you hear is motivation

00:09:04.880 | for semi-supervised or unsupervised learning, right?

00:09:07.760 | But actually, where these images come from,

00:09:10.160 | there's almost always some extra information,

00:09:12.520 | like surrounding the image on the Web

00:09:14.520 | or if you collect it otherwise,

00:09:16.720 | there's some extra information there

00:09:18.440 | that you could use as some weak source of information

00:09:21.240 | or some weak label, right?

00:09:22.840 | Then it happens that in Google,

00:09:25.080 | there's some team that actually does this for production,

00:09:28.080 | and they have collected already a large dataset

00:09:31.240 | with some pipeline that from the surrounding signals

00:09:34.280 | somewhat automatically,

00:09:36.040 | but very noisily annotates the images,

00:09:38.240 | and we wanted to figure out how far can we go

00:09:42.160 | when we scale up pre-training.

00:09:43.760 | Then, long story short, you need a couple of ingredients.

00:09:47.920 | One is patience. I really like this plot.

00:09:50.680 | This is one of the curves of just pre-training

00:09:53.120 | on large data with large models.

00:09:55.040 | The details don't really matter.

00:09:57.640 | The gist is that if I zoom into this little box,

00:10:00.360 | I see this here, and this is the metric for the training,

00:10:04.320 | like the performance in upstream.

00:10:06.280 | Then I see after spending eight GPU weeks of compute,

00:10:09.440 | what does GPU week mean?

00:10:10.560 | It means eight GPUs for a week or, sorry,

00:10:13.920 | one GPU for eight weeks or eight GPUs for one week

00:10:17.960 | or 16 GPUs for half week and so on, right?

00:10:21.560 | But this looks flat. A reasonable person would say,

00:10:24.000 | "Yeah, there's no progress for a week on eight GPUs.

00:10:26.760 | This is flat. I'm going to stop and try something else,"

00:10:29.000 | but we are not reasonable, so we keep going,

00:10:31.600 | and this is what the exact same spot looks like

00:10:33.920 | after eight GPU months of training,

00:10:35.880 | and you can clearly see the things are progressing, right?

00:10:39.480 | So it may not always be obvious, and you need patience.

00:10:42.240 | The second thing is that you actually need

00:10:45.880 | to scale up everything,

00:10:47.120 | so this was work done with ResNets,

00:10:49.000 | not yet with transformers.

00:10:50.640 | I see you see a lot of ResNet models here.

00:10:52.760 | The x-axis is the number of images available.

00:10:55.000 | In vision, there is this image in the dataset,

00:10:57.000 | which is a very common, super common dataset for pre-training,

00:11:00.400 | which has 1.3 million images.

00:11:02.560 | There's another one which has 10 times more images

00:11:04.560 | that's still public, and then there is one subset

00:11:07.640 | from this internal group

00:11:09.200 | that has 300 million labeled images,

00:11:11.240 | so the y-axis is measure of accuracy on some tasks,

00:11:16.480 | and we tried many. They all look similar,

00:11:20.000 | and the dots are differently sized ResNets.

00:11:22.000 | The blue dot is the standard ResNet 50 that everybody uses.

00:11:25.320 | If this one, you trained on more data,

00:11:27.480 | it looks promising at first,

00:11:29.040 | but if you go to even more data, it looks like,

00:11:30.960 | oh, okay, this doesn't really seem that useful,

00:11:34.280 | and this is what most people have been doing for a long time,

00:11:37.280 | and a lot of people, even in Google, were like,

00:11:39.960 | yeah, I tried this internal checkpoint on these tons of data.

00:11:43.840 | It doesn't really help that much.

00:11:45.800 | However, what we found out, and in hindsight,

00:11:48.000 | it's kind of obvious, is that you actually need to scale

00:11:51.720 | not just the data but also the model.

00:11:53.760 | Here, this blue dot is a gigantic ResNet that is slow as hell,

00:11:57.520 | but when you scale this up together with the data,

00:11:59.720 | you keep getting benefit with adding more data,

00:12:02.320 | and then if you do these two things,

00:12:03.840 | scale up everything and be patient,

00:12:05.720 | be patient could also be quite scale up your patience.

00:12:10.280 | Then you get a lot of benefits,

00:12:14.040 | so here there is a few short transfer learning.

00:12:17.040 | They're what I showed you before,

00:12:19.000 | and on the x-axis is size of the model,

00:12:22.400 | on the y-axis is the accuracy on one of these tasks,

00:12:25.480 | but again, others look similar,

00:12:27.440 | and these three different curves

00:12:29.040 | are featuring with different data set sizes.

00:12:31.600 | The green one being the standard one,

00:12:33.240 | you don't really see benefit or small benefit

00:12:36.120 | from going with larger models.

00:12:37.640 | The blue one is 10 times larger.

00:12:39.000 | You start seeing some slope upwards,

00:12:42.000 | but really only with this giant data,

00:12:43.760 | you start getting better and better and better

00:12:46.240 | at this few short transfer learning

00:12:48.080 | when you pre-train on more and more data

00:12:49.720 | with larger and larger models.

00:12:52.000 | Second benefit that we did not anticipate really at all,

00:12:55.560 | but then found out is that these models are super robust

00:12:58.400 | when you scale everything up.

00:12:59.840 | This is ObjectNet.

00:13:02.240 | It's a data set that's specifically designed

00:13:04.320 | to measure robustness,

00:13:05.240 | and it shows things in crazy,

00:13:07.120 | like a chair in the bathtub and things like that,

00:13:10.200 | and you should recognize it as a chair.

00:13:13.920 | Here, the pink dots are basically how existing models,

00:13:17.480 | and x-axis is, again, how large is the model,

00:13:19.920 | and pink dot is existing ones from the literature,

00:13:22.600 | and then these lines, same color coding,

00:13:24.680 | is what we found out.

00:13:26.080 | Again, you see this large data,

00:13:28.160 | and then going to large model

00:13:29.520 | just gives you amazing benefits on,

00:13:31.840 | like in this case, out-of-distribution robustness.

00:13:34.360 | This was amazing.

00:13:38.280 | Scale up everything, be patient, and get huge benefit.

00:13:42.160 | - Sorry, Lucas.

00:13:43.960 | Sorry for interrupting you,

00:13:45.040 | but there is a question from a student in the class.

00:13:47.840 | - Yep. - Right.

00:13:49.400 | Do you want to unmute yourself and ask it yourself?

00:13:51.960 | - Yeah, I can ask my question.

00:13:55.040 | Can people hear me?

00:13:55.960 | Maybe there's some-- - Yes.

00:13:56.800 | - I'm sorry, one second.

00:13:57.640 | Let me just step away real quick.

00:13:58.480 | Yeah, so the question I wanna know is,

00:13:59.840 | what work has been done characterizing the parameters

00:14:02.120 | after pre-training finishes?

00:14:04.040 | Like, the reason why I'm motivating this question is,

00:14:06.960 | it seems like we do this tremendous amount of pre-training,

00:14:09.320 | but it seems like we might be able

00:14:10.280 | to significantly reduce that

00:14:12.000 | if we just have smarter initialization schemes.

00:14:14.520 | - Yeah, you know, I've been thinking this

00:14:17.680 | for a long time, actually, also.

00:14:19.560 | And they've come to conclude that I think not.

00:14:25.440 | I think there is, like, two parts.

00:14:28.200 | One is, like, what I like to call

00:14:30.800 | hand-wavy the numerics of the weights.

00:14:33.080 | You know, that everything is in a nice range,

00:14:35.200 | such that it can have nice input/output functions,

00:14:38.320 | and so on, and that your optimizer can do steps

00:14:41.160 | that make reasonable change to the input/output function,

00:14:44.440 | but not too large, and so on.

00:14:46.600 | I think that is part of it,

00:14:48.120 | and that you can get through good init

00:14:50.040 | or good normalizations and whatnot.

00:14:52.040 | But then I also think there is,

00:14:54.880 | I do think that these models memorize a lot,

00:14:57.040 | and then, personally, I believe,

00:14:59.480 | but I don't know of evidence or so,

00:15:01.560 | that these models do more kind of, you know,

00:15:05.280 | remembering similarity to things they've seen in training.

00:15:09.080 | And then, as you grow things up,

00:15:11.160 | they have more memory, and they have seen more things,

00:15:13.840 | so they should be better on more newer things,

00:15:16.480 | because there's more similar things they have seen.

00:15:19.240 | And this, I don't think you can, like,

00:15:21.560 | just create one shot from initialization.

00:15:24.400 | But I don't have the immediate pointer to a paper

00:15:29.320 | at the top of my head now to answer your question.

00:15:31.840 | - Okay, thank you.

00:15:32.760 | - I think we also have more questions,

00:15:36.200 | so has posted on the chat and is raising his hand.

00:15:40.160 | Maybe in this order, you wanna ask your question first?

00:15:44.400 | - Yeah, for sure, I can go ahead.

00:15:46.000 | So I just have a quick clarification on this chart right here,

00:15:50.760 | the chart number three.

00:15:52.840 | The bit L and bit M and bit S,

00:15:54.760 | are they the same model architecture,

00:15:57.840 | but just trained on different datasets?

00:15:59.920 | So the bit S is trained on the 1.3 million

00:16:02.720 | all the way to the 300 million image dataset for bit L?

00:16:05.560 | - Yes and no.

00:16:09.480 | The architecture is here on the x-axis.

00:16:11.640 | So within one vertical slice,

00:16:13.960 | these are the same architecture.

00:16:16.000 | And then the different points are random restarts,

00:16:18.760 | because when you do future learning,

00:16:20.280 | there is a lot of variance

00:16:21.480 | in which few examples do you see.

00:16:24.120 | And then again, this next vertical slice

00:16:26.200 | is the same model and so on.

00:16:27.640 | And as you go to the right, the model gets larger.

00:16:30.320 | And so you can see that for this little data,

00:16:32.840 | going to larger model doesn't really help you much

00:16:35.400 | for pre-training, only for this giant data,

00:16:37.920 | everything's the giant data,

00:16:39.240 | not necessarily giant model in this case.

00:16:42.200 | - Right, that makes a lot of sense, thank you.

00:16:44.680 | - Okay.

00:16:45.520 | - Do you have a question?

00:16:48.600 | Oh, I see you're raising your hand as well.

00:16:51.560 | Go ahead and let Otto.

00:16:53.320 | - Hey, yeah, thanks.

00:16:54.320 | What is the intuition for the upstream performance

00:17:00.040 | in figure one spiking so suddenly

00:17:03.000 | at like 60 or 40 points in training?

00:17:07.720 | - Here, right?

00:17:08.560 | Yeah.

00:17:09.480 | - Yeah, yeah, I'm looking at it again,

00:17:10.920 | like around one point, like, I don't know,

00:17:12.920 | that just seems like an odd looking training curve.

00:17:15.560 | So like, what's the intuition behind that?

00:17:19.000 | - Yeah, this is old school computer vision thing,

00:17:21.600 | or old school, I mean, a few years ago.

00:17:24.040 | Is this when the learning rate changes?

00:17:26.240 | In computer vision, it used to be very common

00:17:28.440 | to have the learning rate in a kind of staircase pattern.

00:17:31.520 | So it's constant for a while, and then you stop,

00:17:34.000 | you divide the learning rate by 10, usually,

00:17:36.160 | boom, smaller, and then you continue.

00:17:38.280 | And this gives you this huge jump.

00:17:40.680 | And nowadays, people don't use this much anymore.

00:17:42.880 | And this work was like three years ago, I think,

00:17:44.720 | or two or three years ago, I don't remember.

00:17:47.360 | It was very common back then.

00:17:48.960 | And nowadays, people use more continuously changing

00:17:51.800 | learning rate schedule, and then you don't really have

00:17:54.120 | this sudden change anymore.

00:17:55.960 | But if you would overlay it, it would be like

00:17:57.760 | more continuously, but going roughly the same.

00:18:00.720 | And then in language, I think most people,

00:18:03.000 | or many people use just linearly decreasing

00:18:05.560 | learning rate schedule, where also you don't see

00:18:07.240 | this effect, because learning rate continuously decreases.

00:18:11.000 | - Okay, yeah, sounds good, thanks.

00:18:12.680 | - And then this is what, because you asked for,

00:18:16.760 | about this dotted line.

00:18:18.640 | Actually here, if you're like here, you could say,

00:18:21.160 | okay, but this is excessive, right?

00:18:22.840 | Maybe it does really seem almost flat.

00:18:26.400 | Maybe you could have started the decay earlier,

00:18:28.920 | and earlier, and earlier, and then you would get the same,

00:18:31.800 | but much quicker.

00:18:32.920 | And this one shows what would happen then.

00:18:35.840 | And you do land at much worse place in the end

00:18:39.160 | than with the patient.

00:18:41.080 | - All right, yeah, yeah, that makes sense.

00:18:43.440 | Thanks.

00:18:45.400 | - Was there more question, or I continue?

00:18:49.080 | - I think both of you have your answers.

00:18:51.960 | - 'Cause I need to mention, I don't see you,

00:18:53.960 | I just see my slide.

00:18:55.960 | - Yeah, it's fine, we can coordinate that with this.

00:18:58.560 | - Hi, yeah, so I just wanted to make sure

00:19:03.320 | that I'm on the same page.

00:19:05.000 | So basically what you're trying to do is multitask learning

00:19:08.320 | with convolutional neural networks/LSTMs, right?

00:19:11.960 | That's kind of like ResNet.

00:19:13.320 | But you're doing multitask learning, correct?

00:19:17.160 | - No, where does the multitask come from?

00:19:20.160 | Or where does it come from?

00:19:21.920 | - Because like, initially, like you showed like different,

00:19:24.960 | - Ah, yeah, okay.

00:19:26.640 | - Yeah, okay.

00:19:28.920 | - So there is two phases.

00:19:30.720 | The first one is the pre-training.

00:19:33.400 | And this pre-training, I didn't mention it yet.

00:19:36.320 | I just said, I don't care what you do in the pre-training,

00:19:38.960 | just pre-train somehow, and give me the model.

00:19:41.600 | And then I test it on multiple tasks independently.

00:19:45.320 | And I'm tested on multiple tasks,

00:19:47.040 | means like transfer it to that task,

00:19:49.080 | which in our case means fine-tune it just on the task,

00:19:51.920 | and then see how well it does, and so on.

00:19:54.200 | But it could mean other things.

00:19:55.520 | Like later we moved to just learning a linear regression

00:19:58.800 | on top of the embeddings for each task.

00:20:01.160 | And now during the pre-training,

00:20:03.240 | what we do is just regular supervised learning,

00:20:05.760 | but just scaling everything up.

00:20:07.520 | And regular supervised learning is just,

00:20:09.680 | well, not multitask, but multilabel,

00:20:13.360 | in the sense that an image could have

00:20:14.800 | a couple labels or not, but it usually doesn't have.

00:20:17.280 | - This is minor. - Okay, got it.

00:20:20.200 | - Thanks.

00:20:21.040 | - All right, we have a question.

00:20:27.440 | - Yeah, just have a quick follow-up

00:20:28.760 | about the question rather than,

00:20:30.360 | like the discussion rather than started about this,

00:20:33.960 | it's like memorization, or it's more memorizing the data

00:20:37.400 | in pre-training datasets.

00:20:39.040 | So I know in the language side,

00:20:40.520 | there's a quite interesting phenomenon

00:20:42.080 | that you can pre-train on a synthetic language

00:20:45.280 | that's, it doesn't have any semantic meaning,

00:20:48.400 | but it only has structural,

00:20:49.960 | like paired premises or things like that.

00:20:52.600 | And that actually gives you almost the same boost

00:20:56.280 | in your downstream transfer as a normal pre-training.

00:20:59.520 | So I wonder if, say like,

00:21:02.160 | so this means like in for language, right,

00:21:04.600 | the structure seems to make a lot of contribution,

00:21:07.480 | which can be replaced by visualization.

00:21:09.680 | But I don't know if it's an image,

00:21:11.040 | it's a different case, maybe to have people done,

00:21:13.640 | maybe some synthetic pre-training data set for image.

00:21:17.960 | - Yeah, there was a paper,

00:21:19.920 | I forgot the name and the authors,

00:21:22.320 | but it creates completely synthetic images

00:21:24.600 | and like not even rendering of some realistic things,

00:21:27.440 | but just completely patterns, waves, and shapes and so on,

00:21:31.880 | and uses that for pre-training.

00:21:33.760 | And then it shows that they get almost the same performance

00:21:37.040 | as ImageNet quickly,

00:21:38.080 | they actually do this with vision transformers.

00:21:41.520 | But yeah, they never go further or it is not clear,

00:21:45.600 | you know, they kind of show that you can almost get

00:21:47.920 | to this point here.

00:21:49.200 | That is not clear how much further can you go with this.

00:21:53.480 | And I think probably not much further,

00:21:56.000 | but it's just me guessing that not much further,

00:21:59.160 | I don't have evidence for it.

00:22:00.600 | - Right, so I have one question

00:22:04.880 | and then we can continue with the talk.

00:22:07.360 | Said that you think like the large vision models

00:22:09.880 | are like learning some sort of similarity

00:22:11.920 | to the data set they're trained on.

00:22:13.040 | So do you think they are behaving

00:22:14.480 | like prototypical networks, in a sense?

00:22:16.800 | - They're behaving like what networks?

00:22:19.760 | - Oh, so like prototypical networks?

00:22:21.720 | Essentially like when you're doing pre-short learning,

00:22:24.760 | you just say like, "I'm going to learn a network."

00:22:26.640 | - Yeah, yeah, yeah.

00:22:27.480 | - And learn the metric space.

00:22:28.920 | - Probably not exactly, but close-ish.

00:22:39.240 | - I mean, I cannot really say

00:22:40.560 | because this is just some intuitive guess that I have.

00:22:44.000 | That's what they do, but nobody really knows

00:22:45.600 | what the models do, right?

00:22:46.960 | Yeah, I mean, we do get much more,

00:22:52.480 | when we do something like prototypical networks

00:22:54.680 | for the future learning with these pre-trained models,

00:22:57.560 | we do get worse performance than when we do fine-tuning.

00:23:00.880 | So there is a bit more to it still.

00:23:03.880 | However, I don't know what is this more.

00:23:07.800 | (laughs)

00:23:09.120 | - Okay, thanks.

00:23:10.760 | - All right, let's continue.

00:23:12.160 | Okay, yeah, so, ah, right, and I didn't mention,

00:23:20.600 | but on ImageNet, which is the top benchmark

00:23:23.520 | in computer vision, with this work, with the big transfer,

00:23:27.640 | we finally were able to increase the score

00:23:30.600 | after there was a long period of a couple of years

00:23:33.320 | of no improvement, but many attempts

00:23:35.680 | that you see the great upside.

00:23:37.240 | This was, yay, awesome.

00:23:39.200 | Pre-training, scaling up everything,

00:23:41.160 | and leveraging the data.

00:23:43.680 | And then, okay, let's not care about that.

00:23:46.080 | Yeah, that's, okay, this is just a little aside,

00:23:51.360 | that if you are in the setting that I mentioned

00:23:53.640 | of pre-training on huge amounts of data

00:23:55.560 | and then testing on many other tasks,

00:23:58.040 | you should, of course, be careful

00:23:59.520 | that you don't have images from the other tasks

00:24:02.440 | in your pre-training data, right?

00:24:05.360 | Otherwise, you have seen them during training,

00:24:07.360 | and then you're not really generalizing,

00:24:09.240 | and you're just fooling yourself with good scores.

00:24:12.520 | And this is a real danger when we get huge amounts of data,

00:24:15.240 | because, like, ImageNet images can totally be

00:24:17.600 | in huge amounts of data, right?

00:24:19.440 | So we actually use an internal pipeline

00:24:22.840 | that is really good at finding duplicates,

00:24:24.920 | and also new duplicates, like when they are shifted,

00:24:27.760 | rotated, squeezed, color changed a bit, whatnot.

00:24:30.760 | It still finds it.

00:24:32.000 | And we use this to completely remove all images

00:24:34.920 | from the test data sets that we test on later.

00:24:37.880 | And we actually found that a lot of classic

00:24:39.960 | just vision data sets have clear duplicates

00:24:42.520 | between their training and validation set,

00:24:44.920 | between the training set of ImageNet and CIFAR,

00:24:48.080 | 10 and 100 test sets, and so on.

00:24:50.960 | So new duplicates are quite widespread problem in vision.

00:24:54.600 | And this slide is just to say, hey, there are problems,

00:24:57.080 | but in all that we present,

00:24:58.720 | we actually took care that in the pre-training,

00:25:01.120 | as best as we can, we don't have new duplicates.

00:25:04.840 | Right, now back to being like, hey,

00:25:08.520 | we figured out large data, a large model,

00:25:10.560 | and then things get really good.

00:25:12.760 | And that's how we got to transformers, basically.

00:25:16.440 | In computer vision, everything was convolutional networks

00:25:19.160 | for many years.

00:25:20.480 | And basically there was nothing else, CNN is king.

00:25:23.440 | However, in language, we saw a transformation recently,

00:25:26.960 | right, that everything used to be LSTM,

00:25:29.320 | everywhere LSTM was king, and then came the transformer.

00:25:32.880 | And in the case when there is a lot of data available,

00:25:35.880 | suddenly transformer worked much better than LSTM.

00:25:39.400 | For little data, that was still not the case exactly.

00:25:42.840 | So what we then thought is that, okay,

00:25:45.600 | so we are now in this regime where we have tons of data

00:25:48.320 | and we see benefit from it.

00:25:50.040 | Can we see even more benefit if we try also

00:25:52.520 | out the transformer architecture in vision?

00:25:55.600 | And that's basically what we did.

00:25:58.280 | To be fair, there were a few other attempts

00:26:01.480 | at trying out transformer in vision before,

00:26:03.960 | that I don't want to detail too much here

00:26:06.360 | because I don't want to point fingers too much,

00:26:09.120 | but they were all not really using transformers

00:26:12.800 | for learning everything from the data.

00:26:14.920 | It was always like, get something out of a ResNet first,

00:26:19.280 | like object detection proposals

00:26:21.560 | or high-level feature maps or things like that,

00:26:24.280 | and then stick a little transformer on top.

00:26:26.480 | But we wanted to go all the way,

00:26:28.040 | just transformer everything.

00:26:30.240 | And so we came up with the simplest and most natural,

00:26:33.080 | I believe, way of applying transformers to vision,

00:26:36.440 | which is you take the image, you cut it into pieces,

00:26:39.360 | and that's it, like a puzzle.

00:26:41.400 | Tack, tack, tack, patches, and that's it.

00:26:45.440 | Each of these patches, you take it

00:26:48.320 | and you project it into your embedding space,

00:26:50.680 | which is the input to the transformer.

00:26:52.640 | Embedded space is just abstract space of,

00:26:55.120 | let's say, 768 dimensions, for example.

00:26:58.160 | How do you embed it?

00:26:59.040 | You just take the pixel values

00:27:00.640 | and put the linear projection layer on top.

00:27:03.640 | So take all the pixels, flatten the vector,

00:27:07.520 | matrix multiply into whatever size you want,

00:27:10.680 | and use the same matrix for all the patches.

00:27:14.520 | And here we just went the simplest way ever

00:27:16.600 | with non-overlapping patches and everything.

00:27:18.960 | You can, and people later did, go on and say,

00:27:22.600 | "Hey, this is almost a convolution.

00:27:24.280 | Let's make proper convolution.

00:27:25.760 | Let's make stack of them," whatnot.

00:27:27.440 | But this is all for web work later.

00:27:29.480 | This is just the simplest way to do it first.

00:27:32.480 | Then we have these embedded patches,

00:27:34.160 | and we treat them exactly literally

00:27:36.840 | like the tokens in language,

00:27:40.080 | and then give them to exactly the BERT transformer

00:27:44.280 | from language folks.

00:27:45.960 | And just like in language, we add this class token,

00:27:49.400 | or I think the language is like end-of-sentence token

00:27:51.800 | or something.

00:27:53.960 | And we add the position embeddings to the tokens

00:27:57.680 | that can be learned.

00:27:59.960 | And then we feed all of this to a transformer encoder,

00:28:02.760 | which has a MLP head, which reads out this class token,

00:28:07.200 | and then maps it to Softmax layer

00:28:10.120 | for classification, for example.

00:28:12.720 | And that's it. That is the vision transformer.

00:28:15.120 | So it's literally a BERT transformer,

00:28:16.960 | but instead of words or sentence tokens,

00:28:20.960 | feed in patches transformed into tokens.

00:28:23.840 | And that's it.

00:28:25.280 | And then just same story as before, scale everything up.

00:28:28.360 | Compute, data set, model size, patients, everything.

00:28:32.400 | And see what happens. Is this good or not?

00:28:35.920 | That was the question.

00:28:37.280 | And now we can see a plot here.

00:28:39.080 | This is similar plot as before.

00:28:41.880 | The gray area is actually what were all of the bit dots before.

00:28:46.480 | And now the bubbles are vision transformers

00:28:49.400 | of different sizes.

00:28:51.000 | And the bubble is kind of the size of the model,

00:28:53.800 | although it's a bit hard to say exactly.

00:28:56.320 | And what you can see first is that with little data,

00:28:58.760 | ImageNet is the 1.3 million images.

00:29:01.360 | It works worse than ResNet.

00:29:03.520 | So if we would not believe in this idea

00:29:05.440 | and just try this, we're like, "Okay, this is a crap idea."

00:29:07.920 | And 1.3 million images is not that little.

00:29:11.520 | Then the 10 times larger data sets

00:29:13.080 | started in the same ballpark as ResNet.

00:29:15.640 | And when we go to much larger data

00:29:18.800 | with a much larger transformer,

00:29:20.760 | then we actually start outperforming this ResNet.

00:29:23.600 | And we outperform it just by a little.

00:29:25.720 | But this ResNet was really hard to get

00:29:27.560 | and is extremely clumsy and slow and big.

00:29:30.360 | So we were very excited by this.

00:29:33.840 | Then we did more controlled studies and everything.

00:29:35.880 | And one of them is like using subset of the same data set.

00:29:40.040 | And there's lots of curves,

00:29:41.360 | but basically just look at the dark gray one

00:29:44.240 | and the light blue one.

00:29:46.280 | These are roughly similarly fast and clumsy

00:29:49.800 | or easy to use or difficult to use bits,

00:29:53.320 | which is a ResNet variant and bits, the vision transformer.

00:29:57.000 | And what you can see, vision transformer,

00:29:58.560 | when we have little, in quotes, little data,

00:30:01.800 | is really bad compared to ResNet.

00:30:04.360 | But as we start having a lot of data, actually,

00:30:07.360 | it starts outperforming the ResNet.

00:30:09.200 | And this is very promising

00:30:10.600 | because I think everything that looks huge

00:30:12.840 | and a lot and so on now, in five or 10 years,

00:30:15.640 | it's maybe regular.

00:30:17.000 | Like 10 years ago, imagine if this one seemed to be huge

00:30:20.400 | and massive amount of data.

00:30:21.640 | No, not anymore.

00:30:23.520 | So we should look to the future.

00:30:24.800 | And this looks promising for the future.

00:30:28.040 | Then back to the same benchmark.

00:30:29.760 | That was another little jump.

00:30:33.200 | - Because we, yeah, yeah, we have some questions.

00:30:37.080 | - Yep.

00:30:38.280 | There is also this section about, yeah.

00:30:41.960 | So it's in that order,

00:30:45.880 | if you want to unmute yourself and ask the questions.

00:30:50.440 | - Sure, yeah.

00:30:51.280 | And I think Dimal already answered part of the question,

00:30:54.080 | but I was wondering in the input to this transformer,

00:30:57.360 | when you're chunking up the image

00:30:58.600 | into little puzzle pieces and then finding them,

00:31:02.920 | does the order of feeding these patches in matter?

00:31:06.840 | Like if you switch the order,

00:31:08.880 | does the prediction maybe change?

00:31:11.600 | - Yeah, that's a good question.

00:31:13.480 | And I actually have a slide on something like this,

00:31:16.160 | but not exactly.

00:31:17.800 | Let me jump there.

00:31:20.080 | So first of all,

00:31:21.120 | if the order is consistent during training, right?

00:31:24.840 | And you don't shuffle the order again for each new image,

00:31:28.360 | then it's literally the exact same.

00:31:30.280 | You get the same curve saying everything

00:31:31.920 | because we don't encode the order anywhere.

00:31:34.200 | If you start randomizing the order

00:31:36.160 | all the time during training,

00:31:37.520 | then performance gets quite a lot worse.

00:31:39.960 | And let me show you why.

00:31:41.000 | This is the slide was on my plan to present anyways.

00:31:45.280 | Then if you ask, let's jump here.

00:31:47.440 | These are, this is a visualization

00:31:49.360 | of the position embeddings.

00:31:51.600 | What does it mean?

00:31:52.440 | So in this case,

00:31:53.280 | we had 14 by 14 patches that we cut the image in.

00:31:56.880 | So it means we have also 14 by 14 position embeddings.

00:32:01.160 | Although we just see them as one long sequence of,

00:32:03.640 | what is it?

00:32:04.480 | 150 something, or I don't know, 140 something.

00:32:09.720 | And now each of these pictures shows the position embedding,

00:32:13.040 | which corresponds to this location.

00:32:15.400 | How similar is it to all the other position embeddings?

00:32:18.720 | So let's look at this one, for example.

00:32:20.560 | Yellow means perfectly similar, like exactly the same.

00:32:23.680 | And blue means opposite in terms of cosine similarity.

00:32:27.320 | So this position embedding is most similar to itself,

00:32:30.760 | which is the pixel here.

00:32:32.360 | And then the neighboring pixels is how similar is it

00:32:35.520 | to the position embeddings that correspond originally

00:32:39.000 | to the neighboring patch.

00:32:40.640 | And we do see a very clear pattern

00:32:42.400 | that each position embedding is very similar

00:32:44.760 | to the embedding from its surrounding patches.

00:32:48.080 | And we didn't implement any of this, right?

00:32:51.840 | We just had these position embeddings

00:32:53.440 | at randomly initialized variables,

00:32:55.240 | and they are learned as freely

00:32:57.280 | as the rest of the parameters of the model.

00:32:59.600 | But they learned to recover this notion

00:33:01.400 | of what are my neighbor patches,

00:33:03.320 | even though we don't give this information

00:33:05.080 | anywhere at any time,

00:33:06.480 | besides the raw image data and the task

00:33:08.440 | to please classify this image.

00:33:12.280 | So that's pretty cool, I think.

00:33:13.560 | But it also means that if you take the trained model now

00:33:16.480 | and give in patches

00:33:18.680 | in a completely differently shuffled order,

00:33:21.200 | it's going to perform poorly

00:33:22.480 | because these learned position embeddings

00:33:24.680 | don't make sense anymore.

00:33:26.480 | We did try also to implement, like, position embeddings

00:33:30.360 | which encode the location as hardcoded by us,

00:33:35.440 | and other fancy position embeddings like relative ones.

00:33:39.520 | But basically, none of that really outperformed

00:33:42.160 | these freely learned.

00:33:43.320 | And then the freely learned is simpler.

00:33:44.760 | You just run them in it,

00:33:46.080 | let it learn as part of SGD, and that's it.

00:33:48.200 | And so we go with that, and so just like that.

00:33:53.480 | -Nice, it's awesome.

00:33:55.440 | -We have one more question from --

00:33:58.920 | -Hey, yeah, I was wondering if you could --

00:34:00.600 | Yeah, this slide.

00:34:01.840 | I think something that's really interesting

00:34:03.320 | is we're talking about scaling up the data,

00:34:05.400 | and scaling up the model would be fun as well.

00:34:08.240 | But it seems like you're reaching an awesome job, right,

00:34:11.280 | when you keep doing the scaling.

00:34:13.400 | So I'm curious if you have any thoughts on that.

00:34:15.920 | Like, are these points just look like that,

00:34:18.160 | or is there kind of a best you can sort of do

00:34:21.480 | where when you're pre-training the data or the parameters,

00:34:25.400 | you're actually not going to get much --

00:34:27.480 | -Yeah, I have another slide,

00:34:29.000 | but much further in the talk about this,

00:34:32.040 | where I would like to not jump on it, if you don't mind.

00:34:36.640 | And then maybe in 10, 15 minutes, we will be there.

00:34:40.920 | -Sounds good. Thanks.

00:34:42.520 | -Yeah, maybe to be a bit optimistic,

00:34:46.800 | it does seem like the transformers

00:34:48.640 | have a better slope here in the end,

00:34:50.240 | and there is a plateau earlier.

00:34:53.920 | -Sorry, Lucas, I did not mean to interrupt.

00:34:57.000 | Are there any more questions before we proceed?

00:35:00.040 | -Yeah, can I ask my question real quick?

00:35:02.160 | -Sorry about that.

00:35:03.120 | -So what I'm curious to know is how does this VIT

00:35:06.200 | compare to if you equip a ConvNet,

00:35:08.680 | so, for example, ResNet, with an attention mechanism?

00:35:12.320 | -Mm-hmm.

00:35:13.360 | -Like, how much of this is due to the structure of a transformer

00:35:15.840 | and the particular way it operates

00:35:17.240 | versus just the benefit of attention

00:35:18.800 | that a vanilla ConvNet does not have access to?

00:35:22.040 | -Yeah, so this has been tried many times before,

00:35:26.400 | and the first time that I know of

00:35:27.800 | was actually from -- I mispronounce his name,

00:35:30.840 | but Jaime Herr, the inventor of ResNet,

00:35:34.120 | and some of his colleagues, they called it non-blocker networks.

00:35:37.360 | This was way -- I think even before the transformer paper,

00:35:39.920 | if I remember correctly,

00:35:42.760 | and they basically inserted attention blocks

00:35:45.000 | at various locations in the ResNet,

00:35:47.120 | and then they showed improvement,

00:35:48.200 | but it was, like, tiny improvements.

00:35:51.080 | It was a cool block and a simple paper,

00:35:53.560 | but it was not really worth it,

00:35:55.040 | and people usually place their attention --

00:35:59.400 | you can imagine if you place the attention just on the pixels

00:36:01.840 | and don't do this patch-cutting,

00:36:03.360 | this is way too expensive computation-wise, right?

00:36:07.080 | If you have two to four by two to four pixels,

00:36:08.920 | that's like -- yeah, I cannot do this in my head.

00:36:11.240 | I don't know, 40,000 or so maybe pixels?

00:36:14.320 | Attending to 40,000 others, that doesn't work,

00:36:16.960 | so people just do it in the very high

00:36:18.960 | and very final layers of the ResNet,

00:36:20.760 | like, where it's maybe seven by seven,

00:36:23.600 | and then they add a bit of --

00:36:25.040 | sprinkle a bit of attention there,

00:36:27.280 | but then you don't really get much benefit of scaling

00:36:29.800 | because it's essentially still a ResNet.

00:36:32.280 | And there is -- in ResNet,

00:36:34.560 | there is this block called Squeeze Excite

00:36:37.400 | that has been getting really popular --

00:36:39.680 | or has gotten really popular

00:36:41.120 | and improves ResNet quite a bit,

00:36:43.920 | and that is also kind of a form of attention,

00:36:47.120 | but, like, nicely tailored to images.

00:36:50.480 | I'm not doing -- it's arguable.

00:36:53.520 | But yeah, it has been tried many times before,

00:36:55.480 | but it just -- it doesn't show --

00:36:57.640 | or it hasn't been shown to have this scaling benefit

00:37:01.000 | as much as they did.

00:37:03.280 | -So I think I'm missing something critical here,

00:37:05.160 | which is you just said, in fact,

00:37:06.960 | or it's computationally difficult

00:37:09.920 | to do an attention layer at a low level in the ResNet,

00:37:12.920 | but why is it any different than doing an attention layer

00:37:15.400 | in the Vision Transformer?

00:37:17.840 | -Because we cut the patches first,

00:37:20.040 | so we have maybe 14 by 14 patches,

00:37:22.800 | which is not that much.

00:37:26.320 | -Okay, but I'm confused.

00:37:28.160 | Like, you could imagine, not at a high level,

00:37:31.480 | not at a high layer in the ResNet,

00:37:32.920 | but at a relatively low layer,

00:37:34.280 | after you've applied, like, one or two convolutional filters --

00:37:37.720 | convolutional layers, excuse me --

00:37:39.440 | then you have something the size of the patches.

00:37:43.080 | -That's still 50 by 50 at the early layers, and that's --

00:37:47.440 | -But 50 by 50 is significantly less than,

00:37:49.440 | I don't know, like, 400 by 400 or whatever.

00:37:52.320 | -But it's still 2,500 tokens attending to 2,500 tokens,

00:37:56.960 | which -- -Yeah, I mean, it's a lot,

00:37:58.080 | but it's not comparable.

00:38:01.000 | I don't know. Okay, cool. Thank you.

00:38:02.400 | -Yeah. I mean, it could be tracked.

00:38:04.840 | Okay, maybe another answer to your question

00:38:06.960 | is then we're slowly getting to this,

00:38:09.040 | my next slide after the set of questions,

00:38:12.880 | where we do try something almost like what you said,

00:38:17.600 | have a very small part of the ResNet,

00:38:19.880 | and then stick a transformer on top of it,

00:38:23.640 | but, like, the full transformer encoder on top of it,

00:38:26.400 | and not just sprinkle a few attention layers

00:38:29.320 | and then continue with columns and so on.

00:38:32.080 | And this is this process, and we call them hybrid,

00:38:35.280 | but it's almost literally what you said, actually,

00:38:37.560 | like a few early layers from the ResNet

00:38:39.760 | and with different varying amount,

00:38:42.000 | and then stick the whole transformer encoder.

00:38:45.480 | And this seems to work well, too,

00:38:48.080 | especially for the -- when you --

00:38:50.880 | x-axis in this case is amount of compute,

00:38:53.360 | so for the little compute, it seems to work well.

00:38:55.920 | But then the scaling behavior of the pure ResNet

00:38:58.280 | is a little better, so we focused on that.

00:39:00.680 | I think we later tried also hybrid further to the right,

00:39:03.280 | and it was a bit lower, but it was after the paper,

00:39:05.880 | so it's not on this plot, which I just cut out of the paper.

00:39:08.760 | But you can already see the trend here.

00:39:12.840 | Yeah, so if you don't scale all the way up,

00:39:15.520 | then this is a totally reasonable thing to do,

00:39:18.120 | have a little bit of ResNet

00:39:19.800 | and then the encoder from transformer.

00:39:22.960 | -Do you want to ask a question?

00:39:29.040 | -Yeah, I was just wondering about the --

00:39:32.960 | basically, there's like a short section of paper

00:39:35.640 | about, like, fine-tuning and, like, higher resolution,

00:39:38.120 | and in that case, right, like, the pre-trained,

00:39:40.640 | like, position embeddings, sorry, are, like, skewed, right?

00:39:45.480 | And it basically says that you guys are, like, interpolating.

00:39:48.320 | Can you, like, talk a little bit?

00:39:50.480 | Like, how do you interpolate what's going on?

00:39:52.400 | -Yeah. Actually, when I checked the slides earlier today,

00:39:55.840 | I was like, "Oh, it would be cool to have a slide on that."

00:40:00.440 | And we don't have a nice visualization in the paper,

00:40:02.520 | either, because it's a bit difficult to explain,

00:40:04.520 | but this is the best starting point we have.

00:40:08.320 | So if you want to increase the resolution of the image,

00:40:11.760 | and you keep the patch size fixed,

00:40:13.520 | it means you have more patches suddenly, right?

00:40:15.720 | And then, as you say, the patch embeddings, like,

00:40:18.480 | what do you even use as position embeddings, right?

00:40:22.840 | And basically, you can see here that we see

00:40:25.480 | that they learn a very regular structure, right?

00:40:27.680 | We don't really know what is the structure

00:40:29.520 | of these position embeddings that I learned.

00:40:31.160 | We just see the similarity to each other

00:40:33.600 | and that it is very regular.

00:40:36.240 | And so this gave us the intuition

00:40:37.600 | that we may be able to just take them,

00:40:41.680 | kind of imaging these boxes, they slide apart,

00:40:45.080 | and new boxes appear between them,

00:40:46.880 | and they are just the interpolation

00:40:48.320 | of the surrounding ones.

00:40:50.040 | And that's basically what we do with the position embeddings.

00:40:54.520 | We create new ones where there are missing ones,

00:40:57.640 | because we need more,

00:41:00.120 | and by interpolating the surrounding.

00:41:02.240 | Or more precisely, we basically see them as a picture,

00:41:06.040 | in this case, 14 by 14, with 700-something channels,

00:41:10.000 | or whatever is the dimensionality.

00:41:12.000 | And then we basically resize this

00:41:14.320 | like you would resize a picture by interpolation.

00:41:19.000 | And that way, we get more and new position embeddings

00:41:22.200 | that we don't understand where they are,

00:41:24.040 | but they follow the same pattern as the learned ones,

00:41:26.240 | just at a higher resolution, basically.

00:41:29.720 | Yeah, go ahead.

00:41:35.520 | - Yeah, I just have a quick question.

00:41:39.080 | So when you're creating the embeddings as input,

00:41:44.080 | right now you're doing a light projection,

00:41:47.920 | at least in this case.

00:41:50.440 | Has there been work to do to memorize the other way,

00:41:52.440 | 'cause there's a lot of pixels that are close to each other?

00:41:56.920 | - Yeah, there were quite a few works

00:41:59.920 | that tried varying other things.

00:42:02.880 | One that I especially liked recently,

00:42:04.880 | it's called "Early Convolutions Help Transformers See Better,"

00:42:08.920 | or something like that.

00:42:10.640 | And they basically say,

00:42:11.520 | "Okay, instead of this linear projection,

00:42:13.720 | instead of this one big linear projection,

00:42:16.240 | we replace it by a stack of three-by-three convolution

00:42:20.040 | with a stride two."

00:42:22.280 | And then they have also nonlinearities between them,

00:42:24.640 | normalizations between them,

00:42:26.520 | but such that the overall stride is the same

00:42:29.440 | as this patchifying.

00:42:32.160 | So the outcome would then be the same dimensionality

00:42:36.040 | as after this patch cutting and then projecting.

00:42:39.040 | And then they showed that,

00:42:40.440 | supposedly it makes it a bit easier to optimize

00:42:45.160 | in the sense that more optimized settings are good settings.

00:42:48.760 | In many scenarios, it performs the same,

00:42:53.640 | but like more robustly to get there.

00:42:56.800 | And they also show some scenarios

00:42:59.080 | where this performs much better,

00:43:02.040 | like for example, when pre-training on,

00:43:04.480 | actually, when they pre-train on more data,

00:43:06.920 | that seems to perform even better.

00:43:08.880 | I have played a bit with it and tried to reproduce it.

00:43:14.080 | I don't have it fully reproduced,

00:43:16.120 | but I don't see as much benefit as in the paper yet.

00:43:19.040 | But that's not to say that the paper is wrong,

00:43:20.920 | just that I didn't get there yet.

00:43:22.640 | That is one example of them.

00:43:26.080 | There are other papers that do stuff,

00:43:28.280 | but this one I found especially interesting

00:43:30.520 | because it's simple.

00:43:31.920 | - Thank you.

00:43:34.440 | - All right, continue?

00:43:37.720 | - We don't have any more questions.

00:43:40.440 | - All right, then let's see.

00:43:42.640 | Yeah, I have like three more interesting details

00:43:44.800 | from the paper and then depending on

00:43:47.680 | if we want more discussion or more content,

00:43:49.600 | I have more content, like also the question about,

00:43:52.120 | does it saturate here or not?

00:43:53.600 | All right, so another interesting thing

00:43:57.400 | that we had in the paper,

00:43:58.440 | but it is buried in the appendix,

00:44:00.280 | and then follow-up papers from others

00:44:03.560 | have been written on this by now actually,

00:44:05.880 | is like how should we scale these transformers?

00:44:09.400 | I don't know, right in the high-level shape

00:44:13.120 | of the transformer, there's lots of settings

00:44:15.880 | that you could choose.

00:44:17.400 | And we actually tried many of them.

00:44:19.080 | So we started with the reasonable medium-sized transformer,

00:44:22.360 | this dot in the middle,

00:44:23.680 | and then we varied things one by one,

00:44:27.640 | such that we always double the compute.

00:44:30.960 | So for example, this pink line,

00:44:32.760 | if we go to the right, this point increases the width,

00:44:36.600 | such that we double the compute.

00:44:39.040 | X-axis is compute relative to this starting point.

00:44:44.040 | And we have all of these different settings.

00:44:46.240 | There's the width, which is how wide are the vectors

00:44:50.320 | with which self-attention is done,

00:44:52.320 | which is for the base model 768,

00:44:54.600 | and then goes larger or smaller.

00:44:57.520 | There is like, as you see scaling,

00:45:00.680 | this does not seem promising.

00:45:02.760 | So we didn't scale that much.

00:45:04.400 | Then there's other things like the width

00:45:06.720 | of the multi-layer perceptron,

00:45:08.760 | or some people call it the one-by-one convolution

00:45:11.280 | in these attentions.

00:45:12.840 | And this seems to scale a bit nicer, this orange part.

00:45:16.160 | I actually wonder where it went to the left.

00:45:18.400 | I don't remember.

00:45:19.400 | I don't know if it's hidden somewhere

00:45:21.840 | or if we just didn't scale it down, but anyways.

00:45:24.720 | Then another thing to scale,

00:45:26.000 | which does not exist in the transformers from text

00:45:29.040 | is the patch size.

00:45:30.400 | As you make the patch smaller,

00:45:32.040 | you get more and more tokens out of an image

00:45:34.960 | and thus more and more compute capacity.

00:45:37.520 | This is the green one, which also seems to scale nicely.

00:45:42.120 | Then the depth is an interesting one, this yellow one.

00:45:45.960 | And this is the number of encoder blocks.

00:45:48.600 | As we scale, it first seems like, wow,

00:45:50.360 | this is the thing you want to scale,

00:45:51.640 | but then it does seem to plateau.

00:45:53.640 | And it scales really badly if you decrease the depth.

00:45:56.760 | So that's not a good thing to decrease.

00:45:59.040 | However, the width seems to be a good thing to decrease

00:46:01.240 | if you want to go to smaller models.

00:46:03.160 | And then the blue is just scaling everything together

00:46:06.120 | such that the compute is kept,

00:46:08.360 | like everything by roughly the same amount.

00:46:09.960 | That seems to scale nicely as well as the rest

00:46:14.920 | and is relatively simple, or at least conceptually.

00:46:17.240 | So we like this, so we went with that

00:46:18.800 | whenever we scaled up or down the model.

00:46:23.440 | And this one I really like is the inference speed,

00:46:26.560 | because if you have the image size of two to four pixels,

00:46:29.640 | it actually means you have two to four by two to four pixels.

00:46:32.520 | So if you have, then you patchify it with 16 by 16 patch,

00:46:37.000 | for example, patch size, then you have 14 by 14 patches.

00:46:42.520 | So that is the sequence length is actually 150.

00:46:45.440 | And then on top of the sequence length,

00:46:48.320 | you have the self-attention operation,

00:46:49.800 | which is square again.

00:46:51.880 | So overall, with respect to image size,

00:46:54.760 | the self-attention operation is to the fourth power,

00:46:58.200 | which is called quartic.

00:47:01.520 | So that is really bad.

00:47:03.000 | Like everybody who sees all of something to the fourth

00:47:05.960 | is like, "What the hell are you doing?

00:47:07.240 | This is never going to scale."

00:47:09.760 | So we checked what does it look like in practice

00:47:12.520 | with the image sizes that we operate in,

00:47:15.000 | and this is what you see here.

00:47:16.240 | On the y-axis is how fast it goes,

00:47:19.920 | basically how fast it does inference,

00:47:22.000 | and on the x-axis is varying the input size.

00:47:25.920 | And this, what this means, it doesn't look so bad yet.

00:47:31.760 | Basically, when you go here to the 512,

00:47:34.240 | to the really large image,

00:47:35.320 | then you see that the transformers

00:47:37.320 | actually start going down a lot more than the ResNets.

00:47:42.920 | But in this reasonable image size,

00:47:44.600 | let's call it very typical,

00:47:46.200 | it doesn't seem so bad in practice yet.

00:47:48.240 | So we're not getting hit by the big O yet.

00:47:52.120 | But as we go larger, it will likely be a problem,

00:47:54.480 | and there will be a lot of follow-up works

00:47:56.240 | trying to make that better.

00:48:01.440 | Then, this is the last one from the original paper.

00:48:07.560 | This is looking at the input's receptive field size.

00:48:10.960 | So in the self-attention operation,

00:48:13.360 | how far ago do heads typically attend?

00:48:17.920 | And here on the x-axis, we see the layer in the network.

00:48:21.600 | To the right is more towards the output, the classes,

00:48:24.040 | and to the left is more towards the input, the patches.

00:48:27.760 | And the y-axis is how far on average across,

00:48:31.480 | I think, the whole validation set,

00:48:33.640 | does the self-attention look?

00:48:35.320 | And does look means that the peak of the self-attention

00:48:39.120 | or the max, how far is it away?

00:48:41.360 | Something like that.

00:48:44.200 | And each dot is a different head

00:48:45.800 | because we can use multi-head self-attention.

00:48:48.560 | And so what this shows is that in the early layers,

00:48:50.960 | actually you have some heads that go far,

00:48:53.360 | but also a lot of heads that look very nearby them,

00:48:56.840 | so locally.

00:48:58.000 | And as we go deeper in the model,

00:48:59.600 | we only are left with heads that, on average, look further.

00:49:03.560 | So it's just some kind of analysis.

00:49:07.000 | There is not immediately action to take about this,

00:49:09.840 | but it's interesting to see that earlier layers,

00:49:12.640 | they learn a mixture of looking to a local neighborhood

00:49:16.440 | and looking globally,

00:49:18.120 | and later layers only look globally anymore.

00:49:22.680 | Right.

00:49:24.280 | So that is about the original vision transformers.

00:49:30.360 | Now, I don't know how long you want me

00:49:33.320 | to continue speaking or discussing.

00:49:35.320 | I have a couple of options that I can talk about,

00:49:38.760 | which is one project that was further scaling updates,

00:49:41.920 | and this one also has the answer to the --

00:49:44.040 | I can also jump straight to the answer

00:49:45.400 | if you don't want to hear the rest.

00:49:46.880 | But to the question of, like,

00:49:48.560 | how does it continue to the right?

00:49:49.960 | Are we separating?

00:49:52.720 | There is another project about how to train vision transformers

00:49:56.400 | when you don't have massive amounts of data.

00:49:58.600 | Can you still do it? Is it reasonable?

00:50:00.360 | Or is it maybe just unreasonable to do?

00:50:03.320 | This one is maybe too unrelated.

00:50:04.800 | Let's not talk about this.

00:50:06.160 | And the last one is, like,

00:50:08.360 | I talk all about these benefits of a really large model

00:50:11.920 | when you pre-train them on lots of data.

00:50:13.640 | Okay, that's nice. That's how we get a good model.

00:50:17.000 | But then actually using a model that is massive

00:50:19.880 | is not fun at all.

00:50:21.200 | Like, it doesn't fit on your GPUs.

00:50:22.960 | You need, like, multiple TPUs to even use it.

00:50:25.960 | So people are not happy to use it

00:50:29.080 | and usually still go back to small-ish models,

00:50:32.000 | even though they know, like, larger models should be better.

00:50:34.760 | What can we do about it?

00:50:37.800 | That's another project we had, which is about distillation.

00:50:41.520 | So I would say it's up to you guys what you prefer to do.

00:50:46.080 | Or if you have plenty of questions,

00:50:47.480 | we can continue with the questions now,

00:50:49.200 | because I think now the original one hour would be over, right?

00:50:53.080 | -Right. -So I think one suggestion was,

00:50:54.960 | like, we can continue the talk,

00:50:56.880 | and we'll also be recording it so people can, like,

00:50:59.360 | just, like, go and see it if they miss out something.

00:51:02.720 | So we could do that.

00:51:04.240 | -Yeah, the other thing is two people have their hands raised,

00:51:06.800 | so we can... -Okay.

00:51:09.160 | -...take questions first.

00:51:10.960 | -Up to you guys, and fight either way.

00:51:13.400 | -So you guys want to ask a question?

00:51:24.640 | -Yeah, I just had a pretty basic question.

00:51:28.000 | So if an object lies on the border between the patches,

00:51:32.280 | does that impact the model's performance in any way?

00:51:37.040 | -Yeah, I mean, that's not a basic question.

00:51:39.280 | It's a good question.

00:51:40.880 | There is a mix of answers.

00:51:45.560 | So one is we didn't specifically go and test this.

00:51:48.960 | It would be an interesting thing to test in a very controlled way

00:51:51.800 | with some of the trained models.

00:51:55.000 | That's for sure.

00:51:57.520 | The other thing is that when you have a massive data set,

00:52:01.360 | like 300 million images, it's an insane amount.

00:52:03.920 | I used to try to conceptualize how much is image net,

00:52:07.960 | 1 million images, and I think I did the math.

00:52:10.920 | It's like if you go to an image and look at all of the images,

00:52:15.520 | each image for a couple of seconds,

00:52:17.200 | you are sitting there for a month or something like that.

00:52:19.720 | Don't remember.

00:52:21.600 | But so 300 million is just insanely massive.

00:52:24.520 | And then on top of that, we do actually use

00:52:27.160 | random augmentations, like random crop out of the image.

00:52:30.920 | So I would say it's the default that you see objects

00:52:34.040 | that don't fall on a patch during the training already.

00:52:38.760 | And if you look at here, basically,

00:52:40.600 | this is the standard model, like how the patches are.

00:52:44.360 | When we have 14 by 14, they look roughly this size also.

00:52:50.240 | Then an object is usually scattered across many patches,

00:52:54.040 | actually, because objects in typical images

00:52:57.920 | are relatively large, right?

00:52:59.280 | People don't take a picture where the object of interest

00:53:01.640 | is super tiny in the corner.

00:53:04.400 | So that's the default that you see during pre-training.

00:53:06.680 | And so I believe that the model just

00:53:08.840 | learns to do that much better, actually.

00:53:13.640 | Then the other answer to the question is like, OK,

00:53:16.480 | maybe if you did some nicer thing than this very crude

00:53:20.720 | patch cutting, like for example, this stack of convolutions

00:53:24.480 | that I mentioned, maybe this is even better.

00:53:28.000 | Thank you.

00:53:30.960 | Thank you.

00:53:31.480 | So you mentioned that we're using transformers,

00:53:40.720 | or at least you mentioned in the paper

00:53:42.320 | that they lack locality and echoliteration.

00:53:49.800 | I was just thinking, are these sort of properties

00:53:54.400 | that you probably [INAUDIBLE] and especially when

00:54:00.880 | you're in the [INAUDIBLE]

00:54:02.720 | So why is it that we would prefer [INAUDIBLE]

00:54:12.040 | The audio was not that good, but I

00:54:13.520 | believe I understood the question.

00:54:14.920 | Is that we say that transformers lack locality bias, or prior,

00:54:20.160 | or whatever?

00:54:21.280 | And why is this even something that we want, right?

00:54:25.120 | Wouldn't we want our models to know about locality

00:54:27.560 | if they are about pictures in the first place?

00:54:30.800 | Yes and no.

00:54:32.680 | So that's why I gave the context in the beginning.

00:54:35.520 | This is all about what happens when you scale things up.

00:54:39.760 | And specifically, in the ideal world, at least in our mind,

00:54:46.640 | we want gigantic amounts of data.

00:54:49.360 | And we believe that it will just keep

00:54:51.400 | growing as the years go by.

00:54:54.000 | And there will be more and more data just generally there.

00:54:58.560 | And then we want the model to have as little

00:55:01.160 | of our thinking built in.

00:55:04.720 | Because what we may think that is good to solve the task

00:55:08.000 | may actually not be best to solve the task.

00:55:10.880 | Maybe an analogy would be, what was it,

00:55:14.880 | AlphaGo that made some moves that experts would say,

00:55:17.760 | this is crazy.

00:55:18.440 | This is a silly move.

00:55:19.360 | But it actually then was much better.

00:55:22.240 | And in a similar way, we want to encode as little as possible

00:55:26.040 | into the model, such that if we just

00:55:28.400 | throw massive amounts of data in the difficult task at it,

00:55:31.520 | that it might think things that are even better that we

00:55:33.720 | didn't think of before.

00:55:36.120 | This is our approach.

00:55:37.960 | Because we believe that, as I mentioned, I think, already,

00:55:43.040 | what seems massive and excessive now

00:55:45.400 | will be the norm in five years or so.

00:55:47.560 | So that's where we want to go and look what's the direction.

00:55:51.520 | However, if you want to just get something working now

00:55:56.480 | and don't have massive amounts of data

00:55:58.240 | and don't want to use pre-trained model

00:55:59.880 | for some reason, which always use a pre-trained model,

00:56:04.440 | but if you don't want to, then it

00:56:06.760 | makes total sense to build in some

00:56:08.800 | of your prior intuition and knowledge of what should

00:56:12.000 | probably help the model, like locality.

00:56:16.240 | I hope this answered your question.

00:56:19.320 | I suppose this is a quick follow up.

00:56:22.280 | What sort of [INAUDIBLE] like any vision task?

00:56:30.040 | Isn't that sort of like, yeah, I don't know.

00:56:33.560 | Maybe I'm not seeing exactly why we'd not

00:56:37.720 | want those inductive biases.

00:56:40.080 | Could you maybe elaborate on that?

00:56:42.680 | Why is it that we don't want locality

00:56:46.240 | or what translation [INAUDIBLE]

00:56:50.560 | Well, ideally, we want the model that

00:56:53.840 | is powerful enough to learn about this concept itself

00:56:58.840 | if it is useful to solve the task.

00:57:01.320 | If it's not useful to solve the task, then if we had put it in,

00:57:05.840 | there is no way for the model not to do this, right?

00:57:10.840 | That is ideally the outcome.

00:57:12.600 | In a similar way also that in language,

00:57:16.120 | it seemed to be nonsense to not encode

00:57:19.480 | the from left to right direction of text, like in RMS.

00:57:24.080 | But then comes transformer and just doesn't.

00:57:26.440 | And works much better if you throw a lot of data at it.

00:57:29.600 | And it recovers that plus some more or a more flexible variant

00:57:34.640 | of it or something like that.

00:57:35.840 | That is even better for solving tasks.

00:57:39.240 | So basically, the idea being that we are not

00:57:42.840 | as smart to design the thing, the model in the way that

00:57:47.800 | will be best for the task.

00:57:49.160 | Let's rather give it all the flexibility and all the data

00:57:52.440 | it needs to figure out what is the best

00:57:54.160 | way of solving the task.

00:57:55.200 | I mean, it is a philosophy of approaching it.

00:58:05.640 | I'm not saying this is the only true way, right?

00:58:07.880 | So we have around seven minutes left

00:58:16.320 | before the scheduled end of the talk.

00:58:18.720 | And Lucas, we want to be mindful of your time

00:58:21.240 | as well, because it is evening where you are.

00:58:24.520 | So one thing we could do is you could--

00:58:28.360 | I don't see any more questions right now.

00:58:30.240 | So you could quickly go over the last few bits,

00:58:33.800 | maybe skipping through the details

00:58:36.000 | and just talking about the final results.

00:58:39.200 | I will do this to a high level, then.

00:58:41.080 | Those two that are still very, very tight to transformers

00:58:44.360 | and answer some questions that happened before.

00:58:46.880 | Like the first question was like, OK, are we saturating?

00:58:50.360 | Yes or no?

00:58:51.320 | And here, no.

00:58:56.280 | This was the bit on this benchmark

00:58:58.960 | from the original transforming paper.

00:59:01.360 | But then it's like these transformers,

00:59:04.280 | when we use them, we just notice they have really nice scaling

00:59:06.760 | properties.

00:59:07.280 | And they seem, actually, to be easier

00:59:08.920 | to scale up without paying massive compute as much

00:59:12.920 | as ResNet, just from gut feeling from us having experience

00:59:17.320 | with both.

00:59:17.920 | And so we went and looked what happens

00:59:20.120 | if we scale vision transformer just as far up

00:59:23.960 | as we possibly can.

00:59:25.640 | And we spent quite a lot of our blood into making this happen.

00:59:30.400 | One part of it is scaling the data set.

00:59:33.640 | So we went back to this Google internal team

00:59:36.680 | that this 300 million data set is just one out of many

00:59:39.760 | that they work with.

00:59:41.360 | And we asked around, and they basically

00:59:43.480 | had the 3 billion, like 10 times larger data set

00:59:46.240 | that we could also play around with.

00:59:49.280 | So there we go.

00:59:50.240 | We want to scale up the data set.

00:59:52.360 | And this is just showing, yes, just scaling up the data set

00:59:56.120 | and switching it gives you benefits,

00:59:57.720 | but that's not all of it.

01:00:00.120 | Then the next thing is we needed to figure out

01:00:03.080 | how to use less memory on device, like on GPU or TPU,

01:00:07.640 | because already previously with this score,

01:00:10.280 | we fitted the model as large as we could fit.

01:00:13.200 | So we did a lot of clicks that I will skip for now

01:00:16.360 | and are able to scale much larger.

01:00:18.960 | This is like-- this plot shows the size

01:00:22.040 | of the model in the different shape factors

01:00:24.520 | that I mentioned before, like the width of the MIP on x-axis,

01:00:27.400 | the self-attention width on the y-axis,

01:00:29.720 | and then the different plots are different layers for the depth.

01:00:32.760 | This box are how large the transformer

01:00:35.720 | we did in the original paper.

01:00:37.800 | And then boom, one step further and two steps further,

01:00:40.640 | this is just super massive transformer

01:00:43.440 | we did in this scaling paper.

01:00:46.600 | And with all of our tricks, how much larger

01:00:48.840 | we could go, a lot larger.

01:00:51.640 | Then, yeah, some learning rate stuff, and it is really cool.

01:00:54.360 | I recommend people to look at square root learning rate

01:00:56.840 | schedule, which is cool, and often just mentioned

01:00:59.320 | as a side note.

01:01:02.400 | It is also cool, but I'm going to skip it

01:01:04.240 | for the interest and basic interest of time.

01:01:08.160 | And basically, we scaled it up a lot.

01:01:10.920 | And of course, again, we get always

01:01:14.040 | this envision image net number a bit higher.

01:01:17.840 | This is actually plus 2% on what we had before,

01:01:20.200 | which is very significant in this high percentage range

01:01:23.440 | there.

01:01:24.880 | But also, what's very interesting

01:01:26.520 | is the view shot again.

01:01:28.760 | By just keep scaling up everything,

01:01:30.840 | we get super large boost in view shot again.

01:01:33.600 | This is image net top 1 accuracy.

01:01:35.800 | And for example, it's just 10 images per image net class,

01:01:40.560 | which means 10,000 images total because 1,000 classes.

01:01:44.000 | We get this big of a jump.

01:01:46.280 | We get 85% of 1 accuracy, which is what you typically

01:01:52.840 | get when using the full data set, basically.

01:01:56.320 | So this is scaling up.

01:01:58.600 | It makes actually view shot work significantly better.

01:02:02.000 | And then I'm going to skip on this.

01:02:04.240 | Well, this actually has an interesting message.

01:02:06.480 | This is three times the same story,

01:02:08.960 | but measured in a slightly different way,

01:02:11.040 | which is that if you make the model larger,

01:02:13.920 | it actually needs to see fewer images

01:02:16.320 | to get to a similar score.

01:02:18.400 | This blue line is a tiny vision transformer,

01:02:22.080 | and the base vision transformer in the large one.

01:02:24.680 | And the y-axis is the error.

01:02:26.520 | So lower is better.

01:02:28.040 | And actually, you need to see--

01:02:30.000 | still, we're talking in millions of images,

01:02:32.120 | and here it's 100 million images.

01:02:33.880 | But still, you need to see a lot fewer images

01:02:36.560 | with the larger models.

01:02:37.960 | Doesn't mean a lot less compute, right?

01:02:39.560 | Because the model is larger and the slower.

01:02:42.400 | But it's interesting.

01:02:44.160 | And then there's some scaling loss

01:02:45.720 | that are popular in language.

01:02:47.240 | And we, I think, maybe for the first time

01:02:49.880 | in discriminative image learning show

01:02:52.880 | that, yeah, they appear to be here, too.

01:02:56.800 | And then-- right.

01:02:59.800 | Then we want to--

01:03:01.400 | sorry, I had the order of the slides mixed up in my head.

01:03:04.160 | So I'm a bit surprised.

01:03:05.120 | But then another threat was that besides further scaling up

01:03:08.720 | the model, we wanted to push even further

01:03:11.040 | into this direction of less hand engineering of things

01:03:16.520 | into the model architecture.

01:03:18.560 | And then with the vision transformer,

01:03:21.360 | transform in general, what is the obviously most hand

01:03:24.360 | engineered part of it is the self-attention.

01:03:26.480 | So we tried, what can we do something

01:03:29.440 | more generic than that and less smart than that, basically?

01:03:34.240 | And we ended up by replacing it, essentially,

01:03:36.480 | with just a multi-layer perceptron that, however,

01:03:42.400 | has a little bit of structure, but much less

01:03:44.680 | than self-attention.

01:03:46.120 | So they would skip the structure or the safety of time.

01:03:49.680 | And we're coming back to this plot, where the question was,

01:03:53.120 | aren't we saturating?

01:03:54.480 | Now, this plot is slightly different.

01:03:56.040 | We, again, have this bit resonate here in black.

01:03:59.040 | And the full green line is the vision transformer.

01:04:02.760 | And the other color, also the full lines,

01:04:04.680 | are the vision transformers.

01:04:05.800 | So it is exactly the same numbers as from before.

01:04:09.000 | However, now we also throw in this mixer architecture,

01:04:11.800 | which we believe is even more flexible and less

01:04:14.240 | hand-engineered than transformer.

01:04:16.240 | And as you see, with less data, it's even worse.

01:04:19.400 | However, with much more data, it may

01:04:22.520 | be surpassing the transformer, or it may be random noise.

01:04:27.560 | Not clear at this point, right?

01:04:29.000 | Because it's the only point where this happens.

01:04:32.240 | So we need to go further.

01:04:33.760 | So we use this 3 billion data set,

01:04:35.960 | for example, from the previous paper that I mentioned here,

01:04:40.160 | and try to extend these lines to the right to see what happens.

01:04:44.360 | We don't extend many of them, because these

01:04:46.360 | are very expensive experiments that

01:04:48.360 | require a ton of patience.

01:04:50.320 | But we extended two most interesting.

01:04:52.840 | And it seems that it continues.

01:04:54.720 | And that, first of all, yes, the vision transformer

01:04:58.000 | keeps increasing.

01:04:59.880 | We don't have such experiment with the ResNet,

01:05:02.280 | because it doesn't look promising enough

01:05:04.400 | to pay the cost of doing it.

01:05:07.520 | But it also seems that the mixer, what we believe

01:05:10.000 | is even more flexible architecture,

01:05:11.560 | actually is consistently above the transformer now,

01:05:15.160 | which is good news.

01:05:17.080 | And yeah, it is good news.

01:05:19.200 | So we're now right at the time when I should stop, right?

01:05:23.640 | Or open to more questions again.

01:05:26.120 | Yeah, I guess, as a question--

01:05:29.160 | Can I ask a follow up on the scaling

01:05:31.520 | that you were showing earlier?

01:05:33.080 | It's related to my previous question.

01:05:35.080 | I'm curious how this model size compares

01:05:37.640 | to model sizes for Earth or the natural language.

01:05:41.840 | [INAUDIBLE]

01:05:43.840 | Like, especially when we're going from smaller models

01:05:46.080 | to much bigger models, are they comparable at all

01:05:49.520 | in terms of model size?

01:05:50.880 | And if not, why do you think--

01:05:53.880 | what is the [INAUDIBLE] models for these two different tasks?

01:05:57.280 | Yeah, actually, a colleague of mine has a slide, which I hate.

01:06:01.160 | But he loves-- it's the model number of parameters

01:06:04.280 | in NLP and in vision.

01:06:07.240 | And the question is, how do you measure model size?

01:06:10.240 | If you just measure number of parameters,

01:06:12.200 | then these vision models are much smaller.

01:06:15.160 | However, the language models, number of parameters,

01:06:19.200 | like a huge chunk of it is in the dictionary,

01:06:21.480 | for example, which for us just doesn't exist.

01:06:23.880 | It is linear embedding, which is trivial number of parameters.

01:06:29.240 | So in terms of number of parameters, it's much smaller.

01:06:32.480 | My personal opinion is number of parameters

01:06:34.800 | doesn't mean that much.

01:06:37.200 | Then the other way that you could measure

01:06:39.120 | this maybe in terms of compute, like how much floating point

01:06:42.760 | operations does it do on one data point.

01:06:46.400 | And in terms of this, it's in the same ballpark.

01:06:50.120 | However, last time I checked, which is quite a few months

01:06:52.720 | ago, the largest language model was still

01:06:55.640 | like four times more or five times more in the vision model,

01:06:59.800 | I believe.

01:07:02.200 | So that's the two ways of measuring model size.

01:07:05.320 | I don't think either of the ways is

01:07:07.880 | the one true way to measure model size.

01:07:09.560 | And I think it's actually an interesting research topic,

01:07:11.920 | like how to properly measure and order models

01:07:15.280 | in terms of capacity is not clear.

01:07:18.520 | [INAUDIBLE]

01:07:19.800 | Do you know why the vision is--

01:07:21.720 | I'm sorry, the vision is four times smaller?

01:07:23.880 | Like, what about that [INAUDIBLE]??

01:07:26.520 | I think it's just there is less interest in it,

01:07:30.200 | so less resources spent on it, basically.

01:07:34.360 | Like in Google, there are many, many more groups

01:07:37.160 | doing research with language than with vision.

01:07:40.720 | And I think we are one of the few groups that

01:07:44.000 | have access to a lot of resources

01:07:45.640 | and are interested in scaling up things in vision so much.

01:07:48.920 | Whereas in language, it seems there are a lot of groups

01:07:51.240 | that are doing that.

01:07:53.000 | I think that's the main reason, actually.

01:07:56.040 | It's not that we don't want to go beyond that,

01:07:58.760 | or if we can, we would go even more.

01:08:03.960 | Awesome, thank you.

01:08:05.280 | [INAUDIBLE]

01:08:07.760 | Right, so we are actually over time at this point.

01:08:10.840 | So anyone who has to leave, please feel free to do so.

01:08:13.720 | And before we do that, Lucas, thank you so much for joining,

01:08:18.240 | for all the way from across the ocean.

01:08:21.600 | And we know it's in the evening, so thank you

01:08:24.080 | for taking your free time to come and talk to us here.

01:08:27.520 | Yeah, thanks for the invitation.

01:08:29.040 | Always like to talk about the work.

01:08:31.840 | [BLANK_AUDIO]

Stanford CS25: V1 I Transformers in Vision: Tackling problems in Computer Vision

Chapters