back to index

Stanford CS25: V1 I Transformers in Vision: Tackling problems in Computer Vision


Chapters

0:0
0:34 General Visual Representation
4:8 The Visual Task Adaptation Benchmark
7:26 Self-Supervised Pre-Training
7:58 Semi-Supervised Training
21:22 Synthetic Images
26:33 Applying Transformers to Vision
26:49 Embedding Space
42:5 Early Convolutions
45:28 Patch Size
46:24 Inference Speed
59:31 Scaling the Data Set

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, I'm going to talk to you about vision transformers,
00:00:07.640 | since this is all about transformers,
00:00:09.440 | specifically their application
00:00:12.200 | for visual representation learning.
00:00:14.400 | But before we jump into transformers,
00:00:16.360 | I'm going to spend like 10 or 15 minutes
00:00:18.080 | giving you a lot of context on all of this,
00:00:20.560 | and specifically also on the vision part of things,
00:00:24.320 | because I think the majority of what you have seen
00:00:27.200 | and will see will be about language.
00:00:29.960 | All right, so let's get started.
00:00:32.280 | My goal and that of my close collaborators
00:00:34.400 | is to find general visual representation,
00:00:37.000 | and you're going to soon see what that means and why,
00:00:40.720 | or what can we do if we imagine
00:00:42.640 | we have a general visual representation.
00:00:45.040 | The hope is that with this,
00:00:46.680 | we can kickstart all kinds of tasks
00:00:49.640 | that require visual input.
00:00:51.280 | That means most tasks that you do
00:00:52.960 | when you have your eyes open, basically,
00:00:56.680 | because if you have a good understanding of what you see,
00:01:00.240 | then you can much quicker understand what's going on
00:01:03.040 | and what you should do.
00:01:04.640 | And eventually, I have now a little kid since a year,
00:01:09.720 | and so I really want that when he's grown up,
00:01:13.320 | that there is like some kind of robot.
00:01:16.080 | It doesn't need to be nice and pretty like in movies,
00:01:18.080 | just maybe an arm or whatever, that my kid could teach,
00:01:21.640 | or my parents who cannot program
00:01:23.520 | can teach to do some boring task
00:01:25.600 | that they really don't want to do.
00:01:27.440 | And I believe one component of this
00:01:29.440 | is a good visual representation
00:01:31.800 | that generalizes to understanding
00:01:33.880 | the world visually everywhere.
00:01:36.000 | It's not all that's required, but it's one part,
00:01:38.160 | and the part that I'm trying to push.
00:01:40.000 | So this is for context and motivation
00:01:42.520 | on working on general visual representation,
00:01:45.080 | and one good example of a general visual representation
00:01:47.920 | is the humans,
00:01:49.920 | and I'm going to show you what I mean by that.
00:01:51.680 | So here is a task that I give you.
00:01:55.680 | There is three classes, class A, B, and C,
00:01:58.120 | and I give you five images of each class, okay?
00:02:00.560 | And here I give you a new image,
00:02:04.040 | and I'm sure that by now you all know which class it is.
00:02:09.320 | I'm not going to ask because I don't actually see you.
00:02:11.760 | If I was in the room, I would do the raised hands,
00:02:14.560 | but I'm sure you know it's class A now.
00:02:17.160 | Okay, this is fine. We have seen millions of flowers
00:02:19.280 | in our lives, hopefully,
00:02:21.600 | but there is other kinds of pictures,
00:02:23.280 | like this satellite images
00:02:25.360 | that you don't see much in your life.
00:02:26.960 | Some people may have never seen it sometimes,
00:02:29.520 | like when you fly or maybe on TV or in the Internet or so,
00:02:32.840 | but it's rather rare, but still, same story.
00:02:35.680 | Three classes, class A, B, C, five images of each,
00:02:39.760 | and I show you a new image.
00:02:41.720 | This might be a little bit less trivial than the flower,
00:02:44.720 | but I think I've spent enough time talking that by now,
00:02:48.520 | most of you should know that this is class B.
00:02:51.160 | Shows a, what is it, basketball court, right?
00:02:53.720 | All right, now even more abstract.
00:02:57.760 | You don't see this in real life, all right,
00:03:00.200 | but still, I give you images of class A and B.
00:03:03.160 | I have just two to make it a bit easier here
00:03:05.320 | because you need to use your brain a little bit more,
00:03:08.120 | and I show you this new image,
00:03:10.680 | and now I should do a little bit of small talk
00:03:14.400 | to let you think,
00:03:15.960 | like you see that there is like spheres, boxes, and whatnot,
00:03:19.120 | and by now, I hope that most of you know
00:03:21.760 | that this is class A. Why?
00:03:24.320 | Because there is three objects in class A,
00:03:26.880 | and class B is always, what is it, five objects,
00:03:29.640 | no matter what they are, what they look like.
00:03:33.080 | Okay, I think by now, you more or less understand
00:03:37.680 | what I mean when I mean a good visual representation,
00:03:40.160 | general visual representation, right?
00:03:42.400 | Some, I don't know how to call it,
00:03:46.640 | in your brain, in your eyes such that you can quickly see
00:03:51.360 | something new and understand what's going on
00:03:53.760 | with just a few examples, and then generalize from that,
00:03:57.480 | right, and that's the goal.
00:04:00.320 | Then the next step, if you have the goal,
00:04:02.680 | how do we measure progress towards it?
00:04:04.640 | And this is a paper we did a few years ago
00:04:07.680 | with my collaborators,
00:04:08.800 | which we call the Visual Task Adaptation Benchmark.
00:04:10.920 | It's kind of formalization of the little game
00:04:13.120 | that we just played, so it's a benchmark,
00:04:16.880 | and there is some component that you,
00:04:20.280 | or anybody who participates in the benchmark does,
00:04:22.440 | which is creating a model with some data.
00:04:24.800 | We don't really care what data, what model, how, what not.
00:04:28.160 | Just you come with a model.
00:04:30.440 | Then we come with this landscape of all possible visual tasks
00:04:35.280 | that kind of make sense, which is a vague statement,
00:04:38.360 | and we sample some tasks from that,
00:04:41.160 | and this is kind of the task that you have just seen.
00:04:44.840 | They were actually taken out of this Task Adaptation Benchmark,
00:04:49.160 | and we have, for a first step, made 19 such tasks
00:04:53.000 | where we try to cover broad types of visual tasks,
00:04:56.320 | not just classes of natural images
00:04:59.320 | like these dogs and cats things,
00:05:01.200 | but also of very specialized images like satellite image,
00:05:04.480 | also non-classification tasks that involve counting,
00:05:07.360 | like the one I showed you before, right,
00:05:09.360 | but that can be expressed in this simple classification API,
00:05:13.200 | but that logically requires some more thinking.
00:05:15.840 | Some things like distance, we have something with cars
00:05:19.760 | and with distance of the closest car and things like that.
00:05:22.560 | It should cover a broad range of variation,
00:05:27.000 | and then with the model that you came to this benchmark,
00:05:32.000 | you can do some adaptation step on each of the datasets,
00:05:35.760 | one after another or at the same time.
00:05:37.560 | It doesn't really matter,
00:05:38.920 | but then you should have, as a result,
00:05:40.720 | a model of this dataset, which is very small.
00:05:44.040 | It just has seen a few examples for each class
00:05:46.960 | that then performs well there,
00:05:48.560 | and then we just take the average score
00:05:50.880 | across all of these tasks,
00:05:52.040 | and this is what we call the VTAP task,
00:05:54.040 | and this is how, for now,
00:05:56.120 | we judge how good of a general visual representation
00:06:01.040 | does your model and adaptation algorithm have,
00:06:03.320 | and now just for some nomenclature, this preparation,
00:06:08.560 | we have words that we often use pre-training.
00:06:10.720 | Sometimes we call it the upstream,
00:06:12.560 | like upstream data, upstream training, something,
00:06:15.640 | so I may use this word interchangeably with pre-training,
00:06:18.680 | and then there is the second part,
00:06:20.200 | which we usually call transfer,
00:06:21.920 | and then sometimes we say downstream,
00:06:23.720 | and the adaptation, in principle, it's whatever you want,
00:06:29.440 | but for our work, we almost always just use very simple,
00:06:32.200 | fine-tuning without any bits and whistles
00:06:34.680 | because it's simple and works well.
00:06:36.720 | In general, we try to do things as simple as possible.
00:06:39.600 | It still works well, and so sometimes I even just say,
00:06:42.720 | like, fine-tuning when fine-tuning.
00:06:44.280 | That means moving from this pre-training to the transfer.
00:06:47.120 | All right, so so far for the settings, so far so good?
00:06:52.440 | Good. Then the question is, how do we get there,
00:06:58.600 | and we spend a lot of time thinking about this
00:07:00.640 | and trying different things,
00:07:02.240 | and this is also roughly the outline
00:07:04.600 | of all that I have available to talk about,
00:07:07.520 | which doesn't mean we're going to cover everything,
00:07:10.120 | so I'm not going to go, like, through the outline exactly,
00:07:13.640 | but you will see this again and again,
00:07:15.160 | and as you see, vision transformer,
00:07:17.040 | field transformer only comes a little bit later.
00:07:19.080 | There's some stuff before that,
00:07:20.680 | so this one, just really quickly
00:07:23.640 | because it doesn't matter for this course,
00:07:25.280 | is that we spend some time trying self-supervised pre-training
00:07:28.440 | which is very popular in language,
00:07:30.200 | and in vision only recently has become popular,
00:07:32.760 | and it doesn't work that way.
00:07:35.480 | You don't need to understand these bars,
00:07:38.080 | but basically higher is better,
00:07:40.000 | and here, just look at the blue ones.
00:07:42.880 | That's the VTAP score for this few-shot VTAP,
00:07:47.000 | and self-supervised learning performs like this bar.
00:07:49.680 | We tried multiple methods and multiple models and so on.
00:07:52.520 | It was a proper good benchmark,
00:07:53.960 | but it was a couple years ago.
00:07:55.560 | Then we moved on to semi-supervised training,
00:08:00.080 | so a few labeled examples and a ton of unlabeled examples.
00:08:03.440 | That's this next blue bar.
00:08:05.000 | Did you actually see the mouse cursor?
00:08:06.400 | Sorry.
00:08:08.000 | - We don't see the mouse cursor.
00:08:11.680 | - Maybe I need to do some laser --
00:08:16.240 | - Oh, we can see it. We can see it.
00:08:17.640 | - Yeah. - Oh, okay.
00:08:19.960 | - Yeah, so then semi-supervised is that blue bar
00:08:22.520 | which is a lot higher than this other blue bar,
00:08:24.880 | so what this means to us
00:08:26.400 | is that by adding a few labeled examples,
00:08:28.280 | we're able to get much better
00:08:31.040 | or much more general visual representation.
00:08:33.240 | Then I'm not going to spend more time on this
00:08:36.240 | and how exactly and so on,
00:08:38.320 | but I'm going to move to the next one,
00:08:39.640 | which was for us kind of a breakthrough
00:08:42.120 | when we figured out that, well,
00:08:43.600 | if we just scale up fully-supervised pre-training,
00:08:47.280 | then we get really much better representations
00:08:49.840 | than everything we've seen before,
00:08:51.800 | and here I want to briefly spend some time on that one
00:08:54.080 | because it's the precursor to using vision
00:08:56.480 | or transformers in vision.
00:08:58.080 | So the idea is simple.
00:09:00.960 | There are tons of images on the Internet.
00:09:03.200 | That's always what you hear is motivation
00:09:04.880 | for semi-supervised or unsupervised learning, right?
00:09:07.760 | But actually, where these images come from,
00:09:10.160 | there's almost always some extra information,
00:09:12.520 | like surrounding the image on the Web
00:09:14.520 | or if you collect it otherwise,
00:09:16.720 | there's some extra information there
00:09:18.440 | that you could use as some weak source of information
00:09:21.240 | or some weak label, right?
00:09:22.840 | Then it happens that in Google,
00:09:25.080 | there's some team that actually does this for production,
00:09:28.080 | and they have collected already a large dataset
00:09:31.240 | with some pipeline that from the surrounding signals
00:09:34.280 | somewhat automatically,
00:09:36.040 | but very noisily annotates the images,
00:09:38.240 | and we wanted to figure out how far can we go
00:09:42.160 | when we scale up pre-training.
00:09:43.760 | Then, long story short, you need a couple of ingredients.
00:09:47.920 | One is patience. I really like this plot.
00:09:50.680 | This is one of the curves of just pre-training
00:09:53.120 | on large data with large models.
00:09:55.040 | The details don't really matter.
00:09:57.640 | The gist is that if I zoom into this little box,
00:10:00.360 | I see this here, and this is the metric for the training,
00:10:04.320 | like the performance in upstream.
00:10:06.280 | Then I see after spending eight GPU weeks of compute,
00:10:09.440 | what does GPU week mean?
00:10:10.560 | It means eight GPUs for a week or, sorry,
00:10:13.920 | one GPU for eight weeks or eight GPUs for one week
00:10:17.960 | or 16 GPUs for half week and so on, right?
00:10:21.560 | But this looks flat. A reasonable person would say,
00:10:24.000 | "Yeah, there's no progress for a week on eight GPUs.
00:10:26.760 | This is flat. I'm going to stop and try something else,"
00:10:29.000 | but we are not reasonable, so we keep going,
00:10:31.600 | and this is what the exact same spot looks like
00:10:33.920 | after eight GPU months of training,
00:10:35.880 | and you can clearly see the things are progressing, right?
00:10:39.480 | So it may not always be obvious, and you need patience.
00:10:42.240 | The second thing is that you actually need
00:10:45.880 | to scale up everything,
00:10:47.120 | so this was work done with ResNets,
00:10:49.000 | not yet with transformers.
00:10:50.640 | I see you see a lot of ResNet models here.
00:10:52.760 | The x-axis is the number of images available.
00:10:55.000 | In vision, there is this image in the dataset,
00:10:57.000 | which is a very common, super common dataset for pre-training,
00:11:00.400 | which has 1.3 million images.
00:11:02.560 | There's another one which has 10 times more images
00:11:04.560 | that's still public, and then there is one subset
00:11:07.640 | from this internal group
00:11:09.200 | that has 300 million labeled images,
00:11:11.240 | so the y-axis is measure of accuracy on some tasks,
00:11:16.480 | and we tried many. They all look similar,
00:11:20.000 | and the dots are differently sized ResNets.
00:11:22.000 | The blue dot is the standard ResNet 50 that everybody uses.
00:11:25.320 | If this one, you trained on more data,
00:11:27.480 | it looks promising at first,
00:11:29.040 | but if you go to even more data, it looks like,
00:11:30.960 | oh, okay, this doesn't really seem that useful,
00:11:34.280 | and this is what most people have been doing for a long time,
00:11:37.280 | and a lot of people, even in Google, were like,
00:11:39.960 | yeah, I tried this internal checkpoint on these tons of data.
00:11:43.840 | It doesn't really help that much.
00:11:45.800 | However, what we found out, and in hindsight,
00:11:48.000 | it's kind of obvious, is that you actually need to scale
00:11:51.720 | not just the data but also the model.
00:11:53.760 | Here, this blue dot is a gigantic ResNet that is slow as hell,
00:11:57.520 | but when you scale this up together with the data,
00:11:59.720 | you keep getting benefit with adding more data,
00:12:02.320 | and then if you do these two things,
00:12:03.840 | scale up everything and be patient,
00:12:05.720 | be patient could also be quite scale up your patience.
00:12:10.280 | Then you get a lot of benefits,
00:12:14.040 | so here there is a few short transfer learning.
00:12:17.040 | They're what I showed you before,
00:12:19.000 | and on the x-axis is size of the model,
00:12:22.400 | on the y-axis is the accuracy on one of these tasks,
00:12:25.480 | but again, others look similar,
00:12:27.440 | and these three different curves
00:12:29.040 | are featuring with different data set sizes.
00:12:31.600 | The green one being the standard one,
00:12:33.240 | you don't really see benefit or small benefit
00:12:36.120 | from going with larger models.
00:12:37.640 | The blue one is 10 times larger.
00:12:39.000 | You start seeing some slope upwards,
00:12:42.000 | but really only with this giant data,
00:12:43.760 | you start getting better and better and better
00:12:46.240 | at this few short transfer learning
00:12:48.080 | when you pre-train on more and more data
00:12:49.720 | with larger and larger models.
00:12:52.000 | Second benefit that we did not anticipate really at all,
00:12:55.560 | but then found out is that these models are super robust
00:12:58.400 | when you scale everything up.
00:12:59.840 | This is ObjectNet.
00:13:02.240 | It's a data set that's specifically designed
00:13:04.320 | to measure robustness,
00:13:05.240 | and it shows things in crazy,
00:13:07.120 | like a chair in the bathtub and things like that,
00:13:10.200 | and you should recognize it as a chair.
00:13:13.920 | Here, the pink dots are basically how existing models,
00:13:17.480 | and x-axis is, again, how large is the model,
00:13:19.920 | and pink dot is existing ones from the literature,
00:13:22.600 | and then these lines, same color coding,
00:13:24.680 | is what we found out.
00:13:26.080 | Again, you see this large data,
00:13:28.160 | and then going to large model
00:13:29.520 | just gives you amazing benefits on,
00:13:31.840 | like in this case, out-of-distribution robustness.
00:13:34.360 | This was amazing.
00:13:38.280 | Scale up everything, be patient, and get huge benefit.
00:13:42.160 | - Sorry, Lucas.
00:13:43.960 | Sorry for interrupting you,
00:13:45.040 | but there is a question from a student in the class.
00:13:47.840 | - Yep. - Right.
00:13:49.400 | Do you want to unmute yourself and ask it yourself?
00:13:51.960 | - Yeah, I can ask my question.
00:13:55.040 | Can people hear me?
00:13:55.960 | Maybe there's some-- - Yes.
00:13:56.800 | - I'm sorry, one second.
00:13:57.640 | Let me just step away real quick.
00:13:58.480 | Yeah, so the question I wanna know is,
00:13:59.840 | what work has been done characterizing the parameters
00:14:02.120 | after pre-training finishes?
00:14:04.040 | Like, the reason why I'm motivating this question is,
00:14:06.960 | it seems like we do this tremendous amount of pre-training,
00:14:09.320 | but it seems like we might be able
00:14:10.280 | to significantly reduce that
00:14:12.000 | if we just have smarter initialization schemes.
00:14:14.520 | - Yeah, you know, I've been thinking this
00:14:17.680 | for a long time, actually, also.
00:14:19.560 | And they've come to conclude that I think not.
00:14:25.440 | I think there is, like, two parts.
00:14:28.200 | One is, like, what I like to call
00:14:30.800 | hand-wavy the numerics of the weights.
00:14:33.080 | You know, that everything is in a nice range,
00:14:35.200 | such that it can have nice input/output functions,
00:14:38.320 | and so on, and that your optimizer can do steps
00:14:41.160 | that make reasonable change to the input/output function,
00:14:44.440 | but not too large, and so on.
00:14:46.600 | I think that is part of it,
00:14:48.120 | and that you can get through good init
00:14:50.040 | or good normalizations and whatnot.
00:14:52.040 | But then I also think there is,
00:14:54.880 | I do think that these models memorize a lot,
00:14:57.040 | and then, personally, I believe,
00:14:59.480 | but I don't know of evidence or so,
00:15:01.560 | that these models do more kind of, you know,
00:15:05.280 | remembering similarity to things they've seen in training.
00:15:09.080 | And then, as you grow things up,
00:15:11.160 | they have more memory, and they have seen more things,
00:15:13.840 | so they should be better on more newer things,
00:15:16.480 | because there's more similar things they have seen.
00:15:19.240 | And this, I don't think you can, like,
00:15:21.560 | just create one shot from initialization.
00:15:24.400 | But I don't have the immediate pointer to a paper
00:15:29.320 | at the top of my head now to answer your question.
00:15:31.840 | - Okay, thank you.
00:15:32.760 | - I think we also have more questions,
00:15:36.200 | so has posted on the chat and is raising his hand.
00:15:40.160 | Maybe in this order, you wanna ask your question first?
00:15:44.400 | - Yeah, for sure, I can go ahead.
00:15:46.000 | So I just have a quick clarification on this chart right here,
00:15:50.760 | the chart number three.
00:15:52.840 | The bit L and bit M and bit S,
00:15:54.760 | are they the same model architecture,
00:15:57.840 | but just trained on different datasets?
00:15:59.920 | So the bit S is trained on the 1.3 million
00:16:02.720 | all the way to the 300 million image dataset for bit L?
00:16:05.560 | - Yes and no.
00:16:09.480 | The architecture is here on the x-axis.
00:16:11.640 | So within one vertical slice,
00:16:13.960 | these are the same architecture.
00:16:16.000 | And then the different points are random restarts,
00:16:18.760 | because when you do future learning,
00:16:20.280 | there is a lot of variance
00:16:21.480 | in which few examples do you see.
00:16:24.120 | And then again, this next vertical slice
00:16:26.200 | is the same model and so on.
00:16:27.640 | And as you go to the right, the model gets larger.
00:16:30.320 | And so you can see that for this little data,
00:16:32.840 | going to larger model doesn't really help you much
00:16:35.400 | for pre-training, only for this giant data,
00:16:37.920 | everything's the giant data,
00:16:39.240 | not necessarily giant model in this case.
00:16:42.200 | - Right, that makes a lot of sense, thank you.
00:16:44.680 | - Okay.
00:16:45.520 | - Do you have a question?
00:16:48.600 | Oh, I see you're raising your hand as well.
00:16:51.560 | Go ahead and let Otto.
00:16:53.320 | - Hey, yeah, thanks.
00:16:54.320 | What is the intuition for the upstream performance
00:17:00.040 | in figure one spiking so suddenly
00:17:03.000 | at like 60 or 40 points in training?
00:17:07.720 | - Here, right?
00:17:08.560 | Yeah.
00:17:09.480 | - Yeah, yeah, I'm looking at it again,
00:17:10.920 | like around one point, like, I don't know,
00:17:12.920 | that just seems like an odd looking training curve.
00:17:15.560 | So like, what's the intuition behind that?
00:17:19.000 | - Yeah, this is old school computer vision thing,
00:17:21.600 | or old school, I mean, a few years ago.
00:17:24.040 | Is this when the learning rate changes?
00:17:26.240 | In computer vision, it used to be very common
00:17:28.440 | to have the learning rate in a kind of staircase pattern.
00:17:31.520 | So it's constant for a while, and then you stop,
00:17:34.000 | you divide the learning rate by 10, usually,
00:17:36.160 | boom, smaller, and then you continue.
00:17:38.280 | And this gives you this huge jump.
00:17:40.680 | And nowadays, people don't use this much anymore.
00:17:42.880 | And this work was like three years ago, I think,
00:17:44.720 | or two or three years ago, I don't remember.
00:17:47.360 | It was very common back then.
00:17:48.960 | And nowadays, people use more continuously changing
00:17:51.800 | learning rate schedule, and then you don't really have
00:17:54.120 | this sudden change anymore.
00:17:55.960 | But if you would overlay it, it would be like
00:17:57.760 | more continuously, but going roughly the same.
00:18:00.720 | And then in language, I think most people,
00:18:03.000 | or many people use just linearly decreasing
00:18:05.560 | learning rate schedule, where also you don't see
00:18:07.240 | this effect, because learning rate continuously decreases.
00:18:11.000 | - Okay, yeah, sounds good, thanks.
00:18:12.680 | - And then this is what, because you asked for,
00:18:16.760 | about this dotted line.
00:18:18.640 | Actually here, if you're like here, you could say,
00:18:21.160 | okay, but this is excessive, right?
00:18:22.840 | Maybe it does really seem almost flat.
00:18:26.400 | Maybe you could have started the decay earlier,
00:18:28.920 | and earlier, and earlier, and then you would get the same,
00:18:31.800 | but much quicker.
00:18:32.920 | And this one shows what would happen then.
00:18:35.840 | And you do land at much worse place in the end
00:18:39.160 | than with the patient.
00:18:41.080 | - All right, yeah, yeah, that makes sense.
00:18:43.440 | Thanks.
00:18:45.400 | - Was there more question, or I continue?
00:18:49.080 | - I think both of you have your answers.
00:18:51.960 | - 'Cause I need to mention, I don't see you,
00:18:53.960 | I just see my slide.
00:18:55.960 | - Yeah, it's fine, we can coordinate that with this.
00:18:58.560 | - Hi, yeah, so I just wanted to make sure
00:19:03.320 | that I'm on the same page.
00:19:05.000 | So basically what you're trying to do is multitask learning
00:19:08.320 | with convolutional neural networks/LSTMs, right?
00:19:11.960 | That's kind of like ResNet.
00:19:13.320 | But you're doing multitask learning, correct?
00:19:17.160 | - No, where does the multitask come from?
00:19:20.160 | Or where does it come from?
00:19:21.920 | - Because like, initially, like you showed like different,
00:19:24.960 | - Ah, yeah, okay.
00:19:26.640 | - Yeah, okay.
00:19:28.920 | - So there is two phases.
00:19:30.720 | The first one is the pre-training.
00:19:33.400 | And this pre-training, I didn't mention it yet.
00:19:36.320 | I just said, I don't care what you do in the pre-training,
00:19:38.960 | just pre-train somehow, and give me the model.
00:19:41.600 | And then I test it on multiple tasks independently.
00:19:45.320 | And I'm tested on multiple tasks,
00:19:47.040 | means like transfer it to that task,
00:19:49.080 | which in our case means fine-tune it just on the task,
00:19:51.920 | and then see how well it does, and so on.
00:19:54.200 | But it could mean other things.
00:19:55.520 | Like later we moved to just learning a linear regression
00:19:58.800 | on top of the embeddings for each task.
00:20:01.160 | And now during the pre-training,
00:20:03.240 | what we do is just regular supervised learning,
00:20:05.760 | but just scaling everything up.
00:20:07.520 | And regular supervised learning is just,
00:20:09.680 | well, not multitask, but multilabel,
00:20:13.360 | in the sense that an image could have
00:20:14.800 | a couple labels or not, but it usually doesn't have.
00:20:17.280 | - This is minor. - Okay, got it.
00:20:20.200 | - Thanks.
00:20:21.040 | - All right, we have a question.
00:20:27.440 | - Yeah, just have a quick follow-up
00:20:28.760 | about the question rather than,
00:20:30.360 | like the discussion rather than started about this,
00:20:33.960 | it's like memorization, or it's more memorizing the data
00:20:37.400 | in pre-training datasets.
00:20:39.040 | So I know in the language side,
00:20:40.520 | there's a quite interesting phenomenon
00:20:42.080 | that you can pre-train on a synthetic language
00:20:45.280 | that's, it doesn't have any semantic meaning,
00:20:48.400 | but it only has structural,
00:20:49.960 | like paired premises or things like that.
00:20:52.600 | And that actually gives you almost the same boost
00:20:56.280 | in your downstream transfer as a normal pre-training.
00:20:59.520 | So I wonder if, say like,
00:21:02.160 | so this means like in for language, right,
00:21:04.600 | the structure seems to make a lot of contribution,
00:21:07.480 | which can be replaced by visualization.
00:21:09.680 | But I don't know if it's an image,
00:21:11.040 | it's a different case, maybe to have people done,
00:21:13.640 | maybe some synthetic pre-training data set for image.
00:21:17.960 | - Yeah, there was a paper,
00:21:19.920 | I forgot the name and the authors,
00:21:22.320 | but it creates completely synthetic images
00:21:24.600 | and like not even rendering of some realistic things,
00:21:27.440 | but just completely patterns, waves, and shapes and so on,
00:21:31.880 | and uses that for pre-training.
00:21:33.760 | And then it shows that they get almost the same performance
00:21:37.040 | as ImageNet quickly,
00:21:38.080 | they actually do this with vision transformers.
00:21:41.520 | But yeah, they never go further or it is not clear,
00:21:45.600 | you know, they kind of show that you can almost get
00:21:47.920 | to this point here.
00:21:49.200 | That is not clear how much further can you go with this.
00:21:53.480 | And I think probably not much further,
00:21:56.000 | but it's just me guessing that not much further,
00:21:59.160 | I don't have evidence for it.
00:22:00.600 | - Right, so I have one question
00:22:04.880 | and then we can continue with the talk.
00:22:07.360 | Said that you think like the large vision models
00:22:09.880 | are like learning some sort of similarity
00:22:11.920 | to the data set they're trained on.
00:22:13.040 | So do you think they are behaving
00:22:14.480 | like prototypical networks, in a sense?
00:22:16.800 | - They're behaving like what networks?
00:22:19.760 | - Oh, so like prototypical networks?
00:22:21.720 | Essentially like when you're doing pre-short learning,
00:22:24.760 | you just say like, "I'm going to learn a network."
00:22:26.640 | - Yeah, yeah, yeah.
00:22:27.480 | - And learn the metric space.
00:22:28.920 | - Probably not exactly, but close-ish.
00:22:39.240 | - I mean, I cannot really say
00:22:40.560 | because this is just some intuitive guess that I have.
00:22:44.000 | That's what they do, but nobody really knows
00:22:45.600 | what the models do, right?
00:22:46.960 | Yeah, I mean, we do get much more,
00:22:52.480 | when we do something like prototypical networks
00:22:54.680 | for the future learning with these pre-trained models,
00:22:57.560 | we do get worse performance than when we do fine-tuning.
00:23:00.880 | So there is a bit more to it still.
00:23:03.880 | However, I don't know what is this more.
00:23:07.800 | (laughs)
00:23:09.120 | - Okay, thanks.
00:23:10.760 | - All right, let's continue.
00:23:12.160 | Okay, yeah, so, ah, right, and I didn't mention,
00:23:20.600 | but on ImageNet, which is the top benchmark
00:23:23.520 | in computer vision, with this work, with the big transfer,
00:23:27.640 | we finally were able to increase the score
00:23:30.600 | after there was a long period of a couple of years
00:23:33.320 | of no improvement, but many attempts
00:23:35.680 | that you see the great upside.
00:23:37.240 | This was, yay, awesome.
00:23:39.200 | Pre-training, scaling up everything,
00:23:41.160 | and leveraging the data.
00:23:43.680 | And then, okay, let's not care about that.
00:23:46.080 | Yeah, that's, okay, this is just a little aside,
00:23:51.360 | that if you are in the setting that I mentioned
00:23:53.640 | of pre-training on huge amounts of data
00:23:55.560 | and then testing on many other tasks,
00:23:58.040 | you should, of course, be careful
00:23:59.520 | that you don't have images from the other tasks
00:24:02.440 | in your pre-training data, right?
00:24:05.360 | Otherwise, you have seen them during training,
00:24:07.360 | and then you're not really generalizing,
00:24:09.240 | and you're just fooling yourself with good scores.
00:24:12.520 | And this is a real danger when we get huge amounts of data,
00:24:15.240 | because, like, ImageNet images can totally be
00:24:17.600 | in huge amounts of data, right?
00:24:19.440 | So we actually use an internal pipeline
00:24:22.840 | that is really good at finding duplicates,
00:24:24.920 | and also new duplicates, like when they are shifted,
00:24:27.760 | rotated, squeezed, color changed a bit, whatnot.
00:24:30.760 | It still finds it.
00:24:32.000 | And we use this to completely remove all images
00:24:34.920 | from the test data sets that we test on later.
00:24:37.880 | And we actually found that a lot of classic
00:24:39.960 | just vision data sets have clear duplicates
00:24:42.520 | between their training and validation set,
00:24:44.920 | between the training set of ImageNet and CIFAR,
00:24:48.080 | 10 and 100 test sets, and so on.
00:24:50.960 | So new duplicates are quite widespread problem in vision.
00:24:54.600 | And this slide is just to say, hey, there are problems,
00:24:57.080 | but in all that we present,
00:24:58.720 | we actually took care that in the pre-training,
00:25:01.120 | as best as we can, we don't have new duplicates.
00:25:04.840 | Right, now back to being like, hey,
00:25:08.520 | we figured out large data, a large model,
00:25:10.560 | and then things get really good.
00:25:12.760 | And that's how we got to transformers, basically.
00:25:16.440 | In computer vision, everything was convolutional networks
00:25:19.160 | for many years.
00:25:20.480 | And basically there was nothing else, CNN is king.
00:25:23.440 | However, in language, we saw a transformation recently,
00:25:26.960 | right, that everything used to be LSTM,
00:25:29.320 | everywhere LSTM was king, and then came the transformer.
00:25:32.880 | And in the case when there is a lot of data available,
00:25:35.880 | suddenly transformer worked much better than LSTM.
00:25:39.400 | For little data, that was still not the case exactly.
00:25:42.840 | So what we then thought is that, okay,
00:25:45.600 | so we are now in this regime where we have tons of data
00:25:48.320 | and we see benefit from it.
00:25:50.040 | Can we see even more benefit if we try also
00:25:52.520 | out the transformer architecture in vision?
00:25:55.600 | And that's basically what we did.
00:25:58.280 | To be fair, there were a few other attempts
00:26:01.480 | at trying out transformer in vision before,
00:26:03.960 | that I don't want to detail too much here
00:26:06.360 | because I don't want to point fingers too much,
00:26:09.120 | but they were all not really using transformers
00:26:12.800 | for learning everything from the data.
00:26:14.920 | It was always like, get something out of a ResNet first,
00:26:19.280 | like object detection proposals
00:26:21.560 | or high-level feature maps or things like that,
00:26:24.280 | and then stick a little transformer on top.
00:26:26.480 | But we wanted to go all the way,
00:26:28.040 | just transformer everything.
00:26:30.240 | And so we came up with the simplest and most natural,
00:26:33.080 | I believe, way of applying transformers to vision,
00:26:36.440 | which is you take the image, you cut it into pieces,
00:26:39.360 | and that's it, like a puzzle.
00:26:41.400 | Tack, tack, tack, patches, and that's it.
00:26:45.440 | Each of these patches, you take it
00:26:48.320 | and you project it into your embedding space,
00:26:50.680 | which is the input to the transformer.
00:26:52.640 | Embedded space is just abstract space of,
00:26:55.120 | let's say, 768 dimensions, for example.
00:26:58.160 | How do you embed it?
00:26:59.040 | You just take the pixel values
00:27:00.640 | and put the linear projection layer on top.
00:27:03.640 | So take all the pixels, flatten the vector,
00:27:07.520 | matrix multiply into whatever size you want,
00:27:10.680 | and use the same matrix for all the patches.
00:27:14.520 | And here we just went the simplest way ever
00:27:16.600 | with non-overlapping patches and everything.
00:27:18.960 | You can, and people later did, go on and say,
00:27:22.600 | "Hey, this is almost a convolution.
00:27:24.280 | Let's make proper convolution.
00:27:25.760 | Let's make stack of them," whatnot.
00:27:27.440 | But this is all for web work later.
00:27:29.480 | This is just the simplest way to do it first.
00:27:32.480 | Then we have these embedded patches,
00:27:34.160 | and we treat them exactly literally
00:27:36.840 | like the tokens in language,
00:27:40.080 | and then give them to exactly the BERT transformer
00:27:44.280 | from language folks.
00:27:45.960 | And just like in language, we add this class token,
00:27:49.400 | or I think the language is like end-of-sentence token
00:27:51.800 | or something.
00:27:53.960 | And we add the position embeddings to the tokens
00:27:57.680 | that can be learned.
00:27:59.960 | And then we feed all of this to a transformer encoder,
00:28:02.760 | which has a MLP head, which reads out this class token,
00:28:07.200 | and then maps it to Softmax layer
00:28:10.120 | for classification, for example.
00:28:12.720 | And that's it. That is the vision transformer.
00:28:15.120 | So it's literally a BERT transformer,
00:28:16.960 | but instead of words or sentence tokens,
00:28:20.960 | feed in patches transformed into tokens.
00:28:23.840 | And that's it.
00:28:25.280 | And then just same story as before, scale everything up.
00:28:28.360 | Compute, data set, model size, patients, everything.
00:28:32.400 | And see what happens. Is this good or not?
00:28:35.920 | That was the question.
00:28:37.280 | And now we can see a plot here.
00:28:39.080 | This is similar plot as before.
00:28:41.880 | The gray area is actually what were all of the bit dots before.
00:28:46.480 | And now the bubbles are vision transformers
00:28:49.400 | of different sizes.
00:28:51.000 | And the bubble is kind of the size of the model,
00:28:53.800 | although it's a bit hard to say exactly.
00:28:56.320 | And what you can see first is that with little data,
00:28:58.760 | ImageNet is the 1.3 million images.
00:29:01.360 | It works worse than ResNet.
00:29:03.520 | So if we would not believe in this idea
00:29:05.440 | and just try this, we're like, "Okay, this is a crap idea."
00:29:07.920 | And 1.3 million images is not that little.
00:29:11.520 | Then the 10 times larger data sets
00:29:13.080 | started in the same ballpark as ResNet.
00:29:15.640 | And when we go to much larger data
00:29:18.800 | with a much larger transformer,
00:29:20.760 | then we actually start outperforming this ResNet.
00:29:23.600 | And we outperform it just by a little.
00:29:25.720 | But this ResNet was really hard to get
00:29:27.560 | and is extremely clumsy and slow and big.
00:29:30.360 | So we were very excited by this.
00:29:33.840 | Then we did more controlled studies and everything.
00:29:35.880 | And one of them is like using subset of the same data set.
00:29:40.040 | And there's lots of curves,
00:29:41.360 | but basically just look at the dark gray one
00:29:44.240 | and the light blue one.
00:29:46.280 | These are roughly similarly fast and clumsy
00:29:49.800 | or easy to use or difficult to use bits,
00:29:53.320 | which is a ResNet variant and bits, the vision transformer.
00:29:57.000 | And what you can see, vision transformer,
00:29:58.560 | when we have little, in quotes, little data,
00:30:01.800 | is really bad compared to ResNet.
00:30:04.360 | But as we start having a lot of data, actually,
00:30:07.360 | it starts outperforming the ResNet.
00:30:09.200 | And this is very promising
00:30:10.600 | because I think everything that looks huge
00:30:12.840 | and a lot and so on now, in five or 10 years,
00:30:15.640 | it's maybe regular.
00:30:17.000 | Like 10 years ago, imagine if this one seemed to be huge
00:30:20.400 | and massive amount of data.
00:30:21.640 | No, not anymore.
00:30:23.520 | So we should look to the future.
00:30:24.800 | And this looks promising for the future.
00:30:28.040 | Then back to the same benchmark.
00:30:29.760 | That was another little jump.
00:30:33.200 | - Because we, yeah, yeah, we have some questions.
00:30:37.080 | - Yep.
00:30:38.280 | There is also this section about, yeah.
00:30:41.960 | So it's in that order,
00:30:45.880 | if you want to unmute yourself and ask the questions.
00:30:50.440 | - Sure, yeah.
00:30:51.280 | And I think Dimal already answered part of the question,
00:30:54.080 | but I was wondering in the input to this transformer,
00:30:57.360 | when you're chunking up the image
00:30:58.600 | into little puzzle pieces and then finding them,
00:31:02.920 | does the order of feeding these patches in matter?
00:31:06.840 | Like if you switch the order,
00:31:08.880 | does the prediction maybe change?
00:31:11.600 | - Yeah, that's a good question.
00:31:13.480 | And I actually have a slide on something like this,
00:31:16.160 | but not exactly.
00:31:17.800 | Let me jump there.
00:31:20.080 | So first of all,
00:31:21.120 | if the order is consistent during training, right?
00:31:24.840 | And you don't shuffle the order again for each new image,
00:31:28.360 | then it's literally the exact same.
00:31:30.280 | You get the same curve saying everything
00:31:31.920 | because we don't encode the order anywhere.
00:31:34.200 | If you start randomizing the order
00:31:36.160 | all the time during training,
00:31:37.520 | then performance gets quite a lot worse.
00:31:39.960 | And let me show you why.
00:31:41.000 | This is the slide was on my plan to present anyways.
00:31:45.280 | Then if you ask, let's jump here.
00:31:47.440 | These are, this is a visualization
00:31:49.360 | of the position embeddings.
00:31:51.600 | What does it mean?
00:31:52.440 | So in this case,
00:31:53.280 | we had 14 by 14 patches that we cut the image in.
00:31:56.880 | So it means we have also 14 by 14 position embeddings.
00:32:01.160 | Although we just see them as one long sequence of,
00:32:03.640 | what is it?
00:32:04.480 | 150 something, or I don't know, 140 something.
00:32:09.720 | And now each of these pictures shows the position embedding,
00:32:13.040 | which corresponds to this location.
00:32:15.400 | How similar is it to all the other position embeddings?
00:32:18.720 | So let's look at this one, for example.
00:32:20.560 | Yellow means perfectly similar, like exactly the same.
00:32:23.680 | And blue means opposite in terms of cosine similarity.
00:32:27.320 | So this position embedding is most similar to itself,
00:32:30.760 | which is the pixel here.
00:32:32.360 | And then the neighboring pixels is how similar is it
00:32:35.520 | to the position embeddings that correspond originally
00:32:39.000 | to the neighboring patch.
00:32:40.640 | And we do see a very clear pattern
00:32:42.400 | that each position embedding is very similar
00:32:44.760 | to the embedding from its surrounding patches.
00:32:48.080 | And we didn't implement any of this, right?
00:32:51.840 | We just had these position embeddings
00:32:53.440 | at randomly initialized variables,
00:32:55.240 | and they are learned as freely
00:32:57.280 | as the rest of the parameters of the model.
00:32:59.600 | But they learned to recover this notion
00:33:01.400 | of what are my neighbor patches,
00:33:03.320 | even though we don't give this information
00:33:05.080 | anywhere at any time,
00:33:06.480 | besides the raw image data and the task
00:33:08.440 | to please classify this image.
00:33:12.280 | So that's pretty cool, I think.
00:33:13.560 | But it also means that if you take the trained model now
00:33:16.480 | and give in patches
00:33:18.680 | in a completely differently shuffled order,
00:33:21.200 | it's going to perform poorly
00:33:22.480 | because these learned position embeddings
00:33:24.680 | don't make sense anymore.
00:33:26.480 | We did try also to implement, like, position embeddings
00:33:30.360 | which encode the location as hardcoded by us,
00:33:35.440 | and other fancy position embeddings like relative ones.
00:33:39.520 | But basically, none of that really outperformed
00:33:42.160 | these freely learned.
00:33:43.320 | And then the freely learned is simpler.
00:33:44.760 | You just run them in it,
00:33:46.080 | let it learn as part of SGD, and that's it.
00:33:48.200 | And so we go with that, and so just like that.
00:33:53.480 | -Nice, it's awesome.
00:33:55.440 | -We have one more question from --
00:33:58.920 | -Hey, yeah, I was wondering if you could --
00:34:00.600 | Yeah, this slide.
00:34:01.840 | I think something that's really interesting
00:34:03.320 | is we're talking about scaling up the data,
00:34:05.400 | and scaling up the model would be fun as well.
00:34:08.240 | But it seems like you're reaching an awesome job, right,
00:34:11.280 | when you keep doing the scaling.
00:34:13.400 | So I'm curious if you have any thoughts on that.
00:34:15.920 | Like, are these points just look like that,
00:34:18.160 | or is there kind of a best you can sort of do
00:34:21.480 | where when you're pre-training the data or the parameters,
00:34:25.400 | you're actually not going to get much --
00:34:27.480 | -Yeah, I have another slide,
00:34:29.000 | but much further in the talk about this,
00:34:32.040 | where I would like to not jump on it, if you don't mind.
00:34:36.640 | And then maybe in 10, 15 minutes, we will be there.
00:34:40.920 | -Sounds good. Thanks.
00:34:42.520 | -Yeah, maybe to be a bit optimistic,
00:34:46.800 | it does seem like the transformers
00:34:48.640 | have a better slope here in the end,
00:34:50.240 | and there is a plateau earlier.
00:34:53.920 | -Sorry, Lucas, I did not mean to interrupt.
00:34:57.000 | Are there any more questions before we proceed?
00:35:00.040 | -Yeah, can I ask my question real quick?
00:35:02.160 | -Sorry about that.
00:35:03.120 | -So what I'm curious to know is how does this VIT
00:35:06.200 | compare to if you equip a ConvNet,
00:35:08.680 | so, for example, ResNet, with an attention mechanism?
00:35:12.320 | -Mm-hmm.
00:35:13.360 | -Like, how much of this is due to the structure of a transformer
00:35:15.840 | and the particular way it operates
00:35:17.240 | versus just the benefit of attention
00:35:18.800 | that a vanilla ConvNet does not have access to?
00:35:22.040 | -Yeah, so this has been tried many times before,
00:35:26.400 | and the first time that I know of
00:35:27.800 | was actually from -- I mispronounce his name,
00:35:30.840 | but Jaime Herr, the inventor of ResNet,
00:35:34.120 | and some of his colleagues, they called it non-blocker networks.
00:35:37.360 | This was way -- I think even before the transformer paper,
00:35:39.920 | if I remember correctly,
00:35:42.760 | and they basically inserted attention blocks
00:35:45.000 | at various locations in the ResNet,
00:35:47.120 | and then they showed improvement,
00:35:48.200 | but it was, like, tiny improvements.
00:35:51.080 | It was a cool block and a simple paper,
00:35:53.560 | but it was not really worth it,
00:35:55.040 | and people usually place their attention --
00:35:59.400 | you can imagine if you place the attention just on the pixels
00:36:01.840 | and don't do this patch-cutting,
00:36:03.360 | this is way too expensive computation-wise, right?
00:36:07.080 | If you have two to four by two to four pixels,
00:36:08.920 | that's like -- yeah, I cannot do this in my head.
00:36:11.240 | I don't know, 40,000 or so maybe pixels?
00:36:14.320 | Attending to 40,000 others, that doesn't work,
00:36:16.960 | so people just do it in the very high
00:36:18.960 | and very final layers of the ResNet,
00:36:20.760 | like, where it's maybe seven by seven,
00:36:23.600 | and then they add a bit of --
00:36:25.040 | sprinkle a bit of attention there,
00:36:27.280 | but then you don't really get much benefit of scaling
00:36:29.800 | because it's essentially still a ResNet.
00:36:32.280 | And there is -- in ResNet,
00:36:34.560 | there is this block called Squeeze Excite
00:36:37.400 | that has been getting really popular --
00:36:39.680 | or has gotten really popular
00:36:41.120 | and improves ResNet quite a bit,
00:36:43.920 | and that is also kind of a form of attention,
00:36:47.120 | but, like, nicely tailored to images.
00:36:50.480 | I'm not doing -- it's arguable.
00:36:53.520 | But yeah, it has been tried many times before,
00:36:55.480 | but it just -- it doesn't show --
00:36:57.640 | or it hasn't been shown to have this scaling benefit
00:37:01.000 | as much as they did.
00:37:03.280 | -So I think I'm missing something critical here,
00:37:05.160 | which is you just said, in fact,
00:37:06.960 | or it's computationally difficult
00:37:09.920 | to do an attention layer at a low level in the ResNet,
00:37:12.920 | but why is it any different than doing an attention layer
00:37:15.400 | in the Vision Transformer?
00:37:17.840 | -Because we cut the patches first,
00:37:20.040 | so we have maybe 14 by 14 patches,
00:37:22.800 | which is not that much.
00:37:26.320 | -Okay, but I'm confused.
00:37:28.160 | Like, you could imagine, not at a high level,
00:37:31.480 | not at a high layer in the ResNet,
00:37:32.920 | but at a relatively low layer,
00:37:34.280 | after you've applied, like, one or two convolutional filters --
00:37:37.720 | convolutional layers, excuse me --
00:37:39.440 | then you have something the size of the patches.
00:37:43.080 | -That's still 50 by 50 at the early layers, and that's --
00:37:47.440 | -But 50 by 50 is significantly less than,
00:37:49.440 | I don't know, like, 400 by 400 or whatever.
00:37:52.320 | -But it's still 2,500 tokens attending to 2,500 tokens,
00:37:56.960 | which -- -Yeah, I mean, it's a lot,
00:37:58.080 | but it's not comparable.
00:38:01.000 | I don't know. Okay, cool. Thank you.
00:38:02.400 | -Yeah. I mean, it could be tracked.
00:38:04.840 | Okay, maybe another answer to your question
00:38:06.960 | is then we're slowly getting to this,
00:38:09.040 | my next slide after the set of questions,
00:38:12.880 | where we do try something almost like what you said,
00:38:17.600 | have a very small part of the ResNet,
00:38:19.880 | and then stick a transformer on top of it,
00:38:23.640 | but, like, the full transformer encoder on top of it,
00:38:26.400 | and not just sprinkle a few attention layers
00:38:29.320 | and then continue with columns and so on.
00:38:32.080 | And this is this process, and we call them hybrid,
00:38:35.280 | but it's almost literally what you said, actually,
00:38:37.560 | like a few early layers from the ResNet
00:38:39.760 | and with different varying amount,
00:38:42.000 | and then stick the whole transformer encoder.
00:38:45.480 | And this seems to work well, too,
00:38:48.080 | especially for the -- when you --
00:38:50.880 | x-axis in this case is amount of compute,
00:38:53.360 | so for the little compute, it seems to work well.
00:38:55.920 | But then the scaling behavior of the pure ResNet
00:38:58.280 | is a little better, so we focused on that.
00:39:00.680 | I think we later tried also hybrid further to the right,
00:39:03.280 | and it was a bit lower, but it was after the paper,
00:39:05.880 | so it's not on this plot, which I just cut out of the paper.
00:39:08.760 | But you can already see the trend here.
00:39:12.840 | Yeah, so if you don't scale all the way up,
00:39:15.520 | then this is a totally reasonable thing to do,
00:39:18.120 | have a little bit of ResNet
00:39:19.800 | and then the encoder from transformer.
00:39:22.960 | -Do you want to ask a question?
00:39:29.040 | -Yeah, I was just wondering about the --
00:39:32.960 | basically, there's like a short section of paper
00:39:35.640 | about, like, fine-tuning and, like, higher resolution,
00:39:38.120 | and in that case, right, like, the pre-trained,
00:39:40.640 | like, position embeddings, sorry, are, like, skewed, right?
00:39:45.480 | And it basically says that you guys are, like, interpolating.
00:39:48.320 | Can you, like, talk a little bit?
00:39:50.480 | Like, how do you interpolate what's going on?
00:39:52.400 | -Yeah. Actually, when I checked the slides earlier today,
00:39:55.840 | I was like, "Oh, it would be cool to have a slide on that."
00:40:00.440 | And we don't have a nice visualization in the paper,
00:40:02.520 | either, because it's a bit difficult to explain,
00:40:04.520 | but this is the best starting point we have.
00:40:08.320 | So if you want to increase the resolution of the image,
00:40:11.760 | and you keep the patch size fixed,
00:40:13.520 | it means you have more patches suddenly, right?
00:40:15.720 | And then, as you say, the patch embeddings, like,
00:40:18.480 | what do you even use as position embeddings, right?
00:40:22.840 | And basically, you can see here that we see
00:40:25.480 | that they learn a very regular structure, right?
00:40:27.680 | We don't really know what is the structure
00:40:29.520 | of these position embeddings that I learned.
00:40:31.160 | We just see the similarity to each other
00:40:33.600 | and that it is very regular.
00:40:36.240 | And so this gave us the intuition
00:40:37.600 | that we may be able to just take them,
00:40:41.680 | kind of imaging these boxes, they slide apart,
00:40:45.080 | and new boxes appear between them,
00:40:46.880 | and they are just the interpolation
00:40:48.320 | of the surrounding ones.
00:40:50.040 | And that's basically what we do with the position embeddings.
00:40:54.520 | We create new ones where there are missing ones,
00:40:57.640 | because we need more,
00:41:00.120 | and by interpolating the surrounding.
00:41:02.240 | Or more precisely, we basically see them as a picture,
00:41:06.040 | in this case, 14 by 14, with 700-something channels,
00:41:10.000 | or whatever is the dimensionality.
00:41:12.000 | And then we basically resize this
00:41:14.320 | like you would resize a picture by interpolation.
00:41:19.000 | And that way, we get more and new position embeddings
00:41:22.200 | that we don't understand where they are,
00:41:24.040 | but they follow the same pattern as the learned ones,
00:41:26.240 | just at a higher resolution, basically.
00:41:29.720 | Yeah, go ahead.
00:41:35.520 | - Yeah, I just have a quick question.
00:41:39.080 | So when you're creating the embeddings as input,
00:41:44.080 | right now you're doing a light projection,
00:41:47.920 | at least in this case.
00:41:50.440 | Has there been work to do to memorize the other way,
00:41:52.440 | 'cause there's a lot of pixels that are close to each other?
00:41:56.920 | - Yeah, there were quite a few works
00:41:59.920 | that tried varying other things.
00:42:02.880 | One that I especially liked recently,
00:42:04.880 | it's called "Early Convolutions Help Transformers See Better,"
00:42:08.920 | or something like that.
00:42:10.640 | And they basically say,
00:42:11.520 | "Okay, instead of this linear projection,
00:42:13.720 | instead of this one big linear projection,
00:42:16.240 | we replace it by a stack of three-by-three convolution
00:42:20.040 | with a stride two."
00:42:22.280 | And then they have also nonlinearities between them,
00:42:24.640 | normalizations between them,
00:42:26.520 | but such that the overall stride is the same
00:42:29.440 | as this patchifying.
00:42:32.160 | So the outcome would then be the same dimensionality
00:42:36.040 | as after this patch cutting and then projecting.
00:42:39.040 | And then they showed that,
00:42:40.440 | supposedly it makes it a bit easier to optimize
00:42:45.160 | in the sense that more optimized settings are good settings.
00:42:48.760 | In many scenarios, it performs the same,
00:42:53.640 | but like more robustly to get there.
00:42:56.800 | And they also show some scenarios
00:42:59.080 | where this performs much better,
00:43:02.040 | like for example, when pre-training on,
00:43:04.480 | actually, when they pre-train on more data,
00:43:06.920 | that seems to perform even better.
00:43:08.880 | I have played a bit with it and tried to reproduce it.
00:43:14.080 | I don't have it fully reproduced,
00:43:16.120 | but I don't see as much benefit as in the paper yet.
00:43:19.040 | But that's not to say that the paper is wrong,
00:43:20.920 | just that I didn't get there yet.
00:43:22.640 | That is one example of them.
00:43:26.080 | There are other papers that do stuff,
00:43:28.280 | but this one I found especially interesting
00:43:30.520 | because it's simple.
00:43:31.920 | - Thank you.
00:43:34.440 | - All right, continue?
00:43:37.720 | - We don't have any more questions.
00:43:40.440 | - All right, then let's see.
00:43:42.640 | Yeah, I have like three more interesting details
00:43:44.800 | from the paper and then depending on
00:43:47.680 | if we want more discussion or more content,
00:43:49.600 | I have more content, like also the question about,
00:43:52.120 | does it saturate here or not?
00:43:53.600 | All right, so another interesting thing
00:43:57.400 | that we had in the paper,
00:43:58.440 | but it is buried in the appendix,
00:44:00.280 | and then follow-up papers from others
00:44:03.560 | have been written on this by now actually,
00:44:05.880 | is like how should we scale these transformers?
00:44:09.400 | I don't know, right in the high-level shape
00:44:13.120 | of the transformer, there's lots of settings
00:44:15.880 | that you could choose.
00:44:17.400 | And we actually tried many of them.
00:44:19.080 | So we started with the reasonable medium-sized transformer,
00:44:22.360 | this dot in the middle,
00:44:23.680 | and then we varied things one by one,
00:44:27.640 | such that we always double the compute.
00:44:30.960 | So for example, this pink line,
00:44:32.760 | if we go to the right, this point increases the width,
00:44:36.600 | such that we double the compute.
00:44:39.040 | X-axis is compute relative to this starting point.
00:44:44.040 | And we have all of these different settings.
00:44:46.240 | There's the width, which is how wide are the vectors
00:44:50.320 | with which self-attention is done,
00:44:52.320 | which is for the base model 768,
00:44:54.600 | and then goes larger or smaller.
00:44:57.520 | There is like, as you see scaling,
00:45:00.680 | this does not seem promising.
00:45:02.760 | So we didn't scale that much.
00:45:04.400 | Then there's other things like the width
00:45:06.720 | of the multi-layer perceptron,
00:45:08.760 | or some people call it the one-by-one convolution
00:45:11.280 | in these attentions.
00:45:12.840 | And this seems to scale a bit nicer, this orange part.
00:45:16.160 | I actually wonder where it went to the left.
00:45:18.400 | I don't remember.
00:45:19.400 | I don't know if it's hidden somewhere
00:45:21.840 | or if we just didn't scale it down, but anyways.
00:45:24.720 | Then another thing to scale,
00:45:26.000 | which does not exist in the transformers from text
00:45:29.040 | is the patch size.
00:45:30.400 | As you make the patch smaller,
00:45:32.040 | you get more and more tokens out of an image
00:45:34.960 | and thus more and more compute capacity.
00:45:37.520 | This is the green one, which also seems to scale nicely.
00:45:42.120 | Then the depth is an interesting one, this yellow one.
00:45:45.960 | And this is the number of encoder blocks.
00:45:48.600 | As we scale, it first seems like, wow,
00:45:50.360 | this is the thing you want to scale,
00:45:51.640 | but then it does seem to plateau.
00:45:53.640 | And it scales really badly if you decrease the depth.
00:45:56.760 | So that's not a good thing to decrease.
00:45:59.040 | However, the width seems to be a good thing to decrease
00:46:01.240 | if you want to go to smaller models.
00:46:03.160 | And then the blue is just scaling everything together
00:46:06.120 | such that the compute is kept,
00:46:08.360 | like everything by roughly the same amount.
00:46:09.960 | That seems to scale nicely as well as the rest
00:46:14.920 | and is relatively simple, or at least conceptually.
00:46:17.240 | So we like this, so we went with that
00:46:18.800 | whenever we scaled up or down the model.
00:46:23.440 | And this one I really like is the inference speed,
00:46:26.560 | because if you have the image size of two to four pixels,
00:46:29.640 | it actually means you have two to four by two to four pixels.
00:46:32.520 | So if you have, then you patchify it with 16 by 16 patch,
00:46:37.000 | for example, patch size, then you have 14 by 14 patches.
00:46:42.520 | So that is the sequence length is actually 150.
00:46:45.440 | And then on top of the sequence length,
00:46:48.320 | you have the self-attention operation,
00:46:49.800 | which is square again.
00:46:51.880 | So overall, with respect to image size,
00:46:54.760 | the self-attention operation is to the fourth power,
00:46:58.200 | which is called quartic.
00:47:01.520 | So that is really bad.
00:47:03.000 | Like everybody who sees all of something to the fourth
00:47:05.960 | is like, "What the hell are you doing?
00:47:07.240 | This is never going to scale."
00:47:09.760 | So we checked what does it look like in practice
00:47:12.520 | with the image sizes that we operate in,
00:47:15.000 | and this is what you see here.
00:47:16.240 | On the y-axis is how fast it goes,
00:47:19.920 | basically how fast it does inference,
00:47:22.000 | and on the x-axis is varying the input size.
00:47:25.920 | And this, what this means, it doesn't look so bad yet.
00:47:31.760 | Basically, when you go here to the 512,
00:47:34.240 | to the really large image,
00:47:35.320 | then you see that the transformers
00:47:37.320 | actually start going down a lot more than the ResNets.
00:47:42.920 | But in this reasonable image size,
00:47:44.600 | let's call it very typical,
00:47:46.200 | it doesn't seem so bad in practice yet.
00:47:48.240 | So we're not getting hit by the big O yet.
00:47:52.120 | But as we go larger, it will likely be a problem,
00:47:54.480 | and there will be a lot of follow-up works
00:47:56.240 | trying to make that better.
00:48:01.440 | Then, this is the last one from the original paper.
00:48:07.560 | This is looking at the input's receptive field size.
00:48:10.960 | So in the self-attention operation,
00:48:13.360 | how far ago do heads typically attend?
00:48:17.920 | And here on the x-axis, we see the layer in the network.
00:48:21.600 | To the right is more towards the output, the classes,
00:48:24.040 | and to the left is more towards the input, the patches.
00:48:27.760 | And the y-axis is how far on average across,
00:48:31.480 | I think, the whole validation set,
00:48:33.640 | does the self-attention look?
00:48:35.320 | And does look means that the peak of the self-attention
00:48:39.120 | or the max, how far is it away?
00:48:41.360 | Something like that.
00:48:44.200 | And each dot is a different head
00:48:45.800 | because we can use multi-head self-attention.
00:48:48.560 | And so what this shows is that in the early layers,
00:48:50.960 | actually you have some heads that go far,
00:48:53.360 | but also a lot of heads that look very nearby them,
00:48:56.840 | so locally.
00:48:58.000 | And as we go deeper in the model,
00:48:59.600 | we only are left with heads that, on average, look further.
00:49:03.560 | So it's just some kind of analysis.
00:49:07.000 | There is not immediately action to take about this,
00:49:09.840 | but it's interesting to see that earlier layers,
00:49:12.640 | they learn a mixture of looking to a local neighborhood
00:49:16.440 | and looking globally,
00:49:18.120 | and later layers only look globally anymore.
00:49:22.680 | Right.
00:49:24.280 | So that is about the original vision transformers.
00:49:30.360 | Now, I don't know how long you want me
00:49:33.320 | to continue speaking or discussing.
00:49:35.320 | I have a couple of options that I can talk about,
00:49:38.760 | which is one project that was further scaling updates,
00:49:41.920 | and this one also has the answer to the --
00:49:44.040 | I can also jump straight to the answer
00:49:45.400 | if you don't want to hear the rest.
00:49:46.880 | But to the question of, like,
00:49:48.560 | how does it continue to the right?
00:49:49.960 | Are we separating?
00:49:52.720 | There is another project about how to train vision transformers
00:49:56.400 | when you don't have massive amounts of data.
00:49:58.600 | Can you still do it? Is it reasonable?
00:50:00.360 | Or is it maybe just unreasonable to do?
00:50:03.320 | This one is maybe too unrelated.
00:50:04.800 | Let's not talk about this.
00:50:06.160 | And the last one is, like,
00:50:08.360 | I talk all about these benefits of a really large model
00:50:11.920 | when you pre-train them on lots of data.
00:50:13.640 | Okay, that's nice. That's how we get a good model.
00:50:17.000 | But then actually using a model that is massive
00:50:19.880 | is not fun at all.
00:50:21.200 | Like, it doesn't fit on your GPUs.
00:50:22.960 | You need, like, multiple TPUs to even use it.
00:50:25.960 | So people are not happy to use it
00:50:29.080 | and usually still go back to small-ish models,
00:50:32.000 | even though they know, like, larger models should be better.
00:50:34.760 | What can we do about it?
00:50:37.800 | That's another project we had, which is about distillation.
00:50:41.520 | So I would say it's up to you guys what you prefer to do.
00:50:46.080 | Or if you have plenty of questions,
00:50:47.480 | we can continue with the questions now,
00:50:49.200 | because I think now the original one hour would be over, right?
00:50:53.080 | -Right. -So I think one suggestion was,
00:50:54.960 | like, we can continue the talk,
00:50:56.880 | and we'll also be recording it so people can, like,
00:50:59.360 | just, like, go and see it if they miss out something.
00:51:02.720 | So we could do that.
00:51:04.240 | -Yeah, the other thing is two people have their hands raised,
00:51:06.800 | so we can... -Okay.
00:51:09.160 | -...take questions first.
00:51:10.960 | -Up to you guys, and fight either way.
00:51:13.400 | -So you guys want to ask a question?
00:51:24.640 | -Yeah, I just had a pretty basic question.
00:51:28.000 | So if an object lies on the border between the patches,
00:51:32.280 | does that impact the model's performance in any way?
00:51:37.040 | -Yeah, I mean, that's not a basic question.
00:51:39.280 | It's a good question.
00:51:40.880 | There is a mix of answers.
00:51:45.560 | So one is we didn't specifically go and test this.
00:51:48.960 | It would be an interesting thing to test in a very controlled way
00:51:51.800 | with some of the trained models.
00:51:55.000 | That's for sure.
00:51:57.520 | The other thing is that when you have a massive data set,
00:52:01.360 | like 300 million images, it's an insane amount.
00:52:03.920 | I used to try to conceptualize how much is image net,
00:52:07.960 | 1 million images, and I think I did the math.
00:52:10.920 | It's like if you go to an image and look at all of the images,
00:52:15.520 | each image for a couple of seconds,
00:52:17.200 | you are sitting there for a month or something like that.
00:52:19.720 | Don't remember.
00:52:21.600 | But so 300 million is just insanely massive.
00:52:24.520 | And then on top of that, we do actually use
00:52:27.160 | random augmentations, like random crop out of the image.
00:52:30.920 | So I would say it's the default that you see objects
00:52:34.040 | that don't fall on a patch during the training already.
00:52:38.760 | And if you look at here, basically,
00:52:40.600 | this is the standard model, like how the patches are.
00:52:44.360 | When we have 14 by 14, they look roughly this size also.
00:52:50.240 | Then an object is usually scattered across many patches,
00:52:54.040 | actually, because objects in typical images
00:52:57.920 | are relatively large, right?
00:52:59.280 | People don't take a picture where the object of interest
00:53:01.640 | is super tiny in the corner.
00:53:04.400 | So that's the default that you see during pre-training.
00:53:06.680 | And so I believe that the model just
00:53:08.840 | learns to do that much better, actually.
00:53:13.640 | Then the other answer to the question is like, OK,
00:53:16.480 | maybe if you did some nicer thing than this very crude
00:53:20.720 | patch cutting, like for example, this stack of convolutions
00:53:24.480 | that I mentioned, maybe this is even better.
00:53:28.000 | Thank you.
00:53:30.960 | Thank you.
00:53:31.480 | So you mentioned that we're using transformers,
00:53:40.720 | or at least you mentioned in the paper
00:53:42.320 | that they lack locality and echoliteration.
00:53:49.800 | I was just thinking, are these sort of properties
00:53:54.400 | that you probably [INAUDIBLE] and especially when
00:54:00.880 | you're in the [INAUDIBLE]
00:54:02.720 | So why is it that we would prefer [INAUDIBLE]
00:54:12.040 | The audio was not that good, but I
00:54:13.520 | believe I understood the question.
00:54:14.920 | Is that we say that transformers lack locality bias, or prior,
00:54:20.160 | or whatever?
00:54:21.280 | And why is this even something that we want, right?
00:54:25.120 | Wouldn't we want our models to know about locality
00:54:27.560 | if they are about pictures in the first place?
00:54:30.800 | Yes and no.
00:54:32.680 | So that's why I gave the context in the beginning.
00:54:35.520 | This is all about what happens when you scale things up.
00:54:39.760 | And specifically, in the ideal world, at least in our mind,
00:54:46.640 | we want gigantic amounts of data.
00:54:49.360 | And we believe that it will just keep
00:54:51.400 | growing as the years go by.
00:54:54.000 | And there will be more and more data just generally there.
00:54:58.560 | And then we want the model to have as little
00:55:01.160 | of our thinking built in.
00:55:04.720 | Because what we may think that is good to solve the task
00:55:08.000 | may actually not be best to solve the task.
00:55:10.880 | Maybe an analogy would be, what was it,
00:55:14.880 | AlphaGo that made some moves that experts would say,
00:55:17.760 | this is crazy.
00:55:18.440 | This is a silly move.
00:55:19.360 | But it actually then was much better.
00:55:22.240 | And in a similar way, we want to encode as little as possible
00:55:26.040 | into the model, such that if we just
00:55:28.400 | throw massive amounts of data in the difficult task at it,
00:55:31.520 | that it might think things that are even better that we
00:55:33.720 | didn't think of before.
00:55:36.120 | This is our approach.
00:55:37.960 | Because we believe that, as I mentioned, I think, already,
00:55:43.040 | what seems massive and excessive now
00:55:45.400 | will be the norm in five years or so.
00:55:47.560 | So that's where we want to go and look what's the direction.
00:55:51.520 | However, if you want to just get something working now
00:55:56.480 | and don't have massive amounts of data
00:55:58.240 | and don't want to use pre-trained model
00:55:59.880 | for some reason, which always use a pre-trained model,
00:56:04.440 | but if you don't want to, then it
00:56:06.760 | makes total sense to build in some
00:56:08.800 | of your prior intuition and knowledge of what should
00:56:12.000 | probably help the model, like locality.
00:56:16.240 | I hope this answered your question.
00:56:19.320 | I suppose this is a quick follow up.
00:56:22.280 | What sort of [INAUDIBLE] like any vision task?
00:56:30.040 | Isn't that sort of like, yeah, I don't know.
00:56:33.560 | Maybe I'm not seeing exactly why we'd not
00:56:37.720 | want those inductive biases.
00:56:40.080 | Could you maybe elaborate on that?
00:56:42.680 | Why is it that we don't want locality
00:56:46.240 | or what translation [INAUDIBLE]
00:56:50.560 | Well, ideally, we want the model that
00:56:53.840 | is powerful enough to learn about this concept itself
00:56:58.840 | if it is useful to solve the task.
00:57:01.320 | If it's not useful to solve the task, then if we had put it in,
00:57:05.840 | there is no way for the model not to do this, right?
00:57:10.840 | That is ideally the outcome.
00:57:12.600 | In a similar way also that in language,
00:57:16.120 | it seemed to be nonsense to not encode
00:57:19.480 | the from left to right direction of text, like in RMS.
00:57:24.080 | But then comes transformer and just doesn't.
00:57:26.440 | And works much better if you throw a lot of data at it.
00:57:29.600 | And it recovers that plus some more or a more flexible variant
00:57:34.640 | of it or something like that.
00:57:35.840 | That is even better for solving tasks.
00:57:39.240 | So basically, the idea being that we are not
00:57:42.840 | as smart to design the thing, the model in the way that
00:57:47.800 | will be best for the task.
00:57:49.160 | Let's rather give it all the flexibility and all the data
00:57:52.440 | it needs to figure out what is the best
00:57:54.160 | way of solving the task.
00:57:55.200 | I mean, it is a philosophy of approaching it.
00:58:05.640 | I'm not saying this is the only true way, right?
00:58:07.880 | So we have around seven minutes left
00:58:16.320 | before the scheduled end of the talk.
00:58:18.720 | And Lucas, we want to be mindful of your time
00:58:21.240 | as well, because it is evening where you are.
00:58:24.520 | So one thing we could do is you could--
00:58:28.360 | I don't see any more questions right now.
00:58:30.240 | So you could quickly go over the last few bits,
00:58:33.800 | maybe skipping through the details
00:58:36.000 | and just talking about the final results.
00:58:39.200 | I will do this to a high level, then.
00:58:41.080 | Those two that are still very, very tight to transformers
00:58:44.360 | and answer some questions that happened before.
00:58:46.880 | Like the first question was like, OK, are we saturating?
00:58:50.360 | Yes or no?
00:58:51.320 | And here, no.
00:58:56.280 | This was the bit on this benchmark
00:58:58.960 | from the original transforming paper.
00:59:01.360 | But then it's like these transformers,
00:59:04.280 | when we use them, we just notice they have really nice scaling
00:59:06.760 | properties.
00:59:07.280 | And they seem, actually, to be easier
00:59:08.920 | to scale up without paying massive compute as much
00:59:12.920 | as ResNet, just from gut feeling from us having experience
00:59:17.320 | with both.
00:59:17.920 | And so we went and looked what happens
00:59:20.120 | if we scale vision transformer just as far up
00:59:23.960 | as we possibly can.
00:59:25.640 | And we spent quite a lot of our blood into making this happen.
00:59:30.400 | One part of it is scaling the data set.
00:59:33.640 | So we went back to this Google internal team
00:59:36.680 | that this 300 million data set is just one out of many
00:59:39.760 | that they work with.
00:59:41.360 | And we asked around, and they basically
00:59:43.480 | had the 3 billion, like 10 times larger data set
00:59:46.240 | that we could also play around with.
00:59:49.280 | So there we go.
00:59:50.240 | We want to scale up the data set.
00:59:52.360 | And this is just showing, yes, just scaling up the data set
00:59:56.120 | and switching it gives you benefits,
00:59:57.720 | but that's not all of it.
01:00:00.120 | Then the next thing is we needed to figure out
01:00:03.080 | how to use less memory on device, like on GPU or TPU,
01:00:07.640 | because already previously with this score,
01:00:10.280 | we fitted the model as large as we could fit.
01:00:13.200 | So we did a lot of clicks that I will skip for now
01:00:16.360 | and are able to scale much larger.
01:00:18.960 | This is like-- this plot shows the size
01:00:22.040 | of the model in the different shape factors
01:00:24.520 | that I mentioned before, like the width of the MIP on x-axis,
01:00:27.400 | the self-attention width on the y-axis,
01:00:29.720 | and then the different plots are different layers for the depth.
01:00:32.760 | This box are how large the transformer
01:00:35.720 | we did in the original paper.
01:00:37.800 | And then boom, one step further and two steps further,
01:00:40.640 | this is just super massive transformer
01:00:43.440 | we did in this scaling paper.
01:00:46.600 | And with all of our tricks, how much larger
01:00:48.840 | we could go, a lot larger.
01:00:51.640 | Then, yeah, some learning rate stuff, and it is really cool.
01:00:54.360 | I recommend people to look at square root learning rate
01:00:56.840 | schedule, which is cool, and often just mentioned
01:00:59.320 | as a side note.
01:01:02.400 | It is also cool, but I'm going to skip it
01:01:04.240 | for the interest and basic interest of time.
01:01:08.160 | And basically, we scaled it up a lot.
01:01:10.920 | And of course, again, we get always
01:01:14.040 | this envision image net number a bit higher.
01:01:17.840 | This is actually plus 2% on what we had before,
01:01:20.200 | which is very significant in this high percentage range
01:01:23.440 | there.
01:01:24.880 | But also, what's very interesting
01:01:26.520 | is the view shot again.
01:01:28.760 | By just keep scaling up everything,
01:01:30.840 | we get super large boost in view shot again.
01:01:33.600 | This is image net top 1 accuracy.
01:01:35.800 | And for example, it's just 10 images per image net class,
01:01:40.560 | which means 10,000 images total because 1,000 classes.
01:01:44.000 | We get this big of a jump.
01:01:46.280 | We get 85% of 1 accuracy, which is what you typically
01:01:52.840 | get when using the full data set, basically.
01:01:56.320 | So this is scaling up.
01:01:58.600 | It makes actually view shot work significantly better.
01:02:02.000 | And then I'm going to skip on this.
01:02:04.240 | Well, this actually has an interesting message.
01:02:06.480 | This is three times the same story,
01:02:08.960 | but measured in a slightly different way,
01:02:11.040 | which is that if you make the model larger,
01:02:13.920 | it actually needs to see fewer images
01:02:16.320 | to get to a similar score.
01:02:18.400 | This blue line is a tiny vision transformer,
01:02:22.080 | and the base vision transformer in the large one.
01:02:24.680 | And the y-axis is the error.
01:02:26.520 | So lower is better.
01:02:28.040 | And actually, you need to see--
01:02:30.000 | still, we're talking in millions of images,
01:02:32.120 | and here it's 100 million images.
01:02:33.880 | But still, you need to see a lot fewer images
01:02:36.560 | with the larger models.
01:02:37.960 | Doesn't mean a lot less compute, right?
01:02:39.560 | Because the model is larger and the slower.
01:02:42.400 | But it's interesting.
01:02:44.160 | And then there's some scaling loss
01:02:45.720 | that are popular in language.
01:02:47.240 | And we, I think, maybe for the first time
01:02:49.880 | in discriminative image learning show
01:02:52.880 | that, yeah, they appear to be here, too.
01:02:56.800 | And then-- right.
01:02:59.800 | Then we want to--
01:03:01.400 | sorry, I had the order of the slides mixed up in my head.
01:03:04.160 | So I'm a bit surprised.
01:03:05.120 | But then another threat was that besides further scaling up
01:03:08.720 | the model, we wanted to push even further
01:03:11.040 | into this direction of less hand engineering of things
01:03:16.520 | into the model architecture.
01:03:18.560 | And then with the vision transformer,
01:03:21.360 | transform in general, what is the obviously most hand
01:03:24.360 | engineered part of it is the self-attention.
01:03:26.480 | So we tried, what can we do something
01:03:29.440 | more generic than that and less smart than that, basically?
01:03:34.240 | And we ended up by replacing it, essentially,
01:03:36.480 | with just a multi-layer perceptron that, however,
01:03:42.400 | has a little bit of structure, but much less
01:03:44.680 | than self-attention.
01:03:46.120 | So they would skip the structure or the safety of time.
01:03:49.680 | And we're coming back to this plot, where the question was,
01:03:53.120 | aren't we saturating?
01:03:54.480 | Now, this plot is slightly different.
01:03:56.040 | We, again, have this bit resonate here in black.
01:03:59.040 | And the full green line is the vision transformer.
01:04:02.760 | And the other color, also the full lines,
01:04:04.680 | are the vision transformers.
01:04:05.800 | So it is exactly the same numbers as from before.
01:04:09.000 | However, now we also throw in this mixer architecture,
01:04:11.800 | which we believe is even more flexible and less
01:04:14.240 | hand-engineered than transformer.
01:04:16.240 | And as you see, with less data, it's even worse.
01:04:19.400 | However, with much more data, it may
01:04:22.520 | be surpassing the transformer, or it may be random noise.
01:04:27.560 | Not clear at this point, right?
01:04:29.000 | Because it's the only point where this happens.
01:04:32.240 | So we need to go further.
01:04:33.760 | So we use this 3 billion data set,
01:04:35.960 | for example, from the previous paper that I mentioned here,
01:04:40.160 | and try to extend these lines to the right to see what happens.
01:04:44.360 | We don't extend many of them, because these
01:04:46.360 | are very expensive experiments that
01:04:48.360 | require a ton of patience.
01:04:50.320 | But we extended two most interesting.
01:04:52.840 | And it seems that it continues.
01:04:54.720 | And that, first of all, yes, the vision transformer
01:04:58.000 | keeps increasing.
01:04:59.880 | We don't have such experiment with the ResNet,
01:05:02.280 | because it doesn't look promising enough
01:05:04.400 | to pay the cost of doing it.
01:05:07.520 | But it also seems that the mixer, what we believe
01:05:10.000 | is even more flexible architecture,
01:05:11.560 | actually is consistently above the transformer now,
01:05:15.160 | which is good news.
01:05:17.080 | And yeah, it is good news.
01:05:19.200 | So we're now right at the time when I should stop, right?
01:05:23.640 | Or open to more questions again.
01:05:26.120 | Yeah, I guess, as a question--
01:05:29.160 | Can I ask a follow up on the scaling
01:05:31.520 | that you were showing earlier?
01:05:33.080 | It's related to my previous question.
01:05:35.080 | I'm curious how this model size compares
01:05:37.640 | to model sizes for Earth or the natural language.
01:05:41.840 | [INAUDIBLE]
01:05:43.840 | Like, especially when we're going from smaller models
01:05:46.080 | to much bigger models, are they comparable at all
01:05:49.520 | in terms of model size?
01:05:50.880 | And if not, why do you think--
01:05:53.880 | what is the [INAUDIBLE] models for these two different tasks?
01:05:57.280 | Yeah, actually, a colleague of mine has a slide, which I hate.
01:06:01.160 | But he loves-- it's the model number of parameters
01:06:04.280 | in NLP and in vision.
01:06:07.240 | And the question is, how do you measure model size?
01:06:10.240 | If you just measure number of parameters,
01:06:12.200 | then these vision models are much smaller.
01:06:15.160 | However, the language models, number of parameters,
01:06:19.200 | like a huge chunk of it is in the dictionary,
01:06:21.480 | for example, which for us just doesn't exist.
01:06:23.880 | It is linear embedding, which is trivial number of parameters.
01:06:29.240 | So in terms of number of parameters, it's much smaller.
01:06:32.480 | My personal opinion is number of parameters
01:06:34.800 | doesn't mean that much.
01:06:37.200 | Then the other way that you could measure
01:06:39.120 | this maybe in terms of compute, like how much floating point
01:06:42.760 | operations does it do on one data point.
01:06:46.400 | And in terms of this, it's in the same ballpark.
01:06:50.120 | However, last time I checked, which is quite a few months
01:06:52.720 | ago, the largest language model was still
01:06:55.640 | like four times more or five times more in the vision model,
01:06:59.800 | I believe.
01:07:02.200 | So that's the two ways of measuring model size.
01:07:05.320 | I don't think either of the ways is
01:07:07.880 | the one true way to measure model size.
01:07:09.560 | And I think it's actually an interesting research topic,
01:07:11.920 | like how to properly measure and order models
01:07:15.280 | in terms of capacity is not clear.
01:07:18.520 | [INAUDIBLE]
01:07:19.800 | Do you know why the vision is--
01:07:21.720 | I'm sorry, the vision is four times smaller?
01:07:23.880 | Like, what about that [INAUDIBLE]??
01:07:26.520 | I think it's just there is less interest in it,
01:07:30.200 | so less resources spent on it, basically.
01:07:34.360 | Like in Google, there are many, many more groups
01:07:37.160 | doing research with language than with vision.
01:07:40.720 | And I think we are one of the few groups that
01:07:44.000 | have access to a lot of resources
01:07:45.640 | and are interested in scaling up things in vision so much.
01:07:48.920 | Whereas in language, it seems there are a lot of groups
01:07:51.240 | that are doing that.
01:07:53.000 | I think that's the main reason, actually.
01:07:56.040 | It's not that we don't want to go beyond that,
01:07:58.760 | or if we can, we would go even more.
01:08:03.960 | Awesome, thank you.
01:08:05.280 | [INAUDIBLE]
01:08:07.760 | Right, so we are actually over time at this point.
01:08:10.840 | So anyone who has to leave, please feel free to do so.
01:08:13.720 | And before we do that, Lucas, thank you so much for joining,
01:08:18.240 | for all the way from across the ocean.
01:08:21.600 | And we know it's in the evening, so thank you
01:08:24.080 | for taking your free time to come and talk to us here.
01:08:27.520 | Yeah, thanks for the invitation.
01:08:29.040 | Always like to talk about the work.
01:08:31.840 | [BLANK_AUDIO]