back to index

Best of 2024 in Vision [LS Live @ NeurIPS]


Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hi, we're Isaac and Peter from Roboflow.
00:00:08.720 | And we're gonna talk about the best papers
00:00:11.280 | of 2024 in computer vision.
00:00:13.520 | So for us, we define best as what made the biggest shifts
00:00:19.680 | in the space.
00:00:21.720 | And to determine that we looked at
00:00:23.840 | what are some major trends that happened
00:00:26.240 | and what papers most contributed to those trends.
00:00:29.160 | So I'm gonna talk about a couple of trends.
00:00:30.280 | Peter's gonna talk about a trend
00:00:31.340 | and then we're gonna hand it off to Moondream.
00:00:34.400 | So the trends that I'm interested in talking about
00:00:39.720 | are a major transition from models
00:00:42.420 | that run on per image basis
00:00:44.080 | to models that run using the same basic ideas on video.
00:00:48.720 | And then also how debtors are starting to take over
00:00:51.760 | the real-time object detection scene
00:00:56.360 | from the YOLOs, which have been dominant for years.
00:00:58.960 | So as a highlight, we're gonna talk about Sora,
00:01:04.620 | which from my perspective is the biggest paper of 2024,
00:01:08.120 | even though it came out in February.
00:01:09.920 | Yeah, yeah.
00:01:13.460 | So Sora is just a post.
00:01:16.860 | So I'm going to fill it in with details
00:01:20.040 | from replication efforts, including open Sora
00:01:22.680 | and related work such as a stable diffusion video.
00:01:26.600 | And then we're also gonna talk about SAM2,
00:01:30.040 | which applies the SAM strategy to video.
00:01:32.880 | And then how debtors are,
00:01:36.240 | the improvements in 2024 to debtors
00:01:37.840 | that are making them a Pareto improvement
00:01:39.360 | to YOLO-based models.
00:01:41.120 | So to start this off,
00:01:44.360 | we're gonna talk about the state-of-the-art
00:01:46.960 | of video generation at the end of 2023.
00:01:50.040 | MagVIT is a discrete token model
00:01:55.080 | discrete token video tokenizer akin to VQ, GAN,
00:01:58.960 | but applied to video sequences.
00:02:01.000 | And it actually outperforms state-of-the-art
00:02:05.760 | handcrafted video compression frameworks
00:02:08.840 | in terms of the bit rate versus human preference for quality
00:02:13.840 | and videos generated by autoregressing
00:02:15.720 | on these discrete tokens.
00:02:17.080 | Generates some pretty nice stuff,
00:02:20.560 | but up to like five seconds length
00:02:22.000 | and you know, not super detailed.
00:02:23.480 | And then suddenly a few months later, we have this,
00:02:28.480 | which when I saw it, it was totally mind-blowing to me.
00:02:32.120 | 1080p, a whole minute long.
00:02:34.440 | We've got light reflecting in puddles.
00:02:36.000 | That's reflective, reminds me of those RTX demonstrations
00:02:41.000 | for next generation video games, such as Cyberpunk,
00:02:44.160 | but with better graphics.
00:02:46.760 | You can see some issues in the background
00:02:48.240 | if you look closely, but they're kind of,
00:02:50.320 | as with a lot of these models,
00:02:52.480 | the issues tend to be things
00:02:54.120 | that people aren't going to pay attention to
00:02:55.880 | unless they're looking for.
00:02:57.040 | In the same way that like six fingers on a hand,
00:02:59.640 | you're not going to notice is a giveaway
00:03:02.320 | unless you're looking for it.
00:03:03.760 | So yeah, as we said, Sora does not have a paper.
00:03:08.440 | So we're going to be filling it in with context
00:03:10.920 | from the rest of the computer vision scene
00:03:14.040 | attempting to replicate these efforts.
00:03:16.440 | So the first step, you have an LLM caption,
00:03:21.800 | a huge amount of videos.
00:03:23.120 | This is a trick that they introduced in Dolly 3
00:03:28.520 | where they train a image captioning model
00:03:32.240 | to just generate very high quality captions
00:03:34.240 | for a huge corpus
00:03:35.360 | and then train a diffusion model on that.
00:03:39.760 | Their Sora and the replication efforts
00:03:42.240 | also show a bunch of other steps
00:03:44.040 | that are necessary for good video generation,
00:03:47.480 | including filtering by aesthetic score
00:03:50.360 | and filtering by making sure the videos have enough motion
00:03:53.320 | so they're not just like kind of the generators
00:03:55.960 | not learning to just generate static frames.
00:03:58.160 | So then we encode our video
00:04:04.040 | into a series of space-time latency.
00:04:06.600 | Once again, this were very sparse in details.
00:04:09.840 | So the replication related works,
00:04:13.680 | OpenSora actually uses a MagVIT V2 itself to do this,
00:04:17.320 | but swapping out the discretization step
00:04:21.520 | with a classic VAE autoencoder framework.
00:04:25.240 | They show that there's a lot of benefit
00:04:30.000 | from getting the temporal compression,
00:04:31.520 | which makes a lot of sense as each sequential frames
00:04:35.400 | and videos have mostly redundant information.
00:04:38.080 | So by compressing in the temporal space,
00:04:43.640 | you allow the latent to hold a lot more semantic information
00:04:47.240 | while avoiding that duplicate.
00:04:49.800 | So we've got our space-time latency possibly via,
00:04:58.440 | there's some 3D VAE, presumably a MagVIT V2.
00:05:02.560 | And then you throw it into a diffusion transformer.
00:05:07.440 | So I think it's personally interesting to note
00:05:11.800 | that OpenSora is using a MagVIT V2,
00:05:14.960 | which originally used an autoregressive transformer decoder
00:05:18.680 | to model the latent space,
00:05:20.200 | but is now using a diffusion transformer.
00:05:25.200 | So it's still a transformer happening.
00:05:27.360 | Just the question is like,
00:05:28.200 | is it parameterizing the stochastic differential equation?
00:05:31.880 | Is it parameterizing a conditional distribution
00:05:34.480 | via autoregression?
00:05:35.680 | It's also worth noting that most diffusion models today,
00:05:44.520 | the very high performance ones are switching away
00:05:46.440 | from the classic like DDPM,
00:05:48.640 | denoising diffusion probability modeling framework
00:05:51.240 | to rectified flows.
00:05:52.560 | Rectified flows have a very interesting property
00:05:56.080 | that as they converge,
00:05:58.520 | they actually get closer to being able to be sampled
00:06:01.480 | with a single step,
00:06:02.880 | which means that in practice,
00:06:05.480 | you can actually generate high quality samples much faster.
00:06:08.440 | Major problem of DDPM and related models
00:06:13.640 | for the past four years is just that
00:06:15.920 | they require many, many steps
00:06:18.000 | to generate high quality samples.
00:06:20.040 | So, and naturally the third step
00:06:23.760 | is throwing lots of compute at the problem.
00:06:26.080 | So I never figured out how to manage
00:06:30.520 | to get this video to loop,
00:06:31.960 | but we see very little compute,
00:06:36.080 | medium compute, lots of compute.
00:06:39.120 | This is so interesting
00:06:40.000 | because the original diffusion transformer paper
00:06:42.480 | from Facebook actually showed that,
00:06:45.000 | in fact, the specific hyperparameters of the transformer
00:06:47.400 | didn't really matter that much.
00:06:49.160 | What mattered was that you were just increasing
00:06:51.760 | the amount of compute that the model had.
00:06:54.480 | So I love how in the, once again, little blog posts,
00:06:59.360 | they don't even talk about
00:07:00.200 | like the specific hyperparameters.
00:07:01.160 | They say, we're using a diffusion transformer
00:07:03.320 | and we're just throwing more compute at it
00:07:04.520 | and this is what happens.
00:07:05.760 | OpenSORA shows similar results.
00:07:10.520 | The primary issue I think here is that
00:07:13.920 | no one else has 32X compute budget.
00:07:17.280 | So we end up with these,
00:07:18.640 | we end up in the middle of the domain
00:07:22.400 | in most of the related work,
00:07:24.920 | which is still super, super cool.
00:07:27.400 | It's just a little disappointing considering the context.
00:07:30.560 | So I think this is a beautiful extension
00:07:34.640 | of the framework that was introduced in '22 and '23
00:07:40.320 | for these very high quality per image generation
00:07:43.280 | and then extending that to videos.
00:07:45.000 | It's awesome.
00:07:47.600 | And it's GA as of Monday,
00:07:49.360 | except no one can seem to get access to it
00:07:51.200 | because they keep shutting down the login.
00:07:53.640 | The next, so next paper I wanted to talk about is SAM.
00:07:59.320 | So we at RoboFlow allow users to label data
00:08:03.200 | and train models on that data.
00:08:04.680 | SAM for us has saved our users 75 years of labeling time.
00:08:10.000 | We are the, to the best of my knowledge,
00:08:11.760 | the largest SAM API that exists.
00:08:16.320 | We also, SAM also allows us to have our users
00:08:19.320 | train just pure bounding box regression models
00:08:22.680 | and use those to generate high quality masks,
00:08:25.600 | which has the great side effect
00:08:29.680 | of requiring less training data
00:08:31.400 | to have a meaningful convergence.
00:08:33.160 | So most people are data limited in the real world.
00:08:35.720 | So anything that requires less data
00:08:37.120 | to get to a useful thing is super useful.
00:08:40.360 | Most of our users actually run their object,
00:08:44.920 | per frame object detectors on every frame in a video,
00:08:47.800 | or maybe not most, but many, many.
00:08:49.600 | And so SAM follows into this category of taking,
00:08:55.480 | SAM2 falls into this category of taking
00:08:57.280 | something that really, really works
00:08:59.080 | and applying it to a video,
00:09:01.880 | which has the wonderful benefit of being plug and play
00:09:05.000 | with most of our, many of our users use cases.
00:09:08.920 | We're still building out a sufficiently mature pipeline
00:09:12.800 | to take advantage of that, but it's in the works.
00:09:15.800 | So here we've got a great example.
00:09:20.040 | We can click on cells and then follow them.
00:09:23.520 | You even notice the cell goes away and comes back
00:09:25.560 | and we can still keep track of it,
00:09:28.120 | which is very challenging for existing object trackers.
00:09:36.920 | High-level overview of how SAM2 works.
00:09:39.440 | There's a simple pipeline here where we can give,
00:09:48.760 | provide some type of prompt and it fills out
00:09:51.440 | the rest of the likely masks for that object
00:09:55.240 | throughout the rest of the video.
00:09:56.440 | So here we're giving a bounding box in the first frame,
00:09:59.120 | a set of positive negative points,
00:10:00.720 | or even just a simple mask.
00:10:04.680 | I'm going to assume people are somewhat familiar with SAM.
00:10:09.680 | So I'm going to just give a high-level overview
00:10:11.720 | of how SAM works.
00:10:13.720 | You have an image encoder that runs on every frame.
00:10:16.800 | SAM2 can be used on a single image,
00:10:20.760 | in which case the only difference between SAM2 and SAM
00:10:23.400 | is that image encoder, which SAM used a standard VIT.
00:10:31.360 | SAM2 replaced that with a Hera hierarchical encoder,
00:10:36.360 | which gets approximately the same results,
00:10:39.240 | but leads to a six times faster inference,
00:10:42.280 | which is excellent, especially considering
00:10:44.560 | how in a trend of 23 was replacing the VIT
00:10:48.960 | with more efficient backbones.
00:10:50.760 | In the case where you're doing video segmentation,
00:10:56.080 | the difference is that you actually create a memory bank
00:10:58.920 | and you cross attend the features from the image encoder
00:11:02.800 | based on the memory bank.
00:11:04.560 | So the feature set that is created is essentially,
00:11:09.560 | well, I'll go more into it in a couple of slides,
00:11:14.500 | but we take the features from the past couple frames
00:11:19.320 | plus a set of object pointers and the set of prompts
00:11:24.520 | and use that to generate our new masks.
00:11:28.920 | Then we then fuse the new masks for this frame
00:11:30.980 | with the image features and add that to the memory bank.
00:11:35.980 | It's, well, I'll say more in a minute.
00:11:39.720 | Just like SAM, SAM2 actually uses a data engine
00:11:44.440 | to create its data set in that people are,
00:11:47.320 | they assembled a huge amount of reference data,
00:11:50.020 | used people to label some of it and train the model,
00:11:54.500 | used the model to label more of it
00:11:57.340 | and asked people to refine the predictions of the model.
00:11:59.780 | And then ultimately the data set is just created
00:12:02.660 | from the final output of the model on the reference data.
00:12:06.960 | It's very interesting.
00:12:08.740 | This paradigm is so interesting to me
00:12:09.980 | because it unifies a model in a data set
00:12:14.100 | in a way that is very unique.
00:12:16.920 | It seems unlikely that another model could come in
00:12:19.340 | and have such a tight relationship with the training set.
00:12:22.340 | Yeah, so brief overview of how the memory bank works.
00:12:30.460 | The paper did not have a great visual,
00:12:33.740 | so I'm just, I'm going to fill in a bit more.
00:12:35.940 | So we take the last couple of frames from our video
00:12:42.780 | and we take the last couple of frames from our video.
00:12:49.700 | Attend that along with the set of prompts that we provided,
00:12:54.700 | they could come from the future,
00:12:56.420 | they could come from anywhere in the video,
00:12:58.180 | as well as reference objects pointers saying,
00:13:01.500 | by the way, here's what we've found so far.
00:13:04.020 | Attending to the last few frames
00:13:05.980 | has the interesting benefit of allowing it
00:13:08.780 | to model complex object motion without actually,
00:13:17.220 | by limiting the amount of frames that you attend to,
00:13:19.940 | you manage to keep the model running in real time.
00:13:22.460 | This is such an interesting topic for me
00:13:24.600 | because one would assume that attending
00:13:27.540 | to all of the frames is super essential
00:13:30.140 | or having some type of summarization
00:13:31.380 | of all the frames is super essential for a high performance,
00:13:35.060 | but we see in their later ablation
00:13:37.300 | that that actually is not the case.
00:13:39.060 | So here, just to make sure
00:13:43.200 | that there is some benchmarking happening,
00:13:45.060 | we just compared to some of the stuff
00:13:46.700 | that's came out prior,
00:13:49.700 | and indeed the SAM2 strategy does improve
00:13:52.380 | on the state of the art.
00:13:53.740 | This ablation deep in their dependencies
00:13:59.620 | was super interesting to me.
00:14:01.040 | We see in section C, the number of memories.
00:14:05.660 | One would assume that increasing the count of memories
00:14:08.820 | would meaningfully increase performance.
00:14:11.220 | And we see that it has some impact,
00:14:12.660 | but not the type that you'd expect.
00:14:15.700 | And that it meaningfully decreases speed,
00:14:17.660 | which justifies in my mind,
00:14:19.380 | just having this FIFO queue of memories.
00:14:22.320 | Although in the future,
00:14:25.620 | I'm super interested to see a more dedicated summarization
00:14:30.340 | of all of the last video,
00:14:31.880 | not just a stacking of the last frames.
00:14:35.340 | So that another extension of beautiful per frame work
00:14:44.180 | into the video domain.
00:14:47.580 | The next trend I'm interested in talking about
00:14:49.320 | is this interesting at Roboflow,
00:14:53.460 | we're super interested in training
00:14:54.660 | real-time object detectors.
00:14:56.180 | Those are bread and butter.
00:14:57.460 | And so we're doing a lot to keep track
00:14:58.860 | of what is actually happening in that space.
00:15:01.820 | We are finally starting to see something change.
00:15:07.160 | So for years, yellows have been the dominant way
00:15:10.940 | of doing real-time object detection.
00:15:12.980 | And we can see here that they've essentially stagnated.
00:15:16.320 | The performance between 10 and 11
00:15:18.500 | is not meaningfully different,
00:15:21.260 | at least in this type of high-level chart.
00:15:25.340 | And even from the last couple of series,
00:15:26.860 | there's not a major change.
00:15:28.900 | So yellows have hit a plateau.
00:15:32.620 | Deaders have not.
00:15:35.940 | So we can look here and see the yellow series
00:15:40.320 | has this plateau, and then these RT-deader,
00:15:43.860 | LW-deader, and D-fine have meaningfully changed that plateau
00:15:47.540 | so that, in fact, the best D-fine models
00:15:50.040 | are plus 4.6 AP on COCO at the same latency.
00:15:54.100 | So three major steps to accomplish this.
00:15:59.680 | The first RT-deader, which is technically
00:16:01.900 | a 2023 paper preprint, but published officially in '24,
00:16:06.000 | so I'm going to include that.
00:16:07.440 | I hope that's okay.
00:16:09.820 | Deaders showed that, RT-deader showed that
00:16:12.300 | we could actually match or out-speed yellows.
00:16:14.600 | And then LW-deader showed that pre-training
00:16:18.540 | is hugely effective on deaders, and much less so on yellows.
00:16:22.260 | And then D-fine added the types of bells and whistles
00:16:24.060 | that we expect from these types, this arena.
00:16:28.260 | So the major improvements that RT-deader shows
00:16:33.540 | was taking the multiscale features
00:16:37.400 | that deaders typically pass into their encoder
00:16:39.980 | and decoupling them
00:16:41.060 | into a much more efficient transformer encoder.
00:16:44.400 | The transformer is, of course, quadratic complexity,
00:16:48.560 | so decreasing the amount of stuff that you pass in at once
00:16:52.100 | is super helpful for increasing your runtime,
00:16:55.780 | or increasing your throughput.
00:16:57.920 | So that change basically brought us up to yellow speed,
00:17:01.980 | and then they do a hardcore analysis
00:17:04.180 | on benchmarking yellows, including the NMS step.
00:17:09.180 | Once you include the NMS in the latency calculation,
00:17:14.600 | you see that, in fact, these deaders are outperforming,
00:17:18.600 | at least at this time, the yellows that existed.
00:17:22.760 | Then LW-deader goes in and suggests that,
00:17:26.660 | in fact, this frame, the huge boost here is from pre-training
00:17:32.000 | So this is the defined line,
00:17:35.200 | and this is the defined line without pre-training.
00:17:37.240 | It's within range, it's still an improvement
00:17:39.320 | over the yellows, but the really huge boost
00:17:42.200 | comes from the benefit of pre-training.
00:17:44.160 | When YOLO-X came out in 2021,
00:17:48.240 | they showed that they got much better results
00:17:51.080 | by having a much, much longer training time,
00:17:54.040 | but they found that when they did that,
00:17:57.240 | they actually did not benefit from pre-training.
00:18:00.160 | So you see in this graph from LW-deader,
00:18:04.040 | in fact, yellows do have a real benefit from pre-training,
00:18:07.240 | but it goes away as we increase the training time.
00:18:10.460 | Then the deaders converge much faster.
00:18:13.240 | LW-deader trains for only 50 epochs,
00:18:15.120 | RT-deaders, 60 epochs.
00:18:17.460 | So one could assume that, in fact,
00:18:19.460 | the entire extra gain from pre-training
00:18:22.960 | is that you're not destroying your original weights
00:18:25.700 | by relying on pre-training.
00:18:27.460 | You're not destroying your original weights
00:18:29.460 | by relying on this long training cycle.
00:18:31.460 | And then LW-deader also shows superior performance
00:18:37.660 | to our favorite data set, Roboflow 100,
00:18:41.040 | which means that they do better on the real world,
00:18:42.840 | not just on Cocoa.
00:18:44.120 | Then Define throws all the bells and whistles at it.
00:18:49.500 | YOLO models tend to have a lot of very specific,
00:18:53.880 | complicated loss functions.
00:18:56.500 | Define brings that into the deader world
00:18:59.840 | and shows consistent improvement
00:19:00.920 | on a variety of deader-based frameworks.
00:19:03.080 | So bring these all together,
00:19:07.200 | and we see that suddenly we have almost 60 AP on Cocoa
00:19:11.120 | while running in like 10 milliseconds.
00:19:13.280 | Huge, huge stuff.
00:19:14.620 | So we're spending a lot of time trying to build models
00:19:19.960 | that work better with less data,
00:19:21.880 | and deaders are clearly becoming a promising step
00:19:24.800 | in that direction.
00:19:26.700 | What we're interested in seeing from the deaders
00:19:30.660 | in this trend to next is Codeader
00:19:33.280 | and the models that are currently sitting
00:19:35.360 | on the top of the leaderboard for large-scale inference
00:19:40.360 | scale really well as you switch out the backbone.
00:19:44.620 | We're very interested in seeing
00:19:46.400 | and having people publish a paper, potentially us,
00:19:49.620 | on what happens if you take these real-time ones
00:19:52.040 | and then throw a Swing G at it.
00:19:53.360 | Like, do we have a Pareto curve that extends
00:19:56.040 | from the real-time domain all the way up
00:19:57.780 | to the super, super slow but high-performance domain?
00:20:02.580 | We also wanna see people benchmarking an RF100 more
00:20:05.860 | because that type of data is what's relevant
00:20:08.460 | for most users.
00:20:09.620 | And we wanna see more pre-training
00:20:13.200 | because pre-training works now.
00:20:15.080 | It's super cool.
00:20:15.960 | - All right, so, yeah.
00:20:21.660 | So, in that theme, one of the big things
00:20:24.420 | that we're focusing on is how do we get more
00:20:26.620 | out of our pre-trained models?
00:20:28.380 | And one of the lenses to look at this is through
00:20:31.880 | sort of this new requirement for, like,
00:20:34.620 | fine-grained visual details and your representations
00:20:37.880 | that are extracted from your foundation model.
00:20:40.700 | So, it's sort of a hook for this.
00:20:42.460 | Oh, yeah, this is just a list of all the papers
00:20:45.840 | that I'm gonna mention.
00:20:46.700 | I just wanted to make sure I set an actual paper
00:20:48.380 | so you can find it later.
00:20:50.660 | Yeah, so, sort of the big hook here is that
00:20:54.080 | I make the claim that LLMs can't see.
00:20:56.460 | If you go to Claude or ChatGPT,
00:21:00.920 | you ask it to see this watch
00:21:04.840 | and tell me what time it is, it fails, right?
00:21:07.120 | And so, you could say, like, maybe the,
00:21:11.800 | like, this is, like, a very classic test of an LLM,
00:21:14.840 | but you could say, okay, maybe this image
00:21:16.540 | is, like, too zoomed out and it just, like,
00:21:19.500 | it'll do better if we increase the resolution
00:21:21.580 | and it has easier time finding these fine-grained features,
00:21:24.540 | like where the watch hands are pointing.
00:21:26.340 | No dice.
00:21:27.160 | And you could say, okay, well, maybe the model
00:21:29.200 | just doesn't know how to tell time
00:21:30.700 | from knowing the position of the hands,
00:21:32.660 | but if you actually prompt it textually,
00:21:34.200 | it's very easy for it to tell the time.
00:21:35.700 | So, this, to me, is proof that these LLMs
00:21:38.540 | literally cannot see the position of the watch hands
00:21:40.840 | and it can't see those details.
00:21:41.960 | So, the question is, sort of, why?
00:21:43.620 | And for you anthropic heads out there, Claude fails, too.
00:21:48.880 | So, my first pick for Best Paper of 2024 Envision
00:21:53.880 | is this MMVP paper, which tries to investigate
00:21:57.260 | why do LLMs not have the ability to see fine-grained details?
00:22:00.880 | And so, for instance, it comes up
00:22:03.040 | with a lot of images like this, where you ask it a question
00:22:05.760 | that seems very visually apparent to us,
00:22:07.260 | like, which way is the school bus facing?
00:22:08.620 | And it gets it wrong.
00:22:09.460 | And then, of course, it makes up details
00:22:11.040 | to support its wrong claim.
00:22:12.460 | And so, the process by which it finds these images
00:22:16.540 | is, sort of, contained in its hypothesis
00:22:18.920 | for why it can't see these details.
00:22:21.240 | So, it hypothesizes that models
00:22:24.920 | that have been initialized with Clip
00:22:26.960 | as their vision encoder,
00:22:28.540 | they don't have fine-grained details
00:22:31.080 | and the features extracted using Clip
00:22:33.080 | because Clip, sort of, doesn't need to find
00:22:36.920 | these fine-grained details to do its job correctly,
00:22:38.840 | which is just to match captions and images, right?
00:22:42.340 | And, sort of, at a high level,
00:22:44.580 | even if ChatGPT wasn't initialized with Clip
00:22:46.800 | and wasn't trained contrastively,
00:22:49.460 | the vision encoder wasn't trained contrastively at all,
00:22:52.220 | still, in order to do its job of capturing the image,
00:22:55.340 | it could do a pretty good job
00:22:56.800 | without actually finding the exact position
00:22:58.800 | of all the objects and visual features in the image, right?
00:23:02.040 | So, this paper finds a set of difficult images
00:23:05.920 | for these types of models.
00:23:07.620 | And the way it does it is it looks for embeddings
00:23:10.000 | that are similar in Clip space, but far in DynaV2 space.
00:23:13.300 | So, DynaV2 is a foundation model
00:23:15.300 | that was trained self-supervised purely on image data,
00:23:20.000 | and it, kind of, uses, like,
00:23:21.420 | some complex student-teacher framework,
00:23:23.960 | but, essentially, it patches out, like,
00:23:26.220 | certain areas of the image
00:23:28.380 | or, like, crops at certain areas of the image
00:23:29.960 | and tries to make sure
00:23:30.840 | that those have consistent representations,
00:23:32.420 | which is a way for it to learn
00:23:33.960 | very fine-grained visual features.
00:23:36.600 | And so, if you take things that are very close in Clip space
00:23:39.300 | and very far in DynaV2 space,
00:23:41.300 | you get a set of images that basically are pairs of images
00:23:45.840 | that are hard for a chat GPT
00:23:47.300 | and other big language models to distinguish.
00:23:49.720 | So, if you then ask it questions about this image,
00:23:52.600 | well, as you can see from this chart,
00:23:54.880 | it's going to answer the same way for both images, right?
00:23:58.600 | Because, from the perspective of the vision encoder,
00:24:01.640 | they're the same image.
00:24:03.000 | And so, if you ask a question, like,
00:24:03.960 | "How many eyes does this animal have?"
00:24:05.540 | It answers the same for both.
00:24:06.960 | And, like, all these other models, including Lava,
00:24:09.880 | do the same thing, right?
00:24:11.920 | And so, this is the benchmark that they create,
00:24:14.080 | which is, like, finding, like, clip-blind pairs,
00:24:17.760 | which is pairs of images that are similar in Clip space,
00:24:19.680 | and creating a data set of multiple-choice questions
00:24:23.220 | based off of those.
00:24:24.760 | And so, how do these models do?
00:24:26.880 | Well, really bad.
00:24:29.080 | Lava, I think...
00:24:30.500 | So, chat GPT and Jim and I do a little bit better
00:24:33.420 | than random guessing,
00:24:34.340 | but, like, half of the performance of humans
00:24:36.220 | who find these problems to be very easy.
00:24:39.040 | Lava is, interestingly,
00:24:41.300 | extremely negatively correlated with this data set.
00:24:44.720 | It does much, much, much, much worse than random guessing,
00:24:47.640 | which means that this process has done a very good job
00:24:50.600 | of identifying hard images for Lava, specifically.
00:24:54.680 | And that's because Lava is basically
00:24:57.040 | not trained for very long and is initialized from Clip.
00:24:59.380 | And so, you would expect it to do poorly on this data set.
00:25:03.160 | So, one of the proposed solutions that this paper attempts
00:25:08.040 | is by basically saying,
00:25:09.300 | "Okay, well, if Clip features aren't enough,
00:25:10.920 | "what if we train the visual encoder
00:25:12.800 | "of the language model also on Dyno features?"
00:25:15.040 | And so, it proposes two different ways of doing this.
00:25:19.080 | One, additively, which is basically interpolating
00:25:22.540 | between the two features.
00:25:23.800 | And then, one is interleaving,
00:25:25.640 | which is just kind of like training one
00:25:27.260 | on the combination of both features.
00:25:30.180 | So, there's this really interesting trend
00:25:32.000 | when you do the additive mixture of features.
00:25:34.720 | So, zero is all Clip features
00:25:38.480 | and one is all Dyno v2 features.
00:25:40.900 | So, I think it's helpful
00:25:44.720 | to look at the rightmost chart first,
00:25:46.380 | which is as you increase the number of Dyno v2 features,
00:25:48.960 | your model does worse and worse and worse
00:25:50.600 | on the actual language modeling task.
00:25:52.560 | And that's because Dyno v2 features
00:25:54.160 | were trained completely from a self-supervised manner
00:25:57.280 | and completely in image space.
00:25:58.600 | It knows nothing about text.
00:25:59.700 | These features aren't really compatible
00:26:01.520 | with these text models.
00:26:03.000 | And so, you can train an adapter all you want,
00:26:05.280 | but it seems that it's in such an alien language
00:26:07.420 | that it's like a very hard optimization
00:26:09.080 | for these models to solve.
00:26:11.560 | And so, that kind of supports what's happening on the left,
00:26:14.880 | which is that, yeah, it gets better
00:26:16.680 | at answering these questions
00:26:19.640 | as you include more Dyno v2 features up to a point,
00:26:23.140 | but then when you oversaturate,
00:26:24.800 | it completely loses its ability to answer language
00:26:28.860 | and do language tasks.
00:26:31.640 | So, you can also see with the interleaving,
00:26:35.520 | they essentially double the number of tokens
00:26:38.080 | that are going into these models and just train on both.
00:26:41.640 | And it still doesn't really solve the MMVP task.
00:26:43.960 | It gets Lava 1.5 above random guessing by a little bit,
00:26:47.560 | but it's still not close to Chachapiti
00:26:50.600 | or any human performance, obviously.
00:26:54.200 | So, clearly, this proposed solution
00:26:56.540 | of just using Dyno v2 features directly isn't gonna work.
00:27:00.000 | And basically what that means is that
00:27:01.920 | as a vision foundation model,
00:27:06.040 | Dyno v2 is gonna be insufficient for language tasks, right?
00:27:09.840 | So, my next pick for best paper of 2024
00:27:13.640 | would be Florence 2, which tries to solve this problem
00:27:16.000 | by incorporating not only this dimension
00:27:19.280 | of spatial hierarchy,
00:27:20.420 | which is to say pixel level understanding,
00:27:23.320 | but also in making sure to include
00:27:25.300 | what they call semantic granularity,
00:27:27.000 | which ends up, the goal is basically to have features
00:27:30.720 | that are sufficient for finding objects in the image.
00:27:34.000 | So, they have enough pixel information,
00:27:37.520 | but also can be talked about and can be reasoned about.
00:27:40.520 | And that's on the semantic granularity axis.
00:27:44.880 | So, here's an example of basically three different
00:27:49.520 | paradigms of labeling that they do.
00:27:51.680 | So, they create a big data set.
00:27:54.160 | One is text, which is just captioning.
00:27:56.800 | And you would expect a model
00:27:57.920 | that's trained only on captioning
00:27:59.120 | to have similar performance like chat2BT
00:28:01.000 | and not have spatial hierarchy,
00:28:03.920 | not have features that are meaningful at the pixel level.
00:28:07.560 | And so, they add another type, which is region text pairs,
00:28:11.080 | which is essentially either classifying a region
00:28:14.080 | or doing object detection
00:28:19.080 | or doing instant segmentation on that region
00:28:22.080 | or captioning that region.
00:28:23.640 | And then they have text phrase region annotations,
00:28:26.240 | which is essentially a triple.
00:28:28.560 | And basically, not only do you have a region
00:28:31.040 | that you've described,
00:28:32.160 | you also find its place in a descriptive paragraph
00:28:36.720 | about the image,
00:28:37.560 | which is basically trying to introduce
00:28:39.760 | even more semantic understanding of these regions.
00:28:42.240 | And so, for instance,
00:28:43.640 | if you're saying a woman riding on the road,
00:28:46.040 | you have to know what a woman is and what the road is
00:28:48.120 | and that she's on top of it.
00:28:49.120 | And that's basically composing a bunch of objects
00:28:52.040 | in this visual space,
00:28:53.120 | but also thinking about it semantically.
00:28:55.240 | Right?
00:28:56.280 | And so, the way that they do this is they take...
00:28:59.400 | Basically, they just dump features from a vision encoder
00:29:04.400 | straight into a encoder-decoder transformer.
00:29:08.440 | And then they train a bunch of different tasks
00:29:12.720 | like object detection and so on as a language task.
00:29:16.240 | And I think that's one of the big things
00:29:17.520 | that we saw in 2024
00:29:19.760 | is these vision language models
00:29:23.480 | operating on pixel space linguistically.
00:29:26.880 | So, they introduce a bunch of new tokens
00:29:28.360 | to point to locations in pixel space.
00:29:33.080 | So, how does it work?
00:29:35.520 | How does it actually do?
00:29:37.280 | We can see, if you look at the graph on the right,
00:29:40.200 | which is using the Dino framework,
00:29:44.560 | your pre-trained Florence 2 models transfer very, very well.
00:29:50.400 | They get 60% map on Cocoa,
00:29:53.000 | which is like approaching state-of-the-art.
00:29:54.960 | And they train with...
00:29:55.800 | - Recording in progress.
00:29:57.520 | - You're good.
00:29:58.440 | And they train with much more efficiently.
00:30:02.960 | So, they converge a lot faster,
00:30:04.360 | which both of these things are pointing to the fact
00:30:06.720 | that they're actually leveraging
00:30:08.320 | their pre-trained weights effectively.
00:30:10.240 | So, where is it falling short?
00:30:14.200 | So, these models, I forgot to mention,
00:30:16.520 | Florence is a 0.2 billion
00:30:18.040 | and a 0.7 billion parameter count.
00:30:20.360 | So, they're very, very small
00:30:21.600 | in terms of being a language model.
00:30:24.240 | And I think that this framework, you can see saturation.
00:30:27.760 | So, what this graph is showing is that
00:30:30.280 | if you train a Florence 2 model
00:30:32.440 | purely on the image level and region level annotations
00:30:35.320 | and not including the pixel level annotations,
00:30:38.040 | like segmentation,
00:30:40.240 | it actually performs better as an object detector.
00:30:43.960 | And what that means is that
00:30:45.640 | it's not able to actually learn all the visual tasks
00:30:48.400 | that it's trying to learn
00:30:49.480 | because it doesn't have enough capacity.
00:30:51.160 | So, I'd like to see this paper explore larger model sizes,
00:30:54.440 | which brings us to our next big paper of 2024,
00:30:58.880 | or two papers.
00:31:00.200 | So, PolyGemma came out earlier this year.
00:31:02.160 | PolyGemma 2 was released, I think, like a week or two ago.
00:31:05.040 | Oh, I forgot to mention, you can actually train
00:31:08.400 | like label text data sets on RoboFlow
00:31:10.720 | and you can train a Florence 2 model
00:31:12.240 | and you can actually train a PolyGemma 2 model on RoboFlow,
00:31:15.640 | which we got into the platform
00:31:16.840 | within like 14 hours of release,
00:31:18.120 | which I was really excited about.
00:31:19.800 | So, anyway, so PolyGemma 2...
00:31:21.920 | So, PolyGemma is essentially doing the same thing,
00:31:24.560 | but instead of doing an encoder-decoder,
00:31:26.280 | it just dumps everything
00:31:27.120 | into a decoder-only transformer model.
00:31:29.560 | But it also introduced the concept of location tokens
00:31:31.840 | to point to objects in pixel space.
00:31:35.240 | PolyGemma 2...
00:31:36.560 | So, PolyGemma uses Gemma as the language encoder
00:31:38.680 | and it uses Gemma 2B.
00:31:39.880 | PolyGemma 2 introduces using multiple different sizes
00:31:43.120 | of language encoders.
00:31:44.160 | So, the way that they sort of get around
00:31:48.360 | having to do encoder-decoder
00:31:49.960 | is they use the concept of prefix loss,
00:31:52.320 | which basically means that
00:31:53.680 | when it's generating tokens autoregressively,
00:31:58.360 | it's all those tokens in the prefix,
00:32:01.160 | which is like the image that it's looking at
00:32:03.040 | and like a description of the task that it's trying to do,
00:32:05.920 | they're attending to each other fully, full attention,
00:32:09.320 | which means that it can sort of bind high level...
00:32:12.960 | It's easier for the prefix to color the output
00:32:17.760 | of the suffix
00:32:19.160 | and also to just find features easily.
00:32:23.440 | So, this is sort of an example
00:32:25.920 | of one of the tasks that I was trained on,
00:32:27.360 | which is you describe the task in English
00:32:29.800 | and then you give it all these...
00:32:34.520 | You're asking for it to segment these two classes of objects
00:32:38.960 | and then it finds their locations using these tokens
00:32:42.760 | and it finds their masks using some encoding
00:32:46.480 | of the masks into tokens.
00:32:50.200 | And yeah, so one of my critiques,
00:32:54.040 | I guess, of PolyGemma 1, at least,
00:32:56.080 | is that you find that performance saturates
00:32:59.080 | as a pre-trained model
00:32:59.960 | after only 300 million examples seen.
00:33:02.400 | So, what this graph is representing
00:33:06.000 | is each blue dot is a performance on some downstream task.
00:33:09.560 | You can see that after seeing 300 million examples,
00:33:12.520 | it sort of does equally well
00:33:15.440 | on all of the downstream tasks that they tried it on,
00:33:18.400 | which was a lot, as one billion examples,
00:33:21.680 | which to me also kind of suggests
00:33:23.720 | a lack of capacity for this model.
00:33:25.560 | PolyGemma 2, you can see the results on object detection.
00:33:31.520 | So, these were transferred to Coco.
00:33:35.800 | And you can see that this sort of also points
00:33:39.200 | to an increase in capacity being helpful to the model.
00:33:41.280 | You can see as both the resolution increases
00:33:44.720 | and the parameter count of the language model increases,
00:33:47.360 | performance increases.
00:33:48.640 | So, resolution makes sense.
00:33:49.640 | Obviously, it helps to find small images
00:33:51.960 | or small objects in the image,
00:33:53.560 | but it also makes sense from another reason,
00:33:55.080 | which is that it kind of gives the model
00:33:56.880 | a thinking register and it gives it more tokens
00:33:58.800 | to process when making its predictions.
00:34:01.440 | But yeah, you could say, oh, 43.6, that's not that great.
00:34:06.600 | Like Florence 2 got 60,
00:34:08.960 | but this is not training a dino or a debtor
00:34:12.520 | on top of this language or this image encoder.
00:34:16.240 | It's doing the raw language modeling task on Coco.
00:34:20.520 | So, it doesn't have any of the bells and whistles.
00:34:21.960 | It doesn't have any of the fancy losses.
00:34:23.360 | It doesn't even have bipartite graph matching
00:34:25.600 | or anything like that.
00:34:27.400 | Okay, the big result and one of the reasons
00:34:30.360 | that I was really excited about this paper
00:34:32.920 | is that they blow everything else away on MMVP.
00:34:35.520 | I mean, 47.3, sure, that's nowhere near human accuracy,
00:34:39.400 | which again is 94%,
00:34:40.680 | but for a 2 billion parameter language model
00:34:44.600 | to be chat2bt, that's quite the achievement.
00:34:47.120 | And that sort of brings us to our final pick
00:34:51.320 | for paper of the year, which is AIMV2.
00:34:56.080 | So, AIMV2 sort of says, okay, maybe this language model,
00:35:01.080 | like maybe coming up with all these specific annotations
00:35:04.760 | to find features and with high fidelity in pixel space
00:35:08.760 | isn't actually necessary.
00:35:10.560 | And we can come up with an even simpler
00:35:12.920 | and more beautiful idea for combining image tokens
00:35:17.280 | and pixel tokens in a way that's interfaceable
00:35:19.640 | for language tasks.
00:35:21.120 | And this is nice because it can scale.
00:35:23.680 | You can come up with lots more data
00:35:25.360 | if you don't have to come up
00:35:26.280 | with all these annotations, right?
00:35:28.080 | So, the way that it works is it does something
00:35:30.160 | very, very similar to PolyGemo
00:35:31.680 | where you have a vision encoder
00:35:33.040 | that dumps image tokens into a decoder only transformer.
00:35:36.840 | But the interesting thing is that
00:35:40.000 | it also autoregressively tries to learn
00:35:42.760 | the mean squared error of the image tokens.
00:35:46.200 | So, instead of having to come up
00:35:47.320 | with fancy object detection or segmentation labels,
00:35:51.520 | you can just try to reconstruct the image
00:35:53.240 | and have it learn fine-grained features that way.
00:35:55.720 | And it does this in kind of, I think, a beautiful way
00:35:59.000 | that's kind of compatible
00:36:00.080 | with the PolyGemo line of thinking,
00:36:01.400 | which is randomly sampling a prefix length
00:36:04.560 | and using only this number of image tokens as the prefix.
00:36:08.480 | And so, doing a similar thing with the causal.
00:36:13.320 | So, the causal prefix is the attention mask on the right.
00:36:16.360 | So, it's doing full block attention
00:36:18.760 | with some randomly sampled number of image tokens
00:36:21.120 | to then reconstruct the rest of the image
00:36:22.600 | and the downstream caption for that image.
00:36:26.160 | And so, this is the dataset that they train on.
00:36:30.160 | It's internet-scale data, very high-quality data
00:36:34.000 | created by the Data Filtering Network's paper, essentially,
00:36:38.320 | which is maybe the best clip data that exists.
00:36:42.120 | And we can see that this is finally a model
00:36:46.640 | that doesn't saturate.
00:36:48.520 | It's even at the highest parameter count,
00:36:51.360 | it appears to be, well, at the highest parameter count,
00:36:55.160 | it appears to be improving in performance
00:36:59.160 | with more and more samples seen.
00:37:00.880 | And so, you can sort of think that, you know,
00:37:03.800 | if we just keep bumping the parameter count
00:37:05.920 | and increasing the example seen,
00:37:07.280 | which is the line of thinking for language models,
00:37:10.400 | then it'll keep getting better.
00:37:12.320 | So, how does it actually do at finding...
00:37:14.080 | Oh, it also improves with resolution,
00:37:16.400 | which you would expect for a model that...
00:37:20.440 | This is the ImageNet classification accuracy,
00:37:22.680 | but yeah, it does better if you increase the resolution,
00:37:25.480 | which means that it's actually leveraging
00:37:26.920 | and finding fine-grained visual features.
00:37:29.760 | And so, how does it actually do compared to CLIP on COCO?
00:37:34.800 | Well, you can see that if you slap
00:37:36.800 | a transformer detection head on it,
00:37:39.400 | and train it on COCO, it gets to 60.2,
00:37:41.280 | which is also within spitting distance of SODA,
00:37:44.200 | which means that it does a very good job
00:37:45.680 | of finding visual features.
00:37:48.480 | But you could say, okay, well, wait a second,
00:37:51.760 | CLIP got to 59.1, so, like,
00:37:55.600 | how does this prove your claim at all?
00:37:57.040 | Because doesn't that mean, like,
00:37:59.000 | CLIP, which is known to be CLIP-blind
00:38:00.920 | and do badly on MMVP,
00:38:02.440 | it's able to achieve a very high performance
00:38:04.720 | on this fine-grained visual features task
00:38:07.560 | of object detection?
00:38:08.800 | Well, they train on, like, tons of data.
00:38:11.800 | They train on, like, Objects 365, COCO, Flickr,
00:38:15.720 | and everything else.
00:38:17.120 | And so, I think that this benchmark
00:38:18.560 | doesn't do a great job of selling
00:38:19.800 | how good of a pre-trained model MV2 is.
00:38:22.040 | And we would like to see performance
00:38:25.000 | on fewer data as examples
00:38:27.840 | and not train to convergence on object detection.
00:38:29.760 | So, seeing it in the real world
00:38:31.640 | on, like, a dataset like RoboFlow 100,
00:38:33.320 | I think would be quite interesting.
00:38:35.760 | And our, I guess, our final, final pick
00:38:38.360 | for paper of 2024 would be Moondream.
00:38:40.240 | So, introducing Vic to talk about that.
00:38:42.280 | - But overall, that was exactly what I was looking for.
00:38:49.640 | Like, best of 2024, amazing job.
00:38:51.800 | Yeah, you can.
00:38:54.480 | Does anyone have questions
00:38:56.400 | while Vic gets set up, like, vision stuff?
00:38:58.400 | Yeah?
00:39:02.720 | Vic, go ahead. - Hi.
00:39:06.520 | Well, while we're getting set up, hi, over here.
00:39:09.920 | Thanks for the really awesome talk.
00:39:11.760 | One of the things that's been weird and surprising
00:39:13.760 | is that the foundation model companies
00:39:19.280 | and even these MLMs,
00:39:22.560 | they're just, like, worse than RTTetter at detection still.
00:39:27.200 | Like, if you wanted to pay a bunch of money
00:39:30.280 | to auto-label your detection dataset,
00:39:32.080 | if you gave it to OpenAI or Cloud,
00:39:33.920 | that would be, like, a big waste.
00:39:36.440 | So, I'm curious, just, like,
00:39:37.520 | even Palo Gemma 2, like, is worse.
00:39:40.840 | So, I'm curious to hear your thoughts on, like,
00:39:43.480 | how come nobody's cracked the code on, like,
00:39:46.040 | a generalist that really, you know,
00:39:50.320 | beats a specialist model in computer vision
00:39:53.360 | like they have in LM land?
00:39:56.120 | - I can, can you hear me?
00:40:01.080 | - Yeah, you gotta press the speak button.
00:40:03.440 | - Okay.
00:40:04.320 | - Oh, yeah.
00:40:05.160 | (laughing)
00:40:07.560 | - It's a very, very interesting question.
00:40:09.760 | I think it depends on the specific domain.
00:40:13.360 | For image classification, it's basically there.
00:40:16.600 | In the, AIMV2 showed a simple attentional probe
00:40:20.480 | on the pre-trained features gets, like, 90%,
00:40:22.520 | which is as well as anyone does.
00:40:24.960 | The bigger question, like,
00:40:29.040 | why isn't it transferring to object detection,
00:40:33.520 | especially, like, real-time object detection?
00:40:35.760 | I think, in my mind, there are two answers.
00:40:39.240 | One is object detection is really, really, really,
00:40:43.280 | the architectures are super domain-specific.
00:40:46.480 | You know, we see these,
00:40:47.320 | all these super, super complicated things,
00:40:48.800 | and it's not super easy to build something
00:40:52.720 | that just transfers naturally like that,
00:40:54.440 | whereas image classification, you know,
00:40:56.440 | clip pre-training transfers super, super easily.
00:40:59.640 | And the other thing is, until recently,
00:41:04.240 | the real-time object detectors
00:41:06.000 | didn't even really benefit from pre-training.
00:41:08.560 | Like, you see the YOLOs that are, like,
00:41:10.200 | essentially saturated, showing very little difference
00:41:12.720 | with pre-training improvements,
00:41:15.440 | with using pre-trained model at all,
00:41:17.680 | it's not surprising, necessarily,
00:41:19.640 | that people aren't looking at the effects
00:41:22.880 | of better and better pre-training on real-time detection.
00:41:25.920 | Maybe that'll change in the next year.
00:41:27.800 | Does that answer your question?
00:41:29.480 | - Cool.
00:41:30.320 | Can you guys hear me?
00:41:33.320 | Yeah, one thing I want to add is just, like,
00:41:35.040 | or just to summarize, basically, is that, like,
00:41:37.520 | until 2024, you know,
00:41:40.080 | we haven't really seen a combination
00:41:41.720 | of transformer-based object detectors and fancy losses,
00:41:46.720 | and PolyGemma suffers from the same problem,
00:41:49.120 | which is basically to say that these ResNet,
00:41:52.360 | or, like, the convolutional models,
00:41:54.280 | they have all these, like, extreme optimizations
00:41:58.200 | for doing object detection,
00:42:00.160 | but essentially, I think it's kind of been shown now
00:42:02.840 | that convolutional models, like,
00:42:04.200 | just don't benefit from pre-training
00:42:05.720 | and just don't, like, have the level of intelligence
00:42:07.440 | of transformer models.
00:42:08.560 | - Awesome.
00:42:13.080 | Balundri.
00:42:14.760 | - Hi, can you hear me?
00:42:17.040 | - Cool.
00:42:17.880 | - I can hear you, see you.
00:42:19.000 | Are you sharing your screen?
00:42:20.120 | - I might have forgotten to do that.
00:42:22.440 | Let me do that.
00:42:23.280 | - Sorry, you should've done that.
00:42:24.120 | - Okay.
00:42:24.960 | - Here's your screen.
00:42:35.320 | - Uh-oh, classic.
00:42:37.160 | You might have to quit Zoom and restart.
00:42:40.640 | - What?
00:42:41.480 | - It's fine.
00:42:43.440 | Yeah, it's like, we have a capture of your screen.
00:42:46.960 | I'll just make sure it's visible.
00:42:49.120 | So let's get to your screen.
00:42:52.440 | - Okay.
00:42:54.080 | Easy enough.
00:42:54.920 | - How do you make it, like, wait for you?
00:42:58.880 | - Quit Zoom.
00:43:04.080 | - Yeah, yeah, there you go.
00:43:04.920 | Perfect.
00:43:05.760 | - All right.
00:43:07.480 | Hi, everyone.
00:43:08.320 | My name is Vik.
00:43:09.440 | I've been working on Moondream for almost a year now,
00:43:12.560 | like Sean mentioned.
00:43:13.440 | I just went and looked,
00:43:14.440 | and it turns out the first version,
00:43:16.280 | I released December 29, 2023.
00:43:18.240 | It's been a fascinating journey.
00:43:21.040 | So Moondream started off as a tiny vision language model.
00:43:25.720 | Since then, we've extended scope a little bit
00:43:27.360 | to also try and build some tooling,
00:43:30.080 | client libraries, et cetera,
00:43:31.120 | to help people really deploy it.
00:43:34.360 | Unlike traditional large models
00:43:37.680 | that are focused at assistant-type use cases,
00:43:39.360 | we're laser-focused on building
00:43:41.480 | capabilities that developers can,
00:43:46.680 | sorry, it's,
00:43:49.680 | yeah, we're laser-focused on building capabilities
00:43:54.480 | that developers can use to build vision applications
00:43:58.200 | that can run anywhere.
00:43:59.120 | So in a lot of cases for vision more so than for text,
00:44:02.720 | you really care about being able to run on the edge,
00:44:05.000 | run in real time, et cetera.
00:44:06.000 | So that's really important.
00:44:08.840 | We have different output modalities that we support.
00:44:12.560 | There's query where you can ask
00:44:14.160 | general English questions about an image
00:44:15.960 | and get back human-like answers.
00:44:18.080 | There's captioning,
00:44:19.280 | which a lot of our users use
00:44:21.040 | for generating synthetic datasets
00:44:23.480 | to then train diffusion models and whatnot.
00:44:26.360 | We've done a lot of work to minimize hallucinations there.
00:44:28.200 | So that's used a lot.
00:44:31.080 | We have open vocabulary object detection built in,
00:44:33.120 | similar to a couple of more recent models
00:44:34.560 | like Palagem, et cetera,
00:44:35.480 | where rather than having to train a dedicated model,
00:44:38.040 | you can just say, "Show me soccer balls in this image,"
00:44:41.000 | or, "Show me if there are any deer in this image."
00:44:42.640 | It'll detect it.
00:44:43.640 | More recently, earlier this month,
00:44:46.520 | we released pointing capability
00:44:48.720 | where if all you're interested in is the center of an object,
00:44:52.440 | you can just ask it to point out where that is.
00:44:56.360 | This is very useful
00:44:57.200 | when you're doing EOI automation-type stuff.
00:45:00.360 | Let's see.
00:45:01.200 | We have two models out right now.
00:45:05.840 | There's a general-purpose 2B paramodel,
00:45:08.160 | which runs fairly...
00:45:11.080 | Like, it's fine if you're running on server.
00:45:13.040 | It's good for our local Lama desktop friends,
00:45:16.720 | and it can run on flagship mobile phones,
00:45:18.800 | but it never really fulfilled the promise
00:45:21.000 | of being able to run anywhere.
00:45:23.000 | Last week, we released a new 0.5B paramodel,
00:45:25.880 | which should be seen more as a 2B paramodel
00:45:28.920 | and more as a distillation target
00:45:30.560 | as opposed to a general-purpose model.
00:45:32.400 | It's very good if you're running on older mobile phones
00:45:36.080 | or edge devices.
00:45:37.760 | Uses less memory,
00:45:39.400 | even with our not-yet-fully-optimized inference client.
00:45:42.120 | So the way we built our 0.5B model
00:45:47.960 | was to start with the 2B parameter model
00:45:50.880 | and prune it while doing continual training
00:45:55.720 | to retain performance.
00:45:57.400 | We...
00:45:58.880 | Our objective during the pruning
00:46:00.280 | was to preserve accuracy across a broad set of benchmarks.
00:46:04.760 | So the way we went about it
00:46:05.840 | was to estimate the importance
00:46:07.400 | of different components of the model,
00:46:08.640 | like attention heads, channels,
00:46:10.360 | MLP rows and whatnot,
00:46:14.440 | using basically a technique based on the gradient.
00:46:17.520 | I'm not sure how much people want to know details.
00:46:19.320 | We'll be writing a paper about this,
00:46:20.560 | but feel free to grab me if you have more questions.
00:46:23.920 | Then we iteratively prune a small chunk
00:46:26.400 | that'll minimize loss in performance,
00:46:28.360 | retrain the model to recover performance and bring it back.
00:46:31.480 | The 0.5B we released is more of a proof of concept
00:46:35.040 | that this is possible.
00:46:35.880 | I think the thing that's really exciting about this
00:46:37.640 | is it makes it possible for...
00:46:39.440 | For developers to build using the 2B param model
00:46:44.880 | and just explore, build their application.
00:46:48.400 | And then once they're ready to deploy,
00:46:50.680 | figure out what exactly they need out of the model
00:46:52.560 | and prune those capabilities into a smaller form factor
00:46:54.680 | that makes sense for their deployment target.
00:46:56.960 | So yeah, very excited about that.
00:47:00.680 | Let me talk to you folks a little bit about another problem
00:47:04.240 | I've been working on recently,
00:47:05.160 | which is similar to the clocks example
00:47:07.040 | we've been talking about.
00:47:07.880 | We had a customer reach out who was talking about,
00:47:11.240 | who had a bunch of gauges out in the field.
00:47:14.240 | This is very common in manufacturing and oil and gas
00:47:16.800 | where you have a bunch of analog devices
00:47:19.720 | that you need to monitor.
00:47:20.960 | It's expensive to have humans look at that
00:47:24.040 | and monitor stuff and make sure that the system
00:47:27.320 | gets shut down when the temperature goes over 80
00:47:29.440 | or something.
00:47:30.360 | So I was like, yeah, this seems easy enough.
00:47:32.240 | Happy to help you distill that.
00:47:34.680 | Let's get it going.
00:47:36.480 | Turns out our model couldn't do it at all.
00:47:38.560 | I went and looked at other open source models
00:47:40.760 | to see if I could just generate a bunch of data
00:47:43.120 | and learn from that.
00:47:43.960 | That did not work either.
00:47:45.680 | So I was like, let's look at what the folks
00:47:47.240 | with hundreds of billions of dollars in market cap
00:47:51.000 | have to offer.
00:47:51.840 | And yeah, that doesn't work either.
00:47:53.960 | My hypothesis is that the way these models are trained
00:48:00.040 | are using a large amount of image text data
00:48:03.200 | scraped from the internet.
00:48:04.480 | And that can be biased.
00:48:05.320 | In the case of gauges,
00:48:06.640 | most gauge images aren't gauges in the wild.
00:48:09.440 | They're product detail images like these,
00:48:12.680 | where it's always set to zero.
00:48:14.280 | It's paired with an alt text that says something like
00:48:16.360 | G-I-V-T-O pressure sensor, PSI zero to 30 or something.
00:48:21.360 | And so the models are fairly good
00:48:23.760 | at picking up those details.
00:48:24.680 | It'll tell you that it's a pressure gauge.
00:48:26.000 | It'll tell you what the brand is,
00:48:26.840 | but it doesn't really learn to pay attention
00:48:28.680 | to the needle over there.
00:48:30.880 | And so, yeah, that's a gap we need to address.
00:48:36.480 | So naturally my mind goes to like,
00:48:39.800 | let's use synthetic data to solve this problem.
00:48:42.520 | That works, but it's problematic
00:48:46.160 | because it turned out we needed millions
00:48:47.760 | of synthetic gauge images to get to reasonable performance.
00:48:50.920 | And thinking about it, reading a gauge is like not a one,
00:48:55.480 | like it's not a zero short process in our minds, right?
00:48:57.520 | Like if you had to tell me the reading in Celsius
00:49:00.440 | for this real world gauge, there's two dials on there.
00:49:03.920 | So first you have to figure out which one
00:49:05.200 | you have to be paying attention to,
00:49:06.160 | like the inner one or the outer one.
00:49:07.920 | You look at the tip of the needle,
00:49:11.080 | you look at what labels it's between,
00:49:13.360 | and then you count how many and do some math
00:49:17.200 | to figure out what that probably is.
00:49:19.360 | So what happens if we just add that as chain of thought
00:49:23.280 | to give the model a better understanding
00:49:27.600 | of the difference up,
00:49:29.720 | to allow the model to better learn the subtasks
00:49:31.360 | it needs to perform to accomplish this goal?
00:49:33.560 | So you can see in this example,
00:49:36.640 | this was actually generated
00:49:37.560 | by the latest version of our model.
00:49:39.480 | It's like, okay, Celsius is the inner scale.
00:49:42.120 | It's between 50 and 60.
00:49:43.200 | There's 10 ticks.
00:49:44.280 | It's at the second tick.
00:49:46.360 | It's a little debatable here.
00:49:47.440 | Like there's a weird shadow situation going on.
00:49:49.400 | The dial is off.
00:49:50.440 | So I don't know what the ground truth is,
00:49:52.040 | but it works okay.
00:49:54.920 | There's points on there that are,
00:49:57.640 | the points over there are actually grounded.
00:50:00.040 | I don't know if this is easy to see,
00:50:01.880 | but when I click on those,
00:50:02.880 | there's a little red dot that moves around.
00:50:05.120 | On the image, the model actually has to predict
00:50:07.000 | where those points are.
00:50:09.880 | I was originally trying to do this with bounding boxes,
00:50:11.920 | but then Malmo came out with pointing capabilities
00:50:14.840 | and it's like pointing is a much better paradigm
00:50:17.680 | to represent this.
00:50:20.960 | We see pretty good results.
00:50:23.440 | This one's actually for clock reading.
00:50:24.800 | I couldn't find our chart for gauge reading
00:50:27.560 | at the last minute.
00:50:28.400 | So the light blue chart is with our grounded chain of thought.
00:50:33.400 | This measures, we built a clock reading benchmark
00:50:40.320 | about 500 images.
00:50:41.520 | This measures accuracy on that.
00:50:44.240 | You can see it's a lot more sample efficient
00:50:47.400 | when you're using the chain of thought to help the model.
00:50:55.040 | Another big benefit from this approach
00:50:59.040 | is you can kind of understand how the model is doing it
00:51:02.800 | and how it's feeling.
00:51:04.560 | So in this example,
00:51:05.880 | the actual correct reading is 54 Celsius,
00:51:08.480 | the model output 56.
00:51:10.440 | Not too bad, but you can actually go and see
00:51:13.720 | where it messed up.
00:51:15.920 | Like it got a lot of these right,
00:51:17.280 | except instead of saying it was on the seventh tick,
00:51:22.120 | it actually predicted that it was the eighth tick
00:51:24.600 | and that's why it went with 56.
00:51:26.360 | So now that you know that this is feeling in this way,
00:51:30.960 | you can adjust how you're doing the chain of thought
00:51:32.760 | to maybe say like actually count out each tick from 40
00:51:35.480 | instead of just trying to say it's the eighth tick.
00:51:37.880 | Or you might say like, okay,
00:51:38.960 | I see that there's that middle thing.
00:51:40.320 | I'll count from there instead of all the way from 40.
00:51:43.160 | So it helps a ton.
00:51:46.080 | The other thing I'm excited about
00:51:47.040 | is a few short prompting or test time training with this.
00:51:50.480 | Like if a customer has a specific gauge
00:51:52.720 | that we're seeing minor errors on,
00:51:55.680 | they can give us a couple of examples
00:51:57.240 | where like if it's misdetecting the needle,
00:52:00.560 | they can go in and correct that in the chain of thought
00:52:02.160 | and hopefully that works the next time.
00:52:04.120 | Now, exciting approach,
00:52:09.040 | we only apply it to clocks and gauges.
00:52:10.400 | The real question is, is it going to generalize?
00:52:13.320 | Probably like there's some science from text models
00:52:15.760 | that when you train on a broad number of tasks,
00:52:17.400 | it does generalize.
00:52:18.240 | And I'm seeing some science with our model as well.
00:52:21.720 | So in addition to the image-based chain of thought stuff,
00:52:25.680 | I also added some spelling-based chain of thought
00:52:29.160 | to help it understand, better understand OCR, I guess.
00:52:33.600 | I don't understand why everyone doesn't do this by the way.
00:52:36.760 | Like it's trivial benchmark question.
00:52:38.760 | It's very, very easy to nail.
00:52:40.880 | But I also wanted to support it for stuff
00:52:45.000 | like license plate partial matching,
00:52:46.640 | like hey, does any license plate in this image
00:52:49.280 | start with WHA or whatever?
00:52:50.880 | So yeah, that sort of worked.
00:52:54.120 | All right, that ends my story about the gauges.
00:53:00.840 | If you think about what's going on over here,
00:53:03.800 | it's interesting that like LLMs
00:53:05.880 | are showing enormous progress in reasoning,
00:53:10.880 | especially with the latest set of models that we've seen.
00:53:14.600 | But we're not really seeing,
00:53:17.000 | I have a feeling that VLMs are lagging behind
00:53:20.680 | as we can see with these tasks
00:53:23.440 | that should be very simple for a human to do
00:53:25.080 | that are very easy to find VLMs failing at.
00:53:29.560 | My hypothesis on why this is the case
00:53:31.280 | is because on the internet,
00:53:33.600 | there's a ton of data that talks about how to reason.
00:53:36.440 | There's books about how to solve problems.
00:53:38.760 | There's books critiquing the books
00:53:40.240 | about how to solve problems.
00:53:41.720 | But humans are just so good at perception
00:53:43.440 | that we never really talk about it.
00:53:45.640 | Like maybe in art books where it's like,
00:53:47.440 | hey, to show that that mountain is further away,
00:53:49.880 | you need to desaturate it a bit or whatever,
00:53:51.880 | but the actual data on how to like look at images
00:53:56.880 | isn't really present.
00:53:58.760 | Also the data we have is kind of sketched.
00:54:01.160 | The best source of data we have
00:54:02.280 | is like image all text pairs on the internet
00:54:04.520 | and that's pretty low quality.
00:54:06.040 | So yeah, I think our solution here is really just,
00:54:09.800 | we need to teach them how to operate on individual tasks
00:54:13.240 | and figure out how to scale that out.
00:54:15.640 | All right, yep.
00:54:19.480 | So conclusion, at Moondream we're trying
00:54:23.200 | to build amazing VLMs that run everywhere.
00:54:25.560 | Very hard problem, much work ahead,
00:54:27.640 | but we're making a ton of progress
00:54:29.240 | that I'm really excited about.
00:54:31.440 | If anyone wants to chat about more technical details
00:54:35.280 | about how we're doing this or interested in collaborating,
00:54:37.360 | please hit me up.
00:54:38.760 | - Yeah, like I always, when people say multi-modality,
00:54:48.800 | I always think about vision as the first among equals
00:54:52.000 | in all the modalities.
00:54:53.000 | So I really appreciate having the experts.
00:54:57.480 | - This is the year that vision language models
00:54:59.440 | became mainstream with every model from GPT-40 to 1
00:55:03.400 | to Claude 3 to Gemini 1 and 2 to Lama 3.2
00:55:08.000 | to Mistral's Pixtrol to AI2's Pixmo going multi-modal.
00:55:13.000 | We asked Peter and Isaac to highlight the best work
00:55:15.680 | in computer vision for 2024.
00:55:18.320 | And they blew us away with the complete overview.
00:55:21.720 | As a special bonus, we also got a bonus talk
00:55:24.400 | from Vik Kaurapati at Moondream
00:55:26.920 | who gave an incredible talk
00:55:28.240 | at this year's AI Engineer World's Fair
00:55:31.080 | on his tiny 0.5 billion parameter pruned
00:55:34.080 | vision language model that absolutely slaps.
00:55:37.400 | As always, don't forget to check the show notes
00:55:39.800 | for the YouTube link to their talk, as well as their slides.
00:55:43.320 | Watch out and take care.