Best of 2024 in Vision [LS Live @ NeurIPS]

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hi, we're Isaac and Peter from Roboflow.

00:00:08.720 | And we're gonna talk about the best papers

00:00:11.280 | of 2024 in computer vision.

00:00:13.520 | So for us, we define best as what made the biggest shifts

00:00:19.680 | in the space.

00:00:21.720 | And to determine that we looked at

00:00:23.840 | what are some major trends that happened

00:00:26.240 | and what papers most contributed to those trends.

00:00:29.160 | So I'm gonna talk about a couple of trends.

00:00:30.280 | Peter's gonna talk about a trend

00:00:31.340 | and then we're gonna hand it off to Moondream.

00:00:34.400 | So the trends that I'm interested in talking about

00:00:39.720 | are a major transition from models

00:00:42.420 | that run on per image basis

00:00:44.080 | to models that run using the same basic ideas on video.

00:00:48.720 | And then also how debtors are starting to take over

00:00:51.760 | the real-time object detection scene

00:00:56.360 | from the YOLOs, which have been dominant for years.

00:00:58.960 | So as a highlight, we're gonna talk about Sora,

00:01:04.620 | which from my perspective is the biggest paper of 2024,

00:01:08.120 | even though it came out in February.

00:01:09.920 | Yeah, yeah.

00:01:13.460 | So Sora is just a post.

00:01:16.860 | So I'm going to fill it in with details

00:01:20.040 | from replication efforts, including open Sora

00:01:22.680 | and related work such as a stable diffusion video.

00:01:26.600 | And then we're also gonna talk about SAM2,

00:01:30.040 | which applies the SAM strategy to video.

00:01:32.880 | And then how debtors are,

00:01:36.240 | the improvements in 2024 to debtors

00:01:37.840 | that are making them a Pareto improvement

00:01:39.360 | to YOLO-based models.

00:01:41.120 | So to start this off,

00:01:44.360 | we're gonna talk about the state-of-the-art

00:01:46.960 | of video generation at the end of 2023.

00:01:50.040 | MagVIT is a discrete token model

00:01:55.080 | discrete token video tokenizer akin to VQ, GAN,

00:01:58.960 | but applied to video sequences.

00:02:01.000 | And it actually outperforms state-of-the-art

00:02:05.760 | handcrafted video compression frameworks

00:02:08.840 | in terms of the bit rate versus human preference for quality

00:02:13.840 | and videos generated by autoregressing

00:02:15.720 | on these discrete tokens.

00:02:17.080 | Generates some pretty nice stuff,

00:02:20.560 | but up to like five seconds length

00:02:22.000 | and you know, not super detailed.

00:02:23.480 | And then suddenly a few months later, we have this,

00:02:28.480 | which when I saw it, it was totally mind-blowing to me.

00:02:32.120 | 1080p, a whole minute long.

00:02:34.440 | We've got light reflecting in puddles.

00:02:36.000 | That's reflective, reminds me of those RTX demonstrations

00:02:41.000 | for next generation video games, such as Cyberpunk,

00:02:44.160 | but with better graphics.

00:02:46.760 | You can see some issues in the background

00:02:48.240 | if you look closely, but they're kind of,

00:02:50.320 | as with a lot of these models,

00:02:52.480 | the issues tend to be things

00:02:54.120 | that people aren't going to pay attention to

00:02:55.880 | unless they're looking for.

00:02:57.040 | In the same way that like six fingers on a hand,

00:02:59.640 | you're not going to notice is a giveaway

00:03:02.320 | unless you're looking for it.

00:03:03.760 | So yeah, as we said, Sora does not have a paper.

00:03:08.440 | So we're going to be filling it in with context

00:03:10.920 | from the rest of the computer vision scene

00:03:14.040 | attempting to replicate these efforts.

00:03:16.440 | So the first step, you have an LLM caption,

00:03:21.800 | a huge amount of videos.

00:03:23.120 | This is a trick that they introduced in Dolly 3

00:03:28.520 | where they train a image captioning model

00:03:32.240 | to just generate very high quality captions

00:03:34.240 | for a huge corpus

00:03:35.360 | and then train a diffusion model on that.

00:03:39.760 | Their Sora and the replication efforts

00:03:42.240 | also show a bunch of other steps

00:03:44.040 | that are necessary for good video generation,

00:03:47.480 | including filtering by aesthetic score

00:03:50.360 | and filtering by making sure the videos have enough motion

00:03:53.320 | so they're not just like kind of the generators

00:03:55.960 | not learning to just generate static frames.

00:03:58.160 | So then we encode our video

00:04:04.040 | into a series of space-time latency.

00:04:06.600 | Once again, this were very sparse in details.

00:04:09.840 | So the replication related works,

00:04:13.680 | OpenSora actually uses a MagVIT V2 itself to do this,

00:04:17.320 | but swapping out the discretization step

00:04:21.520 | with a classic VAE autoencoder framework.

00:04:25.240 | They show that there's a lot of benefit

00:04:30.000 | from getting the temporal compression,

00:04:31.520 | which makes a lot of sense as each sequential frames

00:04:35.400 | and videos have mostly redundant information.

00:04:38.080 | So by compressing in the temporal space,

00:04:43.640 | you allow the latent to hold a lot more semantic information

00:04:47.240 | while avoiding that duplicate.

00:04:49.800 | So we've got our space-time latency possibly via,

00:04:58.440 | there's some 3D VAE, presumably a MagVIT V2.

00:05:02.560 | And then you throw it into a diffusion transformer.

00:05:07.440 | So I think it's personally interesting to note

00:05:11.800 | that OpenSora is using a MagVIT V2,

00:05:14.960 | which originally used an autoregressive transformer decoder

00:05:18.680 | to model the latent space,

00:05:20.200 | but is now using a diffusion transformer.

00:05:25.200 | So it's still a transformer happening.

00:05:27.360 | Just the question is like,

00:05:28.200 | is it parameterizing the stochastic differential equation?

00:05:31.880 | Is it parameterizing a conditional distribution

00:05:34.480 | via autoregression?

00:05:35.680 | It's also worth noting that most diffusion models today,

00:05:44.520 | the very high performance ones are switching away

00:05:46.440 | from the classic like DDPM,

00:05:48.640 | denoising diffusion probability modeling framework

00:05:51.240 | to rectified flows.

00:05:52.560 | Rectified flows have a very interesting property

00:05:56.080 | that as they converge,

00:05:58.520 | they actually get closer to being able to be sampled

00:06:01.480 | with a single step,

00:06:02.880 | which means that in practice,

00:06:05.480 | you can actually generate high quality samples much faster.

00:06:08.440 | Major problem of DDPM and related models

00:06:13.640 | for the past four years is just that

00:06:15.920 | they require many, many steps

00:06:18.000 | to generate high quality samples.

00:06:20.040 | So, and naturally the third step

00:06:23.760 | is throwing lots of compute at the problem.

00:06:26.080 | So I never figured out how to manage

00:06:30.520 | to get this video to loop,

00:06:31.960 | but we see very little compute,

00:06:36.080 | medium compute, lots of compute.

00:06:39.120 | This is so interesting

00:06:40.000 | because the original diffusion transformer paper

00:06:42.480 | from Facebook actually showed that,

00:06:45.000 | in fact, the specific hyperparameters of the transformer

00:06:47.400 | didn't really matter that much.

00:06:49.160 | What mattered was that you were just increasing

00:06:51.760 | the amount of compute that the model had.

00:06:54.480 | So I love how in the, once again, little blog posts,

00:06:59.360 | they don't even talk about

00:07:00.200 | like the specific hyperparameters.

00:07:01.160 | They say, we're using a diffusion transformer

00:07:03.320 | and we're just throwing more compute at it

00:07:04.520 | and this is what happens.

00:07:05.760 | OpenSORA shows similar results.

00:07:10.520 | The primary issue I think here is that

00:07:13.920 | no one else has 32X compute budget.

00:07:17.280 | So we end up with these,

00:07:18.640 | we end up in the middle of the domain

00:07:22.400 | in most of the related work,

00:07:24.920 | which is still super, super cool.

00:07:27.400 | It's just a little disappointing considering the context.

00:07:30.560 | So I think this is a beautiful extension

00:07:34.640 | of the framework that was introduced in '22 and '23

00:07:40.320 | for these very high quality per image generation

00:07:43.280 | and then extending that to videos.

00:07:45.000 | It's awesome.

00:07:47.600 | And it's GA as of Monday,

00:07:49.360 | except no one can seem to get access to it

00:07:51.200 | because they keep shutting down the login.

00:07:53.640 | The next, so next paper I wanted to talk about is SAM.

00:07:59.320 | So we at RoboFlow allow users to label data

00:08:03.200 | and train models on that data.

00:08:04.680 | SAM for us has saved our users 75 years of labeling time.

00:08:10.000 | We are the, to the best of my knowledge,

00:08:11.760 | the largest SAM API that exists.

00:08:16.320 | We also, SAM also allows us to have our users

00:08:19.320 | train just pure bounding box regression models

00:08:22.680 | and use those to generate high quality masks,

00:08:25.600 | which has the great side effect

00:08:29.680 | of requiring less training data

00:08:31.400 | to have a meaningful convergence.

00:08:33.160 | So most people are data limited in the real world.

00:08:35.720 | So anything that requires less data

00:08:37.120 | to get to a useful thing is super useful.

00:08:40.360 | Most of our users actually run their object,

00:08:44.920 | per frame object detectors on every frame in a video,

00:08:47.800 | or maybe not most, but many, many.

00:08:49.600 | And so SAM follows into this category of taking,

00:08:55.480 | SAM2 falls into this category of taking

00:08:57.280 | something that really, really works

00:08:59.080 | and applying it to a video,

00:09:01.880 | which has the wonderful benefit of being plug and play

00:09:05.000 | with most of our, many of our users use cases.

00:09:08.920 | We're still building out a sufficiently mature pipeline

00:09:12.800 | to take advantage of that, but it's in the works.

00:09:15.800 | So here we've got a great example.

00:09:20.040 | We can click on cells and then follow them.

00:09:23.520 | You even notice the cell goes away and comes back

00:09:25.560 | and we can still keep track of it,

00:09:28.120 | which is very challenging for existing object trackers.

00:09:36.920 | High-level overview of how SAM2 works.

00:09:39.440 | There's a simple pipeline here where we can give,

00:09:48.760 | provide some type of prompt and it fills out

00:09:51.440 | the rest of the likely masks for that object

00:09:55.240 | throughout the rest of the video.

00:09:56.440 | So here we're giving a bounding box in the first frame,

00:09:59.120 | a set of positive negative points,

00:10:00.720 | or even just a simple mask.

00:10:04.680 | I'm going to assume people are somewhat familiar with SAM.

00:10:09.680 | So I'm going to just give a high-level overview

00:10:11.720 | of how SAM works.

00:10:13.720 | You have an image encoder that runs on every frame.

00:10:16.800 | SAM2 can be used on a single image,

00:10:20.760 | in which case the only difference between SAM2 and SAM

00:10:23.400 | is that image encoder, which SAM used a standard VIT.

00:10:31.360 | SAM2 replaced that with a Hera hierarchical encoder,

00:10:36.360 | which gets approximately the same results,

00:10:39.240 | but leads to a six times faster inference,

00:10:42.280 | which is excellent, especially considering

00:10:44.560 | how in a trend of 23 was replacing the VIT

00:10:48.960 | with more efficient backbones.

00:10:50.760 | In the case where you're doing video segmentation,

00:10:56.080 | the difference is that you actually create a memory bank

00:10:58.920 | and you cross attend the features from the image encoder

00:11:02.800 | based on the memory bank.

00:11:04.560 | So the feature set that is created is essentially,

00:11:09.560 | well, I'll go more into it in a couple of slides,

00:11:14.500 | but we take the features from the past couple frames

00:11:19.320 | plus a set of object pointers and the set of prompts

00:11:24.520 | and use that to generate our new masks.

00:11:28.920 | Then we then fuse the new masks for this frame

00:11:30.980 | with the image features and add that to the memory bank.

00:11:35.980 | It's, well, I'll say more in a minute.

00:11:39.720 | Just like SAM, SAM2 actually uses a data engine

00:11:44.440 | to create its data set in that people are,

00:11:47.320 | they assembled a huge amount of reference data,

00:11:50.020 | used people to label some of it and train the model,

00:11:54.500 | used the model to label more of it

00:11:57.340 | and asked people to refine the predictions of the model.

00:11:59.780 | And then ultimately the data set is just created

00:12:02.660 | from the final output of the model on the reference data.

00:12:06.960 | It's very interesting.

00:12:08.740 | This paradigm is so interesting to me

00:12:09.980 | because it unifies a model in a data set

00:12:14.100 | in a way that is very unique.

00:12:16.920 | It seems unlikely that another model could come in

00:12:19.340 | and have such a tight relationship with the training set.

00:12:22.340 | Yeah, so brief overview of how the memory bank works.

00:12:30.460 | The paper did not have a great visual,

00:12:33.740 | so I'm just, I'm going to fill in a bit more.

00:12:35.940 | So we take the last couple of frames from our video

00:12:42.780 | and we take the last couple of frames from our video.

00:12:49.700 | Attend that along with the set of prompts that we provided,

00:12:54.700 | they could come from the future,

00:12:56.420 | they could come from anywhere in the video,

00:12:58.180 | as well as reference objects pointers saying,

00:13:01.500 | by the way, here's what we've found so far.

00:13:04.020 | Attending to the last few frames

00:13:05.980 | has the interesting benefit of allowing it

00:13:08.780 | to model complex object motion without actually,

00:13:17.220 | by limiting the amount of frames that you attend to,

00:13:19.940 | you manage to keep the model running in real time.

00:13:22.460 | This is such an interesting topic for me

00:13:24.600 | because one would assume that attending

00:13:27.540 | to all of the frames is super essential

00:13:30.140 | or having some type of summarization

00:13:31.380 | of all the frames is super essential for a high performance,

00:13:35.060 | but we see in their later ablation

00:13:37.300 | that that actually is not the case.

00:13:39.060 | So here, just to make sure

00:13:43.200 | that there is some benchmarking happening,

00:13:45.060 | we just compared to some of the stuff

00:13:46.700 | that's came out prior,

00:13:49.700 | and indeed the SAM2 strategy does improve

00:13:52.380 | on the state of the art.

00:13:53.740 | This ablation deep in their dependencies

00:13:59.620 | was super interesting to me.

00:14:01.040 | We see in section C, the number of memories.

00:14:05.660 | One would assume that increasing the count of memories

00:14:08.820 | would meaningfully increase performance.

00:14:11.220 | And we see that it has some impact,

00:14:12.660 | but not the type that you'd expect.

00:14:15.700 | And that it meaningfully decreases speed,

00:14:17.660 | which justifies in my mind,

00:14:19.380 | just having this FIFO queue of memories.

00:14:22.320 | Although in the future,

00:14:25.620 | I'm super interested to see a more dedicated summarization

00:14:30.340 | of all of the last video,

00:14:31.880 | not just a stacking of the last frames.

00:14:35.340 | So that another extension of beautiful per frame work

00:14:44.180 | into the video domain.

00:14:47.580 | The next trend I'm interested in talking about

00:14:49.320 | is this interesting at Roboflow,

00:14:53.460 | we're super interested in training

00:14:54.660 | real-time object detectors.

00:14:56.180 | Those are bread and butter.

00:14:57.460 | And so we're doing a lot to keep track

00:14:58.860 | of what is actually happening in that space.

00:15:01.820 | We are finally starting to see something change.

00:15:07.160 | So for years, yellows have been the dominant way

00:15:10.940 | of doing real-time object detection.

00:15:12.980 | And we can see here that they've essentially stagnated.

00:15:16.320 | The performance between 10 and 11

00:15:18.500 | is not meaningfully different,

00:15:21.260 | at least in this type of high-level chart.

00:15:25.340 | And even from the last couple of series,

00:15:26.860 | there's not a major change.

00:15:28.900 | So yellows have hit a plateau.

00:15:32.620 | Deaders have not.

00:15:35.940 | So we can look here and see the yellow series

00:15:40.320 | has this plateau, and then these RT-deader,

00:15:43.860 | LW-deader, and D-fine have meaningfully changed that plateau

00:15:47.540 | so that, in fact, the best D-fine models

00:15:50.040 | are plus 4.6 AP on COCO at the same latency.

00:15:54.100 | So three major steps to accomplish this.

00:15:59.680 | The first RT-deader, which is technically

00:16:01.900 | a 2023 paper preprint, but published officially in '24,

00:16:06.000 | so I'm going to include that.

00:16:07.440 | I hope that's okay.

00:16:09.820 | Deaders showed that, RT-deader showed that

00:16:12.300 | we could actually match or out-speed yellows.

00:16:14.600 | And then LW-deader showed that pre-training

00:16:18.540 | is hugely effective on deaders, and much less so on yellows.

00:16:22.260 | And then D-fine added the types of bells and whistles

00:16:24.060 | that we expect from these types, this arena.

00:16:28.260 | So the major improvements that RT-deader shows

00:16:33.540 | was taking the multiscale features

00:16:37.400 | that deaders typically pass into their encoder

00:16:39.980 | and decoupling them

00:16:41.060 | into a much more efficient transformer encoder.

00:16:44.400 | The transformer is, of course, quadratic complexity,

00:16:48.560 | so decreasing the amount of stuff that you pass in at once

00:16:52.100 | is super helpful for increasing your runtime,

00:16:55.780 | or increasing your throughput.

00:16:57.920 | So that change basically brought us up to yellow speed,

00:17:01.980 | and then they do a hardcore analysis

00:17:04.180 | on benchmarking yellows, including the NMS step.

00:17:09.180 | Once you include the NMS in the latency calculation,

00:17:14.600 | you see that, in fact, these deaders are outperforming,

00:17:18.600 | at least at this time, the yellows that existed.

00:17:22.760 | Then LW-deader goes in and suggests that,

00:17:26.660 | in fact, this frame, the huge boost here is from pre-training

00:17:32.000 | So this is the defined line,

00:17:35.200 | and this is the defined line without pre-training.

00:17:37.240 | It's within range, it's still an improvement

00:17:39.320 | over the yellows, but the really huge boost

00:17:42.200 | comes from the benefit of pre-training.

00:17:44.160 | When YOLO-X came out in 2021,

00:17:48.240 | they showed that they got much better results

00:17:51.080 | by having a much, much longer training time,

00:17:54.040 | but they found that when they did that,

00:17:57.240 | they actually did not benefit from pre-training.

00:18:00.160 | So you see in this graph from LW-deader,

00:18:04.040 | in fact, yellows do have a real benefit from pre-training,

00:18:07.240 | but it goes away as we increase the training time.

00:18:10.460 | Then the deaders converge much faster.

00:18:13.240 | LW-deader trains for only 50 epochs,

00:18:15.120 | RT-deaders, 60 epochs.

00:18:17.460 | So one could assume that, in fact,

00:18:19.460 | the entire extra gain from pre-training

00:18:22.960 | is that you're not destroying your original weights

00:18:25.700 | by relying on pre-training.

00:18:27.460 | You're not destroying your original weights

00:18:29.460 | by relying on this long training cycle.

00:18:31.460 | And then LW-deader also shows superior performance

00:18:37.660 | to our favorite data set, Roboflow 100,

00:18:41.040 | which means that they do better on the real world,

00:18:42.840 | not just on Cocoa.

00:18:44.120 | Then Define throws all the bells and whistles at it.

00:18:49.500 | YOLO models tend to have a lot of very specific,

00:18:53.880 | complicated loss functions.

00:18:56.500 | Define brings that into the deader world

00:18:59.840 | and shows consistent improvement

00:19:00.920 | on a variety of deader-based frameworks.

00:19:03.080 | So bring these all together,

00:19:07.200 | and we see that suddenly we have almost 60 AP on Cocoa

00:19:11.120 | while running in like 10 milliseconds.

00:19:13.280 | Huge, huge stuff.

00:19:14.620 | So we're spending a lot of time trying to build models

00:19:19.960 | that work better with less data,

00:19:21.880 | and deaders are clearly becoming a promising step

00:19:24.800 | in that direction.

00:19:26.700 | What we're interested in seeing from the deaders

00:19:30.660 | in this trend to next is Codeader

00:19:33.280 | and the models that are currently sitting

00:19:35.360 | on the top of the leaderboard for large-scale inference

00:19:40.360 | scale really well as you switch out the backbone.

00:19:44.620 | We're very interested in seeing

00:19:46.400 | and having people publish a paper, potentially us,

00:19:49.620 | on what happens if you take these real-time ones

00:19:52.040 | and then throw a Swing G at it.

00:19:53.360 | Like, do we have a Pareto curve that extends

00:19:56.040 | from the real-time domain all the way up

00:19:57.780 | to the super, super slow but high-performance domain?

00:20:02.580 | We also wanna see people benchmarking an RF100 more

00:20:05.860 | because that type of data is what's relevant

00:20:08.460 | for most users.

00:20:09.620 | And we wanna see more pre-training

00:20:13.200 | because pre-training works now.

00:20:15.080 | It's super cool.

00:20:15.960 | - All right, so, yeah.

00:20:21.660 | So, in that theme, one of the big things

00:20:24.420 | that we're focusing on is how do we get more

00:20:26.620 | out of our pre-trained models?

00:20:28.380 | And one of the lenses to look at this is through

00:20:31.880 | sort of this new requirement for, like,

00:20:34.620 | fine-grained visual details and your representations

00:20:37.880 | that are extracted from your foundation model.

00:20:40.700 | So, it's sort of a hook for this.

00:20:42.460 | Oh, yeah, this is just a list of all the papers

00:20:45.840 | that I'm gonna mention.

00:20:46.700 | I just wanted to make sure I set an actual paper

00:20:48.380 | so you can find it later.

00:20:50.660 | Yeah, so, sort of the big hook here is that

00:20:54.080 | I make the claim that LLMs can't see.

00:20:56.460 | If you go to Claude or ChatGPT,

00:21:00.920 | you ask it to see this watch

00:21:04.840 | and tell me what time it is, it fails, right?

00:21:07.120 | And so, you could say, like, maybe the,

00:21:11.800 | like, this is, like, a very classic test of an LLM,

00:21:14.840 | but you could say, okay, maybe this image

00:21:16.540 | is, like, too zoomed out and it just, like,

00:21:19.500 | it'll do better if we increase the resolution

00:21:21.580 | and it has easier time finding these fine-grained features,

00:21:24.540 | like where the watch hands are pointing.

00:21:26.340 | No dice.

00:21:27.160 | And you could say, okay, well, maybe the model

00:21:29.200 | just doesn't know how to tell time

00:21:30.700 | from knowing the position of the hands,

00:21:32.660 | but if you actually prompt it textually,

00:21:34.200 | it's very easy for it to tell the time.

00:21:35.700 | So, this, to me, is proof that these LLMs

00:21:38.540 | literally cannot see the position of the watch hands

00:21:40.840 | and it can't see those details.

00:21:41.960 | So, the question is, sort of, why?

00:21:43.620 | And for you anthropic heads out there, Claude fails, too.

00:21:48.880 | So, my first pick for Best Paper of 2024 Envision

00:21:53.880 | is this MMVP paper, which tries to investigate

00:21:57.260 | why do LLMs not have the ability to see fine-grained details?

00:22:00.880 | And so, for instance, it comes up

00:22:03.040 | with a lot of images like this, where you ask it a question

00:22:05.760 | that seems very visually apparent to us,

00:22:07.260 | like, which way is the school bus facing?

00:22:08.620 | And it gets it wrong.

00:22:09.460 | And then, of course, it makes up details

00:22:11.040 | to support its wrong claim.

00:22:12.460 | And so, the process by which it finds these images

00:22:16.540 | is, sort of, contained in its hypothesis

00:22:18.920 | for why it can't see these details.

00:22:21.240 | So, it hypothesizes that models

00:22:24.920 | that have been initialized with Clip

00:22:26.960 | as their vision encoder,

00:22:28.540 | they don't have fine-grained details

00:22:31.080 | and the features extracted using Clip

00:22:33.080 | because Clip, sort of, doesn't need to find

00:22:36.920 | these fine-grained details to do its job correctly,

00:22:38.840 | which is just to match captions and images, right?

00:22:42.340 | And, sort of, at a high level,

00:22:44.580 | even if ChatGPT wasn't initialized with Clip

00:22:46.800 | and wasn't trained contrastively,

00:22:49.460 | the vision encoder wasn't trained contrastively at all,

00:22:52.220 | still, in order to do its job of capturing the image,

00:22:55.340 | it could do a pretty good job

00:22:56.800 | without actually finding the exact position

00:22:58.800 | of all the objects and visual features in the image, right?

00:23:02.040 | So, this paper finds a set of difficult images

00:23:05.920 | for these types of models.

00:23:07.620 | And the way it does it is it looks for embeddings

00:23:10.000 | that are similar in Clip space, but far in DynaV2 space.

00:23:13.300 | So, DynaV2 is a foundation model

00:23:15.300 | that was trained self-supervised purely on image data,

00:23:20.000 | and it, kind of, uses, like,

00:23:21.420 | some complex student-teacher framework,

00:23:23.960 | but, essentially, it patches out, like,

00:23:26.220 | certain areas of the image

00:23:28.380 | or, like, crops at certain areas of the image

00:23:29.960 | and tries to make sure

00:23:30.840 | that those have consistent representations,

00:23:32.420 | which is a way for it to learn

00:23:33.960 | very fine-grained visual features.

00:23:36.600 | And so, if you take things that are very close in Clip space

00:23:39.300 | and very far in DynaV2 space,

00:23:41.300 | you get a set of images that basically are pairs of images

00:23:45.840 | that are hard for a chat GPT

00:23:47.300 | and other big language models to distinguish.

00:23:49.720 | So, if you then ask it questions about this image,

00:23:52.600 | well, as you can see from this chart,

00:23:54.880 | it's going to answer the same way for both images, right?

00:23:58.600 | Because, from the perspective of the vision encoder,

00:24:01.640 | they're the same image.

00:24:03.000 | And so, if you ask a question, like,

00:24:03.960 | "How many eyes does this animal have?"

00:24:05.540 | It answers the same for both.

00:24:06.960 | And, like, all these other models, including Lava,

00:24:09.880 | do the same thing, right?

00:24:11.920 | And so, this is the benchmark that they create,

00:24:14.080 | which is, like, finding, like, clip-blind pairs,

00:24:17.760 | which is pairs of images that are similar in Clip space,

00:24:19.680 | and creating a data set of multiple-choice questions

00:24:23.220 | based off of those.

00:24:24.760 | And so, how do these models do?

00:24:26.880 | Well, really bad.

00:24:29.080 | Lava, I think...

00:24:30.500 | So, chat GPT and Jim and I do a little bit better

00:24:33.420 | than random guessing,

00:24:34.340 | but, like, half of the performance of humans

00:24:36.220 | who find these problems to be very easy.

00:24:39.040 | Lava is, interestingly,

00:24:41.300 | extremely negatively correlated with this data set.

00:24:44.720 | It does much, much, much, much worse than random guessing,

00:24:47.640 | which means that this process has done a very good job

00:24:50.600 | of identifying hard images for Lava, specifically.

00:24:54.680 | And that's because Lava is basically

00:24:57.040 | not trained for very long and is initialized from Clip.

00:24:59.380 | And so, you would expect it to do poorly on this data set.

00:25:03.160 | So, one of the proposed solutions that this paper attempts

00:25:08.040 | is by basically saying,

00:25:09.300 | "Okay, well, if Clip features aren't enough,

00:25:10.920 | "what if we train the visual encoder

00:25:12.800 | "of the language model also on Dyno features?"

00:25:15.040 | And so, it proposes two different ways of doing this.

00:25:19.080 | One, additively, which is basically interpolating

00:25:22.540 | between the two features.

00:25:23.800 | And then, one is interleaving,

00:25:25.640 | which is just kind of like training one

00:25:27.260 | on the combination of both features.

00:25:30.180 | So, there's this really interesting trend

00:25:32.000 | when you do the additive mixture of features.

00:25:34.720 | So, zero is all Clip features

00:25:38.480 | and one is all Dyno v2 features.

00:25:40.900 | So, I think it's helpful

00:25:44.720 | to look at the rightmost chart first,

00:25:46.380 | which is as you increase the number of Dyno v2 features,

00:25:48.960 | your model does worse and worse and worse

00:25:50.600 | on the actual language modeling task.

00:25:52.560 | And that's because Dyno v2 features

00:25:54.160 | were trained completely from a self-supervised manner

00:25:57.280 | and completely in image space.

00:25:58.600 | It knows nothing about text.

00:25:59.700 | These features aren't really compatible

00:26:01.520 | with these text models.

00:26:03.000 | And so, you can train an adapter all you want,

00:26:05.280 | but it seems that it's in such an alien language

00:26:07.420 | that it's like a very hard optimization

00:26:09.080 | for these models to solve.

00:26:11.560 | And so, that kind of supports what's happening on the left,

00:26:14.880 | which is that, yeah, it gets better

00:26:16.680 | at answering these questions

00:26:19.640 | as you include more Dyno v2 features up to a point,

00:26:23.140 | but then when you oversaturate,

00:26:24.800 | it completely loses its ability to answer language

00:26:28.860 | and do language tasks.

00:26:31.640 | So, you can also see with the interleaving,

00:26:35.520 | they essentially double the number of tokens

00:26:38.080 | that are going into these models and just train on both.

00:26:41.640 | And it still doesn't really solve the MMVP task.

00:26:43.960 | It gets Lava 1.5 above random guessing by a little bit,

00:26:47.560 | but it's still not close to Chachapiti

00:26:50.600 | or any human performance, obviously.

00:26:54.200 | So, clearly, this proposed solution

00:26:56.540 | of just using Dyno v2 features directly isn't gonna work.

00:27:00.000 | And basically what that means is that

00:27:01.920 | as a vision foundation model,

00:27:06.040 | Dyno v2 is gonna be insufficient for language tasks, right?

00:27:09.840 | So, my next pick for best paper of 2024

00:27:13.640 | would be Florence 2, which tries to solve this problem

00:27:16.000 | by incorporating not only this dimension

00:27:19.280 | of spatial hierarchy,

00:27:20.420 | which is to say pixel level understanding,

00:27:23.320 | but also in making sure to include

00:27:25.300 | what they call semantic granularity,

00:27:27.000 | which ends up, the goal is basically to have features

00:27:30.720 | that are sufficient for finding objects in the image.

00:27:34.000 | So, they have enough pixel information,

00:27:37.520 | but also can be talked about and can be reasoned about.

00:27:40.520 | And that's on the semantic granularity axis.

00:27:44.880 | So, here's an example of basically three different

00:27:49.520 | paradigms of labeling that they do.

00:27:51.680 | So, they create a big data set.

00:27:54.160 | One is text, which is just captioning.

00:27:56.800 | And you would expect a model

00:27:57.920 | that's trained only on captioning

00:27:59.120 | to have similar performance like chat2BT

00:28:01.000 | and not have spatial hierarchy,

00:28:03.920 | not have features that are meaningful at the pixel level.

00:28:07.560 | And so, they add another type, which is region text pairs,

00:28:11.080 | which is essentially either classifying a region

00:28:14.080 | or doing object detection

00:28:19.080 | or doing instant segmentation on that region

00:28:22.080 | or captioning that region.

00:28:23.640 | And then they have text phrase region annotations,

00:28:26.240 | which is essentially a triple.

00:28:28.560 | And basically, not only do you have a region

00:28:31.040 | that you've described,

00:28:32.160 | you also find its place in a descriptive paragraph

00:28:36.720 | about the image,

00:28:37.560 | which is basically trying to introduce

00:28:39.760 | even more semantic understanding of these regions.

00:28:42.240 | And so, for instance,

00:28:43.640 | if you're saying a woman riding on the road,

00:28:46.040 | you have to know what a woman is and what the road is

00:28:48.120 | and that she's on top of it.

00:28:49.120 | And that's basically composing a bunch of objects

00:28:52.040 | in this visual space,

00:28:53.120 | but also thinking about it semantically.

00:28:55.240 | Right?

00:28:56.280 | And so, the way that they do this is they take...

00:28:59.400 | Basically, they just dump features from a vision encoder

00:29:04.400 | straight into a encoder-decoder transformer.

00:29:08.440 | And then they train a bunch of different tasks

00:29:12.720 | like object detection and so on as a language task.

00:29:16.240 | And I think that's one of the big things

00:29:17.520 | that we saw in 2024

00:29:19.760 | is these vision language models

00:29:23.480 | operating on pixel space linguistically.

00:29:26.880 | So, they introduce a bunch of new tokens

00:29:28.360 | to point to locations in pixel space.

00:29:33.080 | So, how does it work?

00:29:35.520 | How does it actually do?

00:29:37.280 | We can see, if you look at the graph on the right,

00:29:40.200 | which is using the Dino framework,

00:29:44.560 | your pre-trained Florence 2 models transfer very, very well.

00:29:50.400 | They get 60% map on Cocoa,

00:29:53.000 | which is like approaching state-of-the-art.

00:29:54.960 | And they train with...

00:29:55.800 | - Recording in progress.

00:29:57.520 | - You're good.

00:29:58.440 | And they train with much more efficiently.

00:30:02.960 | So, they converge a lot faster,

00:30:04.360 | which both of these things are pointing to the fact

00:30:06.720 | that they're actually leveraging

00:30:08.320 | their pre-trained weights effectively.

00:30:10.240 | So, where is it falling short?

00:30:14.200 | So, these models, I forgot to mention,

00:30:16.520 | Florence is a 0.2 billion

00:30:18.040 | and a 0.7 billion parameter count.

00:30:20.360 | So, they're very, very small

00:30:21.600 | in terms of being a language model.

00:30:24.240 | And I think that this framework, you can see saturation.

00:30:27.760 | So, what this graph is showing is that

00:30:30.280 | if you train a Florence 2 model

00:30:32.440 | purely on the image level and region level annotations

00:30:35.320 | and not including the pixel level annotations,

00:30:38.040 | like segmentation,

00:30:40.240 | it actually performs better as an object detector.

00:30:43.960 | And what that means is that

00:30:45.640 | it's not able to actually learn all the visual tasks

00:30:48.400 | that it's trying to learn

00:30:49.480 | because it doesn't have enough capacity.

00:30:51.160 | So, I'd like to see this paper explore larger model sizes,

00:30:54.440 | which brings us to our next big paper of 2024,

00:30:58.880 | or two papers.

00:31:00.200 | So, PolyGemma came out earlier this year.

00:31:02.160 | PolyGemma 2 was released, I think, like a week or two ago.

00:31:05.040 | Oh, I forgot to mention, you can actually train

00:31:08.400 | like label text data sets on RoboFlow

00:31:10.720 | and you can train a Florence 2 model

00:31:12.240 | and you can actually train a PolyGemma 2 model on RoboFlow,

00:31:15.640 | which we got into the platform

00:31:16.840 | within like 14 hours of release,

00:31:18.120 | which I was really excited about.

00:31:19.800 | So, anyway, so PolyGemma 2...

00:31:21.920 | So, PolyGemma is essentially doing the same thing,

00:31:24.560 | but instead of doing an encoder-decoder,

00:31:26.280 | it just dumps everything

00:31:27.120 | into a decoder-only transformer model.

00:31:29.560 | But it also introduced the concept of location tokens

00:31:31.840 | to point to objects in pixel space.

00:31:35.240 | PolyGemma 2...

00:31:36.560 | So, PolyGemma uses Gemma as the language encoder

00:31:38.680 | and it uses Gemma 2B.

00:31:39.880 | PolyGemma 2 introduces using multiple different sizes

00:31:43.120 | of language encoders.

00:31:44.160 | So, the way that they sort of get around

00:31:48.360 | having to do encoder-decoder

00:31:49.960 | is they use the concept of prefix loss,

00:31:52.320 | which basically means that

00:31:53.680 | when it's generating tokens autoregressively,

00:31:58.360 | it's all those tokens in the prefix,

00:32:01.160 | which is like the image that it's looking at

00:32:03.040 | and like a description of the task that it's trying to do,

00:32:05.920 | they're attending to each other fully, full attention,

00:32:09.320 | which means that it can sort of bind high level...

00:32:12.960 | It's easier for the prefix to color the output

00:32:17.760 | of the suffix

00:32:19.160 | and also to just find features easily.

00:32:23.440 | So, this is sort of an example

00:32:25.920 | of one of the tasks that I was trained on,

00:32:27.360 | which is you describe the task in English

00:32:29.800 | and then you give it all these...

00:32:34.520 | You're asking for it to segment these two classes of objects

00:32:38.960 | and then it finds their locations using these tokens

00:32:42.760 | and it finds their masks using some encoding

00:32:46.480 | of the masks into tokens.

00:32:50.200 | And yeah, so one of my critiques,

00:32:54.040 | I guess, of PolyGemma 1, at least,

00:32:56.080 | is that you find that performance saturates

00:32:59.080 | as a pre-trained model

00:32:59.960 | after only 300 million examples seen.

00:33:02.400 | So, what this graph is representing

00:33:06.000 | is each blue dot is a performance on some downstream task.

00:33:09.560 | You can see that after seeing 300 million examples,

00:33:12.520 | it sort of does equally well

00:33:15.440 | on all of the downstream tasks that they tried it on,

00:33:18.400 | which was a lot, as one billion examples,

00:33:21.680 | which to me also kind of suggests

00:33:23.720 | a lack of capacity for this model.

00:33:25.560 | PolyGemma 2, you can see the results on object detection.

00:33:31.520 | So, these were transferred to Coco.

00:33:35.800 | And you can see that this sort of also points

00:33:39.200 | to an increase in capacity being helpful to the model.

00:33:41.280 | You can see as both the resolution increases

00:33:44.720 | and the parameter count of the language model increases,

00:33:47.360 | performance increases.

00:33:48.640 | So, resolution makes sense.

00:33:49.640 | Obviously, it helps to find small images

00:33:51.960 | or small objects in the image,

00:33:53.560 | but it also makes sense from another reason,

00:33:55.080 | which is that it kind of gives the model

00:33:56.880 | a thinking register and it gives it more tokens

00:33:58.800 | to process when making its predictions.

00:34:01.440 | But yeah, you could say, oh, 43.6, that's not that great.

00:34:06.600 | Like Florence 2 got 60,

00:34:08.960 | but this is not training a dino or a debtor

00:34:12.520 | on top of this language or this image encoder.

00:34:16.240 | It's doing the raw language modeling task on Coco.

00:34:20.520 | So, it doesn't have any of the bells and whistles.

00:34:21.960 | It doesn't have any of the fancy losses.

00:34:23.360 | It doesn't even have bipartite graph matching

00:34:25.600 | or anything like that.

00:34:27.400 | Okay, the big result and one of the reasons

00:34:30.360 | that I was really excited about this paper

00:34:32.920 | is that they blow everything else away on MMVP.

00:34:35.520 | I mean, 47.3, sure, that's nowhere near human accuracy,

00:34:39.400 | which again is 94%,

00:34:40.680 | but for a 2 billion parameter language model

00:34:44.600 | to be chat2bt, that's quite the achievement.

00:34:47.120 | And that sort of brings us to our final pick

00:34:51.320 | for paper of the year, which is AIMV2.

00:34:56.080 | So, AIMV2 sort of says, okay, maybe this language model,

00:35:01.080 | like maybe coming up with all these specific annotations

00:35:04.760 | to find features and with high fidelity in pixel space

00:35:08.760 | isn't actually necessary.

00:35:10.560 | And we can come up with an even simpler

00:35:12.920 | and more beautiful idea for combining image tokens

00:35:17.280 | and pixel tokens in a way that's interfaceable

00:35:19.640 | for language tasks.

00:35:21.120 | And this is nice because it can scale.

00:35:23.680 | You can come up with lots more data

00:35:25.360 | if you don't have to come up

00:35:26.280 | with all these annotations, right?

00:35:28.080 | So, the way that it works is it does something

00:35:30.160 | very, very similar to PolyGemo

00:35:31.680 | where you have a vision encoder

00:35:33.040 | that dumps image tokens into a decoder only transformer.

00:35:36.840 | But the interesting thing is that

00:35:40.000 | it also autoregressively tries to learn

00:35:42.760 | the mean squared error of the image tokens.

00:35:46.200 | So, instead of having to come up

00:35:47.320 | with fancy object detection or segmentation labels,

00:35:51.520 | you can just try to reconstruct the image

00:35:53.240 | and have it learn fine-grained features that way.

00:35:55.720 | And it does this in kind of, I think, a beautiful way

00:35:59.000 | that's kind of compatible

00:36:00.080 | with the PolyGemo line of thinking,

00:36:01.400 | which is randomly sampling a prefix length

00:36:04.560 | and using only this number of image tokens as the prefix.

00:36:08.480 | And so, doing a similar thing with the causal.

00:36:13.320 | So, the causal prefix is the attention mask on the right.

00:36:16.360 | So, it's doing full block attention

00:36:18.760 | with some randomly sampled number of image tokens

00:36:21.120 | to then reconstruct the rest of the image

00:36:22.600 | and the downstream caption for that image.

00:36:26.160 | And so, this is the dataset that they train on.

00:36:30.160 | It's internet-scale data, very high-quality data

00:36:34.000 | created by the Data Filtering Network's paper, essentially,

00:36:38.320 | which is maybe the best clip data that exists.

00:36:42.120 | And we can see that this is finally a model

00:36:46.640 | that doesn't saturate.

00:36:48.520 | It's even at the highest parameter count,

00:36:51.360 | it appears to be, well, at the highest parameter count,

00:36:55.160 | it appears to be improving in performance

00:36:59.160 | with more and more samples seen.

00:37:00.880 | And so, you can sort of think that, you know,

00:37:03.800 | if we just keep bumping the parameter count

00:37:05.920 | and increasing the example seen,

00:37:07.280 | which is the line of thinking for language models,

00:37:10.400 | then it'll keep getting better.

00:37:12.320 | So, how does it actually do at finding...

00:37:14.080 | Oh, it also improves with resolution,

00:37:16.400 | which you would expect for a model that...

00:37:20.440 | This is the ImageNet classification accuracy,

00:37:22.680 | but yeah, it does better if you increase the resolution,

00:37:25.480 | which means that it's actually leveraging

00:37:26.920 | and finding fine-grained visual features.

00:37:29.760 | And so, how does it actually do compared to CLIP on COCO?

00:37:34.800 | Well, you can see that if you slap

00:37:36.800 | a transformer detection head on it,

00:37:39.400 | and train it on COCO, it gets to 60.2,

00:37:41.280 | which is also within spitting distance of SODA,

00:37:44.200 | which means that it does a very good job

00:37:45.680 | of finding visual features.

00:37:48.480 | But you could say, okay, well, wait a second,

00:37:51.760 | CLIP got to 59.1, so, like,

00:37:55.600 | how does this prove your claim at all?

00:37:57.040 | Because doesn't that mean, like,

00:37:59.000 | CLIP, which is known to be CLIP-blind

00:38:00.920 | and do badly on MMVP,

00:38:02.440 | it's able to achieve a very high performance

00:38:04.720 | on this fine-grained visual features task

00:38:07.560 | of object detection?

00:38:08.800 | Well, they train on, like, tons of data.

00:38:11.800 | They train on, like, Objects 365, COCO, Flickr,

00:38:15.720 | and everything else.

00:38:17.120 | And so, I think that this benchmark

00:38:18.560 | doesn't do a great job of selling

00:38:19.800 | how good of a pre-trained model MV2 is.

00:38:22.040 | And we would like to see performance

00:38:25.000 | on fewer data as examples

00:38:27.840 | and not train to convergence on object detection.

00:38:29.760 | So, seeing it in the real world

00:38:31.640 | on, like, a dataset like RoboFlow 100,

00:38:33.320 | I think would be quite interesting.

00:38:35.760 | And our, I guess, our final, final pick

00:38:38.360 | for paper of 2024 would be Moondream.

00:38:40.240 | So, introducing Vic to talk about that.

00:38:42.280 | - But overall, that was exactly what I was looking for.

00:38:49.640 | Like, best of 2024, amazing job.

00:38:51.800 | Yeah, you can.

00:38:54.480 | Does anyone have questions

00:38:56.400 | while Vic gets set up, like, vision stuff?

00:38:58.400 | Yeah?

00:39:02.720 | Vic, go ahead. - Hi.

00:39:06.520 | Well, while we're getting set up, hi, over here.

00:39:09.920 | Thanks for the really awesome talk.

00:39:11.760 | One of the things that's been weird and surprising

00:39:13.760 | is that the foundation model companies

00:39:19.280 | and even these MLMs,

00:39:22.560 | they're just, like, worse than RTTetter at detection still.

00:39:27.200 | Like, if you wanted to pay a bunch of money

00:39:30.280 | to auto-label your detection dataset,

00:39:32.080 | if you gave it to OpenAI or Cloud,

00:39:33.920 | that would be, like, a big waste.

00:39:36.440 | So, I'm curious, just, like,

00:39:37.520 | even Palo Gemma 2, like, is worse.

00:39:40.840 | So, I'm curious to hear your thoughts on, like,

00:39:43.480 | how come nobody's cracked the code on, like,

00:39:46.040 | a generalist that really, you know,

00:39:50.320 | beats a specialist model in computer vision

00:39:53.360 | like they have in LM land?

00:39:56.120 | - I can, can you hear me?

00:40:01.080 | - Yeah, you gotta press the speak button.

00:40:03.440 | - Okay.

00:40:04.320 | - Oh, yeah.

00:40:05.160 | (laughing)

00:40:07.560 | - It's a very, very interesting question.

00:40:09.760 | I think it depends on the specific domain.

00:40:13.360 | For image classification, it's basically there.

00:40:16.600 | In the, AIMV2 showed a simple attentional probe

00:40:20.480 | on the pre-trained features gets, like, 90%,

00:40:22.520 | which is as well as anyone does.

00:40:24.960 | The bigger question, like,

00:40:29.040 | why isn't it transferring to object detection,

00:40:33.520 | especially, like, real-time object detection?

00:40:35.760 | I think, in my mind, there are two answers.

00:40:39.240 | One is object detection is really, really, really,

00:40:43.280 | the architectures are super domain-specific.

00:40:46.480 | You know, we see these,

00:40:47.320 | all these super, super complicated things,

00:40:48.800 | and it's not super easy to build something

00:40:52.720 | that just transfers naturally like that,

00:40:54.440 | whereas image classification, you know,

00:40:56.440 | clip pre-training transfers super, super easily.

00:40:59.640 | And the other thing is, until recently,

00:41:04.240 | the real-time object detectors

00:41:06.000 | didn't even really benefit from pre-training.

00:41:08.560 | Like, you see the YOLOs that are, like,

00:41:10.200 | essentially saturated, showing very little difference

00:41:12.720 | with pre-training improvements,

00:41:15.440 | with using pre-trained model at all,

00:41:17.680 | it's not surprising, necessarily,

00:41:19.640 | that people aren't looking at the effects

00:41:22.880 | of better and better pre-training on real-time detection.

00:41:25.920 | Maybe that'll change in the next year.

00:41:27.800 | Does that answer your question?

00:41:29.480 | - Cool.

00:41:30.320 | Can you guys hear me?

00:41:33.320 | Yeah, one thing I want to add is just, like,

00:41:35.040 | or just to summarize, basically, is that, like,

00:41:37.520 | until 2024, you know,

00:41:40.080 | we haven't really seen a combination

00:41:41.720 | of transformer-based object detectors and fancy losses,

00:41:46.720 | and PolyGemma suffers from the same problem,

00:41:49.120 | which is basically to say that these ResNet,

00:41:52.360 | or, like, the convolutional models,

00:41:54.280 | they have all these, like, extreme optimizations

00:41:58.200 | for doing object detection,

00:42:00.160 | but essentially, I think it's kind of been shown now

00:42:02.840 | that convolutional models, like,

00:42:04.200 | just don't benefit from pre-training

00:42:05.720 | and just don't, like, have the level of intelligence

00:42:07.440 | of transformer models.

00:42:08.560 | - Awesome.

00:42:13.080 | Balundri.

00:42:14.760 | - Hi, can you hear me?

00:42:17.040 | - Cool.

00:42:17.880 | - I can hear you, see you.

00:42:19.000 | Are you sharing your screen?

00:42:20.120 | - I might have forgotten to do that.

00:42:22.440 | Let me do that.

00:42:23.280 | - Sorry, you should've done that.

00:42:24.120 | - Okay.

00:42:24.960 | - Here's your screen.

00:42:35.320 | - Uh-oh, classic.

00:42:37.160 | You might have to quit Zoom and restart.

00:42:40.640 | - What?

00:42:41.480 | - It's fine.

00:42:43.440 | Yeah, it's like, we have a capture of your screen.

00:42:46.960 | I'll just make sure it's visible.

00:42:49.120 | So let's get to your screen.

00:42:52.440 | - Okay.

00:42:54.080 | Easy enough.

00:42:54.920 | - How do you make it, like, wait for you?

00:42:58.880 | - Quit Zoom.

00:43:03.240 | No.

00:43:04.080 | - Yeah, yeah, there you go.

00:43:04.920 | Perfect.

00:43:05.760 | - All right.

00:43:07.480 | Hi, everyone.

00:43:08.320 | My name is Vik.

00:43:09.440 | I've been working on Moondream for almost a year now,

00:43:12.560 | like Sean mentioned.

00:43:13.440 | I just went and looked,

00:43:14.440 | and it turns out the first version,

00:43:16.280 | I released December 29, 2023.

00:43:18.240 | It's been a fascinating journey.

00:43:21.040 | So Moondream started off as a tiny vision language model.

00:43:25.720 | Since then, we've extended scope a little bit

00:43:27.360 | to also try and build some tooling,

00:43:30.080 | client libraries, et cetera,

00:43:31.120 | to help people really deploy it.

00:43:34.360 | Unlike traditional large models

00:43:37.680 | that are focused at assistant-type use cases,

00:43:39.360 | we're laser-focused on building

00:43:41.480 | capabilities that developers can,

00:43:46.680 | sorry, it's,

00:43:49.680 | yeah, we're laser-focused on building capabilities

00:43:54.480 | that developers can use to build vision applications

00:43:58.200 | that can run anywhere.

00:43:59.120 | So in a lot of cases for vision more so than for text,

00:44:02.720 | you really care about being able to run on the edge,

00:44:05.000 | run in real time, et cetera.

00:44:06.000 | So that's really important.

00:44:08.840 | We have different output modalities that we support.

00:44:12.560 | There's query where you can ask

00:44:14.160 | general English questions about an image

00:44:15.960 | and get back human-like answers.

00:44:18.080 | There's captioning,

00:44:19.280 | which a lot of our users use

00:44:21.040 | for generating synthetic datasets

00:44:23.480 | to then train diffusion models and whatnot.

00:44:26.360 | We've done a lot of work to minimize hallucinations there.

00:44:28.200 | So that's used a lot.

00:44:31.080 | We have open vocabulary object detection built in,

00:44:33.120 | similar to a couple of more recent models

00:44:34.560 | like Palagem, et cetera,

00:44:35.480 | where rather than having to train a dedicated model,

00:44:38.040 | you can just say, "Show me soccer balls in this image,"

00:44:41.000 | or, "Show me if there are any deer in this image."

00:44:42.640 | It'll detect it.

00:44:43.640 | More recently, earlier this month,

00:44:46.520 | we released pointing capability

00:44:48.720 | where if all you're interested in is the center of an object,

00:44:52.440 | you can just ask it to point out where that is.

00:44:56.360 | This is very useful

00:44:57.200 | when you're doing EOI automation-type stuff.

00:45:00.360 | Let's see.

00:45:01.200 | We have two models out right now.

00:45:05.840 | There's a general-purpose 2B paramodel,

00:45:08.160 | which runs fairly...

00:45:11.080 | Like, it's fine if you're running on server.

00:45:13.040 | It's good for our local Lama desktop friends,

00:45:16.720 | and it can run on flagship mobile phones,

00:45:18.800 | but it never really fulfilled the promise

00:45:21.000 | of being able to run anywhere.

00:45:23.000 | Last week, we released a new 0.5B paramodel,

00:45:25.880 | which should be seen more as a 2B paramodel

00:45:28.920 | and more as a distillation target

00:45:30.560 | as opposed to a general-purpose model.

00:45:32.400 | It's very good if you're running on older mobile phones

00:45:36.080 | or edge devices.

00:45:37.760 | Uses less memory,

00:45:39.400 | even with our not-yet-fully-optimized inference client.

00:45:42.120 | So the way we built our 0.5B model

00:45:47.960 | was to start with the 2B parameter model

00:45:50.880 | and prune it while doing continual training

00:45:55.720 | to retain performance.

00:45:57.400 | We...

00:45:58.880 | Our objective during the pruning

00:46:00.280 | was to preserve accuracy across a broad set of benchmarks.

00:46:04.760 | So the way we went about it

00:46:05.840 | was to estimate the importance

00:46:07.400 | of different components of the model,

00:46:08.640 | like attention heads, channels,

00:46:10.360 | MLP rows and whatnot,

00:46:14.440 | using basically a technique based on the gradient.

00:46:17.520 | I'm not sure how much people want to know details.

00:46:19.320 | We'll be writing a paper about this,

00:46:20.560 | but feel free to grab me if you have more questions.

00:46:23.920 | Then we iteratively prune a small chunk

00:46:26.400 | that'll minimize loss in performance,

00:46:28.360 | retrain the model to recover performance and bring it back.

00:46:31.480 | The 0.5B we released is more of a proof of concept

00:46:35.040 | that this is possible.

00:46:35.880 | I think the thing that's really exciting about this

00:46:37.640 | is it makes it possible for...

00:46:39.440 | For developers to build using the 2B param model

00:46:44.880 | and just explore, build their application.

00:46:48.400 | And then once they're ready to deploy,

00:46:50.680 | figure out what exactly they need out of the model

00:46:52.560 | and prune those capabilities into a smaller form factor

00:46:54.680 | that makes sense for their deployment target.

00:46:56.960 | So yeah, very excited about that.

00:47:00.680 | Let me talk to you folks a little bit about another problem

00:47:04.240 | I've been working on recently,

00:47:05.160 | which is similar to the clocks example

00:47:07.040 | we've been talking about.

00:47:07.880 | We had a customer reach out who was talking about,

00:47:11.240 | who had a bunch of gauges out in the field.

00:47:14.240 | This is very common in manufacturing and oil and gas

00:47:16.800 | where you have a bunch of analog devices

00:47:19.720 | that you need to monitor.

00:47:20.960 | It's expensive to have humans look at that

00:47:24.040 | and monitor stuff and make sure that the system

00:47:27.320 | gets shut down when the temperature goes over 80

00:47:29.440 | or something.

00:47:30.360 | So I was like, yeah, this seems easy enough.

00:47:32.240 | Happy to help you distill that.

00:47:34.680 | Let's get it going.

00:47:36.480 | Turns out our model couldn't do it at all.

00:47:38.560 | I went and looked at other open source models

00:47:40.760 | to see if I could just generate a bunch of data

00:47:43.120 | and learn from that.

00:47:43.960 | That did not work either.

00:47:45.680 | So I was like, let's look at what the folks

00:47:47.240 | with hundreds of billions of dollars in market cap

00:47:51.000 | have to offer.

00:47:51.840 | And yeah, that doesn't work either.

00:47:53.960 | My hypothesis is that the way these models are trained

00:48:00.040 | are using a large amount of image text data

00:48:03.200 | scraped from the internet.

00:48:04.480 | And that can be biased.

00:48:05.320 | In the case of gauges,

00:48:06.640 | most gauge images aren't gauges in the wild.

00:48:09.440 | They're product detail images like these,

00:48:12.680 | where it's always set to zero.

00:48:14.280 | It's paired with an alt text that says something like

00:48:16.360 | G-I-V-T-O pressure sensor, PSI zero to 30 or something.

00:48:21.360 | And so the models are fairly good

00:48:23.760 | at picking up those details.

00:48:24.680 | It'll tell you that it's a pressure gauge.

00:48:26.000 | It'll tell you what the brand is,

00:48:26.840 | but it doesn't really learn to pay attention

00:48:28.680 | to the needle over there.

00:48:30.880 | And so, yeah, that's a gap we need to address.

00:48:36.480 | So naturally my mind goes to like,

00:48:39.800 | let's use synthetic data to solve this problem.

00:48:42.520 | That works, but it's problematic

00:48:46.160 | because it turned out we needed millions

00:48:47.760 | of synthetic gauge images to get to reasonable performance.

00:48:50.920 | And thinking about it, reading a gauge is like not a one,

00:48:55.480 | like it's not a zero short process in our minds, right?

00:48:57.520 | Like if you had to tell me the reading in Celsius

00:49:00.440 | for this real world gauge, there's two dials on there.

00:49:03.920 | So first you have to figure out which one

00:49:05.200 | you have to be paying attention to,

00:49:06.160 | like the inner one or the outer one.

00:49:07.920 | You look at the tip of the needle,

00:49:11.080 | you look at what labels it's between,

00:49:13.360 | and then you count how many and do some math

00:49:17.200 | to figure out what that probably is.

00:49:19.360 | So what happens if we just add that as chain of thought

00:49:23.280 | to give the model a better understanding

00:49:27.600 | of the difference up,

00:49:29.720 | to allow the model to better learn the subtasks

00:49:31.360 | it needs to perform to accomplish this goal?

00:49:33.560 | So you can see in this example,

00:49:36.640 | this was actually generated

00:49:37.560 | by the latest version of our model.

00:49:39.480 | It's like, okay, Celsius is the inner scale.

00:49:42.120 | It's between 50 and 60.

00:49:43.200 | There's 10 ticks.

00:49:44.280 | It's at the second tick.

00:49:46.360 | It's a little debatable here.

00:49:47.440 | Like there's a weird shadow situation going on.

00:49:49.400 | The dial is off.

00:49:50.440 | So I don't know what the ground truth is,

00:49:52.040 | but it works okay.

00:49:54.920 | There's points on there that are,

00:49:57.640 | the points over there are actually grounded.

00:50:00.040 | I don't know if this is easy to see,

00:50:01.880 | but when I click on those,

00:50:02.880 | there's a little red dot that moves around.

00:50:05.120 | On the image, the model actually has to predict

00:50:07.000 | where those points are.

00:50:09.880 | I was originally trying to do this with bounding boxes,

00:50:11.920 | but then Malmo came out with pointing capabilities

00:50:14.840 | and it's like pointing is a much better paradigm

00:50:17.680 | to represent this.

00:50:20.960 | We see pretty good results.

00:50:23.440 | This one's actually for clock reading.

00:50:24.800 | I couldn't find our chart for gauge reading

00:50:27.560 | at the last minute.

00:50:28.400 | So the light blue chart is with our grounded chain of thought.

00:50:33.400 | This measures, we built a clock reading benchmark

00:50:40.320 | about 500 images.

00:50:41.520 | This measures accuracy on that.

00:50:44.240 | You can see it's a lot more sample efficient

00:50:47.400 | when you're using the chain of thought to help the model.

00:50:52.080 | Yep.

00:50:55.040 | Another big benefit from this approach

00:50:59.040 | is you can kind of understand how the model is doing it

00:51:02.800 | and how it's feeling.

00:51:04.560 | So in this example,

00:51:05.880 | the actual correct reading is 54 Celsius,

00:51:08.480 | the model output 56.

00:51:10.440 | Not too bad, but you can actually go and see

00:51:13.720 | where it messed up.

00:51:15.920 | Like it got a lot of these right,

00:51:17.280 | except instead of saying it was on the seventh tick,

00:51:22.120 | it actually predicted that it was the eighth tick

00:51:24.600 | and that's why it went with 56.

00:51:26.360 | So now that you know that this is feeling in this way,

00:51:30.960 | you can adjust how you're doing the chain of thought

00:51:32.760 | to maybe say like actually count out each tick from 40

00:51:35.480 | instead of just trying to say it's the eighth tick.

00:51:37.880 | Or you might say like, okay,

00:51:38.960 | I see that there's that middle thing.

00:51:40.320 | I'll count from there instead of all the way from 40.

00:51:43.160 | So it helps a ton.

00:51:46.080 | The other thing I'm excited about

00:51:47.040 | is a few short prompting or test time training with this.

00:51:50.480 | Like if a customer has a specific gauge

00:51:52.720 | that we're seeing minor errors on,

00:51:55.680 | they can give us a couple of examples

00:51:57.240 | where like if it's misdetecting the needle,

00:52:00.560 | they can go in and correct that in the chain of thought

00:52:02.160 | and hopefully that works the next time.

00:52:04.120 | Now, exciting approach,

00:52:09.040 | we only apply it to clocks and gauges.

00:52:10.400 | The real question is, is it going to generalize?

00:52:13.320 | Probably like there's some science from text models

00:52:15.760 | that when you train on a broad number of tasks,

00:52:17.400 | it does generalize.

00:52:18.240 | And I'm seeing some science with our model as well.

00:52:21.720 | So in addition to the image-based chain of thought stuff,

00:52:25.680 | I also added some spelling-based chain of thought

00:52:29.160 | to help it understand, better understand OCR, I guess.

00:52:33.600 | I don't understand why everyone doesn't do this by the way.

00:52:36.760 | Like it's trivial benchmark question.

00:52:38.760 | It's very, very easy to nail.

00:52:40.880 | But I also wanted to support it for stuff

00:52:45.000 | like license plate partial matching,

00:52:46.640 | like hey, does any license plate in this image

00:52:49.280 | start with WHA or whatever?

00:52:50.880 | So yeah, that sort of worked.

00:52:54.120 | All right, that ends my story about the gauges.

00:53:00.840 | If you think about what's going on over here,

00:53:03.800 | it's interesting that like LLMs

00:53:05.880 | are showing enormous progress in reasoning,

00:53:10.880 | especially with the latest set of models that we've seen.

00:53:14.600 | But we're not really seeing,

00:53:17.000 | I have a feeling that VLMs are lagging behind

00:53:20.680 | as we can see with these tasks

00:53:23.440 | that should be very simple for a human to do

00:53:25.080 | that are very easy to find VLMs failing at.

00:53:29.560 | My hypothesis on why this is the case

00:53:31.280 | is because on the internet,

00:53:33.600 | there's a ton of data that talks about how to reason.

00:53:36.440 | There's books about how to solve problems.

00:53:38.760 | There's books critiquing the books

00:53:40.240 | about how to solve problems.

00:53:41.720 | But humans are just so good at perception

00:53:43.440 | that we never really talk about it.

00:53:45.640 | Like maybe in art books where it's like,

00:53:47.440 | hey, to show that that mountain is further away,

00:53:49.880 | you need to desaturate it a bit or whatever,

00:53:51.880 | but the actual data on how to like look at images

00:53:56.880 | isn't really present.

00:53:58.760 | Also the data we have is kind of sketched.

00:54:01.160 | The best source of data we have

00:54:02.280 | is like image all text pairs on the internet

00:54:04.520 | and that's pretty low quality.

00:54:06.040 | So yeah, I think our solution here is really just,

00:54:09.800 | we need to teach them how to operate on individual tasks

00:54:13.240 | and figure out how to scale that out.

00:54:15.640 | All right, yep.

00:54:19.480 | So conclusion, at Moondream we're trying

00:54:23.200 | to build amazing VLMs that run everywhere.

00:54:25.560 | Very hard problem, much work ahead,

00:54:27.640 | but we're making a ton of progress

00:54:29.240 | that I'm really excited about.

00:54:31.440 | If anyone wants to chat about more technical details

00:54:35.280 | about how we're doing this or interested in collaborating,

00:54:37.360 | please hit me up.

00:54:38.760 | - Yeah, like I always, when people say multi-modality,

00:54:48.800 | I always think about vision as the first among equals

00:54:52.000 | in all the modalities.

00:54:53.000 | So I really appreciate having the experts.

00:54:57.480 | - This is the year that vision language models

00:54:59.440 | became mainstream with every model from GPT-40 to 1

00:55:03.400 | to Claude 3 to Gemini 1 and 2 to Lama 3.2

00:55:08.000 | to Mistral's Pixtrol to AI2's Pixmo going multi-modal.

00:55:13.000 | We asked Peter and Isaac to highlight the best work

00:55:15.680 | in computer vision for 2024.

00:55:18.320 | And they blew us away with the complete overview.

00:55:21.720 | As a special bonus, we also got a bonus talk

00:55:24.400 | from Vik Kaurapati at Moondream

00:55:26.920 | who gave an incredible talk

00:55:28.240 | at this year's AI Engineer World's Fair

00:55:31.080 | on his tiny 0.5 billion parameter pruned

00:55:34.080 | vision language model that absolutely slaps.

00:55:37.400 | As always, don't forget to check the show notes

00:55:39.800 | for the YouTube link to their talk, as well as their slides.

00:55:43.320 | Watch out and take care.