back to indexVision AI in 2025 — Peter Robicheaux, Roboflow

00:00:00.000 |
I'm going to be giving a quick presentation about the State of the Union regarding AI Vision. 00:00:20.280 |
So I'm Peter Robichaux. I'm the ML lead at RoboFlow, which is a platform for building and 00:00:29.620 |
deploying vision models. A lot of people are really interested in LLMs these days, so I'm 00:00:37.400 |
trying to pitch why computer vision matters. If you think about systems that interact with 00:00:44.240 |
the real world, they have to use vision as one of their primary inputs because the built 00:00:50.240 |
world is sort of built around vision as a fundamental primitive. There's a big gap between where human 00:00:59.240 |
is and where computer vision is. I would argue a bigger gap than exists currently for human 00:01:05.920 |
speech and computer speech. Computer vision has its own set of problems that are very distinct 00:01:15.940 |
from the problems that need to be solved by LLMs. Latency usually matters. If you want to 00:01:21.260 |
perceive motion, you have to be running your process multiple frames per second. You usually 00:01:27.420 |
want to run at the edge. You can't have one big hub where you do all of your computation 00:01:32.900 |
because you would introduce too much latency to make decisions based off that computation. 00:01:40.100 |
So I sort of gave a version of this talk at Latent Spaces podcast at NeurIPS. And retrospectively, 00:01:49.480 |
I think we identified a few problems with the field of vision in 2024, one of them being 00:01:56.280 |
evals or saturated. So vision evals like ImageNet and Cocoa, they're mostly like pattern matching. 00:02:03.160 |
They measure your ability to match patterns and sort of like don't require much visual intelligence 00:02:08.960 |
to solve. Consequently, I think, vision models don't leverage big pre-training the way that 00:02:17.720 |
language models do. So right now, you can take a language model and unleash it on the internet 00:02:21.900 |
and get something incredibly smart. Some of the best vision models are moving in that direction. 00:02:27.860 |
But because you don't need that level of knowledge and intelligence to solve the evals, there's kind 00:02:34.360 |
of no incentive to do so. And I think that part of that -- so there's sort of two dimensions 00:02:41.440 |
here. One is that vision doesn't leverage big pre-training. So you can think of like if you're 00:02:46.840 |
building an application with language right now, you probably want to use the smartest model 00:02:50.440 |
to get an embedding that works really well for you. And right now, we don't have like their 00:02:55.000 |
downstream applications that make really good use of the pre-training and the embeddings that 00:02:59.560 |
they get from large language models. But there aren't really good vision models that can leverage 00:03:05.720 |
these embeddings. And the corollary to this is that the quality of big pre-trained models just isn't the 00:03:13.960 |
same in vision as it is in language. And so my underlying conclusion is vision models aren't smart. That's the 00:03:20.640 |
the takeaway. And I can prove it to you. So last year when Cloud 3.5 was happening, you can give it an 00:03:27.360 |
image of a watch and it just guesses -- you ask it what time it is and it'll just guess a random time. 00:03:32.480 |
And that's because this model, it has a good conceptual abstract idea of what a clock is or what a watch 00:03:38.480 |
is, but it only comes to actually identifying the location of watch hands and finding the numbers on the 00:03:43.280 |
watch. It's hopeless. And updated for Cloud 4 still has no idea what time it is. And this is even like 00:03:50.880 |
a an egregious failure because 10:10 is like the stock time on like all watches. So the fact that it 00:03:57.040 |
couldn't even get like the most common time is pretty telling. There's so there's this really cool 00:04:05.440 |
data set that's trying to measure this inability of LLMs to see called MMVP, which basically you can 00:04:13.840 |
see an example here where they ask this question that seems incredibly obvious. And the model -- so 00:04:19.920 |
in this case, they're asked the model, which is like ChatGPT, 4.0, which direction the school bus is 00:04:26.960 |
facing. Are we seeing the front or the back of the school bus? And the model gets it completely wrong 00:04:31.840 |
and then hallucinates details to support its claim. And again, I think this is evidence that large 00:04:36.880 |
language models, which are maybe the most intelligent models that we have, like cannot see. And that is 00:04:43.280 |
due to a lack of visual features that they can perceive with. And so the way that this dataset was 00:04:48.720 |
created is they went and they found pairs of images that were close in CLIP space but far in DinoV2 space. 00:04:55.840 |
So CLIP is a vision language model that was sort of contrastedly trained on the whole internet. 00:05:01.200 |
And so what this is showing is that CLIP is not discriminative enough to tell these two images 00:05:13.440 |
apart, right? So according to CLIP, these two images basically look the same. And what that's 00:05:17.920 |
pointing to is like a failure in vision language pre-training. And so the way CLIP is trained is 00:05:25.360 |
basically you come up with a big data set of captioned images and you ask the model to -- you 00:05:32.400 |
scramble the captions and scramble the images and ask the model to pair the image with the caption. 00:05:36.800 |
But the thing is if you go back and look at these two images, what is a caption that would distinguish 00:05:42.240 |
these two images, right? It's like the peculiar pose of the dog and one image it's facing the camera 00:05:48.640 |
and one it's facing away. But these are sort of details that aren't included in the caption. 00:05:52.320 |
So if your loss function can't tell these two images apart, then why would your model be able to, 00:05:56.400 |
right? So vision-only pre-training kind of works is the claim. So DinoV2 is this really cool model. 00:06:04.960 |
So what you're seeing right now is a visualization of its PCA features that have been self-discovered 00:06:10.960 |
by pre-training on the whole internet. So what's really cool is not only does it find the mask of the dog, 00:06:17.680 |
obviously. That's sort of easy because it's highly contrasted with the green background, 00:06:22.240 |
but it also finds the segments of the dog and it finds even analogous segments. So if you look at 00:06:28.800 |
these principal components, you compare the legs of a dog, it'll be in the same sort of feature space as 00:06:33.760 |
the legs of a human. And so there's sort of this big open question which is like how do we get vision 00:06:41.920 |
features that are well aligned with language features and usable by VLMs that don't suck and 00:06:48.880 |
like have visual fidelity? Cool. So that's part of the story. The other part of the question that needs 00:06:58.880 |
to be answered is given that we have some sort of semi-working large pre-training of vision models, 00:07:05.520 |
why aren't we leveraging these vision models? And I would answer that at least in the object detection space, 00:07:10.800 |
the answer is mostly in the distinction between convolutional models and transformers. 00:07:16.320 |
So this is from LW Detter, which is one of the top performing detection transformers that currently exists. 00:07:25.200 |
If you look at this graph, you look at Yellow V8N, which is a convolutional object detector on the edge, 00:07:31.760 |
with and without pre-training on Object 365, it gains like 0.2 map, which is like the main accuracy metric 00:07:38.000 |
for object detectors. So Object 365, which is a big million, 1.6 million image data set, 00:07:44.160 |
pre-training on it leads almost no performance improvements on Cocoa. Whereas for LW Detter, which is a 00:07:52.880 |
transformer-based model, you can see that without -- if you look at this column map without pre-training, 00:07:58.400 |
and you look at the column map with pre-training, you can see that you're getting like five map 00:08:02.240 |
improvements across the board, sometimes even seven map improvements, which is like a gigantic amount, 00:08:07.040 |
right? And so basically, while the language world knows that transformers are able to leverage big 00:08:14.640 |
pre-trainings and yield decent results, the visual world is sort of just now catching up. And you can see 00:08:22.000 |
this from the scale of the big pre-training. In the image world, pre-training on Object 365 with 1.6 00:08:28.080 |
million images is considered a large pre-training. That would be like a tiny challenge data set for 00:08:34.080 |
like undergrads in the LLM world. So I want to announce RoboFlow's special new model called RF Detter, 00:08:44.480 |
which leverages the Dyno V2 pre-trained backbone and uses it in a real-time object detection context. 00:08:52.800 |
So this is sort of our answer to the hole that we see in the field of like, why aren't we leveraging big 00:09:00.160 |
pre-trainings for visual models? And so here's some of the metrics. You can see that 00:09:08.800 |
basically what we did is we took the LW Detter backbone and we like kind of swapped it out with 00:09:12.560 |
the Dyno V2 backbone. And we get like a decent improvement on Cocoa. And we're still not Soda 00:09:19.760 |
on Cocoa compared to Define, which is the current Soda. We're like second Soda. But I think what's 00:09:25.840 |
really interesting is there's this other data set called RF100VL, which we created to measure the sort of 00:09:32.880 |
domain adaptability of this model. And you can see massive yields from using the Dyno V2 pre-trained 00:09:39.360 |
backbone, which basically is pointing to the fact that number one, Cocoa is too easily solvable. 00:09:44.560 |
It basically has common classes like humans and like coffee cups and stuff like this. So it's not 00:09:51.200 |
a good measure of the intelligence of your model. More so the way that you optimize Cocoa is by like 00:09:56.080 |
really nailing the precise location of a bounding box or something, really having good iterative 00:10:01.680 |
refinement of your locations that you're guessing. Whereas we posit RF100VL, this new data set, is a 00:10:10.640 |
better measure of the intelligence of a visual model. So we're introducing a new data set, RF100VL, 00:10:18.240 |
which is a collection of 100 different object detection data sets that were pulled from our open source 00:10:23.920 |
collection of data sets. We have something like, I don't know, it's something like 750,000 data sets or 00:10:30.640 |
whatever on RoboFlow universe. And we hand curated the 100 best, I guess, by some metrics. So like we sorted by 00:10:39.600 |
community engagement and we tried to find very difficult domains. So you'll notice, for instance, 00:10:46.480 |
we have different camera poses that are common in Cocoa. So we have like aerial camera positioning, 00:10:53.920 |
such thing, which requires your model to sort of understand different views of an object in order to 00:10:59.840 |
to do well. We have different visual imaging domains, like you can see, like microscopes and x-rays and all 00:11:06.880 |
this sort of things. So yeah, we think that this data set can measure the richness of features that are 00:11:15.280 |
learned by object detectors in a much more comprehensive way than Cocoa. And here's the other fun thing 00:11:23.200 |
about this is that it is a visual language model. So we are able to benchmark a bunch of different models 00:11:28.720 |
on RF100VL, being able to ask them things like using, contextualizing the class name in the context of this 00:11:36.240 |
data set. Where is this action happening, for instance? So if you look at the top left, we have this class which is block, 00:11:44.160 |
which is representing an action, a volleyball block. But you have to be smart enough to contextualize this 00:11:49.280 |
like word embedding of block within the context of volleyball to be able to detect that. Same thing with 00:11:54.320 |
this thunderbolt type defect in this cable here. If you just ask a dumb visual language model to detect 00:12:00.720 |
thunderbolts in the image, it will find nothing. But if it contextualizes it in the context of a cable 00:12:05.200 |
defect, then it will be able to find more things. And it also increases the breadth of classes. So if you 00:12:11.680 |
only look at Cocoa, you're basically asking your model, hey, can you find a dog? Can you find a cat? But like, 00:12:17.760 |
can you find fibrosis? Now your model needs to have like a lot more information around the world, 00:12:22.240 |
about the world to solve that problem. Same thing with different imaging domains. 00:12:27.040 |
So it is a vision language benchmark. So we also have visual descriptions and sort of instructions on 00:12:36.800 |
how to find the objects that are present in this image. And basically what we found is like you take 00:12:43.600 |
a Cocoa or you take a Yolo V8 model and you train it on like 10 examples per class. It does better than like 00:12:50.320 |
Quen V2 72 -- Quen 2.5 VL 72B, like state of the art gigantic vision language model. So the vision 00:12:58.960 |
language models are really good right now at generalizing out of distribution in the linguistic 00:13:04.000 |
domain, but absolutely hopeless when it comes to generalizing in the visual domain. And so we hope 00:13:09.520 |
that this benchmark can sort of drive that part of the research and make sure that the visual parts of 00:13:17.520 |
VLMs don't get left behind. And yeah, basically by leveraging like stronger embeddings, a debtor model 00:13:27.040 |
does much, much better on RF 100VL than just leveraging embeddings learned on Object 265, which makes sense. 00:13:43.920 |
Oh, yeah, yeah, yeah. It's like 20 million parameters at the small size. Yeah. Cool. Any other questions? 00:13:51.920 |
This works. Yeah, it's publicly available. It's on -- maybe I can -- if you go to RF100VL.org, 00:14:05.680 |
you can find our archive paper as well as the code utilities to help download the data set. It's also 00:14:12.320 |
like on Huggy Face somewhere. Yeah. Yeah, so RoboFlow kind of has a pretty unique strategy when it comes to 00:14:23.920 |
our platform. So we make our platform freely available to all researchers, basically. And so we have like 00:14:30.240 |
a ton of people who use our platform to label medical data and biological data for their own papers and 00:14:36.640 |
their own research. And then our only ask is that they then contribute that data back to the community 00:14:42.480 |
and make it open source. And so a lot of this data comes from like paper cited in nature and stuff like that. 00:14:47.280 |
Yeah, so the data set is kind of measuring up. 00:14:59.920 |
performance of like a bunch of different imaging modalities or predictive modalities, I guess. 00:15:10.560 |
So I think the most interesting track of the data set is the few shot track. 00:15:28.560 |
So basically we've constructed like canonical 10-shot splits. So we provide the model the class name, 00:15:38.560 |
annotator instructions on how to find that class, as well as 10 visual examples per class. And if a 00:15:45.360 |
model -- basically no model exists that can leverage those three things and get higher maps than if you 00:15:52.000 |
just deleted one of those like options. I see that as one of the big shortcomings of visual language models 00:15:57.760 |
Yeah, so currently the specialists are by far the best. 00:16:27.920 |
We benchmarked Grounding Dyno specifically, both zero-shot and fine-tuned. So zero-shot Grounding 00:16:32.720 |
Dyno got like 19 map average on R100VL, which is like kind of good, kind of bad. So if you take 00:16:38.400 |
like a YOLO V8 nano and you train it from scratch on the 10-shot examples, which is not a lot of data, 00:16:43.840 |
obviously, it gets something like 25 map. So like to be worse than fine-tuning a YOLO from scratch is sort of bad. But if you then fine-tune the Grounding Dyno with federated loss, that's the highest performing model we have on the data set. 00:16:56.240 |
However, that being said, like I think that the point of the data set should be, hey, like you should be able to leverage these annotator instructions, the 10-shot examples, and the class names, and come up with something more accurate, which requires a journalist model. But okay, I think I'm super over time. So yeah, thanks for the questions. Cool. Thanks, everyone.