Vision AI in 2025 — Peter Robicheaux, Roboflow

00:00:00.000 | I'm going to be giving a quick presentation about the State of the Union regarding AI Vision.

00:00:20.280 | So I'm Peter Robichaux. I'm the ML lead at RoboFlow, which is a platform for building and

00:00:29.620 | deploying vision models. A lot of people are really interested in LLMs these days, so I'm

00:00:37.400 | trying to pitch why computer vision matters. If you think about systems that interact with

00:00:44.240 | the real world, they have to use vision as one of their primary inputs because the built

00:00:50.240 | world is sort of built around vision as a fundamental primitive. There's a big gap between where human

00:00:59.240 | is and where computer vision is. I would argue a bigger gap than exists currently for human

00:01:05.920 | speech and computer speech. Computer vision has its own set of problems that are very distinct

00:01:15.940 | from the problems that need to be solved by LLMs. Latency usually matters. If you want to

00:01:21.260 | perceive motion, you have to be running your process multiple frames per second. You usually

00:01:27.420 | want to run at the edge. You can't have one big hub where you do all of your computation

00:01:32.900 | because you would introduce too much latency to make decisions based off that computation.

00:01:40.100 | So I sort of gave a version of this talk at Latent Spaces podcast at NeurIPS. And retrospectively,

00:01:49.480 | I think we identified a few problems with the field of vision in 2024, one of them being

00:01:56.280 | evals or saturated. So vision evals like ImageNet and Cocoa, they're mostly like pattern matching.

00:02:03.160 | They measure your ability to match patterns and sort of like don't require much visual intelligence

00:02:08.960 | to solve. Consequently, I think, vision models don't leverage big pre-training the way that

00:02:17.720 | language models do. So right now, you can take a language model and unleash it on the internet

00:02:21.900 | and get something incredibly smart. Some of the best vision models are moving in that direction.

00:02:27.860 | But because you don't need that level of knowledge and intelligence to solve the evals, there's kind

00:02:34.360 | of no incentive to do so. And I think that part of that -- so there's sort of two dimensions

00:02:41.440 | here. One is that vision doesn't leverage big pre-training. So you can think of like if you're

00:02:46.840 | building an application with language right now, you probably want to use the smartest model

00:02:50.440 | to get an embedding that works really well for you. And right now, we don't have like their

00:02:55.000 | downstream applications that make really good use of the pre-training and the embeddings that

00:02:59.560 | they get from large language models. But there aren't really good vision models that can leverage

00:03:05.720 | these embeddings. And the corollary to this is that the quality of big pre-trained models just isn't the

00:03:13.960 | same in vision as it is in language. And so my underlying conclusion is vision models aren't smart. That's the

00:03:20.640 | the takeaway. And I can prove it to you. So last year when Cloud 3.5 was happening, you can give it an

00:03:27.360 | image of a watch and it just guesses -- you ask it what time it is and it'll just guess a random time.

00:03:32.480 | And that's because this model, it has a good conceptual abstract idea of what a clock is or what a watch

00:03:38.480 | is, but it only comes to actually identifying the location of watch hands and finding the numbers on the

00:03:43.280 | watch. It's hopeless. And updated for Cloud 4 still has no idea what time it is. And this is even like

00:03:50.880 | a an egregious failure because 10:10 is like the stock time on like all watches. So the fact that it

00:03:57.040 | couldn't even get like the most common time is pretty telling. There's so there's this really cool

00:04:05.440 | data set that's trying to measure this inability of LLMs to see called MMVP, which basically you can

00:04:13.840 | see an example here where they ask this question that seems incredibly obvious. And the model -- so

00:04:19.920 | in this case, they're asked the model, which is like ChatGPT, 4.0, which direction the school bus is

00:04:26.960 | facing. Are we seeing the front or the back of the school bus? And the model gets it completely wrong

00:04:31.840 | and then hallucinates details to support its claim. And again, I think this is evidence that large

00:04:36.880 | language models, which are maybe the most intelligent models that we have, like cannot see. And that is

00:04:43.280 | due to a lack of visual features that they can perceive with. And so the way that this dataset was

00:04:48.720 | created is they went and they found pairs of images that were close in CLIP space but far in DinoV2 space.

00:04:55.840 | So CLIP is a vision language model that was sort of contrastedly trained on the whole internet.

00:05:01.200 | And so what this is showing is that CLIP is not discriminative enough to tell these two images

00:05:13.440 | apart, right? So according to CLIP, these two images basically look the same. And what that's

00:05:17.920 | pointing to is like a failure in vision language pre-training. And so the way CLIP is trained is

00:05:25.360 | basically you come up with a big data set of captioned images and you ask the model to -- you

00:05:32.400 | scramble the captions and scramble the images and ask the model to pair the image with the caption.

00:05:36.800 | But the thing is if you go back and look at these two images, what is a caption that would distinguish

00:05:42.240 | these two images, right? It's like the peculiar pose of the dog and one image it's facing the camera

00:05:48.640 | and one it's facing away. But these are sort of details that aren't included in the caption.

00:05:52.320 | So if your loss function can't tell these two images apart, then why would your model be able to,

00:05:56.400 | right? So vision-only pre-training kind of works is the claim. So DinoV2 is this really cool model.

00:06:04.960 | So what you're seeing right now is a visualization of its PCA features that have been self-discovered

00:06:10.960 | by pre-training on the whole internet. So what's really cool is not only does it find the mask of the dog,

00:06:17.680 | obviously. That's sort of easy because it's highly contrasted with the green background,

00:06:22.240 | but it also finds the segments of the dog and it finds even analogous segments. So if you look at

00:06:28.800 | these principal components, you compare the legs of a dog, it'll be in the same sort of feature space as

00:06:33.760 | the legs of a human. And so there's sort of this big open question which is like how do we get vision

00:06:41.920 | features that are well aligned with language features and usable by VLMs that don't suck and

00:06:48.880 | like have visual fidelity? Cool. So that's part of the story. The other part of the question that needs

00:06:58.880 | to be answered is given that we have some sort of semi-working large pre-training of vision models,

00:07:05.520 | why aren't we leveraging these vision models? And I would answer that at least in the object detection space,

00:07:10.800 | the answer is mostly in the distinction between convolutional models and transformers.

00:07:16.320 | So this is from LW Detter, which is one of the top performing detection transformers that currently exists.

00:07:25.200 | If you look at this graph, you look at Yellow V8N, which is a convolutional object detector on the edge,

00:07:31.760 | with and without pre-training on Object 365, it gains like 0.2 map, which is like the main accuracy metric

00:07:38.000 | for object detectors. So Object 365, which is a big million, 1.6 million image data set,

00:07:44.160 | pre-training on it leads almost no performance improvements on Cocoa. Whereas for LW Detter, which is a

00:07:52.880 | transformer-based model, you can see that without -- if you look at this column map without pre-training,

00:07:58.400 | and you look at the column map with pre-training, you can see that you're getting like five map

00:08:02.240 | improvements across the board, sometimes even seven map improvements, which is like a gigantic amount,

00:08:07.040 | right? And so basically, while the language world knows that transformers are able to leverage big

00:08:14.640 | pre-trainings and yield decent results, the visual world is sort of just now catching up. And you can see

00:08:22.000 | this from the scale of the big pre-training. In the image world, pre-training on Object 365 with 1.6

00:08:28.080 | million images is considered a large pre-training. That would be like a tiny challenge data set for

00:08:34.080 | like undergrads in the LLM world. So I want to announce RoboFlow's special new model called RF Detter,

00:08:44.480 | which leverages the Dyno V2 pre-trained backbone and uses it in a real-time object detection context.

00:08:52.800 | So this is sort of our answer to the hole that we see in the field of like, why aren't we leveraging big

00:09:00.160 | pre-trainings for visual models? And so here's some of the metrics. You can see that

00:09:08.800 | basically what we did is we took the LW Detter backbone and we like kind of swapped it out with

00:09:12.560 | the Dyno V2 backbone. And we get like a decent improvement on Cocoa. And we're still not Soda

00:09:19.760 | on Cocoa compared to Define, which is the current Soda. We're like second Soda. But I think what's

00:09:25.840 | really interesting is there's this other data set called RF100VL, which we created to measure the sort of

00:09:32.880 | domain adaptability of this model. And you can see massive yields from using the Dyno V2 pre-trained

00:09:39.360 | backbone, which basically is pointing to the fact that number one, Cocoa is too easily solvable.

00:09:44.560 | It basically has common classes like humans and like coffee cups and stuff like this. So it's not

00:09:51.200 | a good measure of the intelligence of your model. More so the way that you optimize Cocoa is by like

00:09:56.080 | really nailing the precise location of a bounding box or something, really having good iterative

00:10:01.680 | refinement of your locations that you're guessing. Whereas we posit RF100VL, this new data set, is a

00:10:10.640 | better measure of the intelligence of a visual model. So we're introducing a new data set, RF100VL,

00:10:18.240 | which is a collection of 100 different object detection data sets that were pulled from our open source

00:10:23.920 | collection of data sets. We have something like, I don't know, it's something like 750,000 data sets or

00:10:30.640 | whatever on RoboFlow universe. And we hand curated the 100 best, I guess, by some metrics. So like we sorted by

00:10:39.600 | community engagement and we tried to find very difficult domains. So you'll notice, for instance,

00:10:46.480 | we have different camera poses that are common in Cocoa. So we have like aerial camera positioning,

00:10:53.920 | such thing, which requires your model to sort of understand different views of an object in order to

00:10:59.840 | to do well. We have different visual imaging domains, like you can see, like microscopes and x-rays and all

00:11:06.880 | this sort of things. So yeah, we think that this data set can measure the richness of features that are

00:11:15.280 | learned by object detectors in a much more comprehensive way than Cocoa. And here's the other fun thing

00:11:23.200 | about this is that it is a visual language model. So we are able to benchmark a bunch of different models

00:11:28.720 | on RF100VL, being able to ask them things like using, contextualizing the class name in the context of this

00:11:36.240 | data set. Where is this action happening, for instance? So if you look at the top left, we have this class which is block,

00:11:44.160 | which is representing an action, a volleyball block. But you have to be smart enough to contextualize this

00:11:49.280 | like word embedding of block within the context of volleyball to be able to detect that. Same thing with

00:11:54.320 | this thunderbolt type defect in this cable here. If you just ask a dumb visual language model to detect

00:12:00.720 | thunderbolts in the image, it will find nothing. But if it contextualizes it in the context of a cable

00:12:05.200 | defect, then it will be able to find more things. And it also increases the breadth of classes. So if you

00:12:11.680 | only look at Cocoa, you're basically asking your model, hey, can you find a dog? Can you find a cat? But like,

00:12:17.760 | can you find fibrosis? Now your model needs to have like a lot more information around the world,

00:12:22.240 | about the world to solve that problem. Same thing with different imaging domains.

00:12:27.040 | So it is a vision language benchmark. So we also have visual descriptions and sort of instructions on

00:12:36.800 | how to find the objects that are present in this image. And basically what we found is like you take

00:12:43.600 | a Cocoa or you take a Yolo V8 model and you train it on like 10 examples per class. It does better than like

00:12:50.320 | Quen V2 72 -- Quen 2.5 VL 72B, like state of the art gigantic vision language model. So the vision

00:12:58.960 | language models are really good right now at generalizing out of distribution in the linguistic

00:13:04.000 | domain, but absolutely hopeless when it comes to generalizing in the visual domain. And so we hope

00:13:09.520 | that this benchmark can sort of drive that part of the research and make sure that the visual parts of

00:13:17.520 | VLMs don't get left behind. And yeah, basically by leveraging like stronger embeddings, a debtor model

00:13:27.040 | does much, much better on RF 100VL than just leveraging embeddings learned on Object 265, which makes sense.

00:13:33.280 | And that's my talk. Thank you. Yes.

00:13:38.000 | Can you fine-tune it inside of the edge?

00:13:41.920 | Fine-tune Quen on the edge?

00:13:43.920 | Oh, yeah, yeah, yeah. It's like 20 million parameters at the small size. Yeah. Cool. Any other questions?

00:13:51.920 | This works. Yeah, it's publicly available. It's on -- maybe I can -- if you go to RF100VL.org,

00:14:05.680 | you can find our archive paper as well as the code utilities to help download the data set. It's also

00:14:12.320 | like on Huggy Face somewhere. Yeah. Yeah, so RoboFlow kind of has a pretty unique strategy when it comes to

00:14:23.920 | our platform. So we make our platform freely available to all researchers, basically. And so we have like

00:14:30.240 | a ton of people who use our platform to label medical data and biological data for their own papers and

00:14:36.640 | their own research. And then our only ask is that they then contribute that data back to the community

00:14:42.480 | and make it open source. And so a lot of this data comes from like paper cited in nature and stuff like that.

00:14:47.280 | Yeah, so the data set is kind of measuring up.

00:14:59.920 | performance of like a bunch of different imaging modalities or predictive modalities, I guess.

00:15:10.560 | So I think the most interesting track of the data set is the few shot track.

00:15:28.560 | So basically we've constructed like canonical 10-shot splits. So we provide the model the class name,

00:15:38.560 | annotator instructions on how to find that class, as well as 10 visual examples per class. And if a

00:15:45.360 | model -- basically no model exists that can leverage those three things and get higher maps than if you

00:15:52.000 | just deleted one of those like options. I see that as one of the big shortcomings of visual language models

00:15:57.760 | Yeah, so currently the specialists are by far the best.

00:16:27.280 | Yeah.

00:16:27.840 | Yeah.

00:16:27.920 | We benchmarked Grounding Dyno specifically, both zero-shot and fine-tuned. So zero-shot Grounding

00:16:32.720 | Dyno got like 19 map average on R100VL, which is like kind of good, kind of bad. So if you take

00:16:38.400 | like a YOLO V8 nano and you train it from scratch on the 10-shot examples, which is not a lot of data,

00:16:43.840 | obviously, it gets something like 25 map. So like to be worse than fine-tuning a YOLO from scratch is sort of bad. But if you then fine-tune the Grounding Dyno with federated loss, that's the highest performing model we have on the data set.

00:16:56.240 | However, that being said, like I think that the point of the data set should be, hey, like you should be able to leverage these annotator instructions, the 10-shot examples, and the class names, and come up with something more accurate, which requires a journalist model. But okay, I think I'm super over time. So yeah, thanks for the questions. Cool. Thanks, everyone.

00:17:16.800 | Cool. Thanks, everyone.

00:17:18.960 | We'll see you next time.