I'm going to be giving a quick presentation about the State of the Union regarding AI Vision. So I'm Peter Robichaux. I'm the ML lead at RoboFlow, which is a platform for building and deploying vision models. A lot of people are really interested in LLMs these days, so I'm trying to pitch why computer vision matters.
If you think about systems that interact with the real world, they have to use vision as one of their primary inputs because the built world is sort of built around vision as a fundamental primitive. There's a big gap between where human is and where computer vision is. I would argue a bigger gap than exists currently for human speech and computer speech.
Computer vision has its own set of problems that are very distinct from the problems that need to be solved by LLMs. Latency usually matters. If you want to perceive motion, you have to be running your process multiple frames per second. You usually want to run at the edge. You can't have one big hub where you do all of your computation because you would introduce too much latency to make decisions based off that computation.
So I sort of gave a version of this talk at Latent Spaces podcast at NeurIPS. And retrospectively, I think we identified a few problems with the field of vision in 2024, one of them being evals or saturated. So vision evals like ImageNet and Cocoa, they're mostly like pattern matching.
They measure your ability to match patterns and sort of like don't require much visual intelligence to solve. Consequently, I think, vision models don't leverage big pre-training the way that language models do. So right now, you can take a language model and unleash it on the internet and get something incredibly smart.
Some of the best vision models are moving in that direction. But because you don't need that level of knowledge and intelligence to solve the evals, there's kind of no incentive to do so. And I think that part of that -- so there's sort of two dimensions here. One is that vision doesn't leverage big pre-training.
So you can think of like if you're building an application with language right now, you probably want to use the smartest model to get an embedding that works really well for you. And right now, we don't have like their downstream applications that make really good use of the pre-training and the embeddings that they get from large language models.
But there aren't really good vision models that can leverage these embeddings. And the corollary to this is that the quality of big pre-trained models just isn't the same in vision as it is in language. And so my underlying conclusion is vision models aren't smart. That's the the takeaway. And I can prove it to you.
So last year when Cloud 3.5 was happening, you can give it an image of a watch and it just guesses -- you ask it what time it is and it'll just guess a random time. And that's because this model, it has a good conceptual abstract idea of what a clock is or what a watch is, but it only comes to actually identifying the location of watch hands and finding the numbers on the watch.
It's hopeless. And updated for Cloud 4 still has no idea what time it is. And this is even like a an egregious failure because 10:10 is like the stock time on like all watches. So the fact that it couldn't even get like the most common time is pretty telling.
There's so there's this really cool data set that's trying to measure this inability of LLMs to see called MMVP, which basically you can see an example here where they ask this question that seems incredibly obvious. And the model -- so in this case, they're asked the model, which is like ChatGPT, 4.0, which direction the school bus is facing.
Are we seeing the front or the back of the school bus? And the model gets it completely wrong and then hallucinates details to support its claim. And again, I think this is evidence that large language models, which are maybe the most intelligent models that we have, like cannot see.
And that is due to a lack of visual features that they can perceive with. And so the way that this dataset was created is they went and they found pairs of images that were close in CLIP space but far in DinoV2 space. So CLIP is a vision language model that was sort of contrastedly trained on the whole internet.
And so what this is showing is that CLIP is not discriminative enough to tell these two images apart, right? So according to CLIP, these two images basically look the same. And what that's pointing to is like a failure in vision language pre-training. And so the way CLIP is trained is basically you come up with a big data set of captioned images and you ask the model to -- you scramble the captions and scramble the images and ask the model to pair the image with the caption.
But the thing is if you go back and look at these two images, what is a caption that would distinguish these two images, right? It's like the peculiar pose of the dog and one image it's facing the camera and one it's facing away. But these are sort of details that aren't included in the caption.
So if your loss function can't tell these two images apart, then why would your model be able to, right? So vision-only pre-training kind of works is the claim. So DinoV2 is this really cool model. So what you're seeing right now is a visualization of its PCA features that have been self-discovered by pre-training on the whole internet.
So what's really cool is not only does it find the mask of the dog, obviously. That's sort of easy because it's highly contrasted with the green background, but it also finds the segments of the dog and it finds even analogous segments. So if you look at these principal components, you compare the legs of a dog, it'll be in the same sort of feature space as the legs of a human.
And so there's sort of this big open question which is like how do we get vision features that are well aligned with language features and usable by VLMs that don't suck and like have visual fidelity? Cool. So that's part of the story. The other part of the question that needs to be answered is given that we have some sort of semi-working large pre-training of vision models, why aren't we leveraging these vision models?
And I would answer that at least in the object detection space, the answer is mostly in the distinction between convolutional models and transformers. So this is from LW Detter, which is one of the top performing detection transformers that currently exists. If you look at this graph, you look at Yellow V8N, which is a convolutional object detector on the edge, with and without pre-training on Object 365, it gains like 0.2 map, which is like the main accuracy metric for object detectors.
So Object 365, which is a big million, 1.6 million image data set, pre-training on it leads almost no performance improvements on Cocoa. Whereas for LW Detter, which is a transformer-based model, you can see that without -- if you look at this column map without pre-training, and you look at the column map with pre-training, you can see that you're getting like five map improvements across the board, sometimes even seven map improvements, which is like a gigantic amount, right?
And so basically, while the language world knows that transformers are able to leverage big pre-trainings and yield decent results, the visual world is sort of just now catching up. And you can see this from the scale of the big pre-training. In the image world, pre-training on Object 365 with 1.6 million images is considered a large pre-training.
That would be like a tiny challenge data set for like undergrads in the LLM world. So I want to announce RoboFlow's special new model called RF Detter, which leverages the Dyno V2 pre-trained backbone and uses it in a real-time object detection context. So this is sort of our answer to the hole that we see in the field of like, why aren't we leveraging big pre-trainings for visual models?
And so here's some of the metrics. You can see that basically what we did is we took the LW Detter backbone and we like kind of swapped it out with the Dyno V2 backbone. And we get like a decent improvement on Cocoa. And we're still not Soda on Cocoa compared to Define, which is the current Soda.
We're like second Soda. But I think what's really interesting is there's this other data set called RF100VL, which we created to measure the sort of domain adaptability of this model. And you can see massive yields from using the Dyno V2 pre-trained backbone, which basically is pointing to the fact that number one, Cocoa is too easily solvable.
It basically has common classes like humans and like coffee cups and stuff like this. So it's not a good measure of the intelligence of your model. More so the way that you optimize Cocoa is by like really nailing the precise location of a bounding box or something, really having good iterative refinement of your locations that you're guessing.
Whereas we posit RF100VL, this new data set, is a better measure of the intelligence of a visual model. So we're introducing a new data set, RF100VL, which is a collection of 100 different object detection data sets that were pulled from our open source collection of data sets. We have something like, I don't know, it's something like 750,000 data sets or whatever on RoboFlow universe.
And we hand curated the 100 best, I guess, by some metrics. So like we sorted by community engagement and we tried to find very difficult domains. So you'll notice, for instance, we have different camera poses that are common in Cocoa. So we have like aerial camera positioning, such thing, which requires your model to sort of understand different views of an object in order to to do well.
We have different visual imaging domains, like you can see, like microscopes and x-rays and all this sort of things. So yeah, we think that this data set can measure the richness of features that are learned by object detectors in a much more comprehensive way than Cocoa. And here's the other fun thing about this is that it is a visual language model.
So we are able to benchmark a bunch of different models on RF100VL, being able to ask them things like using, contextualizing the class name in the context of this data set. Where is this action happening, for instance? So if you look at the top left, we have this class which is block, which is representing an action, a volleyball block.
But you have to be smart enough to contextualize this like word embedding of block within the context of volleyball to be able to detect that. Same thing with this thunderbolt type defect in this cable here. If you just ask a dumb visual language model to detect thunderbolts in the image, it will find nothing.
But if it contextualizes it in the context of a cable defect, then it will be able to find more things. And it also increases the breadth of classes. So if you only look at Cocoa, you're basically asking your model, hey, can you find a dog? Can you find a cat?
But like, can you find fibrosis? Now your model needs to have like a lot more information around the world, about the world to solve that problem. Same thing with different imaging domains. So it is a vision language benchmark. So we also have visual descriptions and sort of instructions on how to find the objects that are present in this image.
And basically what we found is like you take a Cocoa or you take a Yolo V8 model and you train it on like 10 examples per class. It does better than like Quen V2 72 -- Quen 2.5 VL 72B, like state of the art gigantic vision language model. So the vision language models are really good right now at generalizing out of distribution in the linguistic domain, but absolutely hopeless when it comes to generalizing in the visual domain.
And so we hope that this benchmark can sort of drive that part of the research and make sure that the visual parts of VLMs don't get left behind. And yeah, basically by leveraging like stronger embeddings, a debtor model does much, much better on RF 100VL than just leveraging embeddings learned on Object 265, which makes sense.
And that's my talk. Thank you. Yes. Can you fine-tune it inside of the edge? Fine-tune Quen on the edge? Oh, yeah, yeah, yeah. It's like 20 million parameters at the small size. Yeah. Cool. Any other questions? This works. Yeah, it's publicly available. It's on -- maybe I can -- if you go to RF100VL.org, you can find our archive paper as well as the code utilities to help download the data set.
It's also like on Huggy Face somewhere. Yeah. Yeah, so RoboFlow kind of has a pretty unique strategy when it comes to our platform. So we make our platform freely available to all researchers, basically. And so we have like a ton of people who use our platform to label medical data and biological data for their own papers and their own research.
And then our only ask is that they then contribute that data back to the community and make it open source. And so a lot of this data comes from like paper cited in nature and stuff like that. Yeah, so the data set is kind of measuring up. performance of like a bunch of different imaging modalities or predictive modalities, I guess.
So I think the most interesting track of the data set is the few shot track. So basically we've constructed like canonical 10-shot splits. So we provide the model the class name, annotator instructions on how to find that class, as well as 10 visual examples per class. And if a model -- basically no model exists that can leverage those three things and get higher maps than if you just deleted one of those like options.
I see that as one of the big shortcomings of visual language models Yeah, so currently the specialists are by far the best. Yeah. Yeah. We benchmarked Grounding Dyno specifically, both zero-shot and fine-tuned. So zero-shot Grounding Dyno got like 19 map average on R100VL, which is like kind of good, kind of bad.
So if you take like a YOLO V8 nano and you train it from scratch on the 10-shot examples, which is not a lot of data, obviously, it gets something like 25 map. So like to be worse than fine-tuning a YOLO from scratch is sort of bad. But if you then fine-tune the Grounding Dyno with federated loss, that's the highest performing model we have on the data set.
However, that being said, like I think that the point of the data set should be, hey, like you should be able to leverage these annotator instructions, the 10-shot examples, and the class names, and come up with something more accurate, which requires a journalist model. But okay, I think I'm super over time.
So yeah, thanks for the questions. Cool. Thanks, everyone. Cool. Thanks, everyone. We'll see you next time.