back to indexBest of 2024 in Vision [LS Live @ NeurIPS]

00:00:13.520 |
So for us, we define best as what made the biggest shifts 00:00:26.240 |
and what papers most contributed to those trends. 00:00:31.340 |
and then we're gonna hand it off to Moondream. 00:00:34.400 |
So the trends that I'm interested in talking about 00:00:44.080 |
to models that run using the same basic ideas on video. 00:00:48.720 |
And then also how debtors are starting to take over 00:00:56.360 |
from the YOLOs, which have been dominant for years. 00:00:58.960 |
So as a highlight, we're gonna talk about Sora, 00:01:04.620 |
which from my perspective is the biggest paper of 2024, 00:01:20.040 |
from replication efforts, including open Sora 00:01:22.680 |
and related work such as a stable diffusion video. 00:01:55.080 |
discrete token video tokenizer akin to VQ, GAN, 00:02:08.840 |
in terms of the bit rate versus human preference for quality 00:02:23.480 |
And then suddenly a few months later, we have this, 00:02:28.480 |
which when I saw it, it was totally mind-blowing to me. 00:02:36.000 |
That's reflective, reminds me of those RTX demonstrations 00:02:41.000 |
for next generation video games, such as Cyberpunk, 00:02:57.040 |
In the same way that like six fingers on a hand, 00:03:03.760 |
So yeah, as we said, Sora does not have a paper. 00:03:08.440 |
So we're going to be filling it in with context 00:03:23.120 |
This is a trick that they introduced in Dolly 3 00:03:44.040 |
that are necessary for good video generation, 00:03:50.360 |
and filtering by making sure the videos have enough motion 00:03:53.320 |
so they're not just like kind of the generators 00:04:06.600 |
Once again, this were very sparse in details. 00:04:13.680 |
OpenSora actually uses a MagVIT V2 itself to do this, 00:04:31.520 |
which makes a lot of sense as each sequential frames 00:04:35.400 |
and videos have mostly redundant information. 00:04:43.640 |
you allow the latent to hold a lot more semantic information 00:04:49.800 |
So we've got our space-time latency possibly via, 00:05:02.560 |
And then you throw it into a diffusion transformer. 00:05:07.440 |
So I think it's personally interesting to note 00:05:14.960 |
which originally used an autoregressive transformer decoder 00:05:28.200 |
is it parameterizing the stochastic differential equation? 00:05:31.880 |
Is it parameterizing a conditional distribution 00:05:35.680 |
It's also worth noting that most diffusion models today, 00:05:44.520 |
the very high performance ones are switching away 00:05:48.640 |
denoising diffusion probability modeling framework 00:05:52.560 |
Rectified flows have a very interesting property 00:05:58.520 |
they actually get closer to being able to be sampled 00:06:05.480 |
you can actually generate high quality samples much faster. 00:06:40.000 |
because the original diffusion transformer paper 00:06:45.000 |
in fact, the specific hyperparameters of the transformer 00:06:49.160 |
What mattered was that you were just increasing 00:06:54.480 |
So I love how in the, once again, little blog posts, 00:07:01.160 |
They say, we're using a diffusion transformer 00:07:27.400 |
It's just a little disappointing considering the context. 00:07:34.640 |
of the framework that was introduced in '22 and '23 00:07:40.320 |
for these very high quality per image generation 00:07:53.640 |
The next, so next paper I wanted to talk about is SAM. 00:08:04.680 |
SAM for us has saved our users 75 years of labeling time. 00:08:16.320 |
We also, SAM also allows us to have our users 00:08:19.320 |
train just pure bounding box regression models 00:08:22.680 |
and use those to generate high quality masks, 00:08:33.160 |
So most people are data limited in the real world. 00:08:44.920 |
per frame object detectors on every frame in a video, 00:08:49.600 |
And so SAM follows into this category of taking, 00:09:01.880 |
which has the wonderful benefit of being plug and play 00:09:05.000 |
with most of our, many of our users use cases. 00:09:08.920 |
We're still building out a sufficiently mature pipeline 00:09:12.800 |
to take advantage of that, but it's in the works. 00:09:23.520 |
You even notice the cell goes away and comes back 00:09:28.120 |
which is very challenging for existing object trackers. 00:09:39.440 |
There's a simple pipeline here where we can give, 00:09:56.440 |
So here we're giving a bounding box in the first frame, 00:10:04.680 |
I'm going to assume people are somewhat familiar with SAM. 00:10:09.680 |
So I'm going to just give a high-level overview 00:10:13.720 |
You have an image encoder that runs on every frame. 00:10:20.760 |
in which case the only difference between SAM2 and SAM 00:10:23.400 |
is that image encoder, which SAM used a standard VIT. 00:10:31.360 |
SAM2 replaced that with a Hera hierarchical encoder, 00:10:50.760 |
In the case where you're doing video segmentation, 00:10:56.080 |
the difference is that you actually create a memory bank 00:10:58.920 |
and you cross attend the features from the image encoder 00:11:04.560 |
So the feature set that is created is essentially, 00:11:09.560 |
well, I'll go more into it in a couple of slides, 00:11:14.500 |
but we take the features from the past couple frames 00:11:19.320 |
plus a set of object pointers and the set of prompts 00:11:28.920 |
Then we then fuse the new masks for this frame 00:11:30.980 |
with the image features and add that to the memory bank. 00:11:39.720 |
Just like SAM, SAM2 actually uses a data engine 00:11:47.320 |
they assembled a huge amount of reference data, 00:11:50.020 |
used people to label some of it and train the model, 00:11:57.340 |
and asked people to refine the predictions of the model. 00:11:59.780 |
And then ultimately the data set is just created 00:12:02.660 |
from the final output of the model on the reference data. 00:12:16.920 |
It seems unlikely that another model could come in 00:12:19.340 |
and have such a tight relationship with the training set. 00:12:22.340 |
Yeah, so brief overview of how the memory bank works. 00:12:33.740 |
so I'm just, I'm going to fill in a bit more. 00:12:35.940 |
So we take the last couple of frames from our video 00:12:42.780 |
and we take the last couple of frames from our video. 00:12:49.700 |
Attend that along with the set of prompts that we provided, 00:12:58.180 |
as well as reference objects pointers saying, 00:13:08.780 |
to model complex object motion without actually, 00:13:17.220 |
by limiting the amount of frames that you attend to, 00:13:19.940 |
you manage to keep the model running in real time. 00:13:31.380 |
of all the frames is super essential for a high performance, 00:14:05.660 |
One would assume that increasing the count of memories 00:14:25.620 |
I'm super interested to see a more dedicated summarization 00:14:35.340 |
So that another extension of beautiful per frame work 00:14:47.580 |
The next trend I'm interested in talking about 00:15:01.820 |
We are finally starting to see something change. 00:15:07.160 |
So for years, yellows have been the dominant way 00:15:12.980 |
And we can see here that they've essentially stagnated. 00:15:35.940 |
So we can look here and see the yellow series 00:15:43.860 |
LW-deader, and D-fine have meaningfully changed that plateau 00:16:01.900 |
a 2023 paper preprint, but published officially in '24, 00:16:12.300 |
we could actually match or out-speed yellows. 00:16:18.540 |
is hugely effective on deaders, and much less so on yellows. 00:16:22.260 |
And then D-fine added the types of bells and whistles 00:16:28.260 |
So the major improvements that RT-deader shows 00:16:37.400 |
that deaders typically pass into their encoder 00:16:41.060 |
into a much more efficient transformer encoder. 00:16:44.400 |
The transformer is, of course, quadratic complexity, 00:16:48.560 |
so decreasing the amount of stuff that you pass in at once 00:16:52.100 |
is super helpful for increasing your runtime, 00:16:57.920 |
So that change basically brought us up to yellow speed, 00:17:04.180 |
on benchmarking yellows, including the NMS step. 00:17:09.180 |
Once you include the NMS in the latency calculation, 00:17:14.600 |
you see that, in fact, these deaders are outperforming, 00:17:18.600 |
at least at this time, the yellows that existed. 00:17:26.660 |
in fact, this frame, the huge boost here is from pre-training 00:17:35.200 |
and this is the defined line without pre-training. 00:17:48.240 |
they showed that they got much better results 00:17:57.240 |
they actually did not benefit from pre-training. 00:18:04.040 |
in fact, yellows do have a real benefit from pre-training, 00:18:07.240 |
but it goes away as we increase the training time. 00:18:22.960 |
is that you're not destroying your original weights 00:18:31.460 |
And then LW-deader also shows superior performance 00:18:41.040 |
which means that they do better on the real world, 00:18:44.120 |
Then Define throws all the bells and whistles at it. 00:18:49.500 |
YOLO models tend to have a lot of very specific, 00:19:07.200 |
and we see that suddenly we have almost 60 AP on Cocoa 00:19:14.620 |
So we're spending a lot of time trying to build models 00:19:21.880 |
and deaders are clearly becoming a promising step 00:19:26.700 |
What we're interested in seeing from the deaders 00:19:35.360 |
on the top of the leaderboard for large-scale inference 00:19:40.360 |
scale really well as you switch out the backbone. 00:19:46.400 |
and having people publish a paper, potentially us, 00:19:49.620 |
on what happens if you take these real-time ones 00:19:57.780 |
to the super, super slow but high-performance domain? 00:20:02.580 |
We also wanna see people benchmarking an RF100 more 00:20:28.380 |
And one of the lenses to look at this is through 00:20:34.620 |
fine-grained visual details and your representations 00:20:37.880 |
that are extracted from your foundation model. 00:20:42.460 |
Oh, yeah, this is just a list of all the papers 00:20:46.700 |
I just wanted to make sure I set an actual paper 00:21:04.840 |
and tell me what time it is, it fails, right? 00:21:11.800 |
like, this is, like, a very classic test of an LLM, 00:21:19.500 |
it'll do better if we increase the resolution 00:21:21.580 |
and it has easier time finding these fine-grained features, 00:21:27.160 |
And you could say, okay, well, maybe the model 00:21:38.540 |
literally cannot see the position of the watch hands 00:21:43.620 |
And for you anthropic heads out there, Claude fails, too. 00:21:48.880 |
So, my first pick for Best Paper of 2024 Envision 00:21:53.880 |
is this MMVP paper, which tries to investigate 00:21:57.260 |
why do LLMs not have the ability to see fine-grained details? 00:22:03.040 |
with a lot of images like this, where you ask it a question 00:22:12.460 |
And so, the process by which it finds these images 00:22:36.920 |
these fine-grained details to do its job correctly, 00:22:38.840 |
which is just to match captions and images, right? 00:22:49.460 |
the vision encoder wasn't trained contrastively at all, 00:22:52.220 |
still, in order to do its job of capturing the image, 00:22:58.800 |
of all the objects and visual features in the image, right? 00:23:02.040 |
So, this paper finds a set of difficult images 00:23:07.620 |
And the way it does it is it looks for embeddings 00:23:10.000 |
that are similar in Clip space, but far in DynaV2 space. 00:23:15.300 |
that was trained self-supervised purely on image data, 00:23:28.380 |
or, like, crops at certain areas of the image 00:23:36.600 |
And so, if you take things that are very close in Clip space 00:23:41.300 |
you get a set of images that basically are pairs of images 00:23:47.300 |
and other big language models to distinguish. 00:23:49.720 |
So, if you then ask it questions about this image, 00:23:54.880 |
it's going to answer the same way for both images, right? 00:23:58.600 |
Because, from the perspective of the vision encoder, 00:24:06.960 |
And, like, all these other models, including Lava, 00:24:11.920 |
And so, this is the benchmark that they create, 00:24:14.080 |
which is, like, finding, like, clip-blind pairs, 00:24:17.760 |
which is pairs of images that are similar in Clip space, 00:24:19.680 |
and creating a data set of multiple-choice questions 00:24:30.500 |
So, chat GPT and Jim and I do a little bit better 00:24:41.300 |
extremely negatively correlated with this data set. 00:24:44.720 |
It does much, much, much, much worse than random guessing, 00:24:47.640 |
which means that this process has done a very good job 00:24:50.600 |
of identifying hard images for Lava, specifically. 00:24:57.040 |
not trained for very long and is initialized from Clip. 00:24:59.380 |
And so, you would expect it to do poorly on this data set. 00:25:03.160 |
So, one of the proposed solutions that this paper attempts 00:25:12.800 |
"of the language model also on Dyno features?" 00:25:15.040 |
And so, it proposes two different ways of doing this. 00:25:19.080 |
One, additively, which is basically interpolating 00:25:32.000 |
when you do the additive mixture of features. 00:25:46.380 |
which is as you increase the number of Dyno v2 features, 00:25:54.160 |
were trained completely from a self-supervised manner 00:26:03.000 |
And so, you can train an adapter all you want, 00:26:05.280 |
but it seems that it's in such an alien language 00:26:11.560 |
And so, that kind of supports what's happening on the left, 00:26:19.640 |
as you include more Dyno v2 features up to a point, 00:26:24.800 |
it completely loses its ability to answer language 00:26:38.080 |
that are going into these models and just train on both. 00:26:41.640 |
And it still doesn't really solve the MMVP task. 00:26:43.960 |
It gets Lava 1.5 above random guessing by a little bit, 00:26:56.540 |
of just using Dyno v2 features directly isn't gonna work. 00:27:06.040 |
Dyno v2 is gonna be insufficient for language tasks, right? 00:27:13.640 |
would be Florence 2, which tries to solve this problem 00:27:27.000 |
which ends up, the goal is basically to have features 00:27:30.720 |
that are sufficient for finding objects in the image. 00:27:37.520 |
but also can be talked about and can be reasoned about. 00:27:44.880 |
So, here's an example of basically three different 00:28:03.920 |
not have features that are meaningful at the pixel level. 00:28:07.560 |
And so, they add another type, which is region text pairs, 00:28:11.080 |
which is essentially either classifying a region 00:28:23.640 |
And then they have text phrase region annotations, 00:28:32.160 |
you also find its place in a descriptive paragraph 00:28:39.760 |
even more semantic understanding of these regions. 00:28:46.040 |
you have to know what a woman is and what the road is 00:28:49.120 |
And that's basically composing a bunch of objects 00:28:56.280 |
And so, the way that they do this is they take... 00:28:59.400 |
Basically, they just dump features from a vision encoder 00:29:08.440 |
And then they train a bunch of different tasks 00:29:12.720 |
like object detection and so on as a language task. 00:29:37.280 |
We can see, if you look at the graph on the right, 00:29:44.560 |
your pre-trained Florence 2 models transfer very, very well. 00:30:04.360 |
which both of these things are pointing to the fact 00:30:24.240 |
And I think that this framework, you can see saturation. 00:30:32.440 |
purely on the image level and region level annotations 00:30:35.320 |
and not including the pixel level annotations, 00:30:40.240 |
it actually performs better as an object detector. 00:30:45.640 |
it's not able to actually learn all the visual tasks 00:30:51.160 |
So, I'd like to see this paper explore larger model sizes, 00:30:54.440 |
which brings us to our next big paper of 2024, 00:31:02.160 |
PolyGemma 2 was released, I think, like a week or two ago. 00:31:05.040 |
Oh, I forgot to mention, you can actually train 00:31:12.240 |
and you can actually train a PolyGemma 2 model on RoboFlow, 00:31:21.920 |
So, PolyGemma is essentially doing the same thing, 00:31:29.560 |
But it also introduced the concept of location tokens 00:31:36.560 |
So, PolyGemma uses Gemma as the language encoder 00:31:39.880 |
PolyGemma 2 introduces using multiple different sizes 00:31:53.680 |
when it's generating tokens autoregressively, 00:32:03.040 |
and like a description of the task that it's trying to do, 00:32:05.920 |
they're attending to each other fully, full attention, 00:32:09.320 |
which means that it can sort of bind high level... 00:32:12.960 |
It's easier for the prefix to color the output 00:32:34.520 |
You're asking for it to segment these two classes of objects 00:32:38.960 |
and then it finds their locations using these tokens 00:33:06.000 |
is each blue dot is a performance on some downstream task. 00:33:09.560 |
You can see that after seeing 300 million examples, 00:33:15.440 |
on all of the downstream tasks that they tried it on, 00:33:25.560 |
PolyGemma 2, you can see the results on object detection. 00:33:35.800 |
And you can see that this sort of also points 00:33:39.200 |
to an increase in capacity being helpful to the model. 00:33:44.720 |
and the parameter count of the language model increases, 00:33:56.880 |
a thinking register and it gives it more tokens 00:34:01.440 |
But yeah, you could say, oh, 43.6, that's not that great. 00:34:12.520 |
on top of this language or this image encoder. 00:34:16.240 |
It's doing the raw language modeling task on Coco. 00:34:20.520 |
So, it doesn't have any of the bells and whistles. 00:34:23.360 |
It doesn't even have bipartite graph matching 00:34:32.920 |
is that they blow everything else away on MMVP. 00:34:35.520 |
I mean, 47.3, sure, that's nowhere near human accuracy, 00:34:56.080 |
So, AIMV2 sort of says, okay, maybe this language model, 00:35:01.080 |
like maybe coming up with all these specific annotations 00:35:04.760 |
to find features and with high fidelity in pixel space 00:35:12.920 |
and more beautiful idea for combining image tokens 00:35:17.280 |
and pixel tokens in a way that's interfaceable 00:35:28.080 |
So, the way that it works is it does something 00:35:33.040 |
that dumps image tokens into a decoder only transformer. 00:35:47.320 |
with fancy object detection or segmentation labels, 00:35:53.240 |
and have it learn fine-grained features that way. 00:35:55.720 |
And it does this in kind of, I think, a beautiful way 00:36:04.560 |
and using only this number of image tokens as the prefix. 00:36:08.480 |
And so, doing a similar thing with the causal. 00:36:13.320 |
So, the causal prefix is the attention mask on the right. 00:36:18.760 |
with some randomly sampled number of image tokens 00:36:26.160 |
And so, this is the dataset that they train on. 00:36:30.160 |
It's internet-scale data, very high-quality data 00:36:34.000 |
created by the Data Filtering Network's paper, essentially, 00:36:38.320 |
which is maybe the best clip data that exists. 00:36:51.360 |
it appears to be, well, at the highest parameter count, 00:37:00.880 |
And so, you can sort of think that, you know, 00:37:07.280 |
which is the line of thinking for language models, 00:37:20.440 |
This is the ImageNet classification accuracy, 00:37:22.680 |
but yeah, it does better if you increase the resolution, 00:37:29.760 |
And so, how does it actually do compared to CLIP on COCO? 00:37:41.280 |
which is also within spitting distance of SODA, 00:37:48.480 |
But you could say, okay, well, wait a second, 00:38:11.800 |
They train on, like, Objects 365, COCO, Flickr, 00:38:27.840 |
and not train to convergence on object detection. 00:38:42.280 |
- But overall, that was exactly what I was looking for. 00:39:06.520 |
Well, while we're getting set up, hi, over here. 00:39:11.760 |
One of the things that's been weird and surprising 00:39:22.560 |
they're just, like, worse than RTTetter at detection still. 00:39:40.840 |
So, I'm curious to hear your thoughts on, like, 00:40:13.360 |
For image classification, it's basically there. 00:40:16.600 |
In the, AIMV2 showed a simple attentional probe 00:40:29.040 |
why isn't it transferring to object detection, 00:40:33.520 |
especially, like, real-time object detection? 00:40:39.240 |
One is object detection is really, really, really, 00:40:56.440 |
clip pre-training transfers super, super easily. 00:41:06.000 |
didn't even really benefit from pre-training. 00:41:10.200 |
essentially saturated, showing very little difference 00:41:22.880 |
of better and better pre-training on real-time detection. 00:41:35.040 |
or just to summarize, basically, is that, like, 00:41:41.720 |
of transformer-based object detectors and fancy losses, 00:41:54.280 |
they have all these, like, extreme optimizations 00:42:00.160 |
but essentially, I think it's kind of been shown now 00:42:05.720 |
and just don't, like, have the level of intelligence 00:42:43.440 |
Yeah, it's like, we have a capture of your screen. 00:43:09.440 |
I've been working on Moondream for almost a year now, 00:43:21.040 |
So Moondream started off as a tiny vision language model. 00:43:25.720 |
Since then, we've extended scope a little bit 00:43:37.680 |
that are focused at assistant-type use cases, 00:43:49.680 |
yeah, we're laser-focused on building capabilities 00:43:54.480 |
that developers can use to build vision applications 00:43:59.120 |
So in a lot of cases for vision more so than for text, 00:44:02.720 |
you really care about being able to run on the edge, 00:44:08.840 |
We have different output modalities that we support. 00:44:26.360 |
We've done a lot of work to minimize hallucinations there. 00:44:31.080 |
We have open vocabulary object detection built in, 00:44:35.480 |
where rather than having to train a dedicated model, 00:44:38.040 |
you can just say, "Show me soccer balls in this image," 00:44:41.000 |
or, "Show me if there are any deer in this image." 00:44:48.720 |
where if all you're interested in is the center of an object, 00:44:52.440 |
you can just ask it to point out where that is. 00:45:13.040 |
It's good for our local Lama desktop friends, 00:45:32.400 |
It's very good if you're running on older mobile phones 00:45:39.400 |
even with our not-yet-fully-optimized inference client. 00:46:00.280 |
was to preserve accuracy across a broad set of benchmarks. 00:46:14.440 |
using basically a technique based on the gradient. 00:46:17.520 |
I'm not sure how much people want to know details. 00:46:20.560 |
but feel free to grab me if you have more questions. 00:46:28.360 |
retrain the model to recover performance and bring it back. 00:46:31.480 |
The 0.5B we released is more of a proof of concept 00:46:35.880 |
I think the thing that's really exciting about this 00:46:39.440 |
For developers to build using the 2B param model 00:46:50.680 |
figure out what exactly they need out of the model 00:46:52.560 |
and prune those capabilities into a smaller form factor 00:46:54.680 |
that makes sense for their deployment target. 00:47:00.680 |
Let me talk to you folks a little bit about another problem 00:47:07.880 |
We had a customer reach out who was talking about, 00:47:14.240 |
This is very common in manufacturing and oil and gas 00:47:24.040 |
and monitor stuff and make sure that the system 00:47:27.320 |
gets shut down when the temperature goes over 80 00:47:38.560 |
I went and looked at other open source models 00:47:40.760 |
to see if I could just generate a bunch of data 00:47:47.240 |
with hundreds of billions of dollars in market cap 00:47:53.960 |
My hypothesis is that the way these models are trained 00:48:14.280 |
It's paired with an alt text that says something like 00:48:16.360 |
G-I-V-T-O pressure sensor, PSI zero to 30 or something. 00:48:30.880 |
And so, yeah, that's a gap we need to address. 00:48:39.800 |
let's use synthetic data to solve this problem. 00:48:47.760 |
of synthetic gauge images to get to reasonable performance. 00:48:50.920 |
And thinking about it, reading a gauge is like not a one, 00:48:55.480 |
like it's not a zero short process in our minds, right? 00:48:57.520 |
Like if you had to tell me the reading in Celsius 00:49:00.440 |
for this real world gauge, there's two dials on there. 00:49:19.360 |
So what happens if we just add that as chain of thought 00:49:29.720 |
to allow the model to better learn the subtasks 00:49:47.440 |
Like there's a weird shadow situation going on. 00:50:05.120 |
On the image, the model actually has to predict 00:50:09.880 |
I was originally trying to do this with bounding boxes, 00:50:11.920 |
but then Malmo came out with pointing capabilities 00:50:14.840 |
and it's like pointing is a much better paradigm 00:50:28.400 |
So the light blue chart is with our grounded chain of thought. 00:50:33.400 |
This measures, we built a clock reading benchmark 00:50:47.400 |
when you're using the chain of thought to help the model. 00:50:59.040 |
is you can kind of understand how the model is doing it 00:51:17.280 |
except instead of saying it was on the seventh tick, 00:51:22.120 |
it actually predicted that it was the eighth tick 00:51:26.360 |
So now that you know that this is feeling in this way, 00:51:30.960 |
you can adjust how you're doing the chain of thought 00:51:32.760 |
to maybe say like actually count out each tick from 40 00:51:35.480 |
instead of just trying to say it's the eighth tick. 00:51:40.320 |
I'll count from there instead of all the way from 40. 00:51:47.040 |
is a few short prompting or test time training with this. 00:52:00.560 |
they can go in and correct that in the chain of thought 00:52:10.400 |
The real question is, is it going to generalize? 00:52:13.320 |
Probably like there's some science from text models 00:52:15.760 |
that when you train on a broad number of tasks, 00:52:18.240 |
And I'm seeing some science with our model as well. 00:52:21.720 |
So in addition to the image-based chain of thought stuff, 00:52:25.680 |
I also added some spelling-based chain of thought 00:52:29.160 |
to help it understand, better understand OCR, I guess. 00:52:33.600 |
I don't understand why everyone doesn't do this by the way. 00:52:46.640 |
like hey, does any license plate in this image 00:52:54.120 |
All right, that ends my story about the gauges. 00:53:00.840 |
If you think about what's going on over here, 00:53:10.880 |
especially with the latest set of models that we've seen. 00:53:17.000 |
I have a feeling that VLMs are lagging behind 00:53:33.600 |
there's a ton of data that talks about how to reason. 00:53:47.440 |
hey, to show that that mountain is further away, 00:53:51.880 |
but the actual data on how to like look at images 00:54:06.040 |
So yeah, I think our solution here is really just, 00:54:09.800 |
we need to teach them how to operate on individual tasks 00:54:31.440 |
If anyone wants to chat about more technical details 00:54:35.280 |
about how we're doing this or interested in collaborating, 00:54:38.760 |
- Yeah, like I always, when people say multi-modality, 00:54:48.800 |
I always think about vision as the first among equals 00:54:57.480 |
- This is the year that vision language models 00:54:59.440 |
became mainstream with every model from GPT-40 to 1 00:55:08.000 |
to Mistral's Pixtrol to AI2's Pixmo going multi-modal. 00:55:13.000 |
We asked Peter and Isaac to highlight the best work 00:55:18.320 |
And they blew us away with the complete overview. 00:55:37.400 |
As always, don't forget to check the show notes 00:55:39.800 |
for the YouTube link to their talk, as well as their slides.