back to indexSegment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson
Chapters
0:0 Introducing Nikhila
3:38 The Impact of SAM 1 in 2023
9:15 Do People Finetune SAM?
13:5 Video Demo of SAM
17:1 Why the Demo is so Important
20:23 SAM 1 vs SAM 2 Architecture
23:46 Video Demo of SAM on Roboflow
29:44 Extending SAM 2 with other models
32:0 Limitations of SAM: Screenshots
35:56 SAM 2 Paper
36:15 SA-V Dataset and SAM Data Engine
40:15 Memory Attention to solve Video
44:24 "Context Length" in Memory Attention
45:17 Object Tracking
47:52 The Future of FAIR
49:23 CVPR, Trends in Vision
60:4 Calls to Action
00:00:08.300 |
Our first, one of our very first viral podcasts 00:00:14.500 |
- And this time we are joined by the lead author 00:00:21.160 |
- There's a whole story that we can refer people back 00:00:35.420 |
Why, you know, why did you choose computer vision 00:00:37.980 |
coming out of your specialization at Cambridge? 00:00:41.720 |
So I did my undergraduate degree in engineering 00:00:50.840 |
you sort of study everything from mechanical engineering 00:01:02.260 |
I started taking more classes in machine learning 00:01:05.340 |
and computational neuroscience, and I really enjoyed it. 00:01:08.300 |
And actually after graduating from undergrad, 00:01:14.520 |
And so I was initially planning on becoming a doctor, 00:01:28.700 |
And in my machine learning class in undergrad, 00:01:45.980 |
okay, maybe I want to try something different 00:01:49.380 |
Maybe this is a different path I want to take. 00:01:51.740 |
And then in the gap year, I did a bunch of coding, 00:01:59.740 |
And then I got a scholarship to come and study in America. 00:02:05.180 |
took a bunch of computer science classes at Harvard and MIT, 00:02:12.380 |
I really, really enjoyed working in computer vision, 00:02:15.300 |
applied to Facebook and got this job at Facebook. 00:02:17.940 |
And I've now, at Facebook at the time, now Matter. 00:02:29.220 |
I'm not like a research, typical research scientist. 00:02:32.420 |
Definitely came from more of an engineering background. 00:02:37.500 |
have had amazing opportunities to work across 00:02:40.540 |
so many different interesting problems in computer vision 00:02:46.720 |
How can you go from images of objects to 3D structures? 00:03:02.420 |
- It's weird because I guess with segment anything too, 00:03:07.340 |
You know, you started with 3D and now you're solving the 4D. 00:03:10.700 |
- Yeah, it's just going from 3D to images to video. 00:03:15.740 |
And actually one of the nice things has been, 00:03:18.540 |
so I think I mentioned I wanted to become a doctor, 00:03:21.780 |
but actually Sam is having so much impact in medicine, 00:03:30.220 |
hopefully Sam too can also have similar sort of impact 00:03:36.180 |
- Yeah, I want to give Joseph a chance to comment. 00:03:42.620 |
but like in the past year since we did our podcast on Sam, 00:03:53.020 |
You know, recapping from the first release to present, 00:03:56.020 |
Sam introduces the ability for models to near zero shot, 00:04:03.020 |
identify kind of perfect polygons and outlines 00:04:13.740 |
lots of manual labeling, lots of manual preparation, 00:04:17.460 |
clicking very meticulously to create outlines of individuals 00:04:24.940 |
to do zero shot segmentation of items inside images, 00:04:29.940 |
though none were as high quality as segment anything. 00:04:35.420 |
And with the introduction of segment anything, 00:04:38.780 |
you can pass an image with Sam one, Sam two videos as well, 00:04:57.940 |
for the downstream task and problem you're working on. 00:05:00.700 |
Though Sam has accelerated the rate at which developers 00:05:05.700 |
are able to use computer vision and production applications. 00:05:10.300 |
So at RoboFlow, we were very quick to enable the community 00:05:15.140 |
of computer vision developers and engineers to use Sam 00:05:23.260 |
you could kind of use Sam as is to like pass an image 00:05:28.340 |
Another use case for Sam is in preparation of data 00:05:40.140 |
where you have a bunch of images from a wet lab experiment. 00:05:46.340 |
you need to count the presence of a particular protein 00:05:52.140 |
To count all the individual protein reactions, 00:05:59.900 |
will still like kind of individually count and say, 00:06:02.340 |
what are the presence of all of those proteins? 00:06:10.580 |
But often you may need to also add like a class name 00:06:14.860 |
to what the protein is, or you may need to say, 00:06:17.860 |
hey, like I care about the protein portion of this, 00:06:20.340 |
I don't care about the rest of the portion of this image. 00:06:23.420 |
And, or what it encourages and asks for the user to do 00:06:36.620 |
which is kind of a new perimeter that Sam introduced. 00:06:39.140 |
And so at RoboFlow, we have one portion of our tool stack 00:06:45.980 |
With segment anything, Sam can already provide, 00:06:49.540 |
hey, here's where I see the outlines of objects, 00:06:54.060 |
hey, here's where the outlines of objects matter. 00:07:01.700 |
And users have labeled about 49 million images 00:07:09.580 |
And that's like 5 million in the last 30 days alone. 00:07:16.900 |
we did kind of like a rough Bafka napkin calculation 00:07:24.060 |
you're clicking individual points to create a polygon. 00:07:29.820 |
And I'm sure in a bit, we can maybe screen share 00:07:32.140 |
and show some examples of what this experience is like. 00:07:37.900 |
on average saves, you know, maybe a dozen or so seconds. 00:07:44.940 |
on the order of magnitude of 35 years of time for users. 00:07:51.460 |
- So I mean, basically like in the first year 00:07:55.580 |
not only can you say, hey, I'm just gonna go use this model, 00:07:57.940 |
but those numbers that like 49 million images 00:08:01.300 |
is an estimate directly related to just the hosted side. 00:08:05.260 |
So imagine all of the users that are self-hosting 00:08:28.100 |
you know, people use terms like game changing 00:08:29.860 |
and these sorts of things, it has changed the industry. 00:08:42.980 |
was how many fields actually rely on manual segmentation. 00:08:51.300 |
'cause you get to see all the users of these tools. 00:08:56.180 |
people working on understanding coral reef bleaching 00:09:18.140 |
but is everyone using stock segment anything? 00:09:25.340 |
for the medical field without fine tuning, right? 00:09:32.820 |
So one of the design decisions we made in Sam 00:09:40.300 |
And so all the data is annotated in a class agnostic way. 00:09:59.100 |
So you can imagine that we have 11 million images 00:10:17.300 |
that looked like it, but we didn't have to label it. 00:10:22.740 |
for applications that it wasn't really trained for, 00:10:32.140 |
But having said that, there's probably certain domains 00:10:37.500 |
in order to be able to segment something properly. 00:10:42.020 |
having some extra fine tuning data would probably help. 00:10:45.460 |
And we've sort of seen that there's some papers 00:10:56.060 |
- Once Sam came out, there were adaptations that said, 00:10:59.580 |
could we use Sam to be, you know, like efficient Sam, 00:11:02.700 |
like basically take Sam and maybe accelerate it. 00:11:07.300 |
like cell Sam, for example, out of the UC system. 00:11:15.140 |
there's kind of two ways by which that's done. 00:11:19.620 |
like potentially Sam doesn't have a good concept 00:11:27.940 |
and increase the accuracy for zero shot prediction. 00:11:31.940 |
The second way though, is it's not fine tuning, 00:11:35.900 |
It's just guiding the model's existing knowledge 00:11:41.780 |
And both those are actually kind of equally important 00:11:47.500 |
that the objects of interest can be correctly segmented 00:11:55.660 |
like an omniscient Sam that could see every segment 00:11:57.820 |
in every domain with all pixels perfectly outlined, 00:12:04.900 |
to almost like signal to the model what you care about. 00:12:08.260 |
Like to paint this picture, if you were like a retailer 00:12:18.940 |
you may care about, you know, only the shirt. 00:12:21.300 |
And Sam by default might segment the full person. 00:12:24.060 |
And so there's visual prompting that you can do 00:12:27.460 |
to ensure that you only outline maybe the shirt 00:12:29.820 |
for the purposes of swapping in and out different shirts 00:12:31.860 |
for displaying a given model on a retail page. 00:12:35.780 |
And so I think what's interesting is that's where like, 00:12:39.660 |
but that's where like when you apply to industry, 00:12:41.900 |
like one thing that's particularly important with tooling 00:12:45.060 |
and enabling Sam to reach its full potential. 00:12:55.100 |
on the class labeling side is the grounding dyno work, right? 00:13:18.380 |
So we have a web demo where anyone can try Sam 2 on a video. 00:13:23.380 |
Here we have a video of someone kicking a football 00:13:34.540 |
in any frame of the video and this will work. 00:13:39.300 |
So the model's now tracking this in real time. 00:13:45.660 |
And now you can see the ball has been tracked 00:13:50.660 |
There's even like a little bit of a challenging case here 00:13:56.620 |
and actually the model makes a little bit of a mistake, 00:14:02.300 |
Here, the model makes a little bit of a mistake here, 00:14:09.180 |
until we get the mask that we want on this frame. 00:14:17.420 |
taking into account the additional information 00:14:22.700 |
We've also added a couple of other fun things 00:14:24.660 |
you can do on top of the track, like add effects. 00:14:28.660 |
We can add foreground effects, background effects, 00:14:37.100 |
as part of other tools like video editing tools 00:14:49.660 |
where we might not have even imagined SAM2 being useful. 00:15:00.140 |
even though models never really seen an octopus before. 00:15:07.300 |
that SAM2 can actually quite effectively keep track 00:15:19.620 |
of all the different tentacles is quite accurate. 00:15:25.820 |
is that objects can actually become occluded. 00:15:31.380 |
And a really fun example here is the shuffling cup game, 00:15:36.540 |
And so here I can click on the ball in the first frame. 00:15:45.820 |
is that there's three cups that look exactly the same. 00:15:49.100 |
And then there's a ball that will get occluded by the cup. 00:16:07.860 |
I wanted to point out a couple of fun demo UX features 00:16:11.500 |
that we added that actually really help with this. 00:16:25.220 |
the object disappears, and then the object comes back. 00:16:30.980 |
when the object's being occluded and when it's not. 00:16:35.940 |
if you need to go in and fix the model prediction or not. 00:16:45.300 |
- One thing that I think is really notable here, 00:16:49.180 |
One is like, I'd love to have a little bit of a discussion 00:16:53.460 |
of the embedded scene to keep track of the ball 00:16:58.620 |
One thing that Meta has put an emphasis on here 00:17:01.740 |
in a much greater degree than other model releases 00:17:25.940 |
was available prior to the web experience of Chat GPT. 00:17:29.340 |
Can you talk a bit about why that was a consideration 00:17:38.220 |
in tandem with training and releasing a new model? 00:17:41.780 |
I think that's a really great example of how, 00:17:43.700 |
you know, Chat GPT was really more of a UX innovation. 00:17:48.100 |
Obviously, it was like a number of research innovations 00:17:52.500 |
But as you said, like the underlying technology 00:17:56.660 |
putting this UX around it as a chat interface 00:18:03.700 |
and people understanding how it could be useful 00:18:07.980 |
And in computer vision, especially, it's so visual. 00:18:13.820 |
is by trying it on your own image or your own video. 00:18:19.300 |
we put a lot of effort in building like a high-quality demo. 00:18:43.260 |
With this approach, we found it to be really successful. 00:18:53.220 |
outside of machine learning would never have tried SAM 00:18:59.020 |
And I think that definitely led to a lot of the adoption 00:19:25.340 |
that maybe has not had much thought given to it. 00:19:41.620 |
for not thinking about only the new model capability, 00:19:44.900 |
but what sort of applications folks want to build 00:19:51.300 |
to think about many things that you might postpone. 00:20:01.380 |
And so it really forces you to think about these things 00:20:05.020 |
much sooner and actually makes us think about 00:20:08.340 |
how to, what kind of image encoder we want to use 00:20:10.940 |
or like other hardware efficiency improvements. 00:20:16.660 |
become a first-class citizen when you put the demo first. 00:20:22.220 |
and this is related to the architecture change. 00:20:27.340 |
you have the encoder that's creating the embeddings 00:20:39.180 |
can be run independently and on a cheaper process. 00:20:42.460 |
So in the SAM1 demo, the way that it was structured, 00:20:45.700 |
and also this is the way that we have our SAM tools 00:20:49.460 |
is images go to a GPU to get all the SAM-based embeddings. 00:21:11.140 |
And I think that's because you made some notable improvements 00:21:17.700 |
- Can you talk a bit about what led to those speed increases 00:21:29.900 |
- Yeah, so the SAM2 web demo is primarily focused on video. 00:21:33.740 |
We decided to just keep it simple and focus on video. 00:21:52.180 |
to adopt the same architecture as SAM for video 00:21:55.260 |
because we can't send the per frame image embeddings 00:22:02.180 |
In SAM, each frame embedding was like four megabytes. 00:22:05.100 |
But if you have a long video and that's like per frame, 00:22:12.340 |
So SAM2 actually, in terms of the architecture details, 00:22:18.620 |
but SAM1 model was around 630 million parameters, 00:22:23.620 |
a fraction of the size of these large language models, 00:22:39.780 |
So we changed the image encoder from a VITH in SAM 00:22:44.380 |
to a higher model, which is also developed by Meta. 00:22:48.940 |
So that definitely was something that helped. 00:22:51.220 |
And in terms of the efficiency compared to SAM, 00:22:54.580 |
so if we were to run SAM per frame on a video 00:23:04.900 |
Number of things improved the efficiency of SAM2 00:23:07.380 |
such that we were actually able to run this entirely 00:23:15.100 |
But I am very curious to see who puts this on device. 00:23:18.420 |
I'm pretty sure soon we'll see an on-device SAM2 00:23:21.980 |
or maybe even running in the browser or something. 00:23:30.340 |
But we were able to make a compelling web demo 00:23:39.740 |
I want to talk more about things from the paper, 00:23:41.580 |
but I think we're still in this sort of demo section 00:23:43.500 |
and so I want to hand it to Joseph for his demo 00:23:48.100 |
- So I can give some context into one key area 00:24:02.260 |
to have a generalizable model for zero-shot capability. 00:24:22.340 |
So I will similarly share my screen and show an example. 00:24:30.740 |
and there's a number of ways that I could annotate things. 00:24:42.300 |
this is where we make use of models like Segment Anything 00:24:46.660 |
to propose candidate masks and make it faster. 00:25:04.780 |
of Segment Anything 2 performing better on images 00:25:20.260 |
you'll see here that like the original candidate proposal 00:25:39.060 |
but in fact, what I want is I want to name that as a class 00:25:42.660 |
because maybe for the model that I'm building, 00:25:51.060 |
Or, you know, maybe I'm even using like a multimodal model 00:25:56.300 |
to regions of interest in the images as a specific thing. 00:26:06.700 |
zero-shot prediction, and here we have our friend Rick. 00:26:10.780 |
So I get this really rich candidate set of predictions, 00:26:24.740 |
but also of the, what is inside that segment, 00:26:35.900 |
why maybe your team made a conscious decision 00:26:43.100 |
that are also adding open-text prompting capabilities 00:26:54.860 |
which, you know, you can do even image-to-image 00:27:01.340 |
And maybe I can actually give an example of that 00:27:11.780 |
I could try out, you know, prompting Grounding Dino 00:27:17.100 |
And what's notable is, let's do, I don't know, 00:27:20.620 |
let's prompt for person, and we'll prompt for person, 00:27:24.660 |
and let's prompt for, I don't know, microphone, 00:27:38.220 |
allows me to create, in this case, bounding boxes, 00:27:45.980 |
And, you know, we've already seen applications 00:28:00.220 |
and then get the benefits of the zero-shot segmentation 00:28:03.420 |
at the same time as getting the open-form querying. 00:28:09.660 |
we maintain a framework called, like, Autodistill, 00:28:18.260 |
and then prompt and say what you want from that ontology. 00:28:23.780 |
- You can apply videos or groups of images, yes. 00:28:26.740 |
So this is using a project called Autodistill. 00:28:29.580 |
And the concept of Autodistill is use a base model, 00:28:39.780 |
which also could be video broken into individual frames, 00:28:49.860 |
And then the combination of the grounding capabilities of, 00:28:54.540 |
in the example I was showing, Florence 2 plus SAM, 00:29:10.580 |
run this across a bunch of images or video frames, 00:29:21.740 |
And in fact, like, the open form grounding capabilities 00:29:26.780 |
became something the field was broadly doing. 00:29:31.820 |
one of the things I thought maybe SAM 2 would do 00:29:36.260 |
So I'm curious to hear, like, the conscious decision to say, 00:29:39.140 |
hey, we want to continue to be class-agnostic. 00:29:41.340 |
We don't want to add yet maybe open form text prompting 00:29:45.900 |
as a part of finding the segments and parts of images. 00:29:51.660 |
And if you are encouraged or if you want kind of, like, 00:29:55.100 |
what's happening here where people are naturally 00:29:58.420 |
as something that you would expect and encourage to happen 00:30:01.340 |
despite not having it in the base model itself. 00:30:06.340 |
So I think it's really cool that the community 00:30:08.260 |
is taking SAM and taking SAM 2 and building on top of it 00:30:19.540 |
And then in terms of why we didn't put it into SAM 2, 00:30:22.780 |
so as you've probably seen with SAM and SAM 2, 00:30:35.060 |
we are trying to limit the focus on one thing 00:30:46.580 |
but can we do it so well that it's effectively solved? 00:30:57.180 |
we are working on each of these problems one at a time 00:31:08.020 |
the text prompting problem as like the next challenge? 00:31:21.540 |
and that's, I think, proven to be well accomplished. 00:31:24.660 |
- It's like taking both the data, the model, and the demo, 00:31:41.620 |
- This development reminds me of how, you know, 00:31:43.740 |
when you do, and you break out the interpretability 00:31:50.780 |
I feel like SAM is the edge detection version equivalent, 00:31:54.340 |
and then you build up to whatever the next feature is 00:32:01.980 |
and the model was released at 4 p.m. Pacific on Monday. 00:32:04.940 |
We're recording this on 11 a.m. Pacific on Thursday. 00:32:08.540 |
So it's very fresh for a lot of the capabilities. 00:32:11.820 |
And it is so clear that it is a stepwise change 00:32:26.220 |
One thing that's interesting is finding like domain problems 00:32:30.060 |
where there might be still domain applicability 00:32:40.100 |
which is like seven different domain type problems 00:32:43.220 |
that the industry commonly is working on in vision. 00:32:53.500 |
segment anything maybe less performant than other models 00:33:02.340 |
that are building agents to interact with the web 00:33:04.860 |
are particularly interested in that challenge 00:33:16.900 |
And I can show an example of like maybe what, 00:33:19.180 |
how like SAM kind of performs on this challenge 00:33:21.820 |
just to outline some of the context of this problem. 00:33:29.180 |
and what you would expect to want to be the case. 00:33:32.340 |
where I run SAM on the source image on the left, 00:33:41.100 |
where we just grabbed like the top 100 websites by traffic 00:33:49.940 |
and I'm curious how you think about this challenge 00:33:53.740 |
for this type of problem is processing screenshots. 00:34:05.900 |
and then right is SAM2 running on that image. 00:34:13.260 |
hey, tell me all of the buttons that an agent could press, 00:34:15.740 |
tell me like maybe the headlines of the articles, 00:34:25.620 |
I'm curious like how you think about a challenge like this 00:34:29.260 |
for a model that sees everything in the world, 00:34:38.540 |
and how you would expect to see improvement for domains 00:34:50.820 |
We try to build like these foundational models 00:34:53.900 |
that can be applied to lots of different use cases 00:35:01.620 |
potentially people might want to annotate some data, 00:35:11.180 |
that are very custom for different use cases. 00:35:18.540 |
But as you said, like the model is an annotation tool 00:35:23.260 |
And so I think that's definitely the approach 00:35:28.900 |
for you to improve the model as well as the model itself. 00:35:33.020 |
Focus on like as many multi or zero shot problems 00:35:36.220 |
and then allow the community to pick up the torch 00:35:40.660 |
Like we can't solve all the problems ourselves. 00:35:42.900 |
Like we can't solve all the different domains, 00:35:45.020 |
but if we can provide a sort of base hammer tool 00:35:54.340 |
I guess we want to transition to a little bit 00:35:55.820 |
on like asking more questions about the paper. 00:36:08.180 |
but just like just really, really well-written 00:36:10.220 |
and a lot of disclosures, including the dataset as well. 00:36:12.980 |
I think the top question that people had on the dataset, 00:36:18.500 |
about the data engine as well, which I really love. 00:36:32.020 |
but as a research manager for this whole thing, 00:37:06.260 |
That's like the most basic way of extending SAM to video. 00:37:20.660 |
that takes the mask as the first frame input. 00:37:38.060 |
that can do both image and video segmentation 00:37:44.740 |
And we found that, you know, going from each phase, 00:37:51.620 |
And in particular, when you get rid of this two-part model, 00:37:59.540 |
so you prompt the model in one frame to select an object, 00:38:05.740 |
to all the other frames of the video to track the object. 00:38:09.860 |
But if the model makes a mistake and you want to correct it, 00:38:21.660 |
to remove a region or a positive click to add a region. 00:38:27.740 |
you would have to delete that frame prediction 00:38:34.220 |
And so you can imagine for more complex objects, 00:38:37.420 |
this is actually adding like a lot of extra time 00:38:47.780 |
really follow like how we thought about the model design 00:38:53.220 |
because it really helped improve the data quality 00:39:05.900 |
by the time you hit stage three, which is kind of cool. 00:39:08.740 |
- We joked that when SAM1 came out at RoboFlow, 00:39:11.180 |
we're like, "Was this purpose built for our software?" 00:39:13.780 |
Like you have the embedding take like a big model 00:39:23.460 |
Now hearing you talk about how you think about 00:39:25.860 |
building models with a demo in mind, it makes sense. 00:39:37.860 |
is gonna take seminal advances and apply them. 00:39:42.460 |
Like it could also be a model that outputs boxes 00:39:51.980 |
or as a component as part of a larger AI system. 00:40:01.140 |
It needs to have the zero shot generalization capability. 00:40:18.580 |
the sort of research level, architecture level innovation 00:40:22.180 |
that enabled what I've been calling object permanence 00:40:33.460 |
the way we think about extending SAM to video 00:40:36.660 |
is that an image is just a special case of a video 00:40:46.860 |
to be able to support segmentation across videos. 00:40:50.380 |
So this is a quick video that shows how this works. 00:40:53.500 |
So SAM architecture, we have the image encoder, 00:40:56.020 |
we have a prompt encoder, we have a mask decoder. 00:40:59.300 |
You can click on an image and that basically is a prompt. 00:41:04.300 |
We use that prompt along with the image embedding 00:41:11.340 |
Going to SAM 2, we can also apply SAM 2 to images 00:41:15.460 |
because we can, as I said, treat an image as a video 00:41:20.420 |
And so when we are in the SAM 2 architecture, 00:41:27.740 |
There's memory attention, there's a memory encoder, 00:41:45.340 |
because they provide the context of the target object 00:41:59.180 |
or the prompted frames, which are basically the frames 00:42:02.060 |
at which a user or a model provides input like clicks. 00:42:06.860 |
And then there's like the surrounding frames. 00:42:09.100 |
And so we use six frames around the current frame 00:42:19.660 |
Going into a little bit more detail about that, 00:42:21.500 |
there's like two kinds of memory that we use. 00:42:38.100 |
how does this relate to context window and LLMs. 00:42:46.940 |
So they both provide different types of information 00:42:49.700 |
on the spatial side or in terms of the concept 00:42:54.620 |
And so we found that having like six frame length 00:42:56.980 |
for the spatial memory coupled with this longer period 00:43:03.260 |
strong video segmentation accuracy at high speed. 00:43:06.380 |
So as I mentioned, the real time aspect is really important. 00:43:10.220 |
We have to find this speed accuracy trade off. 00:43:12.780 |
And one way in which we sort of circumvent this 00:43:15.700 |
is by allowing additional prompts on subsequent frames. 00:43:24.300 |
After an occlusion, you can provide another prompt 00:43:29.940 |
And so the prompted frames are always in the memory. 00:43:35.700 |
where the model will always remember what you provided. 00:43:39.620 |
And so that's a way in which we can sort of avoid 00:43:45.500 |
That actually is a big limitation of current models. 00:43:50.140 |
don't allow any way to recover if the model makes a mistake. 00:43:53.380 |
And so Joseph, going back to your point about the demo, 00:44:03.140 |
like it's not going to be a one time prediction, 00:44:06.660 |
but you actually want to be able to intervene. 00:44:11.540 |
you can actually be like, no, actually do it this way 00:44:15.500 |
And so we really want to bring some of that thinking 00:44:18.620 |
into how we build these computer vision models as well. 00:44:22.700 |
My main reaction to finding out about the context length, 00:44:26.220 |
input frames and six pass frames as their default 00:44:33.060 |
we're very used to severely extending context windows. 00:44:37.060 |
And what does that do to the memory of your model? 00:44:40.540 |
- So I think maybe one thing that's different 00:44:42.500 |
is that the object in videos, it is challenging. 00:44:53.580 |
is probably the amount of context that you need 00:44:57.780 |
than maintaining a long multi-time conversation. 00:45:01.060 |
And so coupling this short-term spatial memory 00:45:04.500 |
with this longer-term object pointers we found was enough. 00:45:16.340 |
with how literature refers to object re-identification, 00:45:20.060 |
object re-identification is not only what SAM does 00:45:23.780 |
for identifying that an object is similar across frames, 00:45:36.180 |
in addition to seeing that the same looking thing 00:45:43.500 |
I think, you know, SAM2 definitely isn't perfect 00:45:48.900 |
that we'd love to see people in the community 00:45:56.300 |
is where there are multiple similar looking objects, 00:46:03.660 |
Keeping track of the target object is a challenge. 00:46:11.100 |
but again, the ability to provide refinement clicks 00:46:15.260 |
is one way to sort of circumvent that problem. 00:46:18.780 |
In most cases, when there's lots of similar looking objects, 00:46:23.540 |
you can get the perfect track throughout the video. 00:46:26.580 |
So definitely that's one way to solve that problem. 00:46:30.580 |
But, you know, we could have better motion estimation. 00:46:35.460 |
to be able to disambiguate similar looking objects 00:46:43.820 |
anyone interested in this kind of architecture. 00:46:46.340 |
Like, are there papers that you would refer people to 00:46:51.260 |
or, you know, have other interesting alternative approaches? 00:47:04.020 |
It really, really depends on what your application is. 00:47:06.420 |
Like, if you don't care about the entire mask, 00:47:16.580 |
might not actually be necessary for certain use cases. 00:47:21.780 |
you might not need the full capabilities of SAM or SAM2. 00:47:26.180 |
There's many different approaches to tracking. 00:47:27.980 |
I think I would encourage people to think about, like, 00:47:39.180 |
You know, maybe you don't even need the full mask. 00:47:42.660 |
But you have solved the problem that you set out to solve, 00:47:46.540 |
which is something that we're still appreciating even today. 00:47:50.220 |
I would just transition to sort of forward-looking, 00:47:59.900 |
And obviously you're the best person to ask about that. 00:48:11.300 |
this image bind, like, how are things organized? 00:48:18.220 |
we have a number of different research areas. 00:48:26.100 |
basically look at all the fundamental problems 00:48:36.900 |
There are tons of other problems in computer vision 00:48:44.660 |
And so that's really the area in which I work on. 00:48:48.100 |
And then there's a number of other research areas 00:48:53.940 |
in more efficient models and various other topics. 00:48:57.540 |
So FAIR in general is still very much pushing the boundaries 00:49:08.540 |
actually I probably shouldn't talk about llama, 00:49:53.260 |
The way I kind of see the field continuing to progress, 00:49:57.900 |
like the problem statement of computer vision 00:50:13.820 |
out in the world are on the center of that bell curve. 00:50:16.060 |
And then there's things that are less frequently occurring 00:50:24.580 |
"Hey, can we find 80 common objects in context?" 00:50:29.140 |
Like silverware and fridge and these sorts of things. 00:50:32.380 |
And we also conceptualized the challenge of computer vision 00:50:35.460 |
in terms of breaking it down into individual task types, 00:50:38.100 |
because that's like the tools we had for the day. 00:50:40.020 |
So that's why you have the origination of classification, 00:50:45.300 |
And then as you see things continue to progress, 00:50:50.860 |
that need to observe areas in the long tails. 00:50:59.940 |
Some of our customers like Rivian, for example, 00:51:02.420 |
only Rivian knows what the inside of like a Rivian 00:51:05.340 |
should look like as it's assembled and put together 00:51:10.460 |
So how could a model even been trained on the things 00:51:13.820 |
that go inside the componentry of producing a vehicle? 00:51:17.900 |
And what's kind of happening with computer vision 00:51:24.860 |
in the middle of the bell curve push outward faster. 00:51:27.740 |
That's where you see the advent of like open text models 00:51:31.540 |
or the richness of understanding of multimodal models 00:51:46.020 |
kind of like the messy middle in between those two, right? 00:51:48.500 |
So like, Nikila kind of talked about examples 00:51:54.100 |
even though there wasn't octopi in the training data. 00:51:58.580 |
where SAM isn't yet super great at screenshots. 00:52:04.900 |
But what's gonna happen is there needs to be systems 00:52:09.540 |
that I think about like tooling to also validate 00:52:11.980 |
that models are doing what we want them to do, 00:52:13.980 |
adapting to datasets that we want them to adapt to. 00:52:16.500 |
And so there's a lot of things on a forward-looking basis 00:52:19.380 |
that allow propelling that expansion of generalizability. 00:52:30.380 |
of dataset curation continues to play a massive role. 00:52:35.140 |
Something that's notable, I think, about SAM 2 00:52:51.380 |
the largest model being a couple hundred million parameters, 00:53:00.740 |
we're gonna see more capable, more generalizable models 00:53:04.580 |
being able to run on a higher wide array of problems 00:53:07.900 |
with zero or multi-shot capability on a faster rate. 00:53:22.140 |
and probably blended architectures increasingly too. 00:53:25.220 |
So my viewpoint of like on a go-forward basis 00:53:27.700 |
is we will have that bell curve of what humans can see 00:53:32.500 |
both in the center of that curve and the long tails 00:53:36.740 |
allow richer understanding multi and zero-shot 00:53:45.300 |
that allow using them in practical and pragmatic ways. 00:53:52.340 |
the research trends map or don't map to that. 00:54:16.940 |
and then the size of these datasets are really small. 00:54:20.460 |
So with SAM, it's, you know, we had a billion masks, 00:54:24.820 |
we had 11 million images, didn't have class labels, 00:54:28.500 |
but even before that, there were a lot of datasets 00:54:33.580 |
with significantly more, with like a lot of class labels, 00:54:49.740 |
And they're usually like people, there's cars, 00:54:52.300 |
there's dogs and cats and all these common objects, 00:55:06.820 |
these video tracking models actually don't have 00:55:12.260 |
And so that's why having this dataset is really important 00:55:17.100 |
for the segment anything capability in video, 00:55:20.180 |
because if you just provide the mask as the input 00:55:23.580 |
to an off-the-shelf video object segmentation model, 00:55:37.780 |
So doing these sort of combining two models together 00:55:41.380 |
to try to get a capability will actually only get you so far 00:55:45.540 |
and being able to actually create the dataset 00:55:54.620 |
And we can actually see that when we do comparisons 00:55:59.980 |
with the same input mask and the baseline model 00:56:10.420 |
whereas these baselines might actually start tracking 00:56:13.460 |
the entire person because that's what they're used to doing 00:56:16.260 |
and isolating it to just one part of the person 00:56:19.100 |
is not something they were ever trained to do. 00:56:21.620 |
And so those are sort of some of the limitations. 00:56:37.780 |
Or it's actually, we found that in the SAM2 paper, 00:56:50.620 |
And we find that actually SAM2 is a lot better than SAM 00:56:54.340 |
when it comes to segmenting objects in video frames, 00:57:02.660 |
And so I think that's maybe one learning from this project 00:57:12.580 |
as if you really think about how to build things 00:57:27.820 |
of going from COCO to going from SAM to going to SAM2. 00:57:35.900 |
to have that perspective as we build these models 00:57:38.820 |
and as we think about the type of capabilities 00:57:50.900 |
So if like COCO is common objects in context, 00:57:53.060 |
RF100 is like novel objects in weird contexts, 00:58:01.220 |
And so we challenged the community as a part of, 00:58:07.420 |
And it's basically like how well can you create models 00:58:13.540 |
is how well things can learn domain adaptation. 00:58:21.100 |
And what's really impressive about SAM and SAM2 00:58:24.820 |
from what you just described is even with the limited set, 00:58:27.700 |
the class agnostic approach affords the generalizability 00:58:32.180 |
even to out of distribution examples, surprisingly well. 00:58:39.100 |
And so that research direction seems extremely promising. 00:58:42.540 |
- Yeah, and actually Pieter is always telling us like, 00:58:45.460 |
don't care about COCO, even though he built COCO. 00:58:51.540 |
And really keeping that zero shot real world use cases 00:59:00.980 |
- Okay, I think that just leaves us to calls to action 00:59:03.620 |
for engineers, researchers, and personal recommendations. 00:59:09.340 |
- Yeah, so please try out all the resources we put out. 00:59:12.780 |
We, you know, open sourced SAV dataset, SAM2, 00:59:22.780 |
Please try all of these things that we've released. 00:59:31.180 |
Actually in the blog post, we go through many of these 00:59:36.980 |
And so if you have any ideas of how to improve these, 00:59:40.700 |
like please build on top of what we've released. 00:59:43.540 |
We would love to see some of these problems get solved 01:00:14.420 |
And then obviously the incredible open source 01:00:21.100 |
It was a much better episode with you than without you. 01:00:28.020 |
just let us know and we'll come back on again.