Back to Index

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson


Chapters

0:0 Introducing Nikhila
3:38 The Impact of SAM 1 in 2023
9:15 Do People Finetune SAM?
13:5 Video Demo of SAM
17:1 Why the Demo is so Important
20:23 SAM 1 vs SAM 2 Architecture
23:46 Video Demo of SAM on Roboflow
29:44 Extending SAM 2 with other models
32:0 Limitations of SAM: Screenshots
35:56 SAM 2 Paper
36:15 SA-V Dataset and SAM Data Engine
40:15 Memory Attention to solve Video
44:24 "Context Length" in Memory Attention
45:17 Object Tracking
47:52 The Future of FAIR
49:23 CVPR, Trends in Vision
60:4 Calls to Action

Transcript

(upbeat music) - Welcome to the Latest Space Podcast. I'm delighted to do Segment Anything 2. Our first, one of our very first viral podcasts was Segment Anything 1 with Joseph. Welcome back. - Thanks so much. - And this time we are joined by the lead author of Segment Anything 2, Nikki Ravi.

Welcome. - Thank you. Thanks for having me. - There's a whole story that we can refer people back to episode four of the podcast way back when for the story of Segment Anything. But I think we're interested in just introducing you as a researcher, as a, on the human side.

What was your path into AI research? Why, you know, why did you choose computer vision coming out of your specialization at Cambridge? - Yeah, yeah, sure. So I did my undergraduate degree in engineering at Cambridge University. The engineering program is very general. So first couple of years, you sort of study everything from mechanical engineering to fluid mechanics, structural mechanics, material science, and also computer science.

Towards the end of my degree, I started taking more classes in machine learning and computational neuroscience, and I really enjoyed it. And actually after graduating from undergrad, I had a place at Oxford to study medicine. And so I was initially planning on becoming a doctor, had everything planned, and then decided to take a gap year after finishing undergrad.

And actually that was around the time that sort of deep learning was emerging. And in my machine learning class in undergrad, I remember one day our professor came in and that was when Google acquired DeepMind. And so that became like a huge thing. We talked about it for the whole class.

It kind of really kicked off thinking about, okay, maybe I want to try something different other than medicine. Maybe this is a different path I want to take. And then in the gap year, I did a bunch of coding, worked on a number of projects, did some sort of freelance contracting work.

And then I got a scholarship to come and study in America. So I went to Harvard for a year, took a bunch of computer science classes at Harvard and MIT, worked on a number of AI projects, especially in computer vision. I really, really enjoyed working in computer vision, applied to Facebook and got this job at Facebook.

And I've now, at Facebook at the time, now Matter. And I've been here for seven years. So very circuitous path, probably not a very unconventional. I didn't do a PhD. I'm not like a research, typical research scientist. Definitely came from more of an engineering background. But since being at Matter, have had amazing opportunities to work across so many different interesting problems in computer vision from 3D computer vision.

How can you go from images of objects to 3D structures? And then going back to 2D computer vision and actually understanding the objects and the pixels and the images themselves. So it's been very interesting journey over the past seven years. - It's weird because I guess with segment anything too, it's like 4D because you solve time.

You know, you started with 3D and now you're solving the 4D. - Yeah, it's just going from 3D to images to video. It's really covering the full spectrum. And actually one of the nice things has been, so I think I mentioned I wanted to become a doctor, but actually Sam is having so much impact in medicine, probably more than I could have ever had as a doctor myself.

So I think, you know, hopefully Sam too can also have similar sort of impact in medicine and other fields. - Yeah, I want to give Joseph a chance to comment. Does that also mirror your, we know your story about going into vision, but like in the past year since we did our podcast on Sam, what's been the impact that you've seen?

- Segment anything set a new standard in computer vision. You know, recapping from the first release to present, Sam introduces the ability for models to near zero shot, meaning without any training, identify kind of perfect polygons and outlines of items and objects inside images. And that capability previously required lots of manual labeling, lots of manual preparation, clicking very meticulously to create outlines of individuals and people.

And there were some models that attempted to do zero shot segmentation of items inside images, though none were as high quality as segment anything. And with the introduction of segment anything, you can pass an image with Sam one, Sam two videos as well, and get perfect, pixel perfect outlines of most everything inside the images.

Now there are some edge cases across domains and similar to the human eye, sometimes you need to say like, which item you maybe you most care about for the downstream task and problem you're working on. Though Sam has accelerated the rate at which developers are able to use computer vision and production applications.

So at RoboFlow, we were very quick to enable the community of computer vision developers and engineers to use Sam and apply it to their problems. The principle ways is using Sam, you could kind of use Sam as is to like pass an image and receive back masks. Another use case for Sam is in preparation of data for other types of problems.

So for example, in the medical domain, let's say that you're working on a problem where you have a bunch of images from a wet lab experiment. And from each of those images, you need to count the presence of a particular protein that reacts to some experiments. To count all the individual protein reactions, you can go in and lab assistants to this day will still like kind of individually count and say, what are the presence of all of those proteins?

With segment anything, it's able to identify all of those individual items correctly. But often you may need to also add like a class name to what the protein is, or you may need to say, hey, like I care about the protein portion of this, I don't care about the rest of the portion of this image.

And, or what it encourages and asks for the user to do is to provide some visual prompting to say, hey, which part, like Sam says, hey, I can find segments of anything, but which segments do you care about? And so you can do visual prompting, which is kind of a new perimeter that Sam introduced.

And so at RoboFlow, we have one portion of our tool stack enables users to very quickly label data. With segment anything, Sam can already provide, hey, here's where I see the outlines of objects, or a user can click to prompt to say, hey, here's where the outlines of objects matter.

And I recently pulled statistics from the usage of Sam in RoboFlow over the course of the last year. And users have labeled about 49 million images using segment anything on the hosted side of the RoboFlow platform. And that's like 5 million in the last 30 days alone. And of those images, we did kind of like a rough Bafka napkin calculation of like how much time that has saved.

Because again, the alternative is you're clicking individual points to create a polygon. And with Sam, you just click once and it guesses where the polygon is. And I'm sure in a bit, we can maybe screen share and show some examples of what this experience is like. And in that time estimation, it's like, on average saves, you know, maybe a dozen or so seconds.

And we estimate that this has probably saved on the order of magnitude of 35 years of time for users. - That's incredible. - So I mean, basically like in the first year of a model being available, not only can you say, hey, I'm just gonna go use this model, but those numbers that like 49 million images is an estimate directly related to just the hosted side.

So imagine all of the users that are self-hosting or using Sam for robotics applications or out in the field or offline, where it's not even like the time or the image counts are tabulated. And we're probably talking about, you know, just a fraction of the amount of value that's actually being produced for a number of downstream tasks.

So to say that the impact has been, you know, people use terms like game changing and these sorts of things, it has changed the industry. It's set a new standard. And with the release of Sam 2, I think we're about to see an acceleration of those capabilities for a lot of reasons.

- That's really great to hear. I think one of the release Sam 1 was how many fields actually rely on manual segmentation. I think we're not really exposed to that. Maybe you are at Roboflow 'cause you get to see all the users of these tools. But for me, it was, you know, people working on understanding coral reef bleaching or farmers counting their cows and so many different applications that as a researcher matter, you never get exposed to, but you can have impact towards.

So I think that was really awesome to hear. - So as sort of audience surrogate who knows less than the two of you, I'm gonna ask a really dumb question maybe, but is everyone using stock segment anything? Are they fine tuning for the medical domain? Like how on earth could it work for the medical field without fine tuning, right?

Like, is that a thing? - So I mean, I can give a quick perspective from the research side. So one of the design decisions we made in Sam was to not have class labels. And so all the data is annotated in a class agnostic way. So anything that has a boundary, we consider to be an object.

So for example, in any image, there's lots of small objects. We might not know what the name of them are, but you can draw a boundary around it. So you can imagine that we have 11 million images in the SA1B dataset. We annotated all the objects. There's many, many small objects.

And so if you think about cells, they're also kind of small objects. There's probably things in the training data that looked like it, but we didn't have to label it. And so that means that even when you use Sam for applications that it wasn't really trained for, because we didn't restrict it to a certain set of categories, you can actually use it out of the box without custom adaptation.

But having said that, there's probably certain domains where you need some expertise in order to be able to segment something properly. And for those use cases, having some extra fine tuning data would probably help. And we've sort of seen that there's some papers that have come out that do this.

And we'd love to hear, Joseph, how people are collecting data with Sam and fine tuning for their use cases. - Once Sam came out, there were adaptations that said, could we use Sam to be, you know, like efficient Sam, like basically take Sam and maybe accelerate it. And then there were domain adapted Sams, like cell Sam, for example, out of the UC system.

Now, what's interesting is there's, like adapting Sam to a domain, there's kind of two ways by which that's done. One is, as you mentioned, like potentially Sam doesn't have a good concept of the objects of interest. And so you need to do domain adaptation and increase the accuracy for zero shot prediction.

The second way though, is it's not fine tuning, it's actually just prompting. It's just guiding the model's existing knowledge to say which segments you care about. And both those are actually kind of equally important on the application side. You need to like a priori ensure that the objects of interest can be correctly segmented and maybe collect data to do that.

But even if you had like a perfect Sam, like an omniscient Sam that could see every segment in every domain with all pixels perfectly outlined, in production, you would still need some way to almost like signal to the model what you care about. Like to paint this picture, if you were like a retailer and you are providing photos of models wearing your clothing on your retail site, you may care about, you know, only the shirt.

And Sam by default might segment the full person. And so there's visual prompting that you can do to ensure that you only outline maybe the shirt for the purposes of swapping in and out different shirts for displaying a given model on a retail page. And so I think what's interesting is that's where like, I wouldn't call it domain adaptation, but that's where like when you apply to industry, like one thing that's particularly important with tooling and enabling Sam to reach its full potential.

- That's really encouraging to hear. I should also think like, you know, the last time we talked about this, we wanted to, a very natural addition on the class labeling side is the grounding dyno work, right? So I think people built a grounding Sam and all the other extensions.

I think it's probably a good time to cut to a quick demo of Sam 2 for people who are tuning in for Sam 2 and who better to demo Sam 2 than Nikki. - Sure. So I'll try to narrate what I'm doing so audio listeners can also understand. So we have a web demo where anyone can try Sam 2 on a video.

Here we have a video of someone kicking a football and I'm gonna click on the football to select the object in the first frame, but you can actually select the object in any frame of the video and this will work. The next step is to hit track. So the model's now tracking this in real time.

We don't save any of this. It's all running in real time. And now you can see the ball has been tracked throughout the entire video. There's even like a little bit of a challenging case here where the shoe covers the football and actually the model makes a little bit of a mistake, but that's okay because we can...

Here, the model makes a little bit of a mistake here, but we can actually add a refinement click. You can add negative clicks until we get the mask that we want on this frame. And then you can hit track again and the model will track the object, taking into account the additional information I've provided at that frame.

We've also added a couple of other fun things you can do on top of the track, like add effects. We can add foreground effects, background effects, and these are just ways of showing how we can use the output from SAM2 as part of other tools like video editing tools or other systems.

So this is just a preview of what you can do with SAM2. But the really cool use cases are places where we might not have even imagined SAM2 being useful. So we have a number of examples of things you might want to use it for. There's like underwater videos that it works actually really well for, even though models never really seen an octopus before.

And octopus have a lot of moving parts that SAM2 can actually quite effectively keep track of all the different tentacles. And we can probably see it more clearly if I desaturate the background, we can see that actually the tracking of all the different tentacles is quite accurate. Another challenge with video is that objects can actually become occluded.

They can disappear from view and reappear. And a really fun example here is the shuffling cup game, which many of you might have seen. And so here I can click on the ball in the first frame. I can also click on a different cup. And so here the additional challenge is that there's three cups that look exactly the same.

And then there's a ball that will get occluded by the cup. So the ball is no longer visible. The cups are all moving around. They all look the same, but the model actually keeps track of the cup that we selected. And as you can see at the end here, I'll jump to the end so you can see, it actually finds the cup again.

I wanted to point out a couple of fun demo UX features that we added that actually really help with this. So if you can see at the bottom, there's these swim lanes. And then the swim lanes, actually the thickness of the swim lane tells you if the object's visible or not.

So at the beginning, the object's visible, the object disappears, and then the object comes back. So you can actually visually tell when the object's being occluded and when it's not. And so it's a nice way of like knowing if you need to go in and fix the model prediction or not.

And so these are some of the UX innovations that we came up with, as well as the model innovations. - One thing that I think is really notable here, there's two things. One is like, I'd love to have a little bit of a discussion about how the model's keeping track of the embedded scene to keep track of the ball and the cup in different places.

Pause on that for a second. One thing that Meta has put an emphasis on here in a much greater degree than other model releases is the demo experience of recognizing that in addition to having a model that can do zero-shot segmentation, you've created a web experience that allows folks to kind of experience both the video effects, but the types of UX innovations that encourage usage and adoption.

It's actually kind of reminiscent of the underlying technology of Chat GPT was available prior to the web experience of Chat GPT. Can you talk a bit about why that was a consideration to your team and how you thought about the creation of the demo experience in tandem with training and releasing a new model?

- Yeah, absolutely. I think that's a really great example of how, you know, Chat GPT was really more of a UX innovation. Obviously, it was like a number of research innovations that helped to get to this point. But as you said, like the underlying technology was around for a while and, you know, putting this UX around it as a chat interface helped tremendously with adoption and people understanding how it could be useful for real-world use cases.

And in computer vision, especially, it's so visual. The best way to show how these models work is by trying it on your own image or your own video. With the original SAM, we put a lot of effort in building like a high-quality demo. And the other piece here is that the demo is actually the annotation tool.

So we actually use the demo as a way to improve our annotation tool. And so then it becomes very natural to invest in building a good demo because it speeds up your annotation and improves the data quality and that will improve the model quality. With this approach, we found it to be really successful.

And obviously, externally, people really liked being able to try it. I think, you know, people in fields outside of machine learning would never have tried SAM if we didn't have that demo. And I think that definitely led to a lot of the adoption in like diverse fields. And so because we saw that with SAM 2, like the demo was a priority, first-class citizen from day one.

And so we really invested in making that. And I think with SAM 2 as well, we wanted to have like a step change in the demo experience. Interactive video segmentation, I think that experience is something that maybe has not had much thought given to it. And we really wanted to be like, okay, if we are to design a step changing video segmentation experience, what would that look like?

And that really did influence our model and annotation design as well. - It's a really encouraging trend for not thinking about only the new model capability, but what sort of applications folks want to build with models as a result of that downstream. - I think it also really forces you to think about many things that you might postpone.

For example, efficiency. For a good demo experience, making it real time is super important. No one wants to wait. And so it really forces you to think about these things much sooner and actually makes us think about how to, what kind of image encoder we want to use or like other hardware efficiency improvements.

So those kinds of things, I think, become a first-class citizen when you put the demo first. - That's one thing I was going to ask about, and this is related to the architecture change. So SAM1, in the SAM1 demo experience, you have the encoder that's creating the embeddings of all the potential spaces.

That needs to be run on a GPU. That's a relatively intensive operation. But then the query of those embeddings can be run independently and on a cheaper process. So in the SAM1 demo, the way that it was structured, and also this is the way that we have our SAM tools structured in RoboFlow as well, is images go to a GPU to get all the SAM-based embeddings.

But then for querying those embeddings, we do that client-side in the browser so that the user can very quickly, you know, you can move your mouse over and you get the proposed candidate masks that SAM found for that region of the image. In SAM2, you drop that in the web demo.

And I think that's because you made some notable improvements to the rate at which encoding happens. - Can you talk a bit about what led to those speed increases and again, how that interplays with providing a fast user experience for interacting with the model? - Yeah, so the SAM2 web demo is primarily focused on video.

We decided to just keep it simple and focus on video. And on GitHub, we have a Colab notebook that shows how to run SAM2 on images. So if you're interested in using, replacing SAM with SAM2 for images, check out GitHub. But on the SAM2 demo, it's not as straightforward to adopt the same architecture as SAM for video because we can't send the per frame image embeddings for an entire video back to the front end.

In SAM, each frame embedding was like four megabytes. But if you have a long video and that's like per frame, it would become impossible to send that back to the front end. So SAM2 actually, in terms of the architecture details, I was actually just looking at this earlier, but SAM1 model was around 630 million parameters, a fraction of the size of these large language models, but very small.

Actually, SAM2, the largest model is around 224 million parameters. So it's actually one third the size of the SAM original model. So we changed the image encoder from a VITH in SAM to a higher model, which is also developed by Meta. So that definitely was something that helped. And in terms of the efficiency compared to SAM, so if we were to run SAM per frame on a video or run SAM2, it's around six times faster to run SAM2 versus run SAM per frame.

Number of things improved the efficiency of SAM2 such that we were actually able to run this entirely on the server and not have any component in the front end. But I am very curious to see who puts this on device. I'm pretty sure soon we'll see an on-device SAM2 or maybe even running in the browser or something.

So I think that could definitely unlock some of these edge use cases. But we were able to make a compelling web demo without having to do that. - Hugging face is probably already working on Transformers.js version of it. But totally makes sense. I want to talk more about things from the paper, but I think we're still in this sort of demo section and so I want to hand it to Joseph for his demo to see what the RoboFlow site looks like.

- So I can give some context into one key area that Nikolai, you mentioned earlier, which is SAM has made the decision, both SAM1 and SAM2, to be class agnostic in terms of its predictions. And that you then have the ability to have a generalizable model for zero-shot capability.

However, in a lot of domain applications, you do want the class-wise name. And so a lot of the challenge can be adding that class-wise name for at least the annotation to an experience that we've created. That's one of the key considerations. So I will similarly share my screen and show an example.

Here, I have a bunch of images and there's a number of ways that I could annotate things. Like I could prompt a large multimodal model with like grounding capabilities. You could outsource it. Or I can do manual labeling. And with the manual labeling, this is where we make use of models like Segment Anything to propose candidate masks and make it faster.

So we have this annotation pane in what we call the Smart Poly tool, which is powered by Segment Anything. This is currently Segment Anything 1. We're accelerating and seeing improvements from similar to what the paper shows of Segment Anything 2 performing better on images as well as video. But with Segment Anything, I'm able to basically prompt regions of my image of interest.

So for example, if like I wanted to say, I want to like add the drum set, you'll see here that like the original candidate proposal is just the bass drum, but let's say I wanted the whole drum set. So the UX primitive of being able to add and subtract candidate regions of interest is really intuitive here.

And now, great, I have this outline, but in fact, what I want is I want to name that as a class because maybe for the model that I'm building, I want to build like a task-specific model, you know, like an object detection model or an instant segmentation model. Or, you know, maybe I'm even using like a multimodal model and I want that multimodal model to refer to regions of interest in the images as a specific thing.

And so I think what's really powerful is of course, like I get this really rich zero-shot prediction, and here we have our friend Rick. So I get this really rich candidate set of predictions, but then by adding the class-wise label, I can, you know, very quickly make sure that any downstream tasks are aware, not just of the segment, but also of the, what is inside that segment, which actually takes me to a separate point of something that I predict that's probably going to happen.

And Nikhil, I'm actually kind of interested why maybe your team made a conscious decision to not do this initially with SAM2. There's been an emergent set of models that are also adding open-text prompting capabilities to grounding models. So for example, like you've seen models like Grounding Dino or Owlvit, which, you know, you can do even image-to-image or text-to-image-based prompting to find regions of interest.

And maybe I can actually give an example of that even in the context of this same data. So if I wanted to try out, you know, Grounding Dino on the same set of images, I could try out, you know, prompting Grounding Dino for a set of different classes. And what's notable is, let's do, I don't know, let's prompt for person, and we'll prompt for person, and let's prompt for, I don't know, microphone, and we'll ask for a microphone.

Here, I can text-prompt the image, and then the understanding, in this case, Grounding Dino's understanding of where people are in this image allows me to create, in this case, bounding boxes, but, you know, soon you can do segmentations or in tandem with SAM, do segmentations. And, you know, we've already seen applications of using SAM2 in tandem with models like Grounding Dino or Florence 2 so that people can basically text-prompt and then get the benefits of the zero-shot segmentation at the same time as getting the open-form querying.

And in doing so, you know, we maintain a framework called, like, Autodistill, so, like, folks can very quickly, you know, bring some images and then using Autodistill to find some ontology, and then prompt and say what you want from that ontology. - So you already do this for video as well?

- You can apply videos or groups of images, yes. So this is using a project called Autodistill. And the concept of Autodistill is use a base model, like a big base model, which could be, like, SAM or Grounding Dino, and then you pass a directory of images, which also could be video broken into individual frames, and you pass an ontology as well.

So an example I was just showing was, like, the Hello World we have, which is, like, a shipping container. And then the combination of the grounding capabilities of, in the example I was showing, Florence 2 plus SAM, looks for the concept of container. And then SAM does the rich segmentation of turning that concept of container into the candidate proposal of the region so that a user could just say, hey, I want all the shipping containers, run this across a bunch of images or video frames, and then get back the class-wise labels plus the regions of interest.

And this feels like a natural extension. And in fact, like, the open form grounding capabilities between SAM 1 and SAM 2 became something the field was broadly doing. So I'm curious, like, from your perspective, one of the things I thought maybe SAM 2 would do is actually add this capability natively.

So I'm curious to hear, like, the conscious decision to say, hey, we want to continue to be class-agnostic. We don't want to add yet maybe open form text prompting as a part of finding the segments and parts of images. And I'd love to hear about, like, the decision to think about it that way.

And if you are encouraged or if you want kind of, like, what's happening here where people are naturally combining these capabilities as something that you would expect and encourage to happen despite not having it in the base model itself. - Yeah, it's a great question. So I think it's really cool that the community is taking SAM and taking SAM 2 and building on top of it and coming up with cool applications.

We love to see that. That's exactly why we open source our work. And then in terms of why we didn't put it into SAM 2, so as you've probably seen with SAM and SAM 2, it's a fairly narrow problem, but we really try to make it a step change in the capability.

And so with each version, we are trying to limit the focus on one thing that we can know we can do really well. And in this case, like the first SAM, it was class-agnostic segmentation, but can we do it so well that it's effectively solved? And similarly, can we do that same thing, but with video segmentation?

So one step at a time, we are working on each of these problems one at a time so that we can actually deliver something that's really world-class and step-changing. - So does that mean SAM 3 will have the text prompting problem as like the next challenge? - Who knows, who knows?

(laughing) Maybe the community will build that too. - It makes sense to like very narrowly do something very well, and that's, I think, proven to be well accomplished. - It's like taking both the data, the model, and the demo, and how can we push all three towards solving one thing really well?

So we found that that's like a good recipe, and that's what we've limited the focus of each of these models. - This development reminds me of how, you know, when you do, and you break out the interpretability of ConvNets, and you can see like, oh, this is the edge detection one.

I feel like SAM is the edge detection version equivalent, and then you build up to whatever the next feature is on top of that. - Can I bring up one limitation of SAM? So like we've, like even SAM 1, SAM 2, and the model was released at 4 p.m.

Pacific on Monday. We're recording this on 11 a.m. Pacific on Thursday. So it's very fresh for a lot of the capabilities. And it is so clear that it is a stepwise change in the capability that, Nikhila, you mentioned your team wants to do, which is extend SAM's zero-shot class-agnostic capability to video, like A+ kind of mission accomplished.

One thing that's interesting is finding like domain problems where there might be still domain applicability and domain adaptation that is available. One benchmark that we introduced at CBPR is this thing called RF100, which is like seven different domain type problems that the industry commonly is working on in vision.

Like underwater, document processing, aerial examples, medicine examples. And one place where, interestingly, segment anything maybe less performant than other models is handling screenshots. For example, like a lot of folks that are building agents to interact with the web are particularly interested in that challenge of given a screenshot of a computer, what are all the buttons?

And how could I autonomously navigate and prompt and tell it to click? And I can show an example of like maybe what, how like SAM kind of performs on this challenge just to outline some of the context of this problem. But I'm curious like how you think about limitations like this and what you would expect to want to be the case.

So here I just have a notebook where I run SAM on the source image on the left, or the source image on the left, and then SAM output is on the right. And this is just a screenshot of a website where we just grabbed like the top 100 websites by traffic and grabbed screenshots from them.

One example of a place where I could see the community improving on SAM, and I'm curious how you think about this challenge and maybe why SAM is less well adapted for this type of problem is processing screenshots. So I'll share my screen to give an example for viewers that are participating.

Here you see like an example screenshot of a website on the left, and then right is SAM2 running on that image. And in the context of agents, folks usually want to have like, hey, tell me all of the buttons that an agent could press, tell me like maybe the headlines of the articles, tell me the individual images.

And SAM2 behaves perhaps predictably where it outlines like people in the images and like some of like the screen text. I'm curious like how you think about a challenge like this for a model that sees everything in the world, what about handling digital contexts and why maybe it could perform better here and how you would expect to see improvement for domains that might have been out of distribution from the training data?

- Yeah, this is a good question. So at FAIR, we don't really build with a specific use case in mind. We try to build like these foundational models that can be applied to lots of different use cases out of the box. So I think in this kind of example, potentially people might want to annotate some data, fine tune on top of what we release.

I think we probably won't build things that are very custom for different use cases. I think that's not a direction we'll go in. But as you said, like the model is an annotation tool to improve the model. And so I think that's definitely the approach we want to take is we provide the tools for you to improve the model as well as the model itself.

- That makes sense. Focus on like as many multi or zero shot problems and then allow the community to pick up the torch for domain adaptation. - Yeah, absolutely. Like we can't solve all the problems ourselves. Like we can't solve all the different domains, but if we can provide a sort of base hammer tool and then people can apply it to all their different problems.

- Well, if you don't mind, I guess we want to transition to a little bit on like asking more questions about the paper. - Sure. - There's a lot in here. I love the transparency from Meta recently with like Llama 3 last week. And then, and was it last week?

Maybe a little bit less than last week, but just like just really, really well-written and a lot of disclosures, including the dataset as well. I think the top question that people had on the dataset, you know, you've released a diverse videos and there's a lot of discussion about the data engine as well, which I really love.

And I think it's innovative if you want to share anything about that. I think the top question is like, how do you decide the size of dataset? You know, what were you constrained by? People are asking about scaling laws. You had some ablations, but as a research manager for this whole thing, like how do you decide what you need?

- Yeah, I mean, it's a great question. I think it's, as with all papers, you write them at the end of the project. So we can put these nice plots at the end, but going into it, I think, you know, the data engine design really follows sort of the model design, how we thought about the task, how we thought of the model capabilities.

You can really see it's reflected in the different phases of the data engine. We started with just SAM. We apply SAM per frame. That's like the most basic way of extending SAM to video. Then the most obvious thing to do is to take the output masks from SAM and then provide it as input into a video object segmentation model that takes the mask as the first frame input.

And that's exactly what we did. We had SAM plus a version of SAM2 that only had mask as input. And then in the last phase, we got rid of SAM entirely and just had this one unified model that can do both image and video segmentation and do everything in just one model.

And we found that, you know, going from each phase, it both improved the efficiency and it improved the data quality. And in particular, when you get rid of this two-part model, one of the advantages is that when you make refinement clicks, so you prompt the model in one frame to select an object, then you propagate those predictions to all the other frames of the video to track the object.

But if the model makes a mistake and you want to correct it, when you have this unified model, you only need to provide refinement clicks. So you can provide maybe a negative click to remove a region or a positive click to add a region. But if you had this decoupled model, you would have to delete that frame prediction and re-annotate from scratch.

And so you can imagine for more complex objects, this is actually adding like a lot of extra time to redefine that object every time you want to make a correction. So both the data and the data engine phases really follow like how we thought about the model design and the evolution of the capabilities, because it really helped improve the data quality and the annotation efficiency as well.

- Yeah, you had a really nice table with like time taken to annotate, and it was just going down and down. I think it was like down by like 90% by the time you hit stage three, which is kind of cool. - We joked that when SAM1 came out at RoboFlow, we're like, "Was this purpose built for our software?" Like you have the embedding take like a big model and the querying of the embeddings, a smaller model that happens in browser, which felt remarkably aligned.

Now hearing you talk about how you think about building models with a demo in mind, it makes sense. Like you're thinking about the ways that folks downstream are gonna be consuming and creating value. So what felt like maybe a coincidence was perhaps a deliberate choice by Meta to take into account how industry is gonna take seminal advances and apply them.

- Yeah, and it's not just humans. Like it could also be a model that outputs boxes that then get fed into this model. So really thinking about this as a component that could be used by a human or as a component as part of a larger AI system. And that has a number of design requirements that needs to be promptable.

It needs to have the zero shot generalization capability. We need it to be real time. And those requirements really are very core to how we think about these models. - I cannot end this podcast without talking about the architecture because this is your effectively the sort of research level, architecture level innovation that enabled what I've been calling object permanence for SAM and it's memory retention.

What was the inspiration going into it? And what did you find? - Yeah, so at a high level, the way we think about extending SAM to video is that an image is just a special case of a video that just has one frame. With that idea in mind, we can extend the SAM architecture to be able to support segmentation across videos.

So this is a quick video that shows how this works. So SAM architecture, we have the image encoder, we have a prompt encoder, we have a mask decoder. You can click on an image and that basically is a prompt. We use that prompt along with the image embedding to make a mask prediction for that image.

Going to SAM 2, we can also apply SAM 2 to images because we can, as I said, treat an image as a video with a single frame. And so when we are in the SAM 2 architecture, we introduce this new memory mechanism that consists of three main components. There's memory attention, there's a memory encoder, and then there's a memory bank.

And when we apply SAM 2 to images, these are effectively not used and the architecture just collapses down to the original SAM architecture. But when we do apply this to video, the memory components become really useful because they provide the context of the target object from other frames. And so this could be from the past frames, it can be from, there's two types of memory.

So there's like the conditional frames or the prompted frames, which are basically the frames at which a user or a model provides input like clicks. And then there's like the surrounding frames. And so we use six frames around the current frame as memory of the object. So there's both those types of memory that we use to make the mask prediction.

Going into a little bit more detail about that, there's like two kinds of memory that we use. So one is like spatial memory. So it's like this high resolution memory that captures the spatial details. And then we also have this like longer term object point of memory that captures some of the sort of higher level concepts.

And I think Swix, you had a comment about how does this relate to context window and LLMs. And both of these types of memories have some relation to context window. So they both provide different types of information on the spatial side or in terms of the concept of the objects that we want to track.

And so we found that having like six frame length for the spatial memory coupled with this longer period of the object point of memory provides strong video segmentation accuracy at high speed. So as I mentioned, the real time aspect is really important. We have to find this speed accuracy trade off.

And one way in which we sort of circumvent this is by allowing additional prompts on subsequent frames. So even if the model makes a mistake, maybe it loses the object. After an occlusion, you can provide another prompt which actually goes into the memory. And so the prompted frames are always in the memory.

And so if you provide a prompt on a frame where the model will always remember what you provided. And so that's a way in which we can sort of avoid some of the model failure cases. That actually is a big limitation of current models. Current video object segmentation models don't allow any way to recover if the model makes a mistake.

And so Joseph, going back to your point about the demo, that's something that we found just by playing with these models. There's no way to make a correction. And in many real world use cases, like it's not going to be a one time prediction, but you actually want to be able to intervene.

Like if an LLM makes a mistake, you can actually be like, no, actually do it this way and provide feedback. And so we really want to bring some of that thinking into how we build these computer vision models as well. - Amazing. My main reaction to finding out about the context length, input frames and six pass frames as their default is why not 60?

Why not 600? In text language models, we're very used to severely extending context windows. And what does that do to the memory of your model? - So I think maybe one thing that's different is that the object in videos, it is challenging. Objects can change in appearance. There's different lighting conditions.

They can deform. But I think a difference to language models is probably the amount of context that you need is significantly less than maintaining a long multi-time conversation. And so coupling this short-term spatial memory with this longer-term object pointers we found was enough. So I think that's probably one difference between vision models and LLMs.

- I think so. If one wanted to be really precise with how literature refers to object re-identification, object re-identification is not only what SAM does for identifying that an object is similar across frames, it's also assigning a unique ID. How do you think about models keeping track of occurrences of objects in addition to seeing that the same looking thing is present in multiple places?

- Yeah, it's a good question. I think, you know, SAM2 definitely isn't perfect and there's many limitations that we'd love to see people in the community help us address. But one definitely challenging case is where there are multiple similar looking objects, especially if there's like a crowded scene with multiple similar looking objects.

Keeping track of the target object is a challenge. That's still something that I don't know if we've solved perfectly, but again, the ability to provide refinement clicks is one way to sort of circumvent that problem. In most cases, when there's lots of similar looking objects, if you add enough refinement clicks, you can get the perfect track throughout the video.

So definitely that's one way to solve that problem. But, you know, we could have better motion estimation. We could do other things in the model to be able to disambiguate similar looking objects more effectively. - I'm just interested in leaving breadcrumbs for other researchers, anyone interested in this kind of architecture.

Like, are there papers that you would refer people to that are influential in your thinking or, you know, have other interesting alternative approaches? - I think there's other ways in which you can do tracking in video. You might not even need the full mask. I think there's some other works that just track, like, points on objects.

It really, really depends on what your application is. Like, if you don't care about the entire mask, you could just track a bounding box. You could just track a point on an object. And so having the high fidelity mask might not actually be necessary for certain use cases. From that perspective, you might not need the full capabilities of SAM or SAM2.

There's many different approaches to tracking. I think I would encourage people to think about, like, what actually they need for their use case and then try to find something that fits versus, yeah, maybe SAM2 is too much. You know, maybe you don't even need the full mask. - Makes total sense.

But you have solved the problem that you set out to solve, which is no mean feat, which is something that we're still appreciating even today. If there are no further questions, I would just transition to sort of forward-looking, future-looking stuff. Joseph already hinted at, like, you know, our interest in SAM and the future of SAM.

And obviously you're the best person to ask about that. I'm also interested in, like, how should external people think about FAIR? You know, like, there's this stuff going on, this llama, this chameleon, this voice box, this image bind, like, how are things organized? And, you know, where are things trending?

- Yeah, so in FAIR, we, you know, we have a number of different research areas. I work in an area called perception. So we built vision systems that solve, basically look at all the fundamental problems in computer vision. Can we build a step change in all of these different capabilities?

SAM was one example. SAM-2 is another example. There are tons of other problems in computer vision where we've made a lot of progress, but can we really say that they're solved? And so that's really the area in which I work on. And then there's a number of other research areas in language and in embodied AI, in more efficient models and various other topics.

So FAIR in general is still very much pushing the boundaries on solving these foundational problems across different domains. And then there's also obviously, like, actually I probably shouldn't talk about llama, so let's not include that. - I was gonna ask about that. (both laughing) Well, fair enough. Maybe just outside of FAIR, just the future of computer vision, right?

Like you are very involved in the community. What's the talk of the town at CVPR? Both of you went. Who's doing the most interesting work? It's a question for both of you. - I think the trends we're seeing towards more zero-shock capability for common examples will accelerate. I think mutual modality, meaning using images in tandem with text for richer understanding, or images and video in tandem with audio and other mixed media will be a continued acceleration trend.

The way I kind of see the field continuing to progress, like the problem statement of computer vision is making sense of visual input. And I think about the world as the things that need to be observed follow your traditional bell curve, where like things that most frequently exist out in the world are on the center of that bell curve.

And then there's things that are less frequently occurring that are in those long tails. For example, as back as like 2014, you have the COCO dataset, which sets out to say, "Hey, can we find 80 common objects in context?" Like silverware and fridge and these sorts of things. And we also conceptualized the challenge of computer vision in terms of breaking it down into individual task types, because that's like the tools we had for the day.

So that's why you have the origination of classification, object detection, instant segmentation. And then as you see things continue to progress, you have models and things that need to observe areas in the long tails. And so if you think of the COCO dataset as the center of that bell curve, I think of like the long tails, like really edge case problems.

Some of our customers like Rivian, for example, only Rivian knows what the inside of like a Rivian should look like as it's assembled and put together before it makes its way to a customer. And they're making custom parts, right? So how could a model even been trained on the things that go inside the componentry of producing a vehicle?

And what's kind of happening with computer vision is you're seeing models that generalize in the middle of the bell curve push outward faster. That's where you see the advent of like open text models or the richness of understanding of multimodal models to allow richer understanding without perhaps any training, or maybe just using pre-training and applying it to a given problem.

And then there's like, you know, kind of like the messy middle in between those two, right? So like, Nikila kind of talked about examples where SAM does well out of distribution, where like it finds an octopus, even though there wasn't octopi in the training data. I showed an example where like screenshots, where SAM isn't yet super great at screenshots.

So maybe that's like in the messy middle or in the longer tails for now. But what's gonna happen is there needs to be systems of validating the point of view that I think about like tooling to also validate that models are doing what we want them to do, adapting to datasets that we want them to adapt to.

And so there's a lot of things on a forward-looking basis that allow propelling that expansion of generalizability. That's where open text problems, that's where scaling up of training, of dataset curation continues to play a massive role. Something that's notable, I think, about SAM 2 is it's, what, 57,000 videos, 51,000 videos?

- About 51,000, yeah. - And 100,000 internal datasets. - That's like not massive, right? And the model size also isn't, you know, the largest model being a couple hundred million parameters, the smallest model is 38 million parameters and can run at 45 FPS on an A100, right? Like the capabilities of, we're gonna see more capable, more generalizable models being able to run on a higher wide array of problems with zero or multi-shot capability on a faster rate.

And I think the architecture innovations in things like SAM 2 of memory, of increasingly like transformers making their way into division and probably blended architectures increasingly too. So my viewpoint of like on a go-forward basis is we will have that bell curve of what humans can see both in the center of that curve and the long tails and architectural changes allow richer understanding multi and zero-shot and putting those into systems and putting those into industry and putting those into contexts that allow using them in practical and pragmatic ways.

Nicola, I'd love to hear like your thought and perspective of like how you think the research trends map or don't map to that. And like maybe some of the key innovations that you saw at CVPR this year that got you excited about the direction and maybe some promising early directions that you're thinking about researching or pushing the boundaries of further.

- Yeah, I just wanted to actually reply to a couple of things that you said about, so actually in video object segmentation, the number of classes that are annotated and then the size of these datasets are really small. So with SAM, it's, you know, we had a billion masks, we had 11 million images, didn't have class labels, but even before that, there were a lot of datasets that have class labels and are annotated with significantly more, with like a lot of class labels, whereas in video datasets, the number of class labels are very small.

So there's like YouTube VOS, which has 94 object categories, there's MOSE, which has around like 30 or so object categories. And they're usually like people, there's cars, there's dogs and cats and all these common objects, but not really, they don't really cover a very large number of object categories.

And so while SAM learned this general notion of what an object is in an image, these video tracking models actually don't have that knowledge at all. And so that's why having this dataset is really important for the segment anything capability in video, because if you just provide the mask as the input to an off-the-shelf video object segmentation model, it might not actually be able to track that arbitrary object mask as effectively as a SAM2 model that's actually trained to track any object across the entire video.

So doing these sort of combining two models together to try to get a capability will actually only get you so far and being able to actually create the dataset to enable that anything capability, it was actually really important. And we can actually see that when we do comparisons with baselines where we provide SAM2 with the same input mask and the baseline model with the same input mask, for example, the T-shirt of a person, SAM2 can track the T-shirt effectively across the entire video, whereas these baselines might actually start tracking the entire person because that's what they're used to doing and isolating it to just one part of the person is not something they were ever trained to do.

And so those are sort of some of the limitations. Another thing is segmenting an image and segmenting a video frame are actually two different things. So a video frame is still an image, but there might be motion blur or it might have lower resolution. Or it's actually, we found that in the SAM2 paper, we have this study of where we look at the SAM image segmentation task on images and also on frames from videos.

And we find that actually SAM2 is a lot better than SAM when it comes to segmenting objects in video frames, because they actually have a sort of slightly different distribution than images. And so I think that's maybe one learning from this project is like combining two models and sort of just smushing things together might not actually be as effective as if you really think about how to build things in a unified way.

And then another really interesting point is that from the COCO data set, the last author, Peter Dollar, he's the head of our research group. And so he's really seen the whole decade of going from COCO to going from SAM to going to SAM2. And so that's been very interesting to have that perspective as we build these models and as we think about the type of capabilities we want to build.

- We hosted this challenge at CBPR when we introduced RF100, which is kind of meant to be the anti-COCO. So if like COCO is common objects in context, RF100 is like novel objects in weird contexts, like thermal data and like aerial stuff and things we were talking about earlier.

And so we challenged the community as a part of, it's called OD&W with Microsoft, object detection in the wild. And it's basically like how well can you create models that either work zero shot, but really kind of what you end up measuring is how well things can learn domain adaptation.

Like how quickly can something be retrained or fine tuned to a given domain problem? And what's really impressive about SAM and SAM2 from what you just described is even with the limited set, the class agnostic approach affords the generalizability even to out of distribution examples, surprisingly well. Like it's like remarkably robust.

And so that research direction seems extremely promising. - Yeah, and actually Pieter is always telling us like, don't care about COCO, even though he built COCO. So that's always fun. And really keeping that zero shot real world use cases in mind as we build and try to do things in as general a way as possible.

- Okay, I think that just leaves us to calls to action for engineers, researchers, and personal recommendations. What do you have? - Yeah, so please try out all the resources we put out. We, you know, open sourced SAV dataset, SAM2, various SAM2 models, the paper, the demo, this dataset visualizer.

Please try all of these things that we've released. And also, as I said, SAM2 isn't perfect. There are a number of limitations. Actually in the blog post, we go through many of these in quite lots of detail with examples. And so if you have any ideas of how to improve these, like please build on top of what we've released.

We would love to see some of these problems get solved and maybe we can incorporate them back into future model versions. So really cool to use SAM2 for all your different use cases, build on top of it, improve it, and share what you've built back with us. We'd love to hear from you.

- Lovely. We'll definitely want people to comment and share their buildings on SAM and SAV and all the other stuff that's going on. Thank you so much for your time. This is wonderful. And then obviously the incredible open source that you've given us. Joseph, thank you as well for guest hosting.

It was a much better episode with you than without you. So appreciate both of you coming on and whenever SAM3 is out or whatever else you guys are working on, just let us know and we'll come back on again. - Thank you. - Thanks. - Bye. (upbeat music) (upbeat music) (upbeat music) (upbeat music)