Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

00:00:00.000 | (upbeat music)

00:00:02.580 | - Welcome to the Latest Space Podcast.

00:00:05.960 | I'm delighted to do Segment Anything 2.

00:00:08.300 | Our first, one of our very first viral podcasts

00:00:11.060 | was Segment Anything 1 with Joseph.

00:00:12.820 | Welcome back.

00:00:13.660 | - Thanks so much.

00:00:14.500 | - And this time we are joined by the lead author

00:00:16.860 | of Segment Anything 2, Nikki Ravi.

00:00:18.660 | Welcome.

00:00:19.500 | - Thank you.

00:00:20.320 | Thanks for having me.

00:00:21.160 | - There's a whole story that we can refer people back

00:00:23.180 | to episode four of the podcast way back when

00:00:26.180 | for the story of Segment Anything.

00:00:27.820 | But I think we're interested

00:00:29.020 | in just introducing you as a researcher,

00:00:31.300 | as a, on the human side.

00:00:33.260 | What was your path into AI research?

00:00:35.420 | Why, you know, why did you choose computer vision

00:00:37.980 | coming out of your specialization at Cambridge?

00:00:40.880 | - Yeah, yeah, sure.

00:00:41.720 | So I did my undergraduate degree in engineering

00:00:45.520 | at Cambridge University.

00:00:47.620 | The engineering program is very general.

00:00:49.620 | So first couple of years,

00:00:50.840 | you sort of study everything from mechanical engineering

00:00:53.740 | to fluid mechanics, structural mechanics,

00:00:56.660 | material science, and also computer science.

00:01:00.420 | Towards the end of my degree,

00:01:02.260 | I started taking more classes in machine learning

00:01:05.340 | and computational neuroscience, and I really enjoyed it.

00:01:08.300 | And actually after graduating from undergrad,

00:01:11.620 | I had a place at Oxford to study medicine.

00:01:14.520 | And so I was initially planning on becoming a doctor,

00:01:18.060 | had everything planned,

00:01:19.900 | and then decided to take a gap year

00:01:22.780 | after finishing undergrad.

00:01:24.820 | And actually that was around the time

00:01:26.040 | that sort of deep learning was emerging.

00:01:28.700 | And in my machine learning class in undergrad,

00:01:31.900 | I remember one day our professor came in

00:01:34.800 | and that was when Google acquired DeepMind.

00:01:38.300 | And so that became like a huge thing.

00:01:40.940 | We talked about it for the whole class.

00:01:42.740 | It kind of really kicked off thinking about,

00:01:45.980 | okay, maybe I want to try something different

00:01:47.940 | other than medicine.

00:01:49.380 | Maybe this is a different path I want to take.

00:01:51.740 | And then in the gap year, I did a bunch of coding,

00:01:55.300 | worked on a number of projects,

00:01:57.060 | did some sort of freelance contracting work.

00:01:59.740 | And then I got a scholarship to come and study in America.

00:02:02.980 | So I went to Harvard for a year,

00:02:05.180 | took a bunch of computer science classes at Harvard and MIT,

00:02:08.060 | worked on a number of AI projects,

00:02:10.700 | especially in computer vision.

00:02:12.380 | I really, really enjoyed working in computer vision,

00:02:15.300 | applied to Facebook and got this job at Facebook.

00:02:17.940 | And I've now, at Facebook at the time, now Matter.

00:02:21.620 | And I've been here for seven years.

00:02:23.180 | So very circuitous path,

00:02:25.660 | probably not a very unconventional.

00:02:27.460 | I didn't do a PhD.

00:02:29.220 | I'm not like a research, typical research scientist.

00:02:32.420 | Definitely came from more of an engineering background.

00:02:35.460 | But since being at Matter,

00:02:37.500 | have had amazing opportunities to work across

00:02:40.540 | so many different interesting problems in computer vision

00:02:44.500 | from 3D computer vision.

00:02:46.720 | How can you go from images of objects to 3D structures?

00:02:50.860 | And then going back to 2D computer vision

00:02:53.060 | and actually understanding the objects

00:02:55.140 | and the pixels and the images themselves.

00:02:57.540 | So it's been very interesting journey

00:03:00.460 | over the past seven years.

00:03:02.420 | - It's weird because I guess with segment anything too,

00:03:04.660 | it's like 4D because you solve time.

00:03:07.340 | You know, you started with 3D and now you're solving the 4D.

00:03:10.700 | - Yeah, it's just going from 3D to images to video.

00:03:13.940 | It's really covering the full spectrum.

00:03:15.740 | And actually one of the nice things has been,

00:03:18.540 | so I think I mentioned I wanted to become a doctor,

00:03:21.780 | but actually Sam is having so much impact in medicine,

00:03:25.180 | probably more than I could have ever had

00:03:27.420 | as a doctor myself.

00:03:28.740 | So I think, you know,

00:03:30.220 | hopefully Sam too can also have similar sort of impact

00:03:33.900 | in medicine and other fields.

00:03:36.180 | - Yeah, I want to give Joseph a chance to comment.

00:03:38.740 | Does that also mirror your,

00:03:40.420 | we know your story about going into vision,

00:03:42.620 | but like in the past year since we did our podcast on Sam,

00:03:46.500 | what's been the impact that you've seen?

00:03:48.740 | - Segment anything set a new standard

00:03:51.660 | in computer vision.

00:03:53.020 | You know, recapping from the first release to present,

00:03:56.020 | Sam introduces the ability for models to near zero shot,

00:04:01.020 | meaning without any training,

00:04:03.020 | identify kind of perfect polygons and outlines

00:04:06.660 | of items and objects inside images.

00:04:10.300 | And that capability previously required

00:04:13.740 | lots of manual labeling, lots of manual preparation,

00:04:17.460 | clicking very meticulously to create outlines of individuals

00:04:20.820 | and people.

00:04:22.140 | And there were some models that attempted

00:04:24.940 | to do zero shot segmentation of items inside images,

00:04:29.940 | though none were as high quality as segment anything.

00:04:35.420 | And with the introduction of segment anything,

00:04:38.780 | you can pass an image with Sam one, Sam two videos as well,

00:04:43.780 | and get perfect, pixel perfect outlines

00:04:47.620 | of most everything inside the images.

00:04:49.940 | Now there are some edge cases across domains

00:04:52.380 | and similar to the human eye,

00:04:54.780 | sometimes you need to say like,

00:04:56.060 | which item you maybe you most care about

00:04:57.940 | for the downstream task and problem you're working on.

00:05:00.700 | Though Sam has accelerated the rate at which developers

00:05:05.700 | are able to use computer vision and production applications.

00:05:10.300 | So at RoboFlow, we were very quick to enable the community

00:05:15.140 | of computer vision developers and engineers to use Sam

00:05:19.300 | and apply it to their problems.

00:05:21.020 | The principle ways is using Sam,

00:05:23.260 | you could kind of use Sam as is to like pass an image

00:05:26.380 | and receive back masks.

00:05:28.340 | Another use case for Sam is in preparation of data

00:05:32.420 | for other types of problems.

00:05:34.180 | So for example, in the medical domain,

00:05:37.140 | let's say that you're working on a problem

00:05:40.140 | where you have a bunch of images from a wet lab experiment.

00:05:43.700 | And from each of those images,

00:05:46.340 | you need to count the presence of a particular protein

00:05:50.060 | that reacts to some experiments.

00:05:52.140 | To count all the individual protein reactions,

00:05:56.700 | you can go in and lab assistants to this day

00:05:59.900 | will still like kind of individually count and say,

00:06:02.340 | what are the presence of all of those proteins?

00:06:04.820 | With segment anything, it's able to identify

00:06:07.540 | all of those individual items correctly.

00:06:10.580 | But often you may need to also add like a class name

00:06:14.860 | to what the protein is, or you may need to say,

00:06:17.860 | hey, like I care about the protein portion of this,

00:06:20.340 | I don't care about the rest of the portion of this image.

00:06:23.420 | And, or what it encourages and asks for the user to do

00:06:27.020 | is to provide some visual prompting to say,

00:06:29.460 | hey, which part, like Sam says,

00:06:31.780 | hey, I can find segments of anything,

00:06:33.460 | but which segments do you care about?

00:06:34.940 | And so you can do visual prompting,

00:06:36.620 | which is kind of a new perimeter that Sam introduced.

00:06:39.140 | And so at RoboFlow, we have one portion of our tool stack

00:06:43.060 | enables users to very quickly label data.

00:06:45.980 | With segment anything, Sam can already provide,

00:06:49.540 | hey, here's where I see the outlines of objects,

00:06:51.980 | or a user can click to prompt to say,

00:06:54.060 | hey, here's where the outlines of objects matter.

00:06:56.420 | And I recently pulled statistics

00:06:58.020 | from the usage of Sam in RoboFlow

00:06:59.860 | over the course of the last year.

00:07:01.700 | And users have labeled about 49 million images

00:07:05.700 | using segment anything on the hosted side

00:07:07.860 | of the RoboFlow platform.

00:07:09.580 | And that's like 5 million in the last 30 days alone.

00:07:13.740 | And of those images,

00:07:16.900 | we did kind of like a rough Bafka napkin calculation

00:07:19.660 | of like how much time that has saved.

00:07:21.860 | Because again, the alternative is

00:07:24.060 | you're clicking individual points to create a polygon.

00:07:27.020 | And with Sam, you just click once

00:07:28.100 | and it guesses where the polygon is.

00:07:29.820 | And I'm sure in a bit, we can maybe screen share

00:07:32.140 | and show some examples of what this experience is like.

00:07:35.180 | And in that time estimation, it's like,

00:07:37.900 | on average saves, you know, maybe a dozen or so seconds.

00:07:41.340 | And we estimate that this has probably saved

00:07:44.940 | on the order of magnitude of 35 years of time for users.

00:07:49.940 | - That's incredible.

00:07:51.460 | - So I mean, basically like in the first year

00:07:54.060 | of a model being available,

00:07:55.580 | not only can you say, hey, I'm just gonna go use this model,

00:07:57.940 | but those numbers that like 49 million images

00:08:01.300 | is an estimate directly related to just the hosted side.

00:08:05.260 | So imagine all of the users that are self-hosting

00:08:08.300 | or using Sam for robotics applications

00:08:10.740 | or out in the field or offline,

00:08:12.900 | where it's not even like the time

00:08:14.940 | or the image counts are tabulated.

00:08:17.420 | And we're probably talking about, you know,

00:08:19.040 | just a fraction of the amount of value

00:08:21.860 | that's actually being produced

00:08:23.020 | for a number of downstream tasks.

00:08:25.100 | So to say that the impact has been,

00:08:28.100 | you know, people use terms like game changing

00:08:29.860 | and these sorts of things, it has changed the industry.

00:08:32.060 | It's set a new standard.

00:08:33.340 | And with the release of Sam 2,

00:08:35.460 | I think we're about to see an acceleration

00:08:37.580 | of those capabilities for a lot of reasons.

00:08:39.820 | - That's really great to hear.

00:08:40.820 | I think one of the release Sam 1

00:08:42.980 | was how many fields actually rely on manual segmentation.

00:08:47.940 | I think we're not really exposed to that.

00:08:50.020 | Maybe you are at Roboflow

00:08:51.300 | 'cause you get to see all the users of these tools.

00:08:54.860 | But for me, it was, you know,

00:08:56.180 | people working on understanding coral reef bleaching

00:09:00.220 | or farmers counting their cows

00:09:02.140 | and so many different applications

00:09:04.460 | that as a researcher matter,

00:09:07.060 | you never get exposed to,

00:09:08.220 | but you can have impact towards.

00:09:10.020 | So I think that was really awesome to hear.

00:09:12.580 | - So as sort of audience surrogate

00:09:14.220 | who knows less than the two of you,

00:09:15.980 | I'm gonna ask a really dumb question maybe,

00:09:18.140 | but is everyone using stock segment anything?

00:09:20.940 | Are they fine tuning for the medical domain?

00:09:23.260 | Like how on earth could it work

00:09:25.340 | for the medical field without fine tuning, right?

00:09:27.940 | Like, is that a thing?

00:09:29.620 | - So I mean, I can give a quick perspective

00:09:31.620 | from the research side.

00:09:32.820 | So one of the design decisions we made in Sam

00:09:37.180 | was to not have class labels.

00:09:40.300 | And so all the data is annotated in a class agnostic way.

00:09:45.300 | So anything that has a boundary,

00:09:47.780 | we consider to be an object.

00:09:49.900 | So for example, in any image,

00:09:52.220 | there's lots of small objects.

00:09:54.700 | We might not know what the name of them are,

00:09:56.620 | but you can draw a boundary around it.

00:09:59.100 | So you can imagine that we have 11 million images

00:10:02.820 | in the SA1B dataset.

00:10:04.620 | We annotated all the objects.

00:10:06.940 | There's many, many small objects.

00:10:09.260 | And so if you think about cells,

00:10:11.340 | they're also kind of small objects.

00:10:14.060 | There's probably things in the training data

00:10:17.300 | that looked like it, but we didn't have to label it.

00:10:19.820 | And so that means that even when you use Sam

00:10:22.740 | for applications that it wasn't really trained for,

00:10:25.260 | because we didn't restrict it

00:10:26.940 | to a certain set of categories,

00:10:28.460 | you can actually use it out of the box

00:10:30.460 | without custom adaptation.

00:10:32.140 | But having said that, there's probably certain domains

00:10:34.980 | where you need some expertise

00:10:37.500 | in order to be able to segment something properly.

00:10:40.180 | And for those use cases,

00:10:42.020 | having some extra fine tuning data would probably help.

00:10:45.460 | And we've sort of seen that there's some papers

00:10:47.660 | that have come out that do this.

00:10:49.460 | And we'd love to hear, Joseph,

00:10:51.140 | how people are collecting data with Sam

00:10:53.580 | and fine tuning for their use cases.

00:10:56.060 | - Once Sam came out, there were adaptations that said,

00:10:59.580 | could we use Sam to be, you know, like efficient Sam,

00:11:02.700 | like basically take Sam and maybe accelerate it.

00:11:05.260 | And then there were domain adapted Sams,

00:11:07.300 | like cell Sam, for example, out of the UC system.

00:11:11.020 | Now, what's interesting is there's,

00:11:13.420 | like adapting Sam to a domain,

00:11:15.140 | there's kind of two ways by which that's done.

00:11:18.580 | One is, as you mentioned,

00:11:19.620 | like potentially Sam doesn't have a good concept

00:11:22.180 | of the objects of interest.

00:11:25.340 | And so you need to do domain adaptation

00:11:27.940 | and increase the accuracy for zero shot prediction.

00:11:31.940 | The second way though, is it's not fine tuning,

00:11:34.660 | it's actually just prompting.

00:11:35.900 | It's just guiding the model's existing knowledge

00:11:39.060 | to say which segments you care about.

00:11:41.780 | And both those are actually kind of equally important

00:11:44.100 | on the application side.

00:11:45.460 | You need to like a priori ensure

00:11:47.500 | that the objects of interest can be correctly segmented

00:11:50.380 | and maybe collect data to do that.

00:11:53.340 | But even if you had like a perfect Sam,

00:11:55.660 | like an omniscient Sam that could see every segment

00:11:57.820 | in every domain with all pixels perfectly outlined,

00:12:02.620 | in production, you would still need some way

00:12:04.900 | to almost like signal to the model what you care about.

00:12:08.260 | Like to paint this picture, if you were like a retailer

00:12:11.540 | and you are providing photos of models

00:12:16.460 | wearing your clothing on your retail site,

00:12:18.940 | you may care about, you know, only the shirt.

00:12:21.300 | And Sam by default might segment the full person.

00:12:24.060 | And so there's visual prompting that you can do

00:12:27.460 | to ensure that you only outline maybe the shirt

00:12:29.820 | for the purposes of swapping in and out different shirts

00:12:31.860 | for displaying a given model on a retail page.

00:12:35.780 | And so I think what's interesting is that's where like,

00:12:38.220 | I wouldn't call it domain adaptation,

00:12:39.660 | but that's where like when you apply to industry,

00:12:41.900 | like one thing that's particularly important with tooling

00:12:45.060 | and enabling Sam to reach its full potential.

00:12:48.100 | - That's really encouraging to hear.

00:12:49.540 | I should also think like, you know,

00:12:51.380 | the last time we talked about this,

00:12:52.820 | we wanted to, a very natural addition

00:12:55.100 | on the class labeling side is the grounding dyno work, right?

00:12:58.180 | So I think people built a grounding Sam

00:13:00.220 | and all the other extensions.

00:13:02.140 | I think it's probably a good time

00:13:03.540 | to cut to a quick demo of Sam 2

00:13:05.540 | for people who are tuning in for Sam 2

00:13:08.260 | and who better to demo Sam 2 than Nikki.

00:13:10.260 | - Sure.

00:13:12.660 | So I'll try to narrate what I'm doing

00:13:15.140 | so audio listeners can also understand.

00:13:18.380 | So we have a web demo where anyone can try Sam 2 on a video.

00:13:23.380 | Here we have a video of someone kicking a football

00:13:27.900 | and I'm gonna click on the football

00:13:30.380 | to select the object in the first frame,

00:13:32.860 | but you can actually select the object

00:13:34.540 | in any frame of the video and this will work.

00:13:37.340 | The next step is to hit track.

00:13:39.300 | So the model's now tracking this in real time.

00:13:42.180 | We don't save any of this.

00:13:43.700 | It's all running in real time.

00:13:45.660 | And now you can see the ball has been tracked

00:13:48.940 | throughout the entire video.

00:13:50.660 | There's even like a little bit of a challenging case here

00:13:53.060 | where the shoe covers the football

00:13:56.620 | and actually the model makes a little bit of a mistake,

00:13:59.380 | but that's okay because we can...

00:14:02.300 | Here, the model makes a little bit of a mistake here,

00:14:04.380 | but we can actually add a refinement click.

00:14:07.140 | You can add negative clicks

00:14:09.180 | until we get the mask that we want on this frame.

00:14:12.340 | And then you can hit track again

00:14:15.140 | and the model will track the object,

00:14:17.420 | taking into account the additional information

00:14:20.460 | I've provided at that frame.

00:14:22.700 | We've also added a couple of other fun things

00:14:24.660 | you can do on top of the track, like add effects.

00:14:28.660 | We can add foreground effects, background effects,

00:14:33.020 | and these are just ways of showing

00:14:34.500 | how we can use the output from SAM2

00:14:37.100 | as part of other tools like video editing tools

00:14:41.180 | or other systems.

00:14:42.420 | So this is just a preview of what you can do

00:14:44.980 | with SAM2.

00:14:46.420 | But the really cool use cases are places

00:14:49.660 | where we might not have even imagined SAM2 being useful.

00:14:52.740 | So we have a number of examples of things

00:14:54.780 | you might want to use it for.

00:14:56.500 | There's like underwater videos

00:14:58.220 | that it works actually really well for,

00:15:00.140 | even though models never really seen an octopus before.

00:15:03.740 | And octopus have a lot of moving parts

00:15:07.300 | that SAM2 can actually quite effectively keep track

00:15:11.020 | of all the different tentacles.

00:15:13.060 | And we can probably see it more clearly

00:15:14.900 | if I desaturate the background,

00:15:17.420 | we can see that actually the tracking

00:15:19.620 | of all the different tentacles is quite accurate.

00:15:23.500 | Another challenge with video

00:15:25.820 | is that objects can actually become occluded.

00:15:28.140 | They can disappear from view and reappear.

00:15:31.380 | And a really fun example here is the shuffling cup game,

00:15:34.420 | which many of you might have seen.

00:15:36.540 | And so here I can click on the ball in the first frame.

00:15:40.340 | I can also click on a different cup.

00:15:43.500 | And so here the additional challenge

00:15:45.820 | is that there's three cups that look exactly the same.

00:15:49.100 | And then there's a ball that will get occluded by the cup.

00:15:53.020 | So the ball is no longer visible.

00:15:54.620 | The cups are all moving around.

00:15:56.060 | They all look the same,

00:15:57.780 | but the model actually keeps track

00:15:59.620 | of the cup that we selected.

00:16:01.540 | And as you can see at the end here,

00:16:03.780 | I'll jump to the end so you can see,

00:16:05.860 | it actually finds the cup again.

00:16:07.860 | I wanted to point out a couple of fun demo UX features

00:16:11.500 | that we added that actually really help with this.

00:16:13.780 | So if you can see at the bottom,

00:16:15.100 | there's these swim lanes.

00:16:16.860 | And then the swim lanes,

00:16:18.060 | actually the thickness of the swim lane

00:16:20.260 | tells you if the object's visible or not.

00:16:22.340 | So at the beginning, the object's visible,

00:16:25.220 | the object disappears, and then the object comes back.

00:16:28.100 | So you can actually visually tell

00:16:30.980 | when the object's being occluded and when it's not.

00:16:33.900 | And so it's a nice way of like knowing

00:16:35.940 | if you need to go in and fix the model prediction or not.

00:16:38.420 | And so these are some of the UX innovations

00:16:41.780 | that we came up with,

00:16:43.060 | as well as the model innovations.

00:16:45.300 | - One thing that I think is really notable here,

00:16:48.340 | there's two things.

00:16:49.180 | One is like, I'd love to have a little bit of a discussion

00:16:51.180 | about how the model's keeping track

00:16:53.460 | of the embedded scene to keep track of the ball

00:16:55.780 | and the cup in different places.

00:16:57.260 | Pause on that for a second.

00:16:58.620 | One thing that Meta has put an emphasis on here

00:17:01.740 | in a much greater degree than other model releases

00:17:04.300 | is the demo experience of recognizing that

00:17:07.940 | in addition to having a model

00:17:09.620 | that can do zero-shot segmentation,

00:17:11.700 | you've created a web experience

00:17:14.140 | that allows folks to kind of experience

00:17:16.660 | both the video effects,

00:17:18.060 | but the types of UX innovations

00:17:20.420 | that encourage usage and adoption.

00:17:22.700 | It's actually kind of reminiscent

00:17:23.660 | of the underlying technology of Chat GPT

00:17:25.940 | was available prior to the web experience of Chat GPT.

00:17:29.340 | Can you talk a bit about why that was a consideration

00:17:31.660 | to your team and how you thought about

00:17:34.380 | the creation of the demo experience

00:17:38.220 | in tandem with training and releasing a new model?

00:17:40.940 | - Yeah, absolutely.

00:17:41.780 | I think that's a really great example of how,

00:17:43.700 | you know, Chat GPT was really more of a UX innovation.

00:17:48.100 | Obviously, it was like a number of research innovations

00:17:50.580 | that helped to get to this point.

00:17:52.500 | But as you said, like the underlying technology

00:17:54.540 | was around for a while and, you know,

00:17:56.660 | putting this UX around it as a chat interface

00:18:00.540 | helped tremendously with adoption

00:18:03.700 | and people understanding how it could be useful

00:18:06.220 | for real-world use cases.

00:18:07.980 | And in computer vision, especially, it's so visual.

00:18:10.980 | The best way to show how these models work

00:18:13.820 | is by trying it on your own image or your own video.

00:18:17.340 | With the original SAM,

00:18:19.300 | we put a lot of effort in building like a high-quality demo.

00:18:23.660 | And the other piece here is that the demo

00:18:26.540 | is actually the annotation tool.

00:18:28.620 | So we actually use the demo

00:18:31.260 | as a way to improve our annotation tool.

00:18:34.100 | And so then it becomes very natural

00:18:36.220 | to invest in building a good demo

00:18:37.820 | because it speeds up your annotation

00:18:39.940 | and improves the data quality

00:18:41.300 | and that will improve the model quality.

00:18:43.260 | With this approach, we found it to be really successful.

00:18:46.260 | And obviously, externally,

00:18:48.140 | people really liked being able to try it.

00:18:50.980 | I think, you know, people in fields

00:18:53.220 | outside of machine learning would never have tried SAM

00:18:56.740 | if we didn't have that demo.

00:18:59.020 | And I think that definitely led to a lot of the adoption

00:19:02.780 | in like diverse fields.

00:19:04.820 | And so because we saw that with SAM 2,

00:19:07.340 | like the demo was a priority,

00:19:09.860 | first-class citizen from day one.

00:19:12.940 | And so we really invested in making that.

00:19:15.980 | And I think with SAM 2 as well,

00:19:18.620 | we wanted to have like a step change

00:19:20.620 | in the demo experience.

00:19:22.180 | Interactive video segmentation,

00:19:23.660 | I think that experience is something

00:19:25.340 | that maybe has not had much thought given to it.

00:19:28.380 | And we really wanted to be like,

00:19:29.740 | okay, if we are to design a step changing

00:19:32.780 | video segmentation experience,

00:19:34.300 | what would that look like?

00:19:35.380 | And that really did influence our model

00:19:37.980 | and annotation design as well.

00:19:40.580 | - It's a really encouraging trend

00:19:41.620 | for not thinking about only the new model capability,

00:19:44.900 | but what sort of applications folks want to build

00:19:47.660 | with models as a result of that downstream.

00:19:49.820 | - I think it also really forces you

00:19:51.300 | to think about many things that you might postpone.

00:19:53.900 | For example, efficiency.

00:19:55.780 | For a good demo experience,

00:19:57.620 | making it real time is super important.

00:19:59.740 | No one wants to wait.

00:20:01.380 | And so it really forces you to think about these things

00:20:05.020 | much sooner and actually makes us think about

00:20:08.340 | how to, what kind of image encoder we want to use

00:20:10.940 | or like other hardware efficiency improvements.

00:20:14.380 | So those kinds of things, I think,

00:20:16.660 | become a first-class citizen when you put the demo first.

00:20:20.620 | - That's one thing I was going to ask about,

00:20:22.220 | and this is related to the architecture change.

00:20:24.260 | So SAM1, in the SAM1 demo experience,

00:20:27.340 | you have the encoder that's creating the embeddings

00:20:30.780 | of all the potential spaces.

00:20:32.740 | That needs to be run on a GPU.

00:20:34.180 | That's a relatively intensive operation.

00:20:36.260 | But then the query of those embeddings

00:20:39.180 | can be run independently and on a cheaper process.

00:20:42.460 | So in the SAM1 demo, the way that it was structured,

00:20:45.700 | and also this is the way that we have our SAM tools

00:20:47.540 | structured in RoboFlow as well,

00:20:49.460 | is images go to a GPU to get all the SAM-based embeddings.

00:20:54.460 | But then for querying those embeddings,

00:20:56.620 | we do that client-side in the browser

00:20:58.780 | so that the user can very quickly,

00:21:00.780 | you know, you can move your mouse over

00:21:02.580 | and you get the proposed candidate masks

00:21:05.460 | that SAM found for that region of the image.

00:21:08.140 | In SAM2, you drop that in the web demo.

00:21:11.140 | And I think that's because you made some notable improvements

00:21:14.140 | to the rate at which encoding happens.

00:21:17.700 | - Can you talk a bit about what led to those speed increases

00:21:22.500 | and again, how that interplays

00:21:24.460 | with providing a fast user experience

00:21:27.940 | for interacting with the model?

00:21:29.900 | - Yeah, so the SAM2 web demo is primarily focused on video.

00:21:33.740 | We decided to just keep it simple and focus on video.

00:21:36.980 | And on GitHub, we have a Colab notebook

00:21:40.140 | that shows how to run SAM2 on images.

00:21:42.540 | So if you're interested in using,

00:21:44.340 | replacing SAM with SAM2 for images,

00:21:47.260 | check out GitHub.

00:21:48.260 | But on the SAM2 demo,

00:21:50.660 | it's not as straightforward

00:21:52.180 | to adopt the same architecture as SAM for video

00:21:55.260 | because we can't send the per frame image embeddings

00:21:59.260 | for an entire video back to the front end.

00:22:02.180 | In SAM, each frame embedding was like four megabytes.

00:22:05.100 | But if you have a long video and that's like per frame,

00:22:08.980 | it would become impossible

00:22:10.020 | to send that back to the front end.

00:22:12.340 | So SAM2 actually, in terms of the architecture details,

00:22:17.060 | I was actually just looking at this earlier,

00:22:18.620 | but SAM1 model was around 630 million parameters,

00:22:23.620 | a fraction of the size of these large language models,

00:22:27.580 | but very small.

00:22:29.060 | Actually, SAM2, the largest model

00:22:31.860 | is around 224 million parameters.

00:22:34.620 | So it's actually one third the size

00:22:37.900 | of the SAM original model.

00:22:39.780 | So we changed the image encoder from a VITH in SAM

00:22:44.380 | to a higher model, which is also developed by Meta.

00:22:48.940 | So that definitely was something that helped.

00:22:51.220 | And in terms of the efficiency compared to SAM,

00:22:54.580 | so if we were to run SAM per frame on a video

00:22:57.940 | or run SAM2, it's around six times faster

00:23:01.260 | to run SAM2 versus run SAM per frame.

00:23:04.900 | Number of things improved the efficiency of SAM2

00:23:07.380 | such that we were actually able to run this entirely

00:23:11.420 | on the server and not have any component

00:23:13.940 | in the front end.

00:23:15.100 | But I am very curious to see who puts this on device.

00:23:18.420 | I'm pretty sure soon we'll see an on-device SAM2

00:23:21.980 | or maybe even running in the browser or something.

00:23:25.220 | So I think that could definitely unlock

00:23:27.500 | some of these edge use cases.

00:23:30.340 | But we were able to make a compelling web demo

00:23:33.380 | without having to do that.

00:23:34.860 | - Hugging face is probably already working

00:23:36.340 | on Transformers.js version of it.

00:23:38.460 | But totally makes sense.

00:23:39.740 | I want to talk more about things from the paper,

00:23:41.580 | but I think we're still in this sort of demo section

00:23:43.500 | and so I want to hand it to Joseph for his demo

00:23:46.220 | to see what the RoboFlow site looks like.

00:23:48.100 | - So I can give some context into one key area

00:23:51.860 | that Nikolai, you mentioned earlier,

00:23:53.460 | which is SAM has made the decision,

00:23:55.540 | both SAM1 and SAM2, to be class agnostic

00:23:57.860 | in terms of its predictions.

00:23:59.420 | And that you then have the ability

00:24:02.260 | to have a generalizable model for zero-shot capability.

00:24:06.420 | However, in a lot of domain applications,

00:24:08.980 | you do want the class-wise name.

00:24:10.940 | And so a lot of the challenge

00:24:13.820 | can be adding that class-wise name

00:24:16.100 | for at least the annotation to an experience

00:24:19.300 | that we've created.

00:24:20.540 | That's one of the key considerations.

00:24:22.340 | So I will similarly share my screen and show an example.

00:24:27.340 | Here, I have a bunch of images

00:24:30.740 | and there's a number of ways that I could annotate things.

00:24:33.340 | Like I could prompt a large multimodal model

00:24:35.740 | with like grounding capabilities.

00:24:37.740 | You could outsource it.

00:24:39.140 | Or I can do manual labeling.

00:24:41.020 | And with the manual labeling,

00:24:42.300 | this is where we make use of models like Segment Anything

00:24:46.660 | to propose candidate masks and make it faster.

00:24:50.900 | So we have this annotation pane

00:24:53.220 | in what we call the Smart Poly tool,

00:24:55.100 | which is powered by Segment Anything.

00:24:57.340 | This is currently Segment Anything 1.

00:24:59.340 | We're accelerating and seeing improvements

00:25:02.500 | from similar to what the paper shows

00:25:04.780 | of Segment Anything 2 performing better on images

00:25:08.100 | as well as video.

00:25:09.540 | But with Segment Anything,

00:25:11.780 | I'm able to basically prompt regions

00:25:14.420 | of my image of interest.

00:25:16.020 | So for example, if like I wanted to say,

00:25:18.340 | I want to like add the drum set,

00:25:20.260 | you'll see here that like the original candidate proposal

00:25:23.300 | is just the bass drum,

00:25:25.340 | but let's say I wanted the whole drum set.

00:25:27.180 | So the UX primitive of being able to add

00:25:31.100 | and subtract candidate regions of interest

00:25:33.980 | is really intuitive here.

00:25:36.500 | And now, great, I have this outline,

00:25:39.060 | but in fact, what I want is I want to name that as a class

00:25:42.660 | because maybe for the model that I'm building,

00:25:45.420 | I want to build like a task-specific model,

00:25:47.820 | you know, like an object detection model

00:25:49.220 | or an instant segmentation model.

00:25:51.060 | Or, you know, maybe I'm even using like a multimodal model

00:25:54.060 | and I want that multimodal model to refer

00:25:56.300 | to regions of interest in the images as a specific thing.

00:26:01.300 | And so I think what's really powerful

00:26:03.900 | is of course, like I get this really rich

00:26:06.700 | zero-shot prediction, and here we have our friend Rick.

00:26:10.780 | So I get this really rich candidate set of predictions,

00:26:14.460 | but then by adding the class-wise label,

00:26:17.980 | I can, you know, very quickly make sure

00:26:19.700 | that any downstream tasks are aware,

00:26:22.660 | not just of the segment,

00:26:24.740 | but also of the, what is inside that segment,

00:26:29.060 | which actually takes me to a separate point

00:26:32.420 | of something that I predict

00:26:33.260 | that's probably going to happen.

00:26:34.420 | And Nikhil, I'm actually kind of interested

00:26:35.900 | why maybe your team made a conscious decision

00:26:38.220 | to not do this initially with SAM2.

00:26:41.220 | There's been an emergent set of models

00:26:43.100 | that are also adding open-text prompting capabilities

00:26:46.940 | to grounding models.

00:26:48.700 | So for example, like you've seen models

00:26:51.380 | like Grounding Dino or Owlvit,

00:26:54.860 | which, you know, you can do even image-to-image

00:26:57.540 | or text-to-image-based prompting

00:26:59.380 | to find regions of interest.

00:27:01.340 | And maybe I can actually give an example of that

00:27:04.300 | even in the context of this same data.

00:27:06.700 | So if I wanted to try out, you know,

00:27:08.940 | Grounding Dino on the same set of images,

00:27:11.780 | I could try out, you know, prompting Grounding Dino

00:27:14.620 | for a set of different classes.

00:27:17.100 | And what's notable is, let's do, I don't know,

00:27:20.620 | let's prompt for person, and we'll prompt for person,

00:27:24.660 | and let's prompt for, I don't know, microphone,

00:27:28.540 | and we'll ask for a microphone.

00:27:30.580 | Here, I can text-prompt the image,

00:27:32.980 | and then the understanding,

00:27:34.620 | in this case, Grounding Dino's understanding

00:27:36.340 | of where people are in this image

00:27:38.220 | allows me to create, in this case, bounding boxes,

00:27:40.860 | but, you know, soon you can do segmentations

00:27:43.580 | or in tandem with SAM, do segmentations.

00:27:45.980 | And, you know, we've already seen applications

00:27:48.500 | of using SAM2 in tandem with models

00:27:53.100 | like Grounding Dino or Florence 2

00:27:56.820 | so that people can basically text-prompt

00:28:00.220 | and then get the benefits of the zero-shot segmentation

00:28:03.420 | at the same time as getting the open-form querying.

00:28:08.420 | And in doing so, you know,

00:28:09.660 | we maintain a framework called, like, Autodistill,

00:28:11.380 | so, like, folks can very quickly, you know,

00:28:13.700 | bring some images and then using Autodistill

00:28:16.660 | to find some ontology,

00:28:18.260 | and then prompt and say what you want from that ontology.

00:28:21.340 | - So you already do this for video as well?

00:28:23.780 | - You can apply videos or groups of images, yes.

00:28:26.740 | So this is using a project called Autodistill.

00:28:29.580 | And the concept of Autodistill is use a base model,

00:28:33.380 | like a big base model,

00:28:34.340 | which could be, like, SAM or Grounding Dino,

00:28:36.900 | and then you pass a directory of images,

00:28:39.780 | which also could be video broken into individual frames,

00:28:43.580 | and you pass an ontology as well.

00:28:45.340 | So an example I was just showing

00:28:46.780 | was, like, the Hello World we have,

00:28:48.020 | which is, like, a shipping container.

00:28:49.860 | And then the combination of the grounding capabilities of,

00:28:54.540 | in the example I was showing, Florence 2 plus SAM,

00:28:57.660 | looks for the concept of container.

00:28:59.620 | And then SAM does the rich segmentation

00:29:02.780 | of turning that concept of container

00:29:04.820 | into the candidate proposal of the region

00:29:07.300 | so that a user could just say,

00:29:08.980 | hey, I want all the shipping containers,

00:29:10.580 | run this across a bunch of images or video frames,

00:29:13.900 | and then get back the class-wise labels

00:29:17.740 | plus the regions of interest.

00:29:19.740 | And this feels like a natural extension.

00:29:21.740 | And in fact, like, the open form grounding capabilities

00:29:24.700 | between SAM 1 and SAM 2

00:29:26.780 | became something the field was broadly doing.

00:29:29.140 | So I'm curious, like, from your perspective,

00:29:31.820 | one of the things I thought maybe SAM 2 would do

00:29:33.780 | is actually add this capability natively.

00:29:36.260 | So I'm curious to hear, like, the conscious decision to say,

00:29:39.140 | hey, we want to continue to be class-agnostic.

00:29:41.340 | We don't want to add yet maybe open form text prompting

00:29:45.900 | as a part of finding the segments and parts of images.

00:29:48.660 | And I'd love to hear about, like,

00:29:49.820 | the decision to think about it that way.

00:29:51.660 | And if you are encouraged or if you want kind of, like,

00:29:55.100 | what's happening here where people are naturally

00:29:56.740 | combining these capabilities

00:29:58.420 | as something that you would expect and encourage to happen

00:30:01.340 | despite not having it in the base model itself.

00:30:05.340 | - Yeah, it's a great question.

00:30:06.340 | So I think it's really cool that the community

00:30:08.260 | is taking SAM and taking SAM 2 and building on top of it

00:30:11.580 | and coming up with cool applications.

00:30:14.340 | We love to see that.

00:30:15.420 | That's exactly why we open source our work.

00:30:19.540 | And then in terms of why we didn't put it into SAM 2,

00:30:22.780 | so as you've probably seen with SAM and SAM 2,

00:30:25.980 | it's a fairly narrow problem,

00:30:28.540 | but we really try to make it a step change

00:30:31.140 | in the capability.

00:30:32.660 | And so with each version,

00:30:35.060 | we are trying to limit the focus on one thing

00:30:38.780 | that we can know we can do really well.

00:30:41.940 | And in this case, like the first SAM,

00:30:44.700 | it was class-agnostic segmentation,

00:30:46.580 | but can we do it so well that it's effectively solved?

00:30:50.420 | And similarly, can we do that same thing,

00:30:52.740 | but with video segmentation?

00:30:55.340 | So one step at a time,

00:30:57.180 | we are working on each of these problems one at a time

00:31:00.540 | so that we can actually deliver something

00:31:02.500 | that's really world-class and step-changing.

00:31:06.060 | - So does that mean SAM 3 will have

00:31:08.020 | the text prompting problem as like the next challenge?

00:31:11.900 | - Who knows, who knows?

00:31:13.100 | (laughing)

00:31:14.540 | Maybe the community will build that too.

00:31:18.580 | - It makes sense to like very narrowly

00:31:20.540 | do something very well,

00:31:21.540 | and that's, I think, proven to be well accomplished.

00:31:24.660 | - It's like taking both the data, the model, and the demo,

00:31:28.620 | and how can we push all three

00:31:31.020 | towards solving one thing really well?

00:31:34.020 | So we found that that's like a good recipe,

00:31:36.900 | and that's what we've limited the focus

00:31:39.460 | of each of these models.

00:31:41.620 | - This development reminds me of how, you know,

00:31:43.740 | when you do, and you break out the interpretability

00:31:46.620 | of ConvNets, and you can see like,

00:31:48.620 | oh, this is the edge detection one.

00:31:50.780 | I feel like SAM is the edge detection version equivalent,

00:31:54.340 | and then you build up to whatever the next feature is

00:31:56.580 | on top of that.

00:31:57.500 | - Can I bring up one limitation of SAM?

00:31:59.980 | So like we've, like even SAM 1, SAM 2,

00:32:01.980 | and the model was released at 4 p.m. Pacific on Monday.

00:32:04.940 | We're recording this on 11 a.m. Pacific on Thursday.

00:32:08.540 | So it's very fresh for a lot of the capabilities.

00:32:11.820 | And it is so clear that it is a stepwise change

00:32:15.620 | in the capability that, Nikhila,

00:32:18.220 | you mentioned your team wants to do,

00:32:19.340 | which is extend SAM's zero-shot

00:32:21.140 | class-agnostic capability to video,

00:32:23.100 | like A+ kind of mission accomplished.

00:32:26.220 | One thing that's interesting is finding like domain problems

00:32:30.060 | where there might be still domain applicability

00:32:33.220 | and domain adaptation that is available.

00:32:35.900 | One benchmark that we introduced at CBPR

00:32:38.580 | is this thing called RF100,

00:32:40.100 | which is like seven different domain type problems

00:32:43.220 | that the industry commonly is working on in vision.

00:32:45.300 | Like underwater, document processing,

00:32:47.740 | aerial examples, medicine examples.

00:32:50.540 | And one place where, interestingly,

00:32:53.500 | segment anything maybe less performant than other models

00:32:58.500 | is handling screenshots.

00:33:00.780 | For example, like a lot of folks

00:33:02.340 | that are building agents to interact with the web

00:33:04.860 | are particularly interested in that challenge

00:33:06.860 | of given a screenshot of a computer,

00:33:09.740 | what are all the buttons?

00:33:11.420 | And how could I autonomously navigate

00:33:14.700 | and prompt and tell it to click?

00:33:16.900 | And I can show an example of like maybe what,

00:33:19.180 | how like SAM kind of performs on this challenge

00:33:21.820 | just to outline some of the context of this problem.

00:33:26.460 | But I'm curious like how you think about

00:33:28.340 | limitations like this

00:33:29.180 | and what you would expect to want to be the case.

00:33:30.900 | So here I just have a notebook

00:33:32.340 | where I run SAM on the source image on the left,

00:33:35.620 | or the source image on the left,

00:33:36.540 | and then SAM output is on the right.

00:33:38.140 | And this is just a screenshot of a website

00:33:41.100 | where we just grabbed like the top 100 websites by traffic

00:33:43.660 | and grabbed screenshots from them.

00:33:45.620 | One example of a place where I could see

00:33:48.460 | the community improving on SAM,

00:33:49.940 | and I'm curious how you think about this challenge

00:33:51.580 | and maybe why SAM is less well adapted

00:33:53.740 | for this type of problem is processing screenshots.

00:33:57.220 | So I'll share my screen to give an example

00:33:59.500 | for viewers that are participating.

00:34:02.700 | Here you see like an example screenshot

00:34:04.660 | of a website on the left,

00:34:05.900 | and then right is SAM2 running on that image.

00:34:09.860 | And in the context of agents,

00:34:11.820 | folks usually want to have like,

00:34:13.260 | hey, tell me all of the buttons that an agent could press,

00:34:15.740 | tell me like maybe the headlines of the articles,

00:34:17.620 | tell me the individual images.

00:34:19.260 | And SAM2 behaves perhaps predictably

00:34:21.740 | where it outlines like people in the images

00:34:23.420 | and like some of like the screen text.

00:34:25.620 | I'm curious like how you think about a challenge like this

00:34:29.260 | for a model that sees everything in the world,

00:34:32.660 | what about handling digital contexts

00:34:34.940 | and why maybe it could perform better here

00:34:38.540 | and how you would expect to see improvement for domains

00:34:41.940 | that might have been out of distribution

00:34:43.340 | from the training data?

00:34:44.580 | - Yeah, this is a good question.

00:34:45.980 | So at FAIR, we don't really build

00:34:48.940 | with a specific use case in mind.

00:34:50.820 | We try to build like these foundational models

00:34:53.900 | that can be applied to lots of different use cases

00:34:57.220 | out of the box.

00:34:58.300 | So I think in this kind of example,

00:35:01.620 | potentially people might want to annotate some data,

00:35:04.860 | fine tune on top of what we release.

00:35:07.820 | I think we probably won't build things

00:35:11.180 | that are very custom for different use cases.

00:35:14.180 | I think that's not a direction we'll go in.

00:35:18.540 | But as you said, like the model is an annotation tool

00:35:21.740 | to improve the model.

00:35:23.260 | And so I think that's definitely the approach

00:35:26.220 | we want to take is we provide the tools

00:35:28.900 | for you to improve the model as well as the model itself.

00:35:32.180 | - That makes sense.

00:35:33.020 | Focus on like as many multi or zero shot problems

00:35:36.220 | and then allow the community to pick up the torch

00:35:38.300 | for domain adaptation.

00:35:39.820 | - Yeah, absolutely.

00:35:40.660 | Like we can't solve all the problems ourselves.

00:35:42.900 | Like we can't solve all the different domains,

00:35:45.020 | but if we can provide a sort of base hammer tool

00:35:49.940 | and then people can apply it

00:35:51.220 | to all their different problems.

00:35:53.500 | - Well, if you don't mind,

00:35:54.340 | I guess we want to transition to a little bit

00:35:55.820 | on like asking more questions about the paper.

00:35:58.500 | - Sure.

00:35:59.340 | - There's a lot in here.

00:36:00.380 | I love the transparency from Meta recently

00:36:02.980 | with like Llama 3 last week.

00:36:04.300 | And then, and was it last week?

00:36:06.020 | Maybe a little bit less than last week,

00:36:08.180 | but just like just really, really well-written

00:36:10.220 | and a lot of disclosures, including the dataset as well.

00:36:12.980 | I think the top question that people had on the dataset,

00:36:15.220 | you know, you've released a diverse videos

00:36:16.860 | and there's a lot of discussion

00:36:18.500 | about the data engine as well, which I really love.

00:36:21.020 | And I think it's innovative

00:36:22.180 | if you want to share anything about that.

00:36:24.220 | I think the top question is like,

00:36:25.580 | how do you decide the size of dataset?

00:36:27.140 | You know, what were you constrained by?

00:36:28.940 | People are asking about scaling laws.

00:36:30.340 | You had some ablations,

00:36:32.020 | but as a research manager for this whole thing,

00:36:34.340 | like how do you decide what you need?

00:36:37.340 | - Yeah, I mean, it's a great question.

00:36:38.860 | I think it's, as with all papers,

00:36:41.380 | you write them at the end of the project.

00:36:43.660 | So we can put these nice plots at the end,

00:36:46.980 | but going into it, I think, you know,

00:36:49.340 | the data engine design really follows

00:36:52.180 | sort of the model design,

00:36:54.100 | how we thought about the task,

00:36:55.860 | how we thought of the model capabilities.

00:36:58.020 | You can really see it's reflected

00:37:00.100 | in the different phases of the data engine.

00:37:02.540 | We started with just SAM.

00:37:04.500 | We apply SAM per frame.

00:37:06.260 | That's like the most basic way of extending SAM to video.

00:37:10.940 | Then the most obvious thing to do

00:37:12.460 | is to take the output masks from SAM

00:37:15.940 | and then provide it as input

00:37:18.180 | into a video object segmentation model

00:37:20.660 | that takes the mask as the first frame input.

00:37:24.180 | And that's exactly what we did.

00:37:25.460 | We had SAM plus a version of SAM2

00:37:28.940 | that only had mask as input.

00:37:31.580 | And then in the last phase,

00:37:33.260 | we got rid of SAM entirely

00:37:35.260 | and just had this one unified model

00:37:38.060 | that can do both image and video segmentation

00:37:41.820 | and do everything in just one model.

00:37:44.740 | And we found that, you know, going from each phase,

00:37:47.740 | it both improved the efficiency

00:37:49.460 | and it improved the data quality.

00:37:51.620 | And in particular, when you get rid of this two-part model,

00:37:54.980 | one of the advantages is that

00:37:57.540 | when you make refinement clicks,

00:37:59.540 | so you prompt the model in one frame to select an object,

00:38:03.740 | then you propagate those predictions

00:38:05.740 | to all the other frames of the video to track the object.

00:38:09.860 | But if the model makes a mistake and you want to correct it,

00:38:14.340 | when you have this unified model,

00:38:16.420 | you only need to provide refinement clicks.

00:38:19.300 | So you can provide maybe a negative click

00:38:21.660 | to remove a region or a positive click to add a region.

00:38:25.500 | But if you had this decoupled model,

00:38:27.740 | you would have to delete that frame prediction

00:38:31.820 | and re-annotate from scratch.

00:38:34.220 | And so you can imagine for more complex objects,

00:38:37.420 | this is actually adding like a lot of extra time

00:38:40.380 | to redefine that object

00:38:42.420 | every time you want to make a correction.

00:38:44.740 | So both the data and the data engine phases

00:38:47.780 | really follow like how we thought about the model design

00:38:50.700 | and the evolution of the capabilities,

00:38:53.220 | because it really helped improve the data quality

00:38:56.260 | and the annotation efficiency as well.

00:38:58.900 | - Yeah, you had a really nice table

00:39:00.380 | with like time taken to annotate,

00:39:01.780 | and it was just going down and down.

00:39:03.100 | I think it was like down by like 90%

00:39:05.900 | by the time you hit stage three, which is kind of cool.

00:39:08.740 | - We joked that when SAM1 came out at RoboFlow,

00:39:11.180 | we're like, "Was this purpose built for our software?"

00:39:13.780 | Like you have the embedding take like a big model

00:39:17.220 | and the querying of the embeddings,

00:39:19.140 | a smaller model that happens in browser,

00:39:21.260 | which felt remarkably aligned.

00:39:23.460 | Now hearing you talk about how you think about

00:39:25.860 | building models with a demo in mind, it makes sense.

00:39:27.940 | Like you're thinking about the ways

00:39:29.100 | that folks downstream are gonna be consuming

00:39:30.940 | and creating value.

00:39:31.860 | So what felt like maybe a coincidence

00:39:33.820 | was perhaps a deliberate choice by Meta

00:39:35.540 | to take into account how industry

00:39:37.860 | is gonna take seminal advances and apply them.

00:39:41.140 | - Yeah, and it's not just humans.

00:39:42.460 | Like it could also be a model that outputs boxes

00:39:45.300 | that then get fed into this model.

00:39:47.140 | So really thinking about this as a component

00:39:50.180 | that could be used by a human

00:39:51.980 | or as a component as part of a larger AI system.

00:39:56.020 | And that has a number of design requirements

00:39:59.500 | that needs to be promptable.

00:40:01.140 | It needs to have the zero shot generalization capability.

00:40:04.820 | We need it to be real time.

00:40:07.300 | And those requirements really are very core

00:40:10.420 | to how we think about these models.

00:40:12.820 | - I cannot end this podcast

00:40:14.500 | without talking about the architecture

00:40:16.020 | because this is your effectively

00:40:18.580 | the sort of research level, architecture level innovation

00:40:22.180 | that enabled what I've been calling object permanence

00:40:25.700 | for SAM and it's memory retention.

00:40:28.580 | What was the inspiration going into it?

00:40:30.140 | And what did you find?

00:40:31.940 | - Yeah, so at a high level,

00:40:33.460 | the way we think about extending SAM to video

00:40:36.660 | is that an image is just a special case of a video

00:40:40.260 | that just has one frame.

00:40:42.060 | With that idea in mind,

00:40:44.340 | we can extend the SAM architecture

00:40:46.860 | to be able to support segmentation across videos.

00:40:50.380 | So this is a quick video that shows how this works.

00:40:53.500 | So SAM architecture, we have the image encoder,

00:40:56.020 | we have a prompt encoder, we have a mask decoder.

00:40:59.300 | You can click on an image and that basically is a prompt.

00:41:04.300 | We use that prompt along with the image embedding

00:41:08.340 | to make a mask prediction for that image.

00:41:11.340 | Going to SAM 2, we can also apply SAM 2 to images

00:41:15.460 | because we can, as I said, treat an image as a video

00:41:19.060 | with a single frame.

00:41:20.420 | And so when we are in the SAM 2 architecture,

00:41:22.900 | we introduce this new memory mechanism

00:41:25.260 | that consists of three main components.

00:41:27.740 | There's memory attention, there's a memory encoder,

00:41:29.740 | and then there's a memory bank.

00:41:31.180 | And when we apply SAM 2 to images,

00:41:33.660 | these are effectively not used

00:41:35.460 | and the architecture just collapses

00:41:37.780 | down to the original SAM architecture.

00:41:40.540 | But when we do apply this to video,

00:41:43.100 | the memory components become really useful

00:41:45.340 | because they provide the context of the target object

00:41:49.500 | from other frames.

00:41:51.540 | And so this could be from the past frames,

00:41:54.140 | it can be from, there's two types of memory.

00:41:56.700 | So there's like the conditional frames

00:41:59.180 | or the prompted frames, which are basically the frames

00:42:02.060 | at which a user or a model provides input like clicks.

00:42:06.860 | And then there's like the surrounding frames.

00:42:09.100 | And so we use six frames around the current frame

00:42:12.940 | as memory of the object.

00:42:14.220 | So there's both those types of memory

00:42:17.340 | that we use to make the mask prediction.

00:42:19.660 | Going into a little bit more detail about that,

00:42:21.500 | there's like two kinds of memory that we use.

00:42:23.940 | So one is like spatial memory.

00:42:25.940 | So it's like this high resolution memory

00:42:27.540 | that captures the spatial details.

00:42:29.540 | And then we also have this like longer term

00:42:31.740 | object point of memory that captures

00:42:33.420 | some of the sort of higher level concepts.

00:42:35.780 | And I think Swix, you had a comment about

00:42:38.100 | how does this relate to context window and LLMs.

00:42:42.180 | And both of these types of memories

00:42:43.820 | have some relation to context window.

00:42:46.940 | So they both provide different types of information

00:42:49.700 | on the spatial side or in terms of the concept

00:42:52.740 | of the objects that we want to track.

00:42:54.620 | And so we found that having like six frame length

00:42:56.980 | for the spatial memory coupled with this longer period

00:43:00.740 | of the object point of memory provides

00:43:03.260 | strong video segmentation accuracy at high speed.

00:43:06.380 | So as I mentioned, the real time aspect is really important.

00:43:10.220 | We have to find this speed accuracy trade off.

00:43:12.780 | And one way in which we sort of circumvent this

00:43:15.700 | is by allowing additional prompts on subsequent frames.

00:43:19.820 | So even if the model makes a mistake,

00:43:22.540 | maybe it loses the object.

00:43:24.300 | After an occlusion, you can provide another prompt

00:43:27.860 | which actually goes into the memory.

00:43:29.940 | And so the prompted frames are always in the memory.

00:43:33.860 | And so if you provide a prompt on a frame

00:43:35.700 | where the model will always remember what you provided.

00:43:39.620 | And so that's a way in which we can sort of avoid

00:43:42.780 | some of the model failure cases.

00:43:45.500 | That actually is a big limitation of current models.

00:43:48.060 | Current video object segmentation models

00:43:50.140 | don't allow any way to recover if the model makes a mistake.

00:43:53.380 | And so Joseph, going back to your point about the demo,

00:43:56.060 | that's something that we found

00:43:57.260 | just by playing with these models.

00:43:59.180 | There's no way to make a correction.

00:44:01.260 | And in many real world use cases,

00:44:03.140 | like it's not going to be a one time prediction,

00:44:06.660 | but you actually want to be able to intervene.

00:44:08.900 | Like if an LLM makes a mistake,

00:44:11.540 | you can actually be like, no, actually do it this way

00:44:14.020 | and provide feedback.

00:44:15.500 | And so we really want to bring some of that thinking

00:44:18.620 | into how we build these computer vision models as well.

00:44:21.860 | - Amazing.

00:44:22.700 | My main reaction to finding out about the context length,

00:44:26.220 | input frames and six pass frames as their default

00:44:28.780 | is why not 60?

00:44:30.460 | Why not 600?

00:44:31.620 | In text language models,

00:44:33.060 | we're very used to severely extending context windows.

00:44:37.060 | And what does that do to the memory of your model?

00:44:40.540 | - So I think maybe one thing that's different

00:44:42.500 | is that the object in videos, it is challenging.

00:44:45.500 | Objects can change in appearance.

00:44:47.380 | There's different lighting conditions.

00:44:49.220 | They can deform.

00:44:50.740 | But I think a difference to language models

00:44:53.580 | is probably the amount of context that you need

00:44:55.940 | is significantly less

00:44:57.780 | than maintaining a long multi-time conversation.

00:45:01.060 | And so coupling this short-term spatial memory

00:45:04.500 | with this longer-term object pointers we found was enough.

00:45:08.500 | So I think that's probably one difference

00:45:11.020 | between vision models and LLMs.

00:45:13.780 | - I think so.

00:45:14.620 | If one wanted to be really precise

00:45:16.340 | with how literature refers to object re-identification,

00:45:20.060 | object re-identification is not only what SAM does

00:45:23.780 | for identifying that an object is similar across frames,

00:45:27.620 | it's also assigning a unique ID.

00:45:30.220 | How do you think about models keeping track

00:45:32.980 | of occurrences of objects

00:45:36.180 | in addition to seeing that the same looking thing

00:45:40.500 | is present in multiple places?

00:45:42.460 | - Yeah, it's a good question.

00:45:43.500 | I think, you know, SAM2 definitely isn't perfect

00:45:46.460 | and there's many limitations

00:45:48.900 | that we'd love to see people in the community

00:45:52.020 | help us address.

00:45:53.700 | But one definitely challenging case

00:45:56.300 | is where there are multiple similar looking objects,

00:45:59.060 | especially if there's like a crowded scene

00:46:01.460 | with multiple similar looking objects.

00:46:03.660 | Keeping track of the target object is a challenge.

00:46:07.780 | That's still something that I don't know

00:46:09.340 | if we've solved perfectly,

00:46:11.100 | but again, the ability to provide refinement clicks

00:46:15.260 | is one way to sort of circumvent that problem.

00:46:18.780 | In most cases, when there's lots of similar looking objects,

00:46:21.860 | if you add enough refinement clicks,

00:46:23.540 | you can get the perfect track throughout the video.

00:46:26.580 | So definitely that's one way to solve that problem.

00:46:30.580 | But, you know, we could have better motion estimation.

00:46:33.540 | We could do other things in the model

00:46:35.460 | to be able to disambiguate similar looking objects

00:46:38.820 | more effectively.

00:46:40.180 | - I'm just interested in leaving breadcrumbs

00:46:42.100 | for other researchers,

00:46:43.820 | anyone interested in this kind of architecture.

00:46:46.340 | Like, are there papers that you would refer people to

00:46:49.660 | that are influential in your thinking

00:46:51.260 | or, you know, have other interesting alternative approaches?

00:46:54.260 | - I think there's other ways in which

00:46:55.700 | you can do tracking in video.

00:46:57.340 | You might not even need the full mask.

00:46:59.300 | I think there's some other works

00:47:01.020 | that just track, like, points on objects.

00:47:04.020 | It really, really depends on what your application is.

00:47:06.420 | Like, if you don't care about the entire mask,

00:47:09.140 | you could just track a bounding box.

00:47:10.780 | You could just track a point on an object.

00:47:13.300 | And so having the high fidelity mask

00:47:16.580 | might not actually be necessary for certain use cases.

00:47:20.780 | From that perspective,

00:47:21.780 | you might not need the full capabilities of SAM or SAM2.

00:47:26.180 | There's many different approaches to tracking.

00:47:27.980 | I think I would encourage people to think about, like,

00:47:30.580 | what actually they need for their use case

00:47:33.300 | and then try to find something that fits

00:47:36.260 | versus, yeah, maybe SAM2 is too much.

00:47:39.180 | You know, maybe you don't even need the full mask.

00:47:41.820 | - Makes total sense.

00:47:42.660 | But you have solved the problem that you set out to solve,

00:47:44.900 | which is no mean feat,

00:47:46.540 | which is something that we're still appreciating even today.

00:47:49.060 | If there are no further questions,

00:47:50.220 | I would just transition to sort of forward-looking,

00:47:53.060 | future-looking stuff.

00:47:54.660 | Joseph already hinted at, like, you know,

00:47:56.420 | our interest in SAM and the future of SAM.

00:47:59.900 | And obviously you're the best person to ask about that.

00:48:02.220 | I'm also interested in, like,

00:48:03.860 | how should external people think about FAIR?

00:48:06.500 | You know, like, there's this stuff going on,

00:48:08.700 | this llama, this chameleon, this voice box,

00:48:11.300 | this image bind, like, how are things organized?

00:48:14.060 | And, you know, where are things trending?

00:48:15.980 | - Yeah, so in FAIR, we, you know,

00:48:18.220 | we have a number of different research areas.

00:48:20.260 | I work in an area called perception.

00:48:22.980 | So we built vision systems that solve,

00:48:26.100 | basically look at all the fundamental problems

00:48:28.540 | in computer vision.

00:48:29.660 | Can we build a step change

00:48:31.700 | in all of these different capabilities?

00:48:34.180 | SAM was one example.

00:48:35.380 | SAM-2 is another example.

00:48:36.900 | There are tons of other problems in computer vision

00:48:39.780 | where we've made a lot of progress,

00:48:41.780 | but can we really say that they're solved?

00:48:44.660 | And so that's really the area in which I work on.

00:48:48.100 | And then there's a number of other research areas

00:48:50.860 | in language and in embodied AI,

00:48:53.940 | in more efficient models and various other topics.

00:48:57.540 | So FAIR in general is still very much pushing the boundaries

00:49:01.580 | on solving these foundational problems

00:49:04.700 | across different domains.

00:49:06.460 | And then there's also obviously, like,

00:49:08.540 | actually I probably shouldn't talk about llama,

00:49:10.140 | so let's not include that.

00:49:12.020 | - I was gonna ask about that.

00:49:13.900 | (both laughing)

00:49:16.260 | Well, fair enough.

00:49:17.100 | Maybe just outside of FAIR,

00:49:18.460 | just the future of computer vision, right?

00:49:20.060 | Like you are very involved in the community.

00:49:22.540 | What's the talk of the town at CVPR?

00:49:24.380 | Both of you went.

00:49:25.540 | Who's doing the most interesting work?

00:49:27.380 | It's a question for both of you.

00:49:28.860 | - I think the trends we're seeing

00:49:32.540 | towards more zero-shock capability

00:49:35.420 | for common examples will accelerate.

00:49:38.220 | I think mutual modality,

00:49:40.180 | meaning using images in tandem with text

00:49:42.940 | for richer understanding,

00:49:44.260 | or images and video in tandem with audio

00:49:48.340 | and other mixed media

00:49:50.460 | will be a continued acceleration trend.

00:49:53.260 | The way I kind of see the field continuing to progress,

00:49:57.900 | like the problem statement of computer vision

00:50:00.260 | is making sense of visual input.

00:50:02.980 | And I think about the world

00:50:05.300 | as the things that need to be observed

00:50:08.980 | follow your traditional bell curve,

00:50:11.580 | where like things that most frequently exist

00:50:13.820 | out in the world are on the center of that bell curve.

00:50:16.060 | And then there's things that are less frequently occurring

00:50:18.020 | that are in those long tails.

00:50:19.460 | For example, as back as like 2014,

00:50:22.500 | you have the COCO dataset,

00:50:23.740 | which sets out to say,

00:50:24.580 | "Hey, can we find 80 common objects in context?"

00:50:29.140 | Like silverware and fridge and these sorts of things.

00:50:32.380 | And we also conceptualized the challenge of computer vision

00:50:35.460 | in terms of breaking it down into individual task types,

00:50:38.100 | because that's like the tools we had for the day.

00:50:40.020 | So that's why you have the origination of classification,

00:50:42.940 | object detection, instant segmentation.

00:50:45.300 | And then as you see things continue to progress,

00:50:48.460 | you have models and things

00:50:50.860 | that need to observe areas in the long tails.

00:50:53.140 | And so if you think of the COCO dataset

00:50:54.500 | as the center of that bell curve,

00:50:56.180 | I think of like the long tails,

00:50:57.660 | like really edge case problems.

00:50:59.940 | Some of our customers like Rivian, for example,

00:51:02.420 | only Rivian knows what the inside of like a Rivian

00:51:05.340 | should look like as it's assembled and put together

00:51:07.620 | before it makes its way to a customer.

00:51:09.140 | And they're making custom parts, right?

00:51:10.460 | So how could a model even been trained on the things

00:51:13.820 | that go inside the componentry of producing a vehicle?

00:51:17.900 | And what's kind of happening with computer vision

00:51:20.380 | is you're seeing models that generalize

00:51:24.860 | in the middle of the bell curve push outward faster.

00:51:27.740 | That's where you see the advent of like open text models

00:51:31.540 | or the richness of understanding of multimodal models

00:51:36.020 | to allow richer understanding

00:51:38.620 | without perhaps any training,

00:51:40.460 | or maybe just using pre-training

00:51:42.060 | and applying it to a given problem.

00:51:44.220 | And then there's like, you know,

00:51:46.020 | kind of like the messy middle in between those two, right?

00:51:48.500 | So like, Nikila kind of talked about examples

00:51:50.620 | where SAM does well out of distribution,

00:51:52.700 | where like it finds an octopus,

00:51:54.100 | even though there wasn't octopi in the training data.

00:51:56.540 | I showed an example where like screenshots,

00:51:58.580 | where SAM isn't yet super great at screenshots.

00:52:01.460 | So maybe that's like in the messy middle

00:52:02.780 | or in the longer tails for now.

00:52:04.900 | But what's gonna happen is there needs to be systems

00:52:07.500 | of validating the point of view

00:52:09.540 | that I think about like tooling to also validate

00:52:11.980 | that models are doing what we want them to do,

00:52:13.980 | adapting to datasets that we want them to adapt to.

00:52:16.500 | And so there's a lot of things on a forward-looking basis

00:52:19.380 | that allow propelling that expansion of generalizability.

00:52:24.220 | That's where open text problems,

00:52:27.180 | that's where scaling up of training,

00:52:30.380 | of dataset curation continues to play a massive role.

00:52:35.140 | Something that's notable, I think, about SAM 2

00:52:37.260 | is it's, what, 57,000 videos, 51,000 videos?

00:52:40.940 | - About 51,000, yeah.

00:52:42.740 | - And 100,000 internal datasets.

00:52:45.180 | - That's like not massive, right?

00:52:49.300 | And the model size also isn't, you know,

00:52:51.380 | the largest model being a couple hundred million parameters,

00:52:53.820 | the smallest model is 38 million parameters

00:52:55.580 | and can run at 45 FPS on an A100, right?

00:52:57.940 | Like the capabilities of,

00:53:00.740 | we're gonna see more capable, more generalizable models

00:53:04.580 | being able to run on a higher wide array of problems

00:53:07.900 | with zero or multi-shot capability on a faster rate.

00:53:12.140 | And I think the architecture innovations

00:53:15.620 | in things like SAM 2 of memory,

00:53:18.460 | of increasingly like transformers

00:53:20.620 | making their way into division

00:53:22.140 | and probably blended architectures increasingly too.

00:53:25.220 | So my viewpoint of like on a go-forward basis

00:53:27.700 | is we will have that bell curve of what humans can see

00:53:32.500 | both in the center of that curve and the long tails

00:53:35.260 | and architectural changes

00:53:36.740 | allow richer understanding multi and zero-shot

00:53:40.620 | and putting those into systems

00:53:42.180 | and putting those into industry

00:53:43.500 | and putting those into contexts

00:53:45.300 | that allow using them in practical and pragmatic ways.

00:53:48.740 | Nicola, I'd love to hear like your thought

00:53:50.420 | and perspective of like how you think

00:53:52.340 | the research trends map or don't map to that.

00:53:54.980 | And like maybe some of the key innovations

00:53:57.260 | that you saw at CVPR this year

00:53:58.980 | that got you excited about the direction

00:54:00.860 | and maybe some promising early directions

00:54:03.340 | that you're thinking about researching

00:54:05.300 | or pushing the boundaries of further.

00:54:07.140 | - Yeah, I just wanted to actually reply

00:54:08.900 | to a couple of things that you said about,

00:54:11.660 | so actually in video object segmentation,

00:54:14.340 | the number of classes that are annotated

00:54:16.940 | and then the size of these datasets are really small.

00:54:20.460 | So with SAM, it's, you know, we had a billion masks,

00:54:24.820 | we had 11 million images, didn't have class labels,

00:54:28.500 | but even before that, there were a lot of datasets

00:54:30.620 | that have class labels and are annotated

00:54:33.580 | with significantly more, with like a lot of class labels,

00:54:37.300 | whereas in video datasets,

00:54:39.340 | the number of class labels are very small.

00:54:42.020 | So there's like YouTube VOS,

00:54:43.860 | which has 94 object categories,

00:54:45.980 | there's MOSE, which has around like 30

00:54:48.340 | or so object categories.

00:54:49.740 | And they're usually like people, there's cars,

00:54:52.300 | there's dogs and cats and all these common objects,

00:54:56.140 | but not really, they don't really cover

00:54:57.980 | a very large number of object categories.

00:55:00.260 | And so while SAM learned this general notion

00:55:03.900 | of what an object is in an image,

00:55:06.820 | these video tracking models actually don't have

00:55:09.500 | that knowledge at all.

00:55:12.260 | And so that's why having this dataset is really important

00:55:17.100 | for the segment anything capability in video,

00:55:20.180 | because if you just provide the mask as the input

00:55:23.580 | to an off-the-shelf video object segmentation model,

00:55:26.380 | it might not actually be able to track

00:55:28.100 | that arbitrary object mask as effectively

00:55:31.260 | as a SAM2 model that's actually trained

00:55:34.020 | to track any object across the entire video.

00:55:37.780 | So doing these sort of combining two models together

00:55:41.380 | to try to get a capability will actually only get you so far

00:55:45.540 | and being able to actually create the dataset

00:55:49.340 | to enable that anything capability,

00:55:52.260 | it was actually really important.

00:55:54.620 | And we can actually see that when we do comparisons

00:55:57.300 | with baselines where we provide SAM2

00:55:59.980 | with the same input mask and the baseline model

00:56:02.380 | with the same input mask,

00:56:04.060 | for example, the T-shirt of a person,

00:56:06.260 | SAM2 can track the T-shirt effectively

00:56:08.660 | across the entire video,

00:56:10.420 | whereas these baselines might actually start tracking

00:56:13.460 | the entire person because that's what they're used to doing

00:56:16.260 | and isolating it to just one part of the person

00:56:19.100 | is not something they were ever trained to do.

00:56:21.620 | And so those are sort of some of the limitations.

00:56:24.580 | Another thing is segmenting an image

00:56:26.940 | and segmenting a video frame

00:56:29.140 | are actually two different things.

00:56:31.100 | So a video frame is still an image,

00:56:33.180 | but there might be motion blur

00:56:35.220 | or it might have lower resolution.

00:56:37.780 | Or it's actually, we found that in the SAM2 paper,

00:56:41.900 | we have this study of where we look

00:56:43.500 | at the SAM image segmentation task on images

00:56:48.140 | and also on frames from videos.

00:56:50.620 | And we find that actually SAM2 is a lot better than SAM

00:56:54.340 | when it comes to segmenting objects in video frames,

00:56:57.980 | because they actually have a sort of

00:56:59.940 | slightly different distribution than images.

00:57:02.660 | And so I think that's maybe one learning from this project

00:57:06.540 | is like combining two models

00:57:08.460 | and sort of just smushing things together

00:57:10.860 | might not actually be as effective

00:57:12.580 | as if you really think about how to build things

00:57:15.060 | in a unified way.

00:57:16.860 | And then another really interesting point

00:57:19.460 | is that from the COCO data set,

00:57:21.540 | the last author, Peter Dollar,

00:57:23.260 | he's the head of our research group.

00:57:25.420 | And so he's really seen the whole decade

00:57:27.820 | of going from COCO to going from SAM to going to SAM2.

00:57:32.820 | And so that's been very interesting

00:57:35.900 | to have that perspective as we build these models

00:57:38.820 | and as we think about the type of capabilities

00:57:41.820 | we want to build.

00:57:43.220 | - We hosted this challenge at CBPR

00:57:46.220 | when we introduced RF100,

00:57:48.300 | which is kind of meant to be the anti-COCO.

00:57:50.900 | So if like COCO is common objects in context,

00:57:53.060 | RF100 is like novel objects in weird contexts,

00:57:56.500 | like thermal data and like aerial stuff

00:57:59.260 | and things we were talking about earlier.

00:58:01.220 | And so we challenged the community as a part of,

00:58:03.380 | it's called OD&W with Microsoft,

00:58:05.620 | object detection in the wild.

00:58:07.420 | And it's basically like how well can you create models

00:58:10.220 | that either work zero shot,

00:58:11.940 | but really kind of what you end up measuring

00:58:13.540 | is how well things can learn domain adaptation.

00:58:16.260 | Like how quickly can something be retrained

00:58:18.740 | or fine tuned to a given domain problem?

00:58:21.100 | And what's really impressive about SAM and SAM2

00:58:24.820 | from what you just described is even with the limited set,

00:58:27.700 | the class agnostic approach affords the generalizability

00:58:32.180 | even to out of distribution examples, surprisingly well.

00:58:36.660 | Like it's like remarkably robust.

00:58:39.100 | And so that research direction seems extremely promising.

00:58:42.540 | - Yeah, and actually Pieter is always telling us like,

00:58:45.460 | don't care about COCO, even though he built COCO.

00:58:48.580 | So that's always fun.

00:58:51.540 | And really keeping that zero shot real world use cases

00:58:54.980 | in mind as we build and try to do things

00:58:57.780 | in as general a way as possible.

00:59:00.980 | - Okay, I think that just leaves us to calls to action

00:59:03.620 | for engineers, researchers, and personal recommendations.

00:59:07.980 | What do you have?

00:59:09.340 | - Yeah, so please try out all the resources we put out.

00:59:12.780 | We, you know, open sourced SAV dataset, SAM2,

00:59:17.180 | various SAM2 models, the paper, the demo,

00:59:21.100 | this dataset visualizer.

00:59:22.780 | Please try all of these things that we've released.

00:59:26.620 | And also, as I said, SAM2 isn't perfect.

00:59:29.980 | There are a number of limitations.

00:59:31.180 | Actually in the blog post, we go through many of these

00:59:33.860 | in quite lots of detail with examples.

00:59:36.980 | And so if you have any ideas of how to improve these,

00:59:40.700 | like please build on top of what we've released.

00:59:43.540 | We would love to see some of these problems get solved

00:59:47.060 | and maybe we can incorporate them back

00:59:50.020 | into future model versions.

00:59:53.300 | So really cool to use SAM2

00:59:56.660 | for all your different use cases,

00:59:57.980 | build on top of it, improve it,

01:00:00.460 | and share what you've built back with us.

01:00:02.860 | We'd love to hear from you.

01:00:04.300 | - Lovely.

01:00:05.140 | We'll definitely want people to comment

01:00:06.980 | and share their buildings on SAM and SAV

01:00:10.340 | and all the other stuff that's going on.

01:00:12.060 | Thank you so much for your time.

01:00:13.300 | This is wonderful.

01:00:14.420 | And then obviously the incredible open source

01:00:16.940 | that you've given us.

01:00:18.500 | Joseph, thank you as well for guest hosting.

01:00:21.100 | It was a much better episode with you than without you.

01:00:23.020 | So appreciate both of you coming on

01:00:24.820 | and whenever SAM3 is out

01:00:26.140 | or whatever else you guys are working on,

01:00:28.020 | just let us know and we'll come back on again.

01:00:30.180 | - Thank you.

01:00:31.220 | - Thanks. - Bye.

01:00:32.780 | (upbeat music)

01:00:35.380 | (upbeat music)

01:00:37.980 | (upbeat music)

01:00:40.580 | (upbeat music)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Chapters