back to index

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson


Chapters

0:0 Introducing Nikhila
3:38 The Impact of SAM 1 in 2023
9:15 Do People Finetune SAM?
13:5 Video Demo of SAM
17:1 Why the Demo is so Important
20:23 SAM 1 vs SAM 2 Architecture
23:46 Video Demo of SAM on Roboflow
29:44 Extending SAM 2 with other models
32:0 Limitations of SAM: Screenshots
35:56 SAM 2 Paper
36:15 SA-V Dataset and SAM Data Engine
40:15 Memory Attention to solve Video
44:24 "Context Length" in Memory Attention
45:17 Object Tracking
47:52 The Future of FAIR
49:23 CVPR, Trends in Vision
60:4 Calls to Action

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Welcome to the Latest Space Podcast.
00:00:05.960 | I'm delighted to do Segment Anything 2.
00:00:08.300 | Our first, one of our very first viral podcasts
00:00:11.060 | was Segment Anything 1 with Joseph.
00:00:12.820 | Welcome back.
00:00:13.660 | - Thanks so much.
00:00:14.500 | - And this time we are joined by the lead author
00:00:16.860 | of Segment Anything 2, Nikki Ravi.
00:00:18.660 | Welcome.
00:00:19.500 | - Thank you.
00:00:20.320 | Thanks for having me.
00:00:21.160 | - There's a whole story that we can refer people back
00:00:23.180 | to episode four of the podcast way back when
00:00:26.180 | for the story of Segment Anything.
00:00:27.820 | But I think we're interested
00:00:29.020 | in just introducing you as a researcher,
00:00:31.300 | as a, on the human side.
00:00:33.260 | What was your path into AI research?
00:00:35.420 | Why, you know, why did you choose computer vision
00:00:37.980 | coming out of your specialization at Cambridge?
00:00:40.880 | - Yeah, yeah, sure.
00:00:41.720 | So I did my undergraduate degree in engineering
00:00:45.520 | at Cambridge University.
00:00:47.620 | The engineering program is very general.
00:00:49.620 | So first couple of years,
00:00:50.840 | you sort of study everything from mechanical engineering
00:00:53.740 | to fluid mechanics, structural mechanics,
00:00:56.660 | material science, and also computer science.
00:01:00.420 | Towards the end of my degree,
00:01:02.260 | I started taking more classes in machine learning
00:01:05.340 | and computational neuroscience, and I really enjoyed it.
00:01:08.300 | And actually after graduating from undergrad,
00:01:11.620 | I had a place at Oxford to study medicine.
00:01:14.520 | And so I was initially planning on becoming a doctor,
00:01:18.060 | had everything planned,
00:01:19.900 | and then decided to take a gap year
00:01:22.780 | after finishing undergrad.
00:01:24.820 | And actually that was around the time
00:01:26.040 | that sort of deep learning was emerging.
00:01:28.700 | And in my machine learning class in undergrad,
00:01:31.900 | I remember one day our professor came in
00:01:34.800 | and that was when Google acquired DeepMind.
00:01:38.300 | And so that became like a huge thing.
00:01:40.940 | We talked about it for the whole class.
00:01:42.740 | It kind of really kicked off thinking about,
00:01:45.980 | okay, maybe I want to try something different
00:01:47.940 | other than medicine.
00:01:49.380 | Maybe this is a different path I want to take.
00:01:51.740 | And then in the gap year, I did a bunch of coding,
00:01:55.300 | worked on a number of projects,
00:01:57.060 | did some sort of freelance contracting work.
00:01:59.740 | And then I got a scholarship to come and study in America.
00:02:02.980 | So I went to Harvard for a year,
00:02:05.180 | took a bunch of computer science classes at Harvard and MIT,
00:02:08.060 | worked on a number of AI projects,
00:02:10.700 | especially in computer vision.
00:02:12.380 | I really, really enjoyed working in computer vision,
00:02:15.300 | applied to Facebook and got this job at Facebook.
00:02:17.940 | And I've now, at Facebook at the time, now Matter.
00:02:21.620 | And I've been here for seven years.
00:02:23.180 | So very circuitous path,
00:02:25.660 | probably not a very unconventional.
00:02:27.460 | I didn't do a PhD.
00:02:29.220 | I'm not like a research, typical research scientist.
00:02:32.420 | Definitely came from more of an engineering background.
00:02:35.460 | But since being at Matter,
00:02:37.500 | have had amazing opportunities to work across
00:02:40.540 | so many different interesting problems in computer vision
00:02:44.500 | from 3D computer vision.
00:02:46.720 | How can you go from images of objects to 3D structures?
00:02:50.860 | And then going back to 2D computer vision
00:02:53.060 | and actually understanding the objects
00:02:55.140 | and the pixels and the images themselves.
00:02:57.540 | So it's been very interesting journey
00:03:00.460 | over the past seven years.
00:03:02.420 | - It's weird because I guess with segment anything too,
00:03:04.660 | it's like 4D because you solve time.
00:03:07.340 | You know, you started with 3D and now you're solving the 4D.
00:03:10.700 | - Yeah, it's just going from 3D to images to video.
00:03:13.940 | It's really covering the full spectrum.
00:03:15.740 | And actually one of the nice things has been,
00:03:18.540 | so I think I mentioned I wanted to become a doctor,
00:03:21.780 | but actually Sam is having so much impact in medicine,
00:03:25.180 | probably more than I could have ever had
00:03:27.420 | as a doctor myself.
00:03:28.740 | So I think, you know,
00:03:30.220 | hopefully Sam too can also have similar sort of impact
00:03:33.900 | in medicine and other fields.
00:03:36.180 | - Yeah, I want to give Joseph a chance to comment.
00:03:38.740 | Does that also mirror your,
00:03:40.420 | we know your story about going into vision,
00:03:42.620 | but like in the past year since we did our podcast on Sam,
00:03:46.500 | what's been the impact that you've seen?
00:03:48.740 | - Segment anything set a new standard
00:03:51.660 | in computer vision.
00:03:53.020 | You know, recapping from the first release to present,
00:03:56.020 | Sam introduces the ability for models to near zero shot,
00:04:01.020 | meaning without any training,
00:04:03.020 | identify kind of perfect polygons and outlines
00:04:06.660 | of items and objects inside images.
00:04:10.300 | And that capability previously required
00:04:13.740 | lots of manual labeling, lots of manual preparation,
00:04:17.460 | clicking very meticulously to create outlines of individuals
00:04:20.820 | and people.
00:04:22.140 | And there were some models that attempted
00:04:24.940 | to do zero shot segmentation of items inside images,
00:04:29.940 | though none were as high quality as segment anything.
00:04:35.420 | And with the introduction of segment anything,
00:04:38.780 | you can pass an image with Sam one, Sam two videos as well,
00:04:43.780 | and get perfect, pixel perfect outlines
00:04:47.620 | of most everything inside the images.
00:04:49.940 | Now there are some edge cases across domains
00:04:52.380 | and similar to the human eye,
00:04:54.780 | sometimes you need to say like,
00:04:56.060 | which item you maybe you most care about
00:04:57.940 | for the downstream task and problem you're working on.
00:05:00.700 | Though Sam has accelerated the rate at which developers
00:05:05.700 | are able to use computer vision and production applications.
00:05:10.300 | So at RoboFlow, we were very quick to enable the community
00:05:15.140 | of computer vision developers and engineers to use Sam
00:05:19.300 | and apply it to their problems.
00:05:21.020 | The principle ways is using Sam,
00:05:23.260 | you could kind of use Sam as is to like pass an image
00:05:26.380 | and receive back masks.
00:05:28.340 | Another use case for Sam is in preparation of data
00:05:32.420 | for other types of problems.
00:05:34.180 | So for example, in the medical domain,
00:05:37.140 | let's say that you're working on a problem
00:05:40.140 | where you have a bunch of images from a wet lab experiment.
00:05:43.700 | And from each of those images,
00:05:46.340 | you need to count the presence of a particular protein
00:05:50.060 | that reacts to some experiments.
00:05:52.140 | To count all the individual protein reactions,
00:05:56.700 | you can go in and lab assistants to this day
00:05:59.900 | will still like kind of individually count and say,
00:06:02.340 | what are the presence of all of those proteins?
00:06:04.820 | With segment anything, it's able to identify
00:06:07.540 | all of those individual items correctly.
00:06:10.580 | But often you may need to also add like a class name
00:06:14.860 | to what the protein is, or you may need to say,
00:06:17.860 | hey, like I care about the protein portion of this,
00:06:20.340 | I don't care about the rest of the portion of this image.
00:06:23.420 | And, or what it encourages and asks for the user to do
00:06:27.020 | is to provide some visual prompting to say,
00:06:29.460 | hey, which part, like Sam says,
00:06:31.780 | hey, I can find segments of anything,
00:06:33.460 | but which segments do you care about?
00:06:34.940 | And so you can do visual prompting,
00:06:36.620 | which is kind of a new perimeter that Sam introduced.
00:06:39.140 | And so at RoboFlow, we have one portion of our tool stack
00:06:43.060 | enables users to very quickly label data.
00:06:45.980 | With segment anything, Sam can already provide,
00:06:49.540 | hey, here's where I see the outlines of objects,
00:06:51.980 | or a user can click to prompt to say,
00:06:54.060 | hey, here's where the outlines of objects matter.
00:06:56.420 | And I recently pulled statistics
00:06:58.020 | from the usage of Sam in RoboFlow
00:06:59.860 | over the course of the last year.
00:07:01.700 | And users have labeled about 49 million images
00:07:05.700 | using segment anything on the hosted side
00:07:07.860 | of the RoboFlow platform.
00:07:09.580 | And that's like 5 million in the last 30 days alone.
00:07:13.740 | And of those images,
00:07:16.900 | we did kind of like a rough Bafka napkin calculation
00:07:19.660 | of like how much time that has saved.
00:07:21.860 | Because again, the alternative is
00:07:24.060 | you're clicking individual points to create a polygon.
00:07:27.020 | And with Sam, you just click once
00:07:28.100 | and it guesses where the polygon is.
00:07:29.820 | And I'm sure in a bit, we can maybe screen share
00:07:32.140 | and show some examples of what this experience is like.
00:07:35.180 | And in that time estimation, it's like,
00:07:37.900 | on average saves, you know, maybe a dozen or so seconds.
00:07:41.340 | And we estimate that this has probably saved
00:07:44.940 | on the order of magnitude of 35 years of time for users.
00:07:49.940 | - That's incredible.
00:07:51.460 | - So I mean, basically like in the first year
00:07:54.060 | of a model being available,
00:07:55.580 | not only can you say, hey, I'm just gonna go use this model,
00:07:57.940 | but those numbers that like 49 million images
00:08:01.300 | is an estimate directly related to just the hosted side.
00:08:05.260 | So imagine all of the users that are self-hosting
00:08:08.300 | or using Sam for robotics applications
00:08:10.740 | or out in the field or offline,
00:08:12.900 | where it's not even like the time
00:08:14.940 | or the image counts are tabulated.
00:08:17.420 | And we're probably talking about, you know,
00:08:19.040 | just a fraction of the amount of value
00:08:21.860 | that's actually being produced
00:08:23.020 | for a number of downstream tasks.
00:08:25.100 | So to say that the impact has been,
00:08:28.100 | you know, people use terms like game changing
00:08:29.860 | and these sorts of things, it has changed the industry.
00:08:32.060 | It's set a new standard.
00:08:33.340 | And with the release of Sam 2,
00:08:35.460 | I think we're about to see an acceleration
00:08:37.580 | of those capabilities for a lot of reasons.
00:08:39.820 | - That's really great to hear.
00:08:40.820 | I think one of the release Sam 1
00:08:42.980 | was how many fields actually rely on manual segmentation.
00:08:47.940 | I think we're not really exposed to that.
00:08:50.020 | Maybe you are at Roboflow
00:08:51.300 | 'cause you get to see all the users of these tools.
00:08:54.860 | But for me, it was, you know,
00:08:56.180 | people working on understanding coral reef bleaching
00:09:00.220 | or farmers counting their cows
00:09:02.140 | and so many different applications
00:09:04.460 | that as a researcher matter,
00:09:07.060 | you never get exposed to,
00:09:08.220 | but you can have impact towards.
00:09:10.020 | So I think that was really awesome to hear.
00:09:12.580 | - So as sort of audience surrogate
00:09:14.220 | who knows less than the two of you,
00:09:15.980 | I'm gonna ask a really dumb question maybe,
00:09:18.140 | but is everyone using stock segment anything?
00:09:20.940 | Are they fine tuning for the medical domain?
00:09:23.260 | Like how on earth could it work
00:09:25.340 | for the medical field without fine tuning, right?
00:09:27.940 | Like, is that a thing?
00:09:29.620 | - So I mean, I can give a quick perspective
00:09:31.620 | from the research side.
00:09:32.820 | So one of the design decisions we made in Sam
00:09:37.180 | was to not have class labels.
00:09:40.300 | And so all the data is annotated in a class agnostic way.
00:09:45.300 | So anything that has a boundary,
00:09:47.780 | we consider to be an object.
00:09:49.900 | So for example, in any image,
00:09:52.220 | there's lots of small objects.
00:09:54.700 | We might not know what the name of them are,
00:09:56.620 | but you can draw a boundary around it.
00:09:59.100 | So you can imagine that we have 11 million images
00:10:02.820 | in the SA1B dataset.
00:10:04.620 | We annotated all the objects.
00:10:06.940 | There's many, many small objects.
00:10:09.260 | And so if you think about cells,
00:10:11.340 | they're also kind of small objects.
00:10:14.060 | There's probably things in the training data
00:10:17.300 | that looked like it, but we didn't have to label it.
00:10:19.820 | And so that means that even when you use Sam
00:10:22.740 | for applications that it wasn't really trained for,
00:10:25.260 | because we didn't restrict it
00:10:26.940 | to a certain set of categories,
00:10:28.460 | you can actually use it out of the box
00:10:30.460 | without custom adaptation.
00:10:32.140 | But having said that, there's probably certain domains
00:10:34.980 | where you need some expertise
00:10:37.500 | in order to be able to segment something properly.
00:10:40.180 | And for those use cases,
00:10:42.020 | having some extra fine tuning data would probably help.
00:10:45.460 | And we've sort of seen that there's some papers
00:10:47.660 | that have come out that do this.
00:10:49.460 | And we'd love to hear, Joseph,
00:10:51.140 | how people are collecting data with Sam
00:10:53.580 | and fine tuning for their use cases.
00:10:56.060 | - Once Sam came out, there were adaptations that said,
00:10:59.580 | could we use Sam to be, you know, like efficient Sam,
00:11:02.700 | like basically take Sam and maybe accelerate it.
00:11:05.260 | And then there were domain adapted Sams,
00:11:07.300 | like cell Sam, for example, out of the UC system.
00:11:11.020 | Now, what's interesting is there's,
00:11:13.420 | like adapting Sam to a domain,
00:11:15.140 | there's kind of two ways by which that's done.
00:11:18.580 | One is, as you mentioned,
00:11:19.620 | like potentially Sam doesn't have a good concept
00:11:22.180 | of the objects of interest.
00:11:25.340 | And so you need to do domain adaptation
00:11:27.940 | and increase the accuracy for zero shot prediction.
00:11:31.940 | The second way though, is it's not fine tuning,
00:11:34.660 | it's actually just prompting.
00:11:35.900 | It's just guiding the model's existing knowledge
00:11:39.060 | to say which segments you care about.
00:11:41.780 | And both those are actually kind of equally important
00:11:44.100 | on the application side.
00:11:45.460 | You need to like a priori ensure
00:11:47.500 | that the objects of interest can be correctly segmented
00:11:50.380 | and maybe collect data to do that.
00:11:53.340 | But even if you had like a perfect Sam,
00:11:55.660 | like an omniscient Sam that could see every segment
00:11:57.820 | in every domain with all pixels perfectly outlined,
00:12:02.620 | in production, you would still need some way
00:12:04.900 | to almost like signal to the model what you care about.
00:12:08.260 | Like to paint this picture, if you were like a retailer
00:12:11.540 | and you are providing photos of models
00:12:16.460 | wearing your clothing on your retail site,
00:12:18.940 | you may care about, you know, only the shirt.
00:12:21.300 | And Sam by default might segment the full person.
00:12:24.060 | And so there's visual prompting that you can do
00:12:27.460 | to ensure that you only outline maybe the shirt
00:12:29.820 | for the purposes of swapping in and out different shirts
00:12:31.860 | for displaying a given model on a retail page.
00:12:35.780 | And so I think what's interesting is that's where like,
00:12:38.220 | I wouldn't call it domain adaptation,
00:12:39.660 | but that's where like when you apply to industry,
00:12:41.900 | like one thing that's particularly important with tooling
00:12:45.060 | and enabling Sam to reach its full potential.
00:12:48.100 | - That's really encouraging to hear.
00:12:49.540 | I should also think like, you know,
00:12:51.380 | the last time we talked about this,
00:12:52.820 | we wanted to, a very natural addition
00:12:55.100 | on the class labeling side is the grounding dyno work, right?
00:12:58.180 | So I think people built a grounding Sam
00:13:00.220 | and all the other extensions.
00:13:02.140 | I think it's probably a good time
00:13:03.540 | to cut to a quick demo of Sam 2
00:13:05.540 | for people who are tuning in for Sam 2
00:13:08.260 | and who better to demo Sam 2 than Nikki.
00:13:10.260 | - Sure.
00:13:12.660 | So I'll try to narrate what I'm doing
00:13:15.140 | so audio listeners can also understand.
00:13:18.380 | So we have a web demo where anyone can try Sam 2 on a video.
00:13:23.380 | Here we have a video of someone kicking a football
00:13:27.900 | and I'm gonna click on the football
00:13:30.380 | to select the object in the first frame,
00:13:32.860 | but you can actually select the object
00:13:34.540 | in any frame of the video and this will work.
00:13:37.340 | The next step is to hit track.
00:13:39.300 | So the model's now tracking this in real time.
00:13:42.180 | We don't save any of this.
00:13:43.700 | It's all running in real time.
00:13:45.660 | And now you can see the ball has been tracked
00:13:48.940 | throughout the entire video.
00:13:50.660 | There's even like a little bit of a challenging case here
00:13:53.060 | where the shoe covers the football
00:13:56.620 | and actually the model makes a little bit of a mistake,
00:13:59.380 | but that's okay because we can...
00:14:02.300 | Here, the model makes a little bit of a mistake here,
00:14:04.380 | but we can actually add a refinement click.
00:14:07.140 | You can add negative clicks
00:14:09.180 | until we get the mask that we want on this frame.
00:14:12.340 | And then you can hit track again
00:14:15.140 | and the model will track the object,
00:14:17.420 | taking into account the additional information
00:14:20.460 | I've provided at that frame.
00:14:22.700 | We've also added a couple of other fun things
00:14:24.660 | you can do on top of the track, like add effects.
00:14:28.660 | We can add foreground effects, background effects,
00:14:33.020 | and these are just ways of showing
00:14:34.500 | how we can use the output from SAM2
00:14:37.100 | as part of other tools like video editing tools
00:14:41.180 | or other systems.
00:14:42.420 | So this is just a preview of what you can do
00:14:44.980 | with SAM2.
00:14:46.420 | But the really cool use cases are places
00:14:49.660 | where we might not have even imagined SAM2 being useful.
00:14:52.740 | So we have a number of examples of things
00:14:54.780 | you might want to use it for.
00:14:56.500 | There's like underwater videos
00:14:58.220 | that it works actually really well for,
00:15:00.140 | even though models never really seen an octopus before.
00:15:03.740 | And octopus have a lot of moving parts
00:15:07.300 | that SAM2 can actually quite effectively keep track
00:15:11.020 | of all the different tentacles.
00:15:13.060 | And we can probably see it more clearly
00:15:14.900 | if I desaturate the background,
00:15:17.420 | we can see that actually the tracking
00:15:19.620 | of all the different tentacles is quite accurate.
00:15:23.500 | Another challenge with video
00:15:25.820 | is that objects can actually become occluded.
00:15:28.140 | They can disappear from view and reappear.
00:15:31.380 | And a really fun example here is the shuffling cup game,
00:15:34.420 | which many of you might have seen.
00:15:36.540 | And so here I can click on the ball in the first frame.
00:15:40.340 | I can also click on a different cup.
00:15:43.500 | And so here the additional challenge
00:15:45.820 | is that there's three cups that look exactly the same.
00:15:49.100 | And then there's a ball that will get occluded by the cup.
00:15:53.020 | So the ball is no longer visible.
00:15:54.620 | The cups are all moving around.
00:15:56.060 | They all look the same,
00:15:57.780 | but the model actually keeps track
00:15:59.620 | of the cup that we selected.
00:16:01.540 | And as you can see at the end here,
00:16:03.780 | I'll jump to the end so you can see,
00:16:05.860 | it actually finds the cup again.
00:16:07.860 | I wanted to point out a couple of fun demo UX features
00:16:11.500 | that we added that actually really help with this.
00:16:13.780 | So if you can see at the bottom,
00:16:15.100 | there's these swim lanes.
00:16:16.860 | And then the swim lanes,
00:16:18.060 | actually the thickness of the swim lane
00:16:20.260 | tells you if the object's visible or not.
00:16:22.340 | So at the beginning, the object's visible,
00:16:25.220 | the object disappears, and then the object comes back.
00:16:28.100 | So you can actually visually tell
00:16:30.980 | when the object's being occluded and when it's not.
00:16:33.900 | And so it's a nice way of like knowing
00:16:35.940 | if you need to go in and fix the model prediction or not.
00:16:38.420 | And so these are some of the UX innovations
00:16:41.780 | that we came up with,
00:16:43.060 | as well as the model innovations.
00:16:45.300 | - One thing that I think is really notable here,
00:16:48.340 | there's two things.
00:16:49.180 | One is like, I'd love to have a little bit of a discussion
00:16:51.180 | about how the model's keeping track
00:16:53.460 | of the embedded scene to keep track of the ball
00:16:55.780 | and the cup in different places.
00:16:57.260 | Pause on that for a second.
00:16:58.620 | One thing that Meta has put an emphasis on here
00:17:01.740 | in a much greater degree than other model releases
00:17:04.300 | is the demo experience of recognizing that
00:17:07.940 | in addition to having a model
00:17:09.620 | that can do zero-shot segmentation,
00:17:11.700 | you've created a web experience
00:17:14.140 | that allows folks to kind of experience
00:17:16.660 | both the video effects,
00:17:18.060 | but the types of UX innovations
00:17:20.420 | that encourage usage and adoption.
00:17:22.700 | It's actually kind of reminiscent
00:17:23.660 | of the underlying technology of Chat GPT
00:17:25.940 | was available prior to the web experience of Chat GPT.
00:17:29.340 | Can you talk a bit about why that was a consideration
00:17:31.660 | to your team and how you thought about
00:17:34.380 | the creation of the demo experience
00:17:38.220 | in tandem with training and releasing a new model?
00:17:40.940 | - Yeah, absolutely.
00:17:41.780 | I think that's a really great example of how,
00:17:43.700 | you know, Chat GPT was really more of a UX innovation.
00:17:48.100 | Obviously, it was like a number of research innovations
00:17:50.580 | that helped to get to this point.
00:17:52.500 | But as you said, like the underlying technology
00:17:54.540 | was around for a while and, you know,
00:17:56.660 | putting this UX around it as a chat interface
00:18:00.540 | helped tremendously with adoption
00:18:03.700 | and people understanding how it could be useful
00:18:06.220 | for real-world use cases.
00:18:07.980 | And in computer vision, especially, it's so visual.
00:18:10.980 | The best way to show how these models work
00:18:13.820 | is by trying it on your own image or your own video.
00:18:17.340 | With the original SAM,
00:18:19.300 | we put a lot of effort in building like a high-quality demo.
00:18:23.660 | And the other piece here is that the demo
00:18:26.540 | is actually the annotation tool.
00:18:28.620 | So we actually use the demo
00:18:31.260 | as a way to improve our annotation tool.
00:18:34.100 | And so then it becomes very natural
00:18:36.220 | to invest in building a good demo
00:18:37.820 | because it speeds up your annotation
00:18:39.940 | and improves the data quality
00:18:41.300 | and that will improve the model quality.
00:18:43.260 | With this approach, we found it to be really successful.
00:18:46.260 | And obviously, externally,
00:18:48.140 | people really liked being able to try it.
00:18:50.980 | I think, you know, people in fields
00:18:53.220 | outside of machine learning would never have tried SAM
00:18:56.740 | if we didn't have that demo.
00:18:59.020 | And I think that definitely led to a lot of the adoption
00:19:02.780 | in like diverse fields.
00:19:04.820 | And so because we saw that with SAM 2,
00:19:07.340 | like the demo was a priority,
00:19:09.860 | first-class citizen from day one.
00:19:12.940 | And so we really invested in making that.
00:19:15.980 | And I think with SAM 2 as well,
00:19:18.620 | we wanted to have like a step change
00:19:20.620 | in the demo experience.
00:19:22.180 | Interactive video segmentation,
00:19:23.660 | I think that experience is something
00:19:25.340 | that maybe has not had much thought given to it.
00:19:28.380 | And we really wanted to be like,
00:19:29.740 | okay, if we are to design a step changing
00:19:32.780 | video segmentation experience,
00:19:34.300 | what would that look like?
00:19:35.380 | And that really did influence our model
00:19:37.980 | and annotation design as well.
00:19:40.580 | - It's a really encouraging trend
00:19:41.620 | for not thinking about only the new model capability,
00:19:44.900 | but what sort of applications folks want to build
00:19:47.660 | with models as a result of that downstream.
00:19:49.820 | - I think it also really forces you
00:19:51.300 | to think about many things that you might postpone.
00:19:53.900 | For example, efficiency.
00:19:55.780 | For a good demo experience,
00:19:57.620 | making it real time is super important.
00:19:59.740 | No one wants to wait.
00:20:01.380 | And so it really forces you to think about these things
00:20:05.020 | much sooner and actually makes us think about
00:20:08.340 | how to, what kind of image encoder we want to use
00:20:10.940 | or like other hardware efficiency improvements.
00:20:14.380 | So those kinds of things, I think,
00:20:16.660 | become a first-class citizen when you put the demo first.
00:20:20.620 | - That's one thing I was going to ask about,
00:20:22.220 | and this is related to the architecture change.
00:20:24.260 | So SAM1, in the SAM1 demo experience,
00:20:27.340 | you have the encoder that's creating the embeddings
00:20:30.780 | of all the potential spaces.
00:20:32.740 | That needs to be run on a GPU.
00:20:34.180 | That's a relatively intensive operation.
00:20:36.260 | But then the query of those embeddings
00:20:39.180 | can be run independently and on a cheaper process.
00:20:42.460 | So in the SAM1 demo, the way that it was structured,
00:20:45.700 | and also this is the way that we have our SAM tools
00:20:47.540 | structured in RoboFlow as well,
00:20:49.460 | is images go to a GPU to get all the SAM-based embeddings.
00:20:54.460 | But then for querying those embeddings,
00:20:56.620 | we do that client-side in the browser
00:20:58.780 | so that the user can very quickly,
00:21:00.780 | you know, you can move your mouse over
00:21:02.580 | and you get the proposed candidate masks
00:21:05.460 | that SAM found for that region of the image.
00:21:08.140 | In SAM2, you drop that in the web demo.
00:21:11.140 | And I think that's because you made some notable improvements
00:21:14.140 | to the rate at which encoding happens.
00:21:17.700 | - Can you talk a bit about what led to those speed increases
00:21:22.500 | and again, how that interplays
00:21:24.460 | with providing a fast user experience
00:21:27.940 | for interacting with the model?
00:21:29.900 | - Yeah, so the SAM2 web demo is primarily focused on video.
00:21:33.740 | We decided to just keep it simple and focus on video.
00:21:36.980 | And on GitHub, we have a Colab notebook
00:21:40.140 | that shows how to run SAM2 on images.
00:21:42.540 | So if you're interested in using,
00:21:44.340 | replacing SAM with SAM2 for images,
00:21:47.260 | check out GitHub.
00:21:48.260 | But on the SAM2 demo,
00:21:50.660 | it's not as straightforward
00:21:52.180 | to adopt the same architecture as SAM for video
00:21:55.260 | because we can't send the per frame image embeddings
00:21:59.260 | for an entire video back to the front end.
00:22:02.180 | In SAM, each frame embedding was like four megabytes.
00:22:05.100 | But if you have a long video and that's like per frame,
00:22:08.980 | it would become impossible
00:22:10.020 | to send that back to the front end.
00:22:12.340 | So SAM2 actually, in terms of the architecture details,
00:22:17.060 | I was actually just looking at this earlier,
00:22:18.620 | but SAM1 model was around 630 million parameters,
00:22:23.620 | a fraction of the size of these large language models,
00:22:27.580 | but very small.
00:22:29.060 | Actually, SAM2, the largest model
00:22:31.860 | is around 224 million parameters.
00:22:34.620 | So it's actually one third the size
00:22:37.900 | of the SAM original model.
00:22:39.780 | So we changed the image encoder from a VITH in SAM
00:22:44.380 | to a higher model, which is also developed by Meta.
00:22:48.940 | So that definitely was something that helped.
00:22:51.220 | And in terms of the efficiency compared to SAM,
00:22:54.580 | so if we were to run SAM per frame on a video
00:22:57.940 | or run SAM2, it's around six times faster
00:23:01.260 | to run SAM2 versus run SAM per frame.
00:23:04.900 | Number of things improved the efficiency of SAM2
00:23:07.380 | such that we were actually able to run this entirely
00:23:11.420 | on the server and not have any component
00:23:13.940 | in the front end.
00:23:15.100 | But I am very curious to see who puts this on device.
00:23:18.420 | I'm pretty sure soon we'll see an on-device SAM2
00:23:21.980 | or maybe even running in the browser or something.
00:23:25.220 | So I think that could definitely unlock
00:23:27.500 | some of these edge use cases.
00:23:30.340 | But we were able to make a compelling web demo
00:23:33.380 | without having to do that.
00:23:34.860 | - Hugging face is probably already working
00:23:36.340 | on Transformers.js version of it.
00:23:38.460 | But totally makes sense.
00:23:39.740 | I want to talk more about things from the paper,
00:23:41.580 | but I think we're still in this sort of demo section
00:23:43.500 | and so I want to hand it to Joseph for his demo
00:23:46.220 | to see what the RoboFlow site looks like.
00:23:48.100 | - So I can give some context into one key area
00:23:51.860 | that Nikolai, you mentioned earlier,
00:23:53.460 | which is SAM has made the decision,
00:23:55.540 | both SAM1 and SAM2, to be class agnostic
00:23:57.860 | in terms of its predictions.
00:23:59.420 | And that you then have the ability
00:24:02.260 | to have a generalizable model for zero-shot capability.
00:24:06.420 | However, in a lot of domain applications,
00:24:08.980 | you do want the class-wise name.
00:24:10.940 | And so a lot of the challenge
00:24:13.820 | can be adding that class-wise name
00:24:16.100 | for at least the annotation to an experience
00:24:19.300 | that we've created.
00:24:20.540 | That's one of the key considerations.
00:24:22.340 | So I will similarly share my screen and show an example.
00:24:27.340 | Here, I have a bunch of images
00:24:30.740 | and there's a number of ways that I could annotate things.
00:24:33.340 | Like I could prompt a large multimodal model
00:24:35.740 | with like grounding capabilities.
00:24:37.740 | You could outsource it.
00:24:39.140 | Or I can do manual labeling.
00:24:41.020 | And with the manual labeling,
00:24:42.300 | this is where we make use of models like Segment Anything
00:24:46.660 | to propose candidate masks and make it faster.
00:24:50.900 | So we have this annotation pane
00:24:53.220 | in what we call the Smart Poly tool,
00:24:55.100 | which is powered by Segment Anything.
00:24:57.340 | This is currently Segment Anything 1.
00:24:59.340 | We're accelerating and seeing improvements
00:25:02.500 | from similar to what the paper shows
00:25:04.780 | of Segment Anything 2 performing better on images
00:25:08.100 | as well as video.
00:25:09.540 | But with Segment Anything,
00:25:11.780 | I'm able to basically prompt regions
00:25:14.420 | of my image of interest.
00:25:16.020 | So for example, if like I wanted to say,
00:25:18.340 | I want to like add the drum set,
00:25:20.260 | you'll see here that like the original candidate proposal
00:25:23.300 | is just the bass drum,
00:25:25.340 | but let's say I wanted the whole drum set.
00:25:27.180 | So the UX primitive of being able to add
00:25:31.100 | and subtract candidate regions of interest
00:25:33.980 | is really intuitive here.
00:25:36.500 | And now, great, I have this outline,
00:25:39.060 | but in fact, what I want is I want to name that as a class
00:25:42.660 | because maybe for the model that I'm building,
00:25:45.420 | I want to build like a task-specific model,
00:25:47.820 | you know, like an object detection model
00:25:49.220 | or an instant segmentation model.
00:25:51.060 | Or, you know, maybe I'm even using like a multimodal model
00:25:54.060 | and I want that multimodal model to refer
00:25:56.300 | to regions of interest in the images as a specific thing.
00:26:01.300 | And so I think what's really powerful
00:26:03.900 | is of course, like I get this really rich
00:26:06.700 | zero-shot prediction, and here we have our friend Rick.
00:26:10.780 | So I get this really rich candidate set of predictions,
00:26:14.460 | but then by adding the class-wise label,
00:26:17.980 | I can, you know, very quickly make sure
00:26:19.700 | that any downstream tasks are aware,
00:26:22.660 | not just of the segment,
00:26:24.740 | but also of the, what is inside that segment,
00:26:29.060 | which actually takes me to a separate point
00:26:32.420 | of something that I predict
00:26:33.260 | that's probably going to happen.
00:26:34.420 | And Nikhil, I'm actually kind of interested
00:26:35.900 | why maybe your team made a conscious decision
00:26:38.220 | to not do this initially with SAM2.
00:26:41.220 | There's been an emergent set of models
00:26:43.100 | that are also adding open-text prompting capabilities
00:26:46.940 | to grounding models.
00:26:48.700 | So for example, like you've seen models
00:26:51.380 | like Grounding Dino or Owlvit,
00:26:54.860 | which, you know, you can do even image-to-image
00:26:57.540 | or text-to-image-based prompting
00:26:59.380 | to find regions of interest.
00:27:01.340 | And maybe I can actually give an example of that
00:27:04.300 | even in the context of this same data.
00:27:06.700 | So if I wanted to try out, you know,
00:27:08.940 | Grounding Dino on the same set of images,
00:27:11.780 | I could try out, you know, prompting Grounding Dino
00:27:14.620 | for a set of different classes.
00:27:17.100 | And what's notable is, let's do, I don't know,
00:27:20.620 | let's prompt for person, and we'll prompt for person,
00:27:24.660 | and let's prompt for, I don't know, microphone,
00:27:28.540 | and we'll ask for a microphone.
00:27:30.580 | Here, I can text-prompt the image,
00:27:32.980 | and then the understanding,
00:27:34.620 | in this case, Grounding Dino's understanding
00:27:36.340 | of where people are in this image
00:27:38.220 | allows me to create, in this case, bounding boxes,
00:27:40.860 | but, you know, soon you can do segmentations
00:27:43.580 | or in tandem with SAM, do segmentations.
00:27:45.980 | And, you know, we've already seen applications
00:27:48.500 | of using SAM2 in tandem with models
00:27:53.100 | like Grounding Dino or Florence 2
00:27:56.820 | so that people can basically text-prompt
00:28:00.220 | and then get the benefits of the zero-shot segmentation
00:28:03.420 | at the same time as getting the open-form querying.
00:28:08.420 | And in doing so, you know,
00:28:09.660 | we maintain a framework called, like, Autodistill,
00:28:11.380 | so, like, folks can very quickly, you know,
00:28:13.700 | bring some images and then using Autodistill
00:28:16.660 | to find some ontology,
00:28:18.260 | and then prompt and say what you want from that ontology.
00:28:21.340 | - So you already do this for video as well?
00:28:23.780 | - You can apply videos or groups of images, yes.
00:28:26.740 | So this is using a project called Autodistill.
00:28:29.580 | And the concept of Autodistill is use a base model,
00:28:33.380 | like a big base model,
00:28:34.340 | which could be, like, SAM or Grounding Dino,
00:28:36.900 | and then you pass a directory of images,
00:28:39.780 | which also could be video broken into individual frames,
00:28:43.580 | and you pass an ontology as well.
00:28:45.340 | So an example I was just showing
00:28:46.780 | was, like, the Hello World we have,
00:28:48.020 | which is, like, a shipping container.
00:28:49.860 | And then the combination of the grounding capabilities of,
00:28:54.540 | in the example I was showing, Florence 2 plus SAM,
00:28:57.660 | looks for the concept of container.
00:28:59.620 | And then SAM does the rich segmentation
00:29:02.780 | of turning that concept of container
00:29:04.820 | into the candidate proposal of the region
00:29:07.300 | so that a user could just say,
00:29:08.980 | hey, I want all the shipping containers,
00:29:10.580 | run this across a bunch of images or video frames,
00:29:13.900 | and then get back the class-wise labels
00:29:17.740 | plus the regions of interest.
00:29:19.740 | And this feels like a natural extension.
00:29:21.740 | And in fact, like, the open form grounding capabilities
00:29:24.700 | between SAM 1 and SAM 2
00:29:26.780 | became something the field was broadly doing.
00:29:29.140 | So I'm curious, like, from your perspective,
00:29:31.820 | one of the things I thought maybe SAM 2 would do
00:29:33.780 | is actually add this capability natively.
00:29:36.260 | So I'm curious to hear, like, the conscious decision to say,
00:29:39.140 | hey, we want to continue to be class-agnostic.
00:29:41.340 | We don't want to add yet maybe open form text prompting
00:29:45.900 | as a part of finding the segments and parts of images.
00:29:48.660 | And I'd love to hear about, like,
00:29:49.820 | the decision to think about it that way.
00:29:51.660 | And if you are encouraged or if you want kind of, like,
00:29:55.100 | what's happening here where people are naturally
00:29:56.740 | combining these capabilities
00:29:58.420 | as something that you would expect and encourage to happen
00:30:01.340 | despite not having it in the base model itself.
00:30:05.340 | - Yeah, it's a great question.
00:30:06.340 | So I think it's really cool that the community
00:30:08.260 | is taking SAM and taking SAM 2 and building on top of it
00:30:11.580 | and coming up with cool applications.
00:30:14.340 | We love to see that.
00:30:15.420 | That's exactly why we open source our work.
00:30:19.540 | And then in terms of why we didn't put it into SAM 2,
00:30:22.780 | so as you've probably seen with SAM and SAM 2,
00:30:25.980 | it's a fairly narrow problem,
00:30:28.540 | but we really try to make it a step change
00:30:31.140 | in the capability.
00:30:32.660 | And so with each version,
00:30:35.060 | we are trying to limit the focus on one thing
00:30:38.780 | that we can know we can do really well.
00:30:41.940 | And in this case, like the first SAM,
00:30:44.700 | it was class-agnostic segmentation,
00:30:46.580 | but can we do it so well that it's effectively solved?
00:30:50.420 | And similarly, can we do that same thing,
00:30:52.740 | but with video segmentation?
00:30:55.340 | So one step at a time,
00:30:57.180 | we are working on each of these problems one at a time
00:31:00.540 | so that we can actually deliver something
00:31:02.500 | that's really world-class and step-changing.
00:31:06.060 | - So does that mean SAM 3 will have
00:31:08.020 | the text prompting problem as like the next challenge?
00:31:11.900 | - Who knows, who knows?
00:31:13.100 | (laughing)
00:31:14.540 | Maybe the community will build that too.
00:31:18.580 | - It makes sense to like very narrowly
00:31:20.540 | do something very well,
00:31:21.540 | and that's, I think, proven to be well accomplished.
00:31:24.660 | - It's like taking both the data, the model, and the demo,
00:31:28.620 | and how can we push all three
00:31:31.020 | towards solving one thing really well?
00:31:34.020 | So we found that that's like a good recipe,
00:31:36.900 | and that's what we've limited the focus
00:31:39.460 | of each of these models.
00:31:41.620 | - This development reminds me of how, you know,
00:31:43.740 | when you do, and you break out the interpretability
00:31:46.620 | of ConvNets, and you can see like,
00:31:48.620 | oh, this is the edge detection one.
00:31:50.780 | I feel like SAM is the edge detection version equivalent,
00:31:54.340 | and then you build up to whatever the next feature is
00:31:56.580 | on top of that.
00:31:57.500 | - Can I bring up one limitation of SAM?
00:31:59.980 | So like we've, like even SAM 1, SAM 2,
00:32:01.980 | and the model was released at 4 p.m. Pacific on Monday.
00:32:04.940 | We're recording this on 11 a.m. Pacific on Thursday.
00:32:08.540 | So it's very fresh for a lot of the capabilities.
00:32:11.820 | And it is so clear that it is a stepwise change
00:32:15.620 | in the capability that, Nikhila,
00:32:18.220 | you mentioned your team wants to do,
00:32:19.340 | which is extend SAM's zero-shot
00:32:21.140 | class-agnostic capability to video,
00:32:23.100 | like A+ kind of mission accomplished.
00:32:26.220 | One thing that's interesting is finding like domain problems
00:32:30.060 | where there might be still domain applicability
00:32:33.220 | and domain adaptation that is available.
00:32:35.900 | One benchmark that we introduced at CBPR
00:32:38.580 | is this thing called RF100,
00:32:40.100 | which is like seven different domain type problems
00:32:43.220 | that the industry commonly is working on in vision.
00:32:45.300 | Like underwater, document processing,
00:32:47.740 | aerial examples, medicine examples.
00:32:50.540 | And one place where, interestingly,
00:32:53.500 | segment anything maybe less performant than other models
00:32:58.500 | is handling screenshots.
00:33:00.780 | For example, like a lot of folks
00:33:02.340 | that are building agents to interact with the web
00:33:04.860 | are particularly interested in that challenge
00:33:06.860 | of given a screenshot of a computer,
00:33:09.740 | what are all the buttons?
00:33:11.420 | And how could I autonomously navigate
00:33:14.700 | and prompt and tell it to click?
00:33:16.900 | And I can show an example of like maybe what,
00:33:19.180 | how like SAM kind of performs on this challenge
00:33:21.820 | just to outline some of the context of this problem.
00:33:26.460 | But I'm curious like how you think about
00:33:28.340 | limitations like this
00:33:29.180 | and what you would expect to want to be the case.
00:33:30.900 | So here I just have a notebook
00:33:32.340 | where I run SAM on the source image on the left,
00:33:35.620 | or the source image on the left,
00:33:36.540 | and then SAM output is on the right.
00:33:38.140 | And this is just a screenshot of a website
00:33:41.100 | where we just grabbed like the top 100 websites by traffic
00:33:43.660 | and grabbed screenshots from them.
00:33:45.620 | One example of a place where I could see
00:33:48.460 | the community improving on SAM,
00:33:49.940 | and I'm curious how you think about this challenge
00:33:51.580 | and maybe why SAM is less well adapted
00:33:53.740 | for this type of problem is processing screenshots.
00:33:57.220 | So I'll share my screen to give an example
00:33:59.500 | for viewers that are participating.
00:34:02.700 | Here you see like an example screenshot
00:34:04.660 | of a website on the left,
00:34:05.900 | and then right is SAM2 running on that image.
00:34:09.860 | And in the context of agents,
00:34:11.820 | folks usually want to have like,
00:34:13.260 | hey, tell me all of the buttons that an agent could press,
00:34:15.740 | tell me like maybe the headlines of the articles,
00:34:17.620 | tell me the individual images.
00:34:19.260 | And SAM2 behaves perhaps predictably
00:34:21.740 | where it outlines like people in the images
00:34:23.420 | and like some of like the screen text.
00:34:25.620 | I'm curious like how you think about a challenge like this
00:34:29.260 | for a model that sees everything in the world,
00:34:32.660 | what about handling digital contexts
00:34:34.940 | and why maybe it could perform better here
00:34:38.540 | and how you would expect to see improvement for domains
00:34:41.940 | that might have been out of distribution
00:34:43.340 | from the training data?
00:34:44.580 | - Yeah, this is a good question.
00:34:45.980 | So at FAIR, we don't really build
00:34:48.940 | with a specific use case in mind.
00:34:50.820 | We try to build like these foundational models
00:34:53.900 | that can be applied to lots of different use cases
00:34:57.220 | out of the box.
00:34:58.300 | So I think in this kind of example,
00:35:01.620 | potentially people might want to annotate some data,
00:35:04.860 | fine tune on top of what we release.
00:35:07.820 | I think we probably won't build things
00:35:11.180 | that are very custom for different use cases.
00:35:14.180 | I think that's not a direction we'll go in.
00:35:18.540 | But as you said, like the model is an annotation tool
00:35:21.740 | to improve the model.
00:35:23.260 | And so I think that's definitely the approach
00:35:26.220 | we want to take is we provide the tools
00:35:28.900 | for you to improve the model as well as the model itself.
00:35:32.180 | - That makes sense.
00:35:33.020 | Focus on like as many multi or zero shot problems
00:35:36.220 | and then allow the community to pick up the torch
00:35:38.300 | for domain adaptation.
00:35:39.820 | - Yeah, absolutely.
00:35:40.660 | Like we can't solve all the problems ourselves.
00:35:42.900 | Like we can't solve all the different domains,
00:35:45.020 | but if we can provide a sort of base hammer tool
00:35:49.940 | and then people can apply it
00:35:51.220 | to all their different problems.
00:35:53.500 | - Well, if you don't mind,
00:35:54.340 | I guess we want to transition to a little bit
00:35:55.820 | on like asking more questions about the paper.
00:35:58.500 | - Sure.
00:35:59.340 | - There's a lot in here.
00:36:00.380 | I love the transparency from Meta recently
00:36:02.980 | with like Llama 3 last week.
00:36:04.300 | And then, and was it last week?
00:36:06.020 | Maybe a little bit less than last week,
00:36:08.180 | but just like just really, really well-written
00:36:10.220 | and a lot of disclosures, including the dataset as well.
00:36:12.980 | I think the top question that people had on the dataset,
00:36:15.220 | you know, you've released a diverse videos
00:36:16.860 | and there's a lot of discussion
00:36:18.500 | about the data engine as well, which I really love.
00:36:21.020 | And I think it's innovative
00:36:22.180 | if you want to share anything about that.
00:36:24.220 | I think the top question is like,
00:36:25.580 | how do you decide the size of dataset?
00:36:27.140 | You know, what were you constrained by?
00:36:28.940 | People are asking about scaling laws.
00:36:30.340 | You had some ablations,
00:36:32.020 | but as a research manager for this whole thing,
00:36:34.340 | like how do you decide what you need?
00:36:37.340 | - Yeah, I mean, it's a great question.
00:36:38.860 | I think it's, as with all papers,
00:36:41.380 | you write them at the end of the project.
00:36:43.660 | So we can put these nice plots at the end,
00:36:46.980 | but going into it, I think, you know,
00:36:49.340 | the data engine design really follows
00:36:52.180 | sort of the model design,
00:36:54.100 | how we thought about the task,
00:36:55.860 | how we thought of the model capabilities.
00:36:58.020 | You can really see it's reflected
00:37:00.100 | in the different phases of the data engine.
00:37:02.540 | We started with just SAM.
00:37:04.500 | We apply SAM per frame.
00:37:06.260 | That's like the most basic way of extending SAM to video.
00:37:10.940 | Then the most obvious thing to do
00:37:12.460 | is to take the output masks from SAM
00:37:15.940 | and then provide it as input
00:37:18.180 | into a video object segmentation model
00:37:20.660 | that takes the mask as the first frame input.
00:37:24.180 | And that's exactly what we did.
00:37:25.460 | We had SAM plus a version of SAM2
00:37:28.940 | that only had mask as input.
00:37:31.580 | And then in the last phase,
00:37:33.260 | we got rid of SAM entirely
00:37:35.260 | and just had this one unified model
00:37:38.060 | that can do both image and video segmentation
00:37:41.820 | and do everything in just one model.
00:37:44.740 | And we found that, you know, going from each phase,
00:37:47.740 | it both improved the efficiency
00:37:49.460 | and it improved the data quality.
00:37:51.620 | And in particular, when you get rid of this two-part model,
00:37:54.980 | one of the advantages is that
00:37:57.540 | when you make refinement clicks,
00:37:59.540 | so you prompt the model in one frame to select an object,
00:38:03.740 | then you propagate those predictions
00:38:05.740 | to all the other frames of the video to track the object.
00:38:09.860 | But if the model makes a mistake and you want to correct it,
00:38:14.340 | when you have this unified model,
00:38:16.420 | you only need to provide refinement clicks.
00:38:19.300 | So you can provide maybe a negative click
00:38:21.660 | to remove a region or a positive click to add a region.
00:38:25.500 | But if you had this decoupled model,
00:38:27.740 | you would have to delete that frame prediction
00:38:31.820 | and re-annotate from scratch.
00:38:34.220 | And so you can imagine for more complex objects,
00:38:37.420 | this is actually adding like a lot of extra time
00:38:40.380 | to redefine that object
00:38:42.420 | every time you want to make a correction.
00:38:44.740 | So both the data and the data engine phases
00:38:47.780 | really follow like how we thought about the model design
00:38:50.700 | and the evolution of the capabilities,
00:38:53.220 | because it really helped improve the data quality
00:38:56.260 | and the annotation efficiency as well.
00:38:58.900 | - Yeah, you had a really nice table
00:39:00.380 | with like time taken to annotate,
00:39:01.780 | and it was just going down and down.
00:39:03.100 | I think it was like down by like 90%
00:39:05.900 | by the time you hit stage three, which is kind of cool.
00:39:08.740 | - We joked that when SAM1 came out at RoboFlow,
00:39:11.180 | we're like, "Was this purpose built for our software?"
00:39:13.780 | Like you have the embedding take like a big model
00:39:17.220 | and the querying of the embeddings,
00:39:19.140 | a smaller model that happens in browser,
00:39:21.260 | which felt remarkably aligned.
00:39:23.460 | Now hearing you talk about how you think about
00:39:25.860 | building models with a demo in mind, it makes sense.
00:39:27.940 | Like you're thinking about the ways
00:39:29.100 | that folks downstream are gonna be consuming
00:39:30.940 | and creating value.
00:39:31.860 | So what felt like maybe a coincidence
00:39:33.820 | was perhaps a deliberate choice by Meta
00:39:35.540 | to take into account how industry
00:39:37.860 | is gonna take seminal advances and apply them.
00:39:41.140 | - Yeah, and it's not just humans.
00:39:42.460 | Like it could also be a model that outputs boxes
00:39:45.300 | that then get fed into this model.
00:39:47.140 | So really thinking about this as a component
00:39:50.180 | that could be used by a human
00:39:51.980 | or as a component as part of a larger AI system.
00:39:56.020 | And that has a number of design requirements
00:39:59.500 | that needs to be promptable.
00:40:01.140 | It needs to have the zero shot generalization capability.
00:40:04.820 | We need it to be real time.
00:40:07.300 | And those requirements really are very core
00:40:10.420 | to how we think about these models.
00:40:12.820 | - I cannot end this podcast
00:40:14.500 | without talking about the architecture
00:40:16.020 | because this is your effectively
00:40:18.580 | the sort of research level, architecture level innovation
00:40:22.180 | that enabled what I've been calling object permanence
00:40:25.700 | for SAM and it's memory retention.
00:40:28.580 | What was the inspiration going into it?
00:40:30.140 | And what did you find?
00:40:31.940 | - Yeah, so at a high level,
00:40:33.460 | the way we think about extending SAM to video
00:40:36.660 | is that an image is just a special case of a video
00:40:40.260 | that just has one frame.
00:40:42.060 | With that idea in mind,
00:40:44.340 | we can extend the SAM architecture
00:40:46.860 | to be able to support segmentation across videos.
00:40:50.380 | So this is a quick video that shows how this works.
00:40:53.500 | So SAM architecture, we have the image encoder,
00:40:56.020 | we have a prompt encoder, we have a mask decoder.
00:40:59.300 | You can click on an image and that basically is a prompt.
00:41:04.300 | We use that prompt along with the image embedding
00:41:08.340 | to make a mask prediction for that image.
00:41:11.340 | Going to SAM 2, we can also apply SAM 2 to images
00:41:15.460 | because we can, as I said, treat an image as a video
00:41:19.060 | with a single frame.
00:41:20.420 | And so when we are in the SAM 2 architecture,
00:41:22.900 | we introduce this new memory mechanism
00:41:25.260 | that consists of three main components.
00:41:27.740 | There's memory attention, there's a memory encoder,
00:41:29.740 | and then there's a memory bank.
00:41:31.180 | And when we apply SAM 2 to images,
00:41:33.660 | these are effectively not used
00:41:35.460 | and the architecture just collapses
00:41:37.780 | down to the original SAM architecture.
00:41:40.540 | But when we do apply this to video,
00:41:43.100 | the memory components become really useful
00:41:45.340 | because they provide the context of the target object
00:41:49.500 | from other frames.
00:41:51.540 | And so this could be from the past frames,
00:41:54.140 | it can be from, there's two types of memory.
00:41:56.700 | So there's like the conditional frames
00:41:59.180 | or the prompted frames, which are basically the frames
00:42:02.060 | at which a user or a model provides input like clicks.
00:42:06.860 | And then there's like the surrounding frames.
00:42:09.100 | And so we use six frames around the current frame
00:42:12.940 | as memory of the object.
00:42:14.220 | So there's both those types of memory
00:42:17.340 | that we use to make the mask prediction.
00:42:19.660 | Going into a little bit more detail about that,
00:42:21.500 | there's like two kinds of memory that we use.
00:42:23.940 | So one is like spatial memory.
00:42:25.940 | So it's like this high resolution memory
00:42:27.540 | that captures the spatial details.
00:42:29.540 | And then we also have this like longer term
00:42:31.740 | object point of memory that captures
00:42:33.420 | some of the sort of higher level concepts.
00:42:35.780 | And I think Swix, you had a comment about
00:42:38.100 | how does this relate to context window and LLMs.
00:42:42.180 | And both of these types of memories
00:42:43.820 | have some relation to context window.
00:42:46.940 | So they both provide different types of information
00:42:49.700 | on the spatial side or in terms of the concept
00:42:52.740 | of the objects that we want to track.
00:42:54.620 | And so we found that having like six frame length
00:42:56.980 | for the spatial memory coupled with this longer period
00:43:00.740 | of the object point of memory provides
00:43:03.260 | strong video segmentation accuracy at high speed.
00:43:06.380 | So as I mentioned, the real time aspect is really important.
00:43:10.220 | We have to find this speed accuracy trade off.
00:43:12.780 | And one way in which we sort of circumvent this
00:43:15.700 | is by allowing additional prompts on subsequent frames.
00:43:19.820 | So even if the model makes a mistake,
00:43:22.540 | maybe it loses the object.
00:43:24.300 | After an occlusion, you can provide another prompt
00:43:27.860 | which actually goes into the memory.
00:43:29.940 | And so the prompted frames are always in the memory.
00:43:33.860 | And so if you provide a prompt on a frame
00:43:35.700 | where the model will always remember what you provided.
00:43:39.620 | And so that's a way in which we can sort of avoid
00:43:42.780 | some of the model failure cases.
00:43:45.500 | That actually is a big limitation of current models.
00:43:48.060 | Current video object segmentation models
00:43:50.140 | don't allow any way to recover if the model makes a mistake.
00:43:53.380 | And so Joseph, going back to your point about the demo,
00:43:56.060 | that's something that we found
00:43:57.260 | just by playing with these models.
00:43:59.180 | There's no way to make a correction.
00:44:01.260 | And in many real world use cases,
00:44:03.140 | like it's not going to be a one time prediction,
00:44:06.660 | but you actually want to be able to intervene.
00:44:08.900 | Like if an LLM makes a mistake,
00:44:11.540 | you can actually be like, no, actually do it this way
00:44:14.020 | and provide feedback.
00:44:15.500 | And so we really want to bring some of that thinking
00:44:18.620 | into how we build these computer vision models as well.
00:44:21.860 | - Amazing.
00:44:22.700 | My main reaction to finding out about the context length,
00:44:26.220 | input frames and six pass frames as their default
00:44:28.780 | is why not 60?
00:44:30.460 | Why not 600?
00:44:31.620 | In text language models,
00:44:33.060 | we're very used to severely extending context windows.
00:44:37.060 | And what does that do to the memory of your model?
00:44:40.540 | - So I think maybe one thing that's different
00:44:42.500 | is that the object in videos, it is challenging.
00:44:45.500 | Objects can change in appearance.
00:44:47.380 | There's different lighting conditions.
00:44:49.220 | They can deform.
00:44:50.740 | But I think a difference to language models
00:44:53.580 | is probably the amount of context that you need
00:44:55.940 | is significantly less
00:44:57.780 | than maintaining a long multi-time conversation.
00:45:01.060 | And so coupling this short-term spatial memory
00:45:04.500 | with this longer-term object pointers we found was enough.
00:45:08.500 | So I think that's probably one difference
00:45:11.020 | between vision models and LLMs.
00:45:13.780 | - I think so.
00:45:14.620 | If one wanted to be really precise
00:45:16.340 | with how literature refers to object re-identification,
00:45:20.060 | object re-identification is not only what SAM does
00:45:23.780 | for identifying that an object is similar across frames,
00:45:27.620 | it's also assigning a unique ID.
00:45:30.220 | How do you think about models keeping track
00:45:32.980 | of occurrences of objects
00:45:36.180 | in addition to seeing that the same looking thing
00:45:40.500 | is present in multiple places?
00:45:42.460 | - Yeah, it's a good question.
00:45:43.500 | I think, you know, SAM2 definitely isn't perfect
00:45:46.460 | and there's many limitations
00:45:48.900 | that we'd love to see people in the community
00:45:52.020 | help us address.
00:45:53.700 | But one definitely challenging case
00:45:56.300 | is where there are multiple similar looking objects,
00:45:59.060 | especially if there's like a crowded scene
00:46:01.460 | with multiple similar looking objects.
00:46:03.660 | Keeping track of the target object is a challenge.
00:46:07.780 | That's still something that I don't know
00:46:09.340 | if we've solved perfectly,
00:46:11.100 | but again, the ability to provide refinement clicks
00:46:15.260 | is one way to sort of circumvent that problem.
00:46:18.780 | In most cases, when there's lots of similar looking objects,
00:46:21.860 | if you add enough refinement clicks,
00:46:23.540 | you can get the perfect track throughout the video.
00:46:26.580 | So definitely that's one way to solve that problem.
00:46:30.580 | But, you know, we could have better motion estimation.
00:46:33.540 | We could do other things in the model
00:46:35.460 | to be able to disambiguate similar looking objects
00:46:38.820 | more effectively.
00:46:40.180 | - I'm just interested in leaving breadcrumbs
00:46:42.100 | for other researchers,
00:46:43.820 | anyone interested in this kind of architecture.
00:46:46.340 | Like, are there papers that you would refer people to
00:46:49.660 | that are influential in your thinking
00:46:51.260 | or, you know, have other interesting alternative approaches?
00:46:54.260 | - I think there's other ways in which
00:46:55.700 | you can do tracking in video.
00:46:57.340 | You might not even need the full mask.
00:46:59.300 | I think there's some other works
00:47:01.020 | that just track, like, points on objects.
00:47:04.020 | It really, really depends on what your application is.
00:47:06.420 | Like, if you don't care about the entire mask,
00:47:09.140 | you could just track a bounding box.
00:47:10.780 | You could just track a point on an object.
00:47:13.300 | And so having the high fidelity mask
00:47:16.580 | might not actually be necessary for certain use cases.
00:47:20.780 | From that perspective,
00:47:21.780 | you might not need the full capabilities of SAM or SAM2.
00:47:26.180 | There's many different approaches to tracking.
00:47:27.980 | I think I would encourage people to think about, like,
00:47:30.580 | what actually they need for their use case
00:47:33.300 | and then try to find something that fits
00:47:36.260 | versus, yeah, maybe SAM2 is too much.
00:47:39.180 | You know, maybe you don't even need the full mask.
00:47:41.820 | - Makes total sense.
00:47:42.660 | But you have solved the problem that you set out to solve,
00:47:44.900 | which is no mean feat,
00:47:46.540 | which is something that we're still appreciating even today.
00:47:49.060 | If there are no further questions,
00:47:50.220 | I would just transition to sort of forward-looking,
00:47:53.060 | future-looking stuff.
00:47:54.660 | Joseph already hinted at, like, you know,
00:47:56.420 | our interest in SAM and the future of SAM.
00:47:59.900 | And obviously you're the best person to ask about that.
00:48:02.220 | I'm also interested in, like,
00:48:03.860 | how should external people think about FAIR?
00:48:06.500 | You know, like, there's this stuff going on,
00:48:08.700 | this llama, this chameleon, this voice box,
00:48:11.300 | this image bind, like, how are things organized?
00:48:14.060 | And, you know, where are things trending?
00:48:15.980 | - Yeah, so in FAIR, we, you know,
00:48:18.220 | we have a number of different research areas.
00:48:20.260 | I work in an area called perception.
00:48:22.980 | So we built vision systems that solve,
00:48:26.100 | basically look at all the fundamental problems
00:48:28.540 | in computer vision.
00:48:29.660 | Can we build a step change
00:48:31.700 | in all of these different capabilities?
00:48:34.180 | SAM was one example.
00:48:35.380 | SAM-2 is another example.
00:48:36.900 | There are tons of other problems in computer vision
00:48:39.780 | where we've made a lot of progress,
00:48:41.780 | but can we really say that they're solved?
00:48:44.660 | And so that's really the area in which I work on.
00:48:48.100 | And then there's a number of other research areas
00:48:50.860 | in language and in embodied AI,
00:48:53.940 | in more efficient models and various other topics.
00:48:57.540 | So FAIR in general is still very much pushing the boundaries
00:49:01.580 | on solving these foundational problems
00:49:04.700 | across different domains.
00:49:06.460 | And then there's also obviously, like,
00:49:08.540 | actually I probably shouldn't talk about llama,
00:49:10.140 | so let's not include that.
00:49:12.020 | - I was gonna ask about that.
00:49:13.900 | (both laughing)
00:49:16.260 | Well, fair enough.
00:49:17.100 | Maybe just outside of FAIR,
00:49:18.460 | just the future of computer vision, right?
00:49:20.060 | Like you are very involved in the community.
00:49:22.540 | What's the talk of the town at CVPR?
00:49:24.380 | Both of you went.
00:49:25.540 | Who's doing the most interesting work?
00:49:27.380 | It's a question for both of you.
00:49:28.860 | - I think the trends we're seeing
00:49:32.540 | towards more zero-shock capability
00:49:35.420 | for common examples will accelerate.
00:49:38.220 | I think mutual modality,
00:49:40.180 | meaning using images in tandem with text
00:49:42.940 | for richer understanding,
00:49:44.260 | or images and video in tandem with audio
00:49:48.340 | and other mixed media
00:49:50.460 | will be a continued acceleration trend.
00:49:53.260 | The way I kind of see the field continuing to progress,
00:49:57.900 | like the problem statement of computer vision
00:50:00.260 | is making sense of visual input.
00:50:02.980 | And I think about the world
00:50:05.300 | as the things that need to be observed
00:50:08.980 | follow your traditional bell curve,
00:50:11.580 | where like things that most frequently exist
00:50:13.820 | out in the world are on the center of that bell curve.
00:50:16.060 | And then there's things that are less frequently occurring
00:50:18.020 | that are in those long tails.
00:50:19.460 | For example, as back as like 2014,
00:50:22.500 | you have the COCO dataset,
00:50:23.740 | which sets out to say,
00:50:24.580 | "Hey, can we find 80 common objects in context?"
00:50:29.140 | Like silverware and fridge and these sorts of things.
00:50:32.380 | And we also conceptualized the challenge of computer vision
00:50:35.460 | in terms of breaking it down into individual task types,
00:50:38.100 | because that's like the tools we had for the day.
00:50:40.020 | So that's why you have the origination of classification,
00:50:42.940 | object detection, instant segmentation.
00:50:45.300 | And then as you see things continue to progress,
00:50:48.460 | you have models and things
00:50:50.860 | that need to observe areas in the long tails.
00:50:53.140 | And so if you think of the COCO dataset
00:50:54.500 | as the center of that bell curve,
00:50:56.180 | I think of like the long tails,
00:50:57.660 | like really edge case problems.
00:50:59.940 | Some of our customers like Rivian, for example,
00:51:02.420 | only Rivian knows what the inside of like a Rivian
00:51:05.340 | should look like as it's assembled and put together
00:51:07.620 | before it makes its way to a customer.
00:51:09.140 | And they're making custom parts, right?
00:51:10.460 | So how could a model even been trained on the things
00:51:13.820 | that go inside the componentry of producing a vehicle?
00:51:17.900 | And what's kind of happening with computer vision
00:51:20.380 | is you're seeing models that generalize
00:51:24.860 | in the middle of the bell curve push outward faster.
00:51:27.740 | That's where you see the advent of like open text models
00:51:31.540 | or the richness of understanding of multimodal models
00:51:36.020 | to allow richer understanding
00:51:38.620 | without perhaps any training,
00:51:40.460 | or maybe just using pre-training
00:51:42.060 | and applying it to a given problem.
00:51:44.220 | And then there's like, you know,
00:51:46.020 | kind of like the messy middle in between those two, right?
00:51:48.500 | So like, Nikila kind of talked about examples
00:51:50.620 | where SAM does well out of distribution,
00:51:52.700 | where like it finds an octopus,
00:51:54.100 | even though there wasn't octopi in the training data.
00:51:56.540 | I showed an example where like screenshots,
00:51:58.580 | where SAM isn't yet super great at screenshots.
00:52:01.460 | So maybe that's like in the messy middle
00:52:02.780 | or in the longer tails for now.
00:52:04.900 | But what's gonna happen is there needs to be systems
00:52:07.500 | of validating the point of view
00:52:09.540 | that I think about like tooling to also validate
00:52:11.980 | that models are doing what we want them to do,
00:52:13.980 | adapting to datasets that we want them to adapt to.
00:52:16.500 | And so there's a lot of things on a forward-looking basis
00:52:19.380 | that allow propelling that expansion of generalizability.
00:52:24.220 | That's where open text problems,
00:52:27.180 | that's where scaling up of training,
00:52:30.380 | of dataset curation continues to play a massive role.
00:52:35.140 | Something that's notable, I think, about SAM 2
00:52:37.260 | is it's, what, 57,000 videos, 51,000 videos?
00:52:40.940 | - About 51,000, yeah.
00:52:42.740 | - And 100,000 internal datasets.
00:52:45.180 | - That's like not massive, right?
00:52:49.300 | And the model size also isn't, you know,
00:52:51.380 | the largest model being a couple hundred million parameters,
00:52:53.820 | the smallest model is 38 million parameters
00:52:55.580 | and can run at 45 FPS on an A100, right?
00:52:57.940 | Like the capabilities of,
00:53:00.740 | we're gonna see more capable, more generalizable models
00:53:04.580 | being able to run on a higher wide array of problems
00:53:07.900 | with zero or multi-shot capability on a faster rate.
00:53:12.140 | And I think the architecture innovations
00:53:15.620 | in things like SAM 2 of memory,
00:53:18.460 | of increasingly like transformers
00:53:20.620 | making their way into division
00:53:22.140 | and probably blended architectures increasingly too.
00:53:25.220 | So my viewpoint of like on a go-forward basis
00:53:27.700 | is we will have that bell curve of what humans can see
00:53:32.500 | both in the center of that curve and the long tails
00:53:35.260 | and architectural changes
00:53:36.740 | allow richer understanding multi and zero-shot
00:53:40.620 | and putting those into systems
00:53:42.180 | and putting those into industry
00:53:43.500 | and putting those into contexts
00:53:45.300 | that allow using them in practical and pragmatic ways.
00:53:48.740 | Nicola, I'd love to hear like your thought
00:53:50.420 | and perspective of like how you think
00:53:52.340 | the research trends map or don't map to that.
00:53:54.980 | And like maybe some of the key innovations
00:53:57.260 | that you saw at CVPR this year
00:53:58.980 | that got you excited about the direction
00:54:00.860 | and maybe some promising early directions
00:54:03.340 | that you're thinking about researching
00:54:05.300 | or pushing the boundaries of further.
00:54:07.140 | - Yeah, I just wanted to actually reply
00:54:08.900 | to a couple of things that you said about,
00:54:11.660 | so actually in video object segmentation,
00:54:14.340 | the number of classes that are annotated
00:54:16.940 | and then the size of these datasets are really small.
00:54:20.460 | So with SAM, it's, you know, we had a billion masks,
00:54:24.820 | we had 11 million images, didn't have class labels,
00:54:28.500 | but even before that, there were a lot of datasets
00:54:30.620 | that have class labels and are annotated
00:54:33.580 | with significantly more, with like a lot of class labels,
00:54:37.300 | whereas in video datasets,
00:54:39.340 | the number of class labels are very small.
00:54:42.020 | So there's like YouTube VOS,
00:54:43.860 | which has 94 object categories,
00:54:45.980 | there's MOSE, which has around like 30
00:54:48.340 | or so object categories.
00:54:49.740 | And they're usually like people, there's cars,
00:54:52.300 | there's dogs and cats and all these common objects,
00:54:56.140 | but not really, they don't really cover
00:54:57.980 | a very large number of object categories.
00:55:00.260 | And so while SAM learned this general notion
00:55:03.900 | of what an object is in an image,
00:55:06.820 | these video tracking models actually don't have
00:55:09.500 | that knowledge at all.
00:55:12.260 | And so that's why having this dataset is really important
00:55:17.100 | for the segment anything capability in video,
00:55:20.180 | because if you just provide the mask as the input
00:55:23.580 | to an off-the-shelf video object segmentation model,
00:55:26.380 | it might not actually be able to track
00:55:28.100 | that arbitrary object mask as effectively
00:55:31.260 | as a SAM2 model that's actually trained
00:55:34.020 | to track any object across the entire video.
00:55:37.780 | So doing these sort of combining two models together
00:55:41.380 | to try to get a capability will actually only get you so far
00:55:45.540 | and being able to actually create the dataset
00:55:49.340 | to enable that anything capability,
00:55:52.260 | it was actually really important.
00:55:54.620 | And we can actually see that when we do comparisons
00:55:57.300 | with baselines where we provide SAM2
00:55:59.980 | with the same input mask and the baseline model
00:56:02.380 | with the same input mask,
00:56:04.060 | for example, the T-shirt of a person,
00:56:06.260 | SAM2 can track the T-shirt effectively
00:56:08.660 | across the entire video,
00:56:10.420 | whereas these baselines might actually start tracking
00:56:13.460 | the entire person because that's what they're used to doing
00:56:16.260 | and isolating it to just one part of the person
00:56:19.100 | is not something they were ever trained to do.
00:56:21.620 | And so those are sort of some of the limitations.
00:56:24.580 | Another thing is segmenting an image
00:56:26.940 | and segmenting a video frame
00:56:29.140 | are actually two different things.
00:56:31.100 | So a video frame is still an image,
00:56:33.180 | but there might be motion blur
00:56:35.220 | or it might have lower resolution.
00:56:37.780 | Or it's actually, we found that in the SAM2 paper,
00:56:41.900 | we have this study of where we look
00:56:43.500 | at the SAM image segmentation task on images
00:56:48.140 | and also on frames from videos.
00:56:50.620 | And we find that actually SAM2 is a lot better than SAM
00:56:54.340 | when it comes to segmenting objects in video frames,
00:56:57.980 | because they actually have a sort of
00:56:59.940 | slightly different distribution than images.
00:57:02.660 | And so I think that's maybe one learning from this project
00:57:06.540 | is like combining two models
00:57:08.460 | and sort of just smushing things together
00:57:10.860 | might not actually be as effective
00:57:12.580 | as if you really think about how to build things
00:57:15.060 | in a unified way.
00:57:16.860 | And then another really interesting point
00:57:19.460 | is that from the COCO data set,
00:57:21.540 | the last author, Peter Dollar,
00:57:23.260 | he's the head of our research group.
00:57:25.420 | And so he's really seen the whole decade
00:57:27.820 | of going from COCO to going from SAM to going to SAM2.
00:57:32.820 | And so that's been very interesting
00:57:35.900 | to have that perspective as we build these models
00:57:38.820 | and as we think about the type of capabilities
00:57:41.820 | we want to build.
00:57:43.220 | - We hosted this challenge at CBPR
00:57:46.220 | when we introduced RF100,
00:57:48.300 | which is kind of meant to be the anti-COCO.
00:57:50.900 | So if like COCO is common objects in context,
00:57:53.060 | RF100 is like novel objects in weird contexts,
00:57:56.500 | like thermal data and like aerial stuff
00:57:59.260 | and things we were talking about earlier.
00:58:01.220 | And so we challenged the community as a part of,
00:58:03.380 | it's called OD&W with Microsoft,
00:58:05.620 | object detection in the wild.
00:58:07.420 | And it's basically like how well can you create models
00:58:10.220 | that either work zero shot,
00:58:11.940 | but really kind of what you end up measuring
00:58:13.540 | is how well things can learn domain adaptation.
00:58:16.260 | Like how quickly can something be retrained
00:58:18.740 | or fine tuned to a given domain problem?
00:58:21.100 | And what's really impressive about SAM and SAM2
00:58:24.820 | from what you just described is even with the limited set,
00:58:27.700 | the class agnostic approach affords the generalizability
00:58:32.180 | even to out of distribution examples, surprisingly well.
00:58:36.660 | Like it's like remarkably robust.
00:58:39.100 | And so that research direction seems extremely promising.
00:58:42.540 | - Yeah, and actually Pieter is always telling us like,
00:58:45.460 | don't care about COCO, even though he built COCO.
00:58:48.580 | So that's always fun.
00:58:51.540 | And really keeping that zero shot real world use cases
00:58:54.980 | in mind as we build and try to do things
00:58:57.780 | in as general a way as possible.
00:59:00.980 | - Okay, I think that just leaves us to calls to action
00:59:03.620 | for engineers, researchers, and personal recommendations.
00:59:07.980 | What do you have?
00:59:09.340 | - Yeah, so please try out all the resources we put out.
00:59:12.780 | We, you know, open sourced SAV dataset, SAM2,
00:59:17.180 | various SAM2 models, the paper, the demo,
00:59:21.100 | this dataset visualizer.
00:59:22.780 | Please try all of these things that we've released.
00:59:26.620 | And also, as I said, SAM2 isn't perfect.
00:59:29.980 | There are a number of limitations.
00:59:31.180 | Actually in the blog post, we go through many of these
00:59:33.860 | in quite lots of detail with examples.
00:59:36.980 | And so if you have any ideas of how to improve these,
00:59:40.700 | like please build on top of what we've released.
00:59:43.540 | We would love to see some of these problems get solved
00:59:47.060 | and maybe we can incorporate them back
00:59:50.020 | into future model versions.
00:59:53.300 | So really cool to use SAM2
00:59:56.660 | for all your different use cases,
00:59:57.980 | build on top of it, improve it,
01:00:00.460 | and share what you've built back with us.
01:00:02.860 | We'd love to hear from you.
01:00:04.300 | - Lovely.
01:00:05.140 | We'll definitely want people to comment
01:00:06.980 | and share their buildings on SAM and SAV
01:00:10.340 | and all the other stuff that's going on.
01:00:12.060 | Thank you so much for your time.
01:00:13.300 | This is wonderful.
01:00:14.420 | And then obviously the incredible open source
01:00:16.940 | that you've given us.
01:00:18.500 | Joseph, thank you as well for guest hosting.
01:00:21.100 | It was a much better episode with you than without you.
01:00:23.020 | So appreciate both of you coming on
01:00:24.820 | and whenever SAM3 is out
01:00:26.140 | or whatever else you guys are working on,
01:00:28.020 | just let us know and we'll come back on again.
01:00:30.180 | - Thank you.
01:00:31.220 | - Thanks. - Bye.
01:00:32.780 | (upbeat music)
01:00:35.380 | (upbeat music)
01:00:37.980 | (upbeat music)
01:00:40.580 | (upbeat music)