back to index

Moondream: how does a tiny vision model slap so hard? — Vikhyat Korrapati


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi. My name is Vik. I work on a model, an open-source vision model called Moon Dream.
00:00:17.040 | A little bit about myself before I dive into Moon Dream. I was at AWS for about nine years
00:00:23.920 | before I started working on this model. Looking at where the stock price is going,
00:00:28.960 | I'm not sure if that was the right financial decision, but I'm very happy with the work I'm
00:00:32.480 | doing. So let's dive into it. I'll talk about Moon Dream a little bit.
00:00:36.080 | It is a tiny vision language model. It's less than two billion parameters,
00:00:42.160 | so it can run anywhere, and it's open-source, Apache 2.0, so you can use it to do anything.
00:00:48.960 | Here are some examples of things you can do with Moon Dream. You can ask it questions about images.
00:00:55.760 | You can caption images. It can detect specific objects inside of images.
00:01:02.000 | So here I asked it to tell me where the peak is, and it gives me coordinates.
00:01:05.120 | I can count stuff. It can do all sorts of things.
00:01:08.800 | I had the audacity to title my talk, "How can a tiny vision model slap so hard?"
00:01:14.320 | So I have to back things up a little bit. And so here's me doing that.
00:01:19.040 | So these are two vision benchmarks, vision question-answering benchmarks. One is called VQA, V2.
00:01:25.120 | The other is called GQA. As you can see, Moon Dream has been steadily improving over the
00:01:31.280 | releases I've made over the last three months. I've included a reference line over there for
00:01:35.280 | Lava 1.5, which is a popular 7 billion parameter vision model. So this shows you that Moon Dream gives you
00:01:43.520 | a performance that's comparable to models that are about four times bigger than it.
00:01:48.480 | I didn't really set out to build a vision model, so I kind of got roped into it.
00:01:55.520 | I was originally trying to build an application that required an AI agent, so I needed to be able to
00:02:00.000 | see what was going on on the user's screen and have it describe what's on the browser page for QA
00:02:06.240 | testing automation. I tried to do this at first with GPT-4V, but there were too many safety refusals back
00:02:13.600 | then. Like, if there was any human being present in the image, I would just refuse to process it.
00:02:18.240 | It was also going to be really slow and expensive, and so I realized if this is a product I'm trying to
00:02:23.360 | build, I really need to have control over the model itself. So I figured, you know what, how hard can it be?
00:02:28.080 | Let me just go try and build this model myself. Now, the task I was trying to perform here was fairly
00:02:33.520 | constrained. I just needed to describe screens and answer questions about screens, so it doesn't
00:02:41.760 | need to be generally intelligent. I had a couple of 3090s at home, so I figured I'd train a small version
00:02:46.640 | of the model at home and then rent some beefier machines in the cloud to go train a bigger version.
00:02:54.560 | Once I got done training a small version, I was like, hey, this actually works pretty well, so I posted
00:02:58.000 | it on Twitter. I thought, you know what, I might get 20 likes off of this, and then I'll move on with my
00:03:02.320 | side project at the time. It blew up far beyond expectations. I was a little surprised, pleasantly
00:03:08.480 | surprised, but surprised nonetheless. And I immediately started seeing other automated testing companies
00:03:14.480 | reach out and be like, hey, can I use this to describe browser screens? Because this would work really
00:03:19.280 | great deal for us. As well as other companies, shout out to our friends at Open Interpreter from Seattle.
00:03:24.560 | That basically told us that they were -- I figured, you know what, like, this is getting a lot of
00:03:31.760 | traction. Let me pause on the whole automated testing app for a couple of weeks and focus on Moon Dream and see
00:03:36.320 | where it goes. Yeah, so let me dive into a couple of the technical details from what makes the model succeed
00:03:44.640 | despite being small. The first thing we did that I think really helped was deciding what problems the model
00:03:53.280 | should solve and what it should not solve. So Moon Dream wants to be a developer tool. We focus on being really
00:03:58.400 | accurate and not hallucinate. It doesn't really have a lot of knowledge about the world, so
00:04:04.560 | if you ask it to write a poem, it's probably not going to help you. It's really focused on answering questions, like
00:04:10.960 | giving you -- helping you understand images. This is really important because it affects the type of data
00:04:16.400 | that you use and the sort of benchmarks that you want to focus on. There's a popular vision language
00:04:22.640 | model benchmark called MathVester, which measures how good models are at solving math problems. You take a
00:04:26.800 | picture of a differential equation and you see whether the model can solve it. That was an example of a
00:04:31.920 | non-goal for us because we just want the model to be good at looking at images. The most we do is probably
00:04:37.840 | generate a latex representation of the problem. We don't really want to even attempt to try and solve
00:04:44.240 | calculus. It was not pre-trained from scratch. We use -- we use a vision encoder called Siglip from Google
00:04:52.320 | with a pre-trained text model called PHY 1.5 from Microsoft. The notable thing over here is PHY 1.5 was
00:05:00.960 | also trained on mostly synthetic data, which is very similar to our pipeline set works very well.
00:05:04.640 | For this sort of task, pre-training from scratch doesn't really make a difference as opposed to
00:05:12.560 | using pre-trained models, and it is cost prohibitive. So, unless you want to get those brownie points for
00:05:16.640 | saying you trained it from scratch, it's probably not worth doing. We experimented with a bunch of
00:05:20.960 | different other models as they were released, and nothing really made too much of a difference. What does make a
00:05:26.480 | difference, though, is training data? The latest release of Moon Dream is trained on around
00:05:33.360 | 35 million images, and the problem is, especially when you're on a budget, like high-quality multimodal
00:05:40.960 | training data is really hard to come by. There's companies that -- there's a lot of companies out
00:05:47.280 | there that will annotate data with humans, but it's really expensive, and I've heard a rumor recently that
00:05:53.680 | they won't even talk to you anymore unless you're willing to sign an upfront seven-figure commitment.
00:05:57.440 | There's a lot of data on the internet -- images, all text pairs. The problem with this is it's often not in the format you want it to be, and it's really noisy, and the noise is really problematic when you're
00:06:11.520 | when you're training small models. And so synthetic data is a way to solve this, where you use that
00:06:18.000 | all-text information, process it. It's a bit of an open secret that a lot of people are training on
00:06:22.400 | outputs from GPT-4. You probably don't want to do that. Besides being questionable in terms of use,
00:06:31.120 | it's often not helpful. GPT-4 is a very powerful model. It has reasoning capabilities and knowledge
00:06:36.320 | that your small model is never going to be able to get. And so when you train it on GPT-4 outputs,
00:06:41.120 | what it learns instead is to hallucinate. It's going to generate plausible-sounding outputs that include
00:06:46.160 | details that it cannot possibly memorize, and so you end up in trouble. So this is a little important.
00:06:52.800 | I'm going to go a little more technically detail for a couple of minutes to dive into how to do synthetic
00:06:57.840 | data. So bear with me for a sec. We'll pop back up. Here's an example of how not to do it. Coco is
00:07:03.840 | a dataset. It has around 200k images. Each image has five short descriptions and a bunch of object
00:07:09.840 | annotations with, like, hey, there's a bicycle at these coordinates and whatnot. And let's say you
00:07:15.760 | want to take those short descriptions and these object annotations and generate more detailed captions
00:07:20.000 | that include the union of all the information present over here. If you just naively call GPT-4
00:07:25.040 | with this information, it generates this. It's not important to read all of it, but there's two important
00:07:30.800 | things to note. The first is that it hallucinates. It says in the second paragraph there's a person near
00:07:37.840 | the right side of the harbor. I think there's, like, a person way back. There's, like, five pixels there
00:07:44.080 | that may be a post. It may be a person. We don't really know. That's because object annotations were
00:07:48.080 | bad. But besides that, like, the model is also taking a lot of creative liberties over here, like saying
00:07:53.280 | there is five yatches standing out from the rest and whatnot. And so this is -- you need to do a little
00:08:01.440 | more preprocessing of your data before you feed it to the model. Here's another example. There's a dataset
00:08:06.400 | from Google called localized narratives. The task annotators here -- annotators are given here is
00:08:12.560 | verbally describe this image. And as you're describing the image, hover your mouse over the part of the image
00:08:18.800 | that you're describing. So it's nice in that it encourages people to create really detailed
00:08:22.640 | descriptions that capture spatial positioning in the image. So for example, here it says the girl in the
00:08:28.800 | front is playing the guitar and whatnot. And spatial reasoning is something that vision language models
00:08:33.040 | typically tend to struggle with. I ended up having to build a fairly sophisticated data processing pipeline to
00:08:38.560 | get really good results with this. Not really important to dive into the details over here. But the important
00:08:43.760 | thing to note is, A, it gets really expensive. Each image ends up being 20 LLM calls. And the LLM here is
00:08:52.480 | Mixtral 8x7b. So it gets pretty expensive. But it was necessary. The training data is the biggest needle
00:09:01.680 | mover in terms of model performance. And because of this, I'd say we spent like maybe one or two orders of
00:09:06.880 | magnitude more compute on generating training data than actually training the model itself.
00:09:11.280 | A couple -- so yeah, this particular data set we've open sourced. It's available on Hugging Face. Here's
00:09:18.960 | an example of the type of questions it generates for this image. There's an interesting question towards
00:09:26.400 | the end. What theory does the kid have about the existence of pleasure in the image? I'll talk about
00:09:30.240 | that in a sec. But basically, you want to generate a few distractor questions so the model knows to not always
00:09:35.440 | agree with the question that the user is asking. So yeah, a couple of the challenges involved in
00:09:41.920 | working with synthetic data. There was an interesting incident I had early on where a user was like,
00:09:48.880 | hey, I asked a relatively simple question. Why couldn't the model answer this? And when I looked at it,
00:09:53.920 | it turned out that they didn't capitalize the first letter in their question. And the model had never
00:09:57.920 | seen anything like that during training. So I was like, what do I do over here? And so it's really
00:10:04.000 | important for you to make sure that your training data has the same rough distribution as your real
00:10:09.120 | world query. So I ended up adding like an extra step where we artificially inject like capitalization
00:10:13.600 | issues and typos and whatnot into the model before training it. There's also this risk of what we call
00:10:19.280 | model collapse, where your model has biases inherent to it. So for example, if you try to ask Mixtral to
00:10:25.440 | generate distractor questions, hey, just generate a question that's completely irrelevant to the image,
00:10:29.600 | it'll always generate something about dinosaurs and aliens. And so if you train your model on that,
00:10:33.280 | it'll instead learn to say, hey, if the question is about dinosaurs and aliens, always say no,
00:10:40.480 | which doesn't really help. And so you need to inject like some entropy into the process of
00:10:45.200 | generating synthetic data to avoid this. In the case of synthetic captioning, you can do something
00:10:49.520 | like, hey, describe this image, but also consider the alt text on the image, which may be noisy,
00:10:53.440 | may be irrelevant. But if it is relevant, use relevant facts from that. And that tends to help a lot.
00:10:58.160 | All right. So popping back up,
00:11:08.000 | there's a couple of important learnings I had over the last three months that I would like to share
00:11:11.520 | with all of you. The first was the community was really critical in this whole journey. Seeing that
00:11:18.240 | original engagement that we got from the Moon Dream release helped me realize that, hey, maybe this is
00:11:23.440 | more valuable than that QA testing application that I was working on, because a lot of people have a need
00:11:28.240 | for this to build applications like that. Coming from an enterprise-ish company, it's been really
00:11:35.040 | valuable. It's been refreshing to be able to just talk to customers directly, like someone at Twitter
00:11:39.760 | DM and be like, hey, I just saw you looking for this. What do you think? But it's also helped us
00:11:45.280 | connect with a lot of partners, mentors, and get a lot of support from the community. Being open source
00:11:51.200 | was critical. I kind of didn't really have a choice over here because the competition was free. So
00:11:56.560 | what am I going to do? But when you're in the dev tool space, it is pretty important. Open source is
00:12:02.560 | important to a lot of developers. They would like to have the ability to run it in different
00:12:06.800 | environments. It's also pretty important for a lot of enterprise users. In a lot of cases, they don't
00:12:12.320 | really want to run the software themselves, but having the option is very important to them because
00:12:18.320 | they've had -- most enterprises have had situations where a vendor goes out of business or decides to
00:12:26.400 | screw them in some other capacity. It's also been really critical for engagement for us. We've had
00:12:31.200 | a lot of people in the community help out, port it to different platforms, run the model in the web
00:12:35.280 | browser and whatnot. So it's been very valuable for us. This one is a little controversial. I'm not sure
00:12:44.720 | everyone agrees with this, but I feel pretty strongly that safety guardrails should be implemented at the
00:12:50.000 | application layer, not baked into the model itself. This was one of my learnings from my first attempt to
00:12:57.040 | build a QA testing application with GPT-4B. It made no sense for that application to reject pictures of
00:13:03.840 | any picture that contained a human being. I understand why they felt it was important.
00:13:10.320 | DevTools are kind of B2 -- B2B, not B2C. So it's important to make it easy for developers to
00:13:16.240 | decide what guardrails they want and implement it in their model as opposed to just deciding it for
00:13:21.280 | all users. I'm not saying this is not important at all. Kind of makes sense if you're trying to
00:13:25.040 | build an assistant to make that stuff right directly into the model. But when you're building for
00:13:29.440 | developers, it makes less sense. Yeah, I believe pretty strongly now that tiny models are going to run the world
00:13:39.920 | In computer vision, more so perhaps than in text models, efficiency is really important.
00:13:46.480 | In a lot of cases, you're really worried about cost because you're processing video and
00:13:54.400 | 30 frames a second at seven-tenths of a cent per second adds up very quickly and
00:13:59.280 | doesn't give you a lot of room to work with. But there's also situations where you're
00:14:02.960 | really worried about privacy or latency and therefore you want to run the model really close to where
00:14:09.520 | decisions need to be made. Which is not to say big models are not useful. I think they're
00:14:14.400 | very useful. I just think that we'll mostly be running them in our development environments maybe
00:14:18.560 | for generating training data. But the artifact that you're going to want to deploy is most likely going
00:14:24.560 | to be a smaller model. Another thing that was a little surprising to me was looking at the different
00:14:37.520 | things people were doing with Moon Dream. There were a lot of people building net new applications that
00:14:41.920 | weren't possible to do before because the model can understand language as well as images. But there
00:14:47.600 | were also a lot of people doing traditional computer vision things with the model. It's like,
00:14:52.480 | is there a person in the scene? Or is there something suspicious going on? Tell me where the
00:14:57.760 | bus is in this picture from a road camera. All of which was possible to do before we had transformers,
00:15:07.760 | like just train a YOLO V3 model or whatnot. The thing that was --
00:15:11.360 | Yeah, the lesson I took from this was prompting is a much better developer experience than having to train
00:15:19.360 | a custom model. And so for a lot of developers that would be interested in incorporating vision into their
00:15:25.280 | applications, before they'd be like, you know what, it's not worth me spending two weeks learning how to
00:15:31.200 | like collect data and annotate it and train my own custom model. Giving them the option to say, hey,
00:15:37.520 | for fairly cheap, you can just in English describe what you want extracted from this image makes it
00:15:43.520 | something that they actually consider doing now. All right. I think I'm a little ahead of time,
00:15:51.040 | so I'm excited to maybe do a live demo if the demo got a smile upon me, but we'll see. In conclusion,
00:15:58.880 | yeah, where's Moon Dream going? We're not AGI people. I'm really focused on making it really
00:16:05.520 | easy for developers to build amazing applications with vision. There's a bunch of model improvements
00:16:11.440 | that I'm working on right now. I'll talk about some. Right now, we use 729 tokens to represent an
00:16:18.080 | image, so you can only really send one image to the model at a time. We're working on giving users the
00:16:23.200 | option to give a more compressed representation to the model, which makes sense if you're not trying to
00:16:28.000 | read text or something from the image if you're just trying to do classification and whatnot. That makes
00:16:31.200 | the model run a lot faster, which is important, especially if you're on CPU as opposed to GPUs,
00:16:35.600 | which can't do as much -- CPUs can't do as much parallel compute, and so that sort of thing ends up
00:16:40.720 | being really important. We've also just raised a seed round from Felicis, Ascend, and also the GitHub
00:16:50.160 | one, which I forgot to include in the slide. Sorry, GitHub. This means more GPUs, but more importantly, it means I can
00:16:57.840 | finally get some sleep because we're able to get a couple more people to join the team. If you're
00:17:02.480 | interested, please reach out. We have a contact email on the website or just hit me up on Twitter.
00:17:07.120 | We also have an exciting release coming up later this summer that I'm super pumped for, so stay tuned.
00:17:12.000 | I think that's about it. So I have a couple of minutes left, I think, so I'm going to try doing something
00:17:19.280 | that may not be the wisest idea, but we'll see how it goes.
00:17:22.560 | All right. I'll turn the Wi-Fi off. This whole thing is running locally.
00:17:34.080 | So what this is going to do is, like, start taking my webcam in, and it's going to use Moon Dream in an
00:17:44.560 | infinite loop to describe what it sees, and we can ask it different questions. So we'll see how that goes.
00:18:06.000 | And, yeah, you can ask it different things. So let's say, is the person wearing glasses? You do have
00:18:15.280 | to tell the model to answer briefly if you want a yes or no, otherwise it gives you, like, a
00:18:19.120 | answer with a single word. Let's try that.
00:18:27.680 | Yes. Okay. I'll take them off. I can't see. Did it get it?
00:18:35.840 | Let's do that. I'll go back to the old prompt.
00:18:50.720 | All right. Well, that was it for me. Thank you all.
00:19:07.680 | I'll see you next time.