Moondream: how does a tiny vision model slap so hard?

00:00:00.000 | Hi. My name is Vik. I work on a model, an open-source vision model called Moon Dream.

00:00:17.040 | A little bit about myself before I dive into Moon Dream. I was at AWS for about nine years

00:00:23.920 | before I started working on this model. Looking at where the stock price is going,

00:00:28.960 | I'm not sure if that was the right financial decision, but I'm very happy with the work I'm

00:00:32.480 | doing. So let's dive into it. I'll talk about Moon Dream a little bit.

00:00:36.080 | It is a tiny vision language model. It's less than two billion parameters,

00:00:42.160 | so it can run anywhere, and it's open-source, Apache 2.0, so you can use it to do anything.

00:00:48.960 | Here are some examples of things you can do with Moon Dream. You can ask it questions about images.

00:00:55.760 | You can caption images. It can detect specific objects inside of images.

00:01:02.000 | So here I asked it to tell me where the peak is, and it gives me coordinates.

00:01:05.120 | I can count stuff. It can do all sorts of things.

00:01:08.800 | I had the audacity to title my talk, "How can a tiny vision model slap so hard?"

00:01:14.320 | So I have to back things up a little bit. And so here's me doing that.

00:01:19.040 | So these are two vision benchmarks, vision question-answering benchmarks. One is called VQA, V2.

00:01:25.120 | The other is called GQA. As you can see, Moon Dream has been steadily improving over the

00:01:31.280 | releases I've made over the last three months. I've included a reference line over there for

00:01:35.280 | Lava 1.5, which is a popular 7 billion parameter vision model. So this shows you that Moon Dream gives you

00:01:43.520 | a performance that's comparable to models that are about four times bigger than it.

00:01:48.480 | I didn't really set out to build a vision model, so I kind of got roped into it.

00:01:55.520 | I was originally trying to build an application that required an AI agent, so I needed to be able to

00:02:00.000 | see what was going on on the user's screen and have it describe what's on the browser page for QA

00:02:06.240 | testing automation. I tried to do this at first with GPT-4V, but there were too many safety refusals back

00:02:13.600 | then. Like, if there was any human being present in the image, I would just refuse to process it.

00:02:18.240 | It was also going to be really slow and expensive, and so I realized if this is a product I'm trying to

00:02:23.360 | build, I really need to have control over the model itself. So I figured, you know what, how hard can it be?

00:02:28.080 | Let me just go try and build this model myself. Now, the task I was trying to perform here was fairly

00:02:33.520 | constrained. I just needed to describe screens and answer questions about screens, so it doesn't

00:02:41.760 | need to be generally intelligent. I had a couple of 3090s at home, so I figured I'd train a small version

00:02:46.640 | of the model at home and then rent some beefier machines in the cloud to go train a bigger version.

00:02:54.560 | Once I got done training a small version, I was like, hey, this actually works pretty well, so I posted

00:02:58.000 | it on Twitter. I thought, you know what, I might get 20 likes off of this, and then I'll move on with my

00:03:02.320 | side project at the time. It blew up far beyond expectations. I was a little surprised, pleasantly

00:03:08.480 | surprised, but surprised nonetheless. And I immediately started seeing other automated testing companies

00:03:14.480 | reach out and be like, hey, can I use this to describe browser screens? Because this would work really

00:03:19.280 | great deal for us. As well as other companies, shout out to our friends at Open Interpreter from Seattle.

00:03:24.560 | That basically told us that they were -- I figured, you know what, like, this is getting a lot of

00:03:31.760 | traction. Let me pause on the whole automated testing app for a couple of weeks and focus on Moon Dream and see

00:03:36.320 | where it goes. Yeah, so let me dive into a couple of the technical details from what makes the model succeed

00:03:44.640 | despite being small. The first thing we did that I think really helped was deciding what problems the model

00:03:53.280 | should solve and what it should not solve. So Moon Dream wants to be a developer tool. We focus on being really

00:03:58.400 | accurate and not hallucinate. It doesn't really have a lot of knowledge about the world, so

00:04:04.560 | if you ask it to write a poem, it's probably not going to help you. It's really focused on answering questions, like

00:04:10.960 | giving you -- helping you understand images. This is really important because it affects the type of data

00:04:16.400 | that you use and the sort of benchmarks that you want to focus on. There's a popular vision language

00:04:22.640 | model benchmark called MathVester, which measures how good models are at solving math problems. You take a

00:04:26.800 | picture of a differential equation and you see whether the model can solve it. That was an example of a

00:04:31.920 | non-goal for us because we just want the model to be good at looking at images. The most we do is probably

00:04:37.840 | generate a latex representation of the problem. We don't really want to even attempt to try and solve

00:04:44.240 | calculus. It was not pre-trained from scratch. We use -- we use a vision encoder called Siglip from Google

00:04:52.320 | with a pre-trained text model called PHY 1.5 from Microsoft. The notable thing over here is PHY 1.5 was

00:05:00.960 | also trained on mostly synthetic data, which is very similar to our pipeline set works very well.

00:05:04.640 | For this sort of task, pre-training from scratch doesn't really make a difference as opposed to

00:05:12.560 | using pre-trained models, and it is cost prohibitive. So, unless you want to get those brownie points for

00:05:16.640 | saying you trained it from scratch, it's probably not worth doing. We experimented with a bunch of

00:05:20.960 | different other models as they were released, and nothing really made too much of a difference. What does make a

00:05:26.480 | difference, though, is training data? The latest release of Moon Dream is trained on around

00:05:33.360 | 35 million images, and the problem is, especially when you're on a budget, like high-quality multimodal

00:05:40.960 | training data is really hard to come by. There's companies that -- there's a lot of companies out

00:05:47.280 | there that will annotate data with humans, but it's really expensive, and I've heard a rumor recently that

00:05:53.680 | they won't even talk to you anymore unless you're willing to sign an upfront seven-figure commitment.

00:05:57.440 | There's a lot of data on the internet -- images, all text pairs. The problem with this is it's often not in the format you want it to be, and it's really noisy, and the noise is really problematic when you're

00:06:11.520 | when you're training small models. And so synthetic data is a way to solve this, where you use that

00:06:18.000 | all-text information, process it. It's a bit of an open secret that a lot of people are training on

00:06:22.400 | outputs from GPT-4. You probably don't want to do that. Besides being questionable in terms of use,

00:06:31.120 | it's often not helpful. GPT-4 is a very powerful model. It has reasoning capabilities and knowledge

00:06:36.320 | that your small model is never going to be able to get. And so when you train it on GPT-4 outputs,

00:06:41.120 | what it learns instead is to hallucinate. It's going to generate plausible-sounding outputs that include

00:06:46.160 | details that it cannot possibly memorize, and so you end up in trouble. So this is a little important.

00:06:52.800 | I'm going to go a little more technically detail for a couple of minutes to dive into how to do synthetic

00:06:57.840 | data. So bear with me for a sec. We'll pop back up. Here's an example of how not to do it. Coco is

00:07:03.840 | a dataset. It has around 200k images. Each image has five short descriptions and a bunch of object

00:07:09.840 | annotations with, like, hey, there's a bicycle at these coordinates and whatnot. And let's say you

00:07:15.760 | want to take those short descriptions and these object annotations and generate more detailed captions

00:07:20.000 | that include the union of all the information present over here. If you just naively call GPT-4

00:07:25.040 | with this information, it generates this. It's not important to read all of it, but there's two important

00:07:30.800 | things to note. The first is that it hallucinates. It says in the second paragraph there's a person near

00:07:37.840 | the right side of the harbor. I think there's, like, a person way back. There's, like, five pixels there

00:07:44.080 | that may be a post. It may be a person. We don't really know. That's because object annotations were

00:07:48.080 | bad. But besides that, like, the model is also taking a lot of creative liberties over here, like saying

00:07:53.280 | there is five yatches standing out from the rest and whatnot. And so this is -- you need to do a little

00:08:01.440 | more preprocessing of your data before you feed it to the model. Here's another example. There's a dataset

00:08:06.400 | from Google called localized narratives. The task annotators here -- annotators are given here is

00:08:12.560 | verbally describe this image. And as you're describing the image, hover your mouse over the part of the image

00:08:18.800 | that you're describing. So it's nice in that it encourages people to create really detailed

00:08:22.640 | descriptions that capture spatial positioning in the image. So for example, here it says the girl in the

00:08:28.800 | front is playing the guitar and whatnot. And spatial reasoning is something that vision language models

00:08:33.040 | typically tend to struggle with. I ended up having to build a fairly sophisticated data processing pipeline to

00:08:38.560 | get really good results with this. Not really important to dive into the details over here. But the important

00:08:43.760 | thing to note is, A, it gets really expensive. Each image ends up being 20 LLM calls. And the LLM here is

00:08:52.480 | Mixtral 8x7b. So it gets pretty expensive. But it was necessary. The training data is the biggest needle

00:09:01.680 | mover in terms of model performance. And because of this, I'd say we spent like maybe one or two orders of

00:09:06.880 | magnitude more compute on generating training data than actually training the model itself.

00:09:11.280 | A couple -- so yeah, this particular data set we've open sourced. It's available on Hugging Face. Here's

00:09:18.960 | an example of the type of questions it generates for this image. There's an interesting question towards

00:09:26.400 | the end. What theory does the kid have about the existence of pleasure in the image? I'll talk about

00:09:30.240 | that in a sec. But basically, you want to generate a few distractor questions so the model knows to not always

00:09:35.440 | agree with the question that the user is asking. So yeah, a couple of the challenges involved in

00:09:41.920 | working with synthetic data. There was an interesting incident I had early on where a user was like,

00:09:48.880 | hey, I asked a relatively simple question. Why couldn't the model answer this? And when I looked at it,

00:09:53.920 | it turned out that they didn't capitalize the first letter in their question. And the model had never

00:09:57.920 | seen anything like that during training. So I was like, what do I do over here? And so it's really

00:10:04.000 | important for you to make sure that your training data has the same rough distribution as your real

00:10:09.120 | world query. So I ended up adding like an extra step where we artificially inject like capitalization

00:10:13.600 | issues and typos and whatnot into the model before training it. There's also this risk of what we call

00:10:19.280 | model collapse, where your model has biases inherent to it. So for example, if you try to ask Mixtral to

00:10:25.440 | generate distractor questions, hey, just generate a question that's completely irrelevant to the image,

00:10:29.600 | it'll always generate something about dinosaurs and aliens. And so if you train your model on that,

00:10:33.280 | it'll instead learn to say, hey, if the question is about dinosaurs and aliens, always say no,

00:10:40.480 | which doesn't really help. And so you need to inject like some entropy into the process of

00:10:45.200 | generating synthetic data to avoid this. In the case of synthetic captioning, you can do something

00:10:49.520 | like, hey, describe this image, but also consider the alt text on the image, which may be noisy,

00:10:53.440 | may be irrelevant. But if it is relevant, use relevant facts from that. And that tends to help a lot.

00:10:58.160 | All right. So popping back up,

00:11:08.000 | there's a couple of important learnings I had over the last three months that I would like to share

00:11:11.520 | with all of you. The first was the community was really critical in this whole journey. Seeing that

00:11:18.240 | original engagement that we got from the Moon Dream release helped me realize that, hey, maybe this is

00:11:23.440 | more valuable than that QA testing application that I was working on, because a lot of people have a need

00:11:28.240 | for this to build applications like that. Coming from an enterprise-ish company, it's been really

00:11:35.040 | valuable. It's been refreshing to be able to just talk to customers directly, like someone at Twitter

00:11:39.760 | DM and be like, hey, I just saw you looking for this. What do you think? But it's also helped us

00:11:45.280 | connect with a lot of partners, mentors, and get a lot of support from the community. Being open source

00:11:51.200 | was critical. I kind of didn't really have a choice over here because the competition was free. So

00:11:56.560 | what am I going to do? But when you're in the dev tool space, it is pretty important. Open source is

00:12:02.560 | important to a lot of developers. They would like to have the ability to run it in different

00:12:06.800 | environments. It's also pretty important for a lot of enterprise users. In a lot of cases, they don't

00:12:12.320 | really want to run the software themselves, but having the option is very important to them because

00:12:18.320 | they've had -- most enterprises have had situations where a vendor goes out of business or decides to

00:12:26.400 | screw them in some other capacity. It's also been really critical for engagement for us. We've had

00:12:31.200 | a lot of people in the community help out, port it to different platforms, run the model in the web

00:12:35.280 | browser and whatnot. So it's been very valuable for us. This one is a little controversial. I'm not sure

00:12:44.720 | everyone agrees with this, but I feel pretty strongly that safety guardrails should be implemented at the

00:12:50.000 | application layer, not baked into the model itself. This was one of my learnings from my first attempt to

00:12:57.040 | build a QA testing application with GPT-4B. It made no sense for that application to reject pictures of

00:13:03.840 | any picture that contained a human being. I understand why they felt it was important.

00:13:10.320 | DevTools are kind of B2 -- B2B, not B2C. So it's important to make it easy for developers to

00:13:16.240 | decide what guardrails they want and implement it in their model as opposed to just deciding it for

00:13:21.280 | all users. I'm not saying this is not important at all. Kind of makes sense if you're trying to

00:13:25.040 | build an assistant to make that stuff right directly into the model. But when you're building for

00:13:29.440 | developers, it makes less sense. Yeah, I believe pretty strongly now that tiny models are going to run the world

00:13:39.920 | In computer vision, more so perhaps than in text models, efficiency is really important.

00:13:46.480 | In a lot of cases, you're really worried about cost because you're processing video and

00:13:54.400 | 30 frames a second at seven-tenths of a cent per second adds up very quickly and

00:13:59.280 | doesn't give you a lot of room to work with. But there's also situations where you're

00:14:02.960 | really worried about privacy or latency and therefore you want to run the model really close to where

00:14:09.520 | decisions need to be made. Which is not to say big models are not useful. I think they're

00:14:14.400 | very useful. I just think that we'll mostly be running them in our development environments maybe

00:14:18.560 | for generating training data. But the artifact that you're going to want to deploy is most likely going

00:14:24.560 | to be a smaller model. Another thing that was a little surprising to me was looking at the different

00:14:37.520 | things people were doing with Moon Dream. There were a lot of people building net new applications that

00:14:41.920 | weren't possible to do before because the model can understand language as well as images. But there

00:14:47.600 | were also a lot of people doing traditional computer vision things with the model. It's like,

00:14:52.480 | is there a person in the scene? Or is there something suspicious going on? Tell me where the

00:14:57.760 | bus is in this picture from a road camera. All of which was possible to do before we had transformers,

00:15:07.760 | like just train a YOLO V3 model or whatnot. The thing that was --

00:15:11.360 | Yeah, the lesson I took from this was prompting is a much better developer experience than having to train

00:15:19.360 | a custom model. And so for a lot of developers that would be interested in incorporating vision into their

00:15:25.280 | applications, before they'd be like, you know what, it's not worth me spending two weeks learning how to

00:15:31.200 | like collect data and annotate it and train my own custom model. Giving them the option to say, hey,

00:15:37.520 | for fairly cheap, you can just in English describe what you want extracted from this image makes it

00:15:43.520 | something that they actually consider doing now. All right. I think I'm a little ahead of time,

00:15:51.040 | so I'm excited to maybe do a live demo if the demo got a smile upon me, but we'll see. In conclusion,

00:15:58.880 | yeah, where's Moon Dream going? We're not AGI people. I'm really focused on making it really

00:16:05.520 | easy for developers to build amazing applications with vision. There's a bunch of model improvements

00:16:11.440 | that I'm working on right now. I'll talk about some. Right now, we use 729 tokens to represent an

00:16:18.080 | image, so you can only really send one image to the model at a time. We're working on giving users the

00:16:23.200 | option to give a more compressed representation to the model, which makes sense if you're not trying to

00:16:28.000 | read text or something from the image if you're just trying to do classification and whatnot. That makes

00:16:31.200 | the model run a lot faster, which is important, especially if you're on CPU as opposed to GPUs,

00:16:35.600 | which can't do as much -- CPUs can't do as much parallel compute, and so that sort of thing ends up

00:16:40.720 | being really important. We've also just raised a seed round from Felicis, Ascend, and also the GitHub

00:16:50.160 | one, which I forgot to include in the slide. Sorry, GitHub. This means more GPUs, but more importantly, it means I can

00:16:57.840 | finally get some sleep because we're able to get a couple more people to join the team. If you're

00:17:02.480 | interested, please reach out. We have a contact email on the website or just hit me up on Twitter.

00:17:07.120 | We also have an exciting release coming up later this summer that I'm super pumped for, so stay tuned.

00:17:12.000 | I think that's about it. So I have a couple of minutes left, I think, so I'm going to try doing something

00:17:19.280 | that may not be the wisest idea, but we'll see how it goes.

00:17:22.560 | All right. I'll turn the Wi-Fi off. This whole thing is running locally.

00:17:34.080 | So what this is going to do is, like, start taking my webcam in, and it's going to use Moon Dream in an

00:17:44.560 | infinite loop to describe what it sees, and we can ask it different questions. So we'll see how that goes.

00:18:06.000 | And, yeah, you can ask it different things. So let's say, is the person wearing glasses? You do have

00:18:15.280 | to tell the model to answer briefly if you want a yes or no, otherwise it gives you, like, a

00:18:19.120 | answer with a single word. Let's try that.

00:18:27.680 | Yes. Okay. I'll take them off. I can't see. Did it get it?

00:18:35.840 | Let's do that. I'll go back to the old prompt.

00:18:50.720 | All right. Well, that was it for me. Thank you all.

00:19:07.680 | I'll see you next time.

Moondream: how does a tiny vision model slap so hard? — Vikhyat Korrapati