Moondream: how does a tiny vision model slap so hard?

Hi. My name is Vik. I work on a model, an open-source vision model called Moon Dream. A little bit about myself before I dive into Moon Dream. I was at AWS for about nine years before I started working on this model. Looking at where the stock price is going, I'm not sure if that was the right financial decision, but I'm very happy with the work I'm doing.

So let's dive into it. I'll talk about Moon Dream a little bit. It is a tiny vision language model. It's less than two billion parameters, so it can run anywhere, and it's open-source, Apache 2.0, so you can use it to do anything. Here are some examples of things you can do with Moon Dream.

You can ask it questions about images. You can caption images. It can detect specific objects inside of images. So here I asked it to tell me where the peak is, and it gives me coordinates. I can count stuff. It can do all sorts of things. I had the audacity to title my talk, "How can a tiny vision model slap so hard?" So I have to back things up a little bit.

And so here's me doing that. So these are two vision benchmarks, vision question-answering benchmarks. One is called VQA, V2. The other is called GQA. As you can see, Moon Dream has been steadily improving over the releases I've made over the last three months. I've included a reference line over there for Lava 1.5, which is a popular 7 billion parameter vision model.

So this shows you that Moon Dream gives you a performance that's comparable to models that are about four times bigger than it. I didn't really set out to build a vision model, so I kind of got roped into it. I was originally trying to build an application that required an AI agent, so I needed to be able to see what was going on on the user's screen and have it describe what's on the browser page for QA testing automation.

I tried to do this at first with GPT-4V, but there were too many safety refusals back then. Like, if there was any human being present in the image, I would just refuse to process it. It was also going to be really slow and expensive, and so I realized if this is a product I'm trying to build, I really need to have control over the model itself.

So I figured, you know what, how hard can it be? Let me just go try and build this model myself. Now, the task I was trying to perform here was fairly constrained. I just needed to describe screens and answer questions about screens, so it doesn't need to be generally intelligent.

I had a couple of 3090s at home, so I figured I'd train a small version of the model at home and then rent some beefier machines in the cloud to go train a bigger version. Once I got done training a small version, I was like, hey, this actually works pretty well, so I posted it on Twitter.

I thought, you know what, I might get 20 likes off of this, and then I'll move on with my side project at the time. It blew up far beyond expectations. I was a little surprised, pleasantly surprised, but surprised nonetheless. And I immediately started seeing other automated testing companies reach out and be like, hey, can I use this to describe browser screens?

Because this would work really great deal for us. As well as other companies, shout out to our friends at Open Interpreter from Seattle. That basically told us that they were -- I figured, you know what, like, this is getting a lot of traction. Let me pause on the whole automated testing app for a couple of weeks and focus on Moon Dream and see where it goes.

Yeah, so let me dive into a couple of the technical details from what makes the model succeed despite being small. The first thing we did that I think really helped was deciding what problems the model should solve and what it should not solve. So Moon Dream wants to be a developer tool.

We focus on being really accurate and not hallucinate. It doesn't really have a lot of knowledge about the world, so if you ask it to write a poem, it's probably not going to help you. It's really focused on answering questions, like giving you -- helping you understand images. This is really important because it affects the type of data that you use and the sort of benchmarks that you want to focus on.

There's a popular vision language model benchmark called MathVester, which measures how good models are at solving math problems. You take a picture of a differential equation and you see whether the model can solve it. That was an example of a non-goal for us because we just want the model to be good at looking at images.

The most we do is probably generate a latex representation of the problem. We don't really want to even attempt to try and solve calculus. It was not pre-trained from scratch. We use -- we use a vision encoder called Siglip from Google with a pre-trained text model called PHY 1.5 from Microsoft.

The notable thing over here is PHY 1.5 was also trained on mostly synthetic data, which is very similar to our pipeline set works very well. For this sort of task, pre-training from scratch doesn't really make a difference as opposed to using pre-trained models, and it is cost prohibitive. So, unless you want to get those brownie points for saying you trained it from scratch, it's probably not worth doing.

We experimented with a bunch of different other models as they were released, and nothing really made too much of a difference. What does make a difference, though, is training data? The latest release of Moon Dream is trained on around 35 million images, and the problem is, especially when you're on a budget, like high-quality multimodal training data is really hard to come by.

There's companies that -- there's a lot of companies out there that will annotate data with humans, but it's really expensive, and I've heard a rumor recently that they won't even talk to you anymore unless you're willing to sign an upfront seven-figure commitment. There's a lot of data on the internet -- images, all text pairs.

The problem with this is it's often not in the format you want it to be, and it's really noisy, and the noise is really problematic when you're when you're training small models. And so synthetic data is a way to solve this, where you use that all-text information, process it.

It's a bit of an open secret that a lot of people are training on outputs from GPT-4. You probably don't want to do that. Besides being questionable in terms of use, it's often not helpful. GPT-4 is a very powerful model. It has reasoning capabilities and knowledge that your small model is never going to be able to get.

And so when you train it on GPT-4 outputs, what it learns instead is to hallucinate. It's going to generate plausible-sounding outputs that include details that it cannot possibly memorize, and so you end up in trouble. So this is a little important. I'm going to go a little more technically detail for a couple of minutes to dive into how to do synthetic data.

So bear with me for a sec. We'll pop back up. Here's an example of how not to do it. Coco is a dataset. It has around 200k images. Each image has five short descriptions and a bunch of object annotations with, like, hey, there's a bicycle at these coordinates and whatnot.

And let's say you want to take those short descriptions and these object annotations and generate more detailed captions that include the union of all the information present over here. If you just naively call GPT-4 with this information, it generates this. It's not important to read all of it, but there's two important things to note.

The first is that it hallucinates. It says in the second paragraph there's a person near the right side of the harbor. I think there's, like, a person way back. There's, like, five pixels there that may be a post. It may be a person. We don't really know. That's because object annotations were bad.

But besides that, like, the model is also taking a lot of creative liberties over here, like saying there is five yatches standing out from the rest and whatnot. And so this is -- you need to do a little more preprocessing of your data before you feed it to the model.

Here's another example. There's a dataset from Google called localized narratives. The task annotators here -- annotators are given here is verbally describe this image. And as you're describing the image, hover your mouse over the part of the image that you're describing. So it's nice in that it encourages people to create really detailed descriptions that capture spatial positioning in the image.

So for example, here it says the girl in the front is playing the guitar and whatnot. And spatial reasoning is something that vision language models typically tend to struggle with. I ended up having to build a fairly sophisticated data processing pipeline to get really good results with this. Not really important to dive into the details over here.

But the important thing to note is, A, it gets really expensive. Each image ends up being 20 LLM calls. And the LLM here is Mixtral 8x7b. So it gets pretty expensive. But it was necessary. The training data is the biggest needle mover in terms of model performance. And because of this, I'd say we spent like maybe one or two orders of magnitude more compute on generating training data than actually training the model itself.

A couple -- so yeah, this particular data set we've open sourced. It's available on Hugging Face. Here's an example of the type of questions it generates for this image. There's an interesting question towards the end. What theory does the kid have about the existence of pleasure in the image?

I'll talk about that in a sec. But basically, you want to generate a few distractor questions so the model knows to not always agree with the question that the user is asking. So yeah, a couple of the challenges involved in working with synthetic data. There was an interesting incident I had early on where a user was like, hey, I asked a relatively simple question.

Why couldn't the model answer this? And when I looked at it, it turned out that they didn't capitalize the first letter in their question. And the model had never seen anything like that during training. So I was like, what do I do over here? And so it's really important for you to make sure that your training data has the same rough distribution as your real world query.

So I ended up adding like an extra step where we artificially inject like capitalization issues and typos and whatnot into the model before training it. There's also this risk of what we call model collapse, where your model has biases inherent to it. So for example, if you try to ask Mixtral to generate distractor questions, hey, just generate a question that's completely irrelevant to the image, it'll always generate something about dinosaurs and aliens.

And so if you train your model on that, it'll instead learn to say, hey, if the question is about dinosaurs and aliens, always say no, which doesn't really help. And so you need to inject like some entropy into the process of generating synthetic data to avoid this. In the case of synthetic captioning, you can do something like, hey, describe this image, but also consider the alt text on the image, which may be noisy, may be irrelevant.

But if it is relevant, use relevant facts from that. And that tends to help a lot. All right. So popping back up, there's a couple of important learnings I had over the last three months that I would like to share with all of you. The first was the community was really critical in this whole journey.

Seeing that original engagement that we got from the Moon Dream release helped me realize that, hey, maybe this is more valuable than that QA testing application that I was working on, because a lot of people have a need for this to build applications like that. Coming from an enterprise-ish company, it's been really valuable.

It's been refreshing to be able to just talk to customers directly, like someone at Twitter DM and be like, hey, I just saw you looking for this. What do you think? But it's also helped us connect with a lot of partners, mentors, and get a lot of support from the community.

Being open source was critical. I kind of didn't really have a choice over here because the competition was free. So what am I going to do? But when you're in the dev tool space, it is pretty important. Open source is important to a lot of developers. They would like to have the ability to run it in different environments.

It's also pretty important for a lot of enterprise users. In a lot of cases, they don't really want to run the software themselves, but having the option is very important to them because they've had -- most enterprises have had situations where a vendor goes out of business or decides to screw them in some other capacity.

It's also been really critical for engagement for us. We've had a lot of people in the community help out, port it to different platforms, run the model in the web browser and whatnot. So it's been very valuable for us. This one is a little controversial. I'm not sure everyone agrees with this, but I feel pretty strongly that safety guardrails should be implemented at the application layer, not baked into the model itself.

This was one of my learnings from my first attempt to build a QA testing application with GPT-4B. It made no sense for that application to reject pictures of any picture that contained a human being. I understand why they felt it was important. DevTools are kind of B2 -- B2B, not B2C.

So it's important to make it easy for developers to decide what guardrails they want and implement it in their model as opposed to just deciding it for all users. I'm not saying this is not important at all. Kind of makes sense if you're trying to build an assistant to make that stuff right directly into the model.

But when you're building for developers, it makes less sense. Yeah, I believe pretty strongly now that tiny models are going to run the world In computer vision, more so perhaps than in text models, efficiency is really important. In a lot of cases, you're really worried about cost because you're processing video and 30 frames a second at seven-tenths of a cent per second adds up very quickly and doesn't give you a lot of room to work with.

But there's also situations where you're really worried about privacy or latency and therefore you want to run the model really close to where decisions need to be made. Which is not to say big models are not useful. I think they're very useful. I just think that we'll mostly be running them in our development environments maybe for generating training data.

But the artifact that you're going to want to deploy is most likely going to be a smaller model. Another thing that was a little surprising to me was looking at the different things people were doing with Moon Dream. There were a lot of people building net new applications that weren't possible to do before because the model can understand language as well as images.

But there were also a lot of people doing traditional computer vision things with the model. It's like, is there a person in the scene? Or is there something suspicious going on? Tell me where the bus is in this picture from a road camera. All of which was possible to do before we had transformers, like just train a YOLO V3 model or whatnot.

The thing that was -- Yeah, the lesson I took from this was prompting is a much better developer experience than having to train a custom model. And so for a lot of developers that would be interested in incorporating vision into their applications, before they'd be like, you know what, it's not worth me spending two weeks learning how to like collect data and annotate it and train my own custom model.

Giving them the option to say, hey, for fairly cheap, you can just in English describe what you want extracted from this image makes it something that they actually consider doing now. All right. I think I'm a little ahead of time, so I'm excited to maybe do a live demo if the demo got a smile upon me, but we'll see.

In conclusion, yeah, where's Moon Dream going? We're not AGI people. I'm really focused on making it really easy for developers to build amazing applications with vision. There's a bunch of model improvements that I'm working on right now. I'll talk about some. Right now, we use 729 tokens to represent an image, so you can only really send one image to the model at a time.

We're working on giving users the option to give a more compressed representation to the model, which makes sense if you're not trying to read text or something from the image if you're just trying to do classification and whatnot. That makes the model run a lot faster, which is important, especially if you're on CPU as opposed to GPUs, which can't do as much -- CPUs can't do as much parallel compute, and so that sort of thing ends up being really important.

We've also just raised a seed round from Felicis, Ascend, and also the GitHub one, which I forgot to include in the slide. Sorry, GitHub. This means more GPUs, but more importantly, it means I can finally get some sleep because we're able to get a couple more people to join the team.

If you're interested, please reach out. We have a contact email on the website or just hit me up on Twitter. We also have an exciting release coming up later this summer that I'm super pumped for, so stay tuned. I think that's about it. So I have a couple of minutes left, I think, so I'm going to try doing something that may not be the wisest idea, but we'll see how it goes.

All right. I'll turn the Wi-Fi off. This whole thing is running locally. So what this is going to do is, like, start taking my webcam in, and it's going to use Moon Dream in an infinite loop to describe what it sees, and we can ask it different questions. So we'll see how that goes.

And, yeah, you can ask it different things. So let's say, is the person wearing glasses? You do have to tell the model to answer briefly if you want a yes or no, otherwise it gives you, like, a answer with a single word. Let's try that. Yes. Okay. I'll take them off.

I can't see. Did it get it? Let's do that. I'll go back to the old prompt. All right. Well, that was it for me. Thank you all. I'll see you next time.

Moondream: how does a tiny vision model slap so hard? — Vikhyat Korrapati

Transcript