[Paper Club] Molmo + Pixmo + Whisper 3 Turbo - with Vibhu Sapra, Nathan Lambert, Amgadoz

- I'll go through here, let me set up real quick. If anything's in the chat, let me know. Okay, so Momo and Pixmo. This is from AI21. Basically really, really good people. If people haven't read research papers before, this is probably one of them. - AI2 and AI21 are different things.

- AI2, my bad. - It's hilarious, so many people mess it up. We also get A12. - A12, oh, are they another one or? - No, it's if you misread the I, it's a one. - Okay, so AI2. - Makes it even more confusing with the 21. - Is there an AI21?

Okay, I've been shilling the wrong one, but good correction, good to know. So AI2 though. So basically this is a proper open source, open weights model, and it's a VLM. So if anyone hasn't read papers before, this is one that I highly recommend reading. Clear problem, clear solution, clear explanation of what they do, the data set and everything.

Very, very easy read, no crazy math. So yeah, highly, highly recommend. But TLDR, it's a set of different models at different sizes that are good vision language models that solve a different problem. So the high level of this is most VLMs are distillations of proprietary close source models, right?

So if you need to generate synthetic data, like most open weight models rely heavily on synthetic data from private models. So think GPT-4V, GPT-4O, Gemini, whatnot. And they label a lot of data and then they create a vision language model that's open weight, something like lava. But all they're doing is kind of distilling that proprietary knowledge into an open knowledge.

But this is kind of the key highlight I take away that the community is still missing foundational knowledge about how to build these models from scratch. So they kind of go into everything from data set training, how they label data and how they can do this whole mix. And some of the other papers we covered in paper club are stuff like Clip, OpenClip, which are captioners, the whole history of these and how captioning is pretty important for this.

But basically, yeah, they do that. They explain the data set mixture, how to do this fine tuning. They're gonna release it all. There's a set of different, a class of different models that are gonna come out. And the really interesting thing is they're good. So lava is kind of starting to lag behind, but this thing is outperforming.

So there's two ways to benchmark this. You've got standard vision benchmarks, and then you've got like actual usage. And their stuff is on par with GPT-4V. They're starting to surpass it with some of the larger models and they're very small. So yeah, diving into that, we talk about the data set, how they do the training data, the architecture of it, which is kind of what you would expect if you joined our other paper clubs on this, but let's go a little bit more into that.

So one of this is how can we generate this data from non-proprietary models? So we can't use a closed source model to do this captioning. How do we do it? Some of the background that we learned from other papers like Clip and whatnot is that it's hard to generate really good, diverse, like robust descriptions from images.

And this is something that like the solution that they found was to have audio labels. So they kind of guided people towards, here's a set of about a million images. Here's some prompting questions. Here's a timeline. So people answer them concisely. From that, they have a data set. From that, they train it into a model that's based on their open source OMO models.

So that's the big thing. They can do this end to end without anything proprietary. And yeah, let's like go through it a little bit more. So instead of being a distillation, it's independently pre-trained. There's a vision encoder and a language model that's jointly trained. There are some interesting stuff that they call out like traditionally in stuff like Alava and different adapters, what they'll do is they'll train this vision encoder.

They'll have an LLM, they'll freeze weights, and then they'll kind of just merge them. They avoid doing that multi-stage pre-training. They don't freeze parts of it. They just kind of do it all as one training run. They show how important high quality data is. So they do this, I think, on roughly a million samples of data.

There's some augmentation. Like they pass through these label data sets through LLMs. They augment them. They do augment it a little bit, but point being they can get better than Gemini 1.5, better than Cloud 3.5 Sonnet, better than GPT-4V at a much smaller size with about a million samples of data, which is very impressive, right?

So good, high quality data, and then how to do it from scratch. Now, the other little interesting thing here was it's not as accessible as it seems. Like they still have to pay for a million labeled samples, but point being still very useful. And then we can distill it down to other models.

So yeah, this is one of the main challenges, right? It's hard to collect captioning data. So for example, like if you're given an image here and you're told to describe it, like I would probably just say this is Seattle. I could add in things like, you know, I see the Space Needle.

I see Mount Rainier. I see trees. But if you just blindly give images to either LLMs or whatever, multimodal models or people, we don't give good descriptions, right? We don't give like robust, you know, a city landscape with a lot of skyscrapers, trees in the foreground, water in the background, construction on the side.

We don't do that. But they kind of broke down how they get this type of data and the different segments. And then I think soon the dataset is coming out, but that's basically the challenge, right? So they don't wanna get this from larger proprietary models, which can, like you can write a really good robust prompt and get good descriptions, and then you can create a lava style model, but you're no longer doing this from scratch.

You're doing it from a distillation of proprietary work. So they wanted to do it with annotators. Here's kind of the solution. We ask annotators to describe the images in speech for 60 to 90 seconds, rather than asking them to write descriptions. They prompted them to describe everything in great detail, including descriptions of spatial positioning and relationships.

So stuff like, you know, there's the Space Needle in the middle of a bunch of buildings. With that, the modality switching trick, annotators provide far more detailed descriptions in less time. So part of this is, you know, you keep it 60 to 90 seconds. They gave them guiding questions, but that was basically the takeaway.

Like we're not good at writing descriptions, but we're good at talking about and describing what we want. And then from there, they have like a little bit more into what this dataset is. So they've got positional data in some of the prompts, they've got high level, they've got different types of data.

They've got like, part of the dataset is about charts, tables, descriptions of that. So the model can pick it up. And then there's some use cases where this stuff just blows out GPT-4v from like out of the water. So there's a good blog post on the website that kind of shows how you could do this real time.

It's a very Apple Intelligence or Google Lens type example, where I think the team put it on the Apple Vision Pro and they just look at stuff and they're like, "Hey, where is this in the image?" And it can actually annotate the image because that's part of the training set.

So one more use case of high quality, well-labeled data. Benchmarks-wise, they did the traditional 11 academic benchmarks. It does well. The size of the models they put out. So there's MOMO-1b, which is based on their open source language model, which is a MOE 1 billion active parameter model. That performs on par with GPT-4v on both user preference and academic benchmarks.

Then there's the 7b, which is based on their 7b language model and Quen-7b. That outperforms GPT-4v in 4.0. Then there's a large one, which is MOMO-72b. 72b is based on Quen-72b. And that, it outperforms stuff on academic benchmarks. And then on human preference, it's behind 4.0. So preference-wise, if you guys remember how we do vision benchmarks, there's stuff like Vision QA, where you can QA what's in this image.

It's similar to Trivia QA, where it's just understanding. There's 11 benchmarks, and they're not realistic of how people use it. So they also have this user preference stuff. User preference is basically like an ELO score of models. But TLDR, these are very good small models. This last sentence here is pretty interesting, right?

"Our best models outperform state-of-the-art "proprietary systems, including Gemini 1.5 Pro "and Flash and Cloud 3.5 Sonnet." That's pretty huge. Architecture, if you guys know about other multimodal models, it's an interesting adapter with a little twist of traditionally, like we said, there's this mix of, you have a pre-trained off-the-shelf vision encoder and a language model joined in one space.

Typically, we freeze the language model, and we do contrastive, there's a term for this, but you basically merge these embedding spaces. I'm blanking on the term. In this case, they're like, "No, we don't need to do that." So they go over the four steps of this architecture. Pretty straightforward.

So the first step is a preprocessor. It converts the image into multi-scale, multi-crop images. This allows for diversity in inputs. So stage one of architecture is a preprocessor. Second, it's a VIT image encoder that maps these images into vision tokens in an embedding space. Three, there's the connector that maps and merges these two embedding spaces.

So the vision encoder, which I believe is based on CLIP, and they mentioned, we know CLIP is trained on proprietary data, but there's a paper called OpenCLIP from Meta, which shows you can do OpenCLIP. They just use regular CLIP, which is, I think it's fine. So stage one, preprocess images.

Stage two, vision encoder. Stage three, contrastive embeddings, which merge the embedding dimension of the vision and the language model, and that's not frozen. And then stage four is a decoder-only transformer LLM to produce text. That's the architecture. If people have questions on what's going on here, we can probably dig into it, but for people that joined our other multimodal stuff, you should kind of understand how this works.

There's a good blog post we went over as well, I think from Chip Huen, or another one that we can also share as a reference. - Yeah, the Flamingo post. - There's what? - I think that was her Flamingo post. - Yeah, her Flamingo is a really good one on how multimodal stuff works.

So highly recommend, but also we can dig into it here. And then this is kind of that part of the vision encoder. They use OpenAI's VIT-Large CLIP model, and we know that CLIP is trained on closed source data, and they understand this, but recently MetaCLIP came out from Meta, and they show that you can reproduce it from scratch.

They use OpenAI because it's trained for higher resolution. Meta's was more of a proof of concept. Proof of concept meaning, you know, here's how you can do it. We do it efficiently at low resolution. For the actual production, they just use CLIP, and I'm fine with that. If people disagree, that's a fair topic too.

And then the LLM they use is their recently put out OMO models, which are very similar to this. They're open language models. They've got the range, and then for the big one, it's OpenWeight Quens. Now the data set and training was kind of- - Do you wanna pause for any commentaries, Nathan?

- Yeah. - Otherwise, yeah. - I was just gonna kind of let him cook and then go back to the beginning. - Oh, okay, all right. I'm gonna interleave commentary. - I thought about interrupting. I had unmuted. I just think that it's, I mean, I have this whole take that I've written is that the whole vision space is just so underdeveloped that a lot of these things that seem surprising, I actually think probably aren't surprising.

And the things that this model is good at are things that all the foundation companies, like they're just gonna take our data and train on it. And it's fine tuning, so it's like a million data points is actually kind of a lot, and all this stuff. And it's just like a lot of, I think that it's just like good data works, and they figured out some cool new applications, like clock reading and pointing, and then the data works.

There is actually a niche reference of somebody that did this kind of voice annotation earlier. I can find it. 'Cause someone had mentioned it after we released it, saying that we were the first people to do it. So then somebody's like, oh, but you have to cite this old paper, which is kind of fun.

So I can go find that. - Okay. I love the take of me being like, I love it. It's so straightforward and it works, and people should have done it. And then Nathan here like, yeah, they'll do it too. And it's cool to see the actual use cases. The website has a pretty good blog post on this, and there's a good, very professionally shot video on how this excels.

And it's always nice to see that niche of like, there's so much underdeveloped alpha left in vision space that you train the right data set and it gets good. But it's still cool to see how it's like, significantly better than Lava. It's all open. So I like it. Any other questions from chat and whatnot?

I haven't been following. Is there anything we should pause and break on or? - Something that's worth saying is that I think the other models are much better as text models. Even like Lama, where they're like vision fine tuning is generally accepted as being like kind of meh for whatever drama reasons you want to attribute to it.

Like their text scores are still much better. So they have like more of a text focus, which OpenAI probably does as well. - Is there a drop in performance on these? Like the Quen 72 versus Momo 72, is the text performance also taking a hit? - I honestly don't even know, but probably to some extent.

Like I find it so unlikely it's gonna get better. Or even like some of these other models are probably using like instruction models as their base before doing the vision stuff. But this is just like straight base model, no real instruction tuning. There's literally like no chat template for multi-turn, it just concatenates the messages together and like, there's like, good luck.

- So for other people that don't follow, there's base models and then there's instruction fine tuning, right? So you can base models predict next token, instruction models are chat models. So like we have Lama base, which is not a chat model. You can't chat with it. It'll complete your words.

Basically, these are just those adapted to vision, which means that they're not good chat models, but they do perform well and they do what they're supposed to do. But TLDR, yeah, you know, they've done so much open source work. So maybe someone should recreate. I don't know if you guys put the data set out yet.

I hear it's coming soon. - It's just like needing to clean it up. So it's like the brush of get the model out. And I mean, - Makes sense. - Our leadership probably knew Lama was coming on a specific day. If you look at the timing, like there's some of that needed to happen.

I mean, the data sets just being cleaned. - Okay. - I mean, if you do anything with images, something like data set releases are a little bit messier. You have to check for like, you don't like we're not releasing the images. You release the links type of thing and do some checks.

But yeah, it's coming. I think there's like hundreds of people that filled out the silly Google form asking for interest, which I'm not surprised. It's just like. - Yeah. Well, that's a plug for, you know, they started 90% of the open source stuff. Now take a chat instruction tune model and do the same thing.

If anyone does it. - We did a little bit of experimenting on it, but like in a rushed way and we didn't find it was better. So they didn't add it in. - Hmm. Interesting. - But like this is very visual oriented. The demo, you can't send, you need an image and you can't send a text only query to it.

Partially for saving money, but partially because like that's how the model is trained. - Good to know. Good to know. I think the interesting thing there is like, as you mentioned, yeah, Gemini and OpenAI and stuff, they'll basically train on this data set and it'll kind of improve what they do too.

But it also kind of shows a flaw in benchmarks 'cause our vision benchmarks don't account for like text output as well, which is pretty flawed because people do both. Yeah. This is the fun part since it's primarily vision-based. How did they do the vision? How did they generate the data?

So they've got a split of different data sets. Step one is obviously caption generating from the images. So they source web images from diverse sets of data, 70 high-level topics. So high-level topics, street signs. Street signs, food, meme, drawings, websites, blurry photos and whatnot. They asked annotators to describe images by speaking in detail for 60 to 90 seconds using a single annotator per image.

This was more effective, but yeah, questions were like guided questions. So what's the image at first glance? What objects are there in their accounts? What does the text say? What's the background? What's the style and color? I think this is pretty useful 'cause I'm pretty bad at describing images and 60 to 90 seconds is a good answer.

A bunch of stuff, people will be verbose, concise. Then after this basic pre-processing, so transcribe it using off-the-shelf speech to text. Do you know what that was? Was that like whisper or something or it just probably doesn't matter, I'm guessing. - Yeah, it was like whisper and I don't remember the details.

And I had asked and it's like, I don't think they actually logged the audio. I was like, oh, this would be super cool for like other types of multimodal, but I don't think it was in the terms of the data location to also have the audio with the images 'cause it'd be a super cool, like, I don't know.

Like you can see how all these voice features come about. - That's what I was about to say. I was gonna say like, it would have been a pretty cool audio dataset too, but it looks like the dataset that'll be distributed is just gonna be the captioning and image pairs.

So not audio, which would have been fun, but yeah, there's a very straightforward pre-processing. So use something like whisper or whatever off the shelf they wanna call it speech to text, then process it through an LLM to improve text quality. So remove artifacts, normalize style, then create a fourth image description by using an LLM to summarize the three original descriptions, transcripts into a single description.

So, you know, basic augmentation from that, we use these four images to do natural data augmentation. So trained on 712,000 distinct images with 1.3 million captions, including the augmentation. So not a lot, it's basically just fine tuning after that. So- - I know Eugene Xia has a hand raised.

You have a question, Eugene Xia? - Pop in. - Yeah, it wasn't urgent for me. Because there was a lot of commentary regarding text performance and I guess the presumption that the MOMO is inferior in text performance. That's what I kind of understood. I'm actually wondering how much of it is potentially for especially the closed models, just purely synthetic data, because I can imagine for text in particular, we can generate a lot of synthetic images, essentially signboards, crumpled paper, all these kinds of scenarios, and just put various permutation of text.

And we ask questions about the text through the model. And that essentially forms the training set. - Thoughts? I think a lot of it's also just, yeah, it's a base model. I think that there's a decent bit you could do. I think also at the scale that they're at, like with a 1B and 7B, they could be deployed in a system such that it's primarily image processing, like zero shot, view shot, convert this to JSON, answer stuff about this.

I don't see this really being like a chat with style model, because yeah, there's LLAMA if you want that, right? If you want images and chat with it. But the applications of these edge, small image processing models, I see this more as like OCR output in a pipeline, but that's my two cents.

I guess it's interesting to see that if you did it with a chat instruct tune model, there wasn't any crazy performance difference. But yeah, the other thing here is just that there's not much benchmarking on the language models. So I guess there is MMLU and they're not the best, actually quite roughly, quite bad.

- That's MMMU. - MMMU, my bad. - That's a multimodal version. - So there's no pure text benchmarks, right? Should we benchmark? Someone benchmark it. We've got 40 people on the call that are here for the paper. So someone report back to us. Yeah, any other interesting chat stuff?

Should we crunch the paper quick? - Me? Mama? - Is someone talking? Okay, they're muted. Let's go to that dataset then. So TLDR, people yap, here's guidance questions. It's transcribed. They got about a million captions for 700,000 images, which I'm like, okay, that's kind of expensive. I can't run this experiment myself.

I'm glad they'll put it out. Basically there's this PIXMO is the dataset and then there are subsets of it. So PIXMO ask anything model. It's basically collection of answer a diverse set of questions that a user might ask. So for this image, what are questions people might ask? Here's how they do it.

They have this whole OCR stage. I'm gonna go through these pretty quick, but then here's kind of the subset. So 162,000 question answer pairs of images, PIXMO points is the next one. - Can I ask a quick question on the PIXMO ask model or anything? So I had a really hard time understanding this.

What we do is we use the stage one model to generate advanced caption. So I guess what this means is that given image, get a caption and then we pass that caption and OCR image for that and the OCR output for the image. Does this mean we just apply OCR on the image?

How would that work? If the image is an image of scenery or something. I mean, if someone can just share the intuition behind this. - Nathan, if you have intuition, you'd probably know better. My intuition here is that when you look at there's 70 different high level topics, like drawings, websites, there's another thing here like street signs.

And then there's one that's like primarily charts and tables. So I don't have the exact input output of this but I'm assuming it's more text-based if there's OCR on it. So it's probably, you know, documents, receipts. That's what I would assume this subset of the dataset is. - Yeah, I see.

- Yeah, but that makes sense. - But there, yeah. - Yeah, so we provide caption, we provide OCR output in case there's charts and tables. And then we ask the question and the LM is supposed to answer the question purely solely based on text. It does not have the image.

Yeah, that sounds crazy to me. I just don't know how we can teach a model to answer anything about an image without actually having an image. But if it works, it works. But that was quite mind blowing to me, especially this dataset and how they created it. - So I kind of interpreted this as that's like a starting point and then the annotator would, I wasn't sure why they did it this way and maybe Nathan can speak to this, but that you like have these captions and then you're kind of just selecting and combining annotations as a human who has access to the vision of the model or the vision of the image.

- That makes sense, yeah. Or maybe this is one of the earlier images where we just bootstrapping our synthetic data and the quality is maybe it's not as high or not as intuitive as compared to the point. Picks more points, picks more clocks, which was easy for me to understand.

So this is the one thing I got stuck on. Thank you. - Are you, wait, the model was trained with the image and the question pairs though. It's saying that the LLM that's doing the question answering didn't have access to the image. No? Like it's still a output pair is still image to QA pairs that you train MOMO on, but the annotation, that's what I, so like, I think that's what you were saying.

Like, it's interesting how you can train a MOMO to answer questions on an image without giving it the image, but no, it's given the image and question answer pairs. It's just the annotation answering the question is image to OCR to GPT-4 write question answers on this OCR that, you know, there's a, what is it?

There's a annotator to approve the question answers. Now we've got question answers with image. When we train, we have image question answers. - I see, yeah, okay. That actually is a, that's actually very intuitive. So essentially what we're saying is that right now, the model can't answer questions. What we're going to do, but the model can generate captions because that's the previous pick small cap they also learned on.

So we're going to get that model to generate captions and then we get an off the shelf LLM, probably one of their strong LLMs to create QA pairs. And now based on that, we now have QA pairs on the image. So that's how you could strap yourself from no data to like step-by-step.

Okay. - This little niche here was that off the shelf LLM is not capable. So that's why they do caption and OCR and provide that to the LLM. - Makes sense. - That's where I'm like, oh, I think you mentioned their training questioning without giving it the image, but you still do have the references.

- Yeah, the final mobile data actually has that, but this is the process of them just generating the image. - Yes. - Cool. Thank you. - Yeah. And then honestly, I found this one pretty confusing at first as well. The rest, there's just quite a few subsets of these.

So points, this is kind of their demo. I don't know if I'm sharing my whole screen or just the, yeah, I'm just sharing the. - PDF. - Well, actually I'll change it real quick. If people haven't seen the demo, I would recommend checking it out, but this is basically points on an image.

So it kind of answers, like if you were to ask this, this example, point to Mount Rainier, the model can drop a dot on the mountain. So it's labeled data set of that next sample. And then they show their example of connecting this thing to Apple vision pro and asking it, like, what do I see here?

They've got a subset of that. Then there's a cap QA. It's basically just here are the other ones. There's docs generate code for this much text and heavy figure images. So charts, documents, tables, we prompt it to do QA pairs. There's clocks. I was confused with why there's synthetic analog clock data set, but there are clocks.

There's a lot of clock examples in here. - I had the same question. Like Nathan, would you- - This is because the model didn't work on clocks. And then the lead was really worked on clocks and no models work on clocks. So they're like, we've got to make it work on clocks.

One of the interesting things is that it doesn't work on dials, even though it works on clocks. - It doesn't work on dials, like an oven dial or washing machine dial. - Yeah, like a temperature dial or pressure dial. Like it can't read the number off of that, but it'll work on clocks.

But just like such a, like it should be able to, like why can't it generalize? - Isn't it supposed to generalize? Oh my God. Okay, but that was very good. - But it does work well on clocks. - That was very good. - I love the answer though. These are the insights that like we'd never get, but is this just like over fit to a benchmark that nothing else does?

Is this not like it should generalize bit or less? - Is there a clock benchmark? - Hell yeah, dude. You threw it in. You like, now I know. Now every visual language model that comes out, often will be like. - Well, it is going to make, like it is something that they should be able to do.

- Yeah, it should generalize it. - I mean, a big story of this is that like Matt, that D, I don't know how to pronounce his name, is just got obsessive over data and just kept scaling it up and up and up. And all the things that are adding to this dataset kept on working.

And they just spent more and more money 'cause it kept getting better and better and better over like six months of just making the dataset bigger. So like the pointing is probably similar. It's like, why can't they point? And then they're like, okay, let's make it work. Like pointing seems useful though, right?

Like the demo you guys showed was actually somewhat useful. Like, you know, translate this table in a whole image to JSON and it does it. Or like stuff like this, like point to Mount Rainier, it can point, but other models can't. Seems useful. Clocks, I don't know. This just reminds me of strawberry.

You know how many hours are in strawberry? Someone start this with GPT-4V, what time is it? And it can't tell time. And then of course the rest, there's academic datasets. Once again, the FlamingoBog post goes into all these very deeply, would recommend. But TLDR, those are the datasets. EVALs are EVALs.

This is where I was like, okay, it does good on vision. There's academic benchmarks, which are, you know, 11 of the commonly used ones. There's good averages. I love the new rebrand they've done with the whole pink and color theme. It looks good. Averages. But yeah, I mean, across the board, the 7B, the 1B, the 72B, very on par with 404V 1.5.

Thoughts on it and whatnot. The human ELO ranking was pretty interesting where preference is decent, but I guess the interesting differentiation there is how far off a drop other stuff like the Lava and whatnot is. Yeah, PHY is doing pretty rough in multimodal, even though it got a multimodal update, but it's probably a better text model.

I guess I'd just be interested now to see how this thing does in text and if we can get a better text version. This is kind of back to the theme of this paper of like stuff, most models use a vision encoder that's closed. So we wanted to do full open.

This is a little harder to read, but let me digest real quick. So the vision encoder for Lava is open weights, but the data and code used to generate it was closed. API models, they're basically all closed. A lot of the generation is done with closed stuff. They're in this corner of all green.

We love all green. We don't love red. But this is basically just a visual of the theme of the paper. I didn't look that deep into it, but that's what it's showing that. Once again, the captioning that's done is mostly a distillation of proprietary stuff. For example, all this proprietary stuff sucks at clocks.

So nothing that's a distillation will be good at clocks. Nothing can point. If you just distill from this, if they can't point, your VLM won't point. So we show how to get good data. The answer to that was how people talk with guided questions, do pre-processing, and I guess keep scaling it up.

It's good to see it still scales. - And also, one point on the MOMO one, right? If you scroll up all the way to the top, yeah, over here. The only reason why MOMO 7B and 7BD is not open data encode is because it's based off the QEN data.

I think, correct me if I'm wrong, Nathan. And that's why it's not because they don't want to open it. It's because we just don't know what was in QEN data. Same for a visual encoder. If not, it's everything that the AI2 team did was-- - This model is test the definition of open source AI, because it's like, you grab some random vision encoder, and that's the only thing that's not open.

It's like kind of-- - Well, the weights are open, but we just don't know what the data encode is, yeah. - Yeah, but what's the difference between a vision-- Like, if the weights of the vision encoder were frozen, I think that it could be fine, but I don't know.

It's weird, because you like say that fine-tunes from MOMO are like open-weight models. They're not like open-source fine-tunes and stuff. - This is a discussion that the open-source initiative and other people are having right now. It's kind of a whole other thing that I don't have enough time to answer the emails for, but it's like, I don't know what to do with it.

- The interesting thing there was the only closed code for, the closed open data was just, they used CLIP, large CLIP, right? They have the sentence that our vision encoder is based on CLIP right here. So for the vision encoder, we release, all of our release models use OpenAI's VIT-Large CLIP model, which provides consistently good results.

While this model uses closed data, it can be reduced from scratch with MetaCLIP. So that's what gives them some red, because the data that OpenAI trained CLIP on, they didn't open-source. So it's closed data, and they talked about this too, like the paper and the blog post really shows how the thing that made vision language models good was a good captioner, which was the last CLIP that they put out.

And this is where OpenAI started to go closed source, where they're like, okay, we've scraped the web, we have our own scraper, we have a good captioner, this, that. Here's the data set, or no, they don't give out the data set. They give the final model, but not the data set.

And then a few years later, Meta comes out with MetaCLIP where they're like, okay, here's how we can remake CLIP. And I think that's personally, this is just like now the open-source debate of they could have used the Meta version and given us a worse model, and it would have given them green here, but I don't know, I think it's fine that they did it.

And then this is closed, 'cause yeah, like Eugene said, Quinn stuff. But that's my two cents. This could have been open if they just used MetaCLIP, but MetaCLIP's not as good, because I think Meta did it as a proof of concept. - I think, honestly, I think it's fine, because right now, as previously outlined, before vision and audio, the gap in data set in the open-source space is so large, that the biggest thing is actually more about getting the data set open-source and working.

So for me, if in this process, sure, this may be flawed, and we get a strong open image data set, which then subsequently we can do captioning on top of that strong open image data set, which will create a more complete data set than is used to train it to the next open model, then this will be a bootstrap, essentially, towards a all-green open model, which might be a better path than trying to use an inferior clip and not have a good data set in the first place.

- Makes sense. I think also this is a proof of concept at open-source stuff. I don't know how much there's like, this is a product, use it, use it, use it, versus like we were scaling experiments that kept working better. And there's still quite a bit that could be done here, but yeah, it's a really good point out that stuff is a distillation, and we don't know what goes into this.

It does have that issue with clip, right? We don't know the data that went into it. Metaclip would have been cool, but yeah. Evals are next, so benchmarks. If people understand and want to dig into any of these, we can dig into them. So it's a interesting differentiation of open-weight, open-weight distillation, where MOMO sits, and then these are all the proprietary.

So Claude, Gemini, and GPT. Here's the averages. MOMO 72B kills it. It's the best average across them. And then across the actual benchmarks themselves, they're doing pretty good across. I think when you look a little deeper, they mentioned the 1B is on par with GPT-4V on most benchmarks, but the 7B outperforms 4V.

It's closer to 4.0 and whatnot. So if anyone has anything interesting they'd want to dig into on the benchmarks, we can. Otherwise, I think we know where it sits. The interesting part would be digging into the text generation part. Eugene, you still have your hand up. Are you trying to comment?

Okay, hand up. Yeah, I didn't have much more to dig into in this. I thought the ELO ranking was kind of interesting as well. Here's ELO ranking. Human preference evals use 15K image text prompt pairs. We queried VLM for responses, presented the resulting images, triplets to all this to 870 human annotators.

They were given pair-wise preference rankings for 325 samples across 27 models. NiceFlex, biggest human preference eval for multimodal models to date. All it took was less than a thousand people. Their ELO rankings are 3X more than chatbot arena from LIMPSYS. Also, I hear there's some potential tea with LIMPSYS.

Maybe that's coming out soon for their vision models. And then here's an ELO ranking. - There's potential tea with LIMPSYS, not extrapolate. - Well, we'll leave that without much more unless anyone else wants to comment on it. But yeah, if anyone knows how ELO ranking works, they're pretty cool, figure it out.

But yeah, most preference ranked was GPT-4.0, then MOMO-72B, then Gemini, then Sonnet, then the 7B. It's really cool how the fourth one is the 7B. So very cool on-device model. And then RIP, our boy, Lava 1.57B down here, Chameleon down here. Chameleon was cool though. They're fusion models. They're a different way to do this.

They're a little early. I'd love to see a recreation of fusion for this as opposed to adapters. But those are the two benchmarks. There's some takeaways on this as well. So here's a few key points. Also, there was a line here as well in their pre-training that I wanted to highlight.

So in this data and training, they talk about how they do this different approach. And then they just add this line here. So we don't do RLHF, but we've got the RLHF guy here, but they added a line that we don't do RLHF. Is this meant to say like, you know, there's no SFT, these are base models or just...

Well, there is supervised fine tuning for it, but no RLHF. - For us, team alignment and incentives are hard. - We have a question from Sam. - Thanks, Eugene. Hold on, is my mic on? Can you guys hear me? - Yes, yes. - Great, cool. So this is coming from a place of vague, naivety of multi-modality stuff.

And it's about the granularity that we should attach to images when creating, say, open source image caption datasets. So right now there's a picture, say from like a satellite view of a parking lot. And I want to ask a question that like Eugene could answer of like, what is on the dashboard of the car in the second row, third from the left or something.

And it's a little Hawaiian guy or something like that on the dashboard. So a human can answer that, but I assume a language model only could if it had been trained, I suppose, on image caption pairs that have that level of extreme granularity to them. So like, I've seen some really cool demos on the blog posts for MoMo.

Now, again, I'm not really sure whether to ascribe that performance to the pre-chained vision encoder, the one that we're using, or if it's from some data in the PIXMO set that uses that cool like audio transcription pipeline. So I guess the question is, if we kind of ignore the fact that we're using a pre-trained image encoder that has a bunch of cool abilities on data that we don't know about, how granular should our image caption, our captions for image caption, or image caption datasets be if we want to have, we would expect to be, you know, quote AI performance, you know, if you don't know how this stuff works from our vision language models.

And why did you guys choose the granularity that you did, I suppose? And the question format as well. Kind of a lot. - Do you want to TLDR that into one, two line area? - How granular should our image caption pairs be, period. Kind of detailed. - I think Nathan probably has more intuition on this, but their thing wasn't about granularity, it seems.

It was about one, like, yeah, you've got diversity, you've got, you know, processing of it, but then it was also about the split, right? So we talked about these four different ones, like part of it is, you know, generate code from 255,000 texts and figure heavy images, including charts and whatnot, then get QA pairs on it.

They got clocks, but there's also all the academic datasets, right? So there's a lot of these. If you look into what they are, that blog post goes into them. So I think granularity of specific stuff isn't how I necessarily look at it. It's diversity of the images and what they do with good captions around it.

Nathan, if you want to chime in, go ahead. Eugene, you have your hand raised. I don't know if you want to answer, but let's answer this question before moving to another one. - I don't really have a good response. I think most of this was like really high detail responses is what worked for them, but I don't know.

- Okay. - My intuition, at least, is that I suspect the heavy lifting might be actually more towards the Q&A side rather than captioning side. I view captioning as a bootstrap, but at the end of the day, we want the model to generalize what they want to describe. Captioning tends to overfeed what is being described, where our question answers kind of forces the model to pick up details that it would previously ignore, like plots, for example, or the one extra finger on the hand, or whatever it is.

And as long as you're able to ask questions and you can answer, you're training the model to capture as much information as possible, even if you don't ask. Because even if, let's say, this dataset, you didn't ask this question, another dataset may ask an alternative question instead, and that will just generalize better.

And that's what I suspect. I think captioning is a problem in overfitting from my point of view. - Updates. I really don't have that much intuition on this either, though. Someone else has hand up? You want to pop in? Okay. I'm going to finish the paper real quick in the takeaways.

This is a cool little video we can watch after, and we'll just save time for Q&A. There was really not much left, though. There's just some highlights. So they wanted to highlight features. I will share their highlights. The most efficient model, the 1B, is based on their 1B MOE.

That one matches performance of 4V on most academic benchmarks and their ELO ranking. The 7B and the... So our 7B and Quen 7B-based models comfortably sit between GPT 4V and 4O on both. So 1B, similar to 4V. 7Bs are between 4V and 4O. The big one, the one based on Quen 72B, is the best benchmarks on academic.

So academic benchmarks-wise, their big one is the best state-of-the-art everything, better than proprietary. But ELO-wise, it sits behind 4O. Then our best model outperforms... Yeah, so it's basically best model beats everything. To highlight their potential for action, we tested this on Android Control, where it did lower... I didn't really get this, but we tested it on Android Control, where it achieved this low-level accuracy, 69% high-level accuracy.

I don't really know what Android Control is, but they highlighted it. I probably should have read more into it, but... - Excuse me, do you plan to show us something, or you're just pointing? 'Cause we only see the normal page. - Good catch. My bad, I went back to the slides.

(laughs) My bad. But yeah, I was just going over these highlights. So highlights was 1B is based on that. It's at 4V. The 7Bs are between the two OpenAI models. 72B is better than anything on their chart. So it's state-of-the-art better than proprietary. And then there's this last little point here of it does really good on this Android Control, which I don't know what Android Control is, but they wanted to highlight it.

So I've highlighted it. I think that's most of the paper. There's a release plan here. So yeah, Llamo is coming out. They put it out early. They've put out some stuff. There's more stuff coming. So this technical report is getting an update soon. The data sets are coming out, but putting out data sets is hard.

So give them time. Model weights, training and eval code. This is a fun one. This typically doesn't come out because you're probably not doing training on the same hardware they're training on, but always cool to see training and eval code. I guess that's pretty good overview of the paper.

If there's anything we want to dive deeper into, we've got 10 minutes. Nathan, if you want to add anything, cook, go over. - I just have a question, but like I'm ready to get the section about Respirator, but there are like no other questions. - Nathan. - Yeah, Nathan had to drop.

- Oh, he dropped. Okay, cool. Well, that's paper. There is a little use case thing to just make it useful. We can kind of watch this real quick. And then we've got a little update on Whisper. So this is their video they put out. - Yeah, that's awesome, Nathan.

- It's very Apple Intel like. - Okay. - Count the number of people. - Counting the number of people shows a total of 21. - Convert this table to JSON. - Here's the table converted to JSON format. - I want to sell my bike on Craigslist. Write me a description for it.

- Schwinn bike for sale, blue with white accent. $300 or best offer. - Okay, that's just a, you know, example of what the model can do. I recommend people check out the rest of their blog posts. It's pretty cool. Here's benchmarks. There's a couple other videos if people are interested.

And then there's just a good little TLDR of the paper. I want to save time for questions and stuff. So I'll share the link. We'll share it in Discord. If anyone has last questions on the paper, we can go over now. Otherwise we have a little Whisper update too.

Whisper three turbo dropped. I was very impressed. Yeah, questions or should we pass to Whisper? Last shot. - All right, Whisper it is again. - Whisper, let's go. - Whisper. - Sorry, I took away more time. - Whisper? Who's covering Whisper? - It's supposed to be Abgadoz, but. Oh, there he is.

- Yes. Yes, can you guys hear me? - Yes. - Okay, great. Thank you so much for covering the Momo paper alongside Nathan, that was great. So yeah, let's get started quickly. So can you guys see my screen? - Yes. - Okay, great. So just like some updates. OpenAI has recently released a new checkpoint or version of Whisper called Whisper large V3 turbo.

It's supposed to be like more faster and more efficient with like very little accuracy degradation. And it was inspired by the work behind this to Whisper. So just like that pull request, then that John walk merged into the OpenAI repository. So yeah, a little reminder about Whisper. Whisper is like the state of the art ASR model for like transcription, translation and speech detection as well.

The architecture is basically based on the encoder decoder transformer architecture that was presented in the original attention is all you need paper. So the input is passed to the encoder which processes the audio and then gives out any states of the end. And these hidden states are sent to the decoder which it uses to like generate the text one token at a time.

So yeah, the encoder and decoder mostly have the same number of layers for example, in the Whisper large V3 it has 32 layers in the encoder and 32 layers in the decoder. So this was like a quick summary about the architecture of Whisper itself. Whisper like when it was originally released it had different versions that tiny, base, small, medium and large.

And then later on they added large V2 and large V3. But for each of these versions like the number of layers in the decoder and the encoder were the same. So for example, base has six layers in the encoder and six layers in the decoder. Yeah, so the motivation behind Whisper Turbo was actually the same motivation behind the Solusper.

And the motivation behind the Solusper was like based on two observations. The first one is like the decoder accounts for almost 90% of the latency when you are trying to transcribe any audio. The second observation is that the decoder performs the simpler task of like mapping the hidden states of the audio into the text.

Like the most difficult part is actually extracting these hidden states of the audio and trying to understand what's being said in there. So yeah, the first observation is because Whisper is like an autoregressive language model and generates tokens, one token at a time. And you need the token at time T to generate the token at time T plus one.

So it is actually a very serial process and we cannot utilize modern GPUs to parallelize this. The second observation was made by actually comparing the different versions of Whisper. So for example, the small and tiny versions could transcribe audio and it is loud and clear. And you can actually hear what's being said but they would struggle with audios that have like background noise or where the speaker is not very clear.

However, all models or all versions, whether it be like small, tiny, large, they all generate coherent text in English and all the other languages. And this kind of like gives a hint that they are actually capable of like modeling the language but they just struggle with the audio sometimes.

So yeah, what is exactly Whisper Turbo? Whisper Turbo is like a new and more efficient version of the original model with like minimum degradation accuracy. It uses many two techniques to satisfy this improvement in speed and then compute. The first one is like model pruning and the second one is continued pre-training but it does not use distillation which is I think a very big misconception about this new release.

So let's talk about model pruning. Pruning in machine learning mainly refers to like eliminating some parameters from a neural network to like reduce the compute and memory requirements of a model without impacting the accuracy as much. In Whisper Turbo, the pruning was done only in the decoder and like entire layers were pruned.

So we went from 32 decoder layers which were in the original large V3 to like just four layers. And this resulted in like massive reduction in the model size. So the turbo model is like 1.78 times smaller than the original large V3. So yeah, after you prune a model and you go from 32 layers into four layers, you probably will miss all the capabilities of the model.

Like the model will probably degrade and collapse and will not be able to even generate coherent text. You will have like to train these four layers to work together again as a single decoder. So the way to do this is to actually do continued pre-training. And in continued pre-training, you train the model on like a relatively big amount of data to actually teach it again all the things that it has forgotten or like got confused about.

And in the case of Whisper Turbo, the continued pre-training happened on the same dataset that was actually used in the original training of Whisper Large V3. So for reference, the dataset contained of two epochs. Each epoch had 5 million hours. 1 million out of these were like weekly supervised as in the original Whisper and 4 million were like pseudo-labeled by the Whisper Large V2 model.

So we had like 5 million hours in each epoch. This means that we have 10 million hours in two epochs. So this was like the size of the original training data. And the data using continued pre-training is very similar to this, except we're only using the transcription data. We're not using any of the translation data in the Whisper Turbo continued pre-training.

So yeah, this gives us a hint about like the size of the training data of the new model, which is actually quite large. And some of the fun tips is like, we used a linear learning rate with the decay, starting from 5e to negative 5. So also another confusion is like, it gets compared a lot with this Whisper, but they like differ in size and also the training strategy.

Distilled Whisper was trained using knowledge distillation of ASR models. But we'll talk about this like in a minute or two. Let's just go over a quick comparison between the two models. So they both have the same number of encoder layers, which is 32, but the Whisper Turbo has four decoder layers while Distilled Whisper has only two.

The training strategy for Whisper Turbo is pruning and continued pre-training, but for Distilled Whisper it's actually knowledge distillation. And for Whisper Turbo, the training data is multilingual. For Distilled Whisper it's only English. And the task for Distilled Whisper is transcription and the same for Whisper Turbo. - Can I ask a quick question?

- Sure. - The difference with Whisper Turbo and Distilled Whisper is primarily the pruning versus distillation. It seems like we're giving a lot of benefit towards pruning, but the other big distinction is also the difference in data, right? I'm curious on your intuition because I also assume pruning versus distillation is a big part of it and that's why I can stay.

It's interesting that it can stay multilingual. I would assume the opposite, but how much of this do you think is based on the strategy versus just better data? Distilled Whisper came out a while ago versus Whisper Turbo, OpenAI does good data. Like how much of a difference do you think is caused?

Because, so they're both similar size, but Whisper Turbo is a lot better, right? So how much of that better like performance out of smaller size comes from pruning versus data? Because they don't really mention how good this data is. Do you have intuition on that? - Yes, yes. So we kind of like have some info about the dataset.

For Whisper Turbo, it uses exactly the same dataset for Whisper Large V3, except for the translation section, which is excluded. So this is like a very big dataset of like 10 million, or like a few million hours. And it has like many, many different languages. So this is a very diverse and large and robust dataset, which is actually quite important.

On the other hand, Distilled Whisper uses English only data. I think it is a few thousand hours of audio. So this is like much, much, much smaller than the Whisper Turbo. One upside is that the Distilled Whisper is trying to use actually higher quality data, arguably, but we're not sure about this.

So I think there is like a very big emphasis about the data size as well. And alongside the training strategy. For Whisper Turbo, it's a few million hours, for Distilled Whisper, it's just a few thousand hours. And this is quite crucial, I think. For example, Distilled Whisper is only for English.

You cannot use it like for French, German, Spanish, or Arabic, or any other language. And I believe like if you train a single model on multiple datasets, multiple domains, multiple languages, it becomes more robust. And this is like the premise behind Whisper. Like it is a robust model because it is trained on a very diverse dataset, encompassing different domains and languages.

- Got it. - So I hope this kind of answers your question. - Really, really helps with the intuition there. Sorry to cut off. Also great slides, by the way. - Thanks. - Yeah, amazing slides. - Oh, thank you. This one is good. You're gonna like this one a lot more.

So there was a question from Svex about like, how do you actually do distillation for an ASR model? And I think it's kind of like similar to how you distill a language model. So let's say you want to distill Lama4o5b to Lama3b. The way you do this is you try to like extract or compress the knowledge from the big model to that smaller model.

And this is exactly what you are doing here. The technicalities is, the way you train Whisper is a language model that is conditioned on the audio. So it's actually like a language model that predicts the next token. So you can have like cross entropy loss on like the next token.

And you only like, the difference is the input also contains audio, not just text. So you kind of like train the smaller model on the next token predicted by the bigger model. So kind of like synthetic data or like pseudo data, pseudo label data to be precise. That's one of the training objectives.

The other training objective is, is you're trying to like make the output distribution of the smaller model as close as possible to the output distribution of the bigger model. So not just the next token, but like the entire distribution of the next token. So if you're training like a language model in a conventional way, you're only like training it on the next token.

You only care about the correct next token. You don't care about the top nine or top 10 predictions. But if you wanna train or like gonna do knowledge distillation and you want to like train the model with a lot of information, you might want to do something called KL divergence training.

And in this way, you try to teach the smaller model to approximate the probability distribution of the bigger model. So not just get the next token right, get the top 10 predictions for the next token correctly. And this actually helps the model learn a lot more. You're giving much more information for each data point than just next token prediction.

So this is like how you do a knowledge distillation for an ASR model in like a very brief overview. The distill-whisperer model has a good paper from Dagenfeld's team and goes into much more detail about this. So yeah. - There's another paper that came out recently that talks about distillation and the three types.

So there's also distillation where you do pure SFT on outputs and you match input output type distillation, but there's a distillation loss, which is basically this KL divergence that you're talking about, where you map (indistinct) We did have a question (indistinct) This is basically token level decoding of audio chunks, right?

So (indistinct) asked, is Whisper Turbo fast enough to be real-time? And what's the definition of real-time and how does real-time work with chunk-based decoding is something asked, if you have intuition around this. - I think he, I muted the audio and I think he's got it on mute. I'm glad you have to unmute.

- Oh. - Unmute. - Unmute. - Yeah, I think I got muted. Can you guys hear me now? - Yeah, you're good. Did you get the question? - Okay. Yes, yes, I got the question. So the concept of like real-time is kind of like debatable, but if you ask me if you can transcribe a one minute audio in less than a minute, I think this is kind of like real-time.

So the question is like, is the Whisper model fast enough to be real-time? If you're, the answer is yes, if you run it on the right hardware. If you run it on a T4 or like anything that's bigger for any modern GPU, basically, it's going to be almost real-time.

Like the latency is going to be like a few, like maybe a second or two max, if you've got the parameters correctly and if you size it correctly. And this is even applicable with like the large V2 and large V3, and it's going to be even faster with like Distal Whisper or like Whisper Turbo.

So yes, I think Whisper can actually be very, very good at real-time kind of transcription. - And how do you do like live transcription with token or chunk-based decoding? - So one of the ways to do real-time transcription is to select a small chunk size. To just, before answering this question, like one of the key things about Whisper is that you have to give it 30 seconds of audio.

Even if the audio is one second, like you have to pad the remaining 29 seconds. So you're always giving it 30 seconds of audio that will be processed by the encoder. And then depending on actually how long your speech is, it will maybe generate two tokens or 200 tokens.

But if you're giving it one second of speech and 29 seconds of padding, it will probably generate only two or three tokens or maybe five tokens, not 200. So you're spending the same time on the encoder, but you're spending significantly less time on the decoder. So one way to utilize this in real-time decoding is continuously give Whisper maybe 300 milliseconds of audio and give it as the input and pad it, and then ask it to generate like the five or 10 tokens, or maybe even less that respond to 300 milliseconds.

Because probably in half a second, someone is saying two or three words, which is four or five tokens, which can be decoded relatively quickly and get back the answer in almost real time. So you keep doing this multiple, multiple times until you run into your first batch of chunks, and then you can just shift the audio buffer a bit.

It's kind of like difficult to explain without having a graph beforehand, but it's basically about continuously feeding the model chunks that are like have been shifted by maybe half a second or so. And then you just, if the buffer is less than 30 seconds, you just pad it and give it to the encoder.

- Got it. - Does this give some clarification? - Very useful intuition. I didn't realize all the padding and stuff. Someone asked if these chunks overlap. - You can make them overlap. You can make them not overlap. This is kind of like, there's like a whole set of hyper parameters that you can play with.

You can use VED. You can not use VED. You can just keep adding the chunks without overlapping. You can make them overlap. There's like a whole world of doing this. - Okay. Got it. No, really, really useful intuition. Any other questions, thoughts? This is the last slide, right? We didn't cut off.

- We have two other slides, but this is like the main thing. - If you want to keep going through, go through. - Sure. So some fine tuning, WhisperTurbo can be fine tuned just like any of the other models. I asked Joan Walk about which hyper parameters to use.

She said, you can start with like one E negative five as earning rate, but you should do a learning rate sweep and like do hyper parameter tuning. But it can be like fine tuned just like any other model. Just be careful. Like it's probably gonna be fine tuneable only on like transcription.

Don't try too much with translation or maybe give it a try, but it's not guaranteed to work. So yeah, it's fine tuneable just like any other model. The final two slides about like benchmarks, but just before the benchmarks, people also ask it how it compares to FasterWhisper and like DistalWhisper.

So we've answered the question about DistalWhisper, but for FasterWhisper like, this is like a valid question, because FasterWhisper is like an inference engine. It's like a library, it's code. It can be used to deploy any of the other versions of Whisper. Whisper is small, tiny, base, medium, large, large V2, large V3, large V3 turbo, distal large and so on.

So kind of like try to make a difference between the inference engine and the model versions themselves. So any question before we go into the benchmarks? - No, I guess we can go. - Sure. So for benchmarks, I highly recommend that you benchmark on your own data. So I think this is quite important, but to like give some intuition about the model's performance, I chose three data points that I think are very interesting and they're like very important for me.

So one of them is like the GPU mode at the end of CTOG by Karpathy. This is very recent, so there is no way it was in the training data, hopefully. The second one is like a chat with like Mistral CEO, Arthur Mensch. He is a French guy, so he's speaking English with kind of like an accent.

This was, I think, in December last year. Hopefully it was not in the training data as well. And the final one is my favorite, like the state of GPT by Karpathy in I think 2023 as well. So yeah, Karpathy is known to be talking relatively fast, so this can be challenging.

So yeah, let's jump into the benchmarks. For benchmarking, so you generally look at two metrics, WER and CR. WER is word error rate, like how often the model makes a mistake about a certain word in the text. So for the WER, I benchmarked the different sizes or versions of like the different data points we discussed.

So we kind of like see that the distal V2 is kind of like an outlier. It's one of the least accurate models out there. But the other models are like kind of similar, except maybe large V3 can be a hit or miss. So sometimes it has relatively higher WR.

Sometimes it's the best performing model. And this also aligns or like is aligned with like what people have experienced it and have shared like that large V3 can sometimes hallucinate. It can be caught in an endless repetition loop and can make the output like really bad. But yeah, I think that the turbo model on these data points is doing very, very well as well as the distal large V3.

But on other benchmarks that I did for like a projects that I can not share the results, but I can give hints that it was not very good. It was doing much worse than the other models. So yeah, do your own benchmark. And especially if the dataset is not English and if it's not transcription, you might see like very different numbers from this.

So this is like the WR, the CR is very similar as well. The distal large V2 is doing very, very bad compared to the others and large V3 can be a hit or miss. So yeah, that was like a very quick, super quick introduction about Whisper Turbo. This is like the end.

If you have any questions, please go ahead. Thank you for watching. - That was really great. Clap, clap, clap, applause all around. Close is no MLX stuff, what do you mean, MLX? - No MLX stuff. So I don't have a Mac, so I don't care about MLX. Sorry, but yeah.

(laughing) That's like the two hands. So yeah, just as a hint, I think MLX is not good. So yeah, just as a hint, I think MLX is like a framework for deploying or like using machine learning models on Apple Silicon hardware and presumably it's very fast and very efficient.

But I don't have a Mac, so no data points from my side. - All right, my battery's dying. I'm on 1%, so I gotta end this soon. But thank you so much, it was fantastic. Always excited by your Whisper updates and explanations. Really, really like those slides, those shadow slides.

Awesome, awesome. Thanks everyone. - Sure, sure, thank you guys. Have a nice day. - Thank you very much, bye.

[Paper Club] Molmo + Pixmo + Whisper 3 Turbo - with Vibhu Sapra, Nathan Lambert, Amgadoz

Transcript