[Paper Club] Molmo + Pixmo + Whisper 3 Turbo - with Vibhu Sapra, Nathan Lambert, Amgadoz

00:00:00.000 | - I'll go through here, let me set up real quick.

00:00:05.000 | If anything's in the chat, let me know.

00:00:08.440 | Okay, so Momo and Pixmo.

00:00:10.960 | This is from AI21.

00:00:13.880 | Basically really, really good people.

00:00:15.520 | If people haven't read research papers before,

00:00:18.320 | this is probably one of them.

00:00:19.520 | - AI2 and AI21 are different things.

00:00:22.520 | - AI2, my bad.

00:00:23.880 | - It's hilarious, so many people mess it up.

00:00:25.760 | We also get A12.

00:00:27.600 | - A12, oh, are they another one or?

00:00:29.980 | - No, it's if you misread the I, it's a one.

00:00:32.960 | - Okay, so AI2.

00:00:33.800 | - Makes it even more confusing with the 21.

00:00:36.480 | - Is there an AI21?

00:00:37.760 | Okay, I've been shilling the wrong one,

00:00:41.240 | but good correction, good to know.

00:00:43.480 | So AI2 though.

00:00:44.960 | So basically this is a proper open source,

00:00:47.980 | open weights model, and it's a VLM.

00:00:50.600 | So if anyone hasn't read papers before,

00:00:52.800 | this is one that I highly recommend reading.

00:00:54.920 | Clear problem, clear solution, clear explanation

00:00:57.800 | of what they do, the data set and everything.

00:00:59.760 | Very, very easy read, no crazy math.

00:01:02.880 | So yeah, highly, highly recommend.

00:01:04.840 | But TLDR, it's a set of different models

00:01:08.500 | at different sizes that are good vision language models

00:01:11.240 | that solve a different problem.

00:01:12.720 | So the high level of this is most VLMs

00:01:17.000 | are distillations of proprietary close source models, right?

00:01:20.160 | So if you need to generate synthetic data,

00:01:24.400 | like most open weight models rely heavily

00:01:26.600 | on synthetic data from private models.

00:01:28.560 | So think GPT-4V, GPT-4O, Gemini, whatnot.

00:01:32.800 | And they label a lot of data

00:01:35.440 | and then they create a vision language model

00:01:37.940 | that's open weight, something like lava.

00:01:40.120 | But all they're doing is kind of distilling

00:01:42.240 | that proprietary knowledge into an open knowledge.

00:01:44.440 | But this is kind of the key highlight I take away

00:01:46.480 | that the community is still missing foundational knowledge

00:01:50.280 | about how to build these models from scratch.

00:01:52.740 | So they kind of go into everything from data set training,

00:01:56.480 | how they label data and how they can do this whole mix.

00:01:59.880 | And some of the other papers we covered in paper club

00:02:02.160 | are stuff like Clip, OpenClip, which are captioners,

00:02:04.960 | the whole history of these

00:02:06.040 | and how captioning is pretty important for this.

00:02:08.640 | But basically, yeah, they do that.

00:02:11.000 | They explain the data set mixture,

00:02:13.280 | how to do this fine tuning.

00:02:14.540 | They're gonna release it all.

00:02:15.640 | There's a set of different,

00:02:17.920 | a class of different models that are gonna come out.

00:02:20.240 | And the really interesting thing is they're good.

00:02:23.240 | So lava is kind of starting to lag behind,

00:02:26.360 | but this thing is outperforming.

00:02:29.360 | So there's two ways to benchmark this.

00:02:31.560 | You've got standard vision benchmarks,

00:02:33.240 | and then you've got like actual usage.

00:02:35.400 | And their stuff is on par with GPT-4V.

00:02:39.560 | They're starting to surpass it

00:02:40.960 | with some of the larger models and they're very small.

00:02:43.180 | So yeah, diving into that, we talk about the data set,

00:02:46.720 | how they do the training data, the architecture of it,

00:02:49.500 | which is kind of what you would expect

00:02:51.120 | if you joined our other paper clubs on this,

00:02:52.720 | but let's go a little bit more into that.

00:02:55.080 | So one of this is how can we generate this data

00:02:58.440 | from non-proprietary models?

00:03:00.080 | So we can't use a closed source model to do this captioning.

00:03:03.020 | How do we do it?

00:03:03.920 | Some of the background that we learned

00:03:05.420 | from other papers like Clip and whatnot

00:03:08.000 | is that it's hard to generate really good, diverse,

00:03:11.320 | like robust descriptions from images.

00:03:14.180 | And this is something that like the solution

00:03:17.500 | that they found was to have audio labels.

00:03:19.520 | So they kind of guided people towards,

00:03:21.680 | here's a set of about a million images.

00:03:24.100 | Here's some prompting questions.

00:03:25.520 | Here's a timeline.

00:03:26.360 | So people answer them concisely.

00:03:28.120 | From that, they have a data set.

00:03:29.720 | From that, they train it into a model

00:03:31.800 | that's based on their open source OMO models.

00:03:34.420 | So that's the big thing.

00:03:36.240 | They can do this end to end without anything proprietary.

00:03:39.660 | And yeah, let's like go through it a little bit more.

00:03:42.240 | So instead of being a distillation,

00:03:44.240 | it's independently pre-trained.

00:03:46.520 | There's a vision encoder and a language model

00:03:48.800 | that's jointly trained.

00:03:50.320 | There are some interesting stuff that they call out

00:03:52.240 | like traditionally in stuff like Alava and different adapters,

00:03:55.480 | what they'll do is they'll train this vision encoder.

00:03:58.400 | They'll have an LLM, they'll freeze weights,

00:04:01.200 | and then they'll kind of just merge them.

00:04:02.880 | They avoid doing that multi-stage pre-training.

00:04:05.660 | They don't freeze parts of it.

00:04:06.960 | They just kind of do it all as one training run.

00:04:09.960 | They show how important high quality data is.

00:04:13.480 | So they do this, I think,

00:04:14.420 | on roughly a million samples of data.

00:04:16.600 | There's some augmentation.

00:04:17.800 | Like they pass through these label data sets through LLMs.

00:04:21.440 | They augment them.

00:04:22.600 | They do augment it a little bit,

00:04:24.000 | but point being they can get better than Gemini 1.5,

00:04:28.240 | better than Cloud 3.5 Sonnet,

00:04:30.080 | better than GPT-4V at a much smaller size

00:04:33.460 | with about a million samples of data,

00:04:35.240 | which is very impressive, right?

00:04:37.060 | So good, high quality data,

00:04:39.280 | and then how to do it from scratch.

00:04:40.920 | Now, the other little interesting thing here

00:04:43.200 | was it's not as accessible as it seems.

00:04:46.040 | Like they still have to pay for a million labeled samples,

00:04:48.960 | but point being still very useful.

00:04:51.020 | And then we can distill it down to other models.

00:04:53.080 | So yeah, this is one of the main challenges, right?

00:04:57.080 | It's hard to collect captioning data.

00:04:59.280 | So for example, like if you're given an image here

00:05:02.760 | and you're told to describe it,

00:05:04.680 | like I would probably just say this is Seattle.

00:05:07.120 | I could add in things like, you know,

00:05:08.320 | I see the Space Needle.

00:05:09.800 | I see Mount Rainier.

00:05:11.300 | I see trees.

00:05:12.400 | But if you just blindly give images to either LLMs

00:05:17.400 | or whatever, multimodal models or people,

00:05:19.760 | we don't give good descriptions, right?

00:05:21.380 | We don't give like robust, you know,

00:05:23.260 | a city landscape with a lot of skyscrapers,

00:05:26.180 | trees in the foreground, water in the background,

00:05:28.600 | construction on the side.

00:05:30.100 | We don't do that.

00:05:31.060 | But they kind of broke down how they get this type of data

00:05:34.940 | and the different segments.

00:05:36.220 | And then I think soon the dataset is coming out,

00:05:38.840 | but that's basically the challenge, right?

00:05:41.620 | So they don't wanna get this from larger proprietary models,

00:05:45.900 | which can, like you can write a really good robust prompt

00:05:48.560 | and get good descriptions,

00:05:50.200 | and then you can create a lava style model,

00:05:52.280 | but you're no longer doing this from scratch.

00:05:54.080 | You're doing it from a distillation of proprietary work.

00:05:56.500 | So they wanted to do it with annotators.

00:05:58.360 | Here's kind of the solution.

00:05:59.740 | We ask annotators to describe the images in speech

00:06:02.840 | for 60 to 90 seconds,

00:06:04.240 | rather than asking them to write descriptions.

00:06:06.680 | They prompted them to describe everything in great detail,

00:06:09.360 | including descriptions of spatial positioning

00:06:11.560 | and relationships.

00:06:12.600 | So stuff like, you know,

00:06:13.720 | there's the Space Needle in the middle

00:06:15.720 | of a bunch of buildings.

00:06:18.300 | With that, the modality switching trick,

00:06:21.320 | annotators provide far more detailed descriptions

00:06:24.160 | in less time.

00:06:25.000 | So part of this is, you know,

00:06:26.840 | you keep it 60 to 90 seconds.

00:06:28.740 | They gave them guiding questions,

00:06:30.100 | but that was basically the takeaway.

00:06:33.240 | Like we're not good at writing descriptions,

00:06:34.920 | but we're good at talking about and describing what we want.

00:06:37.920 | And then from there,

00:06:40.160 | they have like a little bit more into what this dataset is.

00:06:43.640 | So they've got positional data in some of the prompts,

00:06:45.800 | they've got high level,

00:06:47.100 | they've got different types of data.

00:06:49.020 | They've got like,

00:06:50.040 | part of the dataset is about charts, tables,

00:06:53.120 | descriptions of that.

00:06:54.000 | So the model can pick it up.

00:06:55.160 | And then there's some use cases

00:06:56.320 | where this stuff just blows out GPT-4v

00:06:59.080 | from like out of the water.

00:07:00.240 | So there's a good blog post on the website

00:07:03.520 | that kind of shows how you could do this real time.

00:07:05.960 | It's a very Apple Intelligence or Google Lens type example,

00:07:09.180 | where I think the team put it on the Apple Vision Pro

00:07:11.740 | and they just look at stuff and they're like,

00:07:13.640 | "Hey, where is this in the image?"

00:07:15.280 | And it can actually annotate the image

00:07:17.500 | because that's part of the training set.

00:07:18.820 | So one more use case of high quality, well-labeled data.

00:07:22.940 | Benchmarks-wise,

00:07:23.940 | they did the traditional 11 academic benchmarks.

00:07:27.520 | It does well.

00:07:28.680 | The size of the models they put out.

00:07:31.200 | So there's MOMO-1b,

00:07:32.820 | which is based on their open source language model,

00:07:35.220 | which is a MOE 1 billion active parameter model.

00:07:38.420 | That performs on par with GPT-4v

00:07:42.380 | on both user preference and academic benchmarks.

00:07:45.360 | Then there's the 7b,

00:07:47.840 | which is based on their 7b language model and Quen-7b.

00:07:51.200 | That outperforms GPT-4v in 4.0.

00:07:54.620 | Then there's a large one, which is MOMO-72b.

00:07:57.320 | 72b is based on Quen-72b.

00:08:00.200 | And that, it outperforms stuff on academic benchmarks.

00:08:05.200 | And then on human preference, it's behind 4.0.

00:08:08.180 | So preference-wise,

00:08:09.480 | if you guys remember how we do vision benchmarks,

00:08:14.480 | there's stuff like Vision QA,

00:08:15.940 | where you can QA what's in this image.

00:08:18.380 | It's similar to Trivia QA, where it's just understanding.

00:08:21.180 | There's 11 benchmarks,

00:08:22.300 | and they're not realistic of how people use it.

00:08:24.180 | So they also have this user preference stuff.

00:08:26.340 | User preference is basically like an ELO score of models.

00:08:29.660 | But TLDR, these are very good small models.

00:08:32.440 | This last sentence here is pretty interesting, right?

00:08:35.380 | "Our best models outperform state-of-the-art

00:08:37.500 | "proprietary systems, including Gemini 1.5 Pro

00:08:40.640 | "and Flash and Cloud 3.5 Sonnet."

00:08:43.000 | That's pretty huge.

00:08:44.240 | Architecture, if you guys know about other multimodal models,

00:08:48.360 | it's an interesting adapter

00:08:50.200 | with a little twist of traditionally, like we said,

00:08:54.040 | there's this mix of,

00:08:56.600 | you have a pre-trained off-the-shelf vision encoder

00:08:59.760 | and a language model joined in one space.

00:09:02.480 | Typically, we freeze the language model,

00:09:05.600 | and we do contrastive, there's a term for this,

00:09:10.600 | but you basically merge these embedding spaces.

00:09:13.220 | I'm blanking on the term.

00:09:14.260 | In this case, they're like, "No, we don't need to do that."

00:09:17.400 | So they go over the four steps of this architecture.

00:09:21.380 | Pretty straightforward.

00:09:22.580 | So the first step is a preprocessor.

00:09:24.380 | It converts the image into multi-scale, multi-crop images.

00:09:27.340 | This allows for diversity in inputs.

00:09:29.780 | So stage one of architecture is a preprocessor.

00:09:32.500 | Second, it's a VIT image encoder

00:09:34.740 | that maps these images into vision tokens

00:09:39.260 | in an embedding space.

00:09:40.660 | Three, there's the connector

00:09:42.020 | that maps and merges these two embedding spaces.

00:09:44.740 | So the vision encoder, which I believe is based on CLIP,

00:09:48.060 | and they mentioned,

00:09:49.860 | we know CLIP is trained on proprietary data,

00:09:52.340 | but there's a paper called OpenCLIP from Meta,

00:09:54.340 | which shows you can do OpenCLIP.

00:09:56.580 | They just use regular CLIP, which is, I think it's fine.

00:09:59.660 | So stage one, preprocess images.

00:10:02.260 | Stage two, vision encoder.

00:10:04.380 | Stage three, contrastive embeddings,

00:10:06.720 | which merge the embedding dimension

00:10:09.920 | of the vision and the language model,

00:10:12.700 | and that's not frozen.

00:10:14.420 | And then stage four is a decoder-only transformer LLM

00:10:17.500 | to produce text.

00:10:19.240 | That's the architecture.

00:10:20.380 | If people have questions on what's going on here,

00:10:23.320 | we can probably dig into it,

00:10:24.580 | but for people that joined our other multimodal stuff,

00:10:27.420 | you should kind of understand how this works.

00:10:30.300 | There's a good blog post we went over as well,

00:10:33.140 | I think from Chip Huen,

00:10:34.860 | or another one that we can also share as a reference.

00:10:38.820 | - Yeah, the Flamingo post.

00:10:40.100 | - There's what?

00:10:42.060 | - I think that was her Flamingo post.

00:10:45.180 | - Yeah, her Flamingo is a really good one

00:10:47.640 | on how multimodal stuff works.

00:10:49.000 | So highly recommend, but also we can dig into it here.

00:10:51.660 | And then this is kind of that part of the vision encoder.

00:10:54.300 | They use OpenAI's VIT-Large CLIP model,

00:10:57.880 | and we know that CLIP is trained on closed source data,

00:11:03.720 | and they understand this,

00:11:06.200 | but recently MetaCLIP came out from Meta,

00:11:10.360 | and they show that you can reproduce it from scratch.

00:11:13.160 | They use OpenAI because it's trained for higher resolution.

00:11:16.360 | Meta's was more of a proof of concept.

00:11:18.460 | Proof of concept meaning, you know,

00:11:21.840 | here's how you can do it.

00:11:22.920 | We do it efficiently at low resolution.

00:11:25.240 | For the actual production, they just use CLIP,

00:11:27.120 | and I'm fine with that.

00:11:28.520 | If people disagree, that's a fair topic too.

00:11:31.320 | And then the LLM they use

00:11:33.120 | is their recently put out OMO models,

00:11:35.100 | which are very similar to this.

00:11:36.360 | They're open language models.

00:11:38.440 | They've got the range, and then for the big one,

00:11:40.560 | it's OpenWeight Quens.

00:11:43.680 | Now the data set and training was kind of-

00:11:46.200 | - Do you wanna pause for any commentaries, Nathan?

00:11:49.200 | - Yeah.

00:11:50.040 | - Otherwise, yeah.

00:11:50.860 | - I was just gonna kind of let him cook

00:11:55.060 | and then go back to the beginning.

00:11:56.760 | - Oh, okay, all right.

00:11:59.040 | I'm gonna interleave commentary.

00:12:00.820 | - I thought about interrupting.

00:12:02.120 | I had unmuted.

00:12:03.620 | I just think that it's,

00:12:05.680 | I mean, I have this whole take that I've written

00:12:07.360 | is that the whole vision space is just so underdeveloped

00:12:10.680 | that a lot of these things that seem surprising,

00:12:12.480 | I actually think probably aren't surprising.

00:12:14.680 | And the things that this model is good at

00:12:16.320 | are things that all the foundation companies,

00:12:17.880 | like they're just gonna take our data and train on it.

00:12:20.440 | And it's fine tuning, so it's like a million data points

00:12:22.880 | is actually kind of a lot, and all this stuff.

00:12:25.080 | And it's just like a lot of,

00:12:27.400 | I think that it's just like good data works,

00:12:29.920 | and they figured out some cool new applications,

00:12:32.100 | like clock reading and pointing, and then the data works.

00:12:35.260 | There is actually a niche reference of somebody

00:12:37.320 | that did this kind of voice annotation earlier.

00:12:41.380 | I can find it.

00:12:42.220 | 'Cause someone had mentioned it after we released it,

00:12:44.980 | saying that we were the first people to do it.

00:12:47.520 | So then somebody's like,

00:12:48.580 | oh, but you have to cite this old paper,

00:12:50.580 | which is kind of fun.

00:12:51.420 | So I can go find that.

00:12:53.060 | - Okay.

00:12:55.620 | I love the take of me being like, I love it.

00:12:58.300 | It's so straightforward and it works,

00:13:00.120 | and people should have done it.

00:13:01.020 | And then Nathan here like, yeah, they'll do it too.

00:13:04.200 | And it's cool to see the actual use cases.

00:13:07.260 | The website has a pretty good blog post on this,

00:13:09.380 | and there's a good, very professionally shot video

00:13:12.100 | on how this excels.

00:13:13.680 | And it's always nice to see that niche of like,

00:13:17.400 | there's so much underdeveloped alpha left in vision space

00:13:21.320 | that you train the right data set and it gets good.

00:13:25.360 | But it's still cool to see how it's like,

00:13:28.260 | significantly better than Lava.

00:13:30.420 | It's all open.

00:13:31.920 | So I like it.

00:13:33.460 | Any other questions from chat and whatnot?

00:13:36.420 | I haven't been following.

00:13:37.420 | Is there anything we should pause and break on or?

00:13:41.800 | - Something that's worth saying is that

00:13:43.300 | I think the other models are much better as text models.

00:13:45.980 | Even like Lama, where they're like vision fine tuning

00:13:48.340 | is generally accepted as being like kind of meh

00:13:50.620 | for whatever drama reasons you want to attribute to it.

00:13:53.340 | Like their text scores are still much better.

00:13:56.280 | So they have like more of a text focus,

00:13:58.000 | which OpenAI probably does as well.

00:14:00.600 | - Is there a drop in performance on these?

00:14:03.400 | Like the Quen 72 versus Momo 72,

00:14:06.040 | is the text performance also taking a hit?

00:14:09.800 | - I honestly don't even know,

00:14:11.080 | but probably to some extent.

00:14:12.680 | Like I find it so unlikely it's gonna get better.

00:14:15.160 | Or even like some of these other models

00:14:18.680 | are probably using like instruction models as their base

00:14:21.240 | before doing the vision stuff.

00:14:22.640 | But this is just like straight base model,

00:14:25.100 | no real instruction tuning.

00:14:26.420 | There's literally like no chat template for multi-turn,

00:14:28.820 | it just concatenates the messages together

00:14:30.740 | and like, there's like, good luck.

00:14:33.840 | - So for other people that don't follow,

00:14:38.180 | there's base models

00:14:39.140 | and then there's instruction fine tuning, right?

00:14:41.120 | So you can base models predict next token,

00:14:43.660 | instruction models are chat models.

00:14:45.180 | So like we have Lama base, which is not a chat model.

00:14:48.100 | You can't chat with it.

00:14:48.940 | It'll complete your words.

00:14:50.860 | Basically, these are just those adapted to vision,

00:14:54.220 | which means that they're not good chat models,

00:14:56.580 | but they do perform well

00:14:58.140 | and they do what they're supposed to do.

00:15:01.260 | But TLDR, yeah, you know,

00:15:03.820 | they've done so much open source work.

00:15:06.260 | So maybe someone should recreate.

00:15:08.100 | I don't know if you guys put the data set out yet.

00:15:10.020 | I hear it's coming soon.

00:15:11.500 | - It's just like needing to clean it up.

00:15:15.180 | So it's like the brush of get the model out.

00:15:17.020 | And I mean,

00:15:18.540 | - Makes sense.

00:15:19.380 | - Our leadership probably knew Lama was coming

00:15:21.340 | on a specific day.

00:15:22.460 | If you look at the timing,

00:15:23.860 | like there's some of that needed to happen.

00:15:27.540 | I mean, the data sets just being cleaned.

00:15:29.660 | - Okay.

00:15:30.500 | - I mean, if you do anything with images,

00:15:33.300 | something like data set releases are a little bit messier.

00:15:36.900 | You have to check for like,

00:15:38.660 | you don't like we're not releasing the images.

00:15:40.740 | You release the links type of thing and do some checks.

00:15:44.500 | But yeah, it's coming.

00:15:45.940 | I think there's like hundreds of people

00:15:47.580 | that filled out the silly Google form asking for interest,

00:15:50.380 | which I'm not surprised.

00:15:51.660 | It's just like.

00:15:52.500 | - Yeah.

00:15:55.260 | Well, that's a plug for, you know,

00:15:57.180 | they started 90% of the open source stuff.

00:15:59.380 | Now take a chat instruction tune model

00:16:01.660 | and do the same thing.

00:16:02.580 | If anyone does it.

00:16:03.820 | - We did a little bit of experimenting on it,

00:16:05.860 | but like in a rushed way and we didn't find it was better.

00:16:09.340 | So they didn't add it in.

00:16:10.940 | - Hmm. Interesting.

00:16:11.860 | - But like this is very visual oriented.

00:16:15.220 | The demo, you can't send, you need an image

00:16:18.540 | and you can't send a text only query to it.

00:16:21.820 | Partially for saving money,

00:16:23.300 | but partially because like that's how the model is trained.

00:16:26.300 | - Good to know.

00:16:28.420 | Good to know.

00:16:29.260 | I think the interesting thing there is like,

00:16:30.420 | as you mentioned, yeah, Gemini and OpenAI and stuff,

00:16:33.140 | they'll basically train on this data set

00:16:34.980 | and it'll kind of improve what they do too.

00:16:38.060 | But it also kind of shows a flaw in benchmarks

00:16:41.260 | 'cause our vision benchmarks don't account

00:16:44.220 | for like text output as well,

00:16:45.700 | which is pretty flawed because people do both.

00:16:49.020 | Yeah.

00:16:51.180 | This is the fun part since it's primarily vision-based.

00:16:55.100 | How did they do the vision?

00:16:57.220 | How did they generate the data?

00:16:58.700 | So they've got a split of different data sets.

00:17:02.900 | Step one is obviously caption generating from the images.

00:17:06.780 | So they source web images from diverse sets of data,

00:17:10.700 | 70 high-level topics.

00:17:11.860 | So high-level topics, street signs.

00:17:15.340 | Street signs, food, meme, drawings, websites,

00:17:17.620 | blurry photos and whatnot.

00:17:19.140 | They asked annotators to describe images

00:17:21.260 | by speaking in detail for 60 to 90 seconds

00:17:24.660 | using a single annotator per image.

00:17:27.140 | This was more effective,

00:17:29.500 | but yeah, questions were like guided questions.

00:17:31.820 | So what's the image at first glance?

00:17:34.380 | What objects are there in their accounts?

00:17:36.260 | What does the text say?

00:17:37.820 | What's the background?

00:17:38.660 | What's the style and color?

00:17:39.780 | I think this is pretty useful

00:17:41.260 | 'cause I'm pretty bad at describing images

00:17:43.300 | and 60 to 90 seconds is a good answer.

00:17:46.900 | A bunch of stuff, people will be verbose, concise.

00:17:49.860 | Then after this basic pre-processing,

00:17:52.220 | so transcribe it using off-the-shelf speech to text.

00:17:56.340 | Do you know what that was?

00:17:57.180 | Was that like whisper or something

00:17:59.020 | or it just probably doesn't matter, I'm guessing.

00:18:02.060 | - Yeah, it was like whisper

00:18:03.100 | and I don't remember the details.

00:18:05.580 | And I had asked and it's like,

00:18:07.180 | I don't think they actually logged the audio.

00:18:08.820 | I was like, oh, this would be super cool

00:18:10.220 | for like other types of multimodal,

00:18:11.900 | but I don't think it was in the terms of the data location

00:18:15.580 | to also have the audio with the images

00:18:17.380 | 'cause it'd be a super cool, like, I don't know.

00:18:20.260 | Like you can see how all these voice features come about.

00:18:24.620 | - That's what I was about to say.

00:18:25.460 | I was gonna say like,

00:18:26.300 | it would have been a pretty cool audio dataset too,

00:18:29.140 | but it looks like the dataset that'll be distributed

00:18:33.860 | is just gonna be the captioning and image pairs.

00:18:37.540 | So not audio, which would have been fun,

00:18:39.100 | but yeah, there's a very straightforward pre-processing.

00:18:42.580 | So use something like whisper

00:18:44.660 | or whatever off the shelf they wanna call it speech to text,

00:18:48.260 | then process it through an LLM to improve text quality.

00:18:52.300 | So remove artifacts, normalize style,

00:18:55.460 | then create a fourth image description

00:18:58.340 | by using an LLM to summarize

00:19:00.300 | the three original descriptions,

00:19:01.780 | transcripts into a single description.

00:19:03.700 | So, you know, basic augmentation from that,

00:19:08.700 | we use these four images to do natural data augmentation.

00:19:12.860 | So trained on 712,000 distinct images

00:19:17.020 | with 1.3 million captions, including the augmentation.

00:19:20.140 | So not a lot, it's basically just fine tuning after that.

00:19:24.020 | So-

00:19:25.060 | - I know Eugene Xia has a hand raised.

00:19:27.060 | You have a question, Eugene Xia?

00:19:28.580 | - Pop in.

00:19:29.420 | - Yeah, it wasn't urgent for me.

00:19:33.060 | Because there was a lot of commentary

00:19:34.420 | regarding text performance and I guess the presumption

00:19:39.060 | that the MOMO is inferior in text performance.

00:19:41.420 | That's what I kind of understood.

00:19:43.580 | I'm actually wondering how much of it is potentially

00:19:46.340 | for especially the closed models,

00:19:48.540 | just purely synthetic data,

00:19:50.540 | because I can imagine for text in particular,

00:19:53.940 | we can generate a lot of synthetic images,

00:19:57.700 | essentially signboards, crumpled paper,

00:20:00.340 | all these kinds of scenarios,

00:20:01.740 | and just put various permutation of text.

00:20:05.500 | And we ask questions about the text through the model.

00:20:08.660 | And that essentially forms the training set.

00:20:10.900 | - Thoughts?

00:20:16.820 | I think a lot of it's also just, yeah, it's a base model.

00:20:21.340 | I think that there's a decent bit you could do.

00:20:25.580 | I think also at the scale that they're at,

00:20:28.220 | like with a 1B and 7B,

00:20:30.460 | they could be deployed in a system

00:20:33.660 | such that it's primarily image processing,

00:20:37.940 | like zero shot, view shot,

00:20:41.380 | convert this to JSON, answer stuff about this.

00:20:45.020 | I don't see this really being like a chat with style model,

00:20:49.580 | because yeah, there's LLAMA if you want that, right?

00:20:52.460 | If you want images and chat with it.

00:20:54.740 | But the applications of these edge,

00:20:57.780 | small image processing models,

00:21:00.420 | I see this more as like OCR output in a pipeline,

00:21:04.780 | but that's my two cents.

00:21:07.300 | I guess it's interesting to see that

00:21:11.860 | if you did it with a chat instruct tune model,

00:21:15.180 | there wasn't any crazy performance difference.

00:21:17.580 | But yeah, the other thing here is just that

00:21:19.620 | there's not much benchmarking on the language models.

00:21:24.860 | So I guess there is MMLU and they're not the best,

00:21:29.860 | actually quite roughly, quite bad.

00:21:33.580 | - That's MMMU.

00:21:35.340 | - MMMU, my bad.

00:21:36.580 | - That's a multimodal version.

00:21:38.100 | - So there's no pure text benchmarks, right?

00:21:42.700 | Should we benchmark?

00:21:43.700 | Someone benchmark it.

00:21:44.580 | We've got 40 people on the call that are here for the paper.

00:21:47.140 | So someone report back to us.

00:21:49.740 | Yeah, any other interesting chat stuff?

00:21:54.100 | Should we crunch the paper quick?

00:21:57.020 | - Me?

00:21:57.860 | Mama?

00:22:02.780 | - Is someone talking?

00:22:06.220 | Okay, they're muted.

00:22:09.060 | Let's go to that dataset then.

00:22:12.100 | So TLDR, people yap, here's guidance questions.

00:22:15.980 | It's transcribed.

00:22:17.220 | They got about a million captions for 700,000 images,

00:22:20.900 | which I'm like, okay, that's kind of expensive.

00:22:22.660 | I can't run this experiment myself.

00:22:24.540 | I'm glad they'll put it out.

00:22:26.900 | Basically there's this PIXMO is the dataset

00:22:30.500 | and then there are subsets of it.

00:22:31.740 | So PIXMO ask anything model.

00:22:33.540 | It's basically collection of answer a diverse set

00:22:37.460 | of questions that a user might ask.

00:22:39.260 | So for this image, what are questions people might ask?

00:22:43.900 | Here's how they do it.

00:22:44.740 | They have this whole OCR stage.

00:22:46.820 | I'm gonna go through these pretty quick,

00:22:48.420 | but then here's kind of the subset.

00:22:49.700 | So 162,000 question answer pairs of images,

00:22:54.220 | PIXMO points is the next one.

00:22:55.780 | - Can I ask a quick question

00:22:56.660 | on the PIXMO ask model or anything?

00:22:58.860 | So I had a really hard time understanding this.

00:23:01.220 | What we do is we use the stage one model

00:23:04.460 | to generate advanced caption.

00:23:06.140 | So I guess what this means is that given image,

00:23:08.260 | get a caption and then we pass that caption

00:23:10.660 | and OCR image for that and the OCR output for the image.

00:23:14.420 | Does this mean we just apply OCR on the image?

00:23:17.580 | How would that work?

00:23:19.540 | If the image is an image of scenery or something.

00:23:22.380 | I mean, if someone can just share the intuition behind this.

00:23:25.980 | - Nathan, if you have intuition, you'd probably know better.

00:23:32.060 | My intuition here is that when you look

00:23:35.620 | at there's 70 different high level topics,

00:23:38.940 | like drawings, websites,

00:23:40.780 | there's another thing here like street signs.

00:23:42.780 | And then there's one that's like primarily charts and tables.

00:23:46.380 | So I don't have the exact input output of this

00:23:49.820 | but I'm assuming it's more text-based if there's OCR on it.

00:23:53.300 | So it's probably, you know, documents, receipts.

00:23:58.060 | That's what I would assume this subset of the dataset is.

00:24:00.940 | - Yeah, I see.

00:24:03.660 | - Yeah, but that makes sense.

00:24:05.420 | - But there, yeah.

00:24:07.620 | - Yeah, so we provide caption, we provide OCR output

00:24:10.060 | in case there's charts and tables.

00:24:11.580 | And then we ask the question

00:24:13.100 | and the LM is supposed to answer the question

00:24:14.820 | purely solely based on text.

00:24:16.260 | It does not have the image.

00:24:17.620 | Yeah, that sounds crazy to me.

00:24:21.540 | I just don't know how we can teach a model

00:24:23.940 | to answer anything about an image

00:24:26.580 | without actually having an image.

00:24:28.260 | But if it works, it works.

00:24:30.420 | But that was quite mind blowing to me,

00:24:31.820 | especially this dataset and how they created it.

00:24:34.660 | - So I kind of interpreted this as

00:24:38.500 | that's like a starting point

00:24:39.660 | and then the annotator would,

00:24:41.980 | I wasn't sure why they did it this way

00:24:44.100 | and maybe Nathan can speak to this,

00:24:46.220 | but that you like have these captions

00:24:51.220 | and then you're kind of just selecting

00:24:53.100 | and combining annotations as a human

00:24:55.620 | who has access to the vision of the model

00:24:59.340 | or the vision of the image.

00:25:00.980 | - That makes sense, yeah.

00:25:05.380 | Or maybe this is one of the earlier images

00:25:06.900 | where we just bootstrapping our synthetic data

00:25:09.460 | and the quality is maybe it's not as high

00:25:11.340 | or not as intuitive as compared to the point.

00:25:14.100 | Picks more points, picks more clocks,

00:25:16.380 | which was easy for me to understand.

00:25:17.980 | So this is the one thing I got stuck on.

00:25:19.780 | Thank you.

00:25:20.620 | - Are you, wait, the model was trained

00:25:24.980 | with the image and the question pairs though.

00:25:27.740 | It's saying that the LLM that's doing

00:25:30.580 | the question answering didn't have access to the image.

00:25:34.860 | No?

00:25:37.020 | Like it's still a output pair is still image

00:25:40.620 | to QA pairs that you train MOMO on,

00:25:43.980 | but the annotation, that's what I,

00:25:47.420 | so like, I think that's what you were saying.

00:25:49.260 | Like, it's interesting how you can train a MOMO

00:25:52.820 | to answer questions on an image without giving it the image,

00:25:55.540 | but no, it's given the image and question answer pairs.

00:25:58.780 | It's just the annotation answering the question

00:26:01.300 | is image to OCR to GPT-4 write question answers

00:26:06.580 | on this OCR that, you know, there's a, what is it?

00:26:11.460 | There's a annotator to approve the question answers.

00:26:14.300 | Now we've got question answers with image.

00:26:16.580 | When we train, we have image question answers.

00:26:19.660 | - I see, yeah, okay.

00:26:20.500 | That actually is a, that's actually very intuitive.

00:26:24.660 | So essentially what we're saying is that right now,

00:26:26.140 | the model can't answer questions.

00:26:27.820 | What we're going to do, but the model can generate captions

00:26:29.980 | because that's the previous pick small cap

00:26:31.700 | they also learned on.

00:26:32.780 | So we're going to get that model to generate captions

00:26:35.460 | and then we get an off the shelf LLM,

00:26:37.940 | probably one of their strong LLMs to create QA pairs.

00:26:41.620 | And now based on that, we now have QA pairs on the image.

00:26:45.300 | So that's how you could strap yourself

00:26:46.860 | from no data to like step-by-step.

00:26:50.020 | Okay.

00:26:50.860 | - This little niche here was that off the shelf LLM

00:26:53.900 | is not capable.

00:26:55.140 | So that's why they do caption and OCR

00:26:58.100 | and provide that to the LLM.

00:26:59.260 | - Makes sense.

00:27:00.100 | - That's where I'm like, oh, I think you mentioned

00:27:02.860 | their training questioning without giving it the image,

00:27:05.780 | but you still do have the references.

00:27:07.980 | - Yeah, the final mobile data actually has that,

00:27:10.740 | but this is the process of them just generating the image.

00:27:13.060 | - Yes. - Cool.

00:27:14.260 | Thank you.

00:27:15.100 | - Yeah.

00:27:15.940 | And then honestly, I found this one pretty confusing

00:27:19.460 | at first as well.

00:27:20.620 | The rest, there's just quite a few subsets of these.

00:27:23.480 | So points, this is kind of their demo.

00:27:27.420 | I don't know if I'm sharing my whole screen or just the,

00:27:31.060 | yeah, I'm just sharing the.

00:27:32.860 | - PDF.

00:27:34.060 | - Well, actually I'll change it real quick.

00:27:36.260 | If people haven't seen the demo,

00:27:39.260 | I would recommend checking it out,

00:27:40.500 | but this is basically points on an image.

00:27:42.720 | So it kind of answers, like if you were to ask this,

00:27:46.500 | this example, point to Mount Rainier,

00:27:49.380 | the model can drop a dot on the mountain.

00:27:52.280 | So it's labeled data set of that next sample.

00:27:56.420 | And then they show their example of connecting this thing

00:27:59.220 | to Apple vision pro and asking it, like, what do I see here?

00:28:02.220 | They've got a subset of that.

00:28:06.060 | Then there's a cap QA.

00:28:08.400 | It's basically just here are the other ones.

00:28:10.260 | There's docs generate code for this much text

00:28:14.980 | and heavy figure images.

00:28:16.040 | So charts, documents, tables,

00:28:18.300 | we prompt it to do QA pairs.

00:28:20.600 | There's clocks.

00:28:21.580 | I was confused with why there's synthetic

00:28:24.760 | analog clock data set, but there are clocks.

00:28:29.440 | There's a lot of clock examples in here.

00:28:31.520 | - I had the same question.

00:28:32.760 | Like Nathan, would you-

00:28:33.600 | - This is because the model didn't work on clocks.

00:28:35.960 | And then the lead was really worked on clocks

00:28:39.160 | and no models work on clocks.

00:28:40.660 | So they're like, we've got to make it work on clocks.

00:28:43.360 | One of the interesting things is that it doesn't work

00:28:45.040 | on dials, even though it works on clocks.

00:28:47.560 | - It doesn't work on dials,

00:28:48.400 | like an oven dial or washing machine dial.

00:28:50.040 | - Yeah, like a temperature dial or pressure dial.

00:28:52.000 | Like it can't read the number off of that,

00:28:53.480 | but it'll work on clocks.

00:28:55.180 | But just like such a, like it should be able to,

00:28:57.120 | like why can't it generalize?

00:28:58.520 | - Isn't it supposed to generalize?

00:28:59.720 | Oh my God.

00:29:00.560 | Okay, but that was very good.

00:29:01.380 | - But it does work well on clocks.

00:29:02.800 | - That was very good.

00:29:04.640 | - I love the answer though.

00:29:05.520 | These are the insights that like we'd never get,

00:29:07.640 | but is this just like over fit to a benchmark

00:29:10.300 | that nothing else does?

00:29:11.680 | Is this not like it should generalize bit or less?

00:29:14.080 | - Is there a clock benchmark?

00:29:17.640 | - Hell yeah, dude.

00:29:18.480 | You threw it in.

00:29:19.320 | You like, now I know.

00:29:20.940 | Now every visual language model that comes out,

00:29:22.920 | often will be like.

00:29:23.760 | - Well, it is going to make,

00:29:24.680 | like it is something that they should be able to do.

00:29:26.920 | - Yeah, it should generalize it.

00:29:28.760 | - I mean, a big story of this is that like Matt,

00:29:31.920 | that D, I don't know how to pronounce his name,

00:29:34.000 | is just got obsessive over data

00:29:35.560 | and just kept scaling it up and up and up.

00:29:37.160 | And all the things that are adding to this dataset

00:29:38.960 | kept on working.

00:29:39.780 | And they just spent more and more money

00:29:40.920 | 'cause it kept getting better and better and better

00:29:42.540 | over like six months of just making the dataset bigger.

00:29:45.880 | So like the pointing is probably similar.

00:29:48.480 | It's like, why can't they point?

00:29:49.680 | And then they're like, okay, let's make it work.

00:29:52.320 | Like pointing seems useful though, right?

00:29:53.880 | Like the demo you guys showed was actually somewhat useful.

00:29:56.920 | Like, you know, translate this table in a whole image

00:30:01.920 | to JSON and it does it.

00:30:04.120 | Or like stuff like this, like point to Mount Rainier,

00:30:08.000 | it can point, but other models can't.

00:30:10.160 | Seems useful.

00:30:11.000 | Clocks, I don't know.

00:30:12.960 | This just reminds me of strawberry.

00:30:14.420 | You know how many hours are in strawberry?

00:30:17.000 | Someone start this with GPT-4V,

00:30:19.760 | what time is it?

00:30:20.600 | And it can't tell time.

00:30:21.660 | And then of course the rest, there's academic datasets.

00:30:25.640 | Once again, the FlamingoBog post

00:30:27.360 | goes into all these very deeply, would recommend.

00:30:30.180 | But TLDR, those are the datasets.

00:30:33.520 | EVALs are EVALs.

00:30:35.280 | This is where I was like, okay, it does good on vision.

00:30:38.880 | There's academic benchmarks,

00:30:41.800 | which are, you know, 11 of the commonly used ones.

00:30:44.400 | There's good averages.

00:30:45.920 | I love the new rebrand they've done

00:30:47.440 | with the whole pink and color theme.

00:30:49.480 | It looks good.

00:30:50.460 | Averages.

00:30:51.880 | But yeah, I mean, across the board,

00:30:54.280 | the 7B, the 1B, the 72B,

00:30:56.900 | very on par with 404V 1.5.

00:31:00.880 | Thoughts on it and whatnot.

00:31:02.380 | The human ELO ranking was pretty interesting

00:31:05.800 | where preference is decent,

00:31:09.800 | but I guess the interesting differentiation there

00:31:13.240 | is how far off a drop other stuff

00:31:15.480 | like the Lava and whatnot is.

00:31:18.960 | Yeah, PHY is doing pretty rough in multimodal,

00:31:21.860 | even though it got a multimodal update,

00:31:24.160 | but it's probably a better text model.

00:31:26.000 | I guess I'd just be interested now to see

00:31:28.200 | how this thing does in text

00:31:29.880 | and if we can get a better text version.

00:31:31.560 | This is kind of back to the theme of this paper

00:31:34.160 | of like stuff, most models use a vision encoder

00:31:39.160 | that's closed.

00:31:40.620 | So we wanted to do full open.

00:31:44.400 | This is a little harder to read,

00:31:47.800 | but let me digest real quick.

00:31:49.280 | So the vision encoder for Lava is open weights,

00:31:54.280 | but the data and code used to generate it was closed.

00:31:59.600 | API models, they're basically all closed.

00:32:02.320 | A lot of the generation is done with closed stuff.

00:32:04.700 | They're in this corner of all green.

00:32:06.320 | We love all green.

00:32:07.300 | We don't love red.

00:32:09.100 | But this is basically just a visual

00:32:11.160 | of the theme of the paper.

00:32:12.120 | I didn't look that deep into it,

00:32:13.140 | but that's what it's showing that.

00:32:14.580 | Once again, the captioning that's done

00:32:17.440 | is mostly a distillation of proprietary stuff.

00:32:21.800 | For example, all this proprietary stuff sucks at clocks.

00:32:24.680 | So nothing that's a distillation will be good at clocks.

00:32:27.520 | Nothing can point.

00:32:28.760 | If you just distill from this,

00:32:30.320 | if they can't point, your VLM won't point.

00:32:33.780 | So we show how to get good data.

00:32:35.940 | The answer to that was how people talk

00:32:38.200 | with guided questions, do pre-processing,

00:32:40.880 | and I guess keep scaling it up.

00:32:42.440 | It's good to see it still scales.

00:32:45.160 | - And also, one point on the MOMO one, right?

00:32:48.560 | If you scroll up all the way to the top, yeah, over here.

00:32:50.560 | The only reason why MOMO 7B and 7BD

00:32:54.520 | is not open data encode is because

00:32:56.280 | it's based off the QEN data.

00:32:58.300 | I think, correct me if I'm wrong, Nathan.

00:33:00.040 | And that's why it's not because they don't want to open it.

00:33:02.680 | It's because we just don't know what was in QEN data.

00:33:05.380 | Same for a visual encoder.

00:33:06.840 | If not, it's everything that the AI2 team did was--

00:33:11.600 | - This model is test the definition of open source AI,

00:33:14.560 | because it's like, you grab some random vision encoder,

00:33:17.160 | and that's the only thing that's not open.

00:33:18.680 | It's like kind of--

00:33:20.360 | - Well, the weights are open,

00:33:23.520 | but we just don't know what the data encode is, yeah.

00:33:26.280 | - Yeah, but what's the difference between a vision--

00:33:27.720 | Like, if the weights of the vision encoder were frozen,

00:33:30.960 | I think that it could be fine, but I don't know.

00:33:34.320 | It's weird, because you like say that fine-tunes

00:33:36.960 | from MOMO are like open-weight models.

00:33:39.340 | They're not like open-source fine-tunes and stuff.

00:33:42.360 | - This is a discussion that the open-source initiative

00:33:45.000 | and other people are having right now.

00:33:47.280 | It's kind of a whole other thing

00:33:49.200 | that I don't have enough time to answer the emails for,

00:33:52.520 | but it's like, I don't know what to do with it.

00:33:55.360 | - The interesting thing there was the only closed code for,

00:34:00.000 | the closed open data was just,

00:34:01.720 | they used CLIP, large CLIP, right?

00:34:04.800 | They have the sentence that our vision encoder

00:34:08.640 | is based on CLIP right here.

00:34:10.040 | So for the vision encoder, we release,

00:34:13.600 | all of our release models use OpenAI's VIT-Large CLIP model,

00:34:18.600 | which provides consistently good results.

00:34:21.140 | While this model uses closed data,

00:34:23.060 | it can be reduced from scratch with MetaCLIP.

00:34:25.220 | So that's what gives them some red,

00:34:28.600 | because the data that OpenAI trained CLIP on,

00:34:32.880 | they didn't open-source.

00:34:34.400 | So it's closed data, and they talked about this too,

00:34:36.980 | like the paper and the blog post really shows

00:34:41.040 | how the thing that made vision language models good

00:34:44.320 | was a good captioner,

00:34:45.240 | which was the last CLIP that they put out.

00:34:47.400 | And this is where OpenAI started to go closed source,

00:34:50.720 | where they're like, okay, we've scraped the web,

00:34:52.980 | we have our own scraper,

00:34:53.920 | we have a good captioner, this, that.

00:34:55.760 | Here's the data set, or no,

00:34:57.980 | they don't give out the data set.

00:34:59.000 | They give the final model, but not the data set.

00:35:02.020 | And then a few years later,

00:35:04.520 | Meta comes out with MetaCLIP where they're like,

00:35:06.800 | okay, here's how we can remake CLIP.

00:35:08.440 | And I think that's personally,

00:35:10.200 | this is just like now the open-source debate

00:35:11.920 | of they could have used the Meta version

00:35:14.240 | and given us a worse model,

00:35:15.820 | and it would have given them green here,

00:35:17.720 | but I don't know, I think it's fine that they did it.

00:35:20.280 | And then this is closed,

00:35:21.200 | 'cause yeah, like Eugene said, Quinn stuff.

00:35:24.160 | But that's my two cents.

00:35:26.400 | This could have been open if they just used MetaCLIP,

00:35:29.480 | but MetaCLIP's not as good,

00:35:30.880 | because I think Meta did it as a proof of concept.

00:35:33.980 | - I think, honestly, I think it's fine,

00:35:36.280 | because right now, as previously outlined,

00:35:38.760 | before vision and audio,

00:35:41.360 | the gap in data set in the open-source space is so large,

00:35:45.280 | that the biggest thing is actually more about

00:35:48.520 | getting the data set open-source and working.

00:35:51.600 | So for me, if in this process,

00:35:54.620 | sure, this may be flawed,

00:35:55.800 | and we get a strong open image data set,

00:35:59.160 | which then subsequently we can do captioning

00:36:02.440 | on top of that strong open image data set,

00:36:05.700 | which will create a more complete data set

00:36:07.640 | than is used to train it to the next open model,

00:36:10.520 | then this will be a bootstrap, essentially,

00:36:13.680 | towards a all-green open model,

00:36:16.400 | which might be a better path

00:36:19.240 | than trying to use an inferior clip

00:36:21.920 | and not have a good data set in the first place.

00:36:24.320 | - Makes sense.

00:36:26.760 | I think also this is a proof of concept

00:36:29.440 | at open-source stuff.

00:36:31.040 | I don't know how much there's like,

00:36:34.160 | this is a product, use it, use it, use it,

00:36:36.700 | versus like we were scaling experiments

00:36:38.380 | that kept working better.

00:36:39.740 | And there's still quite a bit that could be done here,

00:36:42.780 | but yeah, it's a really good point out

00:36:46.060 | that stuff is a distillation,

00:36:47.720 | and we don't know what goes into this.

00:36:49.620 | It does have that issue with clip, right?

00:36:54.460 | We don't know the data that went into it.

00:36:56.500 | Metaclip would have been cool, but yeah.

00:37:02.300 | Evals are next, so benchmarks.

00:37:05.220 | If people understand and want to dig into any of these,

00:37:09.480 | we can dig into them.

00:37:10.920 | So it's a interesting differentiation of open-weight,

00:37:15.920 | open-weight distillation, where MOMO sits,

00:37:18.760 | and then these are all the proprietary.

00:37:20.460 | So Claude, Gemini, and GPT.

00:37:23.000 | Here's the averages.

00:37:24.260 | MOMO 72B kills it.

00:37:27.920 | It's the best average across them.

00:37:30.120 | And then across the actual benchmarks themselves,

00:37:33.740 | they're doing pretty good across.

00:37:36.180 | I think when you look a little deeper,

00:37:37.700 | they mentioned the 1B is on par with GPT-4V

00:37:41.640 | on most benchmarks, but the 7B outperforms 4V.

00:37:46.640 | It's closer to 4.0 and whatnot.

00:37:50.280 | So if anyone has anything interesting

00:37:53.820 | they'd want to dig into on the benchmarks, we can.

00:37:56.460 | Otherwise, I think we know where it sits.

00:37:58.180 | The interesting part would be digging

00:37:59.960 | into the text generation part.

00:38:01.980 | Eugene, you still have your hand up.

00:38:04.020 | Are you trying to comment?

00:38:06.040 | Okay, hand up.

00:38:08.480 | Yeah, I didn't have much more to dig into in this.

00:38:14.340 | I thought the ELO ranking was kind of interesting as well.

00:38:17.380 | Here's ELO ranking.

00:38:20.440 | Human preference evals use 15K image text prompt pairs.

00:38:25.420 | We queried VLM for responses,

00:38:27.320 | presented the resulting images,

00:38:29.400 | triplets to all this to 870 human annotators.

00:38:33.240 | They were given pair-wise preference rankings

00:38:35.400 | for 325 samples across 27 models.

00:38:39.120 | NiceFlex, biggest human preference eval

00:38:41.880 | for multimodal models to date.

00:38:44.000 | All it took was less than a thousand people.

00:38:46.160 | Their ELO rankings are 3X more than chatbot arena

00:38:50.120 | from LIMPSYS.

00:38:50.980 | Also, I hear there's some potential tea with LIMPSYS.

00:38:54.820 | Maybe that's coming out soon for their vision models.

00:38:57.960 | And then here's an ELO ranking.

00:39:00.080 | - There's potential tea with LIMPSYS, not extrapolate.

00:39:04.020 | - Well, we'll leave that without much more

00:39:07.600 | unless anyone else wants to comment on it.

00:39:09.700 | But yeah, if anyone knows how ELO ranking works,

00:39:14.200 | they're pretty cool, figure it out.

00:39:17.120 | But yeah, most preference ranked was GPT-4.0,

00:39:20.880 | then MOMO-72B, then Gemini, then Sonnet, then the 7B.

00:39:25.880 | It's really cool how the fourth one is the 7B.

00:39:30.180 | So very cool on-device model.

00:39:32.580 | And then RIP, our boy, Lava 1.57B down here,

00:39:37.580 | Chameleon down here.

00:39:39.180 | Chameleon was cool though.

00:39:40.220 | They're fusion models.

00:39:41.180 | They're a different way to do this.

00:39:43.660 | They're a little early.

00:39:44.500 | I'd love to see a recreation of fusion for this

00:39:47.100 | as opposed to adapters.

00:39:48.980 | But those are the two benchmarks.

00:39:51.400 | There's some takeaways on this as well.

00:39:55.060 | So here's a few key points.

00:39:58.160 | Also, there was a line here as well in their pre-training

00:40:01.360 | that I wanted to highlight.

00:40:02.680 | So in this data and training,

00:40:04.680 | they talk about how they do this different approach.

00:40:09.120 | And then they just add this line here.

00:40:10.480 | So we don't do RLHF, but we've got the RLHF guy here,

00:40:15.480 | but they added a line that we don't do RLHF.

00:40:18.860 | Is this meant to say like, you know, there's no SFT,

00:40:21.380 | these are base models or just...

00:40:22.980 | Well, there is supervised fine tuning for it,

00:40:26.020 | but no RLHF.

00:40:27.840 | - For us, team alignment and incentives are hard.

00:40:31.220 | - We have a question from Sam.

00:40:34.720 | - Thanks, Eugene.

00:40:38.160 | Hold on, is my mic on?

00:40:39.000 | Can you guys hear me?

00:40:40.060 | - Yes, yes.

00:40:41.200 | - Great, cool.

00:40:42.240 | So this is coming from a place of vague,

00:40:44.980 | naivety of multi-modality stuff.

00:40:46.600 | And it's about the granularity that we should attach

00:40:51.100 | to images when creating, say,

00:40:52.620 | open source image caption datasets.

00:40:55.080 | So right now there's a picture,

00:40:56.680 | say from like a satellite view of a parking lot.

00:40:59.320 | And I want to ask a question that like Eugene could answer

00:41:01.940 | of like, what is on the dashboard of the car

00:41:05.520 | in the second row, third from the left or something.

00:41:08.440 | And it's a little Hawaiian guy

00:41:09.720 | or something like that on the dashboard.

00:41:11.460 | So a human can answer that,

00:41:12.440 | but I assume a language model only could

00:41:14.340 | if it had been trained, I suppose,

00:41:17.060 | on image caption pairs that have that level

00:41:19.940 | of extreme granularity to them.

00:41:21.820 | So like, I've seen some really cool demos

00:41:23.460 | on the blog posts for MoMo.

00:41:25.580 | Now, again, I'm not really sure

00:41:26.820 | whether to ascribe that performance

00:41:29.260 | to the pre-chained vision encoder,

00:41:32.340 | the one that we're using,

00:41:33.180 | or if it's from some data in the PIXMO set

00:41:36.780 | that uses that cool like audio transcription pipeline.

00:41:39.900 | So I guess the question is,

00:41:41.320 | if we kind of ignore the fact

00:41:42.920 | that we're using a pre-trained image encoder

00:41:44.320 | that has a bunch of cool abilities

00:41:45.800 | on data that we don't know about,

00:41:47.740 | how granular should our image caption,

00:41:51.440 | our captions for image caption,

00:41:52.960 | or image caption datasets be if we want to have,

00:41:56.160 | we would expect to be, you know,

00:41:58.320 | quote AI performance, you know,

00:42:00.920 | if you don't know how this stuff works

00:42:02.940 | from our vision language models.

00:42:04.600 | And why did you guys choose the granularity

00:42:06.480 | that you did, I suppose?

00:42:08.440 | And the question format as well.

00:42:10.100 | Kind of a lot.

00:42:17.060 | - Do you want to TLDR that into one, two line area?

00:42:21.940 | - How granular should our image caption pairs be, period.

00:42:26.200 | Kind of detailed.

00:42:30.260 | - I think Nathan probably has more intuition on this,

00:42:32.700 | but their thing wasn't about granularity, it seems.

00:42:37.460 | It was about one, like, yeah, you've got diversity,

00:42:40.980 | you've got, you know, processing of it,

00:42:44.000 | but then it was also about the split, right?

00:42:46.300 | So we talked about these four different ones,

00:42:48.600 | like part of it is, you know,

00:42:51.100 | generate code from 255,000 texts

00:42:54.180 | and figure heavy images, including charts and whatnot,

00:42:57.100 | then get QA pairs on it.

00:42:58.780 | They got clocks,

00:42:59.620 | but there's also all the academic datasets, right?

00:43:02.100 | So there's a lot of these.

00:43:05.180 | If you look into what they are,

00:43:06.220 | that blog post goes into them.

00:43:07.740 | So I think granularity of specific stuff

00:43:11.980 | isn't how I necessarily look at it.

00:43:14.300 | It's diversity of the images

00:43:18.020 | and what they do with good captions around it.

00:43:22.020 | Nathan, if you want to chime in, go ahead.

00:43:24.500 | Eugene, you have your hand raised.

00:43:26.220 | I don't know if you want to answer,

00:43:27.100 | but let's answer this question before moving to another one.

00:43:30.100 | - I don't really have a good response.

00:43:33.100 | I think most of this was like really high detail responses

00:43:35.860 | is what worked for them, but I don't know.

00:43:39.820 | - Okay.

00:43:40.660 | - My intuition, at least,

00:43:43.220 | is that I suspect the heavy lifting

00:43:45.820 | might be actually more towards the Q&A side

00:43:47.820 | rather than captioning side.

00:43:49.740 | I view captioning as a bootstrap,

00:43:51.820 | but at the end of the day,

00:43:52.660 | we want the model to generalize

00:43:54.620 | what they want to describe.

00:43:56.300 | Captioning tends to overfeed what is being described,

00:43:59.980 | where our question answers kind of forces the model

00:44:02.460 | to pick up details that it would previously ignore,

00:44:05.700 | like plots, for example,

00:44:07.140 | or the one extra finger on the hand, or whatever it is.

00:44:11.420 | And as long as you're able to ask questions

00:44:13.460 | and you can answer, you're training the model

00:44:15.380 | to capture as much information as possible,

00:44:18.380 | even if you don't ask.

00:44:19.500 | Because even if, let's say, this dataset,

00:44:21.660 | you didn't ask this question,

00:44:23.620 | another dataset may ask an alternative question instead,

00:44:28.140 | and that will just generalize better.

00:44:30.380 | And that's what I suspect.

00:44:31.540 | I think captioning is a problem in overfitting

00:44:34.540 | from my point of view.

00:44:35.580 | - Updates.

00:44:37.900 | I really don't have that much intuition

00:44:39.220 | on this either, though.

00:44:40.340 | Someone else has hand up?

00:44:43.380 | You want to pop in?

00:44:44.420 | Okay.

00:44:49.060 | I'm going to finish the paper real quick in the takeaways.

00:44:51.820 | This is a cool little video we can watch after,

00:44:55.620 | and we'll just save time for Q&A.

00:44:57.260 | There was really not much left, though.

00:45:00.660 | There's just some highlights.

00:45:02.140 | So they wanted to highlight features.

00:45:04.300 | I will share their highlights.

00:45:05.940 | The most efficient model, the 1B,

00:45:07.660 | is based on their 1B MOE.

00:45:10.140 | That one matches performance of 4V

00:45:12.980 | on most academic benchmarks and their ELO ranking.

00:45:17.980 | The 7B and the...

00:45:20.460 | So our 7B and Quen 7B-based models

00:45:23.340 | comfortably sit between GPT 4V and 4O on both.

00:45:28.300 | So 1B, similar to 4V.

00:45:31.620 | 7Bs are between 4V and 4O.

00:45:35.180 | The big one, the one based on Quen 72B,

00:45:38.660 | is the best benchmarks on academic.

00:45:42.140 | So academic benchmarks-wise,

00:45:45.100 | their big one is the best state-of-the-art everything,

00:45:47.980 | better than proprietary.

00:45:49.620 | But ELO-wise, it sits behind 4O.

00:45:52.940 | Then our best model outperforms...

00:45:58.660 | Yeah, so it's basically best model beats everything.

00:46:01.780 | To highlight their potential for action,

00:46:04.140 | we tested this on Android Control,

00:46:07.540 | where it did lower...

00:46:09.020 | I didn't really get this,

00:46:09.860 | but we tested it on Android Control,

00:46:13.020 | where it achieved this low-level accuracy,

00:46:15.580 | 69% high-level accuracy.

00:46:17.140 | I don't really know what Android Control is,

00:46:18.460 | but they highlighted it.

00:46:19.900 | I probably should have read more into it, but...

00:46:22.020 | - Excuse me, do you plan to show us something,

00:46:24.940 | or you're just pointing?

00:46:26.700 | 'Cause we only see the normal page.

00:46:30.820 | - Good catch.

00:46:31.660 | My bad, I went back to the slides.

00:46:33.100 | (laughs)

00:46:35.020 | My bad.

00:46:35.860 | But yeah, I was just going over these highlights.

00:46:39.340 | So highlights was 1B is based on that.

00:46:44.340 | It's at 4V.

00:46:46.100 | The 7Bs are between the two OpenAI models.

00:46:50.300 | 72B is better than anything on their chart.

00:46:53.740 | So it's state-of-the-art better than proprietary.

00:46:56.780 | And then there's this last little point here

00:47:00.460 | of it does really good on this Android Control,

00:47:04.860 | which I don't know what Android Control is,

00:47:07.380 | but they wanted to highlight it.

00:47:08.940 | So I've highlighted it.

00:47:11.980 | I think that's most of the paper.

00:47:13.700 | There's a release plan here.

00:47:15.460 | So yeah, Llamo is coming out.

00:47:18.260 | They put it out early.

00:47:19.460 | They've put out some stuff.

00:47:22.700 | There's more stuff coming.

00:47:23.740 | So this technical report is getting an update soon.

00:47:27.820 | The data sets are coming out,

00:47:29.140 | but putting out data sets is hard.

00:47:30.580 | So give them time.

00:47:32.380 | Model weights, training and eval code.

00:47:34.380 | This is a fun one.

00:47:35.420 | This typically doesn't come out

00:47:36.860 | because you're probably not doing training

00:47:38.940 | on the same hardware they're training on,

00:47:41.580 | but always cool to see training and eval code.

00:47:45.620 | I guess that's pretty good overview of the paper.

00:47:48.500 | If there's anything we want to dive deeper into,

00:47:51.660 | we've got 10 minutes.

00:47:53.540 | Nathan, if you want to add anything, cook, go over.

00:47:57.980 | - I just have a question,

00:48:03.700 | but like I'm ready to get the section about Respirator,

00:48:07.580 | but there are like no other questions.

00:48:09.780 | - Nathan.

00:48:10.620 | - Yeah, Nathan had to drop.

00:48:20.260 | - Oh, he dropped.

00:48:21.100 | Okay, cool.

00:48:21.940 | Well, that's paper.

00:48:24.700 | There is a little use case thing to just make it useful.

00:48:30.340 | We can kind of watch this real quick.

00:48:32.660 | And then we've got a little update on Whisper.

00:48:35.860 | So this is their video they put out.

00:48:37.940 | - Yeah, that's awesome, Nathan.

00:48:41.380 | - It's very Apple Intel like.

00:48:43.260 | - Okay.

00:48:44.100 | - Count the number of people.

00:48:53.500 | - Counting the number of people shows a total of 21.

00:48:56.580 | - Convert this table to JSON.

00:49:02.700 | - Here's the table converted to JSON format.

00:49:11.420 | - I want to sell my bike on Craigslist.

00:49:14.100 | Write me a description for it.

00:49:15.940 | - Schwinn bike for sale, blue with white accent.

00:49:18.500 | $300 or best offer.

00:49:20.380 | - Okay, that's just a, you know,

00:49:23.220 | example of what the model can do.

00:49:26.500 | I recommend people check out the rest of their blog posts.

00:49:28.860 | It's pretty cool.

00:49:30.140 | Here's benchmarks.

00:49:33.180 | There's a couple other videos if people are interested.

00:49:36.580 | And then there's just a good little TLDR of the paper.

00:49:41.100 | I want to save time for questions and stuff.

00:49:43.580 | So I'll share the link.

00:49:45.620 | We'll share it in Discord.

00:49:47.420 | If anyone has last questions on the paper,

00:49:49.740 | we can go over now.

00:49:50.700 | Otherwise we have a little Whisper update too.

00:49:54.060 | Whisper three turbo dropped.

00:49:57.500 | I was very impressed.

00:49:58.540 | Yeah, questions or should we pass to Whisper?

00:50:02.220 | Last shot.

00:50:03.060 | - All right, Whisper it is again.

00:50:09.060 | - Whisper, let's go.

00:50:11.180 | - Whisper.

00:50:12.260 | - Sorry, I took away more time.

00:50:13.860 | - Whisper?

00:50:22.980 | Who's covering Whisper?

00:50:24.100 | - It's supposed to be Abgadoz, but.

00:50:27.740 | Oh, there he is.

00:50:30.020 | - Yes.

00:50:31.700 | Yes, can you guys hear me?

00:50:33.300 | - Yes.

00:50:34.140 | - Okay, great.

00:50:36.140 | Thank you so much for covering the Momo paper

00:50:39.260 | alongside Nathan, that was great.

00:50:40.900 | So yeah, let's get started quickly.

00:50:45.420 | So can you guys see my screen?

00:50:52.380 | - Yes.

00:50:53.220 | - Okay, great.

00:50:55.820 | So just like some updates.

00:50:57.660 | OpenAI has recently released a new checkpoint

00:51:00.300 | or version of Whisper called Whisper large V3 turbo.

00:51:03.780 | It's supposed to be like more faster and more efficient

00:51:06.980 | with like very little accuracy degradation.

00:51:10.100 | And it was inspired by the work behind this to Whisper.

00:51:13.060 | So just like that pull request,

00:51:15.060 | then that John walk merged into the OpenAI repository.

00:51:19.300 | So yeah, a little reminder about Whisper.

00:51:24.060 | Whisper is like the state of the art ASR model

00:51:26.780 | for like transcription, translation

00:51:28.500 | and speech detection as well.

00:51:31.460 | The architecture is basically based

00:51:33.140 | on the encoder decoder transformer architecture

00:51:36.420 | that was presented in the original

00:51:38.100 | attention is all you need paper.

00:51:40.060 | So the input is passed to the encoder

00:51:42.780 | which processes the audio

00:51:44.380 | and then gives out any states of the end.

00:51:47.740 | And these hidden states are sent to the decoder

00:51:50.820 | which it uses to like generate the text

00:51:53.540 | one token at a time.

00:51:54.740 | So yeah, the encoder and decoder

00:51:58.980 | mostly have the same number of layers

00:52:01.380 | for example, in the Whisper large V3

00:52:03.580 | it has 32 layers in the encoder

00:52:05.460 | and 32 layers in the decoder.

00:52:07.180 | So this was like a quick summary

00:52:09.500 | about the architecture of Whisper itself.

00:52:12.020 | Whisper like when it was originally released

00:52:15.820 | it had different versions

00:52:16.900 | that tiny, base, small, medium and large.

00:52:19.940 | And then later on they added large V2 and large V3.

00:52:23.100 | But for each of these versions

00:52:24.980 | like the number of layers in the decoder

00:52:27.220 | and the encoder were the same.

00:52:29.140 | So for example, base has six layers in the encoder

00:52:31.540 | and six layers in the decoder.

00:52:33.060 | Yeah, so the motivation behind Whisper Turbo

00:52:39.700 | was actually the same motivation behind the Solusper.

00:52:42.820 | And the motivation behind the Solusper

00:52:45.140 | was like based on two observations.

00:52:47.340 | The first one is like the decoder accounts

00:52:49.020 | for almost 90% of the latency

00:52:51.940 | when you are trying to transcribe any audio.

00:52:54.260 | The second observation is that the decoder performs

00:52:58.020 | the simpler task of like mapping the hidden states

00:53:01.340 | of the audio into the text.

00:53:03.220 | Like the most difficult part

00:53:04.340 | is actually extracting these hidden states of the audio

00:53:07.260 | and trying to understand what's being said in there.

00:53:09.820 | So yeah, the first observation is because

00:53:13.100 | Whisper is like an autoregressive language model

00:53:16.460 | and generates tokens, one token at a time.

00:53:18.420 | And you need the token at time T

00:53:21.220 | to generate the token at time T plus one.

00:53:22.980 | So it is actually a very serial process

00:53:25.620 | and we cannot utilize modern GPUs to parallelize this.

00:53:29.820 | The second observation was made

00:53:31.380 | by actually comparing the different versions of Whisper.

00:53:34.700 | So for example, the small and tiny versions

00:53:37.780 | could transcribe audio and it is loud and clear.

00:53:41.580 | And you can actually hear what's being said

00:53:43.860 | but they would struggle with audios

00:53:45.180 | that have like background noise

00:53:46.940 | or where the speaker is not very clear.

00:53:50.140 | However, all models or all versions,

00:53:52.820 | whether it be like small, tiny, large,

00:53:55.380 | they all generate coherent text in English

00:53:58.100 | and all the other languages.

00:53:59.700 | And this kind of like gives a hint

00:54:00.980 | that they are actually capable of like modeling the language

00:54:04.660 | but they just struggle with the audio sometimes.

00:54:07.180 | So yeah, what is exactly Whisper Turbo?

00:54:12.580 | Whisper Turbo is like a new and more efficient version

00:54:15.580 | of the original model

00:54:16.540 | with like minimum degradation accuracy.

00:54:19.300 | It uses many two techniques to satisfy this improvement

00:54:22.380 | in speed and then compute.

00:54:25.580 | The first one is like model pruning

00:54:27.020 | and the second one is continued pre-training

00:54:30.060 | but it does not use distillation

00:54:31.620 | which is I think a very big misconception

00:54:34.220 | about this new release.

00:54:35.580 | So let's talk about model pruning.

00:54:39.660 | Pruning in machine learning mainly refers

00:54:41.620 | to like eliminating some parameters

00:54:44.700 | from a neural network

00:54:47.540 | to like reduce the compute and memory requirements

00:54:52.220 | of a model without impacting the accuracy as much.

00:54:55.220 | In Whisper Turbo, the pruning was done only in the decoder

00:54:58.380 | and like entire layers were pruned.

00:55:01.100 | So we went from 32 decoder layers

00:55:03.740 | which were in the original large V3

00:55:06.300 | to like just four layers.

00:55:08.500 | And this resulted in like massive reduction

00:55:10.460 | in the model size.

00:55:12.020 | So the turbo model is like 1.78 times smaller

00:55:15.980 | than the original large V3.

00:55:17.460 | So yeah, after you prune a model

00:55:21.420 | and you go from 32 layers into four layers,

00:55:24.980 | you probably will miss all the capabilities of the model.

00:55:29.340 | Like the model will probably degrade and collapse

00:55:32.380 | and will not be able to even generate coherent text.

00:55:35.300 | You will have like to train these four layers

00:55:37.820 | to work together again as a single decoder.

00:55:40.700 | So the way to do this

00:55:41.860 | is to actually do continued pre-training.

00:55:43.820 | And in continued pre-training,

00:55:45.100 | you train the model on like a relatively big amount of data

00:55:48.980 | to actually teach it again all the things

00:55:51.460 | that it has forgotten or like got confused about.

00:55:55.180 | And in the case of Whisper Turbo,

00:55:57.020 | the continued pre-training happened on the same dataset

00:55:59.780 | that was actually used

00:56:01.220 | in the original training of Whisper Large V3.

00:56:03.700 | So for reference, the dataset contained of two epochs.

00:56:07.860 | Each epoch had 5 million hours.

00:56:10.580 | 1 million out of these were like weekly supervised

00:56:14.940 | as in the original Whisper

00:56:16.740 | and 4 million were like pseudo-labeled

00:56:18.460 | by the Whisper Large V2 model.

00:56:21.220 | So we had like 5 million hours in each epoch.

00:56:23.620 | This means that we have 10 million hours in two epochs.

00:56:27.060 | So this was like the size of the original training data.

00:56:30.500 | And the data using continued pre-training

00:56:33.260 | is very similar to this,

00:56:34.620 | except we're only using the transcription data.

00:56:37.540 | We're not using any of the translation data

00:56:40.060 | in the Whisper Turbo continued pre-training.

00:56:45.460 | So yeah, this gives us a hint about like the size

00:56:48.260 | of the training data of the new model,

00:56:49.780 | which is actually quite large.

00:56:51.260 | And some of the fun tips is like,

00:56:56.380 | we used a linear learning rate with the decay,

00:57:00.460 | starting from 5e to negative 5.

00:57:03.300 | So also another confusion is like,

00:57:09.340 | it gets compared a lot with this Whisper,

00:57:12.740 | but they like differ in size

00:57:14.020 | and also the training strategy.

00:57:15.620 | Distilled Whisper was trained

00:57:17.700 | using knowledge distillation of ASR models.

00:57:21.260 | But we'll talk about this like in a minute or two.

00:57:24.020 | Let's just go over a quick comparison

00:57:25.740 | between the two models.

00:57:26.940 | So they both have the same number of encoder layers,

00:57:30.060 | which is 32,

00:57:31.540 | but the Whisper Turbo has four decoder layers

00:57:33.780 | while Distilled Whisper has only two.

00:57:36.180 | The training strategy for Whisper Turbo

00:57:37.980 | is pruning and continued pre-training,

00:57:39.620 | but for Distilled Whisper

00:57:41.580 | it's actually knowledge distillation.

00:57:43.660 | And for Whisper Turbo,

00:57:44.900 | the training data is multilingual.

00:57:46.780 | For Distilled Whisper it's only English.

00:57:49.580 | And the task for Distilled Whisper is transcription

00:57:52.300 | and the same for Whisper Turbo.

00:57:54.940 | - Can I ask a quick question?

00:57:56.340 | - Sure.

00:57:58.460 | - The difference with Whisper Turbo and Distilled Whisper

00:58:01.100 | is primarily the pruning versus distillation.

00:58:04.540 | It seems like we're giving a lot of benefit towards pruning,

00:58:08.420 | but the other big distinction

00:58:10.180 | is also the difference in data, right?

00:58:13.060 | I'm curious on your intuition

00:58:14.820 | because I also assume pruning versus distillation

00:58:18.540 | is a big part of it and that's why I can stay.

00:58:20.940 | It's interesting that it can stay multilingual.

00:58:23.740 | I would assume the opposite,

00:58:25.700 | but how much of this do you think is based on the strategy

00:58:29.340 | versus just better data?

00:58:31.380 | Distilled Whisper came out a while ago

00:58:34.340 | versus Whisper Turbo, OpenAI does good data.

00:58:37.220 | Like how much of a difference do you think is caused?

00:58:42.100 | Because, so they're both similar size,

00:58:44.020 | but Whisper Turbo is a lot better, right?

00:58:45.780 | So how much of that better like performance

00:58:50.380 | out of smaller size comes from pruning versus data?

00:58:53.460 | Because they don't really mention how good this data is.

00:58:56.260 | Do you have intuition on that?

00:58:59.380 | - Yes, yes.

00:59:00.220 | So we kind of like have some info about the dataset.

00:59:02.420 | For Whisper Turbo, it uses exactly the same dataset

00:59:06.020 | for Whisper Large V3,

00:59:08.420 | except for the translation section, which is excluded.

00:59:12.420 | So this is like a very big dataset of like 10 million,

00:59:15.100 | or like a few million hours.

00:59:16.660 | And it has like many, many different languages.

00:59:19.140 | So this is a very diverse and large and robust dataset,

00:59:23.060 | which is actually quite important.

00:59:24.900 | On the other hand, Distilled Whisper uses English only data.

00:59:29.260 | I think it is a few thousand hours of audio.

00:59:32.380 | So this is like much, much, much smaller

00:59:34.220 | than the Whisper Turbo.

00:59:37.220 | One upside is that the Distilled Whisper

00:59:39.500 | is trying to use actually higher quality data,

00:59:43.620 | arguably, but we're not sure about this.

00:59:45.980 | So I think there is like a very big emphasis

00:59:47.860 | about the data size as well.

00:59:49.740 | And alongside the training strategy.

00:59:51.860 | For Whisper Turbo, it's a few million hours,

00:59:54.140 | for Distilled Whisper, it's just a few thousand hours.

00:59:56.860 | And this is quite crucial, I think.

00:59:58.660 | For example, Distilled Whisper is only for English.

01:00:00.620 | You cannot use it like for French, German, Spanish,

01:00:02.700 | or Arabic, or any other language.

01:00:05.060 | And I believe like if you train a single model

01:00:07.980 | on multiple datasets, multiple domains,

01:00:11.580 | multiple languages, it becomes more robust.

01:00:13.940 | And this is like the premise behind Whisper.

01:00:16.900 | Like it is a robust model because it is trained

01:00:19.500 | on a very diverse dataset,

01:00:21.060 | encompassing different domains and languages.

01:00:25.020 | - Got it.

01:00:25.860 | - So I hope this kind of answers your question.

01:00:27.620 | - Really, really helps with the intuition there.

01:00:29.940 | Sorry to cut off.

01:00:31.740 | Also great slides, by the way.

01:00:33.820 | - Thanks.

01:00:35.580 | - Yeah, amazing slides.

01:00:36.740 | - Oh, thank you.

01:00:39.100 | This one is good.

01:00:39.940 | You're gonna like this one a lot more.

01:00:41.860 | So there was a question from Svex about like,

01:00:44.980 | how do you actually do distillation for an ASR model?

01:00:47.900 | And I think it's kind of like similar

01:00:49.620 | to how you distill a language model.

01:00:51.860 | So let's say you want to distill Lama4o5b to Lama3b.

01:00:56.100 | The way you do this is you try to like extract

01:00:59.620 | or compress the knowledge from the big model

01:01:01.420 | to that smaller model.

01:01:02.700 | And this is exactly what you are doing here.

01:01:05.420 | The technicalities is,

01:01:07.420 | the way you train Whisper is a language model

01:01:09.700 | that is conditioned on the audio.

01:01:11.580 | So it's actually like a language model

01:01:13.260 | that predicts the next token.

01:01:14.940 | So you can have like cross entropy loss

01:01:17.740 | on like the next token.

01:01:19.420 | And you only like, the difference is the input

01:01:22.420 | also contains audio, not just text.

01:01:25.420 | So you kind of like train the smaller model

01:01:28.940 | on the next token predicted by the bigger model.

01:01:31.660 | So kind of like synthetic data or like pseudo data,

01:01:34.940 | pseudo label data to be precise.

01:01:36.740 | That's one of the training objectives.

01:01:39.140 | The other training objective is,

01:01:41.540 | is you're trying to like make the output distribution

01:01:46.340 | of the smaller model as close as possible

01:01:48.500 | to the output distribution of the bigger model.

01:01:50.220 | So not just the next token,

01:01:51.940 | but like the entire distribution of the next token.

01:01:54.740 | So if you're training like a language model

01:01:58.900 | in a conventional way,

01:01:59.980 | you're only like training it on the next token.

01:02:02.460 | You only care about the correct next token.

01:02:04.260 | You don't care about the top nine or top 10 predictions.

01:02:07.740 | But if you wanna train or like gonna do knowledge distillation

01:02:10.820 | and you want to like train the model

01:02:12.340 | with a lot of information,

01:02:13.860 | you might want to do something called KL divergence training.

01:02:17.780 | And in this way, you try to teach the smaller model

01:02:20.340 | to approximate the probability distribution

01:02:22.860 | of the bigger model.

01:02:23.780 | So not just get the next token right,

01:02:25.700 | get the top 10 predictions for the next token correctly.

01:02:30.060 | And this actually helps the model learn a lot more.

01:02:32.340 | You're giving much more information for each data point

01:02:36.700 | than just next token prediction.

01:02:38.780 | So this is like how you do a knowledge distillation

01:02:42.020 | for an ASR model in like a very brief overview.

01:02:44.900 | The distill-whisperer model has a good paper

01:02:47.220 | from Dagenfeld's team

01:02:48.220 | and goes into much more detail about this.

01:02:50.340 | So yeah.

01:02:52.900 | - There's another paper that came out recently

01:02:54.500 | that talks about distillation and the three types.

01:02:56.940 | So there's also distillation

01:02:58.700 | where you do pure SFT on outputs

01:03:01.020 | and you match input output type distillation,

01:03:03.860 | but there's a distillation loss,

01:03:06.060 | which is basically this KL divergence

01:03:09.780 | that you're talking about,

01:03:10.620 | where you map (indistinct)

01:03:13.860 | We did have a question (indistinct)

01:03:18.220 | This is basically token level decoding

01:03:21.420 | of audio chunks, right?

01:03:23.900 | So (indistinct) asked,

01:03:25.660 | is Whisper Turbo fast enough to be real-time?

01:03:28.860 | And what's the definition of real-time

01:03:30.820 | and how does real-time work with chunk-based decoding

01:03:33.740 | is something asked, if you have intuition around this.

01:03:36.340 | - I think he, I muted the audio

01:03:46.140 | and I think he's got it on mute.

01:03:47.900 | I'm glad you have to unmute.

01:03:49.900 | - Oh.

01:03:50.740 | - Unmute.

01:03:52.700 | - Unmute.

01:03:53.540 | - Yeah, I think I got muted.

01:04:00.580 | Can you guys hear me now?

01:04:01.940 | - Yeah, you're good.

01:04:03.180 | Did you get the question?

01:04:04.020 | - Okay.

01:04:04.860 | Yes, yes, I got the question.

01:04:06.020 | So the concept of like real-time is kind of like debatable,

01:04:09.580 | but if you ask me if you can transcribe

01:04:12.060 | a one minute audio in less than a minute,

01:04:13.820 | I think this is kind of like real-time.

01:04:16.300 | So the question is like,

01:04:18.380 | is the Whisper model fast enough to be real-time?

01:04:22.140 | If you're, the answer is yes,

01:04:23.460 | if you run it on the right hardware.

01:04:25.140 | If you run it on a T4 or like anything that's bigger

01:04:27.940 | for any modern GPU, basically,

01:04:29.420 | it's going to be almost real-time.

01:04:31.020 | Like the latency is going to be like a few,

01:04:33.860 | like maybe a second or two max,

01:04:35.980 | if you've got the parameters correctly

01:04:37.740 | and if you size it correctly.

01:04:39.540 | And this is even applicable with like the large V2

01:04:41.700 | and large V3, and it's going to be even faster

01:04:43.940 | with like Distal Whisper or like Whisper Turbo.

01:04:46.780 | So yes, I think Whisper can actually be very,

01:04:49.420 | very good at real-time kind of transcription.

01:04:53.820 | - And how do you do like live transcription

01:04:57.860 | with token or chunk-based decoding?

01:05:00.300 | - So one of the ways to do real-time transcription

01:05:06.620 | is to select a small chunk size.

01:05:10.580 | To just, before answering this question,

01:05:12.420 | like one of the key things about Whisper

01:05:14.260 | is that you have to give it 30 seconds of audio.

01:05:17.380 | Even if the audio is one second,

01:05:18.980 | like you have to pad the remaining 29 seconds.

01:05:21.540 | So you're always giving it 30 seconds of audio

01:05:24.060 | that will be processed by the encoder.

01:05:25.740 | And then depending on actually how long your speech is,

01:05:29.260 | it will maybe generate two tokens or 200 tokens.

01:05:32.100 | But if you're giving it one second of speech

01:05:34.060 | and 29 seconds of padding,

01:05:35.620 | it will probably generate only two or three tokens

01:05:37.780 | or maybe five tokens, not 200.

01:05:40.340 | So you're spending the same time on the encoder,

01:05:42.620 | but you're spending significantly less time on the decoder.

01:05:45.660 | So one way to utilize this in real-time decoding

01:05:49.420 | is continuously give Whisper maybe 300 milliseconds of audio

01:05:54.420 | and give it as the input and pad it,

01:05:59.460 | and then ask it to generate like the five or 10 tokens,

01:06:02.820 | or maybe even less that respond to 300 milliseconds.

01:06:05.620 | Because probably in half a second,

01:06:07.660 | someone is saying two or three words,

01:06:09.620 | which is four or five tokens,

01:06:11.460 | which can be decoded relatively quickly

01:06:13.180 | and get back the answer in almost real time.

01:06:15.180 | So you keep doing this multiple, multiple times

01:06:17.900 | until you run into your first batch of chunks,

01:06:22.900 | and then you can just shift the audio buffer a bit.

01:06:25.940 | It's kind of like difficult to explain

01:06:27.140 | without having a graph beforehand,

01:06:29.340 | but it's basically about continuously feeding the model

01:06:33.260 | chunks that are like have been shifted

01:06:37.780 | by maybe half a second or so.

01:06:40.100 | And then you just,

01:06:41.220 | if the buffer is less than 30 seconds,

01:06:44.820 | you just pad it and give it to the encoder.

01:06:47.540 | - Got it.

01:06:49.500 | - Does this give some clarification?

01:06:50.900 | - Very useful intuition.

01:06:52.060 | I didn't realize all the padding and stuff.

01:06:54.540 | Someone asked if these chunks overlap.

01:06:56.460 | - You can make them overlap.

01:06:58.860 | You can make them not overlap.

01:07:00.660 | This is kind of like,

01:07:01.900 | there's like a whole set of hyper parameters

01:07:04.740 | that you can play with.

01:07:05.940 | You can use VED.

01:07:07.140 | You can not use VED.

01:07:08.020 | You can just keep adding the chunks without overlapping.

01:07:10.940 | You can make them overlap.

01:07:11.900 | There's like a whole world of doing this.

01:07:14.500 | - Okay.

01:07:15.340 | Got it.

01:07:17.740 | No, really, really useful intuition.

01:07:19.900 | Any other questions, thoughts?

01:07:22.860 | This is the last slide, right?

01:07:23.820 | We didn't cut off.

01:07:24.700 | - We have two other slides,

01:07:27.140 | but this is like the main thing.

01:07:29.780 | - If you want to keep going through, go through.

01:07:31.820 | - Sure.

01:07:32.660 | So some fine tuning,

01:07:34.660 | WhisperTurbo can be fine tuned

01:07:35.820 | just like any of the other models.

01:07:37.980 | I asked Joan Walk about which hyper parameters to use.

01:07:42.060 | She said, you can start with like one E negative five

01:07:44.740 | as earning rate,

01:07:45.580 | but you should do a learning rate sweep

01:07:48.660 | and like do hyper parameter tuning.

01:07:51.060 | But it can be like fine tuned just like any other model.

01:07:53.780 | Just be careful.

01:07:54.620 | Like it's probably gonna be fine tuneable

01:07:57.780 | only on like transcription.

01:07:59.860 | Don't try too much with translation

01:08:01.500 | or maybe give it a try,

01:08:03.300 | but it's not guaranteed to work.

01:08:05.460 | So yeah, it's fine tuneable just like any other model.

01:08:11.300 | The final two slides about like benchmarks,

01:08:15.500 | but just before the benchmarks,

01:08:17.180 | people also ask it how it compares to FasterWhisper

01:08:19.820 | and like DistalWhisper.

01:08:21.740 | So we've answered the question about DistalWhisper,

01:08:24.180 | but for FasterWhisper like,

01:08:25.900 | this is like a valid question,

01:08:27.780 | because FasterWhisper is like an inference engine.

01:08:30.940 | It's like a library, it's code.

01:08:33.260 | It can be used to deploy any of the other versions

01:08:36.820 | of Whisper.

01:08:37.660 | Whisper is small, tiny, base, medium, large,

01:08:41.100 | large V2, large V3, large V3 turbo,

01:08:43.900 | distal large and so on.

01:08:45.620 | So kind of like try to make a difference

01:08:48.060 | between the inference engine

01:08:49.260 | and the model versions themselves.

01:08:51.580 | So any question before we go into the benchmarks?

01:08:55.060 | - No, I guess we can go.

01:09:01.580 | - Sure.

01:09:03.500 | So for benchmarks,

01:09:05.980 | I highly recommend that you benchmark on your own data.

01:09:09.220 | So I think this is quite important,

01:09:12.860 | but to like give some intuition

01:09:14.420 | about the model's performance,

01:09:15.420 | I chose three data points

01:09:17.900 | that I think are very interesting

01:09:19.380 | and they're like very important for me.

01:09:21.740 | So one of them is like the GPU mode

01:09:24.380 | at the end of CTOG by Karpathy.

01:09:26.140 | This is very recent,

01:09:26.980 | so there is no way it was in the training data, hopefully.

01:09:30.780 | The second one is like a chat with like

01:09:34.740 | Mistral CEO, Arthur Mensch.

01:09:36.500 | He is a French guy,

01:09:37.580 | so he's speaking English with kind of like an accent.

01:09:40.300 | This was, I think, in December last year.

01:09:43.020 | Hopefully it was not in the training data as well.

01:09:44.780 | And the final one is my favorite,

01:09:46.220 | like the state of GPT by Karpathy

01:09:47.780 | in I think 2023 as well.

01:09:50.140 | So yeah, Karpathy is known to be talking relatively fast,

01:09:55.380 | so this can be challenging.

01:09:57.580 | So yeah, let's jump into the benchmarks.

01:09:59.940 | For benchmarking,

01:10:02.900 | so you generally look at two metrics,

01:10:05.260 | WER and CR.

01:10:06.620 | WER is word error rate,

01:10:08.940 | like how often the model makes a mistake

01:10:12.380 | about a certain word in the text.

01:10:15.100 | So for the WER,

01:10:16.100 | I benchmarked the different sizes or versions

01:10:19.420 | of like the different data points we discussed.

01:10:24.420 | So we kind of like see that the distal V2

01:10:28.940 | is kind of like an outlier.

01:10:30.140 | It's one of the least accurate models out there.

01:10:33.780 | But the other models are like kind of similar,

01:10:37.940 | except maybe large V3 can be a hit or miss.

01:10:42.860 | So sometimes it has relatively higher WR.

01:10:45.740 | Sometimes it's the best performing model.

01:10:48.180 | And this also aligns or like is aligned

01:10:51.740 | with like what people have experienced it

01:10:54.340 | and have shared like that large V3

01:10:56.180 | can sometimes hallucinate.

01:10:57.260 | It can be caught in an endless repetition loop

01:11:00.580 | and can make the output like really bad.

01:11:02.780 | But yeah, I think that the turbo model

01:11:06.220 | on these data points is doing very, very well

01:11:09.100 | as well as the distal large V3.

01:11:11.620 | But on other benchmarks that I did for like a projects

01:11:16.620 | that I can not share the results,

01:11:18.380 | but I can give hints that it was not very good.

01:11:20.660 | It was doing much worse than the other models.

01:11:23.500 | So yeah, do your own benchmark.

01:11:25.220 | And especially if the dataset is not English

01:11:28.140 | and if it's not transcription,

01:11:30.140 | you might see like very different numbers from this.

01:11:32.980 | So this is like the WR, the CR is very similar as well.

01:11:36.380 | The distal large V2 is doing very, very bad

01:11:38.860 | compared to the others and large V3 can be a hit or miss.

01:11:42.380 | So yeah, that was like a very quick,

01:11:46.100 | super quick introduction about Whisper Turbo.

01:11:48.500 | This is like the end.

01:11:49.340 | If you have any questions, please go ahead.

01:11:51.820 | Thank you for watching.

01:11:53.980 | - That was really great.

01:11:54.980 | Clap, clap, clap, applause all around.

01:11:56.980 | Close is no MLX stuff, what do you mean, MLX?

01:12:02.220 | - No MLX stuff.

01:12:06.540 | So I don't have a Mac, so I don't care about MLX.

01:12:10.540 | Sorry, but yeah.

01:12:11.980 | (laughing)

01:12:14.220 | That's like the two hands.

01:12:15.540 | So yeah, just as a hint, I think MLX is not good.

01:12:20.380 | So yeah, just as a hint, I think MLX is like a framework

01:12:24.020 | for deploying or like using machine learning models

01:12:27.820 | on Apple Silicon hardware

01:12:29.660 | and presumably it's very fast and very efficient.

01:12:32.660 | But I don't have a Mac, so no data points from my side.

01:12:37.420 | - All right, my battery's dying.

01:12:41.020 | I'm on 1%, so I gotta end this soon.

01:12:43.180 | But thank you so much, it was fantastic.

01:12:45.860 | Always excited by your Whisper updates and explanations.

01:12:49.540 | Really, really like those slides, those shadow slides.

01:12:52.220 | Awesome, awesome.

01:12:54.100 | Thanks everyone.

01:12:54.940 | - Sure, sure, thank you guys.

01:12:56.220 | Have a nice day.

01:12:57.060 | - Thank you very much, bye.