back to index

[Paper Club] Molmo + Pixmo + Whisper 3 Turbo - with Vibhu Sapra, Nathan Lambert, Amgadoz


Whisper Transcript | Transcript Only Page

00:00:00.000 | - I'll go through here, let me set up real quick.
00:00:05.000 | If anything's in the chat, let me know.
00:00:08.440 | Okay, so Momo and Pixmo.
00:00:10.960 | This is from AI21.
00:00:13.880 | Basically really, really good people.
00:00:15.520 | If people haven't read research papers before,
00:00:18.320 | this is probably one of them.
00:00:19.520 | - AI2 and AI21 are different things.
00:00:22.520 | - AI2, my bad.
00:00:23.880 | - It's hilarious, so many people mess it up.
00:00:25.760 | We also get A12.
00:00:27.600 | - A12, oh, are they another one or?
00:00:29.980 | - No, it's if you misread the I, it's a one.
00:00:32.960 | - Okay, so AI2.
00:00:33.800 | - Makes it even more confusing with the 21.
00:00:36.480 | - Is there an AI21?
00:00:37.760 | Okay, I've been shilling the wrong one,
00:00:41.240 | but good correction, good to know.
00:00:43.480 | So AI2 though.
00:00:44.960 | So basically this is a proper open source,
00:00:47.980 | open weights model, and it's a VLM.
00:00:50.600 | So if anyone hasn't read papers before,
00:00:52.800 | this is one that I highly recommend reading.
00:00:54.920 | Clear problem, clear solution, clear explanation
00:00:57.800 | of what they do, the data set and everything.
00:00:59.760 | Very, very easy read, no crazy math.
00:01:02.880 | So yeah, highly, highly recommend.
00:01:04.840 | But TLDR, it's a set of different models
00:01:08.500 | at different sizes that are good vision language models
00:01:11.240 | that solve a different problem.
00:01:12.720 | So the high level of this is most VLMs
00:01:17.000 | are distillations of proprietary close source models, right?
00:01:20.160 | So if you need to generate synthetic data,
00:01:24.400 | like most open weight models rely heavily
00:01:26.600 | on synthetic data from private models.
00:01:28.560 | So think GPT-4V, GPT-4O, Gemini, whatnot.
00:01:32.800 | And they label a lot of data
00:01:35.440 | and then they create a vision language model
00:01:37.940 | that's open weight, something like lava.
00:01:40.120 | But all they're doing is kind of distilling
00:01:42.240 | that proprietary knowledge into an open knowledge.
00:01:44.440 | But this is kind of the key highlight I take away
00:01:46.480 | that the community is still missing foundational knowledge
00:01:50.280 | about how to build these models from scratch.
00:01:52.740 | So they kind of go into everything from data set training,
00:01:56.480 | how they label data and how they can do this whole mix.
00:01:59.880 | And some of the other papers we covered in paper club
00:02:02.160 | are stuff like Clip, OpenClip, which are captioners,
00:02:04.960 | the whole history of these
00:02:06.040 | and how captioning is pretty important for this.
00:02:08.640 | But basically, yeah, they do that.
00:02:11.000 | They explain the data set mixture,
00:02:13.280 | how to do this fine tuning.
00:02:14.540 | They're gonna release it all.
00:02:15.640 | There's a set of different,
00:02:17.920 | a class of different models that are gonna come out.
00:02:20.240 | And the really interesting thing is they're good.
00:02:23.240 | So lava is kind of starting to lag behind,
00:02:26.360 | but this thing is outperforming.
00:02:29.360 | So there's two ways to benchmark this.
00:02:31.560 | You've got standard vision benchmarks,
00:02:33.240 | and then you've got like actual usage.
00:02:35.400 | And their stuff is on par with GPT-4V.
00:02:39.560 | They're starting to surpass it
00:02:40.960 | with some of the larger models and they're very small.
00:02:43.180 | So yeah, diving into that, we talk about the data set,
00:02:46.720 | how they do the training data, the architecture of it,
00:02:49.500 | which is kind of what you would expect
00:02:51.120 | if you joined our other paper clubs on this,
00:02:52.720 | but let's go a little bit more into that.
00:02:55.080 | So one of this is how can we generate this data
00:02:58.440 | from non-proprietary models?
00:03:00.080 | So we can't use a closed source model to do this captioning.
00:03:03.020 | How do we do it?
00:03:03.920 | Some of the background that we learned
00:03:05.420 | from other papers like Clip and whatnot
00:03:08.000 | is that it's hard to generate really good, diverse,
00:03:11.320 | like robust descriptions from images.
00:03:14.180 | And this is something that like the solution
00:03:17.500 | that they found was to have audio labels.
00:03:19.520 | So they kind of guided people towards,
00:03:21.680 | here's a set of about a million images.
00:03:24.100 | Here's some prompting questions.
00:03:25.520 | Here's a timeline.
00:03:26.360 | So people answer them concisely.
00:03:28.120 | From that, they have a data set.
00:03:29.720 | From that, they train it into a model
00:03:31.800 | that's based on their open source OMO models.
00:03:34.420 | So that's the big thing.
00:03:36.240 | They can do this end to end without anything proprietary.
00:03:39.660 | And yeah, let's like go through it a little bit more.
00:03:42.240 | So instead of being a distillation,
00:03:44.240 | it's independently pre-trained.
00:03:46.520 | There's a vision encoder and a language model
00:03:48.800 | that's jointly trained.
00:03:50.320 | There are some interesting stuff that they call out
00:03:52.240 | like traditionally in stuff like Alava and different adapters,
00:03:55.480 | what they'll do is they'll train this vision encoder.
00:03:58.400 | They'll have an LLM, they'll freeze weights,
00:04:01.200 | and then they'll kind of just merge them.
00:04:02.880 | They avoid doing that multi-stage pre-training.
00:04:05.660 | They don't freeze parts of it.
00:04:06.960 | They just kind of do it all as one training run.
00:04:09.960 | They show how important high quality data is.
00:04:13.480 | So they do this, I think,
00:04:14.420 | on roughly a million samples of data.
00:04:16.600 | There's some augmentation.
00:04:17.800 | Like they pass through these label data sets through LLMs.
00:04:21.440 | They augment them.
00:04:22.600 | They do augment it a little bit,
00:04:24.000 | but point being they can get better than Gemini 1.5,
00:04:28.240 | better than Cloud 3.5 Sonnet,
00:04:30.080 | better than GPT-4V at a much smaller size
00:04:33.460 | with about a million samples of data,
00:04:35.240 | which is very impressive, right?
00:04:37.060 | So good, high quality data,
00:04:39.280 | and then how to do it from scratch.
00:04:40.920 | Now, the other little interesting thing here
00:04:43.200 | was it's not as accessible as it seems.
00:04:46.040 | Like they still have to pay for a million labeled samples,
00:04:48.960 | but point being still very useful.
00:04:51.020 | And then we can distill it down to other models.
00:04:53.080 | So yeah, this is one of the main challenges, right?
00:04:57.080 | It's hard to collect captioning data.
00:04:59.280 | So for example, like if you're given an image here
00:05:02.760 | and you're told to describe it,
00:05:04.680 | like I would probably just say this is Seattle.
00:05:07.120 | I could add in things like, you know,
00:05:08.320 | I see the Space Needle.
00:05:09.800 | I see Mount Rainier.
00:05:11.300 | I see trees.
00:05:12.400 | But if you just blindly give images to either LLMs
00:05:17.400 | or whatever, multimodal models or people,
00:05:19.760 | we don't give good descriptions, right?
00:05:21.380 | We don't give like robust, you know,
00:05:23.260 | a city landscape with a lot of skyscrapers,
00:05:26.180 | trees in the foreground, water in the background,
00:05:28.600 | construction on the side.
00:05:30.100 | We don't do that.
00:05:31.060 | But they kind of broke down how they get this type of data
00:05:34.940 | and the different segments.
00:05:36.220 | And then I think soon the dataset is coming out,
00:05:38.840 | but that's basically the challenge, right?
00:05:41.620 | So they don't wanna get this from larger proprietary models,
00:05:45.900 | which can, like you can write a really good robust prompt
00:05:48.560 | and get good descriptions,
00:05:50.200 | and then you can create a lava style model,
00:05:52.280 | but you're no longer doing this from scratch.
00:05:54.080 | You're doing it from a distillation of proprietary work.
00:05:56.500 | So they wanted to do it with annotators.
00:05:58.360 | Here's kind of the solution.
00:05:59.740 | We ask annotators to describe the images in speech
00:06:02.840 | for 60 to 90 seconds,
00:06:04.240 | rather than asking them to write descriptions.
00:06:06.680 | They prompted them to describe everything in great detail,
00:06:09.360 | including descriptions of spatial positioning
00:06:11.560 | and relationships.
00:06:12.600 | So stuff like, you know,
00:06:13.720 | there's the Space Needle in the middle
00:06:15.720 | of a bunch of buildings.
00:06:18.300 | With that, the modality switching trick,
00:06:21.320 | annotators provide far more detailed descriptions
00:06:24.160 | in less time.
00:06:25.000 | So part of this is, you know,
00:06:26.840 | you keep it 60 to 90 seconds.
00:06:28.740 | They gave them guiding questions,
00:06:30.100 | but that was basically the takeaway.
00:06:33.240 | Like we're not good at writing descriptions,
00:06:34.920 | but we're good at talking about and describing what we want.
00:06:37.920 | And then from there,
00:06:40.160 | they have like a little bit more into what this dataset is.
00:06:43.640 | So they've got positional data in some of the prompts,
00:06:45.800 | they've got high level,
00:06:47.100 | they've got different types of data.
00:06:49.020 | They've got like,
00:06:50.040 | part of the dataset is about charts, tables,
00:06:53.120 | descriptions of that.
00:06:54.000 | So the model can pick it up.
00:06:55.160 | And then there's some use cases
00:06:56.320 | where this stuff just blows out GPT-4v
00:06:59.080 | from like out of the water.
00:07:00.240 | So there's a good blog post on the website
00:07:03.520 | that kind of shows how you could do this real time.
00:07:05.960 | It's a very Apple Intelligence or Google Lens type example,
00:07:09.180 | where I think the team put it on the Apple Vision Pro
00:07:11.740 | and they just look at stuff and they're like,
00:07:13.640 | "Hey, where is this in the image?"
00:07:15.280 | And it can actually annotate the image
00:07:17.500 | because that's part of the training set.
00:07:18.820 | So one more use case of high quality, well-labeled data.
00:07:22.940 | Benchmarks-wise,
00:07:23.940 | they did the traditional 11 academic benchmarks.
00:07:27.520 | It does well.
00:07:28.680 | The size of the models they put out.
00:07:31.200 | So there's MOMO-1b,
00:07:32.820 | which is based on their open source language model,
00:07:35.220 | which is a MOE 1 billion active parameter model.
00:07:38.420 | That performs on par with GPT-4v
00:07:42.380 | on both user preference and academic benchmarks.
00:07:45.360 | Then there's the 7b,
00:07:47.840 | which is based on their 7b language model and Quen-7b.
00:07:51.200 | That outperforms GPT-4v in 4.0.
00:07:54.620 | Then there's a large one, which is MOMO-72b.
00:07:57.320 | 72b is based on Quen-72b.
00:08:00.200 | And that, it outperforms stuff on academic benchmarks.
00:08:05.200 | And then on human preference, it's behind 4.0.
00:08:08.180 | So preference-wise,
00:08:09.480 | if you guys remember how we do vision benchmarks,
00:08:14.480 | there's stuff like Vision QA,
00:08:15.940 | where you can QA what's in this image.
00:08:18.380 | It's similar to Trivia QA, where it's just understanding.
00:08:21.180 | There's 11 benchmarks,
00:08:22.300 | and they're not realistic of how people use it.
00:08:24.180 | So they also have this user preference stuff.
00:08:26.340 | User preference is basically like an ELO score of models.
00:08:29.660 | But TLDR, these are very good small models.
00:08:32.440 | This last sentence here is pretty interesting, right?
00:08:35.380 | "Our best models outperform state-of-the-art
00:08:37.500 | "proprietary systems, including Gemini 1.5 Pro
00:08:40.640 | "and Flash and Cloud 3.5 Sonnet."
00:08:43.000 | That's pretty huge.
00:08:44.240 | Architecture, if you guys know about other multimodal models,
00:08:48.360 | it's an interesting adapter
00:08:50.200 | with a little twist of traditionally, like we said,
00:08:54.040 | there's this mix of,
00:08:56.600 | you have a pre-trained off-the-shelf vision encoder
00:08:59.760 | and a language model joined in one space.
00:09:02.480 | Typically, we freeze the language model,
00:09:05.600 | and we do contrastive, there's a term for this,
00:09:10.600 | but you basically merge these embedding spaces.
00:09:13.220 | I'm blanking on the term.
00:09:14.260 | In this case, they're like, "No, we don't need to do that."
00:09:17.400 | So they go over the four steps of this architecture.
00:09:21.380 | Pretty straightforward.
00:09:22.580 | So the first step is a preprocessor.
00:09:24.380 | It converts the image into multi-scale, multi-crop images.
00:09:27.340 | This allows for diversity in inputs.
00:09:29.780 | So stage one of architecture is a preprocessor.
00:09:32.500 | Second, it's a VIT image encoder
00:09:34.740 | that maps these images into vision tokens
00:09:39.260 | in an embedding space.
00:09:40.660 | Three, there's the connector
00:09:42.020 | that maps and merges these two embedding spaces.
00:09:44.740 | So the vision encoder, which I believe is based on CLIP,
00:09:48.060 | and they mentioned,
00:09:49.860 | we know CLIP is trained on proprietary data,
00:09:52.340 | but there's a paper called OpenCLIP from Meta,
00:09:54.340 | which shows you can do OpenCLIP.
00:09:56.580 | They just use regular CLIP, which is, I think it's fine.
00:09:59.660 | So stage one, preprocess images.
00:10:02.260 | Stage two, vision encoder.
00:10:04.380 | Stage three, contrastive embeddings,
00:10:06.720 | which merge the embedding dimension
00:10:09.920 | of the vision and the language model,
00:10:12.700 | and that's not frozen.
00:10:14.420 | And then stage four is a decoder-only transformer LLM
00:10:17.500 | to produce text.
00:10:19.240 | That's the architecture.
00:10:20.380 | If people have questions on what's going on here,
00:10:23.320 | we can probably dig into it,
00:10:24.580 | but for people that joined our other multimodal stuff,
00:10:27.420 | you should kind of understand how this works.
00:10:30.300 | There's a good blog post we went over as well,
00:10:33.140 | I think from Chip Huen,
00:10:34.860 | or another one that we can also share as a reference.
00:10:38.820 | - Yeah, the Flamingo post.
00:10:40.100 | - There's what?
00:10:42.060 | - I think that was her Flamingo post.
00:10:45.180 | - Yeah, her Flamingo is a really good one
00:10:47.640 | on how multimodal stuff works.
00:10:49.000 | So highly recommend, but also we can dig into it here.
00:10:51.660 | And then this is kind of that part of the vision encoder.
00:10:54.300 | They use OpenAI's VIT-Large CLIP model,
00:10:57.880 | and we know that CLIP is trained on closed source data,
00:11:03.720 | and they understand this,
00:11:06.200 | but recently MetaCLIP came out from Meta,
00:11:10.360 | and they show that you can reproduce it from scratch.
00:11:13.160 | They use OpenAI because it's trained for higher resolution.
00:11:16.360 | Meta's was more of a proof of concept.
00:11:18.460 | Proof of concept meaning, you know,
00:11:21.840 | here's how you can do it.
00:11:22.920 | We do it efficiently at low resolution.
00:11:25.240 | For the actual production, they just use CLIP,
00:11:27.120 | and I'm fine with that.
00:11:28.520 | If people disagree, that's a fair topic too.
00:11:31.320 | And then the LLM they use
00:11:33.120 | is their recently put out OMO models,
00:11:35.100 | which are very similar to this.
00:11:36.360 | They're open language models.
00:11:38.440 | They've got the range, and then for the big one,
00:11:40.560 | it's OpenWeight Quens.
00:11:43.680 | Now the data set and training was kind of-
00:11:46.200 | - Do you wanna pause for any commentaries, Nathan?
00:11:49.200 | - Yeah.
00:11:50.040 | - Otherwise, yeah.
00:11:50.860 | - I was just gonna kind of let him cook
00:11:55.060 | and then go back to the beginning.
00:11:56.760 | - Oh, okay, all right.
00:11:59.040 | I'm gonna interleave commentary.
00:12:00.820 | - I thought about interrupting.
00:12:02.120 | I had unmuted.
00:12:03.620 | I just think that it's,
00:12:05.680 | I mean, I have this whole take that I've written
00:12:07.360 | is that the whole vision space is just so underdeveloped
00:12:10.680 | that a lot of these things that seem surprising,
00:12:12.480 | I actually think probably aren't surprising.
00:12:14.680 | And the things that this model is good at
00:12:16.320 | are things that all the foundation companies,
00:12:17.880 | like they're just gonna take our data and train on it.
00:12:20.440 | And it's fine tuning, so it's like a million data points
00:12:22.880 | is actually kind of a lot, and all this stuff.
00:12:25.080 | And it's just like a lot of,
00:12:27.400 | I think that it's just like good data works,
00:12:29.920 | and they figured out some cool new applications,
00:12:32.100 | like clock reading and pointing, and then the data works.
00:12:35.260 | There is actually a niche reference of somebody
00:12:37.320 | that did this kind of voice annotation earlier.
00:12:41.380 | I can find it.
00:12:42.220 | 'Cause someone had mentioned it after we released it,
00:12:44.980 | saying that we were the first people to do it.
00:12:47.520 | So then somebody's like,
00:12:48.580 | oh, but you have to cite this old paper,
00:12:50.580 | which is kind of fun.
00:12:51.420 | So I can go find that.
00:12:53.060 | - Okay.
00:12:55.620 | I love the take of me being like, I love it.
00:12:58.300 | It's so straightforward and it works,
00:13:00.120 | and people should have done it.
00:13:01.020 | And then Nathan here like, yeah, they'll do it too.
00:13:04.200 | And it's cool to see the actual use cases.
00:13:07.260 | The website has a pretty good blog post on this,
00:13:09.380 | and there's a good, very professionally shot video
00:13:12.100 | on how this excels.
00:13:13.680 | And it's always nice to see that niche of like,
00:13:17.400 | there's so much underdeveloped alpha left in vision space
00:13:21.320 | that you train the right data set and it gets good.
00:13:25.360 | But it's still cool to see how it's like,
00:13:28.260 | significantly better than Lava.
00:13:30.420 | It's all open.
00:13:31.920 | So I like it.
00:13:33.460 | Any other questions from chat and whatnot?
00:13:36.420 | I haven't been following.
00:13:37.420 | Is there anything we should pause and break on or?
00:13:41.800 | - Something that's worth saying is that
00:13:43.300 | I think the other models are much better as text models.
00:13:45.980 | Even like Lama, where they're like vision fine tuning
00:13:48.340 | is generally accepted as being like kind of meh
00:13:50.620 | for whatever drama reasons you want to attribute to it.
00:13:53.340 | Like their text scores are still much better.
00:13:56.280 | So they have like more of a text focus,
00:13:58.000 | which OpenAI probably does as well.
00:14:00.600 | - Is there a drop in performance on these?
00:14:03.400 | Like the Quen 72 versus Momo 72,
00:14:06.040 | is the text performance also taking a hit?
00:14:09.800 | - I honestly don't even know,
00:14:11.080 | but probably to some extent.
00:14:12.680 | Like I find it so unlikely it's gonna get better.
00:14:15.160 | Or even like some of these other models
00:14:18.680 | are probably using like instruction models as their base
00:14:21.240 | before doing the vision stuff.
00:14:22.640 | But this is just like straight base model,
00:14:25.100 | no real instruction tuning.
00:14:26.420 | There's literally like no chat template for multi-turn,
00:14:28.820 | it just concatenates the messages together
00:14:30.740 | and like, there's like, good luck.
00:14:33.840 | - So for other people that don't follow,
00:14:38.180 | there's base models
00:14:39.140 | and then there's instruction fine tuning, right?
00:14:41.120 | So you can base models predict next token,
00:14:43.660 | instruction models are chat models.
00:14:45.180 | So like we have Lama base, which is not a chat model.
00:14:48.100 | You can't chat with it.
00:14:48.940 | It'll complete your words.
00:14:50.860 | Basically, these are just those adapted to vision,
00:14:54.220 | which means that they're not good chat models,
00:14:56.580 | but they do perform well
00:14:58.140 | and they do what they're supposed to do.
00:15:01.260 | But TLDR, yeah, you know,
00:15:03.820 | they've done so much open source work.
00:15:06.260 | So maybe someone should recreate.
00:15:08.100 | I don't know if you guys put the data set out yet.
00:15:10.020 | I hear it's coming soon.
00:15:11.500 | - It's just like needing to clean it up.
00:15:15.180 | So it's like the brush of get the model out.
00:15:17.020 | And I mean,
00:15:18.540 | - Makes sense.
00:15:19.380 | - Our leadership probably knew Lama was coming
00:15:21.340 | on a specific day.
00:15:22.460 | If you look at the timing,
00:15:23.860 | like there's some of that needed to happen.
00:15:27.540 | I mean, the data sets just being cleaned.
00:15:29.660 | - Okay.
00:15:30.500 | - I mean, if you do anything with images,
00:15:33.300 | something like data set releases are a little bit messier.
00:15:36.900 | You have to check for like,
00:15:38.660 | you don't like we're not releasing the images.
00:15:40.740 | You release the links type of thing and do some checks.
00:15:44.500 | But yeah, it's coming.
00:15:45.940 | I think there's like hundreds of people
00:15:47.580 | that filled out the silly Google form asking for interest,
00:15:50.380 | which I'm not surprised.
00:15:51.660 | It's just like.
00:15:52.500 | - Yeah.
00:15:55.260 | Well, that's a plug for, you know,
00:15:57.180 | they started 90% of the open source stuff.
00:15:59.380 | Now take a chat instruction tune model
00:16:01.660 | and do the same thing.
00:16:02.580 | If anyone does it.
00:16:03.820 | - We did a little bit of experimenting on it,
00:16:05.860 | but like in a rushed way and we didn't find it was better.
00:16:09.340 | So they didn't add it in.
00:16:10.940 | - Hmm. Interesting.
00:16:11.860 | - But like this is very visual oriented.
00:16:15.220 | The demo, you can't send, you need an image
00:16:18.540 | and you can't send a text only query to it.
00:16:21.820 | Partially for saving money,
00:16:23.300 | but partially because like that's how the model is trained.
00:16:26.300 | - Good to know.
00:16:28.420 | Good to know.
00:16:29.260 | I think the interesting thing there is like,
00:16:30.420 | as you mentioned, yeah, Gemini and OpenAI and stuff,
00:16:33.140 | they'll basically train on this data set
00:16:34.980 | and it'll kind of improve what they do too.
00:16:38.060 | But it also kind of shows a flaw in benchmarks
00:16:41.260 | 'cause our vision benchmarks don't account
00:16:44.220 | for like text output as well,
00:16:45.700 | which is pretty flawed because people do both.
00:16:49.020 | Yeah.
00:16:51.180 | This is the fun part since it's primarily vision-based.
00:16:55.100 | How did they do the vision?
00:16:57.220 | How did they generate the data?
00:16:58.700 | So they've got a split of different data sets.
00:17:02.900 | Step one is obviously caption generating from the images.
00:17:06.780 | So they source web images from diverse sets of data,
00:17:10.700 | 70 high-level topics.
00:17:11.860 | So high-level topics, street signs.
00:17:15.340 | Street signs, food, meme, drawings, websites,
00:17:17.620 | blurry photos and whatnot.
00:17:19.140 | They asked annotators to describe images
00:17:21.260 | by speaking in detail for 60 to 90 seconds
00:17:24.660 | using a single annotator per image.
00:17:27.140 | This was more effective,
00:17:29.500 | but yeah, questions were like guided questions.
00:17:31.820 | So what's the image at first glance?
00:17:34.380 | What objects are there in their accounts?
00:17:36.260 | What does the text say?
00:17:37.820 | What's the background?
00:17:38.660 | What's the style and color?
00:17:39.780 | I think this is pretty useful
00:17:41.260 | 'cause I'm pretty bad at describing images
00:17:43.300 | and 60 to 90 seconds is a good answer.
00:17:46.900 | A bunch of stuff, people will be verbose, concise.
00:17:49.860 | Then after this basic pre-processing,
00:17:52.220 | so transcribe it using off-the-shelf speech to text.
00:17:56.340 | Do you know what that was?
00:17:57.180 | Was that like whisper or something
00:17:59.020 | or it just probably doesn't matter, I'm guessing.
00:18:02.060 | - Yeah, it was like whisper
00:18:03.100 | and I don't remember the details.
00:18:05.580 | And I had asked and it's like,
00:18:07.180 | I don't think they actually logged the audio.
00:18:08.820 | I was like, oh, this would be super cool
00:18:10.220 | for like other types of multimodal,
00:18:11.900 | but I don't think it was in the terms of the data location
00:18:15.580 | to also have the audio with the images
00:18:17.380 | 'cause it'd be a super cool, like, I don't know.
00:18:20.260 | Like you can see how all these voice features come about.
00:18:24.620 | - That's what I was about to say.
00:18:25.460 | I was gonna say like,
00:18:26.300 | it would have been a pretty cool audio dataset too,
00:18:29.140 | but it looks like the dataset that'll be distributed
00:18:33.860 | is just gonna be the captioning and image pairs.
00:18:37.540 | So not audio, which would have been fun,
00:18:39.100 | but yeah, there's a very straightforward pre-processing.
00:18:42.580 | So use something like whisper
00:18:44.660 | or whatever off the shelf they wanna call it speech to text,
00:18:48.260 | then process it through an LLM to improve text quality.
00:18:52.300 | So remove artifacts, normalize style,
00:18:55.460 | then create a fourth image description
00:18:58.340 | by using an LLM to summarize
00:19:00.300 | the three original descriptions,
00:19:01.780 | transcripts into a single description.
00:19:03.700 | So, you know, basic augmentation from that,
00:19:08.700 | we use these four images to do natural data augmentation.
00:19:12.860 | So trained on 712,000 distinct images
00:19:17.020 | with 1.3 million captions, including the augmentation.
00:19:20.140 | So not a lot, it's basically just fine tuning after that.
00:19:25.060 | - I know Eugene Xia has a hand raised.
00:19:27.060 | You have a question, Eugene Xia?
00:19:28.580 | - Pop in.
00:19:29.420 | - Yeah, it wasn't urgent for me.
00:19:33.060 | Because there was a lot of commentary
00:19:34.420 | regarding text performance and I guess the presumption
00:19:39.060 | that the MOMO is inferior in text performance.
00:19:41.420 | That's what I kind of understood.
00:19:43.580 | I'm actually wondering how much of it is potentially
00:19:46.340 | for especially the closed models,
00:19:48.540 | just purely synthetic data,
00:19:50.540 | because I can imagine for text in particular,
00:19:53.940 | we can generate a lot of synthetic images,
00:19:57.700 | essentially signboards, crumpled paper,
00:20:00.340 | all these kinds of scenarios,
00:20:01.740 | and just put various permutation of text.
00:20:05.500 | And we ask questions about the text through the model.
00:20:08.660 | And that essentially forms the training set.
00:20:10.900 | - Thoughts?
00:20:16.820 | I think a lot of it's also just, yeah, it's a base model.
00:20:21.340 | I think that there's a decent bit you could do.
00:20:25.580 | I think also at the scale that they're at,
00:20:28.220 | like with a 1B and 7B,
00:20:30.460 | they could be deployed in a system
00:20:33.660 | such that it's primarily image processing,
00:20:37.940 | like zero shot, view shot,
00:20:41.380 | convert this to JSON, answer stuff about this.
00:20:45.020 | I don't see this really being like a chat with style model,
00:20:49.580 | because yeah, there's LLAMA if you want that, right?
00:20:52.460 | If you want images and chat with it.
00:20:54.740 | But the applications of these edge,
00:20:57.780 | small image processing models,
00:21:00.420 | I see this more as like OCR output in a pipeline,
00:21:04.780 | but that's my two cents.
00:21:07.300 | I guess it's interesting to see that
00:21:11.860 | if you did it with a chat instruct tune model,
00:21:15.180 | there wasn't any crazy performance difference.
00:21:17.580 | But yeah, the other thing here is just that
00:21:19.620 | there's not much benchmarking on the language models.
00:21:24.860 | So I guess there is MMLU and they're not the best,
00:21:29.860 | actually quite roughly, quite bad.
00:21:33.580 | - That's MMMU.
00:21:35.340 | - MMMU, my bad.
00:21:36.580 | - That's a multimodal version.
00:21:38.100 | - So there's no pure text benchmarks, right?
00:21:42.700 | Should we benchmark?
00:21:43.700 | Someone benchmark it.
00:21:44.580 | We've got 40 people on the call that are here for the paper.
00:21:47.140 | So someone report back to us.
00:21:49.740 | Yeah, any other interesting chat stuff?
00:21:54.100 | Should we crunch the paper quick?
00:21:57.020 | - Me?
00:21:57.860 | Mama?
00:22:02.780 | - Is someone talking?
00:22:06.220 | Okay, they're muted.
00:22:09.060 | Let's go to that dataset then.
00:22:12.100 | So TLDR, people yap, here's guidance questions.
00:22:15.980 | It's transcribed.
00:22:17.220 | They got about a million captions for 700,000 images,
00:22:20.900 | which I'm like, okay, that's kind of expensive.
00:22:22.660 | I can't run this experiment myself.
00:22:24.540 | I'm glad they'll put it out.
00:22:26.900 | Basically there's this PIXMO is the dataset
00:22:30.500 | and then there are subsets of it.
00:22:31.740 | So PIXMO ask anything model.
00:22:33.540 | It's basically collection of answer a diverse set
00:22:37.460 | of questions that a user might ask.
00:22:39.260 | So for this image, what are questions people might ask?
00:22:43.900 | Here's how they do it.
00:22:44.740 | They have this whole OCR stage.
00:22:46.820 | I'm gonna go through these pretty quick,
00:22:48.420 | but then here's kind of the subset.
00:22:49.700 | So 162,000 question answer pairs of images,
00:22:54.220 | PIXMO points is the next one.
00:22:55.780 | - Can I ask a quick question
00:22:56.660 | on the PIXMO ask model or anything?
00:22:58.860 | So I had a really hard time understanding this.
00:23:01.220 | What we do is we use the stage one model
00:23:04.460 | to generate advanced caption.
00:23:06.140 | So I guess what this means is that given image,
00:23:08.260 | get a caption and then we pass that caption
00:23:10.660 | and OCR image for that and the OCR output for the image.
00:23:14.420 | Does this mean we just apply OCR on the image?
00:23:17.580 | How would that work?
00:23:19.540 | If the image is an image of scenery or something.
00:23:22.380 | I mean, if someone can just share the intuition behind this.
00:23:25.980 | - Nathan, if you have intuition, you'd probably know better.
00:23:32.060 | My intuition here is that when you look
00:23:35.620 | at there's 70 different high level topics,
00:23:38.940 | like drawings, websites,
00:23:40.780 | there's another thing here like street signs.
00:23:42.780 | And then there's one that's like primarily charts and tables.
00:23:46.380 | So I don't have the exact input output of this
00:23:49.820 | but I'm assuming it's more text-based if there's OCR on it.
00:23:53.300 | So it's probably, you know, documents, receipts.
00:23:58.060 | That's what I would assume this subset of the dataset is.
00:24:00.940 | - Yeah, I see.
00:24:03.660 | - Yeah, but that makes sense.
00:24:05.420 | - But there, yeah.
00:24:07.620 | - Yeah, so we provide caption, we provide OCR output
00:24:10.060 | in case there's charts and tables.
00:24:11.580 | And then we ask the question
00:24:13.100 | and the LM is supposed to answer the question
00:24:14.820 | purely solely based on text.
00:24:16.260 | It does not have the image.
00:24:17.620 | Yeah, that sounds crazy to me.
00:24:21.540 | I just don't know how we can teach a model
00:24:23.940 | to answer anything about an image
00:24:26.580 | without actually having an image.
00:24:28.260 | But if it works, it works.
00:24:30.420 | But that was quite mind blowing to me,
00:24:31.820 | especially this dataset and how they created it.
00:24:34.660 | - So I kind of interpreted this as
00:24:38.500 | that's like a starting point
00:24:39.660 | and then the annotator would,
00:24:41.980 | I wasn't sure why they did it this way
00:24:44.100 | and maybe Nathan can speak to this,
00:24:46.220 | but that you like have these captions
00:24:51.220 | and then you're kind of just selecting
00:24:53.100 | and combining annotations as a human
00:24:55.620 | who has access to the vision of the model
00:24:59.340 | or the vision of the image.
00:25:00.980 | - That makes sense, yeah.
00:25:05.380 | Or maybe this is one of the earlier images
00:25:06.900 | where we just bootstrapping our synthetic data
00:25:09.460 | and the quality is maybe it's not as high
00:25:11.340 | or not as intuitive as compared to the point.
00:25:14.100 | Picks more points, picks more clocks,
00:25:16.380 | which was easy for me to understand.
00:25:17.980 | So this is the one thing I got stuck on.
00:25:19.780 | Thank you.
00:25:20.620 | - Are you, wait, the model was trained
00:25:24.980 | with the image and the question pairs though.
00:25:27.740 | It's saying that the LLM that's doing
00:25:30.580 | the question answering didn't have access to the image.
00:25:37.020 | Like it's still a output pair is still image
00:25:40.620 | to QA pairs that you train MOMO on,
00:25:43.980 | but the annotation, that's what I,
00:25:47.420 | so like, I think that's what you were saying.
00:25:49.260 | Like, it's interesting how you can train a MOMO
00:25:52.820 | to answer questions on an image without giving it the image,
00:25:55.540 | but no, it's given the image and question answer pairs.
00:25:58.780 | It's just the annotation answering the question
00:26:01.300 | is image to OCR to GPT-4 write question answers
00:26:06.580 | on this OCR that, you know, there's a, what is it?
00:26:11.460 | There's a annotator to approve the question answers.
00:26:14.300 | Now we've got question answers with image.
00:26:16.580 | When we train, we have image question answers.
00:26:19.660 | - I see, yeah, okay.
00:26:20.500 | That actually is a, that's actually very intuitive.
00:26:24.660 | So essentially what we're saying is that right now,
00:26:26.140 | the model can't answer questions.
00:26:27.820 | What we're going to do, but the model can generate captions
00:26:29.980 | because that's the previous pick small cap
00:26:31.700 | they also learned on.
00:26:32.780 | So we're going to get that model to generate captions
00:26:35.460 | and then we get an off the shelf LLM,
00:26:37.940 | probably one of their strong LLMs to create QA pairs.
00:26:41.620 | And now based on that, we now have QA pairs on the image.
00:26:45.300 | So that's how you could strap yourself
00:26:46.860 | from no data to like step-by-step.
00:26:50.020 | Okay.
00:26:50.860 | - This little niche here was that off the shelf LLM
00:26:53.900 | is not capable.
00:26:55.140 | So that's why they do caption and OCR
00:26:58.100 | and provide that to the LLM.
00:26:59.260 | - Makes sense.
00:27:00.100 | - That's where I'm like, oh, I think you mentioned
00:27:02.860 | their training questioning without giving it the image,
00:27:05.780 | but you still do have the references.
00:27:07.980 | - Yeah, the final mobile data actually has that,
00:27:10.740 | but this is the process of them just generating the image.
00:27:13.060 | - Yes. - Cool.
00:27:14.260 | Thank you.
00:27:15.100 | - Yeah.
00:27:15.940 | And then honestly, I found this one pretty confusing
00:27:19.460 | at first as well.
00:27:20.620 | The rest, there's just quite a few subsets of these.
00:27:23.480 | So points, this is kind of their demo.
00:27:27.420 | I don't know if I'm sharing my whole screen or just the,
00:27:31.060 | yeah, I'm just sharing the.
00:27:32.860 | - PDF.
00:27:34.060 | - Well, actually I'll change it real quick.
00:27:36.260 | If people haven't seen the demo,
00:27:39.260 | I would recommend checking it out,
00:27:40.500 | but this is basically points on an image.
00:27:42.720 | So it kind of answers, like if you were to ask this,
00:27:46.500 | this example, point to Mount Rainier,
00:27:49.380 | the model can drop a dot on the mountain.
00:27:52.280 | So it's labeled data set of that next sample.
00:27:56.420 | And then they show their example of connecting this thing
00:27:59.220 | to Apple vision pro and asking it, like, what do I see here?
00:28:02.220 | They've got a subset of that.
00:28:06.060 | Then there's a cap QA.
00:28:08.400 | It's basically just here are the other ones.
00:28:10.260 | There's docs generate code for this much text
00:28:14.980 | and heavy figure images.
00:28:16.040 | So charts, documents, tables,
00:28:18.300 | we prompt it to do QA pairs.
00:28:20.600 | There's clocks.
00:28:21.580 | I was confused with why there's synthetic
00:28:24.760 | analog clock data set, but there are clocks.
00:28:29.440 | There's a lot of clock examples in here.
00:28:31.520 | - I had the same question.
00:28:32.760 | Like Nathan, would you-
00:28:33.600 | - This is because the model didn't work on clocks.
00:28:35.960 | And then the lead was really worked on clocks
00:28:39.160 | and no models work on clocks.
00:28:40.660 | So they're like, we've got to make it work on clocks.
00:28:43.360 | One of the interesting things is that it doesn't work
00:28:45.040 | on dials, even though it works on clocks.
00:28:47.560 | - It doesn't work on dials,
00:28:48.400 | like an oven dial or washing machine dial.
00:28:50.040 | - Yeah, like a temperature dial or pressure dial.
00:28:52.000 | Like it can't read the number off of that,
00:28:53.480 | but it'll work on clocks.
00:28:55.180 | But just like such a, like it should be able to,
00:28:57.120 | like why can't it generalize?
00:28:58.520 | - Isn't it supposed to generalize?
00:28:59.720 | Oh my God.
00:29:00.560 | Okay, but that was very good.
00:29:01.380 | - But it does work well on clocks.
00:29:02.800 | - That was very good.
00:29:04.640 | - I love the answer though.
00:29:05.520 | These are the insights that like we'd never get,
00:29:07.640 | but is this just like over fit to a benchmark
00:29:10.300 | that nothing else does?
00:29:11.680 | Is this not like it should generalize bit or less?
00:29:14.080 | - Is there a clock benchmark?
00:29:17.640 | - Hell yeah, dude.
00:29:18.480 | You threw it in.
00:29:19.320 | You like, now I know.
00:29:20.940 | Now every visual language model that comes out,
00:29:22.920 | often will be like.
00:29:23.760 | - Well, it is going to make,
00:29:24.680 | like it is something that they should be able to do.
00:29:26.920 | - Yeah, it should generalize it.
00:29:28.760 | - I mean, a big story of this is that like Matt,
00:29:31.920 | that D, I don't know how to pronounce his name,
00:29:34.000 | is just got obsessive over data
00:29:35.560 | and just kept scaling it up and up and up.
00:29:37.160 | And all the things that are adding to this dataset
00:29:38.960 | kept on working.
00:29:39.780 | And they just spent more and more money
00:29:40.920 | 'cause it kept getting better and better and better
00:29:42.540 | over like six months of just making the dataset bigger.
00:29:45.880 | So like the pointing is probably similar.
00:29:48.480 | It's like, why can't they point?
00:29:49.680 | And then they're like, okay, let's make it work.
00:29:52.320 | Like pointing seems useful though, right?
00:29:53.880 | Like the demo you guys showed was actually somewhat useful.
00:29:56.920 | Like, you know, translate this table in a whole image
00:30:01.920 | to JSON and it does it.
00:30:04.120 | Or like stuff like this, like point to Mount Rainier,
00:30:08.000 | it can point, but other models can't.
00:30:10.160 | Seems useful.
00:30:11.000 | Clocks, I don't know.
00:30:12.960 | This just reminds me of strawberry.
00:30:14.420 | You know how many hours are in strawberry?
00:30:17.000 | Someone start this with GPT-4V,
00:30:19.760 | what time is it?
00:30:20.600 | And it can't tell time.
00:30:21.660 | And then of course the rest, there's academic datasets.
00:30:25.640 | Once again, the FlamingoBog post
00:30:27.360 | goes into all these very deeply, would recommend.
00:30:30.180 | But TLDR, those are the datasets.
00:30:33.520 | EVALs are EVALs.
00:30:35.280 | This is where I was like, okay, it does good on vision.
00:30:38.880 | There's academic benchmarks,
00:30:41.800 | which are, you know, 11 of the commonly used ones.
00:30:44.400 | There's good averages.
00:30:45.920 | I love the new rebrand they've done
00:30:47.440 | with the whole pink and color theme.
00:30:49.480 | It looks good.
00:30:50.460 | Averages.
00:30:51.880 | But yeah, I mean, across the board,
00:30:54.280 | the 7B, the 1B, the 72B,
00:30:56.900 | very on par with 404V 1.5.
00:31:00.880 | Thoughts on it and whatnot.
00:31:02.380 | The human ELO ranking was pretty interesting
00:31:05.800 | where preference is decent,
00:31:09.800 | but I guess the interesting differentiation there
00:31:13.240 | is how far off a drop other stuff
00:31:15.480 | like the Lava and whatnot is.
00:31:18.960 | Yeah, PHY is doing pretty rough in multimodal,
00:31:21.860 | even though it got a multimodal update,
00:31:24.160 | but it's probably a better text model.
00:31:26.000 | I guess I'd just be interested now to see
00:31:28.200 | how this thing does in text
00:31:29.880 | and if we can get a better text version.
00:31:31.560 | This is kind of back to the theme of this paper
00:31:34.160 | of like stuff, most models use a vision encoder
00:31:39.160 | that's closed.
00:31:40.620 | So we wanted to do full open.
00:31:44.400 | This is a little harder to read,
00:31:47.800 | but let me digest real quick.
00:31:49.280 | So the vision encoder for Lava is open weights,
00:31:54.280 | but the data and code used to generate it was closed.
00:31:59.600 | API models, they're basically all closed.
00:32:02.320 | A lot of the generation is done with closed stuff.
00:32:04.700 | They're in this corner of all green.
00:32:06.320 | We love all green.
00:32:07.300 | We don't love red.
00:32:09.100 | But this is basically just a visual
00:32:11.160 | of the theme of the paper.
00:32:12.120 | I didn't look that deep into it,
00:32:13.140 | but that's what it's showing that.
00:32:14.580 | Once again, the captioning that's done
00:32:17.440 | is mostly a distillation of proprietary stuff.
00:32:21.800 | For example, all this proprietary stuff sucks at clocks.
00:32:24.680 | So nothing that's a distillation will be good at clocks.
00:32:27.520 | Nothing can point.
00:32:28.760 | If you just distill from this,
00:32:30.320 | if they can't point, your VLM won't point.
00:32:33.780 | So we show how to get good data.
00:32:35.940 | The answer to that was how people talk
00:32:38.200 | with guided questions, do pre-processing,
00:32:40.880 | and I guess keep scaling it up.
00:32:42.440 | It's good to see it still scales.
00:32:45.160 | - And also, one point on the MOMO one, right?
00:32:48.560 | If you scroll up all the way to the top, yeah, over here.
00:32:50.560 | The only reason why MOMO 7B and 7BD
00:32:54.520 | is not open data encode is because
00:32:56.280 | it's based off the QEN data.
00:32:58.300 | I think, correct me if I'm wrong, Nathan.
00:33:00.040 | And that's why it's not because they don't want to open it.
00:33:02.680 | It's because we just don't know what was in QEN data.
00:33:05.380 | Same for a visual encoder.
00:33:06.840 | If not, it's everything that the AI2 team did was--
00:33:11.600 | - This model is test the definition of open source AI,
00:33:14.560 | because it's like, you grab some random vision encoder,
00:33:17.160 | and that's the only thing that's not open.
00:33:18.680 | It's like kind of--
00:33:20.360 | - Well, the weights are open,
00:33:23.520 | but we just don't know what the data encode is, yeah.
00:33:26.280 | - Yeah, but what's the difference between a vision--
00:33:27.720 | Like, if the weights of the vision encoder were frozen,
00:33:30.960 | I think that it could be fine, but I don't know.
00:33:34.320 | It's weird, because you like say that fine-tunes
00:33:36.960 | from MOMO are like open-weight models.
00:33:39.340 | They're not like open-source fine-tunes and stuff.
00:33:42.360 | - This is a discussion that the open-source initiative
00:33:45.000 | and other people are having right now.
00:33:47.280 | It's kind of a whole other thing
00:33:49.200 | that I don't have enough time to answer the emails for,
00:33:52.520 | but it's like, I don't know what to do with it.
00:33:55.360 | - The interesting thing there was the only closed code for,
00:34:00.000 | the closed open data was just,
00:34:01.720 | they used CLIP, large CLIP, right?
00:34:04.800 | They have the sentence that our vision encoder
00:34:08.640 | is based on CLIP right here.
00:34:10.040 | So for the vision encoder, we release,
00:34:13.600 | all of our release models use OpenAI's VIT-Large CLIP model,
00:34:18.600 | which provides consistently good results.
00:34:21.140 | While this model uses closed data,
00:34:23.060 | it can be reduced from scratch with MetaCLIP.
00:34:25.220 | So that's what gives them some red,
00:34:28.600 | because the data that OpenAI trained CLIP on,
00:34:32.880 | they didn't open-source.
00:34:34.400 | So it's closed data, and they talked about this too,
00:34:36.980 | like the paper and the blog post really shows
00:34:41.040 | how the thing that made vision language models good
00:34:44.320 | was a good captioner,
00:34:45.240 | which was the last CLIP that they put out.
00:34:47.400 | And this is where OpenAI started to go closed source,
00:34:50.720 | where they're like, okay, we've scraped the web,
00:34:52.980 | we have our own scraper,
00:34:53.920 | we have a good captioner, this, that.
00:34:55.760 | Here's the data set, or no,
00:34:57.980 | they don't give out the data set.
00:34:59.000 | They give the final model, but not the data set.
00:35:02.020 | And then a few years later,
00:35:04.520 | Meta comes out with MetaCLIP where they're like,
00:35:06.800 | okay, here's how we can remake CLIP.
00:35:08.440 | And I think that's personally,
00:35:10.200 | this is just like now the open-source debate
00:35:11.920 | of they could have used the Meta version
00:35:14.240 | and given us a worse model,
00:35:15.820 | and it would have given them green here,
00:35:17.720 | but I don't know, I think it's fine that they did it.
00:35:20.280 | And then this is closed,
00:35:21.200 | 'cause yeah, like Eugene said, Quinn stuff.
00:35:24.160 | But that's my two cents.
00:35:26.400 | This could have been open if they just used MetaCLIP,
00:35:29.480 | but MetaCLIP's not as good,
00:35:30.880 | because I think Meta did it as a proof of concept.
00:35:33.980 | - I think, honestly, I think it's fine,
00:35:36.280 | because right now, as previously outlined,
00:35:38.760 | before vision and audio,
00:35:41.360 | the gap in data set in the open-source space is so large,
00:35:45.280 | that the biggest thing is actually more about
00:35:48.520 | getting the data set open-source and working.
00:35:51.600 | So for me, if in this process,
00:35:54.620 | sure, this may be flawed,
00:35:55.800 | and we get a strong open image data set,
00:35:59.160 | which then subsequently we can do captioning
00:36:02.440 | on top of that strong open image data set,
00:36:05.700 | which will create a more complete data set
00:36:07.640 | than is used to train it to the next open model,
00:36:10.520 | then this will be a bootstrap, essentially,
00:36:13.680 | towards a all-green open model,
00:36:16.400 | which might be a better path
00:36:19.240 | than trying to use an inferior clip
00:36:21.920 | and not have a good data set in the first place.
00:36:24.320 | - Makes sense.
00:36:26.760 | I think also this is a proof of concept
00:36:29.440 | at open-source stuff.
00:36:31.040 | I don't know how much there's like,
00:36:34.160 | this is a product, use it, use it, use it,
00:36:36.700 | versus like we were scaling experiments
00:36:38.380 | that kept working better.
00:36:39.740 | And there's still quite a bit that could be done here,
00:36:42.780 | but yeah, it's a really good point out
00:36:46.060 | that stuff is a distillation,
00:36:47.720 | and we don't know what goes into this.
00:36:49.620 | It does have that issue with clip, right?
00:36:54.460 | We don't know the data that went into it.
00:36:56.500 | Metaclip would have been cool, but yeah.
00:37:02.300 | Evals are next, so benchmarks.
00:37:05.220 | If people understand and want to dig into any of these,
00:37:09.480 | we can dig into them.
00:37:10.920 | So it's a interesting differentiation of open-weight,
00:37:15.920 | open-weight distillation, where MOMO sits,
00:37:18.760 | and then these are all the proprietary.
00:37:20.460 | So Claude, Gemini, and GPT.
00:37:23.000 | Here's the averages.
00:37:24.260 | MOMO 72B kills it.
00:37:27.920 | It's the best average across them.
00:37:30.120 | And then across the actual benchmarks themselves,
00:37:33.740 | they're doing pretty good across.
00:37:36.180 | I think when you look a little deeper,
00:37:37.700 | they mentioned the 1B is on par with GPT-4V
00:37:41.640 | on most benchmarks, but the 7B outperforms 4V.
00:37:46.640 | It's closer to 4.0 and whatnot.
00:37:50.280 | So if anyone has anything interesting
00:37:53.820 | they'd want to dig into on the benchmarks, we can.
00:37:56.460 | Otherwise, I think we know where it sits.
00:37:58.180 | The interesting part would be digging
00:37:59.960 | into the text generation part.
00:38:01.980 | Eugene, you still have your hand up.
00:38:04.020 | Are you trying to comment?
00:38:06.040 | Okay, hand up.
00:38:08.480 | Yeah, I didn't have much more to dig into in this.
00:38:14.340 | I thought the ELO ranking was kind of interesting as well.
00:38:17.380 | Here's ELO ranking.
00:38:20.440 | Human preference evals use 15K image text prompt pairs.
00:38:25.420 | We queried VLM for responses,
00:38:27.320 | presented the resulting images,
00:38:29.400 | triplets to all this to 870 human annotators.
00:38:33.240 | They were given pair-wise preference rankings
00:38:35.400 | for 325 samples across 27 models.
00:38:39.120 | NiceFlex, biggest human preference eval
00:38:41.880 | for multimodal models to date.
00:38:44.000 | All it took was less than a thousand people.
00:38:46.160 | Their ELO rankings are 3X more than chatbot arena
00:38:50.120 | from LIMPSYS.
00:38:50.980 | Also, I hear there's some potential tea with LIMPSYS.
00:38:54.820 | Maybe that's coming out soon for their vision models.
00:38:57.960 | And then here's an ELO ranking.
00:39:00.080 | - There's potential tea with LIMPSYS, not extrapolate.
00:39:04.020 | - Well, we'll leave that without much more
00:39:07.600 | unless anyone else wants to comment on it.
00:39:09.700 | But yeah, if anyone knows how ELO ranking works,
00:39:14.200 | they're pretty cool, figure it out.
00:39:17.120 | But yeah, most preference ranked was GPT-4.0,
00:39:20.880 | then MOMO-72B, then Gemini, then Sonnet, then the 7B.
00:39:25.880 | It's really cool how the fourth one is the 7B.
00:39:30.180 | So very cool on-device model.
00:39:32.580 | And then RIP, our boy, Lava 1.57B down here,
00:39:37.580 | Chameleon down here.
00:39:39.180 | Chameleon was cool though.
00:39:40.220 | They're fusion models.
00:39:41.180 | They're a different way to do this.
00:39:43.660 | They're a little early.
00:39:44.500 | I'd love to see a recreation of fusion for this
00:39:47.100 | as opposed to adapters.
00:39:48.980 | But those are the two benchmarks.
00:39:51.400 | There's some takeaways on this as well.
00:39:55.060 | So here's a few key points.
00:39:58.160 | Also, there was a line here as well in their pre-training
00:40:01.360 | that I wanted to highlight.
00:40:02.680 | So in this data and training,
00:40:04.680 | they talk about how they do this different approach.
00:40:09.120 | And then they just add this line here.
00:40:10.480 | So we don't do RLHF, but we've got the RLHF guy here,
00:40:15.480 | but they added a line that we don't do RLHF.
00:40:18.860 | Is this meant to say like, you know, there's no SFT,
00:40:21.380 | these are base models or just...
00:40:22.980 | Well, there is supervised fine tuning for it,
00:40:26.020 | but no RLHF.
00:40:27.840 | - For us, team alignment and incentives are hard.
00:40:31.220 | - We have a question from Sam.
00:40:34.720 | - Thanks, Eugene.
00:40:38.160 | Hold on, is my mic on?
00:40:39.000 | Can you guys hear me?
00:40:40.060 | - Yes, yes.
00:40:41.200 | - Great, cool.
00:40:42.240 | So this is coming from a place of vague,
00:40:44.980 | naivety of multi-modality stuff.
00:40:46.600 | And it's about the granularity that we should attach
00:40:51.100 | to images when creating, say,
00:40:52.620 | open source image caption datasets.
00:40:55.080 | So right now there's a picture,
00:40:56.680 | say from like a satellite view of a parking lot.
00:40:59.320 | And I want to ask a question that like Eugene could answer
00:41:01.940 | of like, what is on the dashboard of the car
00:41:05.520 | in the second row, third from the left or something.
00:41:08.440 | And it's a little Hawaiian guy
00:41:09.720 | or something like that on the dashboard.
00:41:11.460 | So a human can answer that,
00:41:12.440 | but I assume a language model only could
00:41:14.340 | if it had been trained, I suppose,
00:41:17.060 | on image caption pairs that have that level
00:41:19.940 | of extreme granularity to them.
00:41:21.820 | So like, I've seen some really cool demos
00:41:23.460 | on the blog posts for MoMo.
00:41:25.580 | Now, again, I'm not really sure
00:41:26.820 | whether to ascribe that performance
00:41:29.260 | to the pre-chained vision encoder,
00:41:32.340 | the one that we're using,
00:41:33.180 | or if it's from some data in the PIXMO set
00:41:36.780 | that uses that cool like audio transcription pipeline.
00:41:39.900 | So I guess the question is,
00:41:41.320 | if we kind of ignore the fact
00:41:42.920 | that we're using a pre-trained image encoder
00:41:44.320 | that has a bunch of cool abilities
00:41:45.800 | on data that we don't know about,
00:41:47.740 | how granular should our image caption,
00:41:51.440 | our captions for image caption,
00:41:52.960 | or image caption datasets be if we want to have,
00:41:56.160 | we would expect to be, you know,
00:41:58.320 | quote AI performance, you know,
00:42:00.920 | if you don't know how this stuff works
00:42:02.940 | from our vision language models.
00:42:04.600 | And why did you guys choose the granularity
00:42:06.480 | that you did, I suppose?
00:42:08.440 | And the question format as well.
00:42:10.100 | Kind of a lot.
00:42:17.060 | - Do you want to TLDR that into one, two line area?
00:42:21.940 | - How granular should our image caption pairs be, period.
00:42:26.200 | Kind of detailed.
00:42:30.260 | - I think Nathan probably has more intuition on this,
00:42:32.700 | but their thing wasn't about granularity, it seems.
00:42:37.460 | It was about one, like, yeah, you've got diversity,
00:42:40.980 | you've got, you know, processing of it,
00:42:44.000 | but then it was also about the split, right?
00:42:46.300 | So we talked about these four different ones,
00:42:48.600 | like part of it is, you know,
00:42:51.100 | generate code from 255,000 texts
00:42:54.180 | and figure heavy images, including charts and whatnot,
00:42:57.100 | then get QA pairs on it.
00:42:58.780 | They got clocks,
00:42:59.620 | but there's also all the academic datasets, right?
00:43:02.100 | So there's a lot of these.
00:43:05.180 | If you look into what they are,
00:43:06.220 | that blog post goes into them.
00:43:07.740 | So I think granularity of specific stuff
00:43:11.980 | isn't how I necessarily look at it.
00:43:14.300 | It's diversity of the images
00:43:18.020 | and what they do with good captions around it.
00:43:22.020 | Nathan, if you want to chime in, go ahead.
00:43:24.500 | Eugene, you have your hand raised.
00:43:26.220 | I don't know if you want to answer,
00:43:27.100 | but let's answer this question before moving to another one.
00:43:30.100 | - I don't really have a good response.
00:43:33.100 | I think most of this was like really high detail responses
00:43:35.860 | is what worked for them, but I don't know.
00:43:39.820 | - Okay.
00:43:40.660 | - My intuition, at least,
00:43:43.220 | is that I suspect the heavy lifting
00:43:45.820 | might be actually more towards the Q&A side
00:43:47.820 | rather than captioning side.
00:43:49.740 | I view captioning as a bootstrap,
00:43:51.820 | but at the end of the day,
00:43:52.660 | we want the model to generalize
00:43:54.620 | what they want to describe.
00:43:56.300 | Captioning tends to overfeed what is being described,
00:43:59.980 | where our question answers kind of forces the model
00:44:02.460 | to pick up details that it would previously ignore,
00:44:05.700 | like plots, for example,
00:44:07.140 | or the one extra finger on the hand, or whatever it is.
00:44:11.420 | And as long as you're able to ask questions
00:44:13.460 | and you can answer, you're training the model
00:44:15.380 | to capture as much information as possible,
00:44:18.380 | even if you don't ask.
00:44:19.500 | Because even if, let's say, this dataset,
00:44:21.660 | you didn't ask this question,
00:44:23.620 | another dataset may ask an alternative question instead,
00:44:28.140 | and that will just generalize better.
00:44:30.380 | And that's what I suspect.
00:44:31.540 | I think captioning is a problem in overfitting
00:44:34.540 | from my point of view.
00:44:35.580 | - Updates.
00:44:37.900 | I really don't have that much intuition
00:44:39.220 | on this either, though.
00:44:40.340 | Someone else has hand up?
00:44:43.380 | You want to pop in?
00:44:44.420 | Okay.
00:44:49.060 | I'm going to finish the paper real quick in the takeaways.
00:44:51.820 | This is a cool little video we can watch after,
00:44:55.620 | and we'll just save time for Q&A.
00:44:57.260 | There was really not much left, though.
00:45:00.660 | There's just some highlights.
00:45:02.140 | So they wanted to highlight features.
00:45:04.300 | I will share their highlights.
00:45:05.940 | The most efficient model, the 1B,
00:45:07.660 | is based on their 1B MOE.
00:45:10.140 | That one matches performance of 4V
00:45:12.980 | on most academic benchmarks and their ELO ranking.
00:45:17.980 | The 7B and the...
00:45:20.460 | So our 7B and Quen 7B-based models
00:45:23.340 | comfortably sit between GPT 4V and 4O on both.
00:45:28.300 | So 1B, similar to 4V.
00:45:31.620 | 7Bs are between 4V and 4O.
00:45:35.180 | The big one, the one based on Quen 72B,
00:45:38.660 | is the best benchmarks on academic.
00:45:42.140 | So academic benchmarks-wise,
00:45:45.100 | their big one is the best state-of-the-art everything,
00:45:47.980 | better than proprietary.
00:45:49.620 | But ELO-wise, it sits behind 4O.
00:45:52.940 | Then our best model outperforms...
00:45:58.660 | Yeah, so it's basically best model beats everything.
00:46:01.780 | To highlight their potential for action,
00:46:04.140 | we tested this on Android Control,
00:46:07.540 | where it did lower...
00:46:09.020 | I didn't really get this,
00:46:09.860 | but we tested it on Android Control,
00:46:13.020 | where it achieved this low-level accuracy,
00:46:15.580 | 69% high-level accuracy.
00:46:17.140 | I don't really know what Android Control is,
00:46:18.460 | but they highlighted it.
00:46:19.900 | I probably should have read more into it, but...
00:46:22.020 | - Excuse me, do you plan to show us something,
00:46:24.940 | or you're just pointing?
00:46:26.700 | 'Cause we only see the normal page.
00:46:30.820 | - Good catch.
00:46:31.660 | My bad, I went back to the slides.
00:46:33.100 | (laughs)
00:46:35.020 | My bad.
00:46:35.860 | But yeah, I was just going over these highlights.
00:46:39.340 | So highlights was 1B is based on that.
00:46:44.340 | It's at 4V.
00:46:46.100 | The 7Bs are between the two OpenAI models.
00:46:50.300 | 72B is better than anything on their chart.
00:46:53.740 | So it's state-of-the-art better than proprietary.
00:46:56.780 | And then there's this last little point here
00:47:00.460 | of it does really good on this Android Control,
00:47:04.860 | which I don't know what Android Control is,
00:47:07.380 | but they wanted to highlight it.
00:47:08.940 | So I've highlighted it.
00:47:11.980 | I think that's most of the paper.
00:47:13.700 | There's a release plan here.
00:47:15.460 | So yeah, Llamo is coming out.
00:47:18.260 | They put it out early.
00:47:19.460 | They've put out some stuff.
00:47:22.700 | There's more stuff coming.
00:47:23.740 | So this technical report is getting an update soon.
00:47:27.820 | The data sets are coming out,
00:47:29.140 | but putting out data sets is hard.
00:47:30.580 | So give them time.
00:47:32.380 | Model weights, training and eval code.
00:47:34.380 | This is a fun one.
00:47:35.420 | This typically doesn't come out
00:47:36.860 | because you're probably not doing training
00:47:38.940 | on the same hardware they're training on,
00:47:41.580 | but always cool to see training and eval code.
00:47:45.620 | I guess that's pretty good overview of the paper.
00:47:48.500 | If there's anything we want to dive deeper into,
00:47:51.660 | we've got 10 minutes.
00:47:53.540 | Nathan, if you want to add anything, cook, go over.
00:47:57.980 | - I just have a question,
00:48:03.700 | but like I'm ready to get the section about Respirator,
00:48:07.580 | but there are like no other questions.
00:48:09.780 | - Nathan.
00:48:10.620 | - Yeah, Nathan had to drop.
00:48:20.260 | - Oh, he dropped.
00:48:21.100 | Okay, cool.
00:48:21.940 | Well, that's paper.
00:48:24.700 | There is a little use case thing to just make it useful.
00:48:30.340 | We can kind of watch this real quick.
00:48:32.660 | And then we've got a little update on Whisper.
00:48:35.860 | So this is their video they put out.
00:48:37.940 | - Yeah, that's awesome, Nathan.
00:48:41.380 | - It's very Apple Intel like.
00:48:43.260 | - Okay.
00:48:44.100 | - Count the number of people.
00:48:53.500 | - Counting the number of people shows a total of 21.
00:48:56.580 | - Convert this table to JSON.
00:49:02.700 | - Here's the table converted to JSON format.
00:49:11.420 | - I want to sell my bike on Craigslist.
00:49:14.100 | Write me a description for it.
00:49:15.940 | - Schwinn bike for sale, blue with white accent.
00:49:18.500 | $300 or best offer.
00:49:20.380 | - Okay, that's just a, you know,
00:49:23.220 | example of what the model can do.
00:49:26.500 | I recommend people check out the rest of their blog posts.
00:49:28.860 | It's pretty cool.
00:49:30.140 | Here's benchmarks.
00:49:33.180 | There's a couple other videos if people are interested.
00:49:36.580 | And then there's just a good little TLDR of the paper.
00:49:41.100 | I want to save time for questions and stuff.
00:49:43.580 | So I'll share the link.
00:49:45.620 | We'll share it in Discord.
00:49:47.420 | If anyone has last questions on the paper,
00:49:49.740 | we can go over now.
00:49:50.700 | Otherwise we have a little Whisper update too.
00:49:54.060 | Whisper three turbo dropped.
00:49:57.500 | I was very impressed.
00:49:58.540 | Yeah, questions or should we pass to Whisper?
00:50:02.220 | Last shot.
00:50:03.060 | - All right, Whisper it is again.
00:50:09.060 | - Whisper, let's go.
00:50:11.180 | - Whisper.
00:50:12.260 | - Sorry, I took away more time.
00:50:13.860 | - Whisper?
00:50:22.980 | Who's covering Whisper?
00:50:24.100 | - It's supposed to be Abgadoz, but.
00:50:27.740 | Oh, there he is.
00:50:30.020 | - Yes.
00:50:31.700 | Yes, can you guys hear me?
00:50:33.300 | - Yes.
00:50:34.140 | - Okay, great.
00:50:36.140 | Thank you so much for covering the Momo paper
00:50:39.260 | alongside Nathan, that was great.
00:50:40.900 | So yeah, let's get started quickly.
00:50:45.420 | So can you guys see my screen?
00:50:52.380 | - Yes.
00:50:53.220 | - Okay, great.
00:50:55.820 | So just like some updates.
00:50:57.660 | OpenAI has recently released a new checkpoint
00:51:00.300 | or version of Whisper called Whisper large V3 turbo.
00:51:03.780 | It's supposed to be like more faster and more efficient
00:51:06.980 | with like very little accuracy degradation.
00:51:10.100 | And it was inspired by the work behind this to Whisper.
00:51:13.060 | So just like that pull request,
00:51:15.060 | then that John walk merged into the OpenAI repository.
00:51:19.300 | So yeah, a little reminder about Whisper.
00:51:24.060 | Whisper is like the state of the art ASR model
00:51:26.780 | for like transcription, translation
00:51:28.500 | and speech detection as well.
00:51:31.460 | The architecture is basically based
00:51:33.140 | on the encoder decoder transformer architecture
00:51:36.420 | that was presented in the original
00:51:38.100 | attention is all you need paper.
00:51:40.060 | So the input is passed to the encoder
00:51:42.780 | which processes the audio
00:51:44.380 | and then gives out any states of the end.
00:51:47.740 | And these hidden states are sent to the decoder
00:51:50.820 | which it uses to like generate the text
00:51:53.540 | one token at a time.
00:51:54.740 | So yeah, the encoder and decoder
00:51:58.980 | mostly have the same number of layers
00:52:01.380 | for example, in the Whisper large V3
00:52:03.580 | it has 32 layers in the encoder
00:52:05.460 | and 32 layers in the decoder.
00:52:07.180 | So this was like a quick summary
00:52:09.500 | about the architecture of Whisper itself.
00:52:12.020 | Whisper like when it was originally released
00:52:15.820 | it had different versions
00:52:16.900 | that tiny, base, small, medium and large.
00:52:19.940 | And then later on they added large V2 and large V3.
00:52:23.100 | But for each of these versions
00:52:24.980 | like the number of layers in the decoder
00:52:27.220 | and the encoder were the same.
00:52:29.140 | So for example, base has six layers in the encoder
00:52:31.540 | and six layers in the decoder.
00:52:33.060 | Yeah, so the motivation behind Whisper Turbo
00:52:39.700 | was actually the same motivation behind the Solusper.
00:52:42.820 | And the motivation behind the Solusper
00:52:45.140 | was like based on two observations.
00:52:47.340 | The first one is like the decoder accounts
00:52:49.020 | for almost 90% of the latency
00:52:51.940 | when you are trying to transcribe any audio.
00:52:54.260 | The second observation is that the decoder performs
00:52:58.020 | the simpler task of like mapping the hidden states
00:53:01.340 | of the audio into the text.
00:53:03.220 | Like the most difficult part
00:53:04.340 | is actually extracting these hidden states of the audio
00:53:07.260 | and trying to understand what's being said in there.
00:53:09.820 | So yeah, the first observation is because
00:53:13.100 | Whisper is like an autoregressive language model
00:53:16.460 | and generates tokens, one token at a time.
00:53:18.420 | And you need the token at time T
00:53:21.220 | to generate the token at time T plus one.
00:53:22.980 | So it is actually a very serial process
00:53:25.620 | and we cannot utilize modern GPUs to parallelize this.
00:53:29.820 | The second observation was made
00:53:31.380 | by actually comparing the different versions of Whisper.
00:53:34.700 | So for example, the small and tiny versions
00:53:37.780 | could transcribe audio and it is loud and clear.
00:53:41.580 | And you can actually hear what's being said
00:53:43.860 | but they would struggle with audios
00:53:45.180 | that have like background noise
00:53:46.940 | or where the speaker is not very clear.
00:53:50.140 | However, all models or all versions,
00:53:52.820 | whether it be like small, tiny, large,
00:53:55.380 | they all generate coherent text in English
00:53:58.100 | and all the other languages.
00:53:59.700 | And this kind of like gives a hint
00:54:00.980 | that they are actually capable of like modeling the language
00:54:04.660 | but they just struggle with the audio sometimes.
00:54:07.180 | So yeah, what is exactly Whisper Turbo?
00:54:12.580 | Whisper Turbo is like a new and more efficient version
00:54:15.580 | of the original model
00:54:16.540 | with like minimum degradation accuracy.
00:54:19.300 | It uses many two techniques to satisfy this improvement
00:54:22.380 | in speed and then compute.
00:54:25.580 | The first one is like model pruning
00:54:27.020 | and the second one is continued pre-training
00:54:30.060 | but it does not use distillation
00:54:31.620 | which is I think a very big misconception
00:54:34.220 | about this new release.
00:54:35.580 | So let's talk about model pruning.
00:54:39.660 | Pruning in machine learning mainly refers
00:54:41.620 | to like eliminating some parameters
00:54:44.700 | from a neural network
00:54:47.540 | to like reduce the compute and memory requirements
00:54:52.220 | of a model without impacting the accuracy as much.
00:54:55.220 | In Whisper Turbo, the pruning was done only in the decoder
00:54:58.380 | and like entire layers were pruned.
00:55:01.100 | So we went from 32 decoder layers
00:55:03.740 | which were in the original large V3
00:55:06.300 | to like just four layers.
00:55:08.500 | And this resulted in like massive reduction
00:55:10.460 | in the model size.
00:55:12.020 | So the turbo model is like 1.78 times smaller
00:55:15.980 | than the original large V3.
00:55:17.460 | So yeah, after you prune a model
00:55:21.420 | and you go from 32 layers into four layers,
00:55:24.980 | you probably will miss all the capabilities of the model.
00:55:29.340 | Like the model will probably degrade and collapse
00:55:32.380 | and will not be able to even generate coherent text.
00:55:35.300 | You will have like to train these four layers
00:55:37.820 | to work together again as a single decoder.
00:55:40.700 | So the way to do this
00:55:41.860 | is to actually do continued pre-training.
00:55:43.820 | And in continued pre-training,
00:55:45.100 | you train the model on like a relatively big amount of data
00:55:48.980 | to actually teach it again all the things
00:55:51.460 | that it has forgotten or like got confused about.
00:55:55.180 | And in the case of Whisper Turbo,
00:55:57.020 | the continued pre-training happened on the same dataset
00:55:59.780 | that was actually used
00:56:01.220 | in the original training of Whisper Large V3.
00:56:03.700 | So for reference, the dataset contained of two epochs.
00:56:07.860 | Each epoch had 5 million hours.
00:56:10.580 | 1 million out of these were like weekly supervised
00:56:14.940 | as in the original Whisper
00:56:16.740 | and 4 million were like pseudo-labeled
00:56:18.460 | by the Whisper Large V2 model.
00:56:21.220 | So we had like 5 million hours in each epoch.
00:56:23.620 | This means that we have 10 million hours in two epochs.
00:56:27.060 | So this was like the size of the original training data.
00:56:30.500 | And the data using continued pre-training
00:56:33.260 | is very similar to this,
00:56:34.620 | except we're only using the transcription data.
00:56:37.540 | We're not using any of the translation data
00:56:40.060 | in the Whisper Turbo continued pre-training.
00:56:45.460 | So yeah, this gives us a hint about like the size
00:56:48.260 | of the training data of the new model,
00:56:49.780 | which is actually quite large.
00:56:51.260 | And some of the fun tips is like,
00:56:56.380 | we used a linear learning rate with the decay,
00:57:00.460 | starting from 5e to negative 5.
00:57:03.300 | So also another confusion is like,
00:57:09.340 | it gets compared a lot with this Whisper,
00:57:12.740 | but they like differ in size
00:57:14.020 | and also the training strategy.
00:57:15.620 | Distilled Whisper was trained
00:57:17.700 | using knowledge distillation of ASR models.
00:57:21.260 | But we'll talk about this like in a minute or two.
00:57:24.020 | Let's just go over a quick comparison
00:57:25.740 | between the two models.
00:57:26.940 | So they both have the same number of encoder layers,
00:57:30.060 | which is 32,
00:57:31.540 | but the Whisper Turbo has four decoder layers
00:57:33.780 | while Distilled Whisper has only two.
00:57:36.180 | The training strategy for Whisper Turbo
00:57:37.980 | is pruning and continued pre-training,
00:57:39.620 | but for Distilled Whisper
00:57:41.580 | it's actually knowledge distillation.
00:57:43.660 | And for Whisper Turbo,
00:57:44.900 | the training data is multilingual.
00:57:46.780 | For Distilled Whisper it's only English.
00:57:49.580 | And the task for Distilled Whisper is transcription
00:57:52.300 | and the same for Whisper Turbo.
00:57:54.940 | - Can I ask a quick question?
00:57:56.340 | - Sure.
00:57:58.460 | - The difference with Whisper Turbo and Distilled Whisper
00:58:01.100 | is primarily the pruning versus distillation.
00:58:04.540 | It seems like we're giving a lot of benefit towards pruning,
00:58:08.420 | but the other big distinction
00:58:10.180 | is also the difference in data, right?
00:58:13.060 | I'm curious on your intuition
00:58:14.820 | because I also assume pruning versus distillation
00:58:18.540 | is a big part of it and that's why I can stay.
00:58:20.940 | It's interesting that it can stay multilingual.
00:58:23.740 | I would assume the opposite,
00:58:25.700 | but how much of this do you think is based on the strategy
00:58:29.340 | versus just better data?
00:58:31.380 | Distilled Whisper came out a while ago
00:58:34.340 | versus Whisper Turbo, OpenAI does good data.
00:58:37.220 | Like how much of a difference do you think is caused?
00:58:42.100 | Because, so they're both similar size,
00:58:44.020 | but Whisper Turbo is a lot better, right?
00:58:45.780 | So how much of that better like performance
00:58:50.380 | out of smaller size comes from pruning versus data?
00:58:53.460 | Because they don't really mention how good this data is.
00:58:56.260 | Do you have intuition on that?
00:58:59.380 | - Yes, yes.
00:59:00.220 | So we kind of like have some info about the dataset.
00:59:02.420 | For Whisper Turbo, it uses exactly the same dataset
00:59:06.020 | for Whisper Large V3,
00:59:08.420 | except for the translation section, which is excluded.
00:59:12.420 | So this is like a very big dataset of like 10 million,
00:59:15.100 | or like a few million hours.
00:59:16.660 | And it has like many, many different languages.
00:59:19.140 | So this is a very diverse and large and robust dataset,
00:59:23.060 | which is actually quite important.
00:59:24.900 | On the other hand, Distilled Whisper uses English only data.
00:59:29.260 | I think it is a few thousand hours of audio.
00:59:32.380 | So this is like much, much, much smaller
00:59:34.220 | than the Whisper Turbo.
00:59:37.220 | One upside is that the Distilled Whisper
00:59:39.500 | is trying to use actually higher quality data,
00:59:43.620 | arguably, but we're not sure about this.
00:59:45.980 | So I think there is like a very big emphasis
00:59:47.860 | about the data size as well.
00:59:49.740 | And alongside the training strategy.
00:59:51.860 | For Whisper Turbo, it's a few million hours,
00:59:54.140 | for Distilled Whisper, it's just a few thousand hours.
00:59:56.860 | And this is quite crucial, I think.
00:59:58.660 | For example, Distilled Whisper is only for English.
01:00:00.620 | You cannot use it like for French, German, Spanish,
01:00:02.700 | or Arabic, or any other language.
01:00:05.060 | And I believe like if you train a single model
01:00:07.980 | on multiple datasets, multiple domains,
01:00:11.580 | multiple languages, it becomes more robust.
01:00:13.940 | And this is like the premise behind Whisper.
01:00:16.900 | Like it is a robust model because it is trained
01:00:19.500 | on a very diverse dataset,
01:00:21.060 | encompassing different domains and languages.
01:00:25.020 | - Got it.
01:00:25.860 | - So I hope this kind of answers your question.
01:00:27.620 | - Really, really helps with the intuition there.
01:00:29.940 | Sorry to cut off.
01:00:31.740 | Also great slides, by the way.
01:00:33.820 | - Thanks.
01:00:35.580 | - Yeah, amazing slides.
01:00:36.740 | - Oh, thank you.
01:00:39.100 | This one is good.
01:00:39.940 | You're gonna like this one a lot more.
01:00:41.860 | So there was a question from Svex about like,
01:00:44.980 | how do you actually do distillation for an ASR model?
01:00:47.900 | And I think it's kind of like similar
01:00:49.620 | to how you distill a language model.
01:00:51.860 | So let's say you want to distill Lama4o5b to Lama3b.
01:00:56.100 | The way you do this is you try to like extract
01:00:59.620 | or compress the knowledge from the big model
01:01:01.420 | to that smaller model.
01:01:02.700 | And this is exactly what you are doing here.
01:01:05.420 | The technicalities is,
01:01:07.420 | the way you train Whisper is a language model
01:01:09.700 | that is conditioned on the audio.
01:01:11.580 | So it's actually like a language model
01:01:13.260 | that predicts the next token.
01:01:14.940 | So you can have like cross entropy loss
01:01:17.740 | on like the next token.
01:01:19.420 | And you only like, the difference is the input
01:01:22.420 | also contains audio, not just text.
01:01:25.420 | So you kind of like train the smaller model
01:01:28.940 | on the next token predicted by the bigger model.
01:01:31.660 | So kind of like synthetic data or like pseudo data,
01:01:34.940 | pseudo label data to be precise.
01:01:36.740 | That's one of the training objectives.
01:01:39.140 | The other training objective is,
01:01:41.540 | is you're trying to like make the output distribution
01:01:46.340 | of the smaller model as close as possible
01:01:48.500 | to the output distribution of the bigger model.
01:01:50.220 | So not just the next token,
01:01:51.940 | but like the entire distribution of the next token.
01:01:54.740 | So if you're training like a language model
01:01:58.900 | in a conventional way,
01:01:59.980 | you're only like training it on the next token.
01:02:02.460 | You only care about the correct next token.
01:02:04.260 | You don't care about the top nine or top 10 predictions.
01:02:07.740 | But if you wanna train or like gonna do knowledge distillation
01:02:10.820 | and you want to like train the model
01:02:12.340 | with a lot of information,
01:02:13.860 | you might want to do something called KL divergence training.
01:02:17.780 | And in this way, you try to teach the smaller model
01:02:20.340 | to approximate the probability distribution
01:02:22.860 | of the bigger model.
01:02:23.780 | So not just get the next token right,
01:02:25.700 | get the top 10 predictions for the next token correctly.
01:02:30.060 | And this actually helps the model learn a lot more.
01:02:32.340 | You're giving much more information for each data point
01:02:36.700 | than just next token prediction.
01:02:38.780 | So this is like how you do a knowledge distillation
01:02:42.020 | for an ASR model in like a very brief overview.
01:02:44.900 | The distill-whisperer model has a good paper
01:02:47.220 | from Dagenfeld's team
01:02:48.220 | and goes into much more detail about this.
01:02:50.340 | So yeah.
01:02:52.900 | - There's another paper that came out recently
01:02:54.500 | that talks about distillation and the three types.
01:02:56.940 | So there's also distillation
01:02:58.700 | where you do pure SFT on outputs
01:03:01.020 | and you match input output type distillation,
01:03:03.860 | but there's a distillation loss,
01:03:06.060 | which is basically this KL divergence
01:03:09.780 | that you're talking about,
01:03:10.620 | where you map (indistinct)
01:03:13.860 | We did have a question (indistinct)
01:03:18.220 | This is basically token level decoding
01:03:21.420 | of audio chunks, right?
01:03:23.900 | So (indistinct) asked,
01:03:25.660 | is Whisper Turbo fast enough to be real-time?
01:03:28.860 | And what's the definition of real-time
01:03:30.820 | and how does real-time work with chunk-based decoding
01:03:33.740 | is something asked, if you have intuition around this.
01:03:36.340 | - I think he, I muted the audio
01:03:46.140 | and I think he's got it on mute.
01:03:47.900 | I'm glad you have to unmute.
01:03:49.900 | - Oh.
01:03:50.740 | - Unmute.
01:03:52.700 | - Unmute.
01:03:53.540 | - Yeah, I think I got muted.
01:04:00.580 | Can you guys hear me now?
01:04:01.940 | - Yeah, you're good.
01:04:03.180 | Did you get the question?
01:04:04.020 | - Okay.
01:04:04.860 | Yes, yes, I got the question.
01:04:06.020 | So the concept of like real-time is kind of like debatable,
01:04:09.580 | but if you ask me if you can transcribe
01:04:12.060 | a one minute audio in less than a minute,
01:04:13.820 | I think this is kind of like real-time.
01:04:16.300 | So the question is like,
01:04:18.380 | is the Whisper model fast enough to be real-time?
01:04:22.140 | If you're, the answer is yes,
01:04:23.460 | if you run it on the right hardware.
01:04:25.140 | If you run it on a T4 or like anything that's bigger
01:04:27.940 | for any modern GPU, basically,
01:04:29.420 | it's going to be almost real-time.
01:04:31.020 | Like the latency is going to be like a few,
01:04:33.860 | like maybe a second or two max,
01:04:35.980 | if you've got the parameters correctly
01:04:37.740 | and if you size it correctly.
01:04:39.540 | And this is even applicable with like the large V2
01:04:41.700 | and large V3, and it's going to be even faster
01:04:43.940 | with like Distal Whisper or like Whisper Turbo.
01:04:46.780 | So yes, I think Whisper can actually be very,
01:04:49.420 | very good at real-time kind of transcription.
01:04:53.820 | - And how do you do like live transcription
01:04:57.860 | with token or chunk-based decoding?
01:05:00.300 | - So one of the ways to do real-time transcription
01:05:06.620 | is to select a small chunk size.
01:05:10.580 | To just, before answering this question,
01:05:12.420 | like one of the key things about Whisper
01:05:14.260 | is that you have to give it 30 seconds of audio.
01:05:17.380 | Even if the audio is one second,
01:05:18.980 | like you have to pad the remaining 29 seconds.
01:05:21.540 | So you're always giving it 30 seconds of audio
01:05:24.060 | that will be processed by the encoder.
01:05:25.740 | And then depending on actually how long your speech is,
01:05:29.260 | it will maybe generate two tokens or 200 tokens.
01:05:32.100 | But if you're giving it one second of speech
01:05:34.060 | and 29 seconds of padding,
01:05:35.620 | it will probably generate only two or three tokens
01:05:37.780 | or maybe five tokens, not 200.
01:05:40.340 | So you're spending the same time on the encoder,
01:05:42.620 | but you're spending significantly less time on the decoder.
01:05:45.660 | So one way to utilize this in real-time decoding
01:05:49.420 | is continuously give Whisper maybe 300 milliseconds of audio
01:05:54.420 | and give it as the input and pad it,
01:05:59.460 | and then ask it to generate like the five or 10 tokens,
01:06:02.820 | or maybe even less that respond to 300 milliseconds.
01:06:05.620 | Because probably in half a second,
01:06:07.660 | someone is saying two or three words,
01:06:09.620 | which is four or five tokens,
01:06:11.460 | which can be decoded relatively quickly
01:06:13.180 | and get back the answer in almost real time.
01:06:15.180 | So you keep doing this multiple, multiple times
01:06:17.900 | until you run into your first batch of chunks,
01:06:22.900 | and then you can just shift the audio buffer a bit.
01:06:25.940 | It's kind of like difficult to explain
01:06:27.140 | without having a graph beforehand,
01:06:29.340 | but it's basically about continuously feeding the model
01:06:33.260 | chunks that are like have been shifted
01:06:37.780 | by maybe half a second or so.
01:06:40.100 | And then you just,
01:06:41.220 | if the buffer is less than 30 seconds,
01:06:44.820 | you just pad it and give it to the encoder.
01:06:47.540 | - Got it.
01:06:49.500 | - Does this give some clarification?
01:06:50.900 | - Very useful intuition.
01:06:52.060 | I didn't realize all the padding and stuff.
01:06:54.540 | Someone asked if these chunks overlap.
01:06:56.460 | - You can make them overlap.
01:06:58.860 | You can make them not overlap.
01:07:00.660 | This is kind of like,
01:07:01.900 | there's like a whole set of hyper parameters
01:07:04.740 | that you can play with.
01:07:05.940 | You can use VED.
01:07:07.140 | You can not use VED.
01:07:08.020 | You can just keep adding the chunks without overlapping.
01:07:10.940 | You can make them overlap.
01:07:11.900 | There's like a whole world of doing this.
01:07:14.500 | - Okay.
01:07:15.340 | Got it.
01:07:17.740 | No, really, really useful intuition.
01:07:19.900 | Any other questions, thoughts?
01:07:22.860 | This is the last slide, right?
01:07:23.820 | We didn't cut off.
01:07:24.700 | - We have two other slides,
01:07:27.140 | but this is like the main thing.
01:07:29.780 | - If you want to keep going through, go through.
01:07:31.820 | - Sure.
01:07:32.660 | So some fine tuning,
01:07:34.660 | WhisperTurbo can be fine tuned
01:07:35.820 | just like any of the other models.
01:07:37.980 | I asked Joan Walk about which hyper parameters to use.
01:07:42.060 | She said, you can start with like one E negative five
01:07:44.740 | as earning rate,
01:07:45.580 | but you should do a learning rate sweep
01:07:48.660 | and like do hyper parameter tuning.
01:07:51.060 | But it can be like fine tuned just like any other model.
01:07:53.780 | Just be careful.
01:07:54.620 | Like it's probably gonna be fine tuneable
01:07:57.780 | only on like transcription.
01:07:59.860 | Don't try too much with translation
01:08:01.500 | or maybe give it a try,
01:08:03.300 | but it's not guaranteed to work.
01:08:05.460 | So yeah, it's fine tuneable just like any other model.
01:08:11.300 | The final two slides about like benchmarks,
01:08:15.500 | but just before the benchmarks,
01:08:17.180 | people also ask it how it compares to FasterWhisper
01:08:19.820 | and like DistalWhisper.
01:08:21.740 | So we've answered the question about DistalWhisper,
01:08:24.180 | but for FasterWhisper like,
01:08:25.900 | this is like a valid question,
01:08:27.780 | because FasterWhisper is like an inference engine.
01:08:30.940 | It's like a library, it's code.
01:08:33.260 | It can be used to deploy any of the other versions
01:08:36.820 | of Whisper.
01:08:37.660 | Whisper is small, tiny, base, medium, large,
01:08:41.100 | large V2, large V3, large V3 turbo,
01:08:43.900 | distal large and so on.
01:08:45.620 | So kind of like try to make a difference
01:08:48.060 | between the inference engine
01:08:49.260 | and the model versions themselves.
01:08:51.580 | So any question before we go into the benchmarks?
01:08:55.060 | - No, I guess we can go.
01:09:01.580 | - Sure.
01:09:03.500 | So for benchmarks,
01:09:05.980 | I highly recommend that you benchmark on your own data.
01:09:09.220 | So I think this is quite important,
01:09:12.860 | but to like give some intuition
01:09:14.420 | about the model's performance,
01:09:15.420 | I chose three data points
01:09:17.900 | that I think are very interesting
01:09:19.380 | and they're like very important for me.
01:09:21.740 | So one of them is like the GPU mode
01:09:24.380 | at the end of CTOG by Karpathy.
01:09:26.140 | This is very recent,
01:09:26.980 | so there is no way it was in the training data, hopefully.
01:09:30.780 | The second one is like a chat with like
01:09:34.740 | Mistral CEO, Arthur Mensch.
01:09:36.500 | He is a French guy,
01:09:37.580 | so he's speaking English with kind of like an accent.
01:09:40.300 | This was, I think, in December last year.
01:09:43.020 | Hopefully it was not in the training data as well.
01:09:44.780 | And the final one is my favorite,
01:09:46.220 | like the state of GPT by Karpathy
01:09:47.780 | in I think 2023 as well.
01:09:50.140 | So yeah, Karpathy is known to be talking relatively fast,
01:09:55.380 | so this can be challenging.
01:09:57.580 | So yeah, let's jump into the benchmarks.
01:09:59.940 | For benchmarking,
01:10:02.900 | so you generally look at two metrics,
01:10:05.260 | WER and CR.
01:10:06.620 | WER is word error rate,
01:10:08.940 | like how often the model makes a mistake
01:10:12.380 | about a certain word in the text.
01:10:15.100 | So for the WER,
01:10:16.100 | I benchmarked the different sizes or versions
01:10:19.420 | of like the different data points we discussed.
01:10:24.420 | So we kind of like see that the distal V2
01:10:28.940 | is kind of like an outlier.
01:10:30.140 | It's one of the least accurate models out there.
01:10:33.780 | But the other models are like kind of similar,
01:10:37.940 | except maybe large V3 can be a hit or miss.
01:10:42.860 | So sometimes it has relatively higher WR.
01:10:45.740 | Sometimes it's the best performing model.
01:10:48.180 | And this also aligns or like is aligned
01:10:51.740 | with like what people have experienced it
01:10:54.340 | and have shared like that large V3
01:10:56.180 | can sometimes hallucinate.
01:10:57.260 | It can be caught in an endless repetition loop
01:11:00.580 | and can make the output like really bad.
01:11:02.780 | But yeah, I think that the turbo model
01:11:06.220 | on these data points is doing very, very well
01:11:09.100 | as well as the distal large V3.
01:11:11.620 | But on other benchmarks that I did for like a projects
01:11:16.620 | that I can not share the results,
01:11:18.380 | but I can give hints that it was not very good.
01:11:20.660 | It was doing much worse than the other models.
01:11:23.500 | So yeah, do your own benchmark.
01:11:25.220 | And especially if the dataset is not English
01:11:28.140 | and if it's not transcription,
01:11:30.140 | you might see like very different numbers from this.
01:11:32.980 | So this is like the WR, the CR is very similar as well.
01:11:36.380 | The distal large V2 is doing very, very bad
01:11:38.860 | compared to the others and large V3 can be a hit or miss.
01:11:42.380 | So yeah, that was like a very quick,
01:11:46.100 | super quick introduction about Whisper Turbo.
01:11:48.500 | This is like the end.
01:11:49.340 | If you have any questions, please go ahead.
01:11:51.820 | Thank you for watching.
01:11:53.980 | - That was really great.
01:11:54.980 | Clap, clap, clap, applause all around.
01:11:56.980 | Close is no MLX stuff, what do you mean, MLX?
01:12:02.220 | - No MLX stuff.
01:12:06.540 | So I don't have a Mac, so I don't care about MLX.
01:12:10.540 | Sorry, but yeah.
01:12:11.980 | (laughing)
01:12:14.220 | That's like the two hands.
01:12:15.540 | So yeah, just as a hint, I think MLX is not good.
01:12:20.380 | So yeah, just as a hint, I think MLX is like a framework
01:12:24.020 | for deploying or like using machine learning models
01:12:27.820 | on Apple Silicon hardware
01:12:29.660 | and presumably it's very fast and very efficient.
01:12:32.660 | But I don't have a Mac, so no data points from my side.
01:12:37.420 | - All right, my battery's dying.
01:12:41.020 | I'm on 1%, so I gotta end this soon.
01:12:43.180 | But thank you so much, it was fantastic.
01:12:45.860 | Always excited by your Whisper updates and explanations.
01:12:49.540 | Really, really like those slides, those shadow slides.
01:12:52.220 | Awesome, awesome.
01:12:54.100 | Thanks everyone.
01:12:54.940 | - Sure, sure, thank you guys.
01:12:56.220 | Have a nice day.
01:12:57.060 | - Thank you very much, bye.