back to index[Paper Club] Molmo + Pixmo + Whisper 3 Turbo - with Vibhu Sapra, Nathan Lambert, Amgadoz
00:00:00.000 |
- I'll go through here, let me set up real quick. 00:00:15.520 |
If people haven't read research papers before, 00:00:54.920 |
Clear problem, clear solution, clear explanation 00:00:57.800 |
of what they do, the data set and everything. 00:01:08.500 |
at different sizes that are good vision language models 00:01:17.000 |
are distillations of proprietary close source models, right? 00:01:42.240 |
that proprietary knowledge into an open knowledge. 00:01:44.440 |
But this is kind of the key highlight I take away 00:01:46.480 |
that the community is still missing foundational knowledge 00:01:50.280 |
about how to build these models from scratch. 00:01:52.740 |
So they kind of go into everything from data set training, 00:01:56.480 |
how they label data and how they can do this whole mix. 00:01:59.880 |
And some of the other papers we covered in paper club 00:02:02.160 |
are stuff like Clip, OpenClip, which are captioners, 00:02:06.040 |
and how captioning is pretty important for this. 00:02:17.920 |
a class of different models that are gonna come out. 00:02:20.240 |
And the really interesting thing is they're good. 00:02:40.960 |
with some of the larger models and they're very small. 00:02:43.180 |
So yeah, diving into that, we talk about the data set, 00:02:46.720 |
how they do the training data, the architecture of it, 00:02:55.080 |
So one of this is how can we generate this data 00:03:00.080 |
So we can't use a closed source model to do this captioning. 00:03:08.000 |
is that it's hard to generate really good, diverse, 00:03:31.800 |
that's based on their open source OMO models. 00:03:36.240 |
They can do this end to end without anything proprietary. 00:03:39.660 |
And yeah, let's like go through it a little bit more. 00:03:46.520 |
There's a vision encoder and a language model 00:03:50.320 |
There are some interesting stuff that they call out 00:03:52.240 |
like traditionally in stuff like Alava and different adapters, 00:03:55.480 |
what they'll do is they'll train this vision encoder. 00:04:02.880 |
They avoid doing that multi-stage pre-training. 00:04:06.960 |
They just kind of do it all as one training run. 00:04:09.960 |
They show how important high quality data is. 00:04:17.800 |
Like they pass through these label data sets through LLMs. 00:04:24.000 |
but point being they can get better than Gemini 1.5, 00:04:46.040 |
Like they still have to pay for a million labeled samples, 00:04:51.020 |
And then we can distill it down to other models. 00:04:53.080 |
So yeah, this is one of the main challenges, right? 00:04:59.280 |
So for example, like if you're given an image here 00:05:04.680 |
like I would probably just say this is Seattle. 00:05:12.400 |
But if you just blindly give images to either LLMs 00:05:26.180 |
trees in the foreground, water in the background, 00:05:31.060 |
But they kind of broke down how they get this type of data 00:05:36.220 |
And then I think soon the dataset is coming out, 00:05:41.620 |
So they don't wanna get this from larger proprietary models, 00:05:45.900 |
which can, like you can write a really good robust prompt 00:05:52.280 |
but you're no longer doing this from scratch. 00:05:54.080 |
You're doing it from a distillation of proprietary work. 00:05:59.740 |
We ask annotators to describe the images in speech 00:06:04.240 |
rather than asking them to write descriptions. 00:06:06.680 |
They prompted them to describe everything in great detail, 00:06:09.360 |
including descriptions of spatial positioning 00:06:21.320 |
annotators provide far more detailed descriptions 00:06:34.920 |
but we're good at talking about and describing what we want. 00:06:40.160 |
they have like a little bit more into what this dataset is. 00:06:43.640 |
So they've got positional data in some of the prompts, 00:07:03.520 |
that kind of shows how you could do this real time. 00:07:05.960 |
It's a very Apple Intelligence or Google Lens type example, 00:07:09.180 |
where I think the team put it on the Apple Vision Pro 00:07:11.740 |
and they just look at stuff and they're like, 00:07:18.820 |
So one more use case of high quality, well-labeled data. 00:07:23.940 |
they did the traditional 11 academic benchmarks. 00:07:32.820 |
which is based on their open source language model, 00:07:35.220 |
which is a MOE 1 billion active parameter model. 00:07:42.380 |
on both user preference and academic benchmarks. 00:07:47.840 |
which is based on their 7b language model and Quen-7b. 00:08:00.200 |
And that, it outperforms stuff on academic benchmarks. 00:08:05.200 |
And then on human preference, it's behind 4.0. 00:08:09.480 |
if you guys remember how we do vision benchmarks, 00:08:18.380 |
It's similar to Trivia QA, where it's just understanding. 00:08:22.300 |
and they're not realistic of how people use it. 00:08:24.180 |
So they also have this user preference stuff. 00:08:26.340 |
User preference is basically like an ELO score of models. 00:08:32.440 |
This last sentence here is pretty interesting, right? 00:08:37.500 |
"proprietary systems, including Gemini 1.5 Pro 00:08:44.240 |
Architecture, if you guys know about other multimodal models, 00:08:50.200 |
with a little twist of traditionally, like we said, 00:08:56.600 |
you have a pre-trained off-the-shelf vision encoder 00:09:05.600 |
and we do contrastive, there's a term for this, 00:09:10.600 |
but you basically merge these embedding spaces. 00:09:14.260 |
In this case, they're like, "No, we don't need to do that." 00:09:17.400 |
So they go over the four steps of this architecture. 00:09:24.380 |
It converts the image into multi-scale, multi-crop images. 00:09:29.780 |
So stage one of architecture is a preprocessor. 00:09:42.020 |
that maps and merges these two embedding spaces. 00:09:44.740 |
So the vision encoder, which I believe is based on CLIP, 00:09:52.340 |
but there's a paper called OpenCLIP from Meta, 00:09:56.580 |
They just use regular CLIP, which is, I think it's fine. 00:10:14.420 |
And then stage four is a decoder-only transformer LLM 00:10:20.380 |
If people have questions on what's going on here, 00:10:24.580 |
but for people that joined our other multimodal stuff, 00:10:27.420 |
you should kind of understand how this works. 00:10:30.300 |
There's a good blog post we went over as well, 00:10:34.860 |
or another one that we can also share as a reference. 00:10:49.000 |
So highly recommend, but also we can dig into it here. 00:10:51.660 |
And then this is kind of that part of the vision encoder. 00:10:57.880 |
and we know that CLIP is trained on closed source data, 00:11:10.360 |
and they show that you can reproduce it from scratch. 00:11:13.160 |
They use OpenAI because it's trained for higher resolution. 00:11:25.240 |
For the actual production, they just use CLIP, 00:11:38.440 |
They've got the range, and then for the big one, 00:11:46.200 |
- Do you wanna pause for any commentaries, Nathan? 00:12:05.680 |
I mean, I have this whole take that I've written 00:12:07.360 |
is that the whole vision space is just so underdeveloped 00:12:10.680 |
that a lot of these things that seem surprising, 00:12:16.320 |
are things that all the foundation companies, 00:12:17.880 |
like they're just gonna take our data and train on it. 00:12:20.440 |
And it's fine tuning, so it's like a million data points 00:12:22.880 |
is actually kind of a lot, and all this stuff. 00:12:29.920 |
and they figured out some cool new applications, 00:12:32.100 |
like clock reading and pointing, and then the data works. 00:12:35.260 |
There is actually a niche reference of somebody 00:12:37.320 |
that did this kind of voice annotation earlier. 00:12:42.220 |
'Cause someone had mentioned it after we released it, 00:12:44.980 |
saying that we were the first people to do it. 00:13:01.020 |
And then Nathan here like, yeah, they'll do it too. 00:13:07.260 |
The website has a pretty good blog post on this, 00:13:09.380 |
and there's a good, very professionally shot video 00:13:13.680 |
And it's always nice to see that niche of like, 00:13:17.400 |
there's so much underdeveloped alpha left in vision space 00:13:21.320 |
that you train the right data set and it gets good. 00:13:37.420 |
Is there anything we should pause and break on or? 00:13:43.300 |
I think the other models are much better as text models. 00:13:45.980 |
Even like Lama, where they're like vision fine tuning 00:13:48.340 |
is generally accepted as being like kind of meh 00:13:50.620 |
for whatever drama reasons you want to attribute to it. 00:13:53.340 |
Like their text scores are still much better. 00:14:12.680 |
Like I find it so unlikely it's gonna get better. 00:14:18.680 |
are probably using like instruction models as their base 00:14:26.420 |
There's literally like no chat template for multi-turn, 00:14:39.140 |
and then there's instruction fine tuning, right? 00:14:45.180 |
So like we have Lama base, which is not a chat model. 00:14:50.860 |
Basically, these are just those adapted to vision, 00:14:54.220 |
which means that they're not good chat models, 00:15:08.100 |
I don't know if you guys put the data set out yet. 00:15:19.380 |
- Our leadership probably knew Lama was coming 00:15:33.300 |
something like data set releases are a little bit messier. 00:15:38.660 |
you don't like we're not releasing the images. 00:15:40.740 |
You release the links type of thing and do some checks. 00:15:47.580 |
that filled out the silly Google form asking for interest, 00:16:03.820 |
- We did a little bit of experimenting on it, 00:16:05.860 |
but like in a rushed way and we didn't find it was better. 00:16:23.300 |
but partially because like that's how the model is trained. 00:16:30.420 |
as you mentioned, yeah, Gemini and OpenAI and stuff, 00:16:38.060 |
But it also kind of shows a flaw in benchmarks 00:16:45.700 |
which is pretty flawed because people do both. 00:16:51.180 |
This is the fun part since it's primarily vision-based. 00:16:58.700 |
So they've got a split of different data sets. 00:17:02.900 |
Step one is obviously caption generating from the images. 00:17:06.780 |
So they source web images from diverse sets of data, 00:17:15.340 |
Street signs, food, meme, drawings, websites, 00:17:29.500 |
but yeah, questions were like guided questions. 00:17:46.900 |
A bunch of stuff, people will be verbose, concise. 00:17:52.220 |
so transcribe it using off-the-shelf speech to text. 00:17:59.020 |
or it just probably doesn't matter, I'm guessing. 00:18:07.180 |
I don't think they actually logged the audio. 00:18:11.900 |
but I don't think it was in the terms of the data location 00:18:17.380 |
'cause it'd be a super cool, like, I don't know. 00:18:20.260 |
Like you can see how all these voice features come about. 00:18:26.300 |
it would have been a pretty cool audio dataset too, 00:18:29.140 |
but it looks like the dataset that'll be distributed 00:18:33.860 |
is just gonna be the captioning and image pairs. 00:18:39.100 |
but yeah, there's a very straightforward pre-processing. 00:18:44.660 |
or whatever off the shelf they wanna call it speech to text, 00:18:48.260 |
then process it through an LLM to improve text quality. 00:19:08.700 |
we use these four images to do natural data augmentation. 00:19:17.020 |
with 1.3 million captions, including the augmentation. 00:19:20.140 |
So not a lot, it's basically just fine tuning after that. 00:19:34.420 |
regarding text performance and I guess the presumption 00:19:39.060 |
that the MOMO is inferior in text performance. 00:19:43.580 |
I'm actually wondering how much of it is potentially 00:19:50.540 |
because I can imagine for text in particular, 00:20:05.500 |
And we ask questions about the text through the model. 00:20:16.820 |
I think a lot of it's also just, yeah, it's a base model. 00:20:21.340 |
I think that there's a decent bit you could do. 00:20:41.380 |
convert this to JSON, answer stuff about this. 00:20:45.020 |
I don't see this really being like a chat with style model, 00:20:49.580 |
because yeah, there's LLAMA if you want that, right? 00:21:00.420 |
I see this more as like OCR output in a pipeline, 00:21:11.860 |
if you did it with a chat instruct tune model, 00:21:15.180 |
there wasn't any crazy performance difference. 00:21:19.620 |
there's not much benchmarking on the language models. 00:21:24.860 |
So I guess there is MMLU and they're not the best, 00:21:44.580 |
We've got 40 people on the call that are here for the paper. 00:22:12.100 |
So TLDR, people yap, here's guidance questions. 00:22:17.220 |
They got about a million captions for 700,000 images, 00:22:20.900 |
which I'm like, okay, that's kind of expensive. 00:22:33.540 |
It's basically collection of answer a diverse set 00:22:39.260 |
So for this image, what are questions people might ask? 00:22:58.860 |
So I had a really hard time understanding this. 00:23:06.140 |
So I guess what this means is that given image, 00:23:10.660 |
and OCR image for that and the OCR output for the image. 00:23:14.420 |
Does this mean we just apply OCR on the image? 00:23:19.540 |
If the image is an image of scenery or something. 00:23:22.380 |
I mean, if someone can just share the intuition behind this. 00:23:25.980 |
- Nathan, if you have intuition, you'd probably know better. 00:23:40.780 |
there's another thing here like street signs. 00:23:42.780 |
And then there's one that's like primarily charts and tables. 00:23:46.380 |
So I don't have the exact input output of this 00:23:49.820 |
but I'm assuming it's more text-based if there's OCR on it. 00:23:53.300 |
So it's probably, you know, documents, receipts. 00:23:58.060 |
That's what I would assume this subset of the dataset is. 00:24:07.620 |
- Yeah, so we provide caption, we provide OCR output 00:24:13.100 |
and the LM is supposed to answer the question 00:24:31.820 |
especially this dataset and how they created it. 00:25:06.900 |
where we just bootstrapping our synthetic data 00:25:11.340 |
or not as intuitive as compared to the point. 00:25:24.980 |
with the image and the question pairs though. 00:25:30.580 |
the question answering didn't have access to the image. 00:25:47.420 |
so like, I think that's what you were saying. 00:25:49.260 |
Like, it's interesting how you can train a MOMO 00:25:52.820 |
to answer questions on an image without giving it the image, 00:25:55.540 |
but no, it's given the image and question answer pairs. 00:25:58.780 |
It's just the annotation answering the question 00:26:01.300 |
is image to OCR to GPT-4 write question answers 00:26:06.580 |
on this OCR that, you know, there's a, what is it? 00:26:11.460 |
There's a annotator to approve the question answers. 00:26:16.580 |
When we train, we have image question answers. 00:26:20.500 |
That actually is a, that's actually very intuitive. 00:26:24.660 |
So essentially what we're saying is that right now, 00:26:27.820 |
What we're going to do, but the model can generate captions 00:26:32.780 |
So we're going to get that model to generate captions 00:26:37.940 |
probably one of their strong LLMs to create QA pairs. 00:26:41.620 |
And now based on that, we now have QA pairs on the image. 00:26:50.860 |
- This little niche here was that off the shelf LLM 00:27:00.100 |
- That's where I'm like, oh, I think you mentioned 00:27:02.860 |
their training questioning without giving it the image, 00:27:07.980 |
- Yeah, the final mobile data actually has that, 00:27:10.740 |
but this is the process of them just generating the image. 00:27:15.940 |
And then honestly, I found this one pretty confusing 00:27:20.620 |
The rest, there's just quite a few subsets of these. 00:27:27.420 |
I don't know if I'm sharing my whole screen or just the, 00:27:42.720 |
So it kind of answers, like if you were to ask this, 00:27:52.280 |
So it's labeled data set of that next sample. 00:27:56.420 |
And then they show their example of connecting this thing 00:27:59.220 |
to Apple vision pro and asking it, like, what do I see here? 00:28:10.260 |
There's docs generate code for this much text 00:28:33.600 |
- This is because the model didn't work on clocks. 00:28:35.960 |
And then the lead was really worked on clocks 00:28:40.660 |
So they're like, we've got to make it work on clocks. 00:28:43.360 |
One of the interesting things is that it doesn't work 00:28:50.040 |
- Yeah, like a temperature dial or pressure dial. 00:28:55.180 |
But just like such a, like it should be able to, 00:29:05.520 |
These are the insights that like we'd never get, 00:29:07.640 |
but is this just like over fit to a benchmark 00:29:11.680 |
Is this not like it should generalize bit or less? 00:29:20.940 |
Now every visual language model that comes out, 00:29:24.680 |
like it is something that they should be able to do. 00:29:28.760 |
- I mean, a big story of this is that like Matt, 00:29:31.920 |
that D, I don't know how to pronounce his name, 00:29:37.160 |
And all the things that are adding to this dataset 00:29:40.920 |
'cause it kept getting better and better and better 00:29:42.540 |
over like six months of just making the dataset bigger. 00:29:49.680 |
And then they're like, okay, let's make it work. 00:29:53.880 |
Like the demo you guys showed was actually somewhat useful. 00:29:56.920 |
Like, you know, translate this table in a whole image 00:30:04.120 |
Or like stuff like this, like point to Mount Rainier, 00:30:21.660 |
And then of course the rest, there's academic datasets. 00:30:27.360 |
goes into all these very deeply, would recommend. 00:30:35.280 |
This is where I was like, okay, it does good on vision. 00:30:41.800 |
which are, you know, 11 of the commonly used ones. 00:31:09.800 |
but I guess the interesting differentiation there 00:31:18.960 |
Yeah, PHY is doing pretty rough in multimodal, 00:31:31.560 |
This is kind of back to the theme of this paper 00:31:34.160 |
of like stuff, most models use a vision encoder 00:31:49.280 |
So the vision encoder for Lava is open weights, 00:31:54.280 |
but the data and code used to generate it was closed. 00:32:02.320 |
A lot of the generation is done with closed stuff. 00:32:17.440 |
is mostly a distillation of proprietary stuff. 00:32:21.800 |
For example, all this proprietary stuff sucks at clocks. 00:32:24.680 |
So nothing that's a distillation will be good at clocks. 00:32:45.160 |
- And also, one point on the MOMO one, right? 00:32:48.560 |
If you scroll up all the way to the top, yeah, over here. 00:33:00.040 |
And that's why it's not because they don't want to open it. 00:33:02.680 |
It's because we just don't know what was in QEN data. 00:33:06.840 |
If not, it's everything that the AI2 team did was-- 00:33:11.600 |
- This model is test the definition of open source AI, 00:33:14.560 |
because it's like, you grab some random vision encoder, 00:33:23.520 |
but we just don't know what the data encode is, yeah. 00:33:26.280 |
- Yeah, but what's the difference between a vision-- 00:33:27.720 |
Like, if the weights of the vision encoder were frozen, 00:33:30.960 |
I think that it could be fine, but I don't know. 00:33:34.320 |
It's weird, because you like say that fine-tunes 00:33:39.340 |
They're not like open-source fine-tunes and stuff. 00:33:42.360 |
- This is a discussion that the open-source initiative 00:33:49.200 |
that I don't have enough time to answer the emails for, 00:33:52.520 |
but it's like, I don't know what to do with it. 00:33:55.360 |
- The interesting thing there was the only closed code for, 00:34:04.800 |
They have the sentence that our vision encoder 00:34:13.600 |
all of our release models use OpenAI's VIT-Large CLIP model, 00:34:23.060 |
it can be reduced from scratch with MetaCLIP. 00:34:28.600 |
because the data that OpenAI trained CLIP on, 00:34:34.400 |
So it's closed data, and they talked about this too, 00:34:36.980 |
like the paper and the blog post really shows 00:34:41.040 |
how the thing that made vision language models good 00:34:47.400 |
And this is where OpenAI started to go closed source, 00:34:50.720 |
where they're like, okay, we've scraped the web, 00:34:59.000 |
They give the final model, but not the data set. 00:35:04.520 |
Meta comes out with MetaCLIP where they're like, 00:35:17.720 |
but I don't know, I think it's fine that they did it. 00:35:26.400 |
This could have been open if they just used MetaCLIP, 00:35:30.880 |
because I think Meta did it as a proof of concept. 00:35:41.360 |
the gap in data set in the open-source space is so large, 00:35:45.280 |
that the biggest thing is actually more about 00:35:48.520 |
getting the data set open-source and working. 00:36:07.640 |
than is used to train it to the next open model, 00:36:21.920 |
and not have a good data set in the first place. 00:36:39.740 |
And there's still quite a bit that could be done here, 00:37:05.220 |
If people understand and want to dig into any of these, 00:37:10.920 |
So it's a interesting differentiation of open-weight, 00:37:30.120 |
And then across the actual benchmarks themselves, 00:37:41.640 |
on most benchmarks, but the 7B outperforms 4V. 00:37:53.820 |
they'd want to dig into on the benchmarks, we can. 00:38:08.480 |
Yeah, I didn't have much more to dig into in this. 00:38:14.340 |
I thought the ELO ranking was kind of interesting as well. 00:38:20.440 |
Human preference evals use 15K image text prompt pairs. 00:38:29.400 |
triplets to all this to 870 human annotators. 00:38:33.240 |
They were given pair-wise preference rankings 00:38:46.160 |
Their ELO rankings are 3X more than chatbot arena 00:38:50.980 |
Also, I hear there's some potential tea with LIMPSYS. 00:38:54.820 |
Maybe that's coming out soon for their vision models. 00:39:00.080 |
- There's potential tea with LIMPSYS, not extrapolate. 00:39:09.700 |
But yeah, if anyone knows how ELO ranking works, 00:39:17.120 |
But yeah, most preference ranked was GPT-4.0, 00:39:20.880 |
then MOMO-72B, then Gemini, then Sonnet, then the 7B. 00:39:25.880 |
It's really cool how the fourth one is the 7B. 00:39:44.500 |
I'd love to see a recreation of fusion for this 00:39:58.160 |
Also, there was a line here as well in their pre-training 00:40:04.680 |
they talk about how they do this different approach. 00:40:10.480 |
So we don't do RLHF, but we've got the RLHF guy here, 00:40:18.860 |
Is this meant to say like, you know, there's no SFT, 00:40:22.980 |
Well, there is supervised fine tuning for it, 00:40:27.840 |
- For us, team alignment and incentives are hard. 00:40:46.600 |
And it's about the granularity that we should attach 00:40:56.680 |
say from like a satellite view of a parking lot. 00:40:59.320 |
And I want to ask a question that like Eugene could answer 00:41:05.520 |
in the second row, third from the left or something. 00:41:36.780 |
that uses that cool like audio transcription pipeline. 00:41:52.960 |
or image caption datasets be if we want to have, 00:42:17.060 |
- Do you want to TLDR that into one, two line area? 00:42:21.940 |
- How granular should our image caption pairs be, period. 00:42:30.260 |
- I think Nathan probably has more intuition on this, 00:42:32.700 |
but their thing wasn't about granularity, it seems. 00:42:37.460 |
It was about one, like, yeah, you've got diversity, 00:42:46.300 |
So we talked about these four different ones, 00:42:54.180 |
and figure heavy images, including charts and whatnot, 00:42:59.620 |
but there's also all the academic datasets, right? 00:43:18.020 |
and what they do with good captions around it. 00:43:27.100 |
but let's answer this question before moving to another one. 00:43:33.100 |
I think most of this was like really high detail responses 00:43:56.300 |
Captioning tends to overfeed what is being described, 00:43:59.980 |
where our question answers kind of forces the model 00:44:02.460 |
to pick up details that it would previously ignore, 00:44:07.140 |
or the one extra finger on the hand, or whatever it is. 00:44:13.460 |
and you can answer, you're training the model 00:44:23.620 |
another dataset may ask an alternative question instead, 00:44:31.540 |
I think captioning is a problem in overfitting 00:44:49.060 |
I'm going to finish the paper real quick in the takeaways. 00:44:51.820 |
This is a cool little video we can watch after, 00:45:12.980 |
on most academic benchmarks and their ELO ranking. 00:45:23.340 |
comfortably sit between GPT 4V and 4O on both. 00:45:45.100 |
their big one is the best state-of-the-art everything, 00:45:58.660 |
Yeah, so it's basically best model beats everything. 00:46:19.900 |
I probably should have read more into it, but... 00:46:22.020 |
- Excuse me, do you plan to show us something, 00:46:35.860 |
But yeah, I was just going over these highlights. 00:46:53.740 |
So it's state-of-the-art better than proprietary. 00:47:00.460 |
of it does really good on this Android Control, 00:47:23.740 |
So this technical report is getting an update soon. 00:47:41.580 |
but always cool to see training and eval code. 00:47:45.620 |
I guess that's pretty good overview of the paper. 00:47:48.500 |
If there's anything we want to dive deeper into, 00:47:53.540 |
Nathan, if you want to add anything, cook, go over. 00:48:03.700 |
but like I'm ready to get the section about Respirator, 00:48:24.700 |
There is a little use case thing to just make it useful. 00:48:32.660 |
And then we've got a little update on Whisper. 00:48:53.500 |
- Counting the number of people shows a total of 21. 00:49:15.940 |
- Schwinn bike for sale, blue with white accent. 00:49:26.500 |
I recommend people check out the rest of their blog posts. 00:49:33.180 |
There's a couple other videos if people are interested. 00:49:36.580 |
And then there's just a good little TLDR of the paper. 00:49:50.700 |
Otherwise we have a little Whisper update too. 00:49:58.540 |
Yeah, questions or should we pass to Whisper? 00:50:36.140 |
Thank you so much for covering the Momo paper 00:50:57.660 |
OpenAI has recently released a new checkpoint 00:51:00.300 |
or version of Whisper called Whisper large V3 turbo. 00:51:03.780 |
It's supposed to be like more faster and more efficient 00:51:10.100 |
And it was inspired by the work behind this to Whisper. 00:51:15.060 |
then that John walk merged into the OpenAI repository. 00:51:24.060 |
Whisper is like the state of the art ASR model 00:51:33.140 |
on the encoder decoder transformer architecture 00:51:47.740 |
And these hidden states are sent to the decoder 00:52:19.940 |
And then later on they added large V2 and large V3. 00:52:29.140 |
So for example, base has six layers in the encoder 00:52:39.700 |
was actually the same motivation behind the Solusper. 00:52:54.260 |
The second observation is that the decoder performs 00:52:58.020 |
the simpler task of like mapping the hidden states 00:53:04.340 |
is actually extracting these hidden states of the audio 00:53:07.260 |
and trying to understand what's being said in there. 00:53:13.100 |
Whisper is like an autoregressive language model 00:53:25.620 |
and we cannot utilize modern GPUs to parallelize this. 00:53:31.380 |
by actually comparing the different versions of Whisper. 00:53:37.780 |
could transcribe audio and it is loud and clear. 00:54:00.980 |
that they are actually capable of like modeling the language 00:54:04.660 |
but they just struggle with the audio sometimes. 00:54:12.580 |
Whisper Turbo is like a new and more efficient version 00:54:19.300 |
It uses many two techniques to satisfy this improvement 00:54:47.540 |
to like reduce the compute and memory requirements 00:54:52.220 |
of a model without impacting the accuracy as much. 00:54:55.220 |
In Whisper Turbo, the pruning was done only in the decoder 00:55:12.020 |
So the turbo model is like 1.78 times smaller 00:55:24.980 |
you probably will miss all the capabilities of the model. 00:55:29.340 |
Like the model will probably degrade and collapse 00:55:32.380 |
and will not be able to even generate coherent text. 00:55:35.300 |
You will have like to train these four layers 00:55:45.100 |
you train the model on like a relatively big amount of data 00:55:51.460 |
that it has forgotten or like got confused about. 00:55:57.020 |
the continued pre-training happened on the same dataset 00:56:01.220 |
in the original training of Whisper Large V3. 00:56:03.700 |
So for reference, the dataset contained of two epochs. 00:56:10.580 |
1 million out of these were like weekly supervised 00:56:21.220 |
So we had like 5 million hours in each epoch. 00:56:23.620 |
This means that we have 10 million hours in two epochs. 00:56:27.060 |
So this was like the size of the original training data. 00:56:34.620 |
except we're only using the transcription data. 00:56:45.460 |
So yeah, this gives us a hint about like the size 00:56:56.380 |
we used a linear learning rate with the decay, 00:57:21.260 |
But we'll talk about this like in a minute or two. 00:57:26.940 |
So they both have the same number of encoder layers, 00:57:31.540 |
but the Whisper Turbo has four decoder layers 00:57:49.580 |
And the task for Distilled Whisper is transcription 00:57:58.460 |
- The difference with Whisper Turbo and Distilled Whisper 00:58:01.100 |
is primarily the pruning versus distillation. 00:58:04.540 |
It seems like we're giving a lot of benefit towards pruning, 00:58:14.820 |
because I also assume pruning versus distillation 00:58:18.540 |
is a big part of it and that's why I can stay. 00:58:20.940 |
It's interesting that it can stay multilingual. 00:58:25.700 |
but how much of this do you think is based on the strategy 00:58:37.220 |
Like how much of a difference do you think is caused? 00:58:50.380 |
out of smaller size comes from pruning versus data? 00:58:53.460 |
Because they don't really mention how good this data is. 00:59:00.220 |
So we kind of like have some info about the dataset. 00:59:02.420 |
For Whisper Turbo, it uses exactly the same dataset 00:59:08.420 |
except for the translation section, which is excluded. 00:59:12.420 |
So this is like a very big dataset of like 10 million, 00:59:16.660 |
And it has like many, many different languages. 00:59:19.140 |
So this is a very diverse and large and robust dataset, 00:59:24.900 |
On the other hand, Distilled Whisper uses English only data. 00:59:39.500 |
is trying to use actually higher quality data, 00:59:54.140 |
for Distilled Whisper, it's just a few thousand hours. 00:59:58.660 |
For example, Distilled Whisper is only for English. 01:00:00.620 |
You cannot use it like for French, German, Spanish, 01:00:05.060 |
And I believe like if you train a single model 01:00:16.900 |
Like it is a robust model because it is trained 01:00:21.060 |
encompassing different domains and languages. 01:00:25.860 |
- So I hope this kind of answers your question. 01:00:27.620 |
- Really, really helps with the intuition there. 01:00:41.860 |
So there was a question from Svex about like, 01:00:44.980 |
how do you actually do distillation for an ASR model? 01:00:51.860 |
So let's say you want to distill Lama4o5b to Lama3b. 01:00:56.100 |
The way you do this is you try to like extract 01:01:07.420 |
the way you train Whisper is a language model 01:01:19.420 |
And you only like, the difference is the input 01:01:28.940 |
on the next token predicted by the bigger model. 01:01:31.660 |
So kind of like synthetic data or like pseudo data, 01:01:41.540 |
is you're trying to like make the output distribution 01:01:48.500 |
to the output distribution of the bigger model. 01:01:51.940 |
but like the entire distribution of the next token. 01:01:59.980 |
you're only like training it on the next token. 01:02:04.260 |
You don't care about the top nine or top 10 predictions. 01:02:07.740 |
But if you wanna train or like gonna do knowledge distillation 01:02:13.860 |
you might want to do something called KL divergence training. 01:02:17.780 |
And in this way, you try to teach the smaller model 01:02:25.700 |
get the top 10 predictions for the next token correctly. 01:02:30.060 |
And this actually helps the model learn a lot more. 01:02:32.340 |
You're giving much more information for each data point 01:02:38.780 |
So this is like how you do a knowledge distillation 01:02:42.020 |
for an ASR model in like a very brief overview. 01:02:52.900 |
- There's another paper that came out recently 01:02:54.500 |
that talks about distillation and the three types. 01:03:01.020 |
and you match input output type distillation, 01:03:25.660 |
is Whisper Turbo fast enough to be real-time? 01:03:30.820 |
and how does real-time work with chunk-based decoding 01:03:33.740 |
is something asked, if you have intuition around this. 01:04:06.020 |
So the concept of like real-time is kind of like debatable, 01:04:18.380 |
is the Whisper model fast enough to be real-time? 01:04:25.140 |
If you run it on a T4 or like anything that's bigger 01:04:39.540 |
And this is even applicable with like the large V2 01:04:41.700 |
and large V3, and it's going to be even faster 01:04:43.940 |
with like Distal Whisper or like Whisper Turbo. 01:04:46.780 |
So yes, I think Whisper can actually be very, 01:04:49.420 |
very good at real-time kind of transcription. 01:05:00.300 |
- So one of the ways to do real-time transcription 01:05:14.260 |
is that you have to give it 30 seconds of audio. 01:05:18.980 |
like you have to pad the remaining 29 seconds. 01:05:21.540 |
So you're always giving it 30 seconds of audio 01:05:25.740 |
And then depending on actually how long your speech is, 01:05:29.260 |
it will maybe generate two tokens or 200 tokens. 01:05:35.620 |
it will probably generate only two or three tokens 01:05:40.340 |
So you're spending the same time on the encoder, 01:05:42.620 |
but you're spending significantly less time on the decoder. 01:05:45.660 |
So one way to utilize this in real-time decoding 01:05:49.420 |
is continuously give Whisper maybe 300 milliseconds of audio 01:05:59.460 |
and then ask it to generate like the five or 10 tokens, 01:06:02.820 |
or maybe even less that respond to 300 milliseconds. 01:06:15.180 |
So you keep doing this multiple, multiple times 01:06:17.900 |
until you run into your first batch of chunks, 01:06:22.900 |
and then you can just shift the audio buffer a bit. 01:06:29.340 |
but it's basically about continuously feeding the model 01:07:08.020 |
You can just keep adding the chunks without overlapping. 01:07:29.780 |
- If you want to keep going through, go through. 01:07:37.980 |
I asked Joan Walk about which hyper parameters to use. 01:07:42.060 |
She said, you can start with like one E negative five 01:07:51.060 |
But it can be like fine tuned just like any other model. 01:08:05.460 |
So yeah, it's fine tuneable just like any other model. 01:08:17.180 |
people also ask it how it compares to FasterWhisper 01:08:21.740 |
So we've answered the question about DistalWhisper, 01:08:27.780 |
because FasterWhisper is like an inference engine. 01:08:33.260 |
It can be used to deploy any of the other versions 01:08:51.580 |
So any question before we go into the benchmarks? 01:09:05.980 |
I highly recommend that you benchmark on your own data. 01:09:26.980 |
so there is no way it was in the training data, hopefully. 01:09:37.580 |
so he's speaking English with kind of like an accent. 01:09:43.020 |
Hopefully it was not in the training data as well. 01:09:50.140 |
So yeah, Karpathy is known to be talking relatively fast, 01:10:16.100 |
I benchmarked the different sizes or versions 01:10:19.420 |
of like the different data points we discussed. 01:10:30.140 |
It's one of the least accurate models out there. 01:10:33.780 |
But the other models are like kind of similar, 01:10:57.260 |
It can be caught in an endless repetition loop 01:11:06.220 |
on these data points is doing very, very well 01:11:11.620 |
But on other benchmarks that I did for like a projects 01:11:18.380 |
but I can give hints that it was not very good. 01:11:20.660 |
It was doing much worse than the other models. 01:11:30.140 |
you might see like very different numbers from this. 01:11:32.980 |
So this is like the WR, the CR is very similar as well. 01:11:38.860 |
compared to the others and large V3 can be a hit or miss. 01:11:46.100 |
super quick introduction about Whisper Turbo. 01:11:56.980 |
Close is no MLX stuff, what do you mean, MLX? 01:12:06.540 |
So I don't have a Mac, so I don't care about MLX. 01:12:15.540 |
So yeah, just as a hint, I think MLX is not good. 01:12:20.380 |
So yeah, just as a hint, I think MLX is like a framework 01:12:24.020 |
for deploying or like using machine learning models 01:12:29.660 |
and presumably it's very fast and very efficient. 01:12:32.660 |
But I don't have a Mac, so no data points from my side. 01:12:45.860 |
Always excited by your Whisper updates and explanations. 01:12:49.540 |
Really, really like those slides, those shadow slides.