back to index[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
00:00:00.000 |
And I kicked off like a what do people want to work on section with I'm going to do a deep dive in the paper because I need to make slides on this and that very much overtook the hackathon. 00:00:10.000 |
We had like a solid crew of like 20-30 people that were just discussing the paper with me and slides didn't really get made. 00:00:16.000 |
But also I don't know it was weird hackathon project winner was just a deep dive into the paper, but we had an in-person paper club session that I led yesterday and a lot of people from there trying to join in so it should be vibes. 00:00:30.000 |
I am liking in person hybrid format, I might start running those we'll see how they go but it was good. Everyone had good discussions. 00:00:40.000 |
Amazing, amazing. Yeah, I would be happy to join that. Once I get back to SF this Friday. 00:00:47.000 |
Oh, Friday. Exciting. Yeah. So, as you know, we also we interviewed Thomas who was one of the paper co authors, he did not give us the paper beforehand, which is annoying because after reading the paper I have so much better questions that we actually 00:01:06.000 |
Yeah, yeah, I think that you know a bunch of us have read it. 00:01:09.000 |
I feel like Vibhu you're probably best situated to, to take over the screen if you want, if you have, if you have stuff. 00:01:17.000 |
I have very basic stuff but yeah, I'll share. We also got someone at the hackathon that worked on a hackathon project that's paper to video, so someone's cooking up a video explainer of this it's like literally doing inference right now, we'll share 00:01:31.000 |
that once it's ready. Yeah. Yeah, this is, I mean this is, I'm excited about it but also I'm wondering how to do this justice. I feel like we can post questions in here, and then, you know, people can just kind of discuss in the chat. 00:01:50.000 |
And yeah, I mean like we classically have a lot of side discussions in the zoom chat anyway so I'm not worried about that. 00:01:57.000 |
I mean, yeah, well Vibhu, Vibhu go ahead you can get started. But like, what do what do people think what people want to talk about, you know, personally, I, I called this the synthetic data paper. 00:02:11.000 |
So I have a lot of interesting insights or these questions about the synthetic data stuff, but we can talk about everything like there's just so much in here. 00:02:20.000 |
The format that worked well yesterday was like, we're not getting through 100 pages, I'll give the high level we'll go through the tweet overviews. And then let's just dig into whatever anyone found interesting and had like, you know, something that someone dove into 00:02:34.000 |
so like part of it was they had a bunch of scaling laws for pre training they had scaling laws for how they picked 405 B and 15 trillion tokens. So whatever someone chose to dive deep into is what like I was like okay, well we'll dig into that. 00:02:48.000 |
Also, other part of this I'm probably going to give like a longer one hour breakdown of like I'll go through the whole paper, have like an hour talk at some point so I started slides, they're very not ready these are like 20 minutes of slides just became discussions. 00:03:01.000 |
But basically, I'll spend like two minutes on overview everyone knows llama. 00:03:06.000 |
Interesting stuff that so like they drop three to three 131 was a pretty big update the AP got a lot better than 70 be got a lot better. A lot of this is just for other talk, but, um, yeah, they drop some sizes the context with getting bigger. 00:03:21.000 |
Their scaling laws were just overtrain and pray, but no, they're actually pretty grounded and real scaling laws. 00:03:30.000 |
They're all dense models, after reading the paper their whole justification for this was like we want to see stuff that scales. It's the first actual research paper where they talk about everything hardware inference hardware failures what happened how they fixed it. 00:03:43.000 |
So, real research paper, there's a lot on it it's like basically everything pre training post training, they're scaling laws how they ran experiments it's a great recipe on how to build it, that's what this talk would be later when I do it. 00:03:56.000 |
It's really cooked models really good, it's solid open sources like GPT four level, they talk about how they bring up performance. 00:04:15.000 |
Also for anyone that finds any time to cut us off cut us off it's all vibes. 00:04:20.000 |
The jumps for the eight be basically from three to three one, it got better all around overview of the paper, there's a bunch of Twitter threads so instead of me making slides will go over like the main one shared a discord for everyone that hasn't seen also is my 00:04:37.000 |
whole screen sharing. So is it just okay let me share my desktop. Okay, so for people that are new and not in discord. 00:04:45.000 |
I do have a very active running llama three section, I'll share the paper. So, if we go to this little like you have to find it so you got to go through the news and find the llama there's like 60 links that we've been posting of everything popular on Twitter 00:05:02.000 |
will go through these at some point but paper overview is basically like they have multiple phases of training. I'm very not ready I do have other notes on this that all share it there so I started screenshotting stuff. 00:05:18.000 |
They have like three aspects to a good foundation model, basically data scale complexity. This is why they didn't go into an MOE they want training stability, basic two phase training there's like pre training post training. 00:05:32.000 |
They do a lot of scaling well work so in their pre trained data set, how they, how do they determine what the pre training mixes in the post training in the pre training, they start doing most of the training at like low context, then they continue pre 00:05:48.000 |
A lot of what they said in their complexity section was like, we want to do basic stuff that will like scale up this is like the foundation for how we can like redefine scaling laws and train this stuff up so no crazy complex RL just, you know, SFT for chat 00:06:03.000 |
tuning and then they do a lot of models for rejection sampling to do data set stuff, DPO. 00:06:22.000 |
They added their safety bullshit at the end. Other interesting stuff that didn't make it to Twitter was like, they had multimodal experiments, they did a lot of experiments, they did a lot of tests, they did a lot of experiments, they did a lot 00:06:43.000 |
don't do it. They added their safety bullshit at the end. Other interesting stuff that didn't make 00:06:48.760 |
it to Twitter was like they had multimodal experiments. They trained in adapters. They 00:06:53.720 |
have a vision adapter, audio adapter stuff. There was cool sections on their pre-training mix. So 00:07:00.600 |
basically they used a lot of like traditional filtering techniques. So they have like Roberta 00:07:06.520 |
based filters for high quality. They have this for their synthetic data distribution. Like how 00:07:11.800 |
do we extract out high quality data? Then they have a lot of traditional NLP for like PII text 00:07:16.680 |
extraction. They had a whole section on like how they scrape the web and how they train their own 00:07:21.480 |
parsing HTML. They compared it to what's out there and their stuff's better. There's a lot in this 00:07:26.200 |
paper. Datamix was a really interesting section as well. So they basically go into, here's basically 00:07:34.440 |
what they did, deduplication, all this stuff that you would expect. Model-based filtering was pretty 00:07:39.880 |
cool. They used a lot of like they trained classifiers on LAMA 2 outputs. On the synthetic 00:07:44.840 |
data side, Eugene has a great tweet thread. We'll probably go through it at some point. 00:07:48.520 |
This was an interesting section that we haven't seen before. So when you have like a base model 00:07:54.200 |
that you're pre-training and you have like 15 trillion tokens, how do you determine what the 00:07:58.440 |
right mix of that is? So their finding was like half the tokens are general knowledge, 25% map 00:08:04.440 |
and reasoning, 17% code, all this stuff. But they're like, this is the first research paper 00:08:09.720 |
that actually breaks this stuff down. They actually did like scaling law experiments. 00:08:13.640 |
So they trained small models that were like a couple billion parameters. They started testing 00:08:19.480 |
different datamixes and then they train a large model to see what actually works on what's the 00:08:23.720 |
right datamix. And then they're like, here's the answer for this stuff. Model architecture was 00:08:29.640 |
pretty similar. They like did a few little changes. They did some better attention masking, 00:08:34.600 |
group query attention, here's architecture. All this stuff is like on Twitter, so not as interesting. 00:08:39.640 |
From the podcast that Sean had, the vocab section is pretty interesting. They're like, 00:08:44.600 |
instead of messing with tokenizers, changing vocab is pretty big for small models. Check out 00:08:49.960 |
the podcast or if it comes up in discussion, we'll discuss it. Scaling laws was another interesting 00:08:55.160 |
one for the paper itself. Basically traditional like chinchilla scaling laws used to have this 00:09:01.960 |
whole like, they're predicting what's the optimal for your compute budget, like what's the optimal 00:09:07.160 |
model parameters, all that stuff, how many tokens you train on. We thought that they were just 00:09:12.360 |
scaling and preying and trading like, you know, fixed cost training run for cheaper inference. 00:09:17.880 |
But this stuff is actually grounded. So they developed new scaling laws where TLDR of what 00:09:23.000 |
they did is previously we used to do scaling laws where we're just predicting on next token 00:09:30.040 |
prediction accuracy, right? So we're trying to predict on like perplexity and just how good is 00:09:35.480 |
next token prediction. Instead they do all this fancy math and they change the training objective 00:09:41.720 |
to be like more representative of a reasoning benchmark. They use the ARC challenge where 00:09:46.440 |
basically they have a reasoning benchmark and now instead of doing scaling laws to predict next token 00:09:52.120 |
prediction, they've changed it so that they're doing scaling laws to predict optimal model stuff 00:09:57.240 |
based on actual reasoning. And that's where they come up with this like, their scaling laws show 00:10:02.680 |
that for a 402b model, you want to train on 16 and a half trillion tokens. Based on that, they did a 00:10:08.520 |
flagship 405b based on 15 trillion tokens. And then this is where they have their like infra 00:10:15.400 |
optimal where they started to do the 8b, the 70b, they just reuse their 15 trillion tokens and just 00:10:22.360 |
overtrained and that works. The other really cool section, the sections that didn't make it on 00:10:27.720 |
Twitter were like their training infrastructure. So they give out everything, right? They give out 00:10:33.160 |
like the full pre-training stack of like, they have a section in here on how they do their pre-training. 00:10:38.920 |
So like, one is like the whole hardware configuration. So 16,000 H100 hours, what 00:10:46.040 |
failures they hit, why they went for simplicity. This was a pretty interesting section. Like over 00:10:51.800 |
their 54 day training, they had like 400 job interruptions, 419 unexpected interruptions, 00:10:59.560 |
and like 78% of these were like GPU hardware issues. And then they have a section on like, 00:11:05.240 |
if they did MOE, all this stuff compounds. So we just wanted something like simple, 00:11:09.480 |
scalable that we could deal with well. And like, this is stuff that you don't really see in 00:11:14.280 |
papers anymore, right? It goes further with like, what is the pre-training set? So like, 00:11:19.560 |
these formulas, we don't really see anymore, right? So it's like when they pre-trained it, 00:11:24.920 |
here's their like peak learning rate, here's their warmup, here's their decay, here's how 00:11:28.600 |
many training steps, here's the bat size, little nuggets like this haven't really like come up on 00:11:33.560 |
Twitter yet. But like, you know, at first, they have a bat size of 4 million tokens with a small 00:11:38.280 |
sequence length. So like, the first bit of training is a sequence length of 4,000. Then 00:11:43.080 |
they double it to like 8 million sequences at 8,000 for the next 252 million tokens. After 00:11:49.960 |
they've trained on 200 million tokens, they double it again to like larger bat size for the next 3 00:11:54.920 |
trillion tokens. And then they do most of the training at 8,000 token sequence length. So like, 00:12:00.520 |
little stuff like this, I feel like we still need to digest. There's reasons for why they did this, 00:12:05.800 |
but basically TL;DR, no other open source paper has like a formula like this. And then that's kind 00:12:11.880 |
of what the next like 100 pages is. I feel like at that point, instead of finding what I found 00:12:17.720 |
interesting, like, I found all this stuff really interesting. They talked about the batching, 00:12:22.760 |
GPU utilization, memory, like utilization, all that stuff. Like CUDA optimizations, their whole 00:12:30.600 |
training recipe, what they released performance stuff. Instead, I feel like that's enough of a 00:12:36.280 |
high level overview of the paper. The more fun stuff is like, yeah, so how does it perform? 00:12:41.800 |
They're all better. Infra companies are pretty cheap. And this is also where like everyone else 00:12:46.440 |
can hop into discussion. Eugene, Sean, other Eugene, hop in now. You know, Fireworks is somehow 00:12:53.480 |
really undercutting inference price. The scale leaderboard is a held out leaderboard. It does 00:12:58.200 |
pretty good here. What else? Grok has it. So some insider info for all the Infra companies, 00:13:05.640 |
they gave access to randomized weights that were the same size about six days before launch. 00:13:12.040 |
So six days ago, Infra companies started playing around with it. They started working out how 00:13:16.440 |
they're going to do inference, what type of decoding they need. But they didn't have the 00:13:19.720 |
paper. They didn't have the actual weights. And then day of, they released weights. But like, 00:13:23.960 |
yeah, stuff like Grok is serving a thousand tokens per second. What other discussions that 00:13:29.400 |
we have here? Kyle did pretty good evals on performance. He started doing it on his own 00:13:37.000 |
fine tuning stack. So he started fine tuning it, compared it to 4.0 mini. OpenAI within hours 00:13:43.880 |
responded with like 4.0 mini fine tuning. But fine tuning the Lama 3.18b is kind of on par with 00:13:52.360 |
4.0 mini. 4.0 mini fine tuning is kind of broken and free, but it gets worse. 00:13:56.520 |
What other fun stuff? There's a comparison of model pricing here that's being updated live. 00:14:03.640 |
Other tweets, George Hotz, Karpathy tweeted. VLLM supports it. Other more independent benchmarks 00:14:12.040 |
coming in. Basically, it's good. The other interesting part was the licensing. So they 00:14:17.720 |
changed up their Lama license to proper full open source everything. We have more infra providers, 00:14:24.760 |
NVIDIA stuff. But yeah, that's kind of where I feel like we should open it up. That's the quick 00:14:31.000 |
15, 10 to 15 minute overview. Whatever people found interesting, like I know there was a lot 00:14:35.960 |
of discussion about synthetic data gen. Sean and Eugene, you had good tweets about this. So I think 00:14:43.000 |
this is where we open it up to whatever people found interesting. And then we dig into those 00:14:48.040 |
topics because we're not getting through the rest of it. I'm going to open up chat and see what 00:14:53.400 |
people are up to. But yeah, thoughts everyone. Hop in. Yeah, I wanted to jump in by the way. 00:15:01.560 |
One thing to warn about pricing is that you're going to see a lot of providers 00:15:10.120 |
jumping in and everyone's just trying to get the piece of the pie. So like with some of the 00:15:16.440 |
previous model launches, you see people coming in at lower and lower price and then they'll 00:15:20.040 |
increase later. But I wanted to jump in on the training side because I'm quite sure Mibu, 00:15:25.800 |
Eugene, and Ed will have lots to say on the data. So I think I'll start with that. I can't share 00:15:32.200 |
the screen by the way. Do you want to take over or do you want me to scroll? I want to take over 00:15:38.120 |
slightly because I want to jump through a few things there. So let me share my screen. 00:15:45.960 |
All right. So I didn't see too much talk about on this but for me, one of the big ones is actually 00:15:59.880 |
pipeline parallelism. Not sure how many people... Can you see my screen? Yes. Yeah. So if you're 00:16:08.760 |
looking at this and like what is this crazy freaking schedule that they are doing here. 00:16:13.560 |
But TLDR, pipeline parallelism is generally the idea of scaling up your training across multiple 00:16:21.080 |
GPUs and to build around optimizing that. That has its own benefits. It has also its own downsides. 00:16:30.200 |
And the major downside, the reason why people try to avoid pipeline parallelism at all costs 00:16:36.360 |
and they use like DeepSpeed 3, for example, where the weights are sharded around all the other GPUs 00:16:42.200 |
is that if you look at pipeline parallelism or model parallelism, there's this problem called 00:16:49.080 |
the bubble. The bubble is basically as your data set goes through the different devices. So the 00:16:56.280 |
forward pass and then the backwards pass, you have all this GPU time here where some of the GPUs are 00:17:02.280 |
waiting for other GPUs and are doing nothing and basically you're wasting compute. And because 00:17:07.960 |
everyone wanted to avoid wasting compute, it went on to a search of the algorithm to figure out how 00:17:16.440 |
to do pipeline parallelism. And one major one is actually CLSG, coincidentally Singapore, 00:17:23.880 |
where they created this crazy-ass algorithm to basically train without any wasted time. So you 00:17:30.760 |
see the gray spots are with the wasted time, respectively. And Facebook is now embarking on 00:17:36.440 |
their own journey on this. And the reason why this is exciting even for smaller models is that 00:17:43.240 |
this kind of algorithmic changes on the training is what's going to allow you to train bigger models 00:17:50.520 |
easier on lower-end GPUs. So this concept could apply to, let's say, training a 7TB model on 24GB 00:17:58.680 |
GPUs and things like that. And the reason why they probably need it for the 80GB is because 00:18:04.040 |
they're training 405B. And yeah, and a lot of people thought, like academia thought that this 00:18:11.240 |
treated it as a dead end because of the bubble problem. And then Facebook was like, "You know 00:18:17.720 |
what? We are going to do that." And that, to me, is one of the more exciting things. 00:18:22.200 |
The other one that I saw some people tweet out is about batch sizing being smaller, 00:18:27.800 |
constraints on the batch size. I thought Google has pipeline parallelism in their 00:18:32.440 |
JAX, the distributed training repositories. They don't? Yeah, they do. They do. But the thing is, 00:18:39.000 |
no offense to Google, no one really took, everyone just interpreted it as TPU has 2L VRAM. 00:18:45.480 |
Kind of, kind of, kind of thing. And they had the basic pipeline parallel, but 00:18:51.320 |
we still suffered from the bubble problem. This weird scheduling, which I'm quite sure 00:18:56.120 |
people are going to start replicating it, is to reduce the bubble, the wastage. 00:19:00.760 |
So I also saw lots of papers on this from maybe NVIDIA and Matej Zaharia from Berkeley or Stanford. 00:19:08.440 |
Like they had lots of interleaved pipeline parallelism updates. 00:19:13.560 |
So you're saying no one is using it? Just Facebook has used it more recently? I find that pretty... 00:19:19.400 |
Or at least no one published it within their training processes. Because this is the first 00:19:25.320 |
major model of this CL class size, right, that's saying, "Hey, we are doing pipeline parallelism." 00:19:31.400 |
Google models, so they have some, these pathways, distributed training architecture systems, 00:19:40.760 |
and they publish in maybe OSDI, which is kind of the biggest distributed systems conference. 00:19:46.680 |
So they publish these trainings and they can do all sorts of parallelism within their systems, 00:19:52.040 |
and even a mixture of experts parallelism and stuff like that. So they do quite, 00:19:57.160 |
quite heavy stuff. I'll look it up and post some papers if I find them in the messages. But yeah, 00:20:04.600 |
my mental model was that people are actually doing this at scale. 00:20:11.240 |
Yeah. So I'll draw the distinction between pipeline parallelism and techniques like 00:20:16.120 |
DeepSpeedTree, which is essentially where the GPU has MVLink connectivity to other GPUs to 00:20:25.000 |
actually read the model weights. Pipeline parallelism is really more of like, instead of 00:20:29.640 |
going cross GPU to read the weights of the other models, or the other half of the model, you 00:20:36.040 |
actually just focus on the half of the model that you're working on. And this has the trade-off, 00:20:44.200 |
respectively, of saving VRAM and allowing you training larger model and larger batch size, 00:20:49.240 |
but it means you have the bubble problem. And I think the focus is really more about the bubble 00:20:54.680 |
problem here, rather than anything else. And yeah, like I said, I do expect more people to replicate 00:21:01.160 |
this part. Yeah. So that's the part that I wanted to jump in on. The other major one I wanted to 00:21:07.400 |
jump in on is just multilingual. I'm so happy that I've seen this. We try to avoid using machine 00:21:14.760 |
translated data to fine-tune the model. This is something that I think multiple people know that 00:21:20.760 |
I've been shouting on the roof about saying, "Hey, can we stop using machine translated data for 00:21:25.160 |
other languages?" And then assuming that's great, because when you speak to the other language, 00:21:30.600 |
native speakers, they've been saying, "That sucks." And finally, someone is also, at least 00:21:35.400 |
on the bigger model side, is doing that as well. So particularly excited about that part. But yeah, 00:21:41.400 |
I think I'll hand off to the whole data stream. The interesting little section there of translated 00:21:46.680 |
data is I've still seen it used where they have a Lama3 filter that extracts out what's the highest 00:21:52.200 |
quality data, what's the highest quality reasoning code data and whatnot. And in other work, they'll 00:21:57.160 |
still do... This is very traditional pre-training data set stuff where you need more data augmentation 00:22:02.760 |
to get more high-quality data and translation. So one thing is you can train on multiple rounds of 00:22:08.760 |
that. It's like more epochs on high-quality data. So you can just resample it. But then there was a 00:22:14.040 |
paper that I'm forgetting that tested this. Do they want to only use a little bit? Do they want 00:22:18.280 |
to train on multiple rounds of pass-throughs of the same high-quality data? Or do they want to do 00:22:23.080 |
basic augmentation, like translate and translate back? And somehow translation to other languages 00:22:28.520 |
work better. That was the best option. Translating it to high-quality in another language as opposed 00:22:34.120 |
to translate and translate it back. So there's still some value, but interesting little piece. 00:22:39.560 |
Yeah. So I think I want to hand off to the people who are going to tear all the data 00:22:45.640 |
parts into bits. Because I just wanted to jump in on trains. Because that's what I can uniquely offer. 00:22:50.040 |
>> Awesome. Appreciate that. I think Cameron has his hand up. 00:22:58.040 |
>> Hey, did they make any claims around it being good for code generation? 00:23:05.240 |
I'm interested in whether yes versus cloud. >> Yeah. This is a big contrast to Llama 2, 00:23:15.240 |
where they were intentionally not training for code, and then they put out code Llama separately. 00:23:19.480 |
Now they explicitly outline code as a separate modality, like separate from text. 00:23:25.000 |
Vibhu, I don't know if you have a slide on this stuff. And then they also did synthetic 00:23:30.120 |
data for code as well. Yeah. They just -- they spent a lot more time on code this time around. 00:23:36.920 |
>> Has anyone looked at it versus Cloud 3.5 Sonnet yet? 00:23:41.800 |
>> We vibe checked it. We vibe checked it. >> They did what? Checked vibe? 00:23:48.760 |
>> Yeah. So like, you know, it's not rigorous evals, but like we vibe checked it. And like, 00:23:55.160 |
it does pretty good. So in the paper, they did explicitly mention as well, like, yeah, 00:24:00.040 |
they used to have previous -- they used to have previous code Llama models, right? And part of 00:24:06.600 |
their, like, second step of post-training was to add in this section on code. But they explicitly 00:24:12.360 |
no longer need to do that. And I'll pull up the section of the paper, basically. But they 00:24:16.920 |
mentioned that this is, like, natively trained in, in pre-training as well. But it's a good 00:24:22.920 |
code model. They also have a -- the -- >> There's a scale AI benchmark. 00:24:32.040 |
We can't hear you very well. >> Okay. Yeah. There's a scale AI 00:24:35.720 |
benchmark where Sonnet and 4.0 were compared against the new 4.0.5b model. And 4.5b was found 00:24:45.080 |
to be basically on par with GPT 4.0, which is worse than both Sonnet and GPT 4.0 turbo preview. 00:24:51.560 |
There's a tweet thread and a comment that I'll just drop. But it outperforms Jem and I 1.5. 00:25:01.320 |
The thing I like about the scale benchmarks is that they are pulled out. That is, like, 00:25:06.200 |
none of the companies have access to them. And they're private. So, there's probably more 00:25:11.560 |
durability to the benchmarks. And they don't have as much of a conflict of interest. They 00:25:15.960 |
did co-watch with Lama. So, yeah. There may be a little bit of conflict of interest. 00:25:20.040 |
>> Thank you. Thank you, Jeremy. >> So, overview, the scale -- 00:25:27.320 |
>> Go ahead. >> Scale leaderboards aren't just coding. So, 00:25:32.360 |
for people that don't know, it started out with the GSM 8K, where they tried to recreate it. And 00:25:38.200 |
they made a GSM 1K, which is meant to match the actual benchmark and just be a held out that 00:25:43.960 |
they'll run models, they'll evaluate them. And then that turned into now they have held out 00:25:48.520 |
benchmarks that no one can see what the actual examples are of coding, instruction following, 00:25:53.720 |
math, Spanish. There's a bunch of these. And, yeah, they're kind of, like, pretty good in the 00:25:58.360 |
sense of, like, no one can directly train on them. There was a piece that said, like, when they put 00:26:03.400 |
out their first one, what's the delta between companies, like, models that do really well 00:26:09.240 |
on traditional, like, GSM 8K, but don't do well on 1K, where it's like they haven't seen it before. 00:26:15.000 |
So, they basically tried to test who overfit to the benchmarks. And this is trying to solve that. 00:26:20.840 |
So, if we go through real quick, this is kind of where the 405B sits in coding. It's, like, 00:26:25.960 |
a step right below QPT 4s and Sonnet. Sonnet's still slightly better. And then we can kind of go 00:26:34.120 |
through it. I think they're still testing the 405B, because I'm not seeing it through the rest 00:26:39.160 |
of them. But they're being updated in tweet threads and whatnot. And then Jeremy shared a link 00:26:46.120 |
to the Reddit that talks about this, where they're basically going through them. And then there's 00:26:51.560 |
discussion here, if anyone's interested. But yeah. Someone was also talking in. 00:26:56.200 |
>> Thanks very much. >> I have something to share. The 00:27:02.680 |
coding evaluation, it seems like they so there's I can't share my screen. But basically, 00:27:15.160 |
human eval is kind of one second. Let me try and share it. Can you see? Yeah. So, human eval 00:27:25.960 |
is one of the benchmark data sets that people use to benchmark coding. And it's very simple. 00:27:32.680 |
Like, they have 150 questions. And it's almost, like, autocomplete. Like, solve this simple 00:27:37.480 |
puzzle in Python or things like that. It's very, like, one, two lines. And you can see that 00:27:44.840 |
let's see. So, the LLAMA405B is not state of the art. So, Cloud Sonnet beats it by a few 00:27:52.920 |
percentage points. It's close to the GPT and Sonnet models, but slightly worse. And I think 00:28:00.680 |
this kind of is similar to the vibe checks. My understanding was on the initial LLAMA, 00:28:06.600 |
stuff that Meta didn't focus that much on reasoning or on code, because they're a 00:28:12.200 |
social company. So, maybe reasoning is not as super important for them. But then they hacked 00:28:17.960 |
focused coding data collection session and shared a big code model. Which kind of wasn't that great. 00:28:27.640 |
Maybe if you don't put the data in from the beginning, just trying to fine tune on code 00:28:33.320 |
by itself doesn't work that well. The other thing I wanted to share, can you see this other page? 00:28:42.200 |
Now? Basically, it seems they spend quite a bit to make their coding much better in LLAMA3. 00:28:51.080 |
And they actually train the code experts and then try to use that code experts to maybe, 00:29:00.280 |
I guess, collect high quality human annotations and do some more post-training. And then they 00:29:09.880 |
also did some synthetic data generation to improve coding. So, I think they spent quite a bit to work 00:29:18.280 |
on reasoning and coding. I didn't read this section carefully, but yeah, they have a full 00:29:21.800 |
section on trying to get better code data to generate, to incorporate feedback and do analysis. 00:29:30.200 |
They did quite a bit on coding. Yeah. >> Yeah. There's two sections there. 00:29:39.240 |
One is the synthetic data gen with coding, and the other is the pre-trained mix of their code 00:29:46.120 |
and reasoning sample where they trained a second classifier. So, one of the takeaways there was 00:29:53.160 |
when you're doing pre-processing of 15 trillion tokens, you actually can't just run inference. 00:29:58.280 |
Even Meta with all the GPUs they have, they couldn't afford to just throw LLAMA3 inference as 00:30:05.320 |
this whole 15 trillion token set. So, they trained a code and reasoning classifier on Distal-Roberta, 00:30:11.560 |
which is a small original encoder-decoder transformer to try to annotate out their 00:30:17.400 |
web scrape data for quality and whatnot. So, they have it both there in the pre-training set and in 00:30:23.640 |
the synthetic data gen. There's a really good quote tweet that went on about all this code gen. 00:30:30.440 |
It's by Eugene. I will share screen and throw him on the stage if he wants to talk about it. 00:30:37.400 |
>> Yeah. Thank you. I'm currently commuting, but I'm finding a quiet space right now. 00:30:46.920 |
All right. Great. Thank you, Vibhu. Yeah, I can talk to you, I think. 00:30:49.880 |
Can you hear me fine? >> We can come to it in a few minutes. 00:30:52.920 |
If you're commuting, we can come to it in a bit. >> No, I'll be commuting for a while. I'm walking 00:30:58.040 |
to the stadium right now, team event, but I'm finding a good space to sit. Okay. So, I think 00:31:03.320 |
what really stood out for me in this paper was that how much automation and augmentation was there, 00:31:09.240 |
right? In the first one, you can see they actually use Lama2 to filter out bad data, right? And this 00:31:17.560 |
is in the pre-training step. So, essentially, what they're saying is that, "Hey, we trust Lama2's 00:31:21.880 |
judgment well enough to be able to do that." And if you scroll down, next slide. 00:31:26.840 |
And then over here, you can see that they actually trust Lama3 to do tag intention. They actually 00:31:35.560 |
tag the generated data or the responses based on intention, and they also classify things based on 00:31:43.720 |
difficulty, right? And the thing is, they actually adopt some kind of curriculum learning where they 00:31:50.360 |
start with a single shot prompt or rather a single turn prompt and response, and then after that, 00:31:56.520 |
they move on to multi-turn. Next slide. And then after that, and this is a code expert that 00:32:05.640 |
everyone's been talking about, right? So, what it means is that in order to get Lama good at code, 00:32:11.080 |
as an intermediate step, they had to train a code model. And that sounds quite crazy, right? I mean, 00:32:19.320 |
for me, I mean, sometimes training such large models just seems to take so much effort to 00:32:25.560 |
curate the data, to set up the info and everything, but it seems completely essential in this case. 00:32:33.080 |
They could not have done it without that. And Andrej Karpaty had a great tweet about this, 00:32:36.840 |
whereby every model distillation and synthetic data generation is really now a stepping stone 00:32:45.000 |
for the next better model. Next, please. And then the same thing here is, okay, and of course, 00:32:52.760 |
here, this is just an example of how much we trust the synthetic data, right? The model was prompted 00:32:59.400 |
to generate problems, then solve each problem, so I'm focusing only on the green highlights here, 00:33:04.440 |
solve each problems, and then they give the model the errors, and then they ask the model 00:33:09.000 |
to solve the errors, and then the model also generates the unit tests, which they then use 00:33:13.240 |
to evaluate the generations on the unit test itself. It's like, you see that the human is 00:33:19.160 |
very minimally in the loop. And then if we move on, and you see this pattern everywhere, like 00:33:26.280 |
multilingual, you didn't hear me talk about it. One thing that's interesting here is that they 00:33:31.000 |
generate, they use Lama to generate data for target capabilities, and then they back translate it into 00:33:37.000 |
doc strings and comments. So that's how they can teach the model to explain code. And then they 00:33:43.000 |
use those tweets and comments, those doc strings and comments to actually create code again. And 00:33:47.800 |
then we're going to go through the rest really quickly. It's like multilingual, the same pattern 00:33:52.360 |
here, math and reasoning, the next one, you see it's the same pattern, whereby the model actually 00:33:57.400 |
augments the training data with the step-by-step. So one thing that's really interesting here, 00:34:02.040 |
in the sense that they actually went the extra step, no pun intended, to actually train step-wise 00:34:08.360 |
reward models. That's kind of crazy, no? I mean, they wanted each step in the chain of thought to 00:34:14.440 |
be so good that they actually took the extra effort to train step-wise reward models, which 00:34:19.240 |
they then combined with Monte Carlo Tree Search to improve the reasoning traces. And then you see 00:34:25.880 |
synthetic data for long context, it's the same pattern, Q&A. And then as you scroll down, 00:34:33.400 |
you see synthetic data for image captioning and synthetic data for factuality. Like factuality, 00:34:39.800 |
essentially all of it is just synthetic data, if you look at this. I think time will tell whether 00:34:44.840 |
this really works out well or not. I think we're still too early on the evals. And then you see 00:34:49.160 |
synthetic data for adversarial examples, synthetic data for the image encoder training, where they 00:34:54.520 |
use image captions, and then they augment as existing datasets with new instructions and 00:34:59.160 |
responses. And what's really interesting here, the second last tweet, is that the human annotators 00:35:05.960 |
were actually augmented with model in the loop, right? And if you think about it, this slightly 00:35:15.640 |
represents a shift in how some folks are thinking about it, right? I mean, a lot of people is like 00:35:19.640 |
thinking human in the loop, but no, now it's model in the loop, whereby you use the model 00:35:27.720 |
to create an initial generation that the human then can edit, and it just makes it so easy for 00:35:32.760 |
the human, right? And then the one big takeaway from all of this, and that's what I had from this 00:35:39.160 |
paper, but the one big takeaway from all of this is that, can you imagine how much META had to 00:35:47.240 |
educate and upskill their SDEs or their existing scientists to use this new technology to be 00:35:53.480 |
trusting of it, and the annotators to trust the new technology and to just work based on that. 00:35:59.080 |
So that was quite eye-opening for me, and I think it sort of suggests the 00:36:02.840 |
long-term, here's where the puck is heading. And that's all I had. Thank you. 00:36:11.400 |
Eugene coming in clutch with, I threw him on the spot, he's commuting and already had 00:36:15.800 |
slides and tweet thread. But yeah, what other topics have we got? I've got the chat. 00:36:22.760 |
Did they mention using chain of verification prompting, Eugene? 00:36:29.800 |
Do you mean chain of thought prompting or chain of verification where they try to verify the chain? 00:36:41.400 |
Okay, I don't think they actually did that, but they did mention they had stepwise reward models 00:36:48.040 |
that actually checks every step in the chain of thought. But I don't recall 00:36:58.860 |
Eugene, like early last year, there was tree of thought and some iterations with Monte Carlo 00:37:07.240 |
search and this tree of thought stuff. But at that point, LLMs weren't good enough to verify 00:37:15.160 |
or provide enough signal for this multi-step reasoning things to happen and things in the 00:37:20.040 |
loop. Do you know, do you have some idea how they solved it or why they were able to make all this 00:37:25.400 |
progress? Basically use all the tricks we were reading about maybe half a year, a year ago, 00:37:30.840 |
but it seems they actually got them to work. So yeah, I wonder what made it work. 00:37:36.600 |
Yeah, I don't know. I'm very interesting. I'm very curious about that as well. I wish there 00:37:41.480 |
was more papers showing how to use Monte Carlo tree search and actually get it to work and share 00:37:47.880 |
more details about it. I'm afraid I haven't seen too much of that in this current paper. 00:37:51.720 |
Is it mostly for coding that they employed this or for other tasks as well? Because for coding, 00:37:58.440 |
you could signal back some reward, but for other things, like I don't know how you evaluate things 00:38:03.720 |
and propagate information and validate the chains of that. Yeah, if I recall correctly, 00:38:10.360 |
it was actually in the math and reasoning section. So whereby they actually use stepwise 00:38:15.640 |
reward models to evaluate each step, to score each step in the chain of thoughts, 00:38:23.640 |
Thank you. I'll look into it more. Yeah, it's the math. Lightman et al was the citation. 00:38:37.400 |
And I guess I'll wrap up with one final thing. I'm sorry, it's a bit noisy. I think 00:38:42.200 |
Switzer's Latent Space podcast with Thomas, he really goes really deep and he has a strong 00:38:50.920 |
opinion on synthetic data. I think listening to that podcast will give you a lot more insight 00:38:56.120 |
into how meta is really embracing synthetic data. So I found that podcast quite helpful. 00:39:04.680 |
And this was the Karpathy tweet about synthetic data. 00:39:09.960 |
Also, yeah, great podcast. I think that's the one. Exactly, wait a minute, 00:39:20.680 |
I think that one thing in the sense that everything is a step for the next one. 00:39:26.120 |
No, not this one. It was actually a tweet about smaller models, 00:39:33.080 |
about how the competition for smaller models is going backwards, but buried in there. 00:39:37.240 |
If you scroll down a little bit more. Yeah, this one. Yeah, exactly. You can see the models have 00:39:45.320 |
to first get larger before they get smaller, right? And the three-line paragraph. And then 00:39:51.480 |
it's a staircase of improvement where one model helping to generate training data for the next. 00:39:55.640 |
It's almost like he had read this paper up front and he was alluding to that. I don't know. 00:40:01.320 |
Yeah, it's pretty interesting to see. Also, this was a tweet that came out even before, 00:40:06.200 |
but very much the small model distillation work, it's pretty huge. And that's, I think, 00:40:11.640 |
the big part of the license play of this too, where they did actually finally change their 00:40:17.000 |
license to allow people to generate synthetic data, train on outputs of the 405B. I think the 00:40:23.480 |
405B is a little overhyped for just using it for inference. When it comes to inference 00:40:28.520 |
generation and cost-effectiveness, make sure if experts are pretty efficient. They use less RAM 00:40:36.520 |
at inference time and they're more cost-effective for just the best quality. But then this is really 00:40:43.400 |
valuable for synthetic data gen, for filtering, stuff like that. And that's what I see more of it. 00:40:49.080 |
I know, Sean Swicks, you also had a pretty good write-up about this. 00:40:58.920 |
Or any other about how this is a synthetic data gen model or any other topics we want to dive 00:41:05.480 |
into. Also open to everyone else that's in the call too. If anyone had anything interesting that 00:41:10.600 |
they want to dive into on the paper, pop in, share your thoughts. 00:41:13.480 |
I think Sachin's hand has been raised up for quite some time. 00:41:18.360 |
Yeah, so this is for Eugene and Vibhu also. So we saw SONET actually take over some time back. 00:41:27.240 |
And there's still some, what do you call, gap to cover. So does SONET have 00:41:31.240 |
some other tricks in their back which is getting them that higher up? 00:41:35.240 |
I know someone's working on a write-up about this. 00:41:42.920 |
No, no, no. I abandoned the idea. Yeah, so they never published anything about 00:41:52.840 |
what tricks they use. But the evidence strongly points to the fact that they use the steering 00:41:57.240 |
vectors that they had from the scaling monosemanticity paper. The main evidence is 00:42:04.760 |
that they happened to do this monosemanticity research on SONET. 00:42:13.880 |
SJ is constantly screwing up his mic. They did it on SONET and obviously they only shipped 00:42:21.640 |
3.5 SONET. That's like the smoking gun. If they actually had anything else, any other trick 00:42:28.440 |
that caused 3.5 SONET to be so good, they probably would have deployed it on Haiku and Opus as well. 00:42:35.560 |
The fact that they don't is proof positive that it's basically the monosemanticity stuff. 00:42:40.520 |
Does that answer your question? Do I need to explain what that is? 00:42:45.740 |
I think it's not. I think it's not control vectors. 00:42:53.000 |
They could be. They did say it's a larger model. If you look at the training data and the training 00:42:58.520 |
date for when Cloud 3 SONET came out to 3.5 SONET, it also has a year and a half of significant data 00:43:05.320 |
updates. I think that there was a lot of research that put out on high-quality synthetic data, 00:43:10.360 |
the post-training mixture of it. SONET probably just had a decent bit of... There's a lot more 00:43:17.320 |
that they could squeeze out of it. Also, they did say it's bigger. A lot more research and good 00:43:24.280 |
quality synthetic data, the pre-training data mixture started to come out. It's bigger. I think 00:43:30.920 |
it was just a lot more post-training as well, because there was quite a bit of time... 00:43:35.800 |
In post-training, there's a... Sorry, Vibhu. In post-training, in SONET, you can see there's 00:43:41.080 |
this pause, thinking pause tokens, where sometimes it generates a token and it does internal thinking 00:43:47.240 |
and then generates the answer. It seems like they used some recent tricks where people say, "Hey, 00:43:53.640 |
you need to think step-by-step, but maybe not materialize directly in the answer, the 00:43:58.520 |
step-by-step thinking." Sometimes when you run SONET generations, you can see there's some... 00:44:05.160 |
It stops in the middle and people saw that those are actually pause for thinking and 00:44:11.080 |
side thoughts. It seems that really helps with reasoning tasks quite a bit. That's one additional 00:44:18.840 |
trick that they use. I agree that it's been one year of work, so they probably have lots of tricks 00:44:26.200 |
in there, not just one, two, three. Much like in the Lama paper, you'll see that it's hundreds and 00:44:32.120 |
hundreds of people, still less than a thousand, but yeah, it's like a Manhattan project to build 00:44:37.880 |
one of these things. I actually counted the number of people. Lama 3 had 200 core contributors, 00:44:43.960 |
which is good. It's pretty small. It's less than Gemini, which had 950. 00:44:49.080 |
Sebastian says, "What does thinking mean?" Okay, here's where we get philosophical. 00:45:01.480 |
My quick take on that is this used to be a thing with the original ChatGPT web UI stuff of why is 00:45:07.880 |
it pausing at stuff. I think some of this is also just the way the API works. What's the inference 00:45:14.120 |
it's running on? How's the streaming? What's the API service like? Sometimes when there's blocks, 00:45:19.240 |
it's not that something else is going on. It's just that there's a delayed stream of your API 00:45:23.640 |
response and sometimes people overanalyze that. Is it that it's thinking? Is it what's going on? 00:45:29.560 |
Maybe, maybe not. I think for the Sonnet's case that some users have already used, I guess, 00:45:38.440 |
prompt injection to trick it into instead of doing the XML thinking block and to use it to 00:45:43.640 |
output it in a different format, then you literally get to see the thinking process. 00:45:48.120 |
Yeah, so for what it's worth, I went to iClear and interviewed the pause token author. It's on 00:45:58.040 |
the iClear episode if people want to check it out. I do not think that Cloud specifically 00:46:03.640 |
implemented that version of thinking. I think it's much simpler. It is just chain of thought. It is 00:46:08.680 |
just prompted XML chain of thought that is then subsequently post-processed and removed inside 00:46:14.920 |
of Cloud Artifacts. Yeah, but it's still a form of thinking. It's a form of chain of thought. 00:46:20.520 |
It definitely improves the performance. Right. Sorry. Yeah, that's what I was aware of. 00:46:27.000 |
I'm sorry. I missed what Eugene mentioned for an alternative to 00:46:30.680 |
what you described as a chain of thought that is not presented to the user. 00:46:34.520 |
Eugene? Yeah, so instead of creating a custom token, which is the pause token concept, 00:46:45.320 |
they literally just got the model through prompting, got the model to reply with a thinking 00:46:50.920 |
XML block, which then you can trick it through prompt engineering to substitute the tokens 00:46:55.400 |
respectively. Then suddenly this technique becomes very blatant when you get to see it 00:47:00.840 |
respectively because it's no longer hidden from the UI. Yeah, I think also on a separate line, 00:47:07.880 |
because since a lot of people are looking to the eval and then they're like, "Hey, some of these 00:47:11.800 |
evals are doing worse than, let's say, 4.0," or things like that, the fact that it's already 00:47:17.640 |
closed itself means that if you're just a few weeks away until every single benchmark, you're 00:47:24.520 |
going to see a point jump because someone fine-tuned a code-specific version of the L3 model 00:47:31.480 |
or a medical reasoning-specific version of this model. It's going to take slower than normal 00:47:39.400 |
because I spoke to some people in the fine-tuning community. The biggest hurdle has been, "What do 00:47:45.080 |
you mean you need at least three nodes of H100 to start the process?" Yeah, the amount of VRAM 00:47:51.960 |
requirement is kind of huge. I suspect we are going to see more Loras first before we get 00:47:56.680 |
full fine-tunes. Also, the most random part, I know Meta did this for good reasons because 00:48:04.280 |
they basically did a lot of no-keyword filtering from the sources, but a lot of people in the AI 00:48:10.760 |
companion space, they were like, "No!" basically. Yeah, it makes sense. I'm going to look into that. 00:48:20.120 |
Thanks. Do we have more things on the paper? I mean, there's more to discuss. I feel like 00:48:31.560 |
everyone's being too polite. There's a lot of new scaling laws they brought up. They had a whole 00:48:38.280 |
recipe for post-training, how they did it, how they did their SFT, how they did... They also 00:48:44.040 |
released both the base and the instruct models, how much of this was done by synthetic data, how 00:48:51.000 |
they train their image video adapters, all that stuff for multilingual stuff. They give out a 00:48:56.520 |
whole recipe on how to do this. It's a long, long read, but for anyone that hasn't read a lot of 00:49:02.760 |
papers, this is also probably a really good one that's very approachable, very readable, 00:49:07.640 |
and not too crazy technical, one to at least understand what's going on. They go into some 00:49:13.320 |
of their evals on their multimodality, how their adapters work, how it performs, and they're like, 00:49:18.680 |
"Yeah, it's pretty good." They added speech into speech understanding, how to train a speech 00:49:24.760 |
encoder, how many hours of recording they used, how they filtered it. They go through literally 00:49:31.160 |
all of this. This is probably where you could have an hour on this paper. That's every step 00:49:36.920 |
of it. But it's an interesting one where, yeah, they do go into all that data set, 00:49:41.640 |
how they transcribed it, just little, little stuff too. In their speech understanding section, 00:49:47.240 |
there's a section on, "Our ASR training data contains 230,000 hours of manually transcribed 00:49:55.080 |
screech recording that spans 34 languages." Just a little one line of, "We casually manually 00:50:03.000 |
transcribed 230,000 hours of 34 languages of speech, and we're just training a little adapter 00:50:09.800 |
for this that we're not releasing." They put a lot of work into that. Then it goes even deeper 00:50:15.240 |
into, "How do you use this for pre-training? What about spoken dialogue? How do we fine-tune?" 00:50:21.400 |
What's the recipe for a speech adapter in an LLM? Yeah, we did a lot of pre-processing 00:50:27.160 |
to the base data set of manually transcribe a lot of speech recording, have multi-languages, 00:50:32.520 |
train it out. Here's the speech length segments that we want. Then we fine-tune this adapter for 00:50:39.400 |
spoken dialect. How do we do that? Well, we synthetically generate responses for prompt. 00:50:44.760 |
We ask for transcripts. We generate them. They generate 25,000 hours of speech synthesis through 00:50:52.360 |
voice box, which is a whole other series that Meta has put out around everything, 00:50:57.560 |
how they do voice generation. They have a whole really good breakdown paper of that, how they use 00:51:07.800 |
that to generate model, to fine-tune, and generate synthetic data for this. There's a lot in here, 00:51:13.160 |
if anyone's interested in. A lot of that doesn't make it to Twitter, but dig in, present it. 00:51:19.640 |
Architecture stayed the same. I thought the interesting parts were also just like, 00:51:30.840 |
they want to keep it very foundational and see what works and what they can just scale up. 00:51:37.400 |
The second aspect of that is it'll be fun to see when they start cooking with actual MOEs, 00:51:43.880 |
how do we go outside the box. It's nice to have clarity on their scaling laws. I've definitely 00:51:53.480 |
presented too many times that they just scaled up and prayed, and they took an AP to 15 trillion, 00:51:58.120 |
and they were very inefficient. Other papers, like 5.3 took a big shot at this. 5.3's whole 00:52:04.680 |
paper is about how we had Chinchilla scaling. Then we have this inference optimal Lama scaling, 00:52:11.640 |
and then here's how you could do what we think is right. Now, Lama puts out a paper and they're 00:52:17.240 |
like, "No, this is actually all based. Here's new scaling that's not just next token prediction. 00:52:21.960 |
It's grounded on reasoning. Here's how scaling laws work. Here's how you can use it. Here's 00:52:26.680 |
why we did it." It's going to make sense. The scaling part, I found it interesting and funny 00:52:34.200 |
that they were using ARK as a measurement for the scaling training. One of the realistic that I had 00:52:41.480 |
in my head, it was like, "So Facebook spent over $100 million at the ARK challenge to try to win 00:52:46.760 |
the million-dollar prize." I think it's a different ARK, right? If I'm not mistaken. It's not the 00:52:52.840 |
same as the million-dollar. It's not the same data set. Yeah, but it's still in the line. 00:53:03.240 |
Yeah, a lot of good stuff. In the five to seven minutes we have left, I wanted to give some time 00:53:09.560 |
to, I guess, Hassan. Hassan's actually built an app that's kind of cool with Lama3U.1. Maybe 00:53:17.720 |
there's something to learn about prompting it, building with it, anything surprising. 00:53:22.680 |
Yeah, thanks, Sean. Hey, everybody. I just want to talk about this app that I built real quick. 00:53:30.760 |
Definitely a lot less technical than we're talking right now. This is dropping all the way down to 00:53:35.960 |
the half layer. I guess it is all about building, but I just built this little app. It uses a search 00:53:44.600 |
API to put in whatever topic you want to learn about, like quantization, and it can explain it 00:53:50.120 |
to you at kind of any level you want. So, let's learn about quantization at an elementary level. 00:53:55.080 |
So, it will basically use a search API, grab all these sources, and put it into context, 00:53:59.240 |
because obviously Lama3.1, larger context, you can fit in a lot of stuff, which is great. 00:54:03.480 |
And I've noticed that it's pretty good at kind of dumbing down concepts, but also 00:54:08.600 |
responding to the prompts a little bit better. And almost responding, like if I set a system 00:54:15.400 |
prompt that kind of details how it should behave over the next few messages, I found that it 00:54:21.080 |
responds a little bit better. Like, for example, for this system prompt, I had, like, make sure 00:54:26.600 |
you make the overview really short, because when I was testing on just Lama3 and on other platforms, 00:54:33.160 |
it was giving me a really, really long initial answer. And so, I want it to give a really short 00:54:38.920 |
overview, but at the same time, I want it to be detailed in the future. And I also want it to 00:54:41.960 |
include quizzes at certain times. And so, I just found that it's a little bit -- it was a little 00:54:46.440 |
bit smoother at kind of -- at responding. So, yeah, here it's going to, you know, actually try 00:54:52.440 |
to give me a quiz and try to just be interactive and kind of teach me a subject at any level that 00:54:58.680 |
I want. So, it's fully open source. LamaTutor.com, it's definitely open source. I misspelled this. 00:55:09.400 |
>> Did it give an example of quantization math? 00:55:12.600 |
>> I don't actually know over here. I think it's talking about -- 00:55:18.760 |
>> I just got into a debate on a call where someone was making an obvious error in 00:55:23.720 |
quantization math and how much the memory before versus after. I can send this to them. 00:55:31.480 |
>> I'm not sure if it'll -- I know it does formulas as well, yeah, so it'll put some -- I 00:55:35.160 |
should probably format these a little bit nicer. >> That's awesome, though. 00:55:38.760 |
>> Yeah, thank you. It has -- yeah, I got about 4,000 visitors, and if anybody's curious about 00:55:45.880 |
cost as well, so about 6,400 requests. I had some errors because I was playing around with -- I was 00:55:52.280 |
hitting the context limit and a bunch of other stuff. And also, you know, using the together API. 00:55:57.160 |
Very biased. I work over it together. But the cost for anybody curious, so we have about 12,000 00:56:04.920 |
average tokens per request. We have about 5,900 successful tokens. And so if you do the math, 00:56:10.200 |
that's like 74 million tokens. And if I go over to our pricing, right now we're at 18 cents 00:56:17.240 |
per million tokens for 8b 3.1. So that comes out to about $12 from the 72 million tokens that I 00:56:25.400 |
used in the last 24 hours from these 4,000 people and 6,000 requests. So that's all I have. 00:56:32.680 |
>> And so, wait. Wait. Can you show -- did you put a link in the chat? Or can you type -- 00:56:41.200 |
>> No, I'll do that. Yes, I'll put in a link to the LamaTutor. 00:56:47.000 |
>> Oh, LamaTutor, okay. >> HassanLamaTutor.com. 00:56:50.280 |
>> Hassan, I have a question at a high level. Is this using some sort of, like, 00:57:02.440 |
have you ever used the GPT's actions? Is it using a system similar to that? 00:57:05.960 |
>> I actually haven't used actions. Can you tell me about that? 00:57:09.800 |
>> So actions is where you input an API spec and the model itself can make the calls -- the model 00:57:17.640 |
itself can decide to make the calls to that API, provided that API spec. 00:57:21.640 |
>> Oh, so like function calling, basically. >> Yes. 00:57:25.960 |
>> No, I'm not using -- yes, I'm not using it on this app. This was, like, 00:57:31.640 |
the most simple example I could do, where I give it a system prompt, I do this API call for search, 00:57:36.920 |
I parse all the sources, and I just drop everything. And I'm like, hey, build something 00:57:41.000 |
interactive. So this is kind of step one. There's obviously so much more I want to do. I want to try 00:57:45.480 |
to do generative UI, maybe, with the Versalia ISDK, where I show -- for the quiz example, 00:57:52.200 |
I show an actual quiz component that renders. I can, like, generate a little, like, report out 00:57:58.840 |
of all the sources and have, like, read this, you know, like, you can read this to check it out, or 00:58:04.360 |
flashcards, or, like, I feel like there's a lot of directions to go with it, but I kind of just 00:58:08.040 |
wanted to build something really, really quickly. I just built it over the weekend. We got early 00:58:12.760 |
access to 3.1, because we were a launch partner with Meta. So kind of just playing around with it 00:58:19.160 |
and try to build something really quick over a weekend. >> Oh. Okay. I missed that detail. That's 00:58:23.880 |
really cool. >> All right. Thanks. >> So you're doing the retrieval. >> I'm curious if you could 00:58:29.320 |
share what does it take to serve a 405B model all together? >> What does it take to serve 405B? So 00:58:40.040 |
we're serving -- what is it? FP8. It takes eight H100s per instance. 00:58:45.800 |
Yeah. But we're looking into quantizing down to int4 and trying to serve on four H100s. But 00:58:54.440 |
in progress. >> The map is roughly 1 gig to 1 gig of VRAM, right? So at FP8, yeah, 1 to 1. And then 00:59:05.720 |
you could scale that out. There's pretty good resources on this. I feel like we've shared them 00:59:08.760 |
in Discord. And then someone asked about quantization stuff. Together put out pretty 00:59:13.880 |
good -- like you're now running quantized models in a good blog post about this. It's somewhere 00:59:18.280 |
in the Zoom chat, and we'll probably share it in Discord, too. Good resource on all that. 00:59:22.600 |
Also the paper has a section on inference, of course. So how do you run a 405B? What does 00:59:30.120 |
that look like? What's efficiency in that? They have a whole section on this. They have sections 00:59:35.960 |
on previous meta work of how do we have more token efficiency? So multi-token prediction of 00:59:41.480 |
the pre-training task. How do we have other parts of what you might see? Meta's done a lot of work 00:59:48.200 |
on this. But overall, that's kind of a high-level overview and our thoughts on the paper. If anyone 00:59:54.040 |
else has questions, we can discuss them. Otherwise, next week -- we were supposed to do 01:00:00.600 |
Wizard LM and Orca 3, heavy synthetic data gen stuff this week, today. We pushed that to next 01:00:07.400 |
week. Sorry, this was not super prompted. Basically, less than 24 hours ago, we decided 01:00:15.560 |
let's switch this paper club. So we didn't have crazy slides. But next week, we'll have stuff for 01:00:20.040 |
Orca 3 and Wizard LM. Same paper club, same time. And then at some point, I might do a deeper dive 01:00:28.760 |
into this, a proper one-hour breakdown of all this. If anyone's interested, I'll share somewhere. 01:00:33.960 |
I'm curious about the quantization thing and whether -- if you have the choice of a larger -- 01:00:42.040 |
sorry, more parameters, but worse quantization, how do you figure out the trade-off between that 01:00:48.200 |
and running, say, the 70 billion model without quantization versus the 405 with quantization to 01:00:56.200 |
make it down to the same size? >> People generally do benchmarks. 01:01:00.520 |
These are already happening right now. A lot of the communities are trying to figure this out. 01:01:08.920 |
For anyone who wants to try 4-bit quantize, you can actually run it on 16 01:01:17.240 |
4090 GPUs if you're doing 4-bit quantize. I'm already seeing folks doing that. Whether that's 01:01:27.560 |
a good idea, whether the model will get dumber or not is something we'll probably find out in 01:01:31.800 |
the next few days. >> At a high level, though, 01:01:35.320 |
the larger ones see less of a hit when quantizing than the small ones. r/locallama has a lot of 01:01:41.880 |
comparisons and benchmarks of different quantizations. Basically, a lot of the prosumer 01:01:47.800 |
rig is a dual 4090 or dual 3090 system. You've got about 48 gigs of VRAM. With 48 gigs of VRAM, 01:01:55.400 |
you can run a 70B at 4-bit quant. A lot of people will spend $3,000 on a local rig. They can run 01:02:05.080 |
a 4-bit 70B, or they'll look at how does that compare to a 34B at higher precision, or a single 01:02:11.560 |
GPU where you're running an 8B. There's solid breakdowns of benchmarks of these. This is where 01:02:18.280 |
Reddit people are doing good work, as opposed to when you look at it from the infra provider side 01:02:25.800 |
of what's the benefits of quantizing. There's also efficiency in QLORA fine-tuning and quantizing and 01:02:32.520 |
doing it. Benchmarks-wise, the LDR is bigger models take less of a hit, smaller models take 01:02:38.760 |
a bigger hit, and then speed inference. Alex, do you have thoughts? >> I have a question, actually. 01:02:46.360 |
I don't usually have questions, but this time I have a question about effects of quantization. 01:02:50.520 |
In your guys' experiences, what are the most visible effects of quantization? What comes 01:03:00.040 |
to mind when you see that the model is quantized, and how quickly do you realize, "Oh, this is the 01:03:06.760 |
effects of quantization"? >> It's just dumber, that's all. >> Dumber in any specific areas? Is it 01:03:12.920 |
dumber knowledge-wise, is it dumber logic-wise, or any specific areas? Which is overall dumber? 01:03:18.440 |
>> One thing that I've seen a lot of community members do is that when you over-quantize 01:03:25.800 |
at longer context length, it starts going into repetition 01:03:28.920 |
rapidly. So that's like, "We have gone too far line." >> Yeah, in a lot of the one-bit 01:03:40.120 |
quants, as people ineffectively quantize stuff, it starts to just go into pure chaos. So it 01:03:46.280 |
becomes stochastic random next-token prediction. The little trade-offs that you see at what type 01:03:52.280 |
of quantization you want to do, it starts to perform worse at chain of thought, at long context, 01:03:57.640 |
at needle in the haystack. Some of those things start to degrade earlier on in quantization, 01:04:03.320 |
as opposed to pure performance. And then, yeah, sometimes it's just dumber. It's just worse on 01:04:08.200 |
some of the trivia QA-type benchmarks, which is interesting because it's worse on reasoning 01:04:15.560 |
benchmarks, but decent on trivia. So trivia is where it's under internal knowledge, what does 01:04:20.360 |
it already know? It doesn't lose facts, but it loses reasoning, it loses verboseness of responses 01:04:26.760 |
and whatnot. So it degrades in that sense. But then you do have benefits of, yeah, it's 01:04:33.400 |
significantly more efficient to run, right? Four-bit versus eight-bit, you can run in half 01:04:37.640 |
the hardware, you get speed-ups, you can fine-tune more efficiently. So there's trade-offs. Though 01:04:43.080 |
the one interesting thing I'll note there is if anyone saw the Apple LLMs at their WWDC stuff, 01:04:49.240 |
they put out an Apple research blog on how they have all their local LoRa swaps, and they dropped 01:04:55.800 |
in a little paragraph where they're like, "We tested our models at a net of one or two-bit 01:05:02.520 |
quantization, and we saw no performance drop compared to full precision." So they're basically 01:05:07.720 |
like, "Yeah, we benchmarked it, we got lossless quantization, no details, no nothing." But 01:05:14.520 |
they seem to have stated they figured it out, which, you know, skeptical, but 01:05:18.840 |
they're running quantization for on-device. Yeah, but on specific tasks for them. 01:05:23.640 |
It's also going to be very... Sorry. It's also going to be task-specific. We have seen 01:05:30.440 |
weird edge cases where when you quantize a model one step for certain tasks, it actually improves. 01:05:37.240 |
We are degrading all the time. That's true, too. 01:05:42.760 |
I'm writing an email now between cloud providers for 3.170B just to see if, like, you know, 01:05:49.320 |
together's FB8 versus, I don't know, Fireworks or something like this has a difference, or versus 01:05:56.760 |
Glock. Have you seen an artificially analysis? Oh yeah, I think I saw something. 01:06:06.360 |
These guys, they run pretty good. Yeah, but they don't do quality, 01:06:10.520 |
they just do reported numbers, I think. But they do compare in front providers. 01:06:17.160 |
Oh, they're not doing their own benchmarks? I thought they have... 01:06:19.800 |
They're doing pricing and they're doing speed. I don't think they're doing quality. 01:06:25.880 |
Glock is much more difficult to benchmark, to be honest. 01:06:29.320 |
So, I'll add to Alex, like, what he was asking. I know credible reports of people where they're 01:06:41.080 |
saying between Glock and some other providers the quality is different in the sense everything else 01:06:48.280 |
remaining the same, the inference engines are giving wrong answers. So, I'll just leave it at 01:06:52.920 |
that. I mean, one thing I can tell you right now that I noticed that Glock temperatures zero 01:06:59.560 |
returns different responses every request. So, I don't know if temperatures zero actually works 01:07:03.720 |
there. Yeah, that has been... A lot of people have said that to the CEO also. 01:07:16.200 |
Wouldn't temperatures zero supposed to be the most deterministic instead of... 01:07:21.800 |
Depends on the framework, to be honest. For opinion, I think temperature is zero. 01:07:29.400 |
Well, theoretically, it's supposed to be the most deterministic. I would expect, based on 01:07:35.240 |
what temperature is supposed to do, that it would be the most deterministic. 01:07:46.200 |
So, sometimes there's little reasons, like, when they're running inference, right? They might run 01:07:51.160 |
it on different hardware. So, like, maybe there's a server in West Coast, East Coast that's different 01:07:57.160 |
GPUs, different CUDA kernels, different drivers, and some of that affects how you do rounding, 01:08:01.720 |
how you do next, like, what... Yeah, basically, how is inference being done? So, you see slight 01:08:06.280 |
variation. Then, across inference providers, this is their, like, secret sauce, right? Are they doing 01:08:11.240 |
self-speculative decoding? Are they doing speculative decoding? How are they doing it? So, 01:08:15.240 |
like, all these little things have differences, but there's also the reason of temperatures, 01:08:20.360 |
like, yeah, different hardware, different drivers, different all that. So, some of that answers it, 01:08:25.240 |
but I don't know, Eugene, maybe you have more to add. I was just going to say that for GPUs, 01:08:31.080 |
floating point just on what... I mean, for GPUs with floating points and you push it through so 01:08:35.720 |
many calculations and so many matmuls, the floating points aren't just not going to be precise. So, 01:08:41.880 |
that's why even if temperature is zero, it's not going to be the same throughout for multiple 01:08:46.760 |
requests. Yeah, even if it's, like, the same GPU on the same request, like, the order you do 01:08:54.120 |
reduction on the matmuls, as Eugene mentioned, will affect the results because, like, floating 01:08:59.400 |
point calculations is not deterministic. Like, you can try this in Python and add 0.1 and plus 01:09:05.720 |
0.2, you might get 0.1999, not 0.2. So, like, the order you're doing the addition and multiplication 01:09:13.640 |
can affect the results. And if you're doing, like, a thousand addition and multiplications, 01:09:17.720 |
it's difficult to get the same order every time. It's about that very small 0.0000001% noise 01:09:25.320 |
that ends up cascading all the way. If you want true temperature equals to zero, the 01:09:31.560 |
non-reliable way is to run it on CPU because the CPU will do have the floating point precision 01:09:37.080 |
reliability. But if you're running on CPU, you're going to take forever. 01:09:41.880 |
And to add to Eugene also, there are actually, like, five or six different class of floating 01:09:52.840 |
point, what do you call, classes that there are in the CUDA. And I believe, like, people 01:09:57.560 |
have to hand code certain, have to account for these, what do you call, differences. 01:10:04.680 |
And that's where they have to be actually shown the errors, saying that on some other 01:10:09.800 |
platform we see this error and over here we see this answer. And now someone has to painfully 01:10:15.000 |
go back and figure out a bug in their code. So, most of this is basically bugs in the 01:10:20.120 |
translation, not bugs in the hardware. That's what I was trying to say. 01:10:23.400 |
I'm definitely going to try that CPU approach. So, I'm trying to get a project approved and 01:10:30.600 |
not with the rationale being this type of inconsistency. 01:10:45.240 |
Also, like, depends on the engine. Like, XLAT12 is not deterministic at all compared to maybe, 01:10:51.320 |
I think, transformers is more deterministic. You can set the torch seed and the 01:10:56.840 |
inference seed. So, also the inference engine is very important. 01:11:11.400 |
Awesome. Well, thanks, everyone, for joining in on our last minute drop-in. 01:11:20.280 |
Yeah, yeah. Go ahead. I've got to drop, but this room will still stay open. I'm going to make 01:11:26.680 |
someone else host. Feel free to keep discussing. Next week we have another one at 12, but I'm 01:11:38.920 |
No, don't make me host. I'm at a baseball game. 01:11:43.160 |
So, it's quick. So, basically, I found like a LAMA 3.1, like a 405B, actually might do better 01:11:52.920 |
in some domains, specific question, than like a CharGBT 4.0. Do you guys want to see a little bit? 01:12:00.200 |
Sure. And it's no surprise, actually. I actually think that more and more people will find 01:12:09.960 |
That's great. So, let me share really quickly. 01:12:13.720 |
So, here's a little summary. So, basically, this is the specific question I was asking. 01:12:20.280 |
Here, you're an expert in mechanistic interval research. How do you use the rest of the stream? 01:12:26.040 |
Yeah. And so, here's my summary. Overall, I think, you know, 405B did a better job to explain. 01:12:34.680 |
And also, there are some important information like is missing in CharGBT 4.0 or not expressly 01:12:41.080 |
explained. So, let me show you the answer from LAMA 3.0, 405B first. 01:12:48.360 |
I feel, I find it's really easy to follow for someone like me. Like, I'm not doing research 01:12:52.840 |
in this field. And it also gives enough technical detail. So, if anyone wants to read a little bit. 01:13:01.480 |
Yeah. So, and I can move to the chat. What did the answer from CharGBT 4.0? Do you guys want to 01:13:11.320 |
see it? Yeah, yeah. Yeah, I'm done. You're done? Okay, cool. So, this one is from CharGBT 4.0. 01:13:21.240 |
I feel like CharGBT still kind of like answer things in a sort of generalized way. 01:13:28.440 |
Let me know if you want me to move, scroll down. Or if I'm scrolling down too fast. 01:13:41.880 |
I think Eugene, the other Eugene would chime in that actually a lot of these things 01:13:55.720 |
are very dependent on the prompt. So, it won't surprise me if you tweak the prompt right, 01:14:11.560 |
the winner ends up flipping around. That's what I meant. Yes, just for the argument, 01:14:17.480 |
I didn't tweak the prompt just for LAMA 405B. So, I just said, hey, I have this one. I just 01:14:23.560 |
throw it out. But definitely, you know, more detailed study needs to be done. And let me show 01:14:28.680 |
you CharGBT 4.0. It's better than 4.0, but I don't think, I still think LAMA 405B did a better job. 01:14:39.640 |
Just a second. So, I'm going to do the same thing, scrolling down. Let me know if I'm moving too 01:14:53.880 |
fast. So, basically, I think I kind of feel, of course, need to do more detailed study. I feel 01:15:13.640 |
like LAMA 405B may have been doing a much better job of organizing this knowledge, you know, how 01:15:21.960 |
the way to handle how to answer questions, you know, how to organize the knowledge together. 01:15:29.560 |
So, I'll give my personal example that when I look at some healthcare related data, 01:15:37.160 |
and if I'm looking at through basically CharGBT 4.0, and then even when I use the similar prompt 01:15:47.160 |
that I give through perplexity, I get different answers. But since I know the domain, I know that 01:15:53.000 |
sometimes like CharGBT is just bullshitting. And perplexity actually gets references, and it gives 01:15:58.840 |
you a slightly better or more answer, which I can take and go back to the references and dig deeper. 01:16:05.240 |
So, you will definitely, once you know the domain better, you should not fall in for 01:16:10.200 |
the English is my take. The English may look all correct, but it might be nonsense at the end of 01:16:14.440 |
the day. Oh, I looked into a little bit about the rest of the stream. I can have pretty good 01:16:21.400 |
confidence. It's not the bullshitting from 405B. Yeah. So, the things that really amazed me is like 01:16:28.440 |
how it explains. And it's me, like I said, hey, I'm just starting getting into like a lot of 01:16:34.200 |
technical things I haven't done before. It seems like it just really explained really well. I can 01:16:39.640 |
just go down to, you know, do look into specific information and stuff like that. But definitely, 01:16:45.400 |
I see your point, like it's just one shot. But still, it seems pretty amazing because I didn't 01:16:50.680 |
try to twit the problem just for 405B. Yeah. Thank you. Yeah. So, that's what I want to share. 01:16:56.200 |
Yeah, definitely. We are going to see more and more of this. Like, I think what's interesting, 01:17:02.680 |
especially when you're having different foundation models, you're going to have very different 01:17:07.800 |
default outputs. Because, yeah, we can probably steer all models, especially the bigger ones, 01:17:13.080 |
we prompt engineering to a certain desired effects or even like fine-tuning. But the default out of 01:17:19.560 |
the box would be what's interesting because that's what a lot of day-to-day users is what 01:17:25.560 |
they will experience. So, like for me, for the longest period of time, I didn't care that GPT-4 01:17:31.960 |
had a better eval score across the board than the cloud model. The cloud model just seems 01:17:38.840 |
friendly and nicer to me and I like it better. And maybe that's what does matter sometime when 01:17:45.320 |
they are all good enough to get the job done. Yes. So, I wish we have a better model where 01:17:51.880 |
we don't need to twit the prompt, you know. In a way, it can reflect how well the model is 01:17:58.760 |
organized, its knowledge, you know. And so, if it's easy, just talk to it without 01:18:03.800 |
twitting the prompt. Actually, it's, you know, in a much better advanced stage, I think. 01:18:09.880 |
But I think in the long run, that will be hard to achieve because if you just view each model 01:18:16.520 |
a bit like an individual, so you can call it Lama-kun, you can call it GPT-4-kun, etc. It's 01:18:25.480 |
just really a preference thing when it boils down to it. Because, as I said, well, I prefer 01:18:32.440 |
Claude because of the way he did the artwork. I know someone who prefers the other way. 01:18:36.760 |
And that's the challenge here, right? You can't possibly just train the model to 01:18:43.000 |
satisfy everyone in the world. There'll be that back and forth. And that's where all those, like, 01:18:48.200 |
more view-shot prompting will come in or fine-tunes will come in. And I think that is probably the 01:18:54.200 |
more exciting thing about Lama because we're going to see lots of fine-tunes. 01:19:04.200 |
I think, because since we are well past the usual, and there's a lot of new faces here, 01:19:11.160 |
I'm just going to, like, recap and point it out. So, the Latentspace Discord, we host this 01:19:20.520 |
weekly papers talk, and this is why this session happened. This week is just a very special one 01:19:26.920 |
because of the whole Lama 3 model came out. Prior to that, we were planning to do other papers, 01:19:34.600 |
and we'll probably go back to our usual schedule. So, if any of you, especially the new folks that 01:19:40.280 |
joined in because this got broadcast on a much larger audience, feel free to join us next week,