[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

And I kicked off like a what do people want to work on section with I'm going to do a deep dive in the paper because I need to make slides on this and that very much overtook the hackathon. We had like a solid crew of like 20-30 people that were just discussing the paper with me and slides didn't really get made.

But also I don't know it was weird hackathon project winner was just a deep dive into the paper, but we had an in-person paper club session that I led yesterday and a lot of people from there trying to join in so it should be vibes. I am liking in person hybrid format, I might start running those we'll see how they go but it was good.

Everyone had good discussions. Amazing, amazing. Yeah, I would be happy to join that. Once I get back to SF this Friday. Oh, Friday. Exciting. Yeah. So, as you know, we also we interviewed Thomas who was one of the paper co authors, he did not give us the paper beforehand, which is annoying because after reading the paper I have so much better questions that we actually end up asking in a podcast but whatever.

Yeah, yeah, I think that you know a bunch of us have read it. I feel like Vibhu you're probably best situated to, to take over the screen if you want, if you have, if you have stuff. I have very basic stuff but yeah, I'll share. We also got someone at the hackathon that worked on a hackathon project that's paper to video, so someone's cooking up a video explainer of this it's like literally doing inference right now, we'll share that once it's ready.

Yeah. Yeah, this is, I mean this is, I'm excited about it but also I'm wondering how to do this justice. I feel like we can post questions in here, and then, you know, people can just kind of discuss in the chat. And yeah, I mean like we classically have a lot of side discussions in the zoom chat anyway so I'm not worried about that.

I mean, yeah, well Vibhu, Vibhu go ahead you can get started. But like, what do what do people think what people want to talk about, you know, personally, I, I called this the synthetic data paper. So I have a lot of interesting insights or these questions about the synthetic data stuff, but we can talk about everything like there's just so much in here.

The format that worked well yesterday was like, we're not getting through 100 pages, I'll give the high level we'll go through the tweet overviews. And then let's just dig into whatever anyone found interesting and had like, you know, something that someone dove into so like part of it was they had a bunch of scaling laws for pre training they had scaling laws for how they picked 405 B and 15 trillion tokens.

So whatever someone chose to dive deep into is what like I was like okay, well we'll dig into that. Also, other part of this I'm probably going to give like a longer one hour breakdown of like I'll go through the whole paper, have like an hour talk at some point so I started slides, they're very not ready these are like 20 minutes of slides just became discussions.

But basically, I'll spend like two minutes on overview everyone knows llama. Interesting stuff that so like they drop three to three 131 was a pretty big update the AP got a lot better than 70 be got a lot better. A lot of this is just for other talk, but, um, yeah, they drop some sizes the context with getting bigger.

Their scaling laws were just overtrain and pray, but no, they're actually pretty grounded and real scaling laws. They're all dense models, after reading the paper their whole justification for this was like we want to see stuff that scales. It's the first actual research paper where they talk about everything hardware inference hardware failures what happened how they fixed it.

So, real research paper, there's a lot on it it's like basically everything pre training post training, they're scaling laws how they ran experiments it's a great recipe on how to build it, that's what this talk would be later when I do it. It's really cooked models really good, it's solid open sources like GPT four level, they talk about how they bring up performance.

Also for anyone that finds any time to cut us off cut us off it's all vibes. The jumps for the eight be basically from three to three one, it got better all around overview of the paper, there's a bunch of Twitter threads so instead of me making slides will go over like the main one shared a discord for everyone that hasn't seen also is my whole screen sharing.

So is it just okay let me share my desktop. Okay, so for people that are new and not in discord. I do have a very active running llama three section, I'll share the paper. So, if we go to this little like you have to find it so you got to go through the news and find the llama there's like 60 links that we've been posting of everything popular on Twitter will go through these at some point but paper overview is basically like they have multiple phases of training.

I'm very not ready I do have other notes on this that all share it there so I started screenshotting stuff. They have like three aspects to a good foundation model, basically data scale complexity. This is why they didn't go into an MOE they want training stability, basic two phase training there's like pre training post training.

They do a lot of scaling well work so in their pre trained data set, how they, how do they determine what the pre training mixes in the post training in the pre training, they start doing most of the training at like low context, then they continue pre training at long contrast.

A lot of what they said in their complexity section was like, we want to do basic stuff that will like scale up this is like the foundation for how we can like redefine scaling laws and train this stuff up so no crazy complex RL just, you know, SFT for chat tuning and then they do a lot of models for rejection sampling to do data set stuff, DPO.

They added their safety bullshit at the end. Other interesting stuff that didn't make it to Twitter was like, they had multimodal experiments, they did a lot of experiments, they did a lot of tests, they did a lot of experiments, they did a lot don't do it. They added their safety bullshit at the end.

Other interesting stuff that didn't make it to Twitter was like they had multimodal experiments. They trained in adapters. They have a vision adapter, audio adapter stuff. There was cool sections on their pre-training mix. So basically they used a lot of like traditional filtering techniques. So they have like Roberta based filters for high quality.

They have this for their synthetic data distribution. Like how do we extract out high quality data? Then they have a lot of traditional NLP for like PII text extraction. They had a whole section on like how they scrape the web and how they train their own parsing HTML. They compared it to what's out there and their stuff's better.

There's a lot in this paper. Datamix was a really interesting section as well. So they basically go into, here's basically what they did, deduplication, all this stuff that you would expect. Model-based filtering was pretty cool. They used a lot of like they trained classifiers on LAMA 2 outputs. On the synthetic data side, Eugene has a great tweet thread.

We'll probably go through it at some point. This was an interesting section that we haven't seen before. So when you have like a base model that you're pre-training and you have like 15 trillion tokens, how do you determine what the right mix of that is? So their finding was like half the tokens are general knowledge, 25% map and reasoning, 17% code, all this stuff.

But they're like, this is the first research paper that actually breaks this stuff down. They actually did like scaling law experiments. So they trained small models that were like a couple billion parameters. They started testing different datamixes and then they train a large model to see what actually works on what's the right datamix.

And then they're like, here's the answer for this stuff. Model architecture was pretty similar. They like did a few little changes. They did some better attention masking, group query attention, here's architecture. All this stuff is like on Twitter, so not as interesting. From the podcast that Sean had, the vocab section is pretty interesting.

They're like, instead of messing with tokenizers, changing vocab is pretty big for small models. Check out the podcast or if it comes up in discussion, we'll discuss it. Scaling laws was another interesting one for the paper itself. Basically traditional like chinchilla scaling laws used to have this whole like, they're predicting what's the optimal for your compute budget, like what's the optimal model parameters, all that stuff, how many tokens you train on.

We thought that they were just scaling and preying and trading like, you know, fixed cost training run for cheaper inference. But this stuff is actually grounded. So they developed new scaling laws where TLDR of what they did is previously we used to do scaling laws where we're just predicting on next token prediction accuracy, right?

So we're trying to predict on like perplexity and just how good is next token prediction. Instead they do all this fancy math and they change the training objective to be like more representative of a reasoning benchmark. They use the ARC challenge where basically they have a reasoning benchmark and now instead of doing scaling laws to predict next token prediction, they've changed it so that they're doing scaling laws to predict optimal model stuff based on actual reasoning.

And that's where they come up with this like, their scaling laws show that for a 402b model, you want to train on 16 and a half trillion tokens. Based on that, they did a flagship 405b based on 15 trillion tokens. And then this is where they have their like infra optimal where they started to do the 8b, the 70b, they just reuse their 15 trillion tokens and just overtrained and that works.

The other really cool section, the sections that didn't make it on Twitter were like their training infrastructure. So they give out everything, right? They give out like the full pre-training stack of like, they have a section in here on how they do their pre-training. So like, one is like the whole hardware configuration.

So 16,000 H100 hours, what failures they hit, why they went for simplicity. This was a pretty interesting section. Like over their 54 day training, they had like 400 job interruptions, 419 unexpected interruptions, and like 78% of these were like GPU hardware issues. And then they have a section on like, if they did MOE, all this stuff compounds.

So we just wanted something like simple, scalable that we could deal with well. And like, this is stuff that you don't really see in papers anymore, right? It goes further with like, what is the pre-training set? So like, these formulas, we don't really see anymore, right? So it's like when they pre-trained it, here's their like peak learning rate, here's their warmup, here's their decay, here's how many training steps, here's the bat size, little nuggets like this haven't really like come up on Twitter yet.

But like, you know, at first, they have a bat size of 4 million tokens with a small sequence length. So like, the first bit of training is a sequence length of 4,000. Then they double it to like 8 million sequences at 8,000 for the next 252 million tokens. After they've trained on 200 million tokens, they double it again to like larger bat size for the next 3 trillion tokens.

And then they do most of the training at 8,000 token sequence length. So like, little stuff like this, I feel like we still need to digest. There's reasons for why they did this, but basically TL;DR, no other open source paper has like a formula like this. And then that's kind of what the next like 100 pages is.

I feel like at that point, instead of finding what I found interesting, like, I found all this stuff really interesting. They talked about the batching, GPU utilization, memory, like utilization, all that stuff. Like CUDA optimizations, their whole training recipe, what they released performance stuff. Instead, I feel like that's enough of a high level overview of the paper.

The more fun stuff is like, yeah, so how does it perform? They're all better. Infra companies are pretty cheap. And this is also where like everyone else can hop into discussion. Eugene, Sean, other Eugene, hop in now. You know, Fireworks is somehow really undercutting inference price. The scale leaderboard is a held out leaderboard.

It does pretty good here. What else? Grok has it. So some insider info for all the Infra companies, they gave access to randomized weights that were the same size about six days before launch. So six days ago, Infra companies started playing around with it. They started working out how they're going to do inference, what type of decoding they need.

But they didn't have the paper. They didn't have the actual weights. And then day of, they released weights. But like, yeah, stuff like Grok is serving a thousand tokens per second. What other discussions that we have here? Kyle did pretty good evals on performance. He started doing it on his own fine tuning stack.

So he started fine tuning it, compared it to 4.0 mini. OpenAI within hours responded with like 4.0 mini fine tuning. But fine tuning the Lama 3.18b is kind of on par with 4.0 mini. 4.0 mini fine tuning is kind of broken and free, but it gets worse. What other fun stuff?

There's a comparison of model pricing here that's being updated live. Other tweets, George Hotz, Karpathy tweeted. VLLM supports it. Other more independent benchmarks coming in. Basically, it's good. The other interesting part was the licensing. So they changed up their Lama license to proper full open source everything. We have more infra providers, NVIDIA stuff.

But yeah, that's kind of where I feel like we should open it up. That's the quick 15, 10 to 15 minute overview. Whatever people found interesting, like I know there was a lot of discussion about synthetic data gen. Sean and Eugene, you had good tweets about this. So I think this is where we open it up to whatever people found interesting.

And then we dig into those topics because we're not getting through the rest of it. I'm going to open up chat and see what people are up to. But yeah, thoughts everyone. Hop in. Yeah, I wanted to jump in by the way. One thing to warn about pricing is that you're going to see a lot of providers jumping in and everyone's just trying to get the piece of the pie.

So like with some of the previous model launches, you see people coming in at lower and lower price and then they'll increase later. But I wanted to jump in on the training side because I'm quite sure Mibu, Eugene, and Ed will have lots to say on the data. So I think I'll start with that.

I can't share the screen by the way. Do you want to take over or do you want me to scroll? I want to take over slightly because I want to jump through a few things there. So let me share my screen. All right. So I didn't see too much talk about on this but for me, one of the big ones is actually pipeline parallelism.

Not sure how many people... Can you see my screen? Yes. Yeah. So if you're looking at this and like what is this crazy freaking schedule that they are doing here. But TLDR, pipeline parallelism is generally the idea of scaling up your training across multiple GPUs and to build around optimizing that.

That has its own benefits. It has also its own downsides. And the major downside, the reason why people try to avoid pipeline parallelism at all costs and they use like DeepSpeed 3, for example, where the weights are sharded around all the other GPUs is that if you look at pipeline parallelism or model parallelism, there's this problem called the bubble.

The bubble is basically as your data set goes through the different devices. So the forward pass and then the backwards pass, you have all this GPU time here where some of the GPUs are waiting for other GPUs and are doing nothing and basically you're wasting compute. And because everyone wanted to avoid wasting compute, it went on to a search of the algorithm to figure out how to do pipeline parallelism.

And one major one is actually CLSG, coincidentally Singapore, where they created this crazy-ass algorithm to basically train without any wasted time. So you see the gray spots are with the wasted time, respectively. And Facebook is now embarking on their own journey on this. And the reason why this is exciting even for smaller models is that this kind of algorithmic changes on the training is what's going to allow you to train bigger models easier on lower-end GPUs.

So this concept could apply to, let's say, training a 7TB model on 24GB GPUs and things like that. And the reason why they probably need it for the 80GB is because they're training 405B. And yeah, and a lot of people thought, like academia thought that this treated it as a dead end because of the bubble problem.

And then Facebook was like, "You know what? We are going to do that." And that, to me, is one of the more exciting things. The other one that I saw some people tweet out is about batch sizing being smaller, constraints on the batch size. I thought Google has pipeline parallelism in their JAX, the distributed training repositories.

They don't? Yeah, they do. They do. But the thing is, no offense to Google, no one really took, everyone just interpreted it as TPU has 2L VRAM. Kind of, kind of, kind of thing. And they had the basic pipeline parallel, but we still suffered from the bubble problem. This weird scheduling, which I'm quite sure people are going to start replicating it, is to reduce the bubble, the wastage.

So I also saw lots of papers on this from maybe NVIDIA and Matej Zaharia from Berkeley or Stanford. Like they had lots of interleaved pipeline parallelism updates. Correct. So you're saying no one is using it? Just Facebook has used it more recently? I find that pretty... Or at least no one published it within their training processes.

Because this is the first major model of this CL class size, right, that's saying, "Hey, we are doing pipeline parallelism." Google models, so they have some, these pathways, distributed training architecture systems, and they publish in maybe OSDI, which is kind of the biggest distributed systems conference. So they publish these trainings and they can do all sorts of parallelism within their systems, and even a mixture of experts parallelism and stuff like that.

So they do quite, quite heavy stuff. I'll look it up and post some papers if I find them in the messages. But yeah, my mental model was that people are actually doing this at scale. Thanks. Yeah. So I'll draw the distinction between pipeline parallelism and techniques like DeepSpeedTree, which is essentially where the GPU has MVLink connectivity to other GPUs to actually read the model weights.

Pipeline parallelism is really more of like, instead of going cross GPU to read the weights of the other models, or the other half of the model, you actually just focus on the half of the model that you're working on. And this has the trade-off, respectively, of saving VRAM and allowing you training larger model and larger batch size, but it means you have the bubble problem.

And I think the focus is really more about the bubble problem here, rather than anything else. And yeah, like I said, I do expect more people to replicate this part. Yeah. So that's the part that I wanted to jump in on. The other major one I wanted to jump in on is just multilingual.

I'm so happy that I've seen this. We try to avoid using machine translated data to fine-tune the model. This is something that I think multiple people know that I've been shouting on the roof about saying, "Hey, can we stop using machine translated data for other languages?" And then assuming that's great, because when you speak to the other language, native speakers, they've been saying, "That sucks." And finally, someone is also, at least on the bigger model side, is doing that as well.

So particularly excited about that part. But yeah, I think I'll hand off to the whole data stream. The interesting little section there of translated data is I've still seen it used where they have a Lama3 filter that extracts out what's the highest quality data, what's the highest quality reasoning code data and whatnot.

And in other work, they'll still do... This is very traditional pre-training data set stuff where you need more data augmentation to get more high-quality data and translation. So one thing is you can train on multiple rounds of that. It's like more epochs on high-quality data. So you can just resample it.

But then there was a paper that I'm forgetting that tested this. Do they want to only use a little bit? Do they want to train on multiple rounds of pass-throughs of the same high-quality data? Or do they want to do basic augmentation, like translate and translate back? And somehow translation to other languages work better.

That was the best option. Translating it to high-quality in another language as opposed to translate and translate it back. So there's still some value, but interesting little piece. Yeah. So I think I want to hand off to the people who are going to tear all the data parts into bits.

Because I just wanted to jump in on trains. Because that's what I can uniquely offer. >> Awesome. Appreciate that. I think Cameron has his hand up. >> Hey, did they make any claims around it being good for code generation? I'm interested in whether yes versus cloud. >> Yeah. This is a big contrast to Llama 2, where they were intentionally not training for code, and then they put out code Llama separately.

Now they explicitly outline code as a separate modality, like separate from text. Vibhu, I don't know if you have a slide on this stuff. And then they also did synthetic data for code as well. Yeah. They just -- they spent a lot more time on code this time around.

>> Has anyone looked at it versus Cloud 3.5 Sonnet yet? >> We vibe checked it. We vibe checked it. >> They did what? Checked vibe? >> Yeah. So like, you know, it's not rigorous evals, but like we vibe checked it. And like, it does pretty good. So in the paper, they did explicitly mention as well, like, yeah, they used to have previous -- they used to have previous code Llama models, right?

And part of their, like, second step of post-training was to add in this section on code. But they explicitly no longer need to do that. And I'll pull up the section of the paper, basically. But they mentioned that this is, like, natively trained in, in pre-training as well. But it's a good code model.

They also have a -- the -- >> There's a scale AI benchmark. >> Yeah. >> Jeremy, can you repeat? We can't hear you very well. >> Okay. Yeah. There's a scale AI benchmark where Sonnet and 4.0 were compared against the new 4.0.5b model. And 4.5b was found to be basically on par with GPT 4.0, which is worse than both Sonnet and GPT 4.0 turbo preview.

There's a tweet thread and a comment that I'll just drop. But it outperforms Jem and I 1.5. The thing I like about the scale benchmarks is that they are pulled out. That is, like, none of the companies have access to them. And they're private. So, there's probably more durability to the benchmarks.

And they don't have as much of a conflict of interest. They did co-watch with Lama. So, yeah. There may be a little bit of conflict of interest. >> Thank you. Thank you, Jeremy. >> So, overview, the scale -- >> Go ahead. >> Scale leaderboards aren't just coding. So, for people that don't know, it started out with the GSM 8K, where they tried to recreate it.

And they made a GSM 1K, which is meant to match the actual benchmark and just be a held out that they'll run models, they'll evaluate them. And then that turned into now they have held out benchmarks that no one can see what the actual examples are of coding, instruction following, math, Spanish.

There's a bunch of these. And, yeah, they're kind of, like, pretty good in the sense of, like, no one can directly train on them. There was a piece that said, like, when they put out their first one, what's the delta between companies, like, models that do really well on traditional, like, GSM 8K, but don't do well on 1K, where it's like they haven't seen it before.

So, they basically tried to test who overfit to the benchmarks. And this is trying to solve that. So, if we go through real quick, this is kind of where the 405B sits in coding. It's, like, a step right below QPT 4s and Sonnet. Sonnet's still slightly better. And then we can kind of go through it.

I think they're still testing the 405B, because I'm not seeing it through the rest of them. But they're being updated in tweet threads and whatnot. And then Jeremy shared a link to the Reddit that talks about this, where they're basically going through them. And then there's discussion here, if anyone's interested.

But yeah. Someone was also talking in. >> Thanks very much. >> I have something to share. The coding evaluation, it seems like they so there's I can't share my screen. But basically, human eval is kind of one second. Let me try and share it. Can you see? Yeah. So, human eval is one of the benchmark data sets that people use to benchmark coding.

And it's very simple. Like, they have 150 questions. And it's almost, like, autocomplete. Like, solve this simple puzzle in Python or things like that. It's very, like, one, two lines. And you can see that let's see. So, the LLAMA405B is not state of the art. So, Cloud Sonnet beats it by a few percentage points.

It's close to the GPT and Sonnet models, but slightly worse. And I think this kind of is similar to the vibe checks. My understanding was on the initial LLAMA, stuff that Meta didn't focus that much on reasoning or on code, because they're a social company. So, maybe reasoning is not as super important for them.

But then they hacked focused coding data collection session and shared a big code model. Which kind of wasn't that great. Maybe if you don't put the data in from the beginning, just trying to fine tune on code by itself doesn't work that well. The other thing I wanted to share, can you see this other page?

Now? Basically, it seems they spend quite a bit to make their coding much better in LLAMA3. And they actually train the code experts and then try to use that code experts to maybe, I guess, collect high quality human annotations and do some more post-training. And then they also did some synthetic data generation to improve coding.

So, I think they spent quite a bit to work on reasoning and coding. I didn't read this section carefully, but yeah, they have a full section on trying to get better code data to generate, to incorporate feedback and do analysis. They did quite a bit on coding. Yeah. >> Yeah.

There's two sections there. One is the synthetic data gen with coding, and the other is the pre-trained mix of their code and reasoning sample where they trained a second classifier. So, one of the takeaways there was when you're doing pre-processing of 15 trillion tokens, you actually can't just run inference.

Even Meta with all the GPUs they have, they couldn't afford to just throw LLAMA3 inference as this whole 15 trillion token set. So, they trained a code and reasoning classifier on Distal-Roberta, which is a small original encoder-decoder transformer to try to annotate out their web scrape data for quality and whatnot.

So, they have it both there in the pre-training set and in the synthetic data gen. There's a really good quote tweet that went on about all this code gen. It's by Eugene. I will share screen and throw him on the stage if he wants to talk about it. >> Yeah.

Thank you. I'm currently commuting, but I'm finding a quiet space right now. All right. Great. Thank you, Vibhu. Yeah, I can talk to you, I think. Can you hear me fine? >> We can come to it in a few minutes. If you're commuting, we can come to it in a bit.

>> No, I'll be commuting for a while. I'm walking to the stadium right now, team event, but I'm finding a good space to sit. Okay. So, I think what really stood out for me in this paper was that how much automation and augmentation was there, right? In the first one, you can see they actually use Lama2 to filter out bad data, right?

And this is in the pre-training step. So, essentially, what they're saying is that, "Hey, we trust Lama2's judgment well enough to be able to do that." And if you scroll down, next slide. And then over here, you can see that they actually trust Lama3 to do tag intention. They actually tag the generated data or the responses based on intention, and they also classify things based on difficulty, right?

And the thing is, they actually adopt some kind of curriculum learning where they start with a single shot prompt or rather a single turn prompt and response, and then after that, they move on to multi-turn. Next slide. And then after that, and this is a code expert that everyone's been talking about, right?

So, what it means is that in order to get Lama good at code, as an intermediate step, they had to train a code model. And that sounds quite crazy, right? I mean, for me, I mean, sometimes training such large models just seems to take so much effort to curate the data, to set up the info and everything, but it seems completely essential in this case.

They could not have done it without that. And Andrej Karpaty had a great tweet about this, whereby every model distillation and synthetic data generation is really now a stepping stone for the next better model. Next, please. And then the same thing here is, okay, and of course, here, this is just an example of how much we trust the synthetic data, right?

The model was prompted to generate problems, then solve each problem, so I'm focusing only on the green highlights here, solve each problems, and then they give the model the errors, and then they ask the model to solve the errors, and then the model also generates the unit tests, which they then use to evaluate the generations on the unit test itself.

It's like, you see that the human is very minimally in the loop. And then if we move on, and you see this pattern everywhere, like multilingual, you didn't hear me talk about it. One thing that's interesting here is that they generate, they use Lama to generate data for target capabilities, and then they back translate it into doc strings and comments.

So that's how they can teach the model to explain code. And then they use those tweets and comments, those doc strings and comments to actually create code again. And then we're going to go through the rest really quickly. It's like multilingual, the same pattern here, math and reasoning, the next one, you see it's the same pattern, whereby the model actually augments the training data with the step-by-step.

So one thing that's really interesting here, in the sense that they actually went the extra step, no pun intended, to actually train step-wise reward models. That's kind of crazy, no? I mean, they wanted each step in the chain of thought to be so good that they actually took the extra effort to train step-wise reward models, which they then combined with Monte Carlo Tree Search to improve the reasoning traces.

And then you see synthetic data for long context, it's the same pattern, Q&A. And then as you scroll down, you see synthetic data for image captioning and synthetic data for factuality. Like factuality, essentially all of it is just synthetic data, if you look at this. I think time will tell whether this really works out well or not.

I think we're still too early on the evals. And then you see synthetic data for adversarial examples, synthetic data for the image encoder training, where they use image captions, and then they augment as existing datasets with new instructions and responses. And what's really interesting here, the second last tweet, is that the human annotators were actually augmented with model in the loop, right?

And if you think about it, this slightly represents a shift in how some folks are thinking about it, right? I mean, a lot of people is like thinking human in the loop, but no, now it's model in the loop, whereby you use the model to create an initial generation that the human then can edit, and it just makes it so easy for the human, right?

And then the one big takeaway from all of this, and that's what I had from this paper, but the one big takeaway from all of this is that, can you imagine how much META had to educate and upskill their SDEs or their existing scientists to use this new technology to be trusting of it, and the annotators to trust the new technology and to just work based on that.

So that was quite eye-opening for me, and I think it sort of suggests the long-term, here's where the puck is heading. And that's all I had. Thank you. Awesome. Thank you, Vips. Eugene coming in clutch with, I threw him on the spot, he's commuting and already had slides and tweet thread.

But yeah, what other topics have we got? I've got the chat. Did they mention using chain of verification prompting, Eugene? Do you mean chain of thought prompting or chain of verification where they try to verify the chain? The latter. Okay, I don't think they actually did that, but they did mention they had stepwise reward models that actually checks every step in the chain of thought.

But I don't recall seeing chain of verification. Sorry. Okay, thank you. Welcome. Eugene, like early last year, there was tree of thought and some iterations with Monte Carlo search and this tree of thought stuff. But at that point, LLMs weren't good enough to verify or provide enough signal for this multi-step reasoning things to happen and things in the loop.

Do you know, do you have some idea how they solved it or why they were able to make all this progress? Basically use all the tricks we were reading about maybe half a year, a year ago, but it seems they actually got them to work. So yeah, I wonder what made it work.

Yeah, I don't know. I'm very interesting. I'm very curious about that as well. I wish there was more papers showing how to use Monte Carlo tree search and actually get it to work and share more details about it. I'm afraid I haven't seen too much of that in this current paper.

Is it mostly for coding that they employed this or for other tasks as well? Because for coding, you could signal back some reward, but for other things, like I don't know how you evaluate things and propagate information and validate the chains of that. Yeah, if I recall correctly, it was actually in the math and reasoning section.

So whereby they actually use stepwise reward models to evaluate each step, to score each step in the chain of thoughts, so that the final output gets better. Thank you. I'll look into it more. Yeah, it's the math. Lightman et al was the citation. And I guess I'll wrap up with one final thing.

I'm sorry, it's a bit noisy. I think Switzer's Latent Space podcast with Thomas, he really goes really deep and he has a strong opinion on synthetic data. I think listening to that podcast will give you a lot more insight into how meta is really embracing synthetic data. So I found that podcast quite helpful.

And this was the Karpathy tweet about synthetic data. Also, yeah, great podcast. I think that's the one. Exactly, wait a minute, I think that one thing in the sense that everything is a step for the next one. No, not this one. It was actually a tweet about smaller models, about how the competition for smaller models is going backwards, but buried in there.

If you scroll down a little bit more. Yeah, this one. Yeah, exactly. You can see the models have to first get larger before they get smaller, right? And the three-line paragraph. And then it's a staircase of improvement where one model helping to generate training data for the next. It's almost like he had read this paper up front and he was alluding to that.

I don't know. Yeah, it's pretty interesting to see. Also, this was a tweet that came out even before, but very much the small model distillation work, it's pretty huge. And that's, I think, the big part of the license play of this too, where they did actually finally change their license to allow people to generate synthetic data, train on outputs of the 405B.

I think the 405B is a little overhyped for just using it for inference. When it comes to inference generation and cost-effectiveness, make sure if experts are pretty efficient. They use less RAM at inference time and they're more cost-effective for just the best quality. But then this is really valuable for synthetic data gen, for filtering, stuff like that.

And that's what I see more of it. I know, Sean Swicks, you also had a pretty good write-up about this. Sorry about that. Or any other about how this is a synthetic data gen model or any other topics we want to dive into. Also open to everyone else that's in the call too.

If anyone had anything interesting that they want to dive into on the paper, pop in, share your thoughts. I think Sachin's hand has been raised up for quite some time. Yeah, Sachin, go ahead. Yeah, so this is for Eugene and Vibhu also. So we saw SONET actually take over some time back.

And there's still some, what do you call, gap to cover. So does SONET have some other tricks in their back which is getting them that higher up? I know someone's working on a write-up about this. No, no, no. I abandoned the idea. Yeah, so they never published anything about what tricks they use.

But the evidence strongly points to the fact that they use the steering vectors that they had from the scaling monosemanticity paper. The main evidence is that they happened to do this monosemanticity research on SONET. What are you guys doing, you little freaks? SJ is constantly screwing up his mic.

They did it on SONET and obviously they only shipped 3.5 SONET. That's like the smoking gun. If they actually had anything else, any other trick that caused 3.5 SONET to be so good, they probably would have deployed it on Haiku and Opus as well. The fact that they don't is proof positive that it's basically the monosemanticity stuff.

Does that answer your question? Do I need to explain what that is? I have a hard time. No, it does. I think it's not. I think it's not control vectors. Yeah, why? They could be. They did say it's a larger model. If you look at the training data and the training date for when Cloud 3 SONET came out to 3.5 SONET, it also has a year and a half of significant data updates.

I think that there was a lot of research that put out on high-quality synthetic data, the post-training mixture of it. SONET probably just had a decent bit of... There's a lot more that they could squeeze out of it. Also, they did say it's bigger. A lot more research and good quality synthetic data, the pre-training data mixture started to come out.

It's bigger. I think it was just a lot more post-training as well, because there was quite a bit of time... In post-training, there's a... Sorry, Vibhu. In post-training, in SONET, you can see there's this pause, thinking pause tokens, where sometimes it generates a token and it does internal thinking and then generates the answer.

It seems like they used some recent tricks where people say, "Hey, you need to think step-by-step, but maybe not materialize directly in the answer, the step-by-step thinking." Sometimes when you run SONET generations, you can see there's some... It stops in the middle and people saw that those are actually pause for thinking and side thoughts.

It seems that really helps with reasoning tasks quite a bit. That's one additional trick that they use. I agree that it's been one year of work, so they probably have lots of tricks in there, not just one, two, three. Much like in the Lama paper, you'll see that it's hundreds and hundreds of people, still less than a thousand, but yeah, it's like a Manhattan project to build one of these things.

I actually counted the number of people. Lama 3 had 200 core contributors, which is good. It's pretty small. It's less than Gemini, which had 950. Sebastian says, "What does thinking mean?" Okay, here's where we get philosophical. My quick take on that is this used to be a thing with the original ChatGPT web UI stuff of why is it pausing at stuff.

I think some of this is also just the way the API works. What's the inference it's running on? How's the streaming? What's the API service like? Sometimes when there's blocks, it's not that something else is going on. It's just that there's a delayed stream of your API response and sometimes people overanalyze that.

Is it that it's thinking? Is it what's going on? Maybe, maybe not. I think for the Sonnet's case that some users have already used, I guess, prompt injection to trick it into instead of doing the XML thinking block and to use it to output it in a different format, then you literally get to see the thinking process.

Yeah, so for what it's worth, I went to iClear and interviewed the pause token author. It's on the iClear episode if people want to check it out. I do not think that Cloud specifically implemented that version of thinking. I think it's much simpler. It is just chain of thought.

It is just prompted XML chain of thought that is then subsequently post-processed and removed inside of Cloud Artifacts. Yeah, but it's still a form of thinking. It's a form of chain of thought. It definitely improves the performance. Right. Sorry. Yeah, that's what I was aware of. I'm sorry. I missed what Eugene mentioned for an alternative to what you described as a chain of thought that is not presented to the user.

Eugene? Yeah, so instead of creating a custom token, which is the pause token concept, they literally just got the model through prompting, got the model to reply with a thinking XML block, which then you can trick it through prompt engineering to substitute the tokens respectively. Then suddenly this technique becomes very blatant when you get to see it respectively because it's no longer hidden from the UI.

Yeah, I think also on a separate line, because since a lot of people are looking to the eval and then they're like, "Hey, some of these evals are doing worse than, let's say, 4.0," or things like that, the fact that it's already closed itself means that if you're just a few weeks away until every single benchmark, you're going to see a point jump because someone fine-tuned a code-specific version of the L3 model or a medical reasoning-specific version of this model.

It's going to take slower than normal because I spoke to some people in the fine-tuning community. The biggest hurdle has been, "What do you mean you need at least three nodes of H100 to start the process?" Yeah, the amount of VRAM requirement is kind of huge. I suspect we are going to see more Loras first before we get full fine-tunes.

Also, the most random part, I know Meta did this for good reasons because they basically did a lot of no-keyword filtering from the sources, but a lot of people in the AI companion space, they were like, "No!" basically. Yeah, it makes sense. I'm going to look into that. Thanks.

Do we have more things on the paper? I mean, there's more to discuss. I feel like everyone's being too polite. There's a lot of new scaling laws they brought up. They had a whole recipe for post-training, how they did it, how they did their SFT, how they did... They also released both the base and the instruct models, how much of this was done by synthetic data, how they train their image video adapters, all that stuff for multilingual stuff.

They give out a whole recipe on how to do this. It's a long, long read, but for anyone that hasn't read a lot of papers, this is also probably a really good one that's very approachable, very readable, and not too crazy technical, one to at least understand what's going on.

They go into some of their evals on their multimodality, how their adapters work, how it performs, and they're like, "Yeah, it's pretty good." They added speech into speech understanding, how to train a speech encoder, how many hours of recording they used, how they filtered it. They go through literally all of this.

This is probably where you could have an hour on this paper. That's every step of it. But it's an interesting one where, yeah, they do go into all that data set, how they transcribed it, just little, little stuff too. In their speech understanding section, there's a section on, "Our ASR training data contains 230,000 hours of manually transcribed screech recording that spans 34 languages." Just a little one line of, "We casually manually transcribed 230,000 hours of 34 languages of speech, and we're just training a little adapter for this that we're not releasing." They put a lot of work into that.

Then it goes even deeper into, "How do you use this for pre-training? What about spoken dialogue? How do we fine-tune?" What's the recipe for a speech adapter in an LLM? Yeah, we did a lot of pre-processing to the base data set of manually transcribe a lot of speech recording, have multi-languages, train it out.

Here's the speech length segments that we want. Then we fine-tune this adapter for spoken dialect. How do we do that? Well, we synthetically generate responses for prompt. We ask for transcripts. We generate them. They generate 25,000 hours of speech synthesis through voice box, which is a whole other series that Meta has put out around everything, how they do voice generation.

They have a whole really good breakdown paper of that, how they use that to generate model, to fine-tune, and generate synthetic data for this. There's a lot in here, if anyone's interested in. A lot of that doesn't make it to Twitter, but dig in, present it. Architecture stayed the same.

I thought the interesting parts were also just like, they want to keep it very foundational and see what works and what they can just scale up. The second aspect of that is it'll be fun to see when they start cooking with actual MOEs, how do we go outside the box.

It's nice to have clarity on their scaling laws. I've definitely presented too many times that they just scaled up and prayed, and they took an AP to 15 trillion, and they were very inefficient. Other papers, like 5.3 took a big shot at this. 5.3's whole paper is about how we had Chinchilla scaling.

Then we have this inference optimal Lama scaling, and then here's how you could do what we think is right. Now, Lama puts out a paper and they're like, "No, this is actually all based. Here's new scaling that's not just next token prediction. It's grounded on reasoning. Here's how scaling laws work.

Here's how you can use it. Here's why we did it." It's going to make sense. The scaling part, I found it interesting and funny that they were using ARK as a measurement for the scaling training. One of the realistic that I had in my head, it was like, "So Facebook spent over $100 million at the ARK challenge to try to win the million-dollar prize." I think it's a different ARK, right?

If I'm not mistaken. It's not the same as the million-dollar. It's not the same data set. Yeah, but it's still in the line. Yeah, a lot of good stuff. In the five to seven minutes we have left, I wanted to give some time to, I guess, Hassan. Hassan's actually built an app that's kind of cool with Lama3U.1.

Maybe there's something to learn about prompting it, building with it, anything surprising. Yeah, thanks, Sean. Hey, everybody. I just want to talk about this app that I built real quick. Definitely a lot less technical than we're talking right now. This is dropping all the way down to the half layer.

I guess it is all about building, but I just built this little app. It uses a search API to put in whatever topic you want to learn about, like quantization, and it can explain it to you at kind of any level you want. So, let's learn about quantization at an elementary level.

So, it will basically use a search API, grab all these sources, and put it into context, because obviously Lama3.1, larger context, you can fit in a lot of stuff, which is great. And I've noticed that it's pretty good at kind of dumbing down concepts, but also responding to the prompts a little bit better.

And almost responding, like if I set a system prompt that kind of details how it should behave over the next few messages, I found that it responds a little bit better. Like, for example, for this system prompt, I had, like, make sure you make the overview really short, because when I was testing on just Lama3 and on other platforms, it was giving me a really, really long initial answer.

And so, I want it to give a really short overview, but at the same time, I want it to be detailed in the future. And I also want it to include quizzes at certain times. And so, I just found that it's a little bit -- it was a little bit smoother at kind of -- at responding.

So, yeah, here it's going to, you know, actually try to give me a quiz and try to just be interactive and kind of teach me a subject at any level that I want. So, it's fully open source. LamaTutor.com, it's definitely open source. I misspelled this. >> Did it give an example of quantization math?

>> I don't actually know over here. I think it's talking about -- >> I just got into a debate on a call where someone was making an obvious error in quantization math and how much the memory before versus after. I can send this to them. >> I'm not sure if it'll -- I know it does formulas as well, yeah, so it'll put some -- I should probably format these a little bit nicer.

>> That's awesome, though. >> Yeah, thank you. It has -- yeah, I got about 4,000 visitors, and if anybody's curious about cost as well, so about 6,400 requests. I had some errors because I was playing around with -- I was hitting the context limit and a bunch of other stuff.

And also, you know, using the together API. Very biased. I work over it together. But the cost for anybody curious, so we have about 12,000 average tokens per request. We have about 5,900 successful tokens. And so if you do the math, that's like 74 million tokens. And if I go over to our pricing, right now we're at 18 cents per million tokens for 8b 3.1.

So that comes out to about $12 from the 72 million tokens that I used in the last 24 hours from these 4,000 people and 6,000 requests. So that's all I have. >> And so, wait. Wait. Can you show -- did you put a link in the chat? Or can you type -- >> No, I'll do that.

Yes, I'll put in a link to the LamaTutor. >> Oh, LamaTutor, okay. >> HassanLamaTutor.com. >> Hassan, I have a question at a high level. Is this using some sort of, like, have you ever used the GPT's actions? Is it using a system similar to that? >> I actually haven't used actions.

Can you tell me about that? >> So actions is where you input an API spec and the model itself can make the calls -- the model itself can decide to make the calls to that API, provided that API spec. >> Oh, so like function calling, basically. >> Yes. >> No, I'm not using -- yes, I'm not using it on this app.

This was, like, the most simple example I could do, where I give it a system prompt, I do this API call for search, I parse all the sources, and I just drop everything. And I'm like, hey, build something interactive. So this is kind of step one. There's obviously so much more I want to do.

I want to try to do generative UI, maybe, with the Versalia ISDK, where I show -- for the quiz example, I show an actual quiz component that renders. I can, like, generate a little, like, report out of all the sources and have, like, read this, you know, like, you can read this to check it out, or flashcards, or, like, I feel like there's a lot of directions to go with it, but I kind of just wanted to build something really, really quickly.

I just built it over the weekend. We got early access to 3.1, because we were a launch partner with Meta. So kind of just playing around with it and try to build something really quick over a weekend. >> Oh. Okay. I missed that detail. That's really cool. >> All right.

Thanks. >> So you're doing the retrieval. >> I'm curious if you could share what does it take to serve a 405B model all together? >> What does it take to serve 405B? So we're serving -- what is it? FP8. It takes eight H100s per instance. Yeah. But we're looking into quantizing down to int4 and trying to serve on four H100s.

But in progress. >> The map is roughly 1 gig to 1 gig of VRAM, right? So at FP8, yeah, 1 to 1. And then you could scale that out. There's pretty good resources on this. I feel like we've shared them in Discord. And then someone asked about quantization stuff.

Together put out pretty good -- like you're now running quantized models in a good blog post about this. It's somewhere in the Zoom chat, and we'll probably share it in Discord, too. Good resource on all that. Also the paper has a section on inference, of course. So how do you run a 405B?

What does that look like? What's efficiency in that? They have a whole section on this. They have sections on previous meta work of how do we have more token efficiency? So multi-token prediction of the pre-training task. How do we have other parts of what you might see? Meta's done a lot of work on this.

But overall, that's kind of a high-level overview and our thoughts on the paper. If anyone else has questions, we can discuss them. Otherwise, next week -- we were supposed to do Wizard LM and Orca 3, heavy synthetic data gen stuff this week, today. We pushed that to next week.

Sorry, this was not super prompted. Basically, less than 24 hours ago, we decided let's switch this paper club. So we didn't have crazy slides. But next week, we'll have stuff for Orca 3 and Wizard LM. Same paper club, same time. And then at some point, I might do a deeper dive into this, a proper one-hour breakdown of all this.

If anyone's interested, I'll share somewhere. I'm curious about the quantization thing and whether -- if you have the choice of a larger -- sorry, more parameters, but worse quantization, how do you figure out the trade-off between that and running, say, the 70 billion model without quantization versus the 405 with quantization to make it down to the same size?

>> People generally do benchmarks. These are already happening right now. A lot of the communities are trying to figure this out. For anyone who wants to try 4-bit quantize, you can actually run it on 16 4090 GPUs if you're doing 4-bit quantize. I'm already seeing folks doing that. Whether that's a good idea, whether the model will get dumber or not is something we'll probably find out in the next few days.

>> At a high level, though, the larger ones see less of a hit when quantizing than the small ones. r/locallama has a lot of comparisons and benchmarks of different quantizations. Basically, a lot of the prosumer rig is a dual 4090 or dual 3090 system. You've got about 48 gigs of VRAM.

With 48 gigs of VRAM, you can run a 70B at 4-bit quant. A lot of people will spend $3,000 on a local rig. They can run a 4-bit 70B, or they'll look at how does that compare to a 34B at higher precision, or a single GPU where you're running an 8B.

There's solid breakdowns of benchmarks of these. This is where Reddit people are doing good work, as opposed to when you look at it from the infra provider side of what's the benefits of quantizing. There's also efficiency in QLORA fine-tuning and quantizing and doing it. Benchmarks-wise, the LDR is bigger models take less of a hit, smaller models take a bigger hit, and then speed inference.

Alex, do you have thoughts? >> I have a question, actually. I don't usually have questions, but this time I have a question about effects of quantization. In your guys' experiences, what are the most visible effects of quantization? What comes to mind when you see that the model is quantized, and how quickly do you realize, "Oh, this is the effects of quantization"?

>> It's just dumber, that's all. >> Dumber in any specific areas? Is it dumber knowledge-wise, is it dumber logic-wise, or any specific areas? Which is overall dumber? >> One thing that I've seen a lot of community members do is that when you over-quantize at longer context length, it starts going into repetition rapidly.

So that's like, "We have gone too far line." >> Yeah, in a lot of the one-bit quants, as people ineffectively quantize stuff, it starts to just go into pure chaos. So it becomes stochastic random next-token prediction. The little trade-offs that you see at what type of quantization you want to do, it starts to perform worse at chain of thought, at long context, at needle in the haystack.

Some of those things start to degrade earlier on in quantization, as opposed to pure performance. And then, yeah, sometimes it's just dumber. It's just worse on some of the trivia QA-type benchmarks, which is interesting because it's worse on reasoning benchmarks, but decent on trivia. So trivia is where it's under internal knowledge, what does it already know?

It doesn't lose facts, but it loses reasoning, it loses verboseness of responses and whatnot. So it degrades in that sense. But then you do have benefits of, yeah, it's significantly more efficient to run, right? Four-bit versus eight-bit, you can run in half the hardware, you get speed-ups, you can fine-tune more efficiently.

So there's trade-offs. Though the one interesting thing I'll note there is if anyone saw the Apple LLMs at their WWDC stuff, they put out an Apple research blog on how they have all their local LoRa swaps, and they dropped in a little paragraph where they're like, "We tested our models at a net of one or two-bit quantization, and we saw no performance drop compared to full precision." So they're basically like, "Yeah, we benchmarked it, we got lossless quantization, no details, no nothing." But they seem to have stated they figured it out, which, you know, skeptical, but they're running quantization for on-device.

Yeah, but on specific tasks for them. It's also going to be very... Sorry. It's also going to be task-specific. We have seen weird edge cases where when you quantize a model one step for certain tasks, it actually improves. We are degrading all the time. That's true, too. I'm writing an email now between cloud providers for 3.170B just to see if, like, you know, together's FB8 versus, I don't know, Fireworks or something like this has a difference, or versus Glock.

Have you seen an artificially analysis? Oh yeah, I think I saw something. These guys, they run pretty good. Yeah, but they don't do quality, they just do reported numbers, I think. But they do compare in front providers. Oh, they're not doing their own benchmarks? I thought they have... They're doing pricing and they're doing speed.

I don't think they're doing quality. Yeah, they're not doing benchmarks. Yeah. Glock is much more difficult to benchmark, to be honest. So, I'll add to Alex, like, what he was asking. I know credible reports of people where they're saying between Glock and some other providers the quality is different in the sense everything else remaining the same, the inference engines are giving wrong answers.

So, I'll just leave it at that. I mean, one thing I can tell you right now that I noticed that Glock temperatures zero returns different responses every request. So, I don't know if temperatures zero actually works there. Yeah, that has been... A lot of people have said that to the CEO also.

Wouldn't temperatures zero supposed to be the most deterministic instead of... Depends on the framework, to be honest. For opinion, I think temperature is zero. Well, theoretically, it's supposed to be the most deterministic. I would expect, based on what temperature is supposed to do, that it would be the most deterministic.

Yeah, I guess... Oh, go ahead, Vibhu. So, sometimes there's little reasons, like, when they're running inference, right? They might run it on different hardware. So, like, maybe there's a server in West Coast, East Coast that's different GPUs, different CUDA kernels, different drivers, and some of that affects how you do rounding, how you do next, like, what...

Yeah, basically, how is inference being done? So, you see slight variation. Then, across inference providers, this is their, like, secret sauce, right? Are they doing self-speculative decoding? Are they doing speculative decoding? How are they doing it? So, like, all these little things have differences, but there's also the reason of temperatures, like, yeah, different hardware, different drivers, different all that.

So, some of that answers it, but I don't know, Eugene, maybe you have more to add. I was just going to say that for GPUs, floating point just on what... I mean, for GPUs with floating points and you push it through so many calculations and so many matmuls, the floating points aren't just not going to be precise.

So, that's why even if temperature is zero, it's not going to be the same throughout for multiple requests. Yeah, even if it's, like, the same GPU on the same request, like, the order you do reduction on the matmuls, as Eugene mentioned, will affect the results because, like, floating point calculations is not deterministic.

Like, you can try this in Python and add 0.1 and plus 0.2, you might get 0.1999, not 0.2. So, like, the order you're doing the addition and multiplication can affect the results. And if you're doing, like, a thousand addition and multiplications, it's difficult to get the same order every time.

It's about that very small 0.0000001% noise that ends up cascading all the way. If you want true temperature equals to zero, the non-reliable way is to run it on CPU because the CPU will do have the floating point precision reliability. But if you're running on CPU, you're going to take forever.

And to add to Eugene also, there are actually, like, five or six different class of floating point, what do you call, classes that there are in the CUDA. And I believe, like, people have to hand code certain, have to account for these, what do you call, differences. And that's where they have to be actually shown the errors, saying that on some other platform we see this error and over here we see this answer.

And now someone has to painfully go back and figure out a bug in their code. So, most of this is basically bugs in the translation, not bugs in the hardware. That's what I was trying to say. I'm definitely going to try that CPU approach. So, I'm trying to get a project approved and not with the rationale being this type of inconsistency.

Also, like, depends on the engine. Like, XLAT12 is not deterministic at all compared to maybe, I think, transformers is more deterministic. You can set the torch seed and the inference seed. So, also the inference engine is very important. Thank you. Awesome. Well, thanks, everyone, for joining in on our last minute drop-in.

Can I share something really quick? Yeah, yeah. Go ahead. I've got to drop, but this room will still stay open. I'm going to make someone else host. Feel free to keep discussing. Next week we have another one at 12, but I'm going to make Eugene host. Oh, it's okay.

I can do it next time. No, no. Go ahead, go ahead, go ahead. No, don't make me host. I'm at a baseball game. No, the other Eugene. So, it's quick. So, basically, I found like a LAMA 3.1, like a 405B, actually might do better in some domains, specific question, than like a CharGBT 4.0.

Do you guys want to see a little bit? Just really quick. Sure. And it's no surprise, actually. I actually think that more and more people will find specific domains as time will happen. That's great. So, let me share really quickly. So, here's a little summary. So, basically, this is the specific question I was asking.

Here, you're an expert in mechanistic interval research. How do you use the rest of the stream? Yeah. And so, here's my summary. Overall, I think, you know, 405B did a better job to explain. And also, there are some important information like is missing in CharGBT 4.0 or not expressly explained.

So, let me show you the answer from LAMA 3.0, 405B first. I feel, I find it's really easy to follow for someone like me. Like, I'm not doing research in this field. And it also gives enough technical detail. So, if anyone wants to read a little bit. Yeah. So, and I can move to the chat.

What did the answer from CharGBT 4.0? Do you guys want to see it? Yeah, yeah. Yeah, I'm done. You're done? Okay, cool. So, this one is from CharGBT 4.0. I feel like CharGBT still kind of like answer things in a sort of generalized way. Let me know if you want me to move, scroll down.

Or if I'm scrolling down too fast. I think Eugene, the other Eugene would chime in that actually a lot of these things are very dependent on the prompt. So, it won't surprise me if you tweak the prompt right, the winner ends up flipping around. That's what I meant. Yes, just for the argument, I didn't tweak the prompt just for LAMA 405B.

So, I just said, hey, I have this one. I just throw it out. But definitely, you know, more detailed study needs to be done. And let me show you CharGBT 4.0. It's better than 4.0, but I don't think, I still think LAMA 405B did a better job. Just a second.

So, I'm going to do the same thing, scrolling down. Let me know if I'm moving too fast. So, basically, I think I kind of feel, of course, need to do more detailed study. I feel like LAMA 405B may have been doing a much better job of organizing this knowledge, you know, how the way to handle how to answer questions, you know, how to organize the knowledge together.

I think that's pretty important. So, I'll give my personal example that when I look at some healthcare related data, and if I'm looking at through basically CharGBT 4.0, and then even when I use the similar prompt that I give through perplexity, I get different answers. But since I know the domain, I know that sometimes like CharGBT is just bullshitting.

And perplexity actually gets references, and it gives you a slightly better or more answer, which I can take and go back to the references and dig deeper. So, you will definitely, once you know the domain better, you should not fall in for the English is my take. The English may look all correct, but it might be nonsense at the end of the day.

Oh, I looked into a little bit about the rest of the stream. I can have pretty good confidence. It's not the bullshitting from 405B. Yeah. So, the things that really amazed me is like how it explains. And it's me, like I said, hey, I'm just starting getting into like a lot of technical things I haven't done before.

It seems like it just really explained really well. I can just go down to, you know, do look into specific information and stuff like that. But definitely, I see your point, like it's just one shot. But still, it seems pretty amazing because I didn't try to twit the problem just for 405B.

Yeah. Thank you. Yeah. So, that's what I want to share. Yeah, definitely. We are going to see more and more of this. Like, I think what's interesting, especially when you're having different foundation models, you're going to have very different default outputs. Because, yeah, we can probably steer all models, especially the bigger ones, we prompt engineering to a certain desired effects or even like fine-tuning.

But the default out of the box would be what's interesting because that's what a lot of day-to-day users is what they will experience. So, like for me, for the longest period of time, I didn't care that GPT-4 had a better eval score across the board than the cloud model.

The cloud model just seems friendly and nicer to me and I like it better. And maybe that's what does matter sometime when they are all good enough to get the job done. Yes. So, I wish we have a better model where we don't need to twit the prompt, you know.

In a way, it can reflect how well the model is organized, its knowledge, you know. And so, if it's easy, just talk to it without twitting the prompt. Actually, it's, you know, in a much better advanced stage, I think. But I think in the long run, that will be hard to achieve because if you just view each model a bit like an individual, so you can call it Lama-kun, you can call it GPT-4-kun, etc.

It's just really a preference thing when it boils down to it. Because, as I said, well, I prefer Claude because of the way he did the artwork. I know someone who prefers the other way. And that's the challenge here, right? You can't possibly just train the model to satisfy everyone in the world.

There'll be that back and forth. And that's where all those, like, more view-shot prompting will come in or fine-tunes will come in. And I think that is probably the more exciting thing about Lama because we're going to see lots of fine-tunes. Sorry, go ahead. I think, because since we are well past the usual, and there's a lot of new faces here, I'm just going to, like, recap and point it out.

So, the Latentspace Discord, we host this weekly papers talk, and this is why this session happened. This week is just a very special one because of the whole Lama 3 model came out. Prior to that, we were planning to do other papers, and we'll probably go back to our usual schedule.

So, if any of you, especially the new folks that joined in because this got broadcast on a much larger audience, feel free to join us next week, and we'll be talking other papers. Great. Thank you. Yes. Thank you, everyone. Thanks a lot. All right. Have a great day, everyone.

Okay. I'll stay. It's got organized. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay.

Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. See you later, Eugene. Peace. You You You You You You You You You You You You You You You You You You You You

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Transcript