back to index

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models


Whisper Transcript | Transcript Only Page

00:00:00.000 | And I kicked off like a what do people want to work on section with I'm going to do a deep dive in the paper because I need to make slides on this and that very much overtook the hackathon.
00:00:10.000 | We had like a solid crew of like 20-30 people that were just discussing the paper with me and slides didn't really get made.
00:00:16.000 | But also I don't know it was weird hackathon project winner was just a deep dive into the paper, but we had an in-person paper club session that I led yesterday and a lot of people from there trying to join in so it should be vibes.
00:00:30.000 | I am liking in person hybrid format, I might start running those we'll see how they go but it was good. Everyone had good discussions.
00:00:40.000 | Amazing, amazing. Yeah, I would be happy to join that. Once I get back to SF this Friday.
00:00:47.000 | Oh, Friday. Exciting. Yeah. So, as you know, we also we interviewed Thomas who was one of the paper co authors, he did not give us the paper beforehand, which is annoying because after reading the paper I have so much better questions that we actually
00:01:02.000 | end up asking in a podcast but whatever.
00:01:06.000 | Yeah, yeah, I think that you know a bunch of us have read it.
00:01:09.000 | I feel like Vibhu you're probably best situated to, to take over the screen if you want, if you have, if you have stuff.
00:01:17.000 | I have very basic stuff but yeah, I'll share. We also got someone at the hackathon that worked on a hackathon project that's paper to video, so someone's cooking up a video explainer of this it's like literally doing inference right now, we'll share
00:01:31.000 | that once it's ready. Yeah. Yeah, this is, I mean this is, I'm excited about it but also I'm wondering how to do this justice. I feel like we can post questions in here, and then, you know, people can just kind of discuss in the chat.
00:01:50.000 | And yeah, I mean like we classically have a lot of side discussions in the zoom chat anyway so I'm not worried about that.
00:01:57.000 | I mean, yeah, well Vibhu, Vibhu go ahead you can get started. But like, what do what do people think what people want to talk about, you know, personally, I, I called this the synthetic data paper.
00:02:11.000 | So I have a lot of interesting insights or these questions about the synthetic data stuff, but we can talk about everything like there's just so much in here.
00:02:20.000 | The format that worked well yesterday was like, we're not getting through 100 pages, I'll give the high level we'll go through the tweet overviews. And then let's just dig into whatever anyone found interesting and had like, you know, something that someone dove into
00:02:34.000 | so like part of it was they had a bunch of scaling laws for pre training they had scaling laws for how they picked 405 B and 15 trillion tokens. So whatever someone chose to dive deep into is what like I was like okay, well we'll dig into that.
00:02:48.000 | Also, other part of this I'm probably going to give like a longer one hour breakdown of like I'll go through the whole paper, have like an hour talk at some point so I started slides, they're very not ready these are like 20 minutes of slides just became discussions.
00:03:01.000 | But basically, I'll spend like two minutes on overview everyone knows llama.
00:03:06.000 | Interesting stuff that so like they drop three to three 131 was a pretty big update the AP got a lot better than 70 be got a lot better. A lot of this is just for other talk, but, um, yeah, they drop some sizes the context with getting bigger.
00:03:21.000 | Their scaling laws were just overtrain and pray, but no, they're actually pretty grounded and real scaling laws.
00:03:30.000 | They're all dense models, after reading the paper their whole justification for this was like we want to see stuff that scales. It's the first actual research paper where they talk about everything hardware inference hardware failures what happened how they fixed it.
00:03:43.000 | So, real research paper, there's a lot on it it's like basically everything pre training post training, they're scaling laws how they ran experiments it's a great recipe on how to build it, that's what this talk would be later when I do it.
00:03:56.000 | It's really cooked models really good, it's solid open sources like GPT four level, they talk about how they bring up performance.
00:04:15.000 | Also for anyone that finds any time to cut us off cut us off it's all vibes.
00:04:20.000 | The jumps for the eight be basically from three to three one, it got better all around overview of the paper, there's a bunch of Twitter threads so instead of me making slides will go over like the main one shared a discord for everyone that hasn't seen also is my
00:04:37.000 | whole screen sharing. So is it just okay let me share my desktop. Okay, so for people that are new and not in discord.
00:04:45.000 | I do have a very active running llama three section, I'll share the paper. So, if we go to this little like you have to find it so you got to go through the news and find the llama there's like 60 links that we've been posting of everything popular on Twitter
00:05:02.000 | will go through these at some point but paper overview is basically like they have multiple phases of training. I'm very not ready I do have other notes on this that all share it there so I started screenshotting stuff.
00:05:18.000 | They have like three aspects to a good foundation model, basically data scale complexity. This is why they didn't go into an MOE they want training stability, basic two phase training there's like pre training post training.
00:05:32.000 | They do a lot of scaling well work so in their pre trained data set, how they, how do they determine what the pre training mixes in the post training in the pre training, they start doing most of the training at like low context, then they continue pre
00:05:46.000 | training at long contrast.
00:05:48.000 | A lot of what they said in their complexity section was like, we want to do basic stuff that will like scale up this is like the foundation for how we can like redefine scaling laws and train this stuff up so no crazy complex RL just, you know, SFT for chat
00:06:03.000 | tuning and then they do a lot of models for rejection sampling to do data set stuff, DPO.
00:06:22.000 | They added their safety bullshit at the end. Other interesting stuff that didn't make it to Twitter was like, they had multimodal experiments, they did a lot of experiments, they did a lot of tests, they did a lot of experiments, they did a lot
00:06:43.000 | don't do it. They added their safety bullshit at the end. Other interesting stuff that didn't make
00:06:48.760 | it to Twitter was like they had multimodal experiments. They trained in adapters. They
00:06:53.720 | have a vision adapter, audio adapter stuff. There was cool sections on their pre-training mix. So
00:07:00.600 | basically they used a lot of like traditional filtering techniques. So they have like Roberta
00:07:06.520 | based filters for high quality. They have this for their synthetic data distribution. Like how
00:07:11.800 | do we extract out high quality data? Then they have a lot of traditional NLP for like PII text
00:07:16.680 | extraction. They had a whole section on like how they scrape the web and how they train their own
00:07:21.480 | parsing HTML. They compared it to what's out there and their stuff's better. There's a lot in this
00:07:26.200 | paper. Datamix was a really interesting section as well. So they basically go into, here's basically
00:07:34.440 | what they did, deduplication, all this stuff that you would expect. Model-based filtering was pretty
00:07:39.880 | cool. They used a lot of like they trained classifiers on LAMA 2 outputs. On the synthetic
00:07:44.840 | data side, Eugene has a great tweet thread. We'll probably go through it at some point.
00:07:48.520 | This was an interesting section that we haven't seen before. So when you have like a base model
00:07:54.200 | that you're pre-training and you have like 15 trillion tokens, how do you determine what the
00:07:58.440 | right mix of that is? So their finding was like half the tokens are general knowledge, 25% map
00:08:04.440 | and reasoning, 17% code, all this stuff. But they're like, this is the first research paper
00:08:09.720 | that actually breaks this stuff down. They actually did like scaling law experiments.
00:08:13.640 | So they trained small models that were like a couple billion parameters. They started testing
00:08:19.480 | different datamixes and then they train a large model to see what actually works on what's the
00:08:23.720 | right datamix. And then they're like, here's the answer for this stuff. Model architecture was
00:08:29.640 | pretty similar. They like did a few little changes. They did some better attention masking,
00:08:34.600 | group query attention, here's architecture. All this stuff is like on Twitter, so not as interesting.
00:08:39.640 | From the podcast that Sean had, the vocab section is pretty interesting. They're like,
00:08:44.600 | instead of messing with tokenizers, changing vocab is pretty big for small models. Check out
00:08:49.960 | the podcast or if it comes up in discussion, we'll discuss it. Scaling laws was another interesting
00:08:55.160 | one for the paper itself. Basically traditional like chinchilla scaling laws used to have this
00:09:01.960 | whole like, they're predicting what's the optimal for your compute budget, like what's the optimal
00:09:07.160 | model parameters, all that stuff, how many tokens you train on. We thought that they were just
00:09:12.360 | scaling and preying and trading like, you know, fixed cost training run for cheaper inference.
00:09:17.880 | But this stuff is actually grounded. So they developed new scaling laws where TLDR of what
00:09:23.000 | they did is previously we used to do scaling laws where we're just predicting on next token
00:09:30.040 | prediction accuracy, right? So we're trying to predict on like perplexity and just how good is
00:09:35.480 | next token prediction. Instead they do all this fancy math and they change the training objective
00:09:41.720 | to be like more representative of a reasoning benchmark. They use the ARC challenge where
00:09:46.440 | basically they have a reasoning benchmark and now instead of doing scaling laws to predict next token
00:09:52.120 | prediction, they've changed it so that they're doing scaling laws to predict optimal model stuff
00:09:57.240 | based on actual reasoning. And that's where they come up with this like, their scaling laws show
00:10:02.680 | that for a 402b model, you want to train on 16 and a half trillion tokens. Based on that, they did a
00:10:08.520 | flagship 405b based on 15 trillion tokens. And then this is where they have their like infra
00:10:15.400 | optimal where they started to do the 8b, the 70b, they just reuse their 15 trillion tokens and just
00:10:22.360 | overtrained and that works. The other really cool section, the sections that didn't make it on
00:10:27.720 | Twitter were like their training infrastructure. So they give out everything, right? They give out
00:10:33.160 | like the full pre-training stack of like, they have a section in here on how they do their pre-training.
00:10:38.920 | So like, one is like the whole hardware configuration. So 16,000 H100 hours, what
00:10:46.040 | failures they hit, why they went for simplicity. This was a pretty interesting section. Like over
00:10:51.800 | their 54 day training, they had like 400 job interruptions, 419 unexpected interruptions,
00:10:59.560 | and like 78% of these were like GPU hardware issues. And then they have a section on like,
00:11:05.240 | if they did MOE, all this stuff compounds. So we just wanted something like simple,
00:11:09.480 | scalable that we could deal with well. And like, this is stuff that you don't really see in
00:11:14.280 | papers anymore, right? It goes further with like, what is the pre-training set? So like,
00:11:19.560 | these formulas, we don't really see anymore, right? So it's like when they pre-trained it,
00:11:24.920 | here's their like peak learning rate, here's their warmup, here's their decay, here's how
00:11:28.600 | many training steps, here's the bat size, little nuggets like this haven't really like come up on
00:11:33.560 | Twitter yet. But like, you know, at first, they have a bat size of 4 million tokens with a small
00:11:38.280 | sequence length. So like, the first bit of training is a sequence length of 4,000. Then
00:11:43.080 | they double it to like 8 million sequences at 8,000 for the next 252 million tokens. After
00:11:49.960 | they've trained on 200 million tokens, they double it again to like larger bat size for the next 3
00:11:54.920 | trillion tokens. And then they do most of the training at 8,000 token sequence length. So like,
00:12:00.520 | little stuff like this, I feel like we still need to digest. There's reasons for why they did this,
00:12:05.800 | but basically TL;DR, no other open source paper has like a formula like this. And then that's kind
00:12:11.880 | of what the next like 100 pages is. I feel like at that point, instead of finding what I found
00:12:17.720 | interesting, like, I found all this stuff really interesting. They talked about the batching,
00:12:22.760 | GPU utilization, memory, like utilization, all that stuff. Like CUDA optimizations, their whole
00:12:30.600 | training recipe, what they released performance stuff. Instead, I feel like that's enough of a
00:12:36.280 | high level overview of the paper. The more fun stuff is like, yeah, so how does it perform?
00:12:41.800 | They're all better. Infra companies are pretty cheap. And this is also where like everyone else
00:12:46.440 | can hop into discussion. Eugene, Sean, other Eugene, hop in now. You know, Fireworks is somehow
00:12:53.480 | really undercutting inference price. The scale leaderboard is a held out leaderboard. It does
00:12:58.200 | pretty good here. What else? Grok has it. So some insider info for all the Infra companies,
00:13:05.640 | they gave access to randomized weights that were the same size about six days before launch.
00:13:12.040 | So six days ago, Infra companies started playing around with it. They started working out how
00:13:16.440 | they're going to do inference, what type of decoding they need. But they didn't have the
00:13:19.720 | paper. They didn't have the actual weights. And then day of, they released weights. But like,
00:13:23.960 | yeah, stuff like Grok is serving a thousand tokens per second. What other discussions that
00:13:29.400 | we have here? Kyle did pretty good evals on performance. He started doing it on his own
00:13:37.000 | fine tuning stack. So he started fine tuning it, compared it to 4.0 mini. OpenAI within hours
00:13:43.880 | responded with like 4.0 mini fine tuning. But fine tuning the Lama 3.18b is kind of on par with
00:13:52.360 | 4.0 mini. 4.0 mini fine tuning is kind of broken and free, but it gets worse.
00:13:56.520 | What other fun stuff? There's a comparison of model pricing here that's being updated live.
00:14:03.640 | Other tweets, George Hotz, Karpathy tweeted. VLLM supports it. Other more independent benchmarks
00:14:12.040 | coming in. Basically, it's good. The other interesting part was the licensing. So they
00:14:17.720 | changed up their Lama license to proper full open source everything. We have more infra providers,
00:14:24.760 | NVIDIA stuff. But yeah, that's kind of where I feel like we should open it up. That's the quick
00:14:31.000 | 15, 10 to 15 minute overview. Whatever people found interesting, like I know there was a lot
00:14:35.960 | of discussion about synthetic data gen. Sean and Eugene, you had good tweets about this. So I think
00:14:43.000 | this is where we open it up to whatever people found interesting. And then we dig into those
00:14:48.040 | topics because we're not getting through the rest of it. I'm going to open up chat and see what
00:14:53.400 | people are up to. But yeah, thoughts everyone. Hop in. Yeah, I wanted to jump in by the way.
00:15:01.560 | One thing to warn about pricing is that you're going to see a lot of providers
00:15:10.120 | jumping in and everyone's just trying to get the piece of the pie. So like with some of the
00:15:16.440 | previous model launches, you see people coming in at lower and lower price and then they'll
00:15:20.040 | increase later. But I wanted to jump in on the training side because I'm quite sure Mibu,
00:15:25.800 | Eugene, and Ed will have lots to say on the data. So I think I'll start with that. I can't share
00:15:32.200 | the screen by the way. Do you want to take over or do you want me to scroll? I want to take over
00:15:38.120 | slightly because I want to jump through a few things there. So let me share my screen.
00:15:45.960 | All right. So I didn't see too much talk about on this but for me, one of the big ones is actually
00:15:59.880 | pipeline parallelism. Not sure how many people... Can you see my screen? Yes. Yeah. So if you're
00:16:08.760 | looking at this and like what is this crazy freaking schedule that they are doing here.
00:16:13.560 | But TLDR, pipeline parallelism is generally the idea of scaling up your training across multiple
00:16:21.080 | GPUs and to build around optimizing that. That has its own benefits. It has also its own downsides.
00:16:30.200 | And the major downside, the reason why people try to avoid pipeline parallelism at all costs
00:16:36.360 | and they use like DeepSpeed 3, for example, where the weights are sharded around all the other GPUs
00:16:42.200 | is that if you look at pipeline parallelism or model parallelism, there's this problem called
00:16:49.080 | the bubble. The bubble is basically as your data set goes through the different devices. So the
00:16:56.280 | forward pass and then the backwards pass, you have all this GPU time here where some of the GPUs are
00:17:02.280 | waiting for other GPUs and are doing nothing and basically you're wasting compute. And because
00:17:07.960 | everyone wanted to avoid wasting compute, it went on to a search of the algorithm to figure out how
00:17:16.440 | to do pipeline parallelism. And one major one is actually CLSG, coincidentally Singapore,
00:17:23.880 | where they created this crazy-ass algorithm to basically train without any wasted time. So you
00:17:30.760 | see the gray spots are with the wasted time, respectively. And Facebook is now embarking on
00:17:36.440 | their own journey on this. And the reason why this is exciting even for smaller models is that
00:17:43.240 | this kind of algorithmic changes on the training is what's going to allow you to train bigger models
00:17:50.520 | easier on lower-end GPUs. So this concept could apply to, let's say, training a 7TB model on 24GB
00:17:58.680 | GPUs and things like that. And the reason why they probably need it for the 80GB is because
00:18:04.040 | they're training 405B. And yeah, and a lot of people thought, like academia thought that this
00:18:11.240 | treated it as a dead end because of the bubble problem. And then Facebook was like, "You know
00:18:17.720 | what? We are going to do that." And that, to me, is one of the more exciting things.
00:18:22.200 | The other one that I saw some people tweet out is about batch sizing being smaller,
00:18:27.800 | constraints on the batch size. I thought Google has pipeline parallelism in their
00:18:32.440 | JAX, the distributed training repositories. They don't? Yeah, they do. They do. But the thing is,
00:18:39.000 | no offense to Google, no one really took, everyone just interpreted it as TPU has 2L VRAM.
00:18:45.480 | Kind of, kind of, kind of thing. And they had the basic pipeline parallel, but
00:18:51.320 | we still suffered from the bubble problem. This weird scheduling, which I'm quite sure
00:18:56.120 | people are going to start replicating it, is to reduce the bubble, the wastage.
00:19:00.760 | So I also saw lots of papers on this from maybe NVIDIA and Matej Zaharia from Berkeley or Stanford.
00:19:08.440 | Like they had lots of interleaved pipeline parallelism updates.
00:19:11.960 | Correct.
00:19:13.560 | So you're saying no one is using it? Just Facebook has used it more recently? I find that pretty...
00:19:19.400 | Or at least no one published it within their training processes. Because this is the first
00:19:25.320 | major model of this CL class size, right, that's saying, "Hey, we are doing pipeline parallelism."
00:19:31.400 | Google models, so they have some, these pathways, distributed training architecture systems,
00:19:40.760 | and they publish in maybe OSDI, which is kind of the biggest distributed systems conference.
00:19:46.680 | So they publish these trainings and they can do all sorts of parallelism within their systems,
00:19:52.040 | and even a mixture of experts parallelism and stuff like that. So they do quite,
00:19:57.160 | quite heavy stuff. I'll look it up and post some papers if I find them in the messages. But yeah,
00:20:04.600 | my mental model was that people are actually doing this at scale.
00:20:10.760 | Thanks.
00:20:11.240 | Yeah. So I'll draw the distinction between pipeline parallelism and techniques like
00:20:16.120 | DeepSpeedTree, which is essentially where the GPU has MVLink connectivity to other GPUs to
00:20:25.000 | actually read the model weights. Pipeline parallelism is really more of like, instead of
00:20:29.640 | going cross GPU to read the weights of the other models, or the other half of the model, you
00:20:36.040 | actually just focus on the half of the model that you're working on. And this has the trade-off,
00:20:44.200 | respectively, of saving VRAM and allowing you training larger model and larger batch size,
00:20:49.240 | but it means you have the bubble problem. And I think the focus is really more about the bubble
00:20:54.680 | problem here, rather than anything else. And yeah, like I said, I do expect more people to replicate
00:21:01.160 | this part. Yeah. So that's the part that I wanted to jump in on. The other major one I wanted to
00:21:07.400 | jump in on is just multilingual. I'm so happy that I've seen this. We try to avoid using machine
00:21:14.760 | translated data to fine-tune the model. This is something that I think multiple people know that
00:21:20.760 | I've been shouting on the roof about saying, "Hey, can we stop using machine translated data for
00:21:25.160 | other languages?" And then assuming that's great, because when you speak to the other language,
00:21:30.600 | native speakers, they've been saying, "That sucks." And finally, someone is also, at least
00:21:35.400 | on the bigger model side, is doing that as well. So particularly excited about that part. But yeah,
00:21:41.400 | I think I'll hand off to the whole data stream. The interesting little section there of translated
00:21:46.680 | data is I've still seen it used where they have a Lama3 filter that extracts out what's the highest
00:21:52.200 | quality data, what's the highest quality reasoning code data and whatnot. And in other work, they'll
00:21:57.160 | still do... This is very traditional pre-training data set stuff where you need more data augmentation
00:22:02.760 | to get more high-quality data and translation. So one thing is you can train on multiple rounds of
00:22:08.760 | that. It's like more epochs on high-quality data. So you can just resample it. But then there was a
00:22:14.040 | paper that I'm forgetting that tested this. Do they want to only use a little bit? Do they want
00:22:18.280 | to train on multiple rounds of pass-throughs of the same high-quality data? Or do they want to do
00:22:23.080 | basic augmentation, like translate and translate back? And somehow translation to other languages
00:22:28.520 | work better. That was the best option. Translating it to high-quality in another language as opposed
00:22:34.120 | to translate and translate it back. So there's still some value, but interesting little piece.
00:22:39.560 | Yeah. So I think I want to hand off to the people who are going to tear all the data
00:22:45.640 | parts into bits. Because I just wanted to jump in on trains. Because that's what I can uniquely offer.
00:22:50.040 | >> Awesome. Appreciate that. I think Cameron has his hand up.
00:22:58.040 | >> Hey, did they make any claims around it being good for code generation?
00:23:05.240 | I'm interested in whether yes versus cloud. >> Yeah. This is a big contrast to Llama 2,
00:23:15.240 | where they were intentionally not training for code, and then they put out code Llama separately.
00:23:19.480 | Now they explicitly outline code as a separate modality, like separate from text.
00:23:25.000 | Vibhu, I don't know if you have a slide on this stuff. And then they also did synthetic
00:23:30.120 | data for code as well. Yeah. They just -- they spent a lot more time on code this time around.
00:23:36.920 | >> Has anyone looked at it versus Cloud 3.5 Sonnet yet?
00:23:41.800 | >> We vibe checked it. We vibe checked it. >> They did what? Checked vibe?
00:23:48.760 | >> Yeah. So like, you know, it's not rigorous evals, but like we vibe checked it. And like,
00:23:55.160 | it does pretty good. So in the paper, they did explicitly mention as well, like, yeah,
00:24:00.040 | they used to have previous -- they used to have previous code Llama models, right? And part of
00:24:06.600 | their, like, second step of post-training was to add in this section on code. But they explicitly
00:24:12.360 | no longer need to do that. And I'll pull up the section of the paper, basically. But they
00:24:16.920 | mentioned that this is, like, natively trained in, in pre-training as well. But it's a good
00:24:22.920 | code model. They also have a -- the -- >> There's a scale AI benchmark.
00:24:28.840 | >> Yeah. >> Jeremy, can you repeat?
00:24:32.040 | We can't hear you very well. >> Okay. Yeah. There's a scale AI
00:24:35.720 | benchmark where Sonnet and 4.0 were compared against the new 4.0.5b model. And 4.5b was found
00:24:45.080 | to be basically on par with GPT 4.0, which is worse than both Sonnet and GPT 4.0 turbo preview.
00:24:51.560 | There's a tweet thread and a comment that I'll just drop. But it outperforms Jem and I 1.5.
00:25:01.320 | The thing I like about the scale benchmarks is that they are pulled out. That is, like,
00:25:06.200 | none of the companies have access to them. And they're private. So, there's probably more
00:25:11.560 | durability to the benchmarks. And they don't have as much of a conflict of interest. They
00:25:15.960 | did co-watch with Lama. So, yeah. There may be a little bit of conflict of interest.
00:25:20.040 | >> Thank you. Thank you, Jeremy. >> So, overview, the scale --
00:25:27.320 | >> Go ahead. >> Scale leaderboards aren't just coding. So,
00:25:32.360 | for people that don't know, it started out with the GSM 8K, where they tried to recreate it. And
00:25:38.200 | they made a GSM 1K, which is meant to match the actual benchmark and just be a held out that
00:25:43.960 | they'll run models, they'll evaluate them. And then that turned into now they have held out
00:25:48.520 | benchmarks that no one can see what the actual examples are of coding, instruction following,
00:25:53.720 | math, Spanish. There's a bunch of these. And, yeah, they're kind of, like, pretty good in the
00:25:58.360 | sense of, like, no one can directly train on them. There was a piece that said, like, when they put
00:26:03.400 | out their first one, what's the delta between companies, like, models that do really well
00:26:09.240 | on traditional, like, GSM 8K, but don't do well on 1K, where it's like they haven't seen it before.
00:26:15.000 | So, they basically tried to test who overfit to the benchmarks. And this is trying to solve that.
00:26:20.840 | So, if we go through real quick, this is kind of where the 405B sits in coding. It's, like,
00:26:25.960 | a step right below QPT 4s and Sonnet. Sonnet's still slightly better. And then we can kind of go
00:26:34.120 | through it. I think they're still testing the 405B, because I'm not seeing it through the rest
00:26:39.160 | of them. But they're being updated in tweet threads and whatnot. And then Jeremy shared a link
00:26:46.120 | to the Reddit that talks about this, where they're basically going through them. And then there's
00:26:51.560 | discussion here, if anyone's interested. But yeah. Someone was also talking in.
00:26:56.200 | >> Thanks very much. >> I have something to share. The
00:27:02.680 | coding evaluation, it seems like they so there's I can't share my screen. But basically,
00:27:15.160 | human eval is kind of one second. Let me try and share it. Can you see? Yeah. So, human eval
00:27:25.960 | is one of the benchmark data sets that people use to benchmark coding. And it's very simple.
00:27:32.680 | Like, they have 150 questions. And it's almost, like, autocomplete. Like, solve this simple
00:27:37.480 | puzzle in Python or things like that. It's very, like, one, two lines. And you can see that
00:27:44.840 | let's see. So, the LLAMA405B is not state of the art. So, Cloud Sonnet beats it by a few
00:27:52.920 | percentage points. It's close to the GPT and Sonnet models, but slightly worse. And I think
00:28:00.680 | this kind of is similar to the vibe checks. My understanding was on the initial LLAMA,
00:28:06.600 | stuff that Meta didn't focus that much on reasoning or on code, because they're a
00:28:12.200 | social company. So, maybe reasoning is not as super important for them. But then they hacked
00:28:17.960 | focused coding data collection session and shared a big code model. Which kind of wasn't that great.
00:28:27.640 | Maybe if you don't put the data in from the beginning, just trying to fine tune on code
00:28:33.320 | by itself doesn't work that well. The other thing I wanted to share, can you see this other page?
00:28:42.200 | Now? Basically, it seems they spend quite a bit to make their coding much better in LLAMA3.
00:28:51.080 | And they actually train the code experts and then try to use that code experts to maybe,
00:29:00.280 | I guess, collect high quality human annotations and do some more post-training. And then they
00:29:09.880 | also did some synthetic data generation to improve coding. So, I think they spent quite a bit to work
00:29:18.280 | on reasoning and coding. I didn't read this section carefully, but yeah, they have a full
00:29:21.800 | section on trying to get better code data to generate, to incorporate feedback and do analysis.
00:29:30.200 | They did quite a bit on coding. Yeah. >> Yeah. There's two sections there.
00:29:39.240 | One is the synthetic data gen with coding, and the other is the pre-trained mix of their code
00:29:46.120 | and reasoning sample where they trained a second classifier. So, one of the takeaways there was
00:29:53.160 | when you're doing pre-processing of 15 trillion tokens, you actually can't just run inference.
00:29:58.280 | Even Meta with all the GPUs they have, they couldn't afford to just throw LLAMA3 inference as
00:30:05.320 | this whole 15 trillion token set. So, they trained a code and reasoning classifier on Distal-Roberta,
00:30:11.560 | which is a small original encoder-decoder transformer to try to annotate out their
00:30:17.400 | web scrape data for quality and whatnot. So, they have it both there in the pre-training set and in
00:30:23.640 | the synthetic data gen. There's a really good quote tweet that went on about all this code gen.
00:30:30.440 | It's by Eugene. I will share screen and throw him on the stage if he wants to talk about it.
00:30:37.400 | >> Yeah. Thank you. I'm currently commuting, but I'm finding a quiet space right now.
00:30:46.920 | All right. Great. Thank you, Vibhu. Yeah, I can talk to you, I think.
00:30:49.880 | Can you hear me fine? >> We can come to it in a few minutes.
00:30:52.920 | If you're commuting, we can come to it in a bit. >> No, I'll be commuting for a while. I'm walking
00:30:58.040 | to the stadium right now, team event, but I'm finding a good space to sit. Okay. So, I think
00:31:03.320 | what really stood out for me in this paper was that how much automation and augmentation was there,
00:31:09.240 | right? In the first one, you can see they actually use Lama2 to filter out bad data, right? And this
00:31:17.560 | is in the pre-training step. So, essentially, what they're saying is that, "Hey, we trust Lama2's
00:31:21.880 | judgment well enough to be able to do that." And if you scroll down, next slide.
00:31:26.840 | And then over here, you can see that they actually trust Lama3 to do tag intention. They actually
00:31:35.560 | tag the generated data or the responses based on intention, and they also classify things based on
00:31:43.720 | difficulty, right? And the thing is, they actually adopt some kind of curriculum learning where they
00:31:50.360 | start with a single shot prompt or rather a single turn prompt and response, and then after that,
00:31:56.520 | they move on to multi-turn. Next slide. And then after that, and this is a code expert that
00:32:05.640 | everyone's been talking about, right? So, what it means is that in order to get Lama good at code,
00:32:11.080 | as an intermediate step, they had to train a code model. And that sounds quite crazy, right? I mean,
00:32:19.320 | for me, I mean, sometimes training such large models just seems to take so much effort to
00:32:25.560 | curate the data, to set up the info and everything, but it seems completely essential in this case.
00:32:33.080 | They could not have done it without that. And Andrej Karpaty had a great tweet about this,
00:32:36.840 | whereby every model distillation and synthetic data generation is really now a stepping stone
00:32:45.000 | for the next better model. Next, please. And then the same thing here is, okay, and of course,
00:32:52.760 | here, this is just an example of how much we trust the synthetic data, right? The model was prompted
00:32:59.400 | to generate problems, then solve each problem, so I'm focusing only on the green highlights here,
00:33:04.440 | solve each problems, and then they give the model the errors, and then they ask the model
00:33:09.000 | to solve the errors, and then the model also generates the unit tests, which they then use
00:33:13.240 | to evaluate the generations on the unit test itself. It's like, you see that the human is
00:33:19.160 | very minimally in the loop. And then if we move on, and you see this pattern everywhere, like
00:33:26.280 | multilingual, you didn't hear me talk about it. One thing that's interesting here is that they
00:33:31.000 | generate, they use Lama to generate data for target capabilities, and then they back translate it into
00:33:37.000 | doc strings and comments. So that's how they can teach the model to explain code. And then they
00:33:43.000 | use those tweets and comments, those doc strings and comments to actually create code again. And
00:33:47.800 | then we're going to go through the rest really quickly. It's like multilingual, the same pattern
00:33:52.360 | here, math and reasoning, the next one, you see it's the same pattern, whereby the model actually
00:33:57.400 | augments the training data with the step-by-step. So one thing that's really interesting here,
00:34:02.040 | in the sense that they actually went the extra step, no pun intended, to actually train step-wise
00:34:08.360 | reward models. That's kind of crazy, no? I mean, they wanted each step in the chain of thought to
00:34:14.440 | be so good that they actually took the extra effort to train step-wise reward models, which
00:34:19.240 | they then combined with Monte Carlo Tree Search to improve the reasoning traces. And then you see
00:34:25.880 | synthetic data for long context, it's the same pattern, Q&A. And then as you scroll down,
00:34:33.400 | you see synthetic data for image captioning and synthetic data for factuality. Like factuality,
00:34:39.800 | essentially all of it is just synthetic data, if you look at this. I think time will tell whether
00:34:44.840 | this really works out well or not. I think we're still too early on the evals. And then you see
00:34:49.160 | synthetic data for adversarial examples, synthetic data for the image encoder training, where they
00:34:54.520 | use image captions, and then they augment as existing datasets with new instructions and
00:34:59.160 | responses. And what's really interesting here, the second last tweet, is that the human annotators
00:35:05.960 | were actually augmented with model in the loop, right? And if you think about it, this slightly
00:35:15.640 | represents a shift in how some folks are thinking about it, right? I mean, a lot of people is like
00:35:19.640 | thinking human in the loop, but no, now it's model in the loop, whereby you use the model
00:35:27.720 | to create an initial generation that the human then can edit, and it just makes it so easy for
00:35:32.760 | the human, right? And then the one big takeaway from all of this, and that's what I had from this
00:35:39.160 | paper, but the one big takeaway from all of this is that, can you imagine how much META had to
00:35:47.240 | educate and upskill their SDEs or their existing scientists to use this new technology to be
00:35:53.480 | trusting of it, and the annotators to trust the new technology and to just work based on that.
00:35:59.080 | So that was quite eye-opening for me, and I think it sort of suggests the
00:36:02.840 | long-term, here's where the puck is heading. And that's all I had. Thank you.
00:36:07.880 | Awesome. Thank you, Vips.
00:36:11.400 | Eugene coming in clutch with, I threw him on the spot, he's commuting and already had
00:36:15.800 | slides and tweet thread. But yeah, what other topics have we got? I've got the chat.
00:36:22.760 | Did they mention using chain of verification prompting, Eugene?
00:36:29.800 | Do you mean chain of thought prompting or chain of verification where they try to verify the chain?
00:36:39.320 | The latter.
00:36:41.400 | Okay, I don't think they actually did that, but they did mention they had stepwise reward models
00:36:48.040 | that actually checks every step in the chain of thought. But I don't recall
00:36:53.720 | seeing chain of verification. Sorry.
00:36:55.800 | Okay, thank you.
00:36:57.320 | Welcome.
00:36:58.860 | Eugene, like early last year, there was tree of thought and some iterations with Monte Carlo
00:37:07.240 | search and this tree of thought stuff. But at that point, LLMs weren't good enough to verify
00:37:15.160 | or provide enough signal for this multi-step reasoning things to happen and things in the
00:37:20.040 | loop. Do you know, do you have some idea how they solved it or why they were able to make all this
00:37:25.400 | progress? Basically use all the tricks we were reading about maybe half a year, a year ago,
00:37:30.840 | but it seems they actually got them to work. So yeah, I wonder what made it work.
00:37:36.600 | Yeah, I don't know. I'm very interesting. I'm very curious about that as well. I wish there
00:37:41.480 | was more papers showing how to use Monte Carlo tree search and actually get it to work and share
00:37:47.880 | more details about it. I'm afraid I haven't seen too much of that in this current paper.
00:37:51.720 | Is it mostly for coding that they employed this or for other tasks as well? Because for coding,
00:37:58.440 | you could signal back some reward, but for other things, like I don't know how you evaluate things
00:38:03.720 | and propagate information and validate the chains of that. Yeah, if I recall correctly,
00:38:10.360 | it was actually in the math and reasoning section. So whereby they actually use stepwise
00:38:15.640 | reward models to evaluate each step, to score each step in the chain of thoughts,
00:38:19.480 | so that the final output gets better.
00:38:23.640 | Thank you. I'll look into it more. Yeah, it's the math. Lightman et al was the citation.
00:38:37.400 | And I guess I'll wrap up with one final thing. I'm sorry, it's a bit noisy. I think
00:38:42.200 | Switzer's Latent Space podcast with Thomas, he really goes really deep and he has a strong
00:38:50.920 | opinion on synthetic data. I think listening to that podcast will give you a lot more insight
00:38:56.120 | into how meta is really embracing synthetic data. So I found that podcast quite helpful.
00:39:04.680 | And this was the Karpathy tweet about synthetic data.
00:39:09.960 | Also, yeah, great podcast. I think that's the one. Exactly, wait a minute,
00:39:20.680 | I think that one thing in the sense that everything is a step for the next one.
00:39:26.120 | No, not this one. It was actually a tweet about smaller models,
00:39:33.080 | about how the competition for smaller models is going backwards, but buried in there.
00:39:37.240 | If you scroll down a little bit more. Yeah, this one. Yeah, exactly. You can see the models have
00:39:45.320 | to first get larger before they get smaller, right? And the three-line paragraph. And then
00:39:51.480 | it's a staircase of improvement where one model helping to generate training data for the next.
00:39:55.640 | It's almost like he had read this paper up front and he was alluding to that. I don't know.
00:40:01.320 | Yeah, it's pretty interesting to see. Also, this was a tweet that came out even before,
00:40:06.200 | but very much the small model distillation work, it's pretty huge. And that's, I think,
00:40:11.640 | the big part of the license play of this too, where they did actually finally change their
00:40:17.000 | license to allow people to generate synthetic data, train on outputs of the 405B. I think the
00:40:23.480 | 405B is a little overhyped for just using it for inference. When it comes to inference
00:40:28.520 | generation and cost-effectiveness, make sure if experts are pretty efficient. They use less RAM
00:40:36.520 | at inference time and they're more cost-effective for just the best quality. But then this is really
00:40:43.400 | valuable for synthetic data gen, for filtering, stuff like that. And that's what I see more of it.
00:40:49.080 | I know, Sean Swicks, you also had a pretty good write-up about this.
00:40:56.920 | Sorry about that.
00:40:58.920 | Or any other about how this is a synthetic data gen model or any other topics we want to dive
00:41:05.480 | into. Also open to everyone else that's in the call too. If anyone had anything interesting that
00:41:10.600 | they want to dive into on the paper, pop in, share your thoughts.
00:41:13.480 | I think Sachin's hand has been raised up for quite some time.
00:41:17.480 | Yeah, Sachin, go ahead.
00:41:18.360 | Yeah, so this is for Eugene and Vibhu also. So we saw SONET actually take over some time back.
00:41:27.240 | And there's still some, what do you call, gap to cover. So does SONET have
00:41:31.240 | some other tricks in their back which is getting them that higher up?
00:41:35.240 | I know someone's working on a write-up about this.
00:41:42.920 | No, no, no. I abandoned the idea. Yeah, so they never published anything about
00:41:52.840 | what tricks they use. But the evidence strongly points to the fact that they use the steering
00:41:57.240 | vectors that they had from the scaling monosemanticity paper. The main evidence is
00:42:04.760 | that they happened to do this monosemanticity research on SONET.
00:42:13.080 | What are you guys doing, you little freaks?
00:42:13.880 | SJ is constantly screwing up his mic. They did it on SONET and obviously they only shipped
00:42:21.640 | 3.5 SONET. That's like the smoking gun. If they actually had anything else, any other trick
00:42:28.440 | that caused 3.5 SONET to be so good, they probably would have deployed it on Haiku and Opus as well.
00:42:35.560 | The fact that they don't is proof positive that it's basically the monosemanticity stuff.
00:42:40.520 | Does that answer your question? Do I need to explain what that is?
00:42:43.240 | I have a hard time.
00:42:45.240 | No, it does.
00:42:45.740 | I think it's not. I think it's not control vectors.
00:42:50.920 | Yeah, why?
00:42:53.000 | They could be. They did say it's a larger model. If you look at the training data and the training
00:42:58.520 | date for when Cloud 3 SONET came out to 3.5 SONET, it also has a year and a half of significant data
00:43:05.320 | updates. I think that there was a lot of research that put out on high-quality synthetic data,
00:43:10.360 | the post-training mixture of it. SONET probably just had a decent bit of... There's a lot more
00:43:17.320 | that they could squeeze out of it. Also, they did say it's bigger. A lot more research and good
00:43:24.280 | quality synthetic data, the pre-training data mixture started to come out. It's bigger. I think
00:43:30.920 | it was just a lot more post-training as well, because there was quite a bit of time...
00:43:35.800 | In post-training, there's a... Sorry, Vibhu. In post-training, in SONET, you can see there's
00:43:41.080 | this pause, thinking pause tokens, where sometimes it generates a token and it does internal thinking
00:43:47.240 | and then generates the answer. It seems like they used some recent tricks where people say, "Hey,
00:43:53.640 | you need to think step-by-step, but maybe not materialize directly in the answer, the
00:43:58.520 | step-by-step thinking." Sometimes when you run SONET generations, you can see there's some...
00:44:05.160 | It stops in the middle and people saw that those are actually pause for thinking and
00:44:11.080 | side thoughts. It seems that really helps with reasoning tasks quite a bit. That's one additional
00:44:18.840 | trick that they use. I agree that it's been one year of work, so they probably have lots of tricks
00:44:26.200 | in there, not just one, two, three. Much like in the Lama paper, you'll see that it's hundreds and
00:44:32.120 | hundreds of people, still less than a thousand, but yeah, it's like a Manhattan project to build
00:44:37.880 | one of these things. I actually counted the number of people. Lama 3 had 200 core contributors,
00:44:43.960 | which is good. It's pretty small. It's less than Gemini, which had 950.
00:44:49.080 | Sebastian says, "What does thinking mean?" Okay, here's where we get philosophical.
00:45:01.480 | My quick take on that is this used to be a thing with the original ChatGPT web UI stuff of why is
00:45:07.880 | it pausing at stuff. I think some of this is also just the way the API works. What's the inference
00:45:14.120 | it's running on? How's the streaming? What's the API service like? Sometimes when there's blocks,
00:45:19.240 | it's not that something else is going on. It's just that there's a delayed stream of your API
00:45:23.640 | response and sometimes people overanalyze that. Is it that it's thinking? Is it what's going on?
00:45:29.560 | Maybe, maybe not. I think for the Sonnet's case that some users have already used, I guess,
00:45:38.440 | prompt injection to trick it into instead of doing the XML thinking block and to use it to
00:45:43.640 | output it in a different format, then you literally get to see the thinking process.
00:45:48.120 | Yeah, so for what it's worth, I went to iClear and interviewed the pause token author. It's on
00:45:58.040 | the iClear episode if people want to check it out. I do not think that Cloud specifically
00:46:03.640 | implemented that version of thinking. I think it's much simpler. It is just chain of thought. It is
00:46:08.680 | just prompted XML chain of thought that is then subsequently post-processed and removed inside
00:46:14.920 | of Cloud Artifacts. Yeah, but it's still a form of thinking. It's a form of chain of thought.
00:46:20.520 | It definitely improves the performance. Right. Sorry. Yeah, that's what I was aware of.
00:46:27.000 | I'm sorry. I missed what Eugene mentioned for an alternative to
00:46:30.680 | what you described as a chain of thought that is not presented to the user.
00:46:34.520 | Eugene? Yeah, so instead of creating a custom token, which is the pause token concept,
00:46:45.320 | they literally just got the model through prompting, got the model to reply with a thinking
00:46:50.920 | XML block, which then you can trick it through prompt engineering to substitute the tokens
00:46:55.400 | respectively. Then suddenly this technique becomes very blatant when you get to see it
00:47:00.840 | respectively because it's no longer hidden from the UI. Yeah, I think also on a separate line,
00:47:07.880 | because since a lot of people are looking to the eval and then they're like, "Hey, some of these
00:47:11.800 | evals are doing worse than, let's say, 4.0," or things like that, the fact that it's already
00:47:17.640 | closed itself means that if you're just a few weeks away until every single benchmark, you're
00:47:24.520 | going to see a point jump because someone fine-tuned a code-specific version of the L3 model
00:47:31.480 | or a medical reasoning-specific version of this model. It's going to take slower than normal
00:47:39.400 | because I spoke to some people in the fine-tuning community. The biggest hurdle has been, "What do
00:47:45.080 | you mean you need at least three nodes of H100 to start the process?" Yeah, the amount of VRAM
00:47:51.960 | requirement is kind of huge. I suspect we are going to see more Loras first before we get
00:47:56.680 | full fine-tunes. Also, the most random part, I know Meta did this for good reasons because
00:48:04.280 | they basically did a lot of no-keyword filtering from the sources, but a lot of people in the AI
00:48:10.760 | companion space, they were like, "No!" basically. Yeah, it makes sense. I'm going to look into that.
00:48:20.120 | Thanks. Do we have more things on the paper? I mean, there's more to discuss. I feel like
00:48:31.560 | everyone's being too polite. There's a lot of new scaling laws they brought up. They had a whole
00:48:38.280 | recipe for post-training, how they did it, how they did their SFT, how they did... They also
00:48:44.040 | released both the base and the instruct models, how much of this was done by synthetic data, how
00:48:51.000 | they train their image video adapters, all that stuff for multilingual stuff. They give out a
00:48:56.520 | whole recipe on how to do this. It's a long, long read, but for anyone that hasn't read a lot of
00:49:02.760 | papers, this is also probably a really good one that's very approachable, very readable,
00:49:07.640 | and not too crazy technical, one to at least understand what's going on. They go into some
00:49:13.320 | of their evals on their multimodality, how their adapters work, how it performs, and they're like,
00:49:18.680 | "Yeah, it's pretty good." They added speech into speech understanding, how to train a speech
00:49:24.760 | encoder, how many hours of recording they used, how they filtered it. They go through literally
00:49:31.160 | all of this. This is probably where you could have an hour on this paper. That's every step
00:49:36.920 | of it. But it's an interesting one where, yeah, they do go into all that data set,
00:49:41.640 | how they transcribed it, just little, little stuff too. In their speech understanding section,
00:49:47.240 | there's a section on, "Our ASR training data contains 230,000 hours of manually transcribed
00:49:55.080 | screech recording that spans 34 languages." Just a little one line of, "We casually manually
00:50:03.000 | transcribed 230,000 hours of 34 languages of speech, and we're just training a little adapter
00:50:09.800 | for this that we're not releasing." They put a lot of work into that. Then it goes even deeper
00:50:15.240 | into, "How do you use this for pre-training? What about spoken dialogue? How do we fine-tune?"
00:50:21.400 | What's the recipe for a speech adapter in an LLM? Yeah, we did a lot of pre-processing
00:50:27.160 | to the base data set of manually transcribe a lot of speech recording, have multi-languages,
00:50:32.520 | train it out. Here's the speech length segments that we want. Then we fine-tune this adapter for
00:50:39.400 | spoken dialect. How do we do that? Well, we synthetically generate responses for prompt.
00:50:44.760 | We ask for transcripts. We generate them. They generate 25,000 hours of speech synthesis through
00:50:52.360 | voice box, which is a whole other series that Meta has put out around everything,
00:50:57.560 | how they do voice generation. They have a whole really good breakdown paper of that, how they use
00:51:07.800 | that to generate model, to fine-tune, and generate synthetic data for this. There's a lot in here,
00:51:13.160 | if anyone's interested in. A lot of that doesn't make it to Twitter, but dig in, present it.
00:51:19.640 | Architecture stayed the same. I thought the interesting parts were also just like,
00:51:30.840 | they want to keep it very foundational and see what works and what they can just scale up.
00:51:37.400 | The second aspect of that is it'll be fun to see when they start cooking with actual MOEs,
00:51:43.880 | how do we go outside the box. It's nice to have clarity on their scaling laws. I've definitely
00:51:53.480 | presented too many times that they just scaled up and prayed, and they took an AP to 15 trillion,
00:51:58.120 | and they were very inefficient. Other papers, like 5.3 took a big shot at this. 5.3's whole
00:52:04.680 | paper is about how we had Chinchilla scaling. Then we have this inference optimal Lama scaling,
00:52:11.640 | and then here's how you could do what we think is right. Now, Lama puts out a paper and they're
00:52:17.240 | like, "No, this is actually all based. Here's new scaling that's not just next token prediction.
00:52:21.960 | It's grounded on reasoning. Here's how scaling laws work. Here's how you can use it. Here's
00:52:26.680 | why we did it." It's going to make sense. The scaling part, I found it interesting and funny
00:52:34.200 | that they were using ARK as a measurement for the scaling training. One of the realistic that I had
00:52:41.480 | in my head, it was like, "So Facebook spent over $100 million at the ARK challenge to try to win
00:52:46.760 | the million-dollar prize." I think it's a different ARK, right? If I'm not mistaken. It's not the
00:52:52.840 | same as the million-dollar. It's not the same data set. Yeah, but it's still in the line.
00:53:03.240 | Yeah, a lot of good stuff. In the five to seven minutes we have left, I wanted to give some time
00:53:09.560 | to, I guess, Hassan. Hassan's actually built an app that's kind of cool with Lama3U.1. Maybe
00:53:17.720 | there's something to learn about prompting it, building with it, anything surprising.
00:53:22.680 | Yeah, thanks, Sean. Hey, everybody. I just want to talk about this app that I built real quick.
00:53:30.760 | Definitely a lot less technical than we're talking right now. This is dropping all the way down to
00:53:35.960 | the half layer. I guess it is all about building, but I just built this little app. It uses a search
00:53:44.600 | API to put in whatever topic you want to learn about, like quantization, and it can explain it
00:53:50.120 | to you at kind of any level you want. So, let's learn about quantization at an elementary level.
00:53:55.080 | So, it will basically use a search API, grab all these sources, and put it into context,
00:53:59.240 | because obviously Lama3.1, larger context, you can fit in a lot of stuff, which is great.
00:54:03.480 | And I've noticed that it's pretty good at kind of dumbing down concepts, but also
00:54:08.600 | responding to the prompts a little bit better. And almost responding, like if I set a system
00:54:15.400 | prompt that kind of details how it should behave over the next few messages, I found that it
00:54:21.080 | responds a little bit better. Like, for example, for this system prompt, I had, like, make sure
00:54:26.600 | you make the overview really short, because when I was testing on just Lama3 and on other platforms,
00:54:33.160 | it was giving me a really, really long initial answer. And so, I want it to give a really short
00:54:38.920 | overview, but at the same time, I want it to be detailed in the future. And I also want it to
00:54:41.960 | include quizzes at certain times. And so, I just found that it's a little bit -- it was a little
00:54:46.440 | bit smoother at kind of -- at responding. So, yeah, here it's going to, you know, actually try
00:54:52.440 | to give me a quiz and try to just be interactive and kind of teach me a subject at any level that
00:54:58.680 | I want. So, it's fully open source. LamaTutor.com, it's definitely open source. I misspelled this.
00:55:09.400 | >> Did it give an example of quantization math?
00:55:12.600 | >> I don't actually know over here. I think it's talking about --
00:55:18.760 | >> I just got into a debate on a call where someone was making an obvious error in
00:55:23.720 | quantization math and how much the memory before versus after. I can send this to them.
00:55:31.480 | >> I'm not sure if it'll -- I know it does formulas as well, yeah, so it'll put some -- I
00:55:35.160 | should probably format these a little bit nicer. >> That's awesome, though.
00:55:38.760 | >> Yeah, thank you. It has -- yeah, I got about 4,000 visitors, and if anybody's curious about
00:55:45.880 | cost as well, so about 6,400 requests. I had some errors because I was playing around with -- I was
00:55:52.280 | hitting the context limit and a bunch of other stuff. And also, you know, using the together API.
00:55:57.160 | Very biased. I work over it together. But the cost for anybody curious, so we have about 12,000
00:56:04.920 | average tokens per request. We have about 5,900 successful tokens. And so if you do the math,
00:56:10.200 | that's like 74 million tokens. And if I go over to our pricing, right now we're at 18 cents
00:56:17.240 | per million tokens for 8b 3.1. So that comes out to about $12 from the 72 million tokens that I
00:56:25.400 | used in the last 24 hours from these 4,000 people and 6,000 requests. So that's all I have.
00:56:32.680 | >> And so, wait. Wait. Can you show -- did you put a link in the chat? Or can you type --
00:56:41.200 | >> No, I'll do that. Yes, I'll put in a link to the LamaTutor.
00:56:47.000 | >> Oh, LamaTutor, okay. >> HassanLamaTutor.com.
00:56:50.280 | >> Hassan, I have a question at a high level. Is this using some sort of, like,
00:57:02.440 | have you ever used the GPT's actions? Is it using a system similar to that?
00:57:05.960 | >> I actually haven't used actions. Can you tell me about that?
00:57:09.800 | >> So actions is where you input an API spec and the model itself can make the calls -- the model
00:57:17.640 | itself can decide to make the calls to that API, provided that API spec.
00:57:21.640 | >> Oh, so like function calling, basically. >> Yes.
00:57:25.960 | >> No, I'm not using -- yes, I'm not using it on this app. This was, like,
00:57:31.640 | the most simple example I could do, where I give it a system prompt, I do this API call for search,
00:57:36.920 | I parse all the sources, and I just drop everything. And I'm like, hey, build something
00:57:41.000 | interactive. So this is kind of step one. There's obviously so much more I want to do. I want to try
00:57:45.480 | to do generative UI, maybe, with the Versalia ISDK, where I show -- for the quiz example,
00:57:52.200 | I show an actual quiz component that renders. I can, like, generate a little, like, report out
00:57:58.840 | of all the sources and have, like, read this, you know, like, you can read this to check it out, or
00:58:04.360 | flashcards, or, like, I feel like there's a lot of directions to go with it, but I kind of just
00:58:08.040 | wanted to build something really, really quickly. I just built it over the weekend. We got early
00:58:12.760 | access to 3.1, because we were a launch partner with Meta. So kind of just playing around with it
00:58:19.160 | and try to build something really quick over a weekend. >> Oh. Okay. I missed that detail. That's
00:58:23.880 | really cool. >> All right. Thanks. >> So you're doing the retrieval. >> I'm curious if you could
00:58:29.320 | share what does it take to serve a 405B model all together? >> What does it take to serve 405B? So
00:58:40.040 | we're serving -- what is it? FP8. It takes eight H100s per instance.
00:58:45.800 | Yeah. But we're looking into quantizing down to int4 and trying to serve on four H100s. But
00:58:54.440 | in progress. >> The map is roughly 1 gig to 1 gig of VRAM, right? So at FP8, yeah, 1 to 1. And then
00:59:05.720 | you could scale that out. There's pretty good resources on this. I feel like we've shared them
00:59:08.760 | in Discord. And then someone asked about quantization stuff. Together put out pretty
00:59:13.880 | good -- like you're now running quantized models in a good blog post about this. It's somewhere
00:59:18.280 | in the Zoom chat, and we'll probably share it in Discord, too. Good resource on all that.
00:59:22.600 | Also the paper has a section on inference, of course. So how do you run a 405B? What does
00:59:30.120 | that look like? What's efficiency in that? They have a whole section on this. They have sections
00:59:35.960 | on previous meta work of how do we have more token efficiency? So multi-token prediction of
00:59:41.480 | the pre-training task. How do we have other parts of what you might see? Meta's done a lot of work
00:59:48.200 | on this. But overall, that's kind of a high-level overview and our thoughts on the paper. If anyone
00:59:54.040 | else has questions, we can discuss them. Otherwise, next week -- we were supposed to do
01:00:00.600 | Wizard LM and Orca 3, heavy synthetic data gen stuff this week, today. We pushed that to next
01:00:07.400 | week. Sorry, this was not super prompted. Basically, less than 24 hours ago, we decided
01:00:15.560 | let's switch this paper club. So we didn't have crazy slides. But next week, we'll have stuff for
01:00:20.040 | Orca 3 and Wizard LM. Same paper club, same time. And then at some point, I might do a deeper dive
01:00:28.760 | into this, a proper one-hour breakdown of all this. If anyone's interested, I'll share somewhere.
01:00:33.960 | I'm curious about the quantization thing and whether -- if you have the choice of a larger --
01:00:42.040 | sorry, more parameters, but worse quantization, how do you figure out the trade-off between that
01:00:48.200 | and running, say, the 70 billion model without quantization versus the 405 with quantization to
01:00:56.200 | make it down to the same size? >> People generally do benchmarks.
01:01:00.520 | These are already happening right now. A lot of the communities are trying to figure this out.
01:01:08.920 | For anyone who wants to try 4-bit quantize, you can actually run it on 16
01:01:17.240 | 4090 GPUs if you're doing 4-bit quantize. I'm already seeing folks doing that. Whether that's
01:01:27.560 | a good idea, whether the model will get dumber or not is something we'll probably find out in
01:01:31.800 | the next few days. >> At a high level, though,
01:01:35.320 | the larger ones see less of a hit when quantizing than the small ones. r/locallama has a lot of
01:01:41.880 | comparisons and benchmarks of different quantizations. Basically, a lot of the prosumer
01:01:47.800 | rig is a dual 4090 or dual 3090 system. You've got about 48 gigs of VRAM. With 48 gigs of VRAM,
01:01:55.400 | you can run a 70B at 4-bit quant. A lot of people will spend $3,000 on a local rig. They can run
01:02:05.080 | a 4-bit 70B, or they'll look at how does that compare to a 34B at higher precision, or a single
01:02:11.560 | GPU where you're running an 8B. There's solid breakdowns of benchmarks of these. This is where
01:02:18.280 | Reddit people are doing good work, as opposed to when you look at it from the infra provider side
01:02:25.800 | of what's the benefits of quantizing. There's also efficiency in QLORA fine-tuning and quantizing and
01:02:32.520 | doing it. Benchmarks-wise, the LDR is bigger models take less of a hit, smaller models take
01:02:38.760 | a bigger hit, and then speed inference. Alex, do you have thoughts? >> I have a question, actually.
01:02:46.360 | I don't usually have questions, but this time I have a question about effects of quantization.
01:02:50.520 | In your guys' experiences, what are the most visible effects of quantization? What comes
01:03:00.040 | to mind when you see that the model is quantized, and how quickly do you realize, "Oh, this is the
01:03:06.760 | effects of quantization"? >> It's just dumber, that's all. >> Dumber in any specific areas? Is it
01:03:12.920 | dumber knowledge-wise, is it dumber logic-wise, or any specific areas? Which is overall dumber?
01:03:18.440 | >> One thing that I've seen a lot of community members do is that when you over-quantize
01:03:25.800 | at longer context length, it starts going into repetition
01:03:28.920 | rapidly. So that's like, "We have gone too far line." >> Yeah, in a lot of the one-bit
01:03:40.120 | quants, as people ineffectively quantize stuff, it starts to just go into pure chaos. So it
01:03:46.280 | becomes stochastic random next-token prediction. The little trade-offs that you see at what type
01:03:52.280 | of quantization you want to do, it starts to perform worse at chain of thought, at long context,
01:03:57.640 | at needle in the haystack. Some of those things start to degrade earlier on in quantization,
01:04:03.320 | as opposed to pure performance. And then, yeah, sometimes it's just dumber. It's just worse on
01:04:08.200 | some of the trivia QA-type benchmarks, which is interesting because it's worse on reasoning
01:04:15.560 | benchmarks, but decent on trivia. So trivia is where it's under internal knowledge, what does
01:04:20.360 | it already know? It doesn't lose facts, but it loses reasoning, it loses verboseness of responses
01:04:26.760 | and whatnot. So it degrades in that sense. But then you do have benefits of, yeah, it's
01:04:33.400 | significantly more efficient to run, right? Four-bit versus eight-bit, you can run in half
01:04:37.640 | the hardware, you get speed-ups, you can fine-tune more efficiently. So there's trade-offs. Though
01:04:43.080 | the one interesting thing I'll note there is if anyone saw the Apple LLMs at their WWDC stuff,
01:04:49.240 | they put out an Apple research blog on how they have all their local LoRa swaps, and they dropped
01:04:55.800 | in a little paragraph where they're like, "We tested our models at a net of one or two-bit
01:05:02.520 | quantization, and we saw no performance drop compared to full precision." So they're basically
01:05:07.720 | like, "Yeah, we benchmarked it, we got lossless quantization, no details, no nothing." But
01:05:14.520 | they seem to have stated they figured it out, which, you know, skeptical, but
01:05:18.840 | they're running quantization for on-device. Yeah, but on specific tasks for them.
01:05:23.640 | It's also going to be very... Sorry. It's also going to be task-specific. We have seen
01:05:30.440 | weird edge cases where when you quantize a model one step for certain tasks, it actually improves.
01:05:37.240 | We are degrading all the time. That's true, too.
01:05:42.760 | I'm writing an email now between cloud providers for 3.170B just to see if, like, you know,
01:05:49.320 | together's FB8 versus, I don't know, Fireworks or something like this has a difference, or versus
01:05:56.760 | Glock. Have you seen an artificially analysis? Oh yeah, I think I saw something.
01:06:06.360 | These guys, they run pretty good. Yeah, but they don't do quality,
01:06:10.520 | they just do reported numbers, I think. But they do compare in front providers.
01:06:17.160 | Oh, they're not doing their own benchmarks? I thought they have...
01:06:19.800 | They're doing pricing and they're doing speed. I don't think they're doing quality.
01:06:23.560 | Yeah, they're not doing benchmarks. Yeah.
01:06:25.880 | Glock is much more difficult to benchmark, to be honest.
01:06:29.320 | So, I'll add to Alex, like, what he was asking. I know credible reports of people where they're
01:06:41.080 | saying between Glock and some other providers the quality is different in the sense everything else
01:06:48.280 | remaining the same, the inference engines are giving wrong answers. So, I'll just leave it at
01:06:52.920 | that. I mean, one thing I can tell you right now that I noticed that Glock temperatures zero
01:06:59.560 | returns different responses every request. So, I don't know if temperatures zero actually works
01:07:03.720 | there. Yeah, that has been... A lot of people have said that to the CEO also.
01:07:16.200 | Wouldn't temperatures zero supposed to be the most deterministic instead of...
01:07:21.800 | Depends on the framework, to be honest. For opinion, I think temperature is zero.
01:07:29.400 | Well, theoretically, it's supposed to be the most deterministic. I would expect, based on
01:07:35.240 | what temperature is supposed to do, that it would be the most deterministic.
01:07:42.200 | Yeah, I guess... Oh, go ahead, Vibhu.
01:07:46.200 | So, sometimes there's little reasons, like, when they're running inference, right? They might run
01:07:51.160 | it on different hardware. So, like, maybe there's a server in West Coast, East Coast that's different
01:07:57.160 | GPUs, different CUDA kernels, different drivers, and some of that affects how you do rounding,
01:08:01.720 | how you do next, like, what... Yeah, basically, how is inference being done? So, you see slight
01:08:06.280 | variation. Then, across inference providers, this is their, like, secret sauce, right? Are they doing
01:08:11.240 | self-speculative decoding? Are they doing speculative decoding? How are they doing it? So,
01:08:15.240 | like, all these little things have differences, but there's also the reason of temperatures,
01:08:20.360 | like, yeah, different hardware, different drivers, different all that. So, some of that answers it,
01:08:25.240 | but I don't know, Eugene, maybe you have more to add. I was just going to say that for GPUs,
01:08:31.080 | floating point just on what... I mean, for GPUs with floating points and you push it through so
01:08:35.720 | many calculations and so many matmuls, the floating points aren't just not going to be precise. So,
01:08:41.880 | that's why even if temperature is zero, it's not going to be the same throughout for multiple
01:08:46.760 | requests. Yeah, even if it's, like, the same GPU on the same request, like, the order you do
01:08:54.120 | reduction on the matmuls, as Eugene mentioned, will affect the results because, like, floating
01:08:59.400 | point calculations is not deterministic. Like, you can try this in Python and add 0.1 and plus
01:09:05.720 | 0.2, you might get 0.1999, not 0.2. So, like, the order you're doing the addition and multiplication
01:09:13.640 | can affect the results. And if you're doing, like, a thousand addition and multiplications,
01:09:17.720 | it's difficult to get the same order every time. It's about that very small 0.0000001% noise
01:09:25.320 | that ends up cascading all the way. If you want true temperature equals to zero, the
01:09:31.560 | non-reliable way is to run it on CPU because the CPU will do have the floating point precision
01:09:37.080 | reliability. But if you're running on CPU, you're going to take forever.
01:09:41.880 | And to add to Eugene also, there are actually, like, five or six different class of floating
01:09:52.840 | point, what do you call, classes that there are in the CUDA. And I believe, like, people
01:09:57.560 | have to hand code certain, have to account for these, what do you call, differences.
01:10:04.680 | And that's where they have to be actually shown the errors, saying that on some other
01:10:09.800 | platform we see this error and over here we see this answer. And now someone has to painfully
01:10:15.000 | go back and figure out a bug in their code. So, most of this is basically bugs in the
01:10:20.120 | translation, not bugs in the hardware. That's what I was trying to say.
01:10:23.400 | I'm definitely going to try that CPU approach. So, I'm trying to get a project approved and
01:10:30.600 | not with the rationale being this type of inconsistency.
01:10:45.240 | Also, like, depends on the engine. Like, XLAT12 is not deterministic at all compared to maybe,
01:10:51.320 | I think, transformers is more deterministic. You can set the torch seed and the
01:10:56.840 | inference seed. So, also the inference engine is very important.
01:11:11.000 | Thank you.
01:11:11.400 | Awesome. Well, thanks, everyone, for joining in on our last minute drop-in.
01:11:18.760 | Can I share something really quick?
01:11:20.280 | Yeah, yeah. Go ahead. I've got to drop, but this room will still stay open. I'm going to make
01:11:26.680 | someone else host. Feel free to keep discussing. Next week we have another one at 12, but I'm
01:11:32.680 | going to make Eugene host.
01:11:34.600 | Oh, it's okay. I can do it next time.
01:11:36.760 | No, no. Go ahead, go ahead, go ahead.
01:11:38.920 | No, don't make me host. I'm at a baseball game.
01:11:41.480 | No, the other Eugene.
01:11:43.160 | So, it's quick. So, basically, I found like a LAMA 3.1, like a 405B, actually might do better
01:11:52.920 | in some domains, specific question, than like a CharGBT 4.0. Do you guys want to see a little bit?
01:11:59.400 | Just really quick.
01:12:00.200 | Sure. And it's no surprise, actually. I actually think that more and more people will find
01:12:07.560 | specific domains as time will happen.
01:12:09.960 | That's great. So, let me share really quickly.
01:12:13.720 | So, here's a little summary. So, basically, this is the specific question I was asking.
01:12:20.280 | Here, you're an expert in mechanistic interval research. How do you use the rest of the stream?
01:12:26.040 | Yeah. And so, here's my summary. Overall, I think, you know, 405B did a better job to explain.
01:12:34.680 | And also, there are some important information like is missing in CharGBT 4.0 or not expressly
01:12:41.080 | explained. So, let me show you the answer from LAMA 3.0, 405B first.
01:12:48.360 | I feel, I find it's really easy to follow for someone like me. Like, I'm not doing research
01:12:52.840 | in this field. And it also gives enough technical detail. So, if anyone wants to read a little bit.
01:13:01.480 | Yeah. So, and I can move to the chat. What did the answer from CharGBT 4.0? Do you guys want to
01:13:11.320 | see it? Yeah, yeah. Yeah, I'm done. You're done? Okay, cool. So, this one is from CharGBT 4.0.
01:13:21.240 | I feel like CharGBT still kind of like answer things in a sort of generalized way.
01:13:28.440 | Let me know if you want me to move, scroll down. Or if I'm scrolling down too fast.
01:13:41.880 | I think Eugene, the other Eugene would chime in that actually a lot of these things
01:13:55.720 | are very dependent on the prompt. So, it won't surprise me if you tweak the prompt right,
01:14:11.560 | the winner ends up flipping around. That's what I meant. Yes, just for the argument,
01:14:17.480 | I didn't tweak the prompt just for LAMA 405B. So, I just said, hey, I have this one. I just
01:14:23.560 | throw it out. But definitely, you know, more detailed study needs to be done. And let me show
01:14:28.680 | you CharGBT 4.0. It's better than 4.0, but I don't think, I still think LAMA 405B did a better job.
01:14:39.640 | Just a second. So, I'm going to do the same thing, scrolling down. Let me know if I'm moving too
01:14:53.880 | fast. So, basically, I think I kind of feel, of course, need to do more detailed study. I feel
01:15:13.640 | like LAMA 405B may have been doing a much better job of organizing this knowledge, you know, how
01:15:21.960 | the way to handle how to answer questions, you know, how to organize the knowledge together.
01:15:26.440 | I think that's pretty important.
01:15:29.560 | So, I'll give my personal example that when I look at some healthcare related data,
01:15:37.160 | and if I'm looking at through basically CharGBT 4.0, and then even when I use the similar prompt
01:15:47.160 | that I give through perplexity, I get different answers. But since I know the domain, I know that
01:15:53.000 | sometimes like CharGBT is just bullshitting. And perplexity actually gets references, and it gives
01:15:58.840 | you a slightly better or more answer, which I can take and go back to the references and dig deeper.
01:16:05.240 | So, you will definitely, once you know the domain better, you should not fall in for
01:16:10.200 | the English is my take. The English may look all correct, but it might be nonsense at the end of
01:16:14.440 | the day. Oh, I looked into a little bit about the rest of the stream. I can have pretty good
01:16:21.400 | confidence. It's not the bullshitting from 405B. Yeah. So, the things that really amazed me is like
01:16:28.440 | how it explains. And it's me, like I said, hey, I'm just starting getting into like a lot of
01:16:34.200 | technical things I haven't done before. It seems like it just really explained really well. I can
01:16:39.640 | just go down to, you know, do look into specific information and stuff like that. But definitely,
01:16:45.400 | I see your point, like it's just one shot. But still, it seems pretty amazing because I didn't
01:16:50.680 | try to twit the problem just for 405B. Yeah. Thank you. Yeah. So, that's what I want to share.
01:16:56.200 | Yeah, definitely. We are going to see more and more of this. Like, I think what's interesting,
01:17:02.680 | especially when you're having different foundation models, you're going to have very different
01:17:07.800 | default outputs. Because, yeah, we can probably steer all models, especially the bigger ones,
01:17:13.080 | we prompt engineering to a certain desired effects or even like fine-tuning. But the default out of
01:17:19.560 | the box would be what's interesting because that's what a lot of day-to-day users is what
01:17:25.560 | they will experience. So, like for me, for the longest period of time, I didn't care that GPT-4
01:17:31.960 | had a better eval score across the board than the cloud model. The cloud model just seems
01:17:38.840 | friendly and nicer to me and I like it better. And maybe that's what does matter sometime when
01:17:45.320 | they are all good enough to get the job done. Yes. So, I wish we have a better model where
01:17:51.880 | we don't need to twit the prompt, you know. In a way, it can reflect how well the model is
01:17:58.760 | organized, its knowledge, you know. And so, if it's easy, just talk to it without
01:18:03.800 | twitting the prompt. Actually, it's, you know, in a much better advanced stage, I think.
01:18:09.880 | But I think in the long run, that will be hard to achieve because if you just view each model
01:18:16.520 | a bit like an individual, so you can call it Lama-kun, you can call it GPT-4-kun, etc. It's
01:18:25.480 | just really a preference thing when it boils down to it. Because, as I said, well, I prefer
01:18:32.440 | Claude because of the way he did the artwork. I know someone who prefers the other way.
01:18:36.760 | And that's the challenge here, right? You can't possibly just train the model to
01:18:43.000 | satisfy everyone in the world. There'll be that back and forth. And that's where all those, like,
01:18:48.200 | more view-shot prompting will come in or fine-tunes will come in. And I think that is probably the
01:18:54.200 | more exciting thing about Lama because we're going to see lots of fine-tunes.
01:18:58.200 | Sorry, go ahead.
01:19:04.200 | I think, because since we are well past the usual, and there's a lot of new faces here,
01:19:11.160 | I'm just going to, like, recap and point it out. So, the Latentspace Discord, we host this
01:19:20.520 | weekly papers talk, and this is why this session happened. This week is just a very special one
01:19:26.920 | because of the whole Lama 3 model came out. Prior to that, we were planning to do other papers,
01:19:34.600 | and we'll probably go back to our usual schedule. So, if any of you, especially the new folks that
01:19:40.280 | joined in because this got broadcast on a much larger audience, feel free to join us next week,
01:19:47.080 | and we'll be talking other papers.
01:19:50.120 | Great. Thank you. Yes.
01:19:51.480 | Thank you, everyone.
01:19:54.520 | Thanks a lot.
01:19:57.560 | All right. Have a great day, everyone.
01:20:00.680 | Okay. I'll stay. It's got organized.
01:20:27.240 | Okay.
01:20:28.700 | Okay.
01:20:30.160 | Okay.
01:20:31.620 | Okay.
01:20:33.080 | Okay.
01:20:34.540 | Okay.
01:20:36.000 | Okay.
01:20:37.460 | Okay.
01:20:38.920 | Okay.
01:20:40.380 | Okay.
01:20:40.880 | Okay.
01:20:42.880 | Okay.
01:20:44.880 | Okay.
01:20:46.880 | Okay.
01:20:48.880 | Okay.
01:20:50.880 | Okay.
01:20:52.880 | Okay.
01:20:54.880 | Okay.
01:20:56.880 | Okay.
01:20:58.880 | Okay.
01:21:00.880 | Okay.
01:21:02.880 | Okay.
01:21:04.880 | Okay.
01:21:06.880 | Okay.
01:21:08.880 | Okay.
01:21:09.380 | Okay.
01:21:11.380 | Okay.
01:21:13.380 | Okay.
01:21:15.380 | Okay.
01:21:17.380 | Okay.
01:21:19.380 | Okay.
01:21:21.380 | Okay.
01:21:23.380 | Okay.
01:21:25.380 | Okay.
01:21:27.380 | Okay.
01:21:29.380 | Okay.
01:21:31.380 | Okay.
01:21:33.380 | Okay.
01:21:35.380 | Okay.
01:21:37.380 | Okay.
01:21:37.880 | Okay.
01:21:39.880 | Okay.
01:21:41.880 | Okay.
01:21:43.880 | Okay.
01:21:45.880 | Okay.
01:21:47.880 | Okay.
01:21:49.880 | Okay.
01:21:51.880 | Okay.
01:21:53.880 | Okay.
01:21:55.880 | Okay.
01:21:57.880 | Okay.
01:21:59.880 | Okay.
01:22:01.880 | Okay.
01:22:03.880 | Okay.
01:22:05.880 | Okay.
01:22:06.380 | Okay.
01:22:08.380 | Okay.
01:22:10.380 | Okay.
01:22:12.380 | Okay.
01:22:14.380 | Okay.
01:22:16.380 | Okay.
01:22:18.380 | Okay.
01:22:20.380 | Okay.
01:22:22.380 | Okay.
01:22:24.380 | Okay.
01:22:26.380 | Okay.
01:22:28.380 | Okay.
01:22:30.380 | Okay.
01:22:32.380 | Okay.
01:22:34.380 | Okay.
01:22:34.880 | Okay.
01:22:36.880 | Okay.
01:22:38.880 | Okay.
01:22:40.880 | Okay.
01:22:42.880 | Okay.
01:22:44.880 | Okay.
01:22:46.880 | Okay.
01:22:48.880 | Okay.
01:22:50.880 | Okay.
01:22:52.880 | Okay.
01:22:54.880 | Okay.
01:22:56.880 | Okay.
01:22:58.880 | Okay.
01:23:00.880 | Okay.
01:23:03.880 | See you later, Eugene. Peace.