Low Level Technicals of LLMs: Daniel Han

Welcome to the AI Engineers World Fair. This is the first workshop. There's a few other running, but thanks for coming. We just arrived from Australia with my brother. I think he's over there somewhere, but yes, we just came here. Yes, we didn't know a lot of stuff about SF, and I think maybe the US is a bit different from Australia.

But yeah, we're very excited to be here. So we're going to stay here for a few months. So if you want to meet up, you can just hit me off email or Twitter or wherever. So today I'm going to be talking about low-level technicals of language models. Yes, yes, I'm Daniel.

So we do have a website called Unsloth.ai, if you want to look that up, there's like cute sloths and stuff. So my brother designed that. We'll be using two tiny URLs. Oh, did it? Okay, it's still working. Yeah, so we'll be using two tiny URLs now. So the first one is -- oh, wait, I'll shoot.

Yeah. So the slides are at tinyurl.com/unsloth. Hopefully that works. And there's also Q&A. So I'll be monitoring Q&A. You can type any question that you like. And I will be answering questions as we go. And that is at tinyurl.com/unslothqa. So if those two work -- so they'll be like on the bottom if, you know, anyone doesn't get this, like this on the very bottom of the footer, they'll be like -- we'll re-show these links.

Okay. Just -- it doesn't work? Yes? Okay. Okay. Good. Good. Good. Okay. Yes. So you might know me from my tweets. So Gemma like kind of like released an open source model a few months ago. And we just found a few like issues and bugs for like different implementations.

So like, for example, the first tweet that we ever did was about some sort of like the approximate Jellu bug issue. And so like multiple implementations of Gemma that had different implementations. Some of them use exact Jellu. Some of them use approximate. And like, so which one is correct?

And so that's the question. And so like, we just tweeted about this. And that was like our first issue that we found. We thought this was just like one issue. But actually, there were like many issues. And so like we found more bugs. And so I'm assuming maybe you know me from this.

We did get partially recognizable for our Gemma bug fixes. So yeah. So today we'll be showing you how you can actually find these bugs and issues in language models. And how you can actually like analyze and do this yourself. Without, you know, us just doing it manually ourselves. And hopefully this can be like an open source project where everyone can find these issues automatically and help us to solve these issues.

I always thought about like, can we like automate this? I don't think so. It can be automated. There are actually many issues of these implementations. And it's not just Gemma. For example, like, you know, we also analyzed Grok. And there's like some weird things in their code. Like they're scaled by 30 times 10 h x over 30.

It's just a it's just a clamping mechanism. You can see I also make mistakes sometimes. I like, you know, said it's division and not multiplication. So sometimes I misread the code. That's because like, you know, I spend when the code gets released, I quickly try to analyze them. And sometimes I mistakenly say stuff.

So I have to like lodge, you know, showcase corrections. So yes, I'm still human. But yes, like, you know, we analyze lots of models and stuff like this. And hopefully by the end of these workshops, well, actually, this one, okay, it's actually not one workshop, it's going to be like multiple things in one.

So like, I decided to like, talk about three things. But I'll tell you about that later. Hopefully, you guys can like, analyze and learn about how to do, like, find bugs and stuff like that by the end of today. So another one I did recently was like, you know, Nvidia's Nemetron.

I don't know if you saw this, but Nvidia released a 340 billion parameter model, which is extremely large. I'm assuming this is in, like, you know, in preparation, like they have to release this earlier before Lama 405 billion, right? So they have to do this quickly. And but there are like some weird, interesting things, like, you know, they use the squared value, and not the normal sweet glue and the other types.

So that was very interesting. They were actually the first, well, actually, not the first, but like the first big model trained to be using these, like other activation functions. And there's other like other weird quirks and stuff like that. And hopefully, you'll be also able to like, analyze, like, you know, whenever the code comes out, just read it, and you'll get it.

Um, it does take some practice. Okay, like, I take like, okay, the first time when I read this code, like, it took me like many, many days to read for like, all these architectures and understand exactly what they are. But now it takes like 10 minutes. So like, I'm sure you're getting like, just the code comes and just read it.

Um, that's the whole goal today. And also language models, if you don't know, they're not just like issues and bugs and analysis of these architectures, the tokenizer is a totally separate beast from language models. My tokenization is like extremely annoying. Um, so like, I also tweeted about like, you know, there's like different types of tokenization issues, like Mistral, Lama, like all these different types of like variants of Mistral from the Mistral team, you know, they have like different tokenizations if you didn't notice.

Um, like if you, so the smiley face, the sun's smiley face is a space. And if you actually tokenize them, um, depending on the model, you'll have different results. And the question is, which one is correct? Um, unfortunately, I do not know. Um, I did ask the HockeyFace team for this.

Um, and according to them, some of them are correct. And some of them are just because the Mistral team forgot to update the model to the fast tokenization variant. Um, we will be talking about this later as well. Um, but you can see even before you can train or run the model, the tokenizer is broken.

So what are you going to do? Um, it's a multi-pronged problem. So we don't just do language models. Um, you know, like we don't, you know, experience is, you know, a bit broader than that. So we actually like, I used to actually major in maths and computer science. So, um, yes, very fun.

Well, actually I kind of did very badly in maths, but anyways, um, yes, very, very fun. So like SVD, I don't know if anyone, has anyone done normal machine learning here? Oh yes. Very good. There is a few people. Um, so like the SVD, I'm assuming like most people know PCA.

Yes. Principle component analysis. Yes. Okay. Very good. Data visualization. It's a very powerful technique. More people should know about it. Um, SVD. Okay. I don't, I don't know if people know about SVD or like, okay. Yes. It's a bit less well known. I'm actually a bit confused why people don't know SVD.

It's actually the algorithm that's one of the most important algorithms in all of like maths and computer science. Um, it literally underpins like many applications. Um, and it is extremely important. Um, maybe we'll talk about that, but yes, I'm a huge proponent of like telling people to learn more about SVD.

So please do the singular value decomposition. That's like the must, must, must, must, must. Okay. Like that's the most important algorithm. It's because it's like one algorithm. It can like spawn many other algorithms and it's like, can be used for many purposes. Um, there's also like the QR decomposition.

Okay. Okay. Okay. No one knows. The LU, Choletsky, like there's a lot, um, randomized SVD. Yes. That's extremely important as well. Um, and yeah, so we don't just do language models. You can ask me any questions about, you know, stuff, um, maps or computer science. You, you have a question or real quick.

Um, so do you think, um, so for the Nemo Tron, uh, 340 B, um, is it a unique architecture for, because you can only use Nemo loader right now to load and train, you know, do the, I think the data is just the most valuable part, but we are attempting to try to convert it to a hugging face, a transformer safe tensors.

So, but we've had issues because we don't have the modeling file. So, um, I was wondering, do you think that it's similar to, so at the same day, they uploaded a 7db of llama 3 that's Nemo Tron and 12. Do you think that we can get some clues of how to build a hugging face implementation?

Yes. So the question was, for Nemo Tron, the code was not released for the actual inference and bot training. And you have to go through the Nemo training framework from MIM video. Um, yes, but the code. Yeah. I was actually planning on doing something like that, but as you know, we didn't have time to do that.

So I probably, yeah, I might take a crack at that. And also, is there going to be a Q and A? I don't want to ask any more questions. Oh, yes. No, no, there is. So you can log questions on Q and A. There is, there will be Q and A.

So we, no, no. Yeah. No, you cannot. Anyone can like raise their hands and like ask a question. Like I don't, I'll just repeat the question. Um, and, but there is like the slider. If you want to random questions, I will keep monitoring. Um, yeah. And yeah, so like, oh, yes.

The other one was like another paper called Laura learns less and forgets less. It shows that, you know, fine tuning by Laura does not really work for learning new knowledge. And, um, well, it depends. Like it depends on how you actually read the paper. Like some components were incorrect.

They didn't actually train on all linear layers. They've kind of forgot a few. And they also, you need to do some sort of like, like, you know, special parameters to make this work. And we will also be talking about that as well. But I was just trying to show you that, you know, we don't just do language models.

So we have like a whole wealth of knowledge across different industries and stuff. Well, not industries, topics. Um, and you can ask me any question that you like. Um, so unsloft. Yes. Um, we launched just last December. So I launched this with my brother. Um, it's a bit outdated.

But anyways, um, I think we have 11.9 something K or something. I don't even know now. But anyways, um, we launched this last December. It generally makes fine tuning of, um, language models like Lama, Mr. Gemma faster to two times faster, um, generally speaking. And we have like 70% less memory usage.

Now we have like some new methodologies which reduce memory even further. And the trick is there is no degradation and accuracy. So like, you know, we don't do any, like, approximations. That's the whole purpose of the optimizations is we don't want to lose any accuracy. Um, and so we do like Triton kernels.

So this is like from OpenAI. Um, it's a language to do like Cuda programming. Um, essentially it's like an intermediary between the Cuda code and Python itself. And we'll be showing some Triton code. Um, I don't know if we have time for like programming Triton, but that'll be another topic.

Um, and yeah, so like the whole purpose of unsloft is to make everyone be able to fine tune the language models with very bad GPUs, right? So like Tesla T4s, the free Google. Does anyone like people do know that Google collab has free Tesla T4s, right? Yes. Yes. Right.

65 teraflops, right? It's actually not that bad. Um, if you use it properly, um, just a reminder, there is a common misconception that the P100s on Kaggle is faster. That's actually not correct. Um, P100s I think are five times slower than Tesla T4s. Um, so although it's actually more expensive as a GPU, I think, I think, um, but it's actually slower for, um, so please do not select the P100s on Kaggle.

Right? So Kaggle has 30 hours for free per week GPUs. Um, and you get two Tesla T4s. So that's 130 teraflops per week, um, 30 hours. And that is actually very powerful. I think that's the same as RTX 3070. Although I can't, yeah, I can't, I can't remember exactly.

But, um, yeah, so that, so Kaggle has 30 hours for free per week. Google collab, it depends on how much you use. Normally, you get four hours per day, I think. Um, I guess the pro is not that bad. It's like $10 per month. You can actually get, like, yeah, it's pretty good.

Um, yeah, so probably get pro. Um, I do not suggest, I mean, you could use, like, Runprod and Lambda Labs and stuff like that. I guess that's another option. But we do actually share a pricing comparison. So what you need to, like, be careful when you use GPUs, there is a big issue is, like, oh, look, this is the most I want to use a H100.

Um, did you actually check how much flops the H100 provides? Um, be careful of NVIDIA's marketing. It's times two, because it has sparsity. Um, so just be careful of that. And also, you have to be careful of the flops when it's, like, float eight or float 16. So just be careful of those.

Um, I do have a pricing comparison where we, like, normalize by, like, the flops, um, with no sparsity. And we'd, like, look through, like, Lambda Labs, Runpod, Google Colab, AWS, um, Google Cloud. And I think Runpod is mostly pretty good. Oh, yes. Question. Oh, the sparsity? So, okay, so the question was, why is times two for flops for the sparsity feature on the NVIDIA GPUs?

So sparsity, the sparsity feature, what it does is you take 50% of the weights and make them go to zero. And NVIDIA essentially allows you to train this two times faster by not doing matrix multiplications on the zeros, right? So you're, like, two times zero is just zero. So, like, you essentially don't fire the transistors.

And essentially, this makes it two times faster. Um, well, that's actually not, that's just a higher level overview. But, like, essentially, you, like, compress the matrix into this special format. And then this NVIDIA special format allows you to do matrix multiplications two times faster. Um, yeah, so it's sparsity.

So it's on H100s, it's on, it's on A100s as well. So your RTX 30, 60, your RTX 30 series has that feature from RTX 30, the 30 series. I think it has, yes, I think it does. Um, so if you want to enable it, the biggest issue is, is that the language models, most companies do not train their language models with sparsity enabled.

If you set this, like, weights go to zero, um, you will actually ruin the behavior of the model. So there are, like, papers which show that you can actually do, like, turn on the feature and then you can do fine tuning to make it work. So there are actually papers which do that.

Um, so in theory, you could enable this. Um, but, you know, it depends on, like, what models are released from the large companies. Um, yeah. I'm assuming, I think I do know if Facebook is, like, they did implement sparsity in the PyTorch and, um, Xformers library. So I'm assuming there might be focused on sparsity because you get times two faster.

Um, and if you know, like, open AI, you know, like, they keep saying, oh, it's two times faster. Hmm. I wonder why. Why is it two times faster? Why is it the two times faster? Right? So, like, could it be sparsity? Could it be float eight? Right? Float eight is generally two times faster.

Okay. Not exactly, but, like, approximately two times faster. So, you know, all these, like, things when you hear two times faster, where does it come from? Could it be these things? Although we don't know, but, like, we're just guessing. Um, yeah. Any other questions? Okay. Just remember, you can raise your hand or you can do slide -- wait, are there any -- I'm assuming there's no slider questions yet.

Um, yeah. Okay. Yeah. Just raise your hand. Um, so we -- so for Unsloth, we do benchmarking against HuggingFace plus Flash Attention 2. Um, and we just show our benchmark. This is kind of old already. The memory reduction is much more now. Um, so -- and we do a blog -- we did a blog post with them.

So thanks to HuggingFace for the collaboration. Um, and essentially, all you need to do is, you know, from Unsloth import fast language model, and we try to make it as easy as possible for people to fine tune a language model. Um, and, yes, we'll be talking about Unsloth a bit later.

Um, oh, there is a question. Is it a myth or solid hypothesis that linear versus cosine, then short one to two epochs versus three to five epochs is the highly -- is -- oh, is highly generalized? Um, I think it depends. Um, so, like, for training methodologies -- sorry.

So, is it a myth or solid hypothesis that linear versus cosine, then, of course, short epochs versus long epochs is the highly generalized best way to train as any standard-based model? I think it depends. So, there are, like, some research papers which show that cosine or linear schedules -- I mean, it depends.

To tell you the truth, I think it's a toss of a coin. I don't think so. It's actually that important for the learning rate scheduler. I think it more depends on, like, the data set, the number of parameters. Um, so, like, research papers would show that if you just simply change from tied weights -- from untied weights to tied weights, you can get, like, better accuracy for smaller models.

So, I think it's -- so, the learning rate schedule is not that important. You might get accuracy plus 0.1%. Just train for more data. There we go. Right, get more data. Oh, wait, I pressed back. Just train for more data, and I'm assuming it will be similar. Um, but to tell you the truth, I think it's best to, like, do small experiments and then, like, test which schedule is the best.

But I don't think so. It's that important. Um, I think for the number of epochs, that's actually important. Um, these, like, big companies -- I'm not sure what LAMA, like, 15 trillion tokens, is it actually 15 trillion tokens? Or is it, like, 5 trillion tokens times 3 epochs? I do not know.

Right? These questions are very important. If it's 5 trillion tokens times 3, that's actually very different from actually 15 trillion tokens in total. But that's actually very, very different. But in general speaking, like, you know, if you train for more epochs, 3 is generally the well, you know, good, like, approximate 3 epochs.

One is actually the best for pre-training, generally. Um, you shouldn't, like, retrain your data, like, multiple times. Um, but yeah. Um, did you have a follow-up question, or no? Well, um, so basically, um, learning rate is one was one of the big issues to fix with the Gemma implementation.

Oh, yes. So that's why I kind of -- that's where, um, my pitfall was when I was training my 2D for, um, Gemma. And, um, so I actually trained it pre-YourFix, and somehow it turned out the benchmark, and then after YourFix was better than -- I don't know what happened, but it's now it's, like, one of the highest ranking benchmark in 2D.

Oh, okay. So I don't know, like, what -- do you have any theories about what could happen? I trained it on the transformers, the broken version. Hmm. And then something about using octalado and using, uh, a, um, a very hard -- we grew forth a learning rate. Hmm. But it turned out surprisingly well, and, um, we also used -- we didn't use unslopped, but we used, like, uh, APU, so, um -- So this is after the fixes that we did?

Like, it does -- Before. Before, so before does better? No, well, no. Before, it was -- it was -- it was -- It was usable. Everybody else's was unusual, right? Yeah. But it was usable. Was it surprised me? Because everybody else's was unusual. But then, after your fixes, we are now the top -- well, my company has -- on the OpenL on Lee Award has the highest -- So you didn't even retrain?

You, like, you just -- No. So after your fixes, somehow -- It made it better? Okay. Do you have any theories on that? To be honest, I do not know. I think, like, because the fixes that we did for Gemma are, like, multi-pronged. Like, it's not like one fix.

It's, like, nine or something. So I don't know which fix caused the change. It could be, like, all of them, maybe. I don't know. It's kind of like, um, just for me, when I -- when I -- when I did the training, right, and it turned out good, and then I heard from, like, all my friends, "Oh, I can't do this.

I can't do this." It's, like, um, it just shocked me, I guess. Okay, yes, that is quite shocking. If, like, you didn't -- yeah, we change the code, like, we kind of, like, fix all the issues, and then you don't need to retrain it, and it does better? Okay, that's -- okay, that's a very interesting phenomenon.

I do not know. I just sent you the code that -- Oh, yeah, okay. It's actually open source online on the PowerPoint. Okay, yeah, great. Um, yes. To be honest, like, language models -- I mean, these are all active areas of research. Please, someone will do a research on that.

Yeah. Okay, yeah. Yes. I cannot say anything other than I just read the code and fix the box. I do not know. So, yeah, yeah, yeah, don't worry. Um, okay. We also, like, do long context fine tuning. So, like, we show that if you use a new methodology which does gradient checkpointing and offload it to system RAM, you can randomly increase your context size by four.

And your -- and the weird part is, if you offload correctly to system RAM from the GPU, weirdly, the time of execution is just slower by one to two percent. Right? So, like, this is very weird. It's like, if you can do non-blocking cores and offload the GPU, like, memory into system RAM, if you do it correctly, it's not slower.

Um, some implementations, unfortunately, offload incorrectly. Um, I don't know what to name anyone, but they offload incorrectly. Sometimes they offload to disk. I don't know who came up with the idea of offloading to disk. But anyways, please try to offload to memory first and then disk. Right? Disk is extremely slow.

Um, and if you can offload to memory system RAM, you can actually get away with a lot of memory usage. Um, okay. So, I should have put this at the first time. But anyways, so today, we'll be having three approximate topics. Um, and these -- I wanted to make them into like three different separate topics, but I guess I just mix them together.

Whatever. Um, so you'll be learning about low-level technicals of language models. For example, backpropagation. Um, why is -- why is transformers not O of N squared -- sorry, O of N cubed for training and rather O of N squared. And there is a lot of maths. But I will try my best to reduce -- I think I already tried my best to reduce the maths, but there are still some maths.

So, please handle the maths. Um, I will try my best to explain as simply as possible. Um, that's the whole goal of the workshop. So, not that bad maths. You will actually understand the formulas very well. Um, just a reminder, I kind of nearly failed my maths in university.

So, do not worry. Do not be scared. Um, it's very fine. Um, we were talking about unslawed fast fine-tuning. The best tips and tricks for fine-tuning. How do we write the fast kernels for fine-tuning? Um, you know, how do we actually make it two times faster? Use 70% less memory?

Like, how? And with no accuracy degradation? Um, and we'll be talking, like, some, you know, Triton, OpenAI's Triton language and stuff like that. Um, and we'll be doing finding and fixing bugs. Um, so, this would be a constant phenomenon and theme. How do we find and fix bugs in Lama, Mr.

Gemma? Um, we'll be talking about a mixture of experts as well. Oh, wait, maybe not. But it depends on time. Um, and we'll be doing lots of bug hunting, bug fixing, and more. And everyone here will be a fantastic bug hunter and bug fixer. And we can, like, essentially open source our effort to fix open source models to everyone here.

Um, oh, yes. And we also have stickers. Um, yes. I don't know where they are. But, like, oh, yes. Yeah, my brother has some stickers. Um, and we bought a few of these stickers, which look pretty cute. Right? So, like, you can wait. My laptop has some, right? I put them on my laptop.

Um, and they're pretty cute. I really like them. Um, so my brother has them. We'll be handing them out. Um, yeah, as well at the end. Um, okay, so let us start. Um, so the transformer, right? So, like, what is the transformer? I'm assuming everyone knows what the transformer is?

Um, does anyone not know what the transformer is? Yes or no? Like, you can simply... Okay. Yes. Okay. So the transformer is just an architecture that is behind all language models. Um, so, like, GPT-4, GPT-3, you know, like, Lama, Mistral, Gemma, all these open source models. What are they?

Like, what's the architecture behind them? And all of them rely on the, like, the transformer. And the transformer is essentially a architecture, which seems to be very good for sequence modeling. Um, so it's not just for languages. It can be for any sequence modeling. Right? So, like, if you know, like, you know, Sora is a transformer.

Well, not just a transformer. It's probably plus diffusion, but, like, it's generally a transformer. Um, and there's other different types of models, which doesn't have to be language modeling. Okay? It's just sequence modeling. And I'll probably shoot some pictures later. Um, I probably should have explained it a bit better, but, like, just, just, just assume that transformers are the method behind all language models.

Okay. GPT-4, GPT-3, GPT-5. Okay. I don't know if, okay, who knows what GPT-5 is, but, like, I'm assuming it's a transformer. Um, the transformers just seem to be very good at learning new knowledge, injecting knowledge into the model. It seems to be very, very good at changing the weights to fit the training data.

Um, which is very interesting. Um, and the GPT-2 architecture was actually very popular for the, like, you know, most decoder-style transformer. That was very, very popular. It's still used to this day. Um, and it kind of got reincarnated by adding extra components to it. And this new architecture is called transformer plus plus.

Um, I don't know if people have heard of this, but transformer plus plus is the GPT-2 transformer architecture, plus rope embeddings, plus Wiggaloo, plus RMS layer norm, and with no bias. Um, and I think it's untied weights, although I'm not sure of GPT, I can't remember that exactly, but I think it's plus untied weights.

Um, and transformer plus plus is the architecture which most people think is the best, you know, transformer architecture for now. Um, yes, for now. Um, there are probably, like, some other tweaks and little small things that transformer still can do, but like, in general, this would be considered the best architecture.

Um, and how does the architecture look like? It is just a list of math equations. Um, so I just wrote down the entire transformer architecture. Um, well, this is Lama 2, right? So Lama's transformer architecture in one slide. Um, and all you need to do is get some inputs, like some sort of like inputs, do some layer norm, do some rope embeddings, do some attention, plus some residual, do some layer norm, swigaloo, whatever, residual layer norm, and you get logits.

Um, and you essentially repeat this middle section L times or many times. Um, and that is the transform architecture. Um, okay, then maybe the math equations, I'm not sure. Does the math equation scare anyone? Or I'll be explaining each one. Okay. So like, hopefully I try to make the math equations as like reasonable as possible.

Um, in theory, if you write this down on in PyTorch, um, oh yes, if you write this down in, in PyTorch, you actually have a working implementation of, of a transformer architecture. Um, and yeah, so like, we'll be talking about each component separately as well. Um, yeah. Is anyone scared for the math?

No. Yes. No. No. Okay. Very good. Okay. Let me just check questions. Does anyone have any questions? Okay. Okay. So did you have a question? Okay. Well, um, so from my understanding, from the layer level for Transformers, um, so it's almost comparable to Cosmic Plinko in a way. Is that, I mean, of course it's a hard map behind it.

Cosmic, sorry? What? Cosmic Plinko, like the drop the, you know, cause it's 80 layers, right? And a grid, right? So like low energy something. Sorry. I'm, I'm actually not familiar with that. Do I, you'll have to explain it to me. What did you say? Cosmic, what? Sorry. Cosmic Plinko.

Cosmic Plinko? Do I know that? I'm not that smart. So you'll have to like explain to me what that is. Should I search that up? Not gotcha, not gotcha. Um, like, uh, the arcade game with a bunch of pegs and you drop the bottom Oh, that. Oh, yes. It's a Windows XP.

Output layer, output, you know, on the bottom right, you have to visualize it, right? Yes. Like if you take this and visualize it, and then visualize the map. For example, let's take llama 380 AP, right? It's 32 layers, right? And then layer zero, or I guess, it doesn't matter.

There's layer zero, and then layer 32, which is the output layer. Yes. Now, when you drop a, uh, prompt into this cosmic machine. Yes. Which, it's not actually cosmic. You just don't understand. Okay. Yes. It's not cosmic. Yes. It's just mass. Yup. But so far, it's cosmic. Yes. I don't think we understand.

Personally, I don't think we understand if you pretend it again. Like where we're. I know. I think we've studied a lot, but okay. Yes. Okay. It's just mass. Yes. You're the expert, not me. I'm, you know, so. Okay. But, um, so you're just trying to say an analogy. Like it's kind of like the game.

Yeah. So, so people can visualize the map versus like. I think so we. Oh, yeah. I'll talk about that in the slides, but like, I think it's more like, um, an analogy. I think it's more like you're going through like a maze. Like, okay. It's not a maze. I would say it's more like you have like a, every single layer has like someone trying to make you change clothes.

And then each layer, there's like a fashion designer trying to get you to wear different clothes. And each layer, the fashion does designer doesn't like the previous fashion designers choices. And they will change your clothes. Something like that. I think that's more like a transformer. It's like each fashion designer has like views on their own.

Okay. Yeah. I guess you could say it. Oh, yes. Yes. Yes. Wait, it is in Windows XP, right? Is that? I'm making me confused. Windows XP, like the game. I think I played it before. Oh, okay. Anyways. Okay. Yeah. Sorry. Yeah. Question. Oh. Um, so when I put subscript I, it generally is effect.

Well, technically everything is a matrix. But if you see any summation signs, um, with subscript I, then it generally means row wise or column wise. Um, the W, like if you see small, small, small, small, like a small W that generally means vector. Um, but in general, everything that's capital is a matrix.

Um, and why is it a matrix? Because it's just faster. Um, I mean, in theory, you could convert this all to vectors, but for speed purposes, this should be matrices. Um, yeah. Any other questions? Okay. Next. Um, so, why did I put this? Hello, my name is Daniel. Hi, my brother's name is Michael.

I hope everyone will have tons of fun. AI engineers, what fares is the best. Um, why did I put this? Well, does anyone notice any similarities between these sentences? Um, or differences? Sorry? Okay. Yes. Okay. Yes. Okay. Yes. Okay. Except for the first sentence. Okay. Anything else? Just say random stuff.

Interesting. Yes. Okay. Hello and hi are the same thing, but kind of different words. Okay. Yes. Yes. Semantic embeddings. Basically this is an example of semantic embedding. Okay. Okay. Somewhat. Okay. Okay. Yes. Okay. Okay. Okay. Well, to turn the truth, I didn't have any intention. The intention. Yes. Go on.

I mean, it's the king plus queen. Yes. Yes. Yes. For water back. Yeah. Yeah. Yeah. I know what you're talking about. Yeah. So like, you know, it's king minus man plus woman equals queen. That's the, I think that's what you're trying to show. Okay. Well, in theory, I guess you could have just seen the next slide.

There are just five. If you simply just look, um, the first one, if you consider punctuation as combined with the word, right? Like hello comma, treat that as one separate component. And if you do this, ignore all spaces. The first one has just five components. Right? What do you think the second one?

Okay. I guess you can already have the slides. But what do you think the second one has? Right? Anyone? I already wrote all the answers there. Okay. Yes. I wrote all the answers. Right? Six, eight, seven. Right? One, two, three, four, five, six, seven, eight. Right? One, two, three, four, five, six, seven.

Right? So like, if you do this, um, just assume all punctuation from now on is combined with the word and ignore all spaces. Um, and this, we just invented a tokenizer. Um, right? So this is a very general, a random tokenizer we just invented. Um, and each one is, has an, essentially the reason why we want to do this is because computers doesn't, they don't understand words.

They only understand numbers. Right? So like, you have to essentially assign each of these tokens as a number, like an ID. Right? So hello is ID zero. Right? Oh, hello comma actually is ID zero. My is ID one. Name is ID two and so on. Right? So like, you have to, you have to assign an ID to each of these components.

If you don't do that, then a computer doesn't know what you're actually doing. Right? Computers only know numbers. So we just invented a tokenizer. Um, I would not suggest you to use this tokenizer, but in general, it's actually not that bad. Um, because can anyone in see any issues with this new tokenizer we just created?

Um, what are some issues? Yes, very good. So what do you think we should do? Okay. So the point was we included the punctuation for the words and that is not helpful. Right? So like Michael explanation mark or Michael not explanation mark. So what would you suggest to fix this?

Um, interesting. So hello and then hello will be one token and then comma itself will be a, okay. Interesting. Anyone have any other suggestions? Yes. Yes. Very good. Exactly. So do you have any suggestions for how to improve? Oh yes. Very good. Yes. I haven't heard that in a long time.

Very good. Yes. Old natural language stuff. Um, so the idea was to stem the word. So essentially you can remove like, you know, for example, like skipping can become skip, right? So like skipped, skipping, skip, like, you know, they're all kind of the same. Well, I wouldn't say the same thing, but like in theory, they're the same thing.

Um, any other suggestions? Very good idea, right? If you lowercase them all, then you can reduce a lot of issues. Um, is that a good idea? Capital my and small my, what do you think is the difference? If you do capital my generally means it's the start of the sentence.

If you do my with no lowercase, then it means it's the middle of the sentence. So good idea though. Um, actually I think, I think, um, so but it's an old, okay, people still use but, um, I think there is a lowercase version of but, so they essentially lowercase everything.

Um, and I think it does okay for like semantics and stuff, but it doesn't really do well for decoder type style. So don't lowercase, but good idea. Any other suggestions? Yeah. Uh, you build a vocabulary by starting with just the individual tokens instead of size for how large your vocabulary would be.

Yes. You start building word pieces from the less frequently of the algorithm. Oh, you just said the name of the algorithm. It's called word piece or BPE tokenization. Um, yes, exactly. So that's actually the correct, well, I wouldn't say it's the correct way. I actually don't like that approach, but yes, it's the most recognizable and most industry standard approach is to do what you suggested, which is to start off with like small little individual characters and then combine them together based on some sort of statistic, right?

You shouldn't like, you know, hello, like maybe hello is a word that we have to select, right? So like, but, but like H E like H E L L. Okay. Hell. Okay. That might be a popular word as well in like the dictionary, but anyways, or he, right? So he, hell, hello.

Um, each of these might have assigned different tokens, but they might not be right. So like, it depends. Um, and the correct industry standard is to use these methodologies to build up the component, um, which we won't be discussing today, but like that's for later research. Um, yes. Okay.

So let us just look at one sentence, right? The first one is, hello, my name is Daniel. Assuming our tokenization is useful. Okay. Let's just assume the tokenization which we created is helpful. Okay. So remember it is just put all the punctuation together. Um, ignore spaces and don't do lowercase or whatever.

Just, just, you know, tokenize it as it is. It's not very useful, but who cares? We just say this is a good tokenizer. Um, the question now is if I select the first token, hello, right? Let's assume, I don't know if, okay, the color's not very good, but like my name is Daniel is grayed out.

I think I'm not sure if you can see that grayed out. Pretend someone types in hello, comma, right? A language model should predict what's the next word, right? So how does it know to predict my, right? You know, when you type chat GPT, it like, it goes from left to right.

Like you type something, you know, you type some sort of instruction and then chat GPT will like print the words from left to right. Right. So like, why is it printing words? Can anyone tell me why is chat GPT printing words from left to right? Why is it not doing right to left?

Or why does it not just spit out everything in one go? Yes. Exactly. Correct. So because it's a transformer type decoder type architecture, it is predicting the next word based on the previous words. So very good. Um, and so the point is, is we only can have the language model see previous words, not future words, right?

If you accidentally put the future data in, Oh, you know, your accuracy might go to 100% if you use future data. So please do not do that. This is actually a very common mistake. Like I'm not saying this jokingly. It's a very big issue in research. So please read the paper before you actually, like, if you read research papers, please see how they do the methodology.

Um, always read. Did they put the, always ask the question, did they put future data in when they did the training or doing the research paper? Okay, this is a pervasive problem. I'm not joking. Um, you can see like weird accuracies. Like if you see papers which have like 98% accuracy, question, question, question, question, question, question.

Okay. Or 100% accuracy. How's that even possible? How can something be 100% accurate? Question, right? And so the most likely scenario is they use future data. Yes. that is a very good question. So that the question was like, is it the paper, the research paper, like the methodology itself using future data, or is it the code implementation, the problem?

Now that's actually a very good question because it depends. So I think if the researchers, I think normally in research papers, one person does the coding, some, some other people do like secondary coding. And then the main researcher who makes the idea, they don't actually do the coding. So the, the coder might have misinterpreted the methodology.

To be honest, I don't think so. That's the case. I think it's actually a methodology. That's the problem. I think if the researcher, like the main researcher, if they find out that their method has 90%, 98% accuracy, like why didn't they question the results? Like I would like, wouldn't that be very sketchy, like 98% accuracy, 100% accuracy?

I'm not joking. This is actually a very serious problem. So if you read enough research papers, you'll see this problem is always the same problem. It's always using future data. So yeah, I think that, yeah. I think like if you don't, yeah. Okay. I won't comment on any papers.

Anyways. Yes. Okay. Does that kind of answer your question? So it depends. I think in general, I would say it's the, it is the researcher, the main research, the lead researcher's responsibility to correct these mistakes. Then you shouldn't be the really lead researcher. So like, that's my take on that.

Um, I think the coder, I, I mean the program, I might have some issues, but I don't know. I, I think I blame the lead researcher. Um, any other points? It seems like you're saying the main red flag is like a huge leap in performance. Yes. If you see huge leap in performance, question, question, question.

It's probably future data. Actually, I wouldn't say it's probably, it's like 50% sure it's future data. Um, yes, question. Exactly. So the point was, um, in the test and train split, sometimes, although you might not be using future data, you might like accidentally put different components in the trip, like the training set from the test set.

And then depending on how you split the data set, you might actually mix the data sets. Is that kind of correct? Yes. The distribution itself. Yes. So that is why you should, you have to be very careful to how you split the data sets, right? For training and testing, you have to use stratification.

You have to inspect the data before you do it. You must enable random shuffling. Right? There is actually a very, very common question on PyTorch. How do you not enable random for hugging face? How do you not enable random shuffling for hugging face? Um, they purposefully disabled this. Like think about like why, because like people might forget to randomly shuffle.

Um, and actually this is a pervasive problem with Kaggle competitions. People like to like train on the tests. Like they give you like 20, normally speaking, you should have 20 tests, 80 train, right? So like, but people like to like the final submission, they use even the test in the submission.

So, you know, it is a pervasive problem. Um, any other points and questions? Okay. So what is the language model? Given the word hello, can you somehow predict my name is Daniel? Right? So like given the word, right? So given that token, can you predict these extra tokens? Um, remember carefully, I purposely only went from left to right, right?

So like, hello can only predict my, can only predict name is and Daniel, right? It cannot predict, right? You can't use the word hello to predict previous words, right? That's cheating. So that is sequence modeling. And the point is when you chain them together, let's assume that you predicted the word my as the next word, right?

So now you have two pieces of information. The text that you see is hello comma my using these two pieces of information. You want to predict name is Daniel and you keep doing this. And that is a language model, right? That is essentially a language model. What is it doing?

Is you start from the first word, you predict the words into the future, um, and you keep doing this iteratively, right? And so that is what chat GPT is doing. Um, does that kind of make sense? Um, I remember the point is never use future data. Okay. I'm, I'm, I know I keep stressing this, but like, this is actually a very big issue in machine learning and AI, right?

This is actually the biggest issues I find in my opinion is using future data. There's so many papers which do this. Um, Yeah. So the second point is you must tokenize each component into numbers. Remember this? So I'm just going to, you know, cook up some numbers. Hello will be 0.11 minus 0.123 102.

Okay, just made those numbers up. Um, Daniel is 0.11 123 minus 0.122. Okay. I just randomly made them up. Um, And remember you must, each component must have the same number of numbers, right? So like if hello has three, Daniel must also have three. Can someone tell me how many combinations of, if you assign this case that each number must have three, um, you know, three numbers, how many combinations do you think there can be for each token or for each component?

How many combinations? What do you think the answer is? So remember you can choose any single number in the three numbers, right? 0.11, 0.1, 1, 1, 1, 2, 0.1, 1, 1, 1, 1, 1, 3, whatever number, how many combinations are possible? Depends from the flow. Okay. Let's assume it's infinite precision.

Correct. The answer is infinity. You can do as many as you like. Um, but normally speaking, you should use not three. All right. Now the question is, why don't you just use one number then? Right? How low can be 0.1, 1, 1. If it's already infinity, 0.1, 1, 1.

If you use two numbers, isn't also infinity, right? So what's, what's the problem? Well, please don't, you should use as many numbers as possible. Try your best to use more numbers. It's because the computer, you know, it's not an infinite machine. So please use like more numbers so it can like learn which numbers to like assign it to.

Um, and when you start training a language model, these numbers will be randomly initialized, right? So like all these numbers will be randomly initialized. And can someone even notice my initialization? What is the issue with this? There are like some, there is a glaring issue, quite obvious, um, very big problematic issue.

Okay, same number. Okay. Okay. Good point. Yes. Did someone say something? What was the other point? Yes, the magnitude is the most important. Right? 123 and 102 are terrible. Um, when you randomly initialized, please do not initialize with random large components. Um, this will destroy your training. That is why your training might have infinities.

And sometimes the training loss goes to zero. Okay. That is not, that does not mean your model learned anything. That just means it's some sort of error in your training data. Uh, sorry, your initialization. So be careful of that. Um, don't worry. Huggy face does this automatically for you.

So you don't need to worry. Um, and now each component has a list of numbers that is it it's associated with. Okay. I just use the same number for now. It's easier for me to, you know, do the slides. Um, but essentially, hello comma. My name is Daniel. Each of them has numbers associated with them.

And this is the thing that you're trying to learn for each of these components. And remember, this can be converted into a table of numbers. Right? So like if you replace all the commas with just column columns, right, these are just tables of numbers. Um, and this table is what you need to train.

Um, and again, remember, given the word hello, you want to predict my name is Daniel. Right? So like essentially given that, you know, vector of numbers, can you predict the other vectors of numbers? Um, yes. Oh yeah, you can do as many numbers. So you have to select a option.

How many numbers you want to select to represent these numbers. So for example, you can select six numbers, or you can select 1024 numbers or 2048. It depends on the model creators choice. Um, yeah. Yes. Every single row must be, well, it doesn't have to, okay, not, it might not be the case that might be not, but it should be, yeah, with high probability, it will be unique.

Um, yeah. Wait, is it like, okay. Um, and so when you do training of a language model, um, there's a trick that you use and remember we want to predict the next word, right? So hello, you want to predict my, right? So can someone notice any pattern with this?

Like, why did I do the arrow? And what's the pattern that can anyone see this? Any special, special, like things with this? No. What happens if you take, hello, my name is Daniel and just shift it up by one place. Is it right? If you shift it up by one place, hello is now aligned with my, my is now aligned with name and so on.

Right. And there'll be a gap at the very bottom. Right. So we simply, we just put us, which means end of sentence token. That just means it's the end of the sentence. And we just, you know, put it there because it's a gap, right? So remember machines, machines do not like gaps and you must use all numbers.

So that's the reason why we did that. And this is kind of the training mechanism, right? So we essentially, we have that list of, you know, list of words and we want to predict the shifted words. And the transformer, all it does is there's a function to predict that.

So given hello, can you predict my, given my, can you predict name and so on, right? That's the transformer architecture. It's kind of a bit wrong, but like something like that. Okay. Like the FX is like this gigantic model that can be like, you know, there's lots of turning knobs in it.

Okay. Let's check if there's any questions. Okay. And the point is, remember, remember the point is we can only use predict the future words, right? So hello can only predict my name is Daniel and so on, right? That's the purple, the purple component. And the blue box is called the attention mechanism.

The FFX, I factored it out, and that is called the multi-layer perception or MLP layer, right? So there's actually two components in a language model. One does the prediction of the next word and the FFX, which is the MLP, just does this, you know, changing component. It just makes it like, you know, a bit better than just simple predicting the next word, the attention.

Oh, okay. There is a question. Oh, I didn't see. Okay. Well, um, yes. So the comment was normal transformers predict only one token at a time. How about transformers predict multiple tokens at a time? Yes. So actually you can predict multiple tokens at a time. It just depends on what is your training objective at the very last layer.

You don't have to remember we shifted by one place, right? Why don't you shift by two places and three places, then you'll be predicting two tokens in the future, three tokens into the future. Exactly. So before it's just one objective. So like one column of the output, we just add more.

And yes, you could do multiple tokens. Um, was it the, it was the Facebook paper, right? I can't remember. Yeah, it was a Facebook paper. Um, I forgot the accuracy though. Um, yeah, but yeah, you could do that. Um, I guess like, I don't see any, I guess it's just good for inference time.

Like you can predict, you know, you can make inference. If you do four tokens in the future, you can predict four times faster. Um, okay. Yeah. I think it's mainly for when you do inference, you can do this. If you do two tokens in the future, you can have like two tokens in one go.

Um, I don't know. I don't like that approach. I think predicting one token is better. Um, cause you're already forcing the language model to do so much. You're making even more problematic. So I think predicting one, okay, maybe for inference time, it could work. Um, yeah. Yes. Question. So token, yeah.

Tokenizer converted to ID. So like, yeah. So, so the token tokenizer when it has like, when it says like it has 32,000 words in the tokenizer, essentially it's an ID from zero to three, one, nine, nine, nine, right? So like, hello, we'll have ID 2,557. So then what you do is you take 2,557, go to the hash table, which has this, which has the, um, this table, right?

I made the table, right? So hello has an ID. My has an ID, name as an ID and so on. And you essentially, you hash, you go to that specific row and then you, that is your like sample and your training data. And you do this like for the whole thing.

So the tokenizer does, you do get a number, it's an integer. Um, and you just have to hash it to the, you know, the embedding matrix and you will get like a vector of numbers. Does that kind of answer your question? Oh, you're talking about the embedding dimension. So if you see 2,540 or whatever those numbers, um, that is just the, how many, how many like columns, how many numbers you want to represent for each, um, component?

Okay. Okay. Any other questions? Yeah. Yes. Oh. Oh, okay. So you can enter them at late, way later. Okay. Okay. Okay. Yeah. Yeah. You had a... I guess it just depends like if you make, so the point, so remember we said how many combinations can we do? Infinity because we can do right, because floating in person representation, um, in theory, if you have infinite position, it can be infinity.

But someone mentioned how depending on the position of your float, it's actually limited position, right? So like the point is you want your model, you want the training objective to actually learn, give it as much freedom as you like, right? So like if you're trying to restrict the model when it learns, it might actually not be helpful.

And that is why normally people have like these large numbers, like, you know, 6144 embedding dimension or 8192 embedding dimension, right? The more numbers you give it to the model, it just has more freedom to move. Um, so if I, in theory, I mean, in theory, it should have better accuracy.

Um, in theory, everything's in theory. Um, I don't know if there are research papers to show this. I think someone should write a research paper on that, you know, each embedding dimension test, you know, do 3 trillion tokens. Okay. That's probably too many. And then see which one has, I'm assuming the more you add the, um, the higher accuracy.

Yes. I agree on that. I think someone needs to do a research paper. Yes, that should be a new research topic. I've never seen this paper before. So like, that would be very interesting. So yes, question. Yes. The reason for that is it just makes training faster sometimes. So depending on, so there is, I think Andre was one who tweeted about depending if you pad the token, if you pad the vocabulary to a specific number, you can actually make training faster.

Because in the NVIDIA GPUs, when you use tensor cores, if you pad it correctly, it can get the data and cache it more appropriately. So like, for example, the cache size is like 64. I think it's 64. Okay. Maybe I'm making stuff up. But like essentially you have to pad it to a multiple of 64, something like that.

And so sometimes that happens. Another one is like some people want to add when you want to do more fine tuning, when you want to like train the model for more, you want to use one of those unused tokens for your own purpose. And so they left some holes in there.

Yeah. Does that kind of answer your question? So when you do tokenization, assuming you don't encounter these tokens, you won't have any problems. But if you do, then there are problems. Yes. So for example, if you do LLAMA3 fine tuning, if you use the base model for LLAMA3, and you accidentally use one of those tokens, you will get NANs for fine tuning.

Right? So you have to be very, very careful. And so like, I think what we did is for unsloft, we actually find these untrained tokens first, set them to the mean of all the embeddings, and you won't have these issues. So I think that's actually a model creators problem.

It's like, they probably should have not set it to zero. I don't know why they did that. But anyways, yeah, they should have set it to be like, you know, normal distribution or like some, you know, just random initialization. Yes. Kind of. Yeah. Okay. Any other questions? Okay. Oh, yes.

Oh, yeah. Yeah. You can put a beginning of sentence token. I just didn't do that. Um, you should put a beginning of most language models will put a beginning of sentence. Like, what is it? You know, I put the end of sentence, you should probably put a beginning of sentence as well.

That's actually very important as well. Um, most models do that now. Um, they found that to be very helpful. To be honest, I don't think so. It's actually that effective. I think the beginning of the BOS token came from the old style, um, um, the old style, the CLS token for, um, I think it was the first token for like bird style.

Um, so that they had a classified token at the very start. I think it was at the very start. I'm not a hundred percent sure, but I think that's where it came from. The big, I don't think so. The beginning of sentence token actually makes that much difference. Um, but you should put it, you should put it, you know, giving the model more freedom to move is always better.

Um, so yes, I probably should have put the beginning of sentence, but you know, yeah, for demonstration, I did not do that. Um, yes. Okay. We did that. Right. So like the green one, um, right. So like the, the attention block is kind of encoding the stuff that we described, right?

Predicting the next word based on the previous words, right? And so like the attention block is that first part, the MLP block is just the mixing pump component. Um, and this is kind of the transform architecture and kind of like in visualized. And you just repeat this L times.

Um, that is a transformer. Oh. Now, another question I always have is, why is training language models not O of N cubed? Because like, aren't you like, given the word hello, you're predicting my, right? And now we have hello, my, you're predicting name, and then you have hello, my name, and you're predicting is, and so on, right?

Shouldn't this be the training data? Why is the training data just hello, my, my name, name is, is Daniel, Daniel EOS, right? This is the training data that you actually see. Why is it not this? Can, does anyone know why? Sorry? The complexity? Yes, the complexity. Yeah. Very bad complexity, right?

So like if the sentence is like one, okay. If the sentence is like 100 words, what do you think? How many? Quite bad. Yes. Basically. Yeah. So like a one plus two plus three plus four plus five, all the way to plus 100, right? So like N divided by two, one plus 100, I think.

I can't remember my maths, but yeah, something like that. So it ends, yeah, it's very bad. Um, and that's if you have one sentence. What happens if you have like 10 sentences? Oh my. Yeah. But like, does anyone know why language models don't need to do this? Like we don't actually need to do this, right?

So like, we can skip essentially, instead of, instead of having this as the training data, your training data is simply, my name is Daniel and, and shift it by one up. And that's your training data. Why is it not this? Oh, yes. We haven't talked about position encodings yet.

Yeah. Okay. But you actually don't need position encodings. Oh, okay. Yeah. Attention. What? Oh yeah. The attention mechanism. Oh, correct. That's the answer. Yes. It's because of attention. Well, actually, specifically, um, masked attention. Right? So that's, that's a trick. Okay. We'll be talking about that. Um, we've been talking about a few times.

Um, and I'll give you the code again. Well, actually the math formulas for transformer architecture, right? So like attention block, um, we will be now talking about the attention block, right? So like the Z is equal to the soft max of QK transpose over root H plus M V.

Um, and as, as you mentioned, it is the attention mechanism, which allows us to skip the O of N cubed complexity and make it O of N squared. Why? Because remember, we want to mask out future tokens because we don't want to predict on future data, right? So like, by using this mask, weirdly, this mask allows you to train more efficiently.

Um, you know, it's funny, because like attention is O of N squared. So the longer your sequence is, the worst the complexity, but actually there is a special trick, which you use a mask and this actually makes attention not that bad. Um, so instead of doing hello to predict my and so on, so on, so on, so on, the attention mask acts as this methodology, right?

So the attention mask itself acts as, um, you don't need to do like all the complicated, you know, all of the words predict the next word. Um, okay. This is okay. Um, so we'll be now talking just about the attention itself, right? So like soft max QK transfers over root DV.

Um, just a reminder that whenever you see QK transfers, the query and keys, um, I do not like to like, there's like explanations. Like what is a query? What is the key? I do not like that actually approach. I would like this to be a math approach. Um, so my view is given the matrix X, which is your embeddings, right?

So remember hello is a vector of numbers, right? You multiply this by some weights, WQ, WK and WV and you get back QKV. Q is query. Okay. Keys and values, but, um, that's a very vague interpretation. I don't really believe, like, I don't really trust those interpretations. It's not that clear.

Um, just assume it's just math. Okay. Just like get your X matrix and multiply by weights and you get some extra weights. That's my view. Um, and so that is kind of, so like, if you see why I stacked it like this, does anyone know why I stacked it like this?

Like why did a presenter like this, specifically? Why is it like the presentation like this? Any, any points? Sorry. What? Composition? Decomposition. Interesting. Okay. That's a very interesting point. Um, but no. Correct. I just, yes, that's correct. So I just lined it up such that it's easier to see.

And if you take the matrix X and you multiply by WQ, you'll get Q, right? And this is actually the correct maths, um, dimensions and stuff like that. Um, and so like, I like, I like to normally tell people to like visualize transformers as maths. It's actually, in my view, it's easier.

Okay. I'm not sure for other people, but my view is easier. I do not like it when they say, oh, queries and like you're trying to do keys and values. I don't even know what that even means. Anyways. Um, and the yellow components are the ones you want to train.

X is what you want to train. WQ is what you want to train. WK and WV. And QKV are just the components afterwards. When you have the, so remember you have the Q, you have the K, all you need to do is when you do K transpose, you transpose the matrix and you do Q times K transpose and you get this big square matrix called QK transpose, right?

Hello, my name is Daniel and so on. Right? So like that, that's kind of what I want to visualize is like, you know, it's actually a, when you do QK times K transpose, you get a square matrix. Um, and all you need to do now is do the soft max divided by root D, right?

So soft max, essentially each row, you normalize to one, right? The sum of the exponentials must be right. You need to like normalize them. Do you, does anyone know why you should do that? And why should you use soft max? Any clues? Why do? Yes. Yes. Okay. That's the answer.

Yes. But like, why? What? Why? Sorry. When you multiply them, you can get NANDs. Oh yes. Very good. Oh, that's okay. Do you know how to fix that? Close. You have to minus the maximum of the row. That's how you fix it. Um, yes. Oh yes. Very good. Okay.

Yes. We want to sample from that. Okay. Sample from that distribution. But what happens if you don't do the soft max? Doesn't this still work or not? Like what happens if you just do QK transpose over root D? Remove the soft max. Like why do I have to do soft max?

Yes. Interesting. Then you can fix that with like minus max of the row as well. We're exploding. Anyone else? Okay. What happens if you don't have a non linearity then? So does it have to be soft max? Can it be something else? Yes. It could be. Yes. That is another active error research which people should focus on, which is like, why do we need to use soft max?

Um, generally speaking, research papers show that is actually the most accurate. Um, if you use other activation functions, it might actually not be that accurate. Right? So like, um, but this also is the bottleneck of transformers is because it's a soft max, it does the row sum of the exponentials.

Um, this means that you can't actually decompose this, right? You can't actually bring the matrix, matrix multiplications out. Um, and so if someone can find ways to make this faster, you know, you'll get like millions of dollars. Oh, okay. Maybe like much more than that. But, um, um, yes.

And V is just, remember the V comes from here, right? So we just take the V, multiply it up again, and we get this matrix at the very end. And that is, right? Oh yeah. That, that is the final component, right? This, this empty box is what you get out from the attention mechanism.

For the layer norms, um, I don't really want to explain too much, but the layer norms essentially, you take the, um, you take, you square all the elements per row, you sum them, you divide them by the square root and you take the mean and they just do one divided by, right?

All this does is just normalizes the rows to make it easier for the language model to learn, right? So like, why do people do layer norm? It just makes training easier. Um, it's more stable. There's no other, like, there's no other like point. There are like some theories of like, you know, batch normalization, like, you know, um, you know, out of distribution, you want to make like shift towards the distribution of the distribution data.

I just like to think of this as an optimization method. Um, layer norms just make a training easier and more stable. Um, and layer norm is simply, remember, as I said, is you take the X matrix, you do a row sum of all the squares and you take the mean.

And then you just divide it. And then you multiply by some weights. It's a vector of weights. And that's just layer norm. Um, you don't worry too much about like what is layer norm or what it does. It just does training better, more stable. Um, please add as many layer norms as possible.

Um, yes, add everywhere. Layer norms everywhere. Um, and you'll make training much better. Um, okay. I probably, okay. I don't know if you can see this, but in Triton, right, in order to write Triton code for the layer norm, this is the forward kernel. Um, we will not be talking about Triton today, but, um, it's actually not that complicated.

If you read more intently, um, ignore all of the like components, there was only very few lines for the layer norm. Um, it's actually not that complicated. Um, the rest is simply just how to load the data. Um, it's actually not that hard. Um, yeah, the backward kernel is when the problem comes.

Um, how do we actually do the differentiation of the layer norms, right? Remember you have to train the w right is yellow. Um, you actually have to train that. Um, how do we find the derivatives of the w? Um, it is very complicated. Um, and if you want to learn in your own time, you can have fun learning the derivatives.

Um, it is extremely complicated because there is like sums. There is um, you know, row sums. How do we do the derivative of a row sum? Um, it can get quite complex. I wanted to talk about backpropagation today, but I thought like it's probably too heavy math. Um, so no backpropagation, but we'll be showing, but I do have tutorials on that.

So if you go to the Triton tutorial, um, I followed that. That's actually quite helpful. Um, and the backward kernel is just very, very problematic. Um, Now up to the rope embeddings. Why do we do rope embeddings? Does anybody know what is the rope embedding? Yes? So you could use the rope embeddings to extend context.

Yes. Okay. Do you know how? How does it extend context? So, well, how does it work for the yarn or like, or? How would you use rope embeddings to extend context? How, what would you do? How would I, how would I do that? I would create a basically, what I would do is I would just create kind of a, um, you just multiply the base by two and then you get two times longer context.

You multiply the base by 10. Correct. Yes. So that's where yarn might begin. Yes. So is that the dynamic dynamic? Well, that's static or dynamic. Yeah. So how would you solve the problem if it's like, you want to train, if you want to have 1 million context them, but your data set is only 1000 words.

How would you solve that problem? How would you think of solving that problem? Because like, some people have said they do 10 million context them. Is there any data sets which is 10 million tokens? Um, how would you? It's 15 trillion tokens, 5 lines. So, yeah. Oh, no, no, but that's 15 trillion tokens for like the data set.

I mean, like, how do we do long context? Remember, when you do long context training, you have to have a document, which is at least 10 million words for it to learn how to predict the 10 million plus one token. So, um, how I would solve the problem would just be to gather better and more diverse data sets.

Yes, that's the ideal. So what happens if there is no data set, which is 100 million tokens? Then what would you do? How would you synthesize if the model, it's like a chicken and egg problem. How would you do synthesis? So, no, no, no. I would, I would basically, um, just create, I would basically use like a quad or like any of the state of the art models with like a Laura and then get, and then basically circulate the data.

But are they trained on 10 million tokens? Huh? If the model itself wasn't trained on 10 million tokens, does it do long context? So, if I was to try to solve this problem for like clients, for example, like let's say their code base is in 10 million tokens, or, or, or, you know, and they wanted 10 million specific context or whatever, right?

Then I would, um, basically like, uh, create a like, uh, synthetic data set. So not synthetic, but a derived data set from what we have. Okay. Interesting. So assuming we do not have, but I can't assume that we have no data, right? Good point. Okay. I don't know. I, I think it remains to be seen, like many claims by companies, 10 million context, 100 million context.

I question, question, question, question, question, question. Well, I've only seen one million actually work, so I mean, yeah. And that's bringing attention to theory, right? Okay. Okay. Okay. Now we're going into, okay. Yes. Okay. Yeah. Okay. No, no, no, that's fine. I was asking the questions, but okay. Wait, the question was like, what is a rope embedding?

Someone who did mention like positions. What does that actually entail? What do you think is the point of a rope embedding? All it does is you want to tell the model to learn what is the position of the words, right? So like, hello, my name is Daniel. It actually has a meaning.

Like, hello is like the first token, right? But then if you put hello as a third token, what's the difference? There is a difference, right? So like, depending on where the word is in the sentence, it matters. So the whole point of embeddings, rope embeddings is that it tells your model to learn where is the position of this component.

And old style, they use absolute, like relative, like, you know, absolute positions. Rope embeddings does like some special tricks, like, you know, times a cosine, plus times a sine, and does some sort of like special rotation and stuff like that. The paper found that if you do rope embeddings, it actually has high accuracy.

And, you know, everyone does rope embeddings now. Yeah. So why do you use the You mean lower, sorry. Yes, there is. I think bird did not. I don't know. Did bird use rope? I don't think so. But use absolute. Yes, that's the problem. I think but use absolute, I think.

I don't remember anymore, but oh, oh, yes, yes, exactly. So rope did not exist. Yeah. And so like this paper, the reformer paper shows. So previously people use absolute position encodings, which simply just adds a position. Like you can literally just add, like if the position is one, zero, just add zero.

If the position is one, add one. If the position is two, just add two. That's literally what they do. Well, actually, well, not exactly, but like, you know what I mean, right? You have to divide it by some sort of normalizing factor, right? If the position is 30,000, don't add 30,000, right?

You would like destroy training. But that's kind of what they do. And what they show is if you do rope, you can essentially increase accuracy somehow. And we just use this as gospel. We just treat this as true. And everyone uses rope now. Um, yeah. Yeah. In that case, do you have an opinion on yarn versus rope versus and yarn-- So yarn is kind of rope.

So yarn just does-- I'm assuming yarn is rope, but it does-- It does like-- actually, I don't think I should comment on this because I'm not an expert on that. Okay. Doesn't it does like-- it does-- yeah. I mean, not as-- so I'm no expert on-- obviously, I-- I'm not expert on long context.

Like, on rope for the yarn, but since VLM only supports static yarn, unless you're planning on trying to go to one million context, it's actually kind of terrible. I-- is yarn the one which it does, like, the base, like, randomly changes? Like, if you have position-- if you have, like, up to one million context, and you do one million and one context, the base changes with that, like, the factor changes.

Is that yarn? Dynamic changing. Yeah. So, and that's the issue is that in short-- in, like, let's say, five-shot, right? That's not even-- a short five-shot, right? That's not anywhere close to one million context. In theory, or maybe, like, let's say, a three-year shot or two-shot, um, like, so-- Did it.

And, but, obviously, dynamic yarn, in theory, could fix this rope issue. Uh-huh. Where, like, we just take this rope as gospel, where, um, but are you following kind of my question here? Yes, I-- no, I think dynamic yarn is just rope, though. Like, okay, that's weird. Okay, maybe I'll unplug and re-plug.

So, like, the screen kind of went away. Let me just read this-- do this again. Is this, like-- is it-- is it-- is it not-- not working? Or-- Is it, like, screen? Or-- no screen? That's weird. Can you answer questions while we're at it? Oh, yes, okay. Yes, I'll answer some questions.

Um, yeah. Anyone else have questions? Wait, okay. Wait. I need to read your brush. Oh, if anyone has, like, take a break, you can take a break now, if you want. Um, and if you have, like, other questions-- yes. Okay, question. Sorry, sorry. Yan is-- So, I don't know what yarn is, but I will hear some talk about, like, extending the context window, and I'm just wondering that, like, what are the keywords that do when it comes to, like, the context window, like, .

So-- so the question is for context windows. Like, what is-- what is-- what is the improvement like yarn? Oh, okay. What is the improvement? Okay. So, so the point of yarn-- like, what we were talking about is, like, how do we make a language model learn-- how do we make it do long context without training on long context?

Kind of. Yeah. And so, like, what yarn does is you can essentially extend the context window automatically by dividing the base of the rope embeddings. You change the scale factor. I don't actually have slides for this, but it's just a methodology which allows you to scale the factor, and you essentially magically make the model learn new context, long context.

That's kind of what yarn does. Yes, it's extremely-- if you do one million context length, then your O of n squared is one million squared, which is horrible. There are, like, some other methodologies, like, you know, they want to do, like, linear transformers, and, like, you know, yes, I guess you could try that, but I don't suggest that.

Um, yeah. So, hope-- yeah. Yeah, sorry, question. What is the difference between in-context learning and fine-tuning, or what's the benefit? That is a good question. So, fine-tuning changes the grade-- you have gradients for fine-tuning. So, yeah. I think it depends. I think I would do in-context learning first. So, if you have, like, few sharp prompts, you shove it in to see if it works.

But I would still resort to fine-tuning if you want to be more efficient. And if your model doesn't seem to be learning, then you have to go back to fine-tuning. There was a paper which was released yesterday, I think? Was it yesterday? It showed that in-context learning is very useful.

And it, like, learns kind of, like, how to be, like, a random forest or, like, a tree. I think it was yesterday. The paper. That was quite useful. Very interesting. But I think fine-tuning is still very important. Especially if your model is not learning anything and it doesn't seem to be working, then you have to, like, use fine-tuning to change your behavior.

I don't really have a comment on this. Like, I just feel like you should do everything and try everything. Um, yeah. Any other questions? Well, I have a couple of my apps. I don't know if, uh... Oh, did you... Okay. I think my app kind of glitched. Do you want me to read them off one by one?

Oh, maybe I'll take... Okay. Wait. Yeah, wait. I'm sorry, I asked. Okay. Okay. Okay. Maybe... Okay. Let's just... I'll continue, and then we will do the questions. Yes. Yeah. Okay. I probably have to... Okay. Okay. I did not see time. I've been speaking to... Okay. Anyways, that is the rope kernel.

Um, and it might look horrifying. It's actually not that bad. Um, it's literally just the formula that it did. Q times cosine plus Q times a rotation matrix times sine. And that is just rope. It's actually not that hard. Um, just the code is a bit more annoying, but it's just like moving the data.

It's just data moving and data moving data and stuff like that. All of the code is just related to data movement. So not that complicated. Um, the most complicated part is the derivatives for rope. Um, and you have to use something called rotate half, which essentially rotates half of the...

So, okay, just read the code. It's like minus X2 concatenated with X1. So you're like, you essentially take the matrix X, you divide it by two, you take the first half, you put on the second half and you put, you switch the ordering and you, the first half becomes minus.

And the code generally is reasonably well, hopefully, um, for understanding. But the question is, as you like, this is hugging face code, right? So like Q times cosine plus rotate half Q times sine. The question is, how do we actually take, find the derivatives of this? Um, this was actually very, a very complicated phenomenon, because I could see many implementations not doing this correctly.

Um, and it is very special, the derivative. Um, simply, if you notice, is rotate half, the function is literally a matrix multiplication, right? It's Q times R, where R is a rotation matrix. And the rotation matrix is minus identity and identity, um, and zeros on the diagonal. Um, and if you do this, it's because it's a matrix notation now.

Simply the derivative is the transpose. And so if you do the transpose, the minus sign just flips. Um, and if the minus sign flips, it's literally the same as your previous example, just for the minus. Um, I can probably like explain this too quickly. Um, but the point is, if you do matrix multiplication, you can derive derivatives very simply.

My suggestion is shove the derivatives and will from Africa or like, you know, your favorite tool, you can use chat as well. And you will get the derivatives back, but you must put it in the form that's useful for the computer disease. Um, now we've been talking about the MLP component, right?

So we completed the attention, we completed the rope, the layer norms. The MLP is just a mixing component, right? So it's an activation function times some weights, you know, multiply some of the other weights and stuff like that. Um, all it does is it mixes the signals to make it like, you know, more fancy.

Um, you do in theory, you don't actually need the MLP component. Um, like most attention, you don't actually need this part, but you must put it for the model to have more freedom to learn. Um, and you know, the famous paper, um, glue variants, improve transformer, um, you know, very famous author, I'm assuming most people know him.

Um, but he showed that, um, if you add glue, um, sweet glue and all these other variants, um, you can actually increase accuracy. Um, once again, this is just treated as fact in the machine learning community. Um, we should be doing more experiments, um, than just using this methodology, but you know, we just treat it as fact.

And this is in transformer plus plus, um, the architecture, which everyone uses. Um, there is a very big difference though. Um, in GPT two, in GPT two, they don't use the glue variants. Um, they simply just use a normal MLP, right? So x times the weights up, do some sort of activation function, and then you down project it.

So the weights down, right? So that's, but then if you do sweet glue and these new variants, the glue variants, you essentially add this component where you do element wise multiplication, um, and then you do a gate a down projection. So it's actually very similar to the GPT two architecture.

You just add an extra component. Um, and so I try to like, there's also like a naming change, like up and down, um, up and gate and up like, you know, changes and stuff like that. But in general, it's, you can see, it's very similar. Um, just the extra element wise multiplication component.

Um, yeah. But there is like a new, um, the Numitron paper, for example, oh, sorry, the Numitron, the new model by NVIDIA did not use glue. And instead, they use squared value, right? And they showed that you don't actually have to do glue anymore. You can just use squared value and it seems to do okay.

Um, although it remains to be seen if actually is good, but yeah, so like they showed that if you do squared value, you can remove this. You can essentially go back to the GPT two architecture, right? You don't need to use Lama architecture anymore. So Lama and Mistral Gemma all used the second equation.

GPT two use the first equation. You can go back to the first equation, but the trick is the F must be very special. And that's called squared regular value. And they show that if you do this, your accuracy does not degrade that much. And so, yes, the paper. And interestingly, if you see one of the authors, it's the same author.

One of the authors' names is similar. Um, you know, the name is the same. So, you know, they were also the ones to showcase the squared value. You don't need to do glue anymore. Um, and you can simply just use squared value as well. There is a research paper which I highly suggest people to read and it's called the physics of language models.

Um, it is extremely long though, but it has many nuggets inside and I highly suggest people to read this. Um, they show example. They actually did so many testing tests and experiments and they showed that if you used glue variants or gated MLP, it actually reduces the models capacity to learn on small models.

Okay. So that's the, that's the point. It's on small models. On small models, on small models, if you do GPT-2, the first formula, it does better than if you do Lama, Mistral, Gemma, the second formula. Only on small models, right? That's the point. Only on small models. Um, and there are other special things inside the paper, which I highly suggest.

Um, it's extremely useful. For example, they say that if you do mod, if you change, if you don't, if you like, for example, the activation function, which activation function did you use? Um, should you use the sily or jelly or rally or whatever? That's not that important, right? So like, it's not, not that important.

If you use like biases, oh, it's not that important. Um, so like, there's so many different things that you don't need to do. Um, the paper shows, but you know, people just treat it as gospel or we have to use this specific component. Um, my suggestion is, you know, we should do more testing in the AI space, like, you know, which variants are the most important, but I think this paper is pretty useful.

Um, the code for like the, you know, the swigler kernel is, that's the forward kernels. Again, it's not that complicated to like the second part is actually useless. It's the swigler kernel is literally three lines. It's just, it's just the three lines which are commented. The rest is just data loading.

How do we actually load the data into the GPU? Or it's not that important. Um, and you know, if you use Torch to compile now, you can simply generate these kernels automatically. Um, and this makes your training much faster. So that's what I suggest people to do. Um, just use Torch to compile.

You don't have to re rewrite, try to kernels. Yes. Sorry. W W K. Oh, gate. Oh, it's just another matrix. So w gate, w up, you train this, um, and w down. These are all you train them. W down, w up. These are all you train. They're just numbers.

Um, so like matrices X times w gate is like, it's, so remember the, um, uh, this thing, right. W, this is for attention. W, uh, x times w, q, w, k, w, v. Assume it's just w gate, w up and w down. Um, and it's the same thing. So you just train these.

Um, does that kind of answer your question or Oh no, no, no, no. It's just a naming convention. W up is just, it's just, it's just, it's actually called upper protection. The naming convention is like w b w a w c. It's just a naming convention that people like to use w up w gates and w down.

Um, they do have meetings. So w down is a down projection. Up means up projection. Um, so essentially you take the matrix and you like make it larger and then you project back to a smaller version. It just makes the model better to like, you know, make the the model has have capacity to learn.

Um, so it's just a naming thing. Um, any other questions? Okay. Um, the, as I said, like before, the derivatives are always a pain. If you did the derivatives of Swigaloo, it is a nightmare to do, and I do not suggest you to do this. Um, but I had to manually do all the, you can see all my comments.

So like the comments are actually there. If you do see more carefully, I wrote it in math formulas of how to actually take the derivatives and it's extremely painful. Um, I highly suggest you not to inflict pain on yourself by doing this. Um, it took me many days to do so.

Do not, I don't suggest this. Um, yes question. That is a very good question. So I use Desmos. Um, so Desmos is a graphing, graphing, online graphing calculator. You type all these equations in and then you can see, does the graph align? Um, Oh, every single component, you'll have to be careful.

So every single component, you have to check. Um, oh, is this component correct? Is this component correct? Check all of them. Um, And so like I normally, so like you isolate each component separately and then test it. I'll talk about that actually. Yeah. Um, any other questions though? Yes.

Okay. Um, and this is the cross entropy loss kernel. I'll probably just skip this. Don't have enough time. So, um, I wrote this as if you want to inspect the formulas and stuff like that, how do we do the derivatives for this? Um, you can do this. It's not that complicated.

Um, actually it is very complicated. Um, I did spend a lot of time trying to like work out the derivatives. Um, uh, it might be a bit foreign for some people for the derivatives. The main reason why it gets complicated is when there's sums, right? For like, whenever there's sums, I just like, you know, doing derivatives when the sums is always painful.

Um, matrix differential is actually very easy. If you do X times W, the derivative of W is just X transpose. It's very simple. But if you do derivatives when the sums, uh, it's horrible. Um, yeah, it's quite horrible. What you can do for the sum when you do derivatives, if you transform the sum into a matrix also matrix multiplication, right?

So a sum is just X times, and not like a vector of all ones. So that's called the bro sum. And essentially, if you do this, you can actually make differentiation much easier. Um, but I won't be talking about that. Um, that's for another topic. Um, and someone was talking about stability for soft max.

If you minus the maximum of the road, you can make soft max much more stable. Um, and this is to like reduce exponentials of large numbers. And then you like, essentially, it takes over the entire exponential, right? And so like, if you do this trick, when you minus the maximum of the road, this makes training much more stable.

Um, always do this. Um, yeah, always do this. Um, yeah, and that's the code for the forward. Um, not that important. Oh yes, I wrote the code for the backward as well. Um, always say this is the, sorry, this is the forward, um, the forward, um, and it is quite long, but I wrote all of this down for your own leisure.

If you want to read and implement this, have fun. Um, but I wrote this step by step, right? So like take Y is equal to log sum of X, you know, then I simplified it out. Like, you know, if you, if you exponentiate both sides, right, you can do exponential of the Y is equal to the sum of it, sum of exponentials and so on.

All right. And so like, I wrote this all down, um, for the, but there is a methodology which we showed in unsloth as you can use chunked cross-interprey. And this is actually very helpful for large models. Um, your logits are very large. So if you chunk them, you can actually, you can make multi-threading much better for the GPU.

And so like the problem though, is like the derivatives, the forward propagation, you have to be careful now when you do chunking. Um, so like essentially you divide these into slivers and parallelize each component. Um, I also wrote some, you know, maths and stuff like that for you to review.

Um, and how you actually do the chunk sum is very interesting. Um, the chunks, the log chunk sum of the log sum exponentials is just the log sum exponentials. Um, there's like, you have to do some manipulation and it's actually very interesting. Um, you don't actually need to change that much code to make a work.

Um, okay. Now we'll be going to the next component, which is to investigate the llama architecture. Um, hopefully this works. Yes. Question. It is part of Unsloft currently. Um, I think I heard that the PyTorch team will be including this in Torsha compile, although I'm not a hundred percent sure.

Um, this reduces memory. This just makes, this makes long, large context, uh, sorry, large vocabulary sizes work. So the biggest issue why you have to do chunking is CUDA, um, NVIDIA GPUs has a limit. Six, five, five, three, six. I think that's two to the power of 16. I think so.

Yeah. Yeah. So there is a limit. And so if you go, if your vocabulary size is larger than six, five, five, three, six, you must do chunking. Um, yeah, you have to do chunking. So like if you have like, I think was a Gemma 128,000 or them. I'm getting confused.

I think it's 128,000. You have to divide it into two. Um, and so like your chunks would be two. Um, so it should be in PyTorch in the future release maybe. Um, yeah. So now we learned all of this. Now you can read the code for llama. So if you go to, um, if you go to the code by modeling llama dot py.

Um, it should be in the slides, but you can also type in Google like modeling, uh, you know, modeling py. Um, this, right. There is your, if you go to line 94, you have your room embedding, which we talked about, right? Just assume, don't worry. Oh, oh, okay. I need to, okay.

I probably use my mouse. Yeah. Okay. I don't really like how GitHub does the, it's kind of annoying sometimes, but I think I can disable it if I log in anyways, the symbols. Um, so if you go to this, right, this is not important. If you go to the first one is line 74, the llama RMS norm, right?

This is the layer norm kernel, right? This is the code for the layer norm kernel. It's just this much. Um, right. This is the, you take the row, you take the sum of the, you take the squares of the, like each row, you sum them, you divide it by the mean.

Remember, this is the only reason why you do layer norm is to make training more stable. Um, and it's actually not that complicated. It's just these few lines. Um, the rest as the rest is just bloat code for like, you know, you have to set stuff up. You have to do random initialization, blah, blah, blah, comments and stuff like that.

Um, but that's the layer norm kernel. The rope kernel, the rotary embedding is a rope kernel. This is just setting stuff up. This is setting stuff up, um, forward, right? This is the most important component, which is the forward component. Now, don't get scared by this. That's because we fixed a bug.

So this is actually our bug fix that we did. So this is all in transformer architectures is like, you have to be very careful when you downcast to float 16. If you use B float 16, right? This is actually very important. Before you have to be very, very careful when you use float 16 and B float 16 training, mixed position training, because you were downcast incorrectly and your rope embeddings will be wrong.

Um, so it's actually, it's not supposed to look as ugly as this, but unfortunately it looks ugly now. Um, before it was just this, right? But this is just setting up code, um, that, you know, we have to like fix the bug and stuff like that. Um, yeah. So it's actually not that long.

Right. So like, it's actually, it's just this, um, yeah, up to here. It's the, so whenever you see these architectures, it just looks complicated, but it's like, yeah. No, no, no. So there are like some, if you do torture compile, we're still like, I think 30, 30% faster. So no matter if, sorry, if you use torture compile plus unsloth.

Oh, it's, we're still 30 times faster. Oh, not 30, 30%. So we're two times faster than hugging face plus flash attention to, but torture compile, you know, they're adding kernels from multiple packages in. So yes, they could learn from unsloth and put it in. I'm assuming they're doing that.

I did have a talk with them. So yes, they're probably doing that already. But yes, we're still 30% faster. Um, there was like some, I think someone tested this last week. Um, yeah, does that any other questions? Okay. This is not important. This is for, this is the one we're talking about for linear scaling for rib embeddings.

If you want to extend the context then, that's kind of what they do. Right. This is, this is for scaling. Not, not that there is a rotate half, rotate half part, which I was talking about. Right? How do we actually derive the formulas for the gradient? Uh, how do we actually do the rope embeddings?

It's just this. Um, the MLP, which we talked about again, remember the MLP layer, right? So there's a gate projection, the up projection, the down projection. Um, this code is bloat again. Ignore this. It's just this. It's just one line. That's the rest bloat. Um, ignore that. Right? So like the down projection is just the down projection, right?

The activation function, the gate projection times our projection, right? So that's, that is MLP. That is the, you know, sweet glue. One line. Not all of this. And that's just for training purposes. Um, repeat KV is just an, that's for like, you know, um, I can't explain this too much, but that's for the attention part.

Um, this is just to make inference faster. When you do Q K and when you go back to slides, um, if you go to, um, where is it? Right. WQ, WK and WV. Instead of training WK and WV, you train a small sliver and you repeat this. And this can make inference faster.

Um, and so like, you don't, we don't actually train the full mate, the matrix size WK anymore. We train a small sliver. Um, and we just repeat this. The attention again, all of this is just preparing, right? This is just preparing, right? And they're like, okay, ignore, ignore, ignore, blah, blah, blah, bloat.

Get rid of this. Um, don't look at that. Um, there is Q K and V. That's the matrix part. Um, there is some, the only problem I find that people struggle with is there is like dimension manipulation. You have to like manipulated the dimensions of the output. That is actually kind of annoying.

I do agree. This is actually very annoying, but just assume it is the, this, right? That's all we're trying to do is this. Um, and that is just these, these three lines do this, uh, do this. Um, and then we want to do Q K transpose, right? Oh, you have to, sorry, you have to apply the rope embedding.

Don't forget to apply the rope embedding. Um, and then Q K, that's a repeat K V. Repeat K V is the one that are the trick that I said to make inference faster, um, by repeating the, so if you go back here, the K and the V, the K and the V, you take only small slivers, you only train small slivers and you repeat them four times.

Um, and it does not produce accuracy that bad. Um, and that's the repeat K V. This is the Q K transpose, right? Torch on map ball, Q K transpose. This is the attention part. Soft max dropout. Okay. No one uses dropout anymore. Get rid of that line. Um, matrix multiplication, right?

So this is, so up to here, up to here is Q K transpose over root D soft max times V. Right? So up to here. And then we do some sort of, and then we have to do some, um, output projection as well. Um, the rest, okay. Whenever you see this part, just ignore that, right?

Whenever you see like if self dot convict dot pre training TP is more than one, this is just for, um, this is just for faster training. So you can get rid of that. Well, not for faster training for like, you know, training across multiple GPUs. You don't, you can ignore that and just assume it's just that one line.

Um, now there is more code like flash attention to ignore. That's just for faster flash attention. Ignore, ignore, ignore. Right? No, you don't need to see this. Um, scale dot product attention is a faster version of attention that is native to fight torch. Also ignore. You don't need to see that as well.

Pretend you didn't see that. Um, and then finally we get to the decoder layer, right? Remember each we show, uh, okay, maybe I should exit the slides. Um, where is it? No. So remember the decoder layer. Um, so this, right? Remember we said we repeat L times, right? This is repeat L times.

This is just shoved in, in the decoder layer, right? We call this one decoder layer. And again, we do the layer norm. Remember put layer norms everywhere, do layer norm, do attention, add some residual. Um, this also makes training more stable, right? When you add residual, it makes training more stable, do more layer norm, do more, um, do an MLP, add more residual, and then we complete it.

Right? That's just one, that's one component of the decoder. And remember, we repeat this L times. And the rest of the code is just doing this L times. Now, where is it L times? Um, comments, comments, comments, comments, forward, right? You go to forward. Um, yeah. In my opinion, put it everywhere.

Yeah, but in terms of calculating the residual and then adding the five. Oh, that's just by, yes. Why is the ordering? Why is the ordering do, I think it's layer norm first, then add residual. Is that correct? Or maybe I'm getting confused. Yes. The ordering, to tell the truth, it doesn't really matter.

I think there are some research papers which show if you switch the ordering, it maybe increases, like it decreases accuracy by like 0.01%. To be honest, we need to do more testing again. Like, in my view, shove layer norms everywhere. Um, this actually, this should make training more stable.

Um, but it makes training slower. That's the problem. Um, that is the only problem. But I suggest you to put layer norms everywhere. Wherever you see you at, just shove layer norms. Um, yeah. Layer norms always work. Um, okay. And, So why do you add the residuals of the previous, so you do, you take the residuals, you, you save the state before the layer norm, and then you, and then you do a layer norm, and then you add back it in.

So why do we have to do that? Is that your question? Yeah. Why the, the formalization is between that rather than like... Oh, you could do it before, if you want to. You could try it. It depends. I think it's for, yeah, it depends. As I said, all of this is like gospel, like, oh, why did we do this?

It's just people tried it, and they said it works. Um. I, I defend why this isn't impossible, I guess. Oh, what do you mean by not an issue? Like, why does this work? Or why is this like not a, what, what do you mean by issue? I mean, the, the visualization kind of changes the, in some way, the representation.

Yes, it scales it. Yeah. It's just math. You just let it just, it just works. Because like, because like, if you do that, the autograd engine will still know that you did it, and the derivatives will still be applied correctly. So you just assume that it works. Um, to be honest, that is another research topic.

You should try that. Um, I mean, I'm being like serious. Like, oh, there's so many research questions that are like open. Like, why doesn't everyone just put layer norms everywhere? Like, why don't you put a layer norm after the multi-head, multi-head attention? Put a layer norm after the Swigaloo.

Put a layer norm after the inputs. Put a layer norm, you know, everywhere. Um, and my, my hypothesis is it makes training more stable, but like, it makes it slower though. Yes. In this theory, could mechanistic interpretability, um, solve, like, not solve, but explain this? Do you mean like, what do you mean by like, explain this?

Oh, the mechanistic interpretability where you take the... Do, which method? Do you mean like doing the autoencoder style? Like, sparse... Do you mean like the, uh, what, which method? Like the, um, because I know there's like many methods for mechanistic interpretability, like there's different types of methods. This is the sparse autoencoder one or the...

Well, transformer lens is Neil Nando's tool. Okay, I, I don't, I don't use that, so I'm not that, I'm not expert on that, but... No, but you can see the, um, the literal, like, um, activation. Yes, yes, the activation one. Yeah. Yeah, so basically, well, so, I mean, I'm gonna not divert from this, but, um, so basically, can, do you think we can figure out why this works this way?

Like, because you said it's kind of an open question, right? But do you think that using tools like transformer lens, where we can look at training activation, or not activation, but like steps, where in fact we would have to, like, um, I'm not sure if I explained it correctly, but like, do you think mechanistic interpretability is a path to understanding this?

Good question. Um, could mechanistic interpretability, okay, hmm, it depends. I think my view is that just layer norms, if it's specifically on the topic of layer norms, it just makes training more stable. I don't think so has any, like, meaning, like, that's my view. Like, I think, like, the math equations don't show that it has any meaning, I just find it to be just stabilized training.

There was, like, papers like the, what was the one? Batch normalization, I forgot what the term was. Um, yeah, like, there was, like, a view which shows that batch normalization reduces problems of out of distribution data and stuff like that. Oh, reduces internal covariate shift or something. That was the phrase, which, yeah, I don't know what that even means.

But anyways, um, does anyone know what that means? There was, like, a, um, there was, like, a video for that as well. The, yeah, does anyone know what that means? Yes. So, layer norms, my view of layer norms is when you do, if you don't do layer norms, if you keep, okay, let's say you take a number, two, you multiply by two, you get four.

Remember, there's 32 layers, right? If you multiply by two continuously, you will get infinities in the end, right? Because you, like, go out of scope of the float 32. So what layer norm does is it makes your number go back to a good scale. So if you do two times two is four, let's divide it by four, go back to one.

Right? And so now it's one again. If you times two, it's two again, let's divide it by two again, go back to one. So all layer norm does is it makes a scale go back to a good scale. Like, it doesn't, your numbers don't like diverge on both sides.

That's what layer norm kind of does. Does that kind of? Okay. Any other questions? Okay. Um, so all we remember the decoder style. Oh, I think we actually kind of finished reviewing the Lama architecture. There's nothing else to do. Um, the decoder, right? You do this 32 times. Remember, like four decoder layer is in self dot layers.

You do this 32. I think it was 32. I can't remember. Um, multiple times, um, that's the decoder. You just do, you apply this multiple times. You do a lay and all. And finally, you get your logits. Where is it? Your LM head, right? This outputs the probabilities of which token.

Remember, we're trying to predict the next token. We output probabilities for each token. And that is called the LM head. Um, and where's the forward function? The forward, right? There's a forward, but always with the forward, um, self, you do, you go through the model and then you do, okay.

Remember, ignore this, right? Ignore this. And you should LM head. That's just one line, one line. Okay. One line. And then you do the float. Now, another question people have is like, why do you have to do the float? Um, um, does anyone know why you have to do, you have to upcast a float?

Why? Any clues? Have a guess. Have a guess. Have a guess. Have a guess. Why do we have to upcast a float? Sorry? Gradients? Okay. Close. Why? Why gradients? It is related somehow to gradients. Anyone have a guess? Okay. It's for training and stability purposes. So, the softmax, you should always upcast the float32 because it makes training more stable.

If you take the derivatives and the gradients, if you do not do float32, you might get NANDs as well. Remember, the exponential can be very large. So, you want to take the float32, which has larger precision than float16. Float16 is the maximum number of 65536, I think. I think it's 65536.

Right? But float32, the maximum is like some large number to the power of 38 or something. 10 to the power of 38. So, that's why you have to upcast it to float32. This just makes training more stable. Right? So, all of these things that we do tricks, it's just to make training stable.

Yes? You said you're doing the operation 32 times. Is that just like an arbitrary thing that people figured out works? Oh, that's up to you. So, like, if you want to do more parameters, you can do 300 times. Up to you. That's just make your model 10 times larger.

So, like, when you see, when you hear, like, you know, llamas. Yeah? Each time you do it, it's generating, like, a weight for the model? Like, each layer, like, each time you iterate, it's going to generate, like, a sort of weight? Is that one of them? So, the weights you train, when you take the tokens, you go through the architecture and, like, it changes the, it changes the tokens.

And these tokens keep shifting to, like, some sort of, like, new direction. And you keep doing this. The problem is you'll have to train more weights. So, each, each 32 times has different weights. Correct. Each of the, each of the iterations, but it's not, yeah, each, there'll be 32 different weights for each layer.

And so, like, yeah, normally people just, if you see, like, this, like, you know, GPT-4, what is it, like, one something trillion tokens, I'm assuming there's more layers. Larger embedding dimension, larger this, larger that, more layers. Normally speaking, the more layers you do, the model can learn more. So, that's the whole reason why you wanted to add more layers.

You just want to increase the capacity of the model to learn. Again, it's to make training more stable, again. And so this, remember the shifting trick that we did? In PyTorch, the shifting trick is just this and this. That's the shifting trick. That's the thing that makes it learn to predict the next token.

And then you pass it through the loss function, the cross entropy loss, which we discussed about. And then that's the Lama architecture. And that's normal. The rest is not useful. The rest is, yeah. So, in theory, you could write the entire Lama architecture in, like, I think, 50 lines or something.

The rest is just unnecessary bloat, right? This, all of this is 1,600 lines of comments and stuff like that. But, you know, this is for hugging faces implementation. It's highly respected. And this is what you should look at first when you read a new architecture. So, we just kind of went through the Lama architecture.

Hopefully, you can kind of get a sense of it. Obviously, if this is your first time reading the source code, it's actually not that hard. It's not that complicated. You just have to see which components you can ignore, right? It's not that scary. Yep. Does that kind of get it?

Or do you guys kind of get that feel? We're going to do more. Obviously, this is the first one. Any questions? Yeah. Is there a major architectural difference between Mava 2 and Mava 3? No, not really. Other than more tokens. I think they changed embedding. They did change some of the numbers.

Like, how many numbers you want to represent for each number. They changed that large vocabulary. They did much larger vocabulary. And more tokens. Other than that, no, there's no change at all. Yeah. Yes. The reason why -- it's funny. I used to work at NVIDIA. Why shouldn't I be writing Cuda, right?

The reason is, I see Cuda as extremely annoying to write. And if you want to optimize for just NVIDIA hardware, okay, go ahead. You can do Cuda. But my view is like, I don't think so. That's going to be forever. So, as like a safety precaution, let's just do Triton, right?

Let Triton handle the compiling down to Cuda or AMD or whatever, Intel or whatever, right? And Triton can be the intermediary. If you want to get like 10% faster, yes, you should do Cuda. But it's only 10%, right? If you do fine-tuning two times faster, it's already like, it's already nearly at the ceiling.

You can only go so much. And so like, if you want to go down the extra mile, yes, more than happy to welcome you to do that. But I don't know. I do not like -- it's funny because I used to do Cuda all the time, but I don't suggest it.

You will get more performance though, but I don't suggest it. Yes, question. So we never had to drop down to Cuda, right? Oh, sorry. Yes, what? We never had to drop down to Cuda. You could deal with Triton. Yes, you don't. Yeah. So Triton, you write it in Triton, then it compiles down to Cuda.

Yeah, sorry. Wait. Yeah, yeah, yeah. Actually, it could work. The only problem why it doesn't work on AMD is Triton. Oh, I think. And Xformis. Actually, if -- so if Triton works on AMD, we work. If Triton -- if Xformis, so Facebook's flash attention library, if that works on AMD, then we work.

Oh, funny. We work. But anyways, it depends on those conditions. So if AMD has those to work, then yes. In theory, you can remove Xformis and just use scale dot product attention. So there's only one dependency, which is Triton. I think some people have gotten it to work. So it depends.

Yes. I kind of have an answer to that. I've trained on a MI300X instinct with one card with Triton and it works with AMD. Okay. I mean, if Triton works, then yes. It works. So I just have an answer. Sorry. Okay, good. You answered, yeah. Okay. Yeah, but we don't -- so officially, we do not support AMD, but I guess it works.

Okay. That's interesting. Yes. Okay. What's next? Where's my -- where's the Gemma one? Yes. Okay. So now we're talking about Gemma bugs, specifically Gemma. So if you go to our blog post, I actually -- we wrote a blog post about all the issues that we found in Gemma. For example, you must add a VOS token.

There is a typo in the paper. Yes. So we don't just find bugs and, you know, we have to read the paper first to understand the model. Now, the problem is sometimes when people release models, they don't release papers. That is very painful. That happens a lot now. So please, model creators, please provide papers.

Otherwise, it gets more complicated. There's also, like, some other issues. And we have a Colab notebook, which provides all these -- so if you open up the link, Gemma details, in the -- remember, if you don't have access to these slides, it is tinyurl.com/unsloth, right? That's the slides. If you open up the Colab notebook, this is actually runnable in Colab.

Please log into your Google account for this to actually work. But we show that this is the logl2 norm, so we check the -- so this layer number, right? There's, like, 18 layers. We check every single layer, the output of the actual good implementation. So the DeepMind implementation with the HuggingFace one, with the PyTorch one, with the other ones.

And if you do the L2 norm, you find that the error is very high. And what we showed is that you can actually move the error down by doing multiple changes, right? So each line -- you can see there's, like, multiple lines -- each line is actually a method that we apply to make it better, right?

So, like, we finally found that approximately either the blue line -- I mean, it depends on which one you like -- either the blue line or the black line makes training much better. Does anyone notice any interesting things about this graph? Anything interesting? Like, do you see the -- so remember, each line is a fix that we did, right?

So, like, there's many lines, and we did a fix, and it changes the error. And we selected the black line to be the final one. Does anyone have any -- what is, like, anything interesting? Yes. So one of them caused a huge jump, and that is a float 32 fix that we did for all architectures.

Yes. And the other ones are less prominent. But anything else? Anything else interesting? Yes? Yes. Fantastic. Why? I do not know. And that is a good question, and I don't actually know. I think it's just language -- yeah, I have a theory. The theory is -- Yeah, but unfortunately, I can't say everything.

Like, I mean, my theory is -- and there was also a jump as well in the middle -- and the blue line, you know, it starts from very low, it goes up very high, and everything does this, right? So, like, there is some weird transition boundary in the Gemma model, right?

And so, like, I'm just going to guess. My guess is that when you train a transformer, the later layers get harder and harder to train, right? The earlier layers actually get very easy to train. And so this transition boundary is when the model probably wasn't really trained that well.

So I'm going to guess, this is just guessing, that maybe the model should have been trained for more data, and the boundary should disappear. This is just my guess. So there is a phenomenon, essentially, it's like more data. The model, the last layers are much harder to train. And that's kind of my theory.

But I don't think so. That's correct. But, okay. And the blue one kind of dropped for a moment, right? Yes. Right before the last one? Yeah, exactly. So in the end -- so now the question is, like, why did we choose the black one, then? Why didn't we choose the green -- the blue line?

So that's adding the exact jellu that we found. So if you add the rope fix plus the exact jellu, you get the blue line. But we, in the end, decided to do the black line. And why do you think that is? We did not choose the blue line. We should have chose the blue line, right?

But the final -- all the fixes that we did. So essentially, the answer why we did not choose the blue line -- the blue line should actually have had lower error, right? The reason why we didn't choose that is because there was not just one error. There were two errors.

There was many errors. And all of the errors compounded together. We finally chose the black line because it matches the original implementation. So because, remember, the trick is you have to match the original implementation of whatever the gemma model creators did. So you can't just look for this error.

I mean, maybe if someone chose different fixes that we did, you can probably get a lower training loss. I guess you could. But we decided to choose the black line because that's what the original implementation did. Any other questions? Oh, I'm talking about the weights. So the weights are the ones in the -- so the model weights are the ones training.

All right? So the rest you don't actually train. It's just the weights itself. Yeah, it's just a bit better, right? At the end, you end up with the model -- you end up with the . And that leads me to my follow-up question. Do you have examples of the training, like, is it -- like, what are those training based properties?

Is it iterative where you're just going to see the training loss over time ? Yes. So remember, the goal of a transformer is you want to predict the next word. Right? So the sentence, "Hello, my name is Daniel." You're trying to predict, "Hello, predict my." Right? "My, predict name." And so on.

You have this data, correct? Like, you have -- just take novels. You shove in the novels. You already -- you essentially create the data out of thin air. And then you change these weights using, like, backpropagation, do derivatives. And try to, like, change these weights such that you get the highest accuracy.

And this training procedure is called backpropagation. And so, like, I was trying to show you, like, how do we actually derive the derivatives? When you do backpropagation, you need to derive the derivatives. Just use PyTorch. PyTorch will do the derivatives for you. Um, and -- yes, but that's -- does that kind of answer your question, or?

Yeah, I think like this one is working. Okay, yes. So you mentioned that the layers are embedded in the frame. Yes. So you mentioned that the layers are embedded in the frame. Yes. So you mentioned that the layers are embedded in the frame. I know that's why I did layer-free thing now, but is there a way to change the layer?

Yes, the answer actually has that. So you can actually -- depending on your layer -- so for now, what we do is your embedding and your final layer, you can change different weights -- different learning rates. So we found that if you train on -- if you train the last layer with the embedding weights and the first -- sorry, the embedding weights in the LM head by a factor of 10 smaller, the learning rate, you can actually have increased accuracy.

So yes, you should. You should change the -- you should change the learning rates for each layer. But people don't actually do that. I think it's because if you set a learning rate for each layer beforehand, you're kind of like -- it's like this -- you're like doing subjective bias.

So that's why people just set one learning rate for all the layers. But I think in this case, I'm just going to guess, okay, this might be a transformer. This is transformer general. This is not just for JAMA. This is for all transformers. Maybe I guess layer-wise layer-wise learning rate could work.

I think there are some papers which do that. I think it's called Lars, I think. Lars does layer-wise -- I think it's called Lars -- layer-wise learning rate. I hope that answered your question. Yes? Oh, it's a log L2 norm. So it's -- you take the deepmine implementation, you code it up correctly.

Then you take the other implementations like PyTorch, HuggingVase, even deepmine's own implementations. And then you check each layer, the output, you compare it with the original -- the correct implementation and check what's the error. And that's the thing that I graphed. And your goal is you want the error to go to zero, right?

So like you want it to go all the way to zero, so like, you know, on the bottom and not like very high. And that's log scale, right? So the error is not like a small number. It's 1,000, right? So like every single line, every single step you go down is a log difference, right?

It's not -- it's not like a -- it's -- I essentially logged it. If you did not log it, it would look very bad. But I just logged it. Yeah, does that -- okay. Any other questions? Yeah. This actually happens a lot, very frequently. And I think like -- so like, for example, TinyLlama, someone trying TinyLlama, and then training already 80% completed.

They found a bug for tokenization. They're like -- so it happens very frequently. And it depends on what you want to do. I think it depends to the model creator. If you already spent millions of dollars, if you already spent millions of dollars, you have to change the way you have to change the way you're tokenizing -- would you like -- so it happens very frequently.

And it depends on what you want to do. I think it depends to the model creator. If you already spent millions of dollars, maybe just keep -- just train it with the bug, and then you release the bugged version. But it should still work. Hopefully. Yeah. So in theory, let's say if OpenAI would have a lot of difficulty shifting, if they found -- like, if somebody else found an optimized tokenizer or something like that, they would have trouble shifting to that model, because they would have to spend like -- You have to retrain everything.

Correct. So you just -- just assume it -- just leave it. If you already spent like billions of dollars, I'm probably not a good idea to retrain. So even if it offers like a -- like 2x optimization, somehow, like they would have to spend -- like they would have to retrain and spend -- Yes, you have to retrain everything from scratch.

But that's why like -- I think like -- that's why you should do like small scale experiments. You know, get like a smaller model, train it for less data, test it, and then see if the accuracy is good, and then you scale it up. Yeah. Any other questions? Okay.

I will -- yes. So there's a notebook. So we show step by step exactly what we did. And if you inspect the code -- okay, the Gemma code is now -- the Gemma code -- if you -- oh, okay. Wait, no, it's modeling Gemma. Oh, okay. Maybe I should just go to HuggingFace itself.

Wait, let me go to -- you can actually find this. If you copy paste this -- right, you edit the -- you go to Gemma, and you go to modeling Gemma. All right. This is -- oh, did I not -- okay. Let me just -- okay. Maybe I typed it wrong.

Did they not -- oh, okay. Maybe I did two Ls. My bad. I always get confused on that. Oh, what is this? Hmm. This is interesting. Okay. This is like new. So -- okay. Yeah, I did not -- yeah. So all of this -- so we wrote inside the -- like, you know, Lama does -- so we showed, for example, in the code now, if you go to, like, HuggingFace's code for Gemma, we wrote -- I tried to write some comments, you know, for it to be more clear on why we are doing this.

And so, for example, the layer norm, right, you have to be careful to where you upcast and downcast. And we write this in here. Where is it? I think it's -- no, no, no, no. Oh, wait. Is it -- no, I'm pretty sure I've wrote it somewhere. No, it is here.

Yes. Okay. It's a bit unclear. I need to make this bigger. Okay. It's a bit blurry. But you can see that, depending on the model, in Gemma, you have to actually upcast to float32. Everywhere. You must use float32 everywhere. Because the original implementation used float32. Right? So you must always follow the original implementation.

If you don't follow the original implementation, then you will get wrong, like, you know, somewhat worse results. And the problem was other implementations just copied Lama and Mistral's code. And they did not do this. And so we found that you actually have to upcast correctly over here. Right? You have to upcast immediately.

And then you downcast at the very end. And so we wrote a few comments, right? Lama does x.2 float16, whilst Gemma is x -- you know, it really -- like, Lama does that. Right? But Gemma does this. Right? So there were, like, small little issues, downcasting, upcasting. Another question is, like, why do we have to do downcasting?

Does anyone know why? Like, why is there always, like, downcasting, upcasting? Float32, float16, float8? Does anyone know why we have to do downcasting, upcasting? Yes, correct. It's for faster speed. So do you know how much faster? Like -- Eight? I don't know. So float32 to float16. What do you think?

It depends. Who said two? Okay, good guess. Why? Why did you guess two? Well, that's for sparsity. Okay. Okay. Yes, okay. Float8, approximately two. Actually, it could be more. So float32 to float16 is actually not two. It's actually much more. I think it's five. I think. Or is it six?

The reason is because the representation of the float is different. Right? So float32, I have floating point representation Wikipedia. I think it's in here somewhere. Oh, and maybe go to beatfloat16. Where is beatfloat16? Brain float. Beat float? Beat float. Yes. Right? So like, there it is. Oh, there's more pictures now.

Oh, they edited this. I did not. Okay. This is new. I didn't see amd ft24 format or Pixar. Oh, okay. They have like weird formats now. This is float32. Right? And float32, the exponent has eight numbers. Right? Eight bits. And the fraction of a bit has 23. And when you do.

And when you do matrix, when you do matrix multiplication, does anyone know how to calculate the number of transistors you need for float32? Does anyone know? It's a formula. That's related to the exponent, the fraction, and just the exponent and fraction. What do you think the formula is? Have a guess.

Right? I said that it's approximately. So if you have beatfloat16, the fraction is seven. Right? But if you have beatfloat16 has 16 bits you can use. The exponent, the exponent is used for the dynamic range of the number. Right? So if you want larger numbers, you have to have larger exponents.

Right? So this means 16 only has a range of two to the path. Okay. It's not two to the eight. Like, just assume, you know, it's two to the power of eight. Okay. That's not right. But just assume that. Um, two to the power of eight. Right? But. Yeah.

Um, and this one, float32 also has two to the power of eight. There's another format called float16, which is two to the power of five. Um, and then the fraction of component is 10. So all of these numbers, you can scale, right? How many do you want for the exponent?

How many do you want for the fraction? You must include the signed it. And the trick is you must have 16. You need to fit. You know, 16. So you could have, like, exponent one. And fraction could be, um, 14. That could also work. Um, but does anyone know how many transistors you need to use for float16, for example?

And be for 16? Remember, I said it's like around five times faster. It's actually not right. I think it's even more. Um, what is the formula? Have a guess. How many transistors do you need to use? How many transistors do you need to do float16 multiplication, approximately? Or float multiplication?

It's, it's a formula related to exponent and fraction. The answer is exponent plus fraction squared. That's the answer. Um, so what does that mean? That means float16 is 5 plus 10 squared. Right? And float32 is 8 plus 23 squared. So it is not two times faster. It is much faster.

Right? So like, I don't know what that is. What is, um, So it's 8 plus 23 squared. So it's 8 plus 23 squared. So it's 8 plus 23 squared. Yeah. And so what, what is the other one? I think it goes, um, what's that? I can't remember. So, um, so it's 8 and 7, right?

So 8 and 7. This is, this is Google's format. It is 57. So what does that mean? How many times faster? Yeah. So it's actually 10 times faster. Right? So 332 to float16 is around 10 times faster. Right? Float16 is 5 plus 10. Right? So 5 plus 10. So B for 16 is approximately 2 times faster than float16.

Although no one really notices any difference. Um, but in general, B for 16 is actually faster. Right? So that's why it's not 2 times faster. It's 10 times faster. Um, and that's why you must use Tesla T4s, as I said, because it has tensor cores, which does this, right?

The tensor cores does float16 multiplication very effectively and very efficiently. Um, and so do not use P100s again. Right? P100s do not have this methodology. Um, yes. Question? Yes. Float8. So float8. I don't know. Um, there are two formats for float8. Oh, wait, I don't think so. It's in Wikipedia.

Float eight. Oh, okay. Floating point. It's called EEM -- oh, I'll just use mini float. They have some. Yeah, there we go. Right. So you get to -- remember, if you want to have eight bits, you get to decide how many you want to do for the exponent, how many you want to do for the fraction or the mantis apart, right?

You get to decide. And depending on the company, you know, it's unclear. There's no standard. So this one's 1, 4, 3. So what's 1, 4, 3? 1 plus 4, right? 4, 3 squared. Is that 4, 3? Yeah. 13. So float eight is? I think it's around -- yeah. So around four times faster than Bfloat 16.

But in general, it's not. In general, it's like two to three. It's not going to be four. The reason is because you're packing so many transistors in. You also have to do like energy. You have to like do the data movement. There's like other transistors you have to do.

I just -- approximately it's two to three times faster. That's float eight. Can you go even lower? Why don't we do one bit? One bit. So you must have the sign though. So you can't do one bit. So 1.58 bit, some people have been talking about. Two bit. Two bit could be possible.

The problem with two bit is it's problematic. Because when you do two bit training -- yes, okay. So let's see. Let's do two bit, right? So what do you want to do? How many exponent? Zero? Remember, you have to have a sign bit. That's the most important. One for the exponent and fraction zero, right?

Because remember, it's squared. So plus one. Oh, wait. No. It's zero plus one. Okay. So it's one. Okay. 10 times faster? I don't think so. Okay. Maybe two bit is probably too low. Maybe four bit. Four bit could work. Yes? Oh, that's just because they wanted to do that.

Just for easier calculation. And then, for their tensor32, they use it. So that's what I would say. Tensor32 is not 32. It is -- they have it somewhere. NVIDIA. TensorFlow, it's 19. That's the trick. They like to do marketing. And they say it's 32. But it's actually 19. Yes.

That's why it's the same. Okay. Any other questions? Someone else raised their hand? Okay. But yes, I was going to say, like, you can do four bit, right? So four bit is actually a new -- NVIDIA's new GPUs. The B100s do have four bit. So that is approximately two times faster.

Now, the reason it's not -- okay, let me just try. Four bit. I think it's one plus -- it's probably like two plus two or something. I don't know. Six. Okay. Right. It's not going to be that much faster because, as I said, there's power transistors and there's other transistors.

You can only go so far. Just the jump from flow32 to flow16 was very large. Yes. Yes. So a quick question. So for the example, the one bit bit net, that -- 1.58 bit. Yeah. Yeah. So that would be an example of one bit. So it's a different -- so actually, I had a tweet about this.

1.58 bit and float4 is the same in terms of number of transistors. You'd rather use float4. The reason why is 1.58 bit is you have to do more manipulation to make it work. You have to use the straight through estimator. It's a horrible mess. You'd rather just use float4.

So float4 and 1.58 bit are similar. You get to create your own base model if you replicate the paper. Yes. What do you mean? Which most of us have never done, right? Which would be -- Technium and the NOAA's research, we've replicated there some -- it does work. Somewhat.

Somewhat works. I mean, yeah. It's one bit, but I mean -- 1.58. Yeah. It's not actually one bit. Yeah. Yeah. I think it's like three. They call it one bit. Oh, they like to call it one bit. Yeah. So, but my question is, like, so in theory, like, obviously, I don't know who works here, but like, most of us have never built a base model.

Yes. So -- Well, you could. Yeah. Yeah. You can with enough GPU power, but one bit, you know, that was -- and they even had, like, a really great tutorial, and like -- but do you think that that's just like -- I'm just asking your opinion about that. I don't think so if 1.8 -- 1.58 bit will be the future.

I think NVIDIA is, like, the focus is on float4. They might go to float -- I think float4 might be the final precision. I don't think you can go any faster with that. I think float4 is a final -- no more. So we won't be having that much faster GPUs.

I don't think so. I mean, float4 is actually -- they don't actually do float4 anymore. It's like you float6 for the gradients, float6, and then float4 for the activations. Like, you know, it's very weird. I mean, you could do, like, float3, float2, but, like, it's your diminishing returns. So, in ARM silicon, though, there's -- there's been, like, advances in, like, super low bits, like, bits versus -- Fixpoint stuff?

Is it called fixpoint stuff? Or -- Huh? I think it's called fixed -- I knew ARM has fixpoint. Oh, well, yeah. So it's -- I mean, just like the Snapdragon X, like the new -- Yes. They have -- so it's, like, customizable as well? Or -- I don't know.

Yeah. Well, okay. So the SDK is broken. You have to pass the -- so this is why you can technically run Mixtrel 8x7v on your phone at, like, 20-something FPS -- sorry -- TPS is because you can use UFS 4.0 as flash storage and subsequently use that as memory.

And then the -- but the thing is, then you're running at 2-bit precision, which is -- So that's probably why this -- so if you use 2-bit precision, that's why you have memory reductions. But there's actually papers which show that if you do 2-bit for the MLP plus 4-bit for attention, that's actually the most appropriate.

You can actually do that. So that's not an invalid approach. No, that's not invalid. Actually, it works. It works. The -- I've seen it work -- Möbius people did that, I think. Yeah. Yes, question. Sorry. Okay. Two kind of related questions about precision. First one is, like, why is the negative bit -- like, you must have -- The sine bit?

Yeah. You don't have to. But it's generally -- generally, like, standard practice to have the sine bit. In theory, you don't have to. The only problem is, if you don't have a sine bit, your numbers will be 0, 1, 2, right? But what happens if you wanted to, like, make the model -- like, you're trying to not make the model learn negative directions anymore?

You could do that. I don't know if there are papers -- maybe you should write a paper about that. Train a model on that, and let's see what -- okay, but -- yeah. Well, it's very related. I think it's all bits. It's all bits, right? Yeah. Softmax, you're basically just linearly, like, getting stuff down to a certain number of bits.

There's nothing special about softmax and things that could be, like, exponentially big, sort of going this way, right? Like -- The reason -- it's because -- remember, when you do softmax, you also have to normalize by the sum of the exponentials. And if you do exponential of 10, you already get, like, some large number, and this -- this probability will take over the entire sum.

Well, but you're not -- you're not, like, logging it. You're just square-booting it. No, no. It's -- it's -- it's -- the sum of exponentials divided by -- oh, sorry -- the exponential divided by the sum of the exponentials. Yeah, but, like, the biggest exponential dominates the sum, right?

Yes. That's the problem, though. If you do that, then your model's not learning. You're just trying to learn to predict one token. Why don't you just predict that one token, then? Like, the largest one that you did. That's kind of what you -- you're forcing the model to not learn anything.

That is why you have to, like, minus the maximum. That's a trick that we showed. It's, like, minus the maximum, and then you can, like, reduce this effect of this one token, or this one issue. So it's for training stability purposes. I don't know if that kind of -- okay, probably that didn't answer your question, but -- okay.

Yes. From a back-to-back perspective, how much slower was that layer norm, and then do we know whether that's actually more accurate than the way that -- That is a good question. To be honest, I do not know. I don't think so. It changes too much. Layer norms, if you upcast, that's probably, yeah, small effect.

Small effect. But the reason why you need to upcast is because Gemma did it before, so you have to do it. Remember, the trick is you must follow what the original implementation does. Any other questions? Or -- okay. There are, like, some other issues which we showed that more -- okay, it's funny, because it's all about upcasting, downcasting, and stuff like that.

Each implementation does its own thing. Unfortunately, how do you actually analyze this? You have to open three screens up. The DeepMind one. The DeepMind one. Okay. Okay, I'm too excited. You have to open up three implementations. DeepMind one, the HuggingFace one, the KRS one. You have to open up three screens, and you see line by line, what did they do, and then now you have to guess which one's the correct one.

The guessing part is the most painful, so you have to, like, inquire your ass-hugging face, which one's the correct one? You look at the paper, which one's the correct one? You assume the DeepMind one's correct and stuff like that. So there's, like, some human component you have to guess.

Guessing. So that's probably why it can't be automated, right? These error-checking things cannot be automated. It's because there's a human there which made these decisions. And so you have to, now you have to decide which one, which of those decisions did they choose. And you can't really automate this away.

I guess you could automate this by doing the methodology which we described, try all combinations and see which one has a lower error. I guess you could do that. But remember, you must have the original implementation first. That is a problem. So there's, like, chicken and egg problems. The rope position, so this is the one I was talking about, upcasting rope.

This is in all architectures now. You must not downcast rope. If you do, you will get wrong results. So previously on the left, right, if you see 8192, 8192, 8192, that's the positions. That is definitely incorrect, right? What does that mean? Like, do you know why that's incorrect? 8192, 8192, 8192?

Does anyone know why? Remember, this is positions. Why is it all the same? Like, does anyone know why this is really bad? So we kind of, like, essentially now we, the three words have the same position, right? 8192 is the position. And what is another big error of this?

There's actually one more error. Let's assume the maximum is 8192, the sequence length. What is 8192? It's out of bounds. Remember it's minus one for Python, right? 8191 is the correct number, right? So if you correct this, you get 8189, 8190 and 8191, right? And you can see all the numbers are like this.

So the point is if you use, remember the whole point of this problem is because we're using float 16 for faster training. Remember float 16 is how much times faster? Yes, around 10 or 5 to 10, something around there, right? That is why you have to do this. And these are the issues pop out because of this issue, right?

We're trying to make training faster, but then these issues come. And the jelly one, which we described before, this was the first bug that we found. Actually, I think this is the main reason why we were trying to look through bugs is we found that, oh, look, there is this bug in Jellu in the activation function.

And so the point is Keras used approximate Jellu, the PyTorch and the PyTorch version used exact Jellu, and Hugging Face also used exact Jellu. And the question is, which one is the correct one? Is the exact Jellu correct? Is approximate Jellu Jellu correct? So what's the difference between exact and Jellu activation function?

There is the, where is the, I don't know if they have the exact and the, that's called flex. Oh, okay. That's night mode. Oh, that's even worse. Okay. Whatever. Oh wait, that's Prelude. Where is Jellu? Oh wait, no, I have to find it right. Yes. Right. So like the, the exact Jellu is this one, right?

There's an error function. Um, okay. My thing is not rendering it properly. Um, but if you essentially what you do is you use Desmos. So what I like to do is I use Desmos, um, Desmos, right. And literally plot them together, plot them on the graph, right? So like, if you have, right.

X, Y is equal to X over two, right. You literally type this in. And what is this one? I think you can do error function. Oh yes, you can. Right. You can do error function, right. X divided by square root of two. All right. That's the exact Jellu. Right.

Now you type in this complicated formula for X divided by two. Um, one, one plus than what? I don't remember this square root of two divided by what? Was it pi? Oh, it's pi. Pi. And the what? Um, X plus zero point. Can I? Oh, we can't. Okay. Zero, four, four, seven.

Zero, four, four, seven. What was the? Oh, one, five. One, five X to the cubed. Was it cubed? Yeah. Okay. Right. Oh, is it? Oh, you're right. Okay. Wait, is that? Oh, is it just the rendering problem? Or is it square root? No, no, no. It's square root of two over pi, I think.

Wait, is that correct? Wait, something. I did something wrong. Maybe I did something wrong. Oh, whatever. Who cares? Just, just assume. Okay. Oh, wait. Oh, yeah. You're right. You're right. I put the square root everywhere. Oh, is that what you're saying? Yeah. Oh, okay. Oh, no, no, no, no, no.

What? No, get rid of that. Okay. Let me just, no, it's, it's tan of everything. Now I have to do this. Oh, yeah. This is probably have to play around with this. Oh, there we go. There we go. Right. So the blue line, if you remove it, the blue line and the red line, right?

They're the same thing. But what's the difference? Remember in Desmos, I don't know if people know this, but you can actually do derivatives D over DX. Did anyone know this? You can actually do derivatives. You get your D over DX. And then you can do this as well. D over DX.

Right. And they generally align by the exact JLU and the approximate JLU generally align. And guess what? You can also do integration integral of minus infinity. Oh, did I spell it wrong? Oh, infinity to infinity, right? Of, I think this works. I'm not a hundred percent sure. Right. You take your exact JLU, you minus the difference.

Oh, I don't think so. This works. I, I, I, yeah, I don't think so. Oh yes, it works. Yes, it works. So what you do is you can take the integral of minus infinity to infinity. So the entire line, you minus exact JLU and the approximate JLU and you do DX.

And there is a difference, right? But the difference is very small, right? It's like 10 to the minus 16. It's very, very small. And notice it's like, when you do, when we do fast triton kernels, I generally use this feature. So you do, you can do integration, integration and derivatives, and you know, you can use Desmos.

So I highly recommend Desmos. Um, and if you do this, that's where we found the problem. It's like, oh, okay, there is some sort of issue. And if you fix it, remember the JLU fix does do some effect. It does do some effect. But remember, we only showed, there was only very small effect.

So it's not that useful. Um, the rope fix was the most important, right? The rope fix actually caused issues. Um, so you must fix that. And that's the most important fix that you must do. Um, and finally, there is like some other things that we do, um, depending on the precision that you use, there is actually a difference between float 16 and bleed float 16.

And if you do this, we show that float 32, um, remember we showed before that in the fixes that we did, the lines sometimes go back up, right? But actually, if you do float 32, it actually does work. If you do float 32 position, the lines actually don't do separate very well.

But once you use float 16, the lines then match up again, right? And be float 16, the lines match up again, right? So this is just a phenomenon that you're using faster, smaller positions. And that is why you have this problem. But if you do use full precision, you get good results.

And the fine tuning notebook for the Gemma one also works. So Gemma is two times faster and uses like, I think 60% less memory as well. It's more now. Um, so if you run this, remember you have to connect to, um, you have to connect to your Google account and you will get this to run.

Any questions on the Gemma one? Okay. Yes. Yes. Um, where did I put the picture? Oh, it's in the blog post. Um, yes, that's fine. Um, wait, where did I put it? Oh, it's the first picture. Right. Yeah. This one, right? So the x axis is the layer number.

So Gemma has 18 layers. So each of those, the x axis just indicates which layer the, um, which layer it is. The y axis is log to log to norm, um, log, log L2 norm. So what you do is you take the original implementation, like deep minds implementation, you take hugging face, PyTorch, Gemma, like, you know, the other implementations, you check the output of both of them.

So the output, you run the model through, you take output to layer one and outputs layer one, the other implementations, and you just find the error. Um, and so this is just the error. Um, and this is log scale. Um, so when it's log scale, it looks better when it's not log scale, it looks very bad.

Um, so does that, is that better? Yes. Output of each layer. Yes. Um, that's called Gemma. So that's for Gemma. For fee three, um, similar. What you do is you open up the fee three implementation, you read through the fee three implementation. And because like, you guys like probably most likely can go through Lama and like, just look at it in general, remember, delete useless parts of the code, you will see there are differences in fee three.

Um, and the differences are, um, they use other methodologies. Um, they use upcasting, they use stuff, but there was a weird thing that we found in the config file. Um, I will show you, um, fee three config. Okay. I'll just use the instruct version. Um, if you go to always, when you go to like new models, always read the config file, right?

Config.json. When you open it up, it tells you all the tricks you need to know about the model architecture. Um, and I highly, right. It tells you what is the EOS token ID? 32,000. Right? When you look at this, Hmm, is that a good idea? 32,000, 32,000. What is the EOS token ID?

Right? 32,000. Okay. That's fine. The pad token. No. Hmm. Is that a good idea? Like you have to think about like, why are these, why are they there? How many layers does fee three have? It's 40, right? So 40 layers. Um, how many positional encodings does it have? So how long, what is the context length?

It is one, three, one, zero, seven, two. That's the context. Um, remember it's 100. So this model, the fee three medium is 128 K, right? It's not one, one, two, eight, zero, zero, zero, zero, right? Just be careful. It's actually 128 K, right? It's one, three, one, zero, seven, two.

Um, there are other issues with this model as well. Um, okay. This is, okay. That's the, okay. That's probably, okay. Probably don't use the instruct version. The instruct, sorry, the true, choose a small version. Um, this is a smaller version. There is a thing we noticed as a sliding window.

So Mr. Has sliding window, um, sliding window essentially attends to only 2048 tokens. Um, and this just makes training much faster. Um, and does anyone notice what the problem is for this? Why is it 2047? Anyone notice any issues? Yes. Well, it's not a power of two, right? Correct.

So is that weird? I mean, that's horrible. Yeah. So I did ask the hugging face people and they said, yes, it is a bug. So they actually did fix it, but then I don't know why they reverted it back. So I'm a bit confused. Um, they never, they kind of forgot about this.

Yeah. So it's actually, it's supposed to be 2048. Um, yeah, because that's, that only makes sense. Because if you're training on like, you know, you're training on the correct context, right? Then this sliding window makes no sense. In fact, I've seen a lot of sliding window above reaches. Yeah.

So yeah, I, I, yeah, I'm not sure why. Um, but I'm pretty sure this should be 2048. Yeah. I'm very confident. I, I've actually, um, what I have to say is 2048. Yeah, it's not, yeah. And yeah, so these, these small issues are, they need to fix. Um, they still have not fixed.

Um, but when on stuff, it's fixed. So we actually uploaded models, which fixed them, right? So if you go to, uh, on stuff, hugging face repo, we actually have models, which we fixed all of them. Oh, this is too big. Um, where's the fee one? Oh, I didn't put it up.

Okay. I need to find the fee one now. Um, where's fee? Oh, that fee three, mini four K instruct, right? If you go to files, you go to config dot Jason, we fixed it. Uh, and there's other things that we did to fix it. Um, for example, the pad token ID is 30.

Okay. That's actually wrong. Okay. Um, okay. I need to fix my own. Okay. Anyways, um, there is a bug which we discovered ourselves. It should be 30. This is actually wrong. Another thing is you must not make the pad token the same token ID as EOS. Never, never, never, never, never.

Um, this must be a different token to the EOS token. Um, I do not unstop. We automatically fixed this during the loading. It's just the config itself is not right. Um, but that's okay. Unstop itself is fine. Um, just the config is a bit wrong. Um, oh, okay. I found my own box, but okay.

Yes. Um, so, okay. Oh, yeah, yeah. I'm not going to slow down the, the, I'm not going to, you keep going because there's a lot of slides. Okay. Oh, yeah. Yeah. Actually this, okay. There's not that much slides. Okay. Actually, there is. Oh, okay. I just noticed we have more.

Okay. Um, so another one is like fee three used. Um, they merged the Q, K and K. Remember we did Q, K and V. They're unmerged, right? The weights are separate for the attention matrices. V three did a very interesting move is that they fuse them into one matrix.

And we found that to be very problematic for fine tuning. Um, because if you fuse them together, um, when you do Laura adapters, you actually, you actually only learn new extra weights and that's very less. Um, so please unfuse them. Um, and we do this. So our version of the V three actually unfuses the weights.

Um, you must unfuse. Actually, I have to like highly suggest you to unfuse the weights. You can only fuse them if you want to do training faster. Um, this will make training like maybe five percent faster. It's actually not that much. It's like two, two percent. You actually increase memory usage a lot.

So just be careful of that as well. Um, oh, yes. They actually did. So this is the sliding window one. They actually fixed it. Um, and then they unfixed it. I think they just forgot about it. I'll probably like push them again to fix it. Um, and this is the fusing of the weights.

Um, so we show that if you actually unfuse the weights, so Q, K and V must be separate. You must not combine them. Um, if you combine them, you actually have lower accuracy. So please do not do that. Um, for tokenization, remember this slide, which I show you about the, um, the smiley faces are like the spaces and each ones are different tokenizations.

There are actually many issues, um, for tokenizations. Um, this is a totally different separate topic from finding bugs and issues in language models. Um, this is a whole topic of its own because tokenizers are very problematic. Um, and they're very hard to find and fix. Did I double this slide?

Okay, I doubled that. Um, and also we have new Olama support, which we have not announced yet, which you can try out. So lots of people have asked us for how do we actually fine tune a language model and export it to Olama, um, effectively. Um, does any, does people do know what's Olama or no, or does anyone not know what's Olama?

Okay. So Olama is like a interface. When you, when you fine tune a model, you have to run it, right? You have to run the model somewhere. And Olama just makes you run the model much easier. Um, so like you know, chattpt, chattpt is like the running mechanism. Olama is just like chattpt, but they don't have the model.

You have to select a model. Um, that's kind of Olama. Um, yes. How did you manage to like, um, so I've been working on converting, uh, creating model files using, uh, um, the, uh, automated pipeline, but we've been found many issues trying to automate model files. Um, is this using Unslap though?

Or is using Axolotl or something or other ones? Did you automate the model file yourself or? Well, we, yeah, we, because we need, we need our model file, right? Oh, so we do this automatically now. So with Unslap, we actually, we spent, I spent like a few, one month on trying to automate the model file creation.

That's why we were struggling so hard as a company. Yes, um, uh, we, I have code for that somewhere. Yeah, if you could, yeah, okay, this is open source. Oh, yeah, it's, it's already in the GitHub repos. If you go to Unslap, um, you go to chat templates, we have code for that, um, Olama.

It is still very ugly. Um, so these are the chat templates for, remember the BOS token, someone mentioned you have to add it. Um, yeah, add the BOS token. This is the Olama chat template. Um, which we, so Olama has a, um, Olama has a specific requirement that you must have a chat template because if you don't use a correct chat template, your model will output incorrect, like, substandard responses.

Um, so this is the chat template for like some of them. I had to, we had to write chat templates for all of the architectures. Um, and we have an automatic one. So these are Vakuna and blah, blah, blah, um, alpaca style, um, Gemma, the Gemma style. We also have that.

We have many, many, even the Lama three chat template we have as well. Um, now for the automatic one. So what we do is we can actually make an automatic chat template, a model five for you automatically. Um, and this makes your fine tune much more accurately. Um, wait, I'll show you the, where is the code for that?

Um, where is the code? Okay. You can see the code is quite large for the, just the chat templates, right? This is just for tokenization. So it's not even the, yes. This is Apache 2.0, right? Yes. It's Apache. Yes. It's open source. Yeah. Um, wait, where is it? Okay.

So we have something called pass combined prompt, which does some O of N squared. I didn't actually optimize this. It does O of N squared. I should have done O of N, but anyways, it's O of N squared. Um, checking the prompt. Um, here's the prompt format. So we do, it looks quite ugly, the code for automatic model file creation, but we actually made it so you can actually automatically create a model file from your chat template.

Um, you can see it's quite ugly. Um, but it works. Um, and, uh, yes. Oh, it's even more ugly. Yup. It's quite ugly code. Um, but unfortunately the model file is very hard to create automatically. And so we have the notebook, which allows you to do this. Um, so, so this notebook is in here for Alpaca.

So this one's for the Alpaca dataset. And so this is our installation, Lama 3. Um, where is it? So we, so we'll be using Alpaca GPT-4, um, GPT-4 dataset. So you use the Alpaca dataset and you use GPT-4 to create the dataset. Um, and the trick is though, um, we also have a CSV file now, so you can actually upload a CSV file and use unsoft directly to fine tune the language model.

Um, and, but the problem is a language model must have an instruction and output, right? Only two columns. CSV files and Excel files can have many columns. So what do you do? You have to merge the columns into one. Um, so remember each of those columns in your Excel file, convert them into text.

And for example, the Titanic dataset, you merge them to say they have one siblings and spouses and so on, right? You merge the rope into one row. Um, and that's what you do. And with unsoft, you can do this now. It's, I still probably need to edit the, um, the like syntax calling, but this merging technique says, okay, your first column is called an instruction column.

And the two double brackets means it's optional. Um, so if, if the input column exists, then it will say the instruction followed by your input is. Um, and you can like make this very crazy. You can do as many columns as you like. Um, I don't know if the syntax is useful, but like, I will probably be editing this.

We're going to make a YouTube video about this to talk about this. Um, this is actually very important for fine tuning. Um, we noticed that every single provider requires you to use only one column for instruction and one output column. Now you can have infinite columns. Well, how many you like, but you must define the chat template.

Um, and, and we also have a custom customizable chat template. Um, so before when you do fine tuning of language models, you have to use the alpaca prompt, um, and our other notebooks, right? Below is an instruction that describes the task paired with an input, blah, blah, blah. You put your instruction here, you put your input here, and you put your output here, right?

But notice, what is the problem with this? Is there a problem with this? So you must only put one instruction and one output or response, right? The input is the problem, right? So how do you solve this? You solve this by merging the input into your instruction prompt, right?

So this actually should be removed entirely, right? And your, your input should be something else. And what you do is we can actually, you're now, we can do this now, right? So you must do, you must put the input and you must put an output, right? You can only use two columns now, but you can use, remember, even though you can only use two columns, you can use this to convert your data set into two columns.

Um, yes? Do you lose any of the semantic meanings, though? Oh, no, I don't think so. No. You don't think so? No, I don't think. So it depends on how you, it depends on how you format the data set. Remember, it's a language model, so you can do, the more you tell the language model what to do, the better.

Of course. Yeah. But the problem is, to do the model file creation, you must do two iterations, repetitions of this, right? You must do instruction response, and then you do another one, you must. Okay, you must do this for Unsloth. I found this to be very, very important for the model file creation.

If you do not do this, you have dangling new lines, and you actually make your model output terrible. Um, so you must do two repetitions of this. Okay, it's a must, must, must. Um, and if you don't do that, we'll error out. Um, and so once you do this, we also have examples of, for example, this is Lama 3's chat template, right?

We again, do two iterations. You must do two iterations. Most important. Um, and when you finish training the model, um, remember you can do run time, run all. Um, you can do inference now, right? Continue the Fibonacci sequence. Your input is one, one, two, whatever. And the next Fibonacci sequence is 13.

I think that's correct. Yes, that's correct. Um, So your language model has learned how to do Fibonacci. And because it's a chat template, you can also do, you can shove in multiple messages into the model. Um, so this becomes a ChatGPT for you. This is a customized ChatGPT that you can use.

Um, and finally, when you want to save the model, um, you can save it to LoRa adapters. So this only is 100 MB in size. So once you find true the model, you have 100 MB. But some people also want to like merge the model back. Um, and that will take 16 GB.

Um, but you must merge this for like, um, olama support and GG ref and stuff like that. And what we show for olama support is you first have to like, you know, install olama. Um, you select what you want to save the model to GG ref. So this is eight bit.

Um, we now support multiple quantization methods, right? You don't have to do eight bit. You can do like four bit, five bit, whatever you like. And this will be saved into one go a much faster. Um, in fact, I think this will save you like 20 minutes of your time.

Um, and we save this automatically. Um, okay. And this does all the saving, blah, blah, blah saves. And we also, you see, we automatically create an olama model file automatically using a chat template. And I can verify this is actually correct because I tried it. Um, and then when you want to serve the model file, you can actually print out the model file, which we created.

And this is the model file. Um, whoops, I pressed run already. Um, anyways, um, and finally you to serve it, you can just do a model file to serve it. Um, and you can solve this. Um, and we do have a CSV version, so you can actually use the Titanic data set.

Um, okay, it's loading. Um, so if you want to use the Titanic data set, you can upload the Titanic data set, right? I upload the Titanic CSV. You can use the CSV file for this. Um, and again, you have the merger columns and so on. Right? This is a more complicated example.

Um, in fact, I should provide this entire example for you for the entire Titanic data set to merge all the columns into one. Um, and it's the same exact output. So that's the new books that we're sharing for. We did not release this yet. So this is for you guys to like experiment and see if there's any issues.

Um, yeah. And just tell me, um, we also have blog posts on our website, which you can see, um, our unslaw GitHub repo. Um, and we have stickers available. Um, and they're very, very cute for you to take and thank. Yeah. And also, yeah, we have Q and A now.

Yeah. Yes. Oh, the problem is if you put them in Jason format, you still need to have instruction and output. So how would you do that? You need to have two columns only for fine tuning. Can you do like, you know, your, in your template here, you have instruction and then you add, you add all of the other columns onto it.

Can you just put, put the same instruction and then Jason of the, of the values of the columns? Yes, you could, you could do the Jason file. Yes, you can. Um, but we just show you that you can do multiple columns now. So like if you have like 10 columns, you can now make the 10 columns into one, um, by merging them together.

Does that kind of, there's a big difference in representing that merge columns as an English sentence or like a dictionary. Oh no, you can't use, you mean like you shove in the actual dictionary for fine tuning. Um, you could do that. I don't, I think you should do English language because our language model predicts the next word.

Jason is probably less useful, always convert it into English. Research paper. Yes. There should be another research paper. Yeah. Any other questions? Yes. A lot of upvoted questions from me on the chat. Sorry. Oh yeah. I, I, I didn't actually check the slider questions. Whoopsies. Um, it actually didn't load.

So, Oh, there's lots of questions. Okay. I will. Okay. Oh, okay. Oh, okay. Oh, okay. I need to, I need to, I need to, um, answer each of them afterwards. I think I'm already out of time though. So, yes. Thanks a lot. Bye. Bye. Bye. Bye. Bye. Bye. Bye.

Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. you you

Low Level Technicals of LLMs: Daniel Han

Transcript