Back to Index

A little guide to building Large Language Models in 2024


Chapters

0:0 Intro
0:59 Workflow for LLMs
1:17 Data preparation - intro and good recent ressources on data preparation
5:28 A web scale pretraining corpus - goals and challenges
11:29 Web scale data sources – Focus on recent datasets
18:1 Language, and quality filtering
24:34 Diving in data deduplication
27:40 Final data preparation for training
31:31 How to evaluate data quality at scale
36:29 The datatrove and lighteval libraries
38:18 Introduction in modeling technics for LLM training
39:9 When the model is too big: parallelism
40:0 Data parallelism
41:18 Tensor parallelism
44:38 Pipeline parallelism
47:0 Sequence parallelism and references on 4D parallelism
47:52 Synchronisation: GPU-CPU and GPU-GPU challenges
52:14 Flash attention v1 and v2
56:23 Stable training recipes
59:12 New architectures: Mixture-of-experts
63:13 New architectures: Mamba
64:49 The nanotron library
66:15 RLHF in 2024
68:23 PPO, DPO and REINFORCE
71:23 Quantization, speculative decoding and compilation: overview and ressources
74:36 Sharing your model, datasets and demo – final words

Transcript

Hi everyone. So two weeks ago I gave a graduate class here in Amsterdam to 200 PhD students about how to build, how to train a large language model from scratch in 2024. I tried in this talk to highlight the dark secrets, the thing that people don't talk a lot about but are very crucial to getting good performance large language models, and maybe to also highlight a bit what is more hype than reality.

And when I shared the slides afterwards there was a lot of interest for this so I decided I would actually re-record the talk and post it on YouTube as well. So here is our little guide to building a large language model in 2024. In this talk I'm gonna cover three main parts - training, fine-tuning, inference.

I think for fine-tuning and inference you can already find super good recipes, super good blog posts and explanations online so I really spend most of my time on training, which is the part that's you know mostly like dark science I would say today. In training you have three parts - data preparation, efficient training technique, evaluation.

It's the same here, I'll spend most of my time on the first part, data preparation, because that's really the secret sauce that I want to highlight today. So let's start writing. You can believe me or you can also believe much smarter people at OpenAI or Entropiq when I say that basically the most important part in your training is the dataset.

So I really like this blog post from James at OpenAI which highlights how you know by training many many architectures he basically found that in the end they all converge to roughly the same behavior which is determined fully by the dataset. So what he says is this - the hit in AI models is the dataset.

Basically model behavior is much less determined by architecture or high-performance than we think and much more by your dataset. He actually says it's your dataset, nothing else. Well I still talk about architecture a little bit. And Amanda, a girl from Entropiq, basically said the same thing last week when she tweeted "is this emergent behavior coming from data or from the model?" and basically she said "none of us has ever magically pulled anything out of the ether, it's all coming from the dataset".

So if you're more into YouTube than Twitter, I think there is a nice video that jokingly summarizes all of this by Rutger Bergman Bergman when he said "let me play it". That's a video I think about when I read all the tech reports that are only talking about model architecture and don't say anything about the data.

I mean it feels like I'm at a firefighters conference and no one's allowed to speak about water. I mean this is not rocket science. I mean we can talk for a very long time about all these stupid philanthropy schemes. We can invite Bono once more but come on, we got to be talking about taxes.

That's it. Taxes, taxes, taxes. All the rest is bullshit in my opinion. So basically for us we got to be talking about data. Data, data, data. All the rest is bullshit in my opinion. So I mean now that I kind of planted you know the landscape, let's dive in what I mean about that.

I mean it feels like I'm at a fire. Thanks. I think another nice well recent paper I think is the Yi paper. So maybe if you've been following the field you probably saw that many Chinese teams have actually trained very good models recently. And the nice thing is that they also have a very very good tech report.

Much better than what we have I would say in the western world where everyone is now very shy about sharing anything. And so the Yi models are a very good model if you look at the benchmark. And basically when training them they say that their underlying assumption is that when you train on extensive data of high enough quality a standard architecture can exhibit advanced capabilities.

So basically you don't need yet now to you know go look behind beyond transformers or maybe like I will be talking later like slight extension like mixture of experts. If you have very good data just spend the time on carefully crafting your data set and for now stay on one of these simple architectures that we use today.

I think there is extensive resources as always. I could have cited like 20 papers but I try to keep like a small list of resources so you can read them extensively. I think these four ones are nice recent examples. The survey on data selection for language model by LNAI is very nice.

The paper I just mentioned by the Yi team is really great and I think two recent data sets that were open source and shared a lot more about how they were built were the the Dolma data set from LNAI and also RefineWeb. So I think a nice thing about RefineWeb is that I'm working with Guilherme the lead author of this at Hugging Face and so we'll have much more news about this data set to share and I think it's a very nice work.

So you can use data for many things. So when you talk about data you actually talk about various type of data. You can use data for pre-training your model, you can use data for instruction tuning, you can use data for alignment which is basically after having pre-trained your model you really want to align it so it learns how to exhibit the nice behavior that you want.

In particular a dialogue behavior which is one we often want to have when we interact with these models. You can also have model data more for in-context learning, for rag training, retrieval training and I would say each of these aspects will have different goals and will require different data.

So as a rough idea for instance for pre-training you want really the maximal diversity. You want to assume that your model just has no way to generalize. So if the behavior you want at the end is not in the pre-training data there is no way the model will discover it.

You have to put it in the training data. For alignment it's quite different. You want very clean data because you're training your model to exhibit some specific behavior. You want model to really be very good at you know like a function call or like you want your model to be very good at dialogue.

So you want the model really to train and to learn this behavior. So usually this data set can be much smaller and they can be much more carefully cleaned. In pre-training you will want some noise so your model knows about the noise. In particular there is a debate you know should you use no toxic data or like maybe no bad language data.

Right now I think the main approach to this problem by people is to use a lot of like toxic data or like a lot a decent amount so that the model is already exposed to this. It's a little bit like your kid if you want. If you want to tell them that drug is bad right they have to first know about drug.

You cannot really you know expect them to learn that this is something they shouldn't touch they should not be using if you don't tell them what it is. It's the same for language model in some way. We want them to be exposed to this data to a small amount of it so that they can learn later to avoid this and they will know what they need to avoid.

Basically assume that there is no generalization capabilities in this model. If you want to tell them anything about something positive or negative you have to first put it in the model. So let's talk about pre-training stage. I already covered a little bit but basically you want to have maximal coverage you want to cover everything.

So you will train a massive quantity of texts at least 1 trillion token nowadays and I think you probably want to aim for like more 10 trillion tokens. The challenges that you want to solve here you want to maximize diversity and coverage and you want to maximize quality as much as possible because this is still you know something that your model will learn.

So if your model learn mostly noise you will still get noise out. So you want to have a little bit of this so it's kind of robust to this but you don't want to have too much of this. Here is one example. Basically you would want your model a good rule of thumb is that you will want your model to know two things.

You want your model to know the thing that you may want it to generate at the end. So if you want to generate knowledge about physics you will want to put that in the model and we want also your model to learn the thing that it might be exposed to.

So you want your model to be familiar with the thing that the users might input. So if you have inputs that might be noisy from the users your model should still be trained on it. Otherwise it will be out of distribution and as I said the safest bet here is to assume your model don't generalize at all.

The main challenge here is maximal diversity, good quality but still a little bit of noise and data quality evaluation. How do you measure data quality at the billion token scale. That's what we're going to talk a little bit about as well. So here is the typical pipeline to train a model.

So you start by collection. I'm going to talk a little bit about that. You want to filter by languages which language you want to keep and then you have a set of filters. You have basically two main type of filters. You have some filters that are more heuristic so they are kind of rules that you wrote and there are some filters that are more like ML models.

So you have a model that you train to identify some good quality text. Usually you want to combine two and then you have a set of filters that are more semantic. The rule and the ML model are usually a little bit more on the surface level and then you want to cover really the topics that you need to know about.

If you want to know about physics, you want to know about technology, you want to be sure that these are in and so you have a step of like more topic filtering and basically be sure that you extract this topic very well. This is another example from RefineWeb. The first one was from Yi.

This is from RefineWeb just to show you how much data we remove. So we start from Common Crawl which is basically the internet crawled since 10 years ago and basically we filter that and you can see that there is a lot of things that you will remove. First I would say language removal.

If you only keep English, English is roughly half of the internet. The second biggest language is usually Russian and then you have all of the other in Common Crawl. So basically remove half of it when you only filter for English you will have a lot of like duplication removal.

Why do you want to do duplication removal? Well we'll talk a little bit about that later so wait. And then you extract a little bit and in the end you end up with about 10% of the original Common Crawl sizes. So if you want to get a trillion token that means you really want to start with a very large source.

This is an example this one from the from the LNAI survey. It's roughly the same steps that you will see here. Language filtering, some heuristics, some what they call data quality which is machine learning based usually. Some deduplication and then topic filtering basically. So where can you start from?

You want as I said something very large because you'll just keep like 10% of it. So there is two main large sources of data I would say today. One is Common Crawl, one is the internet basically and the other one is more like for code. Usually you want to start from GitHub or something like Software Heritage or like a place where this has been carefully already extracted from the web.

You can use some curated sources like Wikipedia or books and then in books you have this big question you know like I should use only public domain books which stops usually 100 years from now so in 1924 for today or do you want to dive in more like copyright equation.

So that's the big big question I would say for today for mobile trainers. And you have more recent trends like synthetic data generation where you basically will ask one LLM to generate some data specifically for you and because you're kind of paying compute for data here you can scale this quite largely.

So there is a full new trend on this spearheaded by Microsoft and the fee models which were trained on billions of synthetically generated data from GPT-4. I think it's quite interesting that you can really craft the data set in a more controlled way here because you can say okay I want this topic, this topic, this topic, this behavior and given the quality of large language models today the quality of the resulting data is actually very high.

There is even a recent interesting paper from Apple which is about you know rephrasing the web so you take one page and you actually ask an LLM to write it cleanly and if you train on this data which is very clean and still cover a lot of diversity you can train actually three times faster because you use three times less data.

It's very recent but it's super interesting. Okay I talk a little bit about this resource in more details because we've been releasing data set on this at HuggingFace and I want to show you a little bit what we released and I go in reverse order so I start with synthetic data.

We released recently Lubna and Anton and Leandro at HuggingFace have been releasing a data data set called Cosmopedia which is a synthetic data set of 30 million samples, that's actually billions of tokens and it was generated using one of the best open source models today which is MixedTrail Instruct and here you can see how basically this is controlled for various seeds so basically we give the model a slight small topic you know or a sentence from a document and you can choose where this comes from and you ask the model to to write content you know from this seed sample on the topic.

So we took some very clean sources like the Stanford open courses or OpenStacks which is also open textbook, Khan Academy that you maybe know and also some web data so I would say more more diverse data and even instruction tuning data set and then you can ask model also to write you know using various language you can ask the model to write this for college students you know to write textbook article on this topic for college students or for high school students you can also ask the model to write in various styles to write blog posts about this topic and so you can actually have a lot of diversity even though it's synthetic.

Here is a quick example of all the clusters you can do topic clustering to check that you know that you cover a lot what we discovered is that we could still cover even more clusters and I would say right now the work on Cosmopedia 0.2 is to extend this to even more cluster and to get basically more coverage and so here you can see that we train a we train 1 billion model 1 billion parameters model on this to to show the performances and it's really competitive with web data set even being much smaller but I would say it can it can even be better you know by having more coverage so stay tuned for Cosmopedia 0.2 coming in April.

If we go now to code data there was a very nice release earlier this year called Starcoder 2 and the Stack V2 so the Stack V2 is really the largest code data set out there that's prepared for large language model pre-training it's more than 3 billion files in 600 programming languages in total you have like billions of tokens you have roughly 1 trillion tokens in the Stack V2.

So to get all this data basically we didn't crawl ourself we partnered with one of the non-profit foundation out there called Software Heritage which is a non-profit who has been focusing on archiving all code that has been out there since you know 10 years ago really and basically there is a question you know when you when you when you all when you gather all this data set what do you do do you sell it to I would say private you know closed source companies or do you partner with like an open source company to train an open source model on this and so Software Heritage can reach out to us to partner on the training of a new code open source code generation model called Starcoder 2 that you can use as well and which is one of one of the best code completion model out there today it's a very large collaboration actually an open collaboration so you see all the others there mostly led by Hugging Face and great people at ServiceNow.

So really go check this out if you're interested in code data it's by far the largest and the cleanest data set out there on this. On web data so as I told you we've been working on with the lead author of RefineWeb to get a very large and very high quality web data out there so basically a filtered common crawl out there for people to basically start their training from a high quality data set so this should be also out in the beginning of April maybe already next to it so just stay tuned on this.

So now that we got our data source we need to filter it so filtering by language I would say stay simple fast text by meta facebook is just great so just use fast text it's a great one it's worked pretty fine it has like all the language you may want to filter.

Now that we filter by language we want to start cleaning our data sets so there is basically two ways to do that heuristics ML based. We started by the heuristics the heuristics is this idea that you will count items so basically if your documents only have like you know two characters per line probably it's just it's just a bad list or like something that you actually don't really want to use in your large language model so as a reminder you don't want to use the thing that are naser things that your model will ever generate and there's another thing that you will think your user might input in your model so basically repetition you know a very long repetition of single character something that you know have a very strange ratio of alphabetic character to punctuation all these statistics that you can extract are way to easily filter documents the nice thing about heuristics is you kind of know what you're filtering out you know you wrote the things yourself you can really set the threshold by inspecting it and you have a very clear control on what you're removing from your data set you know what's what's this so these are the annotations I told you it's kind of control it's robust you know the prior and I would say the drawbacks are that you're only relying on surface level okay you're not looking in the meaning of the document you may also remove too much sometimes you think you're just removing bad lists but maybe these are also good lists that your user may want to input in your model one way to be a little bit more flexible about that is to use stochastic removal instead of you know being a one-off binary choice you sample a little bit and you keep a little bit of noisy data another drawback is that you will need to carefully tune your hyper parameters here you know the statistics that you want to filter and that's sometimes a little bit time-consuming process another way to do data set filtering quality filtering is to do machine learning filtering so here basically how you do is that you will have a set of good example a set of bad example and you will train either a classifier or a perplexity based filtering to you know to classify or to predict the next token so classifier based you know usually the standard one is to use a fast classification with some n-grams and you label your documents as good bad whatever perplexity based you train a very small language model so usually we could we use the this this kn uh old model right and we say that if the perplexity is too high then we filter documents i would say the advantage is here is that you have a more like semantic understanding hopefully from your ml model even though we use very simple machine learning techniques here um and you don't you know need to tweak all the hyper parameter that you tweak for heuristics the main disadvantage is that you're not really controlling what you remove okay you you have a very vague view of what the biases are so let me give you an example wikipedia okay if you train your model on wikipedia and you filter based on this wikipedia is written 90 more than 90 actually by men so you're basically also filtering your pre-training corpus to be mostly male written do you want this bias well maybe not right so these are things that you you still need to be careful and basically it's really hard to know exactly what bias you're introducing um a couple of notes additional on data filtering very important notes actually you will have several parts in your training data even if it's only web documents you will have you know some part of the web data are blog posts some part of the web are you know like tutorials some part of these are companies websites all of these are somehow specific domains and you want to make sure they are all you know um processed in a good way so you need to make sure that for each of these big domains that you want to have at the end you actually didn't do something bad in the pre-processing so there is various way to do that you can you know cluster and identify a list of documents in a cluster but just one thing to remember about all of this and i would say it's a general rule of all good quality data processing is that you will want to manually inspect the data inspect the data that you've been keeping inspect how it is at the end how it was filtered is it still really readable is your latex um document well processed is your pdf ocr well extracted manually go through the data that you keep and also through the data that you remove did you remove something that you think is actually very important you need to sample you need to take a look you can take a look just at the most important so for instance you can sort your data by top urls per token and just read 10 documents for this top urls and make sure that these 10 documents are really well filtered okay very likely you need to also craft specific domain focused hyperparameters for instance for your heuristics maybe they will work well for blog posts but maybe they will just badly filter latex documents so you can either say okay i craft specific rule for this domain or you can also say i'll just add this domain afterwards uh for instance code you could say i remove all code for web and just i'll just add a very big code data set but try to think about the implication of doing that okay you will basically remove for instance some mixed natural language and code documents so you want to make sure you add this back again uh so that your model still cover this type of inputs um as i told you you can also make use of some stochastic selection so if a rule is maybe just too hard too harsh you may want to just stochastically sample in the filtering so that you keep a little bit of noise you can smooth a bit your rules now the duplication why do you want to do the application well the idea is that there is a lot of duplication on the web that's something to really be mindful and to be aware of the web is hugely duplicated and so duplication will increase the density around some topics okay wikipedia is copied a lot over the internet so maybe that's nice to have a lot of density around wikipedia so that you're sure that your model has seen it a lot but um you also have to be aware that duplicated points they have more chance of being memorized okay they will also take more time because you will go during your training more times over the same data points so it takes more compute during training and you really want that okay you really need to see that um reducing the duplication duplication also has been shown to improve accuracy so generally the duplication is something that's very important and that you want to have a lot uh how can you duplicate well you have a couple of methods you have more like fuzzy method where you basically will extract some hash fixed size hash of your documents and so you will lose here a little bit of accuracy because this hash are just a rough summary of the n grants in your document and then you will want to filter them either by min hash which is a i would say quite a good method in general or by bloom filters which are much stronger on the duplication because you just keep one hash and just keep one document per hash so it's very it's very strong you have a fixed size vector which is very constraining um and if you don't want to do fuzzy duplication you can use exact duplication where you will extract you know uh with a suffix array you will extract exactly all the duplicate in your document they have both trade-off uh in advantages and drawback um exact filtering is very costly in memory because the the table the suffix array table are really huge um i say bloom filter is very very strong uh a filter so usually we we use a lot for instance in fine where we use a lot min hash because you can you can control you can control a little bit more um your trade-off between memory and um and accuracy uh speeding the duplication is also a very big issue i would say on very big challenges uh and we saw a very nice very interesting counterintuitive result recently that more duplication also led us to keeping only bad data so basically when we were deduplicating more and more all the good data was now taken out and only the the remaining things were just basically bad quality data that was not the duplicated but that was just so random that it didn't fall in the duplication uh buckets so uh i would say for the duplication also be careful investigate what you're removing at the end and also what you're keeping and don't be sure don't don't take this as a silver bullet just like every filter out there it's something that you should double check yourself now that we've finished you know uh sourcing language filtering filtering by quality heuristic or ml deduplicating uh topic we need to prepare the data for training there's two main thing you need to do we need to shuffle it it might seem as a joke but it's still very important today you don't want to train in the order of the common crawl terms you want a good a good shuffling of all your data and then you want to tokenize it so recently there was a very nice video by by andrej carpati on tokenizer you should watch it if you want to know everything about tokenizer but generally there's just a set of good practices you should fit you should be mindful of the first one i would say is sample well through your whole data set i would say the first gpt2 tokenizer was famous for including in in the in the final vocabulary token the name of redditors because it was really trained on ready data only you don't want that you want really to shuffle so that the single you know the one single part of your data set is not over represented in your vocabulary the vocabulary of your model for math you want to be careful about numbers you want to be careful that they are well you know you don't have like for instance 42 as a single token and 43 as two token because 42 is much more used since uh the douglas adams book and so usually what people do is either they split digits so you split all the tickets in every in every number that's what for instance llama do or you add you know the list of all numbers manually in your vocabulary up to a thousand for instance that's what gpt4 do then you need to be sure that your data set is big enough that every number is really well represented in it for code you want to be mindful about tabs and spaces they're very important for instance in python and so you want to handle them well you want to model to know what is a double space and four spaces so just be careful about this and for basically if you need something by default i would say a byte level dpe is a good standard way to train a tokenizer don't fall in a rabbit hole for tokenizer they are not the thing that will bring you to adi okay this is just something you want to make in a clean way so that you don't fall in the in the traps along the way that you're able to process code numbers you know that you don't have some strange tokens over represented but that's it by the way you can also use tokenizer to inspect your data set i'm going to talk a little bit about that scaling tokenization is non-trivial you want to really parallelize that well because otherwise pre-processing and tokenizing trillions of token can take quite a long time in the end and so there is two main approach the first one is well parallelizing and then finding a way to efficiently merging the post the tokenized data sets and shuffling it and the other way is that you tokenize during training basically you feed the direct text to your model and you tokenize just before feeding the model i would say the the nice thing about the first one is once your doc once your data set is tokenized and everything stopping training and continuing training around resuming training is very easy it's very efficient it's very reliable and in the second case is well you can change the tokenizer easily but usually you don't really need to do that a lot but resuming and being sure that you've you know we're starting exactly from where you were is usually slightly trickier now how do you evaluate data quality so that's really tricky because we're talking about trillion size data sets okay so it's really hard to have some good metrics to evaluate the data quality so a lot of this is you know inspecting yourself some exact documents as i will tell you and some easy i would say one one one good way is training small model to test it so typically what we've been training here for instance is like one to two billion size model and you train at this on like a chinchilla optimal size you don't need to train for longer you're not using this model for inference or anything so which is roughly 30 giga token when you train your model you need to find some high signal benchmark not all the benchmark in nlp are high signal what is a high signal there is two way i've seen it uh being being being used one way is to make sure that your matrix on this benchmark is monotonically increasing during training okay you want basically some benchmark where you really see your model learning learning increasingly and not like oscillating a lot otherwise depending when you stop you will have like very different results you want to have a low variance which means if you train on various seeds if you train on various um you know parts of your data set you want to be sure that you're you're roughly in the same ballpark at least the the standard deviation that you're measuring is small enough that you can really tell data set apart so usually you will want to have two debugging data sets one of high quality a standard very high quality data set is c4 it's a very it's really a data set that has standard test of time in terms of high quality and you want another data set that's maybe much more complex the power is some some sometime an example or you can take just a pure common crawl and filtered and you should see really a distance between the measurement on your benchmark on these two data sets the performance of your train model on these two data sets and obviously you want your model to be above the random baseline you know that's also one indication of a good benchmark so if a 1-2 billion size model is not above the random baseline you're just measuring noise and there is some tricky details to make sure that you have high signal these are some things we have in in light table but basically for instance if you want to measure multiple choices question it's often the case for this small benchmark and that's why for instance you have four continuation you want to predict you know you want to select one of the four small model what i call small model is one to two billion size model small models really like more what we call normalized likelihood so we'll measure the likelihood of each answer normalize it by the length and take you know the highest likelihood and larger model when we move to like 30 40 even 70 model well trained they will like more you know lettered answer when you explain the answer and then you say select between a b c d and the model just generates a b c d and here you can have nice calibration curve because you have a very clear uncertainty on this single generated token so keep this one for larger model for small model i would say focus on normalized likelihood so these are small model training another thing talk about this a lot but manual data inspection take your top domains take your top url inspect 10 documents for each of them inspect also at various stages in your pipeline and also take a look at what you've discarded okay always you can set up a search tool in your data set that's also very useful you can do some clustering to see and to be able also to inspect top documents per maybe more clusters than url so we have here a nice library by leandro at hugging face called text clustering we also have a nice search tool in this so really take a look at this library and use it if you think there is a more uncommon that that i really like from tevin who was a there was a lot of people at hugging face you know who are now at mistral and tevin who is now at mistral also told me once that he used the tokenizer to inspect and basically you can train a tokenizer on your data set and you can take a look at the longest token and maybe the last token so the less the less the least frequent token and see there okay do you have strange things do you have like javascript parts do you have like name of redditors like i was telling you and if they look bad that means that you have some high frequency of bad quality data in your data set um here we have some nice library that we've been releasing you know just last month for doing all of this all of this data processing pipelines it's called data trove it's by uh gilerme the lead author of refined web and basically it started as an open reproduction of refined web so a very high quality filtered common crawl and what we ended up was kind of a fully fledged lightweight library for processing filter the duplicated test data and basically preparing very large data set for other than training you have pre-built block for all the steps that i showed you here it's fully in python and it's very easy to set up on slurm or locally and to use remote file system as well if you need to so take a look at data trove it's a very small library i would say self-contained python thing but you really you have all the basic blocks that you may want to use here um when you want to evaluate your model we have one library that works well with data trove and the pipeline which is called light evil light evil is a very lightweight llm evaluation suite usually inspired by the amazing eluther airness i would say the main difference is that integrate from the ground up 3d parallelism i'm going to talk about next so basically efficient model uh training and inference and you can play a lot with the prompts and the eval so i was telling you for instance small model really like this like normalized log likelihood while while bigger model like more like lettered answers and so here you can play with the prompts easily and so to see how much signal you can extract for each benchmark on your specific debugging model size now we've talked a lot about data so let's talk a little bit about modeling that's the part everyone is waiting for that's the most exciting part easily that's the reason we're all in ml and i'm very happy to still cover this i would say so what are the essential elements when you train well there is three main thing the first one is efficiency and size you want to fit your billion parameters model efficiently on your gpu and you want to train really fast so you have some recepts for this that i'm going to cover quickly and then you want to train in a kind of a roughly stable way you have to avoid instabilities but still you want to stay really close to it and then you have the last question which is capacity and that's where we're going to talk a little bit about other architecture than just the transformers but that's i would say just the last part so how do you train model efficiently in particular when it's too big to fit on one gpu so when it fits on one gpu there is no real problem right so you won't be model no problem your 7 13 30b model they are just too big for one gpu and a decent batch size so you need to parallelize them today we have four way to do parallelism roughly we have data parallelism that's something everyone has been using already i would say you have tensor parallelism pipeline parallelism and a much more recent i would say or slightly more recent sequence parallelism i'm going to cover them briefly so i would say here my idea is more to give you kind of a overview of everything more than really dive deep because in each of these topics you could dive really deep in a technical point of view okay so this is really entry level and i put some references again just select a couple of references that you can read to dive deeper in this let's start with the first parallelism data parallelism usually it works out of the box that's the easiest one the only challenge is the data loading to make sure that your mobile replica will have different data as input so what does data parallelism do you take the one model and you duplicate it on several gpu you feed it several parts of your batch and then you just you know match the gradient reduce the gradients so that you have basically a larger batch on three gpu for instance than you had on one gpu so you can process on parallel you know different part of your data and you just make the optimization step the main challenge i would say is the last part is the all reduce that you use to to kind of merge the gradient updates and actually when you scale very large model it can start to become a huge bottleneck so we'll talk a little bit about that um yeah the tensor parallelism is when you don't want to when you're limited in your data parallelism so why would you be limited by data parallelism there is two main cases one case is basically your model is just too big to fit on one gpu so you cannot replicate your model on various gpu you need to split the model somehow the other case is when your batch size by replicating the model start to be too big okay so let's say you want to really scale the model and now you start to have like one to four million token batch size well if you start to have a very large batch size the model for each optimization step make less efficient use of each token because the batch size is so big that each token is kind of watched out in the optimization step and roughly it's a little bit hard to measure this limit which we call the critical batch size it's roughly around four to six million token it's different for like small and bigger model but basically you cannot really go to 100 million token base like that so you want to find another way to parallelize to make more efficient use of your data and so one way to do that is to use tensor parallelism tensor parallelism is slightly more involved because you need to rewrite your model code you cannot just rewrite the data loading code you need to change the model why because you will divide all the matrix multiplication all the matrices that we use in the model into or like four or like eight depending on your tensor parallelism degree and you will put each part of the weights each sub part of this weight matrices on various gpu and synchronization will happen after the operation so here you need to rewind model code the nice thing is that you can combine smart column and row slicing to try to reduce the number of synchronization points let me show you a little bit here you have two main parts in a transformer as you may remember you have feed forward networks you know you usually have two two matrix multiplication with an activation in between it can be a bit more if you're using something different than just if you're using clue but basically you will have one matrix multiplication some activation and another matrix multiplication okay and here you can basically split the first matrix multiplication in one direction usually column wise you do separately your activation on each gpu you don't need to synchronize and then you gather by doing the opposite slicing at the end on the second matrix multiplication you do a row slicing to gather again your output your activation you can do the same smart thing for self-attention where you will do one part matrix multiplication in one direction you will split the matrix you will do like softmax dropouts separately and then you will combine them with you know another parallel operation in the other direction this way you reduce the number of synchronization point because you can do a couple of like operation without needing to synchronize between the gpu the tricky part is always that when you're synchronized you're going through the network that's much that's much slower than just the computation the last part that you can use when you when you don't want to use tensor parallelism or when you cannot scale tensor parallelism enough is pipeline parallelism so usually you want pipeline parallelism when your like network is not fast enough to do full tensor parallelism everywhere okay pipeline parallelism reduce the number of network exchanges because you will put some layers on some gpu and other layers and other gpu and you will just communicate at the interface between two layers or like two groups of layers so you can see here you will put one for instance level layer two zero to three on one gpu layer four to seven on the second gpu etc etc here the challenge i would say is to keep all the gpu busy so you don't want to have just you know one group one gpu working for the first layers of your batch and then being idle while you have the other gpu working for the other layer you know as we go as we go forward in the model and it can be very challenging to actually keep have maximal utilization of the gpu so usually you have like complex interleaving of the forward and the backward path so i can show you here a little bit where you have the forward path in blue and the backwards path in green and you can see what we do in this case is that we will split our batch in smaller sub-batch mini-batches so for instance we split a long batch in four mini-batches and when the first mini-batch is done on the last device we already start the backward while we are still doing the forward path on the other gpu for the last batches and this way you can reduce what we call the bubble the tricky thing here as you probably got it is that for tensor parallelism you needed to rewrite the model code as i told you and here you also need to rewrite the optimization code okay you cannot just do forward and then you're lost at backward because you have parallel execution of a backward and forward path so this makes using the code quite complex and that's why actually we have a new library called nanotron that's right to have this as simple as possible there is a last way to do parallelization called sequence parallelism so be careful because there is two use of sequence parallelism there is one which is kind of a smart way to do ring attention to do attention on very long sequences but the one i talk a little bit about today is another simpler way it's quite similar to tensor parallelism in a way but instead of slicing the parameter matrices like we do we slice the sequence this way and the idea is if you took tensor parallelism here it's the top box we still had some operation between each tensor parallelism operation where we were not really parallelized in any way and the idea is on this operation which are applied independently for each token we could split along the sequence and so we could parallelize this along the sequence axis it's only interesting you're doing training usually because you need long sequences or which is a little bit doing prefill now what can you read if you want to know more about this there is many reference on parallelism i try to extract i think the one and i think give you the highest level overview i would say and cover as much as possible of this thing i really like this first paper from joel at service now which is not very well known but i think it's very interesting as it covers like a lot of challenges here red first pipeline parallelism reducing activation computation in large transformer model is very nice one and the last one is actually the one on sequence parallelism that i told you and the last one called sequence parallelism is actually this ring attention paper that i think is also very interesting but more maybe an extension of this presentation now we talk about a lot about parallelization okay but there is an additional thing that you need to be mindful about is synchronization i already talked a little bit about synchronization okay during tensor parallelism and the thing i talk a little bit about reducing synchronization and here you you have to be very careful about that well why well you have two type of synchronization you have one synchronization which is between values gpu which is uh basically when you when you do like a like a like a reduced operation in tensor parallelism and you have one synchronization which is between cpu and gpu which is when your cpu basically launched the kernel on gpu and you want to reduce or at least you want to make sure that for both of these as much as possible you can do an overlap of computation and communication so basically if you can do something called asynchronous computation basically where you will asynchronously start some operation and do some communication during this time it's much better so let me talk about two things we we talk a little bit during the data parallelism part about the cost of the all reduce at the end so that's something you probably have been using already without knowing it in pytorch which is if you look at the distributed data parallel so the ddp uh in pytorch you can see that there is a very smart way to do all reduce so let's look at here basically typically you will usually do like all your forward and your backward and then you will do your all reduce at the end okay well this is very annoying because during the all reduce where you gather all your gradient together you don't do any computation you're just waiting for synchronization there you're just waiting for your gpu to exchange all the all the great and that's not something you really want you want to keep your gpu busy so if you have a way that once every time um one layer is finished with computing you can already start you know reducing you can already start in parallel to computation you can already start communicating gradient then you should try to do that and if you take a look at the pytorch code for for distributed data parallel that's something that they do um another example is in pipeline parallelism you know we saw this this this uh forward backward reducing of the bubble and here you can also try to you know overlap this very long here g so this very long gradient reduction here with some you know forward pass of the next batch and just a quick example and a quicker note about cp and gpu synchronization here what you will want is to reduce as much as possible the number of time your cpu need to inspect the data or need to start a kernel so we want to fuse kernel so you want to fuse the operation that could go together the result of your attention your activation if you can do all of that in the gpu without the cpu needing to say okay now it's time to compute the activation now it's time to do you should do that so that's usually done by merging merging operation in single kernels um now i want to talk a little bit about attention that's very interesting because if you were already in the field like one year one year and a half ago a little bit more maybe now um we had a lot of work on designing efficient attention computation because people were really very scared by the quadratic cost of attention okay and all of this disappeared now you know there was like all this reformer all these very long attention smart and the main reason this disappeared was that uh our friend tree dao at stanford invented flash attention flash attention the idea is basically you will just not materialize the attention matrix so the attention matrix is this very like large is the n square sequence square size matrices comparing each token you know to make the attention between all of them what you could do is instead of building these very large matrices you can just on the fly you know build small matrices and just keep the statistics that you need to compute your softmax along the way and that's what flash attention does that's the first step and the second step for flash attention is that if you just compute along the way small part of your attention matrix you may even have this small part small enough so that they fit actually in the sram of the gpu so the static random access memory the sram is a much much smaller memory but which is really next to each chip and this this this cannot be shared between processes right this has to be this is a single memory for for a group of of processing while the hbm the high bandwidth memory is shared by everything so it's the it's the hbm is this 80 or 40 gigabytes memory you know that you see it's really large but it's also much smaller and much lower bandwidth than this sram okay so you can compute just like your attention not in one big memory but in small one with statistics and the small one can be small enough to be fitted in the very very high bandwidth memory and this way you can actually compute attention really much faster and while using actually much less memory so flash attention can solve somehow the quadratic attention costs of attention and that's why we don't really care a lot anymore about you know linear attention mechanism for instance also because performance were never able to match full attention somehow just apart from sparse attention in some way flash attention v2 was a development of flash attention still roughly two times faster and here the idea was mostly to really have as much as possible of the computation in matmul flop so you have to know something about gp as well which is gpu already optimized to do matrix matrix multiplication so each time you do something like a division or something basically else than a multiplication you're paying a cost and the cost is very expensive a division is like 60 times more expensive than a multiplication and so for instance when we do softmax we usually divide by some normalization like some some number some square root of the dimension of our model we want to keep this and do it just one time at the end you don't want to do that every you know on every and every element of your computation so these type of things are basically what flash attention 2 is bringing with also a better parallelism causal mask if you're just computing causal mask you just don't need to compute half of the matrix and just better work partitioning and using more better like the blocks and the wraps of the gpu so i won't dive into this because there's a lot to unfold here but i would say it's more like a little bit more incremental but it's still like very very nice beta so now that we have something efficient we've parallelized this well we have a very efficient attention computation we want to make sure that we train well and here don't miss this hyperparameter search you have a couple of very important things that you need to go over learning rate you want to do a nice hyperparameter search you want to make sure that your initialization is well done you want to create a normal where they need to be you want to make sure that your training is stable but also that you're not too stable that you still at the verge of like you're still training with very high learning rate and here i would say there is very few uh recent work on this but there is two that i really like there is the mu transfer work slightly older i would say now but it's still very very interesting on how to find a hyperparameter on a small model and how to scale them on a larger model this work by cerebras was maybe one of the most interesting application of mu transfer and a very interesting recent work from also a chinese team again but very well you know set of experiments open source is this mini cpm block post where they really try to optimize the model and to optimize you know the uh scaling of activation between various part of the model how you want to scale the activation between embeddings and then the first layers etc so you should really probably should really give it a look what is also very interesting is that they challenged the dominant view that cosine learning rate was the end learning rate that everyone should use from now on cosine is still that really the great great default learning rate but they use a linear plus uh you know warm up and decay and they show that they do have some uh decent performances with that as well and the nice thing about having a linear uh learning rate like a constant learning rate is that you don't need to know from the beginning how long you will train your model on and that's very interesting because the cosine kind of force you in a very specific shape where you need to decide from the beginning of your training how long you are going to train and you cannot resume for longer and if we find a way to have good performances with like a flat learning rate and just like warm up decay decay is very important that's what they show in this paper um then maybe we can get away uh out of this constraint of knowing from the beginning how long we're going to train so uh take a look at this paper i think they're very nice in terms of stable training recipes and the takeaway here i would say is don't skip this step do your work in terms of research for hyperparameters now the last part uh that's the one usually people spend the most time on talking about so it's a good indication that that's it's also the least important part but yeah let me still talk a little bit about this for a long time transformer were believed to be the end architecture so uh maybe slightly sad i would say for the field that we didn't have anything new since you know the transformer paper in 2018 and uh recently there was two extension they want to cover one is mixture of expert so mixture of experts reduced to transformers in the limit of one experts so it's still slightly a stretch to say that it's a fully new architecture but it's still very interesting as a new knob to you know play with capacity um so basically one problem was that until now it was not very efficient to train a mixture of experts so let me explain you a little bit okay in a mixture of experts when you go through uh when your sequence of tokens will go through your model at some point you will have a router that will say for each token where in which experts should this token go and experts are basically at the mlp level feed forward so you have basically several mlp several feed forward layers and you will select like these are the number of your experts for instance three feed forward layers three different feed forward layer will be three experts and your router will say for each token okay you should go through to expert one to expert two to expert three now each each expert was designed to be able to welcome a certain number of token so for instance two token in this example here okay and if three token should go to one expert and then two token to one and one to the last one the expert that would get three token was not able to welcome them all and so would drop one token so without one token that would just be not used in the computation one input token so that's quite strong i would say as a as a impact that means you're kind of ignoring a part of your inputs and that led to i would say non-optimal performances but that was needed for the sake of having a very uh determined like a very static you know matrix and our gpu and tpu are not really well adapt to dynamic architecture well recently using ideas of sparsity but not too sparse because we also know gpu don't really like very sparse matrices but you can they are actually quite good for block sparse matrices so what is block sparse this means you it's sparse but you have blocks and these blocks are big enough so that they make efficient use of the gpus you know so they are big enough that they will fill like your your math mill here you will have like enough to crush um but between these blocks you have like empty places and here this is basically what mega blocks recently did and that's how it unlocked actually efficient mixture of expert training which is saying maybe our experts could be these blocks they are very big feed-forward matrices and if we actually use this we could just like repeat this block for the various number of token and we can maybe dynamically do the sparsity because it's not it's actually just blocks that will repeat so it's i would say it's a kind of a low level of dynamicity uh low level low enough that it can be very efficient so that's basically what kind of changed the the thing here we don't need to drop token anymore we can just dynamically build these big sparse matrices from experts and it actually even opened the door for something i think nobody has been really using yet which is you could have experts of various sizes you could add you could have like big experts smaller experts etc etc very interesting and i'm really looking forward to uh what will be built on top of this another interesting development was kind of a revival of recurrence model and you have two uh main uh model well i mean i just talk about mamba and the idea here is that you can you can use like space state space model so if you're just out of your master in ai you probably learn about space states model there are these discrete models this continuous model that make evolving you know a space state and here the all the smart all the smart thing was about how to discretize this and keep this efficient and so that was solved by um by albert gu and and again for flash attention and how to train this efficient it's very funny because when you train this mamba model when you train it it behave kind of like a like a convolution convolutional network and when you use it in inference you can use it in a kind of a recurrence mode so it's really really fast actually um mamba itself is quite hard to dive in and i think the best entry point is again an annotated blog post by sasha rush so maybe you learn about the transformer architecture from the annotated transformer by sasha rush a few years ago so now you have also annotated mamba blog posts which is i think a very nice way to learn about the mamba architecture we're actually training several mamba at hugging face at the moment with nanotron and so it's also very easy to train i'll show you a little bit there so talking about nanotron we wanted to have a very simple library to use all the techniques that i showed you um so we talk about parallelism we talked about efficient training we talk about being able to you know nicely iterate on your hyperparameter and also have mixture of experts on mamba if you want to gather all of this you usually have a very large library with a lot of bell and whistle we want to keep something very minimalistic so that's how nanotron was born so we want to keep this really under 10 000 lines of code and we want to make it very fast basically train as fast as possible and also very transparent so there's not a lot of wrapping around the things that you do here and as the idea is uh it's very open it's very transparent and you have in it like 3d parallelism radiant accumulation didn't talk a little bit about values mixed precision but it's in it and you have all the all the way to do also smart optimizer zero one and all the architecture i was talking about at the end so uh take a look at nanotron it's a kind of a research code that we use so it's still roughly it's still a bit rough on the edges but it's a very very nice code base now that we trained our model took a long time i'm gonna cover briefly uh the the next step so uh talking a little bit about that because we also have nice open source library on this so i want to tell you a little bit about them once you've pre-trained your model you usually want to align it which means you want to have it not as a completion model which just you know generate the most likely tokens after the prompts but you want to extract you want to have it behave in a specific way so usually you want to start to have your model behave as a dialogue model so that it learn to generate answers to prompt and not just continuation and you also sometimes want to you know have specific behaviors or you want to do some safety and you know for like forbid or like reduce the occurrence of specific behaviors of your model okay so this step is called alignment or fine tuning and i would say up to now there was a very uh complex technique called rl rl hf reinforcement learning from human feedback the impressive thing about our hf i would say is that it works at all that's basically maybe the first widespread occurrence of reinforcement learning in ai world that's actually really useful for many many people but it's still really really complex basically how it works and the main tricky thing here is as always reinforcement learning is the reward so usually in reinforcement learning you define your reward manually it's very complex you know it's very full of heuristics and that's kind of one reason you don't generalize to anything else on your uh test test environment and the nice thing about rl hf is the hf part which is you will define your reward from human feedback so we'll ask you will generate some completion we ask human to rank them and you will use that to train a reward model now it's very nice but it's kind of a complex thing as you can see here some typical uh labeling interface for human to label the the rewards and in practical i would say the very impressive thing is that it's just working as well but in practice the implementation is very complicated you have like four models you have your like your model that you're training so the dpo model you have a base model that you still use because you want to stay not too far from it you have a reward model that you trained on the on the human feedback you have the sft model so all of these models need to be at the same time in memory and so you can do some smart sharing of layers but that's basically very complex and that's why we actually started to build a library called trl to make this easier and it's also very challenging in terms of fitting all of this in memory now something very interesting happened uh last year which was uh dpo direct preference optimization and the idea here was that basically maybe your language model already know the reward somehow and so uh maybe it can be used without being trained as a reward model and i'm saying that with my hands but basically you can write that much more um much more precisely in an equation which is the dpo equation here and which is actually the the dpo paper has a very nice math part it's not always the case for machine learning paper sometimes the math is just there to pass reviewer too but in this case uh the math is really very nice and very interesting and the conclusion is that you can maybe just go with two models the dpo model and the sft model and that's make uh much easier training and what we saw with the rlhf team uh led by uh lewis lewis tenstall and ed beshing also with former hf people like nathan and nesny what they saw was that basically it makes training much more stable and it's actually just kind of work out of the box because your objective is much closer to a standard like language modeling objective so dpo changed a lot i would say how we used to align this model and um there was this question maybe earlier this year which was is this the end of it do have we again move reinforcement learning out of the most used ml technique well no no no there is a revival recently of rl through the reinforcement algorithm that's maybe some of you know if you were working in the field some time ago at least i was playing a lot with it for language modeling a long time ago and the idea is that at least the the paper from from rika here and and from cohere show that uh reinforcement and kind of more on policy rl was maybe still very very competitive with dpo and maybe even better so the jury is still open in 2024 is dpo the answer or will we see back a revival of rl we'll see now you find your new model you've pre-trained it you find you need the behavior are great you're very happy you think it's it's nice model you evaluated it as we told you you need to deploy it and that will be my my last slides uh it will be actually very uh short because i think there's a lot of resources here but maybe just something to keep in mind is that there was multiple breakthrough in inference optimization over the last few months i would say it's really impressive i remember like two years ago when i was saying okay we might want to deploy a news model of seven or ten billion parameters people like this is never going to work these are just too big well the reality is that today on my laptop i can run you know mistral 7b and it's just really fast it's even faster than me talking and there is a couple of things that made that possible that's the things that i'm listing in these slides the first one i would say is quantization that's the first impressive thing we can just quantize this model we can move them from the floating point values that they have fp16 for most of them bfloat16 to quantized integer and that just work we lose minimal performances we have values set up you know we have various techniques gptq we have the techniques included in a in lama cpp the dgml and nm nf4 so i put a couple of them out there they all just work really well i think a good default honestly is the one from lama cpp out there it's very nice it worked well and this basically solved use pay point also in terms of model sizes because models are much much smaller in one's quantized now we can do even better now with like speculative decoding which is super interesting and it's developed a recent development called medusa and here the idea is that we have two models that are roughly similar but one is much smaller than the other one and they're trained roughly on the same data set i mean they should be as close as possible and the small one will actually predict full sentences and then we'll just use the big one to validate you know how good are these sentences and to keep you know the tokens until they start to diverge from the token that the large model would have outputted and this means we can generate a token by bunch and just validate them by the big model by the big model take a little bit more room in the memory but not so much because the small model is much smaller and it speed up inference by a lot as well and this basically let us use very large model on a laptop there is a nice blog post i really like called accelerating generative ai with pytorch gpt fast which show you all the other techniques you can use you can compile your model you can use cuda graph basically this is something we covered just earlier and this is just the idea of reducing as much as possible cpu gpu synchronization so you put as much as possible your gpu autonomously going through the layers and you do as few as possible synchronization with with your cpu and give you even more like a speed up really really impressive these are the inference techniques a lot of there just put it a few reference there basically for you to to explore the final step you've pre-trained your model you aligned it you're very happy about the inference you quantized it well you distribute it final step share it with the world okay we need more knowledge we need more model outputs opens we need more data set open we need a lot more sharing the world uh thankfully at honey face we'll be building a place to share stuff so use the spaces evaluate your model openly on the open leaderboard put this on the really great chat but arena set up a chat for people to try it basically please share all the knowledge that you've learned and all the artifact that you've created as much as you can that will be my only reward i asking you from this video thanks i actually kept this question slide from my talk i cannot really answer question on youtube but please put comments or like open a post on a happy face or ping me everywhere and i'm very happy to answer any question that you may have on this thanks a lot for watching bye