back to index

A little guide to building Large Language Models in 2024


Chapters

0:0 Intro
0:59 Workflow for LLMs
1:17 Data preparation - intro and good recent ressources on data preparation
5:28 A web scale pretraining corpus - goals and challenges
11:29 Web scale data sources – Focus on recent datasets
18:1 Language, and quality filtering
24:34 Diving in data deduplication
27:40 Final data preparation for training
31:31 How to evaluate data quality at scale
36:29 The datatrove and lighteval libraries
38:18 Introduction in modeling technics for LLM training
39:9 When the model is too big: parallelism
40:0 Data parallelism
41:18 Tensor parallelism
44:38 Pipeline parallelism
47:0 Sequence parallelism and references on 4D parallelism
47:52 Synchronisation: GPU-CPU and GPU-GPU challenges
52:14 Flash attention v1 and v2
56:23 Stable training recipes
59:12 New architectures: Mixture-of-experts
63:13 New architectures: Mamba
64:49 The nanotron library
66:15 RLHF in 2024
68:23 PPO, DPO and REINFORCE
71:23 Quantization, speculative decoding and compilation: overview and ressources
74:36 Sharing your model, datasets and demo – final words

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everyone. So two weeks ago I gave a graduate class here in Amsterdam to 200 PhD students about
00:00:06.560 | how to build, how to train a large language model from scratch in 2024. I tried in this talk to
00:00:14.320 | highlight the dark secrets, the thing that people don't talk a lot about but are very crucial to
00:00:19.600 | getting good performance large language models, and maybe to also highlight a bit what is more
00:00:24.640 | hype than reality. And when I shared the slides afterwards there was a lot of interest for this
00:00:30.480 | so I decided I would actually re-record the talk and post it on YouTube as well. So here is our
00:00:38.080 | little guide to building a large language model in 2024. In this talk I'm gonna cover three main
00:00:46.320 | parts - training, fine-tuning, inference. I think for fine-tuning and inference you can already find
00:00:51.520 | super good recipes, super good blog posts and explanations online so I really spend most of my
00:00:57.360 | time on training, which is the part that's you know mostly like dark science I would say today.
00:01:02.400 | In training you have three parts - data preparation, efficient training technique,
00:01:07.680 | evaluation. It's the same here, I'll spend most of my time on the first part, data preparation,
00:01:12.880 | because that's really the secret sauce that I want to highlight today. So let's start writing.
00:01:21.280 | You can believe me or you can also believe much smarter people at OpenAI or Entropiq
00:01:26.640 | when I say that basically the most important part in your training is the dataset. So I really like
00:01:31.760 | this blog post from James at OpenAI which highlights how you know by training many many
00:01:38.480 | architectures he basically found that in the end they all converge to roughly the same behavior
00:01:44.720 | which is determined fully by the dataset. So what he says is this - the hit in AI models
00:01:50.800 | is the dataset. Basically model behavior is much less determined by architecture or
00:01:56.800 | high-performance than we think and much more by your dataset. He actually says it's your dataset,
00:02:02.160 | nothing else. Well I still talk about architecture a little bit. And Amanda, a girl from Entropiq,
00:02:08.240 | basically said the same thing last week when she tweeted "is this emergent behavior
00:02:14.800 | coming from data or from the model?" and basically she said "none of us has ever magically pulled
00:02:21.680 | anything out of the ether, it's all coming from the dataset". So if you're more into YouTube than
00:02:28.560 | Twitter, I think there is a nice video that jokingly summarizes all of this by Rutger Bergman
00:02:35.040 | Bergman when he said "let me play it". That's a video I think about when I read all the tech
00:02:40.880 | reports that are only talking about model architecture and don't say anything about the
00:02:46.000 | data. I mean it feels like I'm at a firefighters conference and no one's allowed to speak about
00:02:52.080 | water. I mean this is not rocket science. I mean we can talk for a very long time about all these
00:02:56.800 | stupid philanthropy schemes. We can invite Bono once more but come on, we got to be talking about
00:03:02.160 | taxes. That's it. Taxes, taxes, taxes. All the rest is bullshit in my opinion.
00:03:06.640 | So basically for us we got to be talking about data. Data, data, data. All the rest is bullshit
00:03:15.920 | in my opinion. So I mean now that I kind of planted you know the landscape, let's dive in
00:03:22.160 | what I mean about that. I mean it feels like I'm at a fire. Thanks. I think another nice well
00:03:30.800 | recent paper I think is the Yi paper. So maybe if you've been following the field you probably saw
00:03:35.840 | that many Chinese teams have actually trained very good models recently. And the nice thing
00:03:41.360 | is that they also have a very very good tech report. Much better than what we have I would
00:03:45.520 | say in the western world where everyone is now very shy about sharing anything. And so the Yi
00:03:51.360 | models are a very good model if you look at the benchmark. And basically when training them they
00:03:57.760 | say that their underlying assumption is that when you train on extensive data of high enough quality
00:04:03.120 | a standard architecture can exhibit advanced capabilities. So basically you don't need yet
00:04:09.440 | now to you know go look behind beyond transformers or maybe like I will be talking later like slight
00:04:17.360 | extension like mixture of experts. If you have very good data just spend the time on carefully
00:04:24.320 | crafting your data set and for now stay on one of these simple architectures that we use today.
00:04:29.840 | I think there is extensive resources as always. I could have cited like 20 papers but I try to keep
00:04:37.920 | like a small list of resources so you can read them extensively. I think these four ones are
00:04:46.080 | nice recent examples. The survey on data selection for language model by LNAI is very nice.
00:04:53.680 | The paper I just mentioned by the Yi team is really great and I think two recent data sets
00:05:00.080 | that were open source and shared a lot more about how they were built were the the Dolma data set
00:05:06.160 | from LNAI and also RefineWeb. So I think a nice thing about RefineWeb is that I'm working with
00:05:12.720 | Guilherme the lead author of this at Hugging Face and so we'll have much more news about this data
00:05:18.720 | set to share and I think it's a very nice work. So you can use data for many things. So when you
00:05:25.760 | talk about data you actually talk about various type of data. You can use data for pre-training
00:05:31.280 | your model, you can use data for instruction tuning, you can use data for alignment which is
00:05:37.200 | basically after having pre-trained your model you really want to align it so it learns how to exhibit
00:05:42.080 | the nice behavior that you want. In particular a dialogue behavior which is one we often want to
00:05:47.680 | have when we interact with these models. You can also have model data more for in-context learning,
00:05:53.280 | for rag training, retrieval training and I would say each of these aspects will have different
00:05:59.120 | goals and will require different data. So as a rough idea for instance for pre-training you want
00:06:04.640 | really the maximal diversity. You want to assume that your model just has no way to generalize. So
00:06:11.120 | if the behavior you want at the end is not in the pre-training data there is no way the model will
00:06:18.160 | discover it. You have to put it in the training data. For alignment it's quite different. You want
00:06:22.880 | very clean data because you're training your model to exhibit some specific behavior. You want model
00:06:28.160 | to really be very good at you know like a function call or like you want your model to be very good
00:06:34.080 | at dialogue. So you want the model really to train and to learn this behavior. So usually
00:06:40.480 | this data set can be much smaller and they can be much more carefully cleaned.
00:06:44.320 | In pre-training you will want some noise so your model knows about the noise. In particular there
00:06:49.520 | is a debate you know should you use no toxic data or like maybe no bad language data.
00:06:54.880 | Right now I think the main approach to this problem by people is to use a lot of like toxic
00:07:03.520 | data or like a lot a decent amount so that the model is already exposed to this. It's a little
00:07:08.160 | bit like your kid if you want. If you want to tell them that drug is bad right they have to first know
00:07:13.760 | about drug. You cannot really you know expect them to learn that this is something they shouldn't
00:07:21.120 | touch they should not be using if you don't tell them what it is. It's the same for language model
00:07:26.960 | in some way. We want them to be exposed to this data to a small amount of it so that they can
00:07:31.920 | learn later to avoid this and they will know what they need to avoid. Basically assume that there is
00:07:37.440 | no generalization capabilities in this model. If you want to tell them anything about something
00:07:42.480 | positive or negative you have to first put it in the model. So let's talk about pre-training stage.
00:07:49.200 | I already covered a little bit but basically you want to have maximal coverage you want to cover
00:07:52.880 | everything. So you will train a massive quantity of texts at least 1 trillion token nowadays and
00:07:58.960 | I think you probably want to aim for like more 10 trillion tokens. The challenges that you want to
00:08:04.480 | solve here you want to maximize diversity and coverage and you want to maximize quality as
00:08:10.080 | much as possible because this is still you know something that your model will learn. So if your
00:08:15.440 | model learn mostly noise you will still get noise out. So you want to have a little bit of this so
00:08:20.320 | it's kind of robust to this but you don't want to have too much of this. Here is one example.
00:08:26.320 | Basically you would want your model a good rule of thumb is that you will want your model to know
00:08:31.520 | two things. You want your model to know the thing that you may want it to generate at the end.
00:08:36.240 | So if you want to generate knowledge about physics you will want to put that in the model and we want
00:08:42.080 | also your model to learn the thing that it might be exposed to. So you want your model to be familiar
00:08:47.920 | with the thing that the users might input. So if you have inputs that might be noisy from the users
00:08:53.360 | your model should still be trained on it. Otherwise it will be out of distribution
00:08:58.160 | and as I said the safest bet here is to assume your model don't generalize at all.
00:09:02.400 | The main challenge here is maximal diversity, good quality but still a little bit of noise
00:09:09.280 | and data quality evaluation. How do you measure data quality at the billion token scale.
00:09:14.800 | That's what we're going to talk a little bit about as well. So here is the typical pipeline to train
00:09:19.680 | a model. So you start by collection. I'm going to talk a little bit about that. You want to filter
00:09:23.840 | by languages which language you want to keep and then you have a set of filters. You have basically
00:09:28.800 | two main type of filters. You have some filters that are more heuristic so they are kind of rules
00:09:33.760 | that you wrote and there are some filters that are more like ML models. So you have a model that you
00:09:40.160 | train to identify some good quality text. Usually you want to combine two and then you have a set of
00:09:46.160 | filters that are more semantic. The rule and the ML model are usually a little bit more on the surface
00:09:51.040 | level and then you want to cover really the topics that you need to know about. If you want to know
00:09:56.000 | about physics, you want to know about technology, you want to be sure that these are in and so you
00:10:00.480 | have a step of like more topic filtering and basically be sure that you extract this topic
00:10:06.160 | very well. This is another example from RefineWeb. The first one was from Yi. This is from RefineWeb
00:10:16.160 | just to show you how much data we remove. So we start from Common Crawl which is basically the
00:10:21.920 | internet crawled since 10 years ago and basically we filter that and you can see that there is a
00:10:30.000 | lot of things that you will remove. First I would say language removal. If you only keep English,
00:10:34.640 | English is roughly half of the internet. The second biggest language is usually Russian and
00:10:39.760 | then you have all of the other in Common Crawl. So basically remove half of it when you only filter
00:10:45.440 | for English you will have a lot of like duplication removal. Why do you want to do duplication
00:10:51.600 | removal? Well we'll talk a little bit about that later so wait. And then you extract a little bit
00:10:56.880 | and in the end you end up with about 10% of the original Common Crawl sizes. So if you want to
00:11:02.960 | get a trillion token that means you really want to start with a very large source. This is an
00:11:09.280 | example this one from the from the LNAI survey. It's roughly the same steps that you will see here.
00:11:15.600 | Language filtering, some heuristics, some what they call data quality which is machine learning
00:11:21.360 | based usually. Some deduplication and then topic filtering basically. So where can you start from?
00:11:31.920 | You want as I said something very large because you'll just keep like 10% of it.
00:11:36.880 | So there is two main large sources of data I would say today. One is Common Crawl,
00:11:40.960 | one is the internet basically and the other one is more like for code. Usually you want to start
00:11:46.160 | from GitHub or something like Software Heritage or like a place where this has been carefully
00:11:51.920 | already extracted from the web. You can use some curated sources like Wikipedia or books and then
00:11:59.920 | in books you have this big question you know like I should use only public domain books which stops
00:12:05.280 | usually 100 years from now so in 1924 for today or do you want to dive in more like copyright
00:12:13.040 | equation. So that's the big big question I would say for today for mobile trainers.
00:12:17.760 | And you have more recent trends like synthetic data generation where you basically will ask one
00:12:23.200 | LLM to generate some data specifically for you and because you're kind of paying compute for data
00:12:29.520 | here you can scale this quite largely. So there is a full new trend on this
00:12:36.080 | spearheaded by Microsoft and the fee models which were trained on billions of synthetically
00:12:42.160 | generated data from GPT-4. I think it's quite interesting that you can really craft the data
00:12:48.400 | set in a more controlled way here because you can say okay I want this topic, this topic, this topic,
00:12:53.920 | this behavior and given the quality of large language models today the quality of the resulting
00:13:00.000 | data is actually very high. There is even a recent interesting paper from Apple which is about you
00:13:06.480 | know rephrasing the web so you take one page and you actually ask an LLM to write it cleanly and
00:13:12.400 | if you train on this data which is very clean and still cover a lot of diversity you can train
00:13:17.440 | actually three times faster because you use three times less data. It's very recent but it's super
00:13:22.720 | interesting. Okay I talk a little bit about this resource in more details because we've been
00:13:30.480 | releasing data set on this at HuggingFace and I want to show you a little bit what we released
00:13:35.120 | and I go in reverse order so I start with synthetic data. We released recently Lubna and Anton
00:13:41.760 | and Leandro at HuggingFace have been releasing a data data set called Cosmopedia which is a
00:13:47.360 | synthetic data set of 30 million samples, that's actually billions of tokens and it was generated
00:13:53.120 | using one of the best open source models today which is MixedTrail Instruct and here you can
00:13:59.360 | see how basically this is controlled for various seeds so basically we give the model a slight
00:14:06.480 | small topic you know or a sentence from a document and you can choose where this comes from and you
00:14:12.720 | ask the model to to write content you know from this seed sample on the topic. So we took some
00:14:19.600 | very clean sources like the Stanford open courses or OpenStacks which is also open textbook,
00:14:26.480 | Khan Academy that you maybe know and also some web data so I would say more more diverse data
00:14:32.880 | and even instruction tuning data set and then you can ask model also to write you know using various
00:14:40.240 | language you can ask the model to write this for college students you know to write textbook
00:14:44.720 | article on this topic for college students or for high school students you can also ask the model
00:14:50.560 | to write in various styles to write blog posts about this topic and so you can actually have
00:14:55.680 | a lot of diversity even though it's synthetic. Here is a quick example of all the clusters you
00:15:04.480 | can do topic clustering to check that you know that you cover a lot what we discovered is that
00:15:09.680 | we could still cover even more clusters and I would say right now the work on Cosmopedia 0.2
00:15:16.400 | is to extend this to even more cluster and to get basically more coverage and so here you can see
00:15:22.400 | that we train a we train 1 billion model 1 billion parameters model on this to to show the performances
00:15:28.720 | and it's really competitive with web data set even being much smaller but I would say it can
00:15:34.480 | it can even be better you know by having more coverage so stay tuned for Cosmopedia 0.2
00:15:40.960 | coming in April. If we go now to code data there was a very nice release earlier this year called
00:15:49.440 | Starcoder 2 and the Stack V2 so the Stack V2 is really the largest code data set out there that's
00:15:57.360 | prepared for large language model pre-training it's more than 3 billion files in 600 programming
00:16:05.920 | languages in total you have like billions of tokens you have roughly 1 trillion tokens in the
00:16:11.760 | Stack V2. So to get all this data basically we didn't crawl ourself we partnered with one of the
00:16:18.720 | non-profit foundation out there called Software Heritage which is a non-profit who has been
00:16:24.560 | focusing on archiving all code that has been out there since you know 10 years ago really
00:16:31.520 | and basically there is a question you know when you when you when you all when you gather all
00:16:37.360 | this data set what do you do do you sell it to I would say private you know closed source companies
00:16:43.040 | or do you partner with like an open source company to train an open source model on this
00:16:47.440 | and so Software Heritage can reach out to us to partner on the training of a new
00:16:52.880 | code open source code generation model called Starcoder 2 that you can use as well and which
00:16:58.320 | is one of one of the best code completion model out there today it's a very large collaboration
00:17:05.120 | actually an open collaboration so you see all the others there mostly led by Hugging Face and
00:17:12.000 | great people at ServiceNow. So really go check this out if you're interested in code data
00:17:17.520 | it's by far the largest and the cleanest data set out there on this. On web data so as I told
00:17:23.200 | you we've been working on with the lead author of RefineWeb to get a very large and very high
00:17:29.680 | quality web data out there so basically a filtered common crawl out there for people to basically
00:17:36.720 | start their training from a high quality data set so this should be also out in the beginning of
00:17:43.440 | April maybe already next to it so just stay tuned on this. So now that we got our data source we
00:17:50.640 | need to filter it so filtering by language I would say stay simple fast text by meta facebook
00:17:57.600 | is just great so just use fast text it's a great one it's worked pretty fine it has like all the
00:18:04.240 | language you may want to filter. Now that we filter by language we want to start cleaning
00:18:10.800 | our data sets so there is basically two ways to do that heuristics ML based. We started by the
00:18:16.480 | heuristics the heuristics is this idea that you will count items so basically if your documents
00:18:22.080 | only have like you know two characters per line probably it's just it's just a bad list or like
00:18:28.720 | something that you actually don't really want to use in your large language model so as a reminder
00:18:33.600 | you don't want to use the thing that are naser things that your model will ever generate
00:18:38.400 | and there's another thing that you will think your user might input in your model
00:18:43.440 | so basically repetition you know a very long repetition of single character something that
00:18:50.320 | you know have a very strange ratio of alphabetic character to punctuation all these statistics
00:18:58.080 | that you can extract are way to easily filter documents the nice thing about heuristics is you
00:19:03.840 | kind of know what you're filtering out you know you wrote the things yourself you can really set
00:19:08.480 | the threshold by inspecting it and you have a very clear control on what you're removing from
00:19:15.200 | your data set you know what's what's this so these are the annotations I told you it's kind of
00:19:21.200 | control it's robust you know the prior and I would say the drawbacks are that you're only relying on
00:19:27.440 | surface level okay you're not looking in the meaning of the document you may also remove too
00:19:32.320 | much sometimes you think you're just removing bad lists but maybe these are also good lists that
00:19:37.200 | your user may want to input in your model one way to be a little bit more flexible about that is to
00:19:43.600 | use stochastic removal instead of you know being a one-off binary choice you sample a little bit
00:19:49.520 | and you keep a little bit of noisy data another drawback is that you will need to carefully tune
00:19:55.840 | your hyper parameters here you know the statistics that you want to filter and that's sometimes a
00:20:01.920 | little bit time-consuming process another way to do data set filtering quality filtering is to do
00:20:10.000 | machine learning filtering so here basically how you do is that you will have a set of good example
00:20:14.960 | a set of bad example and you will train either a classifier or a perplexity based filtering
00:20:22.480 | to you know to classify or to predict the next token so classifier based you know usually the
00:20:30.800 | standard one is to use a fast classification with some n-grams and you label your documents as good
00:20:37.840 | bad whatever perplexity based you train a very small language model so usually we could we use
00:20:44.240 | the this this kn uh old model right and we say that if the perplexity is too high then we filter
00:20:52.640 | documents i would say the advantage is here is that you have a more like semantic understanding
00:20:59.200 | hopefully from your ml model even though we use very simple machine learning techniques here
00:21:04.240 | um and you don't you know need to tweak all the hyper parameter that you tweak for heuristics
00:21:10.560 | the main disadvantage is that you're not really controlling what you remove okay you you have a
00:21:17.920 | very vague view of what the biases are so let me give you an example wikipedia okay if you train
00:21:24.320 | your model on wikipedia and you filter based on this wikipedia is written 90 more than 90 actually
00:21:32.160 | by men so you're basically also filtering your pre-training corpus to be mostly male written
00:21:38.640 | do you want this bias well maybe not right so these are things that you you still need to be
00:21:44.720 | careful and basically it's really hard to know exactly what bias you're introducing
00:21:53.040 | um a couple of notes additional on data filtering very important notes actually
00:21:57.840 | you will have several parts in your training data even if it's only web documents you will
00:22:03.920 | have you know some part of the web data are blog posts some part of the web are you know like
00:22:08.640 | tutorials some part of these are companies websites all of these are somehow specific
00:22:15.840 | domains and you want to make sure they are all you know um processed in a good way so you need
00:22:22.160 | to make sure that for each of these big domains that you want to have at the end you actually
00:22:26.800 | didn't do something bad in the pre-processing so there is various way to do that you can you know
00:22:31.600 | cluster and identify a list of documents in a cluster but just one thing to remember about
00:22:37.600 | all of this and i would say it's a general rule of all good quality data processing is that you
00:22:42.800 | will want to manually inspect the data inspect the data that you've been keeping inspect how it is at
00:22:49.520 | the end how it was filtered is it still really readable is your latex um document well processed
00:22:58.000 | is your pdf ocr well extracted manually go through the data that you keep and also through the data
00:23:04.720 | that you remove did you remove something that you think is actually very important you need to sample
00:23:10.160 | you need to take a look you can take a look just at the most important so for instance you can
00:23:15.280 | sort your data by top urls per token and just read 10 documents for this top urls and make sure that
00:23:22.400 | these 10 documents are really well filtered okay very likely you need to also craft specific
00:23:30.320 | domain focused hyperparameters for instance for your heuristics maybe they will work well for
00:23:36.240 | blog posts but maybe they will just badly filter latex documents so you can either say okay i craft
00:23:42.240 | specific rule for this domain or you can also say i'll just add this domain afterwards uh for
00:23:48.560 | instance code you could say i remove all code for web and just i'll just add a very big code data
00:23:54.080 | set but try to think about the implication of doing that okay you will basically remove for
00:23:58.960 | instance some mixed natural language and code documents so you want to make sure you add this
00:24:04.000 | back again uh so that your model still cover this type of inputs um as i told you you can also make
00:24:11.520 | use of some stochastic selection so if a rule is maybe just too hard too harsh you may want to just
00:24:18.880 | stochastically sample in the filtering so that you keep a little bit of noise you can smooth a bit
00:24:25.600 | your rules now the duplication why do you want to do the application well the idea is that there is
00:24:33.920 | a lot of duplication on the web that's something to really be mindful and to be aware of the web is
00:24:39.920 | hugely duplicated and so duplication will increase the density around some topics okay wikipedia is
00:24:47.760 | copied a lot over the internet so maybe that's nice to have a lot of density around wikipedia so
00:24:52.880 | that you're sure that your model has seen it a lot but um you also have to be aware that duplicated
00:24:58.080 | points they have more chance of being memorized okay they will also take more time because you
00:25:03.120 | will go during your training more times over the same data points so it takes more compute
00:25:09.040 | during training and you really want that okay you really need to see that um reducing the
00:25:14.960 | duplication duplication also has been shown to improve accuracy so generally the duplication
00:25:19.680 | is something that's very important and that you want to have a lot uh how can you duplicate well
00:25:27.760 | you have a couple of methods you have more like fuzzy method where you basically will extract some
00:25:33.040 | hash fixed size hash of your documents and so you will lose here a little bit of accuracy because
00:25:40.240 | this hash are just a rough summary of the n grants in your document and then you will want to filter
00:25:45.600 | them either by min hash which is a i would say quite a good method in general or by bloom filters
00:25:54.800 | which are much stronger on the duplication because you just keep one hash and just keep one document
00:26:00.320 | per hash so it's very it's very strong you have a fixed size vector which is very constraining
00:26:05.120 | um and if you don't want to do fuzzy duplication you can use exact duplication where you will
00:26:10.880 | extract you know uh with a suffix array you will extract exactly all the duplicate in your document
00:26:17.360 | they have both trade-off uh in advantages and drawback um exact filtering is very costly in
00:26:26.560 | memory because the the table the suffix array table are really huge um i say bloom filter is
00:26:34.000 | very very strong uh a filter so usually we we use a lot for instance in fine where we use a lot
00:26:40.640 | min hash because you can you can control you can control a little bit more um your trade-off between
00:26:46.320 | memory and um and accuracy uh speeding the duplication is also a very big issue i would
00:26:56.400 | say on very big challenges uh and we saw a very nice very interesting counterintuitive result
00:27:02.160 | recently that more duplication also led us to keeping only bad data so basically when we were
00:27:07.040 | deduplicating more and more all the good data was now taken out and only the the remaining things
00:27:13.760 | were just basically bad quality data that was not the duplicated but that was just so random that it
00:27:20.320 | didn't fall in the duplication uh buckets so uh i would say for the duplication also be careful
00:27:27.360 | investigate what you're removing at the end and also what you're keeping and don't be sure don't
00:27:33.360 | don't take this as a silver bullet just like every filter out there it's something that you should
00:27:38.400 | double check yourself now that we've finished you know uh sourcing language filtering filtering by
00:27:46.640 | quality heuristic or ml deduplicating uh topic we need to prepare the data for training there's two
00:27:53.600 | main thing you need to do we need to shuffle it it might seem as a joke but it's still very
00:27:58.160 | important today you don't want to train in the order of the common crawl terms you want a good
00:28:05.040 | a good shuffling of all your data and then you want to tokenize it so recently there was a very
00:28:10.800 | nice video by by andrej carpati on tokenizer you should watch it if you want to know everything
00:28:16.000 | about tokenizer but generally there's just a set of good practices you should fit you should be
00:28:22.320 | mindful of the first one i would say is sample well through your whole data set i would say the
00:28:29.440 | first gpt2 tokenizer was famous for including in in the in the final vocabulary token the name of
00:28:36.400 | redditors because it was really trained on ready data only you don't want that you want really to
00:28:42.080 | shuffle so that the single you know the one single part of your data set is not over represented
00:28:48.880 | in your vocabulary the vocabulary of your model for math you want to be careful about numbers you
00:28:55.360 | want to be careful that they are well you know you don't have like for instance 42 as a single token
00:29:00.880 | and 43 as two token because 42 is much more used since uh the douglas adams book and so usually
00:29:09.360 | what people do is either they split digits so you split all the tickets in every in every number
00:29:15.280 | that's what for instance llama do or you add you know the list of all numbers manually in your
00:29:22.000 | vocabulary up to a thousand for instance that's what gpt4 do then you need to be sure that your
00:29:28.320 | data set is big enough that every number is really well represented in it for code you want to be
00:29:34.960 | mindful about tabs and spaces they're very important for instance in python and so you want to handle
00:29:41.920 | them well you want to model to know what is a double space and four spaces so just be careful
00:29:47.840 | about this and for basically if you need something by default i would say a byte level dpe is a good
00:29:54.960 | standard way to train a tokenizer don't fall in a rabbit hole for tokenizer they are not the thing
00:30:02.000 | that will bring you to adi okay this is just something you want to make in a clean way so
00:30:08.720 | that you don't fall in the in the traps along the way that you're able to process code numbers you
00:30:15.360 | know that you don't have some strange tokens over represented but that's it by the way you can also
00:30:22.560 | use tokenizer to inspect your data set i'm going to talk a little bit about that scaling tokenization
00:30:27.840 | is non-trivial you want to really parallelize that well because otherwise pre-processing
00:30:34.240 | and tokenizing trillions of token can take quite a long time in the end and so there is two main
00:30:40.400 | approach the first one is well parallelizing and then finding a way to efficiently merging the post
00:30:47.360 | the tokenized data sets and shuffling it and the other way is that you tokenize during training
00:30:53.360 | basically you feed the direct text to your model and you tokenize just before feeding the model
00:30:59.760 | i would say the the nice thing about the first one is once your doc once your data set is tokenized
00:31:05.280 | and everything stopping training and continuing training around resuming training is very easy
00:31:10.960 | it's very efficient it's very reliable and in the second case is well you can change the tokenizer
00:31:17.920 | easily but usually you don't really need to do that a lot but resuming and being sure that you've
00:31:25.280 | you know we're starting exactly from where you were is usually slightly trickier
00:31:32.320 | now how do you evaluate data quality so that's really tricky because we're talking about trillion
00:31:39.920 | size data sets okay so it's really hard to have some good metrics to evaluate the data quality
00:31:45.520 | so a lot of this is you know inspecting yourself some exact documents as i will tell you and some
00:31:52.240 | easy i would say one one one good way is training small model to test it so typically what we've
00:31:58.000 | been training here for instance is like one to two billion size model and you train at this on
00:32:03.600 | like a chinchilla optimal size you don't need to train for longer you're not using this model
00:32:08.480 | for inference or anything so which is roughly 30 giga token when you train your model you need to
00:32:14.960 | find some high signal benchmark not all the benchmark in nlp are high signal what is a high
00:32:20.480 | signal there is two way i've seen it uh being being being used one way is to make sure that your
00:32:27.840 | matrix on this benchmark is monotonically increasing during training okay you want
00:32:34.400 | basically some benchmark where you really see your model learning learning increasingly and
00:32:39.520 | not like oscillating a lot otherwise depending when you stop you will have like very different
00:32:45.120 | results you want to have a low variance which means if you train on various seeds if you train
00:32:51.680 | on various um you know parts of your data set you want to be sure that you're you're roughly in the
00:32:58.000 | same ballpark at least the the standard deviation that you're measuring is small enough that you can
00:33:02.800 | really tell data set apart so usually you will want to have two debugging data sets one of high
00:33:10.080 | quality a standard very high quality data set is c4 it's a very it's really a data set that has
00:33:16.800 | standard test of time in terms of high quality and you want another data set that's maybe much
00:33:22.160 | more complex the power is some some sometime an example or you can take just a pure common crawl
00:33:27.040 | and filtered and you should see really a distance between the measurement on your benchmark on these
00:33:33.760 | two data sets the performance of your train model on these two data sets and obviously you want your
00:33:39.040 | model to be above the random baseline you know that's also one indication of a good benchmark
00:33:45.280 | so if a 1-2 billion size model is not above the random baseline you're just measuring noise
00:33:51.360 | and there is some tricky details to make sure that you have high signal these are some things
00:33:57.280 | we have in in light table but basically for instance if you want to measure multiple choices
00:34:02.640 | question it's often the case for this small benchmark and that's why for instance you have
00:34:06.640 | four continuation you want to predict you know you want to select one of the four small model
00:34:12.720 | what i call small model is one to two billion size model small models really like more what we call
00:34:19.600 | normalized likelihood so we'll measure the likelihood of each answer normalize it by the
00:34:24.800 | length and take you know the highest likelihood and larger model when we move to like 30 40
00:34:31.760 | even 70 model well trained they will like more you know lettered answer when you explain the
00:34:36.720 | answer and then you say select between a b c d and the model just generates a b c d and here you
00:34:42.640 | can have nice calibration curve because you have a very clear uncertainty on this single generated
00:34:48.960 | token so keep this one for larger model for small model i would say focus on normalized likelihood
00:34:56.240 | so these are small model training another thing talk about this a lot but manual data inspection
00:35:02.160 | take your top domains take your top url inspect 10 documents for each of them inspect also at
00:35:08.400 | various stages in your pipeline and also take a look at what you've discarded okay always
00:35:16.560 | you can set up a search tool in your data set that's also very useful you can do some clustering
00:35:22.480 | to see and to be able also to inspect top documents per maybe more clusters than url so we have here a
00:35:28.800 | nice library by leandro at hugging face called text clustering we also have a nice search tool
00:35:36.720 | in this so really take a look at this library and use it if you think there is a more uncommon that
00:35:43.040 | that i really like from tevin who was a there was a lot of people at hugging face you know who are
00:35:49.360 | now at mistral and tevin who is now at mistral also told me once that he used the tokenizer
00:35:55.600 | to inspect and basically you can train a tokenizer on your data set and you can take a look at the
00:36:02.560 | longest token and maybe the last token so the less the less the least frequent token and see there
00:36:09.520 | okay do you have strange things do you have like javascript parts do you have like name of redditors
00:36:14.880 | like i was telling you and if they look bad that means that you have some high frequency of bad
00:36:21.840 | quality data in your data set um here we have some nice library that we've been releasing you know
00:36:30.320 | just last month for doing all of this all of this data processing pipelines it's called data trove
00:36:36.560 | it's by uh gilerme the lead author of refined web and basically it started as an open reproduction
00:36:43.200 | of refined web so a very high quality filtered common crawl and what we ended up was kind of a
00:36:49.360 | fully fledged lightweight library for processing filter the duplicated test data and basically
00:36:54.880 | preparing very large data set for other than training you have pre-built block for all the
00:37:01.280 | steps that i showed you here it's fully in python and it's very easy to set up on slurm or locally
00:37:09.360 | and to use remote file system as well if you need to so take a look at data trove it's a very small
00:37:17.440 | library i would say self-contained python thing but you really you have all the basic blocks that
00:37:22.000 | you may want to use here um when you want to evaluate your model we have one library that
00:37:27.840 | works well with data trove and the pipeline which is called light evil light evil is a very lightweight
00:37:34.640 | llm evaluation suite usually inspired by the amazing eluther airness i would say the main
00:37:41.600 | difference is that integrate from the ground up 3d parallelism i'm going to talk about next
00:37:47.600 | so basically efficient model uh training and inference and you can play a lot with the prompts
00:37:54.560 | and the eval so i was telling you for instance small model really like this like normalized
00:38:00.240 | log likelihood while while bigger model like more like lettered answers and so here you can play with
00:38:06.000 | the prompts easily and so to see how much signal you can extract for each benchmark on your specific
00:38:14.560 | debugging model size now we've talked a lot about data so let's talk a little bit about modeling
00:38:23.040 | that's the part everyone is waiting for that's the most exciting part easily that's the reason we're
00:38:27.840 | all in ml and i'm very happy to still cover this i would say so what are the essential elements when
00:38:34.320 | you train well there is three main thing the first one is efficiency and size you want to fit your
00:38:41.280 | billion parameters model efficiently on your gpu and you want to train really fast so you have some
00:38:46.800 | recepts for this that i'm going to cover quickly and then you want to train in a kind of a roughly
00:38:53.760 | stable way you have to avoid instabilities but still you want to stay really close to it and
00:38:59.360 | then you have the last question which is capacity and that's where we're going to talk a little bit
00:39:03.200 | about other architecture than just the transformers but that's i would say just the last part so how
00:39:09.680 | do you train model efficiently in particular when it's too big to fit on one gpu so when it fits on
00:39:14.800 | one gpu there is no real problem right so you won't be model no problem your 7 13 30b model
00:39:23.920 | they are just too big for one gpu and a decent batch size so you need to parallelize them
00:39:28.400 | today we have four way to do parallelism roughly we have data parallelism that's something everyone
00:39:35.040 | has been using already i would say you have tensor parallelism pipeline parallelism and a much more
00:39:41.840 | recent i would say or slightly more recent sequence parallelism i'm going to cover them
00:39:46.080 | briefly so i would say here my idea is more to give you kind of a overview of everything more
00:39:52.080 | than really dive deep because in each of these topics you could dive really deep in a technical
00:39:57.360 | point of view okay so this is really entry level and i put some references again just select a
00:40:04.400 | couple of references that you can read to dive deeper in this let's start with the first parallelism
00:40:09.280 | data parallelism usually it works out of the box that's the easiest one the only challenge is the
00:40:15.840 | data loading to make sure that your mobile replica will have different data as input so what does
00:40:24.640 | data parallelism do you take the one model and you duplicate it on several gpu you feed it several
00:40:31.360 | parts of your batch and then you just you know match the gradient reduce the gradients so that
00:40:37.200 | you have basically a larger batch on three gpu for instance than you had on one gpu so you can
00:40:43.040 | process on parallel you know different part of your data and you just make the optimization step
00:40:49.360 | the main challenge i would say is the last part is the all reduce that you use to to kind of merge
00:40:56.960 | the gradient updates and actually when you scale very large model it can start to become a huge
00:41:03.120 | bottleneck so we'll talk a little bit about that um yeah the tensor parallelism is when you don't
00:41:13.200 | want to when you're limited in your data parallelism so why would you be limited by data
00:41:19.520 | parallelism there is two main cases one case is basically your model is just too big to fit on
00:41:25.120 | one gpu so you cannot replicate your model on various gpu you need to split the model somehow
00:41:32.400 | the other case is when your batch size by replicating the model start to be too big
00:41:37.920 | okay so let's say you want to really scale the model and now you start to have like one to four
00:41:44.320 | million token batch size well if you start to have a very large batch size the model for each
00:41:50.240 | optimization step make less efficient use of each token because the batch size is so big that each
00:41:57.520 | token is kind of watched out in the optimization step and roughly it's a little bit hard to measure
00:42:03.520 | this limit which we call the critical batch size it's roughly around four to six million token it's
00:42:09.200 | different for like small and bigger model but basically you cannot really go to 100 million
00:42:14.960 | token base like that so you want to find another way to parallelize to make more efficient use of
00:42:21.440 | your data and so one way to do that is to use tensor parallelism tensor parallelism is slightly
00:42:28.640 | more involved because you need to rewrite your model code you cannot just rewrite the data
00:42:33.440 | loading code you need to change the model why because you will divide all the matrix multiplication
00:42:39.600 | all the matrices that we use in the model into or like four or like eight depending on your tensor
00:42:45.440 | parallelism degree and you will put each part of the weights each sub part of this weight matrices
00:42:53.360 | on various gpu and synchronization will happen after the operation so here you need to rewind
00:43:01.040 | model code the nice thing is that you can combine smart column and row slicing to try to reduce the
00:43:07.120 | number of synchronization points let me show you a little bit here you have two main parts in a
00:43:14.000 | transformer as you may remember you have feed forward networks you know you usually have two
00:43:20.160 | two matrix multiplication with an activation in between it can be a bit more if you're using
00:43:26.720 | something different than just if you're using clue but basically you will have one matrix
00:43:30.720 | multiplication some activation and another matrix multiplication okay and here you can basically
00:43:37.680 | split the first matrix multiplication in one direction usually column wise you do separately
00:43:44.880 | your activation on each gpu you don't need to synchronize and then you gather by doing the
00:43:50.480 | opposite slicing at the end on the second matrix multiplication you do a row slicing
00:43:56.080 | to gather again your output your activation you can do the same smart thing for self-attention
00:44:04.320 | where you will do one part matrix multiplication in one direction you will split the matrix
00:44:09.200 | you will do like softmax dropouts separately and then you will combine them with you know
00:44:15.360 | another parallel operation in the other direction this way you reduce the number of synchronization
00:44:24.160 | point because you can do a couple of like operation without needing to synchronize between the gpu
00:44:29.360 | the tricky part is always that when you're synchronized you're going through the network
00:44:33.680 | that's much that's much slower than just the computation the last part that you can use when
00:44:40.960 | you when you don't want to use tensor parallelism or when you cannot scale tensor parallelism enough
00:44:45.120 | is pipeline parallelism so usually you want pipeline parallelism when your like network
00:44:50.800 | is not fast enough to do full tensor parallelism everywhere okay pipeline parallelism reduce
00:44:58.480 | the number of network exchanges because you will put some layers on some gpu
00:45:03.360 | and other layers and other gpu and you will just communicate at the interface between two layers
00:45:09.920 | or like two groups of layers so you can see here you will put one for instance level layer two
00:45:17.520 | zero to three on one gpu layer four to seven on the second gpu etc etc here the challenge i would
00:45:24.800 | say is to keep all the gpu busy so you don't want to have just you know one group one gpu working
00:45:30.880 | for the first layers of your batch and then being idle while you have the other gpu working for the
00:45:37.040 | other layer you know as we go as we go forward in the model and it can be very challenging
00:45:43.520 | to actually keep have maximal utilization of the gpu so usually you have like complex interleaving
00:45:50.320 | of the forward and the backward path so i can show you here a little bit where you have the forward
00:45:56.480 | path in blue and the backwards path in green and you can see what we do in this case is that we will
00:46:02.960 | split our batch in smaller sub-batch mini-batches so for instance we split a long batch in four
00:46:10.800 | mini-batches and when the first mini-batch is done on the last device we already start the backward
00:46:18.400 | while we are still doing the forward path on the other gpu for the last batches and this way you
00:46:24.800 | can reduce what we call the bubble the tricky thing here as you probably got it is that
00:46:30.000 | for tensor parallelism you needed to rewrite the model code as i told you
00:46:34.880 | and here you also need to rewrite the optimization code okay you cannot just do forward
00:46:41.680 | and then you're lost at backward because you have parallel execution of a backward and forward
00:46:48.880 | path so this makes using the code quite complex and that's why actually we have a new library
00:46:53.680 | called nanotron that's right to have this as simple as possible
00:46:57.440 | there is a last way to do parallelization called sequence parallelism so be careful
00:47:05.040 | because there is two use of sequence parallelism there is one which is kind of a smart way to do
00:47:10.800 | ring attention to do attention on very long sequences but the one i talk a little bit about
00:47:16.240 | today is another simpler way it's quite similar to tensor parallelism in a way but instead of
00:47:23.280 | slicing the parameter matrices like we do we slice the sequence this way and the idea is
00:47:29.680 | if you took tensor parallelism here it's the top box we still had some operation between each tensor
00:47:38.080 | parallelism operation where we were not really parallelized in any way and the idea is on this
00:47:44.160 | operation which are applied independently for each token we could split along the sequence
00:47:52.880 | and so we could parallelize this along the sequence axis it's only interesting you're
00:47:58.720 | doing training usually because you need long sequences or which is a little bit doing prefill
00:48:03.600 | now what can you read if you want to know more about this there is many reference
00:48:10.480 | on parallelism i try to extract i think the one and i think give you the highest level overview
00:48:16.480 | i would say and cover as much as possible of this thing i really like this first paper from joel at
00:48:22.960 | service now which is not very well known but i think it's very interesting as it covers like a
00:48:28.160 | lot of challenges here red first pipeline parallelism reducing activation computation
00:48:34.320 | in large transformer model is very nice one and the last one is actually the one on sequence
00:48:40.320 | parallelism that i told you and the last one called sequence parallelism is actually this
00:48:44.960 | ring attention paper that i think is also very interesting but more maybe an extension of this
00:48:50.800 | presentation now we talk about a lot about parallelization okay but there is an additional
00:48:56.960 | thing that you need to be mindful about is synchronization i already talked a little bit
00:49:01.680 | about synchronization okay during tensor parallelism and the thing i talk a little bit
00:49:05.840 | about reducing synchronization and here you you have to be very careful about that well
00:49:10.720 | why well you have two type of synchronization you have one synchronization which is between
00:49:15.920 | values gpu which is uh basically when you when you do like a like a like a reduced operation
00:49:24.080 | in tensor parallelism and you have one synchronization which is between cpu and gpu which
00:49:29.120 | is when your cpu basically launched the kernel on gpu and you want to reduce or at least you want to
00:49:35.920 | make sure that for both of these as much as possible you can do an overlap of computation
00:49:42.560 | and communication so basically if you can do something called asynchronous computation
00:49:48.160 | basically where you will asynchronously start some operation and do some communication during
00:49:53.760 | this time it's much better so let me talk about two things we we talk a little bit during the
00:49:59.600 | data parallelism part about the cost of the all reduce at the end so that's something you probably
00:50:05.520 | have been using already without knowing it in pytorch which is if you look at the distributed
00:50:11.600 | data parallel so the ddp uh in pytorch you can see that there is a very smart way to do all reduce
00:50:20.000 | so let's look at here basically typically you will usually do like all your forward and your
00:50:24.960 | backward and then you will do your all reduce at the end okay well this is very annoying because
00:50:32.000 | during the all reduce where you gather all your gradient together you don't do any computation
00:50:38.000 | you're just waiting for synchronization there you're just waiting for your gpu to exchange
00:50:43.440 | all the all the great and that's not something you really want you want to keep your gpu busy
00:50:48.560 | so if you have a way that once every time um one layer is finished with computing you can already
00:50:54.880 | start you know reducing you can already start in parallel to computation you can already start
00:51:00.000 | communicating gradient then you should try to do that and if you take a look at the pytorch code
00:51:05.440 | for for distributed data parallel that's something that they do um another example is in pipeline
00:51:13.440 | parallelism you know we saw this this this uh forward backward reducing of the bubble and here
00:51:22.160 | you can also try to you know overlap this very long here g so this very long gradient reduction
00:51:29.360 | here with some you know forward pass of the next batch and just a quick example and a quicker note
00:51:40.960 | about cp and gpu synchronization here what you will want is to reduce as much as possible the
00:51:46.800 | number of time your cpu need to inspect the data or need to start a kernel so we want to fuse kernel
00:51:54.000 | so you want to fuse the operation that could go together the result of your attention your
00:51:58.720 | activation if you can do all of that in the gpu without the cpu needing to say okay now it's time
00:52:04.960 | to compute the activation now it's time to do you should do that so that's usually done by merging
00:52:10.000 | merging operation in single kernels um now i want to talk a little bit about attention
00:52:18.240 | that's very interesting because if you were already in the field like one year one year
00:52:23.760 | and a half ago a little bit more maybe now um we had a lot of work on designing efficient
00:52:30.720 | attention computation because people were really very scared by the quadratic cost of attention
00:52:38.720 | okay and all of this disappeared now you know there was like all this reformer all these very
00:52:44.160 | long attention smart and the main reason this disappeared was that uh our friend tree dao at
00:52:52.000 | stanford invented flash attention flash attention the idea is basically you will just not materialize
00:52:59.520 | the attention matrix so the attention matrix is this very like large is the n square sequence
00:53:06.000 | square size matrices comparing each token you know to make the attention between all of them
00:53:12.000 | what you could do is instead of building these very large matrices you can just
00:53:16.880 | on the fly you know build small matrices and just keep the statistics that you need
00:53:21.200 | to compute your softmax along the way and that's what flash attention does that's the first step
00:53:27.040 | and the second step for flash attention is that if you just compute along the way small part of
00:53:32.720 | your attention matrix you may even have this small part small enough so that they fit actually
00:53:39.760 | in the sram of the gpu so the static random access memory the sram is a much much smaller memory but
00:53:50.560 | which is really next to each chip and this this this cannot be shared between processes right
00:53:56.320 | this has to be this is a single memory for for a group of of processing while the hbm the high
00:54:04.240 | bandwidth memory is shared by everything so it's the it's the hbm is this 80 or 40 gigabytes memory
00:54:10.800 | you know that you see it's really large but it's also much smaller and much lower bandwidth than
00:54:16.240 | this sram okay so you can compute just like your attention not in one big memory but in small one
00:54:23.200 | with statistics and the small one can be small enough to be fitted in the very very high bandwidth
00:54:28.320 | memory and this way you can actually compute attention really much faster and while using
00:54:34.800 | actually much less memory so flash attention can solve somehow the quadratic attention costs of
00:54:41.680 | attention and that's why we don't really care a lot anymore about you know linear attention
00:54:49.040 | mechanism for instance also because performance were never able to match full attention somehow
00:54:55.840 | just apart from sparse attention in some way flash attention v2 was a development of flash
00:55:02.320 | attention still roughly two times faster and here the idea was mostly to really have as much as
00:55:09.440 | possible of the computation in matmul flop so you have to know something about gp as well which is
00:55:16.400 | gpu already optimized to do matrix matrix multiplication so each time you do something
00:55:22.560 | like a division or something basically else than a multiplication you're paying a cost and the cost
00:55:29.600 | is very expensive a division is like 60 times more expensive than a multiplication and so for instance
00:55:36.720 | when we do softmax we usually divide by some normalization like some some number some square
00:55:43.200 | root of the dimension of our model we want to keep this and do it just one time at the end you don't
00:55:48.000 | want to do that every you know on every and every element of your computation so these type of things
00:55:54.960 | are basically what flash attention 2 is bringing with also a better parallelism causal mask if
00:56:01.600 | you're just computing causal mask you just don't need to compute half of the matrix and just better
00:56:06.720 | work partitioning and using more better like the blocks and the wraps of the gpu so i won't dive
00:56:11.920 | into this because there's a lot to unfold here but i would say it's more like a little bit more
00:56:17.440 | incremental but it's still like very very nice beta so now that we have something efficient
00:56:24.880 | we've parallelized this well we have a very efficient attention computation we want to make
00:56:30.800 | sure that we train well and here don't miss this hyperparameter search you have a couple of very
00:56:37.760 | important things that you need to go over learning rate you want to do a nice hyperparameter search
00:56:43.360 | you want to make sure that your initialization is well done you want to create a normal where
00:56:48.480 | they need to be you want to make sure that your training is stable but also that you're not too
00:56:53.840 | stable that you still at the verge of like you're still training with very high learning rate
00:56:58.000 | and here i would say there is very few uh recent work on this but there is two that i really like
00:57:05.120 | there is the mu transfer work slightly older i would say now but it's still very very interesting
00:57:10.160 | on how to find a hyperparameter on a small model and how to scale them on a larger model
00:57:15.920 | this work by cerebras was maybe one of the most interesting application of mu transfer
00:57:22.480 | and a very interesting recent work from also a chinese team again but very well you know set of
00:57:30.240 | experiments open source is this mini cpm block post where they really try to optimize the model
00:57:36.800 | and to optimize you know the uh scaling of activation between various part of the model
00:57:42.800 | how you want to scale the activation between embeddings and then the first layers etc so you
00:57:47.200 | should really probably should really give it a look what is also very interesting is that they
00:57:51.680 | challenged the dominant view that cosine learning rate was the end learning rate that everyone should
00:57:58.800 | use from now on cosine is still that really the great great default learning rate but they use
00:58:03.920 | a linear plus uh you know warm up and decay and they show that they do have some uh decent
00:58:10.960 | performances with that as well and the nice thing about having a linear uh learning rate like a
00:58:17.680 | constant learning rate is that you don't need to know from the beginning how long you will train
00:58:23.760 | your model on and that's very interesting because the cosine kind of force you in a very specific
00:58:29.520 | shape where you need to decide from the beginning of your training how long you are going to train
00:58:34.480 | and you cannot resume for longer and if we find a way to have good performances with like
00:58:39.440 | a flat learning rate and just like warm up decay decay is very important that's what they show in
00:58:43.840 | this paper um then maybe we can get away uh out of this constraint of knowing from the beginning
00:58:52.960 | how long we're going to train so uh take a look at this paper i think they're very nice in terms
00:58:57.120 | of stable training recipes and the takeaway here i would say is don't skip this step
00:59:03.440 | do your work in terms of research for hyperparameters now the last part uh that's
00:59:13.920 | the one usually people spend the most time on talking about so it's a good indication that
00:59:19.520 | that's it's also the least important part but yeah let me still talk a little bit about this
00:59:24.640 | for a long time transformer were believed to be the end architecture so uh maybe slightly
00:59:33.280 | sad i would say for the field that we didn't have anything new since you know the transformer paper
00:59:39.200 | in 2018 and uh recently there was two extension they want to cover one is mixture of expert
00:59:47.440 | so mixture of experts reduced to transformers in the limit of one experts so it's still slightly
00:59:53.440 | a stretch to say that it's a fully new architecture but it's still very interesting as a new knob
00:59:59.360 | to you know play with capacity um so basically one problem was that until now it was not very
01:00:06.240 | efficient to train a mixture of experts so let me explain you a little bit okay in a mixture of
01:00:11.280 | experts when you go through uh when your sequence of tokens will go through your model at some point
01:00:17.280 | you will have a router that will say for each token where in which experts should this token go
01:00:23.440 | and experts are basically at the mlp level feed forward so you have basically several mlp several
01:00:29.600 | feed forward layers and you will select like these are the number of your experts for instance three
01:00:34.880 | feed forward layers three different feed forward layer will be three experts and your router will
01:00:40.240 | say for each token okay you should go through to expert one to expert two to expert three
01:00:45.040 | now each each expert was designed to be able to welcome a certain number of token so for instance
01:00:52.960 | two token in this example here okay and if three token should go to one expert and then two token
01:00:59.680 | to one and one to the last one the expert that would get three token was not able to welcome
01:01:05.040 | them all and so would drop one token so without one token that would just be not used in the
01:01:11.440 | computation one input token so that's quite strong i would say as a as a impact that means you're
01:01:17.760 | kind of ignoring a part of your inputs and that led to i would say non-optimal performances
01:01:24.240 | but that was needed for the sake of having a very uh determined like a very static you know matrix
01:01:31.760 | and our gpu and tpu are not really well adapt to dynamic architecture well recently using ideas of
01:01:39.600 | sparsity but not too sparse because we also know gpu don't really like very sparse matrices
01:01:44.880 | but you can they are actually quite good for block sparse matrices so what is block sparse this means
01:01:50.160 | you it's sparse but you have blocks and these blocks are big enough so that they make efficient
01:01:55.120 | use of the gpus you know so they are big enough that they will fill like your your math mill here
01:02:00.080 | you will have like enough to crush um but between these blocks you have like empty places and here
01:02:08.160 | this is basically what mega blocks recently did and that's how it unlocked actually efficient
01:02:13.920 | mixture of expert training which is saying maybe our experts could be these blocks they are very
01:02:19.520 | big feed-forward matrices and if we actually use this we could just like repeat this block
01:02:26.320 | for the various number of token and we can maybe dynamically do the sparsity because it's not
01:02:32.880 | it's actually just blocks that will repeat so it's i would say it's a kind of a
01:02:37.760 | low level of dynamicity uh low level low enough that it can be very efficient so that's basically
01:02:45.360 | what kind of changed the the thing here we don't need to drop token anymore we can just dynamically
01:02:50.880 | build these big sparse matrices from experts and it actually even opened the door for something
01:02:56.800 | i think nobody has been really using yet which is you could have experts of various sizes you could
01:03:02.080 | add you could have like big experts smaller experts etc etc very interesting and i'm really
01:03:09.760 | looking forward to uh what will be built on top of this another interesting development was kind of a
01:03:16.480 | revival of recurrence model and you have two uh main uh model well i mean i just talk about mamba
01:03:24.880 | and the idea here is that you can you can use like space state space model so if you're just
01:03:32.000 | out of your master in ai you probably learn about space states model
01:03:36.720 | there are these discrete models this continuous model that make evolving you know a space state
01:03:45.120 | and here the all the smart all the smart thing was about how to discretize this and keep this
01:03:50.080 | efficient and so that was solved by um by albert gu and and again for flash attention and how to
01:03:57.920 | train this efficient it's very funny because when you train this mamba model when you train it it
01:04:03.120 | behave kind of like a like a convolution convolutional network and when you use it in
01:04:07.760 | inference you can use it in a kind of a recurrence mode so it's really really fast actually um mamba
01:04:13.840 | itself is quite hard to dive in and i think the best entry point is again an annotated blog post
01:04:20.080 | by sasha rush so maybe you learn about the transformer architecture from the annotated
01:04:25.680 | transformer by sasha rush a few years ago so now you have also annotated mamba blog posts which is
01:04:32.000 | i think a very nice way to learn about the mamba architecture we're actually training several
01:04:37.040 | mamba at hugging face at the moment with nanotron and so it's also very easy to train i'll show you
01:04:44.080 | a little bit there so talking about nanotron we wanted to have a very simple library to use all
01:04:52.240 | the techniques that i showed you um so we talk about parallelism we talked about efficient training
01:04:58.320 | we talk about being able to you know nicely iterate on your hyperparameter and also have
01:05:04.720 | mixture of experts on mamba if you want to gather all of this you usually have a very large library
01:05:10.960 | with a lot of bell and whistle we want to keep something very minimalistic so that's how nanotron
01:05:15.920 | was born so we want to keep this really under 10 000 lines of code and we want to make it very fast
01:05:23.600 | basically train as fast as possible and also very transparent so there's not a lot of wrapping around
01:05:30.480 | the things that you do here and as the idea is uh it's very open it's very transparent and you have
01:05:36.720 | in it like 3d parallelism radiant accumulation didn't talk a little bit about values mixed
01:05:43.200 | precision but it's in it and you have all the all the way to do also
01:05:47.280 | smart optimizer zero one and all the architecture i was talking about at the end
01:05:56.400 | so uh take a look at nanotron it's a kind of a research code that we use so it's still roughly
01:06:02.240 | it's still a bit rough on the edges but it's a very very nice code base
01:06:07.520 | now that we trained our model took a long time i'm gonna cover briefly uh the the next step so uh
01:06:17.200 | talking a little bit about that because we also have nice open source library on this so i want
01:06:22.720 | to tell you a little bit about them once you've pre-trained your model you usually want to align
01:06:29.040 | it which means you want to have it not as a completion model which just you know generate
01:06:35.200 | the most likely tokens after the prompts but you want to extract you want to have it behave in a
01:06:41.440 | specific way so usually you want to start to have your model behave as a dialogue model so that it
01:06:46.880 | learn to generate answers to prompt and not just continuation and you also sometimes want to you
01:06:53.120 | know have specific behaviors or you want to do some safety and you know for like forbid or like
01:07:01.200 | reduce the occurrence of specific behaviors of your model okay so this step is called alignment
01:07:07.600 | or fine tuning and i would say up to now there was a very uh complex technique called rl rl hf
01:07:15.440 | reinforcement learning from human feedback the impressive thing about our hf i would say is that
01:07:21.680 | it works at all that's basically maybe the first widespread occurrence of reinforcement learning
01:07:28.640 | in ai world that's actually really useful for many many people but it's still really really
01:07:34.240 | complex basically how it works and the main tricky thing here is as always reinforcement learning
01:07:42.480 | is the reward so usually in reinforcement learning you define your reward manually it's very complex
01:07:48.240 | you know it's very full of heuristics and that's kind of one reason you don't generalize to anything
01:07:53.200 | else on your uh test test environment and the nice thing about rl hf is the hf part which is
01:07:59.920 | you will define your reward from human feedback so we'll ask you will generate some completion
01:08:07.040 | we ask human to rank them and you will use that to train a reward model now it's very nice but
01:08:14.960 | it's kind of a complex thing as you can see here some typical uh labeling interface for human to
01:08:21.920 | label the the rewards and in practical i would say the very impressive thing is that it's just
01:08:27.920 | working as well but in practice the implementation is very complicated you have like four models you
01:08:34.240 | have your like your model that you're training so the dpo model you have a base model that you
01:08:38.640 | still use because you want to stay not too far from it you have a reward model that you trained
01:08:43.600 | on the on the human feedback you have the sft model so all of these models need to be at the
01:08:48.720 | same time in memory and so you can do some smart sharing of layers but that's basically
01:08:54.880 | very complex and that's why we actually started to build a library called trl to make this easier
01:09:02.800 | and it's also very challenging in terms of fitting all of this in memory
01:09:06.800 | now something very interesting happened uh last year which was uh dpo direct preference
01:09:15.440 | optimization and the idea here was that basically maybe your language model already know the reward
01:09:21.280 | somehow and so uh maybe it can be used without being trained as a reward model and i'm saying
01:09:28.080 | that with my hands but basically you can write that much more um much more precisely in an equation
01:09:35.280 | which is the dpo equation here and which is actually the the dpo paper has a very nice math
01:09:41.760 | part it's not always the case for machine learning paper sometimes the math is just there to
01:09:46.720 | pass reviewer too but in this case uh the math is really very nice and very interesting and the
01:09:53.760 | conclusion is that you can maybe just go with two models the dpo model and the sft model and that's
01:09:58.560 | make uh much easier training and what we saw with the rlhf team uh led by uh lewis lewis tenstall
01:10:06.320 | and ed beshing also with former hf people like nathan and nesny what they saw was that basically
01:10:13.440 | it makes training much more stable and it's actually just kind of work out of the box
01:10:17.840 | because your objective is much closer to a standard like language modeling objective
01:10:22.720 | so dpo changed a lot i would say how we used to align this model and um there was this question
01:10:30.880 | maybe earlier this year which was is this the end of it do have we again move reinforcement learning
01:10:39.040 | out of the most used ml technique well no no no there is a revival recently of rl through
01:10:47.920 | the reinforcement algorithm that's maybe some of you know if you were working in the field
01:10:52.720 | some time ago at least i was playing a lot with it for language modeling a long time ago
01:10:57.520 | and the idea is that at least the the paper from from rika here and and from cohere show that uh
01:11:04.240 | reinforcement and kind of more on policy rl was maybe still very very competitive with dpo and
01:11:10.720 | maybe even better so the jury is still open in 2024 is dpo the answer or will we see back a revival
01:11:20.320 | of rl we'll see now you find your new model you've pre-trained it you find you need the behavior are
01:11:28.640 | great you're very happy you think it's it's nice model you evaluated it as we told you
01:11:34.000 | you need to deploy it and that will be my my last slides uh it will be actually very uh short
01:11:41.120 | because i think there's a lot of resources here but maybe just something to keep in mind is that
01:11:45.520 | there was multiple breakthrough in inference optimization over the last few months i would
01:11:51.680 | say it's really impressive i remember like two years ago when i was saying okay we might want
01:11:56.800 | to deploy a news model of seven or ten billion parameters people like this is never going to
01:12:02.960 | work these are just too big well the reality is that today on my laptop i can run you know mistral
01:12:09.200 | 7b and it's just really fast it's even faster than me talking and there is a couple of things that
01:12:15.680 | made that possible that's the things that i'm listing in these slides the first one i would say
01:12:20.400 | is quantization that's the first impressive thing we can just quantize this model we can move them
01:12:25.680 | from the floating point values that they have fp16 for most of them bfloat16 to quantized integer and
01:12:34.800 | that just work we lose minimal performances we have values set up you know we have various techniques
01:12:41.760 | gptq we have the techniques included in a in lama cpp the dgml and nm nf4 so i put a couple of them
01:12:50.640 | out there they all just work really well i think a good default honestly is the one from lama cpp
01:12:57.120 | out there it's very nice it worked well and this basically solved use pay point also in terms of
01:13:03.120 | model sizes because models are much much smaller in one's quantized now we can do even better now
01:13:09.120 | with like speculative decoding which is super interesting and it's developed a recent development
01:13:14.000 | called medusa and here the idea is that we have two models that are roughly similar but one is
01:13:19.040 | much smaller than the other one and they're trained roughly on the same data set i mean they should be
01:13:24.480 | as close as possible and the small one will actually predict full sentences and then we'll
01:13:29.760 | just use the big one to validate you know how good are these sentences and to keep you know the
01:13:35.040 | tokens until they start to diverge from the token that the large model would have outputted and this
01:13:42.000 | means we can generate a token by bunch and just validate them by the big model by the big model
01:13:48.160 | take a little bit more room in the memory but not so much because the small model is much smaller
01:13:53.920 | and it speed up inference by a lot as well and this basically let us use very large model
01:13:59.040 | on a laptop there is a nice blog post i really like called accelerating generative ai with pytorch
01:14:06.720 | gpt fast which show you all the other techniques you can use you can compile your model you can
01:14:12.800 | use cuda graph basically this is something we covered just earlier and this is just the idea
01:14:17.840 | of reducing as much as possible cpu gpu synchronization so you put as much as possible
01:14:24.080 | your gpu autonomously going through the layers and you do as few as possible synchronization
01:14:30.400 | with with your cpu and give you even more like a speed up really really impressive
01:14:35.120 | these are the inference techniques a lot of there just put it a few reference there basically for
01:14:41.680 | you to to explore the final step you've pre-trained your model you aligned it you're very happy about
01:14:49.760 | the inference you quantized it well you distribute it final step share it with the world okay we need
01:14:56.320 | more knowledge we need more model outputs opens we need more data set open we need a lot more
01:15:01.840 | sharing the world uh thankfully at honey face we'll be building a place to share stuff so
01:15:08.080 | use the spaces evaluate your model openly on the open leaderboard put this on the really great chat
01:15:14.480 | but arena set up a chat for people to try it basically please share all the knowledge that
01:15:21.600 | you've learned and all the artifact that you've created as much as you can that will be my only
01:15:27.920 | reward i asking you from this video thanks i actually kept this question slide from my talk
01:15:36.720 | i cannot really answer question on youtube but please put comments or like open a post
01:15:42.080 | on a happy face or ping me everywhere and i'm very happy to answer any question
01:15:47.840 | that you may have on this thanks a lot for watching bye