back to indexA little guide to building Large Language Models in 2024
Chapters
0:0 Intro
0:59 Workflow for LLMs
1:17 Data preparation - intro and good recent ressources on data preparation
5:28 A web scale pretraining corpus - goals and challenges
11:29 Web scale data sources – Focus on recent datasets
18:1 Language, and quality filtering
24:34 Diving in data deduplication
27:40 Final data preparation for training
31:31 How to evaluate data quality at scale
36:29 The datatrove and lighteval libraries
38:18 Introduction in modeling technics for LLM training
39:9 When the model is too big: parallelism
40:0 Data parallelism
41:18 Tensor parallelism
44:38 Pipeline parallelism
47:0 Sequence parallelism and references on 4D parallelism
47:52 Synchronisation: GPU-CPU and GPU-GPU challenges
52:14 Flash attention v1 and v2
56:23 Stable training recipes
59:12 New architectures: Mixture-of-experts
63:13 New architectures: Mamba
64:49 The nanotron library
66:15 RLHF in 2024
68:23 PPO, DPO and REINFORCE
71:23 Quantization, speculative decoding and compilation: overview and ressources
74:36 Sharing your model, datasets and demo – final words
00:00:00.000 |
Hi everyone. So two weeks ago I gave a graduate class here in Amsterdam to 200 PhD students about 00:00:06.560 |
how to build, how to train a large language model from scratch in 2024. I tried in this talk to 00:00:14.320 |
highlight the dark secrets, the thing that people don't talk a lot about but are very crucial to 00:00:19.600 |
getting good performance large language models, and maybe to also highlight a bit what is more 00:00:24.640 |
hype than reality. And when I shared the slides afterwards there was a lot of interest for this 00:00:30.480 |
so I decided I would actually re-record the talk and post it on YouTube as well. So here is our 00:00:38.080 |
little guide to building a large language model in 2024. In this talk I'm gonna cover three main 00:00:46.320 |
parts - training, fine-tuning, inference. I think for fine-tuning and inference you can already find 00:00:51.520 |
super good recipes, super good blog posts and explanations online so I really spend most of my 00:00:57.360 |
time on training, which is the part that's you know mostly like dark science I would say today. 00:01:02.400 |
In training you have three parts - data preparation, efficient training technique, 00:01:07.680 |
evaluation. It's the same here, I'll spend most of my time on the first part, data preparation, 00:01:12.880 |
because that's really the secret sauce that I want to highlight today. So let's start writing. 00:01:21.280 |
You can believe me or you can also believe much smarter people at OpenAI or Entropiq 00:01:26.640 |
when I say that basically the most important part in your training is the dataset. So I really like 00:01:31.760 |
this blog post from James at OpenAI which highlights how you know by training many many 00:01:38.480 |
architectures he basically found that in the end they all converge to roughly the same behavior 00:01:44.720 |
which is determined fully by the dataset. So what he says is this - the hit in AI models 00:01:50.800 |
is the dataset. Basically model behavior is much less determined by architecture or 00:01:56.800 |
high-performance than we think and much more by your dataset. He actually says it's your dataset, 00:02:02.160 |
nothing else. Well I still talk about architecture a little bit. And Amanda, a girl from Entropiq, 00:02:08.240 |
basically said the same thing last week when she tweeted "is this emergent behavior 00:02:14.800 |
coming from data or from the model?" and basically she said "none of us has ever magically pulled 00:02:21.680 |
anything out of the ether, it's all coming from the dataset". So if you're more into YouTube than 00:02:28.560 |
Twitter, I think there is a nice video that jokingly summarizes all of this by Rutger Bergman 00:02:35.040 |
Bergman when he said "let me play it". That's a video I think about when I read all the tech 00:02:40.880 |
reports that are only talking about model architecture and don't say anything about the 00:02:46.000 |
data. I mean it feels like I'm at a firefighters conference and no one's allowed to speak about 00:02:52.080 |
water. I mean this is not rocket science. I mean we can talk for a very long time about all these 00:02:56.800 |
stupid philanthropy schemes. We can invite Bono once more but come on, we got to be talking about 00:03:02.160 |
taxes. That's it. Taxes, taxes, taxes. All the rest is bullshit in my opinion. 00:03:06.640 |
So basically for us we got to be talking about data. Data, data, data. All the rest is bullshit 00:03:15.920 |
in my opinion. So I mean now that I kind of planted you know the landscape, let's dive in 00:03:22.160 |
what I mean about that. I mean it feels like I'm at a fire. Thanks. I think another nice well 00:03:30.800 |
recent paper I think is the Yi paper. So maybe if you've been following the field you probably saw 00:03:35.840 |
that many Chinese teams have actually trained very good models recently. And the nice thing 00:03:41.360 |
is that they also have a very very good tech report. Much better than what we have I would 00:03:45.520 |
say in the western world where everyone is now very shy about sharing anything. And so the Yi 00:03:51.360 |
models are a very good model if you look at the benchmark. And basically when training them they 00:03:57.760 |
say that their underlying assumption is that when you train on extensive data of high enough quality 00:04:03.120 |
a standard architecture can exhibit advanced capabilities. So basically you don't need yet 00:04:09.440 |
now to you know go look behind beyond transformers or maybe like I will be talking later like slight 00:04:17.360 |
extension like mixture of experts. If you have very good data just spend the time on carefully 00:04:24.320 |
crafting your data set and for now stay on one of these simple architectures that we use today. 00:04:29.840 |
I think there is extensive resources as always. I could have cited like 20 papers but I try to keep 00:04:37.920 |
like a small list of resources so you can read them extensively. I think these four ones are 00:04:46.080 |
nice recent examples. The survey on data selection for language model by LNAI is very nice. 00:04:53.680 |
The paper I just mentioned by the Yi team is really great and I think two recent data sets 00:05:00.080 |
that were open source and shared a lot more about how they were built were the the Dolma data set 00:05:06.160 |
from LNAI and also RefineWeb. So I think a nice thing about RefineWeb is that I'm working with 00:05:12.720 |
Guilherme the lead author of this at Hugging Face and so we'll have much more news about this data 00:05:18.720 |
set to share and I think it's a very nice work. So you can use data for many things. So when you 00:05:25.760 |
talk about data you actually talk about various type of data. You can use data for pre-training 00:05:31.280 |
your model, you can use data for instruction tuning, you can use data for alignment which is 00:05:37.200 |
basically after having pre-trained your model you really want to align it so it learns how to exhibit 00:05:42.080 |
the nice behavior that you want. In particular a dialogue behavior which is one we often want to 00:05:47.680 |
have when we interact with these models. You can also have model data more for in-context learning, 00:05:53.280 |
for rag training, retrieval training and I would say each of these aspects will have different 00:05:59.120 |
goals and will require different data. So as a rough idea for instance for pre-training you want 00:06:04.640 |
really the maximal diversity. You want to assume that your model just has no way to generalize. So 00:06:11.120 |
if the behavior you want at the end is not in the pre-training data there is no way the model will 00:06:18.160 |
discover it. You have to put it in the training data. For alignment it's quite different. You want 00:06:22.880 |
very clean data because you're training your model to exhibit some specific behavior. You want model 00:06:28.160 |
to really be very good at you know like a function call or like you want your model to be very good 00:06:34.080 |
at dialogue. So you want the model really to train and to learn this behavior. So usually 00:06:40.480 |
this data set can be much smaller and they can be much more carefully cleaned. 00:06:44.320 |
In pre-training you will want some noise so your model knows about the noise. In particular there 00:06:49.520 |
is a debate you know should you use no toxic data or like maybe no bad language data. 00:06:54.880 |
Right now I think the main approach to this problem by people is to use a lot of like toxic 00:07:03.520 |
data or like a lot a decent amount so that the model is already exposed to this. It's a little 00:07:08.160 |
bit like your kid if you want. If you want to tell them that drug is bad right they have to first know 00:07:13.760 |
about drug. You cannot really you know expect them to learn that this is something they shouldn't 00:07:21.120 |
touch they should not be using if you don't tell them what it is. It's the same for language model 00:07:26.960 |
in some way. We want them to be exposed to this data to a small amount of it so that they can 00:07:31.920 |
learn later to avoid this and they will know what they need to avoid. Basically assume that there is 00:07:37.440 |
no generalization capabilities in this model. If you want to tell them anything about something 00:07:42.480 |
positive or negative you have to first put it in the model. So let's talk about pre-training stage. 00:07:49.200 |
I already covered a little bit but basically you want to have maximal coverage you want to cover 00:07:52.880 |
everything. So you will train a massive quantity of texts at least 1 trillion token nowadays and 00:07:58.960 |
I think you probably want to aim for like more 10 trillion tokens. The challenges that you want to 00:08:04.480 |
solve here you want to maximize diversity and coverage and you want to maximize quality as 00:08:10.080 |
much as possible because this is still you know something that your model will learn. So if your 00:08:15.440 |
model learn mostly noise you will still get noise out. So you want to have a little bit of this so 00:08:20.320 |
it's kind of robust to this but you don't want to have too much of this. Here is one example. 00:08:26.320 |
Basically you would want your model a good rule of thumb is that you will want your model to know 00:08:31.520 |
two things. You want your model to know the thing that you may want it to generate at the end. 00:08:36.240 |
So if you want to generate knowledge about physics you will want to put that in the model and we want 00:08:42.080 |
also your model to learn the thing that it might be exposed to. So you want your model to be familiar 00:08:47.920 |
with the thing that the users might input. So if you have inputs that might be noisy from the users 00:08:53.360 |
your model should still be trained on it. Otherwise it will be out of distribution 00:08:58.160 |
and as I said the safest bet here is to assume your model don't generalize at all. 00:09:02.400 |
The main challenge here is maximal diversity, good quality but still a little bit of noise 00:09:09.280 |
and data quality evaluation. How do you measure data quality at the billion token scale. 00:09:14.800 |
That's what we're going to talk a little bit about as well. So here is the typical pipeline to train 00:09:19.680 |
a model. So you start by collection. I'm going to talk a little bit about that. You want to filter 00:09:23.840 |
by languages which language you want to keep and then you have a set of filters. You have basically 00:09:28.800 |
two main type of filters. You have some filters that are more heuristic so they are kind of rules 00:09:33.760 |
that you wrote and there are some filters that are more like ML models. So you have a model that you 00:09:40.160 |
train to identify some good quality text. Usually you want to combine two and then you have a set of 00:09:46.160 |
filters that are more semantic. The rule and the ML model are usually a little bit more on the surface 00:09:51.040 |
level and then you want to cover really the topics that you need to know about. If you want to know 00:09:56.000 |
about physics, you want to know about technology, you want to be sure that these are in and so you 00:10:00.480 |
have a step of like more topic filtering and basically be sure that you extract this topic 00:10:06.160 |
very well. This is another example from RefineWeb. The first one was from Yi. This is from RefineWeb 00:10:16.160 |
just to show you how much data we remove. So we start from Common Crawl which is basically the 00:10:21.920 |
internet crawled since 10 years ago and basically we filter that and you can see that there is a 00:10:30.000 |
lot of things that you will remove. First I would say language removal. If you only keep English, 00:10:34.640 |
English is roughly half of the internet. The second biggest language is usually Russian and 00:10:39.760 |
then you have all of the other in Common Crawl. So basically remove half of it when you only filter 00:10:45.440 |
for English you will have a lot of like duplication removal. Why do you want to do duplication 00:10:51.600 |
removal? Well we'll talk a little bit about that later so wait. And then you extract a little bit 00:10:56.880 |
and in the end you end up with about 10% of the original Common Crawl sizes. So if you want to 00:11:02.960 |
get a trillion token that means you really want to start with a very large source. This is an 00:11:09.280 |
example this one from the from the LNAI survey. It's roughly the same steps that you will see here. 00:11:15.600 |
Language filtering, some heuristics, some what they call data quality which is machine learning 00:11:21.360 |
based usually. Some deduplication and then topic filtering basically. So where can you start from? 00:11:31.920 |
You want as I said something very large because you'll just keep like 10% of it. 00:11:36.880 |
So there is two main large sources of data I would say today. One is Common Crawl, 00:11:40.960 |
one is the internet basically and the other one is more like for code. Usually you want to start 00:11:46.160 |
from GitHub or something like Software Heritage or like a place where this has been carefully 00:11:51.920 |
already extracted from the web. You can use some curated sources like Wikipedia or books and then 00:11:59.920 |
in books you have this big question you know like I should use only public domain books which stops 00:12:05.280 |
usually 100 years from now so in 1924 for today or do you want to dive in more like copyright 00:12:13.040 |
equation. So that's the big big question I would say for today for mobile trainers. 00:12:17.760 |
And you have more recent trends like synthetic data generation where you basically will ask one 00:12:23.200 |
LLM to generate some data specifically for you and because you're kind of paying compute for data 00:12:29.520 |
here you can scale this quite largely. So there is a full new trend on this 00:12:36.080 |
spearheaded by Microsoft and the fee models which were trained on billions of synthetically 00:12:42.160 |
generated data from GPT-4. I think it's quite interesting that you can really craft the data 00:12:48.400 |
set in a more controlled way here because you can say okay I want this topic, this topic, this topic, 00:12:53.920 |
this behavior and given the quality of large language models today the quality of the resulting 00:13:00.000 |
data is actually very high. There is even a recent interesting paper from Apple which is about you 00:13:06.480 |
know rephrasing the web so you take one page and you actually ask an LLM to write it cleanly and 00:13:12.400 |
if you train on this data which is very clean and still cover a lot of diversity you can train 00:13:17.440 |
actually three times faster because you use three times less data. It's very recent but it's super 00:13:22.720 |
interesting. Okay I talk a little bit about this resource in more details because we've been 00:13:30.480 |
releasing data set on this at HuggingFace and I want to show you a little bit what we released 00:13:35.120 |
and I go in reverse order so I start with synthetic data. We released recently Lubna and Anton 00:13:41.760 |
and Leandro at HuggingFace have been releasing a data data set called Cosmopedia which is a 00:13:47.360 |
synthetic data set of 30 million samples, that's actually billions of tokens and it was generated 00:13:53.120 |
using one of the best open source models today which is MixedTrail Instruct and here you can 00:13:59.360 |
see how basically this is controlled for various seeds so basically we give the model a slight 00:14:06.480 |
small topic you know or a sentence from a document and you can choose where this comes from and you 00:14:12.720 |
ask the model to to write content you know from this seed sample on the topic. So we took some 00:14:19.600 |
very clean sources like the Stanford open courses or OpenStacks which is also open textbook, 00:14:26.480 |
Khan Academy that you maybe know and also some web data so I would say more more diverse data 00:14:32.880 |
and even instruction tuning data set and then you can ask model also to write you know using various 00:14:40.240 |
language you can ask the model to write this for college students you know to write textbook 00:14:44.720 |
article on this topic for college students or for high school students you can also ask the model 00:14:50.560 |
to write in various styles to write blog posts about this topic and so you can actually have 00:14:55.680 |
a lot of diversity even though it's synthetic. Here is a quick example of all the clusters you 00:15:04.480 |
can do topic clustering to check that you know that you cover a lot what we discovered is that 00:15:09.680 |
we could still cover even more clusters and I would say right now the work on Cosmopedia 0.2 00:15:16.400 |
is to extend this to even more cluster and to get basically more coverage and so here you can see 00:15:22.400 |
that we train a we train 1 billion model 1 billion parameters model on this to to show the performances 00:15:28.720 |
and it's really competitive with web data set even being much smaller but I would say it can 00:15:34.480 |
it can even be better you know by having more coverage so stay tuned for Cosmopedia 0.2 00:15:40.960 |
coming in April. If we go now to code data there was a very nice release earlier this year called 00:15:49.440 |
Starcoder 2 and the Stack V2 so the Stack V2 is really the largest code data set out there that's 00:15:57.360 |
prepared for large language model pre-training it's more than 3 billion files in 600 programming 00:16:05.920 |
languages in total you have like billions of tokens you have roughly 1 trillion tokens in the 00:16:11.760 |
Stack V2. So to get all this data basically we didn't crawl ourself we partnered with one of the 00:16:18.720 |
non-profit foundation out there called Software Heritage which is a non-profit who has been 00:16:24.560 |
focusing on archiving all code that has been out there since you know 10 years ago really 00:16:31.520 |
and basically there is a question you know when you when you when you all when you gather all 00:16:37.360 |
this data set what do you do do you sell it to I would say private you know closed source companies 00:16:43.040 |
or do you partner with like an open source company to train an open source model on this 00:16:47.440 |
and so Software Heritage can reach out to us to partner on the training of a new 00:16:52.880 |
code open source code generation model called Starcoder 2 that you can use as well and which 00:16:58.320 |
is one of one of the best code completion model out there today it's a very large collaboration 00:17:05.120 |
actually an open collaboration so you see all the others there mostly led by Hugging Face and 00:17:12.000 |
great people at ServiceNow. So really go check this out if you're interested in code data 00:17:17.520 |
it's by far the largest and the cleanest data set out there on this. On web data so as I told 00:17:23.200 |
you we've been working on with the lead author of RefineWeb to get a very large and very high 00:17:29.680 |
quality web data out there so basically a filtered common crawl out there for people to basically 00:17:36.720 |
start their training from a high quality data set so this should be also out in the beginning of 00:17:43.440 |
April maybe already next to it so just stay tuned on this. So now that we got our data source we 00:17:50.640 |
need to filter it so filtering by language I would say stay simple fast text by meta facebook 00:17:57.600 |
is just great so just use fast text it's a great one it's worked pretty fine it has like all the 00:18:04.240 |
language you may want to filter. Now that we filter by language we want to start cleaning 00:18:10.800 |
our data sets so there is basically two ways to do that heuristics ML based. We started by the 00:18:16.480 |
heuristics the heuristics is this idea that you will count items so basically if your documents 00:18:22.080 |
only have like you know two characters per line probably it's just it's just a bad list or like 00:18:28.720 |
something that you actually don't really want to use in your large language model so as a reminder 00:18:33.600 |
you don't want to use the thing that are naser things that your model will ever generate 00:18:38.400 |
and there's another thing that you will think your user might input in your model 00:18:43.440 |
so basically repetition you know a very long repetition of single character something that 00:18:50.320 |
you know have a very strange ratio of alphabetic character to punctuation all these statistics 00:18:58.080 |
that you can extract are way to easily filter documents the nice thing about heuristics is you 00:19:03.840 |
kind of know what you're filtering out you know you wrote the things yourself you can really set 00:19:08.480 |
the threshold by inspecting it and you have a very clear control on what you're removing from 00:19:15.200 |
your data set you know what's what's this so these are the annotations I told you it's kind of 00:19:21.200 |
control it's robust you know the prior and I would say the drawbacks are that you're only relying on 00:19:27.440 |
surface level okay you're not looking in the meaning of the document you may also remove too 00:19:32.320 |
much sometimes you think you're just removing bad lists but maybe these are also good lists that 00:19:37.200 |
your user may want to input in your model one way to be a little bit more flexible about that is to 00:19:43.600 |
use stochastic removal instead of you know being a one-off binary choice you sample a little bit 00:19:49.520 |
and you keep a little bit of noisy data another drawback is that you will need to carefully tune 00:19:55.840 |
your hyper parameters here you know the statistics that you want to filter and that's sometimes a 00:20:01.920 |
little bit time-consuming process another way to do data set filtering quality filtering is to do 00:20:10.000 |
machine learning filtering so here basically how you do is that you will have a set of good example 00:20:14.960 |
a set of bad example and you will train either a classifier or a perplexity based filtering 00:20:22.480 |
to you know to classify or to predict the next token so classifier based you know usually the 00:20:30.800 |
standard one is to use a fast classification with some n-grams and you label your documents as good 00:20:37.840 |
bad whatever perplexity based you train a very small language model so usually we could we use 00:20:44.240 |
the this this kn uh old model right and we say that if the perplexity is too high then we filter 00:20:52.640 |
documents i would say the advantage is here is that you have a more like semantic understanding 00:20:59.200 |
hopefully from your ml model even though we use very simple machine learning techniques here 00:21:04.240 |
um and you don't you know need to tweak all the hyper parameter that you tweak for heuristics 00:21:10.560 |
the main disadvantage is that you're not really controlling what you remove okay you you have a 00:21:17.920 |
very vague view of what the biases are so let me give you an example wikipedia okay if you train 00:21:24.320 |
your model on wikipedia and you filter based on this wikipedia is written 90 more than 90 actually 00:21:32.160 |
by men so you're basically also filtering your pre-training corpus to be mostly male written 00:21:38.640 |
do you want this bias well maybe not right so these are things that you you still need to be 00:21:44.720 |
careful and basically it's really hard to know exactly what bias you're introducing 00:21:53.040 |
um a couple of notes additional on data filtering very important notes actually 00:21:57.840 |
you will have several parts in your training data even if it's only web documents you will 00:22:03.920 |
have you know some part of the web data are blog posts some part of the web are you know like 00:22:08.640 |
tutorials some part of these are companies websites all of these are somehow specific 00:22:15.840 |
domains and you want to make sure they are all you know um processed in a good way so you need 00:22:22.160 |
to make sure that for each of these big domains that you want to have at the end you actually 00:22:26.800 |
didn't do something bad in the pre-processing so there is various way to do that you can you know 00:22:31.600 |
cluster and identify a list of documents in a cluster but just one thing to remember about 00:22:37.600 |
all of this and i would say it's a general rule of all good quality data processing is that you 00:22:42.800 |
will want to manually inspect the data inspect the data that you've been keeping inspect how it is at 00:22:49.520 |
the end how it was filtered is it still really readable is your latex um document well processed 00:22:58.000 |
is your pdf ocr well extracted manually go through the data that you keep and also through the data 00:23:04.720 |
that you remove did you remove something that you think is actually very important you need to sample 00:23:10.160 |
you need to take a look you can take a look just at the most important so for instance you can 00:23:15.280 |
sort your data by top urls per token and just read 10 documents for this top urls and make sure that 00:23:22.400 |
these 10 documents are really well filtered okay very likely you need to also craft specific 00:23:30.320 |
domain focused hyperparameters for instance for your heuristics maybe they will work well for 00:23:36.240 |
blog posts but maybe they will just badly filter latex documents so you can either say okay i craft 00:23:42.240 |
specific rule for this domain or you can also say i'll just add this domain afterwards uh for 00:23:48.560 |
instance code you could say i remove all code for web and just i'll just add a very big code data 00:23:54.080 |
set but try to think about the implication of doing that okay you will basically remove for 00:23:58.960 |
instance some mixed natural language and code documents so you want to make sure you add this 00:24:04.000 |
back again uh so that your model still cover this type of inputs um as i told you you can also make 00:24:11.520 |
use of some stochastic selection so if a rule is maybe just too hard too harsh you may want to just 00:24:18.880 |
stochastically sample in the filtering so that you keep a little bit of noise you can smooth a bit 00:24:25.600 |
your rules now the duplication why do you want to do the application well the idea is that there is 00:24:33.920 |
a lot of duplication on the web that's something to really be mindful and to be aware of the web is 00:24:39.920 |
hugely duplicated and so duplication will increase the density around some topics okay wikipedia is 00:24:47.760 |
copied a lot over the internet so maybe that's nice to have a lot of density around wikipedia so 00:24:52.880 |
that you're sure that your model has seen it a lot but um you also have to be aware that duplicated 00:24:58.080 |
points they have more chance of being memorized okay they will also take more time because you 00:25:03.120 |
will go during your training more times over the same data points so it takes more compute 00:25:09.040 |
during training and you really want that okay you really need to see that um reducing the 00:25:14.960 |
duplication duplication also has been shown to improve accuracy so generally the duplication 00:25:19.680 |
is something that's very important and that you want to have a lot uh how can you duplicate well 00:25:27.760 |
you have a couple of methods you have more like fuzzy method where you basically will extract some 00:25:33.040 |
hash fixed size hash of your documents and so you will lose here a little bit of accuracy because 00:25:40.240 |
this hash are just a rough summary of the n grants in your document and then you will want to filter 00:25:45.600 |
them either by min hash which is a i would say quite a good method in general or by bloom filters 00:25:54.800 |
which are much stronger on the duplication because you just keep one hash and just keep one document 00:26:00.320 |
per hash so it's very it's very strong you have a fixed size vector which is very constraining 00:26:05.120 |
um and if you don't want to do fuzzy duplication you can use exact duplication where you will 00:26:10.880 |
extract you know uh with a suffix array you will extract exactly all the duplicate in your document 00:26:17.360 |
they have both trade-off uh in advantages and drawback um exact filtering is very costly in 00:26:26.560 |
memory because the the table the suffix array table are really huge um i say bloom filter is 00:26:34.000 |
very very strong uh a filter so usually we we use a lot for instance in fine where we use a lot 00:26:40.640 |
min hash because you can you can control you can control a little bit more um your trade-off between 00:26:46.320 |
memory and um and accuracy uh speeding the duplication is also a very big issue i would 00:26:56.400 |
say on very big challenges uh and we saw a very nice very interesting counterintuitive result 00:27:02.160 |
recently that more duplication also led us to keeping only bad data so basically when we were 00:27:07.040 |
deduplicating more and more all the good data was now taken out and only the the remaining things 00:27:13.760 |
were just basically bad quality data that was not the duplicated but that was just so random that it 00:27:20.320 |
didn't fall in the duplication uh buckets so uh i would say for the duplication also be careful 00:27:27.360 |
investigate what you're removing at the end and also what you're keeping and don't be sure don't 00:27:33.360 |
don't take this as a silver bullet just like every filter out there it's something that you should 00:27:38.400 |
double check yourself now that we've finished you know uh sourcing language filtering filtering by 00:27:46.640 |
quality heuristic or ml deduplicating uh topic we need to prepare the data for training there's two 00:27:53.600 |
main thing you need to do we need to shuffle it it might seem as a joke but it's still very 00:27:58.160 |
important today you don't want to train in the order of the common crawl terms you want a good 00:28:05.040 |
a good shuffling of all your data and then you want to tokenize it so recently there was a very 00:28:10.800 |
nice video by by andrej carpati on tokenizer you should watch it if you want to know everything 00:28:16.000 |
about tokenizer but generally there's just a set of good practices you should fit you should be 00:28:22.320 |
mindful of the first one i would say is sample well through your whole data set i would say the 00:28:29.440 |
first gpt2 tokenizer was famous for including in in the in the final vocabulary token the name of 00:28:36.400 |
redditors because it was really trained on ready data only you don't want that you want really to 00:28:42.080 |
shuffle so that the single you know the one single part of your data set is not over represented 00:28:48.880 |
in your vocabulary the vocabulary of your model for math you want to be careful about numbers you 00:28:55.360 |
want to be careful that they are well you know you don't have like for instance 42 as a single token 00:29:00.880 |
and 43 as two token because 42 is much more used since uh the douglas adams book and so usually 00:29:09.360 |
what people do is either they split digits so you split all the tickets in every in every number 00:29:15.280 |
that's what for instance llama do or you add you know the list of all numbers manually in your 00:29:22.000 |
vocabulary up to a thousand for instance that's what gpt4 do then you need to be sure that your 00:29:28.320 |
data set is big enough that every number is really well represented in it for code you want to be 00:29:34.960 |
mindful about tabs and spaces they're very important for instance in python and so you want to handle 00:29:41.920 |
them well you want to model to know what is a double space and four spaces so just be careful 00:29:47.840 |
about this and for basically if you need something by default i would say a byte level dpe is a good 00:29:54.960 |
standard way to train a tokenizer don't fall in a rabbit hole for tokenizer they are not the thing 00:30:02.000 |
that will bring you to adi okay this is just something you want to make in a clean way so 00:30:08.720 |
that you don't fall in the in the traps along the way that you're able to process code numbers you 00:30:15.360 |
know that you don't have some strange tokens over represented but that's it by the way you can also 00:30:22.560 |
use tokenizer to inspect your data set i'm going to talk a little bit about that scaling tokenization 00:30:27.840 |
is non-trivial you want to really parallelize that well because otherwise pre-processing 00:30:34.240 |
and tokenizing trillions of token can take quite a long time in the end and so there is two main 00:30:40.400 |
approach the first one is well parallelizing and then finding a way to efficiently merging the post 00:30:47.360 |
the tokenized data sets and shuffling it and the other way is that you tokenize during training 00:30:53.360 |
basically you feed the direct text to your model and you tokenize just before feeding the model 00:30:59.760 |
i would say the the nice thing about the first one is once your doc once your data set is tokenized 00:31:05.280 |
and everything stopping training and continuing training around resuming training is very easy 00:31:10.960 |
it's very efficient it's very reliable and in the second case is well you can change the tokenizer 00:31:17.920 |
easily but usually you don't really need to do that a lot but resuming and being sure that you've 00:31:25.280 |
you know we're starting exactly from where you were is usually slightly trickier 00:31:32.320 |
now how do you evaluate data quality so that's really tricky because we're talking about trillion 00:31:39.920 |
size data sets okay so it's really hard to have some good metrics to evaluate the data quality 00:31:45.520 |
so a lot of this is you know inspecting yourself some exact documents as i will tell you and some 00:31:52.240 |
easy i would say one one one good way is training small model to test it so typically what we've 00:31:58.000 |
been training here for instance is like one to two billion size model and you train at this on 00:32:03.600 |
like a chinchilla optimal size you don't need to train for longer you're not using this model 00:32:08.480 |
for inference or anything so which is roughly 30 giga token when you train your model you need to 00:32:14.960 |
find some high signal benchmark not all the benchmark in nlp are high signal what is a high 00:32:20.480 |
signal there is two way i've seen it uh being being being used one way is to make sure that your 00:32:27.840 |
matrix on this benchmark is monotonically increasing during training okay you want 00:32:34.400 |
basically some benchmark where you really see your model learning learning increasingly and 00:32:39.520 |
not like oscillating a lot otherwise depending when you stop you will have like very different 00:32:45.120 |
results you want to have a low variance which means if you train on various seeds if you train 00:32:51.680 |
on various um you know parts of your data set you want to be sure that you're you're roughly in the 00:32:58.000 |
same ballpark at least the the standard deviation that you're measuring is small enough that you can 00:33:02.800 |
really tell data set apart so usually you will want to have two debugging data sets one of high 00:33:10.080 |
quality a standard very high quality data set is c4 it's a very it's really a data set that has 00:33:16.800 |
standard test of time in terms of high quality and you want another data set that's maybe much 00:33:22.160 |
more complex the power is some some sometime an example or you can take just a pure common crawl 00:33:27.040 |
and filtered and you should see really a distance between the measurement on your benchmark on these 00:33:33.760 |
two data sets the performance of your train model on these two data sets and obviously you want your 00:33:39.040 |
model to be above the random baseline you know that's also one indication of a good benchmark 00:33:45.280 |
so if a 1-2 billion size model is not above the random baseline you're just measuring noise 00:33:51.360 |
and there is some tricky details to make sure that you have high signal these are some things 00:33:57.280 |
we have in in light table but basically for instance if you want to measure multiple choices 00:34:02.640 |
question it's often the case for this small benchmark and that's why for instance you have 00:34:06.640 |
four continuation you want to predict you know you want to select one of the four small model 00:34:12.720 |
what i call small model is one to two billion size model small models really like more what we call 00:34:19.600 |
normalized likelihood so we'll measure the likelihood of each answer normalize it by the 00:34:24.800 |
length and take you know the highest likelihood and larger model when we move to like 30 40 00:34:31.760 |
even 70 model well trained they will like more you know lettered answer when you explain the 00:34:36.720 |
answer and then you say select between a b c d and the model just generates a b c d and here you 00:34:42.640 |
can have nice calibration curve because you have a very clear uncertainty on this single generated 00:34:48.960 |
token so keep this one for larger model for small model i would say focus on normalized likelihood 00:34:56.240 |
so these are small model training another thing talk about this a lot but manual data inspection 00:35:02.160 |
take your top domains take your top url inspect 10 documents for each of them inspect also at 00:35:08.400 |
various stages in your pipeline and also take a look at what you've discarded okay always 00:35:16.560 |
you can set up a search tool in your data set that's also very useful you can do some clustering 00:35:22.480 |
to see and to be able also to inspect top documents per maybe more clusters than url so we have here a 00:35:28.800 |
nice library by leandro at hugging face called text clustering we also have a nice search tool 00:35:36.720 |
in this so really take a look at this library and use it if you think there is a more uncommon that 00:35:43.040 |
that i really like from tevin who was a there was a lot of people at hugging face you know who are 00:35:49.360 |
now at mistral and tevin who is now at mistral also told me once that he used the tokenizer 00:35:55.600 |
to inspect and basically you can train a tokenizer on your data set and you can take a look at the 00:36:02.560 |
longest token and maybe the last token so the less the less the least frequent token and see there 00:36:09.520 |
okay do you have strange things do you have like javascript parts do you have like name of redditors 00:36:14.880 |
like i was telling you and if they look bad that means that you have some high frequency of bad 00:36:21.840 |
quality data in your data set um here we have some nice library that we've been releasing you know 00:36:30.320 |
just last month for doing all of this all of this data processing pipelines it's called data trove 00:36:36.560 |
it's by uh gilerme the lead author of refined web and basically it started as an open reproduction 00:36:43.200 |
of refined web so a very high quality filtered common crawl and what we ended up was kind of a 00:36:49.360 |
fully fledged lightweight library for processing filter the duplicated test data and basically 00:36:54.880 |
preparing very large data set for other than training you have pre-built block for all the 00:37:01.280 |
steps that i showed you here it's fully in python and it's very easy to set up on slurm or locally 00:37:09.360 |
and to use remote file system as well if you need to so take a look at data trove it's a very small 00:37:17.440 |
library i would say self-contained python thing but you really you have all the basic blocks that 00:37:22.000 |
you may want to use here um when you want to evaluate your model we have one library that 00:37:27.840 |
works well with data trove and the pipeline which is called light evil light evil is a very lightweight 00:37:34.640 |
llm evaluation suite usually inspired by the amazing eluther airness i would say the main 00:37:41.600 |
difference is that integrate from the ground up 3d parallelism i'm going to talk about next 00:37:47.600 |
so basically efficient model uh training and inference and you can play a lot with the prompts 00:37:54.560 |
and the eval so i was telling you for instance small model really like this like normalized 00:38:00.240 |
log likelihood while while bigger model like more like lettered answers and so here you can play with 00:38:06.000 |
the prompts easily and so to see how much signal you can extract for each benchmark on your specific 00:38:14.560 |
debugging model size now we've talked a lot about data so let's talk a little bit about modeling 00:38:23.040 |
that's the part everyone is waiting for that's the most exciting part easily that's the reason we're 00:38:27.840 |
all in ml and i'm very happy to still cover this i would say so what are the essential elements when 00:38:34.320 |
you train well there is three main thing the first one is efficiency and size you want to fit your 00:38:41.280 |
billion parameters model efficiently on your gpu and you want to train really fast so you have some 00:38:46.800 |
recepts for this that i'm going to cover quickly and then you want to train in a kind of a roughly 00:38:53.760 |
stable way you have to avoid instabilities but still you want to stay really close to it and 00:38:59.360 |
then you have the last question which is capacity and that's where we're going to talk a little bit 00:39:03.200 |
about other architecture than just the transformers but that's i would say just the last part so how 00:39:09.680 |
do you train model efficiently in particular when it's too big to fit on one gpu so when it fits on 00:39:14.800 |
one gpu there is no real problem right so you won't be model no problem your 7 13 30b model 00:39:23.920 |
they are just too big for one gpu and a decent batch size so you need to parallelize them 00:39:28.400 |
today we have four way to do parallelism roughly we have data parallelism that's something everyone 00:39:35.040 |
has been using already i would say you have tensor parallelism pipeline parallelism and a much more 00:39:41.840 |
recent i would say or slightly more recent sequence parallelism i'm going to cover them 00:39:46.080 |
briefly so i would say here my idea is more to give you kind of a overview of everything more 00:39:52.080 |
than really dive deep because in each of these topics you could dive really deep in a technical 00:39:57.360 |
point of view okay so this is really entry level and i put some references again just select a 00:40:04.400 |
couple of references that you can read to dive deeper in this let's start with the first parallelism 00:40:09.280 |
data parallelism usually it works out of the box that's the easiest one the only challenge is the 00:40:15.840 |
data loading to make sure that your mobile replica will have different data as input so what does 00:40:24.640 |
data parallelism do you take the one model and you duplicate it on several gpu you feed it several 00:40:31.360 |
parts of your batch and then you just you know match the gradient reduce the gradients so that 00:40:37.200 |
you have basically a larger batch on three gpu for instance than you had on one gpu so you can 00:40:43.040 |
process on parallel you know different part of your data and you just make the optimization step 00:40:49.360 |
the main challenge i would say is the last part is the all reduce that you use to to kind of merge 00:40:56.960 |
the gradient updates and actually when you scale very large model it can start to become a huge 00:41:03.120 |
bottleneck so we'll talk a little bit about that um yeah the tensor parallelism is when you don't 00:41:13.200 |
want to when you're limited in your data parallelism so why would you be limited by data 00:41:19.520 |
parallelism there is two main cases one case is basically your model is just too big to fit on 00:41:25.120 |
one gpu so you cannot replicate your model on various gpu you need to split the model somehow 00:41:32.400 |
the other case is when your batch size by replicating the model start to be too big 00:41:37.920 |
okay so let's say you want to really scale the model and now you start to have like one to four 00:41:44.320 |
million token batch size well if you start to have a very large batch size the model for each 00:41:50.240 |
optimization step make less efficient use of each token because the batch size is so big that each 00:41:57.520 |
token is kind of watched out in the optimization step and roughly it's a little bit hard to measure 00:42:03.520 |
this limit which we call the critical batch size it's roughly around four to six million token it's 00:42:09.200 |
different for like small and bigger model but basically you cannot really go to 100 million 00:42:14.960 |
token base like that so you want to find another way to parallelize to make more efficient use of 00:42:21.440 |
your data and so one way to do that is to use tensor parallelism tensor parallelism is slightly 00:42:28.640 |
more involved because you need to rewrite your model code you cannot just rewrite the data 00:42:33.440 |
loading code you need to change the model why because you will divide all the matrix multiplication 00:42:39.600 |
all the matrices that we use in the model into or like four or like eight depending on your tensor 00:42:45.440 |
parallelism degree and you will put each part of the weights each sub part of this weight matrices 00:42:53.360 |
on various gpu and synchronization will happen after the operation so here you need to rewind 00:43:01.040 |
model code the nice thing is that you can combine smart column and row slicing to try to reduce the 00:43:07.120 |
number of synchronization points let me show you a little bit here you have two main parts in a 00:43:14.000 |
transformer as you may remember you have feed forward networks you know you usually have two 00:43:20.160 |
two matrix multiplication with an activation in between it can be a bit more if you're using 00:43:26.720 |
something different than just if you're using clue but basically you will have one matrix 00:43:30.720 |
multiplication some activation and another matrix multiplication okay and here you can basically 00:43:37.680 |
split the first matrix multiplication in one direction usually column wise you do separately 00:43:44.880 |
your activation on each gpu you don't need to synchronize and then you gather by doing the 00:43:50.480 |
opposite slicing at the end on the second matrix multiplication you do a row slicing 00:43:56.080 |
to gather again your output your activation you can do the same smart thing for self-attention 00:44:04.320 |
where you will do one part matrix multiplication in one direction you will split the matrix 00:44:09.200 |
you will do like softmax dropouts separately and then you will combine them with you know 00:44:15.360 |
another parallel operation in the other direction this way you reduce the number of synchronization 00:44:24.160 |
point because you can do a couple of like operation without needing to synchronize between the gpu 00:44:29.360 |
the tricky part is always that when you're synchronized you're going through the network 00:44:33.680 |
that's much that's much slower than just the computation the last part that you can use when 00:44:40.960 |
you when you don't want to use tensor parallelism or when you cannot scale tensor parallelism enough 00:44:45.120 |
is pipeline parallelism so usually you want pipeline parallelism when your like network 00:44:50.800 |
is not fast enough to do full tensor parallelism everywhere okay pipeline parallelism reduce 00:44:58.480 |
the number of network exchanges because you will put some layers on some gpu 00:45:03.360 |
and other layers and other gpu and you will just communicate at the interface between two layers 00:45:09.920 |
or like two groups of layers so you can see here you will put one for instance level layer two 00:45:17.520 |
zero to three on one gpu layer four to seven on the second gpu etc etc here the challenge i would 00:45:24.800 |
say is to keep all the gpu busy so you don't want to have just you know one group one gpu working 00:45:30.880 |
for the first layers of your batch and then being idle while you have the other gpu working for the 00:45:37.040 |
other layer you know as we go as we go forward in the model and it can be very challenging 00:45:43.520 |
to actually keep have maximal utilization of the gpu so usually you have like complex interleaving 00:45:50.320 |
of the forward and the backward path so i can show you here a little bit where you have the forward 00:45:56.480 |
path in blue and the backwards path in green and you can see what we do in this case is that we will 00:46:02.960 |
split our batch in smaller sub-batch mini-batches so for instance we split a long batch in four 00:46:10.800 |
mini-batches and when the first mini-batch is done on the last device we already start the backward 00:46:18.400 |
while we are still doing the forward path on the other gpu for the last batches and this way you 00:46:24.800 |
can reduce what we call the bubble the tricky thing here as you probably got it is that 00:46:30.000 |
for tensor parallelism you needed to rewrite the model code as i told you 00:46:34.880 |
and here you also need to rewrite the optimization code okay you cannot just do forward 00:46:41.680 |
and then you're lost at backward because you have parallel execution of a backward and forward 00:46:48.880 |
path so this makes using the code quite complex and that's why actually we have a new library 00:46:53.680 |
called nanotron that's right to have this as simple as possible 00:46:57.440 |
there is a last way to do parallelization called sequence parallelism so be careful 00:47:05.040 |
because there is two use of sequence parallelism there is one which is kind of a smart way to do 00:47:10.800 |
ring attention to do attention on very long sequences but the one i talk a little bit about 00:47:16.240 |
today is another simpler way it's quite similar to tensor parallelism in a way but instead of 00:47:23.280 |
slicing the parameter matrices like we do we slice the sequence this way and the idea is 00:47:29.680 |
if you took tensor parallelism here it's the top box we still had some operation between each tensor 00:47:38.080 |
parallelism operation where we were not really parallelized in any way and the idea is on this 00:47:44.160 |
operation which are applied independently for each token we could split along the sequence 00:47:52.880 |
and so we could parallelize this along the sequence axis it's only interesting you're 00:47:58.720 |
doing training usually because you need long sequences or which is a little bit doing prefill 00:48:03.600 |
now what can you read if you want to know more about this there is many reference 00:48:10.480 |
on parallelism i try to extract i think the one and i think give you the highest level overview 00:48:16.480 |
i would say and cover as much as possible of this thing i really like this first paper from joel at 00:48:22.960 |
service now which is not very well known but i think it's very interesting as it covers like a 00:48:28.160 |
lot of challenges here red first pipeline parallelism reducing activation computation 00:48:34.320 |
in large transformer model is very nice one and the last one is actually the one on sequence 00:48:40.320 |
parallelism that i told you and the last one called sequence parallelism is actually this 00:48:44.960 |
ring attention paper that i think is also very interesting but more maybe an extension of this 00:48:50.800 |
presentation now we talk about a lot about parallelization okay but there is an additional 00:48:56.960 |
thing that you need to be mindful about is synchronization i already talked a little bit 00:49:01.680 |
about synchronization okay during tensor parallelism and the thing i talk a little bit 00:49:05.840 |
about reducing synchronization and here you you have to be very careful about that well 00:49:10.720 |
why well you have two type of synchronization you have one synchronization which is between 00:49:15.920 |
values gpu which is uh basically when you when you do like a like a like a reduced operation 00:49:24.080 |
in tensor parallelism and you have one synchronization which is between cpu and gpu which 00:49:29.120 |
is when your cpu basically launched the kernel on gpu and you want to reduce or at least you want to 00:49:35.920 |
make sure that for both of these as much as possible you can do an overlap of computation 00:49:42.560 |
and communication so basically if you can do something called asynchronous computation 00:49:48.160 |
basically where you will asynchronously start some operation and do some communication during 00:49:53.760 |
this time it's much better so let me talk about two things we we talk a little bit during the 00:49:59.600 |
data parallelism part about the cost of the all reduce at the end so that's something you probably 00:50:05.520 |
have been using already without knowing it in pytorch which is if you look at the distributed 00:50:11.600 |
data parallel so the ddp uh in pytorch you can see that there is a very smart way to do all reduce 00:50:20.000 |
so let's look at here basically typically you will usually do like all your forward and your 00:50:24.960 |
backward and then you will do your all reduce at the end okay well this is very annoying because 00:50:32.000 |
during the all reduce where you gather all your gradient together you don't do any computation 00:50:38.000 |
you're just waiting for synchronization there you're just waiting for your gpu to exchange 00:50:43.440 |
all the all the great and that's not something you really want you want to keep your gpu busy 00:50:48.560 |
so if you have a way that once every time um one layer is finished with computing you can already 00:50:54.880 |
start you know reducing you can already start in parallel to computation you can already start 00:51:00.000 |
communicating gradient then you should try to do that and if you take a look at the pytorch code 00:51:05.440 |
for for distributed data parallel that's something that they do um another example is in pipeline 00:51:13.440 |
parallelism you know we saw this this this uh forward backward reducing of the bubble and here 00:51:22.160 |
you can also try to you know overlap this very long here g so this very long gradient reduction 00:51:29.360 |
here with some you know forward pass of the next batch and just a quick example and a quicker note 00:51:40.960 |
about cp and gpu synchronization here what you will want is to reduce as much as possible the 00:51:46.800 |
number of time your cpu need to inspect the data or need to start a kernel so we want to fuse kernel 00:51:54.000 |
so you want to fuse the operation that could go together the result of your attention your 00:51:58.720 |
activation if you can do all of that in the gpu without the cpu needing to say okay now it's time 00:52:04.960 |
to compute the activation now it's time to do you should do that so that's usually done by merging 00:52:10.000 |
merging operation in single kernels um now i want to talk a little bit about attention 00:52:18.240 |
that's very interesting because if you were already in the field like one year one year 00:52:23.760 |
and a half ago a little bit more maybe now um we had a lot of work on designing efficient 00:52:30.720 |
attention computation because people were really very scared by the quadratic cost of attention 00:52:38.720 |
okay and all of this disappeared now you know there was like all this reformer all these very 00:52:44.160 |
long attention smart and the main reason this disappeared was that uh our friend tree dao at 00:52:52.000 |
stanford invented flash attention flash attention the idea is basically you will just not materialize 00:52:59.520 |
the attention matrix so the attention matrix is this very like large is the n square sequence 00:53:06.000 |
square size matrices comparing each token you know to make the attention between all of them 00:53:12.000 |
what you could do is instead of building these very large matrices you can just 00:53:16.880 |
on the fly you know build small matrices and just keep the statistics that you need 00:53:21.200 |
to compute your softmax along the way and that's what flash attention does that's the first step 00:53:27.040 |
and the second step for flash attention is that if you just compute along the way small part of 00:53:32.720 |
your attention matrix you may even have this small part small enough so that they fit actually 00:53:39.760 |
in the sram of the gpu so the static random access memory the sram is a much much smaller memory but 00:53:50.560 |
which is really next to each chip and this this this cannot be shared between processes right 00:53:56.320 |
this has to be this is a single memory for for a group of of processing while the hbm the high 00:54:04.240 |
bandwidth memory is shared by everything so it's the it's the hbm is this 80 or 40 gigabytes memory 00:54:10.800 |
you know that you see it's really large but it's also much smaller and much lower bandwidth than 00:54:16.240 |
this sram okay so you can compute just like your attention not in one big memory but in small one 00:54:23.200 |
with statistics and the small one can be small enough to be fitted in the very very high bandwidth 00:54:28.320 |
memory and this way you can actually compute attention really much faster and while using 00:54:34.800 |
actually much less memory so flash attention can solve somehow the quadratic attention costs of 00:54:41.680 |
attention and that's why we don't really care a lot anymore about you know linear attention 00:54:49.040 |
mechanism for instance also because performance were never able to match full attention somehow 00:54:55.840 |
just apart from sparse attention in some way flash attention v2 was a development of flash 00:55:02.320 |
attention still roughly two times faster and here the idea was mostly to really have as much as 00:55:09.440 |
possible of the computation in matmul flop so you have to know something about gp as well which is 00:55:16.400 |
gpu already optimized to do matrix matrix multiplication so each time you do something 00:55:22.560 |
like a division or something basically else than a multiplication you're paying a cost and the cost 00:55:29.600 |
is very expensive a division is like 60 times more expensive than a multiplication and so for instance 00:55:36.720 |
when we do softmax we usually divide by some normalization like some some number some square 00:55:43.200 |
root of the dimension of our model we want to keep this and do it just one time at the end you don't 00:55:48.000 |
want to do that every you know on every and every element of your computation so these type of things 00:55:54.960 |
are basically what flash attention 2 is bringing with also a better parallelism causal mask if 00:56:01.600 |
you're just computing causal mask you just don't need to compute half of the matrix and just better 00:56:06.720 |
work partitioning and using more better like the blocks and the wraps of the gpu so i won't dive 00:56:11.920 |
into this because there's a lot to unfold here but i would say it's more like a little bit more 00:56:17.440 |
incremental but it's still like very very nice beta so now that we have something efficient 00:56:24.880 |
we've parallelized this well we have a very efficient attention computation we want to make 00:56:30.800 |
sure that we train well and here don't miss this hyperparameter search you have a couple of very 00:56:37.760 |
important things that you need to go over learning rate you want to do a nice hyperparameter search 00:56:43.360 |
you want to make sure that your initialization is well done you want to create a normal where 00:56:48.480 |
they need to be you want to make sure that your training is stable but also that you're not too 00:56:53.840 |
stable that you still at the verge of like you're still training with very high learning rate 00:56:58.000 |
and here i would say there is very few uh recent work on this but there is two that i really like 00:57:05.120 |
there is the mu transfer work slightly older i would say now but it's still very very interesting 00:57:10.160 |
on how to find a hyperparameter on a small model and how to scale them on a larger model 00:57:15.920 |
this work by cerebras was maybe one of the most interesting application of mu transfer 00:57:22.480 |
and a very interesting recent work from also a chinese team again but very well you know set of 00:57:30.240 |
experiments open source is this mini cpm block post where they really try to optimize the model 00:57:36.800 |
and to optimize you know the uh scaling of activation between various part of the model 00:57:42.800 |
how you want to scale the activation between embeddings and then the first layers etc so you 00:57:47.200 |
should really probably should really give it a look what is also very interesting is that they 00:57:51.680 |
challenged the dominant view that cosine learning rate was the end learning rate that everyone should 00:57:58.800 |
use from now on cosine is still that really the great great default learning rate but they use 00:58:03.920 |
a linear plus uh you know warm up and decay and they show that they do have some uh decent 00:58:10.960 |
performances with that as well and the nice thing about having a linear uh learning rate like a 00:58:17.680 |
constant learning rate is that you don't need to know from the beginning how long you will train 00:58:23.760 |
your model on and that's very interesting because the cosine kind of force you in a very specific 00:58:29.520 |
shape where you need to decide from the beginning of your training how long you are going to train 00:58:34.480 |
and you cannot resume for longer and if we find a way to have good performances with like 00:58:39.440 |
a flat learning rate and just like warm up decay decay is very important that's what they show in 00:58:43.840 |
this paper um then maybe we can get away uh out of this constraint of knowing from the beginning 00:58:52.960 |
how long we're going to train so uh take a look at this paper i think they're very nice in terms 00:58:57.120 |
of stable training recipes and the takeaway here i would say is don't skip this step 00:59:03.440 |
do your work in terms of research for hyperparameters now the last part uh that's 00:59:13.920 |
the one usually people spend the most time on talking about so it's a good indication that 00:59:19.520 |
that's it's also the least important part but yeah let me still talk a little bit about this 00:59:24.640 |
for a long time transformer were believed to be the end architecture so uh maybe slightly 00:59:33.280 |
sad i would say for the field that we didn't have anything new since you know the transformer paper 00:59:39.200 |
in 2018 and uh recently there was two extension they want to cover one is mixture of expert 00:59:47.440 |
so mixture of experts reduced to transformers in the limit of one experts so it's still slightly 00:59:53.440 |
a stretch to say that it's a fully new architecture but it's still very interesting as a new knob 00:59:59.360 |
to you know play with capacity um so basically one problem was that until now it was not very 01:00:06.240 |
efficient to train a mixture of experts so let me explain you a little bit okay in a mixture of 01:00:11.280 |
experts when you go through uh when your sequence of tokens will go through your model at some point 01:00:17.280 |
you will have a router that will say for each token where in which experts should this token go 01:00:23.440 |
and experts are basically at the mlp level feed forward so you have basically several mlp several 01:00:29.600 |
feed forward layers and you will select like these are the number of your experts for instance three 01:00:34.880 |
feed forward layers three different feed forward layer will be three experts and your router will 01:00:40.240 |
say for each token okay you should go through to expert one to expert two to expert three 01:00:45.040 |
now each each expert was designed to be able to welcome a certain number of token so for instance 01:00:52.960 |
two token in this example here okay and if three token should go to one expert and then two token 01:00:59.680 |
to one and one to the last one the expert that would get three token was not able to welcome 01:01:05.040 |
them all and so would drop one token so without one token that would just be not used in the 01:01:11.440 |
computation one input token so that's quite strong i would say as a as a impact that means you're 01:01:17.760 |
kind of ignoring a part of your inputs and that led to i would say non-optimal performances 01:01:24.240 |
but that was needed for the sake of having a very uh determined like a very static you know matrix 01:01:31.760 |
and our gpu and tpu are not really well adapt to dynamic architecture well recently using ideas of 01:01:39.600 |
sparsity but not too sparse because we also know gpu don't really like very sparse matrices 01:01:44.880 |
but you can they are actually quite good for block sparse matrices so what is block sparse this means 01:01:50.160 |
you it's sparse but you have blocks and these blocks are big enough so that they make efficient 01:01:55.120 |
use of the gpus you know so they are big enough that they will fill like your your math mill here 01:02:00.080 |
you will have like enough to crush um but between these blocks you have like empty places and here 01:02:08.160 |
this is basically what mega blocks recently did and that's how it unlocked actually efficient 01:02:13.920 |
mixture of expert training which is saying maybe our experts could be these blocks they are very 01:02:19.520 |
big feed-forward matrices and if we actually use this we could just like repeat this block 01:02:26.320 |
for the various number of token and we can maybe dynamically do the sparsity because it's not 01:02:32.880 |
it's actually just blocks that will repeat so it's i would say it's a kind of a 01:02:37.760 |
low level of dynamicity uh low level low enough that it can be very efficient so that's basically 01:02:45.360 |
what kind of changed the the thing here we don't need to drop token anymore we can just dynamically 01:02:50.880 |
build these big sparse matrices from experts and it actually even opened the door for something 01:02:56.800 |
i think nobody has been really using yet which is you could have experts of various sizes you could 01:03:02.080 |
add you could have like big experts smaller experts etc etc very interesting and i'm really 01:03:09.760 |
looking forward to uh what will be built on top of this another interesting development was kind of a 01:03:16.480 |
revival of recurrence model and you have two uh main uh model well i mean i just talk about mamba 01:03:24.880 |
and the idea here is that you can you can use like space state space model so if you're just 01:03:32.000 |
out of your master in ai you probably learn about space states model 01:03:36.720 |
there are these discrete models this continuous model that make evolving you know a space state 01:03:45.120 |
and here the all the smart all the smart thing was about how to discretize this and keep this 01:03:50.080 |
efficient and so that was solved by um by albert gu and and again for flash attention and how to 01:03:57.920 |
train this efficient it's very funny because when you train this mamba model when you train it it 01:04:03.120 |
behave kind of like a like a convolution convolutional network and when you use it in 01:04:07.760 |
inference you can use it in a kind of a recurrence mode so it's really really fast actually um mamba 01:04:13.840 |
itself is quite hard to dive in and i think the best entry point is again an annotated blog post 01:04:20.080 |
by sasha rush so maybe you learn about the transformer architecture from the annotated 01:04:25.680 |
transformer by sasha rush a few years ago so now you have also annotated mamba blog posts which is 01:04:32.000 |
i think a very nice way to learn about the mamba architecture we're actually training several 01:04:37.040 |
mamba at hugging face at the moment with nanotron and so it's also very easy to train i'll show you 01:04:44.080 |
a little bit there so talking about nanotron we wanted to have a very simple library to use all 01:04:52.240 |
the techniques that i showed you um so we talk about parallelism we talked about efficient training 01:04:58.320 |
we talk about being able to you know nicely iterate on your hyperparameter and also have 01:05:04.720 |
mixture of experts on mamba if you want to gather all of this you usually have a very large library 01:05:10.960 |
with a lot of bell and whistle we want to keep something very minimalistic so that's how nanotron 01:05:15.920 |
was born so we want to keep this really under 10 000 lines of code and we want to make it very fast 01:05:23.600 |
basically train as fast as possible and also very transparent so there's not a lot of wrapping around 01:05:30.480 |
the things that you do here and as the idea is uh it's very open it's very transparent and you have 01:05:36.720 |
in it like 3d parallelism radiant accumulation didn't talk a little bit about values mixed 01:05:43.200 |
precision but it's in it and you have all the all the way to do also 01:05:47.280 |
smart optimizer zero one and all the architecture i was talking about at the end 01:05:56.400 |
so uh take a look at nanotron it's a kind of a research code that we use so it's still roughly 01:06:02.240 |
it's still a bit rough on the edges but it's a very very nice code base 01:06:07.520 |
now that we trained our model took a long time i'm gonna cover briefly uh the the next step so uh 01:06:17.200 |
talking a little bit about that because we also have nice open source library on this so i want 01:06:22.720 |
to tell you a little bit about them once you've pre-trained your model you usually want to align 01:06:29.040 |
it which means you want to have it not as a completion model which just you know generate 01:06:35.200 |
the most likely tokens after the prompts but you want to extract you want to have it behave in a 01:06:41.440 |
specific way so usually you want to start to have your model behave as a dialogue model so that it 01:06:46.880 |
learn to generate answers to prompt and not just continuation and you also sometimes want to you 01:06:53.120 |
know have specific behaviors or you want to do some safety and you know for like forbid or like 01:07:01.200 |
reduce the occurrence of specific behaviors of your model okay so this step is called alignment 01:07:07.600 |
or fine tuning and i would say up to now there was a very uh complex technique called rl rl hf 01:07:15.440 |
reinforcement learning from human feedback the impressive thing about our hf i would say is that 01:07:21.680 |
it works at all that's basically maybe the first widespread occurrence of reinforcement learning 01:07:28.640 |
in ai world that's actually really useful for many many people but it's still really really 01:07:34.240 |
complex basically how it works and the main tricky thing here is as always reinforcement learning 01:07:42.480 |
is the reward so usually in reinforcement learning you define your reward manually it's very complex 01:07:48.240 |
you know it's very full of heuristics and that's kind of one reason you don't generalize to anything 01:07:53.200 |
else on your uh test test environment and the nice thing about rl hf is the hf part which is 01:07:59.920 |
you will define your reward from human feedback so we'll ask you will generate some completion 01:08:07.040 |
we ask human to rank them and you will use that to train a reward model now it's very nice but 01:08:14.960 |
it's kind of a complex thing as you can see here some typical uh labeling interface for human to 01:08:21.920 |
label the the rewards and in practical i would say the very impressive thing is that it's just 01:08:27.920 |
working as well but in practice the implementation is very complicated you have like four models you 01:08:34.240 |
have your like your model that you're training so the dpo model you have a base model that you 01:08:38.640 |
still use because you want to stay not too far from it you have a reward model that you trained 01:08:43.600 |
on the on the human feedback you have the sft model so all of these models need to be at the 01:08:48.720 |
same time in memory and so you can do some smart sharing of layers but that's basically 01:08:54.880 |
very complex and that's why we actually started to build a library called trl to make this easier 01:09:02.800 |
and it's also very challenging in terms of fitting all of this in memory 01:09:06.800 |
now something very interesting happened uh last year which was uh dpo direct preference 01:09:15.440 |
optimization and the idea here was that basically maybe your language model already know the reward 01:09:21.280 |
somehow and so uh maybe it can be used without being trained as a reward model and i'm saying 01:09:28.080 |
that with my hands but basically you can write that much more um much more precisely in an equation 01:09:35.280 |
which is the dpo equation here and which is actually the the dpo paper has a very nice math 01:09:41.760 |
part it's not always the case for machine learning paper sometimes the math is just there to 01:09:46.720 |
pass reviewer too but in this case uh the math is really very nice and very interesting and the 01:09:53.760 |
conclusion is that you can maybe just go with two models the dpo model and the sft model and that's 01:09:58.560 |
make uh much easier training and what we saw with the rlhf team uh led by uh lewis lewis tenstall 01:10:06.320 |
and ed beshing also with former hf people like nathan and nesny what they saw was that basically 01:10:13.440 |
it makes training much more stable and it's actually just kind of work out of the box 01:10:17.840 |
because your objective is much closer to a standard like language modeling objective 01:10:22.720 |
so dpo changed a lot i would say how we used to align this model and um there was this question 01:10:30.880 |
maybe earlier this year which was is this the end of it do have we again move reinforcement learning 01:10:39.040 |
out of the most used ml technique well no no no there is a revival recently of rl through 01:10:47.920 |
the reinforcement algorithm that's maybe some of you know if you were working in the field 01:10:52.720 |
some time ago at least i was playing a lot with it for language modeling a long time ago 01:10:57.520 |
and the idea is that at least the the paper from from rika here and and from cohere show that uh 01:11:04.240 |
reinforcement and kind of more on policy rl was maybe still very very competitive with dpo and 01:11:10.720 |
maybe even better so the jury is still open in 2024 is dpo the answer or will we see back a revival 01:11:20.320 |
of rl we'll see now you find your new model you've pre-trained it you find you need the behavior are 01:11:28.640 |
great you're very happy you think it's it's nice model you evaluated it as we told you 01:11:34.000 |
you need to deploy it and that will be my my last slides uh it will be actually very uh short 01:11:41.120 |
because i think there's a lot of resources here but maybe just something to keep in mind is that 01:11:45.520 |
there was multiple breakthrough in inference optimization over the last few months i would 01:11:51.680 |
say it's really impressive i remember like two years ago when i was saying okay we might want 01:11:56.800 |
to deploy a news model of seven or ten billion parameters people like this is never going to 01:12:02.960 |
work these are just too big well the reality is that today on my laptop i can run you know mistral 01:12:09.200 |
7b and it's just really fast it's even faster than me talking and there is a couple of things that 01:12:15.680 |
made that possible that's the things that i'm listing in these slides the first one i would say 01:12:20.400 |
is quantization that's the first impressive thing we can just quantize this model we can move them 01:12:25.680 |
from the floating point values that they have fp16 for most of them bfloat16 to quantized integer and 01:12:34.800 |
that just work we lose minimal performances we have values set up you know we have various techniques 01:12:41.760 |
gptq we have the techniques included in a in lama cpp the dgml and nm nf4 so i put a couple of them 01:12:50.640 |
out there they all just work really well i think a good default honestly is the one from lama cpp 01:12:57.120 |
out there it's very nice it worked well and this basically solved use pay point also in terms of 01:13:03.120 |
model sizes because models are much much smaller in one's quantized now we can do even better now 01:13:09.120 |
with like speculative decoding which is super interesting and it's developed a recent development 01:13:14.000 |
called medusa and here the idea is that we have two models that are roughly similar but one is 01:13:19.040 |
much smaller than the other one and they're trained roughly on the same data set i mean they should be 01:13:24.480 |
as close as possible and the small one will actually predict full sentences and then we'll 01:13:29.760 |
just use the big one to validate you know how good are these sentences and to keep you know the 01:13:35.040 |
tokens until they start to diverge from the token that the large model would have outputted and this 01:13:42.000 |
means we can generate a token by bunch and just validate them by the big model by the big model 01:13:48.160 |
take a little bit more room in the memory but not so much because the small model is much smaller 01:13:53.920 |
and it speed up inference by a lot as well and this basically let us use very large model 01:13:59.040 |
on a laptop there is a nice blog post i really like called accelerating generative ai with pytorch 01:14:06.720 |
gpt fast which show you all the other techniques you can use you can compile your model you can 01:14:12.800 |
use cuda graph basically this is something we covered just earlier and this is just the idea 01:14:17.840 |
of reducing as much as possible cpu gpu synchronization so you put as much as possible 01:14:24.080 |
your gpu autonomously going through the layers and you do as few as possible synchronization 01:14:30.400 |
with with your cpu and give you even more like a speed up really really impressive 01:14:35.120 |
these are the inference techniques a lot of there just put it a few reference there basically for 01:14:41.680 |
you to to explore the final step you've pre-trained your model you aligned it you're very happy about 01:14:49.760 |
the inference you quantized it well you distribute it final step share it with the world okay we need 01:14:56.320 |
more knowledge we need more model outputs opens we need more data set open we need a lot more 01:15:01.840 |
sharing the world uh thankfully at honey face we'll be building a place to share stuff so 01:15:08.080 |
use the spaces evaluate your model openly on the open leaderboard put this on the really great chat 01:15:14.480 |
but arena set up a chat for people to try it basically please share all the knowledge that 01:15:21.600 |
you've learned and all the artifact that you've created as much as you can that will be my only 01:15:27.920 |
reward i asking you from this video thanks i actually kept this question slide from my talk 01:15:36.720 |
i cannot really answer question on youtube but please put comments or like open a post 01:15:42.080 |
on a happy face or ping me everywhere and i'm very happy to answer any question 01:15:47.840 |
that you may have on this thanks a lot for watching bye