A little guide to building Large Language Models in 2024

00:00:00.000 | Hi everyone. So two weeks ago I gave a graduate class here in Amsterdam to 200 PhD students about

00:00:06.560 | how to build, how to train a large language model from scratch in 2024. I tried in this talk to

00:00:14.320 | highlight the dark secrets, the thing that people don't talk a lot about but are very crucial to

00:00:19.600 | getting good performance large language models, and maybe to also highlight a bit what is more

00:00:24.640 | hype than reality. And when I shared the slides afterwards there was a lot of interest for this

00:00:30.480 | so I decided I would actually re-record the talk and post it on YouTube as well. So here is our

00:00:38.080 | little guide to building a large language model in 2024. In this talk I'm gonna cover three main

00:00:46.320 | parts - training, fine-tuning, inference. I think for fine-tuning and inference you can already find

00:00:51.520 | super good recipes, super good blog posts and explanations online so I really spend most of my

00:00:57.360 | time on training, which is the part that's you know mostly like dark science I would say today.

00:01:02.400 | In training you have three parts - data preparation, efficient training technique,

00:01:07.680 | evaluation. It's the same here, I'll spend most of my time on the first part, data preparation,

00:01:12.880 | because that's really the secret sauce that I want to highlight today. So let's start writing.

00:01:21.280 | You can believe me or you can also believe much smarter people at OpenAI or Entropiq

00:01:26.640 | when I say that basically the most important part in your training is the dataset. So I really like

00:01:31.760 | this blog post from James at OpenAI which highlights how you know by training many many

00:01:38.480 | architectures he basically found that in the end they all converge to roughly the same behavior

00:01:44.720 | which is determined fully by the dataset. So what he says is this - the hit in AI models

00:01:50.800 | is the dataset. Basically model behavior is much less determined by architecture or

00:01:56.800 | high-performance than we think and much more by your dataset. He actually says it's your dataset,

00:02:02.160 | nothing else. Well I still talk about architecture a little bit. And Amanda, a girl from Entropiq,

00:02:08.240 | basically said the same thing last week when she tweeted "is this emergent behavior

00:02:14.800 | coming from data or from the model?" and basically she said "none of us has ever magically pulled

00:02:21.680 | anything out of the ether, it's all coming from the dataset". So if you're more into YouTube than

00:02:28.560 | Twitter, I think there is a nice video that jokingly summarizes all of this by Rutger Bergman

00:02:35.040 | Bergman when he said "let me play it". That's a video I think about when I read all the tech

00:02:40.880 | reports that are only talking about model architecture and don't say anything about the

00:02:46.000 | data. I mean it feels like I'm at a firefighters conference and no one's allowed to speak about

00:02:52.080 | water. I mean this is not rocket science. I mean we can talk for a very long time about all these

00:02:56.800 | stupid philanthropy schemes. We can invite Bono once more but come on, we got to be talking about

00:03:02.160 | taxes. That's it. Taxes, taxes, taxes. All the rest is bullshit in my opinion.

00:03:06.640 | So basically for us we got to be talking about data. Data, data, data. All the rest is bullshit

00:03:15.920 | in my opinion. So I mean now that I kind of planted you know the landscape, let's dive in

00:03:22.160 | what I mean about that. I mean it feels like I'm at a fire. Thanks. I think another nice well

00:03:30.800 | recent paper I think is the Yi paper. So maybe if you've been following the field you probably saw

00:03:35.840 | that many Chinese teams have actually trained very good models recently. And the nice thing

00:03:41.360 | is that they also have a very very good tech report. Much better than what we have I would

00:03:45.520 | say in the western world where everyone is now very shy about sharing anything. And so the Yi

00:03:51.360 | models are a very good model if you look at the benchmark. And basically when training them they

00:03:57.760 | say that their underlying assumption is that when you train on extensive data of high enough quality

00:04:03.120 | a standard architecture can exhibit advanced capabilities. So basically you don't need yet

00:04:09.440 | now to you know go look behind beyond transformers or maybe like I will be talking later like slight

00:04:17.360 | extension like mixture of experts. If you have very good data just spend the time on carefully

00:04:24.320 | crafting your data set and for now stay on one of these simple architectures that we use today.

00:04:29.840 | I think there is extensive resources as always. I could have cited like 20 papers but I try to keep

00:04:37.920 | like a small list of resources so you can read them extensively. I think these four ones are

00:04:46.080 | nice recent examples. The survey on data selection for language model by LNAI is very nice.

00:04:53.680 | The paper I just mentioned by the Yi team is really great and I think two recent data sets

00:05:00.080 | that were open source and shared a lot more about how they were built were the the Dolma data set

00:05:06.160 | from LNAI and also RefineWeb. So I think a nice thing about RefineWeb is that I'm working with

00:05:12.720 | Guilherme the lead author of this at Hugging Face and so we'll have much more news about this data

00:05:18.720 | set to share and I think it's a very nice work. So you can use data for many things. So when you

00:05:25.760 | talk about data you actually talk about various type of data. You can use data for pre-training

00:05:31.280 | your model, you can use data for instruction tuning, you can use data for alignment which is

00:05:37.200 | basically after having pre-trained your model you really want to align it so it learns how to exhibit

00:05:42.080 | the nice behavior that you want. In particular a dialogue behavior which is one we often want to

00:05:47.680 | have when we interact with these models. You can also have model data more for in-context learning,

00:05:53.280 | for rag training, retrieval training and I would say each of these aspects will have different

00:05:59.120 | goals and will require different data. So as a rough idea for instance for pre-training you want

00:06:04.640 | really the maximal diversity. You want to assume that your model just has no way to generalize. So

00:06:11.120 | if the behavior you want at the end is not in the pre-training data there is no way the model will

00:06:18.160 | discover it. You have to put it in the training data. For alignment it's quite different. You want

00:06:22.880 | very clean data because you're training your model to exhibit some specific behavior. You want model

00:06:28.160 | to really be very good at you know like a function call or like you want your model to be very good

00:06:34.080 | at dialogue. So you want the model really to train and to learn this behavior. So usually

00:06:40.480 | this data set can be much smaller and they can be much more carefully cleaned.

00:06:44.320 | In pre-training you will want some noise so your model knows about the noise. In particular there

00:06:49.520 | is a debate you know should you use no toxic data or like maybe no bad language data.

00:06:54.880 | Right now I think the main approach to this problem by people is to use a lot of like toxic

00:07:03.520 | data or like a lot a decent amount so that the model is already exposed to this. It's a little

00:07:08.160 | bit like your kid if you want. If you want to tell them that drug is bad right they have to first know

00:07:13.760 | about drug. You cannot really you know expect them to learn that this is something they shouldn't

00:07:21.120 | touch they should not be using if you don't tell them what it is. It's the same for language model

00:07:26.960 | in some way. We want them to be exposed to this data to a small amount of it so that they can

00:07:31.920 | learn later to avoid this and they will know what they need to avoid. Basically assume that there is

00:07:37.440 | no generalization capabilities in this model. If you want to tell them anything about something

00:07:42.480 | positive or negative you have to first put it in the model. So let's talk about pre-training stage.

00:07:49.200 | I already covered a little bit but basically you want to have maximal coverage you want to cover

00:07:52.880 | everything. So you will train a massive quantity of texts at least 1 trillion token nowadays and

00:07:58.960 | I think you probably want to aim for like more 10 trillion tokens. The challenges that you want to

00:08:04.480 | solve here you want to maximize diversity and coverage and you want to maximize quality as

00:08:10.080 | much as possible because this is still you know something that your model will learn. So if your

00:08:15.440 | model learn mostly noise you will still get noise out. So you want to have a little bit of this so

00:08:20.320 | it's kind of robust to this but you don't want to have too much of this. Here is one example.

00:08:26.320 | Basically you would want your model a good rule of thumb is that you will want your model to know

00:08:31.520 | two things. You want your model to know the thing that you may want it to generate at the end.

00:08:36.240 | So if you want to generate knowledge about physics you will want to put that in the model and we want

00:08:42.080 | also your model to learn the thing that it might be exposed to. So you want your model to be familiar

00:08:47.920 | with the thing that the users might input. So if you have inputs that might be noisy from the users

00:08:53.360 | your model should still be trained on it. Otherwise it will be out of distribution

00:08:58.160 | and as I said the safest bet here is to assume your model don't generalize at all.

00:09:02.400 | The main challenge here is maximal diversity, good quality but still a little bit of noise

00:09:09.280 | and data quality evaluation. How do you measure data quality at the billion token scale.

00:09:14.800 | That's what we're going to talk a little bit about as well. So here is the typical pipeline to train

00:09:19.680 | a model. So you start by collection. I'm going to talk a little bit about that. You want to filter

00:09:23.840 | by languages which language you want to keep and then you have a set of filters. You have basically

00:09:28.800 | two main type of filters. You have some filters that are more heuristic so they are kind of rules

00:09:33.760 | that you wrote and there are some filters that are more like ML models. So you have a model that you

00:09:40.160 | train to identify some good quality text. Usually you want to combine two and then you have a set of

00:09:46.160 | filters that are more semantic. The rule and the ML model are usually a little bit more on the surface

00:09:51.040 | level and then you want to cover really the topics that you need to know about. If you want to know

00:09:56.000 | about physics, you want to know about technology, you want to be sure that these are in and so you

00:10:00.480 | have a step of like more topic filtering and basically be sure that you extract this topic

00:10:06.160 | very well. This is another example from RefineWeb. The first one was from Yi. This is from RefineWeb

00:10:16.160 | just to show you how much data we remove. So we start from Common Crawl which is basically the

00:10:21.920 | internet crawled since 10 years ago and basically we filter that and you can see that there is a

00:10:30.000 | lot of things that you will remove. First I would say language removal. If you only keep English,

00:10:34.640 | English is roughly half of the internet. The second biggest language is usually Russian and

00:10:39.760 | then you have all of the other in Common Crawl. So basically remove half of it when you only filter

00:10:45.440 | for English you will have a lot of like duplication removal. Why do you want to do duplication

00:10:51.600 | removal? Well we'll talk a little bit about that later so wait. And then you extract a little bit

00:10:56.880 | and in the end you end up with about 10% of the original Common Crawl sizes. So if you want to

00:11:02.960 | get a trillion token that means you really want to start with a very large source. This is an

00:11:09.280 | example this one from the from the LNAI survey. It's roughly the same steps that you will see here.

00:11:15.600 | Language filtering, some heuristics, some what they call data quality which is machine learning

00:11:21.360 | based usually. Some deduplication and then topic filtering basically. So where can you start from?

00:11:31.920 | You want as I said something very large because you'll just keep like 10% of it.

00:11:36.880 | So there is two main large sources of data I would say today. One is Common Crawl,

00:11:40.960 | one is the internet basically and the other one is more like for code. Usually you want to start

00:11:46.160 | from GitHub or something like Software Heritage or like a place where this has been carefully

00:11:51.920 | already extracted from the web. You can use some curated sources like Wikipedia or books and then

00:11:59.920 | in books you have this big question you know like I should use only public domain books which stops

00:12:05.280 | usually 100 years from now so in 1924 for today or do you want to dive in more like copyright

00:12:13.040 | equation. So that's the big big question I would say for today for mobile trainers.

00:12:17.760 | And you have more recent trends like synthetic data generation where you basically will ask one

00:12:23.200 | LLM to generate some data specifically for you and because you're kind of paying compute for data

00:12:29.520 | here you can scale this quite largely. So there is a full new trend on this

00:12:36.080 | spearheaded by Microsoft and the fee models which were trained on billions of synthetically

00:12:42.160 | generated data from GPT-4. I think it's quite interesting that you can really craft the data

00:12:48.400 | set in a more controlled way here because you can say okay I want this topic, this topic, this topic,

00:12:53.920 | this behavior and given the quality of large language models today the quality of the resulting

00:13:00.000 | data is actually very high. There is even a recent interesting paper from Apple which is about you

00:13:06.480 | know rephrasing the web so you take one page and you actually ask an LLM to write it cleanly and

00:13:12.400 | if you train on this data which is very clean and still cover a lot of diversity you can train

00:13:17.440 | actually three times faster because you use three times less data. It's very recent but it's super

00:13:22.720 | interesting. Okay I talk a little bit about this resource in more details because we've been

00:13:30.480 | releasing data set on this at HuggingFace and I want to show you a little bit what we released

00:13:35.120 | and I go in reverse order so I start with synthetic data. We released recently Lubna and Anton

00:13:41.760 | and Leandro at HuggingFace have been releasing a data data set called Cosmopedia which is a

00:13:47.360 | synthetic data set of 30 million samples, that's actually billions of tokens and it was generated

00:13:53.120 | using one of the best open source models today which is MixedTrail Instruct and here you can

00:13:59.360 | see how basically this is controlled for various seeds so basically we give the model a slight

00:14:06.480 | small topic you know or a sentence from a document and you can choose where this comes from and you

00:14:12.720 | ask the model to to write content you know from this seed sample on the topic. So we took some

00:14:19.600 | very clean sources like the Stanford open courses or OpenStacks which is also open textbook,

00:14:26.480 | Khan Academy that you maybe know and also some web data so I would say more more diverse data

00:14:32.880 | and even instruction tuning data set and then you can ask model also to write you know using various

00:14:40.240 | language you can ask the model to write this for college students you know to write textbook

00:14:44.720 | article on this topic for college students or for high school students you can also ask the model

00:14:50.560 | to write in various styles to write blog posts about this topic and so you can actually have

00:14:55.680 | a lot of diversity even though it's synthetic. Here is a quick example of all the clusters you

00:15:04.480 | can do topic clustering to check that you know that you cover a lot what we discovered is that

00:15:09.680 | we could still cover even more clusters and I would say right now the work on Cosmopedia 0.2

00:15:16.400 | is to extend this to even more cluster and to get basically more coverage and so here you can see

00:15:22.400 | that we train a we train 1 billion model 1 billion parameters model on this to to show the performances

00:15:28.720 | and it's really competitive with web data set even being much smaller but I would say it can

00:15:34.480 | it can even be better you know by having more coverage so stay tuned for Cosmopedia 0.2

00:15:40.960 | coming in April. If we go now to code data there was a very nice release earlier this year called

00:15:49.440 | Starcoder 2 and the Stack V2 so the Stack V2 is really the largest code data set out there that's

00:15:57.360 | prepared for large language model pre-training it's more than 3 billion files in 600 programming

00:16:05.920 | languages in total you have like billions of tokens you have roughly 1 trillion tokens in the

00:16:11.760 | Stack V2. So to get all this data basically we didn't crawl ourself we partnered with one of the

00:16:18.720 | non-profit foundation out there called Software Heritage which is a non-profit who has been

00:16:24.560 | focusing on archiving all code that has been out there since you know 10 years ago really

00:16:31.520 | and basically there is a question you know when you when you when you all when you gather all

00:16:37.360 | this data set what do you do do you sell it to I would say private you know closed source companies

00:16:43.040 | or do you partner with like an open source company to train an open source model on this

00:16:47.440 | and so Software Heritage can reach out to us to partner on the training of a new

00:16:52.880 | code open source code generation model called Starcoder 2 that you can use as well and which

00:16:58.320 | is one of one of the best code completion model out there today it's a very large collaboration

00:17:05.120 | actually an open collaboration so you see all the others there mostly led by Hugging Face and

00:17:12.000 | great people at ServiceNow. So really go check this out if you're interested in code data

00:17:17.520 | it's by far the largest and the cleanest data set out there on this. On web data so as I told

00:17:23.200 | you we've been working on with the lead author of RefineWeb to get a very large and very high

00:17:29.680 | quality web data out there so basically a filtered common crawl out there for people to basically

00:17:36.720 | start their training from a high quality data set so this should be also out in the beginning of

00:17:43.440 | April maybe already next to it so just stay tuned on this. So now that we got our data source we

00:17:50.640 | need to filter it so filtering by language I would say stay simple fast text by meta facebook

00:17:57.600 | is just great so just use fast text it's a great one it's worked pretty fine it has like all the

00:18:04.240 | language you may want to filter. Now that we filter by language we want to start cleaning

00:18:10.800 | our data sets so there is basically two ways to do that heuristics ML based. We started by the

00:18:16.480 | heuristics the heuristics is this idea that you will count items so basically if your documents

00:18:22.080 | only have like you know two characters per line probably it's just it's just a bad list or like

00:18:28.720 | something that you actually don't really want to use in your large language model so as a reminder

00:18:33.600 | you don't want to use the thing that are naser things that your model will ever generate

00:18:38.400 | and there's another thing that you will think your user might input in your model

00:18:43.440 | so basically repetition you know a very long repetition of single character something that

00:18:50.320 | you know have a very strange ratio of alphabetic character to punctuation all these statistics

00:18:58.080 | that you can extract are way to easily filter documents the nice thing about heuristics is you

00:19:03.840 | kind of know what you're filtering out you know you wrote the things yourself you can really set

00:19:08.480 | the threshold by inspecting it and you have a very clear control on what you're removing from

00:19:15.200 | your data set you know what's what's this so these are the annotations I told you it's kind of

00:19:21.200 | control it's robust you know the prior and I would say the drawbacks are that you're only relying on

00:19:27.440 | surface level okay you're not looking in the meaning of the document you may also remove too

00:19:32.320 | much sometimes you think you're just removing bad lists but maybe these are also good lists that

00:19:37.200 | your user may want to input in your model one way to be a little bit more flexible about that is to

00:19:43.600 | use stochastic removal instead of you know being a one-off binary choice you sample a little bit

00:19:49.520 | and you keep a little bit of noisy data another drawback is that you will need to carefully tune

00:19:55.840 | your hyper parameters here you know the statistics that you want to filter and that's sometimes a

00:20:01.920 | little bit time-consuming process another way to do data set filtering quality filtering is to do

00:20:10.000 | machine learning filtering so here basically how you do is that you will have a set of good example

00:20:14.960 | a set of bad example and you will train either a classifier or a perplexity based filtering

00:20:22.480 | to you know to classify or to predict the next token so classifier based you know usually the

00:20:30.800 | standard one is to use a fast classification with some n-grams and you label your documents as good

00:20:37.840 | bad whatever perplexity based you train a very small language model so usually we could we use

00:20:44.240 | the this this kn uh old model right and we say that if the perplexity is too high then we filter

00:20:52.640 | documents i would say the advantage is here is that you have a more like semantic understanding

00:20:59.200 | hopefully from your ml model even though we use very simple machine learning techniques here

00:21:04.240 | um and you don't you know need to tweak all the hyper parameter that you tweak for heuristics

00:21:10.560 | the main disadvantage is that you're not really controlling what you remove okay you you have a

00:21:17.920 | very vague view of what the biases are so let me give you an example wikipedia okay if you train

00:21:24.320 | your model on wikipedia and you filter based on this wikipedia is written 90 more than 90 actually

00:21:32.160 | by men so you're basically also filtering your pre-training corpus to be mostly male written

00:21:38.640 | do you want this bias well maybe not right so these are things that you you still need to be

00:21:44.720 | careful and basically it's really hard to know exactly what bias you're introducing

00:21:53.040 | um a couple of notes additional on data filtering very important notes actually

00:21:57.840 | you will have several parts in your training data even if it's only web documents you will

00:22:03.920 | have you know some part of the web data are blog posts some part of the web are you know like

00:22:08.640 | tutorials some part of these are companies websites all of these are somehow specific

00:22:15.840 | domains and you want to make sure they are all you know um processed in a good way so you need

00:22:22.160 | to make sure that for each of these big domains that you want to have at the end you actually

00:22:26.800 | didn't do something bad in the pre-processing so there is various way to do that you can you know

00:22:31.600 | cluster and identify a list of documents in a cluster but just one thing to remember about

00:22:37.600 | all of this and i would say it's a general rule of all good quality data processing is that you

00:22:42.800 | will want to manually inspect the data inspect the data that you've been keeping inspect how it is at

00:22:49.520 | the end how it was filtered is it still really readable is your latex um document well processed

00:22:58.000 | is your pdf ocr well extracted manually go through the data that you keep and also through the data

00:23:04.720 | that you remove did you remove something that you think is actually very important you need to sample

00:23:10.160 | you need to take a look you can take a look just at the most important so for instance you can

00:23:15.280 | sort your data by top urls per token and just read 10 documents for this top urls and make sure that

00:23:22.400 | these 10 documents are really well filtered okay very likely you need to also craft specific

00:23:30.320 | domain focused hyperparameters for instance for your heuristics maybe they will work well for

00:23:36.240 | blog posts but maybe they will just badly filter latex documents so you can either say okay i craft

00:23:42.240 | specific rule for this domain or you can also say i'll just add this domain afterwards uh for

00:23:48.560 | instance code you could say i remove all code for web and just i'll just add a very big code data

00:23:54.080 | set but try to think about the implication of doing that okay you will basically remove for

00:23:58.960 | instance some mixed natural language and code documents so you want to make sure you add this

00:24:04.000 | back again uh so that your model still cover this type of inputs um as i told you you can also make

00:24:11.520 | use of some stochastic selection so if a rule is maybe just too hard too harsh you may want to just

00:24:18.880 | stochastically sample in the filtering so that you keep a little bit of noise you can smooth a bit

00:24:25.600 | your rules now the duplication why do you want to do the application well the idea is that there is

00:24:33.920 | a lot of duplication on the web that's something to really be mindful and to be aware of the web is

00:24:39.920 | hugely duplicated and so duplication will increase the density around some topics okay wikipedia is

00:24:47.760 | copied a lot over the internet so maybe that's nice to have a lot of density around wikipedia so

00:24:52.880 | that you're sure that your model has seen it a lot but um you also have to be aware that duplicated

00:24:58.080 | points they have more chance of being memorized okay they will also take more time because you

00:25:03.120 | will go during your training more times over the same data points so it takes more compute

00:25:09.040 | during training and you really want that okay you really need to see that um reducing the

00:25:14.960 | duplication duplication also has been shown to improve accuracy so generally the duplication

00:25:19.680 | is something that's very important and that you want to have a lot uh how can you duplicate well

00:25:27.760 | you have a couple of methods you have more like fuzzy method where you basically will extract some

00:25:33.040 | hash fixed size hash of your documents and so you will lose here a little bit of accuracy because

00:25:40.240 | this hash are just a rough summary of the n grants in your document and then you will want to filter

00:25:45.600 | them either by min hash which is a i would say quite a good method in general or by bloom filters

00:25:54.800 | which are much stronger on the duplication because you just keep one hash and just keep one document

00:26:00.320 | per hash so it's very it's very strong you have a fixed size vector which is very constraining

00:26:05.120 | um and if you don't want to do fuzzy duplication you can use exact duplication where you will

00:26:10.880 | extract you know uh with a suffix array you will extract exactly all the duplicate in your document

00:26:17.360 | they have both trade-off uh in advantages and drawback um exact filtering is very costly in

00:26:26.560 | memory because the the table the suffix array table are really huge um i say bloom filter is

00:26:34.000 | very very strong uh a filter so usually we we use a lot for instance in fine where we use a lot

00:26:40.640 | min hash because you can you can control you can control a little bit more um your trade-off between

00:26:46.320 | memory and um and accuracy uh speeding the duplication is also a very big issue i would

00:26:56.400 | say on very big challenges uh and we saw a very nice very interesting counterintuitive result

00:27:02.160 | recently that more duplication also led us to keeping only bad data so basically when we were

00:27:07.040 | deduplicating more and more all the good data was now taken out and only the the remaining things

00:27:13.760 | were just basically bad quality data that was not the duplicated but that was just so random that it

00:27:20.320 | didn't fall in the duplication uh buckets so uh i would say for the duplication also be careful

00:27:27.360 | investigate what you're removing at the end and also what you're keeping and don't be sure don't

00:27:33.360 | don't take this as a silver bullet just like every filter out there it's something that you should

00:27:38.400 | double check yourself now that we've finished you know uh sourcing language filtering filtering by

00:27:46.640 | quality heuristic or ml deduplicating uh topic we need to prepare the data for training there's two

00:27:53.600 | main thing you need to do we need to shuffle it it might seem as a joke but it's still very

00:27:58.160 | important today you don't want to train in the order of the common crawl terms you want a good

00:28:05.040 | a good shuffling of all your data and then you want to tokenize it so recently there was a very

00:28:10.800 | nice video by by andrej carpati on tokenizer you should watch it if you want to know everything

00:28:16.000 | about tokenizer but generally there's just a set of good practices you should fit you should be

00:28:22.320 | mindful of the first one i would say is sample well through your whole data set i would say the

00:28:29.440 | first gpt2 tokenizer was famous for including in in the in the final vocabulary token the name of

00:28:36.400 | redditors because it was really trained on ready data only you don't want that you want really to

00:28:42.080 | shuffle so that the single you know the one single part of your data set is not over represented

00:28:48.880 | in your vocabulary the vocabulary of your model for math you want to be careful about numbers you

00:28:55.360 | want to be careful that they are well you know you don't have like for instance 42 as a single token

00:29:00.880 | and 43 as two token because 42 is much more used since uh the douglas adams book and so usually

00:29:09.360 | what people do is either they split digits so you split all the tickets in every in every number

00:29:15.280 | that's what for instance llama do or you add you know the list of all numbers manually in your

00:29:22.000 | vocabulary up to a thousand for instance that's what gpt4 do then you need to be sure that your

00:29:28.320 | data set is big enough that every number is really well represented in it for code you want to be

00:29:34.960 | mindful about tabs and spaces they're very important for instance in python and so you want to handle

00:29:41.920 | them well you want to model to know what is a double space and four spaces so just be careful

00:29:47.840 | about this and for basically if you need something by default i would say a byte level dpe is a good

00:29:54.960 | standard way to train a tokenizer don't fall in a rabbit hole for tokenizer they are not the thing

00:30:02.000 | that will bring you to adi okay this is just something you want to make in a clean way so

00:30:08.720 | that you don't fall in the in the traps along the way that you're able to process code numbers you

00:30:15.360 | know that you don't have some strange tokens over represented but that's it by the way you can also

00:30:22.560 | use tokenizer to inspect your data set i'm going to talk a little bit about that scaling tokenization

00:30:27.840 | is non-trivial you want to really parallelize that well because otherwise pre-processing

00:30:34.240 | and tokenizing trillions of token can take quite a long time in the end and so there is two main

00:30:40.400 | approach the first one is well parallelizing and then finding a way to efficiently merging the post

00:30:47.360 | the tokenized data sets and shuffling it and the other way is that you tokenize during training

00:30:53.360 | basically you feed the direct text to your model and you tokenize just before feeding the model

00:30:59.760 | i would say the the nice thing about the first one is once your doc once your data set is tokenized

00:31:05.280 | and everything stopping training and continuing training around resuming training is very easy

00:31:10.960 | it's very efficient it's very reliable and in the second case is well you can change the tokenizer

00:31:17.920 | easily but usually you don't really need to do that a lot but resuming and being sure that you've

00:31:25.280 | you know we're starting exactly from where you were is usually slightly trickier

00:31:32.320 | now how do you evaluate data quality so that's really tricky because we're talking about trillion

00:31:39.920 | size data sets okay so it's really hard to have some good metrics to evaluate the data quality

00:31:45.520 | so a lot of this is you know inspecting yourself some exact documents as i will tell you and some

00:31:52.240 | easy i would say one one one good way is training small model to test it so typically what we've

00:31:58.000 | been training here for instance is like one to two billion size model and you train at this on

00:32:03.600 | like a chinchilla optimal size you don't need to train for longer you're not using this model

00:32:08.480 | for inference or anything so which is roughly 30 giga token when you train your model you need to

00:32:14.960 | find some high signal benchmark not all the benchmark in nlp are high signal what is a high

00:32:20.480 | signal there is two way i've seen it uh being being being used one way is to make sure that your

00:32:27.840 | matrix on this benchmark is monotonically increasing during training okay you want

00:32:34.400 | basically some benchmark where you really see your model learning learning increasingly and

00:32:39.520 | not like oscillating a lot otherwise depending when you stop you will have like very different

00:32:45.120 | results you want to have a low variance which means if you train on various seeds if you train

00:32:51.680 | on various um you know parts of your data set you want to be sure that you're you're roughly in the

00:32:58.000 | same ballpark at least the the standard deviation that you're measuring is small enough that you can

00:33:02.800 | really tell data set apart so usually you will want to have two debugging data sets one of high

00:33:10.080 | quality a standard very high quality data set is c4 it's a very it's really a data set that has

00:33:16.800 | standard test of time in terms of high quality and you want another data set that's maybe much

00:33:22.160 | more complex the power is some some sometime an example or you can take just a pure common crawl

00:33:27.040 | and filtered and you should see really a distance between the measurement on your benchmark on these

00:33:33.760 | two data sets the performance of your train model on these two data sets and obviously you want your

00:33:39.040 | model to be above the random baseline you know that's also one indication of a good benchmark

00:33:45.280 | so if a 1-2 billion size model is not above the random baseline you're just measuring noise

00:33:51.360 | and there is some tricky details to make sure that you have high signal these are some things

00:33:57.280 | we have in in light table but basically for instance if you want to measure multiple choices

00:34:02.640 | question it's often the case for this small benchmark and that's why for instance you have

00:34:06.640 | four continuation you want to predict you know you want to select one of the four small model

00:34:12.720 | what i call small model is one to two billion size model small models really like more what we call

00:34:19.600 | normalized likelihood so we'll measure the likelihood of each answer normalize it by the

00:34:24.800 | length and take you know the highest likelihood and larger model when we move to like 30 40

00:34:31.760 | even 70 model well trained they will like more you know lettered answer when you explain the

00:34:36.720 | answer and then you say select between a b c d and the model just generates a b c d and here you

00:34:42.640 | can have nice calibration curve because you have a very clear uncertainty on this single generated

00:34:48.960 | token so keep this one for larger model for small model i would say focus on normalized likelihood

00:34:56.240 | so these are small model training another thing talk about this a lot but manual data inspection

00:35:02.160 | take your top domains take your top url inspect 10 documents for each of them inspect also at

00:35:08.400 | various stages in your pipeline and also take a look at what you've discarded okay always

00:35:16.560 | you can set up a search tool in your data set that's also very useful you can do some clustering

00:35:22.480 | to see and to be able also to inspect top documents per maybe more clusters than url so we have here a

00:35:28.800 | nice library by leandro at hugging face called text clustering we also have a nice search tool

00:35:36.720 | in this so really take a look at this library and use it if you think there is a more uncommon that

00:35:43.040 | that i really like from tevin who was a there was a lot of people at hugging face you know who are

00:35:49.360 | now at mistral and tevin who is now at mistral also told me once that he used the tokenizer

00:35:55.600 | to inspect and basically you can train a tokenizer on your data set and you can take a look at the

00:36:02.560 | longest token and maybe the last token so the less the less the least frequent token and see there

00:36:09.520 | okay do you have strange things do you have like javascript parts do you have like name of redditors

00:36:14.880 | like i was telling you and if they look bad that means that you have some high frequency of bad

00:36:21.840 | quality data in your data set um here we have some nice library that we've been releasing you know

00:36:30.320 | just last month for doing all of this all of this data processing pipelines it's called data trove

00:36:36.560 | it's by uh gilerme the lead author of refined web and basically it started as an open reproduction

00:36:43.200 | of refined web so a very high quality filtered common crawl and what we ended up was kind of a

00:36:49.360 | fully fledged lightweight library for processing filter the duplicated test data and basically

00:36:54.880 | preparing very large data set for other than training you have pre-built block for all the

00:37:01.280 | steps that i showed you here it's fully in python and it's very easy to set up on slurm or locally

00:37:09.360 | and to use remote file system as well if you need to so take a look at data trove it's a very small

00:37:17.440 | library i would say self-contained python thing but you really you have all the basic blocks that

00:37:22.000 | you may want to use here um when you want to evaluate your model we have one library that

00:37:27.840 | works well with data trove and the pipeline which is called light evil light evil is a very lightweight

00:37:34.640 | llm evaluation suite usually inspired by the amazing eluther airness i would say the main

00:37:41.600 | difference is that integrate from the ground up 3d parallelism i'm going to talk about next

00:37:47.600 | so basically efficient model uh training and inference and you can play a lot with the prompts

00:37:54.560 | and the eval so i was telling you for instance small model really like this like normalized

00:38:00.240 | log likelihood while while bigger model like more like lettered answers and so here you can play with

00:38:06.000 | the prompts easily and so to see how much signal you can extract for each benchmark on your specific

00:38:14.560 | debugging model size now we've talked a lot about data so let's talk a little bit about modeling

00:38:23.040 | that's the part everyone is waiting for that's the most exciting part easily that's the reason we're

00:38:27.840 | all in ml and i'm very happy to still cover this i would say so what are the essential elements when

00:38:34.320 | you train well there is three main thing the first one is efficiency and size you want to fit your

00:38:41.280 | billion parameters model efficiently on your gpu and you want to train really fast so you have some

00:38:46.800 | recepts for this that i'm going to cover quickly and then you want to train in a kind of a roughly

00:38:53.760 | stable way you have to avoid instabilities but still you want to stay really close to it and

00:38:59.360 | then you have the last question which is capacity and that's where we're going to talk a little bit

00:39:03.200 | about other architecture than just the transformers but that's i would say just the last part so how

00:39:09.680 | do you train model efficiently in particular when it's too big to fit on one gpu so when it fits on

00:39:14.800 | one gpu there is no real problem right so you won't be model no problem your 7 13 30b model

00:39:23.920 | they are just too big for one gpu and a decent batch size so you need to parallelize them

00:39:28.400 | today we have four way to do parallelism roughly we have data parallelism that's something everyone

00:39:35.040 | has been using already i would say you have tensor parallelism pipeline parallelism and a much more

00:39:41.840 | recent i would say or slightly more recent sequence parallelism i'm going to cover them

00:39:46.080 | briefly so i would say here my idea is more to give you kind of a overview of everything more

00:39:52.080 | than really dive deep because in each of these topics you could dive really deep in a technical

00:39:57.360 | point of view okay so this is really entry level and i put some references again just select a

00:40:04.400 | couple of references that you can read to dive deeper in this let's start with the first parallelism

00:40:09.280 | data parallelism usually it works out of the box that's the easiest one the only challenge is the

00:40:15.840 | data loading to make sure that your mobile replica will have different data as input so what does

00:40:24.640 | data parallelism do you take the one model and you duplicate it on several gpu you feed it several

00:40:31.360 | parts of your batch and then you just you know match the gradient reduce the gradients so that

00:40:37.200 | you have basically a larger batch on three gpu for instance than you had on one gpu so you can

00:40:43.040 | process on parallel you know different part of your data and you just make the optimization step

00:40:49.360 | the main challenge i would say is the last part is the all reduce that you use to to kind of merge

00:40:56.960 | the gradient updates and actually when you scale very large model it can start to become a huge

00:41:03.120 | bottleneck so we'll talk a little bit about that um yeah the tensor parallelism is when you don't

00:41:13.200 | want to when you're limited in your data parallelism so why would you be limited by data

00:41:19.520 | parallelism there is two main cases one case is basically your model is just too big to fit on

00:41:25.120 | one gpu so you cannot replicate your model on various gpu you need to split the model somehow

00:41:32.400 | the other case is when your batch size by replicating the model start to be too big

00:41:37.920 | okay so let's say you want to really scale the model and now you start to have like one to four

00:41:44.320 | million token batch size well if you start to have a very large batch size the model for each

00:41:50.240 | optimization step make less efficient use of each token because the batch size is so big that each

00:41:57.520 | token is kind of watched out in the optimization step and roughly it's a little bit hard to measure

00:42:03.520 | this limit which we call the critical batch size it's roughly around four to six million token it's

00:42:09.200 | different for like small and bigger model but basically you cannot really go to 100 million

00:42:14.960 | token base like that so you want to find another way to parallelize to make more efficient use of

00:42:21.440 | your data and so one way to do that is to use tensor parallelism tensor parallelism is slightly

00:42:28.640 | more involved because you need to rewrite your model code you cannot just rewrite the data

00:42:33.440 | loading code you need to change the model why because you will divide all the matrix multiplication

00:42:39.600 | all the matrices that we use in the model into or like four or like eight depending on your tensor

00:42:45.440 | parallelism degree and you will put each part of the weights each sub part of this weight matrices

00:42:53.360 | on various gpu and synchronization will happen after the operation so here you need to rewind

00:43:01.040 | model code the nice thing is that you can combine smart column and row slicing to try to reduce the

00:43:07.120 | number of synchronization points let me show you a little bit here you have two main parts in a

00:43:14.000 | transformer as you may remember you have feed forward networks you know you usually have two

00:43:20.160 | two matrix multiplication with an activation in between it can be a bit more if you're using

00:43:26.720 | something different than just if you're using clue but basically you will have one matrix

00:43:30.720 | multiplication some activation and another matrix multiplication okay and here you can basically

00:43:37.680 | split the first matrix multiplication in one direction usually column wise you do separately

00:43:44.880 | your activation on each gpu you don't need to synchronize and then you gather by doing the

00:43:50.480 | opposite slicing at the end on the second matrix multiplication you do a row slicing

00:43:56.080 | to gather again your output your activation you can do the same smart thing for self-attention

00:44:04.320 | where you will do one part matrix multiplication in one direction you will split the matrix

00:44:09.200 | you will do like softmax dropouts separately and then you will combine them with you know

00:44:15.360 | another parallel operation in the other direction this way you reduce the number of synchronization

00:44:24.160 | point because you can do a couple of like operation without needing to synchronize between the gpu

00:44:29.360 | the tricky part is always that when you're synchronized you're going through the network

00:44:33.680 | that's much that's much slower than just the computation the last part that you can use when

00:44:40.960 | you when you don't want to use tensor parallelism or when you cannot scale tensor parallelism enough

00:44:45.120 | is pipeline parallelism so usually you want pipeline parallelism when your like network

00:44:50.800 | is not fast enough to do full tensor parallelism everywhere okay pipeline parallelism reduce

00:44:58.480 | the number of network exchanges because you will put some layers on some gpu

00:45:03.360 | and other layers and other gpu and you will just communicate at the interface between two layers

00:45:09.920 | or like two groups of layers so you can see here you will put one for instance level layer two

00:45:17.520 | zero to three on one gpu layer four to seven on the second gpu etc etc here the challenge i would

00:45:24.800 | say is to keep all the gpu busy so you don't want to have just you know one group one gpu working

00:45:30.880 | for the first layers of your batch and then being idle while you have the other gpu working for the

00:45:37.040 | other layer you know as we go as we go forward in the model and it can be very challenging

00:45:43.520 | to actually keep have maximal utilization of the gpu so usually you have like complex interleaving

00:45:50.320 | of the forward and the backward path so i can show you here a little bit where you have the forward

00:45:56.480 | path in blue and the backwards path in green and you can see what we do in this case is that we will

00:46:02.960 | split our batch in smaller sub-batch mini-batches so for instance we split a long batch in four

00:46:10.800 | mini-batches and when the first mini-batch is done on the last device we already start the backward

00:46:18.400 | while we are still doing the forward path on the other gpu for the last batches and this way you

00:46:24.800 | can reduce what we call the bubble the tricky thing here as you probably got it is that

00:46:30.000 | for tensor parallelism you needed to rewrite the model code as i told you

00:46:34.880 | and here you also need to rewrite the optimization code okay you cannot just do forward

00:46:41.680 | and then you're lost at backward because you have parallel execution of a backward and forward

00:46:48.880 | path so this makes using the code quite complex and that's why actually we have a new library

00:46:53.680 | called nanotron that's right to have this as simple as possible

00:46:57.440 | there is a last way to do parallelization called sequence parallelism so be careful

00:47:05.040 | because there is two use of sequence parallelism there is one which is kind of a smart way to do

00:47:10.800 | ring attention to do attention on very long sequences but the one i talk a little bit about

00:47:16.240 | today is another simpler way it's quite similar to tensor parallelism in a way but instead of

00:47:23.280 | slicing the parameter matrices like we do we slice the sequence this way and the idea is

00:47:29.680 | if you took tensor parallelism here it's the top box we still had some operation between each tensor

00:47:38.080 | parallelism operation where we were not really parallelized in any way and the idea is on this

00:47:44.160 | operation which are applied independently for each token we could split along the sequence

00:47:52.880 | and so we could parallelize this along the sequence axis it's only interesting you're

00:47:58.720 | doing training usually because you need long sequences or which is a little bit doing prefill

00:48:03.600 | now what can you read if you want to know more about this there is many reference

00:48:10.480 | on parallelism i try to extract i think the one and i think give you the highest level overview

00:48:16.480 | i would say and cover as much as possible of this thing i really like this first paper from joel at

00:48:22.960 | service now which is not very well known but i think it's very interesting as it covers like a

00:48:28.160 | lot of challenges here red first pipeline parallelism reducing activation computation

00:48:34.320 | in large transformer model is very nice one and the last one is actually the one on sequence

00:48:40.320 | parallelism that i told you and the last one called sequence parallelism is actually this

00:48:44.960 | ring attention paper that i think is also very interesting but more maybe an extension of this

00:48:50.800 | presentation now we talk about a lot about parallelization okay but there is an additional

00:48:56.960 | thing that you need to be mindful about is synchronization i already talked a little bit

00:49:01.680 | about synchronization okay during tensor parallelism and the thing i talk a little bit

00:49:05.840 | about reducing synchronization and here you you have to be very careful about that well

00:49:10.720 | why well you have two type of synchronization you have one synchronization which is between

00:49:15.920 | values gpu which is uh basically when you when you do like a like a like a reduced operation

00:49:24.080 | in tensor parallelism and you have one synchronization which is between cpu and gpu which

00:49:29.120 | is when your cpu basically launched the kernel on gpu and you want to reduce or at least you want to

00:49:35.920 | make sure that for both of these as much as possible you can do an overlap of computation

00:49:42.560 | and communication so basically if you can do something called asynchronous computation

00:49:48.160 | basically where you will asynchronously start some operation and do some communication during

00:49:53.760 | this time it's much better so let me talk about two things we we talk a little bit during the

00:49:59.600 | data parallelism part about the cost of the all reduce at the end so that's something you probably

00:50:05.520 | have been using already without knowing it in pytorch which is if you look at the distributed

00:50:11.600 | data parallel so the ddp uh in pytorch you can see that there is a very smart way to do all reduce

00:50:20.000 | so let's look at here basically typically you will usually do like all your forward and your

00:50:24.960 | backward and then you will do your all reduce at the end okay well this is very annoying because

00:50:32.000 | during the all reduce where you gather all your gradient together you don't do any computation

00:50:38.000 | you're just waiting for synchronization there you're just waiting for your gpu to exchange

00:50:43.440 | all the all the great and that's not something you really want you want to keep your gpu busy

00:50:48.560 | so if you have a way that once every time um one layer is finished with computing you can already

00:50:54.880 | start you know reducing you can already start in parallel to computation you can already start

00:51:00.000 | communicating gradient then you should try to do that and if you take a look at the pytorch code

00:51:05.440 | for for distributed data parallel that's something that they do um another example is in pipeline

00:51:13.440 | parallelism you know we saw this this this uh forward backward reducing of the bubble and here

00:51:22.160 | you can also try to you know overlap this very long here g so this very long gradient reduction

00:51:29.360 | here with some you know forward pass of the next batch and just a quick example and a quicker note

00:51:40.960 | about cp and gpu synchronization here what you will want is to reduce as much as possible the

00:51:46.800 | number of time your cpu need to inspect the data or need to start a kernel so we want to fuse kernel

00:51:54.000 | so you want to fuse the operation that could go together the result of your attention your

00:51:58.720 | activation if you can do all of that in the gpu without the cpu needing to say okay now it's time

00:52:04.960 | to compute the activation now it's time to do you should do that so that's usually done by merging

00:52:10.000 | merging operation in single kernels um now i want to talk a little bit about attention

00:52:18.240 | that's very interesting because if you were already in the field like one year one year

00:52:23.760 | and a half ago a little bit more maybe now um we had a lot of work on designing efficient

00:52:30.720 | attention computation because people were really very scared by the quadratic cost of attention

00:52:38.720 | okay and all of this disappeared now you know there was like all this reformer all these very

00:52:44.160 | long attention smart and the main reason this disappeared was that uh our friend tree dao at

00:52:52.000 | stanford invented flash attention flash attention the idea is basically you will just not materialize

00:52:59.520 | the attention matrix so the attention matrix is this very like large is the n square sequence

00:53:06.000 | square size matrices comparing each token you know to make the attention between all of them

00:53:12.000 | what you could do is instead of building these very large matrices you can just

00:53:16.880 | on the fly you know build small matrices and just keep the statistics that you need

00:53:21.200 | to compute your softmax along the way and that's what flash attention does that's the first step

00:53:27.040 | and the second step for flash attention is that if you just compute along the way small part of

00:53:32.720 | your attention matrix you may even have this small part small enough so that they fit actually

00:53:39.760 | in the sram of the gpu so the static random access memory the sram is a much much smaller memory but

00:53:50.560 | which is really next to each chip and this this this cannot be shared between processes right

00:53:56.320 | this has to be this is a single memory for for a group of of processing while the hbm the high

00:54:04.240 | bandwidth memory is shared by everything so it's the it's the hbm is this 80 or 40 gigabytes memory

00:54:10.800 | you know that you see it's really large but it's also much smaller and much lower bandwidth than

00:54:16.240 | this sram okay so you can compute just like your attention not in one big memory but in small one

00:54:23.200 | with statistics and the small one can be small enough to be fitted in the very very high bandwidth

00:54:28.320 | memory and this way you can actually compute attention really much faster and while using

00:54:34.800 | actually much less memory so flash attention can solve somehow the quadratic attention costs of

00:54:41.680 | attention and that's why we don't really care a lot anymore about you know linear attention

00:54:49.040 | mechanism for instance also because performance were never able to match full attention somehow

00:54:55.840 | just apart from sparse attention in some way flash attention v2 was a development of flash

00:55:02.320 | attention still roughly two times faster and here the idea was mostly to really have as much as

00:55:09.440 | possible of the computation in matmul flop so you have to know something about gp as well which is

00:55:16.400 | gpu already optimized to do matrix matrix multiplication so each time you do something

00:55:22.560 | like a division or something basically else than a multiplication you're paying a cost and the cost

00:55:29.600 | is very expensive a division is like 60 times more expensive than a multiplication and so for instance

00:55:36.720 | when we do softmax we usually divide by some normalization like some some number some square

00:55:43.200 | root of the dimension of our model we want to keep this and do it just one time at the end you don't

00:55:48.000 | want to do that every you know on every and every element of your computation so these type of things

00:55:54.960 | are basically what flash attention 2 is bringing with also a better parallelism causal mask if

00:56:01.600 | you're just computing causal mask you just don't need to compute half of the matrix and just better

00:56:06.720 | work partitioning and using more better like the blocks and the wraps of the gpu so i won't dive

00:56:11.920 | into this because there's a lot to unfold here but i would say it's more like a little bit more

00:56:17.440 | incremental but it's still like very very nice beta so now that we have something efficient

00:56:24.880 | we've parallelized this well we have a very efficient attention computation we want to make

00:56:30.800 | sure that we train well and here don't miss this hyperparameter search you have a couple of very

00:56:37.760 | important things that you need to go over learning rate you want to do a nice hyperparameter search

00:56:43.360 | you want to make sure that your initialization is well done you want to create a normal where

00:56:48.480 | they need to be you want to make sure that your training is stable but also that you're not too

00:56:53.840 | stable that you still at the verge of like you're still training with very high learning rate

00:56:58.000 | and here i would say there is very few uh recent work on this but there is two that i really like

00:57:05.120 | there is the mu transfer work slightly older i would say now but it's still very very interesting

00:57:10.160 | on how to find a hyperparameter on a small model and how to scale them on a larger model

00:57:15.920 | this work by cerebras was maybe one of the most interesting application of mu transfer

00:57:22.480 | and a very interesting recent work from also a chinese team again but very well you know set of

00:57:30.240 | experiments open source is this mini cpm block post where they really try to optimize the model

00:57:36.800 | and to optimize you know the uh scaling of activation between various part of the model

00:57:42.800 | how you want to scale the activation between embeddings and then the first layers etc so you

00:57:47.200 | should really probably should really give it a look what is also very interesting is that they

00:57:51.680 | challenged the dominant view that cosine learning rate was the end learning rate that everyone should

00:57:58.800 | use from now on cosine is still that really the great great default learning rate but they use

00:58:03.920 | a linear plus uh you know warm up and decay and they show that they do have some uh decent

00:58:10.960 | performances with that as well and the nice thing about having a linear uh learning rate like a

00:58:17.680 | constant learning rate is that you don't need to know from the beginning how long you will train

00:58:23.760 | your model on and that's very interesting because the cosine kind of force you in a very specific

00:58:29.520 | shape where you need to decide from the beginning of your training how long you are going to train

00:58:34.480 | and you cannot resume for longer and if we find a way to have good performances with like

00:58:39.440 | a flat learning rate and just like warm up decay decay is very important that's what they show in

00:58:43.840 | this paper um then maybe we can get away uh out of this constraint of knowing from the beginning

00:58:52.960 | how long we're going to train so uh take a look at this paper i think they're very nice in terms

00:58:57.120 | of stable training recipes and the takeaway here i would say is don't skip this step

00:59:03.440 | do your work in terms of research for hyperparameters now the last part uh that's

00:59:13.920 | the one usually people spend the most time on talking about so it's a good indication that

00:59:19.520 | that's it's also the least important part but yeah let me still talk a little bit about this

00:59:24.640 | for a long time transformer were believed to be the end architecture so uh maybe slightly

00:59:33.280 | sad i would say for the field that we didn't have anything new since you know the transformer paper

00:59:39.200 | in 2018 and uh recently there was two extension they want to cover one is mixture of expert

00:59:47.440 | so mixture of experts reduced to transformers in the limit of one experts so it's still slightly

00:59:53.440 | a stretch to say that it's a fully new architecture but it's still very interesting as a new knob

00:59:59.360 | to you know play with capacity um so basically one problem was that until now it was not very

01:00:06.240 | efficient to train a mixture of experts so let me explain you a little bit okay in a mixture of

01:00:11.280 | experts when you go through uh when your sequence of tokens will go through your model at some point

01:00:17.280 | you will have a router that will say for each token where in which experts should this token go

01:00:23.440 | and experts are basically at the mlp level feed forward so you have basically several mlp several

01:00:29.600 | feed forward layers and you will select like these are the number of your experts for instance three

01:00:34.880 | feed forward layers three different feed forward layer will be three experts and your router will

01:00:40.240 | say for each token okay you should go through to expert one to expert two to expert three

01:00:45.040 | now each each expert was designed to be able to welcome a certain number of token so for instance

01:00:52.960 | two token in this example here okay and if three token should go to one expert and then two token

01:00:59.680 | to one and one to the last one the expert that would get three token was not able to welcome

01:01:05.040 | them all and so would drop one token so without one token that would just be not used in the

01:01:11.440 | computation one input token so that's quite strong i would say as a as a impact that means you're

01:01:17.760 | kind of ignoring a part of your inputs and that led to i would say non-optimal performances

01:01:24.240 | but that was needed for the sake of having a very uh determined like a very static you know matrix

01:01:31.760 | and our gpu and tpu are not really well adapt to dynamic architecture well recently using ideas of

01:01:39.600 | sparsity but not too sparse because we also know gpu don't really like very sparse matrices

01:01:44.880 | but you can they are actually quite good for block sparse matrices so what is block sparse this means

01:01:50.160 | you it's sparse but you have blocks and these blocks are big enough so that they make efficient

01:01:55.120 | use of the gpus you know so they are big enough that they will fill like your your math mill here

01:02:00.080 | you will have like enough to crush um but between these blocks you have like empty places and here

01:02:08.160 | this is basically what mega blocks recently did and that's how it unlocked actually efficient

01:02:13.920 | mixture of expert training which is saying maybe our experts could be these blocks they are very

01:02:19.520 | big feed-forward matrices and if we actually use this we could just like repeat this block

01:02:26.320 | for the various number of token and we can maybe dynamically do the sparsity because it's not

01:02:32.880 | it's actually just blocks that will repeat so it's i would say it's a kind of a

01:02:37.760 | low level of dynamicity uh low level low enough that it can be very efficient so that's basically

01:02:45.360 | what kind of changed the the thing here we don't need to drop token anymore we can just dynamically

01:02:50.880 | build these big sparse matrices from experts and it actually even opened the door for something

01:02:56.800 | i think nobody has been really using yet which is you could have experts of various sizes you could

01:03:02.080 | add you could have like big experts smaller experts etc etc very interesting and i'm really

01:03:09.760 | looking forward to uh what will be built on top of this another interesting development was kind of a

01:03:16.480 | revival of recurrence model and you have two uh main uh model well i mean i just talk about mamba

01:03:24.880 | and the idea here is that you can you can use like space state space model so if you're just

01:03:32.000 | out of your master in ai you probably learn about space states model

01:03:36.720 | there are these discrete models this continuous model that make evolving you know a space state

01:03:45.120 | and here the all the smart all the smart thing was about how to discretize this and keep this

01:03:50.080 | efficient and so that was solved by um by albert gu and and again for flash attention and how to

01:03:57.920 | train this efficient it's very funny because when you train this mamba model when you train it it

01:04:03.120 | behave kind of like a like a convolution convolutional network and when you use it in

01:04:07.760 | inference you can use it in a kind of a recurrence mode so it's really really fast actually um mamba

01:04:13.840 | itself is quite hard to dive in and i think the best entry point is again an annotated blog post

01:04:20.080 | by sasha rush so maybe you learn about the transformer architecture from the annotated

01:04:25.680 | transformer by sasha rush a few years ago so now you have also annotated mamba blog posts which is

01:04:32.000 | i think a very nice way to learn about the mamba architecture we're actually training several

01:04:37.040 | mamba at hugging face at the moment with nanotron and so it's also very easy to train i'll show you

01:04:44.080 | a little bit there so talking about nanotron we wanted to have a very simple library to use all

01:04:52.240 | the techniques that i showed you um so we talk about parallelism we talked about efficient training

01:04:58.320 | we talk about being able to you know nicely iterate on your hyperparameter and also have

01:05:04.720 | mixture of experts on mamba if you want to gather all of this you usually have a very large library

01:05:10.960 | with a lot of bell and whistle we want to keep something very minimalistic so that's how nanotron

01:05:15.920 | was born so we want to keep this really under 10 000 lines of code and we want to make it very fast

01:05:23.600 | basically train as fast as possible and also very transparent so there's not a lot of wrapping around

01:05:30.480 | the things that you do here and as the idea is uh it's very open it's very transparent and you have

01:05:36.720 | in it like 3d parallelism radiant accumulation didn't talk a little bit about values mixed

01:05:43.200 | precision but it's in it and you have all the all the way to do also

01:05:47.280 | smart optimizer zero one and all the architecture i was talking about at the end

01:05:56.400 | so uh take a look at nanotron it's a kind of a research code that we use so it's still roughly

01:06:02.240 | it's still a bit rough on the edges but it's a very very nice code base

01:06:07.520 | now that we trained our model took a long time i'm gonna cover briefly uh the the next step so uh

01:06:17.200 | talking a little bit about that because we also have nice open source library on this so i want

01:06:22.720 | to tell you a little bit about them once you've pre-trained your model you usually want to align

01:06:29.040 | it which means you want to have it not as a completion model which just you know generate

01:06:35.200 | the most likely tokens after the prompts but you want to extract you want to have it behave in a

01:06:41.440 | specific way so usually you want to start to have your model behave as a dialogue model so that it

01:06:46.880 | learn to generate answers to prompt and not just continuation and you also sometimes want to you

01:06:53.120 | know have specific behaviors or you want to do some safety and you know for like forbid or like

01:07:01.200 | reduce the occurrence of specific behaviors of your model okay so this step is called alignment

01:07:07.600 | or fine tuning and i would say up to now there was a very uh complex technique called rl rl hf

01:07:15.440 | reinforcement learning from human feedback the impressive thing about our hf i would say is that

01:07:21.680 | it works at all that's basically maybe the first widespread occurrence of reinforcement learning

01:07:28.640 | in ai world that's actually really useful for many many people but it's still really really

01:07:34.240 | complex basically how it works and the main tricky thing here is as always reinforcement learning

01:07:42.480 | is the reward so usually in reinforcement learning you define your reward manually it's very complex

01:07:48.240 | you know it's very full of heuristics and that's kind of one reason you don't generalize to anything

01:07:53.200 | else on your uh test test environment and the nice thing about rl hf is the hf part which is

01:07:59.920 | you will define your reward from human feedback so we'll ask you will generate some completion

01:08:07.040 | we ask human to rank them and you will use that to train a reward model now it's very nice but

01:08:14.960 | it's kind of a complex thing as you can see here some typical uh labeling interface for human to

01:08:21.920 | label the the rewards and in practical i would say the very impressive thing is that it's just

01:08:27.920 | working as well but in practice the implementation is very complicated you have like four models you

01:08:34.240 | have your like your model that you're training so the dpo model you have a base model that you

01:08:38.640 | still use because you want to stay not too far from it you have a reward model that you trained

01:08:43.600 | on the on the human feedback you have the sft model so all of these models need to be at the

01:08:48.720 | same time in memory and so you can do some smart sharing of layers but that's basically

01:08:54.880 | very complex and that's why we actually started to build a library called trl to make this easier

01:09:02.800 | and it's also very challenging in terms of fitting all of this in memory

01:09:06.800 | now something very interesting happened uh last year which was uh dpo direct preference

01:09:15.440 | optimization and the idea here was that basically maybe your language model already know the reward

01:09:21.280 | somehow and so uh maybe it can be used without being trained as a reward model and i'm saying

01:09:28.080 | that with my hands but basically you can write that much more um much more precisely in an equation

01:09:35.280 | which is the dpo equation here and which is actually the the dpo paper has a very nice math

01:09:41.760 | part it's not always the case for machine learning paper sometimes the math is just there to

01:09:46.720 | pass reviewer too but in this case uh the math is really very nice and very interesting and the

01:09:53.760 | conclusion is that you can maybe just go with two models the dpo model and the sft model and that's

01:09:58.560 | make uh much easier training and what we saw with the rlhf team uh led by uh lewis lewis tenstall

01:10:06.320 | and ed beshing also with former hf people like nathan and nesny what they saw was that basically

01:10:13.440 | it makes training much more stable and it's actually just kind of work out of the box

01:10:17.840 | because your objective is much closer to a standard like language modeling objective

01:10:22.720 | so dpo changed a lot i would say how we used to align this model and um there was this question

01:10:30.880 | maybe earlier this year which was is this the end of it do have we again move reinforcement learning

01:10:39.040 | out of the most used ml technique well no no no there is a revival recently of rl through

01:10:47.920 | the reinforcement algorithm that's maybe some of you know if you were working in the field

01:10:52.720 | some time ago at least i was playing a lot with it for language modeling a long time ago

01:10:57.520 | and the idea is that at least the the paper from from rika here and and from cohere show that uh

01:11:04.240 | reinforcement and kind of more on policy rl was maybe still very very competitive with dpo and

01:11:10.720 | maybe even better so the jury is still open in 2024 is dpo the answer or will we see back a revival

01:11:20.320 | of rl we'll see now you find your new model you've pre-trained it you find you need the behavior are

01:11:28.640 | great you're very happy you think it's it's nice model you evaluated it as we told you

01:11:34.000 | you need to deploy it and that will be my my last slides uh it will be actually very uh short

01:11:41.120 | because i think there's a lot of resources here but maybe just something to keep in mind is that

01:11:45.520 | there was multiple breakthrough in inference optimization over the last few months i would

01:11:51.680 | say it's really impressive i remember like two years ago when i was saying okay we might want

01:11:56.800 | to deploy a news model of seven or ten billion parameters people like this is never going to

01:12:02.960 | work these are just too big well the reality is that today on my laptop i can run you know mistral

01:12:09.200 | 7b and it's just really fast it's even faster than me talking and there is a couple of things that

01:12:15.680 | made that possible that's the things that i'm listing in these slides the first one i would say

01:12:20.400 | is quantization that's the first impressive thing we can just quantize this model we can move them

01:12:25.680 | from the floating point values that they have fp16 for most of them bfloat16 to quantized integer and

01:12:34.800 | that just work we lose minimal performances we have values set up you know we have various techniques

01:12:41.760 | gptq we have the techniques included in a in lama cpp the dgml and nm nf4 so i put a couple of them

01:12:50.640 | out there they all just work really well i think a good default honestly is the one from lama cpp

01:12:57.120 | out there it's very nice it worked well and this basically solved use pay point also in terms of

01:13:03.120 | model sizes because models are much much smaller in one's quantized now we can do even better now

01:13:09.120 | with like speculative decoding which is super interesting and it's developed a recent development

01:13:14.000 | called medusa and here the idea is that we have two models that are roughly similar but one is

01:13:19.040 | much smaller than the other one and they're trained roughly on the same data set i mean they should be

01:13:24.480 | as close as possible and the small one will actually predict full sentences and then we'll

01:13:29.760 | just use the big one to validate you know how good are these sentences and to keep you know the

01:13:35.040 | tokens until they start to diverge from the token that the large model would have outputted and this

01:13:42.000 | means we can generate a token by bunch and just validate them by the big model by the big model

01:13:48.160 | take a little bit more room in the memory but not so much because the small model is much smaller

01:13:53.920 | and it speed up inference by a lot as well and this basically let us use very large model

01:13:59.040 | on a laptop there is a nice blog post i really like called accelerating generative ai with pytorch

01:14:06.720 | gpt fast which show you all the other techniques you can use you can compile your model you can

01:14:12.800 | use cuda graph basically this is something we covered just earlier and this is just the idea

01:14:17.840 | of reducing as much as possible cpu gpu synchronization so you put as much as possible

01:14:24.080 | your gpu autonomously going through the layers and you do as few as possible synchronization

01:14:30.400 | with with your cpu and give you even more like a speed up really really impressive

01:14:35.120 | these are the inference techniques a lot of there just put it a few reference there basically for

01:14:41.680 | you to to explore the final step you've pre-trained your model you aligned it you're very happy about

01:14:49.760 | the inference you quantized it well you distribute it final step share it with the world okay we need

01:14:56.320 | more knowledge we need more model outputs opens we need more data set open we need a lot more

01:15:01.840 | sharing the world uh thankfully at honey face we'll be building a place to share stuff so

01:15:08.080 | use the spaces evaluate your model openly on the open leaderboard put this on the really great chat

01:15:14.480 | but arena set up a chat for people to try it basically please share all the knowledge that

01:15:21.600 | you've learned and all the artifact that you've created as much as you can that will be my only

01:15:27.920 | reward i asking you from this video thanks i actually kept this question slide from my talk

01:15:36.720 | i cannot really answer question on youtube but please put comments or like open a post

01:15:42.080 | on a happy face or ping me everywhere and i'm very happy to answer any question

01:15:47.840 | that you may have on this thanks a lot for watching bye

A little guide to building Large Language Models in 2024

Chapters