The Ultra-Scale Handbook by HuggingFace Nanotron

a bit on Laura's okay I'll restart we just started recording so basically Hugging Face put out this blog post like a week or two ago and it's a very in-depth estimated reading time is like two to three days really good visualizations and like little widgets on how to do pre-training of models so there's a quick overview at the start the plan is basically we're not covering a two-day reading in an hour we'll go through the main points of everything and try to do try to do some little you know double click into whatever people have questions on if anyone has thoughts comments you know open whenever a lot of it is pretty like mathematical guide on how to pre-train a language model so there's a lot of resources out there for if you're doing fine-tuning Laura synthetic data generation like there's a lot of libraries that make it easy to do little post-training stuff or fine-tuning but they wanted to solve the question of okay everyone says like it's hard to train across multiple nodes multiple GPUs what is all this data parallelism pipeline parallelism how do we know about like you know like who's actually writing about this so Hugging Face has a cluster of 512 GPUs they ran a thousand different experiments and they basically put together uh hey if you want to go train pre-train a model here's kind of like your zero to eighty percent of what you should know about distributed training so this is when you have more than one GPU and kind of more than two GPUs how can you efficiently scale your training so all from you can't have a model that fits on one GPU to how do you have like let's say a 7b model and you want to efficiently use 10 GPUs versus how do you have like multi-node so if you have a train run what if you have multiple nodes um what makes up the most of like compute when you're training so activations back prop um all this stuff so TLDR they kind of kick off with a basic overview of this and they're like there's pretty good open source models but they don't really talk too much about these little nitty-gritty stuff um at a high level something that this blog post also kind of didn't explicitly state but does show is you see all these gray dots so they ran over a thousand experiments all these gray dots are failed runs so some of them were kind of failed force runs like force failed where you know you overdo your bad size and you know you'll run out of memory and guess what you ran out of memory but the other little interesting niche that um I've noticed is when you do a lot of like multi-node multi-GPU training is sometimes a random GPU fails sometimes one GPU has like a weird loss and your train run is kind of screwed so um it doesn't account for those little niches right so sure there's checkpointing sure there's stuff like this but uh in practice sometimes when you're training on like a thousand GPUs yeah one GPU randomly dies and it causes little issues but anyway this is kind of just the high level like here's what to do so um kicking off let's go over the fundamentals and background so they have this like cool interactive widget that's basically like here are the presets of models that exist for the llama family so there's a llama tiny llama 8b llama 70b and it kind of lets you visualize how much VRAM is required to do pre-training so if you want to like train the full thing at different precision and a different um sequence lengths right so as we scale up the sequence length we can kind of see what happens to um the VRAM required right so let's say we want to train um there's like a few key parameters that you should know here right so you've got that size and sequence length and that kind of makes up how much GPU usage you're going to have so how many parameters like what's the weight so this is a 70b how many batches are you doing so more batches more more GPU needed more sequence length significantly more GPU needed and kind of like one of the takeaways is you know the attention blocks here quadratically increase right so as I double the batch size we quadratically or sorry double the sequence length we um quadratically increase attention so if we start with a sequence length of like 4 000 you can see how like the gradients optimizers these make up like you know oh I did 14 000 as we do about 4 000 the attention is sure it's some of it like to to train llama 7db uh 4 000 sequence length you know activation memory is about 80 gigs parameters gradients optimization steps are about 300 gigs parameters are obviously you know it's about one to two if you're doing um mixed precision you can also toggle that off so one to two for mixed precision then gradients and optimizations they add a lot people don't realize that this um attention mechanism it actually doesn't take that much but as we scale up let's say to a hundred thousand context length or a million this attention mechanism really eats up all the all the um memory so cool little visualization um the other stuff I guess yeah like hidden dimension cool bat size cool it's cool how they at least just have these presets for you though you know um vocab size is kind of interesting if you want to kind of see what this is changing here um you could change different activations but basically you guys should play around with this like read through this intro um pldr of this blog post they have like a high level overview if you want to get anything just read through this and you'll you'll kind of have a decent understanding of what's going on what takes memory when you train a foundation model from scratch and yeah um okay let's go to inefficiency so I guess they're in 4 000 experiments not not a thousand and then they run them across different sizes so 1b 3b 8b to 70b here's what crashed here's more tokens I thought this chart was kind of weird to read I don't know they wanted it like pretty instead of functional but I didn't find the charts too useful by the way all this context is pretty much pure gold um so high level overview um they're going to look at three challenges when you do training right so memory usage this is kind of your hard limitation right how many gpus do you have and what can you fit so one gpu you got to adjust your batch size your sequence length two gpus uh how do we do compute efficiency right how do we share the model do we put data parallelism where we have the entire model on every single gpu or do we do something like uh shard the model weights if we can't even fit them like that then communication overhead this kind of becomes when you're gpu rich rich you know most people gpu poor you have one two gpus you can do some fine tuning some post training gpu rich rich is when you have multiple nodes so there's there's a thing called like gpu interconnect right so what's the communication bandwidth it's a buzzword that we hear around pretty often what that basically means is like when you're sharding stuff between gpus they need to talk to each other right so like if i have layer one of a model on one gpu and layer two on another gpu well they kind of have to sync to do their thing right so if they've got good communication this is basically what envy link is if they have good communication you're kind of chilling there's stuff we can do ring attention is a way that you can like do really long context and you split this really like chunky sequence length you can do like a million context scaled across multiple gpus now that only works if you have low communication overhead meaning you know the gpus talk to each other very very like closely so you have one node of eight gpus all interconnected with high communication that's cool but then when you become gpu rich and you have multiple nodes how do these nodes talk together well now you have like another level of communication bottleneck right so you have nodes that have good interconnect but the node to node communication you probably want to do something different so these are kind of like the three main things um if anyone is really doing this you probably don't need to listen to this lap yap from me you probably know more than me but uh they give good formulas and like explanations as to how you want to mathematically figure out what's your batch size what's your sequence length based on um your available compute they have a little cheat sheet here we can go over pretty quick um so small models use single parallelism you know if you have eight gpus in a small model you can probably throw the full model on every gpu if you have a large model and you need more than eight gpus there's tensor parallelism if you have a bunch of gpus there's like moe parallelism there's stuff like that um there's a section on optimizing throughput so sensor parallelism data parallelism they kind of have this little cheat sheet um then they have parallelization strategies right so these are kind of the ones that they go over data parallelism is pretty straightforward they give you pros and cons then they have zero one two three um tensor parallelism pipeline and context we'll we'll quickly go over all of them but this cheat sheet is interesting it seems more like uh if you're interviewing a research intern you would expect them to be able to yap about this but if you're really doing something in training and if you're giving someone access to 500 or a thousand gpus they better not be needing this cheat sheet but i guess it's useful you know um yeah so step zero this is still kind of that overview right what happens when you're training on one gpu so very very high level we know that lms are trained to predict the next token right so what happens is you do a forward pass of your data you do like gradient accumulation you know you kind of do a forward pass you accumulate gradients you do a backward propagation where you update these gradients you run an optimizer step and now we have new new parameters right so the interesting little diagram but this is kind of what's happening when you train on a single gpu forward pass backward pass to compute gradients optimization step to update gradients and parameters you basically keep doing this at scale then you can start batching your input so the interesting thing about this attention which is taking up all of this compute right out of this 300 terabytes of ram required to train an 8b at that scale or let's say 400 gigs of vram required to train uh 70b at 256 right all of this little attention blocks they're really good at running in parallel so you don't necessarily sequentially do attention attention attention you run all these in parallel that's what gpus are good at so when you're doing stuff in parallel you want to start doing batching right batching is basically where you do multiple sequences at once so you have eight sequences you throw them in one batch and then you run the whole batch in parallel so um that's kind of what your batch size is they're like forget all that stuff what we really think about is batch size tokens batch size tokens is batch size times your sequence length so if each sequence is a hundred tokens like a hundred characters and you have eight batches of a hundred characters you're actually basically just training 8 000 tokens um a suite this is kind of a cool little line right so for people not super reading up on pre-training stuff lms are typically trained on the order of four to 60 million tokens per batch so the batch size has been kind of increasing as we get bigger and bigger clusters and we get more efficient training so llama one was trained on four million token batch size tokens uh each batch had whatever many um batches times how many ever tokens so about four million token batch size tokens and uh one point wait sorry form sorry it was with a batch size of four million tokens for 1.4 trillion tokens so yeah uh four million batch size tokens and they trained on troll total of 1.4 trillion tokens llama three was trained on 15 trillion tokens and i don't know the batch size tokens off the top of my head deep seek was trained on 60 million batch size tokens for 14 trillion tokens so numbers go up more efficient training is very cool so first challenge is scaling our model to have these large batch sizes and running out of memory issues right so as much as i want to parallelize everything i'm going to run out of um i'm going to run out of memory so the question here we're trying to solve is what should we do when one gpu doesn't have enough memory to hold a full batch of target batch sizes well um basically you know make the batch size smaller and if you really can't fit it then let's start adding more gpus but what happens when you can't fit your one batch size of tokens is you well they want to kind of explain what that is right so let's look at what happens in the memory so when you train you kind of have a few things going on right you have um model weights which is basically the parameters so parameters are like depending on the precision they go into all the math about this of how many bytes it takes per parameter then you're computing gradients for backprop then you have an optimization state that you have to keep keep held and activate and activations additionally you you need to leave a little bit of a buffer cuda kernels add one to two gigs of gpu memory so if you're kind of doing local stuff you know if you're trying to like do some training on a 7b model and you have like a 40 90 50 90 or if you have like a macbook with 16 gigs of ram well 16 or 24 gigs of ram doesn't mean that you can use it all right your cuda kernels take one to two gigs you're like screen in general like laptop processing takes a few gigs so keep that into account quantization stuff is this crazy thing you do mix precision training where you can kind of um keep cut the cut the um cut the memory requirement for the model weights um another thing is that you're not consistently using all the memory right so this is kind of four training steps of llama 1b you see spikes right so at one second memory being used was up here 50 gigs then it dropped when it was doing other stuff right so gradient accumulation autograd all this stuff it drops then it does um activations and drop so it's like kind of little spiky like this um they go into a bit of math about how to calculate all this but i think if you're interested in this you should kind of read it on your own time it's pretty long um this is always interesting so they kind of have interesting little tables here that show you how much memory you would need for fixed precision or half precision training with gradient accumulation gradient accumulation is a pretty cool little thing where you can kind of you know add every couple steps accumulate gradients and then do a loss over the average of them activation memory is another interesting one um it's it's another thing that you know activations take memory uh we are running a little slow so i'm gonna go a little quicker but if anyone wants to double click into any of this stuff um you know we can always come back towards the end or just interrupt me and we can go into it now um yeah sorry just one quick question and um that uh that graph that you showed is is um on the um uh for epochs right like so that's the first epoch in uh by the end of the first epoch right this is just training steps so all the steps okay yeah so steps could be like you know per batch for whatever after you do the training after you accumulate your loss and stuff after you do a back prop there's just a little dip so this is per batch um yeah so it kind of depends epochs are just a broader term for this grouping of these right but similar concept yeah a lot of stuff is one epoch you don't train as much on the same data anymore kind of interesting but uh same same concept okay uh gradient accumulation this is a very fun one so as you accumulate these gradients um yeah so epoch meta might be shifting to four epochs there's interesting little niches there so high quality data more epochs retraining data less epochs one epoch is basically training on a full trade like a full pastor of all your training data right so i've had thousand samples if i train on a thousand all my one thousand samples that is one epoch uh two epochs would be you know do the thousand samples and do it again and there's like there was this whole phase of one epoch on a trillion tokens is the new norm i guess new meta is now four epochs according to some paper okay so um gradient accumulation this is kind of interesting right so that gradient storing these gradients as you train is pretty important right as we do a forward pass we have to kind of do a backward pass and calculate the gradients right so we predict some tokens then we look at okay what was the actual token and how far off are we should we are we moving in the right direction or are we completely off for every parameter for all these gradients we kind of have to store them then we have multiple batches then we have to do an optimization now storing this little gradient um this little these gradients for every single parameter adds up it takes a lot a lot of memory so what we can do is this gradient accumulation right so instead of doing one entire batch we split our batch into micro batches we do a forward and backward pass on each micro batch complete compute the gradients so if you have a batch of 16 instead cut it up into small micro batches of four do forward backward passes and kind of only compute smaller gradients and then kind of accumulate them average them out it really helps you reduce your memory footprint there's a bit of a overhead right now because instead of fast parallel computation we now have to take time to do all this gradient accumulation stuff but this is pretty straightforward it's what it's what you would expect right um oh i haven't even looked at this chart but i guess it shows there's a a bit of idle gpu when you're doing some of this um accumulation merging so that's kind of your one gpu right you you take your stuff you find you how much vram you have you take your sequences you optimize your batch size um if you can't fit that size then you can do gradient accumulation with micro batches um there's other little niches like yeah when you have too many micro batches your loss can get a little spiky because now you're you know you're doing more little batches so little little niches if you're actually doing it you should start to look into but otherwise at a high level that's kind of what's um going on here now that's all great one gpu stuff is pretty easy right there's a bunch of libraries that'll help us do this one gpu fine tuning like full parameter stuff a lot of companies will do it for us but what happens when i want to start to like use multiple gpus right there was a whole gpu shortage but now now it's kind of chill like now you can rent multiple gpus i can go to run pod i can go to prime intellect i can go to a bunch of these companies and i can rent like two four eight i think run pod announced they're soon gonna have on demand 64 h100s so let's say i've got like eight h100s or 64 h100s what do i do um i don't want to just you know do a whole batch size and just run it that's kind of inefficient so that's where stuff like data parallelism comes into play so data parallelism is where you basically um you put all of the model weights on every gpu or several gpus then you split up your data and you run different batches on different gpus so you know in parallel you have all the weights and you do training on different gpus on different batches then you kind of accumulate them and you do this optimization and you update them all in parallel there's a bunch of ways that we do this there's zero one zero two all this stuff and we'll we'll kind of talk about that in a bit but the the interesting other little note here with data parallelism is there's this whole crypto wave of um distributed training and everyone talks about you know everyone has one gpu everyone has one macbook let's train god gpt5 on everyone's distributed training what they're actually doing is they're doing data parallelism so like news research prime intellect not to shine on the research they did it's still very interesting and very unique and like novel and hard to do um they're doing this distributed training which is yeah it's not a mean um it's basically where you have multiple data centers so like you have one node here one in australia one in europe one wherever and you put the entire weights on them and then you train batches across there it's still cool it still requires a lot of you know servers and different locations but it is still like the whole weights have to be on every gpu on you know every server so that's kind of what this one is it's it's basically as straightforward as you would expect um you throw all the weights and you do different batches on different stuff and then you have optimization sinking across right so um there's there's different types there's you can add gradient accumulation you can bucket gradients they talk about like efficiencies to do this which i think is not what we should cover in an hour you know if you're at a stage where you're doing this they kind of go over what are the main ways to do it but um yeah that's kind of the two so we've gone from single gpu training to now we have data parallel where let's say i have a node of four h100s i want to train a llama 70b a llama 7b what i can do is i can throw llama 7b on each gpu split my one trillion tokens so let's say like a hundred million tokens split that into batches do each gpu trains a batch and then kind of efficiently chunk them together so this is kind of a summary of what's happened so far so if you're at this stage here's what you should do first determine the best global batch size by either running experiments or just consume uh you know consult their literature read their blog post to figure out how you should do this then select the sequence length this kind of takes a lot of compute what's done in modern models is you train on a small sequence length for like 90 of the tokens and then the last 10 you kind of increase so if i'm not mistaken llama did like 14 trillion tokens at 4 000 sequence length and then the last trillion they did at like 100k then they you know update it's very common because it takes a lot of compute and this kind of works to generalize as well but two to eight thousand token length works well for the evaluations we have today i think this is something that people don't really look into enough right um sure you could do two to eight thousand tokens because that works for mmlu but if you're training something like a llm judge classifier you probably don't always need 8 000 tokens right so be smart and train on less tokens if you need less tokens um then we know the bat size we can find the minimum local bat size for a single gpu and then kind of increase that till you run out of memory then determine the number of gps use you have for data parallelism and you kind of throw your stuff across them so cool then you have gradient accumulation if you need it but that's kind of basic data um parallelism now we've kind of got this whole series of how do we improve upon data problem we've got um 0 1 0 2 0 3 um so what is 0 0 is redundant actually let me also check chat real quick if there's anything there's map reduce to kind of do this stink uh they have fancy versions of map reduce as well numbers are different for reasoning models are when it's putting out yeah reasoning models are a little different but that's kind of rl rl is at some level similar right you're still just training a transformer um what's happening in that transformer is somewhat irrelevant okay so zero what's going on with zero um so data parallelism is an efficient way to scale training the naive replication of optimizer states gradients and parameters across dp introduces significant memory redundancy right so what's happening here is as you in regular data parallelism as you split the entire model you also have to at every single gpu do all the optimizer states all the gradients and you know all the parameters on every gpu and that's kind of redundancy right so what zero one two and three look into is how can we shard more than just the weights so can we sort of shard this optimizer state can we start shard the gradient partitioning so and then there's uh zero three which is kind of interesting where you start to actually shard the parameters themselves so uh zero one zero two are more efficient ways of data sharding basically instead of just model like instead of everything on every gpu can we also shard this optimizer state a little bit and then zero two is can we also shard this gradient a little bit okay back into what's happening in memory so um in mixed precision training we have parameters we have gradients we have optimizer states which are stored in higher precision and then gradients in higher precision if we want gradients in higher position uh precision so zero one is partitioning the optimizer state so in regular data parallelism all ranks gather uh gradients after the backwards pass simultaneously and perform identical optimization steps that's a lot of duplicated of work we can avoid it using redundant memory stores at the same pad so they have this all gather which is kind of um community it's kind of yeah let's shard this um optimization state so forward passes happen at the same time backward passes at the same time perform a reduced scatter on gradients each replica performs an optimization step on its local optimization steps then you perform this all gather to kind of send these missing replications pldr um they're now just sharding optimization uh how they do it they go into more details here now zero two is kind of the next step we have a reduced shatter optimization where now not only does each replica have to shard optimization but they're also calculating gradients right so why don't we shard this gradient optimization state so during the backward instead of performing in all reduced we perform a reduced scatter reduced scatter spreads gradients needed in memory and it's more memory saving again um that's that's cool they have a chart at the end of this that kind of shows how all of this works um then we have zero three zero three is the fun one a fun one so if you can or if you need to if you're training something like let's say you're doing uh llama 70b or something post training and you want to crazily increase your sequence length right so you want to go up to a hundred thousand sequence length well now you're not fitting in even one node this is where um fsdp starts coming in where you have to shard your parameters themselves so instead of each gpu having every single all the model weights now you're sharding the model weights across different gpus sounds pretty interesting right well in zero three you still need good communication between the gpus there's kind of technical ways that they do this how they do these forward passes these backward passes later on we'll learn about sharding across different dimensions the way this works is basically training and all this attention and all of like math all all of this stuff is just matmos right it's matrix multiplication so it's row times column there's smart ways to do that and manipulate row times column and yeah they they kind of just make it really easy to do it all this explained what's going on so if interested read it if you really want to use it pretty much any training library whether that's pytorch or whatever you're using uh you can you know in a few lines of code just implement this and it handles all of this for you on the back end now um yeah that's that's kind of the three stages uh let's let's kind of take a break here and see if there's any questions what's the entity okay for weights activations gradients and optimizer states what's the it's intuition for which ones we want in lower and higher precision so um for weights you can it kind of also depends on what you're doing if you're doing training or inference right um what you're really doing here isn't higher or lower precision there's there's mixed precision for some stuff like gradients and activations you do them in higher precision there's more and more stuff coming out about training and lower precision and mixed precision but the general um you know general rule of thumb is you want to use as much high precision as you can afford like half precision is cool for inference it's different you can do a lot of quantization but for stuff like um gradients or stuff that has not a lot of compute uh not a lot of memory overhead you want to use higher precision right so gradients are small they don't need a lot of memory do them in high precision model weights uh let's try to lower the precision a little bit but yeah okay now we have um tensor parallelism tensor parallelism is a way of more splitting it's kind of splitting up model weights across their hidden dimension so uh data parallelism was can we shard the weights this is a way of splitting it up across um sensors so you know here's the math of what's happening you're basically doing math moles rows and columns here's a smart way to do it here's a smart way to reduce it here's what we're doing in a transformer block um a lot of this is kind of the background behind stuff like flash attention group query attention uh this is used in papers like llama 3 and whatnot but high level this is just kind of another way of charting your um your model it's done column wise row wise um there's there's two different ways to do it there's different communication overhead in both so whether you're doing column wise or row wise um there's there's different benefits to both you should look into them um tensor parallelism and transformer layers is another interesting one so help you make progress yeah that's there i think for time's sake we're gonna start going a little little quicker um the the interesting thing here is kind of this is where as you're all in one node all this stuff kind of works right you can do different layers on different gpus but the more you do this the more important the interconnect between gpus is right so if you're on a single node of highly interconnected gpus this stuff will start to be fine but the minute you're using pcie cards or multiple nodes you're kind of adding too much bottleneck for any of this to be worth um yeah next we kind of have sequence parallelism i believe context parallelism context parallelism is or we went through sequence parallelism sequence parallelism kind of same thing um so instead of tensor parallelism split it across the sequence this is where stuff like ring attention starts to come in i think a while ago we had a paper club on ring attention so for context and sequence parallelism if instead you know you can't scale your context length uh you want to take these 128 000 sequences and you want to split them across gpus you want to let's say have like you know first 10 000 tokens on gpu one next 10 000 on gpu two and so on um there's kind of this cool little way that you can do this so you can you know this goes into the math about what's happening but it's kind of it's kind of so conceptually you think what is attention attention is the relationship between all tokens right you still need to calculate this so if you have a hundred thousand tokens we need to know how does each token relate to every other token which you know you would kind of expect that to have to happen on one gpu you need to look at all tokens that wants to do attention instead you can kind of shard them and do this ring attention where you you have each gpu do part of the attention mechanism you kind of sync them you send qks between each other then you do this big query uh value calculation and now you've kind of done attention across gpus um there's there's ring attention there's zigzag attention there's little optimizations for this stuff but it's it's pretty cool it was a lot of like how do we get this infinite context how can we do um you know long context this is kind of what happened here um very very interesting stuff i think it's a little underrated so basically reading a little about it is pretty cool context parallelism and sequence parallelism okay next is uh pipeline parallelism pipeline parallelism is a pretty fun one so this is where you distribute consecutive layers so instead of just sharding gradients or optimizations or sequences how about we start uh sharding different layers so let's have like you know layer one two three on this gpu and different layers on different gpus this once again it's like another approach to fitting really large models it it lets you reduce gpu overhead for everything but once again like now they go into as you do this how do you continue training right so they have these like all forward all backward so as you do your forward pass you need to do a backward pass compute gradients to optimization so once you split up layers they've kind of got here's what's happening as we want to do optimization right so all forward all backward um there's there's a few other ways of this one forward one backward this is what llama three did um very interesting little optimizations to make this stuff work um interleaving stages so if you've got different slices there's just different recipes for all this zero bubble dual pipe um yeah expert parallelism is kind of a more straightforward one where you have different experts in a mixture of experts model on different gpus that one's pretty straightforward right so you kind of have a base router you have every gpu have its own expert in some cases you can have multiple gpu multiple experts on multiple gpus and you kind of have this efficient expert parallelism so once again now there's this kind of overview summary so what's data parallelism it's along the batch dimension tensor parallelism is along the hidden dimension so different layers sequence and context parallelism are if you need to extend context length and you want to let's say train your model up to 500 000 token context how can we do that on a limited gpu pipeline parallelism is along model layers and then expert parallelism is around mixture of experts um then we've also got we can combine these with the zero stages right so zero one is sharding optimizers zero two is sharding optimizers and gradients zero three is sharding um parameters alongside the other two so this is kind of the high level if you have multiple gpus here are the things that you can do for each one they kind of have you know here's how you double click into what you want to do so let's say you're doing pipeline parallelism and you're sharding everything and you then want to extend your sequence length right so let's go back into that section um you can come in and find out what are what are different interleaving stages how do you want to optimize all this um what they go into here is kind of still all at the you're not really doing research here you're kind of still at the applied researcher stage right all this stuff is still available in libraries like pytorch so there are cuda kernels that optimize this stuff like flash attention is made that already uses all of this these are things that you can just kind of consume as you're doing your training if you're pushing the bound like um xai now has a 200 800 cluster right or gpt 4.5 was a very chunky model that's where they have to start to push on these ideas and develop even more novel um training strategies but yeah that's kind of their high level um what's going on in different parallelization strategies after that we've kind of got here's a recipe for finding the best um training config right so if you kind of skip everything the original intro is kind of important right how do we train on one gpu how do we train on two how do we train on a single node then the middle is kind of okay now your gpu rich you're probably smarter than me how do you train across nodes um some of the reasons for some of these as well by the way that we skipped over was if you have multiple nodes and you don't have interconnect some of these different parallelization strategies will work better they they better optimize for not having that interconnect between all nodes um so you know there's recipes for that as well but okay high level let's let's sum it all up here again so if you want to find your best training config how do you do it so step one uh fitting a training step into memory right let's figure out how we can fit a full model instance on the gpu there's several use cases whether you're gpu rich or your gpu poor if your gpu poor um you know there's full activation re-computation there's gradient accumulation there's stuff like that you can train slower um basically you want to increase your gradient accumulation steps to processor larger brass sizes and you kind of do what you do for gpu rich stuff if you have models under 10b use single parallelization stuff so tensor parallel with um zero three across eight gpus this is kind of where you shard those model weights across eight gpus for stuff between 10 and 100 billion parameters you'll need more than eight gpus right they keep saying eight gpus because this is basically one node if you have stuff under 10b and one node use fsdp with zero three if you have 10 to 100b you're going to need multiple nodes so this is where you're going to start to do tensor parallelism and pipeline parallelism do tensor parallelism with data parallelism uh only use zero three then when you're rich with with 512 or a thousand gpus um you know you should not be reading this but there's there's more stuff that you can do here um tpu poor we talked about it basically gradient accumulation find out what batch size can you support split batches do micro batches but don't overdo it um yeah step two achieve the target batch size so depending on where step one left us uh in terms of micro batches and our data parallelism our current batch size might be too small or too big right so what you want to do is kind of scale up data parallelization or accumulation step so you know keep doing gradient accumulation until you're no longer running out of memory and if you want to start to do long context add in context parallelism so this kind of ring attention stuff to decrease our global batch size we can reduce data parallelism and we can reduce context parallelism if you have more gpus okay step three optimizing training throughput set up tensor parallelism using interconnect if you have it then there's uh increased data parallelism so you know if you have more batches if you have more gpus yeah use data parallelism and throw the whole weight on them keep the target batch size as big as you can um try scaling up different parallelism parallelisms one by one and just keep experimenting uh benchmarking they ran a lot of benchmarks i don't know if anyone found anything interesting in any of these that we want to dive into but i want to leave 10 minutes for questions so um i'm going to skip through a good bit of this uh diving into gpus fusing threading so what is gpu how does gpu work you know there's high bandwidth memory there's interconnect there's global stores there's caching there's kv caching there's cuda kernels and stuff that optimizes all this um i don't think this is as relevant with the high level of if you want to train models across tpus what do you do so if you're interested read it it's pretty useful stuff but at a high level i think it's a little too niche for our um here's kind of how mapmoles happened let's see i think there's a conclusion at the end of this flash attention is um you know cuda optimizations for all this ring attention is one mixed precision training is another interesting one um mixed precision training is one thing but mixed precision inference is another we have a lot more on uh mixed precision inference but for training um they kind of go into accumulation what to have different precision for different training stuff fpa pre-training is kind of interesting some people are doing it we want to see if it works um recent research including fpa lm deep seek v3 which is an interesting one um it shows potential that's kind of huge so the the big thing with precision is basically as you cut your precision in half you also cut the vram needed to train in half so full precision versus half precision for 600b model means you know i can now cut off 600 billion 600 gigs of vram so pretty big stuff with precision okay conclusion uh we know we now know how to train we now know how stuff was efficiently trained like gamma 405b and v3 on multiple gpus uh once again they kind of have their their cheat sheets if you want cheat sheets i think it's good to go over this again um there's there's there's a lot of stuff about fpa versus intake and how they're not the same so quantization yeah um you should look a little bit deeper into stuff like that um llama's 40 parallelism what else um a lot of papers a lot of references um yeah pretty pretty solid interesting little high level overview of how to train it i think uh as a quick recap we should just remember that they have this you know here's how to fit stuff into memory pick your model size pick your gpus pick your batch size and optimize your training throughput until stuff doesn't crash here are the different parallelization strategies we talked about right so data parallel everything on one zero one zero two zero three is where you start sharding the weights tensor parallel across hidden dimensions pipeline parallel across model layers context parallel for sequence length expert parallel for mixture of experts um pretty nice overview i think you know for the ai engineer crowd even if you don't have five hundred or a thousand gpus just kind of familiarizing yourself with what's going on what training like what parts of model and pre-training takes how much gpu um you know like yeah it's not all attention there's these speed forward layers there's these active there's these activation states there's these gradients we have to calculate how those actually take up more than half of your training in front uh training memory all that stuff is pretty interesting to know um then the one thing that this blog i might have missed or it didn't do for me was it didn't mention that you know when you actually do inference and you consume this stuff um you're actually not accumulating a lot of these gradients let me find this little diagram again so for example if we're doing let's say um um llama oh that's cool this doesn't want to work let me refresh so if we're doing llama where'd it go there it is if we're doing llama 3 8b at a sequence length of 60 of a thousand um all these parameters gradients that are taking a hundred gigs of memory this optimization these optimizers you know 70 gigs of vram we don't do this during inference we don't have to do all this back prop so the attention is you know some of it but all this other stuff for training is actually a lot less and we can quantize better so it didn't it didn't hit on that as much but yeah high level good read to kind of throw this thing in chat gpt or cloud or whatever your lm of choices stay you're reading this section have a breakdown each section for you spend an hour on it it's useful to know for the ai engineer um kind of understanding the difference between training and inference is always useful and i think um you know a good rule of thumb is like if you're hiring a research intern would they be able to explain all of this to you hopefully pros and cons trade-offs different multi-node single node single gpu um if they could that's great and then you should be able to do it yourself too right it's just good general knowledge so pretend you're in an interview be able to learn all this stuff it's good background knowledge but yeah um that's kind of high level overview of this big big pre-training blog um questions thoughts comments we have five minutes i'm sure i skipped out a lot of stuff i'm sure people have a lot to add in uh out of curiosity what precision or state of the art model is trained in mostly uh higher precision for gradients so there there is mixed precision as of recently um you know we're cutting that down i think it said that v3 was trained in fpa which is kind of interesting but there's still a lot a lot of precision um yeah that's that's paper thoughts comments questions who wants to train 70b from scratch knowing all this are we prepared to blow thousands of dollars an hour i think it's a good experiment to run like uh realistically h100s you can rent at two dollars an hour right so rent an h100 train a 7b rent two h100s on different like you know non um non good interconnect try some of this stuff because it's very approachable from like common libraries to do this training then rent a node then rent two nodes and like realistically you only have like a few different parallelization strategies right um a precursor to this blog from hugging face is fine web it goes over like i think five trillion maybe 15 trillion pre-training tokens so yeah go go learn um very useful to learn and just try all this stuff and see what's different and burn burn not too much money right two dollars an hour for h100s eight h100s twenty cents an hour like we can afford this um papers that can help me do this for non-english languages um the presenter from last time works on a state-of-the-art german model not german some some multilingual model so go to paper club someone is apparently an expert in non non um english models cheapest h100 provider on the internet crypto bullshit the crypto companies are raising a lot of money and subsidizing gpus to be very very cheap deep info has them for two dollars model will give you money a lot of people will also sponsor compute but also um hyperbolic is a good one shout out hyperbolic i think uh it's also pretty cheap right like two dollars an hour three dollars four dollars sf compute is not um serverless i believe sf compute is bigger cluster at more time i like the guys but um as run pod and stuff adds 32 h100 64 i don't know if we need but yeah cool um what other questions questions are on where to get gpus pay money that's not that expensive very subsidized right now like to a point where somewhat more subsidized than um than compute people don't like hyperbolic apparently so never mind no hyperbolic we don't like them i guess um but they are slightly cheaper than than the others but we don't like them i guess okay volunteers for next week um who wants to cover paper next week there is a cool pre-training uh post-training survey uh this is uh i don't know what is um i don't know if it's good uh qw32b came out i don't know if it's thinking the same name oh is is the is the paper good i haven't looked i don't think there's a paper if the paper comes out that would be cool i don't know no paper at all blog post which was shitty blog post 4496 word blog post you know thought they had a very sick title on this uh embracing the power of rl 700 word blog post that has nothing but here's code on how to use it okay um llm post training there's a survey paper on post training very cool little chart here um i think it's a good one if someone wants to cover it okay i'm gonna volunteer this one to someone if someone's down otherwise um yeah if there's no more questions oh actually this is pretty short so uh 20 page on post training very useful graph of thoughts wait these do cover mcts why not mcts is great dude you know how much of a time sink this was you know how much time was wasted on mcts who remembers strawberry who remembers multi on saying that they are q star strawberry mcts waste of time but you know pretty good and i think it's good background right so mcts is kind of tree search throughout potential next tokens can we span out and do um inference time scaling test time compute instead of rl a bunch of people wasted a bunch of time on this uh some companies said that they are q star by by doing this but yeah okay another um another paper that someone has recommended that we don't know if it is credible the fft strikes back an efficient alternative to self-attention from one guy at usc is he plugging his own paper i don't know uh alternatives to attention are always interesting but okay um that's that's paper club this week guys super nicely done thank you so much you covered two days of reading in one hour always need to cover quick in general yeah i i feel like there needs to be more good papers like i i feel like we uh paper i feel like paper velocity has slowed somehow yeah i mean this isn't even a paper on anything of their own this is just like uh okay this counts this counts okay very very good survey okay well that's paper club guys all right take care thank you

The Ultra-Scale Handbook by HuggingFace Nanotron

Transcript