Accelerating Mixture of Experts Training With Rail Optimized InfiniBand Networking in Crusoe Cloud

thank you for coming over to our session today a lots of really interesting stuff is happening on the AI world these days right with all the recent model developments in the GPU developments it is really cool to see all the use cases however here I want to talk now a little bit about the infrastructure and the way how we can support the newest models of the GPUs and the newest the newest models the machine learning models and how we can help that everything is working smoothly and fast and and productive so my name is Evgeny Bakulinko I'm a product manager at Crusoe my main responsibility is is the infrastructure and specifically GPU networking infrastructure and we are always looking for a way how we can increase the performance of that network because as we will see later in the presentation it is really important to do that now a little bit about the Crusoe Crusoe is an AI cloud platform which has one I think very important mission for all of us it's to align the future of computing with the future of climate there is a really strong demand right now for the computing power the GPUs are really energy hungry there is a lot of investments being done in the data center area and of course that puts an additional pressure on the grid and on the energy sources what we are trying to do here at Crusoe we are trying to utilize the stranded energy sources wasted energy sources and renewables to power our data centers we want really to be able to make sure that every time when you train your model every time when you're using GPU for inference you're not causing any negative impact on the climate now whenever we are building the cloud right the AI cloud we are building it based on three important pillars first of all there is a high performance pillar as the customers are buying our services and procuring you you know the GPU times and training their models we have to ensure that all the infrastructure is optimized for this training every time when it's not optimized every time when there is a delay or a glitch or any sort of outage or simply not that great performance it causes the direct impact on the customer's bottom line it causes a direct impact on the time to train and kind of raises the cost to train the model now the second one which i think is very important for everybody around here is the easy to use we want really to separate ourselves from the general purpose clouds we do know all of them the hyperscalers are building the great infrastructure and are trying to support each and every use case the customers might have for the cloud computing however in our case we really want to focus on the experience of the AI engineers so we want to make sure that we are providing a simpler user interface that allows developers to spin up the compute resources to deploy the models to train them to use them to in for inference and and so on uh all the underlying complexity of the infrastructure is being hidden by us and i believe that's our job to make sure that that is that stays the case and now as i mentioned we are as i mentioned before we are climate aligned which means uh we as a company really aiming to power 100 percent of our data centers with the renewable wasted energy sources with some some some form of stranded energy sources to ensure that we are uh we are being net zero emission net new zero emissions from the carbon perspective we have a big story around that feel free to check it on our website or come over to our booth on the on the show floor and the team will happy to talk about that now where are we present right now we have a number of the data centers located across the u.s uh as you see three of them in the continental united states and they are generally located close to the energy sources i was mentioning before so we have the one in texas we have uh the one in the northern central part of the country and on the east we are also building right now one big data center in iceland that will be powered by the gerothermal energy i mean again a way amazing way to use the constrained energy resources or the renewables to power the data center we are trying to follow that model hence we are placing our data center strategically the placement of the data center in iceland though will be also very important for our emir customers given the latency and and the general connectivity to the europe that is something i think uh might be helpful for them as well now what is our platform right i say cruzo cloud but generally whenever we are talking about any cloud we are talking about three general types of the products first and foremost we have the compute we are offering the vms with uh with gpus attached to them so every time when customer wants to spin up when customer wants to get access to the gpus they're able to get it through the vm they can get a bunch of vm connected together and use them as a one single training cluster we also offer cpu instances for any potential data pre-processing or any general purpose compute tasks you might have for the data preparation for the offload whatever you have uh from the storage perspective we are offering ephemeral and persistent disks on the node so those are delivered from the nvme on the local server where your vms are being placed we also have the persistent block storage storage solution available for our customers and we are working on providing and delivering the managed file system the network file systems for the customers on the networking side of course more traditional more typical vpc networking that's the network sometimes we call it front-end network that is used to deliver the customer traffic from the internet or from the customer environment wherever the customer might have the data sources to deliver that towards the vm so that's your kind of main connectivity uh path to the outside world now uh we do offer a number of the additional services on that that's not simply connectivity we also have the firewalls we will be offering the load balancers soon but generally we are trying to follow more traditional paths for the vpc networking and and the requirements the customers usually have there now what is more interesting and what we will be talking a little bit later today in greater details is our rail optimized infiniband cluster networking so for those of you who don't know typically customers typically providers the gpu providers are separating their network they have the front-end network which is used for general purpose traffic but then all the communication between the gpus is happening on the stand-alone separate network that is really high performance low latency and how bandwidth and the whole topology is optimized for the gpu to gpu communication now last but not least the user experience as i mentioned before we are our main customers our main persona the people who are using cruiser cloud are the ai developers and machine learning engineers so we want to make sure they have what they need in order you know to be successful and not to think too much about infrastructure we also offer cli we have apis we have guis so everything can be automated everything can be can be consumed and configured in the way you like it more uh we do have a lot of customers already and it was very fun for me to see on the floor that some of them are there and some of them are talking about their solutions probably this is the first time in my life whenever i'm attending a conference and standing at the booth i don't have to compete with all the people around us so we do see all the companies that are presenting their solutions right now as our partner partners we do partner with a bunch of them already we have the together ai here we have the boson ai and all of them are using our infrastructure for different purposes so together ai for example they're really into using cruiser infrastructure for the ml training for the fine tuning their models and some sometimes for some sometimes for inference the c dot ai is uh they're trade they're using our compute infrastructure to train the new foundational models this is really great i mean if you're the customer of together ai for example or codium or whatnot it is likely that you have been somehow exposed to the cruiser infrastructure now the distributed training has a very specific set of problems or issues right there is a compute part of it when the computation is being done on the gpus but since we're talking about a distributed training which means there are a lot of gpus at certain stages whenever there is a whenever there is a training step being completed all the gpus have to exchange the information have to exchange the data that they calculated on their own this is typically done through the all reduce or or all all all get through the only reduce process and the protocols and it contains a forward path the backward path but then the networking part takes without any optimization about 25 30 of the network of the time of the training time now this is the time where when your gpus are staying idle they're not being able to compute anything because they have to wait for all the information to be gathered uh together this is kind of a bad thing for everybody right this is bad thing for the customers because they still pay for that infrastructure they still have to wait it delays the the model model training but it also bad for us because we have the infrastructure that is not being performing enough there are a couple of tricks we can do first of all the computation and communication overlap allows you to start the network exchange or the data exchange when the computation is still ongoing but even with that when we were working with the customers we saw just the reduction uh of about 10 percent so about 25 percent of the training time was still spent on the network me as a product manager on the infrastructure side are constantly being asked like how can we reduce that how can we use the network as much as possible and reduce that gap so we we have been looking into that and we were trying to figure out what would be the right cluster networking topology how can we make sure that our data fabric that is used for connecting the gpus is being fully optimized and is being is able to provide the bandwidth needed and the latency needed the standard fat tree those of you who have been working in the data center infrastructure before that is something that we were traditionally doing for years that's a great way to build a scalable maybe non-blocking fabric right but there are a bunch of issues with that first of all if we will be connecting our servers that are shown below to a single leaf that introduces the single choke point as well as the single fault domain right if we are losing the leaf we are losing all the gpus that are connected to that now what else we were thinking about is like look we have that switch we have that switch sorry what is it the time okay so we have that switch that can be used for the back-end traffic propagation and why don't we use that switch for from the bandwidth perspective and kind of you know have an additional path let me just use the simple two node uh example to explain the topology and to explain how we are using it so first of all whenever we have the gpus that want to communicate within one server they can use their embedded and vlink and we switch and that provides a good communication they don't have to go to the outside fabric anywhere and whatnot now whenever we have the data communication between the gpus on the different nodes if they collect if they are connected to the single leaf that's something we called one single rail that means that the traffic communication will be passing through the uh through the one single leaf just one hop away and you will get the to the destination now what is interesting here is when we want to talk to the different trails right we have to go all the way to the spine and that introduces the additional hop besides the bandwidth saturation problems that may lead to the additional latency which will be really important for all for your uh all reduce all reduce operations but luckily for us and video with the recent version of nickel introduced the feature called pxn which allows you to use the internal and we switch inside the host to communicate across the rails so whenever we want to have the gpu zero to communicate with the gpu 8 on the another host we can use an internal switch to do the traffic hub between the gpus and then send it to the leaf where it is connected to so it still allows us to use one single hop and have access across the different rails of the gpus now we did some nickel test results uh and we saw quite a significant improvement 50 for the small messages and 50 of the large messages now those numbers here are of course for the smaller uh for the smaller uh for the smaller messages are about latency for larger ones we care more about bandwidth because latency tends to stand to stay roughly the same uh those numbers are great right everybody everybody would love them but not the customers and it does make sense because those numbers are synthetic and more are showing you the workload that is applied to your network what customers care about is the time to train the particular model so we use the sparse mixture of expert as an example and uh i mean i'm not going to dive into the details how how it works but essentially the sparse network the the sparse mixture of experts shows you gives you a different layers of the experts and the network that allows the traffic between them whenever you're deploying that on the li really large gpu cluster that makes uh that creates a ton of traffic like all the gpus have to send the traffic to each other they have to extend the information the workload on the network is pretty significant so we use the mixture model the open source uh sparse mixture of experts which is contained of eight feed forward blocks eight seven billions of parameters and we use the fine tuning to use this model to fine tune it on 240 h100 gpus and we did a quite significant we saw a quite significant improvement when we had the pxn enabled and without it the 14 percent of improvement is something that can be directly connected to the time to train the model that can be directly connected to cost of train the model and that is something that everybody really uh really got excited i definitely got excited and and our customers as well because that shows them some real value numbers they can get with a model now that was it from my side sorry for uh going through that it's so fast it's a very you know large topic it's it's hard to talk about that but i'm happy to answer all the additional questions anything you guys might have you We'll see you next time.

Accelerating Mixture of Experts Training With Rail Optimized InfiniBand Networking in Crusoe Cloud

Transcript