Accelerating Mixture of Experts Training With Rail Optimized InfiniBand Networking in Crusoe Cloud

00:00:00.000 | thank you for coming over to our session today a lots of really interesting stuff is happening on

00:00:18.900 | the AI world these days right with all the recent model developments in the GPU developments it is

00:00:26.520 | really cool to see all the use cases however here I want to talk now a little bit about the

00:00:34.280 | infrastructure and the way how we can support the newest models of the GPUs and the newest the

00:00:42.360 | newest models the machine learning models and how we can help that everything is working smoothly and

00:00:47.940 | fast and and productive so my name is Evgeny Bakulinko I'm a product manager at Crusoe my

00:00:56.100 | main responsibility is is the infrastructure and specifically GPU networking infrastructure and we

00:01:04.200 | are always looking for a way how we can increase the performance of that network because as we will see

00:01:10.500 | later in the presentation it is really important to do that now a little bit about the Crusoe Crusoe is

00:01:18.000 | an AI cloud platform which has one I think very important mission for all of us it's to align the

00:01:24.780 | future of computing with the future of climate there is a really strong demand right now for the computing

00:01:30.420 | power the GPUs are really energy hungry there is a lot of investments being done in the data center area

00:01:37.140 | and of course that puts an additional pressure on the grid and on the energy sources what we are trying

00:01:43.620 | to do here at Crusoe we are trying to utilize the stranded energy sources wasted energy sources and renewables

00:01:50.980 | to power our data centers we want really to be able to make sure that every time when you train your model

00:01:57.780 | every time when you're using GPU for inference you're not causing any negative impact on the climate now

00:02:06.180 | whenever we are building the cloud right the AI cloud we are building it based on three important pillars

00:02:14.660 | first of all there is a high performance pillar as the customers are buying our services and procuring you

00:02:22.260 | you know the GPU times and training their models we have to ensure that all the infrastructure is

00:02:27.860 | optimized for this training every time when it's not optimized every time when there is a delay or a

00:02:33.380 | glitch or any sort of outage or simply not that great performance it causes the direct impact on the

00:02:41.620 | customer's bottom line it causes a direct impact on the time to train and kind of raises the cost to train

00:02:48.420 | the model now the second one which i think is very important for everybody around here is the easy to use

00:02:55.140 | we want really to separate ourselves from the general purpose clouds

00:03:01.140 | we do know all of them the hyperscalers are building the great infrastructure and are trying to support

00:03:06.980 | each and every use case the customers might have for the cloud computing however in our case we really

00:03:13.060 | want to focus on the experience of the AI engineers so we want to make sure that we are providing a simpler

00:03:20.180 | user interface that allows developers to spin up the compute resources to deploy the models to train them to use

00:03:27.380 | them to in for inference and and so on uh all the underlying complexity of the infrastructure

00:03:34.580 | is being hidden by us and i believe that's our job to make sure that that is that stays the case

00:03:43.060 | and now as i mentioned we are as i mentioned before we are climate aligned which means uh we as a company

00:03:49.620 | really aiming to power 100 percent of our data centers with the renewable wasted energy sources

00:03:56.820 | with some some some form of stranded energy sources to ensure that we are uh we are being net zero

00:04:06.260 | emission net new zero emissions from the carbon perspective we have a big story around that feel

00:04:12.580 | free to check it on our website or come over to our booth on the on the show floor and the team will happy

00:04:19.060 | to talk about that now where are we present right now we have a number of the data centers located across

00:04:26.740 | the u.s uh as you see three of them in the continental united states and they are generally located close to the

00:04:36.420 | energy sources i was mentioning before so we have the one in texas we have uh the one in the northern

00:04:43.300 | central part of the country and on the east we are also building right now one big data center in iceland

00:04:50.900 | that will be powered by the gerothermal energy i mean again a way amazing way to use the constrained energy

00:04:59.220 | resources or the renewables to power the data center we are trying to follow that model hence we are

00:05:05.300 | placing our data center strategically the placement of the data center in iceland though will be also very

00:05:11.380 | important for our emir customers given the latency and and the general connectivity to the europe that is

00:05:18.740 | something i think uh might be helpful for them as well now what is our platform right i say

00:05:29.060 | cruzo cloud but generally whenever we are talking about any cloud we are talking about three general

00:05:35.140 | types of the products first and foremost we have the compute we are offering the vms with uh with gpus

00:05:42.740 | attached to them so every time when customer wants to spin up when customer wants to get access to the

00:05:48.660 | gpus they're able to get it through the vm they can get a bunch of vm connected together and use them as a

00:05:57.060 | one single training cluster we also offer cpu instances for any potential data pre-processing

00:06:03.700 | or any general purpose compute tasks you might have for the data preparation for the offload

00:06:09.700 | whatever you have uh from the storage perspective we are offering ephemeral and persistent disks on the

00:06:17.860 | node so those are delivered from the nvme on the local server where your vms are being placed we also have the

00:06:25.940 | persistent block storage storage solution available for our customers and we are working on providing

00:06:32.180 | and delivering the managed file system the network file systems for the customers on the networking

00:06:38.100 | side of course more traditional more typical vpc networking that's the network sometimes we call it

00:06:45.620 | front-end network that is used to deliver the customer traffic from the internet or from the customer

00:06:51.860 | environment wherever the customer might have the data sources to deliver that towards the vm so that's

00:06:58.500 | your kind of main connectivity uh path to the outside world now uh we do offer a number of the additional

00:07:06.660 | services on that that's not simply connectivity we also have the firewalls we will be offering the load

00:07:12.340 | balancers soon but generally we are trying to follow more traditional paths for the vpc networking and and the

00:07:19.940 | requirements the customers usually have there now what is more interesting and what we will be talking a

00:07:25.860 | little bit later today in greater details is our rail optimized infiniband cluster networking so for those

00:07:33.140 | of you who don't know typically customers typically providers the gpu providers are separating their network

00:07:40.020 | they have the front-end network which is used for general purpose traffic but then all the communication

00:07:45.220 | between the gpus is happening on the stand-alone separate network that is really high performance low

00:07:53.700 | latency and how bandwidth and the whole topology is optimized for the gpu to gpu communication

00:07:59.300 | now last but not least the user experience as i mentioned before we are our main customers our main

00:08:07.780 | persona the people who are using cruiser cloud are the ai developers and machine learning engineers so we want to

00:08:14.980 | make sure they have what they need in order you know to be successful and not to think too much about

00:08:22.340 | infrastructure we also offer cli we have apis we have guis so everything can be automated everything can be

00:08:29.540 | can be consumed and configured in the way you like it more uh we do have a lot of customers already and

00:08:39.140 | it was very fun for me to see on the floor that some of them are there and some of them are talking about their solutions

00:08:46.020 | probably this is the first time in my

00:08:49.380 | life whenever i'm attending a conference and standing at the booth i don't have to compete with all the people around us

00:08:56.980 | so we do see all the companies that are presenting their solutions right now as our partner partners

00:09:03.620 | we do partner with a bunch of them already we have the together ai here we have the boson ai and

00:09:10.900 | all of them are using our infrastructure for different purposes so together ai for example

00:09:16.260 | they're really into using cruiser infrastructure for the ml training for the fine tuning their models and

00:09:23.220 | some sometimes for some sometimes for inference the c dot ai is uh they're trade they're using our compute

00:09:30.820 | infrastructure to train the new foundational models this is really great i mean if you're the customer of

00:09:37.060 | together ai for example or codium or whatnot it is likely that you have been somehow exposed to the cruiser

00:09:44.660 | infrastructure now the distributed training has a very specific set of problems or issues right

00:09:55.140 | there is a compute part of it when the computation is being done on the gpus but since we're talking

00:10:01.860 | about a distributed training which means there are a lot of gpus at certain stages whenever there is a

00:10:08.660 | whenever there is a training step being completed all the gpus have to exchange the information have to

00:10:15.620 | exchange the data that they calculated on their own this is typically done through the all reduce or

00:10:21.540 | or all all all get through the only reduce process and the protocols and it contains a forward path the

00:10:32.020 | backward path but then the networking part takes without any optimization about

00:10:38.180 | 25 30 of the network of the time of the training time now this is the time where when your gpus are

00:10:47.220 | staying idle they're not being able to compute anything because they have to wait for all the

00:10:52.580 | information to be gathered uh together this is kind of a bad thing for everybody right this is

00:10:59.540 | bad thing for the customers because they still pay for that infrastructure they still have to wait it delays the

00:11:06.020 | the model model training but it also bad for us because we have the infrastructure that is not being

00:11:11.620 | performing enough there are a couple of tricks we can do first of all the computation and communication

00:11:17.460 | overlap allows you to start the network exchange or the data exchange when the computation is still ongoing

00:11:26.420 | but even with that when we were working with the customers we saw just the reduction uh of about 10

00:11:34.580 | percent so about 25 percent of the training time was still spent on the network me as a product manager

00:11:41.700 | on the infrastructure side are constantly being asked like how can we reduce that how can we use the

00:11:47.940 | network as much as possible and reduce that gap so we we have been looking into that and we were trying to

00:11:55.060 | figure out what would be the right cluster networking topology how can we make sure that our data fabric

00:12:02.260 | that is used for connecting the gpus is being fully optimized and is being is able to provide the bandwidth

00:12:10.740 | needed and the latency needed the standard fat tree those of you who have been working in the data center

00:12:16.660 | infrastructure before that is something that we were traditionally doing for years that's a great way to build

00:12:23.220 | a scalable maybe non-blocking fabric right but there are a bunch of issues with that first of all if

00:12:31.460 | we will be connecting our servers that are shown below to a single leaf that introduces the single choke

00:12:38.420 | point as well as the single fault domain right if we are losing the leaf we are losing all the gpus that

00:12:45.380 | are connected to that now what else we were thinking about is like look we have that switch we have that

00:12:55.140 | switch sorry what is it the time okay so we have that switch that can be used for the back-end traffic

00:13:03.140 | propagation and why don't we use that switch for from the bandwidth perspective and kind of you know have an

00:13:10.020 | additional path let me just use the simple two node uh example to explain the topology and to explain how we

00:13:18.020 | are using it so first of all whenever we have the gpus that want to communicate within one server they can

00:13:24.420 | use their embedded and vlink and we switch and that provides a good communication they don't have to go to

00:13:31.140 | the outside fabric anywhere and whatnot now whenever we have the data communication between the gpus on

00:13:40.180 | the different nodes if they collect if they are connected to the single leaf that's something we

00:13:44.980 | called one single rail that means that the traffic communication will be passing through the uh through

00:13:51.220 | the one single leaf just one hop away and you will get the to the destination now what is interesting here is

00:13:58.820 | when we want to talk to the different trails right we have to go all the way to the spine and that

00:14:04.900 | introduces the additional hop besides the bandwidth saturation problems that may lead to the additional

00:14:11.460 | latency which will be really important for all for your uh all reduce all reduce operations but luckily for us

00:14:22.100 | and video with the recent version of nickel introduced the feature called pxn which allows you to use

00:14:28.740 | the internal and we switch inside the host to communicate across the rails so whenever we want

00:14:35.460 | to have the gpu zero to communicate with the gpu 8 on the another host we can use an internal switch

00:14:43.380 | to do the traffic hub between the gpus and then send it to the leaf where it is connected to so it

00:14:48.820 | still allows us to use one single hop and have access across the different rails of the gpus now

00:14:58.660 | we did some nickel test results uh and we saw quite a significant improvement 50 for the small messages

00:15:06.580 | and 50 of the large messages now those numbers here are of course for the smaller uh for the smaller

00:15:12.500 | uh for the smaller messages are about latency for larger ones we care more about bandwidth because

00:15:18.100 | latency tends to stand to stay roughly the same uh those numbers are great right everybody everybody would love them but not the customers and it does make sense because

00:15:28.580 | those numbers are synthetic and more are showing you the workload that is applied to your network what

00:15:35.140 | customers care about is the time to train the particular model so we use the sparse mixture of expert

00:15:42.500 | as an example and uh i mean i'm not going to dive into the details how how it works but essentially

00:15:50.740 | the sparse network the the sparse mixture of experts shows you gives you a different layers of the experts and

00:15:58.500 | the network that allows the traffic between them whenever you're deploying that on the li really

00:16:02.900 | large gpu cluster that makes uh that creates a ton of traffic like all the gpus have to send the

00:16:11.380 | traffic to each other they have to extend the information the workload on the network is pretty significant

00:16:16.820 | so we use the mixture model the open source uh sparse mixture of experts which is contained of eight feed forward blocks

00:16:28.420 | eight seven billions of parameters and we use the fine tuning to use this model to fine tune it on 240 h100 gpus

00:16:37.780 | and we did a quite significant we saw a quite significant improvement when we had the pxn enabled and

00:16:44.500 | without it the 14 percent of improvement is something that can be directly connected to the time to train the

00:16:51.940 | model that can be directly connected to cost of train the model and that is something that everybody really

00:16:58.340 | uh really got excited i definitely got excited and and our customers as well because that shows them

00:17:04.500 | some real value numbers they can get with a model now that was it from my side sorry for uh going through

00:17:12.100 | that it's so fast it's a very you know large topic it's it's hard to talk about that but i'm happy to answer

00:17:19.220 | all the additional questions anything you guys might have

00:17:36.420 | you

00:17:40.260 | We'll see you next time.