back to index

Accelerating Mixture of Experts Training With Rail Optimized InfiniBand Networking in Crusoe Cloud


Whisper Transcript | Transcript Only Page

00:00:00.000 | thank you for coming over to our session today a lots of really interesting stuff is happening on
00:00:18.900 | the AI world these days right with all the recent model developments in the GPU developments it is
00:00:26.520 | really cool to see all the use cases however here I want to talk now a little bit about the
00:00:34.280 | infrastructure and the way how we can support the newest models of the GPUs and the newest the
00:00:42.360 | newest models the machine learning models and how we can help that everything is working smoothly and
00:00:47.940 | fast and and productive so my name is Evgeny Bakulinko I'm a product manager at Crusoe my
00:00:56.100 | main responsibility is is the infrastructure and specifically GPU networking infrastructure and we
00:01:04.200 | are always looking for a way how we can increase the performance of that network because as we will see
00:01:10.500 | later in the presentation it is really important to do that now a little bit about the Crusoe Crusoe is
00:01:18.000 | an AI cloud platform which has one I think very important mission for all of us it's to align the
00:01:24.780 | future of computing with the future of climate there is a really strong demand right now for the computing
00:01:30.420 | power the GPUs are really energy hungry there is a lot of investments being done in the data center area
00:01:37.140 | and of course that puts an additional pressure on the grid and on the energy sources what we are trying
00:01:43.620 | to do here at Crusoe we are trying to utilize the stranded energy sources wasted energy sources and renewables
00:01:50.980 | to power our data centers we want really to be able to make sure that every time when you train your model
00:01:57.780 | every time when you're using GPU for inference you're not causing any negative impact on the climate now
00:02:06.180 | whenever we are building the cloud right the AI cloud we are building it based on three important pillars
00:02:14.660 | first of all there is a high performance pillar as the customers are buying our services and procuring you
00:02:22.260 | you know the GPU times and training their models we have to ensure that all the infrastructure is
00:02:27.860 | optimized for this training every time when it's not optimized every time when there is a delay or a
00:02:33.380 | glitch or any sort of outage or simply not that great performance it causes the direct impact on the
00:02:41.620 | customer's bottom line it causes a direct impact on the time to train and kind of raises the cost to train
00:02:48.420 | the model now the second one which i think is very important for everybody around here is the easy to use
00:02:55.140 | we want really to separate ourselves from the general purpose clouds
00:03:01.140 | we do know all of them the hyperscalers are building the great infrastructure and are trying to support
00:03:06.980 | each and every use case the customers might have for the cloud computing however in our case we really
00:03:13.060 | want to focus on the experience of the AI engineers so we want to make sure that we are providing a simpler
00:03:20.180 | user interface that allows developers to spin up the compute resources to deploy the models to train them to use
00:03:27.380 | them to in for inference and and so on uh all the underlying complexity of the infrastructure
00:03:34.580 | is being hidden by us and i believe that's our job to make sure that that is that stays the case
00:03:43.060 | and now as i mentioned we are as i mentioned before we are climate aligned which means uh we as a company
00:03:49.620 | really aiming to power 100 percent of our data centers with the renewable wasted energy sources
00:03:56.820 | with some some some form of stranded energy sources to ensure that we are uh we are being net zero
00:04:06.260 | emission net new zero emissions from the carbon perspective we have a big story around that feel
00:04:12.580 | free to check it on our website or come over to our booth on the on the show floor and the team will happy
00:04:19.060 | to talk about that now where are we present right now we have a number of the data centers located across
00:04:26.740 | the u.s uh as you see three of them in the continental united states and they are generally located close to the
00:04:36.420 | energy sources i was mentioning before so we have the one in texas we have uh the one in the northern
00:04:43.300 | central part of the country and on the east we are also building right now one big data center in iceland
00:04:50.900 | that will be powered by the gerothermal energy i mean again a way amazing way to use the constrained energy
00:04:59.220 | resources or the renewables to power the data center we are trying to follow that model hence we are
00:05:05.300 | placing our data center strategically the placement of the data center in iceland though will be also very
00:05:11.380 | important for our emir customers given the latency and and the general connectivity to the europe that is
00:05:18.740 | something i think uh might be helpful for them as well now what is our platform right i say
00:05:29.060 | cruzo cloud but generally whenever we are talking about any cloud we are talking about three general
00:05:35.140 | types of the products first and foremost we have the compute we are offering the vms with uh with gpus
00:05:42.740 | attached to them so every time when customer wants to spin up when customer wants to get access to the
00:05:48.660 | gpus they're able to get it through the vm they can get a bunch of vm connected together and use them as a
00:05:57.060 | one single training cluster we also offer cpu instances for any potential data pre-processing
00:06:03.700 | or any general purpose compute tasks you might have for the data preparation for the offload
00:06:09.700 | whatever you have uh from the storage perspective we are offering ephemeral and persistent disks on the
00:06:17.860 | node so those are delivered from the nvme on the local server where your vms are being placed we also have the
00:06:25.940 | persistent block storage storage solution available for our customers and we are working on providing
00:06:32.180 | and delivering the managed file system the network file systems for the customers on the networking
00:06:38.100 | side of course more traditional more typical vpc networking that's the network sometimes we call it
00:06:45.620 | front-end network that is used to deliver the customer traffic from the internet or from the customer
00:06:51.860 | environment wherever the customer might have the data sources to deliver that towards the vm so that's
00:06:58.500 | your kind of main connectivity uh path to the outside world now uh we do offer a number of the additional
00:07:06.660 | services on that that's not simply connectivity we also have the firewalls we will be offering the load
00:07:12.340 | balancers soon but generally we are trying to follow more traditional paths for the vpc networking and and the
00:07:19.940 | requirements the customers usually have there now what is more interesting and what we will be talking a
00:07:25.860 | little bit later today in greater details is our rail optimized infiniband cluster networking so for those
00:07:33.140 | of you who don't know typically customers typically providers the gpu providers are separating their network
00:07:40.020 | they have the front-end network which is used for general purpose traffic but then all the communication
00:07:45.220 | between the gpus is happening on the stand-alone separate network that is really high performance low
00:07:53.700 | latency and how bandwidth and the whole topology is optimized for the gpu to gpu communication
00:07:59.300 | now last but not least the user experience as i mentioned before we are our main customers our main
00:08:07.780 | persona the people who are using cruiser cloud are the ai developers and machine learning engineers so we want to
00:08:14.980 | make sure they have what they need in order you know to be successful and not to think too much about
00:08:22.340 | infrastructure we also offer cli we have apis we have guis so everything can be automated everything can be
00:08:29.540 | can be consumed and configured in the way you like it more uh we do have a lot of customers already and
00:08:39.140 | it was very fun for me to see on the floor that some of them are there and some of them are talking about their solutions
00:08:46.020 | probably this is the first time in my
00:08:49.380 | life whenever i'm attending a conference and standing at the booth i don't have to compete with all the people around us
00:08:56.980 | so we do see all the companies that are presenting their solutions right now as our partner partners
00:09:03.620 | we do partner with a bunch of them already we have the together ai here we have the boson ai and
00:09:10.900 | all of them are using our infrastructure for different purposes so together ai for example
00:09:16.260 | they're really into using cruiser infrastructure for the ml training for the fine tuning their models and
00:09:23.220 | some sometimes for some sometimes for inference the c dot ai is uh they're trade they're using our compute
00:09:30.820 | infrastructure to train the new foundational models this is really great i mean if you're the customer of
00:09:37.060 | together ai for example or codium or whatnot it is likely that you have been somehow exposed to the cruiser
00:09:44.660 | infrastructure now the distributed training has a very specific set of problems or issues right
00:09:55.140 | there is a compute part of it when the computation is being done on the gpus but since we're talking
00:10:01.860 | about a distributed training which means there are a lot of gpus at certain stages whenever there is a
00:10:08.660 | whenever there is a training step being completed all the gpus have to exchange the information have to
00:10:15.620 | exchange the data that they calculated on their own this is typically done through the all reduce or
00:10:21.540 | or all all all get through the only reduce process and the protocols and it contains a forward path the
00:10:32.020 | backward path but then the networking part takes without any optimization about
00:10:38.180 | 25 30 of the network of the time of the training time now this is the time where when your gpus are
00:10:47.220 | staying idle they're not being able to compute anything because they have to wait for all the
00:10:52.580 | information to be gathered uh together this is kind of a bad thing for everybody right this is
00:10:59.540 | bad thing for the customers because they still pay for that infrastructure they still have to wait it delays the
00:11:06.020 | the model model training but it also bad for us because we have the infrastructure that is not being
00:11:11.620 | performing enough there are a couple of tricks we can do first of all the computation and communication
00:11:17.460 | overlap allows you to start the network exchange or the data exchange when the computation is still ongoing
00:11:26.420 | but even with that when we were working with the customers we saw just the reduction uh of about 10
00:11:34.580 | percent so about 25 percent of the training time was still spent on the network me as a product manager
00:11:41.700 | on the infrastructure side are constantly being asked like how can we reduce that how can we use the
00:11:47.940 | network as much as possible and reduce that gap so we we have been looking into that and we were trying to
00:11:55.060 | figure out what would be the right cluster networking topology how can we make sure that our data fabric
00:12:02.260 | that is used for connecting the gpus is being fully optimized and is being is able to provide the bandwidth
00:12:10.740 | needed and the latency needed the standard fat tree those of you who have been working in the data center
00:12:16.660 | infrastructure before that is something that we were traditionally doing for years that's a great way to build
00:12:23.220 | a scalable maybe non-blocking fabric right but there are a bunch of issues with that first of all if
00:12:31.460 | we will be connecting our servers that are shown below to a single leaf that introduces the single choke
00:12:38.420 | point as well as the single fault domain right if we are losing the leaf we are losing all the gpus that
00:12:45.380 | are connected to that now what else we were thinking about is like look we have that switch we have that
00:12:55.140 | switch sorry what is it the time okay so we have that switch that can be used for the back-end traffic
00:13:03.140 | propagation and why don't we use that switch for from the bandwidth perspective and kind of you know have an
00:13:10.020 | additional path let me just use the simple two node uh example to explain the topology and to explain how we
00:13:18.020 | are using it so first of all whenever we have the gpus that want to communicate within one server they can
00:13:24.420 | use their embedded and vlink and we switch and that provides a good communication they don't have to go to
00:13:31.140 | the outside fabric anywhere and whatnot now whenever we have the data communication between the gpus on
00:13:40.180 | the different nodes if they collect if they are connected to the single leaf that's something we
00:13:44.980 | called one single rail that means that the traffic communication will be passing through the uh through
00:13:51.220 | the one single leaf just one hop away and you will get the to the destination now what is interesting here is
00:13:58.820 | when we want to talk to the different trails right we have to go all the way to the spine and that
00:14:04.900 | introduces the additional hop besides the bandwidth saturation problems that may lead to the additional
00:14:11.460 | latency which will be really important for all for your uh all reduce all reduce operations but luckily for us
00:14:22.100 | and video with the recent version of nickel introduced the feature called pxn which allows you to use
00:14:28.740 | the internal and we switch inside the host to communicate across the rails so whenever we want
00:14:35.460 | to have the gpu zero to communicate with the gpu 8 on the another host we can use an internal switch
00:14:43.380 | to do the traffic hub between the gpus and then send it to the leaf where it is connected to so it
00:14:48.820 | still allows us to use one single hop and have access across the different rails of the gpus now
00:14:58.660 | we did some nickel test results uh and we saw quite a significant improvement 50 for the small messages
00:15:06.580 | and 50 of the large messages now those numbers here are of course for the smaller uh for the smaller
00:15:12.500 | uh for the smaller messages are about latency for larger ones we care more about bandwidth because
00:15:18.100 | latency tends to stand to stay roughly the same uh those numbers are great right everybody everybody would love them but not the customers and it does make sense because
00:15:28.580 | those numbers are synthetic and more are showing you the workload that is applied to your network what
00:15:35.140 | customers care about is the time to train the particular model so we use the sparse mixture of expert
00:15:42.500 | as an example and uh i mean i'm not going to dive into the details how how it works but essentially
00:15:50.740 | the sparse network the the sparse mixture of experts shows you gives you a different layers of the experts and
00:15:58.500 | the network that allows the traffic between them whenever you're deploying that on the li really
00:16:02.900 | large gpu cluster that makes uh that creates a ton of traffic like all the gpus have to send the
00:16:11.380 | traffic to each other they have to extend the information the workload on the network is pretty significant
00:16:16.820 | so we use the mixture model the open source uh sparse mixture of experts which is contained of eight feed forward blocks
00:16:28.420 | eight seven billions of parameters and we use the fine tuning to use this model to fine tune it on 240 h100 gpus
00:16:37.780 | and we did a quite significant we saw a quite significant improvement when we had the pxn enabled and
00:16:44.500 | without it the 14 percent of improvement is something that can be directly connected to the time to train the
00:16:51.940 | model that can be directly connected to cost of train the model and that is something that everybody really
00:16:58.340 | uh really got excited i definitely got excited and and our customers as well because that shows them
00:17:04.500 | some real value numbers they can get with a model now that was it from my side sorry for uh going through
00:17:12.100 | that it's so fast it's a very you know large topic it's it's hard to talk about that but i'm happy to answer
00:17:19.220 | all the additional questions anything you guys might have
00:17:40.260 | We'll see you next time.