back to indexAccelerating Mixture of Experts Training With Rail Optimized InfiniBand Networking in Crusoe Cloud

00:00:00.000 |
thank you for coming over to our session today a lots of really interesting stuff is happening on 00:00:18.900 |
the AI world these days right with all the recent model developments in the GPU developments it is 00:00:26.520 |
really cool to see all the use cases however here I want to talk now a little bit about the 00:00:34.280 |
infrastructure and the way how we can support the newest models of the GPUs and the newest the 00:00:42.360 |
newest models the machine learning models and how we can help that everything is working smoothly and 00:00:47.940 |
fast and and productive so my name is Evgeny Bakulinko I'm a product manager at Crusoe my 00:00:56.100 |
main responsibility is is the infrastructure and specifically GPU networking infrastructure and we 00:01:04.200 |
are always looking for a way how we can increase the performance of that network because as we will see 00:01:10.500 |
later in the presentation it is really important to do that now a little bit about the Crusoe Crusoe is 00:01:18.000 |
an AI cloud platform which has one I think very important mission for all of us it's to align the 00:01:24.780 |
future of computing with the future of climate there is a really strong demand right now for the computing 00:01:30.420 |
power the GPUs are really energy hungry there is a lot of investments being done in the data center area 00:01:37.140 |
and of course that puts an additional pressure on the grid and on the energy sources what we are trying 00:01:43.620 |
to do here at Crusoe we are trying to utilize the stranded energy sources wasted energy sources and renewables 00:01:50.980 |
to power our data centers we want really to be able to make sure that every time when you train your model 00:01:57.780 |
every time when you're using GPU for inference you're not causing any negative impact on the climate now 00:02:06.180 |
whenever we are building the cloud right the AI cloud we are building it based on three important pillars 00:02:14.660 |
first of all there is a high performance pillar as the customers are buying our services and procuring you 00:02:22.260 |
you know the GPU times and training their models we have to ensure that all the infrastructure is 00:02:27.860 |
optimized for this training every time when it's not optimized every time when there is a delay or a 00:02:33.380 |
glitch or any sort of outage or simply not that great performance it causes the direct impact on the 00:02:41.620 |
customer's bottom line it causes a direct impact on the time to train and kind of raises the cost to train 00:02:48.420 |
the model now the second one which i think is very important for everybody around here is the easy to use 00:02:55.140 |
we want really to separate ourselves from the general purpose clouds 00:03:01.140 |
we do know all of them the hyperscalers are building the great infrastructure and are trying to support 00:03:06.980 |
each and every use case the customers might have for the cloud computing however in our case we really 00:03:13.060 |
want to focus on the experience of the AI engineers so we want to make sure that we are providing a simpler 00:03:20.180 |
user interface that allows developers to spin up the compute resources to deploy the models to train them to use 00:03:27.380 |
them to in for inference and and so on uh all the underlying complexity of the infrastructure 00:03:34.580 |
is being hidden by us and i believe that's our job to make sure that that is that stays the case 00:03:43.060 |
and now as i mentioned we are as i mentioned before we are climate aligned which means uh we as a company 00:03:49.620 |
really aiming to power 100 percent of our data centers with the renewable wasted energy sources 00:03:56.820 |
with some some some form of stranded energy sources to ensure that we are uh we are being net zero 00:04:06.260 |
emission net new zero emissions from the carbon perspective we have a big story around that feel 00:04:12.580 |
free to check it on our website or come over to our booth on the on the show floor and the team will happy 00:04:19.060 |
to talk about that now where are we present right now we have a number of the data centers located across 00:04:26.740 |
the u.s uh as you see three of them in the continental united states and they are generally located close to the 00:04:36.420 |
energy sources i was mentioning before so we have the one in texas we have uh the one in the northern 00:04:43.300 |
central part of the country and on the east we are also building right now one big data center in iceland 00:04:50.900 |
that will be powered by the gerothermal energy i mean again a way amazing way to use the constrained energy 00:04:59.220 |
resources or the renewables to power the data center we are trying to follow that model hence we are 00:05:05.300 |
placing our data center strategically the placement of the data center in iceland though will be also very 00:05:11.380 |
important for our emir customers given the latency and and the general connectivity to the europe that is 00:05:18.740 |
something i think uh might be helpful for them as well now what is our platform right i say 00:05:29.060 |
cruzo cloud but generally whenever we are talking about any cloud we are talking about three general 00:05:35.140 |
types of the products first and foremost we have the compute we are offering the vms with uh with gpus 00:05:42.740 |
attached to them so every time when customer wants to spin up when customer wants to get access to the 00:05:48.660 |
gpus they're able to get it through the vm they can get a bunch of vm connected together and use them as a 00:05:57.060 |
one single training cluster we also offer cpu instances for any potential data pre-processing 00:06:03.700 |
or any general purpose compute tasks you might have for the data preparation for the offload 00:06:09.700 |
whatever you have uh from the storage perspective we are offering ephemeral and persistent disks on the 00:06:17.860 |
node so those are delivered from the nvme on the local server where your vms are being placed we also have the 00:06:25.940 |
persistent block storage storage solution available for our customers and we are working on providing 00:06:32.180 |
and delivering the managed file system the network file systems for the customers on the networking 00:06:38.100 |
side of course more traditional more typical vpc networking that's the network sometimes we call it 00:06:45.620 |
front-end network that is used to deliver the customer traffic from the internet or from the customer 00:06:51.860 |
environment wherever the customer might have the data sources to deliver that towards the vm so that's 00:06:58.500 |
your kind of main connectivity uh path to the outside world now uh we do offer a number of the additional 00:07:06.660 |
services on that that's not simply connectivity we also have the firewalls we will be offering the load 00:07:12.340 |
balancers soon but generally we are trying to follow more traditional paths for the vpc networking and and the 00:07:19.940 |
requirements the customers usually have there now what is more interesting and what we will be talking a 00:07:25.860 |
little bit later today in greater details is our rail optimized infiniband cluster networking so for those 00:07:33.140 |
of you who don't know typically customers typically providers the gpu providers are separating their network 00:07:40.020 |
they have the front-end network which is used for general purpose traffic but then all the communication 00:07:45.220 |
between the gpus is happening on the stand-alone separate network that is really high performance low 00:07:53.700 |
latency and how bandwidth and the whole topology is optimized for the gpu to gpu communication 00:07:59.300 |
now last but not least the user experience as i mentioned before we are our main customers our main 00:08:07.780 |
persona the people who are using cruiser cloud are the ai developers and machine learning engineers so we want to 00:08:14.980 |
make sure they have what they need in order you know to be successful and not to think too much about 00:08:22.340 |
infrastructure we also offer cli we have apis we have guis so everything can be automated everything can be 00:08:29.540 |
can be consumed and configured in the way you like it more uh we do have a lot of customers already and 00:08:39.140 |
it was very fun for me to see on the floor that some of them are there and some of them are talking about their solutions 00:08:49.380 |
life whenever i'm attending a conference and standing at the booth i don't have to compete with all the people around us 00:08:56.980 |
so we do see all the companies that are presenting their solutions right now as our partner partners 00:09:03.620 |
we do partner with a bunch of them already we have the together ai here we have the boson ai and 00:09:10.900 |
all of them are using our infrastructure for different purposes so together ai for example 00:09:16.260 |
they're really into using cruiser infrastructure for the ml training for the fine tuning their models and 00:09:23.220 |
some sometimes for some sometimes for inference the c dot ai is uh they're trade they're using our compute 00:09:30.820 |
infrastructure to train the new foundational models this is really great i mean if you're the customer of 00:09:37.060 |
together ai for example or codium or whatnot it is likely that you have been somehow exposed to the cruiser 00:09:44.660 |
infrastructure now the distributed training has a very specific set of problems or issues right 00:09:55.140 |
there is a compute part of it when the computation is being done on the gpus but since we're talking 00:10:01.860 |
about a distributed training which means there are a lot of gpus at certain stages whenever there is a 00:10:08.660 |
whenever there is a training step being completed all the gpus have to exchange the information have to 00:10:15.620 |
exchange the data that they calculated on their own this is typically done through the all reduce or 00:10:21.540 |
or all all all get through the only reduce process and the protocols and it contains a forward path the 00:10:32.020 |
backward path but then the networking part takes without any optimization about 00:10:38.180 |
25 30 of the network of the time of the training time now this is the time where when your gpus are 00:10:47.220 |
staying idle they're not being able to compute anything because they have to wait for all the 00:10:52.580 |
information to be gathered uh together this is kind of a bad thing for everybody right this is 00:10:59.540 |
bad thing for the customers because they still pay for that infrastructure they still have to wait it delays the 00:11:06.020 |
the model model training but it also bad for us because we have the infrastructure that is not being 00:11:11.620 |
performing enough there are a couple of tricks we can do first of all the computation and communication 00:11:17.460 |
overlap allows you to start the network exchange or the data exchange when the computation is still ongoing 00:11:26.420 |
but even with that when we were working with the customers we saw just the reduction uh of about 10 00:11:34.580 |
percent so about 25 percent of the training time was still spent on the network me as a product manager 00:11:41.700 |
on the infrastructure side are constantly being asked like how can we reduce that how can we use the 00:11:47.940 |
network as much as possible and reduce that gap so we we have been looking into that and we were trying to 00:11:55.060 |
figure out what would be the right cluster networking topology how can we make sure that our data fabric 00:12:02.260 |
that is used for connecting the gpus is being fully optimized and is being is able to provide the bandwidth 00:12:10.740 |
needed and the latency needed the standard fat tree those of you who have been working in the data center 00:12:16.660 |
infrastructure before that is something that we were traditionally doing for years that's a great way to build 00:12:23.220 |
a scalable maybe non-blocking fabric right but there are a bunch of issues with that first of all if 00:12:31.460 |
we will be connecting our servers that are shown below to a single leaf that introduces the single choke 00:12:38.420 |
point as well as the single fault domain right if we are losing the leaf we are losing all the gpus that 00:12:45.380 |
are connected to that now what else we were thinking about is like look we have that switch we have that 00:12:55.140 |
switch sorry what is it the time okay so we have that switch that can be used for the back-end traffic 00:13:03.140 |
propagation and why don't we use that switch for from the bandwidth perspective and kind of you know have an 00:13:10.020 |
additional path let me just use the simple two node uh example to explain the topology and to explain how we 00:13:18.020 |
are using it so first of all whenever we have the gpus that want to communicate within one server they can 00:13:24.420 |
use their embedded and vlink and we switch and that provides a good communication they don't have to go to 00:13:31.140 |
the outside fabric anywhere and whatnot now whenever we have the data communication between the gpus on 00:13:40.180 |
the different nodes if they collect if they are connected to the single leaf that's something we 00:13:44.980 |
called one single rail that means that the traffic communication will be passing through the uh through 00:13:51.220 |
the one single leaf just one hop away and you will get the to the destination now what is interesting here is 00:13:58.820 |
when we want to talk to the different trails right we have to go all the way to the spine and that 00:14:04.900 |
introduces the additional hop besides the bandwidth saturation problems that may lead to the additional 00:14:11.460 |
latency which will be really important for all for your uh all reduce all reduce operations but luckily for us 00:14:22.100 |
and video with the recent version of nickel introduced the feature called pxn which allows you to use 00:14:28.740 |
the internal and we switch inside the host to communicate across the rails so whenever we want 00:14:35.460 |
to have the gpu zero to communicate with the gpu 8 on the another host we can use an internal switch 00:14:43.380 |
to do the traffic hub between the gpus and then send it to the leaf where it is connected to so it 00:14:48.820 |
still allows us to use one single hop and have access across the different rails of the gpus now 00:14:58.660 |
we did some nickel test results uh and we saw quite a significant improvement 50 for the small messages 00:15:06.580 |
and 50 of the large messages now those numbers here are of course for the smaller uh for the smaller 00:15:12.500 |
uh for the smaller messages are about latency for larger ones we care more about bandwidth because 00:15:18.100 |
latency tends to stand to stay roughly the same uh those numbers are great right everybody everybody would love them but not the customers and it does make sense because 00:15:28.580 |
those numbers are synthetic and more are showing you the workload that is applied to your network what 00:15:35.140 |
customers care about is the time to train the particular model so we use the sparse mixture of expert 00:15:42.500 |
as an example and uh i mean i'm not going to dive into the details how how it works but essentially 00:15:50.740 |
the sparse network the the sparse mixture of experts shows you gives you a different layers of the experts and 00:15:58.500 |
the network that allows the traffic between them whenever you're deploying that on the li really 00:16:02.900 |
large gpu cluster that makes uh that creates a ton of traffic like all the gpus have to send the 00:16:11.380 |
traffic to each other they have to extend the information the workload on the network is pretty significant 00:16:16.820 |
so we use the mixture model the open source uh sparse mixture of experts which is contained of eight feed forward blocks 00:16:28.420 |
eight seven billions of parameters and we use the fine tuning to use this model to fine tune it on 240 h100 gpus 00:16:37.780 |
and we did a quite significant we saw a quite significant improvement when we had the pxn enabled and 00:16:44.500 |
without it the 14 percent of improvement is something that can be directly connected to the time to train the 00:16:51.940 |
model that can be directly connected to cost of train the model and that is something that everybody really 00:16:58.340 |
uh really got excited i definitely got excited and and our customers as well because that shows them 00:17:04.500 |
some real value numbers they can get with a model now that was it from my side sorry for uh going through 00:17:12.100 |
that it's so fast it's a very you know large topic it's it's hard to talk about that but i'm happy to answer 00:17:19.220 |
all the additional questions anything you guys might have