back to indexContinuous Profiling for GPUs — Matthias Loibl, Polar Signals

00:00:14.000 |
Great, thank you for for coming. I'm gonna talk about maximizing GPU efficiency with continuous profiling for GPUs 00:00:22.700 |
So what is profiling? Profiling is pretty much as old as programming 00:00:28.100 |
I think it was like firstly first done in like the 00:00:34.440 |
Trying to figure out what was happening on their computers back then so it's been around basically forever in 00:00:44.220 |
What are we doing with profiling? We are profiling 00:00:46.860 |
basically anything that we can to to inform our 00:00:52.020 |
View of the world we want to see like the memory or CPU and GPU 00:00:58.000 |
Time spent we want to see the usage of the individual instructions and the frequency and duration of these function calls so 00:01:05.220 |
Yeah, a lot of different approaches to profiling, but 00:01:09.980 |
Yeah, it's generally speaking super super important to to performance engineering 00:01:18.420 |
Obviously to improve performance and then also we can save money. So if we 00:01:25.740 |
Software by like 10% we might be able to just like turn off 10% of our like servers and save a bunch of money, right? 00:01:34.880 |
There are two different kinds of of profiling typically 00:01:39.460 |
that that we're seeing these days so one is tracing profiling and that is you record each and every event all the time constantly, but 00:01:50.700 |
Great for like getting like the best possible 00:01:54.440 |
View onto the system, but it's like pretty high cost 00:01:58.380 |
And generates a lot of data. So it's like hard to to do 00:02:04.060 |
Continuously, and that is why we're doing sampled profiling 00:02:08.200 |
So what we do is basically we sample for a certain duration like 10 seconds and we only sample a hundred times per second or like 00:02:18.900 |
You can tweak that how often do you want to profile? 00:02:21.840 |
And and and sample so like a hundred times per second 00:02:26.720 |
Isn't that much for a CPU and that's why you get like less than like a percent overhead on on the CPU and like only like four megabytes of overhead for the memory profiling 00:02:37.420 |
You will most definitely miss things, but if you do it always on 00:02:43.900 |
You will eventually see most of the relevant things right like one stack that executed once isn't like relevant to us anyway 00:02:55.120 |
So yeah, this is basically what we're what we're doing we we see like the stacks on the left-hand side 00:03:03.640 |
Executing and these are like the functions that are calling each other and we are just 00:03:08.100 |
Like 20 times per second or hundred times per second taking taking note 00:03:24.000 |
Yeah, and that allows us to like do it always on do it in production 00:03:30.560 |
the production environment, so it is pretty important to be able to do this in production and actually see what's happening 00:03:37.780 |
out there in the real world and do with a low overhead and we are actually using 00:03:49.280 |
That the kernel is doing we we don't even have to change any of your applications 00:03:56.060 |
That means you start one thing and it will start 00:04:00.280 |
Profiling all of your applications, so you don't really have to instrument 00:04:06.160 |
quickly about me. I'm Matthias Loewel flew in from Berlin, Germany and 00:04:10.620 |
I'm the director of policy in its cloud and I'm also a 00:04:14.580 |
Maintainer of Prometheus Prometheus operator Parker's the open source 00:04:19.040 |
version of all of what I'm talking about today and some other 00:04:27.720 |
We are basically here for like GPUs, right and we just earlier this year 00:04:33.020 |
After like working on CPU and memory profiling for the last three or four years 00:04:41.000 |
preview on GPU profiling, so I'm gonna talk about this today and 00:04:51.140 |
We're talking to Nvidia and VML to get these metrics out of your GPU so we can see in the blue the blue line on top 00:05:03.740 |
Utilization of the note and then the orange line is one particular process on that GPU so we can see 00:05:13.460 |
Over here we can see the process idea so we see individual processes, but we also see the overall notes utilization 00:05:20.900 |
Further down the memory utilization and the clock speed etc. And that will kind of inform 00:05:27.640 |
Where we want to look at the performance of our system, right? 00:05:31.840 |
So sometimes we can see the utilization drop down and that might be something that we want to investigate to really make sure that we are using 00:05:42.500 |
Just a couple of more metrics we are collecting so there's like the power utilization and the dashed line is the power limit and then the temperature 00:05:50.860 |
Temperature sometimes is important because like eventually if you're like always at like 80 degrees Celsius 00:06:01.120 |
Quite significantly and then obviously a PCIe throughput 00:06:05.520 |
It's interesting. Are you bound by the data you are transferring between CPU and GPU 00:06:12.080 |
Perfect. Yeah, so just to repeat like the negative one is receiving whereas the positive ones are sending 10 megabytes per second 00:06:23.320 |
And then we can use all of those metrics to correlate from the CPU profile from the GPU metrics to the CPU 00:06:32.520 |
Profiles that we're storing so we are like collecting like we have done the last three or four years 00:06:39.300 |
Using eBPF those CPU stacks and we want to like see what is happening on the CPU so in this case 00:06:47.360 |
We might want to look at a particular stack on 00:06:52.300 |
Right before the end because there was some activity for example, so we can drag and drop and select a particular 00:06:58.620 |
Time range and then we are presented with a flame chart 00:07:03.540 |
um, and in the flame charts we can see what the CPU is doing while the GPU is not fully utilized so in in this case 00:07:11.280 |
We're we can see that Python is actually calling 00:07:21.700 |
But oftentimes you you will see that like the CPU is 00:07:25.200 |
Pretty actively trying to load data and being busy that way and not not keeping the GPU busy 00:07:36.080 |
Using Python we can see it if you're using rust to integrate with CUDA for example 00:07:42.040 |
That also works but any compiled language is going to show up in those stack traces and even some of the 00:07:48.560 |
Interpreted languages are going to show up like Ruby Python 00:07:55.020 |
There is a focus at this conference that we're talking about GPUs here 00:07:59.660 |
It really works with like any language and in any application so web servers databases and back to databases for example 00:08:07.420 |
Also are interested in improving their performance obviously 00:08:25.640 |
um, as you heard like I I was talking about like these GPU profiles and we we look how how much time is spent on 00:08:33.300 |
individual functions on the CPU but we are like more interested in 00:08:38.560 |
GPU time spent by these functions, so here's like a 00:08:44.820 |
CUDA functions and basically what we do is we tell the Linux kernel 00:08:52.120 |
Whenever there's a CUDA stack getting put on the on the CPU to tell us the start time of that 00:09:03.360 |
Terminates and then we know the duration of how much time that particular kernel was spending on on the GPU 00:09:10.680 |
And and that's super interesting obviously because now we can actually see how much GPU time these individual 00:09:19.460 |
Functions are taking on the GPU and here's a bit more of a real world 00:09:30.200 |
Yeah, on the right-hand side we can see like the main function in Python and then calling down into 00:09:38.500 |
Stacks that we're seeing is like the actual time that we had these functions 00:09:54.480 |
Down to here and then the leaf of each stack is the function that was taking time on the GPU 00:10:06.820 |
Binaries in this case that are running on your machine 00:10:11.580 |
So that's why I like blue up here for example is Python and then there's like some 00:10:19.420 |
Yeah, great question. How do you get started? 00:10:24.560 |
Because we we run on Linux using a BPF you have a binary that you can 00:10:30.920 |
That you can run using system do docker works as well, but we also have a daemon set for kubernetes 00:10:38.020 |
And you deploy that you get the manifest yaml and give it a token 00:10:44.420 |
And then some of our customers are already using it for CPU and memory profiling and they're starting to 00:10:50.660 |
Also integrate their platforms with our GPU profiling 00:10:58.300 |
Are are interested in in improving their performance of their vector engine, right? 00:11:05.420 |
And that's really it um, please visit visit our booth 00:11:09.400 |
You got the first ten people get like to sign up for a consultation get two hours for free if you want to 00:11:17.820 |
And we can also do discounts for C and CSA startups, and that's really it. Thank you so much