Continuous Profiling for GPUs — Matthias Loibl, Polar Signals

00:00:00.000 | *Music*

00:00:14.000 | Great, thank you for for coming. I'm gonna talk about maximizing GPU efficiency with continuous profiling for GPUs

00:00:22.700 | So what is profiling? Profiling is pretty much as old as programming

00:00:28.100 | I think it was like firstly first done in like the

00:00:30.700 | 1970s, I think some IBM folks were

00:00:34.440 | Trying to figure out what was happening on their computers back then so it's been around basically forever in

00:00:41.620 | computer science and

00:00:44.220 | What are we doing with profiling? We are profiling

00:00:46.860 | basically anything that we can to to inform our

00:00:52.020 | View of the world we want to see like the memory or CPU and GPU

00:00:58.000 | Time spent we want to see the usage of the individual instructions and the frequency and duration of these function calls so

00:01:05.220 | Yeah, a lot of different approaches to profiling, but

00:01:09.980 | Yeah, it's generally speaking super super important to to performance engineering

00:01:15.360 | So why would we do this?

00:01:18.420 | Obviously to improve performance and then also we can save money. So if we

00:01:22.700 | improve our

00:01:25.740 | Software by like 10% we might be able to just like turn off 10% of our like servers and save a bunch of money, right?

00:01:32.220 | so that would be great and

00:01:34.880 | There are two different kinds of of profiling typically

00:01:39.460 | that that we're seeing these days so one is tracing profiling and that is you record each and every event all the time constantly, but

00:01:48.280 | obviously, that's

00:01:50.700 | Great for like getting like the best possible

00:01:54.440 | View onto the system, but it's like pretty high cost

00:01:58.380 | And generates a lot of data. So it's like hard to to do

00:02:04.060 | Continuously, and that is why we're doing sampled profiling

00:02:08.200 | So what we do is basically we sample for a certain duration like 10 seconds and we only sample a hundred times per second or like

00:02:16.820 | 20 times per second, etc

00:02:18.900 | You can tweak that how often do you want to profile?

00:02:21.840 | And and and sample so like a hundred times per second

00:02:26.720 | Isn't that much for a CPU and that's why you get like less than like a percent overhead on on the CPU and like only like four megabytes of overhead for the memory profiling

00:02:37.420 | You will most definitely miss things, but if you do it always on

00:02:43.900 | You will eventually see most of the relevant things right like one stack that executed once isn't like relevant to us anyway

00:02:53.120 | Like we want to see the big picture

00:02:55.120 | So yeah, this is basically what we're what we're doing we we see like the stacks on the left-hand side

00:03:03.640 | Executing and these are like the functions that are calling each other and we are just

00:03:08.100 | Like 20 times per second or hundred times per second taking taking note

00:03:13.320 | On what exact stack we're seeing on the CPU

00:03:17.620 | Or which stack is like allocating?

00:03:20.960 | Etc

00:03:24.000 | Yeah, and that allows us to like do it always on do it in production

00:03:28.000 | your machine is not

00:03:30.560 | the production environment, so it is pretty important to be able to do this in production and actually see what's happening

00:03:37.780 | out there in the real world and do with a low overhead and we are actually using

00:03:43.460 | Linux EVPF and

00:03:47.280 | Because we're using something

00:03:49.280 | That the kernel is doing we we don't even have to change any of your applications

00:03:56.060 | That means you start one thing and it will start

00:04:00.280 | Profiling all of your applications, so you don't really have to instrument

00:04:06.160 | quickly about me. I'm Matthias Loewel flew in from Berlin, Germany and

00:04:10.620 | I'm the director of policy in its cloud and I'm also a

00:04:14.580 | Maintainer of Prometheus Prometheus operator Parker's the open source

00:04:19.040 | version of all of what I'm talking about today and some other

00:04:23.360 | projects so

00:04:27.720 | We are basically here for like GPUs, right and we just earlier this year

00:04:33.020 | After like working on CPU and memory profiling for the last three or four years

00:04:37.820 | started a

00:04:41.000 | preview on GPU profiling, so I'm gonna talk about this today and

00:04:45.160 | Why we think it's pretty pretty great

00:04:49.140 | As you can see in this screenshot

00:04:51.140 | We're talking to Nvidia and VML to get these metrics out of your GPU so we can see in the blue the blue line on top

00:05:01.040 | We can see the overall

00:05:03.740 | Utilization of the note and then the orange line is one particular process on that GPU so we can see

00:05:13.460 | Over here we can see the process idea so we see individual processes, but we also see the overall notes utilization

00:05:20.900 | Further down the memory utilization and the clock speed etc. And that will kind of inform

00:05:27.640 | Where we want to look at the performance of our system, right?

00:05:31.840 | So sometimes we can see the utilization drop down and that might be something that we want to investigate to really make sure that we are using

00:05:39.060 | our GPUs to the fullest

00:05:42.500 | Just a couple of more metrics we are collecting so there's like the power utilization and the dashed line is the power limit and then the temperature

00:05:50.860 | Temperature sometimes is important because like eventually if you're like always at like 80 degrees Celsius

00:05:57.640 | You're gonna get throttled by the GPU

00:06:01.120 | Quite significantly and then obviously a PCIe throughput

00:06:05.520 | It's interesting. Are you bound by the data you are transferring between CPU and GPU

00:06:12.080 | Perfect. Yeah, so just to repeat like the negative one is receiving whereas the positive ones are sending 10 megabytes per second

00:06:20.740 | through PCIe

00:06:23.320 | And then we can use all of those metrics to correlate from the CPU profile from the GPU metrics to the CPU

00:06:32.520 | Profiles that we're storing so we are like collecting like we have done the last three or four years

00:06:39.300 | Using eBPF those CPU stacks and we want to like see what is happening on the CPU so in this case

00:06:47.360 | We might want to look at a particular stack on

00:06:50.000 | CPU zero

00:06:52.300 | Right before the end because there was some activity for example, so we can drag and drop and select a particular

00:06:58.620 | Time range and then we are presented with a flame chart

00:07:03.540 | um, and in the flame charts we can see what the CPU is doing while the GPU is not fully utilized so in in this case

00:07:11.280 | We're we can see that Python is actually calling

00:07:14.700 | Eventually the the CUDA

00:07:17.800 | Functions further down

00:07:21.700 | But oftentimes you you will see that like the CPU is

00:07:25.200 | Pretty actively trying to load data and being busy that way and not not keeping the GPU busy

00:07:34.080 | um, if you are

00:07:36.080 | Using Python we can see it if you're using rust to integrate with CUDA for example

00:07:42.040 | That also works but any compiled language is going to show up in those stack traces and even some of the

00:07:48.560 | Interpreted languages are going to show up like Ruby Python

00:07:51.800 | JVM etc. So while

00:07:55.020 | There is a focus at this conference that we're talking about GPUs here

00:07:59.660 | It really works with like any language and in any application so web servers databases and back to databases for example

00:08:07.420 | Also are interested in improving their performance obviously

00:08:11.520 | Something super exciting that we first

00:08:15.520 | Introduced this morning, so this like

00:08:19.160 | super fresh and hot of the press is

00:08:22.840 | GPU time profiling so

00:08:25.640 | um, as you heard like I I was talking about like these GPU profiles and we we look how how much time is spent on

00:08:33.300 | individual functions on the CPU but we are like more interested in

00:08:38.560 | GPU time spent by these functions, so here's like a

00:08:42.240 | small example of

00:08:44.820 | CUDA functions and basically what we do is we tell the Linux kernel

00:08:49.180 | to

00:08:52.120 | Whenever there's a CUDA stack getting put on the on the CPU to tell us the start time of that

00:08:57.840 | function and then eventually tell us the

00:09:00.840 | time when that kind of

00:09:03.360 | Terminates and then we know the duration of how much time that particular kernel was spending on on the GPU

00:09:10.680 | And and that's super interesting obviously because now we can actually see how much GPU time these individual

00:09:19.460 | Functions are taking on the GPU and here's a bit more of a real world

00:09:24.760 | Example so at the top we can see

00:09:28.060 | on the

00:09:30.200 | Yeah, on the right-hand side we can see like the main function in Python and then calling down into

00:09:35.660 | libcuda down here and the width of these

00:09:38.500 | Stacks that we're seeing is like the actual time that we had these functions

00:09:44.660 | Take up in in the GPU

00:09:46.660 | Yeah, so this is like basically

00:09:51.660 | the stack on the CPU

00:09:54.480 | Down to here and then the leaf of each stack is the function that was taking time on the GPU

00:10:04.820 | The colors are different

00:10:06.820 | Binaries in this case that are running on your machine

00:10:11.580 | So that's why I like blue up here for example is Python and then there's like some

00:10:16.640 | Some I think CUDA down here

00:10:19.420 | Yeah, great question. How do you get started?

00:10:24.560 | Because we we run on Linux using a BPF you have a binary that you can

00:10:30.920 | That you can run using system do docker works as well, but we also have a daemon set for kubernetes

00:10:38.020 | And you deploy that you get the manifest yaml and give it a token

00:10:44.420 | And then some of our customers are already using it for CPU and memory profiling and they're starting to

00:10:50.660 | Also integrate their platforms with our GPU profiling

00:10:55.360 | especially like turbo puffer

00:10:58.300 | Are are interested in in improving their performance of their vector engine, right?

00:11:05.420 | And that's really it um, please visit visit our booth

00:11:09.400 | You got the first ten people get like to sign up for a consultation get two hours for free if you want to

00:11:17.820 | And we can also do discounts for C and CSA startups, and that's really it. Thank you so much

00:11:24.440 | You

00:11:26.440 | We'll see you next time.