*Music* Great, thank you for for coming. I'm gonna talk about maximizing GPU efficiency with continuous profiling for GPUs So what is profiling? Profiling is pretty much as old as programming I think it was like firstly first done in like the 1970s, I think some IBM folks were Trying to figure out what was happening on their computers back then so it's been around basically forever in computer science and What are we doing with profiling?
We are profiling basically anything that we can to to inform our View of the world we want to see like the memory or CPU and GPU Time spent we want to see the usage of the individual instructions and the frequency and duration of these function calls so Yeah, a lot of different approaches to profiling, but Yeah, it's generally speaking super super important to to performance engineering So why would we do this?
Obviously to improve performance and then also we can save money. So if we improve our Software by like 10% we might be able to just like turn off 10% of our like servers and save a bunch of money, right? so that would be great and There are two different kinds of of profiling typically that that we're seeing these days so one is tracing profiling and that is you record each and every event all the time constantly, but obviously, that's Great for like getting like the best possible View onto the system, but it's like pretty high cost And generates a lot of data.
So it's like hard to to do Continuously, and that is why we're doing sampled profiling So what we do is basically we sample for a certain duration like 10 seconds and we only sample a hundred times per second or like 20 times per second, etc You can tweak that how often do you want to profile?
And and and sample so like a hundred times per second Isn't that much for a CPU and that's why you get like less than like a percent overhead on on the CPU and like only like four megabytes of overhead for the memory profiling You will most definitely miss things, but if you do it always on You will eventually see most of the relevant things right like one stack that executed once isn't like relevant to us anyway Like we want to see the big picture So yeah, this is basically what we're what we're doing we we see like the stacks on the left-hand side Executing and these are like the functions that are calling each other and we are just Like 20 times per second or hundred times per second taking taking note On what exact stack we're seeing on the CPU Or which stack is like allocating?
Etc Yeah, and that allows us to like do it always on do it in production your machine is not the production environment, so it is pretty important to be able to do this in production and actually see what's happening out there in the real world and do with a low overhead and we are actually using Linux EVPF and Because we're using something That the kernel is doing we we don't even have to change any of your applications That means you start one thing and it will start Profiling all of your applications, so you don't really have to instrument quickly about me.
I'm Matthias Loewel flew in from Berlin, Germany and I'm the director of policy in its cloud and I'm also a Maintainer of Prometheus Prometheus operator Parker's the open source version of all of what I'm talking about today and some other projects so We are basically here for like GPUs, right and we just earlier this year After like working on CPU and memory profiling for the last three or four years started a preview on GPU profiling, so I'm gonna talk about this today and Why we think it's pretty pretty great As you can see in this screenshot We're talking to Nvidia and VML to get these metrics out of your GPU so we can see in the blue the blue line on top We can see the overall Utilization of the note and then the orange line is one particular process on that GPU so we can see Over here we can see the process idea so we see individual processes, but we also see the overall notes utilization Further down the memory utilization and the clock speed etc.
And that will kind of inform Where we want to look at the performance of our system, right? So sometimes we can see the utilization drop down and that might be something that we want to investigate to really make sure that we are using our GPUs to the fullest Just a couple of more metrics we are collecting so there's like the power utilization and the dashed line is the power limit and then the temperature Temperature sometimes is important because like eventually if you're like always at like 80 degrees Celsius You're gonna get throttled by the GPU Quite significantly and then obviously a PCIe throughput It's interesting.
Are you bound by the data you are transferring between CPU and GPU Perfect. Yeah, so just to repeat like the negative one is receiving whereas the positive ones are sending 10 megabytes per second through PCIe And then we can use all of those metrics to correlate from the CPU profile from the GPU metrics to the CPU Profiles that we're storing so we are like collecting like we have done the last three or four years Using eBPF those CPU stacks and we want to like see what is happening on the CPU so in this case We might want to look at a particular stack on CPU zero Right before the end because there was some activity for example, so we can drag and drop and select a particular Time range and then we are presented with a flame chart um, and in the flame charts we can see what the CPU is doing while the GPU is not fully utilized so in in this case We're we can see that Python is actually calling Eventually the the CUDA Functions further down But oftentimes you you will see that like the CPU is Pretty actively trying to load data and being busy that way and not not keeping the GPU busy um, if you are Using Python we can see it if you're using rust to integrate with CUDA for example That also works but any compiled language is going to show up in those stack traces and even some of the Interpreted languages are going to show up like Ruby Python JVM etc.
So while There is a focus at this conference that we're talking about GPUs here It really works with like any language and in any application so web servers databases and back to databases for example Also are interested in improving their performance obviously Something super exciting that we first Introduced this morning, so this like super fresh and hot of the press is GPU time profiling so um, as you heard like I I was talking about like these GPU profiles and we we look how how much time is spent on individual functions on the CPU but we are like more interested in GPU time spent by these functions, so here's like a small example of CUDA functions and basically what we do is we tell the Linux kernel to Whenever there's a CUDA stack getting put on the on the CPU to tell us the start time of that function and then eventually tell us the time when that kind of Terminates and then we know the duration of how much time that particular kernel was spending on on the GPU And and that's super interesting obviously because now we can actually see how much GPU time these individual Functions are taking on the GPU and here's a bit more of a real world Example so at the top we can see on the Yeah, on the right-hand side we can see like the main function in Python and then calling down into libcuda down here and the width of these Stacks that we're seeing is like the actual time that we had these functions Take up in in the GPU Yeah, so this is like basically the stack on the CPU Down to here and then the leaf of each stack is the function that was taking time on the GPU The colors are different Binaries in this case that are running on your machine So that's why I like blue up here for example is Python and then there's like some Some I think CUDA down here Yeah, great question.
How do you get started? Because we we run on Linux using a BPF you have a binary that you can That you can run using system do docker works as well, but we also have a daemon set for kubernetes And you deploy that you get the manifest yaml and give it a token And then some of our customers are already using it for CPU and memory profiling and they're starting to Also integrate their platforms with our GPU profiling especially like turbo puffer Are are interested in in improving their performance of their vector engine, right?
And that's really it um, please visit visit our booth You got the first ten people get like to sign up for a consultation get two hours for free if you want to And we can also do discounts for C and CSA startups, and that's really it. Thank you so much You We'll see you next time.