back to index

Continuous Profiling for GPUs — Matthias Loibl, Polar Signals


Whisper Transcript | Transcript Only Page

00:00:00.000 | *Music*
00:00:14.000 | Great, thank you for for coming. I'm gonna talk about maximizing GPU efficiency with continuous profiling for GPUs
00:00:22.700 | So what is profiling? Profiling is pretty much as old as programming
00:00:28.100 | I think it was like firstly first done in like the
00:00:30.700 | 1970s, I think some IBM folks were
00:00:34.440 | Trying to figure out what was happening on their computers back then so it's been around basically forever in
00:00:41.620 | computer science and
00:00:44.220 | What are we doing with profiling? We are profiling
00:00:46.860 | basically anything that we can to to inform our
00:00:52.020 | View of the world we want to see like the memory or CPU and GPU
00:00:58.000 | Time spent we want to see the usage of the individual instructions and the frequency and duration of these function calls so
00:01:05.220 | Yeah, a lot of different approaches to profiling, but
00:01:09.980 | Yeah, it's generally speaking super super important to to performance engineering
00:01:15.360 | So why would we do this?
00:01:18.420 | Obviously to improve performance and then also we can save money. So if we
00:01:22.700 | improve our
00:01:25.740 | Software by like 10% we might be able to just like turn off 10% of our like servers and save a bunch of money, right?
00:01:32.220 | so that would be great and
00:01:34.880 | There are two different kinds of of profiling typically
00:01:39.460 | that that we're seeing these days so one is tracing profiling and that is you record each and every event all the time constantly, but
00:01:48.280 | obviously, that's
00:01:50.700 | Great for like getting like the best possible
00:01:54.440 | View onto the system, but it's like pretty high cost
00:01:58.380 | And generates a lot of data. So it's like hard to to do
00:02:04.060 | Continuously, and that is why we're doing sampled profiling
00:02:08.200 | So what we do is basically we sample for a certain duration like 10 seconds and we only sample a hundred times per second or like
00:02:16.820 | 20 times per second, etc
00:02:18.900 | You can tweak that how often do you want to profile?
00:02:21.840 | And and and sample so like a hundred times per second
00:02:26.720 | Isn't that much for a CPU and that's why you get like less than like a percent overhead on on the CPU and like only like four megabytes of overhead for the memory profiling
00:02:37.420 | You will most definitely miss things, but if you do it always on
00:02:43.900 | You will eventually see most of the relevant things right like one stack that executed once isn't like relevant to us anyway
00:02:53.120 | Like we want to see the big picture
00:02:55.120 | So yeah, this is basically what we're what we're doing we we see like the stacks on the left-hand side
00:03:03.640 | Executing and these are like the functions that are calling each other and we are just
00:03:08.100 | Like 20 times per second or hundred times per second taking taking note
00:03:13.320 | On what exact stack we're seeing on the CPU
00:03:17.620 | Or which stack is like allocating?
00:03:24.000 | Yeah, and that allows us to like do it always on do it in production
00:03:28.000 | your machine is not
00:03:30.560 | the production environment, so it is pretty important to be able to do this in production and actually see what's happening
00:03:37.780 | out there in the real world and do with a low overhead and we are actually using
00:03:43.460 | Linux EVPF and
00:03:47.280 | Because we're using something
00:03:49.280 | That the kernel is doing we we don't even have to change any of your applications
00:03:56.060 | That means you start one thing and it will start
00:04:00.280 | Profiling all of your applications, so you don't really have to instrument
00:04:06.160 | quickly about me. I'm Matthias Loewel flew in from Berlin, Germany and
00:04:10.620 | I'm the director of policy in its cloud and I'm also a
00:04:14.580 | Maintainer of Prometheus Prometheus operator Parker's the open source
00:04:19.040 | version of all of what I'm talking about today and some other
00:04:23.360 | projects so
00:04:27.720 | We are basically here for like GPUs, right and we just earlier this year
00:04:33.020 | After like working on CPU and memory profiling for the last three or four years
00:04:37.820 | started a
00:04:41.000 | preview on GPU profiling, so I'm gonna talk about this today and
00:04:45.160 | Why we think it's pretty pretty great
00:04:49.140 | As you can see in this screenshot
00:04:51.140 | We're talking to Nvidia and VML to get these metrics out of your GPU so we can see in the blue the blue line on top
00:05:01.040 | We can see the overall
00:05:03.740 | Utilization of the note and then the orange line is one particular process on that GPU so we can see
00:05:13.460 | Over here we can see the process idea so we see individual processes, but we also see the overall notes utilization
00:05:20.900 | Further down the memory utilization and the clock speed etc. And that will kind of inform
00:05:27.640 | Where we want to look at the performance of our system, right?
00:05:31.840 | So sometimes we can see the utilization drop down and that might be something that we want to investigate to really make sure that we are using
00:05:39.060 | our GPUs to the fullest
00:05:42.500 | Just a couple of more metrics we are collecting so there's like the power utilization and the dashed line is the power limit and then the temperature
00:05:50.860 | Temperature sometimes is important because like eventually if you're like always at like 80 degrees Celsius
00:05:57.640 | You're gonna get throttled by the GPU
00:06:01.120 | Quite significantly and then obviously a PCIe throughput
00:06:05.520 | It's interesting. Are you bound by the data you are transferring between CPU and GPU
00:06:12.080 | Perfect. Yeah, so just to repeat like the negative one is receiving whereas the positive ones are sending 10 megabytes per second
00:06:20.740 | through PCIe
00:06:23.320 | And then we can use all of those metrics to correlate from the CPU profile from the GPU metrics to the CPU
00:06:32.520 | Profiles that we're storing so we are like collecting like we have done the last three or four years
00:06:39.300 | Using eBPF those CPU stacks and we want to like see what is happening on the CPU so in this case
00:06:47.360 | We might want to look at a particular stack on
00:06:50.000 | CPU zero
00:06:52.300 | Right before the end because there was some activity for example, so we can drag and drop and select a particular
00:06:58.620 | Time range and then we are presented with a flame chart
00:07:03.540 | um, and in the flame charts we can see what the CPU is doing while the GPU is not fully utilized so in in this case
00:07:11.280 | We're we can see that Python is actually calling
00:07:14.700 | Eventually the the CUDA
00:07:17.800 | Functions further down
00:07:21.700 | But oftentimes you you will see that like the CPU is
00:07:25.200 | Pretty actively trying to load data and being busy that way and not not keeping the GPU busy
00:07:34.080 | um, if you are
00:07:36.080 | Using Python we can see it if you're using rust to integrate with CUDA for example
00:07:42.040 | That also works but any compiled language is going to show up in those stack traces and even some of the
00:07:48.560 | Interpreted languages are going to show up like Ruby Python
00:07:51.800 | JVM etc. So while
00:07:55.020 | There is a focus at this conference that we're talking about GPUs here
00:07:59.660 | It really works with like any language and in any application so web servers databases and back to databases for example
00:08:07.420 | Also are interested in improving their performance obviously
00:08:11.520 | Something super exciting that we first
00:08:15.520 | Introduced this morning, so this like
00:08:19.160 | super fresh and hot of the press is
00:08:22.840 | GPU time profiling so
00:08:25.640 | um, as you heard like I I was talking about like these GPU profiles and we we look how how much time is spent on
00:08:33.300 | individual functions on the CPU but we are like more interested in
00:08:38.560 | GPU time spent by these functions, so here's like a
00:08:42.240 | small example of
00:08:44.820 | CUDA functions and basically what we do is we tell the Linux kernel
00:08:52.120 | Whenever there's a CUDA stack getting put on the on the CPU to tell us the start time of that
00:08:57.840 | function and then eventually tell us the
00:09:00.840 | time when that kind of
00:09:03.360 | Terminates and then we know the duration of how much time that particular kernel was spending on on the GPU
00:09:10.680 | And and that's super interesting obviously because now we can actually see how much GPU time these individual
00:09:19.460 | Functions are taking on the GPU and here's a bit more of a real world
00:09:24.760 | Example so at the top we can see
00:09:28.060 | on the
00:09:30.200 | Yeah, on the right-hand side we can see like the main function in Python and then calling down into
00:09:35.660 | libcuda down here and the width of these
00:09:38.500 | Stacks that we're seeing is like the actual time that we had these functions
00:09:44.660 | Take up in in the GPU
00:09:46.660 | Yeah, so this is like basically
00:09:51.660 | the stack on the CPU
00:09:54.480 | Down to here and then the leaf of each stack is the function that was taking time on the GPU
00:10:04.820 | The colors are different
00:10:06.820 | Binaries in this case that are running on your machine
00:10:11.580 | So that's why I like blue up here for example is Python and then there's like some
00:10:16.640 | Some I think CUDA down here
00:10:19.420 | Yeah, great question. How do you get started?
00:10:24.560 | Because we we run on Linux using a BPF you have a binary that you can
00:10:30.920 | That you can run using system do docker works as well, but we also have a daemon set for kubernetes
00:10:38.020 | And you deploy that you get the manifest yaml and give it a token
00:10:44.420 | And then some of our customers are already using it for CPU and memory profiling and they're starting to
00:10:50.660 | Also integrate their platforms with our GPU profiling
00:10:55.360 | especially like turbo puffer
00:10:58.300 | Are are interested in in improving their performance of their vector engine, right?
00:11:05.420 | And that's really it um, please visit visit our booth
00:11:09.400 | You got the first ten people get like to sign up for a consultation get two hours for free if you want to
00:11:17.820 | And we can also do discounts for C and CSA startups, and that's really it. Thank you so much
00:11:26.440 | We'll see you next time.