back to indexThe GPGPU developer experience has a long way to go
Chapters
0:0 Intro
1:45 But behind the scenes these libraries just call
2:18 There are many things Python programmers can't easily do
5:8 There are various hacks to try to handle these shortcomings
8:7 JAX is an interesting new approach
9:38 JAX shares some of the same constraints, and has some of its own
11:44 Julia can create, debug, and profile GPU kernels directly
00:00:00.000 |
>> Hi, everybody. I want to talk about my personal opinions about the GPGPU developer 00:00:08.560 |
experience. I feel like we don't talk about developer experience enough. When we talk 00:00:13.760 |
about GPGPU, we tend to focus more on performance issues and distributed computing and stuff 00:00:21.320 |
like that. I know a lot of the audience here is from an academic background, and so folks 00:00:28.160 |
who focus on GPGPU in academia may not fully realize how incredibly popular GPGPU has become 00:00:36.880 |
in the last few years. To give you a sense, this is the downloads for CUDA Toolkit from 00:00:46.440 |
just one source, which is from the Anaconda Python repository. As you can see, 11.3 has 00:00:54.800 |
1.1 million downloads, 11.4, 1.1 million downloads, 11.1 million downloads. We've got to a point 00:01:01.880 |
now where literally over a million people are downloading CUDA. So what are all these 00:01:10.020 |
people doing? They are not writing CUDA kernels. If you look at the Kaggle developer survey, 00:01:18.800 |
most developers are now better scientists, are now using things like TensorFlow and PyTorch 00:01:28.480 |
and Lightning and fast AI. So GPGPU is being used extremely extensively around the world 00:01:35.400 |
now through these higher-level libraries and nearly always via Python. But the thing is 00:01:44.920 |
that these libraries like PyTorch, behind the scenes, they are calling compiled C libraries 00:01:51.520 |
such as deep learning or the PyTorch C++ library or the C and C++ library. Although the Python 00:02:02.960 |
developer is working in Python, there's a point at which they can't easily dig any deeper 00:02:10.720 |
because it's jumping into compiled code. And in the case of things like WhoDNM, it's not 00:02:15.600 |
even open source code. So what's the issue? Well, the issue is that for Python programmers, 00:02:24.680 |
there's things that they either can't do it all or can't do conveniently. So because it 00:02:31.240 |
ends up being turned into these really very big C libraries or precompiled libraries, 00:02:40.560 |
edge deployment can be very difficult. For example, when you install PyTorch, you're 00:02:46.200 |
actually installing over a gigabyte. It's an over a gigabyte download. And trying to 00:02:56.520 |
turn your Python code into something that you can then put onto a mobile phone or a Raspberry 00:03:01.800 |
Pi or whatever is incredibly challenging. But from a developer experience point of view, 00:03:08.480 |
it's actually very difficult to debug your work because Python programmers are used to 00:03:15.800 |
using the Python debugger, but most of the real works that's being done in your code 00:03:21.040 |
is not happening in Python. It's happening in these lower level libraries. So trying 00:03:26.440 |
to understand what's really going on is extremely challenging. Same problem for profiling. Obviously 00:03:31.800 |
we all want our code to run fast. And that's challenging to do when you can't easily just 00:03:41.760 |
use your Python profiler to jump in and see what's going on, where the holdups, how do 00:03:46.480 |
I make it faster? A lot of people think that it's not important 00:03:54.800 |
when I speak to people. They say it's not important that Python programmers can kind 00:03:58.840 |
of dig into the underlying kernels and understand them and debug them and customize them. Because 00:04:08.400 |
Python programmers are happy working at these higher levels. But actually this is a big 00:04:13.040 |
challenge. Because realistically, whether you're doing research or production in industry, 00:04:21.080 |
at some point you want to dive in and change things. And in my experience most of the time 00:04:28.760 |
there's something I would like to try and change that's buried down inside one of these 00:04:33.960 |
precompiled libraries. Also as an educator, it's very hard for me to teach people what's 00:04:39.240 |
going on. Because I can't show them the actual code that's really running behind the scenes. 00:04:46.480 |
And so for understanding the implementation details, whether it's for an educational reason 00:04:51.360 |
or because you want to understand how the algorithm works to think about how you can 00:04:56.560 |
improve it, this is either impossible or extremely difficult. And this kind of hackability is 00:05:04.440 |
critical for the developer experience, in my opinion. So there's various hacks to try 00:05:11.640 |
and handle these deficiencies. So for example PyTorch now has a specialized profiler just 00:05:19.800 |
for profiling PyTorch. NVIDIA has a specialized profiler as well. These are really neat tools 00:05:26.600 |
and it's really cool that they're being provided for free. But the fact is that it's still 00:05:32.800 |
not a great developer experience to have to learn a whole new tool which works in a different 00:05:37.480 |
way and that's not actually giving you a consistent view of all of your code. For edge deployment 00:05:50.240 |
or even sometimes a web hosting, there are hacks like in particular tracing and a just-in-time 00:05:56.840 |
compiler that are provided by both TensorFlow and PyTorch. So the idea is that you use the 00:06:06.360 |
JIT or the tracing mechanism to basically turn your Python code into, you know, basically 00:06:17.040 |
some code in a different form. In particular it's likely to be ONNX, which is kind of an 00:06:22.520 |
open standard for sharing these kind of models. The problem is that Python is a really rich 00:06:31.960 |
and dynamic language. So in either of these cases, they're not capable of handling all 00:06:39.880 |
of the things that Python can do. So for example, in the case of the PyTorch just-in-time compiler, 00:06:46.640 |
there's all kinds of things where it's just going to give you an error and say I'm sorry 00:06:49.400 |
I don't know how to do that. More frustrating for me I find is that very often it does something 00:06:55.400 |
slightly different to how Python works and it's then very difficult to know why did it 00:07:00.800 |
work in Python and it didn't work when I compiled it to ONNX. Another very interesting technology 00:07:09.520 |
is XLA, which comes out of Google and is now available as a back end for both TensorFlow 00:07:15.520 |
and PyTorch. It's a similar kind of idea to the PyTorch JIT, but it's something which 00:07:27.060 |
is specifically designed around creating a really accelerated fast version of your code. 00:07:35.200 |
And so nowadays it's used, for example, when PyTorch wants to talk to a TPU, it will go 00:07:41.360 |
through the XLA compiler because that's the best way to create TPU code at this stage 00:07:48.520 |
through XLA. So these are all nice to have, but they, you know, have a lot of shortcomings 00:07:56.700 |
that's not nearly as convenient and not nearly as good a developer experience as using just 00:08:04.040 |
Python and using the Python tools that Python programmers are familiar with. 00:08:10.540 |
Another very interesting new approach is JAX. JAX is another Google project and it's also 00:08:19.640 |
a Python library, but it's actually specifically designed to bring Python over to XLA. So it's 00:08:30.480 |
written from the ground up for XLA. And what's particularly interesting about JAX is that 00:08:35.200 |
you can kind of write your own kernels. So you're not as limited as you are with tracing 00:08:45.800 |
and JIT approaches. You're still limited to doing just the stuff that your underlying 00:08:51.920 |
seed or CUDA or whatever library has written for you, or else with JAX you can do a lot 00:08:59.520 |
more stuff. There's a lot more flexibility. And so this is very interesting approach. 00:09:06.160 |
But we still have the problem that the code that's running on the accelerator is not the 00:09:13.120 |
code you wrote. It's a transformation of that code through XLA. And so again, profiling 00:09:19.160 |
it and debugging it and understanding really what's going on is difficult. Also, in order 00:09:24.400 |
to provide these composable transformations, JAX has a very -- I mean, it's very interesting, 00:09:31.960 |
but in some ways a very limited programming model. It's highly functional and immutable. 00:09:37.720 |
And so JAX ends up with this kind of complexity from this functional programming model. State 00:09:43.160 |
management becomes difficult. Things like random number generation becomes particularly 00:09:49.200 |
challenging. And obviously, in my world of machine learning and deep learning, random 00:09:53.680 |
numbers are very important as they are in many other GPU areas. So I feel like these 00:10:00.560 |
are all, like, amazing technologies. So much impressive work going on. But it doesn't feel 00:10:08.480 |
like, you know, the really long-term solutions. I don't see how any of these things quite 00:10:14.680 |
end up giving us the developer experience we'd like to be able to offer. Another very 00:10:21.560 |
interesting technology I wanted to mention is TVM. So TVM is an Apache project nowadays. 00:10:27.720 |
And you can use TVM directly from Python. And you basically end up creating these compute 00:10:35.040 |
expressions. In this case, using a lambda. And if you're familiar with something like 00:10:41.440 |
Halide, similar kind of idea, you can basically create a schedule which will figure out how 00:10:48.320 |
to -- where you can show various ways that you think it might be best run on an accelerator. 00:10:56.240 |
And in this case, you're actually binding axes to blocks and threads on the accelerator. 00:11:03.240 |
This is a super convenient way to write kernels. And more importantly, perhaps, it also has 00:11:09.560 |
things like auto schedulers. So this is how you can create things that run as fast as 00:11:16.560 |
2DNN or, you know, specialized linear algebra libraries from Nvidia or whatever without 00:11:23.360 |
having to write all that, you know, unrolled loops and memory management and whatnot. But 00:11:30.360 |
as you can see in the end, it's still not anywhere near as convenient as writing normal 00:11:36.080 |
Python. And the thing you end up with is, you know, this kind of compiled code that 00:11:40.800 |
again has all the kind of developer experience issues I described before. Perhaps the most 00:11:47.520 |
interesting path for the future for me right now is Julia. Julia is a fairly new language. 00:11:57.920 |
But what's really interesting from a GPU standpoint is it handles nearly all of the developer 00:12:05.120 |
experience problems I described. Nearly none of them exist in Julia. And the key thing 00:12:10.120 |
is that in Julia, you can write kernels that look a lot like you would write in CUDA but 00:12:17.640 |
with less boilerplate. And you can do in parallelized operations. You can handle memory. That can 00:12:29.840 |
all be done in Julia. And so I think this is a really underappreciated important idea, 00:12:39.080 |
which is that developers should be able to use the same language and the same tools throughout 00:12:45.320 |
the hierarchy of abstractions in their program. Again, speaking as an educator, this is incredibly 00:12:51.600 |
important for teaching people what's going on. It's really important for a researcher 00:12:58.920 |
because you can hack in at any level. It's really important in industry because you can 00:13:03.440 |
ensure that you can jump in and make sure the performance is working properly for you 00:13:10.360 |
at every level. And it also opens up the research world in such a way that things aren't off 00:13:20.480 |
the table. I find that the things that get worked on in deep learning research are the 00:13:24.880 |
things that are kind of conveniently accessible through libraries. And a lot of stuff that 00:13:32.400 |
isn't has just not really been touched because it requires people to go in and write their 00:13:36.700 |
own CUDA kernels. And very, very few people have the patience to do that, at least in 00:13:43.280 |
the deep learning world. So yeah, really, I guess this is a bit of a play for the GPGPU 00:13:56.020 |
community to consider building the next generation of languages and tools, which allows developers 00:14:09.960 |
to really do everything that they might want to do in a convenient way. For Julia, I feel 00:14:15.800 |
like there's a lot of gaps in the developer experience there more generally, which I think 00:14:21.960 |
the community is very familiar with around deployment and the amount of memory use it 00:14:27.000 |
requires and the amount of latency it requires to start up and so forth. But I do think at 00:14:32.680 |
least with Julia, it feels like something that there's a path there that could eventually 00:14:38.800 |
lead to a really beautiful developer experience. And that's not a path that I see available 00:14:45.160 |
in really any of the Python frameworks that I see right now. And I would love to see things 00:14:53.400 |
like TVM being more integrated with those ideas into languages and tools. So yeah, that's 00:15:02.720 |
the end of my thoughts on that. Thanks very much.