back to index

The GPGPU developer experience has a long way to go


Chapters

0:0 Intro
1:45 But behind the scenes these libraries just call
2:18 There are many things Python programmers can't easily do
5:8 There are various hacks to try to handle these shortcomings
8:7 JAX is an interesting new approach
9:38 JAX shares some of the same constraints, and has some of its own
11:44 Julia can create, debug, and profile GPU kernels directly

Whisper Transcript | Transcript Only Page

00:00:00.000 | >> Hi, everybody. I want to talk about my personal opinions about the GPGPU developer
00:00:08.560 | experience. I feel like we don't talk about developer experience enough. When we talk
00:00:13.760 | about GPGPU, we tend to focus more on performance issues and distributed computing and stuff
00:00:21.320 | like that. I know a lot of the audience here is from an academic background, and so folks
00:00:28.160 | who focus on GPGPU in academia may not fully realize how incredibly popular GPGPU has become
00:00:36.880 | in the last few years. To give you a sense, this is the downloads for CUDA Toolkit from
00:00:46.440 | just one source, which is from the Anaconda Python repository. As you can see, 11.3 has
00:00:54.800 | 1.1 million downloads, 11.4, 1.1 million downloads, 11.1 million downloads. We've got to a point
00:01:01.880 | now where literally over a million people are downloading CUDA. So what are all these
00:01:10.020 | people doing? They are not writing CUDA kernels. If you look at the Kaggle developer survey,
00:01:18.800 | most developers are now better scientists, are now using things like TensorFlow and PyTorch
00:01:28.480 | and Lightning and fast AI. So GPGPU is being used extremely extensively around the world
00:01:35.400 | now through these higher-level libraries and nearly always via Python. But the thing is
00:01:44.920 | that these libraries like PyTorch, behind the scenes, they are calling compiled C libraries
00:01:51.520 | such as deep learning or the PyTorch C++ library or the C and C++ library. Although the Python
00:02:02.960 | developer is working in Python, there's a point at which they can't easily dig any deeper
00:02:10.720 | because it's jumping into compiled code. And in the case of things like WhoDNM, it's not
00:02:15.600 | even open source code. So what's the issue? Well, the issue is that for Python programmers,
00:02:24.680 | there's things that they either can't do it all or can't do conveniently. So because it
00:02:31.240 | ends up being turned into these really very big C libraries or precompiled libraries,
00:02:40.560 | edge deployment can be very difficult. For example, when you install PyTorch, you're
00:02:46.200 | actually installing over a gigabyte. It's an over a gigabyte download. And trying to
00:02:56.520 | turn your Python code into something that you can then put onto a mobile phone or a Raspberry
00:03:01.800 | Pi or whatever is incredibly challenging. But from a developer experience point of view,
00:03:08.480 | it's actually very difficult to debug your work because Python programmers are used to
00:03:15.800 | using the Python debugger, but most of the real works that's being done in your code
00:03:21.040 | is not happening in Python. It's happening in these lower level libraries. So trying
00:03:26.440 | to understand what's really going on is extremely challenging. Same problem for profiling. Obviously
00:03:31.800 | we all want our code to run fast. And that's challenging to do when you can't easily just
00:03:41.760 | use your Python profiler to jump in and see what's going on, where the holdups, how do
00:03:46.480 | I make it faster? A lot of people think that it's not important
00:03:54.800 | when I speak to people. They say it's not important that Python programmers can kind
00:03:58.840 | of dig into the underlying kernels and understand them and debug them and customize them. Because
00:04:08.400 | Python programmers are happy working at these higher levels. But actually this is a big
00:04:13.040 | challenge. Because realistically, whether you're doing research or production in industry,
00:04:21.080 | at some point you want to dive in and change things. And in my experience most of the time
00:04:28.760 | there's something I would like to try and change that's buried down inside one of these
00:04:33.960 | precompiled libraries. Also as an educator, it's very hard for me to teach people what's
00:04:39.240 | going on. Because I can't show them the actual code that's really running behind the scenes.
00:04:46.480 | And so for understanding the implementation details, whether it's for an educational reason
00:04:51.360 | or because you want to understand how the algorithm works to think about how you can
00:04:56.560 | improve it, this is either impossible or extremely difficult. And this kind of hackability is
00:05:04.440 | critical for the developer experience, in my opinion. So there's various hacks to try
00:05:11.640 | and handle these deficiencies. So for example PyTorch now has a specialized profiler just
00:05:19.800 | for profiling PyTorch. NVIDIA has a specialized profiler as well. These are really neat tools
00:05:26.600 | and it's really cool that they're being provided for free. But the fact is that it's still
00:05:32.800 | not a great developer experience to have to learn a whole new tool which works in a different
00:05:37.480 | way and that's not actually giving you a consistent view of all of your code. For edge deployment
00:05:50.240 | or even sometimes a web hosting, there are hacks like in particular tracing and a just-in-time
00:05:56.840 | compiler that are provided by both TensorFlow and PyTorch. So the idea is that you use the
00:06:06.360 | JIT or the tracing mechanism to basically turn your Python code into, you know, basically
00:06:17.040 | some code in a different form. In particular it's likely to be ONNX, which is kind of an
00:06:22.520 | open standard for sharing these kind of models. The problem is that Python is a really rich
00:06:31.960 | and dynamic language. So in either of these cases, they're not capable of handling all
00:06:39.880 | of the things that Python can do. So for example, in the case of the PyTorch just-in-time compiler,
00:06:46.640 | there's all kinds of things where it's just going to give you an error and say I'm sorry
00:06:49.400 | I don't know how to do that. More frustrating for me I find is that very often it does something
00:06:55.400 | slightly different to how Python works and it's then very difficult to know why did it
00:07:00.800 | work in Python and it didn't work when I compiled it to ONNX. Another very interesting technology
00:07:09.520 | is XLA, which comes out of Google and is now available as a back end for both TensorFlow
00:07:15.520 | and PyTorch. It's a similar kind of idea to the PyTorch JIT, but it's something which
00:07:27.060 | is specifically designed around creating a really accelerated fast version of your code.
00:07:35.200 | And so nowadays it's used, for example, when PyTorch wants to talk to a TPU, it will go
00:07:41.360 | through the XLA compiler because that's the best way to create TPU code at this stage
00:07:48.520 | through XLA. So these are all nice to have, but they, you know, have a lot of shortcomings
00:07:56.700 | that's not nearly as convenient and not nearly as good a developer experience as using just
00:08:04.040 | Python and using the Python tools that Python programmers are familiar with.
00:08:10.540 | Another very interesting new approach is JAX. JAX is another Google project and it's also
00:08:19.640 | a Python library, but it's actually specifically designed to bring Python over to XLA. So it's
00:08:30.480 | written from the ground up for XLA. And what's particularly interesting about JAX is that
00:08:35.200 | you can kind of write your own kernels. So you're not as limited as you are with tracing
00:08:45.800 | and JIT approaches. You're still limited to doing just the stuff that your underlying
00:08:51.920 | seed or CUDA or whatever library has written for you, or else with JAX you can do a lot
00:08:59.520 | more stuff. There's a lot more flexibility. And so this is very interesting approach.
00:09:06.160 | But we still have the problem that the code that's running on the accelerator is not the
00:09:13.120 | code you wrote. It's a transformation of that code through XLA. And so again, profiling
00:09:19.160 | it and debugging it and understanding really what's going on is difficult. Also, in order
00:09:24.400 | to provide these composable transformations, JAX has a very -- I mean, it's very interesting,
00:09:31.960 | but in some ways a very limited programming model. It's highly functional and immutable.
00:09:37.720 | And so JAX ends up with this kind of complexity from this functional programming model. State
00:09:43.160 | management becomes difficult. Things like random number generation becomes particularly
00:09:49.200 | challenging. And obviously, in my world of machine learning and deep learning, random
00:09:53.680 | numbers are very important as they are in many other GPU areas. So I feel like these
00:10:00.560 | are all, like, amazing technologies. So much impressive work going on. But it doesn't feel
00:10:08.480 | like, you know, the really long-term solutions. I don't see how any of these things quite
00:10:14.680 | end up giving us the developer experience we'd like to be able to offer. Another very
00:10:21.560 | interesting technology I wanted to mention is TVM. So TVM is an Apache project nowadays.
00:10:27.720 | And you can use TVM directly from Python. And you basically end up creating these compute
00:10:35.040 | expressions. In this case, using a lambda. And if you're familiar with something like
00:10:41.440 | Halide, similar kind of idea, you can basically create a schedule which will figure out how
00:10:48.320 | to -- where you can show various ways that you think it might be best run on an accelerator.
00:10:56.240 | And in this case, you're actually binding axes to blocks and threads on the accelerator.
00:11:03.240 | This is a super convenient way to write kernels. And more importantly, perhaps, it also has
00:11:09.560 | things like auto schedulers. So this is how you can create things that run as fast as
00:11:16.560 | 2DNN or, you know, specialized linear algebra libraries from Nvidia or whatever without
00:11:23.360 | having to write all that, you know, unrolled loops and memory management and whatnot. But
00:11:30.360 | as you can see in the end, it's still not anywhere near as convenient as writing normal
00:11:36.080 | Python. And the thing you end up with is, you know, this kind of compiled code that
00:11:40.800 | again has all the kind of developer experience issues I described before. Perhaps the most
00:11:47.520 | interesting path for the future for me right now is Julia. Julia is a fairly new language.
00:11:57.920 | But what's really interesting from a GPU standpoint is it handles nearly all of the developer
00:12:05.120 | experience problems I described. Nearly none of them exist in Julia. And the key thing
00:12:10.120 | is that in Julia, you can write kernels that look a lot like you would write in CUDA but
00:12:17.640 | with less boilerplate. And you can do in parallelized operations. You can handle memory. That can
00:12:29.840 | all be done in Julia. And so I think this is a really underappreciated important idea,
00:12:39.080 | which is that developers should be able to use the same language and the same tools throughout
00:12:45.320 | the hierarchy of abstractions in their program. Again, speaking as an educator, this is incredibly
00:12:51.600 | important for teaching people what's going on. It's really important for a researcher
00:12:58.920 | because you can hack in at any level. It's really important in industry because you can
00:13:03.440 | ensure that you can jump in and make sure the performance is working properly for you
00:13:10.360 | at every level. And it also opens up the research world in such a way that things aren't off
00:13:20.480 | the table. I find that the things that get worked on in deep learning research are the
00:13:24.880 | things that are kind of conveniently accessible through libraries. And a lot of stuff that
00:13:32.400 | isn't has just not really been touched because it requires people to go in and write their
00:13:36.700 | own CUDA kernels. And very, very few people have the patience to do that, at least in
00:13:43.280 | the deep learning world. So yeah, really, I guess this is a bit of a play for the GPGPU
00:13:56.020 | community to consider building the next generation of languages and tools, which allows developers
00:14:09.960 | to really do everything that they might want to do in a convenient way. For Julia, I feel
00:14:15.800 | like there's a lot of gaps in the developer experience there more generally, which I think
00:14:21.960 | the community is very familiar with around deployment and the amount of memory use it
00:14:27.000 | requires and the amount of latency it requires to start up and so forth. But I do think at
00:14:32.680 | least with Julia, it feels like something that there's a path there that could eventually
00:14:38.800 | lead to a really beautiful developer experience. And that's not a path that I see available
00:14:45.160 | in really any of the Python frameworks that I see right now. And I would love to see things
00:14:53.400 | like TVM being more integrated with those ideas into languages and tools. So yeah, that's
00:15:02.720 | the end of my thoughts on that. Thanks very much.
00:15:04.960 | [BLANK_AUDIO]