The GPGPU developer experience has a long way to go

00:00:00.000 | >> Hi, everybody. I want to talk about my personal opinions about the GPGPU developer

00:00:08.560 | experience. I feel like we don't talk about developer experience enough. When we talk

00:00:13.760 | about GPGPU, we tend to focus more on performance issues and distributed computing and stuff

00:00:21.320 | like that. I know a lot of the audience here is from an academic background, and so folks

00:00:28.160 | who focus on GPGPU in academia may not fully realize how incredibly popular GPGPU has become

00:00:36.880 | in the last few years. To give you a sense, this is the downloads for CUDA Toolkit from

00:00:46.440 | just one source, which is from the Anaconda Python repository. As you can see, 11.3 has

00:00:54.800 | 1.1 million downloads, 11.4, 1.1 million downloads, 11.1 million downloads. We've got to a point

00:01:01.880 | now where literally over a million people are downloading CUDA. So what are all these

00:01:10.020 | people doing? They are not writing CUDA kernels. If you look at the Kaggle developer survey,

00:01:18.800 | most developers are now better scientists, are now using things like TensorFlow and PyTorch

00:01:28.480 | and Lightning and fast AI. So GPGPU is being used extremely extensively around the world

00:01:35.400 | now through these higher-level libraries and nearly always via Python. But the thing is

00:01:44.920 | that these libraries like PyTorch, behind the scenes, they are calling compiled C libraries

00:01:51.520 | such as deep learning or the PyTorch C++ library or the C and C++ library. Although the Python

00:02:02.960 | developer is working in Python, there's a point at which they can't easily dig any deeper

00:02:10.720 | because it's jumping into compiled code. And in the case of things like WhoDNM, it's not

00:02:15.600 | even open source code. So what's the issue? Well, the issue is that for Python programmers,

00:02:24.680 | there's things that they either can't do it all or can't do conveniently. So because it

00:02:31.240 | ends up being turned into these really very big C libraries or precompiled libraries,

00:02:40.560 | edge deployment can be very difficult. For example, when you install PyTorch, you're

00:02:46.200 | actually installing over a gigabyte. It's an over a gigabyte download. And trying to

00:02:56.520 | turn your Python code into something that you can then put onto a mobile phone or a Raspberry

00:03:01.800 | Pi or whatever is incredibly challenging. But from a developer experience point of view,

00:03:08.480 | it's actually very difficult to debug your work because Python programmers are used to

00:03:15.800 | using the Python debugger, but most of the real works that's being done in your code

00:03:21.040 | is not happening in Python. It's happening in these lower level libraries. So trying

00:03:26.440 | to understand what's really going on is extremely challenging. Same problem for profiling. Obviously

00:03:31.800 | we all want our code to run fast. And that's challenging to do when you can't easily just

00:03:41.760 | use your Python profiler to jump in and see what's going on, where the holdups, how do

00:03:46.480 | I make it faster? A lot of people think that it's not important

00:03:54.800 | when I speak to people. They say it's not important that Python programmers can kind

00:03:58.840 | of dig into the underlying kernels and understand them and debug them and customize them. Because

00:04:08.400 | Python programmers are happy working at these higher levels. But actually this is a big

00:04:13.040 | challenge. Because realistically, whether you're doing research or production in industry,

00:04:21.080 | at some point you want to dive in and change things. And in my experience most of the time

00:04:28.760 | there's something I would like to try and change that's buried down inside one of these

00:04:33.960 | precompiled libraries. Also as an educator, it's very hard for me to teach people what's

00:04:39.240 | going on. Because I can't show them the actual code that's really running behind the scenes.

00:04:46.480 | And so for understanding the implementation details, whether it's for an educational reason

00:04:51.360 | or because you want to understand how the algorithm works to think about how you can

00:04:56.560 | improve it, this is either impossible or extremely difficult. And this kind of hackability is

00:05:04.440 | critical for the developer experience, in my opinion. So there's various hacks to try

00:05:11.640 | and handle these deficiencies. So for example PyTorch now has a specialized profiler just

00:05:19.800 | for profiling PyTorch. NVIDIA has a specialized profiler as well. These are really neat tools

00:05:26.600 | and it's really cool that they're being provided for free. But the fact is that it's still

00:05:32.800 | not a great developer experience to have to learn a whole new tool which works in a different

00:05:37.480 | way and that's not actually giving you a consistent view of all of your code. For edge deployment

00:05:50.240 | or even sometimes a web hosting, there are hacks like in particular tracing and a just-in-time

00:05:56.840 | compiler that are provided by both TensorFlow and PyTorch. So the idea is that you use the

00:06:06.360 | JIT or the tracing mechanism to basically turn your Python code into, you know, basically

00:06:17.040 | some code in a different form. In particular it's likely to be ONNX, which is kind of an

00:06:22.520 | open standard for sharing these kind of models. The problem is that Python is a really rich

00:06:31.960 | and dynamic language. So in either of these cases, they're not capable of handling all

00:06:39.880 | of the things that Python can do. So for example, in the case of the PyTorch just-in-time compiler,

00:06:46.640 | there's all kinds of things where it's just going to give you an error and say I'm sorry

00:06:49.400 | I don't know how to do that. More frustrating for me I find is that very often it does something

00:06:55.400 | slightly different to how Python works and it's then very difficult to know why did it

00:07:00.800 | work in Python and it didn't work when I compiled it to ONNX. Another very interesting technology

00:07:09.520 | is XLA, which comes out of Google and is now available as a back end for both TensorFlow

00:07:15.520 | and PyTorch. It's a similar kind of idea to the PyTorch JIT, but it's something which

00:07:27.060 | is specifically designed around creating a really accelerated fast version of your code.

00:07:35.200 | And so nowadays it's used, for example, when PyTorch wants to talk to a TPU, it will go

00:07:41.360 | through the XLA compiler because that's the best way to create TPU code at this stage

00:07:48.520 | through XLA. So these are all nice to have, but they, you know, have a lot of shortcomings

00:07:56.700 | that's not nearly as convenient and not nearly as good a developer experience as using just

00:08:04.040 | Python and using the Python tools that Python programmers are familiar with.

00:08:10.540 | Another very interesting new approach is JAX. JAX is another Google project and it's also

00:08:19.640 | a Python library, but it's actually specifically designed to bring Python over to XLA. So it's

00:08:30.480 | written from the ground up for XLA. And what's particularly interesting about JAX is that

00:08:35.200 | you can kind of write your own kernels. So you're not as limited as you are with tracing

00:08:45.800 | and JIT approaches. You're still limited to doing just the stuff that your underlying

00:08:51.920 | seed or CUDA or whatever library has written for you, or else with JAX you can do a lot

00:08:59.520 | more stuff. There's a lot more flexibility. And so this is very interesting approach.

00:09:06.160 | But we still have the problem that the code that's running on the accelerator is not the

00:09:13.120 | code you wrote. It's a transformation of that code through XLA. And so again, profiling

00:09:19.160 | it and debugging it and understanding really what's going on is difficult. Also, in order

00:09:24.400 | to provide these composable transformations, JAX has a very -- I mean, it's very interesting,

00:09:31.960 | but in some ways a very limited programming model. It's highly functional and immutable.

00:09:37.720 | And so JAX ends up with this kind of complexity from this functional programming model. State

00:09:43.160 | management becomes difficult. Things like random number generation becomes particularly

00:09:49.200 | challenging. And obviously, in my world of machine learning and deep learning, random

00:09:53.680 | numbers are very important as they are in many other GPU areas. So I feel like these

00:10:00.560 | are all, like, amazing technologies. So much impressive work going on. But it doesn't feel

00:10:08.480 | like, you know, the really long-term solutions. I don't see how any of these things quite

00:10:14.680 | end up giving us the developer experience we'd like to be able to offer. Another very

00:10:21.560 | interesting technology I wanted to mention is TVM. So TVM is an Apache project nowadays.

00:10:27.720 | And you can use TVM directly from Python. And you basically end up creating these compute

00:10:35.040 | expressions. In this case, using a lambda. And if you're familiar with something like

00:10:41.440 | Halide, similar kind of idea, you can basically create a schedule which will figure out how

00:10:48.320 | to -- where you can show various ways that you think it might be best run on an accelerator.

00:10:56.240 | And in this case, you're actually binding axes to blocks and threads on the accelerator.

00:11:03.240 | This is a super convenient way to write kernels. And more importantly, perhaps, it also has

00:11:09.560 | things like auto schedulers. So this is how you can create things that run as fast as

00:11:16.560 | 2DNN or, you know, specialized linear algebra libraries from Nvidia or whatever without

00:11:23.360 | having to write all that, you know, unrolled loops and memory management and whatnot. But

00:11:30.360 | as you can see in the end, it's still not anywhere near as convenient as writing normal

00:11:36.080 | Python. And the thing you end up with is, you know, this kind of compiled code that

00:11:40.800 | again has all the kind of developer experience issues I described before. Perhaps the most

00:11:47.520 | interesting path for the future for me right now is Julia. Julia is a fairly new language.

00:11:57.920 | But what's really interesting from a GPU standpoint is it handles nearly all of the developer

00:12:05.120 | experience problems I described. Nearly none of them exist in Julia. And the key thing

00:12:10.120 | is that in Julia, you can write kernels that look a lot like you would write in CUDA but

00:12:17.640 | with less boilerplate. And you can do in parallelized operations. You can handle memory. That can

00:12:29.840 | all be done in Julia. And so I think this is a really underappreciated important idea,

00:12:39.080 | which is that developers should be able to use the same language and the same tools throughout

00:12:45.320 | the hierarchy of abstractions in their program. Again, speaking as an educator, this is incredibly

00:12:51.600 | important for teaching people what's going on. It's really important for a researcher

00:12:58.920 | because you can hack in at any level. It's really important in industry because you can

00:13:03.440 | ensure that you can jump in and make sure the performance is working properly for you

00:13:10.360 | at every level. And it also opens up the research world in such a way that things aren't off

00:13:20.480 | the table. I find that the things that get worked on in deep learning research are the

00:13:24.880 | things that are kind of conveniently accessible through libraries. And a lot of stuff that

00:13:32.400 | isn't has just not really been touched because it requires people to go in and write their

00:13:36.700 | own CUDA kernels. And very, very few people have the patience to do that, at least in

00:13:43.280 | the deep learning world. So yeah, really, I guess this is a bit of a play for the GPGPU

00:13:56.020 | community to consider building the next generation of languages and tools, which allows developers

00:14:09.960 | to really do everything that they might want to do in a convenient way. For Julia, I feel

00:14:15.800 | like there's a lot of gaps in the developer experience there more generally, which I think

00:14:21.960 | the community is very familiar with around deployment and the amount of memory use it

00:14:27.000 | requires and the amount of latency it requires to start up and so forth. But I do think at

00:14:32.680 | least with Julia, it feels like something that there's a path there that could eventually

00:14:38.800 | lead to a really beautiful developer experience. And that's not a path that I see available

00:14:45.160 | in really any of the Python frameworks that I see right now. And I would love to see things

00:14:53.400 | like TVM being more integrated with those ideas into languages and tools. So yeah, that's

00:15:02.720 | the end of my thoughts on that. Thanks very much.

00:15:04.960 | [BLANK_AUDIO]

The GPGPU developer experience has a long way to go

Chapters