Back to Index

The GPGPU developer experience has a long way to go


Chapters

0:0 Intro
1:45 But behind the scenes these libraries just call
2:18 There are many things Python programmers can't easily do
5:8 There are various hacks to try to handle these shortcomings
8:7 JAX is an interesting new approach
9:38 JAX shares some of the same constraints, and has some of its own
11:44 Julia can create, debug, and profile GPU kernels directly

Transcript

>> Hi, everybody. I want to talk about my personal opinions about the GPGPU developer experience. I feel like we don't talk about developer experience enough. When we talk about GPGPU, we tend to focus more on performance issues and distributed computing and stuff like that. I know a lot of the audience here is from an academic background, and so folks who focus on GPGPU in academia may not fully realize how incredibly popular GPGPU has become in the last few years.

To give you a sense, this is the downloads for CUDA Toolkit from just one source, which is from the Anaconda Python repository. As you can see, 11.3 has 1.1 million downloads, 11.4, 1.1 million downloads, 11.1 million downloads. We've got to a point now where literally over a million people are downloading CUDA.

So what are all these people doing? They are not writing CUDA kernels. If you look at the Kaggle developer survey, most developers are now better scientists, are now using things like TensorFlow and PyTorch and Lightning and fast AI. So GPGPU is being used extremely extensively around the world now through these higher-level libraries and nearly always via Python.

But the thing is that these libraries like PyTorch, behind the scenes, they are calling compiled C libraries such as deep learning or the PyTorch C++ library or the C and C++ library. Although the Python developer is working in Python, there's a point at which they can't easily dig any deeper because it's jumping into compiled code.

And in the case of things like WhoDNM, it's not even open source code. So what's the issue? Well, the issue is that for Python programmers, there's things that they either can't do it all or can't do conveniently. So because it ends up being turned into these really very big C libraries or precompiled libraries, edge deployment can be very difficult.

For example, when you install PyTorch, you're actually installing over a gigabyte. It's an over a gigabyte download. And trying to turn your Python code into something that you can then put onto a mobile phone or a Raspberry Pi or whatever is incredibly challenging. But from a developer experience point of view, it's actually very difficult to debug your work because Python programmers are used to using the Python debugger, but most of the real works that's being done in your code is not happening in Python.

It's happening in these lower level libraries. So trying to understand what's really going on is extremely challenging. Same problem for profiling. Obviously we all want our code to run fast. And that's challenging to do when you can't easily just use your Python profiler to jump in and see what's going on, where the holdups, how do I make it faster?

A lot of people think that it's not important when I speak to people. They say it's not important that Python programmers can kind of dig into the underlying kernels and understand them and debug them and customize them. Because Python programmers are happy working at these higher levels. But actually this is a big challenge.

Because realistically, whether you're doing research or production in industry, at some point you want to dive in and change things. And in my experience most of the time there's something I would like to try and change that's buried down inside one of these precompiled libraries. Also as an educator, it's very hard for me to teach people what's going on.

Because I can't show them the actual code that's really running behind the scenes. And so for understanding the implementation details, whether it's for an educational reason or because you want to understand how the algorithm works to think about how you can improve it, this is either impossible or extremely difficult.

And this kind of hackability is critical for the developer experience, in my opinion. So there's various hacks to try and handle these deficiencies. So for example PyTorch now has a specialized profiler just for profiling PyTorch. NVIDIA has a specialized profiler as well. These are really neat tools and it's really cool that they're being provided for free.

But the fact is that it's still not a great developer experience to have to learn a whole new tool which works in a different way and that's not actually giving you a consistent view of all of your code. For edge deployment or even sometimes a web hosting, there are hacks like in particular tracing and a just-in-time compiler that are provided by both TensorFlow and PyTorch.

So the idea is that you use the JIT or the tracing mechanism to basically turn your Python code into, you know, basically some code in a different form. In particular it's likely to be ONNX, which is kind of an open standard for sharing these kind of models. The problem is that Python is a really rich and dynamic language.

So in either of these cases, they're not capable of handling all of the things that Python can do. So for example, in the case of the PyTorch just-in-time compiler, there's all kinds of things where it's just going to give you an error and say I'm sorry I don't know how to do that.

More frustrating for me I find is that very often it does something slightly different to how Python works and it's then very difficult to know why did it work in Python and it didn't work when I compiled it to ONNX. Another very interesting technology is XLA, which comes out of Google and is now available as a back end for both TensorFlow and PyTorch.

It's a similar kind of idea to the PyTorch JIT, but it's something which is specifically designed around creating a really accelerated fast version of your code. And so nowadays it's used, for example, when PyTorch wants to talk to a TPU, it will go through the XLA compiler because that's the best way to create TPU code at this stage through XLA.

So these are all nice to have, but they, you know, have a lot of shortcomings that's not nearly as convenient and not nearly as good a developer experience as using just Python and using the Python tools that Python programmers are familiar with. Another very interesting new approach is JAX.

JAX is another Google project and it's also a Python library, but it's actually specifically designed to bring Python over to XLA. So it's written from the ground up for XLA. And what's particularly interesting about JAX is that you can kind of write your own kernels. So you're not as limited as you are with tracing and JIT approaches.

You're still limited to doing just the stuff that your underlying seed or CUDA or whatever library has written for you, or else with JAX you can do a lot more stuff. There's a lot more flexibility. And so this is very interesting approach. But we still have the problem that the code that's running on the accelerator is not the code you wrote.

It's a transformation of that code through XLA. And so again, profiling it and debugging it and understanding really what's going on is difficult. Also, in order to provide these composable transformations, JAX has a very -- I mean, it's very interesting, but in some ways a very limited programming model.

It's highly functional and immutable. And so JAX ends up with this kind of complexity from this functional programming model. State management becomes difficult. Things like random number generation becomes particularly challenging. And obviously, in my world of machine learning and deep learning, random numbers are very important as they are in many other GPU areas.

So I feel like these are all, like, amazing technologies. So much impressive work going on. But it doesn't feel like, you know, the really long-term solutions. I don't see how any of these things quite end up giving us the developer experience we'd like to be able to offer. Another very interesting technology I wanted to mention is TVM.

So TVM is an Apache project nowadays. And you can use TVM directly from Python. And you basically end up creating these compute expressions. In this case, using a lambda. And if you're familiar with something like Halide, similar kind of idea, you can basically create a schedule which will figure out how to -- where you can show various ways that you think it might be best run on an accelerator.

And in this case, you're actually binding axes to blocks and threads on the accelerator. This is a super convenient way to write kernels. And more importantly, perhaps, it also has things like auto schedulers. So this is how you can create things that run as fast as 2DNN or, you know, specialized linear algebra libraries from Nvidia or whatever without having to write all that, you know, unrolled loops and memory management and whatnot.

But as you can see in the end, it's still not anywhere near as convenient as writing normal Python. And the thing you end up with is, you know, this kind of compiled code that again has all the kind of developer experience issues I described before. Perhaps the most interesting path for the future for me right now is Julia.

Julia is a fairly new language. But what's really interesting from a GPU standpoint is it handles nearly all of the developer experience problems I described. Nearly none of them exist in Julia. And the key thing is that in Julia, you can write kernels that look a lot like you would write in CUDA but with less boilerplate.

And you can do in parallelized operations. You can handle memory. That can all be done in Julia. And so I think this is a really underappreciated important idea, which is that developers should be able to use the same language and the same tools throughout the hierarchy of abstractions in their program.

Again, speaking as an educator, this is incredibly important for teaching people what's going on. It's really important for a researcher because you can hack in at any level. It's really important in industry because you can ensure that you can jump in and make sure the performance is working properly for you at every level.

And it also opens up the research world in such a way that things aren't off the table. I find that the things that get worked on in deep learning research are the things that are kind of conveniently accessible through libraries. And a lot of stuff that isn't has just not really been touched because it requires people to go in and write their own CUDA kernels.

And very, very few people have the patience to do that, at least in the deep learning world. So yeah, really, I guess this is a bit of a play for the GPGPU community to consider building the next generation of languages and tools, which allows developers to really do everything that they might want to do in a convenient way.

For Julia, I feel like there's a lot of gaps in the developer experience there more generally, which I think the community is very familiar with around deployment and the amount of memory use it requires and the amount of latency it requires to start up and so forth. But I do think at least with Julia, it feels like something that there's a path there that could eventually lead to a really beautiful developer experience.

And that's not a path that I see available in really any of the Python frameworks that I see right now. And I would love to see things like TVM being more integrated with those ideas into languages and tools. So yeah, that's the end of my thoughts on that. Thanks very much.