back to index

New GPU-Acceleration for PyTorch on M1 Macs! + using with BERT


Chapters

0:0 Intro
1:34 PyTorch MPS
4:57 Installing ARM Python
9:9 Using PyTorch with GPU
12:14 BERT on PyTorch GPU
13:51 Best way to train LLMs on Mac
16:1 Buffer Size Bug
17:24 When we would use Mac M1 GPU

Whisper Transcript | Transcript Only Page

00:00:00.000 | In November 2020, Apple released their latest chips, the M1 chips, based solely on Apple Silicon.
00:00:10.000 | Now, the M1 chips are incredibly powerful for what they are, but they were not particularly well supported for anyone doing deep learning.
00:00:22.000 | Now, TensorFlow pretty much straight out of the gate supported GPU acceleration on the new M1 chips, but PyTorch have literally only just released it.
00:00:34.000 | Now, it's been a relatively long wait, and even longer because today's deep learning models rely very heavily on large model sizes and lots of data.
00:00:48.000 | And in order to process that, we can't really rely on CPUs, it's incredibly slow.
00:00:58.000 | So, that has basically made deep learning very difficult with Macs, and practically no one is going to use a Mac for deep learning when they're using PyTorch, until now.
00:01:12.000 | That being said, for at least the first generation M1 chips, we're probably not going to be training any large models with them anytime soon, but we can perform inference on them.
00:01:28.000 | And I will also show you how we can run through training, but it's not going to be anything spectacular.
00:01:35.000 | So, PyTorch's support for GPUs comes in version 1.12.
00:01:42.000 | In the latest version of PyTorch, there comes a support for interaction between PyTorch and the lower-level Metal Performance Shaders, or MPS.
00:01:55.000 | So, it's Metal Performance Shaders that interact with our GPU, almost like a layer between PyTorch and the GPU.
00:02:03.000 | Another pretty interesting thing, or positive thing, about this integration is that PyTorch have collaborated directly with the Apple Metal team.
00:02:13.000 | The Metal team deal with the new M1 chips, and the MPS layer has been fine-tuned for each particular M1 chip, or family of M1 chips.
00:02:27.000 | So, that basically means the performance should be pretty optimal for what it is doing.
00:02:35.000 | So, if we take a look at the release announcement over on PyTorch, we can come down here, and the accelerated training and evaluation - this is on an M1 Ultra, so I believe that's the top M1 chip at the moment -
00:02:55.000 | it shows you pretty incredible speed-up. Now, I will say right now that this is not the same on my M1 chip, or it doesn't seem to be.
00:03:09.000 | So, this is probably the most optimistic that you're going to get, and of course this is in their release, so they're not going to show you the average results, they're just showing you the best they've ever seen.
00:03:23.000 | So, either way, this is actually pretty good.
00:03:29.000 | So, the getting started on here is not that helpful, to be honest. This is not how you get started, so we need to go through a few steps to get everything organized and put together.
00:03:49.000 | With the new MPS-enabled PyTorch, there are two key prerequisites that are not really mentioned in the announcements.
00:03:59.000 | The first is that you need to have macOS version 12.3 or higher. So, if you don't have that, you'll need to upgrade.
00:04:11.000 | And the other one is that we need to do everything via the ARM version of Python.
00:04:18.000 | Now, we can check all of this within our Python environment using this code here.
00:04:24.000 | So, "import platform platform mac version", right?
00:04:27.000 | So, the first value that you get there is, in this case, you can see 12.4. That is the macOS version. That must be 12.3 or more.
00:04:38.000 | And then the last thing you see here is ARM64. That tells me, okay, I've got the right version of Python running here.
00:04:45.000 | It's using the ARM architecture rather than the x86 architecture.
00:04:51.000 | In that case, you need to install the new Python environment.
00:04:57.000 | So, if you're using Anaconda, that's great because we're going to go through the installation of ARM Python with Anaconda.
00:05:06.000 | So, the first thing we're doing here is specifying that we need to use the ARM64 version of macOS.
00:05:18.000 | So, we do that here.
00:05:20.000 | We then say, okay, we want to create a new conda environment, call that environment "ml".
00:05:26.000 | We are going to use Python 3.9.
00:05:29.000 | And Conda Forge is probably in your channel list anyway, but just in case, we're also specifying this,
00:05:36.000 | which is just another repository where we're going to pull all of these versions of different packages from.
00:05:44.000 | Okay, so once that has installed, we need to go into that environment.
00:05:51.000 | So, conda activate ml, and then we need to permanently modify the conda subdirectory variable
00:06:01.000 | to make sure this is always going to be set to OSX ARM64.
00:06:07.000 | So, this is to avoid later on when we start pip installing things in this environment,
00:06:13.000 | this variable may switch back to x86, which we don't want.
00:06:19.000 | So, we need to make sure it stays with the ARM environment architecture.
00:06:24.000 | So, we add that in there.
00:06:27.000 | And you'll probably see this where it says, "Please reactivate your environment."
00:06:32.000 | So, to do that, we just do conda activate and run that.
00:06:36.000 | That switches back to the base environment, and then we literally just activate ml again.
00:06:42.000 | Now, the next step is to actually pip install PyTorch.
00:06:47.000 | And to do that, we are doing what you can see here.
00:06:51.000 | So, I am running pip install upgrade.
00:06:56.000 | You might not need the upgrade flag there, but just in case.
00:07:00.000 | And we need to make sure we are going to install the nightly version of PyTorch
00:07:06.000 | because as of this moment, version 1.12 is only released in the nightly releases,
00:07:13.000 | which is basically just a more frequent but slightly more unstable release.
00:07:17.000 | It's like PyTorch releases.
00:07:19.000 | So, you need all of this here.
00:07:22.000 | So, go across, and you'll be able to copy these from the notebook link that I have in the description below.
00:07:33.000 | Okay, so we run that.
00:07:36.000 | And one thing to just be aware of here.
00:07:39.000 | So, if we have a look here, you'll know it's working and it's being installed in the correct version
00:07:46.000 | if you can see that it says ARM64 up here.
00:07:52.000 | Now, if you are just wanting to use PyTorch with NPS, that's ready.
00:08:00.000 | You can go ahead and start using it, and I'll show you how in a moment.
00:08:04.000 | But for those of you that are going to want to use home-based transformers, there is an extra step.
00:08:09.000 | If we try and pip install transformers datasets, this will probably come up with an error for most of you.
00:08:20.000 | For me, because I have already dealt with the error, it's not popping up.
00:08:25.000 | And it would say something like, "Error, failed building wheels for tokenizers."
00:08:31.000 | So, the reason for that is transformers tokenizers have particular tokenizers that are faster than other tokenizers.
00:08:40.000 | And they are faster because they use Rust.
00:08:43.000 | Now, Rust is not installed from the ARM distribution of Python that we have at the moment.
00:08:50.000 | So, from within our new environment, all we need to do is install Rust like this.
00:08:58.000 | Once we execute that, we can just go ahead and pip install transformers and datasets like we did before.
00:09:05.000 | And with that, we are ready to move on to our code.
00:09:09.000 | So, for this example, all we need are these libraries here with Torch, transformers, datasets, which we already installed.
00:09:17.000 | And we can check that our PyTorch installation has NPS using this.
00:09:24.000 | So, Torch has NPS. If you run that, you should see true.
00:09:28.000 | And that means you're okay and you're ready to go with the rest of the code.
00:09:32.000 | So, here I'm just pulling some data. I'm not really going to go into detail on this.
00:09:36.000 | I'm not pulling much data because it's a pretty low-spec M1 chip and I can't handle much.
00:09:46.000 | So, here I'm initializing a tokenizer model.
00:09:50.000 | That doesn't really make sense to you. It doesn't really matter for the sake of trying to understand PyTorch here.
00:09:56.000 | I'm tokenizing my text and then this is being run on CPU.
00:10:02.000 | This bit here. I have not moved anything to the NPS device yet or the GPU.
00:10:11.000 | This is all being run on CPU and we get 547 milliseconds as an average time for this model processing our data.
00:10:22.000 | And this is a BERT model. BERT based on case from the HuggingFace library.
00:10:28.000 | Now, that's CPU performance. If we want to test with NPS, there's a few things we do.
00:10:34.000 | So, we have to move everything. The tensors and the model over to the NPS device.
00:10:40.000 | So, we set that. So, Torch device NPS and then we just say model to device and tokens.
00:10:47.000 | So, your tensors to device as well. And that's all that device is.
00:10:54.000 | Now, if we rerun it, we see it's faster. So, we get 345 milliseconds.
00:11:00.000 | So, a little bit better. Not a massive difference, but it is a little bit better.
00:11:05.000 | Now, this is using a batch size of 64. I tested a few different batch sizes with this data set.
00:11:12.000 | And I found that when it comes to larger batch sizes, at least for your inference and I imagine it's the same for training,
00:11:24.000 | we get a more significant difference as we increase the number of values in each batch.
00:11:34.000 | So, really at 64 there's very little difference and you can kind of see that.
00:11:40.000 | But then when we increase that, the difference is definitely more pronounced.
00:11:46.000 | So, we're not using like A100 GPUs or anything here. We're just using the first generation MacBook Pro,
00:11:56.000 | the almost base specs with the M1 chip. That's all we're using. So, it's not going to blow us away.
00:12:05.000 | But nonetheless, just for the sake of moving our model and tensors to a different device, I think this is pretty cool.
00:12:14.000 | Now, for those of you that are interested in using this with Hugging Face,
00:12:19.000 | I will quickly go through the setup for that because there's a few things to just be aware of.
00:12:25.000 | So, as before, I'm doing the same thing. We've got our device, it's NPS, using the same data set.
00:12:31.000 | Nothing new here. And we go through everything.
00:12:36.000 | And in this case, anything beyond the batch size of one. So, this is training the entire bare model.
00:12:47.000 | Anything beyond a batch size of one just doesn't work. So, you can see I'm here. Where's my batch?
00:12:54.000 | So, I'm creating the batch size here or the data loader here, batch size of one.
00:13:02.000 | We have our model to device because we're using NPS.
00:13:07.000 | And when we're loading the different tensors for training in this training loop here,
00:13:16.000 | I'm just moving them to the same NPS device. So, anything beyond a batch size of one, I even tried just two.
00:13:24.000 | My kernel just died. So, yeah, you just have to be wary of that.
00:13:30.000 | But with one, this is actually showing a little bit lower than what I got in my other test, but I tried it a few times.
00:13:37.000 | I got around 90 minutes to train that. Now, that's a full bare model.
00:13:44.000 | And we're probably not going to use MacBook or at least not this MacBook for training that.
00:13:50.000 | Now, moving on to maybe the way that we would actually use this on Mac is with these NLP models and so on,
00:14:03.000 | we typically have two components. We have the larger core of the model.
00:14:08.000 | So, that would be BERT itself that has been pre-trained by Google or Microsoft or someone else.
00:14:15.000 | That includes a lot of parameters. But then there's a smaller head on top of that, which is just like a couple of linear layers
00:14:22.000 | that does something with the output from that BERT model. So, for example, it might classify the text for you.
00:14:29.000 | And, well, OK, we can't train a full BERT model, but we can train that head on top, which is called fine tuning.
00:14:37.000 | So, let's go ahead and have a look at how we might do that.
00:14:43.000 | So, this is a little more complex because to do this, we need to initialize our entire model.
00:14:49.000 | So, the whole BERT model and the head, and then we need to freeze all of the BERT model parameters.
00:14:57.000 | So, an extra step, although it's not anything complicated.
00:15:10.000 | So, here I'm initializing my model again. I'm using BERT for sequence classification.
00:15:18.000 | And this is all I'm doing. So, for param in model BERT parameters, I'm setting the parameter requires grad equal to false.
00:15:28.000 | And taking a look at the model printout from PyTorch, we can see there's this BERT at the top,
00:15:36.000 | and then there's this classification part at the bottom. Those are all the parameters in our model.
00:15:44.000 | And when we're saying model dot BERT dot parameters, we are accessing all the BERT parameters.
00:15:50.000 | And we're leaving those classified parameters at the end there. So, that is how that works.
00:15:57.000 | And then we just go down and we would run this again.
00:16:01.000 | Okay. So, in this case, we have this error. And in order to get rid of that,
00:16:07.000 | we actually need to downgrade to a slightly older version of the PyTorch nightly release,
00:16:14.000 | because this is just a bug that's popped up in one of the more recent releases.
00:16:19.000 | Hopefully, by the time you're watching this, it won't be a problem anymore, so you won't get this anyway.
00:16:23.000 | But if you do, this is how we fix it.
00:16:26.000 | So, all we do is we pip install, make sure we upgrade, and we make sure that we do this.
00:16:32.000 | We don't actually need to include TorchVision and TorchAudio here, they're included anyway.
00:16:36.000 | So, we just downgrade to this nightly release. Okay. That will fix the problem.
00:16:45.000 | And then we can go back to our code and rerun everything.
00:16:48.000 | Okay. So, in the little GPU usage history, we can see this peak now that the model is training.
00:16:57.000 | And that's just going to basically save the top for the next four or so minutes
00:17:04.000 | and go back down once we finish training.
00:17:08.000 | So, let's skip ahead to this finishing.
00:17:12.000 | Okay. So, that has just finished. You can see GPU usage ramps down straight away.
00:17:18.000 | And yeah, so it took pretty much one second shy of four minutes.
00:17:25.000 | So, it's relatively quick. Nothing special, but to say I'm just on the first generation M1
00:17:33.000 | using the almost base level MacBook Pro, it's not bad.
00:17:38.000 | I think, realistically, you're probably not going to do much training on the M1 chips
00:17:45.000 | unless you have maybe M1 Ultra. Maybe in that case, you would.
00:17:49.000 | But even then, I'm not sure the chips are really at that level yet.
00:17:56.000 | Nonetheless, I'm sure with the future iterations of NPS and NPS-enabled PyTorch in particular,
00:18:07.000 | it's probably going to improve.
00:18:09.000 | So, it's useful to be able to do this.
00:18:13.000 | And even maybe just for a little bit of maybe fine-tuning or at least inference here and there,
00:18:20.000 | I think this is pretty useful.
00:18:22.000 | And it's exciting to see where this will actually go.
00:18:26.000 | Maybe in the future, deep learning is for -- or Macs are for deep learning.
00:18:32.000 | That would be interesting.
00:18:35.000 | So, yeah, it's pretty cool.
00:18:39.000 | I know the setup for this is kind of finicky.
00:18:44.000 | So, I hope this has at least helped you figure that out.
00:18:49.000 | So, I hope all this has been interesting.
00:18:52.000 | Thank you very much for watching, and I will see you again in the next one.