Large Scale AI on Apple Silicon — Alex Cheema, EXO Labs

- - Yeah, thank you all for coming. I'm sure you're wondering what this has to do with AI, but we'll get there. So let me set the stage. At the turn of the 20th century, physics had a big problem. So the problem was that the current theory said that there would be an infinite amount of energy in the universe, which was very strange, because clearly that would violate a lot of physical principles.

So what did physicists do about this? Well, this guy called Max Planck came along, and he was like, well, if we just assume that all the energy is quantized in some way, then we can solve this problem. Mathematically, it works out fine. But he introduced this new thing called H, which is like this constant that comes out of nowhere.

And, you know, they didn't really know at the time like how to measure that. So another chap in 1909 called Millican came up with an experiment to sort of measure this value of H. So he did it indirectly by measuring the charge of an electron. And the way he did it is he put this sort of charged oil spray into some water with some charged plates and then looked at how fast this charge moves.

The details aren't so important, but he got a result, and, you know, it was like big news in the scientific community. And everyone was like, oh, man, now we know what this charge is, we know what this H is. And all was good, and then, you know, sort of, he had this result and everyone was using it for their calculations and stuff.

And, you know, many years went by and all the experiments seemed to agree with his reading. And, you know, it took until just after that last data point, like 1929, for the sort of people to realize that, okay, well, you know, this isn't the actual value. And I think what's interesting is when you look at the history, you've got this, like, period of 15 years or so where everyone thought that this was the sort of the right value.

And, you know, you look back and you think, okay, why did they think this? And it turns out that, you know, people are doing the right experiments. They were doing a lot of experiments. And there's a lot of data points that would be in between here as well that's just, you know, been lost in the history books.

But, you know, and part of that is because of embarrassment. This is, like, a very embarrassing thing for the scientific community. So, what happened? Well, basically, these scientists, they were running their experiments. They got a result and then they saw, oh, wait, this great Millikan guy that had this result in 1909, it doesn't agree with him.

So I must be wrong. And then they sort of, like, fudged, you know, the experiments to, you know, make it work such that they get the same value as him. And that went on for a long time. And this is, like, this is not a trivial thing. This is quite an important thing to science in general.

So this kind of gets me up to the point of, like, you know, sort of scientific rigor. And, you know, it's actually very tricky to do science properly. And, you know, the fact that, you know, these subsequent experiments, basically, they all agreed on the wrong thing. And it says something about, you know, sort of the way in which scientific progress happens.

So, you know, there's a lot of inertia behind the current way of doing things. And, you know, there's another example, which is sort of about questioning assumptions. And, you know, there's another guy who is looking at something completely different. He was looking at experiments with rats. And he wanted to basically, he had a hypothesis.

He wanted to test, like, some weird esoteric thing. He wanted to test that, like, you know, that rats, he could basically get rats to navigate a maze in a specific way. He wanted them to go through this corridor of doors, right? And they would go through a random door.

And they wanted them, he wanted the rats to come out of a door that was three doors along from the one that they went in. It doesn't matter which one they go in. He always wanted it to be three. So he wanted to show that, like, they could actually think and, you know, be able to consistently go, like, three doors along.

And basically, he tried, like, a bunch of stuff. And what kept happening was whatever door, like, he basically put, like, a piece of food on the door that was three doors along to make them go through that one. But what ended up happening is they just always went to the previous door.

So if it was, like, you know, door one that they went through and he wanted it to come out of door four, and then he tested it again where they go through door two and they should go out of door five. They would just still go out of door four.

And he tried a bunch of stuff. So, like, he was, like, okay, why is this happening? Like, how did the rats know, like, to go back to that same door? So he was, like, he very meticulously went through and he made sure that there was no pattern or anything that they could distinguish on the doors.

He painted them all the same way. He made sure the patterns were all the same. But still it didn't work. He thought it might be the smell. So maybe there was, like, a smell that came from the food. He tried basically putting chemicals in so that they couldn't smell the food.

And that didn't work either. And then he thought, okay, maybe it's something to do with the lighting. You know, like, a human could do this, like, through common sense, just sort of see, like, okay, the lighting is in such and such a way and see the pattern. And so he covered up the corridors and stuff and made sure that that couldn't be a thing.

And still, you know, the same thing happened. And eventually what he found out was that the reason they could consistently go to that same door was because they remembered the sounds. So as they walked along, they remembered the pattern of the sounds in this corridor. So what he did was he put some sand in there so that they couldn't distinguish the sounds, basically.

Now, from a scientific perspective, this is, like, S-tier science in terms of, you know, really clearly looking at, like, what are all the assumptions I'm making and just, like, systematically eliminating them. And, you know, this is great, great science. But, you know, the problem was that the scientific community didn't agree.

So the people that were conducting experiments at the time, they made a lot of these assumptions. And they were kind of stuck in their ways. So they, you know, they discard this. So, you know, none of the, it wasn't cited, you know, this was, this was basically just forgotten.

And so I think, you know, there's sort of this tendency to stick to the current way that things are done. And even the methodology, if it's spot on, if there's a certain way of doing things, then that kind of, you know, has a lot of inertia behind it. And, you know, Feynman talked about this in one of his talks called Cogga called Science, in, like, the 70s.

And he had this, like, you know, quote, the first principle is that you must not fool yourself. And you are the easiest person to fool. And I think, you know, there's this tendency to sort of make, oversimplify the science. You know, it is hard to get it right. And if you're interested that Gwen wrote, like, a blog post all about this and there's some controversy about, like, who this guy was and stuff.

But getting to AI, so, like, you know, there's a very similar thing happened in AI, you know, sort of about questioning assumptions, right? And, you know, just having a good idea out there is not enough. So, you know, in 1963, backpropagation was introduced in this paper. And then it was, you know, reinvented in 1976 in this paper and reinvented again in 1988.

And then, you know, sort of deep convolutional neural networks were introduced here. And then it was only in 1989 that these two things were combined. So, you had, like, deep CNNs and backpropagation. And then it was only, like, three decades later that, you know, CNNs were widely accepted. There was still, like, this massive skepticism, even though the ideas were out there.

And why was that? I think, like, a big part of it is sort of, again, being stuck in the way of doing things and looking at the existing hardware. So, if you look at, obviously, CPUs, they have this von Neumann bottleneck. You know, they have really good single-core performance.

But, you know, if you're sort of having to read memory often, then it's usually bottlenecked by that. And, you know, you can sort of look at, like, at a systems level, why did GPUs fix that? And it sort of changed this, like, ratio of how many, you know, how many bytes you have to load to how many flops you can execute.

And, you know, it's kind of striking, you know, like, when you look at the history of this, this was, like, a groundbreaking paper and a very famous paper, like, where they trained a network on 1,000 machines, 16,000 CPU cores, and it took three days. And then, like, less than a year later, there was this other paper that got the exact same results, but it took, like, you know, three machines in a couple of days.

And this was using, you know, hardware acceleration, like, using GPUs. So, you know, this gets me to the hardware lottery, which is essentially this idea introduced by Sarah Hooker in 2020, which says that, you know, the best research ideas don't necessarily win. There's a lot of factors that sort of make it so that a great idea can be out there, great science can be being done, but, you know, it doesn't necessarily get adopted and accepted.

There's a recent example of this, which I think is quite interesting, that I, you know, preparing for this, I had the realization that, you know, LLMs are kind of a forcing, like, they're kind of creating inertia as well, because the things that they're good at are the things that people will work on.

So if they're good at generating Python code, then more people will write Python code. And it's this feedback loop of, you know, more people use it, and then the LLMs get better at that thing. And, you know, now, if you wanted to come out with a new programming language, maybe it is a really good idea.

You wouldn't, it's much harder to get adoption if the tooling is, like, way worse if the LLMs don't have good support for it. And there's this paper that, you know, this result, like, basically, they made a table of, like, all the tasks that, like, different languages are best at, and basically everything was Python.

And, yeah, like, what is it? 90 to 97% of all problems, yeah, Python was the best thing. So what if we did question our assumptions? So that brings me to sort of what I'm working on with EXO. So what we're doing is we're building a orchestration layer for, you know, AI that runs on different hardware targets.

And it's sort of, we're sitting at this layer that I haven't seen much right now, and it's kind of a pain point in terms of just having this, like, reliable thing that can orchestrate a lot of different kinds of devices with different connections in, like, this ad hoc kind of mesh network.

And just to give you, like, some idea of, like, kind of the kind of things that we're doing and the kind of, you know, the solution space that we sit in, like, you know, everything in EXO is modeled as a causally consistent set of events. So there's essentially, like, an ordering on everything that happens across the whole distributed system.

If there's all these things going on, it's really hard to reason about, like, sort of, if you want to move a KV cache around, like, how do you know that it happened successfully? How do you know then if something depends on that? So you build this sort of causal graph, and, you know, you can then reason about the system and get some guarantees about, you know, sort of where the data is and, you know, what's going on.

So just to give you a quick example to put this into, like, you know, more practical terms, like, what does this enable? Spark is coming out soon. I'm still waiting, but it's been delayed a few times, I think, but hopefully soon. And this is NVIDIA's new, like, consumer thing, if you haven't seen that.

And it's, like, it has a lot of flops. It's pretty good for the cost. But the memory bandwidth is kind of lacking. It doesn't have that much memory. Like, Studio has a lot more memory bandwidth, but a lot less flops. So if you look at sort of just if you wanted to generate with an LLM, like, there's two phases, right?

There's the pre-fill phase, which is compute bound, and the generation phase, which is memory bandwidth bound. As far as I know, there isn't really anything that can nicely do this reliably, where you would have these, like, different sets of devices, and then figure out the best way to, like, run this whole workload across, you know, all the device that you have available.

But this is possible now with EXO. And just another example, this is some research that we're working on. It's more on the training side. So this will be out pretty soon. It will be made public. Essentially, we're kind of questioning all these assumptions about, like, you know, what hardware is best to run on.

If you look at Apple Silicon, it has a lot more memory, obviously, but it's a lot more expensive per flop. But what if you could use that memory for something useful to make training more efficient? And there's been this whole area of research of, like, sort of second-order methods and, you know, different ways of making training more efficient, but a lot of them have been discarded because of the memory requirements.

And so what we've done is we've come out with, we're going to come out with this, like, new optimizer, which is essentially, like, two times more efficient per flop than Atom, but it uses a lot more memory. So if you look at this sort of ratio of, like, memory to flops of Apple Silicon, it's probably, like, 20x.

It's around 20x, depending on which one you get. NVIDIA. So you've got all that spare memory that you can, you know, think about, okay, what does the solution space look like if we make use of that memory? And, yeah, we're buying a lot of mags. We're, like, trying a lot of stuff.

This is the first batch, but there's going to be another batch. And, yeah, we're just running a lot of experiments at scale. Like, there's a lot of stuff in that paper that's really interesting in terms of, like, not many people have tried doing large training runs on Apple Silicon.

I mean, nobody. Nobody has. Even, I mean, talking to people at Apple, like, they're also surprised at these results. Like, they're not really aware of what the hardware is capable of. Just a final thing as well around sort of, you know, going back to the process of doing science, right?

Like, if anything, like, publishing kind of, like, publishing results that are maybe not the best, I think, is something that we need to, like, more normalize. And, you know, we're sort of, we're doing a lot to make all the data accessible, whether it's good or bad. like, you know, right now you can go to benchmarks.exelabs.net and there's a lot of configurations there that are just really bad.

But, you know, I think having that data out there and publishing stuff that isn't necessarily, like, the best thing is super important. So we've got this, these benchmarks. Part of that is also running, like, these Macs continuously in CI and different devices. We'll probably end up with, like, at least one of pretty much, like, any device that you can reasonably run an AI workload on and just have those continuously pushing out to the benchmarks.

Yeah, so we're coming out with a big release in the end of this week. So that will sort of, you know, be this orchestration, this orchestration layer that I talked about. And with that will come a lot of this sort of tooling around it. So the benchmarks website, a lot of stuff around, you know, being able to test different algorithms that run on different devices.

So we kind of have Exogym, for example, which is a way to run experiments on, you know, if you don't have 16 Macs like this, then, you know, being able to actually just run them locally quickly and test different distributed algorithms, that would be part of those releases. Yep, that's it.

We've actually got like four minutes, so we're ahead of schedule. Anyone have questions? Don't get mad if I don't pick you, but I'll tell you, we'll do these three right here. Do you guys do communication parameters like all the issues that are at a higher level? Right now, like right now, it's a bit of a, like ideally not.

Like ideally, we want to sit at like this, like higher up in the stack and just focus on this orchestration piece because I think that's where there's like not much. And, you know, the MLX team, for example, I've done a lot of work on like MLX distributed, which is really good.

It's just, it's really fast, but it's like kind of brittle. So if you lose a connection, it would just break completely. Like you just, you get errors everywhere. And it's super specialized, obviously, to their configuration. So, you know, our hope is that there'll be a lot of work on that layer that is just done by all these frameworks like MLX and VLM.

And then, and then we can sit on top. But right now, we're doing a lot of sort of just work with like the MLX team, for example, and just, you know, building out those primitives. Yeah, it's the orange one. how difficult would it be to scale up like the AMDB part because it's going to be like, it's going to be to be able to do that.

Or is it optimistically that's like smaller than that? Yeah, so a lot of this stuff is, the absolute numbers don't matter too much. It's more about the ratios. So, AMD has way more flops than Apple Silicon would have. So, the ratio is probably going to be, you know, still really high.

And then, yeah, it's just sort of, you know, what do you end up being bottlenecked by? Obviously, they have a lot more network bandwidth, but again, but it's relative to the flops, right? So, if you look at the ratio of network bandwidth to flops of Apple Silicon, it's actually better than AMD.

So, I'm not sure, like I would need to look at the specific device that you're talking about, but maybe. Yes, I have one more short answer. How would you be yeah, we're not working on that. I think there's other projects that might be working on that kind of thing, like maybe Prime Intellect, Hyperbolic, perhaps, as well.

A few of them in this room. But maybe there's some synergy that I don't know, like for now, at least there just seems to be a lot of work to be done on these private clusters where you have a fully trusted setup and you don't really care about all the hard problems that come with untrusted public networks.

things. have a great day. have a great day. We'll see you next time.

Large Scale AI on Apple Silicon — Alex Cheema, EXO Labs

Transcript