Ep 18: Petaflops to the People — with George Hotz of tinycorp

>> Hey, everyone. Welcome to the Latent Space podcast. This is Swix, writer and editor of Latent Space, and Alessio is taking over with the intros, Alessio's partner and CTO on residence at Decimal Partners. >> Hey, everyone. Today we have GioHot on the podcast, aka George Hotz, for the human name.

Everybody knows George, so I'm not going to do a big intro. A couple things that people might have missed. So you were the first to unlock the iPhone. You traded the first ever unlocked iPhone for a Nissan 350Z and three new iPhones. You were then one of the first people to break into the PS3 around arbitrary code.

You got sued by Sony. You wrote a rap song to fight against that, which is still live on YouTube, which we're going to have on the show notes. Then you did not go to Tesla to build Vision, and instead you started Com.ai, which was an amazing engineering feat in itself, until you got a season disease from the government to not put these things on the street.

Turned that into a research-only project. >> You know they're out there. >> Yeah, yeah. No, no, no. They're out there. But like, they're not a, you know, you market them as a research kind of like no warranty. >> Because I use the word DevKit. That's not about the government.

That has nothing to do with the government. We offer a great one-year warranty. The truth about that is it's gatekeeping. What's the difference between a DevKit and not a DevKit? Nothing. Just the question of do you think it's for you? And if you think it's for you, buy it.

It's a consumer product. We call it a DevKit. If you have a problem with that, it's not for you. >> That's great insight. And then I was going through your blog post to get to the day. You wrote this post about the hero's journey, and you linked this thing called the portal story, which is kind of the set of stories in movies and books about people living this arbitrary life, and then they run into these magic portals, kind of takes them into a new, very exciting life and dimension.

When you wrote that post, you talked about TinyGrad, which is one of the projects you're working on today. And you mentioned this is more of a hobby, something that is not going to change the course of history. Obviously, you're now going full speed into it. So we would love to learn more about what was the portal that you run into to get here.

>> Well, what you realize is, you know what made me realize that I absolutely had to do company? Seeing Sam Altman go in front of Congress. Why? What are the odds they nationalize NVIDIA? You know, what are the odds that large organizations in the government, but of course I repeat myself, decide to try to clamp down on accessibility of ML compute?

I want to make sure that can't happen structurally. So that's why I realized that it's really important that I do this. And actually, from a more practical perspective, I'm working with NVIDIA and Qualcomm to buy chips. NVIDIA has the best training chips. Qualcomm has the best inference chips. Working with these companies is really difficult.

So I'd like to start another organization that eventually in the limit, either works with people to make chips or makes chips itself and makes them available to anybody. >> You shared kind of three core pieces to TinyGrad. Maybe we can dive into each of them. So XLA, Prime Torch, those are the complex instruction system.

TinyGrad is the restricted instruction system. So you're kind of focused on, again, TinyGrad being small, not being overcomplicated and trying to get as close to like the DSP as possible in a way, where it's at more. >> Well, it's a very clear analogy from how processors developed. So a lot of processors back in the day were CISC, complex instruction set.

System 360 and then x86. Then this isn't how things stayed. They went to now the most common processor is ARM. And people are excited about RISC-V. RISC-V is even less complex than ARM. No one is excited about CISC processors anymore. They're excited about reduced instruction set processors. So TinyGrad is, we're going to make a RISC offset for all ML models.

And yeah, it can run all ML models with basically 25 instead of the 250 of XLA or Prime Torch. So about 10x less complex. >> You talked a lot about existing AI chips. You said if you can write a fast ML framework for GPUs, you just cannot write one for your own chip.

So that's another one of your core insights. I don't know if you want to expand on that. >> Yeah, I mean, your chip is worse, right? There's no way the chip that you're going to tape out, especially on the first try, is going to be easier to use than an AMD GPU.

And yet there's no good stack for AMD GPUs. So why do you think you can make one for your chip? You can't, right? The only company, there's one other company, aside from NVIDIA, who's succeeded at all at making training chips. What company? >> AMD? Intel? >> No, no, no.

I've never trained, who's trained a model on AMD or Intel? >> Nobody on AMD. Cerebris. >> Cerebris, I'm talking about, you might know some startups who trained models on these chips. I'm surprised no one immediately gets this, because there is one other chip, aside from NVIDIA, that normal people have actually used for training.

>> Apple Neural Engine? >> No, used for training. You can only buy them in the cloud. >> Oh, TPU. >> Exactly, right? So, Mid Journey is trained on TPU, right? A lot of startups do actually train on TPUs. And they're the only other successful training chip, aside from NVIDIA.

But what's unique about Google is that they also wrote their own ML framework, right? And if you can't write your own ML framework that is performant on NVIDIA, there's no way you're going to make it performant on your-- >> And they started from TensorFlow, and then they made the chip after.

>> Yeah, exactly, exactly. And you have to do it in that direction. Otherwise, you're going to end up-- Cerebris, one of those things, a million-- I've never seen a Cerebris. No one's ever like, "Oh, I trained my model on a Cerebris." Most people are like, "I trained my model on GPUs." Some people, 20%, are like, "I trained my model on TPUs." >> And then the third one, which is the one that surprised me the most, is Turing completeness is harmful, should be avoided.

It made sense once I read it, but maybe tell us a bit more about how you got there. >> Okay. So, CPUs devote tons of their silicon and power to things like reorder buffers and speculative execution and branch predictors. And the reason that you need all these things is because at compile time, you can't understand how the code's going to run.

This is Rice's theorem. This is the halting problem and its limit. And this is not like, "Oh, the halting problem is theoretical." No, no, no, no. It's actually very real. Does this branch get taken or not? Well, it depends on X. Where does X come from? Yeah, forget it, right?

But no branches depend on X in a neural net. Every branch is a static loop. Like if you're doing a matrix multiply, it's a static loop over the inner dimension. And neural networks are even better. No loads even depend on X, right? So with a GPU shader, right, your load might depend on which texture you're actually loading into RAM.

But with a neural network, your load is, "Well, I load that way." Why? "Well, because I load that way the other million times I ran the same net." Every single time you run the net, you do the exact same set of loads, stores, and arithmetic. The only thing that changes is the data.

And this gives you a very powerful ability to optimize that you can't do with CPU style things, which have branches, and even GPU style things, which have loads and stores. Oh, that makes sense. Well, GPUs, if you want GPU style stuff, you have like load based on X. You now need a cache hierarchy, and not an explicit cache hierarchy, an implicit cache hierarchy.

With eviction policies that are hard-coded into the CPU, you start doing all this stuff and you're never going to get theoretically good performance. Again, I don't think there's 100X. Some startups will talk about 100X, and they'll talk about absolutely ridiculous things like clockless computing or analog computing. Okay. Here, analog computing just won't work.

And clockless computing, sure, it might work in theory, but your EDA tools are... Maybe AIs will be able to design clockless chips, but not humans. But what actually is practical is changing cache hierarchies, and removing branch predictors, and removing warp schedulers. GPUs spend tons of power on warp scheduling, because we have to hide the latency from the memory.

We don't have to hide the latency if everything's statically scheduled. Yeah. Why do you think people are still hanging on to Turing complete? Well, because it's really easy. Turing complete is just really easy, right? It's really easy to just, "Oh, you know, it would just be so nice if I could do like an if statement here, and actually branch the code," right?

So it requires a lot more thought to do it without Turing completeness. And would this be qualitatively different than TPUs? So TPUs are a lot closer. TPUs are a lot closer to what I'm talking about than like CUDA. Okay, so what is CUDA? Well, CUDA is a C-like language, which compiles to an LLVM-like IR, which compiles to PTX, which compiles to SAS, which are all Turing complete.

TPUs are much more like this, yeah. Their memory is pretty statically managed. I did some reverse engineering on the TPU. It's published in TinyGrad. It has like a VLIW instruction, and it runs them. So it's similar. I think the TPUs have a few problems. I think systolic arrays are the wrong choice.

Systolic array, I think they have systolic arrays, because that was the guy's PhD. And of course, Amazon makes-- Could you summarize systolic arrays right now? Systolic arrays are just, okay, so basically you have like, it's a way to do matrix multiplication. Think of a grid of mollax, and then the grid can multiply and then shift, multiply, then shift, multiply, then shift.

And they are very power efficient, but it becomes hard to schedule a lot of stuff on them if you're not doing perfectly sized dense matrix multiplies. Which you can argue, well, design your models to use perfectly sized dense matrix multiplies, sure. But it's just-- No, but thanks for indulging on these explanations.

I think we need to keep our audience along with us by pausing every now and then to explain key terms. When I say explain a systolic array, I just immediately get a picture in my head of like tilting a matrix and shifting it. It's hard to kind of explain.

Yeah. We'll do some videos so that you're-- We like show notes, but-- We edit it in visuals. Yeah, yeah, yeah. There's some great graphics that just show you, oh, so that's what a systolic array is. But it's a mollax shift machine that looks kind of different from the typical like APU sort of machine.

Sorry, ALU sort of machine. I think the right answer is something that looks more like queues that feed into ALUs. And then you can like prefetch the loads from the memory, put in a bunch of queues, and then the queue is just like, and feeds into another queue over here.

But yeah, but that's not even the main problem with TPUs. The main problem with TPUs is that they're closed source. Not only is the chip closed source, but all of-- XLA is open source, but the XLA to TPU compiler is a 32 megabyte binary blob called lib TPU on Google's cloud instances.

It's all closed source. It's all hidden stuff. And, you know, well, there's a reason Google made it closed source. Amazon made a clone of the TPU. It's called Inferencia. Or they have some other name for it, a training-- >> Trainium, yeah. >> Trainium, yeah, yeah, yeah. And here, look, it's a clone of the TPU.

It's--software doesn't work though. Like Google software at least kind of works. >> So those are kind of like the three core thesis. The first thing you're working on, that you've been working on is TinyGrad. And one of the--your Twitch streams, you said, is the best thing you've ever written.

Yeah, tell us a bit more about that creation. >> For a long time, TinyGrad had a hard limit of 1,000 lines of code. And what this would force you to do is really make sure you were not wasting lines. I got rid of the restriction because it became a little code golfy at the end.

But once like the core framework of TinyGrad was there in those 1,000 lines, it's not huge now. It's like 2,800 lines now. It's still very readable. But like the core framework, the ideas are expressed with no boilerplate. If you go read PyTorch--you know, PyTorch is actually pretty good code.

I think Facebook's pretty good. But there's so much boilerplate. Go in PyTorch and try to track down how an LU actually works. >> Just a lot of instructions? >> Oh, you're going to be diving down a long stack from Python to C to custom libraries to dispatchers to--and then I don't even know how to read TensorFlow.

Like I don't even know where's an LU in TensorFlow. Nobody knows. Someone at Google knows maybe. Google as an organism knows. I don't know if anyone individual at Google knows. >> What are like the important ergonomics like for a developer as you think about designing the TinyGrad API? >> So, the TinyGrad frontend looks very similar to PyTorch.

There's an even higher level frontend you can use for TinyGrad which is just ONNX. We support--we have better support for ONNX than Core ML does. And we're going to have--I think we're going to pass ONNX Runtime soon too. Like people think ONNX Runtime, that's a gold standard for ONNX.

No, you can do better. >> Pass them in what specifically? >> Test, compliance tests. So, ONNX has a big set of compliance tests that you can check out. And we have them running in TinyGrad and there's some failures. We're below ONNX Runtime but we're beyond Core ML. So, like that's like where we are in ONNX support now.

But we will pass. We will pass ONNX Runtime soon because it becomes very easy to add ops because of how like you don't need to do anything at the lower levels. You just do it at this very high level and TinyGrad compiles it to something that's fast using these minimal ops with.

You can like write--I mean, most concretely what TinyGrad can do that like PyTorch can't really do is if you have something like A times B plus C, right? If you write that in Naive PyTorch, what it's going to do on the GPU is, well, read A, read B in a kernel and then store A times B in memory and then launch another kernel to do A times B plus C, okay?

Got to do those loads from memory. I know I did a whole extra round trip to memory that I just didn't have to do. And you're like, "Yeah, but you can use the Torch JIT and it corrects this." Yeah, for that one example, for that one example of MOLAC, but oh, now you did three multiplies, six multiplies, right?

It doesn't--it won't compile arbitrary code. >> And if you looked into like the other approaches like PyTorch Lightning to accelerate PyTorch itself? >> Well, PyTorch Lightning, my understanding is it's mostly a framework around PyTorch, right? PyTorch Lightning is not going to fix this fundamental problem of I multiply six tensors together, why is it going to memory any more than a single read from each and a single write to the output?

>> Okay. >> Yeah, there are lower level things in PyTorch that are--I'm not exactly sure what Dynamo does but I know they're generating some Triton stuff which is going to generate the kernels on the fly. But, you know, PyTorch Lightning is at a higher level of abstraction. So TinyGrad's front-end stuff looks like PyTorch.

I made a few tweaks, there's a few things I don't like about PyTorch. Why is ReLU a class? No, really, like what's the state? You make a class and there's a state. Everything should just be Torch functional and then ReLU but just dot ReLU on the tensor. Also like there's things in Torch where you have to do tensor dot and not a tensor dot, right?

And like why are these things--like this just--it just shows an API that's like not perfectly refined. But when you're doing stuff TinyGrad style where you don't have lines, well, it has to work this way because even the lines to express the--well, you can't use the where operator unless--and the where operator in PyTorch.

Why is it true case, condition, false case? The worst--that's how Python expresses ifs. It's disgusting, right? Turner operators are much nicer. It should be--I can do my like a less than zero dot where a comma one, right? >> The very Pandas-like API. >> Yeah, yeah, yeah. It's just--it's some--it looks like Torch, NumPy, Pandas.

They're all very similar. I tried to take like the cleanest subset of them and express them. But like I said, you can also interact with it using Onyx. But I have a rewrite of StableDiffusion, I have a rewrite of LLAMA, I have a rewrite of Whisper. You can look at them.

They're shorter than the Torch version and I think they're cleaner. >> And you stream them all? >> Yeah. >> Very nice. >> Laziness is kind of the other important concept that you're leveraging to do operation fusing. Yeah, talk a bit more about that. >> So, yeah, you have basically like a few different like models for compute.

The simplest one is Eager, right? The simplest one is Eager. As soon as the interpreter sees A times B, it actually dispatches A times B, right? Then you have Graph, like TensorFlow, which will put A times B into a graph and then will do absolutely nothing until you actually compile the graph at the end.

I like this third choice, which is somewhere in the middle, laziness. Laziness is, you don't know when the ops are going to dispatch and don't worry about that. You don't have to worry about this as a programmer. You just write out all your stuff and then when you actually type .numpy, it'll be ready by the time you, you know, copy the thing back to CPU.

Or you can do .realize and it will actually like force that tensor to be allocated in RAM. But yeah, a lot of times, right, like, and if you think about it, PyTorch is kind of lazy in a way, but they didn't extend the paradigm far enough, right? When I do A times B in PyTorch, it's going to launch a CUDA kernel to do A times B, but it's not going to wait for that CUDA kernel to complete.

So you're getting the worst possible world. You're getting the same laziness, but you also can't get fusion because PyTorch doesn't know that I'm then going to do plus C. There's no way for it to be like, "Whoa, whoa, whoa, don't launch that CUDA kernel. Whoa, just do this one too." Right?

You can kind of like, this stuff, PyTorch is working on this and, you know, it's a little bit harder. Like in comma, I felt like I was competing against a lot of idiots. Here I'm competing against, you know, smart, smart, very smart people who've made, yeah, who've made some, I think, different trade-offs, right?

Who've made some different trade-offs. Whereas if you're trying to build something that is just straight up good on NVIDIA and we have a lot of people and complexity to throw at it, yeah, PyTorch made a lot of the right choices. I'm trying to build something that manages complexity. Like you can always make your software do more.

The magic is when you can make your software do more without adding complexity, right? Because, you know, complex things eventually collapse under their own weight. So it's kind of that. >> How does fusing actually work? >> Like TensorFlow actually collapsed. It's kind of what happened, right? How does fusing actually work?

So yeah, there's this thing called lazy.py. And when you do like A times B, that's, it's put into a graph, but it's a very local graph. There's no global graph optimizations. And even this can change, right? Again, like the programming model for TinyGrad does not preclude eagerness, right? Laziness is not guaranteed laziness.

It's just going to try its best. So you put in A times B, and that's a binary op, right? And then you put in A times B, like that's a node in the graph. It's a virtual node because it's not realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes the output of A times B.

It's like, whoa, wait, there's two binary ops. Okay, we'll just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs. It does A times B plus C in the local registers, and then outputs that to memory. And you can graph.1 in TinyGrad.

Another, like amazing thing that TinyGrad has that I've not seen in any other framework is two things. Graph.1, graph equals one, which is an environment variable. It will output a complete graph of all the operations. A lot of people are like, oh, you can use PyTorch, export it to Onyx, and use Netron.

Yeah, you can, but like what? That's not what's real. Graph.1 will show you the actual kernels that were dispatched to the GPU. You can also type debug equals two, which will print those kernels out in your command line. And it will tell you the exact number of flops and the exact number of memory accesses in each kernel.

So you can immediately see, wait a second, okay, this kernel used this many flops, this was the gigaflops, this is how many bytes it read, and this was the gigabytes per second. And then you can profile without having to like, okay, I mean, in theory, in PyTorch, sure, use the NVIDIA Insight Profiler, which is-- >> No one does that.

>> No one does, of course, because it's so difficult, right? Like, actually, NVIDIA used to, pre, I think CUDA 9 was the last one they had it. They had a command line one, but now it's like, okay, I'm going to generate this blob, use this NVIDIA GUI tool to convert it into a Chrome trace and then load it.

Yeah, no one does that, right? I've just typed debug=2 in any TinyGrad model and it will show you all the kernels that it launches and the efficiency of each kernel, basically. >> Yeah, this is something that John Carmack has often commented about, is that when you code, you need to build in your instrumentation or observability right into that.

I wonder if whatever John is working on, he's adopting this style, and maybe we can sort of encourage it by, like, I don't know, naming it and coining it as a certain kind of debugging style. >> If he would like to start contributing to TinyGrad, I'd be-- >> You should hook up with him.

>> I'd be so happy. I've chatted with him a few times. I'm not really sure what his company is doing. I think it's all, I think it's pretty, but no, I mean, hopefully, like, we get TinyGrad to a point where people actually want to start using it. So TinyGrad right now is uncompetitive on, it's uncompetitive on NVIDIA, it's uncompetitive on x86.

>> And specifically, what do you care about when you say uncompetitive? >> Speed. >> Okay. >> Share of speed. It's correct. The correctness is there. The correctness for both forwards and backwards passes is there, but on NVIDIA, it's about 5x slower than PyTorch right now. Like, 5x, wow, this is unsurmountable.

No, there's reasons it's 5x slower, and I can go through how we're going to make it faster, and it used to be, you know, 100x slower, so, you know, we're making progress, but there's one place where it actually is competitive, and that's Qualcomm GPUs. So TinyGrad is used to run the model in OpenPilot.

Like, right now, it's been live in production now for six months, and TinyGrad is about 2x faster on the GPU than Qualcomm's library. >> And why specifically Qualcomm? >> Well, because we have Qualcomm. We use Qualcomm in the Comma devices. >> Oh, I mean, like, what makes, what about Qualcomm architecture?

>> Oh, what makes it doable? Well, because the world has spent how many millions of man-hours to make NVIDIA fast, and Qualcomm has a team of 10 Qualcomm engineers? Okay, well, who can I beat here? Like, what I propose with TinyGrad is that developer efficiency is much higher, but even if I have 10x higher developer efficiency, I still lose on NVIDIA, right?

You know, okay, I didn't put 100,000 man-hours into it, right? If they put a million, like, that's what I'm saying, but that's what I'm saying we can get, and we are going to close this speed gap a lot. Like, I don't support TensorCourse yet. That's a big one that's just going to, okay, massively close the gap.

And then AMD. I can't even get, I don't even have a benchmark for AMD because I couldn't get it compiled. Oh, and I tried. Oh, I tried. I spent a day. Like, I spent actually a day trying to get PyTorch, and I got it built. I got it kind of working, and then I tried to run a model.

Like, there's all kinds of weird errors, and the rabbit holes are so deep on this. I'm like, so we, you know, you can compare the speed. Right now, you can run LLAMA. You can run anything you want on AMD. It already all works. Any OpenCL backend works, and it's not terribly slow.

I mean, it's a lot faster than crashing, so it's infinitely times faster than PyTorch on AMD, but pretty soon, we're going to start getting close to theoretical maximums on AMD. That's really where I'm pushing, and I want to get AMD on MLPerf in a couple months, hopefully. >> Now that you bring up AMD.

>> Yeah, let's dive into that, because when you announced the TinyCorp fundraise, you mentioned one of your first goals is build the framework, runtime, and driver for AMD, and then on June 3rd on Twitch, you weren't as excited about AMD anymore. Maybe let's talk a bit about that, and you compared the quality of commit messages from the AMD kernel to the Intel work that people are doing there.

What's important to know? >> So when I said I want to write a framework, I never intended on writing a kernel driver. I mean, I flirted with that idea briefly, but realistically, there's three parts to it, right? There's the ML framework, there's the driver, and then there's the user space runtime.

I was even down to rewrite the user space runtime. I have a GitHub repo called CUDA I/O Control Sniffer. It's terribly called, but you can actually launch a CUDA kernel without CUDA, so you don't need CUDA installed. Just the NVIDIA open source driver and this open source repo can launch a CUDA kernel.

So rewriting the user space runtime is doable. Rewriting the kernel driver, I don't even have docs. I don't have any docs for the GPU. It would just be a massive reverse engineering project. When I saw that there, I wasn't complaining about it being slow. I wasn't complaining about PyTorch not compiling.

I was complaining about the thing crashing my entire computer. It panics my kernel, and I have to wait five minutes while it reboots because it's a server motherboard and they take five minutes to reboot. So I was like, "Look, if you guys do not care enough to get me a decent kernel driver, there's no way I'm wasting my time on this, especially when I can use Intel GPUs." Intel GPUs have a stable kernel driver, and they have all their hardware documented.

You can go and you can find all the register docs on Intel GPUs. So I'm like, "Why don't I just use these?" Now, there's a downside to them. Their GPU is $350. You're like, "What a deal. It's $350." You get about $350 worth of performance. If you're paying about $400 for the PCIe slot to put it in, like between the power and all the other stuff, you're like, "Okay, never mind.

You've got to use NVIDIA or AMD from that perspective." But I sent an email to Lisa Su. She responded. >> Nice. >> Oh, you can see you published that email in a Discord. >> I did. I did. And she responded. And I've had a few calls since. And what I did was like what I tried to do.

Well, first off, thank you for responding. It shows me that if you don't care about your kernel panicking, I can't. This is just a huge waste of my time. Right? I'll find someone who will care. I'm not asking for your 7x7 Winograd convolution when transposed to be fast. I'm not asking for that.

I'm asking literally for- >> The basics. >> Oh, and this isn't TinyGrad. This is your demo apps. I ran their demo apps in loops and I got kernel panics. I'm like, "No. Okay." But no, Lisa Su reached out, connected with a whole bunch of different people. They sent me a pre-release version of RockM 5.6.

They told me you can't release it, which I'm like, "Why do you care?" But they said they're going to release it by the end of the month. And it fixed the kernel panic. The guy managed to reproduce it with the two GPUs and the computer. And yeah, sent me a driver and it works.

So yeah, I had that experience. And then I had another experience where I had two calls with AMD's communication people. I tried to explain to these people open source culture. It's not open source if you dump the source code on a GitHub repo and then forget about it until the next release.

It's not open source if all your issues are from 2022. No one's going to contribute to that project. Sure, it's open source in a very technical sense. To be fair, it's better than nothing. It's better than nothing, but I fixed a bug in Nickel. There's a fun fact, by the way.

If you have a consumer AMD GPU, they don't support peer-to-peer, and their all-reduce bandwidth is horrendously slow because it's using CUDA kernels to do the copy between the GPUs. And it's putting so many transactions on the PCIe bus that it's really slow. But you can use CUDA memcpy, and there's a flag to use CUDA memcpy, but that flag had a bug.

So I posted the issue on Nickel. I expected nothing to happen. The Nvidia guy replied to me within an hour. He's like, "Try this other flag." I'm like, "Okay, I tried the other flag. It still doesn't work, but here's a clean repo." And I spent like three hours writing a very clean repro.

I ended up tracking the issue down myself, but just the fact that somebody responded to me within an hour and cared about fixing the issue, okay, you've shown that it's worth my time, and I will put my time in because let's make this better. I'm here to help. But if you show me that you're like, "You're the kernel panics.

Let's just expect it." Okay. >> Well, it sounds like AMD is getting the message. >> They are. And I don't really think they've had someone explain to them. I was like, "You can build in public." And they're like, "What's an example of building in public?" I'm like, "Go look at PyTorch." Go look at PyTorch, right?

I have two minor things merged into PyTorch because it's very responsive. They're like minor bug fixes, but I feel like it's... >> Yeah. So that's kind of like the lowest level of the stack. And then at a slightly higher level, obviously, there's TinyGrad, there's Mojo, there's GGML. How are you thinking about breadth versus depth and where you decided to focus early on?

>> So GGML is very much like a... Okay, everyone has M1s, right? Actually, I was thinking... In the beginning, I was thinking of something more like GGML focused on the M1s, but GGML showed up and was just like, "We're actually just focusing on the M1s." And actually, M1 PyTorch is considerably better than AMD PyTorch.

M1 PyTorch works. It only gives wrong answers sometimes and it only crashes sometimes. But some models kind of run. When I was writing the metal backend, I was comparing to MPS PyTorch, and I had a discrepancy. TinyGrad checks all its outputs compared to Torch, and I had one where it didn't match.

I'm like, "I checked the matrix by hand. It matches TinyGrad. I don't understand." And then I switched PyTorch back to CPU and it matched. I'm like, "Oh." Yeah. Well, there's bugs. If you transpose the matrix, because I think it has to do with multi-views and PyTorch and weird under-the-hood stuff that's not exposed to you.

There's bugs and maybe they fix them. But it seems like there was a lot of momentum, again, because you're getting how many engineers care about making PyTorch work on M1? Thousands, tens of thousands. And you have an open development process, and guess what? It's going to be good. How many engineers care about AMD working, PyTorch AMD working?

You got 10 guys that work for AMD, and then a couple hobbyists. You revealed an interesting detail about how you debunk, which is you hand-check the matrix path. No, I don't hand-check it. One of the best tests in TinyGrad is a file called testops.py. And it's just 100 small examples written in TinyGrad and PyTorch.

And it checks both the forwards and backwards to make sure they match. Good test suite. Very important. That's one of them where I really have put a lot of effort into CI for TinyGrad. I think CI is super important. I want that green check to mean I can merge this.

I don't want my tests to -- and the green check, if you somehow manage to introduce a bug and get the green check, okay, we're fixing the test. Top priority. Mojo? It's closed source. No, I'm not that interested. You know what I mean? Look, I like Chris Lattner. I think he's going to do great things, and I understand kind of the wisdom even in keeping it closed source.

But I'm interested when it's open. You have an interesting design deviation from him, because he's decided to be -- well, promised to be a superset of Python, and you have decided to break with PyTorch APIs. And I think that affects learnability and transportability of code. You know, if the PyTorch thing ends up being like a stumbling block, I could write a perfect PyTorch.

Like a -- you know, instead of import PyTorch, instead of, like, yeah, import Torch, you type import TinyTorch as Torch. And if that really becomes the stumbling block, I will do that. No, Chris Lattner went much further than PyTorch. Replicating the PyTorch API is something I can do with a couple, you know, like an engineer month or two.

Like a shim. Right, like a shim, yeah. Replicating Python, whoo-hoo-hoo. There's a big graveyard of those things. How's Piston going? How's Jython? You can go way back. So TinyGrid is one layer. You announced TinyBox recently, which is, you know, you made it -- so your core mission is commoditizing the petaflop.

And then your business goal is to sell computers for more than the cost to make, which seems super reasonable. And you're gonna have three TinyBoxes? Red, green, blue? No, no, no, no, no, no, no, no. That was my -- look, you know, a lot of people, like, I love, you know, leaning into like saying I'm giving up, right?

It's great to give up. Giving up is this wonderful thing. It's so liberating. And then, like, you can decide afterward if you really give up or not. There's very little harm in saying you give up, except like, you know, great, Twitter haters have something to talk about. And all press is good press, kids, so.

So obviously -- Just red. Only red. TinyBox, red. TinyBox, red. Unless AMD, you know, upsets me again, and then we're back to other colors. We have other colors to choose from. When you think about hardware design, what are some of the numbers you look for? So teraflops per second is one, but like memory bandwidth is another big limiter.

Like, how do you make those tradeoffs? Well, I mean, fundamentally, I'm limited to what GPUs I can buy. But yeah, for something that I think a lot of people are going to want to reasonably do with -- a coworker of mine described them as luxury AI computers, right? Like, luxury AI computers for people.

And that's like what we're building. And I think a common thing people are going to want to do is run, like, Large Llama, right? Or Large, like, Falcon or whatever. FB16 Llama. FB16, exactly. Exactly. You know, Int8, I think, can work. I think that, like, what GGML is doing to go to, like, Int4, like, this doesn't work.

Like, have you done -- maybe they have. But, like, I read what it was, and I was like, this isn't from any paper. This is just some -- Like, you made -- Squeezing as much as possible. Yeah, you made up some quantization standard to make it run fast. And, like, maybe it works, but, okay, where's, like, the helliswag number, right?

Where's your, where's your, you know, all your -- The thesis is right, that, like, if you have billions, hundreds of billions of parameters, that the individual quantization doesn't actually matter that much. Well, the real way to look at all of that is to just say you want to compress the weights, right?

It's a form of weight compression. Quantization is a form of weight compression, right? Now, this is obviously not lossless. It's not a lossless compressor, right? If it's a lossless compressor, and you can show that it's correct, then, okay, we don't have to have any other conversation. But it's a lossy compressor.

Yes. And how do you know that your loss isn't actually losing the power of the model? Interesting. Maybe int 4 65 B Lama is actually the same as FB 16 7 B Lama, right? We don't know. Maybe someone has done this yet, but I looked for it when it, like, first came out, and people were talking about it, and I'm like, I just have -- like, it's not from a paper, right?

The int8 stuff is from a paper where they, like, some of the int8 stuff is from a paper. There's one paper, I think it's, like, LLM.int8, where they actually, you know, do all the tests. And they didn't go fully int8. They made, like, 90% of it int8 and kept, like, 10% of it in FB 16 for what they called, like, the outliers or whatever.

So I think that this is not quite so easy. And I think being able -- well, so first off, if you're training, no one's gotten training to work with int8 yet. There's a few papers that vaguely show it. But if you're training, you're going to need BF 16 or float 16.

So this is why I target that. Now the thing that you're going to want to do is run these large language models out of the box on your hardware in FB 16, and that's memory bandwidth. So you need large amounts of memory bandwidth, too. So ask how I trade off memory bandwidth in Flop, so what GPUs can I buy?

>> And I saw one of your -- so first of all, you have this hiring process, which is you've got to solve one of the bounties that are open on TinyGrad. There's no technical interview. One of them is int8 support. Do you already have some things you want to test on?

>> We have int8 support. What I'd like to see somebody do is just load the ggml int8 llama into TinyGrad and then benchmark it against the FB 16 one. Int8 already works in TinyGrad. It doesn't actually do the math in int8, which is even a stronger -- like, it does all the math still in FB 32.

So int8 can mean you just have your weights in int8, or int8 can mean you actually do your math in int8. And doing your math in int8, the big, like, gain that people care about is actually having your weights in int8, because weights in int8 mean less memory and less memory bandwidth, whereas the math, keep it in FB 32.

On M1s, it doesn't even matter if you're doing -- it doesn't matter what data type you're doing in the GPU. I'm not even sure it can do int8, but FB 16 and FB 32 is the same. It's the same tarifflops. So, yeah, no, that's one of the bounties. One of the bounties is get int8 llama running with the int8 weights.

And then actually, you don't even need to -- what you could even do, if you really want to test this, just take the FB 16 weights, convert them to int8, then convert them back to FB 16, then compare the unconverted and converted. >> Oh, that's a nice hack. Oh, yeah.

>> Right? Like -- >> This should be lossless in the other direction. >> Yeah, I think FB 16, it should be lossless in the other direction. I'm actually not 100% about that. >> Why not? >> Oh, because, like, you ever try to, like, if you want to represent -- if it was, like, int16, it's not lossless.

>> Sure. >> I think all of int8 can be represented in FB 16, but I'm not 100% about that. >> Okay. >> And actually, I think -- >> Just draw up the bytes. >> We just have to do it, right? Just literally do it. There's only 256 to check.

But, yeah, either way -- I mean, int4, definitely. So do your int4, convert it back, and now see, even with int4 weight and FB 32 math, like, okay, how much has your performance degraded this model? >> Yeah. >> Yeah. >> So can we -- I'm about to zoom out a little bit from the details.

I don't know if you had more to -- >> No, I think, like, the -- you're planning to release the first tiny box, ship them in, like, two to six, eight months, something like that. What's top of mind for you in terms of building a team? Who should -- who are you calling for?

>> Yeah. Well, to stay on the tiny box for one minute, so as the GPU is picked out, and you're like, well, I could make that computer with the GPUs, and my answer is, can you? Do you know how to put -- do you know how hard it is to put six GPUs in a computer?

People think it's really easy, and it's really easy to put one GPU in a computer. It's really easy to put two GPUs in a computer, but now you want to put in eight. Okay, so I'll tell you a few things about these GPUs. They take up four slots. What kind of computer -- you can buy the nicest super micro.

You can't put eight of those in there. You need two slot blowers. If you want to use one of those four-U super micros, you need two slot blowers. Or water cooling. If you're trying to get the four-slot cards in there, you're going to need some form of water cooling.

Or you're going to need -- there are some, like, Chinese 4090s that are blowers, right? You're going to need blowers or water cooling if you're trying to get it in those things, right? >> So are you doing water? >> No, I'm not using that chassis. >> Okay. >> Then, the other thing that -- okay, so now you want to get six GPUs in a computer, so that's a big challenge.

You're like, "Oh, I'll just use PCIe extenders. I saw it online as tech tips. It works great." No, it doesn't. Try PCIe extenders that work at PCIe 4.0. And interconnect bandwidth is super important. >> Yes. >> They don't work at 3.0. No PCIe extender I've tested, and I've bought 20 of them, works at PCIe 4.0.

So you're going to need PCIe re-drivers. Now, okay, how much is that adding cost, right? Like, these things all get really hard. And then, tiny boxes, I've even added another constraint to it. I want this thing to be silent. Not totally silent, but my limit is like 45, maybe 50 dB, but not -- super micro machine, 60 dB.

We have a small -- we have a compute cluster at Comma. You've got to wear ear protection to go in there. >> Yeah, I've seen some videos where you give a tour. >> Oh, yeah. >> Yeah. >> It's noisy. >> It's super loud. You've got all these things just screaming.

>> Yeah. >> 10,000 RPM, just screaming. Like, I want to be able to use the normal big GPU fans, and make this thing so it can sit under your desk, plug into one outlet of power, right? Six GPUs. Your GPUs are 350 watts each. You can't plug that into a wall outlet.

Okay, so how are you going to deal with that? Good questions, right? And you're not sharing them. Well, that one, I mean, that one is pretty obvious. You have to limit the power on the GPUs, right? >> Uh-huh. >> You have to limit the power on the GPUs. Now, you can limit power on GPUs and still get -- you can use like half the power and get 80% of the performance.

This is a known fact about GPUs, but like, that's one of my design constraints. So, when you start to add all these design constraints, good luck building a tiny box yourself. >> Uh-huh. >> You know, obviously, it can be done, but you need something that has actually quite a bit of scale and resources to do it.

>> And you see like the -- under the desk, it's like one of the main use cases, kind of like individual developer use or -- >> Yeah. What I also see is more of a like an AI hub for your home, right? As we start to get like home robotics kind of stuff, you don't want to put the inference on the robot, but you also don't want to put the inference on the cloud.

You don't want to put it on the robot because, okay, it's 1,500 watts, tiny box. You'll put batteries, you'll charge them. Bad idea. I mean, just wireless. Wireless is .5 milliseconds, right? This is super fast. You don't want to go to the cloud for two reasons. One, cloud's far away.

It's not that far away. You can kind of address this. But two, cloud's also mad expensive. Like cloud GPUs are way more expensive than running that GPU at your house, at least any rates you're going to get, right? Maybe if you commit to buy, well, yeah, I'm going to buy 10,000 GPUs for three years, then maybe the cloud will give you a good rate.

But like, you want to buy one GPU in the cloud? Ooh. I mean, okay, you can go to like BAST, but like if you're going on Azure or AWS, oh, that's expensive. Yeah. This is like a personal data center, you know, instead of a cloud data center. We like the term compute cluster, so we can use NVIDIA GPUs.

Data centers may be a little bit dated. It's a compute cluster, which is totally legal under the CUDA license agreement. You talk a lot about the PCIe connection. Do you think there's any fat there to the term? What do you mean? Just you're limited by bandwidth, right? Okay. For some things, yes.

So the bandwidth is roughly 10x less than what you can get with NVLinked A100s. NVLinked A100s are going to have, and then you can even get like full fabric and NVIDIA really pushes on that stuff, 600 gigabytes per second, right? And PCIe 4, you're going to get 60, right?

So you're getting 10x less. That said, why do you need the bandwidth, right? And the answer is you need it for training huge models. If you're training on a tiny box, your limit's going to be about 7 billion, right? If you're training on big stuff, your limits could be like 70 billion, right?

Okay. You can hack it to get a bit higher. You can hack it like GPT hacked it to get a bit higher, but like that 65 billion in Lama, like there's a reason they chose 65 billion, right? And that's what can reasonably fit model parallel on a GPUs, right?

So yes, you are going to end up training models. The cap's going to be like 7 billion. I actually heard this on your podcast. I don't think that the best chatbot models are going to be the big ones. I think the best chatbot models are going to be the ones where you had a thousand training runs instead of one.

And I don't think that the interconnect bandwidth is going to matter that much. >> So what are we optimizing for instead of compute optimal? >> What do you mean compute optimal? >> So you're talking about this, the Lama style models where you train for like 200x. >> You train longer, yeah.

>> Yeah, yeah. >> Yeah. So, okay. You can always make your model better by doing one of two things, right? In a comma, we just have a strict limit on it. You can always make your model better by training longer and you can always make your model better by making it bigger.

But these aren't the interesting ones, right? Particularly the making it bigger. Because training it longer, fine. You know, you're getting a better set of weights. The inference is the same. The inference is the same whether I trained it for a day or a week. >> Yeah. >> But the, okay, if it's 1 billion versus 10 billion, well, I 10x my inference too, right?

So I think that these big models are kind of, sure, they're great if you're research labs and you're trying to like max out this hypothetical thing. >> Yeah, which you can talk about later. >> Yeah, yeah, yeah. But if you're like a startup or you're like an individual or you're trying to deploy this to the edge anywhere, you don't need that many weights.

>> Yeah, yeah. >> You actually don't want that many weights. >> Yeah. Optimizing for inference rather than capabilities. >> Yes. >> Doing benchmarks. >> Yes, yes. And I think the inference thing, right? There's going to be so much more. Right now, the ratio between like training and inference on clouds, I think it's only still like, I think it's like 2 or 3x, right?

It's 2 or 3x more inference, which doesn't make any sense, right? There should be way more inference. >> Yeah. >> There should be 10 to 100x more inference in the world than training. But then also, like, what is training, right? You start to see these things like Laura, like, you're getting kind of, it's kind of blurring the lines between inference and training.

And I think that that blurred line is actually really good. I'd like to see much more like on-device training or on-device fine-tuning of the final layer, where we're pushing toward the stuff that come, right? Like, why am I shipping a fixed model? I totally want this model to fine-tune based on, like, how, you know, your left tire is flat, right?

Like, every time you cut the same turn because your left tire is flat, well, it should learn that, right? >> So, would Kama pursue parameter-efficient fine-tuning? >> Yeah. Yeah, yeah, yeah. We're, we're, we're, we're -- >> Seems like a -- >> We're, we're looking into stuff like that. I mean, Kama's already very parameter-efficient because we have to, like, run this thing in a car and you have to, like, cool it and power it.

>> Yeah. >> Yeah. >> Yeah, yeah. And so, that's kind of like intelligence cluster you have in your home. You see when the person is using third-party model, they load them locally and kind of do the final fine-tuning. It kind of stays within the box. >> Yeah. I think that that's one thing.

That's one version of it for the privacy conscious. I also see a world where you can have your tiny box in its down cycles. Mine, flop coin, right? You know, not all, it turns out not all crypto is a scam. There's one way to tell if crypto is a scam.

If they're selling the coin before they make the product, it's a scam. >> Yeah. >> If they have the product and then they sell the coin, it's maybe not a scam, right? So, yeah, my thought is, like, each tiny box would let you, would have a private key on it.

And you have to do it this way. You can't just let anyone join because of civil attacks, right? There's a real problem of, like, how do I ensure your data is correct? And the way that I ensure your data is correct on the tiny net is if you ever send wrong data, you're banned from the internet for life.

>> You're out? >> Yeah, yeah, yeah. >> Oh, wow. >> Your $15,000 hardware box is banned. So, you know, don't cheat. Obviously, if it messes up, we'll forgive you. But I'm saying, like -- >> Somebody's going to try to jailbreak your devices. >> There's no jailbreak. >> There's no jailbreak.

>> There's no jailbreak. >> There's just a different network. >> Well, there's just a private key on each device, right? Like, if you buy a tiny box from the tiny corp, I give you a private key. It's in my back-end server, right? You want to hack my server, that's illegal.

Anything you want to do on the device, the device is yours. My server's mine, right? Like -- >> Yeah, yeah. Have you looked into, like, federated training at all? >> Yeah. So, I mean, okay, you're now -- there's, okay, there's orders of magnitude of federated training. You mean, like, over the cloud and stuff?

Over the internet? >> Yeah, over the internet, but also distributed on a bunch of devices, right? >> Yeah. >> Which some people -- >> I'm very bearish on this stuff. >> Yeah. >> Because of your interconnect bandwidth, right? So, okay, at the high-end, you have your interconnect bandwidth of NVLink, which is 600 gigabytes per second, right?

>> Yeah. >> The tiny box has 60 gigabytes per second. And then your internet has 125 megabytes per second, right? Not gigabits, 125 megabytes, right? So, okay, that's -- >> Orders of magnitude. >> That's how many orders of magnitude we're talking here? Like, from 60 down to 125? >> Yeah.

>> Like, all right, that's over 100X. That's 400X, right? >> Yeah. >> So, like, no. But what you can do is inference, right? Like, for inference, you don't care. >> Mm-hmm. >> For inference, there's so little bandwidth at the top and the bottom of the model that, like, yeah, you can do federated inference, right?

And that's kind of what I'm talking about. There's also interesting things to push into, like, you're like, but, okay, what if you want to run closed-source models? This stuff gets kind of interesting, like, using TPMs on the boxes and stuff. >> Yeah. >> But then someone might jailbreak my device.

So, you know, maybe we don't try to do that. >> Yeah. What's, like, the enterprise use case? Do you see companies buying a bunch of these and, like, stacking them together? >> So, the tiny box is, like, the first version of what we're building. But what I really want to do is be on the absolute edge of flops per dollar and flops per watt.

These are the two numbers that matter. So, the enterprise use case is you want to train, like, Comma. So, Comma just built out a new compute cluster. It's about a person and a half. So, you know, it's a decent size. A person and a half. >> A person being 20 petaflops.

>> A person is 20 petaflops. It's about 30 petaflops. We built out a little compute cluster. And, you know, we paid double what you theoretically could per flop, right? You theoretically could pay half per flop if you designed a bunch of custom stuff. And, yeah, I mean, I could see that being, you know, tiny Corp.

Comma is going to be the first customer. I'm going to build a box for Comma. And then I'm going to show off the box I built for Comma and be like, okay, like, do you want to build I sell $250,000 training computers? Or how much is one H100 box?

It's 400 grand? Okay. I'll build you a 400 grand training computer and it'll be 10x better than that H100 box. Again, not for every use case. For some, you need the interconnect bandwidth. But for 90% of most companies' model training use cases, the tiny box will be 5x faster for the same price.

Awesome. You mentioned the person of compute. How do we build a human for $20 million? Well, it's a lot cheaper now. It's a lot cheaper now. So, like I said, Comma spent about half a million on our person and a half. What are some of the numbers people should think of when they compare compute to like people?

So, GPT-4 was 100% years of training. That's more like on the timescale. 20 petaflops is one person. I think you, right now, the math was that for the price of the most expensive thing we build, which is the International Space Station, we could build one Tampa. Yeah, one Tampa of compute.

Yeah, which is 400,000 people. Yeah, we could build. So, like the biggest training clusters today, I know less about how GPT-4 was trained. I know some rough numbers on the weights and stuff, but Llama- A trillion parameters? Well, okay. So, GPT-4 is 220 billion in each head, and then it's an eight-way mixture model.

So, mixture models are what you do when you're out of ideas. So, it's a mixture model. They just train the same model eight times, and then they have some little trick. They actually do 16 inferences, but no, it's not like- So, the multimodality is just a vision model kind of glommed on?

I mean, the multimodality is like obvious what it is too. You just put the vision model in the same token space as your language model. Oh, did people think it was something else? No, no, the mixture has nothing to do with the vision or language aspect of it. It just has to do with, well, okay, we can't really make models bigger than 220 billion parameters.

We want it to be better. Well, how can we make it better? Well, we can train it longer, and okay, we've actually already maxed that out. We're getting diminishing returns there. Okay. A mixture of experts. Yeah, a mixture of experts. We'll train eight of them, right? So, all right.

So, you know, the real truth is whenever a company is secretive, with the exception of Apple, Apple's the only exception, whenever a company is secretive, it's because they're hiding something that's not that cool. And people have this wrong idea over and over again that they think they're hiding it because it's really cool.

It must be amazing. It's a trillion parameters. No, it's a little bigger than GPT-3, and they did an eight-way mixture of experts. Like, all right, dude, anyone can spend eight times the money and get that. All right. But yeah, so coming back to what I think is actually going to happen is, yeah, people are going to train smaller models for longer and fine-tune them and find all these tricks, right?

Like, you know, I think opening, I used to publish stuff on this, you know, when they would publish stuff about how much better the training has gotten given the same, holding compute constant. And it's gotten a lot better, right? Compare like batch norm to no batch norm. Yeah. And now we have like- Because you're finding algorithms like flash attention.

Yeah. Well, flash attention. Yeah. Flash attention is the same compute. Flash attention is an interesting fact where it's actually the identical compute. It's just a more efficient way to do the compute. But I'm even talking about like, look at the new embeddings people are using, right? They used to use this like boring old embeddings.

Now, like Lama uses that complex one, and that was like alibi. I'm not up to date on all the latest stuff, but those tricks give you so much. There's been a whole round trip with positional embeddings. I don't know if you've seen this discussion. I haven't followed- Like you need them, you need rotational, and then you don't need them.

I haven't followed exactly. I mean, you quickly run into the obvious problem with positional embeddings, which is you have to invalidate your KB cache if you run off the context. So that's why I think these new ones, they're playing with them, but I'm not that. I'm not an expert on like the latest up-to-date language model stuff.

Yeah. I mean, we have what we do at Comma, and I don't know how that works, but like- What are some of the things, I mean, that people are getting wrong? So back to autonomous driving, there was like the whole like LiDAR versus vision thing. It's like people don't get into accidents because they cannot see well.

They get into accidents because they get distracted and all these things. What are, do you see similarities today on like the path to AGI? Like are there people, like what are like the- Nothing I say about this is ever going to compete with how Rich Sutton stated it. Rich Sutton is the writer of Reinforcement Learning, The Bitter Lesson.

Nothing I say is ever going to compete with, The Bitter Lesson is way better than any way I'm going to phrase this. Just go read that. And then like, I'm sorry it's bitter, but you actually just have to believe it. Like over and over again, people make this mistake.

They're like, oh, we're going to hand engineer this thing. We're going to hand, no, like stop wasting time. Which is, I mean, OpenAI is not taking The Bitter Lesson. No. OpenAI- They were leaders in deep learning for a long, long, long time. Open- But you're telling me that GPT-4 is not, yeah.

Well, OpenAI was the absolute leader to the thesis that compute is all you need. Yes. Right? And there's a question of how long this thesis is going to continue for. It's a cool thesis. And look, I think I would be lying along with everybody else. I was into language models like way back in the day for the Hutter Prize.

I got into AI through the Hutter Prize. Like 2014, I'm trying to build compressive models of Wikipedia. And I'm like, okay, why is this so hard? Like what this is, is a language model, right? And I'm playing with these like Bayesian things. And I'm just like, oh, but like, I get it.

Like, it needs to be like, like, it's like, I have two data points and they're like almost the same, but how do I measure that almost, right? I just like, you know, wrap my head around. I couldn't like, like wrap my head around this. And this was around the time Carpathia released the first like RNN that generated the Shakespeare stuff.

And I'm like, okay, I get it. Right? It's neural networks that are compressors. Now this isn't actually, you can't actually win the Hutter Prize with these things because the Hutter Prize is MDL. It's the model, size of the model plus the size of the encodings, embeddings. So yeah, you can't, I mean, probably now you can because it's gotten so good, but yeah, back in the day you kind of couldn't.

So I was like, okay, cool. Like this is what it is. I kind of get it. Yeah. I mean, I think I didn't expect that it would continue to work this well. I thought there'd be real limits to how good autocomplete could get. That's fancy autocomplete. But yeah, no, like it works.

It works well. So like, yeah. What is OpenAI getting wrong? Technically, not that much. I don't know. Like if I was a researcher, why would I go work there? Yes. So why is OpenAI like the Miami Heat? No, look, I don't, I don't, this is, this is my technical stuff.

I don't really want to harp on this, but like why go work at OpenAI when you could go work at Facebook, right? As a researcher. Like OpenAI can keep ideologues who, you know, believe ideological stuff and Facebook can keep every researcher who's like, dude, I just want to build AI and publish it.

Yeah. Yeah. Awesome. Yeah. Any other thoughts, tiny corp, bounties? Yeah. So we have, you know, I've been thinking a lot about like what it means to hire in today's world. What actually is the like core? Okay. Look, I'm a believer that machines are going to replace everything in about 20 years.

So, okay. What is that, what is that thing that people can still do that computers can't, right? And this is a narrowing list, but like, you know, back in the day, like imagine I was starting a company in 1960, right? Oh, we're going to have to hire a whole bunch of calculators in the basement to do all the, you know, math to support the, dude, have you heard about computers?

Why don't we just buy a few of those? Oh, oh wow, man. You're right. So like, I feel like that's kind of happening again. And I'm thinking about, I will post in my discord. I'll be like, okay, who wants to like, okay. I just changed my Unary ops used to be log and exp in like E.

I changed them to be log two and exp two because hardware has log two and exp two accelerators. Yeah. And of course you can use change of base. It's one multiply to, to get it back to E, but like I made the primitives log two and exp two. Right.

And this is the kind of, I just posted in the discord. I'm like, could someone put this pull request up? Right. And someone eventually did and I merged it, but I'm like, this is almost to the level where models can do it. Right. We're almost to the point where I can say that to a model and the model can do it.

Have you tried? Yeah, I'm, I don't know. I'm like, I'm, I think it went further. I think autocomplete went further than I thought it would, but I'm also relatively unimpressed with the chatbots, with what I've seen from the language models like there. The problem is if your loss function is categorical cross-entropy on the internet, your responses will always be mid.

Yes. Mode collapse is what I call it. I don't know. Maybe I'm not even talking about mode collapse. You're actually trying to predict the like, like, look, I rap, I'm a hobbyist rapper. And like, when I try to get these things to write rap, the rap sound like the kind of raps you read in the YouTube comments.

Nursery school. Yeah. It's like, all right, great. You're right. Box with Fox. Sick rhyme, bro. You know, you know, and Drake is rhyming. Give it up for me with napkins and cutlery. Right. Like, like, all right, come on. We've got like this thing about orange, like orange is famous.

Yeah, yeah, yeah, yeah. But now, of course, you know, four inch screws and orange juice is in, is in GPT's training corp. But yeah, so I think it went further than like everyone kind of thought it would. But the thing that I really want to see is like somebody put 10 LLMs in a room and have them discuss the answer before they give it to me.

You can actually do this. Right. And I think the coding things have to be the same way. There is no coder alive, no matter how good you are, that sits down. Well, I'm going to start at cell A1 and type my program and then I'm going to press run and it's going to work.

No one programs like that. So why do we expect the models to write? So so there's there's a lot that like still needs to be done. But, you know, at the tiny corp, I want to be on the cutting edge of this, too. I want to be like program generation.

I mean, what is TinyGrad? It's a compiler, generates programs, generate the fastest program that meets the spec. Right. Why am I not just having ML do that? So, you know, it's kind of a you have to exist fluidly with the machines. And I come around on a lot of stuff.

I'm like, wait, TinyGrad, TinyCorp should be a remote company. I can't do this in person. Really? Yeah. Like, oh, comma makes sense to be in person. Like comma, sure. Yeah, we'll get off in San Diego. Like, but that's a six year old company. Right. And it works and it works for a certain type of people and certain type of culture.

But what's going to be different this time? OK, remote. But now it's remote. And now I'm getting these like people who apply and I'm like, I literally have a thousand applications. I'm not calling you to do a technical screen. I can't really tell anything from a technical screen. What am I going to do?

Make a code on a whiteboard? Like, bring up bring up a shared notebook document so we could. Oh, like, that's not going to work. OK. So then I move to the next thing. We do this a comma with good success programming challenges. I've also found them to be like completely non-predictive.

I found one thing to actually be predictive and it's wait a second. Just write code in TinyGrad. It's open source. Right. And so, you know, I'm talking to a few people who've been contributing and like contribute or, you know, the job's not for you. But you can do it remote.

And it's like it's a chill job. Like you're not you're like, oh, yeah, well, I work for the tiny corp. Well, you're writing MIT licensed software like you see what it's doing. Right. Like, well, just I think think of it maybe more of like a stipend and a salary and then also some equity.

Look, you know, I get rich. You all get rich. Yeah. How do you think about agents and kind of like thinking of them as people versus like job to be done? Sean built this thing called Small Developer. And then it's in the same vein, like the human in the loop with the language model and just iterating while you write code.

I think I think that's that's absolutely where it goes. And there's like a it's not like one thing. It's like they're small interpreter. There's like small debugger. It's kind of like all these different jobs to be done. It's a small world. Yeah. It's I know this is like the small pockets.

It's like small. I mean, tiny corp. So we're on the same wavelength. How do you think about that? Do you think people will have a human like interaction with like, oh, this is like the AI developer or like is it I'm the human being supercharged by the AI tools?

Oh, I think it's much more like I'm the human supercharged by the AI tools. I think that like coding is tool complete, right? Like driving is not tool complete. Right. Like driving is just like like we hire people to drive who are like below the API line. Right. There's an API line in the world.

Right. Love that. Yeah. There's an API line in the world. And like you can think like Uber is a really clear example. Right. There's the people below the API line and the people above the API line. And the way you can tell if you're below or above, by the way, is is your manager a computer?

Right. Who's the manager of the Uber driver or computer? Does the machine tell you what to do? Or do you tell machines? Exactly. Exactly. So coding is tool complete. Right. Coding is tool complete. Coding is above the API line. So it will always be tools supercharging your coding workflow.

And it will never be you performing some like task like, OK, well, I can do everything except for actually starting a docker container. Like it just doesn't make any sense. Right. Yeah. So we'll always be sort of tools. And, you know, look, we see the same stuff with all the people are like stable diffusion is going to replace artists or whatever.

It's like, dude, like it's going to create new artists. What did Photoshop replace artists? Like, what are you talking about? Right. Like, you know, a real artist's finger paint. I can't use brushes. Brushes are, you know, brushes are going to replace all the. OK. Like, I just can't like it's all just tools and the tools are going to get better and better and better.

And then eventually, yes, the tools are going to replace us. But, you know, that's still 20 years away. So, you know, I've got a company in the meantime. So I've written about the API line before, and I think that's from Venkatesh. I don't know if you I definitely took it from someone.

It's definitely not mine. VGR. But I also have speculated a higher line than that, which is the Kanban board. Like who tells the programmers what to do? Right. So are you above or below the Kanban board? Has that evolved your management thinking? Yeah. Like that's sort of what I mean.

Like it's like I'm just going to describe the pull request in two sentences and then like, yeah. So you are running the Kanban board or the bounties? Yes. Yeah. The bounties of the Kanban board. Exactly. And that is kind of the high level. And then like, yeah, we'll get AIs to fill in some and we'll get people to fill in others.

Yeah. And that's also what it means to be like full time at a tiny corp. Right. Would you start and I wrote this up pretty concretely. I'm like, OK, step one is you do bounties for the company. Step two is you propose bounties for the company. You don't obviously pay them.

We pay them. But you propose them. And I'm like, yeah, that's a good bounty that like helps with the main workflow of the company. And step three is you get hired full time. You get equity. We all know maybe you're rich. What else are you designing differently about the employee experience?

I mean, I'm very much a like, you know, some people really like to like, like keep a separation. Right. Some people really like to keep a separation between like employees and management or customers and employees like a comma. You know, the reason I do the DevKit thing, it's like, dude, you buy a comma thing, you're an employee of the company, like you're just part of the company.

It's all the same thing. There's no like secrets. There's no dividing lines. There's no like it's all a spectrum for like, you know, down here at the spectrum, like you pay and then up here at the spectrum you get paid. You understand this is the same spectrum of college, right?

Like for undergrad, you pay and then you get up here to like, you know, doing a Ph.D. program, you get paid. OK, well, cool. Welcome to the, you know. What about comma bodies? You know, you mentioned a lot of this stuff is clearly virtual, but then there's below the API line you actually need.

This is a thing that's been announced. Comma bodies? We sell them. You can buy them. They're a thousand bucks on our website. OK, no, no, no. I'm thinking about like what Tesla announced with like the humanoid robot. It's the same thing, except of course we made the comma version of it.

Tesla uses 20 activators. We use two, right? Like how do you how do you build the simplest possible thing that can like turn the robotics problem into entirely a software problem? So right now it is literally just a comma three on a pole with two wheels. It balances, keeps the comma three up there.

And like there's so much you could do with that already. Like this should replace you. How many security guards could this replace? Right. If this thing could just competently wander around a space and take pictures and, you know, focus in on things, send you a text message when someone's trying to break into your building, you know, like like this could already do so much, of course.

But the software is not there yet. Right. So how do we turn robotics into a thing where it's very clearly a software problem? You know, the people don't accept that self-driving cars are a software problem. Like, I don't I don't know what to tell you, man. Like literally just watch the video yourself and then drive with a joystick.

Right. Yeah. Can you drive? And we've actually done this test. We've actually done this test where we've had someone. OK, you just watch this video and here's a joystick and you got to drive the car. And of course, they can drive the car. Yeah. It takes a little bit of practice to get used to that joystick.

But the problem is all the model. Right. So I can now make the model better. Yeah. Specifically, anything in computer vision that you think our second most popular episode ever was about segment anything coming out of Facebook, which is as far as I understand, the state of the art in computer vision.

What are you hoping for there that you need for karma? I think a segment, anything like the large, large YOLOs or not. I've used like large yellows and I'm super impressed by them. Yeah. I think it's solved. I got to check out segment anything. I don't think it's a distinct problem.

Right. OK, here's something that I'm interested in. All right. We have great LLMs. We have great text to speech models and we have great speech to text models. OK, so why can I not why can I not talk to an LLM like I'd have a normal conversation with it?

You can with the latency of like two seconds every time. Right. Why? Why isn't this? And then it feels so unnatural. It's just like staccato. Like, I don't like the RLHF models. I don't like the tuned versions of them. I think that they become you take on the personality of a customer support agent.

Oh, come on. You know, I like I like LLM more than Chachi B.T. Chachi B.T.'s personality just graded on me. Was LLM like cool. I write I read a little bit of pretext paragraph. I can put you in any scenario I want. Right. Like that's interesting to me. I don't want some like, you know.

Yeah. So, yeah, I think there is really no like distinction between computer vision and language and any of this stuff. It's all eventually going to be fused into one massive. So to say computer vision is solved. Well, it doesn't make any sense because what's the output of computer vision model segmentation?

Like what a weird task. Right. Who cares? OCR. Who cares? I don't care if you can segment which pixels make up that laptop. I care if you can pick it up. Interact with the real world. And you're going to have the local cluster. You're going to have the body.

Yeah. Yeah. I think I think that's kind of where that goes. So maybe we can paint the future of like the year is 2050. You've achieved all you wanted at Tiny Corp. What is what is the AI enabled future like? Well, Tiny Corp is the second company. Comma was the first.

Comma builds the hardware infrastructure. Tiny Corp builds a software infrastructure. The third company is the first one that's going to build a real product. And that product is AI Girlfriend. No, like I'm dead serious. Right. Like this is the dream product. Right. This is the absolute dream product. Girlfriend is just the like stand in.

Well, no, it's not a stand in. No, no, no, no. I actually mean it. Right. So I've been wanting to merge with a machine ever since I was like mad little like, you know, how do I merge with the machine? Right. And like, you can look at like in like a maybe the Elon style we're thinking about is Neuralink.

Right. Like, I don't think we need any of this. Right. Some of your friends, maybe they get into relationships and you start thinking of, you know, them and their partner is the same person. You start thinking of them as like one person. I mean, they are kind of like merged, right?

Like humans can just kind of do this. It's so cool. It's this ability that we already have. It's only to put, you know, electrodes in my brain to merge with a machine. I need an AI Girlfriend. Right. So that's what I mean. Like this is this is the third product.

This is the third company. And yeah, in 2050, I mean, like, it's so hard. I like maybe I can imagine like 2035. I don't even know 2050. But like, yeah, 2035. Like, yeah, that'd be really great. Like I have this like kind of, you know. So in terms of merging, like, isn't it, shouldn't you work on brain upload rather than AI Girlfriend?

But I don't need brain upload. Right. I don't need brain upload either. Like, there's there's thousands of hours of me on YouTube. Right. Yes. If you might, how much of my brain's already uploaded? That's only the stuff that you voice. Yeah, it's not that different. It's not that different.

Right. You really think a powerful, you really think a model with, you know, an exaflop of compute couldn't extract everything that's really going on in my brain. I'm a pretty open person. Right. Like, I'm not running a complex filter. Humans can't run that complex of a filter. Yeah. Like humans just can't.

Like, this is actually a cool quirk of biology. It's like, well, humans can't lie that well. Yeah. Yeah. So is it good or bad to put all of your stream of consciousness out there? I mean, I think it's good. I mean, I don't know. I'm streaming every day. I want to live forever.

We said off mic that we may be the first immortals. Right. Yeah. Yeah. Like, this is how you this is how you live forever. It's a question of, OK, how many weights do I have? Right. OK. Let's say I have a trillion weights. It's talking about a terabyte, 100 terabytes here.

But it's not really 100 terabytes. Right. Because it's a complexity. How much redundancy is there in those weights? So, like, maximally compressed, how big is the weight file for my brain? Quantize it whatever you want. Quantization is a poor man's compression. I think we're only talking really here about like maybe a couple of gigabytes.

Right. And then if you have like a couple of gigabytes of true information of yourself up there. Cool, man. Like, what does it mean for me to live forever? Like, that's me. Yeah, no, I think that's good. And I think like the there's a bit of like a professionalization of social media or like a lot of people only have what's like PC out there, you know, and I feel like you're going to get come back to the Chad GPT thing.

Right. You're going to train a model and like everything that's public about a lot of people and it's like no one's going to run their model and they're going to die. I see on social media your life could depend on it. We have a segment. So we're moving on to a what would normally be called the lightning round.

But just just general takes because you're a generally interesting person with many other interests. What is the goddess of everything else mean to you? Oh, it means that is not really going to kill us. Really? Of course. Tell us more. Look, Lex asked me this, like, is there going to kill us all?

And I was quick to say yes, but I don't actually really believe it. I think there's a decent chance. I think there's a decent chance that AI kills 95 percent of us. OK. But they saw on your Twitch streams that you're with them, so they're not going to. No, I don't think I actually I don't also think it's AI.

Like I think the AI alignment problem is so misstated. I think it's actually not a question of whether the computer is aligned with the company who owns the computer. It's a question of whether that company is aligned with you or that government's aligned with you. And the answer is no.

And that's how you end up dead. But so what the goddess of everything else means to me is like the complexity will continue. Paper clippers don't exist. You know, there are forces. The paper clipper is cancer. The paper clipper is really just a perfect form of cancer. And the goddess of everything else says, yeah, but cancer doesn't win.

You know? Yeah. It's a beautiful story for those who haven't heard it. And you read it out and I listened to it. Yeah. Good. What else we have here? Pick a question. So many. Yeah. What are you grateful for today? Oh, man. I mean, it's all just like I haven't I haven't thinking about this stuff forever, like that.

It's actually like happening and it's happening in an accessible way, too. I guess that's what I'm really grateful for. It's not like like AI is not some Manhattan Project style. You don't know when you close doors. I'll fight really hard to keep it that way. You know, that's that's I'm grateful for just just how much is released out there and how much I can just learn and stay up to date.

And I guess I'm grateful to the true fabric of reality that, you know, I didn't need differential equations to understand it. Like I don't need you don't need you don't need some like like like there's there's I've tried to do. There's a limit to my to my math abilities.

I can do most undergrad math, but I took some grad math classes. And OK, now we're getting to the end of what I can do. And it's just the actual like end of what I can do. Like I'm limited by my brain. But, you know, ML stuff, you need high school math.

Yeah, I could do nothing. You know what I mean? When I went to my major, seventh grade, like it's all easy. You need more electrical engineering than you need high school math early. Yeah, well, you need electrical engineering to like build the machines. But even that, like these machines are simpler than the machines that have existed before.

The compute stack looks really nice. So, you know, yeah, I just I'm grateful that it's all happening and I get to understand it, be here. Yeah. Yeah. John Carmack mentioned there's about six insights we have left. Do you have an intuition for what some of the paths people should be taking?

Obviously, you're working on one. What are some of the other branches of the tree that people should go under? I don't think I'm working on one of the six insights. I don't think TinyGrad's any one of the six insights. Something I really like that Elon does, and I try to take it from, try to be inspired by it, is look at the boring tunnel machine and ask how you can build a 10x cheaper one.

All right. Look at the rocket. How can I build a 10x cheaper one? Look at the electric car and say, how can I build a 10x cheaper, like cheaper or, you know, can go further or whatever, whatever, whatever. Right. You just do the straight up physics math. Right. Like I'm trying to do the same thing with with ML frameworks.

Right. And in doing so, making sure that this stuff remains accessible. Right. You could imagine a world where if Google TPUs were actually the ultimate, if Google TPUs were actually the best training things. I mean, actually, you know, I'm kind of grateful for NVIDIA. Right. Like, because if Google TPUs were the ultimate, now you have this huge closed source compiler in between XLA and the hardware.

And yeah, that's just a really bad thing. So, I mean, something that is somewhat upsetting about the TinyGrad is that it is trying to prevent downside, but it's not all trying to prevent downside. Like we're also building computers and we're going to build some awesome, powerful, cheap computers along the way.

So, no, I'm not really working directly on any of the six tricks. I also think the six tricks are kind of going to be like luck. I think it's going to be like, you know, please tell me more about what covariate shift is and how that inspired you to come up with batch normalization.

Please tell me more about why it's a transformer and it has a query, a key and a value. Right. Like Schmidt-Hoover described it better in fast weights. You know, like, I mean, my theory about why transformers work have nothing to do with this attention mechanism and just the fact that like it's semi-weight sharing.

Right. Like because the weight matrix is being generated on the fly, you can, you can like compress the weight matrix. Right. Like this is what that, there's a, there's an operation in the, in the transformer, which like, and by the way, this is like Qualcomm's SNPE can't run transformers for this reason.

So most matrix multipliers in neural networks are weights times values. Right. Whereas, you know, when you get to the, the, the, the outer product in, in transformers, well it's weights times weight. It's a, it's values times values. Right. So SNPE like doesn't even support that operation. Right. So it's like that operation that gives the transformer its power.

It has nothing to do with the fact that it's attention. Right. And this is just as a funny, like, but that is one of the six tricks, right. Batch, like these norms are a trick. Transformers are a trick. Okay. Six more. Is there a reason why, so you could talk, you talk about attention as weight compression.

Compression is not exactly the right word. What I mean is that the weights can change dynamically based on the context. So there was this thing in pack eight in the Hunter Prize that I absolutely loved. And I've never seen it again in neural networks and it's a really good trick.

Okay. Imagine you have 256 weight sets for a layer. Right. And then you choose which of the weight sets you're loading in based on some context. And that context can come from another neural net. Right. So I have another neural net which protect, projects, you know, 256 wide, one hot, do a softmax, predict it.

And then I actually load the weights in. And I can do this operation at both test time and train time. Right. I can do this operation at both training and inference. And I load in the weights given the context. Right. Like that is what transformers do. But transformers, instead of having 256 discrete ones, it's actually just that, but continuous.

Yeah. Which is funny that that was in language models. And I just like, when I understood that about transformers, I'm like, oh, this is a real trick. And why are they using the word attention? Yeah. And today is actually the anniversary of attention is all you need. What? Today, six years ago.

Six years. Six years. Changed the world. Wow. Well, there's one of your envelope tricks. Right. And you can easily write it on an envelope. You know, think about how you write out that. How many times have you written that? Because it's not in any libraries because it's like all used a little differently each time.

Yeah. If you just write out that exact same, you know. Yeah. Yeah. You've name checked Elon a few times. Yeah. I think about both of you as systems thinkers. Input, output, thinking something in between. Sure. What's different about your style versus his? Elon's fundamental science for the world is physics, mine is information theory.

Huh. But you do a lot of physics as well. I mean, like you base it on. And Elon does a lot of information theory as well, too. But the question is fundamentally that the difference maybe is expressed in what your ambitions are. Right. Elon's ambitions may be like, go to Mars.

Go to Mars. Right. Go to Mars is the ultimate modern modernist physics ambition. Right. It's a physics, but I'm getting to Mars. Right. Well, what are electric cars? It's a physics problem. Right. OK. Now he's like pushing on the autonomy stuff and you push a little on information theory.

But fundamentally, his dreams are physics based dreams. My dreams are information based dreams. I want to live forever in virtual reality with my AI girlfriend. Right. Those are those are the aspirations of someone who who who accepts information theory as a core science. So I think that's the main difference between me and him.

He has physics based aspirations and I have information based aspirations. Very, very neat. Mark Andreessen. He is a, hi Mark, he's a listener. He is heavily, he's a big proponent of effective accelerationism. You've been a bit more critical. Why do you say that EAC is not taken seriously by its adherents?

Oh, well, only the left takes ideology seriously. Why is that? Just as a fact. It's just like it's just like a fact. Is the right more cynical? Is that what it is? I don't know. It's like it's like the left actually manages to get energy around the ideologies. Right.

Like like like there's a lot more. Look, here you have you have two effective altruists named Sam going in front of Congress. Only one of them is in jail. You know, it's interesting. They're both calling for regulation in their respective spaces. Right. So SPF is definitely like kind of a wolf in sheep's clothing, kind of.

Right. He only adopted EAC or EA. Oh, and Sam Altman is a genuinely good guy who is not interested in power seeking for himself. All right. We don't we don't have to. Fair enough. Fair enough. But no, EAC is not like like you are not serious. Right. You are not actually a serious ideology.

You know, Mark Andreessen. I like Mark Andreessen. But I think that like some of his Twitter things are like, dude, you like just like it's like it's like someone who's like twenty nineteen who's like eyes were opened about like the political world being not exact. You mean all the people on the news were lying to me?

Well, they were lying to you like, OK, we all figured this out five years ago. Now, what are you going to do about it? I'm going to complain about it on Twitter. Right. And that's what EAC is. Last and maybe most important, what was Avatar 2 bad? Oh, I have a whole you can go on my blog.

I rewrote the script of Avatar 2. I wrote a script that actually might make you feel something for the characters. I killed Jake Sully in the first scene like you had to. Do you really think his second story arc topped his first one? No, of course not. You had to kill the guy and make the movie about the brothers.

Right. And just that alone and realizing that like you could have kept the Titanic scene, it would have been fine. Even take it out. I left your Titanic scene, James Cameron. But I wrote you a story that so, you know, just just just he needs ships to sink in water.

He needs. Well, look, it's a great scene. But like the movie was just like the Roman never great CGI, you know, let down by the writing. Maybe. Yeah. Yeah. No, but like the CGI like it was it's a beautiful world. And that's why, like, I care so much. Right.

Like you don't hear me ranting about Pirates of the Caribbean 2 being a terrible story because come on, what do you expect, man? Like Johnny Depp's like, wow, I had a movie that made me rich. I love this. But this goes back to like the midpoint. You know, I think you wrote like feels like Chachapiti wrote the movie.

And that's my worry a little bit. It's like kind of converging towards that. Oh, I look Malik wrote the movie. Sorry, I didn't want to interrupt. I closed. I closed a pull request two days ago. I was like, was this written by Chachapiti? And I just closed it. Like, you know what?

I honestly feel bad if you were a human who wrote this. Like you're incapable of being more perplexed. But if you have a classifier running in my head that asks, you know, is this a I or is this a human like, you know, the only way to deal with all this, like, like, like, it's like the worst possible.

Like, you know, people are like, like, like, how are you mad about like these chatbots? You're not mad about like Tesla? Well, because if I don't want to buy a Tesla, I want to buy a Tesla and it won't really impact my life negatively. But if I don't want to use a chatbot, it's still going to impact my life negatively.

All the amount of like personalized spam that now makes me spend more cycles on my classifier to tell if it's spam or not, because you can now use AIs and generate this. Like, no, I mean, we have to move to a model where everything's just a dollar, right? Like you want to send me an email, it's a dollar.

Like you guys wouldn't care. None of my friends would care. No one would care except the spammers. Right? Like we just got to move to those sort of models. Awesome. One last message you want everyone to remember. Look, go, go try TinyGrad. I hope that we're a serious competitor to what's out there.

And then I want to, you know, I want to take it all the way. We'll start with just building something for GPUs and then we'll start building chips and we'll start building fabs and we'll start building silicon mines and we'll have the first self-reproducing robot using. Yeah, all right, George.

Thank you so much for coming on. You're a big inspiration. Thank you. Thanks. All right. How was that? We, uh, not, not quite like Friedman, but we hope to do something different. Thank you. Thank you.

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Chapters

Transcript