Ep 18: Petaflops to the People — with George Hotz of tinycorp

00:00:00.000 | >> Hey, everyone. Welcome to the Latent Space podcast. This is Swix, writer and editor of

00:00:06.440 | Latent Space, and Alessio is taking over with the intros, Alessio's partner and CTO on residence

00:00:11.640 | at Decimal Partners. >> Hey, everyone. Today we have GioHot on

00:00:15.040 | the podcast, aka George Hotz, for the human name. Everybody knows George, so I'm not going

00:00:20.960 | to do a big intro. A couple things that people might have missed. So you were the first to

00:00:24.600 | unlock the iPhone. You traded the first ever unlocked iPhone for a Nissan 350Z and three

00:00:30.120 | new iPhones. You were then one of the first people to break into the PS3 around arbitrary

00:00:35.480 | code. You got sued by Sony. You wrote a rap song to fight against that, which is still

00:00:40.400 | live on YouTube, which we're going to have on the show notes. Then you did not go to

00:00:44.920 | Tesla to build Vision, and instead you started Com.ai, which was an amazing engineering feat

00:00:50.720 | in itself, until you got a season disease from the government to not put these things

00:00:55.280 | on the street. Turned that into a research-only project.

00:00:58.560 | >> You know they're out there. >> Yeah, yeah. No, no, no. They're out there.

00:01:01.800 | But like, they're not a, you know, you market them as a research kind of like no warranty.

00:01:06.520 | >> Because I use the word DevKit. That's not about the government. That has nothing to

00:01:10.000 | do with the government. We offer a great one-year warranty. The truth about that is it's gatekeeping.

00:01:17.640 | What's the difference between a DevKit and not a DevKit? Nothing. Just the question of

00:01:22.120 | do you think it's for you? And if you think it's for you, buy it. It's a consumer product.

00:01:26.480 | We call it a DevKit. If you have a problem with that, it's not for you.

00:01:31.000 | >> That's great insight. And then I was going through your blog post to get to the day.

00:01:35.480 | You wrote this post about the hero's journey, and you linked this thing called the portal

00:01:40.080 | story, which is kind of the set of stories in movies and books about people living this

00:01:45.240 | arbitrary life, and then they run into these magic portals, kind of takes them into a new,

00:01:49.680 | very exciting life and dimension. When you wrote that post, you talked about TinyGrad,

00:01:54.520 | which is one of the projects you're working on today. And you mentioned this is more of

00:01:58.120 | a hobby, something that is not going to change the course of history. Obviously, you're now

00:02:01.440 | going full speed into it. So we would love to learn more about what was the portal that

00:02:05.480 | you run into to get here.

00:02:07.680 | >> Well, what you realize is, you know what made me realize that I absolutely had to do

00:02:13.520 | company? Seeing Sam Altman go in front of Congress. Why? What are the odds they nationalize

00:02:20.200 | NVIDIA? You know, what are the odds that large organizations in the government, but of course

00:02:26.400 | I repeat myself, decide to try to clamp down on accessibility of ML compute? I want to

00:02:34.720 | make sure that can't happen structurally. So that's why I realized that it's really

00:02:39.560 | important that I do this. And actually, from a more practical perspective, I'm working

00:02:43.240 | with NVIDIA and Qualcomm to buy chips. NVIDIA has the best training chips. Qualcomm has

00:02:47.120 | the best inference chips. Working with these companies is really difficult. So I'd like

00:02:51.000 | to start another organization that eventually in the limit, either works with people to

00:02:56.000 | make chips or makes chips itself and makes them available to anybody.

00:03:01.860 | >> You shared kind of three core pieces to TinyGrad. Maybe we can dive into each of them.

00:03:06.080 | So XLA, Prime Torch, those are the complex instruction system. TinyGrad is the restricted

00:03:12.640 | instruction system. So you're kind of focused on, again, TinyGrad being small, not being

00:03:17.040 | overcomplicated and trying to get as close to like the DSP as possible in a way, where

00:03:21.640 | it's at more.

00:03:22.640 | >> Well, it's a very clear analogy from how processors developed. So a lot of processors

00:03:26.920 | back in the day were CISC, complex instruction set. System 360 and then x86. Then this isn't

00:03:34.520 | how things stayed. They went to now the most common processor is ARM. And people are excited

00:03:40.640 | about RISC-V. RISC-V is even less complex than ARM. No one is excited about CISC processors

00:03:46.920 | anymore. They're excited about reduced instruction set processors. So TinyGrad is, we're going

00:03:52.680 | to make a RISC offset for all ML models. And yeah, it can run all ML models with basically

00:03:59.800 | 25 instead of the 250 of XLA or Prime Torch. So about 10x less complex.

00:04:06.040 | >> You talked a lot about existing AI chips. You said if you can write a fast ML framework

00:04:10.600 | for GPUs, you just cannot write one for your own chip. So that's another one of your core

00:04:14.760 | insights. I don't know if you want to expand on that.

00:04:17.280 | >> Yeah, I mean, your chip is worse, right? There's no way the chip that you're going

00:04:20.600 | to tape out, especially on the first try, is going to be easier to use than an AMD GPU.

00:04:25.720 | And yet there's no good stack for AMD GPUs. So why do you think you can make one for your

00:04:30.600 | chip? You can't, right? The only company, there's one other company, aside from NVIDIA,

00:04:35.560 | who's succeeded at all at making training chips. What company?

00:04:40.720 | >> AMD? Intel?

00:04:43.000 | >> No, no, no. I've never trained, who's trained a model on AMD or Intel?

00:04:47.560 | >> Nobody on AMD. Cerebris.

00:04:49.760 | >> Cerebris, I'm talking about, you might know some startups who trained models on these

00:04:53.800 | chips. I'm surprised no one immediately gets this, because there is one other chip, aside

00:04:59.280 | from NVIDIA, that normal people have actually used for training.

00:05:02.200 | >> Apple Neural Engine?

00:05:03.560 | >> No, used for training. You can only buy them in the cloud.

00:05:08.040 | >> Oh, TPU.

00:05:09.440 | >> Exactly, right? So, Mid Journey is trained on TPU, right? A lot of startups do actually

00:05:15.680 | train on TPUs. And they're the only other successful training chip, aside from NVIDIA.

00:05:21.180 | But what's unique about Google is that they also wrote their own ML framework, right?

00:05:26.160 | And if you can't write your own ML framework that is performant on NVIDIA, there's no way

00:05:29.960 | you're going to make it performant on your--

00:05:32.680 | >> And they started from TensorFlow, and then they made the chip after.

00:05:36.000 | >> Yeah, exactly, exactly. And you have to do it in that direction. Otherwise, you're

00:05:40.040 | going to end up-- Cerebris, one of those things, a million-- I've never seen a Cerebris. No

00:05:46.560 | one's ever like, "Oh, I trained my model on a Cerebris." Most people are like, "I trained

00:05:50.520 | my model on GPUs." Some people, 20%, are like, "I trained my model on TPUs."

00:05:57.040 | >> And then the third one, which is the one that surprised me the most, is Turing completeness

00:06:01.320 | is harmful, should be avoided. It made sense once I read it, but maybe tell us a bit more

00:06:07.600 | about how you got there.

00:06:09.560 | >> Okay. So, CPUs devote tons of their silicon and power to things like reorder buffers and

00:06:18.160 | speculative execution and branch predictors. And the reason that you need all these things

00:06:22.960 | is because at compile time, you can't understand how the code's going to run. This is Rice's

00:06:28.240 | theorem. This is the halting problem and its limit. And this is not like, "Oh, the halting

00:06:32.040 | problem is theoretical." No, no, no, no. It's actually very real. Does this branch get taken

00:06:36.240 | or not? Well, it depends on X. Where does X come from? Yeah, forget it, right? But no

00:06:41.520 | branches depend on X in a neural net. Every branch is a static loop. Like if you're doing

00:06:46.360 | a matrix multiply, it's a static loop over the inner dimension. And neural networks are

00:06:50.720 | even better. No loads even depend on X, right? So with a GPU shader, right, your load might

00:06:55.760 | depend on which texture you're actually loading into RAM. But with a neural network, your

00:06:59.480 | load is, "Well, I load that way." Why? "Well, because I load that way the other million

00:07:03.160 | times I ran the same net." Every single time you run the net, you do the exact same set

00:07:07.160 | of loads, stores, and arithmetic. The only thing that changes is the data. And this gives

00:07:12.800 | you a very powerful ability to optimize that you can't do with CPU style things, which

00:07:19.080 | have branches, and even GPU style things, which have loads and stores.

00:07:22.160 | Oh, that makes sense. Well, GPUs, if you want GPU style stuff, you have like load based

00:07:26.560 | on X. You now need a cache hierarchy, and not an explicit cache hierarchy, an implicit

00:07:31.480 | cache hierarchy. With eviction policies that are hard-coded into the CPU, you start doing

00:07:37.240 | all this stuff and you're never going to get theoretically good performance. Again, I don't

00:07:42.480 | think there's 100X. Some startups will talk about 100X, and they'll talk about absolutely

00:07:45.840 | ridiculous things like clockless computing or analog computing. Okay. Here, analog computing

00:07:50.720 | just won't work. And clockless computing, sure, it might work in theory, but your EDA

00:07:55.960 | tools are... Maybe AIs will be able to design clockless chips, but not humans. But what

00:08:02.840 | actually is practical is changing cache hierarchies, and removing branch predictors, and removing

00:08:07.360 | warp schedulers. GPUs spend tons of power on warp scheduling, because we have to hide

00:08:10.960 | the latency from the memory. We don't have to hide the latency if everything's statically

00:08:14.040 | scheduled.

00:08:15.040 | Yeah. Why do you think people are still hanging on to Turing complete?

00:08:19.920 | Well, because it's really easy. Turing complete is just really easy, right? It's really easy

00:08:24.720 | to just, "Oh, you know, it would just be so nice if I could do like an if statement here,

00:08:29.820 | and actually branch the code," right? So it requires a lot more thought to do it without

00:08:34.920 | Turing completeness.

00:08:37.520 | And would this be qualitatively different than TPUs?

00:08:40.120 | So TPUs are a lot closer. TPUs are a lot closer to what I'm talking about than like CUDA.

00:08:46.560 | Okay, so what is CUDA? Well, CUDA is a C-like language, which compiles to an LLVM-like IR,

00:08:52.240 | which compiles to PTX, which compiles to SAS, which are all Turing complete. TPUs are much

00:08:57.540 | more like this, yeah. Their memory is pretty statically managed. I did some reverse engineering

00:09:02.680 | on the TPU. It's published in TinyGrad. It has like a VLIW instruction, and it runs them.

00:09:09.520 | So it's similar. I think the TPUs have a few problems. I think systolic arrays are the

00:09:13.400 | wrong choice. Systolic array, I think they have systolic arrays, because that was the

00:09:17.280 | guy's PhD. And of course, Amazon makes--

00:09:19.280 | Could you summarize systolic arrays right now?

00:09:20.860 | Systolic arrays are just, okay, so basically you have like, it's a way to do matrix multiplication.

00:09:26.640 | Think of a grid of mollax, and then the grid can multiply and then shift, multiply, then

00:09:31.080 | shift, multiply, then shift. And they are very power efficient, but it becomes hard

00:09:35.400 | to schedule a lot of stuff on them if you're not doing perfectly sized dense matrix multiplies.

00:09:42.360 | Which you can argue, well, design your models to use perfectly sized dense matrix multiplies,

00:09:46.920 | sure. But it's just--

00:09:48.920 | No, but thanks for indulging on these explanations. I think we need to keep our audience along

00:09:57.120 | with us by pausing every now and then to explain key terms.

00:10:01.680 | When I say explain a systolic array, I just immediately get a picture in my head of like

00:10:06.000 | tilting a matrix and shifting it. It's hard to kind of explain.

00:10:08.880 | Yeah.

00:10:09.880 | We'll do some videos so that you're--

00:10:10.880 | We like show notes, but--

00:10:12.720 | We edit it in visuals.

00:10:13.720 | Yeah, yeah, yeah. There's some great graphics that just show you, oh, so that's what a systolic

00:10:17.200 | array is. But it's a mollax shift machine that looks kind of different from the typical

00:10:21.640 | like APU sort of machine. Sorry, ALU sort of machine. I think the right answer is something

00:10:26.440 | that looks more like queues that feed into ALUs. And then you can like prefetch the loads

00:10:31.640 | from the memory, put in a bunch of queues, and then the queue is just like, and feeds

00:10:35.600 | into another queue over here. But yeah, but that's not even the main problem with TPUs.

00:10:42.360 | The main problem with TPUs is that they're closed source. Not only is the chip closed

00:10:45.440 | source, but all of-- XLA is open source, but the XLA to TPU compiler is a 32 megabyte binary

00:10:51.520 | blob called lib TPU on Google's cloud instances. It's all closed source. It's all hidden stuff.

00:10:56.800 | And, you know, well, there's a reason Google made it closed source. Amazon made a clone

00:10:59.920 | of the TPU. It's called Inferencia. Or they have some other name for it, a training--

00:11:03.280 | >> Trainium, yeah.

00:11:04.280 | >> Trainium, yeah, yeah, yeah. And here, look, it's a clone of the TPU. It's--software doesn't

00:11:08.080 | work though. Like Google software at least kind of works.

00:11:12.120 | >> So those are kind of like the three core thesis. The first thing you're working on,

00:11:15.360 | that you've been working on is TinyGrad. And one of the--your Twitch streams, you said,

00:11:19.600 | is the best thing you've ever written. Yeah, tell us a bit more about that creation.

00:11:26.840 | >> For a long time, TinyGrad had a hard limit of 1,000 lines of code. And what this would

00:11:31.280 | force you to do is really make sure you were not wasting lines. I got rid of the restriction

00:11:37.400 | because it became a little code golfy at the end. But once like the core framework of TinyGrad

00:11:42.680 | was there in those 1,000 lines, it's not huge now. It's like 2,800 lines now. It's still

00:11:49.120 | very readable. But like the core framework, the ideas are expressed with no boilerplate.

00:11:56.420 | If you go read PyTorch--you know, PyTorch is actually pretty good code. I think Facebook's

00:12:00.720 | pretty good. But there's so much boilerplate. Go in PyTorch and try to track down how an

00:12:07.400 | LU actually works. >> Just a lot of instructions?

00:12:10.960 | >> Oh, you're going to be diving down a long stack from Python to C to custom libraries

00:12:16.800 | to dispatchers to--and then I don't even know how to read TensorFlow. Like I don't even

00:12:20.360 | know where's an LU in TensorFlow. Nobody knows. Someone at Google knows maybe. Google as an

00:12:27.080 | organism knows. I don't know if anyone individual at Google knows.

00:12:31.580 | >> What are like the important ergonomics like for a developer as you think about designing

00:12:35.400 | the TinyGrad API? >> So, the TinyGrad frontend looks very similar

00:12:39.240 | to PyTorch. There's an even higher level frontend you can use for TinyGrad which is just ONNX.

00:12:44.060 | We support--we have better support for ONNX than Core ML does. And we're going to have--I

00:12:48.680 | think we're going to pass ONNX Runtime soon too. Like people think ONNX Runtime, that's

00:12:52.000 | a gold standard for ONNX. No, you can do better. >> Pass them in what specifically?

00:12:55.560 | >> Test, compliance tests. So, ONNX has a big set of compliance tests that you can check

00:12:59.580 | out. And we have them running in TinyGrad and there's some failures. We're below ONNX

00:13:05.480 | Runtime but we're beyond Core ML. So, like that's like where we are in ONNX support now.

00:13:09.800 | But we will pass. We will pass ONNX Runtime soon because it becomes very easy to add ops

00:13:14.060 | because of how like you don't need to do anything at the lower levels. You just do it at this

00:13:17.960 | very high level and TinyGrad compiles it to something that's fast using these minimal

00:13:21.560 | ops with. You can like write--I mean, most concretely what TinyGrad can do that like

00:13:27.280 | PyTorch can't really do is if you have something like A times B plus C, right? If you write

00:13:32.460 | that in Naive PyTorch, what it's going to do on the GPU is, well, read A, read B in

00:13:37.520 | a kernel and then store A times B in memory and then launch another kernel to do A times

00:13:42.880 | B plus C, okay? Got to do those loads from memory. I know I did a whole extra round trip

00:13:48.040 | to memory that I just didn't have to do. And you're like, "Yeah, but you can use the Torch

00:13:51.080 | JIT and it corrects this." Yeah, for that one example, for that one example of MOLAC,

00:13:56.720 | but oh, now you did three multiplies, six multiplies, right? It doesn't--it won't compile

00:14:03.420 | arbitrary code.

00:14:04.420 | >> And if you looked into like the other approaches like PyTorch Lightning to accelerate PyTorch

00:14:09.360 | itself?

00:14:10.360 | >> Well, PyTorch Lightning, my understanding is it's mostly a framework around PyTorch,

00:14:14.280 | right? PyTorch Lightning is not going to fix this fundamental problem of I multiply six

00:14:18.040 | tensors together, why is it going to memory any more than a single read from each and

00:14:21.880 | a single write to the output?

00:14:23.320 | >> Okay.

00:14:24.320 | >> Yeah, there are lower level things in PyTorch that are--I'm not exactly sure what Dynamo

00:14:29.680 | does but I know they're generating some Triton stuff which is going to generate the kernels

00:14:33.960 | on the fly. But, you know, PyTorch Lightning is at a higher level of abstraction. So TinyGrad's

00:14:39.840 | front-end stuff looks like PyTorch. I made a few tweaks, there's a few things I don't

00:14:42.960 | like about PyTorch. Why is ReLU a class? No, really, like what's the state? You make a

00:14:49.200 | class and there's a state. Everything should just be Torch functional and then ReLU but

00:14:52.160 | just dot ReLU on the tensor. Also like there's things in Torch where you have to do tensor

00:14:56.840 | dot and not a tensor dot, right? And like why are these things--like this just--it just

00:15:02.640 | shows an API that's like not perfectly refined. But when you're doing stuff TinyGrad style

00:15:07.880 | where you don't have lines, well, it has to work this way because even the lines to express

00:15:12.920 | the--well, you can't use the where operator unless--and the where operator in PyTorch.

00:15:17.720 | Why is it true case, condition, false case? The worst--that's how Python expresses ifs.

00:15:24.440 | It's disgusting, right? Turner operators are much nicer. It should be--I can do my like

00:15:28.360 | a less than zero dot where a comma one, right? >> The very Pandas-like API.

00:15:35.320 | >> Yeah, yeah, yeah. It's just--it's some--it looks like Torch, NumPy, Pandas. They're all

00:15:40.440 | very similar. I tried to take like the cleanest subset of them and express them. But like

00:15:44.960 | I said, you can also interact with it using Onyx. But I have a rewrite of StableDiffusion,

00:15:50.240 | I have a rewrite of LLAMA, I have a rewrite of Whisper. You can look at them. They're

00:15:52.840 | shorter than the Torch version and I think they're cleaner.

00:15:54.360 | >> And you stream them all? >> Yeah.

00:15:56.360 | >> Very nice. >> Laziness is kind of the other important

00:16:00.080 | concept that you're leveraging to do operation fusing. Yeah, talk a bit more about that.

00:16:05.200 | >> So, yeah, you have basically like a few different like models for compute. The simplest

00:16:14.160 | one is Eager, right? The simplest one is Eager. As soon as the interpreter sees A times B,

00:16:20.760 | it actually dispatches A times B, right? Then you have Graph, like TensorFlow, which will

00:16:27.400 | put A times B into a graph and then will do absolutely nothing until you actually compile

00:16:35.000 | the graph at the end. I like this third choice, which is somewhere in the middle, laziness.

00:16:40.080 | Laziness is, you don't know when the ops are going to dispatch and don't worry about that.

00:16:42.760 | You don't have to worry about this as a programmer. You just write out all your stuff and then

00:16:46.280 | when you actually type .numpy, it'll be ready by the time you, you know, copy the thing

00:16:50.540 | back to CPU. Or you can do .realize and it will actually like force that tensor to be

00:16:54.960 | allocated in RAM. But yeah, a lot of times, right, like, and if you think about it, PyTorch

00:17:00.960 | is kind of lazy in a way, but they didn't extend the paradigm far enough, right? When

00:17:04.920 | I do A times B in PyTorch, it's going to launch a CUDA kernel to do A times B, but it's not

00:17:09.680 | going to wait for that CUDA kernel to complete. So you're getting the worst possible world.

00:17:13.800 | You're getting the same laziness, but you also can't get fusion because PyTorch doesn't know

00:17:18.200 | that I'm then going to do plus C. There's no way for it to be like, "Whoa, whoa, whoa,

00:17:21.560 | don't launch that CUDA kernel. Whoa, just do this one too." Right? You can kind of like,

00:17:26.320 | this stuff, PyTorch is working on this and, you know, it's a little bit harder. Like in

00:17:31.920 | comma, I felt like I was competing against a lot of idiots. Here I'm competing against,

00:17:35.840 | you know, smart, smart, very smart people who've made, yeah, who've made some, I think,

00:17:41.680 | different trade-offs, right? Who've made some different trade-offs. Whereas if you're trying

00:17:45.400 | to build something that is just straight up good on NVIDIA and we have a lot of people

00:17:49.540 | and complexity to throw at it, yeah, PyTorch made a lot of the right choices. I'm trying

00:17:53.140 | to build something that manages complexity. Like you can always make your software do

00:17:57.520 | more. The magic is when you can make your software do more without adding complexity,

00:18:02.320 | right? Because, you know, complex things eventually collapse under their own weight. So it's kind

00:18:06.780 | of that.

00:18:07.780 | >> How does fusing actually work?

00:18:09.500 | >> Like TensorFlow actually collapsed. It's kind of what happened, right? How does fusing

00:18:15.760 | actually work? So yeah, there's this thing called lazy.py. And when you do like A times

00:18:21.720 | B, that's, it's put into a graph, but it's a very local graph. There's no global graph

00:18:28.120 | optimizations. And even this can change, right? Again, like the programming model for TinyGrad

00:18:32.760 | does not preclude eagerness, right? Laziness is not guaranteed laziness. It's just going

00:18:37.440 | to try its best. So you put in A times B, and that's a binary op, right? And then you

00:18:41.960 | put in A times B, like that's a node in the graph. It's a virtual node because it's not

00:18:45.640 | realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes

00:18:50.360 | the output of A times B. It's like, whoa, wait, there's two binary ops. Okay, we'll

00:18:53.680 | just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs.

00:18:58.200 | It does A times B plus C in the local registers, and then outputs that to memory. And you can

00:19:04.360 | graph.1 in TinyGrad. Another, like amazing thing that TinyGrad has that I've not seen

00:19:10.560 | in any other framework is two things. Graph.1, graph equals one, which is an environment variable.

00:19:16.040 | It will output a complete graph of all the operations. A lot of people are like, oh,

00:19:19.480 | you can use PyTorch, export it to Onyx, and use Netron. Yeah, you can, but like what?

00:19:24.680 | That's not what's real. Graph.1 will show you the actual kernels that were dispatched

00:19:28.520 | to the GPU. You can also type debug equals two, which will print those kernels out in

00:19:34.200 | your command line. And it will tell you the exact number of flops and the exact number

00:19:40.440 | of memory accesses in each kernel. So you can immediately see, wait a second, okay,

00:19:45.680 | this kernel used this many flops, this was the gigaflops, this is how many bytes it read,

00:19:49.520 | and this was the gigabytes per second. And then you can profile without having to like,

00:19:53.280 | okay, I mean, in theory, in PyTorch, sure, use the NVIDIA Insight Profiler, which is--

00:19:58.000 | >> No one does that. >> No one does, of course, because it's so

00:20:00.240 | difficult, right? Like, actually, NVIDIA used to, pre, I think CUDA 9 was the last one they

00:20:06.320 | had it. They had a command line one, but now it's like, okay, I'm going to generate this

00:20:09.760 | blob, use this NVIDIA GUI tool to convert it into a Chrome trace and then load it. Yeah,

00:20:15.480 | no one does that, right? I've just typed debug=2 in any TinyGrad model and it will show you

00:20:19.680 | all the kernels that it launches and the efficiency of each kernel, basically.

00:20:24.160 | >> Yeah, this is something that John Carmack has often commented about, is that when you

00:20:29.320 | code, you need to build in your instrumentation or observability right into that. I wonder

00:20:34.520 | if whatever John is working on, he's adopting this style, and maybe we can sort of encourage

00:20:39.720 | it by, like, I don't know, naming it and coining it as a certain kind of debugging style.

00:20:46.280 | >> If he would like to start contributing to TinyGrad, I'd be--

00:20:49.320 | >> You should hook up with him. >> I'd be so happy. I've chatted with him

00:20:52.000 | a few times. I'm not really sure what his company is doing. I think it's all, I think

00:20:55.720 | it's pretty, but no, I mean, hopefully, like, we get TinyGrad to a point where people actually

00:21:02.240 | want to start using it. So TinyGrad right now is uncompetitive on, it's uncompetitive

00:21:07.800 | on NVIDIA, it's uncompetitive on x86. >> And specifically, what do you care about

00:21:11.520 | when you say uncompetitive? >> Speed.

00:21:13.180 | >> Okay. >> Share of speed. It's correct. The correctness

00:21:16.040 | is there. The correctness for both forwards and backwards passes is there, but on NVIDIA,

00:21:20.680 | it's about 5x slower than PyTorch right now. Like, 5x, wow, this is unsurmountable. No,

00:21:25.240 | there's reasons it's 5x slower, and I can go through how we're going to make it faster,

00:21:28.040 | and it used to be, you know, 100x slower, so, you know, we're making progress, but there's

00:21:32.160 | one place where it actually is competitive, and that's Qualcomm GPUs. So TinyGrad is

00:21:36.560 | used to run the model in OpenPilot. Like, right now, it's been live in production now

00:21:40.360 | for six months, and TinyGrad is about 2x faster on the GPU than Qualcomm's library.

00:21:46.360 | >> And why specifically Qualcomm? >> Well, because we have Qualcomm. We use

00:21:51.080 | Qualcomm in the Comma devices. >> Oh, I mean, like, what makes, what about

00:21:55.840 | Qualcomm architecture? >> Oh, what makes it doable? Well, because

00:21:58.920 | the world has spent how many millions of man-hours to make NVIDIA fast, and Qualcomm has a team

00:22:03.160 | of 10 Qualcomm engineers? Okay, well, who can I beat here? Like, what I propose with

00:22:08.760 | TinyGrad is that developer efficiency is much higher, but even if I have 10x higher developer

00:22:14.160 | efficiency, I still lose on NVIDIA, right? You know, okay, I didn't put 100,000 man-hours

00:22:19.840 | into it, right? If they put a million, like, that's what I'm saying, but that's what I'm

00:22:23.560 | saying we can get, and we are going to close this speed gap a lot. Like, I don't support

00:22:28.480 | TensorCourse yet. That's a big one that's just going to, okay, massively close the gap.

00:22:33.960 | And then AMD. I can't even get, I don't even have a benchmark for AMD because I couldn't

00:22:39.280 | get it compiled. Oh, and I tried. Oh, I tried. I spent a day. Like, I spent actually a day

00:22:43.940 | trying to get PyTorch, and I got it built. I got it kind of working, and then I tried

00:22:49.400 | to run a model. Like, there's all kinds of weird errors, and the rabbit holes are so

00:22:52.800 | deep on this. I'm like, so we, you know, you can compare the speed. Right now, you can

00:22:57.320 | run LLAMA. You can run anything you want on AMD. It already all works. Any OpenCL backend

00:23:01.160 | works, and it's not terribly slow. I mean, it's a lot faster than crashing, so it's infinitely

00:23:05.760 | times faster than PyTorch on AMD, but pretty soon, we're going to start getting close to

00:23:10.560 | theoretical maximums on AMD. That's really where I'm pushing, and I want to get AMD on

00:23:14.680 | MLPerf in a couple months, hopefully.

00:23:17.880 | >> Now that you bring up AMD.

00:23:19.800 | >> Yeah, let's dive into that, because when you announced the TinyCorp fundraise, you

00:23:23.920 | mentioned one of your first goals is build the framework, runtime, and driver for AMD,

00:23:29.520 | and then on June 3rd on Twitch, you weren't as excited about AMD anymore. Maybe let's

00:23:35.080 | talk a bit about that, and you compared the quality of commit messages from the AMD kernel

00:23:41.360 | to the Intel work that people are doing there. What's important to know?

00:23:44.800 | >> So when I said I want to write a framework, I never intended on writing a kernel driver.

00:23:49.160 | I mean, I flirted with that idea briefly, but realistically, there's three parts to

00:23:55.840 | it, right? There's the ML framework, there's the driver, and then there's the user space

00:23:59.800 | runtime. I was even down to rewrite the user space runtime. I have a GitHub repo called

00:24:04.800 | CUDA I/O Control Sniffer. It's terribly called, but you can actually launch a CUDA kernel

00:24:08.520 | without CUDA, so you don't need CUDA installed. Just the NVIDIA open source driver and this

00:24:13.820 | open source repo can launch a CUDA kernel. So rewriting the user space runtime is doable.

00:24:19.040 | Rewriting the kernel driver, I don't even have docs. I don't have any docs for the GPU.

00:24:23.000 | It would just be a massive reverse engineering project. When I saw that there, I wasn't complaining

00:24:30.880 | about it being slow. I wasn't complaining about PyTorch not compiling. I was complaining

00:24:34.120 | about the thing crashing my entire computer. It panics my kernel, and I have to wait five

00:24:37.880 | minutes while it reboots because it's a server motherboard and they take five minutes to

00:24:40.640 | reboot. So I was like, "Look, if you guys do not care enough to get me a decent kernel

00:24:45.160 | driver, there's no way I'm wasting my time on this, especially when I can use Intel GPUs."

00:24:49.280 | Intel GPUs have a stable kernel driver, and they have all their hardware documented. You

00:24:53.620 | can go and you can find all the register docs on Intel GPUs. So I'm like, "Why don't I just

00:24:58.480 | use these?" Now, there's a downside to them. Their GPU is $350. You're like, "What a deal.

00:25:03.600 | It's $350." You get about $350 worth of performance. If you're paying about $400 for the PCIe slot

00:25:08.760 | to put it in, like between the power and all the other stuff, you're like, "Okay, never

00:25:12.520 | mind. You've got to use NVIDIA or AMD from that perspective." But I sent an email to

00:25:17.640 | Lisa Su. She responded.

00:25:19.600 | >> Nice.

00:25:20.600 | >> Oh, you can see you published that email in a Discord.

00:25:22.600 | >> I did. I did. And she responded. And I've had a few calls since. And what I did was

00:25:30.160 | like what I tried to do. Well, first off, thank you for responding. It shows me that

00:25:35.680 | if you don't care about your kernel panicking, I can't. This is just a huge waste of my time.

00:25:40.640 | Right? I'll find someone who will care. I'm not asking for your 7x7 Winograd convolution

00:25:46.760 | when transposed to be fast. I'm not asking for that. I'm asking literally for-

00:25:50.640 | >> The basics.

00:25:51.640 | >> Oh, and this isn't TinyGrad. This is your demo apps. I ran their demo apps in loops

00:25:56.320 | and I got kernel panics. I'm like, "No. Okay." But no, Lisa Su reached out, connected with

00:26:05.640 | a whole bunch of different people. They sent me a pre-release version of RockM 5.6. They

00:26:12.040 | told me you can't release it, which I'm like, "Why do you care?" But they said they're going

00:26:17.240 | to release it by the end of the month. And it fixed the kernel panic. The guy managed

00:26:20.560 | to reproduce it with the two GPUs and the computer. And yeah, sent me a driver and it

00:26:27.600 | works. So yeah, I had that experience. And then I had another experience where I had

00:26:33.080 | two calls with AMD's communication people. I tried to explain to these people open source

00:26:38.000 | culture. It's not open source if you dump the source code on a GitHub repo and then

00:26:42.880 | forget about it until the next release. It's not open source if all your issues are from

00:26:48.000 | 2022. No one's going to contribute to that project. Sure, it's open source in a very

00:26:54.400 | technical sense. To be fair, it's better than nothing. It's better than nothing, but I fixed

00:26:59.640 | a bug in Nickel. There's a fun fact, by the way. If you have a consumer AMD GPU, they

00:27:05.800 | don't support peer-to-peer, and their all-reduce bandwidth is horrendously slow because it's

00:27:10.800 | using CUDA kernels to do the copy between the GPUs. And it's putting so many transactions

00:27:15.220 | on the PCIe bus that it's really slow. But you can use CUDA memcpy, and there's a flag

00:27:19.400 | to use CUDA memcpy, but that flag had a bug. So I posted the issue on Nickel. I expected

00:27:27.360 | nothing to happen. The Nvidia guy replied to me within an hour. He's like, "Try this

00:27:30.560 | other flag." I'm like, "Okay, I tried the other flag. It still doesn't work, but here's

00:27:33.900 | a clean repo." And I spent like three hours writing a very clean repro. I ended up tracking

00:27:40.280 | the issue down myself, but just the fact that somebody responded to me within an hour and

00:27:43.660 | cared about fixing the issue, okay, you've shown that it's worth my time, and I will

00:27:47.960 | put my time in because let's make this better. I'm here to help. But if you show me that

00:27:52.640 | you're like, "You're the kernel panics. Let's just expect it." Okay.

00:27:56.000 | >> Well, it sounds like AMD is getting the message.

00:27:59.000 | >> They are. And I don't really think they've had someone explain to them. I was like, "You

00:28:03.600 | can build in public." And they're like, "What's an example of building in public?" I'm like,

00:28:06.640 | "Go look at PyTorch." Go look at PyTorch, right? I have two minor things merged into

00:28:11.760 | PyTorch because it's very responsive. They're like minor bug fixes, but I feel like it's...

00:28:17.160 | >> Yeah. So that's kind of like the lowest level of the stack. And then at a slightly

00:28:22.400 | higher level, obviously, there's TinyGrad, there's Mojo, there's GGML. How are you thinking

00:28:28.200 | about breadth versus depth and where you decided to focus early on?

00:28:33.600 | >> So GGML is very much like a... Okay, everyone has M1s, right? Actually, I was thinking...

00:28:38.400 | In the beginning, I was thinking of something more like GGML focused on the M1s, but GGML

00:28:42.880 | showed up and was just like, "We're actually just focusing on the M1s." And actually, M1

00:28:49.920 | PyTorch is considerably better than AMD PyTorch. M1 PyTorch works. It only gives wrong answers

00:28:54.560 | sometimes and it only crashes sometimes. But some models kind of run. When I was writing

00:29:00.960 | the metal backend, I was comparing to MPS PyTorch, and I had a discrepancy. TinyGrad

00:29:07.000 | checks all its outputs compared to Torch, and I had one where it didn't match. I'm like,

00:29:13.000 | "I checked the matrix by hand. It matches TinyGrad. I don't understand." And then I

00:29:17.040 | switched PyTorch back to CPU and it matched. I'm like, "Oh." Yeah. Well, there's bugs.

00:29:23.200 | If you transpose the matrix, because I think it has to do with multi-views and PyTorch

00:29:27.340 | and weird under-the-hood stuff that's not exposed to you. There's bugs and maybe they

00:29:30.880 | fix them. But it seems like there was a lot of momentum, again, because you're getting

00:29:36.960 | how many engineers care about making PyTorch work on M1? Thousands, tens of thousands.

00:29:42.120 | And you have an open development process, and guess what? It's going to be good. How

00:29:45.120 | many engineers care about AMD working, PyTorch AMD working? You got 10 guys that work for

00:29:50.000 | AMD, and then a couple hobbyists.

00:29:54.000 | You revealed an interesting detail about how you debunk, which is you hand-check the matrix

00:29:59.040 | path.

00:30:00.040 | No, I don't hand-check it. One of the best tests in TinyGrad is a file called testops.py.

00:30:06.600 | And it's just 100 small examples written in TinyGrad and PyTorch. And it checks both the

00:30:12.720 | forwards and backwards to make sure they match.

00:30:14.720 | Good test suite. Very important.

00:30:17.080 | That's one of them where I really have put a lot of effort into CI for TinyGrad. I think

00:30:21.400 | CI is super important. I want that green check to mean I can merge this. I don't want my

00:30:26.360 | tests to -- and the green check, if you somehow manage to introduce a bug and get the green

00:30:29.760 | check, okay, we're fixing the test. Top priority.

00:30:32.880 | Mojo?

00:30:33.880 | It's closed source. No, I'm not that interested. You know what I mean? Look, I like Chris Lattner.

00:30:40.680 | I think he's going to do great things, and I understand kind of the wisdom even in keeping

00:30:45.080 | it closed source. But I'm interested when it's open.

00:30:50.240 | You have an interesting design deviation from him, because he's decided to be -- well, promised

00:30:54.840 | to be a superset of Python, and you have decided to break with PyTorch APIs. And I think that

00:31:01.160 | affects learnability and transportability of code.

00:31:05.700 | You know, if the PyTorch thing ends up being like a stumbling block, I could write a perfect

00:31:13.600 | PyTorch. Like a -- you know, instead of import PyTorch, instead of, like, yeah, import Torch,

00:31:20.280 | you type import TinyTorch as Torch. And if that really becomes the stumbling block, I

00:31:25.540 | will do that. No, Chris Lattner went much further than PyTorch. Replicating the PyTorch

00:31:30.960 | API is something I can do with a couple, you know, like an engineer month or two.

00:31:34.720 | Like a shim.

00:31:35.720 | Right, like a shim, yeah. Replicating Python, whoo-hoo-hoo. There's a big graveyard of those

00:31:41.360 | things. How's Piston going? How's Jython? You can go way back.

00:31:51.880 | So TinyGrid is one layer. You announced TinyBox recently, which is, you know, you made it

00:31:57.560 | -- so your core mission is commoditizing the petaflop. And then your business goal is to

00:32:03.080 | sell computers for more than the cost to make, which seems super reasonable. And you're gonna

00:32:08.680 | have three TinyBoxes?

00:32:09.680 | Red, green, blue?

00:32:10.680 | No, no, no, no, no, no, no, no. That was my -- look, you know, a lot of people, like,

00:32:15.040 | I love, you know, leaning into like saying I'm giving up, right? It's great to give up.

00:32:19.040 | Giving up is this wonderful thing. It's so liberating. And then, like, you can decide

00:32:22.520 | afterward if you really give up or not. There's very little harm in saying you give up, except

00:32:25.920 | like, you know, great, Twitter haters have something to talk about. And all press is

00:32:28.580 | good press, kids, so.

00:32:32.160 | So obviously --

00:32:33.240 | Just red. Only red. TinyBox, red.

00:32:35.440 | TinyBox, red.

00:32:36.440 | Unless AMD, you know, upsets me again, and then we're back to other colors. We have other

00:32:42.240 | colors to choose from.

00:32:43.240 | When you think about hardware design, what are some of the numbers you look for? So teraflops

00:32:48.600 | per second is one, but like memory bandwidth is another big limiter. Like, how do you make

00:32:53.880 | those tradeoffs?

00:32:54.880 | Well, I mean, fundamentally, I'm limited to what GPUs I can buy. But yeah, for something

00:32:59.520 | that I think a lot of people are going to want to reasonably do with -- a coworker of

00:33:05.160 | mine described them as luxury AI computers, right? Like, luxury AI computers for people.

00:33:11.120 | And that's like what we're building. And I think a common thing people are going to want

00:33:13.600 | to do is run, like, Large Llama, right? Or Large, like, Falcon or whatever.

00:33:17.520 | FB16 Llama.

00:33:18.520 | FB16, exactly. Exactly. You know, Int8, I think, can work. I think that, like, what

00:33:23.120 | GGML is doing to go to, like, Int4, like, this doesn't work. Like, have you done -- maybe

00:33:28.120 | they have. But, like, I read what it was, and I was like, this isn't from any paper.

00:33:32.320 | This is just some --

00:33:33.320 | Like, you made --

00:33:34.320 | Squeezing as much as possible.

00:33:35.320 | Yeah, you made up some quantization standard to make it run fast. And, like, maybe it works,

00:33:39.560 | but, okay, where's, like, the helliswag number, right? Where's your, where's your, you know,

00:33:44.240 | all your --

00:33:45.240 | The thesis is right, that, like, if you have billions, hundreds of billions of parameters,

00:33:49.080 | that the individual quantization doesn't actually matter that much.

00:33:52.080 | Well, the real way to look at all of that is to just say you want to compress the weights,

00:33:55.320 | right? It's a form of weight compression. Quantization is a form of weight compression,

00:33:58.440 | right? Now, this is obviously not lossless. It's not a lossless compressor, right? If

00:34:01.280 | it's a lossless compressor, and you can show that it's correct, then, okay, we don't have

00:34:04.320 | to have any other conversation. But it's a lossy compressor.

00:34:06.920 | Yes.

00:34:07.920 | And how do you know that your loss isn't actually losing the power of the model?

00:34:11.080 | Interesting.

00:34:12.080 | Maybe int 4 65 B Lama is actually the same as FB 16 7 B Lama, right? We don't know. Maybe

00:34:18.600 | someone has done this yet, but I looked for it when it, like, first came out, and people

00:34:21.680 | were talking about it, and I'm like, I just have -- like, it's not from a paper, right?

00:34:25.920 | The int8 stuff is from a paper where they, like, some of the int8 stuff is from a paper.

00:34:29.720 | There's one paper, I think it's, like, LLM.int8, where they actually, you know, do all the

00:34:35.960 | tests. And they didn't go fully int8. They made, like, 90% of it int8 and kept, like,

00:34:41.320 | 10% of it in FB 16 for what they called, like, the outliers or whatever.

00:34:46.200 | So I think that this is not quite so easy. And I think being able -- well, so first off,

00:34:49.560 | if you're training, no one's gotten training to work with int8 yet. There's a few papers

00:34:52.640 | that vaguely show it. But if you're training, you're going to need BF 16 or float 16. So

00:34:58.480 | this is why I target that. Now the thing that you're going to want to do is run these large

00:35:03.320 | language models out of the box on your hardware in FB 16, and that's memory bandwidth. So

00:35:09.320 | you need large amounts of memory bandwidth, too. So ask how I trade off memory bandwidth

00:35:13.600 | in Flop, so what GPUs can I buy?

00:35:17.720 | >> And I saw one of your -- so first of all, you have this hiring process, which is you've

00:35:22.160 | got to solve one of the bounties that are open on TinyGrad. There's no technical interview.

00:35:27.280 | One of them is int8 support. Do you already have some things you want to test on?

00:35:32.480 | >> We have int8 support. What I'd like to see somebody do is just load the ggml int8

00:35:37.800 | llama into TinyGrad and then benchmark it against the FB 16 one. Int8 already works

00:35:43.720 | in TinyGrad. It doesn't actually do the math in int8, which is even a stronger -- like,

00:35:49.360 | it does all the math still in FB 32. So int8 can mean you just have your weights in int8,

00:35:54.240 | or int8 can mean you actually do your math in int8. And doing your math in int8, the

00:35:57.520 | big, like, gain that people care about is actually having your weights in int8, because

00:36:03.720 | weights in int8 mean less memory and less memory bandwidth, whereas the math, keep it

00:36:09.280 | in FB 32. On M1s, it doesn't even matter if you're doing -- it doesn't matter what data

00:36:14.240 | type you're doing in the GPU. I'm not even sure it can do int8, but FB 16 and FB 32 is

00:36:19.840 | the same. It's the same tarifflops. So, yeah, no, that's one of the bounties. One of the

00:36:25.720 | bounties is get int8 llama running with the int8 weights. And then actually, you don't

00:36:31.040 | even need to -- what you could even do, if you really want to test this, just take the

00:36:34.600 | FB 16 weights, convert them to int8, then convert them back to FB 16, then compare the

00:36:39.120 | unconverted and converted.

00:36:40.360 | >> Oh, that's a nice hack. Oh, yeah.

00:36:42.560 | >> Right? Like --

00:36:43.560 | >> This should be lossless in the other direction.

00:36:45.440 | >> Yeah, I think FB 16, it should be lossless in the other direction. I'm actually not 100%

00:36:52.520 | about that.

00:36:53.520 | >> Why not?

00:36:54.520 | >> Oh, because, like, you ever try to, like, if you want to represent -- if it was, like,

00:36:57.800 | int16, it's not lossless.

00:36:58.800 | >> Sure.

00:36:59.800 | >> I think all of int8 can be represented in FB 16, but I'm not 100% about that.

00:37:05.040 | >> Okay.

00:37:06.040 | >> And actually, I think --

00:37:07.040 | >> Just draw up the bytes.

00:37:08.760 | >> We just have to do it, right? Just literally do it. There's only 256 to check. But, yeah,

00:37:14.720 | either way -- I mean, int4, definitely. So do your int4, convert it back, and now see,

00:37:19.480 | even with int4 weight and FB 32 math, like, okay, how much has your performance degraded

00:37:24.880 | this model?

00:37:25.880 | >> Yeah.

00:37:26.880 | >> Yeah.

00:37:27.880 | >> So can we -- I'm about to zoom out a little bit from the details. I don't know if you

00:37:32.240 | had more to --

00:37:33.240 | >> No, I think, like, the -- you're planning to release the first tiny box, ship them in,

00:37:37.880 | like, two to six, eight months, something like that. What's top of mind for you in terms

00:37:42.080 | of building a team? Who should -- who are you calling for?

00:37:45.840 | >> Yeah. Well, to stay on the tiny box for one minute, so as the GPU is picked out, and

00:37:53.200 | you're like, well, I could make that computer with the GPUs, and my answer is, can you?

00:37:57.840 | Do you know how to put -- do you know how hard it is to put six GPUs in a computer?

00:38:02.600 | People think it's really easy, and it's really easy to put one GPU in a computer. It's really

00:38:06.240 | easy to put two GPUs in a computer, but now you want to put in eight. Okay, so I'll tell

00:38:10.680 | you a few things about these GPUs. They take up four slots. What kind of computer -- you

00:38:15.560 | can buy the nicest super micro. You can't put eight of those in there. You need two

00:38:19.000 | slot blowers. If you want to use one of those four-U super micros, you need two slot blowers.

00:38:23.240 | Or water cooling. If you're trying to get the four-slot cards in there, you're going

00:38:26.240 | to need some form of water cooling. Or you're going to need -- there are some, like, Chinese

00:38:31.120 | 4090s that are blowers, right? You're going to need blowers or water cooling if you're

00:38:33.560 | trying to get it in those things, right?

00:38:36.120 | >> So are you doing water?

00:38:37.120 | >> No, I'm not using that chassis.

00:38:39.560 | >> Okay.

00:38:40.560 | >> Then, the other thing that -- okay, so now you want to get six GPUs in a computer,

00:38:45.660 | so that's a big challenge. You're like, "Oh, I'll just use PCIe extenders. I saw it online

00:38:49.080 | as tech tips. It works great." No, it doesn't. Try PCIe extenders that work at PCIe 4.0.

00:38:54.440 | And interconnect bandwidth is super important.

00:38:55.920 | >> Yes.

00:38:56.920 | >> They don't work at 3.0. No PCIe extender I've tested, and I've bought 20 of them, works

00:39:02.760 | at PCIe 4.0. So you're going to need PCIe re-drivers. Now, okay, how much is that adding

00:39:08.840 | cost, right? Like, these things all get really hard. And then, tiny boxes, I've even added

00:39:12.760 | another constraint to it. I want this thing to be silent. Not totally silent, but my limit

00:39:17.520 | is like 45, maybe 50 dB, but not -- super micro machine, 60 dB. We have a small -- we

00:39:24.760 | have a compute cluster at Comma. You've got to wear ear protection to go in there.

00:39:28.840 | >> Yeah, I've seen some videos where you give a tour.

00:39:31.080 | >> Oh, yeah.

00:39:32.080 | >> Yeah.

00:39:33.080 | >> It's noisy.

00:39:34.080 | >> It's super loud. You've got all these things just screaming.

00:39:35.080 | >> Yeah.

00:39:36.080 | >> 10,000 RPM, just screaming. Like, I want to be able to use the normal big GPU fans,

00:39:43.320 | and make this thing so it can sit under your desk, plug into one outlet of power, right?

00:39:48.880 | Six GPUs. Your GPUs are 350 watts each. You can't plug that into a wall outlet. Okay,

00:39:55.600 | so how are you going to deal with that? Good questions, right? And you're not sharing them.

00:40:00.360 | Well, that one, I mean, that one is pretty obvious. You have to limit the power on the

00:40:03.760 | GPUs, right?

00:40:04.760 | >> Uh-huh.

00:40:05.760 | >> You have to limit the power on the GPUs. Now, you can limit power on GPUs and still

00:40:08.160 | get -- you can use like half the power and get 80% of the performance. This is a known

00:40:12.320 | fact about GPUs, but like, that's one of my design constraints. So, when you start to

00:40:15.840 | add all these design constraints, good luck building a tiny box yourself.

00:40:19.840 | >> Uh-huh.

00:40:20.840 | >> You know, obviously, it can be done, but you need something that has actually quite

00:40:24.800 | a bit of scale and resources to do it.

00:40:27.040 | >> And you see like the -- under the desk, it's like one of the main use cases, kind

00:40:31.480 | of like individual developer use or --

00:40:33.160 | >> Yeah. What I also see is more of a like an AI hub for your home, right? As we start

00:40:38.200 | to get like home robotics kind of stuff, you don't want to put the inference on the robot,

00:40:43.720 | but you also don't want to put the inference on the cloud. You don't want to put it on

00:40:47.000 | the robot because, okay, it's 1,500 watts, tiny box. You'll put batteries, you'll charge

00:40:52.680 | them. Bad idea. I mean, just wireless. Wireless is .5 milliseconds, right? This is super fast.

00:41:00.040 | You don't want to go to the cloud for two reasons. One, cloud's far away. It's not that

00:41:04.200 | far away. You can kind of address this. But two, cloud's also mad expensive. Like cloud

00:41:10.560 | GPUs are way more expensive than running that GPU at your house, at least any rates you're

00:41:14.960 | going to get, right? Maybe if you commit to buy, well, yeah, I'm going to buy 10,000 GPUs

00:41:18.960 | for three years, then maybe the cloud will give you a good rate. But like, you want to

00:41:22.320 | buy one GPU in the cloud? Ooh. I mean, okay, you can go to like BAST, but like if you're

00:41:26.080 | going on Azure or AWS, oh, that's expensive. Yeah. This is like a personal data center,

00:41:30.880 | you know, instead of a cloud data center. We like the term compute cluster, so we can

00:41:34.960 | use NVIDIA GPUs. Data centers may be a little bit dated. It's a compute cluster, which is

00:41:40.720 | totally legal under the CUDA license agreement. You talk a lot about the PCIe connection.

00:41:45.080 | Do you think there's any fat there to the term? What do you mean? Just you're limited

00:41:50.760 | by bandwidth, right? Okay. For some things, yes. So the bandwidth is roughly 10x less

00:41:58.160 | than what you can get with NVLinked A100s. NVLinked A100s are going to have, and then

00:42:03.000 | you can even get like full fabric and NVIDIA really pushes on that stuff, 600 gigabytes

00:42:07.280 | per second, right? And PCIe 4, you're going to get 60, right? So you're getting 10x less.

00:42:14.480 | That said, why do you need the bandwidth, right? And the answer is you need it for training

00:42:19.880 | huge models. If you're training on a tiny box, your limit's going to be about 7 billion,

00:42:25.040 | right? If you're training on big stuff, your limits could be like 70 billion, right? Okay.

00:42:29.720 | You can hack it to get a bit higher. You can hack it like GPT hacked it to get a bit higher,

00:42:32.880 | but like that 65 billion in Lama, like there's a reason they chose 65 billion, right? And

00:42:36.720 | that's what can reasonably fit model parallel on a GPUs, right? So yes, you are going to

00:42:43.880 | end up training models. The cap's going to be like 7 billion. I actually heard this on

00:42:47.040 | your podcast. I don't think that the best chatbot models are going to be the big ones.

00:42:51.720 | I think the best chatbot models are going to be the ones where you had a thousand training

00:42:54.680 | runs instead of one. And I don't think that the interconnect bandwidth is going to matter

00:42:59.320 | that much.

00:43:00.320 | >> So what are we optimizing for instead of compute optimal?

00:43:02.760 | >> What do you mean compute optimal?

00:43:05.640 | >> So you're talking about this, the Lama style models where you train for like 200x.

00:43:11.040 | >> You train longer, yeah.

00:43:12.040 | >> Yeah, yeah.

00:43:13.040 | >> Yeah. So, okay. You can always make your model better by doing one of two things, right?

00:43:16.680 | In a comma, we just have a strict limit on it. You can always make your model better

00:43:19.800 | by training longer and you can always make your model better by making it bigger. But

00:43:23.720 | these aren't the interesting ones, right? Particularly the making it bigger. Because

00:43:27.480 | training it longer, fine. You know, you're getting a better set of weights. The inference

00:43:30.240 | is the same. The inference is the same whether I trained it for a day or a week.

00:43:33.960 | >> Yeah.

00:43:34.960 | >> But the, okay, if it's 1 billion versus 10 billion, well, I 10x my inference too,

00:43:38.680 | right? So I think that these big models are kind of, sure, they're great if you're research

00:43:43.120 | labs and you're trying to like max out this hypothetical thing.

00:43:45.240 | >> Yeah, which you can talk about later.

00:43:47.200 | >> Yeah, yeah, yeah. But if you're like a startup or you're like an individual or you're

00:43:51.760 | trying to deploy this to the edge anywhere, you don't need that many weights.

00:43:55.880 | >> Yeah, yeah.

00:43:56.880 | >> You actually don't want that many weights.

00:43:57.880 | >> Yeah. Optimizing for inference rather than capabilities.

00:43:59.680 | >> Yes.

00:44:00.680 | >> Doing benchmarks.

00:44:01.680 | >> Yes, yes. And I think the inference thing, right? There's going to be so much more. Right

00:44:06.360 | now, the ratio between like training and inference on clouds, I think it's only still like, I

00:44:10.680 | think it's like 2 or 3x, right? It's 2 or 3x more inference, which doesn't make any

00:44:13.680 | sense, right? There should be way more inference.

00:44:15.160 | >> Yeah.

00:44:16.160 | >> There should be 10 to 100x more inference in the world than training. But then also,

00:44:20.400 | like, what is training, right? You start to see these things like Laura, like, you're

00:44:24.720 | getting kind of, it's kind of blurring the lines between inference and training. And

00:44:28.960 | I think that that blurred line is actually really good. I'd like to see much more like

00:44:32.000 | on-device training or on-device fine-tuning of the final layer, where we're pushing toward

00:44:36.920 | the stuff that come, right? Like, why am I shipping a fixed model? I totally want this

00:44:40.120 | model to fine-tune based on, like, how, you know, your left tire is flat, right? Like,

00:44:46.560 | every time you cut the same turn because your left tire is flat, well, it should learn that,

00:44:49.920 | right?

00:44:50.920 | >> So, would Kama pursue parameter-efficient fine-tuning?

00:44:53.200 | >> Yeah. Yeah, yeah, yeah. We're, we're, we're, we're --

00:44:55.280 | >> Seems like a --

00:44:56.280 | >> We're, we're looking into stuff like that. I mean, Kama's already very parameter-efficient

00:44:59.440 | because we have to, like, run this thing in a car and you have to, like, cool it and power

00:45:02.120 | it.

00:45:03.120 | >> Yeah.

00:45:04.120 | >> Yeah.

00:45:05.120 | >> Yeah, yeah. And so, that's kind of like intelligence cluster you have in your home.

00:45:07.960 | You see when the person is using third-party model, they load them locally and kind of

00:45:13.880 | do the final fine-tuning. It kind of stays within the box.

00:45:16.560 | >> Yeah. I think that that's one thing. That's one version of it for the privacy conscious.

00:45:21.560 | I also see a world where you can have your tiny box in its down cycles. Mine, flop coin,

00:45:29.000 | right? You know, not all, it turns out not all crypto is a scam. There's one way to tell

00:45:32.760 | if crypto is a scam. If they're selling the coin before they make the product, it's a

00:45:36.000 | scam.

00:45:37.000 | >> Yeah.

00:45:38.000 | >> If they have the product and then they sell the coin, it's maybe not a scam, right? So,

00:45:40.400 | yeah, my thought is, like, each tiny box would let you, would have a private key on it. And

00:45:44.800 | you have to do it this way. You can't just let anyone join because of civil attacks, right?

00:45:47.680 | There's a real problem of, like, how do I ensure your data is correct? And the way that

00:45:51.320 | I ensure your data is correct on the tiny net is if you ever send wrong data, you're

00:45:55.640 | banned from the internet for life.

00:45:56.640 | >> You're out?

00:45:57.640 | >> Yeah, yeah, yeah.

00:45:58.640 | >> Oh, wow.

00:45:59.640 | >> Your $15,000 hardware box is banned. So, you know, don't cheat. Obviously, if it messes

00:46:02.040 | up, we'll forgive you. But I'm saying, like --

00:46:04.240 | >> Somebody's going to try to jailbreak your devices.

00:46:05.960 | >> There's no jailbreak.

00:46:06.960 | >> There's no jailbreak.

00:46:07.960 | >> There's no jailbreak.

00:46:08.960 | >> There's just a different network.

00:46:09.960 | >> Well, there's just a private key on each device, right? Like, if you buy a tiny box

00:46:12.360 | from the tiny corp, I give you a private key. It's in my back-end server, right? You want

00:46:15.320 | to hack my server, that's illegal. Anything you want to do on the device, the device is

00:46:18.040 | yours. My server's mine, right? Like --

00:46:19.560 | >> Yeah, yeah. Have you looked into, like, federated training at all?

00:46:25.280 | >> Yeah. So, I mean, okay, you're now -- there's, okay, there's orders of magnitude of federated

00:46:29.760 | training. You mean, like, over the cloud and stuff? Over the internet?

00:46:32.960 | >> Yeah, over the internet, but also distributed on a bunch of devices, right?

00:46:36.560 | >> Yeah.

00:46:37.560 | >> Which some people --

00:46:38.560 | >> I'm very bearish on this stuff.

00:46:39.560 | >> Yeah.

00:46:40.560 | >> Because of your interconnect bandwidth, right? So, okay, at the high-end, you have

00:46:42.880 | your interconnect bandwidth of NVLink, which is 600 gigabytes per second, right?

00:46:46.440 | >> Yeah.

00:46:47.440 | >> The tiny box has 60 gigabytes per second. And then your internet has 125 megabytes per

00:46:53.520 | second, right? Not gigabits, 125 megabytes, right? So, okay, that's --

00:46:58.240 | >> Orders of magnitude.

00:46:59.640 | >> That's how many orders of magnitude we're talking here? Like, from 60 down to 125?

00:47:04.280 | >> Yeah.

00:47:05.280 | >> Like, all right, that's over 100X. That's 400X, right?

00:47:07.960 | >> Yeah.

00:47:08.960 | >> So, like, no. But what you can do is inference, right? Like, for inference, you don't care.

00:47:13.200 | >> Mm-hmm.

00:47:14.200 | >> For inference, there's so little bandwidth at the top and the bottom of the model that,

00:47:17.880 | like, yeah, you can do federated inference, right? And that's kind of what I'm talking

00:47:21.480 | about. There's also interesting things to push into, like, you're like, but, okay, what

00:47:26.520 | if you want to run closed-source models? This stuff gets kind of interesting, like, using

00:47:30.520 | TPMs on the boxes and stuff.

00:47:32.320 | >> Yeah.

00:47:33.320 | >> But then someone might jailbreak my device. So, you know, maybe we don't try to do that.

00:47:37.440 | >> Yeah. What's, like, the enterprise use case? Do you see companies buying a bunch

00:47:40.880 | of these and, like, stacking them together?

00:47:43.160 | >> So, the tiny box is, like, the first version of what we're building. But what I really

00:47:47.800 | want to do is be on the absolute edge of flops per dollar and flops per watt. These are the

00:47:52.960 | two numbers that matter. So, the enterprise use case is you want to train, like, Comma.

00:47:57.520 | So, Comma just built out a new compute cluster. It's about a person and a half. So, you know,

00:48:03.680 | it's a decent size. A person and a half.

00:48:05.160 | >> A person being 20 petaflops.

00:48:06.160 | >> A person is 20 petaflops. It's about 30 petaflops. We built out a little compute cluster.

00:48:12.080 | And, you know, we paid double what you theoretically could per flop, right? You theoretically could

00:48:17.840 | pay half per flop if you designed a bunch of custom stuff. And, yeah, I mean, I could

00:48:22.120 | see that being, you know, tiny Corp. Comma is going to be the first customer. I'm going

00:48:26.040 | to build a box for Comma. And then I'm going to show off the box I built for Comma and

00:48:29.960 | be like, okay, like, do you want to build I sell $250,000 training computers? Or how

00:48:34.280 | much is one H100 box? It's 400 grand? Okay. I'll build you a 400 grand training computer

00:48:39.360 | and it'll be 10x better than that H100 box. Again, not for every use case. For some, you

00:48:45.520 | need the interconnect bandwidth. But for 90% of most companies' model training use cases,

00:48:50.120 | the tiny box will be 5x faster for the same price.

00:48:54.240 | Awesome. You mentioned the person of compute. How do we build a human for $20 million?

00:48:59.560 | Well, it's a lot cheaper now. It's a lot cheaper now. So, like I said, Comma spent about half

00:49:05.960 | a million on our person and a half. What are some of the numbers people should think of

00:49:12.400 | when they compare compute to like people? So, GPT-4 was 100% years of training. That's

00:49:18.600 | more like on the timescale. 20 petaflops is one person. I think you, right now, the math

00:49:24.840 | was that for the price of the most expensive thing we build, which is the International

00:49:28.840 | Space Station, we could build one Tampa. Yeah, one Tampa of compute.

00:49:33.600 | Yeah, which is 400,000 people. Yeah, we could build. So, like the biggest

00:49:39.880 | training clusters today, I know less about how GPT-4 was trained. I know some rough numbers

00:49:43.960 | on the weights and stuff, but Llama- A trillion parameters?

00:49:48.640 | Well, okay. So, GPT-4 is 220 billion in each head, and then it's an eight-way mixture model.

00:49:53.280 | So, mixture models are what you do when you're out of ideas. So, it's a mixture model. They

00:49:58.360 | just train the same model eight times, and then they have some little trick. They actually

00:50:01.280 | do 16 inferences, but no, it's not like- So, the multimodality is just a vision model

00:50:06.400 | kind of glommed on? I mean, the multimodality is like obvious

00:50:10.440 | what it is too. You just put the vision model in the same token space as your language model.

00:50:13.600 | Oh, did people think it was something else? No, no, the mixture has nothing to do with

00:50:16.360 | the vision or language aspect of it. It just has to do with, well, okay, we can't really

00:50:20.160 | make models bigger than 220 billion parameters. We want it to be better. Well, how can we

00:50:25.280 | make it better? Well, we can train it longer, and okay, we've actually already maxed that

00:50:30.280 | out. We're getting diminishing returns there. Okay.

00:50:33.080 | A mixture of experts. Yeah, a mixture of experts. We'll train eight

00:50:35.240 | of them, right? So, all right. So, you know, the real truth is whenever a company is secretive,

00:50:41.680 | with the exception of Apple, Apple's the only exception, whenever a company is secretive,

00:50:45.340 | it's because they're hiding something that's not that cool. And people have this wrong

00:50:49.000 | idea over and over again that they think they're hiding it because it's really cool. It must

00:50:52.240 | be amazing. It's a trillion parameters. No, it's a little bigger than GPT-3, and they

00:50:55.960 | did an eight-way mixture of experts. Like, all right, dude, anyone can spend eight times

00:50:59.160 | the money and get that. All right. But yeah, so coming back to what I think is actually

00:51:07.560 | going to happen is, yeah, people are going to train smaller models for longer and fine-tune

00:51:11.960 | them and find all these tricks, right? Like, you know, I think opening, I used to publish

00:51:17.480 | stuff on this, you know, when they would publish stuff about how much better the training has

00:51:23.680 | gotten given the same, holding compute constant. And it's gotten a lot better, right? Compare

00:51:29.700 | like batch norm to no batch norm. Yeah.

00:51:32.400 | And now we have like- Because you're finding algorithms like flash

00:51:34.960 | attention. Yeah. Well, flash attention. Yeah. Flash attention

00:51:40.480 | is the same compute. Flash attention is an interesting fact where it's actually the identical

00:51:43.160 | compute. It's just a more efficient way to do the compute. But I'm even talking about

00:51:46.320 | like, look at the new embeddings people are using, right? They used to use this like boring

00:51:53.040 | old embeddings. Now, like Lama uses that complex one, and that was like alibi. I'm not up to

00:51:56.720 | date on all the latest stuff, but those tricks give you so much.

00:52:00.640 | There's been a whole round trip with positional embeddings. I don't know if you've seen this

00:52:04.140 | discussion. I haven't followed-

00:52:06.520 | Like you need them, you need rotational, and then you don't need them.

00:52:09.080 | I haven't followed exactly. I mean, you quickly run into the obvious problem with positional

00:52:13.320 | embeddings, which is you have to invalidate your KB cache if you run off the context.

00:52:17.480 | So that's why I think these new ones, they're playing with them, but I'm not that. I'm not

00:52:22.800 | an expert on like the latest up-to-date language model stuff.

00:52:26.120 | Yeah. I mean, we have what we do at Comma, and I

00:52:29.000 | don't know how that works, but like-

00:52:33.940 | What are some of the things, I mean, that people are getting wrong? So back to autonomous

00:52:38.140 | driving, there was like the whole like LiDAR versus vision thing. It's like people don't

00:52:42.460 | get into accidents because they cannot see well. They get into accidents because they

00:52:45.660 | get distracted and all these things. What are, do you see similarities today on like

00:52:49.940 | the path to AGI? Like are there people, like what are like the-

00:52:53.780 | Nothing I say about this is ever going to compete with how Rich Sutton stated it. Rich

00:52:57.980 | Sutton is the writer of Reinforcement Learning, The Bitter Lesson. Nothing I say is ever going

00:53:01.800 | to compete with, The Bitter Lesson is way better than any way I'm going to phrase this.

00:53:05.040 | Just go read that. And then like, I'm sorry it's bitter, but you actually just have to

00:53:08.760 | believe it. Like over and over again, people make this mistake. They're like, oh, we're

00:53:13.240 | going to hand engineer this thing. We're going to hand, no, like stop wasting time.

00:53:17.160 | Which is, I mean, OpenAI is not taking The Bitter Lesson.

00:53:20.760 | No. OpenAI-

00:53:23.640 | They were leaders in deep learning for a long, long, long time.

00:53:26.680 | Open-

00:53:27.680 | But you're telling me that GPT-4 is not, yeah.

00:53:29.340 | Well, OpenAI was the absolute leader to the thesis that compute is all you need.

00:53:32.980 | Yes.

00:53:33.980 | Right? And there's a question of how long this thesis is going to continue for. It's

00:53:36.900 | a cool thesis. And look, I think I would be lying along with everybody else. I was into

00:53:41.540 | language models like way back in the day for the Hutter Prize. I got into AI through the

00:53:45.820 | Hutter Prize. Like 2014, I'm trying to build compressive models of Wikipedia. And I'm like,

00:53:50.660 | okay, why is this so hard? Like what this is, is a language model, right? And I'm playing

00:53:54.180 | with these like Bayesian things. And I'm just like, oh, but like, I get it. Like, it needs

00:53:58.820 | to be like, like, it's like, I have two data points and they're like almost the same, but

00:54:02.860 | how do I measure that almost, right? I just like, you know, wrap my head around. I couldn't

00:54:07.660 | like, like wrap my head around this. And this was around the time Carpathia released the

00:54:10.500 | first like RNN that generated the Shakespeare stuff. And I'm like, okay, I get it. Right?

00:54:17.380 | It's neural networks that are compressors. Now this isn't actually, you can't actually

00:54:19.980 | win the Hutter Prize with these things because the Hutter Prize is MDL. It's the model, size

00:54:24.380 | of the model plus the size of the encodings, embeddings. So yeah, you can't, I mean, probably

00:54:30.460 | now you can because it's gotten so good, but yeah, back in the day you kind of couldn't.

00:54:35.140 | So I was like, okay, cool. Like this is what it is. I kind of get it. Yeah. I mean, I think

00:54:39.760 | I didn't expect that it would continue to work this well. I thought there'd be real

00:54:44.460 | limits to how good autocomplete could get. That's fancy autocomplete. But yeah, no, like

00:54:49.780 | it works. It works well. So like, yeah. What is OpenAI getting wrong? Technically, not

00:54:57.060 | that much. I don't know. Like if I was a researcher, why would I go work there?

00:55:02.620 | Yes. So why is OpenAI like the Miami Heat?

00:55:05.820 | No, look, I don't, I don't, this is, this is my technical stuff. I don't really want

00:55:10.180 | to harp on this, but like why go work at OpenAI when you could go work at Facebook, right?

00:55:14.140 | As a researcher. Like OpenAI can keep ideologues who, you know, believe ideological stuff and

00:55:19.660 | Facebook can keep every researcher who's like, dude, I just want to build AI and publish

00:55:23.740 | it.

00:55:24.740 | Yeah.

00:55:25.740 | Yeah.

00:55:26.740 | Awesome. Yeah. Any other thoughts, tiny corp, bounties?

00:55:31.780 | Yeah. So we have, you know, I've been thinking a lot about like what it means to hire in

00:55:39.100 | today's world. What actually is the like core? Okay. Look, I'm a believer that machines are

00:55:46.060 | going to replace everything in about 20 years. So, okay. What is that, what is that thing

00:55:54.220 | that people can still do that computers can't, right? And this is a narrowing list, but like,

00:56:00.740 | you know, back in the day, like imagine I was starting a company in 1960, right? Oh,

00:56:04.460 | we're going to have to hire a whole bunch of calculators in the basement to do all the,

00:56:08.180 | you know, math to support the, dude, have you heard about computers? Why don't we just

00:56:12.500 | buy a few of those? Oh, oh wow, man. You're right. So like, I feel like that's kind of

00:56:19.180 | happening again. And I'm thinking about, I will post in my discord. I'll be like, okay,

00:56:22.980 | who wants to like, okay. I just changed my Unary ops used to be log and exp in like E.

00:56:28.500 | I changed them to be log two and exp two because hardware has log two and exp two accelerators.

00:56:33.940 | Yeah. And of course you can use change of base. It's one multiply to, to get it back

00:56:37.260 | to E, but like I made the primitives log two and exp two. Right. And this is the kind of,

00:56:42.220 | I just posted in the discord. I'm like, could someone put this pull request up? Right. And

00:56:45.140 | someone eventually did and I merged it, but I'm like, this is almost to the level where

00:56:48.940 | models can do it. Right. We're almost to the point where I can say that to a model and

00:56:52.620 | the model can do it. Have you tried? Yeah, I'm, I don't know. I'm like, I'm, I think

00:57:02.740 | it went further. I think autocomplete went further than I thought it would, but I'm also

00:57:07.060 | relatively unimpressed with the chatbots, with what I've seen from the language models

00:57:12.140 | like there. The problem is if your loss function is categorical cross-entropy on the internet,

00:57:19.700 | your responses will always be mid. Yes. Mode collapse is what I call it. I don't know.

00:57:24.220 | Maybe I'm not even talking about mode collapse. You're actually trying to predict the like,

00:57:27.300 | like, look, I rap, I'm a hobbyist rapper. And like, when I try to get these things to

00:57:31.500 | write rap, the rap sound like the kind of raps you read in the YouTube comments. Nursery

00:57:34.820 | school. Yeah. It's like, all right, great. You're right. Box with Fox. Sick rhyme, bro.

00:57:40.740 | You know, you know, and Drake is rhyming. Give it up for me with napkins and cutlery.

00:57:45.940 | Right. Like, like, all right, come on. We've got like this thing about orange, like orange

00:57:50.220 | is famous. Yeah, yeah, yeah, yeah. But now, of course, you know, four inch screws and

00:57:54.020 | orange juice is in, is in GPT's training corp. But yeah, so I think it went further than

00:58:01.420 | like everyone kind of thought it would. But the thing that I really want to see is like

00:58:04.380 | somebody put 10 LLMs in a room and have them discuss the answer before they give it to

00:58:08.500 | me. You can actually do this. Right. And I think the coding things have to be the same

00:58:12.540 | way. There is no coder alive, no matter how good you are, that sits down. Well, I'm going

00:58:16.140 | to start at cell A1 and type my program and then I'm going to press run and it's going

00:58:20.940 | to work. No one programs like that. So why do we expect the models to write? So so there's

00:58:26.180 | there's a lot that like still needs to be done. But, you know, at the tiny corp, I want

00:58:29.740 | to be on the cutting edge of this, too. I want to be like program generation. I mean,

00:58:34.220 | what is TinyGrad? It's a compiler, generates programs, generate the fastest program that

00:58:37.260 | meets the spec. Right. Why am I not just having ML do that? So, you know, it's kind of a you

00:58:42.940 | have to exist fluidly with the machines. And I come around on a lot of stuff. I'm like,

00:58:48.860 | wait, TinyGrad, TinyCorp should be a remote company. I can't do this in person. Really?

00:58:53.500 | Yeah. Like, oh, comma makes sense to be in person. Like comma, sure. Yeah, we'll get

00:58:57.580 | off in San Diego. Like, but that's a six year old company. Right. And it works and it works

00:59:01.640 | for a certain type of people and certain type of culture. But what's going to be different

00:59:04.260 | this time? OK, remote. But now it's remote. And now I'm getting these like people who

00:59:07.700 | apply and I'm like, I literally have a thousand applications. I'm not calling you to do a

00:59:12.580 | technical screen. I can't really tell anything from a technical screen. What am I going to

00:59:16.020 | do? Make a code on a whiteboard? Like, bring up bring up a shared notebook document so

00:59:20.100 | we could. Oh, like, that's not going to work. OK. So then I move to the next thing. We do

00:59:24.300 | this a comma with good success programming challenges. I've also found them to be like

00:59:28.540 | completely non-predictive. I found one thing to actually be predictive and it's wait a

00:59:34.300 | second. Just write code in TinyGrad. It's open source. Right. And so, you know, I'm

00:59:39.340 | talking to a few people who've been contributing and like contribute or, you know, the job's

00:59:44.020 | not for you. But you can do it remote. And it's like it's a chill job. Like you're not

00:59:47.340 | you're like, oh, yeah, well, I work for the tiny corp. Well, you're writing MIT licensed

00:59:51.060 | software like you see what it's doing. Right. Like, well, just I think think of it maybe

00:59:54.540 | more of like a stipend and a salary and then also some equity. Look, you know, I get rich.

00:59:58.420 | You all get rich. Yeah. How do you think about agents and kind of like thinking of them as

01:00:06.580 | people versus like job to be done? Sean built this thing called Small Developer. And then

01:00:11.860 | it's in the same vein, like the human in the loop with the language model and just iterating

01:00:17.220 | while you write code. I think I think that's that's absolutely where it goes. And there's

01:00:20.560 | like a it's not like one thing. It's like they're small interpreter. There's like small

01:00:24.340 | debugger. It's kind of like all these different jobs to be done. It's a small world. Yeah.

01:00:28.340 | It's I know this is like the small pockets. It's like small. I mean, tiny corp. So we're

01:00:33.020 | on the same wavelength. How do you think about that? Do you think people will have a human

01:00:37.500 | like interaction with like, oh, this is like the AI developer or like is it I'm the human

01:00:41.980 | being supercharged by the AI tools? Oh, I think it's much more like I'm the human supercharged

01:00:48.140 | by the AI tools. I think that like coding is tool complete, right? Like driving is not

01:00:52.780 | tool complete. Right. Like driving is just like like we hire people to drive who are

01:00:56.180 | like below the API line. Right. There's an API line in the world. Right. Love that. Yeah.

01:01:00.060 | There's an API line in the world. And like you can think like Uber is a really clear

01:01:02.740 | example. Right. There's the people below the API line and the people above the API line.

01:01:06.220 | And the way you can tell if you're below or above, by the way, is is your manager a computer?

01:01:10.540 | Right. Who's the manager of the Uber driver or computer? Does the machine tell you what

01:01:13.060 | to do? Or do you tell machines? Exactly. Exactly. So coding is tool complete. Right. Coding

01:01:19.820 | is tool complete. Coding is above the API line. So it will always be tools supercharging

01:01:25.100 | your coding workflow. And it will never be you performing some like task like, OK, well,

01:01:32.460 | I can do everything except for actually starting a docker container. Like it just doesn't make

01:01:36.340 | any sense. Right. Yeah. So we'll always be sort of tools. And, you know, look, we see

01:01:39.780 | the same stuff with all the people are like stable diffusion is going to replace artists

01:01:44.420 | or whatever. It's like, dude, like it's going to create new artists. What did Photoshop

01:01:47.780 | replace artists? Like, what are you talking about? Right. Like, you know, a real artist's

01:01:53.300 | finger paint. I can't use brushes. Brushes are, you know, brushes are going to replace

01:01:57.660 | all the. OK. Like, I just can't like it's all just tools and the tools are going to

01:02:01.900 | get better and better and better. And then eventually, yes, the tools are going to replace

01:02:04.900 | us. But, you know, that's still 20 years away. So, you know, I've got a company in the meantime.

01:02:10.420 | So I've written about the API line before, and I think that's from Venkatesh. I don't

01:02:13.820 | know if you I definitely took it from someone. It's definitely not mine. VGR. But I also

01:02:18.060 | have speculated a higher line than that, which is the Kanban board. Like who tells the programmers

01:02:23.180 | what to do? Right. So are you above or below the Kanban board? Has that evolved your management

01:02:29.540 | thinking? Yeah. Like that's sort of what I mean. Like it's like I'm just going to describe

01:02:33.780 | the pull request in two sentences and then like, yeah. So you are running the Kanban

01:02:37.740 | board or the bounties? Yes. Yeah. The bounties of the Kanban board. Exactly. And that is

01:02:42.300 | kind of the high level. And then like, yeah, we'll get AIs to fill in some and we'll get

01:02:46.380 | people to fill in others. Yeah. And that's also what it means to be like full time at

01:02:50.540 | a tiny corp. Right. Would you start and I wrote this up pretty concretely. I'm like,

01:02:54.260 | OK, step one is you do bounties for the company. Step two is you propose bounties for the company.

01:02:58.460 | You don't obviously pay them. We pay them. But you propose them. And I'm like, yeah,

01:03:02.380 | that's a good bounty that like helps with the main workflow of the company. And step

01:03:06.660 | three is you get hired full time. You get equity. We all know maybe you're rich. What

01:03:11.620 | else are you designing differently about the employee experience? I mean, I'm very much

01:03:16.780 | a like, you know, some people really like to like, like keep a separation. Right. Some

01:03:20.900 | people really like to keep a separation between like employees and management or customers

01:03:25.940 | and employees like a comma. You know, the reason I do the DevKit thing, it's like, dude,

01:03:30.300 | you buy a comma thing, you're an employee of the company, like you're just part of the

01:03:33.180 | company. It's all the same thing. There's no like secrets. There's no dividing lines.

01:03:37.540 | There's no like it's all a spectrum for like, you know, down here at the spectrum, like

01:03:41.220 | you pay and then up here at the spectrum you get paid. You understand this is the same

01:03:44.220 | spectrum of college, right? Like for undergrad, you pay and then you get up here to like,

01:03:48.700 | you know, doing a Ph.D. program, you get paid. OK, well, cool. Welcome to the, you know.

01:03:55.660 | What about comma bodies? You know, you mentioned a lot of this stuff is clearly virtual, but

01:04:00.660 | then there's below the API line you actually need.

01:04:03.540 | This is a thing that's been announced. Comma bodies?

01:04:06.620 | We sell them. You can buy them. They're a thousand bucks on our website.

01:04:09.820 | OK, no, no, no. I'm thinking about like what Tesla announced with like the humanoid robot.

01:04:14.180 | It's the same thing, except of course we made the comma version of it. Tesla uses 20 activators.

01:04:18.660 | We use two, right? Like how do you how do you build the simplest possible thing that

01:04:23.100 | can like turn the robotics problem into entirely a software problem? So right now it is literally

01:04:28.000 | just a comma three on a pole with two wheels. It balances, keeps the comma three up there.

01:04:35.620 | And like there's so much you could do with that already. Like this should replace you.

01:04:39.940 | How many security guards could this replace? Right. If this thing could just competently

01:04:43.660 | wander around a space and take pictures and, you know, focus in on things, send you a text

01:04:49.940 | message when someone's trying to break into your building, you know, like like this could

01:04:53.100 | already do so much, of course. But the software is not there yet. Right. So how do we turn

01:04:57.940 | robotics into a thing where it's very clearly a software problem? You know, the people don't

01:05:01.360 | accept that self-driving cars are a software problem. Like, I don't I don't know what to

01:05:04.980 | tell you, man. Like literally just watch the video yourself and then drive with a joystick.

01:05:09.900 | Right. Yeah. Can you drive? And we've actually done this test. We've actually done this test

01:05:13.500 | where we've had someone. OK, you just watch this video and here's a joystick and you got

01:05:16.840 | to drive the car. And of course, they can drive the car. Yeah. It takes a little bit

01:05:19.660 | of practice to get used to that joystick. But the problem is all the model. Right. So

01:05:24.820 | I can now make the model better. Yeah. Specifically, anything in computer vision that you think

01:05:30.860 | our second most popular episode ever was about segment anything coming out of Facebook, which

01:05:35.300 | is as far as I understand, the state of the art in computer vision. What are you hoping

01:05:39.420 | for there that you need for karma? I think a segment, anything like the large, large

01:05:45.060 | YOLOs or not. I've used like large yellows and I'm super impressed by them. Yeah. I think

01:05:49.740 | it's solved. I got to check out segment anything. I don't think it's a distinct problem. Right.

01:05:53.860 | OK, here's something that I'm interested in. All right. We have great LLMs. We have great

01:05:57.780 | text to speech models and we have great speech to text models. OK, so why can I not why can

01:06:01.740 | I not talk to an LLM like I'd have a normal conversation with it? You can with the latency

01:06:05.580 | of like two seconds every time. Right. Why? Why isn't this? And then it feels so unnatural.

01:06:11.540 | It's just like staccato. Like, I don't like the RLHF models. I don't like the tuned versions

01:06:16.220 | of them. I think that they become you take on the personality of a customer support agent.

01:06:21.540 | Oh, come on. You know, I like I like LLM more than Chachi B.T. Chachi B.T.'s personality

01:06:27.900 | just graded on me. Was LLM like cool. I write I read a little bit of pretext paragraph.

01:06:32.660 | I can put you in any scenario I want. Right. Like that's interesting to me. I don't want

01:06:36.620 | some like, you know. Yeah. So, yeah, I think there is really no like distinction between

01:06:44.980 | computer vision and language and any of this stuff. It's all eventually going to be fused

01:06:50.700 | into one massive. So to say computer vision is solved. Well, it doesn't make any sense

01:06:54.540 | because what's the output of computer vision model segmentation? Like what a weird task.

01:06:58.740 | Right. Who cares? OCR. Who cares? I don't care if you can segment which pixels make

01:07:03.420 | up that laptop. I care if you can pick it up. Interact with the real world. And you're

01:07:10.060 | going to have the local cluster. You're going to have the body. Yeah. Yeah. I think I think

01:07:15.040 | that's kind of where that goes. So maybe we can paint the future of like the year is 2050.

01:07:23.020 | You've achieved all you wanted at Tiny Corp. What is what is the AI enabled future like?

01:07:28.700 | Well, Tiny Corp is the second company. Comma was the first. Comma builds the hardware infrastructure.

01:07:33.860 | Tiny Corp builds a software infrastructure. The third company is the first one that's

01:07:36.940 | going to build a real product. And that product is AI Girlfriend. No, like I'm dead serious.

01:07:42.940 | Right. Like this is the dream product. Right. This is the absolute dream product. Girlfriend

01:07:47.820 | is just the like stand in. Well, no, it's not a stand in. No, no, no, no. I actually

01:07:52.060 | mean it. Right. So I've been wanting to merge with a machine ever since I was like mad little

01:07:56.500 | like, you know, how do I merge with the machine? Right. And like, you can look at like in like

01:08:00.260 | a maybe the Elon style we're thinking about is Neuralink. Right. Like, I don't think we

01:08:04.780 | need any of this. Right. Some of your friends, maybe they get into relationships and you

01:08:09.820 | start thinking of, you know, them and their partner is the same person. You start thinking

01:08:13.460 | of them as like one person. I mean, they are kind of like merged, right? Like humans can

01:08:18.460 | just kind of do this. It's so cool. It's this ability that we already have. It's only to

01:08:23.140 | put, you know, electrodes in my brain to merge with a machine. I need an AI Girlfriend. Right.

01:08:29.460 | So that's what I mean. Like this is this is the third product. This is the third company.

01:08:34.500 | And yeah, in 2050, I mean, like, it's so hard. I like maybe I can imagine like 2035. I don't

01:08:41.860 | even know 2050. But like, yeah, 2035. Like, yeah, that'd be really great. Like I have

01:08:45.980 | this like kind of, you know. So in terms of merging, like, isn't it, shouldn't you work

01:08:51.340 | on brain upload rather than AI Girlfriend? But I don't need brain upload. Right. I don't

01:08:55.580 | need brain upload either. Like, there's there's thousands of hours of me on YouTube. Right.

01:08:59.740 | Yes. If you might, how much of my brain's already uploaded? That's only the stuff that

01:09:03.420 | you voice. Yeah, it's not that different. It's not that different. Right. You really

01:09:07.780 | think a powerful, you really think a model with, you know, an exaflop of compute couldn't

01:09:12.380 | extract everything that's really going on in my brain. I'm a pretty open person. Right.

01:09:16.340 | Like, I'm not running a complex filter. Humans can't run that complex of a filter. Yeah.

01:09:19.740 | Like humans just can't. Like, this is actually a cool quirk of biology. It's like, well,

01:09:24.460 | humans can't lie that well. Yeah. Yeah. So is it good or bad to put all of your stream

01:09:30.480 | of consciousness out there? I mean, I think it's good. I mean, I don't know. I'm streaming

01:09:37.140 | every day. I want to live forever. We said off mic that we may be the first immortals.

01:09:43.100 | Right. Yeah. Yeah. Like, this is how you this is how you live forever. It's a question of,

01:09:47.900 | OK, how many weights do I have? Right. OK. Let's say I have a trillion weights. It's

01:09:51.900 | talking about a terabyte, 100 terabytes here. But it's not really 100 terabytes. Right.

01:09:55.840 | Because it's a complexity. How much redundancy is there in those weights? So, like, maximally

01:09:59.460 | compressed, how big is the weight file for my brain? Quantize it whatever you want. Quantization

01:10:05.620 | is a poor man's compression. I think we're only talking really here about like maybe

01:10:12.100 | a couple of gigabytes. Right. And then if you have like a couple of gigabytes of true

01:10:15.860 | information of yourself up there. Cool, man. Like, what does it mean for me to live forever?

01:10:21.540 | Like, that's me. Yeah, no, I think that's good. And I think like the there's a bit of

01:10:27.100 | like a professionalization of social media or like a lot of people only have what's like

01:10:32.660 | PC out there, you know, and I feel like you're going to get come back to the Chad GPT thing.

01:10:36.540 | Right. You're going to train a model and like everything that's public about a lot of people

01:10:40.620 | and it's like no one's going to run their model and they're going to die. I see on social

01:10:46.420 | media your life could depend on it. We have a segment. So we're moving on to a what would

01:10:55.420 | normally be called the lightning round. But just just general takes because you're a generally

01:10:58.820 | interesting person with many other interests. What is the goddess of everything else mean

01:11:03.820 | to you? Oh, it means that is not really going to kill us. Really? Of course. Tell us more.

01:11:14.460 | Look, Lex asked me this, like, is there going to kill us all? And I was quick to say yes,

01:11:20.580 | but I don't actually really believe it. I think there's a decent chance. I think there's

01:11:23.980 | a decent chance that AI kills 95 percent of us. OK. But they saw on your Twitch streams

01:11:29.980 | that you're with them, so they're not going to. No, I don't think I actually I don't also

01:11:34.540 | think it's AI. Like I think the AI alignment problem is so misstated. I think it's actually

01:11:38.220 | not a question of whether the computer is aligned with the company who owns the computer.

01:11:42.100 | It's a question of whether that company is aligned with you or that government's aligned

01:11:44.820 | with you. And the answer is no. And that's how you end up dead. But so what the goddess

01:11:49.580 | of everything else means to me is like the complexity will continue. Paper clippers don't

01:11:54.980 | exist. You know, there are forces. The paper clipper is cancer. The paper clipper is really

01:11:59.340 | just a perfect form of cancer. And the goddess of everything else says, yeah, but cancer

01:12:04.460 | doesn't win. You know? Yeah. It's a beautiful story for those who haven't heard it. And

01:12:09.220 | you read it out and I listened to it. Yeah. Good. What else we have here? Pick a question.

01:12:14.940 | So many. Yeah. What are you grateful for today? Oh, man. I mean, it's all just like I haven't

01:12:23.100 | I haven't thinking about this stuff forever, like that. It's actually like happening and

01:12:28.700 | it's happening in an accessible way, too. I guess that's what I'm really grateful for.

01:12:32.420 | It's not like like AI is not some Manhattan Project style. You don't know when you close

01:12:37.860 | doors. I'll fight really hard to keep it that way. You know, that's that's I'm grateful

01:12:45.300 | for just just how much is released out there and how much I can just learn and stay up

01:12:50.140 | to date. And I guess I'm grateful to the true fabric of reality that, you know, I didn't

01:12:54.980 | need differential equations to understand it. Like I don't need you don't need you don't

01:12:58.540 | need some like like like there's there's I've tried to do. There's a limit to my to my math

01:13:03.580 | abilities. I can do most undergrad math, but I took some grad math classes. And OK, now

01:13:07.580 | we're getting to the end of what I can do. And it's just the actual like end of what

01:13:11.460 | I can do. Like I'm limited by my brain. But, you know, ML stuff, you need high school math.

01:13:17.820 | Yeah, I could do nothing. You know what I mean? When I went to my major, seventh grade,

01:13:22.500 | like it's all easy. You need more electrical engineering than you need high school math

01:13:25.660 | early. Yeah, well, you need electrical engineering to like build the machines. But even that,

01:13:30.020 | like these machines are simpler than the machines that have existed before. The compute stack

01:13:34.660 | looks really nice. So, you know, yeah, I just I'm grateful that it's all happening and I

01:13:38.460 | get to understand it, be here. Yeah. Yeah. John Carmack mentioned there's about six insights

01:13:44.620 | we have left. Do you have an intuition for what some of the paths people should be taking?

01:13:48.860 | Obviously, you're working on one. What are some of the other branches of the tree that

01:13:53.460 | people should go under? I don't think I'm working on one of the six insights. I don't

01:13:56.860 | think TinyGrad's any one of the six insights. Something I really like that Elon does, and

01:14:01.420 | I try to take it from, try to be inspired by it, is look at the boring tunnel machine

01:14:10.140 | and ask how you can build a 10x cheaper one. All right. Look at the rocket. How can I build

01:14:13.580 | a 10x cheaper one? Look at the electric car and say, how can I build a 10x cheaper, like

01:14:17.700 | cheaper or, you know, can go further or whatever, whatever, whatever. Right. You just do the

01:14:21.380 | straight up physics math. Right. Like I'm trying to do the same thing with with ML frameworks.

01:14:25.540 | Right. And in doing so, making sure that this stuff remains accessible. Right. You could

01:14:31.560 | imagine a world where if Google TPUs were actually the ultimate, if Google TPUs were

01:14:37.180 | actually the best training things. I mean, actually, you know, I'm kind of grateful for

01:14:39.420 | NVIDIA. Right. Like, because if Google TPUs were the ultimate, now you have this huge

01:14:42.660 | closed source compiler in between XLA and the hardware. And yeah, that's just a really

01:14:49.220 | bad thing. So, I mean, something that is somewhat upsetting about the TinyGrad is that it is

01:14:53.300 | trying to prevent downside, but it's not all trying to prevent downside. Like we're also

01:14:57.640 | building computers and we're going to build some awesome, powerful, cheap computers along

01:15:02.260 | the way. So, no, I'm not really working directly on any of the six tricks. I also think the

01:15:06.260 | six tricks are kind of going to be like luck. I think it's going to be like, you know, please

01:15:11.020 | tell me more about what covariate shift is and how that inspired you to come up with

01:15:14.140 | batch normalization. Please tell me more about why it's a transformer and it has a query,

01:15:18.380 | a key and a value. Right. Like Schmidt-Hoover described it better in fast weights. You know,

01:15:23.140 | like, I mean, my theory about why transformers work have nothing to do with this attention

01:15:27.540 | mechanism and just the fact that like it's semi-weight sharing. Right. Like because the

01:15:31.180 | weight matrix is being generated on the fly, you can, you can like compress the weight

01:15:35.180 | matrix. Right. Like this is what that, there's a, there's an operation in the, in the transformer,

01:15:39.740 | which like, and by the way, this is like Qualcomm's SNPE can't run transformers for this reason.

01:15:45.900 | So most matrix multipliers in neural networks are weights times values. Right. Whereas,

01:15:51.480 | you know, when you get to the, the, the, the outer product in, in transformers, well it's

01:15:55.380 | weights times weight. It's a, it's values times values. Right. So SNPE like doesn't

01:15:59.540 | even support that operation. Right. So it's like that operation that gives the transformer

01:16:03.940 | its power. It has nothing to do with the fact that it's attention. Right. And this is just

01:16:07.620 | as a funny, like, but that is one of the six tricks, right. Batch, like these norms are

01:16:11.780 | a trick. Transformers are a trick. Okay. Six more. Is there a reason why, so you could

01:16:18.620 | talk, you talk about attention as weight compression. Compression is not exactly the right word.

01:16:26.020 | What I mean is that the weights can change dynamically based on the context. So there

01:16:29.980 | was this thing in pack eight in the Hunter Prize that I absolutely loved. And I've never

01:16:33.020 | seen it again in neural networks and it's a really good trick. Okay. Imagine you have

01:16:36.100 | 256 weight sets for a layer. Right. And then you choose which of the weight sets you're

01:16:41.500 | loading in based on some context. And that context can come from another neural net.

01:16:45.500 | Right. So I have another neural net which protect, projects, you know, 256 wide, one

01:16:49.780 | hot, do a softmax, predict it. And then I actually load the weights in. And I can do

01:16:53.300 | this operation at both test time and train time. Right. I can do this operation at both

01:16:56.420 | training and inference. And I load in the weights given the context. Right. Like that

01:17:01.580 | is what transformers do. But transformers, instead of having 256 discrete ones, it's

01:17:05.940 | actually just that, but continuous. Yeah. Which is funny that that was in language models.

01:17:09.980 | And I just like, when I understood that about transformers, I'm like, oh, this is a real

01:17:13.140 | trick. And why are they using the word attention? Yeah. And today is actually the anniversary

01:17:17.860 | of attention is all you need. What? Today, six years ago. Six years. Six years. Changed

01:17:23.860 | the world. Wow. Well, there's one of your envelope tricks. Right. And you can easily

01:17:27.260 | write it on an envelope. You know, think about how you write out that. How many times have

01:17:30.220 | you written that? Because it's not in any libraries because it's like all used a little

01:17:33.200 | differently each time. Yeah. If you just write out that exact same, you know. Yeah. Yeah.

01:17:39.860 | You've name checked Elon a few times. Yeah. I think about both of you as systems thinkers.

01:17:44.900 | Input, output, thinking something in between. Sure. What's different about your style versus

01:17:50.740 | his? Elon's fundamental science for the world is physics, mine is information theory. Huh.

01:17:57.540 | But you do a lot of physics as well. I mean, like you base it on. And Elon does a lot of

01:18:00.820 | information theory as well, too. But the question is fundamentally that the difference maybe

01:18:05.700 | is expressed in what your ambitions are. Right. Elon's ambitions may be like, go to Mars.

01:18:12.340 | Go to Mars. Right. Go to Mars is the ultimate modern modernist physics ambition. Right.

01:18:17.580 | It's a physics, but I'm getting to Mars. Right. Well, what are electric cars? It's a physics

01:18:20.740 | problem. Right. OK. Now he's like pushing on the autonomy stuff and you push a little

01:18:25.180 | on information theory. But fundamentally, his dreams are physics based dreams. My dreams

01:18:29.980 | are information based dreams. I want to live forever in virtual reality with my AI girlfriend.

01:18:33.900 | Right. Those are those are the aspirations of someone who who who accepts information

01:18:37.860 | theory as a core science. So I think that's the main difference between me and him. He

01:18:40.660 | has physics based aspirations and I have information based aspirations. Very, very neat. Mark Andreessen.

01:18:47.260 | He is a, hi Mark, he's a listener. He is heavily, he's a big proponent of effective accelerationism.

01:18:54.240 | You've been a bit more critical. Why do you say that EAC is not taken seriously by its

01:18:58.380 | adherents? Oh, well, only the left takes ideology seriously. Why is that? Just as a fact. It's

01:19:08.700 | just like it's just like a fact. Is the right more cynical? Is that what it is? I don't

01:19:12.260 | know. It's like it's like the left actually manages to get energy around the ideologies.

01:19:16.900 | Right. Like like like there's a lot more. Look, here you have you have two effective

01:19:21.740 | altruists named Sam going in front of Congress. Only one of them is in jail. You know, it's

01:19:26.500 | interesting. They're both calling for regulation in their respective spaces. Right. So SPF

01:19:30.300 | is definitely like kind of a wolf in sheep's clothing, kind of. Right. He only adopted

01:19:34.340 | EAC or EA. Oh, and Sam Altman is a genuinely good guy who is not interested in power seeking

01:19:40.860 | for himself. All right. We don't we don't have to. Fair enough. Fair enough. But no,

01:19:46.780 | EAC is not like like you are not serious. Right. You are not actually a serious ideology.

01:19:53.460 | You know, Mark Andreessen. I like Mark Andreessen. But I think that like some of his Twitter

01:19:58.140 | things are like, dude, you like just like it's like it's like someone who's like twenty

01:20:01.620 | nineteen who's like eyes were opened about like the political world being not exact.

01:20:07.540 | You mean all the people on the news were lying to me? Well, they were lying to you like,

01:20:12.020 | OK, we all figured this out five years ago. Now, what are you going to do about it? I'm

01:20:15.260 | going to complain about it on Twitter. Right. And that's what EAC is.

01:20:21.140 | Last and maybe most important, what was Avatar 2 bad?

01:20:24.700 | Oh, I have a whole you can go on my blog. I rewrote the script of Avatar 2. I wrote

01:20:31.100 | a script that actually might make you feel something for the characters. I killed Jake

01:20:35.020 | Sully in the first scene like you had to. Do you really think his second story arc topped

01:20:39.860 | his first one? No, of course not. You had to kill the guy and make the movie about the

01:20:43.180 | brothers. Right. And just that alone and realizing that like you could have kept the Titanic

01:20:47.800 | scene, it would have been fine. Even take it out. I left your Titanic scene, James Cameron.

01:20:51.460 | But I wrote you a story that so, you know, just just just he needs ships to sink in water.

01:20:56.940 | He needs. Well, look, it's a great scene. But like the movie was just like the Roman

01:21:01.980 | never great CGI, you know, let down by the writing. Maybe. Yeah. Yeah. No, but like the

01:21:06.460 | CGI like it was it's a beautiful world. And that's why, like, I care so much. Right. Like

01:21:10.740 | you don't hear me ranting about Pirates of the Caribbean 2 being a terrible story because

01:21:13.980 | come on, what do you expect, man? Like Johnny Depp's like, wow, I had a movie that made

01:21:17.940 | me rich. I love this. But this goes back to like the midpoint. You know, I think you wrote

01:21:23.140 | like feels like Chachapiti wrote the movie. And that's my worry a little bit. It's like

01:21:27.820 | kind of converging towards that. Oh, I look Malik wrote the movie. Sorry, I didn't want

01:21:34.380 | to interrupt. I closed. I closed a pull request two days ago. I was like, was this written

01:21:38.780 | by Chachapiti? And I just closed it. Like, you know what? I honestly feel bad if you

01:21:42.620 | were a human who wrote this. Like you're incapable of being more perplexed. But if you have a

01:21:48.980 | classifier running in my head that asks, you know, is this a I or is this a human like,

01:21:54.740 | you know, the only way to deal with all this, like, like, like, it's like the worst possible.

01:22:00.460 | Like, you know, people are like, like, like, how are you mad about like these chatbots?

01:22:05.020 | You're not mad about like Tesla? Well, because if I don't want to buy a Tesla, I want to

01:22:09.300 | buy a Tesla and it won't really impact my life negatively. But if I don't want to use

01:22:12.580 | a chatbot, it's still going to impact my life negatively. All the amount of like personalized

01:22:16.540 | spam that now makes me spend more cycles on my classifier to tell if it's spam or not,

01:22:21.540 | because you can now use AIs and generate this. Like, no, I mean, we have to move to a model

01:22:26.260 | where everything's just a dollar, right? Like you want to send me an email, it's a dollar.

01:22:28.940 | Like you guys wouldn't care. None of my friends would care. No one would care except the spammers.

01:22:32.660 | Right? Like we just got to move to those sort of models.

01:22:36.620 | Awesome. One last message you want everyone to remember.

01:22:41.380 | Look, go, go try TinyGrad. I hope that we're a serious competitor to what's out there.

01:22:51.900 | And then I want to, you know, I want to take it all the way. We'll start with just building

01:22:55.140 | something for GPUs and then we'll start building chips and we'll start building fabs and we'll

01:22:59.340 | start building silicon mines and we'll have the first self-reproducing robot using. Yeah,

01:23:04.460 | all right, George. Thank you so much for coming on.

01:23:07.500 | You're a big inspiration. Thank you. Thanks. All right. How was that? We, uh, not, not

01:23:15.780 | quite like Friedman, but we hope to do something different.

01:23:17.300 | Thank you.

01:23:18.300 | Thank you.

01:23:18.300 | [inaudible]

01:23:19.300 | [inaudible]

01:23:20.300 | [inaudible]

01:23:21.300 | [inaudible]

01:23:22.300 | [inaudible]

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Chapters