back to indexEp 18: Petaflops to the People — with George Hotz of tinycorp
Chapters
0:0 Introducing George
2:59 Tinycorp's 3 Theses
11:12 Tinygrad's creation
15:58 Operation fusing in Tinygrad
19:11 Tinygrad debugging
21:14 Tiny Competitiveness on QCOMM vs NVDA
23:21 geohot vs AMD
28:21 Tinygrad vs ggml
30:1 Importance of Good CI
30:37 Mojo and Compatibility
32:43 ggml quantization is made up
35:18 tinygrad: benchmark int8 vs fp16
37:39 Why you can't build tinybox
40:28 The personal compute cluster
43:8 Compute Optimal to Inference optimal
45:6 Announcing FLOPcoin
46:23 Why Federated AI won't work
47:38 5x faster than Nvidia
48:53 A Person of Compute
49:49 GPT-4's real architecture
51:7 BatchNorm, FlashAttention
52:34 The Bitter Lesson
55:31 Hiring in the Age of AI
60:2 Why AI doesn't replace developers & artists
63:2 Comma Body
67:34 AI Girlfriend
71:0 The Goddess of Everything Else
73:43 John Carmack Insights
77:41 on Elon
78:47 on e/acc
80:24 Avatar 2
00:00:00.000 |
>> Hey, everyone. Welcome to the Latent Space podcast. This is Swix, writer and editor of 00:00:06.440 |
Latent Space, and Alessio is taking over with the intros, Alessio's partner and CTO on residence 00:00:11.640 |
at Decimal Partners. >> Hey, everyone. Today we have GioHot on 00:00:15.040 |
the podcast, aka George Hotz, for the human name. Everybody knows George, so I'm not going 00:00:20.960 |
to do a big intro. A couple things that people might have missed. So you were the first to 00:00:24.600 |
unlock the iPhone. You traded the first ever unlocked iPhone for a Nissan 350Z and three 00:00:30.120 |
new iPhones. You were then one of the first people to break into the PS3 around arbitrary 00:00:35.480 |
code. You got sued by Sony. You wrote a rap song to fight against that, which is still 00:00:40.400 |
live on YouTube, which we're going to have on the show notes. Then you did not go to 00:00:44.920 |
Tesla to build Vision, and instead you started Com.ai, which was an amazing engineering feat 00:00:50.720 |
in itself, until you got a season disease from the government to not put these things 00:00:55.280 |
on the street. Turned that into a research-only project. 00:00:58.560 |
>> You know they're out there. >> Yeah, yeah. No, no, no. They're out there. 00:01:01.800 |
But like, they're not a, you know, you market them as a research kind of like no warranty. 00:01:06.520 |
>> Because I use the word DevKit. That's not about the government. That has nothing to 00:01:10.000 |
do with the government. We offer a great one-year warranty. The truth about that is it's gatekeeping. 00:01:17.640 |
What's the difference between a DevKit and not a DevKit? Nothing. Just the question of 00:01:22.120 |
do you think it's for you? And if you think it's for you, buy it. It's a consumer product. 00:01:26.480 |
We call it a DevKit. If you have a problem with that, it's not for you. 00:01:31.000 |
>> That's great insight. And then I was going through your blog post to get to the day. 00:01:35.480 |
You wrote this post about the hero's journey, and you linked this thing called the portal 00:01:40.080 |
story, which is kind of the set of stories in movies and books about people living this 00:01:45.240 |
arbitrary life, and then they run into these magic portals, kind of takes them into a new, 00:01:49.680 |
very exciting life and dimension. When you wrote that post, you talked about TinyGrad, 00:01:54.520 |
which is one of the projects you're working on today. And you mentioned this is more of 00:01:58.120 |
a hobby, something that is not going to change the course of history. Obviously, you're now 00:02:01.440 |
going full speed into it. So we would love to learn more about what was the portal that 00:02:07.680 |
>> Well, what you realize is, you know what made me realize that I absolutely had to do 00:02:13.520 |
company? Seeing Sam Altman go in front of Congress. Why? What are the odds they nationalize 00:02:20.200 |
NVIDIA? You know, what are the odds that large organizations in the government, but of course 00:02:26.400 |
I repeat myself, decide to try to clamp down on accessibility of ML compute? I want to 00:02:34.720 |
make sure that can't happen structurally. So that's why I realized that it's really 00:02:39.560 |
important that I do this. And actually, from a more practical perspective, I'm working 00:02:43.240 |
with NVIDIA and Qualcomm to buy chips. NVIDIA has the best training chips. Qualcomm has 00:02:47.120 |
the best inference chips. Working with these companies is really difficult. So I'd like 00:02:51.000 |
to start another organization that eventually in the limit, either works with people to 00:02:56.000 |
make chips or makes chips itself and makes them available to anybody. 00:03:01.860 |
>> You shared kind of three core pieces to TinyGrad. Maybe we can dive into each of them. 00:03:06.080 |
So XLA, Prime Torch, those are the complex instruction system. TinyGrad is the restricted 00:03:12.640 |
instruction system. So you're kind of focused on, again, TinyGrad being small, not being 00:03:17.040 |
overcomplicated and trying to get as close to like the DSP as possible in a way, where 00:03:22.640 |
>> Well, it's a very clear analogy from how processors developed. So a lot of processors 00:03:26.920 |
back in the day were CISC, complex instruction set. System 360 and then x86. Then this isn't 00:03:34.520 |
how things stayed. They went to now the most common processor is ARM. And people are excited 00:03:40.640 |
about RISC-V. RISC-V is even less complex than ARM. No one is excited about CISC processors 00:03:46.920 |
anymore. They're excited about reduced instruction set processors. So TinyGrad is, we're going 00:03:52.680 |
to make a RISC offset for all ML models. And yeah, it can run all ML models with basically 00:03:59.800 |
25 instead of the 250 of XLA or Prime Torch. So about 10x less complex. 00:04:06.040 |
>> You talked a lot about existing AI chips. You said if you can write a fast ML framework 00:04:10.600 |
for GPUs, you just cannot write one for your own chip. So that's another one of your core 00:04:14.760 |
insights. I don't know if you want to expand on that. 00:04:17.280 |
>> Yeah, I mean, your chip is worse, right? There's no way the chip that you're going 00:04:20.600 |
to tape out, especially on the first try, is going to be easier to use than an AMD GPU. 00:04:25.720 |
And yet there's no good stack for AMD GPUs. So why do you think you can make one for your 00:04:30.600 |
chip? You can't, right? The only company, there's one other company, aside from NVIDIA, 00:04:35.560 |
who's succeeded at all at making training chips. What company? 00:04:43.000 |
>> No, no, no. I've never trained, who's trained a model on AMD or Intel? 00:04:49.760 |
>> Cerebris, I'm talking about, you might know some startups who trained models on these 00:04:53.800 |
chips. I'm surprised no one immediately gets this, because there is one other chip, aside 00:04:59.280 |
from NVIDIA, that normal people have actually used for training. 00:05:03.560 |
>> No, used for training. You can only buy them in the cloud. 00:05:09.440 |
>> Exactly, right? So, Mid Journey is trained on TPU, right? A lot of startups do actually 00:05:15.680 |
train on TPUs. And they're the only other successful training chip, aside from NVIDIA. 00:05:21.180 |
But what's unique about Google is that they also wrote their own ML framework, right? 00:05:26.160 |
And if you can't write your own ML framework that is performant on NVIDIA, there's no way 00:05:32.680 |
>> And they started from TensorFlow, and then they made the chip after. 00:05:36.000 |
>> Yeah, exactly, exactly. And you have to do it in that direction. Otherwise, you're 00:05:40.040 |
going to end up-- Cerebris, one of those things, a million-- I've never seen a Cerebris. No 00:05:46.560 |
one's ever like, "Oh, I trained my model on a Cerebris." Most people are like, "I trained 00:05:50.520 |
my model on GPUs." Some people, 20%, are like, "I trained my model on TPUs." 00:05:57.040 |
>> And then the third one, which is the one that surprised me the most, is Turing completeness 00:06:01.320 |
is harmful, should be avoided. It made sense once I read it, but maybe tell us a bit more 00:06:09.560 |
>> Okay. So, CPUs devote tons of their silicon and power to things like reorder buffers and 00:06:18.160 |
speculative execution and branch predictors. And the reason that you need all these things 00:06:22.960 |
is because at compile time, you can't understand how the code's going to run. This is Rice's 00:06:28.240 |
theorem. This is the halting problem and its limit. And this is not like, "Oh, the halting 00:06:32.040 |
problem is theoretical." No, no, no, no. It's actually very real. Does this branch get taken 00:06:36.240 |
or not? Well, it depends on X. Where does X come from? Yeah, forget it, right? But no 00:06:41.520 |
branches depend on X in a neural net. Every branch is a static loop. Like if you're doing 00:06:46.360 |
a matrix multiply, it's a static loop over the inner dimension. And neural networks are 00:06:50.720 |
even better. No loads even depend on X, right? So with a GPU shader, right, your load might 00:06:55.760 |
depend on which texture you're actually loading into RAM. But with a neural network, your 00:06:59.480 |
load is, "Well, I load that way." Why? "Well, because I load that way the other million 00:07:03.160 |
times I ran the same net." Every single time you run the net, you do the exact same set 00:07:07.160 |
of loads, stores, and arithmetic. The only thing that changes is the data. And this gives 00:07:12.800 |
you a very powerful ability to optimize that you can't do with CPU style things, which 00:07:19.080 |
have branches, and even GPU style things, which have loads and stores. 00:07:22.160 |
Oh, that makes sense. Well, GPUs, if you want GPU style stuff, you have like load based 00:07:26.560 |
on X. You now need a cache hierarchy, and not an explicit cache hierarchy, an implicit 00:07:31.480 |
cache hierarchy. With eviction policies that are hard-coded into the CPU, you start doing 00:07:37.240 |
all this stuff and you're never going to get theoretically good performance. Again, I don't 00:07:42.480 |
think there's 100X. Some startups will talk about 100X, and they'll talk about absolutely 00:07:45.840 |
ridiculous things like clockless computing or analog computing. Okay. Here, analog computing 00:07:50.720 |
just won't work. And clockless computing, sure, it might work in theory, but your EDA 00:07:55.960 |
tools are... Maybe AIs will be able to design clockless chips, but not humans. But what 00:08:02.840 |
actually is practical is changing cache hierarchies, and removing branch predictors, and removing 00:08:07.360 |
warp schedulers. GPUs spend tons of power on warp scheduling, because we have to hide 00:08:10.960 |
the latency from the memory. We don't have to hide the latency if everything's statically 00:08:15.040 |
Yeah. Why do you think people are still hanging on to Turing complete? 00:08:19.920 |
Well, because it's really easy. Turing complete is just really easy, right? It's really easy 00:08:24.720 |
to just, "Oh, you know, it would just be so nice if I could do like an if statement here, 00:08:29.820 |
and actually branch the code," right? So it requires a lot more thought to do it without 00:08:37.520 |
And would this be qualitatively different than TPUs? 00:08:40.120 |
So TPUs are a lot closer. TPUs are a lot closer to what I'm talking about than like CUDA. 00:08:46.560 |
Okay, so what is CUDA? Well, CUDA is a C-like language, which compiles to an LLVM-like IR, 00:08:52.240 |
which compiles to PTX, which compiles to SAS, which are all Turing complete. TPUs are much 00:08:57.540 |
more like this, yeah. Their memory is pretty statically managed. I did some reverse engineering 00:09:02.680 |
on the TPU. It's published in TinyGrad. It has like a VLIW instruction, and it runs them. 00:09:09.520 |
So it's similar. I think the TPUs have a few problems. I think systolic arrays are the 00:09:13.400 |
wrong choice. Systolic array, I think they have systolic arrays, because that was the 00:09:19.280 |
Could you summarize systolic arrays right now? 00:09:20.860 |
Systolic arrays are just, okay, so basically you have like, it's a way to do matrix multiplication. 00:09:26.640 |
Think of a grid of mollax, and then the grid can multiply and then shift, multiply, then 00:09:31.080 |
shift, multiply, then shift. And they are very power efficient, but it becomes hard 00:09:35.400 |
to schedule a lot of stuff on them if you're not doing perfectly sized dense matrix multiplies. 00:09:42.360 |
Which you can argue, well, design your models to use perfectly sized dense matrix multiplies, 00:09:48.920 |
No, but thanks for indulging on these explanations. I think we need to keep our audience along 00:09:57.120 |
with us by pausing every now and then to explain key terms. 00:10:01.680 |
When I say explain a systolic array, I just immediately get a picture in my head of like 00:10:06.000 |
tilting a matrix and shifting it. It's hard to kind of explain. 00:10:13.720 |
Yeah, yeah, yeah. There's some great graphics that just show you, oh, so that's what a systolic 00:10:17.200 |
array is. But it's a mollax shift machine that looks kind of different from the typical 00:10:21.640 |
like APU sort of machine. Sorry, ALU sort of machine. I think the right answer is something 00:10:26.440 |
that looks more like queues that feed into ALUs. And then you can like prefetch the loads 00:10:31.640 |
from the memory, put in a bunch of queues, and then the queue is just like, and feeds 00:10:35.600 |
into another queue over here. But yeah, but that's not even the main problem with TPUs. 00:10:42.360 |
The main problem with TPUs is that they're closed source. Not only is the chip closed 00:10:45.440 |
source, but all of-- XLA is open source, but the XLA to TPU compiler is a 32 megabyte binary 00:10:51.520 |
blob called lib TPU on Google's cloud instances. It's all closed source. It's all hidden stuff. 00:10:56.800 |
And, you know, well, there's a reason Google made it closed source. Amazon made a clone 00:10:59.920 |
of the TPU. It's called Inferencia. Or they have some other name for it, a training-- 00:11:04.280 |
>> Trainium, yeah, yeah, yeah. And here, look, it's a clone of the TPU. It's--software doesn't 00:11:08.080 |
work though. Like Google software at least kind of works. 00:11:12.120 |
>> So those are kind of like the three core thesis. The first thing you're working on, 00:11:15.360 |
that you've been working on is TinyGrad. And one of the--your Twitch streams, you said, 00:11:19.600 |
is the best thing you've ever written. Yeah, tell us a bit more about that creation. 00:11:26.840 |
>> For a long time, TinyGrad had a hard limit of 1,000 lines of code. And what this would 00:11:31.280 |
force you to do is really make sure you were not wasting lines. I got rid of the restriction 00:11:37.400 |
because it became a little code golfy at the end. But once like the core framework of TinyGrad 00:11:42.680 |
was there in those 1,000 lines, it's not huge now. It's like 2,800 lines now. It's still 00:11:49.120 |
very readable. But like the core framework, the ideas are expressed with no boilerplate. 00:11:56.420 |
If you go read PyTorch--you know, PyTorch is actually pretty good code. I think Facebook's 00:12:00.720 |
pretty good. But there's so much boilerplate. Go in PyTorch and try to track down how an 00:12:07.400 |
LU actually works. >> Just a lot of instructions? 00:12:10.960 |
>> Oh, you're going to be diving down a long stack from Python to C to custom libraries 00:12:16.800 |
to dispatchers to--and then I don't even know how to read TensorFlow. Like I don't even 00:12:20.360 |
know where's an LU in TensorFlow. Nobody knows. Someone at Google knows maybe. Google as an 00:12:27.080 |
organism knows. I don't know if anyone individual at Google knows. 00:12:31.580 |
>> What are like the important ergonomics like for a developer as you think about designing 00:12:35.400 |
the TinyGrad API? >> So, the TinyGrad frontend looks very similar 00:12:39.240 |
to PyTorch. There's an even higher level frontend you can use for TinyGrad which is just ONNX. 00:12:44.060 |
We support--we have better support for ONNX than Core ML does. And we're going to have--I 00:12:48.680 |
think we're going to pass ONNX Runtime soon too. Like people think ONNX Runtime, that's 00:12:52.000 |
a gold standard for ONNX. No, you can do better. >> Pass them in what specifically? 00:12:55.560 |
>> Test, compliance tests. So, ONNX has a big set of compliance tests that you can check 00:12:59.580 |
out. And we have them running in TinyGrad and there's some failures. We're below ONNX 00:13:05.480 |
Runtime but we're beyond Core ML. So, like that's like where we are in ONNX support now. 00:13:09.800 |
But we will pass. We will pass ONNX Runtime soon because it becomes very easy to add ops 00:13:14.060 |
because of how like you don't need to do anything at the lower levels. You just do it at this 00:13:17.960 |
very high level and TinyGrad compiles it to something that's fast using these minimal 00:13:21.560 |
ops with. You can like write--I mean, most concretely what TinyGrad can do that like 00:13:27.280 |
PyTorch can't really do is if you have something like A times B plus C, right? If you write 00:13:32.460 |
that in Naive PyTorch, what it's going to do on the GPU is, well, read A, read B in 00:13:37.520 |
a kernel and then store A times B in memory and then launch another kernel to do A times 00:13:42.880 |
B plus C, okay? Got to do those loads from memory. I know I did a whole extra round trip 00:13:48.040 |
to memory that I just didn't have to do. And you're like, "Yeah, but you can use the Torch 00:13:51.080 |
JIT and it corrects this." Yeah, for that one example, for that one example of MOLAC, 00:13:56.720 |
but oh, now you did three multiplies, six multiplies, right? It doesn't--it won't compile 00:14:04.420 |
>> And if you looked into like the other approaches like PyTorch Lightning to accelerate PyTorch 00:14:10.360 |
>> Well, PyTorch Lightning, my understanding is it's mostly a framework around PyTorch, 00:14:14.280 |
right? PyTorch Lightning is not going to fix this fundamental problem of I multiply six 00:14:18.040 |
tensors together, why is it going to memory any more than a single read from each and 00:14:24.320 |
>> Yeah, there are lower level things in PyTorch that are--I'm not exactly sure what Dynamo 00:14:29.680 |
does but I know they're generating some Triton stuff which is going to generate the kernels 00:14:33.960 |
on the fly. But, you know, PyTorch Lightning is at a higher level of abstraction. So TinyGrad's 00:14:39.840 |
front-end stuff looks like PyTorch. I made a few tweaks, there's a few things I don't 00:14:42.960 |
like about PyTorch. Why is ReLU a class? No, really, like what's the state? You make a 00:14:49.200 |
class and there's a state. Everything should just be Torch functional and then ReLU but 00:14:52.160 |
just dot ReLU on the tensor. Also like there's things in Torch where you have to do tensor 00:14:56.840 |
dot and not a tensor dot, right? And like why are these things--like this just--it just 00:15:02.640 |
shows an API that's like not perfectly refined. But when you're doing stuff TinyGrad style 00:15:07.880 |
where you don't have lines, well, it has to work this way because even the lines to express 00:15:12.920 |
the--well, you can't use the where operator unless--and the where operator in PyTorch. 00:15:17.720 |
Why is it true case, condition, false case? The worst--that's how Python expresses ifs. 00:15:24.440 |
It's disgusting, right? Turner operators are much nicer. It should be--I can do my like 00:15:28.360 |
a less than zero dot where a comma one, right? >> The very Pandas-like API. 00:15:35.320 |
>> Yeah, yeah, yeah. It's just--it's some--it looks like Torch, NumPy, Pandas. They're all 00:15:40.440 |
very similar. I tried to take like the cleanest subset of them and express them. But like 00:15:44.960 |
I said, you can also interact with it using Onyx. But I have a rewrite of StableDiffusion, 00:15:50.240 |
I have a rewrite of LLAMA, I have a rewrite of Whisper. You can look at them. They're 00:15:52.840 |
shorter than the Torch version and I think they're cleaner. 00:15:56.360 |
>> Very nice. >> Laziness is kind of the other important 00:16:00.080 |
concept that you're leveraging to do operation fusing. Yeah, talk a bit more about that. 00:16:05.200 |
>> So, yeah, you have basically like a few different like models for compute. The simplest 00:16:14.160 |
one is Eager, right? The simplest one is Eager. As soon as the interpreter sees A times B, 00:16:20.760 |
it actually dispatches A times B, right? Then you have Graph, like TensorFlow, which will 00:16:27.400 |
put A times B into a graph and then will do absolutely nothing until you actually compile 00:16:35.000 |
the graph at the end. I like this third choice, which is somewhere in the middle, laziness. 00:16:40.080 |
Laziness is, you don't know when the ops are going to dispatch and don't worry about that. 00:16:42.760 |
You don't have to worry about this as a programmer. You just write out all your stuff and then 00:16:46.280 |
when you actually type .numpy, it'll be ready by the time you, you know, copy the thing 00:16:50.540 |
back to CPU. Or you can do .realize and it will actually like force that tensor to be 00:16:54.960 |
allocated in RAM. But yeah, a lot of times, right, like, and if you think about it, PyTorch 00:17:00.960 |
is kind of lazy in a way, but they didn't extend the paradigm far enough, right? When 00:17:04.920 |
I do A times B in PyTorch, it's going to launch a CUDA kernel to do A times B, but it's not 00:17:09.680 |
going to wait for that CUDA kernel to complete. So you're getting the worst possible world. 00:17:13.800 |
You're getting the same laziness, but you also can't get fusion because PyTorch doesn't know 00:17:18.200 |
that I'm then going to do plus C. There's no way for it to be like, "Whoa, whoa, whoa, 00:17:21.560 |
don't launch that CUDA kernel. Whoa, just do this one too." Right? You can kind of like, 00:17:26.320 |
this stuff, PyTorch is working on this and, you know, it's a little bit harder. Like in 00:17:31.920 |
comma, I felt like I was competing against a lot of idiots. Here I'm competing against, 00:17:35.840 |
you know, smart, smart, very smart people who've made, yeah, who've made some, I think, 00:17:41.680 |
different trade-offs, right? Who've made some different trade-offs. Whereas if you're trying 00:17:45.400 |
to build something that is just straight up good on NVIDIA and we have a lot of people 00:17:49.540 |
and complexity to throw at it, yeah, PyTorch made a lot of the right choices. I'm trying 00:17:53.140 |
to build something that manages complexity. Like you can always make your software do 00:17:57.520 |
more. The magic is when you can make your software do more without adding complexity, 00:18:02.320 |
right? Because, you know, complex things eventually collapse under their own weight. So it's kind 00:18:09.500 |
>> Like TensorFlow actually collapsed. It's kind of what happened, right? How does fusing 00:18:15.760 |
actually work? So yeah, there's this thing called lazy.py. And when you do like A times 00:18:21.720 |
B, that's, it's put into a graph, but it's a very local graph. There's no global graph 00:18:28.120 |
optimizations. And even this can change, right? Again, like the programming model for TinyGrad 00:18:32.760 |
does not preclude eagerness, right? Laziness is not guaranteed laziness. It's just going 00:18:37.440 |
to try its best. So you put in A times B, and that's a binary op, right? And then you 00:18:41.960 |
put in A times B, like that's a node in the graph. It's a virtual node because it's not 00:18:45.640 |
realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes 00:18:50.360 |
the output of A times B. It's like, whoa, wait, there's two binary ops. Okay, we'll 00:18:53.680 |
just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs. 00:18:58.200 |
It does A times B plus C in the local registers, and then outputs that to memory. And you can 00:19:04.360 |
graph.1 in TinyGrad. Another, like amazing thing that TinyGrad has that I've not seen 00:19:10.560 |
in any other framework is two things. Graph.1, graph equals one, which is an environment variable. 00:19:16.040 |
It will output a complete graph of all the operations. A lot of people are like, oh, 00:19:19.480 |
you can use PyTorch, export it to Onyx, and use Netron. Yeah, you can, but like what? 00:19:24.680 |
That's not what's real. Graph.1 will show you the actual kernels that were dispatched 00:19:28.520 |
to the GPU. You can also type debug equals two, which will print those kernels out in 00:19:34.200 |
your command line. And it will tell you the exact number of flops and the exact number 00:19:40.440 |
of memory accesses in each kernel. So you can immediately see, wait a second, okay, 00:19:45.680 |
this kernel used this many flops, this was the gigaflops, this is how many bytes it read, 00:19:49.520 |
and this was the gigabytes per second. And then you can profile without having to like, 00:19:53.280 |
okay, I mean, in theory, in PyTorch, sure, use the NVIDIA Insight Profiler, which is-- 00:19:58.000 |
>> No one does that. >> No one does, of course, because it's so 00:20:00.240 |
difficult, right? Like, actually, NVIDIA used to, pre, I think CUDA 9 was the last one they 00:20:06.320 |
had it. They had a command line one, but now it's like, okay, I'm going to generate this 00:20:09.760 |
blob, use this NVIDIA GUI tool to convert it into a Chrome trace and then load it. Yeah, 00:20:15.480 |
no one does that, right? I've just typed debug=2 in any TinyGrad model and it will show you 00:20:19.680 |
all the kernels that it launches and the efficiency of each kernel, basically. 00:20:24.160 |
>> Yeah, this is something that John Carmack has often commented about, is that when you 00:20:29.320 |
code, you need to build in your instrumentation or observability right into that. I wonder 00:20:34.520 |
if whatever John is working on, he's adopting this style, and maybe we can sort of encourage 00:20:39.720 |
it by, like, I don't know, naming it and coining it as a certain kind of debugging style. 00:20:46.280 |
>> If he would like to start contributing to TinyGrad, I'd be-- 00:20:49.320 |
>> You should hook up with him. >> I'd be so happy. I've chatted with him 00:20:52.000 |
a few times. I'm not really sure what his company is doing. I think it's all, I think 00:20:55.720 |
it's pretty, but no, I mean, hopefully, like, we get TinyGrad to a point where people actually 00:21:02.240 |
want to start using it. So TinyGrad right now is uncompetitive on, it's uncompetitive 00:21:07.800 |
on NVIDIA, it's uncompetitive on x86. >> And specifically, what do you care about 00:21:13.180 |
>> Okay. >> Share of speed. It's correct. The correctness 00:21:16.040 |
is there. The correctness for both forwards and backwards passes is there, but on NVIDIA, 00:21:20.680 |
it's about 5x slower than PyTorch right now. Like, 5x, wow, this is unsurmountable. No, 00:21:25.240 |
there's reasons it's 5x slower, and I can go through how we're going to make it faster, 00:21:28.040 |
and it used to be, you know, 100x slower, so, you know, we're making progress, but there's 00:21:32.160 |
one place where it actually is competitive, and that's Qualcomm GPUs. So TinyGrad is 00:21:36.560 |
used to run the model in OpenPilot. Like, right now, it's been live in production now 00:21:40.360 |
for six months, and TinyGrad is about 2x faster on the GPU than Qualcomm's library. 00:21:46.360 |
>> And why specifically Qualcomm? >> Well, because we have Qualcomm. We use 00:21:51.080 |
Qualcomm in the Comma devices. >> Oh, I mean, like, what makes, what about 00:21:55.840 |
Qualcomm architecture? >> Oh, what makes it doable? Well, because 00:21:58.920 |
the world has spent how many millions of man-hours to make NVIDIA fast, and Qualcomm has a team 00:22:03.160 |
of 10 Qualcomm engineers? Okay, well, who can I beat here? Like, what I propose with 00:22:08.760 |
TinyGrad is that developer efficiency is much higher, but even if I have 10x higher developer 00:22:14.160 |
efficiency, I still lose on NVIDIA, right? You know, okay, I didn't put 100,000 man-hours 00:22:19.840 |
into it, right? If they put a million, like, that's what I'm saying, but that's what I'm 00:22:23.560 |
saying we can get, and we are going to close this speed gap a lot. Like, I don't support 00:22:28.480 |
TensorCourse yet. That's a big one that's just going to, okay, massively close the gap. 00:22:33.960 |
And then AMD. I can't even get, I don't even have a benchmark for AMD because I couldn't 00:22:39.280 |
get it compiled. Oh, and I tried. Oh, I tried. I spent a day. Like, I spent actually a day 00:22:43.940 |
trying to get PyTorch, and I got it built. I got it kind of working, and then I tried 00:22:49.400 |
to run a model. Like, there's all kinds of weird errors, and the rabbit holes are so 00:22:52.800 |
deep on this. I'm like, so we, you know, you can compare the speed. Right now, you can 00:22:57.320 |
run LLAMA. You can run anything you want on AMD. It already all works. Any OpenCL backend 00:23:01.160 |
works, and it's not terribly slow. I mean, it's a lot faster than crashing, so it's infinitely 00:23:05.760 |
times faster than PyTorch on AMD, but pretty soon, we're going to start getting close to 00:23:10.560 |
theoretical maximums on AMD. That's really where I'm pushing, and I want to get AMD on 00:23:19.800 |
>> Yeah, let's dive into that, because when you announced the TinyCorp fundraise, you 00:23:23.920 |
mentioned one of your first goals is build the framework, runtime, and driver for AMD, 00:23:29.520 |
and then on June 3rd on Twitch, you weren't as excited about AMD anymore. Maybe let's 00:23:35.080 |
talk a bit about that, and you compared the quality of commit messages from the AMD kernel 00:23:41.360 |
to the Intel work that people are doing there. What's important to know? 00:23:44.800 |
>> So when I said I want to write a framework, I never intended on writing a kernel driver. 00:23:49.160 |
I mean, I flirted with that idea briefly, but realistically, there's three parts to 00:23:55.840 |
it, right? There's the ML framework, there's the driver, and then there's the user space 00:23:59.800 |
runtime. I was even down to rewrite the user space runtime. I have a GitHub repo called 00:24:04.800 |
CUDA I/O Control Sniffer. It's terribly called, but you can actually launch a CUDA kernel 00:24:08.520 |
without CUDA, so you don't need CUDA installed. Just the NVIDIA open source driver and this 00:24:13.820 |
open source repo can launch a CUDA kernel. So rewriting the user space runtime is doable. 00:24:19.040 |
Rewriting the kernel driver, I don't even have docs. I don't have any docs for the GPU. 00:24:23.000 |
It would just be a massive reverse engineering project. When I saw that there, I wasn't complaining 00:24:30.880 |
about it being slow. I wasn't complaining about PyTorch not compiling. I was complaining 00:24:34.120 |
about the thing crashing my entire computer. It panics my kernel, and I have to wait five 00:24:37.880 |
minutes while it reboots because it's a server motherboard and they take five minutes to 00:24:40.640 |
reboot. So I was like, "Look, if you guys do not care enough to get me a decent kernel 00:24:45.160 |
driver, there's no way I'm wasting my time on this, especially when I can use Intel GPUs." 00:24:49.280 |
Intel GPUs have a stable kernel driver, and they have all their hardware documented. You 00:24:53.620 |
can go and you can find all the register docs on Intel GPUs. So I'm like, "Why don't I just 00:24:58.480 |
use these?" Now, there's a downside to them. Their GPU is $350. You're like, "What a deal. 00:25:03.600 |
It's $350." You get about $350 worth of performance. If you're paying about $400 for the PCIe slot 00:25:08.760 |
to put it in, like between the power and all the other stuff, you're like, "Okay, never 00:25:12.520 |
mind. You've got to use NVIDIA or AMD from that perspective." But I sent an email to 00:25:20.600 |
>> Oh, you can see you published that email in a Discord. 00:25:22.600 |
>> I did. I did. And she responded. And I've had a few calls since. And what I did was 00:25:30.160 |
like what I tried to do. Well, first off, thank you for responding. It shows me that 00:25:35.680 |
if you don't care about your kernel panicking, I can't. This is just a huge waste of my time. 00:25:40.640 |
Right? I'll find someone who will care. I'm not asking for your 7x7 Winograd convolution 00:25:46.760 |
when transposed to be fast. I'm not asking for that. I'm asking literally for- 00:25:51.640 |
>> Oh, and this isn't TinyGrad. This is your demo apps. I ran their demo apps in loops 00:25:56.320 |
and I got kernel panics. I'm like, "No. Okay." But no, Lisa Su reached out, connected with 00:26:05.640 |
a whole bunch of different people. They sent me a pre-release version of RockM 5.6. They 00:26:12.040 |
told me you can't release it, which I'm like, "Why do you care?" But they said they're going 00:26:17.240 |
to release it by the end of the month. And it fixed the kernel panic. The guy managed 00:26:20.560 |
to reproduce it with the two GPUs and the computer. And yeah, sent me a driver and it 00:26:27.600 |
works. So yeah, I had that experience. And then I had another experience where I had 00:26:33.080 |
two calls with AMD's communication people. I tried to explain to these people open source 00:26:38.000 |
culture. It's not open source if you dump the source code on a GitHub repo and then 00:26:42.880 |
forget about it until the next release. It's not open source if all your issues are from 00:26:48.000 |
2022. No one's going to contribute to that project. Sure, it's open source in a very 00:26:54.400 |
technical sense. To be fair, it's better than nothing. It's better than nothing, but I fixed 00:26:59.640 |
a bug in Nickel. There's a fun fact, by the way. If you have a consumer AMD GPU, they 00:27:05.800 |
don't support peer-to-peer, and their all-reduce bandwidth is horrendously slow because it's 00:27:10.800 |
using CUDA kernels to do the copy between the GPUs. And it's putting so many transactions 00:27:15.220 |
on the PCIe bus that it's really slow. But you can use CUDA memcpy, and there's a flag 00:27:19.400 |
to use CUDA memcpy, but that flag had a bug. So I posted the issue on Nickel. I expected 00:27:27.360 |
nothing to happen. The Nvidia guy replied to me within an hour. He's like, "Try this 00:27:30.560 |
other flag." I'm like, "Okay, I tried the other flag. It still doesn't work, but here's 00:27:33.900 |
a clean repo." And I spent like three hours writing a very clean repro. I ended up tracking 00:27:40.280 |
the issue down myself, but just the fact that somebody responded to me within an hour and 00:27:43.660 |
cared about fixing the issue, okay, you've shown that it's worth my time, and I will 00:27:47.960 |
put my time in because let's make this better. I'm here to help. But if you show me that 00:27:52.640 |
you're like, "You're the kernel panics. Let's just expect it." Okay. 00:27:56.000 |
>> Well, it sounds like AMD is getting the message. 00:27:59.000 |
>> They are. And I don't really think they've had someone explain to them. I was like, "You 00:28:03.600 |
can build in public." And they're like, "What's an example of building in public?" I'm like, 00:28:06.640 |
"Go look at PyTorch." Go look at PyTorch, right? I have two minor things merged into 00:28:11.760 |
PyTorch because it's very responsive. They're like minor bug fixes, but I feel like it's... 00:28:17.160 |
>> Yeah. So that's kind of like the lowest level of the stack. And then at a slightly 00:28:22.400 |
higher level, obviously, there's TinyGrad, there's Mojo, there's GGML. How are you thinking 00:28:28.200 |
about breadth versus depth and where you decided to focus early on? 00:28:33.600 |
>> So GGML is very much like a... Okay, everyone has M1s, right? Actually, I was thinking... 00:28:38.400 |
In the beginning, I was thinking of something more like GGML focused on the M1s, but GGML 00:28:42.880 |
showed up and was just like, "We're actually just focusing on the M1s." And actually, M1 00:28:49.920 |
PyTorch is considerably better than AMD PyTorch. M1 PyTorch works. It only gives wrong answers 00:28:54.560 |
sometimes and it only crashes sometimes. But some models kind of run. When I was writing 00:29:00.960 |
the metal backend, I was comparing to MPS PyTorch, and I had a discrepancy. TinyGrad 00:29:07.000 |
checks all its outputs compared to Torch, and I had one where it didn't match. I'm like, 00:29:13.000 |
"I checked the matrix by hand. It matches TinyGrad. I don't understand." And then I 00:29:17.040 |
switched PyTorch back to CPU and it matched. I'm like, "Oh." Yeah. Well, there's bugs. 00:29:23.200 |
If you transpose the matrix, because I think it has to do with multi-views and PyTorch 00:29:27.340 |
and weird under-the-hood stuff that's not exposed to you. There's bugs and maybe they 00:29:30.880 |
fix them. But it seems like there was a lot of momentum, again, because you're getting 00:29:36.960 |
how many engineers care about making PyTorch work on M1? Thousands, tens of thousands. 00:29:42.120 |
And you have an open development process, and guess what? It's going to be good. How 00:29:45.120 |
many engineers care about AMD working, PyTorch AMD working? You got 10 guys that work for 00:29:54.000 |
You revealed an interesting detail about how you debunk, which is you hand-check the matrix 00:30:00.040 |
No, I don't hand-check it. One of the best tests in TinyGrad is a file called testops.py. 00:30:06.600 |
And it's just 100 small examples written in TinyGrad and PyTorch. And it checks both the 00:30:12.720 |
forwards and backwards to make sure they match. 00:30:17.080 |
That's one of them where I really have put a lot of effort into CI for TinyGrad. I think 00:30:21.400 |
CI is super important. I want that green check to mean I can merge this. I don't want my 00:30:26.360 |
tests to -- and the green check, if you somehow manage to introduce a bug and get the green 00:30:29.760 |
check, okay, we're fixing the test. Top priority. 00:30:33.880 |
It's closed source. No, I'm not that interested. You know what I mean? Look, I like Chris Lattner. 00:30:40.680 |
I think he's going to do great things, and I understand kind of the wisdom even in keeping 00:30:45.080 |
it closed source. But I'm interested when it's open. 00:30:50.240 |
You have an interesting design deviation from him, because he's decided to be -- well, promised 00:30:54.840 |
to be a superset of Python, and you have decided to break with PyTorch APIs. And I think that 00:31:01.160 |
affects learnability and transportability of code. 00:31:05.700 |
You know, if the PyTorch thing ends up being like a stumbling block, I could write a perfect 00:31:13.600 |
PyTorch. Like a -- you know, instead of import PyTorch, instead of, like, yeah, import Torch, 00:31:20.280 |
you type import TinyTorch as Torch. And if that really becomes the stumbling block, I 00:31:25.540 |
will do that. No, Chris Lattner went much further than PyTorch. Replicating the PyTorch 00:31:30.960 |
API is something I can do with a couple, you know, like an engineer month or two. 00:31:35.720 |
Right, like a shim, yeah. Replicating Python, whoo-hoo-hoo. There's a big graveyard of those 00:31:41.360 |
things. How's Piston going? How's Jython? You can go way back. 00:31:51.880 |
So TinyGrid is one layer. You announced TinyBox recently, which is, you know, you made it 00:31:57.560 |
-- so your core mission is commoditizing the petaflop. And then your business goal is to 00:32:03.080 |
sell computers for more than the cost to make, which seems super reasonable. And you're gonna 00:32:10.680 |
No, no, no, no, no, no, no, no. That was my -- look, you know, a lot of people, like, 00:32:15.040 |
I love, you know, leaning into like saying I'm giving up, right? It's great to give up. 00:32:19.040 |
Giving up is this wonderful thing. It's so liberating. And then, like, you can decide 00:32:22.520 |
afterward if you really give up or not. There's very little harm in saying you give up, except 00:32:25.920 |
like, you know, great, Twitter haters have something to talk about. And all press is 00:32:36.440 |
Unless AMD, you know, upsets me again, and then we're back to other colors. We have other 00:32:43.240 |
When you think about hardware design, what are some of the numbers you look for? So teraflops 00:32:48.600 |
per second is one, but like memory bandwidth is another big limiter. Like, how do you make 00:32:54.880 |
Well, I mean, fundamentally, I'm limited to what GPUs I can buy. But yeah, for something 00:32:59.520 |
that I think a lot of people are going to want to reasonably do with -- a coworker of 00:33:05.160 |
mine described them as luxury AI computers, right? Like, luxury AI computers for people. 00:33:11.120 |
And that's like what we're building. And I think a common thing people are going to want 00:33:13.600 |
to do is run, like, Large Llama, right? Or Large, like, Falcon or whatever. 00:33:18.520 |
FB16, exactly. Exactly. You know, Int8, I think, can work. I think that, like, what 00:33:23.120 |
GGML is doing to go to, like, Int4, like, this doesn't work. Like, have you done -- maybe 00:33:28.120 |
they have. But, like, I read what it was, and I was like, this isn't from any paper. 00:33:35.320 |
Yeah, you made up some quantization standard to make it run fast. And, like, maybe it works, 00:33:39.560 |
but, okay, where's, like, the helliswag number, right? Where's your, where's your, you know, 00:33:45.240 |
The thesis is right, that, like, if you have billions, hundreds of billions of parameters, 00:33:49.080 |
that the individual quantization doesn't actually matter that much. 00:33:52.080 |
Well, the real way to look at all of that is to just say you want to compress the weights, 00:33:55.320 |
right? It's a form of weight compression. Quantization is a form of weight compression, 00:33:58.440 |
right? Now, this is obviously not lossless. It's not a lossless compressor, right? If 00:34:01.280 |
it's a lossless compressor, and you can show that it's correct, then, okay, we don't have 00:34:04.320 |
to have any other conversation. But it's a lossy compressor. 00:34:07.920 |
And how do you know that your loss isn't actually losing the power of the model? 00:34:12.080 |
Maybe int 4 65 B Lama is actually the same as FB 16 7 B Lama, right? We don't know. Maybe 00:34:18.600 |
someone has done this yet, but I looked for it when it, like, first came out, and people 00:34:21.680 |
were talking about it, and I'm like, I just have -- like, it's not from a paper, right? 00:34:25.920 |
The int8 stuff is from a paper where they, like, some of the int8 stuff is from a paper. 00:34:29.720 |
There's one paper, I think it's, like, LLM.int8, where they actually, you know, do all the 00:34:35.960 |
tests. And they didn't go fully int8. They made, like, 90% of it int8 and kept, like, 00:34:41.320 |
10% of it in FB 16 for what they called, like, the outliers or whatever. 00:34:46.200 |
So I think that this is not quite so easy. And I think being able -- well, so first off, 00:34:49.560 |
if you're training, no one's gotten training to work with int8 yet. There's a few papers 00:34:52.640 |
that vaguely show it. But if you're training, you're going to need BF 16 or float 16. So 00:34:58.480 |
this is why I target that. Now the thing that you're going to want to do is run these large 00:35:03.320 |
language models out of the box on your hardware in FB 16, and that's memory bandwidth. So 00:35:09.320 |
you need large amounts of memory bandwidth, too. So ask how I trade off memory bandwidth 00:35:17.720 |
>> And I saw one of your -- so first of all, you have this hiring process, which is you've 00:35:22.160 |
got to solve one of the bounties that are open on TinyGrad. There's no technical interview. 00:35:27.280 |
One of them is int8 support. Do you already have some things you want to test on? 00:35:32.480 |
>> We have int8 support. What I'd like to see somebody do is just load the ggml int8 00:35:37.800 |
llama into TinyGrad and then benchmark it against the FB 16 one. Int8 already works 00:35:43.720 |
in TinyGrad. It doesn't actually do the math in int8, which is even a stronger -- like, 00:35:49.360 |
it does all the math still in FB 32. So int8 can mean you just have your weights in int8, 00:35:54.240 |
or int8 can mean you actually do your math in int8. And doing your math in int8, the 00:35:57.520 |
big, like, gain that people care about is actually having your weights in int8, because 00:36:03.720 |
weights in int8 mean less memory and less memory bandwidth, whereas the math, keep it 00:36:09.280 |
in FB 32. On M1s, it doesn't even matter if you're doing -- it doesn't matter what data 00:36:14.240 |
type you're doing in the GPU. I'm not even sure it can do int8, but FB 16 and FB 32 is 00:36:19.840 |
the same. It's the same tarifflops. So, yeah, no, that's one of the bounties. One of the 00:36:25.720 |
bounties is get int8 llama running with the int8 weights. And then actually, you don't 00:36:31.040 |
even need to -- what you could even do, if you really want to test this, just take the 00:36:34.600 |
FB 16 weights, convert them to int8, then convert them back to FB 16, then compare the 00:36:43.560 |
>> This should be lossless in the other direction. 00:36:45.440 |
>> Yeah, I think FB 16, it should be lossless in the other direction. I'm actually not 100% 00:36:54.520 |
>> Oh, because, like, you ever try to, like, if you want to represent -- if it was, like, 00:36:59.800 |
>> I think all of int8 can be represented in FB 16, but I'm not 100% about that. 00:37:08.760 |
>> We just have to do it, right? Just literally do it. There's only 256 to check. But, yeah, 00:37:14.720 |
either way -- I mean, int4, definitely. So do your int4, convert it back, and now see, 00:37:19.480 |
even with int4 weight and FB 32 math, like, okay, how much has your performance degraded 00:37:27.880 |
>> So can we -- I'm about to zoom out a little bit from the details. I don't know if you 00:37:33.240 |
>> No, I think, like, the -- you're planning to release the first tiny box, ship them in, 00:37:37.880 |
like, two to six, eight months, something like that. What's top of mind for you in terms 00:37:42.080 |
of building a team? Who should -- who are you calling for? 00:37:45.840 |
>> Yeah. Well, to stay on the tiny box for one minute, so as the GPU is picked out, and 00:37:53.200 |
you're like, well, I could make that computer with the GPUs, and my answer is, can you? 00:37:57.840 |
Do you know how to put -- do you know how hard it is to put six GPUs in a computer? 00:38:02.600 |
People think it's really easy, and it's really easy to put one GPU in a computer. It's really 00:38:06.240 |
easy to put two GPUs in a computer, but now you want to put in eight. Okay, so I'll tell 00:38:10.680 |
you a few things about these GPUs. They take up four slots. What kind of computer -- you 00:38:15.560 |
can buy the nicest super micro. You can't put eight of those in there. You need two 00:38:19.000 |
slot blowers. If you want to use one of those four-U super micros, you need two slot blowers. 00:38:23.240 |
Or water cooling. If you're trying to get the four-slot cards in there, you're going 00:38:26.240 |
to need some form of water cooling. Or you're going to need -- there are some, like, Chinese 00:38:31.120 |
4090s that are blowers, right? You're going to need blowers or water cooling if you're 00:38:40.560 |
>> Then, the other thing that -- okay, so now you want to get six GPUs in a computer, 00:38:45.660 |
so that's a big challenge. You're like, "Oh, I'll just use PCIe extenders. I saw it online 00:38:49.080 |
as tech tips. It works great." No, it doesn't. Try PCIe extenders that work at PCIe 4.0. 00:38:54.440 |
And interconnect bandwidth is super important. 00:38:56.920 |
>> They don't work at 3.0. No PCIe extender I've tested, and I've bought 20 of them, works 00:39:02.760 |
at PCIe 4.0. So you're going to need PCIe re-drivers. Now, okay, how much is that adding 00:39:08.840 |
cost, right? Like, these things all get really hard. And then, tiny boxes, I've even added 00:39:12.760 |
another constraint to it. I want this thing to be silent. Not totally silent, but my limit 00:39:17.520 |
is like 45, maybe 50 dB, but not -- super micro machine, 60 dB. We have a small -- we 00:39:24.760 |
have a compute cluster at Comma. You've got to wear ear protection to go in there. 00:39:28.840 |
>> Yeah, I've seen some videos where you give a tour. 00:39:34.080 |
>> It's super loud. You've got all these things just screaming. 00:39:36.080 |
>> 10,000 RPM, just screaming. Like, I want to be able to use the normal big GPU fans, 00:39:43.320 |
and make this thing so it can sit under your desk, plug into one outlet of power, right? 00:39:48.880 |
Six GPUs. Your GPUs are 350 watts each. You can't plug that into a wall outlet. Okay, 00:39:55.600 |
so how are you going to deal with that? Good questions, right? And you're not sharing them. 00:40:00.360 |
Well, that one, I mean, that one is pretty obvious. You have to limit the power on the 00:40:05.760 |
>> You have to limit the power on the GPUs. Now, you can limit power on GPUs and still 00:40:08.160 |
get -- you can use like half the power and get 80% of the performance. This is a known 00:40:12.320 |
fact about GPUs, but like, that's one of my design constraints. So, when you start to 00:40:15.840 |
add all these design constraints, good luck building a tiny box yourself. 00:40:20.840 |
>> You know, obviously, it can be done, but you need something that has actually quite 00:40:27.040 |
>> And you see like the -- under the desk, it's like one of the main use cases, kind 00:40:33.160 |
>> Yeah. What I also see is more of a like an AI hub for your home, right? As we start 00:40:38.200 |
to get like home robotics kind of stuff, you don't want to put the inference on the robot, 00:40:43.720 |
but you also don't want to put the inference on the cloud. You don't want to put it on 00:40:47.000 |
the robot because, okay, it's 1,500 watts, tiny box. You'll put batteries, you'll charge 00:40:52.680 |
them. Bad idea. I mean, just wireless. Wireless is .5 milliseconds, right? This is super fast. 00:41:00.040 |
You don't want to go to the cloud for two reasons. One, cloud's far away. It's not that 00:41:04.200 |
far away. You can kind of address this. But two, cloud's also mad expensive. Like cloud 00:41:10.560 |
GPUs are way more expensive than running that GPU at your house, at least any rates you're 00:41:14.960 |
going to get, right? Maybe if you commit to buy, well, yeah, I'm going to buy 10,000 GPUs 00:41:18.960 |
for three years, then maybe the cloud will give you a good rate. But like, you want to 00:41:22.320 |
buy one GPU in the cloud? Ooh. I mean, okay, you can go to like BAST, but like if you're 00:41:26.080 |
going on Azure or AWS, oh, that's expensive. Yeah. This is like a personal data center, 00:41:30.880 |
you know, instead of a cloud data center. We like the term compute cluster, so we can 00:41:34.960 |
use NVIDIA GPUs. Data centers may be a little bit dated. It's a compute cluster, which is 00:41:40.720 |
totally legal under the CUDA license agreement. You talk a lot about the PCIe connection. 00:41:45.080 |
Do you think there's any fat there to the term? What do you mean? Just you're limited 00:41:50.760 |
by bandwidth, right? Okay. For some things, yes. So the bandwidth is roughly 10x less 00:41:58.160 |
than what you can get with NVLinked A100s. NVLinked A100s are going to have, and then 00:42:03.000 |
you can even get like full fabric and NVIDIA really pushes on that stuff, 600 gigabytes 00:42:07.280 |
per second, right? And PCIe 4, you're going to get 60, right? So you're getting 10x less. 00:42:14.480 |
That said, why do you need the bandwidth, right? And the answer is you need it for training 00:42:19.880 |
huge models. If you're training on a tiny box, your limit's going to be about 7 billion, 00:42:25.040 |
right? If you're training on big stuff, your limits could be like 70 billion, right? Okay. 00:42:29.720 |
You can hack it to get a bit higher. You can hack it like GPT hacked it to get a bit higher, 00:42:32.880 |
but like that 65 billion in Lama, like there's a reason they chose 65 billion, right? And 00:42:36.720 |
that's what can reasonably fit model parallel on a GPUs, right? So yes, you are going to 00:42:43.880 |
end up training models. The cap's going to be like 7 billion. I actually heard this on 00:42:47.040 |
your podcast. I don't think that the best chatbot models are going to be the big ones. 00:42:51.720 |
I think the best chatbot models are going to be the ones where you had a thousand training 00:42:54.680 |
runs instead of one. And I don't think that the interconnect bandwidth is going to matter 00:43:00.320 |
>> So what are we optimizing for instead of compute optimal? 00:43:05.640 |
>> So you're talking about this, the Lama style models where you train for like 200x. 00:43:13.040 |
>> Yeah. So, okay. You can always make your model better by doing one of two things, right? 00:43:16.680 |
In a comma, we just have a strict limit on it. You can always make your model better 00:43:19.800 |
by training longer and you can always make your model better by making it bigger. But 00:43:23.720 |
these aren't the interesting ones, right? Particularly the making it bigger. Because 00:43:27.480 |
training it longer, fine. You know, you're getting a better set of weights. The inference 00:43:30.240 |
is the same. The inference is the same whether I trained it for a day or a week. 00:43:34.960 |
>> But the, okay, if it's 1 billion versus 10 billion, well, I 10x my inference too, 00:43:38.680 |
right? So I think that these big models are kind of, sure, they're great if you're research 00:43:43.120 |
labs and you're trying to like max out this hypothetical thing. 00:43:47.200 |
>> Yeah, yeah, yeah. But if you're like a startup or you're like an individual or you're 00:43:51.760 |
trying to deploy this to the edge anywhere, you don't need that many weights. 00:43:56.880 |
>> You actually don't want that many weights. 00:43:57.880 |
>> Yeah. Optimizing for inference rather than capabilities. 00:44:01.680 |
>> Yes, yes. And I think the inference thing, right? There's going to be so much more. Right 00:44:06.360 |
now, the ratio between like training and inference on clouds, I think it's only still like, I 00:44:10.680 |
think it's like 2 or 3x, right? It's 2 or 3x more inference, which doesn't make any 00:44:13.680 |
sense, right? There should be way more inference. 00:44:16.160 |
>> There should be 10 to 100x more inference in the world than training. But then also, 00:44:20.400 |
like, what is training, right? You start to see these things like Laura, like, you're 00:44:24.720 |
getting kind of, it's kind of blurring the lines between inference and training. And 00:44:28.960 |
I think that that blurred line is actually really good. I'd like to see much more like 00:44:32.000 |
on-device training or on-device fine-tuning of the final layer, where we're pushing toward 00:44:36.920 |
the stuff that come, right? Like, why am I shipping a fixed model? I totally want this 00:44:40.120 |
model to fine-tune based on, like, how, you know, your left tire is flat, right? Like, 00:44:46.560 |
every time you cut the same turn because your left tire is flat, well, it should learn that, 00:44:50.920 |
>> So, would Kama pursue parameter-efficient fine-tuning? 00:44:53.200 |
>> Yeah. Yeah, yeah, yeah. We're, we're, we're, we're -- 00:44:56.280 |
>> We're, we're looking into stuff like that. I mean, Kama's already very parameter-efficient 00:44:59.440 |
because we have to, like, run this thing in a car and you have to, like, cool it and power 00:45:05.120 |
>> Yeah, yeah. And so, that's kind of like intelligence cluster you have in your home. 00:45:07.960 |
You see when the person is using third-party model, they load them locally and kind of 00:45:13.880 |
do the final fine-tuning. It kind of stays within the box. 00:45:16.560 |
>> Yeah. I think that that's one thing. That's one version of it for the privacy conscious. 00:45:21.560 |
I also see a world where you can have your tiny box in its down cycles. Mine, flop coin, 00:45:29.000 |
right? You know, not all, it turns out not all crypto is a scam. There's one way to tell 00:45:32.760 |
if crypto is a scam. If they're selling the coin before they make the product, it's a 00:45:38.000 |
>> If they have the product and then they sell the coin, it's maybe not a scam, right? So, 00:45:40.400 |
yeah, my thought is, like, each tiny box would let you, would have a private key on it. And 00:45:44.800 |
you have to do it this way. You can't just let anyone join because of civil attacks, right? 00:45:47.680 |
There's a real problem of, like, how do I ensure your data is correct? And the way that 00:45:51.320 |
I ensure your data is correct on the tiny net is if you ever send wrong data, you're 00:45:59.640 |
>> Your $15,000 hardware box is banned. So, you know, don't cheat. Obviously, if it messes 00:46:02.040 |
up, we'll forgive you. But I'm saying, like -- 00:46:04.240 |
>> Somebody's going to try to jailbreak your devices. 00:46:09.960 |
>> Well, there's just a private key on each device, right? Like, if you buy a tiny box 00:46:12.360 |
from the tiny corp, I give you a private key. It's in my back-end server, right? You want 00:46:15.320 |
to hack my server, that's illegal. Anything you want to do on the device, the device is 00:46:19.560 |
>> Yeah, yeah. Have you looked into, like, federated training at all? 00:46:25.280 |
>> Yeah. So, I mean, okay, you're now -- there's, okay, there's orders of magnitude of federated 00:46:29.760 |
training. You mean, like, over the cloud and stuff? Over the internet? 00:46:32.960 |
>> Yeah, over the internet, but also distributed on a bunch of devices, right? 00:46:40.560 |
>> Because of your interconnect bandwidth, right? So, okay, at the high-end, you have 00:46:42.880 |
your interconnect bandwidth of NVLink, which is 600 gigabytes per second, right? 00:46:47.440 |
>> The tiny box has 60 gigabytes per second. And then your internet has 125 megabytes per 00:46:53.520 |
second, right? Not gigabits, 125 megabytes, right? So, okay, that's -- 00:46:59.640 |
>> That's how many orders of magnitude we're talking here? Like, from 60 down to 125? 00:47:05.280 |
>> Like, all right, that's over 100X. That's 400X, right? 00:47:08.960 |
>> So, like, no. But what you can do is inference, right? Like, for inference, you don't care. 00:47:14.200 |
>> For inference, there's so little bandwidth at the top and the bottom of the model that, 00:47:17.880 |
like, yeah, you can do federated inference, right? And that's kind of what I'm talking 00:47:21.480 |
about. There's also interesting things to push into, like, you're like, but, okay, what 00:47:26.520 |
if you want to run closed-source models? This stuff gets kind of interesting, like, using 00:47:33.320 |
>> But then someone might jailbreak my device. So, you know, maybe we don't try to do that. 00:47:37.440 |
>> Yeah. What's, like, the enterprise use case? Do you see companies buying a bunch 00:47:43.160 |
>> So, the tiny box is, like, the first version of what we're building. But what I really 00:47:47.800 |
want to do is be on the absolute edge of flops per dollar and flops per watt. These are the 00:47:52.960 |
two numbers that matter. So, the enterprise use case is you want to train, like, Comma. 00:47:57.520 |
So, Comma just built out a new compute cluster. It's about a person and a half. So, you know, 00:48:06.160 |
>> A person is 20 petaflops. It's about 30 petaflops. We built out a little compute cluster. 00:48:12.080 |
And, you know, we paid double what you theoretically could per flop, right? You theoretically could 00:48:17.840 |
pay half per flop if you designed a bunch of custom stuff. And, yeah, I mean, I could 00:48:22.120 |
see that being, you know, tiny Corp. Comma is going to be the first customer. I'm going 00:48:26.040 |
to build a box for Comma. And then I'm going to show off the box I built for Comma and 00:48:29.960 |
be like, okay, like, do you want to build I sell $250,000 training computers? Or how 00:48:34.280 |
much is one H100 box? It's 400 grand? Okay. I'll build you a 400 grand training computer 00:48:39.360 |
and it'll be 10x better than that H100 box. Again, not for every use case. For some, you 00:48:45.520 |
need the interconnect bandwidth. But for 90% of most companies' model training use cases, 00:48:50.120 |
the tiny box will be 5x faster for the same price. 00:48:54.240 |
Awesome. You mentioned the person of compute. How do we build a human for $20 million? 00:48:59.560 |
Well, it's a lot cheaper now. It's a lot cheaper now. So, like I said, Comma spent about half 00:49:05.960 |
a million on our person and a half. What are some of the numbers people should think of 00:49:12.400 |
when they compare compute to like people? So, GPT-4 was 100% years of training. That's 00:49:18.600 |
more like on the timescale. 20 petaflops is one person. I think you, right now, the math 00:49:24.840 |
was that for the price of the most expensive thing we build, which is the International 00:49:28.840 |
Space Station, we could build one Tampa. Yeah, one Tampa of compute. 00:49:33.600 |
Yeah, which is 400,000 people. Yeah, we could build. So, like the biggest 00:49:39.880 |
training clusters today, I know less about how GPT-4 was trained. I know some rough numbers 00:49:43.960 |
on the weights and stuff, but Llama- A trillion parameters? 00:49:48.640 |
Well, okay. So, GPT-4 is 220 billion in each head, and then it's an eight-way mixture model. 00:49:53.280 |
So, mixture models are what you do when you're out of ideas. So, it's a mixture model. They 00:49:58.360 |
just train the same model eight times, and then they have some little trick. They actually 00:50:01.280 |
do 16 inferences, but no, it's not like- So, the multimodality is just a vision model 00:50:06.400 |
kind of glommed on? I mean, the multimodality is like obvious 00:50:10.440 |
what it is too. You just put the vision model in the same token space as your language model. 00:50:13.600 |
Oh, did people think it was something else? No, no, the mixture has nothing to do with 00:50:16.360 |
the vision or language aspect of it. It just has to do with, well, okay, we can't really 00:50:20.160 |
make models bigger than 220 billion parameters. We want it to be better. Well, how can we 00:50:25.280 |
make it better? Well, we can train it longer, and okay, we've actually already maxed that 00:50:30.280 |
out. We're getting diminishing returns there. Okay. 00:50:33.080 |
A mixture of experts. Yeah, a mixture of experts. We'll train eight 00:50:35.240 |
of them, right? So, all right. So, you know, the real truth is whenever a company is secretive, 00:50:41.680 |
with the exception of Apple, Apple's the only exception, whenever a company is secretive, 00:50:45.340 |
it's because they're hiding something that's not that cool. And people have this wrong 00:50:49.000 |
idea over and over again that they think they're hiding it because it's really cool. It must 00:50:52.240 |
be amazing. It's a trillion parameters. No, it's a little bigger than GPT-3, and they 00:50:55.960 |
did an eight-way mixture of experts. Like, all right, dude, anyone can spend eight times 00:50:59.160 |
the money and get that. All right. But yeah, so coming back to what I think is actually 00:51:07.560 |
going to happen is, yeah, people are going to train smaller models for longer and fine-tune 00:51:11.960 |
them and find all these tricks, right? Like, you know, I think opening, I used to publish 00:51:17.480 |
stuff on this, you know, when they would publish stuff about how much better the training has 00:51:23.680 |
gotten given the same, holding compute constant. And it's gotten a lot better, right? Compare 00:51:32.400 |
And now we have like- Because you're finding algorithms like flash 00:51:34.960 |
attention. Yeah. Well, flash attention. Yeah. Flash attention 00:51:40.480 |
is the same compute. Flash attention is an interesting fact where it's actually the identical 00:51:43.160 |
compute. It's just a more efficient way to do the compute. But I'm even talking about 00:51:46.320 |
like, look at the new embeddings people are using, right? They used to use this like boring 00:51:53.040 |
old embeddings. Now, like Lama uses that complex one, and that was like alibi. I'm not up to 00:51:56.720 |
date on all the latest stuff, but those tricks give you so much. 00:52:00.640 |
There's been a whole round trip with positional embeddings. I don't know if you've seen this 00:52:06.520 |
Like you need them, you need rotational, and then you don't need them. 00:52:09.080 |
I haven't followed exactly. I mean, you quickly run into the obvious problem with positional 00:52:13.320 |
embeddings, which is you have to invalidate your KB cache if you run off the context. 00:52:17.480 |
So that's why I think these new ones, they're playing with them, but I'm not that. I'm not 00:52:22.800 |
an expert on like the latest up-to-date language model stuff. 00:52:26.120 |
Yeah. I mean, we have what we do at Comma, and I 00:52:33.940 |
What are some of the things, I mean, that people are getting wrong? So back to autonomous 00:52:38.140 |
driving, there was like the whole like LiDAR versus vision thing. It's like people don't 00:52:42.460 |
get into accidents because they cannot see well. They get into accidents because they 00:52:45.660 |
get distracted and all these things. What are, do you see similarities today on like 00:52:49.940 |
the path to AGI? Like are there people, like what are like the- 00:52:53.780 |
Nothing I say about this is ever going to compete with how Rich Sutton stated it. Rich 00:52:57.980 |
Sutton is the writer of Reinforcement Learning, The Bitter Lesson. Nothing I say is ever going 00:53:01.800 |
to compete with, The Bitter Lesson is way better than any way I'm going to phrase this. 00:53:05.040 |
Just go read that. And then like, I'm sorry it's bitter, but you actually just have to 00:53:08.760 |
believe it. Like over and over again, people make this mistake. They're like, oh, we're 00:53:13.240 |
going to hand engineer this thing. We're going to hand, no, like stop wasting time. 00:53:17.160 |
Which is, I mean, OpenAI is not taking The Bitter Lesson. 00:53:23.640 |
They were leaders in deep learning for a long, long, long time. 00:53:27.680 |
But you're telling me that GPT-4 is not, yeah. 00:53:29.340 |
Well, OpenAI was the absolute leader to the thesis that compute is all you need. 00:53:33.980 |
Right? And there's a question of how long this thesis is going to continue for. It's 00:53:36.900 |
a cool thesis. And look, I think I would be lying along with everybody else. I was into 00:53:41.540 |
language models like way back in the day for the Hutter Prize. I got into AI through the 00:53:45.820 |
Hutter Prize. Like 2014, I'm trying to build compressive models of Wikipedia. And I'm like, 00:53:50.660 |
okay, why is this so hard? Like what this is, is a language model, right? And I'm playing 00:53:54.180 |
with these like Bayesian things. And I'm just like, oh, but like, I get it. Like, it needs 00:53:58.820 |
to be like, like, it's like, I have two data points and they're like almost the same, but 00:54:02.860 |
how do I measure that almost, right? I just like, you know, wrap my head around. I couldn't 00:54:07.660 |
like, like wrap my head around this. And this was around the time Carpathia released the 00:54:10.500 |
first like RNN that generated the Shakespeare stuff. And I'm like, okay, I get it. Right? 00:54:17.380 |
It's neural networks that are compressors. Now this isn't actually, you can't actually 00:54:19.980 |
win the Hutter Prize with these things because the Hutter Prize is MDL. It's the model, size 00:54:24.380 |
of the model plus the size of the encodings, embeddings. So yeah, you can't, I mean, probably 00:54:30.460 |
now you can because it's gotten so good, but yeah, back in the day you kind of couldn't. 00:54:35.140 |
So I was like, okay, cool. Like this is what it is. I kind of get it. Yeah. I mean, I think 00:54:39.760 |
I didn't expect that it would continue to work this well. I thought there'd be real 00:54:44.460 |
limits to how good autocomplete could get. That's fancy autocomplete. But yeah, no, like 00:54:49.780 |
it works. It works well. So like, yeah. What is OpenAI getting wrong? Technically, not 00:54:57.060 |
that much. I don't know. Like if I was a researcher, why would I go work there? 00:55:05.820 |
No, look, I don't, I don't, this is, this is my technical stuff. I don't really want 00:55:10.180 |
to harp on this, but like why go work at OpenAI when you could go work at Facebook, right? 00:55:14.140 |
As a researcher. Like OpenAI can keep ideologues who, you know, believe ideological stuff and 00:55:19.660 |
Facebook can keep every researcher who's like, dude, I just want to build AI and publish 00:55:26.740 |
Awesome. Yeah. Any other thoughts, tiny corp, bounties? 00:55:31.780 |
Yeah. So we have, you know, I've been thinking a lot about like what it means to hire in 00:55:39.100 |
today's world. What actually is the like core? Okay. Look, I'm a believer that machines are 00:55:46.060 |
going to replace everything in about 20 years. So, okay. What is that, what is that thing 00:55:54.220 |
that people can still do that computers can't, right? And this is a narrowing list, but like, 00:56:00.740 |
you know, back in the day, like imagine I was starting a company in 1960, right? Oh, 00:56:04.460 |
we're going to have to hire a whole bunch of calculators in the basement to do all the, 00:56:08.180 |
you know, math to support the, dude, have you heard about computers? Why don't we just 00:56:12.500 |
buy a few of those? Oh, oh wow, man. You're right. So like, I feel like that's kind of 00:56:19.180 |
happening again. And I'm thinking about, I will post in my discord. I'll be like, okay, 00:56:22.980 |
who wants to like, okay. I just changed my Unary ops used to be log and exp in like E. 00:56:28.500 |
I changed them to be log two and exp two because hardware has log two and exp two accelerators. 00:56:33.940 |
Yeah. And of course you can use change of base. It's one multiply to, to get it back 00:56:37.260 |
to E, but like I made the primitives log two and exp two. Right. And this is the kind of, 00:56:42.220 |
I just posted in the discord. I'm like, could someone put this pull request up? Right. And 00:56:45.140 |
someone eventually did and I merged it, but I'm like, this is almost to the level where 00:56:48.940 |
models can do it. Right. We're almost to the point where I can say that to a model and 00:56:52.620 |
the model can do it. Have you tried? Yeah, I'm, I don't know. I'm like, I'm, I think 00:57:02.740 |
it went further. I think autocomplete went further than I thought it would, but I'm also 00:57:07.060 |
relatively unimpressed with the chatbots, with what I've seen from the language models 00:57:12.140 |
like there. The problem is if your loss function is categorical cross-entropy on the internet, 00:57:19.700 |
your responses will always be mid. Yes. Mode collapse is what I call it. I don't know. 00:57:24.220 |
Maybe I'm not even talking about mode collapse. You're actually trying to predict the like, 00:57:27.300 |
like, look, I rap, I'm a hobbyist rapper. And like, when I try to get these things to 00:57:31.500 |
write rap, the rap sound like the kind of raps you read in the YouTube comments. Nursery 00:57:34.820 |
school. Yeah. It's like, all right, great. You're right. Box with Fox. Sick rhyme, bro. 00:57:40.740 |
You know, you know, and Drake is rhyming. Give it up for me with napkins and cutlery. 00:57:45.940 |
Right. Like, like, all right, come on. We've got like this thing about orange, like orange 00:57:50.220 |
is famous. Yeah, yeah, yeah, yeah. But now, of course, you know, four inch screws and 00:57:54.020 |
orange juice is in, is in GPT's training corp. But yeah, so I think it went further than 00:58:01.420 |
like everyone kind of thought it would. But the thing that I really want to see is like 00:58:04.380 |
somebody put 10 LLMs in a room and have them discuss the answer before they give it to 00:58:08.500 |
me. You can actually do this. Right. And I think the coding things have to be the same 00:58:12.540 |
way. There is no coder alive, no matter how good you are, that sits down. Well, I'm going 00:58:16.140 |
to start at cell A1 and type my program and then I'm going to press run and it's going 00:58:20.940 |
to work. No one programs like that. So why do we expect the models to write? So so there's 00:58:26.180 |
there's a lot that like still needs to be done. But, you know, at the tiny corp, I want 00:58:29.740 |
to be on the cutting edge of this, too. I want to be like program generation. I mean, 00:58:34.220 |
what is TinyGrad? It's a compiler, generates programs, generate the fastest program that 00:58:37.260 |
meets the spec. Right. Why am I not just having ML do that? So, you know, it's kind of a you 00:58:42.940 |
have to exist fluidly with the machines. And I come around on a lot of stuff. I'm like, 00:58:48.860 |
wait, TinyGrad, TinyCorp should be a remote company. I can't do this in person. Really? 00:58:53.500 |
Yeah. Like, oh, comma makes sense to be in person. Like comma, sure. Yeah, we'll get 00:58:57.580 |
off in San Diego. Like, but that's a six year old company. Right. And it works and it works 00:59:01.640 |
for a certain type of people and certain type of culture. But what's going to be different 00:59:04.260 |
this time? OK, remote. But now it's remote. And now I'm getting these like people who 00:59:07.700 |
apply and I'm like, I literally have a thousand applications. I'm not calling you to do a 00:59:12.580 |
technical screen. I can't really tell anything from a technical screen. What am I going to 00:59:16.020 |
do? Make a code on a whiteboard? Like, bring up bring up a shared notebook document so 00:59:20.100 |
we could. Oh, like, that's not going to work. OK. So then I move to the next thing. We do 00:59:24.300 |
this a comma with good success programming challenges. I've also found them to be like 00:59:28.540 |
completely non-predictive. I found one thing to actually be predictive and it's wait a 00:59:34.300 |
second. Just write code in TinyGrad. It's open source. Right. And so, you know, I'm 00:59:39.340 |
talking to a few people who've been contributing and like contribute or, you know, the job's 00:59:44.020 |
not for you. But you can do it remote. And it's like it's a chill job. Like you're not 00:59:47.340 |
you're like, oh, yeah, well, I work for the tiny corp. Well, you're writing MIT licensed 00:59:51.060 |
software like you see what it's doing. Right. Like, well, just I think think of it maybe 00:59:54.540 |
more of like a stipend and a salary and then also some equity. Look, you know, I get rich. 00:59:58.420 |
You all get rich. Yeah. How do you think about agents and kind of like thinking of them as 01:00:06.580 |
people versus like job to be done? Sean built this thing called Small Developer. And then 01:00:11.860 |
it's in the same vein, like the human in the loop with the language model and just iterating 01:00:17.220 |
while you write code. I think I think that's that's absolutely where it goes. And there's 01:00:20.560 |
like a it's not like one thing. It's like they're small interpreter. There's like small 01:00:24.340 |
debugger. It's kind of like all these different jobs to be done. It's a small world. Yeah. 01:00:28.340 |
It's I know this is like the small pockets. It's like small. I mean, tiny corp. So we're 01:00:33.020 |
on the same wavelength. How do you think about that? Do you think people will have a human 01:00:37.500 |
like interaction with like, oh, this is like the AI developer or like is it I'm the human 01:00:41.980 |
being supercharged by the AI tools? Oh, I think it's much more like I'm the human supercharged 01:00:48.140 |
by the AI tools. I think that like coding is tool complete, right? Like driving is not 01:00:52.780 |
tool complete. Right. Like driving is just like like we hire people to drive who are 01:00:56.180 |
like below the API line. Right. There's an API line in the world. Right. Love that. Yeah. 01:01:00.060 |
There's an API line in the world. And like you can think like Uber is a really clear 01:01:02.740 |
example. Right. There's the people below the API line and the people above the API line. 01:01:06.220 |
And the way you can tell if you're below or above, by the way, is is your manager a computer? 01:01:10.540 |
Right. Who's the manager of the Uber driver or computer? Does the machine tell you what 01:01:13.060 |
to do? Or do you tell machines? Exactly. Exactly. So coding is tool complete. Right. Coding 01:01:19.820 |
is tool complete. Coding is above the API line. So it will always be tools supercharging 01:01:25.100 |
your coding workflow. And it will never be you performing some like task like, OK, well, 01:01:32.460 |
I can do everything except for actually starting a docker container. Like it just doesn't make 01:01:36.340 |
any sense. Right. Yeah. So we'll always be sort of tools. And, you know, look, we see 01:01:39.780 |
the same stuff with all the people are like stable diffusion is going to replace artists 01:01:44.420 |
or whatever. It's like, dude, like it's going to create new artists. What did Photoshop 01:01:47.780 |
replace artists? Like, what are you talking about? Right. Like, you know, a real artist's 01:01:53.300 |
finger paint. I can't use brushes. Brushes are, you know, brushes are going to replace 01:01:57.660 |
all the. OK. Like, I just can't like it's all just tools and the tools are going to 01:02:01.900 |
get better and better and better. And then eventually, yes, the tools are going to replace 01:02:04.900 |
us. But, you know, that's still 20 years away. So, you know, I've got a company in the meantime. 01:02:10.420 |
So I've written about the API line before, and I think that's from Venkatesh. I don't 01:02:13.820 |
know if you I definitely took it from someone. It's definitely not mine. VGR. But I also 01:02:18.060 |
have speculated a higher line than that, which is the Kanban board. Like who tells the programmers 01:02:23.180 |
what to do? Right. So are you above or below the Kanban board? Has that evolved your management 01:02:29.540 |
thinking? Yeah. Like that's sort of what I mean. Like it's like I'm just going to describe 01:02:33.780 |
the pull request in two sentences and then like, yeah. So you are running the Kanban 01:02:37.740 |
board or the bounties? Yes. Yeah. The bounties of the Kanban board. Exactly. And that is 01:02:42.300 |
kind of the high level. And then like, yeah, we'll get AIs to fill in some and we'll get 01:02:46.380 |
people to fill in others. Yeah. And that's also what it means to be like full time at 01:02:50.540 |
a tiny corp. Right. Would you start and I wrote this up pretty concretely. I'm like, 01:02:54.260 |
OK, step one is you do bounties for the company. Step two is you propose bounties for the company. 01:02:58.460 |
You don't obviously pay them. We pay them. But you propose them. And I'm like, yeah, 01:03:02.380 |
that's a good bounty that like helps with the main workflow of the company. And step 01:03:06.660 |
three is you get hired full time. You get equity. We all know maybe you're rich. What 01:03:11.620 |
else are you designing differently about the employee experience? I mean, I'm very much 01:03:16.780 |
a like, you know, some people really like to like, like keep a separation. Right. Some 01:03:20.900 |
people really like to keep a separation between like employees and management or customers 01:03:25.940 |
and employees like a comma. You know, the reason I do the DevKit thing, it's like, dude, 01:03:30.300 |
you buy a comma thing, you're an employee of the company, like you're just part of the 01:03:33.180 |
company. It's all the same thing. There's no like secrets. There's no dividing lines. 01:03:37.540 |
There's no like it's all a spectrum for like, you know, down here at the spectrum, like 01:03:41.220 |
you pay and then up here at the spectrum you get paid. You understand this is the same 01:03:44.220 |
spectrum of college, right? Like for undergrad, you pay and then you get up here to like, 01:03:48.700 |
you know, doing a Ph.D. program, you get paid. OK, well, cool. Welcome to the, you know. 01:03:55.660 |
What about comma bodies? You know, you mentioned a lot of this stuff is clearly virtual, but 01:04:00.660 |
then there's below the API line you actually need. 01:04:03.540 |
This is a thing that's been announced. Comma bodies? 01:04:06.620 |
We sell them. You can buy them. They're a thousand bucks on our website. 01:04:09.820 |
OK, no, no, no. I'm thinking about like what Tesla announced with like the humanoid robot. 01:04:14.180 |
It's the same thing, except of course we made the comma version of it. Tesla uses 20 activators. 01:04:18.660 |
We use two, right? Like how do you how do you build the simplest possible thing that 01:04:23.100 |
can like turn the robotics problem into entirely a software problem? So right now it is literally 01:04:28.000 |
just a comma three on a pole with two wheels. It balances, keeps the comma three up there. 01:04:35.620 |
And like there's so much you could do with that already. Like this should replace you. 01:04:39.940 |
How many security guards could this replace? Right. If this thing could just competently 01:04:43.660 |
wander around a space and take pictures and, you know, focus in on things, send you a text 01:04:49.940 |
message when someone's trying to break into your building, you know, like like this could 01:04:53.100 |
already do so much, of course. But the software is not there yet. Right. So how do we turn 01:04:57.940 |
robotics into a thing where it's very clearly a software problem? You know, the people don't 01:05:01.360 |
accept that self-driving cars are a software problem. Like, I don't I don't know what to 01:05:04.980 |
tell you, man. Like literally just watch the video yourself and then drive with a joystick. 01:05:09.900 |
Right. Yeah. Can you drive? And we've actually done this test. We've actually done this test 01:05:13.500 |
where we've had someone. OK, you just watch this video and here's a joystick and you got 01:05:16.840 |
to drive the car. And of course, they can drive the car. Yeah. It takes a little bit 01:05:19.660 |
of practice to get used to that joystick. But the problem is all the model. Right. So 01:05:24.820 |
I can now make the model better. Yeah. Specifically, anything in computer vision that you think 01:05:30.860 |
our second most popular episode ever was about segment anything coming out of Facebook, which 01:05:35.300 |
is as far as I understand, the state of the art in computer vision. What are you hoping 01:05:39.420 |
for there that you need for karma? I think a segment, anything like the large, large 01:05:45.060 |
YOLOs or not. I've used like large yellows and I'm super impressed by them. Yeah. I think 01:05:49.740 |
it's solved. I got to check out segment anything. I don't think it's a distinct problem. Right. 01:05:53.860 |
OK, here's something that I'm interested in. All right. We have great LLMs. We have great 01:05:57.780 |
text to speech models and we have great speech to text models. OK, so why can I not why can 01:06:01.740 |
I not talk to an LLM like I'd have a normal conversation with it? You can with the latency 01:06:05.580 |
of like two seconds every time. Right. Why? Why isn't this? And then it feels so unnatural. 01:06:11.540 |
It's just like staccato. Like, I don't like the RLHF models. I don't like the tuned versions 01:06:16.220 |
of them. I think that they become you take on the personality of a customer support agent. 01:06:21.540 |
Oh, come on. You know, I like I like LLM more than Chachi B.T. Chachi B.T.'s personality 01:06:27.900 |
just graded on me. Was LLM like cool. I write I read a little bit of pretext paragraph. 01:06:32.660 |
I can put you in any scenario I want. Right. Like that's interesting to me. I don't want 01:06:36.620 |
some like, you know. Yeah. So, yeah, I think there is really no like distinction between 01:06:44.980 |
computer vision and language and any of this stuff. It's all eventually going to be fused 01:06:50.700 |
into one massive. So to say computer vision is solved. Well, it doesn't make any sense 01:06:54.540 |
because what's the output of computer vision model segmentation? Like what a weird task. 01:06:58.740 |
Right. Who cares? OCR. Who cares? I don't care if you can segment which pixels make 01:07:03.420 |
up that laptop. I care if you can pick it up. Interact with the real world. And you're 01:07:10.060 |
going to have the local cluster. You're going to have the body. Yeah. Yeah. I think I think 01:07:15.040 |
that's kind of where that goes. So maybe we can paint the future of like the year is 2050. 01:07:23.020 |
You've achieved all you wanted at Tiny Corp. What is what is the AI enabled future like? 01:07:28.700 |
Well, Tiny Corp is the second company. Comma was the first. Comma builds the hardware infrastructure. 01:07:33.860 |
Tiny Corp builds a software infrastructure. The third company is the first one that's 01:07:36.940 |
going to build a real product. And that product is AI Girlfriend. No, like I'm dead serious. 01:07:42.940 |
Right. Like this is the dream product. Right. This is the absolute dream product. Girlfriend 01:07:47.820 |
is just the like stand in. Well, no, it's not a stand in. No, no, no, no. I actually 01:07:52.060 |
mean it. Right. So I've been wanting to merge with a machine ever since I was like mad little 01:07:56.500 |
like, you know, how do I merge with the machine? Right. And like, you can look at like in like 01:08:00.260 |
a maybe the Elon style we're thinking about is Neuralink. Right. Like, I don't think we 01:08:04.780 |
need any of this. Right. Some of your friends, maybe they get into relationships and you 01:08:09.820 |
start thinking of, you know, them and their partner is the same person. You start thinking 01:08:13.460 |
of them as like one person. I mean, they are kind of like merged, right? Like humans can 01:08:18.460 |
just kind of do this. It's so cool. It's this ability that we already have. It's only to 01:08:23.140 |
put, you know, electrodes in my brain to merge with a machine. I need an AI Girlfriend. Right. 01:08:29.460 |
So that's what I mean. Like this is this is the third product. This is the third company. 01:08:34.500 |
And yeah, in 2050, I mean, like, it's so hard. I like maybe I can imagine like 2035. I don't 01:08:41.860 |
even know 2050. But like, yeah, 2035. Like, yeah, that'd be really great. Like I have 01:08:45.980 |
this like kind of, you know. So in terms of merging, like, isn't it, shouldn't you work 01:08:51.340 |
on brain upload rather than AI Girlfriend? But I don't need brain upload. Right. I don't 01:08:55.580 |
need brain upload either. Like, there's there's thousands of hours of me on YouTube. Right. 01:08:59.740 |
Yes. If you might, how much of my brain's already uploaded? That's only the stuff that 01:09:03.420 |
you voice. Yeah, it's not that different. It's not that different. Right. You really 01:09:07.780 |
think a powerful, you really think a model with, you know, an exaflop of compute couldn't 01:09:12.380 |
extract everything that's really going on in my brain. I'm a pretty open person. Right. 01:09:16.340 |
Like, I'm not running a complex filter. Humans can't run that complex of a filter. Yeah. 01:09:19.740 |
Like humans just can't. Like, this is actually a cool quirk of biology. It's like, well, 01:09:24.460 |
humans can't lie that well. Yeah. Yeah. So is it good or bad to put all of your stream 01:09:30.480 |
of consciousness out there? I mean, I think it's good. I mean, I don't know. I'm streaming 01:09:37.140 |
every day. I want to live forever. We said off mic that we may be the first immortals. 01:09:43.100 |
Right. Yeah. Yeah. Like, this is how you this is how you live forever. It's a question of, 01:09:47.900 |
OK, how many weights do I have? Right. OK. Let's say I have a trillion weights. It's 01:09:51.900 |
talking about a terabyte, 100 terabytes here. But it's not really 100 terabytes. Right. 01:09:55.840 |
Because it's a complexity. How much redundancy is there in those weights? So, like, maximally 01:09:59.460 |
compressed, how big is the weight file for my brain? Quantize it whatever you want. Quantization 01:10:05.620 |
is a poor man's compression. I think we're only talking really here about like maybe 01:10:12.100 |
a couple of gigabytes. Right. And then if you have like a couple of gigabytes of true 01:10:15.860 |
information of yourself up there. Cool, man. Like, what does it mean for me to live forever? 01:10:21.540 |
Like, that's me. Yeah, no, I think that's good. And I think like the there's a bit of 01:10:27.100 |
like a professionalization of social media or like a lot of people only have what's like 01:10:32.660 |
PC out there, you know, and I feel like you're going to get come back to the Chad GPT thing. 01:10:36.540 |
Right. You're going to train a model and like everything that's public about a lot of people 01:10:40.620 |
and it's like no one's going to run their model and they're going to die. I see on social 01:10:46.420 |
media your life could depend on it. We have a segment. So we're moving on to a what would 01:10:55.420 |
normally be called the lightning round. But just just general takes because you're a generally 01:10:58.820 |
interesting person with many other interests. What is the goddess of everything else mean 01:11:03.820 |
to you? Oh, it means that is not really going to kill us. Really? Of course. Tell us more. 01:11:14.460 |
Look, Lex asked me this, like, is there going to kill us all? And I was quick to say yes, 01:11:20.580 |
but I don't actually really believe it. I think there's a decent chance. I think there's 01:11:23.980 |
a decent chance that AI kills 95 percent of us. OK. But they saw on your Twitch streams 01:11:29.980 |
that you're with them, so they're not going to. No, I don't think I actually I don't also 01:11:34.540 |
think it's AI. Like I think the AI alignment problem is so misstated. I think it's actually 01:11:38.220 |
not a question of whether the computer is aligned with the company who owns the computer. 01:11:42.100 |
It's a question of whether that company is aligned with you or that government's aligned 01:11:44.820 |
with you. And the answer is no. And that's how you end up dead. But so what the goddess 01:11:49.580 |
of everything else means to me is like the complexity will continue. Paper clippers don't 01:11:54.980 |
exist. You know, there are forces. The paper clipper is cancer. The paper clipper is really 01:11:59.340 |
just a perfect form of cancer. And the goddess of everything else says, yeah, but cancer 01:12:04.460 |
doesn't win. You know? Yeah. It's a beautiful story for those who haven't heard it. And 01:12:09.220 |
you read it out and I listened to it. Yeah. Good. What else we have here? Pick a question. 01:12:14.940 |
So many. Yeah. What are you grateful for today? Oh, man. I mean, it's all just like I haven't 01:12:23.100 |
I haven't thinking about this stuff forever, like that. It's actually like happening and 01:12:28.700 |
it's happening in an accessible way, too. I guess that's what I'm really grateful for. 01:12:32.420 |
It's not like like AI is not some Manhattan Project style. You don't know when you close 01:12:37.860 |
doors. I'll fight really hard to keep it that way. You know, that's that's I'm grateful 01:12:45.300 |
for just just how much is released out there and how much I can just learn and stay up 01:12:50.140 |
to date. And I guess I'm grateful to the true fabric of reality that, you know, I didn't 01:12:54.980 |
need differential equations to understand it. Like I don't need you don't need you don't 01:12:58.540 |
need some like like like there's there's I've tried to do. There's a limit to my to my math 01:13:03.580 |
abilities. I can do most undergrad math, but I took some grad math classes. And OK, now 01:13:07.580 |
we're getting to the end of what I can do. And it's just the actual like end of what 01:13:11.460 |
I can do. Like I'm limited by my brain. But, you know, ML stuff, you need high school math. 01:13:17.820 |
Yeah, I could do nothing. You know what I mean? When I went to my major, seventh grade, 01:13:22.500 |
like it's all easy. You need more electrical engineering than you need high school math 01:13:25.660 |
early. Yeah, well, you need electrical engineering to like build the machines. But even that, 01:13:30.020 |
like these machines are simpler than the machines that have existed before. The compute stack 01:13:34.660 |
looks really nice. So, you know, yeah, I just I'm grateful that it's all happening and I 01:13:38.460 |
get to understand it, be here. Yeah. Yeah. John Carmack mentioned there's about six insights 01:13:44.620 |
we have left. Do you have an intuition for what some of the paths people should be taking? 01:13:48.860 |
Obviously, you're working on one. What are some of the other branches of the tree that 01:13:53.460 |
people should go under? I don't think I'm working on one of the six insights. I don't 01:13:56.860 |
think TinyGrad's any one of the six insights. Something I really like that Elon does, and 01:14:01.420 |
I try to take it from, try to be inspired by it, is look at the boring tunnel machine 01:14:10.140 |
and ask how you can build a 10x cheaper one. All right. Look at the rocket. How can I build 01:14:13.580 |
a 10x cheaper one? Look at the electric car and say, how can I build a 10x cheaper, like 01:14:17.700 |
cheaper or, you know, can go further or whatever, whatever, whatever. Right. You just do the 01:14:21.380 |
straight up physics math. Right. Like I'm trying to do the same thing with with ML frameworks. 01:14:25.540 |
Right. And in doing so, making sure that this stuff remains accessible. Right. You could 01:14:31.560 |
imagine a world where if Google TPUs were actually the ultimate, if Google TPUs were 01:14:37.180 |
actually the best training things. I mean, actually, you know, I'm kind of grateful for 01:14:39.420 |
NVIDIA. Right. Like, because if Google TPUs were the ultimate, now you have this huge 01:14:42.660 |
closed source compiler in between XLA and the hardware. And yeah, that's just a really 01:14:49.220 |
bad thing. So, I mean, something that is somewhat upsetting about the TinyGrad is that it is 01:14:53.300 |
trying to prevent downside, but it's not all trying to prevent downside. Like we're also 01:14:57.640 |
building computers and we're going to build some awesome, powerful, cheap computers along 01:15:02.260 |
the way. So, no, I'm not really working directly on any of the six tricks. I also think the 01:15:06.260 |
six tricks are kind of going to be like luck. I think it's going to be like, you know, please 01:15:11.020 |
tell me more about what covariate shift is and how that inspired you to come up with 01:15:14.140 |
batch normalization. Please tell me more about why it's a transformer and it has a query, 01:15:18.380 |
a key and a value. Right. Like Schmidt-Hoover described it better in fast weights. You know, 01:15:23.140 |
like, I mean, my theory about why transformers work have nothing to do with this attention 01:15:27.540 |
mechanism and just the fact that like it's semi-weight sharing. Right. Like because the 01:15:31.180 |
weight matrix is being generated on the fly, you can, you can like compress the weight 01:15:35.180 |
matrix. Right. Like this is what that, there's a, there's an operation in the, in the transformer, 01:15:39.740 |
which like, and by the way, this is like Qualcomm's SNPE can't run transformers for this reason. 01:15:45.900 |
So most matrix multipliers in neural networks are weights times values. Right. Whereas, 01:15:51.480 |
you know, when you get to the, the, the, the outer product in, in transformers, well it's 01:15:55.380 |
weights times weight. It's a, it's values times values. Right. So SNPE like doesn't 01:15:59.540 |
even support that operation. Right. So it's like that operation that gives the transformer 01:16:03.940 |
its power. It has nothing to do with the fact that it's attention. Right. And this is just 01:16:07.620 |
as a funny, like, but that is one of the six tricks, right. Batch, like these norms are 01:16:11.780 |
a trick. Transformers are a trick. Okay. Six more. Is there a reason why, so you could 01:16:18.620 |
talk, you talk about attention as weight compression. Compression is not exactly the right word. 01:16:26.020 |
What I mean is that the weights can change dynamically based on the context. So there 01:16:29.980 |
was this thing in pack eight in the Hunter Prize that I absolutely loved. And I've never 01:16:33.020 |
seen it again in neural networks and it's a really good trick. Okay. Imagine you have 01:16:36.100 |
256 weight sets for a layer. Right. And then you choose which of the weight sets you're 01:16:41.500 |
loading in based on some context. And that context can come from another neural net. 01:16:45.500 |
Right. So I have another neural net which protect, projects, you know, 256 wide, one 01:16:49.780 |
hot, do a softmax, predict it. And then I actually load the weights in. And I can do 01:16:53.300 |
this operation at both test time and train time. Right. I can do this operation at both 01:16:56.420 |
training and inference. And I load in the weights given the context. Right. Like that 01:17:01.580 |
is what transformers do. But transformers, instead of having 256 discrete ones, it's 01:17:05.940 |
actually just that, but continuous. Yeah. Which is funny that that was in language models. 01:17:09.980 |
And I just like, when I understood that about transformers, I'm like, oh, this is a real 01:17:13.140 |
trick. And why are they using the word attention? Yeah. And today is actually the anniversary 01:17:17.860 |
of attention is all you need. What? Today, six years ago. Six years. Six years. Changed 01:17:23.860 |
the world. Wow. Well, there's one of your envelope tricks. Right. And you can easily 01:17:27.260 |
write it on an envelope. You know, think about how you write out that. How many times have 01:17:30.220 |
you written that? Because it's not in any libraries because it's like all used a little 01:17:33.200 |
differently each time. Yeah. If you just write out that exact same, you know. Yeah. Yeah. 01:17:39.860 |
You've name checked Elon a few times. Yeah. I think about both of you as systems thinkers. 01:17:44.900 |
Input, output, thinking something in between. Sure. What's different about your style versus 01:17:50.740 |
his? Elon's fundamental science for the world is physics, mine is information theory. Huh. 01:17:57.540 |
But you do a lot of physics as well. I mean, like you base it on. And Elon does a lot of 01:18:00.820 |
information theory as well, too. But the question is fundamentally that the difference maybe 01:18:05.700 |
is expressed in what your ambitions are. Right. Elon's ambitions may be like, go to Mars. 01:18:12.340 |
Go to Mars. Right. Go to Mars is the ultimate modern modernist physics ambition. Right. 01:18:17.580 |
It's a physics, but I'm getting to Mars. Right. Well, what are electric cars? It's a physics 01:18:20.740 |
problem. Right. OK. Now he's like pushing on the autonomy stuff and you push a little 01:18:25.180 |
on information theory. But fundamentally, his dreams are physics based dreams. My dreams 01:18:29.980 |
are information based dreams. I want to live forever in virtual reality with my AI girlfriend. 01:18:33.900 |
Right. Those are those are the aspirations of someone who who who accepts information 01:18:37.860 |
theory as a core science. So I think that's the main difference between me and him. He 01:18:40.660 |
has physics based aspirations and I have information based aspirations. Very, very neat. Mark Andreessen. 01:18:47.260 |
He is a, hi Mark, he's a listener. He is heavily, he's a big proponent of effective accelerationism. 01:18:54.240 |
You've been a bit more critical. Why do you say that EAC is not taken seriously by its 01:18:58.380 |
adherents? Oh, well, only the left takes ideology seriously. Why is that? Just as a fact. It's 01:19:08.700 |
just like it's just like a fact. Is the right more cynical? Is that what it is? I don't 01:19:12.260 |
know. It's like it's like the left actually manages to get energy around the ideologies. 01:19:16.900 |
Right. Like like like there's a lot more. Look, here you have you have two effective 01:19:21.740 |
altruists named Sam going in front of Congress. Only one of them is in jail. You know, it's 01:19:26.500 |
interesting. They're both calling for regulation in their respective spaces. Right. So SPF 01:19:30.300 |
is definitely like kind of a wolf in sheep's clothing, kind of. Right. He only adopted 01:19:34.340 |
EAC or EA. Oh, and Sam Altman is a genuinely good guy who is not interested in power seeking 01:19:40.860 |
for himself. All right. We don't we don't have to. Fair enough. Fair enough. But no, 01:19:46.780 |
EAC is not like like you are not serious. Right. You are not actually a serious ideology. 01:19:53.460 |
You know, Mark Andreessen. I like Mark Andreessen. But I think that like some of his Twitter 01:19:58.140 |
things are like, dude, you like just like it's like it's like someone who's like twenty 01:20:01.620 |
nineteen who's like eyes were opened about like the political world being not exact. 01:20:07.540 |
You mean all the people on the news were lying to me? Well, they were lying to you like, 01:20:12.020 |
OK, we all figured this out five years ago. Now, what are you going to do about it? I'm 01:20:15.260 |
going to complain about it on Twitter. Right. And that's what EAC is. 01:20:21.140 |
Last and maybe most important, what was Avatar 2 bad? 01:20:24.700 |
Oh, I have a whole you can go on my blog. I rewrote the script of Avatar 2. I wrote 01:20:31.100 |
a script that actually might make you feel something for the characters. I killed Jake 01:20:35.020 |
Sully in the first scene like you had to. Do you really think his second story arc topped 01:20:39.860 |
his first one? No, of course not. You had to kill the guy and make the movie about the 01:20:43.180 |
brothers. Right. And just that alone and realizing that like you could have kept the Titanic 01:20:47.800 |
scene, it would have been fine. Even take it out. I left your Titanic scene, James Cameron. 01:20:51.460 |
But I wrote you a story that so, you know, just just just he needs ships to sink in water. 01:20:56.940 |
He needs. Well, look, it's a great scene. But like the movie was just like the Roman 01:21:01.980 |
never great CGI, you know, let down by the writing. Maybe. Yeah. Yeah. No, but like the 01:21:06.460 |
CGI like it was it's a beautiful world. And that's why, like, I care so much. Right. Like 01:21:10.740 |
you don't hear me ranting about Pirates of the Caribbean 2 being a terrible story because 01:21:13.980 |
come on, what do you expect, man? Like Johnny Depp's like, wow, I had a movie that made 01:21:17.940 |
me rich. I love this. But this goes back to like the midpoint. You know, I think you wrote 01:21:23.140 |
like feels like Chachapiti wrote the movie. And that's my worry a little bit. It's like 01:21:27.820 |
kind of converging towards that. Oh, I look Malik wrote the movie. Sorry, I didn't want 01:21:34.380 |
to interrupt. I closed. I closed a pull request two days ago. I was like, was this written 01:21:38.780 |
by Chachapiti? And I just closed it. Like, you know what? I honestly feel bad if you 01:21:42.620 |
were a human who wrote this. Like you're incapable of being more perplexed. But if you have a 01:21:48.980 |
classifier running in my head that asks, you know, is this a I or is this a human like, 01:21:54.740 |
you know, the only way to deal with all this, like, like, like, it's like the worst possible. 01:22:00.460 |
Like, you know, people are like, like, like, how are you mad about like these chatbots? 01:22:05.020 |
You're not mad about like Tesla? Well, because if I don't want to buy a Tesla, I want to 01:22:09.300 |
buy a Tesla and it won't really impact my life negatively. But if I don't want to use 01:22:12.580 |
a chatbot, it's still going to impact my life negatively. All the amount of like personalized 01:22:16.540 |
spam that now makes me spend more cycles on my classifier to tell if it's spam or not, 01:22:21.540 |
because you can now use AIs and generate this. Like, no, I mean, we have to move to a model 01:22:26.260 |
where everything's just a dollar, right? Like you want to send me an email, it's a dollar. 01:22:28.940 |
Like you guys wouldn't care. None of my friends would care. No one would care except the spammers. 01:22:32.660 |
Right? Like we just got to move to those sort of models. 01:22:36.620 |
Awesome. One last message you want everyone to remember. 01:22:41.380 |
Look, go, go try TinyGrad. I hope that we're a serious competitor to what's out there. 01:22:51.900 |
And then I want to, you know, I want to take it all the way. We'll start with just building 01:22:55.140 |
something for GPUs and then we'll start building chips and we'll start building fabs and we'll 01:22:59.340 |
start building silicon mines and we'll have the first self-reproducing robot using. Yeah, 01:23:04.460 |
all right, George. Thank you so much for coming on. 01:23:07.500 |
You're a big inspiration. Thank you. Thanks. All right. How was that? We, uh, not, not 01:23:15.780 |
quite like Friedman, but we hope to do something different.