back to indexThe State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
Chapters
0:0 Introductions
4:31 Importance of infrastructure and hardware for AI progress
10:53 GPU-rich vs GPU-poor companies and competing in AI
14:22 Optimizing hardware and software for AI workloads
17:0 Metrics for model training vs inference optimization
21:38 Networking challenges for distributed AI training
23:19 Google’s partnership with Broadcom for TPU networking
28:4 What GPU-poor companies/researchers should focus on
34:47 Innovation in AI beyond just model scale
38:3 AI hardware startups and challenges they face
46:15 Manufacturing constraints for cutting edge hardware
50:36 Apple and AI
54:18 AI safety considerations with scaling AI capabilities
57:37 Complexity of rebuilding semiconductor supply chain
60:8 Recommended readings to understand this space
64:31 Dylan’s process for writing viral blog posts
67:27 Dylan’s “magic genie” question
00:00:06.500 |
Hey everyone, welcome to the Latent Space Podcast. 00:00:09.240 |
This is Alessio, partner and CTO of Residence at Decibel Partners. 00:00:12.620 |
I'm joined by my co-host Swiggs, founder of Small AI. 00:00:15.540 |
>> And today we have Dylan Patel in the new pod, in the new studios, welcome. 00:00:27.060 |
And I was like, it's going to be hard to schedule this guy. 00:00:32.140 |
you just DM me on the day of and you'll go like, let's set something up. 00:00:35.060 |
>> Yeah, yeah, well, the folks at Tao gave me this hat and then they mentioned you. 00:00:39.260 |
And I was like, yeah, we talked about something. 00:00:40.940 |
And then you mentioned from Taiwan, you didn't see this, 00:00:47.580 |
>> From Taiwan that I brought back, so hopefully you'll enjoy that. 00:00:52.460 |
So you're the author of the extremely popular semi-analysis blog. 00:00:55.860 |
We have both had a little bit of credentials or 00:01:03.460 |
George Hodge came on our pod and talked about the mixture of experts thing. 00:01:08.460 |
>> Let's just be clear, I talked about mixture of experts in January. 00:01:10.820 |
It's just people didn't really notice it, I guess, I don't know. 00:01:14.060 |
>> You went into a lot more detail, and I'd love to dig into some of that. 00:01:17.180 |
But anyway, so welcome, and congrats on all your success so far. 00:01:21.740 |
It's really interesting, I've been doing semi-conductor industry since 2017. 00:01:28.980 |
And 2021, got bored, and in November, I started writing a blog. 00:01:33.860 |
And then 2022, I was good, and I started hiring folks from my firm. 00:01:37.620 |
And then all of a sudden, 2023 happens, and it's the perfect intersection. 00:01:41.020 |
Cuz I used to do data science, but not AI, not really. 00:01:49.700 |
But also, I've been involved in the semi-conductor industry for a long, 00:01:52.540 |
long time, posting about it online since I was 12, right? 00:02:03.740 |
All of a sudden, it wasn't this boring thing. 00:02:05.700 |
And then also the shortage in 2021 also mattered. 00:02:08.420 |
But all of a sudden, this all kind of came to fruition. 00:02:11.700 |
So it's cool to have the blog sort of blow up in that way. 00:02:19.740 |
And for a long time, it was just a mobile cycle. 00:02:22.540 |
And then a little bit of PCs, but not that much. 00:02:25.420 |
And then maybe some cloud stuff, like public cloud, semi-conductor stuff. 00:02:31.780 |
But it really wasn't anything until this wave. 00:02:34.660 |
And I was actually listening to you on one of the previous podcasts that you've done. 00:02:38.300 |
And it was surprising that high-performance computing also kind of didn't really take off. 00:02:43.460 |
AI is just the first form of high-performance computing that worked. 00:02:46.980 |
One of the theses I've had for a long time that I think people haven't really caught on, 00:02:51.780 |
but it's really, really coming to fruition now, 00:02:53.780 |
is that the largest tech companies in the world, their software is important, 00:02:58.380 |
but actually having and operating a very efficient infrastructure is incredibly important. 00:03:03.500 |
And so people talk about, hey, Amazon is great, AWS is great, 00:03:07.740 |
because yes, it is easy to use, and they've built all these things. 00:03:09.900 |
But behind the scenes, and no one really talks about it that much, 00:03:12.540 |
but it's like, behind the scenes, they've done a lot on the infrastructure that is super custom 00:03:16.980 |
that Microsoft, Azure, and Google Cloud just don't even match in terms of efficiency. 00:03:22.260 |
If you think about the cost to rent out SSD space, 00:03:25.980 |
so the cost to offer a database service on top of that, obviously, 00:03:28.860 |
the cost to rent out a certain level of CPU performance, 00:03:33.540 |
And likewise, Google spent all this time doing that in AI, 00:03:36.780 |
with their TPUs and infrastructure there and optical switches and all this sort of stuff. 00:03:41.020 |
And so in the past, it wasn't immediately obvious. 00:03:44.500 |
But I think with AI, especially how scaling laws are going, 00:03:47.980 |
it's incredibly important for, you know, infrastructure is so much more important. 00:03:53.780 |
And then when you just think about software cost, right, the cost structure of it, 00:03:57.420 |
there is always a bigger component of R&D and SaaS businesses all over SF, right? 00:04:04.140 |
All these SaaS businesses did crazy good, because they just start as they grow. 00:04:08.420 |
And then all of a sudden, they're so freaking profitable for each incremental new customer. 00:04:13.420 |
And AI software looks like it's going to be very different, in my opinion, right? 00:04:16.900 |
Like the R&D cost is much lower in terms of people. 00:04:20.700 |
But the cost of goods sold in terms of actually operating the service, 00:04:26.660 |
And so in that same sense, infrastructure matters a ton for that. 00:04:30.980 |
I think you wrote once that training costs effectively don't matter. 00:04:33.980 |
Yeah, in my opinion, I think that's a little bit spicy. 00:04:36.540 |
But yeah, it's like training costs are irrelevant, right? 00:04:42.220 |
That's, like, I know it sounds like a lot of money. 00:04:44.580 |
500 million all in, 500 million all in, is that a reasonable estimate? 00:04:47.700 |
Yeah, I think for the supercomputer, it's slightly more. 00:04:52.060 |
But yeah, I think the 500 million is a fair enough number. 00:04:54.460 |
I mean, if you think about just the pre-training, right, three months, 00:04:56.900 |
20,000 A100s at, you know, a dollar an hour is like, that is way less than 500 million, right? 00:05:02.300 |
But of course, there's data and all this sort of stuff. 00:05:04.420 |
Yeah, so people that are watching this on YouTube, 00:05:06.620 |
they can see a GPU-poor and a GPU-rich hat on the table, 00:05:16.900 |
One, did you know that this thing was going to blow up so much? 00:05:22.100 |
He said, "Incredible Google got the semi-analysis guy to publish 00:05:28.460 |
And yeah, tell people, who are the GPU-poors, who are the GPU-rich? 00:05:32.420 |
Like, what's this framework that they should think about? 00:05:35.420 |
You know, some of this work we've been doing for a while is just on infrastructure. 00:05:41.100 |
I think it's like a sort of competitive advantage of our firm, right? 00:05:46.300 |
we go from software all the way through to like low-level manufacturing. 00:05:49.300 |
It's like, who, you know, oh, Google's actually ramping up TPU production massively, right? 00:05:55.660 |
And like, I think people in AI would be like, well, duh. 00:05:58.220 |
But like, okay, like, who has the capability of figuring out the number? 00:06:01.300 |
Well, one, you can just get Google to tell you, but they won't tell you, right? 00:06:06.260 |
And most people that work at Google DeepMind don't even know that number, right? 00:06:10.420 |
Two, you go through the supply chain and see what they've placed in orders, right? 00:06:13.940 |
But then, you know, three is sort of like, well, who's actually winning from this, right? 00:06:17.260 |
Like, hey, oh, Celestica's building these boxes. 00:06:20.740 |
Okay, you know, this company's involved in testing for them. 00:06:24.620 |
Oh, this company's providing design IP to them. 00:06:26.980 |
Like, that's like, you know, very valuable in a monetary sense. 00:06:31.060 |
But, you know, you have to understand the whole technology stack. 00:06:32.740 |
But on the flip side, right, is like, well, why is Google building all these? 00:06:41.140 |
And the state of the world is like, especially in SF, right? 00:06:44.060 |
Like, I'm sure you folks have been to parties. 00:06:46.020 |
People just brag about how many GPUs they have. 00:06:49.540 |
Like, it's happened to me multiple times where somebody's just like, 00:06:52.620 |
I'm just witnessing a conversation where somebody from Meta is bragging 00:06:55.740 |
about how many GPUs they have versus someone from another firm. 00:06:58.900 |
And then it's like, or like a startup person's like, dude, can you believe 00:07:02.340 |
we just acquired, we have 512 H100s coming online in August. 00:07:07.660 |
Like, but then you're like, you know, going through the supply chain. 00:07:10.540 |
It's like, dude, you realize there's 400,000 to 500,000 manufactured last 00:07:18.540 |
quarter and like 530,000 this quarter being sold, right, of H100s. 00:07:24.820 |
You know, so it's sort of like, that's a lot of GPUs. 00:07:26.820 |
But then like, oh, how does that compare to Google? 00:07:29.660 |
And like, there's one way to look at the world, which is just like, hey, 00:07:35.460 |
But given any data set, a larger model will just do better. 00:07:40.020 |
I think it's going to be more expensive, but it's going to do better. 00:07:42.900 |
There's the view of like, OK, there's all these GPUs going to production. 00:07:46.220 |
Nvidia is going to sell well over 3 million total GPUs next year. 00:07:51.900 |
You know, over a million H100s this year alone, right? 00:07:54.500 |
You know, there's a lot of GPU capacity coming online. 00:08:03.220 |
I think it's very important to just think about what are people working on, right? 00:08:06.900 |
Some, you know, what actually are you building that's going to advance? 00:08:14.980 |
And so like, a lot of people were doing things that I thought 00:08:19.220 |
In a world where in less than a year, there's 00:08:21.540 |
going to be more than 4 million high-end GPUs out there, 00:08:26.220 |
we can talk about the concentration of those GPUs. 00:08:28.460 |
But if you're doing really valuable work as a good person, 00:08:33.540 |
should you be focused on like, well, I don't have access 00:08:41.300 |
Should I focus on being able to fine-tune a model on that, right? 00:08:46.060 |
Or like, should I be focused on batch 1 inference on a cloud GPU? 00:08:51.220 |
Why would you do batch size 1 inference on an H100? 00:08:59.340 |
And at the same time, there's a lot of like, you know, 00:09:03.620 |
And so like, you know, kind of you can peer the world into like, 00:09:06.260 |
hey, like, I mean, obviously most people don't have resources, right? 00:09:09.220 |
And I love the open source and I want the open source to win. 00:09:11.420 |
And I hate the people who want to like, you know, just like, 00:09:14.740 |
no, we're X lab and we think this is the only way you should do it. 00:09:20.460 |
And if people don't do it this way, they should be regulated against it 00:09:22.860 |
and all this kind of stuff. I hate that attitude. 00:09:26.540 |
Like companies like Mistral and like what Meta are doing, you know, 00:09:29.340 |
Mosaic and, you know, all these folks together, blah, blah, blah, right? 00:09:32.420 |
Like, all these people doing, you know, huge stuff with open source, 00:09:36.420 |
But it's like, there's certain things that are, you know, 00:09:38.380 |
like hyper focusing on leaderboards and hugging face, right? 00:09:40.940 |
Like, that's just like, no, like truthful QA is a garbage benchmark. 00:09:44.300 |
Like, some of the models that are very high on there, 00:09:48.620 |
if you use it for five seconds, you're like, this is garbage, right? 00:09:51.340 |
And it's just like, you're gaming a benchmark. 00:09:52.980 |
So it's like, there was things I wanted to say. 00:09:55.700 |
Also, you know, we're in a world where compute matters a lot. 00:09:58.620 |
Google is going to have more compute than any other company in the world, period. 00:10:04.700 |
And so like, it's just like framing it into that like, you know, 00:10:08.100 |
mindset of like, hey, like, what are the counterproductive things? 00:10:12.420 |
Or what have people told me that are involved in this should they focus on? 00:10:15.420 |
Right? You know, and, and what is the world where like, you know, hey, we're doing the pace of acceleration, 00:10:23.180 |
you know, from 2020 to 2022 is less than 2022 to 2024. 00:10:27.820 |
Right? Like we are growing, you know, GPT-2 to 4, 2 to 4 is like 2020 to 2022, right? 00:10:34.340 |
Is less than I think, from GPT-4 in 2022, which is when it was trained, right? 00:10:38.820 |
To what OpenAI and Google and Anthropic could do in 2025, right? 00:10:44.460 |
Like, I think the pace of acceleration is increasing. 00:10:48.500 |
And it's just good to like, think about, you know, that sort of stuff. 00:10:51.340 |
I don't know, I don't know where I'm rambling with this, but. 00:10:54.740 |
And the chart that Sam mentioned is about, yeah, Google TPU B5 is completely overtaking OpenAI by like, orders of magnitude. 00:11:07.620 |
We had Chris Landner on the show, which I know, you know, he used to work on TensorFlow and Google. 00:11:13.220 |
And he did mention that the goal of Google is like make TPUs go fast with TensorFlow. 00:11:18.500 |
But then he also had a post about PyTorch kind of stealing the thunder, so to speak. 00:11:25.820 |
If like, now that a lot of the compute will be TPU based and Google wants to offer some of that to the public too? 00:11:33.180 |
I mean, Google internally, and I think, you know, is obviously on JAX and XLA and all that kind of stuff, right? 00:11:37.980 |
But externally, like they've done a really good job. 00:11:41.180 |
Like, I mean, I wouldn't, you know, I wouldn't say like TPUs through PyTorch XLA is amazing, but it's not bad, right? 00:11:48.860 |
Like some of the numbers they've shown, some of the, you know, code they've shown for TPU B5e, 00:11:52.900 |
which is not the TPU B5 that I was referring to, which is in the sort of the post, the GPU poor posts I was referring to. 00:11:59.100 |
But TPU B5e is like the new one, but it's mostly an inference chip. 00:12:04.420 |
It's a little bit, it's about half the size of a TPU B5. 00:12:07.860 |
That chip, you know, you can get very good performance on like of Llama 70B inference, right? 00:12:17.540 |
Now, of course, you're going to get better if you go JAX XLA, 00:12:19.700 |
but I think Google is doing a really good job after the restructuring of focusing on external customers too, right? 00:12:26.300 |
Like, hey, like TPU B5e, we'll probably won't focus too much on TPU B5 for everyone externally, 00:12:30.900 |
but B5e, well, we're also building a million of those, right? 00:12:34.180 |
Hey, a lot of companies are using them, right, or will be using them because it's going to be an incredibly cheap form of compute. 00:12:39.660 |
The world of like, you know, frameworks and all that, right? 00:12:42.340 |
Like that's obviously something a researcher should talk about, not myself. 00:12:45.420 |
But, you know, the stats are clear that PyTorch is way, way dominating everything. 00:12:51.220 |
But JAX is like doing well, like there's external users of JAX. 00:12:54.220 |
But in the end, like there's the front end, right? 00:12:55.860 |
The front end is like what, you know, we're referring back to or maybe it's something we do later and you guys are going to edit after. 00:13:01.020 |
The layers of abstraction, right, like, you know, the forever shouldn't be that the person doing, you know, like PyTorch level code, right, that high should also be writing custom CUDA kernels, right? 00:13:12.780 |
There should be, you know, different layers of abstraction where people hyper optimize and make it much easier for everyone to innovate on separate stacks, right? 00:13:19.460 |
And then every once in a while, someone comes through and pierces through the layers of abstraction and innovates across multiple or a group of people. 00:13:26.900 |
But I think, you know, that's probably, you know, frameworks are important, but, you know, compilers are important, right? 00:13:32.820 |
Chris Latners, what he's doing is really cool. 00:13:35.220 |
I don't know how it'll work, but it's super cool. 00:13:37.540 |
And it certainly works on CPUs, we'll see about accelerators. 00:13:41.220 |
Likewise, there's OpenAI's Triton, like what they're trying to do there. 00:13:43.980 |
And like, you know, everyone's really coalescing around Triton, you know, people, you know, third party hardware vendors, there's Palace, right? 00:13:52.660 |
So I don't know if you've heard about that, but I don't want to mischaracterize it. 00:13:56.060 |
But you can write in Palace, and it'll go through, you can at the lower level code, and it'll work to TPUs and GPUs, kind of like Triton, but it's like, there's a back end for Triton. 00:14:07.660 |
But I think there's a lot of innovation happening on make things go faster, right? 00:14:12.260 |
Because every single person working in ML, you know, it would be a travesty if they had to write like custom CUDA kernels always, right? 00:14:23.340 |
- Yeah, thanks. By the way, I like to quantify things when you say make things over. 00:14:28.540 |
Like, is there a target range of like MFU that you typically talk about? 00:14:33.500 |
- Yeah, there's there's sort of two metrics that I like to think about a lot, right? 00:14:37.260 |
So in training, everyone just talks about MFU, right? 00:14:39.420 |
But then on inference, right, which I think is, you know, one LLM inference will be bigger than training or multimodal, whatever, bubble inference will be bigger than training, you know, probably next year, in fact, at least in terms of GPUs deployed. 00:14:52.300 |
The other thing is like, you know, what's the bottleneck when you're running these models? 00:14:55.420 |
So like, the simple, stupid way to look at it is training is, you know, there's six flops, floating point operations you have to do for every byte you read in, right? 00:15:07.820 |
If it's FP16, it's two bytes, whatever, right, on training. 00:15:10.940 |
But on inference side, the ratio is completely different. 00:15:14.500 |
There's two flops per parameter that you read in and parameters, maybe one byte, right, because it's FP8 or intake, right, eight bits byte. 00:15:21.820 |
But then when you look at the GPUs, right, the GPUs are very, very different ratio. 00:15:25.900 |
The H100 has 3.35 terabytes a second of memory bandwidth, and it has a thousand teraflops of FP16, BFLIP16, right? 00:15:33.580 |
So that ratio is like, sorry, I'm going to butcher the math here and people are going to think I'm dumb, but 256 to one, right, call it 256 to one if you're doing FP16. 00:15:41.900 |
Same applies to FP8, right, because, anyways, per parameter read to number of floating point operations, right? 00:15:49.500 |
If you quantize further, but you also get double the performance on that lower quantization. 00:15:53.220 |
That does not fit the hardware at all, right? 00:15:55.260 |
So if you're just doing LLM inference at batch one, then you're always going to be underutilizing the flops. 00:16:00.780 |
And the way hardware is developing, that ratio is actually only going to get worse, right? 00:16:04.980 |
H200 will come out soon enough, which will help the ratio a little bit, you know, improve memory bandwidth more than improves flops, just like the A180 gig did versus the A140 gig. 00:16:15.660 |
But then when the B100 comes out, the flops are going to increase more than memory bandwidth. 00:16:19.300 |
And when future generations come out, and the same with AMD's side, right, MI300 versus 400, as you move on generations, just due to fundamental, like, semiconductor scaling, DRAM memory is not scaling as fast as logic has been. 00:16:32.660 |
And so you're going to continue to, and you can do a lot of interesting things on the architecture. 00:16:37.740 |
So you're going to have this problem get worse and worse and worse, right? 00:16:40.540 |
And so on training, it's very, you know, who cares, right? 00:16:42.900 |
Because my flops are still my bottleneck most of the time. 00:16:45.300 |
I mean, memory bandwidth is obviously a bottleneck, but like, well, you know, batch sizes are freaking crazy, right? 00:16:50.420 |
Like people train like 2 million batch size is trivial, right? 00:16:56.540 |
And like, you talk to someone at one of the Frontier Labs, and they're like, right, just 2 million, right? 00:17:02.180 |
2 million token batch size, right? That's crazy, or sequence, sorry. 00:17:05.020 |
But when you go to inference side, it's like, well, it's impossible to do, one, to do 2 million batch size. 00:17:10.300 |
Also, your latency would be horrendous if you tried to do something that crazy, right? 00:17:13.740 |
So you kind of have this like differing problem where on training, everyone just kept talking MFU, Model Flop Utilization, right? 00:17:19.900 |
How many flops? Six times the number of parameters, basically, more or less. 00:17:26.820 |
So if I have 312 teraflops out of my A100, and I was able to achieve 200, that's really good, right? 00:17:32.460 |
You know, some people are achieving higher, right? Some people are achieving lower. 00:17:35.220 |
That's a very important like metric to think about. 00:17:37.340 |
Now you have like people thinking MFU is like a security risk. 00:17:40.540 |
But on inference, MFU is not nearly as important, right? 00:17:44.820 |
You know, batch one is, you know, what memory bandwidth can I achieve, right? 00:17:49.220 |
Because as I increase batch from batch size one to four to eight to even 256, right? 00:17:54.180 |
It's sort of where the crossover happens on an H100, inference wise, right? 00:18:01.060 |
But like, you should have very high memory bandwidth utilization. 00:18:04.420 |
So when people talk about A100s, like 60% MFU is decent, right? 00:18:08.580 |
On H100s, it's more like 40, 45% because the flops increased more than the memory bandwidth. 00:18:13.620 |
But people over time will probably get above 50% on H100, on MFU, on training. 00:18:19.540 |
But on inference, it's not being talked about much, but MBU, model bandwidth utilization is the important factor, right? 00:18:26.700 |
So my 3.35 terabytes a second memory bandwidth on my H100, can I get two? 00:18:32.300 |
Can I get three, right? That's the important thing. 00:18:34.700 |
And right now, if you look at everyone's, you know, inference steps, I dogged on this in a GPU poor thing, right? 00:18:39.820 |
But it's like, Hugging Faces libraries are actually very inefficient, like incredibly inefficient for inference. 00:18:45.500 |
You get like 15% MBU on some configurations, like eight A100s and Llama 70B, you get like 15%, which is just like horrendous. 00:18:55.900 |
Because at the end of the day, your latency is derived from what memory bandwidth you can effectively get, right? 00:19:01.420 |
You know, so if you're doing Llama 70 billion, 70 billion parameters, if you're doing it in tape, 00:19:07.500 |
okay, that's 70 gigabytes a second, gigabytes you need to read for every single inference, every single forward pass, 00:19:13.660 |
plus, you know, the attention, but again, we're simplifying it. 00:19:16.220 |
70 gigabytes you need to read for every forward pass, what is an acceptable latency for a user to have? 00:19:21.020 |
I would argue, you know, 30 milliseconds per token. 00:19:26.860 |
At the very least, you need to achieve human reading level speeds and probably a little bit faster, because we'd like their skim. 00:19:31.900 |
To have a usable model for chatbot style applications, now there's other applications, of course, 00:19:36.540 |
but chatbot style applications, you want it to be human reading speed. 00:19:39.820 |
So 30 tokens per second, 30 tokens per second is 33, or 30 tokens, milliseconds per token is 33 tokens per second, 00:19:47.740 |
times 70 is, let's say 3 times 7 is 21, and then add to 0, so 2,100 gigabytes a second, right? 00:19:55.900 |
To achieve human reading speed on Lama70B, right? 00:19:58.940 |
So, one, you can never achieve Lama70B human reading speed on, even if you had enough memory capacity, 00:20:07.980 |
Even an H100, to achieve human reading speed, right? 00:20:10.940 |
Of course, you couldn't fit it, because it's 80 gigabytes versus 70 billion parameters, 00:20:14.300 |
so you're kind of butting up against the limits already. 00:20:18.700 |
70 billion parameters being 70 gigabytes at int8 or fp8. 00:20:22.300 |
You end up with, one, how do I achieve human reading level speeds, right? 00:20:26.220 |
So, if I go with two H100s, then now I have, you know, call it 6 terabytes a second of memory bandwidth. 00:20:31.020 |
If I achieve just 30 milliseconds per token, then I'm, you know, which is 33 tokens per second, 00:20:35.820 |
which is 2.1 terabytes a second of memory bandwidth, then I'm only at like 30% bandwidth utilization. 00:20:41.500 |
So, I'm not using all my flops on batch one anyways, right? 00:20:45.100 |
Because 70, you know, the flops that you're using there is tremendously low relative to inference, 00:20:49.820 |
and I'm not actually using a ton of the tokens on inference. 00:20:52.940 |
So, with two H100s, I only get 30 milliseconds a token, that's a really bad result. 00:20:57.580 |
You should be striving to get, you know, upwards of 60%, and that's like, 60% is kind of low too, right? 00:21:03.420 |
Like, I've heard people getting 70, 80% model bandwidth utilization. 00:21:06.940 |
You know, obviously, you can increase your batch size from there, 00:21:09.340 |
and your model bandwidth utilization will start to fall as your flops utilization increases, 00:21:13.740 |
but, you know, you have to pick the sweet spot for where you want to hit on the latency curve for your user. 00:21:18.380 |
Obviously, as you increase batch size, you get more throughput per GPU. 00:21:21.900 |
So, that's more cost effective. There's a lot of, like, things to think about there, 00:21:25.260 |
but I think those are sort of the two main things that people want to think about. 00:21:28.460 |
And there's obviously a ton with regards to, like, networking and inner GPU connection, 00:21:33.100 |
because most of the useful models don't run on a single GPU. 00:21:40.620 |
So, the TPUs, the Google TPU is, like, super interesting, because Google has been working with Broadcom, 00:21:46.860 |
who's the number one networking company in the world, right? 00:21:48.700 |
So, Mellanox was nowhere close to number one. 00:21:53.340 |
They had a niche that they were very good at, which was the network card, 00:21:59.180 |
the card that you actually put in the server, but they didn't do much. 00:22:01.820 |
They weren't doing successfully in the switches, right? 00:22:04.220 |
Which is, you know, you connect all the network cards to switches, 00:22:06.780 |
and then the switches to all the, you know, servers. 00:22:13.820 |
And NVIDIA bought them, you know, in '19, I believe, or '18. 00:22:16.300 |
But Broadcom has been number one in networking for a decade plus, right? 00:22:21.340 |
And Google partnered with them on making the TPU. 00:22:24.460 |
And they, you know, TPU, you know, all the way through to all TPU v5, 00:22:27.980 |
which is the one they're in production of now, and 6, and, you know, all of these. 00:22:31.820 |
These are all going to be, you know, co-designed with Broadcom, right? 00:22:35.180 |
So, Google does a lot of the design, especially on the ML hardware side, 00:22:37.820 |
on how you pass stuff around internally on the chip. 00:22:40.860 |
But Broadcom does a lot on the network side, right? 00:22:43.580 |
They specifically, you know, how to get really high connection speed between two chips. 00:22:51.500 |
But this is sort of like Google's, like, less discussed partnership 00:22:58.060 |
And Google's tried to get away from them many times. 00:23:00.860 |
Their latest target to get away from Broadcom is 2027, right? 00:23:03.740 |
But, like, you know, that's four years from now. 00:23:07.100 |
So, they already tried to get away in 2025, and that failed. 00:23:10.300 |
But, yeah, they have this equivalent of very high speed networking. 00:23:13.340 |
It works very differently than the way GPU networking does. 00:23:16.380 |
And that's important for people who code on a lower level. 00:23:18.940 |
I've seen this described as the ultimate limit on how big models are built. 00:23:32.540 |
And I don't know what to do about that because no one else has any solutions. 00:23:37.740 |
So, I think what you're referring to is that, like, 00:23:40.140 |
network speed is increased slower than flops and bandwidth. 00:23:45.660 |
And, yeah, that's a tremendous problem in the industry, right? 00:23:49.020 |
But, like, that's why NVIDIA bought a networking company. 00:23:53.020 |
That's why Broadcom is working on Google's chip right now. 00:23:59.100 |
which they're on the second generation of, working on that. 00:24:01.660 |
And what's the main thing that Meta's doing interesting 00:24:05.420 |
Multiplying tensors is kind of, you know, anyone can... 00:24:08.460 |
There's a lot of people who've made good matrix multiply units, right? 00:24:10.620 |
But it's about, like, getting good utilization out of those 00:24:14.140 |
and interfacing with other chips really efficiently. 00:24:17.420 |
And most of the startups, obviously, have not done that really well. 00:24:19.580 |
Yeah, I mean, I think the startup's point is the most interesting, right? 00:24:28.620 |
And there's a lot of startups out there that are GPU-poor 00:24:37.580 |
Are we just supposed to wait for, like, the big labs 00:24:40.540 |
to do a lot of this work with a lot of the GPUs? 00:24:43.100 |
Like, what's, like, the GPU-poor's beautiful version of the article? 00:24:48.460 |
Like, the whole point was that, like, Google, you know, 00:24:52.300 |
"Oh, yeah, they have more GPUs than anyone else," right? 00:24:54.140 |
But they have a lot less flops than Google, right? 00:24:58.620 |
It's like, okay, you know, it's like a relative totem pole, right? 00:25:01.660 |
Now, of course, Google doesn't use GPUs as much 00:25:07.420 |
So, kind of like, the whole point is that everyone is GPU-poor 00:25:16.940 |
just like data will always be a bottleneck, right? 00:25:21.340 |
And same with, you know, the biggest compute system in the world, 00:25:23.420 |
and you can, but you'll always want a better one. 00:25:29.980 |
And now they're scaling up higher and higher and higher, right? 00:25:32.300 |
You know, there's a lot that the GPU-poor can do, though, right? 00:25:35.020 |
Like, hey, we all have phones, we all have laptops, right? 00:25:39.500 |
There is a world for running GPUs or models on device, right? 00:25:55.980 |
that you can get on a laptop or a phone, right? 00:25:58.460 |
You know, I mentioned the ratio of flops to bandwidth on a GPU 00:26:03.580 |
compared to, like, a MacBook or, like, a phone. 00:26:08.140 |
two terabytes a second memory bandwidth, 2.1, 00:26:12.220 |
Yeah, but my phone has, like, 50 gigabytes a second, right? 00:26:19.260 |
like, a couple hundred gigabytes a second memory bandwidth. 00:26:21.260 |
You can't run Llama 70b just by doing the classical thing. 00:26:24.460 |
So there's, like, there's stuff like speculative decoding, 00:26:26.700 |
and then, you know, Together did something really cool, 00:26:29.020 |
and they put it in open source, of course, Medusa, right? 00:26:33.740 |
They don't work on batch size, you know, high. 00:26:35.420 |
And so there's, like, the world of, like, cloud inference. 00:26:39.020 |
And so in the cloud, it's all about, you know, 00:26:45.740 |
I don't think Google is going to deploy a model 00:26:49.580 |
that I can run on my laptop to help me with code 00:26:53.820 |
They're always going to want to run it on a cloud for control. 00:26:58.300 |
but it's, like, only their Pixel phone, you know, 00:27:01.180 |
There's obviously a lot of reasons to do other things 00:27:08.300 |
to not be at the whims of a trillion-dollar-plus company 00:27:13.260 |
Like, you know, there's a lot of stuff to be done there. 00:27:15.260 |
And I think, like, folks like Repl.it are, like, 00:27:27.580 |
what I just mentioned, right, that developing Medusa, 00:27:30.780 |
right, that didn't take much GPU at all, right? 00:27:36.860 |
they made a big announcement about having 4,000 H100s. 00:27:41.260 |
when we're talking about hundreds of thousands 00:27:42.620 |
of, like, the big labs, like OpenAI and so on and so forth, 00:27:48.300 |
But, you know, still, they were able to develop Medusa 00:27:57.500 |
something like speculative decoding is on-device, right? 00:28:00.220 |
And that's what, like, a lot of people can focus on. 00:28:01.980 |
You know, people can focus on all sorts of things like that. 00:28:08.140 |
I'm pretty tilled to think, like, transformers are it, right? 00:28:13.180 |
can only know something that loves hardware, right? 00:28:19.500 |
people should continue to try and innovate on that, right? 00:28:21.740 |
Like, you know, asynchronous training, right? 00:28:27.980 |
Yeah, yeah, distributed, like, not in one data center. 00:28:47.180 |
But, like, I, like, research that kind of stuff, right? 00:28:49.820 |
the universities will never have much compute. 00:28:52.860 |
But, like, hey, you know, to prepare, to do things, 00:28:56.860 |
Like, they should try to build, you know, super large models. 00:28:59.100 |
Like, you look at what Tsinghua University is doing in China. 00:29:01.260 |
Like, actually, they open-sourced their model to, 00:29:03.100 |
I think, the largest, like, by-parameter count, at least, 00:29:09.100 |
Yeah, that's from Tsinghua University, though, right? 00:29:12.860 |
Yeah, I mean, of course, they didn't train it on much data. 00:29:15.900 |
like, you could do some cool stuff like that. 00:29:17.980 |
I think there's a lot that people can focus on. 00:29:19.900 |
Because, you know, one, scaling out a service to many, many users. 00:29:31.500 |
Like, you know, doing LLMs that, you know, OpenAI will never make, 00:29:36.780 |
you know, sorry for the crassness, a porn Dolly 3, right? 00:29:39.980 |
But open-source is doing crazy stuff with stable diffusion, right? 00:29:44.860 |
Yeah, but it's, like, and there is a legitimate market. 00:29:47.580 |
I think there's a couple companies who make tens of millions of dollars of revenue from, 00:29:51.660 |
from, yeah, from LLMs or diffusion models for porn, right? 00:29:58.380 |
Like, I mean, there's a lot of stuff that people can work on 00:30:02.300 |
or doesn't even have to be a business, but can advance humanity tremendously. 00:30:07.740 |
How do you think about the depreciation of, like, the hardware versus the models? 00:30:14.540 |
Yeah, like, we covered open models for a while. 00:30:16.540 |
If I think about the episodes we had, like, in March, with, like, MPT, 7B. 00:30:23.260 |
It's, like, the depreciation is, like, three months. 00:30:25.500 |
Well, I mean, no one should be talking about Lama 13 billion anymore, right? 00:30:31.980 |
It's, like, you know, if you buy a H100, sure, the next series is going to be better, 00:30:37.820 |
If you're spending a lot of money on, like, training a smaller model, like, 00:30:41.500 |
it might be, like, super obsolete in, like, three months. 00:30:44.060 |
And you got now all this compute coming online. 00:30:47.100 |
I'm just curious if, like, companies should actually spend the time to, like, you know, 00:30:52.700 |
Where, like, the next generation is going to be out of the box so much better. 00:30:55.820 |
Unless you're fine-tuning for on-device use, I think fine-tuning current existing models, 00:31:02.460 |
especially the smaller ones, is a useless waste of time, right? 00:31:05.900 |
Because the cost of inference is actually much cheaper than you think once you achieve good MBU 00:31:11.660 |
and you batch at a decent size, which any successful business in the cloud is going to achieve. 00:31:15.500 |
You know, and then two, fine-tuning, like, people are like, "Oh, you know, 00:31:22.140 |
this seven billion perimeter model, if you fine-tune it on a data set, 00:31:26.620 |
It's like, "Yeah, but why don't you just fine-tune? 00:31:29.260 |
Why don't you fine-tune 3.5 and look at your performance, right?" 00:31:32.540 |
And, like, there's nothing open source that is anywhere close to 3.5 yet, right? 00:31:39.260 |
I think, I think people also don't quite grasp… 00:31:46.220 |
It's less parameters than 3.5, and also, I don't know about the exact token count, 00:31:59.820 |
No, it's bigger than 1.75, but I think it's sparse. 00:32:03.580 |
I think it's, you know, it's MOE, I'm pretty sure. 00:32:08.460 |
Yeah, you can do some, like, gating around the size of it by looking at their inference latency. 00:32:16.220 |
Yeah, you can look at, like, what's the theoretical bandwidth if they're running it on this hardware, 00:32:20.540 |
and, you know, and doing tensor parallel in this way, 00:32:24.700 |
so they have this much memory bandwidth, and maybe they get, maybe they're awesome, 00:32:27.340 |
and they get 90% memory bandwidth utilization. 00:32:29.180 |
I don't know, that's upper bound, and you can see the latency that 3.5 gives you, 00:32:32.220 |
like, especially at, like, off-peak hours, or if you do fine-tuning, 00:32:36.540 |
and if you have a private enclave, like, my Azure will quote you latency. 00:32:39.580 |
So, you can figure out how many parameters per forward path, 00:32:43.020 |
which I think is somewhere in the, like, 50 to 40 billion range, 00:32:48.700 |
but I could be very wrong, that's just, like, my guess, based on that sort of stuff. 00:32:58.220 |
There's no way to figure that out, because it's just so rowdy. 00:33:01.580 |
Yeah, actually, there's someone I've talked to at one of the labs who, like, 00:33:08.460 |
thinks they can figure out how many experts are in a model by querying in a crapload, 00:33:12.620 |
but that's only if you have access to the logits, like, the percentage chance, 00:33:17.180 |
yeah, before you did the softmax, I don't know. 00:33:19.660 |
But yeah, there's, like, a ton of, like, competitive analysis you could try to do. 00:33:22.860 |
But anyways, I think open source will have models of that quality, right? 00:33:26.220 |
I think, like, you know, I mean, I assume Mosaic or, like, Meadow will open source, 00:33:30.860 |
and Mistral will be able to open source models of that quality. 00:33:33.900 |
Now, furthermore, right, like, if you just look at the amount of compute, 00:33:37.900 |
obviously data is very important, and the ability… 00:33:40.140 |
All these tricks and dials that you turn to be able to get good MFU and good MBO, right, 00:33:44.540 |
like, depending on inference or training, is… 00:33:45.980 |
There's a ton of tricks, but at the end of the day, like, 00:33:50.300 |
there's, like, 10 companies that have enough compute in one single data center 00:33:57.100 |
Like, straight up, like, if not today, within the next six months, right? 00:34:04.940 |
I think you need about 7,000 maybe, and with some, like, good, like, 00:34:08.140 |
and with some algorithmic improvements that have happened since GPT-4, 00:34:12.220 |
and some data quality improvements, probably, like, 00:34:15.500 |
you could probably get to even, like, you know, 00:34:17.020 |
less than 7,000 H100s running for three months to beat GPT-4. 00:34:21.420 |
Of course, that's going to take a really awesome team. 00:34:24.300 |
But, you know, there's quite a few companies that are going to have that many, right? 00:34:27.340 |
Open source will match GPT-4, but then it's like, what about GPT-4 Vision, 00:34:31.580 |
or what about, you know, 5 and 6, and, you know, all these kind of stuff, 00:34:35.020 |
and, like, interact tool use and DALI, and, like, that's the other thing is, 00:34:38.300 |
like, there's a lot of stuff on tool use that the open source could also do, 00:34:44.140 |
I think there are some folks that are doing that kind of stuff, 00:34:45.980 |
agents and all that kind of stuff, I don't know. 00:34:52.620 |
One more question on just, like, the sort of Gemini GPU rich essay. 00:34:57.020 |
We've had a very wide reaching conversation already, so it's hard to categorize. 00:35:00.540 |
But I tried to look for the Mina Eats the World document. 00:35:11.900 |
So Noam Shazir is, like, I don't know, I think he's, like-- 00:35:17.660 |
Like, obviously, like, in one year, he published, like-- 00:35:27.180 |
It's, like, all this stuff that we were talking about today was, like, you know. 00:35:30.700 |
And obviously, there's other people that are awesome that were, you know, 00:35:33.180 |
helping and all that sort of stuff, just to be clear. 00:35:35.660 |
There was a couple other papers, but, like, yeah. 00:35:38.380 |
So, like, Mina Eats the World was basically-- 00:35:40.620 |
He wrote an internal document around the time where Google had Mina, right? 00:35:44.540 |
And Mina was one of their LLMs that, like, is a footnote in the history. 00:35:48.860 |
Like, you know, most people will not, like, think about Mina's relevant. 00:35:53.420 |
He wrote it, and he was, like, basically predicting everything that's happening now, 00:35:55.900 |
which is that, like, large language models are going to eat the world, right? 00:35:59.580 |
And he's, like, the total amount of deployed flops within Google data centers 00:36:04.940 |
And, like, back then, a lot of people thought he was, like, silly for that, right? 00:36:11.420 |
But, you know, now, if you look at it, it's, like, oh, wait. 00:36:17.820 |
Okay, we're totally getting dominated by, like, both, you know, 00:36:23.740 |
Like, you know, like, or whatever, 2, 3, 4, plus 1, 2, 3, 4 for Gemini, 00:36:28.460 |
Like, that's, yeah, total flops being dominated by LLMs was completely right. 00:36:32.620 |
So my question was, he had a bunch of predictions in there. 00:36:35.980 |
Do you think there are any, like, underrated predictions that 00:36:41.340 |
I think, like, if, you know, obviously, I read the document, 00:36:46.300 |
They didn't send it to me, so I can't really send it, sorry. 00:36:48.540 |
And also, they were okay with me talking about the document 00:36:52.540 |
and calling Gnome a goat, because they also think Gnome is a goat. 00:36:55.660 |
But I think, like, you know, now, most everybody is, like, 00:37:00.860 |
scaling law pilled, and, like, LLM pilled, and, like, you know, 00:37:15.420 |
I don't remember off the top of my head, but it's like, 00:37:19.580 |
You know, parameters times tokens times six, right? 00:37:21.420 |
It's like-- it's like a tiny, tiny fraction of GPT-2, 00:37:24.140 |
which came out just a few months later, which is like, 00:37:25.980 |
okay, so he wasn't right about everything, but, like, 00:37:33.180 |
way ahead of Google on LLM scaling, even then, right? 00:37:50.540 |
So, the TPU is obviously one, but there's Cerebras, 00:37:53.900 |
there's Graphcore, there's Metax, Lemurian Labs, 00:38:03.900 |
So, if you go back and, like, you know, I mean, 00:38:06.380 |
I mentioned, like, transformers were the architecture 00:38:09.340 |
the number of people who recognized that in 2020 was, 00:38:11.500 |
you know, as you mentioned, probably hundreds, right? 00:38:18.860 |
You know, so it's kind of hard to bet your architecture 00:38:22.700 |
But what's interesting about all the first wave 00:38:25.740 |
AI hardware startups is you kind of have, you know, 00:38:41.980 |
because the models have grew way past that, right? 00:38:44.700 |
I mean, you know, like, I'm talking about, like, GraphCore, 00:38:46.940 |
it's called SRAM, which is the memory on chip, 00:38:53.100 |
versus, you know, DRAM, which is the, you know, 00:39:05.100 |
for image networks and models that are small enough 00:39:15.260 |
So NVIDIA was the only company that bet on the other side 00:39:25.260 |
also the right ratio of memory bandwidth versus capacity, 00:39:32.940 |
and then they had a lot more memory off chip, 00:39:35.100 |
but that memory off chip was a much lower bandwidth. 00:39:37.580 |
Same applies to Samanova, same applies to Cerebris, 00:39:42.780 |
but they thought, hey, I'm going to make a chip 00:39:54.300 |
models are way bigger than 40 gigabytes, right? 00:39:56.140 |
Everyone bet on sort of the left side of this curve, right? 00:39:58.940 |
The interesting thing is that there's new age startups 00:40:10.300 |
I don't know, you know, it's hard to say with a startup, 00:40:15.980 |
but those folks like, you know, Jay Duane and Lumeria, 00:40:26.700 |
And if transformers continue to reign supreme, right? 00:40:34.060 |
are doing on hardware are going to need to be, 00:40:37.980 |
Or you have to predict what the model architecture 00:40:54.300 |
He started Mosaic ML and sold it to Databricks 00:40:59.180 |
But, you know, Intel bought that company from him 00:41:01.980 |
and then shut it down and bought this other AI company. 00:41:04.220 |
And now that company is kind of, you know, got new chips. 00:41:07.820 |
They're going to release a better chip than the H100 00:41:12.620 |
AMD, they have a GPU, MI300, that will be better 00:41:18.860 |
Now that says nothing about how hard it is to program it, 00:41:20.940 |
but at least hardware wise on paper, it's better. 00:41:23.420 |
Why? Because it's, you know, a year and a half later, right? 00:41:25.820 |
Than in the H100 or a year later than the H100, of course. 00:41:30.460 |
But, you know, they're at least making similar bets 00:41:33.500 |
on memory bandwidth versus flops versus capacity, 00:41:38.620 |
The questions are like, what is the correct bet 00:41:54.620 |
you know, soon three nanometer or whatever, right? 00:41:57.980 |
and you need the whole supply chain to go through that. 00:42:00.300 |
We've written a lot about it, but, you know, to simplify it, 00:42:08.300 |
So it's like the total capacity for everyone else is much lower 00:42:18.380 |
and it's like Meta's in-house chip and also AMD. 00:42:27.660 |
Or if they are, even though they're, you know, 00:42:29.340 |
I mentioned Intel and AMD's chips are better. 00:42:33.020 |
That's only because they're throwing more money 00:42:41.020 |
AMD and Intel and others will charge more reasonable margins. 00:42:45.660 |
And so they're able to give you more HBM and et cetera 00:42:49.500 |
And so that ends up letting them beat NVIDIA, if you will. 00:42:52.860 |
But their manufacturing costs are twice that in some cases, right? 00:43:04.620 |
So it's like, you know, it's tough for anyone 00:43:12.220 |
Like in my opinion, like you should either just like be like, 00:43:25.020 |
sort of, they're much more aggressive on external selling. 00:43:27.660 |
And so you companies like, even companies like Apple 00:43:29.420 |
using TPUs for training LLMs as well as GPUs. 00:43:39.500 |
and leverage all this amazing open source code 00:43:53.340 |
and damn sure you also have a compelling business case 00:43:57.660 |
is giving you such a good deal that it's worth it. 00:43:59.660 |
And also, by the way, NVIDIA is releasing a new chip, 00:44:01.420 |
you know, they're going to announce it in March 00:44:05.580 |
and ship it, you know, Q2, Q3 next year anyways, right? 00:44:09.580 |
And that chip will probably be three or four times as good, 00:44:11.900 |
right, and maybe it'll cost twice as much or 50% more. 00:44:27.580 |
And then NVIDIA is moving to a yearly release cycle. 00:44:33.660 |
Are, you know, investing all this in other hardware, 00:44:41.180 |
Who cares if I spend $500 million a year on AMD chips, right? 00:44:49.100 |
puts the fear of God within Jensen Huang, right? 00:44:51.180 |
Like, you know, then it is what it is, right? 00:44:57.340 |
of course, they hope is that their chips succeed 00:44:59.500 |
or that they can actually have an alternative 00:45:02.380 |
But to throw a couple hundred million dollars 00:45:10.300 |
I think it'll be more than a couple hundred million dollars, 00:45:12.620 |
But yeah, I think alternative hardware is like, 00:45:15.020 |
it really does hit like sort of a peak hype cycle, 00:45:21.420 |
because all NVIDIA has is H100 and then H200, 00:45:28.540 |
But that doesn't beat what, you know, AMD are doing. 00:45:34.700 |
But then very quickly after, NVIDIA will crush them. 00:45:37.500 |
And then those other companies are gonna take two years 00:45:45.660 |
hey, that bet I talked about earlier is like, 00:45:49.900 |
Just memory bandwidth flops and memory capacity. 00:45:54.140 |
There's a hundred different bets that you have to make 00:46:01.740 |
And that takes understanding models really, really well. 00:46:09.100 |
so many different aspects, whether it's power delivery 00:46:15.900 |
And it's like, how many companies can do everything here, 00:46:18.060 |
It's like, I'd argue Google probably understands 00:46:22.700 |
I'm an NVIDIA understands hardware better than Google. 00:46:28.300 |
but like, does Amazon understand models better than NVIDIA? 00:46:31.740 |
And does Amazon understand hardware better than NVIDIA? 00:46:39.340 |
- Yeah, I'm also of the opinion that the labs are, 00:46:50.460 |
they're not gonna buddy up as close as people think, right? 00:46:56.940 |
that the OpenAI Microsoft probably falls apart too. 00:46:59.980 |
- I mean, they'll still continue to use GPUs and stuff there. 00:47:02.860 |
But like, I think that the level of closeness you see today 00:47:20.780 |
But like the level of value that they deliver to the world, 00:47:25.980 |
they truly believe it'll be tens of trillions, 00:47:27.900 |
if not hundreds of trillions of dollars, right? 00:47:33.180 |
like, you know, this is the same like playing field 00:47:51.900 |
Like these lab partnerships are gonna be nice, 00:47:54.220 |
but they're probably incentivized to, you know, 00:48:02.380 |
And NVIDIA's like, "No, it doesn't work like that. 00:48:04.300 |
And they're like, "Oh, so this is the best compromise, right?" 00:48:09.020 |
not to do that with NVIDIA, but also with AMD. 00:48:14.300 |
but it's like, how much time do I actually have, right? 00:48:27.420 |
"Hey, can I get like asynchronous training to work?" 00:48:29.580 |
Or like, you know, figure out this next multimodal thing, 00:48:36.380 |
"and work on designing the next supercomputer," right? 00:48:45.260 |
you know, even OpenAI helping Microsoft enough 00:48:52.620 |
Like Microsoft's gonna announce their chip soon. 00:49:03.020 |
just because they don't have to pay the NVIDIA tax. 00:49:09.100 |
"Oh, hey, that only works on a certain size of models 00:49:12.060 |
"then it's actually, you know, again, better for NVIDIA." 00:49:14.300 |
So it's like, it's really tough for OpenAI to be like, 00:49:20.700 |
"I don't know, what's their number of people they have now? 00:49:27.580 |
And, you know, it's like, it's just like a big headache to, 00:49:38.300 |
especially if you have, if you need that scale, right? 00:49:40.620 |
And that scale that the, at least the labs, right, 00:49:55.340 |
Like the numbers that we're going to get to are like, 00:50:00.860 |
"Yeah, we're going to build a $100 billion supercomputer 00:50:12.860 |
I'm sure the market would give it to him, right? 00:50:15.260 |
Like, and then they build that supercomputer, right? 00:50:25.900 |
and Taiwan companies are famously very chatty 00:50:28.700 |
Should we take Apple seriously at all in this game 00:50:32.460 |
or they're just in a different world altogether? 00:50:35.020 |
- I don't know, I think, just from my view of Apple, 00:50:47.100 |
new Apple Watch every couple of years, right? 00:50:51.500 |
but like, I don't think Apple will ever release a model 00:50:56.300 |
that you can get to say, you know, really bad things, right? 00:51:00.140 |
Or, you know, racist things or whatever, right? 00:51:06.620 |
I'm sure OpenAI releasing, you know, 3.5 and 4 00:51:20.220 |
Like, or like say these, like, hateful things, 00:51:26.620 |
I've seen all these three of these things, right? 00:51:35.100 |
She needs to know how to make anthrax to live, right? 00:51:43.820 |
into OpenAI's like platform and it gets them. 00:51:45.980 |
It's like being public and open is like, you know, 00:51:51.660 |
open to use is accelerating their like ability 00:52:03.740 |
like the fruit company ships perfect products 00:52:09.660 |
- They'll kill the car before you even see it. 00:52:11.340 |
- Right, and that's why everyone loves iPhones, right? 00:52:13.420 |
Like I have a Samsung, I can tell you how many, 00:52:17.740 |
Maybe I'll buy a Pixel this year, the new one looks nice. 00:52:22.460 |
Like how many times like I just have to like restart my phone. 00:52:25.020 |
Like, I mean, it's not like often, but it's like, hey, 00:52:27.660 |
if like once a week I need to like, you know, 00:52:30.060 |
an app just crashes, it's like, oh, what the heck, right? 00:52:33.500 |
It's like Bing was only ever like a few percent 00:52:36.620 |
behind Google, truly, for the last decade, a few percent. 00:52:40.380 |
But that few percent is enough to make people be like, 00:52:43.980 |
So I think that sort of like applies to Apple's like, 00:52:47.580 |
are you willing to deploy a model of 3.5 capabilities 00:52:50.380 |
that can say really, and do really bad things potentially? 00:52:55.500 |
And the possibility of it doing worse things is even higher. 00:52:59.900 |
Like you can't get on that like iteration cycle, right? 00:53:03.340 |
To build four, you need to be able to build a 3.5. 00:53:05.420 |
Build 3.5, you need to be able to build three, right? 00:53:13.180 |
and like all these folks are doing exactly that, right? 00:53:15.340 |
Building a bigger and better model every, you know, 00:53:18.860 |
And I don't know how Apple gets on that train. 00:53:20.700 |
But you know, at the same time, there's no company 00:53:23.980 |
that has more powerful distribution, maybe, right? 00:53:28.220 |
You could argue that, but like, so obviously, 00:53:42.860 |
I think a lot of people still use Siri, right? 00:53:49.500 |
- Tim Cook is not in the AI safety discussions. 00:54:02.540 |
because, you know, Entropic came out of open AI. 00:54:13.020 |
because you have more pull-ups and more compute. 00:54:14.860 |
Yeah, what's your thought on like this whole space? 00:54:18.060 |
- So obviously I think safety is probably important, 00:54:22.700 |
Like, I mean, I've read sci-fi normally, right? 00:54:31.740 |
if you just look at the demographics across the world, 00:54:33.580 |
there are like, there's like 30 to 50 million more men 00:54:42.540 |
obsessively obviously on population, that level dynamics, 00:54:44.940 |
you know, their LGBTQ, all that stuff happens. 00:54:48.380 |
like there are 30 to 50 million more men across the world. 00:54:52.300 |
Why can't an LLM like radicalize them, right? 00:54:55.980 |
And then all of a sudden like inciting, like, you know, 00:54:59.340 |
There's like all sorts of stuff like that can happen. 00:55:06.620 |
was a good thing and it ends up wiping out humanity. 00:55:11.580 |
I think security through obscurity doesn't work, right? 00:55:18.860 |
Like, you know, they're very open internally, 00:55:41.980 |
maybe try and align things, you know, better. 00:55:44.540 |
- Maybe the semi-analysis analyst point of view is, 00:55:47.820 |
is it feasible to build this capacity up in the U.S.? 00:55:59.180 |
the Chinese semiconductor supply chain, they won't. 00:56:19.260 |
There is no chip that is less than seven nanometer 00:56:28.700 |
likewise, everything two nanometer and beyond 00:56:34.060 |
less than a billion dollars in revenue, right? 00:56:35.420 |
So it's like, you think it's so inconsequential. 00:56:37.260 |
There's like three or four Japanese chemical companies. 00:56:40.860 |
It's like the supply chain is so fragmented, right? 00:56:43.740 |
Like people only ever talk about where the fabs, 00:56:46.700 |
But it's like, I mean, TSMC in Arizona, right? 00:56:51.420 |
It's quite a bit smaller than the fabs in Taiwan. 00:56:57.820 |
And also they have to get what's called a mask from Taiwan 00:57:02.540 |
And by the way, there's these Japanese companies 00:57:04.380 |
that make these chemicals that need to ship to, 00:57:09.020 |
and hey, it needs this tool from Austria, no matter what. 00:57:12.540 |
like the entire supply chain is just way too fragmented. 00:57:15.180 |
You can't like re-engineer and rebuild it on a snap, right? 00:57:19.100 |
It's just like, it's just complex to do that. 00:57:23.180 |
than any other thing that humans do, without a doubt. 00:57:25.740 |
There's more people working in that supply chain 00:57:34.780 |
the most complex supply chain that humanity has. 00:57:48.700 |
Texas Instruments communicated to Morris Chang 00:57:58.220 |
also I think the world would probably be further behind 00:58:04.620 |
Like technology proliferation is how you accelerate, 00:58:13.340 |
oh, well, hey, it's not just a bunch of people 00:58:15.180 |
in Oregon at Intel that are leading everything, right? 00:58:22.220 |
plus all these tool companies across the country 00:58:47.180 |
So the first one is, what are like foundational readings 00:58:53.660 |
- Our audience has a lot of software engineers. 00:58:56.300 |
- Yeah, yeah, so I think the easiest one is like, 00:58:58.780 |
is the PyTorch 2.0 and Triton one that I did. 00:59:02.220 |
You know, there's the advanced packaging series. 00:59:05.580 |
There's the Google infrastructure supremacy piece. 00:59:17.180 |
through all that sort of history of the TPU a little bit 00:59:42.540 |
Kind of like, you know, in a different sense. 00:59:45.900 |
all of human productivity gains since the '70s 01:00:01.820 |
mostly innovated because of technology, right? 01:00:13.980 |
I think that's why it's the most important industry 01:00:15.420 |
in the world, but like seeing the frame of mind 01:00:29.660 |
to the now modern times, except maybe, you know, 01:00:36.540 |
So I think that's probably a good readings to do. 01:00:47.820 |
And then was there, has there been an equivalent pivot 01:00:53.980 |
- I mean, like, you know, some people would argue 01:01:08.140 |
And then all of a sudden he's going to like universities, 01:01:15.900 |
NeurIPS when it used to have the more unfortunate name, 01:01:18.300 |
he would go there and just give away GPUs to people, right? 01:01:33.020 |
They only care, they mostly only care about AI. 01:01:35.820 |
And the gaming innovations are only because of like, 01:01:41.100 |
"Hey, they're doing a lot of ship design stuff with AI." 01:01:44.380 |
not, I don't know if it's equivalent pivot quite yet, 01:01:46.780 |
but, you know, because the digital, you know, 01:01:58.940 |
most people left the culture of like Google Brain 01:02:01.260 |
and DeepMind and decided to build this like company 01:02:04.860 |
Like, and does things in a very different way 01:02:06.620 |
and like is innovating in a very different way. 01:02:15.420 |
before they eventually found like GPTs as the thing. 01:02:30.860 |
Okay, so maybe just a general question of it. 01:02:38.460 |
You are obviously managing your consulting business 01:02:41.580 |
while you're also publishing these amazing posts. 01:02:53.420 |
So I'm thankful for my, you know, my teammates 01:03:03.580 |
you know, or not one thing, but a number of things, right? 01:03:05.580 |
Like, you know, someone who's this expert on X and Y 01:03:09.340 |
So that really helps with that side of the business. 01:03:12.220 |
I, most of the times, only write when I'm very excited. 01:03:15.180 |
Or, you know, it's like, "Hey, like, we should work on this 01:03:19.340 |
So like, you know, one of the most recent posts we did 01:03:21.580 |
was we explained the manufacturing process for 3D NAND, 01:03:24.060 |
you know, flash storage, gate all around transistors 01:03:30.060 |
'Cause there's a company in Japan that's going public, 01:03:34.300 |
And it's like, "Okay, well, we should do a post about this 01:03:43.660 |
But like, usually it's like, there's a few, like, 01:03:46.300 |
very long in-depth back burner type things, right? 01:03:53.180 |
And Myron knows this stuff already really well, right? 01:03:55.420 |
Like, but also furthermore, it's like, you know, 01:03:59.420 |
And that like builds up a body of work for our consulting 01:04:06.140 |
But a lot of times the process is also just like, 01:04:11.180 |
having done a lot of work on the supply chain 01:04:16.860 |
and HBM capacities and all this sort of stuff 01:04:19.340 |
to be able to, you know, figure out how many units 01:04:21.180 |
and that Google's ordering all sorts of stuff. 01:04:22.860 |
And then like, also like looking at like open sources, 01:04:25.660 |
like all just that, all that culminated in like, 01:04:28.700 |
I sent it to a couple of people and they were like, 01:04:31.980 |
'cause that's really gonna piss off, you know, 01:04:37.260 |
So it's like, there's no like specific process. 01:04:46.620 |
like what was in the Gemini Eats the World post, 01:04:49.820 |
you know, obviously like, hey, like we do deep work. 01:04:53.100 |
And there's a lot more like factual, not leaks. 01:05:01.340 |
All the way from like a photo resist conference 01:05:03.180 |
to a photo mask conference to a lithography conference, 01:05:09.900 |
and piecing everything across the supply chain. 01:05:15.500 |
It is sometimes bad to like have the infamousness 01:05:21.580 |
and the GPT-4 leak or the Google has no moat leak, right? 01:05:25.900 |
that's just like stuff that comes along, right? 01:05:33.100 |
what technologies are inflecting, things like that. 01:05:39.340 |
and in accelerating or capturing value, et cetera. 01:06:02.220 |
man, if I really knew the answer to this one, 01:06:14.460 |
Now, everything doesn't need to be all-to-all connected, 01:06:31.660 |
where there is a significantly lower bandwidth 01:06:59.020 |
because there's so many different data centers,