Back to Index

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis


Chapters

0:0 Introductions
4:31 Importance of infrastructure and hardware for AI progress
10:53 GPU-rich vs GPU-poor companies and competing in AI
14:22 Optimizing hardware and software for AI workloads
17:0 Metrics for model training vs inference optimization
21:38 Networking challenges for distributed AI training
23:19 Google’s partnership with Broadcom for TPU networking
28:4 What GPU-poor companies/researchers should focus on
34:47 Innovation in AI beyond just model scale
38:3 AI hardware startups and challenges they face
46:15 Manufacturing constraints for cutting edge hardware
50:36 Apple and AI
54:18 AI safety considerations with scaling AI capabilities
57:37 Complexity of rebuilding semiconductor supply chain
60:8 Recommended readings to understand this space
64:31 Dylan’s process for writing viral blog posts
67:27 Dylan’s “magic genie” question

Transcript

Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners. I'm joined by my co-host Swiggs, founder of Small AI. >> And today we have Dylan Patel in the new pod, in the new studios, welcome. >> Well, thank you for having me.

And it was very short notice, right? >> Yes, yes, just hours. I was thinking you were in Taiwan somewhere. And I was like, it's going to be hard to schedule this guy. But I'm sure you visit San Francisco. >> Yeah, yeah. >> And obviously, you just DM me on the day of and you'll go like, let's set something up.

>> Yeah, yeah, well, the folks at Tao gave me this hat and then they mentioned you. And I was like, yeah, we talked about something. And then you mentioned from Taiwan, you didn't see this, I was talking to Swiggy about this. But this is a mooncake- >> Nice. >> From Taiwan that I brought back, so hopefully you'll enjoy that.

>> Nice, thank you. >> Amazing. So you're the author of the extremely popular semi-analysis blog. We have both had a little bit of credentials or claim to fame in breaking details of GPC 4. George Hodge came on our pod and talked about the mixture of experts thing. And then you have a lot more details.

>> Let's just be clear, I talked about mixture of experts in January. It's just people didn't really notice it, I guess, I don't know. >> You went into a lot more detail, and I'd love to dig into some of that. But anyway, so welcome, and congrats on all your success so far.

>> Yeah, thank you so much. It's really interesting, I've been doing semi-conductor industry since 2017. And 2021, got bored, and in November, I started writing a blog. And then 2022, I was good, and I started hiring folks from my firm. And then all of a sudden, 2023 happens, and it's the perfect intersection.

Cuz I used to do data science, but not AI, not really. Multivariable progression is not AI, right? But also, I've been involved in the semi-conductor industry for a long, long time, posting about it online since I was 12, right? And so it was the perfect time and place, cuz semi-conductors became important, right?

All of a sudden, it wasn't this boring thing. And then also the shortage in 2021 also mattered. But all of a sudden, this all kind of came to fruition. So it's cool to have the blog sort of blow up in that way. I used to cover semis at Belyasny as well.

And for a long time, it was just a mobile cycle. And then a little bit of PCs, but not that much. And then maybe some cloud stuff, like public cloud, semi-conductor stuff. But it really wasn't anything until this wave. And I was actually listening to you on one of the previous podcasts that you've done.

And it was surprising that high-performance computing also kind of didn't really take off. AI is just the first form of high-performance computing that worked. One of the theses I've had for a long time that I think people haven't really caught on, but it's really, really coming to fruition now, is that the largest tech companies in the world, their software is important, but actually having and operating a very efficient infrastructure is incredibly important.

And so people talk about, hey, Amazon is great, AWS is great, because yes, it is easy to use, and they've built all these things. But behind the scenes, and no one really talks about it that much, but it's like, behind the scenes, they've done a lot on the infrastructure that is super custom that Microsoft, Azure, and Google Cloud just don't even match in terms of efficiency.

If you think about the cost to rent out SSD space, so the cost to offer a database service on top of that, obviously, the cost to rent out a certain level of CPU performance, Amazon has a massive advantage there. And likewise, Google spent all this time doing that in AI, with their TPUs and infrastructure there and optical switches and all this sort of stuff.

And so in the past, it wasn't immediately obvious. But I think with AI, especially how scaling laws are going, it's incredibly important for, you know, infrastructure is so much more important. And then when you just think about software cost, right, the cost structure of it, there is always a bigger component of R&D and SaaS businesses all over SF, right?

All these SaaS businesses did crazy good, because they just start as they grow. And then all of a sudden, they're so freaking profitable for each incremental new customer. And AI software looks like it's going to be very different, in my opinion, right? Like the R&D cost is much lower in terms of people.

But the cost of goods sold in terms of actually operating the service, I think will be much higher, right? And so in that same sense, infrastructure matters a ton for that. I think you wrote once that training costs effectively don't matter. Yeah, in my opinion, I think that's a little bit spicy.

But yeah, it's like training costs are irrelevant, right? Like GPT-4, right, like 20,000 A100s. That's, like, I know it sounds like a lot of money. 500 million all in, 500 million all in, is that a reasonable estimate? Yeah, I think for the supercomputer, it's slightly more. But yeah, I think the 500 million is a fair enough number.

I mean, if you think about just the pre-training, right, three months, 20,000 A100s at, you know, a dollar an hour is like, that is way less than 500 million, right? But of course, there's data and all this sort of stuff. Yeah, so people that are watching this on YouTube, they can see a GPU-poor and a GPU-rich hat on the table, which is inspired by your Google Gemini.

It's the world blog post. One, did you know that this thing was going to blow up so much? Sam Altman even tweeted about it. He said, "Incredible Google got the semi-analysis guy to publish their internal marketing recruiting chart." And yeah, tell people, who are the GPU-poors, who are the GPU-rich?

Like, what's this framework that they should think about? You know, some of this work we've been doing for a while is just on infrastructure. And like, hey, like when something happens, I think it's like a sort of competitive advantage of our firm, right? Me, myself, and my colleagues is like, we go from software all the way through to like low-level manufacturing.

It's like, who, you know, oh, Google's actually ramping up TPU production massively, right? And like, I think people in AI would be like, well, duh. But like, okay, like, who has the capability of figuring out the number? Well, one, you can just get Google to tell you, but they won't tell you, right?

That's like a very closely guarded secret. And most people that work at Google DeepMind don't even know that number, right? Two, you go through the supply chain and see what they've placed in orders, right? But then, you know, three is sort of like, well, who's actually winning from this, right?

Like, hey, oh, Celestica's building these boxes. Wow, oh, interesting. Okay, you know, this company's involved in testing for them. Oh, okay. Oh, this company's providing design IP to them. Okay, okay. Like, that's like, you know, very valuable in a monetary sense. But, you know, you have to understand the whole technology stack.

But on the flip side, right, is like, well, why is Google building all these? What could they do with it? And what does that mean for the world? And the state of the world is like, especially in SF, right? Like, I'm sure you folks have been to parties. People just brag about how many GPUs they have.

Like, it's happened to me multiple times where somebody's just like, I'm just witnessing a conversation where somebody from Meta is bragging about how many GPUs they have versus someone from another firm. And then it's like, or like a startup person's like, dude, can you believe we just acquired, we have 512 H100s coming online in August.

And it's like, oh, cool. Like, but then you're like, you know, going through the supply chain. It's like, dude, you realize there's 400,000 to 500,000 manufactured last quarter and like 530,000 this quarter being sold, right, of H100s. And it's like, oh, crap, that's a lot. You know, so it's sort of like, that's a lot of GPUs.

But then like, oh, how does that compare to Google? And like, there's one way to look at the world, which is just like, hey, scale is all you need. Like, obviously, data matters. Obviously, all this stuff matters. But given any data set, a larger model will just do better.

I think it's going to be more expensive, but it's going to do better. There's the view of like, OK, there's all these GPUs going to production. Nvidia is going to sell well over 3 million total GPUs next year. You know, over a million H100s this year alone, right? You know, there's a lot of GPU capacity coming online.

It's an incredible amount. And like, well, what are people doing? What are people working on? I think it's very important to just think about what are people working on, right? Some, you know, what actually are you building that's going to advance? You know, what is monetizable? But what also makes sense?

And so like, a lot of people were doing things that I thought felt counterproductive. In a world where in less than a year, there's going to be more than 4 million high-end GPUs out there, we can talk about the concentration of those GPUs. But if you're doing really valuable work as a good person, right, like you're contributing in some way, should you be focused on like, well, I don't have access to any of those 4 million GPUs, right?

I actually only have access to gaming GPUs. Should I focus on being able to fine-tune a model on that, right? Like, it's not really that important. Or like, should I be focused on batch 1 inference on a cloud GPU? Like, no, that's pointless. Why would you do batch size 1 inference on an H100?

That's just like ridiculously dumb. There's a lot of counterproductive work. And at the same time, there's a lot of like, you know, things that people should be doing. And so like, you know, kind of you can peer the world into like, hey, like, I mean, obviously most people don't have resources, right?

And I love the open source and I want the open source to win. And I hate the people who want to like, you know, just like, no, we're X lab and we think this is the only way you should do it. And if people don't do it this way, they should be regulated against it and all this kind of stuff.

I hate that attitude. So I want the open source to win, right? Like companies like Mistral and like what Meta are doing, you know, Mosaic and, you know, all these folks together, blah, blah, blah, right? Like, all these people doing, you know, huge stuff with open source, you know, want them to succeed.

But it's like, there's certain things that are, you know, like hyper focusing on leaderboards and hugging face, right? Like, that's just like, no, like truthful QA is a garbage benchmark. Like, some of the models that are very high on there, if you use it for five seconds, you're like, this is garbage, right?

And it's just like, you're gaming a benchmark. So it's like, there was things I wanted to say. Also, you know, we're in a world where compute matters a lot. Google is going to have more compute than any other company in the world, period. By like a large, large factor.

And so like, it's just like framing it into that like, you know, mindset of like, hey, like, what are the counterproductive things? What do I think personally? Or what have people told me that are involved in this should they focus on? Right? You know, and, and what is the world where like, you know, hey, we're doing the pace of acceleration, you know, from 2020 to 2022 is less than 2022 to 2024.

Right? Like we are growing, you know, GPT-2 to 4, 2 to 4 is like 2020 to 2022, right? Is less than I think, from GPT-4 in 2022, which is when it was trained, right? To what OpenAI and Google and Anthropic could do in 2025, right? Like, I think the pace of acceleration is increasing.

And it's just good to like, think about, you know, that sort of stuff. I don't know, I don't know where I'm rambling with this, but. Yeah, that makes sense. And the chart that Sam mentioned is about, yeah, Google TPU B5 is completely overtaking OpenAI by like, orders of magnitude.

Let's talk about the TPU a bit. We had Chris Landner on the show, which I know, you know, he used to work on TensorFlow and Google. And he did mention that the goal of Google is like make TPUs go fast with TensorFlow. But then he also had a post about PyTorch kind of stealing the thunder, so to speak.

How do you see that changing? If like, now that a lot of the compute will be TPU based and Google wants to offer some of that to the public too? I mean, Google internally, and I think, you know, is obviously on JAX and XLA and all that kind of stuff, right?

But externally, like they've done a really good job. Like, I mean, I wouldn't, you know, I wouldn't say like TPUs through PyTorch XLA is amazing, but it's not bad, right? Like some of the numbers they've shown, some of the, you know, code they've shown for TPU B5e, which is not the TPU B5 that I was referring to, which is in the sort of the post, the GPU poor posts I was referring to.

But TPU B5e is like the new one, but it's mostly an inference chip. It's a small chip. It's a little bit, it's about half the size of a TPU B5. That chip, you know, you can get very good performance on like of Llama 70B inference, right? Like, you know, very, very good performance.

So like when you're using PyTorch and XLA. Now, of course, you're going to get better if you go JAX XLA, but I think Google is doing a really good job after the restructuring of focusing on external customers too, right? Like, hey, like TPU B5e, we'll probably won't focus too much on TPU B5 for everyone externally, but B5e, well, we're also building a million of those, right?

Hey, a lot of companies are using them, right, or will be using them because it's going to be an incredibly cheap form of compute. The world of like, you know, frameworks and all that, right? Like that's obviously something a researcher should talk about, not myself. But, you know, the stats are clear that PyTorch is way, way dominating everything.

But JAX is like doing well, like there's external users of JAX. But in the end, like there's the front end, right? The front end is like what, you know, we're referring back to or maybe it's something we do later and you guys are going to edit after. The layers of abstraction, right, like, you know, the forever shouldn't be that the person doing, you know, like PyTorch level code, right, that high should also be writing custom CUDA kernels, right?

There should be, you know, different layers of abstraction where people hyper optimize and make it much easier for everyone to innovate on separate stacks, right? And then every once in a while, someone comes through and pierces through the layers of abstraction and innovates across multiple or a group of people.

But I think, you know, that's probably, you know, frameworks are important, but, you know, compilers are important, right? Chris Latners, what he's doing is really cool. I don't know how it'll work, but it's super cool. And it certainly works on CPUs, we'll see about accelerators. Likewise, there's OpenAI's Triton, like what they're trying to do there.

And like, you know, everyone's really coalescing around Triton, you know, people, you know, third party hardware vendors, there's Palace, right? So I don't know if you've heard about that, but I don't want to mischaracterize it. But you can write in Palace, and it'll go through, you can at the lower level code, and it'll work to TPUs and GPUs, kind of like Triton, but it's like, there's a back end for Triton.

I don't know exactly everything about it. But I think there's a lot of innovation happening on make things go faster, right? How do you go burr? Because every single person working in ML, you know, it would be a travesty if they had to write like custom CUDA kernels always, right?

Like that would just slow down productivity. But at the same time, you kind of have to. - Yeah, thanks. By the way, I like to quantify things when you say make things over. Like, is there a target range of like MFU that you typically talk about? - Yeah, there's there's sort of two metrics that I like to think about a lot, right?

So in training, everyone just talks about MFU, right? But then on inference, right, which I think is, you know, one LLM inference will be bigger than training or multimodal, whatever, bubble inference will be bigger than training, you know, probably next year, in fact, at least in terms of GPUs deployed.

The other thing is like, you know, what's the bottleneck when you're running these models? So like, the simple, stupid way to look at it is training is, you know, there's six flops, floating point operations you have to do for every byte you read in, right? Every parameter you read in.

So if it's FP8, then it's a byte. If it's FP16, it's two bytes, whatever, right, on training. But on inference side, the ratio is completely different. It's two to one, right? There's two flops per parameter that you read in and parameters, maybe one byte, right, because it's FP8 or intake, right, eight bits byte.

But then when you look at the GPUs, right, the GPUs are very, very different ratio. The H100 has 3.35 terabytes a second of memory bandwidth, and it has a thousand teraflops of FP16, BFLIP16, right? So that ratio is like, sorry, I'm going to butcher the math here and people are going to think I'm dumb, but 256 to one, right, call it 256 to one if you're doing FP16.

Same applies to FP8, right, because, anyways, per parameter read to number of floating point operations, right? If you quantize further, but you also get double the performance on that lower quantization. That does not fit the hardware at all, right? So if you're just doing LLM inference at batch one, then you're always going to be underutilizing the flops.

You're only paying for memory bandwidth. And the way hardware is developing, that ratio is actually only going to get worse, right? H200 will come out soon enough, which will help the ratio a little bit, you know, improve memory bandwidth more than improves flops, just like the A180 gig did versus the A140 gig.

But then when the B100 comes out, the flops are going to increase more than memory bandwidth. And when future generations come out, and the same with AMD's side, right, MI300 versus 400, as you move on generations, just due to fundamental, like, semiconductor scaling, DRAM memory is not scaling as fast as logic has been.

And so you're going to continue to, and you can do a lot of interesting things on the architecture. So you're going to have this problem get worse and worse and worse, right? And so on training, it's very, you know, who cares, right? Because my flops are still my bottleneck most of the time.

I mean, memory bandwidth is obviously a bottleneck, but like, well, you know, batch sizes are freaking crazy, right? Like people train like 2 million batch size is trivial, right? Like, that's what Lama, I think, did. Lama70B was 2 million batch size. And like, you talk to someone at one of the Frontier Labs, and they're like, right, just 2 million, right?

2 million token batch size, right? That's crazy, or sequence, sorry. But when you go to inference side, it's like, well, it's impossible to do, one, to do 2 million batch size. Also, your latency would be horrendous if you tried to do something that crazy, right? So you kind of have this like differing problem where on training, everyone just kept talking MFU, Model Flop Utilization, right?

How many flops? Six times the number of parameters, basically, more or less. And then what's the quoted number, right? So if I have 312 teraflops out of my A100, and I was able to achieve 200, that's really good, right? You know, some people are achieving higher, right? Some people are achieving lower.

That's a very important like metric to think about. Now you have like people thinking MFU is like a security risk. But on inference, MFU is not nearly as important, right? It's memory bandwidth utilization. You know, batch one is, you know, what memory bandwidth can I achieve, right? Because as I increase batch from batch size one to four to eight to even 256, right?

It's sort of where the crossover happens on an H100, inference wise, right? Where it's flops limiting you more and more. But like, you should have very high memory bandwidth utilization. So when people talk about A100s, like 60% MFU is decent, right? On H100s, it's more like 40, 45% because the flops increased more than the memory bandwidth.

But people over time will probably get above 50% on H100, on MFU, on training. But on inference, it's not being talked about much, but MBU, model bandwidth utilization is the important factor, right? So my 3.35 terabytes a second memory bandwidth on my H100, can I get two? Can I get three, right?

That's the important thing. And right now, if you look at everyone's, you know, inference steps, I dogged on this in a GPU poor thing, right? But it's like, Hugging Faces libraries are actually very inefficient, like incredibly inefficient for inference. You get like 15% MBU on some configurations, like eight A100s and Llama 70B, you get like 15%, which is just like horrendous.

Because at the end of the day, your latency is derived from what memory bandwidth you can effectively get, right? You know, so if you're doing Llama 70 billion, 70 billion parameters, if you're doing it in tape, okay, that's 70 gigabytes a second, gigabytes you need to read for every single inference, every single forward pass, plus, you know, the attention, but again, we're simplifying it.

70 gigabytes you need to read for every forward pass, what is an acceptable latency for a user to have? I would argue, you know, 30 milliseconds per token. Some people would argue lower, right? At the very least, you need to achieve human reading level speeds and probably a little bit faster, because we'd like their skim.

To have a usable model for chatbot style applications, now there's other applications, of course, but chatbot style applications, you want it to be human reading speed. So 30 tokens per second, 30 tokens per second is 33, or 30 tokens, milliseconds per token is 33 tokens per second, times 70 is, let's say 3 times 7 is 21, and then add to 0, so 2,100 gigabytes a second, right?

To achieve human reading speed on Lama70B, right? So, one, you can never achieve Lama70B human reading speed on, even if you had enough memory capacity, on a model, on an A100, right? Even an H100, to achieve human reading speed, right? Of course, you couldn't fit it, because it's 80 gigabytes versus 70 billion parameters, so you're kind of butting up against the limits already.

70 billion parameters being 70 gigabytes at int8 or fp8. You end up with, one, how do I achieve human reading level speeds, right? So, if I go with two H100s, then now I have, you know, call it 6 terabytes a second of memory bandwidth. If I achieve just 30 milliseconds per token, then I'm, you know, which is 33 tokens per second, which is 2.1 terabytes a second of memory bandwidth, then I'm only at like 30% bandwidth utilization.

So, I'm not using all my flops on batch one anyways, right? Because 70, you know, the flops that you're using there is tremendously low relative to inference, and I'm not actually using a ton of the tokens on inference. So, with two H100s, I only get 30 milliseconds a token, that's a really bad result.

You should be striving to get, you know, upwards of 60%, and that's like, 60% is kind of low too, right? Like, I've heard people getting 70, 80% model bandwidth utilization. You know, obviously, you can increase your batch size from there, and your model bandwidth utilization will start to fall as your flops utilization increases, but, you know, you have to pick the sweet spot for where you want to hit on the latency curve for your user.

Obviously, as you increase batch size, you get more throughput per GPU. So, that's more cost effective. There's a lot of, like, things to think about there, but I think those are sort of the two main things that people want to think about. And there's obviously a ton with regards to, like, networking and inner GPU connection, because most of the useful models don't run on a single GPU.

They can't run on a single GPU. Is your TPU equipped enough with Linux? So, the TPUs, the Google TPU is, like, super interesting, because Google has been working with Broadcom, who's the number one networking company in the world, right? So, Mellanox was nowhere close to number one. They had a niche that they were very good at, which was the network card, the card that you actually put in the server, but they didn't do much.

They weren't doing successfully in the switches, right? Which is, you know, you connect all the network cards to switches, and then the switches to all the, you know, servers. So, Mellanox was not that great. I mean, it was good. They were doing good. And NVIDIA bought them, you know, in '19, I believe, or '18.

But Broadcom has been number one in networking for a decade plus, right? And Google partnered with them on making the TPU. And they, you know, TPU, you know, all the way through to all TPU v5, which is the one they're in production of now, and 6, and, you know, all of these.

These are all going to be, you know, co-designed with Broadcom, right? So, Google does a lot of the design, especially on the ML hardware side, on how you pass stuff around internally on the chip. But Broadcom does a lot on the network side, right? They specifically, you know, how to get really high connection speed between two chips.

Right? They've done a ton there. Obviously, Google works a ton there, too. But this is sort of like Google's, like, less discussed partnership that's truly critical for them. And Google's tried to get away from them many times. Their latest target to get away from Broadcom is 2027, right? But, like, you know, that's four years from now.

Chip design cycle is four years. So, they already tried to get away in 2025, and that failed. But, yeah, they have this equivalent of very high speed networking. It works very differently than the way GPU networking does. And that's important for people who code on a lower level. I've seen this described as the ultimate limit on how big models are built.

It's not flops. It's not memory networking. Like, it has the lowest scaling law. It's, like, the lowest Morse law. So, you know, all of them. And I don't know what to do about that because no one else has any solutions. Yeah, yeah. So, I think what you're referring to is that, like, network speed is increased slower than flops and bandwidth.

Yeah, yeah. And, yeah, that's a tremendous problem in the industry, right? But, like, that's why NVIDIA bought a networking company. That's why Broadcom is working on Google's chip right now. But, of course, on Meta's internal AI chip, which they're on the second generation of, working on that. And what's the main thing that Meta's doing interesting is networking stuff, right?

Multiplying tensors is kind of, you know, anyone can... There's a lot of people who've made good matrix multiply units, right? But it's about, like, getting good utilization out of those and interfacing with the memory and interfacing with other chips really efficiently. It makes designing these chips very hard. And most of the startups, obviously, have not done that really well.

Yeah, I mean, I think the startup's point is the most interesting, right? You mentioned companies that are GPU-poor. They raise a lot of money. And there's a lot of startups out there that are GPU-poor and did not raise a lot of money. What should they do? How do you see, like, this space dividing?

Are we just supposed to wait for, like, the big labs to do a lot of this work with a lot of the GPUs? Like, what's, like, the GPU-poor's beautiful version of the article? Like, the whole point was that, like, Google, you know, opening eye who everyone would be like, "Oh, yeah, they have more GPUs than anyone else," right?

But they have a lot less flops than Google, right? That was the point of the, like, thing. And it was, but not just them. It's like, okay, you know, it's like a relative totem pole, right? Now, of course, Google doesn't use GPUs as much for training and inference. They do use some, but mostly TPUs.

So, kind of like, the whole point is that everyone is GPU-poor because we're going to continue to scale faster and faster and faster and faster. And compute will always be a bottleneck, just like data will always be a bottleneck, right? You can have the best data set in the world, and you can always have a better one.

And same with, you know, the biggest compute system in the world, and you can, but you'll always want a better one. You know, like, Mistral, right? Like, they trained a freaking awesome model on relatively fewer GPUs, right? And now they're scaling up higher and higher and higher, right? You know, there's a lot that the GPU-poor can do, though, right?

Like, hey, we all have phones, we all have laptops, right? There is a world for running GPUs or models on device, right? You know, the Repl.it folks are, you know, trying to do stuff like that. Their models can't be that. They can't follow scaling laws, right? Why? Because there is a fundamental limit to how much memory bandwidth and capacity that you can get on a laptop or a phone, right?

You know, I mentioned the ratio of flops to bandwidth on a GPU is actually really, really good compared to, like, a MacBook or, like, a phone. Hey, to run Llama 70 billion requires two terabytes a second memory bandwidth, 2.1, at human reading speed. Yeah, but my phone has, like, 50 gigabytes a second, right?

Your laptop, even if you have an M1 Ultra, has, what, like, I don't remember, like, a couple hundred gigabytes a second memory bandwidth. You can't run Llama 70b just by doing the classical thing. So there's, like, there's stuff like speculative decoding, and then, you know, Together did something really cool, and they put it in open source, of course, Medusa, right?

Like, things like that that are, you know, they work on batch size one. They don't work on batch size, you know, high. And so there's, like, the world of, like, cloud inference. And so in the cloud, it's all about, you know, what memory bandwidth and MFU I can achieve.

Whereas on the edge, I don't think Google is going to deploy a model that I can run on my laptop to help me with code or help me with, you know, X, Y, Z. They're always going to want to run it on a cloud for control. Or maybe they let it run on the device, but it's, like, only their Pixel phone, you know, it's kind of like a walled garden thing.

There's obviously a lot of reasons to do other things for security, for, you know, openness, to not be at the whims of a trillion-dollar-plus company who wants my data, right? Like, you know, there's a lot of stuff to be done there. And I think, like, folks like Repl.it are, like, you know, I love it, right?

That's exactly, like, the stuff, you know. They open-sourced their model, right? Yeah, so they open-sourced their model. I think, you know, things like Together, what I just mentioned, right, that developing Medusa, right, that didn't take much GPU at all, right? While they do have quite a few GPUs, they made a big announcement about having 4,000 H100s.

That's still relatively poor, right, when we're talking about hundreds of thousands of, like, the big labs, like OpenAI and so on and so forth, or millions of TPUs like Google. But, you know, still, they were able to develop Medusa with probably just one server, right? One server with eight GPUs in it.

And its usefulness of something like Medusa, something like speculative decoding is on-device, right? And that's what, like, a lot of people can focus on. You know, people can focus on all sorts of things like that. I don't know, right? Like, a new model architecture, right? Like, are we only going to use transformers?

I'm pretty tilled to think, like, transformers are it, right? Like, just because, like, my hardware brain can only know something that loves hardware, right? But, so, like, you know, like, you know, people should continue to try and innovate on that, right? Like, you know, asynchronous training, right? Like, that kind of stuff is, like, you know, super, super interesting.

Like, Tim DeMeters. Yeah, yeah, distributed, like, not in one data center. I think it's Tim DeMeters. He had, like, the swarm. Yeah, there you go. Sorry, sorry. Same guy. Yes, he had the swarm paper and pedal. And, well, I think pedal is whatever. You know, like, that research is super cool.

It's, like, steady at home, right? It's not banana. Yeah, I mean, yeah. But, like, I, like, research that kind of stuff, right? Like, you know, hey, like, the universities will never have much compute. But, like, hey, you know, to prepare, to do things, to, you know, all these sorts of stuff.

Like, they should try to build, you know, super large models. Like, you look at what Tsinghua University is doing in China. Like, actually, they open-sourced their model to, I think, the largest, like, by-parameter count, at least, open-source models. I don't remember the name. MOE, yeah. Yeah, that's from Tsinghua University, though, right?

Yeah, yeah. I think it was, like, 1.7 trillion. Yeah, I mean, of course, they didn't train it on much data. But it's, like, you know, it's, like, still, like, you could do some cool stuff like that. I don't know. I think there's a lot that people can focus on.

Because, you know, one, scaling out a service to many, many users. Distribution is very important. So, figuring out distribution, right? Like, figuring out useful fine-tunes, right? Like, you know, doing LLMs that, you know, OpenAI will never make, you know, sorry for the crassness, a porn Dolly 3, right? But open-source is doing crazy stuff with stable diffusion, right?

Right? Like, I don't even know. Yeah, but it's, like, and there is a legitimate market. I think there's a couple companies who make tens of millions of dollars of revenue from, from, yeah, from LLMs or diffusion models for porn, right? Or, you know, that kind of stuff. Like, I mean, there's a lot of stuff that people can work on that will be successful businesses, or doesn't even have to be a business, but can advance humanity tremendously.

That doesn't require a crazy scale. How do you think about the depreciation of, like, the hardware versus the models? Like, we covered two years. Two years in full. Yeah, like, we covered open models for a while. If I think about the episodes we had, like, in March, with, like, MPT, 7B.

Oh, yeah, nobody talks about that anymore. Exactly. It's, like, the depreciation is, like, three months. Well, I mean, no one should be talking about Lama 13 billion anymore, right? Because Mistral just showed them up, right? Yeah. So, I'm really curious. It's, like, you know, if you buy a H100, sure, the next series is going to be better, but, like, at least the hardware is good.

If you're spending a lot of money on, like, training a smaller model, like, it might be, like, super obsolete in, like, three months. And you got now all this compute coming online. I'm just curious if, like, companies should actually spend the time to, like, you know, fine-tune them and, like, work on them.

Where, like, the next generation is going to be out of the box so much better. Unless you're fine-tuning for on-device use, I think fine-tuning current existing models, especially the smaller ones, is a useless waste of time, right? Because the cost of inference is actually much cheaper than you think once you achieve good MBU and you batch at a decent size, which any successful business in the cloud is going to achieve.

You know, and then two, fine-tuning, like, people are like, "Oh, you know, this seven billion perimeter model, if you fine-tune it on a data set, it's almost as good as 3.5, right?" It's like, "Yeah, but why don't you just fine-tune? Why don't you fine-tune 3.5 and look at your performance, right?" And, like, there's nothing open source that is anywhere close to 3.5 yet, right?

There will be, there will be. I think, I think people also don't quite grasp… Falcon was supposed to be Falcon 140b. It's less parameters than 3.5, and also, I don't know about the exact token count, but I believe it's less than… Do we know the parameters of 3.5? It's not 1.75 billion.

Right, we know 3. Because we know 3, but we don't know 3.5. 3.5… It's definitely smaller. No, it's bigger than 1.75, but I think it's sparse. I think it's, you know, it's MOE, I'm pretty sure. Yeah, you can do some, like, gating around the size of it by looking at their inference latency.

Which is also… It's upper bounds. Yeah, you can look at, like, what's the theoretical bandwidth if they're running it on this hardware, and, you know, and doing tensor parallel in this way, so they have this much memory bandwidth, and maybe they get, maybe they're awesome, and they get 90% memory bandwidth utilization.

I don't know, that's upper bound, and you can see the latency that 3.5 gives you, like, especially at, like, off-peak hours, or if you do fine-tuning, and if you have a private enclave, like, my Azure will quote you latency. So, you can figure out how many parameters per forward path, which I think is somewhere in the, like, 50 to 40 billion range, but I could be very wrong, that's just, like, my guess, based on that sort of stuff.

You know, 50-ish, but then… Do you have 16 experts, or… I have no clue, I have no clue. There's no way to figure that out, because it's just so rowdy. Yeah, actually, there's someone I've talked to at one of the labs who, like, thinks they can figure out how many experts are in a model by querying in a crapload, but that's only if you have access to the logits, like, the percentage chance, yeah, before you did the softmax, I don't know.

But yeah, there's, like, a ton of, like, competitive analysis you could try to do. But anyways, I think open source will have models of that quality, right? I think, like, you know, I mean, I assume Mosaic or, like, Meadow will open source, and Mistral will be able to open source models of that quality.

Now, furthermore, right, like, if you just look at the amount of compute, obviously data is very important, and the ability… All these tricks and dials that you turn to be able to get good MFU and good MBO, right, like, depending on inference or training, is… There's a ton of tricks, but at the end of the day, like, there's, like, 10 companies that have enough compute in one single data center to be able to beat GPT-4, right?

Like, straight up, like, if not today, within the next six months, right? Like, 4,000 H100s is… I think you need about 7,000 maybe, and with some, like, good, like, and with some algorithmic improvements that have happened since GPT-4, and some data quality improvements, probably, like, you could probably get to even, like, you know, less than 7,000 H100s running for three months to beat GPT-4.

Of course, that's going to take a really awesome team. But, you know, there's quite a few companies that are going to have that many, right? Open source will match GPT-4, but then it's like, what about GPT-4 Vision, or what about, you know, 5 and 6, and, you know, all these kind of stuff, and, like, interact tool use and DALI, and, like, that's the other thing is, like, there's a lot of stuff on tool use that the open source could also do, that the GPU port could do.

I think there are some folks that are doing that kind of stuff, agents and all that kind of stuff, I don't know. That's way over my head, the agent stuff. Yeah, it's over everyone's head. One more question on just, like, the sort of Gemini GPU rich essay. We've had a very wide reaching conversation already, so it's hard to categorize.

But I tried to look for the Mina Eats the World document. Oh, it's not possible. No, no, no, no, no. Yeah, I'll look in this article. No, so Noam Shazir-- Read it. Yeah, I've read it. So Noam Shazir is, like, I don't know, I think he's, like-- The GOAT.

The GOAT. Yeah, I think he's the GOAT. Like, obviously, like, in one year, he published, like-- Yeah, exactly. It's, like, all this stuff that we were talking about today was, like, you know. And obviously, there's other people that are awesome that were, you know, helping and all that sort of stuff, just to be clear.

There was a couple other papers, but, like, yeah. So, like, Mina Eats the World was basically-- He wrote an internal document around the time where Google had Mina, right? And Mina was one of their LLMs that, like, is a footnote in the history. Like, you know, most people will not, like, think about Mina's relevant.

But it was, like-- He wrote it, and he was, like, basically predicting everything that's happening now, which is that, like, large language models are going to eat the world, right? In terms of, you know, compute. And he's, like, the total amount of deployed flops within Google data centers will be dominated by large language models.

And, like, back then, a lot of people thought he was, like, silly for that, right? Like, internally at Google. But, you know, now, if you look at it, it's, like, oh, wait. Millions of TPUs. You're right, you're right, you're right. Okay, we're totally getting dominated by, like, both, you know, Gemini training and inference, right?

Like, you know, like, or whatever, 2, 3, 4, plus 1, 2, 3, 4 for Gemini, and all these other things. Like, that's, yeah, total flops being dominated by LLMs was completely right. So my question was, he had a bunch of predictions in there. Do you think there are any, like, underrated predictions that may not yet have come true by your-- I think, like, if, you know, obviously, I read the document, but I read it on someone else's device.

They didn't send it to me, so I can't really send it, sorry. And also, they were okay with me talking about the document and calling Gnome a goat, because they also think Gnome is a goat. But I think, like, you know, now, most everybody is, like, scaling law pilled, and, like, LLM pilled, and, like, you know, all this sort of stuff.

And, like, it's a very clear line of sight. What's wrong with anything? I don't-- I mean, like, Mina sucked, right? I mean, it was great for the time, right? So, like-- It's like, really incredible. I don't remember off the top of my head, but it's like, if you'd look at the total flops, right?

You know, parameters times tokens times six, right? It's like-- it's like a tiny, tiny fraction of GPT-2, which came out just a few months later, which is like, okay, so he wasn't right about everything, but, like, maybe he knew about GPT-- I have no clue. But, you know, OpenAI clearly was, like, way ahead of Google on LLM scaling, even then, right?

It's just people didn't really recognize it back in GPT-2 days, maybe. Or the number of people that recognized it was maybe hundreds, tens, right? I don't know. You mentioned transformer alternatives. The other thing is GPU alternatives. So, the TPU is obviously one, but there's Cerebras, there's Graphcore, there's Metax, Lemurian Labs, there's a lot of them.

Thoughts on what's real, who's alive, who's kind of like a zombie company walking? So, if you go back and, like, you know, I mean, I mentioned, like, transformers were the architecture that won out, but I think, you know, the number of people who recognized that in 2020 was, you know, as you mentioned, probably hundreds, right?

You know, for natural language processing, maybe in 2019, at least, right? You think about a chip design cycle, it's like years, right? You know, so it's kind of hard to bet your architecture on the type of model that develops. But what's interesting about all the first wave AI hardware startups is you kind of have, you know, this ratio of memory, capacity, compute, and memory bandwidth, right?

And so everyone kind of made the same bet, which is I have a lot of memory on my chip, which is, A, really dumb, because the models have grew way past that, right? Even Cerebrinus, right? I mean, you know, like, I'm talking about, like, GraphCore, it's called SRAM, which is the memory on chip, much lower density, but much higher speeds, versus, you know, DRAM, which is the, you know, memory off chip.

And so everyone was betting on, you know, pretty much more memory on chip and less memory off chip, right? And if that, and to be clear, right, for image networks and models that are small enough to just fit on your chip, that works. That is the superior architecture, but, you know, scale, right?

Scale, scale, scale, scale. So NVIDIA was the only company that bet on the other side of more memory bandwidth, right, and more memory capacity external, also the right ratio of memory bandwidth versus capacity, right? Because there were people, a lot of people, like GraphCore specifically, right? They had a ton of memory on chip, and then they had a lot more memory off chip, but that memory off chip was a much lower bandwidth.

Same applies to Samanova, same applies to Cerebris, you know, they had no memory off chip, but they thought, hey, I'm going to make a chip the size of a wafer, right? Like, I can fit, you know, fine. You know, those guys, they're silly, right? Hundreds of megabytes, we have 40 gigabytes.

There's no, you know, and then, oh, crap, models are way bigger than 40 gigabytes, right? Everyone bet on sort of the left side of this curve, right? The interesting thing is that there's new age startups like Lumeria, like MedEx. I won't get into what they're doing, but they're making much more rational bets.

I don't know, you know, it's hard to say with a startup, like, it's going to work out, right? Obviously, there's tons of risk embedded, but those folks like, you know, Jay Duane and Lumeria, and like, you know, Mike and Renier, they understand models. They understand how they work. And if transformers continue to reign supreme, right?

You know, whatever innovations those folks are doing on hardware are going to need to be, you know, fitted for that. Or you have to predict what the model architecture is going to look like in a few years, right? What does it look like, right? You know, and hit that spot correctly.

So that's kind of a background on those. But like, now you look today, it's like, hey, you know, Intel bought Nirvana, which was Naveen Rao's Mosaic ML. He started Mosaic ML and sold it to Databricks and recently, he's obviously leading LLMs and stuff there, AI there. But, you know, Intel bought that company from him and then shut it down and bought this other AI company.

And now that company is kind of, you know, got new chips. They're going to release a better chip than the H100 within the next quarter or so, right? AMD, they have a GPU, MI300, that will be better than the H100 in a quarter or so. Now that says nothing about how hard it is to program it, but at least hardware wise on paper, it's better.

Why? Because it's, you know, a year and a half later, right? Than in the H100 or a year later than the H100, of course. And, you know, a little bit more time and all that sort of stuff. But, you know, they're at least making similar bets on memory bandwidth versus flops versus capacity, kind of following NVIDIA's lead.

The questions are like, what is the correct bet for three years from now? How do you engineer that? And will those alternatives make sense? The other thing is, if you look at total manufacturing capacity, right? For this sort of bet, right? You need high bandwidth memory, you need HBM and you need large five nanometer dies, you know, soon three nanometer or whatever, right?

You need both of those components and you need the whole supply chain to go through that. We've written a lot about it, but, you know, to simplify it, NVIDIA has a little bit more than half and Google has like 30%, right? Through Broadcom. So it's like the total capacity for everyone else is much lower and they're all sharing it, right?

Amazon's training and inferentia, Microsoft's in-house chip. And, you know, you go down the list and it's like Meta's in-house chip and also AMD. And also all of these companies are sharing like a much smaller slice. Their chips are not as good. Or if they are, even though they're, you know, I mentioned Intel and AMD's chips are better.

That's only because they're throwing more money at the problem kind of, right? You know, NVIDIA charges crazy prices. I think everyone knows that. Their gross margins are insane. AMD and Intel and others will charge more reasonable margins. And so they're able to give you more HBM and et cetera for a similar price.

And so that ends up letting them beat NVIDIA, if you will. But their manufacturing costs are twice that in some cases, right? In the case of AMD, their manufacturing costs for MI300 are more than twice that of H100. And it only beats H100 by a little bit from, you know, performance stuff I've seen.

So it's like, you know, it's tough for anyone to like bet the farm on a alternative hardware supplier, right? Like in my opinion, like you should either just like be like, you know, a lot of like ex-Google startups are just using TPUs, right? And hey, that's Google Cloud, you know, after moving the TPU team, you know, into the cloud team, infrastructure team, sort of, they're much more aggressive on external selling.

And so you companies like, even companies like Apple using TPUs for training LLMs as well as GPUs. But, you know, either bet heavily on TPUs because that's where the capacity is, bet heavily on GPUs, of course, and stop worrying about it and leverage all this amazing open source code that is optimized for NVIDIA.

Or, okay, if you do bet on AMD or Intel or on any of these startups, then you better make damn sure you're really good at low-level programming and damn sure you also have a compelling business case and that the hardware supplier is giving you such a good deal that it's worth it.

And also, by the way, NVIDIA is releasing a new chip, you know, they're going to announce it in March and they're going to release it, you know, and ship it, you know, Q2, Q3 next year anyways, right? And that chip will probably be three or four times as good, right, and maybe it'll cost twice as much or 50% more.

I hear it's 3X the performance on an LLM and 50% more expensive is what I hear. So it's like, okay, yeah, like nothing, nothing is going to compete with that, even if it is 50% more expensive, right? And then you're like, okay, well, that kicks the can down further.

And then NVIDIA is moving to a yearly release cycle. So it's like very hard for anyone to catch up to NVIDIA really, right? Are, you know, investing all this in other hardware, like if you're Microsoft, obviously, who cares if I spend $500 million a year on my internal chip?

Who cares if I spend $500 million a year on AMD chips, right? Like if it lets me knock the price of NVIDIA GPUs down a little bit, puts the fear of God within Jensen Huang, right? Like, you know, then it is what it is, right? And likewise, you know, with Amazon and so on and so forth, you know, of course, they hope is that their chips succeed or that they can actually have an alternative that is much cheaper than NVIDIA.

But to throw a couple hundred million dollars at a company, you know, this product is completely reasonable. And in the case of AMD, I think it'll be more than a couple hundred million dollars, right? But yeah, I think alternative hardware is like, it really does hit like sort of a peak hype cycle, kind of end of this year, early next year, because all NVIDIA has is H100 and then H200, which is just better, more memory bandwidth, higher memory capacity, H100, right?

But that doesn't beat what, you know, AMD are doing. It doesn't beat what, you know, even Intel's Gaudi 3 does. But then very quickly after, NVIDIA will crush them. And then those other companies are gonna take two years to get to their next generation. You know, it's just a really tough place.

And no one decides, you know, the main thing about hardware is like, hey, that bet I talked about earlier is like, you know, that's very oversimplified, right? Just memory bandwidth flops and memory capacity. There's a whole lot more bets. There's a hundred different bets that you have to make and guess correctly to get good hardware, not even have better hardware than NVIDIA, get close to them, right?

And that takes understanding models really, really well. That takes understanding, you know, so many different aspects, whether it's power delivery or cooling or, you know, design, layout, all this sort of stuff. And it's like, how many companies can do everything here, right? It's like, I'd argue Google probably understands models better than NVIDIA.

I don't think people would disagree. I'm an NVIDIA understands hardware better than Google. And so you end up with like, Google's hardware is competitive, but like, does Amazon understand models better than NVIDIA? I don't think so. And does Amazon understand hardware better than NVIDIA? No, right? Like it's like- - That's like philanthropic's investment or the investment in a therapy.

- Yeah, I'm also of the opinion that the labs are, they're useful partners. They're convenient partners, but they are not gonna like, they're not gonna buddy up as close as people think, right? I don't even think like, you know, I expect in the next few years that the OpenAI Microsoft probably falls apart too.

- That'd be huge. - I mean, they'll still continue to use GPUs and stuff there. But like, I think that the level of closeness you see today is probably the closest they get, right? Like, I mean- - At some point they become competitive. If OpenAI becomes its own cloud.

- Yeah, I mean, I think OpenAI wants to not just become a trillion dollar company, 10 trillion dollar. I mean, I'm not a company, right? But like the level of value that they deliver to the world, if you talk to anyone there, they truly believe it'll be tens of trillions, if not hundreds of trillions of dollars, right?

In which case, obviously, you know, I know weird corporate structure aside, like, you know, this is the same like playing field as companies like Microsoft and Google. Like Google wants to also deliver hundreds of trillions of dollars of value. And it's like, obviously you're competing and Microsoft wants to do the same and you're gonna compete.

And like, yeah, I think in general, right? Like these lab partnerships are gonna be nice, but they're probably incentivized to, you know, "Hey, NVIDIA, you should, you know, "can you design the hardware in this way?" And NVIDIA's like, "No, it doesn't work like that. "It works like this." And they're like, "Oh, so this is the best compromise, right?" Like, I think OpenAI would be stupid not to do that with NVIDIA, but also with AMD.

But also, hey, like how much, and Microsoft's internal silicon, but it's like, how much time do I actually have, right? Like, you know, should I do that? Should I spend all my, you know, super, super smart people's time and limited, you know, this caliber of person's time doing that?

Or should they focus on like, "Hey, can I get like asynchronous training to work?" Or like, you know, figure out this next multimodal thing, or I don't know, I don't know, right? Like it's probably better, you know, "Hey, can I eke out 5% more MFU "and work on designing the next supercomputer," right?

Like these kinds of things, how much more valuable is that, right? So it's like, you know, it's tough to see, you know, even OpenAI helping Microsoft enough to get their knowledge of models. So, so, so good, right? Like Microsoft's gonna announce their chip soon. It's worse performance than the H100, but the cost effectiveness of it is better for Microsoft internally, just because they don't have to pay the NVIDIA tax.

But again, like by the time they ramp it and all these sorts of things, "Oh, hey, that only works on a certain size of models "once you exceed that, "then it's actually, you know, again, better for NVIDIA." So it's like, it's really tough for OpenAI to be like, "Yeah, we want to bet on Microsoft," right?

Like, "And hey, we have, you know, "I don't know, what's their number of people they have now? "Like 700 people, you know, "of which how many do low-level code? "Do I want to have separate code bases "for this and this and this and this?" And, you know, it's like, it's just like a big headache to, I don't know, I think it'd be very difficult to see anyone truly pivoting to anything besides a GPU and a TPU, especially if you have, if you need that scale, right?

And that scale that the, at least the labs, right, require is absurd, right? Google says millions, right, of TPUs. OpenAI will save millions of GPUs, right? Like I truly do believe they think that that number of next-generation GPUs, right? Like the numbers that we're going to get to are like, I bet you, I mean, I don't know, but I bet Sam Alton would say, "Yeah, we're going to build a $100 billion supercomputer "in three years or two years," right?

Like, and like after GPT-5 releases, if he goes to the market and says like, "Hey, I want to raise $100 billion "at $500 billion valuation," I'm sure the market would give it to him, right? Like, and then they build that supercomputer, right? Like, I mean, like, I think that's like truly the path we're on.

And so it's hard to imagine. Yeah, I don't know. - One point that you didn't touch on, and Taiwan companies are famously very chatty about the food company. Should we take Apple seriously at all in this game or they're just in a different world altogether? - I don't know, I think, just from my view of Apple, I don't personally use Apple products, but every, I mean, like my mom, I buy her a new iPhone every year, right?

Just to be clear, right? Like, yeah, no, mom, you know, new Apple Watch every couple of years, right? Like, of course, right? So I respect their products, but like, I don't think Apple will ever release a model that you can get to say, you know, really bad things, right?

Or, you know, racist things or whatever, right? I don't think they can ever do that. But like, you know, frankly, like I'm sure, I'm sure OpenAI releasing, you know, 3.5 and 4 has had people like, you know, jailbreaks for the, you know, kind of old terminology from the iPhone, jailbreak the model and get it to do bad things, right?

Teach me how to make anthrax, right? Like, or like say these, like, hateful things, like rank the races of the world, right? Like, you know, like crazy. I mean, I've seen it on Twitter. I've seen all these three of these things, right? Like, it's like-- - My grandma, my grandma wants to know.

- Yeah, yeah, yeah. - My grandma's dying, please help. - Yeah, right. She needs to know how to make anthrax to live, right? Yeah. But like, there's all these jailbreaks, but also like as soon as they happen, like, you know, it gets fed back into OpenAI's like platform and it gets them.

It's like being public and open is like, you know, I guess open, quote unquote, open to use is accelerating their like ability to make a better and better model, right? Like the RLHF and all this kind of stuff. I don't see how Apple can do that. Structurally, like as a company, like the fruit company ships perfect products or like, or else, right?

Like that is their mentality. - They'll kill the car before you even see it. - Right, and that's why everyone loves iPhones, right? Like I have a Samsung, I can tell you how many, I buy a new Samsung every other year, right? Maybe I'll buy a Pixel this year, the new one looks nice.

But like, it's like, you know how many bugs are on these things? Like how many times like I just have to like restart my phone. Like, I mean, it's not like often, but it's like, hey, if like once a week I need to like, you know, an app just crashes, it's like, oh, what the heck, right?

It's like Bing was only ever like a few percent behind Google, truly, for the last decade, a few percent. But that few percent is enough to make people be like, Bing sucks, right? So I think that sort of like applies to Apple's like, are you willing to deploy a model of 3.5 capabilities that can say really, and do really bad things potentially?

What about four, right? And the possibility of it doing worse things is even higher. Or what about five, right? Like you can't get on that like iteration cycle, right? To build four, you need to be able to build a 3.5. Build 3.5, you need to be able to build three, right?

Like of quality, right? And Meta is clearly doing that, right? Like, and all these like open source firms and like all these folks are doing exactly that, right? Building a bigger and better model every, you know, every few months. And I don't know how Apple gets on that train.

But you know, at the same time, there's no company that has more powerful distribution, maybe, right? Maybe Google does, maybe Microsoft does. You could argue that, but like, so obviously, Apple will be deploying things. And Siri will always suck. But it'll, it'll, it'll, it'll. - It's embarrassing. - Hey, if I have a Siri, which is GPT 3.5 level in two years, I think a lot of people still use Siri, right?

People still use Siri to this day, right? Like, so it's like, you know, same thing's gonna happen, right? So I don't know. - Tim Cook is not in the AI safety discussions. He doesn't wanna be, you know, he's just in the product side. And I know you had some safety outtakes.

And I think it's like an interesting dynamic because, you know, Entropic came out of open AI. And then you can kind of make the case that like by having more labs, if you're really worried about safety, you're like accelerating the unsafe because you have more pull-ups and more compute.

Yeah, what's your thought on like this whole space? - So obviously I think safety is probably important, but like, I mean, it is important, right? Like, I mean, I've read sci-fi normally, right? And it's clearly important, right? Like, I could easily see how an LLM could, I wrote about this the other day, but it was like, it's like, hey, like, if you just look at the demographics across the world, there are like, there's like 30 to 50 million more men than there are women.

And they will never get married, obsessively obviously on population, that level dynamics, you know, their LGBTQ, all that stuff happens. And it's great, but like, you know, like there are 30 to 50 million more men across the world. They'll always be single. Why can't an LLM like radicalize them, right?

By being its AI girlfriend. And then all of a sudden like inciting, like, you know, and also, yeah, I don't know. There's like all sorts of stuff like that can happen. Of course, right? Or like, you know, teach some person to create a manufacturer what they thought was a good thing and it ends up wiping out humanity.

Like all these sorts of stuff can happen. But, you know, at the end of the day, I think security through obscurity doesn't work, right? So that's the approach that the labs take. I truly do believe it, right? Like, you know, they're very open internally, at least anthropic and open AI, or I know Google is a lot more gated with Gemini information.

But like, you know, these three, it's like security through obscurity. And it's like, this doesn't ever work. And two, like innovating in the open is going to have more people, like figuring out what doesn't work. Also figuring out how to like, maybe try and align things, you know, better.

- Maybe the semi-analysis analyst point of view is, is it feasible to build this capacity up in the U.S.? - No. People don't understand how fragmented the semiconductor supply chain really is and how many monopolies there are. The U.S. could absolutely shut down the Chinese semiconductor supply chain, they won't.

But, and China could absolutely shut down the U.S. one actually, by the way. But more relevantly, right, it's like, you know, Austria has two companies, like the country of Austria and Europe has two companies that have, you know, you know, super high market share and very specific technologies that are required for every single, like chip, period, right?

There is no chip that is less than seven nanometer that doesn't get touched by this one Austrian company's tool, right? And there is no alternative. And there's another Austrian company, likewise, everything two nanometer and beyond will be touched by their tool. And it's like, but both of these companies are like doing well, less than a billion dollars in revenue, right?

So it's like, you think it's so inconsequential. There's like three or four Japanese chemical companies. Same idea, right? It's like the supply chain is so fragmented, right? Like people only ever talk about where the fabs, where they actually get produced. But it's like, I mean, TSMC in Arizona, right?

TSMC is building a fab in Arizona. It's quite a bit smaller than the fabs in Taiwan. But even ignoring that, those fabs still have to ship everything to Taiwan back anyways. And also they have to get what's called a mask from Taiwan and get sent to Arizona. And by the way, there's these Japanese companies that make these chemicals that need to ship to, you know, like TOK and Shinetsu.

And you know, it's like, and hey, it needs this tool from Austria, no matter what. It's like, oh, wow, wait, actually, like the entire supply chain is just way too fragmented. You can't like re-engineer and rebuild it on a snap, right? It's just like, it's just complex to do that.

Semiconductors are more complex than any other thing that humans do, without a doubt. There's more people working in that supply chain with X, Y, Z backgrounds and more money invested every year and R and D plus CapEx. You know, it's like, it's just by far the most complex supply chain that humanity has.

And to think that we could rebuild it in a few years is absurd. - Yeah. In an alternative universe, the US kept Morris Chang, right? Like it was just one guy. - Yeah, in an alternative universe, Texas Instruments communicated to Morris Chang that he would become CEO. And so he never goes to Taiwan and you know, blah, blah, blah, right?

Yeah, no. But I, you know, that's just, also I think the world would probably be further behind in terms of technology development if that didn't happen, right? Like technology proliferation is how you accelerate, you know, the pace of innovation, right? So the, you know, the dissemination to, oh, well, hey, it's not just a bunch of people in Oregon at Intel that are leading everything, right?

Or, you know, hey, a bunch of people in Samsung Korea, right? Or Shinshu, Taiwan, right? It's actually all three of those, plus all these tool companies across the country in the Netherlands and in Japan and the U.S. And, you know, it's millions of people innovating on a disseminated technology that's led us to get here, right?

I don't even think, you know, if Morris Chang didn't go to Taiwan, would we even be at five nanometer? Would we be at seven nanometer? Probably not, right? Like there's innovations that, you know, happen because of that, right? - Let's get a quick lightning round down. - Yeah, sure.

- Semi-analysis branded one. So the first one is, what are like foundational readings that people that are listening today should read to get up to speed on like semi? - Our audience has a lot of software engineers. - Yeah, yeah, so I think the easiest one is like, is the PyTorch 2.0 and Triton one that I did.

You know, there's the advanced packaging series. There's the Google infrastructure supremacy piece. I think that one's really critical because it explains Google's infrastructure quite a bit from networking through chips, through all that sort of history of the TPU a little bit and all this sort of stuff. AMD's MI 300 piece, it talks a lot about the one that we did on that are very good.

Chip Wars by Chris Miller, who doesn't recommend that book, right? It's a really good book, right? I mean, like I would say Gordon Moore's book is freaking awesome because you got to think about, right? Like, you know, LLM scaling laws are like Moore's law on crack, right? Kind of like, you know, in a different sense.

Like, you know, if you think about all of human productivity gains since the '70s is probably just off of the base of semiconductors and technology, right? Of course, people across the world are getting, you know, access to oil and gas and all this sort of stuff. But like, at least in the Western world, since the '70s, everything has just been mostly innovated because of technology, right?

Oh, we're able to build better cars because semiconductors enable us to do that. Or we'll be able to build better software because we're able to connect everyone because semiconductors enabled that, right? So it's like, that is like, I think that's why it's the most important industry in the world, but like seeing the frame of mind of what Gordon Moore has written.

You know, he's got a couple, you know, papers, books, et cetera, right? Only the paranoid survive, right? Like, I think like that philosophy and thought process really translates to the now modern times, except maybe, you know, humanity has been an exponential S-curve and this is like another exponential S-curve on top of that.

So I think that's probably a good readings to do. - Has there been an equivalent pivot? So Gordon, like that classic tale was more of like the pivot to memory? - From memory to logic. - Yeah, yeah. And then was there, has there been an equivalent pivot in 763 of that magnitude?

- I mean, like, you know, some people would argue that like, you know, Jensen, you know, he basically didn't care about, he only cared about, you know, like gaming and 3D professional visualization and like rendering and things like that until like he started to learn about AI. And then all of a sudden he's going to like universities, like you want some GPUs, here you go, right?

Like, I think there's even stories of like, like, you know, not so long ago, NeurIPS when it used to have the more unfortunate name, he would go there and just give away GPUs to people, right? Like there's like stuff like that. Like, you know, very grassroots, like pivoting the company.

Now, like you look on gaming forums and it's like, everybody's like, "Oh, NVIDIA doesn't even care about us. They only care about AI." And it's like, yes, you're right. They only care, they mostly only care about AI. And the gaming innovations are only because of like, they're putting more AI into it, right?

It's like, but also like, "Hey, they're doing a lot of ship design stuff with AI." And, you know, I think that's like, not, I don't know if it's equivalent pivot quite yet, but, you know, because the digital, you know, logic is a pretty big innovation, but I think that's a big one.

And, you know, likewise, it's like, you know, what did OpenAI do, right? What did they pivot? How did they pivot? They left the like, you know, a lot of, most people left the culture of like Google Brain and DeepMind and decided to build this like company that's crazy cool, right?

Like, and does things in a very different way and like is innovating in a very different way. So you can consider that a pivot, even though it's not inside Google. I don't know. They're on a very different path with like the Dota games and all that before they eventually found like GPTs as the thing.

So it was a full, like started in 2015 and then really pivoted 2019 to be like, "All right, I got a GPT company." Yeah, yeah. In fact, a class partner. I don't, I'm sure there's OpenAI people yelling at me right now. Okay, so maybe just a general question of it.

You know, I'm a fellow writer on Substack. You are obviously managing your consulting business while you're also publishing these amazing posts. How do you, what's your writing process? How do you source info? When you sit down and go like, "Here's the info that we do." Do you have a pipeline coming up?

Just anything you can describe. So I'm thankful for my, you know, my teammates 'cause they are actually awesome. Like, and they're much more, you know, directed, focused to working on one thing, you know, or not one thing, but a number of things, right? Like, you know, someone who's this expert on X and Y and Z and the semiconductor supply chain.

So that really helps with that side of the business. I, most of the times, only write when I'm very excited. Or, you know, it's like, "Hey, like, we should work on this and we should write about this." So like, you know, one of the most recent posts we did was we explained the manufacturing process for 3D NAND, you know, flash storage, gate all around transistors and 3D DRAM and all this sort of stuff.

'Cause there's a company in Japan that's going public, Kokusai Electric, right? And it's like, "Okay, well, we should do a post about this and we should explain this." But like, it's like, "Okay, we, you know." And so Myron, he did all that work. Myron, she did most of the work and awesome.

But like, usually it's like, there's a few, like, very long in-depth back burner type things, right? Like that took a long time, took, you know, over a month of research. And Myron knows this stuff already really well, right? Like, but also furthermore, it's like, you know, so there's stuff like that that we do.

And that like builds up a body of work for our consulting and some of the reports that we sell that aren't, you know, newsletter posts. But a lot of times the process is also just like, well, like Mina Eats the World is the culmination of reading that, having done a lot of work on the supply chain around the TPU ramp and co-auths and HBM capacities and all this sort of stuff to be able to, you know, figure out how many units and that Google's ordering all sorts of stuff.

And then like, also like looking at like open sources, like all just that, all that culminated in like, I wrote that in four hours, right? I sent it to a couple of people and they were like, "No, change this, this, this. Oh, you know, add this, 'cause that's really gonna piss off, you know, the open source community." I'm like, "Okay, sure." And then posted it, right?

So it's like, there's no like specific process. Unfortunately, like the most viral posts, especially in the AI community, are just like those kinds of pieces rather than like the really deep, deep, like what was in the Gemini Eats the World post, you know, obviously like, hey, like we do deep work.

And there's a lot more like factual, not leaks. You know, it's just factual research. Hey, we, you know, we go across the team, we go to 40 plus conferences a year, right? All the way from like a photo resist conference to a photo mask conference to a lithography conference, all the way up to like AI conferences.

And, you know, everything in between, networking conferences and piecing everything across the supply chain. So it's like, that's like the true like work and like, yeah, I don't know. It is sometimes bad to like have the infamousness of, you know, only people caring about this and the GPT-4 leak or the Google has no moat leak, right?

It's like, but like, you know, that's just like stuff that comes along, right? You know, it's really focused on like understanding the supply chain and how it's pivoting and who's the winners, who's the losers, what technologies are inflecting, things like that. Where is the best place to invest resources, you know, sort of like stuff like that and in accelerating or capturing value, et cetera.

- Awesome. And to wrap, we're trying a new question. If you had a magic genie that could answer any question that would change your worldview, what question would you ask? - That's a tough one. - Like you operate based on a set of facts about the world right now.

And there's maybe someone knows that where you're like, man, if I really knew the answer to this one, I would do so many things differently or I'll think about things. - Everything that we've seen so far is that large-scale training has to happen in an individual data center with very high-speed networking.

Now, everything doesn't need to be all-to-all connected, but you need very high-speed networking between all of your chips, right? I would love to know, you know, hey, magic genie, how can we build artificial intelligence in a way that it can use multiple data centers of resources where there is a significantly lower bandwidth between pools of resources, right?

Because that would instantly, like one of the big bottlenecks is how much power and how many chips you can get into a single data center. So like, A, Google and OpenAI and Anthropic are working on this, right? And I don't know if they've solved it yet, but if they haven't solved it yet, then what is the solution?

Because that will like accelerate the scaling that can be done by not just like a factor of 10, but like orders of magnitude because there's so many different data centers, right, like if you, you know, across the world and, you know, oh, if I could pick up, you know, if I could effectively use 256 GPUs in this little data center here, and then with this big cluster here, you know, how can you make an algorithm that can do that, right?

Like, I think that would be like the number one thing I'd be curious to know if, how, what, because that changes the world significantly in terms of how can we continue to scale this amazing technology that people have invented over the last, you know, five years. Awesome. Oh, thank you so much for coming on, Dylan.

Thank you so much for having me. Hopefully my rambling, especially on AI safety was not, not, not, not poorly taken because I think it will be poorly taken. (upbeat music)