Okay, first of all, it's incredible to be here. I have a few notes, so I make sure we cover everything. I wanna take this opportunity to introduce Jonathan. Obviously, a lot of you guys have heard of his company, but you may not know his origin story, which is quite honestly, having been in Silicon Valley for 25 years, one of the most unique origin stories and founder stories you're going to hear.
We're gonna talk about some of the things that he's accomplished both at Google and Grok. We're gonna compare Grok and NVIDIA, because I think it's probably one of the most important technical considerations that people should know. We'll talk about the software stack, and we'll leave with a few data points and bullets, which I think are pretty impressive.
So, I wanna start by something that you do every week, which is you typically tweet out some sort of developer metric. Where are you as of this morning, and why are developers so important? - So, we're at, can you hear me? - Try this one. - Testing, ah, perfect.
So, we are at 75,000 developers, and that is slightly over 30 days from launching our developer console. For comparison, it took NVIDIA seven years to get to 100,000 developers, and we're at 75,000 in about 30-ish days. So, the reason this matters is the developers, of course, are building all the applications, so every developer is a multiplicative effect on the total number of users that you can have, but it's all about developers.
- Let's go all the way back, so just with that backdrop. This is not an overnight success story. This is eight years of plotting through the wild wilderness, punctuated, frankly, with a lot of misfires, which is really the sign of a great entrepreneur. But I want people to hear this story.
Jonathan may be, you guys have all heard about entrepreneurs who have dropped out of college to start billion-dollar companies. Jonathan may be the only high school dropout to have also started a billion-dollar company. So, let's start with just two minutes. Just give us the background of you, because it was a very circuitous path to being an entrepreneur.
- So, I dropped out of high school, as mentioned, and I ended up getting a job as a programmer, and my boss noticed that I was clever and told me that I should be taking classes at a university despite having dropped out of college. So, unmatriculated, I didn't actually get enrolled.
I started going to Hunter College as a side thing, and then I sort of fell under the wing of one of the professors there, did well, transferred to NYU, and then I started taking PhD courses, but as an undergrad, and then I dropped out of that. - So, do you technically have a high school diploma?
- No. - Okay, this is perfect. - Nor do I have an undergrad degree, but yeah. - So, from NYU, how did you end up at Google? - Well, actually, if it hadn't been for NYU, I don't think I would have ended up at Google, so this is interesting, even though I didn't have the degree.
And I happened to go to an event at Google, and one of the people at Google recognized me because they also went to NYU, and then they referred me. So, you can make some great connections in university, even if you don't graduate, but it was one of the people that I was taking the PhD courses with.
- And when you first went there, sort of what kind of stuff were you working on? - Ads, but testing. So, we were building giant test systems, and if you think that it's hard to build production systems, test systems have to test everything the production system does, we did it live.
So, every single ads query, we would run 100 tests on that, but we didn't have the budget of the production system. So, we had to write our own threading library, we had to do all sorts of crazy stuff, which you don't think of in ads, but yeah, it was actually harder engineering than the production itself.
- And so, Google's very famous for 20% time where you kind of can do whatever you want, and is that what led to the birth of TPU, which is now, I think, or what most of you guys know is Google's sort of leading custom silicon that they use internally.
- So, 20% time is famous, I called it MCI time, which probably isn't gonna transfer as a joke here, but there was these advertisements for this phone company, free nights and weekends, so you could work on 20% time so long as it wasn't during your work time, yeah. But every single night, I would go up and work with the speech team.
So, this was separate from my main project, and they bought me some hardware, and I started what was called the TPU as a side project, and it was funded out of what a VP referred to as his slush fund or leftover money, and it was never expected, there were actually two other projects to build AI accelerators, it was never expected to be successful, which gave us the cover that we needed to do some really counterintuitive and innovative things.
Once that became successful, they brought in the adult supervision. - Okay, take a step back though, what problem were you trying to solve in AI when those words weren't even being used, and what was Google trying to do at the time where you saw an opportunity to build something?
- So, this started in 2012, and at the time, there had never been a machine learning model that outperformed a human being on any task, and the speech team trained a model that transcribed speech better than human beings. The problem was they couldn't afford to put it into production, and so this led to a very famous engineer, Jeff Dean, giving a presentation to the leadership team, it was just two slides.
The first slide was good news, machine learning works. The second slide, bad news, we can't afford it. So, they were gonna have to double or triple the entire global data center footprint of Google at an average of a billion dollars per data center, 20 to 40 data centers, so 20 to 40 billion dollars, and that was just for speech recognition.
If they wanted to do anything else, like search ads, it was gonna cost more. That was uneconomical, and that's been the history with inference. You train it, and then you can't afford to put it into production. - So, against that backdrop, what did you do that was so unique that allowed TPU to be one of the three projects that actually won?
- The biggest thing was Jeff Dean noticed that the main algorithm that was consuming most of the CPU cycles at Google was matrix multiply, and we decided, okay, let's accelerate that, but let's build something around that, and so we built a massive matrix multiplication engine. When doing this, there were those two other competing teams.
They took more traditional approaches to do the same thing. One of them was led by a Turing Award winner, and then what we did was we came up with what's called a systolic array, and I remember when that Turing Award winner was talking about the TPU, he said, "Whoever came up with this must have been really old "because systolic arrays have fallen out of favor," and it was actually me.
I just didn't know what a systolic array was. Someone had to explain to me what the terminology was. It was just kind of the obvious way to do it, and so the lesson is if you come at things knowing how to do them, you might know how to do them the wrong way.
It's helpful to have people who don't know what should and should not be done. - So as TPU scales, there's probably a lot of internal recognition at Google. How do you walk away from that, and why did you walk away from that? - Well, all big companies end up becoming political in the end, and when you have something that successful, a lot of people want to own it, and there's always more senior people who start grabbing for it.
I moved on to the Google X team, the rapid eval team, which is the team that comes up with all the crazy ideas at Google X, and I was having fun there, but nothing was turning into a production system. It was all a bunch of playing around, and I wanted to go and do something real again from start to finish.
I wanted to take something from concept to production, and so I started looking outside, and that's when we met. - Well, that is when we met, but the thing is you had two ideas. One was more of let me build an image classifier, and you thought you could out-ResNet ResNet at the time, which was the best thing in town, and then you had this hardware path.
- Well, actually, I had zero intention of building a chip. What happened was I had also built the highest performing image classifier, but I had noticed that all of the software was being given away for free. TensorFlow was being given away for free. The models were being given away for free.
It was pretty clear that machine learning AI was gonna be open source, and it was gonna be-- - Even back then. - Even back then. That was 2016, and so I just couldn't imagine building a business around that, and it would just be hard scrabble. Chips, it takes so long to build them that if you build something innovative and you launch it, it's gonna be four years before anyone can even copy it, let alone pull ahead of it, so that just felt like a much better approach, and it's atoms.
You can monetize that more easily, so right around that time, the TPU paper came out. My name was in it. People started asking about it, and you asked me what I would do differently. - Well, I was investing in public markets as well at the time, a little dalliance in the public markets, and Sundar goes on in a press release and starts talking about TPU, and I was so shocked.
I thought there is no conceivable world in which Google should be building their own hardware. They must know something that the rest of us don't know, and so we need to know that so that we can go and commercialize that for the rest of the world, and I probably met you a few weeks afterwards, and that was probably the fastest investment I'd ever made.
I remember the key moment is you did not have a company, and so we had to incorporate the company after the check was written, which is always either a sign of complete stupidity or in 15 or 20 years, you'll look like a genius, but the odds of the latter are quite small.
Okay, so you start the business. Tell us about the design decisions you were making in Grok at the time, knowing what you knew then, because at the time, it's very different from what it is now. Well, again, when we started fundraising, we actually weren't even 100% sure that we were gonna do something in hardware, but it was something that I think you asked, Shamath, which is what would you do differently, and my answer was the software, because the big problem we had was we could build these chips in Google, but programming them, every single team at Google had a dedicated person who was hand-optimizing the models, and I'm like, this is absolutely crazy.
Right around then, we had started hiring some people from NVIDIA, and they're like, no, no, no, you don't understand. This is just how it works. This is how we do it, too. We've got these things called kernels, CUDA kernels, and we hand-optimize them. We just make it look like we're not doing that, but the scale, like, all of you understand algorithms and big O complexity.
That's linear complexity. For every application, you need an engineer. NVIDIA now has 50,000 people in their ecosystem. How does any, and these are like really low-level kernel-writing, assembly-writing hackers who understand GPUs and ML and everything. Not gonna scale, so we focused on the compiler for the first six months.
We banned whiteboards at Grok because people kept trying to draw pictures of chips. Like, yeah. - So why is it that LLMs prefer Grok? Like, what was the design decision, or what happened in the design of LLMs? Some part of it is skill, obviously, but some part of it was a little bit of luck, but where, what exactly happened that makes you so much faster than NVIDIA and why there's all of these developers?
What is the? - The crux of it, we didn't know that it was gonna be language, but the inspiration, the last thing that I worked on was getting the AlphaGo software, the Go playing software at DeepMind working on TPU, and having watched that, it was very clear that inference was going to be a scaled problem.
Everyone else had been looking at inference as you take one chip, you run a model on it, it runs whatever. But what happened with AlphaGo was we ported the software over, and even though we had 170 GPUs versus 48 TPUs, the 48 TPUs won 99 out of 100 games with the exact same software.
What that meant was compute was going to result in better performance. And so the insight was, let's build scaled inference. So we built in the interconnect, we built it for scale, and that's what we do now when we're running one of these models, we have hundreds or thousands of chips contributing just like we did with AlphaGo, but it's built for this as opposed to cobbled together.
- I think this is a good jumping off point. A lot of people, and I think this company deserves a lot of respect, but NVIDIA has been toiling for decades, and they have clearly built an incredible business. But in some ways, when you get into the details, the business is slightly misunderstood.
So can you break down, first of all, where is NVIDIA natively good, and where is it more trying to be good? - So natively good, the classic saying is you don't have to outrun the bear, you just have to outrun your friends. So NVIDIA outruns all of the other chip companies when it comes to software, but they're not a software-first company.
They actually have a very expensive approach, as we discussed. But they have the ecosystem. It's a double-sided market. If you have a kernel-based approach, they've already won. There's no catching up. Hence why we have a kernel-free approach. But the other way that they're very good is vertical integration and forward integration.
What happens is NVIDIA, over and over again, decides that they wanna move up the stack, and whatever their customers are doing, they start doing it. So for example, I think it was Gigabyte or one of these other PCI board manufacturers who recently announced, even though 80% of their revenue came from NVIDIA, NVIDIA boards that they were building, they're exiting that market because NVIDIA moved up and started doing a much lower-margin thing.
And you just see that over and over again. They started building systems. - Yeah, I think the other thing is that NVIDIA's incredible at training. And I think the design decisions that they made, including things like HBM, were really oriented around a world back then, which was everything was about training.
There weren't any real-world applications. None of you guys were really building anything in the wild where you needed super-fast inference. And I think that's another. - Absolutely, and what we saw over and over again was you would spend 100% of your compute on training. You would get something that would work well enough to go into production.
And then it would flip to about 5% to 10% training and 90% to 95% inference. But the amount of training would stay the same. The inference would grow massively. And so every time we would have a success at Google, all of a sudden, we would have a disaster. We called it the success disaster where we can't afford to get enough compute for inference 'cause it goes 1020x immediately, over and over.
- And if you take that 1020x and multiply it by the cost of NVIDIA's leading-class solutions, you're talking just an enormous amount of money. So just maybe explain to folks what HBM is and why these systems, like what NVIDIA just announced as B200, the complexity and the cost, actually, if you're trying to do something.
- Yeah, the complexity spans every part of the stack, but there's a couple of components which are in very limited supply. And NVIDIA has locked up the market on these. One of them is HBM. HBM is this high-bandwidth memory which is required to get performance because the speed at which you can run these applications depends on how quickly you can read that memory.
And this is the fastest memory. There is a finite supply. It is only for data centers. So they can't reach into the supply for mobile or other things like you can with other parts. But also interposers. Also, NVIDIA's the largest buyer of super caps in the world and all sorts of other components.
- Cables. - Cables, the 400-gigabit cables, they've bought them all out. So if you want to compete, it doesn't matter how good of a product you design. They've bought out the entire supply chain. - For years. - For years. - So what do you do? - You don't use the same things they do.
- Right. - And that's where we come in. - So how do you design a chip, then? If you look at the leading solution and they're using certain things and they're clearly being successful, how do you, is it just a technical bet to be totally orthogonal and different? Or was it something very specific where you said we cannot be reliant on the same supply chain 'cause we'll just get forced out of business at some point?
- It was actually a really simple observation at the beginning, which is most chip architectures compete on small percentages difference in performance, like 15% is considered amazing. And what we realized was if we were 15% better, no one was gonna change to a radically different architecture. We needed to be five to 10x.
Therefore, the small percentages that you get chasing the leading edge technologies was irrelevant. So we used an older technology, 14 nanometer, which is underutilized. We didn't use external memory. We used older interconnect because our architecture needed to provide the advantage and it needed to be so overwhelming that we didn't need to be at the leading edge.
- So how do you measure sort of speed and value today? And just give us some comparisons for you versus some other folks running what these guys are probably using, Lama, Mistral, et cetera. - Yeah, so we run, we compare on two sides of this. One is the tokens per dollar and one is the tokens per second per user.
So tokens per second per user is the experience. That's the differentiation. And tokens per dollar is the cost. And then also, of course, tokens per watt because power is very limited at the moment. - Right. - If you were to compare us to GPUs, we're typically five to 10x faster.
Apples to apples, like without using speculative decode and other things. So right now, on a 180 billion parameter model, we run about 200 tokens per second, which I think is less than 50 on the next generation GPU that's coming out. - From NVIDIA? - From NVIDIA. - So your current generation is 4x better than the B200?
- Yeah, yeah. And then in total cost, we're about 1/10 the cost versus a modern GPU per token. I want that to sink in for a moment, 1/10 of the cost. - Yeah, I mean, I think the value of that really comes down to the fact that you guys are gonna go and have ideas, and especially if you are part of the venture community and ecosystem and you raise money, folks like me who will give you money will expect you to be investing that wisely.
Last decade, we went into a very negative cycle where almost 50 cents of every dollar we would give a startup would go right back into the hands of Google, Amazon, and Facebook. You're spending it on compute and you're spending it on ads. This time around, the power of AI should be that you can build companies for 1/10 or 1/100 of the cost, but that won't be possible if you're, again, just shipping the money back out, except just now, in this case, to NVIDIA versus somebody else.
So we will be pushing to make sure that this is kind of the low-cost alternative that happens. - So NVIDIA had a huge splashy announcement a few weeks ago. They showed charts, things going up and to the right. They showed huge dyes. They showed huge packaging. Tell us about the B200 and compare it to what you're doing right now.
- Well, the first thing is the B200 is a marvel of engineering. The level of complexity, the level of integration, the amount of different components in silicon. They spent $10 billion developing it, but when it was announced, I got some pings from NVIDIA engineers who said, we were a little embarrassed that they were claiming 30X 'cause it's nowhere near that.
And we as engineers felt that that was hurting our credibility. The 30X claim was, let's put it into perspective. There was this one image that showed a claim of up to 50 tokens per second from the user experience and 140 throughput. That sort of gives you the value or the cost.
If you were to compare that to the previous generation, that would be saying that the users, if you divide 50 by 30, are getting less than two tokens per second, which would be slow, right? There's nothing running that slow. And then also from a throughput perspective, that would make the cost so astronomical, it would be unbelievable.
- I mean, how many of you guys use any of these chat agents right now? Just raise your hand if you use them. And how many of you keep your hands raised if you're satisfied with the speed of performance? You're satisfied. One hand or two? - There's like two or three, yeah.
That's nice. My experience has been that these things, if you want to actually make hallucinations go to zero and the quality of these models really fine-tuned, you have to get back to kind of like a traditional web experience or a traditional mobile app experience where you have a window of probably 300 milliseconds to get an answer back.
In the absence of that, the user experience doesn't scale and it kind of sucks. - How much effort did you spend at Meta and Facebook getting latency down? - I mean, look, at Facebook at one point, I had a team, I was so disgusted with the speed. So in a cold cache, we were approaching 1,000 milliseconds.
And I was so disgusted that I took a small team off to the side, rebuilt the entire website and launched it in India for the Indian market just to prove that we could get it under 500 milliseconds. And it was a huge technical feat that the team did. It was also very poorly received by the mainline engineering team because it was somewhat embarrassing.
But that's the level of intensity we had to approach this problem with. And it wasn't just us. Google realized it, everybody has realized it. There's an economic equation where if you deliver an experience to users under about 250 to 300 milliseconds, you maximize revenue. So if you actually want to be successful, that is the number you have to get to.
So the idea that you can wait and fetch an answer in three and five seconds is completely ridiculous. It's a non-starter. - Here's the actual number. The number is every 100 milliseconds leads to 8% more engagement on desktop, 34% on mobile. We're talking about 100 milliseconds, which is 1/10 of a second.
Right now, these things take 10 seconds. So think about how much less engagement we were getting today than you otherwise could. - So why don't you now break down this difference? Because this is now where I think a good place so that people leave really understanding. There's an enormous difference between training and inference and what is required.
And why don't you define the differences so that then we can contrast where things are gonna go? - The biggest is when you're training, the number of tokens that you're training on is measured in months. Like how many tokens can we train on this month? It doesn't matter if it takes a second, 10 seconds, 100 seconds in a batch, how many per month.
In inference, what matters is how many tokens you can generate per millisecond or a couple of milliseconds. It's not in seconds, it's not in months. - Is it fair to say then that NVIDIA is the exemplar in training? - Yes. - And then is it fair to say that there really isn't yet the equivalent scaled winner in inference?
- Not yet. - And do you think that it will be NVIDIA? - I don't think it'll be NVIDIA. - But specifically, why do you not think it won't work for that market, even though it's clearly working in training - In order to get the latency down, what we had to do, we had to design a completely new chip architecture.
We had to design a completely new networking architecture, an entirely new system, an entirely new runtime, an entirely new compiler, an entirely new orchestration layer. We had to throw everything away. And it had to be compatible with PyTorch and what other people actually developing. Now we're talking about innovators dilemma on steroids.
It's hard enough to give up one of those, which if you were to do one of those successfully would be a very valuable company. But to throw all six of those away is nearly impossible. And also you have to maintain what you have if you wanna keep training. And so now you have to have a completely different architecture for networking, for training versus inference, for your chip, for networking, for everything.
- So let's say that the market today is 100 units of training or 95 units of training, five units of inference. I should say that's roughly where most of the revenue and the dollars are being made. What does it look like in four or five years from now? - Well, actually NVIDIA's latest earnings, 40% inference.
It's already starting to climb. Where it's gonna end is it will end at somewhere between 90 to 95% or 90 to 95 units inference. And so that trajectory is gonna take off rapidly now that we have these open source models that everyone is giving away and you can download a model and run it.
You don't need to train it. - Yeah, one of the things about these open source models is building useful applications, you have to either understand or be able to work with CUDA. With you, it doesn't even matter because you can just port. So maybe explain to folks the importance in the inference market of being able to rip and replace these models and where you think these models are going.
- So for the inference market, every two weeks or so, there is a completely new model that has to be run. It's important, it matters. Either it's setting the best quality bar across the board or it's good at a particular task. If you are writing kernels, it's almost impossible to keep up.
In fact, when LLAMA2 70 billion was launched, it officially had support for AMD. However, the first support we actually saw implemented was after about a week and we had it in I think two days. And so that speed, now everyone develops for NVIDIA hardware. So by default, anything launched will work there.
But if you want anything else to work, you can't be writing these kernels by hand. And remember, AMD had official support and it still took about a week. - Right. So if you're starting a company today, you clearly wanna have the ability to swap from LLAMA to Mistral to Anthropic back as often as possible.
- Whatever's latest. - And just as somebody who sees these models run, do you have any comment on the quality of these models and where you think some of these companies are going or what you see some doing well versus others? - So they're all starting to catch up with each other.
You're starting to see some leapfrogging. It started off with GPT-4 pulling ahead and it had a lead for about a year over everyone else. And now of course Anthropic has caught up. We're seeing some great stuff from Mistral. But across the board, they're all starting to bunch up in quality.
And so one of the interesting things, Mistral in particular, has been able to get closer to quality with smaller, less expensive models to run, which I think gives them a huge advantage. I think Cohere has an interesting take on a sort of rag optimized model. So people are finding niches.
And there's gonna be a couple that are gonna be the best across the board at the highest end. But what we're seeing is a lot of complaints about the cost to run these models. They're just astronomical. And you're not gonna be able to scale up applications for users with them.
- OpenAI has published or has disclosed, as has Meta, as has Tesla and a couple of others, just the total quantum of GPU capacity that they're buying. And you can kind of work backwards to figure out how big the inference market can be, because it's really only supported by them as you guys scale up.
Can you give people a sense of the scale of what folks are fighting for? - So I think Facebook announced that by the end of this year, they're gonna have the equivalent of 650,000 H100s. By the end of this year, Grok will have deployed 100,000 of our LPUs, which do outperform the H100s on a throughput and on a latency basis.
So we will probably get pretty close to the equivalent of Meta ourselves. By the end of next year, we're going to deploy 1.5 million LPUs. For comparison, last year, NVIDIA deployed a total of 500,000 H100s. So 1.5 million means that Grok will probably have more inference generative AI capacity than all of the hyperscalers and cloud service providers combined.
So probably about 50% of the inference compute in the world. - That's just great. Tell us about team building in Silicon Valley. How hard is it to get folks that are real AI folks in the backdrop of you could go work at Tesla, you could go work at Google, open AI, all these people we are hearing, multimillion dollar pay packages that rival playing professional sports.
Like what is going on in finding the people? Now, by the way, you have this interesting thing 'cause your initial chip, you were trying to find folks that knew Haskell. So just tell us, how hard is it to build a team in the Valley to do this? - Impossible.
So if you want to know how to do it, you have to start getting creative, just like anything you want to do well. Don't just compete directly. But yeah, these pay packages are astronomical because everyone views this as a winner take all market. I mean, just that's it. It's not about, am I going to be number two?
Am I going to be number three? They're all going, I got to be number one. So if you don't have the best talent, you're out. Now, here's the mistake. A lot of these AI researchers are amazing at AI, but they're still kind of green. They're new, they're young, right?
This is a new field. And what I always recommend to people is go hire the best, most grizzled engineers who know how to ship stuff and on time and let them learn AI because they will be able to do that faster than you will be able to take the AI researchers and give them the 20 years of experience of deploying production code.
- You were on stage in Saudi Arabia with Saudi Aramco a month ago and announced some big deal. Can you just, like, what is going on with deals like that? Like, where is that market going? Is that you competing with Amazon and Google and Microsoft? Is that what that is?
- It's not competing. It's actually complimentary. The announcement was that we are going to be doing a deal together with Aramco Digital. And we haven't announced how large exactly, but it will be large in terms of the amount of compute that we're gonna deploy. And in total, we've done deals that get us to past 10% of that 1.5 million LPU goal.
And of course, the hard part is the first deals. So once we announced that, a lot of other deals are now coming through. But the, yeah, go ahead. - So, no, no, I was just, I was just, second, Tim. - So the scale of these deals is that these are larger than the amount of compute that Meta has, right?
And a lot of these tech companies right now, they think that they have such an advantage 'cause they've locked up the supply. They don't want it to be true that there is another alternative out there. And so we're actually doing deals with folks where they're gonna have more compute than a hyperscaler.
- Right, that's a crazy idea. Last question, everybody's worried about what AI means. You've been in it for a very long time. Just end with your perspectives on what we should be thinking and what your perspectives are on the future of AI, our future jobs, all of this typical stuff that people worry about.
- So I get asked a lot, should we be afraid of AI? And my answer to that is, if you think back to Galileo, someone who got in a lot of trouble, the reason he got in trouble was he invented the telescope, popularized it, and made some claims that we were much smaller than everyone wanted to believe.
We were supposed to be the center of the universe and it turns out we weren't. And the better the telescope got, the more obvious it became that we were small. And in a large sense, large language models are the telescope for the mind. It's become clear that intelligence is larger than we are.
And it makes us feel really, really small. And it's scary. But what happened over time was as we realized the universe was larger than we thought, and we got used to that, we started to realize how beautiful it was, and our place in the universe. And I think that's what's gonna happen.
We're gonna realize intelligence is more vast than we ever imagined, and we're gonna understand our place in it. And we're not gonna be afraid of it. - That's a beautiful way to end. Jonathan Ross, everybody. Thanks, guys. (audience applauding) - Thank you very, very much. I was told drug means to understand deeply with empathy.
That was embodying this definition.