back to indexConversation with Groq CEO Jonathan Ross
00:00:00.000 |
Okay, first of all, it's incredible to be here. 00:00:03.440 |
I have a few notes, so I make sure we cover everything. 00:00:06.120 |
I wanna take this opportunity to introduce Jonathan. 00:00:11.060 |
Obviously, a lot of you guys have heard of his company, 00:00:28.960 |
that he's accomplished both at Google and Grok. 00:00:33.540 |
because I think it's probably one of the most important 00:00:36.440 |
technical considerations that people should know. 00:00:40.480 |
and we'll leave with a few data points and bullets, 00:00:45.920 |
So, I wanna start by something that you do every week, 00:01:32.120 |
So, the reason this matters is the developers, of course, 00:01:37.400 |
so every developer is a multiplicative effect 00:01:40.160 |
on the total number of users that you can have, 00:01:43.980 |
- Let's go all the way back, so just with that backdrop. 00:01:50.040 |
This is eight years of plotting through the wild wilderness, 00:01:58.580 |
which is really the sign of a great entrepreneur. 00:02:07.660 |
about entrepreneurs who have dropped out of college 00:02:15.060 |
to have also started a billion-dollar company. 00:02:27.140 |
- So, I dropped out of high school, as mentioned, 00:02:29.900 |
and I ended up getting a job as a programmer, 00:02:37.700 |
and told me that I should be taking classes at a university 00:02:42.940 |
So, unmatriculated, I didn't actually get enrolled. 00:02:46.540 |
I started going to Hunter College as a side thing, 00:02:53.220 |
of one of the professors there, did well, transferred to NYU, 00:03:00.020 |
but as an undergrad, and then I dropped out of that. 00:03:03.180 |
- So, do you technically have a high school diploma? 00:03:07.180 |
- Nor do I have an undergrad degree, but yeah. 00:03:09.740 |
- So, from NYU, how did you end up at Google? 00:03:14.380 |
I don't think I would have ended up at Google, 00:03:15.940 |
so this is interesting, even though I didn't have the degree. 00:03:20.800 |
and one of the people at Google recognized me 00:03:23.720 |
because they also went to NYU, and then they referred me. 00:03:27.160 |
So, you can make some great connections in university, 00:03:30.000 |
even if you don't graduate, but it was one of the people 00:03:36.960 |
sort of what kind of stuff were you working on? 00:03:44.160 |
and if you think that it's hard to build production systems, 00:03:52.360 |
So, every single ads query, we would run 100 tests on that, 00:03:57.360 |
but we didn't have the budget of the production system. 00:04:00.640 |
So, we had to write our own threading library, 00:04:18.440 |
which is now, I think, or what most of you guys know 00:04:25.360 |
- So, 20% time is famous, I called it MCI time, 00:04:29.440 |
which probably isn't gonna transfer as a joke here, 00:04:31.360 |
but there was these advertisements for this phone company, 00:04:34.160 |
free nights and weekends, so you could work on 20% time 00:04:37.400 |
so long as it wasn't during your work time, yeah. 00:04:50.200 |
and I started what was called the TPU as a side project, 00:04:55.200 |
and it was funded out of what a VP referred to 00:05:11.900 |
to do some really counterintuitive and innovative things. 00:05:26.760 |
where you saw an opportunity to build something? 00:05:38.340 |
that transcribed speech better than human beings. 00:05:45.320 |
and so this led to a very famous engineer, Jeff Dean, 00:05:48.360 |
giving a presentation to the leadership team, 00:05:51.680 |
The first slide was good news, machine learning works. 00:05:55.820 |
The second slide, bad news, we can't afford it. 00:06:01.280 |
the entire global data center footprint of Google 00:06:04.760 |
at an average of a billion dollars per data center, 00:06:07.120 |
20 to 40 data centers, so 20 to 40 billion dollars, 00:06:28.460 |
that allowed TPU to be one of the three projects 00:06:39.480 |
most of the CPU cycles at Google was matrix multiply, 00:06:47.980 |
and so we built a massive matrix multiplication engine. 00:06:52.360 |
When doing this, there were those two other competing teams. 00:06:55.900 |
They took more traditional approaches to do the same thing. 00:06:58.500 |
One of them was led by a Turing Award winner, 00:07:09.060 |
"Whoever came up with this must have been really old 00:07:11.860 |
"because systolic arrays have fallen out of favor," 00:07:15.780 |
I just didn't know what a systolic array was. 00:07:17.820 |
Someone had to explain to me what the terminology was. 00:07:20.020 |
It was just kind of the obvious way to do it, 00:07:54.040 |
I moved on to the Google X team, the rapid eval team, 00:08:04.320 |
but nothing was turning into a production system. 00:08:10.280 |
and I wanted to go and do something real again 00:08:13.420 |
I wanted to take something from concept to production, 00:08:26.000 |
One was more of let me build an image classifier, 00:08:29.000 |
and you thought you could out-ResNet ResNet at the time, 00:08:35.300 |
- Well, actually, I had zero intention of building a chip. 00:08:59.880 |
was gonna be open source, and it was gonna be-- 00:09:02.960 |
That was 2016, and so I just couldn't imagine 00:09:11.560 |
that if you build something innovative and you launch it, 00:09:13.960 |
it's gonna be four years before anyone can even copy it, 00:09:18.560 |
so that just felt like a much better approach, 00:09:24.680 |
so right around that time, the TPU paper came out. 00:09:32.080 |
and you asked me what I would do differently. 00:09:44.000 |
in a press release and starts talking about TPU, 00:09:50.360 |
in which Google should be building their own hardware. 00:09:52.560 |
They must know something that the rest of us don't know, 00:10:01.560 |
and I probably met you a few weeks afterwards, 00:10:03.940 |
and that was probably the fastest investment I'd ever made. 00:10:07.480 |
I remember the key moment is you did not have a company, 00:10:15.240 |
which is always either a sign of complete stupidity 00:10:18.280 |
or in 15 or 20 years, you'll look like a genius, 00:10:25.740 |
Tell us about the design decisions you were making 00:10:27.740 |
in Grok at the time, knowing what you knew then, 00:10:39.320 |
but it was something that I think you asked, Shamath, 00:10:50.080 |
but programming them, every single team at Google 00:10:53.280 |
had a dedicated person who was hand-optimizing the models, 00:10:59.440 |
Right around then, we had started hiring some people 00:11:06.400 |
We've got these things called kernels, CUDA kernels, 00:11:09.840 |
We just make it look like we're not doing that, 00:11:12.300 |
but the scale, like, all of you understand algorithms 00:11:21.160 |
NVIDIA now has 50,000 people in their ecosystem. 00:11:24.040 |
How does any, and these are like really low-level 00:11:30.100 |
Not gonna scale, so we focused on the compiler 00:11:35.960 |
because people kept trying to draw pictures of chips. 00:11:51.280 |
but some part of it was a little bit of luck, 00:12:04.600 |
but the inspiration, the last thing that I worked on 00:12:11.800 |
the Go playing software at DeepMind working on TPU, 00:12:19.400 |
that inference was going to be a scaled problem. 00:12:34.320 |
and even though we had 170 GPUs versus 48 TPUs, 00:12:44.580 |
What that meant was compute was going to result 00:12:49.240 |
And so the insight was, let's build scaled inference. 00:12:53.880 |
So we built in the interconnect, we built it for scale, 00:13:00.240 |
we have hundreds or thousands of chips contributing 00:13:04.600 |
but it's built for this as opposed to cobbled together. 00:13:16.080 |
and they have clearly built an incredible business. 00:13:19.480 |
But in some ways, when you get into the details, 00:13:39.240 |
So NVIDIA outruns all of the other chip companies 00:13:44.520 |
They actually have a very expensive approach, 00:13:52.040 |
If you have a kernel-based approach, they've already won. 00:14:00.480 |
is vertical integration and forward integration. 00:14:14.120 |
or one of these other PCI board manufacturers 00:14:18.240 |
even though 80% of their revenue came from NVIDIA, 00:14:23.520 |
they're exiting that market because NVIDIA moved up 00:14:33.800 |
And I think the design decisions that they made, 00:14:38.440 |
were really oriented around a world back then, 00:14:44.920 |
None of you guys were really building anything in the wild 00:14:50.640 |
- Absolutely, and what we saw over and over again 00:14:53.160 |
was you would spend 100% of your compute on training. 00:14:56.800 |
You would get something that would work well enough 00:14:59.760 |
And then it would flip to about 5% to 10% training 00:15:07.080 |
But the amount of training would stay the same. 00:15:11.520 |
And so every time we would have a success at Google, 00:15:18.320 |
where we can't afford to get enough compute for inference 00:15:22.280 |
'cause it goes 1020x immediately, over and over. 00:15:28.480 |
by the cost of NVIDIA's leading-class solutions, 00:15:31.320 |
you're talking just an enormous amount of money. 00:15:43.840 |
- Yeah, the complexity spans every part of the stack, 00:15:50.760 |
And NVIDIA has locked up the market on these. 00:16:00.720 |
because the speed at which you can run these applications 00:16:04.640 |
depends on how quickly you can read that memory. 00:16:11.840 |
So they can't reach into the supply for mobile 00:16:14.560 |
or other things like you can with other parts. 00:16:18.400 |
Also, NVIDIA's the largest buyer of super caps in the world 00:16:30.280 |
it doesn't matter how good of a product you design. 00:16:59.760 |
where you said we cannot be reliant on the same supply chain 00:17:01.920 |
'cause we'll just get forced out of business at some point? 00:17:04.480 |
- It was actually a really simple observation 00:17:11.720 |
on small percentages difference in performance, 00:17:18.000 |
And what we realized was if we were 15% better, 00:17:27.400 |
Therefore, the small percentages that you get 00:17:30.780 |
chasing the leading edge technologies was irrelevant. 00:17:34.600 |
So we used an older technology, 14 nanometer, 00:17:44.080 |
because our architecture needed to provide the advantage 00:17:49.140 |
that we didn't need to be at the leading edge. 00:17:51.720 |
- So how do you measure sort of speed and value today? 00:18:02.600 |
- Yeah, so we run, we compare on two sides of this. 00:18:12.160 |
So tokens per second per user is the experience. 00:18:30.160 |
Apples to apples, like without using speculative decode 00:18:34.900 |
So right now, on a 180 billion parameter model, 00:18:44.880 |
on the next generation GPU that's coming out. 00:18:49.960 |
- So your current generation is 4x better than the B200? 00:18:55.240 |
And then in total cost, we're about 1/10 the cost 00:19:02.600 |
I want that to sink in for a moment, 1/10 of the cost. 00:19:13.720 |
and especially if you are part of the venture community 00:19:23.240 |
Last decade, we went into a very negative cycle 00:19:30.320 |
into the hands of Google, Amazon, and Facebook. 00:19:40.760 |
or 1/100 of the cost, but that won't be possible 00:19:43.960 |
if you're, again, just shipping the money back out, 00:19:51.720 |
that this is kind of the low-cost alternative that happens. 00:19:56.160 |
- So NVIDIA had a huge splashy announcement a few weeks ago. 00:19:59.840 |
They showed charts, things going up and to the right. 00:20:13.720 |
The level of complexity, the level of integration, 00:20:16.940 |
the amount of different components in silicon. 00:20:25.120 |
I got some pings from NVIDIA engineers who said, 00:20:28.240 |
we were a little embarrassed that they were claiming 30X 00:20:36.040 |
The 30X claim was, let's put it into perspective. 00:20:42.300 |
of up to 50 tokens per second from the user experience 00:20:49.000 |
That sort of gives you the value or the cost. 00:20:52.440 |
If you were to compare that to the previous generation, 00:21:19.280 |
if you're satisfied with the speed of performance? 00:21:34.160 |
if you want to actually make hallucinations go to zero 00:21:37.120 |
and the quality of these models really fine-tuned, 00:21:44.840 |
where you have a window of probably 300 milliseconds 00:21:50.200 |
the user experience doesn't scale and it kind of sucks. 00:21:53.360 |
- How much effort did you spend at Meta and Facebook 00:21:58.960 |
I had a team, I was so disgusted with the speed. 00:22:05.640 |
And I was so disgusted that I took a small team 00:22:13.520 |
and launched it in India for the Indian market 00:22:17.360 |
just to prove that we could get it under 500 milliseconds. 00:22:22.800 |
And it was a huge technical feat that the team did. 00:22:37.520 |
Google realized it, everybody has realized it. 00:22:44.080 |
under about 250 to 300 milliseconds, you maximize revenue. 00:22:52.520 |
So the idea that you can wait and fetch an answer 00:22:55.280 |
in three and five seconds is completely ridiculous. 00:23:06.240 |
leads to 8% more engagement on desktop, 34% on mobile. 00:23:21.320 |
we were getting today than you otherwise could. 00:23:24.440 |
- So why don't you now break down this difference? 00:23:26.440 |
Because this is now where I think a good place 00:23:32.600 |
between training and inference and what is required. 00:23:40.080 |
so that then we can contrast where things are gonna go? 00:23:50.120 |
Like how many tokens can we train on this month? 00:23:53.040 |
It doesn't matter if it takes a second, 10 seconds, 00:24:00.000 |
In inference, what matters is how many tokens 00:24:03.080 |
you can generate per millisecond or a couple of milliseconds. 00:24:16.640 |
- And then is it fair to say that there really isn't yet 00:24:31.520 |
- But specifically, why do you not think it won't work 00:24:35.400 |
for that market, even though it's clearly working in training 00:24:38.760 |
- In order to get the latency down, what we had to do, 00:24:42.840 |
we had to design a completely new chip architecture. 00:24:45.840 |
We had to design a completely new networking architecture, 00:24:48.720 |
an entirely new system, an entirely new runtime, 00:25:03.320 |
Now we're talking about innovators dilemma on steroids. 00:25:08.200 |
which if you were to do one of those successfully 00:25:12.720 |
But to throw all six of those away is nearly impossible. 00:25:20.520 |
And so now you have to have a completely different 00:25:22.520 |
architecture for networking, for training versus inference, 00:25:26.640 |
for your chip, for networking, for everything. 00:25:32.120 |
is 100 units of training or 95 units of training, 00:25:37.000 |
I should say that's roughly where most of the revenue 00:25:42.360 |
What does it look like in four or five years from now? 00:25:45.360 |
- Well, actually NVIDIA's latest earnings, 40% inference. 00:25:52.720 |
at somewhere between 90 to 95% or 90 to 95 units inference. 00:25:57.720 |
And so that trajectory is gonna take off rapidly 00:26:07.840 |
- Yeah, one of the things about these open source models 00:26:14.040 |
you have to either understand or be able to work with CUDA. 00:26:18.560 |
With you, it doesn't even matter because you can just port. 00:26:28.280 |
- So for the inference market, every two weeks or so, 00:26:33.280 |
there is a completely new model that has to be run. 00:26:40.720 |
Either it's setting the best quality bar across the board 00:26:52.480 |
In fact, when LLAMA2 70 billion was launched, 00:26:58.520 |
However, the first support we actually saw implemented 00:27:03.080 |
was after about a week and we had it in I think two days. 00:27:11.760 |
So by default, anything launched will work there. 00:27:26.960 |
you clearly wanna have the ability to swap from LLAMA 00:27:32.360 |
to Mistral to Anthropic back as often as possible. 00:27:37.240 |
- And just as somebody who sees these models run, 00:27:40.920 |
do you have any comment on the quality of these models 00:27:42.880 |
and where you think some of these companies are going 00:27:44.800 |
or what you see some doing well versus others? 00:27:47.840 |
- So they're all starting to catch up with each other. 00:27:55.840 |
and it had a lead for about a year over everyone else. 00:28:38.680 |
And you're not gonna be able to scale up applications 00:28:48.480 |
as has Meta, as has Tesla and a couple of others, 00:28:51.640 |
just the total quantum of GPU capacity that they're buying. 00:28:56.920 |
to figure out how big the inference market can be, 00:29:14.160 |
they're gonna have the equivalent of 650,000 H100s. 00:29:25.000 |
which do outperform the H100s on a throughput 00:29:47.520 |
So 1.5 million means that Grok will probably have 00:30:00.680 |
So probably about 50% of the inference compute in the world. 00:30:07.600 |
Tell us about team building in Silicon Valley. 00:30:12.280 |
How hard is it to get folks that are real AI folks 00:30:15.440 |
in the backdrop of you could go work at Tesla, 00:30:28.360 |
Now, by the way, you have this interesting thing 00:30:31.800 |
you were trying to find folks that knew Haskell. 00:30:34.160 |
So just tell us, how hard is it to build a team 00:30:47.800 |
But yeah, these pay packages are astronomical 00:30:50.760 |
because everyone views this as a winner take all market. 00:31:01.400 |
So if you don't have the best talent, you're out. 00:31:04.800 |
A lot of these AI researchers are amazing at AI, 00:31:30.800 |
than you will be able to take the AI researchers 00:31:36.480 |
- You were on stage in Saudi Arabia with Saudi Aramco 00:31:45.400 |
Can you just, like, what is going on with deals like that? 00:31:51.560 |
Is that you competing with Amazon and Google and Microsoft? 00:32:03.680 |
to be doing a deal together with Aramco Digital. 00:32:10.080 |
but it will be large in terms of the amount of compute 00:32:15.640 |
And in total, we've done deals that get us to past 10% 00:32:22.440 |
And of course, the hard part is the first deals. 00:32:33.120 |
- So, no, no, I was just, I was just, second, Tim. 00:32:36.000 |
- So the scale of these deals is that these are larger 00:32:41.000 |
than the amount of compute that Meta has, right? 00:33:00.960 |
where they're gonna have more compute than a hyperscaler. 00:33:07.080 |
Last question, everybody's worried about what AI means. 00:33:17.040 |
Just end with your perspectives on what we should be thinking 00:33:22.240 |
and what your perspectives are on the future of AI, 00:33:27.680 |
- So I get asked a lot, should we be afraid of AI? 00:33:31.640 |
And my answer to that is, if you think back to Galileo, 00:33:38.520 |
the reason he got in trouble was he invented the telescope, 00:33:44.920 |
that we were much smaller than everyone wanted to believe. 00:33:48.400 |
We were supposed to be the center of the universe 00:33:53.920 |
the more obvious it became that we were small. 00:34:02.920 |
It's become clear that intelligence is larger than we are. 00:34:14.240 |
But what happened over time was as we realized 00:34:18.640 |
and we got used to that, we started to realize 00:34:20.640 |
how beautiful it was, and our place in the universe. 00:34:26.240 |
We're gonna realize intelligence is more vast 00:34:49.960 |
I was told drug means to understand deeply with empathy.