Conversation with Groq CEO Jonathan Ross

00:00:00.000 | Okay, first of all, it's incredible to be here.

00:00:03.440 | I have a few notes, so I make sure we cover everything.

00:00:06.120 | I wanna take this opportunity to introduce Jonathan.

00:00:11.060 | Obviously, a lot of you guys have heard of his company,

00:00:13.020 | but you may not know his origin story,

00:00:15.560 | which is quite honestly,

00:00:16.760 | having been in Silicon Valley for 25 years,

00:00:20.120 | one of the most unique origin stories

00:00:23.520 | and founder stories you're going to hear.

00:00:25.620 | We're gonna talk about some of the things

00:00:28.960 | that he's accomplished both at Google and Grok.

00:00:31.200 | We're gonna compare Grok and NVIDIA,

00:00:33.540 | because I think it's probably one of the most important

00:00:36.440 | technical considerations that people should know.

00:00:38.680 | We'll talk about the software stack,

00:00:40.480 | and we'll leave with a few data points and bullets,

00:00:44.760 | which I think are pretty impressive.

00:00:45.920 | So, I wanna start by something that you do every week,

00:00:50.920 | which is you typically tweet out

00:00:54.140 | some sort of developer metric.

00:00:56.040 | Where are you as of this morning,

00:00:58.800 | and why are developers so important?

00:01:01.580 | - So, we're at, can you hear me?

00:01:05.280 | - Try this one.

00:01:08.520 | - Testing, ah, perfect.

00:01:10.320 | So, we are at 75,000 developers,

00:01:14.240 | and that is slightly over 30 days

00:01:17.280 | from launching our developer console.

00:01:20.020 | For comparison, it took NVIDIA seven years

00:01:24.300 | to get to 100,000 developers,

00:01:27.220 | and we're at 75,000 in about 30-ish days.

00:01:32.120 | So, the reason this matters is the developers, of course,

00:01:35.920 | are building all the applications,

00:01:37.400 | so every developer is a multiplicative effect

00:01:40.160 | on the total number of users that you can have,

00:01:42.160 | but it's all about developers.

00:01:43.980 | - Let's go all the way back, so just with that backdrop.

00:01:47.880 | This is not an overnight success story.

00:01:50.040 | This is eight years of plotting through the wild wilderness,

00:01:55.400 | punctuated, frankly, with a lot of misfires,

00:01:58.580 | which is really the sign of a great entrepreneur.

00:02:02.240 | But I want people to hear this story.

00:02:05.280 | Jonathan may be, you guys have all heard

00:02:07.660 | about entrepreneurs who have dropped out of college

00:02:10.220 | to start billion-dollar companies.

00:02:12.500 | Jonathan may be the only high school dropout

00:02:15.060 | to have also started a billion-dollar company.

00:02:17.420 | So, let's start with just two minutes.

00:02:20.060 | Just give us the background of you,

00:02:22.480 | because it was a very circuitous path

00:02:25.940 | to being an entrepreneur.

00:02:27.140 | - So, I dropped out of high school, as mentioned,

00:02:29.900 | and I ended up getting a job as a programmer,

00:02:34.300 | and my boss noticed that I was clever

00:02:37.700 | and told me that I should be taking classes at a university

00:02:41.480 | despite having dropped out of college.

00:02:42.940 | So, unmatriculated, I didn't actually get enrolled.

00:02:46.540 | I started going to Hunter College as a side thing,

00:02:50.420 | and then I sort of fell under the wing

00:02:53.220 | of one of the professors there, did well, transferred to NYU,

00:02:56.980 | and then I started taking PhD courses,

00:03:00.020 | but as an undergrad, and then I dropped out of that.

00:03:03.180 | - So, do you technically have a high school diploma?

00:03:05.140 | - No.

00:03:05.980 | - Okay, this is perfect.

00:03:07.180 | - Nor do I have an undergrad degree, but yeah.

00:03:09.740 | - So, from NYU, how did you end up at Google?

00:03:12.420 | - Well, actually, if it hadn't been for NYU,

00:03:14.380 | I don't think I would have ended up at Google,

00:03:15.940 | so this is interesting, even though I didn't have the degree.

00:03:18.360 | And I happened to go to an event at Google,

00:03:20.800 | and one of the people at Google recognized me

00:03:23.720 | because they also went to NYU, and then they referred me.

00:03:27.160 | So, you can make some great connections in university,

00:03:30.000 | even if you don't graduate, but it was one of the people

00:03:33.200 | that I was taking the PhD courses with.

00:03:35.400 | - And when you first went there,

00:03:36.960 | sort of what kind of stuff were you working on?

00:03:39.840 | - Ads, but testing.

00:03:41.660 | So, we were building giant test systems,

00:03:44.160 | and if you think that it's hard to build production systems,

00:03:47.760 | test systems have to test everything

00:03:49.920 | the production system does, we did it live.

00:03:52.360 | So, every single ads query, we would run 100 tests on that,

00:03:57.360 | but we didn't have the budget of the production system.

00:04:00.640 | So, we had to write our own threading library,

00:04:02.620 | we had to do all sorts of crazy stuff,

00:04:04.480 | which you don't think of in ads, but yeah,

00:04:06.880 | it was actually harder engineering

00:04:08.520 | than the production itself.

00:04:10.360 | - And so, Google's very famous for 20% time

00:04:13.300 | where you kind of can do whatever you want,

00:04:15.480 | and is that what led to the birth of TPU,

00:04:18.440 | which is now, I think, or what most of you guys know

00:04:21.420 | is Google's sort of leading custom silicon

00:04:23.960 | that they use internally.

00:04:25.360 | - So, 20% time is famous, I called it MCI time,

00:04:29.440 | which probably isn't gonna transfer as a joke here,

00:04:31.360 | but there was these advertisements for this phone company,

00:04:34.160 | free nights and weekends, so you could work on 20% time

00:04:37.400 | so long as it wasn't during your work time, yeah.

00:04:40.140 | But every single night, I would go up

00:04:42.880 | and work with the speech team.

00:04:45.680 | So, this was separate from my main project,

00:04:47.820 | and they bought me some hardware,

00:04:50.200 | and I started what was called the TPU as a side project,

00:04:55.200 | and it was funded out of what a VP referred to

00:04:59.640 | as his slush fund or leftover money,

00:05:02.280 | and it was never expected,

00:05:03.640 | there were actually two other projects

00:05:05.480 | to build AI accelerators,

00:05:07.440 | it was never expected to be successful,

00:05:09.440 | which gave us the cover that we needed

00:05:11.900 | to do some really counterintuitive and innovative things.

00:05:15.160 | Once that became successful,

00:05:16.840 | they brought in the adult supervision.

00:05:19.080 | - Okay, take a step back though,

00:05:20.720 | what problem were you trying to solve in AI

00:05:23.200 | when those words weren't even being used,

00:05:25.240 | and what was Google trying to do at the time

00:05:26.760 | where you saw an opportunity to build something?

00:05:28.560 | - So, this started in 2012,

00:05:30.580 | and at the time, there had never been

00:05:32.520 | a machine learning model that outperformed

00:05:34.800 | a human being on any task,

00:05:36.360 | and the speech team trained a model

00:05:38.340 | that transcribed speech better than human beings.

00:05:41.400 | The problem was they couldn't afford

00:05:43.720 | to put it into production,

00:05:45.320 | and so this led to a very famous engineer, Jeff Dean,

00:05:48.360 | giving a presentation to the leadership team,

00:05:50.200 | it was just two slides.

00:05:51.680 | The first slide was good news, machine learning works.

00:05:55.820 | The second slide, bad news, we can't afford it.

00:05:58.880 | So, they were gonna have to double or triple

00:06:01.280 | the entire global data center footprint of Google

00:06:04.760 | at an average of a billion dollars per data center,

00:06:07.120 | 20 to 40 data centers, so 20 to 40 billion dollars,

00:06:10.300 | and that was just for speech recognition.

00:06:12.180 | If they wanted to do anything else,

00:06:13.420 | like search ads, it was gonna cost more.

00:06:16.300 | That was uneconomical,

00:06:17.900 | and that's been the history with inference.

00:06:19.620 | You train it, and then you can't afford

00:06:21.660 | to put it into production.

00:06:22.960 | - So, against that backdrop,

00:06:26.380 | what did you do that was so unique

00:06:28.460 | that allowed TPU to be one of the three projects

00:06:31.820 | that actually won?

00:06:33.380 | - The biggest thing was Jeff Dean noticed

00:06:36.260 | that the main algorithm that was consuming

00:06:39.480 | most of the CPU cycles at Google was matrix multiply,

00:06:42.860 | and we decided, okay, let's accelerate that,

00:06:45.740 | but let's build something around that,

00:06:47.980 | and so we built a massive matrix multiplication engine.

00:06:52.360 | When doing this, there were those two other competing teams.

00:06:55.900 | They took more traditional approaches to do the same thing.

00:06:58.500 | One of them was led by a Turing Award winner,

00:07:00.600 | and then what we did was we came up

00:07:02.460 | with what's called a systolic array,

00:07:04.860 | and I remember when that Turing Award winner

00:07:07.480 | was talking about the TPU, he said,

00:07:09.060 | "Whoever came up with this must have been really old

00:07:11.860 | "because systolic arrays have fallen out of favor,"

00:07:14.940 | and it was actually me.

00:07:15.780 | I just didn't know what a systolic array was.

00:07:17.820 | Someone had to explain to me what the terminology was.

00:07:20.020 | It was just kind of the obvious way to do it,

00:07:21.620 | and so the lesson is if you come at things

00:07:24.260 | knowing how to do them,

00:07:25.540 | you might know how to do them the wrong way.

00:07:27.780 | It's helpful to have people who don't know

00:07:30.300 | what should and should not be done.

00:07:33.000 | - So as TPU scales, there's probably a lot

00:07:35.800 | of internal recognition at Google.

00:07:38.180 | How do you walk away from that,

00:07:41.660 | and why did you walk away from that?

00:07:43.760 | - Well, all big companies end up

00:07:45.280 | becoming political in the end,

00:07:46.840 | and when you have something that successful,

00:07:48.920 | a lot of people want to own it,

00:07:50.200 | and there's always more senior people

00:07:51.800 | who start grabbing for it.

00:07:54.040 | I moved on to the Google X team, the rapid eval team,

00:07:58.320 | which is the team that comes up

00:07:59.280 | with all the crazy ideas at Google X,

00:08:01.700 | and I was having fun there,

00:08:04.320 | but nothing was turning into a production system.

00:08:07.300 | It was all a bunch of playing around,

00:08:10.280 | and I wanted to go and do something real again

00:08:12.580 | from start to finish.

00:08:13.420 | I wanted to take something from concept to production,

00:08:16.480 | and so I started looking outside,

00:08:19.200 | and that's when we met.

00:08:21.000 | - Well, that is when we met,

00:08:22.400 | but the thing is you had two ideas.

00:08:26.000 | One was more of let me build an image classifier,

00:08:29.000 | and you thought you could out-ResNet ResNet at the time,

00:08:31.300 | which was the best thing in town,

00:08:33.360 | and then you had this hardware path.

00:08:35.300 | - Well, actually, I had zero intention of building a chip.

00:08:38.960 | What happened was I had also built

00:08:42.960 | the highest performing image classifier,

00:08:47.960 | but I had noticed that all of the software

00:08:51.560 | was being given away for free.

00:08:53.340 | TensorFlow was being given away for free.

00:08:55.080 | The models were being given away for free.

00:08:56.440 | It was pretty clear that machine learning AI

00:08:59.880 | was gonna be open source, and it was gonna be--

00:09:01.560 | - Even back then. - Even back then.

00:09:02.960 | That was 2016, and so I just couldn't imagine

00:09:06.000 | building a business around that,

00:09:07.280 | and it would just be hard scrabble.

00:09:09.240 | Chips, it takes so long to build them

00:09:11.560 | that if you build something innovative and you launch it,

00:09:13.960 | it's gonna be four years before anyone can even copy it,

00:09:16.740 | let alone pull ahead of it,

00:09:18.560 | so that just felt like a much better approach,

00:09:21.040 | and it's atoms.

00:09:21.920 | You can monetize that more easily,

00:09:24.680 | so right around that time, the TPU paper came out.

00:09:28.840 | My name was in it.

00:09:30.160 | People started asking about it,

00:09:32.080 | and you asked me what I would do differently.

00:09:34.360 | - Well, I was investing in public markets

00:09:39.360 | as well at the time, a little dalliance

00:09:41.440 | in the public markets, and Sundar goes on

00:09:44.000 | in a press release and starts talking about TPU,

00:09:47.320 | and I was so shocked.

00:09:48.280 | I thought there is no conceivable world

00:09:50.360 | in which Google should be building their own hardware.

00:09:52.560 | They must know something that the rest of us don't know,

00:09:56.200 | and so we need to know that

00:09:58.520 | so that we can go and commercialize that

00:10:00.200 | for the rest of the world,

00:10:01.560 | and I probably met you a few weeks afterwards,

00:10:03.940 | and that was probably the fastest investment I'd ever made.

00:10:07.480 | I remember the key moment is you did not have a company,

00:10:12.300 | and so we had to incorporate the company

00:10:14.080 | after the check was written,

00:10:15.240 | which is always either a sign of complete stupidity

00:10:18.280 | or in 15 or 20 years, you'll look like a genius,

00:10:21.280 | but the odds of the latter are quite small.

00:10:24.120 | Okay, so you start the business.

00:10:25.740 | Tell us about the design decisions you were making

00:10:27.740 | in Grok at the time, knowing what you knew then,

00:10:30.280 | because at the time, it's very different

00:10:32.160 | from what it is now.

00:10:33.640 | Well, again, when we started fundraising,

00:10:35.680 | we actually weren't even 100% sure

00:10:37.480 | that we were gonna do something in hardware,

00:10:39.320 | but it was something that I think you asked, Shamath,

00:10:41.580 | which is what would you do differently,

00:10:43.680 | and my answer was the software,

00:10:46.180 | because the big problem we had

00:10:47.680 | was we could build these chips in Google,

00:10:50.080 | but programming them, every single team at Google

00:10:53.280 | had a dedicated person who was hand-optimizing the models,

00:10:57.000 | and I'm like, this is absolutely crazy.

00:10:59.440 | Right around then, we had started hiring some people

00:11:02.080 | from NVIDIA, and they're like, no, no, no,

00:11:03.900 | you don't understand.

00:11:04.740 | This is just how it works.

00:11:05.560 | This is how we do it, too.

00:11:06.400 | We've got these things called kernels, CUDA kernels,

00:11:08.880 | and we hand-optimize them.

00:11:09.840 | We just make it look like we're not doing that,

00:11:12.300 | but the scale, like, all of you understand algorithms

00:11:15.600 | and big O complexity.

00:11:17.580 | That's linear complexity.

00:11:18.840 | For every application, you need an engineer.

00:11:21.160 | NVIDIA now has 50,000 people in their ecosystem.

00:11:24.040 | How does any, and these are like really low-level

00:11:26.360 | kernel-writing, assembly-writing hackers

00:11:28.200 | who understand GPUs and ML and everything.

00:11:30.100 | Not gonna scale, so we focused on the compiler

00:11:32.420 | for the first six months.

00:11:34.280 | We banned whiteboards at Grok

00:11:35.960 | because people kept trying to draw pictures of chips.

00:11:38.560 | Like, yeah.

00:11:39.520 | - So why is it that LLMs prefer Grok?

00:11:45.180 | Like, what was the design decision,

00:11:47.220 | or what happened in the design of LLMs?

00:11:49.840 | Some part of it is skill, obviously,

00:11:51.280 | but some part of it was a little bit of luck,

00:11:52.760 | but where, what exactly happened

00:11:54.840 | that makes you so much faster than NVIDIA

00:11:57.960 | and why there's all of these developers?

00:11:59.760 | What is the?

00:12:00.720 | - The crux of it, we didn't know

00:12:03.080 | that it was gonna be language,

00:12:04.600 | but the inspiration, the last thing that I worked on

00:12:07.880 | was getting the AlphaGo software,

00:12:11.800 | the Go playing software at DeepMind working on TPU,

00:12:16.000 | and having watched that, it was very clear

00:12:19.400 | that inference was going to be a scaled problem.

00:12:22.800 | Everyone else had been looking at inference

00:12:24.720 | as you take one chip, you run a model on it,

00:12:27.600 | it runs whatever.

00:12:28.960 | But what happened with AlphaGo

00:12:31.820 | was we ported the software over,

00:12:34.320 | and even though we had 170 GPUs versus 48 TPUs,

00:12:38.920 | the 48 TPUs won 99 out of 100 games

00:12:42.080 | with the exact same software.

00:12:44.580 | What that meant was compute was going to result

00:12:47.360 | in better performance.

00:12:49.240 | And so the insight was, let's build scaled inference.

00:12:53.880 | So we built in the interconnect, we built it for scale,

00:12:57.540 | and that's what we do now

00:12:58.560 | when we're running one of these models,

00:13:00.240 | we have hundreds or thousands of chips contributing

00:13:02.440 | just like we did with AlphaGo,

00:13:04.600 | but it's built for this as opposed to cobbled together.

00:13:07.840 | - I think this is a good jumping off point.

00:13:09.220 | A lot of people, and I think this company

00:13:11.840 | deserves a lot of respect,

00:13:12.760 | but NVIDIA has been toiling for decades,

00:13:16.080 | and they have clearly built an incredible business.

00:13:19.480 | But in some ways, when you get into the details,

00:13:21.760 | the business is slightly misunderstood.

00:13:24.420 | So can you break down, first of all,

00:13:26.680 | where is NVIDIA natively good,

00:13:29.720 | and where is it more trying to be good?

00:13:31.980 | - So natively good, the classic saying is

00:13:36.520 | you don't have to outrun the bear,

00:13:37.600 | you just have to outrun your friends.

00:13:39.240 | So NVIDIA outruns all of the other chip companies

00:13:41.720 | when it comes to software,

00:13:42.800 | but they're not a software-first company.

00:13:44.520 | They actually have a very expensive approach,

00:13:47.360 | as we discussed.

00:13:48.360 | But they have the ecosystem.

00:13:50.800 | It's a double-sided market.

00:13:52.040 | If you have a kernel-based approach, they've already won.

00:13:54.560 | There's no catching up.

00:13:56.200 | Hence why we have a kernel-free approach.

00:13:58.280 | But the other way that they're very good

00:14:00.480 | is vertical integration and forward integration.

00:14:04.400 | What happens is NVIDIA, over and over again,

00:14:07.320 | decides that they wanna move up the stack,

00:14:09.800 | and whatever their customers are doing,

00:14:11.160 | they start doing it.

00:14:12.220 | So for example, I think it was Gigabyte

00:14:14.120 | or one of these other PCI board manufacturers

00:14:17.160 | who recently announced,

00:14:18.240 | even though 80% of their revenue came from NVIDIA,

00:14:20.840 | NVIDIA boards that they were building,

00:14:23.520 | they're exiting that market because NVIDIA moved up

00:14:26.400 | and started doing a much lower-margin thing.

00:14:28.120 | And you just see that over and over again.

00:14:29.240 | They started building systems.

00:14:30.080 | - Yeah, I think the other thing is

00:14:31.440 | that NVIDIA's incredible at training.

00:14:33.800 | And I think the design decisions that they made,

00:14:36.600 | including things like HBM,

00:14:38.440 | were really oriented around a world back then,

00:14:41.460 | which was everything was about training.

00:14:43.080 | There weren't any real-world applications.

00:14:44.920 | None of you guys were really building anything in the wild

00:14:47.440 | where you needed super-fast inference.

00:14:49.160 | And I think that's another.

00:14:50.640 | - Absolutely, and what we saw over and over again

00:14:53.160 | was you would spend 100% of your compute on training.

00:14:56.800 | You would get something that would work well enough

00:14:58.400 | to go into production.

00:14:59.760 | And then it would flip to about 5% to 10% training

00:15:04.200 | and 90% to 95% inference.

00:15:07.080 | But the amount of training would stay the same.

00:15:09.120 | The inference would grow massively.

00:15:11.520 | And so every time we would have a success at Google,

00:15:14.840 | all of a sudden, we would have a disaster.

00:15:16.840 | We called it the success disaster

00:15:18.320 | where we can't afford to get enough compute for inference

00:15:22.280 | 'cause it goes 1020x immediately, over and over.

00:15:26.520 | - And if you take that 1020x and multiply it

00:15:28.480 | by the cost of NVIDIA's leading-class solutions,

00:15:31.320 | you're talking just an enormous amount of money.

00:15:33.000 | So just maybe explain to folks what HBM is

00:15:36.520 | and why these systems,

00:15:37.860 | like what NVIDIA just announced as B200,

00:15:39.920 | the complexity and the cost, actually,

00:15:41.800 | if you're trying to do something.

00:15:43.840 | - Yeah, the complexity spans every part of the stack,

00:15:46.600 | but there's a couple of components

00:15:48.120 | which are in very limited supply.

00:15:50.760 | And NVIDIA has locked up the market on these.

00:15:53.560 | One of them is HBM.

00:15:55.040 | HBM is this high-bandwidth memory

00:15:57.840 | which is required to get performance

00:16:00.720 | because the speed at which you can run these applications

00:16:04.640 | depends on how quickly you can read that memory.

00:16:06.520 | And this is the fastest memory.

00:16:08.720 | There is a finite supply.

00:16:10.000 | It is only for data centers.

00:16:11.840 | So they can't reach into the supply for mobile

00:16:14.560 | or other things like you can with other parts.

00:16:16.800 | But also interposers.

00:16:18.400 | Also, NVIDIA's the largest buyer of super caps in the world

00:16:22.320 | and all sorts of other components.

00:16:24.040 | - Cables.

00:16:24.880 | - Cables, the 400-gigabit cables,

00:16:27.520 | they've bought them all out.

00:16:28.920 | So if you want to compete,

00:16:30.280 | it doesn't matter how good of a product you design.

00:16:32.720 | They've bought out the entire supply chain.

00:16:35.460 | - For years.

00:16:36.300 | - For years.

00:16:38.080 | - So what do you do?

00:16:40.440 | - You don't use the same things they do.

00:16:42.640 | - Right.

00:16:43.480 | - And that's where we come in.

00:16:45.120 | - So how do you design a chip, then?

00:16:47.400 | If you look at the leading solution

00:16:49.240 | and they're using certain things

00:16:50.560 | and they're clearly being successful,

00:16:52.480 | how do you, is it just a technical bet

00:16:55.680 | to be totally orthogonal and different?

00:16:57.360 | Or was it something very specific

00:16:59.760 | where you said we cannot be reliant on the same supply chain

00:17:01.920 | 'cause we'll just get forced out of business at some point?

00:17:04.480 | - It was actually a really simple observation

00:17:06.840 | at the beginning,

00:17:07.800 | which is most chip architectures compete

00:17:11.720 | on small percentages difference in performance,

00:17:14.600 | like 15% is considered amazing.

00:17:18.000 | And what we realized was if we were 15% better,

00:17:21.360 | no one was gonna change

00:17:22.780 | to a radically different architecture.

00:17:24.760 | We needed to be five to 10x.

00:17:27.400 | Therefore, the small percentages that you get

00:17:30.780 | chasing the leading edge technologies was irrelevant.

00:17:34.600 | So we used an older technology, 14 nanometer,

00:17:37.560 | which is underutilized.

00:17:39.560 | We didn't use external memory.

00:17:41.480 | We used older interconnect

00:17:44.080 | because our architecture needed to provide the advantage

00:17:46.760 | and it needed to be so overwhelming

00:17:49.140 | that we didn't need to be at the leading edge.

00:17:51.720 | - So how do you measure sort of speed and value today?

00:17:54.840 | And just give us some comparisons

00:17:56.500 | for you versus some other folks running

00:17:59.740 | what these guys are probably using,

00:18:01.120 | Lama, Mistral, et cetera.

00:18:02.600 | - Yeah, so we run, we compare on two sides of this.

00:18:07.600 | One is the tokens per dollar

00:18:09.400 | and one is the tokens per second per user.

00:18:12.160 | So tokens per second per user is the experience.

00:18:14.480 | That's the differentiation.

00:18:16.040 | And tokens per dollar is the cost.

00:18:18.640 | And then also, of course, tokens per watt

00:18:20.640 | because power is very limited at the moment.

00:18:22.880 | - Right.

00:18:23.720 | - If you were to compare us to GPUs,

00:18:26.880 | we're typically five to 10x faster.

00:18:30.160 | Apples to apples, like without using speculative decode

00:18:33.600 | and other things.

00:18:34.900 | So right now, on a 180 billion parameter model,

00:18:39.360 | we run about 200 tokens per second,

00:18:42.640 | which I think is less than 50

00:18:44.880 | on the next generation GPU that's coming out.

00:18:47.920 | - From NVIDIA?

00:18:48.760 | - From NVIDIA.

00:18:49.960 | - So your current generation is 4x better than the B200?

00:18:54.240 | - Yeah, yeah.

00:18:55.240 | And then in total cost, we're about 1/10 the cost

00:18:59.560 | versus a modern GPU per token.

00:19:02.600 | I want that to sink in for a moment, 1/10 of the cost.

00:19:08.280 | - Yeah, I mean, I think the value of that

00:19:09.680 | really comes down to the fact that

00:19:11.720 | you guys are gonna go and have ideas,

00:19:13.720 | and especially if you are part of the venture community

00:19:16.880 | and ecosystem and you raise money,

00:19:18.640 | folks like me who will give you money

00:19:21.160 | will expect you to be investing that wisely.

00:19:23.240 | Last decade, we went into a very negative cycle

00:19:26.280 | where almost 50 cents of every dollar

00:19:28.680 | we would give a startup would go right back

00:19:30.320 | into the hands of Google, Amazon, and Facebook.

00:19:33.160 | You're spending it on compute

00:19:34.360 | and you're spending it on ads.

00:19:36.120 | This time around, the power of AI should be

00:19:38.960 | that you can build companies for 1/10

00:19:40.760 | or 1/100 of the cost, but that won't be possible

00:19:43.960 | if you're, again, just shipping the money back out,

00:19:45.600 | except just now, in this case,

00:19:47.160 | to NVIDIA versus somebody else.

00:19:49.200 | So we will be pushing to make sure

00:19:51.720 | that this is kind of the low-cost alternative that happens.

00:19:56.160 | - So NVIDIA had a huge splashy announcement a few weeks ago.

00:19:59.840 | They showed charts, things going up and to the right.

00:20:02.040 | They showed huge dyes.

00:20:03.320 | They showed huge packaging.

00:20:05.140 | Tell us about the B200 and compare it

00:20:08.240 | to what you're doing right now.

00:20:09.960 | - Well, the first thing is the B200

00:20:12.120 | is a marvel of engineering.

00:20:13.720 | The level of complexity, the level of integration,

00:20:16.940 | the amount of different components in silicon.

00:20:19.960 | They spent $10 billion developing it,

00:20:22.860 | but when it was announced,

00:20:25.120 | I got some pings from NVIDIA engineers who said,

00:20:28.240 | we were a little embarrassed that they were claiming 30X

00:20:30.640 | 'cause it's nowhere near that.

00:20:31.680 | And we as engineers felt

00:20:33.040 | that that was hurting our credibility.

00:20:36.040 | The 30X claim was, let's put it into perspective.

00:20:39.960 | There was this one image that showed a claim

00:20:42.300 | of up to 50 tokens per second from the user experience

00:20:46.080 | and 140 throughput.

00:20:49.000 | That sort of gives you the value or the cost.

00:20:52.440 | If you were to compare that to the previous generation,

00:20:54.600 | that would be saying that the users,

00:20:56.200 | if you divide 50 by 30,

00:20:58.040 | are getting less than two tokens per second,

00:21:00.400 | which would be slow, right?

00:21:05.320 | There's nothing running that slow.

00:21:07.020 | And then also from a throughput perspective,

00:21:08.920 | that would make the cost so astronomical,

00:21:10.720 | it would be unbelievable.

00:21:11.560 | - I mean, how many of you guys use

00:21:13.120 | any of these chat agents right now?

00:21:15.120 | Just raise your hand if you use them.

00:21:17.320 | And how many of you keep your hands raised

00:21:19.280 | if you're satisfied with the speed of performance?

00:21:21.780 | You're satisfied.

00:21:24.400 | One hand or two?

00:21:26.160 | - There's like two or three, yeah.

00:21:27.840 | That's nice.

00:21:28.680 | My experience has been that these things,

00:21:34.160 | if you want to actually make hallucinations go to zero

00:21:37.120 | and the quality of these models really fine-tuned,

00:21:40.200 | you have to get back to kind of like

00:21:42.000 | a traditional web experience

00:21:43.420 | or a traditional mobile app experience

00:21:44.840 | where you have a window of probably 300 milliseconds

00:21:48.100 | to get an answer back.

00:21:49.360 | In the absence of that,

00:21:50.200 | the user experience doesn't scale and it kind of sucks.

00:21:53.360 | - How much effort did you spend at Meta and Facebook

00:21:56.440 | getting latency down?

00:21:57.600 | - I mean, look, at Facebook at one point,

00:21:58.960 | I had a team, I was so disgusted with the speed.

00:22:01.520 | So in a cold cache,

00:22:03.160 | we were approaching 1,000 milliseconds.

00:22:05.640 | And I was so disgusted that I took a small team

00:22:10.160 | off to the side, rebuilt the entire website

00:22:13.520 | and launched it in India for the Indian market

00:22:17.360 | just to prove that we could get it under 500 milliseconds.

00:22:22.800 | And it was a huge technical feat that the team did.

00:22:26.880 | It was also very poorly received

00:22:28.800 | by the mainline engineering team

00:22:30.360 | because it was somewhat embarrassing.

00:22:31.920 | But that's the level of intensity

00:22:34.280 | we had to approach this problem with.

00:22:36.520 | And it wasn't just us.

00:22:37.520 | Google realized it, everybody has realized it.

00:22:39.760 | There's an economic equation

00:22:41.120 | where if you deliver an experience to users

00:22:44.080 | under about 250 to 300 milliseconds, you maximize revenue.

00:22:48.720 | So if you actually want to be successful,

00:22:50.720 | that is the number you have to get to.

00:22:52.520 | So the idea that you can wait and fetch an answer

00:22:55.280 | in three and five seconds is completely ridiculous.

00:22:59.800 | It's a non-starter.

00:23:02.840 | - Here's the actual number.

00:23:03.920 | The number is every 100 milliseconds

00:23:06.240 | leads to 8% more engagement on desktop, 34% on mobile.

00:23:11.240 | We're talking about 100 milliseconds,

00:23:14.760 | which is 1/10 of a second.

00:23:16.920 | Right now, these things take 10 seconds.

00:23:19.280 | So think about how much less engagement

00:23:21.320 | we were getting today than you otherwise could.

00:23:24.440 | - So why don't you now break down this difference?

00:23:26.440 | Because this is now where I think a good place

00:23:28.760 | so that people leave really understanding.

00:23:31.360 | There's an enormous difference

00:23:32.600 | between training and inference and what is required.

00:23:37.600 | And why don't you define the differences

00:23:40.080 | so that then we can contrast where things are gonna go?

00:23:43.280 | - The biggest is when you're training,

00:23:45.440 | the number of tokens that you're training on

00:23:48.400 | is measured in months.

00:23:50.120 | Like how many tokens can we train on this month?

00:23:53.040 | It doesn't matter if it takes a second, 10 seconds,

00:23:56.600 | 100 seconds in a batch, how many per month.

00:24:00.000 | In inference, what matters is how many tokens

00:24:03.080 | you can generate per millisecond or a couple of milliseconds.

00:24:07.840 | It's not in seconds, it's not in months.

00:24:10.520 | - Is it fair to say then

00:24:12.280 | that NVIDIA is the exemplar in training?

00:24:15.800 | - Yes.

00:24:16.640 | - And then is it fair to say that there really isn't yet

00:24:20.520 | the equivalent scaled winner in inference?

00:24:25.560 | - Not yet.

00:24:26.560 | - And do you think that it will be NVIDIA?

00:24:29.200 | - I don't think it'll be NVIDIA.

00:24:31.520 | - But specifically, why do you not think it won't work

00:24:35.400 | for that market, even though it's clearly working in training

00:24:38.760 | - In order to get the latency down, what we had to do,

00:24:42.840 | we had to design a completely new chip architecture.

00:24:45.840 | We had to design a completely new networking architecture,

00:24:48.720 | an entirely new system, an entirely new runtime,

00:24:51.160 | an entirely new compiler,

00:24:52.320 | an entirely new orchestration layer.

00:24:55.040 | We had to throw everything away.

00:24:57.180 | And it had to be compatible with PyTorch

00:25:01.240 | and what other people actually developing.

00:25:03.320 | Now we're talking about innovators dilemma on steroids.

00:25:06.400 | It's hard enough to give up one of those,

00:25:08.200 | which if you were to do one of those successfully

00:25:10.120 | would be a very valuable company.

00:25:12.720 | But to throw all six of those away is nearly impossible.

00:25:16.520 | And also you have to maintain what you have

00:25:18.520 | if you wanna keep training.

00:25:20.520 | And so now you have to have a completely different

00:25:22.520 | architecture for networking, for training versus inference,

00:25:26.640 | for your chip, for networking, for everything.

00:25:30.320 | - So let's say that the market today

00:25:32.120 | is 100 units of training or 95 units of training,

00:25:35.760 | five units of inference.

00:25:37.000 | I should say that's roughly where most of the revenue

00:25:39.840 | and the dollars are being made.

00:25:42.360 | What does it look like in four or five years from now?

00:25:45.360 | - Well, actually NVIDIA's latest earnings, 40% inference.

00:25:48.600 | It's already starting to climb.

00:25:50.800 | Where it's gonna end is it will end

00:25:52.720 | at somewhere between 90 to 95% or 90 to 95 units inference.

00:25:57.720 | And so that trajectory is gonna take off rapidly

00:26:00.600 | now that we have these open source models

00:26:02.480 | that everyone is giving away

00:26:04.280 | and you can download a model and run it.

00:26:06.680 | You don't need to train it.

00:26:07.840 | - Yeah, one of the things about these open source models

00:26:09.840 | is building useful applications,

00:26:14.040 | you have to either understand or be able to work with CUDA.

00:26:18.560 | With you, it doesn't even matter because you can just port.

00:26:20.680 | So maybe explain to folks the importance

00:26:23.400 | in the inference market of being able to rip

00:26:25.000 | and replace these models

00:26:26.120 | and where you think these models are going.

00:26:28.280 | - So for the inference market, every two weeks or so,

00:26:33.280 | there is a completely new model that has to be run.

00:26:38.800 | It's important, it matters.

00:26:40.720 | Either it's setting the best quality bar across the board

00:26:45.720 | or it's good at a particular task.

00:26:48.920 | If you are writing kernels,

00:26:50.960 | it's almost impossible to keep up.

00:26:52.480 | In fact, when LLAMA2 70 billion was launched,

00:26:55.640 | it officially had support for AMD.

00:26:58.520 | However, the first support we actually saw implemented

00:27:03.080 | was after about a week and we had it in I think two days.

00:27:08.080 | And so that speed,

00:27:09.400 | now everyone develops for NVIDIA hardware.

00:27:11.760 | So by default, anything launched will work there.

00:27:14.840 | But if you want anything else to work,

00:27:17.720 | you can't be writing these kernels by hand.

00:27:19.720 | And remember, AMD had official support

00:27:21.480 | and it still took about a week.

00:27:22.680 | - Right.

00:27:23.680 | So if you're starting a company today,

00:27:26.960 | you clearly wanna have the ability to swap from LLAMA

00:27:32.360 | to Mistral to Anthropic back as often as possible.

00:27:36.080 | - Whatever's latest.

00:27:37.240 | - And just as somebody who sees these models run,

00:27:40.920 | do you have any comment on the quality of these models

00:27:42.880 | and where you think some of these companies are going

00:27:44.800 | or what you see some doing well versus others?

00:27:47.840 | - So they're all starting to catch up with each other.

00:27:51.000 | You're starting to see some leapfrogging.

00:27:52.600 | It started off with GPT-4 pulling ahead

00:27:55.840 | and it had a lead for about a year over everyone else.

00:27:58.600 | And now of course Anthropic has caught up.

00:28:00.840 | We're seeing some great stuff from Mistral.

00:28:03.360 | But across the board,

00:28:05.200 | they're all starting to bunch up in quality.

00:28:08.240 | And so one of the interesting things,

00:28:09.840 | Mistral in particular,

00:28:11.600 | has been able to get closer to quality

00:28:14.120 | with smaller, less expensive models to run,

00:28:16.120 | which I think gives them a huge advantage.

00:28:18.920 | I think Cohere has an interesting take

00:28:21.680 | on a sort of rag optimized model.

00:28:24.600 | So people are finding niches.

00:28:26.560 | And there's gonna be a couple

00:28:28.440 | that are gonna be the best across the board

00:28:30.960 | at the highest end.

00:28:33.000 | But what we're seeing is a lot of complaints

00:28:34.720 | about the cost to run these models.

00:28:36.920 | They're just astronomical.

00:28:38.680 | And you're not gonna be able to scale up applications

00:28:42.360 | for users with them.

00:28:43.480 | - OpenAI has published or has disclosed,

00:28:48.480 | as has Meta, as has Tesla and a couple of others,

00:28:51.640 | just the total quantum of GPU capacity that they're buying.

00:28:55.640 | And you can kind of work backwards

00:28:56.920 | to figure out how big the inference market can be,

00:29:00.400 | because it's really only supported by them

00:29:02.760 | as you guys scale up.

00:29:03.960 | Can you give people a sense of the scale

00:29:06.360 | of what folks are fighting for?

00:29:08.320 | - So I think Facebook announced

00:29:12.480 | that by the end of this year,

00:29:14.160 | they're gonna have the equivalent of 650,000 H100s.

00:29:18.280 | By the end of this year,

00:29:21.760 | Grok will have deployed 100,000 of our LPUs,

00:29:25.000 | which do outperform the H100s on a throughput

00:29:28.840 | and on a latency basis.

00:29:31.000 | So we will probably get pretty close

00:29:33.200 | to the equivalent of Meta ourselves.

00:29:36.840 | By the end of next year,

00:29:37.840 | we're going to deploy 1.5 million LPUs.

00:29:41.320 | For comparison, last year,

00:29:43.600 | NVIDIA deployed a total of 500,000 H100s.

00:29:47.520 | So 1.5 million means that Grok will probably have

00:29:51.800 | more inference generative AI capacity

00:29:55.440 | than all of the hyperscalers

00:29:57.640 | and cloud service providers combined.

00:30:00.680 | So probably about 50% of the inference compute in the world.

00:30:03.960 | - That's just great.

00:30:07.600 | Tell us about team building in Silicon Valley.

00:30:12.280 | How hard is it to get folks that are real AI folks

00:30:15.440 | in the backdrop of you could go work at Tesla,

00:30:17.640 | you could go work at Google, open AI,

00:30:19.560 | all these people we are hearing,

00:30:21.200 | multimillion dollar pay packages

00:30:22.800 | that rival playing professional sports.

00:30:25.280 | Like what is going on in finding the people?

00:30:28.360 | Now, by the way, you have this interesting thing

00:30:30.240 | 'cause your initial chip,

00:30:31.800 | you were trying to find folks that knew Haskell.

00:30:34.160 | So just tell us, how hard is it to build a team

00:30:37.320 | in the Valley to do this?

00:30:39.360 | - Impossible.

00:30:41.120 | So if you want to know how to do it,

00:30:43.120 | you have to start getting creative,

00:30:44.440 | just like anything you want to do well.

00:30:46.040 | Don't just compete directly.

00:30:47.800 | But yeah, these pay packages are astronomical

00:30:50.760 | because everyone views this as a winner take all market.

00:30:54.920 | I mean, just that's it.

00:30:56.080 | It's not about, am I going to be number two?

00:30:58.360 | Am I going to be number three?

00:30:59.360 | They're all going, I got to be number one.

00:31:01.400 | So if you don't have the best talent, you're out.

00:31:03.120 | Now, here's the mistake.

00:31:04.800 | A lot of these AI researchers are amazing at AI,

00:31:09.760 | but they're still kind of green.

00:31:11.760 | They're new, they're young, right?

00:31:13.720 | This is a new field.

00:31:15.240 | And what I always recommend to people

00:31:17.240 | is go hire the best, most grizzled engineers

00:31:22.080 | who know how to ship stuff and on time

00:31:26.080 | and let them learn AI

00:31:28.320 | because they will be able to do that faster

00:31:30.800 | than you will be able to take the AI researchers

00:31:32.920 | and give them the 20 years of experience

00:31:34.760 | of deploying production code.

00:31:36.480 | - You were on stage in Saudi Arabia with Saudi Aramco

00:31:41.880 | a month ago and announced some big deal.

00:31:45.400 | Can you just, like, what is going on with deals like that?

00:31:48.920 | Like, where is that market going?

00:31:51.560 | Is that you competing with Amazon and Google and Microsoft?

00:31:56.000 | Is that what that is?

00:31:57.320 | - It's not competing.

00:31:58.360 | It's actually complimentary.

00:32:01.080 | The announcement was that we are going

00:32:03.680 | to be doing a deal together with Aramco Digital.

00:32:06.320 | And we haven't announced how large exactly,

00:32:10.080 | but it will be large in terms of the amount of compute

00:32:14.240 | that we're gonna deploy.

00:32:15.640 | And in total, we've done deals that get us to past 10%

00:32:19.920 | of that 1.5 million LPU goal.

00:32:22.440 | And of course, the hard part is the first deals.

00:32:26.280 | So once we announced that,

00:32:27.800 | a lot of other deals are now coming through.

00:32:30.000 | But the, yeah, go ahead.

00:32:33.120 | - So, no, no, I was just, I was just, second, Tim.

00:32:36.000 | - So the scale of these deals is that these are larger

00:32:41.000 | than the amount of compute that Meta has, right?

00:32:46.640 | And a lot of these tech companies right now,

00:32:50.040 | they think that they have such an advantage

00:32:51.680 | 'cause they've locked up the supply.

00:32:54.000 | They don't want it to be true

00:32:55.640 | that there is another alternative out there.

00:32:58.560 | And so we're actually doing deals with folks

00:33:00.960 | where they're gonna have more compute than a hyperscaler.

00:33:04.280 | - Right, that's a crazy idea.

00:33:07.080 | Last question, everybody's worried about what AI means.

00:33:12.080 | You've been in it for a very long time.

00:33:17.040 | Just end with your perspectives on what we should be thinking

00:33:22.240 | and what your perspectives are on the future of AI,

00:33:24.400 | our future jobs, all of this typical stuff

00:33:26.480 | that people worry about.

00:33:27.680 | - So I get asked a lot, should we be afraid of AI?

00:33:31.640 | And my answer to that is, if you think back to Galileo,

00:33:36.640 | someone who got in a lot of trouble,

00:33:38.520 | the reason he got in trouble was he invented the telescope,

00:33:41.440 | popularized it, and made some claims

00:33:44.920 | that we were much smaller than everyone wanted to believe.

00:33:48.400 | We were supposed to be the center of the universe

00:33:50.080 | and it turns out we weren't.

00:33:52.440 | And the better the telescope got,

00:33:53.920 | the more obvious it became that we were small.

00:33:56.360 | And in a large sense, large language models

00:34:00.000 | are the telescope for the mind.

00:34:02.920 | It's become clear that intelligence is larger than we are.

00:34:07.920 | And it makes us feel really, really small.

00:34:12.520 | And it's scary.

00:34:14.240 | But what happened over time was as we realized

00:34:16.720 | the universe was larger than we thought,

00:34:18.640 | and we got used to that, we started to realize

00:34:20.640 | how beautiful it was, and our place in the universe.

00:34:24.840 | And I think that's what's gonna happen.

00:34:26.240 | We're gonna realize intelligence is more vast

00:34:29.160 | than we ever imagined,

00:34:30.840 | and we're gonna understand our place in it.

00:34:32.800 | And we're not gonna be afraid of it.

00:34:34.680 | - That's a beautiful way to end.

00:34:36.880 | Jonathan Ross, everybody.

00:34:38.200 | Thanks, guys.

00:34:39.040 | (audience applauding)

00:34:48.240 | - Thank you very, very much.

00:34:49.960 | I was told drug means to understand deeply with empathy.

00:34:52.720 | That was embodying this definition.

00:34:54.920 | [BLANK_AUDIO]