back to index

Conversation with Groq CEO Jonathan Ross


Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, first of all, it's incredible to be here.
00:00:03.440 | I have a few notes, so I make sure we cover everything.
00:00:06.120 | I wanna take this opportunity to introduce Jonathan.
00:00:11.060 | Obviously, a lot of you guys have heard of his company,
00:00:13.020 | but you may not know his origin story,
00:00:15.560 | which is quite honestly,
00:00:16.760 | having been in Silicon Valley for 25 years,
00:00:20.120 | one of the most unique origin stories
00:00:23.520 | and founder stories you're going to hear.
00:00:25.620 | We're gonna talk about some of the things
00:00:28.960 | that he's accomplished both at Google and Grok.
00:00:31.200 | We're gonna compare Grok and NVIDIA,
00:00:33.540 | because I think it's probably one of the most important
00:00:36.440 | technical considerations that people should know.
00:00:38.680 | We'll talk about the software stack,
00:00:40.480 | and we'll leave with a few data points and bullets,
00:00:44.760 | which I think are pretty impressive.
00:00:45.920 | So, I wanna start by something that you do every week,
00:00:50.920 | which is you typically tweet out
00:00:54.140 | some sort of developer metric.
00:00:56.040 | Where are you as of this morning,
00:00:58.800 | and why are developers so important?
00:01:01.580 | - So, we're at, can you hear me?
00:01:05.280 | - Try this one.
00:01:08.520 | - Testing, ah, perfect.
00:01:10.320 | So, we are at 75,000 developers,
00:01:14.240 | and that is slightly over 30 days
00:01:17.280 | from launching our developer console.
00:01:20.020 | For comparison, it took NVIDIA seven years
00:01:24.300 | to get to 100,000 developers,
00:01:27.220 | and we're at 75,000 in about 30-ish days.
00:01:32.120 | So, the reason this matters is the developers, of course,
00:01:35.920 | are building all the applications,
00:01:37.400 | so every developer is a multiplicative effect
00:01:40.160 | on the total number of users that you can have,
00:01:42.160 | but it's all about developers.
00:01:43.980 | - Let's go all the way back, so just with that backdrop.
00:01:47.880 | This is not an overnight success story.
00:01:50.040 | This is eight years of plotting through the wild wilderness,
00:01:55.400 | punctuated, frankly, with a lot of misfires,
00:01:58.580 | which is really the sign of a great entrepreneur.
00:02:02.240 | But I want people to hear this story.
00:02:05.280 | Jonathan may be, you guys have all heard
00:02:07.660 | about entrepreneurs who have dropped out of college
00:02:10.220 | to start billion-dollar companies.
00:02:12.500 | Jonathan may be the only high school dropout
00:02:15.060 | to have also started a billion-dollar company.
00:02:17.420 | So, let's start with just two minutes.
00:02:20.060 | Just give us the background of you,
00:02:22.480 | because it was a very circuitous path
00:02:25.940 | to being an entrepreneur.
00:02:27.140 | - So, I dropped out of high school, as mentioned,
00:02:29.900 | and I ended up getting a job as a programmer,
00:02:34.300 | and my boss noticed that I was clever
00:02:37.700 | and told me that I should be taking classes at a university
00:02:41.480 | despite having dropped out of college.
00:02:42.940 | So, unmatriculated, I didn't actually get enrolled.
00:02:46.540 | I started going to Hunter College as a side thing,
00:02:50.420 | and then I sort of fell under the wing
00:02:53.220 | of one of the professors there, did well, transferred to NYU,
00:02:56.980 | and then I started taking PhD courses,
00:03:00.020 | but as an undergrad, and then I dropped out of that.
00:03:03.180 | - So, do you technically have a high school diploma?
00:03:05.140 | - No.
00:03:05.980 | - Okay, this is perfect.
00:03:07.180 | - Nor do I have an undergrad degree, but yeah.
00:03:09.740 | - So, from NYU, how did you end up at Google?
00:03:12.420 | - Well, actually, if it hadn't been for NYU,
00:03:14.380 | I don't think I would have ended up at Google,
00:03:15.940 | so this is interesting, even though I didn't have the degree.
00:03:18.360 | And I happened to go to an event at Google,
00:03:20.800 | and one of the people at Google recognized me
00:03:23.720 | because they also went to NYU, and then they referred me.
00:03:27.160 | So, you can make some great connections in university,
00:03:30.000 | even if you don't graduate, but it was one of the people
00:03:33.200 | that I was taking the PhD courses with.
00:03:35.400 | - And when you first went there,
00:03:36.960 | sort of what kind of stuff were you working on?
00:03:39.840 | - Ads, but testing.
00:03:41.660 | So, we were building giant test systems,
00:03:44.160 | and if you think that it's hard to build production systems,
00:03:47.760 | test systems have to test everything
00:03:49.920 | the production system does, we did it live.
00:03:52.360 | So, every single ads query, we would run 100 tests on that,
00:03:57.360 | but we didn't have the budget of the production system.
00:04:00.640 | So, we had to write our own threading library,
00:04:02.620 | we had to do all sorts of crazy stuff,
00:04:04.480 | which you don't think of in ads, but yeah,
00:04:06.880 | it was actually harder engineering
00:04:08.520 | than the production itself.
00:04:10.360 | - And so, Google's very famous for 20% time
00:04:13.300 | where you kind of can do whatever you want,
00:04:15.480 | and is that what led to the birth of TPU,
00:04:18.440 | which is now, I think, or what most of you guys know
00:04:21.420 | is Google's sort of leading custom silicon
00:04:23.960 | that they use internally.
00:04:25.360 | - So, 20% time is famous, I called it MCI time,
00:04:29.440 | which probably isn't gonna transfer as a joke here,
00:04:31.360 | but there was these advertisements for this phone company,
00:04:34.160 | free nights and weekends, so you could work on 20% time
00:04:37.400 | so long as it wasn't during your work time, yeah.
00:04:40.140 | But every single night, I would go up
00:04:42.880 | and work with the speech team.
00:04:45.680 | So, this was separate from my main project,
00:04:47.820 | and they bought me some hardware,
00:04:50.200 | and I started what was called the TPU as a side project,
00:04:55.200 | and it was funded out of what a VP referred to
00:04:59.640 | as his slush fund or leftover money,
00:05:02.280 | and it was never expected,
00:05:03.640 | there were actually two other projects
00:05:05.480 | to build AI accelerators,
00:05:07.440 | it was never expected to be successful,
00:05:09.440 | which gave us the cover that we needed
00:05:11.900 | to do some really counterintuitive and innovative things.
00:05:15.160 | Once that became successful,
00:05:16.840 | they brought in the adult supervision.
00:05:19.080 | - Okay, take a step back though,
00:05:20.720 | what problem were you trying to solve in AI
00:05:23.200 | when those words weren't even being used,
00:05:25.240 | and what was Google trying to do at the time
00:05:26.760 | where you saw an opportunity to build something?
00:05:28.560 | - So, this started in 2012,
00:05:30.580 | and at the time, there had never been
00:05:32.520 | a machine learning model that outperformed
00:05:34.800 | a human being on any task,
00:05:36.360 | and the speech team trained a model
00:05:38.340 | that transcribed speech better than human beings.
00:05:41.400 | The problem was they couldn't afford
00:05:43.720 | to put it into production,
00:05:45.320 | and so this led to a very famous engineer, Jeff Dean,
00:05:48.360 | giving a presentation to the leadership team,
00:05:50.200 | it was just two slides.
00:05:51.680 | The first slide was good news, machine learning works.
00:05:55.820 | The second slide, bad news, we can't afford it.
00:05:58.880 | So, they were gonna have to double or triple
00:06:01.280 | the entire global data center footprint of Google
00:06:04.760 | at an average of a billion dollars per data center,
00:06:07.120 | 20 to 40 data centers, so 20 to 40 billion dollars,
00:06:10.300 | and that was just for speech recognition.
00:06:12.180 | If they wanted to do anything else,
00:06:13.420 | like search ads, it was gonna cost more.
00:06:16.300 | That was uneconomical,
00:06:17.900 | and that's been the history with inference.
00:06:19.620 | You train it, and then you can't afford
00:06:21.660 | to put it into production.
00:06:22.960 | - So, against that backdrop,
00:06:26.380 | what did you do that was so unique
00:06:28.460 | that allowed TPU to be one of the three projects
00:06:31.820 | that actually won?
00:06:33.380 | - The biggest thing was Jeff Dean noticed
00:06:36.260 | that the main algorithm that was consuming
00:06:39.480 | most of the CPU cycles at Google was matrix multiply,
00:06:42.860 | and we decided, okay, let's accelerate that,
00:06:45.740 | but let's build something around that,
00:06:47.980 | and so we built a massive matrix multiplication engine.
00:06:52.360 | When doing this, there were those two other competing teams.
00:06:55.900 | They took more traditional approaches to do the same thing.
00:06:58.500 | One of them was led by a Turing Award winner,
00:07:00.600 | and then what we did was we came up
00:07:02.460 | with what's called a systolic array,
00:07:04.860 | and I remember when that Turing Award winner
00:07:07.480 | was talking about the TPU, he said,
00:07:09.060 | "Whoever came up with this must have been really old
00:07:11.860 | "because systolic arrays have fallen out of favor,"
00:07:14.940 | and it was actually me.
00:07:15.780 | I just didn't know what a systolic array was.
00:07:17.820 | Someone had to explain to me what the terminology was.
00:07:20.020 | It was just kind of the obvious way to do it,
00:07:21.620 | and so the lesson is if you come at things
00:07:24.260 | knowing how to do them,
00:07:25.540 | you might know how to do them the wrong way.
00:07:27.780 | It's helpful to have people who don't know
00:07:30.300 | what should and should not be done.
00:07:33.000 | - So as TPU scales, there's probably a lot
00:07:35.800 | of internal recognition at Google.
00:07:38.180 | How do you walk away from that,
00:07:41.660 | and why did you walk away from that?
00:07:43.760 | - Well, all big companies end up
00:07:45.280 | becoming political in the end,
00:07:46.840 | and when you have something that successful,
00:07:48.920 | a lot of people want to own it,
00:07:50.200 | and there's always more senior people
00:07:51.800 | who start grabbing for it.
00:07:54.040 | I moved on to the Google X team, the rapid eval team,
00:07:58.320 | which is the team that comes up
00:07:59.280 | with all the crazy ideas at Google X,
00:08:01.700 | and I was having fun there,
00:08:04.320 | but nothing was turning into a production system.
00:08:07.300 | It was all a bunch of playing around,
00:08:10.280 | and I wanted to go and do something real again
00:08:12.580 | from start to finish.
00:08:13.420 | I wanted to take something from concept to production,
00:08:16.480 | and so I started looking outside,
00:08:19.200 | and that's when we met.
00:08:21.000 | - Well, that is when we met,
00:08:22.400 | but the thing is you had two ideas.
00:08:26.000 | One was more of let me build an image classifier,
00:08:29.000 | and you thought you could out-ResNet ResNet at the time,
00:08:31.300 | which was the best thing in town,
00:08:33.360 | and then you had this hardware path.
00:08:35.300 | - Well, actually, I had zero intention of building a chip.
00:08:38.960 | What happened was I had also built
00:08:42.960 | the highest performing image classifier,
00:08:47.960 | but I had noticed that all of the software
00:08:51.560 | was being given away for free.
00:08:53.340 | TensorFlow was being given away for free.
00:08:55.080 | The models were being given away for free.
00:08:56.440 | It was pretty clear that machine learning AI
00:08:59.880 | was gonna be open source, and it was gonna be--
00:09:01.560 | - Even back then. - Even back then.
00:09:02.960 | That was 2016, and so I just couldn't imagine
00:09:06.000 | building a business around that,
00:09:07.280 | and it would just be hard scrabble.
00:09:09.240 | Chips, it takes so long to build them
00:09:11.560 | that if you build something innovative and you launch it,
00:09:13.960 | it's gonna be four years before anyone can even copy it,
00:09:16.740 | let alone pull ahead of it,
00:09:18.560 | so that just felt like a much better approach,
00:09:21.040 | and it's atoms.
00:09:21.920 | You can monetize that more easily,
00:09:24.680 | so right around that time, the TPU paper came out.
00:09:28.840 | My name was in it.
00:09:30.160 | People started asking about it,
00:09:32.080 | and you asked me what I would do differently.
00:09:34.360 | - Well, I was investing in public markets
00:09:39.360 | as well at the time, a little dalliance
00:09:41.440 | in the public markets, and Sundar goes on
00:09:44.000 | in a press release and starts talking about TPU,
00:09:47.320 | and I was so shocked.
00:09:48.280 | I thought there is no conceivable world
00:09:50.360 | in which Google should be building their own hardware.
00:09:52.560 | They must know something that the rest of us don't know,
00:09:56.200 | and so we need to know that
00:09:58.520 | so that we can go and commercialize that
00:10:00.200 | for the rest of the world,
00:10:01.560 | and I probably met you a few weeks afterwards,
00:10:03.940 | and that was probably the fastest investment I'd ever made.
00:10:07.480 | I remember the key moment is you did not have a company,
00:10:12.300 | and so we had to incorporate the company
00:10:14.080 | after the check was written,
00:10:15.240 | which is always either a sign of complete stupidity
00:10:18.280 | or in 15 or 20 years, you'll look like a genius,
00:10:21.280 | but the odds of the latter are quite small.
00:10:24.120 | Okay, so you start the business.
00:10:25.740 | Tell us about the design decisions you were making
00:10:27.740 | in Grok at the time, knowing what you knew then,
00:10:30.280 | because at the time, it's very different
00:10:32.160 | from what it is now.
00:10:33.640 | Well, again, when we started fundraising,
00:10:35.680 | we actually weren't even 100% sure
00:10:37.480 | that we were gonna do something in hardware,
00:10:39.320 | but it was something that I think you asked, Shamath,
00:10:41.580 | which is what would you do differently,
00:10:43.680 | and my answer was the software,
00:10:46.180 | because the big problem we had
00:10:47.680 | was we could build these chips in Google,
00:10:50.080 | but programming them, every single team at Google
00:10:53.280 | had a dedicated person who was hand-optimizing the models,
00:10:57.000 | and I'm like, this is absolutely crazy.
00:10:59.440 | Right around then, we had started hiring some people
00:11:02.080 | from NVIDIA, and they're like, no, no, no,
00:11:03.900 | you don't understand.
00:11:04.740 | This is just how it works.
00:11:05.560 | This is how we do it, too.
00:11:06.400 | We've got these things called kernels, CUDA kernels,
00:11:08.880 | and we hand-optimize them.
00:11:09.840 | We just make it look like we're not doing that,
00:11:12.300 | but the scale, like, all of you understand algorithms
00:11:15.600 | and big O complexity.
00:11:17.580 | That's linear complexity.
00:11:18.840 | For every application, you need an engineer.
00:11:21.160 | NVIDIA now has 50,000 people in their ecosystem.
00:11:24.040 | How does any, and these are like really low-level
00:11:26.360 | kernel-writing, assembly-writing hackers
00:11:28.200 | who understand GPUs and ML and everything.
00:11:30.100 | Not gonna scale, so we focused on the compiler
00:11:32.420 | for the first six months.
00:11:34.280 | We banned whiteboards at Grok
00:11:35.960 | because people kept trying to draw pictures of chips.
00:11:38.560 | Like, yeah.
00:11:39.520 | - So why is it that LLMs prefer Grok?
00:11:45.180 | Like, what was the design decision,
00:11:47.220 | or what happened in the design of LLMs?
00:11:49.840 | Some part of it is skill, obviously,
00:11:51.280 | but some part of it was a little bit of luck,
00:11:52.760 | but where, what exactly happened
00:11:54.840 | that makes you so much faster than NVIDIA
00:11:57.960 | and why there's all of these developers?
00:11:59.760 | What is the?
00:12:00.720 | - The crux of it, we didn't know
00:12:03.080 | that it was gonna be language,
00:12:04.600 | but the inspiration, the last thing that I worked on
00:12:07.880 | was getting the AlphaGo software,
00:12:11.800 | the Go playing software at DeepMind working on TPU,
00:12:16.000 | and having watched that, it was very clear
00:12:19.400 | that inference was going to be a scaled problem.
00:12:22.800 | Everyone else had been looking at inference
00:12:24.720 | as you take one chip, you run a model on it,
00:12:27.600 | it runs whatever.
00:12:28.960 | But what happened with AlphaGo
00:12:31.820 | was we ported the software over,
00:12:34.320 | and even though we had 170 GPUs versus 48 TPUs,
00:12:38.920 | the 48 TPUs won 99 out of 100 games
00:12:42.080 | with the exact same software.
00:12:44.580 | What that meant was compute was going to result
00:12:47.360 | in better performance.
00:12:49.240 | And so the insight was, let's build scaled inference.
00:12:53.880 | So we built in the interconnect, we built it for scale,
00:12:57.540 | and that's what we do now
00:12:58.560 | when we're running one of these models,
00:13:00.240 | we have hundreds or thousands of chips contributing
00:13:02.440 | just like we did with AlphaGo,
00:13:04.600 | but it's built for this as opposed to cobbled together.
00:13:07.840 | - I think this is a good jumping off point.
00:13:09.220 | A lot of people, and I think this company
00:13:11.840 | deserves a lot of respect,
00:13:12.760 | but NVIDIA has been toiling for decades,
00:13:16.080 | and they have clearly built an incredible business.
00:13:19.480 | But in some ways, when you get into the details,
00:13:21.760 | the business is slightly misunderstood.
00:13:24.420 | So can you break down, first of all,
00:13:26.680 | where is NVIDIA natively good,
00:13:29.720 | and where is it more trying to be good?
00:13:31.980 | - So natively good, the classic saying is
00:13:36.520 | you don't have to outrun the bear,
00:13:37.600 | you just have to outrun your friends.
00:13:39.240 | So NVIDIA outruns all of the other chip companies
00:13:41.720 | when it comes to software,
00:13:42.800 | but they're not a software-first company.
00:13:44.520 | They actually have a very expensive approach,
00:13:47.360 | as we discussed.
00:13:48.360 | But they have the ecosystem.
00:13:50.800 | It's a double-sided market.
00:13:52.040 | If you have a kernel-based approach, they've already won.
00:13:54.560 | There's no catching up.
00:13:56.200 | Hence why we have a kernel-free approach.
00:13:58.280 | But the other way that they're very good
00:14:00.480 | is vertical integration and forward integration.
00:14:04.400 | What happens is NVIDIA, over and over again,
00:14:07.320 | decides that they wanna move up the stack,
00:14:09.800 | and whatever their customers are doing,
00:14:11.160 | they start doing it.
00:14:12.220 | So for example, I think it was Gigabyte
00:14:14.120 | or one of these other PCI board manufacturers
00:14:17.160 | who recently announced,
00:14:18.240 | even though 80% of their revenue came from NVIDIA,
00:14:20.840 | NVIDIA boards that they were building,
00:14:23.520 | they're exiting that market because NVIDIA moved up
00:14:26.400 | and started doing a much lower-margin thing.
00:14:28.120 | And you just see that over and over again.
00:14:29.240 | They started building systems.
00:14:30.080 | - Yeah, I think the other thing is
00:14:31.440 | that NVIDIA's incredible at training.
00:14:33.800 | And I think the design decisions that they made,
00:14:36.600 | including things like HBM,
00:14:38.440 | were really oriented around a world back then,
00:14:41.460 | which was everything was about training.
00:14:43.080 | There weren't any real-world applications.
00:14:44.920 | None of you guys were really building anything in the wild
00:14:47.440 | where you needed super-fast inference.
00:14:49.160 | And I think that's another.
00:14:50.640 | - Absolutely, and what we saw over and over again
00:14:53.160 | was you would spend 100% of your compute on training.
00:14:56.800 | You would get something that would work well enough
00:14:58.400 | to go into production.
00:14:59.760 | And then it would flip to about 5% to 10% training
00:15:04.200 | and 90% to 95% inference.
00:15:07.080 | But the amount of training would stay the same.
00:15:09.120 | The inference would grow massively.
00:15:11.520 | And so every time we would have a success at Google,
00:15:14.840 | all of a sudden, we would have a disaster.
00:15:16.840 | We called it the success disaster
00:15:18.320 | where we can't afford to get enough compute for inference
00:15:22.280 | 'cause it goes 1020x immediately, over and over.
00:15:26.520 | - And if you take that 1020x and multiply it
00:15:28.480 | by the cost of NVIDIA's leading-class solutions,
00:15:31.320 | you're talking just an enormous amount of money.
00:15:33.000 | So just maybe explain to folks what HBM is
00:15:36.520 | and why these systems,
00:15:37.860 | like what NVIDIA just announced as B200,
00:15:39.920 | the complexity and the cost, actually,
00:15:41.800 | if you're trying to do something.
00:15:43.840 | - Yeah, the complexity spans every part of the stack,
00:15:46.600 | but there's a couple of components
00:15:48.120 | which are in very limited supply.
00:15:50.760 | And NVIDIA has locked up the market on these.
00:15:53.560 | One of them is HBM.
00:15:55.040 | HBM is this high-bandwidth memory
00:15:57.840 | which is required to get performance
00:16:00.720 | because the speed at which you can run these applications
00:16:04.640 | depends on how quickly you can read that memory.
00:16:06.520 | And this is the fastest memory.
00:16:08.720 | There is a finite supply.
00:16:10.000 | It is only for data centers.
00:16:11.840 | So they can't reach into the supply for mobile
00:16:14.560 | or other things like you can with other parts.
00:16:16.800 | But also interposers.
00:16:18.400 | Also, NVIDIA's the largest buyer of super caps in the world
00:16:22.320 | and all sorts of other components.
00:16:24.040 | - Cables.
00:16:24.880 | - Cables, the 400-gigabit cables,
00:16:27.520 | they've bought them all out.
00:16:28.920 | So if you want to compete,
00:16:30.280 | it doesn't matter how good of a product you design.
00:16:32.720 | They've bought out the entire supply chain.
00:16:35.460 | - For years.
00:16:36.300 | - For years.
00:16:38.080 | - So what do you do?
00:16:40.440 | - You don't use the same things they do.
00:16:42.640 | - Right.
00:16:43.480 | - And that's where we come in.
00:16:45.120 | - So how do you design a chip, then?
00:16:47.400 | If you look at the leading solution
00:16:49.240 | and they're using certain things
00:16:50.560 | and they're clearly being successful,
00:16:52.480 | how do you, is it just a technical bet
00:16:55.680 | to be totally orthogonal and different?
00:16:57.360 | Or was it something very specific
00:16:59.760 | where you said we cannot be reliant on the same supply chain
00:17:01.920 | 'cause we'll just get forced out of business at some point?
00:17:04.480 | - It was actually a really simple observation
00:17:06.840 | at the beginning,
00:17:07.800 | which is most chip architectures compete
00:17:11.720 | on small percentages difference in performance,
00:17:14.600 | like 15% is considered amazing.
00:17:18.000 | And what we realized was if we were 15% better,
00:17:21.360 | no one was gonna change
00:17:22.780 | to a radically different architecture.
00:17:24.760 | We needed to be five to 10x.
00:17:27.400 | Therefore, the small percentages that you get
00:17:30.780 | chasing the leading edge technologies was irrelevant.
00:17:34.600 | So we used an older technology, 14 nanometer,
00:17:37.560 | which is underutilized.
00:17:39.560 | We didn't use external memory.
00:17:41.480 | We used older interconnect
00:17:44.080 | because our architecture needed to provide the advantage
00:17:46.760 | and it needed to be so overwhelming
00:17:49.140 | that we didn't need to be at the leading edge.
00:17:51.720 | - So how do you measure sort of speed and value today?
00:17:54.840 | And just give us some comparisons
00:17:56.500 | for you versus some other folks running
00:17:59.740 | what these guys are probably using,
00:18:01.120 | Lama, Mistral, et cetera.
00:18:02.600 | - Yeah, so we run, we compare on two sides of this.
00:18:07.600 | One is the tokens per dollar
00:18:09.400 | and one is the tokens per second per user.
00:18:12.160 | So tokens per second per user is the experience.
00:18:14.480 | That's the differentiation.
00:18:16.040 | And tokens per dollar is the cost.
00:18:18.640 | And then also, of course, tokens per watt
00:18:20.640 | because power is very limited at the moment.
00:18:22.880 | - Right.
00:18:23.720 | - If you were to compare us to GPUs,
00:18:26.880 | we're typically five to 10x faster.
00:18:30.160 | Apples to apples, like without using speculative decode
00:18:33.600 | and other things.
00:18:34.900 | So right now, on a 180 billion parameter model,
00:18:39.360 | we run about 200 tokens per second,
00:18:42.640 | which I think is less than 50
00:18:44.880 | on the next generation GPU that's coming out.
00:18:47.920 | - From NVIDIA?
00:18:48.760 | - From NVIDIA.
00:18:49.960 | - So your current generation is 4x better than the B200?
00:18:54.240 | - Yeah, yeah.
00:18:55.240 | And then in total cost, we're about 1/10 the cost
00:18:59.560 | versus a modern GPU per token.
00:19:02.600 | I want that to sink in for a moment, 1/10 of the cost.
00:19:08.280 | - Yeah, I mean, I think the value of that
00:19:09.680 | really comes down to the fact that
00:19:11.720 | you guys are gonna go and have ideas,
00:19:13.720 | and especially if you are part of the venture community
00:19:16.880 | and ecosystem and you raise money,
00:19:18.640 | folks like me who will give you money
00:19:21.160 | will expect you to be investing that wisely.
00:19:23.240 | Last decade, we went into a very negative cycle
00:19:26.280 | where almost 50 cents of every dollar
00:19:28.680 | we would give a startup would go right back
00:19:30.320 | into the hands of Google, Amazon, and Facebook.
00:19:33.160 | You're spending it on compute
00:19:34.360 | and you're spending it on ads.
00:19:36.120 | This time around, the power of AI should be
00:19:38.960 | that you can build companies for 1/10
00:19:40.760 | or 1/100 of the cost, but that won't be possible
00:19:43.960 | if you're, again, just shipping the money back out,
00:19:45.600 | except just now, in this case,
00:19:47.160 | to NVIDIA versus somebody else.
00:19:49.200 | So we will be pushing to make sure
00:19:51.720 | that this is kind of the low-cost alternative that happens.
00:19:56.160 | - So NVIDIA had a huge splashy announcement a few weeks ago.
00:19:59.840 | They showed charts, things going up and to the right.
00:20:02.040 | They showed huge dyes.
00:20:03.320 | They showed huge packaging.
00:20:05.140 | Tell us about the B200 and compare it
00:20:08.240 | to what you're doing right now.
00:20:09.960 | - Well, the first thing is the B200
00:20:12.120 | is a marvel of engineering.
00:20:13.720 | The level of complexity, the level of integration,
00:20:16.940 | the amount of different components in silicon.
00:20:19.960 | They spent $10 billion developing it,
00:20:22.860 | but when it was announced,
00:20:25.120 | I got some pings from NVIDIA engineers who said,
00:20:28.240 | we were a little embarrassed that they were claiming 30X
00:20:30.640 | 'cause it's nowhere near that.
00:20:31.680 | And we as engineers felt
00:20:33.040 | that that was hurting our credibility.
00:20:36.040 | The 30X claim was, let's put it into perspective.
00:20:39.960 | There was this one image that showed a claim
00:20:42.300 | of up to 50 tokens per second from the user experience
00:20:46.080 | and 140 throughput.
00:20:49.000 | That sort of gives you the value or the cost.
00:20:52.440 | If you were to compare that to the previous generation,
00:20:54.600 | that would be saying that the users,
00:20:56.200 | if you divide 50 by 30,
00:20:58.040 | are getting less than two tokens per second,
00:21:00.400 | which would be slow, right?
00:21:05.320 | There's nothing running that slow.
00:21:07.020 | And then also from a throughput perspective,
00:21:08.920 | that would make the cost so astronomical,
00:21:10.720 | it would be unbelievable.
00:21:11.560 | - I mean, how many of you guys use
00:21:13.120 | any of these chat agents right now?
00:21:15.120 | Just raise your hand if you use them.
00:21:17.320 | And how many of you keep your hands raised
00:21:19.280 | if you're satisfied with the speed of performance?
00:21:21.780 | You're satisfied.
00:21:24.400 | One hand or two?
00:21:26.160 | - There's like two or three, yeah.
00:21:27.840 | That's nice.
00:21:28.680 | My experience has been that these things,
00:21:34.160 | if you want to actually make hallucinations go to zero
00:21:37.120 | and the quality of these models really fine-tuned,
00:21:40.200 | you have to get back to kind of like
00:21:42.000 | a traditional web experience
00:21:43.420 | or a traditional mobile app experience
00:21:44.840 | where you have a window of probably 300 milliseconds
00:21:48.100 | to get an answer back.
00:21:49.360 | In the absence of that,
00:21:50.200 | the user experience doesn't scale and it kind of sucks.
00:21:53.360 | - How much effort did you spend at Meta and Facebook
00:21:56.440 | getting latency down?
00:21:57.600 | - I mean, look, at Facebook at one point,
00:21:58.960 | I had a team, I was so disgusted with the speed.
00:22:01.520 | So in a cold cache,
00:22:03.160 | we were approaching 1,000 milliseconds.
00:22:05.640 | And I was so disgusted that I took a small team
00:22:10.160 | off to the side, rebuilt the entire website
00:22:13.520 | and launched it in India for the Indian market
00:22:17.360 | just to prove that we could get it under 500 milliseconds.
00:22:22.800 | And it was a huge technical feat that the team did.
00:22:26.880 | It was also very poorly received
00:22:28.800 | by the mainline engineering team
00:22:30.360 | because it was somewhat embarrassing.
00:22:31.920 | But that's the level of intensity
00:22:34.280 | we had to approach this problem with.
00:22:36.520 | And it wasn't just us.
00:22:37.520 | Google realized it, everybody has realized it.
00:22:39.760 | There's an economic equation
00:22:41.120 | where if you deliver an experience to users
00:22:44.080 | under about 250 to 300 milliseconds, you maximize revenue.
00:22:48.720 | So if you actually want to be successful,
00:22:50.720 | that is the number you have to get to.
00:22:52.520 | So the idea that you can wait and fetch an answer
00:22:55.280 | in three and five seconds is completely ridiculous.
00:22:59.800 | It's a non-starter.
00:23:02.840 | - Here's the actual number.
00:23:03.920 | The number is every 100 milliseconds
00:23:06.240 | leads to 8% more engagement on desktop, 34% on mobile.
00:23:11.240 | We're talking about 100 milliseconds,
00:23:14.760 | which is 1/10 of a second.
00:23:16.920 | Right now, these things take 10 seconds.
00:23:19.280 | So think about how much less engagement
00:23:21.320 | we were getting today than you otherwise could.
00:23:24.440 | - So why don't you now break down this difference?
00:23:26.440 | Because this is now where I think a good place
00:23:28.760 | so that people leave really understanding.
00:23:31.360 | There's an enormous difference
00:23:32.600 | between training and inference and what is required.
00:23:37.600 | And why don't you define the differences
00:23:40.080 | so that then we can contrast where things are gonna go?
00:23:43.280 | - The biggest is when you're training,
00:23:45.440 | the number of tokens that you're training on
00:23:48.400 | is measured in months.
00:23:50.120 | Like how many tokens can we train on this month?
00:23:53.040 | It doesn't matter if it takes a second, 10 seconds,
00:23:56.600 | 100 seconds in a batch, how many per month.
00:24:00.000 | In inference, what matters is how many tokens
00:24:03.080 | you can generate per millisecond or a couple of milliseconds.
00:24:07.840 | It's not in seconds, it's not in months.
00:24:10.520 | - Is it fair to say then
00:24:12.280 | that NVIDIA is the exemplar in training?
00:24:15.800 | - Yes.
00:24:16.640 | - And then is it fair to say that there really isn't yet
00:24:20.520 | the equivalent scaled winner in inference?
00:24:25.560 | - Not yet.
00:24:26.560 | - And do you think that it will be NVIDIA?
00:24:29.200 | - I don't think it'll be NVIDIA.
00:24:31.520 | - But specifically, why do you not think it won't work
00:24:35.400 | for that market, even though it's clearly working in training
00:24:38.760 | - In order to get the latency down, what we had to do,
00:24:42.840 | we had to design a completely new chip architecture.
00:24:45.840 | We had to design a completely new networking architecture,
00:24:48.720 | an entirely new system, an entirely new runtime,
00:24:51.160 | an entirely new compiler,
00:24:52.320 | an entirely new orchestration layer.
00:24:55.040 | We had to throw everything away.
00:24:57.180 | And it had to be compatible with PyTorch
00:25:01.240 | and what other people actually developing.
00:25:03.320 | Now we're talking about innovators dilemma on steroids.
00:25:06.400 | It's hard enough to give up one of those,
00:25:08.200 | which if you were to do one of those successfully
00:25:10.120 | would be a very valuable company.
00:25:12.720 | But to throw all six of those away is nearly impossible.
00:25:16.520 | And also you have to maintain what you have
00:25:18.520 | if you wanna keep training.
00:25:20.520 | And so now you have to have a completely different
00:25:22.520 | architecture for networking, for training versus inference,
00:25:26.640 | for your chip, for networking, for everything.
00:25:30.320 | - So let's say that the market today
00:25:32.120 | is 100 units of training or 95 units of training,
00:25:35.760 | five units of inference.
00:25:37.000 | I should say that's roughly where most of the revenue
00:25:39.840 | and the dollars are being made.
00:25:42.360 | What does it look like in four or five years from now?
00:25:45.360 | - Well, actually NVIDIA's latest earnings, 40% inference.
00:25:48.600 | It's already starting to climb.
00:25:50.800 | Where it's gonna end is it will end
00:25:52.720 | at somewhere between 90 to 95% or 90 to 95 units inference.
00:25:57.720 | And so that trajectory is gonna take off rapidly
00:26:00.600 | now that we have these open source models
00:26:02.480 | that everyone is giving away
00:26:04.280 | and you can download a model and run it.
00:26:06.680 | You don't need to train it.
00:26:07.840 | - Yeah, one of the things about these open source models
00:26:09.840 | is building useful applications,
00:26:14.040 | you have to either understand or be able to work with CUDA.
00:26:18.560 | With you, it doesn't even matter because you can just port.
00:26:20.680 | So maybe explain to folks the importance
00:26:23.400 | in the inference market of being able to rip
00:26:25.000 | and replace these models
00:26:26.120 | and where you think these models are going.
00:26:28.280 | - So for the inference market, every two weeks or so,
00:26:33.280 | there is a completely new model that has to be run.
00:26:38.800 | It's important, it matters.
00:26:40.720 | Either it's setting the best quality bar across the board
00:26:45.720 | or it's good at a particular task.
00:26:48.920 | If you are writing kernels,
00:26:50.960 | it's almost impossible to keep up.
00:26:52.480 | In fact, when LLAMA2 70 billion was launched,
00:26:55.640 | it officially had support for AMD.
00:26:58.520 | However, the first support we actually saw implemented
00:27:03.080 | was after about a week and we had it in I think two days.
00:27:08.080 | And so that speed,
00:27:09.400 | now everyone develops for NVIDIA hardware.
00:27:11.760 | So by default, anything launched will work there.
00:27:14.840 | But if you want anything else to work,
00:27:17.720 | you can't be writing these kernels by hand.
00:27:19.720 | And remember, AMD had official support
00:27:21.480 | and it still took about a week.
00:27:22.680 | - Right.
00:27:23.680 | So if you're starting a company today,
00:27:26.960 | you clearly wanna have the ability to swap from LLAMA
00:27:32.360 | to Mistral to Anthropic back as often as possible.
00:27:36.080 | - Whatever's latest.
00:27:37.240 | - And just as somebody who sees these models run,
00:27:40.920 | do you have any comment on the quality of these models
00:27:42.880 | and where you think some of these companies are going
00:27:44.800 | or what you see some doing well versus others?
00:27:47.840 | - So they're all starting to catch up with each other.
00:27:51.000 | You're starting to see some leapfrogging.
00:27:52.600 | It started off with GPT-4 pulling ahead
00:27:55.840 | and it had a lead for about a year over everyone else.
00:27:58.600 | And now of course Anthropic has caught up.
00:28:00.840 | We're seeing some great stuff from Mistral.
00:28:03.360 | But across the board,
00:28:05.200 | they're all starting to bunch up in quality.
00:28:08.240 | And so one of the interesting things,
00:28:09.840 | Mistral in particular,
00:28:11.600 | has been able to get closer to quality
00:28:14.120 | with smaller, less expensive models to run,
00:28:16.120 | which I think gives them a huge advantage.
00:28:18.920 | I think Cohere has an interesting take
00:28:21.680 | on a sort of rag optimized model.
00:28:24.600 | So people are finding niches.
00:28:26.560 | And there's gonna be a couple
00:28:28.440 | that are gonna be the best across the board
00:28:30.960 | at the highest end.
00:28:33.000 | But what we're seeing is a lot of complaints
00:28:34.720 | about the cost to run these models.
00:28:36.920 | They're just astronomical.
00:28:38.680 | And you're not gonna be able to scale up applications
00:28:42.360 | for users with them.
00:28:43.480 | - OpenAI has published or has disclosed,
00:28:48.480 | as has Meta, as has Tesla and a couple of others,
00:28:51.640 | just the total quantum of GPU capacity that they're buying.
00:28:55.640 | And you can kind of work backwards
00:28:56.920 | to figure out how big the inference market can be,
00:29:00.400 | because it's really only supported by them
00:29:02.760 | as you guys scale up.
00:29:03.960 | Can you give people a sense of the scale
00:29:06.360 | of what folks are fighting for?
00:29:08.320 | - So I think Facebook announced
00:29:12.480 | that by the end of this year,
00:29:14.160 | they're gonna have the equivalent of 650,000 H100s.
00:29:18.280 | By the end of this year,
00:29:21.760 | Grok will have deployed 100,000 of our LPUs,
00:29:25.000 | which do outperform the H100s on a throughput
00:29:28.840 | and on a latency basis.
00:29:31.000 | So we will probably get pretty close
00:29:33.200 | to the equivalent of Meta ourselves.
00:29:36.840 | By the end of next year,
00:29:37.840 | we're going to deploy 1.5 million LPUs.
00:29:41.320 | For comparison, last year,
00:29:43.600 | NVIDIA deployed a total of 500,000 H100s.
00:29:47.520 | So 1.5 million means that Grok will probably have
00:29:51.800 | more inference generative AI capacity
00:29:55.440 | than all of the hyperscalers
00:29:57.640 | and cloud service providers combined.
00:30:00.680 | So probably about 50% of the inference compute in the world.
00:30:03.960 | - That's just great.
00:30:07.600 | Tell us about team building in Silicon Valley.
00:30:12.280 | How hard is it to get folks that are real AI folks
00:30:15.440 | in the backdrop of you could go work at Tesla,
00:30:17.640 | you could go work at Google, open AI,
00:30:19.560 | all these people we are hearing,
00:30:21.200 | multimillion dollar pay packages
00:30:22.800 | that rival playing professional sports.
00:30:25.280 | Like what is going on in finding the people?
00:30:28.360 | Now, by the way, you have this interesting thing
00:30:30.240 | 'cause your initial chip,
00:30:31.800 | you were trying to find folks that knew Haskell.
00:30:34.160 | So just tell us, how hard is it to build a team
00:30:37.320 | in the Valley to do this?
00:30:39.360 | - Impossible.
00:30:41.120 | So if you want to know how to do it,
00:30:43.120 | you have to start getting creative,
00:30:44.440 | just like anything you want to do well.
00:30:46.040 | Don't just compete directly.
00:30:47.800 | But yeah, these pay packages are astronomical
00:30:50.760 | because everyone views this as a winner take all market.
00:30:54.920 | I mean, just that's it.
00:30:56.080 | It's not about, am I going to be number two?
00:30:58.360 | Am I going to be number three?
00:30:59.360 | They're all going, I got to be number one.
00:31:01.400 | So if you don't have the best talent, you're out.
00:31:03.120 | Now, here's the mistake.
00:31:04.800 | A lot of these AI researchers are amazing at AI,
00:31:09.760 | but they're still kind of green.
00:31:11.760 | They're new, they're young, right?
00:31:13.720 | This is a new field.
00:31:15.240 | And what I always recommend to people
00:31:17.240 | is go hire the best, most grizzled engineers
00:31:22.080 | who know how to ship stuff and on time
00:31:26.080 | and let them learn AI
00:31:28.320 | because they will be able to do that faster
00:31:30.800 | than you will be able to take the AI researchers
00:31:32.920 | and give them the 20 years of experience
00:31:34.760 | of deploying production code.
00:31:36.480 | - You were on stage in Saudi Arabia with Saudi Aramco
00:31:41.880 | a month ago and announced some big deal.
00:31:45.400 | Can you just, like, what is going on with deals like that?
00:31:48.920 | Like, where is that market going?
00:31:51.560 | Is that you competing with Amazon and Google and Microsoft?
00:31:56.000 | Is that what that is?
00:31:57.320 | - It's not competing.
00:31:58.360 | It's actually complimentary.
00:32:01.080 | The announcement was that we are going
00:32:03.680 | to be doing a deal together with Aramco Digital.
00:32:06.320 | And we haven't announced how large exactly,
00:32:10.080 | but it will be large in terms of the amount of compute
00:32:14.240 | that we're gonna deploy.
00:32:15.640 | And in total, we've done deals that get us to past 10%
00:32:19.920 | of that 1.5 million LPU goal.
00:32:22.440 | And of course, the hard part is the first deals.
00:32:26.280 | So once we announced that,
00:32:27.800 | a lot of other deals are now coming through.
00:32:30.000 | But the, yeah, go ahead.
00:32:33.120 | - So, no, no, I was just, I was just, second, Tim.
00:32:36.000 | - So the scale of these deals is that these are larger
00:32:41.000 | than the amount of compute that Meta has, right?
00:32:46.640 | And a lot of these tech companies right now,
00:32:50.040 | they think that they have such an advantage
00:32:51.680 | 'cause they've locked up the supply.
00:32:54.000 | They don't want it to be true
00:32:55.640 | that there is another alternative out there.
00:32:58.560 | And so we're actually doing deals with folks
00:33:00.960 | where they're gonna have more compute than a hyperscaler.
00:33:04.280 | - Right, that's a crazy idea.
00:33:07.080 | Last question, everybody's worried about what AI means.
00:33:12.080 | You've been in it for a very long time.
00:33:17.040 | Just end with your perspectives on what we should be thinking
00:33:22.240 | and what your perspectives are on the future of AI,
00:33:24.400 | our future jobs, all of this typical stuff
00:33:26.480 | that people worry about.
00:33:27.680 | - So I get asked a lot, should we be afraid of AI?
00:33:31.640 | And my answer to that is, if you think back to Galileo,
00:33:36.640 | someone who got in a lot of trouble,
00:33:38.520 | the reason he got in trouble was he invented the telescope,
00:33:41.440 | popularized it, and made some claims
00:33:44.920 | that we were much smaller than everyone wanted to believe.
00:33:48.400 | We were supposed to be the center of the universe
00:33:50.080 | and it turns out we weren't.
00:33:52.440 | And the better the telescope got,
00:33:53.920 | the more obvious it became that we were small.
00:33:56.360 | And in a large sense, large language models
00:34:00.000 | are the telescope for the mind.
00:34:02.920 | It's become clear that intelligence is larger than we are.
00:34:07.920 | And it makes us feel really, really small.
00:34:12.520 | And it's scary.
00:34:14.240 | But what happened over time was as we realized
00:34:16.720 | the universe was larger than we thought,
00:34:18.640 | and we got used to that, we started to realize
00:34:20.640 | how beautiful it was, and our place in the universe.
00:34:24.840 | And I think that's what's gonna happen.
00:34:26.240 | We're gonna realize intelligence is more vast
00:34:29.160 | than we ever imagined,
00:34:30.840 | and we're gonna understand our place in it.
00:34:32.800 | And we're not gonna be afraid of it.
00:34:34.680 | - That's a beautiful way to end.
00:34:36.880 | Jonathan Ross, everybody.
00:34:38.200 | Thanks, guys.
00:34:39.040 | (audience applauding)
00:34:48.240 | - Thank you very, very much.
00:34:49.960 | I was told drug means to understand deeply with empathy.
00:34:52.720 | That was embodying this definition.
00:34:54.920 | [BLANK_AUDIO]