back to index

2024 Year in Review: The Big Scaling Debate, the Four Wars of AI, Top Themes and the Rise of Agents


Chapters

0:0 Welcome to the 100th Episode!
0:19 Reflecting on the Journey
0:47 AI Engineering: The Rise and Impact
3:15 Latent Space Live and AI Conferences
9:44 The Competitive AI Landscape
21:45 Synthetic Data and Future Trends
35:53 Creative Writing with AI
36:12 Legal and Ethical Issues in AI
38:18 The Data War: GPU Poor vs. GPU Rich
39:12 The Rise of GPU Ultra Rich
40:47 Emerging Trends in AI Models
45:31 The Multi-Modality War
65:31 The Future of AI Benchmarks
73:17 Pionote and Frontier Models
73:47 Niche Models and Base Models
74:30 State Space Models and RWKB
75:48 Inference Race and Price Wars
82:16 Major AI Themes of the Year
82:48 AI Rewind: January to March
86:42 AI Rewind: April to June
93:12 AI Rewind: July to September
94:59 AI Rewind: October to December
99:53 Year-End Reflections and Predictions

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey everyone, welcome to the Latent Space Podcast.
00:00:06.300 | This is Alessio, partner and CTO at Decibel Partners,
00:00:08.980 | and I'm joined by my co-host Zwickz
00:00:10.720 | for the 100th time today.
00:00:12.600 | - Yay, and we're so glad that everyone has followed us
00:00:17.080 | in this journey.
00:00:17.920 | How do you feel about it, 100 episodes?
00:00:19.160 | - Yeah, almost two years that we've been doing this.
00:00:22.320 | We've had four different studios.
00:00:24.780 | We've had a lot of changes.
00:00:26.620 | You know, we used to do this lightning round
00:00:28.440 | when we first started that we didn't like,
00:00:30.880 | and we tried to change the questions.
00:00:32.320 | - Because every answer was cursor and perplexity.
00:00:34.200 | - Yeah, exactly.
00:00:35.020 | I love mid-journey.
00:00:35.860 | It's like, do you really not like anything else?
00:00:38.320 | Like, what's the unique thing?
00:00:40.440 | And I think, yeah,
00:00:41.600 | we've also had a lot more research-driven content.
00:00:44.520 | You know, we had like Three Dow, we had Jeremy Howard,
00:00:46.800 | we had more folks like that.
00:00:48.000 | I think we want to do more of that too in the new year,
00:00:50.080 | like having some of the Gemini folks,
00:00:52.640 | both on the research and the applied side.
00:00:54.640 | Yeah, but it's been a ton of fun.
00:00:56.360 | I think we both started,
00:00:57.600 | I wouldn't say as a joke,
00:00:58.640 | we were kind of like, oh, we should do a podcast.
00:01:00.680 | And I think we kind of caught the right wave, obviously.
00:01:04.200 | And I think your rise of the AI engineer posts
00:01:07.400 | just kind of give people somewhere to congregate
00:01:09.960 | and then the AI engineer summit.
00:01:11.440 | And that's why when I look at our growth chart,
00:01:13.720 | it's kind of like a proxy
00:01:14.680 | for like the AI engineering industry as a whole,
00:01:18.320 | which is almost like, like,
00:01:19.320 | even if we don't do that much, we keep growing
00:01:21.560 | just because there's so many more AI engineers.
00:01:23.800 | Did you expect that growth
00:01:25.440 | or did you expect it would take longer
00:01:27.040 | for like the AI engineer thing to kind of like become,
00:01:29.880 | you know, everybody talks about it today.
00:01:31.520 | - Yeah, my sign of that that we have won
00:01:34.000 | is that Gartner puts it at the top of the hype curve
00:01:36.920 | right now.
00:01:37.760 | So Gartner has called the peak in AI engineering.
00:01:40.320 | I did not expect to what level,
00:01:42.280 | I knew that I was correct when I called it
00:01:44.120 | because I did like two months of work going into that.
00:01:47.040 | But I didn't know how quickly it could happen.
00:01:49.440 | And obviously there's a chance that I could be wrong.
00:01:52.320 | But I think like most people have come around
00:01:53.720 | to that concept.
00:01:54.640 | Hacker News hates it, which is a good sign,
00:01:56.720 | but there's enough people that have defined it,
00:01:58.840 | you know, GitHub, when he launched GitHub models,
00:02:00.960 | which is the hugging face clone,
00:02:02.280 | they put AI engineers in the banner and like above the fold,
00:02:05.560 | like in big letters.
00:02:07.680 | So I think it's like kind of arrived
00:02:09.440 | as a meaningful and useful definition.
00:02:12.200 | I think people are trying to figure out
00:02:13.920 | where the boundaries are.
00:02:15.040 | I think that was a lot of the quote unquote drama
00:02:17.880 | that happens behind the scenes at the World's Fair in June,
00:02:20.820 | because I think there's a lot of doubt or questions
00:02:24.440 | about where ML engineering stops and AI engineering starts.
00:02:28.200 | That's a useful debate to be had.
00:02:29.840 | In some sense, I actually anticipated that as well.
00:02:32.360 | So I intentionally did not put a firm definition there
00:02:36.480 | because most of the successful definitions
00:02:38.560 | are necessarily underspecified
00:02:40.520 | and it's actually useful to have different perspectives.
00:02:42.360 | And then you don't have to specify everything
00:02:44.520 | from the outset.
00:02:45.360 | - Yeah, I was at AWS reInvent and the line
00:02:48.500 | to get into like the AI engineering talk, so to speak,
00:02:51.800 | which is, you know, applied AI and whatnot was like,
00:02:54.120 | there are like hundreds of people just in line to go in.
00:02:56.960 | I think that's kind of what enabled people, right?
00:02:59.400 | Which is what you kind of talked about is like,
00:03:01.040 | hey, look, you don't actually need a PhD,
00:03:02.760 | just use the model.
00:03:04.560 | And then maybe we'll talk about some of the blind spots
00:03:07.000 | that you get as an engineer with the earlier posts
00:03:10.240 | that we also had on the sub stack.
00:03:11.820 | But yeah, it's been a heck of a two years.
00:03:14.800 | - Yeah, you know, I was trying to view the conference
00:03:17.720 | as like NeurIPS is a thing like 16, 17,000 people
00:03:21.520 | and the latent space live event that we held there
00:03:24.440 | was 950 signups.
00:03:26.200 | I think the AI world, the ML world
00:03:28.600 | is still very much research heavy.
00:03:30.560 | And that's as it should be because ML is very much
00:03:33.580 | in a research phase.
00:03:34.420 | But as we move this entire field into production,
00:03:37.760 | I think that ratio inverts
00:03:39.440 | into becoming more engineering heavy.
00:03:41.440 | So at least I think engineering should be on the same level,
00:03:45.280 | even if it's never as prestigious,
00:03:47.120 | like it'll always be low status because at the end of the day
00:03:49.600 | you're manipulating APIs or whatever, but wrapping GPTs,
00:03:54.240 | but there's gonna be an increasing stack and an art
00:03:56.960 | to doing these things well.
00:03:58.800 | And I think that's what we're focusing on for the podcast,
00:04:02.240 | the conference and basically everything I do
00:04:04.720 | seems to make sense.
00:04:06.240 | And I think we'll talk about the trends here that apply.
00:04:09.160 | It's this very strange mix of like keeping on top of research
00:04:12.960 | while not being a researcher
00:04:14.720 | and then putting that research into production.
00:04:16.800 | So like people always ask me like,
00:04:18.520 | why are you covering NeurIPS?
00:04:20.200 | Like, this is a ML research conference.
00:04:22.040 | And I'm like, well, yeah, I mean,
00:04:23.520 | we're not going to like understand everything
00:04:26.240 | or reproduce every single paper,
00:04:28.160 | but the stuff that is being found here
00:04:30.200 | is going to make its way into production at some point,
00:04:31.960 | you hope.
00:04:32.800 | And then actually, like when I talk to the researchers,
00:04:34.480 | they actually get very excited because they're like,
00:04:35.920 | oh, you guys are actually caring
00:04:37.200 | about how this goes into production.
00:04:38.640 | And that's what they really, really want.
00:04:40.840 | The measure of success is previously just peer review,
00:04:43.120 | right, like getting sevens and eights
00:04:45.120 | on their academic review conferences and stuff.
00:04:48.040 | Like citations is one metric, but money is a better metric.
00:04:51.200 | - Right.
00:04:52.040 | Yeah, and there were about 2,200 people on the live stream
00:04:57.240 | or something like that.
00:04:58.080 | - Yeah, 2,200 on the live stream.
00:04:59.520 | - So I tried my best to moderate,
00:05:01.120 | but it was a lot spicier in person
00:05:03.200 | with Jonathan and Dylan
00:05:05.280 | than it was in the chat on YouTube.
00:05:06.840 | - I would say that I actually also created
00:05:09.840 | Latent Space Live in order to address flaws
00:05:12.080 | that are perceived in academic conferences.
00:05:13.920 | It's not NeurIPS specific,
00:05:14.840 | it's ICML, it's ICLR, it's NeurIPS.
00:05:17.000 | Basically, it's very sort of oriented
00:05:19.280 | towards the sort of PhD students market, job market, right?
00:05:23.400 | Like literally, basically everyone's there
00:05:25.240 | to advertise their research and skills and get jobs.
00:05:28.840 | And then obviously all the companies go there to hire them.
00:05:31.760 | And I think that's great for the individual researchers,
00:05:34.040 | but for people going there to get info is not great
00:05:36.600 | because you have to read between the lines,
00:05:38.840 | bring a ton of context
00:05:40.280 | in order to understand every single paper.
00:05:42.120 | So what is missing is effectively what I ended up doing,
00:05:45.280 | which is domain by domain,
00:05:46.680 | go through and recap the best of the year,
00:05:48.440 | survey the field.
00:05:49.520 | And there are, like NeurIPS had a,
00:05:51.920 | I think ICML had a like a position paper track,
00:05:54.080 | NeurIPS added a benchmarks and datasets track.
00:05:57.280 | These are ways in which to address that issue.
00:06:00.520 | And there's always workshops as well.
00:06:01.560 | Every conference has a last day of workshops and stuff
00:06:04.720 | that provide more of an overview,
00:06:06.520 | but they're not specifically prompted to do so.
00:06:08.840 | And I think really organizing a conference
00:06:11.560 | is just about getting good speakers
00:06:12.960 | and giving them the correct prompts.
00:06:14.720 | And then they will just go and do that thing
00:06:16.600 | and they do a very good job of it.
00:06:17.760 | So I think Sarah did a fantastic job
00:06:19.880 | with the startups prompt.
00:06:21.560 | I can't list everybody,
00:06:22.480 | but we did best of 2024 in startups, vision, open models,
00:06:26.440 | post-transformers, synthetic data, small models, and agents.
00:06:30.520 | And then the last one was the,
00:06:32.240 | and then we also did a quick one on reasoning
00:06:34.120 | with Nathan Lambert.
00:06:35.200 | And then the last one, obviously,
00:06:36.160 | was the debate that people were very hyped about.
00:06:39.760 | It was very awkward.
00:06:40.880 | And I'm really thankful for John Frankel, basically,
00:06:43.400 | who stepped up to challenge Dylan,
00:06:45.440 | 'cause Dylan was like, "Yeah, I'll do it."
00:06:47.000 | But he was pro-scaling.
00:06:49.360 | And I think everyone who is in AI is pro-scaling.
00:06:52.920 | So you need somebody who's ready to publicly say,
00:06:55.880 | "No, we've hit a wall."
00:06:57.160 | So that means you're saying Sam Altman's wrong,
00:06:59.840 | you're saying everyone else is wrong.
00:07:02.240 | It helps that this was the day before Ilya went on,
00:07:05.800 | went up on stage and then said,
00:07:06.920 | "Pre-training has hit a wall, data has hit a wall."
00:07:09.560 | So actually, Jonathan ended up winning,
00:07:11.760 | and then Ilya supported that statement.
00:07:13.920 | And then Noam Brown, on the last day,
00:07:15.560 | further supported that statement as well.
00:07:17.000 | So it's kind of interesting that I think
00:07:18.880 | the consensus kind of going in
00:07:20.240 | was that we're not done scaling,
00:07:22.120 | like you should believe in a better lesson.
00:07:24.600 | And then four straight days in a row,
00:07:26.240 | you had Sepp Hochreiter, who is the creator of the LSTM,
00:07:29.160 | along with everyone's favorite OG in AI,
00:07:31.880 | which is Jurgen Schmidhuber.
00:07:34.640 | He said that we're pre-training has hit a wall,
00:07:37.160 | or we've run into a different kind of wall.
00:07:40.000 | And then we have, you know,
00:07:41.120 | John Frankel, Ilya, and then Noam Brown,
00:07:43.920 | all saying variations of the same thing,
00:07:45.640 | that we have hit some kind of wall in the status quo
00:07:48.440 | of what pre-trained, scaling large pre-trained models
00:07:51.960 | has looked like, and we need a new thing.
00:07:54.120 | And obviously the new thing for people is some make,
00:07:57.320 | either people are calling it inference time compute
00:07:59.160 | or test time compute.
00:08:00.280 | I think the collective terminology has been inference time.
00:08:04.200 | And I think that makes sense because test time,
00:08:06.080 | calling it test, meaning, has a very pre-trained bias,
00:08:08.480 | meaning that the only reason for running inference at all
00:08:10.280 | is to test your model.
00:08:11.640 | That is not true.
00:08:12.480 | - Right.
00:08:13.880 | - So I quite agree that OpenAI seems to have adopted,
00:08:17.080 | or the community seems to have adopted this terminology
00:08:19.680 | of ITC instead of TTC.
00:08:21.720 | And that makes a lot of sense
00:08:23.160 | because like now we care about inference,
00:08:24.640 | even right down to compute optimality.
00:08:26.840 | Like I actually interviewed this author
00:08:28.560 | who we've recovered or reviewed the Chinchilla paper.
00:08:31.840 | Chinchilla paper is compute optimal training,
00:08:33.960 | but what is not stated in there
00:08:35.720 | is it's pre-trained compute optimal training.
00:08:38.360 | And once you start caring about inference,
00:08:40.680 | compute optimal training,
00:08:41.840 | you have a different scaling law
00:08:42.840 | and in a way that we did not know last year.
00:08:45.000 | - I wonder, because John is,
00:08:46.560 | he's also on the side of attention is all you need.
00:08:49.040 | Like he had the bet with Sasha.
00:08:50.360 | So I'm curious, like he doesn't believe in scaling,
00:08:52.520 | but he thinks the transformer.
00:08:53.920 | I wonder if he's still of that camp.
00:08:56.240 | - So he, obviously everything is nuanced and you know,
00:08:58.480 | I told him to play a character for this debate, right?
00:09:01.160 | So he actually does.
00:09:02.520 | Yeah, he still believes that we can scale more.
00:09:04.960 | He just assumed the character to be very game
00:09:07.800 | for playing this debate.
00:09:09.280 | So even more kudos to him that he assumed a position
00:09:12.280 | that he didn't believe in and still won the debate.
00:09:14.800 | - Get wrecked, Dylan.
00:09:17.040 | Do you just want to quickly run through some of these things
00:09:21.080 | like Sarah's presentation, just the highlights?
00:09:24.600 | - Yeah, we can't go through everyone's slides,
00:09:26.240 | but I pulled out some things as a factor of like stuff
00:09:28.600 | that we were going to talk about.
00:09:29.520 | - Yeah, and we'll publish the rest.
00:09:31.960 | - Yeah, we'll publish on this feed,
00:09:33.160 | the best of 2024 in those domains.
00:09:35.840 | And hopefully people can benefit from the work
00:09:37.840 | that our speakers have done.
00:09:39.400 | But I think it's, these are just good slides.
00:09:41.640 | And I've been looking for sort of end of year recaps
00:09:43.680 | from people.
00:09:44.760 | The field has progressed a lot.
00:09:45.920 | You know, I think the max ELO in 2023 on LMSYS
00:09:49.480 | used to be 1200 for LMSYS ELOs.
00:09:52.320 | And now everyone is at least at 1275 in their ELOs.
00:09:57.320 | And this is across Gemini, HGBT, Grok, O1 AI,
00:10:01.640 | which with their E-Large model and Enthopic, of course.
00:10:04.760 | It's a very, very competitive race.
00:10:06.000 | There are multiple frontier labs all racing,
00:10:08.280 | but there is a clear tier zero frontier.
00:10:11.200 | And then there's like a tier one,
00:10:13.240 | it's like everything else.
00:10:14.480 | And tier zero is extremely competitive.
00:10:15.840 | It's effectively now three horse race
00:10:18.360 | between Gemini, Enthopic and OpenAI.
00:10:21.840 | I would say that people are still holding out a candle
00:10:24.480 | for XAI.
00:10:25.400 | XAI, I think for some reason,
00:10:27.440 | because their API was very slow to roll out,
00:10:30.360 | it's not included in these like metrics.
00:10:33.960 | So it's actually quite hard to put on there.
00:10:35.840 | Like as someone who also does charts,
00:10:37.760 | XAI is continually snubbed
00:10:39.200 | because they don't work well with the benchmarking people.
00:10:42.640 | So it's a little trivia for why XAI always gets ignored.
00:10:46.240 | The other thing is market share.
00:10:47.720 | So these are slides from Sarah
00:10:49.000 | and we have it up on the screen.
00:10:50.640 | It has gone from very heavily OpenAI.
00:10:54.520 | So we have some numbers and estimates.
00:10:56.240 | These are from RAMP,
00:10:57.440 | estimates of OpenAI market share in December, 2023.
00:11:01.320 | And this is basically, what is it?
00:11:03.560 | GPT 3.5 and GT for being 95% of production traffic.
00:11:08.560 | And I think if you correlate that with stuff
00:11:10.880 | that we asked Harrison Chase on the LinkedIn episode,
00:11:13.640 | it was true.
00:11:14.560 | And then CLAWD 3 launched middle of this year.
00:11:19.560 | I think CLAWD 3 launched in March,
00:11:21.120 | CLAWD 3.5 Sonnet was in June-ish.
00:11:23.920 | And you can start seeing the market share shift
00:11:25.760 | towards Enthopic very, very aggressively.
00:11:29.120 | And the more recent one is Gemini.
00:11:31.040 | So if I scroll down a little bit,
00:11:32.400 | this is an even more recent dataset.
00:11:33.760 | So RAMP's dataset ends in September, 2024.
00:11:37.600 | Gemini has basically launched a price war at the low end
00:11:40.800 | with Gemini Flash being basically free for personal use.
00:11:45.080 | I think people don't understand the free tier.
00:11:46.680 | It's something like a billion tokens per day.
00:11:48.640 | Unless you're trying to abuse it,
00:11:49.880 | you cannot really exhaust your free tier on Gemini.
00:11:52.520 | They're really trying to get you to use it.
00:11:53.960 | They know they're in like third place,
00:11:56.600 | fourth place, depending how you count.
00:11:58.640 | And so they're going after the lower tier first
00:12:02.040 | and then maybe the upper tier later.
00:12:04.200 | But yeah, Gemini Flash, according to OpenRouter,
00:12:06.640 | is now 50% of their OpenRouter requests.
00:12:10.320 | Obviously, these are the small requests.
00:12:11.520 | These are small, cheap requests
00:12:12.480 | that are mathematically going to be more.
00:12:15.200 | The smart ones obviously are still going to OpenAI.
00:12:17.560 | But it's a very, very big shift in the market.
00:12:19.800 | Like basically over the course of 2023 going into 2024,
00:12:24.200 | OpenAI has gone from 95 market share
00:12:25.640 | to reasonably somewhere between 50 to 75 market share.
00:12:29.360 | - Yeah, I'm really curious
00:12:30.200 | how RAMP does the attribution to the model, if it's API,
00:12:33.160 | because I think it's all- - Credit card spend.
00:12:35.000 | - Well, but the credit card doesn't say.
00:12:37.160 | Maybe when they do expenses, they upload the PDF.
00:12:40.520 | But yeah, the Gemini, I think, makes sense.
00:12:42.080 | I think that was one of my main 2024 takeaways
00:12:44.920 | that like the best small model companies
00:12:47.240 | are the large labs, which is not something
00:12:49.120 | I would have thought that the open source
00:12:50.480 | kind of like long tail would be, like the small model.
00:12:53.880 | - Yeah, different sizes of small models
00:12:55.920 | we're talking about here, right?
00:12:56.760 | Like so small model here for Gemini is 8B, right?
00:13:01.120 | Mini, we don't know what the small model size is,
00:13:03.240 | but yeah, it's probably in the double digits
00:13:05.160 | or maybe single digits, but probably double digits.
00:13:07.440 | The open source community has kind of focused
00:13:09.400 | on the one to 3B size, maybe 0.5B, that's Moon Dream.
00:13:14.400 | And that is small for you, then that's great.
00:13:17.640 | It makes sense that we have a range for small now,
00:13:20.120 | which is like maybe one to 5B.
00:13:22.240 | I'll even put that at the high end.
00:13:24.640 | And so this includes Gemma from Gemini as well,
00:13:27.480 | but also includes the Apple Foundation models,
00:13:30.320 | which are also like, I think Apple Foundation is 3B.
00:13:32.720 | - Yeah, no, that's great.
00:13:34.000 | I mean, I think in the start, small just meant cheap.
00:13:37.240 | I think today small is actually a more nuanced discussion,
00:13:40.960 | you know, that people weren't really having before.
00:13:43.720 | - Yeah, we can keep going.
00:13:45.160 | This is a slide that I smiley disagree with Sarah.
00:13:47.880 | She's pointing to the scale seal leaderboard.
00:13:51.360 | I think the researchers that I talked to in Europe
00:13:54.480 | were kind of positive on this
00:13:55.520 | because basically you need private test sets
00:14:00.160 | to prevent contamination.
00:14:02.240 | And scale is one of maybe three or four people this year
00:14:06.200 | that has really made an effort
00:14:07.600 | in doing a credible private test set leaderboard.
00:14:11.400 | Lama 4 or 5B does well compared to Gemini and GPT-40.
00:14:16.400 | And I think that's good.
00:14:17.960 | I would say that, you know,
00:14:19.280 | it's good to have an open model that is that big
00:14:21.800 | that does well on those metrics.
00:14:23.840 | But anyone putting 4 or 5B in production will tell you,
00:14:27.040 | if you scroll down a little bit
00:14:28.200 | to the artificial analysis numbers,
00:14:30.280 | that it is very slow and very expensive to infer.
00:14:33.800 | It doesn't even fit on like one node of H100s.
00:14:38.120 | Cerebris will be happy to tell you
00:14:39.520 | they can serve 4 or 5B on their super large chips.
00:14:42.800 | But, you know, if you need to do anything custom to it,
00:14:45.760 | you're still kind of constrained.
00:14:48.200 | So is 4 or 5B really that relevant?
00:14:50.160 | Like, I think most people are basically saying
00:14:52.600 | that they only use 4 or 5B as a teacher model
00:14:54.880 | to distill down to something.
00:14:56.720 | Even Meta is doing it.
00:14:58.280 | So when Lama 3.3 launched,
00:15:00.680 | they only launched the 70B
00:15:01.720 | because they use 4 or 5B to distill the 70B.
00:15:03.840 | So I don't know if like open source is keeping up.
00:15:06.360 | I think the open source industrial complex
00:15:09.520 | is very invested in telling you that the gap is narrowing.
00:15:13.760 | I kind of disagree.
00:15:15.280 | I think that the gap is widening with O1.
00:15:18.320 | I think there are very, very smart people
00:15:20.600 | trying to narrow that gap and they should.
00:15:22.320 | I really wish them success,
00:15:23.400 | but you cannot use a chart that is nearing 100
00:15:26.880 | in your saturation chart.
00:15:27.920 | And look, the distance between open source
00:15:29.760 | and closed source is narrowing.
00:15:30.840 | Of course it's going to narrow it
00:15:31.680 | because you're near 100.
00:15:32.680 | - Yeah, yeah, yeah.
00:15:34.000 | - This is stupid.
00:15:35.320 | But in metrics that matter, is open source narrowing?
00:15:38.240 | Probably not for O1 for a while.
00:15:40.880 | And it's really up to the open source guys
00:15:43.560 | to figure out if they can match O1 or not.
00:15:46.320 | - I think inference time compute is bad for open source
00:15:48.840 | just because Doc can donate the flops at training time,
00:15:52.960 | but he cannot donate the flops at inference time.
00:15:55.600 | So it's really hard to like actually keep up
00:15:58.720 | on that axis. - Big business model shift.
00:16:00.840 | - So I don't know what that means for the GPU clouds.
00:16:04.120 | I don't know what that means for the hyperscalers,
00:16:06.520 | but obviously the big labs have a lot of advantage
00:16:09.640 | because it's not a static artifact
00:16:11.640 | that you're putting the compute in.
00:16:13.600 | You're kind of doing that still,
00:16:15.120 | but then you're putting a lot of compute at inference too.
00:16:17.000 | - Yeah, yeah, yeah.
00:16:18.280 | I mean, Llama4 will be reasoning oriented.
00:16:20.680 | We talked with Thomas Shalom.
00:16:22.640 | Kudos for getting that episode together.
00:16:24.240 | That was really nice.
00:16:25.560 | Well-timed.
00:16:26.400 | Actually, I connected with the AI meta guy at NeurIPS
00:16:30.320 | and we're gonna coordinate something for Llama4.
00:16:32.760 | - And our friend Clara Shi just joined
00:16:34.960 | to lead the business agent side.
00:16:36.800 | So I'm sure we'll have her on in the new year.
00:16:39.040 | - Yeah, so my comment on the business model shift,
00:16:41.920 | this is super interesting.
00:16:43.000 | Apparently it is wide knowledge
00:16:44.400 | that OpenAI wanted more than $6.6 billion
00:16:46.520 | for their fund raise.
00:16:48.560 | They wanted to raise higher and they did not.
00:16:51.560 | What that means is basically like,
00:16:52.760 | it's very convenient that we're not getting GPT-5,
00:16:55.480 | which would have been a larger pre-train.
00:16:57.920 | We should have a lot of upfront money.
00:16:59.840 | Instead, we're converting fixed costs
00:17:02.040 | into variable costs, right?
00:17:03.680 | And passing it on effectively to the customer.
00:17:06.960 | And it's so much easier to take margin there
00:17:09.840 | because you can directly attribute it to like,
00:17:11.760 | oh, you're using this more,
00:17:12.800 | therefore you pay more of the cost
00:17:14.160 | and I'll just slap a margin on there.
00:17:15.480 | So like that lets you control your gross margin
00:17:17.960 | and like tie your spend
00:17:20.360 | or your sort of inference spend accordingly.
00:17:22.920 | And it's just really interesting to,
00:17:25.280 | that this change in the sort of inference paradigm
00:17:29.440 | has arrived exactly at the same time
00:17:31.520 | that the funding environment for pre-training
00:17:33.680 | is effectively drying up, kind of.
00:17:36.560 | I feel like maybe the VCs are very in tune
00:17:38.120 | with research anyway.
00:17:38.960 | So like they would have noticed this,
00:17:40.800 | but it's just interesting.
00:17:43.200 | - Yeah, and I was looking back at our yearly recap
00:17:45.800 | of last year and the big thing
00:17:47.760 | was like the mixed trial price fights, you know?
00:17:50.760 | And I think now it's almost like there's nowhere to go.
00:17:52.840 | Like, you know, Gemini Flash
00:17:53.880 | is like basically giving it away for free.
00:17:55.680 | So I think this is a good way for the labs
00:17:57.920 | to generate more revenue and pass down
00:18:00.240 | - It's great.
00:18:01.080 | - Some of the compute to the customer.
00:18:01.920 | - Yeah, I think we're going to keep going.
00:18:02.760 | I think that $2,000 chat GPT will come.
00:18:05.520 | - Yeah, I know, totally.
00:18:06.760 | I mean, next year, the first thing I'm doing
00:18:08.880 | is signing up for Devon,
00:18:10.200 | signing up for the pro chat GPT.
00:18:12.800 | Just to try, I just want to see
00:18:14.160 | what does it look like to spend $1,000 a month on AI?
00:18:17.840 | - Yes, I think if your job is at least AI content creator
00:18:21.280 | or VC or, you know, someone whose job it is
00:18:23.680 | to stay on top of things,
00:18:24.760 | you should already be spending like $1,000 a month on stuff.
00:18:28.080 | And then obviously easy to spend, hard to use.
00:18:31.600 | You have to actually use.
00:18:33.160 | The good thing is that actually Google
00:18:35.320 | lets you do a lot of stuff for free now.
00:18:37.560 | So like deep research that they just launched
00:18:39.880 | uses a ton of inference
00:18:41.360 | and it's free while it's in preview.
00:18:44.480 | So you should use it.
00:18:46.600 | - They need to put that in Lindy.
00:18:47.920 | I've been using Lindy lately.
00:18:49.560 | I've built a bunch of things once we had flow
00:18:51.880 | because I like the new thing.
00:18:53.160 | It's pretty good.
00:18:54.160 | I even did a phone call assistant.
00:18:57.080 | - Yeah, they just launched a new voice.
00:18:58.880 | - Yeah, I think once they get advanced voice mode,
00:19:01.840 | like capability, today it's still like speech to text,
00:19:04.880 | you can kind of tell.
00:19:06.440 | But it's good for like reservations and things like that.
00:19:09.080 | So I have a meeting prepper thing.
00:19:12.320 | And so it's good.
00:19:13.960 | - Okay, I feel like we've covered a lot of stuff.
00:19:16.480 | Yeah, I think we will go over the individual talks
00:19:21.320 | in a separate episode.
00:19:23.040 | I don't want to take too much time with this stuff,
00:19:25.240 | but suffice to say that there was a lot of progress
00:19:27.680 | in each field.
00:19:28.880 | We covered vision.
00:19:29.880 | Basically this is all like the audience voting
00:19:32.120 | for what they wanted.
00:19:33.200 | And then I just invited the best speaker I could find
00:19:35.000 | in each audience, especially agents.
00:19:37.960 | Graham, who I talked to at ICFL in Vienna,
00:19:41.560 | he is currently still number one.
00:19:43.200 | It's very hard to stay on top of SweetBench.
00:19:45.760 | OpenHand is currently still number one on SweetBench full,
00:19:48.840 | which is the hardest one.
00:19:49.920 | He had very good thoughts on agents,
00:19:51.680 | which I'll highlight for people.
00:19:53.120 | Everyone is saying 2025 is the year of agents,
00:19:55.560 | just like they said last year.
00:19:57.160 | (laughing)
00:19:59.360 | But he had thoughts on like eight parts
00:20:01.040 | of what are the frontier problems to solve in agents.
00:20:03.960 | And so I'll highlight that talk as well.
00:20:05.560 | - Yeah, the number six,
00:20:06.720 | which is the how can agents learn more
00:20:08.760 | about the environment has been super interesting to us
00:20:12.160 | as well, just to think through.
00:20:13.760 | Because yeah, how do you put an agent in an enterprise
00:20:16.720 | where most things in an enterprise
00:20:18.680 | have never been public?
00:20:20.320 | You know, a lot of the tooling,
00:20:21.560 | like the code bases and things like that.
00:20:23.120 | So yeah, there's not--
00:20:23.960 | - So just indexing and rag?
00:20:25.560 | - Well, yeah, but it's more like,
00:20:27.120 | you can't really rag things that are not documented,
00:20:29.480 | but people know them based on how they've been doing it.
00:20:32.520 | You know?
00:20:33.360 | So I think there's almost this like--
00:20:34.400 | - Oh, institutional knowledge.
00:20:35.560 | - So yeah, the boring word is kind of like
00:20:36.920 | a business process extraction.
00:20:38.480 | It's like, how do you actually understand
00:20:40.400 | how these things are done?
00:20:41.480 | - I see.
00:20:42.320 | - And I think today the agents that most people are building
00:20:45.560 | are good at following instruction,
00:20:46.920 | but are not as good as like extracting them from you.
00:20:50.280 | So I think that will be a big unlock.
00:20:52.520 | - Cool.
00:20:53.360 | - Just to touch quickly on the Jeff Dean thing.
00:20:55.680 | I thought it was pretty,
00:20:56.520 | I mean, we'll link it in the things,
00:20:58.680 | but I think the main focus was like,
00:21:00.440 | how do you use ML to optimize the systems
00:21:02.760 | instead of just focusing on ML to do something else?
00:21:05.520 | - Yeah, I think speculative decoding,
00:21:07.080 | we had, you know, Eugene from RWKB on the podcast before,
00:21:10.400 | like he's doing a lot of that with Federalist AI.
00:21:12.840 | - Everyone is, I would say it's the norm.
00:21:14.760 | I'm a little bit uncomfortable with how much it costs,
00:21:17.120 | because it does use more of the GPU per call,
00:21:20.040 | but because everyone is so keen on fast inference,
00:21:22.840 | then yeah, it makes sense.
00:21:24.960 | - Exactly.
00:21:26.600 | Yeah, but we'll link that.
00:21:28.480 | Obviously Jeff is great.
00:21:29.320 | - Yeah, so that Jeff is, Jeff's talk was more,
00:21:32.280 | it wasn't focused on Gemini.
00:21:33.600 | I think people got the wrong impression from my tweet,
00:21:36.480 | is more about how Google approaches ML
00:21:38.600 | and uses ML to design systems
00:21:40.240 | and then systems feedback into the ML.
00:21:43.040 | And I think this ties in with Lubna's talk
00:21:45.600 | on synthetic data,
00:21:47.080 | where it's basically the story of bootstrapping
00:21:50.120 | of humans and AI in AI research or AI in production.
00:21:53.680 | So her talk was on synthetic data,
00:21:55.040 | where like how much synthetic data has grown in 2024
00:21:58.200 | in the pre-training side,
00:21:59.120 | the post-training side and the eval side.
00:22:01.360 | And I think Jeff then also extended it basically to chips,
00:22:05.360 | to chip design.
00:22:06.200 | So he'd spent a lot of time talking about AlphaChip.
00:22:08.200 | And most of us in the audience are like,
00:22:10.200 | we're not working on hardware, man.
00:22:11.680 | Like you guys are great.
00:22:12.520 | TPU is great.
00:22:13.360 | Okay, we'll buy TPUs.
00:22:14.200 | - And then there was the Ilya talk.
00:22:15.960 | - Yeah.
00:22:16.800 | - But, and then we have a essay tied to it.
00:22:19.480 | What Ilya saw.
00:22:20.680 | I don't know if we're calling them essays.
00:22:22.120 | What are we calling these?
00:22:23.000 | But post.
00:22:23.840 | - For me, it's just like bonus for the InSpace supporters,
00:22:26.680 | because I feel like they haven't been getting anything.
00:22:28.720 | And then I wanted a more high-frequency way to write stuff.
00:22:33.720 | Like that one I wrote in an afternoon.
00:22:36.600 | I think basically we now have an answer to what Ilya saw.
00:22:39.200 | It's one year since the blip.
00:22:41.640 | And we know what he saw in 2014.
00:22:44.680 | We know what he saw in 2024.
00:22:47.400 | We think we know what he sees in 2024.
00:22:49.160 | He gave some hints.
00:22:50.320 | And then we have vague indications
00:22:52.400 | of what he saw in 2023.
00:22:54.360 | So that was the.
00:22:55.560 | - Yeah.
00:22:56.400 | - Oh, and then 2016 as well,
00:22:57.680 | because of this lawsuit with Elon,
00:22:59.480 | OpenAI is publishing emails from Sam's,
00:23:02.120 | like his personal text messages
00:23:03.960 | to Siobhan, Zelis, or whatever.
00:23:05.680 | So like, we have emails from Ilya saying,
00:23:08.960 | this is what we're seeing in OpenAI,
00:23:10.880 | and this is why we need to scale up GPUs.
00:23:13.240 | And I think it's very prescient in 2016 to write that.
00:23:16.840 | And so like, it is exactly like,
00:23:19.200 | basically his insights is him and Greg,
00:23:21.840 | basically kind of driving the scaling up of OpenAI,
00:23:25.200 | while they're still playing Dota.
00:23:26.600 | They're like, no, like.
00:23:28.120 | - Yeah, yeah, yeah.
00:23:28.960 | - Like, we see the path here.
00:23:30.680 | - Yeah, and it's funny.
00:23:31.520 | Yeah, they even mentioned, you know,
00:23:32.480 | we can only train on 1v1 Dota.
00:23:34.640 | We need to train on 5v5,
00:23:35.960 | and that takes too many GPUs.
00:23:37.440 | - Yeah, and at least for me, I can speak for myself,
00:23:39.400 | like I didn't see the path from Dota to where we are today.
00:23:42.280 | I think even maybe if you ask them,
00:23:44.120 | like they wouldn't necessarily draw a straight line, but.
00:23:46.760 | - Yeah, yeah, no, I definitely.
00:23:49.200 | But I think like that was like the whole idea
00:23:50.680 | of almost like the RL.
00:23:52.520 | And we talked about this with Naden on his podcast.
00:23:55.240 | It's like with RL, you can get very good at specific things,
00:23:58.000 | but then you can't really like generalize as much.
00:23:59.720 | And I think the language models are like the opposite,
00:24:01.720 | which is like, you're gonna throw all this data at them
00:24:03.720 | and scale them up,
00:24:04.720 | but then you really need to drive them home
00:24:06.880 | on a specific task later on.
00:24:08.720 | And we'll talk about the OpenAI reinforcement,
00:24:10.840 | fine tuning, announcement too, and all of that.
00:24:13.480 | But yeah, I think like skill is all you need.
00:24:16.480 | That's kind of what LEI will be remembered for.
00:24:19.160 | - Will be remembered for, yeah.
00:24:21.080 | - And I think just maybe to clarify
00:24:22.680 | on like the pre-training is over thing,
00:24:25.120 | that people love to tweet.
00:24:26.320 | I think the point of the talk was like,
00:24:28.440 | everybody we're scaling these chips,
00:24:30.280 | we're scaling the compute,
00:24:31.280 | but like the second ingredient, which is data,
00:24:33.360 | it's not scaling at the same rate.
00:24:35.040 | So it's not necessarily pre-training is over.
00:24:38.040 | It's kind of like what got us here, won't get us there.
00:24:40.560 | In his email, he predicted like 10X growth
00:24:43.320 | every two years or something like that.
00:24:45.280 | And I think maybe now it's like,
00:24:47.040 | you can 10X the chips again, but-
00:24:49.240 | - I think it's 10X per year, was it?
00:24:51.760 | I don't know.
00:24:52.960 | - And Moore's law is like 2X.
00:24:55.160 | So it's like much faster than that.
00:24:58.000 | And yeah, I like the fossil fuel of AI analogy.
00:25:00.440 | It's kind of like the little background tokens thing.
00:25:03.080 | And so the OpenAI reinforcement fine tuning
00:25:05.640 | is basically like, instead of fine tuning on data,
00:25:07.760 | you fine tune on a reward model.
00:25:09.760 | So it's basically like, instead of being data driven,
00:25:11.960 | it's like task driven.
00:25:13.560 | And I think people have tasks to do.
00:25:15.760 | They don't really have a lot of data.
00:25:17.600 | So I'm curious to see how that changes,
00:25:20.240 | how many people fine tune.
00:25:21.520 | Because I think this is what people run into.
00:25:23.280 | It's like, oh, you can fine tune Lama.
00:25:25.240 | And it's like, okay, where do I get the data
00:25:27.880 | to fine tune it on?
00:25:29.240 | So it's great that we're moving the thing.
00:25:31.360 | And then I really like, he had this chart
00:25:33.000 | where the brain mass and the body mass thing
00:25:35.800 | is basically like mammals that scale linearly
00:25:38.560 | by brain and body size.
00:25:40.360 | And then humans kind of like broke off the slope.
00:25:42.960 | So it's almost like maybe the mammal slope
00:25:45.000 | is like the pre-training slope.
00:25:46.400 | And then the post-training slope is like the human one.
00:25:49.720 | - Yeah, I wonder what the, I mean, we'll know in 10 years,
00:25:52.360 | but I wonder what the y-axis is for Ilya's SSI.
00:25:56.680 | We'll try to get them on.
00:25:57.840 | - Ilya, if you're listening, you're welcome here.
00:26:00.800 | Yeah, and then he had what comes next,
00:26:02.560 | like agent, synthetic data, inference compute.
00:26:04.640 | I thought all of that was like-
00:26:05.480 | - I don't think he was dropping any alpha there.
00:26:06.920 | - Yeah, yeah, yeah.
00:26:07.920 | Any other new reps, highlights, or?
00:26:10.920 | - I think that there was comparatively a lot more work.
00:26:15.800 | Oh, by the way, I need to plug that my friend Yi
00:26:18.920 | made this like little nice-
00:26:20.000 | - Yeah, that was really nice.
00:26:21.400 | - Of like all the, she called it must read papers of 2024.
00:26:26.160 | So I laid out some of these at NeurIPS and it was just gone.
00:26:28.880 | Like everyone just picked it up
00:26:30.280 | 'cause people are dying for like little guidance
00:26:32.920 | and visualizations of each paper.
00:26:35.440 | And so I thought it was really super nice that we got that.
00:26:38.640 | - Should we do a late in space book for each year?
00:26:41.600 | - I thought about it.
00:26:42.440 | - For each year, we should.
00:26:43.280 | - Coffee table book.
00:26:44.100 | - Yeah. - Yeah.
00:26:44.940 | - Okay, put it in there, Will.
00:26:46.920 | - Hi, Will, by the way, we haven't introduced you.
00:26:49.720 | These are new, you know,
00:26:50.680 | you're organized, Jamie, we got Will.
00:26:51.520 | - You need to pull up more things.
00:26:53.480 | One thing I saw that, okay, there-
00:26:56.360 | - What this book see.
00:26:58.080 | - Okay, one fun one, and then one more general one.
00:27:00.680 | So the fun one is this paper on agent collusion.
00:27:04.280 | This is a paper on steganography.
00:27:06.120 | This is secret collusion among AI agents,
00:27:08.280 | multi-agent deception via steganography.
00:27:10.200 | I tried to go to NeurIPS
00:27:11.600 | in order to find these kinds of papers
00:27:13.320 | because the real reason, like NeurIPS this year
00:27:17.120 | has a lottery system.
00:27:17.960 | A lot of people actually even go and don't buy tickets
00:27:20.280 | because they just go and attend the side events.
00:27:22.160 | And then also the people who go
00:27:23.600 | and end up crowding around the most popular papers,
00:27:26.240 | which you already know and already read them
00:27:28.360 | before you showed up to NeurIPS.
00:27:30.000 | So the only reason you go there
00:27:30.960 | is to talk to the paper authors.
00:27:32.560 | But there's like something like 10,000 other papers out there
00:27:36.960 | that, you know, are just people's work
00:27:39.280 | that they did on the air.
00:27:40.680 | They failed to get attention for one reason or another.
00:27:42.680 | And this was one of them.
00:27:43.960 | It was like all the way at the back.
00:27:45.080 | And this is a DeepMind paper that actually focuses
00:27:47.440 | on collusion between AI agents
00:27:49.760 | by hiding messages in the text that they generate.
00:27:53.200 | So that's what steganography is.
00:27:54.320 | So a very simple example would be
00:27:56.160 | the first letter of every word.
00:27:57.520 | If you pick that out, you know,
00:27:59.200 | it decodes a different message than that.
00:28:01.920 | But something I've always emphasized is to LLMs,
00:28:05.040 | we read left to right.
00:28:06.120 | LLMs can read up, down, sideways, you know,
00:28:08.440 | in random character order.
00:28:10.600 | And it's the same to them as it is to us.
00:28:12.640 | So if we were ever to get, you know,
00:28:15.120 | self-motivated, underlined LLMs
00:28:16.760 | that we're trying to collaborate to take over the planet,
00:28:19.840 | this would be how they do it.
00:28:20.680 | They spread messages among us
00:28:22.800 | in the messages that we generate.
00:28:25.000 | And he developed a scaling law for that.
00:28:27.160 | So he marked, I'm showing it on screen right now,
00:28:30.240 | the emergence of this phenomenon.
00:28:31.800 | Basically, for example, for Cypher encoding,
00:28:34.000 | GPT-2, Lama2, mixed trial, GPT-3.5, zero capabilities,
00:28:38.560 | and it's sudden emergence of GPT-4.
00:28:40.320 | And this is the kind of Jason Wei type emergence
00:28:42.520 | properties that people kind of look for.
00:28:44.440 | I think what made this paper stand out as well,
00:28:46.960 | so he developed a benchmark for steganography collusion,
00:28:50.680 | and he also focused on Schelling point collusion,
00:28:53.280 | which is very low coordination.
00:28:54.720 | Like for agreeing on a decoding encoding format,
00:28:58.080 | you kind of need to have some agreement on that.
00:29:00.680 | But Schelling point means like very, very low
00:29:02.880 | or almost no coordination.
00:29:04.440 | So for example, if I ask someone,
00:29:06.480 | if the only message I give you is meet me in New York,
00:29:09.200 | and you have no idea where or when,
00:29:11.520 | you would probably meet me at Grand Central Station.
00:29:14.680 | That is, the Grand Central Station is a Schelling point,
00:29:16.800 | and it's probably somewhere during the day.
00:29:19.040 | That is, the Schelling point of New York is Grand Central.
00:29:21.600 | To that extent, Schelling points for steganography
00:29:23.920 | are things like the common decoding methods
00:29:26.760 | that we talked about.
00:29:27.680 | It will be interesting at some point in the future
00:29:29.440 | when we are worried about alignment.
00:29:30.840 | It is not interesting today,
00:29:32.240 | but it's interesting that DeepMind
00:29:33.560 | is already thinking about this.
00:29:35.800 | - Interesting.
00:29:36.720 | I think that's like one of the hardest things
00:29:38.200 | about NeurIPS, is like the long tail.
00:29:40.040 | - Very long tail.
00:29:41.320 | I found a pricing guy.
00:29:42.840 | I'm going to feature him on the podcast.
00:29:44.400 | Basically, this guy from NVIDIA
00:29:46.800 | worked out the optimal pricing for language models.
00:29:51.080 | It's basically an econometrics paper at NeurIPS,
00:29:53.120 | where like everyone else is talking about GPUs.
00:29:55.400 | - And the guy with the GPUs is talking about pricing.
00:29:59.960 | - That was the sort of fun one.
00:30:02.080 | The broader focus I saw is that model papers at NeurIPS
00:30:07.080 | are kind of dead.
00:30:08.520 | No one really presents models anymore.
00:30:10.840 | It's just data sets.
00:30:12.080 | This is all the grad students are working on.
00:30:14.120 | So like there was a data sets track,
00:30:15.720 | and then I was looking around,
00:30:17.080 | I was like, you don't need a data sets track
00:30:18.680 | because every paper is data sets paper.
00:30:20.520 | (laughing)
00:30:22.920 | And so data sets and benchmarks,
00:30:25.480 | they're kind of flip sides of the same thing.
00:30:27.280 | So yeah, if you're a grad student,
00:30:29.040 | your GPU board, you kind of work on that.
00:30:30.560 | And then the sort of big model that people walk around
00:30:33.680 | and pick the ones that they like,
00:30:34.760 | and then they use it in their models.
00:30:36.440 | And that's kind of how it develops, I feel like.
00:30:39.840 | Last year, you had people like Hao Tian,
00:30:43.880 | who worked on Lava, which is Ticlama and AdVision.
00:30:47.360 | And then obviously XAI hired him,
00:30:49.680 | and he added Vision to Grok.
00:30:52.160 | He's the Vision Grok guy.
00:30:53.560 | This year, I don't think there was any of those.
00:30:55.480 | - Yeah, what were the most popular like orals?
00:30:58.920 | Last year, it was like the Mixed Monarch,
00:31:00.520 | I think was like the most attended.
00:31:03.000 | - Yeah, I need to look it up.
00:31:06.240 | - Yeah, I mean, if nothing comes to mind,
00:31:08.000 | that's also kind of like an answer in a way.
00:31:10.400 | But I think last year, there was a lot of interest
00:31:12.360 | in like furthering models and like different architectures
00:31:15.560 | and all of that.
00:31:16.680 | - I will say that I felt the oral picks this year
00:31:19.760 | were not very good.
00:31:21.240 | Either that, or maybe it's just a highlight
00:31:23.480 | of how I have changed in terms of how I view papers.
00:31:28.480 | So like in my estimation,
00:31:31.080 | two of the best papers in this year for data sets
00:31:34.440 | or Data Comp and RefineWeb, or FineWeb.
00:31:38.040 | These are two actually industrially used papers,
00:31:41.360 | not highlighted for oral.
00:31:42.800 | I think DCLM got the spotlight,
00:31:44.920 | FineWeb didn't even get the spotlight.
00:31:46.440 | So like, it's just that the picks were different.
00:31:48.840 | But one thing that does get a lot of play
00:31:51.360 | that a lot of people are debating
00:31:52.960 | is the role that's scheduled.
00:31:54.360 | This is the Schedule Free Optimizer paper from Meta,
00:31:57.120 | from Aaron DeFazio.
00:31:58.920 | And this year in the ML community,
00:32:00.960 | there's been a lot of chat about shampoo, soap,
00:32:03.960 | all the bathroom amenities
00:32:05.680 | for optimizing your learning rates.
00:32:08.240 | And most people at the big labs who I asked about this
00:32:12.840 | say that it's cute, but it's not something that matters.
00:32:15.840 | I don't know, but it's something that was discussed
00:32:17.960 | and very, very popular.
00:32:19.360 | - Four Wars of AI recap, maybe just quickly.
00:32:22.880 | Where do you want to start, Data?
00:32:25.280 | - Yeah, so to remind people,
00:32:27.040 | this is the Four Wars piece that we did
00:32:28.960 | as one of our earlier recaps of this year.
00:32:31.200 | And the belligerents are on the left,
00:32:33.520 | journalists, writers, artists,
00:32:35.040 | anyone who owns IP, basically,
00:32:36.440 | New York Times, Stack Overflow, Reddit,
00:32:38.480 | Getty, Sarah Silverman, George R.R. Martin.
00:32:41.440 | Yeah, and I think this year we can add
00:32:43.600 | Scarlett Johansson to that side of the fence.
00:32:46.880 | So anyone suing open the eye, basically.
00:32:48.600 | I actually wanted to get a snapshot of all the lawsuits.
00:32:52.560 | I'm sure some lawyer can do it.
00:32:54.600 | That's the Data Quality War.
00:32:55.800 | On the right-hand side, we have the synthetic data people.
00:32:57.800 | And I think we talked about Lumna's talk,
00:32:59.880 | really showing how much synthetic data
00:33:01.800 | has come along this year.
00:33:03.600 | I think there was a bit of a fight
00:33:05.200 | between scale AI and the synthetic data community,
00:33:09.280 | 'cause scale published a paper saying
00:33:11.040 | that synthetic data doesn't work.
00:33:12.840 | Surprise, surprise, scale is the leading vendor
00:33:15.440 | of non-synthetic data.
00:33:16.800 | - Only cage-free annotated data is useful.
00:33:21.400 | - So I think there's some debate going on there,
00:33:23.840 | but I don't think it's much debate anymore
00:33:25.400 | that at least synthetic data,
00:33:27.480 | for the reasons that are blessed in Lumna's talk,
00:33:31.600 | makes sense.
00:33:32.640 | I don't know if you have any perspectives there.
00:33:34.280 | - I think, again, going back to the reinforcement,
00:33:36.120 | fine-tuning, I think that will change a little bit
00:33:38.640 | how people think about it.
00:33:39.680 | I think today, people mostly use synthetic data
00:33:41.880 | for distillation and fine-tuning a smaller model
00:33:45.440 | from a larger model.
00:33:46.960 | I'm not super aware of how the frontier labs use it,
00:33:50.240 | outside of the rephrase, the web thing that Apple also did.
00:33:54.080 | But yeah, I think it'll be useful.
00:33:56.120 | I think whether or not that gets us the big next step,
00:34:00.680 | I think that's maybe like TBD.
00:34:02.720 | I think people love talking about data
00:34:04.160 | because it's like a GPU-poor thing.
00:34:07.640 | I think synthetic data is something that people can do,
00:34:11.800 | so they feel more opinionated about it
00:34:14.040 | compared to the optimizers stuff,
00:34:16.400 | which is like, they don't really work on.
00:34:18.640 | - I think that there is an angle
00:34:20.600 | to the reasoning synthetic data.
00:34:23.880 | So this year, we covered in a paper club
00:34:26.720 | the STAR series of papers.
00:34:29.360 | So that's STAR, QSTAR, VSTAR.
00:34:31.680 | It basically helps you to synthesize reasoning steps
00:34:35.320 | or at least distill reasoning steps from a verifier.
00:34:38.560 | And if you look at the OpenAI RFT API that they released
00:34:43.560 | or that they announced,
00:34:45.040 | basically, they're asking you to submit graders
00:34:47.560 | or they choose from a preset list of graders.
00:34:49.720 | Basically, it feels like a way to create
00:34:52.800 | valid synthetic data for them to fine-tune
00:34:55.720 | the reasoning paths on.
00:34:57.160 | So I think that is another angle
00:34:58.400 | where it starts to make sense.
00:34:59.760 | And so it's very funny that basically
00:35:02.760 | all the data quality wars between,
00:35:05.680 | let's say the music industry
00:35:06.880 | or the newspaper publishing industry
00:35:08.960 | or the textbooks industry on the big labs,
00:35:11.560 | it's all of the pre-training era.
00:35:13.080 | And then the new era, the reasoning era,
00:35:15.440 | nobody has any problem with all the reasoning.
00:35:18.120 | Fine-tuning, especially because it's all
00:35:19.920 | sort of math and science-oriented
00:35:21.720 | with very reasonable graders.
00:35:23.560 | I think the more interesting next step is
00:35:25.920 | how does it generalize beyond STEM?
00:35:27.960 | We've been using O1 for AI News for a while.
00:35:31.160 | And I would say like for summarization
00:35:33.080 | and creative writing and instruction following,
00:35:35.120 | I think it's underrated.
00:35:36.240 | I started using O1 in our intro songs
00:35:38.720 | before we killed the intro songs,
00:35:40.400 | but it's very good at writing lyrics.
00:35:42.200 | You know, it can actually say like,
00:35:44.120 | I think one of the O1 Pro demos that Noam was showing
00:35:47.400 | was that, you know, you can write an entire paragraph
00:35:50.360 | or three paragraphs without using the letter A, right?
00:35:53.720 | So like literally just anything in the sort of token,
00:35:56.560 | like not even token level,
00:35:57.960 | character level manipulation and counting
00:36:00.040 | and instruction following, it's very, very strong.
00:36:02.640 | And so no surprises when I ask it to rhyme
00:36:05.720 | and to create song lyrics,
00:36:06.800 | it's going to do that much better than in previous models.
00:36:09.520 | So I think it's underrated for creative writing.
00:36:11.760 | - Yeah.
00:36:12.600 | What do you think is the rationale
00:36:14.240 | that they're going to have in court
00:36:15.480 | when they don't show you the thinking choices of O1,
00:36:19.280 | but then they want us to like,
00:36:21.120 | they're getting sued for using other publishers' data,
00:36:24.280 | you know, but then on their end, they're like,
00:36:26.080 | well, you shouldn't be using my data
00:36:27.720 | to then train your model.
00:36:29.160 | So I'm curious to see how that kind of comes.
00:36:31.400 | - Yeah, I mean, O1 has many ways to punish people
00:36:33.840 | without taking them to court,
00:36:35.720 | already banned by Dance for distilling their info.
00:36:38.960 | And so anyone caught distilling the chain of thought
00:36:41.560 | will be just disallowed to continue on the API
00:36:44.640 | and it's fine.
00:36:46.240 | It's no big deal.
00:36:47.080 | Like, I don't even think that's an issue at all,
00:36:49.080 | just because the chain of thoughts are pretty well hidden.
00:36:51.880 | Like, you have to work very, very hard to get it to leak.
00:36:55.080 | And then even when it leaks the chain of thought,
00:36:56.920 | you don't know if it's the real one.
00:36:59.080 | So there's much less concern here.
00:37:00.920 | - Yeah, yeah, yeah.
00:37:01.840 | - The bigger concern is actually that
00:37:04.760 | there's not that much IP hiding behind it.
00:37:07.200 | That Cosine, which we talked about,
00:37:10.200 | we talked to him on Dev Day,
00:37:11.720 | can just fine tune 4.0 to beat O1.
00:37:14.880 | That Cloud Sonnet so far is beating O1 on coding tasks
00:37:18.720 | without, at least O1 preview,
00:37:20.880 | without being a reasoning model
00:37:22.720 | and same for Gemini Pro or Gemini 2.0.
00:37:25.800 | So like, how much is reasoning important?
00:37:28.600 | How much of a moat is there in this
00:37:30.960 | proprietary sort of training data
00:37:32.600 | that they've presumably accomplished?
00:37:34.960 | Because like, even DeepSeek was able to do it
00:37:37.720 | and they had two months notice to do this, to do R1.
00:37:40.600 | So it's actually unclear how much moat there is.
00:37:42.920 | Obviously, if you talk to the Strawberry team,
00:37:45.560 | they'll be like, yeah, I mean,
00:37:46.600 | we spent the last two years doing this.
00:37:48.200 | So we don't know.
00:37:49.760 | And it's going to be interesting
00:37:52.560 | because there'll be a lot of noise from people
00:37:55.080 | who say they have inference time compute
00:37:57.200 | and actually don't
00:37:58.320 | because they just have fancy chain of thought.
00:38:00.320 | And then there's other people
00:38:01.360 | who actually do have very good chain of thought
00:38:03.400 | and you will not see them on the same level as OpenAI
00:38:06.400 | because OpenAI has invested a lot
00:38:07.960 | in building up the mythology of their team,
00:38:10.600 | which makes sense.
00:38:11.440 | Like the real answer is somewhere in between.
00:38:13.160 | - Yeah, I think that's kind of like the main data
00:38:16.240 | more story developing.
00:38:18.480 | GPU poor versus GPU rich.
00:38:20.800 | - Yeah.
00:38:21.640 | - Where do you think we are?
00:38:23.440 | I think there was, again,
00:38:24.720 | going back to like the small model thing,
00:38:26.080 | there was like a time in which the GPU poor
00:38:29.320 | were kind of like the rebel faction
00:38:30.920 | working on like these models
00:38:32.080 | that were like open and small and cheap.
00:38:34.000 | And I think today people don't really care
00:38:35.880 | as much about GPUs anymore.
00:38:37.360 | You also see it in the price of the GPUs.
00:38:39.600 | Like, you know, that market is kind of like plummeted
00:38:41.680 | because there's, people don't want to be,
00:38:43.200 | they want to be GPU free.
00:38:45.520 | They don't even want to be poor.
00:38:46.560 | They just want to be, you know, completely without them.
00:38:49.280 | Yeah, how do you think about this war developing?
00:38:51.600 | - You can tell me about this,
00:38:52.800 | but like, I feel like the appetite for GPU rich startups,
00:38:56.280 | like the, you know, the funding plan is,
00:38:58.840 | we will raise 60 million
00:38:59.840 | and we'll give 50 of that to Nvidia.
00:39:01.800 | That is gone, right?
00:39:02.720 | Like no one's pitching that.
00:39:04.280 | This was literally the plan,
00:39:05.720 | the exact plan of like,
00:39:07.240 | I can name like four or five startups,
00:39:08.640 | but you know, this time last year.
00:39:10.280 | So yeah, GPU rich startups gone.
00:39:12.480 | But I think like the GPU ultra rich,
00:39:16.600 | the GPU ultra high net worth is still going.
00:39:19.040 | So now we're, you know,
00:39:20.600 | we had Leopold's essay on the trillion dollar cluster.
00:39:23.320 | We're not quite there yet.
00:39:24.720 | We have multiple labs, you know, XAI, very famously,
00:39:28.920 | you know, Jensen Huang praising them
00:39:30.120 | for being best boy number one
00:39:31.920 | in spinning up a hundred thousand GPU cluster
00:39:34.360 | in like 12 days or something.
00:39:35.880 | So likewise at Meta, likewise at OpenAI,
00:39:38.360 | likewise at the other labs as well.
00:39:40.200 | So like the GPU ultra rich are going to keep doing that
00:39:42.880 | because I think partially it's an article of faith now
00:39:45.400 | that you just need it.
00:39:46.320 | Like you don't even know what we're going to use it for.
00:39:48.440 | You just need it.
00:39:49.480 | And it makes sense that if,
00:39:51.240 | especially if we're going into
00:39:53.280 | more researchy territory than we are,
00:39:54.960 | so let's say 2020 to 2023
00:39:59.000 | was let's scale big models territory
00:40:01.640 | because we had GPT-3 in 2020.
00:40:04.000 | And we were like, okay,
00:40:04.840 | we'll go from 1.75B to 1.8B, 1.8T.
00:40:08.520 | And that was GPT-3 to GPT-4.
00:40:10.200 | Okay, that's done.
00:40:11.480 | Like, and as far as everyone is concerned,
00:40:15.000 | Claude, you know, Opus 3.5 is not coming out.
00:40:17.840 | GPT-4.5 is not coming out.
00:40:19.760 | And Gemini 2, like we don't have pro whatever.
00:40:22.920 | We've hit that wall, whatever that wall is.
00:40:24.640 | Maybe I'll call it like the 2 trillion parameter wall.
00:40:26.640 | Like we're not going to 10 trillion.
00:40:28.600 | Like it's just like, no one thinks it's a good idea,
00:40:31.640 | at least from training costs, from amount of data,
00:40:34.640 | or at least the inference, like, would you pay 10X
00:40:38.040 | the price of GPT-4?
00:40:39.280 | Probably not.
00:40:40.320 | Like you want something else
00:40:42.560 | that is at least more useful.
00:40:43.920 | So it makes sense that people are pivoting
00:40:45.840 | in terms of their inference paradigm.
00:40:47.240 | And so when it's more researchy,
00:40:49.040 | then you actually need more just general purpose compute
00:40:51.840 | to mess around with at the exact same time
00:40:54.480 | that production deployments of the previous paradigm
00:40:57.160 | are still ramping up pretty aggressively.
00:40:59.760 | So it makes sense that the GPU rich are growing.
00:41:02.840 | We have now interviewed both together
00:41:04.720 | in Fireworks and Replicates.
00:41:06.840 | We haven't done any scale yet,
00:41:08.160 | but I think Amazon may be kind of a sleeper one, Amazon,
00:41:11.320 | in a sense of like they, at reInvent,
00:41:13.720 | I wasn't expecting them to do so well,
00:41:15.840 | but they are now a foundation model lab.
00:41:18.440 | It's kind of interesting.
00:41:20.160 | I think, you know, David went over there
00:41:22.800 | and started just training models.
00:41:25.000 | - Yeah, I mean, that's the power of prepaid contracts.
00:41:28.840 | I think like a lot of AWS customers, you know,
00:41:31.080 | they do this big reserve instance contracts
00:41:33.440 | and now they got to use their money.
00:41:35.600 | That's why so many startups get bought
00:41:38.080 | through the AWS marketplace.
00:41:39.600 | So they can kind of bundle them together and prefer pricing.
00:41:42.320 | - Okay, so maybe GPU super rich, doing very well.
00:41:45.800 | GPU middle-class, dead.
00:41:47.840 | And then GPU poor.
00:41:49.280 | - I mean, my thing is like,
00:41:50.240 | everybody should just be GPU rich.
00:41:52.640 | There shouldn't really be,
00:41:53.800 | even the GPU poorest is like,
00:41:55.840 | does it really make sense to be GPU poor?
00:41:57.600 | Like if you're GPU poor, you should just use the cloud.
00:42:00.480 | - Yes.
00:42:01.320 | - You know, and I think there might be a future
00:42:03.400 | once we kind of like figure out what the size
00:42:06.000 | and shape of these models is,
00:42:07.280 | where like the tiny box and these things come to fruition
00:42:10.200 | where like you can be GPU poor at home.
00:42:12.360 | But I think today it's like,
00:42:14.160 | why are you working so hard till I get these models to run
00:42:17.440 | on like very small clusters where it's like,
00:42:19.760 | it's so cheap to run the-
00:42:21.880 | - Yeah, yeah.
00:42:22.720 | - You know?
00:42:23.560 | - Yeah, yeah.
00:42:24.400 | I think mostly people think it's cool.
00:42:25.800 | People think it's a stepping stone to scaling up.
00:42:28.520 | So they aspire to be GPU rich one day
00:42:30.520 | and they are working on new methods.
00:42:32.040 | Like news research, like probably the most deep tech thing
00:42:35.080 | they've done this year is distro or whatever the new name is.
00:42:38.440 | There's a lot of interest in heterogeneous computing,
00:42:40.800 | distributed computing.
00:42:42.000 | I tend generally to de-emphasize that historically,
00:42:45.080 | but it may be coming to a time
00:42:46.440 | where it is starting to be relevant.
00:42:48.040 | I don't know.
00:42:48.880 | You know, SF Compute launched their
00:42:50.600 | compute marketplace this year.
00:42:51.880 | And like, who's really using that?
00:42:53.360 | Like it's a bunch of small clusters,
00:42:56.040 | disparate types of compute.
00:42:57.880 | And if you can make that useful,
00:43:00.320 | then that will be very beneficial to the broader community,
00:43:04.880 | but maybe still not the source of frontier models.
00:43:07.560 | - Yeah.
00:43:08.400 | - It's just going to be a second tier of compute
00:43:10.840 | that is unlocked for people and that's fine.
00:43:12.960 | But yeah, I mean, I think this year,
00:43:14.360 | I would say a lot more on device.
00:43:16.840 | We are, I now have Apple intelligence on my phone,
00:43:19.720 | doesn't do anything apart from summarize my notifications,
00:43:22.600 | but still not bad.
00:43:24.240 | Like it's multi-modal.
00:43:25.520 | - Yeah.
00:43:26.360 | The notification summaries are so-and-so in my experience.
00:43:29.880 | - Yeah, but they add juice to life.
00:43:32.200 | And then Chrome Nano,
00:43:33.720 | Gemini Nano is coming out in Chrome.
00:43:35.840 | They're still feature flagged,
00:43:37.040 | but you can try it now if you use the alpha.
00:43:40.400 | And so like, I think like,
00:43:41.800 | we're getting the sort of GPU poor version
00:43:45.120 | of a lot of these things coming out.
00:43:47.000 | And I think it's like quite useful.
00:43:49.640 | Like Windows as well,
00:43:50.840 | rolling out RWKB in sort of every Windows department.
00:43:53.440 | It's super cool.
00:43:54.800 | And I think the last thing that I never put
00:43:56.840 | in this GPU poor war that I think I should now,
00:43:59.720 | is the number of startups that are GPU poor,
00:44:02.640 | but still scaling very well,
00:44:04.120 | as sort of wrappers on top of
00:44:07.000 | either a foundation model lab or a GPU cloud.
00:44:10.600 | A GPU cloud, it would be Suno.
00:44:12.320 | Suno, RAMP has rated as one of the top ranked,
00:44:15.960 | fastest growing startups of the year.
00:44:18.000 | I think the last public number is like
00:44:19.840 | zero to 20 million this year in ARR.
00:44:22.200 | And Suno runs on Modo.
00:44:24.120 | So Suno itself is not GPU rich,
00:44:26.200 | but they're just doing the training on Modo,
00:44:29.120 | who we've also talked to on the podcast.
00:44:31.440 | The other one would be Bolt.
00:44:33.520 | Straight cloud wrapper.
00:44:34.880 | (laughs)
00:44:36.280 | And again, now they've announced 20 million ARR,
00:44:41.240 | which is another step up from our 8 million
00:44:44.160 | that we put on the title.
00:44:46.440 | So yeah, I mean, it's crazy that all these GPU poors
00:44:48.960 | are finding a way,
00:44:50.240 | while the GPU riches are also finding a way.
00:44:52.120 | And then the only failures,
00:44:53.880 | I kind of call this the GPU smiling curve,
00:44:56.120 | where the edges do well,
00:44:57.720 | 'cause you're either close to the machines
00:44:59.120 | and you're like number one on the machines,
00:45:00.960 | or you're like close to the customers
00:45:02.320 | and you're number one on the customer side.
00:45:03.840 | And the people who are in the middle inflection,
00:45:07.120 | character, didn't do that great.
00:45:09.440 | I think character did the best of all of them.
00:45:11.720 | Like you have a note in here that we apparently said
00:45:14.240 | that character's price tag was 1B.
00:45:15.600 | - Yeah. - Did I say that?
00:45:16.520 | - Yeah.
00:45:17.360 | You said Google should just buy them for 1B.
00:45:19.240 | I thought it was a crazy number.
00:45:20.360 | Then they paid 2.7.
00:45:21.560 | - I mean, for like, yeah.
00:45:22.840 | What do you pay for now?
00:45:23.680 | Like, I don't know what the beginning was like.
00:45:24.960 | Maybe the starting price was 1B.
00:45:26.760 | (laughs)
00:45:27.600 | I mean, whatever it was,
00:45:28.720 | it worked out for everybody involved.
00:45:31.120 | Multi-modality were, in this one,
00:45:32.960 | we never had text-to-video in the first version,
00:45:35.480 | which now is the hottest.
00:45:37.440 | - Yeah, I would say it's a subset of image, but yes.
00:45:40.040 | - Yeah.
00:45:40.880 | Well, but I think at the time,
00:45:41.720 | it wasn't really something people were doing.
00:45:43.200 | And now we had VO2 just came out yesterday.
00:45:46.720 | Sora was released last week.
00:45:48.800 | - Have you tried Sora?
00:45:49.640 | - I've not tried Sora because the day that I tried-
00:45:52.320 | - Yeah, it wasn't.
00:45:54.000 | I think it's generally available now.
00:45:55.360 | You can go to Sora.com and try it.
00:45:56.880 | - Yeah, and then they had the outage,
00:45:59.640 | which I think also played a part into it.
00:46:02.720 | - Small things, just very important.
00:46:04.600 | - What's the other model that you posted today
00:46:06.400 | that was on Replicate Video or OneLive?
00:46:09.040 | - Yeah, very, very nondescript name,
00:46:11.920 | but it is from Minimax,
00:46:14.600 | which I think is a Chinese lab.
00:46:16.640 | The Chinese labs do surprisingly well at the video models.
00:46:20.680 | I'm not sure it's actually Chinese.
00:46:22.120 | I don't know, hold me up to that.
00:46:24.480 | Yep, China.
00:46:25.320 | (laughs)
00:46:26.760 | - No, it's good.
00:46:27.600 | - So, Heiluo.
00:46:28.560 | Yeah, the Chinese love video.
00:46:30.560 | What can I say?
00:46:31.400 | They have a lot of training data for video,
00:46:33.760 | or a more relaxed regulatory environment.
00:46:37.360 | - Well, sure, in some way.
00:46:40.560 | Yeah, I don't think there's much else there.
00:46:42.600 | I think like, you know, on the image side,
00:46:44.440 | I think it's still open.
00:46:45.520 | - Yeah, I mean, 11 Labs, now Unicorn.
00:46:48.520 | So basically, what is multi-modality war?
00:46:50.280 | Multi-modality war is,
00:46:51.840 | do you specialize in a single modality, right?
00:46:55.520 | Or do you have God model that does all the modalities?
00:46:59.040 | Right, so this is definitely still going
00:47:00.880 | in a sense of 11 Labs, you know, now Unicorn.
00:47:04.960 | Pico Labs doing well.
00:47:06.320 | They launched Pico 2.0 recently.
00:47:07.840 | HN, I think has reached a hundred million ARR.
00:47:10.760 | Assembly, I don't know,
00:47:11.600 | but they have billboards all over the place.
00:47:13.400 | So I assume they're doing very, very well.
00:47:15.240 | So these are all specialist models
00:47:17.440 | and specialist startups.
00:47:18.280 | - And product, especially.
00:47:20.560 | - Yep, and then there's the big labs
00:47:22.560 | who are doing the sort of all-in-one play.
00:47:24.960 | And here I would highlight Gemini 2
00:47:26.880 | for having native image output.
00:47:29.600 | Have you seen the demos?
00:47:30.960 | - No.
00:47:31.800 | - Yeah, it's hard to keep up.
00:47:32.720 | Literally, they launched this last week.
00:47:34.240 | And shout out to Paige Bailey,
00:47:36.120 | who came to the Latent Space event to demo
00:47:38.600 | on the day of launch, and she wasn't prepared.
00:47:41.800 | She was just like, "I'm just gonna show you."
00:47:43.280 | So they have voice.
00:47:44.480 | They have, you know, obviously image input,
00:47:47.600 | and then they obviously can code gen and all that.
00:47:49.400 | But the new one that OpenAI and Meta both have,
00:47:53.120 | but they haven't launched yet,
00:47:54.480 | is image output.
00:47:56.040 | So you can literally, I think their demo video
00:47:58.640 | was that you put in an image of a car
00:48:00.600 | and you ask for minor modifications to that car,
00:48:02.840 | they can generate you that modification
00:48:05.040 | exactly as you asked.
00:48:06.520 | So there's no need for the stable diffusion
00:48:09.840 | or comfy wireflow of like mask here,
00:48:12.360 | and then like infill there, inpaint there,
00:48:14.360 | and all that stuff.
00:48:15.200 | This is small model nonsense.
00:48:17.240 | Big model people are like,
00:48:18.480 | "Huh, we got you everything in the transformer."
00:48:21.640 | This is the multimodality war,
00:48:22.840 | which is, do you bet on the God model?
00:48:25.360 | Or do you string together a whole bunch of small models
00:48:27.680 | like a chump?
00:48:29.160 | - Yeah, I don't know, man.
00:48:30.840 | Yeah, that would be interesting.
00:48:31.920 | I mean, obviously I use Midjourney for all of our thumbnails.
00:48:34.960 | - Yes, still Soda.
00:48:36.560 | - They've been doing a ton on the product, I would say.
00:48:38.640 | They launched a new Midjourney editor thing.
00:48:41.040 | They've been doing a ton.
00:48:42.360 | Because I think, yeah, the model is kind of like,
00:48:44.320 | maybe, you know, people say Black Forest,
00:48:46.360 | the Black Forest models are better than Midjourney
00:48:48.760 | on a pixel by pixel basis.
00:48:50.720 | But I think when you put it together.
00:48:53.120 | - Have you tried the same problems on Black Forest?
00:48:55.200 | - Yes, but the problem is just like,
00:48:56.640 | you know, on Black Forest, it generates one image.
00:48:58.840 | And then it's like, you got to regenerate.
00:49:00.320 | You don't have all these UI things.
00:49:02.200 | Like what I do-
00:49:03.040 | - Skill issue, bro.
00:49:03.880 | - No, but it's like time issue, you know?
00:49:06.000 | It's like on Midjourney-
00:49:07.120 | - Call the API four times.
00:49:08.760 | - No, but then there's no like variate.
00:49:10.920 | Like the good thing about Midjourney is like,
00:49:13.680 | you just go in there and you're cooking.
00:49:15.680 | There's a lot of stuff that just makes it really easy.
00:49:18.080 | And I think people underestimate that.
00:49:19.560 | Like, it's not really a skill issue
00:49:20.880 | because I'm paying Midjourney.
00:49:21.880 | So it's a Black Forest skill issue
00:49:23.320 | because I'm not paying them, you know?
00:49:25.320 | - So, okay.
00:49:26.160 | So this is a UX thing, right?
00:49:27.840 | Like you understand that at least we think
00:49:30.840 | that Black Forest should be able to do all that stuff.
00:49:33.360 | I will also shout out ReCraft.
00:49:34.760 | Has come out on top of the image arena
00:49:36.600 | that artificial analysis has done.
00:49:38.440 | Has apparently taken Flux's place.
00:49:40.760 | Is this still true?
00:49:41.680 | So artificial analysis is now a company.
00:49:44.040 | I highlighted them, I think,
00:49:45.320 | in one of the early AI news of the year.
00:49:47.920 | And they have launched a whole bunch of arenas.
00:49:50.360 | So they're trying to take on LM Arena,
00:49:52.760 | Anastasios and crew.
00:49:54.000 | And they have an image arena.
00:49:55.120 | Oh yeah, ReCraft V3 has now beaten Flux 1.1,
00:49:58.400 | which is very surprising
00:49:59.760 | 'cause Flux and Black Forest Labs
00:50:01.720 | are the old stable diffusion crew
00:50:03.280 | who left stability after the management issues.
00:50:06.480 | So ReCraft has come from nowhere
00:50:07.560 | to be the top image model.
00:50:08.960 | Very, very strange.
00:50:10.040 | I would also highlight that Grok has now launched Aurora,
00:50:13.400 | which is, it's very interesting dynamics
00:50:15.880 | between Grok and Black Forest Labs
00:50:18.360 | because Grok's images were originally launched
00:50:22.080 | in partnership with Black Forest Labs as a thin wrapper.
00:50:24.560 | And then Grok was like, "No, we'll make our own."
00:50:26.600 | And so they've made their own.
00:50:28.280 | I don't know, there are no APIs or benchmarks about it.
00:50:31.240 | They just announced it.
00:50:32.840 | So yeah, that's the multi-modality war.
00:50:35.480 | I would say that so far the small model,
00:50:38.400 | the dedicated model people are winning
00:50:39.960 | because they are just focused on their tasks.
00:50:42.120 | But the big model people are always catching up.
00:50:45.440 | And the moment I saw the Gemini 2 demo of image editing
00:50:49.760 | where I can put an image and just request it and it does,
00:50:52.080 | that's how AI should work.
00:50:53.360 | Not like a whole bunch of complicated steps.
00:50:55.960 | So it really is something.
00:50:58.000 | And I think one frontier that we haven't seen this year,
00:51:00.440 | like obviously video has done very well
00:51:02.320 | and it will continue to grow.
00:51:03.920 | You know, when we have the release of Sora Turbo today,
00:51:06.960 | but at some point we'll get full Sora,
00:51:09.080 | or at least the Hollywood Labs will get full Sora.
00:51:11.040 | We haven't seen video to audio or video synced with audio.
00:51:14.480 | And so the researchers that I talked to
00:51:16.400 | are already starting to talk about that
00:51:17.520 | as the next frontier.
00:51:18.520 | But there's still maybe like five more years of video left
00:51:21.320 | to actually be Sora.
00:51:23.480 | I would say that Gemini's approach compared to OpenAI,
00:51:26.720 | Gemini seems, or DeepMind's approach to video
00:51:29.560 | seems a lot more fully-fledged than OpenAI.
00:51:32.800 | Because if you look at the ICML recap that I published
00:51:36.080 | that so far nobody has listened to,
00:51:37.880 | (laughs)
00:51:40.480 | that people have listened to it.
00:51:41.600 | It's just a different,
00:51:42.440 | definitely a different audience.
00:51:43.280 | - It's only seven hours long.
00:51:44.400 | - It's only seven hours.
00:51:45.240 | - Why are people not listening?
00:51:46.080 | - Basically everything in one video is good.
00:51:48.400 | So DeepMind is working on Genie.
00:51:50.360 | They also launched Genie 2 and VideoPoet.
00:51:52.480 | So like they have maybe four years advantage
00:51:56.080 | on world modeling that OpenAI does not have.
00:51:58.520 | 'Cause OpenAI basically only started
00:52:00.280 | Diffusion Transformers last year
00:52:01.640 | when they hired Bill Peebles.
00:52:03.200 | So DeepMind has a bit of advantage here,
00:52:05.600 | I would say, in showing, like the reason of VO2,
00:52:09.360 | well, one, they cherry-picked their videos.
00:52:11.480 | So obviously it looks better than Sora.
00:52:13.000 | But the reason I would believe that VO2,
00:52:15.880 | when it's fully launched, will do very well
00:52:18.280 | is because they have all this background work
00:52:20.160 | in video that they've done for years.
00:52:22.200 | Like last year's NeurIPS,
00:52:23.480 | I already was interviewing some of their video people.
00:52:25.840 | I forget their model name,
00:52:27.280 | but for people who are dedicated fans,
00:52:29.280 | they can go to NeurIPS 2023 and see that paper.
00:52:32.480 | - And then last but not least, the LLMOS/RegOps,
00:52:37.480 | formerly known as RegOps War.
00:52:40.520 | I put the latest chart on the Brain Trust episode.
00:52:43.200 | I think I'm gonna separate these essays
00:52:46.200 | from the episode notes.
00:52:47.840 | So the reason I used to do that, by the way,
00:52:49.160 | is 'cause I wanted to show up on Hacker News.
00:52:50.920 | I want the podcast to show up on Hacker News.
00:52:52.920 | So I always put an essay inside of there
00:52:54.480 | because at Hacker News, people like to read and not listen.
00:52:57.960 | - So episode essays, I don't know about you,
00:52:59.760 | we're just doing them separately.
00:53:01.200 | - You say Liangchen Lama Index is still growing.
00:53:03.040 | - Yeah, so I looked at the PiPi stats.
00:53:06.000 | I don't care about stars.
00:53:08.280 | On PiPi, you see-
00:53:09.400 | - Do you wanna share your screen?
00:53:11.200 | - Yes.
00:53:12.040 | I prefer to look at actual downloads,
00:53:14.480 | not at stars on GitHub.
00:53:16.920 | So if you look at, you know,
00:53:18.480 | Liangchen still growing.
00:53:20.280 | These are the last six months, Lama Index still growing.
00:53:23.760 | What I've basically seen is like things that,
00:53:26.600 | one, obviously these things have a commercial product.
00:53:29.760 | So there's like people buying this and sticking with it
00:53:32.200 | versus kind of hopping in between things
00:53:34.360 | versus, you know, for example,
00:53:35.960 | crew.ai, not really growing as much.
00:53:38.280 | The stars are growing.
00:53:39.600 | If you look on GitHub, the stars are growing,
00:53:41.600 | but kind of like the usage is kind of like flat
00:53:44.160 | in the last six months.
00:53:45.360 | - Have they done some kind of a reorg
00:53:47.160 | where they did like a split of packages
00:53:49.920 | and now it's like a bundle of packages?
00:53:51.600 | Sometimes that happens, you know.
00:53:52.960 | - I didn't see that.
00:53:54.480 | - I can see both.
00:53:55.320 | I can see both happening.
00:53:56.840 | The crew.ai is very loud, but not used.
00:53:59.880 | And then-
00:54:00.720 | - Yeah, but anyway, to me, it's just like-
00:54:02.960 | - Yeah, there's no split.
00:54:04.360 | I mean, similar with AutoGPT is like,
00:54:07.200 | there's still a wait list for AutoGPT to be used.
00:54:10.680 | - Yeah, they're still kicking.
00:54:12.560 | They announced some stuff recently.
00:54:14.440 | - But I think that's another one.
00:54:15.800 | It's the fastest growing project in the history of GitHub.
00:54:18.360 | But I think, you know, when you maybe like run the numbers
00:54:21.520 | on like the value of the stars
00:54:23.400 | and like the value of the hype,
00:54:24.480 | I think in AI you see this a lot,
00:54:25.920 | which is like a lot of stars, a lot of interest
00:54:28.320 | at a rate that you didn't really see in the past
00:54:30.120 | in open source where nobody's running to start,
00:54:33.400 | you know, a NoSQL database.
00:54:35.120 | It's kind of like just to be able to actually use it.
00:54:37.040 | - Yeah.
00:54:37.880 | I think one thing that's interesting here,
00:54:40.320 | one obviously is that in AI,
00:54:42.000 | you kind of get paid to promise things.
00:54:43.760 | - Yeah.
00:54:44.600 | - And then you, to deliver them, you know,
00:54:45.920 | people have a lot of patience.
00:54:47.160 | I think that patience has come down over time.
00:54:49.720 | One example here is Devon, right, this year,
00:54:51.920 | where a lot of promise in March
00:54:53.320 | and then it took nine months to get to GA.
00:54:56.920 | But I think people are still coming around now on Devon.
00:54:59.480 | Devon's product has improved a little bit
00:55:01.200 | and even you're going to be a paying customer.
00:55:03.480 | So I think something Devon-like will work.
00:55:05.480 | I don't know if it's Devon itself.
00:55:07.360 | The Auto-GPT has an interesting second layer
00:55:10.040 | in terms of what I think is the dynamics going on here,
00:55:13.240 | which is a very AI-specific layer.
00:55:15.960 | Over-promising, under-delivering applies to any startup.
00:55:18.800 | But for AI specifically,
00:55:20.920 | there's this promise of generality
00:55:23.120 | that I can do anything, right?
00:55:24.840 | So Auto-GPT's initial problem was making money,
00:55:27.960 | like increase my net worth.
00:55:29.160 | And I think that means that there's a lot of broad interest
00:55:32.400 | from a lot of different people
00:55:33.480 | who are trying to do all different things
00:55:34.760 | on this one project.
00:55:35.920 | So that's why this concentrates a lot of stars.
00:55:38.320 | And then obviously, because it does too much,
00:55:40.280 | maybe, or it's not focused enough,
00:55:42.000 | then it fails to deploy.
00:55:44.160 | So that would be my explanation
00:55:46.240 | for why the interest to usage ratio is so low.
00:55:49.640 | And the second one is obviously pure execution.
00:55:51.840 | Like the team needs to have a vision
00:55:53.720 | and execute like half the core team left
00:55:56.040 | right after AI Engineer Summit last year.
00:55:58.640 | (laughs)
00:56:00.600 | That will be my explanation as to why,
00:56:01.880 | like this promise of generality works
00:56:04.840 | basically only for ChatGPT.
00:56:06.560 | - Right.
00:56:07.400 | - And maybe for this year, Notebook LM.
00:56:09.960 | Like sticking anything in there, it'll mostly be correct.
00:56:12.560 | And then for basically everyone else,
00:56:14.080 | it's like, you know, we will help you complete code.
00:56:17.400 | We will help you with your PR reviews, like small things.
00:56:20.400 | - Yeah, yeah, yeah.
00:56:21.240 | All right, code interpreting,
00:56:22.680 | we talked about a bunch of times
00:56:25.040 | we soft announced the E2B fundraising on this podcast.
00:56:29.800 | Code Sandbox got acquired by Together AI last week,
00:56:33.880 | which they're now also going to offer as an API.
00:56:36.200 | So more and more activity, which is great.
00:56:39.800 | Yeah, and then in the last step, two episodes ago with Bolt,
00:56:43.240 | we talked about the web container stuff
00:56:45.080 | that we're working on.
00:56:46.440 | I think like there's maybe the spectrum of code interpreting,
00:56:50.280 | which is like, you know, dedicated SDK.
00:56:53.080 | There's like, yeah, the models of the world,
00:56:55.320 | which is like, hey, we got a sandbox.
00:56:57.400 | Now you just kind of run the commands
00:56:58.760 | and orchestrate all of that.
00:57:00.240 | I think this is one of the, I mean,
00:57:01.600 | E2B's growth has just been crazy,
00:57:03.520 | just because, I mean, everybody needs to run code, right?
00:57:06.640 | And I think now all the products
00:57:09.320 | and everybody's graduating to like, okay,
00:57:12.160 | it's not enough to just do chat.
00:57:13.840 | So Perplexity, which is a E2B customers,
00:57:16.200 | they do all these nice charts for like finance
00:57:18.480 | and all these different things.
00:57:19.400 | It's like the products are maturing.
00:57:21.720 | And I think this is becoming more and more
00:57:23.200 | of kind of like a hair on fire problem, so to speak.
00:57:26.360 | So yeah, excited to see more.
00:57:28.240 | And this was one that really wasn't on the radar
00:57:30.880 | when we first wrote "The Four Wars."
00:57:33.360 | - Yeah.
00:57:34.200 | I think mostly because I was trying to limit it
00:57:36.840 | to Ragnops, but I think now that the frontier has expanded
00:57:41.600 | in terms of the core set of tools,
00:57:44.280 | core set of tools would include code interpreting,
00:57:47.400 | like tools that every agent needs, right?
00:57:49.760 | And Graham in his state of agents talk had this as well,
00:57:54.000 | which is kind of interesting for me,
00:57:55.600 | 'cause like everyone finds the same set of things.
00:57:58.520 | So it's basically like, everyone needs web browsing,
00:58:01.040 | everyone needs code interpreting,
00:58:03.240 | and then everyone needs some kind of memory or planning
00:58:06.640 | or whatever that is.
00:58:08.920 | We'll discover this more over time,
00:58:10.280 | but I think this is what we've discovered so far.
00:58:12.640 | I will also call out Morphlabs
00:58:14.560 | for launching a time travel VM.
00:58:16.920 | I think that basically the statefulness of these things
00:58:21.360 | needs to be locked down a lot.
00:58:24.480 | Basically, you can't just spin up a VM,
00:58:26.920 | run code on it, and then kill it.
00:58:28.480 | It's because sometimes you might need to time travel back,
00:58:31.560 | like unwind or fork to explore different paths
00:58:34.760 | for sort of like a tree search approach
00:58:36.600 | to your agent development.
00:58:38.400 | I would call out the newer ones,
00:58:40.360 | the new implementations as the emerging frontier
00:58:42.400 | in terms of like what people kind of are going to need
00:58:44.600 | for agents to do very fan out approaches
00:58:47.600 | to all these sort of code execution.
00:58:49.760 | And then I'll also call out that I think ChatGPT Canvas
00:58:52.800 | with what they launched in the 12 days of shipments
00:58:55.800 | that they announced
00:58:56.640 | has surprisingly superseded Code Interpreter.
00:58:59.600 | Like Code Interpreter was last year's thing.
00:59:01.680 | And now Canvas can also write code and also run code
00:59:04.320 | and do more than Code Interpreter used to do.
00:59:06.080 | So right now it has not killed it.
00:59:07.640 | So there's a toggle box for Canvas and for Code Interpreter
00:59:11.160 | when you create a new custom GPTs.
00:59:13.000 | My old thesis that custom GPTs is your roadmap for investing
00:59:15.760 | 'cause it's what everyone needs.
00:59:17.360 | So now there's a new box called Canvas
00:59:19.040 | that everyone has access to,
00:59:20.880 | but basically there's no reason
00:59:22.680 | why you should use Code Interpreter over Canvas.
00:59:24.720 | Like Canvas has incorporated the diff mode
00:59:27.480 | that both Anthropic and OpenAI and Fireworks has now shipped
00:59:31.960 | that I think is going to be the norm for next year,
00:59:34.600 | that everyone needs some kind of diff mode
00:59:37.840 | Code Interpreter thing.
00:59:38.680 | Like Ader was also very early to this.
00:59:40.360 | Like the Ader benchmarks were also all based on diffs
00:59:43.520 | and Coursera as well.
00:59:44.600 | - Yeah, you want to talk about memory?
00:59:46.600 | - Memory, you think it's not real.
00:59:48.960 | - Yeah, I just don't.
00:59:49.920 | I think most memory product today
00:59:53.040 | just like a summarization and extraction.
00:59:56.480 | - Yeah. - I don't think there is.
00:59:57.320 | - They're very immature.
00:59:58.440 | - Yeah, there's no implicit memory.
01:00:01.320 | It's not explicit memory of what you've written.
01:00:03.400 | There's no implicit extraction of like,
01:00:06.240 | oh, use a node to this, use a node to this 10 times
01:00:09.480 | so you don't like going on hikes at 6 a.m.
01:00:12.840 | Like it doesn't, none of the memory products do that.
01:00:16.000 | They'll summarize what you say explicitly.
01:00:18.280 | - When you say memory products,
01:00:19.200 | you mean the startups that are more offering
01:00:21.360 | memory as a service?
01:00:22.400 | - Yeah, or even like, you know, Lindy has like memories,
01:00:24.640 | you know, it's like based on what I say, it remembers it.
01:00:28.480 | So it's less about making an actual memory of my preference.
01:00:31.800 | It's more about what explicitly said.
01:00:33.760 | And I'm trying to figure out at what level that gets solved.
01:00:37.400 | You know, like is it, do these memory products
01:00:40.240 | like the MMPTs of the world create a better way
01:00:43.320 | to implicitly extract preference
01:00:46.000 | or can that be done very well?
01:00:48.280 | You know, I think that's why I don't think,
01:00:49.680 | it's not that I don't think memory is real.
01:00:51.520 | I just don't think that like the approaches today
01:00:54.440 | are like actually memory or what you need a system to have.
01:00:57.560 | - Yeah, I would actually agree with that.
01:00:59.720 | But I would just point it to it being immature
01:01:02.200 | rather than not needed.
01:01:04.400 | Like it's clearly something that we will want at some point.
01:01:07.280 | And so the people developing it now are, you know,
01:01:11.280 | not very good at it.
01:01:12.320 | And I would definitely predict that next year
01:01:14.760 | will be better and the year after that
01:01:16.560 | will be better than that.
01:01:17.600 | I definitely think that last time
01:01:19.280 | we had the "Shouldn't You" pod with Harrison as guest host,
01:01:22.040 | I over-focused on LangMEM as a separate product.
01:01:24.480 | He has now rolled it into LangGraph
01:01:26.480 | as a memory service, the same API.
01:01:28.520 | And I think that everyone will need some kind of memory.
01:01:31.800 | And I think that this has distinguished itself now
01:01:35.200 | as a separate need from a normal rag vector database.
01:01:38.160 | In fact, you will need a memory layer,
01:01:39.560 | whether it's on top of a vector database or not,
01:01:41.080 | it's up to you.
01:01:42.000 | A memory database and a vector database
01:01:43.560 | are kind of two different things.
01:01:45.000 | Like I've had to justify this so much, actually,
01:01:47.040 | that I have a draft post in the "Latent Space" dashboard
01:01:50.040 | that basically says like,
01:01:51.680 | what is the difference between memory and knowledge?
01:01:53.920 | And to me, it's very clear.
01:01:54.760 | It's like, knowledge is about the world around you.
01:01:57.040 | And like, there's knowledge that you have,
01:01:58.600 | which is the rag corpus that your,
01:02:00.880 | maybe your company docs or whatever.
01:02:02.520 | And then there's external knowledge,
01:02:03.840 | which is the stuff that you Google.
01:02:05.080 | So you use something like Exa, whatever.
01:02:07.360 | And then there's memory,
01:02:09.040 | which is my interactions with you over time.
01:02:11.200 | Both can be represented by vector databases
01:02:13.160 | or knowledge graphs, doesn't really matter.
01:02:15.200 | Time is a specifically important one in memory
01:02:18.480 | because you need a decay function.
01:02:19.920 | And then you also need like a review function.
01:02:22.440 | A lot of people are implementing this as sleep.
01:02:24.560 | Like when you sleep, you like literally,
01:02:26.360 | you sort of process the day's memories
01:02:28.560 | and you come up with new insights that you then persist
01:02:30.880 | and bring into context in the future.
01:02:32.480 | So I feel like this is being developed.
01:02:35.160 | LandGraph has a version of this, ZEP.
01:02:37.080 | It's another one that's based on Neo4j's knowledge graph
01:02:39.480 | that has a version of this.
01:02:41.440 | MemGPT used to have this, but I think,
01:02:43.400 | I feel like Letter, since it was funded by Quiet Capital,
01:02:47.440 | has broadened out into more of a sort of
01:02:49.960 | general LLMOS type startup,
01:02:52.440 | which I feel like there's a bunch of those now.
01:02:54.080 | There's All Hands and all this.
01:02:55.440 | - Do you think this is a LLMOS product
01:02:57.520 | or should it be a consumer product?
01:02:59.360 | - I think it's a building block.
01:03:00.280 | I think every, I mean, there should,
01:03:02.320 | just like every consumer product is going to have a,
01:03:07.320 | going to eventually want a gateway,
01:03:09.920 | you know, for managing their requests.
01:03:11.920 | An ops tool, you know, that kind of stuff.
01:03:14.240 | Code interpreter for maybe not exposing the code,
01:03:16.760 | but executing code under the hood, for sure.
01:03:18.920 | So it's going to want memory.
01:03:20.440 | It's going to want long-lived memory.
01:03:21.360 | So as a consumer, let's say you are a new.computer,
01:03:25.360 | who, you know, they've launched their own little agents,
01:03:28.960 | or if you're a friend.com,
01:03:30.640 | you're going to want to invest in memory
01:03:32.120 | at some point.
01:03:33.040 | Maybe it's not today.
01:03:34.040 | Maybe you can push it off a lot further
01:03:36.080 | with like a million token context,
01:03:37.920 | but at some point you need to compress your memory
01:03:41.080 | and to selectively retrieve it.
01:03:43.040 | And then what are you going to do?
01:03:45.080 | You have to reinvent the whole memory stack.
01:03:47.560 | And these guys have been doing it for a year now.
01:03:49.840 | - Yeah, to me, it's more like I want to bring the memories.
01:03:53.080 | It's almost like they're my memories, right?
01:03:55.000 | So why-
01:03:55.840 | - So you selectively choose the memory to bring in.
01:03:57.760 | - Why does every time that I go to a new product,
01:04:00.120 | it needs to relearn everything about me?
01:04:01.560 | - Okay, you want portable memories.
01:04:02.920 | - Yeah, is it like a protocol?
01:04:04.960 | Like how does that work?
01:04:06.960 | - Speaking of protocols,
01:04:08.080 | Anthropic's model context protocol that they launched
01:04:10.200 | has a 300 line of code memory implementation.
01:04:12.760 | Very simple, very bad news for all the memory startups,
01:04:15.560 | but that's all you need.
01:04:19.040 | And yeah, it would be nice to have a portable memory of you
01:04:22.200 | to ship to everyone else.
01:04:23.440 | Simple answer is there's no standardization for a while
01:04:25.840 | because everyone will experiment with their own stuff.
01:04:28.480 | And I think Anthropic's success with MCP
01:04:31.440 | suggests that basically no one else
01:04:34.120 | but the big labs can do it
01:04:35.400 | because no one else has the sway to do this.
01:04:38.160 | Then that's how it's going to be.
01:04:39.720 | (laughing)
01:04:41.280 | Like, unless you have something silly,
01:04:43.600 | like, okay, one form of standardization
01:04:46.840 | basically came from Georgi Gaganov with Lama CPP, right?
01:04:50.520 | And that was completely open source, completely bottoms up.
01:04:53.040 | And that's because there's just a significant amount of work
01:04:54.880 | that needed to be done there.
01:04:55.880 | And then people build off from there.
01:04:57.280 | Another form of standardization is Confit UI
01:04:59.000 | from Confit Anonymous.
01:05:00.320 | So like that kind of standardization can be done.
01:05:03.040 | So someone basically has to create that
01:05:05.880 | for the role-play community
01:05:09.040 | because those are the people with the longest memories.
01:05:11.880 | Right now, the role-play community,
01:05:13.440 | as far as I understand it,
01:05:14.280 | I've looked at Sully Tavern, I've looked at Cobalt,
01:05:16.480 | they only share character cards.
01:05:18.720 | And there's like four or five different standardized
01:05:20.560 | standard versions of these character cards,
01:05:22.160 | but nobody has exportable memory yet.
01:05:24.880 | If there was anyone that developed memory first,
01:05:27.000 | that became a standard, it would be those guys.
01:05:28.880 | - Cool, I'm excited to see what people built.
01:05:31.800 | - Benchmarks. - Okay.
01:05:33.160 | - One of our favorite pet topics.
01:05:34.960 | - Yeah, yeah.
01:05:35.840 | So basically, I just wanted to mention this briefly.
01:05:39.080 | Like, I think that in a year, end of year review,
01:05:41.800 | it's useful to remind everybody where we were.
01:05:44.440 | So we talked about how in LMS's ELO,
01:05:47.200 | everyone has gone up and it's a very close race.
01:05:49.960 | And I think benchmarks as well.
01:05:51.880 | I was looking at the OpenAI live stream today
01:05:55.400 | when they introduced O1 API
01:05:57.520 | with structured output and everything.
01:05:59.320 | And the benchmarks they're talking about
01:06:01.080 | are like completely different than the benchmarks
01:06:04.440 | that we were talking about this time last year.
01:06:07.280 | This time last year, we were still talking about MMLU.
01:06:10.200 | Little bit of, there's still like GSMAK.
01:06:12.680 | There's stuff that's basically in V1
01:06:16.440 | of the Hugging Face Open Models leaderboard, right?
01:06:18.880 | We talked to Clementine about the decisions that she made
01:06:22.280 | to upgrade to V2.
01:06:24.600 | I will also say LMS's now LM Arena
01:06:27.360 | also has emerged this year as the leading like battlegrounds
01:06:31.520 | between the big frontier labs.
01:06:33.560 | But also, we have also seen like the emergence
01:06:35.680 | of SweBench, LiveBench, MMLU Pro, and AIMI.
01:06:39.360 | AIMI specifically for O1.
01:06:41.040 | It will be interesting to see like top most cited benchmarks
01:06:44.280 | of the year from 2020 to 2021, two, three, four,
01:06:49.120 | and then going to five.
01:06:50.480 | And you can see what has been saturated and solves
01:06:53.240 | and what people care about now.
01:06:55.400 | And so now people care a lot about frontier math coding,
01:06:58.120 | right?
01:06:58.960 | There's literally a benchmark called frontier math,
01:07:00.320 | which I spent a bit of time talking about at NeurIPS.
01:07:03.600 | There's AIMI, there's LiveBench,
01:07:05.520 | there's MMLU Pro, and there's SweBench.
01:07:06.920 | I feel like this is good.
01:07:09.120 | And then there was another one,
01:07:11.960 | this time last year, it was GPQA.
01:07:14.200 | I will put math and GPQA here
01:07:16.320 | as sort of top benchmarks of last year.
01:07:19.320 | At NeurIPS, GPQA was declared dead, which is very sad.
01:07:22.360 | People were still talking about GPQA Diamond.
01:07:24.240 | So literally the name of GPQA
01:07:26.040 | is called Google Proof Question Answering.
01:07:28.280 | So it's supposed to be resistant to saturation for a while.
01:07:31.880 | And Noam Brown said that GPQA was dead.
01:07:34.480 | So now we only care about SweBench, LiveBench,
01:07:36.560 | MMLU Pro, AIMI, and even SweBench.
01:07:38.400 | We don't care about SweBench proper.
01:07:39.920 | We care about SweBench verified.
01:07:42.360 | We care about the SweBench multimodal.
01:07:44.800 | And then we also care about the new Kowinski Prize
01:07:48.080 | from Andy Kowinski,
01:07:48.920 | which is the guy that we talked to yesterday,
01:07:50.520 | who has launched a similar sort of ArcAGI attempt
01:07:53.760 | on a SweBench type metric,
01:07:56.040 | which arguably is a bit more useful.
01:07:58.360 | OpenAI also has MLEBench,
01:08:00.040 | which is more tracking sort of ML research and bootstrapping,
01:08:04.280 | which arguably like this is the key metric
01:08:06.320 | that is most relevant for the Frontier Labs,
01:08:08.560 | which is when the researchers can automate their own jobs.
01:08:11.240 | So that is a kink in the acceleration curve
01:08:13.800 | if we were ever to reach that.
01:08:15.240 | - Yeah, that makes sense.
01:08:16.080 | I mean, I'm curious.
01:08:17.240 | I think Dylan at the debate, he said SweBench 80%
01:08:22.160 | was like a soap for end of next year.
01:08:24.680 | - As a kind of like, you know, watermark.
01:08:27.240 | - Yeah, yeah, yeah.
01:08:28.080 | And keeping when we started the year at 13%.
01:08:30.440 | - Yeah, exactly.
01:08:31.280 | - And so now we're about 50.
01:08:33.000 | OpenHance is around there.
01:08:34.440 | And yeah, 80 sounds fine.
01:08:36.320 | Kowinski Prize is 90.
01:08:37.760 | - Yeah.
01:08:38.600 | And then as we get to 100 and the open source catches up.
01:08:41.160 | - Oh yeah, magically gonna close the gap
01:08:43.960 | between the closed source and open source.
01:08:45.320 | So basically I think my advice to people
01:08:46.760 | is keep track of the slow cooking of benchmark language
01:08:51.760 | because the labs that are not that frontier
01:08:53.920 | will keep measuring themselves on last year's benchmarks.
01:08:57.720 | And then the labs that are actually frontier
01:08:59.200 | will tell you about benchmarks you've never heard of.
01:09:01.200 | And you'll be like,
01:09:02.040 | oh, like, okay, there's new territory to go on.
01:09:05.120 | That will be the quick tip there.
01:09:06.600 | Yeah, maybe I won't belabor this point too much.
01:09:09.800 | I was also saying maybe VL has introduced
01:09:11.800 | some new video benchmarks, right?
01:09:13.080 | Like basically every new frontier capabilities
01:09:15.480 | and this is the next section that we're gonna go into
01:09:17.440 | introduces new benchmarks.
01:09:19.000 | We'll also briefly talk about Ruler as like the new sort of,
01:09:22.680 | last year we was like needle in a haystack
01:09:24.240 | and Ruler is basically a multi-dimensional needle
01:09:26.440 | in a haystack.
01:09:27.280 | - Yeah, we'll link on the episodes.
01:09:29.240 | - Yeah, this is like a review
01:09:30.360 | of all the episodes that we've done,
01:09:31.920 | which I have in my head.
01:09:32.920 | This is one of the slides that I did on my Dev Day talk.
01:09:34.960 | So we're moving on from benchmarks to capabilities.
01:09:37.600 | And I think I have a useful categorization
01:09:39.800 | that I've been trying to sell.
01:09:40.640 | I'd be curious on your feedback or edits.
01:09:43.280 | I think there's basically like,
01:09:45.560 | I kind of like the thought spot model of what's mature,
01:09:49.720 | what's emerging, what's frontier, what's niche.
01:09:51.480 | So mature is like stuff that you can just rely on
01:09:53.560 | in production, it's solved, everyone has it.
01:09:55.480 | So what's solved is general knowledge, MMLU.
01:09:57.960 | And then what's solved is kind of long context.
01:09:59.640 | Everyone has 128K.
01:10:01.840 | Today, O1 announced 200K, which is very expensive.
01:10:05.520 | I don't know what the price is there.
01:10:07.280 | What's solved, kind of solved is RAG.
01:10:09.160 | There's like 18 different kinds of RAG,
01:10:10.440 | but it's mostly solved.
01:10:12.200 | Batch transcription, I would say Whisper
01:10:14.120 | is something that you should be using
01:10:16.080 | on as much as possible.
01:10:17.600 | And then code generation, and kind of solved.
01:10:19.480 | There's different tiers of code generation
01:10:21.480 | and I really need to split out.
01:10:22.920 | - Yeah, yeah, yeah.
01:10:23.840 | - Single line autocomplete versus multi-file generation.
01:10:27.120 | I think that is definitely emerging.
01:10:29.240 | So on the emerging side, tool use,
01:10:30.760 | I would still consider emerging,
01:10:32.280 | maybe more mature already,
01:10:34.120 | but they only launched for short output this year.
01:10:36.120 | - Yeah, yeah, yeah.
01:10:36.960 | I think emerging is fine.
01:10:38.560 | - Vision language models.
01:10:39.400 | Everyone has vision now, I think, including O1.
01:10:42.040 | So this is clear.
01:10:43.280 | A subset of vision is PDF parsing.
01:10:45.880 | And I think the community is very excited
01:10:48.280 | about the work being done with CodePoly and CodeQuin.
01:10:52.000 | - What's for you the break point for vision to go to mature?
01:10:55.080 | - I think it's basically now.
01:10:57.360 | This is maybe two months old.
01:10:59.040 | - Yeah, yeah, yeah.
01:11:00.640 | - NVIDIA, most valuable company in the world.
01:11:02.800 | Also, I think this was in June.
01:11:05.600 | Then also they surprised a lot on the upside
01:11:08.120 | for their Q3 earnings.
01:11:10.760 | I think the quote that I highlighted in AI News
01:11:13.960 | was that it is the best,
01:11:16.080 | like Blackwell is the best selling series
01:11:18.000 | in the history of the company.
01:11:19.560 | And they're sold.
01:11:20.400 | I mean, obviously they're always sold out,
01:11:21.640 | but for him to make that statement,
01:11:23.400 | I think it's another indication
01:11:25.520 | that the transition from the H to the B series
01:11:28.840 | is going to go very well.
01:11:29.960 | - Yeah.
01:11:30.800 | I mean, if you had just bought NVIDIA
01:11:32.720 | and tried to beat the game out, that would be insane.
01:11:35.240 | - Yeah.
01:11:36.600 | Which one more, NVIDIA or Bitcoin?
01:11:39.040 | I think NVIDIA.
01:11:40.560 | - I think in gains, yeah.
01:11:41.560 | - Well, I think the question is like,
01:11:42.600 | people ask me like,
01:11:43.960 | what's the reason to not invest in NVIDIA?
01:11:45.720 | I think it's really just like,
01:11:47.520 | they have committed to this.
01:11:48.800 | They went for a two year cycle to one year cycle, right?
01:11:50.880 | And so it takes one misstep to delay.
01:11:53.920 | You know, like there have been delays in the past
01:11:55.960 | and like when delays happen,
01:11:57.120 | they're typically very good buying opportunities.
01:11:59.440 | Anyway.
01:12:00.280 | - Hey, this is Fiks from the editing room.
01:12:03.760 | I actually just realized
01:12:07.200 | that we lost about 15 minutes of audio and video
01:12:10.480 | that was in the episode that we shipped
01:12:12.920 | and I'm just cutting it back in and rerecording.
01:12:15.040 | We don't have time to rerecord before the end of the year.
01:12:17.040 | It's December 31st already.
01:12:19.160 | So I'm just going to do my best to recover what we have
01:12:23.160 | and then sort of segue you in nicely to the end.
01:12:26.440 | So our plan was basically to cover
01:12:28.520 | like what we felt was emerging capabilities,
01:12:30.960 | frontier capabilities, and niche capabilities.
01:12:33.160 | So emerging would be tool use, vision language models,
01:12:36.440 | which you just heard, real-time transcription,
01:12:38.760 | which I have on one of our upcoming episodes to be,
01:12:42.880 | as well as you can try it in Whisper WebGPU,
01:12:45.640 | which is amazing.
01:12:46.880 | I think diarization capabilities are also maturing as well,
01:12:50.680 | but still way too hard to do properly.
01:12:53.760 | Like we had to do a lot of stuff
01:12:55.720 | for the latent space transcripts to come out right.
01:12:59.040 | I think maybe, you know,
01:13:00.520 | Dwarkesh recently has been talking about
01:13:02.280 | how he's using Gemini 2.0 Flash to do it.
01:13:05.040 | And I think that might be a good effort,
01:13:07.160 | a good way to do it.
01:13:08.560 | And especially if there's crosstalk involved,
01:13:10.520 | that might be really good,
01:13:11.760 | but there might be other reasons
01:13:14.360 | to use normal diarization models as well.
01:13:17.200 | Specifically PyAnode.
01:13:18.680 | Text and image, we talked about a lot,
01:13:20.400 | so I'm just going to skip.
01:13:21.320 | And then we go to frontier, which, you know,
01:13:23.480 | I think like basically I would say is on the horizon,
01:13:26.840 | but not quite ready for broad usage.
01:13:28.880 | Like it's, you know, interesting to show off to people,
01:13:33.080 | but like we haven't really figured out
01:13:34.440 | how like the daily use,
01:13:36.000 | the large amount of money is going to be made
01:13:38.920 | on long inference, on real-time interruptive,
01:13:42.120 | sort of real-time API voice mode things,
01:13:44.560 | on on-device models, as well as all other modalities.
01:13:47.960 | And then niche models, niche capabilities.
01:13:50.200 | I always say like base models are very underrated.
01:13:52.200 | People always love talking to base models as well.
01:13:56.040 | And we're increasingly getting less access to them.
01:13:59.160 | It's quite possible, I think, you know,
01:14:00.640 | Sam Altman for 2025 was like asking about
01:14:03.520 | what people want him to ship
01:14:05.600 | or what people want him to open source.
01:14:07.800 | And people really want GPT-3 base.
01:14:09.720 | We may get it.
01:14:11.960 | We may get it.
01:14:12.800 | It's just for historical interest, but you know,
01:14:16.280 | at this point, but we may get it.
01:14:18.480 | Like it's definitely not a significant IP anymore for him.
01:14:22.440 | So we'll see, you know,
01:14:23.720 | I think OpenAI has a lot more things to worry about
01:14:25.880 | than shipping base models,
01:14:27.120 | but it will be a very, very nice things to do
01:14:28.840 | for the community.
01:14:30.400 | State space models as well.
01:14:31.480 | I would say like the hype for state space models this year,
01:14:33.840 | even though, you know,
01:14:35.320 | the post-transformers talk at Lean Space Live
01:14:37.680 | was extremely hyped and very well attended and watched.
01:14:41.800 | I would say like, it feels like a step down this year.
01:14:43.840 | I don't know why.
01:14:44.760 | It seems like things are scaling out
01:14:49.800 | in state space models and RWBKBs.
01:14:53.160 | So Cartesia, I think is doing extremely well.
01:14:55.600 | We use them for a bunch of stuff,
01:14:57.320 | especially for small talks
01:14:59.120 | and some of our sort of notebook Ellen podcasts clones.
01:15:02.560 | I think they're a real challenger to 11 Labs as well.
01:15:05.920 | And RWBKB of course is rolling on on Windows.
01:15:09.360 | So I'll still say these are niche.
01:15:12.200 | We've been talking about them as the future for a long time.
01:15:14.720 | And I mean, we live technically
01:15:16.640 | in a year in the future from last year,
01:15:18.840 | and we're still saying the exact same things
01:15:20.440 | as we were saying last year.
01:15:21.560 | So what's changed?
01:15:22.720 | I don't know.
01:15:24.360 | I do think the XLSTM paper,
01:15:26.320 | which we will cover when we cover the sort of NeurIPS papers
01:15:29.520 | is worth a look.
01:15:32.280 | I think they are very clear eyed
01:15:34.160 | as to how they wanna fix LSTM.
01:15:36.400 | Okay, so, and then we also wanna cover a little bit
01:15:40.560 | like the major themes of the year.
01:15:42.240 | And then we wanted to go month by month.
01:15:44.120 | So I'll bridge you into it back to the recording,
01:15:46.160 | which we still have the audio of.
01:15:48.160 | So one of the major themes
01:15:49.840 | is sort of the inference race at the bottom.
01:15:51.160 | We started this time last year
01:15:53.400 | with the Mistral Price War of 2023,
01:15:58.160 | with Mistral going from $1.80 per token
01:16:02.480 | down to $1.27 in the span of like a couple of weeks.
01:16:06.400 | And, you know, I think this,
01:16:08.880 | a lot of people also interested in the price war,
01:16:11.760 | sort of the price intelligence curve for this year as well.
01:16:15.440 | I started tracking it,
01:16:16.720 | I think roundabout in March of 2024 with Haiku's launch.
01:16:21.400 | And so this is, if you're watching the YouTube,
01:16:24.000 | this is what I initially charted out as like,
01:16:26.480 | here's the frontier.
01:16:27.600 | Like, everyone's kind of like in a pretty tight range
01:16:29.600 | of LMS's ELO versus the model pricing.
01:16:32.480 | You can pay more for more intelligence
01:16:35.080 | and it'll be cheaper to get less intelligence,
01:16:38.160 | but roughly it correlates to aligned and a trend line.
01:16:43.080 | And then I could update it again in July
01:16:45.520 | and see that everything had kind of shifted right.
01:16:48.200 | So for the same amount of ELO,
01:16:50.240 | let's say GPT-4, 2023 would be about sort of $11.75
01:16:57.560 | as an ELO, you could,
01:16:58.880 | and you used to get that for like $40 per token,
01:17:02.080 | a per million tokens.
01:17:03.200 | And now you get Cloud3 Haiku,
01:17:05.160 | which is about the same ELO for 50 cents.
01:17:08.040 | And so that's a two orders of magnitude improvement
01:17:10.680 | in about two years, sorry, in about a year.
01:17:14.440 | But more importantly, I think you can see
01:17:18.040 | the more recent launches like Cloud3 Opus,
01:17:19.840 | which launched in March this year,
01:17:22.200 | now basically superseded completely,
01:17:25.040 | completely dominated by Gemini 1.5 Pro,
01:17:27.120 | which is both cheaper, $5 a month,
01:17:29.320 | $5 per million, as well as smarter.
01:17:31.840 | So it's about slightly higher than 12.50 in ELO.
01:17:34.880 | So the March frontier and shift to the July frontier
01:17:38.600 | is roughly one order of magnitude improvement
01:17:40.800 | per sort of ISO ELO.
01:17:43.680 | And I think what you're starting to see now in July
01:17:46.680 | is the emergence of 4.0 Mini and DeepSea V2
01:17:49.480 | as outliers to the July frontier,
01:17:51.120 | where July frontier used to be maintained by 4.0,
01:17:54.280 | Lama 4.5, Gemini 1.5 Flash and Mistral Nemo.
01:17:58.600 | These things kind of break the frontier.
01:18:00.120 | And then if you update it like a month later,
01:18:02.480 | I think if I go back a month here,
01:18:04.880 | you update it, you can see start,
01:18:07.320 | you can see more items start to appear here as well
01:18:10.520 | with the August frontier, with Gemini 1.5 Flash coming out
01:18:14.360 | with an August update as compared to the June update,
01:18:17.560 | being a lot cheaper and roughly the same ELO.
01:18:20.800 | And then we update for September.
01:18:24.000 | And this is one of those things
01:18:25.760 | where we really started to understand the pricing curves
01:18:30.760 | being real instead of something
01:18:32.120 | that some random person on the internet drew on a chart,
01:18:36.040 | because Gemini 1.5 cut their prices
01:18:39.080 | and cut their prices exactly in line
01:18:41.160 | with where everyone else is
01:18:42.720 | in terms of their ELO price charts.
01:18:44.080 | So if you plot by September,
01:18:47.000 | we had O1 preview and pricing and costs and ELOs.
01:18:51.160 | So the frontier was O1 preview,
01:18:54.200 | GPC 4.0, O1 mini, 4.0 mini,
01:18:57.120 | and then Gemini Flash at the low end.
01:18:59.560 | That was the frontier as of September.
01:19:01.720 | Gemini 1.5 Pro was not on that frontier.
01:19:04.120 | Then they cut their prices, they halved their prices,
01:19:07.600 | and suddenly they were on the frontier.
01:19:09.880 | And so it's a very, very tight and predictive line,
01:19:12.080 | which I thought it was really interesting
01:19:13.920 | and entertaining as well.
01:19:16.160 | And I thought that was kind of cool.
01:19:18.280 | In November, we had 3.5 Haiku new.
01:19:21.960 | And obviously we had Sonnet as well.
01:19:25.200 | Sonnet is not, I don't know where this Sonnet on this chart,
01:19:29.120 | but Haiku new basically was 4X the price of old Haiku.
01:19:34.120 | Oh, sorry, 3.5 Haiku was 4X the price of 3 Haiku.
01:19:40.680 | And people were kind of unhappy about that.
01:19:43.120 | There's a reasonable assumption, to be honest,
01:19:47.000 | that it's not a price hike, it's just a bigger model.
01:19:48.920 | So it costs more, but we just don't know that.
01:19:51.360 | There was no transparency on that.
01:19:52.440 | So we are left to draw our own conclusions
01:19:54.640 | on what that means.
01:19:56.360 | That's just is what it is.
01:19:59.360 | So yeah, that would be the sort of price ELO chart.
01:20:03.600 | I would say that the main update for this one,
01:20:05.960 | if you go to my LLM pricing chart, which is public,
01:20:09.400 | you can ask me for it, or I've shared it online as well.
01:20:11.760 | The most recent one is Amazon Nova,
01:20:13.240 | which we briefly, briefly talked about on the pod,
01:20:15.600 | where they've really sort of come in
01:20:17.960 | and basically offered Amazon Basics LLM,
01:20:21.280 | where Amazon Pro, Nova Pro, Nova Light, and Nova Micro
01:20:24.280 | are the efficient frontier
01:20:26.040 | for their intelligence levels of 1200 to 1300.
01:20:30.640 | You wanna get beyond 1300, you have to pay up
01:20:32.640 | for the O1s of the world, and the 4Os of the world,
01:20:34.960 | and the Gemini 1.5 Pros of the world.
01:20:37.880 | But 2Flash is not on here,
01:20:40.160 | and is probably a good deal higher.
01:20:42.600 | Flash Thinking is not on here,
01:20:44.000 | as well as all the other QWQs, R1s,
01:20:46.960 | and all the other sort of thinking models.
01:20:49.160 | So I'm gonna have to update this chart.
01:20:50.720 | It's always a struggle to keep up to date.
01:20:53.560 | But I wanna give you the idea that basically for,
01:20:57.400 | through 2024, for the same amount of ELO,
01:21:02.400 | what you used to pay at the start of 2024,
01:21:06.920 | let's say $40 to $50 per million tokens,
01:21:11.920 | now is available approximately, with Amazon Nova,
01:21:17.000 | approximately at, I don't know, $0.075 per token.
01:21:22.440 | So like 7.5 cents.
01:21:26.480 | So that is a couple orders of magnitude at least.
01:21:30.880 | Actually, almost three orders of magnitude improvement
01:21:33.280 | in a year.
01:21:34.560 | And I used to say that intelligence,
01:21:37.040 | the cost of intelligence was coming down
01:21:39.840 | one order of magnitude per year, like 10x.
01:21:41.880 | That is already faster than Moore's Law.
01:21:45.120 | But coming down three times this year
01:21:48.200 | is something that I think
01:21:49.240 | not enough people are talking about.
01:21:50.440 | And so even though people understand
01:21:53.560 | that intelligence has become cheaper,
01:21:55.080 | I don't think people are appreciating
01:21:56.840 | how much more accelerated this year has been.
01:22:00.240 | And obviously, I think a lot of people are speculating
01:22:02.360 | how much more next year will be with H200s
01:22:05.240 | becoming commodity, Blackwell's coming out.
01:22:07.720 | It's very hard to predict,
01:22:09.720 | and obviously there are a lot of factors
01:22:11.080 | beyond just the GPUs.
01:22:12.760 | So that is the sort of thematic overview.
01:22:16.080 | And then we went into sort of the annual overview.
01:22:21.080 | This is basically us going through
01:22:24.080 | the AI News releases of the year,
01:22:29.080 | and just picking out favorites.
01:22:32.040 | I had Will, our new research assistant,
01:22:35.080 | help out with the research,
01:22:36.320 | but you can go on to AI News
01:22:37.360 | and check out all the sort of top news of the day.
01:22:41.360 | But we had a little bit of an AI Rewind thing,
01:22:43.840 | which I'll briefly bridge you in
01:22:45.360 | back to the recording that we had.
01:22:48.120 | So January, we had the first round of the year
01:22:52.160 | for Perplexity.
01:22:53.360 | And for me, it was notable that Jeff Bezos backed it.
01:22:56.880 | Jeff doesn't invest in a whole lot of companies,
01:22:58.920 | but when he does, he backed Google
01:23:01.560 | back in the day, and now he's backing the new Google,
01:23:03.920 | which is kind of cool.
01:23:04.880 | Perplexity is now worth $9 billion.
01:23:06.240 | I think they have four rounds this year.
01:23:08.200 | Will also picked out
01:23:12.320 | that Sam was talking about GPT-5 soon.
01:23:14.720 | This was back when he was, I think,
01:23:17.480 | at one of the sort of global summit type things, Davos.
01:23:22.400 | And yeah, no GPT-5.
01:23:25.720 | It's actually, we got '01 and '03.
01:23:28.440 | In February, you know, people were,
01:23:30.680 | we were still sort of thinking about last year's Dev Day.
01:23:34.400 | And this was three months on from Dev Day.
01:23:37.360 | People were kind of losing confidence in GPTs.
01:23:40.680 | And I feel like that hasn't super recovered yet.
01:23:43.920 | I hear from people that there are still stuff in the works
01:23:47.240 | and you should not give up on them.
01:23:48.280 | And they're actually underrated now, which is good.
01:23:51.920 | So I think people are taking a stab at the problem.
01:23:54.600 | I think it's a thing that should exist.
01:23:56.480 | And we just need to keep iterating on them.
01:23:59.120 | Honestly, any marketplace is hard.
01:24:01.400 | It's very hard to judge
01:24:02.840 | given all the other stuff they've shipped.
01:24:04.800 | ChachiGPT also released Memory in February,
01:24:08.760 | which we talked about a little bit.
01:24:10.080 | We also had Gemini's Diversity Drama,
01:24:12.360 | which we don't tend to talk a ton about in this podcast
01:24:17.200 | 'cause we try to keep it technical.
01:24:19.040 | But we also started seeing context window size blow out.
01:24:22.440 | So this year, I mean, it was Gemini with 1 million tokens.
01:24:26.520 | But also I think there's 2 million tokens talked about.
01:24:30.600 | We had a podcast with Gradients
01:24:32.040 | talking about how to fine tune for 1 million tokens.
01:24:34.480 | It's not just like what you declare
01:24:36.640 | to be your token context, but you also have to use it well.
01:24:40.280 | And increasingly, I think people are looking at
01:24:43.240 | not just Ruler, which is sort of multi needle
01:24:45.720 | in a haystack we talked about,
01:24:47.160 | but also Muser and like reasoning over long context,
01:24:50.760 | not just being able to retrieve over long context.
01:24:54.440 | And so that's what I would call out there.
01:24:56.520 | Specifically, I think magic.dev as well,
01:24:58.240 | made a lot of waves for the 100 million token model,
01:25:01.120 | which was kind of teased last year,
01:25:02.520 | but whatever it was, they made some noise about it.
01:25:05.720 | Still not released, so we don't know,
01:25:07.320 | but we'll try to get them on the podcast.
01:25:09.560 | In March, Cloud 3 came out,
01:25:11.680 | which huge, huge, huge for a topic.
01:25:13.640 | This basically started to mark the shift of market share
01:25:16.200 | that we talked about earlier in the pod,
01:25:18.440 | where most production traffic was on OpenAI,
01:25:22.000 | and now Enthropic had a decent frontier model family
01:25:25.200 | that people could shift to.
01:25:26.760 | And obviously, now we know that Sonnet
01:25:28.280 | is kind of the workhorse,
01:25:30.560 | just like 4.0 is the workhorse of OpenAI.
01:25:33.840 | Devon came out in March,
01:25:37.320 | and that was a very, very big launch.
01:25:38.920 | It was probably one of the most well-executed PR campaigns,
01:25:43.600 | maybe in tech, maybe this decade.
01:25:47.000 | And then I think there was a lot of backlash
01:25:49.760 | as to what specifically was real
01:25:52.880 | in the videos that they launched with,
01:25:55.520 | and then they took nine months to ship to GA.
01:25:58.720 | And now you can buy it for $500 a month
01:26:00.840 | and form your own opinion.
01:26:01.920 | I think some people are happy, some people less so,
01:26:04.360 | but it's very hard to live up to the promises that they made
01:26:07.920 | and the fact that for some of them they do,
01:26:11.600 | which is interesting.
01:26:13.000 | I think the main thing I would caution out for Devon,
01:26:16.400 | I think people call me a Devon show sometimes
01:26:18.560 | 'cause I say nice things,
01:26:19.400 | like one nice thing doesn't mean I'm a show,
01:26:21.840 | basically is that a lot of the ideas can be copied.
01:26:26.400 | And this is always the threat of "GPT wrappers"
01:26:30.240 | that you achieve product market fit with one feature,
01:26:33.200 | it's gonna be copied by 100 other people.
01:26:35.200 | So of course, you gotta compete with branding
01:26:37.000 | and better products and better engineering
01:26:39.160 | and all that sort of stuff,
01:26:40.080 | which Devon has in spades, so we'll see.
01:26:43.680 | - April, we actually talked to Yurio and Suno.
01:26:47.160 | We talked to Suno specifically,
01:26:49.720 | but Yurio I also got a beta access to
01:26:52.080 | and like AI music generation.
01:26:53.760 | We played with that on the podcast.
01:26:56.160 | I loved it.
01:26:57.520 | Some of our friends at the pod like play in their cars,
01:27:00.080 | like I rode in their cars
01:27:01.200 | while they played our Suno intro songs
01:27:02.880 | and I freaking loved using O1 to craft the lyrics
01:27:05.960 | and Suno and Yurio to make the songs.
01:27:10.040 | But ultimately like a lot of people,
01:27:11.840 | some people were skipping them.
01:27:12.800 | I don't know what exact percentages,
01:27:14.720 | but those 10% of you that skipped it,
01:27:18.160 | you're the reason why we cut the intro songs.
01:27:20.600 | (laughs)
01:27:22.480 | We also had Lama 3 release.
01:27:23.960 | So I think people always wanna see
01:27:27.040 | like a good frontier open source model
01:27:29.880 | and Lama 3 obviously delivered on that
01:27:31.760 | with the 8B and 70B, the 400B came later.
01:27:34.600 | Then May, GPC 4.0 released.
01:27:38.840 | And it was a kind of a model efficiency thing,
01:27:41.920 | but also I think just a really good demo
01:27:43.880 | of all the things that 4.0 was capable of.
01:27:47.120 | Like this is where the messaging of omni-model
01:27:49.880 | really started kicking in.
01:27:51.920 | Previously 4 and 4 Turbo were all text
01:27:55.440 | and not natively sort of vision.
01:27:58.480 | I mean, they had vision, but not natively voice.
01:28:00.680 | And I think everyone fell in love immediately
01:28:04.200 | with the Sky Voice and Sky Voice got taken away
01:28:07.920 | before the public release.
01:28:09.600 | And I think it's probably self-inflicted.
01:28:14.320 | I think that the version of events
01:28:16.720 | that has Sam Altman basically putting a foot in his mouth
01:28:20.800 | with a three-letter tweet causing decent grounds
01:28:25.800 | for a lawsuit where there was no grounds to be had
01:28:28.120 | because they actually just use the voice actress
01:28:29.640 | that sounded like Scarlett Johansson is unfortunate
01:28:34.160 | because we could have had it and we don't.
01:28:36.680 | So that's what it is.
01:28:38.320 | And that's what the consensus seems to be
01:28:40.160 | from the people I talked to.
01:28:41.960 | People be pining for the Scarlett Johansson voice.
01:28:46.080 | In June, Apple announced Apple Intelligence at WWDC.
01:28:50.160 | And we haven't, most of us, if you update your phones,
01:28:53.800 | have it now if you're on an iPhone.
01:28:55.480 | And I would say it's like decent.
01:28:58.160 | I think it wasn't the game changer thing
01:29:01.080 | that caused the Apple stock to rise like 20%.
01:29:05.760 | And just because everyone was gonna upgrade their iPhones
01:29:08.200 | just to get Apple Intelligence, it did not become that.
01:29:10.920 | But it is probably the largest scale rollout
01:29:14.920 | of transformers yet after Google rolled out BERT for search.
01:29:19.200 | And people are using it.
01:29:21.800 | And it's a 3B foundation model
01:29:25.200 | that's running locally on your phone with Loras
01:29:27.320 | that are hot swaps and we have papers for it.
01:29:29.280 | Honestly, Apple did a fantastic job
01:29:31.200 | of doing the best that they can.
01:29:33.560 | They're not the most transparent company in the world
01:29:35.560 | and nobody expects them to be.
01:29:37.760 | But they gave us more than I think we normally get
01:29:42.400 | for Apple tech.
01:29:43.680 | And that's very nice for the research community as well.
01:29:46.680 | NVIDIA, I think we continue to talk about,
01:29:51.800 | I think I was at the Taiwanese trade show, Comtex,
01:29:56.560 | and saw him signing women body parts.
01:30:00.720 | And I think that was maybe a sign of the times,
01:30:03.080 | maybe a sign that things have peaked,
01:30:05.080 | but things are clearly not peaked 'cause they continued.
01:30:07.800 | Going, ILIA, and then that bridges us
01:30:11.160 | back into the episode recording.
01:30:12.840 | I'm gonna stop now and stop yapping.
01:30:14.920 | But yeah, we recorded a whole bunch of stuff.
01:30:18.280 | We lost it and we're scrambling to re-record it for you,
01:30:22.600 | but also we're trying to close the chapter on 2024.
01:30:25.480 | So now I'm gonna cut back to the recording
01:30:28.520 | where we talk about the rest of June, July, August,
01:30:31.880 | September, and the second half of 2024's news.
01:30:35.720 | And we'll end the episode there.
01:30:37.320 | ILIA came out from the woodwork.
01:30:41.080 | - Yeah, raised a billion.
01:30:42.760 | - Saw a term sheet, raised a billion dollars.
01:30:45.080 | Dan Gross seems to have now become full-time CEO
01:30:48.120 | of the company, which is interesting.
01:30:49.960 | I thought he was gonna be an investor for life,
01:30:52.200 | but now he's operating.
01:30:53.400 | - He was an investor for a short amount.
01:30:55.480 | - For two years.
01:30:56.840 | What else can we say about ILIA?
01:30:58.600 | I mean, like, I think this idea
01:31:00.840 | that you only ship one product
01:31:01.960 | and it's a straight shot at super intelligence
01:31:04.040 | seems like a really good focusing mission,
01:31:06.520 | but then it runs counter to basically both Tesla
01:31:10.920 | and OpenAI in terms of the ship intermediate products
01:31:14.680 | that get you to that vision.
01:31:15.800 | - Well, I think the question is like,
01:31:17.560 | OpenAI now needs then more money
01:31:19.400 | because they need to support those products.
01:31:21.160 | And I think maybe their bet is like with 1 billion,
01:31:24.280 | we can get to the thing.
01:31:26.040 | But we don't wanna have to have intermediate steps.
01:31:28.360 | Like we're just making it clear
01:31:29.640 | that like, this is what it's about.
01:31:30.840 | - Yeah, but then like, where do you get your data?
01:31:32.680 | - Yeah, totally.
01:31:35.160 | - So I think that's the question.
01:31:38.760 | I think we can also use this as part of a general theme
01:31:42.200 | of the safety wing of OpenAI leaving.
01:31:44.840 | - Yeah.
01:31:45.400 | - It's fair to say that Jan Leica also left
01:31:48.200 | and like basically the entire super alignment team left.
01:31:52.120 | - Yeah, then there was artifacts
01:31:53.640 | kind of like the Chajupiti Canvas equipment that came out.
01:31:57.400 | - I think more code-oriented.
01:31:59.080 | - Yeah.
01:31:59.560 | - No one has a Canvas clone yet apart from OpenAI.
01:32:03.320 | Interestingly, I think the same person
01:32:06.280 | responsible for artifacts and Canvas, Karina,
01:32:08.760 | officially left Anthropic after this to join OpenAI
01:32:12.280 | on the rare reverse moves.
01:32:13.720 | - Yeah, and then we had AI Engineer World's Fair in June.
01:32:17.640 | I was over 2,000 people, not including us.
01:32:20.440 | I would love to attend the next one.
01:32:23.880 | - If only we can get tickets.
01:32:27.240 | - Yeah, but I think a really good demo.
01:32:29.240 | We now have it and deploy it for everybody.
01:32:31.640 | And Gemini actually kind of beat them to the GA release,
01:32:34.040 | which is kind of interesting.
01:32:35.480 | I think that everyone should basically always have this on
01:32:38.520 | as long as you're comfortable with the privacy settings
01:32:40.200 | because then you have a second person
01:32:41.640 | kind of looking over your shoulder.
01:32:42.680 | - Yeah.
01:32:43.240 | - And like this time next year, I would be willing to bet
01:32:45.960 | that I would just have this running on my machine.
01:32:48.840 | And I think that assistance always on
01:32:52.600 | that you can talk to with vision
01:32:53.640 | that sees what you're seeing.
01:32:54.840 | I think that is where at least one hour
01:32:56.600 | software experience to go.
01:32:57.800 | Then it will be another few years for that to happen
01:33:00.840 | in real life in outside of the screen.
01:33:03.160 | But for screen experiences,
01:33:04.840 | I think it's basically here, but not evenly distributed.
01:33:08.520 | And we've just seen the GA of this capability
01:33:10.840 | that was demoed in June.
01:33:12.200 | - And then July was Lama 3.1,
01:33:14.280 | which we've done a whole podcast on, but that was great.
01:33:18.120 | - July and August is kind of quiet.
01:33:19.720 | - Yeah, structure uploads.
01:33:21.320 | We also did a full podcast on that.
01:33:23.400 | And then September we got O1.
01:33:25.480 | - Yes.
01:33:26.200 | - Strawberry, AKA Q*, AKA we had a nice party
01:33:29.960 | with strawberry glasses.
01:33:30.920 | - Yes, I think very underrated.
01:33:33.080 | Like this is basically from the first internal demo
01:33:36.840 | of Q of strawberry was let's say November, 2023.
01:33:41.240 | So between November to September,
01:33:43.720 | like the whole red teaming and everything,
01:33:46.280 | honestly, a very good ship rate.
01:33:47.960 | Like I don't know if like people are giving OpenAI
01:33:51.000 | enough credit for like this all being available.
01:33:53.320 | In ChargeGBT and then shortly after in API.
01:33:56.920 | I think maybe in the same day, I don't know.
01:33:58.600 | I don't remember the exact sequence already,
01:34:00.440 | but like this is like the frontier model
01:34:02.680 | that was like rolled out very, very quickly
01:34:04.440 | to the whole world.
01:34:05.720 | And then we immediately got used to it,
01:34:07.240 | immediately said it was shit
01:34:08.280 | because we're still using Sonnet or whatever,
01:34:10.360 | but like still very good.
01:34:11.400 | And then obviously now we have O1 Pro and O1 Full.
01:34:15.160 | I think like in terms of like biggest ships of the year,
01:34:17.480 | I think this is it, right?
01:34:18.360 | - Yeah, yeah, totally.
01:34:19.960 | Yeah, and I think it now opens a whole new Pandora's box
01:34:23.080 | for like the inference time compute and all that.
01:34:25.560 | - Yeah, yeah, yeah.
01:34:26.440 | It's funny because like it could have been done
01:34:27.800 | by anyone else before.
01:34:29.400 | - Yeah.
01:34:30.040 | - Literally, this is an open secret.
01:34:31.400 | They were working on it ever since they hired GNOME,
01:34:33.240 | but no one else did.
01:34:35.320 | - Yeah.
01:34:36.040 | - Another discovery, I think Ilya actually worked
01:34:38.840 | on a previous version called GPT-0 in 2021.
01:34:42.200 | Same exact idea and it failed, whatever that means.
01:34:46.280 | - Yeah, timing.
01:34:47.400 | - Voice mode also.
01:34:48.520 | - Voice mode, yeah.
01:34:49.400 | Yeah, I think most people have tried it by now
01:34:52.200 | because it's generally available.
01:34:53.400 | - Yeah, I think your wife also likes it.
01:34:55.720 | - Yeah, yeah, yeah.
01:34:57.160 | She talks to it all the time.
01:34:58.680 | - Okay.
01:34:59.400 | - Canvas in October.
01:35:01.320 | - Okay.
01:35:01.800 | - Another big release.
01:35:02.920 | - Have you used it much?
01:35:04.120 | - Not really, honestly.
01:35:06.120 | - I use it a lot.
01:35:07.160 | - What do you use it for mostly?
01:35:08.520 | - Drafting anything.
01:35:09.800 | I think that people don't see where all this is heading.
01:35:12.680 | Like OpenAI is really competing with Google in everything.
01:35:15.240 | Canvas is Google Docs.
01:35:16.760 | And like it's a full document editing environment
01:35:18.920 | with the auto assister thing at the side
01:35:21.560 | that is arguably better than Google Docs,
01:35:23.960 | at least for some editing use cases, right?
01:35:26.440 | 'Cause it has a much better AI integration
01:35:28.760 | than Google Docs with Gemini on the side.
01:35:31.160 | And so OpenAI is taking on Google and Google Docs.
01:35:33.880 | It's also taking it on in search.
01:35:35.800 | They launched their little Chrome extension thing
01:35:39.560 | to be the default search.
01:35:41.160 | And I think like piece by piece,
01:35:43.160 | it's kind of really tackling on Google in a very smart way
01:35:46.360 | that I think is additive to workflow
01:35:48.280 | and people should start using it as intended
01:35:50.760 | because this is a peek into the future.
01:35:52.600 | Maybe they're not successful, but at least they're trying.
01:35:55.000 | And I think Google has gone without competition for so long
01:35:57.960 | that anyone trying will at least
01:36:00.600 | receive some attention from me.
01:36:01.560 | - Yeah, and then yeah, computer use also came out.
01:36:04.760 | Yeah, that was a busy, it's been a busy couple of months.
01:36:09.880 | - Busy couple of months.
01:36:10.920 | I would say that computer use was one
01:36:13.560 | of the most upvoted demos on Hacker News of the year,
01:36:17.240 | but then comparatively, I don't see people using it as much.
01:36:20.520 | This is how you feel the difference
01:36:21.800 | between a mature capability and an emerging capability.
01:36:25.480 | Maybe this is why vision is emerging
01:36:27.480 | because I launched computer use,
01:36:29.000 | you're not using it today,
01:36:30.120 | but you'll use everything else in a mature category.
01:36:31.880 | And it's mostly because it's not precise enough
01:36:34.600 | or it's too slow or it's too expensive.
01:36:36.520 | And those will be the main criticisms.
01:36:38.440 | - Yeah, that makes sense.
01:36:40.120 | It's also just like overall uneasiness
01:36:42.760 | about just letting it go crazy.
01:36:44.600 | - I don't care.
01:36:45.560 | - Yeah, no, no, totally.
01:36:47.160 | But I think a lot of people do.
01:36:48.680 | - November.
01:36:49.640 | - R1, so that was kind of like the open source
01:36:52.520 | of one competitor by DeepSeek.
01:36:53.960 | - Yeah, nobody knew it was coming.
01:36:55.000 | Everyone knew like F1, we had a preview at the Fireworks HQ.
01:36:59.160 | And then I think some other labs did it,
01:37:01.800 | but I think R1 and QWQ, Quill from the Quent team,
01:37:07.080 | both Alibaba affiliated, I think,
01:37:09.560 | are the leading contenders on that front.
01:37:11.400 | And we'll see, we'll see.
01:37:13.400 | - What else to highlight?
01:37:15.240 | I think the Stripe agent toolkit,
01:37:17.160 | I mean, it's a small thing,
01:37:18.680 | but it's just like people are like agents are not real.
01:37:21.000 | It's like when you have companies like Stripe
01:37:23.000 | and like start to build things to support it,
01:37:25.160 | it might not be real today,
01:37:26.280 | but obviously they don't have to do it
01:37:28.360 | because they're not an AI company.
01:37:30.280 | But the fact that they do it shows that there's one demand
01:37:33.400 | and so there's belief on their end.
01:37:35.880 | - This is a broader thing about,
01:37:37.240 | broader thesis for me that I'm exploring around,
01:37:39.880 | do we need special SDKs for agents?
01:37:42.360 | Why can't normal SDKs for humans do the same thing?
01:37:46.440 | Stripe agent toolkits happens to be a wrapper
01:37:48.280 | on the Stripe SDK, it's fine.
01:37:49.640 | It's just like a nice little DX layer.
01:37:51.720 | But like, it's still unclear to me.
01:37:53.160 | I have been asked my opinion on this before.
01:37:56.120 | And I said, I think I said it on a podcast,
01:37:57.480 | which is like the main layer that you need
01:37:59.560 | is separate off roles
01:38:01.160 | so that you don't assume it's a human doing these things.
01:38:05.400 | And you can lock things down much quicker
01:38:07.000 | or you can identify whether it is an agent
01:38:10.680 | acting on your behalf or actually you.
01:38:12.200 | And that is something that you need.
01:38:14.920 | I had my 11 labs key pwned
01:38:17.320 | because I lost my laptop.
01:38:18.680 | And I saw a whole bunch of API calls
01:38:20.840 | and I was like, "Oh, is that me?
01:38:21.960 | "Or is that someone?"
01:38:23.960 | And it turned out to be a key that I committed
01:38:27.320 | onto GitHub and that I didn't scrape.
01:38:29.480 | And so sourcing of where API usage is coming from,
01:38:32.920 | I think you should attribute it to agents
01:38:34.920 | and build for that world.
01:38:36.040 | But other than that, I think SDKs,
01:38:38.040 | I would see it as a failure of tech and AI
01:38:44.200 | that we need every single thing
01:38:46.200 | needs to be reinvented for agents.
01:38:47.640 | - I agree in some ways.
01:38:49.400 | I think in other ways,
01:38:50.280 | we've also like not always made things super explicit.
01:38:53.880 | There's kind of like a lot of defaults
01:38:56.520 | that people do when they design APIs.
01:38:58.120 | But like, I think if you were to redesign them
01:39:00.680 | in a world in which the person or the agent
01:39:02.920 | using them as like almost infinite memory and context,
01:39:06.120 | like you will maybe do things differently.
01:39:07.640 | - Yeah. - But I don't know.
01:39:08.920 | I think to me that the most interesting
01:39:10.360 | is like REST and GraphQL
01:39:11.960 | is almost more interesting in the world of agents
01:39:14.600 | because agents could come up
01:39:15.720 | with so many different things to query
01:39:17.800 | versus like before I always thought GraphQL
01:39:19.640 | was kind of like not really necessary
01:39:21.320 | because like, you know what you need,
01:39:22.680 | just build the REST endpoint for it.
01:39:24.600 | So yeah, I'm curious to see what else changes.
01:39:27.960 | And then yeah, the search wars.
01:39:29.320 | I think that was, you know, search GPT, perplexity.
01:39:32.520 | - Dropbox. - Dropbox dash.
01:39:34.680 | Yeah, we had Drew on the pod
01:39:36.200 | and then we added the Pioneer Summit.
01:39:37.800 | The fact that Dropbox is a Google Drive integration,
01:39:40.760 | it's just like, if you told somebody five years ago,
01:39:43.480 | it's like, Dropbox doesn't really care
01:39:45.800 | about hosting your files.
01:39:47.240 | You know, it's like, that doesn't compute.
01:39:49.640 | So yeah, I'm curious to see where that goes.
01:39:52.040 | - Cool. - This whole space.
01:39:53.160 | - And that brings us up to December, still developing.
01:39:56.120 | I'm curious what the last day
01:39:57.320 | of opening as shipments will be.
01:39:59.320 | I think everyone's expecting something big there.
01:40:01.400 | I think so far has been a very eventful year.
01:40:03.720 | Definitely has grown a lot.
01:40:05.560 | We were asked by Will actually,
01:40:06.920 | like whether we made predictions.
01:40:07.960 | I don't think we did, but.
01:40:09.080 | - Not really. - Maybe we should.
01:40:11.000 | - Well, I think we definitely talked about agents.
01:40:14.040 | - Yes. - And I don't know
01:40:15.480 | if we said it was the year of the agents, but we said.
01:40:18.600 | - But next year is the year.
01:40:19.800 | - No, no, but well, you know, the anatomy of autonomy,
01:40:22.520 | that was April, 2023, you know?
01:40:24.680 | So obviously there's been belief for a while.
01:40:27.720 | - Yes. - But I think now
01:40:28.760 | the models are, I would say maybe the last, yeah, two months.
01:40:32.280 | - Yeah. - I've made a big push
01:40:33.400 | in like capability for like 3.16, 4.1.
01:40:35.640 | - Yeah, I mean, Ilya saying the word agentic
01:40:37.720 | on stage in Europe, it's a big deal.
01:40:40.520 | Satya, I think also saying that a lot these days.
01:40:43.480 | I mean, Sam has been saying that for a while now.
01:40:46.200 | So DeepMind, when you announced Gemini 2.0,
01:40:48.680 | they announced Deep Research,
01:40:50.280 | but also Project Mariner, which is a browser agent,
01:40:52.680 | which is their computer use type thing,
01:40:54.440 | as well as Jules, which is their code agent.
01:40:56.680 | And I think that basically complements
01:40:58.760 | with whatever OpenAI is shipping next year,
01:41:00.200 | which is Codename Operator, which is their agent thing.
01:41:04.360 | It makes sense that if it actually replaces a junior employee,
01:41:07.240 | they will charge $2,000 for it.
01:41:09.000 | - Yeah, I think that's my whole,
01:41:11.240 | I did this post that it's pinned on my Twitter,
01:41:13.560 | so you can find it easily,
01:41:14.520 | but about skill floor and skill ceiling in jobs.
01:41:17.160 | And I think the skill floor more and more,
01:41:18.840 | I think 2025 will be the first year
01:41:20.680 | where the AI sets the skill floor of a role.
01:41:23.960 | I don't think that has been true in the past,
01:41:25.720 | but yeah, I think now really like if Devon works,
01:41:29.480 | if all these customer support agents are working.
01:41:32.360 | So now to be a customer support person,
01:41:34.440 | you need to be better than an agent
01:41:36.760 | because the economics just don't work.
01:41:38.440 | I think the same is gonna happen to in software engineering,
01:41:41.240 | which I think the skill floor is very low.
01:41:43.080 | There's a lot of people doing software engineering
01:41:45.560 | that are really not that good.
01:41:47.080 | So I'm curious to see in the next year of the recap,
01:41:50.360 | what other jobs are gonna have that change.
01:41:52.680 | - Yeah, every NeurIPS that I go,
01:41:54.760 | I have some chats with researchers
01:41:55.880 | and I'll just highlight the best prediction from that group.
01:41:59.080 | And then we'll move on to end of year recap
01:42:01.080 | in terms of we'll just go down the list of top five podcasts
01:42:03.880 | and then we'll end it.
01:42:05.080 | So the best prediction was that there will be a foreign spy
01:42:11.000 | caught at one of the major labs.
01:42:14.280 | So this is part of the consciousness already
01:42:16.840 | that whenever you see someone
01:42:18.920 | who is like too attractive in a San Francisco party,
01:42:22.680 | where the ratio is like a hundred guys to one girl
01:42:25.240 | and suddenly the girl's like super interested in you,
01:42:27.080 | like it may not be your looks.
01:42:28.840 | So there's a lot of like state level secrets
01:42:32.360 | that are kept in these labs.
01:42:33.640 | And not that much security.
01:42:35.080 | I think if anything,
01:42:36.600 | the situational awareness essay did to raise awareness of it,
01:42:40.520 | I think it was directionally correct,
01:42:41.960 | even if not precisely correct.
01:42:43.880 | We should start caring a lot about this.
01:42:45.240 | OpenAI has hired a CISO this year.
01:42:47.800 | And I think like the security space in general,
01:42:49.960 | oh, I remember what I was gonna say
01:42:51.240 | about Apple foundation model before we cut for a break.
01:42:54.120 | They announced Apple secure cloud, cloud compute.
01:42:56.360 | And I think we're also interested in investing in areas
01:42:59.800 | that are basically secure cloud, LLM inference for everybody.
01:43:03.640 | I think like what we have today is not secure enough.
01:43:05.800 | And because it's like normal security
01:43:07.640 | when like this is literally a state level interest.
01:43:10.360 | - Agreed.
01:43:11.160 | - Top episodes?
01:43:12.120 | - Yeah.
01:43:12.600 | So I'm just going through the sub stack.
01:43:14.680 | Number one, the David Luan one.
01:43:17.480 | - Yeah.
01:43:17.720 | - It's the most popular in 2024.
01:43:19.560 | Why Google failed to make GPT-3?
01:43:21.400 | - I will take a little bit of credit for that,
01:43:23.240 | for the naming of that one,
01:43:24.200 | because I think that was the hack news thing.
01:43:26.200 | It's very funny because like,
01:43:27.320 | actually, obviously he wants to talk about debt,
01:43:29.080 | but then he spent half the episode
01:43:30.280 | talking about his time at OpenAI.
01:43:32.040 | But I think it was a very useful insight
01:43:33.720 | that I'm still using today.
01:43:34.760 | Even in like the earlier post,
01:43:36.440 | I was still referring to what he said.
01:43:38.120 | And when we do podcast episodes,
01:43:40.280 | I try to look for that.
01:43:41.960 | I try to look for things
01:43:43.560 | that we'll still be referencing in the future.
01:43:45.480 | And that concentrated badness,
01:43:47.880 | David talked about the brain compute marketplace,
01:43:51.240 | and then Ilya in his emails that I covered
01:43:54.200 | in the what Ilya saw essay,
01:43:56.520 | had the OpenAI side of this,
01:43:57.960 | where they were like one big training run
01:44:01.240 | is much, much more valuable
01:44:02.920 | than the hundred equivalent small training runs.
01:44:04.520 | So we need to go big
01:44:06.280 | and we need to concentrate badness, not spread them.
01:44:08.040 | - Number two, how NotebookLM was made.
01:44:10.920 | - Yeah, that was fun.
01:44:13.640 | - Yeah, and everybody, I mean,
01:44:15.320 | I think that's like a great example
01:44:16.680 | of like just timeliness.
01:44:18.440 | You know, I think it was top of mind for everybody.
01:44:21.000 | There were great guests.
01:44:21.800 | It just made the rounds on social media.
01:44:24.360 | - Yeah, and that one, I would say,
01:44:26.600 | Risa is obviously a star,
01:44:27.960 | but she's been on every episode, every podcast.
01:44:30.600 | But Usama, I think, you know,
01:44:32.200 | actually being the guy who worked on the audio model,
01:44:33.960 | being able to talk to him,
01:44:34.760 | I think was a great gift for us.
01:44:37.080 | And I think people should listen back
01:44:38.920 | to how they trained the NotebookLM model.
01:44:41.400 | 'Cause I think you put that level of attention
01:44:44.120 | on any model, you will make it solder.
01:44:45.840 | - Yeah, that's true.
01:44:48.040 | - And it's specifically like, they didn't have evals.
01:44:50.840 | They just- - Vibes.
01:44:52.280 | - They had a group session with vibes.
01:44:54.920 | The ultimate guide to prompting, that was number three.
01:44:57.800 | I think all these episodes that are like summarizing things
01:45:00.920 | that people care about, but they're disparate,
01:45:03.080 | I think always do very well.
01:45:04.920 | - This helps us save
01:45:06.280 | on a lot of smaller prompting episodes, right?
01:45:08.760 | If we interviewed individual paper authors
01:45:11.080 | with like a 10-page paper that is just a different prompt,
01:45:14.120 | like not as useful as like an overview survey thing.
01:45:17.400 | I think the question is what to do from here.
01:45:19.560 | People have actually, I would say,
01:45:21.800 | I've been surprised by how well received that was.
01:45:24.120 | Should we do ultimate guide to other things?
01:45:27.080 | And then should we do prompting 201, right?
01:45:29.560 | Those are the two lessons that we can learn
01:45:31.800 | from the success of this one. - I think if somebody
01:45:32.840 | does the work for us,
01:45:33.880 | that was the good thing about Sander.
01:45:35.720 | Like he had done all the work for us.
01:45:37.000 | - Yeah, Sander is very, very fastidious about this.
01:45:40.040 | So he did a lot of work on that.
01:45:41.400 | I'm definitely keen to have him on next year
01:45:43.960 | to talk more prompting.
01:45:45.320 | Okay, then the next one is the not safe for work one.
01:45:47.640 | - No. - Or structured outputs.
01:45:50.680 | - The next one is brain trust.
01:45:51.880 | - Really? - Yeah.
01:45:53.160 | - Okay, we have a different list then, but yeah.
01:45:55.240 | - I'm just going on the sub stack.
01:45:57.240 | - I see, I see.
01:45:58.200 | So that includes the number of likes,
01:45:59.800 | but I was going by downloads.
01:46:02.120 | It's fine.
01:46:03.820 | - I would say this is almost recency bias
01:46:08.280 | in the way that like the audience keeps growing
01:46:10.520 | and then like the most recent episodes get more views.
01:46:12.680 | So I would say definitely like the NSFW one
01:46:17.640 | was very popular.
01:46:19.000 | What people were telling me they really liked
01:46:20.920 | because it was something people don't cover.
01:46:23.080 | Yeah, structural outputs.
01:46:25.240 | I think people liked that one.
01:46:26.840 | I mean, the same one.
01:46:28.040 | Yeah, I think that's like something
01:46:29.560 | I refer to all the time.
01:46:31.000 | I think that's one of the most interesting areas.
01:46:32.760 | No, the simulation.
01:46:34.920 | - Oh, WebSim, really?
01:46:36.840 | - Yeah, not that use case,
01:46:38.680 | but like how do you use that for like model training
01:46:42.520 | and like agents learning and all of that.
01:46:44.520 | - Yeah, so I would definitely point
01:46:46.680 | to our newest seven hour long episode
01:46:49.480 | on Simulative Environments
01:46:51.640 | because it is the, let's say the scaled up,
01:46:54.200 | very serious AGI lab version of WebSim and WebSim.
01:46:58.520 | If you take it very, very seriously, you get Genie 2,
01:47:01.240 | which is exactly what you need to then build Sora
01:47:03.240 | and everything else.
01:47:03.960 | So yeah, I think Simulative AI still in summer.
01:47:08.280 | - Yeah, still in summer.
01:47:09.880 | - Still coming.
01:47:10.920 | And I was actually reflecting on this.
01:47:12.040 | Like, would you say that the AI winter has like coming on
01:47:15.480 | or like was it never even here?
01:47:17.400 | 'Cause we did a "Winds of AI Winter" episode
01:47:19.320 | and I was like trying to look for signs.
01:47:21.400 | I think that's kind of gone now.
01:47:23.080 | - Yeah, I would say it was here in the vibes,
01:47:27.080 | but not really in the reality.
01:47:28.440 | You know, when you look back at the yearly recap,
01:47:30.280 | it's like every month there was like progress.
01:47:32.600 | There wasn't really a winter.
01:47:34.040 | There was maybe like a hype winter,
01:47:35.320 | but I don't know if that counts as a real winter.
01:47:38.520 | - I think the scaling has hit a wall thing
01:47:40.920 | has been a big driving discussion for 2024.
01:47:43.960 | And, you know, with some amount of conclusion at NeurIPS
01:47:48.520 | that we were also kind of pointing to
01:47:50.440 | in the "Winds of AI Winter" episode,
01:47:51.960 | but like, it's not a winter by any means.
01:47:54.360 | We know what winter feels like.
01:47:55.560 | It is not winter.
01:47:56.360 | So I think things are going well.
01:47:59.880 | I think every time that people think
01:48:01.560 | that there's like not much happening in AI,
01:48:03.400 | just think back to this time last year
01:48:06.120 | and understand how much has changed
01:48:07.480 | from benchmarks to frontier models,
01:48:09.160 | to market share between OpenAI and the rest.
01:48:11.800 | And then also cover like, you know,
01:48:13.240 | the various coverage areas that we've marked out,
01:48:15.320 | how the discussion has evolved a lot
01:48:17.560 | and what we take for granted now
01:48:18.840 | versus what we did not have a year ago.
01:48:21.560 | - Yeah, and then just to like throw that out there,
01:48:23.560 | there've been 133 funding rounds
01:48:26.520 | over a hundred million in AI this year.
01:48:29.080 | - Does that include Databricks,
01:48:30.600 | the largest venture around in history?
01:48:31.880 | - And that's $10 billion, sheesh.
01:48:34.680 | Well, that mosaic now has been bought
01:48:37.960 | for two something billion
01:48:39.960 | because it was mostly stock, you know?
01:48:42.120 | So price goes up.
01:48:43.160 | - I see.
01:48:43.560 | - Theoretically.
01:48:44.360 | - I see.
01:48:45.000 | So you just bought at a valuation of 40, right?
01:48:46.680 | - Yeah, it was like 43 or something like that.
01:48:48.920 | - And at the time, I remember at the time,
01:48:50.840 | there was a question
01:48:51.480 | about whether or not the valuation was real.
01:48:53.080 | - Yeah, well, that's why everybody-
01:48:54.680 | - Snowflake was down.
01:48:55.560 | - Yeah.
01:48:56.040 | - And like Databricks was a private valuation
01:48:58.440 | that was like two years old.
01:48:59.720 | It's like, who knows what this thing's worth.
01:49:01.720 | Now it's worth 60 billion.
01:49:02.440 | - It's worth more.
01:49:03.080 | It's worth more.
01:49:03.800 | That's what it's worth.
01:49:04.520 | It's worth more than what you thought.
01:49:07.240 | - Yeah, it's been a crazy year,
01:49:09.720 | but I'm excited for next year.
01:49:11.560 | I feel like this is almost like, you know,
01:49:13.080 | now the agent thing needs to happen.
01:49:15.400 | And I think that's really the unlock.
01:49:16.680 | - Yeah, I think it needs to happen.
01:49:18.200 | - I mean-
01:49:19.000 | - I have to agree with you.
01:49:20.360 | Next year is the year of the agent in production.
01:49:21.720 | - Yeah, I don't, you know, it's almost like
01:49:24.120 | I'm not a hundred percent sure it will happen,
01:49:25.960 | but like it needs to happen.
01:49:27.080 | Otherwise it's definitely the winter next year.
01:49:30.120 | Any other parting thoughts?
01:49:33.000 | - I'm very grateful for you.
01:49:35.320 | I think you've been a dream partner
01:49:37.160 | to build Lane Space with.
01:49:38.840 | And also the Discord community,
01:49:41.320 | the paper club people have been beyond my wildest dreams,
01:49:44.680 | so supportive and successful.
01:49:47.800 | It's amazing that the community has grown so much
01:49:51.480 | and the vibe has not changed.
01:49:53.400 | - Yeah, that's true.
01:49:54.760 | - We're almost at 5,000 people.
01:49:56.280 | - Yeah, we started this Discord like four years ago.
01:49:58.040 | - Yeah.
01:49:58.520 | - And still people get it when they join.
01:50:01.000 | You post news here and then you discuss it in threads
01:50:03.160 | and you try not to self-promote too much.
01:50:06.120 | And mostly people obey the rules
01:50:08.680 | and sometimes you smack them down a little bit,
01:50:10.280 | but that's okay.
01:50:10.760 | - We rarely have to ban people, which is great.
01:50:14.440 | But yeah, man, it's been awesome, man.
01:50:16.440 | I think we both started not knowing
01:50:18.600 | where this was going to go.
01:50:19.640 | And now we've done a hundred episodes.
01:50:21.240 | It's easy to see how we're going to get to 200.
01:50:23.560 | I think maybe when we started,
01:50:24.760 | it wasn't easy to see how we would get to 100, you know?
01:50:27.800 | Yeah, excited for more.
01:50:29.160 | Subscribe on YouTube because I swear to God,
01:50:31.640 | we're doing so much work.
01:50:32.520 | We need to make that work.
01:50:33.560 | - It's very expensive for an unclear payoff
01:50:37.240 | as to what we're actually going to get out of it.
01:50:39.240 | But hopefully people discover us more there.
01:50:41.160 | I do believe in YouTube as a podcasting platform
01:50:44.520 | much more so than Spotify.
01:50:45.800 | - Yeah, totally.
01:50:47.560 | Thank you all for listening.
01:50:49.880 | - Yeah, thank you for listening.
01:50:50.520 | - See you in the new year.
01:50:51.320 | - Bye-bye.
01:50:51.800 | (upbeat music)