back to index2024 Year in Review: The Big Scaling Debate, the Four Wars of AI, Top Themes and the Rise of Agents

Chapters
0:0 Welcome to the 100th Episode!
0:19 Reflecting on the Journey
0:47 AI Engineering: The Rise and Impact
3:15 Latent Space Live and AI Conferences
9:44 The Competitive AI Landscape
21:45 Synthetic Data and Future Trends
35:53 Creative Writing with AI
36:12 Legal and Ethical Issues in AI
38:18 The Data War: GPU Poor vs. GPU Rich
39:12 The Rise of GPU Ultra Rich
40:47 Emerging Trends in AI Models
45:31 The Multi-Modality War
65:31 The Future of AI Benchmarks
73:17 Pionote and Frontier Models
73:47 Niche Models and Base Models
74:30 State Space Models and RWKB
75:48 Inference Race and Price Wars
82:16 Major AI Themes of the Year
82:48 AI Rewind: January to March
86:42 AI Rewind: April to June
93:12 AI Rewind: July to September
94:59 AI Rewind: October to December
99:53 Year-End Reflections and Predictions
00:00:02.580 |
- Hey everyone, welcome to the Latent Space Podcast. 00:00:06.300 |
This is Alessio, partner and CTO at Decibel Partners, 00:00:12.600 |
- Yay, and we're so glad that everyone has followed us 00:00:19.160 |
- Yeah, almost two years that we've been doing this. 00:00:32.320 |
- Because every answer was cursor and perplexity. 00:00:35.860 |
It's like, do you really not like anything else? 00:00:41.600 |
we've also had a lot more research-driven content. 00:00:44.520 |
You know, we had like Three Dow, we had Jeremy Howard, 00:00:48.000 |
I think we want to do more of that too in the new year, 00:00:58.640 |
we were kind of like, oh, we should do a podcast. 00:01:00.680 |
And I think we kind of caught the right wave, obviously. 00:01:04.200 |
And I think your rise of the AI engineer posts 00:01:07.400 |
just kind of give people somewhere to congregate 00:01:11.440 |
And that's why when I look at our growth chart, 00:01:14.680 |
for like the AI engineering industry as a whole, 00:01:19.320 |
even if we don't do that much, we keep growing 00:01:21.560 |
just because there's so many more AI engineers. 00:01:27.040 |
for like the AI engineer thing to kind of like become, 00:01:34.000 |
is that Gartner puts it at the top of the hype curve 00:01:37.760 |
So Gartner has called the peak in AI engineering. 00:01:44.120 |
because I did like two months of work going into that. 00:01:47.040 |
But I didn't know how quickly it could happen. 00:01:49.440 |
And obviously there's a chance that I could be wrong. 00:01:52.320 |
But I think like most people have come around 00:01:56.720 |
but there's enough people that have defined it, 00:01:58.840 |
you know, GitHub, when he launched GitHub models, 00:02:02.280 |
they put AI engineers in the banner and like above the fold, 00:02:15.040 |
I think that was a lot of the quote unquote drama 00:02:17.880 |
that happens behind the scenes at the World's Fair in June, 00:02:20.820 |
because I think there's a lot of doubt or questions 00:02:24.440 |
about where ML engineering stops and AI engineering starts. 00:02:29.840 |
In some sense, I actually anticipated that as well. 00:02:32.360 |
So I intentionally did not put a firm definition there 00:02:40.520 |
and it's actually useful to have different perspectives. 00:02:42.360 |
And then you don't have to specify everything 00:02:48.500 |
to get into like the AI engineering talk, so to speak, 00:02:51.800 |
which is, you know, applied AI and whatnot was like, 00:02:54.120 |
there are like hundreds of people just in line to go in. 00:02:56.960 |
I think that's kind of what enabled people, right? 00:02:59.400 |
Which is what you kind of talked about is like, 00:03:04.560 |
And then maybe we'll talk about some of the blind spots 00:03:07.000 |
that you get as an engineer with the earlier posts 00:03:14.800 |
- Yeah, you know, I was trying to view the conference 00:03:17.720 |
as like NeurIPS is a thing like 16, 17,000 people 00:03:21.520 |
and the latent space live event that we held there 00:03:30.560 |
And that's as it should be because ML is very much 00:03:34.420 |
But as we move this entire field into production, 00:03:41.440 |
So at least I think engineering should be on the same level, 00:03:47.120 |
like it'll always be low status because at the end of the day 00:03:49.600 |
you're manipulating APIs or whatever, but wrapping GPTs, 00:03:54.240 |
but there's gonna be an increasing stack and an art 00:03:58.800 |
And I think that's what we're focusing on for the podcast, 00:04:06.240 |
And I think we'll talk about the trends here that apply. 00:04:09.160 |
It's this very strange mix of like keeping on top of research 00:04:14.720 |
and then putting that research into production. 00:04:23.520 |
we're not going to like understand everything 00:04:30.200 |
is going to make its way into production at some point, 00:04:32.800 |
And then actually, like when I talk to the researchers, 00:04:34.480 |
they actually get very excited because they're like, 00:04:40.840 |
The measure of success is previously just peer review, 00:04:45.120 |
on their academic review conferences and stuff. 00:04:48.040 |
Like citations is one metric, but money is a better metric. 00:04:52.040 |
Yeah, and there were about 2,200 people on the live stream 00:05:19.280 |
towards the sort of PhD students market, job market, right? 00:05:25.240 |
to advertise their research and skills and get jobs. 00:05:28.840 |
And then obviously all the companies go there to hire them. 00:05:31.760 |
And I think that's great for the individual researchers, 00:05:34.040 |
but for people going there to get info is not great 00:05:42.120 |
So what is missing is effectively what I ended up doing, 00:05:51.920 |
I think ICML had a like a position paper track, 00:05:54.080 |
NeurIPS added a benchmarks and datasets track. 00:05:57.280 |
These are ways in which to address that issue. 00:06:01.560 |
Every conference has a last day of workshops and stuff 00:06:06.520 |
but they're not specifically prompted to do so. 00:06:22.480 |
but we did best of 2024 in startups, vision, open models, 00:06:26.440 |
post-transformers, synthetic data, small models, and agents. 00:06:32.240 |
and then we also did a quick one on reasoning 00:06:36.160 |
was the debate that people were very hyped about. 00:06:40.880 |
And I'm really thankful for John Frankel, basically, 00:06:49.360 |
And I think everyone who is in AI is pro-scaling. 00:06:52.920 |
So you need somebody who's ready to publicly say, 00:06:57.160 |
So that means you're saying Sam Altman's wrong, 00:07:02.240 |
It helps that this was the day before Ilya went on, 00:07:06.920 |
"Pre-training has hit a wall, data has hit a wall." 00:07:26.240 |
you had Sepp Hochreiter, who is the creator of the LSTM, 00:07:34.640 |
He said that we're pre-training has hit a wall, 00:07:45.640 |
that we have hit some kind of wall in the status quo 00:07:48.440 |
of what pre-trained, scaling large pre-trained models 00:07:54.120 |
And obviously the new thing for people is some make, 00:07:57.320 |
either people are calling it inference time compute 00:08:00.280 |
I think the collective terminology has been inference time. 00:08:04.200 |
And I think that makes sense because test time, 00:08:06.080 |
calling it test, meaning, has a very pre-trained bias, 00:08:08.480 |
meaning that the only reason for running inference at all 00:08:13.880 |
- So I quite agree that OpenAI seems to have adopted, 00:08:17.080 |
or the community seems to have adopted this terminology 00:08:28.560 |
who we've recovered or reviewed the Chinchilla paper. 00:08:31.840 |
Chinchilla paper is compute optimal training, 00:08:35.720 |
is it's pre-trained compute optimal training. 00:08:46.560 |
he's also on the side of attention is all you need. 00:08:50.360 |
So I'm curious, like he doesn't believe in scaling, 00:08:56.240 |
- So he, obviously everything is nuanced and you know, 00:08:58.480 |
I told him to play a character for this debate, right? 00:09:02.520 |
Yeah, he still believes that we can scale more. 00:09:04.960 |
He just assumed the character to be very game 00:09:09.280 |
So even more kudos to him that he assumed a position 00:09:12.280 |
that he didn't believe in and still won the debate. 00:09:17.040 |
Do you just want to quickly run through some of these things 00:09:21.080 |
like Sarah's presentation, just the highlights? 00:09:24.600 |
- Yeah, we can't go through everyone's slides, 00:09:26.240 |
but I pulled out some things as a factor of like stuff 00:09:35.840 |
And hopefully people can benefit from the work 00:09:39.400 |
But I think it's, these are just good slides. 00:09:41.640 |
And I've been looking for sort of end of year recaps 00:09:45.920 |
You know, I think the max ELO in 2023 on LMSYS 00:09:52.320 |
And now everyone is at least at 1275 in their ELOs. 00:09:57.320 |
And this is across Gemini, HGBT, Grok, O1 AI, 00:10:01.640 |
which with their E-Large model and Enthopic, of course. 00:10:21.840 |
I would say that people are still holding out a candle 00:10:39.200 |
because they don't work well with the benchmarking people. 00:10:42.640 |
So it's a little trivia for why XAI always gets ignored. 00:10:57.440 |
estimates of OpenAI market share in December, 2023. 00:11:03.560 |
GPT 3.5 and GT for being 95% of production traffic. 00:11:10.880 |
that we asked Harrison Chase on the LinkedIn episode, 00:11:14.560 |
And then CLAWD 3 launched middle of this year. 00:11:23.920 |
And you can start seeing the market share shift 00:11:37.600 |
Gemini has basically launched a price war at the low end 00:11:40.800 |
with Gemini Flash being basically free for personal use. 00:11:45.080 |
I think people don't understand the free tier. 00:11:46.680 |
It's something like a billion tokens per day. 00:11:49.880 |
you cannot really exhaust your free tier on Gemini. 00:11:58.640 |
And so they're going after the lower tier first 00:12:04.200 |
But yeah, Gemini Flash, according to OpenRouter, 00:12:15.200 |
The smart ones obviously are still going to OpenAI. 00:12:17.560 |
But it's a very, very big shift in the market. 00:12:19.800 |
Like basically over the course of 2023 going into 2024, 00:12:25.640 |
to reasonably somewhere between 50 to 75 market share. 00:12:30.200 |
how RAMP does the attribution to the model, if it's API, 00:12:33.160 |
because I think it's all- - Credit card spend. 00:12:37.160 |
Maybe when they do expenses, they upload the PDF. 00:12:42.080 |
I think that was one of my main 2024 takeaways 00:12:50.480 |
kind of like long tail would be, like the small model. 00:12:56.760 |
Like so small model here for Gemini is 8B, right? 00:13:01.120 |
Mini, we don't know what the small model size is, 00:13:05.160 |
or maybe single digits, but probably double digits. 00:13:07.440 |
The open source community has kind of focused 00:13:09.400 |
on the one to 3B size, maybe 0.5B, that's Moon Dream. 00:13:14.400 |
And that is small for you, then that's great. 00:13:17.640 |
It makes sense that we have a range for small now, 00:13:24.640 |
And so this includes Gemma from Gemini as well, 00:13:27.480 |
but also includes the Apple Foundation models, 00:13:30.320 |
which are also like, I think Apple Foundation is 3B. 00:13:34.000 |
I mean, I think in the start, small just meant cheap. 00:13:37.240 |
I think today small is actually a more nuanced discussion, 00:13:40.960 |
you know, that people weren't really having before. 00:13:45.160 |
This is a slide that I smiley disagree with Sarah. 00:13:47.880 |
She's pointing to the scale seal leaderboard. 00:13:51.360 |
I think the researchers that I talked to in Europe 00:14:02.240 |
And scale is one of maybe three or four people this year 00:14:07.600 |
in doing a credible private test set leaderboard. 00:14:11.400 |
Lama 4 or 5B does well compared to Gemini and GPT-40. 00:14:19.280 |
it's good to have an open model that is that big 00:14:23.840 |
But anyone putting 4 or 5B in production will tell you, 00:14:30.280 |
that it is very slow and very expensive to infer. 00:14:33.800 |
It doesn't even fit on like one node of H100s. 00:14:39.520 |
they can serve 4 or 5B on their super large chips. 00:14:42.800 |
But, you know, if you need to do anything custom to it, 00:14:50.160 |
Like, I think most people are basically saying 00:14:52.600 |
that they only use 4 or 5B as a teacher model 00:15:03.840 |
So I don't know if like open source is keeping up. 00:15:09.520 |
is very invested in telling you that the gap is narrowing. 00:15:23.400 |
but you cannot use a chart that is nearing 100 00:15:35.320 |
But in metrics that matter, is open source narrowing? 00:15:46.320 |
- I think inference time compute is bad for open source 00:15:48.840 |
just because Doc can donate the flops at training time, 00:15:52.960 |
but he cannot donate the flops at inference time. 00:16:00.840 |
- So I don't know what that means for the GPU clouds. 00:16:04.120 |
I don't know what that means for the hyperscalers, 00:16:06.520 |
but obviously the big labs have a lot of advantage 00:16:15.120 |
but then you're putting a lot of compute at inference too. 00:16:26.400 |
Actually, I connected with the AI meta guy at NeurIPS 00:16:30.320 |
and we're gonna coordinate something for Llama4. 00:16:36.800 |
So I'm sure we'll have her on in the new year. 00:16:39.040 |
- Yeah, so my comment on the business model shift, 00:16:48.560 |
They wanted to raise higher and they did not. 00:16:52.760 |
it's very convenient that we're not getting GPT-5, 00:17:03.680 |
And passing it on effectively to the customer. 00:17:09.840 |
because you can directly attribute it to like, 00:17:15.480 |
So like that lets you control your gross margin 00:17:25.280 |
that this change in the sort of inference paradigm 00:17:31.520 |
that the funding environment for pre-training 00:17:43.200 |
- Yeah, and I was looking back at our yearly recap 00:17:47.760 |
was like the mixed trial price fights, you know? 00:17:50.760 |
And I think now it's almost like there's nowhere to go. 00:18:14.160 |
what does it look like to spend $1,000 a month on AI? 00:18:17.840 |
- Yes, I think if your job is at least AI content creator 00:18:24.760 |
you should already be spending like $1,000 a month on stuff. 00:18:28.080 |
And then obviously easy to spend, hard to use. 00:18:37.560 |
So like deep research that they just launched 00:18:49.560 |
I've built a bunch of things once we had flow 00:18:58.880 |
- Yeah, I think once they get advanced voice mode, 00:19:01.840 |
like capability, today it's still like speech to text, 00:19:06.440 |
But it's good for like reservations and things like that. 00:19:13.960 |
- Okay, I feel like we've covered a lot of stuff. 00:19:16.480 |
Yeah, I think we will go over the individual talks 00:19:23.040 |
I don't want to take too much time with this stuff, 00:19:25.240 |
but suffice to say that there was a lot of progress 00:19:29.880 |
Basically this is all like the audience voting 00:19:33.200 |
And then I just invited the best speaker I could find 00:19:45.760 |
OpenHand is currently still number one on SweetBench full, 00:19:53.120 |
Everyone is saying 2025 is the year of agents, 00:20:01.040 |
of what are the frontier problems to solve in agents. 00:20:08.760 |
about the environment has been super interesting to us 00:20:13.760 |
Because yeah, how do you put an agent in an enterprise 00:20:27.120 |
you can't really rag things that are not documented, 00:20:29.480 |
but people know them based on how they've been doing it. 00:20:42.320 |
- And I think today the agents that most people are building 00:20:46.920 |
but are not as good as like extracting them from you. 00:20:53.360 |
- Just to touch quickly on the Jeff Dean thing. 00:21:02.760 |
instead of just focusing on ML to do something else? 00:21:07.080 |
we had, you know, Eugene from RWKB on the podcast before, 00:21:10.400 |
like he's doing a lot of that with Federalist AI. 00:21:14.760 |
I'm a little bit uncomfortable with how much it costs, 00:21:17.120 |
because it does use more of the GPU per call, 00:21:20.040 |
but because everyone is so keen on fast inference, 00:21:29.320 |
- Yeah, so that Jeff is, Jeff's talk was more, 00:21:33.600 |
I think people got the wrong impression from my tweet, 00:21:47.080 |
where it's basically the story of bootstrapping 00:21:50.120 |
of humans and AI in AI research or AI in production. 00:21:55.040 |
where like how much synthetic data has grown in 2024 00:22:01.360 |
And I think Jeff then also extended it basically to chips, 00:22:06.200 |
So he'd spent a lot of time talking about AlphaChip. 00:22:23.840 |
- For me, it's just like bonus for the InSpace supporters, 00:22:26.680 |
because I feel like they haven't been getting anything. 00:22:28.720 |
And then I wanted a more high-frequency way to write stuff. 00:22:36.600 |
I think basically we now have an answer to what Ilya saw. 00:23:13.240 |
And I think it's very prescient in 2016 to write that. 00:23:21.840 |
basically kind of driving the scaling up of OpenAI, 00:23:37.440 |
- Yeah, and at least for me, I can speak for myself, 00:23:39.400 |
like I didn't see the path from Dota to where we are today. 00:23:44.120 |
like they wouldn't necessarily draw a straight line, but. 00:23:49.200 |
But I think like that was like the whole idea 00:23:52.520 |
And we talked about this with Naden on his podcast. 00:23:55.240 |
It's like with RL, you can get very good at specific things, 00:23:58.000 |
but then you can't really like generalize as much. 00:23:59.720 |
And I think the language models are like the opposite, 00:24:01.720 |
which is like, you're gonna throw all this data at them 00:24:08.720 |
And we'll talk about the OpenAI reinforcement, 00:24:10.840 |
fine tuning, announcement too, and all of that. 00:24:13.480 |
But yeah, I think like skill is all you need. 00:24:16.480 |
That's kind of what LEI will be remembered for. 00:24:31.280 |
but like the second ingredient, which is data, 00:24:35.040 |
So it's not necessarily pre-training is over. 00:24:38.040 |
It's kind of like what got us here, won't get us there. 00:24:58.000 |
And yeah, I like the fossil fuel of AI analogy. 00:25:00.440 |
It's kind of like the little background tokens thing. 00:25:05.640 |
is basically like, instead of fine tuning on data, 00:25:09.760 |
So it's basically like, instead of being data driven, 00:25:21.520 |
Because I think this is what people run into. 00:25:35.800 |
is basically like mammals that scale linearly 00:25:40.360 |
And then humans kind of like broke off the slope. 00:25:46.400 |
And then the post-training slope is like the human one. 00:25:49.720 |
- Yeah, I wonder what the, I mean, we'll know in 10 years, 00:25:52.360 |
but I wonder what the y-axis is for Ilya's SSI. 00:25:57.840 |
- Ilya, if you're listening, you're welcome here. 00:26:02.560 |
like agent, synthetic data, inference compute. 00:26:05.480 |
- I don't think he was dropping any alpha there. 00:26:10.920 |
- I think that there was comparatively a lot more work. 00:26:15.800 |
Oh, by the way, I need to plug that my friend Yi 00:26:21.400 |
- Of like all the, she called it must read papers of 2024. 00:26:26.160 |
So I laid out some of these at NeurIPS and it was just gone. 00:26:30.280 |
'cause people are dying for like little guidance 00:26:35.440 |
And so I thought it was really super nice that we got that. 00:26:38.640 |
- Should we do a late in space book for each year? 00:26:46.920 |
- Hi, Will, by the way, we haven't introduced you. 00:26:58.080 |
- Okay, one fun one, and then one more general one. 00:27:00.680 |
So the fun one is this paper on agent collusion. 00:27:13.320 |
because the real reason, like NeurIPS this year 00:27:17.960 |
A lot of people actually even go and don't buy tickets 00:27:20.280 |
because they just go and attend the side events. 00:27:23.600 |
and end up crowding around the most popular papers, 00:27:32.560 |
But there's like something like 10,000 other papers out there 00:27:40.680 |
They failed to get attention for one reason or another. 00:27:45.080 |
And this is a DeepMind paper that actually focuses 00:27:49.760 |
by hiding messages in the text that they generate. 00:28:01.920 |
But something I've always emphasized is to LLMs, 00:28:16.760 |
that we're trying to collaborate to take over the planet, 00:28:27.160 |
So he marked, I'm showing it on screen right now, 00:28:34.000 |
GPT-2, Lama2, mixed trial, GPT-3.5, zero capabilities, 00:28:40.320 |
And this is the kind of Jason Wei type emergence 00:28:44.440 |
I think what made this paper stand out as well, 00:28:46.960 |
so he developed a benchmark for steganography collusion, 00:28:50.680 |
and he also focused on Schelling point collusion, 00:28:54.720 |
Like for agreeing on a decoding encoding format, 00:28:58.080 |
you kind of need to have some agreement on that. 00:29:00.680 |
But Schelling point means like very, very low 00:29:06.480 |
if the only message I give you is meet me in New York, 00:29:11.520 |
you would probably meet me at Grand Central Station. 00:29:14.680 |
That is, the Grand Central Station is a Schelling point, 00:29:19.040 |
That is, the Schelling point of New York is Grand Central. 00:29:21.600 |
To that extent, Schelling points for steganography 00:29:27.680 |
It will be interesting at some point in the future 00:29:36.720 |
I think that's like one of the hardest things 00:29:46.800 |
worked out the optimal pricing for language models. 00:29:51.080 |
It's basically an econometrics paper at NeurIPS, 00:29:53.120 |
where like everyone else is talking about GPUs. 00:29:55.400 |
- And the guy with the GPUs is talking about pricing. 00:30:02.080 |
The broader focus I saw is that model papers at NeurIPS 00:30:12.080 |
This is all the grad students are working on. 00:30:25.480 |
they're kind of flip sides of the same thing. 00:30:30.560 |
And then the sort of big model that people walk around 00:30:36.440 |
And that's kind of how it develops, I feel like. 00:30:43.880 |
who worked on Lava, which is Ticlama and AdVision. 00:30:53.560 |
This year, I don't think there was any of those. 00:30:55.480 |
- Yeah, what were the most popular like orals? 00:31:10.400 |
But I think last year, there was a lot of interest 00:31:12.360 |
in like furthering models and like different architectures 00:31:16.680 |
- I will say that I felt the oral picks this year 00:31:23.480 |
of how I have changed in terms of how I view papers. 00:31:31.080 |
two of the best papers in this year for data sets 00:31:38.040 |
These are two actually industrially used papers, 00:31:46.440 |
So like, it's just that the picks were different. 00:31:54.360 |
This is the Schedule Free Optimizer paper from Meta, 00:32:00.960 |
there's been a lot of chat about shampoo, soap, 00:32:08.240 |
And most people at the big labs who I asked about this 00:32:12.840 |
say that it's cute, but it's not something that matters. 00:32:15.840 |
I don't know, but it's something that was discussed 00:32:43.600 |
Scarlett Johansson to that side of the fence. 00:32:48.600 |
I actually wanted to get a snapshot of all the lawsuits. 00:32:55.800 |
On the right-hand side, we have the synthetic data people. 00:33:05.200 |
between scale AI and the synthetic data community, 00:33:12.840 |
Surprise, surprise, scale is the leading vendor 00:33:21.400 |
- So I think there's some debate going on there, 00:33:27.480 |
for the reasons that are blessed in Lumna's talk, 00:33:32.640 |
I don't know if you have any perspectives there. 00:33:34.280 |
- I think, again, going back to the reinforcement, 00:33:36.120 |
fine-tuning, I think that will change a little bit 00:33:39.680 |
I think today, people mostly use synthetic data 00:33:41.880 |
for distillation and fine-tuning a smaller model 00:33:46.960 |
I'm not super aware of how the frontier labs use it, 00:33:50.240 |
outside of the rephrase, the web thing that Apple also did. 00:33:56.120 |
I think whether or not that gets us the big next step, 00:34:07.640 |
I think synthetic data is something that people can do, 00:34:31.680 |
It basically helps you to synthesize reasoning steps 00:34:35.320 |
or at least distill reasoning steps from a verifier. 00:34:38.560 |
And if you look at the OpenAI RFT API that they released 00:34:45.040 |
basically, they're asking you to submit graders 00:34:47.560 |
or they choose from a preset list of graders. 00:35:15.440 |
nobody has any problem with all the reasoning. 00:35:33.080 |
and creative writing and instruction following, 00:35:44.120 |
I think one of the O1 Pro demos that Noam was showing 00:35:47.400 |
was that, you know, you can write an entire paragraph 00:35:50.360 |
or three paragraphs without using the letter A, right? 00:35:53.720 |
So like literally just anything in the sort of token, 00:36:00.040 |
and instruction following, it's very, very strong. 00:36:06.800 |
it's going to do that much better than in previous models. 00:36:09.520 |
So I think it's underrated for creative writing. 00:36:15.480 |
when they don't show you the thinking choices of O1, 00:36:21.120 |
they're getting sued for using other publishers' data, 00:36:24.280 |
you know, but then on their end, they're like, 00:36:29.160 |
So I'm curious to see how that kind of comes. 00:36:31.400 |
- Yeah, I mean, O1 has many ways to punish people 00:36:35.720 |
already banned by Dance for distilling their info. 00:36:38.960 |
And so anyone caught distilling the chain of thought 00:36:41.560 |
will be just disallowed to continue on the API 00:36:47.080 |
Like, I don't even think that's an issue at all, 00:36:49.080 |
just because the chain of thoughts are pretty well hidden. 00:36:51.880 |
Like, you have to work very, very hard to get it to leak. 00:36:55.080 |
And then even when it leaks the chain of thought, 00:37:14.880 |
That Cloud Sonnet so far is beating O1 on coding tasks 00:37:34.960 |
Because like, even DeepSeek was able to do it 00:37:37.720 |
and they had two months notice to do this, to do R1. 00:37:40.600 |
So it's actually unclear how much moat there is. 00:37:42.920 |
Obviously, if you talk to the Strawberry team, 00:37:52.560 |
because there'll be a lot of noise from people 00:37:58.320 |
because they just have fancy chain of thought. 00:38:01.360 |
who actually do have very good chain of thought 00:38:03.400 |
and you will not see them on the same level as OpenAI 00:38:11.440 |
Like the real answer is somewhere in between. 00:38:13.160 |
- Yeah, I think that's kind of like the main data 00:38:39.600 |
Like, you know, that market is kind of like plummeted 00:38:46.560 |
They just want to be, you know, completely without them. 00:38:49.280 |
Yeah, how do you think about this war developing? 00:38:52.800 |
but like, I feel like the appetite for GPU rich startups, 00:39:20.600 |
we had Leopold's essay on the trillion dollar cluster. 00:39:24.720 |
We have multiple labs, you know, XAI, very famously, 00:39:31.920 |
in spinning up a hundred thousand GPU cluster 00:39:40.200 |
So like the GPU ultra rich are going to keep doing that 00:39:42.880 |
because I think partially it's an article of faith now 00:39:46.320 |
Like you don't even know what we're going to use it for. 00:40:15.000 |
Claude, you know, Opus 3.5 is not coming out. 00:40:19.760 |
And Gemini 2, like we don't have pro whatever. 00:40:24.640 |
Maybe I'll call it like the 2 trillion parameter wall. 00:40:28.600 |
Like it's just like, no one thinks it's a good idea, 00:40:31.640 |
at least from training costs, from amount of data, 00:40:34.640 |
or at least the inference, like, would you pay 10X 00:40:49.040 |
then you actually need more just general purpose compute 00:40:54.480 |
that production deployments of the previous paradigm 00:40:59.760 |
So it makes sense that the GPU rich are growing. 00:41:08.160 |
but I think Amazon may be kind of a sleeper one, Amazon, 00:41:25.000 |
- Yeah, I mean, that's the power of prepaid contracts. 00:41:28.840 |
I think like a lot of AWS customers, you know, 00:41:39.600 |
So they can kind of bundle them together and prefer pricing. 00:41:42.320 |
- Okay, so maybe GPU super rich, doing very well. 00:41:57.600 |
Like if you're GPU poor, you should just use the cloud. 00:42:01.320 |
- You know, and I think there might be a future 00:42:03.400 |
once we kind of like figure out what the size 00:42:07.280 |
where like the tiny box and these things come to fruition 00:42:14.160 |
why are you working so hard till I get these models to run 00:42:25.800 |
People think it's a stepping stone to scaling up. 00:42:32.040 |
Like news research, like probably the most deep tech thing 00:42:35.080 |
they've done this year is distro or whatever the new name is. 00:42:38.440 |
There's a lot of interest in heterogeneous computing, 00:42:42.000 |
I tend generally to de-emphasize that historically, 00:43:00.320 |
then that will be very beneficial to the broader community, 00:43:04.880 |
but maybe still not the source of frontier models. 00:43:08.400 |
- It's just going to be a second tier of compute 00:43:16.840 |
We are, I now have Apple intelligence on my phone, 00:43:19.720 |
doesn't do anything apart from summarize my notifications, 00:43:26.360 |
The notification summaries are so-and-so in my experience. 00:43:50.840 |
rolling out RWKB in sort of every Windows department. 00:43:56.840 |
in this GPU poor war that I think I should now, 00:44:07.000 |
either a foundation model lab or a GPU cloud. 00:44:12.320 |
Suno, RAMP has rated as one of the top ranked, 00:44:36.280 |
And again, now they've announced 20 million ARR, 00:44:46.440 |
So yeah, I mean, it's crazy that all these GPU poors 00:45:03.840 |
And the people who are in the middle inflection, 00:45:09.440 |
I think character did the best of all of them. 00:45:11.720 |
Like you have a note in here that we apparently said 00:45:23.680 |
Like, I don't know what the beginning was like. 00:45:32.960 |
we never had text-to-video in the first version, 00:45:37.440 |
- Yeah, I would say it's a subset of image, but yes. 00:45:41.720 |
it wasn't really something people were doing. 00:45:49.640 |
- I've not tried Sora because the day that I tried- 00:46:04.600 |
- What's the other model that you posted today 00:46:16.640 |
The Chinese labs do surprisingly well at the video models. 00:46:51.840 |
do you specialize in a single modality, right? 00:46:55.520 |
Or do you have God model that does all the modalities? 00:47:00.880 |
in a sense of 11 Labs, you know, now Unicorn. 00:47:07.840 |
HN, I think has reached a hundred million ARR. 00:47:38.600 |
on the day of launch, and she wasn't prepared. 00:47:41.800 |
She was just like, "I'm just gonna show you." 00:47:47.600 |
and then they obviously can code gen and all that. 00:47:49.400 |
But the new one that OpenAI and Meta both have, 00:47:56.040 |
So you can literally, I think their demo video 00:48:00.600 |
and you ask for minor modifications to that car, 00:48:18.480 |
"Huh, we got you everything in the transformer." 00:48:25.360 |
Or do you string together a whole bunch of small models 00:48:31.920 |
I mean, obviously I use Midjourney for all of our thumbnails. 00:48:36.560 |
- They've been doing a ton on the product, I would say. 00:48:42.360 |
Because I think, yeah, the model is kind of like, 00:48:46.360 |
the Black Forest models are better than Midjourney 00:48:53.120 |
- Have you tried the same problems on Black Forest? 00:48:56.640 |
you know, on Black Forest, it generates one image. 00:49:10.920 |
Like the good thing about Midjourney is like, 00:49:15.680 |
There's a lot of stuff that just makes it really easy. 00:49:30.840 |
that Black Forest should be able to do all that stuff. 00:49:47.920 |
And they have launched a whole bunch of arenas. 00:50:03.280 |
who left stability after the management issues. 00:50:10.040 |
I would also highlight that Grok has now launched Aurora, 00:50:18.360 |
because Grok's images were originally launched 00:50:22.080 |
in partnership with Black Forest Labs as a thin wrapper. 00:50:24.560 |
And then Grok was like, "No, we'll make our own." 00:50:28.280 |
I don't know, there are no APIs or benchmarks about it. 00:50:39.960 |
because they are just focused on their tasks. 00:50:42.120 |
But the big model people are always catching up. 00:50:45.440 |
And the moment I saw the Gemini 2 demo of image editing 00:50:49.760 |
where I can put an image and just request it and it does, 00:50:58.000 |
And I think one frontier that we haven't seen this year, 00:51:03.920 |
You know, when we have the release of Sora Turbo today, 00:51:09.080 |
or at least the Hollywood Labs will get full Sora. 00:51:11.040 |
We haven't seen video to audio or video synced with audio. 00:51:18.520 |
But there's still maybe like five more years of video left 00:51:23.480 |
I would say that Gemini's approach compared to OpenAI, 00:51:26.720 |
Gemini seems, or DeepMind's approach to video 00:51:32.800 |
Because if you look at the ICML recap that I published 00:52:05.600 |
I would say, in showing, like the reason of VO2, 00:52:18.280 |
is because they have all this background work 00:52:23.480 |
I already was interviewing some of their video people. 00:52:29.280 |
they can go to NeurIPS 2023 and see that paper. 00:52:32.480 |
- And then last but not least, the LLMOS/RegOps, 00:52:40.520 |
I put the latest chart on the Brain Trust episode. 00:52:49.160 |
is 'cause I wanted to show up on Hacker News. 00:52:50.920 |
I want the podcast to show up on Hacker News. 00:52:54.480 |
because at Hacker News, people like to read and not listen. 00:53:01.200 |
- You say Liangchen Lama Index is still growing. 00:53:20.280 |
These are the last six months, Lama Index still growing. 00:53:23.760 |
What I've basically seen is like things that, 00:53:26.600 |
one, obviously these things have a commercial product. 00:53:29.760 |
So there's like people buying this and sticking with it 00:53:39.600 |
If you look on GitHub, the stars are growing, 00:53:41.600 |
but kind of like the usage is kind of like flat 00:54:07.200 |
there's still a wait list for AutoGPT to be used. 00:54:15.800 |
It's the fastest growing project in the history of GitHub. 00:54:18.360 |
But I think, you know, when you maybe like run the numbers 00:54:25.920 |
which is like a lot of stars, a lot of interest 00:54:28.320 |
at a rate that you didn't really see in the past 00:54:30.120 |
in open source where nobody's running to start, 00:54:35.120 |
It's kind of like just to be able to actually use it. 00:54:47.160 |
I think that patience has come down over time. 00:54:56.920 |
But I think people are still coming around now on Devon. 00:55:01.200 |
and even you're going to be a paying customer. 00:55:10.040 |
in terms of what I think is the dynamics going on here, 00:55:15.960 |
Over-promising, under-delivering applies to any startup. 00:55:24.840 |
So Auto-GPT's initial problem was making money, 00:55:29.160 |
And I think that means that there's a lot of broad interest 00:55:35.920 |
So that's why this concentrates a lot of stars. 00:55:38.320 |
And then obviously, because it does too much, 00:55:46.240 |
for why the interest to usage ratio is so low. 00:55:49.640 |
And the second one is obviously pure execution. 00:56:09.960 |
Like sticking anything in there, it'll mostly be correct. 00:56:14.080 |
it's like, you know, we will help you complete code. 00:56:17.400 |
We will help you with your PR reviews, like small things. 00:56:25.040 |
we soft announced the E2B fundraising on this podcast. 00:56:29.800 |
Code Sandbox got acquired by Together AI last week, 00:56:33.880 |
which they're now also going to offer as an API. 00:56:39.800 |
Yeah, and then in the last step, two episodes ago with Bolt, 00:56:46.440 |
I think like there's maybe the spectrum of code interpreting, 00:57:03.520 |
just because, I mean, everybody needs to run code, right? 00:57:16.200 |
they do all these nice charts for like finance 00:57:23.200 |
of kind of like a hair on fire problem, so to speak. 00:57:28.240 |
And this was one that really wasn't on the radar 00:57:34.200 |
I think mostly because I was trying to limit it 00:57:36.840 |
to Ragnops, but I think now that the frontier has expanded 00:57:44.280 |
core set of tools would include code interpreting, 00:57:49.760 |
And Graham in his state of agents talk had this as well, 00:57:55.600 |
'cause like everyone finds the same set of things. 00:57:58.520 |
So it's basically like, everyone needs web browsing, 00:58:03.240 |
and then everyone needs some kind of memory or planning 00:58:10.280 |
but I think this is what we've discovered so far. 00:58:16.920 |
I think that basically the statefulness of these things 00:58:28.480 |
It's because sometimes you might need to time travel back, 00:58:31.560 |
like unwind or fork to explore different paths 00:58:40.360 |
the new implementations as the emerging frontier 00:58:42.400 |
in terms of like what people kind of are going to need 00:58:49.760 |
And then I'll also call out that I think ChatGPT Canvas 00:58:52.800 |
with what they launched in the 12 days of shipments 00:58:56.640 |
has surprisingly superseded Code Interpreter. 00:59:01.680 |
And now Canvas can also write code and also run code 00:59:04.320 |
and do more than Code Interpreter used to do. 00:59:07.640 |
So there's a toggle box for Canvas and for Code Interpreter 00:59:13.000 |
My old thesis that custom GPTs is your roadmap for investing 00:59:22.680 |
why you should use Code Interpreter over Canvas. 00:59:27.480 |
that both Anthropic and OpenAI and Fireworks has now shipped 00:59:31.960 |
that I think is going to be the norm for next year, 00:59:40.360 |
Like the Ader benchmarks were also all based on diffs 01:00:01.320 |
It's not explicit memory of what you've written. 01:00:06.240 |
oh, use a node to this, use a node to this 10 times 01:00:12.840 |
Like it doesn't, none of the memory products do that. 01:00:22.400 |
- Yeah, or even like, you know, Lindy has like memories, 01:00:24.640 |
you know, it's like based on what I say, it remembers it. 01:00:28.480 |
So it's less about making an actual memory of my preference. 01:00:33.760 |
And I'm trying to figure out at what level that gets solved. 01:00:37.400 |
You know, like is it, do these memory products 01:00:40.240 |
like the MMPTs of the world create a better way 01:00:51.520 |
I just don't think that like the approaches today 01:00:54.440 |
are like actually memory or what you need a system to have. 01:00:59.720 |
But I would just point it to it being immature 01:01:04.400 |
Like it's clearly something that we will want at some point. 01:01:07.280 |
And so the people developing it now are, you know, 01:01:12.320 |
And I would definitely predict that next year 01:01:19.280 |
we had the "Shouldn't You" pod with Harrison as guest host, 01:01:22.040 |
I over-focused on LangMEM as a separate product. 01:01:28.520 |
And I think that everyone will need some kind of memory. 01:01:31.800 |
And I think that this has distinguished itself now 01:01:35.200 |
as a separate need from a normal rag vector database. 01:01:39.560 |
whether it's on top of a vector database or not, 01:01:45.000 |
Like I've had to justify this so much, actually, 01:01:47.040 |
that I have a draft post in the "Latent Space" dashboard 01:01:51.680 |
what is the difference between memory and knowledge? 01:01:54.760 |
It's like, knowledge is about the world around you. 01:02:15.200 |
Time is a specifically important one in memory 01:02:19.920 |
And then you also need like a review function. 01:02:22.440 |
A lot of people are implementing this as sleep. 01:02:28.560 |
and you come up with new insights that you then persist 01:02:37.080 |
It's another one that's based on Neo4j's knowledge graph 01:02:43.400 |
I feel like Letter, since it was funded by Quiet Capital, 01:02:52.440 |
which I feel like there's a bunch of those now. 01:03:02.320 |
just like every consumer product is going to have a, 01:03:14.240 |
Code interpreter for maybe not exposing the code, 01:03:21.360 |
So as a consumer, let's say you are a new.computer, 01:03:25.360 |
who, you know, they've launched their own little agents, 01:03:37.920 |
but at some point you need to compress your memory 01:03:47.560 |
And these guys have been doing it for a year now. 01:03:49.840 |
- Yeah, to me, it's more like I want to bring the memories. 01:03:55.840 |
- So you selectively choose the memory to bring in. 01:03:57.760 |
- Why does every time that I go to a new product, 01:04:08.080 |
Anthropic's model context protocol that they launched 01:04:10.200 |
has a 300 line of code memory implementation. 01:04:12.760 |
Very simple, very bad news for all the memory startups, 01:04:19.040 |
And yeah, it would be nice to have a portable memory of you 01:04:23.440 |
Simple answer is there's no standardization for a while 01:04:25.840 |
because everyone will experiment with their own stuff. 01:04:46.840 |
basically came from Georgi Gaganov with Lama CPP, right? 01:04:50.520 |
And that was completely open source, completely bottoms up. 01:04:53.040 |
And that's because there's just a significant amount of work 01:05:00.320 |
So like that kind of standardization can be done. 01:05:09.040 |
because those are the people with the longest memories. 01:05:14.280 |
I've looked at Sully Tavern, I've looked at Cobalt, 01:05:18.720 |
And there's like four or five different standardized 01:05:24.880 |
If there was anyone that developed memory first, 01:05:27.000 |
that became a standard, it would be those guys. 01:05:28.880 |
- Cool, I'm excited to see what people built. 01:05:35.840 |
So basically, I just wanted to mention this briefly. 01:05:39.080 |
Like, I think that in a year, end of year review, 01:05:41.800 |
it's useful to remind everybody where we were. 01:05:47.200 |
everyone has gone up and it's a very close race. 01:05:51.880 |
I was looking at the OpenAI live stream today 01:06:01.080 |
are like completely different than the benchmarks 01:06:04.440 |
that we were talking about this time last year. 01:06:07.280 |
This time last year, we were still talking about MMLU. 01:06:16.440 |
of the Hugging Face Open Models leaderboard, right? 01:06:18.880 |
We talked to Clementine about the decisions that she made 01:06:27.360 |
also has emerged this year as the leading like battlegrounds 01:06:33.560 |
But also, we have also seen like the emergence 01:06:41.040 |
It will be interesting to see like top most cited benchmarks 01:06:44.280 |
of the year from 2020 to 2021, two, three, four, 01:06:50.480 |
And you can see what has been saturated and solves 01:06:55.400 |
And so now people care a lot about frontier math coding, 01:06:58.960 |
There's literally a benchmark called frontier math, 01:07:00.320 |
which I spent a bit of time talking about at NeurIPS. 01:07:19.320 |
At NeurIPS, GPQA was declared dead, which is very sad. 01:07:22.360 |
People were still talking about GPQA Diamond. 01:07:28.280 |
So it's supposed to be resistant to saturation for a while. 01:07:34.480 |
So now we only care about SweBench, LiveBench, 01:07:44.800 |
And then we also care about the new Kowinski Prize 01:07:48.920 |
which is the guy that we talked to yesterday, 01:07:50.520 |
who has launched a similar sort of ArcAGI attempt 01:08:00.040 |
which is more tracking sort of ML research and bootstrapping, 01:08:08.560 |
which is when the researchers can automate their own jobs. 01:08:17.240 |
I think Dylan at the debate, he said SweBench 80% 01:08:38.600 |
And then as we get to 100 and the open source catches up. 01:08:46.760 |
is keep track of the slow cooking of benchmark language 01:08:53.920 |
will keep measuring themselves on last year's benchmarks. 01:08:59.200 |
will tell you about benchmarks you've never heard of. 01:09:02.040 |
oh, like, okay, there's new territory to go on. 01:09:06.600 |
Yeah, maybe I won't belabor this point too much. 01:09:13.080 |
Like basically every new frontier capabilities 01:09:15.480 |
and this is the next section that we're gonna go into 01:09:19.000 |
We'll also briefly talk about Ruler as like the new sort of, 01:09:24.240 |
and Ruler is basically a multi-dimensional needle 01:09:32.920 |
This is one of the slides that I did on my Dev Day talk. 01:09:34.960 |
So we're moving on from benchmarks to capabilities. 01:09:45.560 |
I kind of like the thought spot model of what's mature, 01:09:49.720 |
what's emerging, what's frontier, what's niche. 01:09:51.480 |
So mature is like stuff that you can just rely on 01:09:57.960 |
And then what's solved is kind of long context. 01:10:01.840 |
Today, O1 announced 200K, which is very expensive. 01:10:17.600 |
And then code generation, and kind of solved. 01:10:23.840 |
- Single line autocomplete versus multi-file generation. 01:10:34.120 |
but they only launched for short output this year. 01:10:39.400 |
Everyone has vision now, I think, including O1. 01:10:48.280 |
about the work being done with CodePoly and CodeQuin. 01:10:52.000 |
- What's for you the break point for vision to go to mature? 01:11:00.640 |
- NVIDIA, most valuable company in the world. 01:11:10.760 |
I think the quote that I highlighted in AI News 01:11:25.520 |
that the transition from the H to the B series 01:11:32.720 |
and tried to beat the game out, that would be insane. 01:11:48.800 |
They went for a two year cycle to one year cycle, right? 01:11:53.920 |
You know, like there have been delays in the past 01:11:57.120 |
they're typically very good buying opportunities. 01:12:07.200 |
that we lost about 15 minutes of audio and video 01:12:12.920 |
and I'm just cutting it back in and rerecording. 01:12:15.040 |
We don't have time to rerecord before the end of the year. 01:12:19.160 |
So I'm just going to do my best to recover what we have 01:12:23.160 |
and then sort of segue you in nicely to the end. 01:12:30.960 |
frontier capabilities, and niche capabilities. 01:12:33.160 |
So emerging would be tool use, vision language models, 01:12:36.440 |
which you just heard, real-time transcription, 01:12:38.760 |
which I have on one of our upcoming episodes to be, 01:12:46.880 |
I think diarization capabilities are also maturing as well, 01:12:55.720 |
for the latent space transcripts to come out right. 01:13:08.560 |
And especially if there's crosstalk involved, 01:13:23.480 |
I think like basically I would say is on the horizon, 01:13:28.880 |
Like it's, you know, interesting to show off to people, 01:13:36.000 |
the large amount of money is going to be made 01:13:38.920 |
on long inference, on real-time interruptive, 01:13:44.560 |
on on-device models, as well as all other modalities. 01:13:50.200 |
I always say like base models are very underrated. 01:13:52.200 |
People always love talking to base models as well. 01:13:56.040 |
And we're increasingly getting less access to them. 01:14:12.800 |
It's just for historical interest, but you know, 01:14:18.480 |
Like it's definitely not a significant IP anymore for him. 01:14:23.720 |
I think OpenAI has a lot more things to worry about 01:14:27.120 |
but it will be a very, very nice things to do 01:14:31.480 |
I would say like the hype for state space models this year, 01:14:35.320 |
the post-transformers talk at Lean Space Live 01:14:37.680 |
was extremely hyped and very well attended and watched. 01:14:41.800 |
I would say like, it feels like a step down this year. 01:14:53.160 |
So Cartesia, I think is doing extremely well. 01:14:59.120 |
and some of our sort of notebook Ellen podcasts clones. 01:15:02.560 |
I think they're a real challenger to 11 Labs as well. 01:15:05.920 |
And RWBKB of course is rolling on on Windows. 01:15:12.200 |
We've been talking about them as the future for a long time. 01:15:26.320 |
which we will cover when we cover the sort of NeurIPS papers 01:15:36.400 |
Okay, so, and then we also wanna cover a little bit 01:15:44.120 |
So I'll bridge you into it back to the recording, 01:16:02.480 |
down to $1.27 in the span of like a couple of weeks. 01:16:08.880 |
a lot of people also interested in the price war, 01:16:11.760 |
sort of the price intelligence curve for this year as well. 01:16:16.720 |
I think roundabout in March of 2024 with Haiku's launch. 01:16:21.400 |
And so this is, if you're watching the YouTube, 01:16:24.000 |
this is what I initially charted out as like, 01:16:27.600 |
Like, everyone's kind of like in a pretty tight range 01:16:35.080 |
and it'll be cheaper to get less intelligence, 01:16:38.160 |
but roughly it correlates to aligned and a trend line. 01:16:45.520 |
and see that everything had kind of shifted right. 01:16:50.240 |
let's say GPT-4, 2023 would be about sort of $11.75 01:16:58.880 |
and you used to get that for like $40 per token, 01:17:08.040 |
And so that's a two orders of magnitude improvement 01:17:31.840 |
So it's about slightly higher than 12.50 in ELO. 01:17:34.880 |
So the March frontier and shift to the July frontier 01:17:38.600 |
is roughly one order of magnitude improvement 01:17:43.680 |
And I think what you're starting to see now in July 01:17:51.120 |
where July frontier used to be maintained by 4.0, 01:18:00.120 |
And then if you update it like a month later, 01:18:07.320 |
you can see more items start to appear here as well 01:18:10.520 |
with the August frontier, with Gemini 1.5 Flash coming out 01:18:14.360 |
with an August update as compared to the June update, 01:18:17.560 |
being a lot cheaper and roughly the same ELO. 01:18:25.760 |
where we really started to understand the pricing curves 01:18:32.120 |
that some random person on the internet drew on a chart, 01:18:47.000 |
we had O1 preview and pricing and costs and ELOs. 01:19:04.120 |
Then they cut their prices, they halved their prices, 01:19:09.880 |
And so it's a very, very tight and predictive line, 01:19:25.200 |
Sonnet is not, I don't know where this Sonnet on this chart, 01:19:29.120 |
but Haiku new basically was 4X the price of old Haiku. 01:19:34.120 |
Oh, sorry, 3.5 Haiku was 4X the price of 3 Haiku. 01:19:43.120 |
There's a reasonable assumption, to be honest, 01:19:47.000 |
that it's not a price hike, it's just a bigger model. 01:19:48.920 |
So it costs more, but we just don't know that. 01:19:59.360 |
So yeah, that would be the sort of price ELO chart. 01:20:03.600 |
I would say that the main update for this one, 01:20:05.960 |
if you go to my LLM pricing chart, which is public, 01:20:09.400 |
you can ask me for it, or I've shared it online as well. 01:20:13.240 |
which we briefly, briefly talked about on the pod, 01:20:21.280 |
where Amazon Pro, Nova Pro, Nova Light, and Nova Micro 01:20:26.040 |
for their intelligence levels of 1200 to 1300. 01:20:30.640 |
You wanna get beyond 1300, you have to pay up 01:20:32.640 |
for the O1s of the world, and the 4Os of the world, 01:20:53.560 |
But I wanna give you the idea that basically for, 01:21:11.920 |
now is available approximately, with Amazon Nova, 01:21:17.000 |
approximately at, I don't know, $0.075 per token. 01:21:26.480 |
So that is a couple orders of magnitude at least. 01:21:30.880 |
Actually, almost three orders of magnitude improvement 01:21:56.840 |
how much more accelerated this year has been. 01:22:00.240 |
And obviously, I think a lot of people are speculating 01:22:16.080 |
And then we went into sort of the annual overview. 01:22:37.360 |
and check out all the sort of top news of the day. 01:22:41.360 |
But we had a little bit of an AI Rewind thing, 01:22:48.120 |
So January, we had the first round of the year 01:22:53.360 |
And for me, it was notable that Jeff Bezos backed it. 01:22:56.880 |
Jeff doesn't invest in a whole lot of companies, 01:23:01.560 |
back in the day, and now he's backing the new Google, 01:23:17.480 |
at one of the sort of global summit type things, Davos. 01:23:30.680 |
we were still sort of thinking about last year's Dev Day. 01:23:37.360 |
People were kind of losing confidence in GPTs. 01:23:40.680 |
And I feel like that hasn't super recovered yet. 01:23:43.920 |
I hear from people that there are still stuff in the works 01:23:48.280 |
And they're actually underrated now, which is good. 01:23:51.920 |
So I think people are taking a stab at the problem. 01:24:12.360 |
which we don't tend to talk a ton about in this podcast 01:24:19.040 |
But we also started seeing context window size blow out. 01:24:22.440 |
So this year, I mean, it was Gemini with 1 million tokens. 01:24:26.520 |
But also I think there's 2 million tokens talked about. 01:24:32.040 |
talking about how to fine tune for 1 million tokens. 01:24:36.640 |
to be your token context, but you also have to use it well. 01:24:40.280 |
And increasingly, I think people are looking at 01:24:43.240 |
not just Ruler, which is sort of multi needle 01:24:47.160 |
but also Muser and like reasoning over long context, 01:24:50.760 |
not just being able to retrieve over long context. 01:24:58.240 |
made a lot of waves for the 100 million token model, 01:25:02.520 |
but whatever it was, they made some noise about it. 01:25:13.640 |
This basically started to mark the shift of market share 01:25:22.000 |
and now Enthropic had a decent frontier model family 01:25:38.920 |
It was probably one of the most well-executed PR campaigns, 01:25:55.520 |
and then they took nine months to ship to GA. 01:26:01.920 |
I think some people are happy, some people less so, 01:26:04.360 |
but it's very hard to live up to the promises that they made 01:26:13.000 |
I think the main thing I would caution out for Devon, 01:26:16.400 |
I think people call me a Devon show sometimes 01:26:21.840 |
basically is that a lot of the ideas can be copied. 01:26:26.400 |
And this is always the threat of "GPT wrappers" 01:26:30.240 |
that you achieve product market fit with one feature, 01:26:35.200 |
So of course, you gotta compete with branding 01:26:43.680 |
- April, we actually talked to Yurio and Suno. 01:26:57.520 |
Some of our friends at the pod like play in their cars, 01:27:02.880 |
and I freaking loved using O1 to craft the lyrics 01:27:18.160 |
you're the reason why we cut the intro songs. 01:27:38.840 |
And it was a kind of a model efficiency thing, 01:27:47.120 |
Like this is where the messaging of omni-model 01:27:58.480 |
I mean, they had vision, but not natively voice. 01:28:00.680 |
And I think everyone fell in love immediately 01:28:04.200 |
with the Sky Voice and Sky Voice got taken away 01:28:16.720 |
that has Sam Altman basically putting a foot in his mouth 01:28:20.800 |
with a three-letter tweet causing decent grounds 01:28:25.800 |
for a lawsuit where there was no grounds to be had 01:28:28.120 |
because they actually just use the voice actress 01:28:29.640 |
that sounded like Scarlett Johansson is unfortunate 01:28:41.960 |
People be pining for the Scarlett Johansson voice. 01:28:46.080 |
In June, Apple announced Apple Intelligence at WWDC. 01:28:50.160 |
And we haven't, most of us, if you update your phones, 01:29:01.080 |
that caused the Apple stock to rise like 20%. 01:29:05.760 |
And just because everyone was gonna upgrade their iPhones 01:29:08.200 |
just to get Apple Intelligence, it did not become that. 01:29:14.920 |
of transformers yet after Google rolled out BERT for search. 01:29:25.200 |
that's running locally on your phone with Loras 01:29:27.320 |
that are hot swaps and we have papers for it. 01:29:33.560 |
They're not the most transparent company in the world 01:29:37.760 |
But they gave us more than I think we normally get 01:29:43.680 |
And that's very nice for the research community as well. 01:29:51.800 |
I think I was at the Taiwanese trade show, Comtex, 01:30:00.720 |
And I think that was maybe a sign of the times, 01:30:05.080 |
but things are clearly not peaked 'cause they continued. 01:30:14.920 |
But yeah, we recorded a whole bunch of stuff. 01:30:18.280 |
We lost it and we're scrambling to re-record it for you, 01:30:22.600 |
but also we're trying to close the chapter on 2024. 01:30:28.520 |
where we talk about the rest of June, July, August, 01:30:31.880 |
September, and the second half of 2024's news. 01:30:42.760 |
- Saw a term sheet, raised a billion dollars. 01:30:45.080 |
Dan Gross seems to have now become full-time CEO 01:30:49.960 |
I thought he was gonna be an investor for life, 01:31:01.960 |
and it's a straight shot at super intelligence 01:31:06.520 |
but then it runs counter to basically both Tesla 01:31:10.920 |
and OpenAI in terms of the ship intermediate products 01:31:21.160 |
And I think maybe their bet is like with 1 billion, 01:31:26.040 |
But we don't wanna have to have intermediate steps. 01:31:30.840 |
- Yeah, but then like, where do you get your data? 01:31:38.760 |
I think we can also use this as part of a general theme 01:31:48.200 |
and like basically the entire super alignment team left. 01:31:53.640 |
kind of like the Chajupiti Canvas equipment that came out. 01:31:59.560 |
- No one has a Canvas clone yet apart from OpenAI. 01:32:06.280 |
responsible for artifacts and Canvas, Karina, 01:32:08.760 |
officially left Anthropic after this to join OpenAI 01:32:13.720 |
- Yeah, and then we had AI Engineer World's Fair in June. 01:32:31.640 |
And Gemini actually kind of beat them to the GA release, 01:32:35.480 |
I think that everyone should basically always have this on 01:32:38.520 |
as long as you're comfortable with the privacy settings 01:32:43.240 |
- And like this time next year, I would be willing to bet 01:32:45.960 |
that I would just have this running on my machine. 01:32:57.800 |
Then it will be another few years for that to happen 01:33:04.840 |
I think it's basically here, but not evenly distributed. 01:33:08.520 |
And we've just seen the GA of this capability 01:33:14.280 |
which we've done a whole podcast on, but that was great. 01:33:26.200 |
- Strawberry, AKA Q*, AKA we had a nice party 01:33:33.080 |
Like this is basically from the first internal demo 01:33:36.840 |
of Q of strawberry was let's say November, 2023. 01:33:47.960 |
Like I don't know if like people are giving OpenAI 01:33:51.000 |
enough credit for like this all being available. 01:34:08.280 |
because we're still using Sonnet or whatever, 01:34:11.400 |
And then obviously now we have O1 Pro and O1 Full. 01:34:15.160 |
I think like in terms of like biggest ships of the year, 01:34:19.960 |
Yeah, and I think it now opens a whole new Pandora's box 01:34:23.080 |
for like the inference time compute and all that. 01:34:26.440 |
It's funny because like it could have been done 01:34:31.400 |
They were working on it ever since they hired GNOME, 01:34:36.040 |
- Another discovery, I think Ilya actually worked 01:34:42.200 |
Same exact idea and it failed, whatever that means. 01:34:49.400 |
Yeah, I think most people have tried it by now 01:35:09.800 |
I think that people don't see where all this is heading. 01:35:12.680 |
Like OpenAI is really competing with Google in everything. 01:35:16.760 |
And like it's a full document editing environment 01:35:31.160 |
And so OpenAI is taking on Google and Google Docs. 01:35:35.800 |
They launched their little Chrome extension thing 01:35:43.160 |
it's kind of really tackling on Google in a very smart way 01:35:52.600 |
Maybe they're not successful, but at least they're trying. 01:35:55.000 |
And I think Google has gone without competition for so long 01:36:01.560 |
- Yeah, and then yeah, computer use also came out. 01:36:04.760 |
Yeah, that was a busy, it's been a busy couple of months. 01:36:13.560 |
of the most upvoted demos on Hacker News of the year, 01:36:17.240 |
but then comparatively, I don't see people using it as much. 01:36:21.800 |
between a mature capability and an emerging capability. 01:36:30.120 |
but you'll use everything else in a mature category. 01:36:31.880 |
And it's mostly because it's not precise enough 01:36:49.640 |
- R1, so that was kind of like the open source 01:36:55.000 |
Everyone knew like F1, we had a preview at the Fireworks HQ. 01:37:01.800 |
but I think R1 and QWQ, Quill from the Quent team, 01:37:18.680 |
but it's just like people are like agents are not real. 01:37:21.000 |
It's like when you have companies like Stripe 01:37:23.000 |
and like start to build things to support it, 01:37:30.280 |
But the fact that they do it shows that there's one demand 01:37:37.240 |
broader thesis for me that I'm exploring around, 01:37:42.360 |
Why can't normal SDKs for humans do the same thing? 01:37:46.440 |
Stripe agent toolkits happens to be a wrapper 01:38:01.160 |
so that you don't assume it's a human doing these things. 01:38:23.960 |
And it turned out to be a key that I committed 01:38:29.480 |
And so sourcing of where API usage is coming from, 01:38:50.280 |
we've also like not always made things super explicit. 01:38:58.120 |
But like, I think if you were to redesign them 01:39:02.920 |
using them as like almost infinite memory and context, 01:39:11.960 |
is almost more interesting in the world of agents 01:39:24.600 |
So yeah, I'm curious to see what else changes. 01:39:29.320 |
I think that was, you know, search GPT, perplexity. 01:39:37.800 |
The fact that Dropbox is a Google Drive integration, 01:39:40.760 |
it's just like, if you told somebody five years ago, 01:39:53.160 |
- And that brings us up to December, still developing. 01:39:59.320 |
I think everyone's expecting something big there. 01:40:01.400 |
I think so far has been a very eventful year. 01:40:11.000 |
- Well, I think we definitely talked about agents. 01:40:15.480 |
if we said it was the year of the agents, but we said. 01:40:19.800 |
- No, no, but well, you know, the anatomy of autonomy, 01:40:24.680 |
So obviously there's been belief for a while. 01:40:28.760 |
the models are, I would say maybe the last, yeah, two months. 01:40:40.520 |
Satya, I think also saying that a lot these days. 01:40:43.480 |
I mean, Sam has been saying that for a while now. 01:40:50.280 |
but also Project Mariner, which is a browser agent, 01:41:00.200 |
which is Codename Operator, which is their agent thing. 01:41:04.360 |
It makes sense that if it actually replaces a junior employee, 01:41:11.240 |
I did this post that it's pinned on my Twitter, 01:41:14.520 |
but about skill floor and skill ceiling in jobs. 01:41:23.960 |
I don't think that has been true in the past, 01:41:25.720 |
but yeah, I think now really like if Devon works, 01:41:29.480 |
if all these customer support agents are working. 01:41:38.440 |
I think the same is gonna happen to in software engineering, 01:41:43.080 |
There's a lot of people doing software engineering 01:41:47.080 |
So I'm curious to see in the next year of the recap, 01:41:55.880 |
and I'll just highlight the best prediction from that group. 01:42:01.080 |
in terms of we'll just go down the list of top five podcasts 01:42:05.080 |
So the best prediction was that there will be a foreign spy 01:42:18.920 |
who is like too attractive in a San Francisco party, 01:42:22.680 |
where the ratio is like a hundred guys to one girl 01:42:25.240 |
and suddenly the girl's like super interested in you, 01:42:36.600 |
the situational awareness essay did to raise awareness of it, 01:42:47.800 |
And I think like the security space in general, 01:42:51.240 |
about Apple foundation model before we cut for a break. 01:42:54.120 |
They announced Apple secure cloud, cloud compute. 01:42:56.360 |
And I think we're also interested in investing in areas 01:42:59.800 |
that are basically secure cloud, LLM inference for everybody. 01:43:03.640 |
I think like what we have today is not secure enough. 01:43:07.640 |
when like this is literally a state level interest. 01:43:21.400 |
- I will take a little bit of credit for that, 01:43:24.200 |
because I think that was the hack news thing. 01:43:27.320 |
actually, obviously he wants to talk about debt, 01:43:43.560 |
that we'll still be referencing in the future. 01:43:47.880 |
David talked about the brain compute marketplace, 01:44:02.920 |
than the hundred equivalent small training runs. 01:44:06.280 |
and we need to concentrate badness, not spread them. 01:44:18.440 |
You know, I think it was top of mind for everybody. 01:44:27.960 |
but she's been on every episode, every podcast. 01:44:32.200 |
actually being the guy who worked on the audio model, 01:44:41.400 |
'Cause I think you put that level of attention 01:44:48.040 |
- And it's specifically like, they didn't have evals. 01:44:54.920 |
The ultimate guide to prompting, that was number three. 01:44:57.800 |
I think all these episodes that are like summarizing things 01:45:00.920 |
that people care about, but they're disparate, 01:45:06.280 |
on a lot of smaller prompting episodes, right? 01:45:11.080 |
with like a 10-page paper that is just a different prompt, 01:45:14.120 |
like not as useful as like an overview survey thing. 01:45:17.400 |
I think the question is what to do from here. 01:45:21.800 |
I've been surprised by how well received that was. 01:45:31.800 |
from the success of this one. - I think if somebody 01:45:37.000 |
- Yeah, Sander is very, very fastidious about this. 01:45:45.320 |
Okay, then the next one is the not safe for work one. 01:45:53.160 |
- Okay, we have a different list then, but yeah. 01:46:08.280 |
in the way that like the audience keeps growing 01:46:10.520 |
and then like the most recent episodes get more views. 01:46:19.000 |
What people were telling me they really liked 01:46:31.000 |
I think that's one of the most interesting areas. 01:46:38.680 |
but like how do you use that for like model training 01:46:54.200 |
very serious AGI lab version of WebSim and WebSim. 01:46:58.520 |
If you take it very, very seriously, you get Genie 2, 01:47:01.240 |
which is exactly what you need to then build Sora 01:47:03.960 |
So yeah, I think Simulative AI still in summer. 01:47:12.040 |
Like, would you say that the AI winter has like coming on 01:47:23.080 |
- Yeah, I would say it was here in the vibes, 01:47:28.440 |
You know, when you look back at the yearly recap, 01:47:30.280 |
it's like every month there was like progress. 01:47:35.320 |
but I don't know if that counts as a real winter. 01:47:43.960 |
And, you know, with some amount of conclusion at NeurIPS 01:48:13.240 |
the various coverage areas that we've marked out, 01:48:21.560 |
- Yeah, and then just to like throw that out there, 01:48:45.000 |
So you just bought at a valuation of 40, right? 01:48:46.680 |
- Yeah, it was like 43 or something like that. 01:48:56.040 |
- And like Databricks was a private valuation 01:48:59.720 |
It's like, who knows what this thing's worth. 01:49:20.360 |
Next year is the year of the agent in production. 01:49:24.120 |
I'm not a hundred percent sure it will happen, 01:49:27.080 |
Otherwise it's definitely the winter next year. 01:49:41.320 |
the paper club people have been beyond my wildest dreams, 01:49:47.800 |
It's amazing that the community has grown so much 01:49:56.280 |
- Yeah, we started this Discord like four years ago. 01:50:01.000 |
You post news here and then you discuss it in threads 01:50:08.680 |
and sometimes you smack them down a little bit, 01:50:10.760 |
- We rarely have to ban people, which is great. 01:50:21.240 |
It's easy to see how we're going to get to 200. 01:50:24.760 |
it wasn't easy to see how we would get to 100, you know? 01:50:37.240 |
as to what we're actually going to get out of it. 01:50:41.160 |
I do believe in YouTube as a podcasting platform