back to indexThe Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)
Chapters
0:0 Intro Song by Suno.ai
2:1 Swyx and Alessio in Singapore
5:49 GPU Rich vs Poors: Frontier Labs
6:35 GPU Rich Frontier Models: Claude 3.5
10:37 GPU Rich helping Poors: Llama 3.1: The Synthetic Data Model
15:41 GPU Rich helping Poors: Frontier Labs Vibe Shift - Phi 3, Gemma 2
18:26 GPU Rich: Mistral Large
21:56 GPU Rich: Nvidia + FlashAttention 3
23:45 GPU Rich helping Poors: Noam Shazeer & Character.AI
28:14 GPU Poors: On Device LLMs: Mozilla Llamafile, Chrome (Gemini Nano), Apple Intelligence
35:33 Quality Data Wars: NYT vs The Atlantic lawyer up vs partner up
37:41 Quality Data Wars: Reddit, ScarJo, RIAA vs Udio & Suno
41:3 Quality Data Wars: Synthetic Data, Jagged Intelligence, AlphaProof
45:33 Multimodality War: ChatGPT Voice Mode, OpenAI demo at AIEWF
47:34 Multimodality War: Meta Llama 3 multimodality + Chameleon
50:54 Multimodality War: PaliGemma + CoPaliGemma
52:55 Renaming Rag/Ops War to LLM OS War
55:31 LLM OS War: Ops War: Prompt Management vs Gateway vs Observability
62:57 LLM OS War: BM42 Vector DB Wars, Memory Databases, GraphRAG
66:15 LLM OS War: Agent Tooling
68:26 LLM OS War: Agent Protocols
70:43 Trend: Commoditization of Intelligence
76:45 Trend: Vertical Service as Software, AI Employees, Brightwave, Dropzone
80:44 Trend: Benchmark Frontiers after MMLU
83:31 Crowdstrike will save us from Skynet
00:00:10.260 |
And today we're in the Sengovor studio with Swix. 00:00:13.780 |
- Hey, this is our long-awaited one-on-one episode. 00:00:18.100 |
I don't know how long ago the previous one was. 00:00:26.620 |
It's just really, I think our travel schedules 00:00:28.660 |
have been really difficult to get this stuff together. 00:00:34.500 |
I think we've kind of depleted that backlog now 00:00:38.900 |
But it's been busy and there's been a lot of news. 00:00:40.980 |
So we actually get to do this like sort of rapid fire thing. 00:00:45.260 |
the podcast has grown a lot in the last six months. 00:00:48.020 |
Maybe just reintroducing like what you're up to, 00:00:50.780 |
what I'm up to, and why we're here in Sengovor 00:01:10.140 |
And I was at one of the offices kind of on the south side 00:01:13.220 |
and from the 38th floor, you can see Indonesia on one side 00:01:22.140 |
One of the people there said their kid goes to school 00:01:40.180 |
And we got to talk about this trend of Sovereign AI, 00:01:42.580 |
which maybe we might cover on another episode, 00:01:44.860 |
but basically how do you drive, if you're a country, 00:01:56.460 |
should I put all this money in foundation models? 00:01:58.660 |
Should I put it in data centers and infrastructure? 00:02:04.340 |
So we'll touch on some of these trends in the episode, 00:02:08.740 |
And I did not expect some of the most senior people 00:02:11.940 |
at the largest financial institution in Singapore 00:02:13.940 |
ask about state space models and some of the alternatives. 00:02:21.660 |
- Yeah, I think that that is mostly people trying 00:02:25.380 |
to listen to jargon that is being floated around 00:02:34.420 |
the basics of what they will actually put to work. 00:02:46.980 |
especially when I travel, is to try to ask questions 00:03:02.620 |
I think AI engineering is one way that countries 00:03:06.700 |
without building a hundred billion dollar cluster, 00:03:10.820 |
And so my pitch at the summit was that we would, 00:03:20.740 |
We're also working on bringing the AIGener conference 00:03:23.820 |
to Singapore next year, together with iClear. 00:03:27.940 |
and I'm being looped into various government meetings 00:03:43.940 |
- Maybe just recap since the framework of the four words 00:03:48.020 |
of AI is something that came up end of last year. 00:04:02.540 |
the data quality wars, the multimodality wars, 00:04:08.100 |
So usually everything falls back under those four categories. 00:04:26.580 |
We haven't done a monthly thing in like three months. 00:04:32.260 |
- That's mostly because I got busy with the conference. 00:04:41.580 |
so that I don't have such a big lift that I don't do it. 00:04:44.020 |
I think the activation energy is the problem, really. 00:04:55.500 |
For a long time, I thought it was kind of like 00:05:00.780 |
at least in some of the hard benchmarks on LMSIS, 00:05:05.700 |
it is the undisputed number one model in the world, 00:05:10.100 |
And we can talk about 4.0 mini and benchmarking later on, 00:05:12.220 |
but for Cloud to be there and hold that position 00:05:14.740 |
for what is more than a month now in AI time is a big deal. 00:05:28.220 |
It marks the beginning of a non-open AI-centric world 00:05:35.420 |
That's been a trend that's been going on for a while. 00:05:39.580 |
But now, new open source projects and tooling, 00:05:45.220 |
That's a strategic threat to open AI, I think, a little bit. 00:05:59.900 |
So the rumor is that the scaling monosematicity paper 00:06:08.660 |
I've had off-the-record chats with people about that idea, 00:06:11.780 |
and they don't agree that it is the only cause. 00:06:14.740 |
So I was thinking this is the only thing that they did. 00:06:17.340 |
But people say that there's about four or five other tricks 00:06:28.060 |
But it basically says that you can find control vectors, 00:06:34.100 |
to make it better at code without really retraining it. 00:06:37.180 |
You just train a whole bunch of sparse autoencoders, 00:06:44.820 |
or suddenly you care a lot about the Golden Gate Bridge. 00:06:48.580 |
That is a huge, huge win for interpretability 00:06:51.020 |
because up to now we were only doing interpretability 00:06:54.940 |
on toy models, like a few million parameters, 00:07:07.620 |
if we could replicate the same on the open models to then, 00:07:12.100 |
to generate synthetic data for training and fine tuning. 00:07:15.620 |
I think obviously Anthropic has a lot of compute 00:07:20.540 |
this is what we should make the model better at, 00:07:22.660 |
they can kind of like put a lot of resources. 00:07:30.820 |
of like the best fine tuning data set owners for a while, 00:07:33.460 |
but at some point that should change, hopefully. 00:07:38.740 |
And I think if we can apply the same principles 00:07:43.460 |
and bring them into like maybe the 7B form factor, 00:07:55.500 |
and Cloud is definitely better most of the time. 00:07:59.700 |
but when the vibes are good, the vibes are good. 00:08:02.060 |
- We run most of the I/O summaries on Cloud as well. 00:08:10.180 |
but yeah, Cloud is very strong at summarization 00:08:24.540 |
Like there's some fundamental irreducible level of mistakes 00:08:38.300 |
I think there are 10 directions that I outlined below, 00:08:46.140 |
Make sure to good differentiate between the models. 00:08:49.620 |
But yeah, we have a whole episode with Thomas Shalom 00:08:52.660 |
from the Meta team, which was really, really good. 00:08:59.140 |
- Yeah, I think we're the only ones to coordinate 00:09:19.820 |
- So behind the scenes, you know, one for listeners, 00:09:22.500 |
one thing that we have attention about is who do we invite? 00:09:27.780 |
then it will cause people to download us more, 00:09:36.060 |
And so I think it's this constant back and forth. 00:09:40.660 |
And we're trying to do that, thread that line 00:09:52.740 |
this actually goes all the way back to George Hotz. 00:09:56.580 |
he said, "You have two paths in the podcast world. 00:09:59.500 |
Either you go be Lex Friedman or you stay small on niche." 00:10:08.860 |
But at the same time, I still want us to grow. 00:10:25.900 |
that they've been fine tuning and training on GPT-4 outputs 00:10:34.860 |
there's like a clear path to how do we make a 7B model good 00:10:38.620 |
without having to go through GPT-4 or going to Cloud 3. 00:10:44.300 |
but I think we're seeing maybe the, you know, 00:10:47.300 |
not the death, but like selling the picks and shovels, 00:10:52.780 |
is like where most of the value is actually getting captured, 00:11:07.740 |
I still need to go through the large labs to fine tune. 00:11:15.620 |
but I don't know if a lot of people are switching 00:11:22.660 |
I also don't know what the hosting options are 00:11:35.380 |
it's a lot of compute if some of the big products 00:11:38.420 |
will switch to it and you cannot easily run it yourself. 00:11:46.340 |
- Yeah, I would say that it is not enough now 00:11:52.300 |
I actually shipped that in the original email 00:11:54.980 |
and then I changed that in the sort of what you see now 00:12:07.500 |
And I think that is what was interesting for Lama3 for me, 00:12:12.380 |
90 pages of all filler no killer is something like that. 00:12:20.540 |
with a proper paper instead of a marketing blog post. 00:12:23.900 |
And they actually spelled out how they'd use synthetic data 00:12:28.700 |
So they have synthetic data for code, for math, 00:12:31.380 |
for multilinguality, for long context, for tool use, 00:12:39.780 |
now you have the license to go distill Lama3, 405B, 00:12:48.340 |
Now you have the permission to do it, how do you do it? 00:12:50.180 |
And I think that people are gonna reference Lama3 a lot, 00:12:53.380 |
but then they can use those techniques for everything else. 00:12:59.020 |
I was very focused on synthetic data for pre-training 00:13:02.380 |
That's my conversations with Technium from Noose 00:13:04.900 |
and all the other people doing synthetic data 00:13:09.260 |
But he was talking about post-training as well. 00:13:22.860 |
the synthetic data model is you have the license for it, 00:13:25.980 |
but then you also have the roadmap, the recipe, 00:13:33.060 |
And probably, you know, obviously opening eyes 00:13:53.620 |
It's like an open AI competitor to be state-of-the-art. 00:13:56.980 |
oh, Anthropic, yeah, these guys are cute over there. 00:13:59.260 |
They're trying to do their thing, but it's not open AI. 00:14:17.420 |
And I don't know if open AI is kind of like sandbagging 00:14:22.820 |
And then they kind of, you know, yesterday or today, 00:14:27.380 |
they launched the search GPT thing behind the wait list. 00:14:53.460 |
you can skip the wait list, just go to perplexity.com. 00:15:02.580 |
But their implementation will have something different. 00:15:04.820 |
They probably like train a dedicated model for that, 00:15:07.100 |
you know, like they will have some innovation 00:15:12.700 |
We're optimistic, you know, but the vibe shift is real. 00:15:16.700 |
that is just worth commenting on and watching. 00:15:21.860 |
I think what you said there is actually very interesting. 00:15:23.780 |
The trend of successive releases is very important to watch. 00:15:47.300 |
and PHY 2 and PHY 3 subsequently improved a lot as well. 00:15:50.780 |
I would say also similar for GEMMA, GEMMA 1 and 2. 00:15:56.580 |
in terms of the local Lama sort of vibe check, 00:16:04.780 |
They released at the AI Engineer World's Fair. 00:16:07.380 |
And, you know, like I didn't know what to think about it 00:16:10.540 |
'cause GEMMA 1 wasn't like super well-received. 00:16:12.420 |
It was just kind of like, here's like free tier GEMMA and I, 00:16:28.260 |
And so like the, and we'll talk about this also, 00:16:30.340 |
like just the winds of AI winter is also like, 00:16:32.980 |
what is the depreciation schedule on this model 00:16:42.300 |
Everybody's favorite Spark and Waitz company. 00:16:46.700 |
- They just released the, you know, Mistral Large Enough. 00:16:55.220 |
presumably because they were speaking at ICML, 00:16:58.900 |
By the way, Brittany is doing a guest host thing for us. 00:17:02.060 |
She's running around the poster sessions doing what I do, 00:17:08.460 |
but I think because we still want to respect their work, 00:17:20.220 |
released as Open Waitz with a research license 00:17:23.140 |
on a commercial license, but still Open Waitz. 00:17:27.340 |
but it is a step down in terms of the general excitement 00:17:36.380 |
So the general hope is, and I cannot say too much, 00:17:42.940 |
The general hope is that they need something more. 00:17:53.020 |
They made progress here with instruction following 00:18:07.300 |
And now, unfortunately, Mistral does not have that crown 00:18:17.180 |
By the way, they've also deprecated Mistral 7B, 00:18:34.060 |
I believe that they're still very committed to open-source, 00:18:40.620 |
I mean, they have, what, $600 million to do it? 00:18:46.140 |
But people are waiting for what's next from them. 00:18:48.980 |
- Yeah, to me, the perception was interesting. 00:18:55.900 |
"for not making any money anyway from the inference?" 00:19:13.460 |
But now it's like they're kind of moving away from that. 00:19:39.020 |
they have some interesting experimentations with Mamba, 00:19:49.380 |
But Mistral Large, otherwise, it's an update. 00:19:52.100 |
It's a necessary update for Mistral Large V1, 00:19:54.980 |
but other than that, they're just kind of holding the line, 00:20:05.860 |
- And then now we're gonna shift a little bit 00:20:07.580 |
towards the smaller deployable on-device solutions. 00:20:12.180 |
First of all, a shout out to our friend, Tri Dao, 00:20:16.860 |
Flash Attention 2, we kind of did a deep dive on the podcast. 00:20:38.180 |
was does NVIDIA's competitors have any threat to NVIDIA? 00:20:48.220 |
which caused a lot of noise with their SOHU chip as well. 00:20:57.380 |
Like Flash Attention 3 only works for NVIDIA, 00:21:11.500 |
I actually heard a really good argument from, 00:21:20.780 |
yeah, absolutely NVIDIA's hardware and ecosystem makes sense. 00:21:28.900 |
it's like the most valuable company in the world right now, 00:21:55.460 |
maybe a couple of years ago about cloud repatriation. 00:21:58.380 |
- Oh yeah, I think he got a lot of shit for that, 00:22:00.500 |
but it's becoming more consensus now, I think. 00:22:12.620 |
and he put up a post talking about five tricks 00:22:15.700 |
that they use to serve 20% of Google search traffic 00:22:21.060 |
A lot of people were very shocked by that number, 00:22:24.780 |
that most conversations are multi-turn, right? 00:22:32.460 |
So obviously there's a good ratio here that matters. 00:22:35.860 |
It's obviously a flex of Character AI's traction 00:22:40.060 |
because I have tried to use Character AI since then, 00:22:42.500 |
and I still cannot for the life of me get it. 00:22:55.140 |
- But please still come on the podcast to Noam Shazir. 00:23:02.140 |
like what the use case is for apart from like the therapy, 00:23:08.180 |
But anyway, one of the most interesting things, 00:23:19.060 |
And I think like that is something that is an easy win. 00:23:28.740 |
past Chinchilla ratio to optimize for inference, 00:23:33.020 |
hey, let's stop using so much memory when training 00:23:35.940 |
because we're gonna quantize it anyway for inference. 00:23:38.500 |
So like just let's pre-quantize it in training. 00:23:47.700 |
which I think is basically going to be the norm, right? 00:23:59.180 |
for the long-form conversations that character has. 00:24:02.500 |
And like simultaneously we have independence research 00:24:06.020 |
from other companies about similar hybrid ratios 00:24:11.940 |
with a Mamba transformer hybrid research thing. 00:24:14.740 |
And in their estimation, you only need 7% transformers. 00:24:24.500 |
And basically every form of hybrid architecture 00:24:35.860 |
and it could well be that the transformer block 00:24:45.980 |
can be the RWKVs, can be another transformer, 00:24:56.660 |
one is something that's local, whatever that is. 00:24:59.900 |
And then, you know, who knows what else is next? 00:25:10.100 |
Gnome thinks that he can do inference at 13X cheaper 00:25:15.780 |
So like, there is a lot of room left to improve inference. 00:25:25.060 |
I was like, they would be losing a ton of money, so. 00:25:31.060 |
So I'm sure money is still an issue for them, 00:25:33.580 |
but I'm also sure they're making a lot of money. 00:25:44.140 |
is like, hey, do you just want to keep building? 00:25:47.580 |
just not worry about the money and go build somewhere else? 00:25:56.980 |
So I'm curious to see what companies decide to stick with it. 00:26:01.500 |
- I think Google or Meta should pay $1 billion 00:26:05.460 |
The purchase price for a character is $1 billion, 00:26:10.700 |
- Which is nothing at their market cap, right? 00:26:13.220 |
Meta's market cap right now is $1.15 trillion 00:26:17.340 |
because they're down 5%, 11% in the past month. 00:26:24.980 |
you know, that's like 0.01% of your market cap. 00:26:31.980 |
and they buy 1% of their market cap on that at the time. 00:26:37.060 |
But the last piece of the GPU rich poor wars. 00:26:42.940 |
And now down to the GPU poorest is on-device models, right? 00:26:46.820 |
Which is something that people are very, very excited about. 00:26:52.140 |
I think was kind of like the talk of the town there 00:26:57.700 |
and explain like some of the optimizations that they did. 00:26:59.820 |
And their just general vision for on-device AI. 00:27:02.540 |
I think that like, it's basically the second act of Mozilla. 00:27:07.260 |
Like a lot of good with the open source browser. 00:27:13.100 |
because it's very hard to keep up in that field. 00:27:15.500 |
And Mozilla has had some management issues as well. 00:27:26.420 |
Like open source is synonymous with local, private 00:27:32.540 |
even running this stuff on CPUs at a very, very fast speed 00:27:40.940 |
and we should probably try to support it more. 00:27:49.780 |
- Yeah, I think to me the biggest question about on-device, 00:28:06.940 |
Llama CPP, MLX, those kinds are all sort of that layer. 00:28:18.820 |
So Google Chrome is building Gemini Nano into the browser. 00:28:21.940 |
The next version of Google Chrome will have Nano inside 00:28:29.620 |
There'll be no download, no latency whatsoever 00:28:35.460 |
which is Apple's version, which is in the OS, 00:28:43.420 |
- My biggest question is how much can you differentiate 00:28:53.140 |
And are people gonna be aware of what model is running? 00:29:07.220 |
the more it's just gonna become like a utility, you know? 00:29:10.460 |
So like, you're not gonna need a model router 00:29:18.540 |
- Actually, Apple Intelligence is the model router, I think. 00:29:22.940 |
I did a count in my newsletter, like 14 to 20 adapters. 00:29:34.860 |
To me, I think a lot of people were trying to puzzle out 00:29:37.420 |
the strategic moves between OpenAI and Apple here 00:29:44.460 |
There was some rumors that Google was working with Apple 00:29:46.660 |
to launch it, they did not make it for the launch, 00:29:48.780 |
but presumably Apple wants to commoditize OpenAI, right? 00:29:53.900 |
you can choose your preferred external AI provider 00:29:57.500 |
and it's either OpenAI or Google or someone else. 00:30:00.220 |
I mean, that puts Apple at the center of the world 00:30:05.780 |
And I think that's probably good for privacy, 00:30:10.420 |
'cause you're not running like oversized models 00:30:18.940 |
Like, yeah, I'm not concerned about the capabilities issue. 00:30:23.180 |
Apple put out a whole bunch of proprietary benchmarks 00:30:29.740 |
So like, you know, in the Apple intelligence blog posts, 00:30:33.780 |
were just like their internal human evaluations. 00:30:36.140 |
And only one of them was an industry standard benchmark, 00:30:40.340 |
But like, you know, why didn't you also release your MMLU? 00:30:46.060 |
- Well, I actually think all these models will be good. 00:30:50.340 |
I'm curious to see what the price tag will be 00:31:00.380 |
- Yeah, I mean, today, even if it was 20 billion, 00:31:03.060 |
that's like nothing compared to like, you know, 00:31:29.340 |
it's because opening has to foot the inference costs, right? 00:31:33.180 |
- Well, yeah, Microsoft really is putting it, 00:31:35.660 |
but again, Microsoft is worth two trillion, you know? 00:31:42.740 |
as someone who is a champion of the open web, 00:31:53.980 |
Apple intelligence being like on-device router 00:32:11.220 |
I will highlight that Apple has also put out Datacomp LM. 00:32:15.020 |
I actually interviewed Datacomp at NeurIPS last year, 00:32:17.900 |
and they've branched out from just vision and images 00:32:21.500 |
And Apple has put out a reference implementation 00:32:30.220 |
because FindWeb was the state-of-the-art last month. 00:32:38.780 |
open weights, open model, like super everything open. 00:32:50.300 |
which basically innovate in terms of like shared weights 00:32:55.540 |
so that you just optimize the amount of file size 00:32:59.700 |
And I think just general trend of on-device models, 00:33:04.660 |
to cheap-to-meter happens is everything happens on-device. 00:33:12.020 |
Like OpenAI's mission is intelligence to cheap-to-meter, 00:33:23.780 |
Maybe OpenAI, even Sam Ullman needs to figure it out 00:33:28.700 |
I don't know if you would buy an OpenAI phone. 00:33:30.580 |
I mean, I'm very locked into the iOS ecosystem, but I mean-- 00:34:03.980 |
- I think there's a lot of news going in the background. 00:34:05.860 |
So like the New York Times lawsuit is still ongoing. 00:34:08.820 |
You know, it's just like we won't have specific things 00:34:14.660 |
There are specific deals that are happening all the time 00:34:17.220 |
with Stack Overflow making deals with everybody, 00:34:19.820 |
with like Shutterstock making deals with everybody. 00:34:22.580 |
It's just, it's hard to make a single news item 00:34:30.100 |
OpenAI's strategy has been to make the New York Times 00:34:34.180 |
prove that their content is actually any original 00:34:40.060 |
- Yeah, so it's kind of like, you know, the iRobot meme. 00:34:42.500 |
It's like, can a robot create a beautiful new symphony? 00:34:51.740 |
- Yeah, I think that the danger with the lawsuit, 00:34:55.780 |
because OpenAI responded, including with Ilya, 00:35:04.980 |
And then suddenly on the eve of the deal, you called it off." 00:35:08.220 |
I don't think New York Times has responded to that one, 00:35:11.980 |
because the New York Times' brand is like trying to be like, 00:35:15.580 |
you know, they're supposed to be the top newspaper 00:35:18.580 |
If OpenAI, like just, and this was my criticism of it 00:35:31.940 |
- So you just lost out on like a hundred million dollars, 00:35:38.980 |
I think they are absolutely right to do that. 00:36:06.300 |
versus partnering, I think it's very interesting. 00:36:08.540 |
- Yeah, I guess the winner in all of this is Sridhar, 00:36:12.140 |
which is making over 200 million just in data licensing 00:36:15.420 |
to OpenAI and some of the other AI providers. 00:36:24.180 |
'cause Reddit conveniently did this deal before IPO, right? 00:36:29.060 |
And then, you know, the stock language is from there? 00:36:35.660 |
So in this market, they're up 25%, I think, since IPO. 00:36:39.380 |
But I saw the FTC had opened an inquiry into it 00:36:44.020 |
So I'm curious what the antitrust regulations 00:36:52.220 |
to prevent kind of like stifling competition. 00:36:57.180 |
where, hey, you cannot actually get all of your data 00:37:03.420 |
because otherwise you're stopping any new company 00:37:08.780 |
- Yeah, that's a serious overreach of the state there. 00:37:12.620 |
- So as a free market person, I want to defend. 00:37:21.900 |
people should be able to make their own decisions 00:37:32.500 |
is that apparently they have added to their robots.txt, 00:37:39.020 |
And that's obviously blocking OpenAI from crawling them, 00:37:49.980 |
I think this is big in the sort of normie worlds. 00:37:55.180 |
had a very, very public Apple Notes take down of OpenAI. 00:37:58.940 |
Only Scarlett Johansson can do that to Sam Altman. 00:38:01.380 |
And then, you know, I was very proud of my newsletter 00:38:05.100 |
because the voice of Sky, so I called it Skyfall. 00:38:09.300 |
And, but it's true, like, that one, she can win. 00:38:13.820 |
And there's a very well-established case law there. 00:38:16.300 |
And the YouTubers and the music industry, the RIAA, 00:38:19.220 |
like the most litigious section of the creator economy 00:38:30.720 |
but it's gonna be a very costly legal battle for sure. 00:38:42.740 |
I was pretty optimistic that something like this 00:38:47.740 |
And with the way that the Supreme Court is making rulings, 00:38:52.260 |
like we just need a judgment on whether or not 00:39:11.540 |
If the Supreme Court rules that there are no lawsuits 00:39:16.540 |
- I think people are probably scraping late in space 00:39:18.660 |
and we're not getting a dime, so that's what it is. 00:39:26.820 |
for our microphones and travel and stuff like that. 00:39:28.860 |
Yeah, it's definitely not worth the amount of time 00:39:32.340 |
we're putting into it, but it's a labor of love. 00:39:36.820 |
- Yeah, I guess we talked about it a little bit 00:39:39.060 |
before with Lama, but there was also the alpha proof thing. 00:39:46.740 |
- Yeah, Google trained, almost got a gold medal. 00:39:50.380 |
- Yes, they're one point short of the gold medal. 00:39:52.740 |
- It's a remarkable, I wish they had more questions. 00:39:55.460 |
So the International Math Olympiad has six questions 00:40:02.140 |
Every single question that the alpha proof model tried, 00:40:09.900 |
And then the cutoff was like sadly one point higher 00:40:15.140 |
like a lot of people have been looking at IMO 00:40:19.500 |
in terms of what AI can achieve and betting markets 00:40:22.860 |
and Eliezer Yakovsky has updated and saying like, 00:40:27.500 |
Like we basically have reached it near gold medal status. 00:40:31.500 |
We definitely reached a silver and bronze status 00:40:34.100 |
and we'll probably reach gold medal next year, right? 00:40:44.540 |
which is an easier version of the human Math Olympiad. 00:40:48.900 |
This is all like related research work on search 00:41:00.020 |
Like it's always hard to cover this kind of news 00:41:08.220 |
'Cause at the same time, we're having this discussion 00:41:12.460 |
You know, one of the IMO questions was solved in 19 seconds 00:41:20.500 |
At the same time, language models cannot determine 00:41:31.060 |
but it's a funny, and then there's someone else's joke. 00:41:37.100 |
This is a failure to generalize because of tokenization 00:41:43.100 |
We've always been able to train dedicated special models 00:41:54.220 |
I think like if you look back a year and a half ago 00:41:59.700 |
Most people would be like, "Yeah, we can keep scaling." 00:42:15.660 |
would be much more capable at this kind of stuff 00:42:19.100 |
while it also serves our needs with everyday things. 00:42:31.980 |
that we can build super intelligence for sure. 00:42:37.300 |
But right now we're just pursuing super intelligence. 00:42:55.460 |
And by the way, also OpenAI did it with GPC 4.0 00:43:07.260 |
In fact, I call it part of the deployment strategy of models. 00:43:10.340 |
You train a base layer, you train a large one, 00:43:22.380 |
to the point where now OpenAI has opened a team 00:43:24.580 |
for mid-training that happens before post-training. 00:43:33.420 |
is before you have capability and you have efficiency, 00:43:36.340 |
there's an in-between layer of generalization 00:43:58.260 |
Yeah, I don't have a good intuition for that. 00:44:02.340 |
Yeah, so we can skip Nemotron's worth looking at 00:44:06.220 |
Multimodal labeling, I think has happened a lot. 00:44:13.220 |
Well, the first news is that 4.0 voice is still not out 00:44:19.060 |
I think they're starting to roll out the beta 00:44:26.340 |
- I gave in because they're rolling it out next week. 00:44:35.500 |
it's basically because they had nothing to offer people. 00:44:38.940 |
because why keep paying $20 a month for this, right? 00:44:47.220 |
I will pay $200 for the Scarlett Johansson voice, 00:44:49.460 |
but you know, they'll probably get sued for that. 00:45:00.260 |
Roman, I have to really give him a shout out for that. 00:45:11.900 |
I think something that people don't understand 00:45:13.540 |
is OpenAI puts a lot of effort into their presentations 00:45:21.260 |
And I think, yeah, they care about their presentation 00:45:28.380 |
Just for the record, for people who don't understand 00:45:30.340 |
what happened was, first of all, you can go see, 00:45:36.780 |
because it was presented live at a conference 00:45:44.340 |
and it needs to distinguish between its own voice 00:45:49.540 |
So we had OpenAI engineers tune that for our stage 00:46:02.020 |
Because I think people wanted an update on voice. 00:46:13.580 |
is that Lumetri is supposed to be a multimodal model. 00:46:19.380 |
Apparently, I'm not sure what the whole story there is. 00:46:26.260 |
It uses adapters rather than being natively multimodal. 00:46:35.620 |
because there was this independent threads of voice box 00:46:41.540 |
These are all projects that Meta-AI has launched 00:46:46.860 |
But now all that research is being pulled in into Lumetri, 00:46:54.660 |
And yeah, you can see the voice box mentioned 00:47:01.820 |
because I looked at the state of existing conformer research 00:47:12.020 |
like the sheer amount of resources that are dedicated. 00:47:15.940 |
I think they had 230,000 hours of speech recordings. 00:47:24.260 |
So Meta just needs to 3X the budget on this thing 00:47:30.980 |
- Yeah, and then we can hopefully fine tune on our voice 00:47:38.180 |
- I should also shout out the other thing from Meta, 00:47:40.180 |
which is a very, very big deal, which is Chameleon, 00:47:42.820 |
which is a natively early fusion vision and language model. 00:47:53.660 |
then you kind of fuse them with an adapter layer. 00:47:59.820 |
Chameleon is interleaving in the same way that IdaFix, 00:48:12.940 |
And I think like once that is better understood, 00:48:17.180 |
That is the more deep learning build version of this, 00:48:23.060 |
I asked Yitei this question about Chameleon in his episode, 00:48:37.820 |
basically all this half-ass measures around adapters 00:48:46.020 |
It is the train from scratch, fully omnimodal model, 00:48:53.220 |
you should read the Chameleon paper, basically. 00:48:58.900 |
because the open model doesn't have image generation. 00:49:05.340 |
The leads were like, "No, do not follow these instructions 00:49:13.380 |
Okay, so yeah, whenever image generation is concerned, 00:49:17.980 |
it's very tricky for large companies to release that, 00:49:26.300 |
and let the open source community put it back in. 00:49:31.340 |
The last piece I had, which I kind of deleted, 00:49:34.980 |
honorable mention of Gemma again with PolyGemma, 00:49:37.540 |
which is one of the smaller releases from Google I/O. 00:49:48.380 |
But CopolyGemma now is being talked a lot about 00:50:00.020 |
- So apparently it is doing better than Amazon Textract 00:50:09.900 |
Colbert retrieval approach on top of a vision model, 00:50:13.820 |
which I was severely underestimating PolyGemma 00:50:16.500 |
when it came out, but it continues to come up. 00:50:20.140 |
And again, this is making a lot of progress here, 00:50:22.860 |
just in terms of their applications in real-world use cases. 00:50:26.060 |
These are small models, but they're very, very capable 00:50:33.780 |
I think maybe a lot of people initially wrote them off, 00:50:36.100 |
but between, you know, some of the Gemini Nano stuff, 00:50:42.540 |
We'll talk about some of the KV cache and context caching. 00:50:52.460 |
He's excited about everything they got going on, so yeah. 00:51:05.580 |
Vertex has this reputation of being extremely hard to use. 00:51:16.020 |
like the Netlify or Vercel to the AWS, right? 00:51:20.820 |
And I think it's Google's chance to reinvent itself 00:51:23.780 |
for this audience, for the AI engineer audience 00:51:25.380 |
that doesn't want like five levels of off IDs and org IDs 00:51:29.180 |
and policy permissions just to get something going. 00:51:44.100 |
And I might need to actually rename this war. 00:51:46.900 |
- War renaming alert, what are we calling it? 00:52:03.860 |
We also need AIs to work with other agents, right? 00:52:08.780 |
That's not reflected in any of the other wars. 00:52:13.740 |
what does an LLM plug into with the broader ecosystem 00:52:16.820 |
to be more capable than an LLM can be on its own? 00:52:21.780 |
but this is something I've been thinking about a lot. 00:52:48.420 |
- So e2b is basically a code interpreter SDK as a service. 00:52:52.060 |
So you can add code interpreter to any model. 00:52:56.380 |
They have this open source cloud artifacts clone 00:53:02.580 |
that they've been getting in open source has been amazing. 00:53:07.020 |
from like 10K to a million containers spun up on the cloud. 00:53:10.900 |
So, I mean, you told me this maybe like nine months ago, 00:53:33.980 |
And yeah, AXA just raised a Series A from Lightspeed, so. 00:53:44.180 |
- So yeah, this is a giving as a VC, early stage VC, 00:53:53.020 |
is like way more important than the actual LLM ops, 00:53:57.300 |
you know, the observability and like all these things, 00:53:59.380 |
like those are nice, but like the way you build real value 00:54:04.620 |
how can this model do more than just chat with me? 00:54:07.380 |
So running code, doing analysis, doing web search. 00:54:25.180 |
And I don't think I'm happy with all the ops solutions 00:54:37.940 |
The central way I explain this thing to people 00:54:39.980 |
is that all the model labs view their job as stopping 00:54:43.020 |
by serving you their model over an API, right? 00:54:46.100 |
That is unfortunately not everything that you need 00:54:57.780 |
And 80 of them show up and they all raise money. 00:55:03.060 |
what do you actually need as sort of an AI native ops layer 00:55:06.620 |
versus what is just plugged into Datadog, right? 00:55:13.700 |
but I appreciate the importance of this thing. 00:55:18.380 |
which is frameworks, gateways, and monitoring or tracing. 00:55:23.140 |
We've talked to like, I interviewed Human Loop in London 00:55:35.340 |
was charging me $49 a month to store my prompt template. 00:55:45.700 |
And it's charging $49 a month for unlimited storage of that. 00:55:49.460 |
It's absurd, but like people want prompt management tools. 00:55:53.100 |
They want to interoperate between PM and developer. 00:56:02.780 |
I was at the Grab office and they also treat prompts as code, 00:56:07.140 |
but they build their own thing to then import the prompts. 00:56:08.580 |
- Yeah, but I want to check prompts into my code base 00:56:11.300 |
But maybe, do you want it outside of the code base? 00:56:17.780 |
What's like, you know, it's not just a string. 00:56:26.220 |
But I think like the problem with building frameworks 00:56:29.500 |
is like frameworks generalize things that we know work. 00:56:33.500 |
And like right now we don't really know what works. 00:56:35.580 |
- Yeah, but some people have to try, you know, 00:56:42.780 |
if you see the most successful open source frameworks 00:56:45.140 |
that became successful businesses are frameworks 00:56:54.020 |
- Vertical-filled instead of horizontal-filled. 00:56:56.980 |
- I mean, we try to be horizontal-filled, right? 00:56:58.820 |
And it's like, where are all the horizontal startups? 00:57:02.860 |
They're just not that, they're not going to win by themselves. 00:57:07.860 |
I think some of them will win by sheer excellent execution. 00:57:12.340 |
And then, but like the market won't pull them. 00:57:16.980 |
It's like, you know, take like Julius, right? 00:57:20.420 |
It's like, "Hey, why are you guys doing Julius?" 00:57:29.820 |
- They're more dedicated to it than Code Interpreter. 00:57:33.580 |
- Just take it more seriously than (indistinct) 00:57:36.180 |
- I think people underestimate how important it is 00:57:41.060 |
versus trying to serve everybody with some of these things. 00:57:54.780 |
So the only job of the gateway is to just be one endpoint 00:58:00.300 |
And it normalizes the APIs mostly to OpenAI's API 00:58:07.940 |
And then lastly, it's monitoring and tracing, right? 00:58:11.780 |
understanding the latency, like P99 or whatever, 00:58:15.820 |
So lagsmith is obviously very, very early on to this stuff. 00:58:29.100 |
It's very hard for me to choose between all those things. 00:58:44.180 |
we recommend these two other friends of ours. 00:58:46.420 |
And I'm like, why am I integrating four tools 00:58:54.980 |
The obvious frustration solution is I build my own, right? 00:58:57.780 |
Which is, you know, we have 14 standards, now we have 15. 00:59:03.700 |
I wish there was a better solution to recommend to people 00:59:06.660 |
because right now I cannot clearly recommend things. 00:59:08.940 |
- Yeah, I think the biggest change in this market 00:59:11.300 |
is like latency is actually not that important anymore. 00:59:14.860 |
Like we lived in the past 10 years in a world 00:59:17.060 |
where like 10, 15, 20 milliseconds made a big difference. 00:59:20.620 |
I think today people will be happy to trade 50 milliseconds 00:59:31.500 |
Instead of saying, is this quality good for this output? 00:59:36.180 |
Like, we're just kind of taking what we did with cloud 00:59:49.100 |
It's like, also like, I don't own most of the models. 00:59:51.820 |
So it's like, this is the GPT-4 API performance. 00:59:56.820 |
It's like, I can't do anything about it, you know? 00:59:58.820 |
So I think that's maybe why the value is not there. 01:00:02.100 |
Like, you know, am I supposed to pay 100K a year? 01:00:04.580 |
Like I pay the data dog or whatever to tell me, 01:00:10.140 |
It's like, you know, and just not, I don't know. 01:00:15.860 |
Okay, so the last piece I'll mention is briefly, 01:00:23.700 |
AI Engineer Ops, the Ops layer on top of the LLM layer 01:00:26.540 |
might follow the same evolution path as the ML Ops layer. 01:00:43.100 |
And you can A/B test like a hundred different variations 01:00:49.460 |
And I could see a straight line from there to like, 01:00:51.860 |
okay, I want this, but for my AI Engineering Ops, 01:00:55.380 |
like I want this level of clarity on like what I do. 01:01:05.020 |
And I see that also happening for AI Engineering as well. 01:01:07.660 |
And let's briefly talk about RAG and context caching, maybe, 01:01:16.460 |
No, I think that's really, a lot of it is like, 01:01:28.140 |
I think today it's mostly like LLM Rails, you know? 01:01:33.140 |
but I think like actually helping people build things. 01:01:47.900 |
but I haven't talked about it on this podcast 01:01:53.820 |
The Vogue thing of last year was vector databases, right? 01:01:59.420 |
And I think the insight is that vector databases 01:02:05.100 |
They do cosine similarity matching and retrieval, 01:02:10.780 |
which was this whole debate between Vespa and who else? 01:02:15.660 |
and I think a couple other companies also chipped in, 01:02:23.860 |
And the history of benchmarking for databases 01:02:25.620 |
goes as far back as Larry Ellison and Oracle and all that. 01:02:36.340 |
I think one of the reasons I put vector databases 01:02:41.460 |
the vector databases have to become more frameworks. 01:02:45.060 |
the ops companies have to become more frameworks, right? 01:02:47.180 |
And then the framework companies have to become ops companies, 01:02:51.100 |
So one element of the vector databases growing, 01:02:54.020 |
I've been looking for what the next direction 01:03:04.340 |
I'm also getting the limitless personal AI wearable, 01:03:07.900 |
I just wanted to record my whole conversation 01:03:15.300 |
I'm sure Character AI has some version of this. 01:03:22.260 |
vector database is very oriented towards factual memory, 01:03:24.340 |
document retrieval, knowledge-based retrieval, 01:03:26.620 |
but it's not the same thing as conversation retrieval, 01:03:32.340 |
what I said to you a year ago, three years ago. 01:03:34.420 |
And it's a different nature of retrieval, right? 01:03:48.940 |
they discover that graphs are a thing for the first time. 01:03:52.140 |
Like the future is graphs and then nothing happens. 01:04:04.500 |
So, this is a fun, this is why I'm not an investor. 01:04:08.340 |
Like you have to get the time that this time is different 01:04:18.900 |
- And so memory databases are one form of that, 01:04:20.660 |
where like they're focused on the problem of long form memory 01:04:24.180 |
for agents, for assistants, for chatbots and all that. 01:04:30.660 |
that I can't really talk about in this sector 01:04:39.580 |
that moving away from just semantic similarity, 01:04:45.780 |
with very different meanings, especially when talking. 01:04:50.260 |
- Yeah, the other direction that vector databases 01:04:51.780 |
have gone into, which LensDB presented at my conference, 01:04:55.940 |
So Character AI uses LensDB for multimodal embeddings. 01:05:03.220 |
in terms of what a vector database does for you. 01:05:07.620 |
is mostly the evolution of just the ecosystem of agents, 01:05:20.380 |
and he since announced that they are pivoting OpenDevIn 01:05:36.620 |
They're all building like this ecosystem of agents, 01:05:48.540 |
The need for startups to build this ecosystem thing up. 01:05:56.420 |
So memory is emerging, then there's like other stuff. 01:06:02.740 |
To me, browser is slightly different from search 01:06:05.980 |
and Browserbase is another company I invested in 01:06:09.580 |
but they're not the only one in that category by any means. 01:06:18.900 |
DevIn, since then, they spoke at the conference as well. 01:06:22.220 |
and actually gave me some personal time as well. 01:06:29.100 |
Each of those things is a potential startup now. 01:06:33.660 |
because they need it to do what they need to do as an agent. 01:06:49.540 |
The reality is that people want to own that standard. 01:06:51.660 |
So, we actually wound down the AI Engineer Foundation 01:06:54.900 |
with the first project was the Agent Protocol, 01:07:08.700 |
People will keep this proprietary and more power to them. 01:07:16.340 |
We're investors in a bunch of agent companies. 01:07:33.500 |
because that's where the future is going, right? 01:07:35.100 |
So, today it's like intra-agent connectivity, you know? 01:07:39.620 |
it's not like somebody I'm selling into a company 01:07:42.580 |
and the company already uses agent X for that job. 01:07:47.780 |
But I think nobody really cares about that today. 01:07:51.100 |
- Yeah, so I think that that layer right now is open API. 01:08:01.740 |
So then the next layer is something I have worked on, 01:08:11.820 |
Yeah, but like, you know, RPC or some kind of, you know, 01:08:17.060 |
and this is one of my problems with the LMOS concept, 01:08:20.900 |
is that do we really need to rewrite every single thing 01:08:30.420 |
Reality is, for now, yes, they need specialized APIs. 01:08:34.080 |
In the distant future, when these things cost nothing, 01:08:36.780 |
then they can use it the same way as humans does, 01:08:39.020 |
but right now they need specialized interfaces. 01:08:40.940 |
The layer between agents ideally should just be English, 01:08:53.140 |
- It's interesting because we talk to each other in English, 01:09:00.340 |
- For those people who want to dive in a little bit more, 01:09:02.580 |
I think AutoGen, I would definitely recommend 01:09:08.220 |
that are working on interagents, communication layers, 01:09:10.780 |
to coordinate them, and not necessarily externally 01:09:13.740 |
from company to company, just internally as well. 01:09:17.980 |
to do different things, you're going to need this anyway. 01:09:23.940 |
They're using some mix of English and structured output. 01:09:27.560 |
And yeah, if you have a better idea than that, let us know. 01:09:35.980 |
I think I want to leave some discussion time open 01:09:40.540 |
in the industry that don't exactly fit in the four words 01:09:45.780 |
So the first one to me is just this trend of open source. 01:09:48.820 |
Obviously this overlaps a lot with the GPU poor thing, 01:09:51.420 |
but I want to really call out this depreciation thing 01:09:55.540 |
Like I do think it's probably one of the bigger thesis 01:10:04.340 |
of the deprecation schedule of this sort of model spend. 01:10:11.500 |
but I drew a chart of the price efficiency frontier 01:10:26.740 |
And then I did the same chart in July, two days ago, 01:10:32.220 |
And Mistral is like deprecating their old models 01:10:36.700 |
It is so shocking how predictive and tight this band is. 01:10:43.340 |
and the whole industry is moving the same way. 01:10:45.380 |
And it's roughly one order of magnitude drop in cost 01:10:49.060 |
for the same level of intelligence every four months. 01:10:53.380 |
was one order of magnitude drop in cost every 12 months. 01:11:15.660 |
Or is it maybe like the timeline is going down 01:11:27.660 |
- You're like, "Wow, you got a good question." 01:11:34.500 |
The first response is something I haven't thought about. 01:11:39.140 |
When the cost of frontier models are going up, 01:11:41.740 |
potentially like SB1047 is going to make it illegal 01:11:46.940 |
For us, I think the opposition has increased enough 01:11:49.940 |
that it's not going to be a real concern for people. 01:12:05.180 |
And what we're talking about here is efficiency. 01:12:09.900 |
That's definitely one of the emergent stories 01:12:23.700 |
- Regardless of GPT-Next and Cloud 4 or whatever, 01:12:27.580 |
we will still have efficiency frontiers to pursue. 01:12:30.580 |
And it seems like doing the higher capable thing 01:12:48.140 |
And the 8B had the most uplift across all the benchmarks. 01:12:56.900 |
So the best way to train more efficient models 01:13:04.060 |
So this is fascinating from an investor point of view. 01:13:06.060 |
You're like, okay, you're worried about picks and shovels, 01:13:07.820 |
you're worried about investing in foundation model labs. 01:13:20.060 |
what do you do when you know that your base cost 01:13:22.420 |
is going down an order of magnitude every four months? 01:13:41.140 |
and now the cost of intelligence is going down. 01:13:46.380 |
In the meantime, they have a crap ton of value 01:13:55.060 |
is to make economically non-viable startups now, 01:14:17.260 |
the model providers don't really have a lot of leverage 01:14:36.300 |
and was actually creating a lot of value downstream, 01:14:42.100 |
I think people today are not that happy with the models. 01:14:47.300 |
because I'm not really getting that much out of it. 01:14:53.540 |
and there are people saving 10, 20 million a year 01:14:59.660 |
like document translation and things like that, 01:15:05.780 |
So like the prices just have to go down too much, 01:15:12.060 |
- Yeah, I always mention temperature to use cases, right? 01:15:19.060 |
What are the cases where hallucination is a feature, 01:15:21.740 |
So we're the first podcast to interview WebSim, 01:15:27.820 |
Like we took generative AI and we used it to do reg. 01:15:52.260 |
I think like most companies that are buying AI tooling, 01:15:56.020 |
they want the AI to do some sort of labor for them. 01:16:01.060 |
kind of disinterest maybe comes from a little bit. 01:16:03.780 |
Most companies do not wanna buy tools to build AI. 01:16:07.380 |
and they also do not want to pay a lot of money 01:16:09.660 |
for something that makes employees more productive 01:16:20.540 |
But most companies are not making a lot more money 01:16:28.220 |
like they're much smaller teams compared to before 01:16:36.020 |
which is something that people are used to paying 01:16:42.300 |
if you ask Brightwave, they don't have it public, 01:16:47.620 |
because hedge funds and like investment banking, 01:16:51.300 |
they're used to paying a lot of money for research. 01:17:16.540 |
So there's not really capture researchers anymore 01:17:21.060 |
And like even the sell side research is not that good. 01:17:23.300 |
- Taking them from in-house to external thing. 01:17:29.020 |
we have drop zone that does security analysis. 01:17:31.260 |
Same, people are used to paying for managed security 01:17:39.940 |
- Okay, and what specifically does drop zone do? 01:17:57.060 |
that's a phishing email that is in, that is in. 01:18:10.300 |
So it's a very basic economic analysis for the company, 01:18:15.220 |
It's not about, is my analyst going to have more free time? 01:18:41.580 |
I know now it's maybe not as good of an example, 01:18:43.540 |
but Crosstrek started as a security research. 01:18:48.500 |
- Yeah, I mean, it's still one of the most successful 01:19:00.980 |
It's like, what's the end labor that I'm building? 01:19:03.940 |
Because, you know, sometimes when you're being too generic 01:19:06.220 |
and you want to help people build things, like ADAPT, 01:19:08.780 |
like ADAPT, you know, David was on the podcast 01:19:14.980 |
- Yeah, it's like, they're working with each company 01:19:28.660 |
he was also on a podcast and spoke at the conference. 01:19:30.940 |
He has this idea of like, it's reports, not rag. 01:19:37.900 |
Rag is still too much work, still too much chatbotting. 01:19:43.420 |
I think you need to have everyone studying AI research, 01:19:48.060 |
understanding the progress of AI and foundation models, 01:19:50.820 |
needs to have in mind what is next after MMLU. 01:20:03.780 |
even though she made us take down the YouTube. 01:20:06.620 |
But MUSR for multi-step reasoning, math for math, 01:20:10.340 |
IFE for instruction following, Big Bench hard. 01:20:15.420 |
that the Hugging Face leaderboard does not have. 01:20:20.620 |
So MBPP is the current one that is post-human eval, 01:20:24.780 |
'cause human eval is widely known to be saturated. 01:20:29.660 |
Context utilization, we had Mark from Gradient 01:20:31.740 |
on talk about ruler, but also zeros goes in infinite bench, 01:20:34.580 |
were the two that Nalma 3 used instead of ruler. 01:20:37.820 |
But basically, something that's a little bit more rigorous 01:20:47.300 |
pretty consensus, I've done nothing there apart from, 01:20:49.940 |
yeah, like all models need something like this. 01:20:56.460 |
I think like Vibey Vell is actually the state of the art here. 01:21:02.020 |
So basically, like these are the 10 directions, right? 01:21:04.500 |
Post-MMLU, here are the frontier capabilities. 01:21:11.840 |
and then you have a good sense of how state of the art they are 01:21:21.380 |
How do you think about benchmarking for, you know, 01:21:24.740 |
everyday thing or like benchmarking for something 01:21:33.900 |
and probably more important for product usage, right? 01:21:40.180 |
And then there's a schism in the AI engineering community 01:21:44.740 |
that did not care enough about product evals. 01:21:51.340 |
but I acknowledge that, I think that it's important. 01:21:53.980 |
There was an oversight in my original AI engineer post. 01:21:57.900 |
is to produce product-specific evals for your use case. 01:22:01.620 |
And there's no way that these general academic benchmarks 01:22:10.700 |
These are very, very rigorous and thought through. 01:22:22.780 |
How well does IQ test correlate to job performance? 01:22:32.820 |
We can, you know, we try not to talk about safety. 01:22:37.420 |
is that, you know, if you're worried about agents 01:22:44.660 |
And you have a button that has just been proved 01:22:55.940 |
- That's funny, except for the CrowdStrike people. 01:23:05.900 |
I think, you know, AI News is surprisingly helpful 01:23:12.780 |
I just thought I needed a thing to summarize discords, 01:23:15.620 |
but now it's becoming a proper media company.