The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

(upbeat music) - Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO on Resonance Investable Partners. And today we're in the Sengovor studio with Swix. - Hey, this is our long-awaited one-on-one episode. I don't know how long ago the previous one was. Do you remember?

Three, four months now? - Yeah, it's been a while. - People really enjoyed it. It's just really, I think our travel schedules have been really difficult to get this stuff together. And then we also had like a decent backlog of guests for a while. I think we've kind of depleted that backlog now and we need to build it up again.

(laughing) But it's been busy and there's been a lot of news. So we actually get to do this like sort of rapid fire thing. I think some people, you know, the podcast has grown a lot in the last six months. Maybe just reintroducing like what you're up to, what I'm up to, and why we're here in Sengovor and stuff like that.

- Yeah, my first time here in Sengovor, which has been really nice. This country is really amazing, I would say. First of all, everything feels like the busiest part of the city. Everything is skyscrapers. There's like plants in all the buildings, or at least in the areas that I've been in, which has been awesome.

And I was at one of the offices kind of on the south side and from the 38th floor, you can see Indonesia on one side and you can see Malaysia on the other side. So it's quite small. One of the people there said their kid goes to school at the border with Malaysia, basically, so they could drive to Malaysia every day.

(all laughing) So they could go pick her up from school. Yeah, and we came here, we hosted with you the Sovereign AI Summit Wednesday night. We had a lot of- - NVIDIA, Goldman, Temasek, Singtel. - GSE, Singtel. And we got to talk about this trend of Sovereign AI, which maybe we might cover on another episode, but basically how do you drive, if you're a country, how do you drive productivity growth in a time where populations are shrinking, the workforce is shrinking, and AI can kind of supplement a lot of this.

And then the question is, okay, should I put all this money in foundation models? Should I put it in data centers and infrastructure? Should I put it in GPUs? Should I put it in agents and whatnot? So we'll touch on some of these trends in the episode, but it was a fun event.

And I did not expect some of the most senior people at the largest financial institution in Singapore ask about state space models and some of the alternatives. So it's great to see how advanced the conversation is sometimes. - Yeah, I think that that is mostly people trying to listen to jargon that is being floated around as like, oh, what could kill transformers?

And then they jump straight there without actually exploring the fundamentals, the basics of what they will actually put to work. That's fine, it's a forum to ask questions. So you wanna ask about the future, but I feel like it's not very practical to spend so much time on those things.

Part of the things that I do in space, especially when I travel, is to try to ask questions about what countries that are not the US and not San Francisco can do, because everyone feels a bit left out. You feel it here as well. And I'm trying to promote alternatives.

I think AI engineering is one way that countries can capitalize on the industry without building a hundred billion dollar cluster, which is one fifth the GDP of Singapore. And so my pitch at the summit was that we would, Singapore with the AIGeneration. We're also working on bringing the AIGener conference to Singapore next year, together with iClear.

So yeah, we're just trying my best and I'm being looped into various government meetings to try to make that happen. - Well, we'll definitely be here next year. We'll be, I'll be back here very often. It's really nice. - Yeah, awesome. Okay, well, we have a lot of news.

How do you think we should cover? - Maybe just recap since the framework of the four words of AI is something that came up end of last year. So basically, we'll link in the show notes, but the end of year recap for 2023 was basically the four words of AI, which we picked GPU-rich versus GPU-poor, the data quality wars, the multimodality wars, and the reg/ops wars.

So usually everything falls back under those four categories. So I'm pretty happy that seven months later, it's something that still matters. - It still kind of holds up. - Yeah, most AI stuff from eight months ago, it's really not that relevant anymore. And today, we'll try and bucket some of the recent news on it.

We haven't done a monthly thing in like three months. So three months is a lot of stuff. - That's mostly because I got busy with the conference. But I do want to, actually, I do want to get back on that horse, or maybe just do it weekly so that I don't have such a big lift that I don't do it.

I think the activation energy is the problem, really. So yeah, I think frontier model-wise, it seems like Cloud has really carved out a persistent space for itself. For a long time, I thought it was kind of like a clear number two to open AI. And with 3.5s on it, at least in some of the hard benchmarks on LMSIS, or coding benchmarks on LMSIS, it is the undisputed number one model in the world, even with 4.0 mini.

And we can talk about 4.0 mini and benchmarking later on, but for Cloud to be there and hold that position for what is more than a month now in AI time is a big deal. There's not much that people know publicly about what Enthopic did for Cloud's on it, but I think it's still a huge achievement.

It marks the beginning of a non-open AI-centric world to the point where the people on Twitter have canceled ChatGPT. That's been a trend that's been going on for a while. We talked about the unbundling of ChatGPT. But now, new open source projects and tooling, they're just built for Cloud.

They don't even use open AI. That's a strategic threat to open AI, I think, a little bit. Obviously, open AI is so big that it doesn't really care about that. But for Enthopic, it's a big win. I think to see that going and to see Enthopic differentiating itself and actually implementing research.

So the rumor is that the scaling monosematicity paper that they put out two months ago was a big part of Cloud 3.5 Sonnet. I've had off-the-record chats with people about that idea, and they don't agree that it is the only cause. So I was thinking this is the only thing that they did.

But people say that there's about four or five other tricks that they haven't disclosed yet that went into 3.5 Sonnet. But the scaling monosematicity paper is a very, very good read. It's a very long read. But it basically says that you can find control vectors, control features now that you can turn on to make it better at code without really retraining it.

You just train a whole bunch of sparse autoencoders, find a bunch of features, and just say, let's up those features, and suddenly you're better at code, or suddenly you care a lot about the Golden Gate Bridge. These are the same things to the model. That is a huge, huge win for interpretability because up to now we were only doing interpretability on toy models, like a few million parameters, a model of Go or chess or whatever.

Cloud 3.5 Sonnet was interpreted and usefully improved using this technique. Wow. - Yeah, I think it would be amazing if we could replicate the same on the open models to then, because now we can use Llama 3.1 to generate synthetic data for training and fine tuning. I think obviously Anthropic has a lot of compute and a lot of money.

So once they figure out, okay, this is what we should make the model better at, they can kind of like put a lot of resources. I think an open source is probably gonna be a more distributed effort. Like I feel like Noose has held the crown of like the best fine tuning data set owners for a while, but at some point that should change, hopefully.

Like other groups should step up. And I think if we can apply the same principles to like a model as big as 405B and bring them into like maybe the 7B form factor, that would be great. But yeah, Cloud is great. I canceled JGBD a while ago. Really small podcaster run for latent space.

It runs both on Cloud and on OpenAI and Cloud is definitely better most of the time. It's not a benchmark, it's just vibes, but when the vibes are good, the vibes are good. - We run most of the I/O summaries on Cloud as well. - But, and I always run it against OpenAI.

Sometimes OpenAI wins. I do a daily comparison, but yeah, Cloud is very strong at summarization and instruction following, which is something I care a lot about. So when you talk about frontier models, MMLU no longer cut it, right? Like we have reached like 92 on MMLU. It's going to like 95, 97.

It just means you're memorizing MMLU. Like there's some fundamental irreducible level of mistakes because of MMLU's quality. We talked about this with Clementine on the Hugging Face episode. And so we need to see what else, what is the next frontier? I think there are 10 directions that I outlined below, but we'll talk about that later.

Yeah, should we move on to number three? - Yeah, 3.1, I guess that too. Make sure to good differentiate between the models. But yeah, we have a whole episode with Thomas Shalom from the Meta team, which was really, really good. And I'm glad we got the podcast to come out at the same time as the model.

- Yeah, I think we're the only ones to coordinate for the paper release for the big launch, the 405 launch. Zuck did a few interviews, but we're the only ones that did the technical team interview. - Yeah, yeah, yeah. I mean, they were like surfing or something with the Bloomberg person.

We should get invited to surf with Zuck. - I would, yeah, I would be down to. - To the audience, the technical breakdown. - So behind the scenes, you know, one for listeners, one thing that we have attention about is who do we invite? Because obviously if we get Mark Zuckerberg, it'll be a big name, then it will cause people to download us more, but it will be a less technical interview because he's not on the research team.

He's CEO of Meta. And so I think it's this constant back and forth. Like we want to grow as a podcast, but we want to serve a technical audience. And we're trying to do that, thread that line because our currency as podcasters is the people that listen to it.

And we need big names, but we also need to serve our audience well. And I think if we don't do it well, this actually goes all the way back to George Hotz. When after he finished recording with us, he said, "You have two paths in the podcast world. Either you go be Lex Friedman or you stay small on niche." And we definitely like, we like our niche.

We think it's a good niche. It's going to grow. But at the same time, I still want us to grow. I want us to grow on YouTube, right? And so that's always a Meta thing. Not to get too Meta. - Not that Meta, the other Meta. - Yeah, so Lama Tree, yeah.

- I think to me, the biggest thing is the training on outputs. Like every company is just hiding the fact that they've been fine tuning and training on GPT-4 outputs and you can not technically do it, but obviously OpenAI is not enforcing it. I think now for the first time, there's like a clear path to how do we make a 7B model good without having to go through GPT-4 or going to Cloud 3.

And we'll kind of talk about this later, but I think we're seeing maybe the, you know, not the death, but like selling the picks and shovels, it's kind of going away. And like building the vertical things is like where most of the value is actually getting captured, at least at the early stages.

So being able to make small models better and specific things through a large model is more important than yet another 7B model that I can try and use. But at the end of the day, I still need to go through the large labs to fine tune. So that to me is the most interesting thing.

You know, it's such a large model that like it's obviously amazing, but I don't know if a lot of people are switching from GPT-4 or Cloud 3.5 to run 4 or 5B. I also don't know what the hosting options are as far as like scaling, you know, I don't know if the Fireworks and Together or Software World, how much capacity they actually have to serve this model, because at the end of the day, it's a lot of compute if some of the big products will switch to it and you cannot easily run it yourself.

So I don't know, but to me, this synthetic data piece is definitely the most interesting. - Yeah, I would say that it is not enough now to say that synthetic data is real. I actually shipped that in the original email and then I changed that in the sort of what you see now in the podcast description.

But because it is so established now that synthetic data is real, therefore you need to go to the next level, which is, okay, what do you use it for and how do you use it? And I think that is what was interesting for Lama3 for me, which you read the paper, 90 pages of all filler no killer is something like that.

This is what the people were saying. Very, very like for once a frontier model with a proper paper instead of a marketing blog post. And they actually spelled out how they'd use synthetic data for a few different domains. So they have synthetic data for code, for math, for multilinguality, for long context, for tool use, and then also for ASR and voice generation.

And I think that, yeah, okay, now you have the license to go distill Lama3, 405B, but how do you do that? That is the sort of the next frontier. Now you have the permission to do it, how do you do it? And I think that people are gonna reference Lama3 a lot, but then they can use those techniques for everything else.

You know, in our episode with Thomas, he talked about like, I was very focused on synthetic data for pre-training 'cause that's my context. That's my conversations with Technium from Noose and all the other people doing synthetic data for pre-training and fine tuning. But he was talking about post-training as well.

And for everything here was post-training. In fact, I wish we had spent more time with Thomas on this stuff. We just didn't have the paper beforehand. (all laughing) But I think like when I call Lama3, the synthetic data model is you have the license for it, but then you also have the roadmap, the recipe, because it's in the paper.

And now everybody knows how to do this. And probably, you know, obviously opening eyes probably laughing at us 'cause they did this like a year ago, but now it's in the open. - I mean, they can laugh all they want, but they're coming for them. I think, I mean, that's definitely the biggest vibe shift, right?

It's like, obviously Lama 3.1 is good. Obviously Cloud is good. Maybe a year and a half ago, you didn't get the benefit of the doubt. It's like an open AI competitor to be state-of-the-art. You know, it was kind of like, oh, Anthropic, yeah, these guys are cute over there.

They're trying to do their thing, but it's not open AI. And like Lama2 is great, but like it's really not a serious model. You know, it's like just good enough. I think now it's like every time Anthropic releases something, people are like, okay, this is like a serious thing.

Whenever like Meta releases something, it's like, okay, they're at the same level. And I don't know if open AI is kind of like sandbagging the GPT next, you know? And then they kind of, you know, yesterday or today, they launched the search GPT thing behind the wait list. - The Singapore confusion, when was it?

- Yeah, when was it? - Yes, it happened yesterday, U.S. time, but today, Singapore time. - Thursday. It's been really confusing. But yeah, and people are kind of like, oh, okay, open AI. I don't know if we can take you seriously. - Well, no, one of the AI grants employees, I think Hirsch, tweeted that, you know, you can skip the wait list, just go to perplexity.com.

(laughs) And that was a really, really sick burn for the open AI search GPT wait list. But their implementation will have something different. They probably like train a dedicated model for that, you know, like they will have some innovation that we haven't seen. - Yeah, data licensing, obviously. - Data licensing, yes.

We're optimistic, you know, but the vibe shift is real. And I think that's something that is just worth commenting on and watching. And yeah, how the other labs catch up. I think what you said there is actually very interesting. The trend of successive releases is very important to watch.

If things get less and less exciting, then it's a red flag for that company. And if things get more and more exciting, it means that these guys have a good team, they have a good plan, good ideas. So yeah, like I will call out, you know, the Microsoft PHY team as well.

PHY 1 was kind of widely regarded to be overtrained on benchmarks and PHY 2 and PHY 3 subsequently improved a lot as well. I would say also similar for GEMMA, GEMMA 1 and 2. GEMMA 2 is currently leading in terms of the local Lama sort of vibe check, eval, informal straw poll.

And that's only like a month after release. They released at the AI Engineer World's Fair. And, you know, like I didn't know what to think about it 'cause GEMMA 1 wasn't like super well-received. It was just kind of like, here's like free tier GEMMA and I, you know, but now GEMMA 2 is actually like a very legitimately widely used model by the open source and local Lama community.

So that's great until Lama 3 7B came along. (laughing) And so like the, and we'll talk about this also, like just the winds of AI winter is also like, what is the depreciation schedule on this model inference and training costs? Like it's very high. - Yeah. I'm curious to get your thought on Mistral.

Everybody's favorite Spark and Waitz company. - Yeah, yeah. - They just released the, you know, Mistral Large Enough. - Large, Mistral Large 2. - Yeah, Large 2. - So this was one day after Lama 3, presumably because they were speaking at ICML, which is going on right now. By the way, Brittany is doing a guest host thing for us.

She's running around the poster sessions doing what I do, which is very great. 'Cause I couldn't go 'cause my visa issue. I have to be careful what I say here, but I think because we still want to respect their work, but Mistral Large, I would say, it's like not as exciting as Lama 3.

I think that is very, very fair to say. It is, yes, another GPT-4 class model released as Open Waitz with a research license on a commercial license, but still Open Waitz. And that's good for the community, but it is a step down in terms of the general excitement around Mistral compared to Lama.

I think that would be fair to say, and I would say that to Mistral themselves. So the general hope is, and I cannot say too much, it's 'cause I've had offline conversations with people close to this. The general hope is that they need something more. Of the 10 elements of what is next in terms of their frontier model boundaries, Mistral needs to make progress there.

They made progress here with instruction following and structured output and multilinguality and all those things. But I think to stand out, you need to basically pull a stunt. You need to be a superlatively good company in one dimension. And now, unfortunately, Mistral does not have that crown as open-source kings.

Like a year ago, I was saying, Mistral are the kings of open-source AI. Now meta is, they've lost their crowns. By the way, they've also deprecated Mistral 7B, 8x7B, and 8x22B. So now there's only the closed-source models that are the API platform. So has Mistral basically started becoming more of a closed-model proprietary platform?

I don't believe that's true. I believe that they're still very committed to open-source, but they need to come up with something more that people can use. And that's a grind. I mean, they have, what, $600 million to do it? So that's still good. But people are waiting for what's next from them.

- Yeah, to me, the perception was interesting. In the comments of the release, everybody was like, "Why do you have a non-commercial license "for not making any money anyway from the inference?" So I feel like the AI engineering tier list is kind of shifting in real time. And maybe Mistral, like you said before, was like, "Hey, thank God for these guys.

"They're saving us in open-source. "They're kind of like speed-running "GPT-1, GPT-2, GPT-3 in open-source." But now it's like they're kind of moving away from that. I haven't really heard of that many people using them as scale commercially, just from discussions. So I'm curious to see what the next step is.

- Yeah, but also you're sort of US-based, and maybe they're not focused there, right? So- - Yeah, no, exactly. - It's a very big elephant, and we're only touching pieces of it as blind leading the blind. (laughs) I will call out, they have some interesting experimentations with Mamba, and Mistral NEMO is actually on the efficiency frontier chart that I drew that is still relevant.

So don't discount Mistral NEMO. But Mistral Large, otherwise, it's an update. It's a necessary update for Mistral Large V1, but other than that, they're just kind of holding the line, not really advancing the field yet. That'll be my statement there. - So those are the frontier big labs. - Yes.

- And then now we're gonna shift a little bit towards the smaller deployable on-device solutions. - Yeah. First of all, a shout out to our friend, Tri Dao, who released Flash Attention 3. Flash Attention 2, we kind of did a deep dive on the podcast. He came on in the studio back then.

It's just great to see how small groups can make a big impact on a whole industry, just like by making math better. So it's just great to see. Just wanted to give Tri a shout out. - Something I mentioned there, and it's something that always comes up, even in the Sovereign AI Summit that we did, was does NVIDIA's competitors have any threat to NVIDIA?

AMD, like MADX, like Etched, which caused a lot of noise with their SOHU chip as well. And just the simple fact is that NVIDIA has won the hardware lottery, and people are customizing for NVIDIA. Like Flash Attention 3 only works for NVIDIA, only works for H100s. And like this much work, this much scaling, this much validation going into this stuff is very difficult to replicate, or very expensive to replicate for the other hardware ecosystems.

So not impossible. I actually heard a really good argument from, I think it is Martin Casado from A16Z, who was saying basically like, yeah, absolutely NVIDIA's hardware and ecosystem makes sense. And obviously that's contributed to, it's like, I don't know, it's like the most valuable company in the world right now, but current training runs are like 100 million to 200 million in cost.

But when they go to 500 million, when they go to a billion, when they go to 1 trillion, then you can actually start justifying making custom ASICs for your run. And if they cut your costs by like half, then you make your money back in one run. - Yeah, yeah, yeah.

Martin has always been a fan of custom ASIC. I think they wrote a really good post maybe a couple of years ago about cloud repatriation. - Oh yeah, I think he got a lot of shit for that, but it's becoming more consensus now, I think. So Noam Shazir blogging again, fantastic gifts to the world.

This guy, nonstop bangers. And so he's at Character AI and he put up a post talking about five tricks that they use to serve 20% of Google search traffic as LLM inference. A lot of people were very shocked by that number, but I think you just have to remember that most conversations are multi-turn, right?

Like in the span of one Google search, I will send like 10 text messages, right? So obviously there's a good ratio here that matters. It's obviously a flex of Character AI's traction among the kids, because I have tried to use Character AI since then, and I still cannot for the life of me get it.

Have you tried? - I have tried it, but yes, definitely not. - Yeah, they launched like voice. I tried to talk to it. It was just so stupid. I just didn't like it myself. But this is what it means. - But please still come on the podcast to Noam Shazir.

- Sorry, what did I mean? - No, no, no. Because like I don't really understand like what the use case is for apart from like the therapy, role play, homework assistant type of stuff that is the norm. But anyway, one of the most interesting things, so you detailed five tricks.

One thing that people talk a lot about is native int8 training. I got it wrong in our Thomas podcast. I said FPA is int8. And I think like that is something that is an easy win. Like we should basically, when we're getting to the point where we're overtraining models 100 times past Chinchilla ratio to optimize for inference, the next thing is actually like, hey, let's stop using so much memory when training because we're gonna quantize it anyway for inference.

So like just let's pre-quantize it in training. So that makes a lot of sense. The other thing as well is this concept of global local hybrid architecture, which I think is basically going to be the norm, right? So he has this formula of one to five ratio of global attention to local attention.

And he says that that works for the long-form conversations that character has. Okay, that's great. And like simultaneously we have independence research from other companies about similar hybrid ratios being the best for their research. So Nvidia came out with a Mamba transformer hybrid research thing. And in their estimation, you only need 7% transformers.

Everything else can be state space models. Jamba also had something like between like six to like 30 to one. And basically every form of hybrid architecture seems to be working at the research stage. So I think like if we scale this, it makes complete sense that you just need a mix of architectures and it could well be that the transformer block instead of transformers being all you need, transformers are the global attention thing.

And then the local attention thing can be the state space models, can be the RWKVs, can be another transformer, but just limited by a sliding window. And I think like we're slowly discovering like the fundamental building blocks of AI. One is transformers, one is something that's local, whatever that is.

And then, you know, who knows what else is next? I mean, the other stuff is adapters, we can talk about that. But yeah, headline is that Gnome, maybe he's too confident, but I mean, I believe him. Gnome thinks that he can do inference at 13X cheaper than the fireworks together, right?

So like, there is a lot of room left to improve inference. - I mean, it does make sense, right? Because like, otherwise, I don't know. - Otherwise, character would be bankrupt. - Yeah, exactly. I was like, they would be losing a ton of money, so. - They are rumored to be exploring a sale.

So I'm sure money is still an issue for them, but I'm also sure they're making a lot of money. So it's very hard to tell because it's not a very public company. - Well, I think that's one of the things in the market right now too, is like, hey, do you just want to keep building?

Do you want to like, just not worry about the money and go build somewhere else? Kind of like maybe Inflection and Adapt and some of these other non-equal hires, licensing deals and whatnot. So I'm curious to see what companies decide to stick with it. - I think Google or Meta should pay $1 billion for Gnome alone.

The purchase price for a character is $1 billion, which is super underpriced. - Which is nothing at their market cap, right? - It's nothing. Meta's market cap right now is $1.15 trillion because they're down 5%, 11% in the past month. - What? - Yeah. So if you pay $1 billion, you know, that's like 0.01% of your market cap.

And they pay $1 billion for WhatsApp and they buy 1% of their market cap on that at the time. So yeah. - That is beyond our pay grade. But the last piece of the GPU rich poor wars. So we're going from the super GPU rich down to like the medium GPU rich.

And now down to the GPU poorest is on-device models, right? Which is something that people are very, very excited about. So at my conference, Mozilla AI, I think was kind of like the talk of the town there on Llamafile. We had Justine Tunney come in and explain like some of the optimizations that they did.

And their just general vision for on-device AI. I think that like, it's basically the second act of Mozilla. Like a lot of good with the open source browser. And obviously then they have since declined because it's very hard to keep up in that field. And Mozilla has had some management issues as well.

But now that the operating system is moving to the AI layer, now they're also like, you know, promoting open source AI there and also like private AI, right? Like open source is synonymous with local, private and all the good things that people want. And I think their vision of like, even running this stuff on CPUs at a very, very fast speed by just like being extremely cracked.

(laughs) I think it's very understated and we should probably try to support it more. And it's just amazing to host these people and see the progress. - Yeah, I think to me the biggest question about on-device, obviously there's a Gemini Nano, which is getting shipped with Chrome. - Yeah, so let's survey, right?

So Llamafile is one executable that runs on every architecture. Similar for, by the way, Mojo from Mozilla, which also spoke at the conference. And then what else? Llama CPP, MLX, those kinds are all sort of that layer. Then the next layer up would be the built-in into their products by the vendors.

So Google Chrome is building Gemini Nano into the browser. The next version of Google Chrome will have Nano inside that you can use like window.ai.something and it would just call Nano. There'll be no download, no latency whatsoever 'cause it runs on your device. And there's Apple Intelligence as well, which is Apple's version, which is in the OS, accessible by apps.

And then there's a long tail of others. But yeah, your comments on those things. - My biggest question is how much can you differentiate at that model size? Like how big is gonna be the performance gap between all these models? And are people gonna be aware of what model is running?

Right now, for the large models, we're still pretty aware of like, oh, is this SoundNet 3.5, is this GPT-4, is this, you know, 3.1, 4.5B. I think the smaller you get, the more it's just gonna become like a utility, you know? So like, you're not gonna need a model router for like small models.

You're not gonna need any of that. Like they're all gonna converge to like the best possible performance. - Actually, Apple Intelligence is the model router, I think. They have something like 14, I did a count in my newsletter, like 14 to 20 adapters. And so based on your use case, they'll route and load the adapter or they'll route to OpenAI.

So there is some routing layer. To me, I think a lot of people were trying to puzzle out the strategic moves between OpenAI and Apple here because Apple is in a very good position to commoditize OpenAI. There was some rumors that Google was working with Apple to launch it, they did not make it for the launch, but presumably Apple wants to commoditize OpenAI, right?

So, you know, when you launch, you can choose your preferred external AI provider and it's either OpenAI or Google or someone else. I mean, that puts Apple at the center of the world with the ability to make routing decisions. And I think that's probably good for privacy, probably good for the planet, 'cause you're not running like oversized models on like your spell check pass.

And I'm generally pretty positive on it. Like, yeah, I'm not concerned about the capabilities issue. It meets their benchmarks. Apple put out a whole bunch of proprietary benchmarks 'cause they don't like to do anything in the way that everyone else does it. So like, you know, in the Apple intelligence blog posts, they like, I think like all of them were just like their internal human evaluations.

And only one of them was an industry standard benchmark, which was IFEVL, which is good. But like, you know, why didn't you also release your MMLU? Oh, 'cause you suck on it. All right. (laughing) - Well, I actually think all these models will be good. And on the Apple side, I'm curious to see what the price tag will be to be the default.

Right now, Google pays them 20 billion to be the default search. - I see. The rumors is zero. - Yeah, I mean, today, even if it was 20 billion, that's like nothing compared to like, you know, NVIDIA's worth three trillion. So like even paying 20 billion to be the default AI provider, like would be cheap compared to search, given that AI is actually being such a core part of the experience.

Like Google being the default for like Apple's phone experience really doesn't change anything. Becoming the default AI provider for like the Apple experience will be worth a lot more than this. - I mean, so I can justify it being zero instead of 20 billion, it's because opening has to foot the inference costs, right?

So that's a lot. - Well, yeah, Microsoft really is putting it, but again, Microsoft is worth two trillion, you know? - So as someone who, this is the web developer coming out, as someone who is a champion of the open web, Apple has been, let's just say, roadblock in that direction.

I think Gemini Nano being good is more important than Apple intelligence being generally capable. Apple intelligence being like on-device router for Apple apps is good, but like if you care about the open web, you really need Gemini Nano to work. And we're not sure. Like right now we have some demos showing that it's fast enough, but we haven't had systematic tests on it.

Along the lines of that research, I will highlight that Apple has also put out Datacomp LM. I actually interviewed Datacomp at NeurIPS last year, and they've branched out from just vision and images to language models. And Apple has put out a reference implementation of the 7B language model that's built on top of Datacomp.

And it is better than FindWeb, which is huge because FindWeb was the state-of-the-art last month. (both laughing) And that's fantastic. So basically like Datacomp is an open data, open weights, open model, like super everything open. So there will be a lot of people optimizing this kind of model. They'll be building on architectures like mobile LM and small LM, which basically innovate in terms of like shared weights and shared matrices for smaller models so that you just optimize the amount of file size and memory that you take up.

And I think just general trend of on-device models, like the only way that intelligence to cheap-to-meter happens is everything happens on-device. So unfortunately that means that OpenAI is not involved in this. Like OpenAI's mission is intelligence to cheap-to-meter, and they're not doing the one thing that needs to happen for that because there's no business plan in monetizing an API for that.

But by definition, none of this is APIs. - I don't know. Maybe OpenAI, even Sam Ullman needs to figure it out so they can do a-- - Yeah, I'm excited for OpenAI phone. I don't know if you would buy an OpenAI phone. I mean, I'm very locked into the iOS ecosystem, but I mean-- - I will not be the first person to buy it because I don't want to be stuck with like the rabbit equivalent of a iPhone, but I think it makes a lot of sense.

I want their-- - They're building a search engine now. The next thing is the phone. (laughs) - Exactly. So we'll see. - We'll see. When it comes on a wait list, we'll see. - Yeah, yeah, we'll review it. All right, so that was GPU-rich, GPU-poor. Maybe we just want to run quickly through the quality data wars.

There's mostly drama in this section. There's not as much research. - I think there's a lot of news going in the background. So like the New York Times lawsuit is still ongoing. You know, it's just like we won't have specific things to update people on. There are specific deals that are happening all the time with Stack Overflow making deals with everybody, with like Shutterstock making deals with everybody.

It's just, it's hard to make a single news item out of something that is just slowly cooking in the background. - Yeah, on the New York Times thing, OpenAI's strategy has been to make the New York Times prove that their content is actually any original or like actually interesting.

- Really? - Yeah, so it's kind of like, you know, the iRobot meme. It's like, can a robot create a beautiful new symphony? And the robot is like, can you? - I think that's what OpenAI's strategy is. - Yeah, I think that the danger with the lawsuit, because this lawsuit is very public, because OpenAI responded, including with Ilya, showing their emails with New York Times, saying that, "Hey, we were doing a deal.

You were like very close to a deal. And then suddenly on the eve of the deal, you called it off." I don't think New York Times has responded to that one, but it's very, very strange because the New York Times' brand is like trying to be like, you know, they're supposed to be the top newspaper in their country.

If OpenAI, like just, and this was my criticism of it at the point in time, like, okay, we'll just go to the next best paper, the Washington Post, the Financial Times, they're all happy to work with us. And then what does New York Times have? - Yeah, yeah, yeah.

- So you just lost out on like a hundred million dollars, $200 million a year of licensing deals just because you wanted to pick that war, which ideologically, I think they are absolutely right to do that. But, you know, the other people, The Verge did a very good interview with, I think the Washington Post.

I'm going to get the outlet wrong. The Verge did a very good interview with a newspaper owner, editor, on why they did the deal with OpenAI. And I think that listening to them on like, they're thinking through like the reasoning of like the pros and cons of picking a fight versus partnering, I think it's very interesting.

- Yeah, I guess the winner in all of this is Sridhar, which is making over 200 million just in data licensing to OpenAI and some of the other AI providers. I mean, 200 million is like more than most AI startups are making. - So I think that was an IPO play, 'cause Reddit conveniently did this deal before IPO, right?

- Totally. - Is it like a one-time deal? And then, you know, the stock language is from there? I don't know. - Yeah, no, well, their IPO is done. Well, I guess it's not gone down. So in this market, they're up 25%, I think, since IPO. But I saw the FTC had opened an inquiry into it just to like investigate.

So I'm curious what the antitrust regulations are gonna be like when it comes to data. Obviously, acquisitions are blocked to prevent kind of like stifling competition. I wonder if for data, it will be similar, where, hey, you cannot actually get all of your data only behind $100 million plus contracts, because otherwise you're stopping any new company from building a competing product, so.

- Yeah, that's a serious overreach of the state there. - Yeah, yeah, yeah. - So as a free market person, I want to defend. It is weird, I'm a free market person and I'm a content creator, right? So I want to be paid for my content. At the same time, I believe that, you know, people should be able to make their own decisions about all these deals.

But UGC is a weird thing, 'cause UGC is contributed by volunteers. - Yeah. - And the other big news about Reddit is that apparently they have added to their robots.txt, like only Google should index us, right? 'Cause we did the deal with Google. And that's obviously blocking OpenAI from crawling them, Anthropic from crawling them, you know, Perplexity from crawling them.

Perplexity maybe ignores all robots.txt, but that's a whole different other issue. And then the other thing is, I think this is big in the sort of normie worlds. The actors, you know, Scarlett Johansson had a very, very public Apple Notes take down of OpenAI. Only Scarlett Johansson can do that to Sam Altman.

And then, you know, I was very proud of my newsletter for that day, I called it Skyfall, because the voice of Sky, so I called it Skyfall. And, but it's true, like, that one, she can win. And there's a very well-established case law there. And the YouTubers and the music industry, the RIAA, like the most litigious section of the creator economy has gone after Yudio and Suno, you know, Mikey from our podcast with him.

And it's unclear what will happen there, but it's gonna be a very costly legal battle for sure. - Yeah, I mean, music industry and lawsuits name a more iconic duel, you know, so I think that's to be expected. - I think last time we talked about this, I was pretty optimistic that something like this would reach the Supreme Court.

And with the way that the Supreme Court is making rulings, like we just need a judgment on whether or not training on data is transformative use. So I think it is. Literally, we're using transformers to do transformative use. So then it's open season for AI to do it. And comparatively, the content creators and owners will lose out, they just will.

'Cause right now we're paying their money out of fear of lawsuits. If the Supreme Court rules that there are no lawsuits to be had, then all their money disappears. - I think people are probably scraping late in space and we're not getting a dime, so that's what it is.

- No, you can support with like an $8 a month subscription and that pays for our microphones and travel and stuff like that. Yeah, it's definitely not worth the amount of time we're putting into it, but it's a labor of love. - Yeah, exactly. - Synthetic data. - Yeah, I guess we talked about it a little bit before with Lama, but there was also the alpha proof thing.

- Yes, just before I came here, I was working on that, et cetera. - Yeah, Google trained, almost got a gold medal. I forget what the-- - Yes, they're one point short of the gold medal. - Yeah, one point short of the gold medal. - It's a remarkable, I wish they had more questions.

So the International Math Olympiad has six questions and each question is seven points. Every single question that the alpha proof model tried, it got full marks on, it just failed on two. And then the cutoff was like sadly one point higher than that, but still like it was a very big, like a lot of people have been looking at IMO as like the next gold prize, grand prize in terms of what AI can achieve and betting markets and Eliezer Yakovsky has updated and saying like, yeah, like we're pretty close.

Like we basically have reached it near gold medal status. We definitely reached a silver and bronze status and we'll probably reach gold medal next year, right? Which is good. There's also related work from Hugging Face on the Numina Math Competition. So this is on the AI Mathematical Olympiad, which is an easier version of the human Math Olympiad.

This is all like related research work on search and verifier model assisted exploration of mathematical problems. So yeah, that's super positive. I don't really know much else beyond that. Like it's always hard to cover this kind of news 'cause it's not super practical and it also doesn't generalize. So one thing that people are talking about is this concept of jagged intelligence.

'Cause at the same time, we're having this discussion about being super human. You know, one of the IMO questions was solved in 19 seconds after we gave the question to AlphaProof. At the same time, language models cannot determine if 9.9 is smaller than or bigger than 9.11. And part of that is 9.11 is an inside job, but it's a funny, and then there's someone else's joke.

I don't know, and I really like that joke. But it's jagged intelligence. This is a failure to generalize because of tokenization or because of whatever. And what we need is general intelligence. We've always been able to train dedicated special models to win prizes and do stunts. But the grand prize is general intelligence.

That same model does everything. - Is it gonna work that way? I don't know. I think like if you look back a year and a half ago and you would say, "Can one model get to general intelligence?" Most people would be like, "Yeah, we can keep scaling." I think now it's like, is it gonna be more of a mix of models?

You know, like can you actually do one model that does it all? - Yeah, absolutely. I think GPT-5 or Gemini 3 or whatever would be much more capable at this kind of stuff while it also serves our needs with everyday things. It might be completely uneconomical. Like why would you use a giant ass model to do normal stuff?

But it is just a demonstration of proof that we can build super intelligence for sure. And then everything else follows from there. But right now we're just pursuing super intelligence. I always think about this, just reflecting on the GPU rich, poor stuff and now this alpha geometry stuff. I used to say you pursue capability first, then you make it more efficient.

You make frontier model, then you distill it down to the 8B, 7B, 7EB, which is what Lambda 3 did. And by the way, also OpenAI did it with GPC 4.0 and then distilled it down to 4.0 Mini. And then Cloud also did it with Opus and then with 3.5 Sonnet, right?

That suitable recipe. In fact, I call it part of the deployment strategy of models. You train a base layer, you train a large one, and then you distill it down. You add structured output generation, tool calling and all that. You add the long context. You add like this standard stack of stuff in post-training that is growing and growing to the point where now OpenAI has opened a team for mid-training that happens before post-training.

I think like one thing that I've realized from this alpha geometry thing is before you have capability and you have efficiency, there's an in-between layer of generalization that you need to accomplish. You need to do capability in one domain. You need to generalize it. Then you need to efficiencize it.

Then you have good models. - That makes sense. I think like maybe the question is how many things can you make it better for before generalizing it, you know? Yeah, I don't have a good intuition for that. - We'll talk about that in the next thing. Yeah, so we can skip Nemotron's worth looking at if you're interested in synthetic data.

Multimodal labeling, I think has happened a lot. We'll jump to multimodal now. - Yeah, we got a bunch of news. Well, the first news is that 4.0 voice is still not out even though the demo was great. I think they're starting to roll out the beta in the next week.

So I subscribed back to ChatGPT+. - You gave in? - I gave in because they're rolling it out next week. So you better be on the cutoff or you're not going to get it. - Nice baits I've all met. - I said this, I said when I talk about unbundling on ChatGPT, it's basically because they had nothing to offer people.

That's why people aren't subscribing because why keep paying $20 a month for this, right? But now they have proprietary models. Oh yeah, I'm back in, right? - We're so back. - We're so back, we're so back. I will pay $200 for the Scarlett Johansson voice, but you know, they'll probably get sued for that.

But yeah, the voice is coming. We had a demo at the World's Fair that was, I think the second public demo. Roman, I have to really give him a shout out for that. We had a few people drop out last minute and he was, he rescued the conference and worked really hard.

Like, you know, I think off the scenes, I think something that people don't understand is OpenAI puts a lot of effort into their presentations and if it's not ready, they won't launch it. Like he was ready to call it off if we didn't make the AV work for him.

And I think, yeah, they care about their presentation and how they launch things to people. Those minor polished details really matter. Just for the record, for people who don't understand what happened was, first of all, you can go see, just look for like the GPC 4.0 talk at the AI Engineer World's Fair.

But second of all, because it was presented live at a conference with large speakers blaring next to you and it is a real-time voice thing. So it's listening to its own voice and it needs to distinguish between its own voice and between the human voice and it needs to ignore its own voice.

So we had OpenAI engineers tune that for our stage to make this thing happen, which is absurd. It was so funny, but also like, shout out to them for doing that for us and for the community, right? Because I think people wanted an update on voice. - Yeah, they definitely do care about demos.

Not much to add there. - Yeah. - Lumetri voice? - Something that maybe is buried among all the Lumetri news is that Lumetri is supposed to be a multimodal model. It was delayed thanks to the European Union. Apparently, I'm not sure what the whole story there is. I didn't really read that much about it.

It is coming. Lumetri will be multimodal. It uses adapters rather than being natively multimodal. But I think that it's interesting to see the state of Meta-AI research come together because there was this independent threads of voice box and seamless communication. These are all projects that Meta-AI has launched that basically didn't really go anywhere because they were all one-offs.

But now all that research is being pulled in into Lumetri, like Lumetri is just subsuming all of FAIR, all of Meta-AI into this thing. And yeah, you can see the voice box mentioned in Lumetri voice adapter. I was kind of bearish on conformers because I looked at the state of existing conformer research in ICM, Eclair, and NeurIPS, and they were far, far, far behind Whisper, mostly because of scale, like the sheer amount of resources that are dedicated.

But Meta is approaching there. I think they had 230,000 hours of speech recordings. I think Whisper is something like 600,000. So Meta just needs to 3X the budget on this thing and they'll do it. And we'll have open source voice. - Yeah, and then we can hopefully fine tune on our voice and then we just need to write this episode instead of actually recording it.

- I should also shout out the other thing from Meta, which is a very, very big deal, which is Chameleon, which is a natively early fusion vision and language model. So most things are late fusion, basically. Like you freeze an existing language model, you freeze an existing vision transformer, then you kind of fuse them with an adapter layer.

That is what Lumetri is also doing. But Chameleon is slightly different. Chameleon is interleaving in the same way that IdaFix, the sort of data set is doing, interleaving natively for image generation and vision and text understanding. And I think like once that is better understood, that is going to be better.

That is the more deep learning build version of this, the more GPU rich version of doing all this. I asked Yitei this question about Chameleon in his episode, he did not confirm or deny, but I think he would agree that that is the right way to do multimodality. And now that we're proving out that multimodality is valuable to people, basically all this half-ass measures around adapters is going to flip to natively multimodal.

To me, that's what GPC 4.0 represents. It is the train from scratch, fully omnimodal model, which is early fusion. So if you want to read that, you should read the Chameleon paper, basically. That's what is my whole point. - And there was some of the Chameleon drama because the open model doesn't have image generation.

And then there were fine tuning recipe. - It's so funny. The leads were like, "No, do not follow these instructions "to fine tune image generation." - That's just really funny. I don't know what the... Okay, so yeah, whenever image generation is concerned, obviously because of the Gemini issue, it's very tricky for large companies to release that, but they can remove it, say that they remove it, point out exactly where they remove it and let the open source community put it back in.

(laughs) The last piece I had, which I kind of deleted, was there's a special mention, honorable mention of Gemma again with PolyGemma, which is one of the smaller releases from Google I/O. I think you went, right? So PolyGemma was mentioned in there? I don't know. It was one of the...

- Yeah, yeah, one of the workshops. - Very, very small release. But CopolyGemma now is being talked a lot about as a late fusion model for extracting structured text out of PDFs. Very, very important for business work. - Yeah, I know. - Workhorses. - Yes. - So apparently it is doing better than Amazon Textract and all the other state-of-the-art.

And it's a tiny, tiny model that does this. And it's really interesting. It's a combination of the Omar Khattab's Colbert retrieval approach on top of a vision model, which I was severely underestimating PolyGemma when it came out, but it continues to come up. There's a lot of trends. And again, this is making a lot of progress here, just in terms of their applications in real-world use cases.

These are small models, but they're very, very capable and they're a very good basis to build things like CopolyGemma. - Yeah, no, Google has been doing great. I think maybe a lot of people initially wrote them off, but between, you know, some of the Gemini Nano stuff, like Gemma 2, PolyGemma.

We'll talk about some of the KV cache and context caching. - Yeah, yeah, that's a rag horse. - So there's a lot to like. And our friend Logan is over there now, so. He's excited about everything they got going on, so yeah. - I think there's a little bit of a fight between AI Studio and Vertex.

And what Logan represents is, so he's moved from DevRel to PM, and he was PM for the Gemma 2 launch. Vertex has this reputation of being extremely hard to use. It's one reason why GCP has kind of fallen behind a little bit. And so AI Studio represents like the developer-friendly version of this, like the Netlify or Vercel to the AWS, right?

And I think it's Google's chance to reinvent itself for this audience, for the AI engineer audience that doesn't want like five levels of off IDs and org IDs and policy permissions just to get something going. - True, true. Yeah, we want to jump into RAG Ops Wars. - What to say here?

I think that what RAG Ops Wars are to me, like the tooling around the ecosystem. And I might need to actually rename this war. - War renaming alert, what are we calling it? - LLMOS. - LLMOS. - Because it used to be when the only job for AIs to do was chatbots.

Then RAG matters, then Ops matters. But now we need AIs to also write code. We also need AIs to work with other agents, right? That's not reflected in any of the other wars. So I think that just the whole point is what does an LLM plug into with the broader ecosystem to be more capable than an LLM can be on its own?

- Yeah. - I just announced it, but this is something I've been thinking about a lot. It's a blog post I've been working on. Basically, my tip to other people is if you want to see where things are going, you go open up the chat GPT, GPT creator. Every single button on the GPT creator is a potential startup.

Exa is for search. The knowledge RAG thing is for RAG. - Yeah, we invested in e2b. - Yeah, congrats. Is that announced? I don't know if you- - It's announced now. By the time this goes out, it'll be. - Briefly, what is e2b? - So e2b is basically a code interpreter SDK as a service.

So you can add code interpreter to any model. They partner with Mistral to add that in. They have this open source cloud artifacts clone using e2b. It's a, I mean, the amount of like traction that they've been getting in open source has been amazing. I think they went in like four months from like 10K to a million containers spun up on the cloud.

So, I mean, you told me this maybe like nine months ago, 12 months ago, something like that. You were like, well, you literally just said every chat GPT plugin can be- - A business, a startup. - Can be a business startup. And I think now it's more clear than ever than the chat bots are just kind of like the band-aid solution, you know, before we build more comprehensive systems.

And yeah, AXA just raised a Series A from Lightspeed, so. - I tried to get you in on that one as well. - Yeah, yeah, no, I read that. - I'm trying to be a scout, man. I don't know. - So yeah, this is a giving as a VC, early stage VC, like giving capabilities to the models is like way more important than the actual LLM ops, you know, the observability and like all these things, like those are nice, but like the way you build real value for a lot of the customers, it's like, how can this model do more than just chat with me?

So running code, doing analysis, doing web search. - I might disagree with you. I think they're all valuable. They're all valuable. - Yeah, well. - They're all valuable. So I would disagree with you just on like, I find ops my number one problem right now building Smalltalk and building AI news, building anything that I do.

And I don't think I'm happy with all the ops solutions that I've explored. There are some 80 something ops startups. - Right. - I nearly, you know, started one of them, but we'll briefly talk about this ops thing and then we'll go back to Rag. The central way I explain this thing to people is that all the model labs view their job as stopping by serving you their model over an API, right?

That is unfortunately not everything that you need in order to productionize this API. So obviously there's all these startups. They're like, yeah, we are ops guys. We've done this for 30 years. We will now do this for AI. And 80 of them show up and they all raise money.

And the question is like, what do you actually need as sort of an AI native ops layer versus what is just plugged into Datadog, right? I don't know if you have dealt with that because I'm not like a super ops person, but I appreciate the importance of this thing.

I think there's three broad categories, which is frameworks, gateways, and monitoring or tracing. We've talked to like, I interviewed Human Loop in London and you've talked to a fair share of them. I've talked to a fair share of them. So the frameworks would be, honestly, I won't name the startup, but basically what this company was doing was charging me $49 a month to store my prompt template.

And every time I make an inference, it would F-string call the prompt template on some variables that I supply. And it's charging $49 a month for unlimited storage of that. It's absurd, but like people want prompt management tools. They want to interoperate between PM and developer. There's some value there.

I don't know what the right price is. - Yeah. - There's some price. - I was at, I'm sure I can share this. I was at the Grab office and they also treat prompts as code, but they build their own thing to then import the prompts. - Yeah, but I want to check prompts into my code base as a developer, right?

But maybe, do you want it outside of the code base? - Well, you can have it in the code base, but like, what's like the prompt file? What's like, you know, it's not just a string. - It's string and model and config. - Exactly, how do you pass these things?

But I think like the problem with building frameworks is like frameworks generalize things that we know work. And like right now we don't really know what works. - Yeah, but some people have to try, you know, in the whole point of early stages, you try it before you know it works.

- Yeah, but I think like the past, if you see the most successful open source frameworks that became successful businesses are frameworks that were built inside companies and then were kind of spun out as projects. So I think it's more about ordering. - Vertical-filled instead of horizontal-filled. (laughs) - I mean, we try to be horizontal-filled, right?

And it's like, where are all the horizontal startups? - There are a lot of them. They're just not that, they're not going to win by themselves. I think some of them will win by sheer excellent execution. And then, but like the market won't pull them. They will have to pull the market.

- Oh, but that's the thing. It's like, you know, take like Julius, right? It's like, "Hey, why are you guys doing Julius?" It's like the same as Code Interpreter. And yet they're pretty successful. A lot of people use it because they're like solving a problem. And then- - They're more dedicated to it than Code Interpreter.

- Exactly. So it's like, I think- - Just take it more seriously than (indistinct) - I think people underestimate how important it is to be very good at doing something versus trying to serve everybody with some of these things. So, yeah, I think that's a learning that a lot of founders are having.

- Yes. Okay, so to round out the Ops world. So it's a three circle Venn diagram, right? It's frameworks, it's gateways. So the only job of the gateway is to just be one endpoint that proxies all the other endpoints, right? And it normalizes the APIs mostly to OpenAI's API just because most people started OpenAI.

And then lastly, it's monitoring and tracing, right? So logging those things, understanding the latency, like P99 or whatever, and like the number of steps that you take. So lagsmith is obviously very, very early on to this stuff. But so is like fuse. So is, oh my God, like there's so many.

I'm sure like Datadog has some like- - Yeah, yeah. - Weights and biases has some, you know. It's very hard for me to choose between all those things. So I, as a small team developer, wants one tool that does all these things. And my discovery has been that there's so much specialization here.

Like everyone is like, oh yeah, we do this, but we don't do that. For the other stuff, we recommend these two other friends of ours. And I'm like, why am I integrating four tools when I just need one? They're all the same thing. That is my current frustration. The obvious frustration solution is I build my own, right?

Which is, you know, we have 14 standards, now we have 15. So it's just a very messy place to be in. I wish there was a better solution to recommend to people because right now I cannot clearly recommend things. - Yeah, I think the biggest change in this market is like latency is actually not that important anymore.

Like we lived in the past 10 years in a world where like 10, 15, 20 milliseconds made a big difference. I think today people will be happy to trade 50 milliseconds to get higher quality output from a model. So, but still all the tracing is all like, how long did it take?

Like, what's the thing? Instead of saying, is this quality good for this output? Like, should you use another model? Like, we're just kind of taking what we did with cloud and putting it in LLMs instead of saying what actually matters when it comes to LLMs, what you should actually monitor.

Like, I don't really care what my P99 is if the model is crap, right? It's like, also like, I don't own most of the models. So it's like, this is the GPT-4 API performance. It's like, okay, am I going into a moment? It's like, I can't do anything about it, you know?

So I think that's maybe why the value is not there. Like, you know, am I supposed to pay 100K a year? Like I pay the data dog or whatever to tell me, for have you to tell me that GPT-4 is slow? It's like, you know, and just not, I don't know.

- I agree, it's challenging there. Okay, so the last piece I'll mention is briefly, ML Ops is still real. I think LLM Ops, or whatever you call this, AI Engineer Ops, the Ops layer on top of the LLM layer might follow the same evolution path as the ML Ops layer.

And so the most impressive thing I've seen from the ML Ops layer is from Apple. When they announced Apple Intelligence, they also announced Teleria, which is their internal ML Ops tool, which way you can profile the performance of each layer of a transformer. And you can A/B test like a hundred different variations of different quantizations and stuff and pick the best performance.

And I could see a straight line from there to like, okay, I want this, but for my AI Engineering Ops, like I want this level of clarity on like what I do. And there's a lot of internal engineering within these big companies would take their ML training very seriously.

And I see that also happening for AI Engineering as well. And let's briefly talk about RAG and context caching, maybe, unless you have other like LLM OS stuff that you're excited about. - LLM OS stuff I'm excited about. No, I think that's really, a lot of it is like, move beyond being observability or like help for like making the prompt call and like actually being on LLM OS, you know?

I think today it's mostly like LLM Rails, you know? Like there's no OS, but I think like actually helping people build things. That's why, you know, if you look at xia2b, it's like, that's the OS, you know? Those are kind of like the OS primitives that you need around it.

- Yeah, okay. So I'll mention a couple of things then. One layer I've been excited about publicly, but I haven't talked about it on this podcast is memory databases, memory layers on top of vector databases. The Vogue thing of last year was vector databases, right? Everybody had a vector database company.

And I think the insight is that vector databases are too low level. Like they're not very useful out of the box. They do cosine similarity matching and retrieval, and that's about it. We'll briefly maybe mention here, BM42, which was this whole debate between Vespa and who else? Quadrants, Qdrants, and I think a couple other companies also chipped in, but it was mainly a very, very public and ugly theater battle between benchmarking for databases.

And the history of benchmarking for databases goes as far back as Larry Ellison and Oracle and all that. It's just very cute to see it happening in the vector database space. Some things don't change. But on top of that, I think one of the reasons I put vector databases inside of these wars is in order to grow, the vector databases have to become more frameworks.

In order to grow, the ops companies have to become more frameworks, right? And then the framework companies have to become ops companies, which is what Lankchain is. So one element of the vector databases growing, I've been looking for what the next direction of vector databases growing is, is memory, long conversation memory.

I have on me this B, which is one of the personal AI wearables. I'm also getting the limitless personal AI wearable, which is like, I just wanted to record my whole conversation and just repeat back to me, or let me find, augment my memory. I'm sure Character AI has some version of this.

Like everyone has conversation memory that is different from factual memory. And right now, vector database is very oriented towards factual memory, document retrieval, knowledge-based retrieval, but it's not the same thing as conversation retrieval, where I need to know what I've said to you, what I said to you yesterday, what I said to you a year ago, three years ago.

And it's a different nature of retrieval, right? So there's a, at the conference that we ran, graph rag was a lot of focus for people, the marriage of knowledge graphs and rag. I think that this is commonly a trap in ML that people are like, they discover that graphs are a thing for the first time.

They're like, oh yeah, everything's a graph. Like the future is graphs and then nothing happens. Very, very common. This happened like three, four times in the industry's past as well. But maybe this time is different. - Maybe. (laughs) Unless. - Unless. (laughs) So, this is a fun, this is why I'm not an investor.

Like you have to get the time that this time is different because no ideas are really truly new, but sometimes this time is different. (laughs) - Maybe. - And so memory databases are one form of that, where like they're focused on the problem of long form memory for agents, for assistants, for chatbots and all that.

I definitely see that coming. There were some funding rounds that I can't really talk about in this sector and I've seen that happen a lot. Yeah, I have one more category in LMOS, but any comments on-- - Yeah, no, I think that makes sense to me, that moving away from just semantic similarity, I think it's the most important because people use the same word with very different meanings, especially when talking.

When writing, it's different, but yeah. - Yeah, the other direction that vector databases have gone into, which LensDB presented at my conference, was multimodality. So Character AI uses LensDB for multimodal embeddings. That's just a minor difference. I don't think that's like a quantum leap in terms of what a vector database does for you.

The other thing that I see in LMOS world is mostly the evolution of just the ecosystem of agents, the agents talking to other agents and coordinating with other agents. So I interviewed Graham Newbig at iClear and he since announced that they are pivoting OpenDevIn or broadening OpenDevIn into All Hands AI.

I'm not sure about that name, but it is one of the three LMOS startups that got funded in the past two months that I know about, and maybe you know more. They're all building like this ecosystem of agents, working with other agents and all this tooling for agents. To me, it makes more sense.

It is probably the biggest thing I missed in doing the four wars. The need for startups to build this ecosystem thing up. So the big categories have been taken. Search, done. Code interpreter, done. There's a long tail of others. So memory is emerging, then there's like other stuff. And so they're focusing on that.

To me, browser is slightly different from search and Browserbase is another company I invested in that is focused on that, but they're not the only one in that category by any means. I used to tell people, go to the DevIn demo and look at the four things that they offer and each of those things is a startup.

DevIn, since then, they spoke at the conference as well. Scott was super nice to me and actually gave me some personal time as well. They have an updated chart of their plans. Look at their plans. They have like 16 things. Each of those things is a potential startup now.

And that is the LMOS. Everyone's building towards that direction because they need it to do what they need to do as an agent. If you believe in the agent's future, you need all these things. - Yeah. You think the HNOS is its own company? Do you think it's a open standard?

Do you think? - I would love it to be open standard. The reality is that people want to own that standard. So, we actually wound down the AI Engineer Foundation with the first project was the Agent Protocol, which E2B actually donated to the foundation because no one's interested. Everyone wants to be VC-backed when they want to own it, right?

So, it's too early to be open source. People will keep this proprietary and more power to them. They need to make it work. They need to make revenue before all the other stuff can happen. - Yeah. I'm really curious. We're investors in a bunch of agent companies. None of them really care about how to communicate with other agents.

They're so focused internally, you know? But I think in the future, you know, it talks about this- - I see, you're talking about agent to other external agents. - Yeah, so I think- - I'm not talking about that. - Yeah, I wonder when, like, because that's where the future is going, right?

So, today it's like intra-agent connectivity, you know? At some point, it's like, well, it's not like somebody I'm selling into a company and the company already uses agent X for that job. I need to talk to that agent, you know? But I think nobody really cares about that today.

So I think that's usually it. - Yeah, so I think that that layer right now is open API. Just give me a RESTful protocol, I can interoperate with that. RESTful protocol only does request-response. So then the next layer is something I have worked on, which is long-running request-response, which is workflows, which is what Temporal was supposed to do before, let's just say, management issues.

Yeah, but like, you know, RPC or some kind of, you know, I think that the dream is, and this is one of my problems with the LMOS concept, is that do we really need to rewrite every single thing for AI-native use cases? Shouldn't the AI just use these things, these tools the same way as humans use them?

Reality is, for now, yes, they need specialized APIs. In the distant future, when these things cost nothing, then they can use it the same way as humans does, but right now they need specialized interfaces. The layer between agents ideally should just be English, you know, like the same way that we talk, but like English is too under-specified, unstructured to make that happen, so.

- It's interesting because we talk to each other in English, but then we both use tools to do things to then get the response back. - For those people who want to dive in a little bit more, I think AutoGen, I would definitely recommend looking at that, Crew AI.

There are established frameworks now that are working on interagents, communication layers, to coordinate them, and not necessarily externally from company to company, just internally as well. If you have multiple agents farming out work to do different things, you're going to need this anyway. And I don't think it's that hard.

They are using English. They're using some mix of English and structured output. And yeah, if you have a better idea than that, let us know. - Yeah, we're listening. - So that's the four words discussion. I think I want to leave some discussion time open for miscellaneous trends that are happening in the industry that don't exactly fit in the four words or are a layer above the four words.

So the first one to me is just this trend of open source. Obviously this overlaps a lot with the GPU poor thing, but I want to really call out this depreciation thing that I've been working on. Like I do think it's probably one of the bigger thesis that I've had in the past month, which is that we now have a rough idea of the deprecation schedule of this sort of model spend.

And I basically drew a chart. I'll link it in the show notes, but I drew a chart of the price efficiency frontier of as of March, April, 2024. And then I had listed all the models that sit within that frontier. Haiku was the best cost per intelligence at that point in time.

And then I did the same chart in July, two days ago, and the whole thing has moved. And Mistral is like deprecating their old models that used to be in the old frontier. It is so shocking how predictive and tight this band is. Very, very tight band, and the whole industry is moving the same way.

And it's roughly one order of magnitude drop in cost for the same level of intelligence every four months. My previous number for this was one order of magnitude drop in cost every 12 months. But the timeline accelerated because at GPT-3, it took about a year to drop order of magnitude.

But now GPT-4, it's really crazy. I don't know what to say about that, but I just want to know. - Do you think GPT-Next and Cloud 4 push it back down because they're coming out with higher intelligence, higher cost? Or is it maybe like the timeline is going down because new frontier models are not really coming out at the same rate?

- Interesting. I don't know. That's a really good question. Wow, I'm stumped. I don't have-- - You're like, "Wow, you got a good question." - Yeah, I don't have an answer. No, I mean, you have a good question, but I thought I had solved this, and then now you came along with it.

The first response is something I haven't thought about. Yeah, yeah. So there's two directions here, right? When the cost of frontier models are going up, potentially like SB1047 is going to make it illegal to train even larger models. For us, I think the opposition has increased enough that it's not going to be a real concern for people.

But I think every lab basically needs a small, medium, large play. And like we said, in the sort of model deployment framework, first you choose, you pursue capability, then you pursue generalization, then you pursue efficiency. And what we're talking about here is efficiency. Now we care about efficiency. That's definitely one of the emergent stories of the year that has happened, is efficiency matters for 4.0, 4.0 mini, and 3.5 Sonnet in a way that in January, nobody was talking about.

And that's great. - Yeah. - Regardless of GPT-Next and Cloud 4 or whatever, or Gemini 2, we will still have efficiency frontiers to pursue. And it seems like doing the higher capable thing creates the synthetic data for us to do the efficient thing. And that means lifting up the, like I had this difference chart between Lama 3.0 8B, Lama 3.0 7TB, versus their 3.1 differences.

And the 8B had the most uplift across all the benchmarks. Right, it makes sense. You're training from the 4 or 5B, you're distilling from there, and it's going to have the biggest lift up. So the best way to train more efficient models is to train the large model. - Right, yeah, yeah.

- And then you can distill down to the rest. So this is fascinating from an investor point of view. You're like, okay, you're worried about picks and shovels, you're worried about investing in foundation model labs. And that's a matter of opinion. I do think that some foundation model labs are worth investing in because they do pay back very quickly.

I think for engineers, the question is, what do you do when you know that your base cost is going down an order of magnitude every four months? How do you make those assumptions? And I don't know the answer to that. I'm just posing the question. I'm calling attention to it.

Because I think that cognition, burning like rumors is, I don't know nothing from Scott. I haven't talked to him at all about this, even though he's very friendly. But they did that, they got the media attention, and now the cost of intelligence is going down. And it will be economically viable tomorrow.

In the meantime, they have a crap ton of value from user data, and a crap ton of value from media exposure. And I think that the correct stunt to pull is to make economically non-viable startups now, and then wait. But honestly, basically I'm basically advocating for people to burn VC money.

- Yeah, they can burn my money all they want if they're building something useful. I think the big problem, not a problem, but the price of the model comes out, and then people build on it. And then there's really no, the model providers don't really have a lot of leverage on keeping the price high.

They just have to bring it down because the people downstream of them are not making that much money with them. And I wonder what's gonna be the model where it's like, this model is so good, I'm not putting the price down. Like if GPD 4.0 was amazing and was actually creating a lot of value downstream, people would be happy to pay.

I think people today are not that happy with the models. They're good, but I'm not paying that much because I'm not really getting that much out of it. Like we have this AI center of excellence with a lot of the Fortune 500 groups, and there are people saving 10, 20 million a year like with these models doing boring stuff, like document translation and things like that, but nobody's making 100 million.

Nobody's making 150 million. So like the prices just have to go down too much, but maybe that will change at some point. - Yeah, I always mention temperature to use cases, right? Those are temperature zero use cases where you need precision, you need creativity. What are the cases where hallucination is a feature, not a bug, right?

So we're the first podcast to interview WebSim, and I'm still pretty positive about the generative part of AI. Like we took generative AI and we used it to do reg. We have an infinite creativity engine. Let's go do more of that. So we'll hopefully do more episodes there. You have some stuff on agents you wanna- - Yeah, no, I think this is something that we talked a lot about, and we wrote this post months and months ago about shifting from software as a service to services as a software.

And that's only more true now. I think like most companies that are buying AI tooling, they want the AI to do some sort of labor for them. And that's why the picks and shovels kind of disinterest maybe comes from a little bit. Most companies do not wanna buy tools to build AI.

They want the AI, and they also do not want to pay a lot of money for something that makes employees more productive because the productivity gains are not accruing to the companies. They're just accruing to the employees. People work less, have longer lunch breaks because they get things done faster.

But most companies are not making a lot more money by making employees productive. That's not true for startups. So if you look at most startups today in AI, like they're much smaller teams compared to before versus agents. We have companies like Brightwave, which we had on the podcast. You're selling labor, which is something that people are used to paying on a certain pay scale.

So when you're doing that, if you ask Brightwave, they don't have it public, but they charge a lot of money, more than you would expect because hedge funds and like investment banking, investment advisors, they're used to paying a lot of money for research. It's like the labor, they don't even care that you use AI.

They just want labor to be done. - I'll mention one pushback, but as a hedge fund, we used to pay for analyst research out of our brokerage cost and not read them. To me, that's my risk of Brightwave, but you know. - No, but I think the- - As a consumer of research, I'm like- - If we want to go down the rabbit hole, there's a lot of pressure on funds for like a OPEX efficiency.

So there's not really capture researchers anymore at most funds. And like even the sell side research is not that good. - Taking them from in-house to external thing. - Yeah. - Yeah, that makes sense. - So yeah, you know, we have drop zone that does security analysis. Same, people are used to paying for managed security or like outsourced SOC analysts.

They don't want to buy an AI tool to make the security team more productive. So- - Okay, and what specifically does drop zone do? - They do SOC analysis. So not SOC like the compliance, but it's like when you have security alerts, how do you investigate them? So large enterprises, they get like thousands of phishing email and then they forward them to IT and IT or security person, the tier zero has to go in and say, that's a phishing email that is in, that is in.

So they have an agent that does that. So the cost to do, like for a human to do the analysis at the rate that they get paid, it's like $35 per alert. Drop zone is like $6 per alert. So it's a very basic economic analysis for the company, whether or not they want to buy it.

It's not about, is my analyst going to have more free time? Like, is it more productive? So selling the labor is like the story of the market right now. - My version of this is I should start a consulting services today and then slowly automate myself, my employees out of a job, right?

Is that fundable? - Is that fundable? That's a good question. I think whether or not, depends how big you want it to be. - This is a services company, basically. - Yeah, that's, I mean, that's what, I know now it's maybe not as good of an example, but Crosstrek started as a security research.

- Yeah, I mean, it's still one of the most successful companies of all time. - Yeah, yeah, yeah. - Yeah, it's an interesting model. I'm always checking my biases there. Anything else on the agents side of things? - No, that's really something that people should spend more time on.

It's like, what's the end labor that I'm building? Because, you know, sometimes when you're being too generic and you want to help people build things, like ADAPT, like ADAPT, you know, David was on the podcast and he said they were sold out of things, but they're kind of like-- - And then he sold out himself.

- Yeah, it's like, they're working with each company and the company has to invest the time to build with them. - Yeah, you need more hands-off. - Exactly. - Yeah. - So, and that's more verticalized. - Yeah, yeah. I'll shout out here, Jason Liu, he was also on a podcast and spoke at the conference.

He has this idea of like, it's reports, not rag. You want things to produce reports, because reports can actually get consumed. Rag is still too much work, still too much chatbotting. I'll briefly mention that new benchmarks, I'm thinking about. I think you need to have everyone studying AI research, understanding the progress of AI and foundation models, needs to have in mind what is next after MMLU.

I have 10 proposals. Most of them, half of them come from the Hugging Face episode. So everyone's loving Clementine. I want her back on. And she was amazing and very charismatic, even though she made us take down the YouTube. But MUSR for multi-step reasoning, math for math, IFE for instruction following, Big Bench hard.

And code, we're now getting to the area that the Hugging Face leaderboard does not have. And I'm considering making my own 'cause I care about this so much. So MBPP is the current one that is post-human eval, 'cause human eval is widely known to be saturated. And SciCode is like the newest one that I would point people to.

Context utilization, we had Mark from Gradient on talk about ruler, but also zeros goes in infinite bench, were the two that Nalma 3 used instead of ruler. But basically, something that's a little bit more rigorous than needle in a haystack, that is something that people need. Then you have function calling.

Here, I think Gorilla, API Bank, Nexus, pretty consensus, I've done nothing there apart from, yeah, like all models need something like this. Vision now is like multimodality, the vision is the most important. I think like Vibey Vell is actually the state of the art here. I, you know, open to being corrected, and then multilinguality.

So basically, like these are the 10 directions, right? Post-MMLU, here are the frontier capabilities. If you're developing models, or if you're encountering a new model, evaluate them on all these elements, and then you have a good sense of how state of the art they are and what you need them for in terms of applying them to your use case.

So I just want to get that out there. - Yeah, and we have the RKGI thing. How do you think about benchmarking for, you know, everyday thing or like benchmarking for something that is maybe like a hard to reach goal? - Yeah, this has been a debate for, that's obviously very important and probably more important for product usage, right?

Here, I'm talking about benchmarking for general model evals. And then there's a schism in the AI engineering community or criticism of AI engineering community that did not care enough about product evals. So Hamoud Hossein led that, and I had a bit of disagreement with him, but I acknowledge that, I think that it's important.

There was an oversight in my original AI engineer post. So the job of the engineer is to produce product-specific evals for your use case. And there's no way that these general academic benchmarks are going to do that because they don't know your use case. It's not important. They will correlate with your use case, and that is a good sign, right?

These are very, very rigorous and thought through. So you want to look for correlates, then you want to look for specifics. And that's something that only you can do. So yeah, RKGI will correlate with IQ. It's an IQ test, right? How well does IQ test correlate to job performance?

5%, 10%, not nothing, but not everything. And so it's important. - Anything else? - Super intelligence. We can, you know, we try not to talk about safety. My favorite safety joke from our dinner is that, you know, if you're worried about agents taking over the worlds and you need a button to take them down, just install CrowdStrike on every agent.

And you have a button that has just been proved at the largest scale in the world to disable all agents, right? So save super intelligence. You should just install CrowdStrike. That's what Elias Oskiver should do. - That's funny, except for the CrowdStrike people. Awesome, man, this was great. I'm glad we did it.

I'm sure we'll do it more regularly now that you're out of visa jail. - Yeah, yeah. I think, you know, AI News is surprisingly helpful for doing this. - Yeah. - Yeah. I had no idea when I started. I just thought I needed a thing to summarize discords, but now it's becoming a proper media company.

Like a thousand people sign up every month. It's growing. - Cool. Thank you all for listening. - Yeah. - See you next time. - Bye. (upbeat music) (upbeat music) (upbeat music) (upbeat music)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Chapters

Transcript