Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

- Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swiggs, founder of Small AI. - Hey, and today in the studio we have Kanjin from Imbue. Welcome. - Thank you. - So, you and I have, I guess, crossed paths a number of times, and you're formerly named General Intelligent, and you've just announced your rename, rebrand, in a huge, humongous way, so congrats on all that.

- Thank you. - And we're here to dive in into deeper detail on Imbue. We'd like to introduce you just on a high-level basis, but then have you go into a little bit more of your personal side. So, you graduated your BS and MS at MIT, and you also spent some time at the MIT Media Lab, one of the most famous, I guess, computer hacking labs in the world.

- Yeah, true. - What were you doing that time? - Yeah, I built electronic textiles, so boards that make it possible to make soft clothing. You can sew circuit boards into clothing, and then make clothing electronic. It's not that useful. - You wrote a book about that? - I wrote a book about it, yeah.

- Yeah, yeah, yeah. - Basically, the idea was to teach young women computer science in this route, because what we found was that young girls, they would be really excited about math until about sixth grade, and then they're like, "Oh, math is not good anymore, because I don't feel like the type of person who does math or does programming, but I do feel like the type of person who does crafting." So, it's like, "Okay, what if you combine the two?" - Yeah, yeah, awesome, awesome.

Always more detail to dive into on that. But then you graduated MIT, and you went straight into BizOps at Dropbox, where you're eventually chief of staff, which is a pretty interesting role we can dive into later. And then it seems like the founder bug hit you. You were basically a three-times founder at Ember, Sorceress, and now at General Intelligence/MBU.

What should people know about you on the personal side that's not on your LinkedIn, that's something you're very passionate about outside of work? - Yeah, I think if you ask any of my friends, they would tell you that I'm obsessed with agency, like human agency and human potential. - That's work.

Come on. - That's not work. What are you talking about? - So, what's an example of human agency that you try to promote? - I feel like, with all of my friends, I have a lot of conversations with them that's helping figure out what's blocking them. I guess I do this with a team kind of automatically, too.

And I think about it for myself often, building systems. I have a lot of systems to help myself be more effective. At Dropbox, I used to give this onboarding talk called "How to Be Effective," which people liked. I think 1,000 people heard this onboarding talk, and I think maybe Dropbox was more effective.

I think I just really believe that, as humans, we can be a lot more than we are, and it's what drives everything. I guess completely outside of work, I do dance. I do partner dance. - Nice. - Yeah. - Yeah, lots of interest in that stuff, especially in the group living houses in San Francisco, which I've been a little bit part of, and you've also run one of those.

- That's right, yeah. I started The Archive with Josh, my co-founder, and a couple other folks in 2015. That's right. We're three. Our housemates built, so. - Was that the, I guess, the precursor to Generally Intelligent, that you started doing more things with Josh? Is that how that relationship started?

- Yeah, so Josh and I are, this is our third company together. Our first company, Josh poached me from Dropbox for Ember, and there, we built a really interesting technology, laser raster projector, VR headset, and then we were like, "VR is not the thing we're most passionate about," and actually, it was kind of early days when we both realized we really do believe that, in our lifetimes, computers that are intelligent are going to be able to allow us to do much more than we can do today as people and be much more as people than we can be today.

At that time, we actually, after Ember, we were like, "Should we work on AI research or start an AI lab?" A bunch of our housemates were joining OpenAI, and we actually decided to do something more pragmatic to apply AI to recruiting and to try to understand, like, "Okay, if we're actually trying to deploy these systems in the real world, what's required?" And that was Sorceress.

That taught us so much about what, that was maybe an AI agent in a lot of ways, like, what does it actually take to make a product that people can trust and rely on? I think we never really fully got there, and it's taught me a lot about what's required, and it's kind of like, I think, informed some of our approach and some of the way that we think about how these systems will actually get used by people in the real world.

Just to go one step deeper on that, so you're building AI agents in 2016, before it was cool. You got some milestone, you raised $30 million, something was working. So what do you think you succeeded in doing, and then what did you try to do that did not pan out?

Yeah. So the product worked quite well. So Sorceress was an AI system that basically kind of looked for candidates that could be a good fit and then helped you reach out to them. And this was a little bit early. We didn't have language models to help you reach out, so we actually had a team of writers that customized emails, and we automated a lot of the customization.

But the product was pretty magical. Candidates would just be interested and land in your inbox, and then you can talk to them. And as a hiring manager, that's such a good experience. I think there were a lot of learnings, both on the product and market side. On the market side, recruiting is a market that is endogenously high churn, which means because people start hiring and then we hire the role for them and they stop hiring.

So the more we succeed, the more they... It's like the whole dating business. It's the dating business. Exactly. Exactly. It's exactly the same problem as the dating business. And I was really passionate about like, can we help people find work that is more exciting for them? A lot of people are not excited about their jobs, and a lot of companies are doing exciting things, and the matching could be a lot better.

But the dating business kind of phenomenon put a damper on that. So we had a good, it's actually a pretty good business, but as with any business with relatively high churn, the bigger it gets, the more revenue we have, the slower growth becomes. Because Nick, if 30% of that revenue you lose year over year, then it becomes a worse business.

So that was the dynamic we noticed quite early on after our Series A. I think the other really interesting thing about it is we realized what was required for people to trust that these candidates were like well-vetted and had been selected for a reason. And it's what actually led us, a lot of what we do at Imbue is working on interfaces to figure out how do we get to a situation where when you're building and using agents, these agents are trustworthy to the end user.

That's actually one of the biggest issues with agents that go off and do longer range goals is that I have to trust, did they actually think through the situation? And that really informed a lot of our work today. Yeah. Let's jump into GI now, Imbue. When did you decide recruiting was done for you, and you were ready for the next challenge?

And how did you pick the agent space? I feel like in 2021, it wasn't as mainstream. Yeah. So the LinkedIn says that it started in 2021, but actually we started thinking very seriously about it in early 2020, late 2019, early 2020. Not exactly this idea, but in late 2019, so I mentioned our housemates, Tom Brown and Ben Mann, they're the first two authors on GPT-3.

So what we were seeing is that scale is starting to work and language models probably will actually get to a point where with hacks, they're actually going to be quite powerful. And it was hard to see that at the time, actually, because GPT-3, the early versions of it, there are all sorts of issues.

We're like, "Oh, that's not that useful." But we could kind of see, okay, you keep improving it in all of these different ways and it'll get better. And so what Josh and I were really interested in is, how can we get computers that help us do bigger things? There's this kind of future where I think a lot about, if I were born in 1900 as a woman, my life would not be that fun.

I'd spend most of my time carrying water and literally getting wood to put in the stove to cook food and cleaning and scrubbing the dishes and getting food every day because there's no refrigerator. All of these things, very physical labor. And what's happened over the last 150 years since the Industrial Revolution is we've kind of gotten free energy.

Energy is way more free than it was 150 years ago. And so as a result, we've built all these technologies like the stove and the dishwasher and the refrigerator. And we have electricity and we have infrastructure, running water, all of these things that have totally freed me up to do what I can do now.

And I think the same thing is true for intellectual energy. We don't really see it today because we're so in it, but our computers have to be micromanaged. Part of why people are like, "Oh, you're stuck to your screen all day." Well, we're stuck to our screen all day because literally nothing happens unless I'm doing something in front of my screen.

I can't send my computer off to do a bunch of stuff for me. There is a future where that's not the case, where I can actually go off and do stuff and trust that my computer will pay my bills and figure out my travel plans and do the detailed work that I am not that excited to do so that I can be much more creative and able to do things that I as a human am very excited about and collaborate with other people.

And there are things that people are uniquely suited for. So that's kind of always been the thing that is really exciting, has been really exciting to me. I'm a mathematician. I've known for a long time I think that AI, whatever AI is, it would happen in our lifetimes. And the personal computer kind of started giving us a bit of free intellectual energy.

And this is like really the explosion of free intellectual energy. So in early 2020, we were thinking about this and what happened was self-supervised learning basically started working across everything. So it worked in language. SimClear came out. MoCo had come out, Momentum Contrast had come out earlier in 2019.

SimClear came out in early 2020 and we were like, okay, for the first time, self-supervised learning is working really well across images and text and suspect that like, okay, actually it's the case that machines can learn things the way that humans do. And if that's true, if they can learn things in a fully self-supervised way, because like as people, we are not supervised.

We like go Google things and try to figure things out. So if that's true, then like what the computer could be is much different, you know, is much bigger than what it is today. And so we started exploring ideas around like, how do we actually go? We didn't think about the fact that we could actually just build a research lab.

So we were like, okay, what kind of startup could we build to like leverage self-supervised learning so that it eventually becomes something that allows computers to become much more kind of able to do bigger things for us. But that became General Intelligent, which started as a research lab. And so your mission is you aim to rekindle the dream of the personal computer.

So when did it go wrong and what are like your first products and kind of like a user phasing things that you're building to rekindle it? Yeah. So what we do at Imbue is we train large foundation models optimized for reasoning. And the reason for that is because reasoning is actually, we believe the biggest blocker to agents or systems that can do these larger goals.

If we think about, you know, something that writes an essay, like when we write an essay, we like write it, we don't just output it and then we're done. We like write it and then we look at it and we're like, oh, I need to do more research on that area.

I'm going to go do some research and figure it out and come back and, oh, actually it's not quite right, the structure of the outline, so I'm going to rearrange the outline, rewrite it. It's this very iterative process and it requires thinking through like, okay, what am I trying to do?

Is the goal correct? Also like, has the goal changed as I've learned more? Also, you know, as a tool, like when should I ask the user questions? I shouldn't ask them questions all the time, but I should ask them questions in higher risk situations. How certain am I about the like flight I'm about to book?

There are all of these notions of like risk certainty, playing out scenarios, figuring out how to make a plan that makes sense, how to change the plan, what the goal should be, that are things, you know, that we lump under the bucket of reasoning. And models today, they're not optimized for reasoning.

It turns out that there's not actually that much explicit reasoning data on the internet as you would expect, and so we get a lot of mileage out of optimizing our models for reasoning in pre-training. And then on top of that, we build agents ourselves. I can get into, we really believe in serious use, like really seriously using the systems and trying to get to an agent that we can use every single day, tons of agents that we can use every single day.

And then we experiment with interfaces that help us better interact with the agents. So those are some set of things that we do on the kind of model training and agent side. And then the initial agents that we build, a lot of them are trying to help us write code better because code is most of what we do every day.

And then on the infrastructure and theory side, we actually do a fair amount of theory work to understand like how do these systems learn? And then also like what are the right abstractions for us to build good agents with, which we can get more into. And if you look at our website, we have a lot of tools.

We build a lot of tools internally. We have a like really nice automated hyperparameter optimizer. We have a lot of really nice infrastructure. And it's all part of the belief of like, okay, let's try to make it so that the humans are doing the things humans are good at as much as possible.

So out of our very small team, we get a lot of leverage. And so would you still categorize yourself as a research lab now, or are you now in startup mode? Is that a transition that is cautious at all? That's a really interesting question. I think we've always intended to build, you know, to try to build the next version of the computer, enable the next version of the computer.

The way I think about it is there is a right time to bring a technology to market. So Apple does this really well. Actually, iPhone was under development for 10 years, AirPods for five years. And Apple has a story where, you know, iPhone, the first multi-touch screen was created.

They actually were like, oh, wow, this is cool. Let's like productionize iPhone. They actually brought, they like did some work trying to productionize it and realized this is not good enough. And they put it back into research to try to figure out like, how do we make it better?

What are the interface pieces that are needed? And then they brought it back into production. So I think of production and research as kind of like these two separate phases. And internally, we have that concept as well, where like things need to be done in order to get to something that's usable.

And then when it's usable, like eventually we figure out how to productize it. What's the culture like to make that happen, to have both like, kind of like product oriented, research oriented. And as you think about building the team, I mean, you just raised 200 million, I'm sure you want to hire more people.

What are like the right archetypes of people that work at Inbu? Hmm. Yeah, I would say we have a very unique culture in a lot of ways. I think a lot about social process design. So how do you design social processes that enable people to be, you know, effective?

I like to think about team members as creative agents. So because most companies, they think of their people as assets. And they're very proud of this. And I think about like, okay, what is an asset? It's something you own, that provides you value that you can discard at any time.

This is a very low bar for people. This is not what people are. And so we try to enable everyone to be a creative agent and to really unlock their superpowers. So a lot of the work I do, you know, I was mentioning earlier, I'm like obsessed with agency.

A lot of the work I do with team members is try to figure out like, you know, what are you really good at? What really gives you energy and where can we put you such that, and how can I help you unlock that and grow that? So much of our work, you know, in terms of team structure, like much of our work actually comes from people.

Carbs, our hyperparameter optimizer came from Abe trying to automate his own research process, doing hyperparameter optimization. And he actually pulled some ideas from plasma physics. He's a plasma physicist to make the local search work. A lot of our work on evaluations comes from a couple members of our team who are like obsessed with evaluations.

We do a lot of work trying to figure out like, how do you actually evaluate if the model is getting better? Is the model making better agents? Is the agent actually reliable? And so a lot of things kind of like, I think of people as making the like them shaped blob inside InView.

And I think, you know, yeah, that's the kind of person that we're hiring for. We're hiring product engineers and data engineers and research engineers and all these roles. You know, we have a project, we have projects, not teams. We have a project around data collection and data engineering. That's actually one of the key things that improve the model performance.

We have a pre-training kind of project and with some fine tuning as part of that. And then we have an agent's project that's like trying to build on top of our models as well as use other models in the outside world to try to make agents that then we actually use as programmers every day.

So all sorts of different projects. As a founder, you are now sort of a capital allocator among all of these different investments effectively at different projects. And I was interested in how you mentioned that you're optimizing for improving reasoning specifically inside of your pre-training, which I assume is just a lot of data collection.

We are optimizing reasoning inside of our pre-trained models. And a lot of that is about data. And I can talk more about like what, you know, what exactly does it involve? But actually big, maybe 50% plus of the work is figuring out even if you do have models that reason well, like the models are still stochastic.

The way you prompt them still makes, is kind of random, like makes them do random things. And so how do we get to something that is actually robust and reliable as a user? How can I as a user trust it? You know, I was mentioning earlier when I talked to other people building agents, they have to do so much work, like to try to get to something that they can actually productize.

And it takes a long time and agents haven't been productized yet for, partly for this reason is that like the abstractions are very leaky. You know, we can get like 80% of that way there, but like self-driving cars, like the remaining 20% is actually really difficult. We believe that, and we have internally, I think some things that like an interface, for example, that lets me really easily like see what the agent execution is, fork it, try out different things, modify the prompt, modify like the plan that it is making.

This type of interface, it makes it so that I feel more like I'm collaborating with the agent as it's executing, as opposed to it's just like doing something as a black box. That's an example of a type of thing that's like beyond just the model pre-training. But on the model pre-training side, like reasoning is a thing that we optimize for.

And a lot of that is about, yeah, what data do we put in? Yeah. It's interesting just because I always think like, you know, out of the levers that you have, the resources that you have, I think a lot of people think that running a foundation model company or a research lab is going to be primarily compute.

And I think the share of compute has gone down a lot over the past three years. It used to be the main story, like the main way you scale is you just throw more compute at it. And now it's like Flops is not all you need. You need better data, you need better algorithms.

And I wonder where that shift has gone. This is a very vague question, but is it like 30, 30, 30 now? Is it like maybe even higher? So one way I'll put this is people estimate that Llamatu maybe took about three, $4 million of compute, but probably $20 to $25 million worth of labeling data.

And I'm like, okay, well that's a very different story than all these other foundation model labs raising hundreds of millions of dollars and spending it on GPUs. Yeah. Data is really expensive. We generate a lot of data and so that does help. The generated data is close to actually good, as good as human labeled data.

So generated data from other models? From our own models. From your own models. Yeah. Do you feel like, and there's certain variations of this, there's the sort of the constitutional AI approach from Anthropic and basically models sampling, training on data from other models. I feel like there's a little bit of like contamination in there or to put it in a statistical form, you're resampling a distribution that you already have that you already know doesn't match human distributions.

Yeah. Yeah. How do you feel about that basically, just philosophically? So when we're optimizing models for reasoning, we are actually trying to make a part of the distribution really spiky. So in a sense, this is actually what we want. We want to, because the internet is a sample of the human distribution that's also skewed in all sorts of ways, that is not the data that we necessarily want these models to be trained on.

And so I don't worry about it that much. What we've seen so far is that it seems to help. When we're generating data, we're not really randomly generating data, we generate very specific things that are like reasoning traces and that help optimize reasoning. Code also is a big piece of improving reasoning.

So yeah, generated code is not that much worse than like regular human written code. You might even say it can be better in a lot of ways. So yeah. So we are trying to already do that. What are some of the tools that you saw that you thought were not a good fit?

So you built Avalon, which is your own simulated world. And when you first started, the kind of like metagame was like using games to simulate things, using, you know, Minecraft and then OpenAI is like the gym thing and all these things. And your thing, I think in one of your other podcasts, you mentioned like Minecraft is like way too slow to actually do any serious work.

Is that true? Yeah. I didn't say it. I don't know. That's above my pay grade. But Avalon is like a hundred times faster than Minecraft for simulation. When did you figure that out that you needed to just like build your own thing? Was it kind of like your engineering team was like, hey, this is too slow.

Was it more a long-term investment? At that time, we built Avalon as a research environment to help us learn particular things. And one thing we were trying to learn is like, how do you get an agent that is able to do many different tasks? Like RL agents at that time and environments at that time, what we heard from other RL researchers was the like biggest thing holding the field back is lack of benchmarks that let us kind of explore things like planning and curiosity and things like that and have the agent actually perform better if the agent has curiosity.

And so we were trying to figure out like, okay, how can we have agents that are like able to handle lots of different types of tasks without the reward being pretty handcrafted? That's a lot of what we had seen is that like these very handcrafted rewards. And so Avalon has like a single reward.

It's across all tasks. And what it taught us, and it also allowed us to kind of create a curriculum so we could make the level more or less difficult. And it taught us a lot, maybe two primary things. One is with no curriculum, RL algorithms don't work at all.

So that's actually really interesting. For the non-RL specialists, what is a curriculum in your terminology? So a curriculum in this particular case is basically the environment Avalon lets us generate simpler environments and harder environments for a given tasks. What's interesting is that the simpler environments, you know, what you'd expect is the agent succeeds more often, so it gets more reward.

And so, you know, kind of my intuitive way of thinking about it is, okay, the reason why it learns much faster with a curriculum is it's just getting a lot more signal. And that's actually an interesting kind of like general intuition to have about training these things. It's like, what kind of signal are they getting and like, how can you help it get a lot more signal?

The second thing we learned is that reinforcement learning is not a good vehicle, like pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were not able to, they were able to learn all sorts of crazy things. They could learn to climb, like hand over hand in VR climbing, they can learn to open doors, like very complicated, like multiple switches and a lever open the door.

But they couldn't do any higher level things and they couldn't do those lower level things consistently necessarily. And as a user, we were like, okay, as a user, I do not want to interact with a pure reinforcement learning end-to-end RL agent. As a user, like I need much more control over what that agent is doing.

And so that actually started to get us on the track of thinking about, okay, how do we do the reasoning part in language? And we were pretty inspired by our friend Chelsea Finn at Stanford was I think working on SACAN at the time, where it's basically an experiment where they have robots kind of trying to do different tasks and actually do the reasoning for the robot in natural language.

And it worked quite well. And that led us to start experimenting very seriously with reasoning. How important is the language part for the agent versus for you to inspect the agent? You know, like is it the interface to kind of the human on the loop really important or? Yeah.

I personally think of it as it's much more important for us, the human user. So I think you probably could get end-to-end agents that work and are fairly general at some point in the future. But I think you don't want that. Like we actually want agents that we can like perturb while they're trying to figure out what to do.

So it's, you know, even a very simple example, internally we have like a type error fixing agent and we have like a test generation agent. Test generation agent goes off the rails all the time. I want to know like, why did it generate this particular test? What was it thinking?

Did it consider, you know, the fact that this is calling out to this other function? Like formatter agent, if it ever comes up with anything weird, I want to be able to debug like what happened. With RL end-to-end stuff, like we couldn't do that. So it sounds like you have a bunch of agents that are operating internally within the company.

What's your most, I guess, successful agent and what's your least successful one? Yeah. A type of agent that works moderately well is like fix the color of this button on the website or like change the color of this button. Which is now sweep.dev is doing that. Perfect. Okay. Well, we should just use sweep.dev.

Well, I mean, okay. I don't know how often you have to fix the color of the button, right? Because all of them raise money on the idea that they can go further. And my fear when encountering something like that is that there's some kind of unknown asymptote ceiling that's going to prevent them, that they're going to run head on into that you've already run into.

We've definitely run into such a ceiling. What is the ceiling? Is there a name for it? I mean, for us, we think of it as reasoning plus these tools. So reasoning plus abstractions, basically. I think actually you can get really far with current models and that's why it's so compelling.

Like we can pile debugging tools on top of these current models, have them critique each other and critique themselves and do all of these like, you know, spend more computer inference time, context hack, you know, retrieve augmented generation, et cetera, et cetera, et cetera. Like the pile of hacks actually does get us really far.

And you're kind of like trying to get more signal out of the channel. We don't like to think about it that way. It's what the default approach is, is like trying to get more signal out of this noisy channel. But the issue with agents is as a user, I want it to be mostly reliable.

It's kind of like self-driving in that way. Like it's not as bad as self-driving, like in self-driving, you know, you're like hurtling at 70 miles an hour is like the hardest agent problem. But I think one thing we learned from Sorceress and one thing we've learned like by using these things internally is we actually have a pretty high bar for these agents to work.

You know, it is actually really annoying if they only work 50% of the time and we can make interfaces to make it slightly less annoying. But yeah, there's a ceiling that we've encountered so far and we need to make the models better and we also need to make the kind of like interface to the user better and also a lot of the like, you know, critiquing, we have a lot of like generation methods, kind of like spending computer inference time generation methods that help things be more robust and reliable, but it's still not 100% of the way there.

So to your question of like what agents work well and what doesn't work well, like most of the agents don't work well and we're slowly making them work better by improving the underlying model and improving these. I think that that's comforting for a lot of people who are feeling a lot of imposter syndrome not being able to make it work.

And I think the fact that you share their struggles, I think also helps people understand how early this is. Yeah, definitely. It's very early and I hope what we can do is help people who are building agents actually like be able to deploy them. I think, you know, that's the gap that we see a lot of today is everyone who's trying to build agents to get to the point where it's robust enough to be deployable.

It's like an unknown amount of time. Okay. Yeah. Well, so this goes back into what Embu is going to offer as a product or a platform. How are you going to actually help people deploy those agents? Yeah, so our current hypothesis, I don't know if this is actually going to end up being the case.

We've built a lot of tools for ourselves internally around like debugging, around like abstractions or techniques after the model generation happens, like after the language model generates the text, like interfaces for the user and the underlying model itself, like models talking to each other. Maybe some set of those things, kind of like an operating system, some set of those things will be helpful for other people.

And we'll figure out what set of those things is helpful for us to make our agents. Like what we want to do is get to a point where we can start making an agent, deploy it, it's reliable, like very quickly. And there's a similar analog to software engineering, like in the early days, in the '70s, in the '60s, like to program a computer, you have to go all the way down to the registers and write things.

Eventually, we had assembly. That was like an improvement. Then we wrote programming languages with these higher levels of abstraction, and that allowed a lot more people to do this and much faster, and the software created is much less expensive. And I think it's basically a similar route here where we're like in the like bare metal phase of agent building, and we will eventually get to something with much nicer abstractions.

So you touched a little bit on the data before. We had this conversation with George Hudson, we were like, there's not a lot of reasoning data out there, and can the models really understand? And his take was like, look, with enough compute, you're not that complicated as a human.

The model can figure out eventually why certain decisions are made. What's been your experience? As you think about reasoning data, do you have to do a lot of manual work, or is there a way to prompt models to extract the reasoning from actions that they see? We don't think of it as, oh, throw enough data at it, and then it will figure out what the plan should be.

I think we're much more explicit. So we have a lot of thoughts internally, like many documents about what reasoning is. A way to think about it is as humans, we've learned a lot of reasoning strategies over time. We are better at reasoning now than we were 3,000 years ago.

An example of a reasoning strategy is noticing you're confused. And then when I notice I'm confused, I should ask like, huh, what was the original claim that was made? What evidence is there for this claim, et cetera, et cetera? Does the evidence support the claim? Is the claim correct?

This is like a reasoning strategy that was developed in like the 1600s, with like the advent of science. That's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we didn't always have them.

And because they're invented, we can generate data that's much more specific to them. So I think internally, yeah, we have a lot of thoughts on what reasoning is, and we generate a lot more specific data. We're not just like, oh, it'll figure out reasoning from this black box, or it'll figure out reasoning from the data that exists.

Yeah. I mean, the scientific method is like a good example. And if you think about hallucination, right? And people are thinking, how do we use these models to do net new scientific research? And if you go back in time and the model is like, well, the earth revolves around the sun, and people are like, man, this model is crap.

It's like, what are you talking about? Like the sun revolves around the earth. Like, how do you see the future where like, do you think we can actually, like, if the models are actually good enough, but we don't believe them, it's like, how do we make the two live together?

Say you're like, you use IMBU as a scientist to do a lot of your research, and IMBU tells you, hey, I think this is like a serious bet. You should go down. And you're like, no, this sounds impossible. Like, how is that trust going to be built, and like, what are some of the tools that maybe are going to be there to inspect it?

Yeah. So like, one element of it is like, as a person, like, I need to basically get information out of the model such that I can try to understand what's going on with the model. So then the second question is like, okay, how do you do that? And that's kind of, some of our debugging tools, they're not necessarily just for debugging.

They're also for like, interfacing with and interacting with the model. So like, if I go back in this reasoning trace and like, change a bunch of things, what's going to happen? Like, what does it conclude instead? So that kind of helps me understand, like, what are its assumptions? And it, you know, we think of these things as tools.

And so it's really about, like, as a user, how do I use this tool effectively? Like, I need to be willing to be convinced as well. It's like, how do I use this tool effectively, and what can it help me with, and what can it tell me? So there's a lot of mention of code in your process.

And I was hoping to dive in even deeper. I think we might run the risk of giving people the impression that you view code, or you use code, just as like a tool within yourself, within MBU, just for coding assistance. And I think there's a lot of informal understanding about how adding code to language models improves their reasoning capabilities.

I wonder if there's any research or findings that you have to share that talks about the intersection of code and reasoning. Yeah, so the way I think about it intuitively is, like, code is the most explicit example of reasoning data on the internet. And it's not only structured, it's actually very explicit, which is nice.

You know, it says this variable means this, and then it uses this variable, and then the function does this. Like, as people, when we talk in language, it takes a lot more to kind of, like, extract that, like, explicit structure out of, like, our language. And so that's one thing that's really nice about code, is I see it as almost like a curriculum for reasoning.

I think we use code in all sorts of ways, like, the coding agents are really helpful for us to understand, like, what are the limitations of the agents? The code is really helpful for the reasoning itself, but also code is a way for models to act. So by generating code, it can act on my computer.

And you know, when we talk about rekindling the dream of the personal computer, kind of where I see computers going is, like, computers will eventually become these much more malleable things, where I, as a user, today, I have to know how to write software code, like, in order to make my computer do exactly what I want it to do.

But in the future, if the computer is able to generate its own code, then I can actually interface with it in natural language. And so we, you know, one way we think about agents is it's kind of like a natural language programming language. It's a way to program my computer in natural language that's much more intuitive to me as a user.

And these interfaces that we're building are essentially IDEs for users to program our computers in natural language. What do you think about the other, the different approaches people have, kind of like, text first, browser first, like, multi-on? What do you think the best interface will be, or like, what is your, you know, thinking today?

I think chat is very limited as an interface. It is sequential, where these agents don't have to be sequential. So with a chat interface, if the agent does something wrong, I have to, like, figure out how to, like, how do I get it to go back and start from the place I wanted it to start from?

So in a lot of ways, like, chat as an interface, I think Linus, Linus Lee, you had on this. I really like how he put it, chat as an interface is skeuomorphic. So in the early days, when we made word processors on our computers, they had notepad lines, because that's what we understood, you know, these, like, objects to be.

Chat, like texting someone, is something we understand. So texting our AI is something that we understand. But today's Word documents don't have notepad lines. And similarly, the way we want to interact with agents, like, chat is a very primitive way of interacting with agents. What we want is to be able to inspect their state and to be able to modify them and fork them and all of these other things.

And we internally have, kind of, like, think about what are the right representations for that, like, architecturally, like, what are the right representations? What kind of abstractions do we need to build? And how do we build abstractions that are not leaky? Because if the abstractions are leaky, which they are today, like, you know, this stochastic generation of text is like a leaky abstraction.

I cannot depend on it. And that means it's actually really hard to build on top of. But our experience and belief is, actually, by building better abstractions and better tooling, we can actually make these things non-leaky. And now you can build, like, whole things on top of them. So these other interfaces, because of where we are, we don't think that much about them.

>> Cool. Yeah, I mean, you mentioned this is kind of like the Xerox spark moment for AI. And we had a lot of stuff come out of Parc, like, yeah, what you see is what you got, headers, and, like, MVC, and all this stuff. But yeah. But then we didn't have the iPhone at Parc.

We didn't have all these, like, higher things. What do you think it's reasonable to expect in, like, this era of AI? You know, kind of, like, five years or so? Like, what are, like, the things we'll build today? And what are things that maybe we'll see in, kind of, like, the second wave of products?

>> I think the waves will be much faster than before. Like, what we're seeing right now is basically, like, a continuous wave. Let me zoom a little bit earlier. So people like the Xerox Parc analogy I give, but I think there are many different analogies. Like one is the, like, analog to digital computer is another analogy to where we are today.

The analog computer Vannevar Bush built in the 1930s, I think, and it's like a system of pulleys. And it can only calculate one function, like, it can calculate, like, an integral. And that was so magical at the time, because you actually did need to calculate this integral a bunch.

But it had a bunch of issues. Like, in analog, errors compound. And so there was actually a set of breakthroughs necessary in order to get to the digital computer. Like Turing's decidability, Shannon, I think the, like, whole, like, relay circuits are, can be thought of as, can be mapped to Boolean operators.

And a set of other, like, theoretical breakthroughs, which essentially, they were creating abstractions for these, like, very analog circuits. And digital had this nice property of, like, being error correcting. And so when I talk about, like, less leaky abstractions, that's what I mean. That's what I'm kind of pointing a little bit to.

It's not going to look exactly the same way. And then the Xerox PARC piece, a lot of that is about, like, how do we get to computers that as a person, I can actually use well. And the interface actually helps it unlock so much more power. So the sets of things we're working on, like the sets of abstractions and the interfaces, like, hopefully that, like, help us unlock a lot more power in these systems.

Like, hopefully that'll come not too far in the future. I could see a next version, like, maybe a little bit farther out. It's, like, an agent protocol. So a way for different agents to talk to each other and call each other, kind of like HTTP. Do you know it exists already?

Yeah, there is a nonprofit that's working on one. I think it's a bit early, but it's interesting to think about right now. Part of why I think it's early is because the issue with agents is it's not quite like the internet where you could, like, make a website and the website would appear.

The issue with agents is that they don't work. And so it may be a bit early to figure out what the protocol is before we really understand how could these agents get constructed. But, you know, I think that's, I think it's a really interesting question. While we're talking on this agent-to-agent thing, there's been a bit of research recently on some of these approaches.

I tend to just call them extremely complicated chain of thoughting, but any perspectives on kind of meta-GPT, I think is the name of the paper. I don't know if you care about at the level of individual papers coming out, but I did read that recently, and TLDR, it beat GPT-4 and human eval by role-playing software agent development agency.

Instead of having a single shot, a single role, you have multiple roles and having all of them criticize each other as agents communicating with other agents. Yeah. I think this is an example of an interesting abstraction of like, okay, can I just plop in this multi-role critiquing and see how it improves my agent?

Can I just plop in chain of thought, tree of thought, plop in these other things and see how they improve my agent? One issue with this kind of prompting is that it's still not very reliable. There's one lens which is like, okay, if you do enough of these techniques, you'll get to high reliability.

I think actually that's a pretty reasonable lens. We take that lens often. Then there's another lens that's like, okay, but it's starting to get really messy what's in the prompt and how do we deal with that messiness? Maybe you need cleaner ways of thinking about and constructing these systems.

We also take that lens. Yeah. I think both are necessary. It's a great question because I feel like this also brought up another question I had for you. I noticed that you work a lot with your own benchmarks, your own evaluations of what is valuable. I would say I would contrast your approach with OpenAI as OpenAI tends to just lean on, "Hey, we played StarCraft," or, "Hey, we ran it on the SAT or the AP bio test and that did results." Basically, is benchmark culture ruining AI?

Or is that actually a good thing? Because everyone knows what an SAT is and that's fine. I think it's important to use both public and internal benchmarks. Part of why we build our own benchmarks is that there are not very many good benchmarks for agents, actually. To evaluate these things, we actually need to think about it in a slightly different way.

But we also do use a lot of public benchmarks for is the reasoning capability in this particular way improving? Yeah. It's good to use both. For example, the Voyager paper coming out of NVIDIA played Minecraft and set their own benchmarks on getting the Diamond Axe or whatever and exploring as much of the territory as possible.

I don't know how that's received. That's obviously fun and novel for the rest of the AI engineer, the people who are new to the scene. But for people like yourself who you build your own, you build Avalon just because you already found deficiencies with using Minecraft, is that valuable as an approach?

Oh, yeah. I love Voyager. Jim, I think is awesome. And I really like the Voyager paper and I think it has a lot of really interesting ideas, which is like the agent can create tools for itself and then use those tools. And he had the idea of the curriculum as well, which is something that we talked about earlier.

Exactly. Exactly. And that's a lot of what we do. We built Avalon mostly because we couldn't use Minecraft very well to learn the things we wanted. And so it's not that much work to build our own. It took us, I don't know, we had eight engineers at the time, took about eight weeks.

So six weeks. Nice. Yeah. And OpenAI built their own as well. Right? Yeah, exactly. It's just nice to have control over our environment. But if you're doing our own sandbox to really trying to inspect our own research questions. But if you're doing something like experimenting with agents and trying to get them to do things like Minecraft is a really interesting environment.

And so Voyager has a lot of really interesting ideas in it. Yeah. Cool. One more element that we had on this list, which is context and memory. I think that's kind of like the foundational "RAM" of our era. I think Andrej Karpathy has already made this comparison, so there's nothing new here.

But that's just the amount of working knowledge that we can fit into one of these agents. And it's not a lot. Right? Especially if you need to get them to do long running tasks, if they need to self-correct from errors that they observe while operating in their environment. Do you see this as a problem?

Do you think we're going to just trend to infinite context and that'll go away? Or how do you think we're going to deal with it? When you talked about what's going to happen in the first wave and then in the second wave, I think what we'll see is we'll get relatively simplistic agents pretty soon.

And they will get more and more complex. And there's a future wave in which they are able to do these really difficult, really long running tasks. And the blocker to that future, one of the blockers is memory. And that was true of computers too. I think when von Neumann made the von Neumann architecture, he was like, "The biggest blocker will be memory.

We need this amount of memory," which is like, I don't remember exactly, like 32 kilobytes or something, "to store programs. And that will allow us to write software." He didn't say it this way because he didn't have these terms. And then that only really happened in the '70s with the microchip revolution.

And so it may be the case that we're waiting for some research breakthroughs or some other breakthroughs in order for us to have really good long running memory. And then in the meantime, agents will be able to do all sorts of things that are a little bit smaller than that.

I do think with the pace of the field, we'll probably come up with all sorts of interesting things. Like RAG is already very helpful. Good enough, you think? Maybe. Good enough for some things. How is it not good enough? I don't know. I just think about a situation where you want something that's like an AI scientist.

As a scientist, I have learned so much about my field. And a lot of that data is maybe hard to fine tune on or maybe hard to put into pre-training. A lot of that data, I don't have a lot of repeats of the data that I'm seeing. My understanding is so at the edge that if I'm a scientist, I've accumulated so many little data points.

And ideally, I'd want to store those somehow or use those to fine tune myself as a model somehow or have better memory somehow. I don't think RAG is enough for that kind of thing. But RAG is certainly enough for user preferences and things like that. What should I do in this situation?

What should I do in that situation? That's a lot of tasks. We don't have to be a scientist right away. I have a hard question, if you don't mind me being bold. I think the most comparable lab to Imbue is ADEPT. Whatever. A research lab with some amount of productization on the horizon, but not just yet.

Why should people work for Imbue over ADEPT? The way I think about it is I believe in our approach. Maybe this is a general question of competitors. And the way I think about it is we're in a historic moment. This is 1978 or something. Love it. Apple is about to start.

Lots of things are starting at that time. And IBM also exists and all of these other big companies exist. We know what we're doing. We're building reasoning foundation models, trying to make agents that actually work reliably. That are inspectable. That we can modify. That we have a lot of control over.

And I think we have a really special team and culture. And that's what we are. I have a sense of where we want to go, of really trying to help the computer be a much more powerful tool for us. And the type of thing that we're doing is we're trying to build something that enables other people to build agents.

And build something that really can be maybe something like an operating system for agents. I know that that's what we're doing. I don't really know what everyone else is doing. I talk to people and have some sense of what they're doing. And I think it's a mistake to focus too much on what other people are doing.

Because extremely focused execution on the right thing is what matters. And so to the question of why us, I think strong focus on reasoning, which we believe is the biggest blocker, on inspectability. Which we believe is really important for user experience. And also for the power and capability of these systems.

Building non-leaky good abstractions. So that which we believe is solving the core issue of agents, which is around reliability and being able to make them deployable. And then really seriously trying to use these things ourselves. Every single day. And getting to something that we can actually ship to other people, that becomes something that is a platform.

It feels like it could be Mac or Windows. I love the dogfooding approach. That's extremely important. And you will not be surprised how many agent companies I talk to that don't use their own agent. Oh no! That's not good! That's a big surprise. Yeah, I think if we didn't use our own agents, then we would have all of these beliefs about how good they are.

The only other follow-up that you had, based on the answer you just gave, was do you see yourself releasing models or do you see yourself... What is the artifacts that you want to produce that lead up to the general operating system that you want to have people use? And so a lot of people, just as a byproduct of their work, just to say, "Hey, I'm still shipping, here's a model along the way," Adept took, I don't know, three years, but they released Persimmon recently.

Do you think that kind of approach is something on your horizon or do you think there's something else that you can release that can show people, "Here's the idea, not the end product, but here's the byproduct of what we're doing"? Yeah. I don't really believe in releasing things to show people, "Oh, here's what we're doing," that much.

I think as a philosophy, we believe in releasing things that will be helpful to other people. And so I think we may release models or we may release tools that we think will help agent builders. Ideally, we would be able to do something like that, but I'm not sure exactly what they look like yet.

I think more companies should get into the releasing evals and benchmarks game. Yeah. Something that we have been talking to agent builders about is co-building evals. So we build a lot of our own evals and every agent builder tells me basically evals are their biggest issue. And so, yeah, we're exploring right now.

And if you are building agents, this is like a call. If you are building agents, please reach out to me because I would love to figure out how we can be helpful based on what we've seen. Cool. Well, that's a good call to action. I know a bunch of people that I can send your way.

Cool. Great. Awesome. Yeah. We can zoom out to other interests now. We've got a lot of stuff. I saw from Lexica on the podcast, he had a lot of interesting questions on his website. You similarly have a lot of them. Yeah. I need to do this. I'm very jealous of people who have personal websites where they're like, here's the high level questions of goals of humanity that I want to set people on.

And I don't have that. This is great. This is good. It's never too late, Sean. Yeah. It's never too late. Exactly. There were a few that stuck out as related to your work that maybe you're kind of learning more about it. One is why are curiosity and goal orientation often at odds?

And from a human perspective, I get it, it's like, you know, would you want to like go explore things or kind of like focus on your career? How do you think about that from like an agent perspective, where it's like, should you just stick to the task and try and solve it as in the guardrails as possible?

Or like, should you look for alternative solutions? Yeah. This is a great question. So the problem with these questions is that I'm still confused about them. So our discussion, in our discussion, I will not have good answers. I will be still confused. Why are curiosity and goal orientation so at odds?

I think one thing that's really interesting about agents actually is that they can be forked. So like, you know, we can take an agent that's executed to a certain place and said, okay, here, like fork this and do a bunch of different things, try a bunch of different things.

Some of those agents can be goal oriented and some of them can be like more curiosity driven. You can prompt them in slightly different ways. And something I'm really curious about, like what would happen if in the future, you know, we were able to actually go down both paths.

As a person, why I have this question on my website is I really find that like, I really can only take one mode at a time. And I don't understand why. And like, is it inherent in like the kind of context that needs to be held? That's why I think from an agent perspective, like forking it is really interesting.

Like I can't fork myself to do both, but I maybe could fork an agent to like at a certain point in a task, yeah, to explore both. How has the thinking changed for you as the funding of the company changed? That's one thing that I think a lot of people in the space think is like, oh, should I raise venture capital?

Like, how should I get money? How do you feel your options to be curious versus like goal oriented has changed as you raise more money and kind of like the company has grown? That's really funny. Actually, things have not changed that much. So we raised our Series A $20 million in late 2021.

And our entire philosophy at that time was, and still kind of is, is like, how do we figure out the stepping stones, like collect stepping stones that eventually let us build agents, the kind of these new computers that help us do bigger things. And there was a lot of curiosity in that.

And there was a lot of goal orientation in that. Like the curiosity led us to build CARBS, for example, this hyperparameter optimizer. Great name by the way. Thank you. Is there a story behind that name? Yeah. Abe loves CARBS. It's also cost aware. So as soon as he came up with cost aware, he was like, I need to figure out how to make this work.

But the cost awareness of it was really important. So that curiosity led us to this really cool hyperparameter optimizer. That's actually a big part of how we do our research. It lets us experiment on smaller models. And for those experiment results to carry to larger ones. Which you also published a scaling laws thing for it, which is great.

I think the scaling laws paper from OpenAI was the biggest. And from Google, I think, was the greatest public service to machine learning that any research lab can do. Yeah. Totally. Yeah. And I think what was nice about CARBS is it gave us scaling laws for all sorts of hyperparameters.

And then there's some goal oriented parts. Like Avalon, it was like a six to eight week sprint for all of us. And we got this thing out. And then now, different projects do more curiosity or more goal orientation at different times. Another one of your questions that we highlighted was, how can we enable artificial agents to permanently learn new abstractions and processes?

I think this might be called online learning. Yeah. So I struggle with this because that scientist example I gave. As a scientist, I've permanently learned a lot of new things and I've updated and created new abstractions and learned them pretty reliably. And you were talking about, OK, we have this RAM that we can store learnings in.

But how well does online learning actually work? And the answer right now seems to be, as models get bigger, they fine tune faster. So they're more sample efficient as they get bigger. Because they already had that knowledge in there, you're just unlocking it. Maybe. Partly, maybe because they already have some subset of the representation.

Yeah. So they just memorize things more, which is good. So maybe this question is going to be solved. But I still don't know what the answer is. I don't know, have a platform that continually fine tunes for you as you work on that domain, which is something I'm working on.

Well, that's great. We would love to use that. We'll talk more. OK. So two more questions just about your general activities, and you've just been very active in the San Francisco tech scene. You're a founding member of Software Commons. Oh, yeah, that's true. Tell me more, because by the time I knew about SPC, it was already a very established thing.

But what was it like in the early days? What was the story there? Yeah, the story is Ruchi, who started it, was the VP of operations at Dropbox. And I was the chief of staff, and we worked together very closely. She's actually one of the investors in Sorceress. And SPC is an investor in Vue.

And at that time, Ruchi was like, "You know, I would like to start a space for people who are figuring out what's next." And we were figuring out what's next post-Ember, those three months. And she was like, "Do you want to just hang out in this space?" And we're like, "Sure." And it was a really good group, I think, Wasim and Jeff from Pilot, the folks from Zulip, and a bunch of other people at that time.

It was a really good group. We just hung out. There was no programming. It's much more official than it was at that time. Yeah. Now it's like a YC before YC type of thing. That's right. Yeah. At that time, we literally, it was a bunch of friends hanging out in the space together.

And was this concurrent with the archive? Oh, yeah, actually. I think we started the archive around the same time. You're just really big into community. But also, I run a hacker house, right? And I'm also part of, hopefully, what becomes the next Software Commons or whatever. But what are the principles in organizing communities like that with really exceptional people that go on to do great things?

Do you have to be really picky about who joins? Did all your friends just magically turn out super successful like that? Yeah, I think so. I think we... You know it's not normal, right? This is very special. And a lot of people want to do that and fail. You had the co-authors of GPT-3 in your house.

That's true. And a lot of other really cool people that you'll eventually hear about. And co-founders of Pilot and anyone else you want to... I don't want you to pick your friends, but there's some magic special sauce in getting people together in one workspace, living space, whatever. And that's part of why I'm here in San Francisco.

And I would love for more people to learn about it and also maybe get inspired to build their own. One adage we had when we started the archive was you become the average of the five people closest to you. Yes. And I think that's roughly true. And good people draw good people.

So there are really two things. One, we were quite picky and it mattered a lot to us. Is this someone where if they're hanging out in the living room, we'd be really excited to come hang out? Yeah. Two is I think we did a really good job of creating a high-growth environment and an environment where people felt really safe.

We actually apply these things to our team and it works remarkably well as well. So I do a lot of basically how do I create safe spaces for people where it's not just like safe law, but it's a safe space where people really feel inspired by each other. And I think at the archive, we really made each other better.

My friend, Michael Nielsen called it a self-actualization machine. My goodness. And I think, yeah, people came in and- Was he a part of the archive? He was not, but he hung out a lot. I don't remember. Friend of the archive. A friend of the archive, yeah. Like the culture was that we learned a lot of things from each other about how to make better life systems and how to think about ourselves and psychological debugging.

And a lot of us were founders, so having other founders going through similar things was really helpful. And a lot of us worked in AI, and so having other people to talk about AI with was really helpful. And so I think all of those things led to a form of idea flux and also kind of like, I think a lot about like the idea flux and the kind of like default habits or default impulses.

It led to a set of idea flux and default impulses that led to some really interesting things and led to us doing much bigger things, I think, than we otherwise would have decided to do because it felt like taking risks was less risky. So that's something we do a lot of on the team is like, how do we make it so that taking risks is less risky?

And there's a term called seniors. Yes. I was thinking Kevin Kelly. Kevin Kelly, seniors. I was going to feed you that word, but I didn't want to like impress you. Yes. Yes. I think maybe like a lot of what I'm interested in is constructing a kind of seniors. And the archive was definitely a seniors in a particular way, or like getting toward a seniors in a particular way.

And Jason Ben, my archive housemate and who now runs the neighborhood, has a good way of putting it. If genius is from your genes, seniors is from your scene. And yeah, I think like maybe a lot of the community building impulse is from this interest in what kind of idea flux can be created.

There's a question of like, why did Xerox PARC come out with all of this interesting stuff? It's their seniors. Why did Bell Labs come out with all this interesting stuff? Maybe it's their seniors. Why didn't the transistor come out of Princeton and the other people working on it at the time?

I just think it's remarkable how you hear a lot about Alan Kay. And I just read a bit and apparently Alan Kay was like the most junior guy at Xerox PARC. Yeah. Definitely. He's just the one who talks about it. He talks the most. Yeah, exactly. Yeah. So I, you know, hopefully I'm also working towards contributing that seniors.

I called mine the most provocative name of the arena. Oh, interesting. That's quite provocative. In the arena. So are you fighting other people in the arena? No. No. You never know. We're in the arena. We're in the arena trying stuff, as they say. You are also a GP at Altic Capital, where you also co-organize the Thursday Nights in AI, where hopefully someday I'll eventually speak.

You're on the roster. I'm on the roster. Thank you so much. So why spend time being a VC and organizing all these events? You're also a very busy CEO and, you know, why spend time with that? Why is that an important part of your life? Yeah. So I actually really like helping founders.

So Allie, my investing partner, is fortunately amazing and she does everything for the fund. So she, like, hosts the Thursday Night events and she finds folks who we could invest in and she does basically everything. Josh and I are her co-partners. So Allie was our former chief of staff at Sorceress and we just thought she was amazing.

And she wanted to be an investor and Josh and I also, like, care about helping founders and kind of, like, giving back to the community. What we didn't realize at the time when we started the fund is that it would actually be incredibly helpful for Imbue. So talking to AI founders who are building agents and working on, you know, similar things is really helpful.

They could potentially be our customers and they're trying out all sorts of interesting things. And I think being an investor, looking at the space from the other side of the table, it's just a different hat that I routinely put on and it's helpful to see the space from the investor lens as opposed to from the founder lens.

So I find that kind of, like, hat switching valuable. It maybe would lead us to do slightly different things. Let's just wrap with the lightning round. Okay. So we have three questions. Acceleration, exploration, and then a takeaway. So the acceleration question is what's something that already happened in AI that you thought would take much longer to be here?

I think the rate at which we discover new capabilities of existing models and kind of, like, build hacks on top of them to make them work better is something that has been surprising and awesome. And the rate of kind of, like, the community, the research community building on its own ideas.

Cool. Exploration/request for startups. If you weren't building Imbue, what AI company would you build? Every founder has, like, their, like, number two. Really? Yeah. I don't know. Wow. I cannot imagine building any other thing than Imbue. Wow. Well, that's a great answer, too. That's an interesting thing. It's, like, obviously the thing to build.

Okay. It's, like, obviously work on the fundamental platform. Yeah. So the previous, I think, that was my attempt at innovating this question, but the previous one was what was the most interesting unsolved question in AI? Yeah. I think probably the most interesting unsolved question, and my answer is kind of boring, but the most interesting unsolved questions are these questions of how do we make these stochastic systems into things that we can, like, reliably use and build on top of?

And, yeah, take away what's one message you want everyone to remember? Maybe two things. Like, one is I didn't think in my lifetime I would necessarily be in, like, able to work on the things I'm excited to work on in this moment, but we're in a historic moment and that where we'll look back and be like, "Oh, my God.

The future was invented in these years." There is maybe a set of messages to take away from that. One is, like, AI is a tool, like any technology, and, you know, when it comes to things, like, what might the future look like? We like to think about it as it's, like, just a better computer.

It's like a much better, much more powerful computer that gives us a lot of free intellectual energy that we can now, like, solve so many problems with. You know, there are so many problems in the world where we're like, "Oh, it's not worth a person thinking about that," and so things get worse and things get worse.

No one wants to work on maintenance, and, like, this technology gives us the potential to actually be able to, like, allocate intellectual energy to all of those problems, and the world could be much better, like, could be much more thoughtful because of that. I'm so excited about that, and there are definitely risks and dangers, and we actually do a fair amount of work on the policy side.

On the safety side, like, we think about safety and policy in terms of engineering theory and also regulation, and kind of comparing to, like, the automobile or the airplane or any new technology, there's, like, a set of new possible, like, capabilities and a set of new possible dangers that are unlocked with every new technology, and so on the engineering side, like, we think a lot about engineering safety, like, how do we actually engineer these systems so that they are inspectable and, you know, why we reason in natural language so that the systems are very inspectable, so that we can, like, stop things if anything weird is happening.

That's why we don't think end-to-end black boxes are a good idea. On the theoretical side, we, like, really believe in, like, deeply understanding, like, what are they learning? Like, when we actually fine-tune on individual examples, like, what's going on? When we're pre-training? What's going on? Like, debugging tools for these agents to understand, like, what's going on?

And then on the regulation side, I think there's actually a lot of regulation that already covers many of the dangers, like, that people are talking about, and there are areas where there's not much regulation, and so we focus on those areas where there's not much regulation. So some of our work is actually, we built an agent that helped us analyze the, like, 20,000 pages of policy proposals submitted to the Department of Commerce request for AI policy proposals.

And we, like, looked at what were the problems people brought up, and what were the solutions they presented, and then, like, did a summary analysis and kind of, like, you know, built agents to do that. And now the Department of Commerce is, like, interested in using that as a tool to, like, analyze proposals.

And so a lot of what we're trying to do on the regulation side is, like, actually figure out where is there regulation missing, and how do we actually, in a very targeted way, try to solve those missing areas. So I guess if I were to say, like, what are the takeaways, it's like, the future could be really exciting if we can actually get agents that are able to do these bigger things.

Reasoning is the biggest blocker, plus, like, these sets of abstractions to make things more robust and reliable. And there are, you know, things where we have to be quite careful and thoughtful about how do we deploy these, and what kind of regulation should go along with it, so that this is actually a technology that, when we deploy it, it is protective to people, and not harmful.

Awesome. Wonderful. Yeah. Thank you so much for your time, Kendra. Cool. Thank you. That's it. Thank you so much. Thank you. No questions. Thank you. No questions. Thank you. Thank you. (upbeat music)

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Chapters

Transcript