back to index

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue


Chapters

0:0 Introductions
7:13 The origin story of Imbue
11:26 Imbue's approach to training large foundation models optimized for reasoning
14:20 Imbue's goals to build an "operating system" for reliable, inspectable AI agents
17:51 Imbue's process of developing internal tools and interfaces to collaborate with AI agents
19:47 Imbue's focus on improving reasoning capabilities in models, using code and other data
21:33 The value of using both public benchmarks and internal metrics to evaluate progress
21:43 Lessons learned from developing the Avalon research environment
23:31 The limitations of pure reinforcement learning for general intelligence
32:12 Imbue's vision for building better abstractions and interfaces for reliable agents
33:49 Interface design for collaborating with, rather than just communicating with, AI agents
39:51 The future potential of an agent-to-agent protocol
42:53 Leveraging approaches like critiquing between models and chain of thought
47:30 Kanjun's philosophy on enabling team members as creative agents at Imbue
59:54 Kanjun's experience co-founding the communal co-living space The Archive
60:22 Lightning Round

Whisper Transcript | Transcript Only Page

00:00:00.000 | - Hey everyone, welcome to the Latent Space Podcast.
00:00:10.200 | This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my
00:00:14.560 | co-host Swiggs, founder of Small AI.
00:00:17.360 | - Hey, and today in the studio we have Kanjin from Imbue.
00:00:20.840 | Welcome.
00:00:21.840 | - Thank you.
00:00:22.840 | - So, you and I have, I guess, crossed paths a number of times, and you're formerly named
00:00:28.920 | General Intelligent, and you've just announced your rename, rebrand, in a huge, humongous
00:00:34.800 | way, so congrats on all that.
00:00:35.800 | - Thank you.
00:00:36.800 | - And we're here to dive in into deeper detail on Imbue.
00:00:39.700 | We'd like to introduce you just on a high-level basis, but then have you go into a little
00:00:45.040 | bit more of your personal side.
00:00:46.280 | So, you graduated your BS and MS at MIT, and you also spent some time at the MIT Media
00:00:52.280 | Lab, one of the most famous, I guess, computer hacking labs in the world.
00:00:57.040 | - Yeah, true.
00:00:58.040 | - What were you doing that time?
00:00:59.040 | - Yeah, I built electronic textiles, so boards that make it possible to make soft clothing.
00:01:08.680 | You can sew circuit boards into clothing, and then make clothing electronic.
00:01:11.960 | It's not that useful.
00:01:12.960 | - You wrote a book about that?
00:01:13.960 | - I wrote a book about it, yeah.
00:01:14.960 | - Yeah, yeah, yeah.
00:01:15.960 | - Basically, the idea was to teach young women computer science in this route, because what
00:01:20.120 | we found was that young girls, they would be really excited about math until about sixth
00:01:24.720 | grade, and then they're like, "Oh, math is not good anymore, because I don't feel like
00:01:30.520 | the type of person who does math or does programming, but I do feel like the type of person who
00:01:34.240 | does crafting."
00:01:35.240 | So, it's like, "Okay, what if you combine the two?"
00:01:37.880 | - Yeah, yeah, awesome, awesome.
00:01:40.280 | Always more detail to dive into on that.
00:01:43.400 | But then you graduated MIT, and you went straight into BizOps at Dropbox, where you're eventually
00:01:48.220 | chief of staff, which is a pretty interesting role we can dive into later.
00:01:51.160 | And then it seems like the founder bug hit you.
00:01:52.640 | You were basically a three-times founder at Ember, Sorceress, and now at General Intelligence/MBU.
00:01:57.920 | What should people know about you on the personal side that's not on your LinkedIn, that's something
00:02:02.280 | you're very passionate about outside of work?
00:02:04.000 | - Yeah, I think if you ask any of my friends, they would tell you that I'm obsessed with
00:02:07.760 | agency, like human agency and human potential.
00:02:10.480 | - That's work.
00:02:11.480 | Come on.
00:02:12.480 | - That's not work.
00:02:13.480 | What are you talking about?
00:02:15.720 | - So, what's an example of human agency that you try to promote?
00:02:20.280 | - I feel like, with all of my friends, I have a lot of conversations with them that's helping
00:02:24.240 | figure out what's blocking them.
00:02:26.520 | I guess I do this with a team kind of automatically, too.
00:02:29.680 | And I think about it for myself often, building systems.
00:02:32.320 | I have a lot of systems to help myself be more effective.
00:02:35.360 | At Dropbox, I used to give this onboarding talk called "How to Be Effective," which people
00:02:39.920 | liked.
00:02:40.920 | I think 1,000 people heard this onboarding talk, and I think maybe Dropbox was more effective.
00:02:45.400 | I think I just really believe that, as humans, we can be a lot more than we are, and it's
00:02:51.200 | what drives everything.
00:02:52.200 | I guess completely outside of work, I do dance.
00:02:54.680 | I do partner dance.
00:02:56.040 | - Nice.
00:02:57.040 | - Yeah.
00:02:58.040 | - Yeah, lots of interest in that stuff, especially in the group living houses in San Francisco,
00:03:03.720 | which I've been a little bit part of, and you've also run one of those.
00:03:07.400 | - That's right, yeah.
00:03:08.400 | I started The Archive with Josh, my co-founder, and a couple other folks in 2015.
00:03:13.160 | That's right.
00:03:14.160 | We're three.
00:03:15.160 | Our housemates built, so.
00:03:16.160 | - Was that the, I guess, the precursor to Generally Intelligent, that you started doing
00:03:22.600 | more things with Josh?
00:03:23.600 | Is that how that relationship started?
00:03:25.160 | - Yeah, so Josh and I are, this is our third company together.
00:03:30.040 | Our first company, Josh poached me from Dropbox for Ember, and there, we built a really interesting
00:03:37.960 | technology, laser raster projector, VR headset, and then we were like, "VR is not the thing
00:03:44.100 | we're most passionate about," and actually, it was kind of early days when we both realized
00:03:49.280 | we really do believe that, in our lifetimes, computers that are intelligent are going to
00:03:54.800 | be able to allow us to do much more than we can do today as people and be much more as
00:03:59.400 | people than we can be today.
00:04:02.800 | At that time, we actually, after Ember, we were like, "Should we work on AI research
00:04:06.880 | or start an AI lab?"
00:04:07.880 | A bunch of our housemates were joining OpenAI, and we actually decided to do something more
00:04:12.900 | pragmatic to apply AI to recruiting and to try to understand, like, "Okay, if we're actually
00:04:17.140 | trying to deploy these systems in the real world, what's required?"
00:04:20.580 | And that was Sorceress.
00:04:21.960 | That taught us so much about what, that was maybe an AI agent in a lot of ways, like,
00:04:28.280 | what does it actually take to make a product that people can trust and rely on?
00:04:34.400 | I think we never really fully got there, and it's taught me a lot about what's required,
00:04:40.100 | and it's kind of like, I think, informed some of our approach and some of the way that we
00:04:43.380 | think about how these systems will actually get used by people in the real world.
00:04:48.580 | Just to go one step deeper on that, so you're building AI agents in 2016, before it was
00:04:53.900 | cool.
00:04:54.900 | You got some milestone, you raised $30 million, something was working.
00:04:59.500 | So what do you think you succeeded in doing, and then what did you try to do that did not
00:05:04.740 | pan out?
00:05:05.740 | Yeah.
00:05:06.740 | So the product worked quite well.
00:05:07.740 | So Sorceress was an AI system that basically kind of looked for candidates that could be
00:05:13.580 | a good fit and then helped you reach out to them.
00:05:16.460 | And this was a little bit early.
00:05:19.180 | We didn't have language models to help you reach out, so we actually had a team of writers
00:05:21.980 | that customized emails, and we automated a lot of the customization.
00:05:27.100 | But the product was pretty magical.
00:05:30.420 | Candidates would just be interested and land in your inbox, and then you can talk to them.
00:05:34.220 | And as a hiring manager, that's such a good experience.
00:05:38.220 | I think there were a lot of learnings, both on the product and market side.
00:05:41.780 | On the market side, recruiting is a market that is endogenously high churn, which means
00:05:46.980 | because people start hiring and then we hire the role for them and they stop hiring.
00:05:50.540 | So the more we succeed, the more they...
00:05:52.940 | It's like the whole dating business.
00:05:54.280 | It's the dating business.
00:05:55.280 | Exactly.
00:05:56.280 | Exactly.
00:05:57.280 | It's exactly the same problem as the dating business.
00:05:59.580 | And I was really passionate about like, can we help people find work that is more exciting
00:06:04.080 | for them?
00:06:05.080 | A lot of people are not excited about their jobs, and a lot of companies are doing exciting
00:06:07.980 | things, and the matching could be a lot better.
00:06:10.420 | But the dating business kind of phenomenon put a damper on that.
00:06:15.900 | So we had a good, it's actually a pretty good business, but as with any business with relatively
00:06:23.620 | high churn, the bigger it gets, the more revenue we have, the slower growth becomes.
00:06:28.060 | Because Nick, if 30% of that revenue you lose year over year, then it becomes a worse business.
00:06:34.500 | So that was the dynamic we noticed quite early on after our Series A.
00:06:40.140 | I think the other really interesting thing about it is we realized what was required
00:06:44.460 | for people to trust that these candidates were like well-vetted and had been selected
00:06:48.640 | for a reason.
00:06:50.260 | And it's what actually led us, a lot of what we do at Imbue is working on interfaces to
00:06:54.620 | figure out how do we get to a situation where when you're building and using agents, these
00:07:00.500 | agents are trustworthy to the end user.
00:07:03.140 | That's actually one of the biggest issues with agents that go off and do longer range
00:07:06.780 | goals is that I have to trust, did they actually think through the situation?
00:07:11.660 | And that really informed a lot of our work today.
00:07:13.460 | Yeah.
00:07:14.460 | Let's jump into GI now, Imbue.
00:07:17.180 | When did you decide recruiting was done for you, and you were ready for the next challenge?
00:07:23.380 | And how did you pick the agent space?
00:07:25.780 | I feel like in 2021, it wasn't as mainstream.
00:07:29.700 | Yeah.
00:07:30.700 | So the LinkedIn says that it started in 2021, but actually we started thinking very seriously
00:07:34.840 | about it in early 2020, late 2019, early 2020.
00:07:39.500 | Not exactly this idea, but in late 2019, so I mentioned our housemates, Tom Brown and
00:07:47.120 | Ben Mann, they're the first two authors on GPT-3.
00:07:49.300 | So what we were seeing is that scale is starting to work and language models probably will
00:07:55.320 | actually get to a point where with hacks, they're actually going to be quite powerful.
00:07:59.460 | And it was hard to see that at the time, actually, because GPT-3, the early versions of it, there
00:08:06.700 | are all sorts of issues.
00:08:07.700 | We're like, "Oh, that's not that useful."
00:08:08.940 | But we could kind of see, okay, you keep improving it in all of these different ways and it'll
00:08:13.900 | get better.
00:08:15.480 | And so what Josh and I were really interested in is, how can we get computers that help
00:08:21.460 | us do bigger things?
00:08:24.140 | There's this kind of future where I think a lot about, if I were born in 1900 as a woman,
00:08:30.540 | my life would not be that fun.
00:08:32.500 | I'd spend most of my time carrying water and literally getting wood to put in the stove
00:08:38.260 | to cook food and cleaning and scrubbing the dishes and getting food every day because
00:08:44.300 | there's no refrigerator.
00:08:45.700 | All of these things, very physical labor.
00:08:48.060 | And what's happened over the last 150 years since the Industrial Revolution is we've kind
00:08:52.460 | of gotten free energy.
00:08:54.580 | Energy is way more free than it was 150 years ago.
00:08:58.780 | And so as a result, we've built all these technologies like the stove and the dishwasher
00:09:02.060 | and the refrigerator.
00:09:03.060 | And we have electricity and we have infrastructure, running water, all of these things that have
00:09:07.720 | totally freed me up to do what I can do now.
00:09:10.460 | And I think the same thing is true for intellectual energy.
00:09:14.520 | We don't really see it today because we're so in it, but our computers have to be micromanaged.
00:09:20.960 | Part of why people are like, "Oh, you're stuck to your screen all day."
00:09:23.780 | Well, we're stuck to our screen all day because literally nothing happens unless I'm doing
00:09:27.380 | something in front of my screen.
00:09:28.380 | I can't send my computer off to do a bunch of stuff for me.
00:09:32.300 | There is a future where that's not the case, where I can actually go off and do stuff and
00:09:37.080 | trust that my computer will pay my bills and figure out my travel plans and do the detailed
00:09:41.780 | work that I am not that excited to do so that I can be much more creative and able to do
00:09:47.020 | things that I as a human am very excited about and collaborate with other people.
00:09:50.660 | And there are things that people are uniquely suited for.
00:09:54.460 | So that's kind of always been the thing that is really exciting, has been really exciting
00:10:01.300 | to me.
00:10:02.300 | I'm a mathematician.
00:10:03.300 | I've known for a long time I think that AI, whatever AI is, it would happen in our lifetimes.
00:10:12.040 | And the personal computer kind of started giving us a bit of free intellectual energy.
00:10:16.320 | And this is like really the explosion of free intellectual energy.
00:10:19.120 | So in early 2020, we were thinking about this and what happened was self-supervised learning
00:10:25.120 | basically started working across everything.
00:10:27.400 | So it worked in language.
00:10:30.040 | SimClear came out.
00:10:31.040 | MoCo had come out, Momentum Contrast had come out earlier in 2019.
00:10:35.680 | SimClear came out in early 2020 and we were like, okay, for the first time, self-supervised
00:10:38.920 | learning is working really well across images and text and suspect that like, okay, actually
00:10:44.040 | it's the case that machines can learn things the way that humans do.
00:10:48.180 | And if that's true, if they can learn things in a fully self-supervised way, because like
00:10:52.620 | as people, we are not supervised.
00:10:54.320 | We like go Google things and try to figure things out.
00:10:56.740 | So if that's true, then like what the computer could be is much different, you know, is much
00:11:01.900 | bigger than what it is today.
00:11:04.120 | And so we started exploring ideas around like, how do we actually go?
00:11:08.860 | We didn't think about the fact that we could actually just build a research lab.
00:11:12.580 | So we were like, okay, what kind of startup could we build to like leverage self-supervised
00:11:17.060 | learning so that it eventually becomes something that allows computers to become much more
00:11:22.400 | kind of able to do bigger things for us.
00:11:25.340 | But that became General Intelligent, which started as a research lab.
00:11:30.380 | And so your mission is you aim to rekindle the dream of the personal computer.
00:11:36.340 | So when did it go wrong and what are like your first products and kind of like a user
00:11:42.940 | phasing things that you're building to rekindle it?
00:11:46.020 | Yeah.
00:11:47.020 | So what we do at Imbue is we train large foundation models optimized for reasoning.
00:11:53.340 | And the reason for that is because reasoning is actually, we believe the biggest blocker
00:11:57.580 | to agents or systems that can do these larger goals.
00:12:01.140 | If we think about, you know, something that writes an essay, like when we write an essay,
00:12:06.900 | we like write it, we don't just output it and then we're done.
00:12:10.180 | We like write it and then we look at it and we're like, oh, I need to do more research
00:12:13.540 | on that area.
00:12:14.540 | I'm going to go do some research and figure it out and come back and, oh, actually it's
00:12:19.380 | not quite right, the structure of the outline, so I'm going to rearrange the outline, rewrite
00:12:24.540 | It's this very iterative process and it requires thinking through like, okay, what am I trying
00:12:29.580 | to do?
00:12:30.580 | Is the goal correct?
00:12:31.900 | Also like, has the goal changed as I've learned more?
00:12:35.140 | Also, you know, as a tool, like when should I ask the user questions?
00:12:39.340 | I shouldn't ask them questions all the time, but I should ask them questions in higher
00:12:42.900 | risk situations.
00:12:44.860 | How certain am I about the like flight I'm about to book?
00:12:50.100 | There are all of these notions of like risk certainty, playing out scenarios, figuring
00:12:53.300 | out how to make a plan that makes sense, how to change the plan, what the goal should be,
00:12:58.100 | that are things, you know, that we lump under the bucket of reasoning.
00:13:03.060 | And models today, they're not optimized for reasoning.
00:13:05.260 | It turns out that there's not actually that much explicit reasoning data on the internet
00:13:09.580 | as you would expect, and so we get a lot of mileage out of optimizing our models for reasoning
00:13:14.380 | in pre-training.
00:13:15.660 | And then on top of that, we build agents ourselves.
00:13:19.380 | I can get into, we really believe in serious use, like really seriously using the systems
00:13:23.460 | and trying to get to an agent that we can use every single day, tons of agents that
00:13:27.180 | we can use every single day.
00:13:28.780 | And then we experiment with interfaces that help us better interact with the agents.
00:13:33.380 | So those are some set of things that we do on the kind of model training and agent side.
00:13:39.420 | And then the initial agents that we build, a lot of them are trying to help us write
00:13:44.140 | code better because code is most of what we do every day.
00:13:47.580 | And then on the infrastructure and theory side, we actually do a fair amount of theory
00:13:51.100 | work to understand like how do these systems learn?
00:13:53.860 | And then also like what are the right abstractions for us to build good agents with, which we
00:13:58.740 | can get more into.
00:14:00.180 | And if you look at our website, we have a lot of tools.
00:14:03.820 | We build a lot of tools internally.
00:14:05.020 | We have a like really nice automated hyperparameter optimizer.
00:14:08.420 | We have a lot of really nice infrastructure.
00:14:10.580 | And it's all part of the belief of like, okay, let's try to make it so that the humans are
00:14:15.580 | doing the things humans are good at as much as possible.
00:14:18.620 | So out of our very small team, we get a lot of leverage.
00:14:21.180 | And so would you still categorize yourself as a research lab now, or are you now in startup
00:14:24.860 | mode?
00:14:25.860 | Is that a transition that is cautious at all?
00:14:28.420 | That's a really interesting question.
00:14:29.860 | I think we've always intended to build, you know, to try to build the next version of
00:14:34.420 | the computer, enable the next version of the computer.
00:14:37.820 | The way I think about it is there is a right time to bring a technology to market.
00:14:41.700 | So Apple does this really well.
00:14:43.780 | Actually, iPhone was under development for 10 years, AirPods for five years.
00:14:48.620 | And Apple has a story where, you know, iPhone, the first multi-touch screen was created.
00:14:54.240 | They actually were like, oh, wow, this is cool.
00:14:57.060 | Let's like productionize iPhone.
00:14:58.060 | They actually brought, they like did some work trying to productionize it and realized
00:15:02.460 | this is not good enough.
00:15:03.760 | And they put it back into research to try to figure out like, how do we make it better?
00:15:06.580 | What are the interface pieces that are needed?
00:15:08.480 | And then they brought it back into production.
00:15:09.700 | So I think of production and research as kind of like these two separate phases.
00:15:13.940 | And internally, we have that concept as well, where like things need to be done in order
00:15:19.880 | to get to something that's usable.
00:15:21.520 | And then when it's usable, like eventually we figure out how to productize it.
00:15:24.740 | What's the culture like to make that happen, to have both like, kind of like product oriented,
00:15:29.940 | research oriented.
00:15:30.940 | And as you think about building the team, I mean, you just raised 200 million, I'm sure
00:15:34.280 | you want to hire more people.
00:15:36.680 | What are like the right archetypes of people that work at Inbu?
00:15:41.460 | Yeah, I would say we have a very unique culture in a lot of ways.
00:15:44.920 | I think a lot about social process design.
00:15:46.880 | So how do you design social processes that enable people to be, you know, effective?
00:15:53.080 | I like to think about team members as creative agents.
00:15:55.900 | So because most companies, they think of their people as assets.
00:16:01.000 | And they're very proud of this.
00:16:02.340 | And I think about like, okay, what is an asset?
00:16:04.660 | It's something you own, that provides you value that you can discard at any time.
00:16:08.780 | This is a very low bar for people.
00:16:10.320 | This is not what people are.
00:16:12.520 | And so we try to enable everyone to be a creative agent and to really unlock their superpowers.
00:16:17.760 | So a lot of the work I do, you know, I was mentioning earlier, I'm like obsessed with
00:16:21.280 | agency.
00:16:22.280 | A lot of the work I do with team members is try to figure out like, you know, what are
00:16:25.760 | you really good at?
00:16:26.760 | What really gives you energy and where can we put you such that, and how can I help you
00:16:31.040 | unlock that and grow that?
00:16:34.120 | So much of our work, you know, in terms of team structure, like much of our work actually
00:16:37.760 | comes from people.
00:16:39.200 | Carbs, our hyperparameter optimizer came from Abe trying to automate his own research process,
00:16:46.000 | doing hyperparameter optimization.
00:16:47.880 | And he actually pulled some ideas from plasma physics.
00:16:49.960 | He's a plasma physicist to make the local search work.
00:16:53.040 | A lot of our work on evaluations comes from a couple members of our team who are like
00:16:56.520 | obsessed with evaluations.
00:16:58.120 | We do a lot of work trying to figure out like, how do you actually evaluate if the model
00:17:01.000 | is getting better?
00:17:02.000 | Is the model making better agents?
00:17:03.280 | Is the agent actually reliable?
00:17:05.640 | And so a lot of things kind of like, I think of people as making the like them shaped blob
00:17:09.880 | inside InView.
00:17:11.960 | And I think, you know, yeah, that's the kind of person that we're hiring for.
00:17:17.760 | We're hiring product engineers and data engineers and research engineers and all these roles.
00:17:22.960 | You know, we have a project, we have projects, not teams.
00:17:27.000 | We have a project around data collection and data engineering.
00:17:30.300 | That's actually one of the key things that improve the model performance.
00:17:34.600 | We have a pre-training kind of project and with some fine tuning as part of that.
00:17:39.360 | And then we have an agent's project that's like trying to build on top of our models
00:17:42.960 | as well as use other models in the outside world to try to make agents that then we actually
00:17:49.240 | use as programmers every day.
00:17:50.680 | So all sorts of different projects.
00:17:52.640 | As a founder, you are now sort of a capital allocator among all of these different investments
00:17:57.440 | effectively at different projects.
00:18:00.380 | And I was interested in how you mentioned that you're optimizing for improving reasoning
00:18:06.760 | specifically inside of your pre-training, which I assume is just a lot of data collection.
00:18:10.940 | We are optimizing reasoning inside of our pre-trained models.
00:18:15.400 | And a lot of that is about data.
00:18:16.400 | And I can talk more about like what, you know, what exactly does it involve?
00:18:21.540 | But actually big, maybe 50% plus of the work is figuring out even if you do have models
00:18:29.040 | that reason well, like the models are still stochastic.
00:18:32.480 | The way you prompt them still makes, is kind of random, like makes them do random things.
00:18:37.600 | And so how do we get to something that is actually robust and reliable as a user?
00:18:41.980 | How can I as a user trust it?
00:18:44.000 | You know, I was mentioning earlier when I talked to other people building agents, they
00:18:47.840 | have to do so much work, like to try to get to something that they can actually productize.
00:18:54.160 | And it takes a long time and agents haven't been productized yet for, partly for this
00:19:00.280 | reason is that like the abstractions are very leaky.
00:19:03.840 | You know, we can get like 80% of that way there, but like self-driving cars, like the
00:19:07.760 | remaining 20% is actually really difficult.
00:19:10.440 | We believe that, and we have internally, I think some things that like an interface,
00:19:15.400 | for example, that lets me really easily like see what the agent execution is, fork it,
00:19:21.120 | try out different things, modify the prompt, modify like the plan that it is making.
00:19:28.120 | This type of interface, it makes it so that I feel more like I'm collaborating with the
00:19:32.960 | agent as it's executing, as opposed to it's just like doing something as a black box.
00:19:37.880 | That's an example of a type of thing that's like beyond just the model pre-training.
00:19:41.740 | But on the model pre-training side, like reasoning is a thing that we optimize for.
00:19:46.160 | And a lot of that is about, yeah, what data do we put in?
00:19:50.520 | Yeah.
00:19:51.520 | It's interesting just because I always think like, you know, out of the levers that you
00:19:55.480 | have, the resources that you have, I think a lot of people think that running a foundation
00:20:00.680 | model company or a research lab is going to be primarily compute.
00:20:05.560 | And I think the share of compute has gone down a lot over the past three years.
00:20:10.120 | It used to be the main story, like the main way you scale is you just throw more compute
00:20:13.660 | at it.
00:20:15.000 | And now it's like Flops is not all you need.
00:20:16.820 | You need better data, you need better algorithms.
00:20:19.260 | And I wonder where that shift has gone.
00:20:22.560 | This is a very vague question, but is it like 30, 30, 30 now?
00:20:25.420 | Is it like maybe even higher?
00:20:27.080 | So one way I'll put this is people estimate that Llamatu maybe took about three, $4 million
00:20:33.420 | of compute, but probably $20 to $25 million worth of labeling data.
00:20:39.100 | And I'm like, okay, well that's a very different story than all these other foundation model
00:20:42.700 | labs raising hundreds of millions of dollars and spending it on GPUs.
00:20:47.700 | Yeah.
00:20:48.780 | Data is really expensive.
00:20:54.180 | We generate a lot of data and so that does help.
00:20:58.460 | The generated data is close to actually good, as good as human labeled data.
00:21:04.180 | So generated data from other models?
00:21:06.740 | From our own models.
00:21:07.740 | From your own models.
00:21:08.740 | Yeah.
00:21:09.740 | Do you feel like, and there's certain variations of this, there's the sort of the constitutional
00:21:14.820 | AI approach from Anthropic and basically models sampling, training on data from other models.
00:21:22.020 | I feel like there's a little bit of like contamination in there or to put it in a statistical form,
00:21:28.620 | you're resampling a distribution that you already have that you already know doesn't
00:21:32.020 | match human distributions.
00:21:33.460 | Yeah.
00:21:34.460 | Yeah.
00:21:35.460 | How do you feel about that basically, just philosophically?
00:21:38.620 | So when we're optimizing models for reasoning, we are actually trying to make a part of the
00:21:44.860 | distribution really spiky.
00:21:46.820 | So in a sense, this is actually what we want.
00:21:50.180 | We want to, because the internet is a sample of the human distribution that's also skewed
00:21:56.140 | in all sorts of ways, that is not the data that we necessarily want these models to be
00:22:01.940 | trained on.
00:22:02.940 | And so I don't worry about it that much.
00:22:05.560 | What we've seen so far is that it seems to help.
00:22:07.360 | When we're generating data, we're not really randomly generating data, we generate very
00:22:11.380 | specific things that are like reasoning traces and that help optimize reasoning.
00:22:17.500 | Code also is a big piece of improving reasoning.
00:22:19.780 | So yeah, generated code is not that much worse than like regular human written code.
00:22:27.460 | You might even say it can be better in a lot of ways.
00:22:29.620 | So yeah.
00:22:30.620 | So we are trying to already do that.
00:22:32.980 | What are some of the tools that you saw that you thought were not a good fit?
00:22:37.200 | So you built Avalon, which is your own simulated world.
00:22:41.600 | And when you first started, the kind of like metagame was like using games to simulate
00:22:47.580 | things, using, you know, Minecraft and then OpenAI is like the gym thing and all these
00:22:52.980 | things.
00:22:53.980 | And your thing, I think in one of your other podcasts, you mentioned like Minecraft is
00:22:57.560 | like way too slow to actually do any serious work.
00:23:01.480 | Is that true?
00:23:02.480 | Yeah.
00:23:03.480 | I didn't say it.
00:23:04.480 | I don't know.
00:23:05.480 | That's above my pay grade.
00:23:07.320 | But Avalon is like a hundred times faster than Minecraft for simulation.
00:23:12.360 | When did you figure that out that you needed to just like build your own thing?
00:23:16.520 | Was it kind of like your engineering team was like, hey, this is too slow.
00:23:20.560 | Was it more a long-term investment?
00:23:22.760 | At that time, we built Avalon as a research environment to help us learn particular things.
00:23:28.200 | And one thing we were trying to learn is like, how do you get an agent that is able to do
00:23:33.480 | many different tasks?
00:23:35.880 | Like RL agents at that time and environments at that time, what we heard from other RL
00:23:39.960 | researchers was the like biggest thing holding the field back is lack of benchmarks that
00:23:46.420 | let us kind of explore things like planning and curiosity and things like that and have
00:23:52.760 | the agent actually perform better if the agent has curiosity.
00:23:57.160 | And so we were trying to figure out like, okay, how can we have agents that are like
00:24:02.120 | able to handle lots of different types of tasks without the reward being pretty handcrafted?
00:24:09.280 | That's a lot of what we had seen is that like these very handcrafted rewards.
00:24:12.800 | And so Avalon has like a single reward.
00:24:15.360 | It's across all tasks.
00:24:17.320 | And what it taught us, and it also allowed us to kind of create a curriculum so we could
00:24:23.640 | make the level more or less difficult.
00:24:26.200 | And it taught us a lot, maybe two primary things.
00:24:29.720 | One is with no curriculum, RL algorithms don't work at all.
00:24:34.000 | So that's actually really interesting.
00:24:36.440 | For the non-RL specialists, what is a curriculum in your terminology?
00:24:39.960 | So a curriculum in this particular case is basically the environment Avalon lets us generate
00:24:47.080 | simpler environments and harder environments for a given tasks.
00:24:50.400 | What's interesting is that the simpler environments, you know, what you'd expect is the agent succeeds
00:24:54.740 | more often, so it gets more reward.
00:24:57.600 | And so, you know, kind of my intuitive way of thinking about it is, okay, the reason
00:25:01.300 | why it learns much faster with a curriculum is it's just getting a lot more signal.
00:25:06.240 | And that's actually an interesting kind of like general intuition to have about training
00:25:10.220 | these things.
00:25:11.220 | It's like, what kind of signal are they getting and like, how can you help it get a lot more
00:25:15.040 | signal?
00:25:16.960 | The second thing we learned is that reinforcement learning is not a good vehicle, like pure
00:25:21.680 | reinforcement learning is not a good vehicle for planning and reasoning.
00:25:24.960 | So these agents were not able to, they were able to learn all sorts of crazy things.
00:25:29.220 | They could learn to climb, like hand over hand in VR climbing, they can learn to open
00:25:33.760 | doors, like very complicated, like multiple switches and a lever open the door.
00:25:40.360 | But they couldn't do any higher level things and they couldn't do those lower level things
00:25:46.480 | consistently necessarily.
00:25:49.040 | And as a user, we were like, okay, as a user, I do not want to interact with a pure reinforcement
00:25:53.640 | learning end-to-end RL agent.
00:25:55.580 | As a user, like I need much more control over what that agent is doing.
00:26:00.080 | And so that actually started to get us on the track of thinking about, okay, how do
00:26:03.640 | we do the reasoning part in language?
00:26:06.980 | And we were pretty inspired by our friend Chelsea Finn at Stanford was I think working
00:26:11.160 | on SACAN at the time, where it's basically an experiment where they have robots kind
00:26:19.600 | of trying to do different tasks and actually do the reasoning for the robot in natural
00:26:23.840 | language.
00:26:25.340 | And it worked quite well.
00:26:27.400 | And that led us to start experimenting very seriously with reasoning.
00:26:32.760 | How important is the language part for the agent versus for you to inspect the agent?
00:26:39.200 | You know, like is it the interface to kind of the human on the loop really important
00:26:45.360 | Yeah.
00:26:46.360 | I personally think of it as it's much more important for us, the human user.
00:26:49.320 | So I think you probably could get end-to-end agents that work and are fairly general at
00:26:56.760 | some point in the future.
00:26:58.360 | But I think you don't want that.
00:27:00.160 | Like we actually want agents that we can like perturb while they're trying to figure out
00:27:05.400 | what to do.
00:27:06.400 | So it's, you know, even a very simple example, internally we have like a type error fixing
00:27:11.320 | agent and we have like a test generation agent.
00:27:13.960 | Test generation agent goes off the rails all the time.
00:27:17.760 | I want to know like, why did it generate this particular test?
00:27:21.240 | What was it thinking?
00:27:22.440 | Did it consider, you know, the fact that this is calling out to this other function?
00:27:27.560 | Like formatter agent, if it ever comes up with anything weird, I want to be able to
00:27:31.960 | debug like what happened.
00:27:34.200 | With RL end-to-end stuff, like we couldn't do that.
00:27:36.640 | So it sounds like you have a bunch of agents that are operating internally within the company.
00:27:41.280 | What's your most, I guess, successful agent and what's your least successful one?
00:27:45.640 | Yeah.
00:27:46.640 | A type of agent that works moderately well is like fix the color of this button on the
00:27:51.120 | website or like change the color of this button.
00:27:53.680 | Which is now sweep.dev is doing that.
00:27:55.440 | Perfect.
00:27:56.440 | Okay.
00:27:57.440 | Well, we should just use sweep.dev.
00:27:58.440 | Well, I mean, okay.
00:27:59.440 | I don't know how often you have to fix the color of the button, right?
00:28:02.000 | Because all of them raise money on the idea that they can go further.
00:28:06.240 | And my fear when encountering something like that is that there's some kind of unknown
00:28:10.480 | asymptote ceiling that's going to prevent them, that they're going to run head on into
00:28:15.000 | that you've already run into.
00:28:16.640 | We've definitely run into such a ceiling.
00:28:18.480 | What is the ceiling?
00:28:19.480 | Is there a name for it?
00:28:21.240 | I mean, for us, we think of it as reasoning plus these tools.
00:28:25.480 | So reasoning plus abstractions, basically.
00:28:28.760 | I think actually you can get really far with current models and that's why it's so compelling.
00:28:34.360 | Like we can pile debugging tools on top of these current models, have them critique each
00:28:39.720 | other and critique themselves and do all of these like, you know, spend more computer
00:28:45.080 | inference time, context hack, you know, retrieve augmented generation, et cetera, et cetera,
00:28:51.920 | et cetera.
00:28:52.920 | Like the pile of hacks actually does get us really far.
00:28:56.440 | And you're kind of like trying to get more signal out of the channel.
00:29:00.280 | We don't like to think about it that way.
00:29:03.400 | It's what the default approach is, is like trying to get more signal out of this noisy
00:29:06.680 | channel.
00:29:08.360 | But the issue with agents is as a user, I want it to be mostly reliable.
00:29:14.200 | It's kind of like self-driving in that way.
00:29:16.080 | Like it's not as bad as self-driving, like in self-driving, you know, you're like hurtling
00:29:21.000 | at 70 miles an hour is like the hardest agent problem.
00:29:24.320 | But I think one thing we learned from Sorceress and one thing we've learned like by using
00:29:28.480 | these things internally is we actually have a pretty high bar for these agents to work.
00:29:33.760 | You know, it is actually really annoying if they only work 50% of the time and we can
00:29:38.720 | make interfaces to make it slightly less annoying.
00:29:40.680 | But yeah, there's a ceiling that we've encountered so far and we need to make the models better
00:29:46.600 | and we also need to make the kind of like interface to the user better and also a lot
00:29:49.920 | of the like, you know, critiquing, we have a lot of like generation methods, kind of
00:29:56.880 | like spending computer inference time generation methods that help things be more robust and
00:30:02.160 | reliable, but it's still not 100% of the way there.
00:30:05.560 | So to your question of like what agents work well and what doesn't work well, like most
00:30:09.240 | of the agents don't work well and we're slowly making them work better by improving the underlying
00:30:13.440 | model and improving these.
00:30:14.800 | I think that that's comforting for a lot of people who are feeling a lot of imposter syndrome
00:30:20.280 | not being able to make it work.
00:30:21.680 | And I think the fact that you share their struggles, I think also helps people understand
00:30:26.880 | how early this is.
00:30:27.880 | Yeah, definitely.
00:30:28.880 | It's very early and I hope what we can do is help people who are building agents actually
00:30:33.200 | like be able to deploy them.
00:30:35.640 | I think, you know, that's the gap that we see a lot of today is everyone who's trying
00:30:39.400 | to build agents to get to the point where it's robust enough to be deployable.
00:30:42.840 | It's like an unknown amount of time.
00:30:46.440 | Okay.
00:30:47.440 | Yeah.
00:30:48.440 | Well, so this goes back into what Embu is going to offer as a product or a platform.
00:30:51.480 | How are you going to actually help people deploy those agents?
00:30:55.160 | Yeah, so our current hypothesis, I don't know if this is actually going to end up being
00:30:58.720 | the case.
00:31:00.080 | We've built a lot of tools for ourselves internally around like debugging, around like abstractions
00:31:07.040 | or techniques after the model generation happens, like after the language model generates the
00:31:13.080 | text, like interfaces for the user and the underlying model itself, like models talking
00:31:20.320 | to each other.
00:31:22.200 | Maybe some set of those things, kind of like an operating system, some set of those things
00:31:28.120 | will be helpful for other people.
00:31:30.240 | And we'll figure out what set of those things is helpful for us to make our agents.
00:31:34.400 | Like what we want to do is get to a point where we can start making an agent, deploy
00:31:37.320 | it, it's reliable, like very quickly.
00:31:40.120 | And there's a similar analog to software engineering, like in the early days, in the '70s, in the
00:31:44.480 | '60s, like to program a computer, you have to go all the way down to the registers and
00:31:50.440 | write things.
00:31:51.440 | Eventually, we had assembly.
00:31:52.880 | That was like an improvement.
00:31:54.640 | Then we wrote programming languages with these higher levels of abstraction, and that allowed
00:31:58.440 | a lot more people to do this and much faster, and the software created is much less expensive.
00:32:03.240 | And I think it's basically a similar route here where we're like in the like bare metal
00:32:08.280 | phase of agent building, and we will eventually get to something with much nicer abstractions.
00:32:14.360 | So you touched a little bit on the data before.
00:32:17.120 | We had this conversation with George Hudson, we were like, there's not a lot of reasoning
00:32:21.600 | data out there, and can the models really understand?
00:32:24.680 | And his take was like, look, with enough compute, you're not that complicated as a human.
00:32:29.320 | The model can figure out eventually why certain decisions are made.
00:32:33.600 | What's been your experience?
00:32:34.600 | As you think about reasoning data, do you have to do a lot of manual work, or is there
00:32:40.080 | a way to prompt models to extract the reasoning from actions that they see?
00:32:46.160 | We don't think of it as, oh, throw enough data at it, and then it will figure out what
00:32:51.800 | the plan should be.
00:32:53.560 | I think we're much more explicit.
00:32:55.800 | So we have a lot of thoughts internally, like many documents about what reasoning is.
00:32:59.920 | A way to think about it is as humans, we've learned a lot of reasoning strategies over
00:33:04.040 | time.
00:33:05.040 | We are better at reasoning now than we were 3,000 years ago.
00:33:08.000 | An example of a reasoning strategy is noticing you're confused.
00:33:12.060 | And then when I notice I'm confused, I should ask like, huh, what was the original claim
00:33:16.560 | that was made?
00:33:18.560 | What evidence is there for this claim, et cetera, et cetera?
00:33:22.880 | Does the evidence support the claim?
00:33:24.240 | Is the claim correct?
00:33:25.480 | This is like a reasoning strategy that was developed in like the 1600s, with like the
00:33:29.160 | advent of science.
00:33:30.160 | That's an example of a reasoning strategy.
00:33:32.600 | There are tons of them.
00:33:33.600 | We employ all the time, lots of heuristics that help us be better at reasoning.
00:33:38.680 | And we didn't always have them.
00:33:40.980 | And because they're invented, we can generate data that's much more specific to them.
00:33:44.860 | So I think internally, yeah, we have a lot of thoughts on what reasoning is, and we generate
00:33:48.240 | a lot more specific data.
00:33:49.320 | We're not just like, oh, it'll figure out reasoning from this black box, or it'll figure
00:33:54.080 | out reasoning from the data that exists.
00:33:55.800 | Yeah.
00:33:56.800 | I mean, the scientific method is like a good example.
00:34:00.480 | And if you think about hallucination, right?
00:34:03.160 | And people are thinking, how do we use these models to do net new scientific research?
00:34:09.240 | And if you go back in time and the model is like, well, the earth revolves around the
00:34:13.840 | sun, and people are like, man, this model is crap.
00:34:16.600 | It's like, what are you talking about?
00:34:18.280 | Like the sun revolves around the earth.
00:34:20.360 | Like, how do you see the future where like, do you think we can actually, like, if the
00:34:26.120 | models are actually good enough, but we don't believe them, it's like, how do we make the
00:34:31.760 | two live together?
00:34:32.760 | Say you're like, you use IMBU as a scientist to do a lot of your research, and IMBU tells
00:34:37.960 | you, hey, I think this is like a serious bet.
00:34:40.760 | You should go down.
00:34:41.760 | And you're like, no, this sounds impossible.
00:34:43.120 | Like, how is that trust going to be built, and like, what are some of the tools that
00:34:47.240 | maybe are going to be there to inspect it?
00:34:49.760 | Yeah.
00:34:50.760 | So like, one element of it is like, as a person, like, I need to basically get information
00:34:57.040 | out of the model such that I can try to understand what's going on with the model.
00:35:01.160 | So then the second question is like, okay, how do you do that?
00:35:04.560 | And that's kind of, some of our debugging tools, they're not necessarily just for debugging.
00:35:10.080 | They're also for like, interfacing with and interacting with the model.
00:35:12.760 | So like, if I go back in this reasoning trace and like, change a bunch of things, what's
00:35:16.600 | going to happen?
00:35:17.600 | Like, what does it conclude instead?
00:35:19.280 | So that kind of helps me understand, like, what are its assumptions?
00:35:23.440 | And it, you know, we think of these things as tools.
00:35:30.120 | And so it's really about, like, as a user, how do I use this tool effectively?
00:35:33.640 | Like, I need to be willing to be convinced as well.
00:35:36.400 | It's like, how do I use this tool effectively, and what can it help me with, and what can
00:35:39.760 | it tell me?
00:35:40.760 | So there's a lot of mention of code in your process.
00:35:44.520 | And I was hoping to dive in even deeper.
00:35:47.200 | I think we might run the risk of giving people the impression that you view code, or you
00:35:54.560 | use code, just as like a tool within yourself, within MBU, just for coding assistance.
00:36:01.560 | And I think there's a lot of informal understanding about how adding code to language models improves
00:36:06.760 | their reasoning capabilities.
00:36:08.120 | I wonder if there's any research or findings that you have to share that talks about the
00:36:14.080 | intersection of code and reasoning.
00:36:15.880 | Yeah, so the way I think about it intuitively is, like, code is the most explicit example
00:36:20.920 | of reasoning data on the internet.
00:36:23.800 | And it's not only structured, it's actually very explicit, which is nice.
00:36:27.940 | You know, it says this variable means this, and then it uses this variable, and then the
00:36:32.240 | function does this.
00:36:33.240 | Like, as people, when we talk in language, it takes a lot more to kind of, like, extract
00:36:38.320 | that, like, explicit structure out of, like, our language.
00:36:43.140 | And so that's one thing that's really nice about code, is I see it as almost like a curriculum
00:36:46.920 | for reasoning.
00:36:47.960 | I think we use code in all sorts of ways, like, the coding agents are really helpful
00:36:53.800 | for us to understand, like, what are the limitations of the agents?
00:36:57.800 | The code is really helpful for the reasoning itself, but also code is a way for models
00:37:02.760 | to act.
00:37:04.120 | So by generating code, it can act on my computer.
00:37:08.080 | And you know, when we talk about rekindling the dream of the personal computer, kind of
00:37:11.720 | where I see computers going is, like, computers will eventually become these much more malleable
00:37:17.280 | things, where I, as a user, today, I have to know how to write software code, like,
00:37:24.160 | in order to make my computer do exactly what I want it to do.
00:37:28.660 | But in the future, if the computer is able to generate its own code, then I can actually
00:37:34.320 | interface with it in natural language.
00:37:37.000 | And so we, you know, one way we think about agents is it's kind of like a natural language
00:37:40.860 | programming language.
00:37:42.640 | It's a way to program my computer in natural language that's much more intuitive to me
00:37:45.680 | as a user.
00:37:46.880 | And these interfaces that we're building are essentially IDEs for users to program our
00:37:52.520 | computers in natural language.
00:37:54.520 | What do you think about the other, the different approaches people have, kind of like, text
00:37:58.680 | first, browser first, like, multi-on?
00:38:02.900 | What do you think the best interface will be, or like, what is your, you know, thinking
00:38:07.840 | today?
00:38:08.840 | I think chat is very limited as an interface.
00:38:14.760 | It is sequential, where these agents don't have to be sequential.
00:38:20.760 | So with a chat interface, if the agent does something wrong, I have to, like, figure out
00:38:26.080 | how to, like, how do I get it to go back and start from the place I wanted it to start
00:38:30.680 | from?
00:38:31.680 | So in a lot of ways, like, chat as an interface, I think Linus, Linus Lee, you had on this.
00:38:37.200 | I really like how he put it, chat as an interface is skeuomorphic.
00:38:41.000 | So in the early days, when we made word processors on our computers, they had notepad lines,
00:38:47.040 | because that's what we understood, you know, these, like, objects to be.
00:38:51.480 | Chat, like texting someone, is something we understand.
00:38:54.600 | So texting our AI is something that we understand.
00:38:58.000 | But today's Word documents don't have notepad lines.
00:39:02.080 | And similarly, the way we want to interact with agents, like, chat is a very primitive
00:39:06.840 | way of interacting with agents.
00:39:08.640 | What we want is to be able to inspect their state and to be able to modify them and fork
00:39:11.640 | them and all of these other things.
00:39:12.840 | And we internally have, kind of, like, think about what are the right representations for
00:39:18.040 | that, like, architecturally, like, what are the right representations?
00:39:22.160 | What kind of abstractions do we need to build?
00:39:24.640 | And how do we build abstractions that are not leaky?
00:39:27.720 | Because if the abstractions are leaky, which they are today, like, you know, this stochastic
00:39:31.520 | generation of text is like a leaky abstraction.
00:39:34.440 | I cannot depend on it.
00:39:35.940 | And that means it's actually really hard to build on top of.
00:39:38.960 | But our experience and belief is, actually, by building better abstractions and better
00:39:43.760 | tooling, we can actually make these things non-leaky.
00:39:46.960 | And now you can build, like, whole things on top of them.
00:39:49.520 | So these other interfaces, because of where we are, we don't think that much about them.
00:39:53.840 | >> Cool.
00:39:54.840 | Yeah, I mean, you mentioned this is kind of like the Xerox spark moment for AI.
00:40:00.720 | And we had a lot of stuff come out of Parc, like, yeah, what you see is what you got,
00:40:05.900 | headers, and, like, MVC, and all this stuff.
00:40:07.940 | But yeah.
00:40:08.940 | But then we didn't have the iPhone at Parc.
00:40:11.100 | We didn't have all these, like, higher things.
00:40:13.380 | What do you think it's reasonable to expect in, like, this era of AI?
00:40:17.460 | You know, kind of, like, five years or so?
00:40:19.940 | Like, what are, like, the things we'll build today?
00:40:21.740 | And what are things that maybe we'll see in, kind of, like, the second wave of products?
00:40:25.820 | >> I think the waves will be much faster than before.
00:40:29.380 | Like, what we're seeing right now is basically, like, a continuous wave.
00:40:33.260 | Let me zoom a little bit earlier.
00:40:34.900 | So people like the Xerox Parc analogy I give, but I think there are many different analogies.
00:40:39.540 | Like one is the, like, analog to digital computer is another analogy to where we are today.
00:40:45.180 | The analog computer Vannevar Bush built in the 1930s, I think, and it's like a system
00:40:50.300 | of pulleys.
00:40:51.800 | And it can only calculate one function, like, it can calculate, like, an integral.
00:40:55.740 | And that was so magical at the time, because you actually did need to calculate this integral
00:40:58.580 | a bunch.
00:40:59.580 | But it had a bunch of issues.
00:41:00.580 | Like, in analog, errors compound.
00:41:03.380 | And so there was actually a set of breakthroughs necessary in order to get to the digital computer.
00:41:08.280 | Like Turing's decidability, Shannon, I think the, like, whole, like, relay circuits are,
00:41:18.460 | can be thought of as, can be mapped to Boolean operators.
00:41:22.180 | And a set of other, like, theoretical breakthroughs, which essentially, they were creating abstractions
00:41:27.940 | for these, like, very analog circuits.
00:41:30.480 | And digital had this nice property of, like, being error correcting.
00:41:34.180 | And so when I talk about, like, less leaky abstractions, that's what I mean.
00:41:37.180 | That's what I'm kind of pointing a little bit to.
00:41:38.700 | It's not going to look exactly the same way.
00:41:41.340 | And then the Xerox PARC piece, a lot of that is about, like, how do we get to computers
00:41:47.020 | that as a person, I can actually use well.
00:41:51.740 | And the interface actually helps it unlock so much more power.
00:41:55.700 | So the sets of things we're working on, like the sets of abstractions and the interfaces,
00:42:00.820 | like, hopefully that, like, help us unlock a lot more power in these systems.
00:42:04.940 | Like, hopefully that'll come not too far in the future.
00:42:08.740 | I could see a next version, like, maybe a little bit farther out.
00:42:13.700 | It's, like, an agent protocol.
00:42:15.580 | So a way for different agents to talk to each other and call each other, kind of like HTTP.
00:42:21.020 | Do you know it exists already?
00:42:23.780 | Yeah, there is a nonprofit that's working on one.
00:42:27.100 | I think it's a bit early, but it's interesting to think about right now.
00:42:32.620 | Part of why I think it's early is because the issue with agents is it's not quite like
00:42:39.460 | the internet where you could, like, make a website and the website would appear.
00:42:44.060 | The issue with agents is that they don't work.
00:42:46.880 | And so it may be a bit early to figure out what the protocol is before we really understand
00:42:50.260 | how could these agents get constructed.
00:42:52.940 | But, you know, I think that's, I think it's a really interesting question.
00:42:55.300 | While we're talking on this agent-to-agent thing, there's been a bit of research recently
00:42:59.740 | on some of these approaches.
00:43:02.380 | I tend to just call them extremely complicated chain of thoughting, but any perspectives
00:43:09.980 | on kind of meta-GPT, I think is the name of the paper.
00:43:13.260 | I don't know if you care about at the level of individual papers coming out, but I did
00:43:18.860 | read that recently, and TLDR, it beat GPT-4 and human eval by role-playing software agent
00:43:25.820 | development agency.
00:43:26.900 | Instead of having a single shot, a single role, you have multiple roles and having all
00:43:31.540 | of them criticize each other as agents communicating with other agents.
00:43:35.100 | Yeah.
00:43:36.100 | I think this is an example of an interesting abstraction of like, okay, can I just plop
00:43:40.100 | in this multi-role critiquing and see how it improves my agent?
00:43:45.020 | Can I just plop in chain of thought, tree of thought, plop in these other things and
00:43:47.940 | see how they improve my agent?
00:43:51.700 | One issue with this kind of prompting is that it's still not very reliable.
00:43:57.300 | There's one lens which is like, okay, if you do enough of these techniques, you'll get
00:44:00.100 | to high reliability.
00:44:01.100 | I think actually that's a pretty reasonable lens.
00:44:03.820 | We take that lens often.
00:44:06.820 | Then there's another lens that's like, okay, but it's starting to get really messy what's
00:44:11.740 | in the prompt and how do we deal with that messiness?
00:44:15.900 | Maybe you need cleaner ways of thinking about and constructing these systems.
00:44:20.100 | We also take that lens.
00:44:21.100 | Yeah.
00:44:22.100 | I think both are necessary.
00:44:23.100 | It's a great question because I feel like this also brought up another question I had
00:44:27.800 | for you.
00:44:28.800 | I noticed that you work a lot with your own benchmarks, your own evaluations of what is
00:44:35.260 | valuable.
00:44:37.500 | I would say I would contrast your approach with OpenAI as OpenAI tends to just lean
00:44:41.700 | on, "Hey, we played StarCraft," or, "Hey, we ran it on the SAT or the AP bio test and
00:44:50.620 | that did results."
00:44:52.380 | Basically, is benchmark culture ruining AI?
00:44:59.460 | Or is that actually a good thing?
00:45:00.780 | Because everyone knows what an SAT is and that's fine.
00:45:04.220 | I think it's important to use both public and internal benchmarks.
00:45:07.420 | Part of why we build our own benchmarks is that there are not very many good benchmarks
00:45:10.520 | for agents, actually.
00:45:12.560 | To evaluate these things, we actually need to think about it in a slightly different
00:45:18.020 | But we also do use a lot of public benchmarks for is the reasoning capability in this particular
00:45:24.020 | way improving?
00:45:25.020 | Yeah.
00:45:26.020 | It's good to use both.
00:45:27.020 | For example, the Voyager paper coming out of NVIDIA played Minecraft and set their own
00:45:35.340 | benchmarks on getting the Diamond Axe or whatever and exploring as much of the territory as
00:45:41.100 | possible.
00:45:42.100 | I don't know how that's received.
00:45:43.940 | That's obviously fun and novel for the rest of the AI engineer, the people who are new
00:45:48.260 | to the scene.
00:45:49.260 | But for people like yourself who you build your own, you build Avalon just because you
00:45:54.620 | already found deficiencies with using Minecraft, is that valuable as an approach?
00:46:00.620 | Oh, yeah.
00:46:01.620 | I love Voyager.
00:46:02.620 | Jim, I think is awesome.
00:46:03.940 | And I really like the Voyager paper and I think it has a lot of really interesting ideas,
00:46:07.180 | which is like the agent can create tools for itself and then use those tools.
00:46:11.460 | And he had the idea of the curriculum as well, which is something that we talked about earlier.
00:46:15.060 | Exactly.
00:46:16.060 | Exactly.
00:46:17.060 | And that's a lot of what we do.
00:46:18.060 | We built Avalon mostly because we couldn't use Minecraft very well to learn the things
00:46:21.340 | we wanted.
00:46:22.420 | And so it's not that much work to build our own.
00:46:25.500 | It took us, I don't know, we had eight engineers at the time, took about eight weeks.
00:46:31.220 | So six weeks.
00:46:32.740 | Nice.
00:46:33.740 | Yeah.
00:46:34.740 | And OpenAI built their own as well.
00:46:35.740 | Right?
00:46:36.740 | Yeah, exactly.
00:46:37.740 | It's just nice to have control over our environment.
00:46:39.180 | But if you're doing our own sandbox to really trying to inspect our own research questions.
00:46:44.140 | But if you're doing something like experimenting with agents and trying to get them to do things
00:46:47.820 | like Minecraft is a really interesting environment.
00:46:51.500 | And so Voyager has a lot of really interesting ideas in it.
00:46:54.260 | Yeah.
00:46:55.260 | Cool.
00:46:56.260 | One more element that we had on this list, which is context and memory.
00:47:00.380 | I think that's kind of like the foundational "RAM" of our era.
00:47:05.660 | I think Andrej Karpathy has already made this comparison, so there's nothing new here.
00:47:10.860 | But that's just the amount of working knowledge that we can fit into one of these agents.
00:47:14.260 | And it's not a lot.
00:47:15.260 | Right?
00:47:16.260 | Especially if you need to get them to do long running tasks, if they need to self-correct
00:47:21.500 | from errors that they observe while operating in their environment.
00:47:24.940 | Do you see this as a problem?
00:47:26.200 | Do you think we're going to just trend to infinite context and that'll go away?
00:47:30.540 | Or how do you think we're going to deal with it?
00:47:33.740 | When you talked about what's going to happen in the first wave and then in the second wave,
00:47:39.220 | I think what we'll see is we'll get relatively simplistic agents pretty soon.
00:47:43.660 | And they will get more and more complex.
00:47:46.180 | And there's a future wave in which they are able to do these really difficult, really
00:47:49.940 | long running tasks.
00:47:52.260 | And the blocker to that future, one of the blockers is memory.
00:47:56.660 | And that was true of computers too.
00:48:00.180 | I think when von Neumann made the von Neumann architecture, he was like, "The biggest blocker
00:48:05.780 | will be memory.
00:48:06.780 | We need this amount of memory," which is like, I don't remember exactly, like 32 kilobytes
00:48:10.460 | or something, "to store programs.
00:48:12.300 | And that will allow us to write software."
00:48:14.580 | He didn't say it this way because he didn't have these terms.
00:48:17.860 | And then that only really happened in the '70s with the microchip revolution.
00:48:23.600 | And so it may be the case that we're waiting for some research breakthroughs or some other
00:48:28.620 | breakthroughs in order for us to have really good long running memory.
00:48:33.300 | And then in the meantime, agents will be able to do all sorts of things that are a little
00:48:36.380 | bit smaller than that.
00:48:37.620 | I do think with the pace of the field, we'll probably come up with all sorts of interesting
00:48:40.860 | things.
00:48:41.860 | Like RAG is already very helpful.
00:48:44.100 | Good enough, you think?
00:48:45.580 | Maybe.
00:48:46.580 | Good enough for some things.
00:48:47.580 | How is it not good enough?
00:48:49.380 | I don't know.
00:48:50.380 | I just think about a situation where you want something that's like an AI scientist.
00:48:55.780 | As a scientist, I have learned so much about my field.
00:49:00.740 | And a lot of that data is maybe hard to fine tune on or maybe hard to put into pre-training.
00:49:09.300 | A lot of that data, I don't have a lot of repeats of the data that I'm seeing.
00:49:14.460 | My understanding is so at the edge that if I'm a scientist, I've accumulated so many
00:49:20.060 | little data points.
00:49:21.620 | And ideally, I'd want to store those somehow or use those to fine tune myself as a model
00:49:28.140 | somehow or have better memory somehow.
00:49:32.660 | I don't think RAG is enough for that kind of thing.
00:49:36.020 | But RAG is certainly enough for user preferences and things like that.
00:49:40.500 | What should I do in this situation?
00:49:41.500 | What should I do in that situation?
00:49:42.500 | That's a lot of tasks.
00:49:44.120 | We don't have to be a scientist right away.
00:49:46.620 | I have a hard question, if you don't mind me being bold.
00:49:50.820 | I think the most comparable lab to Imbue is ADEPT.
00:49:55.060 | Whatever.
00:49:56.060 | A research lab with some amount of productization on the horizon, but not just yet.
00:50:04.300 | Why should people work for Imbue over ADEPT?
00:50:06.660 | The way I think about it is I believe in our approach.
00:50:11.140 | Maybe this is a general question of competitors.
00:50:14.760 | And the way I think about it is we're in a historic moment.
00:50:20.280 | This is 1978 or something.
00:50:23.600 | Love it.
00:50:24.600 | Apple is about to start.
00:50:26.480 | Lots of things are starting at that time.
00:50:28.280 | And IBM also exists and all of these other big companies exist.
00:50:34.120 | We know what we're doing.
00:50:35.520 | We're building reasoning foundation models, trying to make agents that actually work reliably.
00:50:41.120 | That are inspectable.
00:50:42.120 | That we can modify.
00:50:43.120 | That we have a lot of control over.
00:50:45.260 | And I think we have a really special team and culture.
00:50:48.140 | And that's what we are.
00:50:50.240 | I have a sense of where we want to go, of really trying to help the computer be a much
00:50:57.020 | more powerful tool for us.
00:50:58.560 | And the type of thing that we're doing is we're trying to build something that enables
00:51:03.400 | other people to build agents.
00:51:05.760 | And build something that really can be maybe something like an operating system for agents.
00:51:10.600 | I know that that's what we're doing.
00:51:12.120 | I don't really know what everyone else is doing.
00:51:15.360 | I talk to people and have some sense of what they're doing.
00:51:18.760 | And I think it's a mistake to focus too much on what other people are doing.
00:51:22.000 | Because extremely focused execution on the right thing is what matters.
00:51:27.240 | And so to the question of why us, I think strong focus on reasoning, which we believe
00:51:36.600 | is the biggest blocker, on inspectability.
00:51:40.840 | Which we believe is really important for user experience.
00:51:45.500 | And also for the power and capability of these systems.
00:51:49.040 | Building non-leaky good abstractions.
00:51:51.440 | So that which we believe is solving the core issue of agents, which is around reliability
00:51:55.300 | and being able to make them deployable.
00:51:58.200 | And then really seriously trying to use these things ourselves.
00:52:03.400 | Every single day.
00:52:04.640 | And getting to something that we can actually ship to other people, that becomes something
00:52:08.160 | that is a platform.
00:52:09.160 | It feels like it could be Mac or Windows.
00:52:13.080 | I love the dogfooding approach.
00:52:15.880 | That's extremely important.
00:52:16.880 | And you will not be surprised how many agent companies I talk to that don't use their own
00:52:22.480 | agent.
00:52:23.480 | Oh no!
00:52:24.480 | That's not good!
00:52:25.480 | That's a big surprise.
00:52:26.480 | Yeah, I think if we didn't use our own agents, then we would have all of these beliefs about
00:52:30.720 | how good they are.
00:52:33.300 | The only other follow-up that you had, based on the answer you just gave, was do you see
00:52:39.120 | yourself releasing models or do you see yourself...
00:52:43.720 | What is the artifacts that you want to produce that lead up to the general operating system
00:52:49.920 | that you want to have people use?
00:52:52.960 | And so a lot of people, just as a byproduct of their work, just to say, "Hey, I'm still
00:52:58.480 | shipping, here's a model along the way," Adept took, I don't know, three years, but they
00:53:04.680 | released Persimmon recently.
00:53:08.120 | Do you think that kind of approach is something on your horizon or do you think there's something
00:53:12.240 | else that you can release that can show people, "Here's the idea, not the end product, but
00:53:17.760 | here's the byproduct of what we're doing"?
00:53:19.640 | Yeah.
00:53:20.640 | I don't really believe in releasing things to show people, "Oh, here's what we're doing,"
00:53:24.720 | that much.
00:53:25.960 | I think as a philosophy, we believe in releasing things that will be helpful to other people.
00:53:30.760 | And so I think we may release models or we may release tools that we think will help
00:53:35.720 | agent builders.
00:53:36.960 | Ideally, we would be able to do something like that, but I'm not sure exactly what they
00:53:40.440 | look like yet.
00:53:41.440 | I think more companies should get into the releasing evals and benchmarks game.
00:53:46.400 | Yeah.
00:53:47.400 | Something that we have been talking to agent builders about is co-building evals.
00:53:51.200 | So we build a lot of our own evals and every agent builder tells me basically evals are
00:53:56.520 | their biggest issue.
00:53:57.520 | And so, yeah, we're exploring right now.
00:53:59.640 | And if you are building agents, this is like a call.
00:54:02.080 | If you are building agents, please reach out to me because I would love to figure out how
00:54:06.080 | we can be helpful based on what we've seen.
00:54:09.680 | Cool.
00:54:10.680 | Well, that's a good call to action.
00:54:11.680 | I know a bunch of people that I can send your way.
00:54:13.160 | Cool.
00:54:14.160 | Great.
00:54:15.160 | Awesome.
00:54:16.160 | Yeah.
00:54:17.160 | We can zoom out to other interests now.
00:54:18.160 | We've got a lot of stuff.
00:54:19.160 | I saw from Lexica on the podcast, he had a lot of interesting questions on his website.
00:54:23.480 | You similarly have a lot of them.
00:54:25.840 | Yeah.
00:54:26.840 | I need to do this.
00:54:27.840 | I'm very jealous of people who have personal websites where they're like, here's the high
00:54:30.640 | level questions of goals of humanity that I want to set people on.
00:54:34.600 | And I don't have that.
00:54:35.600 | This is great.
00:54:36.600 | This is good.
00:54:37.600 | It's never too late, Sean.
00:54:38.600 | Yeah.
00:54:39.600 | It's never too late.
00:54:40.600 | Exactly.
00:54:41.600 | There were a few that stuck out as related to your work that maybe you're kind of learning
00:54:46.040 | more about it.
00:54:47.040 | One is why are curiosity and goal orientation often at odds?
00:54:51.760 | And from a human perspective, I get it, it's like, you know, would you want to like go
00:54:54.880 | explore things or kind of like focus on your career?
00:54:58.000 | How do you think about that from like an agent perspective, where it's like, should you just
00:55:01.440 | stick to the task and try and solve it as in the guardrails as possible?
00:55:05.360 | Or like, should you look for alternative solutions?
00:55:08.080 | Yeah.
00:55:09.080 | This is a great question.
00:55:10.880 | So the problem with these questions is that I'm still confused about them.
00:55:15.760 | So our discussion, in our discussion, I will not have good answers.
00:55:20.080 | I will be still confused.
00:55:22.280 | Why are curiosity and goal orientation so at odds?
00:55:24.400 | I think one thing that's really interesting about agents actually is that they can be
00:55:27.840 | forked.
00:55:29.360 | So like, you know, we can take an agent that's executed to a certain place and said, okay,
00:55:35.240 | here, like fork this and do a bunch of different things, try a bunch of different things.
00:55:39.280 | Some of those agents can be goal oriented and some of them can be like more curiosity
00:55:42.640 | driven.
00:55:43.640 | You can prompt them in slightly different ways.
00:55:44.640 | And something I'm really curious about, like what would happen if in the future, you know,
00:55:48.480 | we were able to actually go down both paths.
00:55:51.680 | As a person, why I have this question on my website is I really find that like, I really
00:55:56.920 | can only take one mode at a time.
00:56:00.040 | And I don't understand why.
00:56:02.320 | And like, is it inherent in like the kind of context that needs to be held?
00:56:08.560 | That's why I think from an agent perspective, like forking it is really interesting.
00:56:11.600 | Like I can't fork myself to do both, but I maybe could fork an agent to like at a certain
00:56:17.040 | point in a task, yeah, to explore both.
00:56:21.240 | How has the thinking changed for you as the funding of the company changed?
00:56:28.800 | That's one thing that I think a lot of people in the space think is like, oh, should I raise
00:56:32.160 | venture capital?
00:56:33.160 | Like, how should I get money?
00:56:36.120 | How do you feel your options to be curious versus like goal oriented has changed as you
00:56:42.600 | raise more money and kind of like the company has grown?
00:56:45.240 | That's really funny.
00:56:46.240 | Actually, things have not changed that much.
00:56:49.160 | So we raised our Series A $20 million in late 2021.
00:56:54.080 | And our entire philosophy at that time was, and still kind of is, is like, how do we figure
00:57:03.080 | out the stepping stones, like collect stepping stones that eventually let us build agents,
00:57:09.080 | the kind of these new computers that help us do bigger things.
00:57:13.360 | And there was a lot of curiosity in that.
00:57:15.600 | And there was a lot of goal orientation in that.
00:57:17.960 | Like the curiosity led us to build CARBS, for example, this hyperparameter optimizer.
00:57:24.240 | Great name by the way.
00:57:26.520 | Thank you.
00:57:27.520 | Is there a story behind that name?
00:57:28.520 | Yeah.
00:57:29.520 | Abe loves CARBS.
00:57:30.520 | It's also cost aware.
00:57:33.200 | So as soon as he came up with cost aware, he was like, I need to figure out how to make
00:57:36.000 | this work.
00:57:39.120 | But the cost awareness of it was really important.
00:57:40.920 | So that curiosity led us to this really cool hyperparameter optimizer.
00:57:44.600 | That's actually a big part of how we do our research.
00:57:47.040 | It lets us experiment on smaller models.
00:57:50.000 | And for those experiment results to carry to larger ones.
00:57:53.640 | Which you also published a scaling laws thing for it, which is great.
00:57:57.800 | I think the scaling laws paper from OpenAI was the biggest.
00:58:01.240 | And from Google, I think, was the greatest public service to machine learning that any
00:58:07.280 | research lab can do.
00:58:08.560 | Yeah.
00:58:09.560 | Totally.
00:58:10.560 | Yeah.
00:58:11.560 | And I think what was nice about CARBS is it gave us scaling laws for all sorts of hyperparameters.
00:58:15.920 | And then there's some goal oriented parts.
00:58:17.480 | Like Avalon, it was like a six to eight week sprint for all of us.
00:58:22.080 | And we got this thing out.
00:58:24.580 | And then now, different projects do more curiosity or more goal orientation at different times.
00:58:32.800 | Another one of your questions that we highlighted was, how can we enable artificial agents to
00:58:37.500 | permanently learn new abstractions and processes?
00:58:40.280 | I think this might be called online learning.
00:58:42.880 | Yeah.
00:58:43.880 | So I struggle with this because that scientist example I gave.
00:58:49.440 | As a scientist, I've permanently learned a lot of new things and I've updated and created
00:58:53.600 | new abstractions and learned them pretty reliably.
00:58:56.600 | And you were talking about, OK, we have this RAM that we can store learnings in.
00:59:01.880 | But how well does online learning actually work?
00:59:05.360 | And the answer right now seems to be, as models get bigger, they fine tune faster.
00:59:10.720 | So they're more sample efficient as they get bigger.
00:59:13.360 | Because they already had that knowledge in there, you're just unlocking it.
00:59:18.400 | Maybe.
00:59:19.400 | Partly, maybe because they already have some subset of the representation.
00:59:22.240 | Yeah.
00:59:23.240 | So they just memorize things more, which is good.
00:59:27.600 | So maybe this question is going to be solved.
00:59:30.480 | But I still don't know what the answer is.
00:59:32.360 | I don't know, have a platform that continually fine tunes for you as you work on that domain,
00:59:40.240 | which is something I'm working on.
00:59:41.320 | Well, that's great.
00:59:42.320 | We would love to use that.
00:59:43.320 | We'll talk more.
00:59:45.320 | So two more questions just about your general activities, and you've just been very active
00:59:52.120 | in the San Francisco tech scene.
00:59:54.360 | You're a founding member of Software Commons.
00:59:55.960 | Oh, yeah, that's true.
00:59:57.440 | Tell me more, because by the time I knew about SPC, it was already a very established thing.
01:00:03.560 | But what was it like in the early days?
01:00:04.920 | What was the story there?
01:00:05.960 | Yeah, the story is Ruchi, who started it, was the VP of operations at Dropbox.
01:00:11.920 | And I was the chief of staff, and we worked together very closely.
01:00:15.800 | She's actually one of the investors in Sorceress.
01:00:18.080 | And SPC is an investor in Vue.
01:00:22.440 | And at that time, Ruchi was like, "You know, I would like to start a space for people who
01:00:26.880 | are figuring out what's next."
01:00:29.320 | And we were figuring out what's next post-Ember, those three months.
01:00:32.520 | And she was like, "Do you want to just hang out in this space?"
01:00:34.520 | And we're like, "Sure."
01:00:35.760 | And it was a really good group, I think, Wasim and Jeff from Pilot, the folks from Zulip,
01:00:41.680 | and a bunch of other people at that time.
01:00:43.760 | It was a really good group.
01:00:45.000 | We just hung out.
01:00:46.000 | There was no programming.
01:00:47.240 | It's much more official than it was at that time.
01:00:49.600 | Yeah.
01:00:50.600 | Now it's like a YC before YC type of thing.
01:00:53.240 | That's right.
01:00:54.240 | Yeah.
01:00:55.240 | At that time, we literally, it was a bunch of friends hanging out in the space together.
01:00:56.960 | And was this concurrent with the archive?
01:00:59.240 | Oh, yeah, actually.
01:01:01.480 | I think we started the archive around the same time.
01:01:03.560 | You're just really big into community.
01:01:05.880 | But also, I run a hacker house, right?
01:01:08.840 | And I'm also part of, hopefully, what becomes the next Software Commons or whatever.
01:01:15.680 | But what are the principles in organizing communities like that with really exceptional
01:01:21.280 | people that go on to do great things?
01:01:23.280 | Do you have to be really picky about who joins?
01:01:26.440 | Did all your friends just magically turn out super successful like that?
01:01:32.160 | Yeah, I think so.
01:01:37.040 | I think we...
01:01:38.040 | You know it's not normal, right?
01:01:39.720 | This is very special.
01:01:41.200 | And a lot of people want to do that and fail.
01:01:45.120 | You had the co-authors of GPT-3 in your house.
01:01:47.840 | That's true.
01:01:48.920 | And a lot of other really cool people that you'll eventually hear about.
01:01:51.240 | And co-founders of Pilot and anyone else you want to...
01:01:53.360 | I don't want you to pick your friends, but there's some magic special sauce in getting
01:01:58.720 | people together in one workspace, living space, whatever.
01:02:02.400 | And that's part of why I'm here in San Francisco.
01:02:05.000 | And I would love for more people to learn about it and also maybe get inspired to build
01:02:09.200 | their own.
01:02:10.200 | One adage we had when we started the archive was you become the average of the five people
01:02:14.360 | closest to you.
01:02:16.360 | And I think that's roughly true.
01:02:17.360 | And good people draw good people.
01:02:18.960 | So there are really two things.
01:02:20.960 | One, we were quite picky and it mattered a lot to us.
01:02:27.000 | Is this someone where if they're hanging out in the living room, we'd be really excited
01:02:30.240 | to come hang out?
01:02:31.240 | Yeah.
01:02:32.240 | Two is I think we did a really good job of creating a high-growth environment and an
01:02:37.200 | environment where people felt really safe.
01:02:40.120 | We actually apply these things to our team and it works remarkably well as well.
01:02:43.920 | So I do a lot of basically how do I create safe spaces for people where it's not just
01:02:49.920 | like safe law, but it's a safe space where people really feel inspired by each other.
01:02:56.000 | And I think at the archive, we really made each other better.
01:02:58.960 | My friend, Michael Nielsen called it a self-actualization machine.
01:03:02.200 | My goodness.
01:03:04.200 | And I think, yeah, people came in and- Was he a part of the archive?
01:03:07.520 | He was not, but he hung out a lot.
01:03:08.960 | I don't remember.
01:03:09.960 | Friend of the archive.
01:03:10.960 | A friend of the archive, yeah.
01:03:12.880 | Like the culture was that we learned a lot of things from each other about how to make
01:03:19.020 | better life systems and how to think about ourselves and psychological debugging.
01:03:23.080 | And a lot of us were founders, so having other founders going through similar things was
01:03:27.280 | really helpful.
01:03:28.920 | And a lot of us worked in AI, and so having other people to talk about AI with was really
01:03:33.000 | helpful.
01:03:34.000 | And so I think all of those things led to a form of idea flux and also kind of like,
01:03:40.920 | I think a lot about like the idea flux and the kind of like default habits or default
01:03:45.440 | impulses.
01:03:46.440 | It led to a set of idea flux and default impulses that led to some really interesting things
01:03:51.760 | and led to us doing much bigger things, I think, than we otherwise would have decided
01:03:56.880 | to do because it felt like taking risks was less risky.
01:04:01.560 | So that's something we do a lot of on the team is like, how do we make it so that taking
01:04:04.960 | risks is less risky?
01:04:06.800 | And there's a term called seniors.
01:04:09.600 | I was thinking Kevin Kelly.
01:04:10.600 | Kevin Kelly, seniors.
01:04:11.600 | I was going to feed you that word, but I didn't want to like impress you.
01:04:15.760 | I think maybe like a lot of what I'm interested in is constructing a kind of seniors.
01:04:20.080 | And the archive was definitely a seniors in a particular way, or like getting toward a
01:04:23.140 | seniors in a particular way.
01:04:26.040 | And Jason Ben, my archive housemate and who now runs the neighborhood, has a good way
01:04:31.920 | of putting it.
01:04:32.920 | If genius is from your genes, seniors is from your scene.
01:04:36.440 | And yeah, I think like maybe a lot of the community building impulse is from this interest
01:04:41.480 | in what kind of idea flux can be created.
01:04:46.280 | There's a question of like, why did Xerox PARC come out with all of this interesting
01:04:50.040 | stuff?
01:04:51.040 | It's their seniors.
01:04:52.040 | Why did Bell Labs come out with all this interesting stuff?
01:04:55.240 | Maybe it's their seniors.
01:04:56.240 | Why didn't the transistor come out of Princeton and the other people working on it at the
01:05:00.680 | time?
01:05:01.680 | I just think it's remarkable how you hear a lot about Alan Kay.
01:05:05.320 | And I just read a bit and apparently Alan Kay was like the most junior guy at Xerox
01:05:08.600 | PARC.
01:05:09.600 | Yeah.
01:05:10.600 | Definitely.
01:05:11.600 | He's just the one who talks about it.
01:05:13.320 | He talks the most.
01:05:14.320 | Yeah, exactly.
01:05:15.320 | Yeah.
01:05:16.320 | So I, you know, hopefully I'm also working towards contributing that seniors.
01:05:19.120 | I called mine the most provocative name of the arena.
01:05:22.080 | Oh, interesting.
01:05:23.560 | That's quite provocative.
01:05:24.560 | In the arena.
01:05:26.080 | So are you fighting other people in the arena?
01:05:29.880 | You never know.
01:05:30.880 | We're in the arena.
01:05:31.880 | We're in the arena trying stuff, as they say.
01:05:36.440 | You are also a GP at Altic Capital, where you also co-organize the Thursday Nights in
01:05:40.680 | AI, where hopefully someday I'll eventually speak.
01:05:45.040 | You're on the roster.
01:05:46.040 | I'm on the roster.
01:05:47.040 | Thank you so much.
01:05:48.040 | So why spend time being a VC and organizing all these events?
01:05:52.760 | You're also a very busy CEO and, you know, why spend time with that?
01:05:57.360 | Why is that an important part of your life?
01:05:59.080 | Yeah.
01:06:00.080 | So I actually really like helping founders.
01:06:01.560 | So Allie, my investing partner, is fortunately amazing and she does everything for the fund.
01:06:09.840 | So she, like, hosts the Thursday Night events and she finds folks who we could invest in
01:06:15.480 | and she does basically everything.
01:06:17.040 | Josh and I are her co-partners.
01:06:19.280 | So Allie was our former chief of staff at Sorceress and we just thought she was amazing.
01:06:23.560 | And she wanted to be an investor and Josh and I also, like, care about helping founders
01:06:28.840 | and kind of, like, giving back to the community.
01:06:30.640 | What we didn't realize at the time when we started the fund is that it would actually
01:06:34.200 | be incredibly helpful for Imbue.
01:06:36.400 | So talking to AI founders who are building agents and working on, you know, similar things
01:06:42.760 | is really helpful.
01:06:44.000 | They could potentially be our customers and they're trying out all sorts of interesting
01:06:47.440 | things.
01:06:48.440 | And I think being an investor, looking at the space from the other side of the table,
01:06:52.920 | it's just a different hat that I routinely put on and it's helpful to see the space from
01:06:57.760 | the investor lens as opposed to from the founder lens.
01:07:01.440 | So I find that kind of, like, hat switching valuable.
01:07:05.040 | It maybe would lead us to do slightly different things.
01:07:07.480 | Let's just wrap with the lightning round.
01:07:09.600 | Okay.
01:07:10.600 | So we have three questions.
01:07:11.600 | Acceleration, exploration, and then a takeaway.
01:07:14.880 | So the acceleration question is what's something that already happened in AI that you thought
01:07:19.400 | would take much longer to be here?
01:07:22.400 | I think the rate at which we discover new capabilities of existing models and kind of, like, build
01:07:27.200 | hacks on top of them to make them work better is something that has been surprising and
01:07:31.400 | awesome.
01:07:32.900 | And the rate of kind of, like, the community, the research community building on its own
01:07:38.640 | ideas.
01:07:39.640 | Cool.
01:07:40.640 | Exploration/request for startups.
01:07:42.960 | If you weren't building Imbue, what AI company would you build?
01:07:49.280 | Every founder has, like, their, like, number two.
01:07:51.840 | Really?
01:07:52.840 | Yeah.
01:07:53.840 | I don't know.
01:07:55.840 | I cannot imagine building any other thing than Imbue.
01:07:57.840 | Well, that's a great answer, too.
01:07:58.840 | That's an interesting thing.
01:07:59.840 | It's, like, obviously the thing to build.
01:08:01.880 | Okay.
01:08:02.880 | It's, like, obviously work on the fundamental platform.
01:08:05.360 | Yeah.
01:08:06.360 | So the previous, I think, that was my attempt at innovating this question, but the previous
01:08:11.720 | one was what was the most interesting unsolved question in AI?
01:08:15.880 | Yeah.
01:08:16.880 | I think probably the most interesting unsolved question, and my answer is kind of boring,
01:08:22.080 | but the most interesting unsolved questions are these questions of how do we make these
01:08:26.300 | stochastic systems into things that we can, like, reliably use and build on top of?
01:08:31.800 | And, yeah, take away what's one message you want everyone to remember?
01:08:37.760 | Maybe two things.
01:08:38.760 | Like, one is I didn't think in my lifetime I would necessarily be in, like, able to work
01:08:45.160 | on the things I'm excited to work on in this moment, but we're in a historic moment and
01:08:49.160 | that where we'll look back and be like, "Oh, my God.
01:08:51.120 | The future was invented in these years."
01:08:53.900 | There is maybe a set of messages to take away from that.
01:08:56.120 | One is, like, AI is a tool, like any technology, and, you know, when it comes to things, like,
01:09:05.280 | what might the future look like?
01:09:06.680 | We like to think about it as it's, like, just a better computer.
01:09:10.360 | It's like a much better, much more powerful computer that gives us a lot of free intellectual
01:09:14.040 | energy that we can now, like, solve so many problems with.
01:09:17.720 | You know, there are so many problems in the world where we're like, "Oh, it's not worth
01:09:20.400 | a person thinking about that," and so things get worse and things get worse.
01:09:23.480 | No one wants to work on maintenance, and, like, this technology gives us the potential
01:09:28.760 | to actually be able to, like, allocate intellectual energy to all of those problems, and the world
01:09:33.520 | could be much better, like, could be much more thoughtful because of that.
01:09:37.280 | I'm so excited about that, and there are definitely risks and dangers, and we actually do a fair
01:09:44.680 | amount of work on the policy side.
01:09:48.040 | On the safety side, like, we think about safety and policy in terms of engineering theory
01:09:54.360 | and also regulation, and kind of comparing to, like, the automobile or the airplane or
01:10:00.640 | any new technology, there's, like, a set of new possible, like, capabilities and a set
01:10:06.380 | of new possible dangers that are unlocked with every new technology, and so on the engineering
01:10:10.900 | side, like, we think a lot about engineering safety, like, how do we actually engineer
01:10:14.800 | these systems so that they are inspectable and, you know, why we reason in natural language
01:10:19.640 | so that the systems are very inspectable, so that we can, like, stop things if anything
01:10:24.080 | weird is happening.
01:10:25.080 | That's why we don't think end-to-end black boxes are a good idea.
01:10:28.560 | On the theoretical side, we, like, really believe in, like, deeply understanding, like,
01:10:31.800 | what are they learning?
01:10:32.800 | Like, when we actually fine-tune on individual examples, like, what's going on?
01:10:36.520 | When we're pre-training?
01:10:37.520 | What's going on?
01:10:39.080 | Like, debugging tools for these agents to understand, like, what's going on?
01:10:43.040 | And then on the regulation side, I think there's actually a lot of regulation that already
01:10:49.220 | covers many of the dangers, like, that people are talking about, and there are areas where
01:10:57.040 | there's not much regulation, and so we focus on those areas where there's not much regulation.
01:11:00.540 | So some of our work is actually, we built an agent that helped us analyze the, like,
01:11:06.660 | 20,000 pages of policy proposals submitted to the Department of Commerce request for
01:11:12.580 | AI policy proposals.
01:11:15.240 | And we, like, looked at what were the problems people brought up, and what were the solutions
01:11:19.980 | they presented, and then, like, did a summary analysis and kind of, like, you know, built
01:11:24.880 | agents to do that.
01:11:26.660 | And now the Department of Commerce is, like, interested in using that as a tool to, like,
01:11:30.640 | analyze proposals.
01:11:31.640 | And so a lot of what we're trying to do on the regulation side is, like, actually figure
01:11:36.280 | out where is there regulation missing, and how do we actually, in a very targeted way,
01:11:42.680 | try to solve those missing areas.
01:11:45.480 | So I guess if I were to say, like, what are the takeaways, it's like, the future could
01:11:50.000 | be really exciting if we can actually get agents that are able to do these bigger things.
01:11:55.480 | Reasoning is the biggest blocker, plus, like, these sets of abstractions to make things
01:11:58.760 | more robust and reliable.
01:12:02.200 | And there are, you know, things where we have to be quite careful and thoughtful about how
01:12:06.800 | do we deploy these, and what kind of regulation should go along with it, so that this is actually
01:12:11.280 | a technology that, when we deploy it, it is protective to people, and not harmful.
01:12:16.040 | Awesome.
01:12:17.040 | Wonderful.
01:12:18.040 | Yeah.
01:12:19.040 | Thank you so much for your time, Kendra.
01:12:20.040 | Cool.
01:12:21.040 | Thank you.
01:12:22.040 | That's it.
01:12:23.040 | Thank you so much.
01:12:28.040 | Thank you.
01:12:29.040 | No questions.
01:12:29.040 | Thank you.
01:12:30.040 | No questions.
01:12:31.040 | Thank you.
01:12:32.040 | Thank you.
01:12:32.040 | (upbeat music)
01:12:34.620 | [BLANK_AUDIO]