back to indexWhy AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Chapters
0:0 Introductions
7:13 The origin story of Imbue
11:26 Imbue's approach to training large foundation models optimized for reasoning
14:20 Imbue's goals to build an "operating system" for reliable, inspectable AI agents
17:51 Imbue's process of developing internal tools and interfaces to collaborate with AI agents
19:47 Imbue's focus on improving reasoning capabilities in models, using code and other data
21:33 The value of using both public benchmarks and internal metrics to evaluate progress
21:43 Lessons learned from developing the Avalon research environment
23:31 The limitations of pure reinforcement learning for general intelligence
32:12 Imbue's vision for building better abstractions and interfaces for reliable agents
33:49 Interface design for collaborating with, rather than just communicating with, AI agents
39:51 The future potential of an agent-to-agent protocol
42:53 Leveraging approaches like critiquing between models and chain of thought
47:30 Kanjun's philosophy on enabling team members as creative agents at Imbue
59:54 Kanjun's experience co-founding the communal co-living space The Archive
60:22 Lightning Round
00:00:00.000 |
- Hey everyone, welcome to the Latent Space Podcast. 00:00:10.200 |
This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my 00:00:17.360 |
- Hey, and today in the studio we have Kanjin from Imbue. 00:00:22.840 |
- So, you and I have, I guess, crossed paths a number of times, and you're formerly named 00:00:28.920 |
General Intelligent, and you've just announced your rename, rebrand, in a huge, humongous 00:00:36.800 |
- And we're here to dive in into deeper detail on Imbue. 00:00:39.700 |
We'd like to introduce you just on a high-level basis, but then have you go into a little 00:00:46.280 |
So, you graduated your BS and MS at MIT, and you also spent some time at the MIT Media 00:00:52.280 |
Lab, one of the most famous, I guess, computer hacking labs in the world. 00:00:59.040 |
- Yeah, I built electronic textiles, so boards that make it possible to make soft clothing. 00:01:08.680 |
You can sew circuit boards into clothing, and then make clothing electronic. 00:01:15.960 |
- Basically, the idea was to teach young women computer science in this route, because what 00:01:20.120 |
we found was that young girls, they would be really excited about math until about sixth 00:01:24.720 |
grade, and then they're like, "Oh, math is not good anymore, because I don't feel like 00:01:30.520 |
the type of person who does math or does programming, but I do feel like the type of person who 00:01:35.240 |
So, it's like, "Okay, what if you combine the two?" 00:01:43.400 |
But then you graduated MIT, and you went straight into BizOps at Dropbox, where you're eventually 00:01:48.220 |
chief of staff, which is a pretty interesting role we can dive into later. 00:01:51.160 |
And then it seems like the founder bug hit you. 00:01:52.640 |
You were basically a three-times founder at Ember, Sorceress, and now at General Intelligence/MBU. 00:01:57.920 |
What should people know about you on the personal side that's not on your LinkedIn, that's something 00:02:02.280 |
you're very passionate about outside of work? 00:02:04.000 |
- Yeah, I think if you ask any of my friends, they would tell you that I'm obsessed with 00:02:07.760 |
agency, like human agency and human potential. 00:02:15.720 |
- So, what's an example of human agency that you try to promote? 00:02:20.280 |
- I feel like, with all of my friends, I have a lot of conversations with them that's helping 00:02:26.520 |
I guess I do this with a team kind of automatically, too. 00:02:29.680 |
And I think about it for myself often, building systems. 00:02:32.320 |
I have a lot of systems to help myself be more effective. 00:02:35.360 |
At Dropbox, I used to give this onboarding talk called "How to Be Effective," which people 00:02:40.920 |
I think 1,000 people heard this onboarding talk, and I think maybe Dropbox was more effective. 00:02:45.400 |
I think I just really believe that, as humans, we can be a lot more than we are, and it's 00:02:52.200 |
I guess completely outside of work, I do dance. 00:02:58.040 |
- Yeah, lots of interest in that stuff, especially in the group living houses in San Francisco, 00:03:03.720 |
which I've been a little bit part of, and you've also run one of those. 00:03:08.400 |
I started The Archive with Josh, my co-founder, and a couple other folks in 2015. 00:03:16.160 |
- Was that the, I guess, the precursor to Generally Intelligent, that you started doing 00:03:25.160 |
- Yeah, so Josh and I are, this is our third company together. 00:03:30.040 |
Our first company, Josh poached me from Dropbox for Ember, and there, we built a really interesting 00:03:37.960 |
technology, laser raster projector, VR headset, and then we were like, "VR is not the thing 00:03:44.100 |
we're most passionate about," and actually, it was kind of early days when we both realized 00:03:49.280 |
we really do believe that, in our lifetimes, computers that are intelligent are going to 00:03:54.800 |
be able to allow us to do much more than we can do today as people and be much more as 00:04:02.800 |
At that time, we actually, after Ember, we were like, "Should we work on AI research 00:04:07.880 |
A bunch of our housemates were joining OpenAI, and we actually decided to do something more 00:04:12.900 |
pragmatic to apply AI to recruiting and to try to understand, like, "Okay, if we're actually 00:04:17.140 |
trying to deploy these systems in the real world, what's required?" 00:04:21.960 |
That taught us so much about what, that was maybe an AI agent in a lot of ways, like, 00:04:28.280 |
what does it actually take to make a product that people can trust and rely on? 00:04:34.400 |
I think we never really fully got there, and it's taught me a lot about what's required, 00:04:40.100 |
and it's kind of like, I think, informed some of our approach and some of the way that we 00:04:43.380 |
think about how these systems will actually get used by people in the real world. 00:04:48.580 |
Just to go one step deeper on that, so you're building AI agents in 2016, before it was 00:04:54.900 |
You got some milestone, you raised $30 million, something was working. 00:04:59.500 |
So what do you think you succeeded in doing, and then what did you try to do that did not 00:05:07.740 |
So Sorceress was an AI system that basically kind of looked for candidates that could be 00:05:13.580 |
a good fit and then helped you reach out to them. 00:05:19.180 |
We didn't have language models to help you reach out, so we actually had a team of writers 00:05:21.980 |
that customized emails, and we automated a lot of the customization. 00:05:30.420 |
Candidates would just be interested and land in your inbox, and then you can talk to them. 00:05:34.220 |
And as a hiring manager, that's such a good experience. 00:05:38.220 |
I think there were a lot of learnings, both on the product and market side. 00:05:41.780 |
On the market side, recruiting is a market that is endogenously high churn, which means 00:05:46.980 |
because people start hiring and then we hire the role for them and they stop hiring. 00:05:57.280 |
It's exactly the same problem as the dating business. 00:05:59.580 |
And I was really passionate about like, can we help people find work that is more exciting 00:06:05.080 |
A lot of people are not excited about their jobs, and a lot of companies are doing exciting 00:06:07.980 |
things, and the matching could be a lot better. 00:06:10.420 |
But the dating business kind of phenomenon put a damper on that. 00:06:15.900 |
So we had a good, it's actually a pretty good business, but as with any business with relatively 00:06:23.620 |
high churn, the bigger it gets, the more revenue we have, the slower growth becomes. 00:06:28.060 |
Because Nick, if 30% of that revenue you lose year over year, then it becomes a worse business. 00:06:34.500 |
So that was the dynamic we noticed quite early on after our Series A. 00:06:40.140 |
I think the other really interesting thing about it is we realized what was required 00:06:44.460 |
for people to trust that these candidates were like well-vetted and had been selected 00:06:50.260 |
And it's what actually led us, a lot of what we do at Imbue is working on interfaces to 00:06:54.620 |
figure out how do we get to a situation where when you're building and using agents, these 00:07:03.140 |
That's actually one of the biggest issues with agents that go off and do longer range 00:07:06.780 |
goals is that I have to trust, did they actually think through the situation? 00:07:11.660 |
And that really informed a lot of our work today. 00:07:17.180 |
When did you decide recruiting was done for you, and you were ready for the next challenge? 00:07:25.780 |
I feel like in 2021, it wasn't as mainstream. 00:07:30.700 |
So the LinkedIn says that it started in 2021, but actually we started thinking very seriously 00:07:34.840 |
about it in early 2020, late 2019, early 2020. 00:07:39.500 |
Not exactly this idea, but in late 2019, so I mentioned our housemates, Tom Brown and 00:07:47.120 |
Ben Mann, they're the first two authors on GPT-3. 00:07:49.300 |
So what we were seeing is that scale is starting to work and language models probably will 00:07:55.320 |
actually get to a point where with hacks, they're actually going to be quite powerful. 00:07:59.460 |
And it was hard to see that at the time, actually, because GPT-3, the early versions of it, there 00:08:08.940 |
But we could kind of see, okay, you keep improving it in all of these different ways and it'll 00:08:15.480 |
And so what Josh and I were really interested in is, how can we get computers that help 00:08:24.140 |
There's this kind of future where I think a lot about, if I were born in 1900 as a woman, 00:08:32.500 |
I'd spend most of my time carrying water and literally getting wood to put in the stove 00:08:38.260 |
to cook food and cleaning and scrubbing the dishes and getting food every day because 00:08:48.060 |
And what's happened over the last 150 years since the Industrial Revolution is we've kind 00:08:54.580 |
Energy is way more free than it was 150 years ago. 00:08:58.780 |
And so as a result, we've built all these technologies like the stove and the dishwasher 00:09:03.060 |
And we have electricity and we have infrastructure, running water, all of these things that have 00:09:10.460 |
And I think the same thing is true for intellectual energy. 00:09:14.520 |
We don't really see it today because we're so in it, but our computers have to be micromanaged. 00:09:20.960 |
Part of why people are like, "Oh, you're stuck to your screen all day." 00:09:23.780 |
Well, we're stuck to our screen all day because literally nothing happens unless I'm doing 00:09:28.380 |
I can't send my computer off to do a bunch of stuff for me. 00:09:32.300 |
There is a future where that's not the case, where I can actually go off and do stuff and 00:09:37.080 |
trust that my computer will pay my bills and figure out my travel plans and do the detailed 00:09:41.780 |
work that I am not that excited to do so that I can be much more creative and able to do 00:09:47.020 |
things that I as a human am very excited about and collaborate with other people. 00:09:50.660 |
And there are things that people are uniquely suited for. 00:09:54.460 |
So that's kind of always been the thing that is really exciting, has been really exciting 00:10:03.300 |
I've known for a long time I think that AI, whatever AI is, it would happen in our lifetimes. 00:10:12.040 |
And the personal computer kind of started giving us a bit of free intellectual energy. 00:10:16.320 |
And this is like really the explosion of free intellectual energy. 00:10:19.120 |
So in early 2020, we were thinking about this and what happened was self-supervised learning 00:10:31.040 |
MoCo had come out, Momentum Contrast had come out earlier in 2019. 00:10:35.680 |
SimClear came out in early 2020 and we were like, okay, for the first time, self-supervised 00:10:38.920 |
learning is working really well across images and text and suspect that like, okay, actually 00:10:44.040 |
it's the case that machines can learn things the way that humans do. 00:10:48.180 |
And if that's true, if they can learn things in a fully self-supervised way, because like 00:10:54.320 |
We like go Google things and try to figure things out. 00:10:56.740 |
So if that's true, then like what the computer could be is much different, you know, is much 00:11:04.120 |
And so we started exploring ideas around like, how do we actually go? 00:11:08.860 |
We didn't think about the fact that we could actually just build a research lab. 00:11:12.580 |
So we were like, okay, what kind of startup could we build to like leverage self-supervised 00:11:17.060 |
learning so that it eventually becomes something that allows computers to become much more 00:11:25.340 |
But that became General Intelligent, which started as a research lab. 00:11:30.380 |
And so your mission is you aim to rekindle the dream of the personal computer. 00:11:36.340 |
So when did it go wrong and what are like your first products and kind of like a user 00:11:42.940 |
phasing things that you're building to rekindle it? 00:11:47.020 |
So what we do at Imbue is we train large foundation models optimized for reasoning. 00:11:53.340 |
And the reason for that is because reasoning is actually, we believe the biggest blocker 00:11:57.580 |
to agents or systems that can do these larger goals. 00:12:01.140 |
If we think about, you know, something that writes an essay, like when we write an essay, 00:12:06.900 |
we like write it, we don't just output it and then we're done. 00:12:10.180 |
We like write it and then we look at it and we're like, oh, I need to do more research 00:12:14.540 |
I'm going to go do some research and figure it out and come back and, oh, actually it's 00:12:19.380 |
not quite right, the structure of the outline, so I'm going to rearrange the outline, rewrite 00:12:24.540 |
It's this very iterative process and it requires thinking through like, okay, what am I trying 00:12:31.900 |
Also like, has the goal changed as I've learned more? 00:12:35.140 |
Also, you know, as a tool, like when should I ask the user questions? 00:12:39.340 |
I shouldn't ask them questions all the time, but I should ask them questions in higher 00:12:44.860 |
How certain am I about the like flight I'm about to book? 00:12:50.100 |
There are all of these notions of like risk certainty, playing out scenarios, figuring 00:12:53.300 |
out how to make a plan that makes sense, how to change the plan, what the goal should be, 00:12:58.100 |
that are things, you know, that we lump under the bucket of reasoning. 00:13:03.060 |
And models today, they're not optimized for reasoning. 00:13:05.260 |
It turns out that there's not actually that much explicit reasoning data on the internet 00:13:09.580 |
as you would expect, and so we get a lot of mileage out of optimizing our models for reasoning 00:13:15.660 |
And then on top of that, we build agents ourselves. 00:13:19.380 |
I can get into, we really believe in serious use, like really seriously using the systems 00:13:23.460 |
and trying to get to an agent that we can use every single day, tons of agents that 00:13:28.780 |
And then we experiment with interfaces that help us better interact with the agents. 00:13:33.380 |
So those are some set of things that we do on the kind of model training and agent side. 00:13:39.420 |
And then the initial agents that we build, a lot of them are trying to help us write 00:13:44.140 |
code better because code is most of what we do every day. 00:13:47.580 |
And then on the infrastructure and theory side, we actually do a fair amount of theory 00:13:51.100 |
work to understand like how do these systems learn? 00:13:53.860 |
And then also like what are the right abstractions for us to build good agents with, which we 00:14:00.180 |
And if you look at our website, we have a lot of tools. 00:14:05.020 |
We have a like really nice automated hyperparameter optimizer. 00:14:10.580 |
And it's all part of the belief of like, okay, let's try to make it so that the humans are 00:14:15.580 |
doing the things humans are good at as much as possible. 00:14:18.620 |
So out of our very small team, we get a lot of leverage. 00:14:21.180 |
And so would you still categorize yourself as a research lab now, or are you now in startup 00:14:25.860 |
Is that a transition that is cautious at all? 00:14:29.860 |
I think we've always intended to build, you know, to try to build the next version of 00:14:34.420 |
the computer, enable the next version of the computer. 00:14:37.820 |
The way I think about it is there is a right time to bring a technology to market. 00:14:43.780 |
Actually, iPhone was under development for 10 years, AirPods for five years. 00:14:48.620 |
And Apple has a story where, you know, iPhone, the first multi-touch screen was created. 00:14:54.240 |
They actually were like, oh, wow, this is cool. 00:14:58.060 |
They actually brought, they like did some work trying to productionize it and realized 00:15:03.760 |
And they put it back into research to try to figure out like, how do we make it better? 00:15:06.580 |
What are the interface pieces that are needed? 00:15:08.480 |
And then they brought it back into production. 00:15:09.700 |
So I think of production and research as kind of like these two separate phases. 00:15:13.940 |
And internally, we have that concept as well, where like things need to be done in order 00:15:21.520 |
And then when it's usable, like eventually we figure out how to productize it. 00:15:24.740 |
What's the culture like to make that happen, to have both like, kind of like product oriented, 00:15:30.940 |
And as you think about building the team, I mean, you just raised 200 million, I'm sure 00:15:36.680 |
What are like the right archetypes of people that work at Inbu? 00:15:41.460 |
Yeah, I would say we have a very unique culture in a lot of ways. 00:15:46.880 |
So how do you design social processes that enable people to be, you know, effective? 00:15:53.080 |
I like to think about team members as creative agents. 00:15:55.900 |
So because most companies, they think of their people as assets. 00:16:02.340 |
And I think about like, okay, what is an asset? 00:16:04.660 |
It's something you own, that provides you value that you can discard at any time. 00:16:12.520 |
And so we try to enable everyone to be a creative agent and to really unlock their superpowers. 00:16:17.760 |
So a lot of the work I do, you know, I was mentioning earlier, I'm like obsessed with 00:16:22.280 |
A lot of the work I do with team members is try to figure out like, you know, what are 00:16:26.760 |
What really gives you energy and where can we put you such that, and how can I help you 00:16:34.120 |
So much of our work, you know, in terms of team structure, like much of our work actually 00:16:39.200 |
Carbs, our hyperparameter optimizer came from Abe trying to automate his own research process, 00:16:47.880 |
And he actually pulled some ideas from plasma physics. 00:16:49.960 |
He's a plasma physicist to make the local search work. 00:16:53.040 |
A lot of our work on evaluations comes from a couple members of our team who are like 00:16:58.120 |
We do a lot of work trying to figure out like, how do you actually evaluate if the model 00:17:05.640 |
And so a lot of things kind of like, I think of people as making the like them shaped blob 00:17:11.960 |
And I think, you know, yeah, that's the kind of person that we're hiring for. 00:17:17.760 |
We're hiring product engineers and data engineers and research engineers and all these roles. 00:17:22.960 |
You know, we have a project, we have projects, not teams. 00:17:27.000 |
We have a project around data collection and data engineering. 00:17:30.300 |
That's actually one of the key things that improve the model performance. 00:17:34.600 |
We have a pre-training kind of project and with some fine tuning as part of that. 00:17:39.360 |
And then we have an agent's project that's like trying to build on top of our models 00:17:42.960 |
as well as use other models in the outside world to try to make agents that then we actually 00:17:52.640 |
As a founder, you are now sort of a capital allocator among all of these different investments 00:18:00.380 |
And I was interested in how you mentioned that you're optimizing for improving reasoning 00:18:06.760 |
specifically inside of your pre-training, which I assume is just a lot of data collection. 00:18:10.940 |
We are optimizing reasoning inside of our pre-trained models. 00:18:16.400 |
And I can talk more about like what, you know, what exactly does it involve? 00:18:21.540 |
But actually big, maybe 50% plus of the work is figuring out even if you do have models 00:18:29.040 |
that reason well, like the models are still stochastic. 00:18:32.480 |
The way you prompt them still makes, is kind of random, like makes them do random things. 00:18:37.600 |
And so how do we get to something that is actually robust and reliable as a user? 00:18:44.000 |
You know, I was mentioning earlier when I talked to other people building agents, they 00:18:47.840 |
have to do so much work, like to try to get to something that they can actually productize. 00:18:54.160 |
And it takes a long time and agents haven't been productized yet for, partly for this 00:19:00.280 |
reason is that like the abstractions are very leaky. 00:19:03.840 |
You know, we can get like 80% of that way there, but like self-driving cars, like the 00:19:10.440 |
We believe that, and we have internally, I think some things that like an interface, 00:19:15.400 |
for example, that lets me really easily like see what the agent execution is, fork it, 00:19:21.120 |
try out different things, modify the prompt, modify like the plan that it is making. 00:19:28.120 |
This type of interface, it makes it so that I feel more like I'm collaborating with the 00:19:32.960 |
agent as it's executing, as opposed to it's just like doing something as a black box. 00:19:37.880 |
That's an example of a type of thing that's like beyond just the model pre-training. 00:19:41.740 |
But on the model pre-training side, like reasoning is a thing that we optimize for. 00:19:46.160 |
And a lot of that is about, yeah, what data do we put in? 00:19:51.520 |
It's interesting just because I always think like, you know, out of the levers that you 00:19:55.480 |
have, the resources that you have, I think a lot of people think that running a foundation 00:20:00.680 |
model company or a research lab is going to be primarily compute. 00:20:05.560 |
And I think the share of compute has gone down a lot over the past three years. 00:20:10.120 |
It used to be the main story, like the main way you scale is you just throw more compute 00:20:16.820 |
You need better data, you need better algorithms. 00:20:22.560 |
This is a very vague question, but is it like 30, 30, 30 now? 00:20:27.080 |
So one way I'll put this is people estimate that Llamatu maybe took about three, $4 million 00:20:33.420 |
of compute, but probably $20 to $25 million worth of labeling data. 00:20:39.100 |
And I'm like, okay, well that's a very different story than all these other foundation model 00:20:42.700 |
labs raising hundreds of millions of dollars and spending it on GPUs. 00:20:54.180 |
We generate a lot of data and so that does help. 00:20:58.460 |
The generated data is close to actually good, as good as human labeled data. 00:21:09.740 |
Do you feel like, and there's certain variations of this, there's the sort of the constitutional 00:21:14.820 |
AI approach from Anthropic and basically models sampling, training on data from other models. 00:21:22.020 |
I feel like there's a little bit of like contamination in there or to put it in a statistical form, 00:21:28.620 |
you're resampling a distribution that you already have that you already know doesn't 00:21:35.460 |
How do you feel about that basically, just philosophically? 00:21:38.620 |
So when we're optimizing models for reasoning, we are actually trying to make a part of the 00:21:46.820 |
So in a sense, this is actually what we want. 00:21:50.180 |
We want to, because the internet is a sample of the human distribution that's also skewed 00:21:56.140 |
in all sorts of ways, that is not the data that we necessarily want these models to be 00:22:05.560 |
What we've seen so far is that it seems to help. 00:22:07.360 |
When we're generating data, we're not really randomly generating data, we generate very 00:22:11.380 |
specific things that are like reasoning traces and that help optimize reasoning. 00:22:17.500 |
Code also is a big piece of improving reasoning. 00:22:19.780 |
So yeah, generated code is not that much worse than like regular human written code. 00:22:27.460 |
You might even say it can be better in a lot of ways. 00:22:32.980 |
What are some of the tools that you saw that you thought were not a good fit? 00:22:37.200 |
So you built Avalon, which is your own simulated world. 00:22:41.600 |
And when you first started, the kind of like metagame was like using games to simulate 00:22:47.580 |
things, using, you know, Minecraft and then OpenAI is like the gym thing and all these 00:22:53.980 |
And your thing, I think in one of your other podcasts, you mentioned like Minecraft is 00:22:57.560 |
like way too slow to actually do any serious work. 00:23:07.320 |
But Avalon is like a hundred times faster than Minecraft for simulation. 00:23:12.360 |
When did you figure that out that you needed to just like build your own thing? 00:23:16.520 |
Was it kind of like your engineering team was like, hey, this is too slow. 00:23:22.760 |
At that time, we built Avalon as a research environment to help us learn particular things. 00:23:28.200 |
And one thing we were trying to learn is like, how do you get an agent that is able to do 00:23:35.880 |
Like RL agents at that time and environments at that time, what we heard from other RL 00:23:39.960 |
researchers was the like biggest thing holding the field back is lack of benchmarks that 00:23:46.420 |
let us kind of explore things like planning and curiosity and things like that and have 00:23:52.760 |
the agent actually perform better if the agent has curiosity. 00:23:57.160 |
And so we were trying to figure out like, okay, how can we have agents that are like 00:24:02.120 |
able to handle lots of different types of tasks without the reward being pretty handcrafted? 00:24:09.280 |
That's a lot of what we had seen is that like these very handcrafted rewards. 00:24:17.320 |
And what it taught us, and it also allowed us to kind of create a curriculum so we could 00:24:26.200 |
And it taught us a lot, maybe two primary things. 00:24:29.720 |
One is with no curriculum, RL algorithms don't work at all. 00:24:36.440 |
For the non-RL specialists, what is a curriculum in your terminology? 00:24:39.960 |
So a curriculum in this particular case is basically the environment Avalon lets us generate 00:24:47.080 |
simpler environments and harder environments for a given tasks. 00:24:50.400 |
What's interesting is that the simpler environments, you know, what you'd expect is the agent succeeds 00:24:57.600 |
And so, you know, kind of my intuitive way of thinking about it is, okay, the reason 00:25:01.300 |
why it learns much faster with a curriculum is it's just getting a lot more signal. 00:25:06.240 |
And that's actually an interesting kind of like general intuition to have about training 00:25:11.220 |
It's like, what kind of signal are they getting and like, how can you help it get a lot more 00:25:16.960 |
The second thing we learned is that reinforcement learning is not a good vehicle, like pure 00:25:21.680 |
reinforcement learning is not a good vehicle for planning and reasoning. 00:25:24.960 |
So these agents were not able to, they were able to learn all sorts of crazy things. 00:25:29.220 |
They could learn to climb, like hand over hand in VR climbing, they can learn to open 00:25:33.760 |
doors, like very complicated, like multiple switches and a lever open the door. 00:25:40.360 |
But they couldn't do any higher level things and they couldn't do those lower level things 00:25:49.040 |
And as a user, we were like, okay, as a user, I do not want to interact with a pure reinforcement 00:25:55.580 |
As a user, like I need much more control over what that agent is doing. 00:26:00.080 |
And so that actually started to get us on the track of thinking about, okay, how do 00:26:06.980 |
And we were pretty inspired by our friend Chelsea Finn at Stanford was I think working 00:26:11.160 |
on SACAN at the time, where it's basically an experiment where they have robots kind 00:26:19.600 |
of trying to do different tasks and actually do the reasoning for the robot in natural 00:26:27.400 |
And that led us to start experimenting very seriously with reasoning. 00:26:32.760 |
How important is the language part for the agent versus for you to inspect the agent? 00:26:39.200 |
You know, like is it the interface to kind of the human on the loop really important 00:26:46.360 |
I personally think of it as it's much more important for us, the human user. 00:26:49.320 |
So I think you probably could get end-to-end agents that work and are fairly general at 00:27:00.160 |
Like we actually want agents that we can like perturb while they're trying to figure out 00:27:06.400 |
So it's, you know, even a very simple example, internally we have like a type error fixing 00:27:11.320 |
agent and we have like a test generation agent. 00:27:13.960 |
Test generation agent goes off the rails all the time. 00:27:17.760 |
I want to know like, why did it generate this particular test? 00:27:22.440 |
Did it consider, you know, the fact that this is calling out to this other function? 00:27:27.560 |
Like formatter agent, if it ever comes up with anything weird, I want to be able to 00:27:34.200 |
With RL end-to-end stuff, like we couldn't do that. 00:27:36.640 |
So it sounds like you have a bunch of agents that are operating internally within the company. 00:27:41.280 |
What's your most, I guess, successful agent and what's your least successful one? 00:27:46.640 |
A type of agent that works moderately well is like fix the color of this button on the 00:27:51.120 |
website or like change the color of this button. 00:27:59.440 |
I don't know how often you have to fix the color of the button, right? 00:28:02.000 |
Because all of them raise money on the idea that they can go further. 00:28:06.240 |
And my fear when encountering something like that is that there's some kind of unknown 00:28:10.480 |
asymptote ceiling that's going to prevent them, that they're going to run head on into 00:28:21.240 |
I mean, for us, we think of it as reasoning plus these tools. 00:28:28.760 |
I think actually you can get really far with current models and that's why it's so compelling. 00:28:34.360 |
Like we can pile debugging tools on top of these current models, have them critique each 00:28:39.720 |
other and critique themselves and do all of these like, you know, spend more computer 00:28:45.080 |
inference time, context hack, you know, retrieve augmented generation, et cetera, et cetera, 00:28:52.920 |
Like the pile of hacks actually does get us really far. 00:28:56.440 |
And you're kind of like trying to get more signal out of the channel. 00:29:03.400 |
It's what the default approach is, is like trying to get more signal out of this noisy 00:29:08.360 |
But the issue with agents is as a user, I want it to be mostly reliable. 00:29:16.080 |
Like it's not as bad as self-driving, like in self-driving, you know, you're like hurtling 00:29:21.000 |
at 70 miles an hour is like the hardest agent problem. 00:29:24.320 |
But I think one thing we learned from Sorceress and one thing we've learned like by using 00:29:28.480 |
these things internally is we actually have a pretty high bar for these agents to work. 00:29:33.760 |
You know, it is actually really annoying if they only work 50% of the time and we can 00:29:38.720 |
make interfaces to make it slightly less annoying. 00:29:40.680 |
But yeah, there's a ceiling that we've encountered so far and we need to make the models better 00:29:46.600 |
and we also need to make the kind of like interface to the user better and also a lot 00:29:49.920 |
of the like, you know, critiquing, we have a lot of like generation methods, kind of 00:29:56.880 |
like spending computer inference time generation methods that help things be more robust and 00:30:02.160 |
reliable, but it's still not 100% of the way there. 00:30:05.560 |
So to your question of like what agents work well and what doesn't work well, like most 00:30:09.240 |
of the agents don't work well and we're slowly making them work better by improving the underlying 00:30:14.800 |
I think that that's comforting for a lot of people who are feeling a lot of imposter syndrome 00:30:21.680 |
And I think the fact that you share their struggles, I think also helps people understand 00:30:28.880 |
It's very early and I hope what we can do is help people who are building agents actually 00:30:35.640 |
I think, you know, that's the gap that we see a lot of today is everyone who's trying 00:30:39.400 |
to build agents to get to the point where it's robust enough to be deployable. 00:30:48.440 |
Well, so this goes back into what Embu is going to offer as a product or a platform. 00:30:51.480 |
How are you going to actually help people deploy those agents? 00:30:55.160 |
Yeah, so our current hypothesis, I don't know if this is actually going to end up being 00:31:00.080 |
We've built a lot of tools for ourselves internally around like debugging, around like abstractions 00:31:07.040 |
or techniques after the model generation happens, like after the language model generates the 00:31:13.080 |
text, like interfaces for the user and the underlying model itself, like models talking 00:31:22.200 |
Maybe some set of those things, kind of like an operating system, some set of those things 00:31:30.240 |
And we'll figure out what set of those things is helpful for us to make our agents. 00:31:34.400 |
Like what we want to do is get to a point where we can start making an agent, deploy 00:31:40.120 |
And there's a similar analog to software engineering, like in the early days, in the '70s, in the 00:31:44.480 |
'60s, like to program a computer, you have to go all the way down to the registers and 00:31:54.640 |
Then we wrote programming languages with these higher levels of abstraction, and that allowed 00:31:58.440 |
a lot more people to do this and much faster, and the software created is much less expensive. 00:32:03.240 |
And I think it's basically a similar route here where we're like in the like bare metal 00:32:08.280 |
phase of agent building, and we will eventually get to something with much nicer abstractions. 00:32:14.360 |
So you touched a little bit on the data before. 00:32:17.120 |
We had this conversation with George Hudson, we were like, there's not a lot of reasoning 00:32:21.600 |
data out there, and can the models really understand? 00:32:24.680 |
And his take was like, look, with enough compute, you're not that complicated as a human. 00:32:29.320 |
The model can figure out eventually why certain decisions are made. 00:32:34.600 |
As you think about reasoning data, do you have to do a lot of manual work, or is there 00:32:40.080 |
a way to prompt models to extract the reasoning from actions that they see? 00:32:46.160 |
We don't think of it as, oh, throw enough data at it, and then it will figure out what 00:32:55.800 |
So we have a lot of thoughts internally, like many documents about what reasoning is. 00:32:59.920 |
A way to think about it is as humans, we've learned a lot of reasoning strategies over 00:33:05.040 |
We are better at reasoning now than we were 3,000 years ago. 00:33:08.000 |
An example of a reasoning strategy is noticing you're confused. 00:33:12.060 |
And then when I notice I'm confused, I should ask like, huh, what was the original claim 00:33:18.560 |
What evidence is there for this claim, et cetera, et cetera? 00:33:25.480 |
This is like a reasoning strategy that was developed in like the 1600s, with like the 00:33:33.600 |
We employ all the time, lots of heuristics that help us be better at reasoning. 00:33:40.980 |
And because they're invented, we can generate data that's much more specific to them. 00:33:44.860 |
So I think internally, yeah, we have a lot of thoughts on what reasoning is, and we generate 00:33:49.320 |
We're not just like, oh, it'll figure out reasoning from this black box, or it'll figure 00:33:56.800 |
I mean, the scientific method is like a good example. 00:34:03.160 |
And people are thinking, how do we use these models to do net new scientific research? 00:34:09.240 |
And if you go back in time and the model is like, well, the earth revolves around the 00:34:13.840 |
sun, and people are like, man, this model is crap. 00:34:20.360 |
Like, how do you see the future where like, do you think we can actually, like, if the 00:34:26.120 |
models are actually good enough, but we don't believe them, it's like, how do we make the 00:34:32.760 |
Say you're like, you use IMBU as a scientist to do a lot of your research, and IMBU tells 00:34:37.960 |
you, hey, I think this is like a serious bet. 00:34:43.120 |
Like, how is that trust going to be built, and like, what are some of the tools that 00:34:50.760 |
So like, one element of it is like, as a person, like, I need to basically get information 00:34:57.040 |
out of the model such that I can try to understand what's going on with the model. 00:35:01.160 |
So then the second question is like, okay, how do you do that? 00:35:04.560 |
And that's kind of, some of our debugging tools, they're not necessarily just for debugging. 00:35:10.080 |
They're also for like, interfacing with and interacting with the model. 00:35:12.760 |
So like, if I go back in this reasoning trace and like, change a bunch of things, what's 00:35:19.280 |
So that kind of helps me understand, like, what are its assumptions? 00:35:23.440 |
And it, you know, we think of these things as tools. 00:35:30.120 |
And so it's really about, like, as a user, how do I use this tool effectively? 00:35:33.640 |
Like, I need to be willing to be convinced as well. 00:35:36.400 |
It's like, how do I use this tool effectively, and what can it help me with, and what can 00:35:40.760 |
So there's a lot of mention of code in your process. 00:35:47.200 |
I think we might run the risk of giving people the impression that you view code, or you 00:35:54.560 |
use code, just as like a tool within yourself, within MBU, just for coding assistance. 00:36:01.560 |
And I think there's a lot of informal understanding about how adding code to language models improves 00:36:08.120 |
I wonder if there's any research or findings that you have to share that talks about the 00:36:15.880 |
Yeah, so the way I think about it intuitively is, like, code is the most explicit example 00:36:23.800 |
And it's not only structured, it's actually very explicit, which is nice. 00:36:27.940 |
You know, it says this variable means this, and then it uses this variable, and then the 00:36:33.240 |
Like, as people, when we talk in language, it takes a lot more to kind of, like, extract 00:36:38.320 |
that, like, explicit structure out of, like, our language. 00:36:43.140 |
And so that's one thing that's really nice about code, is I see it as almost like a curriculum 00:36:47.960 |
I think we use code in all sorts of ways, like, the coding agents are really helpful 00:36:53.800 |
for us to understand, like, what are the limitations of the agents? 00:36:57.800 |
The code is really helpful for the reasoning itself, but also code is a way for models 00:37:04.120 |
So by generating code, it can act on my computer. 00:37:08.080 |
And you know, when we talk about rekindling the dream of the personal computer, kind of 00:37:11.720 |
where I see computers going is, like, computers will eventually become these much more malleable 00:37:17.280 |
things, where I, as a user, today, I have to know how to write software code, like, 00:37:24.160 |
in order to make my computer do exactly what I want it to do. 00:37:28.660 |
But in the future, if the computer is able to generate its own code, then I can actually 00:37:37.000 |
And so we, you know, one way we think about agents is it's kind of like a natural language 00:37:42.640 |
It's a way to program my computer in natural language that's much more intuitive to me 00:37:46.880 |
And these interfaces that we're building are essentially IDEs for users to program our 00:37:54.520 |
What do you think about the other, the different approaches people have, kind of like, text 00:38:02.900 |
What do you think the best interface will be, or like, what is your, you know, thinking 00:38:08.840 |
I think chat is very limited as an interface. 00:38:14.760 |
It is sequential, where these agents don't have to be sequential. 00:38:20.760 |
So with a chat interface, if the agent does something wrong, I have to, like, figure out 00:38:26.080 |
how to, like, how do I get it to go back and start from the place I wanted it to start 00:38:31.680 |
So in a lot of ways, like, chat as an interface, I think Linus, Linus Lee, you had on this. 00:38:37.200 |
I really like how he put it, chat as an interface is skeuomorphic. 00:38:41.000 |
So in the early days, when we made word processors on our computers, they had notepad lines, 00:38:47.040 |
because that's what we understood, you know, these, like, objects to be. 00:38:51.480 |
Chat, like texting someone, is something we understand. 00:38:54.600 |
So texting our AI is something that we understand. 00:38:58.000 |
But today's Word documents don't have notepad lines. 00:39:02.080 |
And similarly, the way we want to interact with agents, like, chat is a very primitive 00:39:08.640 |
What we want is to be able to inspect their state and to be able to modify them and fork 00:39:12.840 |
And we internally have, kind of, like, think about what are the right representations for 00:39:18.040 |
that, like, architecturally, like, what are the right representations? 00:39:22.160 |
What kind of abstractions do we need to build? 00:39:24.640 |
And how do we build abstractions that are not leaky? 00:39:27.720 |
Because if the abstractions are leaky, which they are today, like, you know, this stochastic 00:39:31.520 |
generation of text is like a leaky abstraction. 00:39:35.940 |
And that means it's actually really hard to build on top of. 00:39:38.960 |
But our experience and belief is, actually, by building better abstractions and better 00:39:43.760 |
tooling, we can actually make these things non-leaky. 00:39:46.960 |
And now you can build, like, whole things on top of them. 00:39:49.520 |
So these other interfaces, because of where we are, we don't think that much about them. 00:39:54.840 |
Yeah, I mean, you mentioned this is kind of like the Xerox spark moment for AI. 00:40:00.720 |
And we had a lot of stuff come out of Parc, like, yeah, what you see is what you got, 00:40:11.100 |
We didn't have all these, like, higher things. 00:40:13.380 |
What do you think it's reasonable to expect in, like, this era of AI? 00:40:19.940 |
Like, what are, like, the things we'll build today? 00:40:21.740 |
And what are things that maybe we'll see in, kind of, like, the second wave of products? 00:40:25.820 |
>> I think the waves will be much faster than before. 00:40:29.380 |
Like, what we're seeing right now is basically, like, a continuous wave. 00:40:34.900 |
So people like the Xerox Parc analogy I give, but I think there are many different analogies. 00:40:39.540 |
Like one is the, like, analog to digital computer is another analogy to where we are today. 00:40:45.180 |
The analog computer Vannevar Bush built in the 1930s, I think, and it's like a system 00:40:51.800 |
And it can only calculate one function, like, it can calculate, like, an integral. 00:40:55.740 |
And that was so magical at the time, because you actually did need to calculate this integral 00:41:03.380 |
And so there was actually a set of breakthroughs necessary in order to get to the digital computer. 00:41:08.280 |
Like Turing's decidability, Shannon, I think the, like, whole, like, relay circuits are, 00:41:18.460 |
can be thought of as, can be mapped to Boolean operators. 00:41:22.180 |
And a set of other, like, theoretical breakthroughs, which essentially, they were creating abstractions 00:41:30.480 |
And digital had this nice property of, like, being error correcting. 00:41:34.180 |
And so when I talk about, like, less leaky abstractions, that's what I mean. 00:41:37.180 |
That's what I'm kind of pointing a little bit to. 00:41:41.340 |
And then the Xerox PARC piece, a lot of that is about, like, how do we get to computers 00:41:51.740 |
And the interface actually helps it unlock so much more power. 00:41:55.700 |
So the sets of things we're working on, like the sets of abstractions and the interfaces, 00:42:00.820 |
like, hopefully that, like, help us unlock a lot more power in these systems. 00:42:04.940 |
Like, hopefully that'll come not too far in the future. 00:42:08.740 |
I could see a next version, like, maybe a little bit farther out. 00:42:15.580 |
So a way for different agents to talk to each other and call each other, kind of like HTTP. 00:42:23.780 |
Yeah, there is a nonprofit that's working on one. 00:42:27.100 |
I think it's a bit early, but it's interesting to think about right now. 00:42:32.620 |
Part of why I think it's early is because the issue with agents is it's not quite like 00:42:39.460 |
the internet where you could, like, make a website and the website would appear. 00:42:44.060 |
The issue with agents is that they don't work. 00:42:46.880 |
And so it may be a bit early to figure out what the protocol is before we really understand 00:42:52.940 |
But, you know, I think that's, I think it's a really interesting question. 00:42:55.300 |
While we're talking on this agent-to-agent thing, there's been a bit of research recently 00:43:02.380 |
I tend to just call them extremely complicated chain of thoughting, but any perspectives 00:43:09.980 |
on kind of meta-GPT, I think is the name of the paper. 00:43:13.260 |
I don't know if you care about at the level of individual papers coming out, but I did 00:43:18.860 |
read that recently, and TLDR, it beat GPT-4 and human eval by role-playing software agent 00:43:26.900 |
Instead of having a single shot, a single role, you have multiple roles and having all 00:43:31.540 |
of them criticize each other as agents communicating with other agents. 00:43:36.100 |
I think this is an example of an interesting abstraction of like, okay, can I just plop 00:43:40.100 |
in this multi-role critiquing and see how it improves my agent? 00:43:45.020 |
Can I just plop in chain of thought, tree of thought, plop in these other things and 00:43:51.700 |
One issue with this kind of prompting is that it's still not very reliable. 00:43:57.300 |
There's one lens which is like, okay, if you do enough of these techniques, you'll get 00:44:01.100 |
I think actually that's a pretty reasonable lens. 00:44:06.820 |
Then there's another lens that's like, okay, but it's starting to get really messy what's 00:44:11.740 |
in the prompt and how do we deal with that messiness? 00:44:15.900 |
Maybe you need cleaner ways of thinking about and constructing these systems. 00:44:23.100 |
It's a great question because I feel like this also brought up another question I had 00:44:28.800 |
I noticed that you work a lot with your own benchmarks, your own evaluations of what is 00:44:37.500 |
I would say I would contrast your approach with OpenAI as OpenAI tends to just lean 00:44:41.700 |
on, "Hey, we played StarCraft," or, "Hey, we ran it on the SAT or the AP bio test and 00:45:00.780 |
Because everyone knows what an SAT is and that's fine. 00:45:04.220 |
I think it's important to use both public and internal benchmarks. 00:45:07.420 |
Part of why we build our own benchmarks is that there are not very many good benchmarks 00:45:12.560 |
To evaluate these things, we actually need to think about it in a slightly different 00:45:18.020 |
But we also do use a lot of public benchmarks for is the reasoning capability in this particular 00:45:27.020 |
For example, the Voyager paper coming out of NVIDIA played Minecraft and set their own 00:45:35.340 |
benchmarks on getting the Diamond Axe or whatever and exploring as much of the territory as 00:45:43.940 |
That's obviously fun and novel for the rest of the AI engineer, the people who are new 00:45:49.260 |
But for people like yourself who you build your own, you build Avalon just because you 00:45:54.620 |
already found deficiencies with using Minecraft, is that valuable as an approach? 00:46:03.940 |
And I really like the Voyager paper and I think it has a lot of really interesting ideas, 00:46:07.180 |
which is like the agent can create tools for itself and then use those tools. 00:46:11.460 |
And he had the idea of the curriculum as well, which is something that we talked about earlier. 00:46:18.060 |
We built Avalon mostly because we couldn't use Minecraft very well to learn the things 00:46:22.420 |
And so it's not that much work to build our own. 00:46:25.500 |
It took us, I don't know, we had eight engineers at the time, took about eight weeks. 00:46:37.740 |
It's just nice to have control over our environment. 00:46:39.180 |
But if you're doing our own sandbox to really trying to inspect our own research questions. 00:46:44.140 |
But if you're doing something like experimenting with agents and trying to get them to do things 00:46:47.820 |
like Minecraft is a really interesting environment. 00:46:51.500 |
And so Voyager has a lot of really interesting ideas in it. 00:46:56.260 |
One more element that we had on this list, which is context and memory. 00:47:00.380 |
I think that's kind of like the foundational "RAM" of our era. 00:47:05.660 |
I think Andrej Karpathy has already made this comparison, so there's nothing new here. 00:47:10.860 |
But that's just the amount of working knowledge that we can fit into one of these agents. 00:47:16.260 |
Especially if you need to get them to do long running tasks, if they need to self-correct 00:47:21.500 |
from errors that they observe while operating in their environment. 00:47:26.200 |
Do you think we're going to just trend to infinite context and that'll go away? 00:47:30.540 |
Or how do you think we're going to deal with it? 00:47:33.740 |
When you talked about what's going to happen in the first wave and then in the second wave, 00:47:39.220 |
I think what we'll see is we'll get relatively simplistic agents pretty soon. 00:47:46.180 |
And there's a future wave in which they are able to do these really difficult, really 00:47:52.260 |
And the blocker to that future, one of the blockers is memory. 00:48:00.180 |
I think when von Neumann made the von Neumann architecture, he was like, "The biggest blocker 00:48:06.780 |
We need this amount of memory," which is like, I don't remember exactly, like 32 kilobytes 00:48:14.580 |
He didn't say it this way because he didn't have these terms. 00:48:17.860 |
And then that only really happened in the '70s with the microchip revolution. 00:48:23.600 |
And so it may be the case that we're waiting for some research breakthroughs or some other 00:48:28.620 |
breakthroughs in order for us to have really good long running memory. 00:48:33.300 |
And then in the meantime, agents will be able to do all sorts of things that are a little 00:48:37.620 |
I do think with the pace of the field, we'll probably come up with all sorts of interesting 00:48:50.380 |
I just think about a situation where you want something that's like an AI scientist. 00:48:55.780 |
As a scientist, I have learned so much about my field. 00:49:00.740 |
And a lot of that data is maybe hard to fine tune on or maybe hard to put into pre-training. 00:49:09.300 |
A lot of that data, I don't have a lot of repeats of the data that I'm seeing. 00:49:14.460 |
My understanding is so at the edge that if I'm a scientist, I've accumulated so many 00:49:21.620 |
And ideally, I'd want to store those somehow or use those to fine tune myself as a model 00:49:32.660 |
I don't think RAG is enough for that kind of thing. 00:49:36.020 |
But RAG is certainly enough for user preferences and things like that. 00:49:46.620 |
I have a hard question, if you don't mind me being bold. 00:49:50.820 |
I think the most comparable lab to Imbue is ADEPT. 00:49:56.060 |
A research lab with some amount of productization on the horizon, but not just yet. 00:50:06.660 |
The way I think about it is I believe in our approach. 00:50:11.140 |
Maybe this is a general question of competitors. 00:50:14.760 |
And the way I think about it is we're in a historic moment. 00:50:28.280 |
And IBM also exists and all of these other big companies exist. 00:50:35.520 |
We're building reasoning foundation models, trying to make agents that actually work reliably. 00:50:45.260 |
And I think we have a really special team and culture. 00:50:50.240 |
I have a sense of where we want to go, of really trying to help the computer be a much 00:50:58.560 |
And the type of thing that we're doing is we're trying to build something that enables 00:51:05.760 |
And build something that really can be maybe something like an operating system for agents. 00:51:12.120 |
I don't really know what everyone else is doing. 00:51:15.360 |
I talk to people and have some sense of what they're doing. 00:51:18.760 |
And I think it's a mistake to focus too much on what other people are doing. 00:51:22.000 |
Because extremely focused execution on the right thing is what matters. 00:51:27.240 |
And so to the question of why us, I think strong focus on reasoning, which we believe 00:51:40.840 |
Which we believe is really important for user experience. 00:51:45.500 |
And also for the power and capability of these systems. 00:51:51.440 |
So that which we believe is solving the core issue of agents, which is around reliability 00:51:58.200 |
And then really seriously trying to use these things ourselves. 00:52:04.640 |
And getting to something that we can actually ship to other people, that becomes something 00:52:16.880 |
And you will not be surprised how many agent companies I talk to that don't use their own 00:52:26.480 |
Yeah, I think if we didn't use our own agents, then we would have all of these beliefs about 00:52:33.300 |
The only other follow-up that you had, based on the answer you just gave, was do you see 00:52:39.120 |
yourself releasing models or do you see yourself... 00:52:43.720 |
What is the artifacts that you want to produce that lead up to the general operating system 00:52:52.960 |
And so a lot of people, just as a byproduct of their work, just to say, "Hey, I'm still 00:52:58.480 |
shipping, here's a model along the way," Adept took, I don't know, three years, but they 00:53:08.120 |
Do you think that kind of approach is something on your horizon or do you think there's something 00:53:12.240 |
else that you can release that can show people, "Here's the idea, not the end product, but 00:53:20.640 |
I don't really believe in releasing things to show people, "Oh, here's what we're doing," 00:53:25.960 |
I think as a philosophy, we believe in releasing things that will be helpful to other people. 00:53:30.760 |
And so I think we may release models or we may release tools that we think will help 00:53:36.960 |
Ideally, we would be able to do something like that, but I'm not sure exactly what they 00:53:41.440 |
I think more companies should get into the releasing evals and benchmarks game. 00:53:47.400 |
Something that we have been talking to agent builders about is co-building evals. 00:53:51.200 |
So we build a lot of our own evals and every agent builder tells me basically evals are 00:53:59.640 |
And if you are building agents, this is like a call. 00:54:02.080 |
If you are building agents, please reach out to me because I would love to figure out how 00:54:11.680 |
I know a bunch of people that I can send your way. 00:54:19.160 |
I saw from Lexica on the podcast, he had a lot of interesting questions on his website. 00:54:27.840 |
I'm very jealous of people who have personal websites where they're like, here's the high 00:54:30.640 |
level questions of goals of humanity that I want to set people on. 00:54:41.600 |
There were a few that stuck out as related to your work that maybe you're kind of learning 00:54:47.040 |
One is why are curiosity and goal orientation often at odds? 00:54:51.760 |
And from a human perspective, I get it, it's like, you know, would you want to like go 00:54:54.880 |
explore things or kind of like focus on your career? 00:54:58.000 |
How do you think about that from like an agent perspective, where it's like, should you just 00:55:01.440 |
stick to the task and try and solve it as in the guardrails as possible? 00:55:05.360 |
Or like, should you look for alternative solutions? 00:55:10.880 |
So the problem with these questions is that I'm still confused about them. 00:55:15.760 |
So our discussion, in our discussion, I will not have good answers. 00:55:22.280 |
Why are curiosity and goal orientation so at odds? 00:55:24.400 |
I think one thing that's really interesting about agents actually is that they can be 00:55:29.360 |
So like, you know, we can take an agent that's executed to a certain place and said, okay, 00:55:35.240 |
here, like fork this and do a bunch of different things, try a bunch of different things. 00:55:39.280 |
Some of those agents can be goal oriented and some of them can be like more curiosity 00:55:43.640 |
You can prompt them in slightly different ways. 00:55:44.640 |
And something I'm really curious about, like what would happen if in the future, you know, 00:55:51.680 |
As a person, why I have this question on my website is I really find that like, I really 00:56:02.320 |
And like, is it inherent in like the kind of context that needs to be held? 00:56:08.560 |
That's why I think from an agent perspective, like forking it is really interesting. 00:56:11.600 |
Like I can't fork myself to do both, but I maybe could fork an agent to like at a certain 00:56:21.240 |
How has the thinking changed for you as the funding of the company changed? 00:56:28.800 |
That's one thing that I think a lot of people in the space think is like, oh, should I raise 00:56:36.120 |
How do you feel your options to be curious versus like goal oriented has changed as you 00:56:42.600 |
raise more money and kind of like the company has grown? 00:56:49.160 |
So we raised our Series A $20 million in late 2021. 00:56:54.080 |
And our entire philosophy at that time was, and still kind of is, is like, how do we figure 00:57:03.080 |
out the stepping stones, like collect stepping stones that eventually let us build agents, 00:57:09.080 |
the kind of these new computers that help us do bigger things. 00:57:15.600 |
And there was a lot of goal orientation in that. 00:57:17.960 |
Like the curiosity led us to build CARBS, for example, this hyperparameter optimizer. 00:57:33.200 |
So as soon as he came up with cost aware, he was like, I need to figure out how to make 00:57:39.120 |
But the cost awareness of it was really important. 00:57:40.920 |
So that curiosity led us to this really cool hyperparameter optimizer. 00:57:44.600 |
That's actually a big part of how we do our research. 00:57:50.000 |
And for those experiment results to carry to larger ones. 00:57:53.640 |
Which you also published a scaling laws thing for it, which is great. 00:57:57.800 |
I think the scaling laws paper from OpenAI was the biggest. 00:58:01.240 |
And from Google, I think, was the greatest public service to machine learning that any 00:58:11.560 |
And I think what was nice about CARBS is it gave us scaling laws for all sorts of hyperparameters. 00:58:17.480 |
Like Avalon, it was like a six to eight week sprint for all of us. 00:58:24.580 |
And then now, different projects do more curiosity or more goal orientation at different times. 00:58:32.800 |
Another one of your questions that we highlighted was, how can we enable artificial agents to 00:58:37.500 |
permanently learn new abstractions and processes? 00:58:40.280 |
I think this might be called online learning. 00:58:43.880 |
So I struggle with this because that scientist example I gave. 00:58:49.440 |
As a scientist, I've permanently learned a lot of new things and I've updated and created 00:58:53.600 |
new abstractions and learned them pretty reliably. 00:58:56.600 |
And you were talking about, OK, we have this RAM that we can store learnings in. 00:59:01.880 |
But how well does online learning actually work? 00:59:05.360 |
And the answer right now seems to be, as models get bigger, they fine tune faster. 00:59:10.720 |
So they're more sample efficient as they get bigger. 00:59:13.360 |
Because they already had that knowledge in there, you're just unlocking it. 00:59:19.400 |
Partly, maybe because they already have some subset of the representation. 00:59:23.240 |
So they just memorize things more, which is good. 00:59:27.600 |
So maybe this question is going to be solved. 00:59:32.360 |
I don't know, have a platform that continually fine tunes for you as you work on that domain, 00:59:45.320 |
So two more questions just about your general activities, and you've just been very active 00:59:54.360 |
You're a founding member of Software Commons. 00:59:57.440 |
Tell me more, because by the time I knew about SPC, it was already a very established thing. 01:00:05.960 |
Yeah, the story is Ruchi, who started it, was the VP of operations at Dropbox. 01:00:11.920 |
And I was the chief of staff, and we worked together very closely. 01:00:15.800 |
She's actually one of the investors in Sorceress. 01:00:22.440 |
And at that time, Ruchi was like, "You know, I would like to start a space for people who 01:00:29.320 |
And we were figuring out what's next post-Ember, those three months. 01:00:32.520 |
And she was like, "Do you want to just hang out in this space?" 01:00:35.760 |
And it was a really good group, I think, Wasim and Jeff from Pilot, the folks from Zulip, 01:00:47.240 |
It's much more official than it was at that time. 01:00:55.240 |
At that time, we literally, it was a bunch of friends hanging out in the space together. 01:01:01.480 |
I think we started the archive around the same time. 01:01:08.840 |
And I'm also part of, hopefully, what becomes the next Software Commons or whatever. 01:01:15.680 |
But what are the principles in organizing communities like that with really exceptional 01:01:23.280 |
Do you have to be really picky about who joins? 01:01:26.440 |
Did all your friends just magically turn out super successful like that? 01:01:41.200 |
And a lot of people want to do that and fail. 01:01:45.120 |
You had the co-authors of GPT-3 in your house. 01:01:48.920 |
And a lot of other really cool people that you'll eventually hear about. 01:01:51.240 |
And co-founders of Pilot and anyone else you want to... 01:01:53.360 |
I don't want you to pick your friends, but there's some magic special sauce in getting 01:01:58.720 |
people together in one workspace, living space, whatever. 01:02:02.400 |
And that's part of why I'm here in San Francisco. 01:02:05.000 |
And I would love for more people to learn about it and also maybe get inspired to build 01:02:10.200 |
One adage we had when we started the archive was you become the average of the five people 01:02:20.960 |
One, we were quite picky and it mattered a lot to us. 01:02:27.000 |
Is this someone where if they're hanging out in the living room, we'd be really excited 01:02:32.240 |
Two is I think we did a really good job of creating a high-growth environment and an 01:02:40.120 |
We actually apply these things to our team and it works remarkably well as well. 01:02:43.920 |
So I do a lot of basically how do I create safe spaces for people where it's not just 01:02:49.920 |
like safe law, but it's a safe space where people really feel inspired by each other. 01:02:56.000 |
And I think at the archive, we really made each other better. 01:02:58.960 |
My friend, Michael Nielsen called it a self-actualization machine. 01:03:04.200 |
And I think, yeah, people came in and- Was he a part of the archive? 01:03:12.880 |
Like the culture was that we learned a lot of things from each other about how to make 01:03:19.020 |
better life systems and how to think about ourselves and psychological debugging. 01:03:23.080 |
And a lot of us were founders, so having other founders going through similar things was 01:03:28.920 |
And a lot of us worked in AI, and so having other people to talk about AI with was really 01:03:34.000 |
And so I think all of those things led to a form of idea flux and also kind of like, 01:03:40.920 |
I think a lot about like the idea flux and the kind of like default habits or default 01:03:46.440 |
It led to a set of idea flux and default impulses that led to some really interesting things 01:03:51.760 |
and led to us doing much bigger things, I think, than we otherwise would have decided 01:03:56.880 |
to do because it felt like taking risks was less risky. 01:04:01.560 |
So that's something we do a lot of on the team is like, how do we make it so that taking 01:04:11.600 |
I was going to feed you that word, but I didn't want to like impress you. 01:04:15.760 |
I think maybe like a lot of what I'm interested in is constructing a kind of seniors. 01:04:20.080 |
And the archive was definitely a seniors in a particular way, or like getting toward a 01:04:26.040 |
And Jason Ben, my archive housemate and who now runs the neighborhood, has a good way 01:04:32.920 |
If genius is from your genes, seniors is from your scene. 01:04:36.440 |
And yeah, I think like maybe a lot of the community building impulse is from this interest 01:04:46.280 |
There's a question of like, why did Xerox PARC come out with all of this interesting 01:04:52.040 |
Why did Bell Labs come out with all this interesting stuff? 01:04:56.240 |
Why didn't the transistor come out of Princeton and the other people working on it at the 01:05:01.680 |
I just think it's remarkable how you hear a lot about Alan Kay. 01:05:05.320 |
And I just read a bit and apparently Alan Kay was like the most junior guy at Xerox 01:05:16.320 |
So I, you know, hopefully I'm also working towards contributing that seniors. 01:05:19.120 |
I called mine the most provocative name of the arena. 01:05:26.080 |
So are you fighting other people in the arena? 01:05:31.880 |
We're in the arena trying stuff, as they say. 01:05:36.440 |
You are also a GP at Altic Capital, where you also co-organize the Thursday Nights in 01:05:40.680 |
AI, where hopefully someday I'll eventually speak. 01:05:48.040 |
So why spend time being a VC and organizing all these events? 01:05:52.760 |
You're also a very busy CEO and, you know, why spend time with that? 01:06:01.560 |
So Allie, my investing partner, is fortunately amazing and she does everything for the fund. 01:06:09.840 |
So she, like, hosts the Thursday Night events and she finds folks who we could invest in 01:06:19.280 |
So Allie was our former chief of staff at Sorceress and we just thought she was amazing. 01:06:23.560 |
And she wanted to be an investor and Josh and I also, like, care about helping founders 01:06:28.840 |
and kind of, like, giving back to the community. 01:06:30.640 |
What we didn't realize at the time when we started the fund is that it would actually 01:06:36.400 |
So talking to AI founders who are building agents and working on, you know, similar things 01:06:44.000 |
They could potentially be our customers and they're trying out all sorts of interesting 01:06:48.440 |
And I think being an investor, looking at the space from the other side of the table, 01:06:52.920 |
it's just a different hat that I routinely put on and it's helpful to see the space from 01:06:57.760 |
the investor lens as opposed to from the founder lens. 01:07:01.440 |
So I find that kind of, like, hat switching valuable. 01:07:05.040 |
It maybe would lead us to do slightly different things. 01:07:11.600 |
Acceleration, exploration, and then a takeaway. 01:07:14.880 |
So the acceleration question is what's something that already happened in AI that you thought 01:07:22.400 |
I think the rate at which we discover new capabilities of existing models and kind of, like, build 01:07:27.200 |
hacks on top of them to make them work better is something that has been surprising and 01:07:32.900 |
And the rate of kind of, like, the community, the research community building on its own 01:07:42.960 |
If you weren't building Imbue, what AI company would you build? 01:07:49.280 |
Every founder has, like, their, like, number two. 01:07:55.840 |
I cannot imagine building any other thing than Imbue. 01:08:02.880 |
It's, like, obviously work on the fundamental platform. 01:08:06.360 |
So the previous, I think, that was my attempt at innovating this question, but the previous 01:08:11.720 |
one was what was the most interesting unsolved question in AI? 01:08:16.880 |
I think probably the most interesting unsolved question, and my answer is kind of boring, 01:08:22.080 |
but the most interesting unsolved questions are these questions of how do we make these 01:08:26.300 |
stochastic systems into things that we can, like, reliably use and build on top of? 01:08:31.800 |
And, yeah, take away what's one message you want everyone to remember? 01:08:38.760 |
Like, one is I didn't think in my lifetime I would necessarily be in, like, able to work 01:08:45.160 |
on the things I'm excited to work on in this moment, but we're in a historic moment and 01:08:49.160 |
that where we'll look back and be like, "Oh, my God. 01:08:53.900 |
There is maybe a set of messages to take away from that. 01:08:56.120 |
One is, like, AI is a tool, like any technology, and, you know, when it comes to things, like, 01:09:06.680 |
We like to think about it as it's, like, just a better computer. 01:09:10.360 |
It's like a much better, much more powerful computer that gives us a lot of free intellectual 01:09:14.040 |
energy that we can now, like, solve so many problems with. 01:09:17.720 |
You know, there are so many problems in the world where we're like, "Oh, it's not worth 01:09:20.400 |
a person thinking about that," and so things get worse and things get worse. 01:09:23.480 |
No one wants to work on maintenance, and, like, this technology gives us the potential 01:09:28.760 |
to actually be able to, like, allocate intellectual energy to all of those problems, and the world 01:09:33.520 |
could be much better, like, could be much more thoughtful because of that. 01:09:37.280 |
I'm so excited about that, and there are definitely risks and dangers, and we actually do a fair 01:09:48.040 |
On the safety side, like, we think about safety and policy in terms of engineering theory 01:09:54.360 |
and also regulation, and kind of comparing to, like, the automobile or the airplane or 01:10:00.640 |
any new technology, there's, like, a set of new possible, like, capabilities and a set 01:10:06.380 |
of new possible dangers that are unlocked with every new technology, and so on the engineering 01:10:10.900 |
side, like, we think a lot about engineering safety, like, how do we actually engineer 01:10:14.800 |
these systems so that they are inspectable and, you know, why we reason in natural language 01:10:19.640 |
so that the systems are very inspectable, so that we can, like, stop things if anything 01:10:25.080 |
That's why we don't think end-to-end black boxes are a good idea. 01:10:28.560 |
On the theoretical side, we, like, really believe in, like, deeply understanding, like, 01:10:32.800 |
Like, when we actually fine-tune on individual examples, like, what's going on? 01:10:39.080 |
Like, debugging tools for these agents to understand, like, what's going on? 01:10:43.040 |
And then on the regulation side, I think there's actually a lot of regulation that already 01:10:49.220 |
covers many of the dangers, like, that people are talking about, and there are areas where 01:10:57.040 |
there's not much regulation, and so we focus on those areas where there's not much regulation. 01:11:00.540 |
So some of our work is actually, we built an agent that helped us analyze the, like, 01:11:06.660 |
20,000 pages of policy proposals submitted to the Department of Commerce request for 01:11:15.240 |
And we, like, looked at what were the problems people brought up, and what were the solutions 01:11:19.980 |
they presented, and then, like, did a summary analysis and kind of, like, you know, built 01:11:26.660 |
And now the Department of Commerce is, like, interested in using that as a tool to, like, 01:11:31.640 |
And so a lot of what we're trying to do on the regulation side is, like, actually figure 01:11:36.280 |
out where is there regulation missing, and how do we actually, in a very targeted way, 01:11:45.480 |
So I guess if I were to say, like, what are the takeaways, it's like, the future could 01:11:50.000 |
be really exciting if we can actually get agents that are able to do these bigger things. 01:11:55.480 |
Reasoning is the biggest blocker, plus, like, these sets of abstractions to make things 01:12:02.200 |
And there are, you know, things where we have to be quite careful and thoughtful about how 01:12:06.800 |
do we deploy these, and what kind of regulation should go along with it, so that this is actually 01:12:11.280 |
a technology that, when we deploy it, it is protective to people, and not harmful.