back to indexSupervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit
Chapters
0:0 Introductions
7:45 How Johan and Andreas Joined Forces to Create Elicit
10:26 Why Products are better than Research
15:49 The Evolution of Elicit's Product
19:44 Automating Literature Review Workflow
22:48 How GPT-3 to GPT-4 Changed Things
25:37 Managing LLM Pricing and Performance
31:7 Open vs. Closed: Elicit's Approach to Model Selection
31:56 Moving to Notebooks
39:11 Elicit's Budget for Model Queries and Evaluations
41:44 Impact of Long Context Windows
47:19 Underrated Features and Surprising Applications
51:35 Driving Systematic and Efficient Research
53:0 Elicit's Team Growth and Transition to a Public Benefit Corporation
55:22 Building AI for Good
00:00:02.760 |
This is Alessio, partner and CTO of Residence Invisible Partners. 00:00:06.120 |
And I'm joined by my co-host, Swix, founder of Small AI. 00:00:19.060 |
but also we'd love to learn a little bit more about you 00:00:23.320 |
So Andreas, it looks like you started Elicit, or ought first, 00:00:35.240 |
the Elicit and also the ought that existed before then 00:00:42.360 |
So I think it's fair to say that she co-founded it. 00:00:46.120 |
And Junghwan, you're a co-founder and COO of Elicit. 00:00:50.200 |
So there's a little bit of a history to this. 00:01:01.040 |
And recently, you turned into sort of like a B Corp-- 00:01:07.280 |
take us through that journey of finding the problem. 00:01:14.960 |
So how do you get together to decide to leave your startup 00:01:22.440 |
I guess, truly, it kind of started in Germany 00:01:25.440 |
So even as a kid, I was always interested in AI. 00:01:32.120 |
There were books about how to write programs in QBasic. 00:01:34.980 |
And some of them talked about how to implement chatbots. 00:01:44.120 |
Dingelscherbin, where it's a very, very idyllic German 00:01:51.680 |
But basically, the main thing is I've kind of always 00:01:56.480 |
and been thinking about, well, at some point, 00:02:02.920 |
And I was thinking about it from when I was a teenager. 00:02:11.400 |
I started a startup with the intention to become rich. 00:02:15.160 |
And then once I'm rich, I can affect the trajectory of AI. 00:02:21.000 |
Decided to go back to college and study cognitive science 00:02:32.360 |
to do a PhD at MIT, working on broadly kind of new programming 00:02:37.960 |
languages for AI, because it kind of seemed like the existing 00:02:40.880 |
languages were not great at expressing world models 00:02:44.100 |
and learning world models doing Bayesian inference. 00:02:47.520 |
Was always thinking about, well, ultimately, the goal 00:02:49.640 |
is to actually build tools that help people reason more 00:03:00.240 |
like the technology to put reasoning in machines 00:03:04.800 |
And so initially, at the end of my postdoc at Stanford, 00:03:12.440 |
I think the standard path is you become an academic 00:03:17.160 |
But it's really hard to actually build interesting tools 00:03:26.760 |
Everything is kind of on a paper-to-paper timeline. 00:03:29.520 |
And so I was like, well, maybe I should start a startup 00:03:37.160 |
because you could have tried to do an AI startup, 00:03:39.900 |
but probably would not have been the kind of AI startup 00:03:44.520 |
So then decided to just start a nonprofit research lab that's 00:03:49.200 |
until we better figure out how to do thinking in machines. 00:03:54.800 |
And then over time, it became clear how to actually build 00:04:02.800 |
And then only over time, we developed a better way to-- 00:04:10.400 |
Yeah, so I guess my story maybe starts around 2015. 00:04:14.360 |
I kind of wanted to be a founder for a long time. 00:04:17.880 |
And I wanted to work on an idea that really tested-- 00:04:23.840 |
like an idea that stuck with me for a long time. 00:04:28.280 |
originally, I became interested in AI-based tools 00:04:41.880 |
before getting hospitalized that could just help her. 00:04:45.640 |
And so luckily, she came and stayed with me for a while. 00:04:48.320 |
And we were just able to talk through some things. 00:04:54.140 |
And something maybe AI-enabled could be much more scalable. 00:05:02.280 |
And I also didn't feel like the technology was ready. 00:05:12.640 |
for me to just jump in and build something on my own 00:05:17.160 |
And at the time, there were two interesting-- 00:05:19.840 |
I looked around at tech and felt not super inspired 00:05:28.840 |
There were two interesting technologies at the time. 00:05:41.320 |
But yeah, I kind of threw my bet in on the AI side. 00:05:52.040 |
was really compatible with what I had envisioned 00:05:57.080 |
that helps kind of take down really complex thinking, 00:05:59.880 |
overwhelming thoughts, and breaks it down into small pieces. 00:06:04.640 |
need AI to help us figure out what we ought to do 00:06:10.520 |
Yeah, because I think it was clear that we were building 00:06:23.640 |
of optimization potential at the wrong thing, 00:06:27.000 |
So the goal of OTT was make sure that if we build the most 00:06:32.160 |
it can be used for something really impactful, 00:06:34.400 |
like good reasoning, like not just generating ads. 00:06:38.880 |
But so I was like, I want to do more than generate ads 00:06:45.320 |
to be super intelligent enough that they are doing 00:06:47.880 |
this really complex reasoning, that we can trust them, 00:06:51.980 |
have ways of evaluating that they're doing the right thing. 00:06:57.600 |
This was, like Andreas said, before foundation models 00:07:09.720 |
be able to do more kind of logical reasoning, 00:07:12.360 |
not just kind of extrapolate from numerical trends. 00:07:18.800 |
with people, where people stood in as super intelligent 00:07:23.320 |
And we effectively gave them context windows. 00:07:40.600 |
like in 2018, 2019, a world where an AI system could read 00:07:45.960 |
And you, as the person who couldn't read that much, 00:07:53.840 |
And from that, we kind of iterated on this idea, 00:08:00.720 |
like open-ended reasoning, logical reasoning, 00:08:07.080 |
And also so that it's easier to evaluate the work of the AI 00:08:11.880 |
And then also kind of really pioneered this idea, 00:08:15.840 |
the importance of supervising the process of AI systems, 00:08:20.360 |
And so a big part of then how Elicit is built 00:08:23.040 |
is we're very intentional about not just throwing a ton of data 00:08:27.320 |
into a model and training it, and then saying, cool, 00:08:33.520 |
Our approach is very much like, what are the steps 00:08:38.800 |
As granularly as possible, let's break that down. 00:08:41.800 |
And then train AI systems to perform each of those steps 00:08:46.200 |
When you train that from the start, after the fact, 00:08:50.920 |
It's much easier to troubleshoot at each point, 00:08:55.000 |
So yeah, we were working on those experiments for a while. 00:08:57.000 |
And then at the start of 2021, decided to build a product. 00:09:00.320 |
Because when you do research, I think maybe-- 00:09:03.280 |
- Do you mind if I, 'cause I think you're about to go 00:09:08.640 |
And I just wanted to, because I think a lot of people 00:09:11.080 |
are in where you were, like sort of 2018, '19, 00:09:15.360 |
where you chose a partner to work with, right? 00:09:26.760 |
I assume you had a lot of other options, right? 00:09:28.880 |
Like how do you advise people to make those choices? 00:09:33.840 |
So we had one of our closest friends introduced us. 00:09:36.880 |
And then Andreas had written a lot on the OTT website, 00:09:47.080 |
And even other people, some of my closest friends 00:09:55.760 |
I wanted someone with a complimentary skillset. 00:10:03.640 |
- We also did a pretty lengthy mutual evaluation process 00:10:08.480 |
where we had all kinds of questions for each other. 00:10:11.120 |
And I think it ended up being around 50 pages or so 00:10:14.560 |
of like various like questions and back and forth. 00:10:18.200 |
There's some lists going around for co-founder questions. 00:10:31.320 |
- I shared like all of my past performance reviews. 00:10:47.320 |
- So before we jump into what a list it is today, 00:11:07.120 |
that are now back to where you were maybe five years ago 00:11:15.160 |
What clicked for you to like move into a list 00:11:20.160 |
- I think in many ways, the approach is still the same 00:11:22.400 |
because the way we are building a list is not, 00:11:24.960 |
let's train a foundation model to do more stuff. 00:11:29.480 |
such that we can deploy powerful models to good ends. 00:11:32.740 |
So I think it's different now in that we are, 00:11:36.040 |
we actually have some of the models to plug in, 00:11:44.840 |
we did run with humans back then, just with models. 00:11:47.720 |
And so in many ways, our philosophy is always like, 00:11:51.280 |
What models are gonna exist in one, two years or longer? 00:11:55.960 |
And how can we make it so that they can actually be deployed 00:12:06.040 |
and we just want to, the research was really important 00:12:09.600 |
and it didn't make sense to build a product at that time. 00:12:13.480 |
the thing that always motivated us is imagining a world 00:12:16.600 |
where high quality reasoning is really abundant. 00:12:24.880 |
And there's a way to guide that technology with research, 00:12:29.320 |
you can have a more direct effect through product 00:12:36.760 |
and the product felt like a more direct path. 00:12:39.120 |
And we wanted to concretely have an impact on people's lives. 00:12:42.360 |
So I think, yeah, I think the kind of personally, 00:12:45.520 |
the motivation was we want to build for people. 00:12:52.600 |
like the models you were using back then were like, 00:12:55.000 |
I don't know, would they like BERT type stuff or T5 00:12:59.000 |
or I don't know what timeframe we're talking about here. 00:13:02.120 |
- So I guess to be clear, at the very beginning, 00:13:04.400 |
we had humans do the work and then the initial, 00:13:09.040 |
I think the first models that kind of make sense 00:13:11.280 |
were TPT-2 and TNLG and like the early generative models. 00:13:18.280 |
We do also use like T5 based models even now, 00:13:26.280 |
- Yeah, cool, I'm just kind of curious about like, 00:13:39.840 |
And I was like, Andreas, you're wasting your time 00:13:52.960 |
after four months, you get to a million in revenue. 00:13:55.080 |
Obviously a lot of people use it, get a lot of value, 00:13:57.040 |
but it would initially kind of like structure data, 00:14:03.000 |
Then you had, yeah, kind of like concept grouping. 00:14:08.760 |
research enabler, kind of like paper understander platform. 00:14:12.040 |
What's the definitive definition of what Elizit is 00:14:17.520 |
- Yeah, we say Elizit is an AI research assistant. 00:14:21.600 |
It has evolved a lot and it will continue research. 00:14:28.800 |
I think the current phase we're in right now, 00:14:30.980 |
we talk about it as really trying to make Elizit 00:14:35.760 |
So it's all a lot about like literature summarization. 00:14:39.360 |
There's a ton of information that the world already knows. 00:14:41.540 |
It's really hard to navigate, hard to make it relevant. 00:14:51.760 |
I kind of want to import some of the incredible 00:14:54.640 |
productivity improvements we've seen in software engineering 00:15:04.080 |
That's why we're launching this new set of features 00:15:07.760 |
It's very much inspired by computational notebooks 00:15:15.520 |
And ultimately when people are trying to get to an answer 00:15:20.120 |
they're kind of like manipulating evidence and information. 00:15:22.900 |
Today that's all packaged in PDFs, which are super brittle, 00:15:27.440 |
we can decompose these PDFs into their underlying claims 00:15:31.480 |
and then let researchers mash them up together, 00:15:40.780 |
Right now we're focused on text-based workflows, 00:15:45.200 |
but long-term really want to kind of go further and further 00:15:55.280 |
So researchers use Elizit as a research assistant. 00:15:58.940 |
It's not a generic, you can research anything type of tool, 00:16:08.960 |
a lot of people use human research assistants to do things. 00:16:18.540 |
see which of these have kind of sufficiently large 00:16:24.420 |
and then write out like, what are the experiments they did? 00:16:31.460 |
And the first phase of understanding what is known 00:16:37.100 |
because a lot of that work is pretty rote work. 00:16:40.480 |
that we need humans to do, language models can do it. 00:16:47.320 |
than a grad student or undergrad research assistant 00:16:58.240 |
or for a mix of personal and professional things. 00:17:01.160 |
People who care a lot about like health or biohacking, 00:17:05.260 |
or parents who have children with a kind of rare disease 00:17:08.880 |
and want to understand the literature directly. 00:17:10.680 |
So there is an individual kind of consumer use case. 00:17:15.600 |
so that's where we're really excited to build. 00:17:18.180 |
So LISD was very much inspired by this workflow 00:17:21.180 |
in literature called Systematic Reviews or Meta-Analysis, 00:17:24.480 |
which is basically the human state of the art 00:17:33.600 |
and they kind of first start by trying to find 00:17:35.600 |
the maximally comprehensive set of papers possible. 00:17:40.360 |
And they kind of systematically narrow that down 00:17:56.080 |
So in science, in machine learning, in policy. 00:18:01.400 |
because it's so structured and designed to be reproducible, 00:18:08.560 |
And then you make that accessible for any question 00:18:19.000 |
he's building a new company called BrightWave, 00:18:20.600 |
which is an AI research assistant for financial research. 00:18:30.680 |
or is every domain going to have its own thing? 00:18:33.620 |
- I think that's a good and mostly open question. 00:18:36.540 |
I do think there are some differences across domains. 00:18:51.600 |
to the broad generalist reasoning type space. 00:18:59.000 |
to like these equations in economics or something. 00:19:06.640 |
So I think there will be, at least within research, 00:19:09.440 |
I think there will be like one best platform more or less 00:19:15.480 |
I think there may still be like some particular tools 00:19:17.680 |
like for genomics, like particular types of modules 00:19:23.560 |
But for a lot of the kind of high-level reasoning 00:19:35.720 |
I see that in your UI now, but that's as it is today. 00:19:41.000 |
about how it was in 2021 and how it maybe progressed. 00:19:57.560 |
types of reasoning that if we could scale up AI 00:20:06.720 |
but we're like, oh, so many people are gonna build 00:20:08.560 |
literature review tools, so let's not start there. 00:20:11.000 |
And so then we focused on geopolitical forecasting. 00:20:13.880 |
So I don't know if you're familiar with like Manifold or-- 00:20:16.400 |
- Manifold Markets. - Yeah, that kind of stuff. 00:20:18.220 |
- And Manifold.love. - Before Manifold, yeah. 00:20:22.760 |
we're predicting like, is China gonna invade Taiwan? 00:20:33.840 |
I think by that time we kind of realized that the, 00:20:39.240 |
convert their beliefs into probability distributions. 00:20:46.280 |
And then after a few months of iterating on that, 00:20:48.040 |
just realized, oh, the thing that's blocking people 00:21:00.920 |
with the very generalist capabilities of GPT-3 00:21:03.720 |
prompted us to make a more general research assistant. 00:21:11.240 |
So we would embed with different researchers, 00:21:13.080 |
we built data labeling workflows in the beginning, 00:21:18.000 |
We built ways to find like experts in a field 00:21:23.000 |
and like ways to ask good research questions. 00:21:25.640 |
So we just kind of iterated through a lot of workflows 00:21:27.660 |
and it was, yeah, no one else was really building 00:21:32.840 |
and see like what is a task that is at the intersection 00:21:40.600 |
And we had like a very nondescript landing page. 00:21:42.680 |
It said nothing, but somehow people were signing up 00:21:48.080 |
And everyone was like, "I need help with literature review." 00:21:50.000 |
And we're like, "Literature review, that sounds so hard. 00:21:56.040 |
"Okay, everyone is saying literature review." 00:21:58.880 |
- And all domains, not like medicine or physics 00:22:04.720 |
And if you look at the graphs for academic literature 00:22:13.960 |
So we're like, "All right, let's just try it." 00:22:16.920 |
"This is kind of like the right problem space 00:22:19.080 |
"to jump into even if we don't know what we're doing." 00:22:21.640 |
So my take was like, "Fine, this feels really scary, 00:22:24.480 |
"but let's just launch a feature every single week 00:22:37.720 |
So the very first version was actually a weekend prototype 00:22:45.440 |
So the thing I remember is you enter a question 00:23:09.820 |
so that we would kind of feel the constant shame 00:23:16.400 |
And I think over time it has gotten a lot better, 00:23:18.400 |
but I think the initial version was really very bad. 00:23:37.440 |
- And were you using embeddings and cosine similarity, 00:23:51.280 |
the SemanticSkuller API or something similar. 00:24:00.040 |
then built our own search engine that has helped a lot. 00:24:04.280 |
- And then we're gonna go into more recent products stuff, 00:24:08.080 |
but I think you seem the more startup-oriented 00:24:11.640 |
business person, and you seem sort of more ideologically 00:24:14.880 |
interested in research, obviously, 'cause of your PhD. 00:24:17.580 |
What kind of market sizing were you guys thinking? 00:24:21.560 |
'Cause you're here saying, "We have to double every month." 00:24:34.960 |
I felt like in this space where so much was changing 00:24:45.760 |
we just really rested a lot on very, very simple 00:24:48.320 |
fundamental principles, which is research is, 00:24:52.480 |
that is very economically beneficial, valuable, 00:24:58.080 |
- Yeah, research is the key to many breakthroughs 00:25:06.960 |
But that's obviously not true, as you guys have found out. 00:25:09.200 |
But I, you know, you had to have some market insight 00:25:12.600 |
for me to have believed that, but I think you skipped that. 00:25:20.120 |
A lot of VCs were like, "You know, researchers, 00:25:30.720 |
maybe that's true, but I think in the long run, 00:25:42.560 |
or avoid kind of controlled trials that don't go anywhere, 00:25:53.040 |
but I think as long as the fundamental principle is there, 00:25:57.360 |
And I guess we found some investors who also were. 00:26:01.480 |
I mean, I'm sure we can cover the sort of flip later. 00:26:05.680 |
Yeah, I think you were about to start us on GPT-3 00:26:14.320 |
- I think it's a little bit less true for us than for others 00:26:18.240 |
because we always believe that there will basically be 00:26:24.280 |
And so it is definitely true that in practice, 00:26:31.320 |
you can add some features that you couldn't add before. 00:26:33.760 |
But I don't think we really ever had the moment 00:26:37.920 |
where we were like, oh, wow, that is super unanticipated. 00:26:42.200 |
We need to do something entirely different now 00:26:46.600 |
- I think GPT-3 was a big change 'cause it kind of said, 00:26:50.420 |
oh, now is the time that we can use AI to build these tools. 00:26:59.760 |
GPT-3 over GPT-2 was like qualitative level shift. 00:27:10.040 |
But the shape of the product had already taken place 00:27:13.280 |
- I kind of want to ask you about this sort of pivot 00:27:15.120 |
that you've made, but I guess that was just a way 00:27:43.640 |
But that's still, you kind of want to take the analysis 00:27:55.200 |
a list of interventions, a list of techniques. 00:27:57.240 |
And so that's one of the things we're working on 00:28:00.680 |
is now that you've extracted this information 00:28:03.720 |
can you pivot it or group by whatever the information 00:28:06.960 |
that you extracted to have more insight-first information 00:28:13.120 |
- Yeah, that was a big revelation when I saw it. 00:28:14.800 |
Yeah, basically, I think I'm very just impressed 00:28:29.000 |
because actually it's just about improving the workflow 00:28:33.120 |
Today, we might call it an agent, I don't know, 00:28:35.160 |
but you're not reliant on the LLM to drive it. 00:28:48.200 |
Yeah, I think the problem space is still huge. 00:28:57.200 |
So I think about this a lot in the context of moats. 00:29:04.440 |
there's still all of this other space that we can go into. 00:29:07.000 |
And so I think being really obsessed with the problem, 00:29:09.920 |
which is very, very big, has helped us stay robust 00:29:13.120 |
and just kind of directly incorporate model improvements 00:29:16.280 |
- And then I first encountered you guys with Charlie. 00:29:22.000 |
Basically, how much did cost become a concern 00:29:40.400 |
I think he had heard about us on some discord, 00:29:51.040 |
in Barcelona visiting our head of engineering at that time, 00:29:54.040 |
and everyone was talking about this wonder kid. 00:29:58.240 |
he had done the best of anyone to that point. 00:30:05.200 |
So we hired him as an intern, and then we're like, 00:30:06.780 |
"Charlie, what if he just dropped out of school?" 00:30:09.640 |
And so then we convinced him to take a year off, 00:30:13.660 |
And I think the thing you're referring to is, 00:30:17.240 |
Anthropic launched their constitutional AI paper, 00:30:23.080 |
he had basically implemented that in production, 00:30:25.280 |
and then we had it in app a week or so after that. 00:30:28.920 |
And he has since contributed to major improvements, 00:30:31.840 |
like cutting costs down to a tenth of what they were. 00:30:36.920 |
but yeah, you can talk about the technical stuff. 00:30:51.880 |
with respect to your query for you on the fly. 00:30:54.640 |
And that's a really important part of Illicit, 00:31:01.320 |
it'll have done it a few hundred times for you. 00:31:03.560 |
And so we cared a lot about this both being fast, cheap, 00:31:23.000 |
It's like everything in the summary is reflected 00:32:03.200 |
and didn't try hard enough to answer the question 00:32:11.680 |
- How do you monitor the ongoing performance of your models? 00:32:22.480 |
more well-known operations doing NLP at scale. 00:32:25.280 |
I guess, effectively, you have to monitor these things, 00:32:29.100 |
and nobody has a good answer that I can talk to. 00:32:31.920 |
- Yeah, I don't think we have a good answer yet. 00:32:35.080 |
I think the answers are actually a little bit clearer 00:32:49.160 |
kind of latencies and response times and uptime and whatnot. 00:32:57.040 |
where I think there the really important thing 00:33:02.360 |
So we care a lot about having our own internal benchmarks 00:33:07.680 |
for model development that reflect the distribution 00:33:11.920 |
of user queries so that we can know ahead of time 00:33:18.600 |
so the tasks being summarization, question answering, 00:33:25.400 |
what's the distribution of things the model is gonna see 00:33:28.360 |
so that we can have well-calibrated predictions 00:33:32.560 |
on how well the model's gonna do in production. 00:33:36.160 |
that there's distribution shift and actually the things 00:33:49.000 |
- I think we also end up effectively monitoring 00:33:51.260 |
by trying to evaluate new models as they come out. 00:33:58.080 |
And then, yeah, and so every time a new model comes out, 00:34:03.240 |
how is this performing relative to production 00:34:08.800 |
any new models have really caught your eye this year? 00:34:13.680 |
I think the team's pretty excited about Cloud. 00:34:18.920 |
is like a good point on the kind of Pareto frontier. 00:34:26.160 |
nor is it the most accurate, most high-quality model, 00:34:34.800 |
You apparently have to 10-shot it to make it good. 00:34:48.280 |
- Yeah, we also used, I think, GPT-4 unlocked process. 00:34:53.320 |
- Yeah, yeah, they get unlocked tables for us, 00:35:03.360 |
I guess you can't try Fuyu, 'cause it's non-commercial. 00:35:12.080 |
- I think the interesting insight that we got 00:35:13.880 |
from talking to David Luan, who is CEO of Adept, 00:35:20.960 |
Like one is the, we recognize images from a camera 00:35:26.280 |
And actually, the more important multimodality 00:35:35.760 |
- So we need a new term for that kind of multimodality. 00:35:42.440 |
- No, they're over-indexed, 'cause of the history 00:35:50.640 |
- Yeah, processing weird handwriting and stuff. 00:35:54.120 |
You mentioned a lot of like closed model lab stuff, 00:35:57.840 |
and then you also have like this open source model 00:36:01.560 |
Like what is your workload now between closed and open? 00:36:10.760 |
- It depends a little bit on like how you index, 00:36:21.360 |
I think the closed models make up more of the budget, 00:36:24.800 |
since the main cases where you want to use closed models 00:36:31.820 |
where no existing open source models are quite smart enough. 00:36:36.240 |
- We have a lot of interesting technical questions 00:36:38.520 |
to go in, but just to wrap the kind of like UX evolution, 00:36:52.560 |
which is a very iterative, kind of like interactive 00:36:55.160 |
interface, and yeah, maybe learnings from that? 00:37:01.840 |
I think the first time was probably in early 2021. 00:37:09.600 |
with this idea of task decomposition and like branching, 00:37:13.200 |
we always wanted a way, a tool that could be kind of 00:37:20.780 |
where you could kind of apply language model operations 00:37:26.080 |
So in 2021, we had this thing called composite tasks 00:37:34.240 |
and decompose those further into sub questions. 00:37:37.320 |
And this kind of, again, that like task decomposition 00:37:40.200 |
tree type thing was always very exciting to us, 00:37:46.840 |
Then at the end of '22, I think we tried again. 00:37:51.280 |
okay, we've done a lot with this literature review thing. 00:37:53.720 |
We also want to start helping with kind of adjacent domains 00:37:57.480 |
Like we want to help more with machine learning. 00:38:00.640 |
And as we were thinking about it, we're like, well, 00:38:04.280 |
Like how do we not just build kind of three new workflows 00:38:19.320 |
and then didn't quite narrow the problem space enough 00:38:25.200 |
And then I think it was at the beginning of 2023, 00:38:28.600 |
where we're like, wow, computational notebooks 00:38:30.440 |
kind of enable this, where they have a lot of flexibility, 00:38:39.580 |
It's not like you ask a query, you get an answer, 00:38:42.100 |
You can just constantly keep building on top of that. 00:38:44.600 |
And each little step seems like a really good 00:38:50.240 |
So that's, and also there was just like really helpful 00:38:52.960 |
to have a bit more kind of pre-existing work to emulate. 00:38:57.960 |
So that was, yeah, that's kind of how we ended up 00:39:03.000 |
- Maybe one thing that's worth making explicit 00:39:05.600 |
is the difference between computational notebooks and chat, 00:39:08.120 |
because on the surface, they seem pretty similar. 00:39:11.560 |
where you add stuff and it's almost like in both cases, 00:39:15.640 |
you have a back and forth between you enter stuff 00:39:17.560 |
and then you get some output and then you enter stuff. 00:39:28.920 |
here's like my data analysis process that takes in a CSV 00:39:37.680 |
and then you can run it over a much larger CSV later. 00:39:40.560 |
And similarly, the vision for notebooks in our case 00:39:43.920 |
is to not make it this like one-off chat interaction, 00:39:54.440 |
and see do I get to the correct like conclusions 00:40:04.440 |
now that I've debugged the process using a few papers? 00:40:07.560 |
And that's an interaction that doesn't fit quite as well 00:40:15.560 |
- Do you think in notebooks as kind of like structure, 00:40:19.020 |
editable chain of thought, basically step by step, 00:40:22.060 |
like is that kind of where you see this going 00:40:24.500 |
and then are people gonna reuse notebooks as like templates 00:40:30.780 |
You share a cookbook, you can start from there. 00:40:36.500 |
So that's our hope that people will build templates, 00:40:40.760 |
I think chain of thought is maybe still like kind of 00:41:06.740 |
but you don't always want it to be front and center. 00:41:09.500 |
- Yeah, what's the difference between a notebook 00:41:12.420 |
Since everybody always asks me, what's an agent? 00:41:14.460 |
Like, how do you think about where the line is? 00:41:18.220 |
I would generally think of the human as the agent 00:41:22.260 |
So you have the notebook and the human kind of adds 00:41:25.780 |
And then the next point on this kind of progress gradient is, 00:41:30.780 |
okay, now you can use language models to predict 00:41:38.020 |
in some cases I can with 100%, 99.9% accuracy 00:41:48.260 |
that will just look more and more like agents taking actions 00:41:54.440 |
And I think templates are a specific case of this 00:42:02.820 |
that you often wanna chunk and have available as primitives, 00:42:08.220 |
And those are, you can view them as action sequences 00:42:14.140 |
like the normal programming language abstraction thing. 00:42:24.300 |
and you need less and less human actual interfacing 00:42:29.220 |
Like how does the UX and the way people perceive it change? 00:42:34.820 |
- Yeah, I think this kind of interaction paradigms 00:42:42.560 |
the internet has all been about like getting data 00:42:46.680 |
But so increasingly, yeah, I really want kind of evaluation 00:42:57.180 |
superpower for Elicit, 'cause I think over time, 00:43:01.000 |
and people will have to do more and more of the evaluation. 00:43:10.140 |
there's some citation back and we kind of directly, 00:43:13.020 |
we try to highlight the ground truth in the paper 00:43:16.940 |
that is most relevant to whatever Elicit said 00:43:19.260 |
and make it super easy so that you can click on it 00:43:27.300 |
So I think we'd probably want to scale things up like that, 00:43:41.220 |
One of the other things we do is also kind of flag 00:43:44.940 |
So we have models report out, how confident are you 00:43:51.780 |
and so the user knows to prioritize checking that. 00:44:01.740 |
we have an uncertainty flag and the user can go 00:44:04.420 |
that was actually the right thing to do or not. 00:44:07.380 |
- So I've tried to do uncertainty readings from models. 00:44:11.260 |
I don't know if you have this live, you do, okay. 00:44:16.260 |
'cause they just hallucinated their own uncertainty. 00:44:25.440 |
But okay, it sounds like they scale properly for you. 00:44:38.260 |
So one model would say, here's my chain of thought, 00:44:40.820 |
here's my answer, and then a different type of model. 00:44:49.140 |
And then the second model just looks over the results 00:44:54.060 |
and like, okay, how confident are you in this? 00:44:56.980 |
And I think sometimes using a different model 00:45:01.540 |
- Yeah, on the topic of models, evaluating models, 00:45:23.860 |
So if the project is basically a systematic review 00:45:29.860 |
that otherwise human research assistants would do, 00:45:32.100 |
then the project is basically a human equivalent spend 00:45:35.540 |
and the spend can get quite large for those projects. 00:45:45.020 |
So in those cases, you're happier to spend compute 00:45:51.020 |
where someone just enters a question because, 00:45:57.380 |
Probably don't want to spend a lot of compute on that. 00:46:00.380 |
And this sort of being able to invest more or less compute 00:46:07.060 |
is I think one of the core things we care about 00:46:09.540 |
and that I think is currently undervalued in the AI space. 00:46:12.900 |
I think currently, you can choose which model you want 00:46:21.140 |
or you can try various things to get it to work harder. 00:46:36.580 |
you should be able to get really high quality answers, 00:46:44.980 |
So unlike most products, it's not a fixed monthly fee. 00:46:58.180 |
Then you can add more columns which have more extractions 00:47:04.060 |
in high accuracy mode, which also parses the table. 00:47:06.660 |
So we kind of stack the complexity on the cost. 00:47:09.900 |
- You know the fun thing you can do with a credit system, 00:47:12.020 |
which is data for data or I don't know what I mean by that. 00:47:23.460 |
It's like, if you don't have money, but you have time, 00:47:31.700 |
and then there's been some kind of adverse selection. 00:47:37.620 |
So maybe if you were willing to give more robust feedback 00:47:58.300 |
- If you make a lot of typos in your queries, 00:48:10.980 |
All these models that we're talking about these days, 00:48:19.660 |
'cause you're just paying for all those tokens 00:48:27.820 |
we think about kind of a staged pipeline of retrieval 00:48:30.980 |
where first you use a kind of semantic search database 00:48:43.820 |
it becomes pretty interesting to use larger models. 00:48:50.100 |
I think a lot of ranking was kind of per item ranking 00:48:55.300 |
maybe using increasingly expensive scoring methods 00:49:02.180 |
where you have a model that can see all the elements 00:49:06.140 |
because often you can only really tell how good a thing is 00:49:15.660 |
maybe you even care about diversity in your results. 00:49:17.820 |
You don't wanna show like 10 very similar papers 00:49:36.060 |
relative to people who just quickly check out things 00:49:41.820 |
I think being able to spend more on longer contexts 00:50:14.740 |
And now if you put the whole text in the paper, 00:50:21.620 |
And so you need kind of like a different set of abilities 00:50:24.380 |
and obviously like a different technology to figure out. 00:50:30.300 |
but then like the interaction is a little different. 00:50:33.060 |
- You like scan through and find some rouge score. 00:50:44.060 |
because you would ideally make causal claims. 00:50:49.940 |
And maybe you can do expensive approximations to that 00:50:57.300 |
But hopefully there are better ways of doing that 00:51:00.500 |
where you just get that kind of counterfactual information 00:51:06.700 |
- Do you think at all about the cost of maintaining RAG 00:51:09.980 |
versus just putting more tokens in the window? 00:51:14.300 |
a lot of times people buy developer productivity things 00:51:21.340 |
You have to maintain chunking and like RAG retrieval 00:51:25.580 |
versus I just shove everything into the context 00:51:33.340 |
- I think we still like hit up against context limits enough 00:51:43.460 |
for the scale of the work that we're doing, yeah. 00:51:45.580 |
- And I think there are different kinds of maintainability. 00:51:50.140 |
that the throw everything into the context window thing 00:52:02.220 |
if you know here's the process that we go through 00:52:08.820 |
and they're like little steps and you understand, 00:52:10.660 |
okay, this is the step that finds the relevant paragraph 00:52:15.720 |
You'll know which step breaks if the answers are bad 00:52:20.140 |
whereas if it's just like a new model version came out 00:52:32.700 |
Let's talk a bit about, yeah, needle in a haystack 00:52:41.740 |
but I was using one of these chat witcher documents features 00:52:56.280 |
And the response was like, oh, it doesn't say in the specs. 00:53:11.740 |
Like having the context sometimes suppress the knowledge 00:53:17.460 |
because I think sometimes that is exactly what you want. 00:53:21.540 |
you're writing the background section of your paper 00:53:23.240 |
and you're trying to describe what these other papers say. 00:53:41.880 |
that there might be something that's not in the papers, 00:53:46.940 |
you still don't want the model to just tell you. 00:53:49.580 |
I think probably the ideal thing looks a bit more 00:53:51.660 |
like agent control where the model can issue a query 00:54:02.100 |
So I would, that's maybe a reasonable middle ground 00:54:07.900 |
and model being fully limited to the papers you give it. 00:54:11.800 |
they're just kind of different tasks right now. 00:54:13.420 |
And the tasks that Elicit is mostly focused on 00:54:18.980 |
which is like, just give me the best possible answer. 00:54:23.340 |
sometimes depends on what do these papers say, 00:54:29.900 |
and then kind of do this overall task for you 00:54:34.220 |
- All right, this was, we see a lot of details, 00:54:39.500 |
what are maybe the most underrated features of Elicit? 00:54:48.260 |
- I think the most powerful feature of Elicit 00:54:50.300 |
is the ability to extract, add columns to this table, 00:54:59.780 |
but there are kind of many different extensions of that 00:55:04.260 |
So one is we let you give a description of the column. 00:55:23.820 |
that we're using to extract that from our predefined fields. 00:55:28.620 |
oh, actually I don't care about the population of people. 00:55:34.280 |
So I think users are still kind of discovering 00:55:37.000 |
that there's both this predefined, easy to use default, 00:55:41.260 |
but that they can extend it to be much more specific to them. 00:55:48.300 |
you can start to create different column types 00:55:51.220 |
So rather than just creating generative answers, 00:55:58.060 |
into a prospective study, a retrospective study, 00:56:04.420 |
It's like all using the same kind of technology 00:56:06.300 |
and the interface, but it unlocks different workflows. 00:56:09.780 |
So I think that like the ability to ask custom questions, 00:56:17.540 |
like classification columns is still pretty underrated. 00:56:22.980 |
I spoke to someone who works in medical affairs 00:56:28.340 |
So they, you know, doctors kind of, you know, 00:56:40.260 |
And this person basically interacts with all the doctors. 00:56:50.440 |
So this person like talks to doctors all day long. 00:56:52.840 |
And one of the things they started using Elicit for 00:56:56.040 |
is like putting the results of their tests as the query. 00:56:59.900 |
Like this test showed, you know, this percentage, 00:57:02.540 |
you know, presence of this and 40% that and whatever. 00:57:08.900 |
what like genes are present here or something 00:57:13.180 |
And getting kind of a list of academic papers 00:57:17.380 |
and using this to help doctors interpret their tests. 00:57:24.020 |
he's pretty interested in kind of doing a survey 00:57:36.340 |
to interpret the results of these diagnostic tests? 00:57:39.520 |
Because the way they ship these tests to doctors 00:57:42.340 |
is they report on a really wide array of things. 00:57:47.900 |
well-resourced hospital, like a city hospital, 00:57:50.580 |
there might be a team of infectious disease specialists 00:57:55.140 |
But at under-resourced hospitals or more rural hospitals, 00:57:57.820 |
the primary care physician can't interpret the test results. 00:58:02.820 |
So then they can't order it, they can't use it, 00:58:07.780 |
kind of an evidence-backed way of interpreting these tests 00:58:10.380 |
is definitely kind of an extension of the product 00:58:20.020 |
and helping them interpret complicated science 00:58:23.140 |
- Yeah, we had Ken Jun from MBU on the podcast 00:58:26.740 |
and we talked about better allocating scientific resources. 00:58:31.540 |
and maybe how Elicit can help drive more research? 00:58:37.600 |
maybe the models actually do some of the research 00:58:55.940 |
I think is like basically the thing that's at stake here. 00:59:04.060 |
I was recently talking to people in longevity 00:59:06.220 |
and I think there isn't really one field of longevity. 00:59:08.620 |
There are kind of different scientific subdomains 00:59:13.180 |
various things that are related to longevity. 00:59:15.140 |
And I think if you could more systematically say, 00:59:17.620 |
look, here are all the different interventions we could do. 00:59:22.580 |
And here's the expected ROI of these experiments. 00:59:39.380 |
Probably, yeah, I'd guess in like 10, 20 years, 00:59:44.800 |
how unsystematic science was back in the day. 00:59:57.260 |
or whatever, start with kind of novice humans 01:00:03.260 |
but we really want the models to kind of like 01:00:07.660 |
So that's why we do things in this very step-by-step way. 01:00:09.820 |
That's why we don't just like throw a bunch of data 01:00:12.300 |
and apply a bunch of compute and hope we get good results. 01:00:20.520 |
But I think that's where making sure that the models 01:00:23.340 |
processes are really explicit and transparent 01:00:26.380 |
and that it's really easy to evaluate is important 01:00:28.660 |
because if it does surpass human understanding, 01:00:31.060 |
people will still need to be able to audit its work somehow 01:00:37.960 |
So yeah, that's kind of why the process-based approach 01:00:42.740 |
- And on the question of will models do their own research, 01:00:47.420 |
I think one feature that most currently don't have 01:00:50.340 |
that will need to be better there is better world models. 01:01:23.340 |
let's see, what are the underlying structures 01:01:25.900 |
of different domains, how are they related or not related, 01:01:30.020 |
for models actually being able to make novel contributions. 01:01:44.100 |
and we're actually organizing our big AI/UX meetup 01:01:48.460 |
around whenever she's in town in San Francisco. 01:01:53.940 |
How have you sort of transitioned your company 01:01:55.920 |
into this sort of PBC and sort of the plan for the future? 01:02:07.620 |
A mix of mostly kind of roles in engineering and product. 01:02:14.240 |
was really not that eventful because I think we were already, 01:02:18.260 |
even as a nonprofit, we were already shipping every week. 01:02:24.020 |
- And then I would say the kind of PBC component 01:02:27.200 |
was to very explicitly say that we have a mission 01:02:33.900 |
We think our mission will make us a lot of money, 01:02:36.020 |
but we are going to be opinionated about how we make money. 01:02:38.360 |
We're gonna take the version of making a lot of money 01:02:42.180 |
But it's like all very, it's very convergent. 01:02:47.220 |
if it doesn't actually help you discover truth 01:02:54.460 |
and the success of the company are very intertwined. 01:02:59.660 |
we're hoping to grow the team quite a lot this year. 01:03:21.860 |
how do you build good orchestration for complex tasks? 01:03:26.860 |
So we talked earlier about these are sort of notebooks, 01:03:38.100 |
than it does look like machine learning research. 01:03:45.520 |
building applications that can kind of survive, 01:03:50.720 |
like making reliable components out of unreliable pieces. 01:03:54.060 |
I think those are the people we're looking for. 01:03:56.460 |
- You know, that's exactly what I used to do. 01:04:02.820 |
Have you explored the existing orchestration frameworks? 01:04:12.220 |
around kind of being able to stream work back very quickly 01:04:16.260 |
to our users, and those could definitely be relevant. 01:04:31.220 |
for humanity, so I hope everyone listening to this podcast 01:04:34.940 |
can think hard about exactly how they want to participate 01:04:41.140 |
There's so much to build, and we can be really intentional 01:04:48.660 |
that are going to be really good for the world, 01:04:51.780 |
and so, yeah, I hope people can take that seriously 01:04:56.620 |
- Yeah, I love how intentional you guys have been.