Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

00:00:00.000 | Hey, everyone.

00:00:01.000 | Welcome to the Latent Space Podcast.

00:00:02.760 | This is Alessio, partner and CTO of Residence Invisible Partners.

00:00:06.120 | And I'm joined by my co-host, Swix, founder of Small AI.

00:00:09.400 | Hey, and today we are back in the studio

00:00:11.520 | with Andreas and Junghwan from Elicit.

00:00:14.480 | Welcome.

00:00:15.240 | Thanks, guys.

00:00:16.000 | It's great to be here.

00:00:16.880 | Yeah.

00:00:17.380 | So I'll introduce you separately,

00:00:19.060 | but also we'd love to learn a little bit more about you

00:00:22.080 | personally.

00:00:23.320 | So Andreas, it looks like you started Elicit, or ought first,

00:00:28.440 | and Junghwan joined later.

00:00:31.520 | That's right, although you did--

00:00:33.200 | I guess, for all intents and purposes,

00:00:35.240 | the Elicit and also the ought that existed before then

00:00:39.080 | were very different from what I started.

00:00:42.360 | So I think it's fair to say that she co-founded it.

00:00:46.120 | And Junghwan, you're a co-founder and COO of Elicit.

00:00:49.400 | Yeah, that's right.

00:00:50.200 | So there's a little bit of a history to this.

00:00:52.840 | I'm not super aware of the journey.

00:00:55.920 | I was aware of ought and Elicit as sort

00:00:59.320 | of a nonprofit-type situation.

00:01:01.040 | And recently, you turned into sort of like a B Corp--

00:01:04.080 | Public Benefit Corporation.

00:01:05.600 | So yeah, maybe if you want, you could

00:01:07.280 | take us through that journey of finding the problem.

00:01:12.400 | Obviously, you're working together now.

00:01:14.960 | So how do you get together to decide to leave your startup

00:01:18.920 | career to join him?

00:01:20.880 | Yeah, it's truly a very long journey.

00:01:22.440 | I guess, truly, it kind of started in Germany

00:01:24.680 | when I was born.

00:01:25.440 | So even as a kid, I was always interested in AI.

00:01:30.720 | I kind of went to the library.

00:01:32.120 | There were books about how to write programs in QBasic.

00:01:34.980 | And some of them talked about how to implement chatbots.

00:01:39.440 | I guess Eliza--

00:01:40.040 | To be clear, he grew up in a tiny village

00:01:42.600 | on the outskirts of Munich called

00:01:44.120 | Dingelscherbin, where it's a very, very idyllic German

00:01:47.880 | village.

00:01:49.000 | Important to the story.

00:01:51.680 | But basically, the main thing is I've kind of always

00:01:54.900 | been thinking about AI my entire life

00:01:56.480 | and been thinking about, well, at some point,

00:01:58.200 | this is going to be a huge deal.

00:01:59.560 | It's going to be transformative.

00:02:00.900 | How can I work on it?

00:02:02.920 | And I was thinking about it from when I was a teenager.

00:02:09.200 | After high school, I did a year where

00:02:11.400 | I started a startup with the intention to become rich.

00:02:15.160 | And then once I'm rich, I can affect the trajectory of AI.

00:02:19.240 | Did not become rich.

00:02:21.000 | Decided to go back to college and study cognitive science

00:02:24.040 | there, which was like the closest thing

00:02:25.720 | I could find at the time to AI.

00:02:29.680 | In the last year of college, moved to the US

00:02:32.360 | to do a PhD at MIT, working on broadly kind of new programming

00:02:37.960 | languages for AI, because it kind of seemed like the existing

00:02:40.880 | languages were not great at expressing world models

00:02:44.100 | and learning world models doing Bayesian inference.

00:02:47.520 | Was always thinking about, well, ultimately, the goal

00:02:49.640 | is to actually build tools that help people reason more

00:02:51.960 | clearly, ask and answer better questions,

00:02:57.600 | and make better decisions.

00:02:58.760 | But for a long time, it just seemed

00:03:00.240 | like the technology to put reasoning in machines

00:03:03.440 | just wasn't there.

00:03:04.800 | And so initially, at the end of my postdoc at Stanford,

00:03:10.820 | I was thinking about, well, what to do?

00:03:12.440 | I think the standard path is you become an academic

00:03:15.600 | and do research.

00:03:17.160 | But it's really hard to actually build interesting tools

00:03:23.040 | as an academic.

00:03:23.920 | You can't really hire great engineers.

00:03:26.760 | Everything is kind of on a paper-to-paper timeline.

00:03:29.520 | And so I was like, well, maybe I should start a startup

00:03:33.440 | and pursue that for a little bit.

00:03:35.120 | But it seemed like it was too early,

00:03:37.160 | because you could have tried to do an AI startup,

00:03:39.900 | but probably would not have been the kind of AI startup

00:03:42.840 | we're seeing now.

00:03:44.520 | So then decided to just start a nonprofit research lab that's

00:03:47.840 | going to do research for a while,

00:03:49.200 | until we better figure out how to do thinking in machines.

00:03:53.000 | And that was odd.

00:03:54.800 | And then over time, it became clear how to actually build

00:04:01.080 | actual tools for reasoning.

00:04:02.800 | And then only over time, we developed a better way to--

00:04:08.400 | I'll let you fill in some of these.

00:04:10.400 | Yeah, so I guess my story maybe starts around 2015.

00:04:14.360 | I kind of wanted to be a founder for a long time.

00:04:17.880 | And I wanted to work on an idea that really tested--

00:04:22.220 | that stood the test of time for me,

00:04:23.840 | like an idea that stuck with me for a long time.

00:04:26.560 | And then starting in 2015, actually,

00:04:28.280 | originally, I became interested in AI-based tools

00:04:30.840 | from the perspective of mental health.

00:04:32.680 | So there are a bunch of people around me

00:04:33.840 | who are really struggling.

00:04:35.240 | One really close friend in particular

00:04:36.620 | is really struggling with mental health

00:04:38.280 | and didn't have any support.

00:04:39.720 | And it didn't feel like there was anything

00:04:41.880 | before getting hospitalized that could just help her.

00:04:45.640 | And so luckily, she came and stayed with me for a while.

00:04:48.320 | And we were just able to talk through some things.

00:04:50.600 | But it seemed like lots of people

00:04:52.760 | might not have that resource.

00:04:54.140 | And something maybe AI-enabled could be much more scalable.

00:04:57.760 | I didn't feel ready to start a company then.

00:05:00.360 | That's 2015.

00:05:02.280 | And I also didn't feel like the technology was ready.

00:05:05.200 | So then I went into fintech and learned

00:05:07.440 | how to do the tech thing.

00:05:09.200 | And then in 2019, I felt like it was time

00:05:12.640 | for me to just jump in and build something on my own

00:05:15.280 | I really wanted to create.

00:05:17.160 | And at the time, there were two interesting--

00:05:19.840 | I looked around at tech and felt not super inspired

00:05:22.680 | by the options.

00:05:23.720 | I didn't want to have a tech career ladder.

00:05:26.320 | I didn't want to climb the career ladder.

00:05:28.840 | There were two interesting technologies at the time.

00:05:30.860 | There was AI and there was crypto.

00:05:32.800 | And I was like, well, the AI people

00:05:34.240 | seem a little bit more nice.

00:05:37.120 | Maybe slightly more trustworthy.

00:05:40.080 | Both super exciting.

00:05:41.320 | But yeah, I kind of threw my bet in on the AI side.

00:05:46.240 | And then I got connected to Andreas.

00:05:47.780 | And actually, the way he was thinking

00:05:49.760 | about pursuing the research agenda at OTT

00:05:52.040 | was really compatible with what I had envisioned

00:05:54.760 | for an ideal AI product, something

00:05:57.080 | that helps kind of take down really complex thinking,

00:05:59.880 | overwhelming thoughts, and breaks it down into small pieces.

00:06:02.720 | And then this kind of mission that we

00:06:04.640 | need AI to help us figure out what we ought to do

00:06:08.200 | was really inspiring.

00:06:10.520 | Yeah, because I think it was clear that we were building

00:06:12.880 | the most powerful optimizer of our time.

00:06:16.560 | But as a society, we hadn't figured out

00:06:18.640 | how to direct that optimization potential.

00:06:21.520 | And if you kind of direct tremendous amounts

00:06:23.640 | of optimization potential at the wrong thing,

00:06:25.840 | that's really disastrous.

00:06:27.000 | So the goal of OTT was make sure that if we build the most

00:06:29.940 | transformative technology of our lifetime,

00:06:32.160 | it can be used for something really impactful,

00:06:34.400 | like good reasoning, like not just generating ads.

00:06:37.480 | My background was in marketing.

00:06:38.880 | But so I was like, I want to do more than generate ads

00:06:41.160 | with this.

00:06:42.160 | And also, if these AI systems get

00:06:45.320 | to be super intelligent enough that they are doing

00:06:47.880 | this really complex reasoning, that we can trust them,

00:06:50.240 | that they are aligned with us and we

00:06:51.980 | have ways of evaluating that they're doing the right thing.

00:06:54.960 | So that's what OTT did.

00:06:55.880 | We did a lot of experiments.

00:06:57.600 | This was, like Andreas said, before foundation models

00:07:00.960 | really took off.

00:07:02.640 | A lot of the issues we were seeing

00:07:04.880 | were more in reinforcement learning.

00:07:06.640 | But we saw a future where AI would

00:07:09.720 | be able to do more kind of logical reasoning,

00:07:12.360 | not just kind of extrapolate from numerical trends.

00:07:15.360 | So we actually kind of set up experiments

00:07:18.800 | with people, where people stood in as super intelligent

00:07:21.960 | systems.

00:07:23.320 | And we effectively gave them context windows.

00:07:25.920 | So they would have to read a bunch of text.

00:07:28.560 | And one person would get less text,

00:07:31.360 | and one person would get all the text,

00:07:32.900 | and the person with less text would

00:07:34.480 | have to evaluate the work of the person who

00:07:37.080 | could read much more.

00:07:38.200 | So in a world, we were basically simulating,

00:07:40.600 | like in 2018, 2019, a world where an AI system could read

00:07:44.520 | significantly more than you.

00:07:45.960 | And you, as the person who couldn't read that much,

00:07:48.280 | had to evaluate the work of the AI system.

00:07:50.640 | Yeah, so there's a lot of the work we did.

00:07:53.840 | And from that, we kind of iterated on this idea,

00:07:56.280 | that the idea of breaking complex tasks down

00:07:58.880 | into smaller tasks, like complex tasks,

00:08:00.720 | like open-ended reasoning, logical reasoning,

00:08:03.640 | into smaller tasks, so that it's easier

00:08:05.520 | to train AI systems on them.

00:08:07.080 | And also so that it's easier to evaluate the work of the AI

00:08:09.960 | system when it's done.

00:08:11.880 | And then also kind of really pioneered this idea,

00:08:15.840 | the importance of supervising the process of AI systems,

00:08:18.800 | not just the outcomes.

00:08:20.360 | And so a big part of then how Elicit is built

00:08:23.040 | is we're very intentional about not just throwing a ton of data

00:08:27.320 | into a model and training it, and then saying, cool,

00:08:29.640 | here's scientific output.

00:08:31.320 | That's not at all what we do.

00:08:33.520 | Our approach is very much like, what are the steps

00:08:35.680 | that an expert human does?

00:08:37.160 | Or what is an ideal process?

00:08:38.800 | As granularly as possible, let's break that down.

00:08:41.800 | And then train AI systems to perform each of those steps

00:08:44.640 | very robustly.

00:08:46.200 | When you train that from the start, after the fact,

00:08:49.200 | it's much easier to evaluate.

00:08:50.920 | It's much easier to troubleshoot at each point,

00:08:53.000 | like where did something break down?

00:08:55.000 | So yeah, we were working on those experiments for a while.

00:08:57.000 | And then at the start of 2021, decided to build a product.

00:09:00.320 | Because when you do research, I think maybe--

00:09:03.280 | - Do you mind if I, 'cause I think you're about to go

00:09:06.000 | into more modern thought and Elicit.

00:09:08.640 | And I just wanted to, because I think a lot of people

00:09:11.080 | are in where you were, like sort of 2018, '19,

00:09:15.360 | where you chose a partner to work with, right?

00:09:18.360 | And you didn't know him.

00:09:19.760 | You were just kind of cold introduced.

00:09:21.520 | A lot of people are cold introduced.

00:09:23.200 | I've been cold introduced to tons of people

00:09:24.720 | and I never work with them.

00:09:26.760 | I assume you had a lot of other options, right?

00:09:28.880 | Like how do you advise people to make those choices?

00:09:32.160 | - Yeah, we were not totally cold introduced.

00:09:33.840 | So we had one of our closest friends introduced us.

00:09:36.880 | And then Andreas had written a lot on the OTT website,

00:09:41.120 | a lot of blog posts, a lot of publications.

00:09:43.160 | And I just read it and I was like, wow,

00:09:44.920 | this sounds like my writing.

00:09:47.080 | And even other people, some of my closest friends

00:09:49.360 | I asked for advice from, they were like,

00:09:50.920 | oh, this sounds like your writing.

00:09:52.960 | But I think I also had some kind of like

00:09:54.840 | things I was looking for.

00:09:55.760 | I wanted someone with a complimentary skillset.

00:09:58.240 | I want someone who was very values aligned.

00:10:00.800 | And yeah, I think that was all a good fit.

00:10:03.640 | - We also did a pretty lengthy mutual evaluation process

00:10:07.120 | where we had a Google doc

00:10:08.480 | where we had all kinds of questions for each other.

00:10:11.120 | And I think it ended up being around 50 pages or so

00:10:14.560 | of like various like questions and back and forth.

00:10:16.720 | - Was it the YC list?

00:10:18.200 | There's some lists going around for co-founder questions.

00:10:20.360 | - No, we just made our own questions.

00:10:22.480 | But I presume, I guess it's probably related

00:10:26.000 | in that you ask yourself, well,

00:10:27.480 | what are the values you care about?

00:10:28.720 | How would you approach various decisions

00:10:30.480 | and things like that?

00:10:31.320 | - I shared like all of my past performance reviews.

00:10:33.880 | - Yeah. - Yeah.

00:10:35.200 | - Yeah, and he had never had any, so.

00:10:36.480 | - No. (all laughing)

00:10:39.240 | - Yeah, sorry, I just had to,

00:10:42.000 | a lot of people are going through that phase

00:10:43.520 | and you kind of skipped over it.

00:10:44.400 | I was like, no, no, no, no,

00:10:45.240 | there's like an interesting story there.

00:10:47.320 | - So before we jump into what a list it is today,

00:10:51.400 | the history is a bit counterintuitive.

00:10:53.920 | So you start with figuring out,

00:10:55.920 | oh, if we had a super powerful model,

00:10:58.400 | how would we align it, how we use it?

00:11:00.400 | But then you were actually like,

00:11:01.560 | well, let's just build the product

00:11:02.880 | so that people can actually leverage it.

00:11:04.680 | And I think there are a lot of folks today

00:11:07.120 | that are now back to where you were maybe five years ago

00:11:09.320 | that are like, oh, what if this happens

00:11:11.400 | rather than focusing on actually building

00:11:13.080 | something useful with it?

00:11:15.160 | What clicked for you to like move into a list

00:11:18.240 | and then we can cover that story too?

00:11:20.160 | - I think in many ways, the approach is still the same

00:11:22.400 | because the way we are building a list is not,

00:11:24.960 | let's train a foundation model to do more stuff.

00:11:27.440 | It's like, let's build a scaffolding

00:11:29.480 | such that we can deploy powerful models to good ends.

00:11:32.740 | So I think it's different now in that we are,

00:11:36.040 | we actually have some of the models to plug in,

00:11:37.920 | but if in 2018, '17, we had had the models,

00:11:42.480 | we could have run the same experiments

00:11:44.840 | we did run with humans back then, just with models.

00:11:47.720 | And so in many ways, our philosophy is always like,

00:11:50.120 | let's think ahead to the future.

00:11:51.280 | What models are gonna exist in one, two years or longer?

00:11:55.960 | And how can we make it so that they can actually be deployed

00:11:59.560 | in kind of transparent, controllable ways?

00:12:02.440 | - Yeah, I think motivationally,

00:12:03.840 | we both are kind of product people at heart

00:12:06.040 | and we just want to, the research was really important

00:12:09.600 | and it didn't make sense to build a product at that time.

00:12:12.640 | But at the end of the day,

00:12:13.480 | the thing that always motivated us is imagining a world

00:12:16.600 | where high quality reasoning is really abundant.

00:12:19.600 | And AI was just kind of the most,

00:12:22.640 | is the technology that's gonna get us there.

00:12:24.880 | And there's a way to guide that technology with research,

00:12:27.240 | but it's also really exciting to have,

00:12:29.320 | you can have a more direct effect through product

00:12:31.880 | because with research, you have kind of,

00:12:33.760 | you'd publish the research and someone else

00:12:35.320 | has to implement that into the product

00:12:36.760 | and the product felt like a more direct path.

00:12:39.120 | And we wanted to concretely have an impact on people's lives.

00:12:42.360 | So I think, yeah, I think the kind of personally,

00:12:45.520 | the motivation was we want to build for people.

00:12:49.160 | - Yep, and then just to recap as well,

00:12:52.600 | like the models you were using back then were like,

00:12:55.000 | I don't know, would they like BERT type stuff or T5

00:12:59.000 | or I don't know what timeframe we're talking about here.

00:13:02.120 | - So I guess to be clear, at the very beginning,

00:13:04.400 | we had humans do the work and then the initial,

00:13:09.040 | I think the first models that kind of make sense

00:13:11.280 | were TPT-2 and TNLG and like the early generative models.

00:13:18.280 | We do also use like T5 based models even now,

00:13:22.420 | but started with TPT-2.

00:13:26.280 | - Yeah, cool, I'm just kind of curious about like,

00:13:27.920 | how do you start so early, you know,

00:13:29.720 | like now it's obvious where to start,

00:13:31.600 | but back then it wasn't.

00:13:33.240 | - Yeah, I used to nag Andreas a lot.

00:13:35.080 | I was like, why are you talking to this?

00:13:37.120 | I don't know, I felt like TPT-2 is like,

00:13:38.720 | clearly can't do anything.

00:13:39.840 | And I was like, Andreas, you're wasting your time

00:13:41.600 | like playing with this toy.

00:13:43.520 | But yeah, he was right.

00:13:46.600 | - So what's the history of what Elizit

00:13:48.840 | actually does as a product?

00:13:50.240 | I think today, you recently announced that

00:13:52.960 | after four months, you get to a million in revenue.

00:13:55.080 | Obviously a lot of people use it, get a lot of value,

00:13:57.040 | but it would initially kind of like structure data,

00:14:01.160 | extraction from papers.

00:14:03.000 | Then you had, yeah, kind of like concept grouping.

00:14:06.480 | And today it's maybe like a more full stack

00:14:08.760 | research enabler, kind of like paper understander platform.

00:14:12.040 | What's the definitive definition of what Elizit is

00:14:16.400 | and how did you get here?

00:14:17.520 | - Yeah, we say Elizit is an AI research assistant.

00:14:20.000 | I think it will continue to evolve.

00:14:21.600 | It has evolved a lot and it will continue research.

00:14:23.640 | And that's part of why we're so excited

00:14:26.080 | about building and research,

00:14:27.000 | 'cause there's just so much space.

00:14:28.800 | I think the current phase we're in right now,

00:14:30.980 | we talk about it as really trying to make Elizit

00:14:34.000 | the best place to understand what is known.

00:14:35.760 | So it's all a lot about like literature summarization.

00:14:39.360 | There's a ton of information that the world already knows.

00:14:41.540 | It's really hard to navigate, hard to make it relevant.

00:14:44.840 | So a lot of it is around document discovery

00:14:47.320 | and processing and analysis.

00:14:49.640 | I really want to make,

00:14:51.760 | I kind of want to import some of the incredible

00:14:54.640 | productivity improvements we've seen in software engineering

00:14:57.920 | and data science and into research.

00:14:59.580 | So it's like, how can we make researchers

00:15:01.760 | like data scientists of text?

00:15:04.080 | That's why we're launching this new set of features

00:15:06.920 | called Notebooks.

00:15:07.760 | It's very much inspired by computational notebooks

00:15:09.960 | like Jupyter Notebooks, DeepNode or Colab,

00:15:13.440 | because they're so powerful and so flexible.

00:15:15.520 | And ultimately when people are trying to get to an answer

00:15:19.200 | or understand insight,

00:15:20.120 | they're kind of like manipulating evidence and information.

00:15:22.900 | Today that's all packaged in PDFs, which are super brittle,

00:15:26.320 | but with language models,

00:15:27.440 | we can decompose these PDFs into their underlying claims

00:15:30.440 | and evidence and insights,

00:15:31.480 | and then let researchers mash them up together,

00:15:34.360 | remix them and analyze them together.

00:15:35.920 | So yeah, I would say quite simply,

00:15:38.800 | overall Elizit is an AI research assistant.

00:15:40.780 | Right now we're focused on text-based workflows,

00:15:45.200 | but long-term really want to kind of go further and further

00:15:48.120 | into reasoning and decision-making.

00:15:50.480 | - And when you say AI research assistant,

00:15:53.280 | this is kind of meta research.

00:15:55.280 | So researchers use Elizit as a research assistant.

00:15:58.940 | It's not a generic, you can research anything type of tool,

00:16:02.880 | or it could be, but like,

00:16:04.160 | what are people using it for today?

00:16:05.920 | - Yeah, so specifically, I guess in science,

00:16:08.960 | a lot of people use human research assistants to do things.

00:16:12.260 | Like you tell your kind of grad student,

00:16:15.500 | hey, here are a couple of papers.

00:16:16.780 | Can you look at all of these,

00:16:18.540 | see which of these have kind of sufficiently large

00:16:21.260 | populations and actually study the disease

00:16:23.260 | that I'm interested in,

00:16:24.420 | and then write out like, what are the experiments they did?

00:16:26.840 | What are the interventions they did?

00:16:28.720 | What are the outcomes?

00:16:29.620 | And kind of organize that for me.

00:16:31.460 | And the first phase of understanding what is known

00:16:34.580 | really focuses on automating that workflow,

00:16:37.100 | because a lot of that work is pretty rote work.

00:16:39.200 | I think it's not the kind of thing

00:16:40.480 | that we need humans to do, language models can do it.

00:16:43.480 | And then if language models can do it,

00:16:45.200 | you can obviously scale it up much more

00:16:47.320 | than a grad student or undergrad research assistant

00:16:50.520 | would be able to do.

00:16:52.120 | - Yeah, the use cases are pretty broad.

00:16:53.760 | So we do have people who just come,

00:16:55.280 | a very large percent of our users

00:16:56.920 | are just using it personally,

00:16:58.240 | or for a mix of personal and professional things.

00:17:01.160 | People who care a lot about like health or biohacking,

00:17:05.260 | or parents who have children with a kind of rare disease

00:17:08.880 | and want to understand the literature directly.

00:17:10.680 | So there is an individual kind of consumer use case.

00:17:13.720 | We're most focused on the power users,

00:17:15.600 | so that's where we're really excited to build.

00:17:18.180 | So LISD was very much inspired by this workflow

00:17:21.180 | in literature called Systematic Reviews or Meta-Analysis,

00:17:24.480 | which is basically the human state of the art

00:17:27.340 | for summarizing scientific literature.

00:17:29.480 | It typically involves like five people

00:17:31.440 | working together for over a year,

00:17:33.600 | and they kind of first start by trying to find

00:17:35.600 | the maximally comprehensive set of papers possible.

00:17:38.500 | So it's like 10,000 papers.

00:17:40.360 | And they kind of systematically narrow that down

00:17:42.520 | to like hundreds or 50 extract key details

00:17:46.160 | from every single paper.

00:17:47.280 | Usually have two people doing it,

00:17:48.760 | like a third person reviewing it.

00:17:50.300 | So it's like an incredibly laborious,

00:17:52.940 | time-consuming process,

00:17:54.160 | but you see it in every single domain.

00:17:56.080 | So in science, in machine learning, in policy.

00:17:59.800 | And so if you can, and it's very,

00:18:01.400 | because it's so structured and designed to be reproducible,

00:18:03.840 | it's really amenable to automation.

00:18:05.580 | So that's kind of the workflow

00:18:07.080 | that we want to automate first.

00:18:08.560 | And then you make that accessible for any question

00:18:12.100 | and make kind of these really robust

00:18:14.080 | living summaries of science.

00:18:15.900 | So yeah, that's one of the workflows

00:18:16.880 | that we're starting with.

00:18:17.720 | - Our previous guest, Mike Conover,

00:18:19.000 | he's building a new company called BrightWave,

00:18:20.600 | which is an AI research assistant for financial research.

00:18:24.380 | How do you see the future of these tools?

00:18:26.360 | Like does everything converge

00:18:27.800 | to like a God researcher assistant,

00:18:30.680 | or is every domain going to have its own thing?

00:18:33.620 | - I think that's a good and mostly open question.

00:18:36.540 | I do think there are some differences across domains.

00:18:40.400 | For example, some research is more

00:18:42.640 | quantitative data analysis,

00:18:44.480 | and other research is more

00:18:46.160 | kind of high-level cross-domain thinking.

00:18:49.360 | And we definitely want to contribute

00:18:51.600 | to the broad generalist reasoning type space.

00:18:53.460 | Like if researchers are making discoveries,

00:18:55.880 | often it's like, hey,

00:18:56.920 | this thing in biology is actually analogous

00:18:59.000 | to like these equations in economics or something.

00:19:01.560 | And that's just fundamentally a thing

00:19:03.840 | where you need to reason across domains.

00:19:06.640 | So I think there will be, at least within research,

00:19:09.440 | I think there will be like one best platform more or less

00:19:12.600 | for this type of generalist research.

00:19:15.480 | I think there may still be like some particular tools

00:19:17.680 | like for genomics, like particular types of modules

00:19:21.360 | of genes and proteins and whatnot.

00:19:23.560 | But for a lot of the kind of high-level reasoning

00:19:25.520 | that humans do, I think that is a more

00:19:27.760 | open or type all thing.

00:19:29.160 | - I wanted to ask a little bit deeper about,

00:19:31.920 | I guess, the workflow that you mentioned.

00:19:34.020 | I like that phrase.

00:19:35.720 | I see that in your UI now, but that's as it is today.

00:19:39.440 | And I think you were about to tell us

00:19:41.000 | about how it was in 2021 and how it maybe progressed.

00:19:43.600 | Like what, how has this workflow evolved?

00:19:46.280 | - Yeah, so the very first version of Elicit

00:19:48.040 | actually wasn't even a research assistant.

00:19:49.720 | It was like a forecasting assistant.

00:19:53.200 | So we set out and we were thinking about

00:19:55.120 | what are some of the most impactful

00:19:57.560 | types of reasoning that if we could scale up AI

00:20:00.700 | would really transform the world.

00:20:02.000 | And the first thing we started,

00:20:04.040 | we actually started with literature review,

00:20:06.720 | but we're like, oh, so many people are gonna build

00:20:08.560 | literature review tools, so let's not start there.

00:20:11.000 | And so then we focused on geopolitical forecasting.

00:20:13.880 | So I don't know if you're familiar with like Manifold or--

00:20:16.400 | - Manifold Markets. - Yeah, that kind of stuff.

00:20:18.220 | - And Manifold.love. - Before Manifold, yeah.

00:20:20.800 | So we're not predicting relationships,

00:20:22.760 | we're predicting like, is China gonna invade Taiwan?

00:20:26.040 | - Markets for everything. - Yeah.

00:20:27.240 | - That's a relationship. - Yeah, it's fair.

00:20:29.160 | - Yeah, yeah, it's true.

00:20:30.520 | - And then we worked on that for a while

00:20:32.080 | and then after GPT-3 came out,

00:20:33.840 | I think by that time we kind of realized that the,

00:20:37.200 | originally we were trying to help people

00:20:39.240 | convert their beliefs into probability distributions.

00:20:42.320 | And so take fuzzy beliefs,

00:20:43.900 | but like model them more concretely.

00:20:46.280 | And then after a few months of iterating on that,

00:20:48.040 | just realized, oh, the thing that's blocking people

00:20:50.580 | from making interesting predictions

00:20:52.600 | about important events in the world

00:20:54.960 | is less kind of on the probabilistic side

00:20:57.080 | and much more on the research side.

00:20:59.320 | And so that kind of combined

00:21:00.920 | with the very generalist capabilities of GPT-3

00:21:03.720 | prompted us to make a more general research assistant.

00:21:06.640 | Then we spent a few months iterating

00:21:08.320 | on what even is a research assistant.

00:21:11.240 | So we would embed with different researchers,

00:21:13.080 | we built data labeling workflows in the beginning,

00:21:17.040 | kind of right off the bat.

00:21:18.000 | We built ways to find like experts in a field

00:21:23.000 | and like ways to ask good research questions.

00:21:25.640 | So we just kind of iterated through a lot of workflows

00:21:27.660 | and it was, yeah, no one else was really building

00:21:30.000 | at this time and it was like very quick

00:21:31.400 | to just do some prompt engineering

00:21:32.840 | and see like what is a task that is at the intersection

00:21:35.940 | of what's good, technologically capable

00:21:38.160 | and like important for researchers.

00:21:40.600 | And we had like a very nondescript landing page.

00:21:42.680 | It said nothing, but somehow people were signing up

00:21:45.000 | and we had the sign-up form that were like,

00:21:47.000 | it was like, "Why are you here?"

00:21:48.080 | And everyone was like, "I need help with literature review."

00:21:50.000 | And we're like, "Literature review, that sounds so hard.

00:21:52.040 | "I don't even know what that means."

00:21:53.160 | They're like, "We don't want to work on it."

00:21:55.040 | But then eventually we were like,

00:21:56.040 | "Okay, everyone is saying literature review."

00:21:57.520 | It's overwhelmingly people want--

00:21:58.880 | - And all domains, not like medicine or physics

00:22:01.120 | or just all domains.

00:22:02.000 | - Yeah, and we also kind of personally knew

00:22:03.680 | literature review was hard.

00:22:04.720 | And if you look at the graphs for academic literature

00:22:07.360 | being published every single month,

00:22:08.520 | you guys know this in machine learning,

00:22:09.720 | it's like up and to the right,

00:22:11.240 | like superhuman amounts of papers.

00:22:13.960 | So we're like, "All right, let's just try it."

00:22:15.240 | I was really nervous, but Andreas was like,

00:22:16.920 | "This is kind of like the right problem space

00:22:19.080 | "to jump into even if we don't know what we're doing."

00:22:21.640 | So my take was like, "Fine, this feels really scary,

00:22:24.480 | "but let's just launch a feature every single week

00:22:27.440 | "and double our user numbers every month.

00:22:29.480 | "And if we can do that, we'll fail fast

00:22:32.120 | "and we will find something."

00:22:33.440 | I was worried about getting lost

00:22:35.400 | in the kind of academic white space.

00:22:37.720 | So the very first version was actually a weekend prototype

00:22:40.320 | that Andreas made.

00:22:41.240 | Do you want to explain how that worked?

00:22:43.120 | - I mostly remember this was really bad.

00:22:45.440 | So the thing I remember is you enter a question

00:22:49.600 | and it would give you back a list of claims.

00:22:51.480 | So your question could be, I don't know,

00:22:53.280 | "How does creatine affect cognition?"

00:22:55.600 | It would give you back some claims

00:22:57.920 | that are to some extent based on papers,

00:23:01.280 | but they were often irrelevant.

00:23:02.800 | The papers were often irrelevant.

00:23:04.480 | And so we ended up soon just printing out

00:23:07.240 | a bunch of examples of results

00:23:08.700 | and putting them up on the wall

00:23:09.820 | so that we would kind of feel the constant shame

00:23:12.240 | of having such a bad product

00:23:13.880 | and would be incentivized to make it better.

00:23:16.400 | And I think over time it has gotten a lot better,

00:23:18.400 | but I think the initial version was really very bad.

00:23:22.680 | - Yeah, but it was basically like

00:23:24.000 | a natural language summary of an abstract,

00:23:25.800 | like kind of a one-sentence summary,

00:23:27.040 | and which we still have.

00:23:28.360 | And then as we learned kind of more

00:23:30.240 | about this systematic review workflow,

00:23:31.960 | we started expanding the capability

00:23:33.600 | so that you could extract a lot more data

00:23:35.280 | from the papers and do more with that.

00:23:37.440 | - And were you using embeddings and cosine similarity,

00:23:40.960 | that kind of stuff, for retrieval,

00:23:42.360 | or was it keyword-based, or?

00:23:44.880 | - I think the very first version

00:23:46.800 | didn't even have its own search engine.

00:23:48.680 | I think the very first version probably used

00:23:51.280 | the SemanticSkuller API or something similar.

00:23:54.640 | And only later, when we discovered

00:23:57.000 | that that API is not very semantic,

00:24:00.040 | then built our own search engine that has helped a lot.

00:24:04.280 | - And then we're gonna go into more recent products stuff,

00:24:08.080 | but I think you seem the more startup-oriented

00:24:11.640 | business person, and you seem sort of more ideologically

00:24:14.880 | interested in research, obviously, 'cause of your PhD.

00:24:17.580 | What kind of market sizing were you guys thinking?

00:24:21.560 | 'Cause you're here saying, "We have to double every month."

00:24:24.680 | And I'm like, "I don't know how you make

00:24:26.920 | "that conclusion from this," right?

00:24:29.720 | Especially also as a non-profit at the time.

00:24:31.920 | - Yeah, I think market size-wise,

00:24:34.960 | I felt like in this space where so much was changing

00:24:39.640 | and it was very unclear what of today

00:24:43.440 | was actually gonna be true tomorrow,

00:24:45.760 | we just really rested a lot on very, very simple

00:24:48.320 | fundamental principles, which is research is,

00:24:51.080 | if you can understand the truth,

00:24:52.480 | that is very economically beneficial, valuable,

00:24:55.320 | if you know the truth.

00:24:56.440 | - Just on principle, that's enough for you.

00:24:58.080 | - Yeah, research is the key to many breakthroughs

00:25:01.160 | that are very commercially valuable.

00:25:02.840 | 'Cause my version of it is students are poor

00:25:05.280 | and they don't pay for anything, right?

00:25:06.960 | But that's obviously not true, as you guys have found out.

00:25:09.200 | But I, you know, you had to have some market insight

00:25:12.600 | for me to have believed that, but I think you skipped that.

00:25:15.240 | - Yeah.

00:25:16.080 | - Yeah, we did encounter, I guess,

00:25:18.160 | talking to VCs for our seed round.

00:25:20.120 | A lot of VCs were like, "You know, researchers,

00:25:22.360 | "they don't have any money.

00:25:24.100 | "Why don't you build a legal assistant?"

00:25:27.320 | (laughing)

00:25:28.560 | And I think in some short-sighted way,

00:25:30.720 | maybe that's true, but I think in the long run,

00:25:34.680 | R&D is such a big space of the economy.

00:25:36.600 | I think if you can substantially improve

00:25:39.080 | how quickly people find new discoveries

00:25:42.560 | or avoid kind of controlled trials that don't go anywhere,

00:25:47.560 | I think that's just huge amounts of money.

00:25:49.640 | And there are a lot of questions, obviously,

00:25:51.400 | about between here and there,

00:25:53.040 | but I think as long as the fundamental principle is there,

00:25:55.840 | we were okay with that.

00:25:57.360 | And I guess we found some investors who also were.

00:26:00.200 | - Yeah, yeah, congrats.

00:26:01.480 | I mean, I'm sure we can cover the sort of flip later.

00:26:05.680 | Yeah, I think you were about to start us on GPT-3

00:26:08.240 | and how that changed things for you.

00:26:10.400 | It's funny, I guess every major GPT version,

00:26:12.800 | you have some big insight.

00:26:14.320 | - I think it's a little bit less true for us than for others

00:26:18.240 | because we always believe that there will basically be

00:26:21.880 | human-level machine work.

00:26:24.280 | And so it is definitely true that in practice,

00:26:27.120 | for your product, as new models come out,

00:26:30.000 | your product starts working better,

00:26:31.320 | you can add some features that you couldn't add before.

00:26:33.760 | But I don't think we really ever had the moment

00:26:37.920 | where we were like, oh, wow, that is super unanticipated.

00:26:42.200 | We need to do something entirely different now

00:26:44.080 | from what was on the roadmap.

00:26:46.600 | - I think GPT-3 was a big change 'cause it kind of said,

00:26:50.420 | oh, now is the time that we can use AI to build these tools.

00:26:54.640 | And then GPT-4 was maybe a little bit more

00:26:56.720 | of an extension of GPT-3.

00:26:58.480 | It felt less like a level shift.

00:26:59.760 | GPT-3 over GPT-2 was like qualitative level shift.

00:27:02.960 | And then GPT-4 was like, okay, great.

00:27:05.000 | Now it's like, much less, more accurate,

00:27:07.680 | we're more accurate on these things,

00:27:08.800 | we can answer harder questions.

00:27:10.040 | But the shape of the product had already taken place

00:27:12.080 | by that time.

00:27:13.280 | - I kind of want to ask you about this sort of pivot

00:27:15.120 | that you've made, but I guess that was just a way

00:27:17.720 | to sell what you were doing,

00:27:18.920 | which is you're adding extra features

00:27:20.880 | on grouping by concepts.

00:27:22.700 | - When GPT-4-- - The GPT-4 pivot,

00:27:24.640 | quote-unquote pivot that you--

00:27:25.620 | - Oh, yeah, yeah, exactly.

00:27:27.000 | Right, right, right, yeah, yeah.

00:27:28.400 | When we launched this workflow,

00:27:30.200 | now that GPT-4 was available,

00:27:32.960 | basically, Elisa was at a place where,

00:27:36.360 | given a table of papers,

00:27:38.200 | we have very tabular interfaces,

00:27:39.680 | so given a table of papers,

00:27:40.960 | you can extract data across all the tables.

00:27:43.640 | But that's still, you kind of want to take the analysis

00:27:47.600 | a step further.

00:27:49.080 | And sometimes what you'd care about

00:27:50.520 | is not having a list of papers,

00:27:52.040 | but a list of arguments, a list of effects,

00:27:55.200 | a list of interventions, a list of techniques.

00:27:57.240 | And so that's one of the things we're working on

00:28:00.680 | is now that you've extracted this information

00:28:02.840 | in a more structured way,

00:28:03.720 | can you pivot it or group by whatever the information

00:28:06.960 | that you extracted to have more insight-first information

00:28:11.040 | still supported by the academic literature?

00:28:13.120 | - Yeah, that was a big revelation when I saw it.

00:28:14.800 | Yeah, basically, I think I'm very just impressed

00:28:18.120 | by how first principles,

00:28:20.280 | your ideas around what the workflow is.

00:28:23.520 | And I think that's why you're not as reliant

00:28:27.400 | on the LLM improving,

00:28:29.000 | because actually it's just about improving the workflow

00:28:31.160 | that you would recommend to people.

00:28:33.120 | Today, we might call it an agent, I don't know,

00:28:35.160 | but you're not reliant on the LLM to drive it.

00:28:39.080 | It's relying on your sort of,

00:28:40.760 | this is the way that Elisit does research,

00:28:43.480 | and this is what we think is most effective

00:28:45.920 | based on talking to our users.

00:28:47.360 | - Yep, that's right.

00:28:48.200 | Yeah, I think the problem space is still huge.

00:28:52.160 | If it's this big, we are all still operating

00:28:55.120 | at this tiny bit of it.

00:28:57.200 | So I think about this a lot in the context of moats.

00:29:00.440 | People are like, "Oh, what's your moat?

00:29:01.440 | "What happens if GPT-5 comes out?"

00:29:03.040 | It's like, if GPT-5 comes out,

00:29:04.440 | there's still all of this other space that we can go into.

00:29:07.000 | And so I think being really obsessed with the problem,

00:29:09.920 | which is very, very big, has helped us stay robust

00:29:13.120 | and just kind of directly incorporate model improvements

00:29:15.440 | and then keep going.

00:29:16.280 | - And then I first encountered you guys with Charlie.

00:29:19.840 | You can tell us about that project.

00:29:22.000 | Basically, how much did cost become a concern

00:29:26.040 | as you're working more and more with OpenAI?

00:29:28.760 | How do you manage that relationship?

00:29:30.240 | - Let me talk about who Charlie is.

00:29:31.440 | - All right. - Sure, sure.

00:29:32.280 | - You can talk about the tech,

00:29:33.100 | 'cause Charlie is a special character.

00:29:34.840 | So Charlie, when we found him,

00:29:37.440 | had just finished his freshman year

00:29:39.000 | at the University of Warwick.

00:29:40.400 | I think he had heard about us on some discord,

00:29:42.560 | and then he applied, and we were like,

00:29:44.240 | "Wow, who is this freshman?"

00:29:45.520 | And then we just saw that he had done

00:29:46.580 | so many incredible side projects.

00:29:49.200 | And we were actually on a team retreat

00:29:51.040 | in Barcelona visiting our head of engineering at that time,

00:29:54.040 | and everyone was talking about this wonder kid.

00:29:56.000 | They're like, "This kid?"

00:29:56.840 | And then on our take-home project,

00:29:58.240 | he had done the best of anyone to that point.

00:30:02.280 | And so we were just so excited to hire him.

00:30:05.200 | So we hired him as an intern, and then we're like,

00:30:06.780 | "Charlie, what if he just dropped out of school?"

00:30:09.640 | And so then we convinced him to take a year off,

00:30:11.840 | and he's just incredibly productive.

00:30:13.660 | And I think the thing you're referring to is,

00:30:15.840 | at the start of 2023,

00:30:17.240 | Anthropic launched their constitutional AI paper,

00:30:20.740 | and within a few days, I think four days,

00:30:23.080 | he had basically implemented that in production,

00:30:25.280 | and then we had it in app a week or so after that.

00:30:28.920 | And he has since contributed to major improvements,

00:30:31.840 | like cutting costs down to a tenth of what they were.

00:30:36.000 | It's really large-scale,

00:30:36.920 | but yeah, you can talk about the technical stuff.

00:30:40.000 | - Yeah, on the constitutional AI project,

00:30:41.840 | this was for abstract summarization,

00:30:44.400 | where in Illicit, if you run a query,

00:30:48.800 | it'll return papers to you,

00:30:50.160 | and then it will summarize each paper

00:30:51.880 | with respect to your query for you on the fly.

00:30:54.640 | And that's a really important part of Illicit,

00:30:57.200 | because Illicit does it so much.

00:30:59.520 | If you run a few searches,

00:31:01.320 | it'll have done it a few hundred times for you.

00:31:03.560 | And so we cared a lot about this both being fast, cheap,

00:31:07.720 | and also very low on hallucination.

00:31:10.760 | I think if Illicit hallucinates something

00:31:12.720 | about the abstract, that's really not good.

00:31:15.040 | And so what Charlie did in that project

00:31:17.420 | was create a constitution that expressed

00:31:20.680 | what are the attributes of a good summary.

00:31:23.000 | It's like everything in the summary is reflected

00:31:26.440 | in the actual abstract,

00:31:28.920 | and it's very concise, et cetera, et cetera.

00:31:32.080 | And then used RLHF with a model

00:31:37.000 | that was trained on the constitution

00:31:39.120 | to basically fine-tune a better summarizer.

00:31:44.120 | - And an open-source model, I think.

00:31:46.280 | - On an open-source model, yeah.

00:31:48.080 | I think that might still be in use.

00:31:51.080 | - Yeah, yeah, definitely.

00:31:52.080 | Yeah, I think at the time,

00:31:53.400 | the models hadn't been trained at all

00:31:55.320 | to be faithful to a text.

00:31:57.360 | So they were just generating.

00:31:58.440 | So then when you ask them a question,

00:32:00.240 | they tried too hard to answer the question

00:32:03.200 | and didn't try hard enough to answer the question

00:32:05.800 | given the text or answer what the text said

00:32:07.780 | about the question.

00:32:08.620 | So we had to basically teach the models

00:32:10.020 | to do that specific task.

00:32:11.680 | - How do you monitor the ongoing performance of your models?

00:32:16.680 | Not to get too LLM-opsy,

00:32:18.680 | but you are one of the larger,

00:32:22.480 | more well-known operations doing NLP at scale.

00:32:25.280 | I guess, effectively, you have to monitor these things,

00:32:29.100 | and nobody has a good answer that I can talk to.

00:32:31.920 | - Yeah, I don't think we have a good answer yet.

00:32:33.760 | (all laughing)

00:32:35.080 | I think the answers are actually a little bit clearer

00:32:36.800 | on the just kind of basic robustness side,

00:32:40.240 | so I think where you can import ideas

00:32:42.840 | from normal software engineering

00:32:45.720 | and normal kind of DevOps.

00:32:47.640 | You're like, well, you need to monitor

00:32:49.160 | kind of latencies and response times and uptime and whatnot.

00:32:52.000 | - I think when we say performance,

00:32:53.120 | it's more about hallucination rate.

00:32:54.920 | - And then things like hallucination rate

00:32:57.040 | where I think there the really important thing

00:32:59.880 | is training time.

00:33:02.360 | So we care a lot about having our own internal benchmarks

00:33:07.680 | for model development that reflect the distribution

00:33:11.920 | of user queries so that we can know ahead of time

00:33:15.480 | how well is the model gonna perform

00:33:17.460 | on different types of tasks,

00:33:18.600 | so the tasks being summarization, question answering,

00:33:21.800 | given a paper, ranking.

00:33:23.800 | And for each of those, we wanna know

00:33:25.400 | what's the distribution of things the model is gonna see

00:33:28.360 | so that we can have well-calibrated predictions

00:33:32.560 | on how well the model's gonna do in production.

00:33:34.640 | And I think, yeah, there's some chance

00:33:36.160 | that there's distribution shift and actually the things

00:33:38.520 | users enter are gonna be different,

00:33:40.680 | but I think that's much less important

00:33:42.560 | than getting the kind of training right

00:33:44.520 | and having very high-quality,

00:33:46.560 | well-vetted data sets at training time.

00:33:49.000 | - I think we also end up effectively monitoring

00:33:51.260 | by trying to evaluate new models as they come out.

00:33:53.500 | And so that kind of prompts us to go through

00:33:56.380 | our eval suite every couple of months.

00:33:58.080 | And then, yeah, and so every time a new model comes out,

00:34:01.080 | we have to see like, okay, which one is,

00:34:03.240 | how is this performing relative to production

00:34:04.920 | and what we currently have?

00:34:06.440 | - Yeah, I mean, since we're on this topic,

00:34:08.800 | any new models have really caught your eye this year?

00:34:11.280 | Like, Cloud came out with a bunch.

00:34:12.120 | - Cloud, yeah, I think Cloud is pretty,

00:34:13.680 | I think the team's pretty excited about Cloud.

00:34:15.720 | - Yeah, specifically, I think Cloud Haiku

00:34:18.920 | is like a good point on the kind of Pareto frontier.

00:34:22.600 | So I think it's like, it's not the,

00:34:24.840 | it's neither the cheapest model,

00:34:26.160 | nor is it the most accurate, most high-quality model,

00:34:30.680 | but it's just like a really good trade-off

00:34:32.400 | between cost and accuracy.

00:34:34.800 | You apparently have to 10-shot it to make it good.

00:34:37.920 | I tried using Haiku for summarization,

00:34:40.080 | but zero-shot was not great.

00:34:42.800 | And yeah, then they were like, you know,

00:34:45.240 | it's a skill issue, you have to try harder.

00:34:47.440 | - Interesting.

00:34:48.280 | - Yeah, we also used, I think, GPT-4 unlocked process.

00:34:51.760 | - Turbo?

00:34:53.320 | - Yeah, yeah, they get unlocked tables for us,

00:34:58.160 | processing data from tables, which was huge.

00:35:00.200 | - GPT-4 Vision.

00:35:01.040 | - Yeah.

00:35:02.040 | - Yeah, did you try like Fuyu?

00:35:03.360 | I guess you can't try Fuyu, 'cause it's non-commercial.

00:35:06.400 | That's the adept model.

00:35:07.600 | - Yeah, we haven't tried that one.

00:35:08.640 | - Yeah, yeah, yeah.

00:35:09.560 | But Cloud is multimodal as well.

00:35:11.240 | - Yeah.

00:35:12.080 | - I think the interesting insight that we got

00:35:13.880 | from talking to David Luan, who is CEO of Adept,

00:35:16.560 | was that multimodality has effectively

00:35:20.120 | two different flavors.

00:35:20.960 | Like one is the, we recognize images from a camera

00:35:24.220 | in the outside natural world.

00:35:26.280 | And actually, the more important multimodality

00:35:28.980 | for knowledge work is screenshots.

00:35:31.200 | And, you know, PDFs and charts and graphs.

00:35:34.240 | - Yeah, yeah, mm-hmm.

00:35:35.760 | - So we need a new term for that kind of multimodality.

00:35:38.240 | - Yeah.

00:35:39.080 | - But is the claim that current models

00:35:40.680 | are good at one or the other?

00:35:42.440 | - No, they're over-indexed, 'cause of the history

00:35:44.200 | of computer vision is CoCo, right?

00:35:46.760 | So now we're like, oh, actually, you know,

00:35:49.240 | screens are more important.

00:35:50.640 | - Yeah, processing weird handwriting and stuff.

00:35:52.120 | - OCR, yeah, handwriting, yeah.

00:35:54.120 | You mentioned a lot of like closed model lab stuff,

00:35:57.840 | and then you also have like this open source model

00:36:00.720 | fine-tuning stuff.

00:36:01.560 | Like what is your workload now between closed and open?

00:36:04.200 | - It's a good question.

00:36:05.040 | - Is it half and half?

00:36:05.880 | - I think it's--

00:36:07.360 | - Is that even a relevant question,

00:36:08.740 | or is this a nonsensical question?

00:36:10.760 | - It depends a little bit on like how you index,

00:36:12.600 | whether you index by like computer cost

00:36:14.540 | or number of queries.

00:36:15.960 | I'd say like in terms of number of queries,

00:36:18.520 | it's maybe similar.

00:36:19.440 | In terms of like cost and compute,

00:36:21.360 | I think the closed models make up more of the budget,

00:36:24.800 | since the main cases where you want to use closed models

00:36:28.320 | are cases where they're just smarter,

00:36:31.820 | where no existing open source models are quite smart enough.

00:36:36.240 | - We have a lot of interesting technical questions

00:36:38.520 | to go in, but just to wrap the kind of like UX evolution,

00:36:42.680 | now you have the notebooks.

00:36:44.320 | We talked a lot about how chatbots

00:36:46.720 | are not the final frontier, you know?

00:36:50.020 | How did you decide to get into notebooks,

00:36:52.560 | which is a very iterative, kind of like interactive

00:36:55.160 | interface, and yeah, maybe learnings from that?

00:36:57.720 | - Yeah, this is actually our fourth time

00:37:00.000 | trying to make this work.

00:37:01.840 | I think the first time was probably in early 2021.

00:37:06.160 | At the time we built something,

00:37:07.480 | I think because we've always been obsessed

00:37:09.600 | with this idea of task decomposition and like branching,

00:37:13.200 | we always wanted a way, a tool that could be kind of

00:37:17.200 | unbounded where you could keep going,

00:37:19.600 | where you could do a lot of branching,

00:37:20.780 | where you could kind of apply language model operations

00:37:23.980 | or computations on other tasks.

00:37:26.080 | So in 2021, we had this thing called composite tasks

00:37:28.840 | where you could use GPT-3 to brainstorm

00:37:31.180 | a bunch of research questions,

00:37:32.560 | and then take each research question

00:37:34.240 | and decompose those further into sub questions.

00:37:37.320 | And this kind of, again, that like task decomposition

00:37:40.200 | tree type thing was always very exciting to us,

00:37:43.440 | but that was like, it didn't work

00:37:44.600 | and it was kind of overwhelming.

00:37:46.840 | Then at the end of '22, I think we tried again.

00:37:50.080 | And at that point we were thinking,

00:37:51.280 | okay, we've done a lot with this literature review thing.

00:37:53.720 | We also want to start helping with kind of adjacent domains

00:37:56.360 | and different workflows.

00:37:57.480 | Like we want to help more with machine learning.

00:37:59.500 | What does that look like?

00:38:00.640 | And as we were thinking about it, we're like, well,

00:38:02.560 | there are so many research workflows.

00:38:04.280 | Like how do we not just build kind of three new workflows

00:38:07.760 | into Elicit, but make Elicit really generic

00:38:10.200 | to lots of workflows?

00:38:11.120 | What is like a generic composable system

00:38:13.640 | with nice abstractions that can like scale

00:38:16.060 | to all these workflows?

00:38:17.640 | So we like iterated on that a bunch

00:38:19.320 | and then didn't quite narrow the problem space enough

00:38:22.440 | or like quite get to what we wanted.

00:38:25.200 | And then I think it was at the beginning of 2023,

00:38:28.600 | where we're like, wow, computational notebooks

00:38:30.440 | kind of enable this, where they have a lot of flexibility,

00:38:34.040 | but kind of robust primitives,

00:38:35.720 | such that you can extend the workflow.

00:38:38.320 | And it's not limited.

00:38:39.580 | It's not like you ask a query, you get an answer,

00:38:41.260 | you're done.

00:38:42.100 | You can just constantly keep building on top of that.

00:38:44.600 | And each little step seems like a really good

00:38:47.240 | kind of unit of work for the language model.

00:38:50.240 | So that's, and also there was just like really helpful

00:38:52.960 | to have a bit more kind of pre-existing work to emulate.

00:38:57.960 | So that was, yeah, that's kind of how we ended up

00:39:00.480 | at computational notebooks for Elicit.

00:39:03.000 | - Maybe one thing that's worth making explicit

00:39:05.600 | is the difference between computational notebooks and chat,

00:39:08.120 | because on the surface, they seem pretty similar.

00:39:10.040 | It's kind of this iterative interaction

00:39:11.560 | where you add stuff and it's almost like in both cases,

00:39:15.640 | you have a back and forth between you enter stuff

00:39:17.560 | and then you get some output and then you enter stuff.

00:39:20.140 | But the important difference in our minds is

00:39:23.240 | with notebooks, you can define a process.

00:39:26.160 | So in data science, you can be like,

00:39:28.920 | here's like my data analysis process that takes in a CSV

00:39:31.520 | and then does some extraction

00:39:32.720 | and then generates a figure at the end.

00:39:34.960 | And you can prototype it using a small CSV

00:39:37.680 | and then you can run it over a much larger CSV later.

00:39:40.560 | And similarly, the vision for notebooks in our case

00:39:43.920 | is to not make it this like one-off chat interaction,

00:39:47.060 | but to allow you to then say kind of,

00:39:50.400 | if you start and first you're like,

00:39:52.680 | okay, let me just analyze a few papers

00:39:54.440 | and see do I get to the correct like conclusions

00:39:57.560 | for those few papers?

00:39:59.160 | Can I then later go back and say,

00:40:00.640 | now let me run this over 10,000 papers

00:40:04.440 | now that I've debugged the process using a few papers?

00:40:07.560 | And that's an interaction that doesn't fit quite as well

00:40:10.200 | into the chat framework,

00:40:11.320 | because that's more for kind of quick

00:40:13.400 | back and forth interaction.

00:40:15.560 | - Do you think in notebooks as kind of like structure,

00:40:19.020 | editable chain of thought, basically step by step,

00:40:22.060 | like is that kind of where you see this going

00:40:24.500 | and then are people gonna reuse notebooks as like templates

00:40:28.180 | and maybe in traditional notebooks,

00:40:29.700 | it's like cookbooks, right?

00:40:30.780 | You share a cookbook, you can start from there.

00:40:33.220 | Is that similar in Elizit?

00:40:35.180 | - Yeah, that's exactly right.

00:40:36.500 | So that's our hope that people will build templates,

00:40:39.060 | share them with other people.

00:40:40.760 | I think chain of thought is maybe still like kind of

00:40:43.780 | one level lower on the abstraction hierarchy

00:40:46.780 | than we would think of notebooks.

00:40:48.460 | I think we'll probably want to think about

00:40:50.060 | more semantic pieces like a building block

00:40:52.660 | is more like a paper search or an extraction

00:40:56.480 | or a list of concepts.

00:40:59.660 | And then the models detailed reasoning

00:41:03.420 | will probably often be one level down.

00:41:05.380 | You always want to be able to see it,

00:41:06.740 | but you don't always want it to be front and center.

00:41:09.500 | - Yeah, what's the difference between a notebook

00:41:11.540 | and an agent?

00:41:12.420 | Since everybody always asks me, what's an agent?

00:41:14.460 | Like, how do you think about where the line is?

00:41:17.020 | - In the notebook world,

00:41:18.220 | I would generally think of the human as the agent

00:41:21.380 | in the first iteration.

00:41:22.260 | So you have the notebook and the human kind of adds

00:41:24.460 | little action steps.

00:41:25.780 | And then the next point on this kind of progress gradient is,

00:41:30.780 | okay, now you can use language models to predict

00:41:33.060 | which action would you take as a human.

00:41:35.020 | And at some point,

00:41:35.840 | you're probably gonna be very good at this.

00:41:36.900 | You'll be like, okay, I can like,

00:41:38.020 | in some cases I can with 100%, 99.9% accuracy

00:41:41.460 | predict what you do.

00:41:42.700 | And then you might as well just execute it.

00:41:44.260 | Like, why wait for the human?

00:41:46.060 | And eventually, as you get better at this,

00:41:48.260 | that will just look more and more like agents taking actions

00:41:52.420 | as opposed to you doing the thing.

00:41:54.440 | And I think templates are a specific case of this

00:41:59.580 | where you're like, okay, well,

00:42:00.420 | there's just particular sequences of actions

00:42:02.820 | that you often wanna chunk and have available as primitives,

00:42:06.500 | just like in normal programming.

00:42:08.220 | And those are, you can view them as action sequences

00:42:11.980 | of agents or you can view them as more

00:42:14.140 | like the normal programming language abstraction thing.

00:42:17.540 | And I think those are two valid views.

00:42:20.840 | - Yeah, how do you see this changes?

00:42:22.980 | Like you said, the models get better

00:42:24.300 | and you need less and less human actual interfacing

00:42:27.660 | with the model, you just get the results.

00:42:29.220 | Like how does the UX and the way people perceive it change?

00:42:34.820 | - Yeah, I think this kind of interaction paradigms

00:42:38.320 | for evaluation is not really something

00:42:40.060 | the internet has encountered yet,

00:42:41.420 | because right now, up to now,

00:42:42.560 | the internet has all been about like getting data

00:42:45.020 | and work from people.

00:42:46.680 | But so increasingly, yeah, I really want kind of evaluation

00:42:51.700 | both from an interface perspective

00:42:53.460 | and from like a technical perspective

00:42:55.340 | and operation perspective to be a power,

00:42:57.180 | superpower for Elicit, 'cause I think over time,

00:42:59.180 | models will do more and more of the work

00:43:01.000 | and people will have to do more and more of the evaluation.

00:43:03.980 | So I think, yeah, in terms of the interface,

00:43:06.140 | some of the things we have today are,

00:43:08.380 | for every kind of language model generation,

00:43:10.140 | there's some citation back and we kind of directly,

00:43:13.020 | we try to highlight the ground truth in the paper

00:43:16.940 | that is most relevant to whatever Elicit said

00:43:19.260 | and make it super easy so that you can click on it

00:43:21.020 | and quickly see in context and validate

00:43:24.180 | whether the text actually supports

00:43:25.740 | the answer that Elicit gave.

00:43:27.300 | So I think we'd probably want to scale things up like that,

00:43:30.420 | like the ability to kind of spot check

00:43:32.780 | the model's work super quickly,

00:43:34.300 | scale up interfaces like that and--

00:43:37.100 | - Who would spot check, the user?

00:43:39.140 | - Yeah, to start it would be the user.

00:43:41.220 | One of the other things we do is also kind of flag

00:43:43.460 | the model's uncertainty.

00:43:44.940 | So we have models report out, how confident are you

00:43:47.540 | that this was the sample size of this study?

00:43:50.080 | The model's not sure, we throw a flag

00:43:51.780 | and so the user knows to prioritize checking that.

00:43:54.500 | So again, we can kind of scale that up.

00:43:56.500 | So when the model's like, well,

00:43:57.940 | I went and searched for Google,

00:43:59.300 | I searched this on Google,

00:44:00.460 | I'm not sure if that was the right thing,

00:44:01.740 | we have an uncertainty flag and the user can go

00:44:03.580 | and be like, oh, okay,

00:44:04.420 | that was actually the right thing to do or not.

00:44:07.380 | - So I've tried to do uncertainty readings from models.

00:44:11.260 | I don't know if you have this live, you do, okay.

00:44:14.620 | 'Cause I just didn't find them reliable

00:44:16.260 | 'cause they just hallucinated their own uncertainty.

00:44:18.580 | I would love to base it on log probs

00:44:21.780 | or something more native within the model

00:44:23.940 | rather than generated.

00:44:25.440 | But okay, it sounds like they scale properly for you.

00:44:28.940 | - Yeah, we found it to be pretty calibrated.

00:44:30.820 | They're diverse on the model.

00:44:32.380 | - Okay, yeah, I think in some cases,

00:44:34.080 | we also used two different models

00:44:35.420 | for the uncertainty estimates

00:44:36.740 | than for the question answering.

00:44:38.260 | So one model would say, here's my chain of thought,

00:44:40.820 | here's my answer, and then a different type of model.

00:44:43.180 | Let's say the first model is Lama

00:44:45.700 | and let's say the second model is GP 3.5,

00:44:48.100 | could be different.

00:44:49.140 | And then the second model just looks over the results

00:44:54.060 | and like, okay, how confident are you in this?

00:44:56.980 | And I think sometimes using a different model

00:44:59.540 | can be better than using the same model.

00:45:01.540 | - Yeah, on the topic of models, evaluating models,

00:45:04.980 | obviously you can do that all day long.

00:45:07.100 | Like what's your budget?

00:45:08.300 | Because your queries fan out a lot

00:45:13.820 | and then you have models evaluating models.

00:45:16.780 | One person typing in a question

00:45:18.740 | can lead to a thousand calls.

00:45:21.340 | - It depends on the project.

00:45:23.860 | So if the project is basically a systematic review

00:45:29.860 | that otherwise human research assistants would do,

00:45:32.100 | then the project is basically a human equivalent spend

00:45:35.540 | and the spend can get quite large for those projects.

00:45:38.740 | Certainly, I don't know, let's say $100,000.

00:45:42.980 | - For the project, yeah.

00:45:43.820 | - Yeah.

00:45:45.020 | So in those cases, you're happier to spend compute

00:45:48.760 | than in the kind of shallow search case

00:45:51.020 | where someone just enters a question because,

00:45:53.180 | I don't know, maybe--

00:45:54.260 | - Feel like it.

00:45:55.100 | - I heard about creatine, what's it about?

00:45:57.380 | Probably don't want to spend a lot of compute on that.

00:46:00.380 | And this sort of being able to invest more or less compute

00:46:05.220 | into getting more or less accurate answers

00:46:07.060 | is I think one of the core things we care about

00:46:09.540 | and that I think is currently undervalued in the AI space.

00:46:12.900 | I think currently, you can choose which model you want

00:46:15.780 | and you can sometimes tell it, I don't know,

00:46:18.380 | you'll tip it and it'll try harder

00:46:21.140 | or you can try various things to get it to work harder.

00:46:24.180 | But you don't have great ways of converting

00:46:27.340 | willingness to spend into better answers

00:46:29.420 | and we really want to build a product

00:46:30.820 | that has this sort of unbounded flavor

00:46:32.980 | where I mean, as much as you care about,

00:46:35.500 | if you care about it a lot,

00:46:36.580 | you should be able to get really high quality answers,

00:46:39.860 | really double checked in every way.

00:46:41.900 | - Yeah.

00:46:42.740 | - And you have a credits-based pricing.

00:46:44.980 | So unlike most products, it's not a fixed monthly fee.

00:46:47.380 | - Right, exactly.

00:46:48.420 | So some of the higher costs are tiered.

00:46:51.820 | So for most casual users,

00:46:54.380 | they'll just get the abstract summary,

00:46:55.740 | which is kind of an open source model.

00:46:58.180 | Then you can add more columns which have more extractions

00:47:01.340 | and these uncertainty features

00:47:02.540 | and then you can also add those same columns

00:47:04.060 | in high accuracy mode, which also parses the table.

00:47:06.660 | So we kind of stack the complexity on the cost.

00:47:09.900 | - You know the fun thing you can do with a credit system,

00:47:12.020 | which is data for data or I don't know what I mean by that.

00:47:16.700 | Basically, you can give people more credits

00:47:18.300 | if they give data back to you.

00:47:20.500 | I don't know if you've already done that.

00:47:21.500 | - I've thought about,

00:47:22.340 | we've thought about something like this.

00:47:23.460 | It's like, if you don't have money, but you have time,

00:47:26.540 | how do you exchange that?

00:47:28.460 | - It's a fair trade.

00:47:29.300 | - Yeah, I think it's interesting.

00:47:30.380 | We haven't quite operationalized it

00:47:31.700 | and then there's been some kind of adverse selection.

00:47:35.100 | For example, it would be really valuable

00:47:36.340 | to get feedback on our models.

00:47:37.620 | So maybe if you were willing to give more robust feedback

00:47:40.580 | on our results, we could give you credits

00:47:42.180 | or something like that.

00:47:43.020 | But then there's kind of this,

00:47:44.380 | will people take it seriously?

00:47:45.580 | - Yeah, you want the good people.

00:47:46.420 | - Exactly.

00:47:47.300 | - Can you tell who are the good people?

00:47:49.340 | - Not right now, but yeah,

00:47:50.180 | maybe at the point where we can,

00:47:51.340 | we can offer it.

00:47:52.180 | We can offer it up to them.

00:47:53.020 | - The perplexity of questions asked,

00:47:55.380 | if it's higher perplexity,

00:47:56.220 | these are smarter people.

00:47:57.060 | - Yeah, maybe.

00:47:58.300 | - If you make a lot of typos in your queries,

00:48:00.260 | you're not gonna get off the house exchange.

00:48:02.340 | (all laughing)

00:48:04.380 | - Negative social credit.

00:48:05.980 | It's very topical right now to think about

00:48:08.100 | the threat of long context windows.

00:48:10.980 | All these models that we're talking about these days,

00:48:12.980 | all like a million token plus.

00:48:14.820 | Is that relevant for you?

00:48:16.740 | Can you make use of that?

00:48:17.660 | Is that just prohibitively expensive

00:48:19.660 | 'cause you're just paying for all those tokens

00:48:21.620 | or you're just doing rag?

00:48:22.820 | - It's definitely relevant.

00:48:23.860 | And when we think about search,

00:48:26.460 | I think as many people do,

00:48:27.820 | we think about kind of a staged pipeline of retrieval

00:48:30.980 | where first you use a kind of semantic search database

00:48:35.500 | with embeddings,

00:48:36.340 | get like the, in our case,

00:48:37.740 | maybe 400 or so most relevant papers.

00:48:40.300 | And then you still need to rank those.

00:48:42.180 | And I think at that point,

00:48:43.820 | it becomes pretty interesting to use larger models.

00:48:47.500 | So specifically in the past,

00:48:50.100 | I think a lot of ranking was kind of per item ranking

00:48:53.340 | where you would score each individual item,

00:48:55.300 | maybe using increasingly expensive scoring methods

00:48:58.500 | and then rank based on the scores.

00:49:00.620 | But I think list-wise free ranking

00:49:02.180 | where you have a model that can see all the elements

00:49:04.580 | is a lot more powerful

00:49:06.140 | because often you can only really tell how good a thing is

00:49:09.220 | in comparison to other things.

00:49:10.980 | And what things should come first,

00:49:13.140 | it really depends on like,

00:49:14.500 | well, what other things that are available,

00:49:15.660 | maybe you even care about diversity in your results.

00:49:17.820 | You don't wanna show like 10 very similar papers

00:49:21.140 | as the first 10 results.

00:49:22.460 | So I think the long context models

00:49:24.780 | are quite interesting there.

00:49:26.820 | And especially for our case

00:49:28.580 | where we care more about power users

00:49:31.820 | who are perhaps a little bit more willing

00:49:33.740 | to wait a little bit longer

00:49:34.740 | to get higher quality results

00:49:36.060 | relative to people who just quickly check out things

00:49:40.220 | because why not?

00:49:41.820 | I think being able to spend more on longer contexts

00:49:44.540 | is quite valuable.

00:49:46.340 | - Yeah, I think one thing

00:49:47.340 | the longer context models changed for us

00:49:49.460 | is maybe a focus from breaking down tasks

00:49:52.700 | to breaking down the evaluation.

00:49:54.860 | So before, if we wanted to answer a question

00:49:59.500 | from the full text of a paper,

00:50:01.380 | we had to figure out how to chunk it

00:50:02.940 | and like find the relevant chunk

00:50:04.300 | and then answer based on that chunk.

00:50:06.020 | And the nice thing was then,

00:50:07.260 | you know kind of which chunk the model used

00:50:09.100 | to answer the question.

00:50:09.940 | So if you want to help the user track it,

00:50:11.660 | yeah, you can be like,

00:50:12.580 | well, this was the chunk that the model got.

00:50:14.740 | And now if you put the whole text in the paper,

00:50:16.860 | you have to go back,

00:50:17.780 | you have to like kind of find the chunk

00:50:19.780 | like more retroactively basically.

00:50:21.620 | And so you need kind of like a different set of abilities

00:50:24.380 | and obviously like a different technology to figure out.

00:50:26.660 | You still want to point the user

00:50:28.660 | to the supporting quotes in the text,

00:50:30.300 | but then like the interaction is a little different.

00:50:33.060 | - You like scan through and find some rouge score.

00:50:35.500 | - Yeah.

00:50:36.340 | - Ceiling or floor.

00:50:38.500 | - Yeah, I think there's an interesting space

00:50:41.340 | of almost research problems here

00:50:44.060 | because you would ideally make causal claims.

00:50:46.300 | Like if this hadn't been in the text,

00:50:48.260 | the model wouldn't have said this thing.

00:50:49.940 | And maybe you can do expensive approximations to that

00:50:52.500 | where like, I don't know,

00:50:53.340 | you just throw a chunk off the paper

00:50:55.460 | and re-answer and see what happens.

00:50:57.300 | But hopefully there are better ways of doing that

00:51:00.500 | where you just get that kind of counterfactual information

00:51:05.140 | for free from the model.

00:51:06.700 | - Do you think at all about the cost of maintaining RAG

00:51:09.980 | versus just putting more tokens in the window?

00:51:12.860 | I think in software development,

00:51:14.300 | a lot of times people buy developer productivity things

00:51:17.740 | so that we don't have to worry about it.

00:51:19.940 | Context window is kind of the same, right?

00:51:21.340 | You have to maintain chunking and like RAG retrieval

00:51:24.140 | and like re-ranking and all of this

00:51:25.580 | versus I just shove everything into the context

00:51:28.380 | and like it costs a little more,

00:51:29.460 | but at least I don't have to do all of that.

00:51:31.740 | Is that something you thought about at all?

00:51:33.340 | - I think we still like hit up against context limits enough

00:51:38.060 | that like, it's not really,

00:51:40.660 | do we still want to keep this RAG around?

00:51:42.180 | It's like we do still need it

00:51:43.460 | for the scale of the work that we're doing, yeah.

00:51:45.580 | - And I think there are different kinds of maintainability.

00:51:48.300 | In one sense, I think you're right

00:51:50.140 | that the throw everything into the context window thing

00:51:53.140 | is easier to maintain

00:51:54.140 | because you just can swap out a model.

00:51:57.540 | In another sense, it's if things go wrong,

00:52:00.580 | it's harder to debug where like,

00:52:02.220 | if you know here's the process that we go through

00:52:04.940 | to go from 200 million papers to an answer

00:52:08.820 | and they're like little steps and you understand,

00:52:10.660 | okay, this is the step that finds the relevant paragraph

00:52:14.260 | or whatever it may be.

00:52:15.720 | You'll know which step breaks if the answers are bad

00:52:20.140 | whereas if it's just like a new model version came out

00:52:24.500 | and now it suddenly doesn't find your needle

00:52:26.580 | in a haystack anymore,

00:52:27.500 | then you're like, okay, what can you do?

00:52:29.740 | You're kind of at a loss.

00:52:31.660 | - Yeah.

00:52:32.700 | Let's talk a bit about, yeah, needle in a haystack

00:52:35.940 | and like maybe the opposite of it,

00:52:37.740 | which is like hard grounding.

00:52:39.340 | I don't know if that's like the best name

00:52:40.900 | to think about it,

00:52:41.740 | but I was using one of these chat witcher documents features

00:52:44.380 | and I put the AMD MI 300 specs

00:52:47.340 | and the new Blackwell chips from NVIDIA

00:52:51.100 | and I was asking questions

00:52:52.120 | and asked, does the AMD chip support NVLink?

00:52:56.280 | And the response was like, oh, it doesn't say in the specs.

00:52:59.620 | But if you ask GPD 4 without the docs,

00:53:02.020 | it would tell you no,

00:53:03.020 | because NVLink, it's a NVIDIA technology.

00:53:05.620 | - Those are NV.

00:53:06.540 | - Yeah, it just says in the thing.

00:53:09.380 | How do you think about that?

00:53:11.740 | Like having the context sometimes suppress the knowledge

00:53:14.860 | that the model has?

00:53:16.220 | - It really depends on the task

00:53:17.460 | because I think sometimes that is exactly what you want.

00:53:19.700 | So imagine you're a researcher,

00:53:21.540 | you're writing the background section of your paper

00:53:23.240 | and you're trying to describe what these other papers say.

00:53:26.980 | You really don't want extra information

00:53:28.580 | to be introduced there.

00:53:29.740 | In other cases where you're just trying

00:53:31.060 | to figure out the truth

00:53:31.940 | and you're giving the documents

00:53:33.740 | because you think they will help the model

00:53:36.340 | figure out what the truth is.

00:53:38.420 | I think you do want,

00:53:40.500 | if the model has a hunch

00:53:41.880 | that there might be something that's not in the papers,

00:53:44.620 | you do want to surface that.

00:53:46.100 | I think ideally,

00:53:46.940 | you still don't want the model to just tell you.

00:53:49.580 | I think probably the ideal thing looks a bit more

00:53:51.660 | like agent control where the model can issue a query

00:53:56.660 | that then is intended to surface documents

00:54:00.700 | that substantiate its hunch.

00:54:02.100 | So I would, that's maybe a reasonable middle ground

00:54:06.060 | between model just telling you

00:54:07.900 | and model being fully limited to the papers you give it.

00:54:10.880 | - Yeah, I would say it's,

00:54:11.800 | they're just kind of different tasks right now.

00:54:13.420 | And the tasks that Elicit is mostly focused on

00:54:15.500 | is what do these papers say?

00:54:17.660 | But there is another task,

00:54:18.980 | which is like, just give me the best possible answer.

00:54:21.660 | And that give me the best possible answer

00:54:23.340 | sometimes depends on what do these papers say,

00:54:25.420 | but it can also depend on other stuff

00:54:27.280 | that's not in the papers.

00:54:28.700 | So ideally we can do both

00:54:29.900 | and then kind of do this overall task for you

00:54:33.060 | more going forward.

00:54:34.220 | - All right, this was, we see a lot of details,

00:54:37.220 | but just to zoom back out a little bit,

00:54:39.500 | what are maybe the most underrated features of Elicit?

00:54:43.900 | And what is one thing that maybe the users

00:54:46.340 | surprised you the most by using it?

00:54:48.260 | - I think the most powerful feature of Elicit

00:54:50.300 | is the ability to extract, add columns to this table,

00:54:54.780 | which effectively extracts data

00:54:56.380 | from all of your papers at once.

00:54:58.260 | It's still, it's well used,

00:54:59.780 | but there are kind of many different extensions of that

00:55:02.620 | that I think users are still discovering.

00:55:04.260 | So one is we let you give a description of the column.

00:55:07.900 | We let you give instructions of a column.

00:55:10.260 | We let you create custom columns.

00:55:11.740 | So we have like 30 plus predefined fields

00:55:14.280 | that users can extract.

00:55:15.820 | Like what were the methods?

00:55:16.940 | What were the main findings?

00:55:18.060 | How many people were studied?

00:55:20.420 | And then, and we can,

00:55:21.880 | we actually show you basically the prompts

00:55:23.820 | that we're using to extract that from our predefined fields.

00:55:26.460 | And then you can fork this and you can say,

00:55:28.620 | oh, actually I don't care about the population of people.

00:55:30.980 | I only care about the population of rats.

00:55:32.780 | Like you can change the instruction.

00:55:34.280 | So I think users are still kind of discovering

00:55:37.000 | that there's both this predefined, easy to use default,

00:55:41.260 | but that they can extend it to be much more specific to them.

00:55:44.220 | And then they can also ask custom questions.

00:55:46.460 | One use case of that is you can,

00:55:48.300 | you can start to create different column types

00:55:50.220 | that you might not expect.

00:55:51.220 | So rather than just creating generative answers,

00:55:53.980 | like a description of the methodology,

00:55:55.680 | you can say classify the methodology

00:55:58.060 | into a prospective study, a retrospective study,

00:56:01.340 | or a case study.

00:56:02.660 | And then you can filter based on that.

00:56:04.420 | It's like all using the same kind of technology

00:56:06.300 | and the interface, but it unlocks different workflows.

00:56:09.780 | So I think that like the ability to ask custom questions,

00:56:12.820 | give instructions and specifically use that

00:56:14.940 | to create different types of columns,

00:56:17.540 | like classification columns is still pretty underrated.

00:56:20.860 | In terms of use case,

00:56:22.980 | I spoke to someone who works in medical affairs

00:56:25.660 | at a genomic sequencing company recently.

00:56:28.340 | So they, you know, doctors kind of, you know,

00:56:32.140 | order these genomic tests,

00:56:34.320 | the sequencing tests to kind of identify

00:56:36.460 | if a patient has a particular disease,

00:56:38.340 | this company helps them process it.

00:56:40.260 | And this person basically interacts with all the doctors.

00:56:43.380 | And if the doctors have any questions,

00:56:44.700 | my understanding is that medical affairs

00:56:46.220 | is kind of like customer support

00:56:47.820 | or customer success in pharma.

00:56:50.440 | So this person like talks to doctors all day long.

00:56:52.840 | And one of the things they started using Elicit for

00:56:56.040 | is like putting the results of their tests as the query.

00:56:59.900 | Like this test showed, you know, this percentage,

00:57:02.540 | you know, presence of this and 40% that and whatever.

00:57:06.500 | What do we think is kind of the, you know,

00:57:08.900 | what like genes are present here or something

00:57:10.900 | or what's in the sample?

00:57:13.180 | And getting kind of a list of academic papers

00:57:15.980 | that would support their findings

00:57:17.380 | and using this to help doctors interpret their tests.

00:57:21.100 | So we talked about, okay, cool.

00:57:22.260 | Like if we built, you know,

00:57:24.020 | he's pretty interested in kind of doing a survey

00:57:26.860 | of infectious disease specialists

00:57:29.340 | and getting them to evaluate, you know,

00:57:31.300 | having them write up their answers,

00:57:32.540 | comparing it to Elicit's answers,

00:57:33.860 | trying to see can Elicit start being used

00:57:36.340 | to interpret the results of these diagnostic tests?

00:57:39.520 | Because the way they ship these tests to doctors

00:57:42.340 | is they report on a really wide array of things.

00:57:46.020 | And he was saying that at a large,

00:57:47.900 | well-resourced hospital, like a city hospital,

00:57:50.580 | there might be a team of infectious disease specialists

00:57:52.820 | who can help interpret these results.

00:57:55.140 | But at under-resourced hospitals or more rural hospitals,

00:57:57.820 | the primary care physician can't interpret the test results.

00:58:02.820 | So then they can't order it, they can't use it,

00:58:04.620 | they can't help their patients with it.

00:58:06.540 | So thinking about, you know,

00:58:07.780 | kind of an evidence-backed way of interpreting these tests

00:58:10.380 | is definitely kind of an extension of the product

00:58:12.140 | that I hadn't considered before.

00:58:13.960 | But yeah, the idea of like using that

00:58:15.760 | to bring more access to physicians

00:58:18.240 | in all different parts of the country

00:58:20.020 | and helping them interpret complicated science

00:58:21.860 | is pretty cool.

00:58:23.140 | - Yeah, we had Ken Jun from MBU on the podcast

00:58:26.740 | and we talked about better allocating scientific resources.

00:58:29.820 | How do you think about these use cases

00:58:31.540 | and maybe how Elicit can help drive more research?

00:58:35.340 | And do you see a world in which, you know,

00:58:37.600 | maybe the models actually do some of the research

00:58:39.860 | before suggesting us?

00:58:42.220 | - Yeah, I think that's like very close

00:58:45.140 | to what we care about.

00:58:46.420 | So our product values are systematic,

00:58:49.380 | transparent, and unbounded.

00:58:50.660 | And I think to make research more,

00:58:53.760 | especially more systematic and unbounded,

00:58:55.940 | I think is like basically the thing that's at stake here.

00:58:58.100 | So ideally people would think,

00:59:01.220 | well, what are, for example,

00:59:04.060 | I was recently talking to people in longevity

00:59:06.220 | and I think there isn't really one field of longevity.

00:59:08.620 | There are kind of different scientific subdomains

00:59:11.500 | that are surfacing,

00:59:13.180 | various things that are related to longevity.

00:59:15.140 | And I think if you could more systematically say,

00:59:17.620 | look, here are all the different interventions we could do.

00:59:22.580 | And here's the expected ROI of these experiments.

00:59:25.140 | Here's like the evidence so far

00:59:26.800 | that supports those being either like likely

00:59:30.060 | to surface new information or not.

00:59:32.180 | Here's the cost of these experiments.

00:59:34.180 | I think you could be so much more systematic

00:59:36.940 | than science is today.

00:59:39.380 | Probably, yeah, I'd guess in like 10, 20 years,

00:59:42.380 | we'll look back and it will be incredible

00:59:44.800 | how unsystematic science was back in the day.

00:59:48.100 | - Yeah, and I think this is as we start to,

00:59:51.700 | so like our view is kind of have models

00:59:54.820 | catch up to expert humans today,

00:59:57.260 | or whatever, start with kind of novice humans

00:59:59.780 | and then increasingly expert humans.

01:00:01.740 | And then at some point,

01:00:03.260 | but we really want the models to kind of like

01:00:05.220 | earn their right to the expertise.

01:00:07.660 | So that's why we do things in this very step-by-step way.

01:00:09.820 | That's why we don't just like throw a bunch of data

01:00:12.300 | and apply a bunch of compute and hope we get good results.

01:00:14.980 | But obviously at some point you hope that

01:00:16.220 | once it's kind of earned its stripes,

01:00:17.580 | it can surpass human researchers.

01:00:20.520 | But I think that's where making sure that the models

01:00:23.340 | processes are really explicit and transparent

01:00:26.380 | and that it's really easy to evaluate is important

01:00:28.660 | because if it does surpass human understanding,

01:00:31.060 | people will still need to be able to audit its work somehow

01:00:33.740 | or spot check its work somehow

01:00:35.580 | to be able to reliably trust it and use it.

01:00:37.960 | So yeah, that's kind of why the process-based approach

01:00:41.420 | is really important.

01:00:42.740 | - And on the question of will models do their own research,

01:00:47.420 | I think one feature that most currently don't have

01:00:50.340 | that will need to be better there is better world models.

01:00:54.300 | I think currently models are just not great

01:00:55.940 | at representing kind of what's going on

01:00:59.520 | in a particular situation or domain in a way

01:01:01.960 | that allows them to come to interesting,

01:01:05.380 | surprising conclusions.

01:01:07.340 | I think they're very good at like coming to,

01:01:09.680 | I don't know, conclusions that are nearby

01:01:12.300 | to conclusions that people have come to,

01:01:14.020 | but not as good at kind of reasoning

01:01:16.860 | and making surprising connections maybe.

01:01:19.060 | And so having deeper models of how,

01:01:23.340 | let's see, what are the underlying structures

01:01:25.900 | of different domains, how are they related or not related,

01:01:28.520 | I think will be an important ingredient

01:01:30.020 | for models actually being able to make novel contributions.

01:01:32.860 | - On the topic of hiring more expert humans,

01:01:34.980 | you've hired some very expert humans.

01:01:37.180 | My friend Maggie Appleton joined you guys

01:01:39.300 | I think maybe a year ago-ish.

01:01:41.380 | In fact, I think you're doing an offsite

01:01:44.100 | and we're actually organizing our big AI/UX meetup

01:01:48.460 | around whenever she's in town in San Francisco.

01:01:51.100 | - Oh, amazing.

01:01:52.140 | - How big is the team?

01:01:53.940 | How have you sort of transitioned your company

01:01:55.920 | into this sort of PBC and sort of the plan for the future?

01:02:00.500 | - Yeah, we're 12 people now.

01:02:02.300 | Mostly, about half of us are in the Bay Area

01:02:05.300 | and then distributed across US and Europe.

01:02:07.620 | A mix of mostly kind of roles in engineering and product.

01:02:11.300 | Yeah, and I think that the transition to PBC

01:02:14.240 | was really not that eventful because I think we were already,

01:02:18.260 | even as a nonprofit, we were already shipping every week.

01:02:21.260 | So very much operating as a product.

01:02:22.420 | - Very much like a startup, yeah.

01:02:24.020 | - And then I would say the kind of PBC component

01:02:27.200 | was to very explicitly say that we have a mission

01:02:30.860 | that we care a lot about.

01:02:32.100 | There are a lot of ways to make money.

01:02:33.900 | We think our mission will make us a lot of money,

01:02:36.020 | but we are going to be opinionated about how we make money.

01:02:38.360 | We're gonna take the version of making a lot of money

01:02:40.780 | that's in line with our mission.

01:02:42.180 | But it's like all very, it's very convergent.

01:02:43.940 | Like illicit is not going to make any money

01:02:45.980 | if it's a bad product,

01:02:47.220 | if it doesn't actually help you discover truth

01:02:49.300 | and do research more rigorously.

01:02:51.640 | So I think for us, the kind of mission

01:02:54.460 | and the success of the company are very intertwined.

01:02:58.700 | So a big part of, yeah,

01:02:59.660 | we're hoping to grow the team quite a lot this year.

01:03:02.280 | Probably some of our highest priority roles

01:03:04.100 | are in engineering,

01:03:05.700 | but also opening up roles more in design

01:03:08.700 | and product marketing, go-to-market.

01:03:10.540 | Yeah, do you want to talk about the roles?

01:03:14.160 | - Yeah, broadly we're just looking

01:03:15.220 | for senior software engineers

01:03:16.580 | and don't need any particular AI expertise.

01:03:19.500 | A lot of it is just, I guess,

01:03:21.860 | how do you build good orchestration for complex tasks?

01:03:26.860 | So we talked earlier about these are sort of notebooks,

01:03:30.660 | scaling up, task orchestration,

01:03:32.700 | and I think a lot of this looks more

01:03:36.180 | like traditional software engineering

01:03:38.100 | than it does look like machine learning research.

01:03:39.920 | And I think the people who are really good

01:03:42.080 | at building good abstractions,

01:03:45.520 | building applications that can kind of survive,

01:03:48.880 | even if some of their pieces break,

01:03:50.720 | like making reliable components out of unreliable pieces.

01:03:54.060 | I think those are the people we're looking for.

01:03:56.460 | - You know, that's exactly what I used to do.

01:03:59.060 | Have you explored any of the existing--

01:03:59.900 | - Do you want to come work with us?

01:04:01.420 | - I can talk about this all day.

01:04:02.820 | Have you explored the existing orchestration frameworks?

01:04:05.220 | Temporal, Airflow, Daxter, Prefect?

01:04:09.060 | - We've looked into them a little bit.

01:04:10.340 | I think we have some specific requirements

01:04:12.220 | around kind of being able to stream work back very quickly

01:04:16.260 | to our users, and those could definitely be relevant.

01:04:20.260 | - Okay, well, you're hiring.

01:04:21.440 | I'm sure we'll plug all the links.

01:04:22.740 | Thank you so much for coming.

01:04:23.660 | Any parting words, any words of wisdom?

01:04:26.340 | Models do you live by?

01:04:27.620 | - No, I think it's a really important time

01:04:31.220 | for humanity, so I hope everyone listening to this podcast

01:04:34.940 | can think hard about exactly how they want to participate

01:04:39.940 | in this story.

01:04:41.140 | There's so much to build, and we can be really intentional

01:04:43.980 | about what we align ourselves with.

01:04:46.500 | I think there are a lot of applications

01:04:48.660 | that are going to be really good for the world,

01:04:50.020 | and a lot of applications that are not,

01:04:51.780 | and so, yeah, I hope people can take that seriously

01:04:54.740 | and kind of seize the moment.

01:04:56.620 | - Yeah, I love how intentional you guys have been.

01:04:58.220 | Thank you for sharing that story.

01:04:59.660 | - Thank you.

01:05:00.500 | - Yeah, thank you for coming on.

01:05:02.540 | (upbeat music)

01:05:05.120 | (upbeat music)

01:05:07.700 | (upbeat music)

01:05:10.280 | (upbeat music)

01:05:12.940 | (upbeat music)

01:05:15.520 | (upbeat music)

01:05:18.100 | (upbeat music)

01:05:20.720 | (upbeat music)

01:05:23.300 | [BLANK_AUDIO]

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Chapters