back to index

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit


Chapters

0:0 Introductions
7:45 How Johan and Andreas Joined Forces to Create Elicit
10:26 Why Products are better than Research
15:49 The Evolution of Elicit's Product
19:44 Automating Literature Review Workflow
22:48 How GPT-3 to GPT-4 Changed Things
25:37 Managing LLM Pricing and Performance
31:7 Open vs. Closed: Elicit's Approach to Model Selection
31:56 Moving to Notebooks
39:11 Elicit's Budget for Model Queries and Evaluations
41:44 Impact of Long Context Windows
47:19 Underrated Features and Surprising Applications
51:35 Driving Systematic and Efficient Research
53:0 Elicit's Team Growth and Transition to a Public Benefit Corporation
55:22 Building AI for Good

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey, everyone.
00:00:01.000 | Welcome to the Latent Space Podcast.
00:00:02.760 | This is Alessio, partner and CTO of Residence Invisible Partners.
00:00:06.120 | And I'm joined by my co-host, Swix, founder of Small AI.
00:00:09.400 | Hey, and today we are back in the studio
00:00:11.520 | with Andreas and Junghwan from Elicit.
00:00:14.480 | Welcome.
00:00:15.240 | Thanks, guys.
00:00:16.000 | It's great to be here.
00:00:16.880 | Yeah.
00:00:17.380 | So I'll introduce you separately,
00:00:19.060 | but also we'd love to learn a little bit more about you
00:00:22.080 | personally.
00:00:23.320 | So Andreas, it looks like you started Elicit, or ought first,
00:00:28.440 | and Junghwan joined later.
00:00:31.520 | That's right, although you did--
00:00:33.200 | I guess, for all intents and purposes,
00:00:35.240 | the Elicit and also the ought that existed before then
00:00:39.080 | were very different from what I started.
00:00:42.360 | So I think it's fair to say that she co-founded it.
00:00:46.120 | And Junghwan, you're a co-founder and COO of Elicit.
00:00:49.400 | Yeah, that's right.
00:00:50.200 | So there's a little bit of a history to this.
00:00:52.840 | I'm not super aware of the journey.
00:00:55.920 | I was aware of ought and Elicit as sort
00:00:59.320 | of a nonprofit-type situation.
00:01:01.040 | And recently, you turned into sort of like a B Corp--
00:01:04.080 | Public Benefit Corporation.
00:01:05.600 | So yeah, maybe if you want, you could
00:01:07.280 | take us through that journey of finding the problem.
00:01:12.400 | Obviously, you're working together now.
00:01:14.960 | So how do you get together to decide to leave your startup
00:01:18.920 | career to join him?
00:01:20.880 | Yeah, it's truly a very long journey.
00:01:22.440 | I guess, truly, it kind of started in Germany
00:01:24.680 | when I was born.
00:01:25.440 | So even as a kid, I was always interested in AI.
00:01:30.720 | I kind of went to the library.
00:01:32.120 | There were books about how to write programs in QBasic.
00:01:34.980 | And some of them talked about how to implement chatbots.
00:01:39.440 | I guess Eliza--
00:01:40.040 | To be clear, he grew up in a tiny village
00:01:42.600 | on the outskirts of Munich called
00:01:44.120 | Dingelscherbin, where it's a very, very idyllic German
00:01:47.880 | village.
00:01:49.000 | Important to the story.
00:01:51.680 | But basically, the main thing is I've kind of always
00:01:54.900 | been thinking about AI my entire life
00:01:56.480 | and been thinking about, well, at some point,
00:01:58.200 | this is going to be a huge deal.
00:01:59.560 | It's going to be transformative.
00:02:00.900 | How can I work on it?
00:02:02.920 | And I was thinking about it from when I was a teenager.
00:02:09.200 | After high school, I did a year where
00:02:11.400 | I started a startup with the intention to become rich.
00:02:15.160 | And then once I'm rich, I can affect the trajectory of AI.
00:02:19.240 | Did not become rich.
00:02:21.000 | Decided to go back to college and study cognitive science
00:02:24.040 | there, which was like the closest thing
00:02:25.720 | I could find at the time to AI.
00:02:29.680 | In the last year of college, moved to the US
00:02:32.360 | to do a PhD at MIT, working on broadly kind of new programming
00:02:37.960 | languages for AI, because it kind of seemed like the existing
00:02:40.880 | languages were not great at expressing world models
00:02:44.100 | and learning world models doing Bayesian inference.
00:02:47.520 | Was always thinking about, well, ultimately, the goal
00:02:49.640 | is to actually build tools that help people reason more
00:02:51.960 | clearly, ask and answer better questions,
00:02:57.600 | and make better decisions.
00:02:58.760 | But for a long time, it just seemed
00:03:00.240 | like the technology to put reasoning in machines
00:03:03.440 | just wasn't there.
00:03:04.800 | And so initially, at the end of my postdoc at Stanford,
00:03:10.820 | I was thinking about, well, what to do?
00:03:12.440 | I think the standard path is you become an academic
00:03:15.600 | and do research.
00:03:17.160 | But it's really hard to actually build interesting tools
00:03:23.040 | as an academic.
00:03:23.920 | You can't really hire great engineers.
00:03:26.760 | Everything is kind of on a paper-to-paper timeline.
00:03:29.520 | And so I was like, well, maybe I should start a startup
00:03:33.440 | and pursue that for a little bit.
00:03:35.120 | But it seemed like it was too early,
00:03:37.160 | because you could have tried to do an AI startup,
00:03:39.900 | but probably would not have been the kind of AI startup
00:03:42.840 | we're seeing now.
00:03:44.520 | So then decided to just start a nonprofit research lab that's
00:03:47.840 | going to do research for a while,
00:03:49.200 | until we better figure out how to do thinking in machines.
00:03:53.000 | And that was odd.
00:03:54.800 | And then over time, it became clear how to actually build
00:04:01.080 | actual tools for reasoning.
00:04:02.800 | And then only over time, we developed a better way to--
00:04:08.400 | I'll let you fill in some of these.
00:04:10.400 | Yeah, so I guess my story maybe starts around 2015.
00:04:14.360 | I kind of wanted to be a founder for a long time.
00:04:17.880 | And I wanted to work on an idea that really tested--
00:04:22.220 | that stood the test of time for me,
00:04:23.840 | like an idea that stuck with me for a long time.
00:04:26.560 | And then starting in 2015, actually,
00:04:28.280 | originally, I became interested in AI-based tools
00:04:30.840 | from the perspective of mental health.
00:04:32.680 | So there are a bunch of people around me
00:04:33.840 | who are really struggling.
00:04:35.240 | One really close friend in particular
00:04:36.620 | is really struggling with mental health
00:04:38.280 | and didn't have any support.
00:04:39.720 | And it didn't feel like there was anything
00:04:41.880 | before getting hospitalized that could just help her.
00:04:45.640 | And so luckily, she came and stayed with me for a while.
00:04:48.320 | And we were just able to talk through some things.
00:04:50.600 | But it seemed like lots of people
00:04:52.760 | might not have that resource.
00:04:54.140 | And something maybe AI-enabled could be much more scalable.
00:04:57.760 | I didn't feel ready to start a company then.
00:05:00.360 | That's 2015.
00:05:02.280 | And I also didn't feel like the technology was ready.
00:05:05.200 | So then I went into fintech and learned
00:05:07.440 | how to do the tech thing.
00:05:09.200 | And then in 2019, I felt like it was time
00:05:12.640 | for me to just jump in and build something on my own
00:05:15.280 | I really wanted to create.
00:05:17.160 | And at the time, there were two interesting--
00:05:19.840 | I looked around at tech and felt not super inspired
00:05:22.680 | by the options.
00:05:23.720 | I didn't want to have a tech career ladder.
00:05:26.320 | I didn't want to climb the career ladder.
00:05:28.840 | There were two interesting technologies at the time.
00:05:30.860 | There was AI and there was crypto.
00:05:32.800 | And I was like, well, the AI people
00:05:34.240 | seem a little bit more nice.
00:05:37.120 | Maybe slightly more trustworthy.
00:05:40.080 | Both super exciting.
00:05:41.320 | But yeah, I kind of threw my bet in on the AI side.
00:05:46.240 | And then I got connected to Andreas.
00:05:47.780 | And actually, the way he was thinking
00:05:49.760 | about pursuing the research agenda at OTT
00:05:52.040 | was really compatible with what I had envisioned
00:05:54.760 | for an ideal AI product, something
00:05:57.080 | that helps kind of take down really complex thinking,
00:05:59.880 | overwhelming thoughts, and breaks it down into small pieces.
00:06:02.720 | And then this kind of mission that we
00:06:04.640 | need AI to help us figure out what we ought to do
00:06:08.200 | was really inspiring.
00:06:10.520 | Yeah, because I think it was clear that we were building
00:06:12.880 | the most powerful optimizer of our time.
00:06:16.560 | But as a society, we hadn't figured out
00:06:18.640 | how to direct that optimization potential.
00:06:21.520 | And if you kind of direct tremendous amounts
00:06:23.640 | of optimization potential at the wrong thing,
00:06:25.840 | that's really disastrous.
00:06:27.000 | So the goal of OTT was make sure that if we build the most
00:06:29.940 | transformative technology of our lifetime,
00:06:32.160 | it can be used for something really impactful,
00:06:34.400 | like good reasoning, like not just generating ads.
00:06:37.480 | My background was in marketing.
00:06:38.880 | But so I was like, I want to do more than generate ads
00:06:41.160 | with this.
00:06:42.160 | And also, if these AI systems get
00:06:45.320 | to be super intelligent enough that they are doing
00:06:47.880 | this really complex reasoning, that we can trust them,
00:06:50.240 | that they are aligned with us and we
00:06:51.980 | have ways of evaluating that they're doing the right thing.
00:06:54.960 | So that's what OTT did.
00:06:55.880 | We did a lot of experiments.
00:06:57.600 | This was, like Andreas said, before foundation models
00:07:00.960 | really took off.
00:07:02.640 | A lot of the issues we were seeing
00:07:04.880 | were more in reinforcement learning.
00:07:06.640 | But we saw a future where AI would
00:07:09.720 | be able to do more kind of logical reasoning,
00:07:12.360 | not just kind of extrapolate from numerical trends.
00:07:15.360 | So we actually kind of set up experiments
00:07:18.800 | with people, where people stood in as super intelligent
00:07:21.960 | systems.
00:07:23.320 | And we effectively gave them context windows.
00:07:25.920 | So they would have to read a bunch of text.
00:07:28.560 | And one person would get less text,
00:07:31.360 | and one person would get all the text,
00:07:32.900 | and the person with less text would
00:07:34.480 | have to evaluate the work of the person who
00:07:37.080 | could read much more.
00:07:38.200 | So in a world, we were basically simulating,
00:07:40.600 | like in 2018, 2019, a world where an AI system could read
00:07:44.520 | significantly more than you.
00:07:45.960 | And you, as the person who couldn't read that much,
00:07:48.280 | had to evaluate the work of the AI system.
00:07:50.640 | Yeah, so there's a lot of the work we did.
00:07:53.840 | And from that, we kind of iterated on this idea,
00:07:56.280 | that the idea of breaking complex tasks down
00:07:58.880 | into smaller tasks, like complex tasks,
00:08:00.720 | like open-ended reasoning, logical reasoning,
00:08:03.640 | into smaller tasks, so that it's easier
00:08:05.520 | to train AI systems on them.
00:08:07.080 | And also so that it's easier to evaluate the work of the AI
00:08:09.960 | system when it's done.
00:08:11.880 | And then also kind of really pioneered this idea,
00:08:15.840 | the importance of supervising the process of AI systems,
00:08:18.800 | not just the outcomes.
00:08:20.360 | And so a big part of then how Elicit is built
00:08:23.040 | is we're very intentional about not just throwing a ton of data
00:08:27.320 | into a model and training it, and then saying, cool,
00:08:29.640 | here's scientific output.
00:08:31.320 | That's not at all what we do.
00:08:33.520 | Our approach is very much like, what are the steps
00:08:35.680 | that an expert human does?
00:08:37.160 | Or what is an ideal process?
00:08:38.800 | As granularly as possible, let's break that down.
00:08:41.800 | And then train AI systems to perform each of those steps
00:08:44.640 | very robustly.
00:08:46.200 | When you train that from the start, after the fact,
00:08:49.200 | it's much easier to evaluate.
00:08:50.920 | It's much easier to troubleshoot at each point,
00:08:53.000 | like where did something break down?
00:08:55.000 | So yeah, we were working on those experiments for a while.
00:08:57.000 | And then at the start of 2021, decided to build a product.
00:09:00.320 | Because when you do research, I think maybe--
00:09:03.280 | - Do you mind if I, 'cause I think you're about to go
00:09:06.000 | into more modern thought and Elicit.
00:09:08.640 | And I just wanted to, because I think a lot of people
00:09:11.080 | are in where you were, like sort of 2018, '19,
00:09:15.360 | where you chose a partner to work with, right?
00:09:18.360 | And you didn't know him.
00:09:19.760 | You were just kind of cold introduced.
00:09:21.520 | A lot of people are cold introduced.
00:09:23.200 | I've been cold introduced to tons of people
00:09:24.720 | and I never work with them.
00:09:26.760 | I assume you had a lot of other options, right?
00:09:28.880 | Like how do you advise people to make those choices?
00:09:32.160 | - Yeah, we were not totally cold introduced.
00:09:33.840 | So we had one of our closest friends introduced us.
00:09:36.880 | And then Andreas had written a lot on the OTT website,
00:09:41.120 | a lot of blog posts, a lot of publications.
00:09:43.160 | And I just read it and I was like, wow,
00:09:44.920 | this sounds like my writing.
00:09:47.080 | And even other people, some of my closest friends
00:09:49.360 | I asked for advice from, they were like,
00:09:50.920 | oh, this sounds like your writing.
00:09:52.960 | But I think I also had some kind of like
00:09:54.840 | things I was looking for.
00:09:55.760 | I wanted someone with a complimentary skillset.
00:09:58.240 | I want someone who was very values aligned.
00:10:00.800 | And yeah, I think that was all a good fit.
00:10:03.640 | - We also did a pretty lengthy mutual evaluation process
00:10:07.120 | where we had a Google doc
00:10:08.480 | where we had all kinds of questions for each other.
00:10:11.120 | And I think it ended up being around 50 pages or so
00:10:14.560 | of like various like questions and back and forth.
00:10:16.720 | - Was it the YC list?
00:10:18.200 | There's some lists going around for co-founder questions.
00:10:20.360 | - No, we just made our own questions.
00:10:22.480 | But I presume, I guess it's probably related
00:10:26.000 | in that you ask yourself, well,
00:10:27.480 | what are the values you care about?
00:10:28.720 | How would you approach various decisions
00:10:30.480 | and things like that?
00:10:31.320 | - I shared like all of my past performance reviews.
00:10:33.880 | - Yeah. - Yeah.
00:10:35.200 | - Yeah, and he had never had any, so.
00:10:36.480 | - No. (all laughing)
00:10:39.240 | - Yeah, sorry, I just had to,
00:10:42.000 | a lot of people are going through that phase
00:10:43.520 | and you kind of skipped over it.
00:10:44.400 | I was like, no, no, no, no,
00:10:45.240 | there's like an interesting story there.
00:10:47.320 | - So before we jump into what a list it is today,
00:10:51.400 | the history is a bit counterintuitive.
00:10:53.920 | So you start with figuring out,
00:10:55.920 | oh, if we had a super powerful model,
00:10:58.400 | how would we align it, how we use it?
00:11:00.400 | But then you were actually like,
00:11:01.560 | well, let's just build the product
00:11:02.880 | so that people can actually leverage it.
00:11:04.680 | And I think there are a lot of folks today
00:11:07.120 | that are now back to where you were maybe five years ago
00:11:09.320 | that are like, oh, what if this happens
00:11:11.400 | rather than focusing on actually building
00:11:13.080 | something useful with it?
00:11:15.160 | What clicked for you to like move into a list
00:11:18.240 | and then we can cover that story too?
00:11:20.160 | - I think in many ways, the approach is still the same
00:11:22.400 | because the way we are building a list is not,
00:11:24.960 | let's train a foundation model to do more stuff.
00:11:27.440 | It's like, let's build a scaffolding
00:11:29.480 | such that we can deploy powerful models to good ends.
00:11:32.740 | So I think it's different now in that we are,
00:11:36.040 | we actually have some of the models to plug in,
00:11:37.920 | but if in 2018, '17, we had had the models,
00:11:42.480 | we could have run the same experiments
00:11:44.840 | we did run with humans back then, just with models.
00:11:47.720 | And so in many ways, our philosophy is always like,
00:11:50.120 | let's think ahead to the future.
00:11:51.280 | What models are gonna exist in one, two years or longer?
00:11:55.960 | And how can we make it so that they can actually be deployed
00:11:59.560 | in kind of transparent, controllable ways?
00:12:02.440 | - Yeah, I think motivationally,
00:12:03.840 | we both are kind of product people at heart
00:12:06.040 | and we just want to, the research was really important
00:12:09.600 | and it didn't make sense to build a product at that time.
00:12:12.640 | But at the end of the day,
00:12:13.480 | the thing that always motivated us is imagining a world
00:12:16.600 | where high quality reasoning is really abundant.
00:12:19.600 | And AI was just kind of the most,
00:12:22.640 | is the technology that's gonna get us there.
00:12:24.880 | And there's a way to guide that technology with research,
00:12:27.240 | but it's also really exciting to have,
00:12:29.320 | you can have a more direct effect through product
00:12:31.880 | because with research, you have kind of,
00:12:33.760 | you'd publish the research and someone else
00:12:35.320 | has to implement that into the product
00:12:36.760 | and the product felt like a more direct path.
00:12:39.120 | And we wanted to concretely have an impact on people's lives.
00:12:42.360 | So I think, yeah, I think the kind of personally,
00:12:45.520 | the motivation was we want to build for people.
00:12:49.160 | - Yep, and then just to recap as well,
00:12:52.600 | like the models you were using back then were like,
00:12:55.000 | I don't know, would they like BERT type stuff or T5
00:12:59.000 | or I don't know what timeframe we're talking about here.
00:13:02.120 | - So I guess to be clear, at the very beginning,
00:13:04.400 | we had humans do the work and then the initial,
00:13:09.040 | I think the first models that kind of make sense
00:13:11.280 | were TPT-2 and TNLG and like the early generative models.
00:13:18.280 | We do also use like T5 based models even now,
00:13:22.420 | but started with TPT-2.
00:13:26.280 | - Yeah, cool, I'm just kind of curious about like,
00:13:27.920 | how do you start so early, you know,
00:13:29.720 | like now it's obvious where to start,
00:13:31.600 | but back then it wasn't.
00:13:33.240 | - Yeah, I used to nag Andreas a lot.
00:13:35.080 | I was like, why are you talking to this?
00:13:37.120 | I don't know, I felt like TPT-2 is like,
00:13:38.720 | clearly can't do anything.
00:13:39.840 | And I was like, Andreas, you're wasting your time
00:13:41.600 | like playing with this toy.
00:13:43.520 | But yeah, he was right.
00:13:46.600 | - So what's the history of what Elizit
00:13:48.840 | actually does as a product?
00:13:50.240 | I think today, you recently announced that
00:13:52.960 | after four months, you get to a million in revenue.
00:13:55.080 | Obviously a lot of people use it, get a lot of value,
00:13:57.040 | but it would initially kind of like structure data,
00:14:01.160 | extraction from papers.
00:14:03.000 | Then you had, yeah, kind of like concept grouping.
00:14:06.480 | And today it's maybe like a more full stack
00:14:08.760 | research enabler, kind of like paper understander platform.
00:14:12.040 | What's the definitive definition of what Elizit is
00:14:16.400 | and how did you get here?
00:14:17.520 | - Yeah, we say Elizit is an AI research assistant.
00:14:20.000 | I think it will continue to evolve.
00:14:21.600 | It has evolved a lot and it will continue research.
00:14:23.640 | And that's part of why we're so excited
00:14:26.080 | about building and research,
00:14:27.000 | 'cause there's just so much space.
00:14:28.800 | I think the current phase we're in right now,
00:14:30.980 | we talk about it as really trying to make Elizit
00:14:34.000 | the best place to understand what is known.
00:14:35.760 | So it's all a lot about like literature summarization.
00:14:39.360 | There's a ton of information that the world already knows.
00:14:41.540 | It's really hard to navigate, hard to make it relevant.
00:14:44.840 | So a lot of it is around document discovery
00:14:47.320 | and processing and analysis.
00:14:49.640 | I really want to make,
00:14:51.760 | I kind of want to import some of the incredible
00:14:54.640 | productivity improvements we've seen in software engineering
00:14:57.920 | and data science and into research.
00:14:59.580 | So it's like, how can we make researchers
00:15:01.760 | like data scientists of text?
00:15:04.080 | That's why we're launching this new set of features
00:15:06.920 | called Notebooks.
00:15:07.760 | It's very much inspired by computational notebooks
00:15:09.960 | like Jupyter Notebooks, DeepNode or Colab,
00:15:13.440 | because they're so powerful and so flexible.
00:15:15.520 | And ultimately when people are trying to get to an answer
00:15:19.200 | or understand insight,
00:15:20.120 | they're kind of like manipulating evidence and information.
00:15:22.900 | Today that's all packaged in PDFs, which are super brittle,
00:15:26.320 | but with language models,
00:15:27.440 | we can decompose these PDFs into their underlying claims
00:15:30.440 | and evidence and insights,
00:15:31.480 | and then let researchers mash them up together,
00:15:34.360 | remix them and analyze them together.
00:15:35.920 | So yeah, I would say quite simply,
00:15:38.800 | overall Elizit is an AI research assistant.
00:15:40.780 | Right now we're focused on text-based workflows,
00:15:45.200 | but long-term really want to kind of go further and further
00:15:48.120 | into reasoning and decision-making.
00:15:50.480 | - And when you say AI research assistant,
00:15:53.280 | this is kind of meta research.
00:15:55.280 | So researchers use Elizit as a research assistant.
00:15:58.940 | It's not a generic, you can research anything type of tool,
00:16:02.880 | or it could be, but like,
00:16:04.160 | what are people using it for today?
00:16:05.920 | - Yeah, so specifically, I guess in science,
00:16:08.960 | a lot of people use human research assistants to do things.
00:16:12.260 | Like you tell your kind of grad student,
00:16:15.500 | hey, here are a couple of papers.
00:16:16.780 | Can you look at all of these,
00:16:18.540 | see which of these have kind of sufficiently large
00:16:21.260 | populations and actually study the disease
00:16:23.260 | that I'm interested in,
00:16:24.420 | and then write out like, what are the experiments they did?
00:16:26.840 | What are the interventions they did?
00:16:28.720 | What are the outcomes?
00:16:29.620 | And kind of organize that for me.
00:16:31.460 | And the first phase of understanding what is known
00:16:34.580 | really focuses on automating that workflow,
00:16:37.100 | because a lot of that work is pretty rote work.
00:16:39.200 | I think it's not the kind of thing
00:16:40.480 | that we need humans to do, language models can do it.
00:16:43.480 | And then if language models can do it,
00:16:45.200 | you can obviously scale it up much more
00:16:47.320 | than a grad student or undergrad research assistant
00:16:50.520 | would be able to do.
00:16:52.120 | - Yeah, the use cases are pretty broad.
00:16:53.760 | So we do have people who just come,
00:16:55.280 | a very large percent of our users
00:16:56.920 | are just using it personally,
00:16:58.240 | or for a mix of personal and professional things.
00:17:01.160 | People who care a lot about like health or biohacking,
00:17:05.260 | or parents who have children with a kind of rare disease
00:17:08.880 | and want to understand the literature directly.
00:17:10.680 | So there is an individual kind of consumer use case.
00:17:13.720 | We're most focused on the power users,
00:17:15.600 | so that's where we're really excited to build.
00:17:18.180 | So LISD was very much inspired by this workflow
00:17:21.180 | in literature called Systematic Reviews or Meta-Analysis,
00:17:24.480 | which is basically the human state of the art
00:17:27.340 | for summarizing scientific literature.
00:17:29.480 | It typically involves like five people
00:17:31.440 | working together for over a year,
00:17:33.600 | and they kind of first start by trying to find
00:17:35.600 | the maximally comprehensive set of papers possible.
00:17:38.500 | So it's like 10,000 papers.
00:17:40.360 | And they kind of systematically narrow that down
00:17:42.520 | to like hundreds or 50 extract key details
00:17:46.160 | from every single paper.
00:17:47.280 | Usually have two people doing it,
00:17:48.760 | like a third person reviewing it.
00:17:50.300 | So it's like an incredibly laborious,
00:17:52.940 | time-consuming process,
00:17:54.160 | but you see it in every single domain.
00:17:56.080 | So in science, in machine learning, in policy.
00:17:59.800 | And so if you can, and it's very,
00:18:01.400 | because it's so structured and designed to be reproducible,
00:18:03.840 | it's really amenable to automation.
00:18:05.580 | So that's kind of the workflow
00:18:07.080 | that we want to automate first.
00:18:08.560 | And then you make that accessible for any question
00:18:12.100 | and make kind of these really robust
00:18:14.080 | living summaries of science.
00:18:15.900 | So yeah, that's one of the workflows
00:18:16.880 | that we're starting with.
00:18:17.720 | - Our previous guest, Mike Conover,
00:18:19.000 | he's building a new company called BrightWave,
00:18:20.600 | which is an AI research assistant for financial research.
00:18:24.380 | How do you see the future of these tools?
00:18:26.360 | Like does everything converge
00:18:27.800 | to like a God researcher assistant,
00:18:30.680 | or is every domain going to have its own thing?
00:18:33.620 | - I think that's a good and mostly open question.
00:18:36.540 | I do think there are some differences across domains.
00:18:40.400 | For example, some research is more
00:18:42.640 | quantitative data analysis,
00:18:44.480 | and other research is more
00:18:46.160 | kind of high-level cross-domain thinking.
00:18:49.360 | And we definitely want to contribute
00:18:51.600 | to the broad generalist reasoning type space.
00:18:53.460 | Like if researchers are making discoveries,
00:18:55.880 | often it's like, hey,
00:18:56.920 | this thing in biology is actually analogous
00:18:59.000 | to like these equations in economics or something.
00:19:01.560 | And that's just fundamentally a thing
00:19:03.840 | where you need to reason across domains.
00:19:06.640 | So I think there will be, at least within research,
00:19:09.440 | I think there will be like one best platform more or less
00:19:12.600 | for this type of generalist research.
00:19:15.480 | I think there may still be like some particular tools
00:19:17.680 | like for genomics, like particular types of modules
00:19:21.360 | of genes and proteins and whatnot.
00:19:23.560 | But for a lot of the kind of high-level reasoning
00:19:25.520 | that humans do, I think that is a more
00:19:27.760 | open or type all thing.
00:19:29.160 | - I wanted to ask a little bit deeper about,
00:19:31.920 | I guess, the workflow that you mentioned.
00:19:34.020 | I like that phrase.
00:19:35.720 | I see that in your UI now, but that's as it is today.
00:19:39.440 | And I think you were about to tell us
00:19:41.000 | about how it was in 2021 and how it maybe progressed.
00:19:43.600 | Like what, how has this workflow evolved?
00:19:46.280 | - Yeah, so the very first version of Elicit
00:19:48.040 | actually wasn't even a research assistant.
00:19:49.720 | It was like a forecasting assistant.
00:19:53.200 | So we set out and we were thinking about
00:19:55.120 | what are some of the most impactful
00:19:57.560 | types of reasoning that if we could scale up AI
00:20:00.700 | would really transform the world.
00:20:02.000 | And the first thing we started,
00:20:04.040 | we actually started with literature review,
00:20:06.720 | but we're like, oh, so many people are gonna build
00:20:08.560 | literature review tools, so let's not start there.
00:20:11.000 | And so then we focused on geopolitical forecasting.
00:20:13.880 | So I don't know if you're familiar with like Manifold or--
00:20:16.400 | - Manifold Markets. - Yeah, that kind of stuff.
00:20:18.220 | - And Manifold.love. - Before Manifold, yeah.
00:20:20.800 | So we're not predicting relationships,
00:20:22.760 | we're predicting like, is China gonna invade Taiwan?
00:20:26.040 | - Markets for everything. - Yeah.
00:20:27.240 | - That's a relationship. - Yeah, it's fair.
00:20:29.160 | - Yeah, yeah, it's true.
00:20:30.520 | - And then we worked on that for a while
00:20:32.080 | and then after GPT-3 came out,
00:20:33.840 | I think by that time we kind of realized that the,
00:20:37.200 | originally we were trying to help people
00:20:39.240 | convert their beliefs into probability distributions.
00:20:42.320 | And so take fuzzy beliefs,
00:20:43.900 | but like model them more concretely.
00:20:46.280 | And then after a few months of iterating on that,
00:20:48.040 | just realized, oh, the thing that's blocking people
00:20:50.580 | from making interesting predictions
00:20:52.600 | about important events in the world
00:20:54.960 | is less kind of on the probabilistic side
00:20:57.080 | and much more on the research side.
00:20:59.320 | And so that kind of combined
00:21:00.920 | with the very generalist capabilities of GPT-3
00:21:03.720 | prompted us to make a more general research assistant.
00:21:06.640 | Then we spent a few months iterating
00:21:08.320 | on what even is a research assistant.
00:21:11.240 | So we would embed with different researchers,
00:21:13.080 | we built data labeling workflows in the beginning,
00:21:17.040 | kind of right off the bat.
00:21:18.000 | We built ways to find like experts in a field
00:21:23.000 | and like ways to ask good research questions.
00:21:25.640 | So we just kind of iterated through a lot of workflows
00:21:27.660 | and it was, yeah, no one else was really building
00:21:30.000 | at this time and it was like very quick
00:21:31.400 | to just do some prompt engineering
00:21:32.840 | and see like what is a task that is at the intersection
00:21:35.940 | of what's good, technologically capable
00:21:38.160 | and like important for researchers.
00:21:40.600 | And we had like a very nondescript landing page.
00:21:42.680 | It said nothing, but somehow people were signing up
00:21:45.000 | and we had the sign-up form that were like,
00:21:47.000 | it was like, "Why are you here?"
00:21:48.080 | And everyone was like, "I need help with literature review."
00:21:50.000 | And we're like, "Literature review, that sounds so hard.
00:21:52.040 | "I don't even know what that means."
00:21:53.160 | They're like, "We don't want to work on it."
00:21:55.040 | But then eventually we were like,
00:21:56.040 | "Okay, everyone is saying literature review."
00:21:57.520 | It's overwhelmingly people want--
00:21:58.880 | - And all domains, not like medicine or physics
00:22:01.120 | or just all domains.
00:22:02.000 | - Yeah, and we also kind of personally knew
00:22:03.680 | literature review was hard.
00:22:04.720 | And if you look at the graphs for academic literature
00:22:07.360 | being published every single month,
00:22:08.520 | you guys know this in machine learning,
00:22:09.720 | it's like up and to the right,
00:22:11.240 | like superhuman amounts of papers.
00:22:13.960 | So we're like, "All right, let's just try it."
00:22:15.240 | I was really nervous, but Andreas was like,
00:22:16.920 | "This is kind of like the right problem space
00:22:19.080 | "to jump into even if we don't know what we're doing."
00:22:21.640 | So my take was like, "Fine, this feels really scary,
00:22:24.480 | "but let's just launch a feature every single week
00:22:27.440 | "and double our user numbers every month.
00:22:29.480 | "And if we can do that, we'll fail fast
00:22:32.120 | "and we will find something."
00:22:33.440 | I was worried about getting lost
00:22:35.400 | in the kind of academic white space.
00:22:37.720 | So the very first version was actually a weekend prototype
00:22:40.320 | that Andreas made.
00:22:41.240 | Do you want to explain how that worked?
00:22:43.120 | - I mostly remember this was really bad.
00:22:45.440 | So the thing I remember is you enter a question
00:22:49.600 | and it would give you back a list of claims.
00:22:51.480 | So your question could be, I don't know,
00:22:53.280 | "How does creatine affect cognition?"
00:22:55.600 | It would give you back some claims
00:22:57.920 | that are to some extent based on papers,
00:23:01.280 | but they were often irrelevant.
00:23:02.800 | The papers were often irrelevant.
00:23:04.480 | And so we ended up soon just printing out
00:23:07.240 | a bunch of examples of results
00:23:08.700 | and putting them up on the wall
00:23:09.820 | so that we would kind of feel the constant shame
00:23:12.240 | of having such a bad product
00:23:13.880 | and would be incentivized to make it better.
00:23:16.400 | And I think over time it has gotten a lot better,
00:23:18.400 | but I think the initial version was really very bad.
00:23:22.680 | - Yeah, but it was basically like
00:23:24.000 | a natural language summary of an abstract,
00:23:25.800 | like kind of a one-sentence summary,
00:23:27.040 | and which we still have.
00:23:28.360 | And then as we learned kind of more
00:23:30.240 | about this systematic review workflow,
00:23:31.960 | we started expanding the capability
00:23:33.600 | so that you could extract a lot more data
00:23:35.280 | from the papers and do more with that.
00:23:37.440 | - And were you using embeddings and cosine similarity,
00:23:40.960 | that kind of stuff, for retrieval,
00:23:42.360 | or was it keyword-based, or?
00:23:44.880 | - I think the very first version
00:23:46.800 | didn't even have its own search engine.
00:23:48.680 | I think the very first version probably used
00:23:51.280 | the SemanticSkuller API or something similar.
00:23:54.640 | And only later, when we discovered
00:23:57.000 | that that API is not very semantic,
00:24:00.040 | then built our own search engine that has helped a lot.
00:24:04.280 | - And then we're gonna go into more recent products stuff,
00:24:08.080 | but I think you seem the more startup-oriented
00:24:11.640 | business person, and you seem sort of more ideologically
00:24:14.880 | interested in research, obviously, 'cause of your PhD.
00:24:17.580 | What kind of market sizing were you guys thinking?
00:24:21.560 | 'Cause you're here saying, "We have to double every month."
00:24:24.680 | And I'm like, "I don't know how you make
00:24:26.920 | "that conclusion from this," right?
00:24:29.720 | Especially also as a non-profit at the time.
00:24:31.920 | - Yeah, I think market size-wise,
00:24:34.960 | I felt like in this space where so much was changing
00:24:39.640 | and it was very unclear what of today
00:24:43.440 | was actually gonna be true tomorrow,
00:24:45.760 | we just really rested a lot on very, very simple
00:24:48.320 | fundamental principles, which is research is,
00:24:51.080 | if you can understand the truth,
00:24:52.480 | that is very economically beneficial, valuable,
00:24:55.320 | if you know the truth.
00:24:56.440 | - Just on principle, that's enough for you.
00:24:58.080 | - Yeah, research is the key to many breakthroughs
00:25:01.160 | that are very commercially valuable.
00:25:02.840 | 'Cause my version of it is students are poor
00:25:05.280 | and they don't pay for anything, right?
00:25:06.960 | But that's obviously not true, as you guys have found out.
00:25:09.200 | But I, you know, you had to have some market insight
00:25:12.600 | for me to have believed that, but I think you skipped that.
00:25:15.240 | - Yeah.
00:25:16.080 | - Yeah, we did encounter, I guess,
00:25:18.160 | talking to VCs for our seed round.
00:25:20.120 | A lot of VCs were like, "You know, researchers,
00:25:22.360 | "they don't have any money.
00:25:24.100 | "Why don't you build a legal assistant?"
00:25:27.320 | (laughing)
00:25:28.560 | And I think in some short-sighted way,
00:25:30.720 | maybe that's true, but I think in the long run,
00:25:34.680 | R&D is such a big space of the economy.
00:25:36.600 | I think if you can substantially improve
00:25:39.080 | how quickly people find new discoveries
00:25:42.560 | or avoid kind of controlled trials that don't go anywhere,
00:25:47.560 | I think that's just huge amounts of money.
00:25:49.640 | And there are a lot of questions, obviously,
00:25:51.400 | about between here and there,
00:25:53.040 | but I think as long as the fundamental principle is there,
00:25:55.840 | we were okay with that.
00:25:57.360 | And I guess we found some investors who also were.
00:26:00.200 | - Yeah, yeah, congrats.
00:26:01.480 | I mean, I'm sure we can cover the sort of flip later.
00:26:05.680 | Yeah, I think you were about to start us on GPT-3
00:26:08.240 | and how that changed things for you.
00:26:10.400 | It's funny, I guess every major GPT version,
00:26:12.800 | you have some big insight.
00:26:14.320 | - I think it's a little bit less true for us than for others
00:26:18.240 | because we always believe that there will basically be
00:26:21.880 | human-level machine work.
00:26:24.280 | And so it is definitely true that in practice,
00:26:27.120 | for your product, as new models come out,
00:26:30.000 | your product starts working better,
00:26:31.320 | you can add some features that you couldn't add before.
00:26:33.760 | But I don't think we really ever had the moment
00:26:37.920 | where we were like, oh, wow, that is super unanticipated.
00:26:42.200 | We need to do something entirely different now
00:26:44.080 | from what was on the roadmap.
00:26:46.600 | - I think GPT-3 was a big change 'cause it kind of said,
00:26:50.420 | oh, now is the time that we can use AI to build these tools.
00:26:54.640 | And then GPT-4 was maybe a little bit more
00:26:56.720 | of an extension of GPT-3.
00:26:58.480 | It felt less like a level shift.
00:26:59.760 | GPT-3 over GPT-2 was like qualitative level shift.
00:27:02.960 | And then GPT-4 was like, okay, great.
00:27:05.000 | Now it's like, much less, more accurate,
00:27:07.680 | we're more accurate on these things,
00:27:08.800 | we can answer harder questions.
00:27:10.040 | But the shape of the product had already taken place
00:27:12.080 | by that time.
00:27:13.280 | - I kind of want to ask you about this sort of pivot
00:27:15.120 | that you've made, but I guess that was just a way
00:27:17.720 | to sell what you were doing,
00:27:18.920 | which is you're adding extra features
00:27:20.880 | on grouping by concepts.
00:27:22.700 | - When GPT-4-- - The GPT-4 pivot,
00:27:24.640 | quote-unquote pivot that you--
00:27:25.620 | - Oh, yeah, yeah, exactly.
00:27:27.000 | Right, right, right, yeah, yeah.
00:27:28.400 | When we launched this workflow,
00:27:30.200 | now that GPT-4 was available,
00:27:32.960 | basically, Elisa was at a place where,
00:27:36.360 | given a table of papers,
00:27:38.200 | we have very tabular interfaces,
00:27:39.680 | so given a table of papers,
00:27:40.960 | you can extract data across all the tables.
00:27:43.640 | But that's still, you kind of want to take the analysis
00:27:47.600 | a step further.
00:27:49.080 | And sometimes what you'd care about
00:27:50.520 | is not having a list of papers,
00:27:52.040 | but a list of arguments, a list of effects,
00:27:55.200 | a list of interventions, a list of techniques.
00:27:57.240 | And so that's one of the things we're working on
00:28:00.680 | is now that you've extracted this information
00:28:02.840 | in a more structured way,
00:28:03.720 | can you pivot it or group by whatever the information
00:28:06.960 | that you extracted to have more insight-first information
00:28:11.040 | still supported by the academic literature?
00:28:13.120 | - Yeah, that was a big revelation when I saw it.
00:28:14.800 | Yeah, basically, I think I'm very just impressed
00:28:18.120 | by how first principles,
00:28:20.280 | your ideas around what the workflow is.
00:28:23.520 | And I think that's why you're not as reliant
00:28:27.400 | on the LLM improving,
00:28:29.000 | because actually it's just about improving the workflow
00:28:31.160 | that you would recommend to people.
00:28:33.120 | Today, we might call it an agent, I don't know,
00:28:35.160 | but you're not reliant on the LLM to drive it.
00:28:39.080 | It's relying on your sort of,
00:28:40.760 | this is the way that Elisit does research,
00:28:43.480 | and this is what we think is most effective
00:28:45.920 | based on talking to our users.
00:28:47.360 | - Yep, that's right.
00:28:48.200 | Yeah, I think the problem space is still huge.
00:28:52.160 | If it's this big, we are all still operating
00:28:55.120 | at this tiny bit of it.
00:28:57.200 | So I think about this a lot in the context of moats.
00:29:00.440 | People are like, "Oh, what's your moat?
00:29:01.440 | "What happens if GPT-5 comes out?"
00:29:03.040 | It's like, if GPT-5 comes out,
00:29:04.440 | there's still all of this other space that we can go into.
00:29:07.000 | And so I think being really obsessed with the problem,
00:29:09.920 | which is very, very big, has helped us stay robust
00:29:13.120 | and just kind of directly incorporate model improvements
00:29:15.440 | and then keep going.
00:29:16.280 | - And then I first encountered you guys with Charlie.
00:29:19.840 | You can tell us about that project.
00:29:22.000 | Basically, how much did cost become a concern
00:29:26.040 | as you're working more and more with OpenAI?
00:29:28.760 | How do you manage that relationship?
00:29:30.240 | - Let me talk about who Charlie is.
00:29:31.440 | - All right. - Sure, sure.
00:29:32.280 | - You can talk about the tech,
00:29:33.100 | 'cause Charlie is a special character.
00:29:34.840 | So Charlie, when we found him,
00:29:37.440 | had just finished his freshman year
00:29:39.000 | at the University of Warwick.
00:29:40.400 | I think he had heard about us on some discord,
00:29:42.560 | and then he applied, and we were like,
00:29:44.240 | "Wow, who is this freshman?"
00:29:45.520 | And then we just saw that he had done
00:29:46.580 | so many incredible side projects.
00:29:49.200 | And we were actually on a team retreat
00:29:51.040 | in Barcelona visiting our head of engineering at that time,
00:29:54.040 | and everyone was talking about this wonder kid.
00:29:56.000 | They're like, "This kid?"
00:29:56.840 | And then on our take-home project,
00:29:58.240 | he had done the best of anyone to that point.
00:30:02.280 | And so we were just so excited to hire him.
00:30:05.200 | So we hired him as an intern, and then we're like,
00:30:06.780 | "Charlie, what if he just dropped out of school?"
00:30:09.640 | And so then we convinced him to take a year off,
00:30:11.840 | and he's just incredibly productive.
00:30:13.660 | And I think the thing you're referring to is,
00:30:15.840 | at the start of 2023,
00:30:17.240 | Anthropic launched their constitutional AI paper,
00:30:20.740 | and within a few days, I think four days,
00:30:23.080 | he had basically implemented that in production,
00:30:25.280 | and then we had it in app a week or so after that.
00:30:28.920 | And he has since contributed to major improvements,
00:30:31.840 | like cutting costs down to a tenth of what they were.
00:30:36.000 | It's really large-scale,
00:30:36.920 | but yeah, you can talk about the technical stuff.
00:30:40.000 | - Yeah, on the constitutional AI project,
00:30:41.840 | this was for abstract summarization,
00:30:44.400 | where in Illicit, if you run a query,
00:30:48.800 | it'll return papers to you,
00:30:50.160 | and then it will summarize each paper
00:30:51.880 | with respect to your query for you on the fly.
00:30:54.640 | And that's a really important part of Illicit,
00:30:57.200 | because Illicit does it so much.
00:30:59.520 | If you run a few searches,
00:31:01.320 | it'll have done it a few hundred times for you.
00:31:03.560 | And so we cared a lot about this both being fast, cheap,
00:31:07.720 | and also very low on hallucination.
00:31:10.760 | I think if Illicit hallucinates something
00:31:12.720 | about the abstract, that's really not good.
00:31:15.040 | And so what Charlie did in that project
00:31:17.420 | was create a constitution that expressed
00:31:20.680 | what are the attributes of a good summary.
00:31:23.000 | It's like everything in the summary is reflected
00:31:26.440 | in the actual abstract,
00:31:28.920 | and it's very concise, et cetera, et cetera.
00:31:32.080 | And then used RLHF with a model
00:31:37.000 | that was trained on the constitution
00:31:39.120 | to basically fine-tune a better summarizer.
00:31:44.120 | - And an open-source model, I think.
00:31:46.280 | - On an open-source model, yeah.
00:31:48.080 | I think that might still be in use.
00:31:51.080 | - Yeah, yeah, definitely.
00:31:52.080 | Yeah, I think at the time,
00:31:53.400 | the models hadn't been trained at all
00:31:55.320 | to be faithful to a text.
00:31:57.360 | So they were just generating.
00:31:58.440 | So then when you ask them a question,
00:32:00.240 | they tried too hard to answer the question
00:32:03.200 | and didn't try hard enough to answer the question
00:32:05.800 | given the text or answer what the text said
00:32:07.780 | about the question.
00:32:08.620 | So we had to basically teach the models
00:32:10.020 | to do that specific task.
00:32:11.680 | - How do you monitor the ongoing performance of your models?
00:32:16.680 | Not to get too LLM-opsy,
00:32:18.680 | but you are one of the larger,
00:32:22.480 | more well-known operations doing NLP at scale.
00:32:25.280 | I guess, effectively, you have to monitor these things,
00:32:29.100 | and nobody has a good answer that I can talk to.
00:32:31.920 | - Yeah, I don't think we have a good answer yet.
00:32:33.760 | (all laughing)
00:32:35.080 | I think the answers are actually a little bit clearer
00:32:36.800 | on the just kind of basic robustness side,
00:32:40.240 | so I think where you can import ideas
00:32:42.840 | from normal software engineering
00:32:45.720 | and normal kind of DevOps.
00:32:47.640 | You're like, well, you need to monitor
00:32:49.160 | kind of latencies and response times and uptime and whatnot.
00:32:52.000 | - I think when we say performance,
00:32:53.120 | it's more about hallucination rate.
00:32:54.920 | - And then things like hallucination rate
00:32:57.040 | where I think there the really important thing
00:32:59.880 | is training time.
00:33:02.360 | So we care a lot about having our own internal benchmarks
00:33:07.680 | for model development that reflect the distribution
00:33:11.920 | of user queries so that we can know ahead of time
00:33:15.480 | how well is the model gonna perform
00:33:17.460 | on different types of tasks,
00:33:18.600 | so the tasks being summarization, question answering,
00:33:21.800 | given a paper, ranking.
00:33:23.800 | And for each of those, we wanna know
00:33:25.400 | what's the distribution of things the model is gonna see
00:33:28.360 | so that we can have well-calibrated predictions
00:33:32.560 | on how well the model's gonna do in production.
00:33:34.640 | And I think, yeah, there's some chance
00:33:36.160 | that there's distribution shift and actually the things
00:33:38.520 | users enter are gonna be different,
00:33:40.680 | but I think that's much less important
00:33:42.560 | than getting the kind of training right
00:33:44.520 | and having very high-quality,
00:33:46.560 | well-vetted data sets at training time.
00:33:49.000 | - I think we also end up effectively monitoring
00:33:51.260 | by trying to evaluate new models as they come out.
00:33:53.500 | And so that kind of prompts us to go through
00:33:56.380 | our eval suite every couple of months.
00:33:58.080 | And then, yeah, and so every time a new model comes out,
00:34:01.080 | we have to see like, okay, which one is,
00:34:03.240 | how is this performing relative to production
00:34:04.920 | and what we currently have?
00:34:06.440 | - Yeah, I mean, since we're on this topic,
00:34:08.800 | any new models have really caught your eye this year?
00:34:11.280 | Like, Cloud came out with a bunch.
00:34:12.120 | - Cloud, yeah, I think Cloud is pretty,
00:34:13.680 | I think the team's pretty excited about Cloud.
00:34:15.720 | - Yeah, specifically, I think Cloud Haiku
00:34:18.920 | is like a good point on the kind of Pareto frontier.
00:34:22.600 | So I think it's like, it's not the,
00:34:24.840 | it's neither the cheapest model,
00:34:26.160 | nor is it the most accurate, most high-quality model,
00:34:30.680 | but it's just like a really good trade-off
00:34:32.400 | between cost and accuracy.
00:34:34.800 | You apparently have to 10-shot it to make it good.
00:34:37.920 | I tried using Haiku for summarization,
00:34:40.080 | but zero-shot was not great.
00:34:42.800 | And yeah, then they were like, you know,
00:34:45.240 | it's a skill issue, you have to try harder.
00:34:47.440 | - Interesting.
00:34:48.280 | - Yeah, we also used, I think, GPT-4 unlocked process.
00:34:51.760 | - Turbo?
00:34:53.320 | - Yeah, yeah, they get unlocked tables for us,
00:34:58.160 | processing data from tables, which was huge.
00:35:00.200 | - GPT-4 Vision.
00:35:01.040 | - Yeah.
00:35:02.040 | - Yeah, did you try like Fuyu?
00:35:03.360 | I guess you can't try Fuyu, 'cause it's non-commercial.
00:35:06.400 | That's the adept model.
00:35:07.600 | - Yeah, we haven't tried that one.
00:35:08.640 | - Yeah, yeah, yeah.
00:35:09.560 | But Cloud is multimodal as well.
00:35:11.240 | - Yeah.
00:35:12.080 | - I think the interesting insight that we got
00:35:13.880 | from talking to David Luan, who is CEO of Adept,
00:35:16.560 | was that multimodality has effectively
00:35:20.120 | two different flavors.
00:35:20.960 | Like one is the, we recognize images from a camera
00:35:24.220 | in the outside natural world.
00:35:26.280 | And actually, the more important multimodality
00:35:28.980 | for knowledge work is screenshots.
00:35:31.200 | And, you know, PDFs and charts and graphs.
00:35:34.240 | - Yeah, yeah, mm-hmm.
00:35:35.760 | - So we need a new term for that kind of multimodality.
00:35:38.240 | - Yeah.
00:35:39.080 | - But is the claim that current models
00:35:40.680 | are good at one or the other?
00:35:42.440 | - No, they're over-indexed, 'cause of the history
00:35:44.200 | of computer vision is CoCo, right?
00:35:46.760 | So now we're like, oh, actually, you know,
00:35:49.240 | screens are more important.
00:35:50.640 | - Yeah, processing weird handwriting and stuff.
00:35:52.120 | - OCR, yeah, handwriting, yeah.
00:35:54.120 | You mentioned a lot of like closed model lab stuff,
00:35:57.840 | and then you also have like this open source model
00:36:00.720 | fine-tuning stuff.
00:36:01.560 | Like what is your workload now between closed and open?
00:36:04.200 | - It's a good question.
00:36:05.040 | - Is it half and half?
00:36:05.880 | - I think it's--
00:36:07.360 | - Is that even a relevant question,
00:36:08.740 | or is this a nonsensical question?
00:36:10.760 | - It depends a little bit on like how you index,
00:36:12.600 | whether you index by like computer cost
00:36:14.540 | or number of queries.
00:36:15.960 | I'd say like in terms of number of queries,
00:36:18.520 | it's maybe similar.
00:36:19.440 | In terms of like cost and compute,
00:36:21.360 | I think the closed models make up more of the budget,
00:36:24.800 | since the main cases where you want to use closed models
00:36:28.320 | are cases where they're just smarter,
00:36:31.820 | where no existing open source models are quite smart enough.
00:36:36.240 | - We have a lot of interesting technical questions
00:36:38.520 | to go in, but just to wrap the kind of like UX evolution,
00:36:42.680 | now you have the notebooks.
00:36:44.320 | We talked a lot about how chatbots
00:36:46.720 | are not the final frontier, you know?
00:36:50.020 | How did you decide to get into notebooks,
00:36:52.560 | which is a very iterative, kind of like interactive
00:36:55.160 | interface, and yeah, maybe learnings from that?
00:36:57.720 | - Yeah, this is actually our fourth time
00:37:00.000 | trying to make this work.
00:37:01.840 | I think the first time was probably in early 2021.
00:37:06.160 | At the time we built something,
00:37:07.480 | I think because we've always been obsessed
00:37:09.600 | with this idea of task decomposition and like branching,
00:37:13.200 | we always wanted a way, a tool that could be kind of
00:37:17.200 | unbounded where you could keep going,
00:37:19.600 | where you could do a lot of branching,
00:37:20.780 | where you could kind of apply language model operations
00:37:23.980 | or computations on other tasks.
00:37:26.080 | So in 2021, we had this thing called composite tasks
00:37:28.840 | where you could use GPT-3 to brainstorm
00:37:31.180 | a bunch of research questions,
00:37:32.560 | and then take each research question
00:37:34.240 | and decompose those further into sub questions.
00:37:37.320 | And this kind of, again, that like task decomposition
00:37:40.200 | tree type thing was always very exciting to us,
00:37:43.440 | but that was like, it didn't work
00:37:44.600 | and it was kind of overwhelming.
00:37:46.840 | Then at the end of '22, I think we tried again.
00:37:50.080 | And at that point we were thinking,
00:37:51.280 | okay, we've done a lot with this literature review thing.
00:37:53.720 | We also want to start helping with kind of adjacent domains
00:37:56.360 | and different workflows.
00:37:57.480 | Like we want to help more with machine learning.
00:37:59.500 | What does that look like?
00:38:00.640 | And as we were thinking about it, we're like, well,
00:38:02.560 | there are so many research workflows.
00:38:04.280 | Like how do we not just build kind of three new workflows
00:38:07.760 | into Elicit, but make Elicit really generic
00:38:10.200 | to lots of workflows?
00:38:11.120 | What is like a generic composable system
00:38:13.640 | with nice abstractions that can like scale
00:38:16.060 | to all these workflows?
00:38:17.640 | So we like iterated on that a bunch
00:38:19.320 | and then didn't quite narrow the problem space enough
00:38:22.440 | or like quite get to what we wanted.
00:38:25.200 | And then I think it was at the beginning of 2023,
00:38:28.600 | where we're like, wow, computational notebooks
00:38:30.440 | kind of enable this, where they have a lot of flexibility,
00:38:34.040 | but kind of robust primitives,
00:38:35.720 | such that you can extend the workflow.
00:38:38.320 | And it's not limited.
00:38:39.580 | It's not like you ask a query, you get an answer,
00:38:41.260 | you're done.
00:38:42.100 | You can just constantly keep building on top of that.
00:38:44.600 | And each little step seems like a really good
00:38:47.240 | kind of unit of work for the language model.
00:38:50.240 | So that's, and also there was just like really helpful
00:38:52.960 | to have a bit more kind of pre-existing work to emulate.
00:38:57.960 | So that was, yeah, that's kind of how we ended up
00:39:00.480 | at computational notebooks for Elicit.
00:39:03.000 | - Maybe one thing that's worth making explicit
00:39:05.600 | is the difference between computational notebooks and chat,
00:39:08.120 | because on the surface, they seem pretty similar.
00:39:10.040 | It's kind of this iterative interaction
00:39:11.560 | where you add stuff and it's almost like in both cases,
00:39:15.640 | you have a back and forth between you enter stuff
00:39:17.560 | and then you get some output and then you enter stuff.
00:39:20.140 | But the important difference in our minds is
00:39:23.240 | with notebooks, you can define a process.
00:39:26.160 | So in data science, you can be like,
00:39:28.920 | here's like my data analysis process that takes in a CSV
00:39:31.520 | and then does some extraction
00:39:32.720 | and then generates a figure at the end.
00:39:34.960 | And you can prototype it using a small CSV
00:39:37.680 | and then you can run it over a much larger CSV later.
00:39:40.560 | And similarly, the vision for notebooks in our case
00:39:43.920 | is to not make it this like one-off chat interaction,
00:39:47.060 | but to allow you to then say kind of,
00:39:50.400 | if you start and first you're like,
00:39:52.680 | okay, let me just analyze a few papers
00:39:54.440 | and see do I get to the correct like conclusions
00:39:57.560 | for those few papers?
00:39:59.160 | Can I then later go back and say,
00:40:00.640 | now let me run this over 10,000 papers
00:40:04.440 | now that I've debugged the process using a few papers?
00:40:07.560 | And that's an interaction that doesn't fit quite as well
00:40:10.200 | into the chat framework,
00:40:11.320 | because that's more for kind of quick
00:40:13.400 | back and forth interaction.
00:40:15.560 | - Do you think in notebooks as kind of like structure,
00:40:19.020 | editable chain of thought, basically step by step,
00:40:22.060 | like is that kind of where you see this going
00:40:24.500 | and then are people gonna reuse notebooks as like templates
00:40:28.180 | and maybe in traditional notebooks,
00:40:29.700 | it's like cookbooks, right?
00:40:30.780 | You share a cookbook, you can start from there.
00:40:33.220 | Is that similar in Elizit?
00:40:35.180 | - Yeah, that's exactly right.
00:40:36.500 | So that's our hope that people will build templates,
00:40:39.060 | share them with other people.
00:40:40.760 | I think chain of thought is maybe still like kind of
00:40:43.780 | one level lower on the abstraction hierarchy
00:40:46.780 | than we would think of notebooks.
00:40:48.460 | I think we'll probably want to think about
00:40:50.060 | more semantic pieces like a building block
00:40:52.660 | is more like a paper search or an extraction
00:40:56.480 | or a list of concepts.
00:40:59.660 | And then the models detailed reasoning
00:41:03.420 | will probably often be one level down.
00:41:05.380 | You always want to be able to see it,
00:41:06.740 | but you don't always want it to be front and center.
00:41:09.500 | - Yeah, what's the difference between a notebook
00:41:11.540 | and an agent?
00:41:12.420 | Since everybody always asks me, what's an agent?
00:41:14.460 | Like, how do you think about where the line is?
00:41:17.020 | - In the notebook world,
00:41:18.220 | I would generally think of the human as the agent
00:41:21.380 | in the first iteration.
00:41:22.260 | So you have the notebook and the human kind of adds
00:41:24.460 | little action steps.
00:41:25.780 | And then the next point on this kind of progress gradient is,
00:41:30.780 | okay, now you can use language models to predict
00:41:33.060 | which action would you take as a human.
00:41:35.020 | And at some point,
00:41:35.840 | you're probably gonna be very good at this.
00:41:36.900 | You'll be like, okay, I can like,
00:41:38.020 | in some cases I can with 100%, 99.9% accuracy
00:41:41.460 | predict what you do.
00:41:42.700 | And then you might as well just execute it.
00:41:44.260 | Like, why wait for the human?
00:41:46.060 | And eventually, as you get better at this,
00:41:48.260 | that will just look more and more like agents taking actions
00:41:52.420 | as opposed to you doing the thing.
00:41:54.440 | And I think templates are a specific case of this
00:41:59.580 | where you're like, okay, well,
00:42:00.420 | there's just particular sequences of actions
00:42:02.820 | that you often wanna chunk and have available as primitives,
00:42:06.500 | just like in normal programming.
00:42:08.220 | And those are, you can view them as action sequences
00:42:11.980 | of agents or you can view them as more
00:42:14.140 | like the normal programming language abstraction thing.
00:42:17.540 | And I think those are two valid views.
00:42:20.840 | - Yeah, how do you see this changes?
00:42:22.980 | Like you said, the models get better
00:42:24.300 | and you need less and less human actual interfacing
00:42:27.660 | with the model, you just get the results.
00:42:29.220 | Like how does the UX and the way people perceive it change?
00:42:34.820 | - Yeah, I think this kind of interaction paradigms
00:42:38.320 | for evaluation is not really something
00:42:40.060 | the internet has encountered yet,
00:42:41.420 | because right now, up to now,
00:42:42.560 | the internet has all been about like getting data
00:42:45.020 | and work from people.
00:42:46.680 | But so increasingly, yeah, I really want kind of evaluation
00:42:51.700 | both from an interface perspective
00:42:53.460 | and from like a technical perspective
00:42:55.340 | and operation perspective to be a power,
00:42:57.180 | superpower for Elicit, 'cause I think over time,
00:42:59.180 | models will do more and more of the work
00:43:01.000 | and people will have to do more and more of the evaluation.
00:43:03.980 | So I think, yeah, in terms of the interface,
00:43:06.140 | some of the things we have today are,
00:43:08.380 | for every kind of language model generation,
00:43:10.140 | there's some citation back and we kind of directly,
00:43:13.020 | we try to highlight the ground truth in the paper
00:43:16.940 | that is most relevant to whatever Elicit said
00:43:19.260 | and make it super easy so that you can click on it
00:43:21.020 | and quickly see in context and validate
00:43:24.180 | whether the text actually supports
00:43:25.740 | the answer that Elicit gave.
00:43:27.300 | So I think we'd probably want to scale things up like that,
00:43:30.420 | like the ability to kind of spot check
00:43:32.780 | the model's work super quickly,
00:43:34.300 | scale up interfaces like that and--
00:43:37.100 | - Who would spot check, the user?
00:43:39.140 | - Yeah, to start it would be the user.
00:43:41.220 | One of the other things we do is also kind of flag
00:43:43.460 | the model's uncertainty.
00:43:44.940 | So we have models report out, how confident are you
00:43:47.540 | that this was the sample size of this study?
00:43:50.080 | The model's not sure, we throw a flag
00:43:51.780 | and so the user knows to prioritize checking that.
00:43:54.500 | So again, we can kind of scale that up.
00:43:56.500 | So when the model's like, well,
00:43:57.940 | I went and searched for Google,
00:43:59.300 | I searched this on Google,
00:44:00.460 | I'm not sure if that was the right thing,
00:44:01.740 | we have an uncertainty flag and the user can go
00:44:03.580 | and be like, oh, okay,
00:44:04.420 | that was actually the right thing to do or not.
00:44:07.380 | - So I've tried to do uncertainty readings from models.
00:44:11.260 | I don't know if you have this live, you do, okay.
00:44:14.620 | 'Cause I just didn't find them reliable
00:44:16.260 | 'cause they just hallucinated their own uncertainty.
00:44:18.580 | I would love to base it on log probs
00:44:21.780 | or something more native within the model
00:44:23.940 | rather than generated.
00:44:25.440 | But okay, it sounds like they scale properly for you.
00:44:28.940 | - Yeah, we found it to be pretty calibrated.
00:44:30.820 | They're diverse on the model.
00:44:32.380 | - Okay, yeah, I think in some cases,
00:44:34.080 | we also used two different models
00:44:35.420 | for the uncertainty estimates
00:44:36.740 | than for the question answering.
00:44:38.260 | So one model would say, here's my chain of thought,
00:44:40.820 | here's my answer, and then a different type of model.
00:44:43.180 | Let's say the first model is Lama
00:44:45.700 | and let's say the second model is GP 3.5,
00:44:48.100 | could be different.
00:44:49.140 | And then the second model just looks over the results
00:44:54.060 | and like, okay, how confident are you in this?
00:44:56.980 | And I think sometimes using a different model
00:44:59.540 | can be better than using the same model.
00:45:01.540 | - Yeah, on the topic of models, evaluating models,
00:45:04.980 | obviously you can do that all day long.
00:45:07.100 | Like what's your budget?
00:45:08.300 | Because your queries fan out a lot
00:45:13.820 | and then you have models evaluating models.
00:45:16.780 | One person typing in a question
00:45:18.740 | can lead to a thousand calls.
00:45:21.340 | - It depends on the project.
00:45:23.860 | So if the project is basically a systematic review
00:45:29.860 | that otherwise human research assistants would do,
00:45:32.100 | then the project is basically a human equivalent spend
00:45:35.540 | and the spend can get quite large for those projects.
00:45:38.740 | Certainly, I don't know, let's say $100,000.
00:45:42.980 | - For the project, yeah.
00:45:43.820 | - Yeah.
00:45:45.020 | So in those cases, you're happier to spend compute
00:45:48.760 | than in the kind of shallow search case
00:45:51.020 | where someone just enters a question because,
00:45:53.180 | I don't know, maybe--
00:45:54.260 | - Feel like it.
00:45:55.100 | - I heard about creatine, what's it about?
00:45:57.380 | Probably don't want to spend a lot of compute on that.
00:46:00.380 | And this sort of being able to invest more or less compute
00:46:05.220 | into getting more or less accurate answers
00:46:07.060 | is I think one of the core things we care about
00:46:09.540 | and that I think is currently undervalued in the AI space.
00:46:12.900 | I think currently, you can choose which model you want
00:46:15.780 | and you can sometimes tell it, I don't know,
00:46:18.380 | you'll tip it and it'll try harder
00:46:21.140 | or you can try various things to get it to work harder.
00:46:24.180 | But you don't have great ways of converting
00:46:27.340 | willingness to spend into better answers
00:46:29.420 | and we really want to build a product
00:46:30.820 | that has this sort of unbounded flavor
00:46:32.980 | where I mean, as much as you care about,
00:46:35.500 | if you care about it a lot,
00:46:36.580 | you should be able to get really high quality answers,
00:46:39.860 | really double checked in every way.
00:46:41.900 | - Yeah.
00:46:42.740 | - And you have a credits-based pricing.
00:46:44.980 | So unlike most products, it's not a fixed monthly fee.
00:46:47.380 | - Right, exactly.
00:46:48.420 | So some of the higher costs are tiered.
00:46:51.820 | So for most casual users,
00:46:54.380 | they'll just get the abstract summary,
00:46:55.740 | which is kind of an open source model.
00:46:58.180 | Then you can add more columns which have more extractions
00:47:01.340 | and these uncertainty features
00:47:02.540 | and then you can also add those same columns
00:47:04.060 | in high accuracy mode, which also parses the table.
00:47:06.660 | So we kind of stack the complexity on the cost.
00:47:09.900 | - You know the fun thing you can do with a credit system,
00:47:12.020 | which is data for data or I don't know what I mean by that.
00:47:16.700 | Basically, you can give people more credits
00:47:18.300 | if they give data back to you.
00:47:20.500 | I don't know if you've already done that.
00:47:21.500 | - I've thought about,
00:47:22.340 | we've thought about something like this.
00:47:23.460 | It's like, if you don't have money, but you have time,
00:47:26.540 | how do you exchange that?
00:47:28.460 | - It's a fair trade.
00:47:29.300 | - Yeah, I think it's interesting.
00:47:30.380 | We haven't quite operationalized it
00:47:31.700 | and then there's been some kind of adverse selection.
00:47:35.100 | For example, it would be really valuable
00:47:36.340 | to get feedback on our models.
00:47:37.620 | So maybe if you were willing to give more robust feedback
00:47:40.580 | on our results, we could give you credits
00:47:42.180 | or something like that.
00:47:43.020 | But then there's kind of this,
00:47:44.380 | will people take it seriously?
00:47:45.580 | - Yeah, you want the good people.
00:47:46.420 | - Exactly.
00:47:47.300 | - Can you tell who are the good people?
00:47:49.340 | - Not right now, but yeah,
00:47:50.180 | maybe at the point where we can,
00:47:51.340 | we can offer it.
00:47:52.180 | We can offer it up to them.
00:47:53.020 | - The perplexity of questions asked,
00:47:55.380 | if it's higher perplexity,
00:47:56.220 | these are smarter people.
00:47:57.060 | - Yeah, maybe.
00:47:58.300 | - If you make a lot of typos in your queries,
00:48:00.260 | you're not gonna get off the house exchange.
00:48:02.340 | (all laughing)
00:48:04.380 | - Negative social credit.
00:48:05.980 | It's very topical right now to think about
00:48:08.100 | the threat of long context windows.
00:48:10.980 | All these models that we're talking about these days,
00:48:12.980 | all like a million token plus.
00:48:14.820 | Is that relevant for you?
00:48:16.740 | Can you make use of that?
00:48:17.660 | Is that just prohibitively expensive
00:48:19.660 | 'cause you're just paying for all those tokens
00:48:21.620 | or you're just doing rag?
00:48:22.820 | - It's definitely relevant.
00:48:23.860 | And when we think about search,
00:48:26.460 | I think as many people do,
00:48:27.820 | we think about kind of a staged pipeline of retrieval
00:48:30.980 | where first you use a kind of semantic search database
00:48:35.500 | with embeddings,
00:48:36.340 | get like the, in our case,
00:48:37.740 | maybe 400 or so most relevant papers.
00:48:40.300 | And then you still need to rank those.
00:48:42.180 | And I think at that point,
00:48:43.820 | it becomes pretty interesting to use larger models.
00:48:47.500 | So specifically in the past,
00:48:50.100 | I think a lot of ranking was kind of per item ranking
00:48:53.340 | where you would score each individual item,
00:48:55.300 | maybe using increasingly expensive scoring methods
00:48:58.500 | and then rank based on the scores.
00:49:00.620 | But I think list-wise free ranking
00:49:02.180 | where you have a model that can see all the elements
00:49:04.580 | is a lot more powerful
00:49:06.140 | because often you can only really tell how good a thing is
00:49:09.220 | in comparison to other things.
00:49:10.980 | And what things should come first,
00:49:13.140 | it really depends on like,
00:49:14.500 | well, what other things that are available,
00:49:15.660 | maybe you even care about diversity in your results.
00:49:17.820 | You don't wanna show like 10 very similar papers
00:49:21.140 | as the first 10 results.
00:49:22.460 | So I think the long context models
00:49:24.780 | are quite interesting there.
00:49:26.820 | And especially for our case
00:49:28.580 | where we care more about power users
00:49:31.820 | who are perhaps a little bit more willing
00:49:33.740 | to wait a little bit longer
00:49:34.740 | to get higher quality results
00:49:36.060 | relative to people who just quickly check out things
00:49:40.220 | because why not?
00:49:41.820 | I think being able to spend more on longer contexts
00:49:44.540 | is quite valuable.
00:49:46.340 | - Yeah, I think one thing
00:49:47.340 | the longer context models changed for us
00:49:49.460 | is maybe a focus from breaking down tasks
00:49:52.700 | to breaking down the evaluation.
00:49:54.860 | So before, if we wanted to answer a question
00:49:59.500 | from the full text of a paper,
00:50:01.380 | we had to figure out how to chunk it
00:50:02.940 | and like find the relevant chunk
00:50:04.300 | and then answer based on that chunk.
00:50:06.020 | And the nice thing was then,
00:50:07.260 | you know kind of which chunk the model used
00:50:09.100 | to answer the question.
00:50:09.940 | So if you want to help the user track it,
00:50:11.660 | yeah, you can be like,
00:50:12.580 | well, this was the chunk that the model got.
00:50:14.740 | And now if you put the whole text in the paper,
00:50:16.860 | you have to go back,
00:50:17.780 | you have to like kind of find the chunk
00:50:19.780 | like more retroactively basically.
00:50:21.620 | And so you need kind of like a different set of abilities
00:50:24.380 | and obviously like a different technology to figure out.
00:50:26.660 | You still want to point the user
00:50:28.660 | to the supporting quotes in the text,
00:50:30.300 | but then like the interaction is a little different.
00:50:33.060 | - You like scan through and find some rouge score.
00:50:35.500 | - Yeah.
00:50:36.340 | - Ceiling or floor.
00:50:38.500 | - Yeah, I think there's an interesting space
00:50:41.340 | of almost research problems here
00:50:44.060 | because you would ideally make causal claims.
00:50:46.300 | Like if this hadn't been in the text,
00:50:48.260 | the model wouldn't have said this thing.
00:50:49.940 | And maybe you can do expensive approximations to that
00:50:52.500 | where like, I don't know,
00:50:53.340 | you just throw a chunk off the paper
00:50:55.460 | and re-answer and see what happens.
00:50:57.300 | But hopefully there are better ways of doing that
00:51:00.500 | where you just get that kind of counterfactual information
00:51:05.140 | for free from the model.
00:51:06.700 | - Do you think at all about the cost of maintaining RAG
00:51:09.980 | versus just putting more tokens in the window?
00:51:12.860 | I think in software development,
00:51:14.300 | a lot of times people buy developer productivity things
00:51:17.740 | so that we don't have to worry about it.
00:51:19.940 | Context window is kind of the same, right?
00:51:21.340 | You have to maintain chunking and like RAG retrieval
00:51:24.140 | and like re-ranking and all of this
00:51:25.580 | versus I just shove everything into the context
00:51:28.380 | and like it costs a little more,
00:51:29.460 | but at least I don't have to do all of that.
00:51:31.740 | Is that something you thought about at all?
00:51:33.340 | - I think we still like hit up against context limits enough
00:51:38.060 | that like, it's not really,
00:51:40.660 | do we still want to keep this RAG around?
00:51:42.180 | It's like we do still need it
00:51:43.460 | for the scale of the work that we're doing, yeah.
00:51:45.580 | - And I think there are different kinds of maintainability.
00:51:48.300 | In one sense, I think you're right
00:51:50.140 | that the throw everything into the context window thing
00:51:53.140 | is easier to maintain
00:51:54.140 | because you just can swap out a model.
00:51:57.540 | In another sense, it's if things go wrong,
00:52:00.580 | it's harder to debug where like,
00:52:02.220 | if you know here's the process that we go through
00:52:04.940 | to go from 200 million papers to an answer
00:52:08.820 | and they're like little steps and you understand,
00:52:10.660 | okay, this is the step that finds the relevant paragraph
00:52:14.260 | or whatever it may be.
00:52:15.720 | You'll know which step breaks if the answers are bad
00:52:20.140 | whereas if it's just like a new model version came out
00:52:24.500 | and now it suddenly doesn't find your needle
00:52:26.580 | in a haystack anymore,
00:52:27.500 | then you're like, okay, what can you do?
00:52:29.740 | You're kind of at a loss.
00:52:31.660 | - Yeah.
00:52:32.700 | Let's talk a bit about, yeah, needle in a haystack
00:52:35.940 | and like maybe the opposite of it,
00:52:37.740 | which is like hard grounding.
00:52:39.340 | I don't know if that's like the best name
00:52:40.900 | to think about it,
00:52:41.740 | but I was using one of these chat witcher documents features
00:52:44.380 | and I put the AMD MI 300 specs
00:52:47.340 | and the new Blackwell chips from NVIDIA
00:52:51.100 | and I was asking questions
00:52:52.120 | and asked, does the AMD chip support NVLink?
00:52:56.280 | And the response was like, oh, it doesn't say in the specs.
00:52:59.620 | But if you ask GPD 4 without the docs,
00:53:02.020 | it would tell you no,
00:53:03.020 | because NVLink, it's a NVIDIA technology.
00:53:05.620 | - Those are NV.
00:53:06.540 | - Yeah, it just says in the thing.
00:53:09.380 | How do you think about that?
00:53:11.740 | Like having the context sometimes suppress the knowledge
00:53:14.860 | that the model has?
00:53:16.220 | - It really depends on the task
00:53:17.460 | because I think sometimes that is exactly what you want.
00:53:19.700 | So imagine you're a researcher,
00:53:21.540 | you're writing the background section of your paper
00:53:23.240 | and you're trying to describe what these other papers say.
00:53:26.980 | You really don't want extra information
00:53:28.580 | to be introduced there.
00:53:29.740 | In other cases where you're just trying
00:53:31.060 | to figure out the truth
00:53:31.940 | and you're giving the documents
00:53:33.740 | because you think they will help the model
00:53:36.340 | figure out what the truth is.
00:53:38.420 | I think you do want,
00:53:40.500 | if the model has a hunch
00:53:41.880 | that there might be something that's not in the papers,
00:53:44.620 | you do want to surface that.
00:53:46.100 | I think ideally,
00:53:46.940 | you still don't want the model to just tell you.
00:53:49.580 | I think probably the ideal thing looks a bit more
00:53:51.660 | like agent control where the model can issue a query
00:53:56.660 | that then is intended to surface documents
00:54:00.700 | that substantiate its hunch.
00:54:02.100 | So I would, that's maybe a reasonable middle ground
00:54:06.060 | between model just telling you
00:54:07.900 | and model being fully limited to the papers you give it.
00:54:10.880 | - Yeah, I would say it's,
00:54:11.800 | they're just kind of different tasks right now.
00:54:13.420 | And the tasks that Elicit is mostly focused on
00:54:15.500 | is what do these papers say?
00:54:17.660 | But there is another task,
00:54:18.980 | which is like, just give me the best possible answer.
00:54:21.660 | And that give me the best possible answer
00:54:23.340 | sometimes depends on what do these papers say,
00:54:25.420 | but it can also depend on other stuff
00:54:27.280 | that's not in the papers.
00:54:28.700 | So ideally we can do both
00:54:29.900 | and then kind of do this overall task for you
00:54:33.060 | more going forward.
00:54:34.220 | - All right, this was, we see a lot of details,
00:54:37.220 | but just to zoom back out a little bit,
00:54:39.500 | what are maybe the most underrated features of Elicit?
00:54:43.900 | And what is one thing that maybe the users
00:54:46.340 | surprised you the most by using it?
00:54:48.260 | - I think the most powerful feature of Elicit
00:54:50.300 | is the ability to extract, add columns to this table,
00:54:54.780 | which effectively extracts data
00:54:56.380 | from all of your papers at once.
00:54:58.260 | It's still, it's well used,
00:54:59.780 | but there are kind of many different extensions of that
00:55:02.620 | that I think users are still discovering.
00:55:04.260 | So one is we let you give a description of the column.
00:55:07.900 | We let you give instructions of a column.
00:55:10.260 | We let you create custom columns.
00:55:11.740 | So we have like 30 plus predefined fields
00:55:14.280 | that users can extract.
00:55:15.820 | Like what were the methods?
00:55:16.940 | What were the main findings?
00:55:18.060 | How many people were studied?
00:55:20.420 | And then, and we can,
00:55:21.880 | we actually show you basically the prompts
00:55:23.820 | that we're using to extract that from our predefined fields.
00:55:26.460 | And then you can fork this and you can say,
00:55:28.620 | oh, actually I don't care about the population of people.
00:55:30.980 | I only care about the population of rats.
00:55:32.780 | Like you can change the instruction.
00:55:34.280 | So I think users are still kind of discovering
00:55:37.000 | that there's both this predefined, easy to use default,
00:55:41.260 | but that they can extend it to be much more specific to them.
00:55:44.220 | And then they can also ask custom questions.
00:55:46.460 | One use case of that is you can,
00:55:48.300 | you can start to create different column types
00:55:50.220 | that you might not expect.
00:55:51.220 | So rather than just creating generative answers,
00:55:53.980 | like a description of the methodology,
00:55:55.680 | you can say classify the methodology
00:55:58.060 | into a prospective study, a retrospective study,
00:56:01.340 | or a case study.
00:56:02.660 | And then you can filter based on that.
00:56:04.420 | It's like all using the same kind of technology
00:56:06.300 | and the interface, but it unlocks different workflows.
00:56:09.780 | So I think that like the ability to ask custom questions,
00:56:12.820 | give instructions and specifically use that
00:56:14.940 | to create different types of columns,
00:56:17.540 | like classification columns is still pretty underrated.
00:56:20.860 | In terms of use case,
00:56:22.980 | I spoke to someone who works in medical affairs
00:56:25.660 | at a genomic sequencing company recently.
00:56:28.340 | So they, you know, doctors kind of, you know,
00:56:32.140 | order these genomic tests,
00:56:34.320 | the sequencing tests to kind of identify
00:56:36.460 | if a patient has a particular disease,
00:56:38.340 | this company helps them process it.
00:56:40.260 | And this person basically interacts with all the doctors.
00:56:43.380 | And if the doctors have any questions,
00:56:44.700 | my understanding is that medical affairs
00:56:46.220 | is kind of like customer support
00:56:47.820 | or customer success in pharma.
00:56:50.440 | So this person like talks to doctors all day long.
00:56:52.840 | And one of the things they started using Elicit for
00:56:56.040 | is like putting the results of their tests as the query.
00:56:59.900 | Like this test showed, you know, this percentage,
00:57:02.540 | you know, presence of this and 40% that and whatever.
00:57:06.500 | What do we think is kind of the, you know,
00:57:08.900 | what like genes are present here or something
00:57:10.900 | or what's in the sample?
00:57:13.180 | And getting kind of a list of academic papers
00:57:15.980 | that would support their findings
00:57:17.380 | and using this to help doctors interpret their tests.
00:57:21.100 | So we talked about, okay, cool.
00:57:22.260 | Like if we built, you know,
00:57:24.020 | he's pretty interested in kind of doing a survey
00:57:26.860 | of infectious disease specialists
00:57:29.340 | and getting them to evaluate, you know,
00:57:31.300 | having them write up their answers,
00:57:32.540 | comparing it to Elicit's answers,
00:57:33.860 | trying to see can Elicit start being used
00:57:36.340 | to interpret the results of these diagnostic tests?
00:57:39.520 | Because the way they ship these tests to doctors
00:57:42.340 | is they report on a really wide array of things.
00:57:46.020 | And he was saying that at a large,
00:57:47.900 | well-resourced hospital, like a city hospital,
00:57:50.580 | there might be a team of infectious disease specialists
00:57:52.820 | who can help interpret these results.
00:57:55.140 | But at under-resourced hospitals or more rural hospitals,
00:57:57.820 | the primary care physician can't interpret the test results.
00:58:02.820 | So then they can't order it, they can't use it,
00:58:04.620 | they can't help their patients with it.
00:58:06.540 | So thinking about, you know,
00:58:07.780 | kind of an evidence-backed way of interpreting these tests
00:58:10.380 | is definitely kind of an extension of the product
00:58:12.140 | that I hadn't considered before.
00:58:13.960 | But yeah, the idea of like using that
00:58:15.760 | to bring more access to physicians
00:58:18.240 | in all different parts of the country
00:58:20.020 | and helping them interpret complicated science
00:58:21.860 | is pretty cool.
00:58:23.140 | - Yeah, we had Ken Jun from MBU on the podcast
00:58:26.740 | and we talked about better allocating scientific resources.
00:58:29.820 | How do you think about these use cases
00:58:31.540 | and maybe how Elicit can help drive more research?
00:58:35.340 | And do you see a world in which, you know,
00:58:37.600 | maybe the models actually do some of the research
00:58:39.860 | before suggesting us?
00:58:42.220 | - Yeah, I think that's like very close
00:58:45.140 | to what we care about.
00:58:46.420 | So our product values are systematic,
00:58:49.380 | transparent, and unbounded.
00:58:50.660 | And I think to make research more,
00:58:53.760 | especially more systematic and unbounded,
00:58:55.940 | I think is like basically the thing that's at stake here.
00:58:58.100 | So ideally people would think,
00:59:01.220 | well, what are, for example,
00:59:04.060 | I was recently talking to people in longevity
00:59:06.220 | and I think there isn't really one field of longevity.
00:59:08.620 | There are kind of different scientific subdomains
00:59:11.500 | that are surfacing,
00:59:13.180 | various things that are related to longevity.
00:59:15.140 | And I think if you could more systematically say,
00:59:17.620 | look, here are all the different interventions we could do.
00:59:22.580 | And here's the expected ROI of these experiments.
00:59:25.140 | Here's like the evidence so far
00:59:26.800 | that supports those being either like likely
00:59:30.060 | to surface new information or not.
00:59:32.180 | Here's the cost of these experiments.
00:59:34.180 | I think you could be so much more systematic
00:59:36.940 | than science is today.
00:59:39.380 | Probably, yeah, I'd guess in like 10, 20 years,
00:59:42.380 | we'll look back and it will be incredible
00:59:44.800 | how unsystematic science was back in the day.
00:59:48.100 | - Yeah, and I think this is as we start to,
00:59:51.700 | so like our view is kind of have models
00:59:54.820 | catch up to expert humans today,
00:59:57.260 | or whatever, start with kind of novice humans
00:59:59.780 | and then increasingly expert humans.
01:00:01.740 | And then at some point,
01:00:03.260 | but we really want the models to kind of like
01:00:05.220 | earn their right to the expertise.
01:00:07.660 | So that's why we do things in this very step-by-step way.
01:00:09.820 | That's why we don't just like throw a bunch of data
01:00:12.300 | and apply a bunch of compute and hope we get good results.
01:00:14.980 | But obviously at some point you hope that
01:00:16.220 | once it's kind of earned its stripes,
01:00:17.580 | it can surpass human researchers.
01:00:20.520 | But I think that's where making sure that the models
01:00:23.340 | processes are really explicit and transparent
01:00:26.380 | and that it's really easy to evaluate is important
01:00:28.660 | because if it does surpass human understanding,
01:00:31.060 | people will still need to be able to audit its work somehow
01:00:33.740 | or spot check its work somehow
01:00:35.580 | to be able to reliably trust it and use it.
01:00:37.960 | So yeah, that's kind of why the process-based approach
01:00:41.420 | is really important.
01:00:42.740 | - And on the question of will models do their own research,
01:00:47.420 | I think one feature that most currently don't have
01:00:50.340 | that will need to be better there is better world models.
01:00:54.300 | I think currently models are just not great
01:00:55.940 | at representing kind of what's going on
01:00:59.520 | in a particular situation or domain in a way
01:01:01.960 | that allows them to come to interesting,
01:01:05.380 | surprising conclusions.
01:01:07.340 | I think they're very good at like coming to,
01:01:09.680 | I don't know, conclusions that are nearby
01:01:12.300 | to conclusions that people have come to,
01:01:14.020 | but not as good at kind of reasoning
01:01:16.860 | and making surprising connections maybe.
01:01:19.060 | And so having deeper models of how,
01:01:23.340 | let's see, what are the underlying structures
01:01:25.900 | of different domains, how are they related or not related,
01:01:28.520 | I think will be an important ingredient
01:01:30.020 | for models actually being able to make novel contributions.
01:01:32.860 | - On the topic of hiring more expert humans,
01:01:34.980 | you've hired some very expert humans.
01:01:37.180 | My friend Maggie Appleton joined you guys
01:01:39.300 | I think maybe a year ago-ish.
01:01:41.380 | In fact, I think you're doing an offsite
01:01:44.100 | and we're actually organizing our big AI/UX meetup
01:01:48.460 | around whenever she's in town in San Francisco.
01:01:51.100 | - Oh, amazing.
01:01:52.140 | - How big is the team?
01:01:53.940 | How have you sort of transitioned your company
01:01:55.920 | into this sort of PBC and sort of the plan for the future?
01:02:00.500 | - Yeah, we're 12 people now.
01:02:02.300 | Mostly, about half of us are in the Bay Area
01:02:05.300 | and then distributed across US and Europe.
01:02:07.620 | A mix of mostly kind of roles in engineering and product.
01:02:11.300 | Yeah, and I think that the transition to PBC
01:02:14.240 | was really not that eventful because I think we were already,
01:02:18.260 | even as a nonprofit, we were already shipping every week.
01:02:21.260 | So very much operating as a product.
01:02:22.420 | - Very much like a startup, yeah.
01:02:24.020 | - And then I would say the kind of PBC component
01:02:27.200 | was to very explicitly say that we have a mission
01:02:30.860 | that we care a lot about.
01:02:32.100 | There are a lot of ways to make money.
01:02:33.900 | We think our mission will make us a lot of money,
01:02:36.020 | but we are going to be opinionated about how we make money.
01:02:38.360 | We're gonna take the version of making a lot of money
01:02:40.780 | that's in line with our mission.
01:02:42.180 | But it's like all very, it's very convergent.
01:02:43.940 | Like illicit is not going to make any money
01:02:45.980 | if it's a bad product,
01:02:47.220 | if it doesn't actually help you discover truth
01:02:49.300 | and do research more rigorously.
01:02:51.640 | So I think for us, the kind of mission
01:02:54.460 | and the success of the company are very intertwined.
01:02:58.700 | So a big part of, yeah,
01:02:59.660 | we're hoping to grow the team quite a lot this year.
01:03:02.280 | Probably some of our highest priority roles
01:03:04.100 | are in engineering,
01:03:05.700 | but also opening up roles more in design
01:03:08.700 | and product marketing, go-to-market.
01:03:10.540 | Yeah, do you want to talk about the roles?
01:03:14.160 | - Yeah, broadly we're just looking
01:03:15.220 | for senior software engineers
01:03:16.580 | and don't need any particular AI expertise.
01:03:19.500 | A lot of it is just, I guess,
01:03:21.860 | how do you build good orchestration for complex tasks?
01:03:26.860 | So we talked earlier about these are sort of notebooks,
01:03:30.660 | scaling up, task orchestration,
01:03:32.700 | and I think a lot of this looks more
01:03:36.180 | like traditional software engineering
01:03:38.100 | than it does look like machine learning research.
01:03:39.920 | And I think the people who are really good
01:03:42.080 | at building good abstractions,
01:03:45.520 | building applications that can kind of survive,
01:03:48.880 | even if some of their pieces break,
01:03:50.720 | like making reliable components out of unreliable pieces.
01:03:54.060 | I think those are the people we're looking for.
01:03:56.460 | - You know, that's exactly what I used to do.
01:03:59.060 | Have you explored any of the existing--
01:03:59.900 | - Do you want to come work with us?
01:04:01.420 | - I can talk about this all day.
01:04:02.820 | Have you explored the existing orchestration frameworks?
01:04:05.220 | Temporal, Airflow, Daxter, Prefect?
01:04:09.060 | - We've looked into them a little bit.
01:04:10.340 | I think we have some specific requirements
01:04:12.220 | around kind of being able to stream work back very quickly
01:04:16.260 | to our users, and those could definitely be relevant.
01:04:20.260 | - Okay, well, you're hiring.
01:04:21.440 | I'm sure we'll plug all the links.
01:04:22.740 | Thank you so much for coming.
01:04:23.660 | Any parting words, any words of wisdom?
01:04:26.340 | Models do you live by?
01:04:27.620 | - No, I think it's a really important time
01:04:31.220 | for humanity, so I hope everyone listening to this podcast
01:04:34.940 | can think hard about exactly how they want to participate
01:04:39.940 | in this story.
01:04:41.140 | There's so much to build, and we can be really intentional
01:04:43.980 | about what we align ourselves with.
01:04:46.500 | I think there are a lot of applications
01:04:48.660 | that are going to be really good for the world,
01:04:50.020 | and a lot of applications that are not,
01:04:51.780 | and so, yeah, I hope people can take that seriously
01:04:54.740 | and kind of seize the moment.
01:04:56.620 | - Yeah, I love how intentional you guys have been.
01:04:58.220 | Thank you for sharing that story.
01:04:59.660 | - Thank you.
01:05:00.500 | - Yeah, thank you for coming on.
01:05:02.540 | (upbeat music)
01:05:05.120 | (upbeat music)
01:05:07.700 | (upbeat music)
01:05:10.280 | (upbeat music)
01:05:12.940 | (upbeat music)
01:05:15.520 | (upbeat music)
01:05:18.100 | (upbeat music)
01:05:20.720 | (upbeat music)
01:05:23.300 | [BLANK_AUDIO]