Back to Index

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit


Chapters

0:0 Introductions
7:45 How Johan and Andreas Joined Forces to Create Elicit
10:26 Why Products are better than Research
15:49 The Evolution of Elicit's Product
19:44 Automating Literature Review Workflow
22:48 How GPT-3 to GPT-4 Changed Things
25:37 Managing LLM Pricing and Performance
31:7 Open vs. Closed: Elicit's Approach to Model Selection
31:56 Moving to Notebooks
39:11 Elicit's Budget for Model Queries and Evaluations
41:44 Impact of Long Context Windows
47:19 Underrated Features and Surprising Applications
51:35 Driving Systematic and Efficient Research
53:0 Elicit's Team Growth and Transition to a Public Benefit Corporation
55:22 Building AI for Good

Transcript

Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence Invisible Partners. And I'm joined by my co-host, Swix, founder of Small AI. Hey, and today we are back in the studio with Andreas and Junghwan from Elicit. Welcome. Thanks, guys. It's great to be here.

Yeah. So I'll introduce you separately, but also we'd love to learn a little bit more about you personally. So Andreas, it looks like you started Elicit, or ought first, and Junghwan joined later. That's right, although you did-- I guess, for all intents and purposes, the Elicit and also the ought that existed before then were very different from what I started.

So I think it's fair to say that she co-founded it. And Junghwan, you're a co-founder and COO of Elicit. Yeah, that's right. So there's a little bit of a history to this. I'm not super aware of the journey. I was aware of ought and Elicit as sort of a nonprofit-type situation.

And recently, you turned into sort of like a B Corp-- Public Benefit Corporation. So yeah, maybe if you want, you could take us through that journey of finding the problem. Obviously, you're working together now. So how do you get together to decide to leave your startup career to join him?

Yeah, it's truly a very long journey. I guess, truly, it kind of started in Germany when I was born. So even as a kid, I was always interested in AI. I kind of went to the library. There were books about how to write programs in QBasic. And some of them talked about how to implement chatbots.

I guess Eliza-- To be clear, he grew up in a tiny village on the outskirts of Munich called Dingelscherbin, where it's a very, very idyllic German village. Important to the story. But basically, the main thing is I've kind of always been thinking about AI my entire life and been thinking about, well, at some point, this is going to be a huge deal.

It's going to be transformative. How can I work on it? And I was thinking about it from when I was a teenager. After high school, I did a year where I started a startup with the intention to become rich. And then once I'm rich, I can affect the trajectory of AI.

Did not become rich. Decided to go back to college and study cognitive science there, which was like the closest thing I could find at the time to AI. In the last year of college, moved to the US to do a PhD at MIT, working on broadly kind of new programming languages for AI, because it kind of seemed like the existing languages were not great at expressing world models and learning world models doing Bayesian inference.

Was always thinking about, well, ultimately, the goal is to actually build tools that help people reason more clearly, ask and answer better questions, and make better decisions. But for a long time, it just seemed like the technology to put reasoning in machines just wasn't there. And so initially, at the end of my postdoc at Stanford, I was thinking about, well, what to do?

I think the standard path is you become an academic and do research. But it's really hard to actually build interesting tools as an academic. You can't really hire great engineers. Everything is kind of on a paper-to-paper timeline. And so I was like, well, maybe I should start a startup and pursue that for a little bit.

But it seemed like it was too early, because you could have tried to do an AI startup, but probably would not have been the kind of AI startup we're seeing now. So then decided to just start a nonprofit research lab that's going to do research for a while, until we better figure out how to do thinking in machines.

And that was odd. And then over time, it became clear how to actually build actual tools for reasoning. And then only over time, we developed a better way to-- I'll let you fill in some of these. Yeah, so I guess my story maybe starts around 2015. I kind of wanted to be a founder for a long time.

And I wanted to work on an idea that really tested-- that stood the test of time for me, like an idea that stuck with me for a long time. And then starting in 2015, actually, originally, I became interested in AI-based tools from the perspective of mental health. So there are a bunch of people around me who are really struggling.

One really close friend in particular is really struggling with mental health and didn't have any support. And it didn't feel like there was anything before getting hospitalized that could just help her. And so luckily, she came and stayed with me for a while. And we were just able to talk through some things.

But it seemed like lots of people might not have that resource. And something maybe AI-enabled could be much more scalable. I didn't feel ready to start a company then. That's 2015. And I also didn't feel like the technology was ready. So then I went into fintech and learned how to do the tech thing.

And then in 2019, I felt like it was time for me to just jump in and build something on my own I really wanted to create. And at the time, there were two interesting-- I looked around at tech and felt not super inspired by the options. I didn't want to have a tech career ladder.

I didn't want to climb the career ladder. There were two interesting technologies at the time. There was AI and there was crypto. And I was like, well, the AI people seem a little bit more nice. Maybe slightly more trustworthy. Both super exciting. But yeah, I kind of threw my bet in on the AI side.

And then I got connected to Andreas. And actually, the way he was thinking about pursuing the research agenda at OTT was really compatible with what I had envisioned for an ideal AI product, something that helps kind of take down really complex thinking, overwhelming thoughts, and breaks it down into small pieces.

And then this kind of mission that we need AI to help us figure out what we ought to do was really inspiring. Yeah, because I think it was clear that we were building the most powerful optimizer of our time. But as a society, we hadn't figured out how to direct that optimization potential.

And if you kind of direct tremendous amounts of optimization potential at the wrong thing, that's really disastrous. So the goal of OTT was make sure that if we build the most transformative technology of our lifetime, it can be used for something really impactful, like good reasoning, like not just generating ads.

My background was in marketing. But so I was like, I want to do more than generate ads with this. And also, if these AI systems get to be super intelligent enough that they are doing this really complex reasoning, that we can trust them, that they are aligned with us and we have ways of evaluating that they're doing the right thing.

So that's what OTT did. We did a lot of experiments. This was, like Andreas said, before foundation models really took off. A lot of the issues we were seeing were more in reinforcement learning. But we saw a future where AI would be able to do more kind of logical reasoning, not just kind of extrapolate from numerical trends.

So we actually kind of set up experiments with people, where people stood in as super intelligent systems. And we effectively gave them context windows. So they would have to read a bunch of text. And one person would get less text, and one person would get all the text, and the person with less text would have to evaluate the work of the person who could read much more.

So in a world, we were basically simulating, like in 2018, 2019, a world where an AI system could read significantly more than you. And you, as the person who couldn't read that much, had to evaluate the work of the AI system. Yeah, so there's a lot of the work we did.

And from that, we kind of iterated on this idea, that the idea of breaking complex tasks down into smaller tasks, like complex tasks, like open-ended reasoning, logical reasoning, into smaller tasks, so that it's easier to train AI systems on them. And also so that it's easier to evaluate the work of the AI system when it's done.

And then also kind of really pioneered this idea, the importance of supervising the process of AI systems, not just the outcomes. And so a big part of then how Elicit is built is we're very intentional about not just throwing a ton of data into a model and training it, and then saying, cool, here's scientific output.

That's not at all what we do. Our approach is very much like, what are the steps that an expert human does? Or what is an ideal process? As granularly as possible, let's break that down. And then train AI systems to perform each of those steps very robustly. When you train that from the start, after the fact, it's much easier to evaluate.

It's much easier to troubleshoot at each point, like where did something break down? So yeah, we were working on those experiments for a while. And then at the start of 2021, decided to build a product. Because when you do research, I think maybe-- - Do you mind if I, 'cause I think you're about to go into more modern thought and Elicit.

And I just wanted to, because I think a lot of people are in where you were, like sort of 2018, '19, where you chose a partner to work with, right? And you didn't know him. You were just kind of cold introduced. A lot of people are cold introduced. I've been cold introduced to tons of people and I never work with them.

I assume you had a lot of other options, right? Like how do you advise people to make those choices? - Yeah, we were not totally cold introduced. So we had one of our closest friends introduced us. And then Andreas had written a lot on the OTT website, a lot of blog posts, a lot of publications.

And I just read it and I was like, wow, this sounds like my writing. And even other people, some of my closest friends I asked for advice from, they were like, oh, this sounds like your writing. But I think I also had some kind of like things I was looking for.

I wanted someone with a complimentary skillset. I want someone who was very values aligned. And yeah, I think that was all a good fit. - We also did a pretty lengthy mutual evaluation process where we had a Google doc where we had all kinds of questions for each other.

And I think it ended up being around 50 pages or so of like various like questions and back and forth. - Was it the YC list? There's some lists going around for co-founder questions. - No, we just made our own questions. But I presume, I guess it's probably related in that you ask yourself, well, what are the values you care about?

How would you approach various decisions and things like that? - I shared like all of my past performance reviews. - Yeah. - Yeah. - Yeah, and he had never had any, so. - No. (all laughing) - Yeah, sorry, I just had to, a lot of people are going through that phase and you kind of skipped over it.

I was like, no, no, no, no, there's like an interesting story there. - So before we jump into what a list it is today, the history is a bit counterintuitive. So you start with figuring out, oh, if we had a super powerful model, how would we align it, how we use it?

But then you were actually like, well, let's just build the product so that people can actually leverage it. And I think there are a lot of folks today that are now back to where you were maybe five years ago that are like, oh, what if this happens rather than focusing on actually building something useful with it?

What clicked for you to like move into a list and then we can cover that story too? - I think in many ways, the approach is still the same because the way we are building a list is not, let's train a foundation model to do more stuff. It's like, let's build a scaffolding such that we can deploy powerful models to good ends.

So I think it's different now in that we are, we actually have some of the models to plug in, but if in 2018, '17, we had had the models, we could have run the same experiments we did run with humans back then, just with models. And so in many ways, our philosophy is always like, let's think ahead to the future.

What models are gonna exist in one, two years or longer? And how can we make it so that they can actually be deployed in kind of transparent, controllable ways? - Yeah, I think motivationally, we both are kind of product people at heart and we just want to, the research was really important and it didn't make sense to build a product at that time.

But at the end of the day, the thing that always motivated us is imagining a world where high quality reasoning is really abundant. And AI was just kind of the most, is the technology that's gonna get us there. And there's a way to guide that technology with research, but it's also really exciting to have, you can have a more direct effect through product because with research, you have kind of, you'd publish the research and someone else has to implement that into the product and the product felt like a more direct path.

And we wanted to concretely have an impact on people's lives. So I think, yeah, I think the kind of personally, the motivation was we want to build for people. - Yep, and then just to recap as well, like the models you were using back then were like, I don't know, would they like BERT type stuff or T5 or I don't know what timeframe we're talking about here.

- So I guess to be clear, at the very beginning, we had humans do the work and then the initial, I think the first models that kind of make sense were TPT-2 and TNLG and like the early generative models. We do also use like T5 based models even now, but started with TPT-2.

- Yeah, cool, I'm just kind of curious about like, how do you start so early, you know, like now it's obvious where to start, but back then it wasn't. - Yeah, I used to nag Andreas a lot. I was like, why are you talking to this? I don't know, I felt like TPT-2 is like, clearly can't do anything.

And I was like, Andreas, you're wasting your time like playing with this toy. But yeah, he was right. - So what's the history of what Elizit actually does as a product? I think today, you recently announced that after four months, you get to a million in revenue. Obviously a lot of people use it, get a lot of value, but it would initially kind of like structure data, extraction from papers.

Then you had, yeah, kind of like concept grouping. And today it's maybe like a more full stack research enabler, kind of like paper understander platform. What's the definitive definition of what Elizit is and how did you get here? - Yeah, we say Elizit is an AI research assistant. I think it will continue to evolve.

It has evolved a lot and it will continue research. And that's part of why we're so excited about building and research, 'cause there's just so much space. I think the current phase we're in right now, we talk about it as really trying to make Elizit the best place to understand what is known.

So it's all a lot about like literature summarization. There's a ton of information that the world already knows. It's really hard to navigate, hard to make it relevant. So a lot of it is around document discovery and processing and analysis. I really want to make, I kind of want to import some of the incredible productivity improvements we've seen in software engineering and data science and into research.

So it's like, how can we make researchers like data scientists of text? That's why we're launching this new set of features called Notebooks. It's very much inspired by computational notebooks like Jupyter Notebooks, DeepNode or Colab, because they're so powerful and so flexible. And ultimately when people are trying to get to an answer or understand insight, they're kind of like manipulating evidence and information.

Today that's all packaged in PDFs, which are super brittle, but with language models, we can decompose these PDFs into their underlying claims and evidence and insights, and then let researchers mash them up together, remix them and analyze them together. So yeah, I would say quite simply, overall Elizit is an AI research assistant.

Right now we're focused on text-based workflows, but long-term really want to kind of go further and further into reasoning and decision-making. - And when you say AI research assistant, this is kind of meta research. So researchers use Elizit as a research assistant. It's not a generic, you can research anything type of tool, or it could be, but like, what are people using it for today?

- Yeah, so specifically, I guess in science, a lot of people use human research assistants to do things. Like you tell your kind of grad student, hey, here are a couple of papers. Can you look at all of these, see which of these have kind of sufficiently large populations and actually study the disease that I'm interested in, and then write out like, what are the experiments they did?

What are the interventions they did? What are the outcomes? And kind of organize that for me. And the first phase of understanding what is known really focuses on automating that workflow, because a lot of that work is pretty rote work. I think it's not the kind of thing that we need humans to do, language models can do it.

And then if language models can do it, you can obviously scale it up much more than a grad student or undergrad research assistant would be able to do. - Yeah, the use cases are pretty broad. So we do have people who just come, a very large percent of our users are just using it personally, or for a mix of personal and professional things.

People who care a lot about like health or biohacking, or parents who have children with a kind of rare disease and want to understand the literature directly. So there is an individual kind of consumer use case. We're most focused on the power users, so that's where we're really excited to build.

So LISD was very much inspired by this workflow in literature called Systematic Reviews or Meta-Analysis, which is basically the human state of the art for summarizing scientific literature. It typically involves like five people working together for over a year, and they kind of first start by trying to find the maximally comprehensive set of papers possible.

So it's like 10,000 papers. And they kind of systematically narrow that down to like hundreds or 50 extract key details from every single paper. Usually have two people doing it, like a third person reviewing it. So it's like an incredibly laborious, time-consuming process, but you see it in every single domain.

So in science, in machine learning, in policy. And so if you can, and it's very, because it's so structured and designed to be reproducible, it's really amenable to automation. So that's kind of the workflow that we want to automate first. And then you make that accessible for any question and make kind of these really robust living summaries of science.

So yeah, that's one of the workflows that we're starting with. - Our previous guest, Mike Conover, he's building a new company called BrightWave, which is an AI research assistant for financial research. How do you see the future of these tools? Like does everything converge to like a God researcher assistant, or is every domain going to have its own thing?

- I think that's a good and mostly open question. I do think there are some differences across domains. For example, some research is more quantitative data analysis, and other research is more kind of high-level cross-domain thinking. And we definitely want to contribute to the broad generalist reasoning type space.

Like if researchers are making discoveries, often it's like, hey, this thing in biology is actually analogous to like these equations in economics or something. And that's just fundamentally a thing where you need to reason across domains. So I think there will be, at least within research, I think there will be like one best platform more or less for this type of generalist research.

I think there may still be like some particular tools like for genomics, like particular types of modules of genes and proteins and whatnot. But for a lot of the kind of high-level reasoning that humans do, I think that is a more open or type all thing. - I wanted to ask a little bit deeper about, I guess, the workflow that you mentioned.

I like that phrase. I see that in your UI now, but that's as it is today. And I think you were about to tell us about how it was in 2021 and how it maybe progressed. Like what, how has this workflow evolved? - Yeah, so the very first version of Elicit actually wasn't even a research assistant.

It was like a forecasting assistant. So we set out and we were thinking about what are some of the most impactful types of reasoning that if we could scale up AI would really transform the world. And the first thing we started, we actually started with literature review, but we're like, oh, so many people are gonna build literature review tools, so let's not start there.

And so then we focused on geopolitical forecasting. So I don't know if you're familiar with like Manifold or-- - Manifold Markets. - Yeah, that kind of stuff. - And Manifold.love. - Before Manifold, yeah. So we're not predicting relationships, we're predicting like, is China gonna invade Taiwan? - Markets for everything.

- Yeah. - That's a relationship. - Yeah, it's fair. - Yeah, yeah, it's true. - And then we worked on that for a while and then after GPT-3 came out, I think by that time we kind of realized that the, originally we were trying to help people convert their beliefs into probability distributions.

And so take fuzzy beliefs, but like model them more concretely. And then after a few months of iterating on that, just realized, oh, the thing that's blocking people from making interesting predictions about important events in the world is less kind of on the probabilistic side and much more on the research side.

And so that kind of combined with the very generalist capabilities of GPT-3 prompted us to make a more general research assistant. Then we spent a few months iterating on what even is a research assistant. So we would embed with different researchers, we built data labeling workflows in the beginning, kind of right off the bat.

We built ways to find like experts in a field and like ways to ask good research questions. So we just kind of iterated through a lot of workflows and it was, yeah, no one else was really building at this time and it was like very quick to just do some prompt engineering and see like what is a task that is at the intersection of what's good, technologically capable and like important for researchers.

And we had like a very nondescript landing page. It said nothing, but somehow people were signing up and we had the sign-up form that were like, it was like, "Why are you here?" And everyone was like, "I need help with literature review." And we're like, "Literature review, that sounds so hard.

"I don't even know what that means." They're like, "We don't want to work on it." But then eventually we were like, "Okay, everyone is saying literature review." It's overwhelmingly people want-- - And all domains, not like medicine or physics or just all domains. - Yeah, and we also kind of personally knew literature review was hard.

And if you look at the graphs for academic literature being published every single month, you guys know this in machine learning, it's like up and to the right, like superhuman amounts of papers. So we're like, "All right, let's just try it." I was really nervous, but Andreas was like, "This is kind of like the right problem space "to jump into even if we don't know what we're doing." So my take was like, "Fine, this feels really scary, "but let's just launch a feature every single week "and double our user numbers every month.

"And if we can do that, we'll fail fast "and we will find something." I was worried about getting lost in the kind of academic white space. So the very first version was actually a weekend prototype that Andreas made. Do you want to explain how that worked? - I mostly remember this was really bad.

So the thing I remember is you enter a question and it would give you back a list of claims. So your question could be, I don't know, "How does creatine affect cognition?" It would give you back some claims that are to some extent based on papers, but they were often irrelevant.

The papers were often irrelevant. And so we ended up soon just printing out a bunch of examples of results and putting them up on the wall so that we would kind of feel the constant shame of having such a bad product and would be incentivized to make it better.

And I think over time it has gotten a lot better, but I think the initial version was really very bad. - Yeah, but it was basically like a natural language summary of an abstract, like kind of a one-sentence summary, and which we still have. And then as we learned kind of more about this systematic review workflow, we started expanding the capability so that you could extract a lot more data from the papers and do more with that.

- And were you using embeddings and cosine similarity, that kind of stuff, for retrieval, or was it keyword-based, or? - I think the very first version didn't even have its own search engine. I think the very first version probably used the SemanticSkuller API or something similar. And only later, when we discovered that that API is not very semantic, then built our own search engine that has helped a lot.

- And then we're gonna go into more recent products stuff, but I think you seem the more startup-oriented business person, and you seem sort of more ideologically interested in research, obviously, 'cause of your PhD. What kind of market sizing were you guys thinking? 'Cause you're here saying, "We have to double every month." And I'm like, "I don't know how you make "that conclusion from this," right?

Especially also as a non-profit at the time. - Yeah, I think market size-wise, I felt like in this space where so much was changing and it was very unclear what of today was actually gonna be true tomorrow, we just really rested a lot on very, very simple fundamental principles, which is research is, if you can understand the truth, that is very economically beneficial, valuable, if you know the truth.

- Just on principle, that's enough for you. - Yeah, research is the key to many breakthroughs that are very commercially valuable. 'Cause my version of it is students are poor and they don't pay for anything, right? But that's obviously not true, as you guys have found out. But I, you know, you had to have some market insight for me to have believed that, but I think you skipped that.

- Yeah. - Yeah, we did encounter, I guess, talking to VCs for our seed round. A lot of VCs were like, "You know, researchers, "they don't have any money. "Why don't you build a legal assistant?" (laughing) And I think in some short-sighted way, maybe that's true, but I think in the long run, R&D is such a big space of the economy.

I think if you can substantially improve how quickly people find new discoveries or avoid kind of controlled trials that don't go anywhere, I think that's just huge amounts of money. And there are a lot of questions, obviously, about between here and there, but I think as long as the fundamental principle is there, we were okay with that.

And I guess we found some investors who also were. - Yeah, yeah, congrats. I mean, I'm sure we can cover the sort of flip later. Yeah, I think you were about to start us on GPT-3 and how that changed things for you. It's funny, I guess every major GPT version, you have some big insight.

- I think it's a little bit less true for us than for others because we always believe that there will basically be human-level machine work. And so it is definitely true that in practice, for your product, as new models come out, your product starts working better, you can add some features that you couldn't add before.

But I don't think we really ever had the moment where we were like, oh, wow, that is super unanticipated. We need to do something entirely different now from what was on the roadmap. - I think GPT-3 was a big change 'cause it kind of said, oh, now is the time that we can use AI to build these tools.

And then GPT-4 was maybe a little bit more of an extension of GPT-3. It felt less like a level shift. GPT-3 over GPT-2 was like qualitative level shift. And then GPT-4 was like, okay, great. Now it's like, much less, more accurate, we're more accurate on these things, we can answer harder questions.

But the shape of the product had already taken place by that time. - I kind of want to ask you about this sort of pivot that you've made, but I guess that was just a way to sell what you were doing, which is you're adding extra features on grouping by concepts.

- When GPT-4-- - The GPT-4 pivot, quote-unquote pivot that you-- - Oh, yeah, yeah, exactly. Right, right, right, yeah, yeah. When we launched this workflow, now that GPT-4 was available, basically, Elisa was at a place where, given a table of papers, we have very tabular interfaces, so given a table of papers, you can extract data across all the tables.

But that's still, you kind of want to take the analysis a step further. And sometimes what you'd care about is not having a list of papers, but a list of arguments, a list of effects, a list of interventions, a list of techniques. And so that's one of the things we're working on is now that you've extracted this information in a more structured way, can you pivot it or group by whatever the information that you extracted to have more insight-first information still supported by the academic literature?

- Yeah, that was a big revelation when I saw it. Yeah, basically, I think I'm very just impressed by how first principles, your ideas around what the workflow is. And I think that's why you're not as reliant on the LLM improving, because actually it's just about improving the workflow that you would recommend to people.

Today, we might call it an agent, I don't know, but you're not reliant on the LLM to drive it. It's relying on your sort of, this is the way that Elisit does research, and this is what we think is most effective based on talking to our users. - Yep, that's right.

Yeah, I think the problem space is still huge. If it's this big, we are all still operating at this tiny bit of it. So I think about this a lot in the context of moats. People are like, "Oh, what's your moat? "What happens if GPT-5 comes out?" It's like, if GPT-5 comes out, there's still all of this other space that we can go into.

And so I think being really obsessed with the problem, which is very, very big, has helped us stay robust and just kind of directly incorporate model improvements and then keep going. - And then I first encountered you guys with Charlie. You can tell us about that project. Basically, how much did cost become a concern as you're working more and more with OpenAI?

How do you manage that relationship? - Let me talk about who Charlie is. - All right. - Sure, sure. - You can talk about the tech, 'cause Charlie is a special character. So Charlie, when we found him, had just finished his freshman year at the University of Warwick. I think he had heard about us on some discord, and then he applied, and we were like, "Wow, who is this freshman?" And then we just saw that he had done so many incredible side projects.

And we were actually on a team retreat in Barcelona visiting our head of engineering at that time, and everyone was talking about this wonder kid. They're like, "This kid?" And then on our take-home project, he had done the best of anyone to that point. And so we were just so excited to hire him.

So we hired him as an intern, and then we're like, "Charlie, what if he just dropped out of school?" And so then we convinced him to take a year off, and he's just incredibly productive. And I think the thing you're referring to is, at the start of 2023, Anthropic launched their constitutional AI paper, and within a few days, I think four days, he had basically implemented that in production, and then we had it in app a week or so after that.

And he has since contributed to major improvements, like cutting costs down to a tenth of what they were. It's really large-scale, but yeah, you can talk about the technical stuff. - Yeah, on the constitutional AI project, this was for abstract summarization, where in Illicit, if you run a query, it'll return papers to you, and then it will summarize each paper with respect to your query for you on the fly.

And that's a really important part of Illicit, because Illicit does it so much. If you run a few searches, it'll have done it a few hundred times for you. And so we cared a lot about this both being fast, cheap, and also very low on hallucination. I think if Illicit hallucinates something about the abstract, that's really not good.

And so what Charlie did in that project was create a constitution that expressed what are the attributes of a good summary. It's like everything in the summary is reflected in the actual abstract, and it's very concise, et cetera, et cetera. And then used RLHF with a model that was trained on the constitution to basically fine-tune a better summarizer.

- And an open-source model, I think. - On an open-source model, yeah. I think that might still be in use. - Yeah, yeah, definitely. Yeah, I think at the time, the models hadn't been trained at all to be faithful to a text. So they were just generating. So then when you ask them a question, they tried too hard to answer the question and didn't try hard enough to answer the question given the text or answer what the text said about the question.

So we had to basically teach the models to do that specific task. - How do you monitor the ongoing performance of your models? Not to get too LLM-opsy, but you are one of the larger, more well-known operations doing NLP at scale. I guess, effectively, you have to monitor these things, and nobody has a good answer that I can talk to.

- Yeah, I don't think we have a good answer yet. (all laughing) I think the answers are actually a little bit clearer on the just kind of basic robustness side, so I think where you can import ideas from normal software engineering and normal kind of DevOps. You're like, well, you need to monitor kind of latencies and response times and uptime and whatnot.

- I think when we say performance, it's more about hallucination rate. - And then things like hallucination rate where I think there the really important thing is training time. So we care a lot about having our own internal benchmarks for model development that reflect the distribution of user queries so that we can know ahead of time how well is the model gonna perform on different types of tasks, so the tasks being summarization, question answering, given a paper, ranking.

And for each of those, we wanna know what's the distribution of things the model is gonna see so that we can have well-calibrated predictions on how well the model's gonna do in production. And I think, yeah, there's some chance that there's distribution shift and actually the things users enter are gonna be different, but I think that's much less important than getting the kind of training right and having very high-quality, well-vetted data sets at training time.

- I think we also end up effectively monitoring by trying to evaluate new models as they come out. And so that kind of prompts us to go through our eval suite every couple of months. And then, yeah, and so every time a new model comes out, we have to see like, okay, which one is, how is this performing relative to production and what we currently have?

- Yeah, I mean, since we're on this topic, any new models have really caught your eye this year? Like, Cloud came out with a bunch. - Cloud, yeah, I think Cloud is pretty, I think the team's pretty excited about Cloud. - Yeah, specifically, I think Cloud Haiku is like a good point on the kind of Pareto frontier.

So I think it's like, it's not the, it's neither the cheapest model, nor is it the most accurate, most high-quality model, but it's just like a really good trade-off between cost and accuracy. You apparently have to 10-shot it to make it good. I tried using Haiku for summarization, but zero-shot was not great.

And yeah, then they were like, you know, it's a skill issue, you have to try harder. - Interesting. - Yeah, we also used, I think, GPT-4 unlocked process. - Turbo? - Yeah, yeah, they get unlocked tables for us, processing data from tables, which was huge. - GPT-4 Vision. - Yeah.

- Yeah, did you try like Fuyu? I guess you can't try Fuyu, 'cause it's non-commercial. That's the adept model. - Yeah, we haven't tried that one. - Yeah, yeah, yeah. But Cloud is multimodal as well. - Yeah. - I think the interesting insight that we got from talking to David Luan, who is CEO of Adept, was that multimodality has effectively two different flavors.

Like one is the, we recognize images from a camera in the outside natural world. And actually, the more important multimodality for knowledge work is screenshots. And, you know, PDFs and charts and graphs. - Yeah, yeah, mm-hmm. - So we need a new term for that kind of multimodality. - Yeah.

- But is the claim that current models are good at one or the other? - No, they're over-indexed, 'cause of the history of computer vision is CoCo, right? So now we're like, oh, actually, you know, screens are more important. - Yeah, processing weird handwriting and stuff. - OCR, yeah, handwriting, yeah.

You mentioned a lot of like closed model lab stuff, and then you also have like this open source model fine-tuning stuff. Like what is your workload now between closed and open? - It's a good question. - Is it half and half? - I think it's-- - Is that even a relevant question, or is this a nonsensical question?

- It depends a little bit on like how you index, whether you index by like computer cost or number of queries. I'd say like in terms of number of queries, it's maybe similar. In terms of like cost and compute, I think the closed models make up more of the budget, since the main cases where you want to use closed models are cases where they're just smarter, where no existing open source models are quite smart enough.

- We have a lot of interesting technical questions to go in, but just to wrap the kind of like UX evolution, now you have the notebooks. We talked a lot about how chatbots are not the final frontier, you know? How did you decide to get into notebooks, which is a very iterative, kind of like interactive interface, and yeah, maybe learnings from that?

- Yeah, this is actually our fourth time trying to make this work. I think the first time was probably in early 2021. At the time we built something, I think because we've always been obsessed with this idea of task decomposition and like branching, we always wanted a way, a tool that could be kind of unbounded where you could keep going, where you could do a lot of branching, where you could kind of apply language model operations or computations on other tasks.

So in 2021, we had this thing called composite tasks where you could use GPT-3 to brainstorm a bunch of research questions, and then take each research question and decompose those further into sub questions. And this kind of, again, that like task decomposition tree type thing was always very exciting to us, but that was like, it didn't work and it was kind of overwhelming.

Then at the end of '22, I think we tried again. And at that point we were thinking, okay, we've done a lot with this literature review thing. We also want to start helping with kind of adjacent domains and different workflows. Like we want to help more with machine learning.

What does that look like? And as we were thinking about it, we're like, well, there are so many research workflows. Like how do we not just build kind of three new workflows into Elicit, but make Elicit really generic to lots of workflows? What is like a generic composable system with nice abstractions that can like scale to all these workflows?

So we like iterated on that a bunch and then didn't quite narrow the problem space enough or like quite get to what we wanted. And then I think it was at the beginning of 2023, where we're like, wow, computational notebooks kind of enable this, where they have a lot of flexibility, but kind of robust primitives, such that you can extend the workflow.

And it's not limited. It's not like you ask a query, you get an answer, you're done. You can just constantly keep building on top of that. And each little step seems like a really good kind of unit of work for the language model. So that's, and also there was just like really helpful to have a bit more kind of pre-existing work to emulate.

So that was, yeah, that's kind of how we ended up at computational notebooks for Elicit. - Maybe one thing that's worth making explicit is the difference between computational notebooks and chat, because on the surface, they seem pretty similar. It's kind of this iterative interaction where you add stuff and it's almost like in both cases, you have a back and forth between you enter stuff and then you get some output and then you enter stuff.

But the important difference in our minds is with notebooks, you can define a process. So in data science, you can be like, here's like my data analysis process that takes in a CSV and then does some extraction and then generates a figure at the end. And you can prototype it using a small CSV and then you can run it over a much larger CSV later.

And similarly, the vision for notebooks in our case is to not make it this like one-off chat interaction, but to allow you to then say kind of, if you start and first you're like, okay, let me just analyze a few papers and see do I get to the correct like conclusions for those few papers?

Can I then later go back and say, now let me run this over 10,000 papers now that I've debugged the process using a few papers? And that's an interaction that doesn't fit quite as well into the chat framework, because that's more for kind of quick back and forth interaction.

- Do you think in notebooks as kind of like structure, editable chain of thought, basically step by step, like is that kind of where you see this going and then are people gonna reuse notebooks as like templates and maybe in traditional notebooks, it's like cookbooks, right? You share a cookbook, you can start from there.

Is that similar in Elizit? - Yeah, that's exactly right. So that's our hope that people will build templates, share them with other people. I think chain of thought is maybe still like kind of one level lower on the abstraction hierarchy than we would think of notebooks. I think we'll probably want to think about more semantic pieces like a building block is more like a paper search or an extraction or a list of concepts.

And then the models detailed reasoning will probably often be one level down. You always want to be able to see it, but you don't always want it to be front and center. - Yeah, what's the difference between a notebook and an agent? Since everybody always asks me, what's an agent?

Like, how do you think about where the line is? - In the notebook world, I would generally think of the human as the agent in the first iteration. So you have the notebook and the human kind of adds little action steps. And then the next point on this kind of progress gradient is, okay, now you can use language models to predict which action would you take as a human.

And at some point, you're probably gonna be very good at this. You'll be like, okay, I can like, in some cases I can with 100%, 99.9% accuracy predict what you do. And then you might as well just execute it. Like, why wait for the human? And eventually, as you get better at this, that will just look more and more like agents taking actions as opposed to you doing the thing.

And I think templates are a specific case of this where you're like, okay, well, there's just particular sequences of actions that you often wanna chunk and have available as primitives, just like in normal programming. And those are, you can view them as action sequences of agents or you can view them as more like the normal programming language abstraction thing.

And I think those are two valid views. - Yeah, how do you see this changes? Like you said, the models get better and you need less and less human actual interfacing with the model, you just get the results. Like how does the UX and the way people perceive it change?

- Yeah, I think this kind of interaction paradigms for evaluation is not really something the internet has encountered yet, because right now, up to now, the internet has all been about like getting data and work from people. But so increasingly, yeah, I really want kind of evaluation both from an interface perspective and from like a technical perspective and operation perspective to be a power, superpower for Elicit, 'cause I think over time, models will do more and more of the work and people will have to do more and more of the evaluation.

So I think, yeah, in terms of the interface, some of the things we have today are, for every kind of language model generation, there's some citation back and we kind of directly, we try to highlight the ground truth in the paper that is most relevant to whatever Elicit said and make it super easy so that you can click on it and quickly see in context and validate whether the text actually supports the answer that Elicit gave.

So I think we'd probably want to scale things up like that, like the ability to kind of spot check the model's work super quickly, scale up interfaces like that and-- - Who would spot check, the user? - Yeah, to start it would be the user. One of the other things we do is also kind of flag the model's uncertainty.

So we have models report out, how confident are you that this was the sample size of this study? The model's not sure, we throw a flag and so the user knows to prioritize checking that. So again, we can kind of scale that up. So when the model's like, well, I went and searched for Google, I searched this on Google, I'm not sure if that was the right thing, we have an uncertainty flag and the user can go and be like, oh, okay, that was actually the right thing to do or not.

- So I've tried to do uncertainty readings from models. I don't know if you have this live, you do, okay. 'Cause I just didn't find them reliable 'cause they just hallucinated their own uncertainty. I would love to base it on log probs or something more native within the model rather than generated.

But okay, it sounds like they scale properly for you. - Yeah, we found it to be pretty calibrated. They're diverse on the model. - Okay, yeah, I think in some cases, we also used two different models for the uncertainty estimates than for the question answering. So one model would say, here's my chain of thought, here's my answer, and then a different type of model.

Let's say the first model is Lama and let's say the second model is GP 3.5, could be different. And then the second model just looks over the results and like, okay, how confident are you in this? And I think sometimes using a different model can be better than using the same model.

- Yeah, on the topic of models, evaluating models, obviously you can do that all day long. Like what's your budget? Because your queries fan out a lot and then you have models evaluating models. One person typing in a question can lead to a thousand calls. - It depends on the project.

So if the project is basically a systematic review that otherwise human research assistants would do, then the project is basically a human equivalent spend and the spend can get quite large for those projects. Certainly, I don't know, let's say $100,000. - For the project, yeah. - Yeah. So in those cases, you're happier to spend compute than in the kind of shallow search case where someone just enters a question because, I don't know, maybe-- - Feel like it.

- I heard about creatine, what's it about? Probably don't want to spend a lot of compute on that. And this sort of being able to invest more or less compute into getting more or less accurate answers is I think one of the core things we care about and that I think is currently undervalued in the AI space.

I think currently, you can choose which model you want and you can sometimes tell it, I don't know, you'll tip it and it'll try harder or you can try various things to get it to work harder. But you don't have great ways of converting willingness to spend into better answers and we really want to build a product that has this sort of unbounded flavor where I mean, as much as you care about, if you care about it a lot, you should be able to get really high quality answers, really double checked in every way.

- Yeah. - And you have a credits-based pricing. So unlike most products, it's not a fixed monthly fee. - Right, exactly. So some of the higher costs are tiered. So for most casual users, they'll just get the abstract summary, which is kind of an open source model. Then you can add more columns which have more extractions and these uncertainty features and then you can also add those same columns in high accuracy mode, which also parses the table.

So we kind of stack the complexity on the cost. - You know the fun thing you can do with a credit system, which is data for data or I don't know what I mean by that. Basically, you can give people more credits if they give data back to you.

I don't know if you've already done that. - I've thought about, we've thought about something like this. It's like, if you don't have money, but you have time, how do you exchange that? - It's a fair trade. - Yeah, I think it's interesting. We haven't quite operationalized it and then there's been some kind of adverse selection.

For example, it would be really valuable to get feedback on our models. So maybe if you were willing to give more robust feedback on our results, we could give you credits or something like that. But then there's kind of this, will people take it seriously? - Yeah, you want the good people.

- Exactly. - Can you tell who are the good people? - Not right now, but yeah, maybe at the point where we can, we can offer it. We can offer it up to them. - The perplexity of questions asked, if it's higher perplexity, these are smarter people. - Yeah, maybe.

- If you make a lot of typos in your queries, you're not gonna get off the house exchange. (all laughing) - Negative social credit. It's very topical right now to think about the threat of long context windows. All these models that we're talking about these days, all like a million token plus.

Is that relevant for you? Can you make use of that? Is that just prohibitively expensive 'cause you're just paying for all those tokens or you're just doing rag? - It's definitely relevant. And when we think about search, I think as many people do, we think about kind of a staged pipeline of retrieval where first you use a kind of semantic search database with embeddings, get like the, in our case, maybe 400 or so most relevant papers.

And then you still need to rank those. And I think at that point, it becomes pretty interesting to use larger models. So specifically in the past, I think a lot of ranking was kind of per item ranking where you would score each individual item, maybe using increasingly expensive scoring methods and then rank based on the scores.

But I think list-wise free ranking where you have a model that can see all the elements is a lot more powerful because often you can only really tell how good a thing is in comparison to other things. And what things should come first, it really depends on like, well, what other things that are available, maybe you even care about diversity in your results.

You don't wanna show like 10 very similar papers as the first 10 results. So I think the long context models are quite interesting there. And especially for our case where we care more about power users who are perhaps a little bit more willing to wait a little bit longer to get higher quality results relative to people who just quickly check out things because why not?

I think being able to spend more on longer contexts is quite valuable. - Yeah, I think one thing the longer context models changed for us is maybe a focus from breaking down tasks to breaking down the evaluation. So before, if we wanted to answer a question from the full text of a paper, we had to figure out how to chunk it and like find the relevant chunk and then answer based on that chunk.

And the nice thing was then, you know kind of which chunk the model used to answer the question. So if you want to help the user track it, yeah, you can be like, well, this was the chunk that the model got. And now if you put the whole text in the paper, you have to go back, you have to like kind of find the chunk like more retroactively basically.

And so you need kind of like a different set of abilities and obviously like a different technology to figure out. You still want to point the user to the supporting quotes in the text, but then like the interaction is a little different. - You like scan through and find some rouge score.

- Yeah. - Ceiling or floor. - Yeah, I think there's an interesting space of almost research problems here because you would ideally make causal claims. Like if this hadn't been in the text, the model wouldn't have said this thing. And maybe you can do expensive approximations to that where like, I don't know, you just throw a chunk off the paper and re-answer and see what happens.

But hopefully there are better ways of doing that where you just get that kind of counterfactual information for free from the model. - Do you think at all about the cost of maintaining RAG versus just putting more tokens in the window? I think in software development, a lot of times people buy developer productivity things so that we don't have to worry about it.

Context window is kind of the same, right? You have to maintain chunking and like RAG retrieval and like re-ranking and all of this versus I just shove everything into the context and like it costs a little more, but at least I don't have to do all of that. Is that something you thought about at all?

- I think we still like hit up against context limits enough that like, it's not really, do we still want to keep this RAG around? It's like we do still need it for the scale of the work that we're doing, yeah. - And I think there are different kinds of maintainability.

In one sense, I think you're right that the throw everything into the context window thing is easier to maintain because you just can swap out a model. In another sense, it's if things go wrong, it's harder to debug where like, if you know here's the process that we go through to go from 200 million papers to an answer and they're like little steps and you understand, okay, this is the step that finds the relevant paragraph or whatever it may be.

You'll know which step breaks if the answers are bad whereas if it's just like a new model version came out and now it suddenly doesn't find your needle in a haystack anymore, then you're like, okay, what can you do? You're kind of at a loss. - Yeah. Let's talk a bit about, yeah, needle in a haystack and like maybe the opposite of it, which is like hard grounding.

I don't know if that's like the best name to think about it, but I was using one of these chat witcher documents features and I put the AMD MI 300 specs and the new Blackwell chips from NVIDIA and I was asking questions and asked, does the AMD chip support NVLink?

And the response was like, oh, it doesn't say in the specs. But if you ask GPD 4 without the docs, it would tell you no, because NVLink, it's a NVIDIA technology. - Those are NV. - Yeah, it just says in the thing. How do you think about that? Like having the context sometimes suppress the knowledge that the model has?

- It really depends on the task because I think sometimes that is exactly what you want. So imagine you're a researcher, you're writing the background section of your paper and you're trying to describe what these other papers say. You really don't want extra information to be introduced there. In other cases where you're just trying to figure out the truth and you're giving the documents because you think they will help the model figure out what the truth is.

I think you do want, if the model has a hunch that there might be something that's not in the papers, you do want to surface that. I think ideally, you still don't want the model to just tell you. I think probably the ideal thing looks a bit more like agent control where the model can issue a query that then is intended to surface documents that substantiate its hunch.

So I would, that's maybe a reasonable middle ground between model just telling you and model being fully limited to the papers you give it. - Yeah, I would say it's, they're just kind of different tasks right now. And the tasks that Elicit is mostly focused on is what do these papers say?

But there is another task, which is like, just give me the best possible answer. And that give me the best possible answer sometimes depends on what do these papers say, but it can also depend on other stuff that's not in the papers. So ideally we can do both and then kind of do this overall task for you more going forward.

- All right, this was, we see a lot of details, but just to zoom back out a little bit, what are maybe the most underrated features of Elicit? And what is one thing that maybe the users surprised you the most by using it? - I think the most powerful feature of Elicit is the ability to extract, add columns to this table, which effectively extracts data from all of your papers at once.

It's still, it's well used, but there are kind of many different extensions of that that I think users are still discovering. So one is we let you give a description of the column. We let you give instructions of a column. We let you create custom columns. So we have like 30 plus predefined fields that users can extract.

Like what were the methods? What were the main findings? How many people were studied? And then, and we can, we actually show you basically the prompts that we're using to extract that from our predefined fields. And then you can fork this and you can say, oh, actually I don't care about the population of people.

I only care about the population of rats. Like you can change the instruction. So I think users are still kind of discovering that there's both this predefined, easy to use default, but that they can extend it to be much more specific to them. And then they can also ask custom questions.

One use case of that is you can, you can start to create different column types that you might not expect. So rather than just creating generative answers, like a description of the methodology, you can say classify the methodology into a prospective study, a retrospective study, or a case study.

And then you can filter based on that. It's like all using the same kind of technology and the interface, but it unlocks different workflows. So I think that like the ability to ask custom questions, give instructions and specifically use that to create different types of columns, like classification columns is still pretty underrated.

In terms of use case, I spoke to someone who works in medical affairs at a genomic sequencing company recently. So they, you know, doctors kind of, you know, order these genomic tests, the sequencing tests to kind of identify if a patient has a particular disease, this company helps them process it.

And this person basically interacts with all the doctors. And if the doctors have any questions, my understanding is that medical affairs is kind of like customer support or customer success in pharma. So this person like talks to doctors all day long. And one of the things they started using Elicit for is like putting the results of their tests as the query.

Like this test showed, you know, this percentage, you know, presence of this and 40% that and whatever. What do we think is kind of the, you know, what like genes are present here or something or what's in the sample? And getting kind of a list of academic papers that would support their findings and using this to help doctors interpret their tests.

So we talked about, okay, cool. Like if we built, you know, he's pretty interested in kind of doing a survey of infectious disease specialists and getting them to evaluate, you know, having them write up their answers, comparing it to Elicit's answers, trying to see can Elicit start being used to interpret the results of these diagnostic tests?

Because the way they ship these tests to doctors is they report on a really wide array of things. And he was saying that at a large, well-resourced hospital, like a city hospital, there might be a team of infectious disease specialists who can help interpret these results. But at under-resourced hospitals or more rural hospitals, the primary care physician can't interpret the test results.

So then they can't order it, they can't use it, they can't help their patients with it. So thinking about, you know, kind of an evidence-backed way of interpreting these tests is definitely kind of an extension of the product that I hadn't considered before. But yeah, the idea of like using that to bring more access to physicians in all different parts of the country and helping them interpret complicated science is pretty cool.

- Yeah, we had Ken Jun from MBU on the podcast and we talked about better allocating scientific resources. How do you think about these use cases and maybe how Elicit can help drive more research? And do you see a world in which, you know, maybe the models actually do some of the research before suggesting us?

- Yeah, I think that's like very close to what we care about. So our product values are systematic, transparent, and unbounded. And I think to make research more, especially more systematic and unbounded, I think is like basically the thing that's at stake here. So ideally people would think, well, what are, for example, I was recently talking to people in longevity and I think there isn't really one field of longevity.

There are kind of different scientific subdomains that are surfacing, various things that are related to longevity. And I think if you could more systematically say, look, here are all the different interventions we could do. And here's the expected ROI of these experiments. Here's like the evidence so far that supports those being either like likely to surface new information or not.

Here's the cost of these experiments. I think you could be so much more systematic than science is today. Probably, yeah, I'd guess in like 10, 20 years, we'll look back and it will be incredible how unsystematic science was back in the day. - Yeah, and I think this is as we start to, so like our view is kind of have models catch up to expert humans today, or whatever, start with kind of novice humans and then increasingly expert humans.

And then at some point, but we really want the models to kind of like earn their right to the expertise. So that's why we do things in this very step-by-step way. That's why we don't just like throw a bunch of data and apply a bunch of compute and hope we get good results.

But obviously at some point you hope that once it's kind of earned its stripes, it can surpass human researchers. But I think that's where making sure that the models processes are really explicit and transparent and that it's really easy to evaluate is important because if it does surpass human understanding, people will still need to be able to audit its work somehow or spot check its work somehow to be able to reliably trust it and use it.

So yeah, that's kind of why the process-based approach is really important. - And on the question of will models do their own research, I think one feature that most currently don't have that will need to be better there is better world models. I think currently models are just not great at representing kind of what's going on in a particular situation or domain in a way that allows them to come to interesting, surprising conclusions.

I think they're very good at like coming to, I don't know, conclusions that are nearby to conclusions that people have come to, but not as good at kind of reasoning and making surprising connections maybe. And so having deeper models of how, let's see, what are the underlying structures of different domains, how are they related or not related, I think will be an important ingredient for models actually being able to make novel contributions.

- On the topic of hiring more expert humans, you've hired some very expert humans. My friend Maggie Appleton joined you guys I think maybe a year ago-ish. In fact, I think you're doing an offsite and we're actually organizing our big AI/UX meetup around whenever she's in town in San Francisco.

- Oh, amazing. - How big is the team? How have you sort of transitioned your company into this sort of PBC and sort of the plan for the future? - Yeah, we're 12 people now. Mostly, about half of us are in the Bay Area and then distributed across US and Europe.

A mix of mostly kind of roles in engineering and product. Yeah, and I think that the transition to PBC was really not that eventful because I think we were already, even as a nonprofit, we were already shipping every week. So very much operating as a product. - Very much like a startup, yeah.

- And then I would say the kind of PBC component was to very explicitly say that we have a mission that we care a lot about. There are a lot of ways to make money. We think our mission will make us a lot of money, but we are going to be opinionated about how we make money.

We're gonna take the version of making a lot of money that's in line with our mission. But it's like all very, it's very convergent. Like illicit is not going to make any money if it's a bad product, if it doesn't actually help you discover truth and do research more rigorously.

So I think for us, the kind of mission and the success of the company are very intertwined. So a big part of, yeah, we're hoping to grow the team quite a lot this year. Probably some of our highest priority roles are in engineering, but also opening up roles more in design and product marketing, go-to-market.

Yeah, do you want to talk about the roles? - Yeah, broadly we're just looking for senior software engineers and don't need any particular AI expertise. A lot of it is just, I guess, how do you build good orchestration for complex tasks? So we talked earlier about these are sort of notebooks, scaling up, task orchestration, and I think a lot of this looks more like traditional software engineering than it does look like machine learning research.

And I think the people who are really good at building good abstractions, building applications that can kind of survive, even if some of their pieces break, like making reliable components out of unreliable pieces. I think those are the people we're looking for. - You know, that's exactly what I used to do.

Have you explored any of the existing-- - Do you want to come work with us? - I can talk about this all day. Have you explored the existing orchestration frameworks? Temporal, Airflow, Daxter, Prefect? - We've looked into them a little bit. I think we have some specific requirements around kind of being able to stream work back very quickly to our users, and those could definitely be relevant.

- Okay, well, you're hiring. I'm sure we'll plug all the links. Thank you so much for coming. Any parting words, any words of wisdom? Models do you live by? - No, I think it's a really important time for humanity, so I hope everyone listening to this podcast can think hard about exactly how they want to participate in this story.

There's so much to build, and we can be really intentional about what we align ourselves with. I think there are a lot of applications that are going to be really good for the world, and a lot of applications that are not, and so, yeah, I hope people can take that seriously and kind of seize the moment.

- Yeah, I love how intentional you guys have been. Thank you for sharing that story. - Thank you. - Yeah, thank you for coming on. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)