back to index

How AI is Eating Finance - with Mike Conover of Brightwave


Chapters

0:0 Introductions
4:52 What's Brightwave?
6:4 How to hire for a vertical AI startup
11:26 How $20B+ hedge funds use Brightwave
12:49 Evolution of context sizes in language models
17:18 Summarizing vs Ideating with AI
22:3 Collecting feedback in a field with no truth
25:10 Evaluation strategies and the importance of custom datasets
28:26 Should more companies make employees label data?
30:40 Retrieval for highly temporal and hierarchical data
35:55 Context-aware prompting for private vs. public data
38:16 Knowledge graph extraction and structured information retrieval
40:17 Fine-tuning vs RAG
43:14 Anthropomorphizing language models
45:39 Why Brightwave doesn't do spreadsheets
50:28 Will there be fully autonomous hedge funds?
56:49 State of open source AI
63:43 Hiring and team expansion at Brightwave

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO
00:00:04.480 | of Residence and Decibel Partners, and I have no co-host today as you can see.
00:00:08.240 | This works in Vienna at ICLR having fun in Europe. And we're in the brand new studio.
00:00:14.880 | As you might see if you're on YouTube, there's still no sound panels on the wall. Mike
00:00:20.080 | tried really hard to put them up, but the glue is a little too old for that.
00:00:25.920 | So if you hear any echo or anything like that, sorry, but we're doing the best that we can.
00:00:31.600 | And today we have our first repeat guest, Mike Conover. Welcome Mike,
00:00:36.480 | who is now the founder of Brightwave, not at Databricks anymore.
00:00:39.760 | Our last episode was one of the fan favorites and I think this will be just as good. So for
00:00:50.160 | those that have not listened to the first episode, which might be many because the podcast has grown
00:00:54.640 | a lot since then, thanks to people like Mike who have interesting conversations on it.
00:00:58.640 | You spent a bunch of years doing ML as some of the best companies on the internet. Things like
00:01:06.160 | Workday, you know, skip like LinkedIn, most recently at Databricks where you were leading
00:01:10.880 | the open source large language models team working on Dolly. And now you're doing Brightwave,
00:01:17.360 | which is in the financial services space, but this is not something new, you know,
00:01:22.320 | I think when you and I first talked about Brightwave, I was like,
00:01:26.080 | why is this guy doing a financial services company? And then you look at your background
00:01:30.160 | and you were doing papers on Nature, on the Nature magazine about LinkedIn data predicting,
00:01:36.720 | S&P 500 stock movement, like many, many years ago. So what's kind of like some of the tying
00:01:44.080 | elements in your background that maybe people are overlooking that brought you to do this?
00:01:47.840 | Yeah, sure. So I would say my, so my PhD research was funded by DARPA and it had,
00:01:57.840 | we had access to the Twitter dataset early in the early in the natural history of the
00:02:02.240 | availability of that dataset. And it was focused on the large scale structure of
00:02:05.360 | propaganda and misinformation campaigns. And LinkedIn, we had planet scale descriptions of
00:02:13.600 | the structure of the global economy. And so primarily my work was, was homepage newsfeed
00:02:17.680 | relevant. So when you go to linkedin.com, you would see updates from one of our machine learning
00:02:21.200 | models. But additionally, I was a research liaison as part of the economic graph challenge
00:02:26.240 | and had this nature communications paper where we demonstrated that 500 million jobs transitions can
00:02:32.240 | be hierarchically clustered as a network of labor flows and in our predictive next quarter S&P 500
00:02:37.360 | market gap changes. And at work day, I was director of financials, machine learning.
00:02:43.840 | And you start to see how organizations are organisms. And I think of the way that
00:02:54.480 | like an accountant or the market encodes information in databases, similar to how
00:03:01.760 | social insects, for example, organize their work and make collective decisions about
00:03:06.080 | where to allocate resources or time and attention. And that, especially with the work on Twitter,
00:03:12.400 | we would see network structures relating to polarization emerge organically out of the
00:03:18.720 | interactions of many individual components. And so like much of my professional work has
00:03:23.600 | been focused on this idea that our lives are governed by systems that we're unable to see
00:03:28.000 | from our locally constrained perspective. And that we, when we, when humans interact
00:03:33.520 | with technology, they create digital trace data that allows us to observe the structure of those
00:03:38.800 | systems as though through a microscope or a telescope. And particularly as regards finance,
00:03:46.080 | I think this is the ultimate, the markets are the ultimate manifestation and record of that
00:03:53.200 | collective decision-making process that humans engage in. Just to start going off script right
00:03:57.200 | away. Sure. How do you think about some of these interactions creating the polarization and how
00:04:02.320 | that reflects in the language models today, because they're trained on this data? Like,
00:04:05.920 | do you think the models pick up on these things on their own as well?
00:04:09.760 | Yeah, I think that they are a compression of the world as it existed at the point in time
00:04:14.400 | when they were pre-trained. And so I think absolutely the, and you see this in Word2Vec too.
00:04:20.240 | I mean, just the semantics of how we think about gender as relates to professions are encoded in
00:04:29.120 | the structure of these models and like language models, I think are, you know, much more sort of
00:04:35.120 | complete representation of human sort of beliefs. Yeah.
00:04:42.400 | That's awesome. So we left you at Databricks last time you were building Dolly. Tell us a bit more
00:04:46.960 | about Brightwave. This is the first time you're really talking about it publicly.
00:04:50.480 | Yeah. Yeah. It's a, it's a pleasure. I mean, so we, we've raised a $6 million seed round,
00:04:56.480 | including participate led by Decibel. We love working with and including participation from
00:05:01.840 | Point72, one of the largest hedge funds in the world and Moonfire Ventures. And we are focused
00:05:08.720 | on like, if you think of the job of an active asset manager, the work to be done is to understand
00:05:13.360 | something about the market that nobody else has seen in order to identify a mispriced asset.
00:05:17.120 | And it's our view that that is not a task that is well-suited to human intellect or attention span.
00:05:22.320 | And so much as I was gesturing towards the ability of these models to perceive
00:05:26.720 | more than a human is able to. We think that there's a unique, historically unique opportunity
00:05:33.920 | to expand individual's ability to reason about the structure of the economy and the markets.
00:05:39.120 | It's not clear that you get superhuman reasoning capabilities from human level demonstrations of
00:05:45.680 | skill. And by that, I mean the pre-training corpus, but then additionally the fine tuning
00:05:49.920 | corpuses. I think you largely mimic the demonstrations that are present at model
00:05:55.120 | training time. But from a working memory standpoint, these models outclass humans
00:06:00.800 | in their ability to reason about these systems. - Yeah. And you started Bravely with Brandon.
00:06:07.280 | - Yeah, yeah. - What's the story?
00:06:09.360 | You two worked together at Workday, but he also has a really relevant background.
00:06:13.280 | - Yeah, so Brandon Katara is my co-founder, the CTO, and he's a very special human. So he's
00:06:21.760 | has a deep background in finance. So he was the former CTO of a federally regulated derivatives
00:06:27.680 | exchange, but his first deep learning patent was filed in 2018. And so he spans worlds. He has
00:06:35.920 | experience building mission critical infrastructure in highly regulated environments for finance use
00:06:41.840 | cases, but also was very early to the deep learning party and understand. He led at Workday,
00:06:49.920 | was the tech lead for semantic search over hundreds of millions of resumes and job listings.
00:06:56.560 | And so just has been working with information retrieval and neural information retrieval methods
00:07:04.480 | for a very long time. And so was an exceptional person, and I'm glad to count him among
00:07:11.760 | the people that we're doing this with. - Yeah, and a great fisherman.
00:07:15.600 | - Yeah, very talented. - That's always important.
00:07:18.400 | - Very talented, very enthusiastic. - And then you have a bunch of
00:07:22.960 | amazing engineers. Then you have folks like JP who used to work at Goldman Sachs.
00:07:26.400 | - Yeah. - How should people think about
00:07:28.000 | team building in this more vertical domain? Obviously you come from a deep ML background,
00:07:33.200 | but you also need some of the industry. So what's the right balance?
00:07:36.000 | - Yeah, I mean, so I think one of the things that's interesting about building
00:07:41.920 | verticalized solutions in AI in 2024 is that historically you need the AI capability. You
00:07:52.320 | need to understand both how the models behave and then how to get them to interact with other kinds
00:07:57.200 | of machine learning subsystems that together perform the work of a system that can reason on
00:08:03.360 | behalf of a human. There are also material systems engineering problems in there. So I forget who
00:08:09.200 | this is attributed to, but a tweet that sort of made reference to all of the traditional software
00:08:15.280 | companies are trying to hire AI talent and all the AI companies are trying to hire systems
00:08:18.960 | engineers. And that is 100% the case. Getting these systems to behave in a predictable and
00:08:25.120 | repeatable and observable way is equally challenging to a lot of the methodological
00:08:30.160 | challenges. But then you bring in, whether it's law or medicine or public policy, or in our case,
00:08:36.960 | finance, I think a lot of the most valuable, like Grammarly is a good example of a company that has
00:08:44.960 | generative work product that is a valuable by most humans. Whereas in finance,
00:08:54.720 | the character of the insight, the depth of insight and the non-consensusness of the insight
00:08:59.680 | really requires a fairly deep domain expertise. And even operating an exchange,
00:09:04.960 | when we went to raise it around, a lot of people said, "Why don't you start a hedge fund?"
00:09:10.000 | And it's like that is a totally separate, there are many, many separate skills that are unrelated to
00:09:17.600 | AI in that problem. And so we've brought into the fold domain experts in finance who can help us
00:09:25.760 | evaluate the character and sort of steer the system.
00:09:29.440 | - Yep. So that's the team. What does the system actually do? What's the Brightwave product?
00:09:35.840 | - Yeah, I mean, it does many, many things, but it acts as a partner in thought to finance
00:09:44.240 | professionals. So you can ask Brightwave a question like, "How is NVIDIA's position in
00:09:49.680 | the GPU market impacted by rare earth metal shortages?" And it will identify as thematic
00:09:57.120 | contributors to an investment decision or developing your thesis that in response to
00:10:04.560 | export controls on A100 cards, China has put in place licensors on the transfer of germanium and
00:10:11.360 | gallium, which are not rare earth metals, but they're semiconductor production inputs,
00:10:14.880 | and has expanded its control of African and South American mining operations.
00:10:18.880 | And so we see, if you think about, we have a $20 billion crossover hedge fund. Their equities team
00:10:29.280 | uses this tool to go deep on a thesis. So I was describing this like multiple steps into the value
00:10:35.360 | chain or supply chain for companies. We see wealth management professionals using Brightwave to get
00:10:43.680 | up to speed extremely quickly as they step into nine conversations tomorrow with clients who are
00:10:50.160 | assessing like, "Do you know something that I don't? Can I trust you to be a steward of my
00:10:55.600 | financial well-being?" We see investor relations teams using Brightwave to...
00:11:04.320 | You just think about the universe of coverage that a person working in finance needs to be
00:11:09.200 | aware of. The ability to rip through filings and transcripts and have a very comprehensive
00:11:15.360 | view of the market. It's extremely rate limited by how quickly a person is able to read and not
00:11:21.600 | just read, but solve the blank page problem of knowing what to say about a factor of finding.
00:11:28.640 | What else can you share about customers that you're working with?
00:11:32.400 | Yeah, so we have seen traction that far exceeded our expectations from the market.
00:11:39.680 | You sit somebody down with a system that can take any question and generate tight,
00:11:46.800 | actionable financial analysis on that subject and the product kind of sells itself. And so we see
00:11:53.440 | many, many different funds, firms, and strategies that are making use of Brightwave. So you've got
00:12:01.040 | 10-person owner-operated registered investment advisor, the classical wealth manager, $500
00:12:06.080 | million in AUM. We have crossover hedge funds that have tens and tens of billions of dollars
00:12:12.560 | in assets under management, very different use case. So that's more investment research,
00:12:16.320 | whereas a wealth manager is going to use this to step into client interactions,
00:12:19.120 | just exceptionally well-prepared. We see investor relations teams. We see
00:12:25.840 | corporate strategy types that are needing to understand very quickly new markets, new themes,
00:12:35.440 | and just the ability to very quickly develop a view on any investment theme or sort of strategic
00:12:43.440 | consideration is broadly applicable to many, many different kinds of personas.
00:12:49.360 | Yep. Yeah, I can attest to the product selling itself, given that I'm a user.
00:12:54.960 | Let's jump into some of the technical challenges and work behind it, because
00:13:00.240 | there are a lot of things. So as I mentioned, you were on the podcast about a year ago.
00:13:05.360 | You had released Dolly from Databricks, which was one of the first open-source LLMs.
00:13:11.200 | Dolly had a whooping 1,024 tokens of context size. And today, I think 1,000 tokens,
00:13:19.680 | a model would be unusable.
00:13:21.120 | You lose that much out.
00:13:22.240 | Yeah, exactly. How did you think about the evolution of context sizes as you built the
00:13:27.920 | company? And where we are today, what are things that people get wrong? Any commentary there?
00:13:34.080 | Sure. We very much take a systems of systems approach. When I started the company, I think I
00:13:45.600 | had more faith in the ability of large context windows to generally solve problems relating
00:13:50.800 | to synthesis. And actually, if you think about the attention mechanism and the way that it
00:13:56.400 | computes similarities between tokens at a distance, I, on some level, believed that as
00:14:02.640 | you would scale that up, you would have the ability to simultaneously perceive and draw
00:14:07.520 | conclusions across vast disparate bodies of content. And I think that does not empirically
00:14:15.760 | seem to be the case. So when, for example, you, and this is something anybody can try,
00:14:20.640 | take a very long document, like needle in a haystack, I think,
00:14:24.720 | sure, we can do information retrieval on specific fact-finding activities pretty easily.
00:14:32.720 | But if you, I kind of think about it like summarizing, if you write a book report on an
00:14:38.160 | entire book versus a synopsis of each individual chapter, there is a characteristic output length
00:14:45.200 | for these models. Let's say it's about 1200 tokens. It is very difficult to get any of the
00:14:50.160 | commercial LLMs or LLAMA to write 5,000 tokens. And you think about it as, what is the conditional
00:14:57.200 | probability that I generate an end token? It just gets higher the more tokens are in the context
00:15:04.320 | window prior to that sort of next inference step. And so if I have a thousand words in which to
00:15:13.120 | say something, the level of specificity and the level of depth when I am assessing a very large
00:15:20.960 | body of content is going to necessarily be less than if I am saying something specific about a
00:15:26.400 | sub passage. And so we, and if you think about drawing a parallel to consumer internet companies
00:15:34.400 | like LinkedIn or Facebook, there are many different subsystems with it. So let's take
00:15:41.760 | the Facebook example. Facebook almost certainly has, I mean, you can see this in your profile,
00:15:47.520 | your inferred interests. What are the things that it believes that you care about? Those
00:15:52.720 | assessments almost certainly feed into the feed relevance algorithms that would judge what you
00:15:58.640 | are, you know, am I, am I going to see you snow, am I going to show you snowboarding content? I'm
00:16:02.160 | going to show you aviation content. It's the outputs of one machine learning system feeding
00:16:09.440 | into another machine learning system. And I think with modern rag and sort of agent-based reasoning,
00:16:16.320 | it is really about creating subsystems that do specific tasks well. And I think the problem of
00:16:22.800 | I'm deciding how to decompose large documents into more kind of atomic reasoning units is
00:16:30.880 | still very important. Now it's an open question, whether you have in the, whether that is a model
00:16:39.680 | that is addressable by pre-training or instruction tuning, like can you have synthesis oriented
00:16:49.120 | demonstrations in the, at training time? And now this problem is more robustly solved
00:16:54.400 | because synthesis is quite different from complete the next word in the great Gatsby.
00:17:01.040 | But it, I think empirically is not the case that you can just throw all of the SEC filings in,
00:17:08.480 | you know, a million token context window and get deep insight that is useful out the other end.
00:17:16.960 | Yeah. And I think that's the main difference about what you're doing. It's not about summarizing,
00:17:21.760 | it's about coming up with different ideas and kind of like thought threads.
00:17:26.800 | Yes. Precisely. Yeah. And I think this specifically like helping a person know,
00:17:31.840 | you know, if I think that GLP ones are going to blow up the diet industry,
00:17:38.160 | identifying and putting in context a negative result from a human clinical trial that's,
00:17:44.880 | or for example, that adherence rates to Ozempic after a year, just 35%,
00:17:48.640 | what are the implications of this? So there's an information retrieval component, and then there's a,
00:17:54.880 | not just presenting me with a summary of like, here's, here are the facts, but like,
00:18:00.000 | what does this entail and how does this fit into my worldview, my fund strategy?
00:18:08.720 | Broadly, I think that, you know, I mean, this, this idea, I think is very eloquently puts it,
00:18:14.080 | which is, and this is, this is not my insight, but that language models and help,
00:18:18.880 | help me know who said this. You may, you may be familiar, but language models are not tools for
00:18:22.400 | creating new knowledge. They're, they're tools for helping me create new knowledge. Like they
00:18:26.880 | themselves do not do that work. I think that that's the presently the right way to think about
00:18:33.040 | it. Yeah. I read a tweet about needle in the haystack actually being harmful to some of this
00:18:40.400 | work because now the model is like too focused on recalling everything versus saying, oh, that
00:18:45.040 | doesn't matter. Like ignoring. If you think about a S1 filing, like 85% is like boilerplate. It's
00:18:52.080 | like, you know, previous performance doesn't guarantee future performance. Like the company
00:18:57.360 | might not be able to turn a profit in the future. Blah, blah, blah. These things, they always come
00:19:01.040 | up again. We have a large workforce and all of that. Have you had to do any work at the model
00:19:10.480 | level to kind of like make it okay to forget these things? Or like, have you found that like,
00:19:15.600 | kind of like making it a smaller problem then cutting, putting them back together kind of
00:19:19.520 | solves for that? Absolutely. And I think this is where having domain expertise around the structure
00:19:27.600 | of these documents. So if you look at the different chunking strategies that you can employ to
00:19:31.360 | understand like what is the intent of this clause or phrase, and then really be selective at
00:19:40.160 | retrieval time in order to get the information that is most relevant to a user query based on
00:19:45.520 | the semantics of that unique document. And I think it is not, it's certainly not just a sliding window
00:19:51.680 | over that corpus. And then the flip side of it is obviously factuality. You don't want to forget
00:20:00.320 | things that were there. How do you tackle that? Yeah. I mean, of course that's, it's a very deep
00:20:06.320 | problem. And I think, you know, I'll be a little circumspect about the specific kinds of methods
00:20:11.920 | we use, but this sort of multiple passes over the material and saying, how convicted are you
00:20:20.800 | that what you're saying is in fact true? And you, I mean, you can take generations from multiple
00:20:27.200 | different models and compare and contrast and say like, do these both reach the same conclusion?
00:20:31.840 | We, you can treat it like a voting problem. We train our own models to assess, you know,
00:20:40.640 | you can think of this like entailment, like is, is this supported by the underlying primary sources?
00:20:46.640 | And I think that you have methodological approaches to this problem, but then you
00:20:52.400 | also have product affordances. So like there's a great blog post on bar from the Bard team
00:20:57.920 | describing and Bill, it was sort of a design led product innovation that allows you to
00:21:05.760 | ask the model to double check the work. So if you have a surprising finding,
00:21:09.920 | we can let the user discretionarily spend more compute to double check the work.
00:21:16.160 | And I think that you want to build product experiences that are fault tolerant.
00:21:20.320 | And it is unclear that you ever, that the difference between like hallucination and
00:21:25.840 | creativity is fuzzy. And so do you, do you ever get language models with next token prediction
00:21:33.360 | as the loss function that are guaranteed to not contain factual misstatements? That is not clear.
00:21:41.440 | Now, maybe being able to invoke code interpreter like code generation and then execution in a
00:21:47.680 | secure way helps to solve some of these problems, especially for quantitative reasoning. That may
00:21:53.600 | be the case, but for right now, I think you need to have product affordances that allow you to
00:22:02.480 | live with the reality that these things are, are fallible.
00:22:08.160 | Yep. Yeah. We did our HF 201 episode, just talking about different methods and whatnot.
00:22:14.640 | How do you think about something like this where it's maybe unclear in the short term,
00:22:19.680 | even if the product is right, you know, it might give, it might give an insight that
00:22:23.840 | might be right, but it might not prove until later. So it's kind of hard for the users to say
00:22:29.440 | that's wrong because actually it might be like, you think it's wrong, like an investment. That's
00:22:33.760 | kind of what it comes down to, you know, some people are wrong. Some people are right. How,
00:22:38.080 | how do you think about some of the product features that you need and something like this to
00:22:41.840 | bring user feedback into the mix and maybe how you approach it today and how you think about it
00:22:46.560 | long-term? Yeah. Well, I mean, I think that to your point about the model may have, the model
00:22:52.160 | may make a statement, which is not actually verifiable. It's like this, this may be the
00:22:57.600 | case. I think that is where the reason we think of this as a partner in thought is that humans are
00:23:04.400 | always going to have access to information that has not, not been digitized. And so in finance,
00:23:08.480 | you see that, especially so with regards to expert call networks, the sort of like, like you're in
00:23:16.080 | the unstated investment theses that a portfolio manager may have, like, we just don't do biotech
00:23:24.560 | or we do not believe that we think that Eli Lilly is actually very exposed because of how unpleasant
00:23:32.800 | it is to take examples. Right. Those, those are things that are beliefs about the world,
00:23:38.720 | but that may not be like falsifiable right now. And so I think you have to,
00:23:44.720 | you can again, take plate pages from the consumer web playbook and think about personalization.
00:23:53.280 | So it is getting a person to articulate everything that they believe is not a realistic task.
00:23:59.200 | Netflix doesn't ask you to describe what kinds of movies you like, and they give you the option
00:24:05.440 | to vote, but nobody does this. And so what I think you do is you observe people's revealed
00:24:11.440 | preferences, like what, so one of the capabilities that our system exposes is given everything that
00:24:18.160 | Brightwave has read and assessed and like the sort of synthesized financial analysis, what are the
00:24:24.480 | natural next questions that this, that a person investigating the subject should ask? And you can
00:24:30.320 | think of this chain of thought and this deepening kind of investigative process and the direction
00:24:38.720 | in which the user steers the attention of this system reveals information about what do they
00:24:46.080 | care about? What do they believe? What kinds of things are important? And so at the individual
00:24:54.480 | level, but then also at the fund and firm level, you can develop like an implicit representation
00:25:02.720 | of your beliefs about the world in a way that you just, you're never going to get everybody,
00:25:08.800 | somebody to write everything down. Yeah. Yeah. How does that tie into one of our other favorite
00:25:14.720 | topics, evals? We had David Luan from Adapt and he mentioned they don't care about benchmarks
00:25:19.840 | because their customers don't work on benchmarks. They work on, you know, business results. How do
00:25:25.440 | you think about that for you? And maybe as you build a new company, when is the time to like
00:25:30.800 | still focus on the benchmark versus when it's time to like move on to your own evaluation using maybe
00:25:35.600 | labelers or whatnot? So, I mean, we, we use a fair bit of LLM supervision to evaluate
00:25:44.000 | multiple different subsystems. And I think that one of the reasons that, I mean, we, we pay human
00:25:49.760 | annotators to evaluate the quality of the generative outputs. And I think that that is
00:25:53.920 | always the reference standard, but we frequently first turn to LLM supervision as a way to have,
00:26:02.400 | whether it's at fine tuning time or even for subsystems that are not generative, like what is
00:26:09.840 | the quality of the system? And I think we will generate a small corpus of high quality domain
00:26:16.320 | expert annotations and then always compare that against how well is either LLM supervision or
00:26:21.760 | even just a heuristic, right? Like a simple thing you can do. This is, this is a technique that we
00:26:27.280 | do not use, but as an example, do not generate any integers or any numbers that are not present
00:26:35.040 | in the underlying source data, right? You know, if you're doing rag, you can just say, you can't
00:26:39.200 | name numbers that are not, you know, it's a very sort of heavy handed, but you can take the
00:26:45.360 | annotations of a human evaluator and then compare that. I mean, snorkel kind of takes a similar
00:26:50.480 | perspective, like multiple different week sort of supervision data sets can give you substantially
00:26:58.320 | more than any one of them does on their own. And so I think you want to compare the quality of
00:27:02.960 | any evaluation against human generated, the sort of like benchmark. But at the end of the day,
00:27:10.480 | like eventually you, especially for things that are nuanced, like is this transcendent poetry?
00:27:16.080 | There's just no way to multiple choice your way out of that, you know? And so really where
00:27:25.040 | I think a lot of the flywheels for some of the large LLM companies are, it's methodological,
00:27:31.920 | obviously, but it's also just data generation. And you think about like, you know, for anybody
00:27:37.680 | who's done crowdsource work, and this I think applies to high skilled human annotators as well.
00:27:44.240 | Like you look at the Google search quality evaluator guidelines, it's like a 90 or 120
00:27:49.280 | page rubric describing like what is a high quality search result. And it's like very difficult to get
00:27:54.400 | on the human level, people to reproducibly follow a rubric. And so what is your process
00:28:03.040 | for orchestrating that motion? Like how do you articulate what is high quality insight? I think
00:28:10.400 | that's where a lot of the work actually happens. And that it's sort of the last resort, like
00:28:19.520 | everything, like ideally you want to automate everything, but ultimately the most interesting
00:28:23.680 | problems right now are those that are not especially automatable. - One thing you did
00:28:28.480 | at Databricks was the, well, not that you did specifically, but the team there was like the
00:28:33.760 | Dolly 15K data set. You mentioned people misvalue the value of this data. Why has no other company
00:28:43.200 | done anything similar? We're like creating this like employee led data set. You can imagine,
00:28:48.480 | you know, some of these like Goldman Sachs, they got like thousands and thousands of people in
00:28:52.400 | there. Obviously they have different privacy and whatnot requirements. Do you think more companies
00:28:57.520 | should do it? Like, do you think there's like a misunderstanding of how valuable that is or yeah?
00:29:03.360 | - So I think Databricks is a very special company and led by people who are
00:29:09.200 | very sort of courageous, I guess is one word for it. Just like, let's just ship it. And I think
00:29:19.520 | it's unusual and it's also because I think like most companies will recognize, like if they go to
00:29:26.400 | the effort to produce something like that, they recognize that it is competitive advantage to have
00:29:30.560 | it and to be the only company that has it. And I think Databricks is in an unusual position in that
00:29:35.760 | they benefit from more people having access to these kinds of sources, but you also saw scale.
00:29:42.080 | I guess they haven't released it. - Well, yeah. I'm sure they have it
00:29:45.840 | because they charge people a lot of money. - But they created that alternative to the
00:29:49.360 | GSMK 8K, I believe was how that's said. I guess they too are not releasing that.
00:30:01.440 | - Yeah. It's interesting because I talked to a lot of enterprises and a lot of them are like,
00:30:06.960 | man, I spent so much money on scale. And I'm like, why don't you just do it? And they're like, what?
00:30:13.120 | - So I think this again gets to the human process orchestration. It's one thing to do
00:30:21.680 | like a single monolithic push to create a training data set like that or an evaluation corpus,
00:30:27.040 | but I think it's another to have a repeatable process. And a lot of that, I think realistically
00:30:33.600 | is pretty unsexy, like people management work. So that's probably a big part of it.
00:30:39.520 | - We have these four words of AI framework, the data quality war, we kind of touched on
00:30:45.760 | a little bit now about rag. That's like the other battlefield, rag and context sizes and kind of
00:30:50.800 | like all these different things. You work in a space that has a couple of different things. One,
00:30:56.880 | temporality of data is important because every quarter there's new data and like the new data
00:31:02.880 | usually overrides the previous one. So you can not just like do semantic search and hope you
00:31:07.440 | get the latest one. And then you have obviously very structured numbers thing that are very
00:31:13.680 | important to the token level, like 50% gross margins and 30% gross margins are very different,
00:31:19.920 | but this organization is not that different. Any thoughts on like how to build a system to
00:31:25.680 | handle all of that as much as you can share, of course?
00:31:27.520 | - Yeah, absolutely. So I think this again, rather than having open-ended retrieval,
00:31:35.440 | open-ended reasoning, our approach is to decompose the problem into multiple different
00:31:41.040 | subsystems that have specific goals. And so, I mean, temporality is a great example.
00:31:48.000 | When you think about time, I mean, just look at all of the libraries for managing calendars.
00:31:54.800 | Time is kind of at the intersection of language and math. And this is one of the places where
00:32:04.720 | without taking specific technical measures to ensure that you get high quality narrative
00:32:10.240 | overlays of statistics that are changing over time and have a description of how a PE multiple
00:32:16.800 | is increasing or decreasing and like a retrieval system that is aware of the time,
00:32:24.560 | sort of the time intent of the user query, right? If I'm asking something about breaking news,
00:32:30.320 | like that's going to be very different than if I'm looking for a thematic account of the past
00:32:35.520 | 18 months in Fed interest rate policy. You have to have retrieval systems that are,
00:32:43.600 | to your point, like if I just look for something that is a nearest neighbor without any of that
00:32:48.560 | temporal or other qualitative metadata overlay, you're just going to get a kind of a bag of facts
00:32:57.440 | and that that is like explicitly not helpful because the worst failure state for these systems
00:33:04.160 | is that they are wrong in a convincing way. And so I think at least presently you have to have
00:33:11.200 | subsystems that are aware of the semantics of the documents or aware of the semantics of
00:33:18.560 | the intent behind the question and then have multiple evaluate. We have multiple evaluation
00:33:25.680 | steps. Once you have the generated outputs, we assess it multiple different ways to know,
00:33:30.880 | is this a factual statement given the sort of content that's been retrieved?
00:33:36.640 | Yep. And what about, I think people think of financial services, they think of privacy,
00:33:42.960 | confidentiality. What's kind of like customer's interest in that as far as like sharing documents
00:33:51.360 | and like how much of a deal breaker is that if you don't have them? I don't know if you want to
00:33:56.080 | share any about that and how you think about architecting the product. Yeah, so one of the
00:34:02.080 | things that gives our customers a high degree of confidence is the fact that Brandon operated a
00:34:09.680 | federally regulated derivatives exchange. That experience in these highly regulated environments,
00:34:17.520 | I mean, additionally at workday, I worked with the financials product and without going into
00:34:23.680 | specifics, it's exceptionally sensitive data and you have multiple tenants and it's just important
00:34:31.920 | that you take the right approach to being a steward of that material. And so from the start,
00:34:37.680 | we've built in a way that anticipates the need for controls on how that data is managed and
00:34:46.480 | who has access to it and how it is treated throughout the life cycle. And so that for
00:34:51.200 | our customer base, where frequently the most interesting and alpha generating material is not
00:35:00.000 | publicly available, has given them a great degree of confidence in sharing
00:35:04.400 | some of this, the most sensitive and interesting material with systems that are able to combine it
00:35:12.320 | with content that is either publicly or semi-publicly available to create non-consensus
00:35:19.360 | insight into some of the most interesting and challenging problems in finance. Yeah,
00:35:24.320 | we always say it breaks our recommendation systems for LLMs. How do you think about that when you
00:35:29.520 | have private versus public data, where sometimes you have public data as one thing, but then the
00:35:34.560 | private is like, well, actually, you know, we got this like insight model, like with this insight
00:35:39.360 | school that we're going to like figure it out. How do you think in the rack system about a value
00:35:45.280 | of these different documents? You know, I know a lot of it is secret sauce, but...
00:35:48.800 | No, no, it's fine. I mean, I think that there is,
00:35:51.120 | so I will gesture towards this by way of saying context where prompting. So you can have prompts
00:36:03.280 | that are composable and that have different sort of command units that like may or may not be
00:36:11.440 | present based on the semantics of the content that is being populated into the rag context window.
00:36:16.800 | And so that's something we make great use of, which is where is this being retrieved from?
00:36:23.920 | What does it represent? And what should be in the instruction set in order to treat and respect the
00:36:31.360 | underlying contents? Not just as like, here's a bunch of text, like you figure it out, but
00:36:36.480 | this is important in the following way, or this aspect of the SEC filings are just categorically
00:36:45.440 | uninteresting, or this is sell-side analysis from a favored source. And so it's that
00:36:52.480 | creating it much like you have with the qualitative, the problem of organizing the work of
00:37:00.320 | humans, you have the problem of organizing the work of all of these different AI subsystems and
00:37:06.640 | getting them to propagate what they know through the rest of the stack so that if you have multiple,
00:37:13.840 | seven, 10 sequence inference calls, that all of the relevant metadata is propagated through
00:37:21.520 | that system and that you are aware of where did this come from? How convicted am I that it is a
00:37:28.240 | source that should be trusted? I mean, you see this also just in analysis, right? So different,
00:37:33.920 | like seeking alpha is a good example of just a lot of people with opinions. And some of them are
00:37:40.960 | great. Some of them are really mid and how do you build a system that is aware of the user's
00:37:50.320 | preferences for different sources? I think this is all related to how we talked about systems
00:37:58.400 | engineering. It's all related to how you actually build the systems. - And then just to kind of wrap
00:38:04.240 | on the right side, how should people think about knowledge graphs and kind of like extraction from
00:38:09.760 | documents versus just like semantic search? - Knowledge graph extraction is an area where
00:38:15.200 | we're making a pretty substantial investment. And so this, I think that it is underappreciated how
00:38:23.120 | powerful, there's the generative capabilities of language models, but there's also the ability to
00:38:29.280 | program them to function as arbitrary machine learning systems, basically for marginally zero
00:38:37.200 | cost. And so the ability to extract structured information from huge, like sort of unfathomably
00:38:47.440 | large bodies of content in a way that is single pass. So rather than having to reanalyze a document
00:38:56.400 | every time that you perform inference or respond to a user query, we believe quite firmly that
00:39:04.000 | you can also in an additive way, perform single pass extraction over this body of text and then
00:39:10.960 | bring that into the RAG context window. And this really sort of levers off of my experience at
00:39:21.040 | LinkedIn where you had this structured graph representation of the global economy where you
00:39:26.640 | said person A works at company B. We believe that there's an opportunity to create a knowledge graph
00:39:33.760 | that has resolution that greatly exceeds what any, whether it's Bloomberg or LinkedIn currently has
00:39:40.480 | access to. We're getting as granular as person X submitted congressional testimony that was
00:39:45.840 | critical of organization Y. And this is the language that is attached to that testimony.
00:39:50.320 | And then you have a structured data artifact that you can pivot through and reason over
00:39:55.760 | that is complimentary to the generative capabilities that language models expose.
00:40:00.000 | And so it's the same technology being applied to multiple different ends. And this is manifest
00:40:07.360 | in the product surface where it's a highly facetable, pivotable product, but it also
00:40:12.720 | enhances the reasoning capability of the system. Yeah. You know, when you mentioned you don't want
00:40:18.160 | to re-query like the same thing over and over, a lot of people may say, well, I'll just fine tune
00:40:23.520 | this information in the model. How do you think about that? That was one thing when we started
00:40:29.920 | working together, you were like, we're not building foundation models. A lot of other
00:40:33.840 | startups were like, oh, we're building the finance, financial model, the finance foundation
00:40:37.760 | model or whatever. When is the right time for people to do fine tuning versus rag? It's like
00:40:45.120 | any heuristics that you can share that you use to think about it. So we, in general, I do not,
00:40:52.960 | I'll just say like, I don't have a strong opinion about how much information you can imbue into a
00:41:01.680 | model that is not present in pre-training through large scale fine tuning. The benefit of rag is the
00:41:10.480 | ground, the capability around grounded reasoning. So the, you know, forcing it to attend to a
00:41:15.760 | collection of facts that are known and available at inference time and sort of like materially,
00:41:21.600 | like only using these facts. The, at least in my view, the, the role of fine tuning is really more
00:41:29.600 | around, I think of like language models, kind of like a STEM cell. And then under fine tuning,
00:41:34.720 | they differentiate into different kinds of specific cells. So kidney or an eye cell. And
00:41:40.240 | we, if you think about like specifically, like, I don't think that unbounded agentic behaviors are
00:41:50.960 | useful and that instead a useful LLM system is more like a finite state machine where the behavior of
00:42:01.520 | the system is occupying one of many different behavioral regimes and making decisions about
00:42:05.600 | what state should I occupy next in order to satisfy the goal. As you think about the graph
00:42:12.960 | of those states that the language model that your system is moving through, once you develop
00:42:18.880 | conviction that one behavior is useful and repeatable and like worthwhile to differentiate
00:42:28.240 | down into a specific kind of subsystem, that's where like fine tuning and like specifically
00:42:32.960 | generating the training data, like having, having human annotators produce a corpus that is useful
00:42:39.920 | enough to get a specific class of behaviors that that's kind of how we use fine tuning rather than
00:42:46.640 | trying to imbue new net new information into these systems. Yeah. And how, but you know,
00:42:55.920 | people always trying to turn LLMs into humans. It's like, Oh, this is my reviewer. This is my
00:43:01.680 | editor. I know you're not in that camp. So any, any thoughts you have on like how people should
00:43:06.880 | think about yeah. How to refer to models. Like, and I mean, we, we've talked a little bit about
00:43:14.400 | this and it's, it is an, it's notable that I think there's a lot of anthropomorphizing going on and
00:43:20.400 | that it reflects the difficulty of evaluating the systems. Is it like, does the saying that you are
00:43:27.600 | that you're the journal editor for nature, does that help? Like you're, you know, you've got the
00:43:35.520 | editor and then you've got the reviewer and you've got the, you know, you're the private
00:43:38.880 | investigator, you know, it's like, this is, I think literally we wave our hands and we say,
00:43:45.280 | maybe if I tell you that I'm going to tip you, that's going to help. And it sort of seems to,
00:43:48.960 | and like, maybe it's just like the more cycles, the more compute that is attached to the prompt.
00:43:56.720 | And then the sort of like chain of thought at inference time, it's like, maybe that's all that
00:44:01.840 | we're really doing and that it's kind of like hidden, hidden compute. But I, our experience
00:44:08.080 | has been that you can get really, really high quality reasoning from roughly an agentic system
00:44:16.800 | without needing to be too cute about it. You can describe the task and you know, within
00:44:25.360 | well-defined bounds you don't need to treat the LLM like a person and to get it to generate high
00:44:34.000 | quality outputs. Yeah. And the other thing is like all these agent frameworks are assuming everything
00:44:41.040 | as an LLM, you know? Yeah, for sure. And I think this is one of the places where
00:44:46.080 | traditional machine learning has a real material role to play in producing a system that hangs
00:44:53.600 | together and there are, you know, guaranteeable like statistical promises that classical machine
00:45:01.840 | learning systems to include traditional deep learning can make about, you know, what is the
00:45:07.200 | set of outputs and like, what is the characteristic distribution of those outputs that LLMs cannot
00:45:13.040 | afford? And so like one of the things that we do is we, as a philosophy, try to choose the right
00:45:18.160 | tool for the job. And so sometimes that is a de novo model that has nothing to do with LLMs that
00:45:26.560 | does one thing exceptionally well. And whether that's retrieval or critique or multi-class
00:45:33.280 | classification, I think having many, many different tools in your toolbox is always valuable.
00:45:41.760 | This is great. So there's kind of the missing piece that maybe people are wondering about.
00:45:46.400 | You do a financial services company and you didn't do anything in Excel. What's the story
00:45:52.400 | behind why you're doing partner in thought versus, hey, this is like an AI enabled model that
00:45:58.640 | understands any stock and all of that. Yeah. And to be clear, we do, Brightwave does a fair
00:46:04.640 | amount of quantitative reasoning. I think what is an explicit non-goal for the company is to
00:46:10.880 | create Excel spreadsheets. And I think when you look at the products that work in that way,
00:46:19.280 | you can spend hours with an Excel spreadsheet and not notice a subtle bug. And that is a
00:46:28.560 | highly non-fault tolerant product experience where you encounter a misstatement in a financial model
00:46:35.200 | in terms of how a formula is composed and all of your assumptions are suddenly violated.
00:46:39.520 | And now it's effectively wasted effort. So as opposed to the partner in thought modality,
00:46:46.160 | which is yes and, like if the model says something that you don't agree with,
00:46:50.720 | you can say, take it under consideration. This is not interesting to me. I'm going to pivot to the
00:46:56.240 | next finding or claim. And it's more like a dialogue. The other piece of this is that
00:47:04.480 | the financial modeling is often very, when we talk to our users, it's very personal. So they
00:47:10.000 | have a specific view of how a company is structured. They have that one key driver
00:47:14.400 | of asset performance that they think is really, really important. It's kind of like the difference
00:47:20.000 | between writing an essay and having an essay, I guess. The purpose of homework is to actually
00:47:27.440 | develop, what do I think about this? And so it's not clear to me that push a button, have
00:47:33.440 | a financial model is solving the actual problem that the financial model affords.
00:47:40.880 | And so that said, we take great efforts to have exceptionally high quality quantitative
00:47:47.920 | reasoning. So you think about, and I won't get into too many specifics about this, but
00:47:55.200 | we deal with a fair number of documents that have tabular data that is really important to making
00:48:01.760 | informed decisions. And so the way that our RAG systems operate over and retrieve from tabular
00:48:09.760 | data sources is, it's something that we place a great degree of emphasis on. It's just, I think,
00:48:15.840 | the medium of Excel spreadsheets is just, I think, not the right play for this class of technologies
00:48:26.160 | as they exist in 2024. - What about 2034? Are people still gonna be making Excel models? I think
00:48:35.280 | to me, the most interesting thing is, how are the models abstracting people away from some of these
00:48:42.480 | more syntax driven thing and making them focus on what matters to them? - I wouldn't be able to tell
00:48:48.640 | you what the future 10 years from now looks like. I think anybody who could convince you of that
00:48:55.760 | is not necessarily somebody to be trusted. I do think that, so let's draw the parallel to
00:49:02.320 | accountants in the '70s. So VisiCalc, I believe, came out in 1979. And historically, the core,
00:49:12.560 | as an accountant, as a finance professional in the '70s, I'm the one who runs the, I run the
00:49:18.640 | numbers. I do the arithmetic. That's like my main job. And we think that, I mean, you just look,
00:49:26.320 | now that's not a job anybody wants. And the sophistication of the analysis that a person
00:49:31.920 | is able to perform as a function of having access to powerful tools like computational spreadsheets
00:49:36.960 | is just much greater. And so I think that with regards to language models, it is probably the
00:49:44.480 | case that there is a play in the workflow where it is commenting on your analysis within that
00:49:56.080 | spreadsheet-based context, or it is taking information from those models and sucking
00:50:02.320 | this into a system that does qualitative reasoning on top of that. But I think the,
00:50:11.280 | it is an open question as to whether the actual production of those models is still a human task,
00:50:16.560 | but I think the sophistication of the analysis that is available to us and the completeness
00:50:22.480 | of that analysis just necessarily increases over time. - Yeah. What about AI hedge funds?
00:50:31.280 | Obviously, I mean, we have quants today, right? But those are more kind of like momentum-driven,
00:50:35.920 | kind of like signal-driven and less about long thesis-driven. Do you think that's a possibility
00:50:40.720 | there? - This is an interesting question. I would put it back to you and say, how different is that
00:50:49.840 | from what hedge funds do now? I think there is, the more that I have learned about how teams at
00:50:58.400 | hedge funds actually behave, and you look at like systematics desks or semi-systematic trading groups,
00:51:03.600 | man, it's a lot like a big machine learning team. And it's, I sort of think it's interesting,
00:51:09.040 | right? So like, if you look at video games and traditional like Bay Area tech, there's not a ton
00:51:16.560 | of like talent mobility between those two communities. You have people that work in video
00:51:22.400 | games and people that work in like SaaS software. And it's not that like cognitively, they would not
00:51:28.080 | be able to work together. It's just like a different set of skill sets, a different set
00:51:30.800 | of relationships. And it's kind of like network clusters that don't interact. I think there's
00:51:34.320 | probably a similar phenomenon happening with regards to machine learning within the active
00:51:43.600 | asset allocation community. And so like, it's actually not clear to me that we don't have
00:51:50.480 | AI hedge funds now. The question of whether you have an AI that is operating a trading desk,
00:51:56.480 | like that seems a little, maybe, like I don't have line of sight to something like that existing yet.
00:52:06.800 | - I'm always curious. I think about asset management on a few different ways, but
00:52:13.040 | venture capital is like extremely power law driven. It's really hard to do machine learning
00:52:18.640 | in power law businesses because the distribution of outcomes is so small versus public equities.
00:52:24.960 | Most high-frequency trading is like very bell curve, normal distribution. It's like,
00:52:30.640 | even if you just get 50.5% at the right scale, you're going to make a lot of money.
00:52:35.440 | And I think AI starts there. And today most high-frequency trading is already AI driven.
00:52:42.240 | Renaissance started a long time ago using these models. But I'm curious how it's going to move
00:52:47.920 | closer and closer to like power law businesses. I would say some boutique hedge funds,
00:52:54.160 | their pitch is like, "Hey, we're differentiated because we only do kind of like these
00:52:59.680 | long only strategies that are like thesis driven versus movement driven." And most venture
00:53:06.000 | capitalists will tell you, "Well, our fund is different because we have this unique thesis
00:53:09.760 | on this market." And I think like five years ago, I've wrote this blog post about why machine
00:53:16.560 | learning would never work in venture, because the things that you're investing in today,
00:53:20.880 | they're just like no precedent that should tell you this will work. Most new companies,
00:53:25.600 | a model will tell you this is not going to work. Versus the closer you get to the public companies,
00:53:30.960 | the more any innovation is like, "Okay, this is kind of like this thing that happened."
00:53:35.760 | And I feel like these models are quite good at generalizing and thinking,
00:53:40.720 | again, going back to the partner in thought, like thinking about second order.
00:53:44.000 | Yeah, and that's maybe where, so a concrete example, I think it certainly is the case that
00:53:51.440 | we tell retrospective, to your point about venture, we tell retrospective stories where it's
00:53:56.320 | like, "Well, here was the set of observable facts. This was knowable at the time, and these people
00:54:01.120 | made the right call and were able to cross correlate all of these different sources, and
00:54:05.680 | this is the bet we're going to make." I think that process of idea generation is absolutely
00:54:13.440 | automatable. And the question of like, do you ever get somebody who just sets the system running,
00:54:18.880 | and it's making all of its own decisions like that, and it is truly like doing thematic investing,
00:54:26.080 | or more of what a human analyst would be on the hook for, as opposed to like HFT.
00:54:32.160 | But the ability of models to say, "Here is a fact pattern that is noteworthy, and
00:54:42.640 | we should pay more attention here." Because if you think about the matrix of all possible
00:54:48.960 | relationships in the economy, it grows with the square of the number of facts you're evaluating,
00:54:56.800 | like polynomial with the number of facts you're evaluating. And so, if I want to make bets on AI,
00:55:07.120 | I think it's like, "What are ways to profit from the rise of AI?" It is very straightforward to
00:55:13.360 | take a model and say, "Parse through all of these documents and find second-order derivative bets,"
00:55:19.760 | and say, "Oh, it turns out that energy is very, very adjacent to investments in AI and may not
00:55:27.360 | be priced in the same way that GPUs are." And a derivative of energy, for example,
00:55:33.360 | is long-duration energy storage. And so, you need a bridge between renewables, which have
00:55:39.200 | fluctuating demands, and the compute requirements of these data centers. And I think that,
00:55:45.200 | and I'm telling this story as like having witnessed BrightWave do this work, you can take a
00:55:52.160 | premise and say, "What are second- and third-order bets that we can make on this topic?" And it's
00:55:57.920 | going to come back with, "Here's a set of reasonable theses." And then I think a human's
00:56:03.520 | role in that world is to assess like, "Does this make sense given our fund strategy? Is this
00:56:08.880 | coherent with the calls that I've had with the management teams?" There's this broad body of
00:56:14.240 | knowledge that I think humans are the ultimate synthesizers and deciders. And maybe I'm wrong.
00:56:23.040 | Maybe the world of the future looks like the AI that truly does everything. I think it is kind of
00:56:32.080 | a singularity where it's really hard to reason about what that world looks like. And you asked
00:56:38.320 | me to speculate, but I'm actually kind of hesitant to do so because it's just the forecast, the
00:56:43.360 | hurricane path just diverges far too much to have a real conviction about what that looks like.
00:56:50.880 | Awesome. I know we've already taken up a lot of your time, but maybe one thing to touch on
00:56:57.440 | before wrapping is open-source LLMs. Obviously you were at the forefront of it. We recorded
00:57:03.520 | our episode the day that Red Pajama was open-source and we were like, "Oh man, this is mind-blowing.
00:57:08.880 | This is going to be crazy." And now we're going to have an open-source dense transformer model
00:57:15.200 | that is 400 billion parameters. I don't know if one year ago you could have told me that
00:57:19.440 | that was going to happen. So what do you think matters in open-source? What do you think
00:57:24.880 | people should work on? What are things that people should keep in mind to evaluate? Is this model
00:57:31.280 | actually going to be good or is it just cheating some benchmarks to look good? Is there anything
00:57:35.840 | there? This is the part of the podcast where people already dropped off if they wanted to,
00:57:42.240 | so they want to hear the hot takes right now. I do think that that's another reason to have
00:57:47.280 | your own private evaluation corpuses is so that you can objectively and out of sample
00:57:52.160 | measure the performance of these models. Again, sometimes that just looks like giving everybody
00:57:59.840 | on the team 250 annotations and saying, "We're just going to grind through this."
00:58:04.080 | The other thing about doing the work yourself is that you get to articulate your loss function
00:58:11.680 | precisely. What do I actually want the system to behave like? Do I prefer this system or
00:58:17.120 | this model or this other model? I think the work around overfitting on the test is 100% happening.
00:58:27.200 | One notable, in contrast to a year ago, say, and the economic incentives for companies to train
00:58:40.080 | their own foundation models, I think, are diminishing. The window in which you are the
00:58:48.720 | dominant pre-train, and let's say that you spend $5 to $40 million for a commodity-ish pre-train,
00:58:59.360 | not $400 billion would be another sort of... It costs more than $40 million.
00:59:03.760 | Another leap. But the kind of thing that a small, multi-billion dollar mom and pop shop
00:59:10.640 | might be able to pull off. The benefit that you get from that is, I think, diminishing over time.
00:59:20.320 | I think fewer companies are going to make that capital outlay. I think that there's probably
00:59:28.160 | some material negatives to that. But the other piece is that we're seeing that,
00:59:33.200 | at least in the past two and a half, three months, there's a convergence
00:59:38.160 | towards, well, these models all behave fairly similarly. It's probably that the training data
00:59:47.600 | on which they are pre-trained is substantially overlapping. It's generalizing a model that
00:59:54.560 | generalizes to that training data. It's unclear to me that you have this sort of balkanization,
01:00:02.560 | where there are many different models, each of which is good in its own unique way, versus
01:00:07.280 | something like Lama becomes, "Listen, this is a fine standard to build off of."
01:00:13.600 | We'll see. It's just the upfront cost is so high. I think for the people that have the money,
01:00:20.480 | the benefit of doing the pre-train is now less. Where I think it gets really interesting is,
01:00:27.920 | how do you differentiate these in all of these different behavioral regimes? I think the cost
01:00:33.840 | of producing instruction tuning and fine-tuning data that creates specific kinds of behaviors,
01:00:41.520 | I think that's probably where the next generation of really interesting work starts to happen.
01:00:49.600 | If you see that the same model architecture trained on much more training data can exhibit
01:00:56.480 | substantially improved performance, it's the value of modeling innovations.
01:01:04.160 | For fundamental machine learning and AI research, there is still so much to be done. But I think
01:01:12.320 | that the much lower-hanging fruit is developing new kinds of training data corpuses that elicit
01:01:22.960 | new behaviors from these models in a specific way. That's where, when I think about the
01:01:28.160 | availability, a year ago you had to have access to fairly high-performance GPUs that were hard
01:01:37.360 | to get in order to get the experience of multiple reps fine-tuning these models.
01:01:42.960 | What you're doing when you take a corpus and then fine-tune the model, and then see across many
01:01:51.360 | inference passes what is the qualitative character of the output, you're developing your own internal
01:01:56.160 | mental model for how does the composition of the training corpus shape the behavior of the model
01:02:00.800 | in a qualitative way. A year ago it was very expensive to get that experience. Now you can
01:02:06.480 | just recompose multiple different training corpuses and see what do I do if I insert
01:02:12.080 | this set of demonstrations, or I ablate that set of demonstrations. That I think is a very,
01:02:18.240 | very valuable skill and one of the ways that you can have models and products that
01:02:23.040 | other people don't have access to. I think as those sensibilities proliferate,
01:02:30.640 | because more people have that experience, you're going to see teams that
01:02:35.680 | release data corpuses that just imbue the models with new behaviors that are especially interesting
01:02:41.120 | and useful. I think that may be where some of the next sets of innovation and differentiation come
01:02:47.920 | from. Yeah, when people ask me, I always tell them the half-life of a model is much shorter
01:02:52.480 | than a half-life of a data set. I mean, the pile is still around and core to most of these training
01:02:58.960 | runs versus all the models people trained a year ago. It's like they're at the bottom of the
01:03:03.520 | LMC's litter board. It's kind of crazy. Just the parallels to other kinds of computing technology
01:03:11.440 | where the work involved in producing the artifact is so significant and the shelf life is like a
01:03:20.160 | week. I'm sure there's a precedent but it is remarkable. I remember when DALI was the best
01:03:30.000 | open-source model. DALI was never the best open-source model but it demonstrated something
01:03:36.800 | that was not obvious to many people at the time. But we always were clear that it was never state
01:03:41.120 | of the art. State of the art, whatever that means. This is great, Mike. Anything that we forgot to
01:03:48.400 | cover that you want to add? I know you're thinking about growing the team. We are hiring across the
01:03:55.680 | board. AI, engineering, classical machine learning, systems engineering, distributed systems,
01:04:03.440 | front-end engineering, design. We have many open roles on the team. We hire exceptional people.
01:04:10.800 | We fit the job to the person as a philosophy and would love to work with more incredible humans.
01:04:17.920 | Awesome. Thank you so much for coming on, Mike.
01:04:20.640 | Thanks, Alessio.
01:04:21.600 | Thanks, Alessio.
01:04:33.200 | [Music]
01:04:47.120 | [End of Audio]
01:04:48.160 | [End of Audio]
01:04:49.200 | [End of Audio]
01:04:49.280 | [End of Audio]
01:04:50.320 | [End of Audio]
01:04:50.400 | [BLANK_AUDIO]
01:05:00.400 | [BLANK_AUDIO]