back to indexHow AI is Eating Finance - with Mike Conover of Brightwave
Chapters
0:0 Introductions
4:52 What's Brightwave?
6:4 How to hire for a vertical AI startup
11:26 How $20B+ hedge funds use Brightwave
12:49 Evolution of context sizes in language models
17:18 Summarizing vs Ideating with AI
22:3 Collecting feedback in a field with no truth
25:10 Evaluation strategies and the importance of custom datasets
28:26 Should more companies make employees label data?
30:40 Retrieval for highly temporal and hierarchical data
35:55 Context-aware prompting for private vs. public data
38:16 Knowledge graph extraction and structured information retrieval
40:17 Fine-tuning vs RAG
43:14 Anthropomorphizing language models
45:39 Why Brightwave doesn't do spreadsheets
50:28 Will there be fully autonomous hedge funds?
56:49 State of open source AI
63:43 Hiring and team expansion at Brightwave
00:00:00.000 |
Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO 00:00:04.480 |
of Residence and Decibel Partners, and I have no co-host today as you can see. 00:00:08.240 |
This works in Vienna at ICLR having fun in Europe. And we're in the brand new studio. 00:00:14.880 |
As you might see if you're on YouTube, there's still no sound panels on the wall. Mike 00:00:20.080 |
tried really hard to put them up, but the glue is a little too old for that. 00:00:25.920 |
So if you hear any echo or anything like that, sorry, but we're doing the best that we can. 00:00:31.600 |
And today we have our first repeat guest, Mike Conover. Welcome Mike, 00:00:36.480 |
who is now the founder of Brightwave, not at Databricks anymore. 00:00:39.760 |
Our last episode was one of the fan favorites and I think this will be just as good. So for 00:00:50.160 |
those that have not listened to the first episode, which might be many because the podcast has grown 00:00:54.640 |
a lot since then, thanks to people like Mike who have interesting conversations on it. 00:00:58.640 |
You spent a bunch of years doing ML as some of the best companies on the internet. Things like 00:01:06.160 |
Workday, you know, skip like LinkedIn, most recently at Databricks where you were leading 00:01:10.880 |
the open source large language models team working on Dolly. And now you're doing Brightwave, 00:01:17.360 |
which is in the financial services space, but this is not something new, you know, 00:01:22.320 |
I think when you and I first talked about Brightwave, I was like, 00:01:26.080 |
why is this guy doing a financial services company? And then you look at your background 00:01:30.160 |
and you were doing papers on Nature, on the Nature magazine about LinkedIn data predicting, 00:01:36.720 |
S&P 500 stock movement, like many, many years ago. So what's kind of like some of the tying 00:01:44.080 |
elements in your background that maybe people are overlooking that brought you to do this? 00:01:47.840 |
Yeah, sure. So I would say my, so my PhD research was funded by DARPA and it had, 00:01:57.840 |
we had access to the Twitter dataset early in the early in the natural history of the 00:02:02.240 |
availability of that dataset. And it was focused on the large scale structure of 00:02:05.360 |
propaganda and misinformation campaigns. And LinkedIn, we had planet scale descriptions of 00:02:13.600 |
the structure of the global economy. And so primarily my work was, was homepage newsfeed 00:02:17.680 |
relevant. So when you go to linkedin.com, you would see updates from one of our machine learning 00:02:21.200 |
models. But additionally, I was a research liaison as part of the economic graph challenge 00:02:26.240 |
and had this nature communications paper where we demonstrated that 500 million jobs transitions can 00:02:32.240 |
be hierarchically clustered as a network of labor flows and in our predictive next quarter S&P 500 00:02:37.360 |
market gap changes. And at work day, I was director of financials, machine learning. 00:02:43.840 |
And you start to see how organizations are organisms. And I think of the way that 00:02:54.480 |
like an accountant or the market encodes information in databases, similar to how 00:03:01.760 |
social insects, for example, organize their work and make collective decisions about 00:03:06.080 |
where to allocate resources or time and attention. And that, especially with the work on Twitter, 00:03:12.400 |
we would see network structures relating to polarization emerge organically out of the 00:03:18.720 |
interactions of many individual components. And so like much of my professional work has 00:03:23.600 |
been focused on this idea that our lives are governed by systems that we're unable to see 00:03:28.000 |
from our locally constrained perspective. And that we, when we, when humans interact 00:03:33.520 |
with technology, they create digital trace data that allows us to observe the structure of those 00:03:38.800 |
systems as though through a microscope or a telescope. And particularly as regards finance, 00:03:46.080 |
I think this is the ultimate, the markets are the ultimate manifestation and record of that 00:03:53.200 |
collective decision-making process that humans engage in. Just to start going off script right 00:03:57.200 |
away. Sure. How do you think about some of these interactions creating the polarization and how 00:04:02.320 |
that reflects in the language models today, because they're trained on this data? Like, 00:04:05.920 |
do you think the models pick up on these things on their own as well? 00:04:09.760 |
Yeah, I think that they are a compression of the world as it existed at the point in time 00:04:14.400 |
when they were pre-trained. And so I think absolutely the, and you see this in Word2Vec too. 00:04:20.240 |
I mean, just the semantics of how we think about gender as relates to professions are encoded in 00:04:29.120 |
the structure of these models and like language models, I think are, you know, much more sort of 00:04:35.120 |
complete representation of human sort of beliefs. Yeah. 00:04:42.400 |
That's awesome. So we left you at Databricks last time you were building Dolly. Tell us a bit more 00:04:46.960 |
about Brightwave. This is the first time you're really talking about it publicly. 00:04:50.480 |
Yeah. Yeah. It's a, it's a pleasure. I mean, so we, we've raised a $6 million seed round, 00:04:56.480 |
including participate led by Decibel. We love working with and including participation from 00:05:01.840 |
Point72, one of the largest hedge funds in the world and Moonfire Ventures. And we are focused 00:05:08.720 |
on like, if you think of the job of an active asset manager, the work to be done is to understand 00:05:13.360 |
something about the market that nobody else has seen in order to identify a mispriced asset. 00:05:17.120 |
And it's our view that that is not a task that is well-suited to human intellect or attention span. 00:05:22.320 |
And so much as I was gesturing towards the ability of these models to perceive 00:05:26.720 |
more than a human is able to. We think that there's a unique, historically unique opportunity 00:05:33.920 |
to expand individual's ability to reason about the structure of the economy and the markets. 00:05:39.120 |
It's not clear that you get superhuman reasoning capabilities from human level demonstrations of 00:05:45.680 |
skill. And by that, I mean the pre-training corpus, but then additionally the fine tuning 00:05:49.920 |
corpuses. I think you largely mimic the demonstrations that are present at model 00:05:55.120 |
training time. But from a working memory standpoint, these models outclass humans 00:06:00.800 |
in their ability to reason about these systems. - Yeah. And you started Bravely with Brandon. 00:06:09.360 |
You two worked together at Workday, but he also has a really relevant background. 00:06:13.280 |
- Yeah, so Brandon Katara is my co-founder, the CTO, and he's a very special human. So he's 00:06:21.760 |
has a deep background in finance. So he was the former CTO of a federally regulated derivatives 00:06:27.680 |
exchange, but his first deep learning patent was filed in 2018. And so he spans worlds. He has 00:06:35.920 |
experience building mission critical infrastructure in highly regulated environments for finance use 00:06:41.840 |
cases, but also was very early to the deep learning party and understand. He led at Workday, 00:06:49.920 |
was the tech lead for semantic search over hundreds of millions of resumes and job listings. 00:06:56.560 |
And so just has been working with information retrieval and neural information retrieval methods 00:07:04.480 |
for a very long time. And so was an exceptional person, and I'm glad to count him among 00:07:11.760 |
the people that we're doing this with. - Yeah, and a great fisherman. 00:07:15.600 |
- Yeah, very talented. - That's always important. 00:07:18.400 |
- Very talented, very enthusiastic. - And then you have a bunch of 00:07:22.960 |
amazing engineers. Then you have folks like JP who used to work at Goldman Sachs. 00:07:28.000 |
team building in this more vertical domain? Obviously you come from a deep ML background, 00:07:33.200 |
but you also need some of the industry. So what's the right balance? 00:07:36.000 |
- Yeah, I mean, so I think one of the things that's interesting about building 00:07:41.920 |
verticalized solutions in AI in 2024 is that historically you need the AI capability. You 00:07:52.320 |
need to understand both how the models behave and then how to get them to interact with other kinds 00:07:57.200 |
of machine learning subsystems that together perform the work of a system that can reason on 00:08:03.360 |
behalf of a human. There are also material systems engineering problems in there. So I forget who 00:08:09.200 |
this is attributed to, but a tweet that sort of made reference to all of the traditional software 00:08:15.280 |
companies are trying to hire AI talent and all the AI companies are trying to hire systems 00:08:18.960 |
engineers. And that is 100% the case. Getting these systems to behave in a predictable and 00:08:25.120 |
repeatable and observable way is equally challenging to a lot of the methodological 00:08:30.160 |
challenges. But then you bring in, whether it's law or medicine or public policy, or in our case, 00:08:36.960 |
finance, I think a lot of the most valuable, like Grammarly is a good example of a company that has 00:08:44.960 |
generative work product that is a valuable by most humans. Whereas in finance, 00:08:54.720 |
the character of the insight, the depth of insight and the non-consensusness of the insight 00:08:59.680 |
really requires a fairly deep domain expertise. And even operating an exchange, 00:09:04.960 |
when we went to raise it around, a lot of people said, "Why don't you start a hedge fund?" 00:09:10.000 |
And it's like that is a totally separate, there are many, many separate skills that are unrelated to 00:09:17.600 |
AI in that problem. And so we've brought into the fold domain experts in finance who can help us 00:09:25.760 |
evaluate the character and sort of steer the system. 00:09:29.440 |
- Yep. So that's the team. What does the system actually do? What's the Brightwave product? 00:09:35.840 |
- Yeah, I mean, it does many, many things, but it acts as a partner in thought to finance 00:09:44.240 |
professionals. So you can ask Brightwave a question like, "How is NVIDIA's position in 00:09:49.680 |
the GPU market impacted by rare earth metal shortages?" And it will identify as thematic 00:09:57.120 |
contributors to an investment decision or developing your thesis that in response to 00:10:04.560 |
export controls on A100 cards, China has put in place licensors on the transfer of germanium and 00:10:11.360 |
gallium, which are not rare earth metals, but they're semiconductor production inputs, 00:10:14.880 |
and has expanded its control of African and South American mining operations. 00:10:18.880 |
And so we see, if you think about, we have a $20 billion crossover hedge fund. Their equities team 00:10:29.280 |
uses this tool to go deep on a thesis. So I was describing this like multiple steps into the value 00:10:35.360 |
chain or supply chain for companies. We see wealth management professionals using Brightwave to get 00:10:43.680 |
up to speed extremely quickly as they step into nine conversations tomorrow with clients who are 00:10:50.160 |
assessing like, "Do you know something that I don't? Can I trust you to be a steward of my 00:10:55.600 |
financial well-being?" We see investor relations teams using Brightwave to... 00:11:04.320 |
You just think about the universe of coverage that a person working in finance needs to be 00:11:09.200 |
aware of. The ability to rip through filings and transcripts and have a very comprehensive 00:11:15.360 |
view of the market. It's extremely rate limited by how quickly a person is able to read and not 00:11:21.600 |
just read, but solve the blank page problem of knowing what to say about a factor of finding. 00:11:28.640 |
What else can you share about customers that you're working with? 00:11:32.400 |
Yeah, so we have seen traction that far exceeded our expectations from the market. 00:11:39.680 |
You sit somebody down with a system that can take any question and generate tight, 00:11:46.800 |
actionable financial analysis on that subject and the product kind of sells itself. And so we see 00:11:53.440 |
many, many different funds, firms, and strategies that are making use of Brightwave. So you've got 00:12:01.040 |
10-person owner-operated registered investment advisor, the classical wealth manager, $500 00:12:06.080 |
million in AUM. We have crossover hedge funds that have tens and tens of billions of dollars 00:12:12.560 |
in assets under management, very different use case. So that's more investment research, 00:12:16.320 |
whereas a wealth manager is going to use this to step into client interactions, 00:12:19.120 |
just exceptionally well-prepared. We see investor relations teams. We see 00:12:25.840 |
corporate strategy types that are needing to understand very quickly new markets, new themes, 00:12:35.440 |
and just the ability to very quickly develop a view on any investment theme or sort of strategic 00:12:43.440 |
consideration is broadly applicable to many, many different kinds of personas. 00:12:49.360 |
Yep. Yeah, I can attest to the product selling itself, given that I'm a user. 00:12:54.960 |
Let's jump into some of the technical challenges and work behind it, because 00:13:00.240 |
there are a lot of things. So as I mentioned, you were on the podcast about a year ago. 00:13:05.360 |
You had released Dolly from Databricks, which was one of the first open-source LLMs. 00:13:11.200 |
Dolly had a whooping 1,024 tokens of context size. And today, I think 1,000 tokens, 00:13:22.240 |
Yeah, exactly. How did you think about the evolution of context sizes as you built the 00:13:27.920 |
company? And where we are today, what are things that people get wrong? Any commentary there? 00:13:34.080 |
Sure. We very much take a systems of systems approach. When I started the company, I think I 00:13:45.600 |
had more faith in the ability of large context windows to generally solve problems relating 00:13:50.800 |
to synthesis. And actually, if you think about the attention mechanism and the way that it 00:13:56.400 |
computes similarities between tokens at a distance, I, on some level, believed that as 00:14:02.640 |
you would scale that up, you would have the ability to simultaneously perceive and draw 00:14:07.520 |
conclusions across vast disparate bodies of content. And I think that does not empirically 00:14:15.760 |
seem to be the case. So when, for example, you, and this is something anybody can try, 00:14:20.640 |
take a very long document, like needle in a haystack, I think, 00:14:24.720 |
sure, we can do information retrieval on specific fact-finding activities pretty easily. 00:14:32.720 |
But if you, I kind of think about it like summarizing, if you write a book report on an 00:14:38.160 |
entire book versus a synopsis of each individual chapter, there is a characteristic output length 00:14:45.200 |
for these models. Let's say it's about 1200 tokens. It is very difficult to get any of the 00:14:50.160 |
commercial LLMs or LLAMA to write 5,000 tokens. And you think about it as, what is the conditional 00:14:57.200 |
probability that I generate an end token? It just gets higher the more tokens are in the context 00:15:04.320 |
window prior to that sort of next inference step. And so if I have a thousand words in which to 00:15:13.120 |
say something, the level of specificity and the level of depth when I am assessing a very large 00:15:20.960 |
body of content is going to necessarily be less than if I am saying something specific about a 00:15:26.400 |
sub passage. And so we, and if you think about drawing a parallel to consumer internet companies 00:15:34.400 |
like LinkedIn or Facebook, there are many different subsystems with it. So let's take 00:15:41.760 |
the Facebook example. Facebook almost certainly has, I mean, you can see this in your profile, 00:15:47.520 |
your inferred interests. What are the things that it believes that you care about? Those 00:15:52.720 |
assessments almost certainly feed into the feed relevance algorithms that would judge what you 00:15:58.640 |
are, you know, am I, am I going to see you snow, am I going to show you snowboarding content? I'm 00:16:02.160 |
going to show you aviation content. It's the outputs of one machine learning system feeding 00:16:09.440 |
into another machine learning system. And I think with modern rag and sort of agent-based reasoning, 00:16:16.320 |
it is really about creating subsystems that do specific tasks well. And I think the problem of 00:16:22.800 |
I'm deciding how to decompose large documents into more kind of atomic reasoning units is 00:16:30.880 |
still very important. Now it's an open question, whether you have in the, whether that is a model 00:16:39.680 |
that is addressable by pre-training or instruction tuning, like can you have synthesis oriented 00:16:49.120 |
demonstrations in the, at training time? And now this problem is more robustly solved 00:16:54.400 |
because synthesis is quite different from complete the next word in the great Gatsby. 00:17:01.040 |
But it, I think empirically is not the case that you can just throw all of the SEC filings in, 00:17:08.480 |
you know, a million token context window and get deep insight that is useful out the other end. 00:17:16.960 |
Yeah. And I think that's the main difference about what you're doing. It's not about summarizing, 00:17:21.760 |
it's about coming up with different ideas and kind of like thought threads. 00:17:26.800 |
Yes. Precisely. Yeah. And I think this specifically like helping a person know, 00:17:31.840 |
you know, if I think that GLP ones are going to blow up the diet industry, 00:17:38.160 |
identifying and putting in context a negative result from a human clinical trial that's, 00:17:44.880 |
or for example, that adherence rates to Ozempic after a year, just 35%, 00:17:48.640 |
what are the implications of this? So there's an information retrieval component, and then there's a, 00:17:54.880 |
not just presenting me with a summary of like, here's, here are the facts, but like, 00:18:00.000 |
what does this entail and how does this fit into my worldview, my fund strategy? 00:18:08.720 |
Broadly, I think that, you know, I mean, this, this idea, I think is very eloquently puts it, 00:18:14.080 |
which is, and this is, this is not my insight, but that language models and help, 00:18:18.880 |
help me know who said this. You may, you may be familiar, but language models are not tools for 00:18:22.400 |
creating new knowledge. They're, they're tools for helping me create new knowledge. Like they 00:18:26.880 |
themselves do not do that work. I think that that's the presently the right way to think about 00:18:33.040 |
it. Yeah. I read a tweet about needle in the haystack actually being harmful to some of this 00:18:40.400 |
work because now the model is like too focused on recalling everything versus saying, oh, that 00:18:45.040 |
doesn't matter. Like ignoring. If you think about a S1 filing, like 85% is like boilerplate. It's 00:18:52.080 |
like, you know, previous performance doesn't guarantee future performance. Like the company 00:18:57.360 |
might not be able to turn a profit in the future. Blah, blah, blah. These things, they always come 00:19:01.040 |
up again. We have a large workforce and all of that. Have you had to do any work at the model 00:19:10.480 |
level to kind of like make it okay to forget these things? Or like, have you found that like, 00:19:15.600 |
kind of like making it a smaller problem then cutting, putting them back together kind of 00:19:19.520 |
solves for that? Absolutely. And I think this is where having domain expertise around the structure 00:19:27.600 |
of these documents. So if you look at the different chunking strategies that you can employ to 00:19:31.360 |
understand like what is the intent of this clause or phrase, and then really be selective at 00:19:40.160 |
retrieval time in order to get the information that is most relevant to a user query based on 00:19:45.520 |
the semantics of that unique document. And I think it is not, it's certainly not just a sliding window 00:19:51.680 |
over that corpus. And then the flip side of it is obviously factuality. You don't want to forget 00:20:00.320 |
things that were there. How do you tackle that? Yeah. I mean, of course that's, it's a very deep 00:20:06.320 |
problem. And I think, you know, I'll be a little circumspect about the specific kinds of methods 00:20:11.920 |
we use, but this sort of multiple passes over the material and saying, how convicted are you 00:20:20.800 |
that what you're saying is in fact true? And you, I mean, you can take generations from multiple 00:20:27.200 |
different models and compare and contrast and say like, do these both reach the same conclusion? 00:20:31.840 |
We, you can treat it like a voting problem. We train our own models to assess, you know, 00:20:40.640 |
you can think of this like entailment, like is, is this supported by the underlying primary sources? 00:20:46.640 |
And I think that you have methodological approaches to this problem, but then you 00:20:52.400 |
also have product affordances. So like there's a great blog post on bar from the Bard team 00:20:57.920 |
describing and Bill, it was sort of a design led product innovation that allows you to 00:21:05.760 |
ask the model to double check the work. So if you have a surprising finding, 00:21:09.920 |
we can let the user discretionarily spend more compute to double check the work. 00:21:16.160 |
And I think that you want to build product experiences that are fault tolerant. 00:21:20.320 |
And it is unclear that you ever, that the difference between like hallucination and 00:21:25.840 |
creativity is fuzzy. And so do you, do you ever get language models with next token prediction 00:21:33.360 |
as the loss function that are guaranteed to not contain factual misstatements? That is not clear. 00:21:41.440 |
Now, maybe being able to invoke code interpreter like code generation and then execution in a 00:21:47.680 |
secure way helps to solve some of these problems, especially for quantitative reasoning. That may 00:21:53.600 |
be the case, but for right now, I think you need to have product affordances that allow you to 00:22:02.480 |
live with the reality that these things are, are fallible. 00:22:08.160 |
Yep. Yeah. We did our HF 201 episode, just talking about different methods and whatnot. 00:22:14.640 |
How do you think about something like this where it's maybe unclear in the short term, 00:22:19.680 |
even if the product is right, you know, it might give, it might give an insight that 00:22:23.840 |
might be right, but it might not prove until later. So it's kind of hard for the users to say 00:22:29.440 |
that's wrong because actually it might be like, you think it's wrong, like an investment. That's 00:22:33.760 |
kind of what it comes down to, you know, some people are wrong. Some people are right. How, 00:22:38.080 |
how do you think about some of the product features that you need and something like this to 00:22:41.840 |
bring user feedback into the mix and maybe how you approach it today and how you think about it 00:22:46.560 |
long-term? Yeah. Well, I mean, I think that to your point about the model may have, the model 00:22:52.160 |
may make a statement, which is not actually verifiable. It's like this, this may be the 00:22:57.600 |
case. I think that is where the reason we think of this as a partner in thought is that humans are 00:23:04.400 |
always going to have access to information that has not, not been digitized. And so in finance, 00:23:08.480 |
you see that, especially so with regards to expert call networks, the sort of like, like you're in 00:23:16.080 |
the unstated investment theses that a portfolio manager may have, like, we just don't do biotech 00:23:24.560 |
or we do not believe that we think that Eli Lilly is actually very exposed because of how unpleasant 00:23:32.800 |
it is to take examples. Right. Those, those are things that are beliefs about the world, 00:23:38.720 |
but that may not be like falsifiable right now. And so I think you have to, 00:23:44.720 |
you can again, take plate pages from the consumer web playbook and think about personalization. 00:23:53.280 |
So it is getting a person to articulate everything that they believe is not a realistic task. 00:23:59.200 |
Netflix doesn't ask you to describe what kinds of movies you like, and they give you the option 00:24:05.440 |
to vote, but nobody does this. And so what I think you do is you observe people's revealed 00:24:11.440 |
preferences, like what, so one of the capabilities that our system exposes is given everything that 00:24:18.160 |
Brightwave has read and assessed and like the sort of synthesized financial analysis, what are the 00:24:24.480 |
natural next questions that this, that a person investigating the subject should ask? And you can 00:24:30.320 |
think of this chain of thought and this deepening kind of investigative process and the direction 00:24:38.720 |
in which the user steers the attention of this system reveals information about what do they 00:24:46.080 |
care about? What do they believe? What kinds of things are important? And so at the individual 00:24:54.480 |
level, but then also at the fund and firm level, you can develop like an implicit representation 00:25:02.720 |
of your beliefs about the world in a way that you just, you're never going to get everybody, 00:25:08.800 |
somebody to write everything down. Yeah. Yeah. How does that tie into one of our other favorite 00:25:14.720 |
topics, evals? We had David Luan from Adapt and he mentioned they don't care about benchmarks 00:25:19.840 |
because their customers don't work on benchmarks. They work on, you know, business results. How do 00:25:25.440 |
you think about that for you? And maybe as you build a new company, when is the time to like 00:25:30.800 |
still focus on the benchmark versus when it's time to like move on to your own evaluation using maybe 00:25:35.600 |
labelers or whatnot? So, I mean, we, we use a fair bit of LLM supervision to evaluate 00:25:44.000 |
multiple different subsystems. And I think that one of the reasons that, I mean, we, we pay human 00:25:49.760 |
annotators to evaluate the quality of the generative outputs. And I think that that is 00:25:53.920 |
always the reference standard, but we frequently first turn to LLM supervision as a way to have, 00:26:02.400 |
whether it's at fine tuning time or even for subsystems that are not generative, like what is 00:26:09.840 |
the quality of the system? And I think we will generate a small corpus of high quality domain 00:26:16.320 |
expert annotations and then always compare that against how well is either LLM supervision or 00:26:21.760 |
even just a heuristic, right? Like a simple thing you can do. This is, this is a technique that we 00:26:27.280 |
do not use, but as an example, do not generate any integers or any numbers that are not present 00:26:35.040 |
in the underlying source data, right? You know, if you're doing rag, you can just say, you can't 00:26:39.200 |
name numbers that are not, you know, it's a very sort of heavy handed, but you can take the 00:26:45.360 |
annotations of a human evaluator and then compare that. I mean, snorkel kind of takes a similar 00:26:50.480 |
perspective, like multiple different week sort of supervision data sets can give you substantially 00:26:58.320 |
more than any one of them does on their own. And so I think you want to compare the quality of 00:27:02.960 |
any evaluation against human generated, the sort of like benchmark. But at the end of the day, 00:27:10.480 |
like eventually you, especially for things that are nuanced, like is this transcendent poetry? 00:27:16.080 |
There's just no way to multiple choice your way out of that, you know? And so really where 00:27:25.040 |
I think a lot of the flywheels for some of the large LLM companies are, it's methodological, 00:27:31.920 |
obviously, but it's also just data generation. And you think about like, you know, for anybody 00:27:37.680 |
who's done crowdsource work, and this I think applies to high skilled human annotators as well. 00:27:44.240 |
Like you look at the Google search quality evaluator guidelines, it's like a 90 or 120 00:27:49.280 |
page rubric describing like what is a high quality search result. And it's like very difficult to get 00:27:54.400 |
on the human level, people to reproducibly follow a rubric. And so what is your process 00:28:03.040 |
for orchestrating that motion? Like how do you articulate what is high quality insight? I think 00:28:10.400 |
that's where a lot of the work actually happens. And that it's sort of the last resort, like 00:28:19.520 |
everything, like ideally you want to automate everything, but ultimately the most interesting 00:28:23.680 |
problems right now are those that are not especially automatable. - One thing you did 00:28:28.480 |
at Databricks was the, well, not that you did specifically, but the team there was like the 00:28:33.760 |
Dolly 15K data set. You mentioned people misvalue the value of this data. Why has no other company 00:28:43.200 |
done anything similar? We're like creating this like employee led data set. You can imagine, 00:28:48.480 |
you know, some of these like Goldman Sachs, they got like thousands and thousands of people in 00:28:52.400 |
there. Obviously they have different privacy and whatnot requirements. Do you think more companies 00:28:57.520 |
should do it? Like, do you think there's like a misunderstanding of how valuable that is or yeah? 00:29:03.360 |
- So I think Databricks is a very special company and led by people who are 00:29:09.200 |
very sort of courageous, I guess is one word for it. Just like, let's just ship it. And I think 00:29:19.520 |
it's unusual and it's also because I think like most companies will recognize, like if they go to 00:29:26.400 |
the effort to produce something like that, they recognize that it is competitive advantage to have 00:29:30.560 |
it and to be the only company that has it. And I think Databricks is in an unusual position in that 00:29:35.760 |
they benefit from more people having access to these kinds of sources, but you also saw scale. 00:29:42.080 |
I guess they haven't released it. - Well, yeah. I'm sure they have it 00:29:45.840 |
because they charge people a lot of money. - But they created that alternative to the 00:29:49.360 |
GSMK 8K, I believe was how that's said. I guess they too are not releasing that. 00:30:01.440 |
- Yeah. It's interesting because I talked to a lot of enterprises and a lot of them are like, 00:30:06.960 |
man, I spent so much money on scale. And I'm like, why don't you just do it? And they're like, what? 00:30:13.120 |
- So I think this again gets to the human process orchestration. It's one thing to do 00:30:21.680 |
like a single monolithic push to create a training data set like that or an evaluation corpus, 00:30:27.040 |
but I think it's another to have a repeatable process. And a lot of that, I think realistically 00:30:33.600 |
is pretty unsexy, like people management work. So that's probably a big part of it. 00:30:39.520 |
- We have these four words of AI framework, the data quality war, we kind of touched on 00:30:45.760 |
a little bit now about rag. That's like the other battlefield, rag and context sizes and kind of 00:30:50.800 |
like all these different things. You work in a space that has a couple of different things. One, 00:30:56.880 |
temporality of data is important because every quarter there's new data and like the new data 00:31:02.880 |
usually overrides the previous one. So you can not just like do semantic search and hope you 00:31:07.440 |
get the latest one. And then you have obviously very structured numbers thing that are very 00:31:13.680 |
important to the token level, like 50% gross margins and 30% gross margins are very different, 00:31:19.920 |
but this organization is not that different. Any thoughts on like how to build a system to 00:31:25.680 |
handle all of that as much as you can share, of course? 00:31:27.520 |
- Yeah, absolutely. So I think this again, rather than having open-ended retrieval, 00:31:35.440 |
open-ended reasoning, our approach is to decompose the problem into multiple different 00:31:41.040 |
subsystems that have specific goals. And so, I mean, temporality is a great example. 00:31:48.000 |
When you think about time, I mean, just look at all of the libraries for managing calendars. 00:31:54.800 |
Time is kind of at the intersection of language and math. And this is one of the places where 00:32:04.720 |
without taking specific technical measures to ensure that you get high quality narrative 00:32:10.240 |
overlays of statistics that are changing over time and have a description of how a PE multiple 00:32:16.800 |
is increasing or decreasing and like a retrieval system that is aware of the time, 00:32:24.560 |
sort of the time intent of the user query, right? If I'm asking something about breaking news, 00:32:30.320 |
like that's going to be very different than if I'm looking for a thematic account of the past 00:32:35.520 |
18 months in Fed interest rate policy. You have to have retrieval systems that are, 00:32:43.600 |
to your point, like if I just look for something that is a nearest neighbor without any of that 00:32:48.560 |
temporal or other qualitative metadata overlay, you're just going to get a kind of a bag of facts 00:32:57.440 |
and that that is like explicitly not helpful because the worst failure state for these systems 00:33:04.160 |
is that they are wrong in a convincing way. And so I think at least presently you have to have 00:33:11.200 |
subsystems that are aware of the semantics of the documents or aware of the semantics of 00:33:18.560 |
the intent behind the question and then have multiple evaluate. We have multiple evaluation 00:33:25.680 |
steps. Once you have the generated outputs, we assess it multiple different ways to know, 00:33:30.880 |
is this a factual statement given the sort of content that's been retrieved? 00:33:36.640 |
Yep. And what about, I think people think of financial services, they think of privacy, 00:33:42.960 |
confidentiality. What's kind of like customer's interest in that as far as like sharing documents 00:33:51.360 |
and like how much of a deal breaker is that if you don't have them? I don't know if you want to 00:33:56.080 |
share any about that and how you think about architecting the product. Yeah, so one of the 00:34:02.080 |
things that gives our customers a high degree of confidence is the fact that Brandon operated a 00:34:09.680 |
federally regulated derivatives exchange. That experience in these highly regulated environments, 00:34:17.520 |
I mean, additionally at workday, I worked with the financials product and without going into 00:34:23.680 |
specifics, it's exceptionally sensitive data and you have multiple tenants and it's just important 00:34:31.920 |
that you take the right approach to being a steward of that material. And so from the start, 00:34:37.680 |
we've built in a way that anticipates the need for controls on how that data is managed and 00:34:46.480 |
who has access to it and how it is treated throughout the life cycle. And so that for 00:34:51.200 |
our customer base, where frequently the most interesting and alpha generating material is not 00:35:00.000 |
publicly available, has given them a great degree of confidence in sharing 00:35:04.400 |
some of this, the most sensitive and interesting material with systems that are able to combine it 00:35:12.320 |
with content that is either publicly or semi-publicly available to create non-consensus 00:35:19.360 |
insight into some of the most interesting and challenging problems in finance. Yeah, 00:35:24.320 |
we always say it breaks our recommendation systems for LLMs. How do you think about that when you 00:35:29.520 |
have private versus public data, where sometimes you have public data as one thing, but then the 00:35:34.560 |
private is like, well, actually, you know, we got this like insight model, like with this insight 00:35:39.360 |
school that we're going to like figure it out. How do you think in the rack system about a value 00:35:45.280 |
of these different documents? You know, I know a lot of it is secret sauce, but... 00:35:48.800 |
No, no, it's fine. I mean, I think that there is, 00:35:51.120 |
so I will gesture towards this by way of saying context where prompting. So you can have prompts 00:36:03.280 |
that are composable and that have different sort of command units that like may or may not be 00:36:11.440 |
present based on the semantics of the content that is being populated into the rag context window. 00:36:16.800 |
And so that's something we make great use of, which is where is this being retrieved from? 00:36:23.920 |
What does it represent? And what should be in the instruction set in order to treat and respect the 00:36:31.360 |
underlying contents? Not just as like, here's a bunch of text, like you figure it out, but 00:36:36.480 |
this is important in the following way, or this aspect of the SEC filings are just categorically 00:36:45.440 |
uninteresting, or this is sell-side analysis from a favored source. And so it's that 00:36:52.480 |
creating it much like you have with the qualitative, the problem of organizing the work of 00:37:00.320 |
humans, you have the problem of organizing the work of all of these different AI subsystems and 00:37:06.640 |
getting them to propagate what they know through the rest of the stack so that if you have multiple, 00:37:13.840 |
seven, 10 sequence inference calls, that all of the relevant metadata is propagated through 00:37:21.520 |
that system and that you are aware of where did this come from? How convicted am I that it is a 00:37:28.240 |
source that should be trusted? I mean, you see this also just in analysis, right? So different, 00:37:33.920 |
like seeking alpha is a good example of just a lot of people with opinions. And some of them are 00:37:40.960 |
great. Some of them are really mid and how do you build a system that is aware of the user's 00:37:50.320 |
preferences for different sources? I think this is all related to how we talked about systems 00:37:58.400 |
engineering. It's all related to how you actually build the systems. - And then just to kind of wrap 00:38:04.240 |
on the right side, how should people think about knowledge graphs and kind of like extraction from 00:38:09.760 |
documents versus just like semantic search? - Knowledge graph extraction is an area where 00:38:15.200 |
we're making a pretty substantial investment. And so this, I think that it is underappreciated how 00:38:23.120 |
powerful, there's the generative capabilities of language models, but there's also the ability to 00:38:29.280 |
program them to function as arbitrary machine learning systems, basically for marginally zero 00:38:37.200 |
cost. And so the ability to extract structured information from huge, like sort of unfathomably 00:38:47.440 |
large bodies of content in a way that is single pass. So rather than having to reanalyze a document 00:38:56.400 |
every time that you perform inference or respond to a user query, we believe quite firmly that 00:39:04.000 |
you can also in an additive way, perform single pass extraction over this body of text and then 00:39:10.960 |
bring that into the RAG context window. And this really sort of levers off of my experience at 00:39:21.040 |
LinkedIn where you had this structured graph representation of the global economy where you 00:39:26.640 |
said person A works at company B. We believe that there's an opportunity to create a knowledge graph 00:39:33.760 |
that has resolution that greatly exceeds what any, whether it's Bloomberg or LinkedIn currently has 00:39:40.480 |
access to. We're getting as granular as person X submitted congressional testimony that was 00:39:45.840 |
critical of organization Y. And this is the language that is attached to that testimony. 00:39:50.320 |
And then you have a structured data artifact that you can pivot through and reason over 00:39:55.760 |
that is complimentary to the generative capabilities that language models expose. 00:40:00.000 |
And so it's the same technology being applied to multiple different ends. And this is manifest 00:40:07.360 |
in the product surface where it's a highly facetable, pivotable product, but it also 00:40:12.720 |
enhances the reasoning capability of the system. Yeah. You know, when you mentioned you don't want 00:40:18.160 |
to re-query like the same thing over and over, a lot of people may say, well, I'll just fine tune 00:40:23.520 |
this information in the model. How do you think about that? That was one thing when we started 00:40:29.920 |
working together, you were like, we're not building foundation models. A lot of other 00:40:33.840 |
startups were like, oh, we're building the finance, financial model, the finance foundation 00:40:37.760 |
model or whatever. When is the right time for people to do fine tuning versus rag? It's like 00:40:45.120 |
any heuristics that you can share that you use to think about it. So we, in general, I do not, 00:40:52.960 |
I'll just say like, I don't have a strong opinion about how much information you can imbue into a 00:41:01.680 |
model that is not present in pre-training through large scale fine tuning. The benefit of rag is the 00:41:10.480 |
ground, the capability around grounded reasoning. So the, you know, forcing it to attend to a 00:41:15.760 |
collection of facts that are known and available at inference time and sort of like materially, 00:41:21.600 |
like only using these facts. The, at least in my view, the, the role of fine tuning is really more 00:41:29.600 |
around, I think of like language models, kind of like a STEM cell. And then under fine tuning, 00:41:34.720 |
they differentiate into different kinds of specific cells. So kidney or an eye cell. And 00:41:40.240 |
we, if you think about like specifically, like, I don't think that unbounded agentic behaviors are 00:41:50.960 |
useful and that instead a useful LLM system is more like a finite state machine where the behavior of 00:42:01.520 |
the system is occupying one of many different behavioral regimes and making decisions about 00:42:05.600 |
what state should I occupy next in order to satisfy the goal. As you think about the graph 00:42:12.960 |
of those states that the language model that your system is moving through, once you develop 00:42:18.880 |
conviction that one behavior is useful and repeatable and like worthwhile to differentiate 00:42:28.240 |
down into a specific kind of subsystem, that's where like fine tuning and like specifically 00:42:32.960 |
generating the training data, like having, having human annotators produce a corpus that is useful 00:42:39.920 |
enough to get a specific class of behaviors that that's kind of how we use fine tuning rather than 00:42:46.640 |
trying to imbue new net new information into these systems. Yeah. And how, but you know, 00:42:55.920 |
people always trying to turn LLMs into humans. It's like, Oh, this is my reviewer. This is my 00:43:01.680 |
editor. I know you're not in that camp. So any, any thoughts you have on like how people should 00:43:06.880 |
think about yeah. How to refer to models. Like, and I mean, we, we've talked a little bit about 00:43:14.400 |
this and it's, it is an, it's notable that I think there's a lot of anthropomorphizing going on and 00:43:20.400 |
that it reflects the difficulty of evaluating the systems. Is it like, does the saying that you are 00:43:27.600 |
that you're the journal editor for nature, does that help? Like you're, you know, you've got the 00:43:35.520 |
editor and then you've got the reviewer and you've got the, you know, you're the private 00:43:38.880 |
investigator, you know, it's like, this is, I think literally we wave our hands and we say, 00:43:45.280 |
maybe if I tell you that I'm going to tip you, that's going to help. And it sort of seems to, 00:43:48.960 |
and like, maybe it's just like the more cycles, the more compute that is attached to the prompt. 00:43:56.720 |
And then the sort of like chain of thought at inference time, it's like, maybe that's all that 00:44:01.840 |
we're really doing and that it's kind of like hidden, hidden compute. But I, our experience 00:44:08.080 |
has been that you can get really, really high quality reasoning from roughly an agentic system 00:44:16.800 |
without needing to be too cute about it. You can describe the task and you know, within 00:44:25.360 |
well-defined bounds you don't need to treat the LLM like a person and to get it to generate high 00:44:34.000 |
quality outputs. Yeah. And the other thing is like all these agent frameworks are assuming everything 00:44:41.040 |
as an LLM, you know? Yeah, for sure. And I think this is one of the places where 00:44:46.080 |
traditional machine learning has a real material role to play in producing a system that hangs 00:44:53.600 |
together and there are, you know, guaranteeable like statistical promises that classical machine 00:45:01.840 |
learning systems to include traditional deep learning can make about, you know, what is the 00:45:07.200 |
set of outputs and like, what is the characteristic distribution of those outputs that LLMs cannot 00:45:13.040 |
afford? And so like one of the things that we do is we, as a philosophy, try to choose the right 00:45:18.160 |
tool for the job. And so sometimes that is a de novo model that has nothing to do with LLMs that 00:45:26.560 |
does one thing exceptionally well. And whether that's retrieval or critique or multi-class 00:45:33.280 |
classification, I think having many, many different tools in your toolbox is always valuable. 00:45:41.760 |
This is great. So there's kind of the missing piece that maybe people are wondering about. 00:45:46.400 |
You do a financial services company and you didn't do anything in Excel. What's the story 00:45:52.400 |
behind why you're doing partner in thought versus, hey, this is like an AI enabled model that 00:45:58.640 |
understands any stock and all of that. Yeah. And to be clear, we do, Brightwave does a fair 00:46:04.640 |
amount of quantitative reasoning. I think what is an explicit non-goal for the company is to 00:46:10.880 |
create Excel spreadsheets. And I think when you look at the products that work in that way, 00:46:19.280 |
you can spend hours with an Excel spreadsheet and not notice a subtle bug. And that is a 00:46:28.560 |
highly non-fault tolerant product experience where you encounter a misstatement in a financial model 00:46:35.200 |
in terms of how a formula is composed and all of your assumptions are suddenly violated. 00:46:39.520 |
And now it's effectively wasted effort. So as opposed to the partner in thought modality, 00:46:46.160 |
which is yes and, like if the model says something that you don't agree with, 00:46:50.720 |
you can say, take it under consideration. This is not interesting to me. I'm going to pivot to the 00:46:56.240 |
next finding or claim. And it's more like a dialogue. The other piece of this is that 00:47:04.480 |
the financial modeling is often very, when we talk to our users, it's very personal. So they 00:47:10.000 |
have a specific view of how a company is structured. They have that one key driver 00:47:14.400 |
of asset performance that they think is really, really important. It's kind of like the difference 00:47:20.000 |
between writing an essay and having an essay, I guess. The purpose of homework is to actually 00:47:27.440 |
develop, what do I think about this? And so it's not clear to me that push a button, have 00:47:33.440 |
a financial model is solving the actual problem that the financial model affords. 00:47:40.880 |
And so that said, we take great efforts to have exceptionally high quality quantitative 00:47:47.920 |
reasoning. So you think about, and I won't get into too many specifics about this, but 00:47:55.200 |
we deal with a fair number of documents that have tabular data that is really important to making 00:48:01.760 |
informed decisions. And so the way that our RAG systems operate over and retrieve from tabular 00:48:09.760 |
data sources is, it's something that we place a great degree of emphasis on. It's just, I think, 00:48:15.840 |
the medium of Excel spreadsheets is just, I think, not the right play for this class of technologies 00:48:26.160 |
as they exist in 2024. - What about 2034? Are people still gonna be making Excel models? I think 00:48:35.280 |
to me, the most interesting thing is, how are the models abstracting people away from some of these 00:48:42.480 |
more syntax driven thing and making them focus on what matters to them? - I wouldn't be able to tell 00:48:48.640 |
you what the future 10 years from now looks like. I think anybody who could convince you of that 00:48:55.760 |
is not necessarily somebody to be trusted. I do think that, so let's draw the parallel to 00:49:02.320 |
accountants in the '70s. So VisiCalc, I believe, came out in 1979. And historically, the core, 00:49:12.560 |
as an accountant, as a finance professional in the '70s, I'm the one who runs the, I run the 00:49:18.640 |
numbers. I do the arithmetic. That's like my main job. And we think that, I mean, you just look, 00:49:26.320 |
now that's not a job anybody wants. And the sophistication of the analysis that a person 00:49:31.920 |
is able to perform as a function of having access to powerful tools like computational spreadsheets 00:49:36.960 |
is just much greater. And so I think that with regards to language models, it is probably the 00:49:44.480 |
case that there is a play in the workflow where it is commenting on your analysis within that 00:49:56.080 |
spreadsheet-based context, or it is taking information from those models and sucking 00:50:02.320 |
this into a system that does qualitative reasoning on top of that. But I think the, 00:50:11.280 |
it is an open question as to whether the actual production of those models is still a human task, 00:50:16.560 |
but I think the sophistication of the analysis that is available to us and the completeness 00:50:22.480 |
of that analysis just necessarily increases over time. - Yeah. What about AI hedge funds? 00:50:31.280 |
Obviously, I mean, we have quants today, right? But those are more kind of like momentum-driven, 00:50:35.920 |
kind of like signal-driven and less about long thesis-driven. Do you think that's a possibility 00:50:40.720 |
there? - This is an interesting question. I would put it back to you and say, how different is that 00:50:49.840 |
from what hedge funds do now? I think there is, the more that I have learned about how teams at 00:50:58.400 |
hedge funds actually behave, and you look at like systematics desks or semi-systematic trading groups, 00:51:03.600 |
man, it's a lot like a big machine learning team. And it's, I sort of think it's interesting, 00:51:09.040 |
right? So like, if you look at video games and traditional like Bay Area tech, there's not a ton 00:51:16.560 |
of like talent mobility between those two communities. You have people that work in video 00:51:22.400 |
games and people that work in like SaaS software. And it's not that like cognitively, they would not 00:51:28.080 |
be able to work together. It's just like a different set of skill sets, a different set 00:51:30.800 |
of relationships. And it's kind of like network clusters that don't interact. I think there's 00:51:34.320 |
probably a similar phenomenon happening with regards to machine learning within the active 00:51:43.600 |
asset allocation community. And so like, it's actually not clear to me that we don't have 00:51:50.480 |
AI hedge funds now. The question of whether you have an AI that is operating a trading desk, 00:51:56.480 |
like that seems a little, maybe, like I don't have line of sight to something like that existing yet. 00:52:06.800 |
- I'm always curious. I think about asset management on a few different ways, but 00:52:13.040 |
venture capital is like extremely power law driven. It's really hard to do machine learning 00:52:18.640 |
in power law businesses because the distribution of outcomes is so small versus public equities. 00:52:24.960 |
Most high-frequency trading is like very bell curve, normal distribution. It's like, 00:52:30.640 |
even if you just get 50.5% at the right scale, you're going to make a lot of money. 00:52:35.440 |
And I think AI starts there. And today most high-frequency trading is already AI driven. 00:52:42.240 |
Renaissance started a long time ago using these models. But I'm curious how it's going to move 00:52:47.920 |
closer and closer to like power law businesses. I would say some boutique hedge funds, 00:52:54.160 |
their pitch is like, "Hey, we're differentiated because we only do kind of like these 00:52:59.680 |
long only strategies that are like thesis driven versus movement driven." And most venture 00:53:06.000 |
capitalists will tell you, "Well, our fund is different because we have this unique thesis 00:53:09.760 |
on this market." And I think like five years ago, I've wrote this blog post about why machine 00:53:16.560 |
learning would never work in venture, because the things that you're investing in today, 00:53:20.880 |
they're just like no precedent that should tell you this will work. Most new companies, 00:53:25.600 |
a model will tell you this is not going to work. Versus the closer you get to the public companies, 00:53:30.960 |
the more any innovation is like, "Okay, this is kind of like this thing that happened." 00:53:35.760 |
And I feel like these models are quite good at generalizing and thinking, 00:53:40.720 |
again, going back to the partner in thought, like thinking about second order. 00:53:44.000 |
Yeah, and that's maybe where, so a concrete example, I think it certainly is the case that 00:53:51.440 |
we tell retrospective, to your point about venture, we tell retrospective stories where it's 00:53:56.320 |
like, "Well, here was the set of observable facts. This was knowable at the time, and these people 00:54:01.120 |
made the right call and were able to cross correlate all of these different sources, and 00:54:05.680 |
this is the bet we're going to make." I think that process of idea generation is absolutely 00:54:13.440 |
automatable. And the question of like, do you ever get somebody who just sets the system running, 00:54:18.880 |
and it's making all of its own decisions like that, and it is truly like doing thematic investing, 00:54:26.080 |
or more of what a human analyst would be on the hook for, as opposed to like HFT. 00:54:32.160 |
But the ability of models to say, "Here is a fact pattern that is noteworthy, and 00:54:42.640 |
we should pay more attention here." Because if you think about the matrix of all possible 00:54:48.960 |
relationships in the economy, it grows with the square of the number of facts you're evaluating, 00:54:56.800 |
like polynomial with the number of facts you're evaluating. And so, if I want to make bets on AI, 00:55:07.120 |
I think it's like, "What are ways to profit from the rise of AI?" It is very straightforward to 00:55:13.360 |
take a model and say, "Parse through all of these documents and find second-order derivative bets," 00:55:19.760 |
and say, "Oh, it turns out that energy is very, very adjacent to investments in AI and may not 00:55:27.360 |
be priced in the same way that GPUs are." And a derivative of energy, for example, 00:55:33.360 |
is long-duration energy storage. And so, you need a bridge between renewables, which have 00:55:39.200 |
fluctuating demands, and the compute requirements of these data centers. And I think that, 00:55:45.200 |
and I'm telling this story as like having witnessed BrightWave do this work, you can take a 00:55:52.160 |
premise and say, "What are second- and third-order bets that we can make on this topic?" And it's 00:55:57.920 |
going to come back with, "Here's a set of reasonable theses." And then I think a human's 00:56:03.520 |
role in that world is to assess like, "Does this make sense given our fund strategy? Is this 00:56:08.880 |
coherent with the calls that I've had with the management teams?" There's this broad body of 00:56:14.240 |
knowledge that I think humans are the ultimate synthesizers and deciders. And maybe I'm wrong. 00:56:23.040 |
Maybe the world of the future looks like the AI that truly does everything. I think it is kind of 00:56:32.080 |
a singularity where it's really hard to reason about what that world looks like. And you asked 00:56:38.320 |
me to speculate, but I'm actually kind of hesitant to do so because it's just the forecast, the 00:56:43.360 |
hurricane path just diverges far too much to have a real conviction about what that looks like. 00:56:50.880 |
Awesome. I know we've already taken up a lot of your time, but maybe one thing to touch on 00:56:57.440 |
before wrapping is open-source LLMs. Obviously you were at the forefront of it. We recorded 00:57:03.520 |
our episode the day that Red Pajama was open-source and we were like, "Oh man, this is mind-blowing. 00:57:08.880 |
This is going to be crazy." And now we're going to have an open-source dense transformer model 00:57:15.200 |
that is 400 billion parameters. I don't know if one year ago you could have told me that 00:57:19.440 |
that was going to happen. So what do you think matters in open-source? What do you think 00:57:24.880 |
people should work on? What are things that people should keep in mind to evaluate? Is this model 00:57:31.280 |
actually going to be good or is it just cheating some benchmarks to look good? Is there anything 00:57:35.840 |
there? This is the part of the podcast where people already dropped off if they wanted to, 00:57:42.240 |
so they want to hear the hot takes right now. I do think that that's another reason to have 00:57:47.280 |
your own private evaluation corpuses is so that you can objectively and out of sample 00:57:52.160 |
measure the performance of these models. Again, sometimes that just looks like giving everybody 00:57:59.840 |
on the team 250 annotations and saying, "We're just going to grind through this." 00:58:04.080 |
The other thing about doing the work yourself is that you get to articulate your loss function 00:58:11.680 |
precisely. What do I actually want the system to behave like? Do I prefer this system or 00:58:17.120 |
this model or this other model? I think the work around overfitting on the test is 100% happening. 00:58:27.200 |
One notable, in contrast to a year ago, say, and the economic incentives for companies to train 00:58:40.080 |
their own foundation models, I think, are diminishing. The window in which you are the 00:58:48.720 |
dominant pre-train, and let's say that you spend $5 to $40 million for a commodity-ish pre-train, 00:58:59.360 |
not $400 billion would be another sort of... It costs more than $40 million. 00:59:03.760 |
Another leap. But the kind of thing that a small, multi-billion dollar mom and pop shop 00:59:10.640 |
might be able to pull off. The benefit that you get from that is, I think, diminishing over time. 00:59:20.320 |
I think fewer companies are going to make that capital outlay. I think that there's probably 00:59:28.160 |
some material negatives to that. But the other piece is that we're seeing that, 00:59:33.200 |
at least in the past two and a half, three months, there's a convergence 00:59:38.160 |
towards, well, these models all behave fairly similarly. It's probably that the training data 00:59:47.600 |
on which they are pre-trained is substantially overlapping. It's generalizing a model that 00:59:54.560 |
generalizes to that training data. It's unclear to me that you have this sort of balkanization, 01:00:02.560 |
where there are many different models, each of which is good in its own unique way, versus 01:00:07.280 |
something like Lama becomes, "Listen, this is a fine standard to build off of." 01:00:13.600 |
We'll see. It's just the upfront cost is so high. I think for the people that have the money, 01:00:20.480 |
the benefit of doing the pre-train is now less. Where I think it gets really interesting is, 01:00:27.920 |
how do you differentiate these in all of these different behavioral regimes? I think the cost 01:00:33.840 |
of producing instruction tuning and fine-tuning data that creates specific kinds of behaviors, 01:00:41.520 |
I think that's probably where the next generation of really interesting work starts to happen. 01:00:49.600 |
If you see that the same model architecture trained on much more training data can exhibit 01:00:56.480 |
substantially improved performance, it's the value of modeling innovations. 01:01:04.160 |
For fundamental machine learning and AI research, there is still so much to be done. But I think 01:01:12.320 |
that the much lower-hanging fruit is developing new kinds of training data corpuses that elicit 01:01:22.960 |
new behaviors from these models in a specific way. That's where, when I think about the 01:01:28.160 |
availability, a year ago you had to have access to fairly high-performance GPUs that were hard 01:01:37.360 |
to get in order to get the experience of multiple reps fine-tuning these models. 01:01:42.960 |
What you're doing when you take a corpus and then fine-tune the model, and then see across many 01:01:51.360 |
inference passes what is the qualitative character of the output, you're developing your own internal 01:01:56.160 |
mental model for how does the composition of the training corpus shape the behavior of the model 01:02:00.800 |
in a qualitative way. A year ago it was very expensive to get that experience. Now you can 01:02:06.480 |
just recompose multiple different training corpuses and see what do I do if I insert 01:02:12.080 |
this set of demonstrations, or I ablate that set of demonstrations. That I think is a very, 01:02:18.240 |
very valuable skill and one of the ways that you can have models and products that 01:02:23.040 |
other people don't have access to. I think as those sensibilities proliferate, 01:02:30.640 |
because more people have that experience, you're going to see teams that 01:02:35.680 |
release data corpuses that just imbue the models with new behaviors that are especially interesting 01:02:41.120 |
and useful. I think that may be where some of the next sets of innovation and differentiation come 01:02:47.920 |
from. Yeah, when people ask me, I always tell them the half-life of a model is much shorter 01:02:52.480 |
than a half-life of a data set. I mean, the pile is still around and core to most of these training 01:02:58.960 |
runs versus all the models people trained a year ago. It's like they're at the bottom of the 01:03:03.520 |
LMC's litter board. It's kind of crazy. Just the parallels to other kinds of computing technology 01:03:11.440 |
where the work involved in producing the artifact is so significant and the shelf life is like a 01:03:20.160 |
week. I'm sure there's a precedent but it is remarkable. I remember when DALI was the best 01:03:30.000 |
open-source model. DALI was never the best open-source model but it demonstrated something 01:03:36.800 |
that was not obvious to many people at the time. But we always were clear that it was never state 01:03:41.120 |
of the art. State of the art, whatever that means. This is great, Mike. Anything that we forgot to 01:03:48.400 |
cover that you want to add? I know you're thinking about growing the team. We are hiring across the 01:03:55.680 |
board. AI, engineering, classical machine learning, systems engineering, distributed systems, 01:04:03.440 |
front-end engineering, design. We have many open roles on the team. We hire exceptional people. 01:04:10.800 |
We fit the job to the person as a philosophy and would love to work with more incredible humans. 01:04:17.920 |
Awesome. Thank you so much for coming on, Mike.