Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence and Decibel Partners, and I have no co-host today as you can see. This works in Vienna at ICLR having fun in Europe. And we're in the brand new studio. As you might see if you're on YouTube, there's still no sound panels on the wall.
Mike tried really hard to put them up, but the glue is a little too old for that. So if you hear any echo or anything like that, sorry, but we're doing the best that we can. And today we have our first repeat guest, Mike Conover. Welcome Mike, who is now the founder of Brightwave, not at Databricks anymore.
Our last episode was one of the fan favorites and I think this will be just as good. So for those that have not listened to the first episode, which might be many because the podcast has grown a lot since then, thanks to people like Mike who have interesting conversations on it.
You spent a bunch of years doing ML as some of the best companies on the internet. Things like Workday, you know, skip like LinkedIn, most recently at Databricks where you were leading the open source large language models team working on Dolly. And now you're doing Brightwave, which is in the financial services space, but this is not something new, you know, I think when you and I first talked about Brightwave, I was like, why is this guy doing a financial services company?
And then you look at your background and you were doing papers on Nature, on the Nature magazine about LinkedIn data predicting, S&P 500 stock movement, like many, many years ago. So what's kind of like some of the tying elements in your background that maybe people are overlooking that brought you to do this?
Yeah, sure. So I would say my, so my PhD research was funded by DARPA and it had, we had access to the Twitter dataset early in the early in the natural history of the availability of that dataset. And it was focused on the large scale structure of propaganda and misinformation campaigns.
And LinkedIn, we had planet scale descriptions of the structure of the global economy. And so primarily my work was, was homepage newsfeed relevant. So when you go to linkedin.com, you would see updates from one of our machine learning models. But additionally, I was a research liaison as part of the economic graph challenge and had this nature communications paper where we demonstrated that 500 million jobs transitions can be hierarchically clustered as a network of labor flows and in our predictive next quarter S&P 500 market gap changes.
And at work day, I was director of financials, machine learning. And you start to see how organizations are organisms. And I think of the way that like an accountant or the market encodes information in databases, similar to how social insects, for example, organize their work and make collective decisions about where to allocate resources or time and attention.
And that, especially with the work on Twitter, we would see network structures relating to polarization emerge organically out of the interactions of many individual components. And so like much of my professional work has been focused on this idea that our lives are governed by systems that we're unable to see from our locally constrained perspective.
And that we, when we, when humans interact with technology, they create digital trace data that allows us to observe the structure of those systems as though through a microscope or a telescope. And particularly as regards finance, I think this is the ultimate, the markets are the ultimate manifestation and record of that collective decision-making process that humans engage in.
Just to start going off script right away. Sure. How do you think about some of these interactions creating the polarization and how that reflects in the language models today, because they're trained on this data? Like, do you think the models pick up on these things on their own as well?
Yeah, I think that they are a compression of the world as it existed at the point in time when they were pre-trained. And so I think absolutely the, and you see this in Word2Vec too. I mean, just the semantics of how we think about gender as relates to professions are encoded in the structure of these models and like language models, I think are, you know, much more sort of complete representation of human sort of beliefs.
Yeah. That's awesome. So we left you at Databricks last time you were building Dolly. Tell us a bit more about Brightwave. This is the first time you're really talking about it publicly. Yeah. Yeah. It's a, it's a pleasure. I mean, so we, we've raised a $6 million seed round, including participate led by Decibel.
We love working with and including participation from Point72, one of the largest hedge funds in the world and Moonfire Ventures. And we are focused on like, if you think of the job of an active asset manager, the work to be done is to understand something about the market that nobody else has seen in order to identify a mispriced asset.
And it's our view that that is not a task that is well-suited to human intellect or attention span. And so much as I was gesturing towards the ability of these models to perceive more than a human is able to. We think that there's a unique, historically unique opportunity to expand individual's ability to reason about the structure of the economy and the markets.
It's not clear that you get superhuman reasoning capabilities from human level demonstrations of skill. And by that, I mean the pre-training corpus, but then additionally the fine tuning corpuses. I think you largely mimic the demonstrations that are present at model training time. But from a working memory standpoint, these models outclass humans in their ability to reason about these systems.
- Yeah. And you started Bravely with Brandon. - Yeah, yeah. - What's the story? You two worked together at Workday, but he also has a really relevant background. - Yeah, so Brandon Katara is my co-founder, the CTO, and he's a very special human. So he's has a deep background in finance.
So he was the former CTO of a federally regulated derivatives exchange, but his first deep learning patent was filed in 2018. And so he spans worlds. He has experience building mission critical infrastructure in highly regulated environments for finance use cases, but also was very early to the deep learning party and understand.
He led at Workday, was the tech lead for semantic search over hundreds of millions of resumes and job listings. And so just has been working with information retrieval and neural information retrieval methods for a very long time. And so was an exceptional person, and I'm glad to count him among the people that we're doing this with.
- Yeah, and a great fisherman. - Yeah, very talented. - That's always important. - Very talented, very enthusiastic. - And then you have a bunch of amazing engineers. Then you have folks like JP who used to work at Goldman Sachs. - Yeah. - How should people think about team building in this more vertical domain?
Obviously you come from a deep ML background, but you also need some of the industry. So what's the right balance? - Yeah, I mean, so I think one of the things that's interesting about building verticalized solutions in AI in 2024 is that historically you need the AI capability. You need to understand both how the models behave and then how to get them to interact with other kinds of machine learning subsystems that together perform the work of a system that can reason on behalf of a human.
There are also material systems engineering problems in there. So I forget who this is attributed to, but a tweet that sort of made reference to all of the traditional software companies are trying to hire AI talent and all the AI companies are trying to hire systems engineers. And that is 100% the case.
Getting these systems to behave in a predictable and repeatable and observable way is equally challenging to a lot of the methodological challenges. But then you bring in, whether it's law or medicine or public policy, or in our case, finance, I think a lot of the most valuable, like Grammarly is a good example of a company that has generative work product that is a valuable by most humans.
Whereas in finance, the character of the insight, the depth of insight and the non-consensusness of the insight really requires a fairly deep domain expertise. And even operating an exchange, when we went to raise it around, a lot of people said, "Why don't you start a hedge fund?" And it's like that is a totally separate, there are many, many separate skills that are unrelated to AI in that problem.
And so we've brought into the fold domain experts in finance who can help us evaluate the character and sort of steer the system. - Yep. So that's the team. What does the system actually do? What's the Brightwave product? - Yeah, I mean, it does many, many things, but it acts as a partner in thought to finance professionals.
So you can ask Brightwave a question like, "How is NVIDIA's position in the GPU market impacted by rare earth metal shortages?" And it will identify as thematic contributors to an investment decision or developing your thesis that in response to export controls on A100 cards, China has put in place licensors on the transfer of germanium and gallium, which are not rare earth metals, but they're semiconductor production inputs, and has expanded its control of African and South American mining operations.
And so we see, if you think about, we have a $20 billion crossover hedge fund. Their equities team uses this tool to go deep on a thesis. So I was describing this like multiple steps into the value chain or supply chain for companies. We see wealth management professionals using Brightwave to get up to speed extremely quickly as they step into nine conversations tomorrow with clients who are assessing like, "Do you know something that I don't?
Can I trust you to be a steward of my financial well-being?" We see investor relations teams using Brightwave to... You just think about the universe of coverage that a person working in finance needs to be aware of. The ability to rip through filings and transcripts and have a very comprehensive view of the market.
It's extremely rate limited by how quickly a person is able to read and not just read, but solve the blank page problem of knowing what to say about a factor of finding. What else can you share about customers that you're working with? Yeah, so we have seen traction that far exceeded our expectations from the market.
You sit somebody down with a system that can take any question and generate tight, actionable financial analysis on that subject and the product kind of sells itself. And so we see many, many different funds, firms, and strategies that are making use of Brightwave. So you've got 10-person owner-operated registered investment advisor, the classical wealth manager, $500 million in AUM.
We have crossover hedge funds that have tens and tens of billions of dollars in assets under management, very different use case. So that's more investment research, whereas a wealth manager is going to use this to step into client interactions, just exceptionally well-prepared. We see investor relations teams. We see corporate strategy types that are needing to understand very quickly new markets, new themes, and just the ability to very quickly develop a view on any investment theme or sort of strategic consideration is broadly applicable to many, many different kinds of personas.
Yep. Yeah, I can attest to the product selling itself, given that I'm a user. Let's jump into some of the technical challenges and work behind it, because there are a lot of things. So as I mentioned, you were on the podcast about a year ago. Yep. You had released Dolly from Databricks, which was one of the first open-source LLMs.
Yep. Dolly had a whooping 1,024 tokens of context size. And today, I think 1,000 tokens, a model would be unusable. You lose that much out. Yeah, exactly. How did you think about the evolution of context sizes as you built the company? And where we are today, what are things that people get wrong?
Any commentary there? Sure. We very much take a systems of systems approach. When I started the company, I think I had more faith in the ability of large context windows to generally solve problems relating to synthesis. And actually, if you think about the attention mechanism and the way that it computes similarities between tokens at a distance, I, on some level, believed that as you would scale that up, you would have the ability to simultaneously perceive and draw conclusions across vast disparate bodies of content.
And I think that does not empirically seem to be the case. So when, for example, you, and this is something anybody can try, take a very long document, like needle in a haystack, I think, sure, we can do information retrieval on specific fact-finding activities pretty easily. But if you, I kind of think about it like summarizing, if you write a book report on an entire book versus a synopsis of each individual chapter, there is a characteristic output length for these models.
Let's say it's about 1200 tokens. It is very difficult to get any of the commercial LLMs or LLAMA to write 5,000 tokens. And you think about it as, what is the conditional probability that I generate an end token? It just gets higher the more tokens are in the context window prior to that sort of next inference step.
And so if I have a thousand words in which to say something, the level of specificity and the level of depth when I am assessing a very large body of content is going to necessarily be less than if I am saying something specific about a sub passage. And so we, and if you think about drawing a parallel to consumer internet companies like LinkedIn or Facebook, there are many different subsystems with it.
So let's take the Facebook example. Facebook almost certainly has, I mean, you can see this in your profile, your inferred interests. What are the things that it believes that you care about? Those assessments almost certainly feed into the feed relevance algorithms that would judge what you are, you know, am I, am I going to see you snow, am I going to show you snowboarding content?
I'm going to show you aviation content. It's the outputs of one machine learning system feeding into another machine learning system. And I think with modern rag and sort of agent-based reasoning, it is really about creating subsystems that do specific tasks well. And I think the problem of I'm deciding how to decompose large documents into more kind of atomic reasoning units is still very important.
Now it's an open question, whether you have in the, whether that is a model that is addressable by pre-training or instruction tuning, like can you have synthesis oriented demonstrations in the, at training time? And now this problem is more robustly solved because synthesis is quite different from complete the next word in the great Gatsby.
But it, I think empirically is not the case that you can just throw all of the SEC filings in, you know, a million token context window and get deep insight that is useful out the other end. Yeah. And I think that's the main difference about what you're doing. It's not about summarizing, it's about coming up with different ideas and kind of like thought threads.
Yes. Precisely. Yeah. And I think this specifically like helping a person know, you know, if I think that GLP ones are going to blow up the diet industry, identifying and putting in context a negative result from a human clinical trial that's, or for example, that adherence rates to Ozempic after a year, just 35%, what are the implications of this?
So there's an information retrieval component, and then there's a, not just presenting me with a summary of like, here's, here are the facts, but like, what does this entail and how does this fit into my worldview, my fund strategy? Broadly, I think that, you know, I mean, this, this idea, I think is very eloquently puts it, which is, and this is, this is not my insight, but that language models and help, help me know who said this.
You may, you may be familiar, but language models are not tools for creating new knowledge. They're, they're tools for helping me create new knowledge. Like they themselves do not do that work. I think that that's the presently the right way to think about it. Yeah. I read a tweet about needle in the haystack actually being harmful to some of this work because now the model is like too focused on recalling everything versus saying, oh, that doesn't matter.
Like ignoring. If you think about a S1 filing, like 85% is like boilerplate. It's like, you know, previous performance doesn't guarantee future performance. Like the company might not be able to turn a profit in the future. Blah, blah, blah. These things, they always come up again. We have a large workforce and all of that.
Have you had to do any work at the model level to kind of like make it okay to forget these things? Or like, have you found that like, kind of like making it a smaller problem then cutting, putting them back together kind of solves for that? Absolutely. And I think this is where having domain expertise around the structure of these documents.
So if you look at the different chunking strategies that you can employ to understand like what is the intent of this clause or phrase, and then really be selective at retrieval time in order to get the information that is most relevant to a user query based on the semantics of that unique document.
And I think it is not, it's certainly not just a sliding window over that corpus. And then the flip side of it is obviously factuality. You don't want to forget things that were there. How do you tackle that? Yeah. I mean, of course that's, it's a very deep problem.
And I think, you know, I'll be a little circumspect about the specific kinds of methods we use, but this sort of multiple passes over the material and saying, how convicted are you that what you're saying is in fact true? And you, I mean, you can take generations from multiple different models and compare and contrast and say like, do these both reach the same conclusion?
We, you can treat it like a voting problem. We train our own models to assess, you know, you can think of this like entailment, like is, is this supported by the underlying primary sources? And I think that you have methodological approaches to this problem, but then you also have product affordances.
So like there's a great blog post on bar from the Bard team describing and Bill, it was sort of a design led product innovation that allows you to ask the model to double check the work. So if you have a surprising finding, we can let the user discretionarily spend more compute to double check the work.
And I think that you want to build product experiences that are fault tolerant. And it is unclear that you ever, that the difference between like hallucination and creativity is fuzzy. And so do you, do you ever get language models with next token prediction as the loss function that are guaranteed to not contain factual misstatements?
That is not clear. Now, maybe being able to invoke code interpreter like code generation and then execution in a secure way helps to solve some of these problems, especially for quantitative reasoning. That may be the case, but for right now, I think you need to have product affordances that allow you to live with the reality that these things are, are fallible.
Yep. Yeah. We did our HF 201 episode, just talking about different methods and whatnot. How do you think about something like this where it's maybe unclear in the short term, even if the product is right, you know, it might give, it might give an insight that might be right, but it might not prove until later.
So it's kind of hard for the users to say that's wrong because actually it might be like, you think it's wrong, like an investment. That's kind of what it comes down to, you know, some people are wrong. Some people are right. How, how do you think about some of the product features that you need and something like this to bring user feedback into the mix and maybe how you approach it today and how you think about it long-term?
Yeah. Well, I mean, I think that to your point about the model may have, the model may make a statement, which is not actually verifiable. It's like this, this may be the case. I think that is where the reason we think of this as a partner in thought is that humans are always going to have access to information that has not, not been digitized.
And so in finance, you see that, especially so with regards to expert call networks, the sort of like, like you're in the unstated investment theses that a portfolio manager may have, like, we just don't do biotech or we do not believe that we think that Eli Lilly is actually very exposed because of how unpleasant it is to take examples.
Right. Those, those are things that are beliefs about the world, but that may not be like falsifiable right now. And so I think you have to, you can again, take plate pages from the consumer web playbook and think about personalization. So it is getting a person to articulate everything that they believe is not a realistic task.
Netflix doesn't ask you to describe what kinds of movies you like, and they give you the option to vote, but nobody does this. And so what I think you do is you observe people's revealed preferences, like what, so one of the capabilities that our system exposes is given everything that Brightwave has read and assessed and like the sort of synthesized financial analysis, what are the natural next questions that this, that a person investigating the subject should ask?
And you can think of this chain of thought and this deepening kind of investigative process and the direction in which the user steers the attention of this system reveals information about what do they care about? What do they believe? What kinds of things are important? And so at the individual level, but then also at the fund and firm level, you can develop like an implicit representation of your beliefs about the world in a way that you just, you're never going to get everybody, somebody to write everything down.
Yeah. Yeah. How does that tie into one of our other favorite topics, evals? We had David Luan from Adapt and he mentioned they don't care about benchmarks because their customers don't work on benchmarks. They work on, you know, business results. How do you think about that for you? And maybe as you build a new company, when is the time to like still focus on the benchmark versus when it's time to like move on to your own evaluation using maybe labelers or whatnot?
So, I mean, we, we use a fair bit of LLM supervision to evaluate multiple different subsystems. And I think that one of the reasons that, I mean, we, we pay human annotators to evaluate the quality of the generative outputs. And I think that that is always the reference standard, but we frequently first turn to LLM supervision as a way to have, whether it's at fine tuning time or even for subsystems that are not generative, like what is the quality of the system?
And I think we will generate a small corpus of high quality domain expert annotations and then always compare that against how well is either LLM supervision or even just a heuristic, right? Like a simple thing you can do. This is, this is a technique that we do not use, but as an example, do not generate any integers or any numbers that are not present in the underlying source data, right?
You know, if you're doing rag, you can just say, you can't name numbers that are not, you know, it's a very sort of heavy handed, but you can take the annotations of a human evaluator and then compare that. I mean, snorkel kind of takes a similar perspective, like multiple different week sort of supervision data sets can give you substantially more than any one of them does on their own.
And so I think you want to compare the quality of any evaluation against human generated, the sort of like benchmark. But at the end of the day, like eventually you, especially for things that are nuanced, like is this transcendent poetry? There's just no way to multiple choice your way out of that, you know?
And so really where I think a lot of the flywheels for some of the large LLM companies are, it's methodological, obviously, but it's also just data generation. And you think about like, you know, for anybody who's done crowdsource work, and this I think applies to high skilled human annotators as well.
Like you look at the Google search quality evaluator guidelines, it's like a 90 or 120 page rubric describing like what is a high quality search result. And it's like very difficult to get on the human level, people to reproducibly follow a rubric. And so what is your process for orchestrating that motion?
Like how do you articulate what is high quality insight? I think that's where a lot of the work actually happens. And that it's sort of the last resort, like everything, like ideally you want to automate everything, but ultimately the most interesting problems right now are those that are not especially automatable.
- One thing you did at Databricks was the, well, not that you did specifically, but the team there was like the Dolly 15K data set. You mentioned people misvalue the value of this data. Why has no other company done anything similar? We're like creating this like employee led data set.
You can imagine, you know, some of these like Goldman Sachs, they got like thousands and thousands of people in there. Obviously they have different privacy and whatnot requirements. Do you think more companies should do it? Like, do you think there's like a misunderstanding of how valuable that is or yeah?
- So I think Databricks is a very special company and led by people who are very sort of courageous, I guess is one word for it. Just like, let's just ship it. And I think it's unusual and it's also because I think like most companies will recognize, like if they go to the effort to produce something like that, they recognize that it is competitive advantage to have it and to be the only company that has it.
And I think Databricks is in an unusual position in that they benefit from more people having access to these kinds of sources, but you also saw scale. I guess they haven't released it. - Well, yeah. I'm sure they have it because they charge people a lot of money. - But they created that alternative to the GSMK 8K, I believe was how that's said.
I guess they too are not releasing that. - Yeah. It's interesting because I talked to a lot of enterprises and a lot of them are like, man, I spent so much money on scale. And I'm like, why don't you just do it? And they're like, what? - So I think this again gets to the human process orchestration.
It's one thing to do like a single monolithic push to create a training data set like that or an evaluation corpus, but I think it's another to have a repeatable process. And a lot of that, I think realistically is pretty unsexy, like people management work. So that's probably a big part of it.
- We have these four words of AI framework, the data quality war, we kind of touched on a little bit now about rag. That's like the other battlefield, rag and context sizes and kind of like all these different things. You work in a space that has a couple of different things.
One, temporality of data is important because every quarter there's new data and like the new data usually overrides the previous one. So you can not just like do semantic search and hope you get the latest one. And then you have obviously very structured numbers thing that are very important to the token level, like 50% gross margins and 30% gross margins are very different, but this organization is not that different.
Any thoughts on like how to build a system to handle all of that as much as you can share, of course? - Yeah, absolutely. So I think this again, rather than having open-ended retrieval, open-ended reasoning, our approach is to decompose the problem into multiple different subsystems that have specific goals.
And so, I mean, temporality is a great example. When you think about time, I mean, just look at all of the libraries for managing calendars. Time is kind of at the intersection of language and math. And this is one of the places where without taking specific technical measures to ensure that you get high quality narrative overlays of statistics that are changing over time and have a description of how a PE multiple is increasing or decreasing and like a retrieval system that is aware of the time, sort of the time intent of the user query, right?
If I'm asking something about breaking news, like that's going to be very different than if I'm looking for a thematic account of the past 18 months in Fed interest rate policy. You have to have retrieval systems that are, to your point, like if I just look for something that is a nearest neighbor without any of that temporal or other qualitative metadata overlay, you're just going to get a kind of a bag of facts and that that is like explicitly not helpful because the worst failure state for these systems is that they are wrong in a convincing way.
And so I think at least presently you have to have subsystems that are aware of the semantics of the documents or aware of the semantics of the intent behind the question and then have multiple evaluate. We have multiple evaluation steps. Once you have the generated outputs, we assess it multiple different ways to know, is this a factual statement given the sort of content that's been retrieved?
Yep. And what about, I think people think of financial services, they think of privacy, confidentiality. What's kind of like customer's interest in that as far as like sharing documents and like how much of a deal breaker is that if you don't have them? I don't know if you want to share any about that and how you think about architecting the product.
Yeah, so one of the things that gives our customers a high degree of confidence is the fact that Brandon operated a federally regulated derivatives exchange. That experience in these highly regulated environments, I mean, additionally at workday, I worked with the financials product and without going into specifics, it's exceptionally sensitive data and you have multiple tenants and it's just important that you take the right approach to being a steward of that material.
And so from the start, we've built in a way that anticipates the need for controls on how that data is managed and who has access to it and how it is treated throughout the life cycle. And so that for our customer base, where frequently the most interesting and alpha generating material is not publicly available, has given them a great degree of confidence in sharing some of this, the most sensitive and interesting material with systems that are able to combine it with content that is either publicly or semi-publicly available to create non-consensus insight into some of the most interesting and challenging problems in finance.
Yeah, we always say it breaks our recommendation systems for LLMs. How do you think about that when you have private versus public data, where sometimes you have public data as one thing, but then the private is like, well, actually, you know, we got this like insight model, like with this insight school that we're going to like figure it out.
How do you think in the rack system about a value of these different documents? You know, I know a lot of it is secret sauce, but... No, no, it's fine. I mean, I think that there is, so I will gesture towards this by way of saying context where prompting.
So you can have prompts that are composable and that have different sort of command units that like may or may not be present based on the semantics of the content that is being populated into the rag context window. And so that's something we make great use of, which is where is this being retrieved from?
What does it represent? And what should be in the instruction set in order to treat and respect the underlying contents? Not just as like, here's a bunch of text, like you figure it out, but this is important in the following way, or this aspect of the SEC filings are just categorically uninteresting, or this is sell-side analysis from a favored source.
And so it's that creating it much like you have with the qualitative, the problem of organizing the work of humans, you have the problem of organizing the work of all of these different AI subsystems and getting them to propagate what they know through the rest of the stack so that if you have multiple, seven, 10 sequence inference calls, that all of the relevant metadata is propagated through that system and that you are aware of where did this come from?
How convicted am I that it is a source that should be trusted? I mean, you see this also just in analysis, right? So different, like seeking alpha is a good example of just a lot of people with opinions. And some of them are great. Some of them are really mid and how do you build a system that is aware of the user's preferences for different sources?
I think this is all related to how we talked about systems engineering. It's all related to how you actually build the systems. - And then just to kind of wrap on the right side, how should people think about knowledge graphs and kind of like extraction from documents versus just like semantic search?
- Knowledge graph extraction is an area where we're making a pretty substantial investment. And so this, I think that it is underappreciated how powerful, there's the generative capabilities of language models, but there's also the ability to program them to function as arbitrary machine learning systems, basically for marginally zero cost.
And so the ability to extract structured information from huge, like sort of unfathomably large bodies of content in a way that is single pass. So rather than having to reanalyze a document every time that you perform inference or respond to a user query, we believe quite firmly that you can also in an additive way, perform single pass extraction over this body of text and then bring that into the RAG context window.
And this really sort of levers off of my experience at LinkedIn where you had this structured graph representation of the global economy where you said person A works at company B. We believe that there's an opportunity to create a knowledge graph that has resolution that greatly exceeds what any, whether it's Bloomberg or LinkedIn currently has access to.
We're getting as granular as person X submitted congressional testimony that was critical of organization Y. And this is the language that is attached to that testimony. And then you have a structured data artifact that you can pivot through and reason over that is complimentary to the generative capabilities that language models expose.
And so it's the same technology being applied to multiple different ends. And this is manifest in the product surface where it's a highly facetable, pivotable product, but it also enhances the reasoning capability of the system. Yeah. You know, when you mentioned you don't want to re-query like the same thing over and over, a lot of people may say, well, I'll just fine tune this information in the model.
How do you think about that? That was one thing when we started working together, you were like, we're not building foundation models. A lot of other startups were like, oh, we're building the finance, financial model, the finance foundation model or whatever. When is the right time for people to do fine tuning versus rag?
It's like any heuristics that you can share that you use to think about it. So we, in general, I do not, I'll just say like, I don't have a strong opinion about how much information you can imbue into a model that is not present in pre-training through large scale fine tuning.
The benefit of rag is the ground, the capability around grounded reasoning. So the, you know, forcing it to attend to a collection of facts that are known and available at inference time and sort of like materially, like only using these facts. The, at least in my view, the, the role of fine tuning is really more around, I think of like language models, kind of like a STEM cell.
And then under fine tuning, they differentiate into different kinds of specific cells. So kidney or an eye cell. And we, if you think about like specifically, like, I don't think that unbounded agentic behaviors are useful and that instead a useful LLM system is more like a finite state machine where the behavior of the system is occupying one of many different behavioral regimes and making decisions about what state should I occupy next in order to satisfy the goal.
As you think about the graph of those states that the language model that your system is moving through, once you develop conviction that one behavior is useful and repeatable and like worthwhile to differentiate down into a specific kind of subsystem, that's where like fine tuning and like specifically generating the training data, like having, having human annotators produce a corpus that is useful enough to get a specific class of behaviors that that's kind of how we use fine tuning rather than trying to imbue new net new information into these systems.
Yeah. And how, but you know, people always trying to turn LLMs into humans. It's like, Oh, this is my reviewer. This is my editor. I know you're not in that camp. So any, any thoughts you have on like how people should think about yeah. How to refer to models.
Like, and I mean, we, we've talked a little bit about this and it's, it is an, it's notable that I think there's a lot of anthropomorphizing going on and that it reflects the difficulty of evaluating the systems. Is it like, does the saying that you are that you're the journal editor for nature, does that help?
Like you're, you know, you've got the editor and then you've got the reviewer and you've got the, you know, you're the private investigator, you know, it's like, this is, I think literally we wave our hands and we say, maybe if I tell you that I'm going to tip you, that's going to help.
And it sort of seems to, and like, maybe it's just like the more cycles, the more compute that is attached to the prompt. And then the sort of like chain of thought at inference time, it's like, maybe that's all that we're really doing and that it's kind of like hidden, hidden compute.
But I, our experience has been that you can get really, really high quality reasoning from roughly an agentic system without needing to be too cute about it. You can describe the task and you know, within well-defined bounds you don't need to treat the LLM like a person and to get it to generate high quality outputs.
Yeah. And the other thing is like all these agent frameworks are assuming everything as an LLM, you know? Yeah, for sure. And I think this is one of the places where traditional machine learning has a real material role to play in producing a system that hangs together and there are, you know, guaranteeable like statistical promises that classical machine learning systems to include traditional deep learning can make about, you know, what is the set of outputs and like, what is the characteristic distribution of those outputs that LLMs cannot afford?
And so like one of the things that we do is we, as a philosophy, try to choose the right tool for the job. And so sometimes that is a de novo model that has nothing to do with LLMs that does one thing exceptionally well. And whether that's retrieval or critique or multi-class classification, I think having many, many different tools in your toolbox is always valuable.
This is great. So there's kind of the missing piece that maybe people are wondering about. You do a financial services company and you didn't do anything in Excel. What's the story behind why you're doing partner in thought versus, hey, this is like an AI enabled model that understands any stock and all of that.
Yeah. And to be clear, we do, Brightwave does a fair amount of quantitative reasoning. I think what is an explicit non-goal for the company is to create Excel spreadsheets. And I think when you look at the products that work in that way, you can spend hours with an Excel spreadsheet and not notice a subtle bug.
And that is a highly non-fault tolerant product experience where you encounter a misstatement in a financial model in terms of how a formula is composed and all of your assumptions are suddenly violated. And now it's effectively wasted effort. So as opposed to the partner in thought modality, which is yes and, like if the model says something that you don't agree with, you can say, take it under consideration.
This is not interesting to me. I'm going to pivot to the next finding or claim. And it's more like a dialogue. The other piece of this is that the financial modeling is often very, when we talk to our users, it's very personal. So they have a specific view of how a company is structured.
They have that one key driver of asset performance that they think is really, really important. It's kind of like the difference between writing an essay and having an essay, I guess. The purpose of homework is to actually develop, what do I think about this? And so it's not clear to me that push a button, have a financial model is solving the actual problem that the financial model affords.
And so that said, we take great efforts to have exceptionally high quality quantitative reasoning. So you think about, and I won't get into too many specifics about this, but we deal with a fair number of documents that have tabular data that is really important to making informed decisions. And so the way that our RAG systems operate over and retrieve from tabular data sources is, it's something that we place a great degree of emphasis on.
It's just, I think, the medium of Excel spreadsheets is just, I think, not the right play for this class of technologies as they exist in 2024. - What about 2034? Are people still gonna be making Excel models? I think to me, the most interesting thing is, how are the models abstracting people away from some of these more syntax driven thing and making them focus on what matters to them?
- I wouldn't be able to tell you what the future 10 years from now looks like. I think anybody who could convince you of that is not necessarily somebody to be trusted. I do think that, so let's draw the parallel to accountants in the '70s. So VisiCalc, I believe, came out in 1979.
And historically, the core, as an accountant, as a finance professional in the '70s, I'm the one who runs the, I run the numbers. I do the arithmetic. That's like my main job. And we think that, I mean, you just look, now that's not a job anybody wants. And the sophistication of the analysis that a person is able to perform as a function of having access to powerful tools like computational spreadsheets is just much greater.
And so I think that with regards to language models, it is probably the case that there is a play in the workflow where it is commenting on your analysis within that spreadsheet-based context, or it is taking information from those models and sucking this into a system that does qualitative reasoning on top of that.
But I think the, it is an open question as to whether the actual production of those models is still a human task, but I think the sophistication of the analysis that is available to us and the completeness of that analysis just necessarily increases over time. - Yeah. What about AI hedge funds?
Obviously, I mean, we have quants today, right? But those are more kind of like momentum-driven, kind of like signal-driven and less about long thesis-driven. Do you think that's a possibility there? - This is an interesting question. I would put it back to you and say, how different is that from what hedge funds do now?
I think there is, the more that I have learned about how teams at hedge funds actually behave, and you look at like systematics desks or semi-systematic trading groups, man, it's a lot like a big machine learning team. And it's, I sort of think it's interesting, right? So like, if you look at video games and traditional like Bay Area tech, there's not a ton of like talent mobility between those two communities.
You have people that work in video games and people that work in like SaaS software. And it's not that like cognitively, they would not be able to work together. It's just like a different set of skill sets, a different set of relationships. And it's kind of like network clusters that don't interact.
I think there's probably a similar phenomenon happening with regards to machine learning within the active asset allocation community. And so like, it's actually not clear to me that we don't have AI hedge funds now. The question of whether you have an AI that is operating a trading desk, like that seems a little, maybe, like I don't have line of sight to something like that existing yet.
- I'm always curious. I think about asset management on a few different ways, but venture capital is like extremely power law driven. It's really hard to do machine learning in power law businesses because the distribution of outcomes is so small versus public equities. Most high-frequency trading is like very bell curve, normal distribution.
It's like, even if you just get 50.5% at the right scale, you're going to make a lot of money. And I think AI starts there. And today most high-frequency trading is already AI driven. Renaissance started a long time ago using these models. But I'm curious how it's going to move closer and closer to like power law businesses.
I would say some boutique hedge funds, their pitch is like, "Hey, we're differentiated because we only do kind of like these long only strategies that are like thesis driven versus movement driven." And most venture capitalists will tell you, "Well, our fund is different because we have this unique thesis on this market." And I think like five years ago, I've wrote this blog post about why machine learning would never work in venture, because the things that you're investing in today, they're just like no precedent that should tell you this will work.
Most new companies, a model will tell you this is not going to work. Versus the closer you get to the public companies, the more any innovation is like, "Okay, this is kind of like this thing that happened." And I feel like these models are quite good at generalizing and thinking, again, going back to the partner in thought, like thinking about second order.
Yeah, and that's maybe where, so a concrete example, I think it certainly is the case that we tell retrospective, to your point about venture, we tell retrospective stories where it's like, "Well, here was the set of observable facts. This was knowable at the time, and these people made the right call and were able to cross correlate all of these different sources, and this is the bet we're going to make." I think that process of idea generation is absolutely automatable.
And the question of like, do you ever get somebody who just sets the system running, and it's making all of its own decisions like that, and it is truly like doing thematic investing, or more of what a human analyst would be on the hook for, as opposed to like HFT.
But the ability of models to say, "Here is a fact pattern that is noteworthy, and we should pay more attention here." Because if you think about the matrix of all possible relationships in the economy, it grows with the square of the number of facts you're evaluating, like polynomial with the number of facts you're evaluating.
And so, if I want to make bets on AI, I think it's like, "What are ways to profit from the rise of AI?" It is very straightforward to take a model and say, "Parse through all of these documents and find second-order derivative bets," and say, "Oh, it turns out that energy is very, very adjacent to investments in AI and may not be priced in the same way that GPUs are." And a derivative of energy, for example, is long-duration energy storage.
And so, you need a bridge between renewables, which have fluctuating demands, and the compute requirements of these data centers. And I think that, and I'm telling this story as like having witnessed BrightWave do this work, you can take a premise and say, "What are second- and third-order bets that we can make on this topic?" And it's going to come back with, "Here's a set of reasonable theses." And then I think a human's role in that world is to assess like, "Does this make sense given our fund strategy?
Is this coherent with the calls that I've had with the management teams?" There's this broad body of knowledge that I think humans are the ultimate synthesizers and deciders. And maybe I'm wrong. Maybe the world of the future looks like the AI that truly does everything. I think it is kind of a singularity where it's really hard to reason about what that world looks like.
And you asked me to speculate, but I'm actually kind of hesitant to do so because it's just the forecast, the hurricane path just diverges far too much to have a real conviction about what that looks like. Awesome. I know we've already taken up a lot of your time, but maybe one thing to touch on before wrapping is open-source LLMs.
Obviously you were at the forefront of it. We recorded our episode the day that Red Pajama was open-source and we were like, "Oh man, this is mind-blowing. This is going to be crazy." And now we're going to have an open-source dense transformer model that is 400 billion parameters. I don't know if one year ago you could have told me that that was going to happen.
So what do you think matters in open-source? What do you think people should work on? What are things that people should keep in mind to evaluate? Is this model actually going to be good or is it just cheating some benchmarks to look good? Is there anything there? This is the part of the podcast where people already dropped off if they wanted to, so they want to hear the hot takes right now.
I do think that that's another reason to have your own private evaluation corpuses is so that you can objectively and out of sample measure the performance of these models. Again, sometimes that just looks like giving everybody on the team 250 annotations and saying, "We're just going to grind through this." The other thing about doing the work yourself is that you get to articulate your loss function precisely.
What do I actually want the system to behave like? Do I prefer this system or this model or this other model? I think the work around overfitting on the test is 100% happening. One notable, in contrast to a year ago, say, and the economic incentives for companies to train their own foundation models, I think, are diminishing.
The window in which you are the dominant pre-train, and let's say that you spend $5 to $40 million for a commodity-ish pre-train, not $400 billion would be another sort of... It costs more than $40 million. Another leap. But the kind of thing that a small, multi-billion dollar mom and pop shop might be able to pull off.
The benefit that you get from that is, I think, diminishing over time. I think fewer companies are going to make that capital outlay. I think that there's probably some material negatives to that. But the other piece is that we're seeing that, at least in the past two and a half, three months, there's a convergence towards, well, these models all behave fairly similarly.
It's probably that the training data on which they are pre-trained is substantially overlapping. It's generalizing a model that generalizes to that training data. It's unclear to me that you have this sort of balkanization, where there are many different models, each of which is good in its own unique way, versus something like Lama becomes, "Listen, this is a fine standard to build off of." We'll see.
It's just the upfront cost is so high. I think for the people that have the money, the benefit of doing the pre-train is now less. Where I think it gets really interesting is, how do you differentiate these in all of these different behavioral regimes? I think the cost of producing instruction tuning and fine-tuning data that creates specific kinds of behaviors, I think that's probably where the next generation of really interesting work starts to happen.
If you see that the same model architecture trained on much more training data can exhibit substantially improved performance, it's the value of modeling innovations. For fundamental machine learning and AI research, there is still so much to be done. But I think that the much lower-hanging fruit is developing new kinds of training data corpuses that elicit new behaviors from these models in a specific way.
That's where, when I think about the availability, a year ago you had to have access to fairly high-performance GPUs that were hard to get in order to get the experience of multiple reps fine-tuning these models. What you're doing when you take a corpus and then fine-tune the model, and then see across many inference passes what is the qualitative character of the output, you're developing your own internal mental model for how does the composition of the training corpus shape the behavior of the model in a qualitative way.
A year ago it was very expensive to get that experience. Now you can just recompose multiple different training corpuses and see what do I do if I insert this set of demonstrations, or I ablate that set of demonstrations. That I think is a very, very valuable skill and one of the ways that you can have models and products that other people don't have access to.
I think as those sensibilities proliferate, because more people have that experience, you're going to see teams that release data corpuses that just imbue the models with new behaviors that are especially interesting and useful. I think that may be where some of the next sets of innovation and differentiation come from.
Yeah, when people ask me, I always tell them the half-life of a model is much shorter than a half-life of a data set. I mean, the pile is still around and core to most of these training runs versus all the models people trained a year ago. It's like they're at the bottom of the LMC's litter board.
It's kind of crazy. Just the parallels to other kinds of computing technology where the work involved in producing the artifact is so significant and the shelf life is like a week. I'm sure there's a precedent but it is remarkable. I remember when DALI was the best open-source model. DALI was never the best open-source model but it demonstrated something that was not obvious to many people at the time.
But we always were clear that it was never state of the art. State of the art, whatever that means. This is great, Mike. Anything that we forgot to cover that you want to add? I know you're thinking about growing the team. We are hiring across the board. AI, engineering, classical machine learning, systems engineering, distributed systems, front-end engineering, design.
We have many open roles on the team. We hire exceptional people. We fit the job to the person as a philosophy and would love to work with more incredible humans. Awesome. Thank you so much for coming on, Mike. Thanks, Alessio. Thanks, Alessio.