back to indexTraining Albatross An Expert Finance LLM: Leo Pekelis

00:00:00.000 |
Hi, everyone. I'm Leo. I'm the Chief Scientist at Gradient. And today, I'll be talking about how 00:00:21.040 |
we trained large language models to be finance experts. Yeah, let's go ahead and dive right into 00:00:27.920 |
it. So before I start getting into the details here, I wanted to make a couple of observations. 00:00:34.640 |
And the first one is that foundational models have been growing at an exponential rate, 00:00:41.280 |
right? So not only do you kind of bespoke AI companies each have their own foundational models, 00:00:48.480 |
but data companies, general tech companies, they all have their own flavor of the language model, 00:00:55.680 |
each with its own features, and use cases. And another observation, which is pretty related, 00:01:02.480 |
is that the context length, right, the number of tokens that you can fit into a prompt has increased 00:01:09.120 |
quite a bit over the past year. The largest context length models about a year ago were something like 00:01:15.120 |
100K. And in the past year, they've grown to about 40 times that, just in models released in the past few 00:01:23.200 |
months, including one released by Gradient. And both of these observations are evidence to kind of one 00:01:32.080 |
point, and that's the large language models are not one size fits all. Especially when you get to kind of more 00:01:38.480 |
complicated use cases, taking a generalist language model, or a base language model kind of off the 00:01:45.840 |
shelf isn't really going to get you too far. And I realize I'm talking at the open models track of a 00:01:52.320 |
conference, I probably don't need to convince you guys too much of this statement. But it is pretty 00:01:58.560 |
important for us at Gradient, and it's actually our foundational pieces for what we built, which is an AI 00:02:05.360 |
Foundry. And for us, what an AI Foundry is, is it's a collection of custom language models, as well as a 00:02:13.280 |
number of workflow primitives. And what we do is we take all these pieces and components together 00:02:18.960 |
to create solutions that are a custom fit for our customers. And today, I'm going to talk about 00:02:24.640 |
specifically, our solutions for the finance domain, right, building financial experts. And for those 00:02:32.080 |
solutions, really, two components have been incredibly useful. One should be fairly, fairly 00:02:38.960 |
straightforward is our domain specific finance language model. And the other one is a context length 00:02:45.840 |
extension that we've worked on. And so why are these important specifically for finance? Well, 00:02:52.480 |
a little while ago, we got together and wrote down kind of six requirements for finance applications 00:02:58.160 |
of language models that generalist models tend to lack or fall a little bit short on. 00:03:02.960 |
You know, if you look at these requirements, they're fairly general, they kind of apply across 00:03:09.520 |
industries, but in particular for finance, they seem pretty important. 00:03:13.600 |
And today, I'm just going to talk about two of them that happen to be paired with the two solutions 00:03:21.440 |
that I also want to talk about the finance language model and the extended context length. 00:03:25.680 |
So jumping, jumping right into it, the first one is the finance language model. 00:03:35.040 |
You know, you might be wondering, why, why even have a domain specific language model? 00:03:42.240 |
Why is domain knowledge important? The reason is, is that your general purpose language models, 00:03:48.400 |
like the GPTs of the world, they are trained on a very broad set of data, kind of broad and not deep, 00:03:57.520 |
especially in kind of like more technical situations, like technical financial information. 00:04:01.920 |
And as kind of like an illustrative example on why this is important, here's a chart from a recent research paper, 00:04:10.080 |
where it shows that even for very large models, right, the red line at the top there is for 176 billion parameter model. 00:04:17.680 |
You need something on the order of thousands of relevant documents in the models pre-training. 00:04:23.600 |
In order for the model to get decent, I mean, here, it's even above 50% accuracy on answering a related question. 00:04:31.120 |
Right. And so kind of what this implies is that if you ask a language model questions about data that's kind of like in the tails of its training data, 00:04:41.120 |
then it's going to do a poor job at answering those questions. 00:04:44.960 |
Right. And so, you know, the natural way to fix this is the case is to say, OK, base model doesn't know a lot about finance. 00:04:54.000 |
Let's train it some finance. An issue there, and here I'm going to talk about kind of how we trained 00:05:01.680 |
our finance-specific language model is -- so an issue there is that there's a whole lot of financial data out there, right? 00:05:08.800 |
Like way more than you could possibly review or look at manually. And so that requires creating an automated data pipeline. 00:05:18.480 |
And that's what we did. We created one. Probably the most compelling or interesting part of this data pipeline is the automated data curation, 00:05:26.800 |
where we borrowed ideas from the membership inference literature. And so what we do is we amass a whole large corpus of training data. 00:05:38.160 |
And then we use techniques to try to see if a particular document, if there's a high chance that it was already in the model's training data, right? 00:05:48.000 |
So maybe you have like a Lama-based model, you have a document, and you can run some of these techniques to get a probability of whether or not the model's already seen that data in training. 00:05:56.320 |
So you filter out all the data that the model hasn't seen before. What you're left with is a much smaller set of data. 00:06:02.480 |
Now that's manageable to look at through human review. And then finally pass through to synthetic data augmentation, right? 00:06:11.120 |
Both to upsample data and to handle some variations in data representation and formatting. 00:06:17.440 |
And kind of like the last part of the recipe for how to train domain-specific language models 00:06:29.280 |
is to take that data set that you created and to pass it through a training pipeline. 00:06:34.560 |
I think by now a training pipeline like this is fairly standard. There's two main parts. 00:06:42.560 |
One is the continuous pre-training. So you take that data set that you created on the previous slide 00:06:49.120 |
and you do kind of next token prediction on it off of a base existing model, right? So again, 00:06:55.520 |
we're taking a base foundational model like a Lama model to start with. And then the second part 00:07:02.240 |
is you run alignment on the model here, you've re-ran both supervised fine tuning and preference optimization. 00:07:10.480 |
And kind of the way I like to think about the division between these two tasks is pre-training is 00:07:18.240 |
something like if you had a bunch of textbooks and you wanted a model to read all those textbooks and 00:07:23.520 |
understand all that information or retain all that information. And alignment is kind of like 00:07:30.080 |
then instructing the model on how to use that information or best practices and what to do with 00:07:34.640 |
that. And so if pre-training is like reading textbooks, alignment is like maybe like taking an exam 00:07:40.640 |
on a class or working on a project. Right. And that's really all I wanted to say about the domain-specific 00:07:48.160 |
language model. Now I want to talk about the, see how much time I have, great, about the other part, 00:07:54.640 |
which is the extended context and how extended context or long context language models help us 00:08:01.920 |
address hallucinations. Right. To give a quick refresher, what are hallucinations? Well, it's a 00:08:07.600 |
pretty broad term and it's used quite frequently nowadays. It's whenever you run inference on a model, 00:08:15.280 |
when you give it a query and it generates content that is irrelevant or made up or inconsistent with the 00:08:20.240 |
input data. There's been a fair amount of research as to the cause of hallucinations. A lot of that 00:08:27.600 |
research points to deficiencies in the underlying training data. Right. So some, some causes might be 00:08:34.720 |
just the training data is outdated. Right. You're asking the model a question on information that is 00:08:39.760 |
now updated since the training data. Another one is a lot of the training data practices require automated data 00:08:48.000 |
data collection. And if there's ever inconsistencies or bugs in that data collection, um, you can get 00:08:54.640 |
source reference divergence, right? So the model is just trained on data that doesn't quite make sense. 00:08:58.640 |
Uh, and, and there's a few other reasons. All of these, uh, can, uh, encode information in the model's 00:09:05.360 |
memory banks that there isn't quite accurate, uh, and, and that'll cause the model to hallucinate. 00:09:10.160 |
And while, uh, alignment or, or, uh, continued training of the model can alleviate hallucinations, 00:09:18.240 |
um, at gradient, we find that actually in context learning. So, uh, working directly on the prompt 00:09:25.440 |
during the execution pipeline, uh, is the most direct and, and sample efficient way to reduce 00:09:30.880 |
hallucinations. Right. Because, uh, what you can do is you can put in a relatively small amount of 00:09:36.320 |
information directly into the prompt, uh, kind of at inference time, uh, and sort of, uh, plaster over 00:09:42.800 |
or bandaid over, uh, issues, uh, with, with the model's training data. Um, and so that's great. 00:09:48.640 |
In context learning works really well. Um, the issue is it works so well that when you start doing it, 00:09:54.640 |
you want to do more and more of it. And, and then kind of, you run into the, one of the biggest pain 00:09:59.600 |
points, uh, uh, with this practice, uh, or one of the biggest bottlenecks, uh, which is the context 00:10:04.800 |
length. Um, and I'm guessing that this is an issue that, that many of you in this room have, have come 00:10:10.640 |
across yourselves. Um, and that's, uh, you, you just run out of prompt, uh, in, in terms of, for in context 00:10:17.120 |
learning. Um, a few examples, uh, for why that can be an issue. Uh, if you're trying to put in a few 00:10:24.000 |
shot examples into the prompt, you're running out of prompt space before you run out of examples. 00:10:28.320 |
So now you have to spend a lot of time in choosing the particular example or, or working 00:10:32.640 |
on some kind of like lossy summarization technique, um, for more complex product problems that may 00:10:38.560 |
require some brittle, uh, pre-processing pipelines, each can have errors. Um, and also if you do some 00:10:45.120 |
kind of external memory management, such as RAG, uh, those systems tend to have poor performance 00:10:51.440 |
when the chunks that get pulled, uh, require them to be interrelated, right? So if you pull one 00:10:57.440 |
chunk and another chunk that you need to pull, uh, has to reference a previous chunk to, to know if 00:11:02.880 |
it needs to get, uh, queried, right? And RAG does, uh, typically does a pretty poor job with that. 00:11:08.000 |
Um, right. So context length, uh, is the bottleneck for this. So the most natural thing to do is just 00:11:15.440 |
extend the context length. Um, and, and so that's, that's what we did with, with some of our models. 00:11:21.680 |
Uh, and here really, I just wanted to talk about a couple of examples of what suddenly becomes 00:11:28.000 |
possible, uh, when you have a context length that, that's sort of in the realm of, of a million tokens 00:11:34.160 |
long. Um, here on, on the left-hand side, uh, is an example showing that you can now actually put 00:11:40.880 |
thousands of examples directly into the prompt. Uh, and that kind of gets you back into this kind of like 00:11:46.160 |
domain learning regime that I talked about earlier. Uh, it's just now it is, uh, on the fly and at 00:11:52.000 |
inference time, right? So it can be very adaptive to the problem. Um, and, uh, you, you do find that, 00:11:57.760 |
uh, for a lot of tasks out there, this like thousands of examples mark is actually necessary, uh, to get 00:12:04.400 |
kind of production grade accuracy or, or dangerous levels of accuracy for a model. Um, and, and the 00:12:10.560 |
other example is, uh, with the long context length, um, you can leverage what transformer models are, are 00:12:18.000 |
natively really good at, which is being able to attend to every single token in the prompt. Um, and by 00:12:24.000 |
doing that, you can actually have the model perform, uh, fairly complicated reasoning, uh, implicitly, 00:12:29.520 |
just, just in through, through going through its, uh, layers and attention layers. Um, and an example 00:12:35.440 |
that, um, that we kind of, uh, cooked up, uh, in house, uh, was we took, uh, books that were written 00:12:43.920 |
by Mark Twain, the author, uh, and first we scrubbed the books of any kind of identifying information, 00:12:49.280 |
right? So, so no mention of the author or anything like that. Uh, and then we gave that into the model, 00:12:54.160 |
uh, into its prompt, into its context and asked the model to generate, uh, new stories in the same 00:12:59.120 |
style. Uh, and after kind of five books of, of reference prompts, the model was, uh, able to 00:13:05.040 |
generate stories, um, that convinced, uh, a separate critic model, uh, that those short stories could have 00:13:11.680 |
been actually written, uh, by that same author, right? Uh, and, and in pretty actually like deep and 00:13:17.440 |
intricate ways, not just kind of like stylistic similarity or language, uh, but down to theme and 00:13:23.360 |
characters and setting and things like that. So, uh, kind of the punchline is, is that long context 00:13:28.720 |
language models give you more, uh, grounded and robust systems and there's fewer moving parts, 00:13:34.080 |
much more is contained in the language model, which, which is the thing that we all care about. 00:13:38.640 |
Um, and, and that in turn reduces hallucinations. 00:13:41.040 |
Right. So, um, you know, those are basically the, the two components, um, two solutions of our 00:13:52.560 |
platform that I wanted to, to describe to you all today. Um, one of the things that, that we believe 00:13:58.160 |
in pretty strongly at Gradient is to have transparent and verifiable benchmarks. Uh, and also we're pretty 00:14:04.800 |
passionate and giving back to the open source community because a lot of what we've, uh, built 00:14:09.120 |
our work on are, are open source, uh, models and techniques themselves. Uh, and so for both of those 00:14:15.280 |
solutions, we've open source models, uh, on our, um, company page at Hugging Face. Um, one of them 00:14:21.600 |
is the, the, the, the alpha trust model. So that's the result of applying our, uh, finance domain 00:14:27.440 |
training on a llama two base model. Um, and here the, the benchmarks show that after doing that, 00:14:33.520 |
uh, uh, it ends up being competitive, uh, and actually better, uh, competitive at kind of open 00:14:39.440 |
LLM general, uh, benchmarks and better at finance specific benchmarks, uh, to models in the same class 00:14:46.000 |
to its peers. And the other model is, um, a 1 million context length extension of, uh, a llama three 00:14:54.000 |
base model that we released pretty recently. Um, and with it, uh, we were able to get, uh, a hundred 00:15:00.480 |
percent needle in a haystack scores actually, uh, above 1 million context lengths. That's the first 00:15:06.560 |
image. Uh, and also had a pretty substantial performance improvement, uh, over the base model 00:15:13.120 |
on a ruler long context length benchmark. That's a benchmark put out by Nvidia. Um, and that brings this 00:15:19.920 |
model kind of in the realm of, uh, flagship long context models, uh, like Gemini 1.5 pro GPT 4 and, 00:15:28.560 |
and command R plus. Right. And so the, these models are open source publicly available and invite you 00:15:34.960 |
all to go and check them out. Um, and about a, about a minute left. So, uh, I'll finish off, uh, here. Uh, 00:15:44.880 |
Uh, there's of course, lots more to building, uh, an AI financial expert. These are just two pieces 00:15:50.160 |
of the puzzle, even though they're two important ones. Uh, and if you guys are interested in finding 00:15:54.480 |
out more, uh, feel free to check us out on our, on our website or reach out and contact us. Cool. Thank you.