Training Albatross An Expert Finance LLM: Leo Pekelis

Hi, everyone. I'm Leo. I'm the Chief Scientist at Gradient. And today, I'll be talking about how we trained large language models to be finance experts. Yeah, let's go ahead and dive right into it. So before I start getting into the details here, I wanted to make a couple of observations.

And the first one is that foundational models have been growing at an exponential rate, right? So not only do you kind of bespoke AI companies each have their own foundational models, but data companies, general tech companies, they all have their own flavor of the language model, each with its own features, and use cases.

And another observation, which is pretty related, is that the context length, right, the number of tokens that you can fit into a prompt has increased quite a bit over the past year. The largest context length models about a year ago were something like 100K. And in the past year, they've grown to about 40 times that, just in models released in the past few months, including one released by Gradient.

And both of these observations are evidence to kind of one point, and that's the large language models are not one size fits all. Especially when you get to kind of more complicated use cases, taking a generalist language model, or a base language model kind of off the shelf isn't really going to get you too far.

And I realize I'm talking at the open models track of a conference, I probably don't need to convince you guys too much of this statement. But it is pretty important for us at Gradient, and it's actually our foundational pieces for what we built, which is an AI Foundry. And for us, what an AI Foundry is, is it's a collection of custom language models, as well as a number of workflow primitives.

And what we do is we take all these pieces and components together to create solutions that are a custom fit for our customers. And today, I'm going to talk about specifically, our solutions for the finance domain, right, building financial experts. And for those solutions, really, two components have been incredibly useful.

One should be fairly, fairly straightforward is our domain specific finance language model. And the other one is a context length extension that we've worked on. And so why are these important specifically for finance? Well, a little while ago, we got together and wrote down kind of six requirements for finance applications of language models that generalist models tend to lack or fall a little bit short on.

You know, if you look at these requirements, they're fairly general, they kind of apply across industries, but in particular for finance, they seem pretty important. And today, I'm just going to talk about two of them that happen to be paired with the two solutions that I also want to talk about the finance language model and the extended context length.

So jumping, jumping right into it, the first one is the finance language model. You know, you might be wondering, why, why even have a domain specific language model? Why is domain knowledge important? The reason is, is that your general purpose language models, like the GPTs of the world, they are trained on a very broad set of data, kind of broad and not deep, especially in kind of like more technical situations, like technical financial information.

And as kind of like an illustrative example on why this is important, here's a chart from a recent research paper, where it shows that even for very large models, right, the red line at the top there is for 176 billion parameter model. You need something on the order of thousands of relevant documents in the models pre-training.

In order for the model to get decent, I mean, here, it's even above 50% accuracy on answering a related question. Right. And so kind of what this implies is that if you ask a language model questions about data that's kind of like in the tails of its training data, then it's going to do a poor job at answering those questions.

Right. And so, you know, the natural way to fix this is the case is to say, OK, base model doesn't know a lot about finance. Let's train it some finance. An issue there, and here I'm going to talk about kind of how we trained our finance-specific language model is -- so an issue there is that there's a whole lot of financial data out there, right?

Like way more than you could possibly review or look at manually. And so that requires creating an automated data pipeline. And that's what we did. We created one. Probably the most compelling or interesting part of this data pipeline is the automated data curation, where we borrowed ideas from the membership inference literature.

And so what we do is we amass a whole large corpus of training data. And then we use techniques to try to see if a particular document, if there's a high chance that it was already in the model's training data, right? So maybe you have like a Lama-based model, you have a document, and you can run some of these techniques to get a probability of whether or not the model's already seen that data in training.

So you filter out all the data that the model hasn't seen before. What you're left with is a much smaller set of data. Now that's manageable to look at through human review. And then finally pass through to synthetic data augmentation, right? Both to upsample data and to handle some variations in data representation and formatting.

And kind of like the last part of the recipe for how to train domain-specific language models is to take that data set that you created and to pass it through a training pipeline. I think by now a training pipeline like this is fairly standard. There's two main parts. One is the continuous pre-training.

So you take that data set that you created on the previous slide and you do kind of next token prediction on it off of a base existing model, right? So again, we're taking a base foundational model like a Lama model to start with. And then the second part is you run alignment on the model here, you've re-ran both supervised fine tuning and preference optimization.

And kind of the way I like to think about the division between these two tasks is pre-training is something like if you had a bunch of textbooks and you wanted a model to read all those textbooks and understand all that information or retain all that information. And alignment is kind of like then instructing the model on how to use that information or best practices and what to do with that.

And so if pre-training is like reading textbooks, alignment is like maybe like taking an exam on a class or working on a project. Right. And that's really all I wanted to say about the domain-specific language model. Now I want to talk about the, see how much time I have, great, about the other part, which is the extended context and how extended context or long context language models help us address hallucinations.

Right. To give a quick refresher, what are hallucinations? Well, it's a pretty broad term and it's used quite frequently nowadays. It's whenever you run inference on a model, when you give it a query and it generates content that is irrelevant or made up or inconsistent with the input data.

There's been a fair amount of research as to the cause of hallucinations. A lot of that research points to deficiencies in the underlying training data. Right. So some, some causes might be just the training data is outdated. Right. You're asking the model a question on information that is now updated since the training data.

Another one is a lot of the training data practices require automated data data collection. And if there's ever inconsistencies or bugs in that data collection, um, you can get source reference divergence, right? So the model is just trained on data that doesn't quite make sense. Uh, and, and there's a few other reasons.

All of these, uh, can, uh, encode information in the model's memory banks that there isn't quite accurate, uh, and, and that'll cause the model to hallucinate. And while, uh, alignment or, or, uh, continued training of the model can alleviate hallucinations, um, at gradient, we find that actually in context learning.

So, uh, working directly on the prompt during the execution pipeline, uh, is the most direct and, and sample efficient way to reduce hallucinations. Right. Because, uh, what you can do is you can put in a relatively small amount of information directly into the prompt, uh, kind of at inference time, uh, and sort of, uh, plaster over or bandaid over, uh, issues, uh, with, with the model's training data.

Um, and so that's great. In context learning works really well. Um, the issue is it works so well that when you start doing it, you want to do more and more of it. And, and then kind of, you run into the, one of the biggest pain points, uh, uh, with this practice, uh, or one of the biggest bottlenecks, uh, which is the context length.

Um, and I'm guessing that this is an issue that, that many of you in this room have, have come across yourselves. Um, and that's, uh, you, you just run out of prompt, uh, in, in terms of, for in context learning. Um, a few examples, uh, for why that can be an issue.

Uh, if you're trying to put in a few shot examples into the prompt, you're running out of prompt space before you run out of examples. So now you have to spend a lot of time in choosing the particular example or, or working on some kind of like lossy summarization technique, um, for more complex product problems that may require some brittle, uh, pre-processing pipelines, each can have errors.

Um, and also if you do some kind of external memory management, such as RAG, uh, those systems tend to have poor performance when the chunks that get pulled, uh, require them to be interrelated, right? So if you pull one chunk and another chunk that you need to pull, uh, has to reference a previous chunk to, to know if it needs to get, uh, queried, right?

And RAG does, uh, typically does a pretty poor job with that. Um, right. So context length, uh, is the bottleneck for this. So the most natural thing to do is just extend the context length. Um, and, and so that's, that's what we did with, with some of our models.

Uh, and here really, I just wanted to talk about a couple of examples of what suddenly becomes possible, uh, when you have a context length that, that's sort of in the realm of, of a million tokens long. Um, here on, on the left-hand side, uh, is an example showing that you can now actually put thousands of examples directly into the prompt.

Uh, and that kind of gets you back into this kind of like domain learning regime that I talked about earlier. Uh, it's just now it is, uh, on the fly and at inference time, right? So it can be very adaptive to the problem. Um, and, uh, you, you do find that, uh, for a lot of tasks out there, this like thousands of examples mark is actually necessary, uh, to get kind of production grade accuracy or, or dangerous levels of accuracy for a model.

Um, and, and the other example is, uh, with the long context length, um, you can leverage what transformer models are, are natively really good at, which is being able to attend to every single token in the prompt. Um, and by doing that, you can actually have the model perform, uh, fairly complicated reasoning, uh, implicitly, just, just in through, through going through its, uh, layers and attention layers.

Um, and an example that, um, that we kind of, uh, cooked up, uh, in house, uh, was we took, uh, books that were written by Mark Twain, the author, uh, and first we scrubbed the books of any kind of identifying information, right? So, so no mention of the author or anything like that.

Uh, and then we gave that into the model, uh, into its prompt, into its context and asked the model to generate, uh, new stories in the same style. Uh, and after kind of five books of, of reference prompts, the model was, uh, able to generate stories, um, that convinced, uh, a separate critic model, uh, that those short stories could have been actually written, uh, by that same author, right?

Uh, and, and in pretty actually like deep and intricate ways, not just kind of like stylistic similarity or language, uh, but down to theme and characters and setting and things like that. So, uh, kind of the punchline is, is that long context language models give you more, uh, grounded and robust systems and there's fewer moving parts, much more is contained in the language model, which, which is the thing that we all care about.

Um, and, and that in turn reduces hallucinations. Right. So, um, you know, those are basically the, the two components, um, two solutions of our platform that I wanted to, to describe to you all today. Um, one of the things that, that we believe in pretty strongly at Gradient is to have transparent and verifiable benchmarks.

Uh, and also we're pretty passionate and giving back to the open source community because a lot of what we've, uh, built our work on are, are open source, uh, models and techniques themselves. Uh, and so for both of those solutions, we've open source models, uh, on our, um, company page at Hugging Face.

Um, one of them is the, the, the, the alpha trust model. So that's the result of applying our, uh, finance domain training on a llama two base model. Um, and here the, the benchmarks show that after doing that, uh, uh, it ends up being competitive, uh, and actually better, uh, competitive at kind of open LLM general, uh, benchmarks and better at finance specific benchmarks, uh, to models in the same class to its peers.

And the other model is, um, a 1 million context length extension of, uh, a llama three base model that we released pretty recently. Um, and with it, uh, we were able to get, uh, a hundred percent needle in a haystack scores actually, uh, above 1 million context lengths. That's the first image.

Uh, and also had a pretty substantial performance improvement, uh, over the base model on a ruler long context length benchmark. That's a benchmark put out by Nvidia. Um, and that brings this model kind of in the realm of, uh, flagship long context models, uh, like Gemini 1.5 pro GPT 4 and, and command R plus.

Right. And so the, these models are open source publicly available and invite you all to go and check them out. Um, and about a, about a minute left. So, uh, I'll finish off, uh, here. Uh, Uh, there's of course, lots more to building, uh, an AI financial expert. These are just two pieces of the puzzle, even though they're two important ones.

Uh, and if you guys are interested in finding out more, uh, feel free to check us out on our, on our website or reach out and contact us. Cool. Thank you. Thank you.

Training Albatross An Expert Finance LLM: Leo Pekelis

Transcript