Back to Index

How to train a Million Context LLM — with Mark Huang of Gradient.ai


Chapters

0:0 Introductions
1:30 Founding story of Gradient and its mission
4:35 Minimum viable agents
9:19 Differentiating ML and AI, focusing on out-of-domain generalization
10:12 Extending Llama3 to 1M tokens
14:32 Technical challenges with long context sequences
17:45 Data quality and the importance of diverse datasets
19:45 What's a theta value?
22:42 RoPE vs Ring Attention vs ALiBi vs YaARN
25:6 Why RingAttention matters
28:1 How to refine datasets for context extension
33:34 Multi-stage training data and avoiding overfitting to recent data
34:27 The potential of using synthetic data in training
38:22 Applying LoRa adapters to extend model capabilities
42:25 Benchmarking long context models and evaluating their performance
47:20 Pushing to 4M context and output quality degradation
50:8 What do you need this context for?
52:57 Impact of long context in chat vs Docs Summarization
56:25 Future directions for long context models and multimodality
59:38 How do you know what research matters?
62:47 Routine for staying updated with AI research and industry news
65:33 Deciding which AI developments to invest time in
70:37 Request for collaboration and data set construction for long context

Transcript

Hey, everyone. Welcome to the Live in Space podcast. This is Alessio, partner and CTO on Residence at Decibel Partners, and I'm joined by my co-host, Swiggs, founder of Small.ai. Hey, and today we're in the remote studio with Mark Wang from Gradient. Welcome, Mark. Hey, glad to be here. It's really, you know, a great experience to be able to talk with you all.

I know your podcast is really, really interesting, and I always am listening to it every time you guys have a release. He's not a paid actor. He said that out of his own will. We'll give you the check later. So, Mark, you're unusual in the sense that you and I go back to college.

I don't exactly remember where we overlapped, but, you know, we both went to Warden and went into the sort of quantitative developer realm. Yeah, exactly. Kind of crazy, right? All goes full circle. I was a quant for quite a few years and then made it out into Silicon Valley.

And now we intersect again when it kind of feels like more or less the same, right? Like the AI wars, the trading wars back in the day, too, to a certain extent, and the grab for talent. Yeah, I think there's definitely a few of us ex-finance people moving into tech and then finding ourselves gravitating towards data and AI.

It seems like you did that. You were at a bunch of sort of quant trading shops, but then as you moved to tech, you were a lead data scientist at Box and staff ML scientist at Splunk. And then before working on the startup that eventually became Gradient. You want to tell that story?

Yeah, I think part of the reason why I came over from the quant finance world is to get more collaboration, learn about what big data and scaling machine learning really looks like when you're not in this bubble, right? And working at Box, I worked mostly in a cross-functional role, helping product analytics and go to market.

And then at Splunk, it was a lot more of a specific role where I was helping with streaming analytics and search and deep learning. And for Gradient, really why we started it was whether it was in finance or whether it was in tech, I always noticed that there was a little bit more to give in terms of what AI or ML could contribute to the business.

And we came at a really good time with respect to wanting to bring the full value of what that could be into the enterprise. And then obviously OpenAI created this huge vacuum into the industry to allow for that, right? So I myself felt really, really empowered to actually ship a product and ship stuff that I could think could really help people.

Maybe just to touch a little bit on Gradient, I know we have a lot of things to go through Gradient, Lumetri, context extension, there's a lot, but what exactly is Gradient? And you have an awesome design on your website. It's really retro. And I think people that are watching Fallout on Amazon Prime right now can maybe feel nostalgia just looking at it.

What exactly is it? Because I know you have the foundry, you have the agent SDK, there's a lot of pieces into it. Yeah, for sure. And I appreciate the call out for the design. I know my co-founder, Chris, spent a lot of thought in terms of how he wanted the aesthetic to look like.

And it reminds me a lot about Mad Men. So that was the initial emotional shape that I felt when I saw it. Well, quite simply, Gradient, we're a full stack AI platform. And what we really want to do is we want to enable all of the RPA workloads or the codified automation workloads that existed in enterprise before.

We really want to enable people to transition into more autonomous, agentic workflows that are less brittle, feel more seamless as an interface too. So and able to empower what we really think the new AI workforce should look like. And that kind of required us to build a fairly horizontal platform for those purposes.

We had this discussion at our AI in Action Club on Discord, like the minimum viable agent, or like kind of how you define an agent. Yeah, in your mind, what is the minimum thing that you can call actually an agent and not just like a for loop, you know?

And how do you see the evolution over time, especially as people adopt it more and more? Yeah, so I kind of stage it where everybody, first of all, at the lowest level, thinks about like non-determinism with respect to how the pipeline looks like when it's executed. But even beyond that, this goes back into effectively evaluations.

It's like on each stage of the node, you're going to have to see a marginal improvement in the probability of success for that particular workload because of non-determinism. So yeah, I think it is an overloaded term to a certain extent because like everything is an agent if it calls a language model or any sort of multimodal model these days.

But for us, it's like, you know, my background is statistics. So I want to see like improvements in the probability of the success event or outcome happening because of more nodes. Yeah, I think, you know, the one thing that makes this sort of generative AI era very different from the sort of data science-y type era is that it is very non-deterministic and it's hard to control.

Yeah, I mean, so like, you know, I think what's the founding story of Gradient? Like how, you know, of all the problems that you chose, like why choose this one? You know, how did you get together your co-founders, anything like that, that bring us up to the present day?

One of my co-founders is Chris and he's a really good friend of mine as well. I don't know if you intersected with him at Penn as well, but yeah, Chris Chang, he was at Penn as well, did banking for maybe one or two years and then, you know, was a software engineer at Meta, also was at Google.

And then most recently, he was like a director at Netflix and product. And we always wanted to do something together, but we felt the, you know, what really came to fruition was wanting to develop something that is enterprise-facing for once, mostly because of our experience with internal tooling and inability for something to like, basically exist through like a migration, right?

Like all the time with every ML platform that I've ever had to experience or he had to experience, it's like a rebuild and you rip it out and you have a new workflow or automation come in. And it's this huge multi-quarter, maybe even multi-year project to do that. And we also teamed up with a former coworker of Chris's from Open Door Forest, who was also on Google Cloud Platform.

And, you know, him seeing the scale and actually the state of the art in terms of Google was using AI for systems before everybody else too, right? They invented a transformer and their internal set of tooling was just so far superior to everything else. Like it's really hard for people to go back after seeing that.

So what we really wanted was to reduce that friction for like actually shipping workloads in product value when you have all these like types of operational frictions that happen inside of these large enterprises. And then really like the main pivot point for all of it was like you said, things that can handle out of domain problems.

So like out of domain data that comes in, having the flexibility to not fall over and having something that you build over time that continues to improve. Like machine learning is about learning. And I feel like a lot of systems back in the place, they were learning a very specific objective function, but they weren't really natively learning with the user.

So like that's the whole, we use the term assistant all the time, but my vision for the assistant was always for the system to grow alongside me, right? Like almost like an embodied second limb or something that will be able to get better as you also learn yourself. Yeah, I might maybe call it, people always trying to define a difference between ML and AI.

And I think in AI, we definitely care a lot more about out of domain generalization. And that's all under the umbrella of learning, but it is a very specific kind of learning. I'm going to try to make a segue into today's like main topic of conversation that's something that you've been blowing up on, which is the long context learning, right?

Which is also some form of out of topic, out of distribution generalization. And in this context, you're extending the context window of an existing open source model. Maybe if you want to like, just bring us all the way back to it, towards like what got you interested in long context?

Why did you find it like an interesting investment to work on? And then the story of how you did your first extensions. Yeah, I think it came, for Llama3, it's specifically, we chose that model because of the main criticisms about it. Before when it first got released, 8,000 context lengths just seemed like it was too short because it seemed like Mistral and even Yee came out with like a 2,000 token model, context length model.

But the really the inception of all of it was us like fine tuning so many models and working on Rags so much and having this, and it still exists today, this basically pedagogical debate with everybody who's like, "Hey, is it fine tuning versus Rag? Is it this versus that?" And like, at the end of the day, it's just all meta learning, right?

Like all we want is like the best meta learning workflow or meta learning setup possible to be able to adapt a model to do anything. So naturally, long context had a place in that, but nobody had really pushed the limits of it, right? Like you would see like 10 shot, maybe 100 shot prompting for improving the model's capabilities, but it wasn't until Google comes out with Gemini with the first 1 million context length model that a lot of people's jaws dropped and that hunger for understanding what that could really facilitate in the new workflows came about.

So we were staged to actually train other open source models to do that. But the moment Llama3 came out, we just went ham against that specific model because the two things that were particularly appealing for that was the fact that like, I see a lot of these language models as compression algorithms to a certain extent, like the way we have like 15 trillion tokens into a specific model.

That definitely made me feel like it would have a lot of capabilities and be more adaptable towards extending that context length. So we went in there and the 1 million number was always, that was more of just like put the North Star up there and see if we can get there.

And then see what was happening along the way as we did that. So yeah, also shout out to Crusoe who facilitated all that compute because I would be lying if I was to say like, anyone could just go out and do it. It does require quite a bit of compute.

It requires like a lot of preparation, but it just like all the stars kind of aligned for that moment for us to go after that problem. I'll take a side note on Crusoe since you just brought it up. Yeah, like, can you explain what Crusoe is? You know, I have this mental image of putting GPUs on top of oil rigs.

What is it? What do they do? How do you work with them? You know, just anything nice. I'm sure they appreciate nice things that you say about them too. Oh, for sure, for sure. So they came to us through a collaborative effort where we basically were in search of a cloud, you know, a GPU provider.

I don't want to call cloud service provider quite yet because then, you know, you think about hyperscalers, but for them, you know, they're one of the biggest alternative GPU cloud providers. And they were offering up like we want to do a collaboration to showcase their technology. And it just made it really easy for us to like scale up with their L40Ss.

And those were the specific GPU instances we used. And coordinating that effort with them to get, you know, that dedicated cluster first to do the project. It became a really good relationship. And we still work with them today because like we're trying to evaluate more of these models and possibly train more of them.

And anyone could go up to them and basically get your compute from them. And they have a lot of GPUs available for those type of projects. I would love to maybe have you run people through why the models don't come with longer context sequences out of the box. Like, obviously, you know, the TLDR is like self-attention.

It's like quadratic scaling of memory. So the longer the context size, the more compute you have to spend the training time. And that's why you have to get Crusoe to help you. How do you actually train a large language model that is like a very long context? And then how does that differ from just tacking it on on top later?

And then maybe we'll dive into performance and some of those things. But I think for a lot of folks in our audience that are more engineers, they use models, but don't necessarily build the models themselves. A lot of time, it's hard to understand what goes into actually making a long context model.

Yeah, in terms of, you know, all the literature out there, I would say, honestly, it's probably still TBD as to like the tradeoffs between the approach we did, which is more of a curriculum learning approach after the fact versus inherently training a model with a long context throughout, because I just don't think people have looked at the scaling properties of it in deep, deep detail.

But as stylistic facts exist out there with research papers from Meta themselves, actually, they were already shown in a paper that if you train a model on a shorter context and you progressively increase that context to like, you know, the final limit that you have, like 32K is usually the limit of Lama 2 was that long.

It actually performs better than if you try to train 32K the whole time. And I like to think about it intuitively as if you're trying to learn probability theory. You're not going to go and read the book cover to cover and then do all the exercises afterwards. What you're going to do is you're going to do each chapter, do an exercise, read the chapter, do an exercise, and then finish right with the final set of like holistic exercises or examination.

So attention is exactly what it sounds like to a certain extent. Like you have a bunch of indices and you are making the model attend to localized contexts and concepts across the entirety of its encoding, right? Like whatever the text that the sequence that you're giving it. So when you're doing the curriculum learning aspect of things, you are kind of trying to give it the opportunity to also attend to all the concepts.

So data actually in the curation, in the creation of that context, plays a huge role because a lot of times people make the mistake of trying to extend the context length by just giving it raw text that doesn't have the necessity for the model to go all the way in the beginning of the sequence and then connect an idea to the end of the sequence.

So data quality is one thing, but it sounds like as long as the base model is at least what is the work, like the one million contexts if Lama3 was 2K context size, is there like a minimum context size that you need to then be able to generalize? Or does it not really matter in the fine-tuning context care of it?

There's no minimum, I would say, or at least I can't make such a strong statement as to say that that does not exist. But if you have a 4K, any regular model out there, like if you can progressively increase the context length of it, so long as it has shown really good perplexity scores prior to your context length extension.

So if it hasn't shown good perplexity, you basically can't even predict the next token, you're kind of like out of luck, right? But then from there, the other component that we actually just released the blog on maybe last Friday, it's like you got to pay attention to the theta value that the model starts off with.

What was fairly unique about the Lava3 model was their choice of the theta parameter, which gave some suspicion as to how long the context could be extended for the model. So that aspect of like, we can go into a huge lesson in terms of positional encodings and in rope scaling and stuff.

But those concepts and that aspect of things enables you to scale out the length much more easily. - What's the TLDR of what the theta is for a model? If I haven't built a model before... - Yeah, yeah. - Not me, obviously I know what it is, but for people that don't know, right?

I'm totally an expert. - Yeah, well, so not all models have it, but some models will employ rope scaling and specifically Lava3 does that. But there's also other positional encoding and embedding mechanisms that other models employ. But TLDR is, if you think about most architectures, they employ basically like a, it's kind of like a sine or cosine curve.

And you're thinking about like the different, you have the amplitudes that occur there to allow for the model to see different types of distributions of data. Really what the theta value does, it governs how often a pattern's going to appear in the embedding space. So you basically are able to make shift that rotational curve by increasing the theta value and allow for different types of distributions to be seen as if they actually occurred in the training data before.

So it's super confusing, but it's like there's positional extrapolation, and then there's interpolation. You want interpolation. It's been shown that just pure extrapolation makes the model a lot worse, and it's harder to attend to stuff. Whereas the interpolation is like you're squeezing everything back in to what the original context length was to a certain extent, and then allowing for it to overlap different sequences that it's already seen as if it actually occurred when you see a million contexts of sequence tokens.

So yeah, I think that aspect, we didn't know how well it would scale. I think that's one thing. So I'm not going to lie and tell you right off the bat, we're definitely going to hit a million. It was more like we're getting to 256, and it looked good.

We did our evals. We scaled it more. And then what was really good was that we established the formula at the start. So it's actually a formula that we actually took from the paper. I think it's the rope scaling paper. And we looked at that particular formula, and then we backed out the values.

And it's all empirical. So it's not like a mathematical tautology or proof. It's just like it's an empirical formula that actually worked really well. And then we just kept scaling it up, and it held. It's kind of like the scaling laws. Nobody knows the scaling laws exist, but you don't know if they're going to continue.

So yeah. Are you able to compare it with other forms of scaling that people have been talking about? Alibi comes to mind. Yarn is being talked about a lot by a news research. And then there's other forms which are not exactly directly related, but ring attention comes up a lot.

We had a really good session with Strong Compute in the Latent Space Discord talking about all these approaches. I was just wondering if you want to compare and contrast Rope versus the other stuff. Yeah, I think... I can never pronounce it right, but Alibi. Yeah, Alibi. We haven't compared with that one specifically, mostly because I've noticed some of the newer architectures don't actually employ it a lot.

I think the last architecture that actually really employed it was the Mosaic FPT model class. And then almost all the models these days are all Rope scaling. And then effectively, you can use Yarn with that as well. We just did the Theta scaling specifically because of its empirical elegance.

It was really easy and it was well understood by us. The other one that I know that in the open source that people are applying, which uses more of a LoRa-based approach, which is really interesting too, is the one that Wing has been employing, which is Pose. We've sort of helped them evaluate some of the models.

With respect to the performance of it, it does start to break down a little bit more on the longer and longer context. So like 500,000 to a million, it appeared that it doesn't hold as well specifically for like Needle in the Haystack. But it's still TBD as... Evaluations, I call it just like a high...

It's a sparse high dimensional space where you're just evaluating performance across so many different things and then trying to map it back to like, "Hey, here's the thing that I actually cared about from the start." And I have like a thousand different evaluations and they tell me something, but not the entire picture, right?

And as for like Ring-Attention specifically, like we employed Ring-Attention in order to do the training. So we combined Flash-Attention and Ring-Attention together with a really specific network topology on our GPUs to be able to maximize the memory bandwidth. Yeah, as far as I understand, like Ring-Attention, a lot of people credit it for Gemini's million token context, but actually it's just a better utilization of GPUs, right?

Like that's really what it is. You mentioned in our show notes, Zhang Peiyuan's Easy Context Repo. I have seen that come up quite a bit. What does that do? Like how important is it as a Ring-Attention implementation? I know there's like maybe another one that was done by Lucid Reins or one of the other open source people.

But like what is Easy Context? Like is that the place to go? Like did you evaluate a bunch of things to implement Ring-Attention? Yeah, we evaluated all of them. Like it was, I would say the original authors, you know, Matai and all the folks at Berkeley, they created the JAX implementation for it.

And unfortunately, not to discredit, like, you know, TPUs or whatever, like the JAX implementation just does not work on GPUs very well. Like any naive setup that you do, like it just won't run out of the box very easily. And then unfortunately, that was probably the most mature repo with a lot more configurations to set up interesting network topologies for your cluster.

And then the other PyTorch implementations outside of Easy Context, they just didn't really work. Maybe we weren't implementing one small aspect incorrectly, but like there was an active development on it at a certain point. Like even Lucid Reins, I think he's interesting 'cause for once he was actually like, he was like taking a job somewhere and then just stopped, you know, doing commits.

And as we were working to try to find it, like we never really want to jump in on a repo where someone's like kind of actively committing breaking changes to it. Otherwise we have to like eat that repo ourselves. And yeah, Easy Context was the first PyTorch implementation that applied it with native libraries that worked pretty well.

And then we adapted it ourselves in order to configure it for our cluster network topology. So, you know, shout out to John Payne for his open source contributions. I think that we look forward to possibly collaborating him and push that further in the future because I think more people, if they do want to get started on it, I would recommend that to be the easiest way.

Like, I don't know how many people know Jax. Me personally, I don't really know it that well. So I'm more of a PyTorch guy. So yeah, I think that he provides a really good introduction to be able to try it out. - And so on one side, you had the technical discovery.

What about the actual customer interest? Customers that you work with? I feel like sometimes the context size can be a bit of a marketing ploy. You know, people are like, "Oh yeah, well, no, 1 million, 2 million, 3 million, 4 million." So that's kind of the algorithm side of it.

How do you actually, you know, how do you power the training? But the other side is obviously the data that goes into it. There's both quantity and quality. I think that's how one of your tweets, you trained on about 200 million tokens for the A/B model to the context extension.

But what are the tokens? You know, how do you build them? What are like maybe some of the differences between pre-training datasets and context extension datasets? Yeah, any other color you give there would be great. So specifically for us, we actually staged two different updates to the model. So our initial layer that we trained was just basically like a pre-training layer.

So continual pre-training where we took the Slim Pajamas data and then we filtered it and concatenated it so that it would reach the context lengths that we were trying to extend out to. And then we took the UltraChat dataset, filtered it down, or maybe some other, you know, second order derivative of the UltraChat dataset that was curated and then filtered it down and then reformatted it for our chat use case.

For those two datasets, I think you always have to really keep in mind for the pre-training data, whether or not you may be like cutting off tokens in weird ways, whether or not, you know, the content is actually diverse enough to retain the ability of the model. So Slim Pajamas tends to be one of the best ones, mostly because it's a diverse dataset and you can use embeddings too as a pre-filtering step as well, right?

Like how diverse are your embeddings space to the original corpus of the model and then train on top of that to retain its abilities. And then finally for the chat dataset, making sure that it's attending to all the information that would be expected to really stretch its capabilities 'cause you could create like a long context dataset where like every single time, the last 200 tokens can answer the entire question.

And that's never gonna make the model attend to anything. So it's even something that we're doing right now is trying to think about like how do we actually improve these models and how do you ablate the datasets such that it can expose like even more nuanced capabilities that aren't easily measurable quite yet.

Is there a ratio between diversity of the dataset versus diversity compared to what the model already knows? Like does the model already need to understand a good part of the new, like the context extension data to function? Like can you put a context extension dataset that is like very far from like what was in the pre-training?

I'm just thinking as the model get older, you know, some of the datasets that we have might not be in the knowledge of the existing model that you're trying to extend. - I think that's always a consideration. I think specifically, you really got to know how much, how many tokens were expended into that particular model from the start.

And all models these days are now double digit trillions, right? So it's kind of a drop in the bucket if you really think I can just put, you know, a billion tokens in there. And I actually think that the model is gonna truly learn new information. There is a lot of research out there between the differences with respect to like full fine-tuning, which we applied full fine-tuning versus lower base fine-tuning.

It's a trade-off. And my opinion of it is actually that you can test certain capabilities and you can kind of inject new knowledge into the model. But to this day, I've not seen any research that does like a strong, well-scaled out empirical study on how do you increase the model's ability to understand like these decision boundaries with a new novel data.

Most of it is taking, like holding out a portion of the data as like novel and then needing to recycle some of the old knowledge. So it just doesn't forget and get worse at everything else, right? Which was seen, like we do have historical precedent where Code Llama was trained further from the original Code Llama was trained further from LLAMA-2 and it just lost all its language capabilities, basically, right?

So it's not, I don't wanna call that project, like deem it as like a failure, but it wasn't like a really successful generalization exercise because these models are about like flexibility and being like generic to a certain extent. - So one thing I see in the recent papers that have been coming out is this sort of concept of multi-stage training of data.

And if you're doing full fine tuning, maybe the move or the answer is don't train 500 billion tokens on just code because then yeah, it's gonna massively overfit to just code. Instead, like maybe the move is to slowly change the mix over the different phases, right? So in other words, like you still need to mix in some of your original source dataset to make sure it doesn't deviate too much.

I feel like that is a very crude solution. Like maybe there's some smarter way to adjust like the loss function so that it doesn't like deviate or overfit too much to more recent data. It seems like it's a solvable thing. That's what I'm saying. Like this overfitting to more recent data issue.

- Well, solvable is hard. I think provably solvable is always something that I know is extremely difficult. But from a heuristical standpoint, as well as like having like some sort of statistical efficiency on like how you can converge to the downstream tasks and improve the performance that way in a targeted manner, I do think there are papers that try to do that.

Like the Do-Re-Mi paper, I think it was released last year. It was really good about doing an empirical study on that. I think the one thing that people struggle with though is the fact that they always try to do it on pretty naive tasks. Like you target like a naive task and then you create your data mixture and you try to show some sort of algorithm that can retain the performance for those downstream tasks.

But then what do we all care about are actually like really, really interesting complex tasks, right? And we barely have good evaluations for those. Like if you do a deep dive at the Gemini 1.5 technical paper, which they just updated with, like it was a fantastic paper with new updates.

If you look at all of their long context evaluations there, like a lot of them are just not something that the open community can even do because they just hired like teachers to evaluate whether or not this model generated a huge lesson plan that is really coherent or like you hire a bunch of subject matter experts or they taught the model how to do language translation for an extinct language where only 200 people in the world know.

It's like, it's kind of hard for us to do that same study, right? As an early stage startup. I mean, technically now you can use Gemini as a judge. Gemini is touting a lot of their capabilities in low resource languages. One more thing before on the sort of data topic.

Did you have any exploration of synthetic data at all? You know, use Mistral to rephrase some existing part of your data set to generate more tokens, anything like that, or any other form of synthetic data that you choose to mention. I think you also mentioned the large world model paper, right?

So yeah, yeah. Anything like that? Yeah, yeah. So yeah, we used like GPT-4 to rephrase certain aspects of the chat data, reformatting it or kind of generating new types of tokens and language and types of data that the model could see. And also like trying to take the lower correlated instances of out-of-domain data that we wanted to inject it to the model too as well.

So I actually think a lot of the moat is in the data pipeline. You'll notice like most papers just don't really go into deep detail about the data set creation because they probably know. I mean, there's some aspects that are like uninteresting, right? Which is like we paid a bunch of people and like generated a lot of good data.

But then the synthetic data generating pipeline itself, sometimes that could be like 25% or 50% of the entire data set that you've been used to depreciating. Yeah, I think it's just for legal deniability rather than... (both laughing) No, it's just too boring. I'm not going to say anything because it's too boring.

No, it's actually really interesting. But in fact, it might be too interesting. So we're not going to say anything about it. Yeah. One more question that I had was on LoRa and taking some of these capabilities out and bringing them to other model. You mentioned Weng's work. He tweeted about, we're going to take this LoRa adapter for the Gradient 1 million context extension and you're going to be able to apply that to other model.

Can you just generally explain to people how these things work with language models? I think people understand that with stable diffusion, you have these like LoRa patches for like different types of styles. Does that work similarly with LLMs? And is it about functionality? Can you do LoRa patches with specific knowledge?

Like what's the state of the art there? Yeah, I think there's a huge kind of resurgence in what I would call like model alchemy to a certain extent because you're like taking all of these LoRas and you're mixing them together. And then that's a lot of the model merging stuff that I think Charles Goddard does and a lot of others in the open community, right?

'Cause it's a really easy way, like you don't need training and you can test and evaluate models and take the best skills and mix and match. I don't think there has been as much empirical study, like you're saying, for how it shows the same type of, like it's not as interpretable as like stable diffusion to a certain extent.

'Cause even we have experimented with taking like deltas in the same methodology as wing where we'll take a delta of like an already trained model, try to see how that has created in a sense an ROHF layer, right? Taking the LLAMA instruct layer, subtracting the base model from that and then trying to apply that LoRa adapter to like another model and seeing what it does to it.

- It does seem to have an effect though. Like I will not lie to say, I'm really surprised how effective it is sometimes. But I do notice that for more complex abilities, other than like more stylistic stuff, it kind of falls through 'cause maybe it requires a much deeper path in the neural network, right?

Like all these things, these weights are just like huge trees of paths that the interesting stuff is like the road less travel to a certain extent. And when you're just like merging things, brute force together that way, you don't quite know what you'll get out all the time. Like there's a lot of other research that you have merged ties and you have all these different types of techniques to effectively just apply like a singular value decomposition on top of weights and just get like the most important ones and prevent interference across all the other layers.

But yeah, I think that that is extremely interesting from the developer community. And I wanna see more of it except it is to a certain extent kind of polluting the leaderboards these days 'cause it's so targeted and like now you can you can kind of game the metric by just finding all the best models and then just merging them together to do that.

And I'll just add one last bit is basically the most interesting part about all that actually to me is when people are trying to take the LORAs as a way of like short circuiting the training process. So they take the LORAs, they merge it in and then they'll fine tune afterwards.

So like the fine tuning and the reinitialization of a little bit of noise into all of the new merged models provides like a kind of kind of a learning tactic for you to get to that capability a little bit faster. There's a lot there. I really like the comparison of ties merging to singular value decomposition.

That's something that I like it. I looked at the paper and I don't really think I understood it on that high level until you just said it. Very cool. We have to move on to benchmarking. This is a very fun topic. Needle in a haystack. What are your thoughts and feelings?

And then we can discuss the other benchmarks first. Needle in a haystack. You want to put me on the spot with that one. Yeah, I think needle in a haystack is definitely like the standard for presenting the work in a way that people can understand and also proving out.

I would say like, I view it as like a primitive that you have to pass in order to give the model any shot of doing something that combines both like a more holistic language understanding and like instruction following, right? Like, honestly, like it's mostly about if you think about the practical applications of like long context and what people complain most about models when you stuff a lot of context into it is either the language model just doesn't care about what you asked it to do or it cannot differentiate like, you know, context that you want it to use as a source to prevent hallucination versus like instructions.

I think that, you know, when we were doing it, it was to make sure that we were on the right track. I think Greg did a really great job of creating a metric and a benchmark that everybody could understood. It was intuitive. Even he says himself, we have to move past it.

But to that regard, it's a big reason why we did the evaluation on the ruler suite of benchmarks, which are way harder. They actually include needle in the haystack within those benchmarks too. And I would even argue is more comprehensive than the benchmark that Gemini released for their like multi-needle in the haystack.

Yeah, you mentioned quite a few. You mentioned ruler, Lugo, infinite bench, bamboo, zero scrolls. Like, do you want to give us like maybe two or three of those that you thought were particularly interesting or challenging and what made them stand out for you? There's just so many and they're so nuanced.

I would say like, yeah, zero scrolls was the first one I'd ever heard of coming out last year. And it was just like the extent, like making, it was more of like tracking variable over long context. I'll go into ruler because that's the freshest in my mind and like we're just scrutinizing it so much and running the evaluation in the previous two weeks.

But like ruler has four different types of evaluations. So the first one is exactly needle in the haystack. It's like you throw multiple needles. So you got to retrieve multiple key value pairs. There's another one where that basically you need to differentiate. Multi-key, multi-value, multi-query. Yeah, yeah, multi-value, multi-query.

That's the ablation. There's also a variable tracking one where you go, hey, if X equals this, Y equals this, Y equals Z, like what is this variable? And you have to track it through all of that context. And then finally, there's one that is more of like creating a summary statistic.

So like the common words one, where you choose a word that goes across the entire context, and then you have to like count it. So it's a lot more holistic and a little bit more difficult that way. And then there's a few other ones that escaped me at this moment.

But yeah, ruler really pushes you. If I think about the progression of the evaluations, it pushes it to start to force the model to actually understand like the totality of the context, rather than right, like everybody argues to say, like, couldn't I just use like a retrieval to like just grab that variable rather than like pay $10 for one shot or something?

Although it's not as expensive. Yeah, exactly, exactly. So being able to actually like, I think the main thing that like I struggled with, with even some of our use cases, were like when the context is scattered across multiple documents, and you have like really delicate plumbing for the retrieval step, but it only works for that one, that really specific instance, right?

And then you throw in other documents and you're like, oh, great, like my retrieval doesn't grab the relevant context anymore. So like, that's the dream, right? Of getting one model, a model that can generalize really well that way. Yeah, totally. I think that probably is what Greg mentioned when saying that he has to move beyond Needle and Haystack.

You also mentioned, so you extended from 1 million to 4 million token context recently, and you saw some degradation in the benchmarks too. Like, do you want to discuss that? So if you look at our theta value at that point, it's getting really big. So think about floating point precision and thinking about propagating, like basically now you're starting to run into problems where in a deep enough network and having to do joint probabilities across like so many tokens, you're hitting kind of the upper bound on accuracy there.

And there's probably some aspect of kind of clamping down certain activations that we need to do within training. Maybe it happens at inference time as well with respect to like the theta value that we use and how do we ensure that it doesn't just explode. If you've ever had to come across like the exploding gradients or the vanishing gradient problem, you will know what I'm talking about.

Like a lot of the empirical aspect of that and scaling up these things is experimentation and figuring out how do you kind of marshal these really complicated composite functions such that they don't just like do a divide over zero problem at one point. Awesome. Just to wrap on the...

There's the evals and then there's what people care about. There's two things. Do you see people care about above 1 million? Because Gemini at the 2 million announcement and I think people were like, "Okay, 1 million, 2 million, it's whatever." Like, do you think we need to get to 10 million to get people to care about again?

Yeah. Like, do we need to get to 100 million? Yeah. I mean, that's an open question. I would certainly say a million seemed like the number that got people really excited for us. And then the 4 million is kind of like, "Okay, that's seen as more..." Rather than like a breakthrough milestone, it's like just the next incremental checkpoint.

I do think even Google themselves, they're evaluating and trying to figure out specifically how do you measure the quality of these models and how do you measure and map those to capabilities that you care about going down the line, right? And I think I'm still... Us as a company, we're figuring out how to saturate the context window in a way that's actually adding incremental value.

So the obvious one is code because code repositories are huge. So can you stuff the entire context of a repo into a model and then make it produce some module that is useful or some suggestion that is useful? However, I would say there are other techniques like alpha coding and flow engineering that if you do iterative things in a more agentic manner, it may actually produce better quality.

I would preface and I would actually counter that maybe start off with the use case that is a little bit more that people are more familiar with right now, which is constantly evolving context in like a session. So like, whereas you're coding, right? If you can figure out evals that actually work where you're constantly providing it multiple turns and each incremental turn has a nuanced aspect and you have a targeted generation that you know of, making the model track state and have state management over time is really, really hard.

And it's an incredibly hard evaluation that will probably only really work when you have a huge context. So that's sort of what we're working on trying to figure out those types of aspects. You can also map that. Like it's not just code, state management exists. And like, you know, we work in the finance sector a lot, like investment management, like having a state management of like a concept and stuff that evolves over like a long session.

So yeah, I'm super excited to hear like what other people think about the longer context. I don't think Google is probably investing to try to get a billion quite yet. I think they're trying to figure out how to fully leverage what they've done already. Yeah. And does this change in your mind for very long chats versus a lot of documents?

The chat is kind of interactive, you know, and information changes the documents are just trying to synthesize more and more things. Yeah. Any thoughts on how those two workloads differ? Yeah, I mean, I would say like with the document aspect of things, you probably have like a little bit more ability to tweak other methodologies.

Like you can get around the long context sometimes where you can do retrieval augmented generation or you do like hierarchical, like recursive summarization. Whereas like evolution in like a session, because that state variable could undergo like pretty rapid changes. It's a little bit harder to imagine like you getting around that without codifying like a really specific workflow or like some sort of, you know, state clause that is going back to like determinism, right?

And then finally, like what I really think people are trying to do is like figure out how did all these like shots progress over time? So like, how do you get away from the brittleness of like the retrieval step to like shoving, if you shove in a thousand shots or 2000 shots, will it just make the retrieval aspect of good examples irrelevant?

And like, it's sort of kind of like a randomly sampling is fine at that point. There's actually a paper on that that came out from CMU that they showed with respect to a few extraction or classification high cardinality benchmarks. They tracked like fine tuning versus in-context learning versus like many, many shot in-context learning.

And they basically showed that like many, many shot in-context learning helps to prevent as much sensitivity around the examples themselves. Right? Like the distraction, the distraction error that a lot of LLMs get where you give it irrelevant context and it literally can't do the task because it just is like it's sort of like a person too.

Right? Like you got to be very specific about I don't want to distract this person because then, you know, they're going to go down a rabbit hole and not be able to complete the task. Yeah. Well, that's kind of the flip side of the needle in a haystack thing too in a bit.

It's like now the models pay attention to like everything so well. Like sometimes it's hard to get them to like, I just said that once, please do not bring that up again. You know, it happens to me with code. Yeah, it happens to me with like a CSS style.

Sometimes I like things like that. If I have a long conversation, it's like it tries to always reapply certain styles, even though I told it maybe that's not the right the right way to do it. But yeah, there's a lot again of empirical work that people will do. And just I know we kind of went through a lot of the technical side, but maybe the flip side is why is it worth doing?

You know, like what are like the use cases that people have that make long context really useful? I know you had, I think you have a lot of healthcare use cases I saw on your Twitter. You just mentioned the finance use case. Obviously, some of the filings and documents that people, the companies publish can be quite worthy.

Any other things that you want to bring up? Maybe how people are using gradient, anything like that. I think that will help have a clearer picture for people. Yeah, so beyond like just using the context for, you know, sessions and evolving state management, it really comes down to something that's fairly obvious, which everybody's trying to do and work on is how do you ground the language model better?

So I think when you think pure text, that's one thing. But then multimodality is, in my opinion, going to be it's going to be pivotal for long context, just because like videos when you're getting into the frames per second and you're getting into lots of images and like things that are a lot more like embodied, you need to utilize and leverage way more, way more tokens.

And that is probably where, you know, us as a company, like we're exploring more and trying to open up the doors for a lot more use cases because I think in financial services, as well as health care, we've done a good job on the tech side, but we still need to push a little bit further when we combined like, you know, a picture with words, like a chart with words or somebody's medical image with words, stuff like that.

Like you definitely can do a better job. And, you know, it's timely too, because Meta just released their chameleon paper, the new chameleon paper that does multimodal training. And it shows that early fusion helps you to, it's like more sample efficient, right? So having that kind of view towards the future is something that, you know, we want to be primed to do because, you know, it's similar to what Sam Altman says himself too, right?

Like you need to just assume that these models are going to be 10x better in the next few years. And if you are primed for that, like that's where you have kind of a business that, you know, you're not just pivoting after every release or every event, you know, that drops.

I think the thing about this 10x issue is that the 10x direction moves all the time. You know, some people were complaining about GPT-4-O that, yeah, look, like the ELO scores for GPT-4-O actually in reality weren't that much higher than GPT-4-Turbo. And really the, you know, so it's not 10x better in reasoning.

It's just 10x better in the integration of multiple modalities. And by the way, look over here, there's a really sexy voice chat app that they accidentally made that they had to deprecate today. It's like the 10x direction keeps moving. Now it's like, you know, fully in like sort of multi-modality land, right?

And like the question is like what next, right? Like, so you can 10x in various ways, but like you guys have 10x context length. But like, you know, are we chasing the last war? Because like now like nobody cares about context length and now it's like multi-modality time, you know.

I'm joking, obviously, people do care about it. I just, I wonder about this, how this comment about this 10x thing every single time. You know, that's honestly why we kind of have our eye on the community as well as you, right? Like you, you know, with your community and the things that you hear, you know, you want to build, where, you know, we're a product company, we're trying to build for users and trying to listen to understand like what they, what they actually need.

Like, obviously, you know, you don't, you don't build everything that people ask you to build, but know what's useful, right? Because I think that you're totally right there. If we want to make something 10x better in a certain direction, but nobody cares and it's not useful for somebody, then it wasn't really worth the, worth the while.

And if anything, maybe that's like bitter, the bitter lesson 2.0 for so many tech startups is like build technology that people care about and will actually 10x their value rather than like build technology that's just, that's just 10x harder. I mean, no, that's not, that's not a bitter lesson.

That's just Paul Graham. That's, that's, yeah. One more thing on the chameleon paper. I was actually just about to bring that up, you know? So on AI News, like my sort of daily newsletter, it was literally my most, my most recent featured paper. And I always wonder if you can actually sort of like train images onto the same data space as words.

That was kind of done with like, you know, what we now call late fusion models with like lava and flamingo and, you know, all the others. But now the early fusion models like chameleon seem to be the way forward. Like, obviously it's more native. I wonder if you guys can figure out some kind of weird technique where you can take an existing like Lama 3 model and like, you know, early fuse the images into the text encoder so that we just retroactively have the early fusion models.

Yeah. Even before the early, you know, that the chameleon paper came out, I think that was on our big board of next to do's to possibly explore or our backlog of ideas, right? Because as you said, early fusion, like even before this paper, I can't remember. I think Meta even had like a scaling laws for multimodality paper that does explore more early fusion.

Like the moment we saw that it was just kind of obvious to us that eventually it'll get to the point that becomes a little bit more mainstream. And yeah, like that's a cool twist that we've been thinking about too as well, as well as like other things that are kind of in the works that are a little bit more agentic.

But yeah, if open collaboration interests you, we can always work on that together with the community. Ooh, okay. Shout out there. Cool. Well, you can leave that in the call to action at the end. I just want to, you know, we have a couple more questions to round this out.

You mentioned a lot of papers in your work. You're also building a company. You're also looking at open source projects and community. What is your daily or weekly routine to keep on top of AI? So one, subscribe to AI News. He didn't have to pay me to say that.

I actually think like it's a good aggregator. I think it's a good aggregator. I'll tell you why. Most of the fastest moving like research that's being done out there is like it's showing up. It's mostly on Twitter. Like my Twitter is like, I wasn't a power Twitter user at all.

Before three years ago, but I had to use it and I had to always check it in order to keep on top of like early work, right? That people want to talk about or present because nothing against submitting research papers to like ICLR, ICML, like knowing the state of the art, like those are like six months late, right?

Like people have already have it dropped it on archive, or they're just openly talking about it. The submission process. Yeah. Yeah. And then being on discord to see when the rubber hits the road, right? Like the implementations and the practices that are being done or like the data sets, like you said, like a lot of conversations about really good data sets and how do you construct them are done in the open in figuring that out for people that don't have like budgets of like $10 million to just pay a bunch of annotators.

So my routine daily is like, second thing I do when I wake up is to look on Twitter to see what the latest updates are from specific people that do really, really great work. Armin at Meta who did the chameleon paper is like everything he writes on Twitter is like gold.

So like anytime he writes something there, like I really try to figure out what he's actually saying there and then tie it to techniques and research papers out there. And then sometimes I try to use certain tools, like I myself use AI itself to search for the latest papers on a specific topic, if that's the thing on the top of my mind.

And at the end of the day, trying out the products too. I think if you do not try out the tooling and some of the products out there, like you are missing out on someone's compression algorithm. Like they compressed all the research out there and all the thought and all the state of the art into a product that they're trying to create for you.

And then like really backing out in reverse engineering, like what it took to build something like that. Like that's huge, right? Like if you can actually understand like perplexity, for instance, like you'll already be well ahead on the research. - Oh, by the way, you mentioned what's a good perplexity score?

If there's like just a number, right? Like it's like five to eight or something. Like do you have a number in mind when you said that? - Yeah, I mean, what was the one that we had? Flipping between train loss and perplexity is actually not native to me quite yet.

But like, yeah, between like, if you can get like a four using the context length extension on LLAMA, like you're in the right direction. And then obviously you'll see spikes. And specifically when the one trick you should pay attention to is, you know that your context length and theta scaling is working right.

If the early steps in the perplexity go straight down. So like when it wasn't correct, it would oscillate a lot in the beginning. And we just knew that we cut the training short and then retry a new theta scale. - Because your effect, you're properly continuing the fine tuning or the full retraining.

- Yeah, yeah. The model just like, it saw something out of domain immediately and was like, I have no idea what to do. And you need it to be able to overlap that positional embedding on top of each other. - One follow up, right? Before we sort of close out.

Like, I think being on Twitter and like looking at all these new headlines is really helpful. But then it only gets you like a very surface level understanding. Then you still need a process to decide which one to invest in. So I'm trying to dig for like, what is your formula for like deciding, you know, what to go deep on and what to kind of skip.

- From a practical standpoint, as a company, like I already know there are like three to five things that will be valuable and useful to us. And then there's other stuff that's like out of scope for different reasons. Some stuff is like out of scope from, hey, this is not going to impact or help us.

And then other things are out of scope because we can't do it. You know, like the stuff like different tech. So a really good instance for that is specific algorithms for, you know, improving extremely large scale distributed training. Like that's that we're not gonna have the opportunity to get 2000 H100s.

If we do, it'd be really cool. But like, I'm just saying like, as for now, like you gotta reach for the things that would be useful. Things that would be useful for us, for instance, for everybody actually, to be honest, is like evaluations, different post-training techniques, and then synthetic data construction.

Like we're always on the, I'm always on the look for that. And then how do I figure out where there are these things? You know, which new piece of news is actually novel? Well, that's sort of my like mental cache to a certain extent. Like I've built up like this state of like, I already know like all the things that have already been written for the state of the art for certain topic areas.

And then I know what's being kind of recycled as like an empirical study versus like something that actually is very insightful. Underrated specific instance would be like the DeepSeek paper. I'd never seen it before, but like the multi-head latent attention, like that was really unexpected to me because like I thought I'd seen every type, not every type, obviously, but like every way that people wanted to cut like mixture of experts into interesting ways.

And I never thought something would like catch my eye to be like, oh, this is totally new. And it really does have a lot of value. Yeah, so like, I think that's mainly how I try to do it. And like you talk to your network too. Like I just, you know, talk to the people and then know and make sure like I have certain subject matter experts on SpeedDial that I also like to share information with and understand like, hey, does this catch your eye too?

Do you think this is valuable or real? 'Cause yeah, Raishan, it's a noisy space we're in right now, which is cool 'cause it's really interesting and people are excited about it. But at the same time, there is actually a 10X or more explosion of information coming in that all sounds really, really unique and new.

And you could spend like hours, you know, down a rabbit hole that isn't that useful. Awesome, Mark, I know we kept you in the studio for a long time. Any final call to actions for folks that could be roles you're hiring for, requests for startups, anything that comes to mind that you want to share with the audience?

Yeah, I think on the line of we definitely have a call to action to get more people to work together with us for long context evaluations. That is the sort of the it topic throughout like every, like even Meta or Google or any of the other folk are focusing on.

'Cause I think we lack an understanding of that within the community. And then can we as a community also help to construct like other modalities of datasets that would be interesting, like pairwise datasets, right? Like you could get just straight video and then straight text, but like getting them together that have like for grounding purposes will be really useful for training the next set of models that I know are coming out.

And the more people we have contributing to that would be really useful. Awesome, thank you so much for coming on, Mark. This was a lot of fun. Yeah, thanks a lot. Yeah, this is great. (upbeat music) (upbeat music) (upbeat music)