back to indexHow to train a Million Context LLM — with Mark Huang of Gradient.ai
Chapters
0:0 Introductions
1:30 Founding story of Gradient and its mission
4:35 Minimum viable agents
9:19 Differentiating ML and AI, focusing on out-of-domain generalization
10:12 Extending Llama3 to 1M tokens
14:32 Technical challenges with long context sequences
17:45 Data quality and the importance of diverse datasets
19:45 What's a theta value?
22:42 RoPE vs Ring Attention vs ALiBi vs YaARN
25:6 Why RingAttention matters
28:1 How to refine datasets for context extension
33:34 Multi-stage training data and avoiding overfitting to recent data
34:27 The potential of using synthetic data in training
38:22 Applying LoRa adapters to extend model capabilities
42:25 Benchmarking long context models and evaluating their performance
47:20 Pushing to 4M context and output quality degradation
50:8 What do you need this context for?
52:57 Impact of long context in chat vs Docs Summarization
56:25 Future directions for long context models and multimodality
59:38 How do you know what research matters?
62:47 Routine for staying updated with AI research and industry news
65:33 Deciding which AI developments to invest time in
70:37 Request for collaboration and data set construction for long context
00:00:00.880 |
Hey, everyone. Welcome to the Live in Space podcast. 00:00:03.760 |
This is Alessio, partner and CTO on Residence at Decibel Partners, 00:00:07.360 |
and I'm joined by my co-host, Swiggs, founder of Small.ai. 00:00:10.960 |
Hey, and today we're in the remote studio with Mark Wang from Gradient. 00:00:17.600 |
It's really, you know, a great experience to be able to talk with you all. 00:00:21.840 |
I know your podcast is really, really interesting, 00:00:24.720 |
and I always am listening to it every time you guys have a release. 00:00:35.040 |
So, Mark, you're unusual in the sense that you and I go back to college. 00:00:39.520 |
I don't exactly remember where we overlapped, 00:00:46.000 |
and went into the sort of quantitative developer realm. 00:01:01.680 |
And now we intersect again when it kind of feels like more or less the same, right? 00:01:07.520 |
Like the AI wars, the trading wars back in the day, too, 00:01:10.720 |
to a certain extent, and the grab for talent. 00:01:13.280 |
Yeah, I think there's definitely a few of us ex-finance people 00:01:23.600 |
You were at a bunch of sort of quant trading shops, 00:01:27.200 |
but then as you moved to tech, you were a lead data scientist at Box 00:01:32.800 |
And then before working on the startup that eventually became Gradient. 00:01:38.880 |
Yeah, I think part of the reason why I came over from the quant finance world 00:01:48.320 |
learn about what big data and scaling machine learning really looks like 00:01:58.800 |
And working at Box, I worked mostly in a cross-functional role, 00:02:08.240 |
And then at Splunk, it was a lot more of a specific role 00:02:13.120 |
where I was helping with streaming analytics and search and deep learning. 00:02:19.600 |
And for Gradient, really why we started it was 00:02:24.720 |
whether it was in finance or whether it was in tech, 00:02:27.440 |
I always noticed that there was a little bit more to give 00:02:31.040 |
in terms of what AI or ML could contribute to the business. 00:02:36.400 |
And we came at a really good time with respect to wanting to 00:02:40.720 |
bring the full value of what that could be into the enterprise. 00:02:47.120 |
And then obviously OpenAI created this huge vacuum 00:02:54.480 |
So I myself felt really, really empowered to actually ship a product 00:03:00.720 |
and ship stuff that I could think could really help people. 00:03:03.760 |
Maybe just to touch a little bit on Gradient, 00:03:06.720 |
I know we have a lot of things to go through Gradient, 00:03:13.600 |
And you have an awesome design on your website. 00:03:17.520 |
And I think people that are watching Fallout on Amazon Prime right now 00:03:26.880 |
Because I know you have the foundry, you have the agent SDK, 00:03:32.160 |
And I appreciate the call out for the design. 00:03:35.840 |
I know my co-founder, Chris, spent a lot of thought 00:03:39.200 |
in terms of how he wanted the aesthetic to look like. 00:03:44.560 |
So that was the initial emotional shape that I felt when I saw it. 00:03:50.640 |
Well, quite simply, Gradient, we're a full stack AI platform. 00:03:56.480 |
And what we really want to do is we want to enable 00:03:59.280 |
all of the RPA workloads or the codified automation workloads 00:04:08.000 |
We really want to enable people to transition 00:04:11.280 |
into more autonomous, agentic workflows that are less brittle, 00:04:32.320 |
a fairly horizontal platform for those purposes. 00:04:35.120 |
We had this discussion at our AI in Action Club on Discord, 00:04:42.160 |
Yeah, in your mind, what is the minimum thing 00:05:06.880 |
with respect to how the pipeline looks like when it's executed. 00:05:17.680 |
you're going to have to see a marginal improvement 00:05:20.320 |
in the probability of success for that particular workload 00:05:25.920 |
So yeah, I think it is an overloaded term to a certain extent 00:05:36.960 |
But for us, it's like, you know, my background is statistics. 00:05:40.800 |
So I want to see like improvements in the probability 00:05:49.280 |
the one thing that makes this sort of generative AI era 00:05:54.000 |
very different from the sort of data science-y type era 00:06:04.320 |
I think what's the founding story of Gradient? 00:06:07.680 |
Like how, you know, of all the problems that you chose, 00:06:14.000 |
You know, how did you get together your co-founders, 00:06:16.800 |
anything like that, that bring us up to the present day? 00:06:21.520 |
and he's a really good friend of mine as well. 00:06:23.680 |
I don't know if you intersected with him at Penn as well, 00:06:26.000 |
but yeah, Chris Chang, he was at Penn as well, 00:06:34.080 |
and then, you know, was a software engineer at Meta, 00:06:42.000 |
he was like a director at Netflix and product. 00:06:44.720 |
And we always wanted to do something together, 00:06:48.480 |
but we felt the, you know, what really came to fruition 00:06:57.120 |
mostly because of our experience with internal tooling 00:07:06.880 |
basically exist through like a migration, right? 00:07:13.360 |
that I've ever had to experience or he had to experience, 00:07:18.240 |
and you have a new workflow or automation come in. 00:07:26.960 |
And we also teamed up with a former coworker of Chris's 00:07:59.920 |
So what we really wanted was to reduce that friction 00:08:05.440 |
for like actually shipping workloads in product value 00:08:12.080 |
when you have all these like types of operational frictions 00:08:16.640 |
that happen inside of these large enterprises. 00:08:20.720 |
And then really like the main pivot point for all of it 00:08:27.760 |
things that can handle out of domain problems. 00:08:36.160 |
and having something that you build over time 00:08:45.440 |
And I feel like a lot of systems back in the place, 00:08:48.080 |
they were learning a very specific objective function, 00:08:52.880 |
but they weren't really natively learning with the user. 00:09:06.000 |
was always for the system to grow alongside me, right? 00:09:21.520 |
people always trying to define a difference between ML and AI. 00:09:31.840 |
And that's all under the umbrella of learning, 00:09:42.560 |
that's something that you've been blowing up on, 00:10:01.680 |
towards like what got you interested in long context? 00:10:04.320 |
Why did you find it like an interesting investment 00:10:08.880 |
And then the story of how you did your first extensions. 00:10:27.040 |
8,000 context lengths just seemed like it was too short 00:10:32.720 |
and even Yee came out with like a 2,000 token model, 00:10:38.960 |
But the really the inception of all of it was 00:10:55.520 |
this basically pedagogical debate with everybody 00:10:58.640 |
who's like, "Hey, is it fine tuning versus Rag? 00:11:06.640 |
Like all we want is like the best meta learning workflow 00:11:17.760 |
So naturally, long context had a place in that, 00:11:22.400 |
but nobody had really pushed the limits of it, right? 00:11:33.520 |
but it wasn't until Google comes out with Gemini 00:11:37.040 |
with the first 1 million context length model 00:12:09.920 |
as compression algorithms to a certain extent, 00:12:32.560 |
that was more of just like put the North Star up there 00:12:38.720 |
And then see what was happening along the way 00:20:17.680 |
And you're thinking about like the different, 00:20:25.360 |
to see different types of distributions of data. 00:20:52.560 |
and allow for different types of distributions 00:21:06.560 |
but it's like there's positional extrapolation, 00:21:36.160 |
when you see a million contexts of sequence tokens. 00:22:06.320 |
was that we established the formula at the start. 00:22:25.520 |
So it's not like a mathematical tautology or proof. 00:22:39.760 |
but you don't know if they're going to continue. 00:22:50.640 |
Yarn is being talked about a lot by a news research. 00:23:00.640 |
We had a really good session with Strong Compute 00:23:07.440 |
I was just wondering if you want to compare and contrast 00:23:18.720 |
We haven't compared with that one specifically, 00:23:47.520 |
It was really easy and it was well understood by us. 00:23:50.960 |
The other one that I know that in the open source 00:24:10.480 |
it does start to break down a little bit more 00:24:20.960 |
specifically for like Needle in the Haystack. 00:24:38.640 |
"Hey, here's the thing that I actually cared about 00:24:41.600 |
And I have like a thousand different evaluations 00:24:57.760 |
with a really specific network topology on our GPUs 00:25:07.920 |
like Ring-Attention, a lot of people credit it 00:25:13.600 |
but actually it's just a better utilization of GPUs, right? 00:25:48.320 |
Like it was, I would say the original authors, 00:25:55.440 |
you know, Matai and all the folks at Berkeley, 00:26:14.960 |
like it just won't run out of the box very easily. 00:26:43.520 |
but like there was an active development on it 00:26:47.040 |
Like even Lucid Reins, I think he's interesting 00:26:53.120 |
and then just stopped, you know, doing commits. 00:26:58.960 |
like we never really want to jump in on a repo 00:27:01.120 |
where someone's like kind of actively committing 00:27:04.800 |
Otherwise we have to like eat that repo ourselves. 00:27:43.280 |
I would recommend that to be the easiest way. 00:27:48.160 |
Me personally, I don't really know it that well. 00:28:59.120 |
was just basically like a pre-training layer. 00:29:23.120 |
second order derivative of the UltraChat dataset 00:29:29.760 |
and then reformatted it for our chat use case. 00:29:38.160 |
I think you always have to really keep in mind 00:29:59.200 |
So Slim Pajamas tends to be one of the best ones, 00:30:18.960 |
and then train on top of that to retain its abilities. 00:30:26.000 |
making sure that it's attending to all the information 00:30:31.280 |
that would be expected to really stretch its capabilities 00:30:34.240 |
'cause you could create like a long context dataset 00:30:40.160 |
the last 200 tokens can answer the entire question. 00:30:44.240 |
And that's never gonna make the model attend to anything. 00:30:47.440 |
So it's even something that we're doing right now 00:30:57.840 |
such that it can expose like even more nuanced capabilities 00:31:07.120 |
Is there a ratio between diversity of the dataset 00:31:11.120 |
versus diversity compared to what the model already knows? 00:31:14.960 |
Like does the model already need to understand 00:31:34.080 |
might not be in the knowledge of the existing model 00:33:15.920 |
all its language capabilities, basically, right? 00:33:18.720 |
So it's not, I don't wanna call that project, 00:33:28.800 |
because these models are about like flexibility 00:33:47.200 |
is don't train 500 billion tokens on just code 00:38:58.080 |
Can you do LoRa patches with specific knowledge? 00:39:02.800 |
Yeah, I think there's a huge kind of resurgence 00:39:13.920 |
because you're like taking all of these LoRas 00:39:18.160 |
And then that's a lot of the model merging stuff 00:39:24.400 |
and a lot of others in the open community, right? 00:39:46.240 |
as like stable diffusion to a certain extent. 00:49:13.920 |
Do you see people care about above 1 million? 00:49:23.920 |
Like, do you think we need to get to 10 million 00:49:51.840 |
it's like just the next incremental checkpoint. 00:51:16.000 |
that people are more familiar with right now, 01:02:17.200 |
But yeah, if open collaboration interests you, 01:02:51.760 |
I actually think like it's a good aggregator. 01:04:40.640 |
And then sometimes I try to use certain tools, 01:05:10.080 |
Like they compressed all the research out there 01:05:15.520 |
into a product that they're trying to create for you. 01:05:21.200 |
like what it took to build something like that. 01:05:30.000 |
like you'll already be well ahead on the research. 01:05:34.480 |
you mentioned what's a good perplexity score? 01:05:40.960 |
Like do you have a number in mind when you said that? 01:05:45.200 |
- Yeah, I mean, what was the one that we had? 01:06:16.960 |
If the early steps in the perplexity go straight down. 01:06:23.760 |
And we just knew that we cut the training short 01:06:43.200 |
that positional embedding on top of each other. 01:07:15.360 |
like I already know there are like three to five things 01:07:23.200 |
And then there's other stuff that's like out of scope 01:07:36.640 |
You know, like the stuff like different tech. 01:07:53.440 |
Like that's that we're not gonna have the opportunity 01:08:46.560 |
And then I know what's being kind of recycled 01:09:15.760 |
like mixture of experts into interesting ways. 01:10:14.800 |
that all sounds really, really unique and new. 01:11:34.160 |
Awesome, thank you so much for coming on, Mark.