How to train a Million Context LLM — with Mark Huang of Gradient.ai

00:00:00.880 | Hey, everyone. Welcome to the Live in Space podcast.

00:00:03.760 | This is Alessio, partner and CTO on Residence at Decibel Partners,

00:00:07.360 | and I'm joined by my co-host, Swiggs, founder of Small.ai.

00:00:10.960 | Hey, and today we're in the remote studio with Mark Wang from Gradient.

00:00:14.240 | Welcome, Mark.

00:00:15.360 | Hey, glad to be here.

00:00:17.600 | It's really, you know, a great experience to be able to talk with you all.

00:00:21.840 | I know your podcast is really, really interesting,

00:00:24.720 | and I always am listening to it every time you guys have a release.

00:00:29.200 | He's not a paid actor.

00:00:30.880 | He said that out of his own will.

00:00:32.240 | We'll give you the check later.

00:00:35.040 | So, Mark, you're unusual in the sense that you and I go back to college.

00:00:39.520 | I don't exactly remember where we overlapped,

00:00:42.640 | but, you know, we both went to Warden

00:00:46.000 | and went into the sort of quantitative developer realm.

00:00:49.760 | Yeah, exactly.

00:00:50.560 | Kind of crazy, right?

00:00:53.120 | All goes full circle.

00:00:54.480 | I was a quant for quite a few years

00:00:57.680 | and then made it out into Silicon Valley.

00:01:01.680 | And now we intersect again when it kind of feels like more or less the same, right?

00:01:07.520 | Like the AI wars, the trading wars back in the day, too,

00:01:10.720 | to a certain extent, and the grab for talent.

00:01:13.280 | Yeah, I think there's definitely a few of us ex-finance people

00:01:17.200 | moving into tech and then finding ourselves

00:01:19.440 | gravitating towards data and AI.

00:01:22.720 | It seems like you did that.

00:01:23.600 | You were at a bunch of sort of quant trading shops,

00:01:27.200 | but then as you moved to tech, you were a lead data scientist at Box

00:01:30.960 | and staff ML scientist at Splunk.

00:01:32.800 | And then before working on the startup that eventually became Gradient.

00:01:37.920 | You want to tell that story?

00:01:38.880 | Yeah, I think part of the reason why I came over from the quant finance world

00:01:46.240 | is to get more collaboration,

00:01:48.320 | learn about what big data and scaling machine learning really looks like

00:01:56.080 | when you're not in this bubble, right?

00:01:58.800 | And working at Box, I worked mostly in a cross-functional role,

00:02:04.800 | helping product analytics and go to market.

00:02:08.240 | And then at Splunk, it was a lot more of a specific role

00:02:13.120 | where I was helping with streaming analytics and search and deep learning.

00:02:19.600 | And for Gradient, really why we started it was

00:02:24.720 | whether it was in finance or whether it was in tech,

00:02:27.440 | I always noticed that there was a little bit more to give

00:02:31.040 | in terms of what AI or ML could contribute to the business.

00:02:36.400 | And we came at a really good time with respect to wanting to

00:02:40.720 | bring the full value of what that could be into the enterprise.

00:02:47.120 | And then obviously OpenAI created this huge vacuum

00:02:51.680 | into the industry to allow for that, right?

00:02:54.480 | So I myself felt really, really empowered to actually ship a product

00:03:00.720 | and ship stuff that I could think could really help people.

00:03:03.760 | Maybe just to touch a little bit on Gradient,

00:03:06.720 | I know we have a lot of things to go through Gradient,

00:03:09.280 | Lumetri, context extension, there's a lot,

00:03:12.320 | but what exactly is Gradient?

00:03:13.600 | And you have an awesome design on your website.

00:03:16.400 | It's really retro.

00:03:17.520 | And I think people that are watching Fallout on Amazon Prime right now

00:03:21.440 | can maybe feel nostalgia just looking at it.

00:03:24.800 | What exactly is it?

00:03:26.880 | Because I know you have the foundry, you have the agent SDK,

00:03:29.520 | there's a lot of pieces into it.

00:03:31.440 | Yeah, for sure.

00:03:32.160 | And I appreciate the call out for the design.

00:03:35.840 | I know my co-founder, Chris, spent a lot of thought

00:03:39.200 | in terms of how he wanted the aesthetic to look like.

00:03:41.600 | And it reminds me a lot about Mad Men.

00:03:44.560 | So that was the initial emotional shape that I felt when I saw it.

00:03:50.640 | Well, quite simply, Gradient, we're a full stack AI platform.

00:03:56.480 | And what we really want to do is we want to enable

00:03:59.280 | all of the RPA workloads or the codified automation workloads

00:04:06.000 | that existed in enterprise before.

00:04:08.000 | We really want to enable people to transition

00:04:11.280 | into more autonomous, agentic workflows that are less brittle,

00:04:16.320 | feel more seamless as an interface too.

00:04:20.480 | So and able to empower what we really think

00:04:24.160 | the new AI workforce should look like.

00:04:29.040 | And that kind of required us to build

00:04:32.320 | a fairly horizontal platform for those purposes.

00:04:35.120 | We had this discussion at our AI in Action Club on Discord,

00:04:38.480 | like the minimum viable agent,

00:04:40.400 | or like kind of how you define an agent.

00:04:42.160 | Yeah, in your mind, what is the minimum thing

00:04:46.800 | that you can call actually an agent

00:04:48.400 | and not just like a for loop, you know?

00:04:50.880 | And how do you see the evolution over time,

00:04:53.840 | especially as people adopt it more and more?

00:04:55.760 | Yeah, so I kind of stage it where everybody,

00:05:03.200 | first of all, at the lowest level,

00:05:04.960 | thinks about like non-determinism

00:05:06.880 | with respect to how the pipeline looks like when it's executed.

00:05:10.400 | But even beyond that,

00:05:11.520 | this goes back into effectively evaluations.

00:05:15.360 | It's like on each stage of the node,

00:05:17.680 | you're going to have to see a marginal improvement

00:05:20.320 | in the probability of success for that particular workload

00:05:23.520 | because of non-determinism.

00:05:25.920 | So yeah, I think it is an overloaded term to a certain extent

00:05:30.800 | because like everything is an agent

00:05:32.560 | if it calls a language model

00:05:34.720 | or any sort of multimodal model these days.

00:05:36.960 | But for us, it's like, you know, my background is statistics.

00:05:40.800 | So I want to see like improvements in the probability

00:05:43.600 | of the success event or outcome happening

00:05:46.720 | because of more nodes.

00:05:48.320 | Yeah, I think, you know,

00:05:49.280 | the one thing that makes this sort of generative AI era

00:05:54.000 | very different from the sort of data science-y type era

00:05:56.880 | is that it is very non-deterministic

00:05:59.200 | and it's hard to control.

00:06:01.040 | Yeah, I mean, so like, you know,

00:06:04.320 | I think what's the founding story of Gradient?

00:06:07.680 | Like how, you know, of all the problems that you chose,

00:06:11.200 | like why choose this one?

00:06:14.000 | You know, how did you get together your co-founders,

00:06:16.800 | anything like that, that bring us up to the present day?

00:06:20.240 | One of my co-founders is Chris

00:06:21.520 | and he's a really good friend of mine as well.

00:06:23.680 | I don't know if you intersected with him at Penn as well,

00:06:26.000 | but yeah, Chris Chang, he was at Penn as well,

00:06:30.480 | did banking for maybe one or two years

00:06:34.080 | and then, you know, was a software engineer at Meta,

00:06:38.160 | also was at Google.

00:06:40.240 | And then most recently,

00:06:42.000 | he was like a director at Netflix and product.

00:06:44.720 | And we always wanted to do something together,

00:06:48.480 | but we felt the, you know, what really came to fruition

00:06:51.920 | was wanting to develop something

00:06:54.080 | that is enterprise-facing for once,

00:06:57.120 | mostly because of our experience with internal tooling

00:07:02.000 | and inability for something to like,

00:07:06.880 | basically exist through like a migration, right?

00:07:10.240 | Like all the time with every ML platform

00:07:13.360 | that I've ever had to experience or he had to experience,

00:07:16.080 | it's like a rebuild and you rip it out

00:07:18.240 | and you have a new workflow or automation come in.

00:07:20.880 | And it's this huge multi-quarter,

00:07:23.440 | maybe even multi-year project to do that.

00:07:26.960 | And we also teamed up with a former coworker of Chris's

00:07:32.400 | from Open Door Forest,

00:07:33.680 | who was also on Google Cloud Platform.

00:07:37.600 | And, you know, him seeing the scale

00:07:40.320 | and actually the state of the art

00:07:44.720 | in terms of Google was using AI for systems

00:07:48.400 | before everybody else too, right?

00:07:50.000 | They invented a transformer

00:07:51.600 | and their internal set of tooling

00:07:54.480 | was just so far superior to everything else.

00:07:56.640 | Like it's really hard for people to go back

00:07:58.720 | after seeing that.

00:07:59.920 | So what we really wanted was to reduce that friction

00:08:05.440 | for like actually shipping workloads in product value

00:08:12.080 | when you have all these like types of operational frictions

00:08:16.640 | that happen inside of these large enterprises.

00:08:20.720 | And then really like the main pivot point for all of it

00:08:26.320 | was like you said,

00:08:27.760 | things that can handle out of domain problems.

00:08:30.960 | So like out of domain data that comes in,

00:08:32.960 | having the flexibility to not fall over

00:08:36.160 | and having something that you build over time

00:08:41.280 | that continues to improve.

00:08:42.960 | Like machine learning is about learning.

00:08:45.440 | And I feel like a lot of systems back in the place,

00:08:48.080 | they were learning a very specific objective function,

00:08:52.880 | but they weren't really natively learning with the user.

00:08:56.880 | So like that's the whole,

00:08:58.560 | we use the term assistant all the time,

00:09:01.840 | but my vision for the assistant

00:09:06.000 | was always for the system to grow alongside me, right?

00:09:10.000 | Like almost like an embodied second limb

00:09:15.360 | or something that will be able to get better

00:09:17.040 | as you also learn yourself.

00:09:19.440 | Yeah, I might maybe call it,

00:09:21.520 | people always trying to define a difference between ML and AI.

00:09:26.560 | And I think in AI,

00:09:28.640 | we definitely care a lot more about

00:09:30.320 | out of domain generalization.

00:09:31.840 | And that's all under the umbrella of learning,

00:09:35.120 | but it is a very specific kind of learning.

00:09:36.880 | I'm going to try to make a segue

00:09:39.440 | into today's like main topic of conversation

00:09:42.560 | that's something that you've been blowing up on,

00:09:44.640 | which is the long context learning, right?

00:09:47.680 | Which is also some form of out of topic,

00:09:50.400 | out of distribution generalization.

00:09:52.720 | And in this context,

00:09:54.080 | you're extending the context window

00:09:56.400 | of an existing open source model.

00:09:58.000 | Maybe if you want to like,

00:10:00.400 | just bring us all the way back to it,

00:10:01.680 | towards like what got you interested in long context?

00:10:04.320 | Why did you find it like an interesting investment

00:10:08.320 | to work on?

00:10:08.880 | And then the story of how you did your first extensions.

00:10:12.000 | Yeah, I think it came,

00:10:15.840 | for Llama3, it's specifically,

00:10:18.000 | we chose that model

00:10:20.080 | because of the main criticisms about it.

00:10:24.800 | Before when it first got released,

00:10:27.040 | 8,000 context lengths just seemed like it was too short

00:10:30.720 | because it seemed like Mistral

00:10:32.720 | and even Yee came out with like a 2,000 token model,

00:10:37.360 | context length model.

00:10:38.960 | But the really the inception of all of it was

00:10:45.040 | us like fine tuning so many models

00:10:48.640 | and working on Rags so much

00:10:51.200 | and having this,

00:10:53.200 | and it still exists today,

00:10:55.520 | this basically pedagogical debate with everybody

00:10:58.640 | who's like, "Hey, is it fine tuning versus Rag?

00:11:00.640 | Is it this versus that?"

00:11:01.840 | And like, at the end of the day,

00:11:04.160 | it's just all meta learning, right?

00:11:06.640 | Like all we want is like the best meta learning workflow

00:11:09.840 | or meta learning setup possible

00:11:11.920 | to be able to adapt a model to do anything.

00:11:17.760 | So naturally, long context had a place in that,

00:11:22.400 | but nobody had really pushed the limits of it, right?

00:11:26.160 | Like you would see like 10 shot,

00:11:27.920 | maybe 100 shot prompting

00:11:29.520 | for improving the model's capabilities,

00:11:33.520 | but it wasn't until Google comes out with Gemini

00:11:37.040 | with the first 1 million context length model

00:11:39.520 | that a lot of people's jaws dropped

00:11:42.720 | and that hunger for understanding

00:11:46.400 | what that could really facilitate

00:11:48.240 | in the new workflows came about.

00:11:50.800 | So we were staged to actually train

00:11:53.360 | other open source models to do that.

00:11:56.800 | But the moment Llama3 came out,

00:11:58.880 | we just went ham against that specific model

00:12:02.480 | because the two things

00:12:04.160 | that were particularly appealing for that

00:12:07.200 | was the fact that like,

00:12:08.480 | I see a lot of these language models

00:12:09.920 | as compression algorithms to a certain extent,

00:12:12.080 | like the way we have like 15 trillion tokens

00:12:14.960 | into a specific model.

00:12:16.240 | That definitely made me feel like

00:12:19.360 | it would have a lot of capabilities

00:12:23.520 | and be more adaptable

00:12:26.400 | towards extending that context length.

00:12:28.160 | So we went in there

00:12:29.440 | and the 1 million number was always,

00:12:32.560 | that was more of just like put the North Star up there

00:12:36.720 | and see if we can get there.

00:12:38.720 | And then see what was happening along the way

00:12:41.840 | as we did that.

00:12:42.880 | So yeah, also shout out to Crusoe

00:12:46.720 | who facilitated all that compute

00:12:48.560 | because I would be lying

00:12:49.920 | if I was to say like,

00:12:51.200 | anyone could just go out and do it.

00:12:52.720 | It does require quite a bit of compute.

00:12:55.680 | It requires like a lot of preparation,

00:12:58.240 | but it just like all the stars

00:13:01.120 | kind of aligned for that moment

00:13:02.720 | for us to go after that problem.

00:13:05.120 | I'll take a side note on Crusoe

00:13:07.440 | since you just brought it up.

00:13:08.400 | Yeah, like, can you explain what Crusoe is?

00:13:11.440 | You know, I have this mental image

00:13:13.520 | of putting GPUs on top of oil rigs.

00:13:15.840 | What is it?

00:13:19.440 | What do they do?

00:13:20.400 | How do you work with them?

00:13:21.360 | You know, just anything nice.

00:13:23.680 | I'm sure they appreciate nice things

00:13:24.640 | that you say about them too.

00:13:25.600 | Oh, for sure, for sure.

00:13:27.040 | So they came to us

00:13:30.640 | through a collaborative effort

00:13:32.640 | where we basically were in search

00:13:34.400 | of a cloud, you know, a GPU provider.

00:13:38.800 | I don't want to call cloud service provider

00:13:40.720 | quite yet because then, you know,

00:13:42.080 | you think about hyperscalers,

00:13:43.200 | but for them, you know,

00:13:44.880 | they're one of the biggest

00:13:45.680 | alternative GPU cloud providers.

00:13:48.960 | And they were offering up

00:13:52.160 | like we want to do a collaboration

00:13:54.000 | to showcase their technology.

00:13:56.560 | And it just made it really easy

00:13:59.200 | for us to like scale up with their L40Ss.

00:14:02.160 | And those were the specific

00:14:04.000 | GPU instances we used.

00:14:05.360 | And coordinating that effort with them

00:14:08.800 | to get, you know,

00:14:10.160 | that dedicated cluster first

00:14:12.320 | to do the project.

00:14:13.760 | It became a really good relationship.

00:14:17.760 | And we still work with them today

00:14:19.200 | because like we're trying to evaluate

00:14:21.360 | more of these models

00:14:22.320 | and possibly train more of them.

00:14:24.160 | And anyone could go up to them

00:14:26.400 | and basically get your compute from them.

00:14:30.080 | And they have a lot of GPUs

00:14:32.320 | available for those type of projects.

00:14:34.560 | I would love to maybe have you run

00:14:37.120 | people through why the models

00:14:38.880 | don't come with longer context

00:14:40.560 | sequences out of the box.

00:14:41.840 | Like, obviously, you know,

00:14:44.400 | the TLDR is like self-attention.

00:14:46.320 | It's like quadratic scaling of memory.

00:14:48.080 | So the longer the context size,

00:14:50.000 | the more compute you have to spend

00:14:51.440 | the training time.

00:14:52.080 | And that's why you have to get

00:14:53.680 | Crusoe to help you.

00:14:54.960 | How do you actually train

00:14:58.320 | a large language model

00:14:59.520 | that is like a very long context?

00:15:00.720 | And then how does that differ

00:15:02.000 | from just tacking it on on top later?

00:15:05.040 | And then maybe we'll dive into performance

00:15:06.960 | and some of those things.

00:15:07.760 | But I think for a lot of folks

00:15:10.160 | in our audience that are more

00:15:11.680 | engineers, they use models,

00:15:12.960 | but don't necessarily build

00:15:14.480 | the models themselves.

00:15:15.920 | A lot of time, it's hard to understand

00:15:17.360 | what goes into actually making

00:15:19.040 | a long context model.

00:15:20.320 | Yeah, in terms of, you know,

00:15:22.880 | all the literature out there,

00:15:23.920 | I would say, honestly,

00:15:26.480 | it's probably still TBD

00:15:28.240 | as to like the tradeoffs

00:15:29.920 | between the approach we did,

00:15:31.920 | which is more of a curriculum

00:15:33.600 | learning approach after the fact

00:15:36.240 | versus inherently training

00:15:38.560 | a model with a long context throughout,

00:15:40.640 | because I just don't think people

00:15:42.400 | have looked at the scaling properties

00:15:44.080 | of it in deep, deep detail.

00:15:46.080 | But as stylistic facts

00:15:49.120 | exist out there with research papers

00:15:52.240 | from Meta themselves, actually,

00:15:53.680 | they were already shown in a paper

00:15:56.960 | that if you train a model

00:15:59.680 | on a shorter context

00:16:01.360 | and you progressively

00:16:02.320 | increase that context to like,

00:16:05.040 | you know, the final limit

00:16:06.960 | that you have, like 32K

00:16:08.800 | is usually the limit of Lama 2

00:16:11.280 | was that long.

00:16:12.240 | It actually performs better

00:16:16.240 | than if you try to train

00:16:17.920 | 32K the whole time.

00:16:19.840 | And I like to think about it

00:16:23.680 | intuitively as if you're trying

00:16:26.080 | to learn probability theory.

00:16:27.760 | You're not going to go

00:16:29.280 | and read the book cover to cover

00:16:30.560 | and then do all the exercises afterwards.

00:16:33.360 | What you're going to do

00:16:34.160 | is you're going to do each chapter,

00:16:36.000 | do an exercise,

00:16:36.960 | read the chapter, do an exercise,

00:16:39.600 | and then finish right

00:16:40.720 | with the final set of like holistic

00:16:43.200 | exercises or examination.

00:16:46.800 | So attention is exactly

00:16:49.680 | what it sounds like to a certain extent.

00:16:51.360 | Like you have a bunch of indices

00:16:54.480 | and you are making the model

00:16:56.000 | attend to localized contexts

00:16:58.080 | and concepts across

00:17:00.720 | the entirety of its encoding, right?

00:17:05.120 | Like whatever the text

00:17:06.240 | that the sequence

00:17:07.040 | that you're giving it.

00:17:08.240 | So when you're doing

00:17:10.880 | the curriculum learning

00:17:11.840 | aspect of things,

00:17:13.040 | you are kind of trying to

00:17:14.640 | give it the opportunity

00:17:17.520 | to also attend to all the concepts.

00:17:20.080 | So data actually in the curation,

00:17:22.400 | in the creation of that context,

00:17:24.160 | plays a huge role

00:17:25.600 | because a lot of times

00:17:28.080 | people make the mistake

00:17:28.960 | of trying to extend the context length

00:17:30.240 | by just giving it raw text

00:17:33.200 | that doesn't have

00:17:35.760 | the necessity for the model

00:17:39.040 | to go all the way

00:17:40.400 | in the beginning of the sequence

00:17:42.080 | and then connect an idea

00:17:43.520 | to the end of the sequence.

00:17:45.120 | So data quality is one thing,

00:17:47.760 | but it sounds like

00:17:49.200 | as long as the base model

00:17:51.440 | is at least what is the work,

00:17:53.760 | like the one million contexts

00:17:54.880 | if Lama3 was 2K context size,

00:17:58.080 | is there like a minimum context size

00:17:59.600 | that you need to then

00:18:00.320 | be able to generalize?

00:18:01.600 | Or does it not really matter

00:18:03.840 | in the fine-tuning context care of it?

00:18:05.680 | There's no minimum, I would say,

00:18:07.600 | or at least I can't make

00:18:11.040 | such a strong statement

00:18:12.000 | as to say that that does not exist.

00:18:13.680 | But if you have a 4K,

00:18:15.840 | any regular model out there,

00:18:17.840 | like if you can progressively increase

00:18:20.240 | the context length of it,

00:18:22.480 | so long as it has shown

00:18:24.800 | really good perplexity scores

00:18:27.040 | prior to your context length extension.

00:18:30.400 | So if it hasn't shown good perplexity,

00:18:33.520 | you basically can't even

00:18:35.040 | predict the next token,

00:18:36.000 | you're kind of like out of luck, right?

00:18:37.840 | But then from there,

00:18:40.480 | the other component

00:18:42.640 | that we actually just released

00:18:44.160 | the blog on maybe last Friday,

00:18:46.400 | it's like you got to pay attention

00:18:47.760 | to the theta value

00:18:50.880 | that the model starts off with.

00:18:52.800 | What was fairly unique

00:18:54.800 | about the Lava3 model

00:18:56.080 | was their choice of the theta parameter,

00:18:59.840 | which gave some suspicion

00:19:02.560 | as to how long the context

00:19:04.720 | could be extended for the model.

00:19:07.040 | So that aspect of like,

00:19:10.640 | we can go into a huge lesson

00:19:15.120 | in terms of positional encodings

00:19:16.800 | and in rope scaling and stuff.

00:19:19.120 | But those concepts

00:19:22.160 | and that aspect of things

00:19:25.280 | enables you to scale out

00:19:27.760 | the length much more easily.

00:19:29.680 | - What's the TLDR

00:19:32.080 | of what the theta is for a model?

00:19:34.720 | If I haven't built a model before...

00:19:36.320 | - Yeah, yeah.

00:19:37.040 | - Not me, obviously I know what it is,

00:19:40.000 | but for people that don't know, right?

00:19:42.000 | I'm totally an expert.

00:19:44.000 | - Yeah, well,

00:19:46.960 | so not all models have it,

00:19:49.200 | but some models will employ

00:19:51.200 | rope scaling

00:19:55.520 | and specifically Lava3 does that.

00:19:58.480 | But there's also other positional encoding

00:20:01.600 | and embedding mechanisms

00:20:02.960 | that other models employ.

00:20:04.560 | But TLDR is,

00:20:06.880 | if you think about most architectures,

00:20:10.640 | they employ basically like a,

00:20:14.480 | it's kind of like a sine or cosine

00:20:16.800 | curve.

00:20:17.680 | And you're thinking about like the different,

00:20:20.240 | you have the amplitudes that occur there

00:20:23.120 | to allow for the model

00:20:25.360 | to see different types of distributions of data.

00:20:28.880 | Really what the theta value does,

00:20:32.160 | it governs how often a pattern's

00:20:36.160 | going to appear in the embedding space.

00:20:39.120 | So you basically are able to

00:20:44.240 | make shift that rotational curve

00:20:49.600 | by increasing the theta value

00:20:52.560 | and allow for different types of distributions

00:20:57.840 | to be seen as if they actually occurred

00:21:01.280 | in the training data before.

00:21:04.000 | So it's super confusing,

00:21:06.560 | but it's like there's positional extrapolation,

00:21:10.560 | and then there's interpolation.

00:21:11.680 | You want interpolation.

00:21:13.760 | It's been shown that just pure extrapolation

00:21:16.640 | makes the model a lot worse,

00:21:18.080 | and it's harder to attend to stuff.

00:21:20.080 | Whereas the interpolation is like

00:21:21.600 | you're squeezing everything back in

00:21:23.600 | to what the original context length was

00:21:26.080 | to a certain extent,

00:21:27.600 | and then allowing for it to overlap

00:21:31.120 | different sequences that it's already seen

00:21:34.320 | as if it actually occurred

00:21:36.160 | when you see a million contexts of sequence tokens.

00:21:41.120 | So yeah, I think that aspect,

00:21:45.920 | we didn't know how well it would scale.

00:21:49.600 | I think that's one thing.

00:21:50.560 | So I'm not going to lie and tell you

00:21:53.520 | right off the bat,

00:21:54.080 | we're definitely going to hit a million.

00:21:56.080 | It was more like we're getting to 256,

00:21:59.200 | and it looked good.

00:22:00.400 | We did our evals.

00:22:01.920 | We scaled it more.

00:22:02.880 | And then what was really good

00:22:06.320 | was that we established the formula at the start.

00:22:10.480 | So it's actually a formula

00:22:11.760 | that we actually took from the paper.

00:22:16.000 | I think it's the rope scaling paper.

00:22:20.000 | And we looked at that particular formula,

00:22:22.160 | and then we backed out the values.

00:22:24.160 | And it's all empirical.

00:22:25.520 | So it's not like a mathematical tautology or proof.

00:22:29.680 | It's just like it's an empirical formula

00:22:31.760 | that actually worked really well.

00:22:33.120 | And then we just kept scaling it up,

00:22:34.400 | and it held.

00:22:35.760 | It's kind of like the scaling laws.

00:22:36.960 | Nobody knows the scaling laws exist,

00:22:39.760 | but you don't know if they're going to continue.

00:22:41.680 | So yeah.

00:22:42.480 | Are you able to compare it

00:22:44.480 | with other forms of scaling

00:22:47.040 | that people have been talking about?

00:22:48.560 | Alibi comes to mind.

00:22:50.640 | Yarn is being talked about a lot by a news research.

00:22:54.160 | And then there's other forms

00:22:56.720 | which are not exactly directly related,

00:22:58.960 | but ring attention comes up a lot.

00:23:00.640 | We had a really good session with Strong Compute

00:23:03.360 | in the Latent Space Discord

00:23:05.440 | talking about all these approaches.

00:23:07.440 | I was just wondering if you want to compare and contrast

00:23:09.440 | Rope versus the other stuff.

00:23:10.640 | Yeah, I think...

00:23:11.140 | I can never pronounce it right, but Alibi.

00:23:16.880 | Yeah, Alibi.

00:23:18.720 | We haven't compared with that one specifically,

00:23:22.800 | mostly because I've noticed

00:23:24.720 | some of the newer architectures

00:23:27.120 | don't actually employ it a lot.

00:23:28.720 | I think the last architecture

00:23:29.760 | that actually really employed it

00:23:31.040 | was the Mosaic FPT model class.

00:23:33.600 | And then almost all the models these days

00:23:35.600 | are all Rope scaling.

00:23:38.000 | And then effectively,

00:23:39.200 | you can use Yarn with that as well.

00:23:41.120 | We just did the Theta scaling specifically

00:23:44.800 | because of its empirical elegance.

00:23:47.520 | It was really easy and it was well understood by us.

00:23:50.960 | The other one that I know that in the open source

00:23:54.480 | that people are applying,

00:23:56.240 | which uses more of a LoRa-based approach,

00:23:58.880 | which is really interesting too,

00:24:00.800 | is the one that Wing has been employing,

00:24:03.440 | which is Pose.

00:24:04.160 | We've sort of helped them evaluate

00:24:06.640 | some of the models.

00:24:07.440 | With respect to the performance of it,

00:24:10.480 | it does start to break down a little bit more

00:24:13.600 | on the longer and longer context.

00:24:14.960 | So like 500,000 to a million,

00:24:17.280 | it appeared that it doesn't hold as well

00:24:20.960 | specifically for like Needle in the Haystack.

00:24:23.440 | But it's still TBD as...

00:24:26.400 | Evaluations, I call it just like a high...

00:24:30.480 | It's a sparse high dimensional space

00:24:32.240 | where you're just evaluating performance

00:24:34.960 | across so many different things

00:24:37.120 | and then trying to map it back to like,

00:24:38.640 | "Hey, here's the thing that I actually cared about

00:24:41.040 | from the start."

00:24:41.600 | And I have like a thousand different evaluations

00:24:43.520 | and they tell me something,

00:24:45.040 | but not the entire picture, right?

00:24:46.880 | And as for like Ring-Attention specifically,

00:24:50.960 | like we employed Ring-Attention

00:24:52.800 | in order to do the training.

00:24:54.160 | So we combined Flash-Attention

00:24:56.400 | and Ring-Attention together

00:24:57.760 | with a really specific network topology on our GPUs

00:25:02.960 | to be able to maximize the memory bandwidth.

00:25:05.840 | Yeah, as far as I understand,

00:25:07.920 | like Ring-Attention, a lot of people credit it

00:25:09.920 | for Gemini's million token context,

00:25:13.600 | but actually it's just a better utilization of GPUs, right?

00:25:16.240 | Like that's really what it is.

00:25:18.880 | You mentioned in our show notes,

00:25:20.960 | Zhang Peiyuan's Easy Context Repo.

00:25:23.920 | I have seen that come up quite a bit.

00:25:25.280 | What does that do?

00:25:26.880 | Like how important is it

00:25:28.800 | as a Ring-Attention implementation?

00:25:31.440 | I know there's like maybe another one

00:25:33.600 | that was done by Lucid Reins

00:25:35.520 | or one of the other open source people.

00:25:37.280 | But like what is Easy Context?

00:25:39.920 | Like is that the place to go?

00:25:41.920 | Like did you evaluate a bunch of things

00:25:43.360 | to implement Ring-Attention?

00:25:45.520 | Yeah, we evaluated all of them.

00:25:48.320 | Like it was, I would say the original authors,

00:25:55.440 | you know, Matai and all the folks at Berkeley,

00:25:59.600 | they created the JAX implementation for it.

00:26:02.240 | And unfortunately, not to discredit,

00:26:06.400 | like, you know, TPUs or whatever,

00:26:08.240 | like the JAX implementation

00:26:09.440 | just does not work on GPUs very well.

00:26:12.480 | Like any naive setup that you do,

00:26:14.960 | like it just won't run out of the box very easily.

00:26:17.600 | And then unfortunately,

00:26:20.080 | that was probably the most mature repo

00:26:22.640 | with a lot more configurations

00:26:24.960 | to set up interesting network topologies

00:26:27.680 | for your cluster.

00:26:28.960 | And then the other PyTorch implementations

00:26:33.600 | outside of Easy Context,

00:26:35.600 | they just didn't really work.

00:26:38.560 | Maybe we weren't implementing

00:26:42.160 | one small aspect incorrectly,

00:26:43.520 | but like there was an active development on it

00:26:46.080 | at a certain point.

00:26:47.040 | Like even Lucid Reins, I think he's interesting

00:26:49.760 | 'cause for once he was actually like,

00:26:51.760 | he was like taking a job somewhere

00:26:53.120 | and then just stopped, you know, doing commits.

00:26:55.840 | And as we were working to try to find it,

00:26:58.960 | like we never really want to jump in on a repo

00:27:01.120 | where someone's like kind of actively committing

00:27:03.040 | breaking changes to it.

00:27:04.800 | Otherwise we have to like eat that repo ourselves.

00:27:07.840 | And yeah, Easy Context

00:27:09.520 | was the first PyTorch implementation

00:27:11.200 | that applied it with native libraries

00:27:14.880 | that worked pretty well.

00:27:18.000 | And then we adapted it ourselves

00:27:19.840 | in order to configure it

00:27:23.040 | for our cluster network topology.

00:27:25.680 | So, you know, shout out to John Payne

00:27:29.760 | for his open source contributions.

00:27:32.800 | I think that we look forward

00:27:34.800 | to possibly collaborating him

00:27:36.640 | and push that further in the future

00:27:38.720 | because I think more people,

00:27:41.120 | if they do want to get started on it,

00:27:43.280 | I would recommend that to be the easiest way.

00:27:45.920 | Like, I don't know how many people know Jax.

00:27:48.160 | Me personally, I don't really know it that well.

00:27:50.080 | So I'm more of a PyTorch guy.

00:27:52.080 | So yeah, I think that he provides

00:27:55.920 | a really good introduction

00:27:57.520 | to be able to try it out.

00:28:00.240 | - And so on one side,

00:28:01.840 | you had the technical discovery.

00:28:04.560 | What about the actual customer interest?

00:28:08.000 | Customers that you work with?

00:28:09.120 | I feel like sometimes the context size

00:28:10.800 | can be a bit of a marketing ploy.

00:28:12.640 | You know, people are like,

00:28:13.440 | "Oh yeah, well, no, 1 million, 2 million,

00:28:15.440 | 3 million, 4 million."

00:28:16.800 | So that's kind of the algorithm side of it.

00:28:20.320 | How do you actually, you know,

00:28:21.680 | how do you power the training?

00:28:23.200 | But the other side is obviously the data

00:28:25.440 | that goes into it.

00:28:26.160 | There's both quantity and quality.

00:28:28.720 | I think that's how one of your tweets,

00:28:30.080 | you trained on about 200 million tokens

00:28:32.960 | for the A/B model to the context extension.

00:28:35.600 | But what are the tokens?

00:28:37.920 | You know, how do you build them?

00:28:38.960 | What are like maybe some of the differences

00:28:41.440 | between pre-training datasets

00:28:43.360 | and context extension datasets?

00:28:45.200 | Yeah, any other color you give there

00:28:47.760 | would be great.

00:28:48.800 | So specifically for us,

00:28:50.880 | we actually staged two different updates

00:28:54.560 | to the model.

00:28:55.200 | So our initial layer that we trained

00:28:59.120 | was just basically like a pre-training layer.

00:29:04.240 | So continual pre-training

00:29:05.520 | where we took the Slim Pajamas data

00:29:07.520 | and then we filtered it and concatenated it

00:29:12.320 | so that it would reach the context lengths

00:29:14.240 | that we were trying to extend out to.

00:29:16.080 | And then we took the UltraChat dataset,

00:29:19.120 | filtered it down,

00:29:20.160 | or maybe some other, you know,

00:29:23.120 | second order derivative of the UltraChat dataset

00:29:26.560 | that was curated and then filtered it down

00:29:29.760 | and then reformatted it for our chat use case.

00:29:34.640 | For those two datasets,

00:29:38.160 | I think you always have to really keep in mind

00:29:41.520 | for the pre-training data,

00:29:44.400 | whether or not you may be like

00:29:48.000 | cutting off tokens in weird ways,

00:29:49.760 | whether or not, you know,

00:29:53.280 | the content is actually diverse enough

00:29:56.240 | to retain the ability of the model.

00:29:59.200 | So Slim Pajamas tends to be one of the best ones,

00:30:02.800 | mostly because it's a diverse dataset

00:30:05.200 | and you can use embeddings too

00:30:09.440 | as a pre-filtering step as well, right?

00:30:12.400 | Like how diverse are your embeddings space

00:30:15.280 | to the original corpus of the model

00:30:18.960 | and then train on top of that to retain its abilities.

00:30:21.920 | And then finally for the chat dataset,

00:30:26.000 | making sure that it's attending to all the information

00:30:31.280 | that would be expected to really stretch its capabilities

00:30:34.240 | 'cause you could create like a long context dataset

00:30:38.080 | where like every single time,

00:30:40.160 | the last 200 tokens can answer the entire question.

00:30:44.240 | And that's never gonna make the model attend to anything.

00:30:47.440 | So it's even something that we're doing right now

00:30:50.480 | is trying to think about

00:30:51.680 | like how do we actually improve these models

00:30:54.160 | and how do you ablate the datasets

00:30:57.840 | such that it can expose like even more nuanced capabilities

00:31:02.240 | that aren't easily measurable quite yet.

00:31:07.120 | Is there a ratio between diversity of the dataset

00:31:11.120 | versus diversity compared to what the model already knows?

00:31:14.960 | Like does the model already need to understand

00:31:17.040 | a good part of the new,

00:31:19.760 | like the context extension data to function?

00:31:22.880 | Like can you put a context extension dataset

00:31:25.680 | that is like very far

00:31:26.800 | from like what was in the pre-training?

00:31:28.320 | I'm just thinking as the model get older,

00:31:31.120 | you know, some of the datasets that we have

00:31:34.080 | might not be in the knowledge of the existing model

00:31:36.720 | that you're trying to extend.

00:31:37.680 | - I think that's always a consideration.

00:31:40.560 | I think specifically,

00:31:41.840 | you really got to know how much,

00:31:45.040 | how many tokens were expended

00:31:47.040 | into that particular model from the start.

00:31:48.880 | And all models these days

00:31:50.960 | are now double digit trillions, right?

00:31:54.160 | So it's kind of a drop in the bucket

00:31:56.800 | if you really think I can just put,

00:31:59.360 | you know, a billion tokens in there.

00:32:01.280 | And I actually think that the model

00:32:02.720 | is gonna truly learn new information.

00:32:06.400 | There is a lot of research out there

00:32:09.760 | between the differences

00:32:11.680 | with respect to like full fine-tuning,

00:32:13.760 | which we applied full fine-tuning

00:32:15.120 | versus lower base fine-tuning.

00:32:16.560 | It's a trade-off.

00:32:19.440 | And my opinion of it is actually

00:32:22.160 | that you can test certain capabilities

00:32:26.640 | and you can kind of inject

00:32:28.640 | new knowledge into the model.

00:32:31.680 | But to this day,

00:32:32.960 | I've not seen any research

00:32:34.720 | that does like a strong,

00:32:35.840 | well-scaled out empirical study

00:32:38.800 | on how do you increase the model's ability

00:32:43.280 | to understand like these decision boundaries

00:32:45.760 | with a new novel data.

00:32:48.880 | Most of it is taking,

00:32:51.120 | like holding out a portion of the data

00:32:53.360 | as like novel

00:32:56.160 | and then needing to recycle

00:32:57.600 | some of the old knowledge.

00:32:58.960 | So it just doesn't forget

00:33:01.200 | and get worse at everything else, right?

00:33:03.680 | Which was seen,

00:33:04.480 | like we do have historical precedent

00:33:06.720 | where Code Llama was trained further

00:33:10.480 | from the original Code Llama

00:33:12.720 | was trained further from LLAMA-2

00:33:14.320 | and it just lost

00:33:15.920 | all its language capabilities, basically, right?

00:33:18.720 | So it's not, I don't wanna call that project,

00:33:21.920 | like deem it as like a failure,

00:33:24.720 | but it wasn't like

00:33:25.600 | a really successful generalization exercise

00:33:28.800 | because these models are about like flexibility

00:33:32.000 | and being like generic to a certain extent.

00:33:34.080 | - So one thing I see in the recent papers

00:33:37.040 | that have been coming out

00:33:37.840 | is this sort of concept

00:33:39.920 | of multi-stage training of data.

00:33:42.560 | And if you're doing full fine tuning,

00:33:45.360 | maybe the move or the answer

00:33:47.200 | is don't train 500 billion tokens on just code

00:33:51.120 | because then yeah,

00:33:51.920 | it's gonna massively overfit to just code.

00:33:53.680 | Instead, like maybe the move

00:33:55.360 | is to slowly change the mix

00:33:57.840 | over the different phases, right?

00:34:00.320 | So in other words,

00:34:01.360 | like you still need to mix in

00:34:03.520 | some of your original source dataset

00:34:05.040 | to make sure it doesn't deviate too much.

00:34:07.040 | I feel like that is a very crude solution.

00:34:10.720 | Like maybe there's some smarter way

00:34:13.280 | to adjust like the loss function

00:34:14.640 | so that it doesn't like deviate

00:34:17.120 | or overfit too much to more recent data.

00:34:19.360 | It seems like it's a solvable thing.

00:34:22.640 | That's what I'm saying.

00:34:23.760 | Like this overfitting

00:34:25.360 | to more recent data issue.

00:34:26.880 | - Well, solvable is hard.

00:34:29.920 | I think provably solvable

00:34:32.160 | is always something that I know

00:34:33.200 | is extremely difficult.

00:34:35.680 | But from a heuristical standpoint,

00:34:39.200 | as well as like having

00:34:40.160 | like some sort of statistical efficiency

00:34:44.240 | on like how you can converge

00:34:45.760 | to the downstream tasks

00:34:48.080 | and improve the performance that way

00:34:49.520 | in a targeted manner,

00:34:51.920 | I do think there are papers

00:34:53.840 | that try to do that.

00:34:56.560 | Like the Do-Re-Mi paper,

00:34:59.120 | I think it was released last year.

00:35:00.240 | It was really good

00:35:01.200 | about doing an empirical study on that.

00:35:03.360 | I think the one thing

00:35:07.440 | that people struggle with though

00:35:09.920 | is the fact that

00:35:10.720 | they always try to do it

00:35:13.680 | on pretty naive tasks.

00:35:15.760 | Like you target like a naive task

00:35:17.840 | and then you create your data mixture

00:35:20.320 | and you try to show

00:35:21.920 | some sort of algorithm

00:35:23.680 | that can retain the performance

00:35:27.840 | for those downstream tasks.

00:35:30.320 | But then what do we all care about

00:35:33.680 | are actually like really,

00:35:34.640 | really interesting complex tasks, right?

00:35:37.040 | And we barely have

00:35:37.760 | good evaluations for those.

00:35:39.440 | Like if you do a deep dive

00:35:42.080 | at the Gemini 1.5 technical paper,

00:35:45.920 | which they just updated with,

00:35:47.520 | like it was a fantastic paper

00:35:49.280 | with new updates.

00:35:50.240 | If you look at all of their

00:35:52.240 | long context evaluations there,

00:35:54.800 | like a lot of them are just not

00:35:56.640 | something that the open community

00:35:58.240 | can even do

00:35:59.120 | because they just hired

00:36:01.040 | like teachers to evaluate

00:36:04.240 | whether or not this model

00:36:06.080 | generated a huge lesson plan

00:36:08.400 | that is really coherent

00:36:10.160 | or like you hire a bunch

00:36:11.840 | of subject matter experts

00:36:13.120 | or they taught the model

00:36:15.760 | how to do language translation

00:36:18.160 | for an extinct language

00:36:20.160 | where only 200 people in the world know.

00:36:22.160 | It's like, it's kind of hard for us

00:36:24.160 | to do that same study, right?

00:36:25.920 | As an early stage startup.

00:36:28.000 | I mean, technically now

00:36:28.880 | you can use Gemini as a judge.

00:36:30.800 | Gemini is touting a lot

00:36:32.000 | of their capabilities

00:36:33.040 | in low resource languages.

00:36:34.240 | One more thing before

00:36:36.000 | on the sort of data topic.

00:36:37.360 | Did you have any exploration

00:36:40.800 | of synthetic data at all?

00:36:41.920 | You know, use Mistral to rephrase

00:36:45.200 | some existing part of your data set

00:36:46.720 | to generate more tokens,

00:36:48.720 | anything like that,

00:36:49.360 | or any other form of synthetic data

00:36:51.200 | that you choose to mention.

00:36:52.640 | I think you also mentioned

00:36:53.680 | the large world model paper, right?

00:36:55.120 | So yeah, yeah.

00:36:56.560 | Anything like that?

00:36:57.200 | Yeah, yeah.

00:36:58.640 | So yeah, we used like GPT-4

00:37:02.880 | to rephrase certain aspects

00:37:06.160 | of the chat data, reformatting it

00:37:10.400 | or kind of generating

00:37:13.680 | new types of tokens

00:37:16.560 | and language and types of data

00:37:19.440 | that the model could see.

00:37:22.400 | And also like trying to take

00:37:25.760 | the lower correlated instances

00:37:28.800 | of out-of-domain data

00:37:29.920 | that we wanted to inject it

00:37:32.720 | to the model too as well.

00:37:34.320 | So I actually think a lot of the moat

00:37:37.360 | is in the data pipeline.

00:37:39.520 | You'll notice like most papers

00:37:42.880 | just don't really go into deep detail

00:37:44.640 | about the data set creation

00:37:47.520 | because they probably know.

00:37:49.200 | I mean, there's some aspects

00:37:50.560 | that are like uninteresting, right?

00:37:52.240 | Which is like we paid a bunch of people

00:37:53.680 | and like generated a lot of good data.

00:37:56.640 | But then the synthetic data

00:37:57.760 | generating pipeline itself,

00:37:59.280 | sometimes that could be like 25%

00:38:02.640 | or 50% of the entire data set

00:38:04.800 | that you've been used to depreciating.

00:38:06.720 | Yeah, I think it's just

00:38:07.520 | for legal deniability rather than...

00:38:09.520 | (both laughing)

00:38:11.440 | No, it's just too boring.

00:38:13.120 | I'm not going to say anything

00:38:13.840 | because it's too boring.

00:38:14.560 | No, it's actually really interesting.

00:38:15.760 | But in fact, it might be too interesting.

00:38:19.600 | So we're not going to say anything about it.

00:38:21.520 | Yeah.

00:38:22.020 | One more question that I had was on LoRa

00:38:25.680 | and taking some of these capabilities

00:38:27.440 | out and bringing them to other model.

00:38:29.120 | You mentioned Weng's work.

00:38:30.800 | He tweeted about,

00:38:32.880 | we're going to take this LoRa adapter

00:38:34.720 | for the Gradient 1 million context extension

00:38:38.240 | and you're going to be able

00:38:39.120 | to apply that to other model.

00:38:41.120 | Can you just generally explain to people

00:38:44.640 | how these things work with language models?

00:38:47.600 | I think people understand

00:38:48.400 | that with stable diffusion,

00:38:49.440 | you have these like LoRa patches

00:38:50.960 | for like different types of styles.

00:38:52.720 | Does that work similarly with LLMs?

00:38:56.240 | And is it about functionality?

00:38:58.080 | Can you do LoRa patches with specific knowledge?

00:39:00.960 | Like what's the state of the art there?

00:39:02.800 | Yeah, I think there's a huge kind of resurgence

00:39:07.920 | in what I would call like model alchemy

00:39:12.640 | to a certain extent

00:39:13.920 | because you're like taking all of these LoRas

00:39:16.880 | and you're mixing them together.

00:39:18.160 | And then that's a lot of the model merging stuff

00:39:22.080 | that I think Charles Goddard does

00:39:24.400 | and a lot of others in the open community, right?

00:39:28.720 | 'Cause it's a really easy way,

00:39:30.480 | like you don't need training

00:39:31.760 | and you can test and evaluate models

00:39:33.920 | and take the best skills and mix and match.

00:39:36.720 | I don't think there has been

00:39:38.400 | as much empirical study, like you're saying,

00:39:40.720 | for how it shows the same type of,

00:39:44.720 | like it's not as interpretable

00:39:46.240 | as like stable diffusion to a certain extent.

00:39:48.080 | 'Cause even we have experimented

00:39:53.440 | with taking like deltas

00:39:55.520 | in the same methodology as wing

00:39:57.680 | where we'll take a delta

00:39:58.880 | of like an already trained model,

00:40:00.320 | try to see how that has created

00:40:03.360 | in a sense an ROHF layer, right?

00:40:05.920 | Taking the LLAMA instruct layer,

00:40:08.080 | subtracting the base model from that

00:40:10.720 | and then trying to apply that LoRa adapter

00:40:12.640 | to like another model

00:40:13.680 | and seeing what it does to it.

00:40:16.080 | - It does seem to have an effect though.

00:40:20.480 | Like I will not lie to say,

00:40:22.640 | I'm really surprised

00:40:24.240 | how effective it is sometimes.

00:40:26.240 | But I do notice that

00:40:28.240 | for more complex abilities,

00:40:30.320 | other than like more stylistic stuff,

00:40:34.480 | it kind of falls through

00:40:37.200 | 'cause maybe it requires

00:40:40.560 | a much deeper path

00:40:42.400 | in the neural network, right?

00:40:43.840 | Like all these things,

00:40:44.640 | these weights are just like

00:40:45.680 | huge trees of paths

00:40:48.640 | that the interesting stuff

00:40:50.960 | is like the road less travel

00:40:54.400 | to a certain extent.

00:40:55.440 | And when you're just like merging things,

00:40:57.440 | brute force together that way,

00:40:59.680 | you don't quite know

00:41:02.720 | what you'll get out all the time.

00:41:04.400 | Like there's a lot of other research

00:41:06.000 | that you have merged ties

00:41:07.760 | and you have all these different

00:41:09.520 | types of techniques

00:41:10.400 | to effectively just apply

00:41:12.400 | like a singular value decomposition

00:41:15.520 | on top of weights

00:41:16.480 | and just get like the most important ones

00:41:18.560 | and prevent interference

00:41:20.000 | across all the other layers.

00:41:22.320 | But yeah, I think that

00:41:25.840 | that is extremely interesting

00:41:30.080 | from the developer community.

00:41:32.480 | And I wanna see more of it

00:41:34.160 | except it is to a certain extent

00:41:37.520 | kind of polluting the leaderboards

00:41:39.360 | these days 'cause it's so targeted

00:41:41.360 | and like now you can

00:41:43.200 | you can kind of game the metric

00:41:45.440 | by just finding all the best models

00:41:47.680 | and then just merging them together

00:41:49.280 | to do that.

00:41:49.840 | And I'll just add one last bit

00:41:53.040 | is basically

00:41:53.760 | the most interesting part

00:41:55.680 | about all that actually to me

00:41:57.040 | is when people are trying

00:41:58.720 | to take the LORAs

00:41:59.840 | as a way of like short circuiting

00:42:02.640 | the training process.

00:42:03.440 | So they take the LORAs,

00:42:04.320 | they merge it in

00:42:05.280 | and then they'll fine tune afterwards.

00:42:06.960 | So like the fine tuning

00:42:08.880 | and the reinitialization

00:42:10.560 | of a little bit of noise

00:42:12.240 | into all of the new merged models

00:42:16.000 | provides like a kind of

00:42:18.320 | kind of a learning tactic

00:42:20.800 | for you to get to that capability

00:42:23.200 | a little bit faster.

00:42:24.080 | There's a lot there.

00:42:25.840 | I really like the comparison

00:42:27.040 | of ties merging

00:42:29.280 | to singular value decomposition.

00:42:31.600 | That's something that I like it.

00:42:35.280 | I looked at the paper

00:42:36.240 | and I don't really think

00:42:37.760 | I understood it on that high level

00:42:39.280 | until you just said it.

00:42:40.640 | Very cool.

00:42:41.760 | We have to move on to benchmarking.

00:42:44.560 | This is a very fun topic.

00:42:47.440 | Needle in a haystack.

00:42:48.320 | What are your thoughts and feelings?

00:42:49.360 | And then we can discuss

00:42:50.160 | the other benchmarks first.

00:42:51.520 | Needle in a haystack.

00:42:52.640 | You want to put me

00:42:53.600 | on the spot with that one.

00:42:54.480 | Yeah, I think needle in a haystack

00:42:57.360 | is definitely like the standard

00:42:59.840 | for presenting the work

00:43:01.520 | in a way that people can understand

00:43:03.280 | and also proving out.

00:43:04.640 | I would say like,

00:43:05.200 | I view it as like a primitive

00:43:07.760 | that you have to pass

00:43:09.840 | in order to give the model

00:43:11.440 | any shot of doing something

00:43:13.440 | that combines both

00:43:15.440 | like a more holistic

00:43:16.720 | language understanding

00:43:17.760 | and like instruction following, right?

00:43:20.080 | Like, honestly,

00:43:20.880 | like it's mostly about

00:43:21.840 | if you think about

00:43:23.680 | the practical applications

00:43:24.960 | of like long context

00:43:26.480 | and what people complain

00:43:27.440 | most about models

00:43:29.040 | when you stuff

00:43:29.680 | a lot of context into it

00:43:31.040 | is either the language model

00:43:33.120 | just doesn't care about

00:43:34.320 | what you asked it to do

00:43:36.000 | or it cannot differentiate

00:43:37.760 | like, you know,

00:43:39.120 | context that you want

00:43:40.960 | it to use as a source

00:43:42.560 | to prevent hallucination

00:43:44.000 | versus like instructions.

00:43:45.680 | I think that, you know,

00:43:48.400 | when we were doing it,

00:43:49.440 | it was to make sure

00:43:50.240 | that we were on the right track.

00:43:51.920 | I think Greg did a really great job

00:43:54.960 | of creating a metric

00:43:56.240 | and a benchmark

00:43:56.880 | that everybody could understood.

00:43:59.120 | It was intuitive.

00:44:00.160 | Even he says himself,

00:44:01.360 | we have to move past it.

00:44:02.880 | But to that regard,

00:44:05.200 | it's a big reason

00:44:06.800 | why we did the evaluation

00:44:08.800 | on the ruler suite of benchmarks,

00:44:10.880 | which are way harder.

00:44:12.560 | They actually include

00:44:14.880 | needle in the haystack

00:44:16.080 | within those benchmarks too.

00:44:17.360 | And I would even argue

00:44:20.000 | is more comprehensive

00:44:21.840 | than the benchmark

00:44:24.800 | that Gemini released

00:44:27.280 | for their like multi-needle

00:44:28.560 | in the haystack.

00:44:29.600 | Yeah, you mentioned quite a few.

00:44:31.040 | You mentioned ruler,

00:44:32.320 | Lugo, infinite bench,

00:44:33.680 | bamboo, zero scrolls.

00:44:35.120 | Like, do you want to give us

00:44:38.400 | like maybe two or three of those

00:44:40.320 | that you thought

00:44:41.040 | were particularly interesting

00:44:42.000 | or challenging

00:44:42.640 | and what made them stand out for you?

00:44:44.640 | There's just so many

00:44:45.600 | and they're so nuanced.

00:44:47.040 | I would say like,

00:44:48.880 | yeah, zero scrolls

00:44:49.760 | was the first one I'd ever heard of

00:44:51.200 | coming out last year.

00:44:53.360 | And it was just like the extent,

00:44:54.560 | like making,

00:44:55.200 | it was more of like tracking

00:44:57.040 | variable over long context.

00:45:00.800 | I'll go into ruler

00:45:02.800 | because that's the freshest in my mind

00:45:04.240 | and like we're just scrutinizing it so much

00:45:06.320 | and running the evaluation

00:45:08.000 | in the previous two weeks.

00:45:09.360 | But like ruler has four different

00:45:14.080 | types of evaluations.

00:45:15.840 | So the first one is

00:45:17.360 | exactly needle in the haystack.

00:45:19.520 | It's like you throw multiple needles.

00:45:21.200 | So you got to retrieve

00:45:22.160 | multiple key value pairs.

00:45:24.640 | There's another one

00:45:25.280 | where that basically

00:45:26.400 | you need to differentiate.

00:45:27.680 | Multi-key, multi-value, multi-query.

00:45:30.320 | Yeah, yeah, multi-value, multi-query.

00:45:32.160 | That's the ablation.

00:45:33.920 | There's also a variable tracking one

00:45:39.600 | where you go,

00:45:40.400 | hey, if X equals this,

00:45:41.760 | Y equals this,

00:45:42.480 | Y equals Z,

00:45:45.280 | like what is this variable?

00:45:47.440 | And you have to track it

00:45:48.480 | through all of that context.

00:45:50.720 | And then finally,

00:45:51.520 | there's one that is

00:45:53.280 | more of like creating a summary statistic.

00:45:55.920 | So like the common words one,

00:45:57.680 | where you choose a word

00:46:00.080 | that goes across the entire context,

00:46:03.040 | and then you have to like count it.

00:46:04.480 | So it's a lot more holistic

00:46:07.360 | and a little bit more difficult that way.

00:46:09.040 | And then there's a few other ones

00:46:13.120 | that escaped me at this moment.

00:46:15.200 | But yeah, ruler really pushes you.

00:46:18.240 | If I think about the progression

00:46:20.400 | of the evaluations,

00:46:21.600 | it pushes it to start

00:46:24.400 | to force the model

00:46:25.520 | to actually understand

00:46:26.960 | like the totality of the context,

00:46:29.200 | rather than right,

00:46:31.360 | like everybody argues to say,

00:46:32.640 | like, couldn't I just use

00:46:34.480 | like a retrieval

00:46:35.840 | to like just grab that variable

00:46:37.360 | rather than like pay $10 for one shot

00:46:41.200 | or something?

00:46:41.760 | Although it's not as expensive.

00:46:43.760 | Yeah, exactly, exactly.

00:46:45.920 | So being able to actually like,

00:46:47.680 | I think the main thing

00:46:48.880 | that like I struggled with,

00:46:50.480 | with even some of our use cases,

00:46:52.480 | were like when the context

00:46:55.680 | is scattered across multiple documents,

00:46:57.920 | and you have like really delicate plumbing

00:47:01.040 | for the retrieval step,

00:47:02.320 | but it only works for that one,

00:47:05.600 | that really specific instance, right?

00:47:07.360 | And then you throw in other documents

00:47:09.040 | and you're like, oh, great,

00:47:10.000 | like my retrieval doesn't grab

00:47:11.920 | the relevant context anymore.

00:47:13.760 | So like, that's the dream, right?

00:47:15.360 | Of getting one model,

00:47:17.040 | a model that can generalize

00:47:18.800 | really well that way.

00:47:20.080 | Yeah, totally.

00:47:20.720 | I think that probably is

00:47:22.880 | what Greg mentioned

00:47:24.240 | when saying that he has to move

00:47:26.000 | beyond Needle and Haystack.

00:47:27.520 | You also mentioned,

00:47:29.440 | so you extended from 1 million

00:47:31.280 | to 4 million token context recently,

00:47:33.680 | and you saw some degradation

00:47:35.920 | in the benchmarks too.

00:47:37.200 | Like, do you want to discuss that?

00:47:38.720 | So if you look at our theta value

00:47:40.720 | at that point,

00:47:41.440 | it's getting really big.

00:47:43.120 | So think about floating point precision

00:47:47.440 | and thinking about propagating,

00:47:49.040 | like basically now you're starting

00:47:52.800 | to run into problems

00:47:54.240 | where in a deep enough network

00:47:56.480 | and having to do joint probabilities

00:48:00.720 | across like so many tokens,

00:48:04.320 | you're hitting kind of the upper bound

00:48:08.880 | on accuracy there.

00:48:11.520 | And there's probably some aspect

00:48:15.920 | of kind of clamping down

00:48:21.200 | certain activations

00:48:22.240 | that we need to do within training.

00:48:23.760 | Maybe it happens at inference time as well

00:48:28.240 | with respect to like the theta value

00:48:30.640 | that we use

00:48:31.360 | and how do we ensure

00:48:33.600 | that it doesn't just explode.

00:48:35.200 | If you've ever had to come across

00:48:38.560 | like the exploding gradients

00:48:40.400 | or the vanishing gradient problem,

00:48:41.760 | you will know what I'm talking about.

00:48:44.160 | Like a lot of the empirical aspect of that

00:48:46.720 | and scaling up these things

00:48:49.200 | is experimentation

00:48:52.240 | and figuring out

00:48:52.960 | how do you kind of marshal

00:48:56.320 | these really complicated

00:48:57.520 | composite functions

00:48:59.920 | such that they don't just like

00:49:01.360 | do a divide over zero problem at one point.

00:49:04.960 | Awesome.

00:49:06.000 | Just to wrap on the...

00:49:08.240 | There's the evals

00:49:10.640 | and then there's what people care about.

00:49:12.240 | There's two things.

00:49:13.920 | Do you see people care about above 1 million?

00:49:16.320 | Because Gemini at the 2 million

00:49:18.400 | announcement and I think people were like,

00:49:20.320 | "Okay, 1 million, 2 million, it's whatever."

00:49:23.920 | Like, do you think we need to get to 10 million

00:49:25.840 | to get people to care about again?

00:49:27.360 | Yeah.

00:49:27.760 | Like, do we need to get to 100 million?

00:49:29.520 | Yeah.

00:49:31.040 | I mean, that's an open question.

00:49:36.400 | I would certainly say

00:49:38.080 | a million seemed like the number

00:49:40.960 | that got people really excited for us.

00:49:43.760 | And then the 4 million is kind of like,

00:49:47.120 | "Okay, that's seen as more..."

00:49:50.000 | Rather than like a breakthrough milestone,

00:49:51.840 | it's like just the next incremental checkpoint.

00:49:55.840 | I do think even Google themselves,

00:50:02.880 | they're evaluating and trying to figure out

00:50:04.800 | specifically how do you measure

00:50:08.800 | the quality of these models

00:50:10.160 | and how do you measure and map those

00:50:12.560 | to capabilities that you care about

00:50:17.120 | going down the line, right?

00:50:18.640 | And I think I'm still...

00:50:22.480 | Us as a company, we're figuring out

00:50:26.240 | how to saturate the context window

00:50:29.840 | in a way that's actually

00:50:31.600 | adding incremental value.

00:50:35.360 | So the obvious one is code

00:50:38.800 | because code repositories are huge.

00:50:41.280 | So can you stuff the entire context

00:50:43.440 | of a repo into a model

00:50:46.560 | and then make it produce some module

00:50:50.080 | that is useful

00:50:50.960 | or some suggestion that is useful?

00:50:53.360 | However, I would say

00:50:56.240 | there are other techniques

00:50:58.480 | like alpha coding and flow engineering

00:51:01.040 | that if you do iterative things

00:51:03.120 | in a more agentic manner,

00:51:04.240 | it may actually produce better quality.

00:51:06.320 | I would preface and I would actually counter

00:51:09.600 | that maybe start off with the use case

00:51:14.880 | that is a little bit more

00:51:16.000 | that people are more familiar with right now,

00:51:18.480 | which is constantly evolving context

00:51:21.840 | in like a session.

00:51:23.200 | So like, whereas you're coding, right?

00:51:25.680 | If you can figure out evals

00:51:27.440 | that actually work

00:51:30.400 | where you're constantly providing it

00:51:32.560 | multiple turns

00:51:33.600 | and each incremental turn

00:51:34.880 | has a nuanced aspect

00:51:36.400 | and you have a targeted generation

00:51:39.440 | that you know of,

00:51:40.080 | making the model track state

00:51:46.400 | and have state management over time

00:51:48.400 | is really, really hard.

00:51:50.320 | And it's an incredibly hard evaluation

00:51:53.440 | that will probably only really work

00:51:55.680 | when you have a huge context.

00:51:57.440 | So that's sort of what we're working on

00:51:59.920 | trying to figure out those types of aspects.

00:52:01.920 | You can also map that.

00:52:02.880 | Like it's not just code,

00:52:04.720 | state management exists.

00:52:06.480 | And like, you know,

00:52:07.440 | we work in the finance sector a lot,

00:52:08.880 | like investment management,

00:52:09.920 | like having a state management

00:52:14.480 | of like a concept and stuff

00:52:16.320 | that evolves over like a long session.

00:52:19.200 | So yeah, I'm super excited to hear

00:52:24.400 | like what other people think

00:52:25.840 | about the longer context.

00:52:27.600 | I don't think Google is probably investing

00:52:30.960 | to try to get a billion quite yet.

00:52:34.080 | I think they're trying to figure out

00:52:36.480 | how to fully leverage

00:52:37.840 | what they've done already.

00:52:39.680 | Yeah.

00:52:41.440 | And does this change in your mind

00:52:43.200 | for very long chats

00:52:44.800 | versus a lot of documents?

00:52:46.800 | The chat is kind of interactive,

00:52:48.800 | you know, and information changes

00:52:50.160 | the documents are just trying

00:52:51.280 | to synthesize more and more things.

00:52:53.200 | Yeah.

00:52:54.400 | Any thoughts on how those

00:52:55.760 | two workloads differ?

00:52:56.960 | Yeah, I mean, I would say like

00:52:59.920 | with the document aspect of things,

00:53:02.080 | you probably have like a little bit

00:53:06.080 | more ability to tweak

00:53:08.640 | other methodologies.

00:53:10.400 | Like you can get around

00:53:11.280 | the long context sometimes

00:53:13.520 | where you can do

00:53:15.360 | retrieval augmented generation

00:53:16.880 | or you do like

00:53:17.680 | hierarchical,

00:53:20.640 | like recursive summarization.

00:53:22.160 | Whereas like evolution

00:53:25.120 | in like a session,

00:53:26.000 | because that state variable

00:53:28.800 | could undergo

00:53:29.920 | like pretty rapid changes.

00:53:32.080 | It's a little bit harder

00:53:34.160 | to imagine like you

00:53:36.560 | getting around that

00:53:37.440 | without codifying

00:53:39.200 | like a really specific workflow

00:53:40.960 | or like some sort of,

00:53:42.480 | you know, state clause

00:53:45.760 | that is going back

00:53:47.200 | to like determinism, right?

00:53:48.720 | And then finally,

00:53:51.520 | like what I really think

00:53:52.640 | people are trying to do

00:53:55.120 | is like figure out

00:53:56.160 | how did all these like shots

00:53:59.600 | progress over time?

00:54:01.600 | So like,

00:54:02.000 | how do you get away

00:54:03.840 | from the brittleness

00:54:04.640 | of like the retrieval step

00:54:05.680 | to like shoving,

00:54:06.880 | if you shove in a thousand shots

00:54:08.560 | or 2000 shots,

00:54:09.920 | will it just make the retrieval aspect

00:54:12.800 | of good examples irrelevant?

00:54:15.520 | And like, it's sort of

00:54:16.640 | kind of like a

00:54:17.600 | randomly sampling is fine

00:54:19.360 | at that point.

00:54:20.000 | There's actually a paper on that

00:54:21.920 | that came out from CMU

00:54:23.920 | that they showed

00:54:25.360 | with respect to a few extraction

00:54:29.520 | or classification

00:54:30.960 | high cardinality benchmarks.

00:54:33.520 | They tracked like fine tuning

00:54:35.520 | versus in-context learning

00:54:37.760 | versus like many,

00:54:40.480 | many shot in-context learning.

00:54:42.000 | And they basically showed

00:54:42.880 | that like many,

00:54:44.000 | many shot in-context learning

00:54:45.600 | helps to prevent

00:54:48.960 | as much sensitivity

00:54:50.240 | around the examples themselves.

00:54:51.600 | Right?

00:54:52.400 | Like the distraction,

00:54:53.680 | the distraction error

00:54:55.120 | that a lot of LLMs get

00:54:56.400 | where you give it irrelevant context

00:54:58.400 | and it literally can't do the task

00:55:00.880 | because it just is

00:55:02.160 | like it's sort of like a person too.

00:55:03.520 | Right?

00:55:03.760 | Like you got to be very specific about

00:55:06.000 | I don't want to distract this person

00:55:07.520 | because then,

00:55:08.480 | you know,

00:55:08.960 | they're going to go down a rabbit hole

00:55:10.640 | and not be able to complete the task.

00:55:12.240 | Yeah.

00:55:13.680 | Well, that's kind of the flip side

00:55:14.960 | of the needle in a haystack

00:55:16.640 | thing too in a bit.

00:55:17.520 | It's like now

00:55:19.120 | the models pay attention

00:55:20.240 | to like everything so well.

00:55:22.080 | Like sometimes it's hard

00:55:23.120 | to get them to like,

00:55:24.560 | I just said that once,

00:55:25.600 | please do not bring that up again.

00:55:27.280 | You know, it happens to me with code.

00:55:29.360 | Yeah, it happens to me

00:55:30.400 | with like a CSS style.

00:55:33.120 | Sometimes I like things like that.

00:55:34.400 | If I have a long conversation,

00:55:35.680 | it's like it tries to always

00:55:37.440 | reapply certain styles,

00:55:38.880 | even though I told it

00:55:40.320 | maybe that's not the right

00:55:41.280 | the right way to do it.

00:55:42.160 | But yeah, there's a lot again

00:55:45.760 | of empirical work

00:55:47.520 | that people will do.

00:55:48.320 | And just I know we kind of went through

00:55:51.520 | a lot of the technical side,

00:55:53.280 | but maybe the flip side is

00:55:55.440 | why is it worth doing?

00:55:57.360 | You know, like what are like

00:55:58.560 | the use cases that people have

00:56:00.480 | that make long context really useful?

00:56:03.040 | I know you had,

00:56:04.080 | I think you have a lot of

00:56:05.280 | healthcare use cases

00:56:06.240 | I saw on your Twitter.

00:56:07.120 | You just mentioned the finance use case.

00:56:09.280 | Obviously, some of the filings

00:56:11.440 | and documents that people,

00:56:12.640 | the companies publish

00:56:13.600 | can be quite worthy.

00:56:14.800 | Any other things

00:56:16.960 | that you want to bring up?

00:56:18.320 | Maybe how people are using gradient,

00:56:20.000 | anything like that.

00:56:20.800 | I think that will help

00:56:21.520 | have a clearer picture for people.

00:56:25.120 | Yeah, so beyond like

00:56:27.760 | just using the context for,

00:56:29.360 | you know, sessions

00:56:31.920 | and evolving state management,

00:56:33.600 | it really comes down

00:56:36.000 | to something that's fairly obvious,

00:56:37.280 | which everybody's trying to do

00:56:38.400 | and work on is

00:56:39.280 | how do you ground

00:56:40.080 | the language model better?

00:56:41.840 | So I think when you think pure text,

00:56:43.920 | that's one thing.

00:56:45.680 | But then multimodality

00:56:47.680 | is, in my opinion, going to be

00:56:50.960 | it's going to be pivotal

00:56:52.320 | for long context,

00:56:53.760 | just because like videos

00:56:55.200 | when you're getting into

00:56:57.120 | the frames per second

00:56:57.920 | and you're getting into lots of images

00:57:01.760 | and like things that are

00:57:04.080 | a lot more like embodied,

00:57:05.360 | you need to utilize

00:57:07.680 | and leverage way more,

00:57:09.680 | way more tokens.

00:57:10.480 | And that is probably where,

00:57:14.240 | you know, us as a company,

00:57:15.600 | like we're exploring more

00:57:17.200 | and trying to open up the doors

00:57:20.880 | for a lot more use cases

00:57:22.480 | because I think in financial services,

00:57:26.560 | as well as health care,

00:57:27.920 | we've done a good job

00:57:29.920 | on the tech side,

00:57:30.640 | but we still need to push

00:57:32.480 | a little bit further

00:57:33.680 | when we combined like,

00:57:34.960 | you know, a picture with words,

00:57:37.920 | like a chart with words

00:57:39.520 | or somebody's medical image

00:57:43.040 | with words, stuff like that.

00:57:44.960 | Like you definitely

00:57:46.160 | can do a better job.

00:57:47.120 | And, you know, it's timely too,

00:57:50.240 | because Meta just released

00:57:51.200 | their chameleon paper,

00:57:52.160 | the new chameleon paper

00:57:54.080 | that does multimodal training.

00:57:55.920 | And it shows that early fusion

00:57:57.520 | helps you to,

00:57:58.480 | it's like more sample efficient, right?

00:58:00.320 | So having that kind of

00:58:02.560 | view towards the future

00:58:04.000 | is something that,

00:58:04.880 | you know, we want to be primed to do

00:58:08.800 | because, you know,

00:58:09.760 | it's similar to what Sam Altman

00:58:12.400 | says himself too, right?

00:58:13.680 | Like you need to just assume

00:58:15.360 | that these models

00:58:15.920 | are going to be 10x better

00:58:17.040 | in the next few years.

00:58:19.040 | And if you are primed for that,

00:58:20.560 | like that's where

00:58:21.200 | you have kind of a business

00:58:23.440 | that, you know,

00:58:24.560 | you're not just pivoting

00:58:26.400 | after every release

00:58:27.760 | or every event,

00:58:29.760 | you know, that drops.

00:58:31.760 | I think the thing

00:58:32.720 | about this 10x issue

00:58:34.320 | is that the 10x direction

00:58:37.440 | moves all the time.

00:58:38.560 | You know, some people

00:58:40.240 | were complaining about GPT-4-O

00:58:42.160 | that, yeah, look,

00:58:43.840 | like the ELO scores

00:58:45.440 | for GPT-4-O actually in reality

00:58:47.280 | weren't that much higher

00:58:48.320 | than GPT-4-Turbo.

00:58:50.000 | And really the, you know,

00:58:51.040 | so it's not 10x better in reasoning.

00:58:52.960 | It's just 10x better

00:58:54.080 | in the integration

00:58:55.440 | of multiple modalities.

00:58:57.440 | And by the way,

00:58:58.720 | look over here,

00:58:59.280 | there's a really sexy voice chat app

00:59:01.440 | that they accidentally made

00:59:03.440 | that they had to deprecate today.

00:59:05.040 | It's like the 10x direction

00:59:09.040 | keeps moving.

00:59:09.840 | Now it's like, you know,

00:59:10.800 | fully in like sort of

00:59:11.600 | multi-modality land, right?

00:59:12.880 | And like the question

00:59:14.480 | is like what next, right?

00:59:15.280 | Like, so you can 10x

00:59:17.120 | in various ways,

00:59:18.320 | but like you guys

00:59:19.600 | have 10x context length.

00:59:22.160 | But like, you know,

00:59:22.960 | are we chasing the last war?

00:59:25.120 | Because like now like nobody cares

00:59:26.720 | about context length

00:59:27.360 | and now it's like

00:59:28.560 | multi-modality time, you know.

00:59:30.240 | I'm joking, obviously,

00:59:31.040 | people do care about it.

00:59:32.000 | I just, I wonder about this,

00:59:33.840 | how this comment

00:59:36.080 | about this 10x thing

00:59:37.040 | every single time.

00:59:37.760 | You know, that's honestly

00:59:39.360 | why we kind of have our eye

00:59:41.280 | on the community

00:59:42.640 | as well as you, right?

00:59:43.600 | Like you, you know,

00:59:45.760 | with your community

00:59:46.800 | and the things that you hear,

00:59:48.160 | you know, you want to build,

00:59:51.360 | where, you know, we're a product company,

00:59:52.960 | we're trying to build for users

00:59:54.720 | and trying to listen

00:59:56.480 | to understand like what they,

00:59:59.280 | what they actually need.

01:00:00.240 | Like, obviously,

01:00:01.040 | you know, you don't,

01:00:02.320 | you don't build everything

01:00:03.200 | that people ask you to build,

01:00:05.360 | but know what's useful, right?

01:00:08.240 | Because I think that

01:00:09.040 | you're totally right there.

01:00:11.280 | If we want to make something

01:00:13.360 | 10x better in a certain direction,

01:00:15.920 | but nobody cares

01:00:16.800 | and it's not useful for somebody,

01:00:18.640 | then it wasn't really worth the,

01:00:21.520 | worth the while.

01:00:22.720 | And if anything, maybe that's like

01:00:24.320 | bitter, the bitter lesson 2.0

01:00:26.800 | for so many tech startups

01:00:28.480 | is like build technology

01:00:29.920 | that people care about

01:00:31.040 | and will actually 10x their value

01:00:33.360 | rather than like build technology

01:00:35.680 | that's just, that's just 10x harder.

01:00:37.840 | I mean, no, that's not,

01:00:38.880 | that's not a bitter lesson.

01:00:39.840 | That's just Paul Graham.

01:00:40.960 | That's, that's, yeah.

01:00:42.080 | One more thing on the chameleon paper.

01:00:44.960 | I was actually just about

01:00:45.840 | to bring that up, you know?

01:00:46.560 | So on AI News,

01:00:47.760 | like my sort of daily newsletter,

01:00:49.360 | it was literally my most,

01:00:50.640 | my most recent featured paper.

01:00:52.560 | And I always wonder

01:00:54.720 | if you can actually sort of

01:00:56.080 | like train images

01:00:58.240 | onto the same data space as words.

01:00:59.920 | That was kind of done with like,

01:01:01.760 | you know, what we now call

01:01:02.800 | late fusion models with like lava

01:01:04.960 | and flamingo and,

01:01:07.680 | you know, all the others.

01:01:08.880 | But now the early fusion models

01:01:10.720 | like chameleon

01:01:11.840 | seem to be the way forward.

01:01:13.200 | Like, obviously it's more native.

01:01:15.520 | I wonder if you guys can figure out

01:01:17.680 | some kind of weird technique

01:01:18.800 | where you can take an existing

01:01:20.080 | like Lama 3 model

01:01:21.120 | and like, you know,

01:01:22.560 | early fuse the images

01:01:24.560 | into the text encoder

01:01:26.560 | so that we just retroactively

01:01:29.440 | have the early fusion models.

01:01:31.040 | Yeah.

01:01:32.180 | Even before the early,

01:01:34.640 | you know, that the chameleon paper came out,

01:01:36.320 | I think that was on our big board

01:01:37.600 | of next to do's to possibly explore

01:01:40.880 | or our backlog of ideas, right?

01:01:44.880 | Because as you said, early fusion,

01:01:48.640 | like even before this paper,

01:01:50.080 | I can't remember.

01:01:51.200 | I think Meta even had like a scaling laws

01:01:54.160 | for multimodality paper

01:01:56.240 | that does explore more early fusion.

01:01:58.480 | Like the moment we saw that

01:02:00.400 | it was just kind of obvious to us

01:02:01.840 | that eventually it'll get to the point

01:02:05.280 | that becomes a little bit more mainstream.

01:02:07.280 | And yeah, like that's a cool twist

01:02:10.320 | that we've been thinking about too as well,

01:02:12.880 | as well as like other things

01:02:14.560 | that are kind of in the works

01:02:15.920 | that are a little bit more agentic.

01:02:17.200 | But yeah, if open collaboration interests you,

01:02:21.040 | we can always work on that

01:02:22.560 | together with the community.

01:02:24.080 | Ooh, okay.

01:02:25.120 | Shout out there.

01:02:25.760 | Cool.

01:02:27.840 | Well, you can leave that

01:02:28.960 | in the call to action at the end.

01:02:30.400 | I just want to, you know,

01:02:31.280 | we have a couple more questions

01:02:32.240 | to round this out.

01:02:33.280 | You mentioned a lot of papers in your work.

01:02:36.320 | You're also building a company.

01:02:37.600 | You're also looking at open source

01:02:39.040 | projects and community.

01:02:41.040 | What is your daily or weekly routine

01:02:42.880 | to keep on top of AI?

01:02:43.920 | So one, subscribe to AI News.

01:02:50.480 | He didn't have to pay me to say that.

01:02:51.760 | I actually think like it's a good aggregator.

01:02:54.400 | I think it's a good aggregator.

01:02:56.640 | I'll tell you why.

01:02:57.360 | Most of the fastest moving like

01:03:01.200 | research that's being done out there

01:03:04.560 | is like it's showing up.

01:03:06.240 | It's mostly on Twitter.

01:03:07.200 | Like my Twitter is like,

01:03:08.480 | I wasn't a power Twitter user at all.

01:03:10.880 | Before three years ago,

01:03:12.320 | but I had to use it

01:03:14.480 | and I had to always check it

01:03:15.920 | in order to keep on top of like early work,

01:03:19.040 | right?

01:03:19.280 | That people want to talk about or present

01:03:21.440 | because nothing against

01:03:24.000 | submitting research papers

01:03:26.320 | to like ICLR, ICML,

01:03:28.960 | like knowing the state of the art,

01:03:30.400 | like those are like six months late, right?

01:03:34.800 | Like people have already have it

01:03:36.160 | dropped it on archive,

01:03:37.120 | or they're just openly talking about it.

01:03:38.960 | The submission process.

01:03:40.560 | Yeah.

01:03:40.880 | Yeah.

01:03:41.120 | And then being on discord to see

01:03:43.760 | when the rubber hits the road, right?

01:03:46.800 | Like the implementations

01:03:48.480 | and the practices that are being done

01:03:51.600 | or like the data sets, like you said,

01:03:54.560 | like a lot of conversations

01:03:57.120 | about really good data sets

01:03:58.880 | and how do you construct them

01:04:00.160 | are done in the open

01:04:02.560 | in figuring that out

01:04:03.440 | for people that don't have like

01:04:05.120 | budgets of like $10 million

01:04:06.480 | to just pay a bunch of annotators.

01:04:09.440 | So my routine daily is like,

01:04:12.080 | second thing I do when I wake up

01:04:13.840 | is to look on Twitter

01:04:14.960 | to see what the latest updates are

01:04:20.160 | from specific people

01:04:22.480 | that do really, really great work.

01:04:23.760 | Armin at Meta

01:04:26.720 | who did the chameleon paper

01:04:28.400 | is like everything he writes

01:04:30.480 | on Twitter is like gold.

01:04:31.760 | So like anytime he writes something there,

01:04:33.440 | like I really try to figure out

01:04:34.800 | what he's actually saying there

01:04:37.360 | and then tie it to techniques

01:04:39.200 | and research papers out there.

01:04:40.640 | And then sometimes I try to use certain tools,

01:04:45.440 | like I myself use AI itself

01:04:47.440 | to search for the latest papers

01:04:52.240 | on a specific topic,

01:04:53.360 | if that's the thing on the top of my mind.

01:04:55.840 | And at the end of the day,

01:04:57.440 | trying out the products too.

01:05:00.240 | I think if you do not try out the tooling

01:05:03.200 | and some of the products out there,

01:05:05.280 | like you are missing out on

01:05:07.200 | someone's compression algorithm.

01:05:10.080 | Like they compressed all the research out there

01:05:13.360 | and all the thought

01:05:14.080 | and all the state of the art

01:05:15.520 | into a product that they're trying to create for you.

01:05:18.480 | And then like really backing out

01:05:20.160 | in reverse engineering,

01:05:21.200 | like what it took to build something like that.

01:05:23.200 | Like that's huge, right?

01:05:26.960 | Like if you can actually understand

01:05:28.240 | like perplexity, for instance,

01:05:30.000 | like you'll already be well ahead on the research.

01:05:33.360 | - Oh, by the way,

01:05:34.480 | you mentioned what's a good perplexity score?

01:05:37.520 | If there's like just a number, right?

01:05:38.880 | Like it's like five to eight or something.

01:05:40.960 | Like do you have a number in mind when you said that?

01:05:45.200 | - Yeah, I mean, what was the one that we had?

01:05:48.800 | Flipping between train loss and perplexity

01:05:51.440 | is actually not native to me quite yet.

01:05:53.600 | But like, yeah, between like,

01:05:55.120 | if you can get like a four

01:05:56.240 | using the context length extension on LLAMA,

01:06:01.600 | like you're in the right direction.

01:06:02.880 | And then obviously you'll see spikes.

01:06:04.320 | And specifically when the one trick

01:06:08.080 | you should pay attention to is,

01:06:09.680 | you know that your context length

01:06:14.960 | and theta scaling is working right.

01:06:16.960 | If the early steps in the perplexity go straight down.

01:06:19.600 | So like when it wasn't correct,

01:06:21.040 | it would oscillate a lot in the beginning.

01:06:23.760 | And we just knew that we cut the training short

01:06:26.240 | and then retry a new theta scale.

01:06:27.680 | - Because your effect,

01:06:30.000 | you're properly continuing the fine tuning

01:06:32.480 | or the full retraining.

01:06:33.600 | - Yeah, yeah.

01:06:34.320 | The model just like,

01:06:35.360 | it saw something out of domain immediately

01:06:37.600 | and was like, I have no idea what to do.

01:06:40.160 | And you need it to be able to overlap

01:06:43.200 | that positional embedding on top of each other.

01:06:46.720 | - One follow up, right?

01:06:47.840 | Before we sort of close out.

01:06:49.200 | Like, I think being on Twitter

01:06:53.520 | and like looking at all these new headlines

01:06:56.240 | is really helpful.

01:06:57.120 | But then it only gets you

01:06:59.120 | like a very surface level understanding.

01:07:00.880 | Then you still need a process to decide

01:07:03.200 | which one to invest in.

01:07:04.240 | So I'm trying to dig for like,

01:07:06.880 | what is your formula for like deciding,

01:07:09.840 | you know, what to go deep on

01:07:11.040 | and what to kind of skip.

01:07:12.560 | - From a practical standpoint,

01:07:14.560 | as a company,

01:07:15.360 | like I already know there are like three to five things

01:07:21.280 | that will be valuable and useful to us.

01:07:23.200 | And then there's other stuff that's like out of scope

01:07:25.280 | for different reasons.

01:07:28.320 | Some stuff is like out of scope from,

01:07:30.000 | hey, this is not going to impact or help us.

01:07:34.240 | And then other things are out of scope

01:07:35.600 | because we can't do it.

01:07:36.640 | You know, like the stuff like different tech.

01:07:40.560 | So a really good instance for that is

01:07:43.120 | specific algorithms for,

01:07:47.760 | you know, improving extremely large scale

01:07:52.560 | distributed training.

01:07:53.440 | Like that's that we're not gonna have the opportunity

01:07:56.560 | to get 2000 H100s.

01:07:59.520 | If we do, it'd be really cool.

01:08:01.360 | But like, I'm just saying like, as for now,

01:08:03.680 | like you gotta reach for the things

01:08:05.360 | that would be useful.

01:08:06.560 | Things that would be useful for us,

01:08:08.320 | for instance,

01:08:08.960 | for everybody actually, to be honest,

01:08:12.560 | is like evaluations,

01:08:14.720 | different post-training techniques,

01:08:17.520 | and then synthetic data construction.

01:08:22.000 | Like we're always on the,

01:08:23.440 | I'm always on the look for that.

01:08:24.480 | And then how do I figure out

01:08:25.760 | where there are these things?

01:08:26.880 | You know, which new piece of news

01:08:30.640 | is actually novel?

01:08:31.680 | Well, that's sort of my like mental cache

01:08:35.920 | to a certain extent.

01:08:36.720 | Like I've built up like this state of like,

01:08:38.560 | I already know like all the things

01:08:40.640 | that have already been written

01:08:41.760 | for the state of the art

01:08:43.920 | for certain topic areas.

01:08:46.560 | And then I know what's being kind of recycled

01:08:49.280 | as like an empirical study

01:08:50.800 | versus like something that

01:08:52.560 | actually is very insightful.

01:08:54.480 | Underrated specific instance

01:08:57.520 | would be like the DeepSeek paper.

01:09:00.400 | I'd never seen it before,

01:09:01.680 | but like the multi-head latent attention,

01:09:05.280 | like that was really unexpected to me

01:09:08.320 | because like I thought I'd seen every type,

01:09:12.320 | not every type, obviously,

01:09:13.520 | but like every way that people wanted to cut

01:09:15.760 | like mixture of experts into interesting ways.

01:09:18.240 | And I never thought something

01:09:19.280 | would like catch my eye to be like,

01:09:20.800 | oh, this is totally new.

01:09:23.280 | And it really does have a lot of value.

01:09:25.520 | Yeah, so like, I think that's mainly

01:09:30.000 | how I try to do it.

01:09:32.880 | And like you talk to your network too.

01:09:35.920 | Like I just, you know,

01:09:38.160 | talk to the people and then know

01:09:39.520 | and make sure like I have

01:09:41.360 | certain subject matter experts

01:09:43.680 | on SpeedDial that I also like

01:09:48.400 | to share information with

01:09:49.760 | and understand like,

01:09:52.000 | hey, does this catch your eye too?

01:09:56.080 | Do you think this is valuable or real?

01:09:58.560 | 'Cause yeah, Raishan,

01:10:00.400 | it's a noisy space we're in right now,

01:10:02.000 | which is cool 'cause it's really interesting

01:10:05.440 | and people are excited about it.

01:10:07.600 | But at the same time,

01:10:08.880 | there is actually a 10X

01:10:11.920 | or more explosion of information coming in

01:10:14.800 | that all sounds really, really unique and new.

01:10:18.320 | And you could spend like hours,

01:10:20.800 | you know, down a rabbit hole

01:10:22.000 | that isn't that useful.

01:10:23.520 | Awesome, Mark, I know we kept you

01:10:25.280 | in the studio for a long time.

01:10:26.480 | Any final call to actions for folks

01:10:29.120 | that could be roles you're hiring for,

01:10:31.440 | requests for startups,

01:10:33.200 | anything that comes to mind

01:10:35.520 | that you want to share with the audience?

01:10:37.280 | Yeah, I think on the line of

01:10:39.280 | we definitely have a call to action

01:10:42.960 | to get more people to work together with us

01:10:45.440 | for long context evaluations.

01:10:49.040 | That is the sort of the it topic

01:10:52.720 | throughout like every,

01:10:55.200 | like even Meta or Google

01:10:56.880 | or any of the other folk are focusing on.

01:11:00.080 | 'Cause I think we lack an understanding

01:11:02.560 | of that within the community.

01:11:03.920 | And then can we as a community

01:11:07.040 | also help to construct

01:11:08.800 | like other modalities of datasets

01:11:11.200 | that would be interesting,

01:11:12.800 | like pairwise datasets, right?

01:11:15.600 | Like you could get just straight video

01:11:17.440 | and then straight text,

01:11:18.160 | but like getting them together

01:11:20.080 | that have like for grounding purposes

01:11:23.920 | will be really useful

01:11:25.040 | for training the next set of models

01:11:27.760 | that I know are coming out.

01:11:29.440 | And the more people

01:11:31.520 | we have contributing to that

01:11:32.800 | would be really useful.

01:11:34.160 | Awesome, thank you so much for coming on, Mark.

01:11:37.520 | This was a lot of fun.

01:11:38.480 | Yeah, thanks a lot.

01:11:39.280 | Yeah, this is great.

01:11:41.040 | (upbeat music)

01:11:49.120 | (upbeat music)

01:12:03.120 | (upbeat music)

How to train a Million Context LLM — with Mark Huang of Gradient.ai

Chapters