back to index

How to train a Million Context LLM — with Mark Huang of Gradient.ai


Chapters

0:0 Introductions
1:30 Founding story of Gradient and its mission
4:35 Minimum viable agents
9:19 Differentiating ML and AI, focusing on out-of-domain generalization
10:12 Extending Llama3 to 1M tokens
14:32 Technical challenges with long context sequences
17:45 Data quality and the importance of diverse datasets
19:45 What's a theta value?
22:42 RoPE vs Ring Attention vs ALiBi vs YaARN
25:6 Why RingAttention matters
28:1 How to refine datasets for context extension
33:34 Multi-stage training data and avoiding overfitting to recent data
34:27 The potential of using synthetic data in training
38:22 Applying LoRa adapters to extend model capabilities
42:25 Benchmarking long context models and evaluating their performance
47:20 Pushing to 4M context and output quality degradation
50:8 What do you need this context for?
52:57 Impact of long context in chat vs Docs Summarization
56:25 Future directions for long context models and multimodality
59:38 How do you know what research matters?
62:47 Routine for staying updated with AI research and industry news
65:33 Deciding which AI developments to invest time in
70:37 Request for collaboration and data set construction for long context

Whisper Transcript | Transcript Only Page

00:00:00.880 | Hey, everyone. Welcome to the Live in Space podcast.
00:00:03.760 | This is Alessio, partner and CTO on Residence at Decibel Partners,
00:00:07.360 | and I'm joined by my co-host, Swiggs, founder of Small.ai.
00:00:10.960 | Hey, and today we're in the remote studio with Mark Wang from Gradient.
00:00:14.240 | Welcome, Mark.
00:00:15.360 | Hey, glad to be here.
00:00:17.600 | It's really, you know, a great experience to be able to talk with you all.
00:00:21.840 | I know your podcast is really, really interesting,
00:00:24.720 | and I always am listening to it every time you guys have a release.
00:00:29.200 | He's not a paid actor.
00:00:30.880 | He said that out of his own will.
00:00:32.240 | We'll give you the check later.
00:00:35.040 | So, Mark, you're unusual in the sense that you and I go back to college.
00:00:39.520 | I don't exactly remember where we overlapped,
00:00:42.640 | but, you know, we both went to Warden
00:00:46.000 | and went into the sort of quantitative developer realm.
00:00:49.760 | Yeah, exactly.
00:00:50.560 | Kind of crazy, right?
00:00:53.120 | All goes full circle.
00:00:54.480 | I was a quant for quite a few years
00:00:57.680 | and then made it out into Silicon Valley.
00:01:01.680 | And now we intersect again when it kind of feels like more or less the same, right?
00:01:07.520 | Like the AI wars, the trading wars back in the day, too,
00:01:10.720 | to a certain extent, and the grab for talent.
00:01:13.280 | Yeah, I think there's definitely a few of us ex-finance people
00:01:17.200 | moving into tech and then finding ourselves
00:01:19.440 | gravitating towards data and AI.
00:01:22.720 | It seems like you did that.
00:01:23.600 | You were at a bunch of sort of quant trading shops,
00:01:27.200 | but then as you moved to tech, you were a lead data scientist at Box
00:01:30.960 | and staff ML scientist at Splunk.
00:01:32.800 | And then before working on the startup that eventually became Gradient.
00:01:37.920 | You want to tell that story?
00:01:38.880 | Yeah, I think part of the reason why I came over from the quant finance world
00:01:46.240 | is to get more collaboration,
00:01:48.320 | learn about what big data and scaling machine learning really looks like
00:01:56.080 | when you're not in this bubble, right?
00:01:58.800 | And working at Box, I worked mostly in a cross-functional role,
00:02:04.800 | helping product analytics and go to market.
00:02:08.240 | And then at Splunk, it was a lot more of a specific role
00:02:13.120 | where I was helping with streaming analytics and search and deep learning.
00:02:19.600 | And for Gradient, really why we started it was
00:02:24.720 | whether it was in finance or whether it was in tech,
00:02:27.440 | I always noticed that there was a little bit more to give
00:02:31.040 | in terms of what AI or ML could contribute to the business.
00:02:36.400 | And we came at a really good time with respect to wanting to
00:02:40.720 | bring the full value of what that could be into the enterprise.
00:02:47.120 | And then obviously OpenAI created this huge vacuum
00:02:51.680 | into the industry to allow for that, right?
00:02:54.480 | So I myself felt really, really empowered to actually ship a product
00:03:00.720 | and ship stuff that I could think could really help people.
00:03:03.760 | Maybe just to touch a little bit on Gradient,
00:03:06.720 | I know we have a lot of things to go through Gradient,
00:03:09.280 | Lumetri, context extension, there's a lot,
00:03:12.320 | but what exactly is Gradient?
00:03:13.600 | And you have an awesome design on your website.
00:03:16.400 | It's really retro.
00:03:17.520 | And I think people that are watching Fallout on Amazon Prime right now
00:03:21.440 | can maybe feel nostalgia just looking at it.
00:03:24.800 | What exactly is it?
00:03:26.880 | Because I know you have the foundry, you have the agent SDK,
00:03:29.520 | there's a lot of pieces into it.
00:03:31.440 | Yeah, for sure.
00:03:32.160 | And I appreciate the call out for the design.
00:03:35.840 | I know my co-founder, Chris, spent a lot of thought
00:03:39.200 | in terms of how he wanted the aesthetic to look like.
00:03:41.600 | And it reminds me a lot about Mad Men.
00:03:44.560 | So that was the initial emotional shape that I felt when I saw it.
00:03:50.640 | Well, quite simply, Gradient, we're a full stack AI platform.
00:03:56.480 | And what we really want to do is we want to enable
00:03:59.280 | all of the RPA workloads or the codified automation workloads
00:04:06.000 | that existed in enterprise before.
00:04:08.000 | We really want to enable people to transition
00:04:11.280 | into more autonomous, agentic workflows that are less brittle,
00:04:16.320 | feel more seamless as an interface too.
00:04:20.480 | So and able to empower what we really think
00:04:24.160 | the new AI workforce should look like.
00:04:29.040 | And that kind of required us to build
00:04:32.320 | a fairly horizontal platform for those purposes.
00:04:35.120 | We had this discussion at our AI in Action Club on Discord,
00:04:38.480 | like the minimum viable agent,
00:04:40.400 | or like kind of how you define an agent.
00:04:42.160 | Yeah, in your mind, what is the minimum thing
00:04:46.800 | that you can call actually an agent
00:04:48.400 | and not just like a for loop, you know?
00:04:50.880 | And how do you see the evolution over time,
00:04:53.840 | especially as people adopt it more and more?
00:04:55.760 | Yeah, so I kind of stage it where everybody,
00:05:03.200 | first of all, at the lowest level,
00:05:04.960 | thinks about like non-determinism
00:05:06.880 | with respect to how the pipeline looks like when it's executed.
00:05:10.400 | But even beyond that,
00:05:11.520 | this goes back into effectively evaluations.
00:05:15.360 | It's like on each stage of the node,
00:05:17.680 | you're going to have to see a marginal improvement
00:05:20.320 | in the probability of success for that particular workload
00:05:23.520 | because of non-determinism.
00:05:25.920 | So yeah, I think it is an overloaded term to a certain extent
00:05:30.800 | because like everything is an agent
00:05:32.560 | if it calls a language model
00:05:34.720 | or any sort of multimodal model these days.
00:05:36.960 | But for us, it's like, you know, my background is statistics.
00:05:40.800 | So I want to see like improvements in the probability
00:05:43.600 | of the success event or outcome happening
00:05:46.720 | because of more nodes.
00:05:48.320 | Yeah, I think, you know,
00:05:49.280 | the one thing that makes this sort of generative AI era
00:05:54.000 | very different from the sort of data science-y type era
00:05:56.880 | is that it is very non-deterministic
00:05:59.200 | and it's hard to control.
00:06:01.040 | Yeah, I mean, so like, you know,
00:06:04.320 | I think what's the founding story of Gradient?
00:06:07.680 | Like how, you know, of all the problems that you chose,
00:06:11.200 | like why choose this one?
00:06:14.000 | You know, how did you get together your co-founders,
00:06:16.800 | anything like that, that bring us up to the present day?
00:06:20.240 | One of my co-founders is Chris
00:06:21.520 | and he's a really good friend of mine as well.
00:06:23.680 | I don't know if you intersected with him at Penn as well,
00:06:26.000 | but yeah, Chris Chang, he was at Penn as well,
00:06:30.480 | did banking for maybe one or two years
00:06:34.080 | and then, you know, was a software engineer at Meta,
00:06:38.160 | also was at Google.
00:06:40.240 | And then most recently,
00:06:42.000 | he was like a director at Netflix and product.
00:06:44.720 | And we always wanted to do something together,
00:06:48.480 | but we felt the, you know, what really came to fruition
00:06:51.920 | was wanting to develop something
00:06:54.080 | that is enterprise-facing for once,
00:06:57.120 | mostly because of our experience with internal tooling
00:07:02.000 | and inability for something to like,
00:07:06.880 | basically exist through like a migration, right?
00:07:10.240 | Like all the time with every ML platform
00:07:13.360 | that I've ever had to experience or he had to experience,
00:07:16.080 | it's like a rebuild and you rip it out
00:07:18.240 | and you have a new workflow or automation come in.
00:07:20.880 | And it's this huge multi-quarter,
00:07:23.440 | maybe even multi-year project to do that.
00:07:26.960 | And we also teamed up with a former coworker of Chris's
00:07:32.400 | from Open Door Forest,
00:07:33.680 | who was also on Google Cloud Platform.
00:07:37.600 | And, you know, him seeing the scale
00:07:40.320 | and actually the state of the art
00:07:44.720 | in terms of Google was using AI for systems
00:07:48.400 | before everybody else too, right?
00:07:50.000 | They invented a transformer
00:07:51.600 | and their internal set of tooling
00:07:54.480 | was just so far superior to everything else.
00:07:56.640 | Like it's really hard for people to go back
00:07:58.720 | after seeing that.
00:07:59.920 | So what we really wanted was to reduce that friction
00:08:05.440 | for like actually shipping workloads in product value
00:08:12.080 | when you have all these like types of operational frictions
00:08:16.640 | that happen inside of these large enterprises.
00:08:20.720 | And then really like the main pivot point for all of it
00:08:26.320 | was like you said,
00:08:27.760 | things that can handle out of domain problems.
00:08:30.960 | So like out of domain data that comes in,
00:08:32.960 | having the flexibility to not fall over
00:08:36.160 | and having something that you build over time
00:08:41.280 | that continues to improve.
00:08:42.960 | Like machine learning is about learning.
00:08:45.440 | And I feel like a lot of systems back in the place,
00:08:48.080 | they were learning a very specific objective function,
00:08:52.880 | but they weren't really natively learning with the user.
00:08:56.880 | So like that's the whole,
00:08:58.560 | we use the term assistant all the time,
00:09:01.840 | but my vision for the assistant
00:09:06.000 | was always for the system to grow alongside me, right?
00:09:10.000 | Like almost like an embodied second limb
00:09:15.360 | or something that will be able to get better
00:09:17.040 | as you also learn yourself.
00:09:19.440 | Yeah, I might maybe call it,
00:09:21.520 | people always trying to define a difference between ML and AI.
00:09:26.560 | And I think in AI,
00:09:28.640 | we definitely care a lot more about
00:09:30.320 | out of domain generalization.
00:09:31.840 | And that's all under the umbrella of learning,
00:09:35.120 | but it is a very specific kind of learning.
00:09:36.880 | I'm going to try to make a segue
00:09:39.440 | into today's like main topic of conversation
00:09:42.560 | that's something that you've been blowing up on,
00:09:44.640 | which is the long context learning, right?
00:09:47.680 | Which is also some form of out of topic,
00:09:50.400 | out of distribution generalization.
00:09:52.720 | And in this context,
00:09:54.080 | you're extending the context window
00:09:56.400 | of an existing open source model.
00:09:58.000 | Maybe if you want to like,
00:10:00.400 | just bring us all the way back to it,
00:10:01.680 | towards like what got you interested in long context?
00:10:04.320 | Why did you find it like an interesting investment
00:10:08.320 | to work on?
00:10:08.880 | And then the story of how you did your first extensions.
00:10:12.000 | Yeah, I think it came,
00:10:15.840 | for Llama3, it's specifically,
00:10:18.000 | we chose that model
00:10:20.080 | because of the main criticisms about it.
00:10:24.800 | Before when it first got released,
00:10:27.040 | 8,000 context lengths just seemed like it was too short
00:10:30.720 | because it seemed like Mistral
00:10:32.720 | and even Yee came out with like a 2,000 token model,
00:10:37.360 | context length model.
00:10:38.960 | But the really the inception of all of it was
00:10:45.040 | us like fine tuning so many models
00:10:48.640 | and working on Rags so much
00:10:51.200 | and having this,
00:10:53.200 | and it still exists today,
00:10:55.520 | this basically pedagogical debate with everybody
00:10:58.640 | who's like, "Hey, is it fine tuning versus Rag?
00:11:00.640 | Is it this versus that?"
00:11:01.840 | And like, at the end of the day,
00:11:04.160 | it's just all meta learning, right?
00:11:06.640 | Like all we want is like the best meta learning workflow
00:11:09.840 | or meta learning setup possible
00:11:11.920 | to be able to adapt a model to do anything.
00:11:17.760 | So naturally, long context had a place in that,
00:11:22.400 | but nobody had really pushed the limits of it, right?
00:11:26.160 | Like you would see like 10 shot,
00:11:27.920 | maybe 100 shot prompting
00:11:29.520 | for improving the model's capabilities,
00:11:33.520 | but it wasn't until Google comes out with Gemini
00:11:37.040 | with the first 1 million context length model
00:11:39.520 | that a lot of people's jaws dropped
00:11:42.720 | and that hunger for understanding
00:11:46.400 | what that could really facilitate
00:11:48.240 | in the new workflows came about.
00:11:50.800 | So we were staged to actually train
00:11:53.360 | other open source models to do that.
00:11:56.800 | But the moment Llama3 came out,
00:11:58.880 | we just went ham against that specific model
00:12:02.480 | because the two things
00:12:04.160 | that were particularly appealing for that
00:12:07.200 | was the fact that like,
00:12:08.480 | I see a lot of these language models
00:12:09.920 | as compression algorithms to a certain extent,
00:12:12.080 | like the way we have like 15 trillion tokens
00:12:14.960 | into a specific model.
00:12:16.240 | That definitely made me feel like
00:12:19.360 | it would have a lot of capabilities
00:12:23.520 | and be more adaptable
00:12:26.400 | towards extending that context length.
00:12:28.160 | So we went in there
00:12:29.440 | and the 1 million number was always,
00:12:32.560 | that was more of just like put the North Star up there
00:12:36.720 | and see if we can get there.
00:12:38.720 | And then see what was happening along the way
00:12:41.840 | as we did that.
00:12:42.880 | So yeah, also shout out to Crusoe
00:12:46.720 | who facilitated all that compute
00:12:48.560 | because I would be lying
00:12:49.920 | if I was to say like,
00:12:51.200 | anyone could just go out and do it.
00:12:52.720 | It does require quite a bit of compute.
00:12:55.680 | It requires like a lot of preparation,
00:12:58.240 | but it just like all the stars
00:13:01.120 | kind of aligned for that moment
00:13:02.720 | for us to go after that problem.
00:13:05.120 | I'll take a side note on Crusoe
00:13:07.440 | since you just brought it up.
00:13:08.400 | Yeah, like, can you explain what Crusoe is?
00:13:11.440 | You know, I have this mental image
00:13:13.520 | of putting GPUs on top of oil rigs.
00:13:15.840 | What is it?
00:13:19.440 | What do they do?
00:13:20.400 | How do you work with them?
00:13:21.360 | You know, just anything nice.
00:13:23.680 | I'm sure they appreciate nice things
00:13:24.640 | that you say about them too.
00:13:25.600 | Oh, for sure, for sure.
00:13:27.040 | So they came to us
00:13:30.640 | through a collaborative effort
00:13:32.640 | where we basically were in search
00:13:34.400 | of a cloud, you know, a GPU provider.
00:13:38.800 | I don't want to call cloud service provider
00:13:40.720 | quite yet because then, you know,
00:13:42.080 | you think about hyperscalers,
00:13:43.200 | but for them, you know,
00:13:44.880 | they're one of the biggest
00:13:45.680 | alternative GPU cloud providers.
00:13:48.960 | And they were offering up
00:13:52.160 | like we want to do a collaboration
00:13:54.000 | to showcase their technology.
00:13:56.560 | And it just made it really easy
00:13:59.200 | for us to like scale up with their L40Ss.
00:14:02.160 | And those were the specific
00:14:04.000 | GPU instances we used.
00:14:05.360 | And coordinating that effort with them
00:14:08.800 | to get, you know,
00:14:10.160 | that dedicated cluster first
00:14:12.320 | to do the project.
00:14:13.760 | It became a really good relationship.
00:14:17.760 | And we still work with them today
00:14:19.200 | because like we're trying to evaluate
00:14:21.360 | more of these models
00:14:22.320 | and possibly train more of them.
00:14:24.160 | And anyone could go up to them
00:14:26.400 | and basically get your compute from them.
00:14:30.080 | And they have a lot of GPUs
00:14:32.320 | available for those type of projects.
00:14:34.560 | I would love to maybe have you run
00:14:37.120 | people through why the models
00:14:38.880 | don't come with longer context
00:14:40.560 | sequences out of the box.
00:14:41.840 | Like, obviously, you know,
00:14:44.400 | the TLDR is like self-attention.
00:14:46.320 | It's like quadratic scaling of memory.
00:14:48.080 | So the longer the context size,
00:14:50.000 | the more compute you have to spend
00:14:51.440 | the training time.
00:14:52.080 | And that's why you have to get
00:14:53.680 | Crusoe to help you.
00:14:54.960 | How do you actually train
00:14:58.320 | a large language model
00:14:59.520 | that is like a very long context?
00:15:00.720 | And then how does that differ
00:15:02.000 | from just tacking it on on top later?
00:15:05.040 | And then maybe we'll dive into performance
00:15:06.960 | and some of those things.
00:15:07.760 | But I think for a lot of folks
00:15:10.160 | in our audience that are more
00:15:11.680 | engineers, they use models,
00:15:12.960 | but don't necessarily build
00:15:14.480 | the models themselves.
00:15:15.920 | A lot of time, it's hard to understand
00:15:17.360 | what goes into actually making
00:15:19.040 | a long context model.
00:15:20.320 | Yeah, in terms of, you know,
00:15:22.880 | all the literature out there,
00:15:23.920 | I would say, honestly,
00:15:26.480 | it's probably still TBD
00:15:28.240 | as to like the tradeoffs
00:15:29.920 | between the approach we did,
00:15:31.920 | which is more of a curriculum
00:15:33.600 | learning approach after the fact
00:15:36.240 | versus inherently training
00:15:38.560 | a model with a long context throughout,
00:15:40.640 | because I just don't think people
00:15:42.400 | have looked at the scaling properties
00:15:44.080 | of it in deep, deep detail.
00:15:46.080 | But as stylistic facts
00:15:49.120 | exist out there with research papers
00:15:52.240 | from Meta themselves, actually,
00:15:53.680 | they were already shown in a paper
00:15:56.960 | that if you train a model
00:15:59.680 | on a shorter context
00:16:01.360 | and you progressively
00:16:02.320 | increase that context to like,
00:16:05.040 | you know, the final limit
00:16:06.960 | that you have, like 32K
00:16:08.800 | is usually the limit of Lama 2
00:16:11.280 | was that long.
00:16:12.240 | It actually performs better
00:16:16.240 | than if you try to train
00:16:17.920 | 32K the whole time.
00:16:19.840 | And I like to think about it
00:16:23.680 | intuitively as if you're trying
00:16:26.080 | to learn probability theory.
00:16:27.760 | You're not going to go
00:16:29.280 | and read the book cover to cover
00:16:30.560 | and then do all the exercises afterwards.
00:16:33.360 | What you're going to do
00:16:34.160 | is you're going to do each chapter,
00:16:36.000 | do an exercise,
00:16:36.960 | read the chapter, do an exercise,
00:16:39.600 | and then finish right
00:16:40.720 | with the final set of like holistic
00:16:43.200 | exercises or examination.
00:16:46.800 | So attention is exactly
00:16:49.680 | what it sounds like to a certain extent.
00:16:51.360 | Like you have a bunch of indices
00:16:54.480 | and you are making the model
00:16:56.000 | attend to localized contexts
00:16:58.080 | and concepts across
00:17:00.720 | the entirety of its encoding, right?
00:17:05.120 | Like whatever the text
00:17:06.240 | that the sequence
00:17:07.040 | that you're giving it.
00:17:08.240 | So when you're doing
00:17:10.880 | the curriculum learning
00:17:11.840 | aspect of things,
00:17:13.040 | you are kind of trying to
00:17:14.640 | give it the opportunity
00:17:17.520 | to also attend to all the concepts.
00:17:20.080 | So data actually in the curation,
00:17:22.400 | in the creation of that context,
00:17:24.160 | plays a huge role
00:17:25.600 | because a lot of times
00:17:28.080 | people make the mistake
00:17:28.960 | of trying to extend the context length
00:17:30.240 | by just giving it raw text
00:17:33.200 | that doesn't have
00:17:35.760 | the necessity for the model
00:17:39.040 | to go all the way
00:17:40.400 | in the beginning of the sequence
00:17:42.080 | and then connect an idea
00:17:43.520 | to the end of the sequence.
00:17:45.120 | So data quality is one thing,
00:17:47.760 | but it sounds like
00:17:49.200 | as long as the base model
00:17:51.440 | is at least what is the work,
00:17:53.760 | like the one million contexts
00:17:54.880 | if Lama3 was 2K context size,
00:17:58.080 | is there like a minimum context size
00:17:59.600 | that you need to then
00:18:00.320 | be able to generalize?
00:18:01.600 | Or does it not really matter
00:18:03.840 | in the fine-tuning context care of it?
00:18:05.680 | There's no minimum, I would say,
00:18:07.600 | or at least I can't make
00:18:11.040 | such a strong statement
00:18:12.000 | as to say that that does not exist.
00:18:13.680 | But if you have a 4K,
00:18:15.840 | any regular model out there,
00:18:17.840 | like if you can progressively increase
00:18:20.240 | the context length of it,
00:18:22.480 | so long as it has shown
00:18:24.800 | really good perplexity scores
00:18:27.040 | prior to your context length extension.
00:18:30.400 | So if it hasn't shown good perplexity,
00:18:33.520 | you basically can't even
00:18:35.040 | predict the next token,
00:18:36.000 | you're kind of like out of luck, right?
00:18:37.840 | But then from there,
00:18:40.480 | the other component
00:18:42.640 | that we actually just released
00:18:44.160 | the blog on maybe last Friday,
00:18:46.400 | it's like you got to pay attention
00:18:47.760 | to the theta value
00:18:50.880 | that the model starts off with.
00:18:52.800 | What was fairly unique
00:18:54.800 | about the Lava3 model
00:18:56.080 | was their choice of the theta parameter,
00:18:59.840 | which gave some suspicion
00:19:02.560 | as to how long the context
00:19:04.720 | could be extended for the model.
00:19:07.040 | So that aspect of like,
00:19:10.640 | we can go into a huge lesson
00:19:15.120 | in terms of positional encodings
00:19:16.800 | and in rope scaling and stuff.
00:19:19.120 | But those concepts
00:19:22.160 | and that aspect of things
00:19:25.280 | enables you to scale out
00:19:27.760 | the length much more easily.
00:19:29.680 | - What's the TLDR
00:19:32.080 | of what the theta is for a model?
00:19:34.720 | If I haven't built a model before...
00:19:36.320 | - Yeah, yeah.
00:19:37.040 | - Not me, obviously I know what it is,
00:19:40.000 | but for people that don't know, right?
00:19:42.000 | I'm totally an expert.
00:19:44.000 | - Yeah, well,
00:19:46.960 | so not all models have it,
00:19:49.200 | but some models will employ
00:19:51.200 | rope scaling
00:19:55.520 | and specifically Lava3 does that.
00:19:58.480 | But there's also other positional encoding
00:20:01.600 | and embedding mechanisms
00:20:02.960 | that other models employ.
00:20:04.560 | But TLDR is,
00:20:06.880 | if you think about most architectures,
00:20:10.640 | they employ basically like a,
00:20:14.480 | it's kind of like a sine or cosine
00:20:16.800 | curve.
00:20:17.680 | And you're thinking about like the different,
00:20:20.240 | you have the amplitudes that occur there
00:20:23.120 | to allow for the model
00:20:25.360 | to see different types of distributions of data.
00:20:28.880 | Really what the theta value does,
00:20:32.160 | it governs how often a pattern's
00:20:36.160 | going to appear in the embedding space.
00:20:39.120 | So you basically are able to
00:20:44.240 | make shift that rotational curve
00:20:49.600 | by increasing the theta value
00:20:52.560 | and allow for different types of distributions
00:20:57.840 | to be seen as if they actually occurred
00:21:01.280 | in the training data before.
00:21:04.000 | So it's super confusing,
00:21:06.560 | but it's like there's positional extrapolation,
00:21:10.560 | and then there's interpolation.
00:21:11.680 | You want interpolation.
00:21:13.760 | It's been shown that just pure extrapolation
00:21:16.640 | makes the model a lot worse,
00:21:18.080 | and it's harder to attend to stuff.
00:21:20.080 | Whereas the interpolation is like
00:21:21.600 | you're squeezing everything back in
00:21:23.600 | to what the original context length was
00:21:26.080 | to a certain extent,
00:21:27.600 | and then allowing for it to overlap
00:21:31.120 | different sequences that it's already seen
00:21:34.320 | as if it actually occurred
00:21:36.160 | when you see a million contexts of sequence tokens.
00:21:41.120 | So yeah, I think that aspect,
00:21:45.920 | we didn't know how well it would scale.
00:21:49.600 | I think that's one thing.
00:21:50.560 | So I'm not going to lie and tell you
00:21:53.520 | right off the bat,
00:21:54.080 | we're definitely going to hit a million.
00:21:56.080 | It was more like we're getting to 256,
00:21:59.200 | and it looked good.
00:22:00.400 | We did our evals.
00:22:01.920 | We scaled it more.
00:22:02.880 | And then what was really good
00:22:06.320 | was that we established the formula at the start.
00:22:10.480 | So it's actually a formula
00:22:11.760 | that we actually took from the paper.
00:22:16.000 | I think it's the rope scaling paper.
00:22:20.000 | And we looked at that particular formula,
00:22:22.160 | and then we backed out the values.
00:22:24.160 | And it's all empirical.
00:22:25.520 | So it's not like a mathematical tautology or proof.
00:22:29.680 | It's just like it's an empirical formula
00:22:31.760 | that actually worked really well.
00:22:33.120 | And then we just kept scaling it up,
00:22:34.400 | and it held.
00:22:35.760 | It's kind of like the scaling laws.
00:22:36.960 | Nobody knows the scaling laws exist,
00:22:39.760 | but you don't know if they're going to continue.
00:22:41.680 | So yeah.
00:22:42.480 | Are you able to compare it
00:22:44.480 | with other forms of scaling
00:22:47.040 | that people have been talking about?
00:22:48.560 | Alibi comes to mind.
00:22:50.640 | Yarn is being talked about a lot by a news research.
00:22:54.160 | And then there's other forms
00:22:56.720 | which are not exactly directly related,
00:22:58.960 | but ring attention comes up a lot.
00:23:00.640 | We had a really good session with Strong Compute
00:23:03.360 | in the Latent Space Discord
00:23:05.440 | talking about all these approaches.
00:23:07.440 | I was just wondering if you want to compare and contrast
00:23:09.440 | Rope versus the other stuff.
00:23:10.640 | Yeah, I think...
00:23:11.140 | I can never pronounce it right, but Alibi.
00:23:16.880 | Yeah, Alibi.
00:23:18.720 | We haven't compared with that one specifically,
00:23:22.800 | mostly because I've noticed
00:23:24.720 | some of the newer architectures
00:23:27.120 | don't actually employ it a lot.
00:23:28.720 | I think the last architecture
00:23:29.760 | that actually really employed it
00:23:31.040 | was the Mosaic FPT model class.
00:23:33.600 | And then almost all the models these days
00:23:35.600 | are all Rope scaling.
00:23:38.000 | And then effectively,
00:23:39.200 | you can use Yarn with that as well.
00:23:41.120 | We just did the Theta scaling specifically
00:23:44.800 | because of its empirical elegance.
00:23:47.520 | It was really easy and it was well understood by us.
00:23:50.960 | The other one that I know that in the open source
00:23:54.480 | that people are applying,
00:23:56.240 | which uses more of a LoRa-based approach,
00:23:58.880 | which is really interesting too,
00:24:00.800 | is the one that Wing has been employing,
00:24:03.440 | which is Pose.
00:24:04.160 | We've sort of helped them evaluate
00:24:06.640 | some of the models.
00:24:07.440 | With respect to the performance of it,
00:24:10.480 | it does start to break down a little bit more
00:24:13.600 | on the longer and longer context.
00:24:14.960 | So like 500,000 to a million,
00:24:17.280 | it appeared that it doesn't hold as well
00:24:20.960 | specifically for like Needle in the Haystack.
00:24:23.440 | But it's still TBD as...
00:24:26.400 | Evaluations, I call it just like a high...
00:24:30.480 | It's a sparse high dimensional space
00:24:32.240 | where you're just evaluating performance
00:24:34.960 | across so many different things
00:24:37.120 | and then trying to map it back to like,
00:24:38.640 | "Hey, here's the thing that I actually cared about
00:24:41.040 | from the start."
00:24:41.600 | And I have like a thousand different evaluations
00:24:43.520 | and they tell me something,
00:24:45.040 | but not the entire picture, right?
00:24:46.880 | And as for like Ring-Attention specifically,
00:24:50.960 | like we employed Ring-Attention
00:24:52.800 | in order to do the training.
00:24:54.160 | So we combined Flash-Attention
00:24:56.400 | and Ring-Attention together
00:24:57.760 | with a really specific network topology on our GPUs
00:25:02.960 | to be able to maximize the memory bandwidth.
00:25:05.840 | Yeah, as far as I understand,
00:25:07.920 | like Ring-Attention, a lot of people credit it
00:25:09.920 | for Gemini's million token context,
00:25:13.600 | but actually it's just a better utilization of GPUs, right?
00:25:16.240 | Like that's really what it is.
00:25:18.880 | You mentioned in our show notes,
00:25:20.960 | Zhang Peiyuan's Easy Context Repo.
00:25:23.920 | I have seen that come up quite a bit.
00:25:25.280 | What does that do?
00:25:26.880 | Like how important is it
00:25:28.800 | as a Ring-Attention implementation?
00:25:31.440 | I know there's like maybe another one
00:25:33.600 | that was done by Lucid Reins
00:25:35.520 | or one of the other open source people.
00:25:37.280 | But like what is Easy Context?
00:25:39.920 | Like is that the place to go?
00:25:41.920 | Like did you evaluate a bunch of things
00:25:43.360 | to implement Ring-Attention?
00:25:45.520 | Yeah, we evaluated all of them.
00:25:48.320 | Like it was, I would say the original authors,
00:25:55.440 | you know, Matai and all the folks at Berkeley,
00:25:59.600 | they created the JAX implementation for it.
00:26:02.240 | And unfortunately, not to discredit,
00:26:06.400 | like, you know, TPUs or whatever,
00:26:08.240 | like the JAX implementation
00:26:09.440 | just does not work on GPUs very well.
00:26:12.480 | Like any naive setup that you do,
00:26:14.960 | like it just won't run out of the box very easily.
00:26:17.600 | And then unfortunately,
00:26:20.080 | that was probably the most mature repo
00:26:22.640 | with a lot more configurations
00:26:24.960 | to set up interesting network topologies
00:26:27.680 | for your cluster.
00:26:28.960 | And then the other PyTorch implementations
00:26:33.600 | outside of Easy Context,
00:26:35.600 | they just didn't really work.
00:26:38.560 | Maybe we weren't implementing
00:26:42.160 | one small aspect incorrectly,
00:26:43.520 | but like there was an active development on it
00:26:46.080 | at a certain point.
00:26:47.040 | Like even Lucid Reins, I think he's interesting
00:26:49.760 | 'cause for once he was actually like,
00:26:51.760 | he was like taking a job somewhere
00:26:53.120 | and then just stopped, you know, doing commits.
00:26:55.840 | And as we were working to try to find it,
00:26:58.960 | like we never really want to jump in on a repo
00:27:01.120 | where someone's like kind of actively committing
00:27:03.040 | breaking changes to it.
00:27:04.800 | Otherwise we have to like eat that repo ourselves.
00:27:07.840 | And yeah, Easy Context
00:27:09.520 | was the first PyTorch implementation
00:27:11.200 | that applied it with native libraries
00:27:14.880 | that worked pretty well.
00:27:18.000 | And then we adapted it ourselves
00:27:19.840 | in order to configure it
00:27:23.040 | for our cluster network topology.
00:27:25.680 | So, you know, shout out to John Payne
00:27:29.760 | for his open source contributions.
00:27:32.800 | I think that we look forward
00:27:34.800 | to possibly collaborating him
00:27:36.640 | and push that further in the future
00:27:38.720 | because I think more people,
00:27:41.120 | if they do want to get started on it,
00:27:43.280 | I would recommend that to be the easiest way.
00:27:45.920 | Like, I don't know how many people know Jax.
00:27:48.160 | Me personally, I don't really know it that well.
00:27:50.080 | So I'm more of a PyTorch guy.
00:27:52.080 | So yeah, I think that he provides
00:27:55.920 | a really good introduction
00:27:57.520 | to be able to try it out.
00:28:00.240 | - And so on one side,
00:28:01.840 | you had the technical discovery.
00:28:04.560 | What about the actual customer interest?
00:28:08.000 | Customers that you work with?
00:28:09.120 | I feel like sometimes the context size
00:28:10.800 | can be a bit of a marketing ploy.
00:28:12.640 | You know, people are like,
00:28:13.440 | "Oh yeah, well, no, 1 million, 2 million,
00:28:15.440 | 3 million, 4 million."
00:28:16.800 | So that's kind of the algorithm side of it.
00:28:20.320 | How do you actually, you know,
00:28:21.680 | how do you power the training?
00:28:23.200 | But the other side is obviously the data
00:28:25.440 | that goes into it.
00:28:26.160 | There's both quantity and quality.
00:28:28.720 | I think that's how one of your tweets,
00:28:30.080 | you trained on about 200 million tokens
00:28:32.960 | for the A/B model to the context extension.
00:28:35.600 | But what are the tokens?
00:28:37.920 | You know, how do you build them?
00:28:38.960 | What are like maybe some of the differences
00:28:41.440 | between pre-training datasets
00:28:43.360 | and context extension datasets?
00:28:45.200 | Yeah, any other color you give there
00:28:47.760 | would be great.
00:28:48.800 | So specifically for us,
00:28:50.880 | we actually staged two different updates
00:28:54.560 | to the model.
00:28:55.200 | So our initial layer that we trained
00:28:59.120 | was just basically like a pre-training layer.
00:29:04.240 | So continual pre-training
00:29:05.520 | where we took the Slim Pajamas data
00:29:07.520 | and then we filtered it and concatenated it
00:29:12.320 | so that it would reach the context lengths
00:29:14.240 | that we were trying to extend out to.
00:29:16.080 | And then we took the UltraChat dataset,
00:29:19.120 | filtered it down,
00:29:20.160 | or maybe some other, you know,
00:29:23.120 | second order derivative of the UltraChat dataset
00:29:26.560 | that was curated and then filtered it down
00:29:29.760 | and then reformatted it for our chat use case.
00:29:34.640 | For those two datasets,
00:29:38.160 | I think you always have to really keep in mind
00:29:41.520 | for the pre-training data,
00:29:44.400 | whether or not you may be like
00:29:48.000 | cutting off tokens in weird ways,
00:29:49.760 | whether or not, you know,
00:29:53.280 | the content is actually diverse enough
00:29:56.240 | to retain the ability of the model.
00:29:59.200 | So Slim Pajamas tends to be one of the best ones,
00:30:02.800 | mostly because it's a diverse dataset
00:30:05.200 | and you can use embeddings too
00:30:09.440 | as a pre-filtering step as well, right?
00:30:12.400 | Like how diverse are your embeddings space
00:30:15.280 | to the original corpus of the model
00:30:18.960 | and then train on top of that to retain its abilities.
00:30:21.920 | And then finally for the chat dataset,
00:30:26.000 | making sure that it's attending to all the information
00:30:31.280 | that would be expected to really stretch its capabilities
00:30:34.240 | 'cause you could create like a long context dataset
00:30:38.080 | where like every single time,
00:30:40.160 | the last 200 tokens can answer the entire question.
00:30:44.240 | And that's never gonna make the model attend to anything.
00:30:47.440 | So it's even something that we're doing right now
00:30:50.480 | is trying to think about
00:30:51.680 | like how do we actually improve these models
00:30:54.160 | and how do you ablate the datasets
00:30:57.840 | such that it can expose like even more nuanced capabilities
00:31:02.240 | that aren't easily measurable quite yet.
00:31:07.120 | Is there a ratio between diversity of the dataset
00:31:11.120 | versus diversity compared to what the model already knows?
00:31:14.960 | Like does the model already need to understand
00:31:17.040 | a good part of the new,
00:31:19.760 | like the context extension data to function?
00:31:22.880 | Like can you put a context extension dataset
00:31:25.680 | that is like very far
00:31:26.800 | from like what was in the pre-training?
00:31:28.320 | I'm just thinking as the model get older,
00:31:31.120 | you know, some of the datasets that we have
00:31:34.080 | might not be in the knowledge of the existing model
00:31:36.720 | that you're trying to extend.
00:31:37.680 | - I think that's always a consideration.
00:31:40.560 | I think specifically,
00:31:41.840 | you really got to know how much,
00:31:45.040 | how many tokens were expended
00:31:47.040 | into that particular model from the start.
00:31:48.880 | And all models these days
00:31:50.960 | are now double digit trillions, right?
00:31:54.160 | So it's kind of a drop in the bucket
00:31:56.800 | if you really think I can just put,
00:31:59.360 | you know, a billion tokens in there.
00:32:01.280 | And I actually think that the model
00:32:02.720 | is gonna truly learn new information.
00:32:06.400 | There is a lot of research out there
00:32:09.760 | between the differences
00:32:11.680 | with respect to like full fine-tuning,
00:32:13.760 | which we applied full fine-tuning
00:32:15.120 | versus lower base fine-tuning.
00:32:16.560 | It's a trade-off.
00:32:19.440 | And my opinion of it is actually
00:32:22.160 | that you can test certain capabilities
00:32:26.640 | and you can kind of inject
00:32:28.640 | new knowledge into the model.
00:32:31.680 | But to this day,
00:32:32.960 | I've not seen any research
00:32:34.720 | that does like a strong,
00:32:35.840 | well-scaled out empirical study
00:32:38.800 | on how do you increase the model's ability
00:32:43.280 | to understand like these decision boundaries
00:32:45.760 | with a new novel data.
00:32:48.880 | Most of it is taking,
00:32:51.120 | like holding out a portion of the data
00:32:53.360 | as like novel
00:32:56.160 | and then needing to recycle
00:32:57.600 | some of the old knowledge.
00:32:58.960 | So it just doesn't forget
00:33:01.200 | and get worse at everything else, right?
00:33:03.680 | Which was seen,
00:33:04.480 | like we do have historical precedent
00:33:06.720 | where Code Llama was trained further
00:33:10.480 | from the original Code Llama
00:33:12.720 | was trained further from LLAMA-2
00:33:14.320 | and it just lost
00:33:15.920 | all its language capabilities, basically, right?
00:33:18.720 | So it's not, I don't wanna call that project,
00:33:21.920 | like deem it as like a failure,
00:33:24.720 | but it wasn't like
00:33:25.600 | a really successful generalization exercise
00:33:28.800 | because these models are about like flexibility
00:33:32.000 | and being like generic to a certain extent.
00:33:34.080 | - So one thing I see in the recent papers
00:33:37.040 | that have been coming out
00:33:37.840 | is this sort of concept
00:33:39.920 | of multi-stage training of data.
00:33:42.560 | And if you're doing full fine tuning,
00:33:45.360 | maybe the move or the answer
00:33:47.200 | is don't train 500 billion tokens on just code
00:33:51.120 | because then yeah,
00:33:51.920 | it's gonna massively overfit to just code.
00:33:53.680 | Instead, like maybe the move
00:33:55.360 | is to slowly change the mix
00:33:57.840 | over the different phases, right?
00:34:00.320 | So in other words,
00:34:01.360 | like you still need to mix in
00:34:03.520 | some of your original source dataset
00:34:05.040 | to make sure it doesn't deviate too much.
00:34:07.040 | I feel like that is a very crude solution.
00:34:10.720 | Like maybe there's some smarter way
00:34:13.280 | to adjust like the loss function
00:34:14.640 | so that it doesn't like deviate
00:34:17.120 | or overfit too much to more recent data.
00:34:19.360 | It seems like it's a solvable thing.
00:34:22.640 | That's what I'm saying.
00:34:23.760 | Like this overfitting
00:34:25.360 | to more recent data issue.
00:34:26.880 | - Well, solvable is hard.
00:34:29.920 | I think provably solvable
00:34:32.160 | is always something that I know
00:34:33.200 | is extremely difficult.
00:34:35.680 | But from a heuristical standpoint,
00:34:39.200 | as well as like having
00:34:40.160 | like some sort of statistical efficiency
00:34:44.240 | on like how you can converge
00:34:45.760 | to the downstream tasks
00:34:48.080 | and improve the performance that way
00:34:49.520 | in a targeted manner,
00:34:51.920 | I do think there are papers
00:34:53.840 | that try to do that.
00:34:56.560 | Like the Do-Re-Mi paper,
00:34:59.120 | I think it was released last year.
00:35:00.240 | It was really good
00:35:01.200 | about doing an empirical study on that.
00:35:03.360 | I think the one thing
00:35:07.440 | that people struggle with though
00:35:09.920 | is the fact that
00:35:10.720 | they always try to do it
00:35:13.680 | on pretty naive tasks.
00:35:15.760 | Like you target like a naive task
00:35:17.840 | and then you create your data mixture
00:35:20.320 | and you try to show
00:35:21.920 | some sort of algorithm
00:35:23.680 | that can retain the performance
00:35:27.840 | for those downstream tasks.
00:35:30.320 | But then what do we all care about
00:35:33.680 | are actually like really,
00:35:34.640 | really interesting complex tasks, right?
00:35:37.040 | And we barely have
00:35:37.760 | good evaluations for those.
00:35:39.440 | Like if you do a deep dive
00:35:42.080 | at the Gemini 1.5 technical paper,
00:35:45.920 | which they just updated with,
00:35:47.520 | like it was a fantastic paper
00:35:49.280 | with new updates.
00:35:50.240 | If you look at all of their
00:35:52.240 | long context evaluations there,
00:35:54.800 | like a lot of them are just not
00:35:56.640 | something that the open community
00:35:58.240 | can even do
00:35:59.120 | because they just hired
00:36:01.040 | like teachers to evaluate
00:36:04.240 | whether or not this model
00:36:06.080 | generated a huge lesson plan
00:36:08.400 | that is really coherent
00:36:10.160 | or like you hire a bunch
00:36:11.840 | of subject matter experts
00:36:13.120 | or they taught the model
00:36:15.760 | how to do language translation
00:36:18.160 | for an extinct language
00:36:20.160 | where only 200 people in the world know.
00:36:22.160 | It's like, it's kind of hard for us
00:36:24.160 | to do that same study, right?
00:36:25.920 | As an early stage startup.
00:36:28.000 | I mean, technically now
00:36:28.880 | you can use Gemini as a judge.
00:36:30.800 | Gemini is touting a lot
00:36:32.000 | of their capabilities
00:36:33.040 | in low resource languages.
00:36:34.240 | One more thing before
00:36:36.000 | on the sort of data topic.
00:36:37.360 | Did you have any exploration
00:36:40.800 | of synthetic data at all?
00:36:41.920 | You know, use Mistral to rephrase
00:36:45.200 | some existing part of your data set
00:36:46.720 | to generate more tokens,
00:36:48.720 | anything like that,
00:36:49.360 | or any other form of synthetic data
00:36:51.200 | that you choose to mention.
00:36:52.640 | I think you also mentioned
00:36:53.680 | the large world model paper, right?
00:36:55.120 | So yeah, yeah.
00:36:56.560 | Anything like that?
00:36:57.200 | Yeah, yeah.
00:36:58.640 | So yeah, we used like GPT-4
00:37:02.880 | to rephrase certain aspects
00:37:06.160 | of the chat data, reformatting it
00:37:10.400 | or kind of generating
00:37:13.680 | new types of tokens
00:37:16.560 | and language and types of data
00:37:19.440 | that the model could see.
00:37:22.400 | And also like trying to take
00:37:25.760 | the lower correlated instances
00:37:28.800 | of out-of-domain data
00:37:29.920 | that we wanted to inject it
00:37:32.720 | to the model too as well.
00:37:34.320 | So I actually think a lot of the moat
00:37:37.360 | is in the data pipeline.
00:37:39.520 | You'll notice like most papers
00:37:42.880 | just don't really go into deep detail
00:37:44.640 | about the data set creation
00:37:47.520 | because they probably know.
00:37:49.200 | I mean, there's some aspects
00:37:50.560 | that are like uninteresting, right?
00:37:52.240 | Which is like we paid a bunch of people
00:37:53.680 | and like generated a lot of good data.
00:37:56.640 | But then the synthetic data
00:37:57.760 | generating pipeline itself,
00:37:59.280 | sometimes that could be like 25%
00:38:02.640 | or 50% of the entire data set
00:38:04.800 | that you've been used to depreciating.
00:38:06.720 | Yeah, I think it's just
00:38:07.520 | for legal deniability rather than...
00:38:09.520 | (both laughing)
00:38:11.440 | No, it's just too boring.
00:38:13.120 | I'm not going to say anything
00:38:13.840 | because it's too boring.
00:38:14.560 | No, it's actually really interesting.
00:38:15.760 | But in fact, it might be too interesting.
00:38:19.600 | So we're not going to say anything about it.
00:38:21.520 | Yeah.
00:38:22.020 | One more question that I had was on LoRa
00:38:25.680 | and taking some of these capabilities
00:38:27.440 | out and bringing them to other model.
00:38:29.120 | You mentioned Weng's work.
00:38:30.800 | He tweeted about,
00:38:32.880 | we're going to take this LoRa adapter
00:38:34.720 | for the Gradient 1 million context extension
00:38:38.240 | and you're going to be able
00:38:39.120 | to apply that to other model.
00:38:41.120 | Can you just generally explain to people
00:38:44.640 | how these things work with language models?
00:38:47.600 | I think people understand
00:38:48.400 | that with stable diffusion,
00:38:49.440 | you have these like LoRa patches
00:38:50.960 | for like different types of styles.
00:38:52.720 | Does that work similarly with LLMs?
00:38:56.240 | And is it about functionality?
00:38:58.080 | Can you do LoRa patches with specific knowledge?
00:39:00.960 | Like what's the state of the art there?
00:39:02.800 | Yeah, I think there's a huge kind of resurgence
00:39:07.920 | in what I would call like model alchemy
00:39:12.640 | to a certain extent
00:39:13.920 | because you're like taking all of these LoRas
00:39:16.880 | and you're mixing them together.
00:39:18.160 | And then that's a lot of the model merging stuff
00:39:22.080 | that I think Charles Goddard does
00:39:24.400 | and a lot of others in the open community, right?
00:39:28.720 | 'Cause it's a really easy way,
00:39:30.480 | like you don't need training
00:39:31.760 | and you can test and evaluate models
00:39:33.920 | and take the best skills and mix and match.
00:39:36.720 | I don't think there has been
00:39:38.400 | as much empirical study, like you're saying,
00:39:40.720 | for how it shows the same type of,
00:39:44.720 | like it's not as interpretable
00:39:46.240 | as like stable diffusion to a certain extent.
00:39:48.080 | 'Cause even we have experimented
00:39:53.440 | with taking like deltas
00:39:55.520 | in the same methodology as wing
00:39:57.680 | where we'll take a delta
00:39:58.880 | of like an already trained model,
00:40:00.320 | try to see how that has created
00:40:03.360 | in a sense an ROHF layer, right?
00:40:05.920 | Taking the LLAMA instruct layer,
00:40:08.080 | subtracting the base model from that
00:40:10.720 | and then trying to apply that LoRa adapter
00:40:12.640 | to like another model
00:40:13.680 | and seeing what it does to it.
00:40:16.080 | - It does seem to have an effect though.
00:40:20.480 | Like I will not lie to say,
00:40:22.640 | I'm really surprised
00:40:24.240 | how effective it is sometimes.
00:40:26.240 | But I do notice that
00:40:28.240 | for more complex abilities,
00:40:30.320 | other than like more stylistic stuff,
00:40:34.480 | it kind of falls through
00:40:37.200 | 'cause maybe it requires
00:40:40.560 | a much deeper path
00:40:42.400 | in the neural network, right?
00:40:43.840 | Like all these things,
00:40:44.640 | these weights are just like
00:40:45.680 | huge trees of paths
00:40:48.640 | that the interesting stuff
00:40:50.960 | is like the road less travel
00:40:54.400 | to a certain extent.
00:40:55.440 | And when you're just like merging things,
00:40:57.440 | brute force together that way,
00:40:59.680 | you don't quite know
00:41:02.720 | what you'll get out all the time.
00:41:04.400 | Like there's a lot of other research
00:41:06.000 | that you have merged ties
00:41:07.760 | and you have all these different
00:41:09.520 | types of techniques
00:41:10.400 | to effectively just apply
00:41:12.400 | like a singular value decomposition
00:41:15.520 | on top of weights
00:41:16.480 | and just get like the most important ones
00:41:18.560 | and prevent interference
00:41:20.000 | across all the other layers.
00:41:22.320 | But yeah, I think that
00:41:25.840 | that is extremely interesting
00:41:30.080 | from the developer community.
00:41:32.480 | And I wanna see more of it
00:41:34.160 | except it is to a certain extent
00:41:37.520 | kind of polluting the leaderboards
00:41:39.360 | these days 'cause it's so targeted
00:41:41.360 | and like now you can
00:41:43.200 | you can kind of game the metric
00:41:45.440 | by just finding all the best models
00:41:47.680 | and then just merging them together
00:41:49.280 | to do that.
00:41:49.840 | And I'll just add one last bit
00:41:53.040 | is basically
00:41:53.760 | the most interesting part
00:41:55.680 | about all that actually to me
00:41:57.040 | is when people are trying
00:41:58.720 | to take the LORAs
00:41:59.840 | as a way of like short circuiting
00:42:02.640 | the training process.
00:42:03.440 | So they take the LORAs,
00:42:04.320 | they merge it in
00:42:05.280 | and then they'll fine tune afterwards.
00:42:06.960 | So like the fine tuning
00:42:08.880 | and the reinitialization
00:42:10.560 | of a little bit of noise
00:42:12.240 | into all of the new merged models
00:42:16.000 | provides like a kind of
00:42:18.320 | kind of a learning tactic
00:42:20.800 | for you to get to that capability
00:42:23.200 | a little bit faster.
00:42:24.080 | There's a lot there.
00:42:25.840 | I really like the comparison
00:42:27.040 | of ties merging
00:42:29.280 | to singular value decomposition.
00:42:31.600 | That's something that I like it.
00:42:35.280 | I looked at the paper
00:42:36.240 | and I don't really think
00:42:37.760 | I understood it on that high level
00:42:39.280 | until you just said it.
00:42:40.640 | Very cool.
00:42:41.760 | We have to move on to benchmarking.
00:42:44.560 | This is a very fun topic.
00:42:47.440 | Needle in a haystack.
00:42:48.320 | What are your thoughts and feelings?
00:42:49.360 | And then we can discuss
00:42:50.160 | the other benchmarks first.
00:42:51.520 | Needle in a haystack.
00:42:52.640 | You want to put me
00:42:53.600 | on the spot with that one.
00:42:54.480 | Yeah, I think needle in a haystack
00:42:57.360 | is definitely like the standard
00:42:59.840 | for presenting the work
00:43:01.520 | in a way that people can understand
00:43:03.280 | and also proving out.
00:43:04.640 | I would say like,
00:43:05.200 | I view it as like a primitive
00:43:07.760 | that you have to pass
00:43:09.840 | in order to give the model
00:43:11.440 | any shot of doing something
00:43:13.440 | that combines both
00:43:15.440 | like a more holistic
00:43:16.720 | language understanding
00:43:17.760 | and like instruction following, right?
00:43:20.080 | Like, honestly,
00:43:20.880 | like it's mostly about
00:43:21.840 | if you think about
00:43:23.680 | the practical applications
00:43:24.960 | of like long context
00:43:26.480 | and what people complain
00:43:27.440 | most about models
00:43:29.040 | when you stuff
00:43:29.680 | a lot of context into it
00:43:31.040 | is either the language model
00:43:33.120 | just doesn't care about
00:43:34.320 | what you asked it to do
00:43:36.000 | or it cannot differentiate
00:43:37.760 | like, you know,
00:43:39.120 | context that you want
00:43:40.960 | it to use as a source
00:43:42.560 | to prevent hallucination
00:43:44.000 | versus like instructions.
00:43:45.680 | I think that, you know,
00:43:48.400 | when we were doing it,
00:43:49.440 | it was to make sure
00:43:50.240 | that we were on the right track.
00:43:51.920 | I think Greg did a really great job
00:43:54.960 | of creating a metric
00:43:56.240 | and a benchmark
00:43:56.880 | that everybody could understood.
00:43:59.120 | It was intuitive.
00:44:00.160 | Even he says himself,
00:44:01.360 | we have to move past it.
00:44:02.880 | But to that regard,
00:44:05.200 | it's a big reason
00:44:06.800 | why we did the evaluation
00:44:08.800 | on the ruler suite of benchmarks,
00:44:10.880 | which are way harder.
00:44:12.560 | They actually include
00:44:14.880 | needle in the haystack
00:44:16.080 | within those benchmarks too.
00:44:17.360 | And I would even argue
00:44:20.000 | is more comprehensive
00:44:21.840 | than the benchmark
00:44:24.800 | that Gemini released
00:44:27.280 | for their like multi-needle
00:44:28.560 | in the haystack.
00:44:29.600 | Yeah, you mentioned quite a few.
00:44:31.040 | You mentioned ruler,
00:44:32.320 | Lugo, infinite bench,
00:44:33.680 | bamboo, zero scrolls.
00:44:35.120 | Like, do you want to give us
00:44:38.400 | like maybe two or three of those
00:44:40.320 | that you thought
00:44:41.040 | were particularly interesting
00:44:42.000 | or challenging
00:44:42.640 | and what made them stand out for you?
00:44:44.640 | There's just so many
00:44:45.600 | and they're so nuanced.
00:44:47.040 | I would say like,
00:44:48.880 | yeah, zero scrolls
00:44:49.760 | was the first one I'd ever heard of
00:44:51.200 | coming out last year.
00:44:53.360 | And it was just like the extent,
00:44:54.560 | like making,
00:44:55.200 | it was more of like tracking
00:44:57.040 | variable over long context.
00:45:00.800 | I'll go into ruler
00:45:02.800 | because that's the freshest in my mind
00:45:04.240 | and like we're just scrutinizing it so much
00:45:06.320 | and running the evaluation
00:45:08.000 | in the previous two weeks.
00:45:09.360 | But like ruler has four different
00:45:14.080 | types of evaluations.
00:45:15.840 | So the first one is
00:45:17.360 | exactly needle in the haystack.
00:45:19.520 | It's like you throw multiple needles.
00:45:21.200 | So you got to retrieve
00:45:22.160 | multiple key value pairs.
00:45:24.640 | There's another one
00:45:25.280 | where that basically
00:45:26.400 | you need to differentiate.
00:45:27.680 | Multi-key, multi-value, multi-query.
00:45:30.320 | Yeah, yeah, multi-value, multi-query.
00:45:32.160 | That's the ablation.
00:45:33.920 | There's also a variable tracking one
00:45:39.600 | where you go,
00:45:40.400 | hey, if X equals this,
00:45:41.760 | Y equals this,
00:45:42.480 | Y equals Z,
00:45:45.280 | like what is this variable?
00:45:47.440 | And you have to track it
00:45:48.480 | through all of that context.
00:45:50.720 | And then finally,
00:45:51.520 | there's one that is
00:45:53.280 | more of like creating a summary statistic.
00:45:55.920 | So like the common words one,
00:45:57.680 | where you choose a word
00:46:00.080 | that goes across the entire context,
00:46:03.040 | and then you have to like count it.
00:46:04.480 | So it's a lot more holistic
00:46:07.360 | and a little bit more difficult that way.
00:46:09.040 | And then there's a few other ones
00:46:13.120 | that escaped me at this moment.
00:46:15.200 | But yeah, ruler really pushes you.
00:46:18.240 | If I think about the progression
00:46:20.400 | of the evaluations,
00:46:21.600 | it pushes it to start
00:46:24.400 | to force the model
00:46:25.520 | to actually understand
00:46:26.960 | like the totality of the context,
00:46:29.200 | rather than right,
00:46:31.360 | like everybody argues to say,
00:46:32.640 | like, couldn't I just use
00:46:34.480 | like a retrieval
00:46:35.840 | to like just grab that variable
00:46:37.360 | rather than like pay $10 for one shot
00:46:41.200 | or something?
00:46:41.760 | Although it's not as expensive.
00:46:43.760 | Yeah, exactly, exactly.
00:46:45.920 | So being able to actually like,
00:46:47.680 | I think the main thing
00:46:48.880 | that like I struggled with,
00:46:50.480 | with even some of our use cases,
00:46:52.480 | were like when the context
00:46:55.680 | is scattered across multiple documents,
00:46:57.920 | and you have like really delicate plumbing
00:47:01.040 | for the retrieval step,
00:47:02.320 | but it only works for that one,
00:47:05.600 | that really specific instance, right?
00:47:07.360 | And then you throw in other documents
00:47:09.040 | and you're like, oh, great,
00:47:10.000 | like my retrieval doesn't grab
00:47:11.920 | the relevant context anymore.
00:47:13.760 | So like, that's the dream, right?
00:47:15.360 | Of getting one model,
00:47:17.040 | a model that can generalize
00:47:18.800 | really well that way.
00:47:20.080 | Yeah, totally.
00:47:20.720 | I think that probably is
00:47:22.880 | what Greg mentioned
00:47:24.240 | when saying that he has to move
00:47:26.000 | beyond Needle and Haystack.
00:47:27.520 | You also mentioned,
00:47:29.440 | so you extended from 1 million
00:47:31.280 | to 4 million token context recently,
00:47:33.680 | and you saw some degradation
00:47:35.920 | in the benchmarks too.
00:47:37.200 | Like, do you want to discuss that?
00:47:38.720 | So if you look at our theta value
00:47:40.720 | at that point,
00:47:41.440 | it's getting really big.
00:47:43.120 | So think about floating point precision
00:47:47.440 | and thinking about propagating,
00:47:49.040 | like basically now you're starting
00:47:52.800 | to run into problems
00:47:54.240 | where in a deep enough network
00:47:56.480 | and having to do joint probabilities
00:48:00.720 | across like so many tokens,
00:48:04.320 | you're hitting kind of the upper bound
00:48:08.880 | on accuracy there.
00:48:11.520 | And there's probably some aspect
00:48:15.920 | of kind of clamping down
00:48:21.200 | certain activations
00:48:22.240 | that we need to do within training.
00:48:23.760 | Maybe it happens at inference time as well
00:48:28.240 | with respect to like the theta value
00:48:30.640 | that we use
00:48:31.360 | and how do we ensure
00:48:33.600 | that it doesn't just explode.
00:48:35.200 | If you've ever had to come across
00:48:38.560 | like the exploding gradients
00:48:40.400 | or the vanishing gradient problem,
00:48:41.760 | you will know what I'm talking about.
00:48:44.160 | Like a lot of the empirical aspect of that
00:48:46.720 | and scaling up these things
00:48:49.200 | is experimentation
00:48:52.240 | and figuring out
00:48:52.960 | how do you kind of marshal
00:48:56.320 | these really complicated
00:48:57.520 | composite functions
00:48:59.920 | such that they don't just like
00:49:01.360 | do a divide over zero problem at one point.
00:49:04.960 | Awesome.
00:49:06.000 | Just to wrap on the...
00:49:08.240 | There's the evals
00:49:10.640 | and then there's what people care about.
00:49:12.240 | There's two things.
00:49:13.920 | Do you see people care about above 1 million?
00:49:16.320 | Because Gemini at the 2 million
00:49:18.400 | announcement and I think people were like,
00:49:20.320 | "Okay, 1 million, 2 million, it's whatever."
00:49:23.920 | Like, do you think we need to get to 10 million
00:49:25.840 | to get people to care about again?
00:49:27.360 | Yeah.
00:49:27.760 | Like, do we need to get to 100 million?
00:49:29.520 | Yeah.
00:49:31.040 | I mean, that's an open question.
00:49:36.400 | I would certainly say
00:49:38.080 | a million seemed like the number
00:49:40.960 | that got people really excited for us.
00:49:43.760 | And then the 4 million is kind of like,
00:49:47.120 | "Okay, that's seen as more..."
00:49:50.000 | Rather than like a breakthrough milestone,
00:49:51.840 | it's like just the next incremental checkpoint.
00:49:55.840 | I do think even Google themselves,
00:50:02.880 | they're evaluating and trying to figure out
00:50:04.800 | specifically how do you measure
00:50:08.800 | the quality of these models
00:50:10.160 | and how do you measure and map those
00:50:12.560 | to capabilities that you care about
00:50:17.120 | going down the line, right?
00:50:18.640 | And I think I'm still...
00:50:22.480 | Us as a company, we're figuring out
00:50:26.240 | how to saturate the context window
00:50:29.840 | in a way that's actually
00:50:31.600 | adding incremental value.
00:50:35.360 | So the obvious one is code
00:50:38.800 | because code repositories are huge.
00:50:41.280 | So can you stuff the entire context
00:50:43.440 | of a repo into a model
00:50:46.560 | and then make it produce some module
00:50:50.080 | that is useful
00:50:50.960 | or some suggestion that is useful?
00:50:53.360 | However, I would say
00:50:56.240 | there are other techniques
00:50:58.480 | like alpha coding and flow engineering
00:51:01.040 | that if you do iterative things
00:51:03.120 | in a more agentic manner,
00:51:04.240 | it may actually produce better quality.
00:51:06.320 | I would preface and I would actually counter
00:51:09.600 | that maybe start off with the use case
00:51:14.880 | that is a little bit more
00:51:16.000 | that people are more familiar with right now,
00:51:18.480 | which is constantly evolving context
00:51:21.840 | in like a session.
00:51:23.200 | So like, whereas you're coding, right?
00:51:25.680 | If you can figure out evals
00:51:27.440 | that actually work
00:51:30.400 | where you're constantly providing it
00:51:32.560 | multiple turns
00:51:33.600 | and each incremental turn
00:51:34.880 | has a nuanced aspect
00:51:36.400 | and you have a targeted generation
00:51:39.440 | that you know of,
00:51:40.080 | making the model track state
00:51:46.400 | and have state management over time
00:51:48.400 | is really, really hard.
00:51:50.320 | And it's an incredibly hard evaluation
00:51:53.440 | that will probably only really work
00:51:55.680 | when you have a huge context.
00:51:57.440 | So that's sort of what we're working on
00:51:59.920 | trying to figure out those types of aspects.
00:52:01.920 | You can also map that.
00:52:02.880 | Like it's not just code,
00:52:04.720 | state management exists.
00:52:06.480 | And like, you know,
00:52:07.440 | we work in the finance sector a lot,
00:52:08.880 | like investment management,
00:52:09.920 | like having a state management
00:52:14.480 | of like a concept and stuff
00:52:16.320 | that evolves over like a long session.
00:52:19.200 | So yeah, I'm super excited to hear
00:52:24.400 | like what other people think
00:52:25.840 | about the longer context.
00:52:27.600 | I don't think Google is probably investing
00:52:30.960 | to try to get a billion quite yet.
00:52:34.080 | I think they're trying to figure out
00:52:36.480 | how to fully leverage
00:52:37.840 | what they've done already.
00:52:39.680 | Yeah.
00:52:41.440 | And does this change in your mind
00:52:43.200 | for very long chats
00:52:44.800 | versus a lot of documents?
00:52:46.800 | The chat is kind of interactive,
00:52:48.800 | you know, and information changes
00:52:50.160 | the documents are just trying
00:52:51.280 | to synthesize more and more things.
00:52:53.200 | Yeah.
00:52:54.400 | Any thoughts on how those
00:52:55.760 | two workloads differ?
00:52:56.960 | Yeah, I mean, I would say like
00:52:59.920 | with the document aspect of things,
00:53:02.080 | you probably have like a little bit
00:53:06.080 | more ability to tweak
00:53:08.640 | other methodologies.
00:53:10.400 | Like you can get around
00:53:11.280 | the long context sometimes
00:53:13.520 | where you can do
00:53:15.360 | retrieval augmented generation
00:53:16.880 | or you do like
00:53:17.680 | hierarchical,
00:53:20.640 | like recursive summarization.
00:53:22.160 | Whereas like evolution
00:53:25.120 | in like a session,
00:53:26.000 | because that state variable
00:53:28.800 | could undergo
00:53:29.920 | like pretty rapid changes.
00:53:32.080 | It's a little bit harder
00:53:34.160 | to imagine like you
00:53:36.560 | getting around that
00:53:37.440 | without codifying
00:53:39.200 | like a really specific workflow
00:53:40.960 | or like some sort of,
00:53:42.480 | you know, state clause
00:53:45.760 | that is going back
00:53:47.200 | to like determinism, right?
00:53:48.720 | And then finally,
00:53:51.520 | like what I really think
00:53:52.640 | people are trying to do
00:53:55.120 | is like figure out
00:53:56.160 | how did all these like shots
00:53:59.600 | progress over time?
00:54:01.600 | So like,
00:54:02.000 | how do you get away
00:54:03.840 | from the brittleness
00:54:04.640 | of like the retrieval step
00:54:05.680 | to like shoving,
00:54:06.880 | if you shove in a thousand shots
00:54:08.560 | or 2000 shots,
00:54:09.920 | will it just make the retrieval aspect
00:54:12.800 | of good examples irrelevant?
00:54:15.520 | And like, it's sort of
00:54:16.640 | kind of like a
00:54:17.600 | randomly sampling is fine
00:54:19.360 | at that point.
00:54:20.000 | There's actually a paper on that
00:54:21.920 | that came out from CMU
00:54:23.920 | that they showed
00:54:25.360 | with respect to a few extraction
00:54:29.520 | or classification
00:54:30.960 | high cardinality benchmarks.
00:54:33.520 | They tracked like fine tuning
00:54:35.520 | versus in-context learning
00:54:37.760 | versus like many,
00:54:40.480 | many shot in-context learning.
00:54:42.000 | And they basically showed
00:54:42.880 | that like many,
00:54:44.000 | many shot in-context learning
00:54:45.600 | helps to prevent
00:54:48.960 | as much sensitivity
00:54:50.240 | around the examples themselves.
00:54:51.600 | Right?
00:54:52.400 | Like the distraction,
00:54:53.680 | the distraction error
00:54:55.120 | that a lot of LLMs get
00:54:56.400 | where you give it irrelevant context
00:54:58.400 | and it literally can't do the task
00:55:00.880 | because it just is
00:55:02.160 | like it's sort of like a person too.
00:55:03.520 | Right?
00:55:03.760 | Like you got to be very specific about
00:55:06.000 | I don't want to distract this person
00:55:07.520 | because then,
00:55:08.480 | you know,
00:55:08.960 | they're going to go down a rabbit hole
00:55:10.640 | and not be able to complete the task.
00:55:12.240 | Yeah.
00:55:13.680 | Well, that's kind of the flip side
00:55:14.960 | of the needle in a haystack
00:55:16.640 | thing too in a bit.
00:55:17.520 | It's like now
00:55:19.120 | the models pay attention
00:55:20.240 | to like everything so well.
00:55:22.080 | Like sometimes it's hard
00:55:23.120 | to get them to like,
00:55:24.560 | I just said that once,
00:55:25.600 | please do not bring that up again.
00:55:27.280 | You know, it happens to me with code.
00:55:29.360 | Yeah, it happens to me
00:55:30.400 | with like a CSS style.
00:55:33.120 | Sometimes I like things like that.
00:55:34.400 | If I have a long conversation,
00:55:35.680 | it's like it tries to always
00:55:37.440 | reapply certain styles,
00:55:38.880 | even though I told it
00:55:40.320 | maybe that's not the right
00:55:41.280 | the right way to do it.
00:55:42.160 | But yeah, there's a lot again
00:55:45.760 | of empirical work
00:55:47.520 | that people will do.
00:55:48.320 | And just I know we kind of went through
00:55:51.520 | a lot of the technical side,
00:55:53.280 | but maybe the flip side is
00:55:55.440 | why is it worth doing?
00:55:57.360 | You know, like what are like
00:55:58.560 | the use cases that people have
00:56:00.480 | that make long context really useful?
00:56:03.040 | I know you had,
00:56:04.080 | I think you have a lot of
00:56:05.280 | healthcare use cases
00:56:06.240 | I saw on your Twitter.
00:56:07.120 | You just mentioned the finance use case.
00:56:09.280 | Obviously, some of the filings
00:56:11.440 | and documents that people,
00:56:12.640 | the companies publish
00:56:13.600 | can be quite worthy.
00:56:14.800 | Any other things
00:56:16.960 | that you want to bring up?
00:56:18.320 | Maybe how people are using gradient,
00:56:20.000 | anything like that.
00:56:20.800 | I think that will help
00:56:21.520 | have a clearer picture for people.
00:56:25.120 | Yeah, so beyond like
00:56:27.760 | just using the context for,
00:56:29.360 | you know, sessions
00:56:31.920 | and evolving state management,
00:56:33.600 | it really comes down
00:56:36.000 | to something that's fairly obvious,
00:56:37.280 | which everybody's trying to do
00:56:38.400 | and work on is
00:56:39.280 | how do you ground
00:56:40.080 | the language model better?
00:56:41.840 | So I think when you think pure text,
00:56:43.920 | that's one thing.
00:56:45.680 | But then multimodality
00:56:47.680 | is, in my opinion, going to be
00:56:50.960 | it's going to be pivotal
00:56:52.320 | for long context,
00:56:53.760 | just because like videos
00:56:55.200 | when you're getting into
00:56:57.120 | the frames per second
00:56:57.920 | and you're getting into lots of images
00:57:01.760 | and like things that are
00:57:04.080 | a lot more like embodied,
00:57:05.360 | you need to utilize
00:57:07.680 | and leverage way more,
00:57:09.680 | way more tokens.
00:57:10.480 | And that is probably where,
00:57:14.240 | you know, us as a company,
00:57:15.600 | like we're exploring more
00:57:17.200 | and trying to open up the doors
00:57:20.880 | for a lot more use cases
00:57:22.480 | because I think in financial services,
00:57:26.560 | as well as health care,
00:57:27.920 | we've done a good job
00:57:29.920 | on the tech side,
00:57:30.640 | but we still need to push
00:57:32.480 | a little bit further
00:57:33.680 | when we combined like,
00:57:34.960 | you know, a picture with words,
00:57:37.920 | like a chart with words
00:57:39.520 | or somebody's medical image
00:57:43.040 | with words, stuff like that.
00:57:44.960 | Like you definitely
00:57:46.160 | can do a better job.
00:57:47.120 | And, you know, it's timely too,
00:57:50.240 | because Meta just released
00:57:51.200 | their chameleon paper,
00:57:52.160 | the new chameleon paper
00:57:54.080 | that does multimodal training.
00:57:55.920 | And it shows that early fusion
00:57:57.520 | helps you to,
00:57:58.480 | it's like more sample efficient, right?
00:58:00.320 | So having that kind of
00:58:02.560 | view towards the future
00:58:04.000 | is something that,
00:58:04.880 | you know, we want to be primed to do
00:58:08.800 | because, you know,
00:58:09.760 | it's similar to what Sam Altman
00:58:12.400 | says himself too, right?
00:58:13.680 | Like you need to just assume
00:58:15.360 | that these models
00:58:15.920 | are going to be 10x better
00:58:17.040 | in the next few years.
00:58:19.040 | And if you are primed for that,
00:58:20.560 | like that's where
00:58:21.200 | you have kind of a business
00:58:23.440 | that, you know,
00:58:24.560 | you're not just pivoting
00:58:26.400 | after every release
00:58:27.760 | or every event,
00:58:29.760 | you know, that drops.
00:58:31.760 | I think the thing
00:58:32.720 | about this 10x issue
00:58:34.320 | is that the 10x direction
00:58:37.440 | moves all the time.
00:58:38.560 | You know, some people
00:58:40.240 | were complaining about GPT-4-O
00:58:42.160 | that, yeah, look,
00:58:43.840 | like the ELO scores
00:58:45.440 | for GPT-4-O actually in reality
00:58:47.280 | weren't that much higher
00:58:48.320 | than GPT-4-Turbo.
00:58:50.000 | And really the, you know,
00:58:51.040 | so it's not 10x better in reasoning.
00:58:52.960 | It's just 10x better
00:58:54.080 | in the integration
00:58:55.440 | of multiple modalities.
00:58:57.440 | And by the way,
00:58:58.720 | look over here,
00:58:59.280 | there's a really sexy voice chat app
00:59:01.440 | that they accidentally made
00:59:03.440 | that they had to deprecate today.
00:59:05.040 | It's like the 10x direction
00:59:09.040 | keeps moving.
00:59:09.840 | Now it's like, you know,
00:59:10.800 | fully in like sort of
00:59:11.600 | multi-modality land, right?
00:59:12.880 | And like the question
00:59:14.480 | is like what next, right?
00:59:15.280 | Like, so you can 10x
00:59:17.120 | in various ways,
00:59:18.320 | but like you guys
00:59:19.600 | have 10x context length.
00:59:22.160 | But like, you know,
00:59:22.960 | are we chasing the last war?
00:59:25.120 | Because like now like nobody cares
00:59:26.720 | about context length
00:59:27.360 | and now it's like
00:59:28.560 | multi-modality time, you know.
00:59:30.240 | I'm joking, obviously,
00:59:31.040 | people do care about it.
00:59:32.000 | I just, I wonder about this,
00:59:33.840 | how this comment
00:59:36.080 | about this 10x thing
00:59:37.040 | every single time.
00:59:37.760 | You know, that's honestly
00:59:39.360 | why we kind of have our eye
00:59:41.280 | on the community
00:59:42.640 | as well as you, right?
00:59:43.600 | Like you, you know,
00:59:45.760 | with your community
00:59:46.800 | and the things that you hear,
00:59:48.160 | you know, you want to build,
00:59:51.360 | where, you know, we're a product company,
00:59:52.960 | we're trying to build for users
00:59:54.720 | and trying to listen
00:59:56.480 | to understand like what they,
00:59:59.280 | what they actually need.
01:00:00.240 | Like, obviously,
01:00:01.040 | you know, you don't,
01:00:02.320 | you don't build everything
01:00:03.200 | that people ask you to build,
01:00:05.360 | but know what's useful, right?
01:00:08.240 | Because I think that
01:00:09.040 | you're totally right there.
01:00:11.280 | If we want to make something
01:00:13.360 | 10x better in a certain direction,
01:00:15.920 | but nobody cares
01:00:16.800 | and it's not useful for somebody,
01:00:18.640 | then it wasn't really worth the,
01:00:21.520 | worth the while.
01:00:22.720 | And if anything, maybe that's like
01:00:24.320 | bitter, the bitter lesson 2.0
01:00:26.800 | for so many tech startups
01:00:28.480 | is like build technology
01:00:29.920 | that people care about
01:00:31.040 | and will actually 10x their value
01:00:33.360 | rather than like build technology
01:00:35.680 | that's just, that's just 10x harder.
01:00:37.840 | I mean, no, that's not,
01:00:38.880 | that's not a bitter lesson.
01:00:39.840 | That's just Paul Graham.
01:00:40.960 | That's, that's, yeah.
01:00:42.080 | One more thing on the chameleon paper.
01:00:44.960 | I was actually just about
01:00:45.840 | to bring that up, you know?
01:00:46.560 | So on AI News,
01:00:47.760 | like my sort of daily newsletter,
01:00:49.360 | it was literally my most,
01:00:50.640 | my most recent featured paper.
01:00:52.560 | And I always wonder
01:00:54.720 | if you can actually sort of
01:00:56.080 | like train images
01:00:58.240 | onto the same data space as words.
01:00:59.920 | That was kind of done with like,
01:01:01.760 | you know, what we now call
01:01:02.800 | late fusion models with like lava
01:01:04.960 | and flamingo and,
01:01:07.680 | you know, all the others.
01:01:08.880 | But now the early fusion models
01:01:10.720 | like chameleon
01:01:11.840 | seem to be the way forward.
01:01:13.200 | Like, obviously it's more native.
01:01:15.520 | I wonder if you guys can figure out
01:01:17.680 | some kind of weird technique
01:01:18.800 | where you can take an existing
01:01:20.080 | like Lama 3 model
01:01:21.120 | and like, you know,
01:01:22.560 | early fuse the images
01:01:24.560 | into the text encoder
01:01:26.560 | so that we just retroactively
01:01:29.440 | have the early fusion models.
01:01:31.040 | Yeah.
01:01:32.180 | Even before the early,
01:01:34.640 | you know, that the chameleon paper came out,
01:01:36.320 | I think that was on our big board
01:01:37.600 | of next to do's to possibly explore
01:01:40.880 | or our backlog of ideas, right?
01:01:44.880 | Because as you said, early fusion,
01:01:48.640 | like even before this paper,
01:01:50.080 | I can't remember.
01:01:51.200 | I think Meta even had like a scaling laws
01:01:54.160 | for multimodality paper
01:01:56.240 | that does explore more early fusion.
01:01:58.480 | Like the moment we saw that
01:02:00.400 | it was just kind of obvious to us
01:02:01.840 | that eventually it'll get to the point
01:02:05.280 | that becomes a little bit more mainstream.
01:02:07.280 | And yeah, like that's a cool twist
01:02:10.320 | that we've been thinking about too as well,
01:02:12.880 | as well as like other things
01:02:14.560 | that are kind of in the works
01:02:15.920 | that are a little bit more agentic.
01:02:17.200 | But yeah, if open collaboration interests you,
01:02:21.040 | we can always work on that
01:02:22.560 | together with the community.
01:02:24.080 | Ooh, okay.
01:02:25.120 | Shout out there.
01:02:25.760 | Cool.
01:02:27.840 | Well, you can leave that
01:02:28.960 | in the call to action at the end.
01:02:30.400 | I just want to, you know,
01:02:31.280 | we have a couple more questions
01:02:32.240 | to round this out.
01:02:33.280 | You mentioned a lot of papers in your work.
01:02:36.320 | You're also building a company.
01:02:37.600 | You're also looking at open source
01:02:39.040 | projects and community.
01:02:41.040 | What is your daily or weekly routine
01:02:42.880 | to keep on top of AI?
01:02:43.920 | So one, subscribe to AI News.
01:02:50.480 | He didn't have to pay me to say that.
01:02:51.760 | I actually think like it's a good aggregator.
01:02:54.400 | I think it's a good aggregator.
01:02:56.640 | I'll tell you why.
01:02:57.360 | Most of the fastest moving like
01:03:01.200 | research that's being done out there
01:03:04.560 | is like it's showing up.
01:03:06.240 | It's mostly on Twitter.
01:03:07.200 | Like my Twitter is like,
01:03:08.480 | I wasn't a power Twitter user at all.
01:03:10.880 | Before three years ago,
01:03:12.320 | but I had to use it
01:03:14.480 | and I had to always check it
01:03:15.920 | in order to keep on top of like early work,
01:03:19.040 | right?
01:03:19.280 | That people want to talk about or present
01:03:21.440 | because nothing against
01:03:24.000 | submitting research papers
01:03:26.320 | to like ICLR, ICML,
01:03:28.960 | like knowing the state of the art,
01:03:30.400 | like those are like six months late, right?
01:03:34.800 | Like people have already have it
01:03:36.160 | dropped it on archive,
01:03:37.120 | or they're just openly talking about it.
01:03:38.960 | The submission process.
01:03:40.560 | Yeah.
01:03:40.880 | Yeah.
01:03:41.120 | And then being on discord to see
01:03:43.760 | when the rubber hits the road, right?
01:03:46.800 | Like the implementations
01:03:48.480 | and the practices that are being done
01:03:51.600 | or like the data sets, like you said,
01:03:54.560 | like a lot of conversations
01:03:57.120 | about really good data sets
01:03:58.880 | and how do you construct them
01:04:00.160 | are done in the open
01:04:02.560 | in figuring that out
01:04:03.440 | for people that don't have like
01:04:05.120 | budgets of like $10 million
01:04:06.480 | to just pay a bunch of annotators.
01:04:09.440 | So my routine daily is like,
01:04:12.080 | second thing I do when I wake up
01:04:13.840 | is to look on Twitter
01:04:14.960 | to see what the latest updates are
01:04:20.160 | from specific people
01:04:22.480 | that do really, really great work.
01:04:23.760 | Armin at Meta
01:04:26.720 | who did the chameleon paper
01:04:28.400 | is like everything he writes
01:04:30.480 | on Twitter is like gold.
01:04:31.760 | So like anytime he writes something there,
01:04:33.440 | like I really try to figure out
01:04:34.800 | what he's actually saying there
01:04:37.360 | and then tie it to techniques
01:04:39.200 | and research papers out there.
01:04:40.640 | And then sometimes I try to use certain tools,
01:04:45.440 | like I myself use AI itself
01:04:47.440 | to search for the latest papers
01:04:52.240 | on a specific topic,
01:04:53.360 | if that's the thing on the top of my mind.
01:04:55.840 | And at the end of the day,
01:04:57.440 | trying out the products too.
01:05:00.240 | I think if you do not try out the tooling
01:05:03.200 | and some of the products out there,
01:05:05.280 | like you are missing out on
01:05:07.200 | someone's compression algorithm.
01:05:10.080 | Like they compressed all the research out there
01:05:13.360 | and all the thought
01:05:14.080 | and all the state of the art
01:05:15.520 | into a product that they're trying to create for you.
01:05:18.480 | And then like really backing out
01:05:20.160 | in reverse engineering,
01:05:21.200 | like what it took to build something like that.
01:05:23.200 | Like that's huge, right?
01:05:26.960 | Like if you can actually understand
01:05:28.240 | like perplexity, for instance,
01:05:30.000 | like you'll already be well ahead on the research.
01:05:33.360 | - Oh, by the way,
01:05:34.480 | you mentioned what's a good perplexity score?
01:05:37.520 | If there's like just a number, right?
01:05:38.880 | Like it's like five to eight or something.
01:05:40.960 | Like do you have a number in mind when you said that?
01:05:45.200 | - Yeah, I mean, what was the one that we had?
01:05:48.800 | Flipping between train loss and perplexity
01:05:51.440 | is actually not native to me quite yet.
01:05:53.600 | But like, yeah, between like,
01:05:55.120 | if you can get like a four
01:05:56.240 | using the context length extension on LLAMA,
01:06:01.600 | like you're in the right direction.
01:06:02.880 | And then obviously you'll see spikes.
01:06:04.320 | And specifically when the one trick
01:06:08.080 | you should pay attention to is,
01:06:09.680 | you know that your context length
01:06:14.960 | and theta scaling is working right.
01:06:16.960 | If the early steps in the perplexity go straight down.
01:06:19.600 | So like when it wasn't correct,
01:06:21.040 | it would oscillate a lot in the beginning.
01:06:23.760 | And we just knew that we cut the training short
01:06:26.240 | and then retry a new theta scale.
01:06:27.680 | - Because your effect,
01:06:30.000 | you're properly continuing the fine tuning
01:06:32.480 | or the full retraining.
01:06:33.600 | - Yeah, yeah.
01:06:34.320 | The model just like,
01:06:35.360 | it saw something out of domain immediately
01:06:37.600 | and was like, I have no idea what to do.
01:06:40.160 | And you need it to be able to overlap
01:06:43.200 | that positional embedding on top of each other.
01:06:46.720 | - One follow up, right?
01:06:47.840 | Before we sort of close out.
01:06:49.200 | Like, I think being on Twitter
01:06:53.520 | and like looking at all these new headlines
01:06:56.240 | is really helpful.
01:06:57.120 | But then it only gets you
01:06:59.120 | like a very surface level understanding.
01:07:00.880 | Then you still need a process to decide
01:07:03.200 | which one to invest in.
01:07:04.240 | So I'm trying to dig for like,
01:07:06.880 | what is your formula for like deciding,
01:07:09.840 | you know, what to go deep on
01:07:11.040 | and what to kind of skip.
01:07:12.560 | - From a practical standpoint,
01:07:14.560 | as a company,
01:07:15.360 | like I already know there are like three to five things
01:07:21.280 | that will be valuable and useful to us.
01:07:23.200 | And then there's other stuff that's like out of scope
01:07:25.280 | for different reasons.
01:07:28.320 | Some stuff is like out of scope from,
01:07:30.000 | hey, this is not going to impact or help us.
01:07:34.240 | And then other things are out of scope
01:07:35.600 | because we can't do it.
01:07:36.640 | You know, like the stuff like different tech.
01:07:40.560 | So a really good instance for that is
01:07:43.120 | specific algorithms for,
01:07:47.760 | you know, improving extremely large scale
01:07:52.560 | distributed training.
01:07:53.440 | Like that's that we're not gonna have the opportunity
01:07:56.560 | to get 2000 H100s.
01:07:59.520 | If we do, it'd be really cool.
01:08:01.360 | But like, I'm just saying like, as for now,
01:08:03.680 | like you gotta reach for the things
01:08:05.360 | that would be useful.
01:08:06.560 | Things that would be useful for us,
01:08:08.320 | for instance,
01:08:08.960 | for everybody actually, to be honest,
01:08:12.560 | is like evaluations,
01:08:14.720 | different post-training techniques,
01:08:17.520 | and then synthetic data construction.
01:08:22.000 | Like we're always on the,
01:08:23.440 | I'm always on the look for that.
01:08:24.480 | And then how do I figure out
01:08:25.760 | where there are these things?
01:08:26.880 | You know, which new piece of news
01:08:30.640 | is actually novel?
01:08:31.680 | Well, that's sort of my like mental cache
01:08:35.920 | to a certain extent.
01:08:36.720 | Like I've built up like this state of like,
01:08:38.560 | I already know like all the things
01:08:40.640 | that have already been written
01:08:41.760 | for the state of the art
01:08:43.920 | for certain topic areas.
01:08:46.560 | And then I know what's being kind of recycled
01:08:49.280 | as like an empirical study
01:08:50.800 | versus like something that
01:08:52.560 | actually is very insightful.
01:08:54.480 | Underrated specific instance
01:08:57.520 | would be like the DeepSeek paper.
01:09:00.400 | I'd never seen it before,
01:09:01.680 | but like the multi-head latent attention,
01:09:05.280 | like that was really unexpected to me
01:09:08.320 | because like I thought I'd seen every type,
01:09:12.320 | not every type, obviously,
01:09:13.520 | but like every way that people wanted to cut
01:09:15.760 | like mixture of experts into interesting ways.
01:09:18.240 | And I never thought something
01:09:19.280 | would like catch my eye to be like,
01:09:20.800 | oh, this is totally new.
01:09:23.280 | And it really does have a lot of value.
01:09:25.520 | Yeah, so like, I think that's mainly
01:09:30.000 | how I try to do it.
01:09:32.880 | And like you talk to your network too.
01:09:35.920 | Like I just, you know,
01:09:38.160 | talk to the people and then know
01:09:39.520 | and make sure like I have
01:09:41.360 | certain subject matter experts
01:09:43.680 | on SpeedDial that I also like
01:09:48.400 | to share information with
01:09:49.760 | and understand like,
01:09:52.000 | hey, does this catch your eye too?
01:09:56.080 | Do you think this is valuable or real?
01:09:58.560 | 'Cause yeah, Raishan,
01:10:00.400 | it's a noisy space we're in right now,
01:10:02.000 | which is cool 'cause it's really interesting
01:10:05.440 | and people are excited about it.
01:10:07.600 | But at the same time,
01:10:08.880 | there is actually a 10X
01:10:11.920 | or more explosion of information coming in
01:10:14.800 | that all sounds really, really unique and new.
01:10:18.320 | And you could spend like hours,
01:10:20.800 | you know, down a rabbit hole
01:10:22.000 | that isn't that useful.
01:10:23.520 | Awesome, Mark, I know we kept you
01:10:25.280 | in the studio for a long time.
01:10:26.480 | Any final call to actions for folks
01:10:29.120 | that could be roles you're hiring for,
01:10:31.440 | requests for startups,
01:10:33.200 | anything that comes to mind
01:10:35.520 | that you want to share with the audience?
01:10:37.280 | Yeah, I think on the line of
01:10:39.280 | we definitely have a call to action
01:10:42.960 | to get more people to work together with us
01:10:45.440 | for long context evaluations.
01:10:49.040 | That is the sort of the it topic
01:10:52.720 | throughout like every,
01:10:55.200 | like even Meta or Google
01:10:56.880 | or any of the other folk are focusing on.
01:11:00.080 | 'Cause I think we lack an understanding
01:11:02.560 | of that within the community.
01:11:03.920 | And then can we as a community
01:11:07.040 | also help to construct
01:11:08.800 | like other modalities of datasets
01:11:11.200 | that would be interesting,
01:11:12.800 | like pairwise datasets, right?
01:11:15.600 | Like you could get just straight video
01:11:17.440 | and then straight text,
01:11:18.160 | but like getting them together
01:11:20.080 | that have like for grounding purposes
01:11:23.920 | will be really useful
01:11:25.040 | for training the next set of models
01:11:27.760 | that I know are coming out.
01:11:29.440 | And the more people
01:11:31.520 | we have contributing to that
01:11:32.800 | would be really useful.
01:11:34.160 | Awesome, thank you so much for coming on, Mark.
01:11:37.520 | This was a lot of fun.
01:11:38.480 | Yeah, thanks a lot.
01:11:39.280 | Yeah, this is great.
01:11:41.040 | (upbeat music)
01:11:49.120 | (upbeat music)
01:12:03.120 | (upbeat music)