The Hierarchy of Needs for Training Dataset Development: Chang She and Noah Shpak

00:00:00.000 | We're excited to be here, and we're excited to be talking to you about training data set

00:00:19.080 | development for LLMs. So my name is Chong Shih. I'm the CEO and co-founder of Lance DB. I've been

00:00:25.640 | creating data tools for data science and machine learning for almost two decades, starting

00:00:30.600 | with being one of the co-authors of the Pandas library a long time ago. I spent a bunch of

00:00:35.200 | years in big data systems and recommender systems, and most recently I started this

00:00:41.420 | company, Lance DB, which is the database for multimodal AI. And these days I spend about

00:00:48.160 | roughly equal time tweeting and on GitHub.

00:00:51.160 | And yeah, hi, everyone. I'm Noah. I currently lead the AI data platform at Character. And

00:00:57.860 | Character AI is one of the leading personalized AI platforms. So we train our own foundation

00:01:02.080 | models as well as run a direct-to-consumer online platform. And I focus on data research. So

00:01:10.100 | since we train our own foundation models, we need to learn what we need to train on to engage

00:01:13.860 | our users. And so we're focused both on academic benchmarks as well as things like AP tests and

00:01:20.480 | trying to get more engagement on our platform. My team is focused on research acceleration

00:01:25.840 | as well. So we tend to build a lot of tools and leading to this collaboration with Lance

00:01:30.420 | and how we think about storing our data.

00:01:32.560 | So I think if there's one thing I want to convey with this whole talk is that you should really

00:01:38.000 | care about what you're training on. And you should care for it by giving it a nice format

00:01:43.080 | that does a lot of nice things for it. I wanted to start just kind of broad strokes talking

00:01:48.140 | about how we think about pre-training and how we think of post-training. There's definitely

00:01:51.860 | a lot of overlap. But at least in terms of pre-training, you tend to think wider. Right? You want

00:01:56.940 | to think about more like what domains you're training on. Are you thinking about books or

00:02:01.180 | more chat data? And then you want to also think about quantity. Right? How big is your

00:02:05.900 | model? How many tokens do you need? Compared to post-training where you're looking at very

00:02:10.680 | specific tasks and maybe not just looking at the context of that task, but also how difficult

00:02:16.040 | is that math problem? How easy is that multiple choice problem? So you kind of have to get much

00:02:20.280 | deeper and more granular in terms of the things that you understand about your data at scale.

00:02:26.660 | In the middle, I guess I've grouped together some of my favorite problems right now that

00:02:30.500 | a lot of people are looking into. So ranging from data-efficient learning, right, how do

00:02:34.700 | we reduce the amount of data we need to get good results from a similarly sized model? How

00:02:40.860 | do we sample from data? Right? Like what kind of metrics do we need? And then how do we look

00:02:45.660 | at diversity? Right? Measuring diversity is very difficult and looking at some of the automated

00:02:49.560 | ways that we do that in industry and all the different papers that are out there.

00:02:56.380 | So everyone loves a good hierarchy of needs. I think that for us, we always start with clean

00:03:00.340 | data and quickly go right up to evaluations. For us, we always start there because it's hard

00:03:06.220 | to measure anything without a compass. And since we're focused a lot on post-training nowadays,

00:03:12.640 | having systems for dataset management is becoming more and more of a problem. So when we're thinking

00:03:17.420 | about mixtures, right, these collections of datasets and usually you have different ways of how you're

00:03:22.380 | including them in your batches and in your training sets, you want to understand those collections,

00:03:26.680 | not just in terms of what dataset they are. You know, is this Wikipedia or is this some other

00:03:31.500 | thing? But also what's in there, right? So it naturally rolls up into analytics. So we want

00:03:37.300 | token counts and understanding of length. And you might even want things that are more complicated,

00:03:42.260 | right? So you might be classifying your code data into not just, say, is this Python or is this Java,

00:03:47.060 | but also how difficult is it? How many functions are in this problem? How many classes are you supposed

00:03:52.300 | to generate? And really having more and more analytics lets you understand your data more. I think more

00:03:57.300 | than anything, reading data has probably been the biggest win. So these are kind of just ways of

00:04:03.300 | automating things that we've learned from looking at data, looking at outputs, looking at performance,

00:04:07.300 | and trying to understand what is going on. Everything in kind of the top half, I guess, of this is more

00:04:13.300 | about using language models to improve language models. So things like synthetic data, things like

00:04:18.300 | quality scoring, things like dataset selection. And dataset selection is probably the simplest and one of my

00:04:24.300 | favorites, right? You're just kind of looking at ways to match distributions for the behavior you want

00:04:30.300 | from your model and the data that you do have. And so a lot of what we do is do retrieval or do

00:04:36.300 | clustering. You know, you can embed the web nowadays pretty quickly. And how do we pick the data that we

00:04:41.300 | like according to what kind of evaluations we're looking at? Quality scoring is similarly simple. Like we

00:04:47.300 | build a lot of classifiers in-house for a variety of things. And there's a lot of cool work around how

00:04:52.300 | people are actually doing this with just prompting classifications. So you can do it even more

00:04:57.300 | simply than, say, having to go down the route of actually building a classifier and

00:05:01.300 | evaluating it and doing that whole loop. And synthetics, given the way that we've

00:05:07.300 | structured our platform, is also super powerful for us because we have this

00:05:11.300 | ecosystem of big data tools like Spark and Trino alongside some GPU-backed services for doing

00:05:17.300 | prompting, for doing embedding, and for classifying things. And so we can enrich our

00:05:22.300 | data sets. We can augment them. You can generate quick examples of, say,

00:05:27.300 | preference pairs and try to explore a method not at its peak of quality, right?

00:05:31.300 | Synthetics are going to have problems. But you can start getting signal for what types

00:05:35.300 | of data, what shape is that data, and how can you kind of start looping in human

00:05:39.300 | labeling to make it even better? So at the top, right, we use human labeling a lot

00:05:44.300 | for improving these classifiers. And we also want to use them for rewriting

00:05:48.300 | synthetic data that maybe has issues or rewriting data that just has

00:05:52.300 | issues in itself. And so all of this kind of comes together to motivate a lot

00:05:57.300 | of our platform tooling and -- I'll go on to the next slide -- to talk about kind

00:06:02.300 | of how we try to make all of this easy for researchers working in this domain.

00:06:07.300 | So I said at the beginning, accelerating research is a big part of this.

00:06:11.300 | I've included some beautiful YAML here that hopefully people can see

00:06:14.300 | that there's a SQL block over there. And I think that this is pretty motivating

00:06:19.300 | in terms of how we materialize datasets. So if you've worked in machine learning

00:06:25.300 | at all, you know that usually you have a specific training format.

00:06:28.300 | Maybe it's TFRecords, maybe it's JSON lines, depending on where you're

00:06:31.300 | coming from. And at least in my experience, it's one of the most error-prone

00:06:35.300 | components of training, right? I don't know what data this is.

00:06:38.300 | I'm training on it. I'm getting weird results. So for our team, since we're doing

00:06:42.300 | so much iteration around data, making it part of your training job and separating

00:06:47.300 | concerns in terms of how your data is materialized and what your training

00:06:50.300 | job is doing is really, really nice for us. And this is kind of where

00:06:55.300 | LAN started becoming a big deal, especially as we start thinking

00:06:58.300 | about multimodal and how the data volumes are much, much larger.

00:07:02.300 | And the problems that we're trying to solve become much more complicated.

00:07:06.300 | So the materialization service aside, you know, it's kind of this nice

00:07:10.300 | interface that you send it some request and it gives you some list of files.

00:07:16.300 | really starts hitting the road when we think about data loading, which is its

00:07:21.300 | own problem in and of itself, especially once data volume becomes really large.

00:07:26.300 | So Lance has this nice property that Chung will talk about a lot more that allows

00:07:30.300 | for quick random access and it lets us shuffle data very cheaply, right?

00:07:35.300 | So it essentially lets you shuffle references to rows rather than shuffling

00:07:38.300 | the rows themselves, which allows you to save a lot of time in iteration speed.

00:07:43.300 | And at the end of the day for us, we just want to watch the GPUs go brr and the numbers go up.

00:07:47.300 | So I'll pass it over to Chung, who can talk a lot more about the Lance format in detail.

00:07:52.300 | Thanks.

00:07:53.300 | Cool. So you've heard from Noah about the importance of data in developing models.

00:08:00.300 | And so if data is critical, then it's also critical to have the right data

00:08:05.300 | infrastructure for your workloads.

00:08:08.300 | Now, AI workloads tend to be a little bit different from your traditional data

00:08:12.300 | warehousing, OLAP, and analytics workloads in a couple of different ways.

00:08:17.300 | But let me give you just one motivating example.

00:08:20.300 | If you think about a distributed training workload, typically it breaks down into three steps.

00:08:26.300 | You have a filter.

00:08:27.300 | You want to select the right samples from your raw data set.

00:08:31.300 | Then you'll have a shuffle step where you will then draw random rows from the filtered set.

00:08:39.300 | And then you'll stream -- typically, if the data set is large, you'll be streaming those observations,

00:08:45.300 | whether they're text or images or videos, from object storage into your GPUs.

00:08:51.300 | So in that one workload, you needed fast scans to run the filter.

00:08:57.300 | You need fast random access to do the shuffling.

00:09:02.300 | And then you need to be able to deal with potentially very large binary data, large blobs,

00:09:07.300 | to be able to quickly stream data directly into your GPUs.

00:09:12.300 | So these three properties are required often in one, AI workloads from training to search and retrieval.

00:09:21.300 | But existing data formats and data infrastructure is good for at most two,

00:09:26.300 | but often just one of the three.

00:09:28.300 | And so this is what I'm calling the new cap theorem for AI data.

00:09:32.300 | And that's the motivation for us for designing LANCE format,

00:09:37.300 | around which we've built LANCE-B.

00:09:41.300 | So this problem is, of course, exacerbated by scale of AI data,

00:09:48.300 | and especially multimodal data.

00:09:50.300 | So if you look at tabular data from the past, one row of tabular data

00:09:55.300 | with just scalar, simple scalar columns, on average, it's about 150 bytes per row.

00:10:01.300 | If you add embeddings to that, that gets about 20, 25 times larger,

00:10:06.300 | depending on the number of dimensions.

00:10:08.300 | If you add images, that's another 20 times.

00:10:11.300 | And if you add videos, that gets pretty astronomical.

00:10:15.300 | And that's one single row.

00:10:17.300 | And with generative AI data isn't limited by the speed at which,

00:10:23.300 | you know, manual human interaction can generate observations.

00:10:26.300 | New rows of data is being generated at thousands of tokens per second.

00:10:32.300 | So scale often blows up.

00:10:34.300 | In the past, as I've been in data for a long time,

00:10:38.300 | if you were in the tens of terabytes, you were a fairly large company.

00:10:42.300 | And I think these days, if you were working in generative AI,

00:10:45.300 | it's not unheard of for, you know, 10-person, 20-person teams

00:10:50.300 | to be managing, like, tens of terabytes to even petabytes of data.

00:10:55.300 | So what does Lands Format do to solve these problems?

00:11:01.300 | Well, so Lands Format, first, it's a columnar file format.

00:11:05.300 | So like Parquet, but -- or optimized for AI.

00:11:09.300 | So it gives you the ability to do fast scans like Parquet.

00:11:12.300 | It supports fast lookups, unlike Parquet.

00:11:16.300 | And we've actually gotten rid of a big limiting factor

00:11:21.300 | in Parquet called row groups.

00:11:23.300 | And so that we can allow you to store blobs in line.

00:11:27.300 | Lands Format is also a lightweight table format.

00:11:30.300 | So as you add data, it's automatically versioned.

00:11:35.300 | You can also add additional columns

00:11:37.300 | without having to copy the original data set.

00:11:40.300 | So it makes it a lot easier if you're working

00:11:42.300 | with large multimodal data sets to add experimental features

00:11:47.300 | and then roll them back later on.

00:11:51.300 | And we'll call this a zero copy schema evolution.

00:11:54.300 | And then finally, of course, it supports time travel.

00:11:56.300 | So that oftentimes, if you make a mistake or there's an error

00:12:00.300 | or there's bad data, it's instantaneous to roll back

00:12:03.300 | to a previously known good version.

00:12:06.300 | So that it doesn't corrupt downstream model training processes.

00:12:11.300 | And the third aspect of Lands Format that's really interesting

00:12:14.300 | is indexing extensions.

00:12:15.300 | So in Parquet, there are indices.

00:12:17.300 | But the indices can quickly tell you which rows you need.

00:12:23.300 | But with Parquet, because it doesn't support random access,

00:12:26.300 | even if you know which rows you need to fetch,

00:12:29.300 | it's really slow to fetch those rows.

00:12:31.300 | And not so with Lands.

00:12:32.300 | So with Lands, we've added indexing extensions for embeddings.

00:12:37.300 | So you can do, you know, essentially billion scale vector search

00:12:42.300 | directly off of S3.

00:12:44.300 | We can have scalar indices to make filtering metadata columns

00:12:49.300 | really quickly.

00:12:50.300 | And then full-text search indices to do keyword or fuzzy search

00:12:56.300 | directly from your S3 data set.

00:12:59.300 | And you don't really need that Elasticsearch cluster anymore.

00:13:03.300 | So what Lands gives you is the ability to have a single table

00:13:09.300 | for many, many different workloads.

00:13:12.300 | So if you have metadata columns or time series columns,

00:13:17.300 | you can run SQL.

00:13:18.300 | So you can plug Lands directly into, say, DuckDB or Trino or Spark.

00:13:25.300 | And you can run SQL on that.

00:13:27.300 | And if you're storing large blobs and tensors,

00:13:30.300 | like the videos or text or images, you can plug your Lands data,

00:13:35.300 | the same table, into PyTorch training.

00:13:38.300 | And if you have embedding vectors,

00:13:41.300 | you can use the embedded vector index to do similarity search.

00:13:47.300 | And so this makes it a lot easier for a full AI workflow from analyzing

00:13:55.300 | and exploring your data set to searching and retrieving

00:13:58.300 | throughout your data set to fine-tuning and training your model.

00:14:04.300 | Around this format, we've built LanCB vector database,

00:14:09.300 | and more generally, a database for multimodal AI.

00:14:14.300 | So one big feature is a distributed vector search.

00:14:18.300 | So search through billions of vectors at low latency

00:14:21.300 | and very high QPS with order magnitude less infra

00:14:26.300 | than other vector databases.

00:14:28.300 | And it provides data infrastructure for all

00:14:30.300 | of your multimodal data needs.

00:14:32.300 | When we talk about multimodal, we often think narrowly

00:14:35.300 | about just image generation or video generation.

00:14:38.300 | But when you look at the data, multimodal, I think,

00:14:40.300 | has many different meanings.

00:14:41.300 | One, of course, is the data.

00:14:43.300 | The data can be multimodal.

00:14:45.300 | So unlike traditional tabular data, we can store features.

00:14:48.300 | And then audio waveforms, images, and all of that.

00:14:52.300 | We're familiar with that already.

00:14:54.300 | And, of course, vectors.

00:14:56.300 | And vector is a vector.

00:14:57.300 | So whether they're image embeddings or text embeddings.

00:15:01.300 | Now, the workload can also be multimodal.

00:15:04.300 | So, you know, not just running OLAP SQL,

00:15:08.300 | but you can run vector search.

00:15:09.300 | You can run full-text search, filtering,

00:15:12.300 | and then other sort of data frame and SQL workloads.

00:15:18.300 | And then, finally, the use case

00:15:20.300 | and the scenario can also be multimodal.

00:15:22.300 | So operational scenarios where you're in a production service

00:15:27.300 | for RAG or search and retrieval and personalization,

00:15:32.300 | or lands can be used in training,

00:15:35.300 | or it can be part of your data lake to analyze

00:15:38.300 | and explore all that multimodal data that you have.

00:15:41.300 | Yeah. So I think that from at least my team's experience

00:15:49.300 | and a lot of what Chung is describing,

00:15:51.300 | we just think that speed is probably our best bet

00:15:54.300 | in terms of strategy.

00:15:55.300 | And a lot of the tools that we've worked with really slow down

00:15:58.300 | under load, under new multimodal needs,

00:16:01.300 | and we're looking to develop out what the future

00:16:05.300 | for those data systems looks like.

00:16:07.300 | So thanks so much for listening to our talk.

00:16:09.300 | Yeah.

00:16:10.300 | Yeah.

00:16:11.300 | Yeah.

00:16:12.300 | Yeah.

00:16:13.300 | Yeah.

00:16:13.300 | We'll see you next time.

00:16:29.740 | Thank you.