The Hierarchy of Needs for Training Dataset Development: Chang She and Noah Shpak

We're excited to be here, and we're excited to be talking to you about training data set development for LLMs. So my name is Chong Shih. I'm the CEO and co-founder of Lance DB. I've been creating data tools for data science and machine learning for almost two decades, starting with being one of the co-authors of the Pandas library a long time ago.

I spent a bunch of years in big data systems and recommender systems, and most recently I started this company, Lance DB, which is the database for multimodal AI. And these days I spend about roughly equal time tweeting and on GitHub. And yeah, hi, everyone. I'm Noah. I currently lead the AI data platform at Character.

And Character AI is one of the leading personalized AI platforms. So we train our own foundation models as well as run a direct-to-consumer online platform. And I focus on data research. So since we train our own foundation models, we need to learn what we need to train on to engage our users.

And so we're focused both on academic benchmarks as well as things like AP tests and trying to get more engagement on our platform. My team is focused on research acceleration as well. So we tend to build a lot of tools and leading to this collaboration with Lance and how we think about storing our data.

So I think if there's one thing I want to convey with this whole talk is that you should really care about what you're training on. And you should care for it by giving it a nice format that does a lot of nice things for it. I wanted to start just kind of broad strokes talking about how we think about pre-training and how we think of post-training.

There's definitely a lot of overlap. But at least in terms of pre-training, you tend to think wider. Right? You want to think about more like what domains you're training on. Are you thinking about books or more chat data? And then you want to also think about quantity. Right? How big is your model?

How many tokens do you need? Compared to post-training where you're looking at very specific tasks and maybe not just looking at the context of that task, but also how difficult is that math problem? How easy is that multiple choice problem? So you kind of have to get much deeper and more granular in terms of the things that you understand about your data at scale.

In the middle, I guess I've grouped together some of my favorite problems right now that a lot of people are looking into. So ranging from data-efficient learning, right, how do we reduce the amount of data we need to get good results from a similarly sized model? How do we sample from data?

Right? Like what kind of metrics do we need? And then how do we look at diversity? Right? Measuring diversity is very difficult and looking at some of the automated ways that we do that in industry and all the different papers that are out there. So everyone loves a good hierarchy of needs.

I think that for us, we always start with clean data and quickly go right up to evaluations. For us, we always start there because it's hard to measure anything without a compass. And since we're focused a lot on post-training nowadays, having systems for dataset management is becoming more and more of a problem.

So when we're thinking about mixtures, right, these collections of datasets and usually you have different ways of how you're including them in your batches and in your training sets, you want to understand those collections, not just in terms of what dataset they are. You know, is this Wikipedia or is this some other thing?

But also what's in there, right? So it naturally rolls up into analytics. So we want token counts and understanding of length. And you might even want things that are more complicated, right? So you might be classifying your code data into not just, say, is this Python or is this Java, but also how difficult is it?

How many functions are in this problem? How many classes are you supposed to generate? And really having more and more analytics lets you understand your data more. I think more than anything, reading data has probably been the biggest win. So these are kind of just ways of automating things that we've learned from looking at data, looking at outputs, looking at performance, and trying to understand what is going on.

Everything in kind of the top half, I guess, of this is more about using language models to improve language models. So things like synthetic data, things like quality scoring, things like dataset selection. And dataset selection is probably the simplest and one of my favorites, right? You're just kind of looking at ways to match distributions for the behavior you want from your model and the data that you do have.

And so a lot of what we do is do retrieval or do clustering. You know, you can embed the web nowadays pretty quickly. And how do we pick the data that we like according to what kind of evaluations we're looking at? Quality scoring is similarly simple. Like we build a lot of classifiers in-house for a variety of things.

And there's a lot of cool work around how people are actually doing this with just prompting classifications. So you can do it even more simply than, say, having to go down the route of actually building a classifier and evaluating it and doing that whole loop. And synthetics, given the way that we've structured our platform, is also super powerful for us because we have this ecosystem of big data tools like Spark and Trino alongside some GPU-backed services for doing prompting, for doing embedding, and for classifying things.

And so we can enrich our data sets. We can augment them. You can generate quick examples of, say, preference pairs and try to explore a method not at its peak of quality, right? Synthetics are going to have problems. But you can start getting signal for what types of data, what shape is that data, and how can you kind of start looping in human labeling to make it even better?

So at the top, right, we use human labeling a lot for improving these classifiers. And we also want to use them for rewriting synthetic data that maybe has issues or rewriting data that just has issues in itself. And so all of this kind of comes together to motivate a lot of our platform tooling and -- I'll go on to the next slide -- to talk about kind of how we try to make all of this easy for researchers working in this domain.

So I said at the beginning, accelerating research is a big part of this. I've included some beautiful YAML here that hopefully people can see that there's a SQL block over there. And I think that this is pretty motivating in terms of how we materialize datasets. So if you've worked in machine learning at all, you know that usually you have a specific training format.

Maybe it's TFRecords, maybe it's JSON lines, depending on where you're coming from. And at least in my experience, it's one of the most error-prone components of training, right? I don't know what data this is. I'm training on it. I'm getting weird results. So for our team, since we're doing so much iteration around data, making it part of your training job and separating concerns in terms of how your data is materialized and what your training job is doing is really, really nice for us.

And this is kind of where LAN started becoming a big deal, especially as we start thinking about multimodal and how the data volumes are much, much larger. And the problems that we're trying to solve become much more complicated. So the materialization service aside, you know, it's kind of this nice interface that you send it some request and it gives you some list of files.

really starts hitting the road when we think about data loading, which is its own problem in and of itself, especially once data volume becomes really large. So Lance has this nice property that Chung will talk about a lot more that allows for quick random access and it lets us shuffle data very cheaply, right?

So it essentially lets you shuffle references to rows rather than shuffling the rows themselves, which allows you to save a lot of time in iteration speed. And at the end of the day for us, we just want to watch the GPUs go brr and the numbers go up. So I'll pass it over to Chung, who can talk a lot more about the Lance format in detail.

Thanks. Cool. So you've heard from Noah about the importance of data in developing models. And so if data is critical, then it's also critical to have the right data infrastructure for your workloads. Now, AI workloads tend to be a little bit different from your traditional data warehousing, OLAP, and analytics workloads in a couple of different ways.

But let me give you just one motivating example. If you think about a distributed training workload, typically it breaks down into three steps. You have a filter. You want to select the right samples from your raw data set. Then you'll have a shuffle step where you will then draw random rows from the filtered set.

And then you'll stream -- typically, if the data set is large, you'll be streaming those observations, whether they're text or images or videos, from object storage into your GPUs. So in that one workload, you needed fast scans to run the filter. You need fast random access to do the shuffling.

And then you need to be able to deal with potentially very large binary data, large blobs, to be able to quickly stream data directly into your GPUs. So these three properties are required often in one, AI workloads from training to search and retrieval. But existing data formats and data infrastructure is good for at most two, but often just one of the three.

And so this is what I'm calling the new cap theorem for AI data. And that's the motivation for us for designing LANCE format, around which we've built LANCE-B. So this problem is, of course, exacerbated by scale of AI data, and especially multimodal data. So if you look at tabular data from the past, one row of tabular data with just scalar, simple scalar columns, on average, it's about 150 bytes per row.

If you add embeddings to that, that gets about 20, 25 times larger, depending on the number of dimensions. If you add images, that's another 20 times. And if you add videos, that gets pretty astronomical. And that's one single row. And with generative AI data isn't limited by the speed at which, you know, manual human interaction can generate observations.

New rows of data is being generated at thousands of tokens per second. So scale often blows up. In the past, as I've been in data for a long time, if you were in the tens of terabytes, you were a fairly large company. And I think these days, if you were working in generative AI, it's not unheard of for, you know, 10-person, 20-person teams to be managing, like, tens of terabytes to even petabytes of data.

So what does Lands Format do to solve these problems? Well, so Lands Format, first, it's a columnar file format. So like Parquet, but -- or optimized for AI. So it gives you the ability to do fast scans like Parquet. It supports fast lookups, unlike Parquet. And we've actually gotten rid of a big limiting factor in Parquet called row groups.

And so that we can allow you to store blobs in line. Lands Format is also a lightweight table format. So as you add data, it's automatically versioned. You can also add additional columns without having to copy the original data set. So it makes it a lot easier if you're working with large multimodal data sets to add experimental features and then roll them back later on.

And we'll call this a zero copy schema evolution. And then finally, of course, it supports time travel. So that oftentimes, if you make a mistake or there's an error or there's bad data, it's instantaneous to roll back to a previously known good version. So that it doesn't corrupt downstream model training processes.

And the third aspect of Lands Format that's really interesting is indexing extensions. So in Parquet, there are indices. But the indices can quickly tell you which rows you need. But with Parquet, because it doesn't support random access, even if you know which rows you need to fetch, it's really slow to fetch those rows.

And not so with Lands. So with Lands, we've added indexing extensions for embeddings. So you can do, you know, essentially billion scale vector search directly off of S3. We can have scalar indices to make filtering metadata columns really quickly. And then full-text search indices to do keyword or fuzzy search directly from your S3 data set.

And you don't really need that Elasticsearch cluster anymore. So what Lands gives you is the ability to have a single table for many, many different workloads. So if you have metadata columns or time series columns, you can run SQL. So you can plug Lands directly into, say, DuckDB or Trino or Spark.

And you can run SQL on that. And if you're storing large blobs and tensors, like the videos or text or images, you can plug your Lands data, the same table, into PyTorch training. And if you have embedding vectors, you can use the embedded vector index to do similarity search.

And so this makes it a lot easier for a full AI workflow from analyzing and exploring your data set to searching and retrieving throughout your data set to fine-tuning and training your model. Around this format, we've built LanCB vector database, and more generally, a database for multimodal AI. So one big feature is a distributed vector search.

So search through billions of vectors at low latency and very high QPS with order magnitude less infra than other vector databases. And it provides data infrastructure for all of your multimodal data needs. When we talk about multimodal, we often think narrowly about just image generation or video generation. But when you look at the data, multimodal, I think, has many different meanings.

One, of course, is the data. The data can be multimodal. So unlike traditional tabular data, we can store features. And then audio waveforms, images, and all of that. We're familiar with that already. And, of course, vectors. And vector is a vector. So whether they're image embeddings or text embeddings.

Now, the workload can also be multimodal. So, you know, not just running OLAP SQL, but you can run vector search. You can run full-text search, filtering, and then other sort of data frame and SQL workloads. And then, finally, the use case and the scenario can also be multimodal. So operational scenarios where you're in a production service for RAG or search and retrieval and personalization, or lands can be used in training, or it can be part of your data lake to analyze and explore all that multimodal data that you have.

Yeah. So I think that from at least my team's experience and a lot of what Chung is describing, we just think that speed is probably our best bet in terms of strategy. And a lot of the tools that we've worked with really slow down under load, under new multimodal needs, and we're looking to develop out what the future for those data systems looks like.

So thanks so much for listening to our talk. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. We'll see you next time. Thank you.

The Hierarchy of Needs for Training Dataset Development: Chang She and Noah Shpak

Transcript