back to indexThe Hierarchy of Needs for Training Dataset Development: Chang She and Noah Shpak

00:00:00.000 |
We're excited to be here, and we're excited to be talking to you about training data set 00:00:19.080 |
development for LLMs. So my name is Chong Shih. I'm the CEO and co-founder of Lance DB. I've been 00:00:25.640 |
creating data tools for data science and machine learning for almost two decades, starting 00:00:30.600 |
with being one of the co-authors of the Pandas library a long time ago. I spent a bunch of 00:00:35.200 |
years in big data systems and recommender systems, and most recently I started this 00:00:41.420 |
company, Lance DB, which is the database for multimodal AI. And these days I spend about 00:00:51.160 |
And yeah, hi, everyone. I'm Noah. I currently lead the AI data platform at Character. And 00:00:57.860 |
Character AI is one of the leading personalized AI platforms. So we train our own foundation 00:01:02.080 |
models as well as run a direct-to-consumer online platform. And I focus on data research. So 00:01:10.100 |
since we train our own foundation models, we need to learn what we need to train on to engage 00:01:13.860 |
our users. And so we're focused both on academic benchmarks as well as things like AP tests and 00:01:20.480 |
trying to get more engagement on our platform. My team is focused on research acceleration 00:01:25.840 |
as well. So we tend to build a lot of tools and leading to this collaboration with Lance 00:01:32.560 |
So I think if there's one thing I want to convey with this whole talk is that you should really 00:01:38.000 |
care about what you're training on. And you should care for it by giving it a nice format 00:01:43.080 |
that does a lot of nice things for it. I wanted to start just kind of broad strokes talking 00:01:48.140 |
about how we think about pre-training and how we think of post-training. There's definitely 00:01:51.860 |
a lot of overlap. But at least in terms of pre-training, you tend to think wider. Right? You want 00:01:56.940 |
to think about more like what domains you're training on. Are you thinking about books or 00:02:01.180 |
more chat data? And then you want to also think about quantity. Right? How big is your 00:02:05.900 |
model? How many tokens do you need? Compared to post-training where you're looking at very 00:02:10.680 |
specific tasks and maybe not just looking at the context of that task, but also how difficult 00:02:16.040 |
is that math problem? How easy is that multiple choice problem? So you kind of have to get much 00:02:20.280 |
deeper and more granular in terms of the things that you understand about your data at scale. 00:02:26.660 |
In the middle, I guess I've grouped together some of my favorite problems right now that 00:02:30.500 |
a lot of people are looking into. So ranging from data-efficient learning, right, how do 00:02:34.700 |
we reduce the amount of data we need to get good results from a similarly sized model? How 00:02:40.860 |
do we sample from data? Right? Like what kind of metrics do we need? And then how do we look 00:02:45.660 |
at diversity? Right? Measuring diversity is very difficult and looking at some of the automated 00:02:49.560 |
ways that we do that in industry and all the different papers that are out there. 00:02:56.380 |
So everyone loves a good hierarchy of needs. I think that for us, we always start with clean 00:03:00.340 |
data and quickly go right up to evaluations. For us, we always start there because it's hard 00:03:06.220 |
to measure anything without a compass. And since we're focused a lot on post-training nowadays, 00:03:12.640 |
having systems for dataset management is becoming more and more of a problem. So when we're thinking 00:03:17.420 |
about mixtures, right, these collections of datasets and usually you have different ways of how you're 00:03:22.380 |
including them in your batches and in your training sets, you want to understand those collections, 00:03:26.680 |
not just in terms of what dataset they are. You know, is this Wikipedia or is this some other 00:03:31.500 |
thing? But also what's in there, right? So it naturally rolls up into analytics. So we want 00:03:37.300 |
token counts and understanding of length. And you might even want things that are more complicated, 00:03:42.260 |
right? So you might be classifying your code data into not just, say, is this Python or is this Java, 00:03:47.060 |
but also how difficult is it? How many functions are in this problem? How many classes are you supposed 00:03:52.300 |
to generate? And really having more and more analytics lets you understand your data more. I think more 00:03:57.300 |
than anything, reading data has probably been the biggest win. So these are kind of just ways of 00:04:03.300 |
automating things that we've learned from looking at data, looking at outputs, looking at performance, 00:04:07.300 |
and trying to understand what is going on. Everything in kind of the top half, I guess, of this is more 00:04:13.300 |
about using language models to improve language models. So things like synthetic data, things like 00:04:18.300 |
quality scoring, things like dataset selection. And dataset selection is probably the simplest and one of my 00:04:24.300 |
favorites, right? You're just kind of looking at ways to match distributions for the behavior you want 00:04:30.300 |
from your model and the data that you do have. And so a lot of what we do is do retrieval or do 00:04:36.300 |
clustering. You know, you can embed the web nowadays pretty quickly. And how do we pick the data that we 00:04:41.300 |
like according to what kind of evaluations we're looking at? Quality scoring is similarly simple. Like we 00:04:47.300 |
build a lot of classifiers in-house for a variety of things. And there's a lot of cool work around how 00:04:52.300 |
people are actually doing this with just prompting classifications. So you can do it even more 00:04:57.300 |
simply than, say, having to go down the route of actually building a classifier and 00:05:01.300 |
evaluating it and doing that whole loop. And synthetics, given the way that we've 00:05:07.300 |
structured our platform, is also super powerful for us because we have this 00:05:11.300 |
ecosystem of big data tools like Spark and Trino alongside some GPU-backed services for doing 00:05:17.300 |
prompting, for doing embedding, and for classifying things. And so we can enrich our 00:05:22.300 |
data sets. We can augment them. You can generate quick examples of, say, 00:05:27.300 |
preference pairs and try to explore a method not at its peak of quality, right? 00:05:31.300 |
Synthetics are going to have problems. But you can start getting signal for what types 00:05:35.300 |
of data, what shape is that data, and how can you kind of start looping in human 00:05:39.300 |
labeling to make it even better? So at the top, right, we use human labeling a lot 00:05:44.300 |
for improving these classifiers. And we also want to use them for rewriting 00:05:48.300 |
synthetic data that maybe has issues or rewriting data that just has 00:05:52.300 |
issues in itself. And so all of this kind of comes together to motivate a lot 00:05:57.300 |
of our platform tooling and -- I'll go on to the next slide -- to talk about kind 00:06:02.300 |
of how we try to make all of this easy for researchers working in this domain. 00:06:07.300 |
So I said at the beginning, accelerating research is a big part of this. 00:06:11.300 |
I've included some beautiful YAML here that hopefully people can see 00:06:14.300 |
that there's a SQL block over there. And I think that this is pretty motivating 00:06:19.300 |
in terms of how we materialize datasets. So if you've worked in machine learning 00:06:25.300 |
at all, you know that usually you have a specific training format. 00:06:28.300 |
Maybe it's TFRecords, maybe it's JSON lines, depending on where you're 00:06:31.300 |
coming from. And at least in my experience, it's one of the most error-prone 00:06:35.300 |
components of training, right? I don't know what data this is. 00:06:38.300 |
I'm training on it. I'm getting weird results. So for our team, since we're doing 00:06:42.300 |
so much iteration around data, making it part of your training job and separating 00:06:47.300 |
concerns in terms of how your data is materialized and what your training 00:06:50.300 |
job is doing is really, really nice for us. And this is kind of where 00:06:55.300 |
LAN started becoming a big deal, especially as we start thinking 00:06:58.300 |
about multimodal and how the data volumes are much, much larger. 00:07:02.300 |
And the problems that we're trying to solve become much more complicated. 00:07:06.300 |
So the materialization service aside, you know, it's kind of this nice 00:07:10.300 |
interface that you send it some request and it gives you some list of files. 00:07:16.300 |
really starts hitting the road when we think about data loading, which is its 00:07:21.300 |
own problem in and of itself, especially once data volume becomes really large. 00:07:26.300 |
So Lance has this nice property that Chung will talk about a lot more that allows 00:07:30.300 |
for quick random access and it lets us shuffle data very cheaply, right? 00:07:35.300 |
So it essentially lets you shuffle references to rows rather than shuffling 00:07:38.300 |
the rows themselves, which allows you to save a lot of time in iteration speed. 00:07:43.300 |
And at the end of the day for us, we just want to watch the GPUs go brr and the numbers go up. 00:07:47.300 |
So I'll pass it over to Chung, who can talk a lot more about the Lance format in detail. 00:07:53.300 |
Cool. So you've heard from Noah about the importance of data in developing models. 00:08:00.300 |
And so if data is critical, then it's also critical to have the right data 00:08:08.300 |
Now, AI workloads tend to be a little bit different from your traditional data 00:08:12.300 |
warehousing, OLAP, and analytics workloads in a couple of different ways. 00:08:17.300 |
But let me give you just one motivating example. 00:08:20.300 |
If you think about a distributed training workload, typically it breaks down into three steps. 00:08:27.300 |
You want to select the right samples from your raw data set. 00:08:31.300 |
Then you'll have a shuffle step where you will then draw random rows from the filtered set. 00:08:39.300 |
And then you'll stream -- typically, if the data set is large, you'll be streaming those observations, 00:08:45.300 |
whether they're text or images or videos, from object storage into your GPUs. 00:08:51.300 |
So in that one workload, you needed fast scans to run the filter. 00:08:57.300 |
You need fast random access to do the shuffling. 00:09:02.300 |
And then you need to be able to deal with potentially very large binary data, large blobs, 00:09:07.300 |
to be able to quickly stream data directly into your GPUs. 00:09:12.300 |
So these three properties are required often in one, AI workloads from training to search and retrieval. 00:09:21.300 |
But existing data formats and data infrastructure is good for at most two, 00:09:28.300 |
And so this is what I'm calling the new cap theorem for AI data. 00:09:32.300 |
And that's the motivation for us for designing LANCE format, 00:09:41.300 |
So this problem is, of course, exacerbated by scale of AI data, 00:09:50.300 |
So if you look at tabular data from the past, one row of tabular data 00:09:55.300 |
with just scalar, simple scalar columns, on average, it's about 150 bytes per row. 00:10:01.300 |
If you add embeddings to that, that gets about 20, 25 times larger, 00:10:11.300 |
And if you add videos, that gets pretty astronomical. 00:10:17.300 |
And with generative AI data isn't limited by the speed at which, 00:10:23.300 |
you know, manual human interaction can generate observations. 00:10:26.300 |
New rows of data is being generated at thousands of tokens per second. 00:10:34.300 |
In the past, as I've been in data for a long time, 00:10:38.300 |
if you were in the tens of terabytes, you were a fairly large company. 00:10:42.300 |
And I think these days, if you were working in generative AI, 00:10:45.300 |
it's not unheard of for, you know, 10-person, 20-person teams 00:10:50.300 |
to be managing, like, tens of terabytes to even petabytes of data. 00:10:55.300 |
So what does Lands Format do to solve these problems? 00:11:01.300 |
Well, so Lands Format, first, it's a columnar file format. 00:11:09.300 |
So it gives you the ability to do fast scans like Parquet. 00:11:16.300 |
And we've actually gotten rid of a big limiting factor 00:11:23.300 |
And so that we can allow you to store blobs in line. 00:11:27.300 |
Lands Format is also a lightweight table format. 00:11:30.300 |
So as you add data, it's automatically versioned. 00:11:37.300 |
without having to copy the original data set. 00:11:40.300 |
So it makes it a lot easier if you're working 00:11:42.300 |
with large multimodal data sets to add experimental features 00:11:51.300 |
And we'll call this a zero copy schema evolution. 00:11:54.300 |
And then finally, of course, it supports time travel. 00:11:56.300 |
So that oftentimes, if you make a mistake or there's an error 00:12:00.300 |
or there's bad data, it's instantaneous to roll back 00:12:06.300 |
So that it doesn't corrupt downstream model training processes. 00:12:11.300 |
And the third aspect of Lands Format that's really interesting 00:12:17.300 |
But the indices can quickly tell you which rows you need. 00:12:23.300 |
But with Parquet, because it doesn't support random access, 00:12:26.300 |
even if you know which rows you need to fetch, 00:12:32.300 |
So with Lands, we've added indexing extensions for embeddings. 00:12:37.300 |
So you can do, you know, essentially billion scale vector search 00:12:44.300 |
We can have scalar indices to make filtering metadata columns 00:12:50.300 |
And then full-text search indices to do keyword or fuzzy search 00:12:59.300 |
And you don't really need that Elasticsearch cluster anymore. 00:13:03.300 |
So what Lands gives you is the ability to have a single table 00:13:12.300 |
So if you have metadata columns or time series columns, 00:13:18.300 |
So you can plug Lands directly into, say, DuckDB or Trino or Spark. 00:13:27.300 |
And if you're storing large blobs and tensors, 00:13:30.300 |
like the videos or text or images, you can plug your Lands data, 00:13:41.300 |
you can use the embedded vector index to do similarity search. 00:13:47.300 |
And so this makes it a lot easier for a full AI workflow from analyzing 00:13:55.300 |
and exploring your data set to searching and retrieving 00:13:58.300 |
throughout your data set to fine-tuning and training your model. 00:14:04.300 |
Around this format, we've built LanCB vector database, 00:14:09.300 |
and more generally, a database for multimodal AI. 00:14:14.300 |
So one big feature is a distributed vector search. 00:14:18.300 |
So search through billions of vectors at low latency 00:14:21.300 |
and very high QPS with order magnitude less infra 00:14:32.300 |
When we talk about multimodal, we often think narrowly 00:14:35.300 |
about just image generation or video generation. 00:14:38.300 |
But when you look at the data, multimodal, I think, 00:14:45.300 |
So unlike traditional tabular data, we can store features. 00:14:48.300 |
And then audio waveforms, images, and all of that. 00:14:57.300 |
So whether they're image embeddings or text embeddings. 00:15:12.300 |
and then other sort of data frame and SQL workloads. 00:15:22.300 |
So operational scenarios where you're in a production service 00:15:27.300 |
for RAG or search and retrieval and personalization, 00:15:35.300 |
or it can be part of your data lake to analyze 00:15:38.300 |
and explore all that multimodal data that you have. 00:15:41.300 |
Yeah. So I think that from at least my team's experience 00:15:51.300 |
we just think that speed is probably our best bet 00:15:55.300 |
And a lot of the tools that we've worked with really slow down 00:16:01.300 |
and we're looking to develop out what the future