back to indexNo more bad outputs with structured generation: Remi Louf

00:00:00.000 |
So yeah, my name is Remy. I'm the co-author and co-maintainer of the open source library 00:00:18.720 |
outlines, which some of you might know. And I'm also the CEO and co-founder of .txt or .txt, 00:00:24.960 |
whichever you prefer. We're more traditional machine learning people. And the motivation for work 00:00:32.240 |
is the very simple observation that large language models are fundamentally flawed. 00:00:37.520 |
I'll give you a very simple example. You're trying to extract flight information from a bunch of emails. 00:00:44.480 |
Of course, you want them to be, you know, a JSON object, you know, with origin, destination, etc. 00:00:51.680 |
So you go to open AI, you prompt the model to death, you threaten it, you use function calling, 00:00:57.600 |
and what you get sometimes as an answer is JSON decode error. I gave you very simple examples, 00:01:03.680 |
but this has like very fundamental implications because computing rests on interfaces. We're able 00:01:10.000 |
to build module infrastructures and very complex infrastructure because we can trust the API over 00:01:16.240 |
other pieces of code. And here, what you and what you've probably witnessed, you can't actually trust 00:01:22.160 |
large language model to return consistent outputs. And, you know, in short is that the technology for 00:01:28.480 |
agents is currently not there. So the good news is that structure generation, which is the ability of 00:01:37.280 |
guiding the model to return to the specific structure. Actually, so we'll see is that it allows you to be GPT for a sort of a byproduct. 00:01:46.480 |
The goals for today are first to introduce the open source library outlines for those of you who don't know about it, 00:01:52.800 |
then very briefly explain how it works. I won't get into the technical details and then try to convince you that you should use it today 00:02:00.880 |
for, you know, most of the workflows that you have to deal with and sort of a very short glimpse into the near future. 00:02:07.440 |
So outlines a Python library emphasis on library. You can actually include outlines in your workflow and 00:02:16.320 |
it's not like frameworks where you have to make your workflow, you know, fit inside a framework. I think as a result, 00:02:23.280 |
it's been adopted by VLLM and TGI in the serving frameworks. And if you use function calling in 00:02:30.080 |
either of these libraries, you're actually using outlines under the hood. Outlines under the hood. 00:02:34.960 |
So I'm co-author, but outlines would be nothing without its contributors. Today, it's 87. I think it might be 88. 00:02:43.120 |
I think I merged a PR this morning. I don't remember. And so outlines would be nothing without all these 00:02:48.400 |
people. And I thank them, thank them a lot. Um, people thought we're crazy about a year ago when 00:02:54.960 |
we're talking about structured generation. Uh, but since then, uh, pretty happy because it looks like 00:03:00.480 |
people are sort of caught up with the topic and realize that you can actually, you know, you can 00:03:05.600 |
actually, uh, do a structured output. Um, so just now, just to run through, quick run through outline. 00:03:12.000 |
Um, so usually generating text happens in three stages. Uh, the first stage is that you need to choose the 00:03:18.080 |
model and instantiate it. So outlines is purely focused on open source models. Uh, we have 00:03:23.520 |
integration with six different model providers, uh, transformers, Lama CPP, and also, uh, recently we 00:03:29.440 |
added MLX, uh, MLX. Um, we have an integration with open AI, but that's mostly for us to compare 00:03:36.080 |
the results that we get with open models with the results that are given by open AI. The second step 00:03:42.000 |
is to, I mean, generate texts. What you do is that you instantiate a generator using generate.txt. 00:03:48.400 |
Here we just want to, you know, return a single sentence. So we're telling the generator stop 00:03:53.360 |
whenever you encounter a period. And the question is described, then you call the generator, uh, with 00:03:59.040 |
your prompt. And here is describe the benefits of structured generation in one sentence. And 00:04:03.840 |
you'll have to wait for 10 more minutes, uh, hopefully less. Okay. Now we get into structured generation. So 00:04:12.480 |
without outlines, without outlines, if you ask what is the IP address of the public Google DNS servers, 00:04:18.240 |
and you just generate text, you just let the LLM do its thing, then generally it will yap for a long 00:04:23.680 |
time. Uh, you know, a hundred tokens, 500 tokens, and the answer will be somewhere in there. And the way 00:04:29.600 |
you extract the answer is using regular expressions generally. Here, what you can do with outlines is 00:04:35.360 |
actually taking that regular expression that you use, you would use to extract the answer and use it to 00:04:40.480 |
guide the model, to tell the model, this is the structure that the output should follow. And as you 00:04:45.360 |
see, you kind of remove the yapping. You print the, you just call generator rejects, call the generator. And 00:04:51.760 |
what you get is just a result. And it's actually the correct answer. Uh, that was with Mistral, uh, 7bv01. 00:04:57.920 |
Regular expressions are not the only way to define structure. Uh, something that people need a lot in 00:05:03.920 |
practice is like JSON. And outlines allow you to generate, um, to generate text that, you know, is a 00:05:12.800 |
JSON object with a given structure. The way you specify the structure is using JSON schema, or you can pass by 00:05:19.280 |
identity models as well. Um, now you might notice on the flight information. So here we're, you know, 00:05:25.120 |
it's the example that I use at the beginning, you're extracting flight information from an email. 00:05:29.200 |
I could have used string as a type for origin and destination, but I did not. I use actually a custom 00:05:34.560 |
type that we implemented in outlines. And the reason is that origin and destination have way more structure 00:05:40.240 |
than just text. It's actually, you know, it's, it's an airport code that has three letters that's 00:05:45.200 |
capitalized and you can actually specify more and more structure, all the structure that you have in 00:05:50.080 |
your problem. Basically, uh, you can use this with vision models. Uh, that's something that we merged 00:05:55.840 |
recently. So here we took, um, I think it was a picture from Wikipedia, uh, of a dish. Uh, we 00:06:03.520 |
tell the model what is the JSON that we expect as a, as a, as an, as an output. And then we instantiate 00:06:10.960 |
the generator and then pass the image on the prompt of the generator and we get valid JSON. Um, 00:06:16.080 |
if you want to install outlines, uh, and you think you could benefit from structure generation, 00:06:20.880 |
then it's very simple. Just install outlines. Now I'm going to try to very quickly explain how it works. 00:06:27.440 |
Um, so models themselves, uh, what Mr. All and Korea this one are doing, uh, is actually training model 00:06:34.880 |
weights. Uh, what a model does is, uh, you input a prompt, you send a prompt. It's like token IDs. 00:06:41.280 |
And what you get as an output is not text. It's larger. It's a probability distribution over the next token. 00:06:46.080 |
Now, what happens after that, when you want to generate text, the first step is that you have a 00:06:50.320 |
logic processor that biases the logics. You probably use this every day, actually, without noticing it. 00:06:56.160 |
When you use temperature or you stop K top P sampling, you actually bias in the logics. 00:07:00.720 |
And once you have your bias logics, use a sampling algorithm, then you get a token. 00:07:04.640 |
And once you have your token, you add it to the prompt and then feed it back to the LL. 00:07:08.480 |
And where we fit is here. We actually, why the model, whenever the model generates logics, 00:07:17.440 |
we look at every token and we say, if I add this token to the current generation, is it going to violate 00:07:24.720 |
the structure? If the answer is yes, we, we like, we mask it so that it doesn't get generated. 00:07:30.240 |
Now that story is very simple. What is really hard is doing that efficiently. And that's what we 00:07:35.360 |
figured out a dot text. And that's what makes us different from the other libraries like guidance, 00:07:39.440 |
well, and QL dot, um, do structure generation. And now I'm going to convince you, uh, that there's 00:07:48.720 |
absolutely no reason to not use, sorry for the double negation here, to not use structure generation. 00:07:54.800 |
Uh, the first reason is that most text is structured. Um, I talked to you about JSON earlier. We talked 00:08:01.120 |
about regular expressions, but here, I just took the GSM 8k dataset. Um, if you look at a, if you're not me 00:08:07.120 |
and don't show everywhere, um, what is it immediately, if you look at the right, uh, you can actually see 00:08:13.760 |
that it's highly structured. It's always Q, uh, period text until a question mark, then et cetera, so on and 00:08:20.320 |
so forth, arithmetic operation, which is defined by a context-free grammar. And you could actually 00:08:24.800 |
express this in outlines and just get the answer at the end, which is, you know, 6. So there's a lot of 00:08:30.880 |
structured text out there, not just, uh, thank you. I'll, I'll be, I'll be quick. Um, of course the 00:08:41.120 |
second benefit is that, uh, you get valid structure. I mean, that's an obvious thing. That's what we're 00:08:45.680 |
doing it. Uh, I like this meme, uh, at the bottom, this is what people are currently doing. Uh, it's just 00:08:50.720 |
crazy stuff to get valid JSON as an output and it's not even guaranteed. And here without lines, you just 00:08:56.080 |
sample what you want. It's just simple as this. And as an experiment, it's actually an experiment 00:09:01.120 |
that pretty based it, uh, they took Mistral 7BV01. They used a version of co-NNL that they modified 00:09:06.480 |
so that gives structured output JSON. What they found is Mistral 7BV01 only gets valid JSON, uh, 17% 00:09:13.520 |
of the time. When you had structure generation on top of it, you get 99.9% and that's without optimizing 00:09:19.360 |
the prompt. So you can actually get, you know, you can actually get better than this. 00:09:23.440 |
The nice thing is, uh, it also adds negligible overhead. So you actually have, you know, 00:09:30.640 |
you don't have to fear for that affecting inference time, uh, which is the highly, you know, highly 00:09:35.200 |
non-trivial thing. Uh, here we compared, uh, the overhead introduced by guidance when they do 00:09:41.760 |
structure generation, uh, you know, as a function of the number of generated token and at the bottom, 00:09:46.480 |
it's outlines, uh, outline says approximately zero until the end. Uh, as a trade-off, there's a compilation 00:09:52.000 |
time. But during inference, it doesn't slow down inference. Now we're at a point where we could 00:09:55.760 |
integrate this in Brock and you wouldn't see the difference between structured and unstructured. 00:09:59.600 |
Um, so no overhead, but even more than no overhead, it is faster to generate text with structure 00:10:07.520 |
generation. Um, the first is that when you take JSON, you don't need to generate the tokens that 00:10:13.280 |
correspond to the bracket and to the field names. I know that in advance, I don't need to ask the model 00:10:18.000 |
to return, uh, to return, uh, the tokens. So here on this very simple example, 00:10:21.680 |
only five out of 10 tokens need to be generated. So only one half, but there's an even more subtle, 00:10:28.160 |
um, way in which it accelerates inference. And this is the example that we took at the beginning. So here, 00:10:35.680 |
I asked ChatGPT like a good model, like ChatGPT, the same question. What is the, uh, public, like the, 00:10:42.880 |
of Google's public DNS servers and ChatGPT took 50 tokens, you know, it yapped it, yapped it, yapped it, 00:10:50.640 |
and give it up to 50 tokens. It's not as bad. It could get a lot worse, uh, with lesser models. Uh, 00:10:56.480 |
but when you use structure generation, you just generate eight tokens. So that's a subtle win, 00:11:00.640 |
which it accelerates inference by your law. Um, 00:11:03.120 |
then it improves efficiency and that's probably the most, uh, actually mind-blowing result, uh, that we've 00:11:10.720 |
had. So here, what you're looking at is the accuracy on GSM 8K, uh, with again, Mistral 7BB01, 00:11:19.520 |
structured and structured. And here we look at the accuracy as a function of the number of shots. So the 00:11:25.040 |
number of examples that you give to the model, uh, before asking the question. And what we found is 00:11:31.040 |
that, yeah, foreign structure, normal one shot is worse than eight shots. Uh, that's completely expected. 00:11:36.320 |
Uh, but what we find with structured is that you actually, and that's really surprised us is that you 00:11:40.720 |
actually get in the same ballpark in terms of accuracy with one shot as you do with eight shots, 00:11:45.440 |
which is surprising for machine learning. Something like you would think that examples are there to teach 00:11:50.960 |
the model about the task, but it looks like it's actually there to teach the model about the 00:11:54.800 |
structure of the problem. The more investigations to do in this line, but that was very mind-blowing. 00:12:00.960 |
And the last one, which probably, you know, after faster, a lot of people care about here 00:12:06.000 |
is that it does improve the performance of open source models. Uh, here, um, what you're looking at 00:12:14.000 |
is the Berkeley function calling leaderboard, uh, simple function benchmark. And we'll look at the accuracy. 00:12:20.480 |
So the first thing we did is that we took Microsoft pre-medium model, uh, which is a small model. 00:12:26.160 |
Uh, but we looked at its accuracy without structure generation. It's 86%, which is pretty good for an open 00:12:32.240 |
model. Uh, 5.3 is actually a pretty good model. When you add structure generation, you get 96.5%. 00:12:39.200 |
And as a comparison, GPT-4, the best version of GPT-4 on this task, 93.5%, uh, on this benchmark. 00:12:49.360 |
And now there are two things to note is that 96.5% gets dangerously useful. And the second thing is that 00:12:58.320 |
we have open models that are available today that can be, you know, larger models, um, without fine tuning. 00:13:07.840 |
So it's pretty huge room for, uh, open models. And that's why I'm really bullish on open models. 00:13:12.640 |
I think, you know, as a community, we can actually extract a lot more out of these models. Um, 00:13:17.840 |
and this is just a glimpse. Um, the work that I just showed you is what we did at .txt about a year ago. 00:13:26.240 |
Since then we've generalized from regular expression to, uh, you call context-free grammars. 00:13:31.520 |
Context-free grammars are used to define code. They used to define protein structure. I mean, 00:13:36.400 |
and to define as well, what I showed you earlier on the gsm-8k example. So we can do the same thing, 00:13:41.840 |
structure generation with no overhead with, um, with context-free grammar. We also started working 00:13:49.280 |
on, um, semantics, like adding some semantic constraints to the generation. And one very popular 00:13:55.120 |
example of this is the SQL, uh, text-to-SQL most model that SQL syntax. Usually what they get wrong 00:14:02.160 |
is the hallucinate table or column names. And a internally, we're able to get perfect text-to-SQL. 00:14:09.360 |
So I can guarantee you that the query will be correct and give you the answer that you expect, 00:14:13.520 |
but I can guarantee you that it will run. So that's a pretty huge advance in text-to-SQL. 00:14:19.600 |
And what else? Oh yeah. And we also starting to, uh, to bubble up computations into the structure 00:14:25.840 |
generation, into the model architecture, because when you think about it, we're biasing logits. 00:14:30.720 |
When you're biasing logits, the model is actually doing computation for nothing. And so you can gain 00:14:35.280 |
even more inefficiency by preventing the model from doing this computations in the first place. And that's 00:14:40.480 |
all work that we'll actually publish in the blog post, I think, in the next couple of weeks. So all that to 00:14:47.120 |
say that if you're doing, if you're not doing a chatbot, there's a really good chance that you will be 00:14:52.880 |
using structure generation. You know, it's just a matter of time until you adopt it, I think. Our users are pretty, 00:14:59.360 |
pretty, pretty, pretty happy. So yeah. Thank you for your attention. And, uh, all the, all the crazy claims 00:15:07.840 |
that I made, you can go the QR code. There's a link to all the blog posts.