No more bad outputs with structured generation: Remi Louf

00:00:00.000 | So yeah, my name is Remy. I'm the co-author and co-maintainer of the open source library

00:00:18.720 | outlines, which some of you might know. And I'm also the CEO and co-founder of .txt or .txt,

00:00:24.960 | whichever you prefer. We're more traditional machine learning people. And the motivation for work

00:00:32.240 | is the very simple observation that large language models are fundamentally flawed.

00:00:37.520 | I'll give you a very simple example. You're trying to extract flight information from a bunch of emails.

00:00:44.480 | Of course, you want them to be, you know, a JSON object, you know, with origin, destination, etc.

00:00:51.680 | So you go to open AI, you prompt the model to death, you threaten it, you use function calling,

00:00:57.600 | and what you get sometimes as an answer is JSON decode error. I gave you very simple examples,

00:01:03.680 | but this has like very fundamental implications because computing rests on interfaces. We're able

00:01:10.000 | to build module infrastructures and very complex infrastructure because we can trust the API over

00:01:16.240 | other pieces of code. And here, what you and what you've probably witnessed, you can't actually trust

00:01:22.160 | large language model to return consistent outputs. And, you know, in short is that the technology for

00:01:28.480 | agents is currently not there. So the good news is that structure generation, which is the ability of

00:01:37.280 | guiding the model to return to the specific structure. Actually, so we'll see is that it allows you to be GPT for a sort of a byproduct.

00:01:46.480 | The goals for today are first to introduce the open source library outlines for those of you who don't know about it,

00:01:52.800 | then very briefly explain how it works. I won't get into the technical details and then try to convince you that you should use it today

00:02:00.880 | for, you know, most of the workflows that you have to deal with and sort of a very short glimpse into the near future.

00:02:07.440 | So outlines a Python library emphasis on library. You can actually include outlines in your workflow and

00:02:16.320 | it's not like frameworks where you have to make your workflow, you know, fit inside a framework. I think as a result,

00:02:23.280 | it's been adopted by VLLM and TGI in the serving frameworks. And if you use function calling in

00:02:30.080 | either of these libraries, you're actually using outlines under the hood. Outlines under the hood.

00:02:34.960 | So I'm co-author, but outlines would be nothing without its contributors. Today, it's 87. I think it might be 88.

00:02:43.120 | I think I merged a PR this morning. I don't remember. And so outlines would be nothing without all these

00:02:48.400 | people. And I thank them, thank them a lot. Um, people thought we're crazy about a year ago when

00:02:54.960 | we're talking about structured generation. Uh, but since then, uh, pretty happy because it looks like

00:03:00.480 | people are sort of caught up with the topic and realize that you can actually, you know, you can

00:03:05.600 | actually, uh, do a structured output. Um, so just now, just to run through, quick run through outline.

00:03:12.000 | Um, so usually generating text happens in three stages. Uh, the first stage is that you need to choose the

00:03:18.080 | model and instantiate it. So outlines is purely focused on open source models. Uh, we have

00:03:23.520 | integration with six different model providers, uh, transformers, Lama CPP, and also, uh, recently we

00:03:29.440 | added MLX, uh, MLX. Um, we have an integration with open AI, but that's mostly for us to compare

00:03:36.080 | the results that we get with open models with the results that are given by open AI. The second step

00:03:42.000 | is to, I mean, generate texts. What you do is that you instantiate a generator using generate.txt.

00:03:48.400 | Here we just want to, you know, return a single sentence. So we're telling the generator stop

00:03:53.360 | whenever you encounter a period. And the question is described, then you call the generator, uh, with

00:03:59.040 | your prompt. And here is describe the benefits of structured generation in one sentence. And

00:04:03.840 | you'll have to wait for 10 more minutes, uh, hopefully less. Okay. Now we get into structured generation. So

00:04:12.480 | without outlines, without outlines, if you ask what is the IP address of the public Google DNS servers,

00:04:18.240 | and you just generate text, you just let the LLM do its thing, then generally it will yap for a long

00:04:23.680 | time. Uh, you know, a hundred tokens, 500 tokens, and the answer will be somewhere in there. And the way

00:04:29.600 | you extract the answer is using regular expressions generally. Here, what you can do with outlines is

00:04:35.360 | actually taking that regular expression that you use, you would use to extract the answer and use it to

00:04:40.480 | guide the model, to tell the model, this is the structure that the output should follow. And as you

00:04:45.360 | see, you kind of remove the yapping. You print the, you just call generator rejects, call the generator. And

00:04:51.760 | what you get is just a result. And it's actually the correct answer. Uh, that was with Mistral, uh, 7bv01.

00:04:57.920 | Regular expressions are not the only way to define structure. Uh, something that people need a lot in

00:05:03.920 | practice is like JSON. And outlines allow you to generate, um, to generate text that, you know, is a

00:05:12.800 | JSON object with a given structure. The way you specify the structure is using JSON schema, or you can pass by

00:05:19.280 | identity models as well. Um, now you might notice on the flight information. So here we're, you know,

00:05:25.120 | it's the example that I use at the beginning, you're extracting flight information from an email.

00:05:29.200 | I could have used string as a type for origin and destination, but I did not. I use actually a custom

00:05:34.560 | type that we implemented in outlines. And the reason is that origin and destination have way more structure

00:05:40.240 | than just text. It's actually, you know, it's, it's an airport code that has three letters that's

00:05:45.200 | capitalized and you can actually specify more and more structure, all the structure that you have in

00:05:50.080 | your problem. Basically, uh, you can use this with vision models. Uh, that's something that we merged

00:05:55.840 | recently. So here we took, um, I think it was a picture from Wikipedia, uh, of a dish. Uh, we

00:06:03.520 | tell the model what is the JSON that we expect as a, as a, as an, as an output. And then we instantiate

00:06:10.960 | the generator and then pass the image on the prompt of the generator and we get valid JSON. Um,

00:06:16.080 | if you want to install outlines, uh, and you think you could benefit from structure generation,

00:06:20.880 | then it's very simple. Just install outlines. Now I'm going to try to very quickly explain how it works.

00:06:27.440 | Um, so models themselves, uh, what Mr. All and Korea this one are doing, uh, is actually training model

00:06:34.880 | weights. Uh, what a model does is, uh, you input a prompt, you send a prompt. It's like token IDs.

00:06:41.280 | And what you get as an output is not text. It's larger. It's a probability distribution over the next token.

00:06:46.080 | Now, what happens after that, when you want to generate text, the first step is that you have a

00:06:50.320 | logic processor that biases the logics. You probably use this every day, actually, without noticing it.

00:06:56.160 | When you use temperature or you stop K top P sampling, you actually bias in the logics.

00:07:00.720 | And once you have your bias logics, use a sampling algorithm, then you get a token.

00:07:04.640 | And once you have your token, you add it to the prompt and then feed it back to the LL.

00:07:08.480 | And where we fit is here. We actually, why the model, whenever the model generates logics,

00:07:17.440 | we look at every token and we say, if I add this token to the current generation, is it going to violate

00:07:24.720 | the structure? If the answer is yes, we, we like, we mask it so that it doesn't get generated.

00:07:30.240 | Now that story is very simple. What is really hard is doing that efficiently. And that's what we

00:07:35.360 | figured out a dot text. And that's what makes us different from the other libraries like guidance,

00:07:39.440 | well, and QL dot, um, do structure generation. And now I'm going to convince you, uh, that there's

00:07:48.720 | absolutely no reason to not use, sorry for the double negation here, to not use structure generation.

00:07:54.800 | Uh, the first reason is that most text is structured. Um, I talked to you about JSON earlier. We talked

00:08:01.120 | about regular expressions, but here, I just took the GSM 8k dataset. Um, if you look at a, if you're not me

00:08:07.120 | and don't show everywhere, um, what is it immediately, if you look at the right, uh, you can actually see

00:08:13.760 | that it's highly structured. It's always Q, uh, period text until a question mark, then et cetera, so on and

00:08:20.320 | so forth, arithmetic operation, which is defined by a context-free grammar. And you could actually

00:08:24.800 | express this in outlines and just get the answer at the end, which is, you know, 6. So there's a lot of

00:08:30.880 | structured text out there, not just, uh, thank you. I'll, I'll be, I'll be quick. Um, of course the

00:08:41.120 | second benefit is that, uh, you get valid structure. I mean, that's an obvious thing. That's what we're

00:08:45.680 | doing it. Uh, I like this meme, uh, at the bottom, this is what people are currently doing. Uh, it's just

00:08:50.720 | crazy stuff to get valid JSON as an output and it's not even guaranteed. And here without lines, you just

00:08:56.080 | sample what you want. It's just simple as this. And as an experiment, it's actually an experiment

00:09:01.120 | that pretty based it, uh, they took Mistral 7BV01. They used a version of co-NNL that they modified

00:09:06.480 | so that gives structured output JSON. What they found is Mistral 7BV01 only gets valid JSON, uh, 17%

00:09:13.520 | of the time. When you had structure generation on top of it, you get 99.9% and that's without optimizing

00:09:19.360 | the prompt. So you can actually get, you know, you can actually get better than this.

00:09:23.440 | The nice thing is, uh, it also adds negligible overhead. So you actually have, you know,

00:09:30.640 | you don't have to fear for that affecting inference time, uh, which is the highly, you know, highly

00:09:35.200 | non-trivial thing. Uh, here we compared, uh, the overhead introduced by guidance when they do

00:09:41.760 | structure generation, uh, you know, as a function of the number of generated token and at the bottom,

00:09:46.480 | it's outlines, uh, outline says approximately zero until the end. Uh, as a trade-off, there's a compilation

00:09:52.000 | time. But during inference, it doesn't slow down inference. Now we're at a point where we could

00:09:55.760 | integrate this in Brock and you wouldn't see the difference between structured and unstructured.

00:09:59.600 | Um, so no overhead, but even more than no overhead, it is faster to generate text with structure

00:10:07.520 | generation. Um, the first is that when you take JSON, you don't need to generate the tokens that

00:10:13.280 | correspond to the bracket and to the field names. I know that in advance, I don't need to ask the model

00:10:18.000 | to return, uh, to return, uh, the tokens. So here on this very simple example,

00:10:21.680 | only five out of 10 tokens need to be generated. So only one half, but there's an even more subtle,

00:10:28.160 | um, way in which it accelerates inference. And this is the example that we took at the beginning. So here,

00:10:35.680 | I asked ChatGPT like a good model, like ChatGPT, the same question. What is the, uh, public, like the,

00:10:42.880 | of Google's public DNS servers and ChatGPT took 50 tokens, you know, it yapped it, yapped it, yapped it,

00:10:50.640 | and give it up to 50 tokens. It's not as bad. It could get a lot worse, uh, with lesser models. Uh,

00:10:56.480 | but when you use structure generation, you just generate eight tokens. So that's a subtle win,

00:11:00.640 | which it accelerates inference by your law. Um,

00:11:03.120 | then it improves efficiency and that's probably the most, uh, actually mind-blowing result, uh, that we've

00:11:10.720 | had. So here, what you're looking at is the accuracy on GSM 8K, uh, with again, Mistral 7BB01,

00:11:19.520 | structured and structured. And here we look at the accuracy as a function of the number of shots. So the

00:11:25.040 | number of examples that you give to the model, uh, before asking the question. And what we found is

00:11:31.040 | that, yeah, foreign structure, normal one shot is worse than eight shots. Uh, that's completely expected.

00:11:36.320 | Uh, but what we find with structured is that you actually, and that's really surprised us is that you

00:11:40.720 | actually get in the same ballpark in terms of accuracy with one shot as you do with eight shots,

00:11:45.440 | which is surprising for machine learning. Something like you would think that examples are there to teach

00:11:50.960 | the model about the task, but it looks like it's actually there to teach the model about the

00:11:54.800 | structure of the problem. The more investigations to do in this line, but that was very mind-blowing.

00:12:00.960 | And the last one, which probably, you know, after faster, a lot of people care about here

00:12:06.000 | is that it does improve the performance of open source models. Uh, here, um, what you're looking at

00:12:14.000 | is the Berkeley function calling leaderboard, uh, simple function benchmark. And we'll look at the accuracy.

00:12:20.480 | So the first thing we did is that we took Microsoft pre-medium model, uh, which is a small model.

00:12:26.160 | Uh, but we looked at its accuracy without structure generation. It's 86%, which is pretty good for an open

00:12:32.240 | model. Uh, 5.3 is actually a pretty good model. When you add structure generation, you get 96.5%.

00:12:39.200 | And as a comparison, GPT-4, the best version of GPT-4 on this task, 93.5%, uh, on this benchmark.

00:12:49.360 | And now there are two things to note is that 96.5% gets dangerously useful. And the second thing is that

00:12:58.320 | we have open models that are available today that can be, you know, larger models, um, without fine tuning.

00:13:07.840 | So it's pretty huge room for, uh, open models. And that's why I'm really bullish on open models.

00:13:12.640 | I think, you know, as a community, we can actually extract a lot more out of these models. Um,

00:13:17.840 | and this is just a glimpse. Um, the work that I just showed you is what we did at .txt about a year ago.

00:13:26.240 | Since then we've generalized from regular expression to, uh, you call context-free grammars.

00:13:31.520 | Context-free grammars are used to define code. They used to define protein structure. I mean,

00:13:36.400 | and to define as well, what I showed you earlier on the gsm-8k example. So we can do the same thing,

00:13:41.840 | structure generation with no overhead with, um, with context-free grammar. We also started working

00:13:49.280 | on, um, semantics, like adding some semantic constraints to the generation. And one very popular

00:13:55.120 | example of this is the SQL, uh, text-to-SQL most model that SQL syntax. Usually what they get wrong

00:14:02.160 | is the hallucinate table or column names. And a internally, we're able to get perfect text-to-SQL.

00:14:09.360 | So I can guarantee you that the query will be correct and give you the answer that you expect,

00:14:13.520 | but I can guarantee you that it will run. So that's a pretty huge advance in text-to-SQL.

00:14:19.600 | And what else? Oh yeah. And we also starting to, uh, to bubble up computations into the structure

00:14:25.840 | generation, into the model architecture, because when you think about it, we're biasing logits.

00:14:30.720 | When you're biasing logits, the model is actually doing computation for nothing. And so you can gain

00:14:35.280 | even more inefficiency by preventing the model from doing this computations in the first place. And that's

00:14:40.480 | all work that we'll actually publish in the blog post, I think, in the next couple of weeks. So all that to

00:14:47.120 | say that if you're doing, if you're not doing a chatbot, there's a really good chance that you will be

00:14:52.880 | using structure generation. You know, it's just a matter of time until you adopt it, I think. Our users are pretty,

00:14:59.360 | pretty, pretty, pretty happy. So yeah. Thank you for your attention. And, uh, all the, all the crazy claims

00:15:07.840 | that I made, you can go the QR code. There's a link to all the blog posts.

00:15:21.120 | I'll see you next time.