No more bad outputs with structured generation: Remi Louf

So yeah, my name is Remy. I'm the co-author and co-maintainer of the open source library outlines, which some of you might know. And I'm also the CEO and co-founder of .txt or .txt, whichever you prefer. We're more traditional machine learning people. And the motivation for work is the very simple observation that large language models are fundamentally flawed.

I'll give you a very simple example. You're trying to extract flight information from a bunch of emails. Of course, you want them to be, you know, a JSON object, you know, with origin, destination, etc. So you go to open AI, you prompt the model to death, you threaten it, you use function calling, and what you get sometimes as an answer is JSON decode error.

I gave you very simple examples, but this has like very fundamental implications because computing rests on interfaces. We're able to build module infrastructures and very complex infrastructure because we can trust the API over other pieces of code. And here, what you and what you've probably witnessed, you can't actually trust large language model to return consistent outputs.

And, you know, in short is that the technology for agents is currently not there. So the good news is that structure generation, which is the ability of guiding the model to return to the specific structure. Actually, so we'll see is that it allows you to be GPT for a sort of a byproduct.

The goals for today are first to introduce the open source library outlines for those of you who don't know about it, then very briefly explain how it works. I won't get into the technical details and then try to convince you that you should use it today for, you know, most of the workflows that you have to deal with and sort of a very short glimpse into the near future.

So outlines a Python library emphasis on library. You can actually include outlines in your workflow and it's not like frameworks where you have to make your workflow, you know, fit inside a framework. I think as a result, it's been adopted by VLLM and TGI in the serving frameworks. And if you use function calling in either of these libraries, you're actually using outlines under the hood.

Outlines under the hood. So I'm co-author, but outlines would be nothing without its contributors. Today, it's 87. I think it might be 88. I think I merged a PR this morning. I don't remember. And so outlines would be nothing without all these people. And I thank them, thank them a lot.

Um, people thought we're crazy about a year ago when we're talking about structured generation. Uh, but since then, uh, pretty happy because it looks like people are sort of caught up with the topic and realize that you can actually, you know, you can actually, uh, do a structured output.

Um, so just now, just to run through, quick run through outline. Um, so usually generating text happens in three stages. Uh, the first stage is that you need to choose the model and instantiate it. So outlines is purely focused on open source models. Uh, we have integration with six different model providers, uh, transformers, Lama CPP, and also, uh, recently we added MLX, uh, MLX.

Um, we have an integration with open AI, but that's mostly for us to compare the results that we get with open models with the results that are given by open AI. The second step is to, I mean, generate texts. What you do is that you instantiate a generator using generate.txt.

Here we just want to, you know, return a single sentence. So we're telling the generator stop whenever you encounter a period. And the question is described, then you call the generator, uh, with your prompt. And here is describe the benefits of structured generation in one sentence. And you'll have to wait for 10 more minutes, uh, hopefully less.

Okay. Now we get into structured generation. So without outlines, without outlines, if you ask what is the IP address of the public Google DNS servers, and you just generate text, you just let the LLM do its thing, then generally it will yap for a long time. Uh, you know, a hundred tokens, 500 tokens, and the answer will be somewhere in there.

And the way you extract the answer is using regular expressions generally. Here, what you can do with outlines is actually taking that regular expression that you use, you would use to extract the answer and use it to guide the model, to tell the model, this is the structure that the output should follow.

And as you see, you kind of remove the yapping. You print the, you just call generator rejects, call the generator. And what you get is just a result. And it's actually the correct answer. Uh, that was with Mistral, uh, 7bv01. Regular expressions are not the only way to define structure.

Uh, something that people need a lot in practice is like JSON. And outlines allow you to generate, um, to generate text that, you know, is a JSON object with a given structure. The way you specify the structure is using JSON schema, or you can pass by identity models as well.

Um, now you might notice on the flight information. So here we're, you know, it's the example that I use at the beginning, you're extracting flight information from an email. I could have used string as a type for origin and destination, but I did not. I use actually a custom type that we implemented in outlines.

And the reason is that origin and destination have way more structure than just text. It's actually, you know, it's, it's an airport code that has three letters that's capitalized and you can actually specify more and more structure, all the structure that you have in your problem. Basically, uh, you can use this with vision models.

Uh, that's something that we merged recently. So here we took, um, I think it was a picture from Wikipedia, uh, of a dish. Uh, we tell the model what is the JSON that we expect as a, as a, as an, as an output. And then we instantiate the generator and then pass the image on the prompt of the generator and we get valid JSON.

Um, if you want to install outlines, uh, and you think you could benefit from structure generation, then it's very simple. Just install outlines. Now I'm going to try to very quickly explain how it works. Um, so models themselves, uh, what Mr. All and Korea this one are doing, uh, is actually training model weights.

Uh, what a model does is, uh, you input a prompt, you send a prompt. It's like token IDs. And what you get as an output is not text. It's larger. It's a probability distribution over the next token. Now, what happens after that, when you want to generate text, the first step is that you have a logic processor that biases the logics.

You probably use this every day, actually, without noticing it. When you use temperature or you stop K top P sampling, you actually bias in the logics. And once you have your bias logics, use a sampling algorithm, then you get a token. And once you have your token, you add it to the prompt and then feed it back to the LL.

And where we fit is here. We actually, why the model, whenever the model generates logics, we look at every token and we say, if I add this token to the current generation, is it going to violate the structure? If the answer is yes, we, we like, we mask it so that it doesn't get generated.

Now that story is very simple. What is really hard is doing that efficiently. And that's what we figured out a dot text. And that's what makes us different from the other libraries like guidance, well, and QL dot, um, do structure generation. And now I'm going to convince you, uh, that there's absolutely no reason to not use, sorry for the double negation here, to not use structure generation.

Uh, the first reason is that most text is structured. Um, I talked to you about JSON earlier. We talked about regular expressions, but here, I just took the GSM 8k dataset. Um, if you look at a, if you're not me and don't show everywhere, um, what is it immediately, if you look at the right, uh, you can actually see that it's highly structured.

It's always Q, uh, period text until a question mark, then et cetera, so on and so forth, arithmetic operation, which is defined by a context-free grammar. And you could actually express this in outlines and just get the answer at the end, which is, you know, 6. So there's a lot of structured text out there, not just, uh, thank you.

I'll, I'll be, I'll be quick. Um, of course the second benefit is that, uh, you get valid structure. I mean, that's an obvious thing. That's what we're doing it. Uh, I like this meme, uh, at the bottom, this is what people are currently doing. Uh, it's just crazy stuff to get valid JSON as an output and it's not even guaranteed.

And here without lines, you just sample what you want. It's just simple as this. And as an experiment, it's actually an experiment that pretty based it, uh, they took Mistral 7BV01. They used a version of co-NNL that they modified so that gives structured output JSON. What they found is Mistral 7BV01 only gets valid JSON, uh, 17% of the time.

When you had structure generation on top of it, you get 99.9% and that's without optimizing the prompt. So you can actually get, you know, you can actually get better than this. The nice thing is, uh, it also adds negligible overhead. So you actually have, you know, you don't have to fear for that affecting inference time, uh, which is the highly, you know, highly non-trivial thing.

Uh, here we compared, uh, the overhead introduced by guidance when they do structure generation, uh, you know, as a function of the number of generated token and at the bottom, it's outlines, uh, outline says approximately zero until the end. Uh, as a trade-off, there's a compilation time. But during inference, it doesn't slow down inference.

Now we're at a point where we could integrate this in Brock and you wouldn't see the difference between structured and unstructured. Um, so no overhead, but even more than no overhead, it is faster to generate text with structure generation. Um, the first is that when you take JSON, you don't need to generate the tokens that correspond to the bracket and to the field names.

I know that in advance, I don't need to ask the model to return, uh, to return, uh, the tokens. So here on this very simple example, only five out of 10 tokens need to be generated. So only one half, but there's an even more subtle, um, way in which it accelerates inference.

And this is the example that we took at the beginning. So here, I asked ChatGPT like a good model, like ChatGPT, the same question. What is the, uh, public, like the, of Google's public DNS servers and ChatGPT took 50 tokens, you know, it yapped it, yapped it, yapped it, and give it up to 50 tokens.

It's not as bad. It could get a lot worse, uh, with lesser models. Uh, but when you use structure generation, you just generate eight tokens. So that's a subtle win, which it accelerates inference by your law. Um, then it improves efficiency and that's probably the most, uh, actually mind-blowing result, uh, that we've had.

So here, what you're looking at is the accuracy on GSM 8K, uh, with again, Mistral 7BB01, structured and structured. And here we look at the accuracy as a function of the number of shots. So the number of examples that you give to the model, uh, before asking the question.

And what we found is that, yeah, foreign structure, normal one shot is worse than eight shots. Uh, that's completely expected. Uh, but what we find with structured is that you actually, and that's really surprised us is that you actually get in the same ballpark in terms of accuracy with one shot as you do with eight shots, which is surprising for machine learning.

Something like you would think that examples are there to teach the model about the task, but it looks like it's actually there to teach the model about the structure of the problem. The more investigations to do in this line, but that was very mind-blowing. And the last one, which probably, you know, after faster, a lot of people care about here is that it does improve the performance of open source models.

Uh, here, um, what you're looking at is the Berkeley function calling leaderboard, uh, simple function benchmark. And we'll look at the accuracy. So the first thing we did is that we took Microsoft pre-medium model, uh, which is a small model. Uh, but we looked at its accuracy without structure generation.

It's 86%, which is pretty good for an open model. Uh, 5.3 is actually a pretty good model. When you add structure generation, you get 96.5%. And as a comparison, GPT-4, the best version of GPT-4 on this task, 93.5%, uh, on this benchmark. And now there are two things to note is that 96.5% gets dangerously useful.

And the second thing is that we have open models that are available today that can be, you know, larger models, um, without fine tuning. So it's pretty huge room for, uh, open models. And that's why I'm really bullish on open models. I think, you know, as a community, we can actually extract a lot more out of these models.

Um, and this is just a glimpse. Um, the work that I just showed you is what we did at .txt about a year ago. Since then we've generalized from regular expression to, uh, you call context-free grammars. Context-free grammars are used to define code. They used to define protein structure.

I mean, and to define as well, what I showed you earlier on the gsm-8k example. So we can do the same thing, structure generation with no overhead with, um, with context-free grammar. We also started working on, um, semantics, like adding some semantic constraints to the generation. And one very popular example of this is the SQL, uh, text-to-SQL most model that SQL syntax.

Usually what they get wrong is the hallucinate table or column names. And a internally, we're able to get perfect text-to-SQL. So I can guarantee you that the query will be correct and give you the answer that you expect, but I can guarantee you that it will run. So that's a pretty huge advance in text-to-SQL.

And what else? Oh yeah. And we also starting to, uh, to bubble up computations into the structure generation, into the model architecture, because when you think about it, we're biasing logits. When you're biasing logits, the model is actually doing computation for nothing. And so you can gain even more inefficiency by preventing the model from doing this computations in the first place.

And that's all work that we'll actually publish in the blog post, I think, in the next couple of weeks. So all that to say that if you're doing, if you're not doing a chatbot, there's a really good chance that you will be using structure generation. You know, it's just a matter of time until you adopt it, I think.

Our users are pretty, pretty, pretty, pretty happy. So yeah. Thank you for your attention. And, uh, all the, all the crazy claims that I made, you can go the QR code. There's a link to all the blog posts. I'll see you next time.

No more bad outputs with structured generation: Remi Louf

Transcript