RFT, DPO, SFT: Fine-tuning with OpenAI

- Amazing, I think there's still some people coming in toward the back. So I'll do like the quick, maybe like administrative stuff in the meantime. This is gonna be super interactive. There's also like mics around, so if at any point you guys have questions, just raise your hand, someone will go to you, or I think we also have mics nearby.

Cool, I think that's pretty much it. I'll let a few people finish trickling in and then we'll get started. Okay, I think that's long enough. Okay, so welcome everyone to Model Maxing with OpenAI. We're gonna be talking about RFT, SFT, DPO. My name is Ilan Biggio. I'm on the developer experience team at OpenAI.

And so I do a lot of very early testing on new products and new directions that we're taking the API. So whenever there's a new feature coming out and it's still changing, like every day, they're like, can you tell us how good this is? And make a demo with it.

So this is a little bit of what we're gonna be doing today. It's gonna be a bit of a cross between a presentation. If you've been to any of my talks before, they're very code heavy. This one's a little bit less so because it's fine tuning and I don't want to have you waiting for multiple hours watching a progress bar.

So I have a lot of pre-compiled data and some stories to share. Cool, so this is you. You might be asking, you know, SFT, RFT, DPO, what the heck are all these letters? That's a great question. You might have some experience with fine tuning, but be a little bit more asking, like, you know, will fine tuning solve all of my problems?

That's another good question. You might also be wondering if you're in the right room. You are, most likely. If you want to learn about fine tuning, stay here. Cool, so yeah, we're gonna be talking about optimization at a high level and then the specific subset that is fine tuning.

We're gonna go into each of the different fine tuning options that OpenAI provides and then some stories, examples, best practices, and then we're gonna do some Q&A. I think I have like one hour and a half or two hours. I have no idea how long this is gonna take.

We'll take as long as we need. If we end early, you can all have some food. Cool, so this is optimization. This is everything that it is, right? It's like, it's a graph with a line going up and to the right in some nonlinear path. This is what you're going for whenever we talk about optimization.

How you actually accomplish this can vary a lot. It can vary, you know, you can do like manual optimization, fine tuning, changing your prompt. There's a million ways to do this in a million different industries. But when you say optimization, this is all it means. Specifically when it comes to LLM optimization, we have a few different ways we can do it.

But every LLM system is sort of composed of these three events or parts, right? You have your input, which involves the prompt and the context. You have your model, which is actually represented by the weights and whatever was pre-trained. And then you also have this system that goes around the model.

It's the scaffolding, it's the tools. I will go ahead and say this is every single LLM application ever. So, everything we say will cover all of this. Now, today we're going to be focusing mostly on this middle part. Optimizing the weights and optimizing the model choice. Now, how to optimize like prompt context, like prompt and context and tools and scaffolding.

There's like a million ways to do that. I think there's other talks today that are going to go a little bit more in-depth into that. But we're going to be focusing on this middle one, especially since it's a little bit nebulous. And I want to help clarify some of the topics and especially how you can do it with the different fine-tuning offerings that you can use on the OpenAI platform.

So this might be your initial reaction to fine-tuning. It's like a gross idea. It's hard. It's confusing. I would rather just write prompts like this and have the model do what I want. Now, that's fair. I think for a lot of problems, that's actually the case. Fine-tuning is really just continued training that optimizes a model for a given domain.

So it might be useful for your case. You might be better off sticking up with prompting. And this is one of the things that we're going to be talking about today. But I think the idea is to open up your eyes to all the options that you have. So let's do a quick comparison between prompting and fine-tuning and then move on.

Prompting is kind of like a set of tools, like a hammer, pliers, whatever. I think those are some made-up tools in the back that don't exist, but we're going to ignore those. It has a very low barrier to entry. Anybody can do it. It's very easy and quick to do small changes to quickly progress.

And it's actually enough for, I'd say, most jobs. Now, fine-tuning is like a CNC machine. You have -- it's a way higher upfront investment. You have to collect data, make sure it's clean. The iteration loops are longer. You can't just make a quick tweak and rerun because the rerun is going to be multiple hours.

However, they're more automated. If you have a pipeline set up for fine-tuning, it's a little bit more hands-off. You don't have to do all this manual thinking yourself. And finally, it is a specialized tool. It's not useful for every single use case, but there are certain use cases where it really, really shines.

And with certain kinds of fine-tuning, you can actually push the boundary of what is possible and do things prompting cannot accomplish. So these are the three main types of fine-tuning that we support during OpenAI. If you're going to take pictures, there's another slide later that has this and more.

So the first one is SFT, supervised fine-tuning. And what is this? It's just imitation, right? You just show the model what you want it to output, and it'll learn to copy exactly that. The next one is direct preference optimization. And it's about showing the model two samples. And you say, one of them I don't like, one of them I like.

Do it more like the one I like, and less like the one I don't like. And then finally, we recently introduced reinforcement fine-tuning, which is actually the magic sauce that went into making 01 and 03 and 04 mini in all these reasoning models. You can use this algorithm as well on this platform.

And what you're doing, you're essentially providing a way. The model essentially figures out how to think about your given problem. So let's get a little bit deeper into each one. What does the data for supervised fine-tuning look like? Like I mentioned before, you just collect a set of inputs and outputs that you want.

For direct preference optimization, you have inputs. And then for each input, you have a positive output and a negative output. I think we call them preferred and non-preferred. And then finally, for reinforcement fine-tuning, you provide a set of inputs. And then slightly differently, you provide a grader. And optionally, with the inputs, you can provide some reference in order for the grader to be able to have a ground truth if you want, but really this grader can be one of a variety of things that we're going to be talking about later, which is why reinforcement fine-tuning is so powerful, because it does not constrain what you can do as much as these other two, but it still has its caveats.

So we're going to be going into all that. Now, what do each of these cases learn? Like, you can look at the data, maybe you have some idea, but from first glance, there's a lot of cases that supervised fine-tuning looks like it might be able to do direct preference optimization versus reinforcement fine-tuning.

Like, how do you think about what each of them is doing? And the way that I find it most useful to think about this is thinking about what is the model learning in each of these cases. So supervised fine-tuning is essentially learning a continuous mapping, or like a soft mapping, from your inputs to the outputs.

And so, you know, it's not exactly like a direct mapping like a table, but in essence, that is what you're teaching the model. It's very, very direct. It's just going to learn to imitate the data that you provided. Direct preference optimization is a little bit more interesting. It doesn't actually learn, like, each of the examples.

It learns the delta between the examples. And so if you imagine if you can, like, pull up your, like, mental embedding, like, latent space, and you project these two ideas there, the positive output and the negative output, if you can imagine, like, what that difference vector is between them, that is essentially what you're trying to teach the model.

And then, finally, and this is also -- it looks straightforward, but it's actually pretty interesting, is learning to reason, right? So reinforcement of fine-tuning learns to reason about a problem by learning to tune and change its chain of thought so that it is able to get this, like, higher success rate on whatever problem that you're giving it.

So, what is each -- what is each of these good for? So, supervised fine-tuning is the simplest of the bunch. It's great for classification use cases. It's great for formatting and structured data extraction. Pretty much any case where you really want to constrain what the model is doing and have a very specific kind of output, supervised fine-tuning is it.

And this is helpful for, like, distillation, if you want to teach, like, a very, very, very small model how to do this, like, structured output example and, like, prompting isn't cutting it. Supervised fine-tuning is the way to go. Now, direct preference optimization is a little bit more -- I'd say it's the hardest to think about of the three.

It's very good at learning tone-matching, right, because tone and style are these kind of little bit more intangible things that are hard to evaluate but easy to see if you put two comparisons together, right? And so, this was really made in order to do tone-matching and also -- I mean, primarily, it is when you have A/B tests, right?

Like, when you have positive data and negative data, preferred data, not preferred data, if you've been using ChatGPT and you get, like, both -- like, two responses, this is why, right? Like, this is the algorithm that is being used in order to, like, tune for those preferences. And so, it's not saying, "Output exactly the one that I prefer." It's saying, "Output slightly more in that direction," whereas supervised fine-tuning is much more exact.

And then, reinforcement fine-tuning is for gradable hard problems. Some examples are, like, medical, legal, coding, just, like, things that are pretty hard to do but pretty simple or straightforward to verify. You really, really, really want unambiguous solutions, and I feel like in a lot of these talks, we always talk about successes.

I snuck in a little failure here later, so you're going to get to see that. And it's also a really interesting candidate for training an LLM judge model, right? Because grading a problem is actually a problem itself, and so, if you -- like, grading that problem is a pretty verifiable answer if you have these, like, golden results that you want your LLM judge to be able to mimic.

Cool. So, one thing to note before I get into each of these is that, each of these approaches, while they're very different, we are training only a subset of the model in each of these cases. And so, this is why they're not amazing for learning, like, tons of new data, right?

Like, RAG and other solutions like agentic retrieval are much, much better if you want to teach the model new information. So, this is called, I think, low-rank adaptation. Can anybody check on that? Yeah, I see some nods. Cool. And essentially, for those who are interested, you should look more into it.

It's a really, really clever way where you take the model weights and you decompose them into two matrices. And then that essentially gives you a much smaller set of weights that you need to update in order to affect the overall model behavior. And so, you don't fully readjust the entire model.

You don't make it forget the things it did before, but you still impart some new behaviors into it. Cool. So, let's start with a story. When I joined OpenAI, I was working as a solutions architect. And what this meant was I was working directly with companies, helping them deploy their, like, AI use cases and applications into whatever domains they are working in.

Now, one of our partners, one of our customers, was working on a low-latency function calling use case, kind of assistant style, based on, like, taking action. Right? And so, you would -- the user would say something, and the model had to act. But it was in an extremely latency-constrained scenario.

Right? So, they wanted to use -- back then, it was 3.5 -- GPT 3.5. GPT 3.5 was very, very fast, but not very accurate. GPT 4 was roughly the level of performance that they were expecting in terms of accuracy, but it was way too slow. And so, they asked us, what can we do?

So, ooh, actually -- before I get into this, this would actually be fairly straightforward. And I feel like a lot of you could imagine how to do this if we had a lot of existing examples with, like, inputs and outputs, which they did not have. And so, we were left to figure out, how do you fine-tune a model when you don't have, like, this great amount of data -- or these, like, good examples.

And so, we took a few different approaches -- but I'm going to cover the two main ones that we used -- that proved to be pretty effective. So, the first one is, we did have all of the functions that they wanted the agent to use. One more thing I didn't say, they were trying to do this with 120 functions, which was unheard of at the time.

I remember asking some researchers, like, you know, what is, like -- how many do we support usually? And they were like, you know, five to ten, maybe. It's like, okay, okay, cool, cool. So, this is actually the first time that anybody fine-tuned a model with over 100 functions, because the capability was not possible before.

We had to have this built for this use case. So, what you're seeing here is the first step that we did. We actually took the function schemas that the customer provided and figured out a way to create every possible, like, permutation of the invocations of that function. So, for example, if we have, like, a toggle -- ooh, typo -- if we have, like, set lights and it takes one parameter and it's an enum, here I'm essentially showing that the three permutations that you can have are set lights off, set lights on, and set lights red.

This was actually done completely deterministically. This was just, like, a Python script that I spent way too long working on. Then search song is kind of similar. There's more than one parameter. So, here I skipped the cases where you provide, like, multiple parameters, but essentially what this function did is it took a schema from a function, and then it spat out essentially every possible permutation of that function.

And if it was, like, more -- if it exploded in complexity, we would, like, randomly sample from those. So, this was the first step. Oh, and then if there were some samples that were non-deterministic, like Katy Perry or, like, were open-ended, we would ask a model to, like, provide plausible values.

The next step was once we have those function calls, then we asked GPT-4 to go essentially from here's a function call, give me a command that would have resulted in that function call. And we did this for a large number of the permutations. And so, some of you might already be noticing what we're doing, but essentially, if you, like, flip this in your head, now we have these input-output pairs, where the input is the GPT-4 generated command, and the output is the function call itself.

So, we're already on a pretty good track, right? We have a lot of data, but it's pretty -- it's synthetic, and we don't just want it to be synthetic. We don't know if these are the kinds of inputs that the users will be giving, and if we are very off-base, if we are, like, very out of distribution, we won't have as good of a model in the end.

I will say, the models are pretty good at generalizing, even if you do stuff that is slightly out of distribution, because they have such a large set of weights in, like, a pre-existing world model. The second way we did it was distilling some data. So, the engineers we were working with did actually have some unlabeled inputs, and we were like, okay.

Well, here's what. You said GPT-4 was good enough for most of these cases, so let's do something. Let's just run them through GPT-4 and get what outputs GPT-4 would give us. After this, I didn't include this in here. We actually did some filtering as well, and, like, we did a couple stages in this pipeline to essentially generate a lot of, like, distilled -- still synthetic data, but hopefully very high quality.

And in the end, we took these two data sets, passed them into GPT-3.5, and we achieved GPT-4 level performance at 3.5 accuracy. And so, this was one of the first, like, major successes that I saw with supervised fine-tuning. Now, note that this is a very constrained use case, so it's actually perfect for fine-tuning.

The model doesn't have to do this, like, open-ended response. You're not trying to strive for a certain personality. You just want it to call one function, and you want it to do it well, and you want it to do it fast. So this is a perfect use case to do distillation with supervised fine-tuning.

Actually, I want to keep this interactive. I'll pause every now and then. Feel free to ask any questions. If I don't see any hands, I'll just, like, barrel along. Okay. I see a couple back there. Yeah. I think they're coming to you with a mic. Six? I can just, yeah.

Could you have distilled without synthetic? So the question is, could we have distilled without synthetic data? I think we would have preferred to distill without synthetic data, right? I think the use case that we -- like, the perfect case here is the engineers come to us, and they're like, hey, we want to get this model to be really, really fast, and we have 1,000 labeled examples.

And we're like, great! We'll get back to you in, like, a day. Instead, this is what we had to do. But yeah, that's a good question. I think there's one back there. Yeah. So the question was, what was the ratio between synthetic and slightly less synthetic data? And how many examples did we need?

I think I'll answer the second one first, which is -- and I have this in a demo, and we'll look at the performance. So, like, at around, like, 50 to 100 examples, you start to see signs of life. And that might be, like, already enough for some smaller problems.

Then if you start, like, pushing more toward production, you might want to scale up to, like, a few hundred, 500 examples. In the demos I have later, I used 150 examples and 500 examples. But this is for supervised fine-tuning. For the other two kinds, you can actually use fewer, and it's more forgiving.

And we will talk about that in a second. And then the first one was, what is the ratio? I don't remember. I think I tried some ablations where, like, I only provided one of the -- one of each. And, like, it was best when I provided both together. But I can't tell you what -- I don't remember what the ratio was.

Yep. Okay. So you -- you said that there was around 100 functions? Yes. Okay. Yeah. I pretty much had the same question. So you're -- you're basically saying for each function you had around 50 to 100 synthetic examples. I don't remember -- it was -- I've said a lot of numbers around 100 and 500.

But I think they're being maybe conflated. We had 100 functions, around 100 functions. Each one we generated -- I think it was, like, between 20 and 200 -- like, we tried different amounts -- like, between 20 and 200 examples, depending on how many permutations each function would give us.

Mm-hmm. And then also depending on which -- which of the inputs. So all in all, I think it was in the, like, like, high hundreds, low thousands. Maybe, like, yeah, low to mid thousands of examples that we used. Yeah. And -- sorry if I'm skipping ahead here, but was there -- did that improve the function call accuracy for, like, choosing which function to call?

Or did it just improve, like, how well it would call a function? So, like, choosing the function versus, like, populating the function parameters? Yes. Both. It improved both. Awesome. Yeah, yeah. I mean, we -- if I remember correctly, we got to, like, within 2% of GPT-4 at, like, a very small fraction of the latency.

Yeah. Something else that we tried was, like, what happens if you remove all the context -- all the functions from context, or, like, the parameters from context? Like, can it learn the parameters, like -- because we wanted to, like, really squeeze out as much context as we could to, like, reduce latency?

And you know when you're cutting input context for latency reductions, you're, like -- like, you're out of options, right? You should never start with that. But we were trying everything. And that was actually not very good results. We realized, like, having things in context -- like, removing them from context -- makes it much harder to learn through fine-tuning.

It's not as -- as good. Thanks. Yeah. Yeah. Yeah. Yeah. So how -- follow-up questions that -- in the future, if they have more functions, then do they need to retrain everything? Uh -- Let's say they have 100, and, you know, they just have two more. Yes. So they would not scale, right?

I believe. This -- this does not scale in that way. If you, like, add new functions, you -- like, in this exact setup that we have, you would need to -- to retrain. There are ways to just pick up training from where you left off. Like, you can fine-tune an already fine-tuned model.

And it just continues the fine-tune process, essentially. And would that be okay? Like, if you fine-tune on top of this with a few examples? I feel like it -- it could be. Especially for this case. It's so constrained. I think you can get away with a lot in cases that are this constrained.

Got it. Yeah. I'll take maybe one more, and then we will keep going. Yeah? How did your -- how did the prompts change when you did this? How did the prompts change? How did the prompts change when I did this? Like, did you have -- were you able to reduce context, and, like, what does each tool do?

I see. Uh, yeah, in terms of -- so I think, um, the question is around, like, how much we managed to reduce the prompt size and, like, context that we put into the model. The answer is a bit, but not that much. Like, we were able to get rid of most of the prompt and just say, like, classify into, like, one of these, and then here's all the functions.

But we weren't able to remove, like, types or the different parts of the functions themselves from the context without losing performance. Yeah. Which was much more of an issue then, because we were working with, like, a 4K context window. Maybe -- I think we, like, had to, like, push -- like, apply to, like, let us fine-tune a 16K model, and it was -- whoa, it was crazy.

Now it's, like, you're fine-tuning, like, one million models. No, no problem. Okay. Um, cool. This is this case. I'll move on, and we'll have -- we have plenty of time for questions. Um, and if you're more -- if you're curious to learn more, we did actually write a lot of that.

So we're going to move this up in a cookbook called Fine Tuning for Function Calling. You can run it. You can play with it. It has most of what I talked about. Okay. Time for the first live demo. So let's say we have a customer -- I wanted to find a data set that would be easy to find.

So here I have cursor with, like, a very basic, you know, loading in OpenAI keys, yada, yada, yada, but then I'm loading in this banking data set. And essentially what it does, if we, like, print out the first thing in the data is not happy. Oh. Cool. Um, we have some input text.

You know, I'm still waiting on my card. And a label. And the label is, uh, some, like, index into an array of labels, right? Uh, of, like, actual text labels, like, um, uh, user is -- not waiting for a card, but, like, like, arrival time or something like that, right?

And so what we want to do is we want to get a model that can, like, classify this really, really well. And this is a good case for fine-tuning, or for supervised fine-tuning. It's very straightforward, um, it's very direct, uh, and it's very constrained, right? So, okay, so here I load in the data.

I load in the label names. I guess I could have just printed these. Cursor abandoned me today. Okay. Yeah, activate card. Oh, yeah, yeah. We got, like, many of them. And so as you can see, this is not, like, a super easy task for the models. This is, like, a lot of different labels they can pick from.

Um, and this is as hard as I tried in my prompt. That's it. That's it. Um, and then it passed it all the labels. Um, and I guess this is a good time to say, in supervised fine-tuning, your prompt does not matter as much, right? You are showing, like, a direct, uh, like, a direct examples, and so it -- like, you can skip a lot of the prompt engineering.

Not all of it, right? If you give it, like, nothing to work on, it's not going to, like, learn very well. Um, but, like -- or, sorry -- you could actually completely remove your prompt, and if you have a constrained enough example, it will still learn to do the right thing.

This is not true about DPO and RFT. You still need good prompts for both of those. Um, as of T, I would recommend having at least a decent prompt. Okay. So we have our data. We have our label names. I loaded in the test data. Uh, and I wrote my prompt.

So now I have my messages. Uh, I'm using the chat completions API because fine-tuning right now, I think, works in that format. So I want to keep the format consistent. Um, so we have our system prompt, which is this beauty. And then the user is just the input text -- uh, the user message is just the input text, right?

And then I write this very simple function that calls the API and gives it a classification, right? And so if I, like, test this out right now, um, look at that. So how do I look at my card -- card arrival? I guess that's correct. But, like, if I run it a few more times, you know, card not working -- like, it's not good, to be fair.

I don't know if any of us could do this with this prompt. Uh, it's pretty bad. But that's part of the point. Okay. So we have this sort of, like, inconsistent thing. Um, now, before we even start doing fine-tuning, we have to ask, like, you know, should we do it?

Right? And this is one of the most important questions in fine-tuning. I think this is part of why a lot of you are here. Should I do it? The answer is, how well is it doing on your evals? Right? So I wrote this very small, like, parallel evaluation function, um, that essentially takes in a model string, a system prompt, uh, some, like, input samples to run over in parallel, and then a number of workers, since I'm, like, shooting them off in parallel, sort of, like, asynchronously.

Um, I have a semaphore, just so I don't, like, violate and, like, rate limits, or, like, the number of workers that I have is, like, fixed. Um, I'm, like, logging the progress, and then I have my score function, which just calls classify intent, um, updates the progress, and then returns whether it equals the correct label.

And I shoot them off in parallel. And there we go. So, uh, this is my evaluation function. You know, people talk about, like, different, uh, evals, um, platforms. We have a way to do evals as well, as well on our platform. Um, I think I'm going to stick to this for this example, since I want to keep things, like, mostly in this notebook.

Um, and we can, like, see little pretty progress bars that I made. Okay. So now we have GPT, uh, 4.0 mini and 4.0. This will come in later. Uh, and we can run it, right? And so if we run this evals, we're essentially taking the testing set, um, made up of 150 examples, uh, running it.

My progress bars aren't working correctly. That's fine. Uh, it's still finished, right? So it just ran 4.0 mini, and it ran 4.0. And now we get the relative accuracies, which is 75% accuracy for 4.0 mini and 83% for 4.0. So, you know, not bad. Let's see if we can do better.

Um, cool. So this is just, I'm taking the actual data sets and then dumping them out into files that we can use for training. Now, if the important part here is the format. So the format of a fine-tuning model and, like, let me see if I can just -- I just love pulling up docs and talks.

Uh, it's what everyone else will be doing. Um, let's see. Oh, by the way, you should all check this out. Agents SDK TypeScript came out today. Uh, where are we? There we go. So if we look at the format -- by the way, I know this stuff. I'm not doing this for me, okay?

This is for everyone to see. Uh, cool. So here we go. This is the JSON L, which is just, like, JSON Lines data format that you'll want to be passing, which is, um, a messages, uh, array -- like, a field. It's an array that you're used to, the one you know and love.

Um, and all it's going to do is it's going to take the last assistant message and learn from it. Um, here the last assistant message is a function call, so it has, like, the whole schema. Um, but if we look at the data that we're producing here, these are, like, you know, this is my amazing prompt.

Uh, these are, like -- this is all part of the prompt, right? And then when I get to the very end -- oh, boy. Okay. Uh, then we just have -- so that's the developer message. Then we have the user message, which is the input, and then the assistant message, which is what we want it to respond.

Um, and that's the whole setup. And what it's going to do is it's going to learn from that last message. Uh, so we can either shoot this off from the, uh, fine-tuning UI, but I am going to show you -- I'll show you how to do both. Um, cool.

So now what we're doing is we're taking the dataset, putting it in this format, and saving it, and calling it train. Uh, we then, like, upload the files. So it's important that we -- like, for us to reference files within a fine-tuning job, we have to upload them to the OpenAI API first.

So client.files.create -- we have to set the purpose to fine-tune, give it the file, here are the file IDs that we have. And now, fine-tuning is as simple as, like, this API call, right? We pass in the base model. Uh, the reason I had the mappings between, like, the base model, like, 4.0 and, like, the full snapshot is because it can't take in, um, aliases.

It has to take in, like, the full snapshot. Then we pass in the training file, which is going to actually train on validation file, which is going to tell us how well it's doing, uh, so it doesn't overfit. And then I'm just passing in the number of epochs. There's many more parameters you can play with.

I just want to give you a sense. Um, so cool. I already ran this. I don't want you all to sit here waiting. Uh, fine-tuning jobs can take anywhere from 30 minutes -- and this is on the OpenAI API -- anywhere from 30 minutes to 24 hours. So give yourself time.

Never do this right before a talk. Uh, cool. So we have our fine-tuning job. Then we load the model, right? We, like, check the model. Uh, we check the job. We see if it succeeded. We load them in. I actually already saved them here. And now the question is, how well did it do?

I fine-tuned, um, as you can see, I made some examples with -- I guess I skipped over this -- but I made some examples with 150 examples, and I made some with 500. And I trained both four -- uh, GPT-40 and GPT-40 mini with each one, just to do, like, a little comparison.

Uh, and so I ran it, and I also compared it against 04 and 03. And the final results at the very, very end, this is what they look like. So, uh, baseline, 83 percent, 75 percent. Uh, reasoning models, a little bit better, 80 and 90. Um, but the fine-tuning models blow everything out of -- like, blow everything else out of the water.

Um, and this is kind of to show, like, how convenient fine-tuning can be, right? Like, it's doing better than 03 at a task that, like, I barely wrote a prompt for. And I think even 4.0 mini -- yeah, 4.0 mini with 500 examples is as -- is, like, actually technically better than 4.0.

But, like, they sort of re -- like, are reaching much higher levels. So this is what I want to show how you can, like, get fine-tuning and, like, when it's worth it. Now, if this was something I could have done with more prompt engineering, maybe. But, like, this is a case where I have the data.

I want to run run it. Pretty straightforward. Uh, maybe -- yeah, I'll pause here again for -- for questions. Yeah? Is the fine -- oh -- Do the fine-tune runs check for overfitting? And if they do, do they stop? Or do you have to be, like, watching it while you're, like -- Nope.

So the question is, do the fine-tuning runs check for overfitting? No, they don't. I mean, you can see it. You can see it on, like, the diagram. Let me see if I can actually pull up these jobs. Um, maybe this last one. I guess the second question, then, is do you just kill it after the diagrams show you that it's crappy?

Or, like, how do you know what to do next? Um, am I in a different org? Okay. I may have done this in a different org. So I'll pull it up at a different time. But, uh, yeah. Um, I don't want to leak anything. Sorry. Uh, sorry. What was the question again?

Do you have -- you do have to -- Do you see the overfitting? Like, what do you do? Well, I mean, yeah, you can stop it and try again. Uh, I think you're asking me a very, like, data science general question. What do you do when you see overfitting?

So many things you can do. In this case, specifically, it's like, you know, you don't want to have a 24-hour job. Yeah. Yeah, you can stop it. Yeah, you can stop it. Uh, you can stop it, and I think you'll still be able to test with latest snapshot, maybe.

Uh, here? Yeah. Yeah. Hey, um, a couple of questions. How do you decide between, like, when do you want to fine-tune versus not, like, just lean to always using O3? Like, when do you sort of pick up to, like, actually go fine-tune? I'll get into this more later. But I think the short of it is always start with prompt engineering and see how far you can go, and then run evals, right?

And so if it's as good as you want it to be, that's great, right? Um, and if you are still changing the prompt and you are still getting better results, there's no reason to fine-tune yet, right? I think fine-tuning is something you want to consider, like, when you have enough data for it to be worth it and the data is, like, pretty clean and, like, it's not really, like, uh, like, you kind of have to have, like, the resources ready for it to do the fine-tuning.

Fair enough. And then, like, which models do you, like, support fine-tuning for right now? Like, does it keep changing over time? Like, I know if you support, like, you know, 3.5 and then 4. Like, is everything, like, when it comes out, like, instantly going to support fine-tuning? Or is there, like, what's the lag period behind it?

I think we have... oop. Oh, man. Uh, where are we? I think we now, like, say, yeah, which ones support fine-tuning on the models page. This is why I like docs. And live browsers. Uh, yeah. Did you have to do any, uh, hyper-parameter tuning? Not for this. Uh, not for this.

Yeah. Does it have any impact on the results? Like, updating, running late? Sure. But I'd say that falls into the, like, the, the, the field of, like, it's gonna, it's gonna be, uh, very case-by-case. You have to, like, run different, different things. There, there's not a lot of good intuition that I have built that I can share with you.

Um, but I'm sure if you, like, look online, everyone will have their own opinions. Yeah. Thank you. Yeah, I'm curious on, um, after you fine-tune, can you still tweak the system prompt? Or if you do, do you need to then fine-tune again? Yeah, that's a good question. You can tweak the system prompt, but, um, you are now out of distribution.

And you are sort of hoping the model was able to learn from your, like, enough out of distribution behavior that it won't. I would recommend if you fine-tune the prompt, always use that prompt. If you're gonna change the prompt, fine-tune again. Um, or just, like, use many different prompts during your fine-tuning.

You don't have to use just one. Uh, if you use multiple different prompts, you might, like, you might prevent it from collapsing too much into, like, one specific kind of behavior. Um, but, good question. Uh, yeah? Uh, can you go back to the notebook for one second? Yep. It should look to be, like, fine-tuning the one-tune for four-o-mini worse.

Is that true? And if that happens, have you no more fine-tuning? Fine-tuning for four-o-mini. For four-o-o. Oh. For four-o-o. For four-o-o. For five-o-o-o. Uh, uh, uh, how do you know when it's gonna start to go up? It's really a lot of things. Well, yes. That's a good question. I don't know what happened.

I think, um, either I made a mistake, or I, um, I would love to pull up the graphs right now. Uh, the reason I'm not is because they, I think I ran these on the OpenAI org, and I don't want to pull that up on screen live. Um, but that's what I recommend you do.

Yeah. Uh, yeah. Um, regarding, like, overfitting, do you have, like, uh, ergonomics to do that in the SDK? Like, like, defining the test set, and then? Oh, yeah, yeah, yeah. I mean, you, you can define a test set, validation set, and it'll, like, compute both. And you can determine when, when you see the, like, the curve go, go up of, like, the test set, like, when you see them cross to stop the job, or you have to check it manually?

I guess I can just pull up, like, an old, older one. Um, I just don't know how successful they'll be. But, like, you know, you get these... What was I doing? Yeah, but, like, the blue line is validation loss, and the green one is. This is clearly, like, not an amazing fine-tuned run.

I think this was pretty old, but this is, like, what the graphs look like. Yeah. Okay. I'm realizing, uh, we still have a lot left, and I'm running out of time. So let's keep it chugging along. Um, cool. Okay. So rules of thumb for supervised fine-tuning. It's best for simple tasks.

Models can regress on other tasks. So, I mean, this regression that you called out, confusing. I'm not sure what it was. But the thing is, if you then test it on different things that you did not fine-tune for, you may see regressions. So this is for very constrained cases.

Data diversity is critical for that reason. Anything you don't include might regress. Um, 100 samples to start. 500+ is best. So, okay. DPO. Uh, training a model on jokes. Uh, so my thinking here was, like, if we take a model, like, jokes -- like, this funniness principle is, like, very hard to evaluate for, but you know between two options whether something is funny or not.

So, this feels like a good candidate for DPO. Uh, so I generated a bunch of jokes for -- with 4.5, hoping they were going to be good. Um, most of them were not. Some of them were. Uh, I really wanted a data set. And this is really what it comes down to.

All of these are about the quality of data. I did not have high-quality data. I had some medium-quality data with a, um, like, mediocre validator. So, like, uh, then did some manual filtering. Like, did I laugh? Did I, like, exhale a little bit more? And then, um, okay. The interesting part is, once I have this subset of good jokes, which really should have been, like, a nice data set.

Uh, remember what we're trying to do, right? We're trying to get some inputs and then two pairs. So I went backwards again from joke. I extracted the topic. Um, and once I had a topic for each joke, I once again went forward, uh, with 4.0 and asked it to, like, generate a corny joke for that topic.

Um, and so what we're left with is, once again, input topics and then good jokes and bad jokes for that topic. Um, and so hopefully the, like, main delta between those jokes is going to be one is good and one is bad. They're about the same topic. So, like, that delta should be stronger, um, as opposed to just giving it two unrelated jokes.

Maybe it's, like, ah, like, you know, he's looking for, like, cat jokes or something. No, it's, like, similar jokes. One's funny. One is not. Um, did it work? Sort of. Um, this is one of the results that I got. So two guys are hunting when one collapses suddenly, stops breathing, and appears lifeless.

Panics, his friend calls emergency services. Help! My friend collapsed. He's not breathing. I think he's dead. What should I do? The operator calmly replies, stay calm. First, let's make sure he's really dead. And there's a short silence followed by a loud gunshot. And then the guy comes back and phone, okay, now what?

Okay, so, is it original? I don't think so. But, like, you know, it's funny. Uh, cool. So, um, Um, yeah, but, like, can you tell me an original joke right now? It's exactly that joke. Exactly that joke? Ten years ago, I never read that. I agree. Me too. I would say almost all the jokes that the models generate, like, you are hard-pressed to make, like, a good one.

But I think it's still, like, it's not a null result. Because the fact that it was able to, like, give me a good joke on command is a hard thing to do, right? Like, I think if you ask Chachapiti for a good joke now, it's, like, harder to get that to happen.

So even though these are, like, not new jokes, the idea is to, like, elicit the good jokes. Um, what was I going to show you? Where's my notebook? Um, yeah. So here is the data in the format that we said, right? So we have the user input, then tell me a joke about topic, and then we have the preferred output, and then the non-preferred output.

Amazing. Cool. Let's keep going. Um, okay. So DPO, it's more forgiving about data diversity, because if you give it, like, a bunch of examples that are all specifically jokes, and then I ask for, like, a speech about something, um, it's not going to give me a joke necessarily, right?

Because it didn't learn to say jokes. It learned to get funnier, hopefully. Um, so it's, like, that delta is important, and it is less constraining than SFT. Um, it's much more for, like, these "what should you prefer" scenarios, and, like, it's harder to evaluate unless you have, like, humans or, like, some, like, natural signal.

Um, and you can actually run it after SFT. Okay. Reinforcement fine tuning. The one I wanted to save time for, which I didn't. Eh, I still have some time. Um, the shape of it, as we talked before, this is, I think, a pretty exciting one, is, like, as time goes on, um, I think more and more use of this is going to start happening, because this is the first way, um, that I see that, like, we are really making this algorithm public, like, this reinforcement learning algorithm public, and you can, like, become, like, get soda, like, state-of-the-art results on whatever your task is, um, like, beyond 01 and 03.

Um, cool. So what does this algorithm look like? You have a bunch of inputs. Give it to the model. Model, for each of the inputs, generates this, like, chain of thought to the output. Then you have your grader. Your grader evaluates the results, right? Um, how does it do this?

However you want. Uh, and we'll get to that in the next slide. But importantly, when it performs well, it reinforces those behaviors specifically and whatever led to that chain of thought. And so, good chain of thoughts, um, that, like, get high scores are more likely, like, to keep going.

Like, to -- it's more likely to generate those chains of thoughts that lead to high results, which is exactly how, like, 01, 03 and these reasoning models are trained in the first place. So you can do this. Um, now, these are the very, very important things to keep in mind.

You want to have an unambiguous grading. Um, in an example you'll see later, the grading was a little bit less, uh, unambiguous, and the results varied. Um, you want very low -- and by unambiguous, it just means, like, if you ask a bunch of people, uh, who are, like, fairly smart and have the same rubric how to get to the answer, they will all give you the exact same answer.

Um, you want very low noise data. So if you have, like, any mistakes in your data, they're going to, like, magnify in a way that is not normal for, like, SFT and DPO. So it's very, very, very important for you to clean your data and not have, like, noisy signal.

You want it to be high signal, like -- and this kind of goes with the last one -- you want it to be able to actually get to the solution, um, because it's learning to reason. It's not necessarily learning, like, new facts or, like, new new things. It's, like, learning how to get to the solution.

Um, and because it is, like, so -- it's very data efficient. You can give it, like, 40 to 80 very high-quality, very high-signal examples, and it can, like, learn to reason over them and generalize, uh, to other examples in that set. Uh, cool. So now let's talk about graders.

Uh, there's a whole myriad of graders you can pick from. There's string, like, string checks. Um, there's text similarity. Um, you can write Python code, sandboxed, no internet access, a few libraries, preloaded. Um, and you can have that Python code be the grader itself. You can have, uh, score, like, model graders.

So model graders are just, like, the -- like, using the LLM with a prompt. Um, one that outputs a score, one that outputs a label. And we have a multi-grader. You can nest these as much as you want. And then you write a combination function. Um, and so this is how you can define the grader.

I'd say it covers most of the cases where you would want to have something graded. Um, and you can get pretty, pretty good results. So, uh, this was my idea for the talk. Um, I hate email. Let's make a model that can, uh, do the thing -- like, I -- it's not that I wanted to respond to the email.

It's I want to know whether it has to show it to me or whether I can ignore it and never have read it in the first place. Uh, so that was the idea. So I made this little, like, email labeler for myself. And I became a little data labeling person for a day.

Um, turns out if you make an interface that is nice, you can go so much faster. Um, so I labeled around 600 emails. Uh, my -- my categories were, like, uh, am I going to glance, ignore, archive it, which is, like, receipts, store it, which is, like, I want to see it later, respond within different time frames, take action within different time frames.

I actually didn't end up using the topic. Um, and I can download it as a JSON-L. Now, what do the instructions look like? I have some formatting. I have some classification instructions with, like, a little bit of extra instructions, but I didn't try very, very hard with this prompt.

This will come back to bite me. Um, and then the input itself, I formatted as, um, like, this, like, XML-ish. So I wrapped each in an email. The input is an email with sender, subject, date, and then the stringified body, uh, plain text of the email. How -- um, yeah.

So this is -- then I formatted the data for reinforcement fine-tuning, which is a set of messages in the same way that you would for SFT or all the others. Um, and then I use reference answer. You can use any other fields that you want, but that is essentially fields that are accessible to your grader.

And then the grader, which I defined below, uh, is just a string check grader that takes the reference answer action and compares it to the sample output. And if they're equal, we're good. Um, now how'd it go? Like, okay. Not amazing. Um, and I ha -- I think I know why, right?

So first of all, like, this is just my whim. Like, whimsy deciding whether an email is important or not. It's a very hard function to model. If I were -- if I was asked to label the same emails again, would I hit, like, a hundred percent the same classifications for all of them?

Probably not. And that's exactly the scenario you want to avoid. Um, it was somewhat noisy. I think I, like, screwed up, like, one or two. Uh, and that, like, uh, compounds. Um, 150 examples, that was actually fine. Like, if they were a very high signal, this would have been, like, a better result, but they weren't.

Um, and I didn't really try with a prompt. The model doesn't know what I want, and it's trying to figure it out. It's like -- it's like as if I gave you this task with the emails, and I locked you in a room, and you get to, like, maybe write some notes, uh, and I, like, slip you some food and water.

Um, and then the greater, uh, you know, string quality, that was fine, but there's only, like, a few categories, so the model could guess some, and you wanted to make it not very guessable. Um, okay, so that was my failure. Now let's take a look at something that, like, reinforcement fine-tuning is actually quite good at, right?

So, um, predicting the number of hydrobond -- uh, sorry, yeah, hydrobond -- I think it's -- yeah, hydrobond donors and acceptors. Um, this is actually fully out of what I understand, um, but the model can learn to do this quite well. So the task is to take in some chemicals, um, formatted in a specific format, and then predict the donors and acceptors, uh, to analyze it.

And this is an example of one of the inputs. Um, now the output schema is defined on the left. We have acceptors and donors, uh, which are just two numbers, uh, and it just has to output these two fields. And then on the right, uh, I have my grader, which is a multi-grader.

It just checks for each of them, right? And the weight is, it gets half if it gets the donor right, half if it gets the acceptor right. Um, I'm gonna pause here, because this is an important slide. If there's any clarifying questions, otherwise I'll move on. Amazing. Um, so what were the results?

Much better. This started at roughly, like, 65 percent. Um, it ended up going to -- I think it started, like, touching 80 percent, right? And this is O4 Mini. Um, O4 Mini, like, does not have a very vast world knowledge. It really has to, like, learn to reason through things.

Um, I'd say even this result is not, like, fully exemplary of how good it can get. The reason it's not as good is because it does need some world knowledge to understand, um, like, this sort of simulation that you have to do with the hydro bonds. Um, but you still see this, like, very, very high performance that it, like, you would not get with something like supervised fine-tuning.

Yes? Can you use DPO for this? Could you, like, hear an email example with DPO first and then do that second? Yeah. One thing I forgot to put in the slides. SFT and DPO you can do with non-reasoning models. RFT you can do with reasoning models, and there's no interplay.

Yes. Sorry. He asked, um, if there -- I'm blanking -- whether you can do DPO before RFT. And the answer was, uh, SFT and DPO are for non-reasoning models. RFT is for reasoning models. And it's like an IFF. Um, cool. So, some thoughts on, uh, notes on reinforcement fine-tuning.

Only for reasoning models. It's extremely sensitive to noisy data. Um, agentic use cases are limited by a single turn. Right? So, you can't yet have these, like, long horizon tasks with multiple turns and function calls. Uh, and reinforcement fine-tuning. I would love to see that. We'll see if we ship it at some point.

And then, um, you want the data to be high signal, uh, and you want a solid grader. And if you provide both, you are likely going to reach, like, state-of-the-art performance on the task that you are going after. Um, cool. So, this, I think, is one of, like, the most telling slides.

I believe that is most of what I have for today. Um, I have a notebook. I have a ton of stuff that I didn't quite go through. Um, I'm happy to dive into that. But I kind of want to leave this open for questions and, like, poking around, um, different parts.

Actually, how much time do I have left? An hour? Oh, great. We can do so much in this time. Uh, you guys are captive. Okay. Yeah, I do want to do a little quick questions and then we'll move on. One thing I started and didn't do is another kind of tuning, um, which I can do live, which is prompt tuning, um, which I'm going to make up on the spot.

So, if you want to see that, stick around. Um, but, yeah, questions. Yeah. Yeah. Hi. Uh, so, I had a small question. Uh, so, you're saying that if I have a non-reasoning model and let's say I do an SFT step with it, uh, can I induce thinking into it?

For example, if I create a dataset input-output pairs, which has, let's say, thinking into it as well. So, would you call it the same amount of thinking as you'd get in a, uh, reasoning model? Um, you would get thinking. It's a very kind of, a very different kind of thinking because it's, it's purely, it's learning by imitation.

Um, so, it's not learning to, like, uh, reason in order to get to an answer. It's learning to imitate the thought process, um, regardless of whether it's actually useful to it or not. Right? Um, however, on the other, like, whereas reinforcement fine-tuning, like, it is, it is the one producing the chain.

And so, like, if it gets to the right answer, it means that chain was useful, like, in, in practice. And that is true for, like, every single time that it gets reward. It's because, like, it managed to get there. So, it's, like, it's pretty different. I'd say, like, yeah, SFT with a, um, reasoning is, like, what we started to see a little bit, like, maybe a year ago.

Um, but reinforcement, uh, learning is really how you get, like, high-quality learning. Yeah, like, chains of thought. Uh, yeah. Yep. Or, sorry, somebody has a mic. Yeah. Here we go, yeah. Um, thanks, Ilhan. Could you go back to the slide, um, where you showed the inputs to the RFT?

Yeah, this one. Um, this one. This one. Does it, um, so for the calculate output, can I, um, basically, instead of just, like, a, a lockup into a dictionary, can I have functions there? Like, like, to, so, so, can I pass in, uh, a function to check if the code runs, and I'll put a number, or, like, a one or a zero, to see if it runs, for example?

Um, can you, can you say it again? So, if, um, I'm just thinking, um, one of the guys in the team is fine-tuning a model at the moment to write, uh, Triton code, and one of the checks he has is, like, does the code actually run? Like, is it valid Python?

Yeah. Is it possible to encode that as a grader? You could do that in a hacky way. Yeah. Right? So, you could use, like, a, um, a Python grader and inject the model response. Like, just do, like, um, eval or, like, exec, which, if you've come to my previous AI engineer talk, I did the same thing.

Don't do this. But you could, you could, uh, get it to work. Yeah. Uh, if it has the right libraries and doesn't need internet access, which it might not. Yeah. And, um, just the, and then on the, the weights you have there, does it matter the magnitude, I guess, of, like, what you're calculating?

Like, should you, like, aim for it to sum to, to one, or, or hundred, or, like, it doesn't really, like, match a massive difference? I don't think it matters. Yeah. Yeah. Um, it, asterisk in the way that, like, it, it doesn't really matter. Asterisk. You can look into the asterisk.

Yeah. Yeah. Cool. Thanks. Um, I want to do some in the back because I think I've been missing some in the back. Yeah. Yeah. You, you mentioned that it wouldn't, the RFT wouldn't work for multi-step, uh, agent stuff. Could you talk a little bit more on, like, why? And if, if you, if OpenAI were to make it work for that, what that would look like?

I just kind of didn't understand. Yeah. Yeah. So, so the reason I said that, um, is because right now it only does a single turn and only evaluates the output of the model. If the model, say, like, calls a function, there's no way to provide it the result and let it keep going.

Right. And so that's really what you want for, like, an agentic evaluation loop, which RFT does not currently have. Um, I, I think that would be awesome if, like, we did have a way to, like, um, provide the output of the model and, like, do these more end-to-end training.

Um, but this is what we have right now. Gotcha. Yeah. Thank you. I want to do in the back first a little bit. Yeah. For, for the HydroBond example with RFT, you can do a generic string check, but you can inject more domain area as a type of balance files if it's only using, like, chemical examples.

You know, when do you need to design that type of custom grader, and when is a generic grader sufficient? Yeah. So the beginning of the question was, in the, uh, HydroBond example, um, we use, like, yeah, this, this string checks. When is it worth using more complex, um, graders?

I think the answer is, like, where you think it would benefit, uh, like, is it a more accurate score? Is it a higher signal grade? Because if the answer is yes, then you might actually just get higher, like, better results. Right? It's going to maximize whatever this grader says is good.

And so if you can better tune what that means, like, what good means, you will just get better results. Yeah. Um, or does someone else have the mic already? Yeah. Can you tell us more about the RFT algorithm or some details about it? Yeah, yeah, yeah, here, here, just come here.

No, I'm kidding, no. Um, I mean, it, I think, obviously, I can't share, like, the exact details, but, like, like, it is, it is fundamentally, like, a very, very similar algorithm to the exact algorithm that we used for 01 and 03. Right? Like, this reinforcement learning with language models is, like, exactly how we did our reasoning models.

Um, and we are putting as much out as we can. Right? And so, like, you can actually get these, like, really, really insane results, um, if you have clean data in, uh, in your scenarios. Yeah. The reason, I really wanted to come with, like, a lot of really good examples.

The reason I can't come with a lot is, like, A, a lot of them are, like, proprietary and I can't share. And B, I wasn't able to do many myself because I don't have a lot of really good data that I work on. Um, yeah. So, back there. So, our time is our most precious resource.

How do you determine when fine-tuning is worth it versus integrating with a new model or getting better data? Yeah. Maybe this is a little bit of a hot take, but I would avoid fine-tuning until you need it. Um, right? Like, if you, um, if you can get away with prompting, you should just use prompting.

Yeah. Uh, and fine-tuning, the way you actually know -- sorry, I blanked for a second -- um, you really want to evaluate. Right? Like, if you, if your evals are showing that, like, you are reaching the limits of what you can accomplish through prompting, but you know it's better to do pos -- you know it's possible to do better, then that's where you can pull out fine-tuning.

If you have the data, if you have everything set up, um, it is a big upfront cost, right? But if you, if you are in a space, like, uh, you know, legal or, like, uh, in, like, um, hard sciences, like, it can really be worth it to do something like reinforcement fine-tuning, because the results you get are, like, impossible to achieve otherwise.

I have a question. What's happening on the OpenAI platform side when we do fine-tuning? So, is it generating, like, a new set of weights, and then when you actually go to run inference, does it load those weights into memory and run on that, or, like, how does that work?

Yeah. So, it is generating new weights. It's generating this LoRa component, right? So, it's generating, like, this little piece that we use with the models. Um, the rest of the weights are the same in the model. And so, what it does when you use your inference is it loads in that part of the weights and then uses those.

Yeah. I had a question on DPO. Um, how bad should your bad examples be? Like, with a joke, should it be just barely bad? Like, it's non-obvious and so you're squeezing out nuance? Or should it just be, like, really bad? I think you sort of answered your own question, right?

If you make it too bad, then you are learning, um, how to go from, like, really bad to, like, good, which might not be what you want to teach the model. You might want to teach the model how to go from, like, mediocre to really good, right? And so, yeah, like, you want to sort of mimic what you would expect to see.

Um, or you want to think, like, what is that, um, delta? I think I haven't actually heard of too many other people doing, like, synthetic DPO with generated bad examples. Um, most often it's just used for A/B tests because the signal from an A/B test is real and pure, right?

The signal from synthetic DPO is as good as your imagination. Maybe. And, like, your intuition. So, I, yeah. Like, the best would be to, like, actually have people sit in front. I was thinking about doing this. Like, sending, sending everyone, like, a little website and having you labeled a bunch of examples.

But then I realized, even if we did that, I wouldn't be able to train it live. Um, but that would be, that would be, like, good, real signal, right? Perfect signal for DPO. Uh, yes. You're gonna have to yell. So, I'm kind of curious about why RFT doesn't work for non-reasoning models.

Like, say I wanted to get a model that only produced output precisely 30% emojis. I don't think I could do that with DPO, or could I? Uh, the question is why RFT doesn't work with non-reasoning models. What was the bit about emojis? I missed that. Sorry. Um, if I wanted to get a non-reasoning model to only output with 30% emojis exactly, I don't think I could do that with DPO, or correct.

Um, exactly is tough, but I would bet that you could get somewhat close. Um, I mean, it depends. Are you trying to do, like, if it's a very specific constrained example? Yeah, um, the main reason it doesn't work is because we don't support it. Uh, yeah, so the question, the question is why doesn't reinforcement fine tuning work with non-reasoning models?

And it's because we don't support it. Yes? Yeah. You have a couple of questions, so, uh, so does, does custom GPT you have, you know, does it use one of those fine tuning? Custom GPTs? Yeah. No. No. No. It's just a prompt. Okay. And second question is that, uh, for us, certain applications, it's also important to have good thinking, COT, you know?

Yeah. But it looks like there's no reward for, or penalty or anything for that, like, if, you know. Um, yeah, I mean, we released a paper recently that talked about how you should avoid, um, directly evaluate, like, directly, uh, rewarding or penalizing certain behaviors in your chain of thoughts, because if you don't touch the chain of thought portion and only, um, put pressure on the results, um, essentially, you get faithful chains of thoughts.

You, you, you can actually see, um, you hopefully close to what the model is, like, actually thinking through. Um, and so if it's doing something you might not want, it'll be clear. Um, but if you, like, for example, reward, like, investigate the chain of thought and, like, penalize it if it starts thinking about bad things, um, turns out, uh, you do reduce the ratio, like, the percentage of bad things by a little bit, but you reduce the cases where you can catch them by looking at the chain of thought to, like, zero.

And so it's, like, not a good idea to do it. There's a, there's a paper on this. You should, you should look it up. We released. So, uh, uh, that question is, uh, you know, like, if you have a really good training data, right, then you can just distill the, any small models, even open source ones, right, I believe.

So how, so, like, supervised training and you also showed DPO, if you have a really good quality data and where things are deterministic, especially the questions, then I believe there is no point in going to bigger models like OpenAI, right? Any open source models would do it. What, what, what do you think about that, you know?

Well, you know, if it's, if it's doing what you want, that's awesome. Like, yeah, like, go with every, go with whatever is going to get you the performance that you're looking for. I think just in practice, you know, the, I think the real consideration is, like, do you want to fine tune an open source model and deal with, like, um, like, if you have the data for it, that's amazing.

If an open source model does what you want, use that. But if you're, like, if you have the choice between, like, fine tuning an open source model, um, which involves all the, like, pitfalls that I just talked about, versus just, like, asking 4.0, like, with no fine tuning, it's, like, one of the, is going to let you move faster.

You can always start with a bigger model, and then, like, tune a smaller one, right? So, I'd say always start with, always start with the biggest, most expensive thing that'll make your thing work, and then work on getting more efficient. Yeah. Yeah, given, uh, the knowledge that you shared with us today, can you give us a little bit of insight why O3 scores higher, but, uh, hallucinates more than O1?

In general? Like, why O3 hallucinates more than O1? Yeah. You have that on the, in the system, right? Yeah, yeah, yeah. I mean, I don't have, like, a reason that I can give you. I think that's something we're investigating. Obviously, it's something that we want to avoid, but, like, um, yeah.

I don't think we have, like, a super solid, like, understanding we can share about why that happens. Um, and speaking for myself, I don't know. Yeah. Hiya. Uh, thanks a lot. This is great. Um, it seems like the hardest thing to do for, uh, reinforcement learning is the reward modeling, or, like, defining the reward functions.

Do you have any heuristics, or tips, or resources on, like, uh, the number of reward functions, or, like, the type of problems that it suits or doesn't? I just feel like that there's, like, not that much to know when I'd, like, to reach for reward functions now, and, you know, how, how I find that it's, it's, it's almost like it's a, uh, a solution in search of a problem, you know?

I don't know when I would, when I would do this, you know? When you would, when you would use reinforcement learning and, like, know, like, find five perfect reward functions that model some behavior I want, want to capture. Um, yeah, I think I think, I think when, it's not for the beginning of the problem, right?

Let's say you already have something in place, it's already working, and, like, you're wondering if you could do even better. Like, the fact that you're asking yourself, can you do even better, means you already have a way to evaluate it, and already have a way to grade it. And so that's the beauty of it, it's like, by the time that you're asking yourself, like, should I use reinforcement fine-tuning, you should already have all the data ready, or almost close to ready, because all you need is, like, your, whatever eval setup you have.

Yeah, so I have, like, a good LLM judge, uh, that I'm using for evals, and then I take that LLM judge, use it as a reward function. You can. To, to, to fine-tune. You can, yes. Okay, so that's the kind of path that's, like, start off, build a good LLM judge that captures the task you want to do well on, and then, you know, try milk a model with, like, or LL to make it better.

Yeah, I mean, that, that's one way for sure, and, like, if that is the path that works for you, that's great. I wouldn't say there's one, like, necessarily one path. Okay. So actually in the crowd is Teo. Can I, can you wave? So, um, he wrote a cookbook that I used to base a lot of this examples for.

Um, you should check it out. It's, I, I believe, what's it called? Can you yell out the name? Exploring, I think it's Exploring Model Graders for Reinforcement Fine-Tuning. Yeah, yeah. Incredible cookbook. You should all check it out. It goes into, like, the nitty-gritty of, like, how to do each of these different parts.

It talks about different graders. It talks about, like, all the different pieces. So you should really check out, uh, that cookbook and the function calling one if you're curious. I'll see if there's a way for us to send it out. Maybe I can just, like, pull it up real quick.

Um, I think it's this one. Yeah. So you should really check this out. Um, it's long, it's comprehensive, and it's, like, really good. Um, cool. Any-- Yeah, uh, I had a question here. Yeah. Uh, does the fine-tune API support different modalities as well, or it's just text and-- Oh, we do have image fine-tuning.

I forgot. Yes. Wow. Okay. We have image fine-tuning, uh, and so you can do image input and, um, I believe it's text output. So it's really good for, like, bounding boxes and all these other parts. Wow. Thank you. It's like you're a plant. Uh... Stand up. Yes? Um, so how do you, um, teach model, uh, new knowledge or domain knowledge?

So, and then how do you use that work with the reasoning? And, and one example is, like, if I have my personal experience, personal preference, should I do fine-tuning, or should I kind of put it in context? Um, I would say always try starting by putting things in context.

Um, yeah. Fine-tuning, you want to, like, leave for, like, a later stage thing. You already-- you already have some data to, like, show one way or another. Um, in order to, like, give models new knowledge, you know, you have, like, all, like, different ways you can do rag and search.

Um... Technically, when you fine-tune, you can put in small amounts of new information into the models. But in general, you should treat it as more of a methodology, right? Um, like, if you consider an algorithm data, like, in that it requires bits to describe the procedure, you're sort of teaching that.

But you can't really give it, like, um, too, too much data. However, you-- it can-- it can learn a bit, right? Like, SFT starts to memorize certain things that you give it. Um, I just wouldn't say it's, like, the best way to give, like, correct information or, like, correct, like, referenced information.

I'd say maybe it's good to, like-- I would focus it more on formatting. You can use it, right? Like, I think, like, two years ago at this point, um, when we released, um, fine-tuning for GPT 3.5, one of the interesting things you can do is, um, if you do retrieval, you can teach a model, like, with, um, like, embedding search, and you need to, like, embed the query.

First, sometimes you want to, like, transform the user's query into something that looks more like your search results. Um, I think it's, like, hypothetical document embeddings. Um, you can fine-tune a model to, like, do, like, essentially hallucinate that better, right? Like, if you train it on, like, the actual outputs, uh, like, in your actual, like, knowledge base, then it's more likely to output something that is in distribution, um, for your knowledge base, and so you'll get, like, more direct mappings, if that makes sense.

So, but it's, like, you're not really teaching it too much new information. It's just helping you find-- it's giving you a better intuition to find it. Yeah, so, like, for a reasoning model, sometimes if I want to reference to custom definition or custom knowledge in certain domain, is it good to kind of combine with reg and then generate that kind of training data, train with RL, or just, like, I guess that's, in general, how to solve that problem?

Um, for that case, uh, I would just let the model-- like, give the model some search functions. Oh. Like, let the model do search. Um, if you have explicit search steps that you do beforehand, um, then you can maybe fine-tune, like, specific parts, but we, like, because search is, like, inherently a two-step process, it's like searching and then interpreting, you can't do that with reinforcement fine-tuning right now.

I see, I see. Thank you. Yeah. Uh, so for your question-- for your thing about the email classifier, did you do the reinforcement learning just to kind of show us, like, a bad example? No, no, no, I really hoped it worked. Oh, okay. Uh, that was me learning-- Well, it's so ambiguous, right?

And then you were like, well, don't be ambiguous with it, and I'm like, okay, uh, okay. Yeah, yeah, no, thanks for calling me out publicly in a talk, but keep going. Well, I'm not trying to do that, but-- I'm joking, I'm joking. For the supervised fine-tuning, that would have been the-- what was the correct answer there, then?

Like, supervised fine-tuning, given your data set? Um, so what I wanted to try and, like, maybe if we have enough time, I'll, like, do a little bit of this, was, like, there are a lot of nuances around the email that I have, like, my preferences that are extremely low signal.

Like, I might see it once. Like, I might get two very similar emails from two very similar, like, people, uh, or, like, sources, and one I decide to, like, glance, and one I decide to, like, ignore. How is I going to learn that, right? Um, like, what I was imagining is, like, if you find cases that are similar, and you could do that with embeddings, um, but that have different results, you might be able to give them to, like, to give them to, like, a reasoning model, and be, like, you know, figure out what's different here, and then make a note of it.

And then you can compile this set of, like, principles, essentially, that hopefully guide what it does. And so this is, like, what I would consider almost, like, prompt tuning, where you, like, are finding the right context to provide the model based on a lot of examples. Uh, we can try in a little bit.

That would be cool to see. Uh, like, I feel like embedding, fine-tuning the embedding model itself, though, would almost be critical, then, to get, right, for an individual, for an individualized, like... Why? What? How so? Why? Because embeddings naturally don't understand the nuance of, like, how you... Oh, no, no.

I'm not saying using embeddings to find case, to, like, find the nuance. I'm saying using embeddings to find similar ones. Right, but how would it know what's similar based on your preferences? Not based on my preference. Based on, like, whatever embeddings similar... Like, does it, like... Finding similar in the naive sense, and then showing the model, like, naive similar cases that are actually different, and being like, look, these are...

The embedding model thought these were similar, but they're actually profoundly different. For some nuanced reason, figure out what that is. Does that make sense? Yeah. Um, yeah. I feel like people kind of want to see me try this. Okay. So, I did start trying this. I'm also going to try not to flash too much of the data, because this is my actual emails.

Okay. So, I have the email data set. I'm loading it. I have this, like, eval function. I have my instructions, and then I just made a little, like, helper that... Essentially, I just did a little, like, fancy evaluation loop where you can define an evaluator, and you can... Where's my code?

Oh. Yeah. I'm sorry. The eval model takes in a run function that just returns a result. This is where I'm going to pass the model. It takes an evaluate function, and then a data set of workers. This is just like a generic version of the other function I implemented.

Cool. And let's see how it was doing. So, you know, I have this prompt. Try it with all the models. And what am I getting right now? For a mini, it's 58%, 51% worse, 56, 57. So, this kind of goes to show there's no signal to learn right now.

Right? It's kind of random. Or, like, there's very little signal there. So, let's think about this. How would you do this sort of prompt tuning? Maybe even without the embedding models. We can, like, do these... Oh, I think I started writing an algorithm somewhere. Where is it? Ha! There we go.

Look at that. So, this was sort of the idea that I had for fine tuning -- for prompt tuning -- was, like, run a forward pass, then sample from that training subset -- or, sorry, split up into mini-batches. Run a forward pass. Sorry. Split into mini-batches, each of which has a training subset and a test subset.

Run a forward pass on the training subset. And then reason over it to see how it could have done better. And then take that -- put it back in the prompt, and then use that -- run it forward again on the test subset. And then update the prompt. And keep going.

Does this make sense? I don't know. I'm kind of, like, coming up with this on the spot. So, we can see how it goes. Before I jump into this, I want to make sure there's no more questions. If I, like, lose myself here. No? Okay. People want to see me?

Okay. So, you know, what are we going to do? Let's do training loop. And -- should we just do this? Let's do one run, right? So, like, we have some, like, mini-batch. And then we have training and -- like, train test. Yeah. Now, if we have that, then I want to run a forward pass.

So, for -- Is it just evaluate? I'll do this in a lazy way right now. So, you know, sample in train. Let's do results. Okay. So -- Sorry. The model was so fast. Cursor is amazing. Okay. So -- Yeah. Let's pretend that's correct. Okay. So, we get train. We get test.

We sample train. We should -- Yeah. We should also have some labels. Here. Should we print what one of these -- So, we have -- let's do this, like, more accurately. So, the samples have an action, and they have a text. So, for each sample, we want to run it on -- Does the model take in the whole thing?

Yeah. It takes in a sample. Okay. So, then this should -- Let's just -- let's start here. You know, process mini-batch of -- like, I don't know. Let's do 20. Do I want to print this. Well, let's see if this runs correctly, if this works. Okay. Um, I'll risk one.

This is one of the worst ideas I've had. I didn't know you could do top-level weights, by the way, in notebooks. It's really nice. Um, yeah, this should be in parallel. I'll just do 10 for now. Did it -- was it happy? Uh, okay. So, it's just the results.

So, maybe I want to append, like, sample... And then model output. Okay. Wow. Very topical. And then -- what did it say? What was the output? Was it ignore? I don't know. That feels correct. Okay. So, now we have -- oh, and then what was the actual -- okay, wait.

Let's print -- you know, let's dump it, right? Uh-huh. We have a blank inputs. My package has -- Ignore us. Okay. Okay, no, this is good, right? So, we have -- let me just, like, quickly -- There's a lot of blank ones. Hmm. Uh, it's -- I don't know what it's doing.

It's -- I don't know what it's doing. Whoa, whoa, whoa. Why is it still running? Oh, this is so much more fun with people, like, debugging with me. So, how should we do this? Should we, like -- I guess we could just give the entire mini batch to a model.

That's the easiest thing. And then just be, like, you know, what did -- like, think about what you got wrong and then give me some notes. So, we'll have results, and then we can make a -- you know, like -- where is it? You know, um, extract insights. And then we can do -- Okay.

Any prompt engineers want to shout out Some prompts. Let's see. So, we have model. We have the results. We'll say, like, you know, you are provided a set of results from a forward pass to a model. Given these results, task is to -- let's see. It doesn't need to know the format.

It's critical. These results follow -- You know, not that fast. You know, there's an app that's really good at making prompts for you. ChatGPT. Yeah. Well, okay. Here's the funny part. All right. So, there is a -- it was actually with a -- with a -- I helped create this prompt generator.

I just don't want to go that far, because then it's more for me to read. Okay. Um, general -- to, you know, emails outside of this batch of results. Um, the duplicated. In general, the output should be at most. What's the last one? It's like... Oh, do not... I'll do this here.

Make sure the sides do not overfit to the current batch results. Should it be a JSON array? I'll keep it as a string for now. Simple string. Okay. So now we have that. Let's see how it does on this. Man, this is like... Okay. Yadda, yadda, yadda. Code. What did I call it?

Extract insights. Maybe I'll give it reasoning. I'll do like that. And then results. Huh? Okie dokie. How do we feel about this prompt? I don't know. I feel like it's going to be kind of just watching, thinking, thinking, thinking, thinking, thinking, thinking, thinking, this is going to work. I don't think this is going to work.

But, like, it'll be an interesting experiment. Okay. Okay. I'd say this is too long. Process. process. Okay. Okay. Okay. So, I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it. You're not going to be able to do it.

Okay. It's going to be able to do it. It's going to be able to do it. It's going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it.

I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it.

I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it. Okay. I'm not going to be able to do it. I'm not going to be able to do it. I'm not going to be able to do it.

I'm not going to be able to do it. I don't know. What do people think? What's going to happen? Okay. No. Very good call. Very good call. We could just make the notes this. Maybe I should print out what the notes are each time. Okay. So, while this happens, I think the idea here would be -- oh.

That's pretty surprising. We're already doing pretty good. So, yeah. I guess I have zero confidence this is going to work. But I think this style of approach can work, right, where you can take -- and you can do this in a tree-like structure as well -- where you can take maybe subsections, these mini-batches, and then let it reason over the failures, and then maybe produce these insights.

then you have a choice as to how you want to aggregate those insights, then you have a choice as to how you want to aggregate those insights. Here, I'm just replacing the notes within each mini-batch, and then the plan was to append all the different notes together from all the different mini-batches and put it in the prompt and see if that can perform better.

Why is it stopping? Thank you. I'll leave this up in case anyone sees any other mistakes. So this is good to have there. But, yeah. This is like the skeleton of something that could be very interesting, obviously with a better prompt, obviously with better pieces, but -- yeah. I don't know.

Any questions, ideas, thoughts? If not, I might just, like, call it after we see what the results are. For this? I'm making this up in front of you. Anything, yeah. Any examples that work for anything? No, no, no. Any examples that work for you, you know? Can I show examples that have worked for this technique?

Well, so you're asking -- are the numbers moving? Not significantly. Okay. I'm going to call it. This doesn't work right now, but it's good enough to, like, be an example. I think to answer your question about, you know, what are the good techniques for prompt engineering? Like, if I can share some.

I think they've just been changing so much. There's many, many, many, like, prompt engineering guides. And, like, what you would want to do is, like -- I guess you're asking both about prompt engineering, but maybe also something like this. Prompt engineering, there's so many resources. I'd say the biggest one is just be clear and don't have any, like, contradictions.

And then a few shared examples are really, really good. For something like this, I'm, like, making this up as we go. I think, like -- again, I think the shape of this is something that can make sense. I would have been very surprised if I would have gotten any, like, good results.

Maybe I'll stay, like, a few minutes after trying to get this work. But, yeah, question back there? Can you explain what you're doing? Yes. So, the question is, can I explain what the hell I'm doing here? I think the idea here is to -- instead of tuning the model itself, tuning the prompt.

And, specifically, tuning a subsection of the prompt that I'm calling notes that essentially, hopefully, contains, like, jots about the emails that it's seen. And anything that they might have found unintuitive about classifying that email. Maybe that's something I should include in the prompt. And so, the hope is that as it sees more emails, like, with the -- what it actually classified it as and what it was supposed to classify it as, by using these reasoning models, hopefully extract some, like, explicit insights of, like, oh, you know, like, maybe Elon, like, wants to, like -- if he sees, like, a package has been delivered -- this one is probably glance.

This is what I mean. Like, my data was very noisy. But, like, he will want to, like, glance at, like -- oh, I said ignore. Like, I might want to glance at this instead of, like, archiving. Right? And so, like -- but the question is, like, but why? Right?

I didn't say why. I'm just saying what the correct answer is. And this is why I thought RFT might be -- might be good. But, like, this is way too noisy and subjective. But that doesn't mean you can't use reasoning models to still extract interesting insights. So, the idea would be, like, give enough examples of both successes and failures that the model itself has done to a reasoning model.

And have it come up with, like, notes or techniques that the model can use to hopefully improve the next time. Yeah, so the question is, like, can I use this to identify pieces of information that are missing? And I think, yes, absolutely. And I think that's, like, very important for -- even, like, if you do want to fine-tune, but even if you don't, like, this approach can, like, tell you a little bit about your data.

As well. Yeah. So, like, just -- it's like having someone read through it. And, like, write these things. Yeah. If there's more questions, I'm happy to keep answering. Otherwise, I'm going to officially call this. You have now seen two failures today. But that's okay. We learned from those. So, thank you.

. . . . . . We'll see you next time.

RFT, DPO, SFT: Fine-tuning with OpenAI — Ilan Bigio, OpenAI

Transcript