LLM Quality Optimization Bootcamp: Thierry Moreau and Pedro Torruella

so welcome everyone thanks for making it to this lunch and learn my goal today is to make sure that I get to share my knowledge and experience on LLM fine-tuning and just to get a quick sort of pull from the audience here how many of you are have heard of the concept of fine-tuning here okay so quite a few people how many of you have actually had hands-on experience in fine-tuning LLMs okay all right that's pretty good that's more than I'm usually used to I mean this is quite a fantastic that in this conference the makeup of AI engineer is close to 100% that's not something I'm generally used to when presenting at other you know hackathons and conferences so I feel like I'm speaking to the right crowd so just to kind of contextualize this talk really I'm trying to address two pains that a lot of Gen AI engineers face and to get a sense of where you are in your journey how many really identify and can relate to the first one which is my Gen AI span has gone through the roof okay yeah all right and how many of you are in this other segment of this journey which is you know you've built POC's it's showing promise but you haven't yet quite met this quality bar to go to production can I get a sense of all right so so I think you know we have a good amount of good fraction of the audience that can relate to one of these two problems myself I'm a co-founder at Octo AI and I'm going to talk a little bit more about what we do but the customers I've been working with they feel those pains in very real way or talking about tens of thousands if not hundreds of thousands of dollars in monthly bills and perhaps even having issues trying to go to production because the quality bar hasn't yet been met so the overview of this 15-minute talk is going to be spent on understanding the why of fine-tuning really try to understand when to use fine-tuning it's not really a silver bullet for all the problems you're going to face but when used right in the right context for the right problem it can really deliver results I'm also going to try to contextualize this notion of fine-tuning within the crawl walk and run of LLM quality optimization because there's different techniques that you should attempt before trying to do fine-tuning but finally when you're convinced that this is the right thing for you I'm going to talk about this continuous deployment cycle of fine-tuned LLMs so we're going to go through today over a whole crank of that wheel of this deployment cycle composed of you know model you know data set collection model fine-tuning deployment and evaluation and really I'm trying to demystify this whole journey to you all because in the next 15 minutes we're actually going to go through this whole process and hopefully that's something that you're going to feel comfortable going through and you know applying to your own data set to your own problems and so for illustrating today's use case we're going to use this personally identifiable information redaction use case now that's a pretty traditional sort of data scrubbing type of application but we're going to use LLMs and we're going to see that we can essentially achieve state of VR accuracy while keeping efficiency at the highest using essentially very compact very lightweight models that have been fine-tuned for that very task so again trying to motivate this talk what limits gen AI adoption in most businesses today based on the conversations that I've had in the field discussions I've had customers and developers the first one is there's a limited availability of GPUs I think we're all familiar with this problem it's one of the reasons why Nvidia is so successful lately I mean everyone wants to have access to those precious resources that allow us to run gen AI at scale and that can also drive costs up right so we have to be smart about how to use those GPU resources and also when people build POCs it displays and shows promise but sometimes you don't reach the expected quality bar to go to production and so on this XY axis you know on this chart where Y axis is costs and the X axis symbolizes quality maybe many people start on that green cross here right on this upper quadrant a very high cost maybe not having met the quality bar that's your first POC POC but really to go to production you need to end on the opposite quadrant right lower cost higher quality where you met the bar you're able to run this in a way that essentially is margin positive and many of us are on this journey to reach that point of you know profitability so we're going to learn today how to use and how to fine tune an LL M now fine tuning is a method that we're going to use to improve the LL M quality but as a bonus we're going to be also showing how to improve quality significantly and I use quality as the title of this talk because really I think many of us AI engineers really care about reaching the high quality bar when we're using LL M's and hopefully I'm you know the goal of today's talk is to instill instill you with some knowledge on how to tackle this journey and so in terms of tools that we're going to use today we're going to use OpenPipe which is a SaaS solution for fine tuning that really lowers the barrier of entry for people to run their own fine tunes you don't need hardware or cloud instances you don't need hardware or cloud instances to get started and we're going to use this to deliver quality improvements over state-of-the-art LL M's and of course since I work at Octo AI I'm going to also be using Octo AI here for the LL M deployments and that's going to be the solution that we're going to use to achieve cost efficiency at scale and really the key here is to be able to build on a solution that is designed to serve models at production scale volumes and just to give you a little bit of a sneak peek in terms of the results that we're going to showcase today after you go through this whole tutorial and this is something that you're going to be able to reproduce independently so you know all the code is there for you to go through we're going to be able to show that we can achieve 47 percent better accuracy at the tasks that I'm going to showcase today using this OpenPipe fine tuning and by deploying the model on Octo AI we're going to achieve this I mean it seems kind of ridiculous 99.5 percent reduction in cost this is really a 200x reduction in cost here from a GPT-4 Turbo to Llama 3 and mostly because this is a much smaller model it's open source and we've optimized the hell of this model to serve it cheaply on Octo AI so I'm going to explain how this is achieved but I hope your interest at least has been peaked on those results that you yourself can reproduce so when to use fine tuning again it's not really a silver bullet for all your quality problems it has its right place in time so I like to contextualize it within the crawl walk run of quality optimization right and as Gen AI engineers many of us have embarked on this journey we're at different stages of this journey and really it should always start with prompt engineering right and many of you are familiar with this concept you start with a model you're you're trying to have it accomplish a task and sometimes you don't really manage to see the result you expect to see so you're going to try prompt engineering and there's different techniques of varying levels of sophistication this talk is not about prompt engineering so you know you can improve prompt specificity there's few shots prompting where you can provide examples to improve essentially the quality of your output there's also chain of thought prompting I mean some of you probably have heard these concepts but this is where you should get started right make sure that given the model given those you just try to improve those weights you just try to improve the prompt to get the right results sometimes that's not enough and there's a second class of solutions which I like to map to the walk stage retreatal augmented generation right we've probably seen a lot of talks on RAG today and throughout this conference so you know there's hallucinated results sometimes the answer is not truthful well why is that it's because the weights of the model that is really the parametric memory of your model is is limited to you know the point in time is limited to you know the point in time in which the model was trained so when you try to ask questions on data it hasn't seen or information that's more recent than when the model was trained it's not going to know how to respond right so the key here is to provide the right amount of context and so this is achieved through similar research for instance in a vector database through function calling to bring the right context by invoking an API through search through querying a database and so this is something that I think many of us engineers have been diving in in order to provide the right context to generate truthful answer right compliment the parametric memory of your model with non parametric information and that's RAG in a nutshell right so you've tried prompt engineering you've tried RAG you've eliminated quality problems and hallucinations but that's still not enough right so what do you try next well fine-tuning I think is the next stage and again I'm generalizing a very complicated and complex journey but in spite of your best efforts you've tried these techniques for maybe days weeks or even months and you still don't get to where you need to be to hit production and we're going to talk about this journey today right fine-tuning so when should you fine-tune a model again after you spend a lot of time in the first two phases of this journey so spending time on prompt engineering spending time on retrieval augmented generation and you don't see the results improve and generally what helps is whenever you use an LLM for a very specific task something that's very focused for instance classification information extraction trying to format a prompt using it for function calling if you can narrow the use case to something that is highly specific then you have an interesting use case for applying fine-tuning here and another requirement is to have a lot of your own high quality data to work with because that's going to be your fine-tuning data set that goes without saying but a model is only as good as the data that it that the model was trained on and we're going to apply this principle here in this tutorial and finally I think as an added incentive oftentimes we're all driven by economic incentive in the work we do for those of you who are feeling the pains of high gen ai bills whether it is with open ai or with a cloud vendor or a third party well this is generally a good reason to explore fine-tuning so we're going to go over all the steps now that we've kind of contextualized why fine-tuning and when to consider fine-tuning we're going to consider all the steps here in this continuous deployment cycle it starts with building your data set then running the fine-tuning of the model deploying that fine-tune llm into production so you can achieve scale and serve your your you know your customer needs or internal needs at high volumes and also evaluate quality and this is an iterative process there's not a single crank of the wheel this is not a fire and forget situation because data that your model sees in production is going to drift and evolve and so this is something that you're going to have to monitor you're going to have to update your data set you're going to have to fine-tune your model and i don't want to scare you away from doing this because it sounds fairly daunting and so by the end of this talk we'll have gone through a full crank of that wheel and hopefully you know it through these sas toolings that i'm going to introduce you to is going to feel a lot more approachable and hopefully i'll demystify the whole process of fine-tuning models so let's start with step one which is to build a fine-tuning data set now the data of the model has to be trained on ideally real world data right it has to be as close as possible to what you're going to see in production so there's kind of a spectrum of ways to build and generate a data set ideally you build a data set out of real world prompts and real world generated real world human responses so for instance you have customer service you've logged calls with a customer agent you have an interaction between two humans that's a very good data set to work with right because it's human generated on both ends this is very high quality but not everyone has the ability to acquire this data set sometimes you're starting from scratch so not everyone has a luxury to start there there's also kind of an intermediary between real world and synthetic where you have real world prompts but ai generated responses and so this is kind of a good middle ground between cost and quality because you're starting from actual ground truth information that is derived from real data but the responses are generated by a high quality llm say gpt4 or clod and actually open pipe is a solution that allows you to log the inputs and outputs of an lm like gpt4 to build your data set for fine-tuning an llm so this is something that you know a lot of practitioners use and finally there's the fully synthetic data set using fully ai generated labels and oftentimes when you go on hugging face or kaggle you'll encounter data sets that have been built entirely synthetically and that's a great way to kind of get started on this journey and actually one of the data sets we're going to use today is from that latter category and of course i mean it probably goes without saying but in case people are not fully uh familiar with this notion you want to split your data set into a training and validation set because you don't want to evaluate your model on data that your fine tune has seen right and so many of you who are ml and ai engineers are already familiar with this but i just want to reiterate that this is important and finally you know this is used for hyper parameter tuning and when you're deploying it and actually testing it on real world examples you want to have a third set outside of training and validation which is your test set that's a good way to do it now you've built your data set you're ready to fine tune your model and there's a lot of decisions that we need to make at this point and the first one is going to be open source versus closed source right and so who here just like raise of hands is using proprietary llms or gen ai models today from open ai anthropic mistral ai okay good amount of crowd who here has been using open source llms like llama some of the free mistral ai models okay so maybe a smaller crowd right and maybe that's because these models are not as capable and sophisticated and but i'm going to walk you through how you can achieve better results if you do fine tuning right so of course the benefit of open source and this is why you know i'm obviously biased but i'm an open source advocate is that you have to you get to have ownership over your model weights so when once you've done the fine tuning you are the proprietor of the weights that are the result of this fine tuning process which means that you can choose how you deploy it how you serve it this is part of your ip and i find that this is a great thing for anyone who wants to embark on this fine tuning journey with proprietary solutions you're not quite the owner or you don't have the flexibility to decide to go with another vendor to host the the models yourself and so you're kind of locked into an ecosystem some people are comfortable with that others are less comfortable with it and many of the customers that we talk to they're very eager to jump on the open source train but they don't really know how to get started or you know where to start on this journey so hopefully this can this can help inform you how to take your first steps here into the world of open source then there's a question of like do i use a small model or a large model because for instance even in the world of open source you have models that are in the order of 8 billion parameters like llama 3 8 b and then you have the large models with a mixture all 8 by 22 b so this is a mixture of expert model with over 100 billion parameters very different beasts and we're going to see even larger models from meta and generally my recommendations here is well look the large models are amazing they have broader context windows they have higher capabilities of reasoning but they're also more expensive to fine tune and more expensive to serve and typically when you have to do a deployment you're going to have to acquire resources like h100s to run these models so generally start with a smaller model like a llama 3 b and sometimes you'll be surprised by its ability to learn specific problems so that's my recommendation start with a smaller llama 3 8 b or mistrol 7 b and if that doesn't work out for you then move towards larger and larger models and today we're going to be using this llama 3 8 billion parameter model there's also different techniques for fine tuning i'm going to go over this one fairly quickly but there's two classes of fine tuning techniques one which is parameter efficient fine tuning it produces a laura and the other one is a full parameter fine tuning which produces a checkpoint a laura is much more much smaller and efficient in terms of memory footprint we're talking about 50 megabytes versus a checkpoint that is 15 gigabytes and so you can guess that because of its more compact representation you're able to serve it on a gpu that doesn't require as much onboard memory and you can even serve multiple lauras at the same time so multiple fine tunes on a gpu for inference as opposed to the checkpoints which require a dedicated gpu for every single fine tune so there's more flexibility in deployment and we're going to use that today we're actually going to serve these lauras which are the result of parameter efficient fine tuning on a shared tendency endpoint with other users who have their own lauras all running on the same server and that allows us to really reduce the cost of inference and there is a benefit to checkpoints though and full parameter fine tuning which is that there are more parameters to tune so it's a more flexible fine tuning technique it allows the model to have essentially achieve better results at more expensive tasks like logical reasoning but for very specialized tasks which is what we're going to look at today like classification or labeling or function calling a laura is just fine so we're going to use parameter efficient fine tuning and also when you're doing fine tuning you have to decide am i going to diy it or am i going to use sas so i'm sure some of you only like to diy things others like the convenience of sas and here i'm not going to take a side i think there's some great tools right now to diy your own fine tuning for for instance the open source project axolotl and actually at the conference there's the the creator behind axolotl who you might be able to catch um and you know the challenge here is that you have to find your own gpu resources you have to understand how to use these libraries even though they're they're they're easier than ever uh to to adopt and you have to tune and tinker uh you know settings and hyper parameters then there's sas which really aim to make it easy to embark on this journey companies like open pipe and there's uh many folks from the open pipe at this conference today so if you can catch them please do talk to them and they're trying to lower the barrier of entry to fine tuning right to make it easy and to bring all this tooling all these libraries to make it as seamless as possible to for instance move from a gpt4 model to a fine tune with the least amount of steps in collecting your data fine tuning etc and so we're going to use sas today but if you feel more comfortable in this journey you might want to start with sas and then evolve into diy-ing it when it comes to deployment you have to navigate the same options right once you have a fine-tuned model now you need to decide well how am i going to serve it right because i need to generate maybe thousands millions or billions of tokens a day and so you need infrastructure you need gpus you need inference libraries some people like to diy it using libraries like vllm mlclm tensor rtllm hogging face tgi if these are all things that you might have heard of these are all solutions to run models on your own on your own infrastructure but you need to provision the resources you need to build the infrastructure to scale with demand and that can get tricky especially achieving high reliability under load that's a challenge that many people face as they scale their business up with sas you can essentially work with a third party like octo ai and obviously i'm a bit biased again i work there so i'm gonna insert a shameless plug for octo ai which allows users to get these fine tunes deployed on sas based endpoints so endpoints very similar to the ones from open ai for instance if you're familiar with that or claude and it offers the ability to serve different kinds of customizations as well and so very quickly i want to go over the advantages of octo ai here first of all you get speed so llama 3 8 billion parameter model you get achieve around 150 tokens per second and we keep on improving that number because we've been applying our our own in-house optimizations to the model serving layer it also has a significant cost advantage because it costs about 15 cents per million tokens compared to say gpt4 which costs 30 dollars per million tokens so that's where the 200x comes from and we don't charge a tax for customization so whether you're serving the base model or a fine-tune it's the same cost there's customization as i mentioned you can load your own laura and serve it and finally scale our customers some of our customers generate up to billions of tokens per day on our endpoints i think we're serving around over 20 billion tokens per day and so we've focused and spent a lot of time on improving robustness and also worth mentioning if sas doesn't cut it for you you are working for a fortune 500 if you're a software company or you know a healthcare company banking sector government you need to deploy your llms inside of your environment either on-prem or in vpc we also have a solution called octo stack come talk to us at the booth so that's it for the shameless uh flag section let's go over to section four which is evaluating quality right we've talked about data set collection fine tuning deployment now quality evaluation and we could have an entire conference just dedicated on that i'm going to try to summarize it into kind of two classes of evaluation techniques that i've seen first of all you know can your quality be evaluated in a precise way that can be automated for instance you generate a program or sql command that can run uh can you for instance label or extract information or classify information in an accurate way that's a kind of pass or fail scenario right or formatting the output into a specific json formatting this is something that you can easily test as a pass or fail test and then there's more of the soft evaluation for instance if i were to take an answer and say well which output is written in a more polite or professional way you can't really write a program to evaluate this unless you're using an lm of course right but you have to put yourself into maybe 2000 2000 sorry 2020 2021 mindset before gpt was around well it'd be hard to build a program that can assess this right so generally you'd need a human in the loop to say which out of a or b is a better answer thankfully today we can use llms to automate that evaluation but keep in mind that for instance if you're using gpt4 to evaluate two answers well if you're comparing against gpt4 it might favor its own answer and people have seen that in in these kind of evaluations so this is a whole science i mean we could have a whole conference just on this i just wanted to present the high level uh guidelines of this whole cycle of deploying fine-tuned llms and so really there is no finish line that's what i want to convey to you all that going through a single iteration is something that you might have to do on a regular basis maybe once a week maybe once a year it all depends on your use case and constraints now let's get a bit more practical let's switch over to our demo and so for those of you who came a little bit late there's a qr code here that you can scan and that will point you to our google colab and we also have under slack now let me see if i can pull it if you're in the slack channel for ai engineers world fair there is this quality optimization boot camp where you can ask questions here if you want to follow along and so we're going to go we're going to try to go over the practical component in the next 25 minutes i just want to provide some context here the use case is uh personally identifiable information redaction we've taken this from a data set composed by ai for privacy called pi masking 200k it's one of the largest data sets of its kind it has 54 different pi classes so different kinds of sensitive data like the name the email address add you know address the physical address of someone uh their credit card information etc etc across 229 discussion subjects so that includes conversations from a customer ticket resolution conversations with a banker conversations between individuals etc what this data set looks like is as follows you're going to have a message an email here we have you know something that looks like it came out of an email that contains credit card information ip address maybe even a mention of a rule or or anything that is essentially personally personally identifiable and i've highlighted those in red because they will need to be redacted and after redaction we should get the following text that shows look here is this information that is now redacted anonymized but instead of just masking it we're actually telling it what kind of category this information belongs to right a credit card number an ip address or job title and this is how we're going to redact this text so where do llms come in the way we would use it is through function calling who here has used llms with tool calls or function calls okay so quite a few people you know and as many of us are aware this kind of what powers a lot of the agentic applications so this is a great use case for people who want to do function calling and are not seeing the results you know out of the box from say gpt4 that they would like to to see and in this case we're actually going to see that that these kind of state-of-the-art models aren't doing quite well at fairly large and complex function call use cases so to achieve this redaction use case we're going to pass in a system prompt we can also pass in a tool specification the system prompt says look you're an expert model trained to do redaction and you can call this function here are all the sensitive pii categories for you to redact and then as a user prompt we're going to pass in that email or that message and then the output is a tools call so it's not the redacted text it's actually a tools call to that redact function that's going to contain all the arguments for us to perform the redaction why am i doing this as opposed to spitting out the redacted test well that gives us flexibility in terms of how we want to redact this text we could choose to just replace that information with the pii class we can also completely obfuscate it or we could choose to use for instance a database that maps each pii entry to a fake substitute so that we have an email that kind of reads normally except the credit card the the names the addresses are all made up but they will always map to the same individual and so that allows us to do then more interesting processing on our data set right so that's why we're going to use function calling here and let's start to build the data set so i'm going to switch over to our notebook here this notebook is meant to be sort of self explainable so there's a bit of redundance redundant context as part of the prerequisites you're going to have to get an account on octo ai and open pipe and and these are the tools that we're going to use and if you want to run the evaluation function also provide your open ai key because we're going to compare against gpt4 so we're going to install the python packages initially only open ai and data sets from hugging face you can ignore this pip dependency error here which happens when you pip install data sets in a colab notebook but that's okay we can get past that you can enter your octo ai token and open ai api key at the beginning and i've already done this so we're going to start with the first phase which is to build a fine-tuning data set so we have this pi masking data set i'm going to show it from hugging face of pi masking and you can see what the data set looks like it has the source text information as you can see these are you know exchange you know snippets from emails for instance you have the target text that is redacted and the privacy mask that contains each one of the pii and the classes associated to it so this contains all the data all the information input and labels that we need to build our our data set for fine-tuning and so really what we're going to do is that we're going to use the system prompt area we're going to define our system prompt here which is again telling the model you're an expert model trained to redact information and here are the 56 categories explaining next to each category what that corresponds to and this is really the beauty of llm and sort of natural language entry is that in the old world when we're doing pi redaction we had to write complex regular expressions and here this is all done through just providing a category and a bit of a description here and the llm will naturally infer how to do the redaction we're also going to define the tool to call right so this is done as a essentially a dictionary a json object and as you can see there is an array that contains dictionaries containing a string and a pi type and the string is the pi information the type is essentially one of 56 categories that we provide as an enum so right off the bat you can see that this tool call is you know a bit of a large function specification and so let's load our data set from hugging face in this case it's going to take maybe a few seconds to load in that data set of 200 000 entries and then what i have in the next cell when i'm downloading this data set is what i'm going to use to build my fine tuning training data set and here's the thing about fine tuning is that to build your data set you need to make it seem like you've essentially logged conversations with an llm right you're logging the prompts and the responses because that's how you're going to fine tune it you need to tell it this is the input with system prompt tools specification user prompt and here's the tools call response that i expect to see and so this cell here just sets it up so that we essentially have each training sample as a message from an llm that's been logged we're going to see what that looks like in a second so we're going to build a 10 000 entry training data set for open pipe and that's going to be downloaded as this open pipe data set that json l and so as i run the cell it's going to download this from colab and now when you switch over to open pipe we're going to create a new data set so once you're on open pipe console you have a project here i've generically named it project one you can access data sets and already as you can see i already have built a few data sets before but if you're a first time user you're not going to see anything under data sets so you can create a new data set here by clicking on this button and if you go under settings we can name our data set so i'm going to call it lunch and learn and today is june 2 6.

all right so this is two days lunch and learn i'm going to i'm going to call this my data set and under general i can upload the data that i just download it from my notebook open pipe data set dot json l so this upload operation is going to take a few seconds or maybe a couple of minutes because what's going to happen on open pipe is not only we're uploading this data set but it's going to do some pre-processing here to split it into training and validation set it's also going to get it all formatted in a nice way so we can essentially look into the data set so you can see there's this little window here that shows that you're uploading the data set and that it is essentially being processed so while this is happening right we've prepared our data set and we're going to take a look at it in a second while it's being processed on open pipe but let's see how we're going to do the fine tuning in the next stage right so once we have our data set uploaded we're going to have this view on the data set that shows every single entry that we can peek into and how it's split into training and test set generally a 90 10 split and from that ui we can launch a fine tune and this is where we get to choose our base model and what we're going to choose is a llama 3 8 billion parameter model with 32k context width which is a fine tune from news research called the theta model and you can see that there's essentially a pricing here that is being estimated for this fine tune we have a substantial training set because it can range from say hundreds of samples to thousands to hundreds of thousands and the cost can scale up as you as you feed in more training samples but it will improve the accuracy and it also provides an estimated training price of forty dollars now that might seem like a lot especially when you're chinkering with fine tuning but keep in mind some of the people that we work with they tend to spend tens of thousands or maybe hundreds of thousands of dollars a month on genii spend so this is absolutely something that you can do up front that will pay off and i believe that on open pipe if you get started you get a hundred dollars credit so that allows you to to run some fine tunes off the bat without having to necessarily uh have to to pay so um let's go over to open pipe and it is still uploading i think maybe the network is uh is a bit slow but we're going to essentially start training at this point and once the training is happening we're going to then deploy the fine tune llm when training is done and what happens on open pipe is when you're done with training you're going to get an email when that training job is done it can take a few minutes so i'm going to pull a jeweler child here i'm going to stick the you know the turkey in the oven and in the second oven i'm going to have a pre-baked turkey just so that we don't lose time but as you're going through this on your own keep in mind it's going to take a little bit of time to just kick off that whole fine tuning process but it's not that long because um you know you're training a fairly small model here all right so this is still uh saving but let's kind of take a look at what we've done so far right so we've built our data set using a synthetic data set from hugging face we format each input output pair from the data set as logged l on messages and this is essentially stored as a json file that we upload to open pipe and we produce 10 000 training samples we're fine tuning a model from open pipe and we're open pipe uses a parameter efficient fine tuning which produces a laura and we choose llama 3 8 billion parameter model as the base and when we deploy what we're going to use here is octo ai so let's see this didn't finish uploading so i'm going to go into the one that i uploaded just a couple days ago just to essentially show you what you should see on the user interface so as you peruse through the training samples what you're going to see is an input column and output column and so on the left you have the input with the system prompt as you can see it's a it's a big boy because it has all these different categories right that it needs to classify it also has the user prompt which is the message that we need to redact the tool choice and the tool specification here with all the different categories of pi types and then the output will be will be this tools call from the assistance response and that will have this redact call along with these arguments field to redact as a list of dictionary entries containing string and pi type information right and so this is what we've passed into our fine tuning data set into open pipe and this is still saving so i'm just going to go ahead and go to the model so once you have the data set uploaded again you hit this fine tune button and this is what's going to allow you to launch a fine tuning job right i can call this blah and this is where you select under this drop down the model that you want to fine tune this is again what we saw before training size is substantial i'm not going to hit start training because i already have a trained model but when you do that it's going to kick off the training and when it's done you'll get notified by email now let's fast forward let's assume i've already trained my model so i'm going to have this model here that's been fine-tuned from this data set i'm going to click on it as we can see it's a llama 3 8b model it's been fine-tuned over these 10 000 data sets split into 9 000 training samples and a thousand test samples we can even look at the evaluation but going back to the model and the nice thing is that it's taking care of the hyper parameter like learning rate number of epochs it kind of figures it out for you so you don't really have to tweak those settings and i find that to be very convenient especially for people who haven't yet built an understanding of how to tweak those values and the beauty of using open pipe is that you can now export the weights and be the owner of those weights right remember when we talked about open source it's really important to own the result of the fine tuning so you can download the weights in any format you want you have lores but also merge checkpoints so you can have a a parameter efficient representation as well as a checkpoint and so we've selected to export our model as a fb16 laura which is what we're going to use to upload our model on octo ai which is where we're going to use to deploy the model so now i can download the weights as a zip file and it's fairly small only 50 megabytes but i can also copy the link copy the url and this is what we're going to need to do in this tutorial so to deploy the model what we need to do is copy this url i'm going to download in the cell the octo ai cli this is a command line interface for users to upload their own fine tunes to what we call our asset library so this is a place where you can store your own checkpoints your own lores for not just llms but also models like stable diffusion if some of you are developers who also work in the image gen space and so we can serve these customized models on our platform and so we're going to upload this laura from open pipe to octo ai so we're going to log in just to make sure credentials are good and here we have a confirmation that our token is valid and in the cell we have to replace the laura url from set me to that url that i just copied here from download weights and keep in mind this might take a couple minutes to get the link to appear but once you have that link and again i'm kind of skipping ahead because when you're going to run this at your own time it might take a you know a few minutes to run the fine tune it might take a few minutes to download the weights but everything that i'm running here is essentially the steps that you'll take yourself and what i'm doing here is passing in this url here and setting a laura asset name in my octo ai asset library so i can then create this asset from this laura as a safe tensor file and based on the llama3 8b model i'm going to name it let's see it seems like something has failed here so let's try to run it again and so what this is doing is uh let's see usually that that that should have worked so what's uh what should happen here is at this point once you've taken the url of your fine-tuned asset should be able to host it on our asset library and then from there serve it to start running some inferences so this uh this this laura upload step didn't quite work here so pedro are you able to maybe double check with product whether this capability is working uh this isn't a good demo unless something fails and so uh yeah you know i just tested it earlier today and it was working flawlessly so uh let's see i might have to list my assets so i can pull an old one uh actually um actually one second pedro can you can you tell me what the command is to to list the assets that are on i think it might be octa octo ai asset list all right let's okay there we go so i'm going to pull from an asset that i uploaded earlier uh okay so i'm gonna have to look into why uh that step failed but uh let's let's try this okay so i'm gonna use an asset that i uploaded earlier i'm not sure why this didn't work but i'll make sure that this is working for you all to reproduce this step and i'm gonna set laura asset name equals this all right so these other lauras i uploaded using the exact same steps as i used for this tutorial so we'll make sure to get to the bottom of this and we'll use the slack channel here for folks who want to run through this step but i'm just going to run an example inference here on this asset that i pulled from open pipe and so again we have our system prompt we're going to pass in this uh ex you know this message this email as our test prompt and then when we're invoking this octo ai endpoint we're using the standard chat completions from open ai and what we're passing here is this open pipe llama 3 8b 32k model and we pass in this argument for parameter efficient fine tune and passing the laura asset name that we just uploaded to the asset library and as we can see the response here contains the tool calls and the call to the function that will do the redaction so this is behaving exactly as we intended to so now we can move on to the quality evaluation for quality evaluation what we've done is use essentially an accuracy metric thankfully we have a ground truth right from our data set all the exchanges have been labeled with privacy mask information that we can use this ground truth so that makes evaluating scoring or results fairly easy we don't have to use an llm for instance for that we can actually use more traditional techniques of accuracy evaluation and so we have a metric that we've built it assigns a score that can be penalized when pi information was missed or mistakenly added i.e false negative false positive and then we use a similarly distance metric to kind of match the responses from the llms compared to our ground truth so for illustration purposes we have for instance this pi information that's been redacted that's a score of 1.0 because it's the perfect match or fine-tune might for instance miss the fact that billy was the middle name and might interpret it as first name in that case we're still attributing a high score because it's close enough and probably for a practical use case that would be good enough but for instance upon calling gpt4 it fails to identify two out of the three information that we have to redact and so the score is about a third here right so in this case what we're going to do here i'm just going to reduce the test size to 100 samples and i am going to run this evaluation inside of this cell it's going to bring us 100 test samples that we can then run our evaluation metric and get our overall scoring out of so if we look at you know the uh output from the cell essentially we're just evoke invoking back to back the fine-tune running octo ai and we're invoking gpt4 on open ai to do the results collection so we're going to collect some results here and uh once we've collected the results once we get to 100 i think we're getting pretty close here we can run the quality evaluation metric and of course i invite you to run it on more samples maybe a thousand or ten thousand it just gets more expensive as you're using gpt4 you know to run a hundred samples it costs about a dollar in inference so then a thousand samples cost ten dollars and now we're going to score it all right so we're going to go through every single entry we have our ground truth information we have our eval and labels from gpt4 and our eval and labels from our fine tune and we can see that right off the bat the fine tune is actually better at finding the pi to redact here gpt4 scored only a score of 0.49 whereas our fine tune achieves 0.85 and here 0.3 for gpt4 1.04 the fine tune so the fine tune overall is performing better and once we aggregate and average the score gpt4 achieved 0.68 out of 1 whereas our fine tune achieves 0.97 and so that's the difference between prototype and production right you're expected to achieve somewhere in the single nine or two nines of accuracy and this is what this technique shows it allows you to achieve and again i want to reiterate that in terms of cost gpt4 costs upwards to 30 dollars per million tokens generated whereas lama38b on octavad cost just 15 cents that's a 200x difference right so with that i just want to conclude with some takeaways on find your on fine tuning right fine tuning is a journey but a very rewarding journey there's truly no finish line here you need to attempt fine tuning after you already tried other techniques like prompt engineering retrieval augmented generation but once you decide to embark data is very important collecting your data set because your model is only as good as the data it's trained on you need to make sure to continuously monitor quality to retune your model as needed you also need to um you know thankfully we have solutions like octa and open pipe to really make this more approachable and easy to do and it's easier than ever it's only getting get easier but maybe a year ago it was only reserved for the most adventurous and sophisticated users and now we've really lowered the barrier of entry and when you do it right you can achieve really significant improvements in accuracy as well as great reduction in costs i wanted to thank you for sitting here with me over the last 50 minutes i want to reiterate a few calls to action so go to octavi.cloud to learn how to use our solutions and endpoints but also come to our booth and so we're located at this g7 booth and we're going to be here today and tomorrow if you want to chat about our sas endpoints about our ability to deploy in an enterprise environment and also i want to give a shout out to my colleague here pedro if you're curious about all the know-how that goes behind how we optimize our model and production because our background is in compiler optimization is is in system optimization infrastructure optimization we've applied all of this to be able to serve our models you know with positive margins we're not doing this at a loss sorry we're not wasting our vc money here we're actually building all this know-how into making sure that ai inference is as efficient as it could be so there's going to be a talk on that and also make sure if you if you get a chance assuming you've joined our our slack channel which is the following one so if you're on the slack org for the event go to llm quality optimization boot camp you can ask us any questions and if you fill out the survey that pedro is going to post we're going to give you an additional 10 dollars in credits so that doesn't seem like a lot but that's a ton you know if it's 15 cents per million tokens that's a lot of tokens that you can generate for free so we can give you an additional 10 dollars for filling out the survey which which should take about you know 20 to 30 seconds so i'm going to be around and also you can find me at the booth this afternoon in case you have any questions but i'd like you all to thank you for sitting through this talk and hopefully hopefully you've learned something from this and hopefully you feel like i've demystified this idea of trying fine-tuning on your own give this notebook a try assuming of course we've fixed this uh laura upload issue and uh yeah thank you all and maybe ask me some questions after this uh after this talk thanks

LLM Quality Optimization Bootcamp: Thierry Moreau and Pedro Torruella

Transcript