LLM Quality Optimization Bootcamp: Thierry Moreau and Pedro Torruella

00:00:00.000 | so welcome everyone thanks for making it to this lunch and learn my goal today is to make sure that

00:00:19.360 | I get to share my knowledge and experience on LLM fine-tuning and just to get a quick sort of pull

00:00:28.920 | from the audience here how many of you are have heard of the concept of fine-tuning here okay so

00:00:36.240 | quite a few people how many of you have actually had hands-on experience in fine-tuning LLMs okay

00:00:42.540 | all right that's pretty good that's more than I'm usually used to I mean this is quite a fantastic

00:00:48.180 | that in this conference the makeup of AI engineer is close to 100% that's not something I'm generally

00:00:54.660 | used to when presenting at other you know hackathons and conferences so I feel like I'm speaking to the

00:01:01.120 | right crowd so just to kind of contextualize this talk really I'm trying to address two pains that

00:01:10.020 | a lot of Gen AI engineers face and to get a sense of where you are in your journey how many really

00:01:17.340 | identify and can relate to the first one which is my Gen AI span has gone through the roof okay yeah all

00:01:25.680 | right and how many of you are in this other segment of this journey which is you know you've built POC's

00:01:31.980 | it's showing promise but you haven't yet quite met this quality bar to go to production can I get a

00:01:39.480 | sense of all right so so I think you know we have a good amount of good fraction of the audience that can

00:01:45.960 | relate to one of these two problems myself I'm a co-founder at Octo AI and I'm going to talk a little

00:01:52.380 | bit more about what we do but the customers I've been working with they feel those pains in very real way or

00:01:58.980 | talking about tens of thousands if not hundreds of thousands of dollars in monthly bills and perhaps

00:02:06.360 | even having issues trying to go to production because the quality bar hasn't yet been met so the

00:02:12.960 | overview of this 15-minute talk is going to be spent on understanding the why of fine-tuning really try to

00:02:19.800 | understand when to use fine-tuning it's not really a silver bullet for all the problems you're going to

00:02:24.960 | face but when used right in the right context for the right problem it can really deliver results

00:02:30.340 | I'm also going to try to contextualize this notion of fine-tuning within the crawl walk and

00:02:36.340 | run of LLM quality optimization because there's different techniques that you should attempt before trying

00:02:42.340 | to do fine-tuning but finally when you're convinced that this is the right thing for you I'm going to

00:02:47.340 | talk about this continuous deployment cycle of fine-tuned LLMs so we're going to go through today

00:02:53.660 | over a whole crank of that wheel of this deployment cycle composed of you know model you know data set

00:03:01.200 | collection model fine-tuning deployment and evaluation and really I'm trying to demystify this

00:03:07.560 | whole journey to you all because in the next 15 minutes we're actually going to go through this whole

00:03:11.700 | process and hopefully that's something that you're going to feel comfortable going through and you know

00:03:15.540 | applying to your own data set to your own problems and so for illustrating today's use case we're going

00:03:21.240 | to use this personally identifiable information redaction use case now that's a pretty traditional

00:03:26.760 | sort of data scrubbing type of application but we're going to use LLMs and we're going to see that we can

00:03:32.460 | essentially achieve state of VR accuracy while keeping efficiency at the highest using essentially very

00:03:41.560 | compact very lightweight models that have been fine-tuned for that very task so again trying to motivate this

00:03:48.160 | talk what limits gen AI adoption in most businesses today based on the conversations that I've had in the

00:03:53.640 | field discussions I've had customers and developers the first one is there's a limited availability of GPUs I think we're all

00:04:02.220 | familiar with this problem it's one of the reasons why Nvidia is so successful lately I mean everyone wants

00:04:07.920 | to have access to those precious resources that allow us to run gen AI at scale and that can also drive costs up right so we have to be smart about how to use those GPU resources

00:04:19.840 | and also when people build POCs it displays and shows promise but sometimes you don't reach the expected quality bar to go to production

00:04:29.540 | and so on this XY axis you know on this chart where Y axis is costs and the X axis symbolizes quality maybe many people start on that green cross here right on this upper quadrant a very high cost maybe not having met the quality bar that's your first POC

00:04:49.600 | POC but really to go to production you need to end on the opposite quadrant right lower cost higher quality where you met the bar you're able to run this in a way that essentially is margin positive and many of us are on this journey to reach that point of you know profitability

00:05:07.240 | so we're going to learn today how to use and how to fine tune an LL M now fine tuning is a method that we're going to use to improve the LL M quality but as a bonus we're going to be also showing how to improve quality significantly and I use quality as the title of this talk because really I think many of us AI engineers really care about reaching the high quality bar when we're using LL M's and hopefully I'm you know the goal of today's talk is to instill

00:05:37.000 | instill you with some knowledge on how to tackle this journey and so in terms of tools that we're going to use today we're going to use OpenPipe which is a SaaS solution for fine tuning that really lowers the barrier of entry for people to run their own fine tunes

00:05:51.640 | you don't need hardware or cloud instances you don't need hardware or cloud instances to get started and we're going to use this to deliver quality improvements over state-of-the-art LL M's and of course since I work at Octo AI I'm going to also be using Octo AI here for the LL M deployments

00:06:06.280 | and that's going to be the solution that we're going to use to achieve cost efficiency at scale and really the key here is to be able to build on a solution that is designed to serve models at production scale volumes

00:06:22.120 | and just to give you a little bit of a sneak peek in terms of the results that we're going to showcase today after you go through this whole tutorial and this is something that you're going to be able to reproduce independently so you know all the code

00:06:34.120 | is there for you to go through we're going to be able to show that we can achieve 47 percent better accuracy at the tasks that I'm going to showcase today using this OpenPipe fine tuning

00:06:44.120 | and by deploying the model on Octo AI we're going to achieve this I mean it seems kind of ridiculous 99.5 percent reduction in cost this is really a 200x reduction in cost here from a GPT-4 Turbo to Llama 3 and mostly because this is a much smaller model it's open source and we've optimized the hell of this model to serve it cheaply on Octo AI so I'm going to explain how this is achieved but I hope your interest at least has been peaked on those results that you yourself

00:07:14.040 | can reproduce so when to use fine tuning again it's not really a silver bullet for all your quality problems it has its right place in time so I like to contextualize it within the crawl walk run of quality optimization right and as Gen AI engineers many of us have embarked on this journey we're at different stages of this journey and really it should always start with prompt engineering right and many of you are familiar with this concept you start with a model you're

00:07:43.960 | you're trying to have it accomplish a task and sometimes you don't really manage to see the result you expect to see so you're going to try prompt engineering and there's different techniques of varying levels of sophistication

00:07:55.320 | this talk is not about prompt engineering so you know you can improve prompt specificity there's few shots prompting where you can provide examples to improve essentially the quality of your output there's also chain of thought prompting I mean some of you probably have heard these concepts but this is where you should get started right make sure that given the model given those

00:08:13.880 | you just try to improve those weights you just try to improve the prompt to get the right results sometimes that's not enough and there's a second class of solutions which I like to map to the walk stage

00:08:23.720 | retreatal augmented generation right we've probably seen a lot of talks on RAG today and throughout this conference so you know there's hallucinated results sometimes the answer is not truthful well why is that it's because the weights of the model that is really the parametric memory of your model is

00:08:43.800 | is limited to you know the point in time is limited to you know the point in time in which the model was trained so when you try to ask questions on data it hasn't seen or information that's more recent than when the model was trained it's not going to know how to respond right so the key here is to provide the right amount of context

00:08:59.800 | and so this is achieved through similar research for instance in a vector database through function calling to bring the right context by invoking an API through search through querying a database

00:09:11.640 | and so this is something that I think many of us engineers have been diving in in order to provide the right context to generate truthful answer right compliment the parametric memory of your model with non parametric information

00:09:23.800 | and that's RAG in a nutshell right so you've tried prompt engineering you've tried RAG you've eliminated quality

00:09:28.920 | problems and hallucinations but that's still not enough right so what do you try next

00:09:33.640 | well fine-tuning I think is the next stage and again I'm generalizing a very complicated and complex journey

00:09:40.840 | but in spite of your best efforts you've tried these techniques for maybe days weeks or even months

00:09:46.440 | and you still don't get to where you need to be to hit production and we're going to talk about this journey

00:09:51.560 | today right fine-tuning so when should you fine-tune a model again after you spend a lot of time in the

00:09:59.160 | first two phases of this journey so spending time on prompt engineering spending time on retrieval augmented

00:10:05.480 | generation and you don't see the results improve and generally what helps is whenever you use an LLM for a very

00:10:13.080 | specific task something that's very focused for instance classification information extraction

00:10:18.840 | trying to format a prompt using it for function calling if you can narrow the use case to something

00:10:26.200 | that is highly specific then you have an interesting use case for applying fine-tuning here and another

00:10:33.640 | requirement is to have a lot of your own high quality data to work with because that's going to be your

00:10:37.880 | fine-tuning data set that goes without saying but a model is only as good as the data that it that the

00:10:43.560 | model was trained on and we're going to apply this principle here in this tutorial and finally I think

00:10:48.760 | as an added incentive oftentimes we're all driven by economic incentive in the work we do for those of

00:10:54.520 | you who are feeling the pains of high gen ai bills whether it is with open ai or with a cloud vendor or a third

00:11:03.320 | party well this is generally a good reason to explore fine-tuning so we're going to go over all the steps

00:11:10.040 | now that we've kind of contextualized why fine-tuning and when to consider fine-tuning we're going to

00:11:15.720 | consider all the steps here in this continuous deployment cycle it starts with building your data

00:11:21.560 | set then running the fine-tuning of the model deploying that fine-tune llm into production so

00:11:28.520 | you can achieve scale and serve your your you know your customer needs or internal needs at high volumes

00:11:35.320 | and also evaluate quality and this is an iterative process there's not a single crank of the wheel this

00:11:40.920 | is not a fire and forget situation because data that your model sees in production is going to drift and

00:11:47.400 | evolve and so this is something that you're going to have to monitor you're going to have to update your data

00:11:51.240 | set you're going to have to fine-tune your model and i don't want to scare you away from doing this

00:11:55.640 | because it sounds fairly daunting and so by the end of this talk we'll have gone through a full crank of

00:12:02.280 | that wheel and hopefully you know it through these sas toolings that i'm going to introduce you to is going

00:12:08.840 | to feel a lot more approachable and hopefully i'll demystify the whole process of fine-tuning models

00:12:13.720 | so let's start with step one which is to build a fine-tuning data set now the data of the model

00:12:21.080 | has to be trained on ideally real world data right it has to be as close as possible to what you're

00:12:27.000 | going to see in production so there's kind of a spectrum of ways to build and generate a data set

00:12:32.280 | ideally you build a data set out of real world prompts and real world generated real world human

00:12:38.920 | responses so for instance you have customer service you've logged calls with a customer agent you have

00:12:44.760 | an interaction between two humans that's a very good data set to work with right because it's human

00:12:49.480 | generated on both ends this is very high quality but not everyone has the ability to acquire

00:12:54.840 | this data set sometimes you're starting from scratch so not everyone has a luxury to start there

00:13:00.360 | there's also kind of an intermediary between real world and synthetic where you have real world prompts

00:13:05.640 | but ai generated responses and so this is kind of a good middle ground between cost and quality because

00:13:11.080 | you're starting from actual ground truth information that is derived from real data but the responses are

00:13:19.320 | generated by a high quality llm say gpt4 or clod and actually open pipe is a solution that allows you

00:13:26.600 | to log the inputs and outputs of an lm like gpt4 to build your data set for fine-tuning an llm so this is

00:13:34.680 | something that you know a lot of practitioners use and finally there's the fully synthetic data set using

00:13:41.560 | fully ai generated labels and oftentimes when you go on hugging face or kaggle you'll encounter data sets that have

00:13:48.680 | been built entirely synthetically and that's a great way to kind of get started on this journey

00:13:53.960 | and actually one of the data sets we're going to use today is from that latter category

00:13:58.440 | and of course i mean it probably goes without saying but in case people are not fully uh familiar with this

00:14:05.160 | notion you want to split your data set into a training and validation set because you don't want to

00:14:11.160 | evaluate your model on data that your fine tune has seen right and so many of you who are ml and ai

00:14:19.800 | engineers are already familiar with this but i just want to reiterate that this is important and finally

00:14:24.120 | you know this is used for hyper parameter tuning and when you're deploying it and actually testing it

00:14:28.520 | on real world examples you want to have a third set outside of training and validation which is your test set

00:14:34.440 | that's a good way to do it now you've built your data set you're ready to fine tune your model and there's a lot of

00:14:40.120 | decisions that we need to make at this point and the first one is going to be open source versus closed source right

00:14:45.720 | and so who here just like raise of hands is using proprietary llms or gen ai models today from open ai

00:14:53.160 | anthropic mistral ai okay good amount of crowd who here has been using open source llms like llama some of the free mistral ai models

00:15:03.800 | okay so maybe a smaller crowd right and maybe that's because these models are not as capable and

00:15:10.520 | sophisticated and but i'm going to walk you through how you can achieve better results if you do fine

00:15:18.280 | tuning right so of course the benefit of open source and this is why you know i'm obviously biased but i'm

00:15:24.200 | an open source advocate is that you have to you get to have ownership over your model weights so when once you've done the fine tuning you are

00:15:33.160 | the proprietor of the weights that are the result of this fine tuning process which means that you can

00:15:38.680 | choose how you deploy it how you serve it this is part of your ip and i find that this is a great

00:15:43.080 | thing for anyone who wants to embark on this fine tuning journey with proprietary solutions you're not

00:15:49.560 | quite the owner or you don't have the flexibility to decide to go with another vendor to host the the

00:15:54.680 | models yourself and so you're kind of locked into an ecosystem some people are comfortable with that others are

00:16:00.440 | less comfortable with it and many of the customers that we talk to they're very eager to jump on the

00:16:06.120 | open source train but they don't really know how to get started or you know where to start on this

00:16:11.640 | journey so hopefully this can this can help inform you how to take your first steps here into the

00:16:16.040 | world of open source then there's a question of like do i use a small model or a large model

00:16:21.400 | because for instance even in the world of open source you have models that are in the order of 8 billion

00:16:26.360 | parameters like llama 3 8 b and then you have the large models with a mixture all 8 by 22 b so this

00:16:33.080 | is a mixture of expert model with over 100 billion parameters very different beasts and we're going

00:16:38.840 | to see even larger models from meta and generally my recommendations here is well look the large models

00:16:45.320 | are amazing they have broader context windows they have higher capabilities of reasoning but they're also

00:16:52.440 | more expensive to fine tune and more expensive to serve and typically when you have to do a deployment

00:16:57.400 | you're going to have to acquire resources like h100s to run these models so generally start with a smaller

00:17:03.080 | model like a llama 3 b and sometimes you'll be surprised by its ability to learn specific problems

00:17:08.920 | so that's my recommendation start with a smaller llama 3 8 b or mistrol 7 b and if that doesn't work out for you then

00:17:18.440 | move towards larger and larger models and today we're going to be using this llama 3 8 billion parameter

00:17:23.960 | model there's also different techniques for fine tuning i'm going to go over this one fairly quickly

00:17:29.880 | but there's two classes of fine tuning techniques one which is parameter efficient fine tuning it produces

00:17:37.880 | a laura and the other one is a full parameter fine tuning which produces a checkpoint a laura is much

00:17:43.960 | more much smaller and efficient in terms of memory footprint we're talking about 50 megabytes versus

00:17:50.440 | a checkpoint that is 15 gigabytes and so you can guess that because of its more compact representation

00:17:58.200 | you're able to serve it on a gpu that doesn't require as much onboard memory and you can even serve

00:18:06.280 | multiple lauras at the same time so multiple fine tunes on a gpu for inference as opposed to the

00:18:13.240 | checkpoints which require a dedicated gpu for every single fine tune so there's more flexibility in

00:18:19.240 | deployment and we're going to use that today we're actually going to serve these lauras which are the

00:18:23.640 | result of parameter efficient fine tuning on a shared tendency endpoint with other users who have their own

00:18:30.360 | lauras all running on the same server and that allows us to really reduce the cost of inference

00:18:36.280 | and there is a benefit to checkpoints though and full parameter fine tuning which is that there are more

00:18:43.400 | parameters to tune so it's a more flexible fine tuning technique it allows the model to have

00:18:49.640 | essentially achieve better results at more expensive tasks like logical reasoning

00:18:57.000 | but for very specialized tasks which is what we're going to look at today like classification or

00:19:01.000 | labeling or function calling a laura is just fine so we're going to use parameter efficient fine tuning

00:19:06.120 | and also when you're doing fine tuning you have to decide am i going to diy it or am i going to use

00:19:12.840 | sas so i'm sure some of you only like to diy things others like the convenience of sas and here i'm not

00:19:18.920 | going to take a side i think there's some great tools right now to diy your own fine tuning for

00:19:24.840 | for instance the open source project axolotl and actually at the conference there's the the creator

00:19:29.960 | behind axolotl who you might be able to catch um and you know the challenge here is that you have to

00:19:36.680 | find your own gpu resources you have to understand how to use these libraries even though they're they're

00:19:41.960 | they're easier than ever uh to to adopt and you have to tune and tinker uh you know settings and hyper

00:19:49.000 | parameters then there's sas which really aim to make it easy to embark on this journey companies like

00:19:54.760 | open pipe and there's uh many folks from the open pipe at this conference today so if you can catch

00:20:00.520 | them please do talk to them and they're trying to lower the barrier of entry to fine tuning right to

00:20:04.920 | make it easy and to bring all this tooling all these libraries to make it as seamless as possible to

00:20:10.280 | for instance move from a gpt4 model to a fine tune with the least amount of steps in collecting your

00:20:15.880 | data fine tuning etc and so we're going to use sas today but if you feel more comfortable in this journey

00:20:21.800 | you might want to start with sas and then evolve into diy-ing it when it comes to deployment you

00:20:28.120 | have to navigate the same options right once you have a fine-tuned model now you need to decide well

00:20:32.120 | how am i going to serve it right because i need to generate maybe thousands millions or billions of

00:20:37.400 | tokens a day and so you need infrastructure you need gpus you need inference libraries some people like to

00:20:43.480 | diy it using libraries like vllm mlclm tensor rtllm hogging face tgi if these are all things that you

00:20:52.760 | might have heard of these are all solutions to run models on your own on your own infrastructure

00:21:00.040 | but you need to provision the resources you need to build the infrastructure to scale with demand and

00:21:06.920 | that can get tricky especially achieving high reliability under load that's a challenge that

00:21:12.200 | many people face as they scale their business up with sas you can essentially work with a third party

00:21:18.040 | like octo ai and obviously i'm a bit biased again i work there so i'm gonna insert a shameless plug for

00:21:24.680 | octo ai which allows users to get these fine tunes deployed on sas based endpoints so endpoints very

00:21:32.840 | similar to the ones from open ai for instance if you're familiar with that or claude and it offers

00:21:40.120 | the ability to serve different kinds of customizations as well and so very quickly i want to go over the

00:21:45.880 | advantages of octo ai here first of all you get speed so llama 3 8 billion parameter model you get achieve

00:21:53.400 | around 150 tokens per second and we keep on improving that number because we've been applying our

00:21:57.960 | our own in-house optimizations to the model serving layer it also has a significant cost advantage

00:22:03.880 | because it costs about 15 cents per million tokens compared to say gpt4 which costs 30 dollars per million tokens

00:22:11.240 | so that's where the 200x comes from and we don't charge a tax for customization so whether you're serving

00:22:16.120 | the base model or a fine-tune it's the same cost there's customization as i mentioned you can load your own laura and serve it

00:22:25.080 | and finally scale our customers some of our customers generate up to billions of tokens per day on our endpoints

00:22:31.480 | i think we're serving around over 20 billion tokens per day and so we've focused and spent a lot of time

00:22:38.280 | on improving robustness and also worth mentioning if sas doesn't cut it for you you are working for a fortune 500

00:22:47.960 | if you're a software company or you know a healthcare company banking sector government you need to deploy your llms inside of your

00:22:55.560 | environment either on-prem or in vpc we also have a solution called octo stack come talk to us at the booth

00:23:02.040 | so that's it for the shameless uh flag section let's go over to section four which is evaluating quality right

00:23:08.680 | we've talked about data set collection fine tuning deployment now quality evaluation and we could have an entire conference just dedicated on that

00:23:16.360 | i'm going to try to summarize it into kind of two classes of evaluation techniques that i've seen

00:23:22.360 | first of all you know can your quality be evaluated in a precise way that can be automated for instance

00:23:29.960 | you generate a program or sql command that can run uh can you for instance label or extract information

00:23:37.240 | or classify information in an accurate way that's a kind of pass or fail scenario right or formatting the output

00:23:43.560 | into a specific json formatting this is something that you can easily test as a pass or fail test and

00:23:49.960 | then there's more of the soft evaluation for instance if i were to take an answer and say well which output

00:23:55.640 | is written in a more polite or professional way you can't really write a program to evaluate this unless

00:24:01.720 | you're using an lm of course right but you have to put yourself into maybe 2000 2000 sorry 2020 2021 mindset

00:24:09.480 | before gpt was around well it'd be hard to build a program that can assess this right so generally you'd

00:24:15.640 | need a human in the loop to say which out of a or b is a better answer thankfully today we can use llms

00:24:23.800 | to automate that evaluation but keep in mind that for instance if you're using gpt4 to evaluate two

00:24:29.400 | answers well if you're comparing against gpt4 it might favor its own answer and people have seen that in

00:24:34.440 | in these kind of evaluations so this is a whole science i mean we could have a whole conference

00:24:38.920 | just on this i just wanted to present the high level uh guidelines of this whole cycle of deploying

00:24:46.120 | fine-tuned llms and so really there is no finish line that's what i want to convey to you all that

00:24:52.520 | going through a single iteration is something that you might have to do on a regular basis maybe

00:24:58.760 | once a week maybe once a year it all depends on your use case and constraints now let's get a bit

00:25:06.120 | more practical let's switch over to our demo and so for those of you who came a little bit late

00:25:13.480 | there's a qr code here that you can scan and that will point you to our google colab and we also have

00:25:22.680 | under slack now let me see if i can pull it if you're in the slack channel for ai engineers world

00:25:29.800 | fair there is this quality optimization boot camp where you can ask questions here if you want to

00:25:35.080 | follow along and so we're going to go we're going to try to go over the practical component in the next

00:25:40.520 | 25 minutes i just want to provide some context here the use case is uh personally identifiable

00:25:48.360 | information redaction we've taken this from a data set composed by ai for privacy called pi masking

00:25:55.960 | 200k it's one of the largest data sets of its kind it has 54 different pi classes so different kinds

00:26:03.880 | of sensitive data like the name the email address add you know address the physical address of someone

00:26:10.760 | uh their credit card information etc etc across 229 discussion subjects so that includes

00:26:18.120 | conversations from a customer ticket resolution conversations with a banker conversations between

00:26:24.440 | individuals etc what this data set looks like is as follows you're going to have a message an email

00:26:32.120 | here we have you know something that looks like it came out of an email that contains credit card

00:26:37.800 | information ip address maybe even a mention of a rule or or anything that is essentially personally personally

00:26:45.640 | identifiable and i've highlighted those in red because they will need to be redacted

00:26:51.160 | and after redaction we should get the following text that shows look here is this information that is now

00:26:57.880 | redacted anonymized but instead of just masking it we're actually telling it what kind of category this

00:27:03.880 | information belongs to right a credit card number an ip address or job title and this is how we're going to redact this text

00:27:10.760 | so where do llms come in the way we would use it is through function calling who here has used llms

00:27:19.400 | with tool calls or function calls okay so quite a few people you know and as many of us are aware this

00:27:27.480 | kind of what powers a lot of the agentic applications so this is a great use case for people who want to do

00:27:33.320 | function calling and are not seeing the results you know out of the box from say gpt4 that they would like

00:27:40.440 | to to see and in this case we're actually going to see that that these kind of state-of-the-art models

00:27:44.760 | aren't doing quite well at fairly large and complex function call use cases so to achieve this redaction

00:27:53.000 | use case we're going to pass in a system prompt we can also pass in a tool specification the system

00:27:58.920 | prompt says look you're an expert model trained to do redaction and you can call this function

00:28:03.320 | here are all the sensitive pii categories for you to redact and then as a user prompt we're going to

00:28:10.040 | pass in that email or that message and then the output is a tools call so it's not the redacted text

00:28:17.320 | it's actually a tools call to that redact function that's going to contain all the arguments for us to

00:28:23.080 | perform the redaction why am i doing this as opposed to spitting out the redacted test well that gives us

00:28:28.760 | flexibility in terms of how we want to redact this text we could choose to just replace that

00:28:34.520 | information with the pii class we can also completely obfuscate it or we could choose to use for instance a

00:28:42.120 | database that maps each pii entry to a fake substitute so that we have an email that kind of reads normally

00:28:51.160 | except the credit card the the names the addresses are all made up but they will always map to the same individual

00:28:58.600 | and so that allows us to do then more interesting processing on our data set right so that's why

00:29:04.680 | we're going to use function calling here and let's start to build the data set so i'm going to switch

00:29:09.000 | over to our notebook here this notebook is meant to be sort of self explainable so there's a bit of

00:29:15.640 | redundance redundant context as part of the prerequisites you're going to have to get an account on octo ai and

00:29:22.280 | open pipe and and these are the tools that we're going to use and if you want to run the evaluation function also

00:29:28.440 | provide your open ai key because we're going to compare against gpt4 so we're going to install the python packages

00:29:35.400 | initially only open ai and data sets from hugging face you can ignore this pip dependency error here

00:29:42.200 | which happens when you pip install data sets in a colab notebook but that's okay we can get past that

00:29:48.200 | you can enter your octo ai token and open ai api key at the beginning

00:29:54.440 | and i've already done this so we're going to start with the first phase which is to build a fine-tuning

00:29:58.520 | data set so we have this pi masking data set i'm going to show it from hugging face of pi

00:30:04.360 | masking and you can see what the data set looks like it has the source text information as you can see

00:30:12.520 | these are you know exchange you know snippets from emails for instance you have the target text that is

00:30:18.360 | redacted and the privacy mask that contains each one of the pii and the classes associated to it

00:30:25.080 | so this contains all the data all the information input and labels that we need to build our

00:30:30.600 | our data set for fine-tuning and so really what we're going to do

00:30:35.720 | is that we're going to use the system prompt

00:30:41.480 | area we're going to define our system prompt here which is again telling the model you're an

00:30:46.120 | expert model trained to redact information and here are the 56 categories explaining next to each

00:30:53.000 | category what that corresponds to and this is really the beauty of llm and sort of natural language entry

00:30:59.400 | is that in the old world when we're doing pi redaction we had to write complex regular expressions

00:31:05.480 | and here this is all done through just providing a category and a bit of a description here

00:31:10.920 | and the llm will naturally infer how to do the redaction we're also going to define the tool to call

00:31:19.080 | right so this is done as a essentially a dictionary a json object and as you can see there is an array

00:31:26.360 | that contains dictionaries containing a string and a pi type and the string is the pi information the type is

00:31:35.000 | essentially one of 56 categories that we provide as an enum so right off the bat you can see that this

00:31:40.520 | tool call is you know a bit of a large function specification and so let's load our data set from

00:31:48.440 | hugging face in this case it's going to take maybe a few seconds to load in that data set of 200 000

00:31:54.600 | entries and then what i have in the next cell when i'm downloading this data set is what i'm going to use

00:32:01.800 | to build my fine tuning training data set and here's the thing about fine tuning is that to build your

00:32:10.120 | data set you need to make it seem like you've essentially logged conversations with an llm right

00:32:15.800 | you're logging the prompts and the responses because that's how you're going to fine tune it you need to

00:32:20.200 | tell it this is the input with system prompt tools specification user prompt and here's the

00:32:27.720 | tools call response that i expect to see and so this cell here just sets it up so that we essentially

00:32:35.320 | have each training sample as a message from an llm that's been logged we're going to see what that looks

00:32:41.800 | like in a second so we're going to build a 10 000 entry training data set for open pipe and that's

00:32:51.480 | going to be downloaded as this open pipe data set that json l and so as i run the cell it's going to

00:32:58.040 | download this from colab and now when you switch over to open pipe we're going to create a new data set

00:33:07.400 | so once you're on open pipe console you have a project here i've generically named it project one

00:33:13.480 | you can access data sets and already as you can see i already have built a few data sets before

00:33:20.280 | but if you're a first time user you're not going to see anything under data sets so you can create

00:33:25.320 | a new data set here by clicking on this button and if you go under settings we can name our data set so

00:33:32.040 | i'm going to call it lunch and learn and today is june 2 6. all right so this is two days lunch and learn

00:33:42.920 | i'm going to i'm going to call this my data set and under general i can upload the data that i just

00:33:49.560 | download it from my notebook open pipe data set dot json l so this upload operation is going to take

00:33:56.280 | a few seconds or maybe a couple of minutes because what's going to happen on open pipe is not only we're

00:34:03.400 | uploading this data set but it's going to do some pre-processing here to split it into training and

00:34:09.880 | validation set it's also going to get it all formatted in a nice way so we can essentially look into the data

00:34:17.160 | set so you can see there's this little window here that shows that you're uploading the data set and

00:34:22.520 | that it is essentially being processed so while this is happening right we've prepared our data set and

00:34:30.600 | we're going to take a look at it in a second while it's being processed on open pipe but let's see how

00:34:35.160 | we're going to do the fine tuning in the next stage right so once we have our data set uploaded we're going

00:34:41.400 | to have this view on the data set that shows every single entry that we can peek into and how it's split

00:34:46.600 | into training and test set generally a 90 10 split and from that ui we can launch a fine tune

00:34:55.400 | and this is where we get to choose our base model and what we're going to choose is a llama 3 8 billion

00:35:00.360 | parameter model with 32k context width which is a fine tune from news research called the theta model

00:35:11.160 | and you can see that there's essentially a pricing here that is being estimated for this fine tune we

00:35:17.400 | have a substantial training set because it can range from say hundreds of samples to thousands to hundreds

00:35:23.160 | of thousands and the cost can scale up as you as you feed in more training samples but it will improve the

00:35:31.480 | accuracy and it also provides an estimated training price of forty dollars now that might seem like a lot

00:35:37.000 | especially when you're chinkering with fine tuning but keep in mind some of the people that we work

00:35:41.160 | with they tend to spend tens of thousands or maybe hundreds of thousands of dollars a month on genii

00:35:46.680 | spend so this is absolutely something that you can do up front that will pay off and i believe that on

00:35:51.800 | open pipe if you get started you get a hundred dollars credit so that allows you to to run some fine

00:35:57.960 | tunes off the bat without having to necessarily uh have to to pay so um let's go over to open pipe

00:36:06.760 | and it is still uploading i think maybe the network is uh is a bit slow but we're going to essentially

00:36:15.800 | start training at this point and once the training is happening we're going to then deploy the fine tune llm

00:36:23.160 | when training is done and what happens on open pipe is when you're done with training you're going to

00:36:28.040 | get an email when that training job is done it can take a few minutes so i'm going to pull a jeweler

00:36:32.600 | child here i'm going to stick the you know the turkey in the oven and in the second oven i'm going to

00:36:37.160 | have a pre-baked turkey just so that we don't lose time but as you're going through this on your own

00:36:42.840 | keep in mind it's going to take a little bit of time to just kick off that whole fine tuning process but

00:36:47.400 | it's not that long because um you know you're training a fairly small model here all right so

00:36:54.360 | this is still uh saving but let's kind of take a look at what we've done so far right so we've built

00:37:00.040 | our data set using a synthetic data set from hugging face we format each input output pair from the data

00:37:05.960 | set as logged l on messages and this is essentially stored as a json file that we upload to open pipe

00:37:13.640 | and we produce 10 000 training samples we're fine tuning a model from open pipe and we're open pipe

00:37:19.640 | uses a parameter efficient fine tuning which produces a laura and we choose llama 3 8 billion parameter

00:37:25.400 | model as the base and when we deploy what we're going to use here is octo ai so let's see this didn't

00:37:33.960 | finish uploading so i'm going to go into the one that i uploaded just a couple days ago just to

00:37:39.640 | essentially show you what you should see on the user interface so as you peruse through the training

00:37:47.960 | samples what you're going to see is an input column and output column and so on the left you have the

00:37:53.240 | input with the system prompt as you can see it's a it's a big boy because it has all these different

00:37:58.760 | categories right that it needs to classify it also has the user prompt which is the message that we need to

00:38:05.160 | redact the tool choice and the tool specification here with all the different categories of pi types

00:38:11.000 | and then the output will be will be this tools call from the assistance response

00:38:15.880 | and that will have this redact call along with these arguments field to redact as a list of dictionary

00:38:23.160 | entries containing string and pi type information right and so this is what we've passed into our fine tuning

00:38:32.360 | data set into open pipe and this is still saving so i'm just going to go ahead and go to the model so

00:38:40.360 | once you have the data set uploaded again you hit this fine tune button and this is what's going to

00:38:46.760 | allow you to launch a fine tuning job right i can call this blah and this is where you select under

00:38:52.680 | this drop down the model that you want to fine tune this is again what we saw before training size is

00:38:58.600 | substantial i'm not going to hit start training because i already have a trained model but when

00:39:02.440 | you do that it's going to kick off the training and when it's done you'll get notified by email

00:39:06.120 | now let's fast forward let's assume i've already trained my model so i'm going to have this

00:39:11.400 | model here that's been fine-tuned from this data set i'm going to click on it as we can see it's

00:39:17.720 | a llama 3 8b model it's been fine-tuned over these 10 000 data sets split into 9 000 training samples and

00:39:27.080 | a thousand test samples we can even look at the evaluation but going back to the model

00:39:35.160 | and the nice thing is that it's taking care of the hyper parameter like learning rate number of epochs it kind

00:39:42.520 | of figures it out for you so you don't really have to tweak those settings and i find that to be very

00:39:46.920 | convenient especially for people who haven't yet built an understanding of how to tweak those values

00:39:51.000 | and the beauty of using open pipe is that you can now export the weights and be the owner of those

00:39:57.960 | weights right remember when we talked about open source it's really important to own the result of

00:40:02.280 | the fine tuning so you can download the weights in any format you want you have lores but also merge

00:40:08.040 | checkpoints so you can have a a parameter efficient representation as well as a checkpoint and so we've

00:40:14.360 | selected to export our model as a fb16 laura which is what we're going to use to upload our model on

00:40:20.280 | octo ai which is where we're going to use to deploy the model so now i can download the weights as a zip

00:40:25.400 | file and it's fairly small only 50 megabytes but i can also copy the link copy the url and this is what

00:40:33.800 | we're going to need to do in this tutorial so to deploy the model what we need to do is copy this url

00:40:41.000 | i'm going to download in the cell the octo ai cli this is a command line interface for users to upload

00:40:49.480 | their own fine tunes to what we call our asset library so this is a place where you can store your

00:40:54.760 | own checkpoints your own lores for not just llms but also models like stable diffusion if some of you are

00:41:00.760 | developers who also work in the image gen space and so we can serve these customized models on our platform

00:41:08.760 | and so we're going to upload this laura from open pipe to octo ai so we're going to log in just to

00:41:17.720 | make sure credentials are good and here we have a confirmation that our token is valid and in the cell we have to

00:41:25.960 | replace the laura url from set me to that url that i just copied here from download weights

00:41:31.160 | and keep in mind this might take a couple minutes to get the link to appear but once you have that link

00:41:38.040 | and again i'm kind of skipping ahead because when you're going to run this at your own time it might

00:41:43.480 | take a you know a few minutes to run the fine tune it might take a few minutes to download the weights

00:41:47.720 | but everything that i'm running here is essentially the steps that you'll take yourself

00:41:51.960 | and what i'm doing here is passing in this url here and setting a laura asset name in my octo ai asset

00:42:01.560 | library so i can then create this asset from this laura as a safe tensor file and based on the llama3 8b model

00:42:14.040 | i'm going to name it let's see

00:42:16.120 | it seems like something has failed here so let's try to run it again

00:42:38.520 | and so what this is doing is uh let's see

00:42:40.920 | usually that that that should have worked so what's uh what should happen here is at this point

00:42:54.680 | once you've taken the url of your fine-tuned asset should be able to host it on our asset library and then

00:43:04.680 | from there serve it to start running some inferences so this uh this this laura upload step didn't quite

00:43:15.800 | work here so pedro are you able to maybe double check with product whether this capability is working

00:43:23.160 | uh this isn't a good demo unless something fails and so uh yeah you know i just tested it earlier today

00:43:29.800 | and it was working flawlessly so

00:43:32.200 | uh let's see i might have to list my assets so i can pull an old one

00:43:41.560 | uh actually um actually one second pedro can you can you tell me what the command is to

00:43:47.880 | to list the assets that are on i think it might be octa octo ai asset list all right let's

00:43:56.280 | okay there we go so i'm going to pull from an asset that i uploaded earlier

00:44:08.280 | uh okay so i'm gonna have to look into why uh that step failed but uh let's

00:44:34.280 | let's try this okay so i'm gonna use an asset that i uploaded earlier i'm not sure why this

00:44:40.600 | didn't work but i'll make sure that this is working for you all to reproduce this step

00:44:43.960 | and i'm gonna set laura asset name equals this all right so these other lauras i uploaded using

00:44:55.960 | the exact same steps as i used for this tutorial so we'll make sure to get to the bottom of this and

00:45:02.280 | we'll use the slack channel here for folks who want to run through this step but i'm just going

00:45:06.840 | to run an example inference here on this asset that i pulled from open pipe

00:45:11.800 | and so again we have our system prompt we're going to pass in this uh ex you know this message this

00:45:20.440 | email as our test prompt and then when we're invoking this octo ai endpoint we're using the standard chat

00:45:27.160 | completions from open ai and what we're passing here is this open pipe llama 3 8b 32k model and we pass

00:45:36.920 | in this argument for parameter efficient fine tune and passing the laura asset name that we just uploaded to

00:45:44.600 | the asset library and as we can see the response here contains the tool calls and the call to the function

00:45:51.800 | that will do the redaction so this is behaving exactly as we intended to so now we can move on to the quality

00:45:58.600 | evaluation for quality evaluation what we've done is use essentially an accuracy metric thankfully we

00:46:06.600 | have a ground truth right from our data set all the exchanges have been labeled with privacy mask information

00:46:13.880 | that we can use this ground truth so that makes evaluating scoring or results fairly easy we don't have to

00:46:19.800 | use an llm for instance for that we can actually use more traditional techniques of accuracy evaluation

00:46:25.640 | and so we have a metric that we've built it assigns a score that can be penalized when pi information was

00:46:33.320 | missed or mistakenly added i.e false negative false positive and then we use a similarly distance metric

00:46:39.880 | to kind of match the responses from the llms compared to our ground truth so for illustration purposes we have

00:46:47.320 | for instance this pi information that's been redacted that's a score of 1.0 because it's the perfect match

00:46:53.560 | or fine-tune might for instance miss the fact that billy was the middle name and might interpret it as

00:46:59.720 | first name in that case we're still attributing a high score because it's close enough and probably

00:47:04.920 | for a practical use case that would be good enough but for instance upon calling gpt4 it fails to identify

00:47:11.240 | two out of the three information that we have to redact and so the score is about a third here right

00:47:16.680 | so in this case what we're going to do here i'm just going to reduce the test size to 100 samples

00:47:23.880 | and i am going to run this evaluation inside of this cell it's going to bring us 100 test samples

00:47:34.520 | that we can then run our evaluation metric and get our overall scoring out of so if we look at

00:47:43.960 | you know the uh output from the cell essentially we're just evoke invoking back to back the fine-tune

00:47:52.120 | running octo ai and we're invoking gpt4 on open ai to do the results collection so we're going to

00:47:59.640 | collect some results here and uh once we've collected the results once we get to 100 i think we're getting

00:48:06.200 | pretty close here we can run the quality evaluation metric and of course i invite you to run it on more

00:48:11.720 | samples maybe a thousand or ten thousand it just gets more expensive as you're using gpt4 you know to

00:48:18.840 | run a hundred samples it costs about a dollar in inference so then a thousand samples cost ten dollars

00:48:26.120 | and now we're going to score it all right so we're going to go through every single entry we have our

00:48:32.040 | ground truth information we have our eval and labels from gpt4 and our eval and labels from our fine

00:48:40.520 | tune and we can see that right off the bat the fine tune is actually better at finding the pi to redact

00:48:47.080 | here gpt4 scored only a score of 0.49 whereas our fine tune achieves 0.85 and here 0.3 for gpt4 1.04

00:48:57.400 | the fine tune so the fine tune overall is performing better and once we aggregate and average the score

00:49:02.600 | gpt4 achieved 0.68 out of 1 whereas our fine tune achieves 0.97 and so that's the difference between

00:49:13.400 | prototype and production right you're expected to achieve somewhere in the single nine or two nines

00:49:18.440 | of accuracy and this is what this technique shows it allows you to achieve and again i want to reiterate that

00:49:25.000 | in terms of cost gpt4 costs upwards to 30 dollars per million tokens generated whereas lama38b on octavad

00:49:33.240 | cost just 15 cents that's a 200x difference right so with that i just want to conclude

00:49:40.440 | with some takeaways on find your on fine tuning right fine tuning is a journey but a very rewarding

00:49:47.880 | journey there's truly no finish line here you need to attempt fine tuning after you already tried other

00:49:53.880 | techniques like prompt engineering retrieval augmented generation but once you decide to embark data is

00:49:59.960 | very important collecting your data set because your model is only as good as the data it's trained on

00:50:05.320 | you need to make sure to continuously monitor quality to retune your model as needed

00:50:10.200 | you also need to um you know thankfully we have solutions like octa and open pipe to really make

00:50:17.080 | this more approachable and easy to do and it's easier than ever it's only getting get easier but

00:50:22.840 | maybe a year ago it was only reserved for the most adventurous and sophisticated users and now we've

00:50:27.800 | really lowered the barrier of entry and when you do it right you can achieve really significant improvements

00:50:33.000 | in accuracy as well as great reduction in costs i wanted to thank you for sitting here with me over the last

00:50:40.280 | 50 minutes i want to reiterate a few calls to action so go to octavi.cloud to learn how to use our solutions

00:50:47.240 | and endpoints but also come to our booth and so we're located at this g7 booth and we're going to be here

00:50:55.640 | today and tomorrow if you want to chat about our sas endpoints about our ability to deploy in an enterprise

00:51:02.360 | environment and also i want to give a shout out to my colleague here pedro if you're curious about all the know-how

00:51:08.360 | that goes behind how we optimize our model and production because our background is in compiler

00:51:12.920 | optimization is is in system optimization infrastructure optimization we've applied all of this to be able

00:51:18.760 | to serve our models you know with positive margins we're not doing this at a loss sorry we're not wasting

00:51:25.080 | our vc money here we're actually building all this know-how into making sure that ai inference is as efficient

00:51:32.440 | as it could be so there's going to be a talk on that and also make sure if you if you get a chance assuming

00:51:39.960 | you've joined our our slack channel which is the following one so if you're on the slack

00:51:48.440 | org for the event go to llm quality optimization boot camp you can ask us any questions and if you fill out

00:51:56.920 | the survey that pedro is going to post we're going to give you an additional 10 dollars in credits so

00:52:03.400 | that doesn't seem like a lot but that's a ton you know if it's 15 cents per million tokens that's a lot

00:52:09.160 | of tokens that you can generate for free so we can give you an additional 10 dollars for filling out

00:52:15.400 | the survey which which should take about you know 20 to 30 seconds so i'm going to be around and also you

00:52:21.080 | can find me at the booth this afternoon in case you have any questions but i'd like you all to thank

00:52:26.040 | you for sitting through this talk and hopefully hopefully you've learned something from this and

00:52:30.040 | hopefully you feel like i've demystified this idea of trying fine-tuning on your own give this notebook

00:52:35.880 | a try assuming of course we've fixed this uh laura upload issue and uh yeah thank you all and maybe ask

00:52:42.200 | me some questions after this uh after this talk thanks