back to index

LLM Quality Optimization Bootcamp: Thierry Moreau and Pedro Torruella


Whisper Transcript | Transcript Only Page

00:00:00.000 | so welcome everyone thanks for making it to this lunch and learn my goal today is to make sure that
00:00:19.360 | I get to share my knowledge and experience on LLM fine-tuning and just to get a quick sort of pull
00:00:28.920 | from the audience here how many of you are have heard of the concept of fine-tuning here okay so
00:00:36.240 | quite a few people how many of you have actually had hands-on experience in fine-tuning LLMs okay
00:00:42.540 | all right that's pretty good that's more than I'm usually used to I mean this is quite a fantastic
00:00:48.180 | that in this conference the makeup of AI engineer is close to 100% that's not something I'm generally
00:00:54.660 | used to when presenting at other you know hackathons and conferences so I feel like I'm speaking to the
00:01:01.120 | right crowd so just to kind of contextualize this talk really I'm trying to address two pains that
00:01:10.020 | a lot of Gen AI engineers face and to get a sense of where you are in your journey how many really
00:01:17.340 | identify and can relate to the first one which is my Gen AI span has gone through the roof okay yeah all
00:01:25.680 | right and how many of you are in this other segment of this journey which is you know you've built POC's
00:01:31.980 | it's showing promise but you haven't yet quite met this quality bar to go to production can I get a
00:01:39.480 | sense of all right so so I think you know we have a good amount of good fraction of the audience that can
00:01:45.960 | relate to one of these two problems myself I'm a co-founder at Octo AI and I'm going to talk a little
00:01:52.380 | bit more about what we do but the customers I've been working with they feel those pains in very real way or
00:01:58.980 | talking about tens of thousands if not hundreds of thousands of dollars in monthly bills and perhaps
00:02:06.360 | even having issues trying to go to production because the quality bar hasn't yet been met so the
00:02:12.960 | overview of this 15-minute talk is going to be spent on understanding the why of fine-tuning really try to
00:02:19.800 | understand when to use fine-tuning it's not really a silver bullet for all the problems you're going to
00:02:24.960 | face but when used right in the right context for the right problem it can really deliver results
00:02:30.340 | I'm also going to try to contextualize this notion of fine-tuning within the crawl walk and
00:02:36.340 | run of LLM quality optimization because there's different techniques that you should attempt before trying
00:02:42.340 | to do fine-tuning but finally when you're convinced that this is the right thing for you I'm going to
00:02:47.340 | talk about this continuous deployment cycle of fine-tuned LLMs so we're going to go through today
00:02:53.660 | over a whole crank of that wheel of this deployment cycle composed of you know model you know data set
00:03:01.200 | collection model fine-tuning deployment and evaluation and really I'm trying to demystify this
00:03:07.560 | whole journey to you all because in the next 15 minutes we're actually going to go through this whole
00:03:11.700 | process and hopefully that's something that you're going to feel comfortable going through and you know
00:03:15.540 | applying to your own data set to your own problems and so for illustrating today's use case we're going
00:03:21.240 | to use this personally identifiable information redaction use case now that's a pretty traditional
00:03:26.760 | sort of data scrubbing type of application but we're going to use LLMs and we're going to see that we can
00:03:32.460 | essentially achieve state of VR accuracy while keeping efficiency at the highest using essentially very
00:03:41.560 | compact very lightweight models that have been fine-tuned for that very task so again trying to motivate this
00:03:48.160 | talk what limits gen AI adoption in most businesses today based on the conversations that I've had in the
00:03:53.640 | field discussions I've had customers and developers the first one is there's a limited availability of GPUs I think we're all
00:04:02.220 | familiar with this problem it's one of the reasons why Nvidia is so successful lately I mean everyone wants
00:04:07.920 | to have access to those precious resources that allow us to run gen AI at scale and that can also drive costs up right so we have to be smart about how to use those GPU resources
00:04:19.840 | and also when people build POCs it displays and shows promise but sometimes you don't reach the expected quality bar to go to production
00:04:29.540 | and so on this XY axis you know on this chart where Y axis is costs and the X axis symbolizes quality maybe many people start on that green cross here right on this upper quadrant a very high cost maybe not having met the quality bar that's your first POC
00:04:49.600 | POC but really to go to production you need to end on the opposite quadrant right lower cost higher quality where you met the bar you're able to run this in a way that essentially is margin positive and many of us are on this journey to reach that point of you know profitability
00:05:07.240 | so we're going to learn today how to use and how to fine tune an LL M now fine tuning is a method that we're going to use to improve the LL M quality but as a bonus we're going to be also showing how to improve quality significantly and I use quality as the title of this talk because really I think many of us AI engineers really care about reaching the high quality bar when we're using LL M's and hopefully I'm you know the goal of today's talk is to instill
00:05:37.000 | instill you with some knowledge on how to tackle this journey and so in terms of tools that we're going to use today we're going to use OpenPipe which is a SaaS solution for fine tuning that really lowers the barrier of entry for people to run their own fine tunes
00:05:51.640 | you don't need hardware or cloud instances you don't need hardware or cloud instances to get started and we're going to use this to deliver quality improvements over state-of-the-art LL M's and of course since I work at Octo AI I'm going to also be using Octo AI here for the LL M deployments
00:06:06.280 | and that's going to be the solution that we're going to use to achieve cost efficiency at scale and really the key here is to be able to build on a solution that is designed to serve models at production scale volumes
00:06:22.120 | and just to give you a little bit of a sneak peek in terms of the results that we're going to showcase today after you go through this whole tutorial and this is something that you're going to be able to reproduce independently so you know all the code
00:06:34.120 | is there for you to go through we're going to be able to show that we can achieve 47 percent better accuracy at the tasks that I'm going to showcase today using this OpenPipe fine tuning
00:06:44.120 | and by deploying the model on Octo AI we're going to achieve this I mean it seems kind of ridiculous 99.5 percent reduction in cost this is really a 200x reduction in cost here from a GPT-4 Turbo to Llama 3 and mostly because this is a much smaller model it's open source and we've optimized the hell of this model to serve it cheaply on Octo AI so I'm going to explain how this is achieved but I hope your interest at least has been peaked on those results that you yourself
00:07:14.040 | can reproduce so when to use fine tuning again it's not really a silver bullet for all your quality problems it has its right place in time so I like to contextualize it within the crawl walk run of quality optimization right and as Gen AI engineers many of us have embarked on this journey we're at different stages of this journey and really it should always start with prompt engineering right and many of you are familiar with this concept you start with a model you're
00:07:43.960 | you're trying to have it accomplish a task and sometimes you don't really manage to see the result you expect to see so you're going to try prompt engineering and there's different techniques of varying levels of sophistication
00:07:55.320 | this talk is not about prompt engineering so you know you can improve prompt specificity there's few shots prompting where you can provide examples to improve essentially the quality of your output there's also chain of thought prompting I mean some of you probably have heard these concepts but this is where you should get started right make sure that given the model given those
00:08:13.880 | you just try to improve those weights you just try to improve the prompt to get the right results sometimes that's not enough and there's a second class of solutions which I like to map to the walk stage
00:08:23.720 | retreatal augmented generation right we've probably seen a lot of talks on RAG today and throughout this conference so you know there's hallucinated results sometimes the answer is not truthful well why is that it's because the weights of the model that is really the parametric memory of your model is
00:08:43.800 | is limited to you know the point in time is limited to you know the point in time in which the model was trained so when you try to ask questions on data it hasn't seen or information that's more recent than when the model was trained it's not going to know how to respond right so the key here is to provide the right amount of context
00:08:59.800 | and so this is achieved through similar research for instance in a vector database through function calling to bring the right context by invoking an API through search through querying a database
00:09:11.640 | and so this is something that I think many of us engineers have been diving in in order to provide the right context to generate truthful answer right compliment the parametric memory of your model with non parametric information
00:09:23.800 | and that's RAG in a nutshell right so you've tried prompt engineering you've tried RAG you've eliminated quality
00:09:28.920 | problems and hallucinations but that's still not enough right so what do you try next
00:09:33.640 | well fine-tuning I think is the next stage and again I'm generalizing a very complicated and complex journey
00:09:40.840 | but in spite of your best efforts you've tried these techniques for maybe days weeks or even months
00:09:46.440 | and you still don't get to where you need to be to hit production and we're going to talk about this journey
00:09:51.560 | today right fine-tuning so when should you fine-tune a model again after you spend a lot of time in the
00:09:59.160 | first two phases of this journey so spending time on prompt engineering spending time on retrieval augmented
00:10:05.480 | generation and you don't see the results improve and generally what helps is whenever you use an LLM for a very
00:10:13.080 | specific task something that's very focused for instance classification information extraction
00:10:18.840 | trying to format a prompt using it for function calling if you can narrow the use case to something
00:10:26.200 | that is highly specific then you have an interesting use case for applying fine-tuning here and another
00:10:33.640 | requirement is to have a lot of your own high quality data to work with because that's going to be your
00:10:37.880 | fine-tuning data set that goes without saying but a model is only as good as the data that it that the
00:10:43.560 | model was trained on and we're going to apply this principle here in this tutorial and finally I think
00:10:48.760 | as an added incentive oftentimes we're all driven by economic incentive in the work we do for those of
00:10:54.520 | you who are feeling the pains of high gen ai bills whether it is with open ai or with a cloud vendor or a third
00:11:03.320 | party well this is generally a good reason to explore fine-tuning so we're going to go over all the steps
00:11:10.040 | now that we've kind of contextualized why fine-tuning and when to consider fine-tuning we're going to
00:11:15.720 | consider all the steps here in this continuous deployment cycle it starts with building your data
00:11:21.560 | set then running the fine-tuning of the model deploying that fine-tune llm into production so
00:11:28.520 | you can achieve scale and serve your your you know your customer needs or internal needs at high volumes
00:11:35.320 | and also evaluate quality and this is an iterative process there's not a single crank of the wheel this
00:11:40.920 | is not a fire and forget situation because data that your model sees in production is going to drift and
00:11:47.400 | evolve and so this is something that you're going to have to monitor you're going to have to update your data
00:11:51.240 | set you're going to have to fine-tune your model and i don't want to scare you away from doing this
00:11:55.640 | because it sounds fairly daunting and so by the end of this talk we'll have gone through a full crank of
00:12:02.280 | that wheel and hopefully you know it through these sas toolings that i'm going to introduce you to is going
00:12:08.840 | to feel a lot more approachable and hopefully i'll demystify the whole process of fine-tuning models
00:12:13.720 | so let's start with step one which is to build a fine-tuning data set now the data of the model
00:12:21.080 | has to be trained on ideally real world data right it has to be as close as possible to what you're
00:12:27.000 | going to see in production so there's kind of a spectrum of ways to build and generate a data set
00:12:32.280 | ideally you build a data set out of real world prompts and real world generated real world human
00:12:38.920 | responses so for instance you have customer service you've logged calls with a customer agent you have
00:12:44.760 | an interaction between two humans that's a very good data set to work with right because it's human
00:12:49.480 | generated on both ends this is very high quality but not everyone has the ability to acquire
00:12:54.840 | this data set sometimes you're starting from scratch so not everyone has a luxury to start there
00:13:00.360 | there's also kind of an intermediary between real world and synthetic where you have real world prompts
00:13:05.640 | but ai generated responses and so this is kind of a good middle ground between cost and quality because
00:13:11.080 | you're starting from actual ground truth information that is derived from real data but the responses are
00:13:19.320 | generated by a high quality llm say gpt4 or clod and actually open pipe is a solution that allows you
00:13:26.600 | to log the inputs and outputs of an lm like gpt4 to build your data set for fine-tuning an llm so this is
00:13:34.680 | something that you know a lot of practitioners use and finally there's the fully synthetic data set using
00:13:41.560 | fully ai generated labels and oftentimes when you go on hugging face or kaggle you'll encounter data sets that have
00:13:48.680 | been built entirely synthetically and that's a great way to kind of get started on this journey
00:13:53.960 | and actually one of the data sets we're going to use today is from that latter category
00:13:58.440 | and of course i mean it probably goes without saying but in case people are not fully uh familiar with this
00:14:05.160 | notion you want to split your data set into a training and validation set because you don't want to
00:14:11.160 | evaluate your model on data that your fine tune has seen right and so many of you who are ml and ai
00:14:19.800 | engineers are already familiar with this but i just want to reiterate that this is important and finally
00:14:24.120 | you know this is used for hyper parameter tuning and when you're deploying it and actually testing it
00:14:28.520 | on real world examples you want to have a third set outside of training and validation which is your test set
00:14:34.440 | that's a good way to do it now you've built your data set you're ready to fine tune your model and there's a lot of
00:14:40.120 | decisions that we need to make at this point and the first one is going to be open source versus closed source right
00:14:45.720 | and so who here just like raise of hands is using proprietary llms or gen ai models today from open ai
00:14:53.160 | anthropic mistral ai okay good amount of crowd who here has been using open source llms like llama some of the free mistral ai models
00:15:03.800 | okay so maybe a smaller crowd right and maybe that's because these models are not as capable and
00:15:10.520 | sophisticated and but i'm going to walk you through how you can achieve better results if you do fine
00:15:18.280 | tuning right so of course the benefit of open source and this is why you know i'm obviously biased but i'm
00:15:24.200 | an open source advocate is that you have to you get to have ownership over your model weights so when once you've done the fine tuning you are
00:15:33.160 | the proprietor of the weights that are the result of this fine tuning process which means that you can
00:15:38.680 | choose how you deploy it how you serve it this is part of your ip and i find that this is a great
00:15:43.080 | thing for anyone who wants to embark on this fine tuning journey with proprietary solutions you're not
00:15:49.560 | quite the owner or you don't have the flexibility to decide to go with another vendor to host the the
00:15:54.680 | models yourself and so you're kind of locked into an ecosystem some people are comfortable with that others are
00:16:00.440 | less comfortable with it and many of the customers that we talk to they're very eager to jump on the
00:16:06.120 | open source train but they don't really know how to get started or you know where to start on this
00:16:11.640 | journey so hopefully this can this can help inform you how to take your first steps here into the
00:16:16.040 | world of open source then there's a question of like do i use a small model or a large model
00:16:21.400 | because for instance even in the world of open source you have models that are in the order of 8 billion
00:16:26.360 | parameters like llama 3 8 b and then you have the large models with a mixture all 8 by 22 b so this
00:16:33.080 | is a mixture of expert model with over 100 billion parameters very different beasts and we're going
00:16:38.840 | to see even larger models from meta and generally my recommendations here is well look the large models
00:16:45.320 | are amazing they have broader context windows they have higher capabilities of reasoning but they're also
00:16:52.440 | more expensive to fine tune and more expensive to serve and typically when you have to do a deployment
00:16:57.400 | you're going to have to acquire resources like h100s to run these models so generally start with a smaller
00:17:03.080 | model like a llama 3 b and sometimes you'll be surprised by its ability to learn specific problems
00:17:08.920 | so that's my recommendation start with a smaller llama 3 8 b or mistrol 7 b and if that doesn't work out for you then
00:17:18.440 | move towards larger and larger models and today we're going to be using this llama 3 8 billion parameter
00:17:23.960 | model there's also different techniques for fine tuning i'm going to go over this one fairly quickly
00:17:29.880 | but there's two classes of fine tuning techniques one which is parameter efficient fine tuning it produces
00:17:37.880 | a laura and the other one is a full parameter fine tuning which produces a checkpoint a laura is much
00:17:43.960 | more much smaller and efficient in terms of memory footprint we're talking about 50 megabytes versus
00:17:50.440 | a checkpoint that is 15 gigabytes and so you can guess that because of its more compact representation
00:17:58.200 | you're able to serve it on a gpu that doesn't require as much onboard memory and you can even serve
00:18:06.280 | multiple lauras at the same time so multiple fine tunes on a gpu for inference as opposed to the
00:18:13.240 | checkpoints which require a dedicated gpu for every single fine tune so there's more flexibility in
00:18:19.240 | deployment and we're going to use that today we're actually going to serve these lauras which are the
00:18:23.640 | result of parameter efficient fine tuning on a shared tendency endpoint with other users who have their own
00:18:30.360 | lauras all running on the same server and that allows us to really reduce the cost of inference
00:18:36.280 | and there is a benefit to checkpoints though and full parameter fine tuning which is that there are more
00:18:43.400 | parameters to tune so it's a more flexible fine tuning technique it allows the model to have
00:18:49.640 | essentially achieve better results at more expensive tasks like logical reasoning
00:18:57.000 | but for very specialized tasks which is what we're going to look at today like classification or
00:19:01.000 | labeling or function calling a laura is just fine so we're going to use parameter efficient fine tuning
00:19:06.120 | and also when you're doing fine tuning you have to decide am i going to diy it or am i going to use
00:19:12.840 | sas so i'm sure some of you only like to diy things others like the convenience of sas and here i'm not
00:19:18.920 | going to take a side i think there's some great tools right now to diy your own fine tuning for
00:19:24.840 | for instance the open source project axolotl and actually at the conference there's the the creator
00:19:29.960 | behind axolotl who you might be able to catch um and you know the challenge here is that you have to
00:19:36.680 | find your own gpu resources you have to understand how to use these libraries even though they're they're
00:19:41.960 | they're easier than ever uh to to adopt and you have to tune and tinker uh you know settings and hyper
00:19:49.000 | parameters then there's sas which really aim to make it easy to embark on this journey companies like
00:19:54.760 | open pipe and there's uh many folks from the open pipe at this conference today so if you can catch
00:20:00.520 | them please do talk to them and they're trying to lower the barrier of entry to fine tuning right to
00:20:04.920 | make it easy and to bring all this tooling all these libraries to make it as seamless as possible to
00:20:10.280 | for instance move from a gpt4 model to a fine tune with the least amount of steps in collecting your
00:20:15.880 | data fine tuning etc and so we're going to use sas today but if you feel more comfortable in this journey
00:20:21.800 | you might want to start with sas and then evolve into diy-ing it when it comes to deployment you
00:20:28.120 | have to navigate the same options right once you have a fine-tuned model now you need to decide well
00:20:32.120 | how am i going to serve it right because i need to generate maybe thousands millions or billions of
00:20:37.400 | tokens a day and so you need infrastructure you need gpus you need inference libraries some people like to
00:20:43.480 | diy it using libraries like vllm mlclm tensor rtllm hogging face tgi if these are all things that you
00:20:52.760 | might have heard of these are all solutions to run models on your own on your own infrastructure
00:21:00.040 | but you need to provision the resources you need to build the infrastructure to scale with demand and
00:21:06.920 | that can get tricky especially achieving high reliability under load that's a challenge that
00:21:12.200 | many people face as they scale their business up with sas you can essentially work with a third party
00:21:18.040 | like octo ai and obviously i'm a bit biased again i work there so i'm gonna insert a shameless plug for
00:21:24.680 | octo ai which allows users to get these fine tunes deployed on sas based endpoints so endpoints very
00:21:32.840 | similar to the ones from open ai for instance if you're familiar with that or claude and it offers
00:21:40.120 | the ability to serve different kinds of customizations as well and so very quickly i want to go over the
00:21:45.880 | advantages of octo ai here first of all you get speed so llama 3 8 billion parameter model you get achieve
00:21:53.400 | around 150 tokens per second and we keep on improving that number because we've been applying our
00:21:57.960 | our own in-house optimizations to the model serving layer it also has a significant cost advantage
00:22:03.880 | because it costs about 15 cents per million tokens compared to say gpt4 which costs 30 dollars per million tokens
00:22:11.240 | so that's where the 200x comes from and we don't charge a tax for customization so whether you're serving
00:22:16.120 | the base model or a fine-tune it's the same cost there's customization as i mentioned you can load your own laura and serve it
00:22:25.080 | and finally scale our customers some of our customers generate up to billions of tokens per day on our endpoints
00:22:31.480 | i think we're serving around over 20 billion tokens per day and so we've focused and spent a lot of time
00:22:38.280 | on improving robustness and also worth mentioning if sas doesn't cut it for you you are working for a fortune 500
00:22:47.960 | if you're a software company or you know a healthcare company banking sector government you need to deploy your llms inside of your
00:22:55.560 | environment either on-prem or in vpc we also have a solution called octo stack come talk to us at the booth
00:23:02.040 | so that's it for the shameless uh flag section let's go over to section four which is evaluating quality right
00:23:08.680 | we've talked about data set collection fine tuning deployment now quality evaluation and we could have an entire conference just dedicated on that
00:23:16.360 | i'm going to try to summarize it into kind of two classes of evaluation techniques that i've seen
00:23:22.360 | first of all you know can your quality be evaluated in a precise way that can be automated for instance
00:23:29.960 | you generate a program or sql command that can run uh can you for instance label or extract information
00:23:37.240 | or classify information in an accurate way that's a kind of pass or fail scenario right or formatting the output
00:23:43.560 | into a specific json formatting this is something that you can easily test as a pass or fail test and
00:23:49.960 | then there's more of the soft evaluation for instance if i were to take an answer and say well which output
00:23:55.640 | is written in a more polite or professional way you can't really write a program to evaluate this unless
00:24:01.720 | you're using an lm of course right but you have to put yourself into maybe 2000 2000 sorry 2020 2021 mindset
00:24:09.480 | before gpt was around well it'd be hard to build a program that can assess this right so generally you'd
00:24:15.640 | need a human in the loop to say which out of a or b is a better answer thankfully today we can use llms
00:24:23.800 | to automate that evaluation but keep in mind that for instance if you're using gpt4 to evaluate two
00:24:29.400 | answers well if you're comparing against gpt4 it might favor its own answer and people have seen that in
00:24:34.440 | in these kind of evaluations so this is a whole science i mean we could have a whole conference
00:24:38.920 | just on this i just wanted to present the high level uh guidelines of this whole cycle of deploying
00:24:46.120 | fine-tuned llms and so really there is no finish line that's what i want to convey to you all that
00:24:52.520 | going through a single iteration is something that you might have to do on a regular basis maybe
00:24:58.760 | once a week maybe once a year it all depends on your use case and constraints now let's get a bit
00:25:06.120 | more practical let's switch over to our demo and so for those of you who came a little bit late
00:25:13.480 | there's a qr code here that you can scan and that will point you to our google colab and we also have
00:25:22.680 | under slack now let me see if i can pull it if you're in the slack channel for ai engineers world
00:25:29.800 | fair there is this quality optimization boot camp where you can ask questions here if you want to
00:25:35.080 | follow along and so we're going to go we're going to try to go over the practical component in the next
00:25:40.520 | 25 minutes i just want to provide some context here the use case is uh personally identifiable
00:25:48.360 | information redaction we've taken this from a data set composed by ai for privacy called pi masking
00:25:55.960 | 200k it's one of the largest data sets of its kind it has 54 different pi classes so different kinds
00:26:03.880 | of sensitive data like the name the email address add you know address the physical address of someone
00:26:10.760 | uh their credit card information etc etc across 229 discussion subjects so that includes
00:26:18.120 | conversations from a customer ticket resolution conversations with a banker conversations between
00:26:24.440 | individuals etc what this data set looks like is as follows you're going to have a message an email
00:26:32.120 | here we have you know something that looks like it came out of an email that contains credit card
00:26:37.800 | information ip address maybe even a mention of a rule or or anything that is essentially personally personally
00:26:45.640 | identifiable and i've highlighted those in red because they will need to be redacted
00:26:51.160 | and after redaction we should get the following text that shows look here is this information that is now
00:26:57.880 | redacted anonymized but instead of just masking it we're actually telling it what kind of category this
00:27:03.880 | information belongs to right a credit card number an ip address or job title and this is how we're going to redact this text
00:27:10.760 | so where do llms come in the way we would use it is through function calling who here has used llms
00:27:19.400 | with tool calls or function calls okay so quite a few people you know and as many of us are aware this
00:27:27.480 | kind of what powers a lot of the agentic applications so this is a great use case for people who want to do
00:27:33.320 | function calling and are not seeing the results you know out of the box from say gpt4 that they would like
00:27:40.440 | to to see and in this case we're actually going to see that that these kind of state-of-the-art models
00:27:44.760 | aren't doing quite well at fairly large and complex function call use cases so to achieve this redaction
00:27:53.000 | use case we're going to pass in a system prompt we can also pass in a tool specification the system
00:27:58.920 | prompt says look you're an expert model trained to do redaction and you can call this function
00:28:03.320 | here are all the sensitive pii categories for you to redact and then as a user prompt we're going to
00:28:10.040 | pass in that email or that message and then the output is a tools call so it's not the redacted text
00:28:17.320 | it's actually a tools call to that redact function that's going to contain all the arguments for us to
00:28:23.080 | perform the redaction why am i doing this as opposed to spitting out the redacted test well that gives us
00:28:28.760 | flexibility in terms of how we want to redact this text we could choose to just replace that
00:28:34.520 | information with the pii class we can also completely obfuscate it or we could choose to use for instance a
00:28:42.120 | database that maps each pii entry to a fake substitute so that we have an email that kind of reads normally
00:28:51.160 | except the credit card the the names the addresses are all made up but they will always map to the same individual
00:28:58.600 | and so that allows us to do then more interesting processing on our data set right so that's why
00:29:04.680 | we're going to use function calling here and let's start to build the data set so i'm going to switch
00:29:09.000 | over to our notebook here this notebook is meant to be sort of self explainable so there's a bit of
00:29:15.640 | redundance redundant context as part of the prerequisites you're going to have to get an account on octo ai and
00:29:22.280 | open pipe and and these are the tools that we're going to use and if you want to run the evaluation function also
00:29:28.440 | provide your open ai key because we're going to compare against gpt4 so we're going to install the python packages
00:29:35.400 | initially only open ai and data sets from hugging face you can ignore this pip dependency error here
00:29:42.200 | which happens when you pip install data sets in a colab notebook but that's okay we can get past that
00:29:48.200 | you can enter your octo ai token and open ai api key at the beginning
00:29:54.440 | and i've already done this so we're going to start with the first phase which is to build a fine-tuning
00:29:58.520 | data set so we have this pi masking data set i'm going to show it from hugging face of pi
00:30:04.360 | masking and you can see what the data set looks like it has the source text information as you can see
00:30:12.520 | these are you know exchange you know snippets from emails for instance you have the target text that is
00:30:18.360 | redacted and the privacy mask that contains each one of the pii and the classes associated to it
00:30:25.080 | so this contains all the data all the information input and labels that we need to build our
00:30:30.600 | our data set for fine-tuning and so really what we're going to do
00:30:35.720 | is that we're going to use the system prompt
00:30:41.480 | area we're going to define our system prompt here which is again telling the model you're an
00:30:46.120 | expert model trained to redact information and here are the 56 categories explaining next to each
00:30:53.000 | category what that corresponds to and this is really the beauty of llm and sort of natural language entry
00:30:59.400 | is that in the old world when we're doing pi redaction we had to write complex regular expressions
00:31:05.480 | and here this is all done through just providing a category and a bit of a description here
00:31:10.920 | and the llm will naturally infer how to do the redaction we're also going to define the tool to call
00:31:19.080 | right so this is done as a essentially a dictionary a json object and as you can see there is an array
00:31:26.360 | that contains dictionaries containing a string and a pi type and the string is the pi information the type is
00:31:35.000 | essentially one of 56 categories that we provide as an enum so right off the bat you can see that this
00:31:40.520 | tool call is you know a bit of a large function specification and so let's load our data set from
00:31:48.440 | hugging face in this case it's going to take maybe a few seconds to load in that data set of 200 000
00:31:54.600 | entries and then what i have in the next cell when i'm downloading this data set is what i'm going to use
00:32:01.800 | to build my fine tuning training data set and here's the thing about fine tuning is that to build your
00:32:10.120 | data set you need to make it seem like you've essentially logged conversations with an llm right
00:32:15.800 | you're logging the prompts and the responses because that's how you're going to fine tune it you need to
00:32:20.200 | tell it this is the input with system prompt tools specification user prompt and here's the
00:32:27.720 | tools call response that i expect to see and so this cell here just sets it up so that we essentially
00:32:35.320 | have each training sample as a message from an llm that's been logged we're going to see what that looks
00:32:41.800 | like in a second so we're going to build a 10 000 entry training data set for open pipe and that's
00:32:51.480 | going to be downloaded as this open pipe data set that json l and so as i run the cell it's going to
00:32:58.040 | download this from colab and now when you switch over to open pipe we're going to create a new data set
00:33:07.400 | so once you're on open pipe console you have a project here i've generically named it project one
00:33:13.480 | you can access data sets and already as you can see i already have built a few data sets before
00:33:20.280 | but if you're a first time user you're not going to see anything under data sets so you can create
00:33:25.320 | a new data set here by clicking on this button and if you go under settings we can name our data set so
00:33:32.040 | i'm going to call it lunch and learn and today is june 2 6. all right so this is two days lunch and learn
00:33:42.920 | i'm going to i'm going to call this my data set and under general i can upload the data that i just
00:33:49.560 | download it from my notebook open pipe data set dot json l so this upload operation is going to take
00:33:56.280 | a few seconds or maybe a couple of minutes because what's going to happen on open pipe is not only we're
00:34:03.400 | uploading this data set but it's going to do some pre-processing here to split it into training and
00:34:09.880 | validation set it's also going to get it all formatted in a nice way so we can essentially look into the data
00:34:17.160 | set so you can see there's this little window here that shows that you're uploading the data set and
00:34:22.520 | that it is essentially being processed so while this is happening right we've prepared our data set and
00:34:30.600 | we're going to take a look at it in a second while it's being processed on open pipe but let's see how
00:34:35.160 | we're going to do the fine tuning in the next stage right so once we have our data set uploaded we're going
00:34:41.400 | to have this view on the data set that shows every single entry that we can peek into and how it's split
00:34:46.600 | into training and test set generally a 90 10 split and from that ui we can launch a fine tune
00:34:55.400 | and this is where we get to choose our base model and what we're going to choose is a llama 3 8 billion
00:35:00.360 | parameter model with 32k context width which is a fine tune from news research called the theta model
00:35:11.160 | and you can see that there's essentially a pricing here that is being estimated for this fine tune we
00:35:17.400 | have a substantial training set because it can range from say hundreds of samples to thousands to hundreds
00:35:23.160 | of thousands and the cost can scale up as you as you feed in more training samples but it will improve the
00:35:31.480 | accuracy and it also provides an estimated training price of forty dollars now that might seem like a lot
00:35:37.000 | especially when you're chinkering with fine tuning but keep in mind some of the people that we work
00:35:41.160 | with they tend to spend tens of thousands or maybe hundreds of thousands of dollars a month on genii
00:35:46.680 | spend so this is absolutely something that you can do up front that will pay off and i believe that on
00:35:51.800 | open pipe if you get started you get a hundred dollars credit so that allows you to to run some fine
00:35:57.960 | tunes off the bat without having to necessarily uh have to to pay so um let's go over to open pipe
00:36:06.760 | and it is still uploading i think maybe the network is uh is a bit slow but we're going to essentially
00:36:15.800 | start training at this point and once the training is happening we're going to then deploy the fine tune llm
00:36:23.160 | when training is done and what happens on open pipe is when you're done with training you're going to
00:36:28.040 | get an email when that training job is done it can take a few minutes so i'm going to pull a jeweler
00:36:32.600 | child here i'm going to stick the you know the turkey in the oven and in the second oven i'm going to
00:36:37.160 | have a pre-baked turkey just so that we don't lose time but as you're going through this on your own
00:36:42.840 | keep in mind it's going to take a little bit of time to just kick off that whole fine tuning process but
00:36:47.400 | it's not that long because um you know you're training a fairly small model here all right so
00:36:54.360 | this is still uh saving but let's kind of take a look at what we've done so far right so we've built
00:37:00.040 | our data set using a synthetic data set from hugging face we format each input output pair from the data
00:37:05.960 | set as logged l on messages and this is essentially stored as a json file that we upload to open pipe
00:37:13.640 | and we produce 10 000 training samples we're fine tuning a model from open pipe and we're open pipe
00:37:19.640 | uses a parameter efficient fine tuning which produces a laura and we choose llama 3 8 billion parameter
00:37:25.400 | model as the base and when we deploy what we're going to use here is octo ai so let's see this didn't
00:37:33.960 | finish uploading so i'm going to go into the one that i uploaded just a couple days ago just to
00:37:39.640 | essentially show you what you should see on the user interface so as you peruse through the training
00:37:47.960 | samples what you're going to see is an input column and output column and so on the left you have the
00:37:53.240 | input with the system prompt as you can see it's a it's a big boy because it has all these different
00:37:58.760 | categories right that it needs to classify it also has the user prompt which is the message that we need to
00:38:05.160 | redact the tool choice and the tool specification here with all the different categories of pi types
00:38:11.000 | and then the output will be will be this tools call from the assistance response
00:38:15.880 | and that will have this redact call along with these arguments field to redact as a list of dictionary
00:38:23.160 | entries containing string and pi type information right and so this is what we've passed into our fine tuning
00:38:32.360 | data set into open pipe and this is still saving so i'm just going to go ahead and go to the model so
00:38:40.360 | once you have the data set uploaded again you hit this fine tune button and this is what's going to
00:38:46.760 | allow you to launch a fine tuning job right i can call this blah and this is where you select under
00:38:52.680 | this drop down the model that you want to fine tune this is again what we saw before training size is
00:38:58.600 | substantial i'm not going to hit start training because i already have a trained model but when
00:39:02.440 | you do that it's going to kick off the training and when it's done you'll get notified by email
00:39:06.120 | now let's fast forward let's assume i've already trained my model so i'm going to have this
00:39:11.400 | model here that's been fine-tuned from this data set i'm going to click on it as we can see it's
00:39:17.720 | a llama 3 8b model it's been fine-tuned over these 10 000 data sets split into 9 000 training samples and
00:39:27.080 | a thousand test samples we can even look at the evaluation but going back to the model
00:39:35.160 | and the nice thing is that it's taking care of the hyper parameter like learning rate number of epochs it kind
00:39:42.520 | of figures it out for you so you don't really have to tweak those settings and i find that to be very
00:39:46.920 | convenient especially for people who haven't yet built an understanding of how to tweak those values
00:39:51.000 | and the beauty of using open pipe is that you can now export the weights and be the owner of those
00:39:57.960 | weights right remember when we talked about open source it's really important to own the result of
00:40:02.280 | the fine tuning so you can download the weights in any format you want you have lores but also merge
00:40:08.040 | checkpoints so you can have a a parameter efficient representation as well as a checkpoint and so we've
00:40:14.360 | selected to export our model as a fb16 laura which is what we're going to use to upload our model on
00:40:20.280 | octo ai which is where we're going to use to deploy the model so now i can download the weights as a zip
00:40:25.400 | file and it's fairly small only 50 megabytes but i can also copy the link copy the url and this is what
00:40:33.800 | we're going to need to do in this tutorial so to deploy the model what we need to do is copy this url
00:40:41.000 | i'm going to download in the cell the octo ai cli this is a command line interface for users to upload
00:40:49.480 | their own fine tunes to what we call our asset library so this is a place where you can store your
00:40:54.760 | own checkpoints your own lores for not just llms but also models like stable diffusion if some of you are
00:41:00.760 | developers who also work in the image gen space and so we can serve these customized models on our platform
00:41:08.760 | and so we're going to upload this laura from open pipe to octo ai so we're going to log in just to
00:41:17.720 | make sure credentials are good and here we have a confirmation that our token is valid and in the cell we have to
00:41:25.960 | replace the laura url from set me to that url that i just copied here from download weights
00:41:31.160 | and keep in mind this might take a couple minutes to get the link to appear but once you have that link
00:41:38.040 | and again i'm kind of skipping ahead because when you're going to run this at your own time it might
00:41:43.480 | take a you know a few minutes to run the fine tune it might take a few minutes to download the weights
00:41:47.720 | but everything that i'm running here is essentially the steps that you'll take yourself
00:41:51.960 | and what i'm doing here is passing in this url here and setting a laura asset name in my octo ai asset
00:42:01.560 | library so i can then create this asset from this laura as a safe tensor file and based on the llama3 8b model
00:42:14.040 | i'm going to name it let's see
00:42:16.120 | it seems like something has failed here so let's try to run it again
00:42:38.520 | and so what this is doing is uh let's see
00:42:40.920 | usually that that that should have worked so what's uh what should happen here is at this point
00:42:54.680 | once you've taken the url of your fine-tuned asset should be able to host it on our asset library and then
00:43:04.680 | from there serve it to start running some inferences so this uh this this laura upload step didn't quite
00:43:15.800 | work here so pedro are you able to maybe double check with product whether this capability is working
00:43:23.160 | uh this isn't a good demo unless something fails and so uh yeah you know i just tested it earlier today
00:43:29.800 | and it was working flawlessly so
00:43:32.200 | uh let's see i might have to list my assets so i can pull an old one
00:43:41.560 | uh actually um actually one second pedro can you can you tell me what the command is to
00:43:47.880 | to list the assets that are on i think it might be octa octo ai asset list all right let's
00:43:56.280 | okay there we go so i'm going to pull from an asset that i uploaded earlier
00:44:08.280 | uh okay so i'm gonna have to look into why uh that step failed but uh let's
00:44:34.280 | let's try this okay so i'm gonna use an asset that i uploaded earlier i'm not sure why this
00:44:40.600 | didn't work but i'll make sure that this is working for you all to reproduce this step
00:44:43.960 | and i'm gonna set laura asset name equals this all right so these other lauras i uploaded using
00:44:55.960 | the exact same steps as i used for this tutorial so we'll make sure to get to the bottom of this and
00:45:02.280 | we'll use the slack channel here for folks who want to run through this step but i'm just going
00:45:06.840 | to run an example inference here on this asset that i pulled from open pipe
00:45:11.800 | and so again we have our system prompt we're going to pass in this uh ex you know this message this
00:45:20.440 | email as our test prompt and then when we're invoking this octo ai endpoint we're using the standard chat
00:45:27.160 | completions from open ai and what we're passing here is this open pipe llama 3 8b 32k model and we pass
00:45:36.920 | in this argument for parameter efficient fine tune and passing the laura asset name that we just uploaded to
00:45:44.600 | the asset library and as we can see the response here contains the tool calls and the call to the function
00:45:51.800 | that will do the redaction so this is behaving exactly as we intended to so now we can move on to the quality
00:45:58.600 | evaluation for quality evaluation what we've done is use essentially an accuracy metric thankfully we
00:46:06.600 | have a ground truth right from our data set all the exchanges have been labeled with privacy mask information
00:46:13.880 | that we can use this ground truth so that makes evaluating scoring or results fairly easy we don't have to
00:46:19.800 | use an llm for instance for that we can actually use more traditional techniques of accuracy evaluation
00:46:25.640 | and so we have a metric that we've built it assigns a score that can be penalized when pi information was
00:46:33.320 | missed or mistakenly added i.e false negative false positive and then we use a similarly distance metric
00:46:39.880 | to kind of match the responses from the llms compared to our ground truth so for illustration purposes we have
00:46:47.320 | for instance this pi information that's been redacted that's a score of 1.0 because it's the perfect match
00:46:53.560 | or fine-tune might for instance miss the fact that billy was the middle name and might interpret it as
00:46:59.720 | first name in that case we're still attributing a high score because it's close enough and probably
00:47:04.920 | for a practical use case that would be good enough but for instance upon calling gpt4 it fails to identify
00:47:11.240 | two out of the three information that we have to redact and so the score is about a third here right
00:47:16.680 | so in this case what we're going to do here i'm just going to reduce the test size to 100 samples
00:47:23.880 | and i am going to run this evaluation inside of this cell it's going to bring us 100 test samples
00:47:34.520 | that we can then run our evaluation metric and get our overall scoring out of so if we look at
00:47:43.960 | you know the uh output from the cell essentially we're just evoke invoking back to back the fine-tune
00:47:52.120 | running octo ai and we're invoking gpt4 on open ai to do the results collection so we're going to
00:47:59.640 | collect some results here and uh once we've collected the results once we get to 100 i think we're getting
00:48:06.200 | pretty close here we can run the quality evaluation metric and of course i invite you to run it on more
00:48:11.720 | samples maybe a thousand or ten thousand it just gets more expensive as you're using gpt4 you know to
00:48:18.840 | run a hundred samples it costs about a dollar in inference so then a thousand samples cost ten dollars
00:48:26.120 | and now we're going to score it all right so we're going to go through every single entry we have our
00:48:32.040 | ground truth information we have our eval and labels from gpt4 and our eval and labels from our fine
00:48:40.520 | tune and we can see that right off the bat the fine tune is actually better at finding the pi to redact
00:48:47.080 | here gpt4 scored only a score of 0.49 whereas our fine tune achieves 0.85 and here 0.3 for gpt4 1.04
00:48:57.400 | the fine tune so the fine tune overall is performing better and once we aggregate and average the score
00:49:02.600 | gpt4 achieved 0.68 out of 1 whereas our fine tune achieves 0.97 and so that's the difference between
00:49:13.400 | prototype and production right you're expected to achieve somewhere in the single nine or two nines
00:49:18.440 | of accuracy and this is what this technique shows it allows you to achieve and again i want to reiterate that
00:49:25.000 | in terms of cost gpt4 costs upwards to 30 dollars per million tokens generated whereas lama38b on octavad
00:49:33.240 | cost just 15 cents that's a 200x difference right so with that i just want to conclude
00:49:40.440 | with some takeaways on find your on fine tuning right fine tuning is a journey but a very rewarding
00:49:47.880 | journey there's truly no finish line here you need to attempt fine tuning after you already tried other
00:49:53.880 | techniques like prompt engineering retrieval augmented generation but once you decide to embark data is
00:49:59.960 | very important collecting your data set because your model is only as good as the data it's trained on
00:50:05.320 | you need to make sure to continuously monitor quality to retune your model as needed
00:50:10.200 | you also need to um you know thankfully we have solutions like octa and open pipe to really make
00:50:17.080 | this more approachable and easy to do and it's easier than ever it's only getting get easier but
00:50:22.840 | maybe a year ago it was only reserved for the most adventurous and sophisticated users and now we've
00:50:27.800 | really lowered the barrier of entry and when you do it right you can achieve really significant improvements
00:50:33.000 | in accuracy as well as great reduction in costs i wanted to thank you for sitting here with me over the last
00:50:40.280 | 50 minutes i want to reiterate a few calls to action so go to octavi.cloud to learn how to use our solutions
00:50:47.240 | and endpoints but also come to our booth and so we're located at this g7 booth and we're going to be here
00:50:55.640 | today and tomorrow if you want to chat about our sas endpoints about our ability to deploy in an enterprise
00:51:02.360 | environment and also i want to give a shout out to my colleague here pedro if you're curious about all the know-how
00:51:08.360 | that goes behind how we optimize our model and production because our background is in compiler
00:51:12.920 | optimization is is in system optimization infrastructure optimization we've applied all of this to be able
00:51:18.760 | to serve our models you know with positive margins we're not doing this at a loss sorry we're not wasting
00:51:25.080 | our vc money here we're actually building all this know-how into making sure that ai inference is as efficient
00:51:32.440 | as it could be so there's going to be a talk on that and also make sure if you if you get a chance assuming
00:51:39.960 | you've joined our our slack channel which is the following one so if you're on the slack
00:51:48.440 | org for the event go to llm quality optimization boot camp you can ask us any questions and if you fill out
00:51:56.920 | the survey that pedro is going to post we're going to give you an additional 10 dollars in credits so
00:52:03.400 | that doesn't seem like a lot but that's a ton you know if it's 15 cents per million tokens that's a lot
00:52:09.160 | of tokens that you can generate for free so we can give you an additional 10 dollars for filling out
00:52:15.400 | the survey which which should take about you know 20 to 30 seconds so i'm going to be around and also you
00:52:21.080 | can find me at the booth this afternoon in case you have any questions but i'd like you all to thank
00:52:26.040 | you for sitting through this talk and hopefully hopefully you've learned something from this and
00:52:30.040 | hopefully you feel like i've demystified this idea of trying fine-tuning on your own give this notebook
00:52:35.880 | a try assuming of course we've fixed this uh laura upload issue and uh yeah thank you all and maybe ask
00:52:42.200 | me some questions after this uh after this talk thanks