back to indexLLM Quality Optimization Bootcamp: Thierry Moreau and Pedro Torruella

00:00:00.000 |
so welcome everyone thanks for making it to this lunch and learn my goal today is to make sure that 00:00:19.360 |
I get to share my knowledge and experience on LLM fine-tuning and just to get a quick sort of pull 00:00:28.920 |
from the audience here how many of you are have heard of the concept of fine-tuning here okay so 00:00:36.240 |
quite a few people how many of you have actually had hands-on experience in fine-tuning LLMs okay 00:00:42.540 |
all right that's pretty good that's more than I'm usually used to I mean this is quite a fantastic 00:00:48.180 |
that in this conference the makeup of AI engineer is close to 100% that's not something I'm generally 00:00:54.660 |
used to when presenting at other you know hackathons and conferences so I feel like I'm speaking to the 00:01:01.120 |
right crowd so just to kind of contextualize this talk really I'm trying to address two pains that 00:01:10.020 |
a lot of Gen AI engineers face and to get a sense of where you are in your journey how many really 00:01:17.340 |
identify and can relate to the first one which is my Gen AI span has gone through the roof okay yeah all 00:01:25.680 |
right and how many of you are in this other segment of this journey which is you know you've built POC's 00:01:31.980 |
it's showing promise but you haven't yet quite met this quality bar to go to production can I get a 00:01:39.480 |
sense of all right so so I think you know we have a good amount of good fraction of the audience that can 00:01:45.960 |
relate to one of these two problems myself I'm a co-founder at Octo AI and I'm going to talk a little 00:01:52.380 |
bit more about what we do but the customers I've been working with they feel those pains in very real way or 00:01:58.980 |
talking about tens of thousands if not hundreds of thousands of dollars in monthly bills and perhaps 00:02:06.360 |
even having issues trying to go to production because the quality bar hasn't yet been met so the 00:02:12.960 |
overview of this 15-minute talk is going to be spent on understanding the why of fine-tuning really try to 00:02:19.800 |
understand when to use fine-tuning it's not really a silver bullet for all the problems you're going to 00:02:24.960 |
face but when used right in the right context for the right problem it can really deliver results 00:02:30.340 |
I'm also going to try to contextualize this notion of fine-tuning within the crawl walk and 00:02:36.340 |
run of LLM quality optimization because there's different techniques that you should attempt before trying 00:02:42.340 |
to do fine-tuning but finally when you're convinced that this is the right thing for you I'm going to 00:02:47.340 |
talk about this continuous deployment cycle of fine-tuned LLMs so we're going to go through today 00:02:53.660 |
over a whole crank of that wheel of this deployment cycle composed of you know model you know data set 00:03:01.200 |
collection model fine-tuning deployment and evaluation and really I'm trying to demystify this 00:03:07.560 |
whole journey to you all because in the next 15 minutes we're actually going to go through this whole 00:03:11.700 |
process and hopefully that's something that you're going to feel comfortable going through and you know 00:03:15.540 |
applying to your own data set to your own problems and so for illustrating today's use case we're going 00:03:21.240 |
to use this personally identifiable information redaction use case now that's a pretty traditional 00:03:26.760 |
sort of data scrubbing type of application but we're going to use LLMs and we're going to see that we can 00:03:32.460 |
essentially achieve state of VR accuracy while keeping efficiency at the highest using essentially very 00:03:41.560 |
compact very lightweight models that have been fine-tuned for that very task so again trying to motivate this 00:03:48.160 |
talk what limits gen AI adoption in most businesses today based on the conversations that I've had in the 00:03:53.640 |
field discussions I've had customers and developers the first one is there's a limited availability of GPUs I think we're all 00:04:02.220 |
familiar with this problem it's one of the reasons why Nvidia is so successful lately I mean everyone wants 00:04:07.920 |
to have access to those precious resources that allow us to run gen AI at scale and that can also drive costs up right so we have to be smart about how to use those GPU resources 00:04:19.840 |
and also when people build POCs it displays and shows promise but sometimes you don't reach the expected quality bar to go to production 00:04:29.540 |
and so on this XY axis you know on this chart where Y axis is costs and the X axis symbolizes quality maybe many people start on that green cross here right on this upper quadrant a very high cost maybe not having met the quality bar that's your first POC 00:04:49.600 |
POC but really to go to production you need to end on the opposite quadrant right lower cost higher quality where you met the bar you're able to run this in a way that essentially is margin positive and many of us are on this journey to reach that point of you know profitability 00:05:07.240 |
so we're going to learn today how to use and how to fine tune an LL M now fine tuning is a method that we're going to use to improve the LL M quality but as a bonus we're going to be also showing how to improve quality significantly and I use quality as the title of this talk because really I think many of us AI engineers really care about reaching the high quality bar when we're using LL M's and hopefully I'm you know the goal of today's talk is to instill 00:05:37.000 |
instill you with some knowledge on how to tackle this journey and so in terms of tools that we're going to use today we're going to use OpenPipe which is a SaaS solution for fine tuning that really lowers the barrier of entry for people to run their own fine tunes 00:05:51.640 |
you don't need hardware or cloud instances you don't need hardware or cloud instances to get started and we're going to use this to deliver quality improvements over state-of-the-art LL M's and of course since I work at Octo AI I'm going to also be using Octo AI here for the LL M deployments 00:06:06.280 |
and that's going to be the solution that we're going to use to achieve cost efficiency at scale and really the key here is to be able to build on a solution that is designed to serve models at production scale volumes 00:06:22.120 |
and just to give you a little bit of a sneak peek in terms of the results that we're going to showcase today after you go through this whole tutorial and this is something that you're going to be able to reproduce independently so you know all the code 00:06:34.120 |
is there for you to go through we're going to be able to show that we can achieve 47 percent better accuracy at the tasks that I'm going to showcase today using this OpenPipe fine tuning 00:06:44.120 |
and by deploying the model on Octo AI we're going to achieve this I mean it seems kind of ridiculous 99.5 percent reduction in cost this is really a 200x reduction in cost here from a GPT-4 Turbo to Llama 3 and mostly because this is a much smaller model it's open source and we've optimized the hell of this model to serve it cheaply on Octo AI so I'm going to explain how this is achieved but I hope your interest at least has been peaked on those results that you yourself 00:07:14.040 |
can reproduce so when to use fine tuning again it's not really a silver bullet for all your quality problems it has its right place in time so I like to contextualize it within the crawl walk run of quality optimization right and as Gen AI engineers many of us have embarked on this journey we're at different stages of this journey and really it should always start with prompt engineering right and many of you are familiar with this concept you start with a model you're 00:07:43.960 |
you're trying to have it accomplish a task and sometimes you don't really manage to see the result you expect to see so you're going to try prompt engineering and there's different techniques of varying levels of sophistication 00:07:55.320 |
this talk is not about prompt engineering so you know you can improve prompt specificity there's few shots prompting where you can provide examples to improve essentially the quality of your output there's also chain of thought prompting I mean some of you probably have heard these concepts but this is where you should get started right make sure that given the model given those 00:08:13.880 |
you just try to improve those weights you just try to improve the prompt to get the right results sometimes that's not enough and there's a second class of solutions which I like to map to the walk stage 00:08:23.720 |
retreatal augmented generation right we've probably seen a lot of talks on RAG today and throughout this conference so you know there's hallucinated results sometimes the answer is not truthful well why is that it's because the weights of the model that is really the parametric memory of your model is 00:08:43.800 |
is limited to you know the point in time is limited to you know the point in time in which the model was trained so when you try to ask questions on data it hasn't seen or information that's more recent than when the model was trained it's not going to know how to respond right so the key here is to provide the right amount of context 00:08:59.800 |
and so this is achieved through similar research for instance in a vector database through function calling to bring the right context by invoking an API through search through querying a database 00:09:11.640 |
and so this is something that I think many of us engineers have been diving in in order to provide the right context to generate truthful answer right compliment the parametric memory of your model with non parametric information 00:09:23.800 |
and that's RAG in a nutshell right so you've tried prompt engineering you've tried RAG you've eliminated quality 00:09:28.920 |
problems and hallucinations but that's still not enough right so what do you try next 00:09:33.640 |
well fine-tuning I think is the next stage and again I'm generalizing a very complicated and complex journey 00:09:40.840 |
but in spite of your best efforts you've tried these techniques for maybe days weeks or even months 00:09:46.440 |
and you still don't get to where you need to be to hit production and we're going to talk about this journey 00:09:51.560 |
today right fine-tuning so when should you fine-tune a model again after you spend a lot of time in the 00:09:59.160 |
first two phases of this journey so spending time on prompt engineering spending time on retrieval augmented 00:10:05.480 |
generation and you don't see the results improve and generally what helps is whenever you use an LLM for a very 00:10:13.080 |
specific task something that's very focused for instance classification information extraction 00:10:18.840 |
trying to format a prompt using it for function calling if you can narrow the use case to something 00:10:26.200 |
that is highly specific then you have an interesting use case for applying fine-tuning here and another 00:10:33.640 |
requirement is to have a lot of your own high quality data to work with because that's going to be your 00:10:37.880 |
fine-tuning data set that goes without saying but a model is only as good as the data that it that the 00:10:43.560 |
model was trained on and we're going to apply this principle here in this tutorial and finally I think 00:10:48.760 |
as an added incentive oftentimes we're all driven by economic incentive in the work we do for those of 00:10:54.520 |
you who are feeling the pains of high gen ai bills whether it is with open ai or with a cloud vendor or a third 00:11:03.320 |
party well this is generally a good reason to explore fine-tuning so we're going to go over all the steps 00:11:10.040 |
now that we've kind of contextualized why fine-tuning and when to consider fine-tuning we're going to 00:11:15.720 |
consider all the steps here in this continuous deployment cycle it starts with building your data 00:11:21.560 |
set then running the fine-tuning of the model deploying that fine-tune llm into production so 00:11:28.520 |
you can achieve scale and serve your your you know your customer needs or internal needs at high volumes 00:11:35.320 |
and also evaluate quality and this is an iterative process there's not a single crank of the wheel this 00:11:40.920 |
is not a fire and forget situation because data that your model sees in production is going to drift and 00:11:47.400 |
evolve and so this is something that you're going to have to monitor you're going to have to update your data 00:11:51.240 |
set you're going to have to fine-tune your model and i don't want to scare you away from doing this 00:11:55.640 |
because it sounds fairly daunting and so by the end of this talk we'll have gone through a full crank of 00:12:02.280 |
that wheel and hopefully you know it through these sas toolings that i'm going to introduce you to is going 00:12:08.840 |
to feel a lot more approachable and hopefully i'll demystify the whole process of fine-tuning models 00:12:13.720 |
so let's start with step one which is to build a fine-tuning data set now the data of the model 00:12:21.080 |
has to be trained on ideally real world data right it has to be as close as possible to what you're 00:12:27.000 |
going to see in production so there's kind of a spectrum of ways to build and generate a data set 00:12:32.280 |
ideally you build a data set out of real world prompts and real world generated real world human 00:12:38.920 |
responses so for instance you have customer service you've logged calls with a customer agent you have 00:12:44.760 |
an interaction between two humans that's a very good data set to work with right because it's human 00:12:49.480 |
generated on both ends this is very high quality but not everyone has the ability to acquire 00:12:54.840 |
this data set sometimes you're starting from scratch so not everyone has a luxury to start there 00:13:00.360 |
there's also kind of an intermediary between real world and synthetic where you have real world prompts 00:13:05.640 |
but ai generated responses and so this is kind of a good middle ground between cost and quality because 00:13:11.080 |
you're starting from actual ground truth information that is derived from real data but the responses are 00:13:19.320 |
generated by a high quality llm say gpt4 or clod and actually open pipe is a solution that allows you 00:13:26.600 |
to log the inputs and outputs of an lm like gpt4 to build your data set for fine-tuning an llm so this is 00:13:34.680 |
something that you know a lot of practitioners use and finally there's the fully synthetic data set using 00:13:41.560 |
fully ai generated labels and oftentimes when you go on hugging face or kaggle you'll encounter data sets that have 00:13:48.680 |
been built entirely synthetically and that's a great way to kind of get started on this journey 00:13:53.960 |
and actually one of the data sets we're going to use today is from that latter category 00:13:58.440 |
and of course i mean it probably goes without saying but in case people are not fully uh familiar with this 00:14:05.160 |
notion you want to split your data set into a training and validation set because you don't want to 00:14:11.160 |
evaluate your model on data that your fine tune has seen right and so many of you who are ml and ai 00:14:19.800 |
engineers are already familiar with this but i just want to reiterate that this is important and finally 00:14:24.120 |
you know this is used for hyper parameter tuning and when you're deploying it and actually testing it 00:14:28.520 |
on real world examples you want to have a third set outside of training and validation which is your test set 00:14:34.440 |
that's a good way to do it now you've built your data set you're ready to fine tune your model and there's a lot of 00:14:40.120 |
decisions that we need to make at this point and the first one is going to be open source versus closed source right 00:14:45.720 |
and so who here just like raise of hands is using proprietary llms or gen ai models today from open ai 00:14:53.160 |
anthropic mistral ai okay good amount of crowd who here has been using open source llms like llama some of the free mistral ai models 00:15:03.800 |
okay so maybe a smaller crowd right and maybe that's because these models are not as capable and 00:15:10.520 |
sophisticated and but i'm going to walk you through how you can achieve better results if you do fine 00:15:18.280 |
tuning right so of course the benefit of open source and this is why you know i'm obviously biased but i'm 00:15:24.200 |
an open source advocate is that you have to you get to have ownership over your model weights so when once you've done the fine tuning you are 00:15:33.160 |
the proprietor of the weights that are the result of this fine tuning process which means that you can 00:15:38.680 |
choose how you deploy it how you serve it this is part of your ip and i find that this is a great 00:15:43.080 |
thing for anyone who wants to embark on this fine tuning journey with proprietary solutions you're not 00:15:49.560 |
quite the owner or you don't have the flexibility to decide to go with another vendor to host the the 00:15:54.680 |
models yourself and so you're kind of locked into an ecosystem some people are comfortable with that others are 00:16:00.440 |
less comfortable with it and many of the customers that we talk to they're very eager to jump on the 00:16:06.120 |
open source train but they don't really know how to get started or you know where to start on this 00:16:11.640 |
journey so hopefully this can this can help inform you how to take your first steps here into the 00:16:16.040 |
world of open source then there's a question of like do i use a small model or a large model 00:16:21.400 |
because for instance even in the world of open source you have models that are in the order of 8 billion 00:16:26.360 |
parameters like llama 3 8 b and then you have the large models with a mixture all 8 by 22 b so this 00:16:33.080 |
is a mixture of expert model with over 100 billion parameters very different beasts and we're going 00:16:38.840 |
to see even larger models from meta and generally my recommendations here is well look the large models 00:16:45.320 |
are amazing they have broader context windows they have higher capabilities of reasoning but they're also 00:16:52.440 |
more expensive to fine tune and more expensive to serve and typically when you have to do a deployment 00:16:57.400 |
you're going to have to acquire resources like h100s to run these models so generally start with a smaller 00:17:03.080 |
model like a llama 3 b and sometimes you'll be surprised by its ability to learn specific problems 00:17:08.920 |
so that's my recommendation start with a smaller llama 3 8 b or mistrol 7 b and if that doesn't work out for you then 00:17:18.440 |
move towards larger and larger models and today we're going to be using this llama 3 8 billion parameter 00:17:23.960 |
model there's also different techniques for fine tuning i'm going to go over this one fairly quickly 00:17:29.880 |
but there's two classes of fine tuning techniques one which is parameter efficient fine tuning it produces 00:17:37.880 |
a laura and the other one is a full parameter fine tuning which produces a checkpoint a laura is much 00:17:43.960 |
more much smaller and efficient in terms of memory footprint we're talking about 50 megabytes versus 00:17:50.440 |
a checkpoint that is 15 gigabytes and so you can guess that because of its more compact representation 00:17:58.200 |
you're able to serve it on a gpu that doesn't require as much onboard memory and you can even serve 00:18:06.280 |
multiple lauras at the same time so multiple fine tunes on a gpu for inference as opposed to the 00:18:13.240 |
checkpoints which require a dedicated gpu for every single fine tune so there's more flexibility in 00:18:19.240 |
deployment and we're going to use that today we're actually going to serve these lauras which are the 00:18:23.640 |
result of parameter efficient fine tuning on a shared tendency endpoint with other users who have their own 00:18:30.360 |
lauras all running on the same server and that allows us to really reduce the cost of inference 00:18:36.280 |
and there is a benefit to checkpoints though and full parameter fine tuning which is that there are more 00:18:43.400 |
parameters to tune so it's a more flexible fine tuning technique it allows the model to have 00:18:49.640 |
essentially achieve better results at more expensive tasks like logical reasoning 00:18:57.000 |
but for very specialized tasks which is what we're going to look at today like classification or 00:19:01.000 |
labeling or function calling a laura is just fine so we're going to use parameter efficient fine tuning 00:19:06.120 |
and also when you're doing fine tuning you have to decide am i going to diy it or am i going to use 00:19:12.840 |
sas so i'm sure some of you only like to diy things others like the convenience of sas and here i'm not 00:19:18.920 |
going to take a side i think there's some great tools right now to diy your own fine tuning for 00:19:24.840 |
for instance the open source project axolotl and actually at the conference there's the the creator 00:19:29.960 |
behind axolotl who you might be able to catch um and you know the challenge here is that you have to 00:19:36.680 |
find your own gpu resources you have to understand how to use these libraries even though they're they're 00:19:41.960 |
they're easier than ever uh to to adopt and you have to tune and tinker uh you know settings and hyper 00:19:49.000 |
parameters then there's sas which really aim to make it easy to embark on this journey companies like 00:19:54.760 |
open pipe and there's uh many folks from the open pipe at this conference today so if you can catch 00:20:00.520 |
them please do talk to them and they're trying to lower the barrier of entry to fine tuning right to 00:20:04.920 |
make it easy and to bring all this tooling all these libraries to make it as seamless as possible to 00:20:10.280 |
for instance move from a gpt4 model to a fine tune with the least amount of steps in collecting your 00:20:15.880 |
data fine tuning etc and so we're going to use sas today but if you feel more comfortable in this journey 00:20:21.800 |
you might want to start with sas and then evolve into diy-ing it when it comes to deployment you 00:20:28.120 |
have to navigate the same options right once you have a fine-tuned model now you need to decide well 00:20:32.120 |
how am i going to serve it right because i need to generate maybe thousands millions or billions of 00:20:37.400 |
tokens a day and so you need infrastructure you need gpus you need inference libraries some people like to 00:20:43.480 |
diy it using libraries like vllm mlclm tensor rtllm hogging face tgi if these are all things that you 00:20:52.760 |
might have heard of these are all solutions to run models on your own on your own infrastructure 00:21:00.040 |
but you need to provision the resources you need to build the infrastructure to scale with demand and 00:21:06.920 |
that can get tricky especially achieving high reliability under load that's a challenge that 00:21:12.200 |
many people face as they scale their business up with sas you can essentially work with a third party 00:21:18.040 |
like octo ai and obviously i'm a bit biased again i work there so i'm gonna insert a shameless plug for 00:21:24.680 |
octo ai which allows users to get these fine tunes deployed on sas based endpoints so endpoints very 00:21:32.840 |
similar to the ones from open ai for instance if you're familiar with that or claude and it offers 00:21:40.120 |
the ability to serve different kinds of customizations as well and so very quickly i want to go over the 00:21:45.880 |
advantages of octo ai here first of all you get speed so llama 3 8 billion parameter model you get achieve 00:21:53.400 |
around 150 tokens per second and we keep on improving that number because we've been applying our 00:21:57.960 |
our own in-house optimizations to the model serving layer it also has a significant cost advantage 00:22:03.880 |
because it costs about 15 cents per million tokens compared to say gpt4 which costs 30 dollars per million tokens 00:22:11.240 |
so that's where the 200x comes from and we don't charge a tax for customization so whether you're serving 00:22:16.120 |
the base model or a fine-tune it's the same cost there's customization as i mentioned you can load your own laura and serve it 00:22:25.080 |
and finally scale our customers some of our customers generate up to billions of tokens per day on our endpoints 00:22:31.480 |
i think we're serving around over 20 billion tokens per day and so we've focused and spent a lot of time 00:22:38.280 |
on improving robustness and also worth mentioning if sas doesn't cut it for you you are working for a fortune 500 00:22:47.960 |
if you're a software company or you know a healthcare company banking sector government you need to deploy your llms inside of your 00:22:55.560 |
environment either on-prem or in vpc we also have a solution called octo stack come talk to us at the booth 00:23:02.040 |
so that's it for the shameless uh flag section let's go over to section four which is evaluating quality right 00:23:08.680 |
we've talked about data set collection fine tuning deployment now quality evaluation and we could have an entire conference just dedicated on that 00:23:16.360 |
i'm going to try to summarize it into kind of two classes of evaluation techniques that i've seen 00:23:22.360 |
first of all you know can your quality be evaluated in a precise way that can be automated for instance 00:23:29.960 |
you generate a program or sql command that can run uh can you for instance label or extract information 00:23:37.240 |
or classify information in an accurate way that's a kind of pass or fail scenario right or formatting the output 00:23:43.560 |
into a specific json formatting this is something that you can easily test as a pass or fail test and 00:23:49.960 |
then there's more of the soft evaluation for instance if i were to take an answer and say well which output 00:23:55.640 |
is written in a more polite or professional way you can't really write a program to evaluate this unless 00:24:01.720 |
you're using an lm of course right but you have to put yourself into maybe 2000 2000 sorry 2020 2021 mindset 00:24:09.480 |
before gpt was around well it'd be hard to build a program that can assess this right so generally you'd 00:24:15.640 |
need a human in the loop to say which out of a or b is a better answer thankfully today we can use llms 00:24:23.800 |
to automate that evaluation but keep in mind that for instance if you're using gpt4 to evaluate two 00:24:29.400 |
answers well if you're comparing against gpt4 it might favor its own answer and people have seen that in 00:24:34.440 |
in these kind of evaluations so this is a whole science i mean we could have a whole conference 00:24:38.920 |
just on this i just wanted to present the high level uh guidelines of this whole cycle of deploying 00:24:46.120 |
fine-tuned llms and so really there is no finish line that's what i want to convey to you all that 00:24:52.520 |
going through a single iteration is something that you might have to do on a regular basis maybe 00:24:58.760 |
once a week maybe once a year it all depends on your use case and constraints now let's get a bit 00:25:06.120 |
more practical let's switch over to our demo and so for those of you who came a little bit late 00:25:13.480 |
there's a qr code here that you can scan and that will point you to our google colab and we also have 00:25:22.680 |
under slack now let me see if i can pull it if you're in the slack channel for ai engineers world 00:25:29.800 |
fair there is this quality optimization boot camp where you can ask questions here if you want to 00:25:35.080 |
follow along and so we're going to go we're going to try to go over the practical component in the next 00:25:40.520 |
25 minutes i just want to provide some context here the use case is uh personally identifiable 00:25:48.360 |
information redaction we've taken this from a data set composed by ai for privacy called pi masking 00:25:55.960 |
200k it's one of the largest data sets of its kind it has 54 different pi classes so different kinds 00:26:03.880 |
of sensitive data like the name the email address add you know address the physical address of someone 00:26:10.760 |
uh their credit card information etc etc across 229 discussion subjects so that includes 00:26:18.120 |
conversations from a customer ticket resolution conversations with a banker conversations between 00:26:24.440 |
individuals etc what this data set looks like is as follows you're going to have a message an email 00:26:32.120 |
here we have you know something that looks like it came out of an email that contains credit card 00:26:37.800 |
information ip address maybe even a mention of a rule or or anything that is essentially personally personally 00:26:45.640 |
identifiable and i've highlighted those in red because they will need to be redacted 00:26:51.160 |
and after redaction we should get the following text that shows look here is this information that is now 00:26:57.880 |
redacted anonymized but instead of just masking it we're actually telling it what kind of category this 00:27:03.880 |
information belongs to right a credit card number an ip address or job title and this is how we're going to redact this text 00:27:10.760 |
so where do llms come in the way we would use it is through function calling who here has used llms 00:27:19.400 |
with tool calls or function calls okay so quite a few people you know and as many of us are aware this 00:27:27.480 |
kind of what powers a lot of the agentic applications so this is a great use case for people who want to do 00:27:33.320 |
function calling and are not seeing the results you know out of the box from say gpt4 that they would like 00:27:40.440 |
to to see and in this case we're actually going to see that that these kind of state-of-the-art models 00:27:44.760 |
aren't doing quite well at fairly large and complex function call use cases so to achieve this redaction 00:27:53.000 |
use case we're going to pass in a system prompt we can also pass in a tool specification the system 00:27:58.920 |
prompt says look you're an expert model trained to do redaction and you can call this function 00:28:03.320 |
here are all the sensitive pii categories for you to redact and then as a user prompt we're going to 00:28:10.040 |
pass in that email or that message and then the output is a tools call so it's not the redacted text 00:28:17.320 |
it's actually a tools call to that redact function that's going to contain all the arguments for us to 00:28:23.080 |
perform the redaction why am i doing this as opposed to spitting out the redacted test well that gives us 00:28:28.760 |
flexibility in terms of how we want to redact this text we could choose to just replace that 00:28:34.520 |
information with the pii class we can also completely obfuscate it or we could choose to use for instance a 00:28:42.120 |
database that maps each pii entry to a fake substitute so that we have an email that kind of reads normally 00:28:51.160 |
except the credit card the the names the addresses are all made up but they will always map to the same individual 00:28:58.600 |
and so that allows us to do then more interesting processing on our data set right so that's why 00:29:04.680 |
we're going to use function calling here and let's start to build the data set so i'm going to switch 00:29:09.000 |
over to our notebook here this notebook is meant to be sort of self explainable so there's a bit of 00:29:15.640 |
redundance redundant context as part of the prerequisites you're going to have to get an account on octo ai and 00:29:22.280 |
open pipe and and these are the tools that we're going to use and if you want to run the evaluation function also 00:29:28.440 |
provide your open ai key because we're going to compare against gpt4 so we're going to install the python packages 00:29:35.400 |
initially only open ai and data sets from hugging face you can ignore this pip dependency error here 00:29:42.200 |
which happens when you pip install data sets in a colab notebook but that's okay we can get past that 00:29:48.200 |
you can enter your octo ai token and open ai api key at the beginning 00:29:54.440 |
and i've already done this so we're going to start with the first phase which is to build a fine-tuning 00:29:58.520 |
data set so we have this pi masking data set i'm going to show it from hugging face of pi 00:30:04.360 |
masking and you can see what the data set looks like it has the source text information as you can see 00:30:12.520 |
these are you know exchange you know snippets from emails for instance you have the target text that is 00:30:18.360 |
redacted and the privacy mask that contains each one of the pii and the classes associated to it 00:30:25.080 |
so this contains all the data all the information input and labels that we need to build our 00:30:30.600 |
our data set for fine-tuning and so really what we're going to do 00:30:41.480 |
area we're going to define our system prompt here which is again telling the model you're an 00:30:46.120 |
expert model trained to redact information and here are the 56 categories explaining next to each 00:30:53.000 |
category what that corresponds to and this is really the beauty of llm and sort of natural language entry 00:30:59.400 |
is that in the old world when we're doing pi redaction we had to write complex regular expressions 00:31:05.480 |
and here this is all done through just providing a category and a bit of a description here 00:31:10.920 |
and the llm will naturally infer how to do the redaction we're also going to define the tool to call 00:31:19.080 |
right so this is done as a essentially a dictionary a json object and as you can see there is an array 00:31:26.360 |
that contains dictionaries containing a string and a pi type and the string is the pi information the type is 00:31:35.000 |
essentially one of 56 categories that we provide as an enum so right off the bat you can see that this 00:31:40.520 |
tool call is you know a bit of a large function specification and so let's load our data set from 00:31:48.440 |
hugging face in this case it's going to take maybe a few seconds to load in that data set of 200 000 00:31:54.600 |
entries and then what i have in the next cell when i'm downloading this data set is what i'm going to use 00:32:01.800 |
to build my fine tuning training data set and here's the thing about fine tuning is that to build your 00:32:10.120 |
data set you need to make it seem like you've essentially logged conversations with an llm right 00:32:15.800 |
you're logging the prompts and the responses because that's how you're going to fine tune it you need to 00:32:20.200 |
tell it this is the input with system prompt tools specification user prompt and here's the 00:32:27.720 |
tools call response that i expect to see and so this cell here just sets it up so that we essentially 00:32:35.320 |
have each training sample as a message from an llm that's been logged we're going to see what that looks 00:32:41.800 |
like in a second so we're going to build a 10 000 entry training data set for open pipe and that's 00:32:51.480 |
going to be downloaded as this open pipe data set that json l and so as i run the cell it's going to 00:32:58.040 |
download this from colab and now when you switch over to open pipe we're going to create a new data set 00:33:07.400 |
so once you're on open pipe console you have a project here i've generically named it project one 00:33:13.480 |
you can access data sets and already as you can see i already have built a few data sets before 00:33:20.280 |
but if you're a first time user you're not going to see anything under data sets so you can create 00:33:25.320 |
a new data set here by clicking on this button and if you go under settings we can name our data set so 00:33:32.040 |
i'm going to call it lunch and learn and today is june 2 6. all right so this is two days lunch and learn 00:33:42.920 |
i'm going to i'm going to call this my data set and under general i can upload the data that i just 00:33:49.560 |
download it from my notebook open pipe data set dot json l so this upload operation is going to take 00:33:56.280 |
a few seconds or maybe a couple of minutes because what's going to happen on open pipe is not only we're 00:34:03.400 |
uploading this data set but it's going to do some pre-processing here to split it into training and 00:34:09.880 |
validation set it's also going to get it all formatted in a nice way so we can essentially look into the data 00:34:17.160 |
set so you can see there's this little window here that shows that you're uploading the data set and 00:34:22.520 |
that it is essentially being processed so while this is happening right we've prepared our data set and 00:34:30.600 |
we're going to take a look at it in a second while it's being processed on open pipe but let's see how 00:34:35.160 |
we're going to do the fine tuning in the next stage right so once we have our data set uploaded we're going 00:34:41.400 |
to have this view on the data set that shows every single entry that we can peek into and how it's split 00:34:46.600 |
into training and test set generally a 90 10 split and from that ui we can launch a fine tune 00:34:55.400 |
and this is where we get to choose our base model and what we're going to choose is a llama 3 8 billion 00:35:00.360 |
parameter model with 32k context width which is a fine tune from news research called the theta model 00:35:11.160 |
and you can see that there's essentially a pricing here that is being estimated for this fine tune we 00:35:17.400 |
have a substantial training set because it can range from say hundreds of samples to thousands to hundreds 00:35:23.160 |
of thousands and the cost can scale up as you as you feed in more training samples but it will improve the 00:35:31.480 |
accuracy and it also provides an estimated training price of forty dollars now that might seem like a lot 00:35:37.000 |
especially when you're chinkering with fine tuning but keep in mind some of the people that we work 00:35:41.160 |
with they tend to spend tens of thousands or maybe hundreds of thousands of dollars a month on genii 00:35:46.680 |
spend so this is absolutely something that you can do up front that will pay off and i believe that on 00:35:51.800 |
open pipe if you get started you get a hundred dollars credit so that allows you to to run some fine 00:35:57.960 |
tunes off the bat without having to necessarily uh have to to pay so um let's go over to open pipe 00:36:06.760 |
and it is still uploading i think maybe the network is uh is a bit slow but we're going to essentially 00:36:15.800 |
start training at this point and once the training is happening we're going to then deploy the fine tune llm 00:36:23.160 |
when training is done and what happens on open pipe is when you're done with training you're going to 00:36:28.040 |
get an email when that training job is done it can take a few minutes so i'm going to pull a jeweler 00:36:32.600 |
child here i'm going to stick the you know the turkey in the oven and in the second oven i'm going to 00:36:37.160 |
have a pre-baked turkey just so that we don't lose time but as you're going through this on your own 00:36:42.840 |
keep in mind it's going to take a little bit of time to just kick off that whole fine tuning process but 00:36:47.400 |
it's not that long because um you know you're training a fairly small model here all right so 00:36:54.360 |
this is still uh saving but let's kind of take a look at what we've done so far right so we've built 00:37:00.040 |
our data set using a synthetic data set from hugging face we format each input output pair from the data 00:37:05.960 |
set as logged l on messages and this is essentially stored as a json file that we upload to open pipe 00:37:13.640 |
and we produce 10 000 training samples we're fine tuning a model from open pipe and we're open pipe 00:37:19.640 |
uses a parameter efficient fine tuning which produces a laura and we choose llama 3 8 billion parameter 00:37:25.400 |
model as the base and when we deploy what we're going to use here is octo ai so let's see this didn't 00:37:33.960 |
finish uploading so i'm going to go into the one that i uploaded just a couple days ago just to 00:37:39.640 |
essentially show you what you should see on the user interface so as you peruse through the training 00:37:47.960 |
samples what you're going to see is an input column and output column and so on the left you have the 00:37:53.240 |
input with the system prompt as you can see it's a it's a big boy because it has all these different 00:37:58.760 |
categories right that it needs to classify it also has the user prompt which is the message that we need to 00:38:05.160 |
redact the tool choice and the tool specification here with all the different categories of pi types 00:38:11.000 |
and then the output will be will be this tools call from the assistance response 00:38:15.880 |
and that will have this redact call along with these arguments field to redact as a list of dictionary 00:38:23.160 |
entries containing string and pi type information right and so this is what we've passed into our fine tuning 00:38:32.360 |
data set into open pipe and this is still saving so i'm just going to go ahead and go to the model so 00:38:40.360 |
once you have the data set uploaded again you hit this fine tune button and this is what's going to 00:38:46.760 |
allow you to launch a fine tuning job right i can call this blah and this is where you select under 00:38:52.680 |
this drop down the model that you want to fine tune this is again what we saw before training size is 00:38:58.600 |
substantial i'm not going to hit start training because i already have a trained model but when 00:39:02.440 |
you do that it's going to kick off the training and when it's done you'll get notified by email 00:39:06.120 |
now let's fast forward let's assume i've already trained my model so i'm going to have this 00:39:11.400 |
model here that's been fine-tuned from this data set i'm going to click on it as we can see it's 00:39:17.720 |
a llama 3 8b model it's been fine-tuned over these 10 000 data sets split into 9 000 training samples and 00:39:27.080 |
a thousand test samples we can even look at the evaluation but going back to the model 00:39:35.160 |
and the nice thing is that it's taking care of the hyper parameter like learning rate number of epochs it kind 00:39:42.520 |
of figures it out for you so you don't really have to tweak those settings and i find that to be very 00:39:46.920 |
convenient especially for people who haven't yet built an understanding of how to tweak those values 00:39:51.000 |
and the beauty of using open pipe is that you can now export the weights and be the owner of those 00:39:57.960 |
weights right remember when we talked about open source it's really important to own the result of 00:40:02.280 |
the fine tuning so you can download the weights in any format you want you have lores but also merge 00:40:08.040 |
checkpoints so you can have a a parameter efficient representation as well as a checkpoint and so we've 00:40:14.360 |
selected to export our model as a fb16 laura which is what we're going to use to upload our model on 00:40:20.280 |
octo ai which is where we're going to use to deploy the model so now i can download the weights as a zip 00:40:25.400 |
file and it's fairly small only 50 megabytes but i can also copy the link copy the url and this is what 00:40:33.800 |
we're going to need to do in this tutorial so to deploy the model what we need to do is copy this url 00:40:41.000 |
i'm going to download in the cell the octo ai cli this is a command line interface for users to upload 00:40:49.480 |
their own fine tunes to what we call our asset library so this is a place where you can store your 00:40:54.760 |
own checkpoints your own lores for not just llms but also models like stable diffusion if some of you are 00:41:00.760 |
developers who also work in the image gen space and so we can serve these customized models on our platform 00:41:08.760 |
and so we're going to upload this laura from open pipe to octo ai so we're going to log in just to 00:41:17.720 |
make sure credentials are good and here we have a confirmation that our token is valid and in the cell we have to 00:41:25.960 |
replace the laura url from set me to that url that i just copied here from download weights 00:41:31.160 |
and keep in mind this might take a couple minutes to get the link to appear but once you have that link 00:41:38.040 |
and again i'm kind of skipping ahead because when you're going to run this at your own time it might 00:41:43.480 |
take a you know a few minutes to run the fine tune it might take a few minutes to download the weights 00:41:47.720 |
but everything that i'm running here is essentially the steps that you'll take yourself 00:41:51.960 |
and what i'm doing here is passing in this url here and setting a laura asset name in my octo ai asset 00:42:01.560 |
library so i can then create this asset from this laura as a safe tensor file and based on the llama3 8b model 00:42:16.120 |
it seems like something has failed here so let's try to run it again 00:42:40.920 |
usually that that that should have worked so what's uh what should happen here is at this point 00:42:54.680 |
once you've taken the url of your fine-tuned asset should be able to host it on our asset library and then 00:43:04.680 |
from there serve it to start running some inferences so this uh this this laura upload step didn't quite 00:43:15.800 |
work here so pedro are you able to maybe double check with product whether this capability is working 00:43:23.160 |
uh this isn't a good demo unless something fails and so uh yeah you know i just tested it earlier today 00:43:32.200 |
uh let's see i might have to list my assets so i can pull an old one 00:43:41.560 |
uh actually um actually one second pedro can you can you tell me what the command is to 00:43:47.880 |
to list the assets that are on i think it might be octa octo ai asset list all right let's 00:43:56.280 |
okay there we go so i'm going to pull from an asset that i uploaded earlier 00:44:08.280 |
uh okay so i'm gonna have to look into why uh that step failed but uh let's 00:44:34.280 |
let's try this okay so i'm gonna use an asset that i uploaded earlier i'm not sure why this 00:44:40.600 |
didn't work but i'll make sure that this is working for you all to reproduce this step 00:44:43.960 |
and i'm gonna set laura asset name equals this all right so these other lauras i uploaded using 00:44:55.960 |
the exact same steps as i used for this tutorial so we'll make sure to get to the bottom of this and 00:45:02.280 |
we'll use the slack channel here for folks who want to run through this step but i'm just going 00:45:06.840 |
to run an example inference here on this asset that i pulled from open pipe 00:45:11.800 |
and so again we have our system prompt we're going to pass in this uh ex you know this message this 00:45:20.440 |
email as our test prompt and then when we're invoking this octo ai endpoint we're using the standard chat 00:45:27.160 |
completions from open ai and what we're passing here is this open pipe llama 3 8b 32k model and we pass 00:45:36.920 |
in this argument for parameter efficient fine tune and passing the laura asset name that we just uploaded to 00:45:44.600 |
the asset library and as we can see the response here contains the tool calls and the call to the function 00:45:51.800 |
that will do the redaction so this is behaving exactly as we intended to so now we can move on to the quality 00:45:58.600 |
evaluation for quality evaluation what we've done is use essentially an accuracy metric thankfully we 00:46:06.600 |
have a ground truth right from our data set all the exchanges have been labeled with privacy mask information 00:46:13.880 |
that we can use this ground truth so that makes evaluating scoring or results fairly easy we don't have to 00:46:19.800 |
use an llm for instance for that we can actually use more traditional techniques of accuracy evaluation 00:46:25.640 |
and so we have a metric that we've built it assigns a score that can be penalized when pi information was 00:46:33.320 |
missed or mistakenly added i.e false negative false positive and then we use a similarly distance metric 00:46:39.880 |
to kind of match the responses from the llms compared to our ground truth so for illustration purposes we have 00:46:47.320 |
for instance this pi information that's been redacted that's a score of 1.0 because it's the perfect match 00:46:53.560 |
or fine-tune might for instance miss the fact that billy was the middle name and might interpret it as 00:46:59.720 |
first name in that case we're still attributing a high score because it's close enough and probably 00:47:04.920 |
for a practical use case that would be good enough but for instance upon calling gpt4 it fails to identify 00:47:11.240 |
two out of the three information that we have to redact and so the score is about a third here right 00:47:16.680 |
so in this case what we're going to do here i'm just going to reduce the test size to 100 samples 00:47:23.880 |
and i am going to run this evaluation inside of this cell it's going to bring us 100 test samples 00:47:34.520 |
that we can then run our evaluation metric and get our overall scoring out of so if we look at 00:47:43.960 |
you know the uh output from the cell essentially we're just evoke invoking back to back the fine-tune 00:47:52.120 |
running octo ai and we're invoking gpt4 on open ai to do the results collection so we're going to 00:47:59.640 |
collect some results here and uh once we've collected the results once we get to 100 i think we're getting 00:48:06.200 |
pretty close here we can run the quality evaluation metric and of course i invite you to run it on more 00:48:11.720 |
samples maybe a thousand or ten thousand it just gets more expensive as you're using gpt4 you know to 00:48:18.840 |
run a hundred samples it costs about a dollar in inference so then a thousand samples cost ten dollars 00:48:26.120 |
and now we're going to score it all right so we're going to go through every single entry we have our 00:48:32.040 |
ground truth information we have our eval and labels from gpt4 and our eval and labels from our fine 00:48:40.520 |
tune and we can see that right off the bat the fine tune is actually better at finding the pi to redact 00:48:47.080 |
here gpt4 scored only a score of 0.49 whereas our fine tune achieves 0.85 and here 0.3 for gpt4 1.04 00:48:57.400 |
the fine tune so the fine tune overall is performing better and once we aggregate and average the score 00:49:02.600 |
gpt4 achieved 0.68 out of 1 whereas our fine tune achieves 0.97 and so that's the difference between 00:49:13.400 |
prototype and production right you're expected to achieve somewhere in the single nine or two nines 00:49:18.440 |
of accuracy and this is what this technique shows it allows you to achieve and again i want to reiterate that 00:49:25.000 |
in terms of cost gpt4 costs upwards to 30 dollars per million tokens generated whereas lama38b on octavad 00:49:33.240 |
cost just 15 cents that's a 200x difference right so with that i just want to conclude 00:49:40.440 |
with some takeaways on find your on fine tuning right fine tuning is a journey but a very rewarding 00:49:47.880 |
journey there's truly no finish line here you need to attempt fine tuning after you already tried other 00:49:53.880 |
techniques like prompt engineering retrieval augmented generation but once you decide to embark data is 00:49:59.960 |
very important collecting your data set because your model is only as good as the data it's trained on 00:50:05.320 |
you need to make sure to continuously monitor quality to retune your model as needed 00:50:10.200 |
you also need to um you know thankfully we have solutions like octa and open pipe to really make 00:50:17.080 |
this more approachable and easy to do and it's easier than ever it's only getting get easier but 00:50:22.840 |
maybe a year ago it was only reserved for the most adventurous and sophisticated users and now we've 00:50:27.800 |
really lowered the barrier of entry and when you do it right you can achieve really significant improvements 00:50:33.000 |
in accuracy as well as great reduction in costs i wanted to thank you for sitting here with me over the last 00:50:40.280 |
50 minutes i want to reiterate a few calls to action so go to octavi.cloud to learn how to use our solutions 00:50:47.240 |
and endpoints but also come to our booth and so we're located at this g7 booth and we're going to be here 00:50:55.640 |
today and tomorrow if you want to chat about our sas endpoints about our ability to deploy in an enterprise 00:51:02.360 |
environment and also i want to give a shout out to my colleague here pedro if you're curious about all the know-how 00:51:08.360 |
that goes behind how we optimize our model and production because our background is in compiler 00:51:12.920 |
optimization is is in system optimization infrastructure optimization we've applied all of this to be able 00:51:18.760 |
to serve our models you know with positive margins we're not doing this at a loss sorry we're not wasting 00:51:25.080 |
our vc money here we're actually building all this know-how into making sure that ai inference is as efficient 00:51:32.440 |
as it could be so there's going to be a talk on that and also make sure if you if you get a chance assuming 00:51:39.960 |
you've joined our our slack channel which is the following one so if you're on the slack 00:51:48.440 |
org for the event go to llm quality optimization boot camp you can ask us any questions and if you fill out 00:51:56.920 |
the survey that pedro is going to post we're going to give you an additional 10 dollars in credits so 00:52:03.400 |
that doesn't seem like a lot but that's a ton you know if it's 15 cents per million tokens that's a lot 00:52:09.160 |
of tokens that you can generate for free so we can give you an additional 10 dollars for filling out 00:52:15.400 |
the survey which which should take about you know 20 to 30 seconds so i'm going to be around and also you 00:52:21.080 |
can find me at the booth this afternoon in case you have any questions but i'd like you all to thank 00:52:26.040 |
you for sitting through this talk and hopefully hopefully you've learned something from this and 00:52:30.040 |
hopefully you feel like i've demystified this idea of trying fine-tuning on your own give this notebook 00:52:35.880 |
a try assuming of course we've fixed this uh laura upload issue and uh yeah thank you all and maybe ask 00:52:42.200 |
me some questions after this uh after this talk thanks