Productionizing GenAI Models – Lessons from the world's best AI teams: Lukas Biewald

00:00:00.000 | I guess as we're getting started, this talk is about productionizing AI, but I'm curious,

00:00:18.560 | I keep asking this question and it keeps changing, how many of you folks have LLM applications

00:00:24.200 | in production in your company? Wow, that's probably the highest I've ever seen, more than 70%.

00:00:32.000 | So why did you come to this talk? You figured it out, right?

00:00:37.520 | Not going great.

00:00:39.660 | Not going great?

00:00:40.460 | All right. You want to put it in slideshow mode? Oh, it is in slideshow mode? Okay.

00:00:52.400 | Okay. Hey, you're showing my IP to the whole audience.

00:00:59.240 | You might ask, is there custom solutions or purchased LLM solutions?

00:01:09.680 | Oh, good question. Yeah. How many is custom solutions?

00:01:15.480 | Interesting. That's maybe 30%? And how many purchased solutions? Wow, way, way less. What are you purchasing?

00:01:28.320 | Yeah. Okay. Well, if you put co-pilot in that category, how many have purchased solutions?

00:01:31.600 | It's got to be most, right? Interesting. Interesting. Thanks for the... Any more, should we crowdsource some more questions?

00:01:41.440 | No, we can begin. Awesome. All right. So I wanted to talk briefly about our experience with our customers,

00:01:49.280 | productionizing Gen AI models. But I want to give you a little bit of context. First of all,

00:01:54.240 | this is clearly preaching to the choir, but it's interesting to see that democratization of AI is here.

00:02:00.200 | I don't know. How many people were working in AI 10 years ago?

00:02:02.200 | Nice. 15 years ago?

00:02:05.280 | So yeah, ML totally counts. 100% ML counts. ML definitely counts as a type of AI. Absolutely.

00:02:12.160 | You know, I think people talked about the democratization of AI, you know, back when AI was kind of a lame term to use.

00:02:20.720 | But I think we thought that it might take the form of, you know, something like auto ML or more, you know, graphical user interfaces.

00:02:26.680 | But clearly, democratization of AI has happened in a way that we didn't expect.

00:02:29.680 | Clearly, through these kind of chat and conversational interfaces and LLMs, we now see AI in literally every company that we talk to in some way.

00:02:38.680 | And not just purchasing solutions, although there's, I think probably every company in this room purchases what I would call an AI solution.

00:02:44.640 | We also see quite a lot of custom solutions, and probably every Fortune 500 right now is investing in custom AI solutions.

00:02:52.640 | More LLMs, more Gen AI, but I would say most also some traditional ML.

00:02:58.240 | And it's moving into production, right?

00:03:00.200 | You see it in this audience, you see 70% of this audience, but everywhere I go to give a talk, even with companies that you wouldn't necessarily think of as the most forward-thinking,

00:03:08.040 | they are starting to have Gen AI applications in production.

00:03:10.640 | So people keep asking me, "When is Gen AI going to move into production?"

00:03:14.800 | It really has moved into production, right?

00:03:16.800 | There's a lot more that could happen, but Gen AI is in production right now in a lot of applications that you use today.

00:03:23.080 | But I think the challenge everybody has, and people keep saying it, important to say, AI is so easy to demo and so hard to productionize.

00:03:31.600 | You see this with a lot of AI apps getting released, there's something about AI that makes CEOs stupid.

00:03:37.080 | It makes them stop listening to their customers, it makes them willing to put things into production that are absolutely terrible.

00:03:42.760 | And I think the fundamental reason is that it's so easy to make an incredibly compelling demo.

00:03:48.840 | I get these all the time from my own team, I get them from startups, you know, pitching me on potentially angel investing.

00:03:54.360 | You can make just an absolutely astonishing demo, and it's so much further from production than we're used to.

00:04:00.520 | Even with software, which has this characteristic, AI is so much more easy to demo and so much harder to productionize.

00:04:08.120 | So let's talk about how to productionize it.

00:04:09.720 | I'll state, I come from Weights and Biases.

00:04:12.040 | This is a company I started that I feel very proud of.

00:04:14.760 | We've built an AI developer platform that helps develop AI engineers with ML applications or with Gen AI applications.

00:04:24.040 | And we started this company because we saw this massive tools gap for AI ops.

00:04:29.560 | We now see that we have three types of customers.

00:04:34.360 | We see foundation model builders, which we've long served since the beginning.

00:04:37.720 | We see a bigger group of AI engineers that are doing a blend of ML applications and Gen AI applications.

00:04:44.280 | And we see a new type of audience for our software in software developers that are now getting used to the fact that AI applications are so easy to demo and so hard to productionize and trying to figure out how to productionize those applications.

00:04:57.800 | What's exciting for us and exciting for everyone, I think, is that the software developers are so much more numerous than ML engineers and foundation model builders.

00:05:07.960 | Right now that companies can use software developers to do what traditionally would have taken a custom model, it opens us up to do so many more AI applications.

00:05:17.640 | And from our perspective, this is a really exciting development because the market is so much bigger.

00:05:21.640 | And that's why we see a whole LLM ops tools track at this conference and so many companies popping up to help with this problem.

00:05:27.720 | I feel really proud, and I always want to say, we work with nearly all the foundation model companies.

00:05:32.760 | They have special needs, but they do push our scale.

00:05:35.720 | So almost all the LLMs that you would have used, from OpenAS GPT to Mistral to LAMA, these were built using the weights and biases platform.

00:05:43.720 | So we have years of experience dealing with large scale.

00:05:47.800 | But it's also important to say almost all of our customers, and we have thousands of customers, are not foundation model builders.

00:05:53.560 | They're companies in nearly every industry.

00:05:55.880 | And the really fun part about being in the AI space right now is that there's applications to every industry.

00:06:01.560 | So we see healthcare, we see ag tech, we see manufacturing.

00:06:04.680 | I remember when I came out of school and wanted to get into the AI industry, it was really kind of ranking search results or going to Wall Street or targeting ads.

00:06:12.120 | Which aren't the most exciting applications now.

00:06:14.760 | It's things that you can touch and feel.

00:06:16.200 | It's things that you can explain to your parents.

00:06:17.880 | This is an incredibly exciting development.

00:06:19.640 | But what makes this hard?

00:06:22.840 | And I've thought a lot about this, what makes it hard?

00:06:24.920 | And what makes kind of AI engineering a different type of process with a different set of tools needed?

00:06:30.920 | Because every software company has now launched an AI tool version of itself, right?

00:06:35.800 | They call me up every day and they say, you know, what do you think?

00:06:37.960 | We're launching, you know, AI version of observability, we're launching AI version of CICD, we're launching AI version of, you know, code versioning.

00:06:46.920 | Why don't those take, right?

00:06:49.320 | Why do we need a new set of tools?

00:06:50.600 | Why is Weights and Bias is able to compete against companies that are, you know, much bigger and have lots of experience?

00:06:55.240 | And I think it's fundamentally that the software development process, which I love, is a linear process.

00:07:00.200 | You add features, you add code, and generally things improve over time.

00:07:05.400 | And the AI process is experimental.

00:07:09.000 | Almost all the things that you do when you're using an LLM is trying it, seeing what happens, right?

00:07:15.000 | And it's fundamentally non-deterministic.

00:07:17.720 | You don't know what's going to happen.

00:07:19.000 | You're never going to have a CICD test that's meaningful or 100% passes.

00:07:23.720 | So instead of kind of adding things all day long and like what you do when you write code, you're doing lots of experiments.

00:07:28.600 | I think this is a fundamentally different workflow.

00:07:30.920 | And what that means is that, you know, when you develop software, your code is your IP.

00:07:34.760 | Companies recognize this and they take a lot of pain to protect their IP.

00:07:38.760 | But when you're developing AI, it's the learning that's your IP.

00:07:41.640 | It's not the model that you build, right? The model is kind of an outcome, right?

00:07:45.400 | The prompt that you discover is kind of an outcome.

00:07:47.720 | But the things that you learned along the way, all the prompts that you tried,

00:07:51.160 | all the workflows that you tried that didn't work, that's really the learning.

00:07:54.120 | That's really the IP that you have.

00:07:55.480 | And if you're not saving that, when the person that figured out walks out the door,

00:07:59.240 | your IP walks out the door with them because you're not going to be able to iterate further from there.

00:08:03.320 | And so what that means is reproducibility matters, right?

00:08:06.760 | You don't really save your IP, you don't save your learnings if you can't reproduce your learnings, right?

00:08:11.320 | And this is the obvious thing. I think everybody asks for reproducibility.

00:08:14.520 | We've always talked about a reproducibility crisis in AI, but reproducibility is incredibly hard.

00:08:19.000 | It's incredibly painful to keep track of everything so you can reproduce your stuff,

00:08:22.600 | especially with these non-deterministic endpoints.

00:08:25.000 | And you see a reproducibility crisis in many fields, not just software.

00:08:28.280 | And so my perspective is what you should be doing is tracking everything.

00:08:33.720 | And the tracking needs to happen with outside processes.

00:08:36.680 | You can't rely on humans to write everything down. No matter how much they want to, they're going to

00:08:40.520 | forget. You need to track everything in the background passively to have real reproducibility.

00:08:44.680 | And the real reason that you want to do this is not just protecting your IP,

00:08:48.520 | but because now you can collaborate, right? Now someone else can pick up the learnings that some

00:08:52.520 | engineer had. And when you can collaborate well, you can iterate faster. And the iteration time is

00:08:59.160 | really the key to bringing things to market, right? If you want to go from demo to production quickly,

00:09:03.880 | you need to bring down that iteration time, you need to have good collaboration,

00:09:07.800 | and you need speed. So that's the real ROI, I think, of all these tools that we might be building.

00:09:13.160 | Now, we work with lots of enterprises, and I wanted to give you an enterprise example,

00:09:19.080 | but more than that, I wanted to give you a real-world example that mattered to me. So this is

00:09:23.000 | very similar to what many of our enterprise customers are doing, but this is actually a project that I did

00:09:27.800 | in a few weeks that I took after my second child was born. And actually, this is my first kid, Matilda,

00:09:37.400 | and she was really demanding my attention after my baby was born. We were talking a lot, and she was

00:09:42.760 | actually asking Alexa to play her favorite song. And it's really interesting. I mean, Alexa is an

00:09:47.720 | interesting product. I really love it. My daughter is kind of connected to it, talks to her a lot, but

00:09:53.400 | she was really kind of shocked when she asked Alexa to play her favorite song. And Alexa didn't know

00:09:57.000 | what her favorite song is, because she asked Alexa to play Baby Shark literally five times a day. So

00:10:01.640 | it's like there's no data science needed to know what the favorite song is, but Alexa didn't know this.

00:10:08.280 | And I think there's kind of data privacy reasons that Alexa doesn't keep this, but it's kind of

00:10:11.400 | remarkable with all these amazing applications out there to see Alexa not able to do it. And so my

00:10:15.880 | daughter is asking me, "Hey, could you build an Alexa?" And I was like, "I think maybe I could,

00:10:19.880 | actually, with LLMs today." So I kind of looked around with the open source Alexa alternatives,

00:10:24.440 | where there's something called Mycroft that's out there, but kind of out of date, doesn't really use LLMs.

00:10:30.360 | And then my neighbor actually was showing me that he got a seven billion model, I think Llama 2,

00:10:36.840 | running on a rock pile, which is like a $200 device, similar price to Alexa. So I thought, "Okay,

00:10:41.240 | if you have that, maybe we could build an architecture that would kind of do the things that Alexa does."

00:10:46.120 | So built a library of skills, things that we asked Alexa to do, like play music, ask what the weather is,

00:10:53.000 | do math problems, check the news, things like that. And these skills are really simple, right? So, you know,

00:10:58.360 | like getting what the weather is, it's like a five line thing that actually Copilot can generate

00:11:03.400 | instantly and works phenomenally well. There's APIs here. The challenge is actually understanding the

00:11:08.360 | speech, and then getting the natural language into an API call. So kind of a simple architecture here,

00:11:14.520 | right? So the weather comes in, you ask what's weather in Boston, it goes through an on-device LLM to

00:11:18.680 | translate it to kind of Python looking code that says weather location equals Boston, has to look like

00:11:23.560 | that because it's going to call that function on the weather skill that I wrote that it calls a weather

00:11:28.520 | API. You know, I think Llama and Whisper is a fantastic open source stack here. Llama CPP is one of the many

00:11:36.120 | many options to run Llama on cheap hardware. There's VLM and others, and lots of companies that'll do it

00:11:42.200 | for you. I kind of wanted to do it myself, so Llama CPP was a great choice. And you can run large LLMs on a MacBook in real time.

00:11:53.240 | We've got to skip the demo so much of the audio is working. What you will, if you actually do try

00:11:59.240 | to do this yourselves, and a lot of our customers do try to do this themselves and they discover the

00:12:02.360 | same thing, latency really matters a lot, right? Because you're actually transcribing audio,

00:12:07.000 | and then you're running the transcription through an LLM to make this function call. And so if that

00:12:13.320 | takes longer than a couple hundred milliseconds, it's incredibly painful. So you need to go kind of

00:12:17.800 | small model here, and you need to run it fast on hardware. So that's why we went with Llama 2 at the

00:12:23.080 | time, and put it in a default prompt, and it didn't work. Zero percent accuracy. And so every kind of

00:12:28.920 | customer goes through this journey to try to improve the accuracy, or whatever the real kind of user

00:12:33.240 | facing metric is. I'll tell you briefly about my journey to improve this. The first thing I did was

00:12:38.440 | I did some prompt engineering, just kind of common sense, make the prompt better, kind of lay out what

00:12:43.480 | functions are available. The model kind of gets closer, and now it's saying call: whether location

00:12:49.240 | equals Boston, but still actually not in any kind of format that I can run. Here,

00:12:52.920 | I need really exact accuracy because I'm generating code function calls. This is Llama 2 chat. There's

00:13:00.040 | so many models to try these days. Llama 2 chat was trained on conversations, performed better in this

00:13:05.800 | case, and got the accuracy up to 11 percent. Still not possible to ship this, but I kind of looked at the

00:13:12.600 | errors it was making. Looking at the errors, incorporating feedback, got the accuracy up to 75 percent. So pretty

00:13:17.800 | satisfying, pretty iterative process, probably typical to what you're doing or what your engineers are doing.

00:13:23.080 | Now it's kind of a cool demo. My daughter is like mildly impressed if she says the the thing just right,

00:13:28.920 | but actually 75 percent accuracy is incredibly annoying. It's still not going to work well. Switched to Mistral.

00:13:38.920 | Mistral came out mid-project here, and Mistral's model for free, exact same API to Llama 2, got the

00:13:45.400 | accuracy up to 79 percent. So this is sort of happening in real time, probably happening to you also. If you're

00:13:50.040 | ready to switch the models quickly, you can get free accuracy improvements as new models come out.

00:13:55.080 | 77 percent accuracy is still really annoying, trust me. So I thought, okay, I need to improve this. Still

00:14:03.320 | not impressing my daughter. And so I thought, okay, let's try some fine-tuning. And now, you know,

00:14:08.760 | fine-tuning used to be a hard thing to do, especially on these really large models. Laura, Q Laura has made

00:14:14.280 | this a lot more tractable. Now there's probably 20 companies here that actually do this for you for super cheap.

00:14:20.040 | So, you know, no excuse for not fine-tuning once you get to a point where you want to make that kind

00:14:25.240 | of last improvement of accuracy. I was able to run Q Laura on my GPU in my basement. This runs on kind

00:14:31.480 | of standard hardware. I think it's a 4080. And ran a data set in a few hours. But you need the data set.

00:14:39.720 | So I manually created a couple examples here. You know, using a larger model, in this case ChatGPT,

00:14:43.960 | to create more examples worked phenomenally well. I think 95 percent of them were good. You know,

00:14:47.880 | I kind of went through by hand and filtered some out. But that, you know, in 15 minutes,

00:14:52.360 | I have a couple thousand examples to fine-tune on. I gave it a schema to format the answer in. And

00:15:00.520 | once it saw more examples, it was actually able to generate more interesting examples kind of from that

00:15:06.040 | data set. Fine-tuning plus Mistral actually led to 98 percent accuracy. So this is a super satisfying

00:15:12.520 | project. 98 percent accuracy actually starts to feel like, hey, maybe we could ship this thing.

00:15:16.760 | And as a surprising byproduct, it actually worked for other languages. So I was using

00:15:22.920 | Whisper model that works really well in other languages. Japanese is the one that I tested on.

00:15:27.800 | I think this is French here. But the LMs also work in other languages. So we've kind of built this amazing

00:15:33.880 | multilingual Alexa as kind of a fun side project. Which, by the way, I think is exciting because I think

00:15:39.320 | there was a time when these models got really big. And there's always been kind of a vibrant hacker

00:15:43.880 | culture, I think, in AI. But started to wonder if you could only do really expensive projects to get

00:15:49.240 | anything interesting done. I think now there's just an incredible amount of things that people can do.

00:15:54.360 | So you can find this project at github/luga/auto. And the lessons learned stepping back are exactly the

00:16:03.080 | things that our customers are discovering. It's exactly the things that you're probably discovering.

00:16:06.600 | prompt engineering works really well, especially when you start with a terrible prompt. And then

00:16:11.960 | fine tuning improves the performance at the end. People often talk about, hey, should I be using

00:16:15.720 | RAG? Should I be using fine tuning? Should I be using prompt engineering? I think almost always in the end,

00:16:19.880 | to get something shipped into production and working well, you end up using a whole bunch of these

00:16:24.440 | different methods. Everything is making iterative improvement. Gradually, we get to the accuracy that we need.

00:16:31.720 | The other thing that I kind of hid from you in this project is this did take me a couple weeks to get

00:16:36.280 | done. And most experiments didn't work. So in the real world, if I gave this to a colleague and they

00:16:42.200 | tried to iterate on top of it, if they just saw this talk here, they wouldn't realize all the things

00:16:49.160 | that I tried that failed. And they'd probably start to do them over and kind of have to experience the same

00:16:53.880 | failures to catch up to where I was. And so we actually kept track in 1DB of all the failures

00:17:00.600 | or experiments. And you can go in there and look at them yourself. But I think this is generally best

00:17:03.960 | practice.

00:17:08.120 | So what are the lessons? People come to me now every day asking, hey, how do I get LLM apps from

00:17:15.480 | demo into production? What are the kind of broader lessons? Besides using weights and biases, which you

00:17:19.960 | obviously are already doing. And I think that there's kind of four things that we see our customers do

00:17:28.760 | where they actually get things into production, they actually make them successful. The first is that they

00:17:34.280 | build an evaluation framework and they take this really seriously. We've had people, I do a podcast

00:17:39.480 | called Gradient Descent where I talk to a lot of people. And one prominent CEO of a well-known company

00:17:45.400 | came on the podcast and I asked him, you know, how do you do testing? How do you test your Gen.AI

00:17:49.880 | applications in production? And he said, I test these applications by vibes.

00:17:54.520 | And I asked him, you know, hey, do you want me to remove that? You know, from the, like,

00:18:00.920 | we're not trying to like embarrass people on our podcast, you know? And he's like, no, no, it's great.

00:18:05.160 | I love it, you know? And I think that there actually is a kind of a value to testing by vibes. Like, I think

00:18:10.920 | if more CEOs tested by vibes, you might actually not have some of the failures that you see. I mean,

00:18:17.000 | some of these things that get released are so embarrassing that I think a vibes check would have actually

00:18:20.920 | caused better decision making. But when you're only testing by vibes, the problem is you can't

00:18:27.000 | release a V2, right? Because what happens is, unlike in software development where you're typically adding

00:18:31.160 | features and mostly moving forward, any of these new things that you try are going to make some

00:18:36.520 | things better and some things worse. And the vibes check is not going to tell you the difference between

00:18:40.520 | a 75% accuracy and a 79% accuracy. So he never would have been able to make that switch from Llama 2 to

00:18:45.960 | Mistral because it just wouldn't have been clear if it's better or worse. And so taking time to build an

00:18:50.680 | evaluation framework is really the foundation of getting these things into production and working

00:18:56.360 | well. I actually checked back with that CEO a year later, and he was doing quite a lot of evaluation

00:19:00.520 | because he kind of realized that thing where you want to get that second model out and you don't know

00:19:04.520 | how to do it. The other thing is starting with a lightweight prototype, we see typically -- this is

00:19:09.800 | kind of an enterprise anti-pattern, but we see it a lot at Weights and Biases where people kind of want to

00:19:13.560 | build the first step and kind of nail that and then build the second step. And they kind of never get

00:19:17.480 | something into the hands of the end user, which might be an internal audience or it might be a

00:19:21.160 | customer. I think this is kind of basic agile product development, but somehow people forget

00:19:25.880 | this with Gen AI. So it's worth mentioning because we see this failure mode so frequently. And then,

00:19:31.320 | of course, incorporating end user feedback in lots of different ways is critical.

00:19:34.600 | And then iterate. So I'll say, you know, evaluation best practices because I think it's such a cornerstone.

00:19:44.280 | And I think, well, actually, let me ask you this. How many people in the audience feel like they're doing

00:19:47.960 | solid evaluations of their Gen AI models? Gen AI applications? Interesting. So a lot less.

00:19:56.280 | It seems like you all are not feeling great about your evaluation frameworks. I think what we see,

00:20:03.160 | where people kind of end up, is lots of different evaluation sets and techniques. So we see a lot of our

00:20:10.680 | customers. We'll do some things like, okay, these are things we can never get wrong. And we want to

00:20:14.040 | have like a hundred percent in this simple test. And there's lots of tricks to getting that right.

00:20:17.400 | Then people want to have, you know, something that's sort of very quick to run. So they can,

00:20:21.640 | they can kind of run it in real time. They can test something new in 20 or 30 seconds. And then

00:20:26.040 | our customers typically will have much bigger evaluation sets that run nightly and give feedback.

00:20:31.480 | That's where people end up. And I think that's really important to do all the different kinds of

00:20:35.320 | evaluations that might matter for your application. And then, you know, the other thing that's very

00:20:39.080 | hard to do, but I mean, MLOps has been doing this for decades, is trying to make sure that the metrics

00:20:43.880 | correlate with the user experience or the value that you actually provide for your customer. Totally

00:20:48.200 | application dependent. Very hard to do. I mean, when we started Weights and Biases, we kind of thought

00:20:53.000 | people might want to track, I don't know, like 10 or 20 metrics per model that they build or per

00:20:57.960 | application. And we see with, with people that are really in production, like a typical, you know,

00:21:03.000 | self-driving car company or like a typical bank with a mission-critical application, we see

00:21:07.560 | thousands of metrics. We see tens of thousands of metrics. We get requests like, "Hey, can I do

00:21:11.560 | regex search across my metrics? Because I have like so many metrics, I can't even, I can't find, you know,

00:21:17.000 | the metrics." And I think that's because you really want to get these metrics working and connected to

00:21:22.600 | your user experience. And there's so many different ways that something could fail. But then the most

00:21:26.520 | important thing is actually do it, right? It seems like a lot of people in this room maybe haven't started to do it yet,

00:21:30.680 | but doing evaluation is so critical because you can't, you can't improve any of these other things.

00:21:36.360 | And when people come to me and they ask me questions like, you know, should I be doing fine tuning?

00:21:40.440 | Should I be doing RAG? Should I be doing XYZ? What that tells me is they don't have a good evaluation

00:21:46.840 | system in place, right? Because it shouldn't take you very long to have an answer for yourself,

00:21:51.000 | for your own application if one of these techniques works or not.

00:21:54.680 | So look, you know, we're here at this conference and this is a lot of fun. If you want to come by our

00:22:00.920 | booth, the Weights and Biases booth, we can show you Weave, which we can help you with some of your

00:22:05.000 | problems getting AI applications into production. But not all of them today, but we'd love to talk to you

00:22:12.040 | about what we're doing and we'd love to hear from you about how your experience has been. Thank you.

00:22:17.400 | *Applaudissements*