Understanding AI Stakes to Break Production Code: Philip Rathle

00:00:00.000 | So the idea here is a lot of you are solving some of the most leading-edge AI,

00:00:17.700 | gen AI problems in the world in this room, and then a lot of you are trying ambitious things. So

00:00:23.560 | main goal was to get folks learning from each other, so think of this as a logical

00:00:29.400 | roundtable, not very round physically. So come away with insights about specifically

00:00:36.840 | you're getting to production challenges by learning from each other. So two rules

00:00:41.420 | of this AI roundtable. One is we'll all be stronger if we figure this out

00:00:46.320 | together. Number two is Chatham House rules. That's for Peter's benefit. For us

00:00:52.140 | Americans in the room, what happens in Nob Hill stays in Nob Hill. So let me actually

00:00:58.800 | kick it off by sharing some of my personal learnings. I'll go through maybe

00:01:02.240 | five minutes or so of what I've learned in relation to getting to production, but

00:01:08.800 | specifically around the topic of stakes. And by stakes, I mean like how high stakes is

00:01:14.560 | the problem that you're trying to solve. And I found there's a relationship between

00:01:17.760 | the bars and the obstacles that you run into as the stakes increase. So I'll just

00:01:22.320 | share my own journey and hopefully this is useful and food for thought.

00:01:25.760 | But as I'm going through it, think about what you're experiencing and the hurdles. Because

00:01:30.000 | what I'm going to ask after I go through my own section is what are some of the hurdles you

00:01:36.560 | are running into? What are some of the ones that you haven't solved? What are some of the ones you have

00:01:41.200 | solved so that we can learn from each other? All right. So it all started like this.

00:01:46.240 | Jenna is a hockey stick, right? And then we kind of ended up with actual project experience where

00:01:53.440 | you can do, I think Lucas said this earlier, lots of amazing prototypes. It's so easy to build an

00:01:59.520 | amazing prototype. But then getting to production is harder. You have a higher bar. And so what a lot of

00:02:05.840 | folks have figured out is you can use rag with vectors. And all right, now it looks like a hockey stick.

00:02:11.520 | And in some cases, it does. In some cases, you can get to production with this. But then there's

00:02:16.000 | still a class of use cases where that's not enough. And so what, you know, you'll hear from the Neo4j folks

00:02:24.240 | here, and if you go to the booth and attend Emil's talk and some of the other ones later is there's another

00:02:28.960 | unlock here with knowledge graphs and with graph rag. Again, like, will this be a hockey stick? I think the

00:02:35.520 | world moves in S-curve. So this is probably more realistic. And there's probably going to be some

00:02:39.760 | other thing. I don't know what that looks like. But why this pattern? So I've come to see a spectrum

00:02:48.320 | of application stakes. And it's a whole spectrum, right? But to kind of paint opposite ends and maybe

00:02:53.120 | something in the middle, if something is low stakes, then maybe it's more business peripheral.

00:02:58.960 | It's things like summarizing text. Here I have someone, you know, a children's bedtime story.

00:03:05.360 | There's no single right answer. The stakes are a bit lower. And then on the high end of things,

00:03:13.440 | like a wrong answer can have significant impact in terms of dollar value, brand and reputational impact,

00:03:20.720 | health and human safety, regulation, fines, bias, all that stuff. And then there's this in-between zone

00:03:28.560 | where, you know, it might be higher business stakes, but I've got some mitigations. Like,

00:03:33.680 | I've got a human in the loop. So here you have a difference between, at the far right,

00:03:39.440 | pilot, and in the middle, co-pilot. So there's kind of an interesting relationship between pilot,

00:03:47.280 | co-pilot, and stakes.

00:03:50.640 | And, you know, the problem with high-stakes use cases is you bump into the meme, right? Which you've

00:03:55.440 | all seen. And specifically, it's things like needing enterprise domain knowledge, up to the moment

00:04:02.080 | knowledge. This is what RAG solves. Solving for hallucinations. Again, RAG solved this, and depending

00:04:09.120 | on how you solve it, you're loading the dice, loading the dice, loading the dice. And then,

00:04:13.520 | depending on the stakes, your tolerance for error, you know, could be, you know, potentially gets very

00:04:19.600 | high, right? How tight should I tighten my bolts on my 737 MAX 9? Like, let's not have a 0.001% chance

00:04:27.840 | of error, because that's not good enough. Cool. So let's look at how we're representing data. So here's an

00:04:36.480 | apple. And think about how we conceptualize an apple as a human. It's, you know, this multi-dimensional,

00:04:42.160 | multi-sensory kind of thing. Now, let's look at a vector view of an apple. It looks something like this.

00:04:49.200 | So I can find other fruit by doing a, you know, cosine similarity or spatial distance

00:04:57.040 | operation with this representation of an apple. It's useful. Again, good for certain things,

00:05:04.960 | good not as good for other things. And if I characterize the behavior that you get with

00:05:11.920 | LLMs generally and with vector-based RAG, it's this, you know, kind of world of statistically computing

00:05:18.080 | the real world based on word proximity statistics and word frequency statistics. So you get this

00:05:26.000 | behavior that's kind of a little bit like the right brain. You know, it's creative. It's mostly right.

00:05:31.280 | Sometimes really, really wrong. I don't really know why it's wrong. It just does its thing.

00:05:35.680 | A bit of a black box. Enter knowledge graphs as a way to bump up the stakes on things. So this gives

00:05:42.720 | you a more structured representation that is understandable by humans, can be reasoned upon

00:05:50.320 | by machines. And this gives you a bit more of left brain behavior. So it can't do all the things that LLMs can

00:05:56.160 | by any means. But when you combine it with an LLM, you being able to bring in facts, reasoning, discernment,

00:06:03.520 | long-term memory can create a better together. So that's my quick sharing around stakes and unlocking

00:06:13.280 | things and different levels. So let me turn it over to folks in the room. And Peter will be coming around

00:06:19.360 | with a microphone. So let me start with a question. Does anyone have, what's one learning that would

00:06:27.840 | have saved you a lot of time if you'd known it sooner? Come on, all at once. Anyone just raise your hand,

00:06:35.120 | blurt it out if you're... Well, maybe if no one's got one, maybe you have one,

00:06:40.640 | like just off the cuff. It might inspire other people to have a story, get the scope. I mean,

00:06:47.040 | it comes back to stakes and, you know, understanding what those are up front for different applications.

00:06:54.880 | So we've heard a lot today about iteration and how important that is. So definitely iterate, you know,

00:07:00.320 | don't overthink things initially. But, you know, it's also think and plan ahead. And that can guide

00:07:07.760 | things like what tools you use. Also, locus of reasoning is something I think about is if there's

00:07:14.480 | something -- if there's a question that has an exact answer, and it's a deterministic question,

00:07:18.880 | then, you know, maybe you can use the LLM, which is inherently non-deterministic, to complement some

00:07:25.680 | other technology that can give you a deterministic answer, which you have. You've got databases

00:07:29.680 | and other kinds of reasoning engines. And then use the LLM to do what it's exceptionally good at,

00:07:36.160 | which is maybe take the question, turn it into a query, take that query, formulate it back into some

00:07:43.840 | text, and then maybe modulate that text depending on who the audience is. If it's going to regulator,

00:07:48.960 | then very serious, no jokes. If it's going to, you know, an outbound marketing email, then it can

00:07:55.120 | maybe go to be phrased differently depending on who the target audience is. Is it someone in high

00:08:01.360 | school? Is it a retiree? Yes, I see a hand, though, over there.

00:08:04.480 | Oh, cool. Let's go. Here we go. Ah, so I'll go here.

00:08:11.680 | I have one learning that I wish I had had before is, instead of asking the LLM to solve a problem,

00:08:17.680 | right, like, paste in Excel and say which one's the outlier, just ask it to write the code to solve

00:08:23.200 | that problem, which it can easily do, and then you can leave the LLM out of the loop, right? You say,

00:08:29.120 | find the outlier, we'll do some Python outlier machine learning thing, and then you have the answer.

00:08:34.160 | Cool, yeah, so the LLM might not be good at certain kinds of reasoning, but it can write code that will

00:08:39.440 | do the reasoning, which is a pretty amazing hack, actually. Thanks.

00:08:43.440 | Everyone's getting involved now. Wait, I'm sorry, who had their hands up? Let's go here.

00:08:51.120 | So one of the things we learned was when we were getting excited about the RAC and indexing and doing

00:09:00.240 | the semantic search, and when we put it to production, that our customers were not used to doing queries

00:09:07.120 | like this. They always resorted back to keywords. It's almost like they were trained to do this,

00:09:14.320 | and so getting them along, and so we had to rewrite, like, is this question actually a query for keywords,

00:09:21.840 | or is it like a query they ask like a question? So that was a learning that we made while putting it into

00:09:30.720 | production, and we put a lot of effort in making the search well, but we didn't ask upfront, like,

00:09:37.280 | well, is this actually, or do you want to use it? So that was one learning there.

00:09:41.440 | And so did you have a sort of curriculum to help people understand how to ask questions, and that they

00:09:46.320 | can do more than just keywords? So yeah, we didn't put attention to this. We were just assuming, like,

00:09:52.720 | yeah, everybody's now going to use, like, putting questions in, so that was, you know, the surprise and

00:09:57.120 | the learning there, like, needed to ease them in. Yeah, that's cool. You need to understand what the

00:09:59.920 | technology's capable of, and they can do more than just what you're used to, like, it's not a Google search box.

00:10:03.920 | I think the lesson is kill more projects than experimentation. So right now, I think one of

00:10:19.840 | the big lessons we learned is we did a lot more experimentation, and if I have to look back,

00:10:26.400 | I'll focus on few that will go all the way to production, because a lot of these experimentation,

00:10:31.920 | even with the best effort, will never meet the user expectation on accuracy. So if we can kill

00:10:38.160 | and a lot of projects, but focus on one or two, that can we can take all the way to production is

00:10:44.160 | probably where I will focus a lot more. And people spend too much time on not being focused.

00:10:49.040 | And when deciding what project to focus on, I imagine there's an element of how achievable is it,

00:10:55.200 | and how valuable is it? How do you...

00:10:57.360 | Yeah.

00:11:03.600 | Yeah, so matching up with user expectation of accuracy, and then those projects are bound to succeed, and go with those. Cool. Thanks.

00:11:21.600 | My learning is expect less and use more common sense.

00:11:26.960 | One learning that I've experienced from our customers when they're building deployments is, you know, it's really easy to walk in and say, "I want the best model, and I want it to run as fast as possible." But when you're making, you know, actual cost-to-quality trade-offs and cost-to-speed trade-offs, you know, you can...

00:11:47.200 | You can really overspend versus what you really need if you don't have a great understanding of what you're actually trying to build, how fast it actually needs to run, that kind of stuff.

00:11:53.440 | You know, in a lab with a small number of questions, you're naturally going to go for the best accuracy, but then enter production.

00:11:59.680 | You know, if you're burning a year's budget within a day, which is very possible with this stuff, then maybe you need smaller models or multiple...

00:12:05.920 | Yeah, great one.

00:12:07.920 | Actually, I just had one pop-up in my own mind that I remember seeing, which was when Bloomberg were training, I think it was Llama 2 at the time, to handle a lot of the financial data that I remember seeing, which was when Bloomberg were training, I think it was Llama 2 at the time, to handle a lot of the financial data that I remember seeing, which is when Bloomberg were training, I think it was Llama 2 at the time, to handle a lot of the financial data.

00:12:26.160 | And it passed really well on all the evals and everything, and then they found that when Llama 3 came out, it actually outperformed their fine-tuned model that they spent all this time producing.

00:12:46.400 | And so I'm not entirely sure what the takeaway from that is, but it's almost this idea that if you just wait for the models to get better, then you kind of save work.

00:12:55.960 | But then, I guess that's not necessarily going to fly in the enterprise, where it's like we want something working today.

00:13:00.320 | Well, this is a great... I'd love to hear what people in the room are experiencing, because there's this real tension of, all right, if I just wait a couple years, like, this stuff is all going to figure itself out, and it'll be way better, but who wants to wait a couple years?

00:13:12.760 | Like, we want to get out and get competitive advantage.

00:13:15.320 | We're seeing crazy advances.

00:13:18.000 | Yeah, yeah, you wait a few weeks.

00:13:20.640 | We're just focused on, like, the full path and getting it end-to-end, and then, like, in parallel tracks, optimizing every part of that, right?

00:13:29.000 | Yeah.

00:13:29.400 | And shipping customer value, and then, oh, our data engineering part of that was, like, a clutch, and that's not going to scale, so let's get people working on that.

00:13:37.480 | It's like an interface between every step of this chain, so, like, if I encounter an issue where my LLMs aren't performing, and it's like, oh, no, we've got to use, like, 3.5 Sonic, and it's going to cost us a lot, we'll just wait, you know, it's going to come down a little bit.

00:13:51.880 | Yeah, yeah, yeah, the bars I was showing earlier are naturally all rising, like, across all those dimensions, and if you understand what the pipeline looks like, then you can just drop stuff in, right?

00:14:07.880 | Yeah.

00:14:08.880 | Yeah.

00:14:09.880 | Just iterate on each one of those components.

00:14:11.880 | Yeah.

00:14:12.880 | That's what's been working really well for us.

00:14:13.880 | Cool.

00:14:14.880 | I like -- so, one question to drill into that is, I can see pipelines getting more complex with more agentic stuff and so on.

00:14:22.880 | Like, how do you -- like, that one is a moving target, too.

00:14:25.880 | How do you see handling -- how are you handling that?

00:14:27.880 | It's not something we've figured out yet.

00:14:29.880 | I'm honestly looking for tool -- better tools to help orchestrate that.

00:14:33.880 | Yeah.

00:14:34.880 | Yeah.

00:14:35.880 | I think we are making improvements on various parts of this thing, but, like, I've had to slow down the team.

00:14:39.880 | Like, you're making too many changes at once now across the different things.

00:14:42.880 | So, it sounds like we have to slow down a little bit with interfaces and contracts between the layers.

00:14:46.880 | So, that would be iterating on one side and not affect the downstream on the other.

00:14:49.880 | So, I don't know that.

00:14:51.880 | Cool.

00:14:52.880 | Great.

00:14:53.880 | Anyone else on this question?

00:14:54.880 | Just to piggyback.

00:14:55.880 | Yes.

00:14:56.880 | When you're thinking of the latency improvements you can get, if you have a solution that might take you months to implement, you have to evaluate that with a trade-off and say,

00:15:04.880 | by the time we finish this, will a new model come out?

00:15:07.880 | Yeah.

00:15:08.880 | So, I think the time it takes to implement these solutions, there should be, like, a max, like a ceiling.

00:15:11.880 | Like, if it takes more than however many months, you're just wasting your time because it's going to be throwaway.

00:15:17.880 | I love that.

00:15:18.880 | Yeah.

00:15:19.880 | If it's a two-year project, then you're -- why bother?

00:15:21.880 | Basically --

00:15:22.880 | Like, anything you do now is going to be irrelevant.

00:15:23.880 | Right.

00:15:24.880 | I need another solution.

00:15:25.880 | Yeah.

00:15:26.880 | Basically.

00:15:27.880 | Yeah.

00:15:28.880 | Love it.

00:15:29.880 | I'll dash over here.

00:15:30.880 | I could also just shout.

00:15:31.880 | Yeah.

00:15:32.880 | Yeah.

00:15:33.880 | Wait.

00:15:34.880 | Sometimes.

00:15:35.880 | Thank you.

00:15:36.880 | Yeah.

00:15:37.880 | One -- all of that I really resonate with.

00:15:38.880 | And one thing that we've seen is companies that have succeeded have avoided premature optimization.

00:15:43.880 | So, obviously, you want to be sort of aware of latency in production and, you know, make something

00:15:48.880 | that will be realistic.

00:15:49.880 | But the teams that have sort of started with that and gone, "We're going to use a small fine-tuned

00:15:53.880 | model," often incorrectly conclude that a task isn't possible.

00:15:57.880 | And also, people are often able to take data from a larger model and fine-tune it later.

00:16:02.880 | And so, I always think of, like, moving down the model size stack or fine-tuning as optimizations.

00:16:07.880 | And you want to validate the product need and whether it works, whether it's possible, usually

00:16:12.880 | with a better model first.

00:16:13.880 | At least, we've seen that be much more successful.

00:16:15.880 | That's a great -- I've heard that, too.

00:16:17.880 | Like, start with GPT 4.0 or, I guess, now maybe Claude 3.5.

00:16:22.880 | And if that can't do it, then the smaller models probably won't.

00:16:27.880 | And I guess you have an exception for data that's highly proprietary.

00:16:30.880 | And that's where fine-tuning and RAG come in.

00:16:33.880 | I think building up on something that you mentioned as well earlier, when, for example,

00:16:49.880 | you're using a large language model to turn unstructured data to structured data to then feed

00:16:54.880 | it into, like, traditional machine learning models, one of the key learnings for us was

00:16:58.880 | if we're building our eval pipeline from the get-go, which is really starting to be echoed,

00:17:04.880 | you know, trying to figure out what your eval should be, get it ready from the start,

00:17:08.880 | is to understand, do you have the right data to build your eval pipeline?

00:17:12.880 | And if not, start from day zero to make sure that you generate that data for your eval pipeline.

00:17:17.880 | Because if you leave it to stay too long and then realize, oh, we've built an model,

00:17:25.880 | and it's in our app, but we cannot evaluate whether the fine-tuning on the prompts makes it better or worse,

00:17:31.880 | you're then spending a lot of sort of long-leg time on figuring out what your eval pipeline should be.

00:17:36.880 | So our key learning was really start, as you start building your app,

00:17:40.880 | also start with your eval pipeline and make sure that you have the data that's there.

00:17:44.880 | Cool. That's maybe a corollary of the, okay, you need your DevOps kind of framework,

00:17:49.880 | but then you need your eval framework, and then your data framework is closely related to that,

00:17:54.880 | is making sure you have the right data, yeah, and the quality's there.

00:18:06.880 | One of the things we learned is not hiding AI from the end user.

00:18:11.880 | So we're, like, explicitly, for example, generating the responses slowly,

00:18:17.880 | so as a hint for the user that they should be more vigilant,

00:18:20.880 | we're using specific colors in the UI to make it clear,

00:18:24.880 | the icons that they click on,

00:18:26.880 | so kind of make sure that they understand that there is a special thing they need to do,

00:18:31.880 | and then when they want to check,

00:18:33.880 | we give them as much tools for them to validate the answer that was generated,

00:18:37.880 | to see what went wrong, what were the sources, what kind of went in there.

00:18:42.880 | So that's one of the lessons that, you know, to abstract this away and kind of hope for the best.

00:18:47.880 | This is related to your moderate expectations point.

00:18:49.880 | Yeah.

00:18:50.880 | If the user can actually see the thing coming out, then that's a trigger that,

00:18:53.880 | all right, it's not instant, it's coming from some machine, I need to consider things.

00:18:58.880 | Yeah.

00:18:59.880 | Cool.

00:19:00.880 | I have, I want to ask if anyone dealing with regulation or something where explainability is a big issue,

00:19:10.880 | and if so, how are you solving that?

00:19:12.880 | What sort of industry are we talking, like a medical something?

00:19:19.880 | Yeah, I could think about medical, like if you're trying to, if the end game is prescribed something or provide some treatment,

00:19:25.880 | I can think about equipment maintenance is another one.

00:19:28.880 | We work in the behavioral health space and a lot of their, there's a lot of regulation and the one way we determined to solve it as a kind of a core principle is just always human in the loop right now.

00:19:41.880 | Yeah.

00:19:42.880 | Like we're only in the co-pilot stage.

00:19:43.880 | In fact, I tell people we'll never get to the pilot stage, but that's just to not scare them.

00:19:47.880 | Yeah.

00:19:48.880 | But we're very focused on co-pilot right now and making sure there's a human in the loop.

00:19:53.880 | But as part of that, we have to give transparency to the inputs that generate the outputs so they can kind of see things side by side.

00:20:00.880 | If we make recommendations or we pick some data out that we think is interesting for the particular case that they're working with,

00:20:08.880 | they can, they can trace it back to the, the root data.

00:20:12.880 | Yeah.

00:20:13.880 | And so do you find the inputs are sufficient for getting the humans in the loop to trust the answer enough?

00:20:20.880 | Because I, I, so you've got inputs.

00:20:23.880 | Another way to do it is here are the references that, you know, but then they're not going to read the 350 page papers maybe.

00:20:29.880 | We, we, so if it's, if it's doing a lookup, a knowledge lookup, we're showing the source documentation that we find based on their query.

00:20:37.880 | So they can see it kind of side by side with those citations.

00:20:41.880 | If it's doing a generation, so let's say we're summarizing data that was collected in a therapy session.

00:20:47.880 | We show the data side by side with the actual summarization of that.

00:20:51.880 | It's kind of like on a paragraph by paragraph basis so that the users have complete control of the, the result in summary.

00:20:58.880 | They can change it, edit it.

00:20:59.880 | It's theirs at that point.

00:21:01.880 | And then are you, last question, are you using the judgment of the human in the loop as like RLHF to improve?

00:21:11.880 | Yeah.

00:21:12.880 | We run our own eval on any of our output with a separate stage just to make sure that like did the input, does it, does the output match kind of the inputs?

00:21:20.880 | And that's just a quick one.

00:21:21.880 | It's also for a safety perspective.

00:21:23.880 | But then we use the, whether or not the user edited or, and or just gave us thumbs up, thumbs down.

00:21:28.880 | Yeah.

00:21:29.880 | And then we do our own clinical evaluation at the end on our own, with our own team.

00:21:34.880 | Just making sure like we're just everywhere we need to, we're putting in that eval stage to make sure that we, we know what, we're comfortable with what's coming out.

00:21:44.880 | Yeah.

00:21:45.880 | Cool.

00:21:46.880 | I'm just going to go over here.

00:21:47.880 | Just one thing affects whether we can do one more or multiple more is, is our next speaker here?

00:21:53.880 | Because I've not.

00:21:54.880 | Yes.

00:21:55.880 | Great.

00:21:56.880 | Oh, fantastic.

00:21:57.880 | Cool.

00:21:58.880 | Okay.

00:21:59.880 | Right.

00:22:00.880 | Unfortunately, this has to be the last one then for now, but I assume this discussion can continue

00:22:01.880 | at lunchtime as well, because it seems everyone, there's so many shared concerns and stories

00:22:06.880 | that people have got in the room.

00:22:07.880 | So it would be a great thing for people to discuss at lunch.

00:22:10.880 | Absolutely.

00:22:11.880 | Here we go.

00:22:12.880 | I'll keep this short.

00:22:14.880 | So a human in the loop is clearly a solution, but that's not always possible.

00:22:18.880 | So what we do is give options.

00:22:21.880 | So this is, you know, one of our use cases is personal loan recommendation.

00:22:26.880 | Hmm.

00:22:27.880 | It's regulated.

00:22:28.880 | And then, so we give, rather than ever say, this is the best loan for you.

00:22:32.880 | We give three options and give explanation for these.

00:22:35.880 | And so it becomes less deterministic and less, and that's how we take care of the regulation.

00:22:41.880 | Okay, cool.

00:22:42.880 | Yeah.

00:22:43.880 | So give a few options.

00:22:44.880 | And then do you do that directly to the end user or to a human in the loop?

00:22:47.880 | End user, because we cannot have human in the loop.

00:22:48.880 | Right, right.

00:22:49.880 | But I think the three options that we select are clearly to see some more logic to it than

00:22:54.880 | just coming, that's coming out of the LLM.

00:22:56.880 | Cool.

00:22:57.880 | Thanks.

00:22:58.880 | Well, let's definitely keep this one alive throughout the day as you're mingling with each other

00:23:03.880 | and lunch and so on.

00:23:04.880 | But thanks.

00:23:05.880 | I appreciate the participation.

00:23:07.880 | Thank you.

00:23:08.880 | Thank you.

00:23:09.880 | Thank you.

00:23:10.880 | Thank you.

00:23:11.880 | Thank you.

00:23:12.880 | Thank you.

00:23:13.880 | Thank you.

00:23:14.880 | Thank you.

00:23:15.880 | Thank you.

00:23:16.880 | Thank you.

00:23:17.880 | Thank you.

00:23:18.880 | Thank you.

00:23:19.880 | I'll see you next time.