What It Actually Takes to Deploy GenAI Applications to Enterprises: Arjun Bansal and Trey Doig

So Echo AI. So Echo AI is a bringing this like fundamental new technology into the world of customer support and customer facing teams. You've probably all seen this metaphor once before. You know, just the tip of the iceberg. There's so much that lies underneath. And this is especially true with companies that are dealing with exceptionally high volumes of customer interactions.

So whether that be customer support or sales or customer success teams, any kind of conversation that you're having with your customers is an opportunity to learn more about your business and what your customers need from it. So, you know, at the surface, you probably get the signals around the things that are going wrong or the things that are just routine to kind of handle.

And most enterprises have a good idea of the approximate kind of set of categories that they generally have to deal with on a normal basis. But there's just so much that lies underneath. And it's been virtually unlocked by just pure manpower that's required to look at these conversations. And there's this sort of cycle that happens with these types of companies.

And, you know, everyone's kind of heard the startup mantra of just be obsessed about your customer, right? Just do everything you can to listen to the customer, understand them to the best possible degree you're capable of at your scale. But as you grow and you get bigger and you get more customers, a lot of that becomes this cycle that where you just like kind of lose touch with what they're saying, what they're telling you every day.

And that only really comes with being successful, right? At like, you know, a meaningful or like a manageable size of customers, you can typically use either, you know, staff or your sales team, your customer support team to sort of derive those insights and better, you know, navigate your company towards greater revenue growth.

But when you get too big to the point where that's like virtually impossible, just due to the scale of conversations that you have to deal with, there becomes this issue where you're just there's so much that is just happening right underneath you without your, your, your knowing. So the way we think about it, um, and this, the, the first three columns here are, are effectively what every enterprise, every, every company, a scale is trying to do where they do manual reviews.

So they do these like small sample sizes. They try to collect, I don't know, 5% of the conversations. They run through an evaluation of it to check for, you know, maybe certain compliance things like how well did the agent handle it? Or perhaps is there any sort of common subjects or, or, um, themes in these conversations?

At the end of the day, you're just sampling and, and it's, leaves everyone unhappy at the end of the day. They understand that it's not a very, uh, kind of accurate process. So then you like involve engineers and you start building these scripts. You start doing retroactive analysis. You pull this data from all the different systems that your, your customers are interacting with you on and you're, you know, doing these like kind of long form analysis and you're, you're ultimately, um, that, that, that can typically like derive insights that you know that you want to track in a forward moving time.

So then you like move into this situation where you start building software to, to look for very specific things in every single conversation and everything's just so retroactive. Like everything is after the fact, it's already after fires had formed. You, you have no sense of where the smoke is and that's where generative AI, sorry, generative AI can come in and really transform this process because generative AI unlocks this, this amazing capability of 100% coverage.

So now rather than doing that sampling or just like looking for the things that you know to look for, you can use generative AI to surface all the things that you didn't know to look for and look at everything all at once. So here, here's a great example. This is just a small interaction between some random, uh, you know, agent that's handling a chat from one of their customers.

And in this one message that comes from the customer, you're able to find things like, why is this customer talking to me? That'd be your intent. Uh, you know, what are the aspects of our business that are, that are really kind of at the root of this? So, you know, routers and it was broken on delivery.

So maybe you have a supply chain issue. You've got the basics things like sentiment, understanding, like not only the sentiment of your customers, but also the sentiment of your, your, your representatives, which is, you know, maybe more important. Uh, and then at the end of the day, with each of these messages, you can effectively extract so much more and so much depth that's ever been available to, to pull from just a single conversation.

And that's what our, our, our, our platform seeks to provide. Here's a great example from one of our customers. Uh, uh, we don't have to read this, but, uh, wine enthusiasts, they, they, they shipped a new unit as a brand new, uh, uh, version of their, they, they, they, they sell like really high end wine refrigerators.

So, um, their customers are, are spending a lot of money. It's a, it's a very special relationship that they have with the customers. They want those customers to buy more fridges. Uh, these customers are retail locations. So, uh, one of the, the, the, the insights we were able to surface for them was, uh, you know, uh, a defect in their manufacturing process.

Uh, something that, uh, you know, could have gone on for weeks and weeks and weeks and become a much bigger problem versus what our, our, our platform was able to surface in real time. So yeah, how, how does this all work? Uh, it all really starts with gathering all of those conversations.

This is kind of like non AI boring stuff. We're like connecting to a bunch of different like contact systems, ticket systems and so forth. Uh, we pull that in, we normalize it. We, we make it super clean, uh, ready to go and compress to, to pass it into LL prompts.

Uh, to pass it into LL prompts. And then we have dozens of these pipelines that are assessing these conversations in, in a variety of different ways, all of which are configurable by the user. So the user, the customer can come in, they can tell us like what they care about, what they, what they're looking for, and they'll actually work with us to write these prompts.

And eventually they write the prompts themselves and manage it over time. Uh, and why this is like ultimately most important is when you are dealing at the accuracy, uh, sorry, when you're dealing at the like enterprise scale, they're ultimately most concerned around accuracy. Uh, there's a, there's a, I think a huge, uh, hesitation right now in the market around accuracy.

I can deploy generative AI to try to understand these conversations, but do I really trust the insights? Is it going to be better than what my, my business analysts are doing or my, my CX leaders and my, my, my VP of sales? Does this system, does this technology, is it really capable of giving me insights that I ultimately trust?

So for us, it's important that we establish trust from the very beginning. So when we bring on a customer within seven days, we try to introduce them to like, okay, here are the insights. And then from there we work towards a place where we're, you know, sort of a, uh, uh, we like to say 95% accurate, but you know, it's a lot of sampling, a lot of figuring out, but virtually we want to create that trust with the customers.

Cause that's ultimately what's going to get them to keep renewing and, and be a customer for a longer period of time. So log 10 plays a huge part in our ability to do this. So not only have they created a huge amount of, of features and capabilities that allow for our engineering team to build faster and to understand, you know, the quality of the code that they're writing, but maybe more importantly, our solutions engineers, our implementation engineers who are actually working hand in hand with the customer who is telling us that like, this isn't, this isn't exactly right.

Um, and log 10 is, is, is, is kind of a go-to tool for managing that process. So, thanks Ray. Uh, so really excited to share with you what we've built at log 10. Uh, so we're basically an infra layer to improve LLM accuracy for your AI applications. We started with this vision of building self improving systems.

So having LLM applications that can improve prompts and models themselves to ultimately drive accuracy improvements. Obviously we're not there as a field yet, but that's the vision we're driving towards. And I'm excited to share some of the work we've done along that path. So today measuring and improving LLM accuracy is hard.

Uh, you've probably had this experience if you've tried to deploy a prompt and kind of study YOLO accuracy in prod. Um, and you've probably heard in the news of these, um, instances where there was the air Canada chat bot, which hallucinated a refund policy and a judge in Canada, um, force them to honor the policy that was made up.

So obviously causing some financial problems as well as brand image problems. Uh, there was the case of the Chevy Tahoe dealership chat bot, which was convinced to sell a truck for a dollar. Uh, we've run into similar issues where we were having some, uh, issues with the product and a chat bot told us to, uh, play a game while we wait, uh, kind of missing our emotional state in that moment.

And, uh, there's also issues with semantic search engines where, uh, because they're doing this kind of rag-based lookup, uh, they sometimes just miss out on common sense, even though it may not be explicitly present in the source documents that they're pulling up. And so, uh, people have tried to use human review as an alternative to, um, as a way to sort of look at the output of the LLMs.

And that's ultimately become the gold standard in terms of getting LLM apps into deployment, but it's time consuming and expensive. And as an alternative, uh, we've tried to use, uh, AI based review. So, um, LLMs as a judge, for example, but obviously this also has a lot of issues with accuracy.

Uh, people have found that models tend to prefer their own output. Uh, they exhibit positional bias. So just the ordering of, um, if you present, uh, an option first, it might just prefer that over the second one, even if the second one's better. Uh, they have verbosity bias, a bias towards diversity of tokens and so forth.

So, uh, kind of failing in these trivial ways. And so with lock 10, we did a bunch of algorithms research work to try to address this issue. And ask this question of, could we get the accuracy of human review with the speed and cost advantages of model based review?

And that's exactly what we've solved for with our auto feedback system. So I think even the previous talk, you saw a graph kind of like this where when you use LLMs as a judge and the predicted feedback can often be, uh, just set at one score, regardless of what the ground truth is.

But with our auto feedback system, you get that much nicer, much better correlation between the, uh, predicted feedback and the actual feedback. And just to kind of motivate what do you do once you have this measure of accuracy. So some of the downstream ways in which you can use the auto feedback is for things like, uh, monitoring.

So you get like an ongoing quality signal on how well your LLM application is doing. It can be used for triaging. So you make the most optimal use of the limited human resources you might have. And it can also be used for curating data sets, high quality data sets for things like automated prompt improvement and fine tuning, which can ultimately improve the accuracy of your LLM application.

Uh, so next I'll say a little bit more about what the system is. Um, so for some of the, uh, experiments we did as part of the research, we came up with these three different ways of building auto feedback models. Uh, so on the left we have the ground truth data sets, which consists of the input and output, some kind of grading rubric, which you might give to a human for review and, uh, their feedback.

And we had three variations where we could build these models with few shot learning or with some kind of fine tuning. Um, and then finally, uh, creating these models with some bootstrap synthetic data and then fine tuning the auto feedback model. And I won't have time to go into all the details.

We published, uh, some of this work in a blog post, which is available on our sub stack and there's a QR code there. But, uh, in summary, uh, for a summary grading task where we use the TLDR dataset, we were able to get a 45% improvement in evaluation accuracy by going from aggregate to, uh, annotator specific models.

Uh, going from GPT 3.5 to GPT 4 as the base model, uh, going from few shot learning to fine tune models and, uh, with the use of the bootstrap synthetic data. Uh, our approach was also very sample efficient. So, by using the bootstrapping approach, we were able to achieve the accuracy of almost as if we had 1,000, uh, ground truth labeled examples.

We're just using 50 ground truth examples. So, much faster to get started and not needing as much data to get to that high level of accuracy with the evaluation model. Uh, we also extended this in a follow up blog post. So, we were able to match the accuracy of GPT 4 and GPT 3.5 fine tuned evaluation using MISTRAL 7B and LAMA 70B chat.

And that's in a follow up blog post which is accessible on this link. So, once we're able to show that we're able to get high confidence in the eval models that we were building in this way, uh, we set it up for deployment within this auto feedback module on our platform.

And, uh, just kind of zooming out, we have as part of the locked in platform, uh, fundamental LLM ops offering which includes things like logging, debugging, and evaluation, as well as auto tuning features to do prompt optimizations and manage your fine tunes. And we have a seamless one line integration which sits in between your LLM application and your LLM SDK.

Uh, we have integrations into many of the common LLM SDKs, including OpenAI, Anthropic, uh, Gemini, and a few of the open source ones. And we integrate with frameworks as well. So, next I'll hand it back to Trey for a demo. All righty. Let me, uh, just make this slightly bigger.

Well, okay. So, uh, here's, uh, uh, Echo AI. So, Echo AI is, uh, like I said earlier, we, we, we connect into all of the different. Um, channels that your customer conversations are coming in. We transcribe them. We clean them. And then we allow you to, uh, basically like codify all the different things that you're looking to, uh, kind of analyze these conversation against.

Uh, as well as offering a product that is purely generative and is, uh, kind of tasked with surfacing those insights in ways that you, um, you, you basically know the question. Like, why are my customers canceling? And then we, uh, generate, uh, like, uh, you know, uh, an ontology of different reasons for why that's happening.

So let me give you an example of how, uh, we are, um, making use of this feedback tool. So if I like jump into one of these, this is a demo account. So, uh, all these conversations are, are generated. So they're kind of silly, but nonetheless, uh, you can see here, we've got this, uh, example, uh, phone conversation that comes from a customer.

Their, their TV was broken. Uh, one of the most, you know, simple, uh, I think on the surface, uh, features would be summarization of the transcript. Uh, we actually rely on summarization for a variety of different downstream, uh, sort of analysis. So it's really important to us to, uh, kind of ensure high levels of accuracy.

Not only that, uh, this is also, um, uh, you know, because of that, uh, kind of technical reason. We have like an immense number of prompts and throughput that has to get through LLM. So we do quite a bit of self hosting and are constantly training, uh, new models to better handle different, uh, domains of, of our customer base.

So here's an example. I can see that this, this summary came in. It's pretty decent. Uh, all of these other sort of insights that you see here are all being generated by, um, various different models and pipelines, all of which are being created from LLM. So, uh, each of these things are, are questions and specific, um, sort of, uh, requirements that the customer is providing us.

So it's really hard to stay on top of accuracy and quality as a result. Basically every customer is different. Um, so what we've done is we've leveraged log 10, not only for the ability to, um, very, sorry, very quickly, uh, uh, allow our engineers and solution engineers to go in and actually understand, like, what was the, the generator prompt for all of this.

Pretty standard stuff that as many of you know, but maybe, um, kind of more interesting is the ability to automatically generate feedback for these things. So we've, we've created a criteria that analyzes each of the summaries that we create against very specific, uh, user defined, like, uh, were defined by us, uh, criteria for how, how good of a quality, um, the summarization was.

So in this case, uh, the, actually look pretty good. And there's, there's one, uh, point deducted. You can actually read down here why it deducted that point. But let's say that like, I want to provide a kind of human over right here. I could just come in here and change the point value and, uh, accept it.

Uh, and why that's useful for us is like I said, uh, it's, it's critical that we continue, um, a process that allows our solution engineers to, this is an example. I want to get some Israel, um, to kind of give us like really high fidelity, uh, human provided feedback in a way that, uh, virtually is, uh, effortless for them to do so.

Because what we're ultimately trying to do is collect as much of that as possible. Uh, and lock 10 has been, uh, I think a great tool and not only, uh, making that possible for the solution engineers and our engineers to do, but also automatically doing so behind the scenes.

It's really changed our, um, our processes towards, uh, our fine tuning datasets. Thanks. Actually, I'll show one more thing. To give an example of like how it's, uh, maybe more useful to an engineer. Here's another, uh, conversation where, uh, the summarization failed. Uh, we, we have just a, uh, reiteration of the instructions.

That looks, that looks like the system prompt. Um, no idea why. Let's, we can, we can actually kind of go in here and look at why. Okay. Well, there's, there's a good reason why. Um, and you can see that it's been graded accordingly, uh, which is, you know, uh, exactly what we would expect.

So we, we've been able to track, uh, hallucinations via this process. We've been able to see model drift, uh, in a meaningful way, in a data driven way that we previously were unable to do. You know, a lot of it has been, uh, sort of, um, requiring humans to sample these things and give feedback on that.

So this has been a, a huge, uh, tool in our, like maintenance of, of, of, of achieving the, the utmost trust that we can, uh, kind of retain with our customers. Thanks. Great. So maybe just in the interest of time, I'll skip forward. So, uh, one of the big, uh, achievements we were able to get was, uh, using this auto feedback approach, get a 20 F1 point improvement in accuracy in one of the use cases with echo AI.

And we published all of the details in a case study that's accessible there. And, uh, just in terms of getting started, obviously we cannot share, um, you know, customer data, uh, here. So we created this new summarization app, which kind of, uh, shows an example of a summary grading, uh, application as well as, um, has a live version of the website that you could check out.

Um, if you want to, uh, take a picture of those QR codes, um, you can adapt this for your use cases, for your tasks and try out auto feedback yourself. Uh, we also have an SDK. So everything that was shown in the UI is available programmatically from, uh, Python and SDKs and other languages.

Uh, the feedback type is pretty flexible. Uh, we have a bunch of recipes that you can get from our GitHub as well as notebooks to get sent out. as well. Um, I'll skip over some of the mechanics, but basically goes over how you create the task, uh, create feedback, run the auto feedback locally on a simpler model, and then fetch feedback from a more complex model that runs on our cloud.

And maybe just in the interest of time, we'll, I think, Trey, you already covered this, but, um, yeah, invite you to use our platform. And, uh, for a limited time, you can use even the more advanced models on our platform for free. Thanks a lot. Thank you. . Thank you.

. Thank you. . . . . . . . . . . . . . . . We'll see you next time.

What It Actually Takes to Deploy GenAI Applications to Enterprises: Arjun Bansal and Trey Doig

Transcript