The Build-Operate Divide: Bridging Product Vision and AI Operational Reality

We titled this talk The Build-Operate Divide because we've observed this troubling trend where like good AI product concepts fail to reach their full potential due to operational challenges. So today Chris and I are going to be talking about some of the learnings we've leaned from like our experiences on the front lines operationalizing AI products at scale.

By the end of this talk, we hope you're going to leave with an understanding of how you can kind of bridge that gap between product concept and operational reality by understanding how to deliver quality through evals, human review, and how you might build your teams around that. So a little bit about ourselves.

My name is Jeremy. I lead product at a company called FreePlay which exists to help solve a lot of these operational problems and help companies ship great AI products. Hi, everyone. My name is Chris Hernandez. I lead our speech analytics team at Chime. I've been in the industry for about 10 years from a CX perspective and about nine years on the ML space.

So I'm excited to be here. So both Chris and I kind of came from the traditional ML world into the Gen AI world and I thought it was like helpful to maybe take a note and talk about like what we feel like has changed in this transition. So I think the biggest difference in my mind is this decreased barrier to entry, right?

In the traditional ML world like you need tons of data to even get started. There are these long model training cycle times and that barrier to entry just kind of like goes down significantly in a Gen AI world, right? Like you can now use the intelligence of the base models to be able to like make use of and leverage smaller data assets within your organization.

And then what comes along with that is this like increased iteration speed, which is like as the barrier to entry goes down, the speed at which you can iterate starts to go up. And this increased iteration speed starts to accentuate the need for like a high quality ops function.

So we want to take a look at why working across like dozens of different enterprise teams at free play. We've noticed this common trend emerge, which is companies will build an initial prototype. Maybe they'll even ship a V1 of that thing into production, but they have inevitably hit this sort of quality chasm as they try and go from a V1 to a V2 that really drives true value for the customer.

And there's this reliability problem in there. And what we've seen is like the only way that you really cross that quality chasm and get to a reliable V2 is through iteration. And so we talk about this iteration loop as you move from monitoring to experimentation to testing, evaluation, and I use that word broadly speaking, human review, auto evaluation, like all these things all kind of come into play.

But what happens is your product quality becomes a direct function of your ability to move through this loop, right? The faster you can iterate, the more times you can move through this loop, the better your product quality becomes. So ops kind of sits at the foundation of this. And especially as you start to scale, right?

Like your ability to move through this loops is like ends up being an ops function. And the like not so subtle irony here is that to deliver high quality AI products, you actually need a ton of human elbow grease. And I'll pass it to Chris here to talk a little bit more about the importance of human experts in this process.

Chris Alvarez: Cool. Awesome. You know, I made the joke that one day I was like, you know, the first time I saw LLMs, we started playing around with it. You know, I typed them like, I wonder if it actually knows like basic things like, who invented Wi-Fi? And then it spit out Abraham Lincoln.

I was like, oh, my goodness, we're in some trouble. But all jokes aside, and maybe not that extreme, it is a good reminder that LLMs and even the smartest ones can make mistakes. And they often do it with a lot of confidence. So that's, of course, what we call hallucination.

Chris Alvarez: And in the gen AI space, at times, like you got to make sure that, you know, we don't leave it unchecked. And that these hallucinations can actually be tricky and risky. They could also mislead customers and they could misinform decisions. And when you think of like companies or industries like healthcare, just a small hallucination could have a lot of risk in it.

So that's where human and loop comes in. That's why it's important to talk about this today. And human and loop kind of ensures while gen AI does the heavy lifting, humans are still there to steer the ship and drive in the right direction. So today we're going to talk a little bit more about like human and loop, why it's important.

I don't think this is like groundbreaking. Like everyone understands human and loop has been around forever. But just reiterating on the importance of it. So LLMs right now are just, you know, we know that they're good at generating content and they do it really, really quickly. But without humans, they often fall short of real world reliability, especially when nuance, empathy, and context is required.

Without human in the loop, you're also not scaling productivity, you're scaling risk. And so a single hallucination in isolation might not seem like a big idea or big deal, but at scale, it could be dangerous. So the last point I want to make on that is also each time a human flags or corrects an output, it's the signal that we need to be using to help retrain and reinforce the models that we see.

And that's of course how, you know, models are evolving over time. So I want everyone to think of human and loop not as like a safeguard, but as a feedback mechanism or feedback engine, if you will. So every model output that goes through a human review, that feedback then gets fed for refinement and, of course, improving the model.

And over time, this loop brings AI closer to real human expectations and behaviors. But there is a challenge. And speaking with many different teams, the challenge is that we don't have enough people to do all these reviews. Like model-graded evals and evals in general are great, but you still need to have the human element in there.

So most teams don't have enough people to review thousands of the different outputs that you have. And that makes it really, really hard to improve on the models and even measure the current state of the models that you have. But there is good news. And so this might not seem like a clear path in some of the organizations that you all might be part of.

But we already have teams that are kind of trained to do this type of work of human in the loop. And those teams are like your quality folks or CX folks within the operations side of the business. And so especially when you look at call centers, they're already experts in doing the jobs that you need them to do, which is evaluate interactions at scale, spotting edge cases and defining what good looks like.

So in this age of Gen.AI, their roles, from my perspective, is not shrinking. It's continuing to evolve. And they're not just measuring outputs anymore or outcomes. I think they're really shaping the future and what's next. So as Gen.AI becomes more embedded in the operations, quality is also evolving. It's no longer about auditing what's already happened, but it's also about shaping what's going to happen next.

So the shift change in the QA teams or in this ops space, they're not just scorekeepers anymore. They're becoming model shapers, prompt testers, and AI performance monitors. And if there's any group that's already built for this kind of work, as I mentioned before, you might want to look at your ops team and see if there's anyone in the QA contact centers that could also help expand your projects even further.

They already know how to evaluate the nuanced conversations. They also know how to surface these edge cases like I mentioned before. So talk a little bit about the evolution of just quality in general or CX. They've been, for the most part, focusing on auditing, coaching, compliance. But as automation continues to scale, these roles aren't disappearing, but they're actually transforming.

When you look back 25 years ago, the main roles were just QA professionals were listening to phone calls and evaluating interactions. But we see that as a continued evolution where automation is now coming into play, and these folks are transforming their skill sets to help solve for larger problems.

The diagram that we have here kind of just shows where the QA scope is today and how it's expanding to the Gen AI space. At the core, we have what QA has always done, which is auditing interactions, capturing quality behaviors. But now, as you can see, the scope itself is expanding.

And QA professionals are testing prompts now, and they're tagging outputs, and they're really helping shape the model behavior that we're expecting. And there is a beauty in the Gen AI space, which is it opens the doors to non-technical folks. Traditionally, only the ML teams, the engineers, are the ones that are heavily involved in the outputs and deciding what's going to come from it.

But now, you don't need to know how to build the model pipeline to know what a good output looks like. Just like you don't need to know how to make wine to be a good wine connoisseur. The outputs or the expertise still matters and just as valuable. And then the contact center CX teams are, again, already equipped with this.

And so it might be something to look into and just see how you can expand that reach even further. So one of the things Chris is talking about here is something we have observed working across a number of customers as well, which is this sort of emerging role of the AI quality lead.

And importantly, like I've almost never seen it actually called this, but it seems to exist at companies who are having like a lot of success in the Gen AI space. And I expect this role to actually become more formalized and gain more traction, right? Like Chris and his team are an example of this role.

And importantly, like people in this role can come from a variety of different backgrounds, product, ops, engineering. But like the key attributes of what makes someone a good AI quality lead is someone who first and foremost has a deep, deep understanding of the customer need in the domain. And then importantly, is a systems thinker and is able to like systematically think about how to diagnose and solve these quality problems.

What that looks like day to day is this person is doing a lot of these just kind of like this new skill set, labeling data, writing evaluation criteria, running experiments and tests and things like this and prompt engineering. And you'll notice that like there is something notably missing from that, which is like these are often not the people who are writing production code.

But I think what has changed so significantly is writing production code is not the only way now that you can contribute in like a really hands on way. All of these things like given the right tool set and the right structure of your team, you can contribute to like this iteration loop, the prompt engineering, the evaluation, all this kind of stuff without necessarily being the one like writing the production code day to day.

Chris is talking about like how you do this at scale, but we've also seen a lot of success where companies will have one person or two people in this role, especially when you have a smaller footprint and that goes a long way. You know, so I think what Chris is painting is like in larger enterprises and as you start to scale this stuff up, you really need like a meaningful quality team, but you can do this by like kind of empowering a single or a couple individuals in this role.

Cool. So I'm just going to leave you all with a few last points. You know, I think while human loop is important and like we just spent the last few minutes talking about it. If you don't have the resources, I think, you know, looking at high risk, high trust areas is like an absolute must.

And so insert human loop at decision points and not just for show. The next piece is like continue to bring your like your ops teams and CX teams into the lifecycle early to help what, you know, define what good looks like. So they can help build out golden sets and tests against real world, real world edge cases.

And then this is another point I wanted to make too. It's just like the fact that like launch is not the finish line. Track performance, flag hallucination, measure impact and iterate. Time and time again, I see teams like celebrating, which is great to see celebration of like a product actually launching or solution being actually launched.

But the important part there that it's not the end, it's the beginning where you need to set up your teams to make sure that the outputs are performing how you expect them to perform. And last but not least, scale is not just about tech anymore. I think it's about people.

And so leveraging QA ops and support and frontline teams, I think as a strategic partner in the gen AI space is going to be what makes teams successful. And there's just one key takeaway today. It's that scaling gen AI isn't just a technical challenge anymore. It's an operational reliability and responsibility.

When you embed quality and then human feedback into that loop, the right people into your gen AI systems, you're not just building faster, but you're also building better. So anyways, thank you guys. Appreciate it. Thank you. Thank you. We'll see you next time.

The Build-Operate Divide: Bridging Product Vision and AI Operational Reality

Chapters

Transcript