Do You Trust Your AI’s Inferences? — Sahil Yadav, Hariharan Ganesan, Telemetrak

Hey guys, thanks for being here. I'm Sahil. I am here with Hari. We're presenting on AI. Obviously, we're talking about trust, but let me give you a little background about us. Over the past 10 years, we have deployed AI in various industries from health monitoring to industrial IoT to network automation in telecom networks.

And there's been one question that has been asked all along, all this time. Can we trust AI? Some of these systems are used in mission-critical applications, but the question really is, can we trust the inferences of these AI systems because they are impacting businesses, they're impacting business decisions, and the bottom line at the end of the day?

So we're going to explore that topic today. With that, let me get us started. All right. So, you know, just like any presentation, we'll start with some stats. So, McKinsey is saying 78% of the companies are adopting AI. There's another research from Evi says 95% investing in AI. But here's the problem.

Only 11% of the companies are focused on AI's governance, ensuring safe practices with AI. So that 67% gap is going to be a huge problem because it's not just about implementing the AI in the right way, it's also about understanding the impact of that AI in the long run.

And let me quantify that with some examples. So you see a couple of examples here, telecom disruption. What really happened here was AI made some decision. Based on that, there was a network disruption. Now, if you look at AT&T's and Verizon's of the world, they're spending millions of dollars each minute the network is not working.

Another example here is a gas sensor misinterpreted data that put human lives at risk. Third company lost millions of dollars because the supply chain company, AI, screwed up the SKUs and, you know, ended up in losses. So what we're trying to say here is that these are silent failures.

You cannot quantify the impact of these failures ahead of time. You cannot see them coming ahead of time. But they are worth millions and billions of dollars over time. So this is extremely, extremely impactful as AI is getting adopted. So, taking a view of what trustworthy AI looks like, there are three main pillars.

You talk about explainability. When you're talking about explainability, I think the most important thing is having a view of what's really under the hood. Otherwise, you're just flying blind. You have to understand why those inferences are being made on what basis. The other thing is traceability. It's like a flight recorder.

It's capturing all the audit trails. It's ensuring that you can retrace the steps. And based on that, you can understand the particular situation, recreate the situation, and being able to solve it again. Guardrails are extremely important. They are to ensure that you don't end up in millions of dollars of losses.

There is some threshold where it stops. The AI has got to stop. So, together, all of these build trust in a real system. More importantly, when you talk about real-world scenarios where you're implementing this, you're talking about scalability. These are the pillars to think about when you're scaling them in the real world.

I'll have Hari talk about the pillars of trust. Hey, thanks, Sahil. So, every mission-critical system that we rely on today, be it aircrafts, be it energy grids, or be it even the simple banking financial systems, are built on principles of safety and understanding. Right? So, looking at the first pillar, our AI systems should be no different, so looking at the first pillar, right?

First, AI has to show its work. Every important decision shouldn't be a mystery. It should come with a simple English explanation so that an end user, a decision maker, somebody who is auditing the system, is able to act on the information and not look for a data scientist to explain or translate what the system actually means.

That's the first pillar. The second pillar, adaptive control. What do we mean by that? It's about building smart guardrains. If the AI system starts to wear off, makes a wrong decision, the system should be able to slow down, change its course, or at least call a human for help.

Think of it as a lane assist for your AI. The third pillar is always have human in the loop. What do we mean by that? This is basically setting up the roles and the playbooks so that the right experts get pinged in the right time with the right information.

without causing an overhead for both the system as well as the person, right? But all of these things are built on the bedrock foundation of traceability. Every data, every change is digitally signed and is trackable. Think of the concept of it like software bill of materials or even simple.

Think of it like your FedEx package. From the time it leaves the warehouse till it reaches your doorstep, you can track every single step of it. So this was our, this is the three pillars of a trustworthy AI. But with these pillars in place, the larger question is, how do we actually weave them into the AI systems we are building and running today?

Right? Let's look into that journey. So, like I said, how do we make these pillars reality in day-to-day AI operations? This is where we move beyond the standard MLOps and what we call it as X-Tops. Think of it as an MLOps, but with built-in conscience and a direct line of human oversight.

This diagram isn't just a flow chart. It's the blueprint for the entire life cycle of AI. Let's begin with the verifiable traceability. Right from the data stage, know where your data comes from. Understand what are all the changes and how it is changing. No more guess words. When we train the models, we just don't train them for accuracy.

Right? We are embedding actionable intelligibility. What does it mean is the model also learned to explain itself so that we can spot when its reason starts to drift. When we deploy, right, we put those adaptive cruise controls that we talk about. This is where the guardrails kicks in. Automatically adjusting to new situation, new data, and pausing to look at things if they drift.

And when we deploy the model, this is where the human AI teaming comes in. Right? This is where the actual real-world feedback kicks in so that we could quickly improve the system and humans can step in when needed. X-Tops is about creating a system where every AI decision has a clear why, a when, and a who, and attached to it.

It is about moving from just launching an AI system to launching an AI which we can truly trust. So, let's pause here. Right? Now, you all might be thinking, hey, we do MLOps day in and day out. Most of these modules that we spoke about is already there. So, what is unique, right?

What is the big difference in doing this? The challenge is adopting an X-Tops is like a journey. It's not a flip of a switch. X-Tops is also taking about all those foundational pieces that we have and giving them a serious integrated upgrade, especially for trust. I'm not going to go through all of this, but let me touch upon a couple of things.

Let's think guardrails and policies. We have IAM policies. We have security policies. MLOps providers, everything. Right? But X-Tops gives you dynamic AI-aware guardrails that you can actually understand the context and block a risky AI decision. Let's talk about monitoring in metrics. We do have standard MLOps metrics. Right? But X-Tops gives you dedicated trust-specific dashboards that both your leadership and the boards can understand.

Human in the feedback. We do have human in the loop, but it is mostly ad hoc when it comes to MLOps. X-Tops, think of it as creating a fast lane. You click to fix workflows where human can look at some of these quick changes and go back and fix it.

Right? The larger context is X-Tops is not reinventing the wheel. Right? It's about adding advanced safety and transparency features needed for the high-stakes enterprise of AI. And what is in for us, we spend less time firefighting, unpredictable AI behaviors, and spend more time actually building more innovative products. So, if you are serious about managing AI trust, you also need to measure what matters.

Right? So, we talk about two metrics here, MTTRE and trust-adjusted risk and dollars. The MTTRE stands for Mean Time to Resolve Explainable Errors. Fancy name, but very simple idea. It's basically the time that takes for us to fix something unexpected when it happens to how quickly can we understand the why and response with the fix.

The faster your MTTRE is, the team is more agile, less defects in the product, and quicker to solve the problems. Second, trust-adjusted risk and dollars. This idea is basically to put a price tag on what happens when the trust breaks. Right? What is actually the business cost? Is it fines?

Is it lost customers? Is it damaged reputation? Right? And if your AI system keeps failing or remains a black box, this metric makes value of the trust. So, let me again pause here. We spoke about metrics. So, why obsess about all these metrics? Right? This is, we have enough of metrics in MLRs.

We have enough of metrics. But why obsess? The challenge is this. Look at the first table. Right? On an average, an MTTRE takes several months in some of these cases to even find a resolution. Right? Now, imagine this. Imagine your AI is making a biased decision for months. The damage escalates quickly, and sometimes it also escalates exponentially.

Now, look at the second table. It actually shows the fallout. Right? It is not just not one parameter. It starts with your direct fines. It starts with your engineering effort, regulatory scrutiny, and above all, the loss of trust and brand value of the products that we stand day in and day out for.

Right? These are in small figures. A serious incident like a privacy bug or a bias in a credit card system could quickly escalate up to 700 millions of dollars. Right? So, this is why these metrics are not just about defense. These are about building resilient, reliable, and ultimately AI-powered products that the end users can trust.

All said, and we are not talking out of thin air. So, Sahil is going to present a case study on a real incident and how we went about building this whole framework. All right. Perfect. So, that's the right stage for let's bring it all together. I'm going to talk about a company called GuardHat.

This is a company that I used to work for. It's focused on worker safety. So, more specifically, it has an AI-driven platform that is geared towards solving worker safety problems in hazardous environments. So, what we're doing here is we built IoT devices, wearable devices, that would be worn by the workers.

And then, at some point in time, these devices would get deployed, activated, and they will collect data. They'll collect health data as well as environmental data. And then, that data is sent to the backend system where the AI analyzes this in real time. And based on that, it is able to identify when an incident, predict when an incident is about to happen, and, you know, you can prevent that incident from happening.

So, very mission-critical application. It was great because while we were saving lives in a way, there were enormous challenges. One of the inputs to the AI platform was the GPS. And as a result, 70% of the cases were false positives. And it's easier to say this now because it's after the fact, but back then, we didn't know that.

And so, the behavior of the user was that at that point in time, the user stopped reacting to the alerts. They started ignoring the alerts, and that caused a huge safety risk, not just for the people, of course, their lives at stake, but even for the company, from the liability point of view, that workers were not reacting to alerts.

So, we went back to the drawing board. We started identifying the issues. And, you know, if we were to do this without the X-DOPs framework, we would probably do an MTTR, mean time to resolution. If you look at it, you know, 70% of the time is spent in identifying the problem.

Another 20% is spent in finding a solution, and then you deploy it. But I think the most critical part is that there is no system to identify the GPS drift. We wouldn't know about it. And then, because it's such a complicated model and code, that it's really hard to identify what's causing that problem.

So, if you were to apply this model, day zero, you get an alert that was ignored during an incident. Day two, there is an attribution telemetry that will flag the anomaly. And day seven, you have a solution deployed, which fixes the GPS drift, or at least finds a reroute to the GPS drift.

Now, having said that, to be real, this problem did not get solved in seven days. It took eight months for us. But this was a model problem that actually helped us to build this framework. And once we were able to build this framework, these kind of problems can be solved in seven days.

We tested it across our enterprise and eventually became an enterprise standard. So, all this is great. You have the impact. You have, you can see the value in this. But here's the big question. What do, how do you convince the CIOs? How do they look at all of this and find value in this?

What is the language that they talk? The answer is. So, you've got to convince the CIOs that this is saving money. And if you were to look at the left side of the slide, you'll see that the risk exposure that we're looking at is approximately $2.5 million per site per year.

Now, some direct impact with this is that we were able to solve the, with this product, or this structure, we were able to solve the fines. And we saved $500,000 in fines every year per site. Beyond this, some of the indirect benefit was that this system, if it were to be working correctly, was supposed to prevent incidents, all of them.

But it was preventing X percentage of incidences because it wasn't working correctly. But with this structure, it did work correctly. And then after that, you got the remaining value as well. So, I just want to wrap it up real quick. I think the outcome, you can see a lot of value there in terms of, you know, false alerts came down.

Trust score went up. That means people started using those alerts. They were able to see value in those alerts. I think more important things were related to the telemetry itself, understanding why a particular inference was made, having the control where you can, if there's a GPS drift, you were able to switch.

And then most important thing, human in the loop. So, if these things happen, someone is notified. We created a dashboard where someone is notified and someone is able to take action and retrain the model. So, with that, I mean, this slide is just a high-level overview of what we presented to you today.

Thank you for being here. We'll just leave it at that. Thank you so much for the fantastic talk on the trust gap. I actually had a question for you because we have a minute or so. So, the phrasing that you had around, like, the trust-adjusted risk cost premium, how do you advise people to think about, like, reputational damage?

Like, is this something that you have thought about measuring or investigating at all? You want to take it short? So, it's, like I was saying in the beginning, these are silent failures. You cannot, it's really hard to quantify the impact. And reputational damage is, again, right on top of that list.

So, it's really hard to measure that, to be honest. But all you can do in this kind of a case is, you know, you can find, you know, people are creative. You can find ways to quantify some dollars to it. But, you know, it's really hard to predict. I mean, there's no short answer to it, let me just say that.

I mean, there's no short answer to it, let me just say that. I mean, there's no short answer to it.

Do You Trust Your AI’s Inferences? — Sahil Yadav, Hariharan Ganesan, Telemetrak

Transcript