back to indexFrom Copilot to Colleague: Trustworthy Agents for High-Stakes - Joel Hron, CTO Thomson Reuters

00:00:22.340 |
You know, probably two, two and a half years ago, 00:00:38.640 |
And obviously we wanted them to be as accurate they could 00:00:45.060 |
But at the end of the day, we wanted it to be helpful. 00:00:46.980 |
And I think over the last two, two and a half years, 00:00:50.780 |
and certainly like within the last six months, 00:00:52.920 |
like that North Star has shifted from helpfulness 00:00:56.540 |
to productive, like we're not asking assistants 00:01:01.920 |
We're asking them to actually like produce output, 00:01:05.300 |
to make judgments and decisions on behalf of users. 00:01:18.040 |
risk and fraud investigations, like the risks of being wrong 00:01:22.380 |
are not particularly acceptable to our end users. 00:01:26.400 |
So doing that in those kinds of environments, 00:01:29.460 |
And that's hopefully what we'll talk about today. 00:01:32.800 |
A little context on Thomson Reuters as a company. 00:01:36.980 |
Maybe different from any of you who like started a company and grew to tens of thousands of users in a couple of weeks. 00:01:45.740 |
We, like I said, represent legal, tax, compliance, audit, risk. 00:01:51.660 |
97% of the top 100 US law firms are customers of ours. 00:01:57.040 |
99% of the Fortune 100 corporate customers of ours and the top 100 US CPA firms. 00:02:07.340 |
So we've had a longstanding and a pretty significant presence in many of these industries for a long time. 00:02:14.220 |
And really what it underpins that is our domain expertise and content. 00:02:23.940 |
I think we're the highest employer of lawyers in the world as an example. 00:02:29.900 |
And our proprietary content really underpins most of our software products that our customers use. 00:02:36.520 |
And it's north of one and a half terabytes of proprietary content across those industries that, you know, 00:02:42.160 |
we serve to our customers through our software. 00:02:44.480 |
You know, we're heavily acquisitive as a company. 00:02:48.040 |
We've spent over three billion in acquisitions over the last couple of years. 00:02:52.840 |
We have an applied research lab with a little more than 200 scientists and engineers that work closely with our development teams. 00:03:02.280 |
And as a company, we spend north of 200 million a year in capital on AI product development. 00:03:08.920 |
So it's a little background on who we are as TR. 00:03:14.560 |
So I'll switch gears, just talk about maybe ground us in the evolution of where AI has been and where it's come. 00:03:19.760 |
So I think this quote from Y Combinator in their summer 2025 sort of request for startups is pretty good grounding. 00:03:27.560 |
And they said, you know, paraphrase a little bit here, but this is pretty much what they said. 00:03:32.200 |
They said, "Don't build agentic tools for law firms, build law firms of agents." 00:03:37.200 |
And I think that, like, signifies, like, the profound shift of, like, moving from helpfulness to productive. 00:03:43.200 |
Like, we're asking AI systems to now produce output and produce judgments and decisions and not just be helpful to people who are doing those kinds of tasks. 00:03:53.840 |
And that's the shift that we're experiencing with agentic AI. 00:03:58.840 |
I think, you know, what does agentic AI actually mean? 00:04:03.840 |
I think we've been talking about this a little bit. 00:04:06.480 |
I think we like to define it more as a spectrum. 00:04:08.680 |
It is not that this system is agentic or it is not. 00:04:13.320 |
But in fact, these are dials that can be used to sort of tune what the experience and how much agency the experience has for the user, depending on the use case. 00:04:28.600 |
There's some use cases where it's very exploratory. 00:04:32.560 |
And you may want to dial these agency dials far up. 00:04:37.400 |
There are other situations where there's a high degree of precision. 00:04:40.400 |
And there's sort of an expectation of certainty around how a certain workflow might need to be executed. 00:04:47.600 |
And you may not want to dial the agency up in those situations. 00:04:51.600 |
And so we view these things as levers that we are able to kind of move up and down, depending on what our users are willing to tolerate in terms of the risk of the 00:05:04.840 |
And each of these dials you can think of somewhat discreetly. 00:05:08.120 |
So like the autonomy dial, this is the ability of an AI assistant to go do a discrete task, like summarize this document all the way down the spectrum to very variable kind of self-evolving workflows where the AI assistant is planning its own work, it's executing its own work, and it's re-planning that work along the way based on what it is observing or learning along the path. 00:05:35.680 |
So context is also a dial, like the sort of first, the most simple examples were like using parametric knowledge of the models directly. 00:05:46.240 |
And we added one knowledge source, we added another knowledge source, and then the models then need to sort of rationalize between, let's say, a controlled knowledge source and the web. 00:05:56.720 |
And it needs to use both of these sources of information and it needs to understand which one is better under which context, all the way to perhaps the models even permuting the data sources themselves and updating not just the data, but perhaps the schemas of the data to make better use of them for future types of questions that may get asked. 00:06:18.280 |
Memory is another dial, like, you know, the earliest systems of RAG that we had were somewhat stateless, they retrieved the context at the point in time. 00:06:27.120 |
And what we're seeing now is that memory needs to be shared throughout the workflow. 00:06:33.960 |
It may need to be shared across a series of execution steps in that workflow and it may need and likely does need to be persistent across many sessions of users. 00:06:45.960 |
And so these are also dials that we can use from a memory perspective. 00:06:51.960 |
Coordination is the idea of an LLM or an AI assistant just atomically executing a task, like I mentioned, summarizing documents to delegation, to tools, to full agent systems collaborating with each other. 00:07:08.800 |
So, again, just to some, like, these are levers that we view the ability to kind of pull up and down depending on the type of use case and what sort of agency we want to give the system. 00:07:19.800 |
So I'll switch gears and just share kind of some lessons that we've learned along the way from the last two and a half years of building this. 00:07:28.800 |
And some of this may be obvious, some of it may be not. 00:07:36.800 |
And I think for our users, one of the things that is most challenging is that, like, to build trust in the system, they almost expect a determinism. 00:07:47.800 |
Like, sort of by definition, trust comes through, like, you know, having certainty and an expected outcome when you give a certain input. 00:07:56.800 |
And that is just not the way that these systems work. 00:07:59.800 |
And that has been, I think, a really challenging bar, not just to climb for our users, but also for our own internal SMEs who evaluate these systems alongside of us. 00:08:11.800 |
What we see in our own development is that even with highly trained domain experts and legal, I could give the same set of data, like question response, to the same people a week later. 00:08:24.800 |
And we see 10 plus percent swings in accuracy by the same people and the same questions. 00:08:29.800 |
And so their own judgments are highly variable as well. 00:08:32.800 |
And it's quite difficult to sort of understand whether you're climbing that hill or not. 00:08:40.800 |
The other challenge is that, you know, it's quite expensive. 00:08:43.800 |
These are highly trained lawyers or tax professionals, whatever it may be. 00:08:48.800 |
And if you're iterating on a system every week, like, you know, it's quite expensive to leverage this amount of human judgment. 00:08:59.800 |
We see these challenges sort of amplified by agentic systems. 00:09:03.800 |
Some of the challenges are that referencing to source material, which is probably one of the most important things for any of our applications, becomes more challenging as you start to build these systems with higher levels of agency. 00:09:17.800 |
We see these agents sort of drift and identifying why they have drifted and where they have drifted along the trajectory becomes more challenging. 00:09:29.800 |
And building the guardrail systems themselves require, you know, a deep level of expert knowledge. 00:09:34.800 |
I think, you know, as we've approached our evals, like, we have really focused on developing pretty rigorous rubrics for how we eval. 00:09:43.800 |
But at the end of the day, I do think we need sort of north stars that guide us. 00:09:47.800 |
And in many ways, like, we really look at preference at the end of the day to really drive an understanding of are we getting better or are we getting worse? 00:09:56.800 |
But we do have, like, deeper levels of rubric that we use to sort of hill climb on certain components of the system. 00:10:02.800 |
The other thing we've learned is that our legacy applications are, you know, in many ways, they're a handicap, to be honest with you. 00:10:09.800 |
But in a lot of ways, I think they're really enabling. 00:10:11.800 |
And I'll show you a couple of demos of that in just a minute. 00:10:14.800 |
But we have 100-plus years of building software systems that have highly tuned domain logic. 00:10:21.800 |
And our users expect this sort of logic in the way that they work. 00:10:25.800 |
And, you know, early on in the age of building assistants, you know, we were kind of just starting over. 00:10:31.800 |
We were leaving all that behind us and building something new, somewhat from scratch. 00:10:35.800 |
But what agents have allowed us to do is to really decompose these legacy applications and decompose the components of them as tools that agents can now use. 00:10:46.800 |
And so we're finding new ways to leverage a lot of these legacy applications and infrastructure that, you know, previously we might have thought of as baggage, 00:10:55.800 |
but I think are really unique assets for us to build on going forward. 00:11:00.800 |
And then the last thing I would say as a learning is, which may be somewhat non-intuitive, is, you know, this whole idea of MVPs, 00:11:09.800 |
which sort of, like, centers in everybody's mind when they're building a new product. 00:11:13.800 |
I think in many times we've over-indexed on the word "minimal." 00:11:17.800 |
Like, and we've sort of, like, chased rabbit holes in development, trying to optimize what we thought of as, like, sort of the smallest, most valuable piece of, you know, code that we could build. 00:11:28.800 |
And it wasn't until we actually, like, built the whole system that we could see the whole system operate. 00:11:33.800 |
And we could understand, you know, what components of that system do we need to go spend time on versus what is just healed by the agentic sort of nature of the system itself. 00:11:45.800 |
And it was really, I would say, like, a mindset shift for many of our teams to not ground themselves in this MVP concept, 00:11:52.800 |
but to try to just go build the whole thing first and then learn from there rather than starting at a smaller component. 00:12:00.800 |
So with that, I'll show you just a couple quick demos of some applications that sort of do this work. 00:12:08.800 |
This is, like, obviously fake data, but, you know, you can imagine a tax professional getting a bunch of documents, 00:12:14.800 |
going through these documents, extracting data, you know, mapping it to a tax calculation engine, et cetera, et cetera, et cetera. 00:12:21.800 |
So what this product does now is basically take source documents, like a W2 or 1099 or whatever, and, you know, end to end does the process of generating tax returns. 00:12:35.800 |
So we use AI to extract data from the particular documents. 00:12:40.800 |
We use AI to take that data and understand how to map it to what fields in a tax engine, 00:12:46.800 |
what those sort of tax laws say about the rules and conditions of those numerical values 00:12:52.800 |
and whether they should apply in this case or that case or to this line or to that line, 00:13:01.800 |
And this is a good example of a couple of things I just mentioned. 00:13:04.800 |
The first is, you know, this is really only possible because we have the tools, like a tax engine, 00:13:11.800 |
to be able to give to the model to leverage to do these calculations. 00:13:16.800 |
We also have a validation engine that's built into that tax engine that the AI system can use to validate the work that it's doing, 00:13:25.800 |
can inspect the errors, it can go look for more information from the documents when it needs it, 00:13:34.800 |
So I think this is a good example of how we're able to decompose our legacy systems and kind of bring new life to them 00:13:49.800 |
This is like a legal research use case, like where a lawyer might go in and prepare for litigation. 00:13:56.800 |
And so, as I mentioned, we have one and a half plus terabytes of proprietary content that we build our products on. 00:14:03.800 |
And so this is really like a deep research implementation that is tuned for legal. 00:14:09.800 |
And what we're doing in this particular case is having an AI assistant that uses the tools of our litigation research products. 00:14:20.800 |
So those things would be like searching for documents, fetching documents, comparing citations across cases, validating citations within cases. 00:14:29.800 |
And it is using the components of that application as tools to go out and search content, retrieve content. 00:14:37.800 |
It's looking at various different sources of content, whether that be case law or statutes or regulations or legal know-how articles that we have 00:14:47.800 |
or other blogs or other content that we've licensed in some way to reason to an appropriate answer to a legal research type question. 00:14:57.800 |
And what you're seeing here is not necessarily just the product, but these are sort of like under the product of like the trajectories 00:15:04.800 |
that the model would be following along its path of answering this particular type of legal question. 00:15:10.800 |
And at the end of the flow, the model will, or along the flow rather, the model will write notes to itself about what it is learning, what it's finding. 00:15:19.800 |
And at the end of the flow, we'll sort of rationalize those notes together into like a final report 00:15:26.800 |
that sort of sums up all of the information that was found throughout the research. 00:15:30.800 |
And I think most importantly, what you'll see is it links to hard citations in our product. 00:15:37.800 |
So every sort of blue hyperlink links to like a true case or a true statute. 00:15:43.800 |
And it flags the sort of risk associated with that, with these flags that you can see. 00:15:49.800 |
So these are two examples that I think kind of highlight pretty well some of those lessons learned that I say around decomposing applications, 00:16:01.800 |
And these are really things, like I said, that we've learned the hard way in many cases. 00:16:07.800 |
So I think just to wrap up, we've got a few minutes and we can take a couple questions as well. 00:16:12.800 |
But I think beginning with the whole problem in mind is the right strategy when you're thinking about agentic systems. 00:16:20.800 |
I think the way to think about agency is not as a binary thing, but as a lever that you can dial up or down depending on the risk or the use case 00:16:31.800 |
or the tolerance of your users for certain situations. 00:16:35.800 |
I think one way to think about agents is to bring life back to old systems and to sort of break those old systems down into components that can be leveraged uniquely by an agentic system. 00:16:49.800 |
I think focusing on where humans are in the loop in terms of evaluation. 00:16:55.800 |
Like I said, those SMEs that we have internally are extremely important for us. 00:17:01.800 |
And then lastly, I think the reason we've done what we've done is because we looked at our company. 00:17:06.800 |
We said, what are the assets that we have that nobody else has? 00:17:09.800 |
And it's 4,500 domain experts, terabytes of content. 00:17:13.800 |
We really ask ourselves, like, how can we use those to create the most amount of differentiation in our product? 00:17:19.800 |
And so I would certainly challenge you guys to do the same for yourselves. 00:17:25.800 |
And how can you perhaps best leverage those to build uniqueness into whatever applications it is that you may be doing? 00:17:33.800 |
So with that, I think we've got a couple of minutes for a couple of questions and some mics. 00:17:46.800 |
I'm a PhD student and I work for Department of Defense who sponsored me. 00:18:09.800 |
If I have to take the product to my firm, who is Department of Defense, or any financial firms, how would you describe the cybersecurity postures which are mandated by CISA and government recently, such as LLM firewall or LLM government? 00:18:32.800 |
Or automated agents for scanning vulnerabilities or any SCM security posture management? 00:18:44.800 |
How would you define the cybersecurity posture for the entire architecture? 00:18:52.800 |
I mean, there's certainly a lot of technical documentation on this that I can point you to online. 00:18:56.800 |
But I would just say that we're heavily focused on not just compliance with the standards of FedRAMP and these other things when we work with the government, but also really trying to conform to the latest standards that are coming out, like the ISO standard that recently came out. 00:19:16.800 |
And several of our products are now sort of compliant with that as well. 00:19:20.800 |
And so it's a pretty quickly evolving space, though. 00:19:26.800 |
Anyway, I think I'm getting the hand, but I appreciate the time.