back to indexBuilding an Agentic Platform — Ben Kus, CTO Box

Chapters
0:0 Box's Content Platform and Enterprise Focus
1:50 Initial AI Deployment in 2023
2:54 The Challenge of Unstructured Data in Enterprises
3:56 Limitations of Pre-Generative AI Data Extraction
4:54 First Version: LLM-Based Extraction
7:5 Challenges with the Pure LLM Approach
8:58 Despair and the Need for a New Architecture
9:30 Introducing Agentic Architecture
10:4 AI Agent Reasoning Framework
10:45 Agentic Routine for Data Extraction
12:28 Advantages of Agentic Architecture
14:5 Key Lesson Learned: Build Agentic Architecture Early
18:37 Approach to Fine-tuning and Model Support
00:00:00.000 |
Hello, so I'm Ben Kuss, I'm CTO of Boxx and I'm going to talk today about our journey of through 00:00:22.640 |
AI and in particular our AI agentic journey. And if you don't know much about Boxx, 00:00:29.440 |
a little bit of background. So at Boxx we are an unstructured content platform, we've been around 00:00:36.720 |
for a while, more than 15 years, and we very much concentrate on large enterprises. So we've got 00:00:44.960 |
over 115,000 enterprise customers, we've got two-thirds of the Fortune 500, and our job really 00:00:52.640 |
is to bring everything you'd want to do with your content to these customers and to provide them all 00:00:57.040 |
the capabilities they might want. In many cases, for AI, many of these customers, their first AI 00:01:01.680 |
deployment was actually with Boxx because, of course, many enterprises worry a lot about data 00:01:08.720 |
security concerns and worry about data leakage with AI, make sure to do safe and secure AI, 00:01:13.120 |
and this is one thing that we have specialized in over time. But the way that we think about 00:01:18.320 |
AI is at a platform level. So we have sort of the historic version of Boxx which has the idea of the 00:01:24.880 |
global infrastructure, sort of everything you need to manage and maintain content at scale. We've got 00:01:29.280 |
over an exabyte of data, we have an awful lot of hundreds of billions of files that our customers have 00:01:34.640 |
trusted us with, and we have the natural way to protect them in addition to the type of services that 00:01:39.280 |
you provide when you're an unstructured data platform. But then for the last few years, one of the key 00:01:44.080 |
things we've been investing in has been in AI on top of the platform, and I'm here to tell you a bit 00:01:48.640 |
about our journey here. So we started our journey in 2023, shortly after AI became sort of production 00:01:58.800 |
ready from a generative AI sense. And everything I'm talking about here today will be generative AI, of course. 00:02:03.280 |
So we ended up with a set of features, things like QA across documents, things like being able 00:02:08.160 |
to extract data, things like being able to do AI power workflows. Happy to talk about these in general, 00:02:12.640 |
but today I'm going to focus on one aspect of the features that we built, which is the idea of data 00:02:18.880 |
extraction. This is the idea of taking structured data from your unstructured data and using that in 00:02:24.320 |
an enterprise setting. And partly I'm going to focus on this one because this is interestingly, 00:02:30.640 |
like maybe the least agentic sort of thing that you might think of when you're thinking about these 00:02:36.400 |
other examples about how you interact with AI. This is much less like a standard chatbot style 00:02:42.240 |
integration, but what we learned and what I'll tell you about is how you the concepts of agentic 00:02:47.760 |
capabilities applies well beyond just sort of end user interactions. 00:02:52.560 |
So we'll be talking about data extraction for a moment, just a quick background. When we talk about 00:02:58.720 |
metadata or data, we talk about the things in unstructured data, be it documents, be it contracts, 00:03:04.560 |
be it project proposals with anything that then turns into structured data. This is a very common 00:03:10.400 |
challenge in enterprises is that they have like 90% of their data is unstructured, 10% of their data is 00:03:16.160 |
in databases, structured data. And historically, there has been this challenge that like it was kind of 00:03:22.720 |
hard to utilize this. So many customers have for a very long time, wish they had better ways to automate 00:03:28.720 |
their unstructured data. And there's a lot of it. And it's really critical. In some cases, it's the most 00:03:32.560 |
critical thing in an enterprise. So the things you do with it would be to like query your data, being able to 00:03:40.480 |
kick off workflows, being able to do just a better search and filtering across all of your data. And so 00:03:46.320 |
so this like the prototypical example, this is something like a contract where you have an 00:03:50.480 |
authoritative unstructured piece of data, but then also the key fields in there are very important. 00:03:57.040 |
So this is not a new thing. For many, many years, the world for box included has been interested in 00:04:04.800 |
pulling out unstructured, structured data from unstructured data. And there were a lot of techniques 00:04:10.160 |
to do this. And there's a whole industry. If you ever heard of IDP, this is like a multi-billion dollar 00:04:14.240 |
industry whose job in life was to do this kind of attraction. But it was really hard, you had to build 00:04:21.680 |
these specialized AI models, you had to like focus on specific types of content, you had to have this huge 00:04:26.800 |
corpus of training data. Oftentimes you need to get custom vendors and custom ML models that you make. 00:04:32.320 |
And it was quite brittle. And then to the point, not a lot of companies ever thought about automating 00:04:37.120 |
most of their most, their critical unstructured data. So this was sort of the state of the industry for a 00:04:42.400 |
very long time. Like just don't bother trying too hard with unstructured data, do everything you can to 00:04:48.880 |
get it in some sort of structured format, but don't try too hard to deal with unstructured data. Until 00:04:55.040 |
generative AI came along. And so this is where our journey sort of begins with AI is for a long time, 00:05:01.360 |
we've been using ML models in different ways. And the first thing that we tried when confronted 00:05:08.720 |
with sort of a GPT-2, GPT-3 style of AI models is that you just say, I have a question for you, AI model, 00:05:18.560 |
can you extract this kind of data? And as we mostly all know, AI is not only great at generating content, 00:05:28.000 |
it's also great at understanding the nuances of content. So what we did, we first started out with 00:05:34.240 |
some pre-processing, doing sort of OCR steps, classic ways to do this, and then being able to then say, 00:05:43.120 |
I want to extract these fields, standard AI calls, single shot, or with some decoration on the prompts. 00:05:49.360 |
And this worked great. This was amazing. This was something where suddenly, a standard, generic, 00:05:57.280 |
off-the-shelf AI model from multiple vendors could outperform even the best sort of models that you 00:06:02.480 |
had seen in the past. And we supported multiple models just in case, and then it got better and 00:06:08.000 |
better. This was wonderful. So this was flexible. You could do it across any kind of data. You could, 00:06:11.840 |
it performed well. It was, yes, you had to do OCR and pre-process it, but that was straightforward. 00:06:18.960 |
And so we were just thrilled. This was like, for us, it was like, this is a new generation of AI. 00:06:25.760 |
And interestingly, we would go to our customers and say, we can do this across any data. And then 00:06:29.920 |
they would give us some, and it would work. And then we'd be like, great, AI models are awesome. 00:06:33.200 |
Until they said, oh, now that you do that well, and I get it. Now, what about this one? What about this 00:06:40.080 |
300-page least document with 300 fields? What about this really complex set of digital assets? 00:06:45.920 |
You want to get these really complex questions associated with it. What about, I want to do not 00:06:49.680 |
just extract data. I want to do risk assessments and things that are these more complex fields. You 00:06:53.280 |
start to realize, huh, like this, as a human, when I, if you ask me that question, I'm struggling to 00:06:58.400 |
answer it. And then in the same way the AI struggled to answer it. So suddenly, we ended up with more 00:07:07.200 |
complex documents. Also, OCR is just a hard problem. Like, like there's no seemingly like no end of 00:07:14.720 |
heuristics and tricks that you do on OCR to get it right. So I've got a scan document, somebody writes 00:07:20.000 |
stuff in it, somebody crosses stuff out. It's just hard. And then, and then for people who have dealt 00:07:24.800 |
with like things like different file formats, PDFs, like it's a challenge. So whenever the OCR broke, 00:07:31.840 |
it would just naturally give that info to the AI and then languages were a big pain. And so we started 00:07:37.760 |
to get more and more challenges as we have an international set of customers across different 00:07:40.960 |
use cases. Also, there was a clear limit to the AI in terms of how much it could handle the attention 00:07:49.040 |
to so many different fields. So if you say, here's 10 fields, here's a 10 page document, figure it out. 00:07:54.320 |
They're great. They're most of them are great. If you say, here's a 100 page document, and here's 100 00:07:59.280 |
fields that are each of them complex with separate instructions, then it loses track. And I have 00:08:04.240 |
sympathy because people would lose track too. And so this became very problematic because if you want 00:08:10.000 |
high accuracy in an enterprise setting, like this just starts to not work. And then also it's just like, 00:08:15.520 |
well, what is accuracy? What does it mean? In the old ML world, they give you confidence scores. 00:08:19.360 |
0.865 is this one. And then, of course, large language models don't really know their own accuracy. 00:08:25.440 |
So we would implement things like LM as a judge. And we'd come back and tell you like, here's your 00:08:29.600 |
extraction. Also, we're not quite sure this is right. And then our enterprise customers would kind 00:08:34.640 |
of be like, well, that's helpful to know. But like, I want it to work right. Not just you tell me 00:08:39.360 |
it doesn't work right. And so this became this kind of set of challenges that we focused on. And so 00:08:44.720 |
customers were looking for speed. They're looking for affordability. They're making this work. They're 00:08:47.840 |
saying, if AI is this future awesome thing, then like, you know, show it to me. And so we're in on 00:08:53.520 |
these more complex documents. So at this point, we kind of hit our despair moment. Our, we thought 00:09:00.800 |
LLMs were the solution to everything. We thought that like, we could have these AI models that worked. 00:09:04.560 |
But then we actually struggled. Like, what do you do now? How do you fix this? And I know, let's just 00:09:09.200 |
wait until the next Gemini model or, you know, OpenAI seems to be on top of this. So like, 00:09:13.920 |
wait till the next one, which is part of it, right? The models do get better. But the fragility of the 00:09:19.680 |
architecture was one that was, we weren't really going to be able to solve on our own. So naturally, 00:09:26.320 |
one of the answers that we came up with was bringing agentic approaches to everything that we do. And this 00:09:35.600 |
is really the one of the key things that I want to sort of bring out in this session is that it certainly was 00:09:41.600 |
not obvious that the way to fix all these problems in something like data extraction was to do agentic 00:09:47.360 |
style of interactions. And when I say agentic, I mean, an AI agent that does something like this, 00:09:51.600 |
I am on instructions, objectives, with the model background tools, we can make have secure access, 00:09:57.840 |
of course, it has memory from the purposes of advancing and being able to look up information 00:10:01.520 |
inside of the system, but also with a full directed graph. So the ability to orchestrate it to be able to do 00:10:10.880 |
things like where you say, do this, then this, then this, either it comes up with its own plan, or we 00:10:14.960 |
actually can orchestrate it ourselves, because we have knowledge of what we want to do. And this was for us, 00:10:20.640 |
um, I mean, it's controversial, like, it was like our engineers, like, what are you talking about? Like, 00:10:24.000 |
let's just make the OCR better, like, like, let's just add another step somewhere, like, let's just add a post-processing, 00:10:29.360 |
uh, regular expression checks. And then, and then, of course, everybody always like, I have a way to do 00:10:33.760 |
this, um, based on the old way of doing this. Why don't we make training ML model? Like, why don't we 00:10:38.160 |
fine tune? And then, and then, and so suddenly, all of the genericness of it would be get lost in this process. 00:10:42.960 |
So, um, we came up with a mechanism, which was a, uh, so this is, uh, I think, like, kind of land graph 00:10:51.120 |
style of agentic, uh, capabilities. And, um, so we still, we went, uh, we still had the same inputs and outputs. 00:10:59.440 |
In document with fields, out answers. However, the approach was an agentic approach. 00:11:05.040 |
And so, um, you know, we played with all the models, uh, reflecting, uh, back and forth and criticism, 00:11:11.600 |
uh, being able to, uh, uh, separate it in multiple tasks, uh, to be able to have different multi-agent 00:11:17.520 |
systems work on this. And we ended up with something like this, where you have a step where you prepare 00:11:21.680 |
the fields, you go through, you group the fields. We learned quickly that, like, if there's like a set of 00:11:26.320 |
fields that are like customers, uh, from a contract and then are like, uh, like parties, 00:11:29.840 |
and then somewhere else there's like the address of the parties, like you need the AI to handle those 00:11:32.960 |
together. Otherwise it's like, you have three parties and two sets of addresses, which don't 00:11:37.680 |
match. So we, so we had to break up intelligently the set of fields. We had to go through and we had to, 00:11:43.360 |
um, uh, uh, like, uh, uh, do multiple queries on a document. Then after we got that, we would then 00:11:49.040 |
use a set of tools to check in and double check the results. In some cases we use OCR. We would then 00:11:54.000 |
double check it by looking at pictures of the pages, um, and, and, and using multiple models. Sometimes 00:11:58.320 |
they vote and they're like, wow, like, this is a hard question. Three, three models from different 00:12:02.400 |
vendors. Uh, two of them think this is the answer. That was probably a good answer. Um, and then on to the 00:12:07.120 |
idea of the LM as a judge, not just a judge to tell you that this is a, um, this is the answer, 00:12:12.560 |
but a judge to tell you, uh, Hey, uh, here's some feedback, keep trying. Now, of course this takes a 00:12:18.560 |
little bit longer. Um, but, uh, this is something that then leads to the kind of accuracy that you'd 00:12:22.800 |
want overall. And so for us, this was the, um, the, uh, uh, the architecture that then helped us solve 00:12:30.400 |
a set of problems. And it became, um, interesting because every time there was a new set of challenges, 00:12:36.160 |
the answer was not rethink everything, or let's then try like a whole new set of like, 00:12:41.840 |
oh, we're gonna give us six months and we'll come up with a new idea. But, uh, I wonder if we change 00:12:46.240 |
that prompt on that one note, or I wonder if we add another double check at the end, then we can 00:12:50.320 |
actually start to solve this problem. So we bring the power of AI intelligence to help us then solve 00:12:54.320 |
something that we used to think of as a standard function. Um, and then not only that, it helped us 00:13:00.480 |
in other ways. Like, so we were naturally as an unstructured content store, like one of the first things 00:13:04.480 |
you always see people, if I can give you a demo right now, it's, I have a bunch of documents, 00:13:08.320 |
I have a question. And then we had the same thing that a judge and it'd be like, you'll tell us like, 00:13:11.600 |
oh, that was a good answer or that wasn't. And then why not just, if it's not a good answer, 00:13:16.560 |
we'll take another beat and, and tell the AI, like, uh, try again, before you tell the user this answer, 00:13:21.920 |
like, I want you to, um, uh, like reflect on it for a second. And this kind of thing just leads to higher 00:13:26.880 |
accuracy. And then it also leads to much more complexity. So we just announced our deep research capabilities on 00:13:32.480 |
your content. So in the same way that like open AI, uh, or Gemini does deep research on the internet, 00:13:36.240 |
we let you do deep research on your data inbox would look something like this. So this would be like 00:13:41.040 |
roughly the, the directed graph that you'd have where you'd go through, you know, first we searched for 00:13:45.600 |
the data, kind of do that for a while, figure out what's relevant, double check, then make an outline, 00:13:49.440 |
kind of prepare a plan, go through, um, uh, make, make a, a process. And this is all agentic thinking. 00:13:54.560 |
And it, and, and this kind of thing wouldn't really be possible if we hadn't kind of laid the front, 00:13:59.520 |
the framework of having an agentic foundation overall. So, um, I will leave you with, uh, these, 00:14:06.400 |
uh, a few lessons learned here. Um, so this is based on our time in the last few years. Um, 00:14:11.280 |
the first is, uh, that, um, it wasn't obvious to us at first, but the agentic, uh, abstraction layer 00:14:19.040 |
from an architecture perspective is actually quite clean. It is, it is very, um, once you start to 00:14:24.240 |
think this way, it is very natural to think, I'm going to run an intelligent workflow, an intelligent 00:14:28.800 |
directed graph powered by a models are every step to be able to accomplish a task. Not everything, 00:14:33.520 |
but sometimes that's a great, that's great approach. And this, and this is independent of some, 00:14:38.720 |
of a high scale set of, of, uh, sort of distributed system design and, and, and both are important. 00:14:43.440 |
Like at some point you have to deal with, you know, a hundred million documents that day at the same, 00:14:46.720 |
other point you have to deal with that one. And so being able to separate these two systems into 00:14:51.040 |
like somebody who thinks about the agentic framework is somebody who thinks about the, 00:14:54.160 |
how to scale a generic process is this is, this is very helpful to keep these distinct. 00:14:59.200 |
Um, also it's just easy to evolve. Like, uh, in that deep research example, 00:15:03.440 |
one of our biggest, we, we, we, we did it and then it worked really well, except for the output 00:15:07.280 |
was kind of sloppy. And so we were like, ah, I guess we gotta redesign the whole thing or add another 00:15:12.080 |
note at the end to say, summarize this and according to this. And it would just take that in and just 00:15:16.880 |
redo the output. So not that long to fix. And this was something that was not obvious to me until later, 00:15:22.880 |
which is that, um, if you're going to be using, um, a agentic, uh, uh, AI with a team who's been around 00:15:29.520 |
for a while, like you start to need to get them to think about agentic 00:15:33.360 |
first kind of thinking AI first thinking. And one way to do that is to, um, let them build something 00:15:37.920 |
so that they can start to think, oh, like this is not only how we can build more things, but also 00:15:42.480 |
because we're also a platform for our enterprise customers, they can think about how to make it 00:15:46.320 |
better, make it better for them. So things like, uh, really doubling down on the idea of, um, 00:15:51.200 |
we publish MCP servers. What are the tools like for them? What can we do to make it easier? How can we 00:15:56.240 |
do our agent to agent communications and so on? So, um, this is, uh, all kind of summed up with is if 00:16:04.560 |
you're confronted with a challenge, the lesson that we learned is that if it's plausible that a set of AI 00:16:10.560 |
models, uh, could help you solve that problem, then you should build this AI agentic architecture early. 00:16:16.240 |
If I go back in time, I would wish to be done this sooner because then we've kind of, uh, have been able to 00:16:20.480 |
continue to take advantage of that. Um, and so that's my, uh, that's my journey and that's my, my, my lesson 00:16:25.760 |
three. Uh, so thank you. Uh, Ankur, are we, um, two minutes? Okay. So, um, if anybody, what? Two questions. 00:16:40.080 |
Okay. If anybody has any questions, I'm happy to answer them. 00:16:42.080 |
Uh, question being, is this available as API? Yes. Um, so we're very API first oriented, so we have an 00:16:50.240 |
agent API that you can call upon these agents to do things and give them the arguments. So yes, 00:16:53.760 |
uh, we, we, we provide, uh, uh, uh, uh, agent, uh, just APIs across everything and tools, um, to, 00:17:07.840 |
Were you primarily just evaluating your agents to develop on this dashboard or on their links? 00:17:15.680 |
Can you explain when you start using, uh, a more manual approach as well? Um, 00:17:20.000 |
In terms of valuing our agents, uh, and how do we do that? Um, so we, we not only use, uh, LM as a 00:17:24.160 |
judge, but we also create an eval set. So we have our standard set of eval sets. Um, and then we've 00:17:27.760 |
learned that, um, since the AI gets so good over time, we created a challenge set of eval sets to, 00:17:31.920 |
so that we can better explore like things that not everybody asked, but if they did, it would be really 00:17:35.920 |
hard. And then that way you can better decide on whether or not you're not only prepared for now, 00:17:40.400 |
but as people get more challenging things, we, we know that we can grow across that. So a mixture of eval sets, 00:17:45.680 |
plus LM as a judge, plus the idea of just having people give feedback. We, we have limited ability 00:17:50.560 |
to look as an enterprise company, what's happening, but, but the, the idea of them telling us this is 00:18:01.600 |
you can yell if you want, I'll hear you. Uh, so, uh, and I, it's the first time you talk, so apologies 00:18:06.960 |
about it. Yeah. Asked a bunch of the floor and you would talk about that. It seems like you're 00:18:10.400 |
mostly building agents, but he's going to get out of the box, you know, center fine tuning. Why are you 00:18:15.600 |
supposed to put it? Uh, so the question being, why bother with agents if you can fine tune a model? Um, 00:18:20.240 |
no, no, I'm just saying, have you tried, have you tried fine tuning? Yeah, we're, we're, um, 00:18:25.520 |
we're pretty anti fine tuning at this moment because, um, of the challenges of once you fine 00:18:30.800 |
tune something, you have to then fine tune all of the evolutions of them going forward. We support 00:18:35.280 |
multiple models, Gemini, Lama, uh, open AI, Anthropic, and it's just hard to consistently 00:18:41.200 |
fine tune across the board in ways that like not only they've been usually just the next version of 00:18:45.760 |
the model gets better. So we've, we've got, we've gotten to the point where we use these prompts or 00:18:48.960 |
cache prompts or agenticness as opposed to fine tuning. That's the approach for our particular use cases. 00:18:53.280 |
It works quite well. Okay. Thank you, everyone.