Back to Index

Building an Agentic Platform — Ben Kus, CTO Box


Chapters

0:0 Box's Content Platform and Enterprise Focus
1:50 Initial AI Deployment in 2023
2:54 The Challenge of Unstructured Data in Enterprises
3:56 Limitations of Pre-Generative AI Data Extraction
4:54 First Version: LLM-Based Extraction
7:5 Challenges with the Pure LLM Approach
8:58 Despair and the Need for a New Architecture
9:30 Introducing Agentic Architecture
10:4 AI Agent Reasoning Framework
10:45 Agentic Routine for Data Extraction
12:28 Advantages of Agentic Architecture
14:5 Key Lesson Learned: Build Agentic Architecture Early
18:37 Approach to Fine-tuning and Model Support

Transcript

Hello, so I'm Ben Kuss, I'm CTO of Boxx and I'm going to talk today about our journey of through AI and in particular our AI agentic journey. And if you don't know much about Boxx, a little bit of background. So at Boxx we are an unstructured content platform, we've been around for a while, more than 15 years, and we very much concentrate on large enterprises.

So we've got over 115,000 enterprise customers, we've got two-thirds of the Fortune 500, and our job really is to bring everything you'd want to do with your content to these customers and to provide them all the capabilities they might want. In many cases, for AI, many of these customers, their first AI deployment was actually with Boxx because, of course, many enterprises worry a lot about data security concerns and worry about data leakage with AI, make sure to do safe and secure AI, and this is one thing that we have specialized in over time.

But the way that we think about AI is at a platform level. So we have sort of the historic version of Boxx which has the idea of the global infrastructure, sort of everything you need to manage and maintain content at scale. We've got over an exabyte of data, we have an awful lot of hundreds of billions of files that our customers have trusted us with, and we have the natural way to protect them in addition to the type of services that you provide when you're an unstructured data platform.

But then for the last few years, one of the key things we've been investing in has been in AI on top of the platform, and I'm here to tell you a bit about our journey here. So we started our journey in 2023, shortly after AI became sort of production ready from a generative AI sense.

And everything I'm talking about here today will be generative AI, of course. So we ended up with a set of features, things like QA across documents, things like being able to extract data, things like being able to do AI power workflows. Happy to talk about these in general, but today I'm going to focus on one aspect of the features that we built, which is the idea of data extraction.

This is the idea of taking structured data from your unstructured data and using that in an enterprise setting. And partly I'm going to focus on this one because this is interestingly, like maybe the least agentic sort of thing that you might think of when you're thinking about these other examples about how you interact with AI.

This is much less like a standard chatbot style integration, but what we learned and what I'll tell you about is how you the concepts of agentic capabilities applies well beyond just sort of end user interactions. So we'll be talking about data extraction for a moment, just a quick background.

When we talk about metadata or data, we talk about the things in unstructured data, be it documents, be it contracts, be it project proposals with anything that then turns into structured data. This is a very common challenge in enterprises is that they have like 90% of their data is unstructured, 10% of their data is in databases, structured data.

And historically, there has been this challenge that like it was kind of hard to utilize this. So many customers have for a very long time, wish they had better ways to automate their unstructured data. And there's a lot of it. And it's really critical. In some cases, it's the most critical thing in an enterprise.

So the things you do with it would be to like query your data, being able to kick off workflows, being able to do just a better search and filtering across all of your data. And so so this like the prototypical example, this is something like a contract where you have an authoritative unstructured piece of data, but then also the key fields in there are very important.

So this is not a new thing. For many, many years, the world for box included has been interested in pulling out unstructured, structured data from unstructured data. And there were a lot of techniques to do this. And there's a whole industry. If you ever heard of IDP, this is like a multi-billion dollar industry whose job in life was to do this kind of attraction.

But it was really hard, you had to build these specialized AI models, you had to like focus on specific types of content, you had to have this huge corpus of training data. Oftentimes you need to get custom vendors and custom ML models that you make. And it was quite brittle.

And then to the point, not a lot of companies ever thought about automating most of their most, their critical unstructured data. So this was sort of the state of the industry for a very long time. Like just don't bother trying too hard with unstructured data, do everything you can to get it in some sort of structured format, but don't try too hard to deal with unstructured data.

Until generative AI came along. And so this is where our journey sort of begins with AI is for a long time, we've been using ML models in different ways. And the first thing that we tried when confronted with sort of a GPT-2, GPT-3 style of AI models is that you just say, I have a question for you, AI model, can you extract this kind of data?

And as we mostly all know, AI is not only great at generating content, it's also great at understanding the nuances of content. So what we did, we first started out with some pre-processing, doing sort of OCR steps, classic ways to do this, and then being able to then say, I want to extract these fields, standard AI calls, single shot, or with some decoration on the prompts.

And this worked great. This was amazing. This was something where suddenly, a standard, generic, off-the-shelf AI model from multiple vendors could outperform even the best sort of models that you had seen in the past. And we supported multiple models just in case, and then it got better and better.

This was wonderful. So this was flexible. You could do it across any kind of data. You could, it performed well. It was, yes, you had to do OCR and pre-process it, but that was straightforward. And so we were just thrilled. This was like, for us, it was like, this is a new generation of AI.

And interestingly, we would go to our customers and say, we can do this across any data. And then they would give us some, and it would work. And then we'd be like, great, AI models are awesome. Until they said, oh, now that you do that well, and I get it.

Now, what about this one? What about this 300-page least document with 300 fields? What about this really complex set of digital assets? You want to get these really complex questions associated with it. What about, I want to do not just extract data. I want to do risk assessments and things that are these more complex fields.

You start to realize, huh, like this, as a human, when I, if you ask me that question, I'm struggling to answer it. And then in the same way the AI struggled to answer it. So suddenly, we ended up with more complex documents. Also, OCR is just a hard problem.

Like, like there's no seemingly like no end of heuristics and tricks that you do on OCR to get it right. So I've got a scan document, somebody writes stuff in it, somebody crosses stuff out. It's just hard. And then, and then for people who have dealt with like things like different file formats, PDFs, like it's a challenge.

So whenever the OCR broke, it would just naturally give that info to the AI and then languages were a big pain. And so we started to get more and more challenges as we have an international set of customers across different use cases. Also, there was a clear limit to the AI in terms of how much it could handle the attention to so many different fields.

So if you say, here's 10 fields, here's a 10 page document, figure it out. They're great. They're most of them are great. If you say, here's a 100 page document, and here's 100 fields that are each of them complex with separate instructions, then it loses track. And I have sympathy because people would lose track too.

And so this became very problematic because if you want high accuracy in an enterprise setting, like this just starts to not work. And then also it's just like, well, what is accuracy? What does it mean? In the old ML world, they give you confidence scores. 0.865 is this one.

And then, of course, large language models don't really know their own accuracy. So we would implement things like LM as a judge. And we'd come back and tell you like, here's your extraction. Also, we're not quite sure this is right. And then our enterprise customers would kind of be like, well, that's helpful to know.

But like, I want it to work right. Not just you tell me it doesn't work right. And so this became this kind of set of challenges that we focused on. And so customers were looking for speed. They're looking for affordability. They're making this work. They're saying, if AI is this future awesome thing, then like, you know, show it to me.

And so we're in on these more complex documents. So at this point, we kind of hit our despair moment. Our, we thought LLMs were the solution to everything. We thought that like, we could have these AI models that worked. But then we actually struggled. Like, what do you do now?

How do you fix this? And I know, let's just wait until the next Gemini model or, you know, OpenAI seems to be on top of this. So like, wait till the next one, which is part of it, right? The models do get better. But the fragility of the architecture was one that was, we weren't really going to be able to solve on our own.

So naturally, one of the answers that we came up with was bringing agentic approaches to everything that we do. And this is really the one of the key things that I want to sort of bring out in this session is that it certainly was not obvious that the way to fix all these problems in something like data extraction was to do agentic style of interactions.

And when I say agentic, I mean, an AI agent that does something like this, I am on instructions, objectives, with the model background tools, we can make have secure access, of course, it has memory from the purposes of advancing and being able to look up information inside of the system, but also with a full directed graph.

So the ability to orchestrate it to be able to do things like where you say, do this, then this, then this, either it comes up with its own plan, or we actually can orchestrate it ourselves, because we have knowledge of what we want to do. And this was for us, um, I mean, it's controversial, like, it was like our engineers, like, what are you talking about?

Like, let's just make the OCR better, like, like, let's just add another step somewhere, like, let's just add a post-processing, uh, regular expression checks. And then, and then, of course, everybody always like, I have a way to do this, um, based on the old way of doing this. Why don't we make training ML model?

Like, why don't we fine tune? And then, and then, and so suddenly, all of the genericness of it would be get lost in this process. So, um, we came up with a mechanism, which was a, uh, so this is, uh, I think, like, kind of land graph style of agentic, uh, capabilities.

And, um, so we still, we went, uh, we still had the same inputs and outputs. In document with fields, out answers. However, the approach was an agentic approach. And so, um, you know, we played with all the models, uh, reflecting, uh, back and forth and criticism, uh, being able to, uh, uh, separate it in multiple tasks, uh, to be able to have different multi-agent systems work on this.

And we ended up with something like this, where you have a step where you prepare the fields, you go through, you group the fields. We learned quickly that, like, if there's like a set of fields that are like customers, uh, from a contract and then are like, uh, like parties, and then somewhere else there's like the address of the parties, like you need the AI to handle those together.

Otherwise it's like, you have three parties and two sets of addresses, which don't match. So we, so we had to break up intelligently the set of fields. We had to go through and we had to, um, uh, uh, like, uh, uh, do multiple queries on a document. Then after we got that, we would then use a set of tools to check in and double check the results.

In some cases we use OCR. We would then double check it by looking at pictures of the pages, um, and, and, and using multiple models. Sometimes they vote and they're like, wow, like, this is a hard question. Three, three models from different vendors. Uh, two of them think this is the answer.

That was probably a good answer. Um, and then on to the idea of the LM as a judge, not just a judge to tell you that this is a, um, this is the answer, but a judge to tell you, uh, Hey, uh, here's some feedback, keep trying. Now, of course this takes a little bit longer.

Um, but, uh, this is something that then leads to the kind of accuracy that you'd want overall. And so for us, this was the, um, the, uh, uh, the architecture that then helped us solve a set of problems. And it became, um, interesting because every time there was a new set of challenges, the answer was not rethink everything, or let's then try like a whole new set of like, oh, we're gonna give us six months and we'll come up with a new idea.

But, uh, I wonder if we change that prompt on that one note, or I wonder if we add another double check at the end, then we can actually start to solve this problem. So we bring the power of AI intelligence to help us then solve something that we used to think of as a standard function.

Um, and then not only that, it helped us in other ways. Like, so we were naturally as an unstructured content store, like one of the first things you always see people, if I can give you a demo right now, it's, I have a bunch of documents, I have a question.

And then we had the same thing that a judge and it'd be like, you'll tell us like, oh, that was a good answer or that wasn't. And then why not just, if it's not a good answer, we'll take another beat and, and tell the AI, like, uh, try again, before you tell the user this answer, like, I want you to, um, uh, like reflect on it for a second.

And this kind of thing just leads to higher accuracy. And then it also leads to much more complexity. So we just announced our deep research capabilities on your content. So in the same way that like open AI, uh, or Gemini does deep research on the internet, we let you do deep research on your data inbox would look something like this.

So this would be like roughly the, the directed graph that you'd have where you'd go through, you know, first we searched for the data, kind of do that for a while, figure out what's relevant, double check, then make an outline, kind of prepare a plan, go through, um, uh, make, make a, a process.

And this is all agentic thinking. And it, and, and this kind of thing wouldn't really be possible if we hadn't kind of laid the front, the framework of having an agentic foundation overall. So, um, I will leave you with, uh, these, uh, a few lessons learned here. Um, so this is based on our time in the last few years.

Um, the first is, uh, that, um, it wasn't obvious to us at first, but the agentic, uh, abstraction layer from an architecture perspective is actually quite clean. It is, it is very, um, once you start to think this way, it is very natural to think, I'm going to run an intelligent workflow, an intelligent directed graph powered by a models are every step to be able to accomplish a task.

Not everything, but sometimes that's a great, that's great approach. And this, and this is independent of some, of a high scale set of, of, uh, sort of distributed system design and, and, and both are important. Like at some point you have to deal with, you know, a hundred million documents that day at the same, other point you have to deal with that one.

And so being able to separate these two systems into like somebody who thinks about the agentic framework is somebody who thinks about the, how to scale a generic process is this is, this is very helpful to keep these distinct. Um, also it's just easy to evolve. Like, uh, in that deep research example, one of our biggest, we, we, we, we did it and then it worked really well, except for the output was kind of sloppy.

And so we were like, ah, I guess we gotta redesign the whole thing or add another note at the end to say, summarize this and according to this. And it would just take that in and just redo the output. So not that long to fix. And this was something that was not obvious to me until later, which is that, um, if you're going to be using, um, a agentic, uh, uh, AI with a team who's been around for a while, like you start to need to get them to think about agentic first kind of thinking AI first thinking.

And one way to do that is to, um, let them build something so that they can start to think, oh, like this is not only how we can build more things, but also because we're also a platform for our enterprise customers, they can think about how to make it better, make it better for them.

So things like, uh, really doubling down on the idea of, um, we publish MCP servers. What are the tools like for them? What can we do to make it easier? How can we do our agent to agent communications and so on? So, um, this is, uh, all kind of summed up with is if you're confronted with a challenge, the lesson that we learned is that if it's plausible that a set of AI models, uh, could help you solve that problem, then you should build this AI agentic architecture early.

If I go back in time, I would wish to be done this sooner because then we've kind of, uh, have been able to continue to take advantage of that. Um, and so that's my, uh, that's my journey and that's my, my, my lesson three. Uh, so thank you. Uh, Ankur, are we, um, two minutes?

Okay. So, um, if anybody, what? Two questions. Okay. If anybody has any questions, I'm happy to answer them. Uh, question being, is this available as API? Yes. Um, so we're very API first oriented, so we have an agent API that you can call upon these agents to do things and give them the arguments.

So yes, uh, we, we, we provide, uh, uh, uh, uh, agent, uh, just APIs across everything and tools, um, to, to call our APIs. Um, uh, okay. Were you primarily just evaluating your agents to develop on this dashboard or on their links? Can you explain when you start using, uh, a more manual approach as well?

Um, In terms of valuing our agents, uh, and how do we do that? Um, so we, we not only use, uh, LM as a judge, but we also create an eval set. So we have our standard set of eval sets. Um, and then we've learned that, um, since the AI gets so good over time, we created a challenge set of eval sets to, so that we can better explore like things that not everybody asked, but if they did, it would be really hard.

And then that way you can better decide on whether or not you're not only prepared for now, but as people get more challenging things, we, we know that we can grow across that. So a mixture of eval sets, plus LM as a judge, plus the idea of just having people give feedback.

We, we have limited ability to look as an enterprise company, what's happening, but, but the, the idea of them telling us this is still useful in all cases. you can yell if you want, I'll hear you. Uh, so, uh, and I, it's the first time you talk, so apologies about it.

Yeah. Asked a bunch of the floor and you would talk about that. It seems like you're mostly building agents, but he's going to get out of the box, you know, center fine tuning. Why are you supposed to put it? Uh, so the question being, why bother with agents if you can fine tune a model?

Um, no, no, I'm just saying, have you tried, have you tried fine tuning? Yeah, we're, we're, um, we're pretty anti fine tuning at this moment because, um, of the challenges of once you fine tune something, you have to then fine tune all of the evolutions of them going forward.

We support multiple models, Gemini, Lama, uh, open AI, Anthropic, and it's just hard to consistently fine tune across the board in ways that like not only they've been usually just the next version of the model gets better. So we've, we've got, we've gotten to the point where we use these prompts or cache prompts or agenticness as opposed to fine tuning.

That's the approach for our particular use cases. It works quite well. Okay. Thank you, everyone.