back to index

Building an Agentic Platform — Ben Kus, CTO Box


Chapters

0:0 Box's Content Platform and Enterprise Focus
1:50 Initial AI Deployment in 2023
2:54 The Challenge of Unstructured Data in Enterprises
3:56 Limitations of Pre-Generative AI Data Extraction
4:54 First Version: LLM-Based Extraction
7:5 Challenges with the Pure LLM Approach
8:58 Despair and the Need for a New Architecture
9:30 Introducing Agentic Architecture
10:4 AI Agent Reasoning Framework
10:45 Agentic Routine for Data Extraction
12:28 Advantages of Agentic Architecture
14:5 Key Lesson Learned: Build Agentic Architecture Early
18:37 Approach to Fine-tuning and Model Support

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello, so I'm Ben Kuss, I'm CTO of Boxx and I'm going to talk today about our journey of through
00:00:22.640 | AI and in particular our AI agentic journey. And if you don't know much about Boxx,
00:00:29.440 | a little bit of background. So at Boxx we are an unstructured content platform, we've been around
00:00:36.720 | for a while, more than 15 years, and we very much concentrate on large enterprises. So we've got
00:00:44.960 | over 115,000 enterprise customers, we've got two-thirds of the Fortune 500, and our job really
00:00:52.640 | is to bring everything you'd want to do with your content to these customers and to provide them all
00:00:57.040 | the capabilities they might want. In many cases, for AI, many of these customers, their first AI
00:01:01.680 | deployment was actually with Boxx because, of course, many enterprises worry a lot about data
00:01:08.720 | security concerns and worry about data leakage with AI, make sure to do safe and secure AI,
00:01:13.120 | and this is one thing that we have specialized in over time. But the way that we think about
00:01:18.320 | AI is at a platform level. So we have sort of the historic version of Boxx which has the idea of the
00:01:24.880 | global infrastructure, sort of everything you need to manage and maintain content at scale. We've got
00:01:29.280 | over an exabyte of data, we have an awful lot of hundreds of billions of files that our customers have
00:01:34.640 | trusted us with, and we have the natural way to protect them in addition to the type of services that
00:01:39.280 | you provide when you're an unstructured data platform. But then for the last few years, one of the key
00:01:44.080 | things we've been investing in has been in AI on top of the platform, and I'm here to tell you a bit
00:01:48.640 | about our journey here. So we started our journey in 2023, shortly after AI became sort of production
00:01:58.800 | ready from a generative AI sense. And everything I'm talking about here today will be generative AI, of course.
00:02:03.280 | So we ended up with a set of features, things like QA across documents, things like being able
00:02:08.160 | to extract data, things like being able to do AI power workflows. Happy to talk about these in general,
00:02:12.640 | but today I'm going to focus on one aspect of the features that we built, which is the idea of data
00:02:18.880 | extraction. This is the idea of taking structured data from your unstructured data and using that in
00:02:24.320 | an enterprise setting. And partly I'm going to focus on this one because this is interestingly,
00:02:30.640 | like maybe the least agentic sort of thing that you might think of when you're thinking about these
00:02:36.400 | other examples about how you interact with AI. This is much less like a standard chatbot style
00:02:42.240 | integration, but what we learned and what I'll tell you about is how you the concepts of agentic
00:02:47.760 | capabilities applies well beyond just sort of end user interactions.
00:02:52.560 | So we'll be talking about data extraction for a moment, just a quick background. When we talk about
00:02:58.720 | metadata or data, we talk about the things in unstructured data, be it documents, be it contracts,
00:03:04.560 | be it project proposals with anything that then turns into structured data. This is a very common
00:03:10.400 | challenge in enterprises is that they have like 90% of their data is unstructured, 10% of their data is
00:03:16.160 | in databases, structured data. And historically, there has been this challenge that like it was kind of
00:03:22.720 | hard to utilize this. So many customers have for a very long time, wish they had better ways to automate
00:03:28.720 | their unstructured data. And there's a lot of it. And it's really critical. In some cases, it's the most
00:03:32.560 | critical thing in an enterprise. So the things you do with it would be to like query your data, being able to
00:03:40.480 | kick off workflows, being able to do just a better search and filtering across all of your data. And so
00:03:46.320 | so this like the prototypical example, this is something like a contract where you have an
00:03:50.480 | authoritative unstructured piece of data, but then also the key fields in there are very important.
00:03:57.040 | So this is not a new thing. For many, many years, the world for box included has been interested in
00:04:04.800 | pulling out unstructured, structured data from unstructured data. And there were a lot of techniques
00:04:10.160 | to do this. And there's a whole industry. If you ever heard of IDP, this is like a multi-billion dollar
00:04:14.240 | industry whose job in life was to do this kind of attraction. But it was really hard, you had to build
00:04:21.680 | these specialized AI models, you had to like focus on specific types of content, you had to have this huge
00:04:26.800 | corpus of training data. Oftentimes you need to get custom vendors and custom ML models that you make.
00:04:32.320 | And it was quite brittle. And then to the point, not a lot of companies ever thought about automating
00:04:37.120 | most of their most, their critical unstructured data. So this was sort of the state of the industry for a
00:04:42.400 | very long time. Like just don't bother trying too hard with unstructured data, do everything you can to
00:04:48.880 | get it in some sort of structured format, but don't try too hard to deal with unstructured data. Until
00:04:55.040 | generative AI came along. And so this is where our journey sort of begins with AI is for a long time,
00:05:01.360 | we've been using ML models in different ways. And the first thing that we tried when confronted
00:05:08.720 | with sort of a GPT-2, GPT-3 style of AI models is that you just say, I have a question for you, AI model,
00:05:18.560 | can you extract this kind of data? And as we mostly all know, AI is not only great at generating content,
00:05:28.000 | it's also great at understanding the nuances of content. So what we did, we first started out with
00:05:34.240 | some pre-processing, doing sort of OCR steps, classic ways to do this, and then being able to then say,
00:05:43.120 | I want to extract these fields, standard AI calls, single shot, or with some decoration on the prompts.
00:05:49.360 | And this worked great. This was amazing. This was something where suddenly, a standard, generic,
00:05:57.280 | off-the-shelf AI model from multiple vendors could outperform even the best sort of models that you
00:06:02.480 | had seen in the past. And we supported multiple models just in case, and then it got better and
00:06:08.000 | better. This was wonderful. So this was flexible. You could do it across any kind of data. You could,
00:06:11.840 | it performed well. It was, yes, you had to do OCR and pre-process it, but that was straightforward.
00:06:18.960 | And so we were just thrilled. This was like, for us, it was like, this is a new generation of AI.
00:06:25.760 | And interestingly, we would go to our customers and say, we can do this across any data. And then
00:06:29.920 | they would give us some, and it would work. And then we'd be like, great, AI models are awesome.
00:06:33.200 | Until they said, oh, now that you do that well, and I get it. Now, what about this one? What about this
00:06:40.080 | 300-page least document with 300 fields? What about this really complex set of digital assets?
00:06:45.920 | You want to get these really complex questions associated with it. What about, I want to do not
00:06:49.680 | just extract data. I want to do risk assessments and things that are these more complex fields. You
00:06:53.280 | start to realize, huh, like this, as a human, when I, if you ask me that question, I'm struggling to
00:06:58.400 | answer it. And then in the same way the AI struggled to answer it. So suddenly, we ended up with more
00:07:07.200 | complex documents. Also, OCR is just a hard problem. Like, like there's no seemingly like no end of
00:07:14.720 | heuristics and tricks that you do on OCR to get it right. So I've got a scan document, somebody writes
00:07:20.000 | stuff in it, somebody crosses stuff out. It's just hard. And then, and then for people who have dealt
00:07:24.800 | with like things like different file formats, PDFs, like it's a challenge. So whenever the OCR broke,
00:07:31.840 | it would just naturally give that info to the AI and then languages were a big pain. And so we started
00:07:37.760 | to get more and more challenges as we have an international set of customers across different
00:07:40.960 | use cases. Also, there was a clear limit to the AI in terms of how much it could handle the attention
00:07:49.040 | to so many different fields. So if you say, here's 10 fields, here's a 10 page document, figure it out.
00:07:54.320 | They're great. They're most of them are great. If you say, here's a 100 page document, and here's 100
00:07:59.280 | fields that are each of them complex with separate instructions, then it loses track. And I have
00:08:04.240 | sympathy because people would lose track too. And so this became very problematic because if you want
00:08:10.000 | high accuracy in an enterprise setting, like this just starts to not work. And then also it's just like,
00:08:15.520 | well, what is accuracy? What does it mean? In the old ML world, they give you confidence scores.
00:08:19.360 | 0.865 is this one. And then, of course, large language models don't really know their own accuracy.
00:08:25.440 | So we would implement things like LM as a judge. And we'd come back and tell you like, here's your
00:08:29.600 | extraction. Also, we're not quite sure this is right. And then our enterprise customers would kind
00:08:34.640 | of be like, well, that's helpful to know. But like, I want it to work right. Not just you tell me
00:08:39.360 | it doesn't work right. And so this became this kind of set of challenges that we focused on. And so
00:08:44.720 | customers were looking for speed. They're looking for affordability. They're making this work. They're
00:08:47.840 | saying, if AI is this future awesome thing, then like, you know, show it to me. And so we're in on
00:08:53.520 | these more complex documents. So at this point, we kind of hit our despair moment. Our, we thought
00:09:00.800 | LLMs were the solution to everything. We thought that like, we could have these AI models that worked.
00:09:04.560 | But then we actually struggled. Like, what do you do now? How do you fix this? And I know, let's just
00:09:09.200 | wait until the next Gemini model or, you know, OpenAI seems to be on top of this. So like,
00:09:13.920 | wait till the next one, which is part of it, right? The models do get better. But the fragility of the
00:09:19.680 | architecture was one that was, we weren't really going to be able to solve on our own. So naturally,
00:09:26.320 | one of the answers that we came up with was bringing agentic approaches to everything that we do. And this
00:09:35.600 | is really the one of the key things that I want to sort of bring out in this session is that it certainly was
00:09:41.600 | not obvious that the way to fix all these problems in something like data extraction was to do agentic
00:09:47.360 | style of interactions. And when I say agentic, I mean, an AI agent that does something like this,
00:09:51.600 | I am on instructions, objectives, with the model background tools, we can make have secure access,
00:09:57.840 | of course, it has memory from the purposes of advancing and being able to look up information
00:10:01.520 | inside of the system, but also with a full directed graph. So the ability to orchestrate it to be able to do
00:10:10.880 | things like where you say, do this, then this, then this, either it comes up with its own plan, or we
00:10:14.960 | actually can orchestrate it ourselves, because we have knowledge of what we want to do. And this was for us,
00:10:20.640 | um, I mean, it's controversial, like, it was like our engineers, like, what are you talking about? Like,
00:10:24.000 | let's just make the OCR better, like, like, let's just add another step somewhere, like, let's just add a post-processing,
00:10:29.360 | uh, regular expression checks. And then, and then, of course, everybody always like, I have a way to do
00:10:33.760 | this, um, based on the old way of doing this. Why don't we make training ML model? Like, why don't we
00:10:38.160 | fine tune? And then, and then, and so suddenly, all of the genericness of it would be get lost in this process.
00:10:42.960 | So, um, we came up with a mechanism, which was a, uh, so this is, uh, I think, like, kind of land graph
00:10:51.120 | style of agentic, uh, capabilities. And, um, so we still, we went, uh, we still had the same inputs and outputs.
00:10:59.440 | In document with fields, out answers. However, the approach was an agentic approach.
00:11:05.040 | And so, um, you know, we played with all the models, uh, reflecting, uh, back and forth and criticism,
00:11:11.600 | uh, being able to, uh, uh, separate it in multiple tasks, uh, to be able to have different multi-agent
00:11:17.520 | systems work on this. And we ended up with something like this, where you have a step where you prepare
00:11:21.680 | the fields, you go through, you group the fields. We learned quickly that, like, if there's like a set of
00:11:26.320 | fields that are like customers, uh, from a contract and then are like, uh, like parties,
00:11:29.840 | and then somewhere else there's like the address of the parties, like you need the AI to handle those
00:11:32.960 | together. Otherwise it's like, you have three parties and two sets of addresses, which don't
00:11:37.680 | match. So we, so we had to break up intelligently the set of fields. We had to go through and we had to,
00:11:43.360 | um, uh, uh, like, uh, uh, do multiple queries on a document. Then after we got that, we would then
00:11:49.040 | use a set of tools to check in and double check the results. In some cases we use OCR. We would then
00:11:54.000 | double check it by looking at pictures of the pages, um, and, and, and using multiple models. Sometimes
00:11:58.320 | they vote and they're like, wow, like, this is a hard question. Three, three models from different
00:12:02.400 | vendors. Uh, two of them think this is the answer. That was probably a good answer. Um, and then on to the
00:12:07.120 | idea of the LM as a judge, not just a judge to tell you that this is a, um, this is the answer,
00:12:12.560 | but a judge to tell you, uh, Hey, uh, here's some feedback, keep trying. Now, of course this takes a
00:12:18.560 | little bit longer. Um, but, uh, this is something that then leads to the kind of accuracy that you'd
00:12:22.800 | want overall. And so for us, this was the, um, the, uh, uh, the architecture that then helped us solve
00:12:30.400 | a set of problems. And it became, um, interesting because every time there was a new set of challenges,
00:12:36.160 | the answer was not rethink everything, or let's then try like a whole new set of like,
00:12:41.840 | oh, we're gonna give us six months and we'll come up with a new idea. But, uh, I wonder if we change
00:12:46.240 | that prompt on that one note, or I wonder if we add another double check at the end, then we can
00:12:50.320 | actually start to solve this problem. So we bring the power of AI intelligence to help us then solve
00:12:54.320 | something that we used to think of as a standard function. Um, and then not only that, it helped us
00:13:00.480 | in other ways. Like, so we were naturally as an unstructured content store, like one of the first things
00:13:04.480 | you always see people, if I can give you a demo right now, it's, I have a bunch of documents,
00:13:08.320 | I have a question. And then we had the same thing that a judge and it'd be like, you'll tell us like,
00:13:11.600 | oh, that was a good answer or that wasn't. And then why not just, if it's not a good answer,
00:13:16.560 | we'll take another beat and, and tell the AI, like, uh, try again, before you tell the user this answer,
00:13:21.920 | like, I want you to, um, uh, like reflect on it for a second. And this kind of thing just leads to higher
00:13:26.880 | accuracy. And then it also leads to much more complexity. So we just announced our deep research capabilities on
00:13:32.480 | your content. So in the same way that like open AI, uh, or Gemini does deep research on the internet,
00:13:36.240 | we let you do deep research on your data inbox would look something like this. So this would be like
00:13:41.040 | roughly the, the directed graph that you'd have where you'd go through, you know, first we searched for
00:13:45.600 | the data, kind of do that for a while, figure out what's relevant, double check, then make an outline,
00:13:49.440 | kind of prepare a plan, go through, um, uh, make, make a, a process. And this is all agentic thinking.
00:13:54.560 | And it, and, and this kind of thing wouldn't really be possible if we hadn't kind of laid the front,
00:13:59.520 | the framework of having an agentic foundation overall. So, um, I will leave you with, uh, these,
00:14:06.400 | uh, a few lessons learned here. Um, so this is based on our time in the last few years. Um,
00:14:11.280 | the first is, uh, that, um, it wasn't obvious to us at first, but the agentic, uh, abstraction layer
00:14:19.040 | from an architecture perspective is actually quite clean. It is, it is very, um, once you start to
00:14:24.240 | think this way, it is very natural to think, I'm going to run an intelligent workflow, an intelligent
00:14:28.800 | directed graph powered by a models are every step to be able to accomplish a task. Not everything,
00:14:33.520 | but sometimes that's a great, that's great approach. And this, and this is independent of some,
00:14:38.720 | of a high scale set of, of, uh, sort of distributed system design and, and, and both are important.
00:14:43.440 | Like at some point you have to deal with, you know, a hundred million documents that day at the same,
00:14:46.720 | other point you have to deal with that one. And so being able to separate these two systems into
00:14:51.040 | like somebody who thinks about the agentic framework is somebody who thinks about the,
00:14:54.160 | how to scale a generic process is this is, this is very helpful to keep these distinct.
00:14:59.200 | Um, also it's just easy to evolve. Like, uh, in that deep research example,
00:15:03.440 | one of our biggest, we, we, we, we did it and then it worked really well, except for the output
00:15:07.280 | was kind of sloppy. And so we were like, ah, I guess we gotta redesign the whole thing or add another
00:15:12.080 | note at the end to say, summarize this and according to this. And it would just take that in and just
00:15:16.880 | redo the output. So not that long to fix. And this was something that was not obvious to me until later,
00:15:22.880 | which is that, um, if you're going to be using, um, a agentic, uh, uh, AI with a team who's been around
00:15:29.520 | for a while, like you start to need to get them to think about agentic
00:15:33.360 | first kind of thinking AI first thinking. And one way to do that is to, um, let them build something
00:15:37.920 | so that they can start to think, oh, like this is not only how we can build more things, but also
00:15:42.480 | because we're also a platform for our enterprise customers, they can think about how to make it
00:15:46.320 | better, make it better for them. So things like, uh, really doubling down on the idea of, um,
00:15:51.200 | we publish MCP servers. What are the tools like for them? What can we do to make it easier? How can we
00:15:56.240 | do our agent to agent communications and so on? So, um, this is, uh, all kind of summed up with is if
00:16:04.560 | you're confronted with a challenge, the lesson that we learned is that if it's plausible that a set of AI
00:16:10.560 | models, uh, could help you solve that problem, then you should build this AI agentic architecture early.
00:16:16.240 | If I go back in time, I would wish to be done this sooner because then we've kind of, uh, have been able to
00:16:20.480 | continue to take advantage of that. Um, and so that's my, uh, that's my journey and that's my, my, my lesson
00:16:25.760 | three. Uh, so thank you. Uh, Ankur, are we, um, two minutes? Okay. So, um, if anybody, what? Two questions.
00:16:40.080 | Okay. If anybody has any questions, I'm happy to answer them.
00:16:42.080 | Uh, question being, is this available as API? Yes. Um, so we're very API first oriented, so we have an
00:16:50.240 | agent API that you can call upon these agents to do things and give them the arguments. So yes,
00:16:53.760 | uh, we, we, we provide, uh, uh, uh, uh, agent, uh, just APIs across everything and tools, um, to,
00:16:59.840 | to call our APIs. Um, uh, okay.
00:17:07.840 | Were you primarily just evaluating your agents to develop on this dashboard or on their links?
00:17:15.680 | Can you explain when you start using, uh, a more manual approach as well? Um,
00:17:20.000 | In terms of valuing our agents, uh, and how do we do that? Um, so we, we not only use, uh, LM as a
00:17:24.160 | judge, but we also create an eval set. So we have our standard set of eval sets. Um, and then we've
00:17:27.760 | learned that, um, since the AI gets so good over time, we created a challenge set of eval sets to,
00:17:31.920 | so that we can better explore like things that not everybody asked, but if they did, it would be really
00:17:35.920 | hard. And then that way you can better decide on whether or not you're not only prepared for now,
00:17:40.400 | but as people get more challenging things, we, we know that we can grow across that. So a mixture of eval sets,
00:17:45.680 | plus LM as a judge, plus the idea of just having people give feedback. We, we have limited ability
00:17:50.560 | to look as an enterprise company, what's happening, but, but the, the idea of them telling us this is
00:17:54.640 | still useful in all cases.
00:18:01.600 | you can yell if you want, I'll hear you. Uh, so, uh, and I, it's the first time you talk, so apologies
00:18:06.960 | about it. Yeah. Asked a bunch of the floor and you would talk about that. It seems like you're
00:18:10.400 | mostly building agents, but he's going to get out of the box, you know, center fine tuning. Why are you
00:18:15.600 | supposed to put it? Uh, so the question being, why bother with agents if you can fine tune a model? Um,
00:18:20.240 | no, no, I'm just saying, have you tried, have you tried fine tuning? Yeah, we're, we're, um,
00:18:25.520 | we're pretty anti fine tuning at this moment because, um, of the challenges of once you fine
00:18:30.800 | tune something, you have to then fine tune all of the evolutions of them going forward. We support
00:18:35.280 | multiple models, Gemini, Lama, uh, open AI, Anthropic, and it's just hard to consistently
00:18:41.200 | fine tune across the board in ways that like not only they've been usually just the next version of
00:18:45.760 | the model gets better. So we've, we've got, we've gotten to the point where we use these prompts or
00:18:48.960 | cache prompts or agenticness as opposed to fine tuning. That's the approach for our particular use cases.
00:18:53.280 | It works quite well. Okay. Thank you, everyone.