Building Reliable Agents: Raising the Bar

this last time I was here was actually in the crowd at a music show use is pretty wild for me to be up here on stage talking to you all I'm super excited anyway my name is Benny weld elite engineering at Harvey and today I'd like to talk to you about how we build and evaluate legal AI so this is the outline of the talk five parts to it talk a little bit about Harvey for those of you are not familiar with the product or the company and I'll talk about quality and legal and why it's difficult how we build and evaluate products and some learning and hot takes I was told they had to be hot takes all right let's dive in so Harvey is really domain-specific AI for legal and professional services we offer a suite of products from a general-purpose assistant for drafting and summarizing docs to tools for large-scale document extraction to many domain-specific agents and workflows and the vision we have for the product is twofold we want you to do all of your work in Harvey and we want Harvey to be available wherever you do your work you here being lawyers and legal professionals and professional service providers so as an example you can use Harvey to summarize documents or draft new ones our way I can leverage firm specific information such as firm internal firm internal knowledge bases or their templates to customize the output we also offer tools for large-scale document analysis which is a really important use case in legal think about a lot of due diligence or legal discovery tasks where you're typically dealing with thousands of contracts or documents thousands of emails that need to be analyzed which typically is done manually and is really really tedious so Harvey can analyze hundreds of thousands of documents at once and output to a table or summarize the results this literally saves hours sometimes weeks of work and of course we offer many workflows workflows that enable users to accomplish complex tasks such as red line analysis drafting certain types of documents and more and customers can tailor these workflows to their own needs we're at an agent conference so naturally you want to talk a little bit about agentic capabilities we've added to the product as well such as multi-step agentic search more personalization and memory and the ability to execute long-running tasks and we are a lot more cooking there that will be launching soon we're trusted by law firms and large enterprises around the world we have just just under 400 customers on i think all continents maybe except on artica at this point and in the u.s.

one-third of the largest 100 and i think eight out of ten of the largest ten law firms use harvey all right let's talk about quality and why it's difficult to build and evaluate high quality products in this domain so this may not come as a surprise to you but lawyers deal with lots and lots and lots of documents many of them very complex often hundreds sometimes thousands of pages in length and typically those documents don't exist in a vacuum they're part of large corpora of case law legislation or other case related documents often those documents contain extensive references to other parts of the document or other documents in the same corpus and the documents themselves can be pretty complex it's not at all unheard of to have documents with lots of handwriting scanned nodes multi-column multiple mini pages on the same page embedded tables etc etc so lots of complexity in the document understanding piece the outputs we need to generate are pretty complex too long text obviously complex tables and sometimes even diagrams or charts for things like reports not to mention the complex language that legal professionals are used to and mistakes can literally be career impacting so verification is key and this isn't really just about hallucinations completely made up statements but really more about slightly misconstrued or misinterpreted statements that are just not quite factually correct so harvey has a citation feature to ground all statements and verifiable sources and to allow our users to verify that you know the summary provided by the ai is indeed correct and acceptable and importantly quality is a really nuanced and subjective concept in this domain i don't know if you can read this i wouldn't expect you to read all of it but basically this is two answers to the same question a document understanding question in this case asking about a specific clause in a specific contract i think it's called materiality scrape and indemnification don't ask me what exactly that means but the point i'm trying to get across is they look similar they're actually both factually correct neither of them have any hallucinations take my word for it but answer two was actually strongly preferred by our in-house lawyers when they looked at both of these answers and the reason is that there's additional nuance in the write-up and more details in some of the definitions that they really appreciated so the point is it's really difficult to assess automatically which of these is better or what's what quality even means and then last but not least obviously our customers work is very sensitive in nature obtaining reliable data sets product feedback or even bug reports can be pretty challenging for us and so all of that combined makes it really challenging to build high quality products and legal ai so how do we do it before evaluation i wanted to actually briefly touch on how we build products we believe and i think harrison actually just talked about this that the best evals are tightly integrated into the product development process and the best teams approach eval holistically with the rest of product development so here are some product development principles that are important to us we're going to do it first off we're an applied ai company so what this really means is that we need to combine state-of-the-art ai with best-in-class ui it's really not just about having the best ai but really about having the best ai that's packaged up in such a way that it meets our customers where they are it helps them solve their real world problems the second principle and this is something that we've talked a lot about and that's very very key to the way that we operate is lawyer in the loop so we really include lawyers at every stage of the product development process as i mentioned before there's an incredible amount of complexity and nuance in legal and so their domain expertise and their user empathy are really critical in helping us create products building us helping us build great products so lawyers work side by side with engineers designers product managers and so on on all aspects of building the product from identifying use cases to data set collection to eval rubric creation to ui iteration and end-to-end testing they're truly embedded lawyers also play a really important part of our go-to-market strategy they're involved in demoing to customers collecting customer feedback and translating that back to our product development teams as well and then third prototype over prd prd is a product requirement doc or any kind of spec doc really we really believe that the actual work of building great products in this domain and probably many other domains happens through frequent prototyping and iteration spec docs can be helpful but prototypes really make the work tangible and easier to grok and the quicker we can build these the quicker we can iterate and learn so we've invested a ton in building out our own ai prototyping stack to iterate on prompts all aspects of the algorithm as well as the ui so i wanted to share an example to make this come to life a little bit let's say we wanted to build out a specific workflow to help users to help our customers draft a specific type of document let's say a client alert now in this case lawyers would provide the initial context what is this document what is it even used for when does this typically come up in a typical lawyer's day-to-day work and what else is important to know about it then lawyers would collaborate with engineers and product to build out the algorithm and the eval data set engineers build a prototype and then we typically go through many iterations of this where we look at initial outputs look at results do we like it and and continue to iterate until it looks good to us as a team of experts in parallel we build out a final product that's more embedded in our actual product where we can iterate on the ui as well this has really worked well for us we've built dozens of workflows this way and it's it's one of the things that that really stands out for us in terms of how we build product okay let's talk about evaluation so we think about eval in three ways and harrison actually alluded to some of these as well but for us the most important way by far still is how can we effect efficiently collect human preference judgments already talked about how nuance and complexity is really important in this domain a very prevalentness domain and so human preference judgments and human evals remain our highest quality signal and so a lot of what we spend our time on and how we think about this here is how can we improve the throughput how can we improve and streamline operations to collect this data so that we can run more of them more quickly at lower cost etc second how can we build model-based auto evaluations or llm as a judge that approximate the quality of human review and then for a lot of our complex multi-step workflows and agents how can we break the problem down into steps so that we can evaluate each step and have it be something that is in the loop okay let's talk a little bit about human preference ratings or human human eval so one classic tool that we use here is is the classic side-by-side this is basically we curate a standardized query data set of common questions that our customers might ask or common things that come up in a workflow and then we ask human raters to evaluate two responses to the same query so in this instance the query is write an outline of all hearsay exemptions based on the fair rules of evidence etc etc and then the model or two different versions of a model generate two two separate responses and we put this in front of raters and ask them to evaluate it we'll typically ask them okay which of these do you prefer just relatively speaking and then on a scale of say one to seven from one being very bad to seven being very good how would you rate each response as well as some qualitative feedback that they may may have in addition and then we use this to make launch decisions whether to ship a new model a new prompt or algorithm we've invested quite a bit of time in our own tool chain for this and that's really allowed us to scale these kinds of evals over the over the course of the last years and we use them routinely for many different tasks okay but of course human eval is very time consuming and expensive especially since we're leveraging domain experts like trained attorneys to answer most of these questions and so we want to leverage automated and model driven evals wherever possible however there are really a number of challenges when it comes to real world complexity there i think harrison actually just talked about this as well so here's an example of one of the academic benchmarks out there in the field for legal legal questions it's called legal bench and you'll see that the question here is fairly simple in that it's a simple yes no answer or simply has no question at the end saying is there hearsay and there's no reference to any other material outside of the question itself and that's really quite simplistic and most of the real world work just doesn't look like that at all so we actually built our own eval benchmark called big law bench which contains complex open-ended tasks with subjective answers that maybe much more closely how lawyers do work in the real world so in this instance it will say that as an example question analyze these trial documents draft and analysis of conflicts gaps contradictions etc etc and the output here is probably paragraphs of text text so how do we get an llm to evaluate these automatically well we have to come up with a rubric and break it down into a few different categories so this is an example rubric for what this single question in a big law bench might look like we might look at structure so for example is the response formatted as a table with columns x y and z we might evaluate style does the response emphasize actionable advice we'll ask about substance does the response state certain facts like in this particular question the question pertain to a document you know does the response actually mention certain facts mentioned in the document and finally does the response contain hallucinations or misconstrued information and importantly like all of the exact evaluation criteria here were crafted by our in-house domain experts the lawyers that i just mentioned and they're really distinct for each qa pair so there's a lot of work that goes into crafting these evals and the rubrics for them okay last evil principle breaking the problem down so workflows and agents are really multi-step processes and breaking the problem down into components enables us to evaluate each of these steps separately which really helps make the problem more tractable so one canonical canonical example for this is rag let's say for a qa over a large corpus typical steps for rag may include first you rewrite the query then you find the matching chunks and docs using a search or retrieval system then you generate the answer from the sources and last you want to maybe create citations to ground the source of the the answer in facts each of these can be evaluated as its own step and the same idea applies to complex workflows citations etc etc and so the more we can do this the more we can leverage automated evals so to put this all together i wanted to give an example of a recent launch in april openai actually released gpt 4.1 we were fortunate to get an early look at the model before it came out to ga and so we first ran big law bench to get a rough idea of its quality you can see on the chart here it's on the far left gpt 4.1 in the context of harvey's ai systems performed better than other foundation models so we felt that the results are pretty promising here and so we moved on to human radar evaluations to further assess the quality in this chart you can see the performance of our baseline system and on the new system using 4.1 on the set of human radar evals that i was just talking about earlier so again we're asking raters to evaluate the answer on a given question on a scale from one to seven one being very bad seven being very good and you can see that in the new system it skews much more to the right so clearly the results here look much more promising and much higher quality so this looked great we could have just launched it at this point but in addition to that we ran a lot of additional tests on more product specific data sets to really help us understand where it worked well and where it had shortcomings and also ran a bunch of additional internal dog fooding to collect qualitative feedback from our in-house teams this actually helped us catch a few regressions so for example 4.1 was much more likely to start every response with the word certainly exclamation mark which is not really what we were going for and it's kind of off-brand for us so we first had to address those issues before we can roll it out to customers okay so what are some things we learned well first learning really is sharpen your axe at the end of the day a lot of evaluation is in my mind really an engineering problem so the more time we invest in building out strong tooling great processes and documentation the more it will pay back quickly in our case i could say it paid back tenfold it became much easier to run evals which meant that more teams started using them and use them more often and as such the iteration speed and our product quality really improved as well as our own confidence in our product quality and which meant that we were confident in shipping it to customers more quickly i didn't mention this earlier but we leveraged langsmith extensively for some subset of our evals especially a lot of the routine evals that pertain to when we break tasks down but we've also built some of our own tools for some of the more human radar focused evaluations so i would say don't be afraid to mix and match and evaluate and find what works best for you learning number two this is kind of the the flip side of this which is that evals matter but taste really does too obviously having rigorous and repeatable evaluations is critical we wouldn't be able to make product progress without them but human judgment qualitative feedback and taste really matter too we learn a ton from the qualitative feedback we get from our raiders from our internal dog fooding and from our customers and we constantly make improvements to the product that don't really impact evil metrics in any meaningful way but they clearly make the product better for example by making it faster more consistent or easy to use and my last learning and maybe this is a little bit more forward-looking and a bit of a hot take but as we're here talking about agents i wanted to talk a little bit about data and the take here is the the most important data doesn't exist yet so maybe one reductive or simplistic take on ai progress in the last decade has been that we've made a ton of progress by just taking more and more publicly available data and creating larger and larger models and that's of course been very very successful it's led to the amazingly capable foundation models that we all know and love and use every day and they continue to improve but i would argue that to build the main specific agentic workflows for real world tasks we actually need more process data the kind of data that shows you how to get things done inside of those firms today so think about an mna transaction a merger between two firms this is typically many months sometimes years of work and it's typically broken down into hundreds of subtasks or projects and there's usually no written playbook for all this this is not all summarized neatly in a single spreadsheet it's often captured in hallway conversations or maybe handwritten margins in a document that says this is how we do this here and so if we can extract that kind of data that kind of process data then i think it has the put and apply that to models it has the potential to really need lead to the next breakthroughs when it comes to building agentic systems and this is something i'm really excited about and that i'm looking forward to spending more time on over the over the next few years and with that thank you it was a real pleasure speaking here today and enjoy the rest of the conference

Building Reliable Agents: Raising the Bar

Transcript