back to index

Building Reliable Agents: Raising the Bar


Whisper Transcript | Transcript Only Page

00:00:07.000 | last time I was here was actually in the crowd at a music show use is pretty
00:00:11.800 | wild for me to be up here on stage talking to you all I'm super excited
00:00:15.440 | anyway my name is Benny weld elite engineering at Harvey
00:00:18.580 | and today I'd like to talk to you about how we build and evaluate legal AI
00:00:23.580 | so this is the outline of the talk five parts to it
00:00:26.580 | talk a little bit about Harvey for those of you are not familiar with the
00:00:29.300 | product or the company and I'll talk about quality and legal and why it's
00:00:33.800 | difficult how we build and evaluate products and some learning and hot takes
00:00:38.540 | I was told they had to be hot takes all right let's dive in so Harvey is really
00:00:46.580 | domain-specific AI for legal and professional services we offer a suite of
00:00:51.020 | products from a general-purpose assistant for drafting and summarizing docs to
00:00:55.420 | tools for large-scale document extraction to many domain-specific
00:00:58.600 | agents and workflows and the vision we have for the product is twofold we want
00:01:04.360 | you to do all of your work in Harvey and we want Harvey to be available wherever
00:01:08.420 | you do your work you here being lawyers and legal professionals and professional
00:01:12.520 | service providers so as an example you can use Harvey to summarize documents or
00:01:18.340 | draft new ones our way I can leverage firm specific information such as firm
00:01:23.100 | internal firm internal knowledge bases or their templates to customize the output we
00:01:29.340 | also offer tools for large-scale document analysis which is a really important use
00:01:33.100 | case in legal think about a lot of due diligence or legal discovery tasks where
00:01:37.980 | you're typically dealing with thousands of contracts or documents thousands of emails that
00:01:41.740 | need to be analyzed which typically is done manually and is really really tedious so Harvey can analyze hundreds of thousands of documents at once and output to a table or summarize the results
00:01:52.860 | this literally saves hours sometimes weeks of work and of course we offer many workflows
00:01:58.620 | workflows that enable users to accomplish complex tasks such as red line analysis drafting certain types of documents
00:02:04.380 | and more and customers can tailor these workflows to their own needs we're at an agent conference so naturally you want to talk a little bit about agentic capabilities we've added to the product as well such as multi-step agentic search more personalization and memory and the ability to execute long-running tasks and we are a lot more cooking there that will be launching soon
00:02:04.380 | we're trusted by law firms and large enterprises around the world we have just
00:02:34.140 | just under 400 customers on i think all continents maybe except on artica at this point and in the
00:02:39.900 | u.s. one-third of the largest 100 and i think eight out of ten of the largest ten law firms use harvey
00:02:45.820 | all right let's talk about quality and why it's difficult to build and evaluate high quality products
00:02:53.980 | in this domain so this may not come as a surprise to you but lawyers deal with lots and lots and lots of documents
00:03:00.780 | many of them very complex often hundreds sometimes thousands of pages in length and typically those
00:03:06.540 | documents don't exist in a vacuum they're part of large corpora of case law legislation or other case
00:03:12.860 | related documents often those documents contain extensive references to other parts of the document
00:03:19.340 | or other documents in the same corpus and the documents themselves can be pretty complex it's not at all
00:03:25.500 | unheard of to have documents with lots of handwriting scanned nodes multi-column multiple mini pages on the
00:03:32.860 | same page embedded tables etc etc so lots of complexity in the document understanding piece
00:03:40.540 | the outputs we need to generate are pretty complex too long text obviously complex tables and sometimes
00:03:47.580 | even diagrams or charts for things like reports not to mention the complex language that legal
00:03:53.260 | professionals are used to
00:03:57.660 | and mistakes can literally be career impacting so verification is key and this isn't really just
00:04:03.660 | about hallucinations completely made up statements but really more about slightly misconstrued or
00:04:08.780 | misinterpreted statements that are just not quite factually correct so harvey has a citation feature to
00:04:15.500 | ground all statements and verifiable sources and to allow our users to verify that you know the summary
00:04:20.860 | provided by the ai is indeed correct and acceptable and importantly quality is a really nuanced and
00:04:28.940 | subjective concept in this domain i don't know if you can read this i wouldn't expect you to read all of
00:04:33.820 | it but basically this is two answers to the same question a document understanding question in this case
00:04:39.580 | asking about a specific clause in a specific contract i think it's called materiality scrape and indemnification
00:04:46.220 | don't ask me what exactly that means but the point i'm trying to get across is they look similar
00:04:52.620 | they're actually both factually correct neither of them have any hallucinations take my word for it
00:04:57.820 | but answer two was actually strongly preferred by our in-house lawyers when they looked at both of
00:05:03.580 | these answers and the reason is that there's additional nuance in the write-up and more details in
00:05:08.460 | some of the definitions that they really appreciated so the point is it's really difficult to assess
00:05:14.380 | automatically which of these is better or what's what quality even means
00:05:18.860 | and then last but not least obviously our customers work is very sensitive in nature
00:05:24.940 | obtaining reliable data sets product feedback or even bug reports can be pretty challenging for us
00:05:30.540 | and so all of that combined makes it really challenging to build high quality products and legal ai
00:05:40.780 | so how do we do it before evaluation i wanted to actually briefly touch on how we build products
00:05:46.380 | we believe and i think harrison actually just talked about this that the best evals are tightly
00:05:51.740 | integrated into the product development process and the best teams approach eval holistically with
00:05:56.940 | the rest of product development so here are some product development principles that are important to us
00:06:02.300 | we're going to do it first off we're an applied ai company so what this really means is that we need to
00:06:09.660 | combine state-of-the-art ai with best-in-class ui it's really not just about having the best ai
00:06:16.060 | but really about having the best ai that's packaged up in such a way that it meets our customers where they
00:06:21.660 | are it helps them solve their real world problems
00:06:26.860 | the second principle and this is something that we've talked a lot about and that's very very
00:06:30.620 | key to the way that we operate is lawyer in the loop
00:06:33.020 | so we really include lawyers at every stage of the product development process
00:06:37.740 | as i mentioned before there's an incredible amount of complexity and nuance in legal
00:06:42.460 | and so their domain expertise and their user empathy are really critical in helping us create products
00:06:48.780 | building us helping us build great products
00:06:52.220 | so lawyers work side by side with engineers designers product managers and so on on all
00:06:56.860 | aspects of building the product from identifying use cases to data set collection to eval rubric
00:07:03.020 | creation to ui iteration and end-to-end testing they're truly embedded lawyers also play a really
00:07:09.660 | important part of our go-to-market strategy they're involved in demoing to customers collecting customer
00:07:14.700 | feedback and translating that back to our product development teams as well
00:07:20.460 | and then third prototype over prd prd is a product requirement doc or any kind of spec doc really
00:07:25.900 | we really believe that the actual work of building great products in this domain and probably many
00:07:30.860 | other domains happens through frequent prototyping and iteration spec docs can be helpful but prototypes
00:07:37.900 | really make the work tangible and easier to grok and the quicker we can build these the quicker we can
00:07:42.220 | iterate and learn so we've invested a ton in building out our own ai prototyping stack to iterate on prompts
00:07:49.740 | all aspects of the algorithm as well as the ui
00:07:52.620 | so i wanted to share an example to make this come to life a little bit let's say we wanted
00:07:59.500 | to build out a specific workflow to help users to help our customers draft a specific type of document
00:08:06.140 | let's say a client alert now in this case lawyers would provide the initial context what is this document
00:08:12.460 | what is it even used for when does this typically come up in a typical lawyer's day-to-day work
00:08:17.900 | and what else is important to know about it then lawyers would collaborate with engineers and product
00:08:23.420 | to build out the algorithm and the eval data set engineers build a prototype and then we typically
00:08:28.860 | go through many iterations of this where we look at initial outputs look at results do we like it and and
00:08:34.220 | continue to iterate until it looks good to us as a team of experts in parallel we build out a final product
00:08:40.620 | that's more embedded in our actual product where we can iterate on the ui as well this has really worked
00:08:47.500 | well for us we've built dozens of workflows this way and it's it's one of the things that that really
00:08:52.380 | stands out for us in terms of how we build product
00:08:57.020 | okay let's talk about evaluation
00:08:59.100 | so we think about eval in three ways and harrison actually alluded to some of these as well but
00:09:06.540 | for us the most important way by far still is how can we effect efficiently collect human preference
00:09:12.940 | judgments already talked about how nuance and complexity is really important in this domain a very
00:09:19.660 | prevalentness domain and so human preference judgments and human evals remain our highest quality signal
00:09:26.140 | and so a lot of what we spend our time on and how we think about this here is how can we improve the
00:09:30.780 | throughput how can we improve and streamline operations to collect this data
00:09:34.380 | so that we can run more of them more quickly at lower cost etc
00:09:40.940 | second how can we build model-based auto evaluations or llm as a judge that approximate the quality of human
00:09:47.980 | review and then for a lot of our complex multi-step workflows and agents how can we break the problem
00:09:55.180 | down into steps so that we can evaluate each step and have it be something that is in the loop
00:09:59.900 | okay let's talk a little bit about human preference ratings or human human eval
00:10:08.700 | so one classic tool that we use here is is the classic side-by-side this is basically we curate
00:10:15.100 | a standardized query data set of common questions that our customers might ask or common things that
00:10:20.300 | come up in a workflow and then we ask human raters to evaluate two responses to the same query so in this
00:10:25.900 | instance the query is write an outline of all hearsay exemptions based on the fair rules of evidence etc etc
00:10:31.660 | and then the model or two different versions of a model generate two two separate responses and we put
00:10:36.860 | this in front of raters and ask them to evaluate it we'll typically ask them okay which of these do you
00:10:42.140 | prefer just relatively speaking and then on a scale of say one to seven from one being very bad to seven
00:10:48.380 | being very good how would you rate each response as well as some qualitative feedback that they may may have
00:10:55.420 | in addition and then we use this to make launch decisions whether to ship a new model a new prompt or algorithm
00:11:02.780 | we've invested quite a bit of time in our own tool chain for this and that's really allowed us to
00:11:06.380 | scale these kinds of evals over the over the course of the last years and we use them routinely for many
00:11:10.940 | different tasks
00:11:13.980 | okay but of course human eval is very time consuming and expensive especially since we're
00:11:19.660 | leveraging domain experts like trained attorneys to answer most of these questions and so we want to
00:11:26.140 | leverage automated and model driven evals wherever possible however there are really a number of
00:11:31.900 | challenges when it comes to real world complexity there i think harrison actually just talked about this as
00:11:36.700 | well so here's an example of one of the academic benchmarks out there in the field for legal legal
00:11:44.540 | questions it's called legal bench and you'll see that the question here is fairly simple in that it's a
00:11:50.700 | simple yes no answer or simply has no question at the end saying is there hearsay and there's no reference
00:11:57.740 | to any other material outside of the question itself and that's really quite simplistic and most of the real world
00:12:04.460 | work just doesn't look like that at all
00:12:06.140 | so we actually built our own eval benchmark called big law bench which contains complex open-ended tasks
00:12:16.300 | with subjective answers that maybe much more closely how lawyers do work in the real world so in this
00:12:21.740 | instance it will say that as an example question analyze these trial documents draft and analysis of
00:12:27.660 | conflicts gaps contradictions etc etc and the output here is probably paragraphs of text
00:12:36.300 | so how do we get an llm to evaluate these automatically well we have to come up with a rubric
00:12:41.980 | and break it down into a few different categories so this is an example rubric for what this single
00:12:48.540 | question in a big law bench might look like we might look at structure so for example
00:12:56.620 | is the response formatted as a table with columns x y and z we might evaluate style does the response
00:13:04.300 | emphasize actionable advice we'll ask about substance does the response state certain facts like in this
00:13:11.740 | particular question the question pertain to a document you know does the response actually
00:13:17.580 | mention certain facts mentioned in the document and finally does the response contain hallucinations or
00:13:23.820 | misconstrued information and importantly like all of the exact evaluation criteria here were crafted by our in-house
00:13:32.220 | domain experts the lawyers that i just mentioned and they're really distinct for each qa pair
00:13:37.580 | so there's a lot of work that goes into crafting these evals and the rubrics for them
00:13:41.100 | okay last evil principle breaking the problem down
00:13:46.540 | so workflows and agents are really multi-step processes
00:13:51.100 | and breaking the problem down into components enables us to evaluate each of these steps
00:13:56.620 | separately which really helps make the problem more tractable
00:13:59.340 | so one canonical canonical example for this is rag
00:14:03.500 | let's say for a qa over a large corpus
00:14:06.620 | typical steps for rag may include first you rewrite the query
00:14:10.620 | then you find the matching chunks and docs using a search or retrieval system
00:14:15.340 | then you generate the answer from the sources and last you want to maybe create citations
00:14:20.220 | to ground the source of the the answer in facts
00:14:22.540 | each of these can be evaluated as its own step
00:14:27.020 | and the same idea applies to complex workflows citations etc etc and so the more we can do this
00:14:33.900 | the more we can leverage automated evals
00:14:35.740 | so to put this all together i wanted to give an example of a recent launch
00:14:43.020 | in april openai actually released gpt 4.1
00:14:46.060 | we were fortunate to get an early look at the model before it came out to ga
00:14:50.780 | and so we first ran big law bench to get a rough idea of its quality
00:14:55.500 | you can see on the chart here it's on the far left gpt 4.1 in the context of harvey's
00:15:01.100 | ai systems performed better than other foundation models so we felt that the results are pretty promising
00:15:07.100 | here and so we moved on to human radar evaluations to further assess the quality
00:15:12.460 | in this chart you can see the performance of our baseline system and on the new system using 4.1
00:15:19.260 | on the set of human radar evals that i was just talking about earlier
00:15:22.860 | so again we're asking raters to evaluate the answer on a given question on a scale from one to seven one
00:15:29.660 | being very bad seven being very good and you can see that in the new system it skews much more to the right
00:15:35.180 | so clearly the results here look much more promising and much higher quality
00:15:38.940 | so this looked great we could have just launched it at this point but in addition to that we ran
00:15:45.260 | a lot of additional tests on more product specific data sets to really help us understand where it worked
00:15:50.780 | well and where it had shortcomings and also ran a bunch of additional internal dog fooding to
00:15:57.340 | collect qualitative feedback from our in-house teams this actually helped us catch a few regressions
00:16:03.820 | so for example 4.1 was much more likely to start every response with the word certainly exclamation mark
00:16:10.540 | which is not really what we were going for and it's kind of off-brand for us so we first had to address
00:16:16.380 | those issues before we can roll it out to customers
00:16:18.940 | okay so what are some things we learned
00:16:24.700 | well first learning really is sharpen your axe
00:16:30.940 | at the end of the day a lot of evaluation is in my mind really an engineering problem
00:16:37.260 | so the more time we invest in building out strong tooling great processes and documentation
00:16:41.580 | the more it will pay back quickly in our case i could say it paid back tenfold
00:16:46.700 | it became much easier to run evals which meant that more teams started using them
00:16:51.980 | and use them more often and as such the iteration speed and our product quality really improved as well
00:16:59.100 | as our own confidence in our product quality and which meant that we were confident in shipping it to
00:17:03.420 | customers more quickly i didn't mention this earlier but we leveraged langsmith extensively
00:17:09.420 | for some subset of our evals especially a lot of the routine evals that pertain to when we break tasks down
00:17:15.580 | but we've also built some of our own tools for some of the more human radar focused evaluations so i
00:17:22.540 | would say don't be afraid to mix and match and evaluate and find what works best for you
00:17:26.620 | learning number two this is kind of the the flip side of this which is that evals matter but taste
00:17:34.620 | really does too obviously having rigorous and repeatable evaluations is critical we wouldn't
00:17:39.660 | be able to make product progress without them but human judgment qualitative feedback and taste really
00:17:45.260 | matter too we learn a ton from the qualitative feedback we get from our raiders from our internal
00:17:50.380 | dog fooding and from our customers and we constantly make improvements to the product that don't
00:17:55.660 | really impact evil metrics in any meaningful way but they clearly make the product better for example
00:18:00.860 | by making it faster more consistent or easy to use
00:18:04.300 | and my last learning and maybe this is a little bit more forward-looking and a bit of a hot take but
00:18:12.620 | as we're here talking about agents i wanted to talk a little bit about data and the take here is the
00:18:18.540 | the most important data doesn't exist yet so maybe one reductive or simplistic take on ai progress in
00:18:26.460 | the last decade has been that we've made a ton of progress by just taking more and more publicly available
00:18:31.580 | data and creating larger and larger models and that's of course been very very successful it's led to the
00:18:38.540 | amazingly capable foundation models that we all know and love and use every day and they continue to improve
00:18:45.820 | but i would argue that to build the main specific agentic workflows for real world tasks we actually
00:18:51.820 | need more process data the kind of data that shows you how to get things done inside of those firms today
00:18:58.460 | so think about an mna transaction a merger between two firms this is typically many months sometimes
00:19:04.620 | years of work and it's typically broken down into hundreds of subtasks or projects and there's usually
00:19:11.580 | no written playbook for all this this is not all summarized neatly in a single spreadsheet
00:19:15.820 | it's often captured in hallway conversations or maybe handwritten margins in a document that says
00:19:22.380 | this is how we do this here and so if we can extract that kind of data that kind of process data
00:19:28.060 | then i think it has the put and apply that to models it has the potential to really need lead to the
00:19:34.620 | next breakthroughs when it comes to building agentic systems and this is something i'm really excited about
00:19:40.700 | and that i'm looking forward to spending more time on over the over the next few years
00:19:44.380 | and with that thank you it was a real pleasure speaking here today and enjoy the rest of the conference