Building Reliable Agents: Raising the Bar

00:00:00.000 | this

00:00:07.000 | last time I was here was actually in the crowd at a music show use is pretty

00:00:11.800 | wild for me to be up here on stage talking to you all I'm super excited

00:00:15.440 | anyway my name is Benny weld elite engineering at Harvey

00:00:18.580 | and today I'd like to talk to you about how we build and evaluate legal AI

00:00:23.580 | so this is the outline of the talk five parts to it

00:00:26.580 | talk a little bit about Harvey for those of you are not familiar with the

00:00:29.300 | product or the company and I'll talk about quality and legal and why it's

00:00:33.800 | difficult how we build and evaluate products and some learning and hot takes

00:00:38.540 | I was told they had to be hot takes all right let's dive in so Harvey is really

00:00:46.580 | domain-specific AI for legal and professional services we offer a suite of

00:00:51.020 | products from a general-purpose assistant for drafting and summarizing docs to

00:00:55.420 | tools for large-scale document extraction to many domain-specific

00:00:58.600 | agents and workflows and the vision we have for the product is twofold we want

00:01:04.360 | you to do all of your work in Harvey and we want Harvey to be available wherever

00:01:08.420 | you do your work you here being lawyers and legal professionals and professional

00:01:12.520 | service providers so as an example you can use Harvey to summarize documents or

00:01:18.340 | draft new ones our way I can leverage firm specific information such as firm

00:01:23.100 | internal firm internal knowledge bases or their templates to customize the output we

00:01:29.340 | also offer tools for large-scale document analysis which is a really important use

00:01:33.100 | case in legal think about a lot of due diligence or legal discovery tasks where

00:01:37.980 | you're typically dealing with thousands of contracts or documents thousands of emails that

00:01:41.740 | need to be analyzed which typically is done manually and is really really tedious so Harvey can analyze hundreds of thousands of documents at once and output to a table or summarize the results

00:01:52.860 | this literally saves hours sometimes weeks of work and of course we offer many workflows

00:01:58.620 | workflows that enable users to accomplish complex tasks such as red line analysis drafting certain types of documents

00:02:04.380 | and more and customers can tailor these workflows to their own needs we're at an agent conference so naturally you want to talk a little bit about agentic capabilities we've added to the product as well such as multi-step agentic search more personalization and memory and the ability to execute long-running tasks and we are a lot more cooking there that will be launching soon

00:02:04.380 | we're trusted by law firms and large enterprises around the world we have just

00:02:34.140 | just under 400 customers on i think all continents maybe except on artica at this point and in the

00:02:39.900 | u.s. one-third of the largest 100 and i think eight out of ten of the largest ten law firms use harvey

00:02:45.820 | all right let's talk about quality and why it's difficult to build and evaluate high quality products

00:02:53.980 | in this domain so this may not come as a surprise to you but lawyers deal with lots and lots and lots of documents

00:03:00.780 | many of them very complex often hundreds sometimes thousands of pages in length and typically those

00:03:06.540 | documents don't exist in a vacuum they're part of large corpora of case law legislation or other case

00:03:12.860 | related documents often those documents contain extensive references to other parts of the document

00:03:19.340 | or other documents in the same corpus and the documents themselves can be pretty complex it's not at all

00:03:25.500 | unheard of to have documents with lots of handwriting scanned nodes multi-column multiple mini pages on the

00:03:32.860 | same page embedded tables etc etc so lots of complexity in the document understanding piece

00:03:40.540 | the outputs we need to generate are pretty complex too long text obviously complex tables and sometimes

00:03:47.580 | even diagrams or charts for things like reports not to mention the complex language that legal

00:03:53.260 | professionals are used to

00:03:57.660 | and mistakes can literally be career impacting so verification is key and this isn't really just

00:04:03.660 | about hallucinations completely made up statements but really more about slightly misconstrued or

00:04:08.780 | misinterpreted statements that are just not quite factually correct so harvey has a citation feature to

00:04:15.500 | ground all statements and verifiable sources and to allow our users to verify that you know the summary

00:04:20.860 | provided by the ai is indeed correct and acceptable and importantly quality is a really nuanced and

00:04:28.940 | subjective concept in this domain i don't know if you can read this i wouldn't expect you to read all of

00:04:33.820 | it but basically this is two answers to the same question a document understanding question in this case

00:04:39.580 | asking about a specific clause in a specific contract i think it's called materiality scrape and indemnification

00:04:46.220 | don't ask me what exactly that means but the point i'm trying to get across is they look similar

00:04:52.620 | they're actually both factually correct neither of them have any hallucinations take my word for it

00:04:57.820 | but answer two was actually strongly preferred by our in-house lawyers when they looked at both of

00:05:03.580 | these answers and the reason is that there's additional nuance in the write-up and more details in

00:05:08.460 | some of the definitions that they really appreciated so the point is it's really difficult to assess

00:05:14.380 | automatically which of these is better or what's what quality even means

00:05:18.860 | and then last but not least obviously our customers work is very sensitive in nature

00:05:24.940 | obtaining reliable data sets product feedback or even bug reports can be pretty challenging for us

00:05:30.540 | and so all of that combined makes it really challenging to build high quality products and legal ai

00:05:40.780 | so how do we do it before evaluation i wanted to actually briefly touch on how we build products

00:05:46.380 | we believe and i think harrison actually just talked about this that the best evals are tightly

00:05:51.740 | integrated into the product development process and the best teams approach eval holistically with

00:05:56.940 | the rest of product development so here are some product development principles that are important to us

00:06:02.300 | we're going to do it first off we're an applied ai company so what this really means is that we need to

00:06:09.660 | combine state-of-the-art ai with best-in-class ui it's really not just about having the best ai

00:06:16.060 | but really about having the best ai that's packaged up in such a way that it meets our customers where they

00:06:21.660 | are it helps them solve their real world problems

00:06:26.860 | the second principle and this is something that we've talked a lot about and that's very very

00:06:30.620 | key to the way that we operate is lawyer in the loop

00:06:33.020 | so we really include lawyers at every stage of the product development process

00:06:37.740 | as i mentioned before there's an incredible amount of complexity and nuance in legal

00:06:42.460 | and so their domain expertise and their user empathy are really critical in helping us create products

00:06:48.780 | building us helping us build great products

00:06:52.220 | so lawyers work side by side with engineers designers product managers and so on on all

00:06:56.860 | aspects of building the product from identifying use cases to data set collection to eval rubric

00:07:03.020 | creation to ui iteration and end-to-end testing they're truly embedded lawyers also play a really

00:07:09.660 | important part of our go-to-market strategy they're involved in demoing to customers collecting customer

00:07:14.700 | feedback and translating that back to our product development teams as well

00:07:20.460 | and then third prototype over prd prd is a product requirement doc or any kind of spec doc really

00:07:25.900 | we really believe that the actual work of building great products in this domain and probably many

00:07:30.860 | other domains happens through frequent prototyping and iteration spec docs can be helpful but prototypes

00:07:37.900 | really make the work tangible and easier to grok and the quicker we can build these the quicker we can

00:07:42.220 | iterate and learn so we've invested a ton in building out our own ai prototyping stack to iterate on prompts

00:07:49.740 | all aspects of the algorithm as well as the ui

00:07:52.620 | so i wanted to share an example to make this come to life a little bit let's say we wanted

00:07:59.500 | to build out a specific workflow to help users to help our customers draft a specific type of document

00:08:06.140 | let's say a client alert now in this case lawyers would provide the initial context what is this document

00:08:12.460 | what is it even used for when does this typically come up in a typical lawyer's day-to-day work

00:08:17.900 | and what else is important to know about it then lawyers would collaborate with engineers and product

00:08:23.420 | to build out the algorithm and the eval data set engineers build a prototype and then we typically

00:08:28.860 | go through many iterations of this where we look at initial outputs look at results do we like it and and

00:08:34.220 | continue to iterate until it looks good to us as a team of experts in parallel we build out a final product

00:08:40.620 | that's more embedded in our actual product where we can iterate on the ui as well this has really worked

00:08:47.500 | well for us we've built dozens of workflows this way and it's it's one of the things that that really

00:08:52.380 | stands out for us in terms of how we build product

00:08:57.020 | okay let's talk about evaluation

00:08:59.100 | so we think about eval in three ways and harrison actually alluded to some of these as well but

00:09:06.540 | for us the most important way by far still is how can we effect efficiently collect human preference

00:09:12.940 | judgments already talked about how nuance and complexity is really important in this domain a very

00:09:19.660 | prevalentness domain and so human preference judgments and human evals remain our highest quality signal

00:09:26.140 | and so a lot of what we spend our time on and how we think about this here is how can we improve the

00:09:30.780 | throughput how can we improve and streamline operations to collect this data

00:09:34.380 | so that we can run more of them more quickly at lower cost etc

00:09:40.940 | second how can we build model-based auto evaluations or llm as a judge that approximate the quality of human

00:09:47.980 | review and then for a lot of our complex multi-step workflows and agents how can we break the problem

00:09:55.180 | down into steps so that we can evaluate each step and have it be something that is in the loop

00:09:59.900 | okay let's talk a little bit about human preference ratings or human human eval

00:10:08.700 | so one classic tool that we use here is is the classic side-by-side this is basically we curate

00:10:15.100 | a standardized query data set of common questions that our customers might ask or common things that

00:10:20.300 | come up in a workflow and then we ask human raters to evaluate two responses to the same query so in this

00:10:25.900 | instance the query is write an outline of all hearsay exemptions based on the fair rules of evidence etc etc

00:10:31.660 | and then the model or two different versions of a model generate two two separate responses and we put

00:10:36.860 | this in front of raters and ask them to evaluate it we'll typically ask them okay which of these do you

00:10:42.140 | prefer just relatively speaking and then on a scale of say one to seven from one being very bad to seven

00:10:48.380 | being very good how would you rate each response as well as some qualitative feedback that they may may have

00:10:55.420 | in addition and then we use this to make launch decisions whether to ship a new model a new prompt or algorithm

00:11:02.780 | we've invested quite a bit of time in our own tool chain for this and that's really allowed us to

00:11:06.380 | scale these kinds of evals over the over the course of the last years and we use them routinely for many

00:11:10.940 | different tasks

00:11:13.980 | okay but of course human eval is very time consuming and expensive especially since we're

00:11:19.660 | leveraging domain experts like trained attorneys to answer most of these questions and so we want to

00:11:26.140 | leverage automated and model driven evals wherever possible however there are really a number of

00:11:31.900 | challenges when it comes to real world complexity there i think harrison actually just talked about this as

00:11:36.700 | well so here's an example of one of the academic benchmarks out there in the field for legal legal

00:11:44.540 | questions it's called legal bench and you'll see that the question here is fairly simple in that it's a

00:11:50.700 | simple yes no answer or simply has no question at the end saying is there hearsay and there's no reference

00:11:57.740 | to any other material outside of the question itself and that's really quite simplistic and most of the real world

00:12:04.460 | work just doesn't look like that at all

00:12:06.140 | so we actually built our own eval benchmark called big law bench which contains complex open-ended tasks

00:12:16.300 | with subjective answers that maybe much more closely how lawyers do work in the real world so in this

00:12:21.740 | instance it will say that as an example question analyze these trial documents draft and analysis of

00:12:27.660 | conflicts gaps contradictions etc etc and the output here is probably paragraphs of text

00:12:33.980 | text

00:12:36.300 | so how do we get an llm to evaluate these automatically well we have to come up with a rubric

00:12:41.980 | and break it down into a few different categories so this is an example rubric for what this single

00:12:48.540 | question in a big law bench might look like we might look at structure so for example

00:12:56.620 | is the response formatted as a table with columns x y and z we might evaluate style does the response

00:13:04.300 | emphasize actionable advice we'll ask about substance does the response state certain facts like in this

00:13:11.740 | particular question the question pertain to a document you know does the response actually

00:13:17.580 | mention certain facts mentioned in the document and finally does the response contain hallucinations or

00:13:23.820 | misconstrued information and importantly like all of the exact evaluation criteria here were crafted by our in-house

00:13:32.220 | domain experts the lawyers that i just mentioned and they're really distinct for each qa pair

00:13:37.580 | so there's a lot of work that goes into crafting these evals and the rubrics for them

00:13:41.100 | okay last evil principle breaking the problem down

00:13:46.540 | so workflows and agents are really multi-step processes

00:13:51.100 | and breaking the problem down into components enables us to evaluate each of these steps

00:13:56.620 | separately which really helps make the problem more tractable

00:13:59.340 | so one canonical canonical example for this is rag

00:14:03.500 | let's say for a qa over a large corpus

00:14:06.620 | typical steps for rag may include first you rewrite the query

00:14:10.620 | then you find the matching chunks and docs using a search or retrieval system

00:14:15.340 | then you generate the answer from the sources and last you want to maybe create citations

00:14:20.220 | to ground the source of the the answer in facts

00:14:22.540 | each of these can be evaluated as its own step

00:14:27.020 | and the same idea applies to complex workflows citations etc etc and so the more we can do this

00:14:33.900 | the more we can leverage automated evals

00:14:35.740 | so to put this all together i wanted to give an example of a recent launch

00:14:43.020 | in april openai actually released gpt 4.1

00:14:46.060 | we were fortunate to get an early look at the model before it came out to ga

00:14:50.780 | and so we first ran big law bench to get a rough idea of its quality

00:14:55.500 | you can see on the chart here it's on the far left gpt 4.1 in the context of harvey's

00:15:01.100 | ai systems performed better than other foundation models so we felt that the results are pretty promising

00:15:07.100 | here and so we moved on to human radar evaluations to further assess the quality

00:15:12.460 | in this chart you can see the performance of our baseline system and on the new system using 4.1

00:15:19.260 | on the set of human radar evals that i was just talking about earlier

00:15:22.860 | so again we're asking raters to evaluate the answer on a given question on a scale from one to seven one

00:15:29.660 | being very bad seven being very good and you can see that in the new system it skews much more to the right

00:15:35.180 | so clearly the results here look much more promising and much higher quality

00:15:38.940 | so this looked great we could have just launched it at this point but in addition to that we ran

00:15:45.260 | a lot of additional tests on more product specific data sets to really help us understand where it worked

00:15:50.780 | well and where it had shortcomings and also ran a bunch of additional internal dog fooding to

00:15:57.340 | collect qualitative feedback from our in-house teams this actually helped us catch a few regressions

00:16:03.820 | so for example 4.1 was much more likely to start every response with the word certainly exclamation mark

00:16:10.540 | which is not really what we were going for and it's kind of off-brand for us so we first had to address

00:16:16.380 | those issues before we can roll it out to customers

00:16:18.940 | okay so what are some things we learned

00:16:24.700 | well first learning really is sharpen your axe

00:16:30.940 | at the end of the day a lot of evaluation is in my mind really an engineering problem

00:16:37.260 | so the more time we invest in building out strong tooling great processes and documentation

00:16:41.580 | the more it will pay back quickly in our case i could say it paid back tenfold

00:16:46.700 | it became much easier to run evals which meant that more teams started using them

00:16:51.980 | and use them more often and as such the iteration speed and our product quality really improved as well

00:16:59.100 | as our own confidence in our product quality and which meant that we were confident in shipping it to

00:17:03.420 | customers more quickly i didn't mention this earlier but we leveraged langsmith extensively

00:17:09.420 | for some subset of our evals especially a lot of the routine evals that pertain to when we break tasks down

00:17:15.580 | but we've also built some of our own tools for some of the more human radar focused evaluations so i

00:17:22.540 | would say don't be afraid to mix and match and evaluate and find what works best for you

00:17:26.620 | learning number two this is kind of the the flip side of this which is that evals matter but taste

00:17:34.620 | really does too obviously having rigorous and repeatable evaluations is critical we wouldn't

00:17:39.660 | be able to make product progress without them but human judgment qualitative feedback and taste really

00:17:45.260 | matter too we learn a ton from the qualitative feedback we get from our raiders from our internal

00:17:50.380 | dog fooding and from our customers and we constantly make improvements to the product that don't

00:17:55.660 | really impact evil metrics in any meaningful way but they clearly make the product better for example

00:18:00.860 | by making it faster more consistent or easy to use

00:18:04.300 | and my last learning and maybe this is a little bit more forward-looking and a bit of a hot take but

00:18:12.620 | as we're here talking about agents i wanted to talk a little bit about data and the take here is the

00:18:18.540 | the most important data doesn't exist yet so maybe one reductive or simplistic take on ai progress in

00:18:26.460 | the last decade has been that we've made a ton of progress by just taking more and more publicly available

00:18:31.580 | data and creating larger and larger models and that's of course been very very successful it's led to the

00:18:38.540 | amazingly capable foundation models that we all know and love and use every day and they continue to improve

00:18:45.820 | but i would argue that to build the main specific agentic workflows for real world tasks we actually

00:18:51.820 | need more process data the kind of data that shows you how to get things done inside of those firms today

00:18:58.460 | so think about an mna transaction a merger between two firms this is typically many months sometimes

00:19:04.620 | years of work and it's typically broken down into hundreds of subtasks or projects and there's usually

00:19:11.580 | no written playbook for all this this is not all summarized neatly in a single spreadsheet

00:19:15.820 | it's often captured in hallway conversations or maybe handwritten margins in a document that says

00:19:22.380 | this is how we do this here and so if we can extract that kind of data that kind of process data

00:19:28.060 | then i think it has the put and apply that to models it has the potential to really need lead to the

00:19:34.620 | next breakthroughs when it comes to building agentic systems and this is something i'm really excited about

00:19:40.700 | and that i'm looking forward to spending more time on over the over the next few years

00:19:44.380 | and with that thank you it was a real pleasure speaking here today and enjoy the rest of the conference