back to indexBuilding Reliable Agents: Raising the Bar

00:00:07.000 |
last time I was here was actually in the crowd at a music show use is pretty 00:00:11.800 |
wild for me to be up here on stage talking to you all I'm super excited 00:00:15.440 |
anyway my name is Benny weld elite engineering at Harvey 00:00:18.580 |
and today I'd like to talk to you about how we build and evaluate legal AI 00:00:23.580 |
so this is the outline of the talk five parts to it 00:00:26.580 |
talk a little bit about Harvey for those of you are not familiar with the 00:00:29.300 |
product or the company and I'll talk about quality and legal and why it's 00:00:33.800 |
difficult how we build and evaluate products and some learning and hot takes 00:00:38.540 |
I was told they had to be hot takes all right let's dive in so Harvey is really 00:00:46.580 |
domain-specific AI for legal and professional services we offer a suite of 00:00:51.020 |
products from a general-purpose assistant for drafting and summarizing docs to 00:00:55.420 |
tools for large-scale document extraction to many domain-specific 00:00:58.600 |
agents and workflows and the vision we have for the product is twofold we want 00:01:04.360 |
you to do all of your work in Harvey and we want Harvey to be available wherever 00:01:08.420 |
you do your work you here being lawyers and legal professionals and professional 00:01:12.520 |
service providers so as an example you can use Harvey to summarize documents or 00:01:18.340 |
draft new ones our way I can leverage firm specific information such as firm 00:01:23.100 |
internal firm internal knowledge bases or their templates to customize the output we 00:01:29.340 |
also offer tools for large-scale document analysis which is a really important use 00:01:33.100 |
case in legal think about a lot of due diligence or legal discovery tasks where 00:01:37.980 |
you're typically dealing with thousands of contracts or documents thousands of emails that 00:01:41.740 |
need to be analyzed which typically is done manually and is really really tedious so Harvey can analyze hundreds of thousands of documents at once and output to a table or summarize the results 00:01:52.860 |
this literally saves hours sometimes weeks of work and of course we offer many workflows 00:01:58.620 |
workflows that enable users to accomplish complex tasks such as red line analysis drafting certain types of documents 00:02:04.380 |
and more and customers can tailor these workflows to their own needs we're at an agent conference so naturally you want to talk a little bit about agentic capabilities we've added to the product as well such as multi-step agentic search more personalization and memory and the ability to execute long-running tasks and we are a lot more cooking there that will be launching soon 00:02:04.380 |
we're trusted by law firms and large enterprises around the world we have just 00:02:34.140 |
just under 400 customers on i think all continents maybe except on artica at this point and in the 00:02:39.900 |
u.s. one-third of the largest 100 and i think eight out of ten of the largest ten law firms use harvey 00:02:45.820 |
all right let's talk about quality and why it's difficult to build and evaluate high quality products 00:02:53.980 |
in this domain so this may not come as a surprise to you but lawyers deal with lots and lots and lots of documents 00:03:00.780 |
many of them very complex often hundreds sometimes thousands of pages in length and typically those 00:03:06.540 |
documents don't exist in a vacuum they're part of large corpora of case law legislation or other case 00:03:12.860 |
related documents often those documents contain extensive references to other parts of the document 00:03:19.340 |
or other documents in the same corpus and the documents themselves can be pretty complex it's not at all 00:03:25.500 |
unheard of to have documents with lots of handwriting scanned nodes multi-column multiple mini pages on the 00:03:32.860 |
same page embedded tables etc etc so lots of complexity in the document understanding piece 00:03:40.540 |
the outputs we need to generate are pretty complex too long text obviously complex tables and sometimes 00:03:47.580 |
even diagrams or charts for things like reports not to mention the complex language that legal 00:03:57.660 |
and mistakes can literally be career impacting so verification is key and this isn't really just 00:04:03.660 |
about hallucinations completely made up statements but really more about slightly misconstrued or 00:04:08.780 |
misinterpreted statements that are just not quite factually correct so harvey has a citation feature to 00:04:15.500 |
ground all statements and verifiable sources and to allow our users to verify that you know the summary 00:04:20.860 |
provided by the ai is indeed correct and acceptable and importantly quality is a really nuanced and 00:04:28.940 |
subjective concept in this domain i don't know if you can read this i wouldn't expect you to read all of 00:04:33.820 |
it but basically this is two answers to the same question a document understanding question in this case 00:04:39.580 |
asking about a specific clause in a specific contract i think it's called materiality scrape and indemnification 00:04:46.220 |
don't ask me what exactly that means but the point i'm trying to get across is they look similar 00:04:52.620 |
they're actually both factually correct neither of them have any hallucinations take my word for it 00:04:57.820 |
but answer two was actually strongly preferred by our in-house lawyers when they looked at both of 00:05:03.580 |
these answers and the reason is that there's additional nuance in the write-up and more details in 00:05:08.460 |
some of the definitions that they really appreciated so the point is it's really difficult to assess 00:05:14.380 |
automatically which of these is better or what's what quality even means 00:05:18.860 |
and then last but not least obviously our customers work is very sensitive in nature 00:05:24.940 |
obtaining reliable data sets product feedback or even bug reports can be pretty challenging for us 00:05:30.540 |
and so all of that combined makes it really challenging to build high quality products and legal ai 00:05:40.780 |
so how do we do it before evaluation i wanted to actually briefly touch on how we build products 00:05:46.380 |
we believe and i think harrison actually just talked about this that the best evals are tightly 00:05:51.740 |
integrated into the product development process and the best teams approach eval holistically with 00:05:56.940 |
the rest of product development so here are some product development principles that are important to us 00:06:02.300 |
we're going to do it first off we're an applied ai company so what this really means is that we need to 00:06:09.660 |
combine state-of-the-art ai with best-in-class ui it's really not just about having the best ai 00:06:16.060 |
but really about having the best ai that's packaged up in such a way that it meets our customers where they 00:06:21.660 |
are it helps them solve their real world problems 00:06:26.860 |
the second principle and this is something that we've talked a lot about and that's very very 00:06:30.620 |
key to the way that we operate is lawyer in the loop 00:06:33.020 |
so we really include lawyers at every stage of the product development process 00:06:37.740 |
as i mentioned before there's an incredible amount of complexity and nuance in legal 00:06:42.460 |
and so their domain expertise and their user empathy are really critical in helping us create products 00:06:52.220 |
so lawyers work side by side with engineers designers product managers and so on on all 00:06:56.860 |
aspects of building the product from identifying use cases to data set collection to eval rubric 00:07:03.020 |
creation to ui iteration and end-to-end testing they're truly embedded lawyers also play a really 00:07:09.660 |
important part of our go-to-market strategy they're involved in demoing to customers collecting customer 00:07:14.700 |
feedback and translating that back to our product development teams as well 00:07:20.460 |
and then third prototype over prd prd is a product requirement doc or any kind of spec doc really 00:07:25.900 |
we really believe that the actual work of building great products in this domain and probably many 00:07:30.860 |
other domains happens through frequent prototyping and iteration spec docs can be helpful but prototypes 00:07:37.900 |
really make the work tangible and easier to grok and the quicker we can build these the quicker we can 00:07:42.220 |
iterate and learn so we've invested a ton in building out our own ai prototyping stack to iterate on prompts 00:07:49.740 |
all aspects of the algorithm as well as the ui 00:07:52.620 |
so i wanted to share an example to make this come to life a little bit let's say we wanted 00:07:59.500 |
to build out a specific workflow to help users to help our customers draft a specific type of document 00:08:06.140 |
let's say a client alert now in this case lawyers would provide the initial context what is this document 00:08:12.460 |
what is it even used for when does this typically come up in a typical lawyer's day-to-day work 00:08:17.900 |
and what else is important to know about it then lawyers would collaborate with engineers and product 00:08:23.420 |
to build out the algorithm and the eval data set engineers build a prototype and then we typically 00:08:28.860 |
go through many iterations of this where we look at initial outputs look at results do we like it and and 00:08:34.220 |
continue to iterate until it looks good to us as a team of experts in parallel we build out a final product 00:08:40.620 |
that's more embedded in our actual product where we can iterate on the ui as well this has really worked 00:08:47.500 |
well for us we've built dozens of workflows this way and it's it's one of the things that that really 00:08:52.380 |
stands out for us in terms of how we build product 00:08:59.100 |
so we think about eval in three ways and harrison actually alluded to some of these as well but 00:09:06.540 |
for us the most important way by far still is how can we effect efficiently collect human preference 00:09:12.940 |
judgments already talked about how nuance and complexity is really important in this domain a very 00:09:19.660 |
prevalentness domain and so human preference judgments and human evals remain our highest quality signal 00:09:26.140 |
and so a lot of what we spend our time on and how we think about this here is how can we improve the 00:09:30.780 |
throughput how can we improve and streamline operations to collect this data 00:09:34.380 |
so that we can run more of them more quickly at lower cost etc 00:09:40.940 |
second how can we build model-based auto evaluations or llm as a judge that approximate the quality of human 00:09:47.980 |
review and then for a lot of our complex multi-step workflows and agents how can we break the problem 00:09:55.180 |
down into steps so that we can evaluate each step and have it be something that is in the loop 00:09:59.900 |
okay let's talk a little bit about human preference ratings or human human eval 00:10:08.700 |
so one classic tool that we use here is is the classic side-by-side this is basically we curate 00:10:15.100 |
a standardized query data set of common questions that our customers might ask or common things that 00:10:20.300 |
come up in a workflow and then we ask human raters to evaluate two responses to the same query so in this 00:10:25.900 |
instance the query is write an outline of all hearsay exemptions based on the fair rules of evidence etc etc 00:10:31.660 |
and then the model or two different versions of a model generate two two separate responses and we put 00:10:36.860 |
this in front of raters and ask them to evaluate it we'll typically ask them okay which of these do you 00:10:42.140 |
prefer just relatively speaking and then on a scale of say one to seven from one being very bad to seven 00:10:48.380 |
being very good how would you rate each response as well as some qualitative feedback that they may may have 00:10:55.420 |
in addition and then we use this to make launch decisions whether to ship a new model a new prompt or algorithm 00:11:02.780 |
we've invested quite a bit of time in our own tool chain for this and that's really allowed us to 00:11:06.380 |
scale these kinds of evals over the over the course of the last years and we use them routinely for many 00:11:13.980 |
okay but of course human eval is very time consuming and expensive especially since we're 00:11:19.660 |
leveraging domain experts like trained attorneys to answer most of these questions and so we want to 00:11:26.140 |
leverage automated and model driven evals wherever possible however there are really a number of 00:11:31.900 |
challenges when it comes to real world complexity there i think harrison actually just talked about this as 00:11:36.700 |
well so here's an example of one of the academic benchmarks out there in the field for legal legal 00:11:44.540 |
questions it's called legal bench and you'll see that the question here is fairly simple in that it's a 00:11:50.700 |
simple yes no answer or simply has no question at the end saying is there hearsay and there's no reference 00:11:57.740 |
to any other material outside of the question itself and that's really quite simplistic and most of the real world 00:12:06.140 |
so we actually built our own eval benchmark called big law bench which contains complex open-ended tasks 00:12:16.300 |
with subjective answers that maybe much more closely how lawyers do work in the real world so in this 00:12:21.740 |
instance it will say that as an example question analyze these trial documents draft and analysis of 00:12:27.660 |
conflicts gaps contradictions etc etc and the output here is probably paragraphs of text 00:12:36.300 |
so how do we get an llm to evaluate these automatically well we have to come up with a rubric 00:12:41.980 |
and break it down into a few different categories so this is an example rubric for what this single 00:12:48.540 |
question in a big law bench might look like we might look at structure so for example 00:12:56.620 |
is the response formatted as a table with columns x y and z we might evaluate style does the response 00:13:04.300 |
emphasize actionable advice we'll ask about substance does the response state certain facts like in this 00:13:11.740 |
particular question the question pertain to a document you know does the response actually 00:13:17.580 |
mention certain facts mentioned in the document and finally does the response contain hallucinations or 00:13:23.820 |
misconstrued information and importantly like all of the exact evaluation criteria here were crafted by our in-house 00:13:32.220 |
domain experts the lawyers that i just mentioned and they're really distinct for each qa pair 00:13:37.580 |
so there's a lot of work that goes into crafting these evals and the rubrics for them 00:13:41.100 |
okay last evil principle breaking the problem down 00:13:46.540 |
so workflows and agents are really multi-step processes 00:13:51.100 |
and breaking the problem down into components enables us to evaluate each of these steps 00:13:56.620 |
separately which really helps make the problem more tractable 00:13:59.340 |
so one canonical canonical example for this is rag 00:14:06.620 |
typical steps for rag may include first you rewrite the query 00:14:10.620 |
then you find the matching chunks and docs using a search or retrieval system 00:14:15.340 |
then you generate the answer from the sources and last you want to maybe create citations 00:14:20.220 |
to ground the source of the the answer in facts 00:14:22.540 |
each of these can be evaluated as its own step 00:14:27.020 |
and the same idea applies to complex workflows citations etc etc and so the more we can do this 00:14:35.740 |
so to put this all together i wanted to give an example of a recent launch 00:14:46.060 |
we were fortunate to get an early look at the model before it came out to ga 00:14:50.780 |
and so we first ran big law bench to get a rough idea of its quality 00:14:55.500 |
you can see on the chart here it's on the far left gpt 4.1 in the context of harvey's 00:15:01.100 |
ai systems performed better than other foundation models so we felt that the results are pretty promising 00:15:07.100 |
here and so we moved on to human radar evaluations to further assess the quality 00:15:12.460 |
in this chart you can see the performance of our baseline system and on the new system using 4.1 00:15:19.260 |
on the set of human radar evals that i was just talking about earlier 00:15:22.860 |
so again we're asking raters to evaluate the answer on a given question on a scale from one to seven one 00:15:29.660 |
being very bad seven being very good and you can see that in the new system it skews much more to the right 00:15:35.180 |
so clearly the results here look much more promising and much higher quality 00:15:38.940 |
so this looked great we could have just launched it at this point but in addition to that we ran 00:15:45.260 |
a lot of additional tests on more product specific data sets to really help us understand where it worked 00:15:50.780 |
well and where it had shortcomings and also ran a bunch of additional internal dog fooding to 00:15:57.340 |
collect qualitative feedback from our in-house teams this actually helped us catch a few regressions 00:16:03.820 |
so for example 4.1 was much more likely to start every response with the word certainly exclamation mark 00:16:10.540 |
which is not really what we were going for and it's kind of off-brand for us so we first had to address 00:16:16.380 |
those issues before we can roll it out to customers 00:16:24.700 |
well first learning really is sharpen your axe 00:16:30.940 |
at the end of the day a lot of evaluation is in my mind really an engineering problem 00:16:37.260 |
so the more time we invest in building out strong tooling great processes and documentation 00:16:41.580 |
the more it will pay back quickly in our case i could say it paid back tenfold 00:16:46.700 |
it became much easier to run evals which meant that more teams started using them 00:16:51.980 |
and use them more often and as such the iteration speed and our product quality really improved as well 00:16:59.100 |
as our own confidence in our product quality and which meant that we were confident in shipping it to 00:17:03.420 |
customers more quickly i didn't mention this earlier but we leveraged langsmith extensively 00:17:09.420 |
for some subset of our evals especially a lot of the routine evals that pertain to when we break tasks down 00:17:15.580 |
but we've also built some of our own tools for some of the more human radar focused evaluations so i 00:17:22.540 |
would say don't be afraid to mix and match and evaluate and find what works best for you 00:17:26.620 |
learning number two this is kind of the the flip side of this which is that evals matter but taste 00:17:34.620 |
really does too obviously having rigorous and repeatable evaluations is critical we wouldn't 00:17:39.660 |
be able to make product progress without them but human judgment qualitative feedback and taste really 00:17:45.260 |
matter too we learn a ton from the qualitative feedback we get from our raiders from our internal 00:17:50.380 |
dog fooding and from our customers and we constantly make improvements to the product that don't 00:17:55.660 |
really impact evil metrics in any meaningful way but they clearly make the product better for example 00:18:00.860 |
by making it faster more consistent or easy to use 00:18:04.300 |
and my last learning and maybe this is a little bit more forward-looking and a bit of a hot take but 00:18:12.620 |
as we're here talking about agents i wanted to talk a little bit about data and the take here is the 00:18:18.540 |
the most important data doesn't exist yet so maybe one reductive or simplistic take on ai progress in 00:18:26.460 |
the last decade has been that we've made a ton of progress by just taking more and more publicly available 00:18:31.580 |
data and creating larger and larger models and that's of course been very very successful it's led to the 00:18:38.540 |
amazingly capable foundation models that we all know and love and use every day and they continue to improve 00:18:45.820 |
but i would argue that to build the main specific agentic workflows for real world tasks we actually 00:18:51.820 |
need more process data the kind of data that shows you how to get things done inside of those firms today 00:18:58.460 |
so think about an mna transaction a merger between two firms this is typically many months sometimes 00:19:04.620 |
years of work and it's typically broken down into hundreds of subtasks or projects and there's usually 00:19:11.580 |
no written playbook for all this this is not all summarized neatly in a single spreadsheet 00:19:15.820 |
it's often captured in hallway conversations or maybe handwritten margins in a document that says 00:19:22.380 |
this is how we do this here and so if we can extract that kind of data that kind of process data 00:19:28.060 |
then i think it has the put and apply that to models it has the potential to really need lead to the 00:19:34.620 |
next breakthroughs when it comes to building agentic systems and this is something i'm really excited about 00:19:40.700 |
and that i'm looking forward to spending more time on over the over the next few years 00:19:44.380 |
and with that thank you it was a real pleasure speaking here today and enjoy the rest of the conference