LangChain Interrupt 2025 Harvey Raising the Bar

00:00:00.000 | All right, how are you all doing?

00:00:02.000 | How are you all doing?

00:00:04.000 | Last time I was going to say it was actually in a crowd

00:00:08.000 | at a music show with you.

00:00:10.000 | It was just a pretty blast for me to be up here on stage talking to you all.

00:00:12.000 | I'm super excited.

00:00:14.000 | Anyway, my name is Ben Ngo.

00:00:16.000 | I lead engineering at Harge.

00:00:18.000 | And today I'd like to talk to you about how we built and validated V-layer.

00:00:22.000 | So this is a subtle blend of this art.

00:00:24.000 | The five parts to talk a little bit about Harge

00:00:26.000 | for those of you who are not familiar with the product or the company.

00:00:29.000 | Then I'll talk about quality and legal and why it's difficult.

00:00:33.000 | How we build and evaluate products.

00:00:35.000 | And some learning about things.

00:00:37.000 | I was told that I'd be on things.

00:00:39.000 | All right, let's dive in.

00:00:43.000 | So Harge is a domain-specific AI for legal and professional services.

00:00:48.000 | We offer a suite of products from a general-purpose system

00:00:51.000 | for drafting and summarizing docs,

00:00:53.000 | to tools for large-scale document extraction,

00:00:55.000 | to many domain-specific agents and workflows.

00:00:58.000 | And the vision we offer product is twofold.

00:01:02.000 | We want you to do all of your work in Harge.

00:01:04.000 | And you want Harge to be available when you do your work.

00:01:08.000 | You and your being lawyers and legal professionals and professional services.

00:01:12.000 | So as an example, you can use Harge to summarize documents or draft new ones.

00:01:18.000 | how I can leverage firm-specific information such as firm internal knowledge bases or their templates,

00:01:24.000 | to customize the outcome.

00:01:26.000 | We also offer tools for large-scale document analysis,

00:01:30.000 | which is a really important use case in legal.

00:01:32.000 | Think about a lot of due diligence or legal discovery tasks,

00:01:36.000 | where you're typically dealing with thousands of contractor documents,

00:01:40.000 | thousands of emails that need to be analyzed,

00:01:42.000 | which typically is done manually and is really, really tedious.

00:01:44.000 | So Harge can analyze hundreds of thousands of documents at once

00:01:48.000 | and output them to a table or summarize the results.

00:01:51.000 | This literally saves hours, sometimes weeks of work.

00:01:54.000 | And, of course, we offer many workflows that enable users to accomplish complex styles,

00:02:00.000 | such as redline analysis, drafting certain types of documents, and more.

00:02:04.000 | customers and tailors workflows to their own needs.

00:02:08.000 | We're at an agent conference, so naturally we want to talk a little bit about agenda capabilities

00:02:14.000 | to add a particular product as well, such as multi-set agenda search,

00:02:18.000 | more personalization and memory, and the ability to execute long-running tasks.

00:02:22.000 | And we have a lot more cooking there that we'll be monitoring soon.

00:02:26.000 | We're trusted by law firms and large enterprises around the world.

00:02:32.000 | We have just under 400 customers on, I think all continents, maybe, except some architecture for this one.

00:02:38.000 | And in the U.S., one-third of the largest 100,

00:02:41.000 | and I think eight out of ten of the largest 10 law firms used for you.

00:02:45.000 | Alright, let's talk about quality and why it's difficult to build and evaluate high-quality products in this domain.

00:02:53.000 | So this may not come as a surprise to you, but lawyers deal with lots and lots and lots of documents,

00:02:59.000 | many of them very complex, hundreds, sometimes thousands of pages in length.

00:03:04.000 | And typically those documents don't exist in a vacuum.

00:03:06.000 | They're part of large, opera of case law, legislation, or other case-related documents.

00:03:12.000 | Often those documents contain extensive references to other parts of the document, or other documents in the same purpose.

00:03:20.000 | And the documents themselves can be pretty complex.

00:03:23.000 | It's not at all unheard of to have documents with lots of handwriting, scanned notes, multi-column,

00:03:29.000 | multiple mini pages on the same page, embedded tables, et cetera, et cetera.

00:03:34.000 | So lots of complexity in the document you're setting these.

00:03:37.000 | The outputs we need to generate are pretty complex, too.

00:03:42.000 | Long text, obviously, complex tables, and sometimes even diagrams and charts for things like reports.

00:03:48.000 | Not to mention the complex language that people professionals are used to.

00:03:53.000 | And mistakes can literally be career-impacting, so verification is key.

00:04:01.000 | And this isn't really just about hallucinations, completely made-up statements,

00:04:05.000 | but really more about slightly misconstrued or misinterpreted statements that are just not quite factually correct.

00:04:11.000 | So far beyond the citation feature, to ground all statements and verify with sources,

00:04:16.000 | and to allow our users to verify that the summary provided by the AI is indeed correct and acceptable.

00:04:23.000 | And importantly, quality is a really nuanced and subjective concept in this domain.

00:04:29.000 | I don't know if you can read this.

00:04:31.000 | I wouldn't expect you to read all of it, but basically this is two answers to the same question.

00:04:36.000 | a document understanding question in this case, asking about a specific clause in a specific contract.

00:04:41.000 | I think it's called "Materiality, Scrape, and Identification."

00:04:45.000 | You don't know what exactly that means.

00:04:47.000 | But the point I'm trying to get across is they look similar.

00:04:51.000 | They're actually both factually correct.

00:04:53.000 | Neither of them have any hallucinations.

00:04:55.000 | Take my word for it.

00:04:57.000 | But answer two is actually strongly preferred by our in-house lawyers when they looked at both of these answers.

00:05:03.000 | And the reason is that there's additional nuance in the write-up and more details in some of the definitions that they really appreciated.

00:05:09.000 | So the point is it's really difficult to assess automatically which of these is better or what quality you can use.

00:05:18.000 | And then last but not least, obviously our customers' work is very sensitive in nature.

00:05:24.000 | Obtaining reliable data sets, product feedback, even market reports can be pretty challenging for us.

00:05:30.000 | And so all of that combined makes it really challenging to build high-quality products in the legal area.

00:05:37.000 | So how do we do it?

00:05:41.000 | Before evaluation, I wanted to actually briefly touch on the legal products.

00:05:45.000 | We believe, and I think Harrison actually just talked about this, that the best evals are tightly integrated into the product development process.

00:05:52.000 | And the best teams approach eval holistically with the rest of the product development.

00:05:57.000 | So here are some product development principles that are important to us.

00:06:00.000 | First off, we're an applied AI company.

00:06:05.000 | So what this really means is that we need to combine state-of-the-art AI with best-in-class UI.

00:06:12.000 | It's really not just about having the best AI, but really about having the best AI that's packaged up in such a way that it meets our customers where they are,

00:06:20.000 | it helps them solve their global problems.

00:06:25.000 | The second principle, and this is something that we've talked about a lot and that's very, very key to the way that we operate,

00:06:30.000 | is lawyer in the loop.

00:06:32.000 | We really include lawyers at every stage of the product development process.

00:06:36.000 | As I mentioned before, there's an incredible amount of complexity in that it's illegal,

00:06:41.000 | and so their domain expertise and their user empathy is really critical in helping us build a great product.

00:06:50.000 | So lawyers work site-based engineers, designers, product managers, and so on, on all aspects of building the product,

00:06:57.000 | from identifying use cases, to data set collection, to eval lubric creation, to UI iteration, and end-to-end testing.

00:07:04.000 | They're truly embedded.

00:07:06.000 | Lawyers can also put a really important part of our global market strategy.

00:07:10.000 | They're involved in demoing to customers, collecting customer feedback, and translating that back to our product development teams as well.

00:07:19.000 | And then third, prototype over PRD.

00:07:21.000 | PRD is a product requirement doc or any kind of spec document.

00:07:25.000 | We really believe that the actual work of building great products in this domain and probably many other domains

00:07:30.000 | happens through frequent prototyping and iteration.

00:07:33.000 | Spec docs can be helpful, but prototypes really make the work tangible and easier to grow.

00:07:39.000 | And the quicker we can build these, the quicker we can iterate and learn.

00:07:43.000 | So we've invested a ton in building out our own AI corporate-type mix center to iterate all prompts,

00:07:48.000 | all aspects of the algorithm, as well as the UI.

00:07:54.000 | So I wanted to share an example to make this come to life a little bit.

00:07:57.000 | Let's say we wanted to build out a specific tool to help our customers draft a specific type of document that, say, a client would learn.

00:08:06.000 | Now in this case, lawyers would provide the initial context.

00:08:09.000 | What is this document?

00:08:10.000 | What is it being used for?

00:08:11.000 | When does it typically come up in a typical lawyer's day-to-day work?

00:08:15.000 | And what else is important to know about it?

00:08:19.000 | Then lawyers would collaborate with engineers and products, build up the algorithm and the email dataset, engineers with the prototype.

00:08:26.000 | And then we typically go through many iterations of this where we look at initial outputs, look at results, do we like it, and continue to iterate until it looks good to us as human experts.

00:08:36.000 | In parallel, we build out a final product that's more embedded in actual product, where we iterate on the UI and the product.

00:08:44.000 | This has really worked well for us.

00:08:46.000 | We've built dozens of workflows this way, and it's one of the things that really stands out for us in terms of how we do the product.

00:08:53.000 | Okay, let's talk about evaluation.

00:08:58.000 | So we think about eval in three ways.

00:09:02.000 | And Harrison actually alluded to some of these as well.

00:09:05.000 | But for us, the most important way by far is how can we efficiently collect human preference judgments.

00:09:12.000 | Already talked about how nuance and complexity is really important in this domain or very prevalent in this domain.

00:09:19.000 | And so human preference judgments and human evals remain our highest quality signal.

00:09:25.000 | And so a lot of what we spend our time on and how we think about this here is how can we improve the throughput, how can we improve and streamline operations to collect this data so that we can run more of them.

00:09:34.000 | More quickly, at lower cost, et cetera.

00:09:39.000 | Second, how can we build model-based audit allegations, or handle them as a judge, that approximate the quality of human review.

00:09:47.000 | And then for a lot of our complex multi-step workflows and agents, how can we break the problem down into steps so that we can evaluate each step and have it be something that is in the room.

00:10:00.000 | Okay, let's talk a little bit about human preference ratings, or human eval.

00:10:07.000 | So one classic tool that we use here is a classic side-by-side.

00:10:11.000 | This is basically, we curate a standardized query dataset of common questions that our customers might ask or common things that come up in the workflow.

00:10:20.000 | And then we ask generators to evaluate two responses to the same query.

00:10:23.000 | So in this instance, the query is right now, line of all, here's the exemptions based on the variables of evidence, et cetera, et cetera.

00:10:29.000 | And then the model, or two different versions of the model, will generate two separate responses.

00:10:34.000 | And we put this in front of our readers and ask them to evaluate it.

00:10:37.000 | We'll typically ask them, okay, which of these do you prefer, just relatively speaking?

00:10:42.000 | And then on a scale of, say, one to seven, from one being very bad to seven being very good, how would you rate each response?

00:10:49.000 | As well as some qualitative feedback that they may have in addition.

00:10:55.000 | And then we use this to make launch decisions, whether to ship a new model, a new problem, or a lot of them.

00:11:01.000 | We've invested quite a bit of time in our own toolchain for this, and that's really allowed us to scale these kinds of ULs over the course of the last few years.

00:11:08.000 | We've used them routinely for many different tasks.

00:11:11.000 | Okay, but of course, human eval is very time-consuming and expensive, especially since we're leveraging domain experts, like trained attorneys, to answer multiple questions.

00:11:23.000 | And so we want to leverage automated and model-driven evals wherever possible.

00:11:28.000 | However, there are really a number of challenges when it comes to real-world complexity data.

00:11:33.000 | I think Harrison actually just talked about this as well.

00:11:36.000 | So here's an example of one of the academic benchmarks out there in the field for legal questions.

00:11:43.000 | It's called LegalBench.

00:11:45.000 | And you'll see that the question here is fairly simple in that it's a simple yes/no answer, or a simple yes/no question at the end saying,

00:11:52.000 | "Is there hearsay?"

00:11:54.000 | And there's no reference to any other material outside of the question itself.

00:11:59.000 | And that's really quite simplistic, and most of the real-world work just doesn't look like that at all.

00:12:07.000 | So we actually built our own eval benchmark called Big Law Bench, which contains complex, open-ended tasks with subjective answers that think much more closely how lawyers do work in the real world.

00:12:20.000 | So in this instance, it will say, as an example question, analyze these trial documents, draft and analysis of conflicts, gaps, contradictions, etc. etc.

00:12:29.000 | And the output here is probably paragraphs of text.

00:12:32.000 | So how would we get an LLM to evaluate these automatically?

00:12:37.000 | Well, we have to come up with a rubric and break it down into a few different categories.

00:12:42.000 | So this is an example rubric for what this single question in Big Law Bench might look like.

00:12:52.000 | We might look at structure.

00:12:53.000 | So for example, is the response formatted as a table with columns X, Y, and Z?

00:12:59.000 | We might evaluate style.

00:13:01.000 | Does the response emphasize actionable advice?

00:13:05.000 | We'll ask about substance.

00:13:07.000 | Does the response state certain facts?

00:13:08.000 | Like, in this particular question, the question pertains to a document, you know, does the response actually mention certain facts mentioned in the document?

00:13:19.000 | And finally, does the response contain hallucinations or misconstrued information?

00:13:24.000 | And importantly, like, all of the exact evaluation criteria here were crafted by in-house domain experts, the lawyers that I just mentioned.

00:13:33.000 | And they're really distinct for each three-way pair.

00:13:36.000 | So there's a lot of work that goes into crafting these evals and the rubrics.

00:13:39.000 | Okay, last email principle.

00:13:44.000 | Breaking the problem down.

00:13:46.000 | So workflows and agents are really long-step processes.

00:13:49.000 | And breaking the problem down into components enables us to evaluate each of these steps separately,

00:13:55.000 | which really helps make the problem more trackable.

00:13:58.000 | So one canonical example for this is RAC.

00:14:02.000 | Let's say for QA over a large corpus.

00:14:06.000 | Typical steps for RAC may include:

00:14:08.000 | First you rewrite the query.

00:14:09.000 | Then you find the matching chunks of docs using a search or retrieval system.

00:14:14.000 | Then you generate the answer from the sources.

00:14:16.000 | And last, you want to maybe create citations to ground the sources of the answer in facts.

00:14:21.000 | Each of these can be evaluated as its own step.

00:14:26.000 | And the same idea applies to complex workflows, citations, etc.

00:14:30.000 | And so the more we can do this, we can leverage automated emails.

00:14:36.000 | So to put this all together, I wanted to give you an example of a recent launch.

00:14:40.000 | In April, OpenAI actually released GPT 4.1.

00:14:46.000 | We were fortunate to get a really look at the model before it came out to GA.

00:14:50.000 | And so we first ran Big Law Bench to get a rough idea of its quality.

00:14:54.000 | You can see on the chart here, it's on the far left, GPT 4.1 in the context of Harvey's AI systems

00:15:00.000 | performed better than other foundation models.

00:15:03.000 | So we felt that the results were pretty promising here.

00:15:06.000 | And so we moved on to human rater evaluations from the assessment quality.

00:15:12.000 | In this chart, you can see the performance of our baseline system and the new system using 4.1

00:15:18.000 | on the set of human rater evals that I was just talking about earlier.

00:15:22.000 | So again, we're asking raters to evaluate and answer all given questions on a scale from 1 to 7.

00:15:28.000 | 1 being very bad, 7 being very good.

00:15:30.000 | And you can see that in the new system, it's used much more to the right.

00:15:34.000 | So clearly, the results here look much more promising and much higher quality.

00:15:38.000 | So this looked great.

00:15:40.000 | We could have just launched it at this point.

00:15:42.000 | But in addition to that, we ran a lot of additional tests on more product-specific data sets

00:15:47.000 | to really help us understand where it worked well and where it had shortcomings.

00:15:51.000 | And also ran a bunch of additional internal job coding to collect qualitative feedback from our NLs teams.

00:15:59.000 | This actually helped us catch a few regressions.

00:16:02.000 | So for example, 4.1 was much more likely to start every response with the word, certainly exclamation mark.

00:16:09.000 | Which is not really what we were going for and kind of off the end for us.

00:16:13.000 | So we first had to address those issues before we can roll it out to the questions.

00:16:18.000 | Okay, so what are some things we learned?

00:16:25.000 | Well, first learning really is sharpen your axe.

00:16:30.000 | At the end of the day, a lot of evaluation is, in my mind, really an engineering problem.

00:16:35.000 | So the more time we invest in building on strong tooling, great processes and documentation, the more it will pay back for it.

00:16:42.000 | In our case, I could say it paid back tenfold.

00:16:45.000 | It became much easier to run evals, which meant that more teams started using them and used them more often.

00:16:52.000 | And as such, the iteration speed and our product quality really improved, as well as our own confidence in our product quality.

00:16:59.000 | And which meant that we were confident in shipping it to customers more quickly.

00:17:04.000 | I didn't mention this earlier, but we leveraged an instrument extensively for some subs of our evals.

00:17:09.000 | Especially a lot of routine evals that pertain to when we break tasks down.

00:17:14.000 | But we've also built some of our own tools for some of the more humorator focused evaluations.

00:17:20.000 | So I would say don't be afraid to mix and match and evaluate and find what works best for you.

00:17:25.000 | Learning number two, this is kind of the flip side of this, which is that evals matter, but taste really does too.

00:17:33.000 | Obviously having rigorous and repeatable evaluations is critical.

00:17:37.000 | We wouldn't be able to make product progress without them.

00:17:40.000 | But human judgment, qualitative feedback, and taste really matter too.

00:17:44.000 | We learn a ton from the qualitative feedback we get from our raters, from our turtle-lock coding, and from our customers.

00:17:50.000 | And we constantly make improvements to the product that don't really impact eval metrics in any meaningful way.

00:17:56.000 | But they clearly make the product better.

00:17:58.000 | For example, by making it faster, more consistent, or easier to use.

00:18:02.000 | And my last learning, and maybe this is a little bit more forward looking and a bit of a hot take.

00:18:11.000 | But as we're here talking about agents, I wanted to talk a little bit about data.

00:18:15.000 | And the take here is the most important data doesn't exist yet.

00:18:20.000 | So maybe one productive or simplistic take on AI progress in the last decade has been that we've made a ton of progress by just taking more and more publicly available data and creating larger and larger models.

00:18:33.000 | And that's, of course, been very, very successful.

00:18:36.000 | It's led to the amazingly capable foundation models that we all know and love and use every day.

00:18:42.000 | And they continue to improve.

00:18:44.000 | But I would argue that if it's built domain-specific, agented workforce for real-world tasks, we actually need more process data.

00:18:53.000 | The kind of data that shows you how to get things done inside of those firms today.

00:18:57.000 | So think about an M&A transaction, a merger between two firms.

00:19:01.000 | This is typically many months, sometimes years of work.

00:19:04.000 | And it's typically broken down into hundreds of sub-tasks or projects.

00:19:09.000 | And there's usually no written playbook for all of this.

00:19:12.000 | This is not all summarized neatly in a single spreadsheet.

00:19:15.000 | It's often captured in hallway conversations or maybe even in handwritten margins in a document that says, "This is how we do this here."

00:19:22.000 | And so if we can extract that kind of data, that kind of process data, then I think it has the potential and apply that to models.

00:19:31.000 | It has the potential to really lead to the next breakthroughs when it comes to building organic systems.

00:19:37.000 | And this is something I'm really excited about and I'm looking forward to spending more time on them over the next few years.

00:19:44.000 | And with that, thank you.

00:19:47.000 | It was really good for speaking with you today.

00:19:49.000 | Thank you very much.

LangChain Interrupt 2025 Harvey Raising the Bar – Ben Liebald