How Intuit uses LLMs to explain taxes to millions of taxpayers

JASPREET BOOTH: Hi, I'm Jaspreet. I'm a senior staff engineer at Intuit. I work on Gen AI for TurboTax. And today, we'll be talking about how we use LLMs at Intuit to help you understand your taxes better. So I think just to understand the scale, right? Intuit TurboTax successfully processed 44 million tax returns for tax year 23.

And that's really the scale we're going for. We want everybody to have high confidence in how their taxes are filed and understand them, that they're getting the best deductions that they can. So this is the experience that we work on. So you go into TurboTax. You enter your information.

Then you go through what credits you are eligible for and so on. And we basically help you expand on to how you are getting the tax breaks that you are, help you understand them better, and so on. And this is another example. This is basically the overall tax outcome.

Like, what is your overall refund for this year? Now, Intuit's Gen AI experiences are built on top of our propriety Gen OS. That's the generative iOS that we have built as a platform capability. And it has a lot of different pieces that you see over here. The key goal is that we found that a lot of the Gen OS tooling that comes out of the box is not supporting all our use cases.

We want to-- most prominently working in tax. We are in the regulatory business. Safety, security is very, very important. So we want to focus on that. At the same time, we want to build a piece that a company at the scale of Intuit can use end to end, a really large scale.

So that's where Gen OS comes in. We have different pieces. There's on the UI side, which is the Gen UX. Then there's orchestrator. That's basically the piece where different teams are working on different components, different pieces, different LLM solutions. How do you find the right solution to answer the right question?

And Intuit calls the entire experience that we powered through this Intuit Assist. So I'm going to deep dive into specific pieces that our team used to build out the experience for terror attacks. So as I said earlier, we have millions and millions of customers who are coming in. So we're trying to build a scalable solution that can work end to end.

So on the slide here, I'm basically going to talk about different pieces that are powering the experience. Of course, to begin with, the first iteration was the prompt tooling, basically a prompt-based solution to try and go through what's your tax situation going on. Let's take an example of what I was showing earlier, which was your tax refund.

So your tax refund has many constituents. These are your deductions. These are your credits. The standard deduction, W2B holding, and so on. So we want to make sure that you understand all of that. So we built a prompt-based solution around it and work from there. The production model that we went with is Claude for this use case.

Intuit is one of the biggest users of Claude. We had a multimillion-dollar contract for this year as well. And you'll also see OpenAI over there. So OpenAI is where we used for other question and answering. So you'll see on the slide, we're talking about static and dynamic type of queries.

So static queries would be what I was showing earlier, that we know you are looking at your summary. You want to see what happened overall. So that would be a static prompt. Think of it like a prepared statement. However, the additional information that we're gathering is the tax info for when the user comes in.

Now, dynamic query would be users have questions about the tax situation. Can I deduct my dog? Well, you can't, but you can try. So things like that, that's what we're trying to answer more dynamically. OpenAI, as GPT-4 mini, had been the model of choice for until a few months ago.

We are now iterating on the newer versions. Of course, models change every year-- every month, I should say. So we're trying to focus on that. Same for the dynamic piece again. Another important aspect is tax information. IRS changes forms every year. Intuit has proprietary tax information, tax engines that we want to use.

So we have a rack-based and, of course, graph-rack-based solutions around it as well. So they help us answer users' questions much better. And one thing that we also piloted recently was actually having a fine-tuned LLM. So we went to Claude because that's the primary one we are using there.

And we stuck to static queries, and we tested it out. And it does well. It definitely does well. Quality is there. It takes effort to fine-tune the model. However, we found that it was a little too specialized in the specific use case. And one thing I want to highlight a deep dive further on is evals.

So you want to make sure that we evaluate everything we do. You want to make sure what's happening in production. You want to make sure in the development lifecycle, you're doing everything you need to do to make sure that you have the best prompts out there. And with that, moving on to the next slide.

So I'll just summarize it a little bit. These are the key pillars that we have. I already spoke about some of them before. I want to highlight here that-- the bottom part in this slide, actually-- the human domain expert. So Intuit has a lot of tax analysts that we work with, of course, that work with us, decoding changes year over year, making changes, and so on.

So they are the experts that provide us the information, make sure the evaluations are correctly done. So we have a phased evaluation system. We have manual evaluations initially in the development lifecycle. And another thing that we have done is actually using the tax analysts as the prompt engineers. So that allows us, the folks in data science and ML world, to actually focus on the quality, defining the metrics, making sure we have a nice data set that we can iterate on and test on.

As we go along, as I said, models change. We want to try out different models. We want to see the loss change in the IRS, say tax year 23 to 24, what happened. So those changes, we focus on that. And human experts bring their expertise and are able to both help with prompt engineering and get the initial evaluations done.

That then becomes the basis for automated evaluations. LLM as a judge is what we use as well. I'm going to talk a little bit more about that. I'm going to take-- going back then to what I was telling earlier about the Cloud3 Haiku and fine tuning. So fine tuning-- as part of GenOS, we built out a lot of tool sets.

One more thing that we want to do is support fine tuning. So for our use case, we actually stuck to just fine tuning on Cloud3 Haiku powered by AWS Bedrock. And the goal there was that we wanted to see if we can actually improve the quality of responses. Biggest driver, of course, is fewer instructions needed once you have fine tuned the model.

Latencies are a big concern. So we want to see if we can squeeze down the prompt size, and at the same time keep the quality that we need and keep going there. So this is roughly what it looks like. We build out-- we have different tests, AWS accounts, different environments that are provided by the platform teams that we work with.

We look at the data and brief not to regulations-- 7 to 16 regulations. So we only use consented data from users. Make sure we're on the right. And to double down on the evaluation part, right? You want to evaluate everything. So the key pillars are accuracy, relevancy, and coherence.

So we have both manual and automated systems. We also have broad monitoring automated systems, basically look at sample data on what the LLM is basically giving real users in real time. And for this tooling that we've built out here, LLM as a judge comes in, in the auto-eval side.

We've also developed some tooling in-house to basically do some automated prompt change during. And that actually really helps to update our LLM as a judge. Basically, LLM as a judge operates on top of a prompt. It needs different information. It needs some manual samples, which are the golden data set.

We use AWS Ground Truth for that and take on that. One more thing that I want to highlight here is models. So we made the move from Anthropic Cloud Instant to Anthropic Cloud Haiku for the next year for Taxia 24. And that takes some effort. And the only way it's possible is because we have clear evals in place so that we can test out whatever we are changing.

And model changes are not as smooth as you would think. These are some more details on what we're talking about on the automated evals. As you can see, the key output is we want to make sure it stacks accurate. That's the main thing we want to aim for and focus on that.

I'm going to move on here. So let's talk about some major learnings that we have. So the contracts are really expensive. And the only way they are slightly cheaper if you have long-term contracts. So you are tied into the vendor. So it helps to have strong partners on the vendor side who work with you to help iterate, help improve.

And I think I was in this conference last year. And this was one thing called out then, as well, that essentially vendors are a form of lock-in. The prompts are a form of lock-in. It's not easy. And we found out it's not even easy to upgrade this model from the same vendor going into the next year.

So we want to focus on that. Another thing I really want to highlight here is the latency. So LLM models, of course, they don't have the SLAs of back-end services. We're not looking at 100 milliseconds, 200 milliseconds. We're talking about three seconds, five seconds, 10 seconds. So as the user's tax information comes in, maybe they have a complicated situation like me that they own a home.

They have maybe something in stocks. And they're trying to file-- they have-- their spouse have their jobs as well. A lot of things going on. So the prompts really balloon up if you're trying to figure out the outcome. And as you go into tax day, everybody's trying to file on tax day, right?

April 15. So latency really is shooting through the roof. So we design a product around that. We want to make sure we have the right fallback mechanisms, the right user design, product design, to make sure that the user experience is seamless and useful. We want to make sure that the explanations are helpful more than anything else.

And I think I covered all the other places. But once again, I cannot say that enough. Evals are a must-do launch. Focus on evals. Make sure you have clear guidelines on what you're building. Have clear golden data set. I've heard that from other talks as well. That's really a key point.

That's all. I'm going to pause here for questions. If you're going to be asking questions, please come to one of the microphones so that we can capture the audio. Thank you. Yeah, hi. You said evaluate everything, right? But with Gen AI systems, there could be very small changes. You're going to make a small change to a prompt.

And evaluations can get very expensive or slow down your whole development process, right? So maybe could you dive a little bit deeper into when do you bring in different types of evaluations? Are there anything that you just say, we ran some regression tests, and it looks fine, so you launch?

Or do you always go with an expert? Sure, sure. Thank you for the question. So just to reiterate, so the evaluations are different types. I would say when we are in the initial phase of development, we are looking more on the manual relations with tax experts so we can get a baseline in place.

Then as we are tweaking different things in the prompts, that's where auto-evaluation comes in. So we basically take the input from the tax experts and use that to train a judge prompt for the LLM. So that LLM is, once again, expensive. We would go for the GPT-4 series until recently on that one.

And then minor iterations we can do with auto-eval. So we have clear understanding with product. We want to make sure that the quality is there. And maybe once we have major changes, for example, we went from tax year 23 to tax year 24, then we definitely reiterate. If the prompt changes a lot, we would go for manual evaluations.

Thank you for the technical deep dive. I was more interested in the product side of it. Sure. We also do taxes, so I was curious. What are the kind of LLM interactions that the users are having? What are the kind of questions they're asking? Is it more like critical parts of the workflow, or more like what are my taxes?

So we have question answering for all types of questions. That includes both the product question, as in how do I do this in TurboTax, or also their tax situation. So for example, I paid the tuition from my grandchild. Can I claim that on my taxes? So things like that.

So our goal is we have different teams going after different pieces. Our goal is we want to answer all of these questions. And accordingly, different types of questions need different solutions. And that's where maybe I would reiterate, go back to here. So this piece here, planner. So essentially, this is where it comes in.

We want to make sure when the query comes in, we understand what the user is trying to ask. And then we have different kind of solutions for different kind of questions and go through that. So you mentioned about the evaluation. So one quick question. So TurboTax, I'm sure it involves a lot of numbers, the answers.

So how do you verify those numbers in terms of the evaluation? Let's say the actual tax number is 11,235. And if it's something like 11,100. So it's quite difficult to catch this with a manual evaluation. Yes. Yes. Yes. Thank you for the question. So that's the key thing that we work on.

So TurboTax, of course, has a tax knowledge engine that we have built proprietary in-house, managed over the years, built and developed. And that's really what's providing these numbers. The tax profile information is all coming from these numbers. We are not having LLMs do the calculations at all. We're basically using the ground truth that is already existing in our systems as the numbers that we see.

And we have safety guardrails. Maybe this piece here I would probably call out. We have a lot of safety guardrails on what's the raw LLM response. Make sure we are not hallucinating numbers before we send to the user. Got it. So the data is coming from the tax engine itself.

But when you formulate the final explanation, the answer itself. So how do you make sure that the numbers that actually in the final answer are, you know, the same as that's coming from data? So basically, we have ML models that are working under the hood as part of the security aspect that you see here that basically make sure we did not hallucinate any numbers that we built on.

Got it. Yeah. Thank you. Could you give an overview of how you use both just a traditional rack and graph rack, like a hybrid, in your workflow? Sure, sure. And sorry. One more question is, now with the new model cloud 4 coming out, do you think the fine tuning might be getting easier where it needs needed?

I'll take the first one. So Graph Rack, I think we've definitely seen better response quality with Graph Rack. Even more than that, though, I think for end user helpfulness, getting personalized answer is the key piece, I would say. Graph Rack definitely outperforms regular Rack. And what even more outperforms is personalizing the answers?

And to your second question, we are constantly evaluating the models. This is really the time that April is just behind us. We are trying to look at what new things we can do. We also have some in-house models that Intuit trains and develops. So we are constantly evaluating, and I don't have an answer now what we'll do for the next tax year, but yes, we keep working on that.

You mentioned you have different situations, tax situations, and you come up with an answer. So if I describe my situation, it's complicated and it comes up with an answer. Is that answer being generated using the LLM, or is it going back to the tax engine? And how do you explain how you came up with that answer?

And I assume there's going to be a lot of legal challenges to the wrong answers. Right, right, right, absolutely. I mean, Intuit focuses heavily on legal and privacy controls. So, the solution for this one, right, what we worked on here, this is specific, this is more of the static variety of questions.

So once again, what I was saying earlier, the inherent numbers are coming in from tax knowledge engine, and we have tax experts who actually crafted these prompts. So they are specifically tested for each piece that you see here. So, that's basically when we do the evals, we make sure it doesn't happen what you're suggesting.

Okay, great. Thank you. Thank you so much. What a great talk. Thank you.

How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit

Transcript