The Billable Hour is Dead; Long Live the Billable Hour — Kevin Madura + Mo Bhasin, Alix Partners

I'm Mo. I'm director of AI products at Alex Partners. Prior to this, I was a co-founder of an anomaly detection startup, and prior to that, I was a data scientist at Google. Together, we co-lead the development of an internal gen AI platform. We've been working for the last two years.

We have 20 engineers. We've scaled it to 50 deployments and hundreds of users, and we're excited to tell you everything we've learned on that journey. Kevin Madera: Great. I'm Kevin Madera. I help companies, courts, and regulators understand new technologies like AI and LLMs. As Mo mentioned, both of us work at a company called Alex Partners.

It's a global management consulting firm. I realize lots of you in this room might be rolling your eyes at that, rightfully so, but I like to think our firm does a little bit more than deliver Powerpoints. We actually roll up our sleeves and solve problems, whether that's coding or actually getting into the weeds of things.

We're here to talk to you today about really three different things. One is how we see AI reshaping knowledge work as we see it today, so a lot of how it's impacting professional services, advisory services, that sort of thing. We'll bring three real-life use cases that we'll walk through in terms of how we've actually deployed it realistically, concretely within the way that we work in our business, and then wrap up with what doesn't work and where we see things going.

So some of you here might recognize this chart from an organization called Meter, which evaluates the ability for LLMs to complete a certain set of tasks, and it very specifically measures the length of task that LLMs can complete, at least with 50% success rate. And so the takeoff rate is pretty significant here.

Now, we think that's mostly because it's a verifiable domain, and as we all know, model capabilities are a little bit jagged, so they perform very, very well in software development, maybe not so well in non-verifiable or more messy domains like knowledge work. So we think it's a rough proxy for the coming disruption for professional services and knowledge work more broadly.

Do we think the takeoff will be as steep as software engineering? Probably not, just because of the messiness of the real world, if you will. And for those of you not familiar, there's typically two main models for professional services. One is the junior-led model. This is where you have very senior individuals and more junior individuals provide that leverage.

So it's a lot of directing, okay, do this, and you throw 50 people at a problem, and they kind of figure it out and probably waste some time in doing so. There's also the senior-led model, which is more senior folks who have 15, 20 years of experience. They are much more involved in the day-to-day.

They're actually doing the work, delivering the work. This is the Alex Partners model, where it's a little bit less leverage, but we can deliver results a lot faster and more impactfully because it's the senior-led folks. We think the future is probably somewhat of a hybrid, but we think because of model capabilities and how quickly they're advancing, it really provides that those more experienced folks, so people who have been in a particular domain or industry for 15, 20 years.

If you've listened to Dwarkesh Patel and his podcast, fantastic podcast, he has this concept of an AI-first firm, where you can basically take the knowledge and start to replicate that out. So you can have 50 copies of the CEO, as an example. We think the future is something like that, where you have, you're basically replicating the knowledge, experience of more senior individuals, and you provide and you scale out that leverage below using AI to do so.

And so the way we think about typical engagements, it roughly falls into these three different buckets, not always, but just for demonstration purposes. There's a lot of upfront work initially, whether it's an M&A transaction, a corporate investigation, some type of due diligence. Oftentimes you're left with a bunch of PDFs, databases, Excels, whatever it might be.

There's just a lot of upfront work to just understand what you've got, right? Just ingest the data, normalize it, categorize things, put it into a framework that you can then use to do what you do best, which whatever that might be. If you're a private equity expert or investigator, whatever it is, you typically have some type of playbook and that's phase two, which is the black part, which is the analysis, the hypothesis generation.

You're basically getting all that data into a format that then you can take and use and derive some type of insights from. And all of that, of course, is in support of the last piece, which is really what clients actually care about, which is you solving their business problem.

That's the recommendation, the deliverable, the output, whatever that might be. That's the reason that they've hired you in the first place. We're seeing AI today just significantly compressing at minimum that first part. So if it was 50%, maybe it's 10 to 20% today in terms of what's required from a human perspective just to get up to speed about understanding the contents of a data room or whatever it might be.

And it's not only that, because to date you're largely limited by the throughput of human beings. So you think of doc review as an example. If you have 5,000 different contracts, Box is a perfect precursor to this talk because that's exactly what they do. If you have 5,000 contracts, think of how many people it would take if it takes 30 minutes to review each and every contract.

You have 5,000 of them. You want to extract some type of information from it. You're inherently limited by either time or cost. And so inevitably there's some type of prioritization that occurs. You're only focusing on kind of the top 20% or whatever it might be, the most valuable pieces of the data.

With AI, that's completely changed, right? You can now look at 100% of the corpus of data, whatever that might be. And you can start to derive insights. You can apply your same methodology, your analysis, your insights to all of the data now, because you're able to extract that information from across 100% of the data set.

So now you can look at 100% of the vendor contracts, 100% of the customer base. You can start to derive those insights, to identify savings opportunities, free up time to do more interviews, whatever it might be. You're freed up to do much more high value work. And the value is that because it's on across 100% of the data, instead of just the first 20 or so percent, the output is just that much better.

So to bring to life a little bit, I'll turn it over to Mo to talk through some real life examples. Mo to talk through some real life examples. Thanks, Kevin. So to motivate the use cases that we have, I want to start with the paradox that we face. Everyone's investing in AI.

89% of CEOs said that they're planning to implement agentic AI, according to Deloitte. But we find ourselves in this paradox where National Bureau of Economic Research says that there's been no significant impact on earnings or recorded hours. BCG says that three quarters of company failed to struggle and achieve and scale value with their Gen AI initiatives.

And then finally, S&P Global said that almost half the companies were abandoning their AI initiatives this year. So how is it that everyone's spending, but no one's seeing the value? We think there's a difference between employee productivity and enterprise productivity. And so we want to talk about the use cases that we found that helped drive enterprise productivity.

So the first example I want to start with is categorization. Maybe trying to put a square peg in a round hole. How does this show up for us? I think if you have IT support tickets, your laptop keeps restarting and that needs to be triaged in the hardware department.

You need to categorize those ticket support accordingly. Something closer to home is we analyze companies a lot. And so we want to look at accounts payables or spend data across companies. And we need to say, what is United Airlines? All right, if it's under travel. How was this done before?

Does anyone remember word clouds? You'd have to build a machine learning model. You'd have to stem your data, remove stop words, build a classifier, support vector machines, naive base. It's a lot of work. Enter the new way, structured outputs. So with structured outputs, you can get the answer a lot easier.

This is unsupervised learning. This is literally what that would look like. Say you have a list of companies, JD factors, and you have to categorize it into a taxonomy. Here, the taxonomy would be the North American industry classification system, the NAICS codes. Each code has a description. And in this case, it would be other cache management, for instance.

Typically, JD factors is probably not part of the foundational model's knowledge. So how do we ensure that the classification works well? Enter tool call. You can run a web query to append information to each of these pieces of -- to each of these companies and then categorize enormous volumes.

So this is what we've been doing. And we found that we've had huge wins from this. So what this has done is this democratized access to text classification for us. I want to talk about the learnings that we've had from deploying this surgically at our company. Enormous wins in speed and accuracy.

Those accuracy gains have not come cheaply. This might be unsupervised learning, but it's not unchecked. We've had to have the right relationships with the business partners who've worked hand in hand with us to ensure that we get to the accuracy that we wanted. What this does is convert skeptics into champions.

We don't become snake oil salesmen pushing and peddling AI. It becomes a pull from the firm that's asking us, "Hey, can you use this, or can you apply gen AI for us in these other initiatives?" Which is really powerful. It's important to have business context. That gets embedded for us in those taxonomies which are being used for classification.

Everyone's talking about agents. Well, you need to get the individual steps right correctly. And what this does is it builds that individual step to a high level of robustness and accuracy that we can daisy chain into the agentic workflows that we want. And finally, you know, a callout is that these results are stochastic and not necessarily deterministic.

That comes with some risks. Kevin will talk more about those. Punchline here, we've been able to achieve 95% accuracy across 10,000-- categorizing 10,000 vendors, doing in minutes what would have taken days at an order of magnitude less cost. All right. Next use case. This wouldn't be an AI conference if we didn't talk about RAG.

So what do we-- how do we-- how do we-- how do we see RAG at our firm? You get dumped with a bunch of data. Here's 80 gigs of internal documents. What did ACME release in 2020? Let's say you've got a court filing that you have to submit on Monday and it's Friday.

You know, you might get asked the question, "What is ACME's escalation procedures for reporting safety violations?" How do we do this in the past? You'd have an index, a literal index. Someone would say, "In an Excel file, what documents have been received, what documents haven't been received and where are they?" Or hope not, but maybe you'd use search and you have SharePoint search or something like that that probably wouldn't find you what you're looking for.

Well, what do we do now? We have an enterprise-scale RAG app. It has to handle hundreds of gigabytes of data, PowerPoints, documents, Excel, CSVs, all sorts of formats, and huge volumes. What can you append to that? You can append tool calls to third-party proprietary databases. Let me talk about that for a second.

What are the trade-offs that we've had? Sorry, I'm going really fast, short on time. The wins and the losses. So it's been--RAG is invaluable at consulting companies because you get dumped on a project really quick and you have to get up to speed, so it ends up being really valuable.

But I want to call out the teaching LLM APIs part. Typically, certain data sources would be siloed behind organizations that had licenses that would have to pull information from a web UI. That would then be emailed to a certain team, and then that team would analyze the Excel. Well, what we did was we took the API spec, embedded it, and taught the LLM how to call an API.

We have democratized access to information that would otherwise have taken days for people to use. Really condensing the time, as Kevin said before, in some of the projects on the high-value work. The last thing to call out about RAG is that it serves as a substrate on which you can tack on a number of Gen AI features that's proven really valuable for us at our firm.

A number of call-outs, you know, people have high expectations on what they want to receive from a prompt box. If you say reason across all documents, that's just not how RAG works. So we have to build those solutions step by step, and it's a long journey that we have to go on, and we're excited to be on it.

With that, over to Kevin and the third use case. Yeah. Thanks. So it's a good thing Box went before us, because they covered a lot of the advantages of the ability, fundamentally, to take unstructured data and create structure from that. It is an unbelievably powerful concept. It's very simple on its face, but it is incredibly powerful in an enterprise context, because you can take something like this credit agreement, it's 50 or so pages long, in terms of a PDF, and you can very quickly extract information that's useful, like contract parties, maturity date, senior lenders, whoever that might be.

And so you see folks like Jason Liu, Pynantic is all you need. It is still true, it is still all you need. And fundamentally, what this looks like, and Box went through a lot of it, but it's combining a document with a schema, with an LLM, with some validation and scaffolding around it to make sure that you're pulling out the values that you need.

And the business value really is in the schema of what you're actually what you're extracting and why you're extracting that information. It's the flexibility that is really powerful here, because you can start to reapply it across different types of engagements. Investigations might be looking at something entirely different than an M&A transaction.

This fundamental capability can span across all those. And the power is there at the bottom, where you can do this type of thing repeatedly across multiple documents of thousands, tens of thousands, hundreds of thousands of documents, where doing a human review might take days or weeks. Using an LLM, you can get it down to minutes.

It's incredibly powerful. In terms of user trust, we not only are using external sources like Box and others as well, but we've rolled our own internally as well. And so in terms of just exposing some of the model internals to users to have somewhat of an off-ramp for them to understand where the model is more or less confident, we use the logprobs that's returned from the OpenAI API, and we align that with the output schema from structured outputs.

So we ignore all the JSON data, we ignore the field names themselves, we just hone in on the values themselves. So in this case, the green box above the interest rate of LIBOR plus one percent per annum, that's the field that we want. We basically take the geometric mean of the logprobs associated with those tokens in particular, and use that as a rough proxy of the model's confidence in producing that output.

So the boxes way at the beginning that you saw in terms of green and yellow is a direct reflection of the confidence level. So it's a really relatively intuitive way for users to get an understanding of the model's confidence, again, for human review to the extent that's needed. I won't go through all these, but fundamentally, like I said, it is magic when it works, and it works at scale.

It is a total unlock, particularly for non-technical folks who are not up to speed with the capabilities of LLMs. To be able to do this is a light switch, light bulb moment for them, and it really is a game changer. Now, that being said, there's a lot of work to be done in terms of validation.

You saw all the work that Box and others have done in terms of getting it to a level of rigor that users can trust, and so that's really a key tenet for all this. And so finally, I'll turn it to Mo for the must-haves. So just a couple quick call-outs.

I know this is a tech conference, but a lot of this, to get to work at the enterprise requires people skills and working closely with the organization. There are a couple things I want to call out that have been really important for us to scale our Gen AI initiatives at our firm.

The first one is demos. We build in Streamlit, but we prototype in Streamlit, but we build in React. And so we have a constant cadence once a month that we show the latest and greatest of what we're building. This inspires the firm in what we're able to build and continue to invest in our initiatives.

And the second thing is, you know, there's always the next shiny thing. Agents, MCP, the latest model. NPS is our metric. ROI is our metric, and that is one hard-earned, one bug fix at a time. I'll skip the other one. You know, partnerships are really important. It's a shared journey, so.

And I think we're out of time, but I'll leave you with this. Once Excel-powered LLMs actually work, we will be at AGI. So I'm looking forward to that next talk. Thank you. Thank you.

The Billable Hour is Dead; Long Live the Billable Hour — Kevin Madura + Mo Bhasin, Alix Partners

Chapters

Transcript