Back to Index

2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI


Chapters

0:0 Introduction to Arthur AI and Mozilla AI
0:46 2025: The Year of Evals
1:15 AI/ML monitoring and evaluation
2:48 The Year of the Agent
3:26 The need for 'evals' wasn't obvious to the C-suite
4:15 Pre-ChatGPT launch
6:6 Venture capitalists' predictions
7:3 Macroeconomic side of things
8:6 OpenAI launching ChatGPT
9:15 2023: The Year of GenAI
9:39 2024: GenAI applications in production
10:22 2025: Scaling and autonomy
11:35 Definition of an agent
12:6 Connecting to downstream business KPIs
14:40 Shift to multi-agent systems monitoring
15:42 Q&A
16:16 Discussion on domain expertise in evaluations
18:13 Discussion on LLMs as judges

Transcript

Thanks everyone for being here. I'm going to give this talk mostly from the point of view of being a co-founder and chief scientist at Arthur AI for six years prior to joining Mozilla AI as CEO. I do want to say Mozilla AI operates in the open source world where we're providing open source AI tooling and we're supporting the open source AI stack.

Our end goal is to enable the open source community to be at the same table as a Sam Altman when talking about AI moving forward. So if you're interested in that that's not what this talk is going to be about but we can talk about that offline. This talk is going to be about 2025 finally being the year of the evals.

And as was written as was spoken by my introduction I've been in the space for a very long time. Arthur AI for example was in and it still is in observability evaluation and security and both traditional ML and AI and then into the deep learning revolution and then into the gen AI revolution and then into the agentic revolution.

I think we're finally at the point where all of these companies are going to start seeing hockey stick growth which is exciting. So the themes of this talk one thing is that I see AI ML monitoring and evaluation as two sides of the same sword or ruler right you can't do monitoring or observability without being able to measure and measurement is the core functionality for evaluation.

This was not top of mind really with the C-suite until two things happened concurrently. One is AI became a thing that people who aren't a CIO or CTO could understand. So the CEOs, the CFOs, the CISOs, began to understand it basically when ChatGPT came out. And simultaneously there was a perfectly timed budget freeze across enterprise at least in the US that happened due to a fear of an impending recession.

So this is right before ChatGPT launched. This is like October, November when most enterprises would set up budget for the next year. At that time there was a freeze except for money that could be opened up for a specific pet project. And that pet project because CEOs and CFOs then knew about it was Gen AI.

So that happened and then now we have the final sort of vertex on this triangle which is going to force evaluation to be top of mind this year, which is that we have systems that are now acting for humans, acting for teams as opposed to just providing inputs into larger systems.

So these three things together are, as you saw with Brain Trust, as we're seeing at Arthur, as you're seeing with like Arise AI, or Galileo, other big players in this space, we're starting to see big takeoff because of this. Cool. So right now, this is what's happening. Year of the agent.

We all hear about it. We're hearing about it here at this conference. Agents are starting to make decisions and take actions, complex steps that lead toward an action, either autonomously or semi-autonomously or semi-autonomously. As you're seeing with ARISE AI or Galileo, other big players in this space, we're starting to see big takeoff because of this.

Cool. So right now, this is what's happening. Year of the agent. We all hear about it. We're hearing about it here at this conference. Agents are starting to make decisions and take actions, complex steps that lead toward an action, either autonomously or semi-autonomously. As a question in the last talk brought up, you know, bringing humans into the loop is still obviously a very good idea in many systems, but we're getting closer and closer to full automation.

And agentic systems are going into deployment now, okay? And that's in enterprises, that's in SMBs, that's in pet projects, and so on. And what that means is, by that last slide, this is also the year that we need email. Flashback to then, up to like a year ago, where every year we were asking, "Hey, is this the year of ML?

Is this the year of evaluations?" And prior to sort of these agentic systems coming out, we would have machine learning models basically spitting out numbers that would then be ingested into a more complex system. And that complexity would sort of erase the top-of-mind need to think about what's coming out of the model itself, except for people in this audience.

We know that's very important. But when it comes to decision makers, it would often get wrapped up into this sort of opaque box. And that meant that the ML part didn't really bubble up beyond like the CIO or whatever org the system was going into, which meant that typically that year was not the year for eval, because the need for evals was not obvious to the entire C-suite.

Cool. So let's take a quick step back in time. Before November 30th of 2022, the ChatGPT launch, ML monitoring was certainly a thing, right? Like data science teams have long used statistical methods as part of larger systems to understand what's going on, right? This is core. Like I mentioned, though, there's a tenuous connection to sort of downstream business KPIs.

And at the end of the day, that's what gets your product bought in the enterprise is being able to make a sell about dollars saved or about dollars earned. So being able to connect machine learning the components specifically to a downstream business KPI. There was a lot of lip service around AIML, around the ROI from the C-suite, including CEOs.

But that was just lip service, in our experience, at least. It was still basically selling into the CIO. So basically, it made it hard to sell outside of that. Now, obviously, this is a large space. This has been happening since, you know, about 2012, I would say, is when AIML monitoring really started up.

With like H2O and Algorithmia and Seldon, sort of the first generation of these companies coming around. YLabs, Aporia, Arise, Arthur, Galileo, Fiddler, Protect AI, and so on and so on and so on. I put the cutoff here at like mid-2022, sort of like before the Gen AI revolution happened.

There have obviously been companies founded after that. You know, we just saw Braintrust talking as well. And then, you know, the big players here as well, right? Snowflake, Databricks, Datadog, SageMaker, Vertex, you know, Microsoft's products and so on. So people have been thinking about it. But it was never the thing.

Again, rarely top of mind for the CEO, the CFO, and the CISO. It's never the issue. So when we would talk to people, it's always, yes, we understand that we need this. But security is going to be a bigger issue. Or latency is going to be a bigger issue.

Or some of these more traditional technology sort of problems are going to be the issue. It wasn't the machine learning model itself. So basically, you know, I do a lot of due diligence for venture capitalists as well. Now as a, you know, a multi-time founder and so on. And basically every pitch deck in this space from like the mid-2010s onward had a slide that said, "This is the year that a CEO is going to get fired because of an ML-related screw-up." And to my knowledge, it just still hasn't happened.

There have been some forward-thinking leaders. So I have here the annual report from Jamie Dimon, head of JPMC. This came out in April of 2022. So it covered basically JPMC up through their fiscal year in 2021. And he's talking about the spend that they have going into AI. But if you squint and you look at these numbers, they're still like comically small.

So basically one of these is in the consumer world. He makes the statement that from 2017 up through the end of 2021, they had put $100 million into AI/ML, right? That's not a huge amount of money for JPMC. So keep sitting back in time, pre-chat GPT and so on.

But let's now flip to the macroeconomic side of things. So the economy started getting pretty dicey right up until about chat GPT launched, right? So that was the end of November for chat GPT. A lot of enterprise budgets are set in October, November for the following year. And toward mid to the end of 2022, there were very deep fears about an impending recession that didn't end up happening.

But those fears basically made it so that most enterprises either froze or shrunk their IT budgets for 2023, okay? So what that meant is, were it not for particular tailwinds called chat GPT, we probably wouldn't have seen a lot of new technology being developed in the IT departments at these large enterprises in 2023.

Right? So this is sort of a bittersweet mix. I guess I just talked about a lot of this. This was sort of a bittersweet mix in the sense that it did set us up to put the eye of Sauron on a particular specific small pet project where a small amount of budget could be applied.

And that small pet project came from our friends at OpenAI launching ChatGPT right before the holiday break. And I'm convinced just from talking to some C-suite folks across the enterprises here in the U.S. That basically what happened is now CEOs and CFOs and, you know, less technical on the computer science sense of the word, were able to interact swiftly and easily with a single UI on the internet and get wowed by AI.

Right? So I was also wowed by AI. In fact, we were hosting, Arthur was hosting with our Series A leaders, Index Ventures, an event at NeurIPS, the major machine learning conference, the night that ChatGPT came out. And it basically took over a bunch of nerds in a room being like, wow, this is very, very impressive.

And that happened to everybody else, right? Flip back to November 30th, you know, you can make Eminem rap like Taylor Swift. Oh, hey, isn't this funny? Oh, hey, my mom did this sort of like joke about, you know, I want to have poetry, but written in the, you know, the words of a rapper or whatever.

This started to happen over and over again. And what that meant is that that discretionary budget, which exists, was unlocked specifically for now the CEO's pet projects, which were called GenAI. So 2023, we still had austerity forced on us because of those frozen budgets, or even reduced budgets. But the thing that was happening here is that now the only money going around that could be allocated was going to specifically GenAI.

And so everybody focused on this. Hey, it's a cool technology. Everybody focused on this. And the science projects started to sort of float around within enterprise. 2024, we started to see GenAI based applications going into production, right? Chat applications are the obvious one, internal chat applications, internal hiring tools, things like that.

And that's because basically the only budget going into new projects in 2023 was going to GenAI. And now as things go into production, primarily internally in 2024, we have the folks who tend to dress in business suits, asking questions around ROI, governance, risk, compliance, brand optics, and that kind of thing.

So now we're starting to get a little bit closer to people outside of the machine learning, the data science, the computer science world, the CIO's office, caring about evaluation, right? If I need to have a quantitative estimate of risk, then I need to do evaluation. 2025, we've seen scale ups, right?

Look at the revenue numbers for any frontier model provider. Look at the revenue numbers for a lot of us in this room. Just everything is really going up right now. And that's because of usage, which is great. That also means that's a function of the C-suite basically becoming comfortable, basically talking about and putting large real budget into AI, right?

So 2023's IT budgets, 2024's IT budgets set for the following year, those weren't frozen, right? Those are earmarked specifically for AI applications and things along those lines. So we had science projects in 2023 go into production in 2024, and they're now shipping and scaling in 2025. And also like, frankly, the technology has just gotten really amazing, right?

Like all of us in this room are technical, but even I'm just amazed every time a new model is dropped. The community has also really gotten behind this, you know, open source has gotten behind this, venture capital, big tech is writing huge checks into frontier model providers and so on.

So everything is sort of coming together in 2025. And also remember that third vertex, we have machine learning systems now moving toward autonomy. Okay, so 2025, we all hear it, it's the year of the agent. No longer is a question mark needed here, it's clearly the year of the agent.

Now, quick 30 second definition of an agent, as defined, you know, in the late 50s onward. Agents need to perceive the environment, they need to learn, they need to abstract and generalize. And unlike traditional machine learning, they're going to reason and act, right? We have reasoning models out there.

We have systems that are acting in virtual environments or cyber physical environments. And what that means is you have a lot of complexity introduced into the system, and you have a lot of risk introduced into the system. And that's great for those of us in eVelt. So at the end of the day, the thing that really matters, like I mentioned, is connecting when you're selling any product, not just our products in this room, when you're selling any product into an enterprise or an SMB, is being able to attach your product into some sort of downstream business KPI, risk mitigation, revenue gains, you know, losing less money, whatever.

So now evaluations, you need to be able to do this because you're quantifying things. And they're finally a first class discussion point, which is fantastic. So we have the CEO, like I mentioned, November 30th, 2022 and onward, now at least knows what the tech is. And I'm not saying they know, you know, what attention means, right?

But I am saying that they know some of the capabilities around these generative models. They know some of the capabilities around agentic and multi-agent systems. And they're comfortable, you know, talking to experts about it, allocating budget for it, and talking to their board of directors and shareholders about it as well.

We have the CFO, who, because the CEO cares about this stuff, also obviously needs to care about it. But she's going to care about the impact to the bottom line, right? That's what a CFO does. They're doing allocation, they're doing budget planning, and they're going to need to basically write some numbers into an Excel spreadsheet.

And those numbers have to come from, in part, quantitative evaluation. CISOs now see this as, you know, a huge security risk and opportunity. And for those who haven't sold into enterprises before, CISOs are typically willing to write checks, smaller checks, especially for startups, more quickly and with less overhead than like a CIO would.

CIOs tend to have a bigger org, for one, and also tend to have a lot more process. CISOs tend to be a little bit more scrappy and willing to try out tools and so on. And so this actually happened before the agentic revolution. This actually happened when Gen AI started coming up, where CISOs were like, hey, hallucination detection, prompt ejection, things like that.

That's firmly in the security space, which is why you've seen a lot of guardrail products, including Arthur's, including the ones that come from our competitors, going into the CISOs office and basically being able to sign a lot of deals. The CIO, corporate CIO, has been on board the entire time, and they're just trying to keep the ship sailing.

So that's great. They're still on board. They want to keep their job. And the CTOs now, they always want standards, right? They need to make these decisions based on numbers. Those numbers are coming in part from like OTEL standards, standards like that. And that's great, right? So I've listed a lot of the C-suite here.

I haven't talked about chief strategy officers or otherwise, but like the CEO, the CFO, the CTO, the CIO, the CISO, they control a lot of budget. And now they are all willing to talk about and they're all aligned about basically the need to understand evaluation from AI. Great. So quick remind me, you know, you should hold me truthful here.

All the evaluation companies, observability companies, monitoring companies, security companies, whatever you want to call them, have shifted into agentic and multi-agent systems monitoring. The point around you should monitor the whole system. You shouldn't just monitor the one model that is being used by one particular agent. That's well taken and well understood in industry and in government.

And I think that's great to have that at that top line discussion point. But, you know, keep me honest here. There was an article that came out in mid-April in the information showing some leaked revenue numbers for a variety of startups in the evaluation space. Weights and Biases, Galileo, Braintrust, and so on.

But they were lagged by about six months or eight months. And just from talking to friends in the space, those numbers are no longer representative of what folks in this area are making. And so let's see what the information leaks in early 2026 about this. And maybe we'll see something like revenue no longer lags at AI evaluation startups.

Because this is the year for AI evaluation. Great. So I'll leave it with some time for questions. I did mention Mozilla is not firmly in the evaluation space. We do have a very nice open source, not monetized at all, what we're calling a light LLM for multi-agent systems. So if you're playing around with different multi-agent system frameworks, check out any agent.

We implement a lot of them for you under a unified interface. So for people in this room, that might be a fun project to play around with. So thank you. I have three minutes for questions. Okay. Thanks. Thanks. Great presentation. I just have a question really about the enterprise value.

Most of the evaluations in Genia require domain expertise. So for example, if you're building a multi-agent system to do financial investment analysis, to do something called a discounted cash flow spreadsheet, is the agent doing it correctly or not? I'm just trying to understand, how is that problem getting solved?

Because most of them are coming from an ML background where it was structured data. But this is a lot of unstructured data, and you have to measure the quality. Like, is it in acting like a human, right? Yeah. Yeah, it's great. Actually, I have a paper in Nature Machine Intelligence talking about some of the problems that can come around when you do persona-based agents, where I say, act like a farmer in Ohio in your mid-40s and so on.

There's value in that, but you can't do it perfectly. And my gut reaction to this is, there was a leaked spreadsheet from Mercore, which is a company that can hire in experts, showing $50 an hour, $100 an hour, $200 an hour for experts to be hired by, for example, Google or, for example, Meta or, for example, large banks, to do kind of what you're saying, which is you're going to have an expert sitting alongside the multi-agent system, basically sitting next to, you know, the intern who's going to come in and take your job in like a year or two, or change your job, maybe take is not the right word.

But they're basically doing that expensive human validation and lockstep with the multi-agent system, which, you know, if you're going to be doing discounted cash flow analysis, right, the kind of thing where A, you can either make or lose a lot of money, and B, lose your job if you get it wrong, it's worth spending that large amount of money doing the human validation.

It's a question for everyone, though, is like, what does that look like in five years once that data is incorporated into the systems themselves? And that can be a mode as well, right? When you talk to anyone in the eval space, like, it's the data set creation and the environment creation that matters more than anything, which is the point you're getting at as well.

So if I spend a bunch of money to have a very good competitive, like, DCF environment or whatever, that can help me versus my competitors. So there is like, there is CapEx going into that. Yeah. Thanks for the presentation. Do you have a rough, maybe timeline on when you think evals will primarily be driven by maybe gen AI or even LLMs?

Yeah, the LLM as a judge paradigm is, and I think this was talked about in the previous talk as well, we see it getting used in practice because there are issues with it, right? We have a paper in iClear from last month talking about some of the biases that LLMs as judges have versus humans and things like conciseness or helpfulness and some of those anthropic words.

But the long and short of it is that, like, it solves the data set creation problem in some sense. And that, like, you can ask, you give a persona to an LLM and it is like a poor man's version of, like, a human doing the judging. And so we see a lot of people using that as a crutch right now.

But you need to, toward the last question that was asked, you do need to make sure you're validating this and making sure that you're not going off in some weird bias direction. That's time? Oh, happy to chat offline.