Thanks Ally for the great intro. Indeed we're working on what I believe to be the extant problem in AI, which is to say how do you validate, verify, audit, steer something that is as subjective and unstructured as literal LLM slop. So today we're gonna be talking a lot about this.
I should point out that ostensibly we're part of the AI security track, although I would really consider us more of a QA company, an eval company in some sense, although there's a lot of shared similarities in how we approach the problem technically. Right, we are essentially a property-based testing company, or Fuzz testing company, or as I like to call it, a hazing company.
Cool, so just to set the context a little bit, why do we start Haze? What does Haze mean? Haze to us is ultimately, all right, we know that AI systems are extremely unreliable. They're hard to trust in practice, and you sort of need to pressure test them before you put them out into the wild.
Our solution to doing this is basically, let's just run large-scale optimization and simulation and search before deployment and try and figure out through a battery of tests whether or not your system will behave as expected before it actually goes into production. And I'm sure any of you guys who have tried to build LLM apps in the past, have understood extremely viscerally what I mean when I say the last mile problem in AI.
Right, it's at this point in 2025, extremely easy to get something that is demo-ready, or POC-ready. Like, you can whip together a cool product over the weekend and impress your PM and whatnot, but it's really hard to get that same product into production at a point where it's truly robust and enterprise-grade and reliable.
And, you know, this has been the case for the past two-plus years at this point. Right? Like, we've been promised the allure of autonomy and agency and full-gen AI and enterprise transformation for two-plus years since ChatGPT launched, and we're still not quite there. Right? And I think, ultimately, it's because we haven't solved this last mile problem around trust and reliability and risk.
So, I think part of the big reasons we haven't solved this is because people still think about evals and measuring your AI system in a very straightforward and naive sense, which is easiest to explain as follows. Right? I'm sure everybody has seen this idea of going out, being a human subject matter experts, collecting a finite static golden data set of inputs and then expected outputs, ground truth outputs from the human, and then basically running the inputs through the application, getting the actual output, and then comparing it somehow with the ground truth golden answers.
Right? This is how evals has been done forever since the birth of deep learning and prior. But it doesn't quite hold up in the Gen AI era, specifically because of this property of Gen AI systems, which is what I like to call brittleness, or more technically, Lipschitz discontinuity. And what I mean by this is, you know, people say AI is sensitive, AI is brittle, AI is non-deterministic, which is true.
This is all true. But that's really not the main problem that makes AI so hard to deal with. Right? Non-determinism is really fine if you set the temperature to zero. Yes, there's like caching and weird systems, quirks, and all of the LLM providers that make it somewhat non-deterministic, even at scale.
But for the most part, non-determinism really doesn't bite you too much when you're building AI apps. Right? You, for the most part, are constraining your outputs to temperature zero. You're running things through a workflow. It's fairly deterministic. What does bite you a lot when you're building AI apps, though, is when you send two ostensibly similar inputs to your AI application with maybe slight variance in the syntax or the semantics or the appearance of the text, but all of a sudden you get wildly different outputs on the other side.
Right? This is what I mean when I say Gen AI apps are incredibly brittle. And I think this is the actual core property that makes billing with AI, with Gen AI, so difficult. And of course, we see this brittleness manifest itself in all sorts of fun ways. I'm sure we don't have to belabor this point too much, but you've got everything from Air Canada customer supports, hallucinating to, you know, character AI telling teenagers to commit suicide, to buying a pickup truck for one dollar on the Chevy patient or customer portal.
Right? I don't think we need to go through more examples of this. This happens more or less every single week. There's more and more examples popping out. And again, this all comes back to Gen AI being extremely sensitive and brittle to perturbations in the input space. Cool. So, standard evals, of course, doesn't cover this brittleness property.
And I would say it's insufficient in two senses, two primary senses. One is coverage. Right? With a static data set, you only know how good your AI system will be with respect to that data set. Right? It might look like your AI system is 100% on all your unit tests, on all your golden data set points.
But if you just push around the corner and look around the corner for more inputs that cover your space more densely, it is entirely possible that you get perturbations that tell a very, very different story about how your AI application actually does in the wild. So point number one, standard evals don't have sufficient coverage.
Second point, too, is it's actually really difficult to come up with a good measure of quality or even similarity between the outputs of your AI application and your ground truth outputs. Really, what we would want almost is a human subject matter expert who is constantly overseeing your AI application and a subject matter expert who has all the right taste and sensitivity but is able to translate that sensitivity into some quantitative metric.
This by no means is a trivial task. Right? I think this is the core challenge that we've been trying to face in the field of AI around reward modeling for the past five, six, seven plus years. Right? And the key challenge is how do you get that sensitivity from the subject matter expert from a non-technical domain to be able to translate their criteria into quantitative measures?
This is not even close to something that's being solved with standard evals today. People are using things like exact match, classifiers, LM as a judge, semantics, limity. All these things have their own sets of quirks and undesiderata. And we'll see how this pans out in a second. Long story short of how we think about tackling this eval problem is essentially through hazing.
Right? Fuzz testing in the AI era. Essentially, what hazing comprises is very simple in the abstract. We just simulate large-scale stimuli to center your AI application. We get the responses as a result of the stimuli. We judge and analyze and score the outputs of your AI application. And we use that as a signal to help guide the next round of search.
Right? And we essentially just do this iteratively until we discover some bugs and corner cases that break your AI application. And if we don't discover anything and we exhaust our search budget, that means you're essentially ready for production. Right? So this is hazing in a nutshell. But easy to describe.
Actually, really difficult to execute in practice. Both sides of the equation in terms of scoring the output and also generating the input stimuli are quite difficult technically. I'll first talk about how do we think about scoring the output. Again, translating from subjective criteria into quantitative metrics. We call this judging, more broadly.
Probably you guys are familiar with something like using LM as a judge to essentially have LM look at the output of your AI application and decide, you know, based on some prompts or rubric that you give to your judge, you know, is this a good response or is it a bad response?
Tell me on a scale from 1 to 5 or 1 to 10 or what have you. Right? Very simple to do, but it has its whole large array of different failure modes. In particular, LM as a judge itself is prone to hallucinations. It is obviously an LLM, so it's prone to hallucinations.
It is unstable. You could have actually a really good articulation of the criteria, but it doesn't actually operationalize well into a model. Right? So it's uncalibrated in the output. Right? Like, what is a 1 to an LLM? That's very different to what is a 1 to a human. Right?
What is a 5 to a human is very different to what is a 5 to an LLM? So it's uncalibrated. It has all sorts of biases. Right? If you change the inputs in any weird position. Right? Let's say you present one response first and the second response. If you flip the order, that changes the results oftentimes.
If you provide context or you change some part of your rubric, that changes the result of the LM as a judge, too. So extremely biased, extremely fickle. And TL;DR, LM as a judge itself, as an off the call call call off the shelf call to an LM is oftentimes not going to solve your reliability issues.
So the key question in my mind is, how do you actually QA the judge itself? Right? How do you get to a point where you can judge the judge and say that this is the best gold standard metric that I can use to then actually iterate my underlying AI application against?
So how do you judge this judge? The broad philosophy that we've been taking over the past few months is essentially pushing the idea of inference time scaling or more broadly compute time scaling to the judging stage. So we call this scaling judge time compute. And there's two ends of the spectrum of this philosophy.
One end of the spectrum is basically just rip from scratch, no inductive biases, train reasoning models that get really, really, really good at this evaluation task. And then the other end of the spectrum is be very structured. You know, don't train any models. Just use the off the shelf elements.
Have really strong inductive priors, but basically build agents as judges. Right? So this is one approach. basically we'll build agent frameworks, pipelines, workflows to do the judging task. And we have this nice little library called Verdict that does this. Very on-the-nose name, I know. But the idea of Verdict is essentially, there's a lot of great intuition from the scalable oversight community, which is subfield of AI safety.
Goal of scalable oversight is basically how do you take smaller language models and have them audit and correct and steer stronger models. Urgently, this is an AI safety concept because people were worried about, you know, in the age of superhuman AI, how do you have weaker models, i.e. humans control the stronger models, right?
And that's how the field got started. But as a result of scalable oversight, there's been a lot of great intuition around the architectures and primitives and units that you would use to probe and reason and critique what a stronger model is doing. And so we baked a lot of the primitives and architectures into this Verdict library.
One example is having LLMs debate each other, having the weaker LLMs debate each other about what the stronger model is saying and seeing if that makes sense. Another example is having the LLMs, weaker LLMs self-verify the results of their own responses, right? So, you know, have an LLM say, okay, this response of the stronger model is good or bad.
It is bad for this reason, and then maybe having an LLM critique its own reasoning, right? So self-verification is another great primitive. Ensembling, of course, another classic primitive in this case, and so on and so forth. TLDR, scaling judge time compute in this particular way, through building agents as judges, actually allows you to come up with extremely powerful judging systems that are also quite cheap and also low latency.
So here's a plot of price and latency and cost and accuracy of Verdict systems vis-à-vis some of the Frontier Labs reasoning models. So you can see that Verdict is beating 01 and 03 mini and, of course, GPT-40 and 3.5 sonnets on the task of expert QA verification. So this is subjective criteria grading in expert domains.
Critically, Verdict here is powered by a GPT-40 mini backbone, right? So we basically have stacked GPT-40 mini aggressively in what is, in this case, like a self-verified debate ensemble architecture. And we're able to beat 01 for a fraction of the cost, like less than a third of the cost, right?
And also, like less than a third of the latency. And this is all because of the fact that we've chosen the priors in a pretty careful and intelligent way. So that's one way to scale just-time compute, is basically building agents to do the task. Another way to do it, and this is a lot more fun in my opinion, is basically, yeah, just rip RL from scratch, train models to do the judging task.
And this is something that we've also been pretty excited about over the past few months. Again, for standard LM judges, a whole host of issues, but two particular issues that are solved by RL is one. There's a lack of coherent rationales that explain why an LM judge thinks something is a 5 out of 5, or thinks something is good or bad.
And also, standard LM judge doesn't provide real fine-grained, tailored, unique criteria to whatever idiosyncratic task and data you're looking at. But both of these can be solved by RL tuning, or specifically, GRPO tuning. One paper recently that has come out in this general flavor is from DeepSeq. This is SPCT, self-principled critique tuning.
The idea here is essentially, can you get an LM to first propose some data set, or sorry, data point specific criteria about what to test for? It's almost like coming up with unit tests for the specific data point you're looking at. And having the LM essentially look at each of those criteria and critique the data points against each of those criteria.
So it's like instance specific rubric and then instance specific rubric critiques. This is one way to train RL models. We ran a pretty simple experiment using a grant of this technique to GRPO train 600 million parameter and 1.7 billion parameter models. And TLDR, this gets us to, you know, competitive performance on the reward bench task with Cloud3 Opus, which is at 80%.
GPT-4 Mini, which is at 80%. LAM370B at 77%. And J1 Micro, which is this 1.7 billion parameter reward model at 80.7% accuracy on the reward bench task, right? And this is all because of judged time scaling. This is all because we did GRPO to come up with better rubric proposals and better critiques on the specific tasks that we're looking at.
So training off essentially a much smaller model, doing more compute, gets you this much better performance. And similar numbers for the 600 million parameter model. Cool. So that's all judging and scoring the outputs. Equally important though is how do you come up with inputs to throw out the AI system, right?
And how do you run the search over time? TLDR, there's two ways that we think about this. There is fuzzing in the general sense, which is essentially, okay, I just want to come up with some variance of some customer happy path and test my system under some reasonable in distribution user inputs, right?
Then there's the more fun part, which is how do you do adversarial testing, right? How do you basically emulate some person trying to sit down and prompt inject and jailbreak and mess with your AI systems at large? And this is much more aggressive in terms of how we pursue the optimization problem.
Long story short is, you know, fuzzing in the AI sense is much more structured and optimization driven than in classical security or software or hardware, right? It is impossible to like search over the input space of natural language and do a brute force search in any reasonably short amount of time.
Like we were dealing with, you know, let's say you were dealing with a Lama 3 tokenizer. There's 128,000 tokens per individual inputs, right? You scale this up to like 100 million tokens and you're like literally impossible to scan this entire input space. So you have to be very clever and guided and prune the search space as you do hazing and fuzzing.
We treat this task essentially as an optimization problem, right? This is, long story short, just discrete optimization. There's plenty of rich literature over the past 60, 70 years of discrete math research to go and support how to do this sort of task. We have to massage it, of course, to work for the LLM domain.
But TL;DR, the search space is just natural language. The objective that we're trying to minimize in this case is essentially whatever judge that we're using to score the output. We basically want to find inputs that break your AI application vis-a-vis the judge, gets the output to score very low on some measure of the judge.
And yeah, we can rip and throw a bunch of fun optimization algorithms at this. We can use gradient-based methods to backprop all the way from the judge loss through the model to the input space and use that to guide what tokens we want to flip. We can use various forms of tree search and MCTS.
We can search over the latent space of embedding models and then map from the embedding models to text and throw that at the underlying AI application or the application under test. We can use DSPy. We can use all sorts of other great tools and tricks to solve this optimization problem.
Some fun case studies in the last few minutes. TL;DR, you could probably imagine that this hazing thing matters a lot for people in regulated industries. And indeed we work a lot with banks and financial services and healthcare and so on. We did something recently where we hazed the largest bank in Hungary.
They had this like loan calculation AI application that they're showing to customers. The customer application had to follow this 18-line code of conduct is what they called it. And we basically threw everything under the sun from our platform in terms of optimization and scoring to emulate adversaries. We were able to discover a ton of prompt injections and jail breaks.
And honestly just like unexpected corner cases that they didn't account for in their code of conduct. And they were able to patch this up and then finally unblock their production into prod. We are doing this right now for a fortune 500 bank that wants to do outbound debt collection with voice agents.
A little bit actually more complex problem because now we're not just testing in the text space. We're actually introducing a lot of variance to just the audio signal as well. So adding things like background noise, stacking weird static into the input domain, changing the frequencies of things, etc. But still an optimization problem by the end of the day.
TLDR, what took this team, you know, three months or so to do with their internal ops teams, took, in their own words, only five minutes for a platform to do. So scaling up adversary emulation works for this task as well. And a little bit more difference for another voice agent company, we've been helping them with scaling up their eval suite, right?
So not so much hazing, but basically scaling up their subjective human annotators through verdicts. They've seen a 38% increase in ground truth human agreements using verdicts as opposed to using their internal ops teams. And what we're using here is essentially a tried and true architecture from the verdict library, which is what we call a rubric fanout.
So it is basically proposed individual unit tests and criteria for any particular data points, critique it, self verify your critique, and then aggregate results at the very end. Cool. So we've got a few minutes left for questions. But yeah, hazing is a ton of fun. I think it matters a lot for this new era of software that we're building.
We're very aggressively hiring. We're, you know, facing what I would deem to be insurmountable enterprise demands. And we're only a team of four people. So we really need to scale up our team. And yeah, we're based in New York in case you guys want to move out to the city.
And yeah, any last questions for me? For the hazing input, is it multi-shot or single-shot? Yeah. Great question. So we do both. We do single turn, multi-turn. We do persistent conversations if you're doing voice. Yeah. All sorts of modalities, all sorts of inputs. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
Yeah.