back to indexE-Values Evaluating the Values of AI: Sheila Gulati and Nischal Nadhamuni

00:00:00.240 |
Well thank you all for being here it is a small but excited group of people around evals and you 00:00:21.000 |
know evals may not be the most sexy talk at this conference but it might be one of the most important 00:00:26.280 |
and Nishal and I will spend time today speaking through why right now we think it's a seminal 00:00:32.520 |
moment for evals and the import sort of of the changes that we're seeing in the market as we move 00:00:38.880 |
into these agentic frameworks right so as things are more automated as things happen in a more 00:00:45.340 |
agentic way we need to make sure that we understand what's happening with evals we're all here 00:00:50.720 |
obviously because we're very long AI we care a lot about AI but we have to really understand what's 00:00:56.040 |
going on with it and really if you think about the crux of it can we evaluate the performance 00:01:00.900 |
of this our systems in relationship to our goals for those systems and that's what we'll be walking 00:01:06.240 |
through today and so about us my name is Sheila and I'm joined by Nishal it's great we're we're an 00:01:12.800 |
investor investee pair right VC and portfolio company and I think there should be more talks 00:01:19.120 |
like this because some of the most fabulous portfolio companies are not great at giving shoutouts to 00:01:25.160 |
themselves so I get to give the shout out to clarity and initial he's the co-founder and CTO of clarity 00:01:31.980 |
he's sitting up here but he'll be speaking very soon as well and clarity just announced their massive 00:01:37.520 |
70 million dollar series B financing on Monday so we're going to hear about that journey it hasn't 00:01:43.020 |
been a short journey so I think that can be quite inspirational for a lot of you as you think about what 00:01:47.780 |
you're building and the size and the scope of what you're building and so that company builds what we call 00:01:52.140 |
exponential organizations well what does that mean right software transformed how we all work right 00:01:58.960 |
this internal systems is always synced always on nature of software systems for internal work change 00:02:05.720 |
the nature of the efficiency and effectiveness of buildings of businesses that we build clarity is 00:02:12.380 |
doing that for your external world right most of your relationships with your customers your partners are 00:02:17.960 |
are dealt with through documents those documents are one-off negotiated they're one-off pieces of paper 00:02:23.780 |
right clarity is automating all of that and then allowing you to build exponential organizations 00:02:30.500 |
through that type of real-time relationship with those documents so you'll hear more about that and how clarity 00:02:37.780 |
is implemented a number of eval systems later in this presentation myself I founded a venture firm called Tola Capital well over a decade ago now 00:02:47.780 |
and the reason I founded the firm was I was working at Microsoft I was running the database and developer platforms 00:02:53.600 |
and was fighting the good fight against team Windows to launch Azure so being team cloud at a company that was based off of Windows was a really difficult thing but it was fabulous as well and obviously Azure has gone on to do pretty okay for itself so the genesis of the firm Tola Capital really was hey how do we think about this next generation of applications that would be cloud-based and now we're even more excited by the 00:03:23.580 |
the opportunity to bring this next generation of AI enabled applications to the four so I'm going to walk down memory lane for one quick moment and say you know what we saw in the advent of the cloud was very clear it would favor scale the capex requirements the physicality of building out those data centers was so expensive that you had to have a search business or an office business or a retail business to go fund that development and then we would build on top of that right 00:03:53.400 |
the evaluation of those platforms was more straightforward am I offering you speeds and feeds am I offering you performance and then of course you're layering on what functionality at what price I'm offering but the evaluation of the physicality of that world was more straightforward now as we enter this AI world we're saying wow there's you know some of the similar characteristics of large capex and large mega cap participation right where we have obviously these systems running on to the AI 00:04:23.380 |
models running on top of the models running on top of clouds and being trained by the clouds and that compute cost that inference the ever so difficult to get your hands on chips the talent of all of you in the room right the AI engineers that bring this to be but in addition to that you have the proliferation of open source and open source models and just the ability of those models to do an incredible job at delivering incredibly complicated scenarios that are you know really 00:04:53.360 |
really really really really strong contenders and so you have a proliferation of players of proliferation of models you have a deep academic heritage of those open source and AI development and you have great mega cap partnerships for those models as well so what where are we going right where AI is just it's it's so important to think about AI is more than just another tool right it is a reflection of us it is a reflection of our understanding of the world our intentions our preferences 00:05:23.340 |
and at the end of the end of the day our society and we'll talk a little bit about what that means individually and collectively especially in this world where agents will represent more of us as individuals so let's talk about some of the shifts here you know we we started with AI eating everything on the internet I like to call this all the garbage and all the gold on the internet was consumed by AI then we moved into you know doing that gave us emergent behaviors that's of course a great 00:05:53.320 |
great fancy way of saying we're not exactly sure how it knows what it knows what it knows but we know it knows it right this era then then we have trained data sets curated data sets but we're moving into the era of self-taught self-learning and self-sufficient models and so if you pause and think about that for a second before we get into a truly self-sufficient era we really need to fix evals right because it kind of gets to be too late in that era and so that transition to full automation is happening faster and more 00:06:23.300 |
aggressively than any of us thought so what does this mean right narrow AI evaluation was I'm a hammer you're a nail I'm hitting you am I doing that right you're an image am I classifying that image and I could I could tell whether that I was performing 00:06:39.060 |
that task in a pretty straightforward way then you get into broad AI right and this radar chart speaks a little bit to how this evaluation becomes more multifaceted what's my capability and intelligence am I serving a domain do I understand that domain what are my values right what are the and whose values do I care about mine yours the user societies today tomorrow all of these questions come together safety okay do 00:07:09.040 |
no harm what does that look like Pete that could be different for people and of course the context and end user awareness which is often not discussed in an eval world right your end user is who you're delivering and developing these solutions for but they're often an afterthought in terms of evals and evaluation and then we have to do that across all of the different modalities of image and text and video and kind of all of these together is creating a much larger problem around evals 00:07:39.020 |
so today the tools are still simplistic we're going to dive into kind of each of these areas of simplisticness and talk about some of the innovation happening to deliver this forward and then what initial will do is show us how clarity has dealt with each of these issues related to evals as well so benchmark hacking I like to call this you know what what am I solving for solve for x is how a lot of these benchmarks and leaderboards work today from an eval perspective and it's interesting 00:08:09.000 |
because scoring high on the benchmarks is pretty easy to do if you know what the what we're solving for if you understand x you can do everything to solve for x and you can look as intelligent as you want at solving for x but the real the real reason is you may not understand anything about what's happening you may be just solving for x and that that's a pretty scary reality on some of these things and so we say well these alms lms passed an ap exam on a particular subject does that mean it was trained well on the questions that have here to be 00:08:38.980 |
come on that ap exam or does that mean it actually understands the subject it's a really really interesting question but what we're seeing on a lot of the ai leaderboards is the best solvers for x are at the top of those leaderboards and that's a problem 00:08:52.340 |
so one area where we're seeing research happen this is a Microsoft research paper around dynamic benchmarks rather than saying solve for x can you identify this image they're creating dynamic data sets basically with synthetic data where you can say hey we're moving objects around these are not published they are not public right so you don't know what the answer is coming into it there first then but then I can test you on spatial reasoning visual prompting 00:09:22.320 |
object recognition the images are changing so the models have no ability to memorize those benchmarks this is an area dynamic benchmarks in general where I think we'll see a lot more work 00:09:32.320 |
benchmarks versus real world scenarios is is as element two on this you know it's interesting there's a lot of model creators that claim that their models perform very well and are very generalizable and then you actually go and ask them a specific set of questions even if I've done super well on MMLU and these things that 00:09:52.300 |
and you say okay I'm going to go test this logic and test this reasoning and the answers are simply wrong and they're wrong much and most of the time and and this this great example from finance bench which was Patronus AI's work and Stanford's work around saying hey you know these basic financial questions were not answered when you actually benchmarked it on those real world scenarios and the doesn't the user is not at the center of our evaluation universe right we have to put the user at the 00:10:22.280 |
center we have to revamp the UX and feedback systems in order to really understand and capture more user needs evaluation of UX is super difficult right and initial we'll talk a lot more about this in the case of clarity but that's why we're all actually here to deliver that end user value and so if we're not doing that evaluation what are we doing 00:10:29.280 |
I'm black box models right so this is a problem that's sort of hiding in plain sight obviously evaluation is a proxy for the for a task evaluation is seeking truth it is not truth and so how do we really understand 00:10:36.280 |
what we're doing in a world where we're doing in a world where we can't mathematically represent what we're doing in a world where we can't mathematically represent what neural networks have learned 00:10:47.200 |
So this is a problem that's sort of hiding in plain sight, obviously. 00:10:57.420 |
And so how do we really understand what we're doing in a world where we can't mathematically 00:11:03.440 |
represent what neural networks have learned and how they have learned that? 00:11:08.100 |
And so the area of AI interpretability is not new, but it's super important as we think 00:11:17.480 |
And so researchers are trying to open up these black box models, show us the how and the steps 00:11:23.500 |
And the transformer-based LLMs were really tracing information flowing through the network. 00:11:29.020 |
And so this is a good example of sort of asking questions and seeing, hey, how am I answering 00:11:36.240 |
We'll see a lot more of this interpretability work in the reinforcement learning work that's 00:11:42.800 |
It's super, super, super important that we get this right now. 00:11:51.260 |
When we create AI models, we do instill our own values in them, whether we want to or not, 00:11:57.260 |
And these evaluations have to understand the technical capabilities, but also the underlying 00:12:06.820 |
And I have way more questions than answers, as I think we all do. 00:12:10.820 |
But it's, you know, are we creating values that benefit humanity? 00:12:13.720 |
As we go into this agentic world and you have your own model, right? 00:12:19.280 |
So the model of Sheila is going to believe more of what Sheila believes and get deeper and deeper 00:12:33.740 |
We've seen the eco-chamber of ourselves in news. 00:12:36.300 |
We've seen the polarization that this has caused in our society. 00:12:41.000 |
We should be asking questions about where we're going to get to as we enter this agentic world. 00:12:47.640 |
And this is kind of a, you know, it's a question, right? 00:12:51.140 |
This is a, you know, a simple two-by-two to ask the question. 00:12:56.740 |
Do we want you to choose whether you were appeased or challenged? 00:13:00.920 |
Do we want to ground AI in present or future societal values, aspirational values versus present 00:13:09.140 |
Do we want to encourage users to select their own values? 00:13:17.060 |
Now we're going to turn it into a real-world example, leveraging Clarity to talk about how 00:13:36.120 |
As Sheila mentioned, what we do is we automate back office workflows. 00:13:40.420 |
Things that traditionally required large teams of offshore humans and throughout human history 00:13:46.400 |
have been impossible to automate because they're cognitive and they're non-repetitive. 00:13:50.580 |
You can't just write down a simple set of steps and automate these workflows. 00:13:54.000 |
So that's what we've been working on for the last eight-odd years. 00:13:57.960 |
Predominantly, these are document-oriented workflows. 00:14:00.540 |
So it's some kind of PDF that a human being is having to read as part of a company's back office. 00:14:06.420 |
One example of this is revenue recognition, typically part of an accounting team, matching 00:14:11.420 |
invoices to purchase orders, actually aligning two documents and matching them to each other, 00:14:17.500 |
processing tax withholdings that often come in many different languages. 00:14:20.800 |
But you probably get the sense, you get PDFs that are completely unstructured. 00:14:24.740 |
Somebody has to go through them because it's a tightly regulated process, part of a finance 00:14:31.680 |
There's actually a really great keynote yesterday by a gentleman who pointed out that document 00:14:37.800 |
processing tasks are generally super tough for LLMs for a variety of reasons. 00:14:43.760 |
If the page is rotated or the scan quality is bad, if you have graphs or images, if you 00:14:51.320 |
It is very, very tough to get this to work, and it's not the kind of thing you can just 00:14:54.640 |
give to ChatGPT and it happens out of the box. 00:14:58.700 |
And we spend a long, long time and tens of millions of dollars building a stack that's 00:15:05.260 |
These are some of our customers, mostly B2B SaaS companies, mostly in a five-mile radius 00:15:13.680 |
And we predominantly serve their finance and accounting teams, although we're expanding 00:15:19.000 |
Our journey has been a little bit unorthodox. 00:15:20.940 |
It's kind of like that, you've probably seen the startup curve of like the trough of disillusionment 00:15:25.080 |
and you get the TechCrunch article in the beginning. 00:15:30.600 |
We pivoted four times between 2016 and 2020, and we were kind of at our wits end, about 00:15:36.380 |
We did one final pivot and focused on finance and accounting teams and found pretty strong 00:15:43.480 |
In the last couple of years, we've completely re-platformed around generative AI, and that's 00:15:47.140 |
really been a shot in the arm of the company and have gone on to raise 90 million plus off 00:15:54.460 |
But before generative AI, we were in traditional ML for more than six years. 00:15:58.880 |
We like to say that we started an AI company five years too late. 00:16:02.580 |
And so a lot of the concepts around evaluations were, you know, they seemed pretty natural to 00:16:06.960 |
us and we didn't really see why generative AI had to be any different. 00:16:10.580 |
In supervised learning, which is a majority of what we did pre-gen AI, you have your train 00:16:14.660 |
split, your test split, the metrics are pretty well defined, F1, ROC curves, there are these 00:16:20.520 |
like shared benchmark tasks like sequence labeling and squad and these were all very well understood. 00:16:24.940 |
So it wasn't immediately apparent to us why generative AI has to change any of these things. 00:16:31.760 |
And in the two years since then, we've, as I mentioned, re-platformed generative AI and 00:16:37.940 |
We process more than half a million documents for customers. 00:16:40.940 |
We have more than 15 unique LLM use cases running in production today, and more than 10 LLMs under 00:16:49.440 |
Oftentimes, a single use case has multiple LLMs working together. 00:16:54.260 |
But all this kind of begs the question, why does eval for generative AI have to be any different 00:17:00.060 |
And this is kind of the journey that we've been on in the last couple of years. 00:17:03.340 |
The first, and a number of speakers have spoken about this, so I'm not going to go into 00:17:06.520 |
really too much detail, is non-deterministic performance. 00:17:09.660 |
Very challenging for us because it's a very cognitively demanding task, so if you upload 00:17:12.700 |
the same PDF multiple times, very likely you'll get completely different responses. 00:17:19.360 |
I think fundamentally when you move from discriminative models, classification, regression, random forests, 00:17:25.460 |
to generative models, the types of experiences you can provide your users expand quite a bit. 00:17:30.660 |
And these new experiences are just much harder to evaluate. 00:17:35.760 |
We have this tool called the Architect, where users will record their business workflow, like 00:17:40.700 |
just them doing their job, upload it to Clarity, and then we'll create a business requirements 00:17:47.800 |
It has a flowchart, images, very comprehensive, like what a McKinsey would build for you. 00:17:53.300 |
It's not at all intuitive how to eval something like that. 00:17:56.760 |
Another example, part of our product is you can do natural language analytics, so you don't 00:18:01.860 |
You could just ask it questions like, "Hey, how's my contract population evolved over time? 00:18:08.220 |
Cool feature, but like how do you eval something like this?" 00:18:11.660 |
And then a more traditional document extraction task. 00:18:14.380 |
You are trying to find certain parts of a document that are consequential in some way. 00:18:19.560 |
They could be legalese buried inside of a document. 00:18:21.400 |
And you need to know what accuracy you're doing that with. 00:18:24.600 |
So this is why new experiences, while very rewarding to our customers, have been very 00:18:31.420 |
The second is the rate of feature development. 00:18:33.400 |
So in our previous deep neural net world, DNN world, a feature took like five to six months 00:18:38.980 |
We literally had teams that would annotate data, dedicated teams to annotate data, GPUs, 00:18:43.160 |
most of which are gathering to us now to train. 00:18:47.420 |
So if it took like two to three weeks to build evals, thoughtful evals, that was completely 00:18:52.380 |
That was a small fraction of the feature development time. 00:18:54.960 |
Now what we're seeing is we can get features out of the door in days. 00:18:58.360 |
Within 12 hours of the chat GPT launch, we launched document chat as a feature. 00:19:02.440 |
And so in that world, it's unacceptable that it takes a week, two weeks to build evals. 00:19:07.280 |
It now becomes the bottleneck to feature development. 00:19:11.540 |
And the third is benchmarks diverging from performance. 00:19:16.360 |
What we've seen at Clarity is even slight differences in MMLU actually make a very big difference 00:19:20.900 |
to us because we're kind of at the frontiers of human cognition doing something that's very 00:19:27.360 |
We've seen many cases where, I'm not going to name names, a model is supposed to be better 00:19:31.300 |
in terms of MMLU, and then it's totally not on our internal benchmarks. 00:19:34.880 |
There's a variety of factors that go into it, but it's made testing very chaotic and challenging 00:19:40.100 |
And so the question then is, what can be done? 00:19:42.860 |
And I'm pretty sure most application developers are running into these problems or something 00:19:47.760 |
I'm not going to say that we have the silver bullet, and I would say we are nascent in our 00:19:51.840 |
But here's a couple of things that have worked for us. 00:19:54.780 |
So the first is, really give yourself the gift of imperfection. 00:19:59.040 |
Don't put the threshold of high quality evals at the beginning of the feature development 00:20:05.100 |
Good evals are not trivially cheap to build today. 00:20:07.340 |
They probably are not going to be for the foreseeable future. 00:20:09.880 |
And so we really try to front load user testing. 00:20:12.520 |
What we found with these generative AI features is, we have to think of each feature almost as 00:20:17.240 |
its own product market fit, because we're delivering experiences that people have never had before. 00:20:21.480 |
chatting with the document, natural language analytics, watching videos automatically. 00:20:26.720 |
And so there's a lot of user experience risk actually baked into each of these features. 00:20:31.260 |
Compared to traditional machine learning like a recommendation engine where you have fairly 00:20:35.020 |
high conviction that the form factor is correct. 00:20:37.600 |
So what we try to do from a development perspective is front load the UX risk, back load building out 00:20:43.900 |
Of course, I'm not advocating that you go into production and scale without building evals, 00:20:47.240 |
but a lot of features die at the UX stage itself. 00:20:53.360 |
But once you decide to go into production, and of our framework for this is, you need to 00:20:57.960 |
think backwards from the user experience, not forwards from what is easy to measure. 00:21:02.580 |
So let's not just say we want to measure F1 and hope that user experience correlates to 00:21:07.360 |
We want to look at the end user value, move backwards from there. 00:21:09.360 |
where not every eval can or should reflect the entirety of the experience, but you want your 00:21:14.280 |
evals in aggregate to be reflective of the user experience. 00:21:17.200 |
A simple exercise that we do for this when we're building features is we kind of walk down 00:21:21.200 |
the stack of what is the end user outcome, the business value we're driving, what is a good 00:21:25.200 |
indicator of adoption/utilization, and at the lowest level, how are we measuring health of this 00:21:30.480 |
It's as simple as like JSON adherence, variability in the output, and so on. 00:21:37.540 |
Every customer is basically giving us their own set of labels, so you could think of this 00:21:42.840 |
And so we'll actually annotate data for that customer as part of our UAT process, build out 00:21:53.140 |
And then there's still user feedback as kind of there to close the loop, but we think it's 00:21:59.420 |
very dangerous to assume that the absence of user feedback is positive feedback. 00:22:03.720 |
So we try to put the majority of the onus on ourselves to be rigorous about metrics, and 00:22:10.580 |
Are we getting documents that are very different from the population we've seen so far? 00:22:15.220 |
We've invested a lot in our synthetic data generation stack. 00:22:17.900 |
We actually surveyed like six-plus providers in the market, tried a bunch of them out. 00:22:22.440 |
We're not too happy, and so I ended up building our own synthetic data generation stack. 00:22:26.440 |
Everything that you see was synthetically generated. 00:22:29.040 |
The way that we think about this is once a use case becomes large enough within the company, 00:22:33.620 |
we have enough customers doing it, we want to invest in customer agnostic evals. 00:22:39.060 |
That's when we go down the kind of path of synthetic data. 00:22:42.900 |
And we have a team where part of their job is just monitoring that the synthetic data is 00:22:47.140 |
distributionally similar to what we're seeing from customers. 00:22:49.680 |
So we haven't yet cracked the problem of doing this in an automated fully quantitative way. 00:22:54.560 |
The other kind of sanity check that we have is of course looking at accuracy scores and making 00:22:57.740 |
sure that our models are not excessively or underperforming on synthetic data. 00:23:03.820 |
The next little trick that we use is kind of reducing the degrees of freedom. 00:23:07.680 |
So this is kind of part of our architecture where we have numerous features, each of which require 00:23:15.100 |
So in this case, these are four features, free text, tabular extraction, matching table composition. 00:23:19.640 |
Each of these has a prompt for each customer. 00:23:21.640 |
So you can imagine over a hundred customers, you could end up with like literally thousands 00:23:26.520 |
So instead of manual prompt engineering, we have these APE things, automated prompt engineers. 00:23:31.580 |
Now the trouble is different LLMs perform, have different levels of performance on different 00:23:39.480 |
So you get like this exponential explosion and complexity. 00:23:43.380 |
And what we did instead is we just said, well, we're seeing quite a bit of commonality in which 00:23:52.640 |
Not permanently, and we'll continuously reevaluate as new LLMs come out, but if we can fix this 00:23:56.960 |
dimension of freedom, it gets a lot easier to iterate. 00:23:59.840 |
So could we eke out another percentage point of accuracy if we didn't do this? 00:24:04.280 |
But building that grid search infrastructure is just too expensive and we don't think 00:24:08.820 |
So this is kind of another trick that we use to just almost like dimensionality reduction 00:24:16.060 |
And the last thing I'll mention is like identify future potential. 00:24:18.300 |
I think people spend a lot of time, and rightly so, building evals for what their company 00:24:23.400 |
But you should almost have a wish list of what are additional use cases that you want to 00:24:27.600 |
grow into over time, three, four, five years down the line. 00:24:30.880 |
Because we are frequently surprised that technology is evolving faster than what we can see. 00:24:34.920 |
But it is too high of a bar to say, we want to have like MMLU or BBH level metrics for 00:24:39.620 |
future workflows that nobody has asked us for yet. 00:24:42.280 |
And so what we do is we have these very scrappy kind of future facing evals. 00:24:47.200 |
For example, when GPTV came out, in about an hour we were able to say, all right, this 00:24:53.300 |
It's maybe not so good at pie charts, et cetera, et cetera. 00:24:56.420 |
And so that, I think, muscle of just building this organizational mental model very quickly 00:25:00.920 |
in a scrappy way is also something that's been helpful to us. 00:25:05.300 |
I wouldn't be a startup founder if I didn't end with a shameless plug. 00:25:08.720 |
As Sheila said, we just raised a $70 million Series B. We are hiring across the board. 00:25:15.520 |
So if any of this sounds interesting to you, I'd love to chat. 00:25:30.000 |
So anyone who's interested, please see either one of us afterwards. 00:25:36.640 |
I've seen these paradigm shifts happen in the past. 00:25:39.480 |
And I think one thing that's important is to remember the people that are early to this 00:25:45.380 |
sort of AI revolution that's happening, these people matter disproportionately. 00:25:52.160 |
And so understanding and owning your power as you think about what happens with the next 00:25:57.740 |
generation of evals, of integrating values into things, it's actually, it's more than just 00:26:06.160 |
There is a real empowerment and opportunity to spend time thinking about this to get this 00:26:12.140 |
right for the ecosystem and for the industry. 00:26:19.640 |
We don't let the leaderboard benchmark stick that we know are just kind of BS, right? 00:26:24.340 |
There's too much happening today that isn't truly understanding how these systems work. 00:26:35.080 |
We bring multifaceted nature of these evaluations together. 00:26:40.720 |
I think it's incredibly important that we bring curiosity, empathy, help to one another. 00:26:47.760 |
I love the fact that Clarity wanted to come and talk about their journey, what they did that 00:26:51.820 |
was right, what they did that was wrong, we share with one another on that. 00:26:55.580 |
And I think that we introspect our own value systems, right? 00:26:58.480 |
I think that today we are looking at such a pace of innovation that we really need to think, 00:27:08.280 |
How are we the trailblazers for this happening? 00:27:11.320 |
You know, I don't know if folks have read this book, The Alignment Problem. 00:27:14.440 |
Brian Christensen wrote a great book, I think it was last year. 00:27:18.160 |
And it was one of my favorite reads of the year, and he had this quote around sort of, 00:27:21.920 |
if we do see AGI, it will be an interesting mirror of society. 00:27:25.980 |
It will tell us which values are uniquely human in nature. 00:27:32.080 |
How do we want to drive a society that has values that we can be proud of as these trailblazers? 00:27:39.160 |
So feel empowered, be curious, stay empathetic, kind, fair, intelligent, right? 00:27:46.100 |
We're printing new intelligence for the world. 00:27:47.880 |
And this is both your opportunity and your accountability. 00:27:50.180 |
And so thank you all for leading the future of AI.