back to index

E-Values Evaluating the Values of AI: Sheila Gulati and Nischal Nadhamuni


Whisper Transcript | Transcript Only Page

00:00:00.240 | Well thank you all for being here it is a small but excited group of people around evals and you
00:00:21.000 | know evals may not be the most sexy talk at this conference but it might be one of the most important
00:00:26.280 | and Nishal and I will spend time today speaking through why right now we think it's a seminal
00:00:32.520 | moment for evals and the import sort of of the changes that we're seeing in the market as we move
00:00:38.880 | into these agentic frameworks right so as things are more automated as things happen in a more
00:00:45.340 | agentic way we need to make sure that we understand what's happening with evals we're all here
00:00:50.720 | obviously because we're very long AI we care a lot about AI but we have to really understand what's
00:00:56.040 | going on with it and really if you think about the crux of it can we evaluate the performance
00:01:00.900 | of this our systems in relationship to our goals for those systems and that's what we'll be walking
00:01:06.240 | through today and so about us my name is Sheila and I'm joined by Nishal it's great we're we're an
00:01:12.800 | investor investee pair right VC and portfolio company and I think there should be more talks
00:01:19.120 | like this because some of the most fabulous portfolio companies are not great at giving shoutouts to
00:01:25.160 | themselves so I get to give the shout out to clarity and initial he's the co-founder and CTO of clarity
00:01:31.980 | he's sitting up here but he'll be speaking very soon as well and clarity just announced their massive
00:01:37.520 | 70 million dollar series B financing on Monday so we're going to hear about that journey it hasn't
00:01:43.020 | been a short journey so I think that can be quite inspirational for a lot of you as you think about what
00:01:47.780 | you're building and the size and the scope of what you're building and so that company builds what we call
00:01:52.140 | exponential organizations well what does that mean right software transformed how we all work right
00:01:58.960 | this internal systems is always synced always on nature of software systems for internal work change
00:02:05.720 | the nature of the efficiency and effectiveness of buildings of businesses that we build clarity is
00:02:12.380 | doing that for your external world right most of your relationships with your customers your partners are
00:02:17.960 | are dealt with through documents those documents are one-off negotiated they're one-off pieces of paper
00:02:23.780 | right clarity is automating all of that and then allowing you to build exponential organizations
00:02:30.500 | through that type of real-time relationship with those documents so you'll hear more about that and how clarity
00:02:37.780 | is implemented a number of eval systems later in this presentation myself I founded a venture firm called Tola Capital well over a decade ago now
00:02:47.780 | and the reason I founded the firm was I was working at Microsoft I was running the database and developer platforms
00:02:53.600 | and was fighting the good fight against team Windows to launch Azure so being team cloud at a company that was based off of Windows was a really difficult thing but it was fabulous as well and obviously Azure has gone on to do pretty okay for itself so the genesis of the firm Tola Capital really was hey how do we think about this next generation of applications that would be cloud-based and now we're even more excited by the
00:03:23.580 | the opportunity to bring this next generation of AI enabled applications to the four so I'm going to walk down memory lane for one quick moment and say you know what we saw in the advent of the cloud was very clear it would favor scale the capex requirements the physicality of building out those data centers was so expensive that you had to have a search business or an office business or a retail business to go fund that development and then we would build on top of that right
00:03:53.400 | the evaluation of those platforms was more straightforward am I offering you speeds and feeds am I offering you performance and then of course you're layering on what functionality at what price I'm offering but the evaluation of the physicality of that world was more straightforward now as we enter this AI world we're saying wow there's you know some of the similar characteristics of large capex and large mega cap participation right where we have obviously these systems running on to the AI
00:04:23.380 | models running on top of the models running on top of clouds and being trained by the clouds and that compute cost that inference the ever so difficult to get your hands on chips the talent of all of you in the room right the AI engineers that bring this to be but in addition to that you have the proliferation of open source and open source models and just the ability of those models to do an incredible job at delivering incredibly complicated scenarios that are you know really
00:04:53.360 | really really really really strong contenders and so you have a proliferation of players of proliferation of models you have a deep academic heritage of those open source and AI development and you have great mega cap partnerships for those models as well so what where are we going right where AI is just it's it's so important to think about AI is more than just another tool right it is a reflection of us it is a reflection of our understanding of the world our intentions our preferences
00:05:23.340 | and at the end of the end of the day our society and we'll talk a little bit about what that means individually and collectively especially in this world where agents will represent more of us as individuals so let's talk about some of the shifts here you know we we started with AI eating everything on the internet I like to call this all the garbage and all the gold on the internet was consumed by AI then we moved into you know doing that gave us emergent behaviors that's of course a great
00:05:53.320 | great fancy way of saying we're not exactly sure how it knows what it knows what it knows but we know it knows it right this era then then we have trained data sets curated data sets but we're moving into the era of self-taught self-learning and self-sufficient models and so if you pause and think about that for a second before we get into a truly self-sufficient era we really need to fix evals right because it kind of gets to be too late in that era and so that transition to full automation is happening faster and more
00:06:23.300 | aggressively than any of us thought so what does this mean right narrow AI evaluation was I'm a hammer you're a nail I'm hitting you am I doing that right you're an image am I classifying that image and I could I could tell whether that I was performing
00:06:39.060 | that task in a pretty straightforward way then you get into broad AI right and this radar chart speaks a little bit to how this evaluation becomes more multifaceted what's my capability and intelligence am I serving a domain do I understand that domain what are my values right what are the and whose values do I care about mine yours the user societies today tomorrow all of these questions come together safety okay do
00:07:09.040 | no harm what does that look like Pete that could be different for people and of course the context and end user awareness which is often not discussed in an eval world right your end user is who you're delivering and developing these solutions for but they're often an afterthought in terms of evals and evaluation and then we have to do that across all of the different modalities of image and text and video and kind of all of these together is creating a much larger problem around evals
00:07:39.020 | so today the tools are still simplistic we're going to dive into kind of each of these areas of simplisticness and talk about some of the innovation happening to deliver this forward and then what initial will do is show us how clarity has dealt with each of these issues related to evals as well so benchmark hacking I like to call this you know what what am I solving for solve for x is how a lot of these benchmarks and leaderboards work today from an eval perspective and it's interesting
00:08:09.000 | because scoring high on the benchmarks is pretty easy to do if you know what the what we're solving for if you understand x you can do everything to solve for x and you can look as intelligent as you want at solving for x but the real the real reason is you may not understand anything about what's happening you may be just solving for x and that that's a pretty scary reality on some of these things and so we say well these alms lms passed an ap exam on a particular subject does that mean it was trained well on the questions that have here to be
00:08:38.980 | come on that ap exam or does that mean it actually understands the subject it's a really really interesting question but what we're seeing on a lot of the ai leaderboards is the best solvers for x are at the top of those leaderboards and that's a problem
00:08:52.340 | so one area where we're seeing research happen this is a Microsoft research paper around dynamic benchmarks rather than saying solve for x can you identify this image they're creating dynamic data sets basically with synthetic data where you can say hey we're moving objects around these are not published they are not public right so you don't know what the answer is coming into it there first then but then I can test you on spatial reasoning visual prompting
00:09:22.320 | object recognition the images are changing so the models have no ability to memorize those benchmarks this is an area dynamic benchmarks in general where I think we'll see a lot more work
00:09:32.320 | benchmarks versus real world scenarios is is as element two on this you know it's interesting there's a lot of model creators that claim that their models perform very well and are very generalizable and then you actually go and ask them a specific set of questions even if I've done super well on MMLU and these things that
00:09:52.300 | and you say okay I'm going to go test this logic and test this reasoning and the answers are simply wrong and they're wrong much and most of the time and and this this great example from finance bench which was Patronus AI's work and Stanford's work around saying hey you know these basic financial questions were not answered when you actually benchmarked it on those real world scenarios and the doesn't the user is not at the center of our evaluation universe right we have to put the user at the
00:10:22.280 | center we have to revamp the UX and feedback systems in order to really understand and capture more user needs evaluation of UX is super difficult right and initial we'll talk a lot more about this in the case of clarity but that's why we're all actually here to deliver that end user value and so if we're not doing that evaluation what are we doing
00:10:29.280 | I'm black box models right so this is a problem that's sort of hiding in plain sight obviously evaluation is a proxy for the for a task evaluation is seeking truth it is not truth and so how do we really understand
00:10:36.280 | what we're doing in a world where we're doing in a world where we can't mathematically represent what we're doing in a world where we can't mathematically represent what neural networks have learned
00:10:45.280 | Black box models, right?
00:10:47.200 | So this is a problem that's sort of hiding in plain sight, obviously.
00:10:52.000 | Evaluation is a proxy for a task.
00:10:55.040 | Evaluation is seeking truth.
00:10:56.040 | It is not truth.
00:10:57.420 | And so how do we really understand what we're doing in a world where we can't mathematically
00:11:03.440 | represent what neural networks have learned and how they have learned that?
00:11:08.100 | And so the area of AI interpretability is not new, but it's super important as we think
00:11:14.040 | about the rise of these complex AI systems.
00:11:17.480 | And so researchers are trying to open up these black box models, show us the how and the steps
00:11:22.500 | in this.
00:11:23.500 | And the transformer-based LLMs were really tracing information flowing through the network.
00:11:29.020 | And so this is a good example of sort of asking questions and seeing, hey, how am I answering
00:11:34.900 | How are these pieces coming together?
00:11:36.240 | We'll see a lot more of this interpretability work in the reinforcement learning work that's
00:11:41.220 | happening with the current GPT models.
00:11:42.800 | It's super, super, super important that we get this right now.
00:11:48.360 | Now, values, right?
00:11:51.260 | When we create AI models, we do instill our own values in them, whether we want to or not,
00:11:56.260 | right?
00:11:57.260 | And these evaluations have to understand the technical capabilities, but also the underlying
00:12:02.340 | values that we're putting into the models.
00:12:05.260 | And there's a lot of questions on this.
00:12:06.820 | And I have way more questions than answers, as I think we all do.
00:12:10.820 | But it's, you know, are we creating values that benefit humanity?
00:12:13.720 | As we go into this agentic world and you have your own model, right?
00:12:19.280 | So the model of Sheila is going to believe more of what Sheila believes and get deeper and deeper
00:12:30.340 | into that Sheila-ness.
00:12:31.740 | Is that a good thing?
00:12:32.740 | Good thing, right?
00:12:33.740 | We've seen the eco-chamber of ourselves in news.
00:12:36.300 | We've seen the polarization that this has caused in our society.
00:12:41.000 | We should be asking questions about where we're going to get to as we enter this agentic world.
00:12:47.640 | And this is kind of a, you know, it's a question, right?
00:12:51.140 | This is a, you know, a simple two-by-two to ask the question.
00:12:54.420 | Do we want to appease or challenge users?
00:12:56.740 | Do we want you to choose whether you were appeased or challenged?
00:13:00.920 | Do we want to ground AI in present or future societal values, aspirational values versus present
00:13:08.140 | values?
00:13:09.140 | Do we want to encourage users to select their own values?
00:13:12.060 | Should model makers be responsible for this?
00:13:15.100 | Lots of questions, less answers.
00:13:17.060 | Now we're going to turn it into a real-world example, leveraging Clarity to talk about how
00:13:23.260 | the company is replicating human cognition.
00:13:25.520 | Nishal?
00:13:26.520 | Thanks, Sheila.
00:13:28.020 | Thank you for having me here.
00:13:32.980 | My name's Nishal.
00:13:33.980 | I'm co-founder and CTO of Clarity.
00:13:36.120 | As Sheila mentioned, what we do is we automate back office workflows.
00:13:40.420 | Things that traditionally required large teams of offshore humans and throughout human history
00:13:46.400 | have been impossible to automate because they're cognitive and they're non-repetitive.
00:13:50.580 | You can't just write down a simple set of steps and automate these workflows.
00:13:54.000 | So that's what we've been working on for the last eight-odd years.
00:13:57.960 | Predominantly, these are document-oriented workflows.
00:14:00.540 | So it's some kind of PDF that a human being is having to read as part of a company's back office.
00:14:06.420 | One example of this is revenue recognition, typically part of an accounting team, matching
00:14:11.420 | invoices to purchase orders, actually aligning two documents and matching them to each other,
00:14:17.500 | processing tax withholdings that often come in many different languages.
00:14:20.800 | But you probably get the sense, you get PDFs that are completely unstructured.
00:14:24.740 | Somebody has to go through them because it's a tightly regulated process, part of a finance
00:14:28.360 | and accounting team.
00:14:29.580 | And today this is done completely manually.
00:14:31.680 | There's actually a really great keynote yesterday by a gentleman who pointed out that document
00:14:37.800 | processing tasks are generally super tough for LLMs for a variety of reasons.
00:14:43.760 | If the page is rotated or the scan quality is bad, if you have graphs or images, if you
00:14:49.320 | have tables inside of these.
00:14:51.320 | It is very, very tough to get this to work, and it's not the kind of thing you can just
00:14:54.640 | give to ChatGPT and it happens out of the box.
00:14:57.320 | This is basically exactly what we do.
00:14:58.700 | And we spend a long, long time and tens of millions of dollars building a stack that's
00:15:03.080 | able to deal with these kinds of documents.
00:15:05.260 | These are some of our customers, mostly B2B SaaS companies, mostly in a five-mile radius
00:15:11.600 | around where we are right now.
00:15:13.680 | And we predominantly serve their finance and accounting teams, although we're expanding
00:15:16.540 | quite a bit beyond that.
00:15:19.000 | Our journey has been a little bit unorthodox.
00:15:20.940 | It's kind of like that, you've probably seen the startup curve of like the trough of disillusionment
00:15:25.080 | and you get the TechCrunch article in the beginning.
00:15:27.140 | Something like that.
00:15:28.140 | We founded the company in 2016.
00:15:30.600 | We pivoted four times between 2016 and 2020, and we were kind of at our wits end, about
00:15:35.380 | to give up.
00:15:36.380 | We did one final pivot and focused on finance and accounting teams and found pretty strong
00:15:42.040 | product market fit there.
00:15:43.480 | In the last couple of years, we've completely re-platformed around generative AI, and that's
00:15:47.140 | really been a shot in the arm of the company and have gone on to raise 90 million plus off
00:15:51.960 | of that.
00:15:54.460 | But before generative AI, we were in traditional ML for more than six years.
00:15:58.880 | We like to say that we started an AI company five years too late.
00:16:02.580 | And so a lot of the concepts around evaluations were, you know, they seemed pretty natural to
00:16:06.960 | us and we didn't really see why generative AI had to be any different.
00:16:10.580 | In supervised learning, which is a majority of what we did pre-gen AI, you have your train
00:16:14.660 | split, your test split, the metrics are pretty well defined, F1, ROC curves, there are these
00:16:20.520 | like shared benchmark tasks like sequence labeling and squad and these were all very well understood.
00:16:24.940 | So it wasn't immediately apparent to us why generative AI has to change any of these things.
00:16:31.760 | And in the two years since then, we've, as I mentioned, re-platformed generative AI and
00:16:35.960 | that's gotten us quite a bit more scale.
00:16:37.940 | We process more than half a million documents for customers.
00:16:40.940 | We have more than 15 unique LLM use cases running in production today, and more than 10 LLMs under
00:16:48.440 | the hood.
00:16:49.440 | Oftentimes, a single use case has multiple LLMs working together.
00:16:54.260 | But all this kind of begs the question, why does eval for generative AI have to be any different
00:16:59.060 | than traditional ML?
00:17:00.060 | And this is kind of the journey that we've been on in the last couple of years.
00:17:03.340 | The first, and a number of speakers have spoken about this, so I'm not going to go into
00:17:06.520 | really too much detail, is non-deterministic performance.
00:17:09.660 | Very challenging for us because it's a very cognitively demanding task, so if you upload
00:17:12.700 | the same PDF multiple times, very likely you'll get completely different responses.
00:17:18.100 | New user experiences.
00:17:19.360 | I think fundamentally when you move from discriminative models, classification, regression, random forests,
00:17:25.460 | to generative models, the types of experiences you can provide your users expand quite a bit.
00:17:30.660 | And these new experiences are just much harder to evaluate.
00:17:33.300 | So, a couple of examples from what we do.
00:17:35.760 | We have this tool called the Architect, where users will record their business workflow, like
00:17:40.700 | just them doing their job, upload it to Clarity, and then we'll create a business requirements
00:17:44.720 | document out of that.
00:17:45.800 | It's typically like a 10-page Word document.
00:17:47.800 | It has a flowchart, images, very comprehensive, like what a McKinsey would build for you.
00:17:53.300 | It's not at all intuitive how to eval something like that.
00:17:56.760 | Another example, part of our product is you can do natural language analytics, so you don't
00:18:00.640 | need like a BI specialist.
00:18:01.860 | You could just ask it questions like, "Hey, how's my contract population evolved over time?
00:18:05.660 | Give it to me in a stacked bar chart.
00:18:07.220 | It'll do that for you.
00:18:08.220 | Cool feature, but like how do you eval something like this?"
00:18:11.660 | And then a more traditional document extraction task.
00:18:14.380 | You are trying to find certain parts of a document that are consequential in some way.
00:18:18.560 | They could be tabular.
00:18:19.560 | They could be legalese buried inside of a document.
00:18:21.400 | And you need to know what accuracy you're doing that with.
00:18:24.600 | So this is why new experiences, while very rewarding to our customers, have been very
00:18:28.920 | challenging to us from an eval perspective.
00:18:31.420 | The second is the rate of feature development.
00:18:33.400 | So in our previous deep neural net world, DNN world, a feature took like five to six months
00:18:37.980 | to build.
00:18:38.980 | We literally had teams that would annotate data, dedicated teams to annotate data, GPUs,
00:18:43.160 | most of which are gathering to us now to train.
00:18:45.480 | And so end to end it was like six months.
00:18:47.420 | So if it took like two to three weeks to build evals, thoughtful evals, that was completely
00:18:51.380 | acceptable.
00:18:52.380 | That was a small fraction of the feature development time.
00:18:54.960 | Now what we're seeing is we can get features out of the door in days.
00:18:58.360 | Within 12 hours of the chat GPT launch, we launched document chat as a feature.
00:19:02.440 | And so in that world, it's unacceptable that it takes a week, two weeks to build evals.
00:19:07.280 | It now becomes the bottleneck to feature development.
00:19:11.540 | And the third is benchmarks diverging from performance.
00:19:14.040 | Sheila talked about this a little bit.
00:19:16.360 | What we've seen at Clarity is even slight differences in MMLU actually make a very big difference
00:19:20.900 | to us because we're kind of at the frontiers of human cognition doing something that's very
00:19:25.000 | challenging for most human beings.
00:19:27.360 | We've seen many cases where, I'm not going to name names, a model is supposed to be better
00:19:31.300 | in terms of MMLU, and then it's totally not on our internal benchmarks.
00:19:34.880 | There's a variety of factors that go into it, but it's made testing very chaotic and challenging
00:19:39.100 | for us.
00:19:40.100 | And so the question then is, what can be done?
00:19:42.860 | And I'm pretty sure most application developers are running into these problems or something
00:19:46.760 | similar.
00:19:47.760 | I'm not going to say that we have the silver bullet, and I would say we are nascent in our
00:19:50.840 | eval journey.
00:19:51.840 | But here's a couple of things that have worked for us.
00:19:54.780 | So the first is, really give yourself the gift of imperfection.
00:19:59.040 | Don't put the threshold of high quality evals at the beginning of the feature development
00:20:03.100 | lifecycle.
00:20:05.100 | Good evals are not trivially cheap to build today.
00:20:07.340 | They probably are not going to be for the foreseeable future.
00:20:09.880 | And so we really try to front load user testing.
00:20:12.520 | What we found with these generative AI features is, we have to think of each feature almost as
00:20:17.240 | its own product market fit, because we're delivering experiences that people have never had before.
00:20:21.480 | chatting with the document, natural language analytics, watching videos automatically.
00:20:26.720 | And so there's a lot of user experience risk actually baked into each of these features.
00:20:31.260 | Compared to traditional machine learning like a recommendation engine where you have fairly
00:20:35.020 | high conviction that the form factor is correct.
00:20:37.600 | So what we try to do from a development perspective is front load the UX risk, back load building out
00:20:42.900 | evals.
00:20:43.900 | Of course, I'm not advocating that you go into production and scale without building evals,
00:20:47.240 | but a lot of features die at the UX stage itself.
00:20:50.120 | let them die before you build out evals.
00:20:53.360 | But once you decide to go into production, and of our framework for this is, you need to
00:20:57.960 | think backwards from the user experience, not forwards from what is easy to measure.
00:21:02.580 | So let's not just say we want to measure F1 and hope that user experience correlates to
00:21:06.360 | that.
00:21:07.360 | We want to look at the end user value, move backwards from there.
00:21:09.360 | where not every eval can or should reflect the entirety of the experience, but you want your
00:21:14.280 | evals in aggregate to be reflective of the user experience.
00:21:17.200 | A simple exercise that we do for this when we're building features is we kind of walk down
00:21:21.200 | the stack of what is the end user outcome, the business value we're driving, what is a good
00:21:25.200 | indicator of adoption/utilization, and at the lowest level, how are we measuring health of this
00:21:29.480 | feature?
00:21:30.480 | It's as simple as like JSON adherence, variability in the output, and so on.
00:21:35.680 | So in practice, this is what it looks like.
00:21:37.540 | Every customer is basically giving us their own set of labels, so you could think of this
00:21:40.480 | as like bespoke enterprise AI.
00:21:42.840 | And so we'll actually annotate data for that customer as part of our UAT process, build out
00:21:47.400 | use case-specific accuracy metrics for them.
00:21:49.720 | Are they trying to do matching?
00:21:50.720 | Are they trying to do extraction?
00:21:53.140 | And then there's still user feedback as kind of there to close the loop, but we think it's
00:21:59.420 | very dangerous to assume that the absence of user feedback is positive feedback.
00:22:03.720 | So we try to put the majority of the onus on ourselves to be rigorous about metrics, and
00:22:08.080 | we have various tools to monitor data drift.
00:22:10.580 | Are we getting documents that are very different from the population we've seen so far?
00:22:15.220 | We've invested a lot in our synthetic data generation stack.
00:22:17.900 | We actually surveyed like six-plus providers in the market, tried a bunch of them out.
00:22:22.440 | We're not too happy, and so I ended up building our own synthetic data generation stack.
00:22:26.440 | Everything that you see was synthetically generated.
00:22:29.040 | The way that we think about this is once a use case becomes large enough within the company,
00:22:33.620 | we have enough customers doing it, we want to invest in customer agnostic evals.
00:22:39.060 | That's when we go down the kind of path of synthetic data.
00:22:42.900 | And we have a team where part of their job is just monitoring that the synthetic data is
00:22:47.140 | distributionally similar to what we're seeing from customers.
00:22:49.680 | So we haven't yet cracked the problem of doing this in an automated fully quantitative way.
00:22:54.560 | The other kind of sanity check that we have is of course looking at accuracy scores and making
00:22:57.740 | sure that our models are not excessively or underperforming on synthetic data.
00:23:03.820 | The next little trick that we use is kind of reducing the degrees of freedom.
00:23:07.680 | So this is kind of part of our architecture where we have numerous features, each of which require
00:23:12.840 | a custom prompt for each customer.
00:23:15.100 | So in this case, these are four features, free text, tabular extraction, matching table composition.
00:23:19.640 | Each of these has a prompt for each customer.
00:23:21.640 | So you can imagine over a hundred customers, you could end up with like literally thousands
00:23:25.440 | or tens of thousands of prompts.
00:23:26.520 | So instead of manual prompt engineering, we have these APE things, automated prompt engineers.
00:23:31.580 | Now the trouble is different LLMs perform, have different levels of performance on different
00:23:37.040 | APE tasks and on different customers.
00:23:39.480 | So you get like this exponential explosion and complexity.
00:23:43.380 | And what we did instead is we just said, well, we're seeing quite a bit of commonality in which
00:23:48.340 | LLMs do well on APE tasks.
00:23:50.120 | So let's just use one LLM for APE tasks.
00:23:52.640 | Not permanently, and we'll continuously reevaluate as new LLMs come out, but if we can fix this
00:23:56.960 | dimension of freedom, it gets a lot easier to iterate.
00:23:59.840 | So could we eke out another percentage point of accuracy if we didn't do this?
00:24:04.280 | But building that grid search infrastructure is just too expensive and we don't think
00:24:07.280 | it's a good investment of time.
00:24:08.820 | So this is kind of another trick that we use to just almost like dimensionality reduction
00:24:12.600 | at a project management level.
00:24:16.060 | And the last thing I'll mention is like identify future potential.
00:24:18.300 | I think people spend a lot of time, and rightly so, building evals for what their company
00:24:22.400 | does today.
00:24:23.400 | But you should almost have a wish list of what are additional use cases that you want to
00:24:27.600 | grow into over time, three, four, five years down the line.
00:24:30.880 | Because we are frequently surprised that technology is evolving faster than what we can see.
00:24:34.920 | But it is too high of a bar to say, we want to have like MMLU or BBH level metrics for
00:24:39.620 | future workflows that nobody has asked us for yet.
00:24:42.280 | And so what we do is we have these very scrappy kind of future facing evals.
00:24:47.200 | For example, when GPTV came out, in about an hour we were able to say, all right, this
00:24:50.620 | is our mental model of how it's going to do.
00:24:52.300 | It's good at check marks.
00:24:53.300 | It's maybe not so good at pie charts, et cetera, et cetera.
00:24:56.420 | And so that, I think, muscle of just building this organizational mental model very quickly
00:25:00.920 | in a scrappy way is also something that's been helpful to us.
00:25:05.300 | I wouldn't be a startup founder if I didn't end with a shameless plug.
00:25:08.720 | As Sheila said, we just raised a $70 million Series B. We are hiring across the board.
00:25:13.400 | AI, back-end, front-end, go-to-market roles.
00:25:15.520 | So if any of this sounds interesting to you, I'd love to chat.
00:25:18.400 | All right.
00:25:19.400 | Thank you so much.
00:25:20.400 | And I'll hand it back to Sheila.
00:25:21.820 | Thank you, Nishel.
00:25:26.500 | I have to say working at Clarity is a dream.
00:25:30.000 | So anyone who's interested, please see either one of us afterwards.
00:25:34.380 | So this is my call to action.
00:25:36.640 | I've seen these paradigm shifts happen in the past.
00:25:39.480 | And I think one thing that's important is to remember the people that are early to this
00:25:45.380 | sort of AI revolution that's happening, these people matter disproportionately.
00:25:50.180 | And these people are all of you.
00:25:52.160 | And so understanding and owning your power as you think about what happens with the next
00:25:57.740 | generation of evals, of integrating values into things, it's actually, it's more than just
00:26:04.380 | sort of words on a slide, right?
00:26:06.160 | There is a real empowerment and opportunity to spend time thinking about this to get this
00:26:12.140 | right for the ecosystem and for the industry.
00:26:15.740 | So what do we do?
00:26:17.180 | We innovate.
00:26:18.180 | We reinvent benchmarks.
00:26:19.640 | We don't let the leaderboard benchmark stick that we know are just kind of BS, right?
00:26:24.340 | There's too much happening today that isn't truly understanding how these systems work.
00:26:31.660 | We bring depth into the evaluations.
00:26:35.080 | We bring multifaceted nature of these evaluations together.
00:26:38.920 | We do so as a community, right?
00:26:40.720 | I think it's incredibly important that we bring curiosity, empathy, help to one another.
00:26:47.760 | I love the fact that Clarity wanted to come and talk about their journey, what they did that
00:26:51.820 | was right, what they did that was wrong, we share with one another on that.
00:26:55.580 | And I think that we introspect our own value systems, right?
00:26:58.480 | I think that today we are looking at such a pace of innovation that we really need to think,
00:27:06.000 | what does it mean to drive this?
00:27:08.280 | How are we the trailblazers for this happening?
00:27:11.320 | You know, I don't know if folks have read this book, The Alignment Problem.
00:27:14.440 | Brian Christensen wrote a great book, I think it was last year.
00:27:18.160 | And it was one of my favorite reads of the year, and he had this quote around sort of,
00:27:21.920 | if we do see AGI, it will be an interesting mirror of society.
00:27:25.980 | It will tell us which values are uniquely human in nature.
00:27:29.920 | Well, those change over time.
00:27:32.080 | How do we want to drive a society that has values that we can be proud of as these trailblazers?
00:27:39.160 | So feel empowered, be curious, stay empathetic, kind, fair, intelligent, right?
00:27:45.100 | That's what we're doing.
00:27:46.100 | We're printing new intelligence for the world.
00:27:47.880 | And this is both your opportunity and your accountability.
00:27:50.180 | And so thank you all for leading the future of AI.
00:27:52.640 | Thank you.
00:27:53.640 | Thank you.
00:27:53.640 | Thank you.
00:27:53.640 | Thank you.
00:27:53.640 | Thank you.
00:28:08.640 | We'll see you next time.