back to indexIn the Arena: How LMSys changed LLM Benchmarking Forever
Chapters
0:0 Introductions
1:16 Origin and development of Chatbot Arena
5:41 Static benchmarks vs. Arenas
9:3 Community building
13:32 Biases in human preference evaluation
18:27 Style Control and Model Categories
26:6 Impact of o1
29:15 Collaborating with AI labs
34:51 RouteLLM and router models
38:9 Future of LMSys / Arena
00:00:00.000 |
Hey everyone, welcome to the Latent Space Podcast. 00:00:07.140 |
This is Alessio, partner in C2N Residence at Decibel Partners, and I'm joined by my 00:00:14.020 |
Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from ElemSys. 00:00:24.020 |
Anastasios, I actually saw you, I think at last year's NeurIPS. 00:00:27.420 |
You were presenting a paper which I don't really super understand, but it was some theory 00:00:32.220 |
paper about how your method was very dominating over other sort of search methods. 00:00:36.600 |
I don't remember what it was, but I remember that you were a very confident speaker. 00:00:42.340 |
Didn't ever connect that, but yes, that's definitely true. 00:00:47.820 |
I was frantically looking for the name of your paper and I couldn't find it. 00:00:49.140 |
Basically, I had to cut it because I didn't understand it. 00:00:51.540 |
Is this conformal PID control or was this the online control? 00:00:58.700 |
It's always interesting how NeurIPS and all these academic conferences are sort of six 00:01:02.700 |
months behind what people are actually doing, but conformal risk control, I would recommend 00:01:08.860 |
I have the recording, I just never published it just because I was like, "I don't understand 00:01:16.780 |
But ELO scores, ELO scores are very easy to understand. 00:01:19.660 |
You guys are responsible for the biggest revolution in language model benchmarking in the last 00:01:26.940 |
Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history 00:01:32.540 |
I'm a fifth year PhD student at UC Berkeley, working on Chappa Arena these days doing crowdsourcing 00:01:43.820 |
I'm a sixth year PhD student here at Berkeley. 00:01:46.620 |
I did most of my PhD on theoretical statistics and sort of foundations of model evaluation 00:01:55.940 |
And now I'm working 150% on this Chappa Arena stuff. 00:02:05.700 |
And then maybe what were one or two of the pivotal moments early on that kind of made 00:02:12.940 |
Chappa Arena project was started last year in April, May. 00:02:19.100 |
Before that, we were basically experimenting in a lab how to fine tune a chatbot open source 00:02:29.060 |
At that time, Lama1 was like a base model and people didn't really know how to fine 00:02:38.420 |
We were inspired by Stanford's Alpaca project. 00:02:41.740 |
So we basically, yeah, grow a data set from the internet which is called shared G2PD data 00:02:47.420 |
set, which is like a dialogue data set between user and chat GPT conversation. 00:02:52.620 |
It turns out to be like pretty high quality data, dialogue data. 00:02:56.940 |
So we fine tune on it and then we train it and release the model called V2D. 00:03:01.460 |
And people were very excited about it because it kind of like demonstrate open way model 00:03:07.060 |
can reach this conversation capability similar to chat GPT. 00:03:12.260 |
And then we basically released the model with and also built a demo website for the model. 00:03:21.580 |
But during the development, the biggest challenge to us at that time was like how do we even 00:03:27.820 |
How do we even argue this model we trained is better than others? 00:03:32.420 |
And what's the gap between this open source model that other proprietary offering? 00:03:36.740 |
At that time, it was like GPT-4 was just announced and it's like cloud one. 00:03:43.740 |
And then after that, like every week, there's a new model being fine tuned, released. 00:03:51.620 |
And then we have that demo website for V2D and then we thought like, okay, maybe we can 00:03:55.420 |
add a few more open model as well, like API model as well. 00:04:00.240 |
And then we quickly realized that people need a tool to compare between different models. 00:04:06.040 |
So we have like a side-by-side UI implemented on the website that people choose to compare. 00:04:12.820 |
And we quickly realized that maybe we can do something like a battle on top of ECLMs, 00:04:19.120 |
like just anonymize it, anonymize the identity and let people vote which one is better. 00:04:25.140 |
So the community decides which one is better, not us arguing our model is better or what. 00:04:31.300 |
And that turns out to be like, people are very excited about this idea. 00:04:35.000 |
And then we tweet, we launch and that's April, May. 00:04:39.760 |
And then it was like first two, three weeks, like just a few hundred thousand views tweet 00:04:47.460 |
And then we have regularly double update weekly at the beginning of the time, adding new model 00:04:58.600 |
Another pivotal moment just to jump in would be private models, like the GPT, I'm a little 00:05:08.880 |
In the beginning, I saw the initial release was May 3rd of Labrador board. 00:05:13.660 |
On April 6th, we did a benchmarks one-on-one episode for a podcast. 00:05:18.040 |
Just kind of talking about how so much of the data is like in the pre-training corpus 00:05:23.160 |
And like the benchmarks are really not what we need to evaluate whether or not a model 00:05:30.040 |
Maybe at the time, you know, it was just like, Hey, let's just put together a whole bunch 00:05:35.780 |
That seems much easier than coming out with a whole website where like users need to vote. 00:05:41.040 |
I think it's more like fundamentally, we don't know how to automate this kind of benchmarks 00:05:46.080 |
when it's more like, you know, conversational, multi-turn and more open-ended task that may 00:05:54.720 |
So let's say, if you ask a model to help you write an email for you or whatever purpose, 00:06:02.920 |
Or write a story or a creative story or many other things like how we use tragedy these 00:06:10.960 |
It's often times more like open-ended, you know, we need human in a loop to give us feedback. 00:06:18.760 |
And I think nuance here is like, sometimes it's also hard for human to give the absolute 00:06:24.600 |
So that's why we have this kind of pairwise comparison easier for people to choose which 00:06:30.560 |
So from that, we use these pairwise comparisons and those to calculate the leaderboard. 00:06:41.140 |
I mean, I think the point is that, and you guys probably also talked about this at some 00:06:46.400 |
point, but static benchmarks are intrinsically, to some extent, unable to measure generative 00:06:53.920 |
And the reason is because you cannot pre-annotate all the outputs of a generative model. 00:06:59.960 |
You change the model, it's like the distribution of your data is changing. 00:07:05.680 |
New labels are great automated labeling, right? 00:07:10.520 |
And yeah, static benchmarks, they allow you to zoom in to particular types of information 00:07:19.160 |
We can build the best benchmark of historical facts, and we will then know that the model 00:07:25.880 |
But ultimately, that's not the only axis, right? 00:07:28.760 |
And we can build 50 of them, and we can evaluate 50 axes, but it's just so, the problem of 00:07:34.040 |
generative model evaluation is just so expansive, and it's so subjective, that it's just maybe 00:07:40.040 |
non-intrinsically impossible, but at least we don't see a way, or we didn't see a way 00:07:47.920 |
But on the other hand, I think there is a challenge where this kind of online dynamic 00:07:52.480 |
benchmark is more expensive than static benchmark, offline benchmark, where people still need 00:07:59.440 |
Like when they build models, they need static benchmark to track where they are. 00:08:03.320 |
It's not like our benchmark is uniformly better than all other benchmarks, right? 00:08:07.440 |
It just measures a different kind of performance that has proved to be useful. 00:08:14.240 |
You guys also publish MTBench as well, which is a static version, let's say, of Chatbot 00:08:22.280 |
That people can actually use in their development of models. 00:08:26.240 |
I think one of the reasons we still do this static benchmark, we still wanted to explore, 00:08:32.280 |
experiment whether we can automate this, because people eventually, model developer need it 00:08:40.240 |
So that's why we explored LM as a judge, and Arena hard, trying to filter, select high-quality 00:08:47.920 |
data we collected from Chatbot Arena, the high-quality subset, and use that as the questions, 00:08:53.640 |
and then automate the judge pipeline, so that people can quickly get high-quality signal, 00:09:00.280 |
benchmark signals, using this offline benchmark. 00:09:03.360 |
As a community builder, I'm curious about just the initial early days. 00:09:07.600 |
Obviously, when you offer effectively free A/B testing inference for people, people will 00:09:14.160 |
What do you think were the key unlocks for you? 00:09:21.920 |
When people came in, do you see a noticeable skew in the data? 00:09:25.400 |
Which obviously now you have enough data sets, you can separate things out, like coding and 00:09:30.100 |
But in the early days, it was just all sorts of things. 00:09:32.920 |
Maybe one thing to establish at first is that our philosophy has always been to maximize 00:09:40.860 |
I think that really does speak to your point, which is, yeah, why do people come? 00:09:48.760 |
And also, a lot of users just come to the website to use direct chat, because you can 00:09:54.040 |
And then you could think about it like, "Hey, let's just be more on the selfish or conservative 00:09:59.640 |
or protectionist side, and say, 'No, we're only giving credits for people that battle,'" 00:10:09.040 |
Because what we're trying to build is a big funnel, a big funnel that can direct people. 00:10:14.020 |
And some people are passionate and interested, and they battle. 00:10:17.620 |
And yes, the distribution of the people that do that is different. 00:10:21.040 |
It's like, as you're pointing out, it's like, that's not as true. 00:10:27.500 |
Or they like games, you know, people like this. 00:10:31.240 |
And we've run a couple of surveys that indicate this as well, of our user base. 00:10:36.620 |
We do see a lot of developers come to the site asking coding questions, 20-30%. 00:10:43.460 |
And that's not reflective of the general population. 00:10:46.460 |
But it's like reflective of some corner of the world of people that really care. 00:10:52.100 |
And to some extent, maybe that's all right, because those are like the power users. 00:10:56.180 |
And you know, we're not trying to claim that we represent the world, right? 00:11:12.300 |
As you can imagine, this leaderboard depends on community engagement participation. 00:11:19.540 |
If no one comes to vote tomorrow, then no leaderboard. 00:11:23.220 |
So we had some period of time when the number of users was just, after the initial launch, 00:11:29.940 |
And you know, at some point, it did not look promising. 00:11:33.140 |
Actually, I joined the project a couple months in to do the statistical aspects, right? 00:11:39.540 |
As you can imagine, that's how it kind of hooked into my previous work. 00:11:42.180 |
But at that time, it definitely wasn't clear that this was going to be the eval or something. 00:11:49.460 |
It was just like, "Oh, this is a cool project. 00:11:52.500 |
Like Wayland seems awesome," you know, and that's it. 00:11:57.780 |
There's in the beginning, because people don't know us, people don't know what this is for. 00:12:04.780 |
But I think we were lucky enough that we have some initial momentum and as well as the competition 00:12:12.100 |
between model providers, just becoming, you know, became very intense and then that makes 00:12:18.940 |
the eval onto us because always number one is number one. 00:12:25.060 |
Our main priority in everything we do is trust. 00:12:28.540 |
We want to make sure we're doing everything, like all the I's are dotted and the T's are 00:12:32.060 |
crossed and nobody gets unfair treatment and people can see from our profiles and from 00:12:37.460 |
our previous work and from whatever that, you know, we're trustworthy people. 00:12:40.340 |
We're not like trying to make a buck and we're not trying to become famous off of this or 00:12:46.860 |
It's just, we're trying to provide a great public leaderboard. 00:12:54.140 |
I don't, you know, that'd be very nice of me, but that's fine. 00:12:59.340 |
Just to dive in more into biases and, you know, some of this is like statistical control. 00:13:04.700 |
The classic one for human preference evaluation is humans demonstrably prefer longer contexts 00:13:11.260 |
or longer outputs, which is actually something that we don't necessarily want. 00:13:15.580 |
You guys, I think maybe two months ago put out some length control studies. 00:13:20.220 |
Apart from that, there are just other documented biases. 00:13:23.220 |
Like I'd just be interested in your review of the, what you've learned about biases and 00:13:30.460 |
maybe a little bit about how you've controlled for them. 00:13:40.660 |
You know, we try not to make value judgments about these things. 00:13:48.460 |
We collect organic data and then we take that data and we mine it to get whatever insights 00:13:56.020 |
And, you know, we have many millions of data points that we can now use to extract insights 00:14:01.780 |
One of those insights is to ask the question, what is the effect of style, right? 00:14:08.900 |
People are voting either which way we have all the conversations we can say, what components 00:14:14.020 |
of style contribute to human preference and how did they contribute? 00:14:21.260 |
It's important because some people want to see which model would be better if the lengths 00:14:27.140 |
of the responses were the same, were to be the same, right? 00:14:30.180 |
People want to see the causal effect of the model's identity controlled for length or 00:14:37.980 |
The number of headers, bulleted lists, is the text bold? 00:14:40.780 |
Some people don't, they just don't care about that. 00:14:43.020 |
The idea is not to impose the judgment that this is not important, but rather to say X 00:14:47.900 |
post facto, can we analyze our data in a way that decouples all the different factors that 00:14:53.900 |
Now, the way we do this is via statistical regression. 00:14:57.060 |
That is to say, the arena score that we show on our leaderboard is a particular type of 00:15:03.740 |
It's a linear model that takes, it's a logistic regression, that takes model identities and 00:15:12.060 |
So, it regresses human preference against model identity. 00:15:15.420 |
What you get at the end of that logistic regression is a parameter vector of coefficients. 00:15:20.100 |
And when the coefficient is large, it tells you that GPT 4.0 or whatever, very large coefficient, 00:15:27.500 |
And that's exactly what we report in the table. 00:15:29.780 |
It's just the predictive effect of the model identity on the vote. 00:15:33.300 |
Another thing that you can do is you can take that vector, let's say we have M models, that 00:15:40.300 |
What you can do is you say, hey, I also want to understand what the effect of length is. 00:15:45.100 |
So I'll add another entry to that vector, which is trying to predict the vote, right? 00:15:50.660 |
That tells me the difference in length between two model responses. 00:15:55.520 |
We can compute it ex post facto, we added it as a regression, and we look at that predictive 00:16:01.260 |
And then the idea, and this is formally true under certain conditions, not always verifiable 00:16:07.220 |
ones, but the idea is that adding that extra coefficient to this vector will kind of suck 00:16:12.500 |
out the predictive power of length and put it into that M plus first coefficient and 00:16:17.460 |
"de-bias" the rest, so that the effect of length is not included. 00:16:26.360 |
We have, you know, five, six different style components that have to do with markdown headers 00:16:31.180 |
and bulleted lists and so on that we add here. 00:16:39.660 |
If you have something that's sort of like a nuisance parameter, something that exists 00:16:44.820 |
and provides predictive value, but you really don't want to estimate that, you want to remove 00:16:50.660 |
its effect, in causal inference these things are called confounders often. 00:16:57.220 |
You can put them into your model and try to adjust for them. 00:17:00.660 |
So another one of those things might be cost. 00:17:02.700 |
You know, what if I want to look at the cost-adjusted performance of my model? 00:17:05.780 |
Which models are punching above their weight? 00:17:08.460 |
Which models are punching above their weight in terms of parameter count? 00:17:13.080 |
We can do it without introducing anything that compromises the organic nature of the 00:17:21.660 |
For someone with a background in econometrics, this is super familiar. 00:17:25.020 |
You're probably better at this than me, for sure. 00:17:27.820 |
Well, I mean, so I used to be a quantitative trader and so controlling for multiple effects 00:17:40.580 |
Obviously the problem is proving causation, which is hard, but you don't have to do that. 00:17:47.700 |
Causal inference is a hard problem and it goes beyond statistics, right? 00:17:50.300 |
It's like, you have to build the right causal model and so on and so forth. 00:17:53.740 |
But we think that this is a good first step and we're sort of looking forward to learning 00:17:59.940 |
You know, there's some good people at Berkeley that work on causal inference. 00:18:02.260 |
We're looking forward to learning from them on like, what are the really most contemporary 00:18:05.500 |
techniques that we can use in order to estimate true causal effects, if possible. 00:18:10.660 |
Maybe we could take a step through the other categories. 00:18:16.780 |
I have thought that when you wrote that blog post, actually, I thought it would be the 00:18:20.740 |
new default because it seems like the most obvious thing to control for. 00:18:23.820 |
But you also have other categories, you have coding, you have hard prompts. 00:18:26.660 |
We consider that we're still actively considering it. 00:18:28.980 |
It's just, you know, once you make that step, once you take that step, you're introducing 00:18:34.180 |
And I'm not, you know, why should our opinion be the one? 00:18:49.700 |
It's just like pick a few of your favorite categories that you'd like to talk about. 00:18:52.180 |
Maybe tell a little bit of the stories, tell a little bit of like the hard choices that 00:19:00.220 |
I think the, initially the reason why we want to add these new categories is essentially 00:19:06.020 |
to answer, I mean, some of the questions from our community, which is we won't have a single 00:19:10.940 |
leaderboard for everything, so these models behave very differently in different domains. 00:19:15.860 |
Let's say this model is trying for coding, this model is trying for more technical questions 00:19:22.260 |
On the other hand, to answer people's question about like, okay, what if all these low quality, 00:19:27.500 |
you know, because we crowdsource data from the internet, there will be noise. 00:19:33.780 |
How do we filter out these low quality data effectively? 00:19:37.380 |
So that was like, you know, some questions we want to answer. 00:19:40.540 |
So basically we spent a few months, like really diving into these questions to understand 00:19:46.020 |
how do we filter all these data, because these are like medias of data points. 00:19:49.420 |
And then if you want to re-label it yourself, it's possible, but we need to kind of like 00:19:54.180 |
to automate this kind of data classification pipeline for us to effectively categorize 00:20:00.800 |
them to different categories, say coding, math, structure, file, and also harder problems. 00:20:06.520 |
So that was like, the hope is when we slice the data into these meaningful categories 00:20:11.740 |
to give people more like better signals, more direct signals, and that's also to clarify 00:20:17.860 |
what we are actually measuring for, because I think that's the core part of the benchmark. 00:20:27.700 |
Also, I'll just say this does like get back to the point that the philosophy is to like 00:20:32.060 |
mine organic, to take organic data and then mine it x plus factor. 00:20:41.980 |
And all of these efforts are like open source, like we open source all of the data cleaning 00:20:52.180 |
Actually really good just for learning statistics. 00:20:59.340 |
I agree on the initial premise of, hey, writing an email, writing a story, there's like no 00:21:04.060 |
ground truth, but I think as you move into like coding and like red teaming, some of 00:21:08.380 |
these things, there's like kind of like skill levels. 00:21:11.460 |
So I'm curious how you think about the distribution of skill of the users, like maybe the top 00:21:17.700 |
1% of red teamers is just not participating in the arena. 00:21:21.960 |
So how do you guys think about adjusting for it? 00:21:24.100 |
And like feels like this where there's kind of like big differences between the average 00:21:29.820 |
Red teaming, of course, red teaming is quite challenging. 00:21:31.620 |
So, okay, moving back, there's definitely like some tasks that are not as subjective 00:21:37.540 |
that like pairwise human preference feedback is not the only signal that you would want 00:21:43.260 |
And to some extent, maybe it's useful, but it may be more useful if you give people better 00:21:49.300 |
For example, it'd be great if we could execute code within arena, be fantastic. 00:21:53.820 |
There's also this idea of constructing a user leaderboard. 00:21:57.580 |
That means some users are better than others. 00:21:59.980 |
How do we quantify that hard in chatbot arena? 00:22:04.660 |
But where it is easier is in red teaming, because in red teaming, there's an explicit 00:22:13.660 |
So what you can do is you can say, Hey, what's really happening here is that the models and 00:22:17.380 |
humans are playing a game against one another. 00:22:19.740 |
And then you can use the same sort of Bradley Terry methodology with some, some extensions 00:22:24.180 |
that we came up with in one of you can read one of our recent blog posts for, for the 00:22:30.500 |
You can attribute like strength back to individual players and jointly attribute strength to 00:22:37.380 |
like the models that are in this jailbreaking game, along with the target tasks, like what 00:22:45.500 |
And I think that this is, this is a hugely important and interesting avenue that we want 00:22:50.580 |
We have some initial ideas, but you know, all thoughts are welcome. 00:22:55.220 |
So first of all, on the code execution, the E2B guys, I'm sure they'll be happy to help 00:23:02.580 |
We're investors in a company called Dreadnought, which we do a lot in AI red teaming. 00:23:06.180 |
I think to me, the most interesting thing has been, how do you do, sure. 00:23:12.340 |
We also had Nicola Scarlini from DeepMind on the podcast, and he was talking about, 00:23:16.140 |
for example, like, you know, context stealing and like a weight stealing. 00:23:19.860 |
So there's kind of like a lot more that goes around it. 00:23:22.580 |
I'm curious just how you think about the model and then maybe like the broader system, even 00:23:28.460 |
with Red Team Arena, you're just focused on like jailbreaking of the model, right? 00:23:32.700 |
You're not doing kind of like any testing on the more system level thing of the model 00:23:37.660 |
where like, maybe you can get the training data back, you're going to exfiltrate some 00:23:40.980 |
of the layers and the weights and things like that. 00:23:43.940 |
So right now, as you can see, the Red Team Arena is at a very early stage and we are 00:23:48.180 |
still exploring what could be the potential new games we can introduce to the platform. 00:23:56.020 |
We build a community driven project platform for people, they can have fun with this website 00:24:04.020 |
That's one thing, and then help everyone to test these models. 00:24:07.800 |
So one of the aspects you mentioned is stealing secrets, right? 00:24:12.140 |
That could be one, you know, it could be designed as a game, say, can you steal user credential, 00:24:19.620 |
you know, we hide, maybe we can hide the credential into system prompts and so on. 00:24:23.300 |
So there are like a few potential ideas we want to explore for sure. 00:24:32.480 |
There's a lot of great ideas in the Red Teaming space. 00:24:35.220 |
You know, I'm not personally like a Red Teamer. 00:24:37.900 |
I don't like go around and Red Team models, but there are people that do that and they're 00:24:44.300 |
And when I think about the Red Team arena, I think those are really the people that we're 00:24:49.380 |
Like we want to make them excited and happy, build tools that they like. 00:24:54.180 |
And just like chatbot arena, we'll trust that this will end up being useful for the world. 00:24:59.100 |
And all these people are, you know, I won't say all these people in this community are 00:25:03.500 |
They're not doing it because they want to like see the world burn. 00:25:06.100 |
They're doing it because they like think it's fun and cool. 00:25:10.380 |
Yeah, okay, maybe they want to see, maybe they want a little bit. 00:25:17.380 |
So, you know, trying to figure out how to serve them best. 00:25:26.780 |
So I'm not trying to express any particular value judgment here as to whether that's the 00:25:31.100 |
It's just, that's, that's sort of the way that I think we would think about it. 00:25:36.100 |
I talked to Sander Schulhoff of the HackerPrompt competition and he's pretty interested in 00:25:46.380 |
We wanted to cover a little, a few topical things and then go into the other stuff that 00:25:51.420 |
You know, you're not just running Chatbot Arena. 00:25:53.660 |
We can also talk about the new website and your future plans, but I just wanted to briefly 00:26:01.900 |
Obviously, you guys already have it on the leaderboard. 00:26:10.740 |
Because it needs like 30, 60 seconds, sometimes even more to the latency is like higher. 00:26:19.460 |
But I think we observed very interesting things from this model as well. 00:26:24.820 |
Like we observed like significant improvement in certain categories, like more technical 00:26:32.780 |
I think actually like one takeaway that was encouraging is that I think a lot of people 00:26:38.140 |
before the O1 release were thinking, oh, like this benchmark is saturated. 00:26:44.380 |
They were thinking that because there was a bunch of models that were kind of at the 00:26:48.340 |
They were just kind of like incrementally competing and it sort of wasn't immediately 00:26:55.820 |
Nobody, including any individual person, it's hard to tell. 00:26:59.420 |
But what O1 did is it was, it's clearly a better model for certain tasks. 00:27:03.780 |
I mean, I used it for like proving some theorems and you know, there's some theorems that like 00:27:08.020 |
only I know because I still do a little bit of theory, right? 00:27:11.320 |
So it's like, I can go in there and ask like, oh, how would you prove this exact thing? 00:27:14.980 |
Which I can tell you has never been in the public domain. 00:27:20.900 |
So there's this model and it crushed the benchmark. 00:27:23.420 |
You know, it's just like really like a big gap. 00:27:27.220 |
But what that's telling us is that it's not saturated yet. 00:27:33.700 |
The takeaway is that the benchmark is comparative. 00:27:40.100 |
It's just like, if you're better than the rest, then you win. 00:27:44.100 |
I think that was actually quite helpful to us. 00:27:46.180 |
I think people were criticizing, I saw some of the academics criticizing it as not apples 00:27:52.340 |
Like because it can take more time to reason, it's basically doing some search, doing some 00:27:57.980 |
chain of thought that if you actually let the other models do that same thing, they 00:28:04.420 |
But I mean, to be clear, none of the leaderboard currently is apples to apples because you 00:28:08.500 |
have like Gemini flash, you have, you know, all sorts of tiny models, like Lama 8B, like 00:28:20.100 |
They have different latencies, different latencies. 00:28:25.620 |
We can do style control, latency control, you know, things like this are important if 00:28:29.780 |
you want to understand the trade-offs involved in using AI. 00:28:36.180 |
We still haven't seen the full model yet, but you know, it's definitely a very exciting 00:28:41.780 |
I think one community controversy I just wanted to give you guys space to address is the collaboration 00:28:50.780 |
People have been suspicious, let's just say, about how they choose to A/B test on you. 00:28:56.620 |
I'll state the argument and let you respond, which is basically they run like five anonymous 00:29:02.060 |
models and basically argmax their ELO on LMSYS or chatbot arena and they release the best 00:29:09.740 |
Like what has been your end of the controversy? 00:29:12.420 |
How have you decided to clarify your policy going forward? 00:29:15.460 |
On a high level, I think our goal here is to build a fast eval for everyone and including 00:29:25.420 |
everyone in the community can see the data board and understand, compare the models. 00:29:31.060 |
More importantly, I think we want to build best eval also for model builders, like all 00:29:37.900 |
They're also internally facing a challenge, which is, you know, how do they eval the model? 00:29:42.980 |
So that's the reason why we want to partner with all the frontier lab people and to help 00:29:49.260 |
So that's one of the, we want to solve this technical challenge, which is eval. 00:29:58.540 |
And people also are interested in like seeing the leading edge of the models. 00:30:03.620 |
People in the community seem to like that, you know, "Oh, there's a new model up, is 00:30:12.380 |
So there's this question that you bring up of, is it actually causing harm, right? 00:30:16.480 |
Is it causing harm to the benchmark that we are allowing this private testing to happen? 00:30:22.100 |
Maybe like stepping back, why do you have that instinct? 00:30:24.860 |
The reason why you and others in the community have that instinct is because when you look 00:30:30.260 |
at something like a benchmark, like an image net, a static benchmark, what happens is that 00:30:36.340 |
if I give you a million different models that are all slightly different, and I pick the 00:30:41.340 |
best one, there's something called selection bias that plays in, which is that the performance 00:30:49.620 |
This is also sometimes called the winner's curse. 00:30:51.860 |
And that's because statistical fluctuations in the evaluation, they're driving which model 00:31:02.900 |
Now there's a couple of things that make this benchmark slightly different. 00:31:07.580 |
So first of all, the selection bias that you incur when you're only testing five models 00:31:12.500 |
And that's why we have this confidence interval constructed. 00:31:17.820 |
Yeah, our confidence intervals are actually not multiplicity adjusted. 00:31:20.520 |
But one thing that we could do immediately tomorrow in order to address this concern 00:31:26.460 |
is if a model provider is testing five models and they want to release one, and we're constructing 00:31:31.260 |
the models at level one minus alpha, we can just construct the intervals instead at level 00:31:40.300 |
What that'll tell you is that like the final performance of the model, like the interval 00:31:44.620 |
that gets constructed, is actually formally correct. 00:31:47.260 |
We don't do that right now, partially because we kind of know from simulations that the 00:31:53.220 |
amount of selection bias you incur with these five things is just not huge. 00:31:57.820 |
It's not huge in comparison to the variability that you get from just regular human voters. 00:32:05.280 |
But then the second thing is the benchmark is live, right? 00:32:08.820 |
So what ends up happening is it'll be a small magnitude, but even if you suffer from the 00:32:12.960 |
winner's curse after testing these five models, what'll happen is that over time, because 00:32:17.660 |
we're getting new data, it'll get adjusted down. 00:32:20.140 |
So if there's any bias that gets introduced at that stage, in the long run, it actually 00:32:24.620 |
Because asymptotically, basically like in the long run, there's way more fresh data 00:32:28.740 |
than there is data that was used to compare these five models against these like private 00:32:35.380 |
The announcement effect is only just the first phase and it has a long tail. 00:32:40.380 |
And it's sort of like automatically corrects itself for this selection adjustment. 00:32:45.620 |
Every month, I do a little chart of Ellim's ELO versus cost, just to track the price per 00:32:50.820 |
dollar, the amount of like, how much money do I have to pay for one incremental point 00:32:57.640 |
And so I actually observe an interesting stability in most of the ELO numbers, except for some 00:33:03.980 |
For example, GPT-4-O August has fallen from 1290 to 1260 over the past few months. 00:33:11.660 |
You're saying like a new version of GPT-4-O versus the version in May? 00:33:19.100 |
I could have made some data entry error, but it'd be interesting to track these things 00:33:24.060 |
Anyway, I observed like numbers go up, numbers go down. 00:33:28.620 |
So there are two different track points and the ELO has fallen. 00:33:36.140 |
That's one of the things, by the way, the community is always suspicious about, like, 00:33:39.580 |
"Hey, did this same endpoint get dumber after release?" 00:33:54.060 |
But if it's for like, you know, endpoint versions we fixed, usually we observe small variation 00:34:03.540 |
I mean, you can quantify the variations that you would expect in an ELO. 00:34:08.900 |
That's a close form number that you can calculate. 00:34:11.340 |
So if the variations are larger than we would expect, then that indicates that we should 00:34:19.580 |
That's an important thing for us to know, but maybe you should send us a reply. 00:34:26.780 |
And I know we only got a few minutes before we wrap, but there are two things I would 00:34:33.940 |
So talking about models, maybe getting number over time, blah, blah, blah. 00:34:38.420 |
Are routers actually helpful in your experience? 00:34:41.100 |
And Sean pointed out that MOEs are technically routers too. 00:34:44.660 |
So how do you kind of think about the router being part of the model versus routing different 00:34:49.620 |
And yeah, overall learnings from building it. 00:34:52.580 |
So route LLM is a project we released a few months ago, I think. 00:34:56.660 |
And our goal was to basically understand can we use the preference data we collect to route 00:35:04.620 |
model based on the question, conditional on the questions, because we may make assumption 00:35:09.060 |
that some models are good at math, some models are good at coding, things like that. 00:35:18.540 |
Our first phase with this project is pretty much like open source, the framework that 00:35:26.580 |
So for anyone, they're interested in this problem, they can use the framework and then 00:35:32.180 |
they can train their own router model and then to do evaluation to benchmark. 00:35:37.260 |
So that's our goal, the reason why we released this framework. 00:35:41.660 |
And I think there are a couple of future stuff we are thinking. 00:35:46.020 |
One is, can we just scale this, do even more data, even more preference data, and then 00:35:52.460 |
train a reward model, train a router model, better router model. 00:35:57.340 |
Another thing is release a benchmark, because right now, currently, there seems to be, one 00:36:03.020 |
of the end point when we developed this project was there's just no good benchmark for a router. 00:36:09.820 |
So that would be another thing we think could be a useful contribution to community. 00:36:14.700 |
And there's still, for sure, methodology, new methodology we think. 00:36:18.540 |
Yeah, I think my fundamental philosophical doubt is, does the router model have to be 00:36:26.900 |
What's the minimum required intelligence of a router model, right? 00:36:29.340 |
Like, if it's too dumb, it's not going to route properly. 00:36:32.740 |
Well, I think that you can build a very, very simple router that is very effective. 00:36:39.820 |
You can build a great router with one parameter, and the parameter is just like, I'm going 00:36:45.060 |
to check if my question is hard, and if it's hard, then I'm going to go to the big model, 00:36:50.300 |
and if it's easy, I'm going to go to the little model. 00:36:52.660 |
You know, there's various ways of measuring hard that are like, pretty trivial, right? 00:37:02.500 |
Because ultimately, at the end of the day, you're competing with a weak baseline, which 00:37:07.580 |
is any individual model, and you're trying to ask the question, how do I improve cost? 00:37:18.420 |
Now, you can also get into that extension, which is, what models are good at what particular 00:37:25.620 |
And then, you know, I think your concern starts taking into effect is, can we actually do 00:37:30.980 |
Can we estimate which models are good in which parts of the space in a way that doesn't introduce 00:37:35.860 |
more variability and more variation in error into our final pipeline than just using the 00:37:45.060 |
Your approach is really interesting compared to the commercial approaches where you use 00:37:48.620 |
information from the chat arena to inform your model, which is, I mean, smart, and it's 00:37:56.580 |
As we wrap, can we just talk about LMSIS and what that's going to be going forward, like 00:38:02.940 |
I saw you announced yesterday you're graduating. 00:38:05.740 |
I think maybe that was confusing since you're PhD students, but this is a different type 00:38:10.900 |
Just for context, LMSIS started as like a student club. 00:38:17.780 |
Student-driven, like research projects of, you know, many different research projects 00:38:22.300 |
Sort of chatbot arena has, of course, like kind of become its own thing. 00:38:28.580 |
And Lianmin and Ying, who are, you know, created LMSIS, have kind of like moved on to working 00:38:34.980 |
on SG Lang, and now they're doing other projects that are sort of originated from LMSIS. 00:38:41.500 |
And for that reason, we thought it made sense to kind of decouple the two. 00:38:45.340 |
Just so, A, the LMSIS thing, it's not like when someone says LMSIS, they think of chatbot 00:38:54.100 |
And we want to support new projects and so on and so forth. 00:38:57.260 |
But of course, these are all like, you know, our friends, so that's why we call it graduation. 00:39:03.620 |
I think that's one thing that people were maybe a little confused by, where LMSIS kind 00:39:08.420 |
of starts and ends and where arena starts and ends. 00:39:10.740 |
So I think you reach escape velocity now that you're kind of like your own, your own thing. 00:39:19.660 |
Like, what do you want people to approach you with? 00:39:23.020 |
One thing would be like, we're obviously expanding into like other kinds of arenas, right? 00:39:28.420 |
We definitely need like active help on red teaming. 00:39:30.700 |
We definitely need active help on our different modalities, different modalities, vision, 00:39:36.100 |
so pilot, coding, coding, you know, if somebody could like help us implement this, like, REPL 00:39:41.620 |
in REPL in chatbot arena, massive, that would be a massive delta. 00:39:46.060 |
And I know that there's people out there who are passionate and capable of doing it. 00:39:50.300 |
It's just, we don't have enough hands on deck. 00:39:52.700 |
We're just like an academic research lab, right? 00:39:54.580 |
We're not equipped to support this kind of project. 00:40:01.500 |
We also need just like general back-end dev and new ideas, new conceptual ideas. 00:40:07.180 |
I mean, honestly, the work that we do spans everything from like foundational statistics, 00:40:15.220 |
And like anybody who's like, wants to contribute something to that pipeline is, should definitely 00:40:22.820 |
And it's an open source project anyways, anyone can make a PR. 00:40:26.380 |
And we're happy to, you know, whoever wants to contribute, we'll give them credit. 00:40:29.180 |
You know, we're not trying to keep all the credit for ourselves. 00:40:35.060 |
And fits this pair of everything you've been doing over there. 00:40:38.660 |
Well, thank you so much for taking the time and we'll put all the links in the show notes 00:40:43.180 |
so that people can find you and reach out if they need it.