In the Arena: How LMSys changed LLM Benchmarking Forever

00:00:00.000 | Hey everyone, welcome to the Latent Space Podcast.

00:00:07.140 | This is Alessio, partner in C2N Residence at Decibel Partners, and I'm joined by my

00:00:11.260 | co-host Sviks, founder of SmallAI.

00:00:14.020 | Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from ElemSys.

00:00:20.020 | Welcome guys.

00:00:21.020 | Hey, how's it going?

00:00:22.020 | Nice to see you.

00:00:23.020 | Thanks for having us.

00:00:24.020 | Anastasios, I actually saw you, I think at last year's NeurIPS.

00:00:27.420 | You were presenting a paper which I don't really super understand, but it was some theory

00:00:32.220 | paper about how your method was very dominating over other sort of search methods.

00:00:36.600 | I don't remember what it was, but I remember that you were a very confident speaker.

00:00:40.300 | Oh, I totally remember you.

00:00:42.340 | Didn't ever connect that, but yes, that's definitely true.

00:00:44.820 | Yeah.

00:00:45.820 | Nice to see you again.

00:00:46.820 | Yeah.

00:00:47.820 | I was frantically looking for the name of your paper and I couldn't find it.

00:00:49.140 | Basically, I had to cut it because I didn't understand it.

00:00:51.540 | Is this conformal PID control or was this the online control?

00:00:55.180 | Conformal, yes.

00:00:56.180 | Blast from the past, man.

00:00:57.700 | Blast from the past.

00:00:58.700 | It's always interesting how NeurIPS and all these academic conferences are sort of six

00:01:02.700 | months behind what people are actually doing, but conformal risk control, I would recommend

00:01:07.500 | people check it out.

00:01:08.860 | I have the recording, I just never published it just because I was like, "I don't understand

00:01:12.300 | this enough to explain it."

00:01:13.300 | People won't be interested, it's all good.

00:01:16.780 | But ELO scores, ELO scores are very easy to understand.

00:01:19.660 | You guys are responsible for the biggest revolution in language model benchmarking in the last

00:01:25.940 | few years.

00:01:26.940 | Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history

00:01:30.540 | of Elimsys.

00:01:31.540 | Hey, I'm Weilin.

00:01:32.540 | I'm a fifth year PhD student at UC Berkeley, working on Chappa Arena these days doing crowdsourcing

00:01:41.020 | BI benchmarking.

00:01:42.820 | I'm Anastasios.

00:01:43.820 | I'm a sixth year PhD student here at Berkeley.

00:01:46.620 | I did most of my PhD on theoretical statistics and sort of foundations of model evaluation

00:01:53.780 | and testing.

00:01:55.940 | And now I'm working 150% on this Chappa Arena stuff.

00:01:59.340 | It's great.

00:02:00.340 | And what was the origin of it?

00:02:02.500 | How did you come up with the idea?

00:02:04.140 | How did you get people to buy in?

00:02:05.700 | And then maybe what were one or two of the pivotal moments early on that kind of made

00:02:09.620 | it the standard for these things?

00:02:11.940 | Yeah, yeah.

00:02:12.940 | Chappa Arena project was started last year in April, May.

00:02:19.100 | Before that, we were basically experimenting in a lab how to fine tune a chatbot open source

00:02:26.060 | based on the Lama1 model that I released.

00:02:29.060 | At that time, Lama1 was like a base model and people didn't really know how to fine

00:02:35.060 | tune it.

00:02:36.060 | So we were doing some explorations.

00:02:38.420 | We were inspired by Stanford's Alpaca project.

00:02:41.740 | So we basically, yeah, grow a data set from the internet which is called shared G2PD data

00:02:47.420 | set, which is like a dialogue data set between user and chat GPT conversation.

00:02:52.620 | It turns out to be like pretty high quality data, dialogue data.

00:02:56.940 | So we fine tune on it and then we train it and release the model called V2D.

00:03:01.460 | And people were very excited about it because it kind of like demonstrate open way model

00:03:07.060 | can reach this conversation capability similar to chat GPT.

00:03:12.260 | And then we basically released the model with and also built a demo website for the model.

00:03:19.460 | People were very excited about it.

00:03:21.580 | But during the development, the biggest challenge to us at that time was like how do we even

00:03:26.820 | evaluate it?

00:03:27.820 | How do we even argue this model we trained is better than others?

00:03:32.420 | And what's the gap between this open source model that other proprietary offering?

00:03:36.740 | At that time, it was like GPT-4 was just announced and it's like cloud one.

00:03:42.420 | What's the difference between them?

00:03:43.740 | And then after that, like every week, there's a new model being fine tuned, released.

00:03:49.140 | So even until still now, right?

00:03:51.620 | And then we have that demo website for V2D and then we thought like, okay, maybe we can

00:03:55.420 | add a few more open model as well, like API model as well.

00:04:00.240 | And then we quickly realized that people need a tool to compare between different models.

00:04:06.040 | So we have like a side-by-side UI implemented on the website that people choose to compare.

00:04:12.820 | And we quickly realized that maybe we can do something like a battle on top of ECLMs,

00:04:19.120 | like just anonymize it, anonymize the identity and let people vote which one is better.

00:04:25.140 | So the community decides which one is better, not us arguing our model is better or what.

00:04:31.300 | And that turns out to be like, people are very excited about this idea.

00:04:35.000 | And then we tweet, we launch and that's April, May.

00:04:39.760 | And then it was like first two, three weeks, like just a few hundred thousand views tweet

00:04:45.880 | on our launch tweets.

00:04:47.460 | And then we have regularly double update weekly at the beginning of the time, adding new model

00:04:53.560 | GPT-4 as well.

00:04:54.560 | So that was like, that was the initial.

00:04:58.600 | Another pivotal moment just to jump in would be private models, like the GPT, I'm a little

00:05:03.880 | chatty.

00:05:04.880 | Yeah, that was this year.

00:05:05.880 | That was this year.

00:05:06.880 | That was also huge.

00:05:08.880 | In the beginning, I saw the initial release was May 3rd of Labrador board.

00:05:13.660 | On April 6th, we did a benchmarks one-on-one episode for a podcast.

00:05:18.040 | Just kind of talking about how so much of the data is like in the pre-training corpus

00:05:22.160 | and blah, blah, blah.

00:05:23.160 | And like the benchmarks are really not what we need to evaluate whether or not a model

00:05:26.520 | is good.

00:05:27.520 | Why did you not make a benchmark?

00:05:30.040 | Maybe at the time, you know, it was just like, Hey, let's just put together a whole bunch

00:05:33.060 | of data again, run a mega score.

00:05:35.780 | That seems much easier than coming out with a whole website where like users need to vote.

00:05:40.040 | Any thoughts behind that?

00:05:41.040 | I think it's more like fundamentally, we don't know how to automate this kind of benchmarks

00:05:46.080 | when it's more like, you know, conversational, multi-turn and more open-ended task that may

00:05:52.880 | not come with a grunt to it.

00:05:54.720 | So let's say, if you ask a model to help you write an email for you or whatever purpose,

00:06:00.160 | there's no ground truth.

00:06:01.160 | How do you score them?

00:06:02.920 | Or write a story or a creative story or many other things like how we use tragedy these

00:06:09.960 | days.

00:06:10.960 | It's often times more like open-ended, you know, we need human in a loop to give us feedback.

00:06:17.320 | Which one is better?

00:06:18.760 | And I think nuance here is like, sometimes it's also hard for human to give the absolute

00:06:23.600 | rating.

00:06:24.600 | So that's why we have this kind of pairwise comparison easier for people to choose which

00:06:29.320 | one is better.

00:06:30.560 | So from that, we use these pairwise comparisons and those to calculate the leaderboard.

00:06:36.300 | Yeah.

00:06:37.300 | You can answer.

00:06:38.300 | I can add more about this methodology.

00:06:40.140 | Yeah.

00:06:41.140 | I mean, I think the point is that, and you guys probably also talked about this at some

00:06:46.400 | point, but static benchmarks are intrinsically, to some extent, unable to measure generative

00:06:52.920 | model performance.

00:06:53.920 | And the reason is because you cannot pre-annotate all the outputs of a generative model.

00:06:59.960 | You change the model, it's like the distribution of your data is changing.

00:07:03.360 | New labels to deal with that.

00:07:05.680 | New labels are great automated labeling, right?

00:07:08.000 | Which is why people are pursuing both.

00:07:10.520 | And yeah, static benchmarks, they allow you to zoom in to particular types of information

00:07:16.480 | like factuality, historical facts.

00:07:19.160 | We can build the best benchmark of historical facts, and we will then know that the model

00:07:23.560 | is great at historical facts.

00:07:25.880 | But ultimately, that's not the only axis, right?

00:07:28.760 | And we can build 50 of them, and we can evaluate 50 axes, but it's just so, the problem of

00:07:34.040 | generative model evaluation is just so expansive, and it's so subjective, that it's just maybe

00:07:40.040 | non-intrinsically impossible, but at least we don't see a way, or we didn't see a way

00:07:45.040 | of encoding that into a fixed benchmark.

00:07:47.920 | But on the other hand, I think there is a challenge where this kind of online dynamic

00:07:52.480 | benchmark is more expensive than static benchmark, offline benchmark, where people still need

00:07:58.440 | it.

00:07:59.440 | Like when they build models, they need static benchmark to track where they are.

00:08:03.320 | It's not like our benchmark is uniformly better than all other benchmarks, right?

00:08:07.440 | It just measures a different kind of performance that has proved to be useful.

00:08:14.240 | You guys also publish MTBench as well, which is a static version, let's say, of Chatbot

00:08:20.560 | Arena, right?

00:08:22.280 | That people can actually use in their development of models.

00:08:25.240 | Right.

00:08:26.240 | I think one of the reasons we still do this static benchmark, we still wanted to explore,

00:08:32.280 | experiment whether we can automate this, because people eventually, model developer need it

00:08:37.600 | to fast iterate their model.

00:08:40.240 | So that's why we explored LM as a judge, and Arena hard, trying to filter, select high-quality

00:08:47.920 | data we collected from Chatbot Arena, the high-quality subset, and use that as the questions,

00:08:53.640 | and then automate the judge pipeline, so that people can quickly get high-quality signal,

00:09:00.280 | benchmark signals, using this offline benchmark.

00:09:03.360 | As a community builder, I'm curious about just the initial early days.

00:09:07.600 | Obviously, when you offer effectively free A/B testing inference for people, people will

00:09:13.000 | come and use your Arena.

00:09:14.160 | What do you think were the key unlocks for you?

00:09:17.360 | Was it funding for this Arena?

00:09:20.080 | Was it marketing?

00:09:21.920 | When people came in, do you see a noticeable skew in the data?

00:09:25.400 | Which obviously now you have enough data sets, you can separate things out, like coding and

00:09:29.100 | hard prompts.

00:09:30.100 | But in the early days, it was just all sorts of things.

00:09:31.920 | Yeah.

00:09:32.920 | Maybe one thing to establish at first is that our philosophy has always been to maximize

00:09:39.500 | organic use.

00:09:40.860 | I think that really does speak to your point, which is, yeah, why do people come?

00:09:45.840 | They came to use free LLM inference, right?

00:09:48.760 | And also, a lot of users just come to the website to use direct chat, because you can

00:09:52.580 | chat with a model for free.

00:09:54.040 | And then you could think about it like, "Hey, let's just be more on the selfish or conservative

00:09:59.640 | or protectionist side, and say, 'No, we're only giving credits for people that battle,'"

00:10:04.880 | or so on and so forth.

00:10:07.040 | The strategy wouldn't have worked, right?

00:10:09.040 | Because what we're trying to build is a big funnel, a big funnel that can direct people.

00:10:14.020 | And some people are passionate and interested, and they battle.

00:10:17.620 | And yes, the distribution of the people that do that is different.

00:10:21.040 | It's like, as you're pointing out, it's like, that's not as true.

00:10:23.580 | Enthusiastic.

00:10:24.580 | They're enthusiastic.

00:10:25.580 | They're already adopters of this technology.

00:10:27.500 | Or they like games, you know, people like this.

00:10:31.240 | And we've run a couple of surveys that indicate this as well, of our user base.

00:10:35.620 | Yeah.

00:10:36.620 | We do see a lot of developers come to the site asking coding questions, 20-30%.

00:10:41.460 | Yeah.

00:10:42.460 | 20-30%.

00:10:43.460 | And that's not reflective of the general population.

00:10:45.460 | Yeah, for sure.

00:10:46.460 | But it's like reflective of some corner of the world of people that really care.

00:10:52.100 | And to some extent, maybe that's all right, because those are like the power users.

00:10:56.180 | And you know, we're not trying to claim that we represent the world, right?

00:10:59.540 | We represent the people that come and vote.

00:11:02.820 | Did you have to do anything marketing-wise?

00:11:04.880 | Was anything effective?

00:11:05.880 | Did you struggle at all?

00:11:07.900 | Was it success from day one?

00:11:09.300 | At some point, almost.

00:11:11.300 | Okay.

00:11:12.300 | As you can imagine, this leaderboard depends on community engagement participation.

00:11:19.540 | If no one comes to vote tomorrow, then no leaderboard.

00:11:23.220 | So we had some period of time when the number of users was just, after the initial launch,

00:11:27.940 | it went lower.

00:11:28.940 | Yeah.

00:11:29.940 | And you know, at some point, it did not look promising.

00:11:33.140 | Actually, I joined the project a couple months in to do the statistical aspects, right?

00:11:39.540 | As you can imagine, that's how it kind of hooked into my previous work.

00:11:42.180 | But at that time, it definitely wasn't clear that this was going to be the eval or something.

00:11:49.460 | It was just like, "Oh, this is a cool project.

00:11:52.500 | Like Wayland seems awesome," you know, and that's it.

00:11:56.780 | Definitely.

00:11:57.780 | There's in the beginning, because people don't know us, people don't know what this is for.

00:12:02.980 | So we had a hard time.

00:12:04.780 | But I think we were lucky enough that we have some initial momentum and as well as the competition

00:12:12.100 | between model providers, just becoming, you know, became very intense and then that makes

00:12:18.940 | the eval onto us because always number one is number one.

00:12:23.340 | There's also an element of trust.

00:12:25.060 | Our main priority in everything we do is trust.

00:12:28.540 | We want to make sure we're doing everything, like all the I's are dotted and the T's are

00:12:32.060 | crossed and nobody gets unfair treatment and people can see from our profiles and from

00:12:37.460 | our previous work and from whatever that, you know, we're trustworthy people.

00:12:40.340 | We're not like trying to make a buck and we're not trying to become famous off of this or

00:12:45.860 | something.

00:12:46.860 | It's just, we're trying to provide a great public leaderboard.

00:12:49.260 | Community driven project.

00:12:51.340 | Yes.

00:12:52.340 | I mean, you are kind of famous now.

00:12:54.140 | I don't, you know, that'd be very nice of me, but that's fine.

00:12:59.340 | Just to dive in more into biases and, you know, some of this is like statistical control.

00:13:04.700 | The classic one for human preference evaluation is humans demonstrably prefer longer contexts

00:13:11.260 | or longer outputs, which is actually something that we don't necessarily want.

00:13:15.580 | You guys, I think maybe two months ago put out some length control studies.

00:13:20.220 | Apart from that, there are just other documented biases.

00:13:23.220 | Like I'd just be interested in your review of the, what you've learned about biases and

00:13:30.460 | maybe a little bit about how you've controlled for them.

00:13:32.820 | At a very high level.

00:13:34.500 | Yeah.

00:13:35.500 | Humans are biased.

00:13:36.500 | Totally agree.

00:13:37.500 | Like in various ways.

00:13:38.500 | It's not clear whether that's good or bad.

00:13:40.660 | You know, we try not to make value judgments about these things.

00:13:43.660 | We just try to describe them as they are.

00:13:45.980 | And our approach is always as follows.

00:13:48.460 | We collect organic data and then we take that data and we mine it to get whatever insights

00:13:55.020 | we can get.

00:13:56.020 | And, you know, we have many millions of data points that we can now use to extract insights

00:14:00.780 | from.

00:14:01.780 | One of those insights is to ask the question, what is the effect of style, right?

00:14:06.860 | You have a bunch of data.

00:14:07.900 | You have votes.

00:14:08.900 | People are voting either which way we have all the conversations we can say, what components

00:14:14.020 | of style contribute to human preference and how did they contribute?

00:14:18.140 | Now, that's an important question.

00:14:19.660 | Why is that an important question?

00:14:21.260 | It's important because some people want to see which model would be better if the lengths

00:14:27.140 | of the responses were the same, were to be the same, right?

00:14:30.180 | People want to see the causal effect of the model's identity controlled for length or

00:14:36.980 | controlled for markdown.

00:14:37.980 | The number of headers, bulleted lists, is the text bold?

00:14:40.780 | Some people don't, they just don't care about that.

00:14:43.020 | The idea is not to impose the judgment that this is not important, but rather to say X

00:14:47.900 | post facto, can we analyze our data in a way that decouples all the different factors that

00:14:52.460 | go into human preference?

00:14:53.900 | Now, the way we do this is via statistical regression.

00:14:57.060 | That is to say, the arena score that we show on our leaderboard is a particular type of

00:15:02.740 | linear model, right?

00:15:03.740 | It's a linear model that takes, it's a logistic regression, that takes model identities and

00:15:09.760 | fits them against human preference, right?

00:15:12.060 | So, it regresses human preference against model identity.

00:15:15.420 | What you get at the end of that logistic regression is a parameter vector of coefficients.

00:15:20.100 | And when the coefficient is large, it tells you that GPT 4.0 or whatever, very large coefficient,

00:15:26.500 | that means it's strong.

00:15:27.500 | And that's exactly what we report in the table.

00:15:29.780 | It's just the predictive effect of the model identity on the vote.

00:15:33.300 | Another thing that you can do is you can take that vector, let's say we have M models, that

00:15:37.500 | is an M dimensional vector of coefficients.

00:15:40.300 | What you can do is you say, hey, I also want to understand what the effect of length is.

00:15:45.100 | So I'll add another entry to that vector, which is trying to predict the vote, right?

00:15:50.660 | That tells me the difference in length between two model responses.

00:15:53.740 | So we have that for all of our data.

00:15:55.520 | We can compute it ex post facto, we added it as a regression, and we look at that predictive

00:16:00.260 | effect.

00:16:01.260 | And then the idea, and this is formally true under certain conditions, not always verifiable

00:16:07.220 | ones, but the idea is that adding that extra coefficient to this vector will kind of suck

00:16:12.500 | out the predictive power of length and put it into that M plus first coefficient and

00:16:17.460 | "de-bias" the rest, so that the effect of length is not included.

00:16:22.780 | And that's what we do in style control.

00:16:24.380 | Now we don't just do it for M plus one.

00:16:26.360 | We have, you know, five, six different style components that have to do with markdown headers

00:16:31.180 | and bulleted lists and so on that we add here.

00:16:34.820 | Now where is this going?

00:16:36.940 | You guys see the idea.

00:16:37.940 | It's a general methodology.

00:16:39.660 | If you have something that's sort of like a nuisance parameter, something that exists

00:16:44.820 | and provides predictive value, but you really don't want to estimate that, you want to remove

00:16:50.660 | its effect, in causal inference these things are called confounders often.

00:16:55.580 | What you can do is you can model the effect.

00:16:57.220 | You can put them into your model and try to adjust for them.

00:17:00.660 | So another one of those things might be cost.

00:17:02.700 | You know, what if I want to look at the cost-adjusted performance of my model?

00:17:05.780 | Which models are punching above their weight?

00:17:07.380 | Parameter count.

00:17:08.460 | Which models are punching above their weight in terms of parameter count?

00:17:11.120 | We can ex post facto measure that.

00:17:13.080 | We can do it without introducing anything that compromises the organic nature of the

00:17:17.260 | data that we collect.

00:17:18.260 | Hopefully that answers the question.

00:17:20.660 | It does.

00:17:21.660 | For someone with a background in econometrics, this is super familiar.

00:17:25.020 | You're probably better at this than me, for sure.

00:17:27.820 | Well, I mean, so I used to be a quantitative trader and so controlling for multiple effects

00:17:34.780 | on stock price is effectively the job.

00:17:37.660 | So it's interesting.

00:17:40.580 | Obviously the problem is proving causation, which is hard, but you don't have to do that.

00:17:45.700 | Yes.

00:17:46.700 | Yes, that's right.

00:17:47.700 | Causal inference is a hard problem and it goes beyond statistics, right?

00:17:50.300 | It's like, you have to build the right causal model and so on and so forth.

00:17:53.740 | But we think that this is a good first step and we're sort of looking forward to learning

00:17:58.940 | from more people.

00:17:59.940 | You know, there's some good people at Berkeley that work on causal inference.

00:18:02.260 | We're looking forward to learning from them on like, what are the really most contemporary

00:18:05.500 | techniques that we can use in order to estimate true causal effects, if possible.

00:18:10.660 | Maybe we could take a step through the other categories.

00:18:14.060 | So style control is a category.

00:18:15.500 | It is not a default.

00:18:16.780 | I have thought that when you wrote that blog post, actually, I thought it would be the

00:18:20.740 | new default because it seems like the most obvious thing to control for.

00:18:23.820 | But you also have other categories, you have coding, you have hard prompts.

00:18:26.660 | We consider that we're still actively considering it.

00:18:28.980 | It's just, you know, once you make that step, once you take that step, you're introducing

00:18:33.180 | your opinion.

00:18:34.180 | And I'm not, you know, why should our opinion be the one?

00:18:37.700 | That's kind of a community choice.

00:18:38.700 | We could put it to a vote.

00:18:39.700 | We could pass.

00:18:40.700 | Yeah, maybe do a poll.

00:18:41.700 | Maybe do a poll.

00:18:42.700 | I don't know.

00:18:43.700 | No opinion is an opinion.

00:18:44.700 | You know what I mean?

00:18:45.700 | It's a culture choice here.

00:18:46.700 | Yeah.

00:18:47.700 | You have all these others.

00:18:48.700 | You have instruction following too.

00:18:49.700 | It's just like pick a few of your favorite categories that you'd like to talk about.

00:18:52.180 | Maybe tell a little bit of the stories, tell a little bit of like the hard choices that

00:18:56.220 | you had to make.

00:18:57.220 | Yeah.

00:18:58.220 | Yeah.

00:18:59.220 | Yeah.

00:19:00.220 | I think the, initially the reason why we want to add these new categories is essentially

00:19:06.020 | to answer, I mean, some of the questions from our community, which is we won't have a single

00:19:10.940 | leaderboard for everything, so these models behave very differently in different domains.

00:19:15.860 | Let's say this model is trying for coding, this model is trying for more technical questions

00:19:21.260 | and so on.

00:19:22.260 | On the other hand, to answer people's question about like, okay, what if all these low quality,

00:19:27.500 | you know, because we crowdsource data from the internet, there will be noise.

00:19:32.060 | So how do we de-noise?

00:19:33.780 | How do we filter out these low quality data effectively?

00:19:37.380 | So that was like, you know, some questions we want to answer.

00:19:40.540 | So basically we spent a few months, like really diving into these questions to understand

00:19:46.020 | how do we filter all these data, because these are like medias of data points.

00:19:49.420 | And then if you want to re-label it yourself, it's possible, but we need to kind of like

00:19:54.180 | to automate this kind of data classification pipeline for us to effectively categorize

00:20:00.800 | them to different categories, say coding, math, structure, file, and also harder problems.

00:20:06.520 | So that was like, the hope is when we slice the data into these meaningful categories

00:20:11.740 | to give people more like better signals, more direct signals, and that's also to clarify

00:20:17.860 | what we are actually measuring for, because I think that's the core part of the benchmark.

00:20:24.700 | That was the initial motivation.

00:20:25.700 | Does that make sense?

00:20:26.700 | Yeah.

00:20:27.700 | Also, I'll just say this does like get back to the point that the philosophy is to like

00:20:32.060 | mine organic, to take organic data and then mine it x plus factor.

00:20:35.940 | Is the data cage-free too, or just organic?

00:20:38.980 | It's cage-free.

00:20:39.980 | No GMO.

00:20:40.980 | Yeah.

00:20:41.980 | And all of these efforts are like open source, like we open source all of the data cleaning

00:20:47.860 | pipeline, filtering pipeline.

00:20:49.380 | Yeah.

00:20:50.380 | I love the notebooks you guys publish.

00:20:52.180 | Actually really good just for learning statistics.

00:20:54.540 | Yeah.

00:20:55.540 | I'll share this insights with everyone.

00:20:59.340 | I agree on the initial premise of, hey, writing an email, writing a story, there's like no

00:21:04.060 | ground truth, but I think as you move into like coding and like red teaming, some of

00:21:08.380 | these things, there's like kind of like skill levels.

00:21:11.460 | So I'm curious how you think about the distribution of skill of the users, like maybe the top

00:21:17.700 | 1% of red teamers is just not participating in the arena.

00:21:21.960 | So how do you guys think about adjusting for it?

00:21:24.100 | And like feels like this where there's kind of like big differences between the average

00:21:27.820 | and the top.

00:21:28.820 | Yeah.

00:21:29.820 | Red teaming, of course, red teaming is quite challenging.

00:21:31.620 | So, okay, moving back, there's definitely like some tasks that are not as subjective

00:21:37.540 | that like pairwise human preference feedback is not the only signal that you would want

00:21:42.060 | to measure.

00:21:43.260 | And to some extent, maybe it's useful, but it may be more useful if you give people better

00:21:48.300 | tools.

00:21:49.300 | For example, it'd be great if we could execute code within arena, be fantastic.

00:21:52.820 | We want to do it.

00:21:53.820 | There's also this idea of constructing a user leaderboard.

00:21:56.580 | What does that mean?

00:21:57.580 | That means some users are better than others.

00:21:58.980 | And how do we measure that?

00:21:59.980 | How do we quantify that hard in chatbot arena?

00:22:04.660 | But where it is easier is in red teaming, because in red teaming, there's an explicit

00:22:09.660 | game.

00:22:10.660 | You're trying to break the model.

00:22:11.660 | You're the winner.

00:22:12.660 | You lose.

00:22:13.660 | So what you can do is you can say, Hey, what's really happening here is that the models and

00:22:17.380 | humans are playing a game against one another.

00:22:19.740 | And then you can use the same sort of Bradley Terry methodology with some, some extensions

00:22:24.180 | that we came up with in one of you can read one of our recent blog posts for, for the

00:22:28.620 | sort of theoretical extensions.

00:22:30.500 | You can attribute like strength back to individual players and jointly attribute strength to

00:22:37.380 | like the models that are in this jailbreaking game, along with the target tasks, like what

00:22:41.860 | types of jailbreaks you've launched.

00:22:44.500 | So yeah.

00:22:45.500 | And I think that this is, this is a hugely important and interesting avenue that we want

00:22:49.580 | to continue researching.

00:22:50.580 | We have some initial ideas, but you know, all thoughts are welcome.

00:22:54.220 | Yeah.

00:22:55.220 | So first of all, on the code execution, the E2B guys, I'm sure they'll be happy to help

00:22:59.580 | you.

00:23:00.580 | I'll be set that up.

00:23:01.580 | They're big fans.

00:23:02.580 | We're investors in a company called Dreadnought, which we do a lot in AI red teaming.

00:23:06.180 | I think to me, the most interesting thing has been, how do you do, sure.

00:23:10.100 | Like the model jailbreak is one side.

00:23:12.340 | We also had Nicola Scarlini from DeepMind on the podcast, and he was talking about,

00:23:16.140 | for example, like, you know, context stealing and like a weight stealing.

00:23:19.860 | So there's kind of like a lot more that goes around it.

00:23:22.580 | I'm curious just how you think about the model and then maybe like the broader system, even

00:23:28.460 | with Red Team Arena, you're just focused on like jailbreaking of the model, right?

00:23:32.700 | You're not doing kind of like any testing on the more system level thing of the model

00:23:37.660 | where like, maybe you can get the training data back, you're going to exfiltrate some

00:23:40.980 | of the layers and the weights and things like that.

00:23:43.940 | So right now, as you can see, the Red Team Arena is at a very early stage and we are

00:23:48.180 | still exploring what could be the potential new games we can introduce to the platform.

00:23:53.820 | So the idea is still the same, right?

00:23:56.020 | We build a community driven project platform for people, they can have fun with this website

00:24:03.020 | for sure.

00:24:04.020 | That's one thing, and then help everyone to test these models.

00:24:07.800 | So one of the aspects you mentioned is stealing secrets, right?

00:24:11.140 | Stealing training sets.

00:24:12.140 | That could be one, you know, it could be designed as a game, say, can you steal user credential,

00:24:19.620 | you know, we hide, maybe we can hide the credential into system prompts and so on.

00:24:23.300 | So there are like a few potential ideas we want to explore for sure.

00:24:27.980 | You want to add more?

00:24:28.980 | I think that this is great.

00:24:30.580 | This idea is a great one.

00:24:32.480 | There's a lot of great ideas in the Red Teaming space.

00:24:35.220 | You know, I'm not personally like a Red Teamer.

00:24:37.900 | I don't like go around and Red Team models, but there are people that do that and they're

00:24:41.900 | awesome.

00:24:42.900 | They're super skilled.

00:24:44.300 | And when I think about the Red Team arena, I think those are really the people that we're

00:24:48.020 | building it for.

00:24:49.380 | Like we want to make them excited and happy, build tools that they like.

00:24:54.180 | And just like chatbot arena, we'll trust that this will end up being useful for the world.

00:24:59.100 | And all these people are, you know, I won't say all these people in this community are

00:25:02.020 | actually good hearted, right?

00:25:03.500 | They're not doing it because they want to like see the world burn.

00:25:06.100 | They're doing it because they like think it's fun and cool.

00:25:10.380 | Yeah, okay, maybe they want to see, maybe they want a little bit.

00:25:13.060 | Not all.

00:25:14.060 | Some.

00:25:15.060 | Majority.

00:25:16.060 | You know what I'm saying.

00:25:17.380 | So, you know, trying to figure out how to serve them best.

00:25:20.980 | I think, I don't know where that fits.

00:25:22.780 | I just, I'm not.

00:25:23.780 | And give them credits.

00:25:24.780 | And give them credit.

00:25:25.780 | Yeah.

00:25:26.780 | So I'm not trying to express any particular value judgment here as to whether that's the

00:25:30.100 | right next step.

00:25:31.100 | It's just, that's, that's sort of the way that I think we would think about it.

00:25:35.100 | Yeah.

00:25:36.100 | I talked to Sander Schulhoff of the HackerPrompt competition and he's pretty interested in

00:25:41.060 | red teaming at scale.

00:25:42.060 | Let's just call it that.

00:25:43.060 | You guys maybe want to talk with him.

00:25:45.380 | Oh, nice.

00:25:46.380 | We wanted to cover a little, a few topical things and then go into the other stuff that

00:25:50.420 | your group is doing.

00:25:51.420 | You know, you're not just running Chatbot Arena.

00:25:53.660 | We can also talk about the new website and your future plans, but I just wanted to briefly

00:25:57.700 | focus on O1.

00:25:59.220 | It is the hottest, latest model.

00:26:01.900 | Obviously, you guys already have it on the leaderboard.

00:26:04.620 | What is the impact of O1 on your evals?

00:26:06.740 | Made our interface slower.

00:26:07.740 | Yeah, exactly.

00:26:08.740 | And made it slower.

00:26:09.740 | Yeah.

00:26:10.740 | Because it needs like 30, 60 seconds, sometimes even more to the latency is like higher.

00:26:17.780 | So that's one issue, sure.

00:26:19.460 | But I think we observed very interesting things from this model as well.

00:26:24.820 | Like we observed like significant improvement in certain categories, like more technical

00:26:30.780 | or math.

00:26:31.780 | Yeah.

00:26:32.780 | I think actually like one takeaway that was encouraging is that I think a lot of people

00:26:38.140 | before the O1 release were thinking, oh, like this benchmark is saturated.

00:26:43.380 | And why were they thinking that?

00:26:44.380 | They were thinking that because there was a bunch of models that were kind of at the

00:26:47.340 | same level.

00:26:48.340 | They were just kind of like incrementally competing and it sort of wasn't immediately

00:26:52.700 | obvious that any of them were any better.

00:26:55.820 | Nobody, including any individual person, it's hard to tell.

00:26:59.420 | But what O1 did is it was, it's clearly a better model for certain tasks.

00:27:03.780 | I mean, I used it for like proving some theorems and you know, there's some theorems that like

00:27:08.020 | only I know because I still do a little bit of theory, right?

00:27:11.320 | So it's like, I can go in there and ask like, oh, how would you prove this exact thing?

00:27:14.980 | Which I can tell you has never been in the public domain.

00:27:17.500 | It'll do it.

00:27:18.500 | It's like, what?

00:27:19.900 | Okay.

00:27:20.900 | So there's this model and it crushed the benchmark.

00:27:23.420 | You know, it's just like really like a big gap.

00:27:27.220 | But what that's telling us is that it's not saturated yet.

00:27:30.340 | It's still measuring some signal.

00:27:32.700 | That was encouraging.

00:27:33.700 | The takeaway is that the benchmark is comparative.

00:27:37.180 | It's not, there's no absolute number.

00:27:38.500 | There's no maximum ELO.

00:27:40.100 | It's just like, if you're better than the rest, then you win.

00:27:44.100 | I think that was actually quite helpful to us.

00:27:46.180 | I think people were criticizing, I saw some of the academics criticizing it as not apples

00:27:51.120 | to apples, right?

00:27:52.340 | Like because it can take more time to reason, it's basically doing some search, doing some

00:27:57.980 | chain of thought that if you actually let the other models do that same thing, they

00:28:02.420 | might do better.

00:28:03.420 | Absolutely.

00:28:04.420 | But I mean, to be clear, none of the leaderboard currently is apples to apples because you

00:28:08.500 | have like Gemini flash, you have, you know, all sorts of tiny models, like Lama 8B, like

00:28:16.060 | 8B and 405B are not apples to apples.

00:28:20.100 | They have different latencies, different latencies.

00:28:22.340 | So latency control, that's another thing.

00:28:25.620 | We can do style control, latency control, you know, things like this are important if

00:28:29.780 | you want to understand the trade-offs involved in using AI.

00:28:34.500 | 01 is a developing story.

00:28:36.180 | We still haven't seen the full model yet, but you know, it's definitely a very exciting

00:28:40.200 | new paradigm.

00:28:41.780 | I think one community controversy I just wanted to give you guys space to address is the collaboration

00:28:48.500 | between you and the large model labs.

00:28:50.780 | People have been suspicious, let's just say, about how they choose to A/B test on you.

00:28:56.620 | I'll state the argument and let you respond, which is basically they run like five anonymous

00:29:02.060 | models and basically argmax their ELO on LMSYS or chatbot arena and they release the best

00:29:08.740 | one, right?

00:29:09.740 | Like what has been your end of the controversy?

00:29:12.420 | How have you decided to clarify your policy going forward?

00:29:15.460 | On a high level, I think our goal here is to build a fast eval for everyone and including

00:29:25.420 | everyone in the community can see the data board and understand, compare the models.

00:29:31.060 | More importantly, I think we want to build best eval also for model builders, like all

00:29:36.100 | these frontier labs building models.

00:29:37.900 | They're also internally facing a challenge, which is, you know, how do they eval the model?

00:29:42.980 | So that's the reason why we want to partner with all the frontier lab people and to help

00:29:48.140 | them testing.

00:29:49.260 | So that's one of the, we want to solve this technical challenge, which is eval.

00:29:53.900 | Yeah.

00:29:54.900 | I mean, ideally it benefits everyone, right?

00:29:56.540 | Yeah.

00:29:57.540 | And ideally, if model...

00:29:58.540 | And people also are interested in like seeing the leading edge of the models.

00:30:03.620 | People in the community seem to like that, you know, "Oh, there's a new model up, is

00:30:08.380 | this strawberry?"

00:30:09.380 | People are excited.

00:30:10.380 | People are interested.

00:30:11.380 | Yeah.

00:30:12.380 | So there's this question that you bring up of, is it actually causing harm, right?

00:30:16.480 | Is it causing harm to the benchmark that we are allowing this private testing to happen?

00:30:22.100 | Maybe like stepping back, why do you have that instinct?

00:30:24.860 | The reason why you and others in the community have that instinct is because when you look

00:30:30.260 | at something like a benchmark, like an image net, a static benchmark, what happens is that

00:30:36.340 | if I give you a million different models that are all slightly different, and I pick the

00:30:41.340 | best one, there's something called selection bias that plays in, which is that the performance

00:30:47.060 | of the winning model is overstated.

00:30:49.620 | This is also sometimes called the winner's curse.

00:30:51.860 | And that's because statistical fluctuations in the evaluation, they're driving which model

00:30:57.960 | gets selected as the top.

00:30:59.860 | So this selection bias can be a problem.

00:31:02.900 | Now there's a couple of things that make this benchmark slightly different.

00:31:07.580 | So first of all, the selection bias that you incur when you're only testing five models

00:31:10.940 | is normally empirically small.

00:31:12.500 | And that's why we have this confidence interval constructed.

00:31:16.820 | That's right.

00:31:17.820 | Yeah, our confidence intervals are actually not multiplicity adjusted.

00:31:20.520 | But one thing that we could do immediately tomorrow in order to address this concern

00:31:26.460 | is if a model provider is testing five models and they want to release one, and we're constructing

00:31:31.260 | the models at level one minus alpha, we can just construct the intervals instead at level

00:31:36.620 | one minus alpha divided by five.

00:31:38.700 | That's called Bonferroni correction.

00:31:40.300 | What that'll tell you is that like the final performance of the model, like the interval

00:31:44.620 | that gets constructed, is actually formally correct.

00:31:47.260 | We don't do that right now, partially because we kind of know from simulations that the

00:31:53.220 | amount of selection bias you incur with these five things is just not huge.

00:31:57.820 | It's not huge in comparison to the variability that you get from just regular human voters.

00:32:03.940 | So that's one thing.

00:32:05.280 | But then the second thing is the benchmark is live, right?

00:32:08.820 | So what ends up happening is it'll be a small magnitude, but even if you suffer from the

00:32:12.960 | winner's curse after testing these five models, what'll happen is that over time, because

00:32:17.660 | we're getting new data, it'll get adjusted down.

00:32:20.140 | So if there's any bias that gets introduced at that stage, in the long run, it actually

00:32:23.620 | doesn't matter.

00:32:24.620 | Because asymptotically, basically like in the long run, there's way more fresh data

00:32:28.740 | than there is data that was used to compare these five models against these like private

00:32:34.380 | models.

00:32:35.380 | The announcement effect is only just the first phase and it has a long tail.

00:32:39.380 | Yeah, that's right.

00:32:40.380 | And it's sort of like automatically corrects itself for this selection adjustment.

00:32:45.620 | Every month, I do a little chart of Ellim's ELO versus cost, just to track the price per

00:32:50.820 | dollar, the amount of like, how much money do I have to pay for one incremental point

00:32:56.500 | in ELO?

00:32:57.640 | And so I actually observe an interesting stability in most of the ELO numbers, except for some

00:33:02.980 | of them.

00:33:03.980 | For example, GPT-4-O August has fallen from 1290 to 1260 over the past few months.

00:33:10.660 | And it's surprising.

00:33:11.660 | You're saying like a new version of GPT-4-O versus the version in May?

00:33:17.100 | There was May.

00:33:18.100 | May is 1285.

00:33:19.100 | I could have made some data entry error, but it'd be interesting to track these things

00:33:23.060 | over time.

00:33:24.060 | Anyway, I observed like numbers go up, numbers go down.

00:33:26.620 | It's remarkably stable.

00:33:27.620 | Gotcha.

00:33:28.620 | So there are two different track points and the ELO has fallen.

00:33:31.780 | Yes.

00:33:32.780 | And sometimes ELOs rise as well.

00:33:33.780 | And then the core rose from 1200 to 1230.

00:33:36.140 | That's one of the things, by the way, the community is always suspicious about, like,

00:33:39.580 | "Hey, did this same endpoint get dumber after release?"

00:33:43.540 | Right?

00:33:44.540 | It's such a meme.

00:33:45.540 | That's funny.

00:33:46.540 | But those are different endpoints, right?

00:33:47.540 | Yeah.

00:33:48.540 | Those are different API endpoints, I think.

00:33:49.540 | Yeah, yeah, yeah.

00:33:50.540 | For GPT-4-O, August and May.

00:33:54.060 | But if it's for like, you know, endpoint versions we fixed, usually we observe small variation

00:34:02.540 | after release.

00:34:03.540 | I mean, you can quantify the variations that you would expect in an ELO.

00:34:08.900 | That's a close form number that you can calculate.

00:34:11.340 | So if the variations are larger than we would expect, then that indicates that we should

00:34:17.580 | look into that.

00:34:18.580 | For sure.

00:34:19.580 | That's an important thing for us to know, but maybe you should send us a reply.

00:34:23.900 | Yeah, please.

00:34:24.900 | I'll send you some data, yeah.

00:34:26.780 | And I know we only got a few minutes before we wrap, but there are two things I would

00:34:31.200 | definitely love to talk about.

00:34:32.420 | One is route LLM.

00:34:33.940 | So talking about models, maybe getting number over time, blah, blah, blah.

00:34:38.420 | Are routers actually helpful in your experience?

00:34:41.100 | And Sean pointed out that MOEs are technically routers too.

00:34:44.660 | So how do you kind of think about the router being part of the model versus routing different

00:34:48.620 | models?

00:34:49.620 | And yeah, overall learnings from building it.

00:34:51.580 | Yeah.

00:34:52.580 | So route LLM is a project we released a few months ago, I think.

00:34:56.660 | And our goal was to basically understand can we use the preference data we collect to route

00:35:04.620 | model based on the question, conditional on the questions, because we may make assumption

00:35:09.060 | that some models are good at math, some models are good at coding, things like that.

00:35:13.380 | So we found it somewhat useful.

00:35:15.980 | For sure, this is like ongoing effort.

00:35:18.540 | Our first phase with this project is pretty much like open source, the framework that

00:35:25.380 | we developed.

00:35:26.580 | So for anyone, they're interested in this problem, they can use the framework and then

00:35:32.180 | they can train their own router model and then to do evaluation to benchmark.

00:35:37.260 | So that's our goal, the reason why we released this framework.

00:35:41.660 | And I think there are a couple of future stuff we are thinking.

00:35:46.020 | One is, can we just scale this, do even more data, even more preference data, and then

00:35:52.460 | train a reward model, train a router model, better router model.

00:35:57.340 | Another thing is release a benchmark, because right now, currently, there seems to be, one

00:36:03.020 | of the end point when we developed this project was there's just no good benchmark for a router.

00:36:09.820 | So that would be another thing we think could be a useful contribution to community.

00:36:14.700 | And there's still, for sure, methodology, new methodology we think.

00:36:18.540 | Yeah, I think my fundamental philosophical doubt is, does the router model have to be

00:36:24.700 | at least as smart as the smartest model?

00:36:26.900 | What's the minimum required intelligence of a router model, right?

00:36:29.340 | Like, if it's too dumb, it's not going to route properly.

00:36:32.740 | Well, I think that you can build a very, very simple router that is very effective.

00:36:38.220 | So let me give you an example.

00:36:39.820 | You can build a great router with one parameter, and the parameter is just like, I'm going

00:36:45.060 | to check if my question is hard, and if it's hard, then I'm going to go to the big model,

00:36:50.300 | and if it's easy, I'm going to go to the little model.

00:36:52.660 | You know, there's various ways of measuring hard that are like, pretty trivial, right?

00:36:56.300 | Like, does it have code?

00:36:57.700 | Does it have math?

00:36:58.780 | Is it long?

00:36:59.940 | That's already a great first step, right?

00:37:02.500 | Because ultimately, at the end of the day, you're competing with a weak baseline, which

00:37:07.580 | is any individual model, and you're trying to ask the question, how do I improve cost?

00:37:13.420 | And that's like a one-dimensional trade-off.

00:37:15.680 | It's like performance cost, and it's great.

00:37:18.420 | Now, you can also get into that extension, which is, what models are good at what particular

00:37:23.820 | types of queries?

00:37:25.620 | And then, you know, I think your concern starts taking into effect is, can we actually do

00:37:29.980 | that?

00:37:30.980 | Can we estimate which models are good in which parts of the space in a way that doesn't introduce

00:37:35.860 | more variability and more variation in error into our final pipeline than just using the

00:37:42.540 | best of them.

00:37:43.740 | That's kind of how I see it.

00:37:45.060 | Your approach is really interesting compared to the commercial approaches where you use

00:37:48.620 | information from the chat arena to inform your model, which is, I mean, smart, and it's

00:37:53.860 | the foundation of everything you do.

00:37:55.580 | Yep.

00:37:56.580 | As we wrap, can we just talk about LMSIS and what that's going to be going forward, like

00:38:01.100 | LM Arena and BeGumming or something?

00:38:02.940 | I saw you announced yesterday you're graduating.

00:38:05.740 | I think maybe that was confusing since you're PhD students, but this is a different type

00:38:09.240 | of graduation.

00:38:10.900 | Just for context, LMSIS started as like a student club.

00:38:15.780 | Student-driven.

00:38:16.780 | Yeah.

00:38:17.780 | Student-driven, like research projects of, you know, many different research projects

00:38:21.300 | are part of LMSIS.

00:38:22.300 | Sort of chatbot arena has, of course, like kind of become its own thing.

00:38:28.580 | And Lianmin and Ying, who are, you know, created LMSIS, have kind of like moved on to working

00:38:34.980 | on SG Lang, and now they're doing other projects that are sort of originated from LMSIS.

00:38:41.500 | And for that reason, we thought it made sense to kind of decouple the two.

00:38:45.340 | Just so, A, the LMSIS thing, it's not like when someone says LMSIS, they think of chatbot

00:38:49.620 | arena.

00:38:50.620 | That's not fair, so to speak.

00:38:52.300 | And we want to support new projects.

00:38:54.100 | And we want to support new projects and so on and so forth.

00:38:57.260 | But of course, these are all like, you know, our friends, so that's why we call it graduation.

00:39:02.620 | I agree.

00:39:03.620 | I think that's one thing that people were maybe a little confused by, where LMSIS kind

00:39:08.420 | of starts and ends and where arena starts and ends.

00:39:10.740 | So I think you reach escape velocity now that you're kind of like your own, your own thing.

00:39:15.940 | So.

00:39:16.940 | I have one parting question.

00:39:17.940 | Like, what do you want more of?

00:39:19.660 | Like, what do you want people to approach you with?

00:39:21.180 | Oh, my God, we need so much help.

00:39:23.020 | One thing would be like, we're obviously expanding into like other kinds of arenas, right?

00:39:28.420 | We definitely need like active help on red teaming.

00:39:30.700 | We definitely need active help on our different modalities, different modalities, vision,

00:39:36.100 | so pilot, coding, coding, you know, if somebody could like help us implement this, like, REPL

00:39:41.620 | in REPL in chatbot arena, massive, that would be a massive delta.

00:39:46.060 | And I know that there's people out there who are passionate and capable of doing it.

00:39:50.300 | It's just, we don't have enough hands on deck.

00:39:52.700 | We're just like an academic research lab, right?

00:39:54.580 | We're not equipped to support this kind of project.

00:39:58.460 | So yeah, we need help with that.

00:40:01.500 | We also need just like general back-end dev and new ideas, new conceptual ideas.

00:40:07.180 | I mean, honestly, the work that we do spans everything from like foundational statistics,

00:40:11.020 | like new proofs to full stack dev.

00:40:15.220 | And like anybody who's like, wants to contribute something to that pipeline is, should definitely

00:40:20.820 | reach out.

00:40:21.820 | We need it.

00:40:22.820 | And it's an open source project anyways, anyone can make a PR.

00:40:26.380 | And we're happy to, you know, whoever wants to contribute, we'll give them credit.

00:40:29.180 | You know, we're not trying to keep all the credit for ourselves.

00:40:31.780 | We want it to be a community project.

00:40:34.060 | That's great.

00:40:35.060 | And fits this pair of everything you've been doing over there.

00:40:37.660 | So awesome guys.

00:40:38.660 | Well, thank you so much for taking the time and we'll put all the links in the show notes

00:40:43.180 | so that people can find you and reach out if they need it.

00:40:46.260 | Thank you so much.

00:40:47.260 | It's very nice to talk to you.

00:40:48.260 | And thank you for the wonderful questions.

00:40:49.260 | Thank you so much.

00:40:49.260 | Thank you.

00:40:50.260 | Thank you.

00:40:50.260 | Thank you so much.

00:40:51.260 | Thank you.

00:40:52.260 | Thank you.

00:40:53.260 | Thank you.

00:40:58.260 | you

00:41:00.320 | you

In the Arena: How LMSys changed LLM Benchmarking Forever

Chapters