back to index[Paper Club] Berkeley Function Calling Paper Club! — Sam Julien, Writer
00:00:05.960 |
Today we're going to be talking about the Berkeley Function Calling Leaderboard or BFCL. 00:00:16.960 |
I'm Sam Juleen, if you don't know me, I'm in the Discord, I'm around. 00:00:21.120 |
In my day job, I lead developer relations for Writer, which is an enterprise AI company. 00:00:26.400 |
And then I also write at samjuleen.com and have a newsletter and things like that. 00:00:32.840 |
So BFCL is the Berkeley Function Calling Leaderboard that we've come to know and love recently. 00:00:42.320 |
And it basically ranks models according to how well they do function calling, also known 00:00:52.080 |
And they evaluate on different things and post these scores and it's great. 00:00:58.080 |
And so the main thing is the leaderboard, but behind the leaderboard, there's also these 00:01:04.040 |
three releases, three blog articles that they've come out with. 00:01:10.300 |
And I can't, I'm going to open the chat to the side here so I can see. 00:01:16.320 |
And when I was preparing for this, I didn't realize how quick all of these, like, this 00:01:21.680 |
The first blog article was in March, second version was in August, and third version was 00:01:26.760 |
in September, which was a little mind blowing to me. 00:01:31.120 |
They've done a lot in a very short amount of time. 00:01:34.600 |
And then I listed out the folks who are on the team, some of them you'll notice from 00:01:40.320 |
I think Shashir has been in a lot of different avenues. 00:01:44.360 |
And then they also have a Discord server if you're super interested in chatting with these 00:01:53.040 |
So first up, we're just going to walk through sort of the three blog articles and just kind 00:02:00.760 |
And the third one, we'll have a little bit more time, since it's the most recent and 00:02:04.480 |
also the most, kind of the biggest evolution as they've been doing it. 00:02:09.920 |
So the first one that came out, like I said, in March, was really the first of its kind 00:02:21.960 |
So the purpose of it is to evaluate the LLM's ability to call functions and provide a comprehensive 00:02:32.480 |
Oh, and I see that I need to make Vibhu co-host. 00:02:37.680 |
If I can look at the participants, and make Vibhu co-host as a person. 00:03:09.440 |
They had a pretty diverse data set, 2000 question function answer pairs across multiple languages 00:03:18.160 |
and domains, and includes simple multiple parallel and parallel multiple function calling 00:03:25.160 |
And so the first version of the leaderboard looked like this. 00:03:29.680 |
Not terribly different than what we see today, but there are some new things that you'll 00:03:35.880 |
The initial data set composition is really interesting, because it's very code oriented. 00:03:41.880 |
A lot of this initial set of function calling was very focused on specific language tasks 00:03:52.040 |
And then the main, I think sort of the main innovation of this first version of the leaderboard 00:04:00.740 |
was that it used this, it introduced this abstract syntax tree evaluation, where it 00:04:07.400 |
would look at the executable functions and evaluate them. 00:04:15.740 |
It's pretty interesting how they created this, where it will evaluate the functions and the 00:04:22.940 |
parameters and see if it matches what was expected, and then kind of run it through 00:04:31.660 |
And they go into more detail in the actual, in the actual blog article of how they actually 00:04:39.660 |
parse through the functions with abstract syntax tree navigation and match the parameters 00:04:46.620 |
So if you want to dig deeper into that, you can go through there. 00:04:52.560 |
Like I said, it was only six or seven or eight months ago, so not that long ago. 00:04:58.140 |
And then with the second version, they were really focused on improving the data set. 00:05:06.780 |
So they came out with this second version titled live data set, and they were really 00:05:12.980 |
focused on getting more real world scenarios, like with live user contributed data, having 00:05:20.020 |
more rare use cases, like having lots and lots of Nested parameters, and then fixing 00:05:27.600 |
some of the issues from the first version, just figuring out some areas of like data 00:05:36.700 |
So they spent a lot of time, like most of the second article is just about all of the 00:05:43.380 |
different issues that they've, that they've worked on with the data pre-processing, filtering, 00:05:50.980 |
And I'll share these slides in the, in the discord as well. 00:05:53.700 |
So you don't have to worry about taking notes. 00:05:56.620 |
And they ended up with a data set of 2,251 question function answer pairs so that they 00:06:05.400 |
So they did a lot of like deduplication using Rouge scores. 00:06:10.020 |
They did a bunch of filtering and just like standardized the function documents to better 00:06:14.680 |
fit their format and enhance their user prompting. 00:06:18.740 |
So then if you look at the, the, the way they display the dataset composition in the second 00:06:30.780 |
version, you can see that it's a lot less focused on specific coding languages and things 00:06:37.060 |
like that, and more about what they were trying to accomplish with it. 00:06:41.140 |
So like relevance detection, irrelevance detection, and maintaining some of these things with 00:06:46.980 |
the abstract syntax tree and parallel multiple function calling and things like that. 00:06:52.580 |
So a lot of the sort of like underlying methodology of how they did the benchmarking didn't really 00:06:56.980 |
change in the second version so much as the, the data process. 00:07:01.580 |
So in that second version, they have this nice, another nice visual. 00:07:04.900 |
I really like how many, how many visuals they include in these, in these blog articles. 00:07:09.140 |
So they have this one for how they actually did kind of the flow for all of the live data 00:07:17.440 |
So they took the user queries and did this data pre-processing with them, and then did 00:07:22.260 |
some filtering and some standardization and things like that to end up with this dataset. 00:07:33.660 |
And then the third version was a pretty big leap forward to, to go into multi-term and 00:07:44.780 |
So previously it was single turn or some parallel function calling, but this is more, this is 00:07:52.900 |
really digging into multi-term and multi-step function calling, which I'll show a diagram 00:07:59.020 |
But really to get to where we're testing for like agentic, mimicking, you know, mimicking 00:08:06.820 |
agentic behaviors where we're kind of all trying to get to by using function calling. 00:08:13.080 |
So what, one of the sort of like main innovations of this that we'll talk about in a second 00:08:19.460 |
is the what, what they called state-based evaluation. 00:08:24.060 |
So they really updated their evaluation from the abstract syntax tree to something called 00:08:29.740 |
state-based evaluation, which we'll, we'll talk about in just a second. 00:08:34.020 |
So they define the terms in the, in the third version of single term, multi-step and multi-turn. 00:08:41.520 |
And that's basically the difference in being able to say something like just, you know, 00:08:44.860 |
find me flights from New York to Tokyo tomorrow versus like, I want to plan a flight or I 00:08:50.700 |
want to find multiple flights or, or multi-term being sort of like maintaining a context and, 00:09:00.380 |
And that's where like having to evaluate state comes into play because the model is going 00:09:05.100 |
to have to remember that it's got these different responses kind of rolling of like, I want 00:09:10.100 |
to book this one, okay, book me the 10 a.m. version, okay, like what's the confirmation 00:09:22.860 |
So the dataset composition, they also made some pretty interesting improvements on how 00:09:32.060 |
They started categorizing them according to like multi-turn and augmented. 00:09:38.660 |
I see a question in the chat, go full screen or slideshow. 00:09:41.580 |
I can, I was leaving it like this so that we could bounce back and forth to the actual 00:09:47.620 |
But for now I could do, oh, actually then I'd have to stop sharing and re-sharing. 00:09:51.060 |
So I'm going to leave this for now, unless it's like really, really hard for people to 00:09:55.340 |
I can also hide the navigation and zoom in a little bit probably. 00:10:12.120 |
I'll also share these slides and see if I can just, let me just dump the link in here 00:10:22.260 |
But if you want to look at it close up, you can. 00:10:30.300 |
Okay, so they had to change their approach to actually curating the data, which was really 00:10:37.300 |
interesting because they wanted to be able to capture the multi-turn and the long context 00:10:46.340 |
And so they did this really interesting approach where they basically created their own API 00:10:51.420 |
for the sake of this testing, and then did this like mapping to create a graph edge construction 00:11:03.260 |
And so I'll just show really quick, I'm jumping ahead a little bit, but what was cool is they 00:11:13.060 |
built out this like whole API code base where they had like vehicle control and stock trading 00:11:21.520 |
and travel booking, and then like a file system. 00:11:25.220 |
And then also like a couple of different like cross-functional APIs like for messaging and 00:11:29.980 |
simulating Twitter and creating tickets and doing math and that kind of thing. 00:11:35.060 |
And kind of like built out this system where they would build out these APIs and generate 00:11:43.060 |
a graph from it and then use that to derive a bunch of example queries and function lists 00:11:50.460 |
And then generate this dataset that way, which I thought was like pretty ingenious. 00:11:56.820 |
So you can now see the third version, the dataset composition looks really, really different 00:12:03.140 |
And that's because they're really honing in on like what they're trying to test for there. 00:12:08.100 |
And so, and they actually go into, in the article, they go deeper into like what all 00:12:21.580 |
But you can see that they're doing things like having a specific set of questions for 00:12:25.540 |
testing for long context or testing for missing params, that kind of thing. 00:12:31.100 |
And so, yeah, like I said, they turned that into this nice system with an API that they 00:12:41.660 |
And then they have this interesting validation process where they check for the questions 00:12:47.180 |
and they use this human ground truth as a way to check for these things. 00:12:54.780 |
And they go into a little bit more on that validation side. 00:13:00.460 |
Yeah, so they actually like have humans label like what would be the correct answer for 00:13:09.180 |
these different scenarios and that way they could check to see how close the model gets 00:13:22.460 |
And so, then that gets us into state-based evaluation. 00:13:26.820 |
And so, they use this as the primary metric to assess the performance of the models. 00:13:31.440 |
And this is really interesting because the way they basically compare the final state 00:13:36.420 |
after all the function calls are executed to see whether the system's internal changes 00:13:41.420 |
after each step align with what you're expecting. 00:13:44.940 |
Which is better for reflecting real world performance because in multi-turn interactions, 00:13:50.220 |
you're trying to update a system state, whether it's with a vehicle or a file system or something 00:13:55.780 |
You want to be able to check to see whether you've successfully deleted a file or something 00:14:03.020 |
So, it's not as straightforward as just like answer a math problem or call an API or something 00:14:09.660 |
So, they compare the attributes of the system state after every turn against the expected 00:14:15.860 |
So, for example, if you had a series of function calls of asking the LLM to like create a file, 00:14:23.100 |
write data to the file and close it, it would basically check after each turn whether the 00:14:28.060 |
file exists, whether the correct data was written, whether the file was properly closed. 00:14:34.580 |
And if they're present and correct, then the evaluation succeeds. 00:14:39.540 |
And so, then they also have a section at the end on the results and error analysis. 00:14:44.580 |
I thought this was like a super interesting section of what the models struggled with. 00:14:51.540 |
So, the first one was failure to perform implicit actions. 00:14:55.380 |
So, they had this example here of like asking the model to fill a fuel tank. 00:15:04.300 |
And so, the model thought that the tank was already full or nearly full and didn't like 00:15:10.100 |
understand that they didn't actually need to do the filling. 00:15:19.020 |
And then the second one they gave was failure to understand the current state before performing 00:15:24.860 |
So, if, for example, you're already in the present working directory is already Alex, 00:15:31.620 |
and you tell it to go into the directory named Alex, it didn't know that it could just like 00:15:36.520 |
check to see whether you were already in the present working directory named Alex. 00:15:46.240 |
And then the last example that they give is LLMs incur unnecessary planning and thinking, 00:15:52.340 |
which I thought was like a really funny way of putting that. 00:15:56.140 |
But so, they give the example of like you're already authenticated in the Twitter API, 00:16:01.740 |
and you ask the model to do something with Twitter, and it like goes ahead and tries 00:16:10.980 |
It didn't know that you were already authenticated with Twitter. 00:16:15.020 |
And so, it's sort of like unnecessarily added a step. 00:16:17.980 |
And so, it's interesting because all three of these sort of tie into like knowing the 00:16:22.260 |
current state or like current context that the question is asked in. 00:16:26.980 |
And that turns out to be like more challenging than you would think for these models. 00:16:33.860 |
And so, that's like, that's all I had like for the actual overview. 00:16:36.780 |
And then I figured if anybody wanted to just like chime in and talk about any of these 00:16:42.420 |
I also posted the data set on Hugging Face, which is pretty interesting. 00:16:50.700 |
There's some sort of error where you can't like see all of it, but you can see some of 00:16:55.660 |
A lot of them are the math problems, but there's also like some travel questions and other 00:17:00.940 |
like planning questions and things like that, which are kind of just like interesting inspiration 00:17:06.780 |
But yeah, that's all I had for the actual like overview of everything. 00:17:14.860 |
Since nobody's jumping in here, I found what you mentioned about using the syntax tree 00:17:28.820 |
to generate sort of like, I think, I don't know if I understood this correctly, but generating 00:17:40.580 |
sort of test cases that are like a sequence of API calls basically. 00:17:46.620 |
And then from that, generating us like a scenario that sort of like implements that so that 00:17:54.180 |
you can check that has an absolute correct answer and then like a scenario that you can 00:18:02.460 |
have the LLM, the function calling LLM follow through. 00:18:11.700 |
Caveat that I'm still wrapping my head around it too, but yeah, I think that's basically 00:18:16.420 |
what they did, where they kind of had these different categories and that they wanted 00:18:24.700 |
And then they used the source code and everything to create the graph and like generate the 00:18:36.540 |
But then I still think, how did, did you grok how they graph, how they did the graph edge 00:18:46.500 |
We'll talk about that a little bit more down here. 00:18:51.740 |
So there's this section here where they talk about each function represents a node and 00:18:57.540 |
then they manually map out direct edges, meaning a functions output is an input of the downstream 00:19:04.900 |
I guess they've got this example here of placing order, like getting the available stock is 00:19:13.100 |
necessary in order to place the order for the stock. 00:19:17.700 |
Whenever we need a data set, we sample a node on the graph and randomly traverse the graph 00:19:32.380 |
So they basically go something like, okay, manually create the graph for the API and 00:19:36.660 |
then sample a node in the graph and then traverse the graph according to the edges. 00:19:48.300 |
So you have function calls, a list of function calls and probably parameters, and then generate 00:19:54.340 |
a scenario based on that, that sort of matches with that. 00:20:00.780 |
And I forgot to mention this part, which is really interesting. 00:20:03.140 |
They use this data set from Persona Hub to take a few examples of different personas 00:20:12.620 |
and then generate the different queries based on those personas, basically. 00:20:19.540 |
So like stock trading by an elderly hermit, for example. 00:20:26.100 |
But then I think after, once they had that, oh yeah, so they've got a triplet of question 00:20:31.420 |
function list and initial config, and then they had humans actually look at it and verify 00:20:37.380 |
everything and validate that these are good questions that actually make sense. 00:20:46.580 |
That is a pretty cool way to generate a data set like this, where you kind of get a good 00:20:51.780 |
covering set and sort of generate the language around it, and then I guess it's small enough 00:21:00.660 |
that you can label the whole thing, or did they label the whole thing? 00:21:06.620 |
I think so, because there were only like 2,200 questions. 00:21:17.460 |
And they did a series of like validations for each one, and also built tests around 00:21:23.380 |
unit tests and error handling tests and things like that. 00:21:29.140 |
Because, I mean, I guess at the end of the day, the whole idea here, if I understand 00:21:33.580 |
correctly, is that I can, because I'm tracking my state, I can know that the series of function 00:21:40.260 |
calls that it made, I ended up in the correct state because of them, or I didn't. 00:21:52.140 |
I think that's exactly what they were designing around because of the state-based evaluation. 00:22:00.980 |
There was also a whole section they didn't really dive into of like, they go through 00:22:08.820 |
like a lengthy discussion on like response-based evaluation and why they don't use like the 00:22:14.380 |
REACT technique and things like that that they didn't really dig into. 00:22:18.380 |
But other things that are just interesting around like the limitations for evaluating 00:22:40.740 |
And so it's not like a crazy amount of stuff to read through, but it's definitely really 00:22:46.540 |
interesting and then helps you kind of understand things a little bit better. 00:22:50.180 |
The only thing I didn't, I didn't see a lot of information on was how they derived this 00:22:55.180 |
I don't know if anybody else read through it and caught that, but it was the only thing 00:23:00.580 |
that I didn't see like a lot of information on how they did the hallucination measurement. 00:23:08.140 |
I believe, and I could, this is not from the paper, but just like a thing that I think 00:23:15.260 |
it would be is cause they have a, they have a data set or something of a data set called 00:23:22.740 |
like API zoo, where basically it's sort of the open source repository of like the correct 00:23:28.500 |
documentation of all the API or of a large amount of, of AB APIs. 00:23:34.180 |
So presumably that could be like a ground truth to benchmark against and then like how 00:23:38.860 |
far, you know, you are, I don't know, probably semantically or something from that or state 00:23:48.940 |
I don't suppose anybody tried to run this locally that they, cause they link out to 00:23:59.580 |
I did not try to do it, but just so, just so people know you actually can, you can just 00:24:07.660 |
clone this and try it yourself, although you need like a thousand API keys, like any other, 00:24:17.900 |
I've not tried the leaderboard specifically, but there's a, there are a few things in that 00:24:23.180 |
repo that are, that are quite useful, open functions, Gorilla, CLI, et cetera, et cetera. 00:24:30.300 |
Sam, is there, is there any description on how they, how they provide the, the state 00:24:38.340 |
of the model, like what API queries it has already executed or what the original request 00:24:47.300 |
I think there was in this section, I mean, they'll go into a ton of detail, but basically 00:24:56.580 |
Ah, so yeah, it says as the model interacts with the API, the backend tracks, how the 00:25:05.220 |
Each function call alters the state, such as creating files at every turn, we compare 00:25:09.380 |
the instances, current state from executing the model function calls with the expected 00:25:14.700 |
So they don't go into like any technical detail. 00:25:21.660 |
But it sounds like it's done in the API itself. 00:25:25.420 |
I'd be curious to know more detail about like, what, like how that actually happens and how 00:25:34.940 |
It seems like that would need to be specific to the API, right? 00:25:38.540 |
Because each API is going to have its own set of state variables that it needs to track, 00:25:45.980 |
Like a file system is, oh yeah, Twitter posts or whatever. 00:25:48.540 |
But it seems like they did their mock API is probably implement some sort of state. 00:25:55.820 |
That might be a good, I think Swig said they're going to have somebody from BFCL on the podcast 00:26:01.420 |
That would probably be a good question for them. 00:26:04.760 |
So do they have, do they have like an API that orchestrates the request to, to other 00:26:10.160 |
APIs, or is this just like referring to any API that you hit? 00:26:16.060 |
I think it's, yeah, they must've had, they must've had sort of one collective API that 00:26:24.660 |
was a collection of the, of the like three or four or no, it was like four single things 00:26:35.540 |
So they must've had something that was orchestrating those different. 00:26:39.900 |
I mean, if you're not familiar with Gorilla itself, that's what it does is it, it is, 00:26:46.020 |
it's only, it's a model that specifically generates API calls and the thing else. 00:26:49.900 |
Wait a minute, but we're talking about a model and then we're talking about like an API that 00:26:56.180 |
orchestrates requests to others, other towards other APIs. 00:26:58.900 |
So you can make an API call that calls this model, but that's, that's where I'm like, 00:27:03.860 |
do you say it was a model, um, calling API calls that it was trained on. 00:27:08.820 |
I showed up to this meeting late, so you might not know this, but I don't know is I'm not 00:27:19.020 |
I'm not, I'm not sure how they, how they did that. 00:27:22.980 |
If they had each API just separate, um, and isolated so that the model would just call 00:27:30.100 |
them, like call, call for the, that API directly, or if they probably, I mean, that would be 00:27:37.620 |
my guess is that they kept them separate because then they would know whether the model was 00:27:43.540 |
actually calling the right API and not just like a gateway. 00:27:54.980 |
So Yikes, you're having somebody from the gorilla team come to speak. 00:28:00.140 |
Um, I, I think that is a, I think that was, I saw Swicks mentioning something about that. 00:28:07.980 |
Um, I think he's going to have somebody come on the podcast was what I think he said. 00:28:12.620 |
I don't remember who, yeah, there's something in discord about it. 00:28:17.300 |
Uh, I, I am not organizing anything at the moment, but, uh, you know, if, if you have 00:28:21.900 |
questions, they, they tend to be like super responsive in discord. 00:28:24.420 |
So just go jump in the gorilla LLM and fix some brains. 00:28:31.540 |
It would be so much fun if he hosted the podcasts with us as an audience, I would love that. 00:28:41.180 |
Yeah, I would have loved to have spoken to like the, um, the person behind structured 00:28:45.180 |
generation and open AI and a lot of other people. 00:28:48.760 |
It'd be interesting to like, uh, uh, at least like a, a questions to ask drop off boxes 00:28:58.100 |
Or something we can do is like, we could do like an AI in action where we actually like, 00:29:04.460 |
Um, I don't know, Sam, would you be like open to doing that? 00:29:23.700 |
I'm talking about this, this system that they're describing here. 00:29:40.620 |
Like, if that's something that you are interested in trying to, to make happen, I presume that 00:29:47.380 |
we would have an open slot on AI in action at some point. 00:29:50.300 |
And that like, sounds interesting, just like try to get the thing working. 00:30:00.060 |
That is an interesting idea of like, we could do an AI in action rather than of a, like, 00:30:05.260 |
let's try and get the thing working as opposed to like, here is how the working thing that 00:30:13.900 |
No, get it, get trying to get it working in a span of like one hour is not feasible for 00:30:21.740 |
If somebody wants to do that, I'm, I'm gonna be on there. 00:30:32.620 |
Yeah, I've had Gorilla come up very often, partially from Yikes and other people. 00:30:39.920 |
I'm a huge Gorilla shill and kind of a Shishir fan, you know, um, but that's, uh, the Gorilla 00:30:49.180 |
Um, cause I, I was using GitHub CLI or the GitHub copilot CLI for quite some time. 00:30:54.260 |
Um, but Gorilla CLI is like open and I don't have to worry about GitHub connections. 00:30:57.940 |
I can just like grab it and it does the same thing basically. 00:31:04.460 |
I don't, I'm not, um, you just, you put in a natural language query of the CLI command 00:31:09.540 |
that you want to run, um, and it'll, it'll give you the, the CLI command to do it. 00:31:14.220 |
It's like, uh, if you used copilot CLI, it's same thing basically. 00:31:21.220 |
I mean, I've, I've, I've wanted to, to implement it myself, but it sounds like, you know, Gorilla 00:31:28.700 |
already has, uh, they already tied up their, their, their model with, with a little front 00:31:37.660 |
No, there's, uh, the, the other interesting thing, I guess, to go over the cool stuff 00:31:41.980 |
in the repo, there's a thing called, uh, uh, Gorilla open functions where you can sort 00:31:46.260 |
of more or less like staple a Gorilla to whatever model that you're playing with. 00:31:51.340 |
So if it doesn't have function calling, you can just implement open functions and it'll 00:31:58.960 |
And then you ended up with no files in your file system. 00:32:07.820 |
Well, you know, it does elucidate from time to time, so yeah, we didn't actually, I almost 00:32:16.780 |
wonder if that'd be another, either AI in action or paper club, just like the Gorilla 00:32:22.780 |
Like we didn't, there's a paper for Gorilla and everything, but yeah, like it was going 00:32:33.580 |
I do feel like I would like to see a demo of, of the, like that, the CLI CLI total CLI 00:32:44.900 |
Like I know that, um, uh, I've, I've seen a bunch of like, sort of these things in use 00:32:51.900 |
and, but I, you know, I haven't gotten started with them myself and I wanted to, and never 00:32:56.540 |
had the time to get started and know what's actually useful. 00:33:20.200 |
So I actually, this, I have been, I've had an idea in my head for a while and I wanted 00:33:27.980 |
If there's no other, it's relevant to this topic, I want to share with people in case 00:33:33.100 |
if there's no other topics, um, but the idea is basically to build a leaderboard of the 00:33:41.060 |
quality of predictions of a, of a, um, of an LLM or LLMs using, um, like an allowee, 00:33:51.940 |
like, uh, so the idea would be to have questions that are like, that, that are around the, 00:34:00.600 |
some future event, like who will be elected president. 00:34:04.520 |
the, and, and, and you have a list of them and then you're, and then you ask LLMs 00:34:10.980 |
to predict based on some sort of canonical data that you also bring in for the LLM to 00:34:17.860 |
use plus any other data that it wants to, like if it's an agent system, then it can 00:34:22.580 |
go and search the web and do whatever it wants and brings in like whatever context it wants 00:34:28.380 |
and then makes prediction on a given day about that event. 00:34:31.740 |
And then those, a series of questions like sort of evolves every day and you basically, 00:34:42.100 |
And then over time you track the quality of prediction of those LLMs or agent systems. 00:34:48.420 |
And the reason why I think this is interesting is because, um, judging a, a, an AI or a person's 00:34:58.540 |
prediction quality is actually one of the only ways that you can actually assess like 00:35:06.060 |
quality of information on the, on the internet. 00:35:10.500 |
So if, if you have a, you have a person or an AI that can accurately predict the future 00:35:20.140 |
and that, and, and do it better in sort of, in like sort of stacked rank, right, because 00:35:23.860 |
I can always predict, I can very accurately predict that the sun will come up tomorrow. 00:35:27.860 |
So it's not, it has, you have to be, it has to be a relative ranking. 00:35:32.420 |
And if you have a, a, an AI or a person that can predict better than, you know, sort of 00:35:38.060 |
the, than the crowd, then that person likely has a good model of the world. 00:35:46.260 |
And that, that good model of the world represents, you know, sort of like at least, um, somebody 00:35:52.740 |
who you probably want to listen to more than the crowd. 00:35:57.020 |
So I feel like this would be a really cool, um, start to developing a way to judge both 00:36:05.140 |
Like you can imagine now look like crawling the news sites and look at who people are 00:36:11.140 |
making predictions and then sort of give everybody like track scores for different people, um, 00:36:20.300 |
So I think, I feel like this would be a really cool way to get a project started. 00:36:24.220 |
I wonder if people have thoughts about this and if anyone would be interested in working 00:36:36.500 |
Uh, I was, uh, I was, I was busy trying to get my Gorilla CLI, uh, working. 00:36:47.180 |
What is the thing that you want, uh, help on that you were doing? 00:36:51.700 |
Summary leaderboard of the quality of predictions that AIs and agents and people make. 00:36:59.140 |
So like news items or whatever, like certain current events and then you predict and that 00:37:06.020 |
becomes a bet kind of sorta, and I can explain the betting mechanism that is fair. 00:37:10.420 |
And then, um, like a, just a general prediction market. 00:37:14.220 |
Well, it's like, it is like a prediction market, but it probably looks a little different. 00:37:20.500 |
Um, because I don't think it's useful to have, it's particularly useful to have it like, 00:37:27.260 |
like maintaining an account and the ability, like that's a separate ability that I think 00:37:33.760 |
So it, but something like a prediction market, but for us, um, a set of things that, um, 00:37:43.380 |
you know, you sort of like a limited set of things that you track over time and whichever 00:37:48.580 |
LLM and or person does the best kind of filters to the top. 00:37:54.580 |
That sounds interesting and probably related to things that I'm interested in. 00:37:57.460 |
So if you like, I don't know if you'd start cranking on it or throw up a repo and shoot 00:38:02.220 |
I'll probably take a look and play around with it. 00:38:07.620 |
Um, if you're not familiar at, and I don't know if they have like an SDK or something 00:38:11.060 |
so you can rig up your own, but, uh, you can check out poly market. 00:38:17.420 |
So, I mean, I don't, I don't want to place real bets. 00:38:22.580 |
Um, and I'm not sure that the, maybe, maybe I'm sure there are like, I'm thinking like 00:38:29.380 |
just throw up a poly market instance on like test net or something and just use monopoly 00:38:42.220 |
I thought it was familiar with poly market, but maybe I'm thinking of a different one 00:38:43.940 |
that is that the one that Nate's over is, uh, discord thread about this. 00:38:52.780 |
They get there and see how this is interested. 00:38:54.780 |
Uh, back on paper club, do we have any last questions for Sam on gorilla? 00:39:01.540 |
I think some stuff won't popped up in the chat, but it looks like you kind of answered 00:39:06.860 |
it in chat to, um, basically that, and then for next week's paper, if anyone wants to 00:39:13.020 |
volunteer, okay, has anyone, uh, have you guys seen Momo? 00:39:21.700 |
It's the open weight, open data, multimodal state of the art model. 00:39:30.480 |
I think, I don't know how good, but the model is pretty good. 00:39:36.620 |
I'm hoping the paper is good because I'm down to cover it. 00:39:39.800 |
They do go a lot into data, how they, um, yeah, how they get the data. 00:39:43.520 |
I think they pre-train it from flip, but it's, it's a pretty good recent multimodal foundation 00:39:51.720 |
Um, I can cover that unless someone else wants to volunteer. 00:40:00.060 |
Um, I was going to say I can be a backup if I pick a Monte Carlo paper and it actually, 00:40:05.760 |
Um, but I would actually be pretty interested to, to get a, get a download on, on Momo too. 00:40:15.440 |
Um, here's a link, I'll, I'll throw it in paper club, but thanks for, um, sharing Sam. 00:40:22.880 |
Is there anybody that's like new to paper club that has not presented or is like kind 00:40:31.460 |
If so, uh, fricking, uh, raise your hand, I guess. 00:40:34.920 |
And, uh, you know, if you need, uh, like a little bit of help or support or whatever, 00:40:39.800 |
trying to like get something, because I went trying to facilitate new people that haven't 00:40:46.040 |
Cause I know you and I have done it a bunch of times. 00:40:48.360 |
Um, but yeah, feel free to like ping me in discord or whatever, if you have questions 00:40:51.320 |
or something, or if you want to do it, um, feel free to, yeah, let's let one of us know. 00:40:56.120 |
And we would, we would happily hand over the reins. 00:40:59.840 |
I'll make an announcement that this is the paper. 00:41:02.960 |
If anyone else wants to sub in, we can always take it. 00:41:09.560 |
Uh, Sam, for covering any last thoughts on this. 00:41:13.200 |
I know, I know some questions have popped up. 00:41:19.160 |
Thanks for everybody for listening, listening and joining.