back to index

[Paper Club] Berkeley Function Calling Paper Club! — Sam Julien, Writer


Whisper Transcript | Transcript Only Page

00:00:00.000 | All right.
00:00:03.640 | Hello everybody.
00:00:05.960 | Today we're going to be talking about the Berkeley Function Calling Leaderboard or BFCL.
00:00:16.960 | I'm Sam Juleen, if you don't know me, I'm in the Discord, I'm around.
00:00:21.120 | In my day job, I lead developer relations for Writer, which is an enterprise AI company.
00:00:26.400 | And then I also write at samjuleen.com and have a newsletter and things like that.
00:00:32.840 | So BFCL is the Berkeley Function Calling Leaderboard that we've come to know and love recently.
00:00:42.320 | And it basically ranks models according to how well they do function calling, also known
00:00:49.040 | as tool calling, depending on who you are.
00:00:52.080 | And they evaluate on different things and post these scores and it's great.
00:00:58.080 | And so the main thing is the leaderboard, but behind the leaderboard, there's also these
00:01:04.040 | three releases, three blog articles that they've come out with.
00:01:10.300 | And I can't, I'm going to open the chat to the side here so I can see.
00:01:16.320 | And when I was preparing for this, I didn't realize how quick all of these, like, this
00:01:20.600 | is all super recent.
00:01:21.680 | The first blog article was in March, second version was in August, and third version was
00:01:26.760 | in September, which was a little mind blowing to me.
00:01:31.120 | They've done a lot in a very short amount of time.
00:01:34.600 | And then I listed out the folks who are on the team, some of them you'll notice from
00:01:39.320 | other projects.
00:01:40.320 | I think Shashir has been in a lot of different avenues.
00:01:44.360 | And then they also have a Discord server if you're super interested in chatting with these
00:01:48.080 | folks.
00:01:49.080 | They have an LLM, Gorilla LLM Discord.
00:01:53.040 | So first up, we're just going to walk through sort of the three blog articles and just kind
00:01:58.840 | of go over what's in each of them.
00:02:00.760 | And the third one, we'll have a little bit more time, since it's the most recent and
00:02:04.480 | also the most, kind of the biggest evolution as they've been doing it.
00:02:09.920 | So the first one that came out, like I said, in March, was really the first of its kind
00:02:17.120 | to make a function calling leaderboard.
00:02:21.960 | So the purpose of it is to evaluate the LLM's ability to call functions and provide a comprehensive
00:02:30.520 | benchmark for function calling capabilities.
00:02:32.480 | Oh, and I see that I need to make Vibhu co-host.
00:02:35.520 | So let me see if I can do that really quick.
00:02:37.680 | If I can look at the participants, and make Vibhu co-host as a person.
00:02:55.360 | There we go.
00:02:57.520 | Okay.
00:02:58.520 | Awesome.
00:02:59.520 | Done and done.
00:03:02.520 | Okay.
00:03:03.520 | So D1 basically was just the starting point.
00:03:07.560 | A pretty solid starting point.
00:03:09.440 | They had a pretty diverse data set, 2000 question function answer pairs across multiple languages
00:03:18.160 | and domains, and includes simple multiple parallel and parallel multiple function calling
00:03:24.160 | scenarios.
00:03:25.160 | And so the first version of the leaderboard looked like this.
00:03:29.680 | Not terribly different than what we see today, but there are some new things that you'll
00:03:33.360 | see as we evolve.
00:03:35.880 | The initial data set composition is really interesting, because it's very code oriented.
00:03:41.880 | A lot of this initial set of function calling was very focused on specific language tasks
00:03:48.880 | across Python and other languages.
00:03:52.040 | And then the main, I think sort of the main innovation of this first version of the leaderboard
00:04:00.740 | was that it used this, it introduced this abstract syntax tree evaluation, where it
00:04:07.400 | would look at the executable functions and evaluate them.
00:04:11.260 | And they have this nice flow chart in here.
00:04:15.740 | It's pretty interesting how they created this, where it will evaluate the functions and the
00:04:22.940 | parameters and see if it matches what was expected, and then kind of run it through
00:04:30.380 | these different scenarios.
00:04:31.660 | And they go into more detail in the actual, in the actual blog article of how they actually
00:04:39.660 | parse through the functions with abstract syntax tree navigation and match the parameters
00:04:45.620 | and that kind of thing.
00:04:46.620 | So if you want to dig deeper into that, you can go through there.
00:04:51.560 | So that was sort of the first version.
00:04:52.560 | Like I said, it was only six or seven or eight months ago, so not that long ago.
00:04:58.140 | And then with the second version, they were really focused on improving the data set.
00:05:06.780 | So they came out with this second version titled live data set, and they were really
00:05:12.980 | focused on getting more real world scenarios, like with live user contributed data, having
00:05:20.020 | more rare use cases, like having lots and lots of Nested parameters, and then fixing
00:05:27.600 | some of the issues from the first version, just figuring out some areas of like data
00:05:32.980 | contamination and bias and things like that.
00:05:36.700 | So they spent a lot of time, like most of the second article is just about all of the
00:05:43.380 | different issues that they've, that they've worked on with the data pre-processing, filtering,
00:05:48.500 | quality improvement, that kind of thing.
00:05:50.980 | And I'll share these slides in the, in the discord as well.
00:05:53.700 | So you don't have to worry about taking notes.
00:05:56.620 | And they ended up with a data set of 2,251 question function answer pairs so that they
00:06:03.580 | could address some of these issues.
00:06:05.400 | So they did a lot of like deduplication using Rouge scores.
00:06:10.020 | They did a bunch of filtering and just like standardized the function documents to better
00:06:14.680 | fit their format and enhance their user prompting.
00:06:18.740 | So then if you look at the, the, the way they display the dataset composition in the second
00:06:30.780 | version, you can see that it's a lot less focused on specific coding languages and things
00:06:37.060 | like that, and more about what they were trying to accomplish with it.
00:06:41.140 | So like relevance detection, irrelevance detection, and maintaining some of these things with
00:06:46.980 | the abstract syntax tree and parallel multiple function calling and things like that.
00:06:52.580 | So a lot of the sort of like underlying methodology of how they did the benchmarking didn't really
00:06:56.980 | change in the second version so much as the, the data process.
00:07:01.580 | So in that second version, they have this nice, another nice visual.
00:07:04.900 | I really like how many, how many visuals they include in these, in these blog articles.
00:07:09.140 | So they have this one for how they actually did kind of the flow for all of the live data
00:07:16.180 | that they were getting.
00:07:17.440 | So they took the user queries and did this data pre-processing with them, and then did
00:07:22.260 | some filtering and some standardization and things like that to end up with this dataset.
00:07:30.940 | So that's the second version.
00:07:33.660 | And then the third version was a pretty big leap forward to, to go into multi-term and
00:07:43.180 | multi-step function calling.
00:07:44.780 | So previously it was single turn or some parallel function calling, but this is more, this is
00:07:52.900 | really digging into multi-term and multi-step function calling, which I'll show a diagram
00:07:57.220 | of in a second.
00:07:59.020 | But really to get to where we're testing for like agentic, mimicking, you know, mimicking
00:08:06.820 | agentic behaviors where we're kind of all trying to get to by using function calling.
00:08:13.080 | So what, one of the sort of like main innovations of this that we'll talk about in a second
00:08:19.460 | is the what, what they called state-based evaluation.
00:08:24.060 | So they really updated their evaluation from the abstract syntax tree to something called
00:08:29.740 | state-based evaluation, which we'll, we'll talk about in just a second.
00:08:34.020 | So they define the terms in the, in the third version of single term, multi-step and multi-turn.
00:08:41.520 | And that's basically the difference in being able to say something like just, you know,
00:08:44.860 | find me flights from New York to Tokyo tomorrow versus like, I want to plan a flight or I
00:08:50.700 | want to find multiple flights or, or multi-term being sort of like maintaining a context and,
00:08:58.060 | and asking follow-up questions.
00:09:00.380 | And that's where like having to evaluate state comes into play because the model is going
00:09:05.100 | to have to remember that it's got these different responses kind of rolling of like, I want
00:09:10.100 | to book this one, okay, book me the 10 a.m. version, okay, like what's the confirmation
00:09:15.460 | ID, that kind of thing.
00:09:18.540 | So really interesting how they set this up.
00:09:22.860 | So the dataset composition, they also made some pretty interesting improvements on how
00:09:29.060 | they did the data composition.
00:09:32.060 | They started categorizing them according to like multi-turn and augmented.
00:09:38.660 | I see a question in the chat, go full screen or slideshow.
00:09:41.580 | I can, I was leaving it like this so that we could bounce back and forth to the actual
00:09:46.620 | blog articles.
00:09:47.620 | But for now I could do, oh, actually then I'd have to stop sharing and re-sharing.
00:09:51.060 | So I'm going to leave this for now, unless it's like really, really hard for people to
00:09:55.340 | I can also hide the navigation and zoom in a little bit probably.
00:10:04.300 | Okay.
00:10:05.300 | Oh, that's going to be too much.
00:10:09.680 | Let's go back to fit.
00:10:12.120 | I'll also share these slides and see if I can just, let me just dump the link in here
00:10:21.260 | as well.
00:10:22.260 | But if you want to look at it close up, you can.
00:10:30.300 | Okay, so they had to change their approach to actually curating the data, which was really
00:10:37.300 | interesting because they wanted to be able to capture the multi-turn and the long context
00:10:43.460 | multi-turn and that kind of thing.
00:10:46.340 | And so they did this really interesting approach where they basically created their own API
00:10:51.420 | for the sake of this testing, and then did this like mapping to create a graph edge construction
00:10:59.320 | and like generate tasks through that.
00:11:03.260 | And so I'll just show really quick, I'm jumping ahead a little bit, but what was cool is they
00:11:13.060 | built out this like whole API code base where they had like vehicle control and stock trading
00:11:21.520 | and travel booking, and then like a file system.
00:11:25.220 | And then also like a couple of different like cross-functional APIs like for messaging and
00:11:29.980 | simulating Twitter and creating tickets and doing math and that kind of thing.
00:11:35.060 | And kind of like built out this system where they would build out these APIs and generate
00:11:43.060 | a graph from it and then use that to derive a bunch of example queries and function lists
00:11:49.460 | and that kind of thing.
00:11:50.460 | And then generate this dataset that way, which I thought was like pretty ingenious.
00:11:56.820 | So you can now see the third version, the dataset composition looks really, really different
00:12:01.860 | than it did in the first couple.
00:12:03.140 | And that's because they're really honing in on like what they're trying to test for there.
00:12:08.100 | And so, and they actually go into, in the article, they go deeper into like what all
00:12:17.300 | of these mean and what they do.
00:12:21.580 | But you can see that they're doing things like having a specific set of questions for
00:12:25.540 | testing for long context or testing for missing params, that kind of thing.
00:12:31.100 | And so, yeah, like I said, they turned that into this nice system with an API that they
00:12:37.100 | created that they could use to evaluate.
00:12:41.660 | And then they have this interesting validation process where they check for the questions
00:12:47.180 | and they use this human ground truth as a way to check for these things.
00:12:54.780 | And they go into a little bit more on that validation side.
00:13:00.460 | Yeah, so they actually like have humans label like what would be the correct answer for
00:13:09.180 | these different scenarios and that way they could check to see how close the model gets
00:13:13.780 | to aligning with the human response.
00:13:22.460 | And so, then that gets us into state-based evaluation.
00:13:26.820 | And so, they use this as the primary metric to assess the performance of the models.
00:13:31.440 | And this is really interesting because the way they basically compare the final state
00:13:36.420 | after all the function calls are executed to see whether the system's internal changes
00:13:41.420 | after each step align with what you're expecting.
00:13:44.940 | Which is better for reflecting real world performance because in multi-turn interactions,
00:13:50.220 | you're trying to update a system state, whether it's with a vehicle or a file system or something
00:13:54.560 | like that.
00:13:55.780 | You want to be able to check to see whether you've successfully deleted a file or something
00:14:02.020 | like that.
00:14:03.020 | So, it's not as straightforward as just like answer a math problem or call an API or something
00:14:08.660 | like that.
00:14:09.660 | So, they compare the attributes of the system state after every turn against the expected
00:14:14.860 | state.
00:14:15.860 | So, for example, if you had a series of function calls of asking the LLM to like create a file,
00:14:23.100 | write data to the file and close it, it would basically check after each turn whether the
00:14:28.060 | file exists, whether the correct data was written, whether the file was properly closed.
00:14:34.580 | And if they're present and correct, then the evaluation succeeds.
00:14:39.540 | And so, then they also have a section at the end on the results and error analysis.
00:14:44.580 | I thought this was like a super interesting section of what the models struggled with.
00:14:50.460 | And they gave a few examples.
00:14:51.540 | So, the first one was failure to perform implicit actions.
00:14:55.380 | So, they had this example here of like asking the model to fill a fuel tank.
00:15:04.300 | And so, the model thought that the tank was already full or nearly full and didn't like
00:15:10.100 | understand that they didn't actually need to do the filling.
00:15:15.740 | So, that was pretty interesting.
00:15:19.020 | And then the second one they gave was failure to understand the current state before performing
00:15:23.860 | the action.
00:15:24.860 | So, if, for example, you're already in the present working directory is already Alex,
00:15:31.620 | and you tell it to go into the directory named Alex, it didn't know that it could just like
00:15:36.520 | check to see whether you were already in the present working directory named Alex.
00:15:41.460 | It would just create it anyway.
00:15:43.860 | So, pretty interesting.
00:15:46.240 | And then the last example that they give is LLMs incur unnecessary planning and thinking,
00:15:52.340 | which I thought was like a really funny way of putting that.
00:15:56.140 | But so, they give the example of like you're already authenticated in the Twitter API,
00:16:01.740 | and you ask the model to do something with Twitter, and it like goes ahead and tries
00:16:09.540 | to authenticate you again.
00:16:10.980 | It didn't know that you were already authenticated with Twitter.
00:16:15.020 | And so, it's sort of like unnecessarily added a step.
00:16:17.980 | And so, it's interesting because all three of these sort of tie into like knowing the
00:16:22.260 | current state or like current context that the question is asked in.
00:16:26.980 | And that turns out to be like more challenging than you would think for these models.
00:16:33.860 | And so, that's like, that's all I had like for the actual overview.
00:16:36.780 | And then I figured if anybody wanted to just like chime in and talk about any of these
00:16:41.420 | things we could.
00:16:42.420 | I also posted the data set on Hugging Face, which is pretty interesting.
00:16:48.420 | It doesn't quite work.
00:16:50.700 | There's some sort of error where you can't like see all of it, but you can see some of
00:16:55.660 | A lot of them are the math problems, but there's also like some travel questions and other
00:17:00.940 | like planning questions and things like that, which are kind of just like interesting inspiration
00:17:04.740 | for function calling in general.
00:17:06.780 | But yeah, that's all I had for the actual like overview of everything.
00:17:14.860 | Since nobody's jumping in here, I found what you mentioned about using the syntax tree
00:17:28.820 | to generate sort of like, I think, I don't know if I understood this correctly, but generating
00:17:40.580 | sort of test cases that are like a sequence of API calls basically.
00:17:46.620 | And then from that, generating us like a scenario that sort of like implements that so that
00:17:54.180 | you can check that has an absolute correct answer and then like a scenario that you can
00:18:02.460 | have the LLM, the function calling LLM follow through.
00:18:06.780 | Is that, did I understand that correctly?
00:18:09.700 | I think so.
00:18:10.700 | Yeah.
00:18:11.700 | Caveat that I'm still wrapping my head around it too, but yeah, I think that's basically
00:18:16.420 | what they did, where they kind of had these different categories and that they wanted
00:18:23.700 | to accomplish.
00:18:24.700 | And then they used the source code and everything to create the graph and like generate the
00:18:33.980 | tasks to do it.
00:18:36.540 | But then I still think, how did, did you grok how they graph, how they did the graph edge
00:18:43.500 | construction?
00:18:44.500 | Yeah.
00:18:45.500 | Let's see.
00:18:46.500 | We'll talk about that a little bit more down here.
00:18:49.740 | Yeah.
00:18:50.740 | Yeah.
00:18:51.740 | So there's this section here where they talk about each function represents a node and
00:18:57.540 | then they manually map out direct edges, meaning a functions output is an input of the downstream
00:19:03.900 | functions.
00:19:04.900 | I guess they've got this example here of placing order, like getting the available stock is
00:19:13.100 | necessary in order to place the order for the stock.
00:19:17.700 | Whenever we need a data set, we sample a node on the graph and randomly traverse the graph
00:19:24.180 | to generate an execution path.
00:19:26.380 | I see.
00:19:27.380 | Execution path.
00:19:28.380 | We're able to extrapolate.
00:19:29.380 | Yeah.
00:19:30.380 | Oh, right.
00:19:31.380 | Okay.
00:19:32.380 | So they basically go something like, okay, manually create the graph for the API and
00:19:36.660 | then sample a node in the graph and then traverse the graph according to the edges.
00:19:44.820 | And then from that, then generate.
00:19:48.300 | So you have function calls, a list of function calls and probably parameters, and then generate
00:19:54.340 | a scenario based on that, that sort of matches with that.
00:19:58.780 | Is that kind of what I am?
00:19:59.780 | Yeah, I think so.
00:20:00.780 | And I forgot to mention this part, which is really interesting.
00:20:03.140 | They use this data set from Persona Hub to take a few examples of different personas
00:20:12.620 | and then generate the different queries based on those personas, basically.
00:20:19.540 | So like stock trading by an elderly hermit, for example.
00:20:23.100 | Yeah.
00:20:24.100 | I saw that.
00:20:25.100 | Yeah.
00:20:26.100 | But then I think after, once they had that, oh yeah, so they've got a triplet of question
00:20:31.420 | function list and initial config, and then they had humans actually look at it and verify
00:20:37.380 | everything and validate that these are good questions that actually make sense.
00:20:42.580 | Right.
00:20:43.580 | Okay.
00:20:44.580 | Yeah.
00:20:45.580 | I agree.
00:20:46.580 | That is a pretty cool way to generate a data set like this, where you kind of get a good
00:20:51.780 | covering set and sort of generate the language around it, and then I guess it's small enough
00:21:00.660 | that you can label the whole thing, or did they label the whole thing?
00:21:06.620 | I think so, because there were only like 2,200 questions.
00:21:10.300 | Yeah.
00:21:11.300 | So you can probably...
00:21:12.300 | So it seemed like they labeled everything.
00:21:14.460 | Yeah.
00:21:15.460 | Yeah.
00:21:16.460 | Okay.
00:21:17.460 | And they did a series of like validations for each one, and also built tests around
00:21:23.380 | unit tests and error handling tests and things like that.
00:21:26.140 | All right.
00:21:27.140 | Got it.
00:21:28.140 | Yeah.
00:21:29.140 | Because, I mean, I guess at the end of the day, the whole idea here, if I understand
00:21:33.580 | correctly, is that I can, because I'm tracking my state, I can know that the series of function
00:21:40.260 | calls that it made, I ended up in the correct state because of them, or I didn't.
00:21:47.580 | And so you get a hard yes/no on that.
00:21:51.140 | Yeah.
00:21:52.140 | I think that's exactly what they were designing around because of the state-based evaluation.
00:21:55.980 | Yeah.
00:21:56.980 | Yeah.
00:21:57.980 | Yeah.
00:21:58.980 | That's pretty cool.
00:21:59.980 | Yeah.
00:22:00.980 | There was also a whole section they didn't really dive into of like, they go through
00:22:08.820 | like a lengthy discussion on like response-based evaluation and why they don't use like the
00:22:14.380 | REACT technique and things like that that they didn't really dig into.
00:22:18.380 | But other things that are just interesting around like the limitations for evaluating
00:22:25.460 | with multi-turn scenarios.
00:22:28.060 | Yeah.
00:22:29.060 | Yeah.
00:22:30.060 | Cool.
00:22:31.060 | Well, yeah.
00:22:32.060 | Yeah.
00:22:33.060 | Thank you.
00:22:34.060 | Yeah.
00:22:35.060 | That's all I got.
00:22:39.140 | These aren't like super long.
00:22:40.740 | And so it's not like a crazy amount of stuff to read through, but it's definitely really
00:22:46.540 | interesting and then helps you kind of understand things a little bit better.
00:22:50.180 | The only thing I didn't, I didn't see a lot of information on was how they derived this
00:22:54.180 | hallucination measurement.
00:22:55.180 | I don't know if anybody else read through it and caught that, but it was the only thing
00:23:00.580 | that I didn't see like a lot of information on how they did the hallucination measurement.
00:23:08.140 | I believe, and I could, this is not from the paper, but just like a thing that I think
00:23:15.260 | it would be is cause they have a, they have a data set or something of a data set called
00:23:22.740 | like API zoo, where basically it's sort of the open source repository of like the correct
00:23:28.500 | documentation of all the API or of a large amount of, of AB APIs.
00:23:34.180 | So presumably that could be like a ground truth to benchmark against and then like how
00:23:38.860 | far, you know, you are, I don't know, probably semantically or something from that or state
00:23:43.780 | base wise.
00:23:44.780 | I don't know.
00:23:45.780 | Oh, interesting.
00:23:46.780 | I guess.
00:23:47.780 | That's interesting.
00:23:48.940 | I don't suppose anybody tried to run this locally that they, cause they link out to
00:23:57.100 | the code.
00:23:58.100 | You can actually like run it locally.
00:23:59.580 | I did not try to do it, but just so, just so people know you actually can, you can just
00:24:07.660 | clone this and try it yourself, although you need like a thousand API keys, like any other,
00:24:13.780 | any other leaderboard thing.
00:24:16.900 | But yeah.
00:24:17.900 | I've not tried the leaderboard specifically, but there's a, there are a few things in that
00:24:23.180 | repo that are, that are quite useful, open functions, Gorilla, CLI, et cetera, et cetera.
00:24:28.300 | Check them out.
00:24:29.300 | Oh, cool.
00:24:30.300 | Sam, is there, is there any description on how they, how they provide the, the state
00:24:38.340 | of the model, like what API queries it has already executed or what the original request
00:24:45.300 | How do they go about that?
00:24:46.300 | Yeah.
00:24:47.300 | I think there was in this section, I mean, they'll go into a ton of detail, but basically
00:24:55.580 | let's see.
00:24:56.580 | Ah, so yeah, it says as the model interacts with the API, the backend tracks, how the
00:25:02.940 | state evolves.
00:25:05.220 | Each function call alters the state, such as creating files at every turn, we compare
00:25:09.380 | the instances, current state from executing the model function calls with the expected
00:25:13.700 | ground truth.
00:25:14.700 | So they don't go into like any technical detail.
00:25:17.780 | I bet, I bet it's in the code somewhere.
00:25:21.660 | But it sounds like it's done in the API itself.
00:25:25.420 | I'd be curious to know more detail about like, what, like how that actually happens and how
00:25:31.020 | they read it out.
00:25:34.940 | It seems like that would need to be specific to the API, right?
00:25:38.540 | Because each API is going to have its own set of state variables that it needs to track,
00:25:44.980 | right?
00:25:45.980 | Like a file system is, oh yeah, Twitter posts or whatever.
00:25:48.540 | But it seems like they did their mock API is probably implement some sort of state.
00:25:54.820 | That's my guess.
00:25:55.820 | That might be a good, I think Swig said they're going to have somebody from BFCL on the podcast
00:26:00.420 | soon.
00:26:01.420 | That would probably be a good question for them.
00:26:02.420 | I'd be super curious about that.
00:26:04.760 | So do they have, do they have like an API that orchestrates the request to, to other
00:26:10.160 | APIs, or is this just like referring to any API that you hit?
00:26:16.060 | I think it's, yeah, they must've had, they must've had sort of one collective API that
00:26:24.660 | was a collection of the, of the like three or four or no, it was like four single things
00:26:32.860 | in like four cross-functional APIs.
00:26:35.540 | So they must've had something that was orchestrating those different.
00:26:38.900 | Yeah.
00:26:39.900 | I mean, if you're not familiar with Gorilla itself, that's what it does is it, it is,
00:26:46.020 | it's only, it's a model that specifically generates API calls and the thing else.
00:26:49.900 | Wait a minute, but we're talking about a model and then we're talking about like an API that
00:26:56.180 | orchestrates requests to others, other towards other APIs.
00:26:58.900 | So you can make an API call that calls this model, but that's, that's where I'm like,
00:27:03.860 | do you say it was a model, um, calling API calls that it was trained on.
00:27:08.820 | I showed up to this meeting late, so you might not know this, but I don't know is I'm not
00:27:14.340 | really sure, can't picture this.
00:27:16.420 | Yeah, that's a good question.
00:27:19.020 | I'm not, I'm not sure how they, how they did that.
00:27:22.980 | If they had each API just separate, um, and isolated so that the model would just call
00:27:30.100 | them, like call, call for the, that API directly, or if they probably, I mean, that would be
00:27:37.620 | my guess is that they kept them separate because then they would know whether the model was
00:27:43.540 | actually calling the right API and not just like a gateway.
00:27:48.180 | I don't know.
00:27:50.340 | That's a good question.
00:27:54.980 | So Yikes, you're having somebody from the gorilla team come to speak.
00:28:00.140 | Um, I, I think that is a, I think that was, I saw Swicks mentioning something about that.
00:28:07.980 | Um, I think he's going to have somebody come on the podcast was what I think he said.
00:28:12.620 | I don't remember who, yeah, there's something in discord about it.
00:28:17.300 | Uh, I, I am not organizing anything at the moment, but, uh, you know, if, if you have
00:28:21.900 | questions, they, they tend to be like super responsive in discord.
00:28:24.420 | So just go jump in the gorilla LLM and fix some brains.
00:28:28.540 | Yeah.
00:28:29.540 | Yeah.
00:28:30.540 | It looks like a pretty active server.
00:28:31.540 | It would be so much fun if he hosted the podcasts with us as an audience, I would love that.
00:28:39.180 | I could dig it.
00:28:40.180 | That could be fun.
00:28:41.180 | Yeah, I would have loved to have spoken to like the, um, the person behind structured
00:28:45.180 | generation and open AI and a lot of other people.
00:28:48.760 | It'd be interesting to like, uh, uh, at least like a, a questions to ask drop off boxes
00:28:54.380 | worth doing or something.
00:28:57.100 | Yeah.
00:28:58.100 | Or something we can do is like, we could do like an AI in action where we actually like,
00:29:03.320 | you know, deploy this.
00:29:04.460 | Um, I don't know, Sam, would you be like open to doing that?
00:29:09.260 | I I'm not in charge here.
00:29:10.700 | I don't know.
00:29:11.700 | I don't know.
00:29:12.700 | I'm just say deploy this.
00:29:13.700 | What do you mean?
00:29:14.700 | Like deploy this?
00:29:15.700 | No, no, no.
00:29:16.700 | It's not the play.
00:29:17.700 | Just, just get it running.
00:29:18.700 | Um, just to like clear up any doubts.
00:29:20.500 | Like the leaderboard.
00:29:21.500 | Do you mean Gorilla?
00:29:23.700 | I'm talking about this, this system that they're describing here.
00:29:29.500 | Oh, well, uh, okay.
00:29:33.460 | So yeah.
00:29:34.460 | So that's the leaderboard.
00:29:35.460 | I think, um, I mean, uh, I mean, yeah.
00:29:40.620 | Like, if that's something that you are interested in trying to, to make happen, I presume that
00:29:47.380 | we would have an open slot on AI in action at some point.
00:29:50.300 | And that like, sounds interesting, just like try to get the thing working.
00:29:54.140 | Um, I'm not committing to anything.
00:29:56.940 | Not committing.
00:29:57.940 | I mean, that's, that's fair.
00:30:00.060 | That is an interesting idea of like, we could do an AI in action rather than of a, like,
00:30:05.260 | let's try and get the thing working as opposed to like, here is how the working thing that
00:30:08.900 | I use works.
00:30:10.900 | No, no, no.
00:30:11.900 | Interesting.
00:30:12.900 | Or I already, I already have it working.
00:30:13.900 | No, get it, get trying to get it working in a span of like one hour is not feasible for
00:30:17.740 | me at least.
00:30:18.740 | Yeah.
00:30:19.740 | I mean, you know, cool idea.
00:30:20.740 | I think, um, I'm just throwing it out there.
00:30:21.740 | If somebody wants to do that, I'm, I'm gonna be on there.
00:30:28.380 | For sure.
00:30:29.380 | I'm going to be an audience.
00:30:32.620 | Yeah, I've had Gorilla come up very often, partially from Yikes and other people.
00:30:39.920 | I'm a huge Gorilla shill and kind of a Shishir fan, you know, um, but that's, uh, the Gorilla
00:30:47.980 | CLI is really helpful.
00:30:49.180 | Um, cause I, I was using GitHub CLI or the GitHub copilot CLI for quite some time.
00:30:54.260 | Um, but Gorilla CLI is like open and I don't have to worry about GitHub connections.
00:30:57.940 | I can just like grab it and it does the same thing basically.
00:31:00.700 | What is the Gorilla CLI do?
00:31:04.460 | I don't, I'm not, um, you just, you put in a natural language query of the CLI command
00:31:09.540 | that you want to run, um, and it'll, it'll give you the, the CLI command to do it.
00:31:14.220 | It's like, uh, if you used copilot CLI, it's same thing basically.
00:31:17.220 | I see.
00:31:18.220 | Okay.
00:31:19.220 | I haven't used that.
00:31:20.220 | That's, that's a cool idea.
00:31:21.220 | I mean, I've, I've, I've wanted to, to implement it myself, but it sounds like, you know, Gorilla
00:31:28.700 | already has, uh, they already tied up their, their, their model with, with a little front
00:31:35.660 | That's cool.
00:31:36.660 | Yeah.
00:31:37.660 | No, there's, uh, the, the other interesting thing, I guess, to go over the cool stuff
00:31:41.980 | in the repo, there's a thing called, uh, uh, Gorilla open functions where you can sort
00:31:46.260 | of more or less like staple a Gorilla to whatever model that you're playing with.
00:31:51.340 | So if it doesn't have function calling, you can just implement open functions and it'll
00:31:54.820 | give it function calling basically.
00:31:57.020 | Oh, cool.
00:31:58.960 | And then you ended up with no files in your file system.
00:32:06.820 | Yeah.
00:32:07.820 | Well, you know, it does elucidate from time to time, so yeah, we didn't actually, I almost
00:32:16.780 | wonder if that'd be another, either AI in action or paper club, just like the Gorilla
00:32:21.780 | stuff in general.
00:32:22.780 | Like we didn't, there's a paper for Gorilla and everything, but yeah, like it was going
00:32:26.780 | to be too much to try to cram into this.
00:32:31.580 | That's really interesting.
00:32:32.580 | Yeah.
00:32:33.580 | I do feel like I would like to see a demo of, of the, like that, the CLI CLI total CLI
00:32:43.900 | tool chain.
00:32:44.900 | Like I know that, um, uh, I've, I've seen a bunch of like, sort of these things in use
00:32:51.900 | and, but I, you know, I haven't gotten started with them myself and I wanted to, and never
00:32:56.540 | had the time to get started and know what's actually useful.
00:33:00.700 | So yeah.
00:33:01.700 | Yeah.
00:33:02.700 | It would be a cool AI in action.
00:33:10.100 | That would.
00:33:15.100 | Cool.
00:33:16.100 | Anything else?
00:33:17.100 | Anybody other, other discussion topics?
00:33:20.200 | So I actually, this, I have been, I've had an idea in my head for a while and I wanted
00:33:26.980 | to share it with people.
00:33:27.980 | If there's no other, it's relevant to this topic, I want to share with people in case
00:33:33.100 | if there's no other topics, um, but the idea is basically to build a leaderboard of the
00:33:41.060 | quality of predictions of a, of a, um, of an LLM or LLMs using, um, like an allowee,
00:33:51.940 | like, uh, so the idea would be to have questions that are like, that, that are around the,
00:34:00.600 | some future event, like who will be elected president.
00:34:03.520 | Right.
00:34:04.520 | the, and, and, and you have a list of them and then you're, and then you ask LLMs
00:34:10.980 | to predict based on some sort of canonical data that you also bring in for the LLM to
00:34:17.860 | use plus any other data that it wants to, like if it's an agent system, then it can
00:34:22.580 | go and search the web and do whatever it wants and brings in like whatever context it wants
00:34:28.380 | and then makes prediction on a given day about that event.
00:34:31.740 | And then those, a series of questions like sort of evolves every day and you basically,
00:34:38.700 | the LLMs placing bets basically on that.
00:34:42.100 | And then over time you track the quality of prediction of those LLMs or agent systems.
00:34:48.420 | And the reason why I think this is interesting is because, um, judging a, a, an AI or a person's
00:34:58.540 | prediction quality is actually one of the only ways that you can actually assess like
00:35:06.060 | quality of information on the, on the internet.
00:35:09.500 | Right.
00:35:10.500 | So if, if you have a, you have a person or an AI that can accurately predict the future
00:35:20.140 | and that, and, and do it better in sort of, in like sort of stacked rank, right, because
00:35:23.860 | I can always predict, I can very accurately predict that the sun will come up tomorrow.
00:35:27.860 | So it's not, it has, you have to be, it has to be a relative ranking.
00:35:32.420 | And if you have a, a, an AI or a person that can predict better than, you know, sort of
00:35:38.060 | the, than the crowd, then that person likely has a good model of the world.
00:35:45.260 | Right.
00:35:46.260 | And that, that good model of the world represents, you know, sort of like at least, um, somebody
00:35:52.740 | who you probably want to listen to more than the crowd.
00:35:56.020 | Right.
00:35:57.020 | So I feel like this would be a really cool, um, start to developing a way to judge both
00:36:04.020 | people.
00:36:05.140 | Like you can imagine now look like crawling the news sites and look at who people are
00:36:11.140 | making predictions and then sort of give everybody like track scores for different people, um,
00:36:18.100 | and their predictions in addition to LLM.
00:36:20.300 | So I think, I feel like this would be a really cool way to get a project started.
00:36:24.220 | I wonder if people have thoughts about this and if anyone would be interested in working
00:36:27.580 | together on it, deafening silence.
00:36:36.500 | Uh, I was, uh, I was, I was busy trying to get my Gorilla CLI, uh, working.
00:36:43.900 | Um, but I, uh, so I was distracted.
00:36:47.180 | What is the thing that you want, uh, help on that you were doing?
00:36:50.700 | Okay.
00:36:51.700 | Summary leaderboard of the quality of predictions that AIs and agents and people make.
00:36:59.140 | So like news items or whatever, like certain current events and then you predict and that
00:37:06.020 | becomes a bet kind of sorta, and I can explain the betting mechanism that is fair.
00:37:10.420 | And then, um, like a, just a general prediction market.
00:37:14.220 | Well, it's like, it is like a prediction market, but it probably looks a little different.
00:37:20.500 | Um, because I don't think it's useful to have, it's particularly useful to have it like,
00:37:27.260 | like maintaining an account and the ability, like that's a separate ability that I think
00:37:31.700 | is not necessary to track.
00:37:33.760 | So it, but something like a prediction market, but for us, um, a set of things that, um,
00:37:43.380 | you know, you sort of like a limited set of things that you track over time and whichever
00:37:48.580 | LLM and or person does the best kind of filters to the top.
00:37:54.580 | That sounds interesting and probably related to things that I'm interested in.
00:37:57.460 | So if you like, I don't know if you'd start cranking on it or throw up a repo and shoot
00:38:01.220 | me a link or whatever.
00:38:02.220 | I'll probably take a look and play around with it.
00:38:06.620 | Yeah.
00:38:07.620 | Um, if you're not familiar at, and I don't know if they have like an SDK or something
00:38:11.060 | so you can rig up your own, but, uh, you can check out poly market.
00:38:14.180 | It's like one of the biggest, uh, yeah.
00:38:16.420 | Yeah.
00:38:17.420 | So, I mean, I don't, I don't want to place real bets.
00:38:22.580 | Um, and I'm not sure that the, maybe, maybe I'm sure there are like, I'm thinking like
00:38:29.380 | just throw up a poly market instance on like test net or something and just use monopoly
00:38:33.220 | money or whatever.
00:38:34.220 | Yeah.
00:38:35.220 | Yeah.
00:38:36.220 | Monopoly.
00:38:37.220 | Okay.
00:38:38.220 | It's wait.
00:38:39.220 | So poly market.
00:38:40.220 | Well, okay.
00:38:41.220 | Maybe I'm not familiar.
00:38:42.220 | I thought it was familiar with poly market, but maybe I'm thinking of a different one
00:38:43.940 | that is that the one that Nate's over is, uh, discord thread about this.
00:38:48.780 | Yeah, sure.
00:38:49.780 | Yeah.
00:38:50.780 | Yeah.
00:38:51.780 | Yeah.
00:38:52.780 | They get there and see how this is interested.
00:38:53.780 | Uh, yeah.
00:38:54.780 | Uh, back on paper club, do we have any last questions for Sam on gorilla?
00:39:01.540 | I think some stuff won't popped up in the chat, but it looks like you kind of answered
00:39:06.860 | it in chat to, um, basically that, and then for next week's paper, if anyone wants to
00:39:13.020 | volunteer, okay, has anyone, uh, have you guys seen Momo?
00:39:21.700 | It's the open weight, open data, multimodal state of the art model.
00:39:27.480 | So it's from.
00:39:28.480 | Yeah.
00:39:30.480 | I think, I don't know how good, but the model is pretty good.
00:39:34.180 | It's a pretty decent technical report.
00:39:36.620 | I'm hoping the paper is good because I'm down to cover it.
00:39:39.800 | They do go a lot into data, how they, um, yeah, how they get the data.
00:39:43.520 | I think they pre-train it from flip, but it's, it's a pretty good recent multimodal foundation
00:39:50.720 | model.
00:39:51.720 | Um, I can cover that unless someone else wants to volunteer.
00:39:58.060 | That sounds super interesting.
00:39:59.060 | Yeah.
00:40:00.060 | Um, I was going to say I can be a backup if I pick a Monte Carlo paper and it actually,
00:40:04.760 | and everything works out.
00:40:05.760 | Um, but I would actually be pretty interested to, to get a, get a download on, on Momo too.
00:40:12.420 | Okay.
00:40:13.420 | I'll do, I'll do Momo next week.
00:40:15.440 | Um, here's a link, I'll, I'll throw it in paper club, but thanks for, um, sharing Sam.
00:40:20.840 | Real quick.
00:40:21.840 | I did want to check in.
00:40:22.880 | Is there anybody that's like new to paper club that has not presented or is like kind
00:40:28.560 | of maybe on board with it?
00:40:30.460 | Maybe not.
00:40:31.460 | If so, uh, fricking, uh, raise your hand, I guess.
00:40:34.920 | And, uh, you know, if you need, uh, like a little bit of help or support or whatever,
00:40:39.800 | trying to like get something, because I went trying to facilitate new people that haven't
00:40:45.040 | done it before.
00:40:46.040 | Cause I know you and I have done it a bunch of times.
00:40:48.360 | Um, but yeah, feel free to like ping me in discord or whatever, if you have questions
00:40:51.320 | or something, or if you want to do it, um, feel free to, yeah, let's let one of us know.
00:40:56.120 | And we would, we would happily hand over the reins.
00:40:58.840 | Yeah.
00:40:59.840 | I'll make an announcement that this is the paper.
00:41:02.960 | If anyone else wants to sub in, we can always take it.
00:41:06.560 | Yeah.
00:41:07.560 | Cool.
00:41:08.560 | Thanks.
00:41:09.560 | Uh, Sam, for covering any last thoughts on this.
00:41:13.200 | I know, I know some questions have popped up.
00:41:15.160 | You're answering.
00:41:17.160 | Yeah.
00:41:18.160 | This has been great.
00:41:19.160 | Thanks for everybody for listening, listening and joining.
00:41:21.880 | Awesome.
00:41:22.880 | Thanks guys.
00:41:23.880 | Take care.
00:41:24.880 | Bye everyone.
00:41:25.880 | Thanks.