back to index

Building State of the Art Open Weights Tool Use: The Command R Family: Sandra Kublik


Chapters

0:0 Introduction
2:11 Community Response
4:18 The Journey
6:32 Post Training
8:20 Open Source UI
10:22 Optimizing Tool Use
11:1 SingleStep vs MultiStep
12:14 MultiStep API

Whisper Transcript | Transcript Only Page

00:00:02.000 | What if I told you
00:00:15.240 | that we have just handed you the keys
00:00:18.280 | to state-of-the-art model,
00:00:21.280 | which excels at structured,
00:00:25.520 | advanced rag at sequential reasoning,
00:00:30.520 | and you can run it locally on your machine.
00:00:33.660 | It's competitive against GPT-4 Turbo, Cloud Opus,
00:00:39.980 | and it's much smaller.
00:00:43.360 | We've been really hard at work at Cohere,
00:00:47.800 | working on our family of models,
00:00:50.760 | and today I'd like to talk to you
00:00:53.600 | about some of the stuff that we've done,
00:00:55.640 | the decisions that we've made
00:00:59.820 | when it comes to the model design,
00:01:01.940 | and also what we're cooking
00:01:03.860 | when it comes to the future of the models.
00:01:06.600 | So this year we've been working really hard
00:01:11.660 | to push the boundaries of what's possible with LLMs,
00:01:16.400 | and here's a quick look at our timeline.
00:01:21.760 | Three months ago, on March 11th,
00:01:23.620 | we've released Command + R.
00:01:24.940 | We opened the weights to the model.
00:01:27.980 | Command + R is a model optimized
00:01:32.860 | for retrieval augmented generation,
00:01:35.020 | and it's scalable.
00:01:36.440 | It's small enough to be scale-friendly.
00:01:39.940 | We followed it up with Command + R+.
00:01:45.600 | And this model is optimized for tool use,
00:01:51.860 | advanced retrieval augmented generation,
00:01:54.820 | and has become a very popular model
00:01:58.280 | in the open-source community.
00:01:59.700 | Within a few days of the release,
00:02:02.640 | we've climbed the LMSys arena.
00:02:05.820 | We're really proud of that.
00:02:07.320 | A really great achievement.
00:02:11.460 | Your response, as a community using the model,
00:02:15.180 | has been incredible.
00:02:16.180 | Some of the zeitgeist.
00:02:19.560 | We started trending at Open Router.
00:02:22.100 | Within two weeks of the release,
00:02:25.480 | the model has been downloaded
00:02:27.600 | a hundred and fifty thousand times from Hanging Face,
00:02:30.640 | which is wild.
00:02:34.200 | Folks at Hanging Face actually liked the model so much,
00:02:37.660 | especially when it comes to the tool use,
00:02:40.040 | that they decided to use it as a base model
00:02:42.460 | for Hanging Chat.
00:02:43.380 | So now you can
00:02:47.060 | play with Hanging Chat.
00:02:50.220 | It has a doc parser,
00:02:52.200 | an image editor.
00:02:53.340 | It even has a calculator.
00:02:54.840 | It had it before the iPad.
00:02:57.540 | So today almost half a million
00:03:01.680 | of developers and researchers
00:03:04.000 | are using the R family.
00:03:05.760 | We're really proud of that.
00:03:07.300 | It looks like you guys got really excited
00:03:12.180 | to get your hands on the model
00:03:14.220 | and to be able to play with the weights
00:03:17.640 | and look under the hood.
00:03:19.280 | We keep hearing your feedback
00:03:22.720 | and the love and support keeps pouring in.
00:03:25.060 | It really gets us going.
00:03:27.400 | And I've seen some super cool stuff
00:03:30.660 | built with R+ since then.
00:03:32.180 | Some of my favorite ones
00:03:34.200 | I want to shout out here
00:03:35.200 | are the Coding Assistant by Daniel San
00:03:37.640 | and a new generative search demo by Complexity.
00:03:42.940 | I'll try to demo it later.
00:03:45.880 | We'll see how the tech goes,
00:03:47.160 | but I'll give you a sneak peek.
00:03:49.640 | Another one that's my favorite
00:03:52.340 | is two Discord server bots that are powering.
00:03:56.060 | our Discord community.
00:03:59.060 | I invite you to go and check it out.
00:04:01.240 | One of them is fine-tuned to be playful
00:04:06.020 | and to demo the model capabilities.
00:04:08.120 | And the other one is made to be helpful.
00:04:11.060 | It's grounded in our docs
00:04:13.200 | and it's focused on the information
00:04:15.440 | coming from the API.
00:04:17.060 | So I want to share the journey of building the R models,
00:04:24.920 | the decisions we've made along the way,
00:04:28.180 | and to show you that we've committed ourselves
00:04:32.480 | to build the top Rack tools for AI builders.
00:04:36.940 | We know firsthand that building Rack
00:04:42.600 | is excruciatingly hard.
00:04:46.900 | Tough word.
00:04:48.340 | When you set out to do that,
00:04:50.400 | you're going to face challenges,
00:04:52.140 | and they are numerous.
00:04:56.080 | Challenge number one is that models
00:04:57.660 | are highly prompt sensitive,
00:05:00.280 | and when you want to use the model in the Rack context,
00:05:04.600 | you need to prompt it to not only look for the information,
00:05:09.040 | but also know where to look,
00:05:12.160 | and know how to differentiate between the conversation history
00:05:16.700 | that the model has with the user
00:05:18.140 | and the retrieved information.
00:05:19.880 | It's not a trivial task.
00:05:21.160 | Another problem is overcoming models' natural bias
00:05:27.060 | results towards focusing on the beginning of the document.
00:05:30.720 | You've seen it with multiple Rack benchmarks
00:05:33.460 | and evaluation tests,
00:05:37.060 | you know, in the haystack and whatnot,
00:05:39.060 | that are really showing the problem of models
00:05:43.020 | not focusing on the most accurate
00:05:45.560 | information retrieval,
00:05:46.680 | but rather becoming a little bit lazy
00:05:48.960 | and focusing on the beginning, mostly.
00:05:54.000 | Another challenge is steering an ongoing battle
00:05:58.500 | that's happening within the model
00:06:00.060 | between its pre-training knowledge
00:06:03.640 | and what it encounters in prompts.
00:06:07.000 | For Rack use cases,
00:06:09.520 | you want the model to be able to tap into the knowledge
00:06:13.540 | that's not baked into the model parameters,
00:06:16.060 | and temporal information is a great example,
00:06:19.860 | when you're answering,
00:06:21.120 | when you're asking the model
00:06:22.620 | about who is the current president of the United States.
00:06:27.540 | You want the model to be able to tap
00:06:29.100 | into the up-to-date information.
00:06:31.480 | So through post-training,
00:06:36.040 | we've been able to optimize the model behavior
00:06:39.100 | to be able to address these
00:06:41.940 | and to decide when the external information is needed
00:06:46.200 | in the first place.
00:06:48.040 | Sometimes it isn't.
00:06:49.040 | Sometimes the pre-trained knowledge is enough.
00:06:51.200 | Then operate the retrieval system smoothly
00:06:56.700 | to be able to run search queries successfully,
00:07:00.600 | retrieve the information,
00:07:02.140 | hopefully the most accurate one,
00:07:05.240 | and then use that information
00:07:07.080 | as a grounded context for the conversation
00:07:09.740 | that the model is having with the user.
00:07:13.600 | We optimize all of this for you,
00:07:16.540 | the model behavior,
00:07:17.560 | so that you don't really have to think about it.
00:07:20.100 | It's really good at it out of the box,
00:07:22.320 | but it was hard work.
00:07:23.400 | Our major focus was working on citations.
00:07:29.340 | We're big on citations.
00:07:31.240 | We believe that allowing the user
00:07:33.740 | to verify where the information comes from
00:07:36.600 | and whether it's trustworthy,
00:07:37.860 | it's really important.
00:07:39.840 | So we're spending extra time
00:07:41.540 | to make these citations very fine-grained.
00:07:44.540 | And thanks to that,
00:07:46.580 | you can experience low hallucination
00:07:48.300 | and reliable context use.
00:07:50.040 | We tested command R and R plus
00:07:54.280 | on some standard RAG data sets like Kilt,
00:07:58.600 | and they exhibit best-in-class performance.
00:08:01.780 | They're small enough to be affordable,
00:08:04.620 | but powerful enough to cover a lot of your use cases.
00:08:09.820 | They have a great balance of token efficiency,
00:08:12.820 | and to achieve this level of performance,
00:08:16.580 | normally you would have to line up a big pipeline of LLMs.
00:08:20.280 | We've also heard from you that creating a UX and UI
00:08:27.060 | for RAG and Toluse is super painful.
00:08:31.620 | It's not a small feat, and we know it first-hand
00:08:35.380 | because we've spent considerable amount of time working on it ourselves.
00:08:40.880 | We're really proud of it at the moment.
00:08:43.960 | I think it has everything a modern UI,
00:08:46.680 | modern chat UI needs to have.
00:08:49.660 | So you're able to have a conversation history.
00:08:54.380 | You're able to have fine-grained citations.
00:08:57.380 | You're able to upload documents there.
00:08:59.380 | You're able to plug it into different types of tools.
00:09:02.380 | So spending so much time on it
00:09:05.880 | and knowing how much you're struggling either way,
00:09:08.840 | we decided that it's going to be a good idea to open source the UI,
00:09:13.720 | and that's what we did in April 24.
00:09:17.720 | I feel like not many people know about it,
00:09:19.560 | but our UI is out there,
00:09:21.380 | and you can now load it and start building with it.
00:09:24.260 | So this is a toolkit repo.
00:09:27.760 | That's how we call it.
00:09:29.760 | It has plug-and-play components and source code
00:09:32.880 | for an interface app that we've built with Next.js.
00:09:36.720 | It has a small SQL database for conversation history.
00:09:41.720 | There is a model component,
00:09:43.720 | which lets you customize how you're accessing command R models.
00:09:47.720 | You can do it via cloud providers.
00:09:49.720 | You can do it via Coher platform.
00:09:51.720 | You can do it locally.
00:09:53.720 | You can do it via Hagnetize, your pick.
00:09:55.720 | Then there is retrieval component,
00:09:58.720 | and here you can customize access to tools and data sources.
00:10:04.720 | Out of the box, we've built an example,
00:10:06.720 | data retriever built off of Langchain.
00:10:10.720 | It has document upload, and it's using web search,
00:10:15.720 | but honestly, you can add support for any tools
00:10:18.720 | and any data sources that you're interested in.
00:10:20.720 | Lately, we've been focused on optimizing tool use,
00:10:27.720 | particularly in the enterprise context.
00:10:29.720 | That's our game.
00:10:32.720 | It's kind of extension of this Rack formula I mentioned earlier,
00:10:36.720 | where we began by training the models to be really good
00:10:39.720 | with vector databases and retrieval systems,
00:10:43.720 | and then it naturally progressed into broader tool use.
00:10:49.720 | Training the model to use any tools,
00:10:53.720 | and ideally in a zero-shot context.
00:10:55.720 | that's kind of our ideal scenario that we're working towards.
00:11:00.720 | Toluse comes in two flavors.
00:11:04.720 | There is single step.
00:11:06.720 | It's really useful for situations where you have a single action to be performed,
00:11:13.720 | or a set of independent actions.
00:11:15.720 | It could be searching for documents or sending out an email.
00:11:19.720 | Multistep, on the other hand, it's really good for scenarios where you have to carry out a sequence of actions,
00:11:31.720 | with each action building on top of the previous ones.
00:11:34.720 | So, in the same example, it would be searching for that document,
00:11:40.720 | being able to compare it against another document,
00:11:43.720 | creating a summary of that comparison,
00:11:46.720 | and then sending it out via an email.
00:11:48.720 | That's possible with multistep tools today.
00:11:51.720 | In sequential reasoning, in multistep,
00:11:56.720 | you want the system to be able to reflect and correct errors,
00:12:00.720 | if there are any on the way.
00:12:02.720 | And we are teaching the models to retrieve the information
00:12:06.720 | many times over from these different data sources.
00:12:09.720 | Kind of a loop to be able to do that.
00:12:12.720 | You know this behavior from the term agents.
00:12:16.720 | Most of the time when people use the term agents and multistep,
00:12:20.720 | they mean the same thing.
00:12:22.720 | It's essentially a scenario where software is performing a sequence of actions,
00:12:27.720 | with each action building on the previous steps.
00:12:30.720 | Last week, we released multistep API, super hyped about it.
00:12:37.720 | We want it to be user friendly.
00:12:40.720 | And so, all you need to do is you need to describe the tools that the model has on their hands,
00:12:45.720 | what these tools do, and then some parameters.
00:12:50.720 | After user request is made, the model is going to create a plan.
00:12:55.720 | And it's going to figure out how to use these tools to fulfill the user request.
00:13:01.720 | And once it calls each tool, it's going to reflect on the contents,
00:13:06.720 | and it's going to adapt the initial plan if it's necessary.
00:13:09.720 | So, for example, if the model is calling an API and it returns an error,
00:13:14.720 | it's going to automatically retry calling it again and coming up with a new plan.
00:13:20.720 | We've outlined this behavior in this huge multistep preamble.
00:13:25.720 | You can find it on Hugginvice.
00:13:28.720 | Essentially, it's a massive prompt that explains the model what it needs to do in order to get the job done.
00:13:36.720 | Unique advantage here is the transparency.
00:13:41.720 | We've trained command R and R+ to generate claims that are verifiable through citations.
00:13:48.720 | And again, big on citations, we really believe that when you can explain which tool has been used by the model for each response,
00:13:59.720 | it's going to make a difference and it's going to make the system better.
00:14:04.720 | Command R+ has competitive performance to plot OPPOS, GPT-4 Turbo, but it is three to five times cheaper.
00:14:14.720 | So that's a massive difference when it comes to scalability and being able to use it in production.
00:14:20.720 | We test the R family on standard complex reasoning benchmarks and command R+ is close to or on par with GPT-4 Turbo.
00:14:31.720 | I'm super excited for the upcoming releases.
00:14:36.720 | We're going to keep hammering on the multistep.
00:14:39.720 | And yeah, stay tuned.
00:14:40.720 | Thanks a lot.
00:14:41.720 | Thanks a lot.
00:14:42.720 | Thank you.
00:14:43.720 | Thank you.
00:14:44.720 | Thank you.
00:14:45.720 | Thank you.
00:14:46.720 | Thank you.
00:14:47.720 | Thank you.
00:14:48.720 | Thank you.
00:14:49.720 | Thank you.
00:14:50.720 | Thank you.
00:14:51.720 | Thank you.
00:14:52.720 | Thank you.
00:14:53.720 | Thank you.
00:14:54.720 | Thank you.
00:14:55.720 | Thank you.
00:14:56.720 | Thank you.
00:14:57.720 | Thank you.
00:14:57.720 | We'll see you next time.
00:15:01.420 | Thank you.