Building State of the Art Open Weights Tool Use: The Command R Family: Sandra Kublik

00:00:00.000 | -

00:00:02.000 | What if I told you

00:00:15.240 | that we have just handed you the keys

00:00:18.280 | to state-of-the-art model,

00:00:21.280 | which excels at structured,

00:00:25.520 | advanced rag at sequential reasoning,

00:00:30.520 | and you can run it locally on your machine.

00:00:33.660 | It's competitive against GPT-4 Turbo, Cloud Opus,

00:00:39.980 | and it's much smaller.

00:00:43.360 | We've been really hard at work at Cohere,

00:00:47.800 | working on our family of models,

00:00:50.760 | and today I'd like to talk to you

00:00:53.600 | about some of the stuff that we've done,

00:00:55.640 | the decisions that we've made

00:00:59.820 | when it comes to the model design,

00:01:01.940 | and also what we're cooking

00:01:03.860 | when it comes to the future of the models.

00:01:06.600 | So this year we've been working really hard

00:01:11.660 | to push the boundaries of what's possible with LLMs,

00:01:16.400 | and here's a quick look at our timeline.

00:01:21.760 | Three months ago, on March 11th,

00:01:23.620 | we've released Command + R.

00:01:24.940 | We opened the weights to the model.

00:01:27.980 | Command + R is a model optimized

00:01:32.860 | for retrieval augmented generation,

00:01:35.020 | and it's scalable.

00:01:36.440 | It's small enough to be scale-friendly.

00:01:39.940 | We followed it up with Command + R+.

00:01:45.600 | And this model is optimized for tool use,

00:01:51.860 | advanced retrieval augmented generation,

00:01:54.820 | and has become a very popular model

00:01:58.280 | in the open-source community.

00:01:59.700 | Within a few days of the release,

00:02:02.640 | we've climbed the LMSys arena.

00:02:05.820 | We're really proud of that.

00:02:07.320 | A really great achievement.

00:02:11.460 | Your response, as a community using the model,

00:02:15.180 | has been incredible.

00:02:16.180 | Some of the zeitgeist.

00:02:19.560 | We started trending at Open Router.

00:02:22.100 | Within two weeks of the release,

00:02:25.480 | the model has been downloaded

00:02:27.600 | a hundred and fifty thousand times from Hanging Face,

00:02:30.640 | which is wild.

00:02:34.200 | Folks at Hanging Face actually liked the model so much,

00:02:37.660 | especially when it comes to the tool use,

00:02:40.040 | that they decided to use it as a base model

00:02:42.460 | for Hanging Chat.

00:02:43.380 | So now you can

00:02:47.060 | play with Hanging Chat.

00:02:50.220 | It has a doc parser,

00:02:52.200 | an image editor.

00:02:53.340 | It even has a calculator.

00:02:54.840 | It had it before the iPad.

00:02:57.540 | So today almost half a million

00:03:01.680 | of developers and researchers

00:03:04.000 | are using the R family.

00:03:05.760 | We're really proud of that.

00:03:07.300 | It looks like you guys got really excited

00:03:12.180 | to get your hands on the model

00:03:14.220 | and to be able to play with the weights

00:03:17.640 | and look under the hood.

00:03:19.280 | We keep hearing your feedback

00:03:22.720 | and the love and support keeps pouring in.

00:03:25.060 | It really gets us going.

00:03:27.400 | And I've seen some super cool stuff

00:03:30.660 | built with R+ since then.

00:03:32.180 | Some of my favorite ones

00:03:34.200 | I want to shout out here

00:03:35.200 | are the Coding Assistant by Daniel San

00:03:37.640 | and a new generative search demo by Complexity.

00:03:42.940 | I'll try to demo it later.

00:03:45.880 | We'll see how the tech goes,

00:03:47.160 | but I'll give you a sneak peek.

00:03:49.640 | Another one that's my favorite

00:03:52.340 | is two Discord server bots that are powering.

00:03:56.060 | our Discord community.

00:03:59.060 | I invite you to go and check it out.

00:04:01.240 | One of them is fine-tuned to be playful

00:04:06.020 | and to demo the model capabilities.

00:04:08.120 | And the other one is made to be helpful.

00:04:11.060 | It's grounded in our docs

00:04:13.200 | and it's focused on the information

00:04:15.440 | coming from the API.

00:04:17.060 | So I want to share the journey of building the R models,

00:04:24.920 | the decisions we've made along the way,

00:04:28.180 | and to show you that we've committed ourselves

00:04:32.480 | to build the top Rack tools for AI builders.

00:04:36.940 | We know firsthand that building Rack

00:04:42.600 | is excruciatingly hard.

00:04:46.900 | Tough word.

00:04:48.340 | When you set out to do that,

00:04:50.400 | you're going to face challenges,

00:04:52.140 | and they are numerous.

00:04:56.080 | Challenge number one is that models

00:04:57.660 | are highly prompt sensitive,

00:05:00.280 | and when you want to use the model in the Rack context,

00:05:04.600 | you need to prompt it to not only look for the information,

00:05:09.040 | but also know where to look,

00:05:12.160 | and know how to differentiate between the conversation history

00:05:16.700 | that the model has with the user

00:05:18.140 | and the retrieved information.

00:05:19.880 | It's not a trivial task.

00:05:21.160 | Another problem is overcoming models' natural bias

00:05:27.060 | results towards focusing on the beginning of the document.

00:05:30.720 | You've seen it with multiple Rack benchmarks

00:05:33.460 | and evaluation tests,

00:05:37.060 | you know, in the haystack and whatnot,

00:05:39.060 | that are really showing the problem of models

00:05:43.020 | not focusing on the most accurate

00:05:45.560 | information retrieval,

00:05:46.680 | but rather becoming a little bit lazy

00:05:48.960 | and focusing on the beginning, mostly.

00:05:54.000 | Another challenge is steering an ongoing battle

00:05:58.500 | that's happening within the model

00:06:00.060 | between its pre-training knowledge

00:06:03.640 | and what it encounters in prompts.

00:06:07.000 | For Rack use cases,

00:06:09.520 | you want the model to be able to tap into the knowledge

00:06:13.540 | that's not baked into the model parameters,

00:06:16.060 | and temporal information is a great example,

00:06:19.860 | when you're answering,

00:06:21.120 | when you're asking the model

00:06:22.620 | about who is the current president of the United States.

00:06:27.540 | You want the model to be able to tap

00:06:29.100 | into the up-to-date information.

00:06:31.480 | So through post-training,

00:06:36.040 | we've been able to optimize the model behavior

00:06:39.100 | to be able to address these

00:06:41.940 | and to decide when the external information is needed

00:06:46.200 | in the first place.

00:06:48.040 | Sometimes it isn't.

00:06:49.040 | Sometimes the pre-trained knowledge is enough.

00:06:51.200 | Then operate the retrieval system smoothly

00:06:56.700 | to be able to run search queries successfully,

00:07:00.600 | retrieve the information,

00:07:02.140 | hopefully the most accurate one,

00:07:05.240 | and then use that information

00:07:07.080 | as a grounded context for the conversation

00:07:09.740 | that the model is having with the user.

00:07:13.600 | We optimize all of this for you,

00:07:16.540 | the model behavior,

00:07:17.560 | so that you don't really have to think about it.

00:07:20.100 | It's really good at it out of the box,

00:07:22.320 | but it was hard work.

00:07:23.400 | Our major focus was working on citations.

00:07:29.340 | We're big on citations.

00:07:31.240 | We believe that allowing the user

00:07:33.740 | to verify where the information comes from

00:07:36.600 | and whether it's trustworthy,

00:07:37.860 | it's really important.

00:07:39.840 | So we're spending extra time

00:07:41.540 | to make these citations very fine-grained.

00:07:44.540 | And thanks to that,

00:07:46.580 | you can experience low hallucination

00:07:48.300 | and reliable context use.

00:07:50.040 | We tested command R and R plus

00:07:54.280 | on some standard RAG data sets like Kilt,

00:07:58.600 | and they exhibit best-in-class performance.

00:08:01.780 | They're small enough to be affordable,

00:08:04.620 | but powerful enough to cover a lot of your use cases.

00:08:09.820 | They have a great balance of token efficiency,

00:08:12.820 | and to achieve this level of performance,

00:08:16.580 | normally you would have to line up a big pipeline of LLMs.

00:08:20.280 | We've also heard from you that creating a UX and UI

00:08:27.060 | for RAG and Toluse is super painful.

00:08:31.620 | It's not a small feat, and we know it first-hand

00:08:35.380 | because we've spent considerable amount of time working on it ourselves.

00:08:40.880 | We're really proud of it at the moment.

00:08:43.960 | I think it has everything a modern UI,

00:08:46.680 | modern chat UI needs to have.

00:08:49.660 | So you're able to have a conversation history.

00:08:54.380 | You're able to have fine-grained citations.

00:08:57.380 | You're able to upload documents there.

00:08:59.380 | You're able to plug it into different types of tools.

00:09:02.380 | So spending so much time on it

00:09:05.880 | and knowing how much you're struggling either way,

00:09:08.840 | we decided that it's going to be a good idea to open source the UI,

00:09:13.720 | and that's what we did in April 24.

00:09:17.720 | I feel like not many people know about it,

00:09:19.560 | but our UI is out there,

00:09:21.380 | and you can now load it and start building with it.

00:09:24.260 | So this is a toolkit repo.

00:09:27.760 | That's how we call it.

00:09:29.760 | It has plug-and-play components and source code

00:09:32.880 | for an interface app that we've built with Next.js.

00:09:36.720 | It has a small SQL database for conversation history.

00:09:41.720 | There is a model component,

00:09:43.720 | which lets you customize how you're accessing command R models.

00:09:47.720 | You can do it via cloud providers.

00:09:49.720 | You can do it via Coher platform.

00:09:51.720 | You can do it locally.

00:09:53.720 | You can do it via Hagnetize, your pick.

00:09:55.720 | Then there is retrieval component,

00:09:58.720 | and here you can customize access to tools and data sources.

00:10:04.720 | Out of the box, we've built an example,

00:10:06.720 | data retriever built off of Langchain.

00:10:10.720 | It has document upload, and it's using web search,

00:10:15.720 | but honestly, you can add support for any tools

00:10:18.720 | and any data sources that you're interested in.

00:10:20.720 | Lately, we've been focused on optimizing tool use,

00:10:27.720 | particularly in the enterprise context.

00:10:29.720 | That's our game.

00:10:32.720 | It's kind of extension of this Rack formula I mentioned earlier,

00:10:36.720 | where we began by training the models to be really good

00:10:39.720 | with vector databases and retrieval systems,

00:10:43.720 | and then it naturally progressed into broader tool use.

00:10:49.720 | Training the model to use any tools,

00:10:53.720 | and ideally in a zero-shot context.

00:10:55.720 | that's kind of our ideal scenario that we're working towards.

00:11:00.720 | Toluse comes in two flavors.

00:11:04.720 | There is single step.

00:11:06.720 | It's really useful for situations where you have a single action to be performed,

00:11:13.720 | or a set of independent actions.

00:11:15.720 | It could be searching for documents or sending out an email.

00:11:19.720 | Multistep, on the other hand, it's really good for scenarios where you have to carry out a sequence of actions,

00:11:31.720 | with each action building on top of the previous ones.

00:11:34.720 | So, in the same example, it would be searching for that document,

00:11:40.720 | being able to compare it against another document,

00:11:43.720 | creating a summary of that comparison,

00:11:46.720 | and then sending it out via an email.

00:11:48.720 | That's possible with multistep tools today.

00:11:51.720 | In sequential reasoning, in multistep,

00:11:56.720 | you want the system to be able to reflect and correct errors,

00:12:00.720 | if there are any on the way.

00:12:02.720 | And we are teaching the models to retrieve the information

00:12:06.720 | many times over from these different data sources.

00:12:09.720 | Kind of a loop to be able to do that.

00:12:12.720 | You know this behavior from the term agents.

00:12:16.720 | Most of the time when people use the term agents and multistep,

00:12:20.720 | they mean the same thing.

00:12:22.720 | It's essentially a scenario where software is performing a sequence of actions,

00:12:27.720 | with each action building on the previous steps.

00:12:30.720 | Last week, we released multistep API, super hyped about it.

00:12:37.720 | We want it to be user friendly.

00:12:40.720 | And so, all you need to do is you need to describe the tools that the model has on their hands,

00:12:45.720 | what these tools do, and then some parameters.

00:12:50.720 | After user request is made, the model is going to create a plan.

00:12:55.720 | And it's going to figure out how to use these tools to fulfill the user request.

00:13:01.720 | And once it calls each tool, it's going to reflect on the contents,

00:13:06.720 | and it's going to adapt the initial plan if it's necessary.

00:13:09.720 | So, for example, if the model is calling an API and it returns an error,

00:13:14.720 | it's going to automatically retry calling it again and coming up with a new plan.

00:13:20.720 | We've outlined this behavior in this huge multistep preamble.

00:13:25.720 | You can find it on Hugginvice.

00:13:28.720 | Essentially, it's a massive prompt that explains the model what it needs to do in order to get the job done.

00:13:36.720 | Unique advantage here is the transparency.

00:13:41.720 | We've trained command R and R+ to generate claims that are verifiable through citations.

00:13:48.720 | And again, big on citations, we really believe that when you can explain which tool has been used by the model for each response,

00:13:59.720 | it's going to make a difference and it's going to make the system better.

00:14:04.720 | Command R+ has competitive performance to plot OPPOS, GPT-4 Turbo, but it is three to five times cheaper.

00:14:14.720 | So that's a massive difference when it comes to scalability and being able to use it in production.

00:14:20.720 | We test the R family on standard complex reasoning benchmarks and command R+ is close to or on par with GPT-4 Turbo.

00:14:31.720 | I'm super excited for the upcoming releases.

00:14:36.720 | We're going to keep hammering on the multistep.

00:14:39.720 | And yeah, stay tuned.

00:14:40.720 | Thanks a lot.

00:14:41.720 | Thanks a lot.

00:14:42.720 | Thank you.

00:14:43.720 | Thank you.

00:14:44.720 | Thank you.

00:14:45.720 | Thank you.

00:14:46.720 | Thank you.

00:14:47.720 | Thank you.

00:14:48.720 | Thank you.

00:14:49.720 | Thank you.

00:14:50.720 | Thank you.

00:14:51.720 | Thank you.

00:14:52.720 | Thank you.

00:14:53.720 | Thank you.

00:14:54.720 | Thank you.

00:14:55.720 | Thank you.

00:14:56.720 | Thank you.

00:14:57.720 | Thank you.

00:14:57.720 | We'll see you next time.

00:15:01.420 | Thank you.

Building State of the Art Open Weights Tool Use: The Command R Family: Sandra Kublik

Chapters