Building State of the Art Open Weights Tool Use: The Command R Family: Sandra Kublik

- What if I told you that we have just handed you the keys to state-of-the-art model, which excels at structured, advanced rag at sequential reasoning, and you can run it locally on your machine. It's competitive against GPT-4 Turbo, Cloud Opus, and it's much smaller. We've been really hard at work at Cohere, working on our family of models, and today I'd like to talk to you about some of the stuff that we've done, the decisions that we've made when it comes to the model design, and also what we're cooking when it comes to the future of the models.

So this year we've been working really hard to push the boundaries of what's possible with LLMs, and here's a quick look at our timeline. Three months ago, on March 11th, we've released Command + R. We opened the weights to the model. Command + R is a model optimized for retrieval augmented generation, and it's scalable.

It's small enough to be scale-friendly. We followed it up with Command + R+. And this model is optimized for tool use, advanced retrieval augmented generation, and has become a very popular model in the open-source community. Within a few days of the release, we've climbed the LMSys arena. We're really proud of that.

A really great achievement. Your response, as a community using the model, has been incredible. Some of the zeitgeist. We started trending at Open Router. Within two weeks of the release, the model has been downloaded a hundred and fifty thousand times from Hanging Face, which is wild. Folks at Hanging Face actually liked the model so much, especially when it comes to the tool use, that they decided to use it as a base model for Hanging Chat.

So now you can play with Hanging Chat. It has a doc parser, an image editor. It even has a calculator. It had it before the iPad. So today almost half a million of developers and researchers are using the R family. We're really proud of that. It looks like you guys got really excited to get your hands on the model and to be able to play with the weights and look under the hood.

We keep hearing your feedback and the love and support keeps pouring in. It really gets us going. And I've seen some super cool stuff built with R+ since then. Some of my favorite ones I want to shout out here are the Coding Assistant by Daniel San and a new generative search demo by Complexity.

I'll try to demo it later. We'll see how the tech goes, but I'll give you a sneak peek. Another one that's my favorite is two Discord server bots that are powering. our Discord community. I invite you to go and check it out. One of them is fine-tuned to be playful and to demo the model capabilities.

And the other one is made to be helpful. It's grounded in our docs and it's focused on the information coming from the API. So I want to share the journey of building the R models, the decisions we've made along the way, and to show you that we've committed ourselves to build the top Rack tools for AI builders.

We know firsthand that building Rack is excruciatingly hard. Tough word. When you set out to do that, you're going to face challenges, and they are numerous. Challenge number one is that models are highly prompt sensitive, and when you want to use the model in the Rack context, you need to prompt it to not only look for the information, but also know where to look, and know how to differentiate between the conversation history that the model has with the user and the retrieved information.

It's not a trivial task. Another problem is overcoming models' natural bias results towards focusing on the beginning of the document. You've seen it with multiple Rack benchmarks and evaluation tests, you know, in the haystack and whatnot, that are really showing the problem of models not focusing on the most accurate information retrieval, but rather becoming a little bit lazy and focusing on the beginning, mostly.

Another challenge is steering an ongoing battle that's happening within the model between its pre-training knowledge and what it encounters in prompts. For Rack use cases, you want the model to be able to tap into the knowledge that's not baked into the model parameters, and temporal information is a great example, when you're answering, when you're asking the model about who is the current president of the United States.

You want the model to be able to tap into the up-to-date information. So through post-training, we've been able to optimize the model behavior to be able to address these and to decide when the external information is needed in the first place. Sometimes it isn't. Sometimes the pre-trained knowledge is enough.

Then operate the retrieval system smoothly to be able to run search queries successfully, retrieve the information, hopefully the most accurate one, and then use that information as a grounded context for the conversation that the model is having with the user. We optimize all of this for you, the model behavior, so that you don't really have to think about it.

It's really good at it out of the box, but it was hard work. Our major focus was working on citations. We're big on citations. We believe that allowing the user to verify where the information comes from and whether it's trustworthy, it's really important. So we're spending extra time to make these citations very fine-grained.

And thanks to that, you can experience low hallucination and reliable context use. We tested command R and R plus on some standard RAG data sets like Kilt, and they exhibit best-in-class performance. They're small enough to be affordable, but powerful enough to cover a lot of your use cases. They have a great balance of token efficiency, and to achieve this level of performance, normally you would have to line up a big pipeline of LLMs.

We've also heard from you that creating a UX and UI for RAG and Toluse is super painful. It's not a small feat, and we know it first-hand because we've spent considerable amount of time working on it ourselves. We're really proud of it at the moment. I think it has everything a modern UI, modern chat UI needs to have.

So you're able to have a conversation history. You're able to have fine-grained citations. You're able to upload documents there. You're able to plug it into different types of tools. So spending so much time on it and knowing how much you're struggling either way, we decided that it's going to be a good idea to open source the UI, and that's what we did in April 24.

I feel like not many people know about it, but our UI is out there, and you can now load it and start building with it. So this is a toolkit repo. That's how we call it. It has plug-and-play components and source code for an interface app that we've built with Next.js.

It has a small SQL database for conversation history. There is a model component, which lets you customize how you're accessing command R models. You can do it via cloud providers. You can do it via Coher platform. You can do it locally. You can do it via Hagnetize, your pick.

Then there is retrieval component, and here you can customize access to tools and data sources. Out of the box, we've built an example, data retriever built off of Langchain. It has document upload, and it's using web search, but honestly, you can add support for any tools and any data sources that you're interested in.

Lately, we've been focused on optimizing tool use, particularly in the enterprise context. That's our game. It's kind of extension of this Rack formula I mentioned earlier, where we began by training the models to be really good with vector databases and retrieval systems, and then it naturally progressed into broader tool use.

Training the model to use any tools, and ideally in a zero-shot context. that's kind of our ideal scenario that we're working towards. Toluse comes in two flavors. There is single step. It's really useful for situations where you have a single action to be performed, or a set of independent actions.

It could be searching for documents or sending out an email. Multistep, on the other hand, it's really good for scenarios where you have to carry out a sequence of actions, with each action building on top of the previous ones. So, in the same example, it would be searching for that document, being able to compare it against another document, creating a summary of that comparison, and then sending it out via an email.

That's possible with multistep tools today. In sequential reasoning, in multistep, you want the system to be able to reflect and correct errors, if there are any on the way. And we are teaching the models to retrieve the information many times over from these different data sources. Kind of a loop to be able to do that.

You know this behavior from the term agents. Most of the time when people use the term agents and multistep, they mean the same thing. It's essentially a scenario where software is performing a sequence of actions, with each action building on the previous steps. Last week, we released multistep API, super hyped about it.

We want it to be user friendly. And so, all you need to do is you need to describe the tools that the model has on their hands, what these tools do, and then some parameters. After user request is made, the model is going to create a plan. And it's going to figure out how to use these tools to fulfill the user request.

And once it calls each tool, it's going to reflect on the contents, and it's going to adapt the initial plan if it's necessary. So, for example, if the model is calling an API and it returns an error, it's going to automatically retry calling it again and coming up with a new plan.

We've outlined this behavior in this huge multistep preamble. You can find it on Hugginvice. Essentially, it's a massive prompt that explains the model what it needs to do in order to get the job done. Unique advantage here is the transparency. We've trained command R and R+ to generate claims that are verifiable through citations.

And again, big on citations, we really believe that when you can explain which tool has been used by the model for each response, it's going to make a difference and it's going to make the system better. Command R+ has competitive performance to plot OPPOS, GPT-4 Turbo, but it is three to five times cheaper.

So that's a massive difference when it comes to scalability and being able to use it in production. We test the R family on standard complex reasoning benchmarks and command R+ is close to or on par with GPT-4 Turbo. I'm super excited for the upcoming releases. We're going to keep hammering on the multistep.

And yeah, stay tuned. Thanks a lot. Thanks a lot. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time. Thank you.

Building State of the Art Open Weights Tool Use: The Command R Family: Sandra Kublik

Chapters

Transcript