Decoding Mistral AI's Large Language Models: Devendra Chaplot

Hey everyone, I'm very excited to be here. I am very happy that there is an open models track. So I'm going to talk about the open models of Mistral AI and go a little bit deeper into why we do open source and how we do open source. So first of all, Mistral AI, we started last June about one year ago.

We released our first open model Mistral 7B in September 23. And then after that in December, we released our first mixture of experts, open model 8x7B. And along with that, we released our platform with model APIs, and also commercial models, Mistral medium and Mistral embed. And then earlier this year in February, we released Mistral large, which is our flagship model, which has the best in class reasoning and math ability.

And also, uh, in April, we released a new open model 8x22B. And then, uh, very recently in June, we released a code specific model called costral 22B. And, uh, it's also available, uh, in the chat interface that we built, uh, along with Mistral large and it's, uh, free to use.

Um, so our mission, um, is to bring frontier AI in everyone's hands. And we specifically focus on building cutting edge AI for developers. And we have certain principles behind how we go about training models and releasing them. So the first is openness. We want to train best in class open models and, uh, release it for, uh, the open source community.

We want our models to be portable. Uh, all our models are available on Azure, AWS, GCP, virtual private cloud, and also they can be deployed, deployed on-prem, which means, uh, you can, uh, license the model weights and use, use it on your own servers, uh, with full control over security and privacy of your data.

Uh, we try to optimize for the performance to speed ratio. Uh, our models are particularly good at getting the best performance out of a particular size. And, um, we want our models to be customizable. Uh, we are building our platform to, with all the libraries and tools to customize our models, uh, depending on your application.

Uh, we recently released the Mistral fine tune open source library, which can be used to find any of our open source models. And also, uh, we have a fine tuning API on our, uh, platform. And before that, we also released Mistral inference, which is the inference library, uh, again, open source.

Uh, so I talked about these three models that we have open, um, sourced in the last one year. The first model is a dense transformer model. Uh, it was the first model, first 7B model to achieve 60 on MMLU. And we saw that the 60 MMLU is like, uh, a bare minimum where the models become useful.

And this was the first 7B model to achieve this. And people have, uh, people, people have been using this model for many, many different applications. And particularly we have seen that this model can be deployed on laptops and phones, uh, and, uh, still get reasonable speed, uh, on, on device.

Uh, we released the first, um, our first sparse mixture of experts model in December 8X 7B. It's based on the, uh, mixture of experts architecture, uh, which basically allows us to push the performance of a model while keeping the inference budget in check. The idea here is we have higher number of total parameters in the model, which allows the model to, uh, still have the knowledge, uh, stored in the model weights.

But at the same time, we use only a small subset of the parameters. Uh, for every token, which makes it really fast and cost efficient at inference time. And then we released a bigger version of this sparse mixture of experts architecture 8X 22B in April. Uh, it has even better performance, higher, uh, context window, and also it's multilingual.

Uh, it supports English, French, Italian, German, Spanish, and also, uh, many other languages. Um, so a lot of people ask me, if you open source your models, how do you make money? And, uh, I think this is a common misconception that people have that open source is somewhat, uh, competitive with profit.

It's actually not the case. We see open source as, uh, uh, something that is goes hand in hand with profit. It doesn't necessarily have to be competitive. It can be, uh, complimentary. And, uh, we want to be in this quadrant where where we can open source our models and still have long term business value with the models.

Um, so why do we open source? So the first reason is it, uh, serves as a very good branding and marketing tool for us. Um, so we believe in open source and open science, and we want to contribute, uh, to the community, but it's not a, a one way thing.

Uh, we are also benefiting from open source just as the community is benefiting from our models. So it helps us doing, doing, uh, a lot of branding and marketing. Uh, a lot of people like our models. They tell other people that our models are good. the model performance speaks for itself.

We do not have a marketing team in house. And, uh, just the open sourcing, the models allows us to create awareness about our products. It also helps us in customer acquisition. If people try out our open source models and they really like it, they come to us for an upgrade to proprietary models and, uh, they pay for the upgrade.

And it also helps in customization and portability. Uh, when, whenever, uh, for example, the seven B model, people can try it, uh, to try to run it on laptops and phones. And this is the kind of stuff we benefit from because we don't necessarily have to do this out of the box, but the community works around our models and we learn from the community, how our models can be customized or, uh, deployed in new settings.

So how are, um, these open source models trained? So I, I'll give you a very high, uh, level overview of the different stages of LLM training. And typically LLMs are trained in three stages. Pre-training instruction, tuning, and learning from human feedback. So the idea behind pre-training is very simple.

You take a piece of text and you pass, uh, word by word or token by token through the large language model and ask the model to predict the next token. Um, so the idea itself is very simple. Each, uh, the task is the next token prediction. Each token is roughly 0.75 word.

The vocabulary size is roughly tens of thousands of tokens, or sometimes hundreds of thousands. And each token is basically represented as an integer and it has an embedding associated with it. And so the task of the model is to take in a sequence of embeddings or tokens and predict the next token.

Although the concept is very simple, in practice, it's actually very hard. Why is it hard? Because it requires a lot of effort in building the data sets. The data sets are huge. They are order of trillions of tokens, tens of trillions of tokens, uh, that requires pre-processing, cleaning, deduplication, curation.

And there's, again, a common belief that more data leads to better performance, but that's not, not necessarily the case. Uh, if you have noise in your data, that can actually hurt the model performance. It also requires a lot of investment. Uh, these models are huge, you know, can go up to hundreds or even hundreds of billions or even trillions of parameters.

Uh, each model takes tens to hundreds of millions of dollars to train. And the hardest part is you don't get multiple chances to train the model. Uh, the, because it's so expensive, if something grows wrong in your training, uh, it's, uh, very difficult to get the investment to do another training run, uh, because, uh, typically for small companies, you don't get that kind of budget.

If you do a model run and it's not successful, it becomes harder to get the funding for the next run. Um, and this is hard because the best hyper parameters for a smaller model might not be the best for a larger model. Uh, here I'm showing you some hyper parameters for Lama one model family sizes.

And you might ask, uh, why are the number of players 80 and not 82 in Lama 65B? And the answer is, we don't know. Uh, there's a lot of things that are been, uh, decided by intuition and it's not exact science. Uh, so you'd need a lot of experience and intuition working with these models to come up with things that are very likely to work, but, uh, we don't, uh, we're still not very mature with the science of what is the best way to train the model or what's the best architecture, what's the best data set mixture.

So, uh, can we use this pre-trained model? Um, so let's say if you want to use this pre-trained model and, uh, ask it to write a Python function to find whether the input number is prime or not, and the model might give you a response like this, uh, continues the text, gives an example and like describes the approach, but it might not give you the code.

And this is because the model is trained to do this, it's trained to predict the next token. So it predicts the most likely token from the text data it's been trained on. But there is a way to trick the model. If you give this input, like as a Python function definition and a doc string, uh, to, to get the same function, the model actually produces the code.

And so this shows you that model actually knows the answer, but it is not aligned with human preferences. It's not trained to interact with humans in the way humans want to. And this is why we need the next two stages. Um, so in the instruction tuning stage, instead of just, uh, a string of text, we have prompt response pairs.

So here we are giving the prompt, but in the way humans want to interact with the model. So for example, this prompt to write a Python function function and the response is directly the code because that's what humans want as the response. And the technique is very simple. Again, we are doing next token prediction, but the only difference is we are going to mask the prompt itself.

We are going to do prediction only for the response. Um, so the data set is paired prompt response pairs. We typically use hundreds to hundreds of thousands of instructions. Uh, the task is next word prediction, but just we mask the, the input instruction. Uh, it requires way less compute order of hundred GPUs for a few hours or days is typically sufficient to do instruction.

And then the last steps is learning from human feedback. And here the idea is, um, that human preferences are cheaper or easier to obtain than full human annotation. If I give you a prompt like this and two responses, it's much easier for a human to decide which response is better than to write the whole response, uh, from scratch.

And so this allows us to scale, uh, data faster. And there are two main techniques, uh, learning from reinforcement learning from human feedback and direct preference optimization, uh, where we use this kind of preference data to fine tune the model, uh, further. So just to summarize, uh, these are the three stages.

Um, they have different orders of data set and compute requirement, and the task is, uh, slightly different. And all the open source models we have, I've been used, I've been trained using these techniques. And so I won't go into the details of um, the model architecture itself, but I'll show you the, this nice graph of performance to cost ratio, uh, which kind of shows that, uh, we really try to optimize, uh, this metric, uh, we try to get the best performance out of our models of a particular size.

So here on the x-axis, we have the active parameters, which is directly proportional to the cost of running through the model. And on the y-axis, we have a popular benchmark, MMLU. So we try to be in the top left corner to get more performance with a lower cost. Um, we recently released, uh, the code trial model, code trial 22B.

It's a dense transformer model trained specifically for code. Um, and again, we are trying to optimize performance and speed. It's fluent in 80 plus programming languages and it has both, uh, instruct and fill in the middle mode, which means that you can use it for code completion, uh, in, uh, your code editor, uh, just like GitHub copilot, but also you can use it to ask questions about the bugs or errors you're facing, just like you would put it in chat GPT.

Um, So it outperforms code Lama 70 B deep deep seek code 33 B Lama 370 B while being a significantly smaller model. So again, we are getting more performance out of a model of a particular size. And it also has a longer context window, uh, with the other open source code models.

It is multilingual. Uh, we trained it with more than 80 programming languages and, uh, across all these different languages tends to perform better than the other models. So it's, uh, free to use on our chat interface chat.mistral.ti. Uh, we also have the API access available on lab platform, which is our, uh, uh, platform API endpoint.

And here, uh, it's also free to use till I believe, uh, end of July. We also have, uh, integration with VS code and JetBrains. So you can download, uh, a plugin in VS code or JetBrains and use it as a coding assistant for code completion. So, um, in the end, I would just discuss some practical tips because these are some commonly asked questions about how to use open source models and when to use open source versus when, uh, to use commercial models.

So, uh, if you have a particular application in mind and you want to try out commercial models, you could do things like prompt engineering, few short prompting, chain of thought, and you could also do retrieval augmented generation, uh, because commercial models typically don't allow you to do fine tuning.

Uh, but for open models, you can do task specific fine tuning as well. You need a little bit of data and compute for this. Uh, but in the end, the choice is between how do you, how do you balance performance versus cost commercial models have a higher general purpose performance.

So they are much easier to get started with if you are trying to build a new application. Uh, but if you, once you get into production or once you have high volume, open models can beat commercial models on specific tasks with fine tuning. And, um, uh, typically what we have seen is people prototype with the highest end models.

And then once they figured out that this is the, the task they want to solve, they take, uh, open source model like install seven B or eight X seven B and then fine tuning for their tasks. And this optimizes the performance to cost ratio. Uh, we have offices in Paris, London and in Maria.

Uh, we are always looking for talented, uh, researchers, engineers, uh, business marketing people. Uh, so, uh, please, please, please do a clap and thank you. Uh, I don't know if you're taking questions, but happy. No. Okay. Thank you so much. Thank you.

Decoding Mistral AI's Large Language Models: Devendra Chaplot

Transcript