back to indexDecoding Mistral AI's Large Language Models: Devendra Chaplot

00:00:00.000 |
Hey everyone, I'm very excited to be here. I am very happy that there is an open models track. 00:00:21.960 |
So I'm going to talk about the open models of Mistral AI and go a little bit deeper into 00:00:31.460 |
why we do open source and how we do open source. So first of all, Mistral AI, we started last 00:00:38.840 |
June about one year ago. We released our first open model Mistral 7B in September 23. And 00:00:46.840 |
then after that in December, we released our first mixture of experts, open model 8x7B. 00:00:53.720 |
And along with that, we released our platform with model APIs, and also commercial models, 00:01:00.600 |
Mistral medium and Mistral embed. And then earlier this year in February, we released Mistral 00:01:06.460 |
large, which is our flagship model, which has the best in class reasoning and math ability. 00:01:15.840 |
And also, uh, in April, we released a new open model 8x22B. And then, uh, very recently 00:01:24.920 |
in June, we released a code specific model called costral 22B. And, uh, it's also available, 00:01:31.920 |
uh, in the chat interface that we built, uh, along with Mistral large and it's, uh, free to 00:01:38.720 |
use. Um, so our mission, um, is to bring frontier AI in everyone's hands. And we specifically focus on 00:01:51.840 |
building cutting edge AI for developers. And we have certain principles behind how we go about training 00:02:00.800 |
models and releasing them. So the first is openness. We want to train best in class open models and, uh, 00:02:09.880 |
release it for, uh, the open source community. We want our models to be portable. Uh, all our models 00:02:16.880 |
are available on Azure, AWS, GCP, virtual private cloud, and also they can be deployed, deployed 00:02:22.960 |
on-prem, which means, uh, you can, uh, license the model weights and use, use it on your own servers, 00:02:31.120 |
uh, with full control over security and privacy of your data. Uh, we try to optimize for the performance to 00:02:38.960 |
speed ratio. Uh, our models are particularly good at getting the best performance out of a particular 00:02:45.280 |
size. And, um, we want our models to be customizable. Uh, we are building our platform to, with all the 00:02:56.000 |
libraries and tools to customize our models, uh, depending on your application. Uh, we recently 00:03:01.840 |
released the Mistral fine tune open source library, which can be used to find any of our open source models. 00:03:07.920 |
And also, uh, we have a fine tuning API on our, uh, platform. And before that, we also released 00:03:14.960 |
Mistral inference, which is the inference library, uh, again, open source. Uh, so I talked about 00:03:21.040 |
these three models that we have open, um, sourced in the last one year. The first model is a dense 00:03:29.920 |
transformer model. Uh, it was the first model, first 7B model to achieve 60 on MMLU. 00:03:36.880 |
And we saw that the 60 MMLU is like, uh, a bare minimum where the models become useful. And this 00:03:43.520 |
was the first 7B model to achieve this. And people have, uh, people, people have been using this model 00:03:51.040 |
for many, many different applications. And particularly we have seen that this model can be deployed on 00:03:55.920 |
laptops and phones, uh, and, uh, still get reasonable speed, uh, on, on device. Uh, we released the first, um, our 00:04:05.840 |
first sparse mixture of experts model in December 8X 7B. It's based on the, uh, mixture of experts 00:04:13.680 |
architecture, uh, which basically allows us to push the performance of a model while keeping the 00:04:21.120 |
inference budget in check. The idea here is we have higher number of total parameters in the model, 00:04:26.640 |
which allows the model to, uh, still have the knowledge, uh, stored in the model weights. But at the same time, we use only a small subset of the parameters. 00:04:34.800 |
Uh, for every token, which makes it really fast and cost efficient at inference time. 00:04:40.400 |
And then we released a bigger version of this sparse mixture of experts architecture 8X 22B in April. 00:04:47.120 |
Uh, it has even better performance, higher, uh, context window, and also it's multilingual. 00:04:53.760 |
Uh, it supports English, French, Italian, German, Spanish, and also, uh, many other languages. 00:05:00.320 |
Um, so a lot of people ask me, if you open source your models, how do you make money? And, uh, I think this is a common misconception that people have that open source is somewhat, uh, competitive with profit. 00:05:19.440 |
It's actually not the case. We see open source as, uh, uh, something that is goes hand in hand with profit. 00:05:29.120 |
It doesn't necessarily have to be competitive. It can be, uh, complimentary. And, uh, we want to be in this quadrant where 00:05:38.000 |
where we can open source our models and still have long term business value with the models. 00:05:42.800 |
Um, so why do we open source? So the first reason is it, uh, serves as a very good branding and marketing tool for us. 00:05:50.400 |
Um, so we believe in open source and open science, and we want to contribute, uh, to the community, but it's not a, a one way thing. 00:06:00.400 |
Uh, we are also benefiting from open source just as the community is benefiting from our models. 00:06:06.400 |
So it helps us doing, doing, uh, a lot of branding and marketing. 00:06:10.160 |
Uh, a lot of people like our models. They tell other people that our models are good. 00:06:15.200 |
the model performance speaks for itself. We do not have a marketing team in house. 00:06:19.920 |
And, uh, just the open sourcing, the models allows us to create awareness about our products. 00:06:30.640 |
If people try out our open source models and they really like it, they come to us for an upgrade to 00:06:37.120 |
proprietary models and, uh, they pay for the upgrade. 00:06:39.920 |
And it also helps in customization and portability. 00:06:45.440 |
Uh, when, whenever, uh, for example, the seven B model, people can try it, uh, to try to run it on 00:06:54.320 |
laptops and phones. And this is the kind of stuff we benefit from because we don't necessarily have to 00:07:00.480 |
do this out of the box, but the community works around our models and we learn from the community, 00:07:05.760 |
how our models can be customized or, uh, deployed in new settings. 00:07:09.760 |
So how are, um, these open source models trained? 00:07:13.840 |
So I, I'll give you a very high, uh, level overview of the different stages of LLM training. 00:07:21.120 |
And typically LLMs are trained in three stages. 00:07:24.080 |
Pre-training instruction, tuning, and learning from human feedback. 00:07:27.760 |
So the idea behind pre-training is very simple. You take a piece of text 00:07:34.480 |
and you pass, uh, word by word or token by token through the large language model 00:07:42.400 |
and ask the model to predict the next token. Um, so the idea itself is very simple. 00:07:49.760 |
Each, uh, the task is the next token prediction. Each token is roughly 0.75 word. The vocabulary size is 00:07:57.040 |
roughly tens of thousands of tokens, or sometimes hundreds of thousands. And each token is basically 00:08:02.800 |
represented as an integer and it has an embedding associated with it. And so the task of the model 00:08:08.000 |
is to take in a sequence of embeddings or tokens and predict the next token. 00:08:12.240 |
Although the concept is very simple, in practice, it's actually very hard. 00:08:17.840 |
Why is it hard? Because it requires a lot of effort in building the data sets. The data sets are huge. 00:08:26.400 |
They are order of trillions of tokens, tens of trillions of tokens, uh, that requires pre-processing, 00:08:32.640 |
cleaning, deduplication, curation. And there's, again, a common belief that more data leads to better 00:08:40.320 |
performance, but that's not, not necessarily the case. Uh, if you have noise in your data, that can 00:08:46.320 |
actually hurt the model performance. It also requires a lot of investment. Uh, these models are huge, you know, 00:08:54.000 |
can go up to hundreds or even hundreds of billions or even trillions of parameters. Uh, each model takes 00:09:00.720 |
tens to hundreds of millions of dollars to train. And the hardest part is you don't get multiple chances 00:09:09.040 |
to train the model. Uh, the, because it's so expensive, if something grows wrong in your training, uh, it's, uh, very 00:09:19.200 |
difficult to get the investment to do another training run, uh, because, uh, typically for small 00:09:25.680 |
companies, you don't get that kind of budget. If you do a model run and it's not successful, 00:09:31.040 |
it becomes harder to get the funding for the next run. Um, and this is hard because the best hyper 00:09:40.000 |
parameters for a smaller model might not be the best for a larger model. Uh, 00:09:46.560 |
here I'm showing you some hyper parameters for Lama one model family sizes. And you might ask, 00:09:53.840 |
uh, why are the number of players 80 and not 82 in Lama 65B? And the answer is, we don't know. 00:10:04.640 |
Uh, there's a lot of things that are been, uh, decided by intuition and it's not exact science. 00:10:14.400 |
Uh, so you'd need a lot of experience and intuition working with these models to come up with things 00:10:20.960 |
that are very likely to work, but, uh, we don't, uh, we're still not very mature with the science of 00:10:30.160 |
what is the best way to train the model or what's the best architecture, what's the best data set 00:10:35.120 |
mixture. So, uh, can we use this pre-trained model? Um, so let's say if you want to use this pre-trained 00:10:42.560 |
model and, uh, ask it to write a Python function to find whether the input number is prime or not, 00:10:47.440 |
and the model might give you a response like this, uh, continues the text, gives an example and like 00:10:54.640 |
describes the approach, but it might not give you the code. And this is because 00:10:59.360 |
the model is trained to do this, it's trained to predict the next token. So it 00:11:02.880 |
predicts the most likely token from the text data it's been trained on. 00:11:07.520 |
But there is a way to trick the model. If you give this input, like as a Python function definition 00:11:16.880 |
and a doc string, uh, to, to get the same function, the model actually produces the code. 00:11:23.360 |
And so this shows you that model actually knows the answer, but it is not aligned with 00:11:29.280 |
human preferences. It's not trained to interact with humans in the way humans want to. 00:11:34.480 |
And this is why we need the next two stages. Um, so in the instruction tuning stage, instead of 00:11:42.080 |
just, uh, a string of text, we have prompt response pairs. So here we are giving the prompt, 00:11:50.160 |
but in the way humans want to interact with the model. So for example, this prompt to write a Python 00:11:55.760 |
function function and the response is directly the code because that's what humans want as the response. 00:11:59.840 |
And the technique is very simple. Again, we are doing next token prediction, but the only difference 00:12:07.120 |
is we are going to mask the prompt itself. We are going to do prediction only for the response. 00:12:12.880 |
Um, so the data set is paired prompt response pairs. We typically use hundreds to hundreds of thousands 00:12:21.680 |
of instructions. Uh, the task is next word prediction, but just we mask the, the input instruction. Uh, 00:12:30.560 |
it requires way less compute order of hundred GPUs for a few hours or days is typically sufficient to do 00:12:38.720 |
instruction. And then the last steps is learning from human feedback. And here the idea is, um, 00:12:46.800 |
that human preferences are cheaper or easier to obtain than full human annotation. If I give you 00:12:52.320 |
a prompt like this and two responses, it's much easier for a human to decide which response is better 00:12:58.080 |
than to write the whole response, uh, from scratch. And so this allows us to scale, uh, data 00:13:05.520 |
faster. And there are two main techniques, uh, learning from reinforcement learning from human 00:13:12.560 |
feedback and direct preference optimization, uh, where we use this kind of preference data to fine tune 00:13:18.080 |
the model, uh, further. So just to summarize, uh, these are the three stages. Um, they have different 00:13:26.880 |
orders of data set and compute requirement, and the task is, uh, slightly different. And all the open source 00:13:34.800 |
models we have, I've been used, I've been trained using these techniques. And so I won't go into the details of 00:13:41.600 |
um, the model architecture itself, but I'll show you the, this nice graph of performance to cost ratio, 00:13:50.800 |
uh, which kind of shows that, uh, we really try to optimize, uh, this metric, uh, we try to get the best 00:14:01.120 |
performance out of our models of a particular size. So here on the x-axis, we have the active parameters, 00:14:07.840 |
which is directly proportional to the cost of running through the model. And on the y-axis, 00:14:12.080 |
we have a popular benchmark, MMLU. So we try to be in the top left corner to get more performance with a 00:14:19.360 |
lower cost. Um, we recently released, uh, the code trial model, code trial 22B. It's a dense transformer 00:14:27.680 |
model trained specifically for code. Um, and again, we are trying to optimize performance and speed. 00:14:34.400 |
It's fluent in 80 plus programming languages and it has both, uh, instruct and fill in the middle 00:14:41.040 |
mode, which means that you can use it for code completion, uh, in, uh, your code editor, uh, 00:14:46.800 |
just like GitHub copilot, but also you can use it to ask questions about the bugs or errors you're facing, 00:14:56.800 |
So it outperforms code Lama 70 B deep deep seek code 33 B Lama 370 B while being a significantly 00:15:04.000 |
smaller model. So again, we are getting more performance out of a model of a particular size. 00:15:09.440 |
And it also has a longer context window, uh, with the other open source code models. 00:15:16.720 |
It is multilingual. Uh, we trained it with more than 80 programming languages and, uh, 00:15:22.080 |
across all these different languages tends to perform better than the other models. 00:15:28.720 |
So it's, uh, free to use on our chat interface chat.mistral.ti. Uh, we also have the API access 00:15:37.440 |
available on lab platform, which is our, uh, uh, platform API endpoint. And here, uh, it's also free 00:15:46.080 |
to use till I believe, uh, end of July. We also have, uh, integration with VS code and JetBrains. 00:15:53.680 |
So you can download, uh, a plugin in VS code or JetBrains and use it as a coding assistant for code completion. 00:16:03.520 |
So, um, in the end, I would just discuss some practical tips because these are some commonly 00:16:11.840 |
asked questions about how to use open source models and when to use open source versus when, 00:16:16.800 |
uh, to use commercial models. So, uh, if you have a particular application in mind and you want to try 00:16:23.760 |
out commercial models, you could do things like prompt engineering, few short prompting, chain of thought, 00:16:28.880 |
and you could also do retrieval augmented generation, uh, because commercial 00:16:33.360 |
models typically don't allow you to do fine tuning. Uh, but for open models, you can do task specific 00:16:41.520 |
fine tuning as well. You need a little bit of data and compute for this. Uh, but in the end, the 00:16:48.320 |
choice is between how do you, how do you balance performance versus cost commercial models have a 00:16:54.960 |
higher general purpose performance. So they are much easier to get started with if you are trying to 00:16:59.360 |
build a new application. Uh, but if you, once you get into production or once you have high volume, 00:17:05.680 |
open models can beat commercial models on specific tasks with fine tuning. 00:17:10.560 |
And, um, uh, typically what we have seen is people prototype with the highest end models. And then once 00:17:19.360 |
they figured out that this is the, the task they want to solve, they take, uh, open source model like 00:17:26.640 |
install seven B or eight X seven B and then fine tuning for their tasks. And this optimizes the 00:17:31.200 |
performance to cost ratio. Uh, we have offices in Paris, London and in Maria. Uh, we are always looking 00:17:40.880 |
for talented, uh, researchers, engineers, uh, business marketing people. Uh, so, uh, please, 00:17:50.880 |
please, please do a clap and thank you. Uh, I don't know if you're taking questions, but happy. No.