Decoding Mistral AI's Large Language Models: Devendra Chaplot

00:00:00.000 | Hey everyone, I'm very excited to be here. I am very happy that there is an open models track.

00:00:21.960 | So I'm going to talk about the open models of Mistral AI and go a little bit deeper into

00:00:31.460 | why we do open source and how we do open source. So first of all, Mistral AI, we started last

00:00:38.840 | June about one year ago. We released our first open model Mistral 7B in September 23. And

00:00:46.840 | then after that in December, we released our first mixture of experts, open model 8x7B.

00:00:53.720 | And along with that, we released our platform with model APIs, and also commercial models,

00:01:00.600 | Mistral medium and Mistral embed. And then earlier this year in February, we released Mistral

00:01:06.460 | large, which is our flagship model, which has the best in class reasoning and math ability.

00:01:15.840 | And also, uh, in April, we released a new open model 8x22B. And then, uh, very recently

00:01:24.920 | in June, we released a code specific model called costral 22B. And, uh, it's also available,

00:01:31.920 | uh, in the chat interface that we built, uh, along with Mistral large and it's, uh, free to

00:01:38.720 | use. Um, so our mission, um, is to bring frontier AI in everyone's hands. And we specifically focus on

00:01:51.840 | building cutting edge AI for developers. And we have certain principles behind how we go about training

00:02:00.800 | models and releasing them. So the first is openness. We want to train best in class open models and, uh,

00:02:09.880 | release it for, uh, the open source community. We want our models to be portable. Uh, all our models

00:02:16.880 | are available on Azure, AWS, GCP, virtual private cloud, and also they can be deployed, deployed

00:02:22.960 | on-prem, which means, uh, you can, uh, license the model weights and use, use it on your own servers,

00:02:31.120 | uh, with full control over security and privacy of your data. Uh, we try to optimize for the performance to

00:02:38.960 | speed ratio. Uh, our models are particularly good at getting the best performance out of a particular

00:02:45.280 | size. And, um, we want our models to be customizable. Uh, we are building our platform to, with all the

00:02:56.000 | libraries and tools to customize our models, uh, depending on your application. Uh, we recently

00:03:01.840 | released the Mistral fine tune open source library, which can be used to find any of our open source models.

00:03:07.920 | And also, uh, we have a fine tuning API on our, uh, platform. And before that, we also released

00:03:14.960 | Mistral inference, which is the inference library, uh, again, open source. Uh, so I talked about

00:03:21.040 | these three models that we have open, um, sourced in the last one year. The first model is a dense

00:03:29.920 | transformer model. Uh, it was the first model, first 7B model to achieve 60 on MMLU.

00:03:36.880 | And we saw that the 60 MMLU is like, uh, a bare minimum where the models become useful. And this

00:03:43.520 | was the first 7B model to achieve this. And people have, uh, people, people have been using this model

00:03:51.040 | for many, many different applications. And particularly we have seen that this model can be deployed on

00:03:55.920 | laptops and phones, uh, and, uh, still get reasonable speed, uh, on, on device. Uh, we released the first, um, our

00:04:05.840 | first sparse mixture of experts model in December 8X 7B. It's based on the, uh, mixture of experts

00:04:13.680 | architecture, uh, which basically allows us to push the performance of a model while keeping the

00:04:21.120 | inference budget in check. The idea here is we have higher number of total parameters in the model,

00:04:26.640 | which allows the model to, uh, still have the knowledge, uh, stored in the model weights. But at the same time, we use only a small subset of the parameters.

00:04:34.800 | Uh, for every token, which makes it really fast and cost efficient at inference time.

00:04:40.400 | And then we released a bigger version of this sparse mixture of experts architecture 8X 22B in April.

00:04:47.120 | Uh, it has even better performance, higher, uh, context window, and also it's multilingual.

00:04:53.760 | Uh, it supports English, French, Italian, German, Spanish, and also, uh, many other languages.

00:05:00.320 | Um, so a lot of people ask me, if you open source your models, how do you make money? And, uh, I think this is a common misconception that people have that open source is somewhat, uh, competitive with profit.

00:05:19.440 | It's actually not the case. We see open source as, uh, uh, something that is goes hand in hand with profit.

00:05:29.120 | It doesn't necessarily have to be competitive. It can be, uh, complimentary. And, uh, we want to be in this quadrant where

00:05:38.000 | where we can open source our models and still have long term business value with the models.

00:05:42.800 | Um, so why do we open source? So the first reason is it, uh, serves as a very good branding and marketing tool for us.

00:05:50.400 | Um, so we believe in open source and open science, and we want to contribute, uh, to the community, but it's not a, a one way thing.

00:06:00.400 | Uh, we are also benefiting from open source just as the community is benefiting from our models.

00:06:06.400 | So it helps us doing, doing, uh, a lot of branding and marketing.

00:06:10.160 | Uh, a lot of people like our models. They tell other people that our models are good.

00:06:15.200 | the model performance speaks for itself. We do not have a marketing team in house.

00:06:19.920 | And, uh, just the open sourcing, the models allows us to create awareness about our products.

00:06:27.200 | It also helps us in customer acquisition.

00:06:30.640 | If people try out our open source models and they really like it, they come to us for an upgrade to

00:06:37.120 | proprietary models and, uh, they pay for the upgrade.

00:06:39.920 | And it also helps in customization and portability.

00:06:45.440 | Uh, when, whenever, uh, for example, the seven B model, people can try it, uh, to try to run it on

00:06:54.320 | laptops and phones. And this is the kind of stuff we benefit from because we don't necessarily have to

00:07:00.480 | do this out of the box, but the community works around our models and we learn from the community,

00:07:05.760 | how our models can be customized or, uh, deployed in new settings.

00:07:09.760 | So how are, um, these open source models trained?

00:07:13.840 | So I, I'll give you a very high, uh, level overview of the different stages of LLM training.

00:07:21.120 | And typically LLMs are trained in three stages.

00:07:24.080 | Pre-training instruction, tuning, and learning from human feedback.

00:07:27.760 | So the idea behind pre-training is very simple. You take a piece of text

00:07:34.480 | and you pass, uh, word by word or token by token through the large language model

00:07:42.400 | and ask the model to predict the next token. Um, so the idea itself is very simple.

00:07:49.760 | Each, uh, the task is the next token prediction. Each token is roughly 0.75 word. The vocabulary size is

00:07:57.040 | roughly tens of thousands of tokens, or sometimes hundreds of thousands. And each token is basically

00:08:02.800 | represented as an integer and it has an embedding associated with it. And so the task of the model

00:08:08.000 | is to take in a sequence of embeddings or tokens and predict the next token.

00:08:12.240 | Although the concept is very simple, in practice, it's actually very hard.

00:08:17.840 | Why is it hard? Because it requires a lot of effort in building the data sets. The data sets are huge.

00:08:26.400 | They are order of trillions of tokens, tens of trillions of tokens, uh, that requires pre-processing,

00:08:32.640 | cleaning, deduplication, curation. And there's, again, a common belief that more data leads to better

00:08:40.320 | performance, but that's not, not necessarily the case. Uh, if you have noise in your data, that can

00:08:46.320 | actually hurt the model performance. It also requires a lot of investment. Uh, these models are huge, you know,

00:08:54.000 | can go up to hundreds or even hundreds of billions or even trillions of parameters. Uh, each model takes

00:09:00.720 | tens to hundreds of millions of dollars to train. And the hardest part is you don't get multiple chances

00:09:09.040 | to train the model. Uh, the, because it's so expensive, if something grows wrong in your training, uh, it's, uh, very

00:09:19.200 | difficult to get the investment to do another training run, uh, because, uh, typically for small

00:09:25.680 | companies, you don't get that kind of budget. If you do a model run and it's not successful,

00:09:31.040 | it becomes harder to get the funding for the next run. Um, and this is hard because the best hyper

00:09:40.000 | parameters for a smaller model might not be the best for a larger model. Uh,

00:09:46.560 | here I'm showing you some hyper parameters for Lama one model family sizes. And you might ask,

00:09:53.840 | uh, why are the number of players 80 and not 82 in Lama 65B? And the answer is, we don't know.

00:10:04.640 | Uh, there's a lot of things that are been, uh, decided by intuition and it's not exact science.

00:10:14.400 | Uh, so you'd need a lot of experience and intuition working with these models to come up with things

00:10:20.960 | that are very likely to work, but, uh, we don't, uh, we're still not very mature with the science of

00:10:30.160 | what is the best way to train the model or what's the best architecture, what's the best data set

00:10:35.120 | mixture. So, uh, can we use this pre-trained model? Um, so let's say if you want to use this pre-trained

00:10:42.560 | model and, uh, ask it to write a Python function to find whether the input number is prime or not,

00:10:47.440 | and the model might give you a response like this, uh, continues the text, gives an example and like

00:10:54.640 | describes the approach, but it might not give you the code. And this is because

00:10:59.360 | the model is trained to do this, it's trained to predict the next token. So it

00:11:02.880 | predicts the most likely token from the text data it's been trained on.

00:11:07.520 | But there is a way to trick the model. If you give this input, like as a Python function definition

00:11:16.880 | and a doc string, uh, to, to get the same function, the model actually produces the code.

00:11:23.360 | And so this shows you that model actually knows the answer, but it is not aligned with

00:11:29.280 | human preferences. It's not trained to interact with humans in the way humans want to.

00:11:34.480 | And this is why we need the next two stages. Um, so in the instruction tuning stage, instead of

00:11:42.080 | just, uh, a string of text, we have prompt response pairs. So here we are giving the prompt,

00:11:50.160 | but in the way humans want to interact with the model. So for example, this prompt to write a Python

00:11:55.760 | function function and the response is directly the code because that's what humans want as the response.

00:11:59.840 | And the technique is very simple. Again, we are doing next token prediction, but the only difference

00:12:07.120 | is we are going to mask the prompt itself. We are going to do prediction only for the response.

00:12:12.880 | Um, so the data set is paired prompt response pairs. We typically use hundreds to hundreds of thousands

00:12:21.680 | of instructions. Uh, the task is next word prediction, but just we mask the, the input instruction. Uh,

00:12:30.560 | it requires way less compute order of hundred GPUs for a few hours or days is typically sufficient to do

00:12:38.720 | instruction. And then the last steps is learning from human feedback. And here the idea is, um,

00:12:46.800 | that human preferences are cheaper or easier to obtain than full human annotation. If I give you

00:12:52.320 | a prompt like this and two responses, it's much easier for a human to decide which response is better

00:12:58.080 | than to write the whole response, uh, from scratch. And so this allows us to scale, uh, data

00:13:05.520 | faster. And there are two main techniques, uh, learning from reinforcement learning from human

00:13:12.560 | feedback and direct preference optimization, uh, where we use this kind of preference data to fine tune

00:13:18.080 | the model, uh, further. So just to summarize, uh, these are the three stages. Um, they have different

00:13:26.880 | orders of data set and compute requirement, and the task is, uh, slightly different. And all the open source

00:13:34.800 | models we have, I've been used, I've been trained using these techniques. And so I won't go into the details of

00:13:41.600 | um, the model architecture itself, but I'll show you the, this nice graph of performance to cost ratio,

00:13:50.800 | uh, which kind of shows that, uh, we really try to optimize, uh, this metric, uh, we try to get the best

00:14:01.120 | performance out of our models of a particular size. So here on the x-axis, we have the active parameters,

00:14:07.840 | which is directly proportional to the cost of running through the model. And on the y-axis,

00:14:12.080 | we have a popular benchmark, MMLU. So we try to be in the top left corner to get more performance with a

00:14:19.360 | lower cost. Um, we recently released, uh, the code trial model, code trial 22B. It's a dense transformer

00:14:27.680 | model trained specifically for code. Um, and again, we are trying to optimize performance and speed.

00:14:34.400 | It's fluent in 80 plus programming languages and it has both, uh, instruct and fill in the middle

00:14:41.040 | mode, which means that you can use it for code completion, uh, in, uh, your code editor, uh,

00:14:46.800 | just like GitHub copilot, but also you can use it to ask questions about the bugs or errors you're facing,

00:14:52.400 | just like you would put it in chat GPT. Um,

00:14:56.800 | So it outperforms code Lama 70 B deep deep seek code 33 B Lama 370 B while being a significantly

00:15:04.000 | smaller model. So again, we are getting more performance out of a model of a particular size.

00:15:09.440 | And it also has a longer context window, uh, with the other open source code models.

00:15:16.720 | It is multilingual. Uh, we trained it with more than 80 programming languages and, uh,

00:15:22.080 | across all these different languages tends to perform better than the other models.

00:15:28.720 | So it's, uh, free to use on our chat interface chat.mistral.ti. Uh, we also have the API access

00:15:37.440 | available on lab platform, which is our, uh, uh, platform API endpoint. And here, uh, it's also free

00:15:46.080 | to use till I believe, uh, end of July. We also have, uh, integration with VS code and JetBrains.

00:15:53.680 | So you can download, uh, a plugin in VS code or JetBrains and use it as a coding assistant for code completion.

00:16:03.520 | So, um, in the end, I would just discuss some practical tips because these are some commonly

00:16:11.840 | asked questions about how to use open source models and when to use open source versus when,

00:16:16.800 | uh, to use commercial models. So, uh, if you have a particular application in mind and you want to try

00:16:23.760 | out commercial models, you could do things like prompt engineering, few short prompting, chain of thought,

00:16:28.880 | and you could also do retrieval augmented generation, uh, because commercial

00:16:33.360 | models typically don't allow you to do fine tuning. Uh, but for open models, you can do task specific

00:16:41.520 | fine tuning as well. You need a little bit of data and compute for this. Uh, but in the end, the

00:16:48.320 | choice is between how do you, how do you balance performance versus cost commercial models have a

00:16:54.960 | higher general purpose performance. So they are much easier to get started with if you are trying to

00:16:59.360 | build a new application. Uh, but if you, once you get into production or once you have high volume,

00:17:05.680 | open models can beat commercial models on specific tasks with fine tuning.

00:17:10.560 | And, um, uh, typically what we have seen is people prototype with the highest end models. And then once

00:17:19.360 | they figured out that this is the, the task they want to solve, they take, uh, open source model like

00:17:26.640 | install seven B or eight X seven B and then fine tuning for their tasks. And this optimizes the

00:17:31.200 | performance to cost ratio. Uh, we have offices in Paris, London and in Maria. Uh, we are always looking

00:17:40.880 | for talented, uh, researchers, engineers, uh, business marketing people. Uh, so, uh, please,

00:17:50.880 | please, please do a clap and thank you. Uh, I don't know if you're taking questions, but happy. No.

00:17:56.160 | Okay. Thank you so much.

00:17:57.760 | Thank you.