Best of 2024: Open Models [LS LIVE! at NeurIPS 2024]

All right. All right. Cool. Yeah, thanks for having me over. I'm Luca. I'm a research scientist at the Allen Institute for AI. I threw together a few slides on sort of like a recap of like interesting themes in Open Models for 2024. I have about maybe 20, 25 minutes of slides, and then we can chat if there are any questions.

If I can advance to the next slide. Okay, cool. So I did the quick check of like to sort of get a sense of like how much 2024 was different from 2023. So I went out hugging face and sort of tried to get a picture of what kind of models were released in 2023 and like what do we get in 2024.

2023, we got things like both LAMA 1 and 2. We got Mistral, we got NPT, Falcon models. I think the Yi model came at the tail end of the year. It was a pretty good year. But then I did the same for 2024, and it's actually quite stark difference.

You have models that are, you know, reveling frontier level performance of what you can get from closed models from like Quen, from DeepSeek. We got LAMA 3, we got all sorts of different models. I added our own Olmo at the bottom. There's this growing group of like fully open models that I'm going to touch on a little bit later.

But, you know, just looking at the slides, it feels like 2024 was just smooth sailing, happy news, much better than previous year. And, you know, you can plot, you can pick your favorite benchmark, or least favorite, I don't know, depending on what point you're trying to make, and plot, you know, your closed model, your open model, and sort of spin it in ways that show that, oh, you know, open models are much closer to where closed models are today versus last year where the gap was fairly significant.

So one thing that I think, I don't know if I have to convince people in this room, but usually when I give this talks about like open models, there is always like this background question in people's mind of like, why should we use open models? Is it just use model APIs argument?

You know, it's just an HTTP request to get output from one of the best model out there. Why do I have to set up infra and use local models? And there are really like two answer. There is the more researchy answer for this, which is where my background lays, which is just research.

If you wanna do research on language models, research thrives on open models. There is like large worth of research on modeling, on how these models behave, on evaluation, on inference, on mechanistic interpretability that could not happen at all if you didn't have open models. There are also for AI builders, there are also like good use cases for using local models.

You know, you have some, this is like a very not comprehensive slides, but you have things like there are some application where local models just blow closed models out of the water. So like retrieval, it's a very clear example. You might have like constraints like edge AI applications where it makes sense.

But even just like in terms of like stability, being able to say this model is not changing under the hood, there's plenty of good cases for open models. And the community is just not models. As I stole this slide from one of the Quantum announcement blog posts, but it's super cool to see like how much tech exists around open models and serving them or making them efficient and hosting them.

It's pretty cool. And it's, if you think about like where the term opens come from, comes from like the open source, really open models meet the core tenants of open source, specifically when it comes around collaboration. There is truly a spirit like through these open models, you can build on top of other people innovation.

We see a lot of these, even in our own work of like, as we iterate in the various version of Alma, it's not just like every time we collect from scratch, all the data. Now the first step is like, okay, what are the cool data sources and datasets people have put together for language model for training?

Or when it comes to like our post-training pipeline, one of the steps is you wanna do some DPO. I use a lot of outputs of other models to improve your preference model. So it's really having like an open sort of ecosystem benefits and accelerates the development of open models.

One thing that we got in 2024, which is not a specific model, but I thought it was really significant is we got our first open source AI definition. So this is from the open source initiative. They've been generally the steward of a lot of the open source licenses when it comes to software.

And so they embarked on this journey and trying to figure out, okay, how does a license, an open source license for a model look like? Majority of the work is very dry because licenses are dry. So I'm not gonna walk through the license step-by-step, but I'm just gonna pick out one aspect that is very good.

And then one aspect that personally feels like it needs improvement. On the good side, this open source AI license, actually, this is very intuitive. If you have a build open source software and you have some expectation around like what open source looks like for software, for AI, sort of matches your intuition.

So the weights need to be fairly available, the code must be released with an open source license, and there shouldn't be like license clauses that block specific use cases. So under this definition, for example, LAMA or some of the QUEN models are not open source because the license says you can't use this model for this, or it says if you use this model, you have to name the output this way or derivative needs to be named that way.

Those clauses don't meet open source definition. And so they will not be covered. The LAMA license will not be covered under the open source definition. It's not perfect. One of the thing that internally, in discussion with OSI, we were sort of disappointed, is around the language for data. So you might imagine that an open source AI model means a model where the data is freely available.

There were discussion around that, but at the end of the day, they decided to go with a softened stance where they say a model is open source if you provide sufficient detail information on how to sort of replicate the data pipeline so you have an equivalent system. Sufficiently detailed?

It's very fuzzy. Don't like that. An equivalent system is also very fuzzy. And this doesn't take into account the accessibility of the process, right? It might be that you provide enough information, but this process costs, I don't know, $10 million to do. Now the open source definition, like any open source license, has never been about accessibility, so that's never a factor in open source software, how accessible software is.

I can make a piece of open source, put it on my hard drive and never access it. That software is still open source. The fact that it's not widely distributed doesn't change the license, but practically the right expectation of what we want good open sources to be, so it's kind of sad to see that the data component in this license is not as open as some of us would like it to be.

And I linked a blog post that Nathan wrote on the topic that it's less rambly and easier to follow through. One thing that in general, I think it's fair to say about the state of open models in 2024 is that we know a lot more than what we knew in 2023.

Both on the training data, the pre-training data you curate, on how to do all the post-training, especially on the RL side. 2023 was a lot of throwing random darts at the board. I think 2024, we have clear recipes that don't get the same results as a closed lab because there is a cost in actually matching what they do, but at least we have a good sense of, okay, this is the path to get state-of-the-art language model.

I think that one thing that is a downside of 2024 is that I think we are more research constrained than 2023. It feels like the barrier for compute that you need to move innovation along has just been rising and rising. So if you go back to this slide, there is now this cluster of models that are released by the Compute Rich Club.

Membership is hotly debated. Some people don't want to be called rich because it comes to expectations. Some people want to be called rich, but I don't know, there's debate. These are players that have 10,000, 50,000 GPUs at minimum, and so they can do a lot of work and a lot of exploration in improving models that is not very accessible.

To give you a sense of how I personally think about research budgets for each part of the language model pipeline is on the pre-training side, you can maybe do something with 1,000 GPUs. Really, you want 10,000. And if you want real estate of the art, your deep-seek minimum is like 50,000.

You can scale to infinity. The more you have, the better it gets. Everyone on that side still complains that they don't have enough GPUs. Post-training is a super wide sort of spectrum. You can do as little with like eight GPUs. As long as you're able to run a good version, I'll say a Lama model, you can do a lot of work there.

You can scale. A lot of the methodology just scales with compute. If you're interested in your open replication of what OpenAI's O1 is, you're going to be on the 10K spectrum of GPUs. Inference, you can do a lot with very few resources. Evaluation, you can do a lot with, well, I should say at least one GPUs if you want to evaluate open models.

But in general, like if you care a lot about intervention to do on this model, which is my preferred area of research, then the resources that you need are quite significant. One of the trends that has emerged in 2024 is this cluster of fully open models. So OMO, the model that we build, AI2 being one of them.

And it's nice that it's not just us. There's like a cluster of other mostly research efforts who are working on this. And so it's good to give you a primer of what like fully open means. So fully open, the easy way to think about it is instead of just releasing a model checkpoint that you run, you release a full recipe so that other people working on that space can pick and choose whatever they want from your recipe and create their own model or improve on top of your model.

You're giving out the full pipeline and all the details there instead of just like the end output. So I pull up the screenshot from our recent MOE model. And like for this model, for example, we released the model itself, data that was trained on, the code, both for training and inference, all the logs that we got through the training run, as well as every intermediate checkpoint.

And like the fact that you release different part of the pipeline allows others to do really cool things. So for example, this tweet from early this year from Fox and News Research, they use our pre-training data to do a replication of the BitNet paper in the open. So they took just a really, like the initial part of a pipeline and then did the thing on top of it.

It goes both ways. So for example, for the old MOE2 model, a lot of our pre-trained data for the first stage of pre-training was from this DCLM initiative that was led by folks, a variety of institutions. It was a really nice group effort. But for when it was nice to be able to say, okay, the state-of-the-art in terms of like what is done in the open has improved.

We don't have to like do all this work from scratch to catch up the state-of-the-art. We can just take it directly and integrate it and do our own improvements on top of that. I'm going to spend a few minutes doing like a shameless plug for some of our fully open recipes.

So indulge me in this. So a few things that we released this year was, as I was mentioning, this old MOE model, which is, I think still is state-of-the-art MOE model in its size class. And it's also fully open. So every component of this model are available. We released a multimodal model called Molmo.

Molmo is not just a model, but it's a full recipe of how you go from a text-only model to a multimodal model. And we apply this recipe on top of Quent checkpoints, on top of Olmo checkpoints, as well as on top of Olmo E. And I think there's been replication doing that on top of Mistral as well.

On the post-training side, we recently released Tulu 3. Same story. This is a recipe on how you go from a base model to a state-of-the-art post-training model. We used the Tulu recipe on top of Olmo, on top of Llama, and there has been open replication effort to do that on top of Quent as well.

It's really nice to see when your recipe is kind of turnkey. You can apply it to different models, and it kind of just works. And finally, the last thing we released this year was Olmo 2, which so far is the best state-of-the-art fully open language model. It sort of combines aspects from all three of these previous models.

What we learned on the data side from Olmo E, and what we learned on making models that are easy to adapt from the Olmo project and the Tulu project. I will close with a little bit of reflection on ways this ecosystem of open models-- it's not all roses. It's not all happy.

It feels like day to day, it's always in peril. And I talked a little bit about the compute issues over there, but it's really not just compute. One thing that is on top of my mind is due to the environment and how growing feelings about how AI is treated, it's actually harder to get access to a lot of the data that was used to train a lot of the models up to last year.

So this is a screenshot from really fabulous work from Shane Longpere, who I think is in Europe, about just access of-- like diminishing access to data for language model pre-training. So what they did is they went through every snapshot of Common Crawl. Common Crawl is this publicly available scrape of a subset of the internet.

And they looked at how, for any given website, whether a website that was accessible in, say, 2017, whether it was accessible or not in 2024. And what they found is as a reaction to the existence of closed models, like OpenAI or Cloud, GPT or Cloud, a lot of content owners have blanket blocked any type of crawling to their website.

And this is something that we see also internally at AI2. Like one project that we started this year is we wanted to understand, like, if you're a good citizen of the internet and you crawl following norms and policy that have been established in the last 25 years, what can you crawl?

And we found that there's a lot of websites where the norms of how you express preference of whether to crawl or not are broken. A lot of people would block a lot of crawling but do not advertise that in robots.txt. You can only tell that they're crawling, that they're blocking you in crawling when you try doing it.

Sometimes you can't even crawl the robots.txt to check whether you're allowed or not. And then a lot of websites, there's like all these technologies that historically have existed to make website serving easier, such as Cloudflare or DNS. They're now being repurposed for blocking AI or any type of crawling in a way that is very opaque to the content owners themselves.

So you go to these websites, you try to access them, and they're not available. And you get a feeling it's like, oh, something changed on the DNS side that it's blocking this. And likely the content owner has no idea. They're just using Cloudflare for better load balancing. And this is something that was sort of sprung on them with very little notice.

I think the problem is this blocking really impacts people in different ways. It disproportionately helps companies that have a head start, which are usually the closed labs. And it hurts incoming newcomer players, where you either have to do things in a sketchy way, or you're never going to get that content that the closed lab might have.

So there was a lot of coverage. I'm going to plug Nathan's blog post again. I think the title of this one is very succinct, which is we're actually not-- before thinking about running out of training data, we're actually running out of open training data. And so if we want better open models, they should be on top of our mind.

The other thing that has emerged is that there is strong lobbying efforts on trying to define any kind of open source AI as a new, extremely risky danger. And I want to be precise here. The problem is not not considering the risk of this technology. Every technology has risks that should always be considered.

The thing that is, to me, is-- sorry, is ingenious-- is just putting this AI on a pedestal and calling it an unknown alien technology that has new and undiscovered potentials to destroy humanity. When in reality, all the dangers, I think, are rooted in dangers that we know from existing software industry or existing issues that come when using software on a lot of sensitive domains, like medical areas.

And I also noticed a lot of efforts that have actually been going on in trying to make these open models safe. I pasted one here from AI2. But there's actually a lot of work that has been going on on like, OK, how do you make-- if you're distributing this model openly, how do you make it safe?

What's the right balance between accessibility on open models and safety? And then also, there's annoying brushing of concerns that are then proved to be unfounded under the rug. If you remember at the beginning of this year, it was all about bio-risk of these open models. The whole thing fizzled out because there's been-- finally, there's been rigorous research, not just this paper from Cohere folks, but there's been rigorous research showing that this is really not a concern that we should be worried about.

Again, there is a lot of dangerous use of AI application. But this one was just like a lobbying ploy to just make things sound scarier than they actually are. So I've got to preface this part. It says this is my personal opinion. It's not my employer. But I look at things like the SP1047 from California.

And I think we kind of dodged a bullet on this legislation. The open source community, a lot of the community came together at sort of the last minute and did a very good effort trying to explain all the negative impact of this bill. But I feel like there's a lot of excitement on building these open models or researching on these open models.

And lobbying is not sexy. It's kind of boring, but it's sort of necessary to make sure that this ecosystem can really thrive. This end of presentation, I have some links, emails, sort of standard thing in case anyone wants to reach out. And if folks have questions or anything they wanted to discuss, sort of open the floor.

I'm very curious how we should build incentives to build open models, things like Francois Chollet's Arc Prize and other initiatives like that. What is your opinion on how we should better align incentives in the community so that open models stay open? The incentive bit is like really hard. Like even-- it's something that actually even we think a lot about it internally.

Because like building open models is risky. It's very expensive. And so people don't want to take risky bets. I think definitely like the challenges, like our challenge, I think those are like very valid approaches for it. And then I think in general, promoting, building, any kind of effort to participate in this challenge, in those challenges, if we can promote doing that on top of open models.

And sort of really lean into this multiplier effect, I think that is a good way to go. If there were more money for efforts, like research efforts around open models, there's a lot of-- I think there's a lot of investments in companies that at the moment are releasing their model in the open, which is really cool.

But it's usually more because of commercial interest and not wanting to support this open models in the long term. It's a really hard problem because I think everyone is operating sort of in what-- everyone is at their local maximum, right? In ways that really optimize their position on the market, the global maximum is harder to achieve.

Yeah, I'm super excited to be here to talk to you guys about Mistral. A really short and quick recap of what we have done, what kind of models and products we have released in the past a year and a half. So most of you have already known that we are a small startup founded about a year and a half ago in Paris.

In May, 2003, it was founded by three of our co-founders. And in September, 2003, we released our first open source model, Mistral 7B. Yeah, how many of you have used or heard about Mistral 7B? Hey, pretty much everyone. Thank you. Yeah, it's pretty popular. And our community really loved this model.

And in December, 2003, we released another popular model with the MLE architecture, Mistral 8x7B. And going into this year, you can see we have released a lot of things this year. First of all, in February, 2004, we released Mistral Small, Mistral Large, Le Chat, which is our chat interface.

I will show you in a little bit. We released an embedding model for converting your text into embedding vectors. And all of our models are available on the cloud resources. So you can use our model on Google Cloud, AWS, Azure, Snowflake, IBM. So very useful for enterprise who wants to use our model through cloud.

And in April and May this year, we released another powerful open source MLE model, 8x22B. And we also released our first code model, Coastral, which is amazing at 80-plus languages. And then we provided another fine-tuning service for customization. So because we know the community love to fine-tune our models, so we provide you a very nice and easy option for you to fine-tune our model on our platform.

And also, we released our fine-tuning code base called Mistral Fine-Tune. It's open source. So feel free to take a look. And more models. On July to November this year, we released many, many other models. First of all is the two new best small models. We have Mistral 3B, great for deploying on edge devices.

We have Mistral 8B. If you used to use Mistral 7B, Mistral 8B is a great replacement with much stronger performance than Mistral 7B. We also collaborated with NVIDIA and open sourced another model, Nemo 12B, another great model. And just a few weeks ago, we updated Mistral Large with version 2 with the updated state-of-the-art features and really great function calling capabilities.

It's supporting function calling latently. And we released two multi-modal models. Mistral 12B, it's open source. And Mistral Large, just amazing models for not understanding images, but also great at text understanding. So yeah, a lot of the image models are not so good at text understanding. But Mistral Large and Mistral 12B are good at both image understanding and text understanding.

And of course, we have models for research. Coastal Mamba is built on Mamba architecture and Mithril, great with working with math problems. So yeah, that's another model. Here's another view of our model reference. We have several premier models, which means these models are mostly available through our API. I mean, all of the models are available throughout our API except for Mistral 3B.

But for the premier model, they have a special license, Mistral Research License. You can use it for free for exploration. But if you want to use it for enterprise, for production use, you will need to purchase a license from us. So on the top row here, we have Mistral 3B and 8B as our premier model.

Mistral Small for best low latency use cases. Mistral Large is great for your most sophisticated use cases. Pixel Large is the frontier class multimodal model. And we have Coastal, great for coding. And then again, Mistral Embedding Model. And the bottom of the slides here, we have several Apache 2.0 licensed open way models free for the community to use.

And also, if you want to fine-tune it, use it for customization, production, feel free to do so. The latest, we have Pixtel 12B. We also have Mistral Nemo, Coastal Mamba, and Mistral, as I mentioned. And we have three legacy models that we don't update anymore. So we recommend you to move to our newer models if you are still using them.

And then just a few weeks ago, we did a lot of improvements to our code interface, Lachet. How many of you have used Lachet? Oh, no, only a few. Okay, I highly recommend Lachet. It's chat.mistral.ai. It's free to use. It has all the amazing capabilities I'm going to show you right now.

But before that, Lachet in French means cat. So this is actually a cat logo. Yeah, if you can tell, this is cat eyes. Yeah, so first of all, I want to show you something. Maybe let's take a look at image understanding. So here I have a receipt, and I want to ask-- sorry, just going to get the prompts.

Going back. What's going on? Yeah, I had an issue with Wi-Fi here, so hopefully it would work. Cool, so basically, I have a receipt, and I said I ordered coffee and a sausage. How much do I owe at an 18% tip? So hopefully it was able to get the cost of the coffee and the sausage and ignore the other things.

And, yeah, I don't really understand this, but I think this is coffee. It's, yeah, nine. Yeah, and then cost of the sausage. We have 22 here. Yep, and then it was able to add the cost, calculate the tip, and all that. Great, so it's great at image understanding. It's great at OCR tasks.

So if you have OCR tasks, please use it. It's free on Lachat. It's also available through our API. And also I want to show you a Canvas example. A lot of you may have used Canvas with other tools before, but with Lachat, it's completely free again. Here I'm asking it to create a Canvas use PyScript to execute Python in my browser.

So, ooh, what's going on? Okay, let's see if it works. Import this. Oh. Yep, okay. So, yeah, so basically it's executing Python here, exactly what we wanted. And the other day I was trying to ask Lachat to create a game for me. Let's see if we can make it work.

Yeah, the Tetris game. Yep. Let's just get one row, maybe. Ah! Oh, no. Okay, never mind. You get the idea. I failed my mission. Okay, here we go. Yay! Cool. Yeah, so as you can see, Lachat can write, like, a code about a simple game pretty easily, and you can ask Lachat to explain the code, make updates, however you like.

Another example. There is a bar here I want to move. Okay. Right, okay. And let's go back. Another one. Yeah, we also have web search capabilities, like you can ask what's the latest AI news. Image generation is pretty cool. Generate an image about researchers in Vancouver. Yeah, Black Forest Labs, FlexPro.

Again, this is free, so. Oh, cool. I guess researchers here are mostly from University of British Columbia. That's smart. Yeah, so this is Lachat. Please feel free to use it and let me know if you have any feedback. We're always looking for improvement, and we're going to release a lot more powerful features in the coming years.

Thank you.

Best of 2024: Open Models [LS LIVE! at NeurIPS 2024]

Transcript