back to indexBest of 2024: Open Models [LS LIVE! at NeurIPS 2024]

00:00:13.000 |
I'm a research scientist at the Allen Institute for AI. 00:00:17.000 |
I threw together a few slides on sort of like a recap 00:00:21.000 |
of like interesting themes in Open Models for 2024. 00:00:30.000 |
and then we can chat if there are any questions. 00:00:37.000 |
So I did the quick check of like to sort of get a sense 00:00:42.000 |
of like how much 2024 was different from 2023. 00:00:46.000 |
So I went out hugging face and sort of tried to get a picture 00:01:03.000 |
I think the Yi model came at the tail end of the year. 00:01:18.000 |
reveling frontier level performance of what you can get 00:01:22.000 |
from closed models from like Quen, from DeepSeek. 00:01:26.000 |
We got LAMA 3, we got all sorts of different models. 00:01:34.000 |
There's this growing group of like fully open models 00:01:37.000 |
that I'm going to touch on a little bit later. 00:01:58.000 |
depending on what point you're trying to make, 00:02:00.000 |
and plot, you know, your closed model, your open model, 00:02:14.000 |
versus last year where the gap was fairly significant. 00:02:24.000 |
I don't know if I have to convince people in this room, 00:02:27.000 |
but usually when I give this talks about like open models, 00:02:31.000 |
there is always like this background question 00:02:33.000 |
in people's mind of like, why should we use open models? 00:02:43.000 |
to get output from one of the best model out there. 00:02:46.000 |
Why do I have to set up infra and use local models? 00:03:07.000 |
There is like large worth of research on modeling, 00:03:26.000 |
there are also like good use cases for using local models. 00:03:33.000 |
this is like a very not comprehensive slides, 00:03:36.000 |
but you have things like there are some application 00:03:38.000 |
where local models just blow closed models out of the water. 00:03:43.000 |
So like retrieval, it's a very clear example. 00:03:46.000 |
You might have like constraints like edge AI applications 00:03:52.000 |
But even just like in terms of like stability, 00:03:54.000 |
being able to say this model is not changing under the hood, 00:03:57.000 |
there's plenty of good cases for open models. 00:04:09.000 |
from one of the Quantum announcement blog posts, 00:04:14.000 |
but it's super cool to see like how much tech exists 00:04:36.000 |
really open models meet the core tenants of open source, 00:04:44.000 |
specifically when it comes around collaboration. 00:04:47.000 |
There is truly a spirit like through these open models, 00:04:50.000 |
you can build on top of other people innovation. 00:04:54.000 |
We see a lot of these, even in our own work of like, 00:04:58.000 |
as we iterate in the various version of Alma, 00:05:01.000 |
it's not just like every time we collect from scratch, 00:05:14.000 |
Or when it comes to like our post-training pipeline, 00:05:31.000 |
So it's really having like an open sort of ecosystem benefits 00:05:36.000 |
and accelerates the development of open models. 00:05:48.000 |
is we got our first open source AI definition. 00:06:18.000 |
So I'm not gonna walk through the license step-by-step, 00:06:21.000 |
but I'm just gonna pick out one aspect that is very good. 00:06:31.000 |
On the good side, this open source AI license, 00:06:43.000 |
like what open source looks like for software, 00:06:56.000 |
the code must be released with an open source license, 00:07:24.000 |
Those clauses don't meet open source definition. 00:07:43.000 |
in discussion with OSI, we were sort of disappointed, 00:07:53.000 |
So you might imagine that an open source AI model 00:07:57.000 |
means a model where the data is freely available. 00:08:12.000 |
on how to sort of replicate the data pipeline 00:08:33.000 |
It might be that you provide enough information, 00:08:46.000 |
so that's never a factor in open source software, 00:09:13.000 |
is not as open as some of us would like it to be. 00:09:53.000 |
2023 was a lot of throwing random darts at the board. 00:10:01.000 |
that don't get the same results as a closed lab 00:10:10.000 |
okay, this is the path to get state-of-the-art language model. 00:10:16.000 |
I think that one thing that is a downside of 2024 00:10:20.000 |
is that I think we are more research constrained than 2023. 00:10:57.000 |
These are players that have 10,000, 50,000 GPUs at minimum, 00:11:43.000 |
Post-training is a super wide sort of spectrum. 00:11:52.000 |
As long as you're able to run a good version, 00:11:59.000 |
I'll say a Lama model, you can do a lot of work there. 00:12:04.000 |
A lot of the methodology just scales with compute. 00:12:07.000 |
If you're interested in your open replication 00:12:16.000 |
you're going to be on the 10K spectrum of GPUs. 00:12:20.000 |
Inference, you can do a lot with very few resources. 00:12:32.000 |
But in general, like if you care a lot about intervention 00:12:42.000 |
then the resources that you need are quite significant. 00:12:58.000 |
So OMO, the model that we build, AI2 being one of them. 00:13:06.000 |
There's like a cluster of other mostly research efforts 00:13:22.000 |
So fully open, the easy way to think about it is 00:13:26.000 |
instead of just releasing a model checkpoint that you run, 00:13:36.000 |
can pick and choose whatever they want from your recipe 00:13:51.000 |
So I pull up the screenshot from our recent MOE model. 00:13:58.000 |
we released the model itself, data that was trained on, 00:14:04.000 |
all the logs that we got through the training run, 00:14:20.000 |
So for example, this tweet from early this year 00:14:28.000 |
to do a replication of the BitNet paper in the open. 00:14:55.000 |
that was led by folks, a variety of institutions. 00:15:02.000 |
But for when it was nice to be able to say, okay, 00:15:11.000 |
We don't have to like do all this work from scratch 00:15:16.000 |
We can just take it directly and integrate it 00:15:24.000 |
I'm going to spend a few minutes doing like a shameless plug 00:15:35.000 |
So a few things that we released this year was, 00:15:41.000 |
which is, I think still is state-of-the-art MOE model 00:15:50.000 |
So every component of this model are available. 00:15:57.000 |
Molmo is not just a model, but it's a full recipe 00:16:05.000 |
And we apply this recipe on top of Quent checkpoints, 00:16:12.000 |
And I think there's been replication doing that 00:16:21.000 |
On the post-training side, we recently released Tulu 3. 00:16:26.000 |
This is a recipe on how you go from a base model 00:16:33.000 |
We used the Tulu recipe on top of Olmo, on top of Llama, 00:16:42.000 |
It's really nice to see when your recipe is kind of turnkey. 00:16:50.000 |
And finally, the last thing we released this year 00:16:53.000 |
was Olmo 2, which so far is the best state-of-the-art 00:17:04.000 |
What we learned on the data side from Olmo E, 00:17:25.000 |
It feels like day to day, it's always in peril. 00:17:30.000 |
And I talked a little bit about the compute issues 00:17:33.000 |
over there, but it's really not just compute. 00:17:50.000 |
to a lot of the data that was used to train a lot of the models 00:17:55.000 |
So this is a screenshot from really fabulous work 00:17:58.000 |
from Shane Longpere, who I think is in Europe, about just access 00:18:06.000 |
like diminishing access to data for language model pre-training. 00:18:10.000 |
So what they did is they went through every snapshot 00:18:17.000 |
Common Crawl is this publicly available scrape 00:18:22.000 |
And they looked at how, for any given website, 00:18:27.000 |
whether a website that was accessible in, say, 2017, 00:18:47.000 |
a lot of content owners have blanket blocked any type 00:18:54.000 |
And this is something that we see also internally at AI2. 00:19:06.000 |
and you crawl following norms and policy that 00:19:11.000 |
have been established in the last 25 years, what can you crawl? 00:19:18.000 |
where the norms of how you express preference 00:19:24.000 |
A lot of people would block a lot of crawling 00:19:32.000 |
that they're blocking you in crawling when you try doing it. 00:19:35.000 |
Sometimes you can't even crawl the robots.txt 00:19:45.000 |
like all these technologies that historically have existed 00:19:49.000 |
to make website serving easier, such as Cloudflare or DNS. 00:19:55.000 |
They're now being repurposed for blocking AI or any type 00:20:00.000 |
of crawling in a way that is very opaque to the content 00:20:07.000 |
So you go to these websites, you try to access them, 00:20:13.000 |
And you get a feeling it's like, oh, something changed 00:20:22.000 |
They're just using Cloudflare for better load balancing. 00:20:27.000 |
And this is something that was sort of sprung on them 00:20:31.000 |
I think the problem is this blocking really impacts people 00:20:46.000 |
that have a head start, which are usually the closed labs. 00:20:55.000 |
where you either have to do things in a sketchy way, 00:21:11.000 |
I think the title of this one is very succinct, 00:21:16.000 |
before thinking about running out of training data, 00:21:19.000 |
we're actually running out of open training data. 00:21:53.000 |
Every technology has risks that should always be considered. 00:21:59.000 |
sorry, is ingenious-- is just putting this AI 00:22:03.000 |
on a pedestal and calling it an unknown alien technology that 00:22:09.000 |
has new and undiscovered potentials to destroy humanity. 00:22:21.000 |
from existing software industry or existing issues that 00:22:27.000 |
come when using software on a lot of sensitive domains, 00:22:47.000 |
that has been going on on like, OK, how do you make-- 00:22:55.000 |
What's the right balance between accessibility 00:23:03.000 |
of concerns that are then proved to be unfounded under the rug. 00:23:09.000 |
If you remember at the beginning of this year, 00:23:11.000 |
it was all about bio-risk of these open models. 00:23:15.000 |
The whole thing fizzled out because there's been-- 00:23:31.000 |
Again, there is a lot of dangerous use of AI application. 00:23:38.000 |
to just make things sound scarier than they actually are. 00:23:48.000 |
But I look at things like the SP1047 from California. 00:23:53.000 |
And I think we kind of dodged a bullet on this legislation. 00:23:59.000 |
The open source community, a lot of the community 00:24:08.000 |
to explain all the negative impact of this bill. 00:24:28.000 |
necessary to make sure that this ecosystem can really thrive. 00:24:34.000 |
This end of presentation, I have some links, emails, 00:24:38.000 |
sort of standard thing in case anyone wants to reach out. 00:24:46.000 |
they wanted to discuss, sort of open the floor. 00:24:50.000 |
I'm very curious how we should build incentives 00:24:53.000 |
to build open models, things like Francois Chollet's Arc 00:25:07.000 |
Like even-- it's something that actually even we 00:25:22.000 |
I think definitely like the challenges, like our challenge, 00:25:28.000 |
I think those are like very valid approaches for it. 00:25:32.000 |
And then I think in general, promoting, building, 00:25:38.000 |
any kind of effort to participate in this challenge, 00:25:46.000 |
And sort of really lean into this multiplier effect, 00:25:55.000 |
If there were more money for efforts, like research 00:26:01.000 |
efforts around open models, there's a lot of-- 00:26:04.000 |
I think there's a lot of investments in companies 00:26:06.000 |
that at the moment are releasing their model in the open, which 00:26:11.000 |
But it's usually more because of commercial interest 00:26:15.000 |
and not wanting to support this open models in the long term. 00:26:21.000 |
It's a really hard problem because I think everyone 00:26:29.000 |
In ways that really optimize their position on the market, 00:26:43.000 |
A really short and quick recap of what we have done, 00:26:47.000 |
what kind of models and products we have released 00:26:56.000 |
that we are a small startup founded about a year 00:27:02.000 |
In May, 2003, it was founded by three of our co-founders. 00:27:06.000 |
And in September, 2003, we released our first open source 00:27:13.000 |
Yeah, how many of you have used or heard about Mistral 7B? 00:27:27.000 |
And in December, 2003, we released another popular model 00:27:40.000 |
see we have released a lot of things this year. 00:27:45.000 |
released Mistral Small, Mistral Large, Le Chat, 00:27:53.000 |
We released an embedding model for converting your text 00:28:01.000 |
And all of our models are available on the cloud 00:28:07.000 |
So you can use our model on Google Cloud, AWS, Azure, 00:28:21.000 |
released another powerful open source MLE model, 8x22B. 00:28:30.000 |
Coastral, which is amazing at 80-plus languages. 00:28:34.000 |
And then we provided another fine-tuning service 00:28:39.000 |
So because we know the community love to fine-tune our models, 00:28:42.000 |
so we provide you a very nice and easy option 00:28:45.000 |
for you to fine-tune our model on our platform. 00:28:48.000 |
And also, we released our fine-tuning code base 00:29:06.000 |
First of all is the two new best small models. 00:29:10.000 |
We have Mistral 3B, great for deploying on edge devices. 00:29:21.000 |
is a great replacement with much stronger performance 00:29:28.000 |
and open sourced another model, Nemo 12B, another great model. 00:29:33.000 |
And just a few weeks ago, we updated Mistral Large 00:29:37.000 |
with version 2 with the updated state-of-the-art features 00:29:42.000 |
and really great function calling capabilities. 00:30:10.000 |
are good at both image understanding and text 00:30:17.000 |
Coastal Mamba is built on Mamba architecture and Mithril, 00:30:39.000 |
means these models are mostly available through our API. 00:30:44.000 |
I mean, all of the models are available throughout our API 00:30:51.000 |
But for the premier model, they have a special license, 00:30:59.000 |
But if you want to use it for enterprise, for production use, 00:31:06.000 |
So on the top row here, we have Mistral 3B and 8B 00:31:12.000 |
Mistral Small for best low latency use cases. 00:31:16.000 |
Mistral Large is great for your most sophisticated use cases. 00:31:20.000 |
Pixel Large is the frontier class multimodal model. 00:31:32.000 |
we have several Apache 2.0 licensed open way models 00:31:40.000 |
use it for customization, production, feel free to do so. 00:31:47.000 |
We also have Mistral Nemo, Coastal Mamba, and Mistral, 00:31:55.000 |
And we have three legacy models that we don't update anymore. 00:31:59.000 |
So we recommend you to move to our newer models 00:32:09.000 |
we did a lot of improvements to our code interface, Lachet. 00:32:44.000 |
Yeah, so first of all, I want to show you something. 00:32:48.000 |
Maybe let's take a look at image understanding. 00:33:00.000 |
So here I have a receipt, and I want to ask-- 00:33:49.000 |
So hopefully it was able to get the cost of the coffee 00:34:28.000 |
And also I want to show you a Canvas example. 00:34:31.000 |
A lot of you may have used Canvas with other tools before, 00:34:45.000 |
use PyScript to execute Python in my browser. 00:35:06.000 |
So, yeah, so basically it's executing Python here, 00:36:05.000 |
Yeah, so as you can see, Lachat can write, like, 00:36:44.000 |
Generate an image about researchers in Vancouver. 00:37:05.000 |
I guess researchers here are mostly from University of British Columbia. 00:37:14.000 |
Please feel free to use it and let me know if you have any feedback. 00:37:21.000 |
and we're going to release a lot more powerful features in the coming years.