back to indexBuilding AI For All: Amjad Masad & Michele Catasta

Chapters
0:0 Introduction - Amjad Masad
0:42 Historical perspective
2:22 How AI can change software
4:29 ️📢 Announcing AI for all!
6:33 A tale of Code LLM & GPU-Poor - Michele Catasta
7:29 How Replit's code completion works
8:39 ️📢 Announcing Replit's new model!
13:45 ️📢 Announcing the new model is open source!
14:6 Model training
15:26 Model evaluation
17:31 Model data & training
18:45 Model evaluation
19:51 Model inference
22:10 Why open source?
23:50 Morph Labs Collaboration
00:00:16.000 |
I agree with Swix and Ben that it feels like a moment. 00:00:26.000 |
where we aspire to be the fastest way to get from an idea 00:00:50.000 |
The way you programmed it is like you literally punched cards. 00:00:54.000 |
Not physically, but you had a machine that sort of punched these cards. 00:00:58.000 |
These are sort of binary code for the machine to interpret. 00:01:02.000 |
There wasn't really a software industry because this was really difficult. 00:01:05.000 |
It automated some tasks that human computers did at the time, 00:01:09.000 |
but it didn't create the software industry yet. 00:01:16.000 |
And we had first assembly, and then we had compilers and higher-level languages such as C, 00:01:29.000 |
But text editors were really -- or like text-based programming was at minimum a 10x improvement, 00:01:40.000 |
So we've had these moments where we've had orders of magnitude improvements in programming before. 00:01:49.000 |
And then, you know, the IDE became a thing because, you know, we had large-scale software. 00:01:52.000 |
This is a screenshot from like 2017 or '18 when we added LSP to every programming environment on Replit, 00:02:01.000 |
so anyone with an account can get IntelliSense. 00:02:04.000 |
And we're really proud about that at the time. 00:02:07.000 |
We're burning a lot of CPU doing sort of inference. 00:02:10.000 |
And, you know, if you've run TypeScript server, that's like a lot of RAM. 00:02:14.000 |
But we're really proud that we're giving everyone in the world tools to create professional-grade software. 00:02:20.000 |
About three, four years ago, we started kind of thinking about how AI could change software. 00:02:32.000 |
But with GPT-2, you know, you could sort of kind of, you know, give it some code and kind of complete part of it. 00:02:39.000 |
And we're like, okay, this thing is actually happening, and we better be part of it. 00:02:44.000 |
And so we started building, and we built this product called Ghostwriter, which does auto-complete, chat, 00:02:54.000 |
And in just those two years, I mean, the pace of progress across the industry, the tools, basically AI, you know, was deployed, 00:03:06.000 |
and a lot of different engineers were using it. 00:03:09.000 |
The AI-enhanced engineer, as Wix kind of called it, everyone is sort of using these tools. 00:03:14.000 |
And so we have a world now where a lot of people are gaining huge amount of productivity improvement. 00:03:20.000 |
I don't think we're at a mode of magnitude improvement yet. 00:03:24.000 |
We're probably in the 50, 80, perhaps 100% improvement for some people. 00:03:31.000 |
And we think that's going to be 10x, 100x, perhaps 1,000x over the next decade. 00:03:39.000 |
The problem, however, Replit's mission has always been about access. 00:03:42.000 |
Our mission is to empower the next billion developers. 00:03:46.000 |
And so we really didn't want to create this world where some people have access to Ghostwriter 00:03:54.000 |
And we started thinking about, okay, what is it, if you really take into heart everything that the AI engineer conference is about, 00:04:01.000 |
that we're at a moment where software is changing, where AI is going to be part of the software stack, 00:04:06.000 |
then you have to really step back a little bit and try to rethink how programming changes. 00:04:11.000 |
So our view is these programming add-ons such as Copilot and Coding and Ghostwriter and all these things, 00:04:17.000 |
we're giving them cute names, we think that's not the way forward. 00:04:21.000 |
We think that AI needs to be really infused in every programming interaction that you have. 00:04:27.000 |
And it needs to be part of the default experience of Replit and I'm sure other products in the future. 00:04:31.000 |
That's why we're announcing today that we're giving AI for our millions of users that are coding on Replit. 00:04:37.000 |
And so we think this is going to be the biggest deployment of AI-enhanced coding in the world. 00:04:46.000 |
We're going to be burning as much GPU as we're burning CPU. 00:04:52.000 |
We have people all over the world coding on all sorts of devices. 00:05:11.000 |
So they're all going to be AI-enhanced engineers. 00:05:14.000 |
But as we showed, it's not just about AI-enhanced engineering. 00:05:20.000 |
So AI being part of the software creation stack makes sense. 00:05:24.000 |
But AI part of the call stack is also where a lot of value is created. 00:05:28.000 |
So that's why we're also -- we have this new product called Model Farm. 00:05:36.000 |
And Model Farm basically gives you access to models right into your IDE. 00:05:43.000 |
So all it takes is three lines of code to start doing inference. 00:05:46.000 |
We launched with Google Cloud LLMs, but we're adding LLAMA pretty soon. 00:05:55.000 |
And if you're an LLM provider and want to work with us and provide this on our platform, 00:06:00.000 |
But basically, everyone will get -- there's some free tier here. 00:06:05.000 |
Everyone will get free access, at least until the end of the year, to Model Farm 00:06:10.000 |
so you can start doing inference and start building AI-based products. 00:06:16.000 |
So next up, I'm going to bring up my colleague, the head of AI, 00:06:21.000 |
Mikaela Katasta, to talk about how we train our own AI models. 00:06:25.000 |
And we have one more announcement for you coming up. 00:06:50.000 |
So today I'm going to be talking about how we're training LLM for code at Replit. 00:06:58.000 |
If you've been around Twitter, I think a bit more than a month ago, 00:07:01.000 |
you must have read this study from Semi-Analysis. 00:07:04.000 |
And their point was it's meaningless to work on small models, 00:07:12.000 |
And that came as a shock to us because we had a very good success story back in May 00:07:17.000 |
where we started to train our models from scratch. 00:07:19.000 |
And then, you know, Amjad and I and the AI team started to think, 00:07:26.000 |
I'm going to try to convince this actually is not the case. 00:07:29.000 |
So our code completion feature, or Replit, is powered by our own bespoke large language model. 00:07:36.000 |
We train open source code, both published on GitHub and also developed by the Replit user base. 00:07:44.000 |
So we try to find a different sweet spot compared to what you might use with other plugins. 00:07:49.000 |
We try to keep our P95 latency below 250 milliseconds, such as the developer experience is almost instantaneous. 00:07:56.000 |
You don't even have to think about it, and the code is going to be completed for you. 00:07:59.000 |
At the model size that we're using, we have been state of the art across the past few months. 00:08:08.000 |
Who has heard about our B1 model back in May? 00:08:17.000 |
Jokes aside, so we released Replit code B1.3b back in May. 00:08:22.000 |
We got a lot of adoption, a lot of love, and also a lot of contribution. 00:08:25.000 |
And that's one of the key reasons why we decided to give it back. 00:08:29.000 |
Rapid history has been built on the shoulders of giants, of all the people contributing to the open source space. 00:08:36.000 |
So we thought we should do exactly the same year. 00:08:40.000 |
And today, I'm going to be announcing Replit code B1.5.3b. 00:08:45.000 |
So the evolution of the model that we released back in May. 00:08:52.000 |
So the next 10 minutes, we're going to do a technical deep dive, 00:08:55.000 |
and I'm going to tell you how we built it and why it's so powerful. 00:08:58.000 |
So first of all, we followed a slightly different recipe compared to the last time. 00:09:02.000 |
If you recall, back in May of our V1 was a Lama-style code model, 00:09:08.000 |
which means we followed a lot of the best recipes that Meta pioneered. 00:09:11.000 |
Now we went, you know, one level up, and we are training up to 300 tokens per parameter. 00:09:17.000 |
So if you have been following a big history of LLMs, even in, you know, two years ago, 00:09:26.000 |
It's not exactly, you know, technically speaking, it's not correct. 00:09:29.000 |
But the truth is, you know, mid-2022, the Chinchilla paper from DeepMind came out, 00:09:34.000 |
and it was like a big warning for the old field. 00:09:37.000 |
Basically, what the paper tells us is that we were under-training our models, 00:09:40.000 |
we should give them way more high-quality data, 00:09:43.000 |
and in exchange, we could train smaller models. 00:09:46.000 |
So in a sense, we're amortizing training time for inference time. 00:09:50.000 |
Spending more compute to train a smaller, more powerful model, 00:09:53.000 |
and then at inference time, the latency would be lower. 00:09:56.000 |
And that's the key insight that we're going to be carrying along, you know, 00:10:01.000 |
Now, differently from the V1, this time we also doubled the amount of high-quality data. 00:10:07.000 |
So we train it up to one trillion tokens of code. 00:10:10.000 |
The data mixture is roughly 200 billion tokens across five epochs, 00:10:14.000 |
plus a linear cooldown at the end that really allows us to squeeze the best possible performance for the model. 00:10:20.000 |
And RapidCode V1.5 this time supports 30 programming languages, 00:10:25.000 |
and we also added a mixture coming from Stack Exchange, 00:10:31.000 |
So questions about coding, questions about software engineering, and so forth. 00:10:39.000 |
Now let's go ahead and take a look inside of the dataset that we used. 00:10:42.000 |
So we started from the Stack, which is an initiative led by BigCode. 00:10:46.000 |
It's a group, you know, under the Hagen-Phase umbrella. 00:10:49.000 |
Very grateful about the work that these people have been doing. 00:10:53.000 |
Basically, they have built a big pipeline, getting data from GitHub, 00:10:57.000 |
selecting top repositories, cleaning up parts of the data, 00:11:00.000 |
and then especially leaving only code that is licensed under permissive licenses, 00:11:09.000 |
Out of this mixture, we selected 30 top languages. 00:11:13.000 |
And then, really, the key secret ingredient here is how much time we spent working on the data. 00:11:21.000 |
You must have been hearing this again and again. 00:11:23.000 |
And every time you go to an LLM talk, there is a ground stage saying, 00:11:26.000 |
"Hey, you should pay attention about the data quality." 00:11:28.000 |
I'm here to tell you exactly the same once again. 00:11:30.000 |
That's probably the most important thing that you could be spending your time on, 00:11:34.000 |
especially because the model I'm talking about today is trained from scratch. 00:11:39.000 |
All the models that we released have been trained from the very first token prepared by us. 00:11:44.000 |
So it's extremely important to have high data quality. 00:11:47.000 |
So we took inspiration from the initial quality pipelines built by Codex, by the Pound paper, 00:11:54.000 |
and then we applied way more heuristics there. 00:11:57.000 |
So we're filtering for code that is being auto-generated, minified, non-parceable, 00:12:01.000 |
basically all the code that you wouldn't want your model to recommend back to you 00:12:05.000 |
because it's not something that you would be writing yourself. 00:12:08.000 |
We also removed toxic content, and all this pipeline had been built on Spark. 00:12:13.000 |
So I'm trying to encourage you to also think of working on your own models, 00:12:17.000 |
because pretty much a lot of the base components are out there available open source. 00:12:22.000 |
So you could really build the whole pipeline to train and serve an LLM with a lot of open source components. 00:12:28.000 |
And as Wix was saying, you have seen this crazy acceleration in the last nine months. 00:12:32.000 |
If you wanted to do this in 2022, good luck with that. 00:12:36.000 |
It feels like we're a decade ahead compared to last year, so it's pretty amazing, 00:12:40.000 |
and I didn't even expect in myself the speed to move this fast. 00:12:44.000 |
The other insight that we kind of pioneered for our V1 model, 00:12:49.000 |
and it turns out to be very powerful also for this new one. 00:12:52.000 |
So when we released the V1, a few weeks after, coincidentally, 00:12:56.000 |
a very interesting paper has been published called Scaling Data Constraint Language Models. 00:13:01.000 |
And I highly recommend it. It's a great read, 00:13:03.000 |
and it's probably one of the most interesting results in LLM, in my opinion. 00:13:08.000 |
And this intuition allowed us to basically train the model to completion. 00:13:12.000 |
Rather than making trade-offs on the data quality, 00:13:15.000 |
it allowed us to select a small, high-quality subset of data, 00:13:21.000 |
The key finding of this paper is basically in these two plots. 00:13:23.000 |
I'm going to be sharing the slides so you can go and check the links. 00:13:26.000 |
And the idea is your loss curve, after you repeat data four or five times, 00:13:31.000 |
is going to be comparable to training on a novel data set. 00:13:35.000 |
Now, not only this is very useful because it allowed us to work only on high-quality data, 00:13:39.000 |
it also allowed us to work with data that is exclusively released under permissive license. 00:13:44.000 |
Therefore, once again, for our 1.5 model, we're going to be releasing it open source, 00:13:50.000 |
and it's going to be released with a commercially permissive license. 00:13:56.000 |
Just shoot us an email when you use it, because I'm very curious if you're having a good time. 00:14:15.000 |
We train a new domain-specific vocabulary, 32K, so a small one. 00:14:20.000 |
It helps us to achieve even higher compression on the data. 00:14:24.000 |
If you've been reading, again, about LLMs, you know that from a simplistic point of view, 00:14:31.000 |
So if your vocabulary allows you to pack even more data on fewer tokens, then you're basically 00:14:37.000 |
bringing more signals to the model while you're training. 00:14:40.000 |
And with this new vocabulary, we're squeezing a few percent extra, and it's a better vocabulary 00:14:44.000 |
for code compared to what StarCoder or CodeLAM are using today. 00:14:49.000 |
We trained on 128 H100 80GB GPUs, which are as rare as, you know, gold at this point. 00:14:56.000 |
We have been on the Mosaic ML platform for a week, and to our knowledge, this is the first 00:15:01.000 |
model officially announced to be trained on H100s and release open source. 00:15:13.000 |
We have group queue retention, which allow us to achieve better inference performance. 00:15:18.000 |
Alibi position embedding, latest optimizers in the game, and that, you know, is really the 00:15:23.000 |
reason why at the end you will see very exciting numbers that I don't want to spoil right away. 00:15:27.000 |
So let's start from the base model, and then there is surprise coming. 00:15:32.000 |
So, this is the evaluation process one on YumiNaval. 00:15:35.000 |
For those of you who never heard about it, YumiNaval is a benchmark release back in 2021 00:15:43.000 |
You have a natural language description of a task in English, and then expect the model 00:15:48.000 |
to generate a self-contained Python snippet that then is going to be tested with a test harness. 00:15:55.000 |
So you generate code, and then you execute it, and you see if the values in output are exactly 00:16:02.000 |
Now, an interesting evolution in the last few months in the field is we were not content 00:16:10.000 |
So we're also doing that across several different programming languages. 00:16:14.000 |
And this is coming from the multilingual code EvalArness, again, built by BigCode. 00:16:19.000 |
And they also maintain a very interesting leaderboard. 00:16:21.000 |
So what they do is they take models across, you know, several companies and several open source 00:16:27.000 |
They run devals themselves, and then they compile this very interesting leaderboard. 00:16:31.000 |
So you will find us there, I guess, in a few days. 00:16:35.000 |
So from the left column, we have StartCoder3b, which, as of yesterday, was a state-of-the-art 00:16:40.000 |
model at the 3b parameter size across languages. 00:16:45.000 |
And today, our WIP 1.5 is basically optimal across every single language that you see 00:16:52.000 |
But what gets me excited is not that much of the fact that we are more powerful than StartCoder, 00:16:59.000 |
So what got me hyped, you know, when we were training it is that we're very, very close 00:17:05.000 |
So as a reminder, call Llama 7b is a Llama 2 model from Meta, the 7b version, which has 00:17:11.000 |
been trained on 2 trillion tokens of natural language. 00:17:14.000 |
And then it has an additional pre-training phase of 500 billion tokens exclusively on code. 00:17:26.000 |
So you see where I'm going, you know, we're getting very close. 00:17:33.000 |
This is the other model that we have been training in parallel, and this is the REPL tune version. 00:17:40.000 |
We further pre-trained it on 200 billion tokens of code, this time coming from our own developers. 00:17:47.000 |
So on Replit, when you create a public REPL, it's automatically published under IMT license, 00:17:54.000 |
so we use this code to further pre-train our model. 00:17:57.000 |
And we extract, again, 30 billion tokens of code, same languages, same data filtering pipeline 00:18:06.000 |
We do these three epochs, then we do also linear cooldown, and we are using basically the languages 00:18:12.000 |
that are predominantly popular for Replit users. 00:18:18.000 |
If you go Replit, I would say 95% of the people are mostly writing Python and JavaScript. 00:18:26.000 |
Another key insight is our cutoff for this model is literally a few weeks ago. 00:18:32.000 |
So if there is a cool new library that everyone is writing software for in the last month, 00:18:38.000 |
our model is going to be capable of generating code that follows that library. 00:18:42.000 |
And we are going to keep, basically, these models up to date so that we can follow the trends, 00:18:53.000 |
So we are back to this back-to-back comparison. 00:19:00.000 |
We didn't add StartCoder here for the sake of space. 00:19:03.000 |
And also, the base model is already topping it on every other language, 00:19:09.000 |
Now we have Colama in between, and you can see why. 00:19:12.000 |
We are, on pretty much every language, substantially better. 00:19:16.000 |
So we have 36% on the OpenAI U-MiniVault benchmark. 00:19:21.000 |
As a reminder, when I was working on PalmCoder, for example, 00:19:26.000 |
that was our Passed-1 result that we published in early 2022. 00:19:38.000 |
and it achieves exactly the same U-MiniVault Passed-1 performance. 00:19:41.000 |
Same code DaVinci 001, if you go back to the paper, is getting exactly 36%. 00:19:48.000 |
So we were pretty much amazed when this happened. 00:19:52.000 |
Now, why do we go through all this struggle of training our models? 00:19:56.000 |
Not only because it's cool, you know, we love to do this stuff, 00:20:06.000 |
with the most powerful small model we could train. 00:20:09.000 |
And the reason is, all of our models are actually optimized for inference, 00:20:17.000 |
The fact that that happens gives us a lot of pride, 00:20:20.000 |
and also makes us feel good when we do a vibe check with the model, 00:20:23.000 |
and it performs as we expect, or even better. 00:20:30.000 |
we're generating above 200 tokens per second. 00:20:34.000 |
And we tune the architecture for speed in every possible way. 00:20:38.000 |
We're training a smaller vocabulary, as I was saying before. 00:20:40.000 |
We're using a flash attention with a Triton kernel. 00:20:46.000 |
So every single aspect is there to make sure that we can go as fast as we can. 00:20:50.000 |
And we optimize, basically, for the usage on the Triton inference server 00:20:54.000 |
and acceleration framework, such as Stensor RTLLM. 00:20:57.000 |
They really squeeze, you know, the last drop for NBita GPUs. 00:21:04.000 |
we work very hard also to make the model deployment go much faster. 00:21:09.000 |
So if you ever, you know, had the bad luck to work with Kubernetes in your life, 00:21:13.000 |
you know, you know how painful it can get, you know, to get your pod, 00:21:18.000 |
download all the dependencies, and build it, and yada, yada. 00:21:21.000 |
You know, so the very first time we brought this infrastructure up, 00:21:24.000 |
it took 18 minutes to go, you know, from clicking until the model was deployed. 00:21:28.000 |
Now, if you want to, you know, adapt to the load that the application is receiving, 00:21:32.000 |
18 minutes, you know, looks like an eternity. 00:21:35.000 |
Like, if there is a traffic spike, good luck with that. 00:21:38.000 |
So one of our awesome engineers, Bradley, you're going to find him at the booth later today, 00:21:42.000 |
brought this number from 18 minutes to just two minutes. 00:21:48.000 |
I'm not going to go through them, just talk to Brad. 00:21:51.000 |
The cool insight here is the fact, now, whenever we get more load, 00:21:55.000 |
we can react very quickly, and that's how we serve a very large user base. 00:21:59.000 |
So the moment that Amjad announced AI4ALL literally 10 minutes ago, 00:22:03.000 |
we flipped the switch, and our code completion is in front of our users. 00:22:09.000 |
Now, I've been asked several times, guys, why are you losing your model open source? 00:22:15.000 |
You put so much effort. Maybe not. That's an advantage for a company. 00:22:19.000 |
It turns out that the moment we did it, we got a lot of adoption. 00:22:24.000 |
And apart from a lot of log, which always feels good, 00:22:27.000 |
and it feels good to chat with other people in AI that are using what we build, 00:22:31.000 |
we also started to get fine-tuned versions, instruct-tuned versions of that. 00:22:35.000 |
And we have seen a lot of people using our small model deployed in local, 00:22:40.000 |
say with GGML, which goes super fast on Apple Silicon, 00:22:44.000 |
and they built their own custom privacy-aware, like GitHub Copilot Alternative with Rapid V1. 00:22:51.000 |
So we expect the same to happen with V1.5 in the next few days. 00:22:56.000 |
As we speak also, if you go on again phase, the model is available. 00:23:01.000 |
Come to Tolwin Madhava, the boot is the mastermind behind it, 00:23:04.000 |
so it's going to tell you every single detail on how to make it run in production. 00:23:07.000 |
And we're going to be here until tonight, so more than happy to play with the model together. 00:23:11.000 |
Now, in the last minute that I've left, I want to give you like a teaser 00:23:15.000 |
of what we're going to be doing in the next few weeks. 00:23:18.000 |
So we're allowing a few very exciting collaborations. 00:23:21.000 |
The first one is with Glaive AI, and it's a company that is building synthetic datasets. 00:23:26.000 |
And we're working on an IFT version of our model, so an Instruct Fintune version, 00:23:40.000 |
We want to triple-check them and, you know, follow our Twitters, 00:23:43.000 |
and the moment that we're sure that this is performing as we expect, 00:23:47.000 |
it's going to be out there, and we are going to be able to play with it. 00:23:50.000 |
Second announcement, we're also collaborating with Morph Labs. 00:23:54.000 |
I think Jesse is here today, and he's going to run a session later, 00:23:58.000 |
explaining you exactly what this new format does. 00:24:01.000 |
I'm going to give you a teaser, and then, you know, go to Jesse's talk, 00:24:03.000 |
and he's going to explain you all the details. 00:24:05.000 |
So we are design partners on the FIST format, which is fill in the syntax tree. 00:24:11.000 |
You might have heard of fill in the middle, this concept where you can take your file, 00:24:15.000 |
split it in half, and then basically if you're writing code in between, 00:24:19.000 |
you can tell the LLM that the top of the file is your prefix, 00:24:25.000 |
and you give this context to the model so that it knows which part it should fill. 00:24:29.000 |
Now, we found that this format is even more powerful, 00:24:32.000 |
is aware of the abstract syntax tree underlying the source code. 00:24:36.000 |
We're seeing very promising results already, and again, this will be out, you know, 00:24:41.000 |
in just a matter of like a few days or weeks. 00:24:43.000 |
Last thing, we have collaborations with the Perplexity AI guys. 00:24:49.000 |
So it's a place where the host models incredibly fast, 00:24:56.000 |
and you can start to play with it and get a vibe check by tonight.