Building AI For All: Amjad Masad & Michele Catasta

. Excited to be here. I agree with Swix and Ben that it feels like a moment. It feels like a historical moment here. My name is Amjad. I'm the co-founder of Replit, where we aspire to be the fastest way to get from an idea to a deployed software that you can scale.

So I'm going to take you back a little bit, not like Swix to the 600 AD, but perhaps to the start of computing. All right, so very early computers, the ENIAC was the first during complete, programmable Von Neumann machine computer. The way you programmed it is like you literally punched cards.

Not physically, but you had a machine that sort of punched these cards. These are sort of binary code for the machine to interpret. It was really hard. There wasn't really a software industry because this was really difficult. It automated some tasks that human computers did at the time, but it didn't create the software industry yet.

But then we moved to texts from punch cards. And we had first assembly, and then we had compilers and higher-level languages such as C, and then someone invented JavaScript, and it's all been downhill since then. But text editors were really -- or like text-based programming was at minimum a 10x improvement, if not a 100x improvement in programming.

So we've had these moments where we've had orders of magnitude improvements in programming before. And then, you know, the IDE became a thing because, you know, we had large-scale software. This is a screenshot from like 2017 or '18 when we added LSP to every programming environment on Replit, so anyone with an account can get IntelliSense.

And we're really proud about that at the time. We're burning a lot of CPU doing sort of inference. And, you know, if you've run TypeScript server, that's like a lot of RAM. But we're really proud that we're giving everyone in the world tools to create professional-grade software. About three, four years ago, we started kind of thinking about how AI could change software.

It actually started much sooner than that. But with GPT-2, you know, you could sort of kind of, you know, give it some code and kind of complete part of it. And we're like, okay, this thing is actually happening, and we better be part of it. And so we started building, and we built this product called Ghostwriter, which does auto-complete, chat, and all sorts of things inside the IDE.

And in just those two years, I mean, the pace of progress across the industry, the tools, basically AI, you know, was deployed, and a lot of different engineers were using it. The AI-enhanced engineer, as Wix kind of called it, everyone is sort of using these tools. And so we have a world now where a lot of people are gaining huge amount of productivity improvement.

I don't think we're at a mode of magnitude improvement yet. We're probably in the 50, 80, perhaps 100% improvement for some people. But we're still at the start of this. And we think that's going to be 10x, 100x, perhaps 1,000x over the next decade. The problem, however, Replit's mission has always been about access.

Our mission is to empower the next billion developers. And so we really didn't want to create this world where some people have access to Ghostwriter and other people don't have access to it. And we started thinking about, okay, what is it, if you really take into heart everything that the AI engineer conference is about, that we're at a moment where software is changing, where AI is going to be part of the software stack, then you have to really step back a little bit and try to rethink how programming changes.

So our view is these programming add-ons such as Copilot and Coding and Ghostwriter and all these things, we're giving them cute names, we think that's not the way forward. We think that AI needs to be really infused in every programming interaction that you have. And it needs to be part of the default experience of Replit and I'm sure other products in the future.

That's why we're announcing today that we're giving AI for our millions of users that are coding on Replit. And so we think this is going to be the biggest deployment of AI-enhanced coding in the world. We're going to be burning as much GPU as we're burning CPU. So pray for us.

We have people all over the world coding on all sorts of devices. We have people coding on Android phones. And they're all going to get AI now. So they're all going to be AI-enhanced engineers. But as we showed, it's not just about AI-enhanced engineering. There's also product. So AI being part of the software creation stack makes sense.

But AI part of the call stack is also where a lot of value is created. So that's why we're also -- we have this new product called Model Farm. And Model Farm basically gives you access to models right into your IDE. So all it takes is three lines of code to start doing inference.

We launched with Google Cloud LLMs, but we're adding LLAMA pretty soon. We're adding stable diffusion. And if you're an LLM provider and want to work with us and provide this on our platform, we'd love to talk to you. But basically, everyone will get -- there's some free tier here.

Everyone will get free access, at least until the end of the year, to Model Farm so you can start doing inference and start building AI-based products. So next up, I'm going to bring up my colleague, the head of AI, Mikaela Katasta, to talk about how we train our own AI models.

And we have one more announcement for you coming up. Thank you. All right. Hi, everyone. So today I'm going to be talking about how we're training LLM for code at Replit. And I will explain why this weird title. If you've been around Twitter, I think a bit more than a month ago, you must have read this study from Semi-Analysis.

And their point was it's meaningless to work on small models, train on a limited amount of GPUs. And that came as a shock to us because we had a very good success story back in May where we started to train our models from scratch. And then, you know, Amjad and I and the AI team started to think, are we really wasting our time here?

I'm going to try to convince this actually is not the case. So our code completion feature, or Replit, is powered by our own bespoke large language model. We train open source code, both published on GitHub and also developed by the Replit user base. It's a very low latency feature.

So we try to find a different sweet spot compared to what you might use with other plugins. We try to keep our P95 latency below 250 milliseconds, such as the developer experience is almost instantaneous. You don't even have to think about it, and the code is going to be completed for you.

At the model size that we're using, we have been state of the art across the past few months. And let's do a show of hands. Who has heard about our B1 model back in May? All right, that feels good. For a second I feel like an AI star. Jokes aside, so we released Replit code B1.3b back in May.

We got a lot of adoption, a lot of love, and also a lot of contribution. And that's one of the key reasons why we decided to give it back. Rapid history has been built on the shoulders of giants, of all the people contributing to the open source space. So we thought we should do exactly the same year.

We should give back our model. And today, I'm going to be announcing Replit code B1.5.3b. So the evolution of the model that we released back in May. Let's go in detail, as Amjad was saying. So the next 10 minutes, we're going to do a technical deep dive, and I'm going to tell you how we built it and why it's so powerful.

So first of all, we followed a slightly different recipe compared to the last time. If you recall, back in May of our V1 was a Lama-style code model, which means we followed a lot of the best recipes that Meta pioneered. Now we went, you know, one level up, and we are training up to 300 tokens per parameter.

So if you have been following a big history of LLMs, even in, you know, two years ago, most of the models were under-trained. Pardon me for the word. It's not exactly, you know, technically speaking, it's not correct. But the truth is, you know, mid-2022, the Chinchilla paper from DeepMind came out, and it was like a big warning for the old field.

Basically, what the paper tells us is that we were under-training our models, we should give them way more high-quality data, and in exchange, we could train smaller models. So in a sense, we're amortizing training time for inference time. Spending more compute to train a smaller, more powerful model, and then at inference time, the latency would be lower.

And that's the key insight that we're going to be carrying along, you know, this whole keynote today. Now, differently from the V1, this time we also doubled the amount of high-quality data. So we train it up to one trillion tokens of code. The data mixture is roughly 200 billion tokens across five epochs, plus a linear cooldown at the end that really allows us to squeeze the best possible performance for the model.

And RapidCode V1.5 this time supports 30 programming languages, and we also added a mixture coming from Stack Exchange, posts that are oriented towards developers. So questions about coding, questions about software engineering, and so forth. So this is the basis of our data. Now let's go ahead and take a look inside of the dataset that we used.

So we started from the Stack, which is an initiative led by BigCode. It's a group, you know, under the Hagen-Phase umbrella. Very grateful about the work that these people have been doing. Basically, they have built a big pipeline, getting data from GitHub, selecting top repositories, cleaning up parts of the data, and then especially leaving only code that is licensed under permissive licenses, such as MIT, BSD, Apache 2, and so forth.

Out of this mixture, we selected 30 top languages. And then, really, the key secret ingredient here is how much time we spent working on the data. You must have been hearing this again and again. And every time you go to an LLM talk, there is a ground stage saying, "Hey, you should pay attention about the data quality." I'm here to tell you exactly the same once again.

That's probably the most important thing that you could be spending your time on, especially because the model I'm talking about today is trained from scratch. So this is not a fine-tuning. All the models that we released have been trained from the very first token prepared by us. So it's extremely important to have high data quality.

So we took inspiration from the initial quality pipelines built by Codex, by the Pound paper, and then we applied way more heuristics there. So we're filtering for code that is being auto-generated, minified, non-parceable, basically all the code that you wouldn't want your model to recommend back to you because it's not something that you would be writing yourself.

We also removed toxic content, and all this pipeline had been built on Spark. So I'm trying to encourage you to also think of working on your own models, because pretty much a lot of the base components are out there available open source. So you could really build the whole pipeline to train and serve an LLM with a lot of open source components.

And as Wix was saying, you have seen this crazy acceleration in the last nine months. If you wanted to do this in 2022, good luck with that. It feels like we're a decade ahead compared to last year, so it's pretty amazing, and I didn't even expect in myself the speed to move this fast.

The other insight that we kind of pioneered for our V1 model, and it turns out to be very powerful also for this new one. So when we released the V1, a few weeks after, coincidentally, a very interesting paper has been published called Scaling Data Constraint Language Models. And I highly recommend it.

It's a great read, and it's probably one of the most interesting results in LLM, in my opinion. And this intuition allowed us to basically train the model to completion. Rather than making trade-offs on the data quality, it allowed us to select a small, high-quality subset of data, and then repeat it several times.

The key finding of this paper is basically in these two plots. I'm going to be sharing the slides so you can go and check the links. And the idea is your loss curve, after you repeat data four or five times, is going to be comparable to training on a novel data set.

Okay? Now, not only this is very useful because it allowed us to work only on high-quality data, it also allowed us to work with data that is exclusively released under permissive license. Therefore, once again, for our 1.5 model, we're going to be releasing it open source, and it's going to be released with a commercially permissive license.

So you can use it. There we go. Just shoot us an email when you use it, because I'm very curious if you're having a good time. So, details about the model training. We changed a few things here and there. It's a slightly larger model. It's a 3.3B. It's 4K context.

The old one was a 2K. We train a new domain-specific vocabulary, 32K, so a small one. It helps us to achieve even higher compression on the data. If you've been reading, again, about LLMs, you know that from a simplistic point of view, there are data compressors. Lots of data compressors.

So if your vocabulary allows you to pack even more data on fewer tokens, then you're basically bringing more signals to the model while you're training. And with this new vocabulary, we're squeezing a few percent extra, and it's a better vocabulary for code compared to what StarCoder or CodeLAM are using today.

We trained on 128 H100 80GB GPUs, which are as rare as, you know, gold at this point. We have been on the Mosaic ML platform for a week, and to our knowledge, this is the first model officially announced to be trained on H100s and release open source. So we're very excited about it.

And we follow a list of LLM best practices. So, of course, we support flash attention. We have group queue retention, which allow us to achieve better inference performance. Alibi position embedding, latest optimizers in the game, and that, you know, is really the reason why at the end you will see very exciting numbers that I don't want to spoil right away.

So let's start from the base model, and then there is surprise coming. So, this is the evaluation process one on YumiNaval. For those of you who never heard about it, YumiNaval is a benchmark release back in 2021 by OpenAI, if I recall correctly. The format is the following. You have a natural language description of a task in English, and then expect the model to generate a self-contained Python snippet that then is going to be tested with a test harness.

So you generate code, and then you execute it, and you see if the values in output are exactly what you expect. Now, an interesting evolution in the last few months in the field is we were not content on benchmarking exclusively on Python. So we're also doing that across several different programming languages.

And this is coming from the multilingual code EvalArness, again, built by BigCode. And they also maintain a very interesting leaderboard. So what they do is they take models across, you know, several companies and several open source contributors. They run devals themselves, and then they compile this very interesting leaderboard.

So you will find us there, I guess, in a few days. So from the left column, we have StartCoder3b, which, as of yesterday, was a state-of-the-art model at the 3b parameter size across languages. And today, our WIP 1.5 is basically optimal across every single language that you see on the list.

But what gets me excited is not that much of the fact that we are more powerful than StartCoder, which has been released a few months ago. So what got me hyped, you know, when we were training it is that we're very, very close to call Llama 7b. So as a reminder, call Llama 7b is a Llama 2 model from Meta, the 7b version, which has been trained on 2 trillion tokens of natural language.

And then it has an additional pre-training phase of 500 billion tokens exclusively on code. So it's a model that is twice the size. It's 2.5x more data, way more GPU compute. So you see where I'm going, you know, we're getting very close. How do we surpass Code Llama? Here is the trick.

This is the other model that we have been training in parallel, and this is the REPL tune version. And it means the following. We further pre-trained it on 200 billion tokens of code, this time coming from our own developers. So on Replit, when you create a public REPL, it's automatically published under IMT license, so we use this code to further pre-train our model.

And we extract, again, 30 billion tokens of code, same languages, same data filtering pipeline to retain only the top quality ones. We do these three epochs, then we do also linear cooldown, and we are using basically the languages that are predominantly popular for Replit users. So not the same list as we saw before.

If you go Replit, I would say 95% of the people are mostly writing Python and JavaScript. These are the cool languages of today. Another key insight is our cutoff for this model is literally a few weeks ago. So if there is a cool new library that everyone is writing software for in the last month, our model is going to be capable of generating code that follows that library.

And we are going to keep, basically, these models up to date so that we can follow the trends, and we can make our developers more happy. Here is the table that I love. So we are back to this back-to-back comparison. On the very left, we have our base model.

We didn't add StartCoder here for the sake of space. And also, the base model is already topping it on every other language, so it didn't make sense. Now we have Colama in between, and you can see why. We are, on pretty much every language, substantially better. So we have 36% on the OpenAI U-MiniVault benchmark.

As a reminder, when I was working on PalmCoder, for example, that was our Passed-1 result that we published in early 2022. That model was at 540 billion tokens, so almost 200x larger than this model, and it achieves exactly the same U-MiniVault Passed-1 performance. Same code DaVinci 001, if you go back to the paper, is getting exactly 36%.

So we were pretty much amazed when this happened. Now, why do we go through all this struggle of training our models? Not only because it's cool, you know, we love to do this stuff, but there is a rationale behind it. So we really want to go as fast as possible with the most powerful small model we could train.

And the reason is, all of our models are actually optimized for inference, rather than for being awesome at benchmarks. The fact that that happens gives us a lot of pride, and also makes us feel good when we do a vibe check with the model, and it performs as we expect, or even better.

But it turns out that our key result is, on a single model, with no batching, we're generating above 200 tokens per second. And we tune the architecture for speed in every possible way. We're training a smaller vocabulary, as I was saying before. We're using a flash attention with a Triton kernel.

We're using the latest GQA. So every single aspect is there to make sure that we can go as fast as we can. And we optimize, basically, for the usage on the Triton inference server and acceleration framework, such as Stensor RTLLM. They really squeeze, you know, the last drop for NBita GPUs.

But the other very interesting insight is, we work very hard also to make the model deployment go much faster. So if you ever, you know, had the bad luck to work with Kubernetes in your life, you know, you know how painful it can get, you know, to get your pod, download all the dependencies, and build it, and yada, yada.

You know, so the very first time we brought this infrastructure up, it took 18 minutes to go, you know, from clicking until the model was deployed. Now, if you want to, you know, adapt to the load that the application is receiving, 18 minutes, you know, looks like an eternity.

Like, if there is a traffic spike, good luck with that. So one of our awesome engineers, Bradley, you're going to find him at the booth later today, brought this number from 18 minutes to just two minutes. There is a long list of tricks that he used. I'm not going to go through them, just talk to Brad.

The cool insight here is the fact, now, whenever we get more load, we can react very quickly, and that's how we serve a very large user base. So the moment that Amjad announced AI4ALL literally 10 minutes ago, we flipped the switch, and our code completion is in front of our users.

And that's the way we made this happen. Now, I've been asked several times, guys, why are you losing your model open source? You put so much effort. Maybe not. That's an advantage for a company. It turns out that the moment we did it, we got a lot of adoption.

And apart from a lot of log, which always feels good, and it feels good to chat with other people in AI that are using what we build, we also started to get fine-tuned versions, instruct-tuned versions of that. And we have seen a lot of people using our small model deployed in local, say with GGML, which goes super fast on Apple Silicon, and they built their own custom privacy-aware, like GitHub Copilot Alternative with Rapid V1.

So we expect the same to happen with V1.5 in the next few days. As we speak also, if you go on again phase, the model is available. We're working on the readme. Come to Tolwin Madhava, the boot is the mastermind behind it, so it's going to tell you every single detail on how to make it run in production.

And we're going to be here until tonight, so more than happy to play with the model together. Now, in the last minute that I've left, I want to give you like a teaser of what we're going to be doing in the next few weeks. So we're allowing a few very exciting collaborations.

The first one is with Glaive AI, and it's a company that is building synthetic datasets. And we're working on an IFT version of our model, so an Instruct Fintune version, over 210,000 coding instructions. We're already seeing very exciting results. We want to triple-check them and, you know, follow our Twitters, and the moment that we're sure that this is performing as we expect, it's going to be out there, and we are going to be able to play with it.

Second announcement, we're also collaborating with Morph Labs. I think Jesse is here today, and he's going to run a session later, explaining you exactly what this new format does. I'm going to give you a teaser, and then, you know, go to Jesse's talk, and he's going to explain you all the details.

So we are design partners on the FIST format, which is fill in the syntax tree. You might have heard of fill in the middle, this concept where you can take your file, split it in half, and then basically if you're writing code in between, you can tell the LLM that the top of the file is your prefix, the bottom of the file is your suffix, and you give this context to the model so that it knows which part it should fill.

Now, we found that this format is even more powerful, is aware of the abstract syntax tree underlying the source code. We're seeing very promising results already, and again, this will be out, you know, in just a matter of like a few days or weeks. Last thing, we have collaborations with the Perplexity AI guys.

You might have used their labs. So it's a place where the host models incredibly fast, and the Rapid B1.5 will appear there, and you can start to play with it and get a vibe check by tonight. Thanks, everyone. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. We'll see you next time.

Building AI For All: Amjad Masad & Michele Catasta

Chapters

Transcript