back to index

Building AI For All: Amjad Masad & Michele Catasta


Chapters

0:0 Introduction - Amjad Masad
0:42 Historical perspective
2:22 How AI can change software
4:29 ️📢 Announcing AI for all!
6:33 A tale of Code LLM & GPU-Poor - Michele Catasta
7:29 How Replit's code completion works
8:39 ️📢 Announcing Replit's new model!
13:45 ️📢 Announcing the new model is open source!
14:6 Model training
15:26 Model evaluation
17:31 Model data & training
18:45 Model evaluation
19:51 Model inference
22:10 Why open source?
23:50 Morph Labs Collaboration

Whisper Transcript | Transcript Only Page

00:00:15.000 | Excited to be here.
00:00:16.000 | I agree with Swix and Ben that it feels like a moment.
00:00:21.000 | It feels like a historical moment here.
00:00:24.000 | My name is Amjad.
00:00:25.000 | I'm the co-founder of Replit,
00:00:26.000 | where we aspire to be the fastest way to get from an idea
00:00:29.000 | to a deployed software that you can scale.
00:00:32.000 | So I'm going to take you back a little bit,
00:00:34.000 | not like Swix to the 600 AD,
00:00:37.000 | but perhaps to the start of computing.
00:00:41.000 | All right, so very early computers,
00:00:44.000 | the ENIAC was the first during complete,
00:00:47.000 | programmable Von Neumann machine computer.
00:00:50.000 | The way you programmed it is like you literally punched cards.
00:00:54.000 | Not physically, but you had a machine that sort of punched these cards.
00:00:58.000 | These are sort of binary code for the machine to interpret.
00:01:01.000 | It was really hard.
00:01:02.000 | There wasn't really a software industry because this was really difficult.
00:01:05.000 | It automated some tasks that human computers did at the time,
00:01:09.000 | but it didn't create the software industry yet.
00:01:12.000 | But then we moved to texts from punch cards.
00:01:16.000 | And we had first assembly, and then we had compilers and higher-level languages such as C,
00:01:25.000 | and then someone invented JavaScript,
00:01:27.000 | and it's all been downhill since then.
00:01:29.000 | But text editors were really -- or like text-based programming was at minimum a 10x improvement,
00:01:38.000 | if not a 100x improvement in programming.
00:01:40.000 | So we've had these moments where we've had orders of magnitude improvements in programming before.
00:01:49.000 | And then, you know, the IDE became a thing because, you know, we had large-scale software.
00:01:52.000 | This is a screenshot from like 2017 or '18 when we added LSP to every programming environment on Replit,
00:02:01.000 | so anyone with an account can get IntelliSense.
00:02:04.000 | And we're really proud about that at the time.
00:02:07.000 | We're burning a lot of CPU doing sort of inference.
00:02:10.000 | And, you know, if you've run TypeScript server, that's like a lot of RAM.
00:02:14.000 | But we're really proud that we're giving everyone in the world tools to create professional-grade software.
00:02:20.000 | About three, four years ago, we started kind of thinking about how AI could change software.
00:02:29.000 | It actually started much sooner than that.
00:02:32.000 | But with GPT-2, you know, you could sort of kind of, you know, give it some code and kind of complete part of it.
00:02:39.000 | And we're like, okay, this thing is actually happening, and we better be part of it.
00:02:44.000 | And so we started building, and we built this product called Ghostwriter, which does auto-complete, chat,
00:02:51.000 | and all sorts of things inside the IDE.
00:02:54.000 | And in just those two years, I mean, the pace of progress across the industry, the tools, basically AI, you know, was deployed,
00:03:06.000 | and a lot of different engineers were using it.
00:03:09.000 | The AI-enhanced engineer, as Wix kind of called it, everyone is sort of using these tools.
00:03:14.000 | And so we have a world now where a lot of people are gaining huge amount of productivity improvement.
00:03:20.000 | I don't think we're at a mode of magnitude improvement yet.
00:03:24.000 | We're probably in the 50, 80, perhaps 100% improvement for some people.
00:03:29.000 | But we're still at the start of this.
00:03:31.000 | And we think that's going to be 10x, 100x, perhaps 1,000x over the next decade.
00:03:39.000 | The problem, however, Replit's mission has always been about access.
00:03:42.000 | Our mission is to empower the next billion developers.
00:03:46.000 | And so we really didn't want to create this world where some people have access to Ghostwriter
00:03:51.000 | and other people don't have access to it.
00:03:54.000 | And we started thinking about, okay, what is it, if you really take into heart everything that the AI engineer conference is about,
00:04:01.000 | that we're at a moment where software is changing, where AI is going to be part of the software stack,
00:04:06.000 | then you have to really step back a little bit and try to rethink how programming changes.
00:04:11.000 | So our view is these programming add-ons such as Copilot and Coding and Ghostwriter and all these things,
00:04:17.000 | we're giving them cute names, we think that's not the way forward.
00:04:21.000 | We think that AI needs to be really infused in every programming interaction that you have.
00:04:27.000 | And it needs to be part of the default experience of Replit and I'm sure other products in the future.
00:04:31.000 | That's why we're announcing today that we're giving AI for our millions of users that are coding on Replit.
00:04:37.000 | And so we think this is going to be the biggest deployment of AI-enhanced coding in the world.
00:04:46.000 | We're going to be burning as much GPU as we're burning CPU.
00:04:50.000 | So pray for us.
00:04:52.000 | We have people all over the world coding on all sorts of devices.
00:05:06.000 | We have people coding on Android phones.
00:05:08.000 | And they're all going to get AI now.
00:05:11.000 | So they're all going to be AI-enhanced engineers.
00:05:14.000 | But as we showed, it's not just about AI-enhanced engineering.
00:05:18.000 | There's also product.
00:05:20.000 | So AI being part of the software creation stack makes sense.
00:05:24.000 | But AI part of the call stack is also where a lot of value is created.
00:05:28.000 | So that's why we're also -- we have this new product called Model Farm.
00:05:36.000 | And Model Farm basically gives you access to models right into your IDE.
00:05:43.000 | So all it takes is three lines of code to start doing inference.
00:05:46.000 | We launched with Google Cloud LLMs, but we're adding LLAMA pretty soon.
00:05:53.000 | We're adding stable diffusion.
00:05:55.000 | And if you're an LLM provider and want to work with us and provide this on our platform,
00:05:59.000 | we'd love to talk to you.
00:06:00.000 | But basically, everyone will get -- there's some free tier here.
00:06:05.000 | Everyone will get free access, at least until the end of the year, to Model Farm
00:06:10.000 | so you can start doing inference and start building AI-based products.
00:06:16.000 | So next up, I'm going to bring up my colleague, the head of AI,
00:06:21.000 | Mikaela Katasta, to talk about how we train our own AI models.
00:06:25.000 | And we have one more announcement for you coming up.
00:06:29.000 | Thank you.
00:06:48.000 | All right.
00:06:49.000 | Hi, everyone.
00:06:50.000 | So today I'm going to be talking about how we're training LLM for code at Replit.
00:06:55.000 | And I will explain why this weird title.
00:06:58.000 | If you've been around Twitter, I think a bit more than a month ago,
00:07:01.000 | you must have read this study from Semi-Analysis.
00:07:04.000 | And their point was it's meaningless to work on small models,
00:07:09.000 | train on a limited amount of GPUs.
00:07:12.000 | And that came as a shock to us because we had a very good success story back in May
00:07:17.000 | where we started to train our models from scratch.
00:07:19.000 | And then, you know, Amjad and I and the AI team started to think,
00:07:23.000 | are we really wasting our time here?
00:07:26.000 | I'm going to try to convince this actually is not the case.
00:07:29.000 | So our code completion feature, or Replit, is powered by our own bespoke large language model.
00:07:36.000 | We train open source code, both published on GitHub and also developed by the Replit user base.
00:07:42.000 | It's a very low latency feature.
00:07:44.000 | So we try to find a different sweet spot compared to what you might use with other plugins.
00:07:49.000 | We try to keep our P95 latency below 250 milliseconds, such as the developer experience is almost instantaneous.
00:07:56.000 | You don't even have to think about it, and the code is going to be completed for you.
00:07:59.000 | At the model size that we're using, we have been state of the art across the past few months.
00:08:05.000 | And let's do a show of hands.
00:08:08.000 | Who has heard about our B1 model back in May?
00:08:12.000 | All right, that feels good.
00:08:14.000 | For a second I feel like an AI star.
00:08:17.000 | Jokes aside, so we released Replit code B1.3b back in May.
00:08:22.000 | We got a lot of adoption, a lot of love, and also a lot of contribution.
00:08:25.000 | And that's one of the key reasons why we decided to give it back.
00:08:29.000 | Rapid history has been built on the shoulders of giants, of all the people contributing to the open source space.
00:08:36.000 | So we thought we should do exactly the same year.
00:08:38.000 | We should give back our model.
00:08:40.000 | And today, I'm going to be announcing Replit code B1.5.3b.
00:08:45.000 | So the evolution of the model that we released back in May.
00:08:49.000 | Let's go in detail, as Amjad was saying.
00:08:52.000 | So the next 10 minutes, we're going to do a technical deep dive,
00:08:55.000 | and I'm going to tell you how we built it and why it's so powerful.
00:08:58.000 | So first of all, we followed a slightly different recipe compared to the last time.
00:09:02.000 | If you recall, back in May of our V1 was a Lama-style code model,
00:09:08.000 | which means we followed a lot of the best recipes that Meta pioneered.
00:09:11.000 | Now we went, you know, one level up, and we are training up to 300 tokens per parameter.
00:09:17.000 | So if you have been following a big history of LLMs, even in, you know, two years ago,
00:09:22.000 | most of the models were under-trained.
00:09:24.000 | Pardon me for the word.
00:09:26.000 | It's not exactly, you know, technically speaking, it's not correct.
00:09:29.000 | But the truth is, you know, mid-2022, the Chinchilla paper from DeepMind came out,
00:09:34.000 | and it was like a big warning for the old field.
00:09:37.000 | Basically, what the paper tells us is that we were under-training our models,
00:09:40.000 | we should give them way more high-quality data,
00:09:43.000 | and in exchange, we could train smaller models.
00:09:46.000 | So in a sense, we're amortizing training time for inference time.
00:09:50.000 | Spending more compute to train a smaller, more powerful model,
00:09:53.000 | and then at inference time, the latency would be lower.
00:09:56.000 | And that's the key insight that we're going to be carrying along, you know,
00:09:59.000 | this whole keynote today.
00:10:01.000 | Now, differently from the V1, this time we also doubled the amount of high-quality data.
00:10:07.000 | So we train it up to one trillion tokens of code.
00:10:10.000 | The data mixture is roughly 200 billion tokens across five epochs,
00:10:14.000 | plus a linear cooldown at the end that really allows us to squeeze the best possible performance for the model.
00:10:20.000 | And RapidCode V1.5 this time supports 30 programming languages,
00:10:25.000 | and we also added a mixture coming from Stack Exchange,
00:10:29.000 | posts that are oriented towards developers.
00:10:31.000 | So questions about coding, questions about software engineering, and so forth.
00:10:36.000 | So this is the basis of our data.
00:10:39.000 | Now let's go ahead and take a look inside of the dataset that we used.
00:10:42.000 | So we started from the Stack, which is an initiative led by BigCode.
00:10:46.000 | It's a group, you know, under the Hagen-Phase umbrella.
00:10:49.000 | Very grateful about the work that these people have been doing.
00:10:53.000 | Basically, they have built a big pipeline, getting data from GitHub,
00:10:57.000 | selecting top repositories, cleaning up parts of the data,
00:11:00.000 | and then especially leaving only code that is licensed under permissive licenses,
00:11:05.000 | such as MIT, BSD, Apache 2, and so forth.
00:11:09.000 | Out of this mixture, we selected 30 top languages.
00:11:13.000 | And then, really, the key secret ingredient here is how much time we spent working on the data.
00:11:21.000 | You must have been hearing this again and again.
00:11:23.000 | And every time you go to an LLM talk, there is a ground stage saying,
00:11:26.000 | "Hey, you should pay attention about the data quality."
00:11:28.000 | I'm here to tell you exactly the same once again.
00:11:30.000 | That's probably the most important thing that you could be spending your time on,
00:11:34.000 | especially because the model I'm talking about today is trained from scratch.
00:11:38.000 | So this is not a fine-tuning.
00:11:39.000 | All the models that we released have been trained from the very first token prepared by us.
00:11:44.000 | So it's extremely important to have high data quality.
00:11:47.000 | So we took inspiration from the initial quality pipelines built by Codex, by the Pound paper,
00:11:54.000 | and then we applied way more heuristics there.
00:11:57.000 | So we're filtering for code that is being auto-generated, minified, non-parceable,
00:12:01.000 | basically all the code that you wouldn't want your model to recommend back to you
00:12:05.000 | because it's not something that you would be writing yourself.
00:12:08.000 | We also removed toxic content, and all this pipeline had been built on Spark.
00:12:13.000 | So I'm trying to encourage you to also think of working on your own models,
00:12:17.000 | because pretty much a lot of the base components are out there available open source.
00:12:22.000 | So you could really build the whole pipeline to train and serve an LLM with a lot of open source components.
00:12:28.000 | And as Wix was saying, you have seen this crazy acceleration in the last nine months.
00:12:32.000 | If you wanted to do this in 2022, good luck with that.
00:12:36.000 | It feels like we're a decade ahead compared to last year, so it's pretty amazing,
00:12:40.000 | and I didn't even expect in myself the speed to move this fast.
00:12:44.000 | The other insight that we kind of pioneered for our V1 model,
00:12:49.000 | and it turns out to be very powerful also for this new one.
00:12:52.000 | So when we released the V1, a few weeks after, coincidentally,
00:12:56.000 | a very interesting paper has been published called Scaling Data Constraint Language Models.
00:13:01.000 | And I highly recommend it. It's a great read,
00:13:03.000 | and it's probably one of the most interesting results in LLM, in my opinion.
00:13:08.000 | And this intuition allowed us to basically train the model to completion.
00:13:12.000 | Rather than making trade-offs on the data quality,
00:13:15.000 | it allowed us to select a small, high-quality subset of data,
00:13:19.000 | and then repeat it several times.
00:13:21.000 | The key finding of this paper is basically in these two plots.
00:13:23.000 | I'm going to be sharing the slides so you can go and check the links.
00:13:26.000 | And the idea is your loss curve, after you repeat data four or five times,
00:13:31.000 | is going to be comparable to training on a novel data set.
00:13:34.000 | Okay?
00:13:35.000 | Now, not only this is very useful because it allowed us to work only on high-quality data,
00:13:39.000 | it also allowed us to work with data that is exclusively released under permissive license.
00:13:44.000 | Therefore, once again, for our 1.5 model, we're going to be releasing it open source,
00:13:50.000 | and it's going to be released with a commercially permissive license.
00:13:53.000 | So you can use it.
00:13:55.000 | There we go.
00:13:56.000 | Just shoot us an email when you use it, because I'm very curious if you're having a good time.
00:14:06.000 | So, details about the model training.
00:14:08.000 | We changed a few things here and there.
00:14:10.000 | It's a slightly larger model.
00:14:11.000 | It's a 3.3B.
00:14:12.000 | It's 4K context.
00:14:13.000 | The old one was a 2K.
00:14:15.000 | We train a new domain-specific vocabulary, 32K, so a small one.
00:14:20.000 | It helps us to achieve even higher compression on the data.
00:14:24.000 | If you've been reading, again, about LLMs, you know that from a simplistic point of view,
00:14:29.000 | there are data compressors.
00:14:30.000 | Lots of data compressors.
00:14:31.000 | So if your vocabulary allows you to pack even more data on fewer tokens, then you're basically
00:14:37.000 | bringing more signals to the model while you're training.
00:14:40.000 | And with this new vocabulary, we're squeezing a few percent extra, and it's a better vocabulary
00:14:44.000 | for code compared to what StarCoder or CodeLAM are using today.
00:14:49.000 | We trained on 128 H100 80GB GPUs, which are as rare as, you know, gold at this point.
00:14:56.000 | We have been on the Mosaic ML platform for a week, and to our knowledge, this is the first
00:15:01.000 | model officially announced to be trained on H100s and release open source.
00:15:05.000 | So we're very excited about it.
00:15:08.000 | And we follow a list of LLM best practices.
00:15:11.000 | So, of course, we support flash attention.
00:15:13.000 | We have group queue retention, which allow us to achieve better inference performance.
00:15:18.000 | Alibi position embedding, latest optimizers in the game, and that, you know, is really the
00:15:23.000 | reason why at the end you will see very exciting numbers that I don't want to spoil right away.
00:15:27.000 | So let's start from the base model, and then there is surprise coming.
00:15:32.000 | So, this is the evaluation process one on YumiNaval.
00:15:35.000 | For those of you who never heard about it, YumiNaval is a benchmark release back in 2021
00:15:40.000 | by OpenAI, if I recall correctly.
00:15:42.000 | The format is the following.
00:15:43.000 | You have a natural language description of a task in English, and then expect the model
00:15:48.000 | to generate a self-contained Python snippet that then is going to be tested with a test harness.
00:15:55.000 | So you generate code, and then you execute it, and you see if the values in output are exactly
00:16:01.000 | what you expect.
00:16:02.000 | Now, an interesting evolution in the last few months in the field is we were not content
00:16:07.000 | on benchmarking exclusively on Python.
00:16:10.000 | So we're also doing that across several different programming languages.
00:16:14.000 | And this is coming from the multilingual code EvalArness, again, built by BigCode.
00:16:19.000 | And they also maintain a very interesting leaderboard.
00:16:21.000 | So what they do is they take models across, you know, several companies and several open source
00:16:26.000 | contributors.
00:16:27.000 | They run devals themselves, and then they compile this very interesting leaderboard.
00:16:31.000 | So you will find us there, I guess, in a few days.
00:16:35.000 | So from the left column, we have StartCoder3b, which, as of yesterday, was a state-of-the-art
00:16:40.000 | model at the 3b parameter size across languages.
00:16:45.000 | And today, our WIP 1.5 is basically optimal across every single language that you see
00:16:51.000 | on the list.
00:16:52.000 | But what gets me excited is not that much of the fact that we are more powerful than StartCoder,
00:16:57.000 | which has been released a few months ago.
00:16:59.000 | So what got me hyped, you know, when we were training it is that we're very, very close
00:17:03.000 | to call Llama 7b.
00:17:05.000 | So as a reminder, call Llama 7b is a Llama 2 model from Meta, the 7b version, which has
00:17:11.000 | been trained on 2 trillion tokens of natural language.
00:17:14.000 | And then it has an additional pre-training phase of 500 billion tokens exclusively on code.
00:17:19.000 | So it's a model that is twice the size.
00:17:22.000 | It's 2.5x more data, way more GPU compute.
00:17:26.000 | So you see where I'm going, you know, we're getting very close.
00:17:29.000 | How do we surpass Code Llama?
00:17:32.000 | Here is the trick.
00:17:33.000 | This is the other model that we have been training in parallel, and this is the REPL tune version.
00:17:39.000 | And it means the following.
00:17:40.000 | We further pre-trained it on 200 billion tokens of code, this time coming from our own developers.
00:17:47.000 | So on Replit, when you create a public REPL, it's automatically published under IMT license,
00:17:54.000 | so we use this code to further pre-train our model.
00:17:57.000 | And we extract, again, 30 billion tokens of code, same languages, same data filtering pipeline
00:18:03.000 | to retain only the top quality ones.
00:18:06.000 | We do these three epochs, then we do also linear cooldown, and we are using basically the languages
00:18:12.000 | that are predominantly popular for Replit users.
00:18:15.000 | So not the same list as we saw before.
00:18:18.000 | If you go Replit, I would say 95% of the people are mostly writing Python and JavaScript.
00:18:23.000 | These are the cool languages of today.
00:18:26.000 | Another key insight is our cutoff for this model is literally a few weeks ago.
00:18:32.000 | So if there is a cool new library that everyone is writing software for in the last month,
00:18:38.000 | our model is going to be capable of generating code that follows that library.
00:18:42.000 | And we are going to keep, basically, these models up to date so that we can follow the trends,
00:18:47.000 | and we can make our developers more happy.
00:18:51.000 | Here is the table that I love.
00:18:53.000 | So we are back to this back-to-back comparison.
00:18:57.000 | On the very left, we have our base model.
00:19:00.000 | We didn't add StartCoder here for the sake of space.
00:19:03.000 | And also, the base model is already topping it on every other language,
00:19:08.000 | so it didn't make sense.
00:19:09.000 | Now we have Colama in between, and you can see why.
00:19:12.000 | We are, on pretty much every language, substantially better.
00:19:16.000 | So we have 36% on the OpenAI U-MiniVault benchmark.
00:19:21.000 | As a reminder, when I was working on PalmCoder, for example,
00:19:26.000 | that was our Passed-1 result that we published in early 2022.
00:19:32.000 | That model was at 540 billion tokens,
00:19:35.000 | so almost 200x larger than this model,
00:19:38.000 | and it achieves exactly the same U-MiniVault Passed-1 performance.
00:19:41.000 | Same code DaVinci 001, if you go back to the paper, is getting exactly 36%.
00:19:48.000 | So we were pretty much amazed when this happened.
00:19:52.000 | Now, why do we go through all this struggle of training our models?
00:19:56.000 | Not only because it's cool, you know, we love to do this stuff,
00:19:59.000 | but there is a rationale behind it.
00:20:02.000 | So we really want to go as fast as possible
00:20:06.000 | with the most powerful small model we could train.
00:20:09.000 | And the reason is, all of our models are actually optimized for inference,
00:20:14.000 | rather than for being awesome at benchmarks.
00:20:17.000 | The fact that that happens gives us a lot of pride,
00:20:20.000 | and also makes us feel good when we do a vibe check with the model,
00:20:23.000 | and it performs as we expect, or even better.
00:20:26.000 | But it turns out that our key result is,
00:20:28.000 | on a single model, with no batching,
00:20:30.000 | we're generating above 200 tokens per second.
00:20:34.000 | And we tune the architecture for speed in every possible way.
00:20:38.000 | We're training a smaller vocabulary, as I was saying before.
00:20:40.000 | We're using a flash attention with a Triton kernel.
00:20:44.000 | We're using the latest GQA.
00:20:46.000 | So every single aspect is there to make sure that we can go as fast as we can.
00:20:50.000 | And we optimize, basically, for the usage on the Triton inference server
00:20:54.000 | and acceleration framework, such as Stensor RTLLM.
00:20:57.000 | They really squeeze, you know, the last drop for NBita GPUs.
00:21:01.000 | But the other very interesting insight is,
00:21:04.000 | we work very hard also to make the model deployment go much faster.
00:21:09.000 | So if you ever, you know, had the bad luck to work with Kubernetes in your life,
00:21:13.000 | you know, you know how painful it can get, you know, to get your pod,
00:21:18.000 | download all the dependencies, and build it, and yada, yada.
00:21:21.000 | You know, so the very first time we brought this infrastructure up,
00:21:24.000 | it took 18 minutes to go, you know, from clicking until the model was deployed.
00:21:28.000 | Now, if you want to, you know, adapt to the load that the application is receiving,
00:21:32.000 | 18 minutes, you know, looks like an eternity.
00:21:35.000 | Like, if there is a traffic spike, good luck with that.
00:21:38.000 | So one of our awesome engineers, Bradley, you're going to find him at the booth later today,
00:21:42.000 | brought this number from 18 minutes to just two minutes.
00:21:46.000 | There is a long list of tricks that he used.
00:21:48.000 | I'm not going to go through them, just talk to Brad.
00:21:51.000 | The cool insight here is the fact, now, whenever we get more load,
00:21:55.000 | we can react very quickly, and that's how we serve a very large user base.
00:21:59.000 | So the moment that Amjad announced AI4ALL literally 10 minutes ago,
00:22:03.000 | we flipped the switch, and our code completion is in front of our users.
00:22:07.000 | And that's the way we made this happen.
00:22:09.000 | Now, I've been asked several times, guys, why are you losing your model open source?
00:22:15.000 | You put so much effort. Maybe not. That's an advantage for a company.
00:22:19.000 | It turns out that the moment we did it, we got a lot of adoption.
00:22:24.000 | And apart from a lot of log, which always feels good,
00:22:27.000 | and it feels good to chat with other people in AI that are using what we build,
00:22:31.000 | we also started to get fine-tuned versions, instruct-tuned versions of that.
00:22:35.000 | And we have seen a lot of people using our small model deployed in local,
00:22:40.000 | say with GGML, which goes super fast on Apple Silicon,
00:22:44.000 | and they built their own custom privacy-aware, like GitHub Copilot Alternative with Rapid V1.
00:22:51.000 | So we expect the same to happen with V1.5 in the next few days.
00:22:56.000 | As we speak also, if you go on again phase, the model is available.
00:23:00.000 | We're working on the readme.
00:23:01.000 | Come to Tolwin Madhava, the boot is the mastermind behind it,
00:23:04.000 | so it's going to tell you every single detail on how to make it run in production.
00:23:07.000 | And we're going to be here until tonight, so more than happy to play with the model together.
00:23:11.000 | Now, in the last minute that I've left, I want to give you like a teaser
00:23:15.000 | of what we're going to be doing in the next few weeks.
00:23:18.000 | So we're allowing a few very exciting collaborations.
00:23:21.000 | The first one is with Glaive AI, and it's a company that is building synthetic datasets.
00:23:26.000 | And we're working on an IFT version of our model, so an Instruct Fintune version,
00:23:32.000 | over 210,000 coding instructions.
00:23:37.000 | We're already seeing very exciting results.
00:23:40.000 | We want to triple-check them and, you know, follow our Twitters,
00:23:43.000 | and the moment that we're sure that this is performing as we expect,
00:23:47.000 | it's going to be out there, and we are going to be able to play with it.
00:23:50.000 | Second announcement, we're also collaborating with Morph Labs.
00:23:54.000 | I think Jesse is here today, and he's going to run a session later,
00:23:58.000 | explaining you exactly what this new format does.
00:24:01.000 | I'm going to give you a teaser, and then, you know, go to Jesse's talk,
00:24:03.000 | and he's going to explain you all the details.
00:24:05.000 | So we are design partners on the FIST format, which is fill in the syntax tree.
00:24:11.000 | You might have heard of fill in the middle, this concept where you can take your file,
00:24:15.000 | split it in half, and then basically if you're writing code in between,
00:24:19.000 | you can tell the LLM that the top of the file is your prefix,
00:24:23.000 | the bottom of the file is your suffix,
00:24:25.000 | and you give this context to the model so that it knows which part it should fill.
00:24:29.000 | Now, we found that this format is even more powerful,
00:24:32.000 | is aware of the abstract syntax tree underlying the source code.
00:24:36.000 | We're seeing very promising results already, and again, this will be out, you know,
00:24:41.000 | in just a matter of like a few days or weeks.
00:24:43.000 | Last thing, we have collaborations with the Perplexity AI guys.
00:24:47.000 | You might have used their labs.
00:24:49.000 | So it's a place where the host models incredibly fast,
00:24:53.000 | and the Rapid B1.5 will appear there,
00:24:56.000 | and you can start to play with it and get a vibe check by tonight.
00:24:59.000 | Thanks, everyone.
00:25:00.000 | Thank you.
00:25:01.000 | Thank you.
00:25:02.000 | Thank you.
00:25:03.000 | Thank you.
00:25:04.000 | Thank you.
00:25:05.000 | Thank you.
00:25:06.000 | Thank you.
00:25:07.000 | Thank you.
00:25:08.000 | Thank you.
00:25:08.000 | We'll see you next time.