How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

MAKUILA MAKUILA: So my name is Kevin, and I'm going to be talking about how embeddings are stunting AI agents. So I'm going to let you in on some secrets about how we build the product and exactly what we're doing behind the scenes to improve your code gen experience. So at Kodium, we are building AI developer tools, and we're starting with an IDE plug-in.

And as mentioned before, we've been downloaded over a million and a half times. We're one of the top-rated extensions across the different marketplaces. And to reiterate, we offer free, unlimited autocomplete, chat, and search across 70 different languages and 40 different IDEs. So we plug into all the popular IDEs.

We are the highest-rated developer tool, as voted by developers in the most recent Stack Over Flow Survey. And you'll note that this is even higher than tools like ChatGPT and GitHub Copilot. And importantly, we are trusted by Fortune 500s to deliver high-quality code that actually makes it into production.

And we do this with top-grade security, licensing, attribution for some of the largest enterprises on the planet. Our goal at Kodium is to empower every developer to have superpowers both inside of the IDE and beyond. And today I'm going to let you in on some secrets about how we've been able to build a tool like this, and why users choose us over the other AI tools on the market.

And the short answer is context awareness. So here's a quick overview about what context looks like today. We're all familiar, since we're at an AI conference, with the basics of retrieval augmented generation. The idea being that a user puts in a query, you accumulate context from a variety of different sources, you throw it into your LLM, and then you get a response, whether that be a code generation or a chat message.

Here's a concrete example about how retrieval can be used in code generation. So let's say we want to build a contact form in React. Now you could go to chat GPT, you could ask it to generate a contact form, but in reality, on a moderately large code base, this is really not going to work.

It's not going to give you things that are personalized to you. And this is really where context retrieval comes in. We need to build a contact form that is in line with our design system components. Let's say you already have buttons and inputs, it has to be able to pattern match with local instances of other forms inside of your code base.

It has to ingest your style guide, for example, if you're using Tailwind, you have to be able to detect and make the form look and feel like every other thing on your site. And then, of course, there's documentation both locally and externally for packages and other dependencies. So the question becomes, how do you collect and rank these items so that our code generation can be both fast and accurate for your use case?

So to dive into a couple of different methods about how people are tackling this today, there's really three main pillars. The first one is long context. So this is the idea that if you expand your prompt window in your LLM, it can read more input and therefore be a bit more personal to what you're trying to generate.

This is very ergonomically easy to use, right? You just shove more items into your prompt. But this comes at the cost of latency and financial cost. So one of the most recent examples was Gemini. Gemini actually takes 36 seconds to ingest 325k tokens. To put this into perspective, moderately sized or even small repo is easily over 1 million tokens, and that accounts to about 100k lines of code.

So in this instance, most enterprises have over a billion tokens of code. It's simply not feasible to be throwing everything into a long context model. The second method is fine tuning. So for those that are familiar, fine tuning is the idea of actually tweaking the weights of your model to reflect the distribution of the data that your consumer expects.

And so this requires continuous updates. It's rather expensive computationally. You have to have one model per customer. And it's honestly prohibitively expensive for most applications. And finally, we have embeddings. And for all of you, hopefully you're familiar, this is a relatively proven technology today. It's pretty inexpensive to compute and store.

But the difficulty that we're about to dive into is that it is hard to reason over multiple items. It also has a low dimensional space, and I'll talk about that shortly. So to dive deeper into embeddings, the whole concept is that you take your objects, you throw it through an embedding model, and then you end up with some sort of vector, some sort of array of numerical values, and this is in a fixed dimension.

And so by mapping and chunking code, we can map it to an embedding. And that allows us to quickly search over our functions, our documents, whatever you decide to chunk by. And this is what embedding search is called. Embedding search, like I said, is not a new concept. There is a bunch of models that I've tried to optimize.

And in this example, we're looking at one of the kind of North Star eval benchmarks. It's become increasingly popular. And the question becomes, how do we fit millions of lines of code into an LLM model so that we can actually generate useful results? And so it's evident through the years that we're actually hitting a ceiling on what is possible using these traditional vector embeddings.

And over time, even the biggest models are approximating to around the same level of performance. As you can see, everything's kind of within plus or minus five. And at Codium, we kind of believe that this is because fundamentally, we cannot distill the dimension space of all possible questions, all possible English queries, down into the embedding dimension space that our vectors are going to occupy.

And so at Codium, we've thought very critically about what retrieval matters to us. Are we measuring the right things? And does semantic distance between these vectors really equate to things like function relevance in the concrete example that I showed earlier? And so what we landed on is that benchmarks like the one that I showed you before heavily skew towards this idea of needle in a haystack.

It's the idea that you can sift through a corpus of text and find some instance of something that is relevant to you. Note, it is only one single needle. So in reality, code search requires multiple different needles, right? We showed that slide earlier. When you're building a contact form, you need all these different things in order to actually have a good generation.

And these benchmarks really don't touch that. And so we decided to use a different metric. And it's called Recall 50. The idea and its definition is that it's what fraction of your ground truth is in the top 50 items retrieved. So the idea being now we have multiple documents and we're now looking at the top 50 documents that we retrieved.

How many of those are part of our ground truth set? So this is really helpful for understanding multi-document context, especially again for those large, large code bases. And now we actually have to build a data set around this. And so this is where we did a little bit of magic.

We wanted to make the eval as close as possible to our end user distribution. So we had to compile our own data set. So what we did, this is a PR that I put out a few months ago, we looked at PRs like this. It's broken down into commits.

Those commits we can extract and actually match them with the modified files, right? So now we have this mapping from something in English to a list of files that are relevant to that change. And you can imagine we can hash this in many different ways. But ultimately the point I'm trying to make is we are creating a eval set that mimics our production usage of something like a code gen product.

And so this message serves as the backing for this new type of eval where now we can run at scale this idea of product-led benchmarks. It gets us closer to the ground truth of what our users are actually experiencing and what retrieval tweaks and retrieval actually mean to the end product.

And so we threw some of the currently publicly available models at this notion of retrieval, this idea of using commit messages. And we found that there is reduced performance. They're unable to reason over specifically code, but then also specifically this kind of real-world notion of English and commits, right?

And so at Codium, we've been able to actually break through the ceiling. This is something that we've worked very hard at. We have to redefine exactly how we are approaching retrieval in order to be kind of in our class of our own, so that when you are typing in your ID, when you're chatting with our assistant, when you're generating autocompletes, we're retrieving the most relevant things that are for your intents.

So now the question becomes, how do we actually get this kind of best-in-class retrieval? And so I'm here to give you the very short and sweet answer, which is we throw more compute at it, right? But of course, that can't come with absurd, absurd cost, right? Financial cost. So how do we do this actually in production?

How do we actually do this without recurring an unreasonable cost? And so this goes back to a little bit of Codium secret sauce, right? We are vertically integrated. And what this means is that we train our own models. So number one, we train our own models. This means that these are custom to our own workflows.

So when you're using our product, you're touching Codium's models. Number two, we build our own custom infrastructure. This is actually a very important point and actually connects to the whole ExaFunction to Codium Pivot that we discussed earlier. ExaFunction was an ML infrastructure company. And so what we've been able to do is build our own custom infrastructure down to the metal.

This means that our speed and efficiency is unmatched by any other competitor on the market so that we can serve more completions at a cheaper cost. And finally, we are product driven, not research driven. Now, what this means is we look at things like actual end user results. When we actually ship a feature, we're looking at real world usage.

And we're always thinking about how does this impact the end user experience, not just some local benchmark tweaking. And so we could spend all day talking about, you know, kind of why Codium has done this and yada yada, but that's a talk for a different time. So I'm going to talk about something that I find very cool.

And this is the reason why we've taken this vertical integration approach and been able to turn it into something that we call mQuery. So mQuery is this way of taking your query, so similar to that idea of taking your retrieval query. You have your code base, and let's just say you have n different items.

And because we own our own infrastructure and train our own models, we're now making parallel calls to an LLM to actually reason over each one of those items. We're not looking at vectors. We're not looking at small dimension space. We're literally taking models and running them on each one of those items so that you can ensure, you can imagine, you know, you run ChatGPT and tell it to say yes or no on an item, for example.

That is going to give you the highest quality, highest dimension space of reasoning. This leads into very, very high confidence ranking that we can then take into account things like your active files, your neighboring directories, your most recent commits. You know, what is the ticket that you're working on currently?

And we can compile all this to give you, you know, the top end documents that are relevant for your generation so that we can start streaming in higher quality generations, higher quality chat messages, things of that nature. And the reason behind this is, again, it's that vertical integration. It's that idea that our computation is 1/100 of the cost of the competitors.

We are not using APIs. And as a result, our customers and our users actually get 100x the amount of compute that they would on another product. And so we're willing to do that. We're willing to spend more compute per user because it leads to a better experience. And so, like I mentioned earlier, I lead our product engineering team.

So we always want to anchor ourselves around these three different things. One, that we have to build a performant product. It has to be really fast. For those of you who have used the product, you can probably attest to this. MQuery runs thousands of LLMs in parallel so that the user can start streaming in code within seconds, not minutes, not hours, seconds, and oftentimes milliseconds.

It has to be powerful, right? None of this matters if the actual quality and the actual generations that you're building are wrong, right? And finally, it has to be easy to use. We're building an end user product for people today that's in the IDE. Tomorrow, it might not be in the IDE.

How do we actually build something that is intuitive to understand that people can grapple with and see exactly what my model is thinking? And so, because we have the benefit of distribution, we were able to roll this out to a small percentage of our users. And by small percentage, we're dealing in the order of, you know, a million plus downloads.

This actually reached the surprising number of people. And what we've been able to see is that we were able to successfully reason over these thousands of files in people's mono repos, in people's remote repos, and select what was relevant, right? We can very accurately deem which files are relevant for the generation that you're trying to have.

And the result, as you can see, this is a real-time GIF, is both fast and accurate. So I'm asking for usage of an alert dialog. It's going through. And I think I've panned down here. This is kind of a shad CN component that I've modified internally. We're pulling in, basically, the source code of what is relevant for our generation.

And ultimately, the results of this experiment were that users were happy. They were thumbs-- they had more thumbs up on chat messages. They were accepting more generations. And we were able to see that, ultimately, we were writing more code for the user, which is the ultimate goal. It's that idea of how much value are we providing to our end users.

And so we built this context engine, right? This idea of mQuery. This idea of ingesting context and deciding what is relevant to your query to give you coding superpowers. And so our users will generate today-- they're generating autocompletes. They're generating chats, search messages. But in the future, they're going to generate documentation.

They're going to generate commit messages, code reviews, code scanning. They're going to take Figma artboards and convert them into UIs that were built by your own components. The possibilities are endless. But what it starts with is this bedrock, this very hard problem of retrieval. And it brings us to, again, one of the reasons why Codium is approaching this problem a little bit differently.

Our iteration cycle starts with product-driven data and eval. So we're starting with the end problem. We're building a product for millions of people. How do we start with what they're asking for? And how do we build a data set and eval system locally so that we can iterate on the metrics that matter?

Secondly, because we're vertically integrated, we're taking that massive amount of compute, and we're going to throw it at our users. You know, paying or not paying, we're going to throw it at our users so that they can get the best product experience and the highest quality results. And then finally, we're actually going to be able to push this out to our users in real time, overnight, and be able to get a pulse check on how this is going.

You know, this is what we did for mQuery. And when we evaluate in production, we can say, you know, thumbs up, thumbs down, and then hit the drawing board again, back to that same cycle. Repetition. And so you can start seeing how these pieces of compounding technology come together.

Right? We've alluded to some of them today. Modeling, infrastructure, being able to retrieve. But then it also includes things like AST parsing, indexing massive amounts of repos, knowledge graphs, parsing documentation, looking at websites online. The list can go on and on and on. But we're confident that we're solving these problems one piece at a time using that same iteration cycle, that same idea that we're going to take the distribution and knowledge that we have, and that additional compute that we're willing to afford each user to solve each one of these puzzle pieces.

And I want to leave you with a parallel analogy. So in my past life, I had experienced the autonomous driving industry. So to bring over a metaphor from that industry, in 2015, TechCrunch boldly predicted that that was going to be the year of the self-driving vehicle. It was largely, you know, now we're in 2024, so we can look back in hindsight, largely untrue, right?

We were doing things like sensor fusion. We were decreasing our polling rates. We were running off-board models. All this in the effort of making heuristics that would compensate for the lack of compute that was available because consumer graphics cards were not as popular or not as powerful as they are today.

Fast forward today, we're seeing 100x the amount of compute available to a vehicle. You can take a Waymo around San Francisco, which I encourage you to do. It's a wonderful experience. But that means that we're actually able to throw larger models at these problems, right? More sensors, higher frequency.

And now, 2024, TechCrunch has released another article that said, will 2024 finally be the year of the self-driving vehicle? And we can now look at this pattern and say driving performance was substantially better by throwing larger models, being able to handle more and more data. And so, at Codium, we believe that this embedding-based retrieval is the heuristic.

We should be planning for AI-first products, throwing large models at these problems so that AI is a first-class citizen. We're planning for the future. And finally, we also believe that ideas are cheap. You know, I could sit up here and tell you all these different ideas about how, you know, we're going to transform coding and the way that the theory behind possible solutions.

But what we believe at Codium is that actually shipping, actually showcasing this technology through a product, is the best way to go. And so, if you agree with these beliefs, you can come join our team. We're based in San Francisco. And you can download our extension. It's free. I'm not, obviously, what's it called?

I'm not advertising the core product nearly as much. We're kind of talking about the technology. But you can experience this technology firsthand today by downloading our extension. It's available on all the different plugins, VS Code, JetBrains, Vim, Emacs. And you can see how this infrastructure and the way that we've approached product development has shaped the experience for you as a user.

And then, of course, you can reach out to me on Twitter. I put my handle up there. I'll be kind of floating around outside. So if you have other questions or are interested in what I had to say. But I hope that you learned something today. I hope that, you know, you use Codium, you try it out, and see what the magic can do for yourself.

Thank you. I'll see you next time. I'll see you next time. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. We'll see you next time.

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

Transcript