The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO on Residents at Decibel Partners. And I'm joined by my co-host, Sweets, founder of Small.ai. Hey, and today we're christening our new podcast studio in the Newton. And we have Biang and Steve from Sourcegraph. Welcome. Hey, thanks for having us.

So this has been a long time coming. I'm very excited to have you. We also are just celebrating the one year anniversary of ChatGPT yesterday. But also we'll be talking about the GA of Cody later on today. But we'll just do a quick intros of both of you. Obviously, people can research you and check the show notes for more.

But Biang, you worked in computer vision at Stanford, and then you worked at Palantir. I did, yeah. You also interned at Google, which is-- I did back in the day, where I get to use Steve's system, dev tool. Right. What was it called? It was called Grok. Well, the end user thing was Google Code Search.

That's what everyone called it, or just like CS. But the brains of it were really the Trigram index and then Grok, which provided the reference graph. Today it's called Kythe, the open source Google one. It's sort of like Grok v3. On your podcast, which you've had me on, you've interviewed a bunch of other code search developers, including the current developer of Kythe, right?

No, we didn't have any Kythe people on, although we would love to if they're up for it. We had Kelly Norton, who built a similar system at Etsy. It's an open source project called Hound. We also had Han-Wen Nienhaus, who created Zooked, which is-- That's the name I'm thinking about.

--I think heavily inspired by the Trigram index that powered Google's original code search, and that we also now use at Sourcegraph. Yeah. So you teamed up with Quinn over 10 years ago to start Sourcegraph. And I kind of view it like-- we'll talk more about this. You were indexing all code on the internet, and now you're in the perfect spot to create a coding intelligence startup.

Yeah, yeah. I guess the back story was, I used Google Code Search while I was an intern. And then after I left that internship and worked elsewhere, it was the single dev tool that I missed the most. I felt like my job was just a lot more tedious and much more of a hassle without it.

And so when Quinn and I started working together at Palantir, he had also used various code search engines in open source over the years. And it was just a pain point that we both felt, both working on code at Palantir and also working within Palantir's clients, which were a lot of Fortune 500 companies, large financial institutions, folks like that.

And if anything, the pains they felt in dealing with large, complex code bases made our pain points feel small by comparison. And so that was really the impetus for starting Sourcegraph. Yeah, excellent. Steve, you famously worked at Amazon. I did, yep. And revealed-- and you've told many, many stories.

I want every single listener of "Latent Space" to check out Steve's YouTube, because he effectively had a podcast that you didn't tell anyone about or something. You just hit record and just went on a few rants. I'm always here for a Stevie rant. Then you moved to Google, where you also had some interesting thoughts on just the overall Google culture versus Amazon.

You joined Grab as head of Eng for a couple of years. I'm from Singapore, so I have actually personally used a lot of Grab's features. And it was very interesting to see you talk so highly of Grab's engineering and overall prospects, because-- Because as a customer, it sucked. No, it's just like-- no, well, being from a smaller country, you never see anyone from our home country being on a global stage or talked about as a good startup that people admire or look up to, on the league that you, with all your legendary experience, would consider equivalent.

Yeah, no, absolutely. They actually didn't even know that they were as good as they were, in a sense. They started hiring a bunch of people from Silicon Valley to come in and fix it. And we came in and we were like, oh, we could have been a little better, operational excellence and stuff.

But by and large, they're really sharp. And the only thing about Grab is that they get criticized a lot for being too Westernized. Oh, by who? By Singaporeans who don't want to work there. OK, well, I guess I'm biased because I'm here, but I don't see that as a problem.

And if anything, they've had their success because they were more Westernized than the Sanders Singaporean tech company. I mean, they had their success because they are laser-focused. They copy to Amazon. I mean, they're executing really, really, really well. For a giant-- I was on a Slack with 2,500 engineers.

It was like this giant waterfall that you could dip your toe into. You'd never catch up with them. Actually, the AI summarizers would have been really helpful there. But yeah, no, I think Grab is successful because they're just out there with their sleeves rolled up, just making it happen.

Yeah. And for those who don't know, it's not just like Uber of Southeast Asia. It's also a super app. PayPal plus. Yeah, in the way that super apps don't exist in the West. It's one of the greatest mysteries, enduring mysteries of B2C, that super apps work in the East and don't work in the West.

Don't understand it. Yeah, it's just kind of curious. They didn't work in India either. And it was primarily because of bandwidth reasons and smaller phones. That should change now. Should. And maybe we'll see a super app here. Yeah. Yeah. You worked on-- you retired-ish? I did, yeah. You worked on your own video game.

Which-- any fun stories about that? Any-- I think-- and that's also where you discover some need for code search, right? Yeah. Sure, a need for a lot of stuff. Better programming languages, better databases, better everything. I mean, I started in '95, where there was kind of nothing. Yeah. I just want to say, I remember when you first went to Grab, because you wrote that blog post, talking about why you were excited about it, about the expanding Asian market.

And our reaction was like, oh, man. Why didn't-- how did we miss stealing it? Hiring you. Yeah, I was like, miss that. Wow, I'm tired. Can we tell that story? So how did this happen? So you were inspired by Grok. Yeah, so I guess the back story, from my point of view, is I had used Code Search and Grok while at Google.

But I didn't actually know that it was connected to you, Steve. Like, I knew you from your blog posts, which were always excellent, kind of like inside, very thoughtful takes on-- from an engineer's perspective, on some of the challenges facing tech companies, and tech culture, and that sort of thing.

But my first introduction to you, within the context of code intelligence and code understanding, was I watched a talk that you gave, I think, at Stanford about Grok when you were first building it. And that was very eye-opening. And I was like, oh, that guy, the guy who writes the extremely thoughtful, ranty blog posts, also built that system.

And so that's how I knew you were kind of involved in that. And then it was kind of like, we always kind of wanted to hire you, but never knew quite how to approach you or get that conversation started. Well, we got introduced by Max, right? Yeah. He's the head of Temporal.

Temporal, yeah. And yeah, I mean, it was a no-brainer. They called me up. And I noticed when Sourcegraph had come out. Of course, when they first came out, I had this dagger of jealousy stabbed through me, piercingly, which I remember, because I am not a jealous person by any means, ever.

But boy, I was like, rah, rah, rah. But I was kind of busy, right? And just one thing led to another. I got sucked back into the ads vortex and whatever. So thank god, Sourcegraph actually kind of rescued me. Here's a chance to build DevTools. Yeah. That's the best.

DevTools are the best. Cool. Well, so that's the overall intro. I guess we can get into Cody. Is there anything else that people should know about you before we get started? I mean, everybody knows I'm a musician. So I can juggle five balls. Five is good. Five is good.

I've only ever managed three. Five's hard. And six, a little bit. Wow. That's impressive. So yeah, to jump into Sourcegraph, this has been a company 10 years in the making. And as Sean said, now you're at the right place. Phase two. Now exactly, you spent 10 years collecting all this code, indexing, making it easy to surface it, and how-- And also learning how to work with enterprises and having them trust you with their code bases.

Because initially, you were only doing on-prem, right, like VPC, a lot of VPC deployments. So in the very early days, we were cloud only. But the first major customers we landed were all on-prem, self-hosted. And that was, I think, related to the nature of the problem that we're solving, which becomes just a critical, unignorable pain point once you're above 100 devs or so.

Yeah. And now Kodi is going to be GA by the time this releases. So congrats. Congrats to your future self for launching this in two weeks. Can you give a quick overview of just what Kodi is? I think everybody understands that it's an AI coding agent. But a lot of companies say they have an AI coding agent.

So yeah, what does Kodi do? How do people interface with it? Yeah, so basically, how is it different from the several dozen other AI coding agents that exist in the market now? I think our take-- when we thought about building a coding assistant that would do things like code generation and question answering about your code base, I think we came at it from the perspective of we've spent the past decade building the world's best code understanding engine for human developers, right?

So it's kind of your guide as a human dev if you want to go and dive into a large, complex code base. And so our intuition was that a lot of the context that we're providing to human developers would also be useful context for AI developers to consume. And so in terms of the feature set, Kodi is very similar to a lot of other assistants.

It does inline autocompletion. It does code base aware chat. It does specific commands that automate tasks that you might rather not want to do, like generating unit tests or adding detailed documentation. But we think the core differentiator is really the quality of the context, which is hard to describe succinctly.

It's a bit like saying, what's the difference between Google and AltaVista? There's not a quick checkbox list of features that you can rattle off. But it really just comes down to all the attention and detail that we've paid to making that context work well and be high quality and fast.

For human devs, we're now kind of plugging into the AI coding assistant as well. Yeah. I mean, just to add, just to add my own perspective onto what Byung just described, I'd say RAG is kind of like a consultant that the LLM has available that knows about your code.

RAG provides basically a bridge to a lookup system for the LLM, right? Whereas fine-tuning would be more like on-the-job training for somebody. If the LLM is a person, and you send them to a new job, and you do on-the-job training, that's what fine-tuning is like, right? So tuned to a specific task.

You're always going to need that expert, even if you get the on-the-job training, because the expert knows your particular code base, your task, right? And that expert has to know your code. And there's a chicken-and-egg problem, because we're like, well, I'm going to ask the LLM about my code.

But first, I have to explain it, right? It's this chicken-and-egg problem. That's where RAG comes in. And we have the best consultants, right? The best assistant who knows your code. And so when you sit down with Cody, right? What Bian said earlier about going to Google and using code search, and then starting to feel like without it, his job was super tedious, yeah?

Once you start using these-- do you guys use coding assistants? Yeah, right? I mean, we're getting to the point very quickly, right? Where you feel like you're kind of like-- almost like you're programming without the internet, right? Or something. It's like you're programming back in the '90s without the coding assistant, yeah?

So hopefully that helps for people who have no idea about coding systems, what they are. Yeah. And I mean, going back to using them, we had a lot of them on the podcast already. We had Cursor. We had Codium and Codium, very similar names. Yeah. Griblet, Find, and then, of course, there's Copilot.

Tab9. Oh, RIP. No, Kite is the one that died, right? Oh, right. I don't know. I'm starting to get drunk. So you had a Copilot versus Cody blog post. And I think it really shows the context improvement. So you had two examples that stuck with me. One was, what does this application do?

And the Copilot answer was like, oh, it uses JavaScript and NPM and this. And it's like, but that's not what it does. That's what it's built with. Versus Cody was like, oh, these are the major functions and these are the functionalities and things like that. And then the other one was, how do I start this up?

And Copilot just said, NPM start, even though there was no start command in the package JSON. But, you know, moat collapse, right? Most projects use NPM start, so maybe this does too. How do you think about open source models and private-- because Copilot has their own private thing. And I think you guys use Starcoder, if I remember right.

- Yeah, that's correct. I think Copilot uses some variant of Codex. They're kind of cagey about it. I don't think they've officially announced what model they use. - And I think they use a range of models based on what you're doing. - Yeah, so everyone uses a range of model.

No one uses the same model for inline completion versus chat, because the latency requirements for-- - Oh, OK. - Well, there's fill in the middle. There's also what the model's trained on. So we actually had completions powered by Cloud Instant for a while. But you had to kind of prompt hack your way to get it to output just the code and not, like, hey, here's the code you asked for, like that sort of text.

So everyone uses a range of models. We've kind of designed Kodi to be especially model-- not agnostic, but pluggable. So one of our design considerations was, as the ecosystem evolves, we want to be able to integrate the best in class models, whether they're proprietary or open source, into Kodi, because the pace of innovation in the space is just so quick.

And I think that's been to our advantage. Like today, Kodi uses Starcoder for inline completions. And with the benefit of the context that we provide, we actually show comparable completion acceptance rate metrics. It's kind of like the standard metric that folks use to evaluate inline completion quality. It's like, if I show you a completion, what's the chance that you actually accept the completion versus you reject it?

And so we're at par with Copilot, which is at the head of the industry right now. And we've been able to do that with the Starcoder model, which is open source, and the benefit of the context fetching stuff that we provide. And of course, a lot of like prompt engineering and other stuff along the way.

Yeah. And Steve, you've wrote a post called "Cheating is All You Need" about what you're building. And one of the points you made is that everybody's fighting on the same axis, which is better UI and the IDE, maybe like a better chat response. But data modes are kind of the most important thing.

And you guys have like a 10-year-old mode with all the data you've been collecting. How do you kind of think about what other companies are doing wrong, right? Like, why is nobody doing this in terms of like really focusing on RAG? I feel like you see so many people, oh, we just got a new model, and it's like a bit human eval.

And it's like, wow, but maybe like that's not what we should really be doing, you know? Do you think most people underestimate the importance of like the actual RAG in code? Yeah, I mean, I think that people weren't doing it much. It's kind of at the edges of AI.

It's not in the center. I know that when ChatGPT launched, so within the last year, I've heard a lot of rumblings from inside of Google, right? Because they're undergoing a huge transformation to try to, of course, get into the new world. And I heard that they told a bunch of teams to go and train their own models or fine-tune their own models, both.

And it was a shit show, right? Because nobody knew how to do it. And they launched two coding assistants. One was called Code D, with an E-Y. And then there was-- I don't know what happened in that one. And then there's Duet, right? Google loves to compete with themselves, right?

They do this all the time. And they had a paper on Duet, like, from a year ago. And they were doing exactly what Copilot was doing, which was just pulling in the local context, right? But fundamentally, I thought of this because we were talking about the splitting of the models.

In the early days, it was the LLM did everything. And then we realized that for certain use cases, like completions, that a different, smaller, faster model would be better. And that fragmentation of models, actually, we expected to continue and proliferate, right? Because fundamentally, we're a recommender engine right now.

We're recommending code to the LLM. We're saying, may I interest you in this code right here so that you can answer my question? And being good at recommender engine-- I mean, who are the best recommenders, right? There's YouTube, and Spotify, and Amazon, or whatever, right? Yeah, and they all have many, many, many, many, many models, right?

All fine-tuned for very specific-- and that's where we're headed in code, too, absolutely. Yeah, we just did an episode we released on Wednesday, which we said RAG is like Rexis, or like LLMs. You're basically just suggesting good content. It's like what? Recommendation systems. Oh, got it. Yeah, yeah, yeah.

Rexis. Yeah. So the naive implementation of RAG is you embed everything through a vector database. You embed your query, and then you find the nearest neighbors, and that's your RAG. But actually, you need to rank it. And actually, you need to make sure there's sample diversity and that kind of stuff.

And then you're slowly gradient-descending yourself towards rediscovering proper Rexis, which has been traditional ML for a long time, but approaching it from an LLM perspective. Yeah, I almost think of it as a generalized search problem, because it's a lot of the same things. You want your layer 1 to have high recall and get all the potential things that could be relevant, and then there's typically a layer 2 re-ranking mechanism that bumps up the precision, tries to get the relevant stuff to the top of the results list.

Have you discovered that ranking matters a lot? So the context is that I think a lot of research shows that one, context utilization matters based on model. GBT uses the top of the context window, and then apparently, Cloud uses the bottom better. And it's lossy in the middle. So ranking matters.

No, it really does. The skill with which models are able to take advantage of context is always going to be dependent on how that factors into the impact on the training loss. So if you want long context window models to work well, then you have to have a ton of data where it's like, here's a billion lines of text, and I'm going to ask a question about something that's embedded deeply into it, and give me the right answer.

And unless you have that training set, then of course you're going to have variability in terms of where it attends to. And in most naturally occurring data, the thing that you're talking about right now, the thing I'm asking about, is going to be something that we talked about recently.

Did you really just say gradient dissenting yourself? Actually, I love that it's entered the casual lexicon. My favorite version of that is how you have to p-hack papers. So when you throw humans at the problem, that's called graduate student dissent. That's great. Yeah, it's really awesome. I think the other interesting thing that you have is inline-assist UX that is, I wouldn't say async, but it works while you can also do work.

So you can ask Kodi to make changes on a code block, and you can still edit the same file at the same time. How do you see that in the future? Do you see a lot of Kodis running together at the same time? How do you validate also that they're not messing each other up as they make changes in the code?

And maybe what are the limitations today, and what do you think about where the attack is going? I want to start with a little history, and then I'm going to turn it over to Bjorn. So we actually had this feature in the very first launch back in June. Dominic wrote it.

It was called Nonstop Kodi. And you could have multiple basically LLM requests in parallel modifying your source file. And he wrote a bunch of codes to handle all of the diffing logic, and you could see the regions of code that the LLM was going to change. And he was showing me demos of it.

And it just felt like it was just a little before its time. But a bunch of that stuff, that scaffolding was able to be reused for where inline's sitting today. How would you characterize it today? Yeah, so that interface has really evolved from a like, hey, general purpose, like request anything inline in the code and have the code update, to really like targeted features like fix the bug that exists at this line, or request a very specific change.

And the reason for that is, I think the challenge that we ran into with inline fixes-- and we do want to get to the point where you could just fire it, forget, and have half a dozen of these running in parallel. But I think we ran into the challenge early on that a lot of people are running into now when they're trying to construct agents, which is the reliability of working code generation is just not quite there yet in today's language models.

And so that kind of constrains you to an interaction where the human is always like in the inner loop, like checking the output of each response. And if you want that to work in a way where you can be asynchronous, you kind of have to constrain it to a domain where today's language models can generate reliable code well enough.

So generating unit tests, that's like a well-constrained problem, or fixing a bug that shows up as a compiler error or a test error, that's a well-constrained problem. But the more general, like, hey, write me this class that does x, y, and z using the libraries that I have, that is not quite there yet, even with the benefit of really good context.

It definitely moves the needle a lot, but we're not quite there yet to the point where you can just fire and forget. And I actually think that this is something that people don't broadly appreciate yet, because I think that everyone's chasing this dream of agentic execution. And if we're to really define that down, I think it implies a couple of things.

You have a multi-step process where each step is fully automated, where you don't have to have a human in the loop every time. And there's also kind of like an LLM call at each stage, or nearly every stage in that chain. And based on all the work that we've done with the inline interactions, with kind of like general Cody features for implementing longer chains of thought, we're actually a little bit more bearish than the average AI hypefluencer out there on the feasibility of agents with purely kind of like transformer-based models.

To your original question, like the inline interactions with Cody, we've actually constrained it to be more targeted, like fix the current error or make this quick fix. I think that that does differentiate us from a lot of the other tools on the market, because a lot of people are going after this shnazzy inline edit interaction, whereas I think where we've moved-- and this is based on the user feedback that we've gotten-- it's like that sort of thing, it demos well, but when you're actually coding day-to-day, you don't want to have a long chat conversation inline with the code base.

That's a waste of time. You'd rather just have it write the right thing and then move on with your life or not have to think about it. And that's what we're trying to work towards. I mean, yeah, we're not going in the agent direction. I mean, I'll believe in agents when somebody shows me one that works.

Instead, we're working on sort of solidifying our strength, which is bringing the right context in. So new context sources, ways for you to plug in your own context, ways for you to control or influence the context, the mixing that happens before the request goes out, et cetera. And there's just so much low-hanging fruit left in that space that agents seems like a little bit of a boondoggle.

Just to dive into that a little bit further, I think at a very high level, what do people mean when they say agents? They really mean greater automation, fully automated. The dream is, here's an issue. Go implement that. And I don't have to think about it as a human.

And I think we are working towards that. That is the eventual goal. I think it's specifically the approach of, hey, can we have a transformer-based LLM alone be the backbone or the orchestrator of these agentic flows? We're a little bit more bearish today. You want a human in the loop.

I mean, you kind of have to. It's just a reality of the behavior of language models that are purely transformer-based. And I think that's just a reflection of reality. And I don't think people realize that yet. Because if you look at the way that a lot of other AI tools have implemented context fetching, for instance, you see this in the co-pilot approach, where if you use the at-workspace thing that supposedly provides code-based level context, it has an agentic approach, where you kind of look at how it's behaving.

And it feels like they're making multiple requests to the LLM, being like, what would you do in this case? Would you search for stuff? What sort of files would you gather? Go and read those files. And it's a multi-hop step, so it takes a long while. It's also non-deterministic.

Because any sort of LLM invocation, it's like a dice roll. And then at the end of the day, the context it fetches is not that good. Whereas our approach is just like, OK, let's do some code searches that make sense, and then maybe crawl through the reference graph a little bit.

That is fast. That doesn't require any sort of LLM invocation at all. And we can pull in much better context very quickly. So it's faster, it's more reliable, it's deterministic, and it yields better context quality. And so that's what we think. We just don't think you should cargo cult or naively go, agents are the future, let's just try to implement agents on top of the LLMs that exist today.

I think there are a couple of other technologies or approaches that need to be refined first before we can get into these multi-stage, fully automated workflows. We're very much focused on developer inner loop right now. But you do see things eventually moving towards developer outer loop. So would you basically say that they're tackling the agents problem that you don't want to tackle?

No, I would say at a high level, we are after maybe like the same high level problem, which is like, hey, I want some code written. I want to develop some software. And can an automated system go build that software for me? I think the approaches might be different.

So I think the analogy in my mind is, think about the AI chess players. Coding in some senses, it's similar and dissimilar to chess. I think one question I ask is, do you think producing code is more difficult than playing chess or less difficult than playing chess? More? I think more.

And if you look at the best AI chess players, yes, you can use an LLM to play chess. People have showed demos where it's like, oh, yeah, GPT-4 is actually a pretty decent chess move suggester. But you would never build a best-in-class chess player off of GPT-4 alone. The way that people design chess players is you have a search space.

And then you have a way to explore that search space efficiently. There's a bunch of search algorithms, essentially, where you're doing tree search in various ways. And you can have heuristic functions, which might be powered by an LLM. You might use an LLM to generate proposals in that space that you can efficiently explore.

But the backbone is still this more formalized tree search based approach rather than the LLM itself. And so I think my high level intuition is that the way that we get to this more reliable multi-step workflows that can do things beyond generate unit test, it's really going to be like a search-based approach, where you use an LLM as kind of like an advisor or a proposal function, sort of your heuristic function, like the A* search algorithm.

But it's probably not going to be the thing that is the backbone. Because I guess it's not the right tool for that. Yeah, yeah. You also have-- I can see yourself thinking through this, but not saying the words, the philosophical Peter Norvig type discussion. Maybe you want to introduce that divided in software.

Yeah, definitely. Your listeners are savvy. They're probably familiar with the classic Chomsky versus Norvig debate. No, actually, I was prompting you to introduce that. Oh, got it. So if you look at the history of artificial intelligence, it goes way back to-- I don't know, it's probably as old as modern computers, like '50s, '60s, '70s.

People are debating on what is the path to producing a general human level of intelligence. And two schools of thought that emerged. One is the Norvig school of thought, which, roughly speaking, includes large language models, regression, SVN. Basically, any model that you learn from data and is data driven, machine learning-- most of machine learning would fall under this umbrella.

And that school of thought says, just learn from the data. That's the approach to reaching intelligence. And then the Chomsky approach is more things like compilers, and parsers, and formal systems. So basically, let's think very carefully about how to construct a formal, precise system. And that will be the approach to how we build a truly intelligent system.

Lisp, for instance, was originally an attempt to-- I think Lisp was invented so that you could create rules-based systems that you would call AI. As a language, yeah. Yeah, and for a long time, there was this debate. There were certain AI research labs that were more in the Chomsky camp, and others that were more in the Norvig camp.

And it's a debate that rages on today. And I feel like the consensus right now is that Norvig definitely has the upper hand right now with the advent of LLMs, and diffusion models, and all the other recent progress in machine learning. But the Chomsky-based stuff is still really useful, in my view.

I mean, it's like parsers, compilers. Basically, a lot of the stuff that provides really good context, it provides kind of like the knowledge graph backbone that you want to explore with your AI dev tool. That will come from Chomsky-based tools, like compilers and parsers. It's a lot of what we've invested in the past decade at Sourcegraph, and what you built with Grok.

Basically, these formal systems that construct these very precise knowledge graphs that are great context providers, and great guardrails enforcers, and safety checkers for the output of a more data-driven, fuzzier system that uses like the Norvig-based models. Bianca was talking about this stuff like it happened in the Middle Ages.

Basically, it's like, OK, so when I was in college, I was in college learning Lisp, and Prolog, and Planning, and all the deterministic Chomsky approaches to AI. And I was there when Norvig basically declared it dead. I was there 3,000 years ago when Norvig and Chomsky fought on the volcano.

When did he declare it dead? What do you mean he declared it dead? Late '90s, yeah, when I went to Google, Peter Norvig was already there. And he had basically like-- I forget exactly where. He's got so many famous short posts, amazing things. He had a famous talk, "The Unreasonable Effectiveness of Data." Yeah, maybe that was it.

But at some point, basically, he basically convinced everybody that the deterministic approaches had failed, and that heuristic-based, data-driven, statistical approaches, stochastic were better. The primary reason-- I can tell you this because I was there-- was that-- --was that, well, the steam-powered engine-- no. The reason was that the deterministic stuff didn't scale, right?

They were using Prolog, man, constraint systems and stuff like that. Well, that was a long time ago, right? Today, actually, these Chomsky-style systems do scale. And that's, in fact, exactly what Sourcegraph has built. And so we have a very unique-- I love the framing that Bjong's made, the marriage of the Chomsky and the Norvig models, conceptual models, because we have both of them.

And they're both really important. And, in fact, there's this really interesting overlap between them, where the AI or our graph or our search engine could potentially provide the right context for any given query, which is, of course, why ranking is important. But what we've really signed ourselves up for is an extraordinary amount of testing, yeah?

Because, like you were saying, Swix, you were saying that GPT-4 tends to the front of the context window, and maybe other LLMs to the back, and maybe all the way to the middle. Yeah, and so that means that if we're actually verifying whether some change we've made has improved things, we're going to have to test putting it at the beginning of the window and at the end of the window, and maybe make the right decision based on the LLM that you've chosen.

Which some of our competitors, that's a problem that they don't have. But we meet you where you are, yeah? And just to finish, we're writing thousands, tens of thousands. We're generating tests, filling the middle type tests and things, and then using our graph to basically fine-tune Cody's behavior there, yeah?

Yeah. I also want to add, I have an internal pet name for this hybrid architecture that I'm trying to make catch on. Maybe I'll just say it here. Saying it publicly makes it more real. But I call the architecture that we've developed the Normski architecture. And it's kind of like-- I mean, it's obviously a portmanteau of Norvig and Chomsky, but the acronym, it stands for non-agentic, rapid, multi-source code intelligence.

So non-agentic, because-- Rolls right off the tongue. Wow! And Normski. Yeah. Yeah. But it's non-agentic in the sense that we're not trying to pitch you on agent hype, right? The things it does are really just use developer tools developers have been using for decades now, like parsers and really good search indexes and things like that.

Rapid, because we place an emphasis on speed. We don't want to sit there waiting for multiple LLM requests to return to complete a simple user request. Multi-source, because we're thinking broadly about what pieces of information and knowledge are useful context. So obviously starting with things that you can search in your code base, and then you add in the reference graph, which kind of allows you to crawl outward from those initial results.

But then even beyond that, sources of information, like there's a lot of knowledge that's embedded in docs, in PRDs, or product specs, in your production logging system, in your chat, in your Slack channel, right? Like there's so much context that's embedded there. And when you're a human developer and you're trying to be productive in your code base, you're going to go to all these different systems to collect the context that you need to figure out what code you need to write.

And I don't think the AI developer will be any different. It will need to pull context from all these different sources. So we're thinking broadly about how to integrate these into Cody. We hope through kind of like an open protocol that others can extend and implement. And this is something else that should be, I guess, like accessible by December 14th in kind of like a preview stage.

But that's really about like broadening this notion of the code graph beyond your Git repository to all the other sources where technical knowledge and valuable context can live. Yeah, it becomes an artifact graph, right? It can link into your logs and your wikis and any data source, right? How do you guys think about the importance of-- it's almost like data pre-processing in a way, which is bring it all together, tie it together, make it ready.

Yeah, any thoughts on how to actually make that good, what some of the innovation you guys have made? We talk a lot about the context fetching, right? I mean, there's a lot of ways you could answer this question. But we've spent a lot of time just in this podcast here talking about context fetching.

But stuffing the context into the window is also the bin packing problem, right? Because the window is not big enough and you've got more context than you can fit. You've got a ranker maybe. But what is that context? Is it a function that was returned by an embedding or a graph call or something?

Do you need the whole function? Or do you just need the top part of the function, this expression here, right? So that art, the golf game of trying to get each piece of context down into its smallest state, possibly even summarized by another model before it even goes to the LLM, becomes this is the game that we're in, yeah?

And so recursive summarization and all the other techniques that you've got to use to stuff stuff into that context window become critically important. And you have to test them across every configuration of models that you could possibly need. I think data preprocessing is probably the unsexy, way underappreciated secret to a lot of the cool stuff that people are shipping today, whether you're doing like RAG or fine tuning or pre-training.

The preprocessing step matters so much because it is basically garbage in, garbage out, right? Like if you're feeding in garbage to the model, then it's going to output garbage. Concretely, for code RAG, if you're not doing some sort of preprocessing that takes advantage of a parser and is able to extract the key components of a particular file of code, separate the function signature from the body, from the doc string, what are you even doing?

That's like table stakes. And it opens up so much more possibilities with which you can tune your system to take advantage of the signals that come from those different parts of the code. We've had a tool since computers were invented that understands the structure of source code to 100% precision.

The compiler knows everything there is to know about the code in terms of structure. Why would you not want to use that in a system that's trying to generate code, answer questions about code? You shouldn't throw that out the window just because now we have really good data-driven models that can do other things.

Yeah. When I called it a data moat in my cheating post, a lot of people were confused about-- because data moat sort of sounds like data lake because there's data and water and stuff. I don't know. And so they thought that we were sitting on this giant mountain of data that we had collected.

But that's not what our data moat is. It's really a data preprocessing engine that can very quickly and scalably basically dissect your entire code base into very small, fine-grained semantic units and then serve it up. And so it's really-- it's not a data moat. It's a data preprocessing moat, I guess.

Yeah, if anything, we're hypersensitive to customer data privacy requirements. So it's not like we've taken a bunch of private data and trained a generally available model. In fact, exactly the opposite. A lot of our customers are choosing Cody over Copilot and other competitors because we have an explicit guarantee that we don't do any of that.

And we've done that from day one. Yeah. I think that's a very real concern in today's day and age. Because if your proprietary IP finds its way into the training set of any model, it's very easy both to extract that knowledge from the model and also use it to build systems that work on top of the institutional knowledge that you've built up.

About a year ago, I wrote a post on LLMs for developers. And one of the points I had was maybe the depth of the DSL. I spent most of my career writing Ruby. And I love Ruby. It's so nice to use. But it's not as performant, but it's really easy to read.

And then you look at other languages, maybe they're faster, but they're more verbose. And when you think about efficiency of the context window, that actually matters. But I haven't really seen a DSL for models. I haven't seen code being optimized to be easier to put in a model context.

And it seems like your pre-processing is kind of doing that. Do you see in the future the way we think about DSL and APIs and service interfaces be more focused on being context-friendly? Whereas maybe it's harder to read for the human, but the human is never going to write it anyway.

We were talking on the "Hacks" podcast. There are some data science things, like spin-up the spandex. Humans are never going to write again, because the models can just do very easily. Yeah, curious to hear your thoughts. Well, so DSLs, they involve writing a grammar and a parser. And they're like little languages, right?

And we do them that way because we need them to compile, and humans need to be able to read them, and so on. The LLMs don't need that level of structure. You can throw any pile of crap at them, more or less unstructured, and they'll deal with it. So I think that's why a DSL hasn't emerged for communicating with the LLM or packaging up the context or anything.

Maybe it will at some point, right? We've got tagging of context and things like that that are sort of peeking into DSL territory, right? But your point on do users, do people have to learn DSLs, like regular expressions, or pick your favorite, right? XPath. I think you're absolutely right that the LLMs are really, really good at that.

And I think you're going to see a lot less of people having to slave away learning these things. They just have to know the broad capabilities, and then the LLM will take care of the rest. Yeah, I'd agree with that. I think we will see kind of like a revisiting of-- basically, the value profit of DSL is that it makes it easier to work with a lower level language, but at the expense of introducing an abstraction layer.

And in many cases today, without the benefit of AI co-generation, that's totally worth it, right? With the benefit of AI co-generation, I mean, I don't think all DSLs will go away. I think there's still places where that trade-off is going to be worthwhile. But it's kind of like, how much of source code do you think is going to be generated through natural language prompting in the future?

Because in a way, any programming language is just a DSL on top of assembly, right? And so if people can do that, then yeah. Maybe for a large portion of the code that's written, people don't actually have to understand the DSL that is Ruby, or Python, or basically any other programming language that exists today.

I mean, seriously, do you guys ever write SQL queries now without using a model of some sort? At least at JavaScript. Ever? Yeah, right? And so we have kind of passed that bridge, right? Yeah, I think to me, the long-term thing is like, is there ever going to be-- you don't actually see the code.

It's like, hey-- the basic thing is like, hey, I need a function to sum two numbers. And that's it. I don't need you to generate the code. And the follow-on question, do you need the engineer or the paycheck? I mean, right? That's kind of the agent's discussion in a way, where you cannot automate the agents, but slowly you're getting more of the atomic units of the work done.

I kind of think of it as like, do you need a punch card operator to answer that for you? And so I think we're still going to have people in the role of a software engineer, but the portion of time they spend on these kind of low-level, tedious tasks versus the higher-level, more creative tasks is going to shift.

No, I haven't used punch cards. It looks over here. Yeah. Yeah, I've been talking about-- so we've kind of made this podcast about the sort of rise of the AI engineer. And the first step is the AI-enhanced engineer that is that software developer that is no longer doing these routine boilerplate-y type tasks, because they're just enhanced by tools like yours.

So you mentioned-- you opened CodeGraph. I mean, that is a kind of DSL, maybe. And because we're releasing this as you go GA, you hope for other people to take advantage of that? Oh, yeah. I would say-- so OpenCodeGraph is not a DSL. It's more of a protocol. It's basically like, hey, if you want to make your system, whether it's chat, or logging, or whatever, accessible to an AI developer tool like Kodi, here is kind of like the schema by which you can provide that context and offer hints.

So comparisons like LSP obviously did this for kind of like standard code intelligence. It's kind of like a lingua franca for providing fine references and codefinition. There's kind of like analogs to that. There might be also analogs to kind of the original OpenAI kind of like plugins API, where it's like, hey, there's all this context out there that might be useful for an LM-based system to consume.

And so at a high level, what we're trying to do is define a common language for context providers to provide context to other tools in the software development lifecycle. Yeah. Do you have any critiques of LSP, by the way, since this is very much very close to home? One of the authors wrote a really good critique recently.

Yeah. Oh, LSP? I don't think I saw that. Yeah, yeah. How LSP could have been better. It just came out a couple of weeks ago. It was a good article. Yeah. I don't know if I-- I think LSP is great for what it did for the developer ecosystem. It's absolutely fantastic.

Nowadays, it's very easy-- it's much easier now to get code navigation up and running in-- A bunch of editors. --in a bunch of editors by speaking this protocol. I think maybe the interesting question is looking at the different design decisions made, comparing LSP basically with Kithe. Because Kithe has more of a-- I don't know, how would you describe it?

A storage format. I think the critique of LSP from a Kithe point of view would be, with LSP, you don't actually have an actual model, a symbolic model, of the code. It's not like LSP models, hey, this function calls this other function. LSP is all range-based. Like, hey, your token is at line 32-- your cursor is at line 32, column 1.

And that's the thing you feed into the language server. And then it's like, OK, here's the range that you should jump to if you click on that range. So it kind of is intentionally ignorant of the fact that there's a thing called a reference underneath your cursor, and that's linked to a symbol definition.

Well, actually, that's the worst example you could have used. You're right, but that's the one thing that it actually did bake in, is following references. Sure. But it's sort of hardwired. Yeah. Whereas Kithe attempts to model all these things explicitly. And so-- Well, so LSP's a protocol, right? And so Google's internal protocol is gRPC-based.

And it's a different approach than LSP. Basically, you make a heavy query to the back end, and you get a lot of data back, and then you render the whole page. So we've looked at LSP, and we think that it's just-- it's a little long in the tooth, right?

I mean, it's a great protocol, lots and lots of support for it. But we need to push into the domain of exposing the intelligence through the protocol. Yeah. And so I would say, I mean, we've developed a protocol of our own called Skip, which is, I think, at a very high level, trying to take some of the good ideas from LSP and from Kithe, and merge that into a system that, in the near term, is useful for SourceGraph, but I think in the long term, we hope it will be useful for the ecosystem.

And I would say, OK, so here's what LSP did well. LSP, by virtue of being intentionally dumb-- "dumb" in air quotes, because I'm not ragging on it-- Yeah. But what it allowed it to do is it allowed language service developers to kind of bypass the hard problem of modeling language semantics precisely.

So if all you want to do is jump to definition, you don't have to come up with a universally unique naming scheme for each symbol, which is actually quite challenging. Because you have to think about, OK, what's the top scope of this name? Is it the source code repository?

Is it the package? Does it depend on what package server you're fetching this from, whether it's the public one or the one inside your-- anyways, naming is hard, right? And by just going from a location-to-location-based approach, you basically just throw that out the window. All I care about is jumping to definition.

Just make that work, and you can make that work without having to deal with all the complex global naming things. The limitation of that approach is that it's harder to build on top of that to build a true-knowledge graph. If you actually want a system that says, OK, here's the web of functions, and here's how they reference each other.

And I want to incorporate that semantic model of how the code operates, or how the code relates to each other at a static level, you can't do that with LSP, because you have to deal with line ranges. And concretely, the pain point that we found in using LSP for source graph is, in order to do a find references and then jump to definition, it's like a multi-hop process, because you have to jump to the range, and then you find the symbol at that range.

And it just adds a lot of latency and complexity of these operations. Where as a human, you're like, well, this thing clearly references this other thing. Why can't you just jump me to that? And I think that's the thing that Kite does well. But then I think the issue that Kite has had with adoption is, because it's a more sophisticated schema, I think.

And so there's basically more things that you have to implement to get a Kite implementation up and running. I hope I'm not like-- correct me if I'm wrong about any of this. 100%. Kite also has the problem-- all these systems have the problem, even Skip, or at least the way that we implemented the indexers, that they have to integrate with your build system in order to build that knowledge graph, because you have to basically compile the code in a special mode to generate artifacts instead of binaries.

And I would say-- by the way, earlier I was saying that xrefs were in LSP, but it's actually-- I was thinking of LSP plus lsif. Ugh, lsif. That's another-- Which is actually bad. We can say that's bad, right? Lsif was not good. It's like Skip or Kite. It's supposed to be sort of a model, a serialization for the code graph.

But it basically just does what LSP needs, the bare minimum. Lsif is basically if you took LSP and turned that into a serialization format. So you build an index for language servers to kind of quickly bootstrap from cold start. But it's a graph model with all of the inconvenience of the API without an actual graph.

And so, yeah, it's not great. So one of the things that we try to do with Skip is try to capture the best of both worlds. So make it easy to write an indexer, make the schema simple, but also model some of the more symbolic characteristics of the code that would allow us to essentially construct this knowledge graph that we can then make useful for both the human developer through SourceGraph and through the AI developer through Kodi.

So anyway, just to finish off the graph comment is we've got a new graph that's Skip-based. We call it BFG internally, right? Beautiful something graph. Big friendly graph. Big friendly graph. It's a blazing fast. Blazing fast. Chasing fast graph. And it is blazing fast, actually. It's really, really interesting.

I should probably have to do a blog post about it to walk you through exactly how they're doing it. Oh, please. But it's a very AI-like, iterative, experimentation sort of approach, where we're building a code graph based on all of our 10 years of knowledge about building code graphs.

But we're building it quickly with zero configuration, and it doesn't have to integrate with your build system through some magic tricks that we have. And so it just happens when you install the plug-in that it'll be there and indexing your code and providing that knowledge graph in the background without all that build system integration.

This is a bit of secret sauce that we haven't really-- I don't know, we haven't advertised it very much lately. But I am super excited about it, because what they do is they say, all right, let's tackle function parameters today. Kodi's not doing a very good job of completing function call arguments or function parameters in the definition, right?

Yeah, we generate those thousands of tests. And then we can actually reuse those tests for the AI context as well. So fortunately, things are kind of converging on. We have half a dozen really, really good context sources. And we mix them all together. So anyway, BFG, you're going to hear more about it probably, I would say, probably in the holidays?

Yeah, I think it'll be online for December 14th. We'll probably mention it. BFG is probably not the public name we're going to go with. I think we might call it Graph Context or something like that. We're officially calling it BFG. You're going to hear first. BFG is just kind of like the working name.

And it's interesting. So the impetus for BFG was, if you look at current AI inline code completion tools and the errors that they make, a lot of the errors that they make, even in kind of the easy single line case, are essentially type errors, right? You're trying to complete a function call.

And it suggests a variable that you define earlier, but that variable is the wrong type. And that's the sort of thing where it's like, well, like a first year freshman CS student would not make that error, right? So why does the AI make that error? And the reason is, I mean, the AI is just suggesting things that are plausible without the context of the types or any other broader files in the code.

And so the kind of intuition here is, why don't we just do the basic thing that any baseline intelligent human developer would do, which is click jump to definition, click some find references, and pull in that graph context into the context window, and then have it generate the completion.

So that's sort of like the MVP of what BFG was. And it turns out that works really well. You can eliminate a lot of type errors that AI coding tools make just by pulling in that context. Yeah, but the graph is definitely our Chomsky side. Yeah, exactly. So this Chomsky-Norvig thing, I think, pops up in a bunch of different layers.

And I think it's just a very useful and also kind of nicely nerdy way to describe the system that we're trying to build. By the way, I remember the point I was trying to make earlier to your question, Alessio, about, is AI going to replace programmers? And I was talking about how compilers-- they thought, oh, are compilers going to replace programming?

And what it did was it just changed kind of what programmers have to focus on. And I think AI is just going to level us up again. So programmers are still going to be building stuff until agents come along, but I don't believe. And so, yeah. Yeah, to be clear, again, with the agent stuff at a high level, I think we will get there.

I think that's still the kind of long-term target. And I think also with Kodi, it's like, you can have Kodi draft up an execution plan. It's just not going to be the sort of thing where you can't attend to what it's doing. Like, we think that with Kodi, it's like, you guys Kodi, like, hey, I have this bug.

Help me solve it. It would do a reasonable job of fetching context and saying, here are the files you should modify. And if you prompt it further, you can actually suggest co-changes to make to those files. And that's a very nice way to resolve issues, because you're kind of on the rails for most of the time, but then now and then you have to intervene as a human.

I just think that if we're trying to get to complete automation, where it's like the sort of thing where a non-software engineer, someone who has no technical expertise, can just speak a non-trivial feature into existence, that is still, I think, several key innovations away from happening right now. And I don't think the pure transformer-based LLM orchestrator model of agents that is kind of dominant today is going to get us there.

FRANCESC CAMPOY: Yeah. Just what you're talking about triggered a thread I've been working on for a little bit, which is we're very much reacting to developments in models on a month-to-month basis. You had a post about, we're going to need a bigger moat, which is great JAWS reference for those who didn't catch it.

About how quickly-- MARK MANDEL: I forgot all about that. FRANCESC CAMPOY: --how quickly models are evolving. But I think if you kind of look out, I actually caught Sam Altman on the podcast yesterday talking about GPT-10. MARK MANDEL: Ooh, wow. Things are accelerating. FRANCESC CAMPOY: And actually, there's a pretty good cadence from GPT-2, 3, and 4 that you can-- if you project out.

So 4 is based on George Hotz's concept of 20 petaflops being a human's worth of compute. GPT-4 took about 100 years in terms of human years to train, in terms of the amount of compute. So that's one living person. And every generation of GPT increases two orders of magnitude.

So 5 is 100 people. And if you just project it out, 9 is every human on Earth, and 10 is every human ever. And he thinks he'll reach there by the end of the decade. MARK MANDEL: George Hotz does? FRANCESC CAMPOY: No, Sam Altman. MARK MANDEL: Oh, Sam Altman, OK.

FRANCESC CAMPOY: Yeah. So I just like setting those high-level-- you have dots on the line. We're at the start of the curve with Moore's law. George Moore, I think, thought it would last 10 years. And he just kept drawing for another 50. And I think we have all these data points.

And we're just trying to extrapolate the curve out to where this goes. So all I'm saying is this Asian stuff that we dealt might come here by 2030. And I don't know how you plan when things are not possible today. And you're like, it's not worth doing. But we're going to be here in 2030.

And what do we do then? MARK MANDEL: So is the question like-- FRANCESC CAMPOY: There's no question. It's like sharing of a comment, just because at the back of my head, anytime we hear things like things are not practical today, I'm just like, all right, but how do we-- MARK MANDEL: So here's a question, maybe.

I get the whole scaling argument. I do think that there will be something like a Moore's law for AI inference. I mean, definitely, I think, at the hardware level, like GPUs. I think it gets a little fuzzier the higher you move up in the stack. But for instance, going back to the chess analogy, at what point do we think that GPT-X or whatever, a pure transformer-based LLM model will be state of the art or outperform the best chess-playing algorithm today?

Because I think that is one milestone on-- FRANCESC CAMPOY: Where you completely overlap search and symbolic models. MARK MANDEL: Yeah, exactly, because I think that would be-- I mean, just to put my cards on the table, I think that would kind of disprove the thesis that I just stated, which is kind of like the pure transformer, just scale the transformer-based approach.

That would be a proof point where like, hey, maybe that is the right approach, versus, oh, we actually have to take a step back and think-- you get what I'm saying, right? Is the transformer going to be like, is at the end all be all of architectures, and it's just a matter of scaling that?

Or are there other algorithms, and that is going to be one piece of a system of intelligence that's going to take advantage-- that we'll have to take advantage of, like many other algorithms and approaches? FRANCESC CAMPOY: Yeah, we shall see. Maybe John Carmack will find it. MARK MANDEL: Yeah.

FRANCESC CAMPOY: All right, sorry for that digression. I'm just very curious. So one thing I did actually want to check in on, because we talked a little bit about code graphs and reference graphs and all that. Do you actually use a graph database? No, right? MARK MANDEL: No. FRANCESC CAMPOY: Isn't it weird?

MARK MANDEL: Well, I mean, how would you find graph database? FRANCESC CAMPOY: We use Postgres. And yeah, I saw a paper actually right after I joined Sourcegraph. There was some joint study between IBM and some other company that basically showed that Postgres was performing as well as most of the graph databases for most graph workloads.

MARK MANDEL: Wow. In V0 of Sourcegraph, we're like, we're building a code graph. Let's use a graph database. I won't name the database, because I mean, it was like 10 years ago. So they're probably much better now. But we basically tried to dump a non-trivially sized data set, but also not the whole universe of code, right?

It was a relatively small data set compared to what we're indexing now into the database. And we let it run for a week. And I think it segfaulted or something. And we're like, OK, let's try another approach. Let's just put everything in Postgres. And these days, the graph data, I mean, it's partially in Postgres.

It's partially just-- I mean, you could store them as flat files. FRANCESC CAMPOY: Yeah. I mean, at the end of the day, all the databases, just get me the data I want. Answer the queries that I need, right? If all your queries are single hops in this-- MARK MANDEL: Which they will be if you denormalize for other use cases.

FRANCESC CAMPOY: Yeah, exactly. MARK MANDEL: Interesting. FRANCESC CAMPOY: So, yeah. MARK MANDEL: Seventh normal form is just a bunch of files. FRANCESC CAMPOY: Yeah, yeah. And I don't know, I feel like there's a bunch of stuff like that, where it's like, if you look past the marketing and think about the actual query load, or the traffic patterns, or the end user use cases you need to serve, just go with the tried and true, dumb, classic tools over the new-agey stuff.

MARK MANDEL: Choose point technology, yeah. FRANCESC CAMPOY: I mean, there's a bunch of stuff like that in the search domain, too, especially right now, with embeddings, and vector search, and all that. But classic search techniques still go very far. And I don't know, I think in the next year or two maybe, as we get past the peak AI hype, we'll start to see the gap emerge, or become more obvious to more people about how many of the newfangled techniques actually work in practice, and yield a better product experience day to day.

MARK MANDEL: Yeah. So speaking of which, obviously there's a bunch of other people trying to build AI tooling. What can you say about your AI stack? Obviously, you build a lot proprietary in-house, but what approaches-- so prompt engineering, do you have a prompt engineering management tool? What approaches there do you do?

Pre-processing orchestration, do you use Airflow? Do you use something else? That kind of stuff. FRANCESC CAMPOY: Yeah. Ours is very duct-taped together at the moment. So in terms of stack, it's essentially Go and TypeScript, and now Rust. There's the knowledge graph, the code knowledge graph that we built, which is using indexers, many of which are open source, that speak the skip protocol.

And we have the code search back end. Traditionally, we supported regular expression search and string literal search with a trigram index. And we're also building more fuzzy search on top of that now, kind of like natural language or keyword-based search on top of that. And we use a variety of open source and proprietary models.

We try to be pluggable with respect to different models, so we can easily swap the latest model in and out as they come online. I'm just hunting for, is there anything out there that you're like, these guys are really good. Everyone should check them out. So for example, you talked about recursive summarization, which is something that LangChain and LlamaIndex do.

I presume you wrote your own. I presume-- Yeah, we wrote our own. I think the stuff that LlamaIndex and LangChain are doing are super interesting. I think, from our point of view, it's like we're still in the application end user use case discovery phase. And so adopting an external infrastructure or middleware tool just seems overly constraining right now.

We need full control. Yeah, we need full control, because we need to be able to iterate rapidly up and down the stack. But maybe at some point, there'll be a convergence, and we can actually merge some of our stuff into theirs and turn that into a common resource. In terms of other vendors that we use, I mean, obviously, nothing but good things to say about Anthropic and OpenAI, which we both kind of partner with and use.

Also, plug for Fireworks as an inference platform. Their team was kind of like ex-meta people who basically know all the bag of tricks for making inference fast. I met Lynn. So she was-- Lynn is great. She was with Sumith. She was the co-manager of PyTorch for five years. Yeah, yeah, yeah.

But is their main thing that we just do fastest inference on Earth? Is that what it is? I think that's the pitch. And it keeps getting faster somehow. We run Starcoder on top of Fireworks. And that's made it so that we just don't have to think about building up an inference stack.

And so that's great for us, because it allows us to focus more on the data fetching, the knowledge graph, and model fine-tuning, which we've also invested a bit in. That's right. We've got multiple AI workstreams in progress now, because we hired a head of AI, finally. We spent close to a year, actually.

I talked to probably 75 candidates. And the guy we hired, Rashab, is absolutely world-class. And he immediately started multiple workstreams, including he's fine-tuned Starcoder already. He's got Prompt Engineering workstream. He's got the Embeddings workstream. He's got Evaluation and Experimentation. Benchmarking-- wouldn't it be nice if Cody was on Hugging Face with a benchmark that anybody could say, well, we'll run against the benchmark, or we'll make our own benchmark if we don't like yours.

But we'll be forcing people into the quantitative comparisons. And that's all happening under the AI program that he's building for us. Yeah. I should mention, by the way, I've heard that there's a v2 of Starcoder coming on. So you guys should talk to Hugging Face. Cool. Awesome. Great. I actually visited their offices in Paris, which is where I heard it.

That's awesome. Can you guys believe how amazing it is that the open source models are competitive with GPT and Anthropic? I mean, it's nuts, right? I mean, that one Googler that was predicting that open source would catch up, at least he was right for completions. Yeah, I mean, for completions, open source is state of the art right now.

You were on OpenAI, then you went to Cloud, and now you've rifted up. Yeah, for completions. We still use Cloud and GPT-4 for chat and also commands. But the ecosystem is going to continue to evolve. We obviously love the open source ecosystem. And a huge shout out to Hugging Face.

And also Meta Research, we love the work that they're doing in kind of driving the ecosystem forward. Yeah, you didn't mention Codelama. We're not using Codelama currently. It's always kind of like a constant evaluation process. I don't want to come out and say, hey, this model's the best because we chose it.

It's basically like we did a bunch of tests for the sorts of context that we're fetching now and given the way that our prompt's constructed now. And at the end of the day, it was like a judgment call. Like, star coders seem to work the best, and that's why we adopted it.

But it's sort of like a continual process of revisitation. Like, if someone comes up with a neat new context fetching mechanism-- and we have a couple coming online soon-- then it's always like, OK, let's try that against the kind of array of models that are available and see how this moves the needle across that set.

Yeah. What do you wish someone else built? What did we have to build that we wish we could have used? Is that the question? Interesting. This is a request for startups. I mean, if someone could just provide like a very nice, clean data set of both naturally occurring and synthetic code data out there.

Yeah, could someone please give us their data mode? Well, not even the data mode. It's just like, I feel like most models today, they still use a combination of the stack and the pile as their training corpus. But you can only stretch that so far. At some point, we need more data.

And I don't know. I think there's still more alpha in synthetic data. We have a couple efforts where we think fine-tuning some models on specific coding tasks will yield alpha, will yield more kind of like reliable code generation of the sort where it's reliable enough that we can fully automate it, at least like the one hop thing.

And synthetic data is playing a part of that. But I mean, if there were like a synthetic data provider-- I don't think you could construct a provider that has access to some proprietary code base. No company in the world would be able to sell that to you. But anyone who's just providing clean data sets off of the publicly available data, that would be nice.

I don't know if there's a business around that. But that's something that we definitely love to use. Oh, for sure. My god. I mean, but that's also like the secret weapon, right? For any AI is the data that you've curated. So I doubt people are going to be, oh, we'll see.

But we can maybe contribute if we want to have a benchmark of our own. Yeah. Yeah. I would say that would be the bull case for Repl.it, that you want to be a coding platform where you also offer bounties. And then you eventually bootstrap your own proprietary set of coding data.

I don't think they'll ever share it. And the rumor is-- this is from nobody at Repl.it that I'm hearing. But also, they're just not leveraging that actively. They're actually just betting on OpenAI to do a lot of that, which banking on OpenAI, I think, has been a winning strategy so far.

Yeah, they're definitely great at executing and-- Executing their CEO. Ooh. And then bring him back in four days. Yeah. He won. That was a whole, like, I don't know. Did you guys-- yeah, was the company just obsessed by the drama? We were unable to work. I just walked in after it happened.

And this whole room in the new room was just like, everyone's just staring at their phones. I mean, it's a bit difficult to ignore. I mean, it would have real implications for us, too. Because we're using them. And so there's a very real question of, do we have to do a quick-- Yeah, did you-- yeah, Microsoft.

You just moved to Microsoft, right? Yeah, I mean, that would have been the break glass plan. If the worst case played out, then I think we'd have a lot of customers the day after being like, how can you guarantee the reliability of your services if the company itself isn't stable?

But I'm really happy they got things sorted out and things are stable now. Because they build really cool stuff, and we love using their tech. Yeah, awesome. So we kind of went through everything, right? Sourcecraft, Kodi, why agents don't work, why inline completion is better, all of these things.

How does that bubble up to who manages the people, right? Because as engineering managers, and I never-- I didn't write much code. I was mostly helping people write their own code. So even if you have the best inline completion, it doesn't help me do my job. What's kind of the future of Sourcecraft in the engineering org?

Yeah, so that's a really interesting question. And I think it sort of gets at this issue, which is I think basically every AI dev tools creator or producer these days, I think us included, we're kind of focusing on the wrong problem in a way. Because the real problem of modern software development, I think, is not how quickly can you write more lines of code.

It's really about managing the emergent complexity of code bases as they evolve and grow, and how to make efficient development tractable again. Because the bulk of your time becomes more about understanding how the system works and how the pieces fit together currently so that you can update it in a way that gets you your added functionality, doesn't break anything, and doesn't introduce a lot of additional complexity that will slow you down in the future.

And if anything, the inner loop developer tools that are all about generating lines of code, yes, they help you get your feature done faster. They generate a lot of boilerplate for you. But they might make this problem of managing large complex code bases more challenging. Just because now, instead of having a pistol, you'll have a machine gun in terms of being able to write code.

And there's going to be a bunch of natural language prompted code that is generated in the future that was produced by someone who doesn't even have an understanding of source code. And so how are you going to verify the quality of that and make sure it not only checks the low-level boxes, but also fits architecturally in a way that's sensible into your code base.

And so I think as we look forward to the future of the next year, we have a lot of ideas around how to make code bases, as they evolve, more understandable and manageable to the people who really care about the code base as a whole-- tech leads, engineering leaders, folks like that.

And it is kind of like a return to our ultimate mission at Sourcegraph, which is to make code accessible to all. It's not really about enabling people to write code. And if anything, the original version of Sourcegraph was a rejection of, hey, let's stop trying to build the next best editor, because there's already enough people doing that.

The real problem that we're facing-- I mean, Quinn, myself, and you, Steve, at Google-- was how do we make sense of the code that exists so we can understand enough to know what code needs to be written? Yeah. Well, I'll tell you what customers want-- what they're going to get.

What they want is for Kody to have a monitor for developer productivity. And any developer who falls below a threshold, a button lights up where the admin can fire them. Or Kody will even press that button for you as the time passes. But I'm kind of only half tongue-in-cheek here.

We've got some prospects who are kind of sniffing down that avenue. And we're like, no. But what they're going to get is much-- like Bian was saying-- much greater whole code-based understanding, which is actually something that Kody is, I would argue, the best at today in the coding assistance space, right, because of our search engine and the techniques that we're using.

And that whole code-based understanding is so important for any sort of a manager who just wants to get a feel for the architecture or potential security vulnerabilities or whether people are writing code that's well-tested and et cetera, et cetera, right? And solving that problem is tricky, right? This is not the developer inner loop or outer loop.

It's like the manager inner loop? No, outer loop. The manager inner loop is staring at your belly button, I guess. So in any case-- Waiting for the next Slack message to arrive? Yes. What they really want is a batch mode for these assistants where you can actually take the coding assistant and shove its face into your code base.

And 6 billion lines of code later, it's told you all the security vulnerabilities. That's what they really actually want. It's an insanely expensive proposition, right? You know, just the GPU cost, especially if you're doing it on a regular basis. So it's better to do it at the point the code enters the system.

And so now we're starting to get into developer outer loop stuff. And I think that's where a lot of the-- to your question, right? A lot of the admins and managers and the decision makers, anybody who just kind of isn't coding but is involved, they're going to have, I think, well, a set of tools, right?

And a set of-- just like with code search today. Our code search actually serves that audience as well, the CIO types, right? Because they're just like, oh, hey, I want to see how we do Samaloth. And they use our search engine and they go find it. And AI is just going to make that so much easier for them.

Yeah, I have a-- this is my perfect place to put my anecdote of how I used Kodi yesterday. I was actually trying to build this Twitter scraper thing. And Twitter is notoriously very challenging to work with because they don't want to work with anyone. And there's a repo that I wanted to inspect.

It was really big that had the Twitter scraper thing in it. And I pulled it into Copilot, didn't work. But then I noticed that on your landing page, you had a web version. Like, I typically think of Kodi as a VS Code extension. But you have a web version where you just plug in any repo in there and just talk to it.

And that's what I used to figure it out. Wow, Kodi web is wild. Yeah. I mean, we've done a very poor job of making the existence of that feature-- It's not easy to find. It's not easy to find. The search thing is like, oh, this is old source graph.

You don't want to look at old source graph. You can use source graph, all the AI stuff. Old source graph has AI stuff. And it's Kodi web. Yeah, there's a little Ask Kodi button that's hidden in the upper right hand corner. We should make that more visible. It's definitely one of those aha moments when you can ask a question of-- Of any repo, right?

Because you already indexed it. Well, you didn't embed it, but you indexed it. And there's actually some use cases that have emerged among power users where they kind of do-- like, you're familiar with v0.dev. You can kind of replicate that, but for arbitrary frameworks and libraries with Kodi web.

Because there's also an equally hidden toggle, which you may not have discovered yet, where you can actually tag in multiple repositories as context. And so you can do things like-- we have a demo path where it's like, OK, let's say you want to build a stock ticker that's React-based, but uses this one tick data fetching API.

It's like, you tag both repositories in. You ask it-- it's like two sentences. Like, build a stock tick app. Track the tick data of Bank of America, Wells Fargo over the past week. And it generates a code. You can paste that in. And it works magically. We'll probably invest in that more, just because the wow factor of that is just pretty incredible.

It's like, what if you can speak apps into existence that use the frameworks and packages that you want to use? It's not even fine-tuning. It's just taking advantage of your RAG pipeline. Yeah, it's just RAG. RAG is all you need for many things. It's not just RAG. It's RAG, right?

RAG's good, not a fallback. Yeah, but I guess getting back to the original question, I think there's a couple of things I think would be interesting for engineering leaders. One is the use case that you called out, is all the stuff that you currently don't do that you really ought to be doing with respect to, like, ensuring code quality, or updating dependencies, or keeping things up to date, the things that humans find toilsome and tedious and just don't want to do, but would really help uplevel the quality, security, and robustness of your code base.

Now we potentially have a way to do that with machines. I think there's also this other thing, and this gets back to the point of, how do you measure developer productivity? It's like the perennial age-old question. Every CFO in the world would love to do it in the same way that you can measure marketing, or sales, or other parts of the organization.

And I think, what is the actual way you would do this that is good, if you had all the time in the world? I think, as an engineering manager or an engineering leader, what you would do is you would go read through the Git log, maybe like line by line.

Be like, OK, you, Sean, these are the features that you built over the past six months or a year. These are the things that delivered that you helped drive. Here's the stuff that you did to help your teammates. Here are the reviews that you did that helped ensure that we have maintained a coherent and high-quality code base.

Now connect that to the things that matter to the business. Like, what were we trying to drive this? Was it engagement? Was it revenue? Was it adoption of some new product line? And really weave that story together. The work that you did had this impact on the metrics that moved the needle for the business and ultimately show up in revenue, or stock price, or whatever it is that's at the very top of any for-profit organization.

And you could, in theory, do all that today if you had all the time in the world. But as an engineering leader-- It's a busy building. Yeah, you're too busy building. You're too busy with a bunch of other stuff. Plus, it's also tedious, like reading through Git log and trying to understand what a change does and summarizing that.

Yeah. It's just-- it's not the most exciting work in the world. But with the benefit of AI, I think you could conceive of a system that actually does a lot of the tedium and helps you actually tell that story. And I think that is maybe the ultimate answer to how we get at developer productivity in a way that a CFO would be like, OK, I can buy that.

The work that you did impacted these core metrics because these features were tied to those. And therefore, we can afford to invest more in this part of the organization. And that's what we really want to drive towards. I think that's what we've been trying to build all along, in a way, with Sourcegraph.

It's this code-based level of understanding. And the availability of LLMs and AI now just puts that much sooner in reach, I think. Yeah. But I mean, we have to focus, also, small company. And so our short-term focus is lovability, right? Yeah. We absolutely have to make Cody like-- everybody wants it, right?

But absolutely, Sourcegraph is all about enabling all of the non-engineering roles, decision makers, and so on. And as Bianca says, I mean, I think there's just a lot of opportunity there once we've built a lovable Cody. Awesome. We want to jump into lightning round? Lightning round. OK. Which we always forget to send the questions ahead of time.

So we usually have three, one around acceleration, exploration, and then a final takeaway. So the acceleration one is, what's something that already happened in AI that is possible today that you thought would take much longer? I mean, just LLMs and how good the vision models are now. Like, I got my start-- Oh, vision.

OK. Yeah. Well, I mean, back in the day, I got my start machine learning in computer vision, but circa 2009, 2010. And in those days, everything was statistical-based. Neural nets had not yet made their comeback. And so nothing really worked. And so I was very bearish after that experience on the future of computer vision.

But man, the progress that's been made just in the past three or four years has just been absolutely astounding. So yeah, it came up faster than I expected it to. Yeah, multimodal in general, I think there's a lot more capability there that we're not tapping into, potentially even in the coding assistant space.

And honestly, I think that the form factor that coding assistants have today is probably not the steady state that we're seeing long-term. I mean, you'll always have completions, and you'll always have chat, and commands, and so on. But I think we're going to discover a lot more. And I think multimodal potentially opens up some kind of new ways to get your stuff done.

So yeah, I think the capabilities are there today. And it's just shocking. I mean, I still am astonished. When I sit down, and I have a conversation with the LLM with the context, and it's like I'm talking to a senior engineer, or an architect, or somebody. And I can bounce ideas off it.

And I think that people have very different working models with these assistants today. Some people are just completion, completion, completion. That's it. And if they want some code generated, they write a comment, and then telling them what to do. But I truly think that there are other modalities that we're going to stumble across, and just kind of latently, inherently built into the LLMs today.

We just haven't found them yet. They're more of a discovery than invention. Like other usage patterns? Absolutely. I mean, the one we talked about earlier, nonstop coding is one, where you could just kick off a whole bunch of requests to refactor, and so on. But there could be any number of others.

We talk about agents, that's kind of out there. But I think there are kind of more inner loop type ones to be found. And we haven't looked at all that multimodal yet. Yeah. For sure, there's two that come to mind, just off the top of my head. One, which is effectively architecture diagrams and entity relationship diagrams.

There's probably more alpha in synthesizing them for management to see, which is, you don't need AI for that. You can just use your reference graph. But then also doing it the other way around, when someone draws stuff on a whiteboard and actually generating code. Well, you can generate the diagram, and then explanations, as well.

Yeah. And then the other one is, there was a demo that went pretty viral two, three weeks ago, about how someone just had an always-on script, just screenshotting and sending it to GPTVision on some kind of time interval. And it would just autonomously suggest stuff. Yeah. So no trigger, just watching your screen, and just being a real co-pilot, rather than having you initiate with the chat.

Yeah. So there's some-- It's like the return of Clippy, right? Return of Clippy. But actually good. So the reason I know this is we actually did a hackathon, where we wrote that project, but it roasted you while you did it, so it's like, hey, you're on Twitter right now.

You should be coding. And that can be a fun co-pilot thing, as well. Yeah. OK, so I'll jump on. Exploration, what do you think is the most interesting unsolved question in AI? It used to be scaling, right, with CNNs and RNNs, and Transformer solved that. So what's the next big hurdle that's keeping GPT-10 from emerging?

I mean, do you mean that like-- Ooh, this is like a safetyist argument. I feel like-- do you mean like the pure model, like AI layer? No, it doesn't have to be-- I mean, for me personally, it's like, how do you get reliable first try working code generation? Even like a single hop, like write a function that does this.

Because I think if you want to get to the point where you can actually be truly agentic or multi-step automated, a necessary part of that is the single step has to be robust and reliable. And so I think that's the problem that we're focused on solving right now. Because once you have that, it's a building block that you can then compose into longer chains.

And just to wrap things up, what's one message, takeaway that you want people to remember and think about? I mean, I think for me it's just like the best DevTools in the future are going to have to leverage many different forms of intelligence. Calling back to that like Normski architecture, trying to make it catch on.

You should call it something cool like S* or R*. Yes, yes, yes. Just one letter and then just let people speculate. Yeah, yeah, what could he mean? But I don't know, like in terms of trying to describe what we're building, we try to be a little bit more down to earth and straightforward.

And I think Normski encapsulates the two big technology areas that we're investing in that we think will be very important for producing really good DevTools. And I think it's a big differentiator that we view that Cody has right now. Yeah, and mine would be I know for a fact that not all developers today are using coding assistants.

And that's probably because they tried it and it didn't immediately write a bunch of beautiful code for them. And they were like, ah, too much effort, and they left. Well, my big takeaway from this talk would be if you're one of those engineers, you better start planning another career.

Because this stuff is in the future. And honestly, it takes some effort to actually make coding assistants work today. You have to-- just like talking to GPT, they'll give you the runaround, just like doing a Google search sometimes. But if you're not putting that effort in and learning the sort of footprint and the characteristics of how LLMs behave under different query conditions and so on, if you're not getting a feel for the coding assistant, then you're letting this whole train just pull out of the station and leave you behind.

Yeah. Cool. Absolutely. Yeah, thank you guys so much for coming on and being the first guest in the new studio. Our pleasure. Thanks for having us. (upbeat music)

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

Chapters

Transcript