Beyond the Prototype: Using AI to Write High-Quality Code

it's great to be here so I'm Josh Albrecht I'm the CTO of imbue and our focus is on making more robust useful AI agents in particular focusing on software agents right now and the main product that we're working on today is called sculptor so the purpose of sculptor is to kind of help us with something that we've all experienced you know we've all tried these vibe coding tools and you you know tell it to go off and do something it goes off and creates a bunch of code for you and then you know voila you're done right well not quite like at least today there's a big gap between kind of the stuff that comes back and what you want to ship to production especially as you get away from the prototyping into a larger more established code bases so today I'm going to go over some of the technical decisions that went into the design of sculptor our experimental coding agent environment and kind of go through some of the context and motivations for the various ideas that we've explored and the features that we've implemented it's still a research preview so these features may change before we actually release it but I hope that you know whether you're an individual using these tools or you're someone who's developing the tools yourself you'll find these kind of learnings from our experiments to be useful for yourselves so today if you're thinking about how you can make coding agents better then there's a million different things that you could build you could build something that helps improve the performance on really large context windows you can make something to make it cheaper or faster you could make something that does a better job of parsing the outputs but I don't think that we really should be building any of these things I think that what we really want to be building is things that are much more specific to the use case or to like the problem domain or the thing that you are like really specialized in most of the things that I just mentioned are going to get solved over the next call it three to twelve to twenty four months as models get better coding agents get better etc and so I think you know just like you wouldn't want to make your own database I don't think we want to be spending a lot of time working on the problems that are going to get solved instead we want to focus on the particular part of the problem that really matters for for us for our business and so and in Pew the problem that we're focusing on is basically this like what is wrong with this diff you get a coding agent output and it tells you like okay I've added 59 new lines are those good like right now you have an awkward choice between either looking at each of the lines yourself or just hitting merge and kind of hoping for the best trust and neither of those are a really great place to be so we try to give you a third option the goal is to help build user trust by allowing another AI system to come and take a look at this and understand like hey are there any race conditions did you leave your API key in there etc so we want to think about how do we help leverage AI tools not just to generate the code but to help us build trust in that code and and kind of the way that we think about it is about like identifying problems with the code because if there's no problems then that's probably high quality code and that's kind of the definition of high quality code if you think about it from like an academic perspective the way that people normally measure software quality is by looking at the number of defects and they look at like how long does it take to fix a particular defect or how many defects are caught by this particular technique so this is sort of the definition that at least we're working on from when we're thinking about making high quality software and then if we think about you know the software development process what you want to be doing is getting to a place where you have identified these problems as early as possible so sculptor does not work as like a pull request review tool because that's much much later in the process rather we want something that's synchronous and immediate and giving you immediate feedback as soon as you generated that code as soon as you've changed that line you want to know like is there something wrong with it that's easier both for you to fix and also for the agent to fix so what are some ways that you can prevent problems in AI generated code we're going to go through five different ways the first is learning planning or sorry only four different ways learning planning writing specs and having a really strict style guide and we'll see how those manifest in sculptor so the first thing you want to do when you're using coding agents if you're trying to prevent problems is learn what's out there we try to make this as easy as we try to make this as easy as possible in sculptor by letting you ask questions have it do research get answers about what are the technologies et cetera that exist what are the ways that other people have solved similar problems so that you don't end up reproducing a bunch of work for what's already out there next we want to think about how we can encourage people to start by planning here's a little example workflow where you can you know kick off the agent to go do something simple like you know implement this gravel solver and change the system prompt here to force the AI agent to first make a plan without writing any code at all then you can wait a little while it'll generate the plan and then you can go and change the system prompt again to say like okay now we can actually create some code so make it really easy to kind of change these types of meta parameters of the coding agent itself of course you can just tell the agent to do that but by changing its system prompt you sort of force it in a much stronger way to change its behavior and you can build up larger workflows by making sort of customized agents for always plan first then always do the code then always run the checks etc third you want to think about writing specs and docs as a kind of first-class part of the workflow one of the main reasons why at least I don't normally write lots of specs and docs in the past has been that it's kind of annoying to keep them all up to date to spend all this time kind of typing everything out if I already know what the code is supposed to be but this is really important to do if you want the coding agents to actually have context on the project that you're trying to do because they don't have access to your email your slack etc necessarily and even if they did they might not know exactly how to turn that into code so in Sculptor one of the ways that we try to make this easier is by helping detect if the code and the docs have become outdated so it reduces the barrier to writing and maintaining documentation and doc strings because now you have a way of more automatically fixing the inconsistencies it can also highlight inconsistencies or parts of the specifications that conflict with each other making it easier to make sure that your system makes sense from the very beginning and finally you want to have a really strict style guide and try to enforce it this is important even if you're just doing regular coding without AI agents just with other human software engineers but one of the things that is special in Sculptor is that we make suggestions which you can see towards the bottom here that help keep the AI system on a reasonable path so here it's highlighting that you could you know make this particular class immutable to prevent race conditions was this something that comes from our style guide where we try to encourage both the coding agents and our teammates to write things in a more functional immutable style to prevent certain classes of errors we're also working on developing a style guide that's sort of custom tailored to AI agents to make it even easier for them to avoid some of the avoid some of the most egregious mistakes that they normally make but no matter how many things you do to prevent the AI system from making mistakes in the first place it's going to make some mistakes and there are many things that we can do to prevent or to detect those problems and prevent them from getting into production so we'll go through three here first running linters second writing and running tests third asking an LLM and we'll dig into each and see how that manifests in Sculptor so for the first one for running linters there are many automated tools that are out there like rough or my pie pile and pyre etc that you can use to automatically detect certain classes of errors in normal development this is sort of obnoxious because you have to go fix all these like really small errors that don't necessarily cause problems it's a lot of like churn and extra work but one of the great things about AI systems is that they're really good at fixing these so one of the things that we've built into Sculptor is the ability for the system to very easily detect these types of issues and automatically fix them for you without you having to get involved another thing that we've done is make it easy to use these tools in practice a lot of tools end up like these you know how many people here maybe a show of hands how many people have a linter set up at all okay how many people have zero linting errors in their code base two great that will hire you okay cool but you know it's a it's not it's not easy but one of the things that we've done in Sculptor is make it so that the AI system understands what issues were there before it started and then what issues were there after it ran so at least you can prevent the AI system from creating more errors without you even if it doesn't work in a perfectly clean code base okay third testing so why should you write tests at all I think I was pretty lazy as a developer for a long time and did not want to write tests because it took a you know a lot of effort you have to maintain them okay I already wrote the code it works okay but one of the major objections to writing tests has kind of disappeared now that we have AI systems the ability to generate tests is now so easy that you might as well write tests especially if you have correct code you can tell the agent hey just write a bunch of tests throughout the ones that don't pass and just keep the rest so there's no real reason to not write tests at all and be echo as they say at Google if you liked it you should have put a test on it this becomes much more important with coding agents the reason is that you don't want your coding agent to go change the behavior of your system in a way that you don't understand and don't expect and don't want to see happen so at Google this matters a lot for their infrastructure because they don't want your site to crash when someone changes something but if you really care about the behavior of your system you want to make sure that it's fully tested so how do we actually write good tests I'll go through a bunch of different components to this so first one of the things that you can do is write code in a functional style by this I mean code that has no side effects this makes it much much easier to run LLMs and understand if the code is actually successful you really don't want to be running a test that has access to say your live Gmail environment where if you make a single mistake you can delete all of your email you really want to isolate those types of side effects and be able to focus most of the code on the kind of functional transformations that matter for your program second you can try and write two different types of unit tests happy path unit tests are those that are ones that show you that your code is working it's happy hooray it worked you don't need that many of those you just need a small number to show that things are working as you hope the unhappy unit tests are the ones that help us find bugs and here LLMs can be really really helpful so especially if you've written your code in a functional style you can have the LLM generate hundreds or even thousands of potential inputs see what happens to those inputs and then ask the LLM does that look weird and often when it says yes that will be a bug and so now you have a perfect test case replicating a bug third after you've written your unit tests it's maybe a good idea to throw them away in some cases this is a little bit counterintuitive in the past it spent we took all this effort and spent all this time trying to write good unit tests and so we feel some aversion to throwing away but now that it's so easy to run LLMs and generate the test suite again from scratch there's a reason a good reason to not keep around too many unit tests of behavior you don't care about too much you might also want to just refactor the ones that you generated into something that's slightly more maintainable but when you do keep them around it does kind of confuse the LLM when you come back and change this behavior so it's something that's at least worth thinking about whether you want to keep the test that were originally generated clean them up how many of them should you keep etc fourth you should probably focus on integration tests as opposed to testing only the kind of code level functional behavior of your program integration tests are those that show you that your program actually works like from the user's perspective like when the user clicks on this thing does this other thing happen AI systems can be extremely good at writing these especially if you create nice test plans where you can write okay when the user clicks on the button to add the item to the shopping cart then the item is in the shopping cart if you write that out and then you write the test then you can write another test plan like if the user clicks to remove the button the thing from the shopping cart then it is gone that systems can almost always get this right and so it allows you to work at the level of meaning for your testing which can be much more efficient fifth you want to think about test coverage as a core part of your testing suite so if you're having cloud code write things for you then you you don't care just about the tests working on their own but you also care are there enough tests in the first place if you think back to the original screenshot where we get back our PR of you know how many lines have changed if I tell you how many lines have changed it's not that helpful if I tell you so many lines have changed and also there's a hundred percent test coverage and also all the test paths and also a thing looked at the test and thought they were reasonable now you can probably click on that merge button without quite as much fear and sixth we try to make it easy to run tests and sandboxes and without secrets as much as possible this makes it a lot easier to actually fix things and makes it a lot easier to make sure that you're not accidentally causing problems or making flaky tests The third thing that we can do to detect errors is ask an LLM there are many different things that we can check for including if there are issues before you commit with your current change if the thing that you're trying to do even makes sense if there are issues in the current branch you're working on if there are violations of rules in your style guide or in your architecture documents if there are details that are missing from the specs if the specs aren't implemented if they're not well tested or is ask an LLM.

There are many different things that we can check for, including if there are issues before you commit with your current change, if the thing that you're trying to do even makes sense, if there are issues in the current branch you're working on, if there are violations of rules in your style guide or in your architecture documents, if there are details that are missing from the specs, if the specs aren't implemented, if they're not well tested, or whatever other custom things that you want to check for.

One of the things that we're trying to enable in Sculptor is for people to extend the checks that we have so that they can add their own types of best practices into the code base and make sure that they are continually checked. After you've found issues, then you have to fix them.

Very little of this talk is about fixing the issues because it ends up being a lot easier for the systems to fix issues than you would expect. I think this quote captures it relatively well and that a problem well stated is half solved. What this means is that if you really understand what went wrong, then it's much easier to solve the problem.

This is especially true for coding agents because the really simple strategies work really well. So even just try multiple times, try a hundred times with a different agent, it actually ends up working out quite well. And one of the things that enables this is having really good sandboxing. If you have agents that can run safely, then you can run an almost unlimited number, subject to cost constraints, in parallel, and then if anyone of them succeeds, then you can use that solution.

And this is really just the beginning. There are going to be so many more tools that are released over the next year or two and many of the people in this room are working on those tools. There will be things that are not just for writing code like we've been talking about, but for after deployment, for debugging, logging, tracing, profiling, etc.

There are tools for doing automated quality assurance, where you can have an AI system click around on your website and check if it can actually do the thing that you want the user to do. There are tools for generating code from visual designs. There are tons of dev tools coming out every week.

You will have much better contextual search systems that are useful for both you and for the agent. And of course, we'll get better AI based models as well. If anyone is working on these other sorts of tools that are adjacent to developer experience and helping you fix this much smaller piece of the process, we would love to work together and find out a way to integrate that into Sculptor so that people can take advantage of that.

I think what we'll see over the next year or two is that most of these things will be accessible and it'll make the development experience just a lot easier once all these things are working together. So that's pretty much all that I have for today. If you're interested, feel free to take a look at the QR code, go to our website at imbue.com and sign up to try out Sculptor.

And of course, if you're interested in working on things like this, we're always hiring, we're always happy to chat, so feel free to reach out. Thank you. Thank you.

Beyond the Prototype: Using AI to Write High-Quality Code - Josh Albrecht, Imbue

Transcript