Spotlight on Shopify | Code w/ Claude

Hi, everyone. I'm Obi Fernandez, and I'm going to be talking to you today about some of the awesome things we're doing at Shopify, leveraging the power of Claude and Claude Code as part of our massive, in a way. I'll talk about that in a second. First of all, I just want to introduce myself.

My name is Obi Fernandez. I'm a Principal Engineer in our Augmented Engineering Group. I work on everything that has to do with augmented engineering, in other words, using AI to improve developer experience at Shopify. I'm also the author of a pretty interesting book that you might like called Patterns of Application Development Using AI.

So I hope that you look that up if you get a chance. Today we're going to be talking about the challenges of scale at Shopify engineering. We are a fairly large organization, probably one of the largest Ruby on Rails shops in the world. Our main application, we've been working on for almost 20 years, and has millions upon millions of lines of code.

There's probably about 5,000 repos in our organization last time I checked. And we're generating about half a million PRs a year at last count, which is a significant amount of stuff to take into account when you're doing anything with AI, in terms of your context and whatnot. Our core challenge is how do we maintain productivity?

Or rather, that's what my group focuses on. And I'm here to kind of tell you some of the solutions we've developed and how those interact with Cloud. The key, really, to understanding kind of the point of what I'm presenting, is to understand that there's two very, very different ways of using AI.

One, which we've been immersed in throughout this conference today, are agentic tools. When we're talking about how to leverage AI as assistance and tooling, there's this agentic approach, which are ideal for scenarios that require adaptive decision-making, iteration, and autonomy. They shine when the tasks that you're trying to accomplish with AI help are exploratory or ambiguous in some way.

And you're relying on the LLM's reasoning and judgment because your path to the solution is not known in advance. It might not be known in advance because it's a complex domain that you're dealing with that has factors that are changing all the time. It might be very, very complicated and just kind of beyond what you know.

Or it might be like something as simple as a feature development where you're going to do some exploratory work to figure out how to implement that feature. Those kinds of use cases, as we've seen again and again today in the various sessions, are perfect for tools like Cloud Code.

Anything that involves ongoing adaptation, debugging and iteration, is perfectly fantastic to give that to an agentic tool and see what it can do. In contrast, structured workflow orchestration, including what we can do with this open source tool that I'm going to present to you in this presentation, which we call Roast, are better for tasks that have predictable, well-defined steps.

Cases where you seek consistency, repeatability and clear oversight. You want to leverage AI and this kind of work for intelligent completion of components of that bigger workflow. So far, I don't think I'm saying anything that is super exotic or wild. It's really the difference between non-deterministic and deterministic kind of behavior.

And it turns out that like peanut butter and chocolate, you know, these make a great combination. Sometimes you want one, sometimes you want the other. In examples of what these structured workflows are great for are things like migrating legacy code bases. So, for instance, going from Python 2 to Python 3, from going whatever your current, you know, JavaScript implementation is based on, to whatever the new flavor of the month is.

Or, as is the case with a lot of things that my team deals with, refactoring large systems, where it isn't really an exploratory task. Like, we know what we want to do. Maybe we're addressing performance. Maybe we're addressing technical debt that we understand kind of what the basis is.

So, we know that we want to go through a certain amount of steps. Specifically, the kinds of things that we do at Shopify using Roast, which is really what we extracted this open source library out of, started with automated testing generation and test optimization. So, we looked at our over half a million tests associated with our main monolith and said, we would really like to address some of the coverage gaps in this code base.

So, how do we go about doing that? Of course, one approach would be to simply open up that project in cloud code and say, hey, I want to address coverage gaps in this place. However, in practice, it's really helpful to break down that problem in the way that you would do it manually and say, okay, well, what would I do if I was going to work on test coverage?

Well, first of all, I need to know what the test coverage is. If we know that every time we're going to do a series of steps, that calls for a structured workflow. For instance, running the coverage tool, running the test, you know, to generate the report of what needs to be covered.

So, taking a step back for a second, we've been using cloud code for a while. We were one of the early shops that actually adopted it. As soon as it launched, there was interest in using it. And as soon as people started using it, we started seeing a lot of excitement in our Slack, right?

So, I copied some of the earliest comments I could find there from March, you know, from some of our folks. And I pulled this graph from our AI proxy that cloud code runs through. And I think it's actually a fairly impressive amount of usage. We have at peak now about 500 daily active users, and that number is growing rapidly.

And as of lately, we had 250,000 requests per second at peak, which is an impressive amount, I believe. And in fact, Roast itself, which is this open source framework that I was telling you about, is called Roast because it helps you set your money on fire. Yeah, think about it.

So, anyway, what does it look like? I let this video kind of stay here at a workflow definition. So, this is a workflow orchestration tool. There's nothing super, super exotic about it. Probably one of the most interesting things about it is that it's implemented in Ruby, which is a bit of an oddity in this world where everyone uses Python and TypeScript, unfortunately.

So, you don't need to implement anything in Ruby to use Roast for your own things. This can actually help you interleave prompt-oriented kind of tasks with bash scripts or, you know, whatever you want to invoke. So, anyway, why did we go through the trouble of writing Roast and open sourcing it on my team?

Well, the thing is, our illustrious CEO, Toby Lutke has instilled throughout the years a culture of tinkering in the company. So, even without AI, we have a culture where people are constantly working on homegrown projects, little skunk works, little research, you know, things within their departments. And this is not only limited to engineering, it's across the board, you know, I've seen people in sales and support and things like that working on their own tooling.

AI exploded that. So, you know, as soon as, you know, vibe coding became a thing, curse, you know, different kinds of tools came about and were available, like cloud code, and all the different kinds of, you know, chat completion, all of a sudden, everyone was coding across the company.

And specifically, when it came to anything when it came to anything that looks like a structured workflow or essentially a script that puts together a number of prompts or chains them together, I think that it's probably safe to say that there's hundreds of different ways that this has been implemented across your company.

And in fact, I see some of you nodding, like, if you work at big companies, you probably have seen this, like, constant reinventing of the wheel. You know, some people are using, you know, One Frameworks. Some people are using LangChain. Some people are just writing their own scripts. They're using their own scripts, et cetera, et cetera, et cetera.

That's cool and all, but it's better, you know, if you start identifying the common needs across the organization and you put something together to really help them out. So, that's where Roast came from. And I want to tell you about the relationship with Cloud Code and Roast because it's a bi-directional thing, which is really, really cool.

So, like I said earlier, you could try to get Cloud Code to execute a workflow, a predefined workflow. You could set up commands. You could set up a bunch of Cloud MD files. All that's well and good, and I'm not telling you -- I'm not here to tell you, hey, don't try to do that.

It's just that, no matter how good the state-of-the-art models get at following instructions, there's still inherently non-deterministic. And you have something else which is the accumulation of entropy. And what I mean by that is that at every step of a given workflow that you just give an agent to work on independently, errors and, you know, misdirection and lack of, you know, problems with judgment, mistakes, add up, right?

I'm sure if you've done any amount of prompt chaining you've seen this. Like, basically, something goes slightly wrong that makes the next step work a little bit worse or the model has to do more work to recover. It's not ideal. What we're finding is that interleaving, non-deterministic kinds of structured workflow with -- sorry, non-deterministic agentic workflows with deterministic kind of structured workflows and scripts is actually the perfect combination.

So, what I mean by that is that you take a take a workflow, a big workflow, like let's say optimizing a test suite and you break it down into component parts and you minimize the amount of instructions that you have to give the agent to work on at any given step.

So, that looks like giving Claude code roast workflows on the one side. So, on the left side of the slide here, what I'm describing is like basically you tell Claude, hey, I want to work on optimizing my tests, but I have a workflow tool that handles the grading. So, go ahead and call roast test grade with this file or this directory and then take its recommendations and work on them.

So, that's one way of using roast. roast as a tool for Claude code. On the other side, roast includes a coding agent tool that you can add to your workflows as part of its configuration, which wraps Claude code. So, you could kick off a workflow in an automated fashion that, let's say, grades a test.

And as part of the steps in that workflow, you can kick off Claude code in SDK mode and provide something that you want the agent to do. that you want the application to work on, but on a narrower scope than to giving it the whole thing. I have already talked about test grading, but to give you another example, the main application that we use at Shopify, like I said before, is a big Ruby on Rails monolith.

Ruby is a dynamic language that doesn't natively have typing. So, we use an add-on typing system called Sorbet. Sorbet is is not something that is super, super well-known by the models. It certainly has a little bit of knowledge of it, but the kinds of tools that you invoke when you're doing type checking and preparation of type files and things like that is not something that is, let's call it, quote, unquote, intuitive to the models.

Very, very helpful to break anything up that has to do with type checking or improving the application of types. types in our code base into these roast workflows where we actually interleave calls to the Sorbet tools that are predefined. that are predefined, like we're always going to run the type checking in this way with a command line and then we interleave that with giving the results of the type checking to Claude and saying, hey, please address the deficiencies that we found in this manual step.

This is not a super compelling video. I didn't have a ton of time to prepare this talk, but basically, if the video starts here, what you'll see is the result of running one of these workflows to generate tests. So, it gets stuck here running coding agent. I'm talking to the Claude code team about maybe giving us some ability to output what the coding agent is doing, but we see that it's generating tests.

I'm actually flipping over and running the test to verify that they run or rather to show you that they run. But this is kind of what it looks like at scale. It's a bit messy. I should add, if you want to try roast, it is a very early version.

It does work. It has a cool set of tools. It has cool features like being able to save your session every time you run a workflow. So, if any of you have tried to do workflow kinds of things, one of the key benefits of using a tool like Roast is that, for instance, if you have a five-step workflow, you don't have to run the first four steps over and over again just to debug the fifth step.

You can just go ahead and replay from the fourth step afterwards and then, you know, work on it. Big, big time saver. We also do things like incorporate tool function caching. A lot of times when you're developing these workflows, you're kind of working on the same dataset. If you're only working on an agentic tool, you kind of have to give it the whole thing and let it run from the beginning and do all the tool things that it's going to do, all the function calling.

If you're using a tool like Roast, you can do that and have all your function, you know, calls cached at the Roast level so that they just execute super, super fast. I mentioned before, but just to bring it home, we are using Cloud SDK as a tool for Roast.

So, specifically, the kinds of things that we're using that for is that the configured Roast workflow one-shots a code migration, for instance, because it's kind of like we know exactly what we want to do. We don't want to beat around the bush or have to discuss what it is we're going to do with Cloud.

So, we're just going to go ahead and do that just using regular chat completion style prompting. And then, once we have a starting place, we hand that over to Cloud using the SDK command line and say, hey, run the test for this, and then, if it's broken, fix it, iterate on doing it.

Again, these are things that are not necessarily that useful to the individual developer, like as they're going about their day, probably they're just going to use Cloud code. But if you're doing this at scale or as part of repeatable processes or as part of reacting to PRs or anything like that, you know, as part of data pipelines, it becomes super, super useful.

I want to leave some time for questions, so I'm just going to move on a little bit faster. I wanted to mention that from experience, one of the things that's a little bit tricky when using Cloud code SDK is kind of figuring out what tools you need. However, an option that doesn't, I think, get enough love or get mentioned, especially when you're prototyping, is you can just say dangerously skip permissions, which just kind of lets it do whatever it wants.

And, you know, when you're prototyping and figuring out how, you know, how you're going to use your coding agent, that's often very useful. Very useful. And as I was kind of giving some initial versions of this talk to my colleagues, they said, hey, it would probably be useful to put an example prompt of what it looks like when you include a coding agent in your workflow.

So I put an example prompt in there. Use your code agent tool function to raise the branch coverage level of the following test above 90%. After each modification run, rake, test with coverage, you know, path to the test, et cetera, you get the picture. So, finally, hopefully you've liked this introduction.

I know that maybe to some of you this might seem a little bit boring, but to us kind of making that discovery that interleaving these deterministic and non-deterministic kinds of processes together and leveraging the power of cloud code was actually a magical combination. It's taking off like wildfire within Shopify.

You know, this is something we just launched. We've had it internally within our development kind of environment for test grading and optimization probably for five or six weeks now. And we launched it as open source, I think, two or three weeks ago, and it's starting to take off like wildfire at this point.

Now that people realize, hey, there's a standardized solution. Also, because of time pressure, I wasn't able to show you all the features of Roast. It actually has a lot of cool things like just being able to declare inline prompts within your workflows, being able to declare inline bash commands.

And it has a lot of convention-oriented things. Raise your hand if you've ever used Ruby on Rails or you like Ruby on Rails. Yay! All right. We've got some people in the house. I'm a Rails guy. I wrote the book The Rails Way back in the day. And I really like Ruby on Rails.

And it takes a convention-oriented approach. That's kind of what you get with Roast as well. So, it has things like the ability to define your prompts and then put an output template alongside it where, you know, you're able to transform the output using ERB. Very, very Rails-like. So, if you like Ruby on Rails, I think you'll like Roast.

It looks like we have about four minutes for questions. So, if anyone wants to step up to the mic, give you a chance. Have you tried agent generating Python code to engage agent? Well, first of all, no, because I don't write Python in principle. But have I tried agent generated code to invoke an agent?

Correct. Yeah. So, basically using Python either in your interpreter or in code execution to orchestrate sub-agents and through that do the same things as you do, like migrations and test coverage and whatnot. No. And if I understand the thrust of your question correctly, I'm not sure that we would in the context of doing Roast.

So, the direction that we're going with Roast is the introduction of things that you would normally associate with workflows. So, the ability to put like control flow, conditionals, branching, looping, things like that, which are kind of quality of life if you're a workflow developer. What makes it unique, though, is that this is very much written for the AI age and for LLMs.

So, for instance, your conditionals allow you to put in a prompt or a bash script or a step, you know, like a full-featured, let's call it a full-featured prompt versus an inline prompt. And the results of invoking that prompt can be coerced into like, for instance, a true or a false.

Or if it's in the context of something that expects a collection to iterate over, the result of the prompt is coerced into a list and then iterates over it. So, I know that you asked about code generation. That's a cool thing. I might actually have to think about that and see if it fits into the picture.

Cool. Thanks. All right. Thank you very much. Thank you very much. Thank you very much. Thank you very much. Thank you very much. Thank you very much. Thank you. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. you you you you Bye.

Spotlight on Shopify | Code w/ Claude

Transcript