Building agent fleet architectures your CISO doesn't hate

Hello everyone, welcome. So, today we'll be talking basically about CISO-approved Agent Fleet Architecture. So, my name is Lou. I'll actually jump into that. So, my name is Lou. I am a field TTO at a company called Gitpod. In a past life, you know, for many years, basically software engineering and platform engineering is my background, but for the last four years, actually nearly four years, I've been working at Gitpod effectively building a platform for automated development environments.

What I'm going to talk about today is actually the journey over the last, even extends beyond my tenure at the company for the last sort of six years, building our sort of product infrastructure and architecture for secure environments, particularly around regulated industries and that kind of thing. I assume they'll probably, in fact, actually, I think I'm already on the next slide.

So, yeah. So, in terms of like who this talk is for, so different aspects. So, first aspect is really about how do we start to think about how technical architectures of product affect our business. If you're a vendor, this is obviously something for you. There are probably a bunch of people here as well that are either building or consuming or thinking about AI tools, maybe also in these secure environments.

But then also, if you're a buyer of this as well, I think this talk will also be interesting if you're buying, using or consuming AI tools, thinking about like what architectures might be right for you and what considerations you might want to take throughout that process. So, let's go ahead and actually just jump into it.

So, for a little bit of context, it's very hard to talk about this talk without actually giving a bit of background to the company and what we do. So, Gitpod is basically a platform for secure dev environments. A lot of people sometimes mix this up with things like staging or integration environments, things like that.

But to give you an idea like developers actually work inside of Gitpod for around 37 hours a week, 36, 37 hours. So, like you're, it's almost like a replacement, I suppose, like for your machine or your laptop, something like that. So, as a consequence of that as well, it's highly mission critical software.

So, if Gitpod goes down, that has a huge impact for our customers. So, it's really important that this is incredibly reliable. In addition to that, what we effectively do is because the automated aspects of our dev environments, we run also mostly with highly secure regulated organizations. So, banks, pharmaceutical, healthcare.

A lot of the companies that we work with as well, like are quite particular about like their logo rights and publishing some of their names. But for sure, it's a lot of companies that you will know, a lot of household names and things like that as well. So, as you can imagine, building this type of product and actually bringing it to market is not an insignificant task, especially if you're a smaller company with sort of limited resources trying to do that.

And in addition to that as well, I'll talk about this a little bit later, but we actually launched and talked more publicly about our agent offering two days ago, but we've been building this for months with some of our design customers as well. And I'll get onto that and how how that interfaces with the architecture as well.

So, just as an overview, and I'll actually deep dive into these a little bit as we go through, is we've gone through various iterations of different architectural models and they have different implications for us as a company, but also for our customers, the implications on, for instance, how much ownership they have over those products, like the cost of running those products, and the security as well of actually when they run them.

So, very quickly, because I will actually deep dive on these. So, we started off 2019 with a managed SaaS product. It's where a lot of products start, hosted on the internet, multi-tenant. You can go, you click on the website, you swipe a credit card, you can use the product.

Wonderful. But this doesn't work for secure companies, right? So, we went through several iterations, first with a very self-hosted architectural approach. We then adjusted that, and I'll talk about exactly how, and then we have what we effectively have today, which is also the platform that we run our agent fleet architecture on, and we'll talk about how that works as well.

So, if we start at the beginning with our sort of fully managed SaaS, and I don't know if any of you know Sean, but it's funny, because I remembered this tweet the other day that he put out several years ago. It's like, Gitpod really inspired him to do a video, because it has a nice time to wow, which is, you know, Gitpod was very well known.

You click on a link, and you end up in a dev environment literally from one-click interaction. This was, at the time, hosted on top of GCP, Kubernetes architecture. We chose GCP, obviously, for the connection with Kubernetes. And the way the dev environments were then orchestrated was pods running inside of Kubernetes.

So, multi-tenant infrastructure, but if you know anything about Kubernetes, it's not really designed for that type of workload, and we will talk a bit about that as we go on. But obviously, there are several challenges with this type of approach. One, it's great to get started really easy to use your product, but we had a ton of crypto mining and abuse on this.

It's one of these things, like, we can't have nice things on the internet, like, where we give away free compute, because somebody always wants to do something. Always wants to abuse it, effectively. But it's not also just insufficient for enterprise. And one of the goals of Gitpod was really to bring our product to professional developers doing real-world work in real companies.

So, what do we do? What's the next step? The obvious, almost like the logical thing there is just, okay, can we take our existing architecture, take that, package that up as a self-hosted installation, and just give that to our customers? So, we tried it. We did do exactly that.

We didn't think too much necessarily about, let's say, the specifics of cloud provider support. So, we kind of opened this up for Google. I think we had AWS at the time and some of the others. And, you know, we had all manner of them requests from different companies with all their different flavors of Kubernetes and different configurations to try and figure out how do we then self-host on those platforms.

In the fullness of time, though, what this came up with, both for us as a vendor, but also for our customers, is significant day two effects. So, self-hosting is great as a general model, but it comes also with this huge overhead. Typically, you have to set it up. You have to run it.

And then, as a business also, that ultimately is, like, your cost of ownership. So, as a company, like, we're ultimately selling a product to our customers. Having a very difficult product to set up and run is effectively eroding your ROI. If you need to allocate two, three, maybe even a whole team to actually set up and run this infrastructure, then the benefit that you're getting from that product is also being eroded.

The difficulty with that as well is then we don't have a strong relationship with the people that are actually using or running that product. That's also really important, especially in an enterprise sort of situation. How you adopt and roll out the product can be critical to its success. And oftentimes, we found that, you know, especially in a self-hosted model without support from our side, some of those people that are trying to adopt the product also are unsuccessful.

So, considering that with the self-hosted model, what did we do next? So, we looked at this ownership and this sort of overhead challenge and said, okay, we can solve this by effectively providing what we call the substrate so that we could allow customers to self-host, but then we could also manage the service.

So, we could take away some of that overhead. But in order to do that, we also needed to reduce variance. So, one of the things that we did is we effectively doubled down on AWS as the individual provider, where we run the infrastructure, but then we would then manage it.

So, that meant small pieces of telemetry data, other things that we needed to operate that infrastructure were still being emitted to us as a provider. But what it still meant is the workload, or in our case, source code, data, integrations, all still live on the customer's infrastructure, which is then still ticking the box when it comes to these regulated companies.

But despite this being a success, certainly from an operational standpoint, it still continued to have effectively fundamental issues because it's still built on this sort of Kubernetes-based infrastructure, still highly complicated, let's say even when you want to install. In this case, we're running multiple clusters with a high fixed cost, let's say, and also still a lot of complexity.

This ultimately culminated in its moving away for Kubernetes for our specific use case. I put this blog post up because it goes into a lot of a lot of detail specifically about why Kubernetes was particularly challenging for us and our use case and our workloads. But we decided to move away from that and start to think from first principles about what an architecture looks like that solves this challenge of running inside of regulated companies, running highly securely, but doing so in a way that doesn't create a huge operational burden for our customers, which is bad for them and bad for us.

So ultimately, this is actually the architecture that we came up with and that we have today that's also running our AI agent fleet. So I'll just run you through this real quick. So we have what we call effectively like a runner. And if you're an engineer, maybe you're probably familiar with this, like tools like GitHub CI as well has like this runner.

The runner takes everything that's secure. So in our case, source code and access to data and runs that on the customer's infrastructure. A runner is actually very, very simple in our case. The runner is actually a single ECS task. So it's a single container running. And it obviously then costs like, we're talking dollars, single figures of dollars to run every month.

The core workload that we spin up is then dev environments. Each runner runs on a different, in our case, this is AWS. But if we, you know, for our Linux runner, they inherit the qualities of the platform they're running on. So for AWS, we run workloads on top of EC2, and those are also then backed up to EBS.

So we can use the native functionality and features of the cloud provider to build that solution rather than something like Kubernetes, which is theoretically portable, but it also comes with all of that overhead and complexity of having the portability across different platforms. Then what we did is then also start to think about, okay, so what are the things that we can keep on our side as a vendor so that our customers aren't taking on a significant amount of operational overhead?

So a lot of that is like metadata, IDs, information about users, et cetera. Not personally identify information, but like user IDs, like stuff that if you were to get your hands on it, it's not particularly significant for our customers. They're not bothered about losing that. They're bothered about losing the IP, the source code things that's running actually inside their infrastructure.

So I did actually throw in a quick video here because I'm going to turn off the audio on this, just as you actually see what this looks like from a user experience point of view. So this is actually the user choosing their workload. When they choose those, those actually do run inside of their network, inside of their cloud account.

To then set up and install on their side, what we're able to now do is provide them with this runner interface, creates a very simple cloud formation, ask the customer to go through and actually put in all of their different network details. That bit is all on their side, create this.

The process of running this, it takes as little as three minutes. The overhead or the challenge is usually getting all of that network configuration, VPC information, subnets, finding out who in your organization owns those who can actually allocate them to you. But fundamentally, this makes it then exceptionally simple to then deploy that for customers.

And the overhead of this is then is also significantly reduced. Just one second. I think I have to back out of this to change it. There we go. Cool. So I mean, we're at an agent conference. So I also presume I should certainly be talking about AI and agents.

So two days ago, I mentioned we basically launched our agent offering, which we've been working on for months now since the start of the year, actually, with customers and our design partners. Now, because we have that existing infrastructure already within those customers, we can actually run the exact same workloads.

So because we create dev environments, dev environments are effectively for humans to be productive. But everything that an autonomous agent needs to operate is exactly the same thing that a dev does. They need the source code. They need access to internal systems, whether that be an internal database, a cluster, etc.

So when you give an agent the ability to access source code and iterate, it can do it inside of that same dev environment with the same access and privileges that an individual human would as well. Because of like our target market and the customers we go forward that our agent offering as well is also privacy first, because what we've effectively built here allows us to not only run inside the customers infrastructure, but there's lots of other additional benefits that come from that as well.

Let's say things like audit logs. Once we did that, and I'll just jump back a few slides. Once we re-architected to this architecture, one of the other first principles that we came back to is building this in an API first way. So every action interaction when you spin up a dev environment, everything that you do is then audit logs and allows us to give our customers also an entire sort of audit history on the platform, which also then extends to once you start running these workloads with agents, you also benefit from that as well.

So that audit log also runs for those agent tasks. What time is it? I have two more minutes and the other guy is straight after. So that is a very sort of whirlwind run through. We do have a booth downstairs. I will be here and we can dig into this a little bit deeper if you want to as well.

I can show you how this works in our context as well, but like when you're then ultimately looking at if you're purchasing other tools, AI tools, these are the types of considerations you want to make about what is the architecture and infrastructure and what sort of different qualities they do have.

But then also if you're a vendor as well, there's probably a lot of learnings here from our side that you can take from this to build out a more simplified effectively technical architecture rather than, you know, building on top of like difficult and complicated platforms like Kubernetes and ultimately keeping things simple for your customers as well.

But that's it. Thank you very much. Hope you enjoyed the conference.

Building agent fleet architectures your CISO doesn't hate — Lou Bichard, Gitpod

Transcript