back to indexBuilding agent fleet architectures your CISO doesn't hate — Lou Bichard, Gitpod

00:00:00.000 |
Hello everyone, welcome. So, today we'll be talking basically about CISO-approved Agent 00:00:20.200 |
Fleet Architecture. So, my name is Lou. I'll actually jump into that. So, my name is Lou. 00:00:25.680 |
I am a field TTO at a company called Gitpod. In a past life, you know, for many years, basically 00:00:31.600 |
software engineering and platform engineering is my background, but for the last four years, 00:00:35.360 |
actually nearly four years, I've been working at Gitpod effectively building a platform for 00:00:39.440 |
automated development environments. What I'm going to talk about today is actually the journey over 00:00:44.400 |
the last, even extends beyond my tenure at the company for the last sort of six years, building 00:00:50.160 |
our sort of product infrastructure and architecture for secure environments, particularly around 00:00:55.040 |
regulated industries and that kind of thing. I assume they'll probably, in fact, actually, 00:00:59.520 |
I think I'm already on the next slide. So, yeah. So, in terms of like who this talk is for, 00:01:05.280 |
so different aspects. So, first aspect is really about how do we start to think about how technical 00:01:10.800 |
architectures of product affect our business. If you're a vendor, this is obviously something for 00:01:14.560 |
you. There are probably a bunch of people here as well that are either building or consuming or 00:01:18.480 |
thinking about AI tools, maybe also in these secure environments. But then also, if you're a buyer of 00:01:22.720 |
this as well, I think this talk will also be interesting if you're buying, using or consuming 00:01:26.720 |
AI tools, thinking about like what architectures might be right for you and what considerations you 00:01:31.760 |
might want to take throughout that process. So, let's go ahead and actually just jump into it. So, 00:01:38.000 |
for a little bit of context, it's very hard to talk about this talk without actually giving a bit of 00:01:42.400 |
background to the company and what we do. So, Gitpod is basically a platform for secure dev environments. 00:01:47.440 |
A lot of people sometimes mix this up with things like staging or integration environments, 00:01:51.440 |
things like that. But to give you an idea like developers actually work inside of Gitpod for 00:01:56.160 |
around 37 hours a week, 36, 37 hours. So, like you're, it's almost like a replacement, I suppose, 00:02:00.880 |
like for your machine or your laptop, something like that. So, as a consequence of that as well, 00:02:05.520 |
it's highly mission critical software. So, if Gitpod goes down, that has a huge impact for our customers. 00:02:11.280 |
So, it's really important that this is incredibly reliable. In addition to that, what we effectively 00:02:16.480 |
do is because the automated aspects of our dev environments, we run also mostly with highly 00:02:22.960 |
secure regulated organizations. So, banks, pharmaceutical, healthcare. A lot of the companies 00:02:28.400 |
that we work with as well, like are quite particular about like their logo rights and publishing some of 00:02:32.160 |
their names. But for sure, it's a lot of companies that you will know, a lot of household names and 00:02:35.760 |
things like that as well. So, as you can imagine, building this type of product and actually bringing 00:02:41.680 |
it to market is not an insignificant task, especially if you're a smaller company with sort of limited 00:02:47.120 |
resources trying to do that. And in addition to that as well, I'll talk about this a little bit later, 00:02:51.360 |
but we actually launched and talked more publicly about our agent offering two days ago, but we've 00:02:56.560 |
been building this for months with some of our design customers as well. And I'll get onto that and how 00:03:01.680 |
how that interfaces with the architecture as well. So, just as an overview, and I'll actually deep dive 00:03:08.320 |
into these a little bit as we go through, is we've gone through various iterations of different 00:03:13.200 |
architectural models and they have different implications for us as a company, but also for 00:03:16.960 |
our customers, the implications on, for instance, how much ownership they have over those products, 00:03:21.840 |
like the cost of running those products, and the security as well of actually when they run them. 00:03:27.600 |
So, very quickly, because I will actually deep dive on these. So, we started off 2019 with a managed 00:03:33.440 |
SaaS product. It's where a lot of products start, hosted on the internet, multi-tenant. You can go, 00:03:38.080 |
you click on the website, you swipe a credit card, you can use the product. Wonderful. But this doesn't 00:03:42.320 |
work for secure companies, right? So, we went through several iterations, first with a very self-hosted 00:03:48.000 |
architectural approach. We then adjusted that, and I'll talk about exactly how, and then we have what we 00:03:53.360 |
effectively have today, which is also the platform that we run our agent fleet architecture on, and 00:03:58.640 |
we'll talk about how that works as well. So, if we start at the beginning with our sort of fully 00:04:04.800 |
managed SaaS, and I don't know if any of you know Sean, but it's funny, because I remembered this tweet 00:04:08.560 |
the other day that he put out several years ago. It's like, Gitpod really inspired him to do a video, 00:04:14.160 |
because it has a nice time to wow, which is, you know, Gitpod was very well known. You click on a link, 00:04:18.960 |
and you end up in a dev environment literally from one-click interaction. This was, at the time, 00:04:24.160 |
hosted on top of GCP, Kubernetes architecture. We chose GCP, obviously, for the connection with 00:04:29.600 |
Kubernetes. And the way the dev environments were then orchestrated was pods running inside of 00:04:34.560 |
Kubernetes. So, multi-tenant infrastructure, but if you know anything about Kubernetes, it's not really 00:04:39.200 |
designed for that type of workload, and we will talk a bit about that as we go on. But obviously, 00:04:44.880 |
there are several challenges with this type of approach. One, it's great to get started really 00:04:48.640 |
easy to use your product, but we had a ton of crypto mining and abuse on this. It's one of these things, 00:04:53.600 |
like, we can't have nice things on the internet, like, where we give away free compute, because somebody 00:04:57.200 |
always wants to do something. Always wants to abuse it, effectively. But it's not also just insufficient 00:05:02.480 |
for enterprise. And one of the goals of Gitpod was really to bring our product to professional 00:05:06.640 |
developers doing real-world work in real companies. So, what do we do? What's the next step? The obvious, 00:05:15.280 |
almost like the logical thing there is just, okay, can we take our existing architecture, take that, 00:05:20.720 |
package that up as a self-hosted installation, and just give that to our customers? So, we tried it. 00:05:25.840 |
We did do exactly that. We didn't think too much necessarily about, let's say, the specifics of 00:05:30.320 |
cloud provider support. So, we kind of opened this up for Google. I think we had AWS at the time and some 00:05:35.840 |
of the others. And, you know, we had all manner of them requests from different companies with all 00:05:39.840 |
their different flavors of Kubernetes and different configurations to try and figure out how do we then 00:05:44.720 |
self-host on those platforms. In the fullness of time, though, what this came up with, both for us as 00:05:50.720 |
a vendor, but also for our customers, is significant day two effects. So, self-hosting is great as a 00:05:56.560 |
general model, but it comes also with this huge overhead. Typically, you have to set it up. You have to run it. 00:06:02.240 |
And then, as a business also, that ultimately is, like, your cost of ownership. So, as a company, 00:06:08.320 |
like, we're ultimately selling a product to our customers. Having a very difficult product to set 00:06:13.440 |
up and run is effectively eroding your ROI. If you need to allocate two, three, maybe even a whole team 00:06:19.280 |
to actually set up and run this infrastructure, then the benefit that you're getting from that product is 00:06:23.520 |
also being eroded. The difficulty with that as well is then we don't have a strong relationship with the 00:06:28.080 |
people that are actually using or running that product. That's also really important, especially 00:06:31.760 |
in an enterprise sort of situation. How you adopt and roll out the product can be critical to its 00:06:36.480 |
success. And oftentimes, we found that, you know, especially in a self-hosted model without support 00:06:41.280 |
from our side, some of those people that are trying to adopt the product also are unsuccessful. 00:06:48.640 |
So, considering that with the self-hosted model, what did we do next? So, we looked at this ownership 00:06:54.320 |
and this sort of overhead challenge and said, okay, we can solve this by effectively providing what we 00:06:59.280 |
call the substrate so that we could allow customers to self-host, but then we could also manage the 00:07:04.400 |
service. So, we could take away some of that overhead. But in order to do that, we also needed to reduce 00:07:08.880 |
variance. So, one of the things that we did is we effectively doubled down on AWS as the individual 00:07:13.840 |
provider, where we run the infrastructure, but then we would then manage it. So, that meant small 00:07:19.760 |
pieces of telemetry data, other things that we needed to operate that infrastructure were still 00:07:24.160 |
being emitted to us as a provider. But what it still meant is the workload, or in our case, source code, 00:07:29.760 |
data, integrations, all still live on the customer's infrastructure, which is then still ticking the 00:07:35.440 |
box when it comes to these regulated companies. But despite this being a success, certainly from an 00:07:42.000 |
operational standpoint, it still continued to have effectively fundamental issues because it's still 00:07:46.880 |
built on this sort of Kubernetes-based infrastructure, still highly complicated, let's say even when you 00:07:52.880 |
want to install. In this case, we're running multiple clusters with a high fixed cost, let's say, and also 00:07:57.840 |
still a lot of complexity. This ultimately culminated in its moving away for Kubernetes for our specific use case. 00:08:06.640 |
I put this blog post up because it goes into a lot of a lot of detail specifically about why 00:08:11.120 |
Kubernetes was particularly challenging for us and our use case and our workloads. 00:08:15.360 |
But we decided to move away from that and start to think from first principles about what an 00:08:19.680 |
architecture looks like that solves this challenge of running inside of regulated companies, running 00:08:24.320 |
highly securely, but doing so in a way that doesn't create a huge operational burden for our customers, 00:08:30.800 |
So ultimately, this is actually the architecture that we came up with and that we have today that's 00:08:38.240 |
also running our AI agent fleet. So I'll just run you through this real quick. So we have what we call 00:08:43.920 |
effectively like a runner. And if you're an engineer, maybe you're probably familiar with this, like tools like 00:08:48.080 |
GitHub CI as well has like this runner. The runner takes everything that's secure. So in our case, source code and 00:08:54.880 |
access to data and runs that on the customer's infrastructure. A runner is actually very, 00:08:59.520 |
very simple in our case. The runner is actually a single ECS task. So it's a single container running. 00:09:04.880 |
And it obviously then costs like, we're talking dollars, single figures of dollars to run every month. 00:09:09.920 |
The core workload that we spin up is then dev environments. Each runner runs on a different, 00:09:18.000 |
in our case, this is AWS. But if we, you know, for our Linux runner, they inherit the qualities of the 00:09:23.360 |
platform they're running on. So for AWS, we run workloads on top of EC2, and those are also then 00:09:28.320 |
backed up to EBS. So we can use the native functionality and features of the cloud provider to build that 00:09:33.840 |
solution rather than something like Kubernetes, which is theoretically portable, but it also comes 00:09:38.800 |
with all of that overhead and complexity of having the portability across different platforms. 00:09:44.640 |
Then what we did is then also start to think about, okay, so what are the things that we can keep on 00:09:48.080 |
our side as a vendor so that our customers aren't taking on a significant amount of operational 00:09:53.120 |
overhead? So a lot of that is like metadata, IDs, information about users, et cetera. Not personally 00:09:59.120 |
identify information, but like user IDs, like stuff that if you were to get your hands on it, it's not 00:10:04.160 |
particularly significant for our customers. They're not bothered about losing that. They're bothered about 00:10:08.000 |
losing the IP, the source code things that's running actually inside their infrastructure. 00:10:11.840 |
So I did actually throw in a quick video here because I'm going to turn off the audio on this, 00:10:24.000 |
just as you actually see what this looks like from a user experience point of view. So this is actually 00:10:28.800 |
the user choosing their workload. When they choose those, those actually do run inside of their network, 00:10:34.160 |
inside of their cloud account. To then set up and install on their side, what we're able to now do 00:10:39.600 |
is provide them with this runner interface, creates a very simple cloud formation, ask the customer to go 00:10:44.800 |
through and actually put in all of their different network details. That bit is all on their side, create 00:10:49.760 |
this. The process of running this, it takes as little as three minutes. The overhead or the challenge is 00:10:55.360 |
usually getting all of that network configuration, VPC information, subnets, finding out who in your 00:10:59.920 |
organization owns those who can actually allocate them to you. But fundamentally, this makes it then 00:11:04.880 |
exceptionally simple to then deploy that for customers. And the overhead of this is then is also 00:11:09.600 |
significantly reduced. Just one second. I think I have to back out of this to change it. There we go. 00:11:22.160 |
Cool. So I mean, we're at an agent conference. So I also presume I should certainly be talking about AI 00:11:28.240 |
and agents. So two days ago, I mentioned we basically launched our agent offering, which we've been 00:11:32.960 |
working on for months now since the start of the year, actually, with customers and our design partners. 00:11:37.200 |
Now, because we have that existing infrastructure already within those customers, we can actually run the 00:11:42.000 |
exact same workloads. So because we create dev environments, dev environments are effectively for 00:11:46.960 |
humans to be productive. But everything that an autonomous agent needs to operate is exactly 00:11:51.520 |
the same thing that a dev does. They need the source code. They need access to internal systems, 00:11:56.080 |
whether that be an internal database, a cluster, etc. So when you give an agent the ability to access 00:12:01.760 |
source code and iterate, it can do it inside of that same dev environment with the same access and 00:12:06.160 |
privileges that an individual human would as well. Because of like our target market and the customers we go 00:12:13.520 |
forward that our agent offering as well is also privacy first, because what we've effectively built here allows us to 00:12:19.200 |
not only run inside the customers infrastructure, but there's lots of other additional benefits that come from that as well. 00:12:25.280 |
Let's say things like audit logs. Once we did that, and I'll just jump back a few slides. Once we re-architected to this 00:12:30.880 |
architecture, one of the other first principles that we came back to is building this in an API first way. 00:12:36.640 |
So every action interaction when you spin up a dev environment, everything that you do is then audit logs and allows us to give 00:12:43.360 |
our customers also an entire sort of audit history on the platform, which also then extends to once you 00:12:49.440 |
start running these workloads with agents, you also benefit from that as well. So that audit log also runs for those agent tasks. 00:12:56.080 |
What time is it? I have two more minutes and the other guy is straight after. So that is a very sort of 00:13:07.040 |
whirlwind run through. We do have a booth downstairs. I will be here and we can dig into this a little bit deeper if you want to as well. 00:13:14.080 |
I can show you how this works in our context as well, but like when you're then ultimately looking 00:13:18.000 |
at if you're purchasing other tools, AI tools, these are the types of considerations you want to make 00:13:22.400 |
about what is the architecture and infrastructure and what sort of different qualities they do have. 00:13:26.640 |
But then also if you're a vendor as well, there's probably a lot of learnings here from our side that 00:13:30.640 |
you can take from this to build out a more simplified effectively technical architecture 00:13:35.040 |
rather than, you know, building on top of like difficult and complicated platforms like Kubernetes and 00:13:40.160 |
ultimately keeping things simple for your customers as well. 00:13:42.480 |
But that's it. Thank you very much. Hope you enjoyed the conference.