back to index

Building agent fleet architectures your CISO doesn't hate — Lou Bichard, Gitpod


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello everyone, welcome. So, today we'll be talking basically about CISO-approved Agent
00:00:20.200 | Fleet Architecture. So, my name is Lou. I'll actually jump into that. So, my name is Lou.
00:00:25.680 | I am a field TTO at a company called Gitpod. In a past life, you know, for many years, basically
00:00:31.600 | software engineering and platform engineering is my background, but for the last four years,
00:00:35.360 | actually nearly four years, I've been working at Gitpod effectively building a platform for
00:00:39.440 | automated development environments. What I'm going to talk about today is actually the journey over
00:00:44.400 | the last, even extends beyond my tenure at the company for the last sort of six years, building
00:00:50.160 | our sort of product infrastructure and architecture for secure environments, particularly around
00:00:55.040 | regulated industries and that kind of thing. I assume they'll probably, in fact, actually,
00:00:59.520 | I think I'm already on the next slide. So, yeah. So, in terms of like who this talk is for,
00:01:05.280 | so different aspects. So, first aspect is really about how do we start to think about how technical
00:01:10.800 | architectures of product affect our business. If you're a vendor, this is obviously something for
00:01:14.560 | you. There are probably a bunch of people here as well that are either building or consuming or
00:01:18.480 | thinking about AI tools, maybe also in these secure environments. But then also, if you're a buyer of
00:01:22.720 | this as well, I think this talk will also be interesting if you're buying, using or consuming
00:01:26.720 | AI tools, thinking about like what architectures might be right for you and what considerations you
00:01:31.760 | might want to take throughout that process. So, let's go ahead and actually just jump into it. So,
00:01:38.000 | for a little bit of context, it's very hard to talk about this talk without actually giving a bit of
00:01:42.400 | background to the company and what we do. So, Gitpod is basically a platform for secure dev environments.
00:01:47.440 | A lot of people sometimes mix this up with things like staging or integration environments,
00:01:51.440 | things like that. But to give you an idea like developers actually work inside of Gitpod for
00:01:56.160 | around 37 hours a week, 36, 37 hours. So, like you're, it's almost like a replacement, I suppose,
00:02:00.880 | like for your machine or your laptop, something like that. So, as a consequence of that as well,
00:02:05.520 | it's highly mission critical software. So, if Gitpod goes down, that has a huge impact for our customers.
00:02:11.280 | So, it's really important that this is incredibly reliable. In addition to that, what we effectively
00:02:16.480 | do is because the automated aspects of our dev environments, we run also mostly with highly
00:02:22.960 | secure regulated organizations. So, banks, pharmaceutical, healthcare. A lot of the companies
00:02:28.400 | that we work with as well, like are quite particular about like their logo rights and publishing some of
00:02:32.160 | their names. But for sure, it's a lot of companies that you will know, a lot of household names and
00:02:35.760 | things like that as well. So, as you can imagine, building this type of product and actually bringing
00:02:41.680 | it to market is not an insignificant task, especially if you're a smaller company with sort of limited
00:02:47.120 | resources trying to do that. And in addition to that as well, I'll talk about this a little bit later,
00:02:51.360 | but we actually launched and talked more publicly about our agent offering two days ago, but we've
00:02:56.560 | been building this for months with some of our design customers as well. And I'll get onto that and how
00:03:01.680 | how that interfaces with the architecture as well. So, just as an overview, and I'll actually deep dive
00:03:08.320 | into these a little bit as we go through, is we've gone through various iterations of different
00:03:13.200 | architectural models and they have different implications for us as a company, but also for
00:03:16.960 | our customers, the implications on, for instance, how much ownership they have over those products,
00:03:21.840 | like the cost of running those products, and the security as well of actually when they run them.
00:03:27.600 | So, very quickly, because I will actually deep dive on these. So, we started off 2019 with a managed
00:03:33.440 | SaaS product. It's where a lot of products start, hosted on the internet, multi-tenant. You can go,
00:03:38.080 | you click on the website, you swipe a credit card, you can use the product. Wonderful. But this doesn't
00:03:42.320 | work for secure companies, right? So, we went through several iterations, first with a very self-hosted
00:03:48.000 | architectural approach. We then adjusted that, and I'll talk about exactly how, and then we have what we
00:03:53.360 | effectively have today, which is also the platform that we run our agent fleet architecture on, and
00:03:58.640 | we'll talk about how that works as well. So, if we start at the beginning with our sort of fully
00:04:04.800 | managed SaaS, and I don't know if any of you know Sean, but it's funny, because I remembered this tweet
00:04:08.560 | the other day that he put out several years ago. It's like, Gitpod really inspired him to do a video,
00:04:14.160 | because it has a nice time to wow, which is, you know, Gitpod was very well known. You click on a link,
00:04:18.960 | and you end up in a dev environment literally from one-click interaction. This was, at the time,
00:04:24.160 | hosted on top of GCP, Kubernetes architecture. We chose GCP, obviously, for the connection with
00:04:29.600 | Kubernetes. And the way the dev environments were then orchestrated was pods running inside of
00:04:34.560 | Kubernetes. So, multi-tenant infrastructure, but if you know anything about Kubernetes, it's not really
00:04:39.200 | designed for that type of workload, and we will talk a bit about that as we go on. But obviously,
00:04:44.880 | there are several challenges with this type of approach. One, it's great to get started really
00:04:48.640 | easy to use your product, but we had a ton of crypto mining and abuse on this. It's one of these things,
00:04:53.600 | like, we can't have nice things on the internet, like, where we give away free compute, because somebody
00:04:57.200 | always wants to do something. Always wants to abuse it, effectively. But it's not also just insufficient
00:05:02.480 | for enterprise. And one of the goals of Gitpod was really to bring our product to professional
00:05:06.640 | developers doing real-world work in real companies. So, what do we do? What's the next step? The obvious,
00:05:15.280 | almost like the logical thing there is just, okay, can we take our existing architecture, take that,
00:05:20.720 | package that up as a self-hosted installation, and just give that to our customers? So, we tried it.
00:05:25.840 | We did do exactly that. We didn't think too much necessarily about, let's say, the specifics of
00:05:30.320 | cloud provider support. So, we kind of opened this up for Google. I think we had AWS at the time and some
00:05:35.840 | of the others. And, you know, we had all manner of them requests from different companies with all
00:05:39.840 | their different flavors of Kubernetes and different configurations to try and figure out how do we then
00:05:44.720 | self-host on those platforms. In the fullness of time, though, what this came up with, both for us as
00:05:50.720 | a vendor, but also for our customers, is significant day two effects. So, self-hosting is great as a
00:05:56.560 | general model, but it comes also with this huge overhead. Typically, you have to set it up. You have to run it.
00:06:02.240 | And then, as a business also, that ultimately is, like, your cost of ownership. So, as a company,
00:06:08.320 | like, we're ultimately selling a product to our customers. Having a very difficult product to set
00:06:13.440 | up and run is effectively eroding your ROI. If you need to allocate two, three, maybe even a whole team
00:06:19.280 | to actually set up and run this infrastructure, then the benefit that you're getting from that product is
00:06:23.520 | also being eroded. The difficulty with that as well is then we don't have a strong relationship with the
00:06:28.080 | people that are actually using or running that product. That's also really important, especially
00:06:31.760 | in an enterprise sort of situation. How you adopt and roll out the product can be critical to its
00:06:36.480 | success. And oftentimes, we found that, you know, especially in a self-hosted model without support
00:06:41.280 | from our side, some of those people that are trying to adopt the product also are unsuccessful.
00:06:48.640 | So, considering that with the self-hosted model, what did we do next? So, we looked at this ownership
00:06:54.320 | and this sort of overhead challenge and said, okay, we can solve this by effectively providing what we
00:06:59.280 | call the substrate so that we could allow customers to self-host, but then we could also manage the
00:07:04.400 | service. So, we could take away some of that overhead. But in order to do that, we also needed to reduce
00:07:08.880 | variance. So, one of the things that we did is we effectively doubled down on AWS as the individual
00:07:13.840 | provider, where we run the infrastructure, but then we would then manage it. So, that meant small
00:07:19.760 | pieces of telemetry data, other things that we needed to operate that infrastructure were still
00:07:24.160 | being emitted to us as a provider. But what it still meant is the workload, or in our case, source code,
00:07:29.760 | data, integrations, all still live on the customer's infrastructure, which is then still ticking the
00:07:35.440 | box when it comes to these regulated companies. But despite this being a success, certainly from an
00:07:42.000 | operational standpoint, it still continued to have effectively fundamental issues because it's still
00:07:46.880 | built on this sort of Kubernetes-based infrastructure, still highly complicated, let's say even when you
00:07:52.880 | want to install. In this case, we're running multiple clusters with a high fixed cost, let's say, and also
00:07:57.840 | still a lot of complexity. This ultimately culminated in its moving away for Kubernetes for our specific use case.
00:08:06.640 | I put this blog post up because it goes into a lot of a lot of detail specifically about why
00:08:11.120 | Kubernetes was particularly challenging for us and our use case and our workloads.
00:08:15.360 | But we decided to move away from that and start to think from first principles about what an
00:08:19.680 | architecture looks like that solves this challenge of running inside of regulated companies, running
00:08:24.320 | highly securely, but doing so in a way that doesn't create a huge operational burden for our customers,
00:08:29.360 | which is bad for them and bad for us.
00:08:30.800 | So ultimately, this is actually the architecture that we came up with and that we have today that's
00:08:38.240 | also running our AI agent fleet. So I'll just run you through this real quick. So we have what we call
00:08:43.920 | effectively like a runner. And if you're an engineer, maybe you're probably familiar with this, like tools like
00:08:48.080 | GitHub CI as well has like this runner. The runner takes everything that's secure. So in our case, source code and
00:08:54.880 | access to data and runs that on the customer's infrastructure. A runner is actually very,
00:08:59.520 | very simple in our case. The runner is actually a single ECS task. So it's a single container running.
00:09:04.880 | And it obviously then costs like, we're talking dollars, single figures of dollars to run every month.
00:09:09.920 | The core workload that we spin up is then dev environments. Each runner runs on a different,
00:09:18.000 | in our case, this is AWS. But if we, you know, for our Linux runner, they inherit the qualities of the
00:09:23.360 | platform they're running on. So for AWS, we run workloads on top of EC2, and those are also then
00:09:28.320 | backed up to EBS. So we can use the native functionality and features of the cloud provider to build that
00:09:33.840 | solution rather than something like Kubernetes, which is theoretically portable, but it also comes
00:09:38.800 | with all of that overhead and complexity of having the portability across different platforms.
00:09:44.640 | Then what we did is then also start to think about, okay, so what are the things that we can keep on
00:09:48.080 | our side as a vendor so that our customers aren't taking on a significant amount of operational
00:09:53.120 | overhead? So a lot of that is like metadata, IDs, information about users, et cetera. Not personally
00:09:59.120 | identify information, but like user IDs, like stuff that if you were to get your hands on it, it's not
00:10:04.160 | particularly significant for our customers. They're not bothered about losing that. They're bothered about
00:10:08.000 | losing the IP, the source code things that's running actually inside their infrastructure.
00:10:11.840 | So I did actually throw in a quick video here because I'm going to turn off the audio on this,
00:10:24.000 | just as you actually see what this looks like from a user experience point of view. So this is actually
00:10:28.800 | the user choosing their workload. When they choose those, those actually do run inside of their network,
00:10:34.160 | inside of their cloud account. To then set up and install on their side, what we're able to now do
00:10:39.600 | is provide them with this runner interface, creates a very simple cloud formation, ask the customer to go
00:10:44.800 | through and actually put in all of their different network details. That bit is all on their side, create
00:10:49.760 | this. The process of running this, it takes as little as three minutes. The overhead or the challenge is
00:10:55.360 | usually getting all of that network configuration, VPC information, subnets, finding out who in your
00:10:59.920 | organization owns those who can actually allocate them to you. But fundamentally, this makes it then
00:11:04.880 | exceptionally simple to then deploy that for customers. And the overhead of this is then is also
00:11:09.600 | significantly reduced. Just one second. I think I have to back out of this to change it. There we go.
00:11:22.160 | Cool. So I mean, we're at an agent conference. So I also presume I should certainly be talking about AI
00:11:28.240 | and agents. So two days ago, I mentioned we basically launched our agent offering, which we've been
00:11:32.960 | working on for months now since the start of the year, actually, with customers and our design partners.
00:11:37.200 | Now, because we have that existing infrastructure already within those customers, we can actually run the
00:11:42.000 | exact same workloads. So because we create dev environments, dev environments are effectively for
00:11:46.960 | humans to be productive. But everything that an autonomous agent needs to operate is exactly
00:11:51.520 | the same thing that a dev does. They need the source code. They need access to internal systems,
00:11:56.080 | whether that be an internal database, a cluster, etc. So when you give an agent the ability to access
00:12:01.760 | source code and iterate, it can do it inside of that same dev environment with the same access and
00:12:06.160 | privileges that an individual human would as well. Because of like our target market and the customers we go
00:12:13.520 | forward that our agent offering as well is also privacy first, because what we've effectively built here allows us to
00:12:19.200 | not only run inside the customers infrastructure, but there's lots of other additional benefits that come from that as well.
00:12:25.280 | Let's say things like audit logs. Once we did that, and I'll just jump back a few slides. Once we re-architected to this
00:12:30.880 | architecture, one of the other first principles that we came back to is building this in an API first way.
00:12:36.640 | So every action interaction when you spin up a dev environment, everything that you do is then audit logs and allows us to give
00:12:43.360 | our customers also an entire sort of audit history on the platform, which also then extends to once you
00:12:49.440 | start running these workloads with agents, you also benefit from that as well. So that audit log also runs for those agent tasks.
00:12:56.080 | What time is it? I have two more minutes and the other guy is straight after. So that is a very sort of
00:13:07.040 | whirlwind run through. We do have a booth downstairs. I will be here and we can dig into this a little bit deeper if you want to as well.
00:13:14.080 | I can show you how this works in our context as well, but like when you're then ultimately looking
00:13:18.000 | at if you're purchasing other tools, AI tools, these are the types of considerations you want to make
00:13:22.400 | about what is the architecture and infrastructure and what sort of different qualities they do have.
00:13:26.640 | But then also if you're a vendor as well, there's probably a lot of learnings here from our side that
00:13:30.640 | you can take from this to build out a more simplified effectively technical architecture
00:13:35.040 | rather than, you know, building on top of like difficult and complicated platforms like Kubernetes and
00:13:40.160 | ultimately keeping things simple for your customers as well.
00:13:42.480 | But that's it. Thank you very much. Hope you enjoyed the conference.