back to index

How to defend your sites from AI bots — David Mytton, Arcjet


Whisper Transcript | Transcript Only Page

00:00:00.040 | Hi everyone. So my name is David. I'm the founder of Artjet. We provide a security SDK for developers.
00:00:21.780 | So everything I'm going to be talking to you about today is what we've been building for
00:00:25.800 | the last few years, but how you can do it yourself. So if you haven't had bots visiting your website
00:00:33.620 | and felt the pain, then you might be thinking, well, is this really a problem? Well, as you
00:00:38.460 | just heard in the introduction, almost 50% of web traffic today is automated clients.
00:00:45.600 | And that varies depending on the industry. In gaming, that's almost 60% of all traffic
00:00:50.100 | is automated. And that's before the agent revolution has really kicked off. This isn't
00:00:56.380 | a new problem. It's been going on since the invention of the internet. And there are bots
00:01:02.160 | that you want to visit your website, like Googlebot, but there are also a lot of malicious crawlers.
00:01:08.140 | And this causes a problem. The first incident you might experience is around expensive requests.
00:01:16.100 | So think through what happens on your website. If it's a static site, then maybe it's not doing
00:01:21.020 | much on your infrastructure. But if you're generating any content from a database or you're reading
00:01:26.280 | some dynamic content in some way, then each request is going to cost something, particularly
00:01:31.920 | if you're using a serverless platform, paying per request. If you have huge numbers of automated
00:01:36.720 | clients coming in, making requests, making hundreds of thousands of requests, then this starts to
00:01:41.320 | build up as a cost problem, and also being able to deal with that on your infrastructure.
00:01:48.700 | These clients can also be requesting all the assets. So downloading large files, that's going to start
00:01:53.240 | eating into your bandwidth costs and eating into the available resources you have to serve legitimate
00:01:59.040 | users on your site. This can show up as a denial of service attack. So your service just might not be
00:02:06.440 | available to others, and even the largest website doesn't have infinite resources. Serverless means
00:02:13.960 | that you don't have to think about that for the most part, but where you're actually handling it is part of the billing.
00:02:19.560 | This has been a problem for decades. And so the real question is, well, is AI making this worse?
00:02:29.400 | And we see complaints in the media, websites talking about the traffic that they're getting, and there's just an
00:02:36.280 | automatic assumption that this is AI. And on the face of it, there's no real evidence that that is the case.
00:02:41.720 | But when you start looking into the details about the kind of requests that these sites are seeing, then AI is making it worse.
00:02:48.920 | So, for instance, Diaspora, which is an online open source community, they saw that 24% of their traffic was from GPTBot,
00:02:58.440 | which is OpenAI's crawler. And then ReadTheDocs, which is an online documentation platform for code projects.
00:03:08.440 | And they found that by blocking all AI crawlers, they reduce their bandwidth from 800 gigabytes a day to 200
00:03:15.160 | gigabytes a day. And even Wikipedia is having this problem. They're spending up to 35% of their traffic just
00:03:22.840 | serving automated clients. And they're seeing this increasing significantly and attributing that to AI crawlers.
00:03:29.480 | So AI is making this worse. Scrapers are coming onto sites and pulling down the content,
00:03:36.280 | and they're not behaving nicely. They're not doing it in a gradual way. And they're making hundreds of
00:03:41.400 | thousands of requests and just pulling down content without following the rules.
00:03:46.040 | In the old days, we had this idea of good bots and bad bots. And the challenge was always
00:03:53.320 | distinguishing between them. If you want your website to show up in a search index like Google,
00:03:59.000 | then Google has to know about your site, has to visit and understand your site. But you get a benefit
00:04:03.880 | from that because you're going to appear in the search index and you're going to get traffic as a
00:04:07.720 | result. And so most people consider Google to be a good bot. And then there's the bad bots,
00:04:14.040 | which are obviously bad. Scrapers come into your site, downloading all the images, downloading all the
00:04:18.120 | content, downloading files. It was very easy to understand that those are the bad bots. But in the
00:04:24.440 | middle, we've got these AI crawlers. And sometimes they're good, sometimes they're bad. And it depends on
00:04:30.040 | on sometimes your philosophical approach to AI, but also what you want from your website. Because
00:04:36.600 | the first kinds of AI bots we were seeing were for training, just to build up the models. And in theory,
00:04:42.920 | there's no benefit to the site owner for that because it's just being built into the model. You're not
00:04:46.920 | necessarily getting any traffic. But things have started to change with multiple bots coming from the
00:04:53.560 | different AI providers. So for instance, with OpenAI, they have at least four different types of bots.
00:04:59.640 | So the first one is the OpenAI search bot. This is kind of classic Google bot type crawler, which will
00:05:07.560 | come to your site. It will understand what's going on and it will index it so that when someone makes a
00:05:12.440 | query into ChatGPT using the search functionality, you show up in OpenAI's index. Now, in most cases,
00:05:19.160 | you're going to want that. It's going to do the same thing as Google. You're going to appear in a search
00:05:22.840 | index. You're probably going to get citations. And that wasn't the case at the very beginning, but now
00:05:28.440 | you're getting citations. And this is becoming a real source of traffic for sites and for services. People
00:05:34.440 | are getting signups as a result. And so there's a win-win. It's the same as the old Google crawlers.
00:05:39.640 | Then there's ChatGPT user. And this is a little more nuanced. It's where ChatGPT may show up to your
00:05:47.560 | website as a result of a real-time query that a user is making. Maybe you drop the actual URL into
00:05:53.800 | the chat and ask it to summarize the content. Or it's a documentation link and you want to understand
00:05:58.840 | how to implement something and it's going out and getting that content. It's not used for training,
00:06:03.480 | but it may not cite the response. But if you've given it the URL, then perhaps you're a legitimate
00:06:10.760 | user. And so maybe you do want that because it's actually your users making use of LLMs.
00:06:15.720 | And then there's GPT bot, which is the one that we saw was taking up a huge amount of
00:06:20.440 | traffic on Wikipedia and Diaspora. And this is the original one that is part of the training.
00:06:26.920 | It doesn't benefit you directly. You're being brought into the model. And there's often no
00:06:32.520 | citation as a result. These are kind of the three crawler bots that you might see on your site.
00:06:38.360 | And then what we're seeing more of now is the computer use operator type bots, which are acting
00:06:44.840 | on behalf of a real person, possibly with a web browser that's running in a VM that is taking an
00:06:51.240 | action as an agent, an autonomous agent. And this becomes challenging to understand, well, do you want
00:06:56.280 | that or not? Maybe it's a legitimate use case. Maybe the agent is doing triage of your inbox. Maybe Google
00:07:02.360 | would want that if it's Gmail. But if you've asked an agent to go out and buy 500 concert tickets,
00:07:08.200 | so you can then go and sell them for a profit, that's probably something you don't want to allow.
00:07:13.160 | And so understanding being able to detect these is really challenging. The open AI crawlers identify
00:07:18.760 | themselves as such. You can verify that. And so you can allow or block them. But something like operator
00:07:24.920 | just shows up as a Chrome browser. And it's much more challenging to understand and detect that.
00:07:30.440 | So let's walk through some of the defenses that you can implement and to decide as a site owner how you
00:07:38.280 | can control the kind of traffic that's coming to your site. So the first one of these is it's not really a
00:07:43.560 | defense because it's entirely voluntary. Everyone's probably heard of robots.txt. It's how you can describe
00:07:49.960 | the structure of your website and tell different crawlers what you want them to do. You can allow
00:07:55.240 | or disallow. You can control particular crawlers. And this gives you a good understanding of your own
00:08:02.120 | site to think through the steps that you want to take to allow or disallow. But it's entirely voluntary.
00:08:07.080 | Crawlers don't have to follow it, but the good ones will. Google bot will follow this, as will all the search
00:08:13.160 | crawlers. Open AI claims to follow it and does for the most part as well. But for the types of bots that are
00:08:19.160 | causing these problems, they're not following this. And in some cases, they're actually using this
00:08:24.360 | to find pages on your site that you've disallowed other bots to go to and deliberately going out and
00:08:29.240 | getting that content. Even so, this is a good place to start because it helps you start to think through
00:08:34.840 | what you want different bots to be doing on your site.
00:08:37.800 | Every request that comes into your site is going to identify itself. This is a required HTTP header
00:08:47.000 | and it's just a string. It is a name that the crawler is going to give itself. And you'll see
00:08:52.840 | that in your request logs. It's just a string because any client can set whatever they like for
00:08:59.320 | this. But it's surprising how many will actually just tell you who they are. And you can use open
00:09:05.000 | source libraries to detect this and create rules around it. At Artjet, we've got an open source project with
00:09:12.200 | several thousand different user agents that you can download and use to build your own rules to identify
00:09:18.040 | who you want to access your site. But it's just a string in a HTTP header and you can set it to whatever
00:09:24.280 | you want. And so the bad bots will just change this. They'll pretend to be Google or they'll pretend to be
00:09:29.880 | Chrome. And so it's not always a good signal about who's actually visiting your site. And so the next
00:09:37.000 | thing you can do is to verify that. If a request is made to your site and it claims to be Apple's crawler,
00:09:45.080 | Bing, Google, OpenAI, all of these services support verification. So you can look at the source IP address
00:09:54.600 | and you can query those services using a reverse DNS lookup to check whether it is actually who it
00:09:59.880 | claims to be. So if you see a request coming from Google, you can ask Google, is this actually Google?
00:10:06.040 | And they'll give you a response back saying whether it is or not. And this makes it quite straightforward
00:10:12.040 | to use the combination of the user agent string plus IP verification to check whether it is the good
00:10:18.680 | bots are visiting your site and to set up some simple rules to allow those crawlers that you actually want
00:10:23.560 | to be on the site. Things start to get a bit more complicated if those signals don't provide you
00:10:31.480 | with sufficient information. Bot detection is not 100 percent accurate. And so you have to build up
00:10:37.720 | these layers. And so the next thing you can do is looking at IP addresses. The idea is to build up a
00:10:45.240 | pattern to understand what is normal from each IP address. And not just a single IP address, but the
00:10:52.520 | different IP address ranges, how they associate with different networks and different network operators,
00:10:58.120 | whether the request is coming from a data center or not, and the country level information. And you can
00:11:04.440 | get this from various databases. You have to pay for access to most of them, but there are also some free
00:11:10.120 | APIs you can use to query the metadata associated with a particular IP address. Maximine and IP Info are two
00:11:17.800 | more popular ones. And you want to be looking at things like, well, where's the traffic coming from,
00:11:22.760 | and what's the association with the network? Is this coming from a VPN or a proxy? Is it a residential or a
00:11:29.480 | mobile IP address? And last year, 12 percent of all bot traffic that hit the Cloudflare network was from the AWS
00:11:39.400 | network. And so you can start to ask yourself, well, are the normal users of our site application going to come from a data center?
00:11:47.880 | Maybe if you're allowing crawlers on your site, then that's expected.
00:11:53.880 | But if you have a signup form that you're expecting only humans to submit, then it's unlikely that a request that's
00:12:00.440 | coming from a data center IP address is going to be traffic that you want to accept.
00:12:07.000 | The challenge we're looking at geodata, like blocking a single country, for instance,
00:12:11.000 | is that the geodata is notoriously inaccurate and has become more inaccurate over time as people are
00:12:17.640 | using satellite and cell phone connectivity, 5G, because the IP address will be geolocated to the
00:12:24.920 | owner of the IP rather than necessarily the user of it. And also, even when
00:12:30.600 | the database is saying that the IP address is coming from a residential network,
00:12:35.560 | there are proxy services that you can just buy access to which will route your traffic through those
00:12:41.560 | residential networks to appear like it's coming from a home ISP or a mobile device.
00:12:47.160 | So you can't always trust these, and you have to build up
00:12:50.840 | signals and build your own database to understand where this traffic is coming from and what
00:12:55.240 | the likelihood is that it's an automated client.
00:12:57.320 | Captures are the standard thing that we've been using now for decades to try and distinguish between
00:13:07.880 | humans and automated clients, solving puzzles and moving things around on the screen. But it's becoming
00:13:14.280 | increasingly easy for AI to solve those, putting them into an LM or downloading the audio version,
00:13:20.760 | and transcribing it, can be done in just a couple of seconds, and it's trivial and cheap to breach these kinds of defenses.
00:13:29.160 | There are newer approaches to this. Proof of work, which has come from the crypto side of things,
00:13:38.200 | means that you require a computer to do a certain number of calculations and provide the answer to a puzzle
00:13:47.080 | before they can access the resource. And this usually takes a certain amount of time. It costs CPU time.
00:13:53.240 | And on an individual basis, on your laptop or on your phone, it might take a second or two to calculate
00:13:59.160 | it, and it makes no real difference to an individual. But if you have a crawler that's going to tens of thousands
00:14:05.160 | or millions of websites and is having to solve this puzzle every single time, it becomes very expensive to do that.
00:14:11.800 | And so deploying these proof of work options on your website can be a way to prevent those crawlers.
00:14:17.880 | But then it becomes a question of incentives. So if you're crawling millions of websites, then maybe that is a good defense.
00:14:26.520 | But if we go back to that ticket example, if it costs someone a couple of dollars to solve a capture or to solve a proof of work,
00:14:34.360 | but they're then going to sell a ticket for $200 or $300, the profit is still there. And so these may not even be a defense
00:14:42.360 | against certain types of attacks. You can scale the difficulty. So if you bring in all these different signals and see that something is coming from an unverified IP address
00:14:53.320 | and has suspicious characteristics, then maybe you could give them a harder puzzle. But then you start to have accessibility problems,
00:15:01.960 | and I'm sure we've all seen those really annoying captures that you can't solve and you have to keep refreshing.
00:15:06.920 | That becomes a problem as well.
00:15:09.160 | There are a couple of interesting open source projects that implement these. Anubis is a good one.
00:15:17.000 | GoAway and Nepenthes. These are all proxies that you can install on the Kubernetes cluster or put them in front of your application.
00:15:24.280 | You can run it yourself and it will implement these proof of work problems and put it in front of the users that it thinks are suspicious.
00:15:30.600 | And there are also some emerging standards around introducing signatures into requests.
00:15:39.640 | Because what we're trying to do is to prove that a particular client is who they say it is and is who you want to be on the website.
00:15:47.160 | Now, Cloudflare has suggested this idea of HTTP message signatures for automated clients where every request will include a cryptographic signature,
00:15:57.240 | which you can then verify very quickly. And then you can understand which client is coming to your site.
00:16:04.280 | It's only just been announced a couple of weeks ago, so it's still being developed.
00:16:07.960 | There's some questions around whether it's any better than just verifying the IP address, but it's a way of verifying automated clients.
00:16:14.840 | And then a couple of years ago, Apple announced private access tokens and what they called a privacy pass,
00:16:21.320 | which allowed website owners to verify that a request was coming from a browser that was owned by an iCloud subscriber.
00:16:31.320 | This has been implemented across all Apple devices. And if you're using Safari, this is on.
00:16:36.120 | And it will reduce the number of captures that you might see because you can verify that someone's actually a paying subscriber to iCloud.
00:16:42.920 | But it's had limited adoption elsewhere. Not many sites are using it. And it's only on the Apple ecosystem, even though it's almost an approved standard.
00:16:56.520 | And then we have to implement fingerprints as well. So fingerprinting is looking at the network requests
00:17:02.360 | to generate a hash to be able to identify that client because it's quite trivial to change the IP address that your requests are coming from.
00:17:11.400 | And you'll often see crawlers using banks of tens or hundreds of thousands of different IP addresses,
00:17:17.160 | particularly with IPv6, which means implementing signatures based just on an IP address isn't sufficient.
00:17:24.760 | But the client stays the same. The client characteristics stay the same across multiple requests.
00:17:30.760 | And you can build up a fingerprint of that. This is the open source JA4 hash, which is based on the TLS fingerprint,
00:17:37.880 | looking at the network level and looking at the configuration of SSL.
00:17:43.160 | And then there's a proprietary version on HTTP, which is looking at headers and the different headers that are sent with a client
00:17:50.440 | and the characteristics of an HTTP request to build up a fingerprint. And then you can use those fingerprints
00:17:55.560 | as part of your block rules. So you could look at all the hundreds of thousands of requests coming from a single fingerprint.
00:18:01.640 | You could just block that fingerprint, regardless of how many IP addresses it's coming across.
00:18:10.040 | And then rate limiting is used in conjunction with a fingerprint. Once you can fingerprint the client,
00:18:15.320 | you can apply quotas or a limit to it. And the key is really important there. You can't just rate limit on an IP address
00:18:21.480 | because people have different IPs. It changes all the time. And then for malicious crawlers, they can just change them themselves.
00:18:28.040 | And so keying off user session ID is a good way to do it. If the user is logged in and you want to apply your rate limits,
00:18:35.480 | or if you've got the fingerprint, the J4 hash, you can implement rate limits on that.
00:18:42.120 | So these are the eight defenses. Robots.txt is where you start. It's not where you finish though,
00:18:51.720 | because it's not going to prevent all the bots. It's a voluntary standard. It's where you start because
00:18:56.920 | it helps with the good bots. At the very least, you need to be looking at user agents. There are various
00:19:02.680 | open source options for looking at that and setting up rules, and then verifying that the user agent for
00:19:08.600 | clients that you actually want on your site are the ones that are actually making the requests.
00:19:13.000 | That gets you most of the way. For most sites, that will deal with everything you need. But for the more
00:19:19.560 | popular ones or sites with particularly interesting resources or things that people might want to buy
00:19:27.000 | lots of numbers of, or they're in restricted quantities, you need to go further looking at IP
00:19:32.040 | reputation, setting up proof of work, considering these experimental HTTP signatures, and certainly
00:19:39.000 | the fingerprint side of things is where most people land in combination with the rate limits.
00:19:44.760 | You can implement all these yourselves in code. That's what we do at Artjet. There's a much more
00:19:50.680 | detailed write-up of this talk on the blog that I just published earlier today. So if you have a look
00:19:55.480 | at blog.artjet.com, there's a full write-up of this talk with much more detailed examples. I'm happy to
00:20:01.320 | answer any questions via email, and we also have a booth down in the expo. Thank you very much.