🚀 A New Era of Bots and Crawlers

How They’re Draining Your Website’s Resources
08/15/2024·Matheus Gaudêncio

Bots and Crawlers

Your website’s traffic doesn’t just come from human visitors; bots play a significant role too. Search engines, social media platforms, and even AI systems deploy automated tools (robots, or 'bots') to crawl your site, extracting content and valuable information. To rank well on Google, for example, your content must be well-structured, with clear titles, readable text, and highly relevant information. This entire analysis is conducted by bots crawling your site!

But here’s the catch: Every time a bot crawls your site, it’s not “free.” Each request made by a bot consumes resources—whether it's computational power or bandwidth. Most major bots respect a special file (robots.txt) that tells them which parts of your site they can or cannot access. As a site owner, you can control which bots are allowed to crawl your site.


User-agent: *
Allow: /

A simple rule that allows all bots to access all pages.

Let’s look at the impact this can have.

In May, across various Deco sites, bots were responsible for over 50% of the bandwidth consumed, even though they didn’t make up the majority of the requests.

Bots Bandwidth Consumption

Despite accounting for less than 20% of traffic, bots often consume significantly more bandwidth due to the specific pages they access. They tend to spend more time on larger pages, such as category pages in online stores, which naturally load more data. These pages often feature product variations and filters, making them even more data-heavy.

While Google’s bot respects the nofollow attribute, which prevents links from being crawled, not all bots do. This means that pages with filter variations also need a noindex meta tag or a more specialized robots.txt configuration.

AI: The New Gold Rush

AI is changing the game when it comes to data extraction, release, and value.

The demand for massive amounts of data for processing has led to the creation of more bots, particularly crawlers operating on the web. Data is more valuable than ever, yet there’s no guarantee of immediate returns for those who hand over their data to third parties. The third-largest consumer of bandwidth (Amazonbot) and several others (Ahrefs, Semrush, Bing) are known as “good bots.” These verified bots respect the robots.txt file, allowing you to control how and what they access. A possible configuration for managing these bots is shown below:


User-agent: googlebot
User-agent: bingbot
Allow: /
Disallow: /search

User-agent: *
Allow: /$
Disallow: /

This allows Google and Bing bots to crawl your site, except for search pages, while restricting all other bots to the site’s root.

This setup grants broad access to valuable, known bots but limits overly aggressive crawling of all your site’s pages. However, notice how the second-highest bandwidth consumer is ClaudeBot—an AI bot notorious for consuming large amounts of data while disregarding the robots.txt file. In this new AI-driven world, we’re seeing more of these kinds of bots.

At deco.cx, we offer a standard robots.txt similar to the example above for our sites, but for bots that don’t respect this standard, the only way to control access is through blocking them at the CDN (in our case, Cloudflare). At Deco, we use three approaches to block these bots:

- Block by User-Agent: Bots that ignore robots.txt but have a consistent user-agent can be blocked directly at our CDN.

- Challenge by ASN: Some bots, especially malicious ones, come from networks known for such attacks. We place a challenge on these networks, making it difficult for machines to solve.

- Limit Requests by IP: After a certain number of requests from a single origin, we present a challenge that users must solve correctly or face a temporary block.

These rules have effectively controlled most bots…

…except Facebook.

“Facebook, Are You Okay?”

We’ve discussed bots that respect robots.txt. And then there’s Facebook.

Just before Facebook’s new privacy policy went into effect—allowing user data to be used for AI training—we noticed a significant spike in the behavior of Facebook’s bot on our networks. This resulted in a substantial increase in data consumption, as shown in the graph below.

More details on Facebook’s new privacy policy

Traffic June 2024

Aggregate data traffic for a set of sites in June 2024.

The Facebook bot typically fetches data when a link is shared on the platform, including details about the image and site information. However, we discovered that the bot wasn’t just fetching this data—it was performing a full crawl of sites, aggressively and without respecting robots.txt!

Moreover, Facebook uses various IPv6 addresses, meaning the crawl doesn’t come from a single or a few IPs, making it difficult to block with our existing controls. We didn’t want to block Facebook entirely, as this would disrupt sharing, but we also didn’t want to allow their bots to consume excessive resources. To address this, we implemented more specific control rules, limiting access across Facebook’s entire network…

Traffic July 2024 Aggregate data traffic for a set of sites in July 2024.

…which proved to be highly effective.

Blocking Too Much Could Hurt Your Presence in Emerging Bots or Technologies

A final word of caution: adopting an overly aggressive approach has its downsides. Restricting access to unknown bots might prevent new technologies and tools that could benefit your site from interacting with it. For example, a new AI that could recommend specific products to visitors might be inadvertently blocked. It’s crucial to strike a balance, allowing selective bot access in line with market evolution and your business needs.

In summary, bots and crawlers are valuable allies, but managing their access requires strategic thinking. The key is to allow only beneficial bots to interact with your site while staying alert to new technologies that might emerge. This balanced approach will ensure that your business maximizes return on traffic and resource consumption.