Your website’s traffic doesn’t just come from human visitors; bots play a significant role too. Search engines, social media platforms, and even AI systems deploy automated tools (robots, or 'bots') to crawl your site, extracting content and valuable information. To rank well on Google, for example, your content must be well-structured, with clear titles, readable text, and highly relevant information. This entire analysis is conducted by bots crawling your site!
But here’s the catch: Every time a bot crawls your site, it’s not “free.” Each request made by a bot consumes resources—whether it's computational power or bandwidth. Most major bots respect a special file (robots.txt
) that tells them which parts of your site they can or cannot access. As a site owner, you can control which bots are allowed to crawl your site.
User-agent: *
Allow: /
A simple rule that allows all bots to access all pages.
Let’s look at the impact this can have.
In May, across various Deco sites, bots were responsible for over 50% of the bandwidth consumed, even though they didn’t make up the majority of the requests.
Despite accounting for less than 20% of traffic, bots often consume significantly more bandwidth due to the specific pages they access. They tend to spend more time on larger pages, such as category pages in online stores, which naturally load more data. These pages often feature product variations and filters, making them even more data-heavy.
While Google’s bot respects the nofollow
attribute, which prevents links from being crawled, not all bots do.
This means that pages with filter variations also need a noindex
meta tag or a more specialized robots.txt
configuration.
AI is changing the game when it comes to data extraction, release, and value.
The demand for massive amounts of data for processing has led to the creation of more bots, particularly crawlers operating on the web. Data is more valuable than ever, yet there’s no guarantee of immediate returns for those who hand over their data to third parties. The third-largest consumer of bandwidth (Amazonbot) and several others (Ahrefs, Semrush, Bing) are known as “good bots.” These verified bots respect the robots.txt
file, allowing you to control how and what they access. A possible configuration for managing these bots is shown below:
User-agent: googlebot
User-agent: bingbot
Allow: /
Disallow: /search
User-agent: *
Allow: /$
Disallow: /
This allows Google and Bing bots to crawl your site, except for search pages, while restricting all other bots to the site’s root.
This setup grants broad access to valuable, known bots but limits overly aggressive crawling of all your site’s pages. However, notice how the second-highest bandwidth consumer is ClaudeBot—an AI bot notorious for consuming large amounts of data while disregarding the robots.txt
file. In this new AI-driven world, we’re seeing more of these kinds of bots.
At deco.cx, we offer a standard robots.txt
similar to the example above for our sites, but for bots that don’t respect this standard, the only way to control access is through blocking them at the CDN (in our case, Cloudflare). At Deco, we use three approaches to block these bots:
- Block by User-Agent: Bots that ignore robots.txt
but have a consistent user-agent can be blocked directly at our CDN.
- Challenge by ASN: Some bots, especially malicious ones, come from networks known for such attacks. We place a challenge on these networks, making it difficult for machines to solve.
- Limit Requests by IP: After a certain number of requests from a single origin, we present a challenge that users must solve correctly or face a temporary block.
These rules have effectively controlled most bots…
…except Facebook.
We’ve discussed bots that respect robots.txt
. And then there’s Facebook.
Just before Facebook’s new privacy policy went into effect—allowing user data to be used for AI training—we noticed a significant spike in the behavior of Facebook’s bot on our networks. This resulted in a substantial increase in data consumption, as shown in the graph below.
More details on Facebook’s new privacy policy
Aggregate data traffic for a set of sites in June 2024.
The Facebook bot typically fetches data when a link is shared on the platform, including details about the image and site information. However, we discovered that the bot wasn’t just fetching this data—it was performing a full crawl of sites, aggressively and without respecting robots.txt
!
Moreover, Facebook uses various IPv6 addresses, meaning the crawl doesn’t come from a single or a few IPs, making it difficult to block with our existing controls. We didn’t want to block Facebook entirely, as this would disrupt sharing, but we also didn’t want to allow their bots to consume excessive resources. To address this, we implemented more specific control rules, limiting access across Facebook’s entire network…
Aggregate data traffic for a set of sites in July 2024.
…which proved to be highly effective.
A final word of caution: adopting an overly aggressive approach has its downsides. Restricting access to unknown bots might prevent new technologies and tools that could benefit your site from interacting with it. For example, a new AI that could recommend specific products to visitors might be inadvertently blocked. It’s crucial to strike a balance, allowing selective bot access in line with market evolution and your business needs.
In summary, bots and crawlers are valuable allies, but managing their access requires strategic thinking. The key is to allow only beneficial bots to interact with your site while staying alert to new technologies that might emerge. This balanced approach will ensure that your business maximizes return on traffic and resource consumption.