Jul 172014
 
Article Apache

Increasing the volume of visits is one of the basic objectives of most websites. And in most cases, the greater number of visits comes from the main internet search engines, such as Google and Bing.

Indeed, for the pages in our site to appear in the search result pages of these search engines, they must have previously been added to their indexes.

To index a site, search engines run specialized applications commonly referred to as “bots”, “crawlers” or “spiders”. These bots navigate the content of web sites, reading the content of the pages they find.

In principle, being actively crawled by different search engines it is a good signal for a site. But it may happen that the number of accesses done by crawlers becomes excessive, affecting the performance of the server, making it appear as slow or unresponsive to users. In this post we will review different possibilities to limit the crawl rate of the main internet search engines.

1. Identify which bots are crawling the site at an excessive crawl rate

The most straightforward way to evaluate the load imposed by bots on the server, is to analyze the content of the site’s access logs, and compute the total numbers of accesses done by different Users Agents, whose value includes the string “bot”, “spider” or “crawler”.

To illustrate this post, we have analyzed the access log of one of our sites, with the following result:

As we can see, the load imposed by bots does not seem to be excessive in this case 67,253 total requests in one day gives an average of 0.78 requests/second, with most reasonably configured web servers should be able to handle.

Nevertheless, these requests consume CPU, RAM and bandwidth, and it is always worth knowing what are the possible ways to limit the crawl rate of the bots.

2. Forbid access to specific bots by means of redirects

The most drastic measure that can be taken is to completely ban access to the site to some bots.

This restriction can be established adding rewrite rules to the configuration file of the web server, or to the .htaccess file in the root directory of the site.

Example:

The analysis of the access log of the site shows that the bot identified as “Baiduspider” makes a large amount of requests.

But Baiduspider is the crawler used by Baidu, a chinese search engine similar to Google or Bing. If the content of our site is not addressed to the average Baidu user, we may well decide to disable the indexation of our pages in this search engine.

We can do this adding the following directives to the .htaccess file in the root directory of the site:

The [NC] (No Case) flag in the RewriteCond makes the pattern “baiduspider” case insensitive, to match user agents that include the texts “baiduspider”, “Baiduspider”,”BaiduSpider”,etc.

The [F] (Forbidden) flag in the RewriteRule tells apache to return a status code “403 Forbidden” for requests that match the RewriteCond above.

3. Limiting the crawl rate of Bingbot and other crawlers in robots.txt

A “robots.txt” file can be placed at the root directory of the web site to be used by “well-behaved” crawlers. This file may include directives to allow, disallow or limit accesses done by bots. Specifically, Bingbot can be asked to limit its crawl rate, adding to robots.txt the directive:

Here, “Crawl-delay: 10” requests the Bing crawler to make no more than one request every ten seconds. Actually, the directive can be written to apply to any crawler accessing the site:

Unfortunately, many bots do not recognize, or are willing to abide by the “crawl-delay” directive. Some of the other means explained in this post need to be used to limit the crawl rate of those “bad-behaved” bots.

4. Limiting the crawl rate of Googlebot using Google Webmaster Tools

The Google bot does not use the “crawl-delay” directive in the robots.txt file. Instead, Google includes in Google Webmaster Tools the functionality to limit the crawl rate.

5. Limiting the crawl rate of Baidu using Baidu WebMaster Tools

Same as Google, Baidu does not recognize either the “crawl-delay” directive, but it also has a tool Baidu Webmaster Tools where webmasters can register their sites and limit the crawl rate of Baiduspider.

References

Is it possible to slow the Baiduspider crawl frequency?

 Posted by at 6:55 am

 Leave a Reply

(required)

(required)