Mar 102013
 

Nowadays, many websites implement dynamic pages: When the page is loaded, or when the user interacts with the page, some of the page elements are updated with new content retrieved from the server by means of ajax javascript calls. This technique avoids having to send all the content that does not change (html code of the header and sidebar menu, css, js, images,…) with every request.

The problem with dynamic pages is that Googlebot (and crawlers of other search engines) does not know about the existence of those dynamic pages, and therefore they do not get crawled and indexed and do not appear in the search results pages of the search engine. To solve this issue, Google has established a way to setup the website to allow crawling of dynamic content. The details of this setup are explained in this post.

Processing of a conventional dynamic web page

In a dynamic web page, the urls of the links in the page start by the ‘#’ symbol, and the “onclick” event triggers the execution of some javascript code that performs the intended action. For instance, a page in a e-commerce website could load the detail of a product with a link like:

The “#” symbol is interpreted by browsers as a reference to a “named anchor” in the current page, and does not cause a page load action by the browser. The javascript function “load_detail()” in the above example would make an ajax request to the server to retrieve the detail information about the product with PROD_ID=36, and would assign it to the destination element (it could be an element such as <div class=”productdetail”>):


Note: In modern browsers, it is possible to use the “hashchange” event. With jQuery, we can assign an event handler to the hashchange event if it is supported by the browser, and fall back to assign an event handler to the onclick event, for older browsers. First, add the action link to the “hash-changer” class:

then, in the jQuery initialization code, assign an event handler to those links:


Google mechanism to index dynamic content

The dynamic content can be made accesible to Googlebot using the following setup: 1. Action links that trigger the load of dynamic content must start with “#!” (hash+exclamation). Google will identify those urls, and will make requests to the web server for the url of the page where they are found, adding to the url an argument “_escaped_fragment_” with the url action link as value. For instante, if the page is “http://www.domain.com/product.php” and there is a dynamic link inside the page “<a href=”#!detail” …>”, Google will request to the web server an url:

2. The web server “www.domain.com” must understand that kind of requests, and send back a page “snapshot”, with the same content that would result after having performed the dynamic loading. In the previous example, the page would be sent with the product detail loaded iniside the <div class=”productdetail”> element.

How to index pages that use the “onload” event to generate content

Some pages use the “onload” event to run javascript code that alters the content of the page. To allow Google to index the final content, a metatag has to be added to the page header:

If Googlebot detects this metatag, it will request the page adding the argument “_escaped_fragment_=”. For instance, if the tag is added to the home page for the “www.domain.com” site, Googlebot will request the page:

The web site must understand this special request, and deliver the html code of the page that would result after the execution of the javascript code in the “onload” event.

How to generate page snapshots for Googlebot

In the previous sections we have seen that the web server must be able to generate page snapshots with the same content that would result if the javascript code is executed in a browser. To do this, there are several possibilities:

  • If most of the content is generated with javascript, the best option is to use a “headless” browser able to run javascript, such as HtmlUnit. There are other similar tools such as crawljax or watij.com.
  • If most of the content is generated in the server using PHP, ASP.NET or similar, the existing code could be adapted to reproduce the processing done with javascript on the client.

 

References:

Google – Webmasters – Making Ajax applications crawable

 Posted by at 3:55 pm

 Leave a Reply

(required)

(required)