Mar 182013
 
Article Java

Sometimes, accessing the content of a web page from an application may be necessary. The simplest solutions to this requirement download the page by establishing a TCP connection to the web server, sending a HTTP request and reading the HTML code that the server sends as response.

But this procedure fails when a server with dynamic content is accessed. A server of this kind generates part of the content of the page using some javascript code that runs on the client once the page has been downloaded (normally as part of the “onload” event handling). This javascript code might interact with the DOM, retrieve additional content issuing ajax requests, etc. The final content of the page could be in this case quite different from the initially delivered by the server.

This post explains how to use the java library HtmlUnit. This library implements a headless browser with a javascript interpreter. The HtmlUnit browser can be fully controlled from a java program. In this way, pages from a dynamic web site can be downloaded, and the final content of the pages after execution of the javascript code can be retrieved by the program.

HtmlUnit download and installation

The home page at sourceforge includes a download link in the left menu. At the time this post is written, the latest stable version is 2.11. The file downloaded is htmlunit-2.11-bin.zip, and it is 12090067 bytes (11.5 MB) in size.

Note: In the same download page it is also possible to download the source code, and since version 2.11 a file with OSGi integration is also available.

After downloading and uncompressing the file, a directory “htmlunit-2.11” is created, with to subdirectories “apidocs” and “lib” in it. The libraries required to run HtmlUnit are located under the “lib” directory.

Downloading a web page with HtmlUnit

The program must create first an object of class com.gargoylesoftware.htmlunit.WebClient. This is the object that implements the browser and implements a method “getPage()” to perform a HTTP request.

The object returned by getPage() is usually an object of class com.gargoylesoftware.htmluni.html.HtmlPage, containing the HTML-encoded web page retrieved.

Below is the source code of a small “TestHtmlUnit.java” program, to test the download of a page:

To compile and execute it, the required libraries must be specified in the classpath:

Specify a User Agent

The web client can be set up to identify itself with different UserAgent strings. This is done in the sentence that creates the object. For instance:

The predefined constants that can be used for this purpose can be found in the  BrowserVersion documentation.

Extract elements from the page

In the previous example, the whole HTML document has been retrieved with a call to “getDocumentElement()”.

We can also extract an element identified by the HTML attribute “id”, with a call to “getElementById()”:

How to avoid error messages being written to stderr

HtmlUnit works a a full-blown browser, downloading and parsing the CSS sytle sheet for the pages as well as the HTML and javascript code.

Often, these style sheets have rules that do not conform to the standard. The HtmlUnit CSS error handler prints an error message to stderr when it finds those rules. For instance:

We can avoid these errors, telling the web client to use a silent error handler:

In the same way, HtmlUnit outputs to stderr the content read when it receives an error status code from the web server. We can avoid this output with the sentence:

If we want to have a finer control in the error handling procedure, it is possible to define custom error handlers for each of the HtmlUnit modules:

Finally, if we just want to silence all error messages, but keep the default error handling:

How to work with persistent cookies

The HtmlUnit web client keeps cookies received from the web servers accessed, and sends back to those servers the matching cookies, like other browsers do.

To make this information persistent, the cookies must be saved when the webClient object is destroyed, and loaded when a new webClient object is created. The next is a sample “saveCookies” method to write the cookies to a file:

A sample “loadCookies()” method to read the cookies from a file and load them in the webClient can be written as:

How to use cookies in HtmlUnit

The HtmlUnit web client can be configured to establish connections through a proxy server:

The value of the proxyname variable is a String with the domain name or the IP address of the proxy.

If the proxy requires authentication, the credentials can be added to the webClient with the sentences:

The last argument in the call to the addCredentials() method (null in the example code above), can be used to specify a “Realm” where the authentication credentials apply. The value null means that the credentials are applicable to any “Realm”.

Note: The addCredentials() method is used to configure the authentication of both web servers and proxies.

References:

 Posted by at 10:38 am

 Leave a Reply

(required)

(required)