Sometimes, accessing the content of a web page from an application may be necessary. The simplest solutions to this requirement download the page by establishing a TCP connection to the web server, sending a HTTP request and reading the HTML code that the server sends as response.
But this procedure fails when a server with dynamic content is accessed. A server of this kind generates part of the content of the page using some javascript code that runs on the client once the page has been downloaded (normally as part of the “onload” event handling). This javascript code might interact with the DOM, retrieve additional content issuing ajax requests, etc. The final content of the page could be in this case quite different from the initially delivered by the server.
This post explains how to use the java library HtmlUnit. This library implements a headless browser with a javascript interpreter. The HtmlUnit browser can be fully controlled from a java program. In this way, pages from a dynamic web site can be downloaded, and the final content of the pages after execution of the javascript code can be retrieved by the program. Continue reading »