Jun 272013
 
Article Perl

A web site whose pages are dynamically generated can be slow and appear unresponsive, if the process involved in generating the content involves heavy queries to a database of other cpu intensive tasks.

This post explains a procedure to save the pre-processed pages in a disk cache, and configure the web server to deliver the cached pages on subsequent requests, avoiding the overhead of generating again the same content each time a new request is received.

Introduction

When a client issues a request to a dynamic web site, the web server launches the execution of a CGI script that outputs the HTML code of the page. The web server just read the output from the script and delivers it to the client.

The basic idea behind the implementation of a cache is redirecting the output from the script to a variable. Once the page has been fully generated, the content of the variable is saved in a cache file, and then it it sent to standard output, to be delivered to the client.

In subsequent requests for the same page, the web server first checks if the page is in cache. In that case, it just reads the content of the cached file, and sends it to the client without executing the CGI script.

If the page is not in cache, the web server executes the CGI script that generates the page and saves it in cache.

The code samples in this post assume that the dynamic web site is implemented using CGI scripts written in perl, and an apache web server running on a Debian Linux platform. However, with minor modifications they can be used for other types of server.

Redirect the output from the script to a variable

At the beginning of its execution, a CGI script normally prints a HTTP header, and then starts printing the HTML code.

For instance, a sample CGI script could include the sentences:

To implement the cache, we are going to modify this code to make it store the generated output in a perl variable, instead of directly sending it to standard output.

To do this, we will redirect the output immediately after the HTTP header has been sent, and before any HTML code has been printed:

Write to cache the pre-processed page

Once the script has generated all the HTML code of the page, it must send it first of all to the client that requested it. To do this, it restores the original stdout handle that had been saved in the $old_stdout variable:

Next, the page content is saved under a “cache” directory located under the DocumentRoot of the web server. The filename used will be composed as the URI of the page requested, with extension “.html”

In the following sample, it is assumed that the DocumentRoot is “/web”, and there is already a “cache” folder under it:

A compressed version of the file is also written to cache. This versison will be used if the client requesting the page accepts compressed data. The name of the compressed file is the same as the name of the uncompressed file, with the extension “.html_gzip”:

Configure the web server to serve pages from cache

Now, the web server must be configure to use the cached content to serve requests.

For an apache web server, this is done by adding the following directives to the configuration file ( or to the .htaccess file, if the site makes use of this type of file):

We can see that:

  • In line 3 the web server is made aware that files with extension “.html_gzip” are gzip-compressed files.
  • In lines 4-6 the web server is told that the content type for those files is “text/html”. This information will be included by the web server in the HTTP header sent to the client.
  • Lines 9 and 10 define an environment variable “PG_ENC” if the client accepts gzip-compressed data.
  • Finally, lines 11-14 rewrite the request to send to the client the cached file, if three conditions are met:
    • The request is not of type POST
    • The url does not include arguments (it is not something like “http://www.example.com/page?argument=value”)
    • The file exists in cache in its compressed or uncompressed version, according to the previously set value of the PG_ENC variable.

And this completes the basic setting of the cache mechanism.

Additional considerations

Although the describe implementation could be valid for a large number of web sites, in some cases some extra considerations could need to be taken into account:

Registered users

Some web sites offer additional services to registered users. Normally, once the user is validated after entering an username and password, the web site displays different contents and give the user access to other pages.

In most cases, the web site implements this functionality by sending a cookie to the client at the end of the validation, and reading the cookie sent back by the client to identify the user.

The easiest way to take this into account in our cache implementation, is to avoid that pages requested by authenticated users be served from cache. To do this, and additional condition is added to the block of rewrite rules in the configuration file:

The sample directive above assumes that “registered_user” is the name of the cookie that the web server sends to track authenticated users.

In perl, received cookies can be accessed using the variable $ENV{‘HTTP_COOKIE’}. A basic way to detect the presence of the cookie in the request, and disable the caching of the page could be coded as:

cache lifetime

In the implementation of the cache functionality above, once a page has been stored in cache, the web server will continue sending to clients the cached content indefinitely.

But it might happen that the page content should be updated due to changes in the database, or for any other reason. The site developer will have to implement a mechanism to identify which pages would be affected by a change, and remove the cache files for those files. The next request for one of those pages will result in the execution of the script and the generation of an updated cache file.

The most basic way to implement a cache refresh is probably a cron process that wipes all the cached content on a daily of weekly basis.

 Posted by at 11:59 am

 Leave a Reply

(required)

(required)