Apr 042014
 
Article Apache

Many web sites offer dynamic content that results from the execution of a script, in many cases written in PHP. Every time the web server receives a request, the script that generates the HTML code returned to the client runs,often performing several database queries, or any other type of processing that might be expensive in time and resource usage, resulting in a slow response time and poor user experience.

This post explains the implementation of a cache mechanism that may alleviate this problem: The HTML code resulting from the execution of the script is saved to disk, and every time a new request is received for the same page, this pre-processed content is served directly from the disk cache, thus avoiding the need to execute again the whole process.

Introduction

When a client browser issues a requets to a web server that serves dynamically generated content, the server launches the execution of a CGI script (often written in PHP) that generated the HTML code of the page. The output from the script is collected by the web server and delivered to the client.

Implementing a cache requires basically saving the output from the script (that normally goes to standard output) into a variable. Then, the content of the variable is both saved in a cache file, and sent to the browser.

For each new request received, the web server first checks if the page exists in cache. It that is the case, the cached page is sent to the client. Otherwise, the CGI script is launched.

The sample code snippets in this post are intended for a dynamic content web site whose CGI scripts are written in PHP. The platform used is a Linux Debian system, running an Apache web server. With small changes, it should be possible to adapt these code samples to run on a different type of server.

Redirect the ouput of the script to a variable

PHP implements the functionality required to perform this redirection, by means of a call to the “ob_start()” function (where ob stands for “output buffering”).

ob_start() must be called in the first lines of code of the CGI script, before any output has been generated.

Then, at the end of the script, after all output has been generated, a call to ob_get_contents() is used to retrieve the content of the buffer and store in a variable the resulting HTML code.

Finally, a call to ob_end_flush(), empties the buffer and sends its content to standard output.

Note: A call to   ob_end_clean() can replace the call to ob_end_flush() in case the logic of the script decides to discard the output generated, instead of sending it to the client.

Saving in a disk cache the pre-processed page

At this point, the “$html” variable holds the HTML code of the page generated and sent to the client. Now the script will save that code in a file under a “cache” subdirectory of the root of the web server (specified in the DocumentRoot directive of the web server configuration file). The file name will be the URI used in the client request, adding a “.html” extension.

In the example below, it is assumed that the DocumentRoot of the web server is “/web”, and there is a previously created “cache” directory under it:

In the example above, a “createPath” function is used to recursively create the destination directory. This sis not a built-in function, but it cn be easily coded as follows:

The script also needs to create a compressed version of the cache file. If a client requesting the page accepts compressed files, this will be the version that will be served from cache. The filename of the compressed file is formed by adding “_gzip” to the name of the uncompressed file:

Configure the web server to serve pages from cache

The next step is to configure the web server to serve form cache those pages that have been cached.

This can be done in an Apache web server adding the following lines to the configuration file ( alternatively, these lines can be added to the .htaccess file, if it is being used).

In this configuration:

  • at line 3 the web server is notified that files with extension “.html_gzip” are actually files compressed in the gzip format.
  • at lines 4-6 the web server is notified that these files hold UTF-8 encoded HTML code. The web server will add this information to the HTTP header of the response sent to the client.
  • at lines 9 y 10, an environment variable “PG_ENC”, with value “_gzip”, is defined if the client accepts gzip encoded responses
  • finally, at lines 11-14 the web server is told to retrieve from cache the file to be delivered to the client, if three conditions are met:
    • The incoming request is not of type POST
    • There are no arguments in the request URL (i.e., there are no trailing arguments “?name=value”)
    • The file exists in cache, either compressed or uncompressed, as suitable for the client

And that completes the implementation of a basic page cache.

Additional considerations

The basic implementaton described above could be suitable for a fair number of web sites. However, extra considerations should be reequired for some cases:

Registred users

Some web sites allow users to sign up, and be granted access to additional services/content as registered users.

To check if an access comes from a registered user, the web site typically uses a cookie that is sent to the browser when the user logs in with a username and password. The browser will send back the cookie in subsequent accesses to the server.

The easiest way to take this into account is to avoid delivering pages from cache for accesses coming from validated users. This is done including an additional condition to the RewriteRule in the web server configuration:

The example above assumes that “registered_user” is the name of the cookie that the web server sends to the client at the time the validation is done.

Furthermore, is the CGI script is run to serve a page to a validated user, it should not save the page to the disk cache.

In PHP, received cookies can be accessed through the $_SERVER[‘HTTP_COOKIE’] variable. Therefore, the implementation of this last requirement could be done with the following code:

If the web site is using a standard CMS, there will most likely exist a function in the CMS library to check if a user is validated. For instance, in WordPress, the function is_user_logged_in() can be used to perform this check. In this case, the sample code could be rewritten as:

Expiration of cache

In the cache implementation discussed so far, cached content contiues to be served indefinitely to clients.

But it may happen that the page content has changed due to changes in the database, or for any other reason. The web developer must implement a mechanism to detect this event, and remove the affected files in cache, to allow the cache to be rebuilt in the next accesses.

The easier way to implement this to write a script that just wipes the whole cache, and is run periodically (for instance, adding it as a cron entry in a Linux system) (of course, there can be finer implementations to only wipe out from cache those pages that have actually changed).

Index of posts related to programming in PHP

 Posted by at 12:01 pm

 Leave a Reply

(required)

(required)