Jun 052012
 
Article Apache

There are some good open source tools, such as AWStats, that perform an in-depth analysis of the information in the access logs of a web server.

We will dedicate another post to talk about the possibilities of AWStats, and how to install and configure it.

However, sometimes we may be interested in and ad-hoc processing of the log files of our web server, to extract information specific to the structure of our site, in a way that is not covered by a generic tools such as AWStats.

In this post we will explain how to perform this kind of processing by means of a home-made perl script.

Format of the access log file

The most commonly used format in an access log format is the “combined” format that comes pre-defined in an Apache Web Server installation. In this format, there is one line per access. Each line, is composed of a series of fields separated by spaces. Those fields whose values can contain spaces are enclosed in double quotes. For instance (The data that follows is a single line in the access log file, but we have divided it in several lines to improve the readability):

The fields in the “combined” log format, and their values in the sample above are:

  1. ip: 190.121.30.129   The IP address of the client browser accessing the web server
  2. client_id:  The identified of the machine where the client is being run. In the sample, this datum is not available, and therefore its value is presented as a hiphen
  3. user_id:  id of the user as validated by the web server. In the sample, the value is again a hiphen because the user did not perform any validation during the session.
  4. date: [01/Jun/2012:04:59:02 +0200] date and time when the access happened.
  5. request: “GET /templates/beez_20/css/general.css HTTP/1.1” request sent by the client, with indication of the type (normally GET or POST), the requested URL and the version of the HTTP protocol used (mostly HTTP/1.1).
  6. http_return_code: 200 Code 200 is returned for normal requests. Other common codes are 404 (page not found), 302 (temporary redirect) y 301 (permanent redirect).
  7. response size: 1447 Is the number of bytes sent as response.
  8. referrer: Is the url of the page where the link that generated the request to our server was found. The referrer can be other page in our site, a results page from a search engine, pages from other sites linking to our site, etc.
  9. agent: Is a string that identifies the browser. In the sample, we can see that it is a Chrome browser.

Reading the log file and parsing the fields.

The code below read the log file line-by-line and parses each line into the different fields:

Once the script above is ready, we can add code to perform some processing. For instance, let’s say we want to get the ten most accessed pages by number of “organic” accesses from google, and for each of them we want to know the searches the users performed on the search engine that resulted in our pages being visited.

To do this, we must identify log lines where the access comes from organic searches in google. Those are the lines where the hostname in the referrer field contains the string “google”, and the referrer url also contains an argument “&q=” whose value is the query string submitted to the search engine.

As we can see, in the first line we discard all accesses from Googlebot and other crawlers that identify themselves as ‘Crawler’ in the ‘agent’ sent to the server (we should also include ‘bingbot’ and many other bots, crawlers, spiders,… that identify themselves in different ways)

In lines 2 to 5 we make sure that the referrer hostname includes the string “google”, and the referrer url includes an argument “&q=”.

Once we know the url of the page being accessed and the query, we add them to a ‘%pages’ perl hash. There is an entry in this %pages hash  per page, the page url being the key, and the values are the total number of accesses of the page, and a hash of queries, including the query string and the number of times the query resulted in a page access.

Finally, we just have to add the code to print the result:

With small modifications, the code presented in this post can be used to gather other statistics of interest, such as the IP addresses performing the greatest number of accesses. With this report, we could easily identify crawlers that are not being detected by means of the ‘agent’ they send.

Another possibility is geolocate the IP addresses (as explained in our previous post How to geolocate IP addresses in Perl ), and get a listing of the top countries/cities visiting our site.

 

 Posted by at 4:19 pm

 Leave a Reply

(required)

(required)