Jan 142013
Article Perl

Sometimes, we need to process large text files (for instance, the log files of a web server).

But, for some purposes, such as the generation of statistical reports, there is no need to process the whole file. Instead, a representative sample is  enough to generate a meaningful result, reducing the processing time and resources involved.

For this, it is normally a requirement to get a fully random sample. In this post a couple of methods to achieve this are presented.

Using a perl “one-liner”

Let’s say we want to obtain a 1% sample of the lines in a file named “access.log” (most likely, the access log of an apache web server). To do it we can use a perl “one-liner” like this:

Option “-n” tells the perl interpreter to process the input file line-by-line.

Option “-e” is followed by the sentence(s) to run on each line: The call to “rand()” generates a random number between zero and one. The line being processed is printed is the generated number is lower than “0.01” (that is, statistically 1% of lines are printed).

Finally, the lines printed are written to a file “sample.txt”

Using awk

An equivalent way to achieve the same result is by means of the “awk” command, available in almost all unix/linux systems:

In this example, the call to “srand()” is used to initialize the random number generator.

The “!/^$/” expression discards empty lines. Finally, the “if” sentence works as in the previous perl example, writing 1% of the lines processed.

How to extract an exact number of lines

In the previous examples, the result obtained is a percent of the total number of lines in the input file, and can be slightly above or below of the exact percent requested.

If we want to extract a fixed number of lines, independent of the size of the input file, the following perl script can be used:

The algorithm implemented in this script is as follows:

The input file has N lines, and we want to select M lines randomly (assuming M <= N).

  1. The first M lines are read and stored in an array “@list”.
  2. The next line is read into to special perl variable “$_”. The line number is also available in the special perl variable “$.” (dollar dot).
  3. A random number between zero and ($./$M) is generated.
  4. If the value is below one, the line is added to the array, replacing another line, that is also chosen randomly.
  5. Repeat steps 2-4 until the input file is exhausted.
  6. Print the array @list that hold the final sample of M lines.

 Posted by at 8:43 pm

 Leave a Reply