Oct 242013
Article Perl

CSV is one of the most common formats used to exchange tabular information between different systems. This post  is about how to read, process and write files in this format.

The CSV format

CSV stands for “Comma Separated Values”. However, it is often used as a generic denomination for other similar format, such as TSV (Tab Separated Values). In a CSV file, each line holds data for a single record. The record is divided into a series of fields, and the value of each field is separated from the previous and next fields by the separator character (a comma by default). It is common that the first line in the file is a header with the names of the fields, also separated by the separator character. Example CSV file:

In this example the CSV format looks like quite easy to process. However, some special cases must be taken into account:

Values with commas, quotes and line feeds

If there is a separator character (comma, tab,…) is found inside the value of a field, the value is enclosed in double quotes. If a double quote is itself inside the value, another double quote is placed preceding it as an escape character:

Finally, it is possible to find newlines inside the value it it spans several lines. Also in this case, the value is enclosed in double quotes:

In the example above, the value of the Profession field in the first record contains a newline character, and therefore the record is split in two lines in the CSV file.

Reading a CSV file in Perl

The “Text::CSV” module available in CPAN can be used in Perl to read a CSV file, taking into account all the special cases that can happen.

A CSV file “test.csv” can be read using the code below:

Reading a file with tabs as separator characters

The Text::CSV module can also be used to read files that use a tab character (“\t”), pipe character (“|”), or other character as separator. This is done specifying the “sep_char” parameter when the CSV object is created. The value of this parameter is the separator character to use. For instance, to read a TSV file, the object is created as:

parameters quote_char and escape_char

As already explained, double quotes are used to enclose a value if it contains the separator, newline or double quote character. Also, a double quote is added as escape character before each double quote contained in the value.

The quote and escape characters to use can be changed adding the parameters “quote_char” and “escape_char” when the CSV object is created.

For instance, if the value of a field as it appears in the CSV file is ‘O\’Reilly books’, then we can see that the single quote is used as the quote character enclosing the value, and the backslash is the escape character in front of a quote character that appears inside the value. This format can be correctly parsed if the Text::CSV object is created as:

Writing a CSV file

Text::CSV can also be used to generate a CSV file. The Text::CSV object inserts the quote and escape characters as required, to generate a syntactically  correct output.

If data to be written are in an array @records, each of whose values is a reference to an array holding a single record, the CSV file can be written easily using the sentences:

Index of posts related to Perl programming

 Posted by at 12:39 pm

 Leave a Reply