
CSV is one of the most common formats used to exchange tabular information between different systems. This post is about how to read, process and write files in this format.
The CSV format
CSV stands for “Comma Separated Values”. However, it is often used as a generic denomination for other similar format, such as TSV (Tab Separated Values). In a CSV file, each line holds data for a single record. The record is divided into a series of fields, and the value of each field is separated from the previous and next fields by the separator character (a comma by default). It is common that the first line in the file is a header with the names of the fields, also separated by the separator character. Example CSV file:
1 2 3 4 5 6 |
Name,First Name,Age,Profession Smith,Peter,28,Plumber Doe,John,43,Electrian ... |
In this example the CSV format looks like quite easy to process. However, some special cases must be taken into account:
Values with commas, quotes and line feeds
If there is a separator character (comma, tab,…) is found inside the value of a field, the value is enclosed in double quotes. If a double quote is itself inside the value, another double quote is placed preceding it as an escape character:
1 2 3 4 5 6 |
Name,First Name,Age,Profession Smith,Peter,28,"Plumber, Locksmith" Doe,John,32,"Electrician ""Painter""" ... |
Finally, it is possible to find newlines inside the value it it spans several lines. Also in this case, the value is enclosed in double quotes:
1 2 3 4 5 6 7 |
Name,First Name,Age,Profession Smith,Peter,28,"Plumber, Locksmith" Doe,John,32,"Electrician ""Painter""" ... |
In the example above, the value of the Profession field in the first record contains a newline character, and therefore the record is split in two lines in the CSV file.
Reading a CSV file in Perl
The “Text::CSV” module available in CPAN can be used in Perl to read a CSV file, taking into account all the special cases that can happen.
A CSV file “test.csv” can be read using the code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
use Text::CSV; my $csv = Text::CSV->new({ binary => 1 }) # it is always recommended to set the binary flag or die "Error creating CSV object: ".Text::CSV->error_diag (); open my $fh, "<:encoding(utf8)", "test.csv" or die "Error reading CSV file: $!"; my $linea = <$fh>; # Skip the first (header) line with the field names while ( my $record = $csv->getline( $fh ) ) { # Process record # Field values are in $record->[0], $record[1], etc... } $csv->eof or $csv->error_diag(); close $fh; |
Reading a file with tabs as separator characters
The Text::CSV module can also be used to read files that use a tab character (“\t”), pipe character (“|”), or other character as separator. This is done specifying the “sep_char” parameter when the CSV object is created. The value of this parameter is the separator character to use. For instance, to read a TSV file, the object is created as:
1 2 3 |
my $csv = Text::CSV->new ( { binary => 1, sep_char => "\t" } ) |
parameters quote_char and escape_char
As already explained, double quotes are used to enclose a value if it contains the separator, newline or double quote character. Also, a double quote is added as escape character before each double quote contained in the value.
The quote and escape characters to use can be changed adding the parameters “quote_char” and “escape_char” when the CSV object is created.
For instance, if the value of a field as it appears in the CSV file is ‘O\’Reilly books’, then we can see that the single quote is used as the quote character enclosing the value, and the backslash is the escape character in front of a quote character that appears inside the value. This format can be correctly parsed if the Text::CSV object is created as:
1 2 3 |
my $csv = Text::CSV->new ( { binary => 1, quote_char => "'", escape_char => "\\" } ) |
Writing a CSV file
Text::CSV can also be used to generate a CSV file. The Text::CSV object inserts the quote and escape characters as required, to generate a syntactically correct output.
If data to be written are in an array @records, each of whose values is a reference to an array holding a single record, the CSV file can be written easily using the sentences:
1 2 3 4 5 6 7 8 9 |
# Specify newline as CR+LF $csv->eol ("\r\n"); # Open a handle to the file "new.csv" open $fh, ">:encoding(utf8)", "new.csv" or die " new.csv: $!"; $csv->print ($fh, $_) for @records; close $fh or die "new.csv: $!"; |
—
Index of posts related to Perl programming
—