Oct 202014
 
Article Perl

Most perl scripts read the data that will be processed, and write the result, from/to several sources: files, databases, keyboard, literal strings in the source code of the script…

Text data read from these sources might be encoded in different character encodings. The most common encodings are utf8 and latin1 (ISO 8859-1). To make sure that those data will be adequately processed, it is highly recommended to always specify explicitly the expected encoding of input data, and the desired encoding of output data, as explained in this post.

1. Text literals in the source code of the script

The source code of a perl script might contain sentences where literal text values are assigned to variables. For instance, in a “test.pl” script we could have written an assignment sentence like this:

But the file “test.pl” itself may be encoded in latin1, utf8, or some other character encoding.

To avoid coding issues, we can convert the file to utf8 encoding (using a suitable text editor, or the ‘iconv’ utility available in most linux distros,…) and add the sentence “use utf8” at the start of the script, to tell the perl interpreter that text literals inside the script are utf8-encoded:

But, even after this change, the output from this script could be different when run on disparate machines. This is because the ‘print’ sentence sends data to the STDOUT standard I/O handle, and the perl interpreter may have guessed wrong the encoding used by the terminal where the script is being run. The next section explains how to fix this issue.

2. I/O operations to a terminal: STDIN, STDOUT, STDERR

Perl uses the STDIN handle to read data from the keyboard, and the STDOUT, STDERR handles to write to the screen of the terminal where it is being run.

But the terminal uses a given character encoding. It is a good practice to explicitly tell the perl interpreter which is the character encoding required for terminal I/O. This is done by means of calls to binmode for each of the standard I/O handles:

3. File I/O operations

“binmode” can also be used to explicitly set the desired character encoding to use in I/O operations to a file.

Alternatively, the desired character encoding can be specified in the call to “open”:

4. Reading from and writing to a database

The behaviour of read/write operations on a database from a perl script depends mainly on the database interface library being used, and on the type of database itself.

As a general rule, the character encoding to use in DB transactions should be explicitly set.

Example 1:

  • Connecting to a MySQL DB using the DBI module, for DBD::mysql versions 4.004 or newer

After the establishment of the DB connection, set it to utf8:

Next, perform the desired read/write operations:

Example 2:

  • Connecting to a MySQL DB using the DBI module,for DBD::mysql versions prior to 4.004

In this case, we must explicitly set the conversion to use with a call to $dbh->do().

Besides, decode_utf8() must be used to set the utf8 flag on texts retrieved from the database:

Example 3:

  • Reading from a table that uses latin1 encoding, but stores utf8-encoded values

This is a rather common case. For instance, the table might be defined as:

Two records have been inserted in it. The values stored can be examined from the mysql client:

We can see that the first record is encoded in latin1, but the second record is encoded in utf8, and the utf8 sequence “ó” is printed instead of the accented character “ó”.

But, if the content of the table is read using the perl code in the Example 1, the resulting output is:

As we can see, after the flag flag mysql_enable_utf8 has been set in the database connection, DBI detects the presence of UTF8 sequences in data read, and performs transparently the required conversion.

References

 Posted by at 6:25 pm

 Leave a Reply

(required)

(required)