Oct 312012
 
Article Perl

In our previous post we have explained how to process a file in XML format using the XML::Simple module from CPAN.

However, that module works by reading the whole file in memory. This is not suitable if the file to be processed is large and the RAM memory resources available are limited.

Instead, we can use the XML::Parse::PerlSAX module (SAX stands for “Simple API for XML”). Using this module, the file is read as a data stream, and events such as “start of element”, “end of element”, etc. are generated. The programmer needs only to provide an event handler package implementing methods to process these standard events.

Example:

#!/usr/bin/perl
#
use strict;
use warnings;

use XML::Parser::PerlSAX;
# Create an instance of the Parser, passing as argument
# an instance of the event handler "XMLReader" that we are
# going to implement
my $parser = XML::Parser::PerlSAX->new( Handler => XMLReader->new( ) );

my $file= "customers.xml";
$parser->parse( Source => {SystemId => $file} );
exit;

In the “XMLReader” package, the methods “start_element”, “end_element” and “characters” are implemented to process the corresponding events:

###
### Event handler package XMLReader
###
package XMLReader;

#
# initialization
#
sub new {
    my $type = shift;
    return bless {}, $type;
}

#
# On receiving an start-of-element event, print the XML opening tag
# and the element attributes, passed as a hash reference argument
sub start_element {
    my( $self, $properties ) = @_;
    # note: as the attributes are received as a hashref, the order of
    # attributes in the input file is lost.

    print "<" . $properties->{'Name'};
    my %attributes = %{$properties->{'Attributes'}};
    foreach( keys( %attributes )) {
        print " $_="" . $attributes{$_} . """;
    }
    print ">";
}

#
# On receiving an end-of-element event, print the XML closing tag
#
sub end_element {
    my( $self, $properties) = @_;
    print "</" . $properties->{'Name'} . ">";
}

#
# On receiving text data, print them.
# Note that in order to generate valid XML, we must convert some characters
# into escape sequences: For instance, '<' must be converted into '&lt;'
#
sub characters {
    my( $self, $properties ) = @_;
    my $data = $properties->{'Data'};
    $data =~ s/&/&/;
    $data =~ s/</&lt;/;
    $data =~ s/>/&gt;/;
    print $data;
}

If we execute this sample program to read the “customers.xml” file introduced in our previous post, the output is:

<customers timestamp="2002-05-13 15:33:45" version="3.5">
 <client identifier="62520">
  <name>John</nombre>
  <surname>Williams</surname>
  <address>
    <street>17 Liberty Ave.</street>
    <locality>Birmingham</locality>
    <province>Birmingham</province>
    <zip>82649</zip>
  </address>
  <email>john.williams@expensive-mail.org</email>
  <age>42</age>
 </client>
 <client identifier="62521">
  <name>Helen</name>
  <surname>Hightower</surname>
   <address>
    <street>2 Flying Saucer</street>
    <locality>Southampton</locality>
    <province>Southampton</province>
    <zip>28001</zip>
   </address>
   <email>elerovw@cyb.org</email>
   <age>37</age>
  </client>
</customers>

 Posted by at 8:05 pm

 Leave a Reply

(required)

(required)