Nov 032014
 
Article PHP

XML is a format commonly used for the interchange of data between software applications.

The easiest way to process data in XML format is by means of some procedure that reads the whole document into a data structure native to the programming language used. In PHP, the result would be an associative array of pairs (key,value). Each of the values in the array, in turn, would be an associative array,  or a primitive value of numeric or string type.

But sometimes, the file holding the XML document may be up to several GBytes in size, and the volume of data to be processed is too big to hold the entire document in memory. In those cases, the file must be processed as a stream: Elements in the file need to be read an processed one element at a time.

This post explains how to work in PHP with large XML documents, by reading them as streams.

1. Opening the XML file

Fron PHP 5.1, the base distribution includes class XMLReader, that is intended to read XML documents as streams.

The XML document is opened with a call to the “open” method implemented in XMLReader:

$xmlfile = "data.xml";
$reader = new XMLReader();
$reader->open($xmlfile);

Often, the file that contains the XML document is compressed to save disk space (XML format is well suited to achieve large compression ratios using standard zip, gzip or bz2 compression). In these cases, standard  PHP compression filters can be used to read compressed files:

$xmlfile = 'compress.zlib://data.xml.gz';
$reader = new XMLReader();
$reader->open($xmlfile);

2. XML element read loop

The “read()” method in XMLReader can be used to read the XML elements, one at a time, from the handle returned by “open()”

while ($reader->read()) {
    echo "Element name: " . $reader->name . ", type: " . $reader->nodeType . "\n";
}

In the sample code above, the element name and type, stored in the “name” and “nodeType” properties in the $reader object, are printed to standard output.

The $reader object makes all the information about the element available through properties:

  • name – element name
  • nodeType – element type, as an integer. The existing types are defined as constants in the XMLReader class, as follows:
    ( 1: Element, 2: Attribute, 3: Text, etc.) The complete set of node types can be found in the  XMLReader official documentation.
  • isEmptyElement – Boolean. True if the element has no value and no subelements
  • hasAttributes – Boolean. True if the element has attributes
  • attributeCount  – number of element attributes
  • hasValue  – Booleano. True if the element as a value (<element>value</element>). False for elements with no value ( <element /> )
  • value – element value

XMLReader implements also a set of methods to access the information about the last element read:

  • getAttribute($name) – Returns the value of an attribute
  • readInnerXML() – Returns the XML code of the node content as a string in XML format
  • readOuterXML() – Returns the XML code of the node itself plus the node content, as a string in XML format
  • etc…

3. Processing individual elements with SimpleXML

The XML document can be completely processed using only the properties and methods implemented in XMLReader. However, the resulting PHP code can be difficult to read and maintain.

Besides, in most cases, large XML documents contain a collection of a large number of nodes of a given type, each of which can be converted into a PHP data structure with no problem. In these cases, combining XMLReader with the simplicity of SimpleXML may be of interest.

As an example, consider a document that holds a large collection of elements of type “address”. Each of them includes subelements “country”, “town”, “street”, etc.  Besides, each “address” element is uniquely identified with an “id” attribute:

<?xml version="1.0" encoding="utf-8"?>
<addresses>
    <address id="1">
        <country><![CDATA[Mexico]]></country>
        <town><![CDATA[Monterrey]]></town>
        <street><![CDATA[Benito Juárez]]></street>
    </address>
    <address id="2">
        ...
    </address>
      ...
</addresses>

The whole document can be processed in a loop where XMLReader reads each “address” element, and SimpleXMLElement converts it into a PHP object:

while ($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'address' ) {
        // For each node to type "address":
        $address = new SimpleXMLElement($reader->readOuterXml());
        $attributes = $address->attributes();
        echo "ID: " . $attributes->id .
             ", country: " . $address->country . 
             ", town: " . $addrerss->town . 
             ", street: " . $address->street . "\n";
    }
}

If the above loop is executed on the sample XML data above, the resulting output is:

ID: 1, country: Mexico, town; Monterrey, street: Benito Juárez
ID: 2, ...

References

Index of posts related to PHP programming

 Posted by at 9:05 am

 Leave a Reply

(required)

(required)