PHP 5 introduced XMLReader, a new class for
reading Extensible Markup Language (XML). Discover the XMLReader
library, which is bundled with PHP 5 and enables PHP pages to process
XML documents in an efficient streaming mode.
Pull parsing XML in PHP
PHP 5 introduced XMLReader, a new class for reading Extensible Markup Language (XML). Unlike SimpleXML or the Document Object Model (DOM), XMLReader
operates in streaming mode. That is, it reads the document from start
to finish. You can begin to work with the content at the beginning
before you see the content at the end. This makes it very fast, very
efficient, and very parsimonious with memory. The larger the documents
you need to process, the more important this is.
Unlike the Simple API for XML (SAX), XMLReader is a
pull parser rather than a push parser. This means that your program is
in control. Rather than being told what the parser sees when the parser
sees it, you tell the parser when to go fetch the next piece of the
document. You request content rather than react to it. Another way of
thinking about it: XMLReader is an implementation of the Iterator design pattern rather than the Observer design pattern.
A sample problem
Let's begin with a simple example. Suppose you're writing
a PHP script that receives XML-RPC requests and generates
responses. More specifically, suppose the requests
look like Listing 1. The root element of the document is methodCall, which contains a methodName element and a params element. The method name is sqrt.
The params element contains one
param element that contains a double whose square root is desired.
Namespaces aren't used.
Listing 1. An XML-RPC request
<?xml version="1.0"?>
<methodCall>
<methodName>sqrt</methodName>
<params>
<param>
<value><double>36.0</double></value>
</param>
</params>
</methodCall>
|
Here's what the PHP script needs to do:
- Check the method name, and generate a fault response if it's not sqrt (the only method this script knows how to handle).
- Find the argument, and generate a fault response
if it's not present or has the wrong type.
- Otherwise, calculate the square root.
- Return the result in the form shown in Listing 2.
Listing 2. An XML-RPC response
<?xml version="1.0"?>
<methodResponse>
<params>
<param>
<value><double>6.0</double></value>
</param>
</params>
</methodResponse>
|
Let's develop this step by step.
Initialize the parser and load the document
The first step is to create a new parser object. Doing so is straightforward:
$reader = new XMLReader();
|
Next, you need to give it some data to parse. For XML-RPC, this is the
raw body of the Hypertext Transfer Protocol (HTTP) request. This string
can then be passed to the reader's XML() function:
$request = $HTTP_RAW_POST_DATA;
$reader->XML($request);
|
You can parse any string, wherever you get it. For instance, it can be
a string literal in the program or read from a local file. You can also
load data from an external URL with the open() function. For example, this statement prepares to parse one of my Atom feeds:
$reader->open('http://www.cafeaulait.org/today.atom');
|
Wherever you get your raw data, the reader is now set up and ready to parse.
Read the document
The read() function advances the parser to the next token. The simplest approach is to iterate through the entire document in a while loop:
while ($reader->read()) {
// processing code goes here...
}
|
After you're finished, close the parser to release any resources it's holding onto and reset it for the next document:
Inside the loop, the parser is positioned on a particular node: the
start of an element, the end of an element, a text node, a comment, and
so forth. You can find out what the parser is looking at right now by
inspecting these properties:
-
localName is the local, unprefixed name of the node.
-
name is the possibly prefixed name of the node. For nodes such as
comments that don't have names, it's #comment, #text, #document, and so forth, as in DOM.
-
namespaceURI is the Uniform Resource Identifier (URI) for the node's namespace.
-
nodeType is an integer representing the node type -- for example, 2 for an attribute node and 7 for a processing instruction.
-
prefix is the node's namespace prefix.
-
value is the node's text content.
-
hasValue is true if the node has a text value or false otherwise.
Of
course, not all node types have all these properties. For instance,
text nodes, CDATA sections, comments, processing instructions,
attributes, whitespace, document types, and XML declarations have
values. Other node types (most significantly, elements and documents)
don't.
Generally, a program uses the nodeType property to figure out what it's looking at and then respond appropriately. Listing 3 shows a simple while
loop that uses these functions to print what it sees. Listing 4 shows
the output from this program when Listing 1 is fed into it.
Listing 3. What the parser sees
while ($reader->read()) {
echo $reader->name;
if ($reader->hasValue) {
echo ": " . $reader->value;
}
echo "\n";
}
|
Listing 4. Output from Listing 3
methodCall
#text:
methodName
#text: sqrt
methodName
#text:
params
#text:
param
#text:
value
double
#text: 10
double
value
#text:
param
#text:
params
#text:
methodCall
|
Most programs aren't so generic. They accept input
in a particular form and process it in some way. In the XML-RPC
example, you need to read only one thing in the input: the double element, of which there should be
exactly one. To do that, you look for the start of an element with the name double:
if ($reader->name == "double"
&& $reader->nodeType == XMLReader::ELEMENT) {
// ...
}
|
This element likely has a single text node child, which you can read by advancing the parser to the next node like so:
if ($reader->name == "double" && $reader->nodeType == XMLReader::ELEMENT) {
$reader->read();
respond($reader->value);
}
|
Here the respond()
function builds the XML-RPC response and sends it to the client.
However, before I show that, there's something else I need to address.
It's not absolutely guaranteed that the double element in
the request document contains exactly one text node. It might contain
several, as well as comments and processing instructions. For instance,
it could look like this:
<value><double>
<!--value follows-->6.<!--fractional part next-->0
</double></value>
|
A robust solution needs to get all the text node children of the double element, concatenate them, and only then convert the result to a double.
It needs to carefully avoid any comments or other non-text nodes that
might appear. This is a little more complex, but not excessively so, as
Listing 5 shows.
Listing 5. Accumulate all text content from an element
while ($reader->read()) {
if ($reader->nodeType == XMLReader::TEXT
|| $reader->nodeType == XMLReader::CDATA
|| $reader->nodeType == XMLReader::WHITESPACE
|| $reader->nodeType == XMLReader::SIGNIFICANT_WHITESPACE) {
$input .= $reader->value;
}
else if ($reader->nodeType == XMLReader::END_ELEMENT
&& $reader->name == "double") {
break;
}
}
|
You can ignore everything else in the document for the moment. (I'll add more error-handling later.)
Build the response
As its name implies, XMLReader is purely for reading. A corresponding XMLWriter
class is in development but isn't yet ready for production.
Fortunately, writing XML is much easier than reading it. First, you
should set the media type of the response using the header() function. For XML-RPC, this is application/xml. For example:
header('Content-type: application/xml');
|
The content can usually be echoed straight onto the page, as shown in the respond() function in Listing 6.
Listing 6. Echo XML
function respond($input) {
echo "<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><double>" .
sqrt($input)
. "</double></value>
</param>
</params>
</methodResponse>";
}
|
You can even embed the literal parts of the response directly in the
PHP page, just as you would with HTML. Listing 7 demonstrates this
technique.
Listing 7. Literal XML
function respond($input) {
?><?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><double>"<?php
echo sqrt($input);
?>
</double></value>
</param>
</params>
</methodResponse>
<?php
}
|
Error handling
Until
now, I implicitly assumed that the input document was well-formed.
However, there's no guarantee of that. Like any XML parser, XMLReader is required to stop processing as soon as it detects a well-formedness error. If it does so, the read() function returns false.
Theoretically, the parser could report data up to the first error it
finds. In my experiments with small documents, however, it errors out
almost immediately. The underlying parser is preparsing a large chunk
of the document, caching it, and then doling it out a piece at a time.
Thus it tends to detect errors prematurely. For safety's sake, don't
assume you'll be able to parse content before the first well-formedness
error. Furthermore, don't assume you won't see any content before the
parser error. If you want to accept only complete, well-formed
documents, then make sure your script doesn't do anything irreversible
until the end of the document is seen.
If the parser detects a well-formedness error, then the read()
function echos an error message such as this one (if verbose error
reporting is turned on, as it should be on a development server):
<br />
<b>Warning</b>: XMLReader::read() [<a href='function.read'>function.read</a>]:
< value><double>10</double></value> in <b>/var/www/root.php</b>
on line <b>35</b><br />
|
You probably don't want to copy this into the HTML page the user sees. A better approach is to capture the error message in the $php_errormsg environment variable. To do this, you need to turn on the track_errors configuration option in your php.ini file:
The track_errors option is off by default; this is explicitly specified in php.ini, so make sure you change that line.
If you add the previous line early in php.ini, as I initially did, the later track_errors = Off line will override it.
This program should send responses only to
complete, well-formed input. (Valid too, but I'll get to that.) Thus
you need to wait until you're finished parsing the document
(you've broken out of the while loop). At that point, you check to see whether $php_errormsg
is set. If it isn't, the document is well-formed, and you send an
XML-RPC response message. If the variable is set, the document is not
well-formed, and you instead send an XML-RPC fault response. You also
send a fault response if someone requests the square root of a negative
number. Listing 8 demonstrates.
Listing 8. Check for well-formedness
// set up the request
$request = $HTTP_RAW_POST_DATA;
error_reporting(E_ERROR | E_WARNING | E_PARSE);
if (isset($php_errormsg)) unset(($php_errormsg);
// create the reader
$reader = new XMLReader();
// $reader->setRelaxNGSchema("request.rng");
$reader->XML($request);
$input = "";
while ($reader->read()) {
if ($reader->name == "double" && $reader->nodeType == XMLReader::ELEMENT) {
while ($reader->read()) {
if ($reader->nodeType == XMLReader::TEXT
|| $reader->nodeType == XMLReader::CDATA
|| $reader->nodeType == XMLReader::WHITESPACE
|| $reader->nodeType == XMLReader::SIGNIFICANT_WHITESPACE) {
$input .= $reader->value;
}
else if ($reader->nodeType == XMLReader::END_ELEMENT
&& $reader->name == "double") {
break;
}
}
break;
}
}
// make sure the input was well-formed
if (isset($php_errormsg) ) fault(21, $php_errormsg);
else if ($input < 0) fault(20, "Cannot take square root of negative number");
else respond($input);
|
This is a simple
version of a common pattern in streaming processing of XML. The parser
fills a data structure that is acted on when the document is finished.
Usually the data structure is simpler than the document itself. Here
the data structure is especially simple: a single string.
Validation
Until now, I've been cavalier about verifying that the data was where I thought it was. The easiest way
to accomplish this verification is to check the document against a schema. XMLReader supports the RELAX NG schema language; Listing 9 shows a simple RELAX NG schema for this specific form of XML-RPC request.
Listing 9. An XML-RPC request
<element name="methodCall" xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<element name="methodName">
<value>sqrt</value>
</element>
<element name="params">
<element name="param">
<element name="value">
<element name="double">
<data type="double"/>
</element>
</element>
</element>
</element>
</element>
|
You can embed the schema directly in the PHP script as a string literal using
setRelaxNGSchemaSource() or read it from an external file or URL using
setRelaxNGSchema(). For example, assuming Listing 9 is in the file sqrt.rng, here's how you load the schema:
reader->setRelaxNGSchema("sqrt.rng")
|
Do this before you begin parsing the document. The parser
checks the document against the schema as it reads. To
check whether the document is valid, you call isValid(),
which returns true if the document is valid (so far) and false
if it isn't. Listing 10 demonstrates the complete finished
program, including all error handling.
This should accept any legal input and
return a correct value, and reject all incorrect requests. I've
also added a fault() method that sends an XML-RPC fault response
when something goes wrong.
Listing 10. The complete XML-RPC square root server
<?php
header('Content-type: application/xml');
// try grammar
$schema = "<element name='methodCall'
xmlns='http://relaxng.org/ns/structure/1.0'
datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'>
<element name='methodName'>
<value>sqrt</value>
</element>
<element name='params'>
<element name='param'>
<element name='value'>
<element name='double'>
<data type='double'/>
</element>
</element>
</element>
</element>
</element>";
if (!isset($HTTP_RAW_POST_DATA)) {
fault(22, "Please make sure always_populate_raw_post_data = On in php.ini");
}
else {
// set up the request
$request = $HTTP_RAW_POST_DATA;
error_reporting(E_ERROR | E_WARNING | E_PARSE);
// create the reader
$reader = new XMLReader();
$reader->setRelaxNGSchema("request.rng");
$reader->XML($request);
$input = "";
while ($reader->read()) {
if ($reader->name == "double" && $reader->nodeType == XMLReader::ELEMENT) {
while ($reader->read()) {
if ($reader->nodeType == XMLReader::TEXT
|| $reader->nodeType == XMLReader::CDATA
|| $reader->nodeType == XMLReader::WHITESPACE
|| $reader->nodeType == XMLReader::SIGNIFICANT_WHITESPACE) {
$input .= $reader->value;
}
else if ($reader->nodeType == XMLReader::END_ELEMENT
&& $reader->name == "double") {
break;
}
}
break;
}
}
if (isset($php_errormsg) ) fault(21, $php_errormsg);
else if (! $reader->isValid()) fault(19, "Invalid request");
else if ($input < 0) fault(20, "Cannot take square root of negative number");
else respond($input);
$reader->close();
}
function respond($input)
{
?>
<methodResponse>
<params>
<param>
<value><double><?php
echo sqrt($input);
?></double></value>
</param>
</params>
</methodResponse>
<?php
}
function fault($code, $message)
{
echo "<?xml version='1.0'?>
<methodResponse>
<fault>
<value>
<struct>
<member>
<name>faultCode</name>
<value><int>" . $code . "</int></value>
</member>
<member>
<name>faultString</name>
<value>
<string>" . $message . "</string>
</value>
</member>
</struct>
</value>
</fault>
</methodResponse>";
}
|
Attributes
Attributes aren't seen during the normal course of pull parsing.
To read attributes, you stop at the start of an element and request a specific attribute, either by name or number.
Pass the name of the attribute you want to getAttribute() to find the value of that attribute on the current element. For example, this statement asks for the id attribute of the current element:
$id = $reader->getAttribute("id");
|
If the attribute is in a namespace -- for example, xlink:href -- call getAttributeNS(),
pass the local name and namespace URI as the first and second
arguments, respectively. (The prefix doesn't matter.) For example, this
statement requests the value of the xlink:href attribute in the http://www.w3.org/1999/xlink/ namespace:
$href = $reader->getAttributeNS("href", "http://www.w3.org/1999/xlink/");
|
Both of these methods return an empty string if the attribute doesn't
exist. (This is wrong. They should return null. The current design
makes it hard to distinguish between an attribute whose value is the
empty string and one that isn't present at all.)
If you just want to know all the attributes on an element, and you don't know their names in advance, then call moveToNextAttribute()
when the reader is positioned on the element. Once the parser is
positioned on an attribute node, you can read its name, namespace, and
value with the same properties used for elements. For example, this
code fragment prints out all the attributes of the current element:
if ($reader->hasAttributes and $reader->nodeType == XMLReader::ELEMENT) {
while ($reader->moveToNextAttribute()) {
echo $reader->name . "='" . $reader->value . "'\n";
}
echo "\n";
}
|
Very unusually for an XML API, XMLReader lets you read the attributes from either the beginning or the end of the element. To avoid double counting, it's important to check that the node type is XMLReader::ELEMENT and not XMLReader::END_ELEMENT, which can also have attributes.
In conclusion
XMLReader
is a useful addition to the PHP programmer's toolkit. Unlike SimpleXML,
it's a full XML parser that handles all documents, not just some of
them. Unlike DOM, it can handle documents larger than available memory.
Unlike SAX, it puts your program in control. If your PHP programs need
to accept XML input, XMLReader is well worth your consideration.
|