Skip to content

Instantly share code, notes, and snippets.

@chrisguitarguy
Created February 27, 2013 04:15
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save chrisguitarguy/5045013 to your computer and use it in GitHub Desktop.
Save chrisguitarguy/5045013 to your computer and use it in GitHub Desktop.
How to use a combination of XMLReader and SimpleXMLElement to parse large XML files in PHP.
<?php
/**
* an example of how to read huge XML files relatively quickly and efficiently
* using a few core PHP libraries.
*
*/
// Assume your file is very large, 140MB or somethig like that
$fn = __DIR__ . '/some_file.xml';
// The tag we want to extract from the file
$tag = 'item';
// we'll use XMLReader to "parse" the large XML file directly because it doesn't
// load the entire tree into memory, just "tokenizes" it enough to deal with
$reader = new \XMLReader();
// now open our file
if (!$reader->open($fn)) {
throw new \RuntimeException("Could not open {$fn} with XMLReader");
}
// loop though the file, read just advances to the next node.
// XMLReader isn't aware of any the document tree, so nodes get
// iterated over as they appear in the file. We'll just read until
// the end of the file.
while ($reader->read()) {
// XMLReader::$name will contain the current tab name, check to see if it
// matches the tag you're looking for. If it does, we can just iterate
// over those tags using XMLReader::next().
while ($tag === $reader->name) {
// since XMLReader doesn't really supply us with much of a usable
// API, we can convert the current node to an instace of `SimpleXMLElement`
$elem = new \SimpleXMLElement($reader->readOuterXML());
// now use SimpleXMLElement as you normally would.
foreach ($elem->children() as $child) {
echo $child->getName(), ': ', $child, PHP_EOL;
}
// Children in a certain namespace even.
foreach ($elem->children('http://purl.org/dc/elements/1.1/') as $child) {
echo "{http://purl.org/dc/elements/1.1/}", $child->getName(), ': ', $child, PHP_EOL;
}
// move on to the next one
$reader->next($tag);
}
}
@dearsina
Copy link

This is no good if the XML is large. because you're still loading the whole string into memory here:

$elem = new \SimpleXMLElement($reader->readOuterXML());

@chrisguitarguy
Copy link
Author

Yeah, that's not at all how XMLReader works @dearsina: https://www.php.net/manual/en/xmlreader.readouterxml.php

It only reads the outer XML on the current node. if the node is large, then you'll have an issue of course. Dont' do that.

Screen Shot 2022-11-18 at 09 22 26

@dearsina
Copy link

dearsina commented Nov 20, 2022

Sure, you're reading the current node only. But that node could be massive. Here's an example of an XML with a bunch of nodes, most of them tiny, and then one massive one towards the end:

https://home.treasury.gov/policy-issues/financial-sanctions/specially-designated-nationals-list-data-formats-data-schemas

In fact, most large XMLs will have large nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment