Last active
February 14, 2021 06:38
-
-
Save pento/8034553 to your computer and use it in GitHub Desktop.
Testing the performance of searching a lump of HTML with Regular Expressions, vs creating a DOMDocument.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?php | |
$html = <<<EOT | |
<p><strong>Lorem #ipsum dolor sit amet</strong>, consectetur adipiscing elit. In in elit euismod, laoreet sapien eget, tristique ipsum. In #aliquam eros tortor, sit amet aliquet turpis suscipit eget. Maecenas eget vulputate metus. Phasellus at ligula ut nulla placerat imperdiet. Duis laoreet mauris <strong>eget dolor #egestas suscipit</strong>. In et #sodales elit. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In tristique sit amet nisl ultrices rhoncus. Phasellus eget sem vitae urna pulvinar tristique non at velit. Integer eget nulla dolor. Vivamus quis iaculis massa, et faucibus mi. Quisque pretium dapibus massa, id imperdiet quam. #Morbi mollis ipsum eu mauris ultrices, <em>vel #pharetra quam sagittis</em>. Pellentesque auctor lacus massa, in tempor leo viverra id. Cras nisl ante, vehicula nec felis vitae, dictum sollicitudin eros. Donec sagittis id lorem ac tristique.</p> | |
<p>Duis quis consequat sapien. <a href="http://google.com/">Quisque porta nunc nec #nisi sollicitudin elementum</a>. Vestibulum facilisis tempus tristique. Nullam sed tristique nulla. In #egestas nec sapien quis tincidunt. Phasellus cursus lacinia mi, dictum bibendum dolor mollis condimentum. Suspendisse elementum, est sit amet luctus dapibus, orci tellus rutrum lacus, sit amet facilisis nisi lacus varius arcu.</p> | |
EOT; | |
$start = microtime( true ); | |
for ( $i = 0; $i < 1000; $i++ ) { | |
$tags = array(); | |
preg_match_all( '/(?:^|\s)#([\w-]+)\b/', $html, $tags ); | |
} | |
$end = microtime( true ); | |
echo "preg_match_all: " . ( $end - $start ) . "\n"; | |
$start = microtime( true ); | |
for ( $i = 0; $i < 1000; $i++ ) { | |
$tags = array(); | |
$dom = new DOMDocument; | |
$dom->loadHTML( '<?xml encoding="UTF-8">' . $html ); | |
$xpath = new DOMXPath( $dom ); | |
$textNodes = $xpath->query( '//text()' ); | |
foreach ( $textNodes as $textNode ) { | |
$matches = array(); | |
if ( preg_match_all( '/(?:^|\s)#([\w-]+)\b/', $textNode->nodeValue, $matches ) ) { | |
$tags = array_merge( $tags, $matches[1] ); | |
} | |
} | |
} | |
$end = microtime( true ); | |
echo "DOM (new document): " . ( $end - $start ) . "\n"; | |
$start = microtime( true ); | |
$dom = new DOMDocument; | |
$dom->loadHTML( '<?xml encoding="UTF-8">' . $html ); | |
$xpath = new DOMXPath( $dom ); | |
for ( $i = 0; $i < 1000; $i++ ) { | |
$tags = array(); | |
$textNodes = $xpath->query( '//text()' ); | |
foreach ( $textNodes as $textNode ) { | |
$matches = array(); | |
if ( preg_match_all( '/(?:^|\s)#([\w-]+)\b/', $textNode->nodeValue, $matches ) ) { | |
$tags = array_merge( $tags, $matches[1] ); | |
} | |
} | |
} | |
$end = microtime( true ); | |
echo "DOM (cached document): " . ( $end - $start ) . "\n"; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Sample in case anyone wants to see the output (ubuntu 64bit / core i3)