XPath regular expression matching in PHP 5.3

Recently I needed to do some text pattern matching in an XML XPath query, and XPath’s built-in sub-string matching capabilities were not good enough.

While XPath 2.0 defines regular expression matching capabilities, it is still not widely implemented and in most available tools there is no easy way to do complex pattern matching on XML nodes.

Or is there?

In his blog Thomas Weinert recently gave an intro to using DOM and its XPath capabilities in PHP, but one of the cool features of DOM’s XPath, available starting from PHP 5.3.0 (have you upgraded yet?), is that the DOM extension supports registering pretty much any PHP function with the XPath engine, and using it inside XPath queries.

Here is a quick example showing usage of PHP’s own preg_match() in an XPath query, to find all the external links in Wikipedia’s PHP article:

// Supress XML parsing errors (this is needed to parse Wikipedia's XHTML)

// Load the PHP Wikipedia article
$domDoc = new DOMDocument();

// Create XPath object and register the XHTML namespace
$xPath = new DOMXPath($domDoc);
$xPath->registerNamespace('html', 'http://www.w3.org/1999/xhtml');

// Register the PHP namespace if you want to call PHP functions
$xPath->registerNamespace('php', 'http://php.net/xpath');

// Register preg_match to be available in XPath queries 
// You can also pass an array to register multiple functions, or call 
// registerPhpFunctions() with no parameters to register all PHP functions

// Find all external links in the article  
$regex = '@^http://[^/]+(?<!wikipedia.org)/@';
$links = $xPath->query("//html:a[ php:functionString('preg_match', '$regex', @href) > 0 ]");

// Print out matched entries
echo "Found " . (int) $links->length . " external linksnn";
foreach($links as $linkDom) { /* @var $entry DOMElement */
    $link = simplexml_import_dom($linkDom);
    $desc = (string) $link;
    $href = (string) $link['href'];
    echo " - ";
    if ($desc && $desc != $href) {
        echo "$desc: ";
    echo "$href\n";

Note the use of php:functionString() as an XPath function, calling preg_match(). functionString() will pass XML entities such as @href as a string into the function, which is different from calling php:function() which, as far as I have seen, will pass parameters without casting them to a string first (however I am not sure what exactly they are passed as… maybe someone who knows can elaborate?).

Pretty useful huh?