XPaths with PHP by example

PHP

PHP

A guide to using XPath with PHP to scrape web pages.

XPath is a little bit like a SQL `select` query for XML documents. Essentially you can query an XML document with a string, and a list of matches are returned.
After learning about using PHP with XPaths I was initially going to write an article on how to scrape a web page. I Intended to expand on something like this short piece of code, which grabs the bbc.co.uk home page and extracts the contents of the title tag.

$html = new DOMDocument();
@$html->loadHtmlFile('http://www.bbc.co.uk');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//title" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}

The above code simply returns
BBC - Homepage

Some pretty good tutorials have already been written, and if you are new to XPath with PHP, you might want to take a look at them even if you’re away from your PC – just absorbing a few techniques and tips on your iphone 4 while commuting will benefit your php and XPath skills.

http://www.metatitan.com/php/27/how-to-build-a-scraper-using-php-curl.html
http://www.merchantos.com/makebeta/php/scraping-links-with-php/

However, these tutorials are only the beginning, the real fun comes when you try to create your queries. Many tutorials approach XPath from an XML point of view (which is of course what it is designed for) but I have used it for mainly for scraping HTML web pages, and so my examples are designed with that task in mind.

Tools

There are some tools to assist you, but I’ve found the XPath queries they create, don’t always work with modification when running them against HTML. But these tools are a great way to get a feel for XPath queries, and they also will help your break down the more complex scrapes into a query.
Firefox has a couple of plugins

  • Xpather – https://addons.mozilla.org/en-US/firefox/addon/1192
  • Xpath – https://addons.mozilla.org/en-US/firefox/addon/1095

Once installed both of these plugins allow you to select an area of text in a web page, then right click, and from the popup menu select the plugin, this will pop-up a window with the calculated Xpath you need to access the selected data.

As I said before these queries often don’t work when dropped into PHP, the most common problem are unecessary <tbody> tags. Simply strip them out of the query (by removing “/tbody”)

Examples

Test page:

This is the test page I am using, see the code at the bottom if you want to test the examples as we take each one in turn.

<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

  • Match a specific tag with it’s exact path.

To match the title, we can reference it using use the absolute path, the XPath query we use is:

/html/head/title

When run in the test program (see Trying it Yourself below) the red parts of the code below are returned.

<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

  • Extract all matches no matter where they are

We can match the title also in this way

//title

which returns all the title tags in the entire page (in this case there is only 1 of course), so the output is the same as example 1

Extract all the bold tags inside the body

//body//b
<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

  • Reference a specific instance of a tag

Extract the second instances of the <b> tags.

//b[2]
<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

Note: this will retrieve two results, as there are 2 second instances, one in the Part1 div and another in the Part2 div.

  • Match tags based on their attributes

Extract all the bold tags inside the div named part2

/html/body/div[@id='part2']/b
<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

There is only one instance of <div id=part1> so we can shorten that to

//div[@id='part1']/b

Which will produce exactly the same result.
As you can see, we use [@attribute=value] to specify tags that have tag atributes with specific values

  • Extract the 3rd link in the part2 div
//div[@id='part2']/a[3]

This query will return the 3rd link in the div labelled with ID=part2, which is the text “google”. Sometimes however we want to match the value of the href attribute of the Anchor tag. To do this we use “@attribute”

//div[@id='part2']/a[3]/@href
<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com“>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

which now gives us the link address itself “http://www.google.com”

  • Extract any divs which don’t have any id parameter set

We can use the “not” keyword to match all divs which do not have an id attribute.

//div[not(@id)]
<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

  • Combining paths with “|” the or operator

Here we use the pipe “|” to join together two queries. So to get the divs with id=part1 and the div without an id attirbute use:

//div[not(@id)]|//div[@id='part1']
<html>
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First – Part1</b>
<b>Second – Part1</b>
</div><div id=part2>
Part2
<b>First – Part2</b>
<b>Second – Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>

<div>
Part3
</div>
</body>
</html>

Trying it yourself

You can test the examples using this code (change the query on line 30)

$html='
<html>
  <head>
    <title>A test page</title>
  </head>
  <body>
Part1 First – Part1 Second – Part1
Part2 First – Part2 Second – Part2 yahoo bbc google
Part3
</body>
</html>';  

$htmlDoc = new DomDocument();
@$htmlDoc->loadhtml($html);
$xpath = new DOMXPath( $htmlDoc );
$nodelist = $xpath->query( "//title" );
foreach ($nodelist as $n){
    echo $n->nodeValue."\n";
}

Using the tools

Like I said in the beginning, the tools (such as the Firefox plugins) can make life easier but the XPath queries they produce don’t always work without some modifications. This is most probably because the HTML in the pages we are creating the queries for, is not properly formed XML.

But they are still helpful, and I often use the queries they produced as a starting point. I then chop out most of the inital parts of the query and use the last few bits.
So when a query produced by Xpather looks like this

/html/body/div[2]/div[2]/div[5]/div/div[1]/div[2]/div/table/tbody/tr[2]/td[2]/p

trim it back to the last few chars and add another / at the beginning

//tr[2]/td[2]/p

If this doesn’t produce a unique match we can re-add in a few more characters to referece a specific numerical instance, so if the query produces 3 results and we only want the second, we should reference it by putting [2] at the end of the query, like this.

//tr[2]/td[2]/p[2]

Just using these basic examples are most probably all you will need to extract data from a web page. Here are a few pages to deepen your knowledge.

If this has been useful to you, and you would like to buy me a coffee, or help towards my monthly server costs please click here to make a donation via paypal.

9 comments to XPaths with PHP by example

  • [...] XPath Tutorial – Advanced XML Part 1 A guide to using XPath with PHP to scrape web pages Web scraping with PHP and XPath Примеры по DOMDocument и DOMXPath Хорошее [...]

  • alin

    my Xpath expression for a value from web is :

    /html/body/div[@id='jive-forumpage']/table[3]/tbody/tr/td[1]/div[1]/div/table/tbody/tr[1]/td[3]/a[@id='jive-thread-1']

    i can not query it. muss i remove tbody ?

  • Thank you for writing this. It helped me work out how to use DOMXPath!

    I wanted to extract an rss feed from a html document if it existed. Here’s what I came up with:

    $html = new DOMDocument();
    @$html->loadHtmlFile(‘http://www.example.com’);
    $xpath = new DOMXPath( $html );
    $nodelist = $xpath->query( “//link[@rel='alternate']/@href” );
    foreach ($nodelist as $n){
    echo $n->nodeValue.”\n”;
    }

  • Christian

    Thanks for the great article… i have been playing with this for hours and yours is the only example that talks about trimming back the xpath query and removing the front elements… this worked a treat first time! Thanks you saved me hours of work…

  • George Hafiz

    I used this for a little project to have at work – a live webcam image of the skyline across my city along with the current temperature, scraped from the met office website. Because the table grows throughout the day to display the temperature by hour, I wanted a position relative to the end of the table.

    The penultimate row (one before last) contained the cell with the latest temperature in. Accessing relative positions is easy in XPath!

    http://www.metoffice.gov.uk/weather/uk/se/solent_latest_temp.html

    $xpath->query(“((//div[@id='obsTable']/table/tr)[last()-1])/td[3]“);

    This finds the div with id ‘obsTable’ and selects the fourth cell of the ‘last – 1`th‘ row. Job done!

  • Filip

    Many thanks for this!!!

    “As I said before these queries often don’t work when dropped into PHP, the most common problem are unecessary tags. Simply strip them out of the query (by removing “/tbody”)”

  • Vic

    Very clean and nice article…thanks.

    In paragraph 7, Combining paths with “|” the or operator.

    “So to get the divs with id=part2 and the div without an id attirbute use”…I think it’s “with id=part1″.

    Thanks again for this great article.

  • Thanks Vic, I’ve corrected the article with the mistake you found.
    I’m glad you found it useful.

  • Adrian Barrett

    Having some trouble :(

    http://stackoverflow.com/questions/24976537/beginner-php-xpath-text-display

    Can’t seem to transfer the xpath from Firefox addon to my PHP code.. not sure what’s going on.

    Thanks!

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>