XPaths with PHP by example
PHP
A guide to using XPath with PHP to scrape web pages.
XPath is a little bit like a SQL `select` query for XML documents. Essentially you can query an XML document with a string, and a list of matches are returned.
After learning about using PHP with XPaths I was initially going to write an article on how to scrape a web page. I Intended to expand on something like this short piece of code, which grabs the bbc.co.uk home page and extracts the contents of the title tag.
$html = new DOMDocument();
@$html->loadHtmlFile('http://www.bbc.co.uk');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//title" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}
The above code simply returns
BBC - Homepage
Some pretty good tutorials have already been written, and if you are new to XPath with PHP, you might want to take a look at them even if you’re away from your PC – just absorbing a few techniques and tips on your iphone 4 while commuting will benefit your php and XPath skills.
http://www.metatitan.com/php/27/how-to-build-a-scraper-using-php-curl.html
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
However, these tutorials are only the beginning, the real fun comes when you try to create your queries. Many tutorials approach XPath from an XML point of view (which is of course what it is designed for) but I have used it for mainly for scraping HTML web pages, and so my examples are designed with that task in mind.
Tools
There are some tools to assist you, but I’ve found the XPath queries they create, don’t always work with modification when running them against HTML. But these tools are a great way to get a feel for XPath queries, and they also will help your break down the more complex scrapes into a query.
Firefox has a couple of plugins
- Xpather – https://addons.mozilla.org/en-US/firefox/addon/1192
- Xpath – https://addons.mozilla.org/en-US/firefox/addon/1095
Once installed both of these plugins allow you to select an area of text in a web page, then right click, and from the popup menu select the plugin, this will pop-up a window with the calculated Xpath you need to access the selected data.
As I said before these queries often don’t work when dropped into PHP, the most common problem are unecessary <tbody> tags. Simply strip them out of the query (by removing “/tbody”)
Examples
Test page:
This is the test page I am using, see the code at the bottom if you want to test the examples as we take each one in turn.
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
- Match a specific tag with it’s exact path.
- Extract all matches no matter where they are
- Reference a specific instance of a tag
- Match tags based on their attributes
- Extract the 3rd link in the part2 div
- Extract any divs which don’t have any id parameter set
- Combining paths with “|” the or operator
To match the title, we can reference it using use the absolute path, the XPath query we use is:
/html/head/title
When run in the test program (see Trying it Yourself below) the red parts of the code below are returned.
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
We can match the title also in this way
//title
which returns all the title tags in the entire page (in this case there is only 1 of course), so the output is the same as example 1
Extract all the bold tags inside the body
//body//b
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
Extract the second instances of the <b> tags.
//b[2]
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
Note: this will retrieve two results, as there are 2 second instances, one in the Part1 div and another in the Part2 div.
Extract all the bold tags inside the div named part2
/html/body/div[@id='part2']/b
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
There is only one instance of <div id=part1> so we can shorten that to
//div[@id='part1']/b
Which will produce exactly the same result.
As you can see, we use [@attribute=value] to specify tags that have tag atributes with specific values
//div[@id='part2']/a[3]
This query will return the 3rd link in the div labelled with ID=part2, which is the text “google”. Sometimes however we want to match the value of the href attribute of the Anchor tag. To do this we use “@attribute”
//div[@id='part2']/a[3]/@href
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com“>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
which now gives us the link address itself “http://www.google.com”
We can use the “not” keyword to match all divs which do not have an id attribute.
//div[not(@id)]
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
Here we use the pipe “|” to join together two queries. So to get the divs with id=part2 and the div without an id attirbute use:
//div[not(@id)]|//div[@id='part1']
<head>
<title>A test page</title>
</head>
<body>
<div id=part1>
Part1
<b>First - Part1</b>
<b>Second - Part1</b>
</div>
<div id=part2>
Part2
<b>First - Part2</b>
<b>Second - Part2</b>
<a href=”http://www.yahoo.com”>yahoo</a>
<a href=”http://www.bbc.co.uk”>bbc</a>
<a href=”http://www.google.com”>google</a>
</div>
<div>
Part3
</div>
</body>
</html>
Trying it yourself
You can test the examples using this code (change the query on line 30)
$html='
<html>
<head>
<title>A test page</title>
</head>
<body>
Part1
First - Part1
Second - Part1
Part3
</body>
</html>';
$htmlDoc = new DomDocument();
@$htmlDoc->loadhtml($html);
$xpath = new DOMXPath( $htmlDoc );
$nodelist = $xpath->query( "//title" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}
Using the tools
Like I said in the beginning, the tools (such as the Firefox plugins) can make life easier but the XPath queries they produce don’t always work without some modifications. This is most probably because the HTML in the pages we are creating the queries for, is not properly formed XML.
But they are still helpful, and I often use the queries they produced as a starting point. I then chop out most of the inital parts of the query and use the last few bits.
So when a query produced by Xpather looks like this
/html/body/div[2]/div[2]/div[5]/div/div[1]/div[2]/div/table/tbody/tr[2]/td[2]/p
trim it back to the last few chars and add another / at the beginning
//tr[2]/td[2]/p
If this doesn’t produce a unique match we can re-add in a few more characters to referece a specific numerical instance, so if the query produces 3 results and we only want the second, we should reference it by putting [2] at the end of the query, like this.
//tr[2]/td[2]/p[2]
Just using these basic examples are most probably all you will need to extract data from a web page. Here are a few pages to deepen your knowledge.
- http://www.zvon.org/xxl/XPathTutorial/General/examples.html
- http://www.w3schools.com/Xpath/default.asp

















[...] XPath Tutorial – Advanced XML Part 1 A guide to using XPath with PHP to scrape web pages Web scraping with PHP and XPath Примеры по DOMDocument и DOMXPath Хорошее [...]
my Xpath expression for a value from web is :
/html/body/div[@id='jive-forumpage']/table[3]/tbody/tr/td[1]/div[1]/div/table/tbody/tr[1]/td[3]/a[@id='jive-thread-1']
i can not query it. muss i remove tbody ?
Thank you for writing this. It helped me work out how to use DOMXPath!
I wanted to extract an rss feed from a html document if it existed. Here’s what I came up with:
$html = new DOMDocument();
@$html->loadHtmlFile(‘http://www.example.com‘);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( “//link[@rel='alternate']/@href” );
foreach ($nodelist as $n){
echo $n->nodeValue.”\n”;
}
Thanks for the great article… i have been playing with this for hours and yours is the only example that talks about trimming back the xpath query and removing the front elements… this worked a treat first time! Thanks you saved me hours of work…
I used this for a little project to have at work – a live webcam image of the skyline across my city along with the current temperature, scraped from the met office website. Because the table grows throughout the day to display the temperature by hour, I wanted a position relative to the end of the table.
The penultimate row (one before last) contained the cell with the latest temperature in. Accessing relative positions is easy in XPath!
http://www.metoffice.gov.uk/weather/uk/se/solent_latest_temp.html
$xpath->query(“((//div[@id='obsTable']/table/tr)[last()-1])/td[3]“);
This finds the div with id ‘obsTable’ and selects the fourth cell of the ‘last – 1`th‘ row. Job done!