How to parse web pages using XPath

Date: 20th Apr 2011 Author: admin 13 Comments
Posted in: PHP |Tags: , , , ,

parsing pages with XPath

How to parse web pages using XPath

Today I will tell you how you can make parsers of remote HTML pages (in PHP). In this article I will show you how to perform xpath queries to Web pages. XPath – a query language to elements of xml or xhtml document. To obtain the necessary data, we just need to create the necessary query. For the work, we also need: browser Mozilla Firefox, firebug and firepath plugins. For our experiment, I suggest this webpage Google Sci/Tech News. Of course you can choose any other web page too.

Here are samples and downloadable package:

Demo 1
Demo 2
download in package

Ok, lets start, firstly make sure that both plugins installed in your browser. Then lets open our page with news (Google Sci/Tech News page). After – clicking the right mouse button at any description text, as example inside ‘A Samsung Electronics Co. Galaxy S smartphone, top…’ text, and in popup menu selecting ‘Inspect in Firepath’

firepath - step 1

as result – we will see next:

firepath - step 2

Make attention to XPath: .//*[@id='top-stories']/div[2]/div[3]/div[1]

The result will be highlighted by the dashed line. After cleanup all unnecessary indexes and small corrections – we will get next query: .//*[@id='top-stories']/div/div[@class='body']/div[1]

Try to execute it, as result we will see next:

firepath - step 3

So in result – we selected all descriptions (Firepath plugin will highlight elements with dashed line all query results plus will select these node elements in DOM too – look to Firepath tab of Firebug).

Next, lets try to pickup URLS for these posts, I made next query (using firepath of course): .//*[@id='top-stories']/div/h2/a[1]

And last query – for titles: .//*[@id='top-stories']/div/h2/a[1]/span


Step 1. PHP

So, what we have as result PHP file:

index.php

<?php
$sUrl = 'http://news.google.com/news/section?pz=1&cf=all&topic=t&ict=ln';
$sUrlSrc = getWebsiteContent($sUrl);

// Load the source
$dom = new DOMDocument();
@$dom->loadHTML($sUrlSrc);

$xpath = new DomXPath($dom);

// step 1 - links:
$aLinks = array();
$vRes = $xpath->query(".//*[@id='top-stories']/div/h2/a[1]");
foreach ($vRes as $obj) {
    $aLinks[] = $obj->getAttribute('href');
}

// step 2 - titles:
$aTitles = array();
$vRes = $xpath->query(".//*[@id='top-stories']/div/h2/a[1]/span");
foreach ($vRes as $obj) {
    $aTitles[] = $obj->nodeValue;
}

// step 3 - descriptions:
$aDescriptions = array();
$vRes = $xpath->query(".//*[@id='top-stories']/div/div[@class='body']/div[1]");
foreach ($vRes as $obj) {
    $aDescriptions[] = $obj->nodeValue;
}

echo '<link href="css/styles.css" type="text/css" rel="stylesheet"/><div class="main">';
echo '<h1>Using xpath for dom html</h1>';

$entries = $xpath->evaluate('.//*[@id="headline-wrapper"]/div[1]/div/h2/span');
echo '<h1>' . $entries->item(0)->nodeValue . ' google news</h1>';

$i = 0;
foreach ($aLinks as $sLink) {
    echo <<<EOF
<div class="unit">
    <a href="{$sLink}">{$aTitles[$i]}</a>
    <div>{$aDescriptions[$i]}</div>
</div>
EOF;
    $i++;
}
echo '</div>';

// this function will return page content using caches (we will load original sources not more than once per hour)
function getWebsiteContent($sUrl) {
    // our folder with cache files
    $sCacheFolder = 'cache/';

    // cache filename
    $sFilename = date('YmdH').'.html';

    if (! file_exists($sCacheFolder.$sFilename)) {
        $ch = curl_init($sUrl);
        $fp = fopen($sCacheFolder.$sFilename, 'w');
        curl_setopt($ch, CURLOPT_FILE, $fp);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_HTTPHEADER, Array('User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15'));
        curl_exec($ch);
        curl_close($ch);
        fclose($fp);
    }
    return file_get_contents($sCacheFolder.$sFilename);
}

?>

Make attention that I always caching results (in ‘cache’ folder). This will help you to prevent often loading of source page (always when we calling our script). Script will take fresh file (page content) hourly. Ok, in PHP we can see that I collecting titles, urls and descriptions of our posts from google news page. And, in result – we will draw these collections to our screen.


XPath details (syntax, functions and arrays)

Lets review syntax of XPath query, as example we have: .//*[@id='top-stories']/div[2]/div[3]/div[1]

Here are .//* mean All(*) elements one or more levels deep in the current context. Instead * we can point any another element, DIV as example (.//div). All elements separated with / is path between elements. All inside [] is predicate. It can be digit (in this case this will index of element .//div[5] – will mean 5th DIV), or another checking logic. This is more interesting. Samples:

  • .//div[@id='top-stories'] – filter elements by ID, or by another attribute using @ symbol
  • .//div[contains(text(),'iPhone')] – filter by DIV elements which have ‘iPhone’ word – searching
  • .//div[@id='top-stories']/div[last()] – last() command will return us last element in collection
  • .//*[@id='top-stories']/div[position() > 5][position() < 10] – position() command will return position of element. In current sample I taking elements from 5 till 10.
  • .//div[@id='top-stories']/div[position() mod 2 = 0] – mod – modulo. Here are all even elements.

Plus of course we can use logic operations (>, <, <=, >=) too.


Another sample file demonstrate another possibilities too:

index-2.php

<?php
$sUrl = 'http://news.google.com/news/section?pz=1&cf=all&topic=t&ict=ln';
$sUrlSrc = getWebsiteContent($sUrl);

// Load the source
$dom = new DOMDocument();
@$dom->loadHTML($sUrlSrc);

$xpath = new DomXPath($dom);

// all links contain 'iPhone':
$aLinks = array();
$vRes = $xpath->query(".//*[@id='top-stories']/div/h2/a[1]/span[contains(text(),'iPhone')]");
foreach ($vRes as $obj) {
    $aLinks[$obj->nodeValue] = $obj->parentNode->getAttribute('href');
}

// all links from 5 to 10
$aLinks2 = array();
$vRes = $xpath->query(".//*[@id='top-stories']/div[position() > 5][position() < 10]/h2/a[1]/span");
foreach ($vRes as $obj) {
    $aLinks2[$obj->nodeValue] = $obj->parentNode->getAttribute('href');
}

echo '<link href="css/styles.css" type="text/css" rel="stylesheet"/><div class="main">';
echo '<h1>Using xpath for dom html (2)</h1>';

echo '<h1>All links contain "iPhone" word</h1>';
foreach ($aLinks as $sTitle => $sLink) {
    echo <<<EOF
<div class="unit">
    <a href="{$sLink}">{$sTitle}</a>
</div>
EOF;
}
echo '<hr /><h1>Links from 5 to 10</h1>';
foreach ($aLinks2 as $sTitle => $sLink) {
    echo <<<EOF
<div class="unit">
    <a href="{$sLink}">{$sTitle}</a>
</div>
EOF;
}
echo '</div>';

// this function will return page content using caches (we will load original sources not more than once per hour)
function getWebsiteContent($sUrl) {
    // our folder with cache files
    $sCacheFolder = 'cache/';

    // cache filename
    $sFilename = date('YmdH').'.html';

    if (! file_exists($sCacheFolder.$sFilename)) {
        $ch = curl_init($sUrl);
        $fp = fopen($sCacheFolder.$sFilename, 'w');
        curl_setopt($ch, CURLOPT_FILE, $fp);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_HTTPHEADER, Array('User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15'));
        curl_exec($ch);
        curl_close($ch);
        fclose($fp);
    }
    return file_get_contents($sCacheFolder.$sFilename);
}

?>

In first case I will collect all links which contain ‘iPhone’ word, second – collection all links from 5 to 10.

Most full reference about XPath commands you can read here or here

Step 2. CSS

Here are used CSS file:

css/styles.css

body{background:#eee;font-family:Verdana, Helvetica, Arial, sans-serif;margin:0;padding:0}
.main{background:#FFF;width:800px;font-size:80%;border:1px #000 solid;margin:3.5em auto 2em;padding:1em 2em 2em}
h1{text-align:center}
.unit {margin:5px;padding:5px;border:1px solid #DDD;}

Here are just several styles for our demo.


Demo 1
Demo 2
download in package

Conclusion

Using easy sample I demonstrated possibilities of XPath. Possible someone still parsing web pages with regular expressions, but I hope that made good step for using XPath in your projects. As minimum – it will make your code more clean and easy. Good luck!

Enjoyed this Post?

If you enjoyed this article, feel free to share our tutorial with your friends.
    Tweet
   
   

Stay connected with us:

7 Comments

    • foo's Gravatar
    • You should not use error suppression operator to load the HTML but libxml_use_internal_errors(TRUE)

    • Jessica's Gravatar
    • Thanks for this great tutorial!

      It solved me alot of headaches by using xpath.

    • Jacob's Gravatar
    • How come the parser displays the menu and header even if it’s not included in the query?

    • Diagnostiqueur Immobilier's Gravatar
    • Diagnostiqueur ImmobilierJanuary 27, 2012 7:01 pm

      Hello, I try it and I get an empty page with only the text :
      Using xpath for dom html google news
      When I look inside the cache there are files sizing around 350ko, The google page is upload, but here the parse fail some where, maybe i have a problem with the version of DOMDocument and DomXPath ? What version are you using ? I'm using:
      PHP 5.3.8 (cli) (built: Aug 23 2011 11:50:20)
      Copyright (c) 1997-2011 The PHP Group
      Zend Engine v2.3.0, Copyright (c) 1998-2011 Zend Technologies
      with Xdebug v2.1.1, Copyright (c) 2002-2011, by Derick Rethans

  1. Script-Tutorials.com: How to parse web pages using XPath | Development Blog With Code Updates : Developercast.com on April 21, 2011 at 6:01 pm
  2. Script-Tutorials.com: How to parse web pages using XPath on April 21, 2011 at 10:20 pm
  3. Script-Tutorials.com: How to parse web pages using XPath | Scripting4You Blog on April 22, 2011 at 3:45 am
  4. Programowanie w PHP » Blog Archive » Script-Tutorials.com: How to parse web pages using XPath on April 24, 2011 at 4:48 pm
  5. GB Bilder » How to parse web pages using XPath on April 25, 2011 at 6:45 pm
  6. Tweets au 29-04-2011 | Info Web Blog Center on April 29, 2011 at 10:13 am

Leave a Reply

Your email address will not be published. Required fields are marked *

*

CAPTCHA Image
Refresh Image

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>