Wednesday, 19 September 2018

Find all anchor tags in a page with PHP and the Simple HTML DOM Parser

This post shows how to download a web page and find all the link anchor tags in the page using PHP and the Simple HTML DOM Parser which has a jQuery like syntax selector.

PHP Simple HTML DOM Parser

The PHP Simple HTML DOM Parser makes it easy to find particular elements within an HTML page in a similar way to jQuery. It can be downloaded from http://simplehtmldom.sourceforge.net/ where there are also several examples.

Finding the <a> tags from a web page

First of all include the Simple HTML DOM Parser using either include, require, include_once or require_once:
require_once('/path/to/simple_html_dom.php');
And then load the webpage into the DOM using either the file_get_html() or str_get_html() helper functions. The filename passed to file_get_html() can either be the URL to the web page or the filename of a local file. str_get_html() takes a string instead of a filename.
$dom = file_get_html('http://www.google.com/');
$dom = str_get_html('... some html string ...');
Now do find() on the DOM for 'a' tags as in the following example which echos out the "href" property with a linebreak between each one:
foreach($dom->find('a') as $a) {
    if($a->href) {
        echo $a->href . "\n";
    }
}
Using www.google.com as an example the above would output this:
http://images.google.co.nz/imghp?hl=en&tab=wi
http://maps.google.co.nz/maps?hl=en&tab=wl
http://news.google.co.nz/nwshp?hl=en&tab=wn
http://groups.google.co.nz/grphp?hl=en&tab=wg
http://books.google.co.nz/bkshp?hl=en&tab=wp
http://mail.google.com/mail/?hl=en&tab=wm
http://www.google.co.nz/intl/en/options/
http://scholar.google.co.nz/schhp?hl=en&tab=ws
http://blogsearch.google.co.nz/?hl=en&tab=wb
http://translate.google.co.nz/?hl=en&tab=wT
http://www.youtube.com/?hl=en&tab=w1&gl=NZ
http://www.google.com/calendar/render?hl=en&tab=wc
http://docs.google.com/?hl=en&tab=wo
http://www.google.co.nz/reader/view/?hl=en&tab=wy
http://sites.google.com/?hl=en&tab=w3
http://www.google.co.nz/intl/en/options/
/url?sa=p&pref=ig&pval=3&q=http://www.google.co.nz/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNGi5EQv2pmx9Kd5MdCX46heegpxAw
/preferences?hl=en
https://www.google.com/accounts/Login?hl=en&continue=http://www.google.co.nz/
/advanced_search?hl=en
/language_tools?hl=en
http://www.google.co.nz/setprefs?sig=0_Va9MAZW7LCKUpGRFXj4-Xh78Tkc=&hl=mi
/intl/en/ads/
/services/
/intl/en/about.html
http://www.google.com/ncr
/intl/en/privacy.html
Notice that these are the hrefs as they appear in the HTML source, so some are relative to the current document/domain and some are absolute containing a full http:// path.

Resolving the paths

I've posted how to resolve the paths to full http:// URLs using the url_to_absolute library from Nadeau Software Consulting in my earlier post titled "Extract images from a web page with PHP and the Simple HTML DOM Parser"
I will write a standalone post about how to do this later this week, which also deals with a slight issue with the URLs returned as they are partially encoded by default using rawurlencode() which is not really ideal. This later post shows the modification needed to resolve this along with some additional examples.

Related posts:

0 comments:

Post a Comment