Wednesday, 19 September 2018

Extract images from a web page with PHP and the Simple HTML DOM Parser

After posting about how to get the meta tags from an HTML web page with PHP I was asked how to get the images from an HTML page with PHP like how Facebook does when a link is posted. This post looks at how to get the image URLs from a page using the Simple HTLM DOM Parser library and in a later post I'll look at how to download the images and make thumbnails.

PHP Simple HTML DOM Parser

The PHP Simple HTML DOM Parser makes it easy to find particular elements within an HTML page in a similar way to jQuery. It can be downloaded from http://simplehtmldom.sourceforge.net/ where there are also several examples.

Getting the images from an HTML page

The following PHP code will echo a list of images on the page at the time of writing this post.
Note that you need to change '/path/to/simple_html_dom.php' to the path where you saved the simple_html_dom.php file which you downloaded from sourceforge using the link above.
require_once('/path/to/simple_html_dom.php');
$html = file_get_html('https://www.electrictoolbox.com/php-get-meta-tags-html-file/');
foreach($html->find('img') as $element) {
    echo $element->src, "\n";
}
As at the time of writing this post, the output would be as follows:
/images/icons/php.gif
http://manage.aff.biz/42/2882/189/
http://static.addtoany.com/buttons/subscribe_171_16.gif
http://static.addtoany.com/buttons/share_save_171_16.gif
/images/gui/logo.gif
/images/feed.16x16.gif
/images/facebook.png
/images/twitter.png
/images/email.16x16.gif
http://feedproxy.google.com/~fc/ElectricToolboxBlog?bg=ffaf5a&fg=333333&anim=0
/images/gui/bottom.gif
Note that the actual path for the image as it is in the HTML is returned and that paths are not resolved to be absolute.

Resolving the paths

So that it's possible to download the images, the relative URLs need to be turned into absolute URLs. I found a library to do this from the blog for Nadeau Software Consulting but their site no longer appears to be available, so have made the library available for download here.
Download and extract the zipped file url_to_absolute.zip which contains three PHP files. The url_to_absolute.php file requires the other two files.
Here's the modified version of the above code which will now resolved all image URLs to absolute URLs which can then be used to download the image:
require_once('simplehtmldom/simple_html_dom.php');
require_once('url_to_absolute.php');

$url = 'https://www.electrictoolbox.com/php-get-meta-tags-html-file/';

$html = file_get_html($url);
foreach($html->find('img') as $element) {
    echo url_to_absolute($url, $element->src), "\n";
}
This will now output the following from the same page, with absolute URLs:
https://www.electrictoolbox.com/images/icons/php.gif
http://manage.aff.biz/42/2882/189/
http://static.addtoany.com/buttons/subscribe_171_16.gif
http://static.addtoany.com/buttons/share_save_171_16.gif
https://www.electrictoolbox.com/images/gui/logo.gif
https://www.electrictoolbox.com/images/feed.16x16.gif
https://www.electrictoolbox.com/images/facebook.png
https://www.electrictoolbox.com/images/twitter.png
https://www.electrictoolbox.com/images/email.16x16.gif
http://feedproxy.google.com/%7Efc/ElectricToolboxBlog?bg%3Dffaf5a%26amp%3Bfg%3D333333%26amp%3Banim%3D0
https://www.electrictoolbox.com/images/gui/bottom.gif

Downloading the images

Instead of echoing out the image URLs as shown in the above example, they could be stored into an array, put into a database, etc etc. In a later post I will look at how to download the images and make thumbnails from them.

Related posts:

0 comments:

Post a Comment