When scraping content using the PHP Simple HTML DOM Parser it is useful to resolve relative URLs in a page to absolute URLs for downloading additional web pages or images. I do this using the url_to_absolute library by Nadeau Software Consulting and show how to do this here, along with a minor fix which needs to be done to their code.
Download the library
"How to convert a relative URL to an absolute URL" to download the url_to_absolute library; it's in the downloads section on that page as a zip file.
The zip file contains url_to_absolute.php, split_url.php and join_url.php. The first of these contains the url_to_absolute() function and it require()s the other two files which contain helper functions. These need to be somewhere in your include path, or you can modify the file so that
reads like this:
Resolving URLs
To work out the absolute URL of aboutus.html relative to the page http://www.example.com/sitemap.html do this:
which would return:
To work out the absolute URL of ../images/somephoto.jpg relative to http://www.example.com/content/sitemap.html do this:
which would return:
URLs are encoded/decoded by default
The join_url and split_url helper functions automatically encode and decode URL parts by default using rawurlencode and rawurldecode which isn't ideal if the resulting URL is being used to download another web page or image file etc.
For example, if we wanted to convert /somepage.php?foo=bar&baz=bat like so:
The value returned would be:
This is obviously not what we want to see.
To solve this problem, a minor modification is needed to the url_to_absolute function or to the split_url or join_url functions.
If changing url_to_absolute, locate all instances of split_url and join_url and add a second parameter "false" to the function calls. For example:
would become
Alternatively modify the join_url and split_url functions from:
to:
Once one of these changes have been made, the URLs will no longer be encoded and will be converted to absolute URLs correctly.
0 comments:
Post a Comment