Tuesday 14 August 2018

Parse Web Pages with PHP Simple HTML DOM Parser

For those of you who have had the pleasure of following me on Twitter (...), you probably know that I'm a complete soccer (football) fanatic.  I even started a separate Twitter account to voice my footy musings.  If you follow football yourself, you'll know that we've just started the international transfer window and there are a billion rumors about a billion players going to a billion clubs.  It's enough to drive you mad but I simply HAVE TO KNOW who will be in the Arsenal and Liverpool first teams next season.


The problem I run into, besides all of the rubbish reports making waved, is that I don't have time to check every website on the hour.  Twitter is a big help, but there's nothing better during this time than an official report from each club's website.  To keep an eye on those reports, I'm using the power of PHP Simple HTML DOM Parser to write a tiny PHP script that shoots me an email whenever a specific page is updated.

PHP Simple HTML DOM Parser
PHP Simple HTML DOM Parser is a dream utility for developers that work with both PHP and the DOM because developers can easily find DOM elements using PHP. Here are a few sample uses of PHP Simple HTML DOM Parser:

// Include the library
include('simple_html_dom.php');

// Retrieve the DOM from a given URL
$html = file_get_html('https://davidwalsh.name/');

// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e)
    echo $e->href . '<br>';

// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
    echo $e->src . '<br>';

// Find all images, print their text with the "<>" included
foreach($html->find('img') as $e)
    echo $e->outertext . '<br>';

// Find the DIV tag with an id of "myId"
foreach($html->find('div#myId') as $e)
    echo $e->innertext . '<br>';

// Find all SPAN tags that have a class of "myClass"
foreach($html->find('span.myClass') as $e)
    echo $e->outertext . '<br>';

// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
    echo $e->innertext . '<br>';
   
// Extract all text from a given cell
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';
Like I said earlier, this library is a dream for finding elements, just as the early JavaScript frameworks and selector engines have become. Armed with the ability to pick content from DOM nodes with PHP, it's time to analyze websites for changes.

The Script
The following script checks two websites for changes:

// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");

// Settings on top
$sitesToCheck = array(
// id is the page ID for selector
array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
);
$savePath = "cachedPages/";
$emailContent = "";

// For every page to check...
foreach($sitesToCheck as $site) {
$url = $site["url"];

// Calculate the cachedPage name, set oldContent = "";
$fileName = md5($url);
$oldContent = "";

// Get the URL's current page content
$html = file_get_html($url);

// Find content by querying with a selector, just like a selector engine!
foreach($html->find($site["selector"]) as $element) {
$currentContent = $element->plaintext;;
}

// If a cached file exists
if(file_exists($savePath.$fileName)) {
// Retrieve the old content
$oldContent = file_get_contents($savePath.$fileName);
}

// If different, notify!
if($oldContent && $currentContent != $oldContent) {
// Here's where we can do a whoooooooooooooole lotta stuff
// We could tweet to an address
// We can send a simple email
// We can text ourselves

// Build simple email content
$emailContent = "David, the following page has changed!\n\n".$url."\n\n";
}

// Save new content
file_put_contents($savePath.$fileName,$currentContent);
}

// Send the email if there's content!
if($emailContent) {
// Sendmail!
mail("david@davidwalsh.name","Sites Have Changed!",$emailContent,"From: alerts@davidwalsh.name","\r\n");
// Debug
echo $emailContent;
}
The code and comments are self-explanatory.  I've set the script up such that I get one "digest" alert if many of the pages change.  The script is the hard part -- to enact the script, I've set up a CRON job to run the script every 20 minutes.

This solution isn't specific to just spying on footy -- you could use this type of script on any number of sites.  This script, however, is a bit simplistic in all cases.  If you wanted to spy on a website that had extremely dynamic code (i.e. a timestamp was in the code), you would want to create a regular expressions that would isolate the content to just the block you're looking for. Since each website is constructed differently, I'll leave it up to you to create page-specific isolators. Have fun spying on websites though...and be sure to let me know if you hear a good, reliable footy rumor!

0 comments:

Post a Comment