Wednesday, 12 September 2018

Parsing Google Analytics data with PHP - CSV files

This post shows how to extract data from an email attachment from data sent by Google Analytics with PHP. I've covered how to download and extract attachments with PHP's IMAP functions in previous posts and am now coming to the end of this series finally showing how to get the data out of the attachments.
The last post showed how to connect to an IMAP mail server and get the attachments for a specific Google Analytics report. It called a function called process_attachment() which was a dummy placeholder function in the example code. The code in this post would be inside that function.
The CSV data from Google Analytics varies depending on the report but generally has a summary at the top, graph data and then table data. The first row of each section contains the column names and then the data itself. View an example CSV attachment here. This is an HTML formatted version of the CSV file so you can easily view it in your web browser.
Without further ado, here's the code followed by a brief explanation of what it's doing, although there's a fair amount of comments in there already explaining it:
// extract the lines from the attachment into an array so we can loop through them
$lines = explode("\n", $attachment);
// cache the count so function doesn't keep getting called in the loop
$count = count($lines);
// this var will store the section we are currently processing
// if it's false then don't do anything with the line
$processing = false;
// initilise the data array
$data = array(
    'graph' => array(),
    'table' => array()
);
// loop through the lines
for($i = 0; $i < $count; $i++) {
    // if the line starts with graph/table then extract the fieldnames two lines down
    if($lines[$i] == '# Graph' || $lines[$i] == '# Table') {
        $processing = strtolower(str_replace('# ', '', $lines[$i]));
        $i+=2;
        $fields = str_getcsv(strtolower($lines[$i]));
        $i++;
    }
    // if it's a blank line or starts with # --- then the section has ended
    if($lines[$i] == '' || preg_match('/# ---+/', $lines[$i])) {
        $processing = false;
    }
    // if processing then load the CSV data into the array using the fieldnames as index keys
    if($processing) {
        $parts = str_getcsv($lines[$i]);
        $row = array();
        for($j = 0; $j < count($parts); $j++) {
            $row[$fields[$j]] = $parts[$j];
        }
        $data[$processing][] = $row;
    }
}
The code above is fairly straight forward. We look for lines that start with # Graph and # Table and extract the fieldnames from two lines further down. If the line should be processed, the str_getcsv function is called to extact the data and then it's mapped into the $data array into either 'graph' or 'table' using the fieldnames.
Note that the str_getcsv function is only available from PHP 5.3.0 so if using an earlier version of PHP you'll need to create a custom function instead. I have provided one here earlier today.
Doing print_r on the example data would show something along these lines:
Array
(
    [graph] => Array
        (
            [0] => Array
                (
                    [day] => Friday, February 13, 2009
                    [pageviews] => 3,771
                )

            [1] => Array
                (
                    [day] => Saturday, February 14, 2009
                    [pageviews] => 2,856
                )

            ...

            [30] => Array
                (
                    [day] => Sunday, March 15, 2009
                    [pageviews] => 1,826
                )

        )

    [table] => Array
        (
            [0] => Array
                (
                    [page] => /article/apache/restart-apache/
                    [pageviews] => 2975
                    [unique pageviews] => 2836
                    [avg. time on page] => 388.93975903614455
                    [bounce rate] => 0.9520255863539445
                    [% exit] => 0.944201680672269
                    [$ index] => 0.0
                )

            [1] => Array
                (
                    [page] => /article/mysql/delete-all-data-mysql/
                    [pageviews] => 2891
                    [unique pageviews] => 2732
                    [avg. time on page] => 267.13524590163934
                    [bounce rate] => 0.9355805243445693
                    [% exit] => 0.915600138360429
                    [$ index] => 0.0
                )

            ...
               
            [99] => Array
                (
                    [page] => /view-email-message-headers-outlook-2007/
                    [pageviews] => 248
                    [unique pageviews] => 242
                    [avg. time on page] => 730.875
                    [bounce rate] => 0.9710743801652892
                    [% exit] => 0.967741935483871
                    [$ index] => 0.0
                )

        )

)
The next post in this series will show how to do the same thing but with TSV files, better known as tab-separated values. The code is virtually identical. And finally on Sunday I'll show how to do this with XML files. Read about the other posts in this series here, including a full list of posts in the series.

Related posts:

0 comments:

Post a Comment