Thursday, 20 September 2018

PHP: get keywords from search engine referer url

This post shows how to use PHP to extract the keywords searched on by a user when they found your website using a seach engine. Bing, Google and Yahoo are covered here and you can easily add your own to the PHP code supplied.

PHP functions used

The code example here uses the parse_url function to extract the parts from the referer URL and then the parse_str function to extract the parts of the query string into array variables. I've covered those functions before in an article titled "Extract query string into an associative array with PHP".
The referer URL is stored in the $_SERVER PHP superglobal as $_SERVER['HTTP_REFERER'], but only if it was set by the web browser. I have covered this value in some detail in the tutorial titled "Using the HTTP_REFERER variable with PHP".

Referer URL examples

Here's some example referer URLs from Bing, Google and Yahoo from people reaching this blog.
http://www.bing.com/search?q=javascript+date+to+timestamp&src=IE-SearchBox&FORM=IE8SRC
http://www.google.de/search?q=apache+restart&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:de:official&client=firefox-a
http://us.yhs.search.yahoo.com/avg/search?fr=yhs-avg-chrome&type=yahoo_avg_hs2-tb-web_chrome_us&p=concatenation+in+mysql
You can see from looking at the URLs that Bing and Google store the keyword word as the "q" variable and Yahoo does it with "p".

The code

Here's the PHP code to extract the keywords entered from the above examples:
function search_engine_query_string($url = false) {

    if(!$url) {
        $url = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : false;
    }
    if($url == false) {
        return '';
    }

    $parts = parse_url($url);
    parse_str($parts['query'], $query);

    $search_engines = array(
        'bing' => 'q',
        'google' => 'q',
        'yahoo' => 'p'
    );

    preg_match('/(' . implode('|', array_keys($search_engines)) . ')\./', $parts['host'], $matches);

    return isset($matches[1]) && isset($query[$search_engines[$matches[1]]]) ? $query[$search_engines[$matches[1]]] : '';

}
The way that it works is to either use a URL passed in or $_SERVER['HTTP_REFERER'] if one is not passed. It then extracts the parts from the URL (line 10) and then the breaks the pieces of the query string into values in an associative array (line 11).
A list of search engines is defined from lines 13 to 17 as an associative array containing the main part of the domain (i.e. in www.google.com the 'google' bit) mapped to the variable name in the query string. You can add additional search engines to this array.
Note that the array index (i.e. the 'google' bit) is used to match against the search engine's domain using this index value plus a period/dot. Therefore 'google' would match www.google.com, www.google.co.nz and even notgoogle.com.
The regular expression could be modified to ensure there's a period/dot at the start of the host OR the host starts with the domain, but I'm personally happy to leave it as-is for the moment; you are free of course to modify the code if you prefer to ensure a more exact match.
The regular expression on line 19 matches the search engine name into the $matches array, and line 21 returns the keywords if the search engine domain matched and a keyword variable was found.
Note that parse_str will remove any URL encoding so e.g. "javascript+date+to+timestamp" will be returned as "javascript date to timestamp".

Examples

So here's some examples running the above function using the referer URLs from the beginning of the post:
echo search_engine_query_string('http://www.bing.com/search?q=javascript+date+to+timestamp&src=IE-SearchBox&FORM=IE8SRC');
// echoes "javascript date to timestamp"
echo search_engine_query_string('http://www.google.de/search?q=apache+restart&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:de:official&client=firefox-a');
// echoes "apache restart"
echo search_engine_query_string('http://us.yhs.search.yahoo.com/avg/search?fr=yhs-avg-chrome&type=yahoo_avg_hs2-tb-web_chrome_us&p=concatenation+in+mysql');
// echoes "concatenation in mysql"

A note about HTTP_REFERER
You cannot ever guarantee that the $_SERVER['HTTP_REFERER'] variable is passed along by the browser and is available to your PHP script.
There are a variety of reasons why it may not be set, such as browser configuration settings, local proxy software that blocks it, clicking a link that moves you from HTTPS to HTTP, right-click and opening in a new browser tab/window, etc.
Having said that, it is available some of the time and you can then capture this information using PHP's parse_url and parse_str functions.

Referer URL examples

Here are some examples that will be used with the example function. The first two are from Google, one with a regular query string and the second where it's passed as a #fragment. The other two are from Bing and Yahoo.
http://www.google.co.nz/url?sa=t&source=web&cd=3&sqi=2&ved=0CCkQFjAC&url=http%3A%2F%2Fwww.electrictoolbox.com%2Fusing-settimeout-javascript%2F&rct=j&q=javascript%20settimeout&ei=IijsTIzYAYLCcfeB2fYO&usg=AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ
http://www.google.com/#hl=en&biw=1440&bih=688&q=javascript+settimeout&aq=f&aqi=g10&aql=&oq=&gs_rfai=&fp=1b219014ca3fb4b2
http://www.bing.com/search?q=javascript+date+to+timestamp&src=IE-SearchBox&FORM=IE8SRC
http://us.yhs.search.yahoo.com/avg/search?fr=yhs-avg-chrome&type=yahoo_avg_hs2-tb-web_chrome_us&p=concatenation+in+mysql
You can see from looking at the URLs that Bing and Google store the keyword word as the "q" variable and Yahoo does it with "p".

The code

Here's the PHP code to extract the keywords entered from the above examples. I will explain it on a line by line basis below.
function search_engine_query_string($url = false) {

    if(!$url && !$url = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : false) {
        return '';
    }

    $parts_url = parse_url($url);
    $query = isset($parts_url['query']) ? $parts_url['query'] : (isset($parts_url['fragment']) ? $parts_url['fragment'] : '');
    if(!$query) {
        return '';
    }
    parse_str($query, $parts_query);
    return isset($parts_query['q']) ? $parts_query['q'] : (isset($parts_query['p']) ? $parts_query['p'] : '');

}

How it works

1. Optionally passing in a url, or getting it from HTTP_REFERER
The full url is optionally passed to the function. If it does not contain a value the first few lines get it from $_SERVER['HTTP_REFERER'] as shown below. At this stage if nothing is available it returns an empty string.
    if(!$url && !$url = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : false) {
        return '';
    }
My original post took a few extra lines to do the above so I have consolidated it here into fewer lines. It was suggested by one commenter to do the return as part of the assignment (e.g. like this: $url = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : return '';) but it results in a parse error.
2. Using parse_url to gets the parts from the URL
The next line of code uses the parse_url function. This extracts the various parts of the URL into an associative array which is returned into the $parts variable.
    $parts_url = parse_url($url);
If print_r($parts_url was done using the first URL in my examples it would output this:
Array
(
    [scheme] => http
    [host] => www.google.co.nz
    [path] => /url
    [query] => sa=t&source=web&cd=3&sqi=2&ved=0CCkQFjAC&url=http%3A%2F%2Fwww.electrictoolbox.com%2Fusing-settimeout-javascript%2F&rct=j&q=javascript%20settimeout&ei=IijsTIzYAYLCcfeB2fYO&usg=AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ
)
You can see the array item we want to use is the "query" one. In the case of the 2nd URL example which has Google sending through the query in the #fragment in the URL it would look like this:
Array
(
    [scheme] => http
    [host] => www.google.com
    [path] => /
    [fragment] => hl=en&biw=1440&bih=688&q=javascript+settimeout&aq=f&aqi=g10&aql=&oq=&gs_rfai=&fp=1b219014ca3fb4b2
)
3. Getting the query string or fragment
Because the query string can effectively be in either the [query] or [fragment] the next line of code works out which one it is in and assigns it to the $query variable:
 $query = isset($parts_url['query']) ? $parts_url['query'] : (isset($parts_url['fragment']) ? $parts_url['fragment'] : '');
If $query is empty at this stage then nothing has been passed in HTTP_REFERER that is a query string or fragement so return an empty string:
    if(!$query) {
        return '';
    }
4. Use parse_str to get explode the query string
The next line uses parse_str to explode the query string into an associative array and store it in the $parts_query array:
    parse_str($query, $parts_query);
Using the Google example again, doing print_r($parts_query) would output this:
Array
(
    [sa] => t
    [source] => web
    [cd] => 3
    [sqi] => 2
    [ved] => 0CCkQFjAC
    [url] => https://www.electrictoolbox.com/using-settimeout-javascript/
    [rct] => j
    [q] => javascript settimeout
    [ei] => IijsTIzYAYLCcfeB2fYO
    [usg] => AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ
)
You can see the element in the array that has the actual query string searched on at Google is in [q].
5. Return the search engine query
The final line in the function checks [q] and then [p] in the $parts_query array sets whichever is set, or an empty string if neither was set. You can easily add additional isset and value clauses if a different search engine sends through the query in a different variable.
    return isset($parts_query['q']) ? $parts_query['q'] : (isset($parts_query['p']) ? $parts_query['p'] : '');

Example output

Using the examples at the top of this post, here's some example output:
echo search_engine_query_string('http://www.google.co.nz/url?sa=t&source=web&cd=3&sqi=2&ved=0CCkQFjAC&url=http%3A%2F%2Fwww.electrictoolbox.com%2Fusing-settimeout-javascript%2F&rct=j&q=javascript%20settimeout&ei=IijsTIzYAYLCcfeB2fYO&usg=AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ');
Result: javascript settimeout
echo search_engine_query_string('http://www.google.com/#hl=en&biw=1440&bih=688&q=javascript+settimeout&aq=f&aqi=g10&aql=&oq=&gs_rfai=&fp=1b219014ca3fb4b2');
Result: javascript settimeout
echo search_engine_query_string('http://www.bing.com/search?q=javascript+date+to+timestamp&src=IE-SearchBox&FORM=IE8SRC');
Result: javascript date to timestamp
echo search_engine_query_string('http://us.yhs.search.yahoo.com/avg/search?fr=yhs-avg-chrome&type=yahoo_avg_hs2-tb-web_chrome_us&p=concatenation+in+mysql');
Result: concatenation in mysql

A note on $_SERVER['QUERY_STRING']

There is also a $_SERVER['QUERY_STRING'] variable which also gets assigned the query string. It would be simpler to use this than using parse_url but then we couldn't also check for a #fragment.

Related posts:

0 comments:

Post a Comment