Monday 2 February 2015

Regular expressions in PHP

Regular expressions in PHP

In this part of the PHP tutorial, we will talk about regular expressions in PHP.
Regular expressions are used for text searching and more advanced text manipulation. Regular expressions are built-in tools like grep, sed, text editors like vi, emacs, programming languages like Tcl, Perl, Python. PHP has a built-in support for regular expressions too.
In PHP, there are two modules for regular expressions. The POSIX Regex and the PCRE. The POSIX Regex is depreciated. In this chapter, we will use the PCRE examples. PCRE stands for Perl compatible regular expressions.
Two things are needed, when we work with regular expressions: Regex functions and the pattern.
A pattern is a regular expression that defines the text, we are searching for or manipulating. It consists of text literals and metacharacters. The pattern is placed inside two delimiters. These are usually //, ##, or @@ characters. They inform the regex function where the pattern starts and ends.
Here is a partial list of metacharacters used in PCRE.
.Matches any single character.
*Matches the preceding element zero or more times.
[ ]Bracket expression. Matches a character within the brackets.
[^ ]Matches a single character that is not contained within the brackets.
^Matches the starting position within the string.
$Matches the ending position within the string.
|Alternation operator.

PRCE functions

We define some PCRE regex functions. They all have a preg prefix.
  • preg_split() - splits a string by regex pattern
  • preg_match() - performs a regex match
  • preg_replace() - search and replace string by regex pattern
  • preg_grep() - returns array entries that match the regex pattern
Next we will have an example for each function.
php > print_r(preg_split("@\s@", "Jane\tKate\nLucy Marion"));
Array
(
    [0] => Jane
    [1] => Kate
    [2] => Lucy
    [3] => Marion
)
We have four names divided by spaces. The \s is a character class, which stands for spaces. Thepreg_split() function returns the split strings.
php > echo preg_match("#[a-z]#", "s");
1
The preg_match() function looks if the 's' character is in the character class [a-z]. The class stands for all characters from a to z. It returns 1 for success.
php > echo preg_replace("/Jane/","Beky","I saw Jane. Jane was beautiful.");
I saw Beky. Beky was beautiful.
The preg_replace() function replaces all occurrences of the word 'Jane' for the word 'Beky'.
php > print_r(preg_grep("#Jane#", array("Jane", "jane", "Joan", "JANE")));
Array
(
    [0] => Jane
)
The preg_grep() function returns an array of words that match the given pattern. In the script example, only one word is returned in the array. This is because by default, the search is case sensitive.
php > print_r(preg_grep("#Jane#i", array("Jane", "jane", "Joan", "JANE")));
Array
(
    [0] => Jane
    [1] => jane
    [3] => JANE
)
In this example, we perform a case insensitive grep. We put the i modifier after the right delimiter. The returned array has three words.

The dot metacharacter

The . (dot) metacharacter stands for any single character in the text.
<?php

$words = array("Seven", "even", "Maven", "Amen", "Leven");

$pattern = "/.even/";

foreach ($words as $word) {
    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>
In the $words array, we have five words.
$pattern = "/.even/";
Here we define the search pattern. The pattern is a string. The regular expression is placed within delimiters. The delimiters are not optional. They must be present. In our case, we use forward slashes / / as delimiters. Note, we can use different delimiters if we want. The dot character stands for any single character.
if (preg_match($pattern, $word)) {
    echo "$word matches the pattern\n";
} else {
    echo "$word does not match the pattern\n";
}
We test all five words if they match with the pattern.
$ php single.php 
Seven matches the pattern
even does not match the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern
The Seven and Leven words match our search pattern.

Anchors

Anchors match positions of characters inside a given text.
In the next example, we will look if a string will be located at the beginning of a sentence.
<?php

$sentence1 = "Everywhere I look I see Jane";
$sentence2 = "Jane is the best thing that happened to me";

if (preg_match("/^Jane/", $sentence1)) {
    echo "Jane is at the beginning of the \$sentence1\n";
} else {
    echo "Jane is not at the beginning of the \$sentence1\n";
}

if (preg_match("/^Jane/", $sentence2)) {
    echo "Jane is at the beginning of the \$sentence2\n";
} else {
    echo "Jane is not at the beginning of the \$sentence2\n";
}

?>
We have two sentences. The pattern is ^Jane. The pattern asks, is the 'Jane' string located at the beginning of the text?
$ php begin.php 
Jane is not at the beginning of the $sentence1
Jane is at the beginning of the $sentence2
php > echo preg_match("#Jane$#", "I love Jane");
1
php > echo preg_match("#Jane$#", "Jane does not love me.");
0
The Jane$ pattern matches a string, in which the word Jane is at the end.

Exact word match

In the following examples, we are going to show, how to look for exact word matches.
php > echo preg_match("/mother/", "mother");
1
php > echo preg_match("/mother/", "motherboard");
1
php > echo preg_match("/mother/", "motherland");
1
The mother pattern fits the words mother, motherboard and motherland. Say, we want to look just for exact word matches. We will use the aforementioned anchor ^, $ characters.
php > echo preg_match("/^mother$/", "motherland");
0
php > echo preg_match("/^mother$/", "Who is your mother?");
0
php > echo preg_match("/^mother$/", "mother");
1
Using the anchor characters, we get an exact word match for a pattern.

Quantifiers

A quantifier after a token or group specifies how often that preceding element is allowed to occur.
 ?     - 0 or 1 match
 *     - 0 or more
 +     - 1 or more
 {n}   - exactly n
 {n,}  - n or more
 {,n}  - n or less (??)
 {n,m} - range n to m
The above is a list of common quantifiers.
The question mark ? indicates there is zero or one of the preceding element.
<?php

$words = array("jar", "jazz", "jay", "java", "jet");

$pattern = "/ja.?/";

foreach ($words as $word) {
    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match pattern\n";
    }
}

?>
We have four words in the $words array.
$pattern = "/colo.?r/";
This is the pattern. The .? combination means, zero or one arbitrary single character.
$ php zeroormore.php 
Seven matches the pattern
even matches the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern
The * metacharacter matches the preceding element zero or more times.
<?php

$words = array("Seven", "even", "Maven", "Amen", "Leven");

$pattern = "/.*even/";

foreach ($words as $word) {
    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>
In the above script, we have added the * metacharacter. The .* combination means, zero, one or more single characters.
$ php zeroormore.php 
Seven matches the pattern
even matches the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern
Now the pattern matches three words: Seven, even and Leven.
php > print_r(preg_grep("#o{2}#", array("gool", "root", "foot", "dog")));
Array
(
    [0] => gool
    [1] => root
    [2] => foot
)
The o{2} pattern matches strings that have exactly two 'o' characters.
php > print_r(preg_grep("#^\d{2,4}$#", array("1", "12", "123", "1234", "12345")));
Array
(
    [1] => 12
    [2] => 123
    [3] => 1234
)
We have this ^\d{2,4}$ pattern. The \d is a character set. It stands for digits. So the pattern matches numbers that have 2, 3, or 4 digits.

Alternation

The next example explains the alternation operator |. This operator enables to create a regular expression with several choices.
<?php

$names = array("Jane", "Thomas", "Robert", "Lucy", 
    "Beky", "John", "Peter", "Andy");

$pattern = "/Jane|Beky|Robert/";

foreach ($names as $name) {

    if (preg_match($pattern, $friend)) {
        echo "$name is my friend\n";
    } else {
        echo "$name is not my friend\n";
    }
}

?>
We have 8 names in the $names array.
$pattern = "/Jane|Beky|Robert/";
This is the search pattern. It says, Jane, Beky, and Robert are my friends. If you find either of them, you have found my friend.
$ php friends.php 
Jane is my friend
Thomas is not my friend
Robert is my friend
Lucy is not my friend
Beky is my friend
John is not my friend
Peter is not my friend
Andy is not my friend
Output of the script.

Subpatterns

We can use square brackets () to create subpatterns inside patterns.
php > echo preg_match("/book(worm)?$/", "bookworm");
1
php > echo preg_match("/book(worm)?$/", "book");
1
php > echo preg_match("/book(worm)?$/", "worm");
0
We have the following regex pattern: book(worm)?$. The (worm) is a subpattern. The ? character follows the subpattern, which means that the subpattern might appear 0, 1 times in the final pattern. The $ character is here for the exact end match of the string. Without it, words like bookstore, bookmania would match too.
php > echo preg_match("/book(shelf|worm)?$/", "book");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookshelf");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookworm");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookstore");
0
Subpatterns are often used with alternation. The (shelf|worm) subpattern enables to create several word combinations.

Character classes

We can combine characters into character classes with the square brackets. A character class matches any character that is specified in the brackets.
<?php

$words = array("sit", "MIT", "fit", "fat", "lot");

$pattern = "/[fs]it/";

foreach ($words as $word) {

    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>
We define a character set with two characters.
$pattern = "/[fs]it/";
This is our pattern. The [fs] is the character class. Note that we work only with one character at a time. We either consider f, or s, but not both.
$ php chclass.php 
sit matches the pattern
MIT does not match the pattern
fit matches the pattern
fat does not match the pattern
lot does not match the pattern
This is the outcome of the script.
We can also use shorthand metacharacters for character classes. The \w stands for alphanumeric characters, \d for digit, \s whitespace characters.
<?php

$words = array("Prague", "111978", "terry2", "mitt##");

$pattern = "/\w{6}/";

foreach ($words as $word) {

    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>
In the above script, we test for words consisting of alphanumeric characters. The \w{6} says, six alphanumeric characters match. Only the word mitt## does not match, because it contains non-alphanumeric characters.
php > echo preg_match("#[^a-z]{3}#", "ABC");
1
The #[^a-z]{3}# pattern stands for three characters that are not in the class a-z. The "ABC" characters match the condition.
php > print_r(preg_grep("#\d{2,4}#", array("32", "234", "2345", "3d3", "2")));
Array
(
    [0] => 32
    [1] => 234
    [2] => 2345
)
In the above example, we have a pattern that matches 2, 3, and 4 digits.

Email example

Next have a practical example. We create a regex pattern for checking email addresses.
<?php

$emails = array("luke@gmail.com", "andy@yahoocom", "34234sdfa#2345", "f344@gmail.com");

# regular expression for emails
$pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$/";

foreach ($emails as $email) {

    if (preg_match($pattern, $email)) {
        echo "$email matches \n";
    } else {
        echo "$email does not match\n";
    }
}

>?
Note that this example provides only one solution. It does not have to be the best one.
$pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$/";
This is the pattern. The first ^ and the last $ characters are here to get an exact pattern match. No characters before and after the pattern are allowed. The email is divided into five parts. The first part is the local part. This is usually a name of a company, individual or a nickname. The [a-zA-Z0-9._-]+ lists all possible characters, we can use in the local part. They can be used one or more times. The second part is the literal @ character. The third part is the domain part. It is usually the domain name of the email provider. Like yahoo, gmail etc. [a-zA-Z0-9-]+ It is a character set providing all characters, than can be used in the domain name. The + quantifier makes use of one or more of these characters. The fourth part is the dot character. It is preceded by the escape character. (\.) This is because the dot character is a metacharacter and has a special meaning. By escaping it, we get a literal dot. Final part is the top level domain. The pattern is as follows: [a-zA-Z.]{2,5} Top level domains can have from 2 to 5 characters. Like sk, net, info, travel. There is also a dot character. This is because some top level domains have two parts. For example, co.uk.

Recap

Finally, we provide a quick recap of the regex patterns.
Jane    the 'Jane' string
^Jane   'Jane' at the start of a string
Jane$   'Jane' at the end of a string
^Jane$  exact match of the string 'Jane'
[abc]   a, b, or c
[a-z]   any lowercase letter
[^A-Z]  any character that is not a uppercase letter
(Jane|Becky)   Matches either 'Jane' or 'Becky'
[a-z]+   one or more lowercase letters
^[98]?$  digits 9, 8 or empty string       
([wx])([yz])  wy, wz, xy, or xz
[0-9]         any digit
[^A-Za-z0-9]  any symbol (not a number or a letter)

0 comments:

Post a Comment