Tuesday 14 August 2018

Match Special Letters with PHP Regular Expressions

Regular expressions come with all sorts of peculiarities, one of which I recently ran into when creating a regex within PHP and preg_match. I was trying to parse strings with the format "Real Name (:username)" when I ran into a problem I would see a lot at Mozilla: my regular expression wasn't properly catching "special" or "international" letters, like à, é, ü, and the dozens of others.


My regular expression was using A-z in the real name matching piece of the regex, which I assumed would match special letters, but it did not:

preg_match( "/([A-Za-z -]+)?\s?\[?\(?:([A-Za-z0-9\-\_]+)\)?\]?/", "Yep Nopé [:ynope]", $matches); 
 // 0 => '[:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'


To match international letters, I needed to update my regular expression in two ways:
Change A-z to \pL within the matching piece
Add the u modifier makes the string treated as UTF-8

The updated regex would be:
preg_match( "/([\pL -]+)?\s?\[?\(?:([\pL0-9\-\_]+)\)?\]?/u", "Yep Nopé [:ynope]", $matches); 
 // 0 => 'Yep Nopé [:ynope]', 1 => 'Yep Nopé', 2 => 'ynope'


You can see my simple test bed here. If you're afraid that other characters might seep in, or don't trust \pL, you could list every special letter manually (i.e. [A-zàáâä....])

One of the nice parts of working at a truly global organization like Mozilla is that I'm exposed to many edge cases; in this case, a few special letters!

0 comments:

Post a Comment