Tuesday, 30 July 2019

How the REGEXP_INSTR() Function Works in MySQL

In MySQL, the REGEXP_INSTR() function returns the starting index of a substring that matches the regular expression pattern.
The index starts at 1. If there’s no match, the result is 0.

Syntax

The syntax goes like this:
REGEXP_INSTR(expr, pat[, pos[, occurrence[, return_option[, match_type]]]])
Where expr is the input string and pat is the regular expression pattern for the substring.
The optional pos argument allows you to specify a position within the string to start the search. If omitted, it starts at position 1.
The optional occurrence argument allows you to specify which occurrence of the match to search for. If omitted, the first occurrence is used (occurrence 1).
The optional return_option argument allows you to specify which type of position to return. If you use 0, it will return the position of the first character in the matching substring. If you use 1 it returns the position of the first character following the matching substring. If omitted, the value is 0.
The optional match_type argument allows you to refine the regular expression. For example, you can use this argument to specify case-sensitive matching or not.

Example 1 – Basic Usage

Here’s a basic example:
SELECT REGEXP_INSTR('Cat', 'at') Result;
Result:
+--------+
| Result |
+--------+
|      2 |
+--------+
In this case there’s a match, and the substring starts at position 2.

Example 2 – No Match

Here’s an example where there’s no match:
SELECT REGEXP_INSTR('Cat', '^at') Result;
Result:
+--------+
| Result |
+--------+
|      0 |
+--------+
There’s no match, so the result is 0. There’s no match because I specified that the string must start with the substring.
Let’s change it so that it does start with that substring:
SELECT REGEXP_INSTR('at', '^at') Result;
Result:
+--------+
| Result |
+--------+
|      1 |
+--------+

Example 3 – The pos Argument

Here’s an example of specifying a starting position:
SELECT REGEXP_INSTR('Cat Cat', 'Cat', 2) Result;
Result:
+--------+
| Result |
+--------+
|      5 |
+--------+
So we get the index of the second occurrence.
Note that the index still starts counting from position 1 regardless of where you specify the starting position.
The following example demonstrates this more clearly:
SELECT 
  REGEXP_INSTR('Cat Cat', 'Cat', 2) AS 'Pos 2', 
  REGEXP_INSTR('Cat Cat', 'Cat', 3) AS 'Pos 3', 
  REGEXP_INSTR('Cat Cat', 'Cat', 5) AS 'Pos 5';
Result:
+-------+-------+-------+
| Pos 2 | Pos 3 | Pos 5 |
+-------+-------+-------+
|     5 |     5 |     5 |
+-------+-------+-------+
Of course, depending on your regex pattern, this can return the index of completely different substrings. Example:
SELECT 
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1) 'Pos 1',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 2) 'Pos 2',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 6) 'Pos 6';
Result:
+-------+-------+-------+
| Pos 1 | Pos 2 | Pos 6 |
+-------+-------+-------+
|     1 |     5 |    16 |
+-------+-------+-------+
We can check the substrings with the REGEXP_SUBSTR() function:
SELECT 
  REGEXP_SUBSTR('Cat City is SO Cute!', 'C.t', 1) 'Pos 1',
  REGEXP_SUBSTR('Cat City is SO Cute!', 'C.t', 2) 'Pos 2',
  REGEXP_SUBSTR('Cat City is SO Cute!', 'C.t', 6) 'Pos 6';
Result:
+-------+-------+-------+
| Pos 1 | Pos 2 | Pos 6 |
+-------+-------+-------+
| Cat   | Cit   | Cut   |
+-------+-------+-------+

Example 4 – The occurrence Argument

Here’s an example of using the occurrence argument. In all cases, we start at position 1:
SELECT 
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 1) 'Occurrence 1',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 2) 'Occurrence 2',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 3) 'Occurrence 3';
Result:
+--------------+--------------+--------------+
| Occurrence 1 | Occurrence 2 | Occurrence 3 |
+--------------+--------------+--------------+
|            1 |            5 |           16 |
+--------------+--------------+--------------+
However, if we start at a different position, the result is different:
SELECT 
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 2, 1) 'Occurrence 1',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 2, 2) 'Occurrence 2',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 2, 3) 'Occurrence 3';
Result:
+--------------+--------------+--------------+
| Occurrence 1 | Occurrence 2 | Occurrence 3 |
+--------------+--------------+--------------+
|            5 |           16 |            0 |
+--------------+--------------+--------------+
This happened because our starting position came after the first occurrence had started. Therefore, occurrence 2 became occurrence 1, and occurrence 3 became occurrence 2. And because there were no more occurrences, the result of occurrence 3 was negative (i.e. there was no occurrence 3).

Example 5 – The return_option Argument

Here’s an example of using the return_option argument:
SELECT 
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 1, 0) 'Option 0',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 1, 1) 'Option 1';
Result:
+----------+----------+
| Option 0 | Option 1 |
+----------+----------+
|        1 |        4 |
+----------+----------+
Option 0 returned the matched substring’s first character. Option 1 returned the position following the matched substring.
Here’s what it looks like if we apply it to the previous example:
SELECT 
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 1, 0) 'Occurrence 1',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 2, 0) 'Occurrence 2',
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 3, 0) 'Occurrence 3'
UNION ALL
SELECT
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 1, 1),
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 2, 1),
  REGEXP_INSTR('Cat City is SO Cute!', 'C.t', 1, 3, 1);

Result:
+--------------+--------------+--------------+
| Occurrence 1 | Occurrence 2 | Occurrence 3 |
+--------------+--------------+--------------+
|            1 |            5 |           16 |
|            4 |            8 |           19 |
+--------------+--------------+--------------+
In this case we did one set using option 0, and another using option 1, then joined them together using UNION ALL.

Example 6 – The match_type Argument

You can provide an additional argument to determine the match type. This allows you to specify things like whether or not the match is case-sensitive, whether or not to include line terminators, etc.
Here’s an example of specifying a case-sensitive match and a case-insensitive match:
SELECT 
  REGEXP_INSTR('Cat City is SO Cute!', 'c.t', 1, 1, 0, 'c') 'Case-Sensitive',
  REGEXP_INSTR('Cat City is SO Cute!', 'c.t', 1, 1, 0, 'i') 'Case-Insensitive';
Result:
+----------------+------------------+
| Case-Sensitive | Case-Insensitive |
+----------------+------------------+
|              0 |                1 |
+----------------+------------------+
The match_type argument can contain the following characters:
c
Case sensitive matching.
i
Case insensitive matching.
m
Multiple-line mode. Recognize line terminators within the string. The default behavior is to match line terminators only at the start and end of the string expression.
n
The . character matches line terminators. The default is for . matching to stop at the end of a line.
u
Unix-only line endings. Only the newline character is recognized as a line ending by the ., ^, and $ match operators.

0 comments:

Post a Comment