Thursday, 8 November 2018

How to do a regular expression replace in MySQL?

I have a table with ~500k rows; varchar(255) UTF8 column filename contains a file name;
I'm trying to strip out various strange characters out of the filename - thought I'd use a character class: [^a-zA-Z0-9()_ .\-]
Now, is there a function in MySQL that lets you replace through a regular expression? I'm looking for a similar functionality to REPLACE() function - simplified example follows:
SELECT REPLACE('stackowerflow', 'ower', 'over');

Output: ""

/* does something like this exist? */
SELECT X_REG_REPLACE('','/[A-Zf]/','-'); 

Output: "-tackover-low"
I know about REGEXP/RLIKE, but those only check if there is a match, not what the match is.
(I could do a "SELECT pkey_id,filename FROM foo WHERE filename RLIKE '[^a-zA-Z0-9()_ .\-]'" from a PHP script, do a preg_replace and then "UPDATE foo ... WHERE pkey_id=...", but that looks like a last-resort slow & ugly hack)

 Answers


MySQL 8.0+ you could use natively REGEXP_REPLACE.
REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]])
Replaces occurrences in the string expr that match the regular expression specified by the pattern pat with the replacement string repl, and returns the resulting string. If expr, pat, or repl is NULL, the return value is NULL.
Previously, MySQL used the Henry Spencer regular expression library to support regular expression operators (REGEXP, RLIKE).
Regular expression support has been reimplemented using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. The REGEXP_LIKE() function performs regular expression matching in the manner of the REGEXP and RLIKE operators, which now are synonyms for that function. In addition, the REGEXP_INSTR(), REGEXP_REPLACE(), and REGEXP_SUBSTR() functions are available to find match positions and perform substring substitution and extraction, respectively.



Use MariaDB instead. It has a function
REGEXP_REPLACE(col, regexp, replace)
Note that you can use regexp grouping as well (I found that very useful):
SELECT REGEXP_REPLACE("", "(stack)(over)(flow)", '\\2 - \\1 - \\3')
returns
over - stack - flow



I recently wrote a MySQL function to replace strings using regular expressions. You could find my post at the following location:
Here is the function code:
DELIMITER $$

CREATE FUNCTION  `regex_replace`(pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000))
RETURNS VARCHAR(1000)
DETERMINISTIC
BEGIN 
 DECLARE temp VARCHAR(1000); 
 DECLARE ch VARCHAR(1); 
 DECLARE i INT;
 SET i = 1;
 SET temp = '';
 IF original REGEXP pattern THEN 
  loop_label: LOOP 
   IF i>CHAR_LENGTH(original) THEN
    LEAVE loop_label;  
   END IF;
   SET ch = SUBSTRING(original,i,1);
   IF NOT ch REGEXP pattern THEN
    SET temp = CONCAT(temp,ch);
   ELSE
    SET temp = CONCAT(temp,replacement);
   END IF;
   SET i=i+1;
  END LOOP;
 ELSE
  SET temp = original;
 END IF;
 RETURN temp;
END$$

DELIMITER ;
Example execution:
mysql> select regex_replace('[^a-zA-Z0-9\-]','','2my test3_text-to. check \\ my- sql (regular) ,expressions ._,');



I'm happy to report that since this question was asked, now there is a satisfactory answer! Take a look at this terrific package:
Sample SQL:
SELECT PREG_REPLACE('/(.*?)(fox)/' , 'dog' , 'the quick brown fox' ) AS demo;




Through all necessary substrings rather than just testing single characters:
-- ------------------------------------------------------------------------------------
-- USAGE
-- ------------------------------------------------------------------------------------
-- SELECT reg_replace(<subject>,
--                    <pattern>,
--                    <replacement>,
--                    <greedy>,
--                    <minMatchLen>,
--                    <maxMatchLen>);
-- where:
-- <subject> is the string to look in for doing the replacements
-- <pattern> is the regular expression to match against
-- <replacement> is the replacement string
-- <greedy> is TRUE for greedy matching or FALSE for non-greedy matching
-- <minMatchLen> specifies the minimum match length
-- <maxMatchLen> specifies the maximum match length
-- (minMatchLen and maxMatchLen are used to improve efficiency but are
--  optional and can be set to 0 or NULL if not known/required)
-- Example:
-- SELECT reg_replace(txt, '^[Tt][^ ]* ', 'a', TRUE, 2, 0) FROM tbl;
DROP FUNCTION IF EXISTS reg_replace;
DELIMITER //
CREATE FUNCTION reg_replace(subject VARCHAR(21845), pattern VARCHAR(21845),
  replacement VARCHAR(21845), greedy BOOLEAN, minMatchLen INT, maxMatchLen INT)
RETURNS VARCHAR(21845) DETERMINISTIC BEGIN 
  DECLARE result, subStr, usePattern VARCHAR(21845); 
  DECLARE startPos, prevStartPos, startInc, len, lenInc INT;
  IF subject REGEXP pattern THEN
    SET result = '';
    -- Sanitize input parameter values
    SET minMatchLen = IF(minMatchLen < 1, 1, minMatchLen);
    SET maxMatchLen = IF(maxMatchLen < 1 OR maxMatchLen > CHAR_LENGTH(subject),
                         CHAR_LENGTH(subject), maxMatchLen);
    -- Set the pattern to use to match an entire string rather than part of a string
    SET usePattern = IF (LEFT(pattern, 1) = '^', pattern, CONCAT('^', pattern));
    SET usePattern = IF (RIGHT(pattern, 1) = '$', usePattern, CONCAT(usePattern, '$'));
    -- Set start position to 1 if pattern starts with ^ or doesn't end with $.
    IF LEFT(pattern, 1) = '^' OR RIGHT(pattern, 1) <> '$' THEN
      SET startPos = 1, startInc = 1;
    -- Otherwise (i.e. pattern ends with $ but doesn't start with ^): Set start pos
    -- to the min or max match length from the end (depending on "greedy" flag).
    ELSEIF greedy THEN
      SET startPos = CHAR_LENGTH(subject) - maxMatchLen + 1, startInc = 1;
    ELSE
      SET startPos = CHAR_LENGTH(subject) - minMatchLen + 1, startInc = -1;
    END IF;
    WHILE startPos >= 1 AND startPos <= CHAR_LENGTH(subject)
      AND startPos + minMatchLen - 1 <= CHAR_LENGTH(subject)
      AND !(LEFT(pattern, 1) = '^' AND startPos <> 1)
      AND !(RIGHT(pattern, 1) = '$'
            AND startPos + maxMatchLen - 1 < CHAR_LENGTH(subject)) DO
      -- Set start length to maximum if matching greedily or pattern ends with $.
      -- Otherwise set starting length to the minimum match length.
      IF greedy OR RIGHT(pattern, 1) = '$' THEN
        SET len = LEAST(CHAR_LENGTH(subject) - startPos + 1, maxMatchLen), lenInc = -1;
      ELSE
        SET len = minMatchLen, lenInc = 1;
      END IF;
      SET prevStartPos = startPos;
      lenLoop: WHILE len >= 1 AND len <= maxMatchLen
                 AND startPos + len - 1 <= CHAR_LENGTH(subject)
                 AND !(RIGHT(pattern, 1) = '$' 
                       AND startPos + len - 1 <> CHAR_LENGTH(subject)) DO
        SET subStr = SUBSTRING(subject, startPos, len);
        IF subStr REGEXP usePattern THEN
          SET result = IF(startInc = 1,
                          CONCAT(result, replacement), CONCAT(replacement, result));
          SET startPos = startPos + startInc * len;
          LEAVE lenLoop;
        END IF;
        SET len = len + lenInc;
      END WHILE;
      IF (startPos = prevStartPos) THEN
        SET result = IF(startInc = 1, CONCAT(result, SUBSTRING(subject, startPos, 1)),
                        CONCAT(SUBSTRING(subject, startPos, 1), result));
        SET startPos = startPos + startInc;
      END IF;
    END WHILE;
    IF startInc = 1 AND startPos <= CHAR_LENGTH(subject) THEN
      SET result = CONCAT(result, RIGHT(subject, CHAR_LENGTH(subject) + 1 - startPos));
    ELSEIF startInc = -1 AND startPos >= 1 THEN
      SET result = CONCAT(LEFT(subject, startPos), result);
    END IF;
  ELSE
    SET result = subject;
  END IF;
  RETURN result;
END//
DELIMITER ;
Demo
Limitations
  1. This method is of course going to take a while when the subject string is large. Update: Have now added minimum and maximum match length parameters for improved efficiency when these are known (zero = unknown/unlimited).
  2. It won't allow substitution of backreferences (e.g. \1\2 etc.) to replace capturing groups. If this functionality is needed, please see this answer which attempts to provide a workaround by updating the function to allow a secondary find and replace within each found match (at the expense of increased complexity).
  3. If ^and/or $ is used in the pattern, they must be at the very start and very end respectively - e.g. patterns such as (^start|end$) are not supported.
  4. There is a "greedy" flag to specify whether the overall matching should be greedy or non-greedy. Combining greedy and lazy matching within a single regular expression (e.g. a.*?b.*) is not supported.

0 comments:

Post a Comment