Boosting Performance of Regular Expressions

Billy Buchanan

Senior Research Scientist

SAG Corporation

https://wbuchanan.github.io/stataConference2025

BLUF

  • Use Boost for Speed and Unicode for Robustness
  • Avoid Greedy Quantifiers When Possible
  • Avoid Capture Groups if You Do Not Need to Capture the Match
  • Factor Out Common Elements
  • Use Lazy Quantifiers When Possible
  • Use Possessive Quantifiers Carefully
  • The Order of Alternatives Matters
  • Never use nested greedy quantifiers (e.g., (xyz.*)*

Why is any of this important or useful?

  • Determine if a command includes an if or in statement -
    " i[fn]{1}\s+.*?(?=, *[a-zA-Z]|\$)"
  • Capture all text until the start of options -
    "^(.*?)(?![^()]*\)),"
  • Determine if a command includes options -
    `"(?:(?!["\(]).)*?,\s*(?![^\(]*\))(.*)\$"'
  • All are used in crossvalidate (Buchanan & Brownell, 2024)

Current State of Regex in Stata

  • regexm - Replaced in Stata 18
  • ustrregexm - Use if you need Unicode support
  • regexmatch - Use if you don't need Unicode support

Components

  • Groups - Capturing and named capturing groups and Non-capturing groups
  • Positions - Word boundaries and look ahead/behind
  • Quantifiers - Greedy, reluctant, & possessive
  • Characters - Metacharacters and String Literals
  • Types - Character sets, categories, and properties

Groups

SyntaxGroup TypeFunctionality
( ... )CaptureCan reference captured content with \#
(?<name> ... )Named CaptureAssigns names to capture groups
(?: ... )Non-CapturingPrevents capturing the group contents
(?> ... )Atomic GroupPrevents backtracking in the group

Positions

SyntaxTypeDirection/Position
(?= ... )PositiveAhead
(?<= ... )PositiveBehind
\bPositiveWord Boundary
(?! ... )NegativeAhead
(?<! ... )NegativeBehind
\BNegativeWord Boundary

Word Boundary Example: @b#cd = [@, b, #, c, d]

Quantifiers

TraitGreedyLazyPossessive
BacktrackingYesYesNo
ConsumptionAs long as possibleAs short as possibleAs much as possible
Zero or One????+
Zero or More**?*+
One or More++?++
Exactly #{#}{#}?{#}+
At Least #{#,}{#,}?{#,}+
In [#, #]{#, #}{#, #}?{#, #}+

Matching (adapted from Maddock (2013)

  1. Find the first match and return it if there is only one.
  2. If more than one match, find the longest match including all ties.
  3. If there is only one longest match, return it. Else, if there are no groups, return the first match.
  4. If there are groups, find the first match for the first group and any ties and return it if there is only one.
  5. If more than one match, find the longest match for the group and any ties, return it if there is only one.
  6. Repeat the process for each group.
  7. If there is still more than one match, return the first one found.

Backtracking

Characters

MetacharacterMeaningNotes
^Start of StringFirst Character in Regex
^Not/NegationFirst Character in Character Set
$End of StringUse the escape character to avoid eval as global
\EscapeInterpret metacharacter as literal
()GroupingSee notes on grouping above
.Any Character
|AlternatesLogical OR
?Matching[0, 1] times
*Matching[0, ∞) times
+Matching[1, ∞) times
{#}MatchingExactly # times
{#, }Matching[#, ∞) times
{#a, #b}Matching[#a, #b] times

More Characters

MetacharacterMeaningNotes
[ ]Character setTo match any of the characters in that position
[: :]Character ClassPlace term/ID between colons to use predefined sets of characters
[. .]Collating ElementsFor digraphs or other locale dependent collating elements (e.g., named unicode characters)
[= =]Equivalence ClassTo reference any character or collating element with the same primary sort key as the collating element referenced
\p{ ... }Positive Character PropertyMatches any character with the specified property
\P{ ... }Negated Character PropertyMatches any character that does not have the specified property

Types

Character sets

  • [ ... ] - defines the character set
  • All members of the set are equivalent for matching purposes
  • DO NOT USE THE PIPETTE CHARACTER '|' WITHOUT AN ESCAPE
  • Character sets match one (1) character in the regex and one character only. Quantifiers modify this.
  • May contain individual characters, character classes, properties (general and unicode), ranges, unicode code points, and unicode blocks

Types

Character categories

ClassDefinitionAliases
alnumAny alphabetic character or number
alphaAny alphabetic character
blankWhite space excluding line separators
cntrlControl characters (typically non-printable)
digitNumbersd
graphAny graphical character
lowerAny lower case characterl
printAny printable character
punctAny punctuation character
spaceAny white space characters
unicodeAny character with a code point > 255
upperAny upper case characteru
wordAny word character (alnum + _)w
xdigitAny hexadecimal digit character

Note: These are always supported when using Boost-based regex.

Types

Properties

  • Will only apply to ustrregex* commands.
  • See documentation in ICU library for unicode properties
  • Properties can include scripts/alphabetic systems, emojis, and letters with diacritics, among many others
  • This can be particularly useful when working with string data from multiple countries and social media 😉

How can you boost the performance of your regexs?

Use Boost for Speed* and Unicode for Robustness

. #d ;
. loc word `: word 5 of `t''; timer clear; timer on 1;
. count if ustrregexm(lyrics, "\b(`word')\s\1\b");
884
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "\b(`word')\s\1\b");
884
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "\b(\w+)\s\1\b");
905,421
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "\b(\w+)\s\1\b");
887,740
. timer off 4; timer list;
   1:     19.20 /        1 =      19.1990
   2:     16.61 /        1 =      16.6140
   3:    252.99 /        1 =     252.9850
   4:    247.90 /        1 =     247.8950
. count if ustrregexm(lyrics, "\P{ASCII}")
  2,745,169

Avoid Greedy Quantifiers When Possible

  • Computational cost increases as a function of n, length of the strings, and the amount of backtracking needed.
  • As an alternative, when * or + appears in your regex, you could consider using:
    1. Lazy/Non-Greedy or Possessive Quantifiers
    2. Multiple Regular Expressions

Avoid Capture Groups if You Do Not Need to Capture the Match

					. timer clear; version 18; timer on 1;
. count if ustrregexm(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 1; timer on 2;
. count if regexmatch(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 2; version 16; timer on 3;
. count if regexm(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 3; version 18; timer on 4;
. count if ustrregexm(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
1,074,729
. timer off 4; timer on 5;
. count if regexmatch(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
1,074,729
. timer off 5; version 16; timer on 6;
. count if regexm(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
regexp: ?+* follows nothing
. timer off 6; version 18; timer on 7;
. count if ustrregexm(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 7; timer on 8;
. count if regexmatch(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 8; version 16; timer on 9;
. count if regexm(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 9; version 18; timer list;
   1:    113.23 /        1 =     113.2280
   2:     90.53 /        1 =      90.5250
   3:    325.08 /        1 =     325.0800
   4:     79.10 /        1 =      79.0960
   5:     81.22 /        1 =      81.2160
   6:      1.67 /        1 =       1.6660
   7:     78.57 /        1 =      78.5720
   8:     75.17 /        1 =      75.1670
   9:    173.61 /        1 =     173.6060

					
				

Factor out common elements

. timer clear; timer on 1;
. count if ustrregexm(lyrics, "(?:losing|loving|looser|lost|lottery)");
473,024
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "(?:losing|loving|looser|lost|lottery)");
473,024
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "lo(?:sing|ving|oser|st|ttery)");
473,024
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "lo(?:sing|ving|oser|st|ttery)");
473,024
. timer off 4; timer list;
   1:     25.25 /        1 =      25.2450
   2:     43.42 /        1 =      43.4170
   3:     20.40 /        1 =      20.4030
   4:     25.85 /        1 =      25.8520
				

Use Lazy Quantifiers When Possible

. loc w1 $bandwidthex ; loc w2 `: word 5 of `t'' ;
. timer clear; timer on 1;
. count if ustrregexm(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}`w2'|`w2'\W+(?:\w+\W+){0,10}`w1')\b");
235,289
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}`w2'|`w2'\W+(?:\w+\W+){0,10}`w1')\b");
235,702
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}?`w2'|`w2'\W+(?:\w+\W+){0,10}?`w1')\b");
235,289
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}?`w2'|`w2'\W+(?:\w+\W+){0,10}?`w1')\b");
235,702
. timer off 4; timer list;
   1:     48.15 /        1 =      48.1450
   2:     99.40 /        1 =      99.3960
   3:     44.17 /        1 =      44.1680
   4:     99.23 /        1 =      99.2300
				

Use Possessive Quantifiers Carefully

python
# import a library to generate random objects
from faker import Faker
# This will vary the number of levels of domains
import random
# Import the Stata API module
from sfi import Data
# Get the number of observations in the dataset
obs = Data.getObsTotal()
# Initialize the Faker class object
fake = Faker()
# Set the pseudorandom object generator seed
fake.seed(7779311)
# Set the pseudorandom number generator seed
random.seed(8675309)
# Pre-allocate a list that will store the email addresses
emails = [None] * obs
# Loop over observation indices
for i in range(obs):
    # Select a random number of domain & subdomain levels to test
    levels = random.randint(1, 5)
    # Generate the email address and store it in the list object
    emails[i] = fake.user_name() + '@' + fake.domain_name(levels)

# Add a variable to store the email addresses
Data.addVarStrL('email')

# Store the email addresses
Data.store('email', None, emails)

# End the python interpreter
end

// Compress the email variable
compress email

// Add the random email addresses to the end of the lyrics
g strL elyrics = lyrics + " " + email

// Get the storage type for email
loc t : type email

#d ;
timer clear; timer on 1;
g `t' etest1 = ustrregexs(0) if ustrregexm(elyrics,
"\b[\w\p{P}]+?@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}\b");
timer off 1; timer on 2;
g `t' etest2 = regexcapture(0) if regexmatch(elyrics,
"\b[\w\p{P}]+?@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}\b");
timer off 2; timer on 3;
g `t' etest3 = ustrregexs(0) if ustrregexm(elyrics,
"\b[\w\p{P}]++@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}+\b");
timer off 3; timer on 4;
g `t' etest4 = regexcapture(0) if regexmatch(elyrics,
"\b[\w\p{P}]++@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}+\b");
timer off 4; timer list;
   1:    277.55 /        1 =     277.5490
   2:    201.91 /        1 =     201.9100
   3:    314.44 /        1 =     314.4410
   4:    164.02 /        1 =     164.0230

. cap noi assert etest1 == email
1,168 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest2 == email
428 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest3 == email
5,134,856 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest4 == email
379 contradictions in 5,134,856 observations
assertion is false

						

Order Alternatives in the Most likely Order in the data

					. timer clear
. timer on 1;
. qui: g ustralt1 = ustrregexs(1) if _n == 105519 &				 ///
		ustrregexm(lyrics, "(`: word 7 of `t''|`: word 4 of `t'')");
. timer off 1; timer on 2;
. qui: g ustralt2 = ustrregexs(1) if _n == 105519 &				 ///
		ustrregexm(lyrics, "(`: word 4 of `t''|`: word 7 of `t'')");
. timer off 2; timer on 3;
. qui: g bstalt1 = regexcapture(1) if _n == 105519 &///
		regexmatch(lyrics, "(`: word 7 of `t''|`: word 4 of `t'')");
. timer off 3; timer on 4;
. qui: g bstalt2 = regexcapture(1) if _n == 105519 &///
		regexmatch(lyrics, "(`: word 4 of `t''|`: word 7 of `t'')");
. timer off 4;
. di strlen(lyrics[105519]);
8714
. if ustralt1[105519] == "`: word 7 of `t''" di "word 7 matched";
. if ustralt1[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if ustralt2[105519] == "`: word 7 of `t''" di "word 7 matched";
. if ustralt2[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if bstalt1[105519] == "`: word 7 of `t''" di "word 7 matched";
. if bstalt1[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if bstalt2[105519] == "`: word 7 of `t''" di "word 7 matched";
. if bstalt2[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. timer list;
   1:     25.37 /        1 =      25.3660
   2:     25.33 /        1 =      25.3320
   3:     22.85 /        1 =      22.8500
   4:     22.73 /        1 =      22.7270
					
				

If all else fails...

Try one of the following:

Example Data

https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data

References

Friedl, J. E. F. (2006). Mastering regular expressions. 3rd Ed. O'Reilly

Goyvaerts, J., & Levithan, S. (2012). Regular expressions cookbook. 2nd Ed.
O'Reilly

Maddock, J. (2013). The leftmost longest rule.

Other Resources

Regex 101 Debugger