Boosting Performance of Regular Expressions

Billy Buchanan

Senior Research Scientist

SAG Corporation

https://wbuchanan.github.io/stataConference2025

BLUF

Use Boost for Speed and Unicode for Robustness
Avoid Greedy Quantifiers When Possible
Avoid Capture Groups if You Do Not Need to Capture the Match
Factor Out Common Elements
Use Lazy Quantifiers When Possible
Use Possessive Quantifiers Carefully
The Order of Alternatives Matters
Never use nested greedy quantifiers (e.g., (xyz.*)*

Why is any of this important or useful?

Determine if a command includes an if or in statement -
```
" i[fn]{1}\s+.*?(?=, *[a-zA-Z]|\$)"
```
Capture all text until the start of options -
```
"^(.*?)(?![^()]*\)),"
```

Determine if a command includes options -

`"(?:(?!["\(]).)*?,\s*(?![^\(]*\))(.*)\$"'

All are used in crossvalidate (Buchanan & Brownell, 2024)

Current State of Regex in Stata

regexm - Replaced in Stata 18
ustrregexm - Use if you need Unicode support
regexmatch - Use if you don't need Unicode support

Components

Groups - Capturing and named capturing groups and Non-capturing groups
Positions - Word boundaries and look ahead/behind
Quantifiers - Greedy, reluctant, & possessive
Characters - Metacharacters and String Literals
Types - Character sets, categories, and properties

I'm primarily going to focus on the Boost library, so some implementation specifics may be different using ICU.
I've included slides with information about characters and types in the deck for anyone interested.
However, for the sake of brevity, I am going to focus on groups, quantifiers, and positions since they affect performance.
Groups should generally be used when you need to capture specific parts of the string.
Quantifiers provide the way to define the number of times a patter repeats, if at all, as well as how the matching is performed.
Positions are used to match locations in a string.
Characters are individual characters that are either literal or convey additional meaning
Types are ways of referencing groups of characters instead of specifying the characters individually.

Groups

Syntax	Group Type	Functionality
( ... )	Capture	Can reference captured content with \#
(?<name> ... )	Named Capture	Assigns names to capture groups
(?: ... )	Non-Capturing	Prevents capturing the group contents
(?> ... )	Atomic Group	Prevents backtracking in the group

Positions

Syntax	Type	Direction/Position
(?= ... )	Positive	Ahead
(?<= ... )	Positive	Behind
\b	Positive	Word Boundary
(?! ... )	Negative	Ahead
(?<! ... )	Negative	Behind
\B	Negative	Word Boundary

Word Boundary Example: @b#cd = [@, b, #, c, d]

Quantifiers

Trait	Greedy	Lazy	Possessive
Backtracking	Yes	Yes	No
Consumption	As long as possible	As short as possible	As much as possible
Zero or One	?	??	?+
Zero or More	*	*?	*+
One or More	+	+?	++
Exactly #	{#}	{#}?	{#}+
At Least #	{#,}	{#,}?	{#,}+
In [#, #]	{#, #}	{#, #}?	{#, #}+

Matching (adapted from Maddock (2013)

Find the first match and return it if there is only one.
If more than one match, find the longest match including all ties.
If there is only one longest match, return it. Else, if there are no groups, return the first match.
If there are groups, find the first match for the first group and any ties and return it if there is only one.
If more than one match, find the longest match for the group and any ties, return it if there is only one.
Repeat the process for each group.
If there is still more than one match, return the first one found.

Backtracking

Characters

Metacharacter	Meaning	Notes
^	Start of String	First Character in Regex
^	Not/Negation	First Character in Character Set
$	End of String	Use the escape character to avoid eval as global
\	Escape	Interpret metacharacter as literal
()	Grouping	See notes on grouping above
.	Any Character
\|	Alternates	Logical OR
?	Matching	[0, 1] times
*	Matching	[0, ∞) times
+	Matching	[1, ∞) times
{#}	Matching	Exactly # times
{#, }	Matching	[#, ∞) times
{#a, #b}	Matching	[#a, #b] times

More Characters

Metacharacter	Meaning	Notes
[ ]	Character set	To match any of the characters in that position
[: :]	Character Class	Place term/ID between colons to use predefined sets of characters
[. .]	Collating Elements	For digraphs or other locale dependent collating elements (e.g., named unicode characters)
[= =]	Equivalence Class	To reference any character or collating element with the same primary sort key as the collating element referenced
\p{ ... }	Positive Character Property	Matches any character with the specified property
\P{ ... }	Negated Character Property	Matches any character that does not have the specified property

Types

Character sets

[ ... ] - defines the character set
All members of the set are equivalent for matching purposes
DO NOT USE THE PIPETTE CHARACTER '|' WITHOUT AN ESCAPE
Character sets match one (1) character in the regex and one character only. Quantifiers modify this.
May contain individual characters, character classes, properties (general and unicode), ranges, unicode code points, and unicode blocks

Types

Character categories

Class	Definition	Aliases
alnum	Any alphabetic character or number
alpha	Any alphabetic character
blank	White space excluding line separators
cntrl	Control characters (typically non-printable)
digit	Numbers	d
graph	Any graphical character
lower	Any lower case character	l
print	Any printable character
punct	Any punctuation character
space	Any white space character	s
unicode	Any character with a code point > 255
upper	Any upper case character	u
word	Any word character (alnum + _)	w
xdigit	Any hexadecimal digit character

Note: These are always supported when using Boost-based regex.

Types

Properties

Will only apply to ustrregex* commands.
See documentation in ICU library for unicode properties
Properties can include scripts/alphabetic systems, emojis, and letters with diacritics, among many others
This can be particularly useful when working with string data from multiple countries and social media 😉

How can you boost the performance of your regexs?

Use Boost for Speed* and Unicode for Robustness

. #d ;
. loc word `: word 5 of `t''; timer clear; timer on 1;
. count if ustrregexm(lyrics, "\b(`word')\s\1\b");
884
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "\b(`word')\s\1\b");
884
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "\b(\w+)\s\1\b");
905,421
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "\b(\w+)\s\1\b");
887,740
. timer off 4; timer list;
   1:     19.20 /        1 =      19.1990
   2:     16.61 /        1 =      16.6140
   3:    252.99 /        1 =     252.9850
   4:    247.90 /        1 =     247.8950
. count if ustrregexm(lyrics, "\P{ASCII}")
  2,745,169

Avoid Greedy Quantifiers When Possible

Computational cost increases as a function of n, length of the strings, and the amount of backtracking needed.
As an alternative, when * or + appears in your regex, you could consider using:

Lazy/Non-Greedy or Possessive Quantifiers
Multiple Regular Expressions

Avoid Capture Groups if You Do Not Need to Capture the Match

					. timer clear; version 18; timer on 1;
. count if ustrregexm(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 1; timer on 2;
. count if regexmatch(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 2; version 16; timer on 3;
. count if regexm(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 3; version 18; timer on 4;
. count if ustrregexm(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
1,074,729
. timer off 4; timer on 5;
. count if regexmatch(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
1,074,729
. timer off 5; version 16; timer on 6;
. count if regexm(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
regexp: ?+* follows nothing
. timer off 6; version 18; timer on 7;
. count if ustrregexm(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 7; timer on 8;
. count if regexmatch(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 8; version 16; timer on 9;
. count if regexm(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 9; version 18; timer list;
   1:    113.23 /        1 =     113.2280
   2:     90.53 /        1 =      90.5250
   3:    325.08 /        1 =     325.0800
   4:     79.10 /        1 =      79.0960
   5:     81.22 /        1 =      81.2160
   6:      1.67 /        1 =       1.6660
   7:     78.57 /        1 =      78.5720
   8:     75.17 /        1 =      75.1670
   9:    173.61 /        1 =     173.6060

Factor out common elements

. timer clear; timer on 1;
. count if ustrregexm(lyrics, "(?:losing|loving|looser|lost|lottery)");
473,024
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "(?:losing|loving|looser|lost|lottery)");
473,024
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "lo(?:sing|ving|oser|st|ttery)");
473,024
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "lo(?:sing|ving|oser|st|ttery)");
473,024
. timer off 4; timer list;
   1:     25.25 /        1 =      25.2450
   2:     43.42 /        1 =      43.4170
   3:     20.40 /        1 =      20.4030
   4:     25.85 /        1 =      25.8520

Use Lazy Quantifiers When Possible

. loc w1 $bandwidthex ; loc w2 `: word 5 of `t'' ;
. timer clear; timer on 1;
. count if ustrregexm(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}`w2'|`w2'\W+(?:\w+\W+){0,10}`w1')\b");
235,289
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}`w2'|`w2'\W+(?:\w+\W+){0,10}`w1')\b");
235,702
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}?`w2'|`w2'\W+(?:\w+\W+){0,10}?`w1')\b");
235,289
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}?`w2'|`w2'\W+(?:\w+\W+){0,10}?`w1')\b");
235,702
. timer off 4; timer list;
   1:     48.15 /        1 =      48.1450
   2:     99.40 /        1 =      99.3960
   3:     44.17 /        1 =      44.1680
   4:     99.23 /        1 =      99.2300

Use Possessive Quantifiers Carefully

python
# import a library to generate random objects
from faker import Faker
# This will vary the number of levels of domains
import random
# Import the Stata API module
from sfi import Data
# Get the number of observations in the dataset
obs = Data.getObsTotal()
# Initialize the Faker class object
fake = Faker()
# Set the pseudorandom object generator seed
fake.seed(7779311)
# Set the pseudorandom number generator seed
random.seed(8675309)
# Pre-allocate a list that will store the email addresses
emails = [None] * obs
# Loop over observation indices
for i in range(obs):
    # Select a random number of domain & subdomain levels to test
    levels = random.randint(1, 5)
    # Generate the email address and store it in the list object
    emails[i] = fake.user_name() + '@' + fake.domain_name(levels)

# Add a variable to store the email addresses
Data.addVarStrL('email')

# Store the email addresses
Data.store('email', None, emails)

# End the python interpreter
end

// Compress the email variable
compress email

// Add the random email addresses to the end of the lyrics
g strL elyrics = lyrics + " " + email

// Get the storage type for email
loc t : type email

#d ;
timer clear; timer on 1;
g `t' etest1 = ustrregexs(0) if ustrregexm(elyrics,
"\b[\w\p{P}]+?@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}\b");
timer off 1; timer on 2;
g `t' etest2 = regexcapture(0) if regexmatch(elyrics,
"\b[\w\p{P}]+?@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}\b");
timer off 2; timer on 3;
g `t' etest3 = ustrregexs(0) if ustrregexm(elyrics,
"\b[\w\p{P}]++@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}+\b");
timer off 3; timer on 4;
g `t' etest4 = regexcapture(0) if regexmatch(elyrics,
"\b[\w\p{P}]++@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}+\b");
timer off 4; timer list;
   1:    277.55 /        1 =     277.5490
   2:    201.91 /        1 =     201.9100
   3:    314.44 /        1 =     314.4410
   4:    164.02 /        1 =     164.0230

. cap noi assert etest1 == email
1,168 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest2 == email
428 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest3 == email
5,134,856 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest4 == email
379 contradictions in 5,134,856 observations
assertion is false

Order Alternatives in the Most likely Order in the data

					. timer clear
. timer on 1;
. qui: g ustralt1 = ustrregexs(1) if _n == 105519 &				 ///
		ustrregexm(lyrics, "(`: word 7 of `t''|`: word 4 of `t'')");
. timer off 1; timer on 2;
. qui: g ustralt2 = ustrregexs(1) if _n == 105519 &				 ///
		ustrregexm(lyrics, "(`: word 4 of `t''|`: word 7 of `t'')");
. timer off 2; timer on 3;
. qui: g bstalt1 = regexcapture(1) if _n == 105519 &///
		regexmatch(lyrics, "(`: word 7 of `t''|`: word 4 of `t'')");
. timer off 3; timer on 4;
. qui: g bstalt2 = regexcapture(1) if _n == 105519 &///
		regexmatch(lyrics, "(`: word 4 of `t''|`: word 7 of `t'')");
. timer off 4;
. di strlen(lyrics[105519]);
8714
. if ustralt1[105519] == "`: word 7 of `t''" di "word 7 matched";
. if ustralt1[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if ustralt2[105519] == "`: word 7 of `t''" di "word 7 matched";
. if ustralt2[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if bstalt1[105519] == "`: word 7 of `t''" di "word 7 matched";
. if bstalt1[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if bstalt2[105519] == "`: word 7 of `t''" di "word 7 matched";
. if bstalt2[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. timer list;
   1:     25.37 /        1 =      25.3660
   2:     25.33 /        1 =      25.3320
   3:     22.85 /        1 =      22.8500
   4:     22.73 /        1 =      22.7270

If all else fails...

Try one of the following:

IBM Granite: https://www.ibm.com/granite/playground/
Microsoft Copilot: https://copilot.microsoft.com/
Google Gemini: https://gemini.google.com
Meta Code Llama: https://www.llama.com/code-llama/

Example Data

https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data

References

Friedl, J. E. F. (2006). Mastering regular expressions. 3rd Ed. O'Reilly

Goyvaerts, J., & Levithan, S. (2012). Regular expressions cookbook. 2nd Ed.
O'Reilly

Maddock, J. (2013). The leftmost longest rule.

Other Resources

Regex 101 Debugger