" i[fn]{1}\s+.*?(?=, *[a-zA-Z]|\$)"
"^(.*?)(?![^()]*\)),"
`"(?:(?!["\(]).)*?,\s*(?![^\(]*\))(.*)\$"'
Syntax | Group Type | Functionality |
( ... ) | Capture | Can reference captured content with \# |
(?<name> ... ) | Named Capture | Assigns names to capture groups |
(?: ... ) | Non-Capturing | Prevents capturing the group contents |
(?> ... ) | Atomic Group | Prevents backtracking in the group |
Syntax | Type | Direction/Position |
(?= ... ) | Positive | Ahead |
(?<= ... ) | Positive | Behind |
\b | Positive | Word Boundary |
(?! ... ) | Negative | Ahead |
(?<! ... ) | Negative | Behind |
\B | Negative | Word Boundary |
Word Boundary Example: @b#cd = [@, b, #, c, d]
Trait | Greedy | Lazy | Possessive |
Backtracking | Yes | Yes | No |
Consumption | As long as possible | As short as possible | As much as possible |
Zero or One | ? | ?? | ?+ |
Zero or More | * | *? | *+ |
One or More | + | +? | ++ |
Exactly # | {#} | {#}? | {#}+ |
At Least # | {#,} | {#,}? | {#,}+ |
In [#, #] | {#, #} | {#, #}? | {#, #}+ |
Metacharacter | Meaning | Notes |
^ | Start of String | First Character in Regex |
^ | Not/Negation | First Character in Character Set |
$ | End of String | Use the escape character to avoid eval as global |
\ | Escape | Interpret metacharacter as literal |
() | Grouping | See notes on grouping above |
. | Any Character | |
| | Alternates | Logical OR |
? | Matching | [0, 1] times |
* | Matching | [0, ∞) times |
+ | Matching | [1, ∞) times |
{#} | Matching | Exactly # times |
{#, } | Matching | [#, ∞) times |
{#a, #b} | Matching | [#a, #b] times |
Metacharacter | Meaning | Notes |
[ ] | Character set | To match any of the characters in that position |
[: :] | Character Class | Place term/ID between colons to use predefined sets of characters |
[. .] | Collating Elements | For digraphs or other locale dependent collating elements (e.g., named unicode characters) |
[= =] | Equivalence Class | To reference any character or collating element with the same primary sort key as the collating element referenced |
\p{ ... } | Positive Character Property | Matches any character with the specified property |
\P{ ... } | Negated Character Property | Matches any character that does not have the specified property |
Class | Definition | Aliases |
alnum | Any alphabetic character or number | |
alpha | Any alphabetic character | |
blank | White space excluding line separators | |
cntrl | Control characters (typically non-printable) | |
digit | Numbers | d |
graph | Any graphical character | |
lower | Any lower case character | l |
Any printable character | ||
punct | Any punctuation character | |
space | Any white space character | s |
unicode | Any character with a code point > 255 | |
upper | Any upper case character | u |
word | Any word character (alnum + _) | w |
xdigit | Any hexadecimal digit character |
Note: These are always supported when using Boost-based regex.
ustrregex*
commands.. #d ;
. loc word `: word 5 of `t''; timer clear; timer on 1;
. count if ustrregexm(lyrics, "\b(`word')\s\1\b");
884
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "\b(`word')\s\1\b");
884
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "\b(\w+)\s\1\b");
905,421
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "\b(\w+)\s\1\b");
887,740
. timer off 4; timer list;
1: 19.20 / 1 = 19.1990
2: 16.61 / 1 = 16.6140
3: 252.99 / 1 = 252.9850
4: 247.90 / 1 = 247.8950
. count if ustrregexm(lyrics, "\P{ASCII}")
2,745,169
*
or +
appears in your regex, you could consider using:
. timer clear; version 18; timer on 1;
. count if ustrregexm(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 1; timer on 2;
. count if regexmatch(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 2; version 16; timer on 3;
. count if regexm(lyrics, `"(`: subinstr loc t " " ")|(", all')"');
1,074,729
. timer off 3; version 18; timer on 4;
. count if ustrregexm(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
1,074,729
. timer off 4; timer on 5;
. count if regexmatch(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
1,074,729
. timer off 5; version 16; timer on 6;
. count if regexm(lyrics, `"(?:`: subinstr loc t " " ")|(?:", all')"');
regexp: ?+* follows nothing
. timer off 6; version 18; timer on 7;
. count if ustrregexm(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 7; timer on 8;
. count if regexmatch(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 8; version 16; timer on 9;
. count if regexm(lyrics, `"`: subinstr loc t " " "|", all'"');
1,074,729
. timer off 9; version 18; timer list;
1: 113.23 / 1 = 113.2280
2: 90.53 / 1 = 90.5250
3: 325.08 / 1 = 325.0800
4: 79.10 / 1 = 79.0960
5: 81.22 / 1 = 81.2160
6: 1.67 / 1 = 1.6660
7: 78.57 / 1 = 78.5720
8: 75.17 / 1 = 75.1670
9: 173.61 / 1 = 173.6060
. timer clear; timer on 1;
. count if ustrregexm(lyrics, "(?:losing|loving|looser|lost|lottery)");
473,024
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "(?:losing|loving|looser|lost|lottery)");
473,024
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "lo(?:sing|ving|oser|st|ttery)");
473,024
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "lo(?:sing|ving|oser|st|ttery)");
473,024
. timer off 4; timer list;
1: 25.25 / 1 = 25.2450
2: 43.42 / 1 = 43.4170
3: 20.40 / 1 = 20.4030
4: 25.85 / 1 = 25.8520
. loc w1 $bandwidthex ; loc w2 `: word 5 of `t'' ;
. timer clear; timer on 1;
. count if ustrregexm(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}`w2'|`w2'\W+(?:\w+\W+){0,10}`w1')\b");
235,289
. timer off 1; timer on 2;
. count if regexmatch(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}`w2'|`w2'\W+(?:\w+\W+){0,10}`w1')\b");
235,702
. timer off 2; timer on 3;
. count if ustrregexm(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}?`w2'|`w2'\W+(?:\w+\W+){0,10}?`w1')\b");
235,289
. timer off 3; timer on 4;
. count if regexmatch(lyrics, "\b(?:`w1'\W+(?:\w+\W+){0,10}?`w2'|`w2'\W+(?:\w+\W+){0,10}?`w1')\b");
235,702
. timer off 4; timer list;
1: 48.15 / 1 = 48.1450
2: 99.40 / 1 = 99.3960
3: 44.17 / 1 = 44.1680
4: 99.23 / 1 = 99.2300
python
# import a library to generate random objects
from faker import Faker
# This will vary the number of levels of domains
import random
# Import the Stata API module
from sfi import Data
# Get the number of observations in the dataset
obs = Data.getObsTotal()
# Initialize the Faker class object
fake = Faker()
# Set the pseudorandom object generator seed
fake.seed(7779311)
# Set the pseudorandom number generator seed
random.seed(8675309)
# Pre-allocate a list that will store the email addresses
emails = [None] * obs
# Loop over observation indices
for i in range(obs):
# Select a random number of domain & subdomain levels to test
levels = random.randint(1, 5)
# Generate the email address and store it in the list object
emails[i] = fake.user_name() + '@' + fake.domain_name(levels)
# Add a variable to store the email addresses
Data.addVarStrL('email')
# Store the email addresses
Data.store('email', None, emails)
# End the python interpreter
end
// Compress the email variable
compress email
// Add the random email addresses to the end of the lyrics
g strL elyrics = lyrics + " " + email
// Get the storage type for email
loc t : type email
#d ;
timer clear; timer on 1;
g `t' etest1 = ustrregexs(0) if ustrregexm(elyrics,
"\b[\w\p{P}]+?@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}\b");
timer off 1; timer on 2;
g `t' etest2 = regexcapture(0) if regexmatch(elyrics,
"\b[\w\p{P}]+?@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}\b");
timer off 2; timer on 3;
g `t' etest3 = ustrregexs(0) if ustrregexm(elyrics,
"\b[\w\p{P}]++@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}+\b");
timer off 3; timer on 4;
g `t' etest4 = regexcapture(0) if regexmatch(elyrics,
"\b[\w\p{P}]++@[[:alnum:]][-[:alnum:]]{1,62}\.(?:[[:alnum:]][-[:alnum:]]{1,62}\.?){1,5}+\b");
timer off 4; timer list;
1: 277.55 / 1 = 277.5490
2: 201.91 / 1 = 201.9100
3: 314.44 / 1 = 314.4410
4: 164.02 / 1 = 164.0230
. cap noi assert etest1 == email
1,168 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest2 == email
428 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest3 == email
5,134,856 contradictions in 5,134,856 observations
assertion is false
. cap noi assert etest4 == email
379 contradictions in 5,134,856 observations
assertion is false
. timer clear
. timer on 1;
. qui: g ustralt1 = ustrregexs(1) if _n == 105519 & ///
ustrregexm(lyrics, "(`: word 7 of `t''|`: word 4 of `t'')");
. timer off 1; timer on 2;
. qui: g ustralt2 = ustrregexs(1) if _n == 105519 & ///
ustrregexm(lyrics, "(`: word 4 of `t''|`: word 7 of `t'')");
. timer off 2; timer on 3;
. qui: g bstalt1 = regexcapture(1) if _n == 105519 &///
regexmatch(lyrics, "(`: word 7 of `t''|`: word 4 of `t'')");
. timer off 3; timer on 4;
. qui: g bstalt2 = regexcapture(1) if _n == 105519 &///
regexmatch(lyrics, "(`: word 4 of `t''|`: word 7 of `t'')");
. timer off 4;
. di strlen(lyrics[105519]);
8714
. if ustralt1[105519] == "`: word 7 of `t''" di "word 7 matched";
. if ustralt1[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if ustralt2[105519] == "`: word 7 of `t''" di "word 7 matched";
. if ustralt2[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if bstalt1[105519] == "`: word 7 of `t''" di "word 7 matched";
. if bstalt1[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. if bstalt2[105519] == "`: word 7 of `t''" di "word 7 matched";
. if bstalt2[105519] == "`: word 4 of `t''" di "word 4 matched";
word 4 matched
. timer list;
1: 25.37 / 1 = 25.3660
2: 25.33 / 1 = 25.3320
3: 22.85 / 1 = 22.8500
4: 22.73 / 1 = 22.7270
https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data
Friedl, J. E. F. (2006). Mastering regular expressions. 3rd Ed. O'Reilly
Goyvaerts, J., & Levithan, S. (2012). Regular expressions cookbook. 2nd Ed.
O'Reilly
Maddock, J. (2013). The leftmost longest rule.