Support for regular expressions was added to ECMAScript 3 in 1999.
Sixteen years later, ES6/ES2015 introduced Unicode mode (the u
flag), sticky mode (the y
flag), and the RegExp.prototype.flags
getter.
This article highlights what’s happening in the world of JavaScript regular expressions right now. Spoiler: it’s quite a lot — there are more RegExp
-related proposals currently advancing through the TC39 standardization process than there have been updates to RegExp
in the history of ECMAScript!
We’ll discuss the following ES2018 features and ECMAScript proposals:
dotAll
mode (thes
flag)- Lookbehind assertions
- Named capture groups
- Unicode property escapes
- Unicode properties of strings
- Set notation
String.prototype.matchAll
- Legacy
RegExp
features
dotAll
mode (the s
flag)
By default, .
matches any character except for line terminators:
/foo.bar/u.test('foo\nbar');
// → false
(It doesn’t match astral Unicode symbols either, but we fixed that by enabling the u
flag.)
ES2018 introduces dotAll
mode, enabled through the s
flag. In dotAll
mode, .
matches line terminators as well.
/foo.bar/su.test('foo\nbar');
// → true
Lookbehind assertions
Lookarounds are zero-width assertions that match a string without consuming anything. ECMAScript currently supports lookahead assertions that do this in forward direction. Positive lookahead ensures a pattern is followed by another pattern:
const pattern = /\d+(?= dollars)/u;
const result = pattern.exec('42 dollars');
// → result[0] === '42'
Negative lookahead ensures a pattern is not followed by another pattern:
const pattern = /\d+(?! dollars)/u;
const result = pattern.exec('42 pesos');
// → result[0] === '42'
ES2018 adds support for lookbehind assertions. Positive lookbehind ensures a pattern is preceded by another pattern:
const pattern = /(?<=\$)\d+/u;
const result = pattern.exec('$42');
// → result[0] === '42'
Negative lookbehind ensures a pattern is not preceded by another pattern:
const pattern = /(?<!\$)\d+/u;
const result = pattern.exec('€42');
// → result[0] === '42'
Named capture groups
Currently, each capture group in a regular expression is numbered and can be referenced using that number:
const pattern = /(\d{4})-(\d{2})-(\d{2})/u;
const result = pattern.exec('2017-01-25');
// → result[0] === '2017-01-25'
// → result[1] === '2017'
// → result[2] === '01'
// → result[3] === '25'
This is useful, but not very readable or maintainable. Whenever the order of capture groups in the pattern changes, the indices need to be updated accordingly.
ES2018 adds support for named capture groups, enabling more readable and maintainable code.
const pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
const result = pattern.exec('2017-01-25');
// → result.groups.year === '2017'
// → result.groups.month === '01'
// → result.groups.day === '25'
Unicode property escapes
The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used in the Greek script, search the Unicode database for symbols whose Script_Extensions
property is set to Greek
.
Unicode property escapes make it possible to access these Unicode character properties natively in ECMAScript regular expressions. For example, the pattern \p{Script_Extensions=Greek}
matches every symbol that is used in the Greek script.
const regexGreekSymbol = /\p{Script_Extensions=Greek}/u;
regexGreekSymbol.test('π');
// → true
Previously, developers wishing to use equivalent regular expressions in JavaScript had to resort to large run-time dependencies or build scripts, both of which lead to performance and maintainability problems. With built-in support for Unicode property escapes, creating regular expressions based on Unicode properties couldn’t be easier.
Unicode properties of strings
A separate proposal extends Unicode property escapes functionality to Unicode properties that expand to sequences of characters, such as Basic_Emoji
(which encompasses all emoji, regardless of whether they consist of a single code point or a sequence of code points):
const regexBasicEmoji = /\p{Basic_Emoji}/v;
// Note: although 4️⃣ looks like a single symbol, it consists
// of two Unicode code points.
regexBasicEmoji.test('4️⃣');
// → true
// Flag emoji consist of multiple code points.
regexBasicEmoji.test('🇧🇪');
// → true
This proposal would make it easier to match emoji (which can consist of multiple code points) and hashtags (which can contain emoji) using regular expressions. As the Unicode Standard defines more sequence properties over time, JavaScript regular expressions could support those as well.
Note: This proposal is still in the process of being standardized, and as such, its syntax is subject to change. The descriptions and code examples in this article match the latest versions of the proposal at the time of writing. This proposal is currently at stage 3 and can make it into ES2023, at the earliest.
Set notation
The proposed unicodeSets
mode, enabled using the v
flag, unlocks support for extended character classes, including not only properties of strings but also set notation, string literal syntax, and improved case-insensitive matching.
Set notation includes the --
syntax for difference/subtraction:
// Match all Greek symbols except for “π”:
/[\p{Script_Extensions=Greek}--π]/v.test('π'); // → false
// Match all Greek symbols except for “α”, “β”, and “γ”:
/[\p{Script_Extensions=Greek}--[αβγ]]/v.test('α'); // → false
/[\p{Script_Extensions=Greek}--[α-γ]]/v.test('β'); // → false
// Match all RGI emoji tag sequences except for the flag of Scotland:
/^[\p{RGI_Emoji_Tag_Sequence}--\q{🏴}]$/v.test('🏴'); // → false
Intersection is done with the new &&
syntax:
// Match all Greek letters:
const re = /[\p{Script_Extensions=Greek}&&\p{Letter}]/v;
// U+03C0 GREEK SMALL LETTER PI
re.test('π'); // → true
// U+1018A GREEK ZERO SIGN
re.test('𐆊'); // → false
String.prototype.matchAll
A common use case of global (g
) or sticky (y
) regular expressions is applying it to a string and iterating through all of the matches, including capturing groups. The String.prototype.matchAll
proposal makes this easier than ever before.
const string = 'Magic hex numbers: DEADBEEF CAFE 8BADF00D';
const regex = /\b[0-9a-fA-F]+\b/g;
for (const match of string.matchAll(regex)) {
console.log(match);
}
The match
object for each loop iteration is equivalent to what regex.exec(string)
would return.
// Iteration 1:
[
'DEADBEEF',
index: 19,
input: 'Magic hex numbers: DEADBEEF CAFE 8BADF00D'
]
// Iteration 2:
[
'CAFE',
index: 28,
input: 'Magic hex numbers: DEADBEEF CAFE 8BADF00D'
]
// Iteration 3:
[
'8BADF00D',
index: 33,
input: 'Magic hex numbers: DEADBEEF CAFE 8BADF00D'
]
String.prototype.matchAll
is especially useful for regular expressions with capture groups:
const string = 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262';
const regex = /\b(?<owner>[a-z0-9]+)\/(?<repo>[a-z0-9\.]+)\b/g;
for (const match of string.matchAll(regex)) {
console.log(`${match[0]} at ${match.index} with '${match.input}'`);
console.log(`→ owner: ${match.groups.owner}`);
console.log(`→ repo: ${match.groups.repo}`);
}
// Output:
//
// tc39/ecma262 at 23 with 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262'
// → owner: tc39
// → repo: ecma262
// v8/v8.dev at 36 with 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262'
// → owner: v8
// → repo: v8.dev
// tc39/test262 at 46 with 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262'
// → owner: tc39
// → repo: test262
Legacy RegExp
features
Another proposal specifies certain legacy RegExp
features, such as the RegExp.prototype.compile
method and the static properties from RegExp.$1
to RegExp.$9
. Although these features are deprecated, unfortunately they cannot be removed from the web platform without introducing compatibility issues. Thus, standardizing their behavior and getting engines to align their implementations is the best way forward. This proposal is important for web compatibility.
Comments
Wyatt wrote on :
I’m not exactly sure what to call it, but are there any proposals for something like what Twitter does in their
twitter-text
library? They have a function calledregexSupplant()
which allows them to reference existingRegExp
instances in newRegExp
instances. Here’s an example taken from their repo:Obviously their solution isn’t ideal since it depends on all of the
RegExp
instances being defined on thetwttr.txt.regexen
object, but I would love to see something like this make it into the spec. It really makes reading complex regular expressions much easier.Mathias wrote on :
Wyatt: That seems similar to Perl 6’s subrules. I haven’t seen an ECMAScript proposal for such functionality.
Brian wrote on :
You could pull something similar off with a tagged template (ES6). You won’t get regexp syntax highlighting on the template literal but you’d get all the regular IDE support on the different parts (which could be modularized). I came up with a version that probably isn’t the most streamlined but does the job.
Test: https://codepen.io/anon/pen/wgqbaN
Ben Nadel wrote on :
I'm very excited for look-behinds. Been wanting those forever!