Mathias Bynens

ECMAScript regular expressions are getting better!

Published · tagged with JavaScript, Unicode

Support for regular expressions was added to ECMAScript 3 in 1999.

Sixteen years later, ES6/ES2015 introduced Unicode mode (the u flag), sticky mode (the y flag), and the RegExp.prototype.flags getter.

This article highlights what’s happening in the world of JavaScript regular expressions right now. Spoiler: it’s quite a lot — there are more RegExp-related proposals currently advancing through the TC39 standardization process than there have been updates to RegExp in the history of ECMAScript!

We’ll discuss the following ES2018 features and ECMAScript proposals:

dotAll mode (the s flag)

By default, . matches any character except for line terminators:

/foo.bar/u.test('foo\nbar');
// → false

(It doesn’t match astral Unicode symbols either, but we fixed that by enabling the u flag.)

ES2018 introduces dotAll mode, enabled through the s flag. In dotAll mode, . matches line terminators as well.

/foo.bar/su.test('foo\nbar');
// → true

Lookbehind assertions

Lookarounds are zero-width assertions that match a string without consuming anything. ECMAScript currently supports lookahead assertions that do this in forward direction. Positive lookahead ensures a pattern is followed by another pattern:

const pattern = /\d+(?= dollars)/u;
const result = pattern.exec('42 dollars');
// → result[0] === '42'

Negative lookahead ensures a pattern is not followed by another pattern:

const pattern = /\d+(?! dollars)/u;
const result = pattern.exec('42 pesos');
// → result[0] === '42'

ES2018 adds support for lookbehind assertions. Positive lookbehind ensures a pattern is preceded by another pattern:

const pattern = /(?<=\$)\d+/u;
const result = pattern.exec('$42');
// → result[0] === '42'

Negative lookbehind ensures a pattern is not preceded by another pattern:

const pattern = /(?<!\$)\d+/u;
const result = pattern.exec('€42');
// → result[0] === '42'

Named capture groups

Currently, each capture group in a regular expression is numbered and can be referenced using that number:

const pattern = /(\d{4})-(\d{2})-(\d{2})/u;
const result = pattern.exec('2017-01-25');
// → result[0] === '2017-01-25'
// → result[1] === '2017'
// → result[2] === '01'
// → result[3] === '25'

This is useful, but not very readable or maintainable. Whenever the order of capture groups in the pattern changes, the indices need to be updated accordingly.

ES2018 adds support for named capture groups, enabling more readable and maintainable code.

const pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
const result = pattern.exec('2017-01-25');
// → result.groups.year === '2017'
// → result.groups.month === '01'
// → result.groups.day === '25'

Unicode property escapes

The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used in the Greek script, search the Unicode database for symbols whose Script_Extensions property is set to Greek.

Unicode property escapes make it possible to access these Unicode character properties natively in ECMAScript regular expressions. For example, the pattern \p{Script_Extensions=Greek} matches every symbol that is used in the Greek script.

const regexGreekSymbol = /\p{Script_Extensions=Greek}/u;
regexGreekSymbol.test('π');
// → true

Previously, developers wishing to use equivalent regular expressions in JavaScript had to resort to large run-time dependencies or build scripts, both of which lead to performance and maintainability problems. With built-in support for Unicode property escapes, creating regular expressions based on Unicode properties couldn’t be easier.

Unicode properties of strings

A separate proposal extends Unicode property escapes functionality to Unicode properties that expand to sequences of characters, such as Basic_Emoji (which encompasses all emoji, regardless of whether they consist of a single code point or a sequence of code points):

const regexBasicEmoji = /\p{Basic_Emoji}/v;

// Note: although 4️⃣ looks like a single symbol, it consists
// of two Unicode code points.
regexBasicEmoji.test('4️⃣');
// → true

// Flag emoji consist of multiple code points.
regexBasicEmoji.test('🇧🇪');
// → true

This proposal would make it easier to match emoji (which can consist of multiple code points) and hashtags (which can contain emoji) using regular expressions. As the Unicode Standard defines more sequence properties over time, JavaScript regular expressions could support those as well.

Note: This proposal is still in the process of being standardized, and as such, its syntax is subject to change. The descriptions and code examples in this article match the latest versions of the proposal at the time of writing. This proposal is currently at stage 3 and can make it into ES2023, at the earliest.

Set notation

The proposed unicodeSets mode, enabled using the v flag, unlocks support for extended character classes, including not only properties of strings but also set notation, string literal syntax, and improved case-insensitive matching.

Set notation includes the -- syntax for difference/subtraction:

// Match all Greek symbols except for “π”:
/[\p{Script_Extensions=Greek}--π]/v.test('π'); // → false

// Match all Greek symbols except for “α”, “β”, and “γ”:
/[\p{Script_Extensions=Greek}--[αβγ]]/v.test('α'); // → false
/[\p{Script_Extensions=Greek}--[α-γ]]/v.test('β'); // → false

// Match all RGI emoji tag sequences except for the flag of Scotland:
/^[\p{RGI_Emoji_Tag_Sequence}--\q{🏴󠁧󠁢󠁳󠁣󠁴󠁿}]$/v.test('🏴󠁧󠁢󠁳󠁣󠁴󠁿'); // → false

Intersection is done with the new && syntax:

// Match all Greek letters:
const re = /[\p{Script_Extensions=Greek}&&\p{Letter}]/v;
// U+03C0 GREEK SMALL LETTER PI
re.test('π'); // → true
// U+1018A GREEK ZERO SIGN
re.test('𐆊'); // → false

String.prototype.matchAll

A common use case of global (g) or sticky (y) regular expressions is applying it to a string and iterating through all of the matches, including capturing groups. The String.prototype.matchAll proposal makes this easier than ever before.

const string = 'Magic hex numbers: DEADBEEF CAFE 8BADF00D';
const regex = /\b[0-9a-fA-F]+\b/g;
for (const match of string.matchAll(regex)) {
console.log(match);
}

The match object for each loop iteration is equivalent to what regex.exec(string) would return.

// Iteration 1:
[
'DEADBEEF',
index: 19,
input: 'Magic hex numbers: DEADBEEF CAFE 8BADF00D'
]

// Iteration 2:
[
'CAFE',
index: 28,
input: 'Magic hex numbers: DEADBEEF CAFE 8BADF00D'
]

// Iteration 3:
[
'8BADF00D',
index: 33,
input: 'Magic hex numbers: DEADBEEF CAFE 8BADF00D'
]

String.prototype.matchAll is especially useful for regular expressions with capture groups:

const string = 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262';
const regex = /\b(?<owner>[a-z0-9]+)\/(?<repo>[a-z0-9\.]+)\b/g;

for (const match of string.matchAll(regex)) {
console.log(`${match[0]} at ${match.index} with '${match.input}'`);
console.log(`→ owner: ${match.groups.owner}`);
console.log(`→ repo: ${match.groups.repo}`);
}

// Output:
//
// tc39/ecma262 at 23 with 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262'
// → owner: tc39
// → repo: ecma262
// v8/v8.dev at 36 with 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262'
// → owner: v8
// → repo: v8.dev
// tc39/test262 at 46 with 'Favorite GitHub repos: tc39/ecma262 v8/v8.dev tc39/test262'
// → owner: tc39
// → repo: test262

Legacy RegExp features

Another proposal specifies certain legacy RegExp features, such as the RegExp.prototype.compile method and the static properties from RegExp.$1 to RegExp.$9. Although these features are deprecated, unfortunately they cannot be removed from the web platform without introducing compatibility issues. Thus, standardizing their behavior and getting engines to align their implementations is the best way forward. This proposal is important for web compatibility.

About me

Hi there! I’m Mathias. I work on Chrome DevTools and the V8 JavaScript engine at Google. HTML, CSS, JavaScript, Unicode, performance, and security get me excited. Follow me on Twitter, Mastodon, and GitHub.

Comments

Wyatt wrote on :

I’m not exactly sure what to call it, but are there any proposals for something like what Twitter does in their twitter-text library? They have a function called regexSupplant() which allows them to reference existing RegExp instances in new RegExp instances. Here’s an example taken from their repo:

twttr.txt.regexen.validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*@@]|(?:^|[^a-zA-Z0-9_+~.-])(?:rt|RT|rT|Rt):?)/;
twttr.txt.regexen.atSigns = /[@@]/;
twttr.txt.regexen.validMentionOrList = regexSupplant(
'(#{validMentionPrecedingChars})' + // $1: Preceding character
'(#{atSigns})' + // $2: At mark
'([a-zA-Z0-9_]{1,20})' + // $3: Screen name
'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?' // $4: List (optional)
, 'g');

Obviously their solution isn’t ideal since it depends on all of the RegExp instances being defined on the twttr.txt.regexen object, but I would love to see something like this make it into the spec. It really makes reading complex regular expressions much easier.

Brian wrote on :

You could pull something similar off with a tagged template (ES6). You won’t get regexp syntax highlighting on the template literal but you’d get all the regular IDE support on the different parts (which could be modularized). I came up with a version that probably isn’t the most streamlined but does the job.

Test: https://codepen.io/anon/pen/wgqbaN

function regexSupplant (strings, ...values) {
const parts = []
while (strings.length || values.length) {
if (strings.length) {
parts.push(strings.shift()
.split('\n')
.map((val) => val.trim())
.join('')
)
}
if (values.length) {
parts.push(values.shift()
.toString()
.replace(/(^\/)|(\/$)/g, '')
)
}
}
const regexpish = parts.join('')
const pattern = regexpish.replace(/^\/|\/\w*$/g, '')
const flags = (/\/(\w*)$/g).exec(regexpish)[1]
return new RegExp(pattern, flags)
}

const validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*@@]|(?:^|[^a-zA-Z0-9_+~.-])(?:rt|RT|rT|Rt):?)/;
const atSigns = /[@@]/;
const screenName = /[a-zA-Z0-9_]{1,20}/;
const list = /\/[a-zA-Z][a-zA-Z0-9_\-]{0,24}/;

const validMentionOrList = regexSupplant`/
(${validMentionPrecedingChars})
(${atSigns})
(${screenName})
(${list})?
/g`;

console.log(validMentionOrList);

Leave a comment

Comment on “ECMAScript regular expressions are getting better!”

Your input will be parsed as Markdown.