Mathias Bynens

ECMAScript regular expressions are getting better!

· tagged with JavaScript, Unicode

Support for regular expressions was added to ECMAScript 3 in 1999.

Sixteen years later, ES6/ES2015 introduced Unicode mode (the u flag), sticky mode (the y flag), and the RegExp.prototype.flags getter.

This article highlights what’s happening in the world of JavaScript regular expressions right now. Spoiler: it’s quite a lot — there are more RegExp-related proposals currently advancing through the TC39 standardization process than there have been updates to RegExp in the history of ECMAScript!

We’ll discuss the following proposals:

Keep in mind that these proposals are still in the process of being standardized, and as such, their syntax and APIs are subject to change. The descriptions and code examples in this article match the latest versions of the proposals at the time of writing. These proposals can make it into ES2018, at the earliest.

dotAll mode (the s flag)

By default, . matches any character except for line terminators:

/foo.bar/u.test('foo\nbar');
// → false

(It doesn’t match astral Unicode symbols either, but we fixed that by enabling the u flag.)

A proposal introduces dotAll mode, enabled through the s flag. In dotAll mode, . matches line terminators as well.

/foo.bar/su.test('foo\nbar');
// → true

Lookbehind assertions

Lookarounds are zero-width assertions that match a string without consuming anything. ECMAScript currently supports lookahead assertions that do this in forward direction. Positive lookahead ensures a pattern is followed by another pattern:

const pattern = /\d+(?= dollars)/u;
const result = pattern.exec('42 dollars');
// → result[0] === '42'

Negative lookahead ensures a pattern is not followed by another pattern:

const pattern = /\d+(?! dollars)/u;
const result = pattern.exec('42 pesos');
// → result[0] === '42'

A proposal adds support for lookbehind assertions. Positive lookbehind ensures a pattern is preceded by another pattern:

const pattern = /(?<=\$)\d+/u;
const result = pattern.exec('$42');
// → result[0] === '42'

Negative lookbehind ensures a pattern is not preceded by another pattern:

const pattern = /(?<!\$)\d+/u;
const result = pattern.exec('€42');
// → result[0] === '42'

Named capture groups

Currently, each capture group in a regular expression is numbered and can be referenced using that number:

const pattern = /(\d{4})-(\d{2})-(\d{2})/u;
const result = pattern.exec('2017-01-25');
// → result[0] === '2017-01-25'
// → result[1] === '2017'
// → result[2] === '01'
// → result[3] === '25'

This is useful, but not very readable or maintainable. Whenever the order of capture groups in the pattern changes, the indices need to be updated accordingly.

A proposal adds support for named capture groups, enabling more readable and maintainable code.

const pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
const result = pattern.exec('2017-01-25');
// → result.groups.year === '2017'
// → result.groups.month === '01'
// → result.groups.day === '25'

Unicode property escapes

The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used in the Greek script, search the Unicode database for symbols whose Script_Extensions property is set to Greek.

Unicode property escapes make it possible to access these Unicode character properties natively in ECMAScript regular expressions. For example, the pattern \p{Script_Extensions=Greek} matches every symbol that is used in the Greek script.

const regexGreekSymbol = /\p{Script_Extensions=Greek}/u;
regexGreekSymbol.test('π');
// → true

Previously, developers wishing to use equivalent regular expressions in JavaScript had to resort to large run-time dependencies or build scripts, both of which lead to performance and maintainability problems. With built-in support for Unicode property escapes, creating regular expressions based on Unicode properties couldn’t be easier.

Legacy RegExp features

Another proposal specifies certain legacy RegExp features, such as the RegExp.prototype.compile method and the static properties from RegExp.$1 to RegExp.$9. Although these features are deprecated, unfortunately they cannot be removed from the web platform without introducing compatibility issues. Thus, standardizing their behavior and getting engines to align their implementations is the best way forward. This proposal is important for web compatibility.

About me

Hi there! I’m Mathias, a web standards enthusiast from Belgium. HTML, CSS, JavaScript, Unicode, performance, and security get me excited. If you managed to read this far without falling asleep, you should follow me on Twitter and GitHub.

Comments

Wyatt wrote on :

I’m not exactly sure what to call it, but are there any proposals for something like what Twitter does in their twitter-text library? They have a function called regexSupplant() which allows them to reference existing RegExp instances in new RegExp instances. Here’s an example taken from their repo:

twttr.txt.regexen.validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*@@]|(?:^|[^a-zA-Z0-9_+~.-])(?:rt|RT|rT|Rt):?)/;
twttr.txt.regexen.atSigns = /[@@]/;
twttr.txt.regexen.validMentionOrList = regexSupplant(
'(#{validMentionPrecedingChars})' + // $1: Preceding character
'(#{atSigns})' + // $2: At mark
'([a-zA-Z0-9_]{1,20})' + // $3: Screen name
'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?' // $4: List (optional)
, 'g');

Obviously their solution isn’t ideal since it depends on all of the RegExp instances being defined on the twttr.txt.regexen object, but I would love to see something like this make it into the spec. It really makes reading complex regular expressions much easier.

Brian wrote on :

You could pull something similar off with a tagged template (ES6). You won’t get regexp syntax highlighting on the template literal but you’d get all the regular IDE support on the different parts (which could be modularized). I came up with a version that probably isn’t the most streamlined but does the job.

Test: https://codepen.io/anon/pen/wgqbaN

function regexSupplant (strings, ...values) {
const parts = []
while (strings.length || values.length) {
if (strings.length) {
parts.push(strings.shift()
.split('\n')
.map((val) => val.trim())
.join('')
)
}
if (values.length) {
parts.push(values.shift()
.toString()
.replace(/(^\/)|(\/$)/g, '')
)
}
}
const regexpish = parts.join('')
const pattern = regexpish.replace(/^\/|\/\w*$/g, '')
const flags = (/\/(\w*)$/g).exec(regexpish)[1]
return new RegExp(pattern, flags)
}

const validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*@@]|(?:^|[^a-zA-Z0-9_+~.-])(?:rt|RT|rT|Rt):?)/;
const atSigns = /[@@]/;
const screenName = /[a-zA-Z0-9_]{1,20}/;
const list = /\/[a-zA-Z][a-zA-Z0-9_\-]{0,24}/;

const validMentionOrList = regexSupplant`/
(${validMentionPrecedingChars})
(${atSigns})
(${screenName})
(${list})?
/g`;

console.log(validMentionOrList);

Leave a comment

Comment on “ECMAScript regular expressions are getting better!”

Your input will be parsed as Markdown.