Unicode-aware regular expressions in ES2015

Published 26th August 2014 · tagged with HTML, JavaScript, Unicode

ECMAScript 2015 introduces two new flags for regular expressions:

y enables ‘sticky’ matching.
u enables various Unicode-related features.

This article explains the effects of the u flag. It helps if you’ve read JavaScript has a Unicode problem first.

Impact on syntax

Setting the u flag on a regular expression enables the use of ES2015 Unicode code point escapes (\u{…}) in the pattern.

// Note: `a` is U+0061 LATIN SMALL LETTER A, a BMP symbol.
console.log(/\u{61}/u.test('a'));
// → true

// Note: `𝌆` is U+1D306 TETRAGRAM FOR CENTRE, an astral symbol.
console.log(/\u{1D306}/u.test('𝌆'));
// → true

Without the flag, things like \u{1234} can technically still occur in patterns, but they won’t be interpreted as Unicode code point escapes. /\u{1234}/ is equivalent to /u{1234}/, which matches 1234 consecutive u symbols rather than the symbol with code point U+1234.

Engines do this for compatibility reasons. But with the u flag set, this changes too: things like \a (where a is not an escape character) won’t be equivalent to a anymore. So even though /\a/ is treated as /a/, /\a/u throws an error, because \a is not a reserved escape sequence. This makes it possible to extend u regular expressions in a future version of ECMAScript. For example, /\p{Script=Greek}/u throws an exception per ES2015, but could become a regular expression that matches all symbols in the Greek script according to the Unicode database once syntax for Unicode property escapes is added to the spec.

Impact on the `.` operator

Without the u flag, . matches any BMP symbol except line terminators. When the ES2015 u flag is set, . matches astral symbols too.

// Note: `𝌆` is U+1D306 TETRAGRAM FOR CENTRE, an astral symbol.
const string = 'a𝌆b';

console.log(/a.b/.test(string));
// → false

console.log(/a.b/u.test(string));
// → true

const match = string.match(/a(.)b/u);
console.log(match[1]);
// → '𝌆'

Impact on quantifiers

The available quantifiers in JavaScript regular expressions are *, +, ?, and {2}, {2,}, {2,4}, and variations of those. Without the u flag, if a quantifier follows an atom that consists of an astral symbol, it applies only to the low surrogate of that symbol.

// Note: `a` is a BMP symbol.
console.log(/a{2}/.test('aa'));
// → true

// Note: `𝌆` is an astral symbol.
console.log(/𝌆{2}/.test('𝌆𝌆'));
// → false

// Explanation: the previous example is equivalent to the following.
console.log(/\uD834\uDF06{2}/.test('\uD834\uDF06\uD834\uDF06'));
// → false

With the ES2015 u flag, quantifiers apply to whole symbols, even for astral symbols.

// Note: `a` is a BMP symbol.
console.log(/a{2}/u.test('aa'));
// → true

// Note: `𝌆` is an astral symbol.
console.log(/𝌆{2}/u.test('𝌆𝌆'));
// → true

Impact on character classes

Without the u flag, any given character class can only match BMP symbols. Things like [bcd] work as expected:

const regex = /^[bcd]$/;
console.log(
	regex.test('a'), // false
	regex.test('b'), // true
	regex.test('c'), // true
	regex.test('d'), // true
	regex.test('e')  // false
);

However, when an astral symbol is used in a character class, the JavaScript engine treats it as two separate ‘characters’: one for each of its surrogate halves.

// Note: `𝌆` is an astral symbol.
const regex = /^[bc𝌆]$/;
console.log(
	regex.test('a'), // false
	regex.test('b'), // true
	regex.test('c'), // true
	regex.test('𝌆')  // false
);

// Explanation: the regular expression is equivalent to the following.
// const regex = /^[bc\uD834\uDF06]$/;

The ES2015 u flag enables the use of whole astral symbols in character classes.

// Note: `𝌆` is an astral symbol.
const regex = /^[bc𝌆]$/u; // Or, `/^[bc\u{1D306}]$/u`.
console.log(
	regex.test('a'), // false
	regex.test('b'), // true
	regex.test('c'), // true
	regex.test('𝌆')  // true
);

Consequently, whole astral symbols can also be used in character class ranges, and everything will work as expected as long as the u flag is set.

// Match any symbol from U+1F4A9 PILE OF POO to U+1F4AB DIZZY SYMBOL.
const regex = /[💩-💫]/u; // Or, `/[\u{1F4A9}-\u{1F4AB}]/u`.
console.log(
	regex.test('💨'), // false
	regex.test('💩'), // true
	regex.test('💪'), // true
	regex.test('💫'), // true
	regex.test('💬')  // false
);

The u flag also affects negated character classes. For example, /[^a]/ is equivalent to /[\0-\x60\x62-\uFFFF]/, which would match any BMP symbol except a. But with the u flag, /[^a]/u matches the much bigger set of all Unicode symbols except a.

const regex = /^[^a]$/u;
console.log(
	regex.test('a'), // false
	regex.test('b'), // true
	regex.test('☃'), // true
	regex.test('𝌆'), // true
	regex.test('💩')  // true
);

Impact on character class escapes

The u flag affects the meaning of the character class escapes \D, \S, and \W. Without the u flag, \D, \S, and \W match any BMP symbols that are not matched by \d, \s, and \w, respectively.

const regex = /^\S$/;
console.log(
	regex.test(' '), // false
	regex.test('a'), // true
	// Note: `𝌆` is an astral symbol.
	regex.test('𝌆')  // false
);

With the u flag, \D, \S, and \W match astral symbols too.

const regex = /^\S$/u;
console.log(
	regex.test(' '), // false
	regex.test('a'), // true
	// Note: `𝌆` is an astral symbol.
	regex.test('𝌆')  // true
);

Their inverse counterparts \d, \s, and \w are not affected by the u flag. There was a proposal to make \d and \w (and \b) more Unicode-aware, but it was rejected.

Impact on the `i` flag

When both the i and u flag are set, all symbols are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared.

const es5regex = /[a-z]/i;
const es6regex = /[a-z]/iu;
console.log(
	es5regex.test('s'),      es6regex.test('s'),      // true true
	es5regex.test('S'),      es6regex.test('S'),      // true true
	// Note: U+017F canonicalizes to `S`.
	es5regex.test('\u017F'), es6regex.test('\u017F'), // false true
	// Note: U+212A canonicalizes to `K`.
	es5regex.test('\u212A'), es6regex.test('\u212A')  // false true
);

The case folding applies to the symbols in the regular expression pattern as well as the symbols in the string to be matched.

console.log(
	/\u212A/iu.test('K'), // true
	/\u212A/iu.test('k'), // true
	/\u017F/iu.test('S'), // true
	/\u017F/iu.test('s')  // true
);

This case-folding logic applies to the \w and \W character escapes as well, which also affects \b and \B. /\w/iu matches [0-9A-Z_a-z] but also U+017F because U+017F canonicalizes to S which is in the match set. The same goes for U+212A and K.

console.log(
	/\w/iu.test('\u017F'), // true
	/\w/iu.test('\u212A'), // true
	/\W/iu.test('\u017F'), // false
	/\W/iu.test('\u212A'), // false
	/\W/iu.test('s'),      // false
	/\W/iu.test('S'),      // false
	/\W/iu.test('K'),      // false
	/\W/iu.test('k'),      // false
	/\b/iu.test('\u017F'), // true
	/\b/iu.test('\u212A'), // true
	/\b/iu.test('s'),      // true
	/\b/iu.test('S'),      // true
	/\B/iu.test('\u017F'), // false
	/\B/iu.test('\u212A'), // false
	/\B/iu.test('s'),      // false
	/\B/iu.test('S'),      // false
	/\B/iu.test('K'),      // false
	/\B/iu.test('k')       // false
);

Note: An annoying result of this case-folding logic is that, per the original ES2015 spec, /w/iu was no longer the inverse of /\W/iu. Remember how /\w/iu matches [0-9A-Z_a-z] but also U+017F and U+212A? This makes sense. However, in ES2015, /\W/iu also matched U+017F, and strangely, S, because \W includes U+017F which matches either the U+017F symbol itself or its canonicalized version S. The same applied for U+212A and K. In other words, /\W/iu was equivalent to /[^0-9a-jl-rt-zA-JL-RT-Z_]/u. 😕 This was rectified in June 2016. Now, /\W/iu doesn’t match S, K, U+017F, or U+212A anymore, making /\W/iu the inverse of /w/iu again. /\W/iu is now equivalent to /[^0-9a-zA-Z_\u{017F}\u{212A}]/u. Whew.

Impact on HTML documents

Believe it or not, the existence of the u flag has its effect on HTML documents as well.

The pattern attribute for input and textarea elements allows you to specify a regular expression to validate the user’s input against. The browser then provides you with styling and scripting hooks to make stuff happen based on the input’s validity.

<style>
	:invalid { background: red; }
	:valid { background: green; }
</style>
<input pattern="a.b" value="aXXb"><!-- gets a red background -->
<input pattern="a.b" value="a𝌆b"><!-- gets a green background -->

The u flag is always enabled for regular expressions compiled through the HTML pattern attribute. Here’s a demo / test case.

Support

At the moment, the ES2015 u flag for regular expressions is available in stable releases of every major browser. Browsers are slowly starting to enable this functionality for the HTML pattern attribute.

Browser(s)	JavaScript engine	`u` flag	`u` flag for `pattern` attribute
Edge	Chakra	✅ issue #1102227 + ✅ issue #517 + ✅ issue #1181 + ✅ issue #4368	❌ issue #7113940
Firefox	Spidermonkey	✅ bug #1135377 + ✅ bug #1281739	✅ bug #1227906
Chrome/Opera	V8	✅ V8 issue #2952 + ✅ issue #5080	✅ issue #535441
WebKit	JavaScriptCore	✅ bug #154842 + ✅ bug #151597 + ✅ bug #158505	✅ bug #151598

Recommendations for developers

Use the u flag for every regular expression you write from now on.
…but don’t blindly add the u flag to existing regular expressions, as it might change their meaning in subtle ways.
Avoid combining the u and i flags. It’s better to be explicit and include all letter cases in your regular expression itself than to be surprised by implicit case folding.
Use a transpiler to make sure your code runs everywhere, including legacy environments.

Transpiling ES6 Unicode regular expressions to ES5

I created regexpu, a transpiler that rewrites ES6 Unicode regular expressions into equivalent ES5 code that works today. This enables you to play around with these upcoming new features. Try it out now!

The regexpu demo page transpiles ES6 Unicode regular expressions as you type.

Full-blown ES6/ES7 transpilers like Traceur and Babel depend on regexpu for their u transpilation. Let me know if you manage to break it.

Comments

Zheng Yin Bo wrote on 22nd March 2016 at 18:48:

Awesome post! I have checked the CaseFolding mapping and I found that '\u017F' with 'S' and '\u212A' with 'K' are the only pairs both in status: C which implies common case folding from a to z. In such pairs, code like es6regex.test('\u017F') returns true. But there’s also status F (full case folding) and S (simple case folding). Do you know what happens for such cases with the iu flags set? Or is there any documentation about case folding in JavaScript in such cases I could refer to? Thanks.

Mathias wrote on 23rd March 2016 at 10:48:

Zheng: As mentioned in the post, the simple mapping is used with the iu flags set.

Mathias Bynens

Unicode-aware regular expressions in ES2015

Impact on syntax

Impact on the `.` operator

Impact on quantifiers

Impact on character classes

Impact on character class escapes

Impact on the `i` flag

Impact on HTML documents

Support

Recommendations for developers

Transpiling ES6 Unicode regular expressions to ES5

Comments

Leave a comment

Impact on syntax

Impact on the . operator

Impact on quantifiers

Impact on character classes

Impact on character class escapes

Impact on the i flag

Impact on HTML documents

Support

Recommendations for developers

Transpiling ES6 Unicode regular expressions to ES5

Comments

Leave a comment

Impact on the `.` operator

Impact on the `i` flag