Mathias Bynens

Unicode property escapes in JavaScript regular expressions

· tagged with JavaScript, Unicode

There is a formal proposal to add support for Unicode property escapes of the form \p{…} and \P{…} to JavaScript regular expressions.

The proposal is currently in stage 3 of the TC39 Process.

This article explains what Unicode property escapes are, how they work, and why they’re useful.

Introduction

The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used in the Greek script, search the Unicode database for symbols whose Script_Extensions property is set to Greek.

Unicode property escapes make it possible to access these Unicode character properties natively in ECMAScript regular expressions. For example, the pattern \p{Script_Extensions=Greek} matches every symbol that is used in the Greek script.

const regexGreekSymbol = /\p{Script_Extensions=Greek}/u;
regexGreekSymbol.test('π');
// → true

Previously, developers wishing to use equivalent regular expressions in JavaScript had to resort to large run-time dependencies or build scripts, both of which lead to performance and maintainability problems. With built-in support for Unicode property escapes, creating regular expressions based on Unicode properties couldn’t be easier.

API

For optimal backwards compatibility, Unicode property escapes are only available in regular expressions with the u flag set.

Non-binary properties

Unicode property escapes for non-binary Unicode properties use the following syntax:

\p{UnicodePropertyName=UnicodePropertyValue}

The current proposal guarantees support for the following non-binary Unicode properties and their values: General_Category, Script, and Script_Extensions.

\p{General_Category=Decimal_Number}
\p{Script=Greek}
\p{Script_Extensions=Greek}

For General_Category values, the General_Category= part may be omitted.

\p{General_Category=Decimal_Number}
\p{Decimal_Number}

Binary properties

Trying to use the abovementioned syntax by specifying a value for a binary property triggers a syntax error. Since binary Unicode properties only have two possible values (Yes or No), you only need to specify the property name. Use \p{…} to match symbols having the property (Yes) and \P{…} to match the negated set (No).

\p{White_Space}
\P{White_Space}

The current proposal guarantees support for a subset of the available binary Unicode properties, including (but not limited to) the ones required by UTS18 RL1.2: ASCII, Alphabetic, Any, Assigned, Default_Ignorable_Code_Point, Lowercase, Noncharacter_Code_Point, Uppercase White_Space, et cetera. Note that this includes the binary properties defined in UTR51: Emoji, Emoji_Component, Emoji_Presentation, Emoji_Modifier, and Emoji_Modifier_Base.

Property and value aliases

The aliases defined in PropertyAliases.txt and PropertyValueAliases.txt may be used instead of the canonical property and value names. I wouldn’t recommend doing so, as it makes the patterns harder to read.

The use of an unknown property name or value triggers a SyntaxError.

Examples

Matching emoji

To match emoji symbols, the binary properties from UTR51 come in handy.

const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;

This regular expression matches, from left to right:

  1. emoji with optional modifiers (\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?);
  2. any remaining symbols that render as emoji rather than text by default (\p{Emoji_Presentation});
  3. symbols that render as text by default, but are forced to render as emoji using U+FE0F VARIATION SELECTOR-16 (\p{Emoji}\uFE0F).
const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
const text = `
\u{231A}: ⌚ default emoji presentation character (Emoji_Presentation)
\u{2194}\u{FE0F}: ↔️ default text presentation character rendered as emoji
\u{1F469}: 👩 emoji modifier base (Emoji_Modifier_Base)
\u{1F469}\u{1F3FF}: 👩🏿 emoji modifier base followed by a modifier
`;

let match;
while (match = regex.exec(text)) {
const emoji = match[0];
console.log(`Matched sequence ${ emoji } — code points: ${ [...emoji].length }`);
}

Console output:

Matched sequence ⌚ — code points: 1
Matched sequence ⌚ — code points: 1
Matched sequence ↔️ — code points: 2
Matched sequence ↔️ — code points: 2
Matched sequence 👩 — code points: 1
Matched sequence 👩 — code points: 1
Matched sequence 👩🏿 — code points: 2
Matched sequence 👩🏿 — code points: 2

For a more complete solution that matches emoji sequences & ZWJ sequences as well, see emoji-regex.

Unicode-aware version of \w

To match any word symbol in Unicode rather than just ASCII [a-zA-Z0-9_], use [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] as per UTS18.

const regex = /([\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]+)/gu;
const text = `
Amharic: የኔ ማንዣበቢያ መኪና በዓሣዎች ተሞልቷል
Bengali: আমার হভারক্রাফ্ট কুঁচে মাছ-এ ভরা হয়ে গেছে
Georgian: ჩემი ხომალდი საჰაერო ბალიშზე სავსეა გველთევზებით
Macedonian: Моето летачко возило е полно со јагули
Vietnamese: Tàu cánh ngầm của tôi đầy lươn
`;

let match;
while (match = regex.exec(text)) {
const word = match[1];
console.log(`Matched word with length ${ word.length }: ${ word }`);
}

Console output:

Matched word with length 7: Amharic
Matched word with length 2: የኔ
Matched word with length 6: ማንዣበቢያ
Matched word with length 3: መኪና
Matched word with length 5: በዓሣዎች
Matched word with length 5: ተሞልቷል
Matched word with length 7: Bengali
Matched word with length 4: আমার
Matched word with length 11: হভারক্রাফ্ট
Matched word with length 5: কুঁচে
Matched word with length 3: মাছ
Matched word with length 1: এ
Matched word with length 3: ভরা
Matched word with length 3: হয়ে
Matched word with length 4: গেছে
Matched word with length 8: Georgian
Matched word with length 4: ჩემი
Matched word with length 7: ხომალდი
Matched word with length 7: საჰაერო
Matched word with length 7: ბალიშზე
Matched word with length 6: სავსეა
Matched word with length 12: გველთევზებით
Matched word with length 10: Macedonian
Matched word with length 5: Моето
Matched word with length 7: летачко
Matched word with length 6: возило
Matched word with length 1: е
Matched word with length 5: полно
Matched word with length 2: со
Matched word with length 6: јагули
Matched word with length 10: Vietnamese
Matched word with length 3: Tàu
Matched word with length 4: cánh
Matched word with length 4: ngầm
Matched word with length 3: của
Matched word with length 3: tôi
Matched word with length 3: đầy
Matched word with length 4: lươn

Unicode-aware version of \d

To match any decimal number in Unicode rather than just ASCII [0-9], use \p{Decimal_Number} instead of \d as per UTS18.

const regex = /^\p{Decimal_Number}+$/u;
regex.test('𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼');
// → true

To match any numeric symbol in Unicode, including non-decimal symbols such as Roman numerals, use \p{Number}:

const regex = /^\p{Number}+$/u;
regex.test('²³¹¼½¾𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼㉛㉜㉝ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ');
// → true

Support

V8 ships with support for Unicode property escapes. JavaScriptCore has an implementation in Safari Technology Preview 42.

Browser(s) JavaScript engine Support for \p{…} & \P{…}
Edge Chakra ❌ ChakraCore issue #2969
Firefox SpiderMonkey ❌ SpiderMonkey issue #1361876
Chrome/Opera V8 ✅ V8 issue #4743
WebKit JavaScriptCore ✅ in Safari Technology Preview 42

My regexpu transpiler supports Unicode property escapes when the { unicodePropertyEscape: true } option is enabled. It translates such regular expressions to equivalent ES5 or ES2015 code that runs in today’s environments. Check out the interactive demo, or view the exhaustive list of supported properties. There’s a Babel plugin, too.

Unicode regular expression transpiler demo

More information

For more details, including the proposed changes to the ECMAScript specification, refer to the formal proposal on GitHub.

About me

Hi there! I’m Mathias. I work on V8 at Google. HTML, CSS, JavaScript, Unicode, performance, and security get me excited. If you managed to read this far without falling asleep, you should follow me on Twitter and GitHub.

Comments

MaxArt wrote on :

It’s both nice — because it allows a lot of new stuff — and ugly — because ugh, that syntax. Just today I needed a regex that would correctly parse Twitter hashtags. A lot of people suggested to use \w, but it fails with a simple é.

This is something I’ll follow closely in order to implement it as soon as it’s ready (stage 3, maybe?) in my regex building library.

JavaScript is finally doing something for regular expressions after years of nothing. Look-behinds are next?

Leave a comment

Comment on “Unicode property escapes in JavaScript regular expressions”

Your input will be parsed as Markdown.