ES2018 adds support for Unicode property escapes of the form \p{…}
and \P{…}
to JavaScript regular expressions.
This article explains what Unicode property escapes are, how they work, and why they’re useful.
Introduction
The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used in the Greek script, search the Unicode database for symbols whose Script_Extensions
property value includes Greek
.
Unicode property escapes make it possible to access these Unicode character properties natively in ECMAScript regular expressions. For example, the pattern \p{Script_Extensions=Greek}
matches every symbol that is used in the Greek script.
const regexGreekSymbol = /\p{Script_Extensions=Greek}/u;
regexGreekSymbol.test('π');
// → true
Previously, developers wishing to use equivalent regular expressions in JavaScript had to resort to large run-time dependencies or build scripts, both of which lead to performance and maintainability problems. With built-in support for Unicode property escapes, creating regular expressions based on Unicode properties couldn’t be easier.
API
For optimal backwards compatibility, Unicode property escapes are only available in regular expressions with the u
flag set.
Non-binary properties
Unicode property escapes for non-binary Unicode properties use the following syntax:
\p{UnicodePropertyName=UnicodePropertyValue}
The current proposal guarantees support for the following non-binary Unicode properties and their values: General_Category
, Script
, and Script_Extensions
.
\p{General_Category=Decimal_Number}
\p{Script=Greek}
\p{Script_Extensions=Greek}
For General_Category
values, the General_Category=
part may be omitted.
\p{General_Category=Decimal_Number}
\p{Decimal_Number}
Binary properties
Trying to use the abovementioned syntax by specifying a value for a binary property triggers a syntax error. Since binary Unicode properties only have two possible values (Yes
or No
), you only need to specify the property name. Use \p{…}
to match symbols having the property (Yes
) and \P{…}
to match the negated set (No
).
\p{White_Space}
\P{White_Space}
The current proposal guarantees support for a subset of the available binary Unicode properties, including (but not limited to) the ones required by UTS18 RL1.2: ASCII
, Alphabetic
, Any
, Assigned
, Default_Ignorable_Code_Point
, Lowercase
, Noncharacter_Code_Point
, Uppercase
White_Space
, et cetera. Note that this includes the binary properties defined in UTR51: Emoji
, Emoji_Component
, Emoji_Presentation
, Emoji_Modifier
, and Emoji_Modifier_Base
.
Property and value aliases
The aliases defined in PropertyAliases.txt
and PropertyValueAliases.txt
may be used instead of the canonical property and value names. I wouldn’t recommend doing so, as it makes the patterns harder to read.
The use of an unknown property name or value triggers a SyntaxError
.
Examples
Matching emoji
To match emoji symbols, the binary properties from UTR51 come in handy.
const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
This regular expression matches, from left to right:
- emoji with optional modifiers (
\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?
); - any remaining symbols that render as emoji rather than text by default (
\p{Emoji_Presentation}
); - symbols that render as text by default, but are forced to render as emoji using U+FE0F VARIATION SELECTOR-16 (
\p{Emoji}\uFE0F
).
const regex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
const text = `
\u{231A}: ⌚ default emoji presentation character (Emoji_Presentation)
\u{2194}\u{FE0F}: ↔️ default text presentation character rendered as emoji
\u{1F469}: 👩 emoji modifier base (Emoji_Modifier_Base)
\u{1F469}\u{1F3FF}: 👩🏿 emoji modifier base followed by a modifier
`;
let match;
while (match = regex.exec(text)) {
const emoji = match[0];
console.log(`Matched sequence ${ emoji } — code points: ${ [...emoji].length }`);
}
Console output:
Matched sequence ⌚ — code points: 1
Matched sequence ⌚ — code points: 1
Matched sequence ↔️ — code points: 2
Matched sequence ↔️ — code points: 2
Matched sequence 👩 — code points: 1
Matched sequence 👩 — code points: 1
Matched sequence 👩🏿 — code points: 2
Matched sequence 👩🏿 — code points: 2
For a more complete solution that matches emoji sequences & ZWJ sequences as well, see emoji-regex
. A follow-up proposal extends Unicode property escapes with new capabilities to make it easier to do things like matching emoji.
Unicode-aware version of \w
To match any word symbol in Unicode rather than just ASCII [a-zA-Z0-9_]
, use [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
as per UTS18.
const regex = /([\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]+)/gu;
const text = `
Amharic: የኔ ማንዣበቢያ መኪና በዓሣዎች ተሞልቷል
Bengali: আমার হভারক্রাফ্ট কুঁচে মাছ-এ ভরা হয়ে গেছে
Georgian: ჩემი ხომალდი საჰაერო ბალიშზე სავსეა გველთევზებით
Macedonian: Моето летачко возило е полно со јагули
Vietnamese: Tàu cánh ngầm của tôi đầy lươn
`;
let match;
while (match = regex.exec(text)) {
const word = match[1];
console.log(`Matched word with length ${ word.length }: ${ word }`);
}
Console output:
Matched word with length 7: Amharic
Matched word with length 2: የኔ
Matched word with length 6: ማንዣበቢያ
Matched word with length 3: መኪና
Matched word with length 5: በዓሣዎች
Matched word with length 5: ተሞልቷል
Matched word with length 7: Bengali
Matched word with length 4: আমার
Matched word with length 11: হভারক্রাফ্ট
Matched word with length 5: কুঁচে
Matched word with length 3: মাছ
Matched word with length 1: এ
Matched word with length 3: ভরা
Matched word with length 3: হয়ে
Matched word with length 4: গেছে
Matched word with length 8: Georgian
Matched word with length 4: ჩემი
Matched word with length 7: ხომალდი
Matched word with length 7: საჰაერო
Matched word with length 7: ბალიშზე
Matched word with length 6: სავსეა
Matched word with length 12: გველთევზებით
Matched word with length 10: Macedonian
Matched word with length 5: Моето
Matched word with length 7: летачко
Matched word with length 6: возило
Matched word with length 1: е
Matched word with length 5: полно
Matched word with length 2: со
Matched word with length 6: јагули
Matched word with length 10: Vietnamese
Matched word with length 3: Tàu
Matched word with length 4: cánh
Matched word with length 4: ngầm
Matched word with length 3: của
Matched word with length 3: tôi
Matched word with length 3: đầy
Matched word with length 4: lươn
Unicode-aware version of \d
To match any decimal number in Unicode rather than just ASCII [0-9]
, use \p{Decimal_Number}
instead of \d
as per UTS18.
const regex = /^\p{Decimal_Number}+$/u;
regex.test('𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼');
// → true
To match any numeric symbol in Unicode, including non-decimal symbols such as Roman numerals, use \p{Number}
:
const regex = /^\p{Number}+$/u;
regex.test('²³¹¼½¾𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼㉛㉜㉝ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ');
// → true
Support
V8 ships with support for Unicode property escapes. JavaScriptCore has an implementation in Safari Technology Preview 42.
Browser(s) | JavaScript engine | Support for \p{…} & \P{…} |
---|---|---|
Edge | Chakra | ❌ ChakraCore issue #2969 |
Firefox | SpiderMonkey | ✅ in Firefox 78 |
Chrome/Opera | V8 | ✅ in Chrome 64 |
WebKit | JavaScriptCore | ✅ in Safari Technology Preview 42 |
My regexpu transpiler supports Unicode property escapes when the { unicodePropertyEscape: true }
option is enabled. It translates such regular expressions to equivalent ES5 or ES2015 code that runs in today’s environments. Check out the interactive demo, or view the exhaustive list of supported properties. There’s a Babel plugin, too.
More information
For more details, including the proposed changes to the ECMAScript specification, refer to the formal proposal on GitHub.
Comments
Álvaro González wrote on :
Right into my bookmarks!
MaxArt wrote on :
It’s both nice — because it allows a lot of new stuff — and ugly — because ugh, that syntax. Just today I needed a regex that would correctly parse Twitter hashtags. A lot of people suggested to use
\w
, but it fails with a simpleé
.This is something I’ll follow closely in order to implement it as soon as it’s ready (stage 3, maybe?) in my regex building library.
JavaScript is finally doing something for regular expressions after years of nothing. Look-behinds are next?