Does JavaScript use UCS-2 or UTF-16 encoding? Since I couldn’t find a definitive answer to this question anywhere, I decided to look into it. The answer depends on what you’re referring to: the JavaScript engine, or JavaScript at the language level.
Let’s start with the basics…
The notorious BMP
Unicode identifies characters by an unambiguous name and an integer number called its code point. For example, the ©
character is named “copyright sign” and has U+00A9 — 0xA9
can be written as 169
in decimal — as its code point.
The Unicode code space is divided into seventeen planes of 2^16 (65,536) code points each. Some of these code points have not yet been assigned character values, some are reserved for private use, and some are permanently reserved as non-characters. The code points in each plane have the hexadecimal values xy0000
to xyFFFF
, where xy
is a hex value from 00
to 10
, signifying which plane the values belong to.
The first plane (xy
is 00
) is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters.
The other sixteen planes (U+010000 → U+10FFFF) are called supplementary planes or astral planes. I won’t discuss them here; just remember that there are BMP characters and non-BMP characters, the latter of which are also known as supplementary characters or astral characters.
Differences between UCS-2 and UTF-16
Both UCS-2 and UTF-16 are character encodings for Unicode.
UCS-2 (2-byte Universal Character Set) produces a fixed-length format by simply using the code point as the 16-bit code unit. This produces exactly the same result as UTF-16 for the majority of all code points in the range from 0
to 0xFFFF
(i.e. the BMP).
UTF-16 (16-bit Unicode Transformation Format) is an extension of UCS-2 that allows representing code points outside the BMP. It produces a variable-length result of either one or two 16-bit code units per code point. This way, it can encode code points in the range from 0
to 0x10FFFF
.
For example, in both UCS-2 and UTF-16, the BMP character U+00A9 copyright sign (©
) is encoded as 0x00A9
.
Surrogate pairs
Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆
), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06
. This is called a surrogate pair. Note that a surrogate pair only represents a single character.
The first code unit of a surrogate pair is always in the range from 0xD800
to 0xDBFF
, and is called a high surrogate or a lead surrogate.
The second code unit of a surrogate pair is always in the range from 0xDC00
to 0xDFFF
, and is called a low surrogate or a trail surrogate.
UCS-2 lacks the concept of surrogate pairs, and therefore interprets 0xD834 0xDF06
(the previous UTF-16 encoding) as two separate characters.
Converting between code points and surrogate pairs
Section 3.7 of The Unicode Standard 3.0 defines the algorithms for converting to and from surrogate pairs.
A code point C
greater than 0xFFFF
corresponds to a surrogate pair <H, L>
as per the following formula:
H = Math.floor((C - 0x10000) / 0x400) + 0xD800
L = (C - 0x10000) % 0x400 + 0xDC00
The reverse mapping, i.e. from a surrogate pair <H, L>
to a Unicode code point C
, is given by:
C = (H - 0xD800) * 0x400 + L - 0xDC00 + 0x10000
Ok, so what about JavaScript?
ECMAScript, the standardized version of JavaScript, defines how characters should be interpreted:
A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.
In other words, JavaScript engines are allowed to use either UCS-2 or UTF-16.
However, specific parts of the specification require some UTF-16 knowledge, regardless of the engine’s internal encoding.
Of course, internal engine specifics don’t really matter to the average JavaScript developer. What’s far more interesting is what JavaScript considers to be “characters”, and how it exposes those:
Throughout the rest of this document, the phrase code unit and the word character will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text.
The phrase Unicode character will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit).
The phrase code point refers to such a Unicode scalar value.
Unicode character only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual “Unicode characters”, even though a user might think of the whole sequence as a single character.
JavaScript treats code units as individual characters, while humans generally think in terms of Unicode characters. This has some unfortunate consequences for Unicode characters outside the BMP. Since surrogate pairs consist of two code units, '𝌆'.length == 2
, even though there’s only one Unicode character there. The individual surrogate halves are being exposed as if they were characters: '𝌆' == '\uD834\uDF06'
.
Remind you of something? It should, ’cause this is almost exactly how UCS-2 works. (The only difference is that technically, UCS-2 doesn’t allow surrogate characters, while JavaScript strings do.)
You could argue that it resembles UTF-16, except unmatched surrogate halves are allowed, surrogates in the wrong order are allowed, and surrogate halves are exposed as separate characters. I think you’ll agree that it’s easier to think of this behavior as “UCS-2 with surrogates”.
This UCS-2-like behavior affects the entire language — for example, regular expressions for ranges of supplementary characters are much harder to write than in languages that do support UTF-16.
Surrogate pairs are only recombined into a single Unicode character when they’re displayed by the browser (during layout). This happens outside of the JavaScript engine. To demonstrate this, you could write out the high surrogate and the low surrogate in separate document.write()
calls: document.write('\uD834'); document.write('\uDF06');
. This ends up getting rendered as 𝌆
— one glyph.
Conclusion
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
If you ever need to escape a Unicode character, breaking it up into surrogate halves when necessary, feel free to use my JavaScript escaper tool.
If you want to count the number of Unicode characters in a JavaScript string, or create a string based on a non-BMP Unicode code point, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:
// `String.length` replacement that only counts full Unicode characters
punycode.ucs2.decode('𝌆').length; // 1
// `String.fromCharCode` replacement that doesn’t make you enter the surrogate halves separately
punycode.ucs2.encode([0x1D306]); // '𝌆'
punycode.ucs2.encode([119558]); // '𝌆'
ECMAScript 6 will support a new kind of escape sequence in strings, namely Unicode code point escapes e.g. \u{1D306}
. Additionally, it will define String.fromCodePoint
and String#codePointAt
, both of which accept code points rather than code units.
Thanks to Jonas ‘nlogax’ Westerlund, Andrew ‘bobince’ Clover, and Tab Atkins Jr. for inspiring me to look into this, and for helping me out with their explanations.
Note: If you liked reading about JavaScript’s internal character encoding, check out JavaScript has a Unicode problem, which explains the practical problems caused by this behavior in more detail, and offers solutions.
Comments
Ash Berlin wrote on :
Given that
'𝌆'.length == 2
in both Firefox (Spidermonkey) and Safari (Nitro/JSCore) doesn’t that mean they use UCS-2 and not UTF-16?Mathias wrote on :
Ash: No, it just means these engines follow the spec and expose “characters” in the standardized way (that happens to match UCS-2).
The character encoding JavaScript engines use internally is nothing more than an implementation detail.
Masklinn wrote on :
Actually, humans generally think in terms of graphemes, which may or may not be composed of multiple Unicode code points (irrespective of the normalization form being used). Using the term “character” should really be avoided when talking about Unicode strings: unless you define precisely what you mean by that word, it’s extremely ambiguous.
Mathias wrote on :
Masklinn: I was using the Unicode characters term as defined in the ECMAScript 5.1 specification — see the quoted block right above the paragraph you cited.
That said, you make a fair point, and provided interesting links. Thanks!
Henri Sivonen wrote on :
In practice, JS string are the same as DOM strings. They are like UCS-2, except the surrogate range is not out of use. But they aren’t guaranteed to be valid UTF-16 either: there can be unpaired surrogates, or surrogates in the wrong order.
So the correct answer is that JS/DOM strings are neither UCS-2 nor UTF-16 in the sense of being valid.
In practice, for purposes of length and addressing, DOM strings are strings of 16-bit code units.
For the purposes of interpretation, DOM strings are potentially non-conforming UTF-16 strings.
FWIW, in Gecko, if the most significant byte of all code units in a DOM text node is zero, Gecko stores the text in 8 bits per character. So if you have one non-ISO-8859-1 character in a comment in an inline
<script>
, the storage of the whole text node gets doubled. (I’m talking about real ISO-8859-1 — not Windows-1252.)Mathias wrote on :
Henri: I had no idea DOM strings are so similar to JS strings. It makes a lot of sense, though. Thanks for the explanation!
Andrew Paprocki wrote on :
For all the gory details about TC-39 work to possibly get rid of this restriction in ECMAScript and support full Unicode, venture to the TC-39 wiki: http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
Mathias wrote on :
Andrew: Thanks for that link.
I don’t understand why it mentions “up to eight hex digits”, though. Seems like six hexadecimal digits should be enough.
Interesting note, by the way:
Also, the proposed new syntax for Unicode escape sequences can (almost) be emulated using Punycode.js’s UCS-2 encoding function:
Those last two examples assume an engine that doesn’t error on
\u{
.Update: ES6 will allow Unicode code point escapes, supporting code points with up to six (not eight) hexadecimal digits.
Peter wrote on :
Interesting blogpost and comments! The JavaScript Unicode behaviour looks somewhat bogus (with
'𝌆'.length == 2
). But maybe there is no simple alternative to that handling (also because of existence of complex graphemes). Would be interesting to check the Unicode internals of Java and if they work in a way better (if'𝌆'.length == 1
there).Mathias wrote on :
Peter: According to Wikipedia, Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. However, non-BMP characters require the individual surrogate halves to be entered individually, for example:
"\uD834\uDD1E"
for U+1D11E.Peter wrote on :
Mathias: Ah, thanks! And I just checked, in Java
"\uD834\uDD1E"
is displayed as 1 char butlength() == 2
, so no progress there… :)Han Guokai wrote on :
In Java5, ‘char’ means code unit (16-bit), not character (code point). One character uses one or two chars. String’s
length
method return the char count. And there are some additional methods based on code points, i.e.codePointCount
. See: http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.htmlPeter wrote on :
Han: Thanks for the info, I should read the docs once in a while. :)
Vivien Blasquez wrote on :
Thank you for this article. For my part I prefer to use UTF-16 :)
Nicolas Labbé wrote on :
Relevant — a French summary of your presentation during the dotJS event on November 30th: Love/Hate: JavaScript and Unicode
timeless wrote on :
I presume that the
#
inString#codePointAt
is a typo for.
as inString.fromCodePoint
.Mathias wrote on :
timeless: No, the
#
is a shorthand for.prototype.
, e.g.String#codePointAt
meansString.prototype.codePointAt
.Ben Nadel wrote on :
I’m currently struggling to wrap my head around the
utf8
vs.utf8mb4
requirements for MySQL. So, while my concerns aren’t JavaScript-related, this article is really helping me just understand the astral plane character stuff, which is whyutf8mb4
is needed (not that I need to tell you — I see you already have an article on it).