JavaScript’s internal character encoding: UCS-2 or UTF-16?

Published 20th January 2012 · tagged with JavaScript, Unicode

Does JavaScript use UCS-2 or UTF-16 encoding? Since I couldn’t find a definitive answer to this question anywhere, I decided to look into it. The answer depends on what you’re referring to: the JavaScript engine, or JavaScript at the language level.

Let’s start with the basics…

The notorious BMP

Unicode identifies characters by an unambiguous name and an integer number called its code point. For example, the © character is named “copyright sign” and has U+00A9 — 0xA9 can be written as 169 in decimal — as its code point.

The Unicode code space is divided into seventeen planes of 2^16 (65,536) code points each. Some of these code points have not yet been assigned character values, some are reserved for private use, and some are permanently reserved as non-characters. The code points in each plane have the hexadecimal values xy0000 to xyFFFF, where xy is a hex value from 00 to 10, signifying which plane the values belong to.

The first plane (xy is 00) is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters.

The other sixteen planes (U+010000 → U+10FFFF) are called supplementary planes or astral planes. I won’t discuss them here; just remember that there are BMP characters and non-BMP characters, the latter of which are also known as supplementary characters or astral characters.

Differences between UCS-2 and UTF-16

Both UCS-2 and UTF-16 are character encodings for Unicode.

UCS-2 (2-byte Universal Character Set) produces a fixed-length format by simply using the code point as the 16-bit code unit. This produces exactly the same result as UTF-16 for the majority of all code points in the range from 0 to 0xFFFF (i.e. the BMP).

UTF-16 (16-bit Unicode Transformation Format) is an extension of UCS-2 that allows representing code points outside the BMP. It produces a variable-length result of either one or two 16-bit code units per code point. This way, it can encode code points in the range from 0 to 0x10FFFF.

For example, in both UCS-2 and UTF-16, the BMP character U+00A9 copyright sign (©) is encoded as 0x00A9.

Surrogate pairs

Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.

The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.

The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.

UCS-2 lacks the concept of surrogate pairs, and therefore interprets 0xD834 0xDF06 (the previous UTF-16 encoding) as two separate characters.

Converting between code points and surrogate pairs

Section 3.7 of The Unicode Standard 3.0 defines the algorithms for converting to and from surrogate pairs.

A code point C greater than 0xFFFF corresponds to a surrogate pair <H, L> as per the following formula:

H = Math.floor((C - 0x10000) / 0x400) + 0xD800
L = (C - 0x10000) % 0x400 + 0xDC00

The reverse mapping, i.e. from a surrogate pair <H, L> to a Unicode code point C, is given by:

C = (H - 0xD800) * 0x400 + L - 0xDC00 + 0x10000

Ok, so what about JavaScript?

ECMAScript, the standardized version of JavaScript, defines how characters should be interpreted:

A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.

In other words, JavaScript engines are allowed to use either UCS-2 or UTF-16.

However, specific parts of the specification require some UTF-16 knowledge, regardless of the engine’s internal encoding.

Of course, internal engine specifics don’t really matter to the average JavaScript developer. What’s far more interesting is what JavaScript considers to be “characters”, and how it exposes those:

Throughout the rest of this document, the phrase code unit and the word character will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text.
The phrase Unicode character will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit).
The phrase code point refers to such a Unicode scalar value.
Unicode character only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual “Unicode characters”, even though a user might think of the whole sequence as a single character.

JavaScript treats code units as individual characters, while humans generally think in terms of Unicode characters. This has some unfortunate consequences for Unicode characters outside the BMP. Since surrogate pairs consist of two code units, '𝌆'.length == 2, even though there’s only one Unicode character there. The individual surrogate halves are being exposed as if they were characters: '𝌆' == '\uD834\uDF06'.

Remind you of something? It should, ’cause this is almost exactly how UCS-2 works. (The only difference is that technically, UCS-2 doesn’t allow surrogate characters, while JavaScript strings do.)

You could argue that it resembles UTF-16, except unmatched surrogate halves are allowed, surrogates in the wrong order are allowed, and surrogate halves are exposed as separate characters. I think you’ll agree that it’s easier to think of this behavior as “UCS-2 with surrogates”.

This UCS-2-like behavior affects the entire language — for example, regular expressions for ranges of supplementary characters are much harder to write than in languages that do support UTF-16.

Surrogate pairs are only recombined into a single Unicode character when they’re displayed by the browser (during layout). This happens outside of the JavaScript engine. To demonstrate this, you could write out the high surrogate and the low surrogate in separate document.write() calls: document.write('\uD834'); document.write('\uDF06');. This ends up getting rendered as 𝌆 — one glyph.

Conclusion

JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.

The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.

If you ever need to escape a Unicode character, breaking it up into surrogate halves when necessary, feel free to use my JavaScript escaper tool.

If you want to count the number of Unicode characters in a JavaScript string, or create a string based on a non-BMP Unicode code point, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:

// `String.length` replacement that only counts full Unicode characters
punycode.ucs2.decode('𝌆').length; // 1
// `String.fromCharCode` replacement that doesn’t make you enter the surrogate halves separately
punycode.ucs2.encode([0x1D306]); // '𝌆'
punycode.ucs2.encode([119558]); // '𝌆'

ECMAScript 6 will support a new kind of escape sequence in strings, namely Unicode code point escapes e.g. \u{1D306}. Additionally, it will define String.fromCodePoint and String#codePointAt, both of which accept code points rather than code units.

Thanks to Jonas ‘nlogax’ Westerlund, Andrew ‘bobince’ Clover, and Tab Atkins Jr. for inspiring me to look into this, and for helping me out with their explanations.

Note: If you liked reading about JavaScript’s internal character encoding, check out JavaScript has a Unicode problem, which explains the practical problems caused by this behavior in more detail, and offers solutions.

Comments

Ash Berlin wrote on 20th January 2012 at 12:33:

Given that '𝌆'.length == 2 in both Firefox (Spidermonkey) and Safari (Nitro/JSCore) doesn’t that mean they use UCS-2 and not UTF-16?

Mathias wrote on 20th January 2012 at 12:47:

Ash: No, it just means these engines follow the spec and expose “characters” in the standardized way (that happens to match UCS-2).

The character encoding JavaScript engines use internally is nothing more than an implementation detail.

Masklinn wrote on 20th January 2012 at 14:05:

JavaScript treats code units as individual characters, while humans generally think in terms of Unicode characters.

Actually, humans generally think in terms of graphemes, which may or may not be composed of multiple Unicode code points (irrespective of the normalization form being used). Using the term “character” should really be avoided when talking about Unicode strings: unless you define precisely what you mean by that word, it’s extremely ambiguous.

Mathias wrote on 20th January 2012 at 14:09:

Masklinn: I was using the Unicode characters term as defined in the ECMAScript 5.1 specification — see the quoted block right above the paragraph you cited.

That said, you make a fair point, and provided interesting links. Thanks!

Henri Sivonen wrote on 20th January 2012 at 14:26:

In practice, JS string are the same as DOM strings. They are like UCS-2, except the surrogate range is not out of use. But they aren’t guaranteed to be valid UTF-16 either: there can be unpaired surrogates, or surrogates in the wrong order.

So the correct answer is that JS/DOM strings are neither UCS-2 nor UTF-16 in the sense of being valid.

In practice, for purposes of length and addressing, DOM strings are strings of 16-bit code units.

For the purposes of interpretation, DOM strings are potentially non-conforming UTF-16 strings.

FWIW, in Gecko, if the most significant byte of all code units in a DOM text node is zero, Gecko stores the text in 8 bits per character. So if you have one non-ISO-8859-1 character in a comment in an inline <script>, the storage of the whole text node gets doubled. (I’m talking about real ISO-8859-1 — not Windows-1252.)

Mathias wrote on 20th January 2012 at 14:50:

Henri: I had no idea DOM strings are so similar to JS strings. It makes a lot of sense, though. Thanks for the explanation!

Andrew Paprocki wrote on 20th January 2012 at 16:10:

For all the gory details about TC-39 work to possibly get rid of this restriction in ECMAScript and support full Unicode, venture to the TC-39 wiki: http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings

Mathias wrote on 20th January 2012 at 16:33:

Andrew: Thanks for that link.

I don’t understand why it mentions “up to eight hex digits”, though. Seems like six hexadecimal digits should be enough.

Interesting note, by the way:

Note that the current usage of UTF-16 in the above ES5.1 clause is an editorial error and dates back to at least ES3. It probably was intended to mean the same as UCS-2. ES3-5.1 did not intend to imply that the ECMAScript strings perform any sort of automatic UTF-16 encoding or interpretation of code points that are outside of the BMP.

Also, the proposed new syntax for Unicode escape sequences can (almost) be emulated using Punycode.js’s UCS-2 encoding function:

function unicodeEscape(string) {
  // note: this will match `u{123}` (no leading `\`) as well
  return string.replace(/u\{([0-9a-fA-F]{1,8})\}/g, function($0, $1) {
    return punycode.ucs2.encode([parseInt($1, 16)]);
  });
}

unicodeEscape('\u{48}\u{65}\u{6c}\u{6c}\u{6f}\u{20}\u{77}\u{6f}\u{72}\u{6c}\u{64}'); // 'Hello world'
unicodeEscape('\u{1d306}'); // '𝌆'

Those last two examples assume an engine that doesn’t error on \u{.

Update: ES6 will allow Unicode code point escapes, supporting code points with up to six (not eight) hexadecimal digits.

Peter wrote on 20th January 2012 at 16:47:

Interesting blogpost and comments! The JavaScript Unicode behaviour looks somewhat bogus (with '𝌆'.length == 2). But maybe there is no simple alternative to that handling (also because of existence of complex graphemes). Would be interesting to check the Unicode internals of Java and if they work in a way better (if '𝌆'.length == 1 there).

Mathias wrote on 20th January 2012 at 17:13:

Peter: According to Wikipedia, Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. However, non-BMP characters require the individual surrogate halves to be entered individually, for example: "\uD834\uDD1E" for U+1D11E.

Peter wrote on 20th January 2012 at 17:29:

Mathias: Ah, thanks! And I just checked, in Java "\uD834\uDD1E" is displayed as 1 char but length() == 2, so no progress there… :)

Han Guokai wrote on 20th January 2012 at 18:28:

In Java5, ‘char’ means code unit (16-bit), not character (code point). One character uses one or two chars. String’s length method return the char count. And there are some additional methods based on code points, i.e. codePointCount. See: http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html

Peter wrote on 20th January 2012 at 19:36:

Han: Thanks for the info, I should read the docs once in a while. :)

Vivien Blasquez wrote on 25th January 2012 at 21:50:

Thank you for this article. For my part I prefer to use UTF-16 :)

Nicolas Labbé wrote on 4th December 2012 at 22:38:

Relevant — a French summary of your presentation during the dotJS event on November 30th: Love/Hate: JavaScript and Unicode

timeless wrote on 13th December 2013 at 01:44:

I presume that the # in String#codePointAt is a typo for . as in String.fromCodePoint.

Mathias wrote on 13th December 2013 at 09:16:

timeless: No, the # is a shorthand for .prototype., e.g. String#codePointAt means String.prototype.codePointAt.

Ben Nadel wrote on 17th April 2016 at 11:51:

I’m currently struggling to wrap my head around the utf8 vs. utf8mb4 requirements for MySQL. So, while my concerns aren’t JavaScript-related, this article is really helping me just understand the astral plane character stuff, which is why utf8mb4 is needed (not that I need to tell you — I see you already have an article on it).

Mathias Bynens