Mathias Bynens

JavaScript character escape sequences

· tagged with JavaScript, Unicode

Having recently written about character references in HTML and escape sequences in CSS, I figured it would be interesting to look into JavaScript character escapes as well.

Character codes and code points

A code point (also known as “character code”) is a numerical representation of a specific Unicode character.

For example, the character code of the copyright symbol © is 169.

In JavaScript, String#charCodeAt() can be used to get the numeric Unicode code point of any character up to U+FFFF (i.e. the character with code point 65535).

Since JavaScript uses UCS-2 encoding internally, higher code points are represented by a pair of (lower valued) “surrogate” pseudo-characters which are used to comprise the real character. To get the actual character code of these higher code point characters in JavaScript, you’ll have to do some extra work.

Now that’s out of the way, let’s take a look at the different types of character escape sequences in JavaScript strings.

Single character escape sequences

There are some reserved single character escape sequences:

All single character escapes can easily be memorized using the following regular expression: \\[bfnrtv0].

Note that the escape character \ makes special characters literal. (Characters without special meaning can be escaped as well (e.g. '\a' == 'a'), but this is of course not needed.)

There’s only one exception to this rule:

'abc\
def' == 'abcdef'; // true

The \ followed by a new line is not a character escape sequence, but a LineContinuation. The new line doesn’t become part of the string. This is simply a way to spread a string over multiple lines (for easier code editing, for example), without the string actually including any new line characters.

Note: IE < 9 treats '\v' as 'v' instead of a vertical tab ('\x0b'). If cross-browser compatibility is a concern, use \x0b instead of \v.

Octal escape sequences

Any character with a character code lower than 256 (i.e. any character in the extended ASCII range) can be escaped using its octal-encoded character code, prefixed with \. (Note that this is the same range of characters that can be escaped through hexadecimal escapes.)

To use the same example, the copyright symbol ('©') has character code 169, which gives 251 in octal notation, so you could write it as '\251'.

Octal escapes can consist of two, three of four characters. '\1', '\01' and '\001' are equivalent; zero padding is not required. However, if the octal escape (e.g. '\1') is part of a larger string, and it’s immediately followed by a character in the range [0-7] (e.g. 1), the next character will be considered part of the escape sequence until at most three digits are matched. In other words, '\12' (a single octal character escape equivalent to '\012') is not the same as '\0012' (an octal escape '\001' followed by an unescaped character '2'). By simply zero padding octal escapes, you can avoid this problem.

Note that there’s one exception here: by itself, \0 is not an octal escape sequence. It looks like one, and it’s even equal to \00 and \000, both of which are octal escape sequences — but unless it’s followed by a decimal digit, it acts like a single character escape sequence. Or, in spec lingo: EscapeSequence :: 0 [lookahead ∉ DecimalDigit]. It’s probably easiest to define octal escape syntax using the following regular expression: \\(?:[1-7][0-7]{0,2}|[0-7]{2,3}).

Note that octal escapes have been deprecated in ES5:

Past editions of ECMAScript have included additional syntax and semantics for specifying octal literals and octal escape sequences. These have been removed from this edition of ECMAScript. This non-normative annex presents uniform syntax and semantics for octal literals and octal escape sequences for compatibility with some older ECMAScript programs.

Additionally, they produce syntax errors in strict mode:

A conforming implementation, when processing strict mode code (see 10.1.1), may not extend the syntax of EscapeSequence to include OctalEscapeSequence as described in B.1.2.

TL;DR Don’t use octal escapes; use hexadecimal escapes instead.

Hexadecimal escape sequences

Any character with a character code lower than 256 (i.e. any character in the extended ASCII range) can be escaped using its hex-encoded character code, prefixed with \x. (Note that this is the same range of characters that can be escaped through octal escapes.)

Hexadecimal escapes are four characters long. They require exactly two characters following \x. If the hexadecimal character code is only one character long (this is the case for all character codes smaller than 16, or 10 in hex), you’ll need to pad it with a leading 0.

For example, the copyright symbol ('©') has character code 169, which gives a9 in hex, so you could write it as '\xa9'.

The hexadecimal part of this escape is case-insensitive; in other words, '\xa9' and '\xA9' are equivalent.

You could define hexadecimal escape syntax using the following regular expression: \\x[a-fA-F0-9]{2}.

It’s a bit confusing that the spec refers to this kind of escape sequence as “hexadecimal”, since Unicode escapes use hex as well.

Unicode escape sequences

Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with \u. (As mentioned before, higher character codes are represented by a pair of surrogate characters.)

Unicode escapes are six characters long. They require exactly four characters following \u. If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes.

The copyright symbol ('©') has character code 169, which gives a9 in hexadecimal notation, so you could write it as '\u00a9'. Similarly, '♥' could be written as '\u2665'.

The hexadecimal part of this kind of character escape is case-insensitive; in other words, '\u00a9' and '\u00A9' are equivalent.

You could define Unicode escape syntax using the following regular expression: \\u[a-fA-F0-9]{4}.

Mothereffing character escapes

I wrote a JavaScript string escaper that combines these different kinds of escapes (except the deprecated octal escapes) and returns the smallest possible result string. Try it at mothereff.in/js-escapes!

Mothereffing JavaScript escapes

You can use it to escape any character, but there’s an option to only escape non-ASCII and unprintable ASCII characters (which is probably the most useful). This way, you can easily turn strings such as 'Ich ♥ Bücher' into its smallest possible ASCII-only equivalent 'Ich \u2665 B\xfccher'. Back when I was working on Punycode.js unit tests, this tool saved me quite some time.

Comments

Deian wrote on :

You are one of the most REALLY useful developers around. Thank you for all of your articles Mathias! Wish you a Merry Christmas & Happy New Year

Leave a comment

Comment on “JavaScript character escape sequences”

Some Markdown is allowed; HTML isn’t. Keyboard shortcuts are available.

It’s possible to add emphasis to text:

_Emphasize_ some terms. Perhaps you’d rather use **strong emphasis** instead?

Select some text and press + I on Mac or Ctrl + I on Windows to make it italic. For bold text, use + B or Ctrl + B.

To create links:

Here’s an inline link to [Google](http://www.google.com/).

If the link itself is not descriptive enough to tell users where they’re going, you might want to create a link with a title attribute, which will show up on hover:

Here’s a [poorly-named link](http://www.google.com/ "Google").

Use backticks (`) to create an inline <code> span:

In HTML, the `p` element represents a paragraph.

Select some inline text and press + K on Mac or Ctrl + K on Windows to make it a <code> span.

Indent four spaces to create an escaped <pre><code> block:

    printf("goodbye world!"); /* his suicide note
was in C */

Alternatively, you could use triple backtick syntax:

```
printf("goodbye world!"); /* his suicide note
was in C */
```

Select a block of text (more than one line) and press + K on Mac or Ctrl + K on Windows to make it a preformatted <code> block.

Quoting text can be done as follows:

> Lorem iPad dolor sit amet, consectetur Apple adipisicing elit,
> sed do eiusmod incididunt ut labore et dolore magna aliqua Shenzhen.
> Ut enim ad minim veniam, quis nostrud no multi-tasking ullamco laboris
> nisi ut aliquip iPad ex ea commodo consequat.

Select a block of text and press + E on Mac or Ctrl + E on Windows to make it a <blockquote>.