Mathias Bynens

The end-tag open (ETAGO) delimiter

· tagged with CSS, DOM, HTML, JavaScript

Disclaimer: Many thanks to Juriy ‘kangax’ Zaytsev (Юрий Зайцев) for writing the test case that inspired me to investigate this further, and everyone in #whatwg for helping me parse the specification correctly.

ETAGO in HTML4

The HTML 4.01 spec says:

Although the <style> and <script> elements use CDATA for their data model, for these elements, CDATA must be handled differently by user agents. Markup and entities must be treated as raw text and passed to the application as is. The first occurrence of the character sequence </ (ETAGO or end-tag open delimiter) is treated as terminating the end of the element’s content. In valid documents, this would be the end tag for the element.

Section 3.2.1 in Appendix B is more specific:

When script or style data is the content of an element (<script> and <style>), the data begins immediately after the element start tag and ends at the first ETAGO (</) delimiter followed by a name start character ([a-zA-Z]); note that this may not be the element’s end tag. Authors should therefore escape </ within the content. Escape mechanisms are specific to each scripting or style sheet language.

(Note that this only applies to inline styles and scripts in HTML documents, not external files that are referenced from the HTML.)

This means that technically the following code is invalid HTML4, and it shouldn’t work:

<!-- Remember, this is HTML4 we’re talking about. Redundant @type attributes ftw! -->
<style type="text/css">
p {
background: red;
}
</style>
<style type="text/css">
p {
content: '</abc';
background: green;
}
</style>

The second <style> element would be closed as soon as the parser reaches the ETAGO delimiter, and none of the style rules in it would be applied. Paragraphs would get a red background color (see the first <style> element). It would be equivalent to the following non-conformant markup:

<style type="text/css">
p {
background: red;
}
</style>
<style type="text/css">
p {
content: '
</style>
bc'; background: green; }</style>

The same goes for <script> elements:

<script type="text/javascript">
document.write('<p>Foo</p>');
</script>

As per HTML4, the <script> element should be closed prematurely, resulting in a JavaScript SyntaxError, since it would be interpreted as follows:

<script type="text/javascript">
document.write('<p>Foo
</script>
');</script>

Well, that’s the theory. In reality, no browser ever implemented this. The ETAGO delimiter isn’t respected as a terminating sequence for <style> and <script> elements in any browser. You can easily confirm this yourself by viewing the test cases based on the above code examples: ETAGO delimiter inside a <style> element and ETAGO delimiter inside a <script> element.

Back to reality with HTML5

Rather than expecting existing implementations to change, ‘HTML5’ a.k.a. the HTML Living Standard standardizes the behavior that browsers had implemented (with a few security improvements). This is described in the spec as part of the full tokenization algorithm, specifically here and here.

This means the above examples are now valid HTML. And of course, they continue to work correctly, as they always did. Generally, ETAGO delimiters can be used inside of <style> and <script> elements. Just keep in mind that the full </style and </script strings followed by a space character, >, or / will close their respective opening tag.

Semi-related fun fact: Since the <title> element is an RCDATA element that uses the text content model, there’s no need to encode < inside of it unless you want to use </title followed by any of those characters. <title>foo < bar</title> and <title><i>foo</i></title> are perfectly valid markup as per HTML. The same goes for <textarea>. In spec lingo: <script> and <style> are raw text elements, <textarea> and <title> are RCDATA elements.

For backwards compatibility, there’s an interesting exception to this rule for <script> elements that contain <!-​- with a later occurence of -​->in that case, e.g. </script> is allowed in the <script> element’s content. Here’s a valid, working example:

<script>
<!--
document.write("<script>alert('LOLWAT')</script>")
-->
</script>

While this is good to know, luckily there are better solutions than this old-school ’90s-style pattern (that only works for <script> elements anyway). Whenever you need to use </style> inside a <style> element, or </script> inside a <script> element, just escape these strings. In both CSS and JavaScript there are various ways of doing this, but using a backslash (\, also known as “reverse solidus character”) is by far the simplest:

<style>
p {
/* Using the Unicode code point for the solidus character (see http://mths.be/bax): */
content: '<\00002Fstyle>';
/* Using the shorthand notation for Unicode code points (see http://mths.be/bax): */
content: '<\2F style>';
/* Simply escaping the solidus character with a reverse solidus (\): */
content: '<\/style>';
background: green;
}
</style>
<script>
// Using `unescape()`:
document.write(unescape('<script>alert("wtf")%3C/script>')); // Überlame.
// Using string concatenation:
document.write('<script>alert("heh")<' + '/script>'); // Lame.
// Using the octal escape sequence for the solidus character (/):
document.write('<script>alert("hah")<\57script>'); // Lame, deprecated, and disallowed in ES5 strict mode.
// Using the Unicode escape sequence:
document.write('<script>alert("hoh")<\u002Fscript>'); // Lame.
// Using the hexadecimal escape sequence:
document.write('<script>alert("huh")<\x2Fscript>'); // Lame.
// Simply escaping the solidus character:
document.write('<script>alert("O HAI")<\/script>'); // Awesome!
</script>

Both these examples are valid HTML, and of course they work as expected in any browser.

Note that while it’s an edge case, the </script character sequence can theoretically be used outside of strings in JavaScript, e.g. 42 </script/. Of course, the simple \/ escape won’t work here. In that case, make sure to use a space before the regex literal: 42 < /script/. (I can’t think of such a case for CSS though. Can you?)

Comments

David Håsäther wrote on :

Actually, your first example is perfectly fine. Only ETAGO delimiter-in-context — that is, ETAGO followed by a name start character (for HTML 4 this is [a-zA-Z]) — would be recognized by the parser as an end-tag. </ alone is treated as data.

E.g. content: "</a" would trigger it though.

Jordi Boggiano wrote on :

Another trivia that’s related — as far as I know, this is the reason you have to escape forward-slashes in JSON. It makes any string safe for inlining, even if it contains </script>.

zcorpan wrote on :

If we’re going to be pedantic here, HTML4 doesn’t require the element to be closed (even if the prose gives the appearance that it is), because it doesn’t use any RFC 2119 keywords. It would be inappropriate for it to do, since it’s the business of SGML to define. AFAIK, SGML just says that ETAGO that doesn’t match the end tag is invalid, without specifying any behavior for when it would occur (so not closing the element would be as compliant as closing the element or indeed fatally aborting parsing).

Also, HTML5 doesn’t specify what browsers already did. What browsers were doing was in the face of hitting the end of the file while in a <script> or <style> (or indeed <textarea>, <title> and others) was to rewind the input stream to the start tag and reparsing with different parsing rules where a matching close tag would close the element despite being after <!--. HTML5 doesn’t specify this because reparsing is a security problem. Now Web content of course uses on one hand stuff like the document.write example in this post, and on the other hand uses <script><!-- here but only </script> there and expect the page to work fine (since it did in browsers thanks to reparsing). With the constraint to never reparse, the complexity in the spec now is what was needed to be compatible enough (which is basically as compatible as possible) with existing Web content, which was based on extensive research by myself with data and help from Philip Taylor. Henri Sivonen also had a few ideas on how to solve this but didn’t have enough research to come up with a compatible solution.

Yahel Carmon wrote on :

It seems like rigid enforcement of ETAGO rules is a problem with DOMDocument, PHP’s major DOM parsing library.

See this test case: http://static.bwerp.net/~adam/2010/10/23/dom.php

Presence of </ causes the parser to enforce the rules rigidly, and prematurely terminate the script block, ignoring the rest of the block. In theory, this is the “correct” behavior under previous versions of the spec, but, considering no browser follows that rigidity, it’s a big pain. There doesn’t appear to be any way to override the behavior in a setting.

TallahasseeJames wrote on :

Actually, those of us who were thorough in the 90’s added one more layer to that closing tagline (just prior to the ETAGO):

<script language="javascript">
<!-- // Hide from elder browsers…
var thisCode = "We walked uphill both ways, and we LIKED it!";
// …stop hiding. -->
</script>

The idea was that modern browsers (like Nutscrape 4.7 and Internet Exploder 5.0) that could run JS would see the single-line comment // and ignore the rest, while lame old browsers like IE 4 could suck it (although my manager in 2000 did insist on complete IE 4 and partial IE 3 support until we showed her the server logs and talked her into dropping them…).

Henri Sivonen wrote on :

In string literals in inline scripts, </script> isn’t the only dangerous substring. <!-- is a dangerous substring, too, because it might mask a following </script> that’s really meant to act as the end tag of the inline script. Therefore, you should never let the substring <!-- or the substring </script> appear in string literals (or regexp literals but those rarely come from untrusted sources) in inline scripts. The way to deal with both in a way that’s safe to even automate is to replace < with \u003C or \x3C in string literals. (If you include untrusted content in string literals, you should also escape line breaks, including weird Unicode breaks, and quotes, of course.)


Update by Mathias: Here’s an example of what Henri is talking about: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/2490 Because <!-- is not escaped here, browsers with an HTML5 parser treat this code differently than browsers without an HTML5 parser.

By escaping <!--, old browsers without an HTML5 parser behave the same as modern browsers with an HTML5 parser: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=2493. (The alert message should be displayed.)

More info here: http://krijnhoetmer.nl/irc-logs/whatwg/20130826#l-287.

Leave a comment

Comment on “The end-tag open (ETAGO) delimiter”

Some Markdown is allowed; HTML isn’t. Keyboard shortcuts are available.

It’s possible to add emphasis to text:

_Emphasize_ some terms. Perhaps you’d rather use **strong emphasis** instead?

Select some text and press + I on Mac or Ctrl + I on Windows to make it italic. For bold text, use + B or Ctrl + B.

To create links:

Here’s an inline link to [Google](http://www.google.com/).

If the link itself is not descriptive enough to tell users where they’re going, you might want to create a link with a title attribute, which will show up on hover:

Here’s a [poorly-named link](http://www.google.com/ "Google").

Use backticks (`) to create an inline <code> span:

In HTML, the `p` element represents a paragraph.

Select some inline text and press + K on Mac or Ctrl + K on Windows to make it a <code> span.

Indent four spaces to create an escaped <pre><code> block:

    printf("goodbye world!"); /* his suicide note
was in C */

Alternatively, you could use triple backtick syntax:

```
printf("goodbye world!"); /* his suicide note
was in C */
```

Select a block of text (more than one line) and press + K on Mac or Ctrl + K on Windows to make it a preformatted <code> block.

Quoting text can be done as follows:

> Lorem iPad dolor sit amet, consectetur Apple adipisicing elit,
> sed do eiusmod incididunt ut labore et dolore magna aliqua Shenzhen.
> Ut enim ad minim veniam, quis nostrud no multi-tasking ullamco laboris
> nisi ut aliquip iPad ex ea commodo consequat.

Select a block of text and press + E on Mac or Ctrl + E on Windows to make it a <blockquote>.