The end-tag open (ETAGO) delimiter

Published 27th June 2011 · tagged with CSS, DOM, HTML, JavaScript

Disclaimer: Many thanks to Juriy ‘kangax’ Zaytsev (Юрий Зайцев) for writing the test case that inspired me to investigate this further, and everyone in #whatwg for helping me parse the specification correctly.

ETAGO in HTML4

The HTML 4.01 spec says:

Although the <style> and <script> elements use CDATA for their data model, for these elements, CDATA must be handled differently by user agents. Markup and entities must be treated as raw text and passed to the application as is. The first occurrence of the character sequence </ (ETAGO or end-tag open delimiter) is treated as terminating the end of the element’s content. In valid documents, this would be the end tag for the element.

Section 3.2.1 in Appendix B is more specific:

When script or style data is the content of an element (<script> and <style>), the data begins immediately after the element start tag and ends at the first ETAGO (</) delimiter followed by a name start character ([a-zA-Z]); note that this may not be the element’s end tag. Authors should therefore escape </ within the content. Escape mechanisms are specific to each scripting or style sheet language.

(Note that this only applies to inline styles and scripts in HTML documents, not external files that are referenced from the HTML.)

This means that technically the following code is invalid HTML4, and it shouldn’t work:

<!-- Remember, this is HTML4 we’re talking about. Redundant @type attributes ftw! -->
<style type="text/css">
	p {
		background: red;
	}
</style>
<style type="text/css">
	p {
		content: '</abc';
		background: green;
	}
</style>

The second <style> element would be closed as soon as the parser reaches the ETAGO delimiter, and none of the style rules in it would be applied. Paragraphs would get a red background color (see the first <style> element). It would be equivalent to the following non-conformant markup:

<style type="text/css">
	p {
		background: red;
	}
</style>
<style type="text/css">
	p {
		content: '
</style>
bc'; background: green; }</style>

The same goes for <script> elements:

<script type="text/javascript">
	document.write('<p>Foo</p>');
</script>

As per HTML4, the <script> element should be closed prematurely, resulting in a JavaScript SyntaxError, since it would be interpreted as follows:

<script type="text/javascript">
	document.write('<p>Foo
</script>
');</script>

Well, that’s the theory. In reality, no browser ever implemented this. The ETAGO delimiter isn’t respected as a terminating sequence for <style> and <script> elements in any browser. You can easily confirm this yourself by viewing the test cases based on the above code examples: ETAGO delimiter inside a <style> element and ETAGO delimiter inside a <script> element.

Back to reality with HTML5

Rather than expecting existing implementations to change, ‘HTML5’ a.k.a. the HTML Living Standard standardizes the behavior that browsers had implemented (with a few security improvements). This is described in the spec as part of the full tokenization algorithm, specifically here and here.

This means the above examples are now valid HTML. And of course, they continue to work correctly, as they always did. Generally, ETAGO delimiters can be used inside of <style> and <script> elements. Just keep in mind that the full </style and </script strings followed by a space character, >, or / will close their respective opening tag.

Semi-related fun fact: Since the <title> element is an RCDATA element that uses the text content model, there’s no need to encode < inside of it unless you want to use </title followed by any of those characters. <title>foo < bar</title> and <title><i>foo</i></title> are perfectly valid markup as per HTML. The same goes for <textarea>. In spec lingo: <script> and <style> are raw text elements, <textarea> and <title> are RCDATA elements.

For backwards compatibility, there’s an interesting exception to this rule for <script> elements that contain  — in that case, e.g. </script> is allowed in the <script> element’s content. Here’s a valid, working example:

<script>
	<!--
		document.write("<script>alert('LOLWAT')</script>")
	-->
</script>

While this is good to know, luckily there are better solutions than this old-school ’90s-style pattern (that only works for <script> elements anyway). Whenever you need to use </style> inside a <style> element, or </script> inside a <script> element, just escape these strings. In both CSS and JavaScript there are various ways of doing this, but using a backslash (\, also known as “reverse solidus character”) is by far the simplest:

<style>
	p {
		/* Using the Unicode code point for the solidus character (see https://mths.be/bax): */
		content: '<\00002Fstyle>';
		/* Using the shorthand notation for Unicode code points (see https://mths.be/bax): */
		content: '<\2F style>';
		/* Simply escaping the solidus character with a reverse solidus (\): */
		content: '<\/style>';
		background: green;
	}
</style>

<script>
	// Using `unescape()`:
	document.write(unescape('<script>alert("wtf")%3C/script>')); // Überlame.
	// Using string concatenation:
	document.write('<script>alert("heh")<' + '/script>'); // Lame.
	// Using the octal escape sequence for the solidus character (/):
	document.write('<script>alert("hah")<\57script>'); // Lame, deprecated, and disallowed in ES5 strict mode.
	// Using the Unicode escape sequence:
	document.write('<script>alert("hoh")<\u002Fscript>'); // Lame.
	// Using the hexadecimal escape sequence:
	document.write('<script>alert("huh")<\x2Fscript>'); // Lame.
	// Simply escaping the solidus character:
	document.write('<script>alert("O HAI")<\/script>'); // Awesome!
</script>

Both these examples are valid HTML, and of course they work as expected in any browser.

Note that while it’s an edge case, the </script character sequence can theoretically be used outside of strings in JavaScript, e.g. 42 </script/. Of course, the simple \/ escape won’t work here. In that case, make sure to use a space before the regex literal: 42 < /script/. (I can’t think of such a case for CSS though. Can you?)

Recommendations

The HTML Standard now has a section about the restrictions for contents of script elements, which includes the following piece of advice:

The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape "<!--" as "<\!--", "<script" as "<\script", and "</script" as "<\/script" when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions.

Comments

David Håsäther wrote on 27th June 2011 at 16:58:

Actually, your first example is perfectly fine. Only ETAGO delimiter-in-context — that is, ETAGO followed by a name start character (for HTML 4 this is [a-zA-Z]) — would be recognized by the parser as an end-tag. </ alone is treated as data.

E.g. content: "</a" would trigger it though.

Mathias wrote on 27th June 2011 at 17:01:

David Håsäther: Thanks, I’ve tweaked the example and updated the article with a link to section B.3.2.1 of the HTML4 spec.

Jordi Boggiano wrote on 27th June 2011 at 17:49:

Another trivia that’s related — as far as I know, this is the reason you have to escape forward-slashes in JSON. It makes any string safe for inlining, even if it contains </script>.

zcorpan wrote on 27th June 2011 at 18:38:

If we’re going to be pedantic here, HTML4 doesn’t require the element to be closed (even if the prose gives the appearance that it is), because it doesn’t use any RFC 2119 keywords. It would be inappropriate for it to do, since it’s the business of SGML to define. AFAIK, SGML just says that ETAGO that doesn’t match the end tag is invalid, without specifying any behavior for when it would occur (so not closing the element would be as compliant as closing the element or indeed fatally aborting parsing).

Also, HTML5 doesn’t specify what browsers already did. What browsers were doing was in the face of hitting the end of the file while in a <script> or <style> (or indeed <textarea>, <title> and others) was to rewind the input stream to the start tag and reparsing with different parsing rules where a matching close tag would close the element despite being after <!--. HTML5 doesn’t specify this because reparsing is a security problem. Now Web content of course uses on one hand stuff like the document.write example in this post, and on the other hand uses <script><!-- here but only </script> there and expect the page to work fine (since it did in browsers thanks to reparsing). With the constraint to never reparse, the complexity in the spec now is what was needed to be compatible enough (which is basically as compatible as possible) with existing Web content, which was based on extensive research by myself with data and help from Philip Taylor. Henri Sivonen also had a few ideas on how to solve this but didn’t have enough research to come up with a compatible solution.

Mathias wrote on 27th June 2011 at 18:53:

zcorpan: Thanks for the detailed explanation. I’ve updated the article with a link to your comment.

Yahel Carmon wrote on 31st July 2011 at 18:03:

It seems like rigid enforcement of ETAGO rules is a problem with DOMDocument, PHP’s major DOM parsing library.

See this test case: http://static.bwerp.net/~adam/2010/10/23/dom.php

Presence of </ causes the parser to enforce the rules rigidly, and prematurely terminate the script block, ignoring the rest of the block. In theory, this is the “correct” behavior under previous versions of the spec, but, considering no browser follows that rigidity, it’s a big pain. There doesn’t appear to be any way to override the behavior in a setting.

TallahasseeJames wrote on 14th December 2011 at 00:14:

Actually, those of us who were thorough in the 90’s added one more layer to that closing tagline (just prior to the ETAGO):

<script language="javascript">
<!-- // Hide from elder browsers…
var thisCode = "We walked uphill both ways, and we LIKED it!";
// …stop hiding. -->
</script>

The idea was that modern browsers (like Nutscrape 4.7 and Internet Exploder 5.0) that could run JS would see the single-line comment // and ignore the rest, while lame old browsers like IE 4 could suck it (although my manager in 2000 did insist on complete IE 4 and partial IE 3 support until we showed her the server logs and talked her into dropping them…).

Henri Sivonen wrote on 28th March 2012 at 08:26:

In string literals in inline scripts, </script> isn’t the only dangerous substring. <!-- is a dangerous substring, too, because it might mask a following </script> that’s really meant to act as the end tag of the inline script. Therefore, you should never let the substring <!-- or the substring </script> appear in string literals (or regexp literals but those rarely come from untrusted sources) in inline scripts. The way to deal with both in a way that’s safe to even automate is to replace < with \u003C or \x3C in string literals. (If you include untrusted content in string literals, you should also escape line breaks, including weird Unicode breaks, and quotes, of course.)

Update by Mathias: Here’s an example of what Henri is talking about: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/2490

<!DOCTYPE html>
<script>"<!--<script><\/script>";
alert(1)
</script>
<!-- LOL -->
<script></script>

Because <!-- is not escaped here, browsers with an HTML5 parser treat this code differently than browsers without an HTML5 parser.

By escaping <!--, old browsers without an HTML5 parser behave the same as modern browsers with an HTML5 parser: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=2493. (The alert message should be displayed.)

More info here: http://krijnhoetmer.nl/irc-logs/whatwg/20130826#l-287.

Dylan wrote on 27th March 2015 at 04:14:

</script> inside <script></script> only seems to not end the tag if you also have a matching <script>. But at least a couple of well-known HTML parsers can’t handle this construct anyway, so I doubt too many people would use it.

Jacek wrote on 30th June 2015 at 14:08:

There is one more problem with  in script, which is harder to do. Unlike </script>, --> makes sense outside strings and comments.

Mathias Bynens

The end-tag open (ETAGO) delimiter

ETAGO in HTML4

Back to reality with HTML5

Recommendations

Comments

Leave a comment