Another abandoned server code base... this is kind of an ancestor of taskrambler.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

1704 lines
98 KiB

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en-US-x-Hixie" ><head><title>8.2 Parsing HTML documents &#8212; HTML5 </title><style type="text/css">
pre { margin-left: 2em; white-space: pre-wrap; }
h2 { margin: 3em 0 1em 0; }
h3 { margin: 2.5em 0 1em 0; }
h4 { margin: 2.5em 0 0.75em 0; }
h5, h6 { margin: 2.5em 0 1em; }
h1 + h2, h1 + h2 + h2 { margin: 0.75em 0 0.75em; }
h2 + h3, h3 + h4, h4 + h5, h5 + h6 { margin-top: 0.5em; }
p { margin: 1em 0; }
hr:not(.top) { display: block; background: none; border: none; padding: 0; margin: 2em 0; height: auto; }
dl, dd { margin-top: 0; margin-bottom: 0; }
dt { margin-top: 0.75em; margin-bottom: 0.25em; clear: left; }
dt + dt { margin-top: 0; }
dd dt { margin-top: 0.25em; margin-bottom: 0; }
dd p { margin-top: 0; }
dd dl + p { margin-top: 1em; }
dd table + p { margin-top: 1em; }
p + * > li, dd li { margin: 1em 0; }
dt, dfn { font-weight: bold; font-style: normal; }
dt dfn { font-style: italic; }
pre, code { font-size: inherit; font-family: monospace; font-variant: normal; }
pre strong { color: black; font: inherit; font-weight: bold; background: yellow; }
pre em { font-weight: bolder; font-style: normal; }
@media screen { code { color: orangered; } code :link, code :visited { color: inherit; } }
var sub { vertical-align: bottom; font-size: smaller; position: relative; top: 0.1em; }
table { border-collapse: collapse; border-style: hidden hidden none hidden; }
table thead, table tbody { border-bottom: solid; }
table tbody th:first-child { border-left: solid; }
table tbody th { text-align: left; }
table td, table th { border-left: solid; border-right: solid; border-bottom: solid thin; vertical-align: top; padding: 0.2em; }
blockquote { margin: 0 0 0 2em; border: 0; padding: 0; font-style: italic; }
.bad, .bad *:not(.XXX) { color: gray; border-color: gray; background: transparent; }
.matrix, .matrix td { border: none; text-align: right; }
.matrix { margin-left: 2em; }
.dice-example { border-collapse: collapse; border-style: hidden solid solid hidden; border-width: thin; margin-left: 3em; }
.dice-example caption { width: 30em; font-size: smaller; font-style: italic; padding: 0.75em 0; text-align: left; }
.dice-example td, .dice-example th { border: solid thin; width: 1.35em; height: 1.05em; text-align: center; padding: 0; }
.toc dfn, h1 dfn, h2 dfn, h3 dfn, h4 dfn, h5 dfn, h6 dfn { font: inherit; }
img.extra { float: right; }
pre.idl { border: solid thin; background: #EEEEEE; color: black; padding: 0.5em 1em; }
pre.idl :link, pre.idl :visited { color: inherit; background: transparent; }
pre.css { border: solid thin; background: #FFFFEE; color: black; padding: 0.5em 1em; }
pre.css:first-line { color: #AAAA50; }
dl.domintro { color: green; margin: 2em 0 2em 2em; padding: 0.5em 1em; border: none; background: #DDFFDD; }
hr + dl.domintro, div.impl + dl.domintro { margin-top: 2.5em; margin-bottom: 1.5em; }
dl.domintro dt, dl.domintro dt * { color: black; text-decoration: none; }
dl.domintro dd { margin: 0.5em 0 1em 2em; padding: 0; }
dl.domintro dd p { margin: 0.5em 0; }
dl.switch { padding-left: 2em; }
dl.switch > dt { text-indent: -1.5em; }
dl.switch > dt:before { content: '\21AA'; padding: 0 0.5em 0 0; display: inline-block; width: 1em; text-align: right; line-height: 0.5em; }
dl.triple { padding: 0 0 0 1em; }
dl.triple dt, dl.triple dd { margin: 0; display: inline }
dl.triple dt:after { content: ':'; }
dl.triple dd:after { content: '\A'; white-space: pre; }
.diff-old { text-decoration: line-through; color: silver; background: transparent; }
.diff-chg, .diff-new { text-decoration: underline; color: green; background: transparent; }
a .diff-new { border-bottom: 1px blue solid; }
h2 { page-break-before: always; }
h1, h2, h3, h4, h5, h6 { page-break-after: avoid; }
h1 + h2, hr + h2.no-toc { page-break-before: auto; }
p > span:not([title=""]):not([class="XXX"]):not([class="impl"]):not([class="note"]),
li > span:not([title=""]):not([class="XXX"]):not([class="impl"]):not([class="note"]), { border-bottom: solid #9999CC; }
div.head { margin: 0 0 1em; padding: 1em 0 0 0; }
div.head p { margin: 0; }
div.head h1 { margin: 0; }
div.head .logo { float: right; margin: 0 1em; }
div.head .logo img { border: none } /* remove border from top image */
div.head dl { margin: 1em 0; }
div.head p.copyright, div.head p.alt { font-size: x-small; font-style: oblique; margin: 0; }
body > .toc > li { margin-top: 1em; margin-bottom: 1em; }
body > .toc.brief > li { margin-top: 0.35em; margin-bottom: 0.35em; }
body > .toc > li > * { margin-bottom: 0.5em; }
body > .toc > li > * > li > * { margin-bottom: 0.25em; }
.toc, .toc li { list-style: none; }
.brief { margin-top: 1em; margin-bottom: 1em; line-height: 1.1; }
.brief li { margin: 0; padding: 0; }
.brief li p { margin: 0; padding: 0; }
.category-list { margin-top: -0.75em; margin-bottom: 1em; line-height: 1.5; }
.category-list::before { content: '\21D2\A0'; font-size: 1.2em; font-weight: 900; }
.category-list li { display: inline; }
.category-list li:not(:last-child)::after { content: ', '; }
.category-list li > span, .category-list li > a { text-transform: lowercase; }
.category-list li * { text-transform: none; } /* don't affect <code> nested in <a> */
.XXX { color: #E50000; background: white; border: solid red; padding: 0.5em; margin: 1em 0; }
.XXX > :first-child { margin-top: 0; }
p .XXX { line-height: 3em; }
.annotation { border: solid thin black; background: #0C479D; color: white; position: relative; margin: 8px 0 20px 0; }
.annotation:before { position: absolute; left: 0; top: 0; width: 100%; height: 100%; margin: 6px -6px -6px 6px; background: #333333; z-index: -1; content: ''; }
.annotation :link, .annotation :visited { color: inherit; }
.annotation :link:hover, .annotation :visited:hover { background: transparent; }
.annotation span { border: none ! important; }
.note { color: green; background: transparent; font-family: sans-serif; }
.warning { color: red; background: transparent; }
.note, .warning { font-weight: bolder; font-style: italic; }
p.note, div.note { padding: 0.5em 2em; }
span.note { padding: 0 2em; }
.note p:first-child, .warning p:first-child { margin-top: 0; }
.note p:last-child, .warning p:last-child { margin-bottom: 0; }
.warning:before { font-style: normal; }
p.note:before { content: 'Note: '; }
p.warning:before { content: '\26A0 Warning! '; }
.bookkeeping:before { display: block; content: 'Bookkeeping details'; font-weight: bolder; font-style: italic; }
.bookkeeping { font-size: 0.8em; margin: 2em 0; }
.bookkeeping p { margin: 0.5em 2em; display: list-item; list-style: square; }
.bookkeeping dt { margin: 0.5em 2em 0; }
.bookkeeping dd { margin: 0 3em 0.5em; }
h4 { position: relative; z-index: 3; }
h4 + .element, h4 + div + .element { margin-top: -2.5em; padding-top: 2em; }
.element {
background: #EEEEFF;
color: black;
margin: 0 0 1em 0.15em;
padding: 0 1em 0.25em 0.75em;
border-left: solid #9999FF 0.25em;
position: relative;
z-index: 1;
}
.element:before {
position: absolute;
z-index: 2;
top: 0;
left: -1.15em;
height: 2em;
width: 0.9em;
background: #EEEEFF;
content: ' ';
border-style: none none solid solid;
border-color: #9999FF;
border-width: 0.25em;
}
.example { display: block; color: #222222; background: #FCFCFC; border-left: double; margin-left: 2em; padding-left: 1em; }
td > .example:only-child { margin: 0 0 0 0.1em; }
ul.domTree, ul.domTree ul { padding: 0 0 0 1em; margin: 0; }
ul.domTree li { padding: 0; margin: 0; list-style: none; position: relative; }
ul.domTree li li { list-style: none; }
ul.domTree li:first-child::before { position: absolute; top: 0; height: 0.6em; left: -0.75em; width: 0.5em; border-style: none none solid solid; content: ''; border-width: 0.1em; }
ul.domTree li:not(:last-child)::after { position: absolute; top: 0; bottom: -0.6em; left: -0.75em; width: 0.5em; border-style: none none solid solid; content: ''; border-width: 0.1em; }
ul.domTree span { font-style: italic; font-family: serif; }
ul.domTree .t1 code { color: purple; font-weight: bold; }
ul.domTree .t2 { font-style: normal; font-family: monospace; }
ul.domTree .t2 .name { color: black; font-weight: bold; }
ul.domTree .t2 .value { color: blue; font-weight: normal; }
ul.domTree .t3 code, .domTree .t4 code, .domTree .t5 code { color: gray; }
ul.domTree .t7 code, .domTree .t8 code { color: green; }
ul.domTree .t10 code { color: teal; }
body.dfnEnabled dfn { cursor: pointer; }
.dfnPanel {
display: inline;
position: absolute;
z-index: 10;
height: auto;
width: auto;
padding: 0.5em 0.75em;
font: small sans-serif, Droid Sans Fallback;
background: #DDDDDD;
color: black;
border: outset 0.2em;
}
.dfnPanel * { margin: 0; padding: 0; font: inherit; text-indent: 0; }
.dfnPanel :link, .dfnPanel :visited { color: black; }
.dfnPanel p { font-weight: bolder; }
.dfnPanel * + p { margin-top: 0.25em; }
.dfnPanel li { list-style-position: inside; }
#configUI { position: absolute; z-index: 20; top: 10em; right: 1em; width: 11em; font-size: small; }
#configUI p { margin: 0.5em 0; padding: 0.3em; background: #EEEEEE; color: black; border: inset thin; }
#configUI p label { display: block; }
#configUI #updateUI, #configUI .loginUI { text-align: center; }
#configUI input[type=button] { display: block; margin: auto; }
fieldset { margin: 1em; padding: 0.5em 1em; }
fieldset > legend + * { margin-top: 0; }
fieldset > :last-child { margin-bottom: 0; }
fieldset p { margin: 0.5em 0; }
.stability {
position: fixed;
bottom: 0;
left: 0; right: 0;
margin: 0 auto 0 auto !important;
z-index: 1000;
width: 50%;
background: maroon; color: yellow;
-webkit-border-radius: 1em 1em 0 0;
-moz-border-radius: 1em 1em 0 0;
border-radius: 1em 1em 0 0;
-moz-box-shadow: 0 0 1em #500;
-webkit-box-shadow: 0 0 1em #500;
box-shadow: 0 0 1em red;
padding: 0.5em 1em;
text-align: center;
}
.stability strong {
display: block;
}
.stability input {
appearance: none; margin: 0; border: 0; padding: 0.25em 0.5em; background: transparent; color: black;
position: absolute; top: -0.5em; right: 0; font: 1.25em sans-serif; text-align: center;
}
.stability input:hover {
color: white;
text-shadow: 0 0 2px black;
}
.stability input:active {
padding: 0.3em 0.45em 0.2em 0.55em;
}
.stability :link, .stability :visited,
.stability :link:hover, .stability :visited:hover {
background: transparent;
color: white;
}
</style><link href="data:text/css,.impl%20%7B%20display:%20none;%20%7D%0Ahtml%20%7B%20border:%20solid%20yellow;%20%7D%20.domintro:before%20%7B%20display:%20none;%20%7D" id="author" rel="alternate stylesheet" title="Author documentation only"><link href="data:text/css,.impl%20%7B%20background:%20%23FFEEEE;%20%7D%20.domintro:before%20%7B%20background:%20%23FFEEEE;%20%7D" id="highlight" rel="alternate stylesheet" title="Highlight implementation
requirements"><link href="http://www.w3.org/StyleSheets/TR/W3C-WD" rel="stylesheet" type="text/css"><style type="text/css">
.applies thead th > * { display: block; }
.applies thead code { display: block; }
.applies tbody th { whitespace: nowrap; }
.applies td { text-align: center; }
.applies .yes { background: yellow; }
.matrix, .matrix td { border: hidden; text-align: right; }
.matrix { margin-left: 2em; }
.dice-example { border-collapse: collapse; border-style: hidden solid solid hidden; border-width: thin; margin-left: 3em; }
.dice-example caption { width: 30em; font-size: smaller; font-style: italic; padding: 0.75em 0; text-align: left; }
.dice-example td, .dice-example th { border: solid thin; width: 1.35em; height: 1.05em; text-align: center; padding: 0; }
td.eg { border-width: thin; text-align: center; }
#table-example-1 { border: solid thin; border-collapse: collapse; margin-left: 3em; }
#table-example-1 * { font-family: "Essays1743", serif; line-height: 1.01em; }
#table-example-1 caption { padding-bottom: 0.5em; }
#table-example-1 thead, #table-example-1 tbody { border: none; }
#table-example-1 th, #table-example-1 td { border: solid thin; }
#table-example-1 th { font-weight: normal; }
#table-example-1 td { border-style: none solid; vertical-align: top; }
#table-example-1 th { padding: 0.5em; vertical-align: middle; text-align: center; }
#table-example-1 tbody tr:first-child td { padding-top: 0.5em; }
#table-example-1 tbody tr:last-child td { padding-bottom: 1.5em; }
#table-example-1 tbody td:first-child { padding-left: 2.5em; padding-right: 0; width: 9em; }
#table-example-1 tbody td:first-child::after { content: leader(". "); }
#table-example-1 tbody td { padding-left: 2em; padding-right: 2em; }
#table-example-1 tbody td:first-child + td { width: 10em; }
#table-example-1 tbody td:first-child + td ~ td { width: 2.5em; }
#table-example-1 tbody td:first-child + td + td + td ~ td { width: 1.25em; }
.apple-table-examples { border: none; border-collapse: separate; border-spacing: 1.5em 0em; width: 40em; margin-left: 3em; }
.apple-table-examples * { font-family: "Times", serif; }
.apple-table-examples td, .apple-table-examples th { border: none; white-space: nowrap; padding-top: 0; padding-bottom: 0; }
.apple-table-examples tbody th:first-child { border-left: none; width: 100%; }
.apple-table-examples thead th:first-child ~ th { font-size: smaller; font-weight: bolder; border-bottom: solid 2px; text-align: center; }
.apple-table-examples tbody th::after, .apple-table-examples tfoot th::after { content: leader(". ") }
.apple-table-examples tbody th, .apple-table-examples tfoot th { font: inherit; text-align: left; }
.apple-table-examples td { text-align: right; vertical-align: top; }
.apple-table-examples.e1 tbody tr:last-child td { border-bottom: solid 1px; }
.apple-table-examples.e1 tbody + tbody tr:last-child td { border-bottom: double 3px; }
.apple-table-examples.e2 th[scope=row] { padding-left: 1em; }
.apple-table-examples sup { line-height: 0; }
.details-example img { vertical-align: top; }
#base64-table {
white-space: nowrap;
font-size: 0.6em;
column-width: 6em;
column-count: 5;
column-gap: 1em;
-moz-column-width: 6em;
-moz-column-count: 5;
-moz-column-gap: 1em;
-webkit-column-width: 6em;
-webkit-column-count: 5;
-webkit-column-gap: 1em;
}
#base64-table thead { display: none; }
#base64-table * { border: none; }
#base64-table tbody td:first-child:after { content: ':'; }
#base64-table tbody td:last-child { text-align: right; }
#named-character-references-table {
white-space: nowrap;
font-size: 0.6em;
column-width: 30em;
column-gap: 1em;
-moz-column-width: 30em;
-moz-column-gap: 1em;
-webkit-column-width: 30em;
-webkit-column-gap: 1em;
}
#named-character-references-table > table > tbody > tr > td:first-child + td,
#named-character-references-table > table > tbody > tr > td:last-child { text-align: center; }
#named-character-references-table > table > tbody > tr > td:last-child:hover > span { position: absolute; top: auto; left: auto; margin-left: 0.5em; line-height: 1.2; font-size: 5em; border: outset; padding: 0.25em 0.5em; background: white; width: 1.25em; height: auto; text-align: center; }
#named-character-references-table > table > tbody > tr#entity-CounterClockwiseContourIntegral > td:first-child { font-size: 0.5em; }
.glyph.control { color: red; }
@font-face {
font-family: 'Essays1743';
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743.ttf');
}
@font-face {
font-family: 'Essays1743';
font-weight: bold;
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743-Bold.ttf');
}
@font-face {
font-family: 'Essays1743';
font-style: italic;
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743-Italic.ttf');
}
@font-face {
font-family: 'Essays1743';
font-style: italic;
font-weight: bold;
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743-BoldItalic.ttf');
}
</style><style type="text/css">
.domintro:before { display: table; margin: -1em -0.5em -0.5em auto; width: auto; content: 'This box is non-normative. Implementation requirements are given below this box.'; color: black; font-style: italic; border: solid 2px; background: white; padding: 0 0.25em; }
</style><script type="text/javascript">
function getCookie(name) {
var params = location.search.substr(1).split("&");
for (var index = 0; index < params.length; index++) {
if (params[index] == name)
return "1";
var data = params[index].split("=");
if (data[0] == name)
return unescape(data[1]);
}
var cookies = document.cookie.split("; ");
for (var index = 0; index < cookies.length; index++) {
var data = cookies[index].split("=");
if (data[0] == name)
return unescape(data[1]);
}
return null;
}
</script>
<script src="link-fixup.js" type="text/javascript"></script>
<link href="style.css" rel="stylesheet"><link href="syntax.html" title="8 The HTML syntax" rel="prev">
<link href="spec.html#contents" title="Table of contents" rel="index">
<link href="tokenization.html" title="8.2.4 Tokenization" rel="next">
</head><body><div class="head" id="head">
<div id="multipage-common">
<p class="stability" id="wip"><strong>This is a work in
progress!</strong> For the latest updates from the HTML WG, possibly
including important bug fixes, please look at the <a href="http://dev.w3.org/html5/spec/Overview.html">editor's draft</a> instead.
There may also be a more
<a href="http://www.w3.org/TR/html5">up-to-date Working Draft</a>
with changes based on resolution of Last Call issues.
<input onclick="closeWarning(this.parentNode)" type="button" value="&#9587;&#8413;"></p>
<script type="text/javascript">
function closeWarning(element) {
element.parentNode.removeChild(element);
var date = new Date();
date.setDate(date.getDate()+4);
document.cookie = 'hide-obsolescence-warning=1; expires=' + date.toGMTString();
}
if (getCookie('hide-obsolescence-warning') == '1')
setTimeout(function () { document.getElementById('wip').parentNode.removeChild(document.getElementById('wip')); }, 2000);
</script></div>
<p><a href="http://www.w3.org/"><img alt="W3C" height="48" src="http://www.w3.org/Icons/w3c_home" width="72"></a></p>
<h1>HTML5</h1>
</div><div>
<a href="syntax.html" class="prev">8 The HTML syntax</a> &#8211;
<a href="spec.html#contents">Table of contents</a> &#8211;
<a href="tokenization.html" class="next">8.2.4 Tokenization</a>
<ol class="toc"><li><ol><li><a href="parsing.html#parsing"><span class="secno">8.2 </span>Parsing HTML documents</a>
<ol><li><a href="parsing.html#overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</a></li><li><a href="parsing.html#the-input-stream"><span class="secno">8.2.2 </span>The input stream</a>
<ol><li><a href="parsing.html#determining-the-character-encoding"><span class="secno">8.2.2.1 </span>Determining the character encoding</a></li><li><a href="parsing.html#character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</a></li><li><a href="parsing.html#preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</a></li><li><a href="parsing.html#changing-the-encoding-while-parsing"><span class="secno">8.2.2.4 </span>Changing the encoding while parsing</a></li></ol></li><li><a href="parsing.html#parse-state"><span class="secno">8.2.3 </span>Parse state</a>
<ol><li><a href="parsing.html#the-insertion-mode"><span class="secno">8.2.3.1 </span>The insertion mode</a></li><li><a href="parsing.html#the-stack-of-open-elements"><span class="secno">8.2.3.2 </span>The stack of open elements</a></li><li><a href="parsing.html#the-list-of-active-formatting-elements"><span class="secno">8.2.3.3 </span>The list of active formatting elements</a></li><li><a href="parsing.html#the-element-pointers"><span class="secno">8.2.3.4 </span>The element pointers</a></li><li><a href="parsing.html#other-parsing-state-flags"><span class="secno">8.2.3.5 </span>Other parsing state flags</a></li></ol></li></ol></li></ol></li></ol></div>
<div class="impl">
<h3 id="parsing"><span class="secno">8.2 </span>Parsing HTML documents</h3>
<p><i>This section only applies to user agents, data mining tools,
and conformance checkers.</i></p>
<p class="note">The rules for parsing XML documents into DOM trees
are covered by the next section, entitled "<a href="the-xhtml-syntax.html#the-xhtml-syntax">The XHTML
syntax</a>".</p>
<p>For <a href="dom.html#html-documents">HTML documents</a>, user agents must use the parsing
rules described in this section to generate the DOM trees. Together,
these rules define what is referred to as the <dfn id="html-parser">HTML
parser</dfn>.</p>
<div class="note">
<p>While the HTML syntax described in this specification bears a
close resemblance to SGML and XML, it is a separate language with
its own parsing rules.</p>
<p>Some earlier versions of HTML (in particular from HTML2 to
HTML4) were based on SGML and used SGML parsing rules. However, few
(if any) web browsers ever implemented true SGML parsing for HTML
documents; the only user agents to strictly handle HTML as an SGML
application have historically been validators. The resulting
confusion &#8212; with validators claiming documents to have one
representation while widely deployed Web browsers interoperably
implemented a different representation &#8212; has wasted decades
of productivity. This version of HTML thus returns to a non-SGML
basis.</p>
<p>Authors interested in using SGML tools in their authoring
pipeline are encouraged to use XML tools and the XML serialization
of HTML.</p>
</div>
<p>This specification defines the parsing rules for HTML documents,
whether they are syntactically correct or not. Certain points in the
parsing algorithm are said to be <dfn id="parse-error" title="parse error">parse
errors</dfn>. The error handling for parse errors is well-defined:
user agents must either act as described below when encountering
such problems, or must abort processing at the first error that they
encounter for which they do not wish to apply the rules described
below.</p>
<p>Conformance checkers must report at least one parse error
condition to the user if one or more parse error conditions exist in
the document and must not report parse error conditions if none
exist in the document. Conformance checkers may report more than one
parse error condition if more than one parse error condition exists
in the document. Conformance checkers are not required to recover
from parse errors.</p>
<p class="note">Parse errors are only errors with the
<em>syntax</em> of HTML. In addition to checking for parse errors,
conformance checkers will also verify that the document obeys all
the other conformance requirements described in this
specification.</p>
<p>For the purposes of conformance checkers, if a resource is
determined to be in <a href="syntax.html#syntax">the HTML syntax</a>, then it is an
<a href="dom.html#html-documents" title="HTML documents">HTML document</a>.</p>
</div><div class="impl">
<h4 id="overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</h4>
<p>The input to the HTML parsing process consists of a stream of
Unicode characters, which is passed through a
<a href="tokenization.html#tokenization">tokenization</a> stage followed by a <a href="tree-construction.html#tree-construction">tree
construction</a> stage. The output is a <code><a href="infrastructure.html#document">Document</a></code>
object.</p>
<p class="note">Implementations that <a href="infrastructure.html#non-scripted">do not
support scripting</a> do not have to actually create a DOM
<code><a href="infrastructure.html#document">Document</a></code> object, but the DOM tree in such cases is
still used as the model for the rest of the specification.</p>
<p>In the common case, the data handled by the tokenization stage
comes from the network, but <a href="apis-in-html-documents.html#dynamic-markup-insertion" title="dynamic markup
insertion">it can also come from script</a> running in the user
agent, e.g. using the <code title="dom-document-write"><a href="apis-in-html-documents.html#dom-document-write">document.write()</a></code> API.</p>
<p><img alt="" height="554" src="parsing-model-overview.png" width="427"></p>
<p id="nestedParsing">There is only one set of states for the
tokenizer stage and the tree construction stage, but the tree
construction stage is reentrant, meaning that while the tree
construction stage is handling one token, the tokenizer might be
resumed, causing further tokens to be emitted and processed before
the first token's processing is complete.</p>
<div class="example">
<p>In the following example, the tree construction stage will be
called upon to handle a "p" start tag token while handling the
"script" end tag token:</p>
<pre>...
&lt;script&gt;
document.write('&lt;p&gt;');
&lt;/script&gt;
...</pre>
</div>
<p>To handle these cases, parsers have a <dfn id="script-nesting-level">script nesting
level</dfn>, which must be initially set to zero, and a <dfn id="parser-pause-flag">parser
pause flag</dfn>, which must be initially set to false.</p>
</div><div class="impl">
<h4 id="the-input-stream"><span class="secno">8.2.2 </span>The <dfn>input stream</dfn></h4>
<p>The stream of Unicode characters that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
particular <em>character encoding</em>, which the user agent must
use to decode the bytes into characters.</p>
<p class="note">For XML documents, the algorithm user agents must
use to determine the character encoding is given by the XML
specification. This section does not apply to XML documents. <a href="references.html#refsXML">[XML]</a></p>
<h5 id="determining-the-character-encoding"><span class="secno">8.2.2.1 </span>Determining the character encoding</h5>
<p>In some cases, it might be impractical to unambiguously determine
the encoding before parsing the document. Because of this, this
specification provides for a two-pass mechanism with an optional
pre-scan. Implementations are allowed, as described below, to apply
a simplified parsing algorithm to whatever bytes they have available
before beginning to parse the document. Then, the real parser is
started, using a tentative encoding derived from this pre-parse and
other out-of-band metadata. If, while the document is being loaded,
the user agent discovers an encoding declaration that conflicts with
this information, then the parser can get reinvoked to perform a
parse of the document with the real encoding.</p>
<p id="documentEncoding">User agents must use the following
algorithm (the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>) to determine
the character encoding to use when decoding a document in the first
pass. This algorithm takes as input any out-of-band metadata
available to the user agent (e.g. the <a href="fetching-resources.html#content-type" title="Content-Type">Content-Type metadata</a> of the document)
and all the bytes available so far, and returns an encoding and a
<dfn id="concept-encoding-confidence" title="concept-encoding-confidence">confidence</dfn>. The
confidence is either <i>tentative</i>, <i>certain</i>, or
<i>irrelevant</i>. The encoding used, and whether the confidence in
that encoding is <i>tentative</i> or <i>certain</i>, is <a href="tree-construction.html#meta-charset-during-parse">used during the parsing</a> to
determine whether to <a href="#change-the-encoding">change the encoding</a>. If no
encoding is necessary, e.g. because the parser is operating on a
stream of Unicode characters and doesn't have to use an encoding at
all, then the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is
<i>irrelevant</i>.</p>
<ol><li><p>If the user has explicitly instructed the user agent to
override the document's character encoding with a specific
encoding, optionally return that encoding with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>certain</i> and abort these steps.</p></li>
<li><p>If the transport layer specifies an encoding, and it is
supported, return that encoding with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>certain</i>, and abort these steps.</p></li>
<li>
<p>The user agent may wait for more bytes of the resource to be
available, either in this step or at any later step in this
algorithm. For instance, a user agent might wait 500ms or 1024
bytes, whichever came first. In general preparsing the source to
find the encoding improves performance, as it reduces the need to
throw away the data structures used when parsing upon finding the
encoding information. However, if the user agent delays too long
to obtain data to determine the encoding, then the cost of the
delay could outweigh any performance improvements from the
preparse.</p>
<p class="note">The authoring conformance requirements for
character encoding declarations limit them to only appearing <a href="semantics.html#charset1024">in the first 1024 bytes</a>. User agents are
therefore encouraged to use the preparse algorithm below (part of
these steps) on the first 1024 bytes, but not to stall beyond
that.</p>
</li>
<li><p>For each of the rows in the following table, starting with
the first one and going down, if there are as many or more bytes
available than the number of bytes in the first column, and the
first bytes of the file match the bytes given in the first column,
then return the encoding given in the cell in the second column of
that row, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>certain</i>, and abort these steps:</p>
<table><thead><tr><th>Bytes in Hexadecimal
</th><th>Encoding
</th></tr></thead><tbody><tr><td>FE FF
</td><td>Big-endian UTF-16
</td></tr><tr><td>FF FE
</td><td>Little-endian UTF-16
</td></tr><tr><td>EF BB BF
</td><td>UTF-8
</td></tr></tbody></table><p class="note">This step looks for Unicode Byte Order Marks
(BOMs).</p></li>
<li><p>Otherwise, the user agent will have to search for explicit
character encoding information in the file itself. This should
proceed as follows:
</p><p>Let <var title="">position</var> be a pointer to a byte in the
input stream, initially pointing at the first byte. If at any
point during these substeps the user agent either runs out of
bytes or decides that scanning further bytes would not be
efficient, then skip to the next step of the overall character
encoding detection algorithm. User agents may decide that scanning
<em>any</em> bytes is not efficient, in which case these substeps
are entirely skipped.</p>
<p>Now, repeat the following "two" steps until the algorithm
aborts (either because user agent aborts, as described above, or
because a character encoding is found):</p>
<ol><li><p>If <var title="">position</var> points to:</p>
<dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '&lt;!--')</dt>
<dd>
<p>Advance the <var title="">position</var> pointer so that it
points at the first 0x3E byte which is preceded by two 0x2D
bytes (i.e. at the end of an ASCII '--&gt;' sequence) and comes
after the 0x3C byte that was found. (The two 0x2D bytes can be
the same as the those in the '&lt;!--' sequence.)</p>
</dd>
<dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and finally one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '&lt;meta' followed by a space or slash)</dt>
<dd>
<ol><li><p>Advance the <var title="">position</var> pointer so
that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
0x2F byte (the one in sequence of characters matched
above).</p></li>
<li><p>Let <var title="">attribute list</var> be an empty
list of strings.</p></li>
<li><p>Let <var title="">got pragma</var> be false.</p></li>
<li><p>Let <var title="">need pragma</var> be null.</p></li>
<li><p>Let <var title="">charset</var> be the null value
(which, for the purposes of this algorithm, is distinct from
an unrecognised encoding or the empty string).</p></li>
<li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an
attribute</a> and its value. If no attribute was sniffed,
then jump to the <i>processing</i> step below.</p></li>
<li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
labeled <i>attributes</i>.</p>
</li><li><p>Add the attribute's name to <var title="">attribute
list</var>.</p>
</li><li>
<p>Run the appropriate step from the following list, if one
applies:</p>
<dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
<dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
pragma</var> to true.</p></dd>
<dt>If the attribute's name is "<code title="">content</code>"</dt>
<dd><p>Apply the <a href="fetching-resources.html#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding
from a <code>meta</code> element</a>, giving the
attribute's value as the string to parse. If an encoding is
returned, and if <var title="">charset</var> is still set
to null, let <var title="">charset</var> be the encoding
returned, and set <var title="">need pragma</var> to
true.</p></dd>
<dt>If the attribute's name is "<code title="">charset</code>"</dt>
<dd><p>Let <var title="">charset</var> be the encoding
corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</p></dd>
</dl></li>
<li><p>Return to the step labeled <i>attributes</i>.</p></li>
<li><p><i>Processing</i>: If <var title="">need pragma</var>
is null, then jump to the second step of the overall "two
step" algorithm.</p></li>
<li><p>If <var title="">mode</var> is true but <var title="">got pragma</var> is false, then jump to the second
step of the overall "two step" algorithm.</p></li>
<li><p>If <var title="">charset</var> is a UTF-16 encoding,
change the value of <var title="">charset</var> to
UTF-8.</p></li>
<li><p>If <var title="">charset</var> is not a supported
character encoding, then jump to the second step of the
overall "two step" algorithm.</p></li>
<li><p>Return the encoding given by <var title="">charset</var>, with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>tentative</i>, and abort all these steps.</p></li>
</ol></dd>
<dt>A sequence of bytes starting with a 0x3C byte (ASCII &lt;), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
<dd>
<ol><li><p>Advance the <var title="">position</var> pointer so
that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
(ASCII &gt;) byte.</p></li>
<li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
attribute</a> until no further attributes can be found,
then jump to the second step in the overall "two step"
algorithm.</p></li>
</ol></dd>
<dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '&lt;!')</dt>
<dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '&lt;/')</dt>
<dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '&lt;?')</dt>
<dd>
<p>Advance the <var title="">position</var> pointer so that it
points at the first 0x3E byte (ASCII &gt;) that comes after the
0x3C byte that was found.</p>
</dd>
<dt>Any other byte</dt>
<dd>
<p>Do nothing with that byte.</p>
</dd>
</dl></li>
<li>Move <var title="">position</var> so it points at the next
byte in the input stream, and return to the first step of this
"two step" algorithm.</li>
</ol><p>When the above "two step" algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
attribute</dfn>, it means doing this:</p>
<ol><li><p>If the byte at <var title="">position</var> is one of 0x09
(ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
substep.</p></li>
<li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
&gt;), then abort the "get an attribute" algorithm. There isn't
one.</p></li>
<li><p>Otherwise, the byte at <var title="">position</var> is the
start of the attribute name. Let <var title="">attribute
name</var> and <var title="">attribute value</var> be the empty
string.</p></li>
<li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
<dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
name</var> is longer than the empty string</dt>
<dd>Advance <var title="">position</var> to the next byte and
jump to the step below labeled <i>value</i>.</dd>
<dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
<dd>Jump to the step below labeled <i>spaces</i>.</dd>
<dt>If it is 0x2F (ASCII /) or 0x3E (ASCII &gt;)</dt>
<dd>Abort the "get an attribute" algorithm. The attribute's
name is the value of <var title="">attribute name</var>, its
value is the empty string.</dd>
<dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
Z)</dt>
<dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
name</var> (where <var title="">b</var> is the value of the
byte at <var title="">position</var>).</dd>
<dt>Anything else</dt>
<dd>Append the Unicode character with the same code point as the
value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
bytes outside the ASCII range are handled here, since only
ASCII characters can contribute to the detection of a character
encoding.)</dd>
</dl></li>
<li><p>Advance <var title="">position</var> to the next byte and
return to the previous step.</p></li>
<li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
advance <var title="">position</var> to the next byte, then,
repeat this step.</p></li>
<li><p>If the byte at <var title="">position</var> is
<em>not</em> 0x3D (ASCII =), abort the "get an attribute"
algorithm. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty
string.</p></li>
<li><p>Advance <var title="">position</var> past the 0x3D (ASCII
=) byte.</p></li>
<li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
advance <var title="">position</var> to the next byte, then,
repeat this step.</p></li>
<li><p>Process the byte at <var title="">position</var> as
follows:</p>
<dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
<dd>
<ol><li>Let <var title="">b</var> be the value of the byte at
<var title="">position</var>.</li>
<li>Advance <var title="">position</var> to the next
byte.</li>
<li>If the value of the byte at <var title="">position</var>
is the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get
an attribute" algorithm. The attribute's name is the value of
<var title="">attribute name</var>, and its value is the
value of <var title="">attribute value</var>.</li>
<li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to
0x5A (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
than the value of the byte at <var title="">position</var>.</li>
<li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
the value of the byte at <var title="">position</var>.</li>
<li>Return to the second step in these substeps.</li>
</ol></dd>
<dt>If it is 0x3E (ASCII &gt;)</dt>
<dd>Abort the "get an attribute" algorithm. The attribute's
name is the value of <var title="">attribute name</var>, its
value is the empty string.</dd>
<dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
Z)</dt>
<dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
value</var> (where <var title="">b</var> is the value of the
byte at <var title="">position</var>). Advance <var title="">position</var> to the next byte.</dd>
<dt>Anything else</dt>
<dd>Append the Unicode character with the same code point as the
value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
</dl></li>
<li><p>Process the byte at <var title="">position</var> as
follows:</p>
<dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
&gt;)</dt>
<dd>Abort the "get an attribute" algorithm. The attribute's
name is the value of <var title="">attribute name</var> and its
value is the value of <var title="">attribute value</var>.</dd>
<dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
Z)</dt>
<dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
value</var> (where <var title="">b</var> is the value of the
byte at <var title="">position</var>).</dd>
<dt>Anything else</dt>
<dd>Append the Unicode character with the same code point as the
value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
</dl></li>
<li><p>Advance <var title="">position</var> to the next byte and
return to the previous step.</p></li>
</ol><p>For the sake of interoperability, user agents should not use a
pre-scan algorithm that returns different results than the one
described above. (But, if you do, please at least let us know, so
that we can improve this algorithm and benefit everyone...)</p>
</li>
<li><p>If the user agent has information on the likely encoding for
this page, e.g. based on the encoding of the page when it was last
visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>tentative</i>, and abort these steps.</p></li>
<li>
<p>The user agent may attempt to autodetect the character encoding
from applying frequency analysis or other algorithms to the data
stream. Such algorithms may use information about the resource
other than the resource's contents, including the address of the
resource. If autodetection succeeds in determining a character
encoding, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>tentative</i>, and abort these steps. <a href="references.html#refsUNIVCHARDET">[UNIVCHARDET]</a></p>
<p class="note">The UTF-8 encoding has a highly detectable bit
pattern. Documents that contain bytes with values greater than
0x7F which match the UTF-8 pattern are very likely to be UTF-8,
while documents with byte sequences that do not match it are very
likely not. User-agents are therefore encouraged to search for
this common encoding. <a href="references.html#refsPPUTF8">[PPUTF8]</a> <a href="references.html#refsUTF8DET">[UTF8DET]</a></p>
</li>
<li>
<p>Otherwise, return an implementation-defined or user-specified
default character encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>tentative</i>.</p>
<p>In controlled environments or in environments where the
encoding of documents can be prescribed (for example, for user
agents intended for dedicated use in new networks), the
comprehensive <code title="">UTF-8</code> encoding is
suggested.</p>
<p>In other environments, the default encoding is typically
dependent on the user's locale (an approximation of the languages,
and thus often encodings, of the pages that the user is likely to
frequent). The following table gives suggested defaults based on
the user's locale, for compatibility with legacy content. Locales
are identified by BCP 47 language tags. <a href="references.html#refsBCP47">[BCP47]</a></p>
<table><thead><tr><th>Locale language
</th><th>Suggested default encoding
</th></tr></thead><tbody><tr><td>ar
</td><td>UTF-8
</td></tr><tr><td>be
</td><td>ISO-8859-5
</td></tr><tr><td>bg
</td><td>windows-1251
</td></tr><tr><td>cs<!-- -CZ -->
</td><td>ISO-8859-2
</td></tr><tr><td>cy
</td><td>UTF-8
</td></tr><tr><td>fa<!-- -IR -->
</td><td>UTF-8
</td></tr><tr><td>he<!-- -IL -->
</td><td>windows-1255
</td></tr><tr><td>hr
</td><td>UTF-8
</td></tr><tr><td>hu<!-- -HU -->
</td><td>ISO-8859-2
</td></tr><tr><td>ja
</td><td>Windows-31J
</td></tr><tr><td>kk
</td><td>UTF-8
</td></tr><tr><td>ko<!-- -KR -->
</td><td>windows-949 <!-- EUC-KR -->
</td></tr><tr><td>ku
</td><td>windows-1254 <!-- ISO-8859-9 -->
</td></tr><tr><td>lt
</td><td>windows-1257
</td></tr><tr><td>lv<!-- -LV -->
</td><td>ISO-8859-13
</td></tr><tr><td>mk<!-- -MK -->
</td><td>UTF-8
</td></tr><tr><td>or
</td><td>UTF-8
</td></tr><tr><td>pl<!-- -PL -->
</td><td>ISO-8859-2
</td></tr><tr><td>ro
</td><td>UTF-8
</td></tr><tr><td>ru
</td><td>windows-1251
</td></tr><tr><td>sk
</td><td>windows-1250
</td></tr><tr><td>sl
</td><td>ISO-8859-2
</td></tr><tr><td>sr
</td><td>UTF-8
</td></tr><tr><td>th
</td><td>windows-874 <!-- TIS-620 -->
</td></tr><tr><td>tr<!-- -TR -->
</td><td>windows-1254 <!-- ISO-8859-9 -->
</td></tr><tr><td>uk
</td><td>windows-1251
</td></tr><tr><td>vi
</td><td>UTF-8
</td></tr><tr><td>zh-CN
</td><td>GB18030
</td></tr><tr><td>zh-TW
</td><td>Big5
</td></tr><tr><td>All other locales
</td><td>windows-1252
</td></tr></tbody></table></li>
</ol><p>The <a href="dom.html#document-s-character-encoding">document's character encoding</a> must immediately
be set to the value returned from this algorithm, at the same time
as the user agent uses the returned value to select the decoder to
use for the input stream.</p>
<p class="note">This algorithm is a <a href="introduction.html#willful-violation">willful violation</a>
of the HTTP specification, which requires that the encoding be
assumed to be ISO-8859-1 in the absence of a <a href="semantics.html#character-encoding-declaration">character
encoding declaration</a> to the contrary, and of RFC 2046,
which requires that the encoding be assumed to be US-ASCII in the
absence of a <a href="semantics.html#character-encoding-declaration">character encoding declaration</a> to the
contrary. This specification's third approach is motivated by a
desire to be maximally compatible with legacy content. <a href="references.html#refsHTTP">[HTTP]</a> <a href="references.html#refsRFC2046">[RFC2046]</a></p>
<h5 id="character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</h5>
<p>User agents must at a minimum support the UTF-8 and Windows-1252
encodings, but may support more. <a href="references.html#refsRFC3629">[RFC3629]</a> <a href="references.html#refsWIN1252">[WIN1252]</a></p>
<p class="note">It is not unusual for Web browsers to support dozens
if not upwards of a hundred distinct character encodings.</p>
<p>User agents must support the <a href="infrastructure.html#preferred-mime-name">preferred MIME name</a> of
every character encoding they support, and should support all the
IANA-registered names and aliases of every character encoding they
support. <a href="references.html#refsIANACHARSET">[IANACHARSET]</a></p>
<p>When comparing a string specifying a character encoding with the
name or alias of a character encoding to determine if they are
equal, user agents must remove any leading or trailing <a href="common-microsyntaxes.html#space-character" title="space character">space characters</a> in both names, and
then perform the comparison in an <a href="infrastructure.html#ascii-case-insensitive">ASCII
case-insensitive</a> manner.</p>
<hr><p>When a user agent would otherwise use an encoding given in the
first column of the following table to either convert content to
Unicode characters or convert Unicode characters to bytes, it must
instead use the encoding given in the cell in the second column of
the same row. When a byte or sequence of bytes is treated
differently due to this encoding aliasing, it is said to have been
<dfn id="misinterpreted-for-compatibility">misinterpreted for compatibility</dfn>.</p>
<table id="table-encoding-overrides"><caption>Character encoding overrides</caption>
<thead><tr><th> Input encoding </th><th> Replacement encoding </th><th> References
</th></tr></thead><tbody><tr><td> EUC-KR </td><td> windows-949 </td><td>
<a href="references.html#refsEUCKR">[EUCKR]</a>
<a href="references.html#refsWIN949">[WIN949]</a>
</td></tr><tr><td> EUC-JP </td><td> CP51932 </td><td>
<a href="references.html#refsEUCJP">[EUCJP]</a>
<a href="references.html#refsCP51932">[CP51932]</a>
</td></tr><tr><td> GB2312 </td><td> GBK </td><td>
<a href="references.html#refsRFC1345">[RFC1345]</a>
<a href="references.html#refsGBK">[GBK]</a>
</td></tr><tr><td> GB_2312-80 </td><td> GBK </td><td>
<a href="references.html#refsRFC1345">[RFC1345]</a>
<a href="references.html#refsGBK">[GBK]</a>
</td></tr><tr><td> ISO-8859-1 </td><td> windows-1252 </td><td>
<a href="references.html#refsRFC1345">[RFC1345]</a>
<a href="references.html#refsWIN1252">[WIN1252]</a>
</td></tr><tr><td> ISO-8859-9 </td><td> windows-1254 </td><td>
<a href="references.html#refsRFC1345">[RFC1345]</a>
<a href="references.html#refsWIN1254">[WIN1254]</a>
</td></tr><tr><td> ISO-8859-11 </td><td> windows-874 </td><td>
<a href="references.html#refsISO885911">[ISO885911]</a>
<a href="references.html#refsWIN874">[WIN874]</a>
</td></tr><tr><td> KS_C_5601-1987 </td><td> windows-949 </td><td>
<a href="references.html#refsRFC1345">[RFC1345]</a>
<a href="references.html#refsWIN949">[WIN949]</a>
</td></tr><tr><td> Shift_JIS </td><td> Windows-31J </td><td>
<a href="references.html#refsSHIFTJIS">[SHIFTJIS]</a>
<a href="references.html#refsWIN31J">[WIN31J]</a>
</td></tr><tr><td> TIS-620 </td><td> windows-874 </td><td>
<a href="references.html#refsTIS620">[TIS620]</a>
<a href="references.html#refsWIN874">[WIN874]</a>
</td></tr><tr><td> US-ASCII </td><td> windows-1252 </td><td>
<a href="references.html#refsRFC1345">[RFC1345]</a>
<a href="references.html#refsWIN1252">[WIN1252]</a>
</td></tr></tbody></table><p class="note">The requirement to treat certain encodings as other
encodings according to the table above is a <a href="introduction.html#willful-violation">willful
violation</a> of the W3C Character Model specification, motivated
by a desire for compatibility with legacy content. <a href="references.html#refsCHARMOD">[CHARMOD]</a></p>
<p>When a user agent is to use the UTF-16 encoding but no BOM has
been found, user agents must default to UTF-16LE.</p>
<p class="note">The requirement to default UTF-16 to LE rather than
BE is a <a href="introduction.html#willful-violation">willful violation</a> of RFC 2781, motivated by a
desire for compatibility with legacy content. <a href="references.html#refsRFC2781">[RFC2781]</a></p>
<hr><p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
encodings. <a href="references.html#refsCESU8">[CESU8]</a> <a href="references.html#refsUTF7">[UTF7]</a> <a href="references.html#refsBOCU1">[BOCU1]</a> <a href="references.html#refsSCSU">[SCSU]</a></p>
<p>Support for encodings based on EBCDIC is not recommended. This
encoding is rarely used for publicly-facing Web content.</p>
<p>Support for UTF-32 is not recommended. This encoding is rarely
used, and frequently implemented incorrectly.</p>
<p class="note">This specification does not make any attempt to
support EBCDIC-based encodings and UTF-32 in its algorithms; support
and use of these encodings can thus lead to unexpected behavior in
implementations of this specification.</p>
<h5 id="preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</h5>
<p>Given an encoding, the bytes in the input stream must be
converted to Unicode characters for the tokenizer, as described by
the rules for that encoding, except that the leading U+FEFF BYTE
ORDER MARK character, if any, must not be stripped by the encoding
layer (it is stripped by the rule below).</p>
<p>Bytes or sequences of bytes in the original byte stream that
could not be converted to Unicode code points must be converted to
U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
UTF-8, the bytes must be <a href="infrastructure.html#decoded-as-utf-8-with-error-handling" title="decoded as UTF-8, with error
handling">decoded with the error handling</a> defined in this
specification.</p>
<p class="note">Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification
(e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
errors that conformance checkers are expected to report.</p>
<p>Any byte or sequence of bytes in the original byte stream that is
<a href="#misinterpreted-for-compatibility">misinterpreted for compatibility</a> is a <a href="#parse-error">parse
error</a>.</p>
<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
any are present.</p>
<p class="note">The requirement to strip a U+FEFF BYTE ORDER MARK
character regardless of whether that character was used to determine
the byte order is a <a href="introduction.html#willful-violation">willful violation</a> of Unicode,
motivated by a desire to increase the resilience of user agents in
the face of na&#239;ve transcoders.</p>
<p>Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F
to U+009F, U+FDD0
to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
U+10FFFE, and U+10FFFF are <a href="#parse-error" title="parse error">parse
errors</a>. These are all control characters or permanently
undefined Unicode characters (noncharacters).</p>
<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are
followed by LF characters must be removed, and any CR characters not
followed by LF characters must be converted to LF characters. Thus,
newlines in HTML DOMs are represented by LF characters, and there
are never any CR characters in the input to the
<a href="tokenization.html#tokenization">tokenization</a> stage.</p>
<p>The <dfn id="next-input-character">next input character</dfn> is the first character in the
input stream that has not yet been <dfn id="consumed">consumed</dfn>. Initially,
the <i><a href="#next-input-character">next input character</a></i> is the first character in the
input. The <dfn id="current-input-character">current input character</dfn> is the last character
to have been <i><a href="#consumed">consumed</a></i>.</p>
<p>The <dfn id="insertion-point">insertion point</dfn> is the position (just before a
character or just before the end of the input stream) where content
inserted using <code title="dom-document-write"><a href="apis-in-html-documents.html#dom-document-write">document.write()</a></code> is actually
inserted. The insertion point is relative to the position of the
character immediately after it, it is not an absolute offset into
the input stream. Initially, the insertion point is
undefined.</p>
<p>The "EOF" character in the tables below is a conceptual character
representing the end of the <a href="#the-input-stream">input stream</a>. If the parser
is a <a href="apis-in-html-documents.html#script-created-parser">script-created parser</a>, then the end of the
<a href="#the-input-stream">input stream</a> is reached when an <dfn id="explicit-eof-character">explicit "EOF"
character</dfn> (inserted by the <code title="dom-document-close"><a href="apis-in-html-documents.html#dom-document-close">document.close()</a></code> method) is
consumed. Otherwise, the "EOF" character is not a real character in
the stream, but rather the lack of any further characters.</p>
<h5 id="changing-the-encoding-while-parsing"><span class="secno">8.2.2.4 </span>Changing the encoding while parsing</h5>
<p>When the parser requires the user agent to <dfn id="change-the-encoding">change the
encoding</dfn>, it must run the following steps. This might happen
if the <a href="#encoding-sniffing-algorithm">encoding sniffing algorithm</a> described above
failed to find an encoding, or if it found an encoding that was not
the actual encoding of the file.</p>
<ol><li>If the new encoding is identical or equivalent to the encoding
that is already being used to interpret the input stream, then set
the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
<i>certain</i> and abort these steps. This happens when the
encoding information found in the file matches what the
<a href="#encoding-sniffing-algorithm">encoding sniffing algorithm</a> determined to be the
encoding, and in the second pass through the parser if the first
pass found that the encoding sniffing algorithm described in the
earlier section failed to find the right encoding.</li>
<li>If the encoding that is already being used to interpret the
input stream is a UTF-16 encoding, then set the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
<i>certain</i> and abort these steps. The new encoding is ignored;
if it was anything but the same encoding, then it would be clearly
incorrect.</li>
<li>If the new encoding is a UTF-16 encoding, change it to
UTF-8.</li>
<li>If all the bytes up to the last byte converted by the current
decoder have the same Unicode interpretations in both the current
encoding and the new encoding, and if the user agent supports
changing the converter on the fly, then the user agent may change
to the new converter for the encoding on the fly. Set the
<a href="dom.html#document-s-character-encoding">document's character encoding</a> and the encoding used to
convert the input stream to the new encoding, set the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
<i>certain</i>, and abort these steps.</li>
<li>Otherwise, <a href="history.html#navigate">navigate</a> to the
document again, with <a href="history.html#replacement-enabled">replacement enabled</a>, and using
the same <a href="history.html#source-browsing-context">source browsing context</a>, but this time skip
the <a href="#encoding-sniffing-algorithm">encoding sniffing algorithm</a> and instead just set
the encoding to the new encoding and the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
<i>certain</i>. Whenever possible, this should be done without
actually contacting the network layer (the bytes should be
re-parsed from memory), even if, e.g., the document is marked as
not being cacheable. If this is not possible and contacting the
network layer would involve repeating a request that uses a method
other than HTTP GET (<a href="fetching-resources.html#concept-http-equivalent-get" title="concept-http-equivalent-get">or
equivalent</a> for non-HTTP URLs), then instead set the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
<i>certain</i> and ignore the new encoding. The resource will be
misinterpreted. User agents may notify the user of the situation,
to aid in application development.</li>
</ol></div><div class="impl">
<h4 id="parse-state"><span class="secno">8.2.3 </span>Parse state</h4>
<h5 id="the-insertion-mode"><span class="secno">8.2.3.1 </span>The insertion mode</h5>
<p>The <dfn id="insertion-mode">insertion mode</dfn> is a state variable that controls
the primary operation of the tree construction stage.</p>
<p>Initially, the <a href="#insertion-mode">insertion mode</a> is "<a href="tree-construction.html#the-initial-insertion-mode" title="insertion mode: initial">initial</a>". It can change to
"<a href="tree-construction.html#the-before-html-insertion-mode" title="insertion mode: before html">before html</a>",
"<a href="tree-construction.html#the-before-head-insertion-mode" title="insertion mode: before head">before head</a>",
"<a href="tree-construction.html#parsing-main-inhead" title="insertion mode: in head">in head</a>", "<a href="tree-construction.html#parsing-main-inheadnoscript" title="insertion mode: in head noscript">in head noscript</a>",
"<a href="tree-construction.html#the-after-head-insertion-mode" title="insertion mode: after head">after head</a>", "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>", "<a href="tree-construction.html#parsing-main-incdata" title="insertion mode: text">text</a>", "<a href="tree-construction.html#parsing-main-intable" title="insertion
mode: in table">in table</a>", "<a href="tree-construction.html#parsing-main-intabletext" title="insertion mode: in
table text">in table text</a>", "<a href="tree-construction.html#parsing-main-incaption" title="insertion mode: in
caption">in caption</a>", "<a href="tree-construction.html#parsing-main-incolgroup" title="insertion mode: in column
group">in column group</a>", "<a href="tree-construction.html#parsing-main-intbody" title="insertion mode: in
table body">in table body</a>", "<a href="tree-construction.html#parsing-main-intr" title="insertion mode: in
row">in row</a>", "<a href="tree-construction.html#parsing-main-intd" title="insertion mode: in cell">in
cell</a>", "<a href="tree-construction.html#parsing-main-inselect" title="insertion mode: in select">in
select</a>", "<a href="tree-construction.html#parsing-main-inselectintable" title="insertion mode: in select in table">in
select in table</a>", "<a href="tree-construction.html#parsing-main-afterbody" title="insertion mode: after
body">after body</a>", "<a href="tree-construction.html#parsing-main-inframeset" title="insertion mode: in
frameset">in frameset</a>", "<a href="tree-construction.html#parsing-main-afterframeset" title="insertion mode: after
frameset">after frameset</a>", "<a href="tree-construction.html#the-after-after-body-insertion-mode" title="insertion mode:
after after body">after after body</a>", and "<a href="tree-construction.html#the-after-after-frameset-insertion-mode" title="insertion mode: after after frameset">after after
frameset</a>" during the course of the parsing, as described in
the <a href="tree-construction.html#tree-construction">tree construction</a> stage. The insertion mode affects
how tokens are processed and whether CDATA sections are
supported.</p>
<p>Several of these modes, namely "<a href="tree-construction.html#parsing-main-inhead" title="insertion mode: in
head">in head</a>", "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in
body</a>", "<a href="tree-construction.html#parsing-main-intable" title="insertion mode: in table">in
table</a>", and "<a href="tree-construction.html#parsing-main-inselect" title="insertion mode: in select">in
select</a>", are special, in that the other modes defer to them
at various times. When the algorithm below says that the user agent
is to do something "<dfn id="using-the-rules-for">using the rules for</dfn> the <var title="">m</var> insertion mode", where <var title="">m</var> is one
of these modes, the user agent must use the rules described under
the <var title="">m</var> <a href="#insertion-mode">insertion mode</a>'s section, but
must leave the <a href="#insertion-mode">insertion mode</a> unchanged unless the
rules in <var title="">m</var> themselves switch the <a href="#insertion-mode">insertion
mode</a> to a new value.</p>
<p>When the insertion mode is switched to "<a href="tree-construction.html#parsing-main-incdata" title="insertion
mode: text">text</a>" or "<a href="tree-construction.html#parsing-main-intabletext" title="insertion mode: in table
text">in table text</a>", the <dfn id="original-insertion-mode">original insertion mode</dfn>
is also set. This is the insertion mode to which the tree
construction stage will return.</p>
<hr><p>When the steps below require the UA to <dfn id="reset-the-insertion-mode-appropriately">reset the insertion
mode appropriately</dfn>, it means the UA must follow these
steps:</p>
<ol><li>Let <var title="">last</var> be false.</li>
<li>Let <var title="">node</var> be the last node in the
<a href="#stack-of-open-elements">stack of open elements</a>.</li>
<li><i>Loop</i>: If <var title="">node</var> is the first node in
the stack of open elements, then set <var title="">last</var> to
true and set <var title="">node</var> to the <var title="concept-frag-parse-context"><a href="the-end.html#concept-frag-parse-context">context</a></var> element.
(<a href="the-end.html#fragment-case">fragment case</a>)</li>
<li>If <var title="">node</var> is a <code><a href="the-button-element.html#the-select-element">select</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inselect" title="insertion mode: in select">in select</a>" and abort these
steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-td-element">td</a></code> or
<code><a href="tabular-data.html#the-th-element">th</a></code> element and <var title="">last</var> is false, then
switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intd" title="insertion
mode: in cell">in cell</a>" and abort these steps.</li>
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-tr-element">tr</a></code> element, then
switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intr" title="insertion
mode: in row">in row</a>" and abort these steps.</li>
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-tbody-element">tbody</a></code>,
<code><a href="tabular-data.html#the-thead-element">thead</a></code>, or <code><a href="tabular-data.html#the-tfoot-element">tfoot</a></code> element, then switch the
<a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intbody" title="insertion mode: in
table body">in table body</a>" and abort these steps.</li>
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-caption-element">caption</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-incaption" title="insertion mode: in caption">in caption</a>" and abort
these steps.</li>
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-colgroup-element">colgroup</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-incolgroup" title="insertion mode: in column group">in column group</a>" and
abort these steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-table-element">table</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intable" title="insertion mode: in table">in table</a>" and abort these
steps.</li>
<li>If <var title="">node</var> is a <code><a href="semantics.html#the-head-element">head</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>" ("<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>"! <em> not "<a href="tree-construction.html#parsing-main-inhead" title="insertion mode: in head">in head</a>"</em>!) and abort
these steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
<li>If <var title="">node</var> is a <code><a href="sections.html#the-body-element">body</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>" and abort these
steps.</li>
<li>If <var title="">node</var> is a <code><a href="obsolete.html#frameset">frameset</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inframeset" title="insertion mode: in frameset">in frameset</a>" and abort
these steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
<li>If <var title="">node</var> is an <code><a href="semantics.html#the-html-element">html</a></code> element,
then switch the <a href="#insertion-mode">insertion mode</a>
to "<a href="tree-construction.html#the-before-head-insertion-mode" title="insertion mode: before head">before
head</a>" Then, abort these steps. (<a href="the-end.html#fragment-case">fragment
case</a>)</li>
<li>If <var title="">last</var> is true, then switch the
<a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in
body">in body</a>" and abort these steps. (<a href="the-end.html#fragment-case">fragment
case</a>)</li>
<li>Let <var title="">node</var> now be the node before <var title="">node</var> in the <a href="#stack-of-open-elements">stack of open
elements</a>.</li>
<li>Return to the step labeled <i>loop</i>.</li>
</ol><h5 id="the-stack-of-open-elements"><span class="secno">8.2.3.2 </span>The stack of open elements</h5>
<p>Initially, the <dfn id="stack-of-open-elements">stack of open elements</dfn> is empty. The
stack grows downwards; the topmost node on the stack is the first
one added to the stack, and the bottommost node of the stack is the
most recently added node in the stack (notwithstanding when the
stack is manipulated in a random access fashion as part of <a href="tree-construction.html#adoptionAgency">the handling for misnested tags</a>).</p>
<p>The "<a href="tree-construction.html#the-before-html-insertion-mode" title="insertion mode: before html">before
html</a>" <a href="#insertion-mode">insertion mode</a> creates the
<code><a href="semantics.html#the-html-element">html</a></code> root element node, which is then added to the
stack.</p>
<p>In the <a href="the-end.html#fragment-case">fragment case</a>, the <a href="#stack-of-open-elements">stack of open
elements</a> is initialized to contain an <code><a href="semantics.html#the-html-element">html</a></code>
element that is created as part of <a href="the-end.html#html-fragment-parsing-algorithm" title="html fragment
parsing algorithm">that algorithm</a>. (The <a href="the-end.html#fragment-case">fragment
case</a> skips the "<a href="tree-construction.html#the-before-html-insertion-mode" title="insertion mode: before
html">before html</a>" <a href="#insertion-mode">insertion mode</a>.)</p>
<p>The <code><a href="semantics.html#the-html-element">html</a></code> node, however it is created, is the topmost
node of the stack. It only gets popped off the stack when the parser
<a href="the-end.html#stop-parsing" title="stop parsing">finishes</a>.</p>
<p>The <dfn id="current-node">current node</dfn> is the bottommost node in this
stack.</p>
<p>The <dfn id="current-table">current table</dfn> is the last <code><a href="tabular-data.html#the-table-element">table</a></code>
element in the <a href="#stack-of-open-elements">stack of open elements</a>, if there is
one. If there is no <code><a href="tabular-data.html#the-table-element">table</a></code> element in the <a href="#stack-of-open-elements">stack of
open elements</a> (<a href="the-end.html#fragment-case">fragment case</a>), then the
<a href="#current-table">current table</a> is the first element in the <a href="#stack-of-open-elements">stack
of open elements</a> (the <code><a href="semantics.html#the-html-element">html</a></code> element).</p>
<p>Elements in the stack fall into the following categories:</p>
<dl><dt><dfn id="special">Special</dfn></dt>
<dd><p>The following elements have varying levels of special
parsing rules: HTML's <code><a href="sections.html#the-address-element">address</a></code>, <code><a href="obsolete.html#the-applet-element">applet</a></code>,
<code><a href="the-map-element.html#the-area-element">area</a></code>, <code><a href="sections.html#the-article-element">article</a></code>, <code><a href="sections.html#the-aside-element">aside</a></code>,
<code><a href="semantics.html#the-base-element">base</a></code>, <code><a href="obsolete.html#basefont">basefont</a></code>, <code><a href="obsolete.html#bgsound">bgsound</a></code>,
<code><a href="grouping-content.html#the-blockquote-element">blockquote</a></code>, <code><a href="sections.html#the-body-element">body</a></code>, <code><a href="text-level-semantics.html#the-br-element">br</a></code>,
<code><a href="the-button-element.html#the-button-element">button</a></code>, <code><a href="tabular-data.html#the-caption-element">caption</a></code>, <code><a href="obsolete.html#center">center</a></code>,
<code><a href="tabular-data.html#the-col-element">col</a></code>, <code><a href="tabular-data.html#the-colgroup-element">colgroup</a></code>, <code><a href="interactive-elements.html#the-command-element">command</a></code>,
<code><a href="grouping-content.html#the-dd-element">dd</a></code>, <code><a href="interactive-elements.html#the-details-element">details</a></code>, <code><a href="obsolete.html#dir">dir</a></code>,
<code><a href="grouping-content.html#the-div-element">div</a></code>, <code><a href="grouping-content.html#the-dl-element">dl</a></code>, <code><a href="grouping-content.html#the-dt-element">dt</a></code>,
<code><a href="the-iframe-element.html#the-embed-element">embed</a></code>, <code><a href="forms.html#the-fieldset-element">fieldset</a></code>, <code><a href="grouping-content.html#the-figcaption-element">figcaption</a></code>,
<code><a href="grouping-content.html#the-figure-element">figure</a></code>, <code><a href="sections.html#the-footer-element">footer</a></code>, <code><a href="forms.html#the-form-element">form</a></code>,
<code><a href="obsolete.html#frame">frame</a></code>, <code><a href="obsolete.html#frameset">frameset</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h1</a></code>,
<code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h2</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h3</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h4</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h5</a></code>,
<code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h6</a></code>, <code><a href="semantics.html#the-head-element">head</a></code>, <code><a href="sections.html#the-header-element">header</a></code>,
<code><a href="sections.html#the-hgroup-element">hgroup</a></code>, <code><a href="grouping-content.html#the-hr-element">hr</a></code>, <code><a href="semantics.html#the-html-element">html</a></code>,
<code><a href="the-iframe-element.html#the-iframe-element">iframe</a></code>, <code><a href="embedded-content-1.html#the-img-element">img</a></code>, <code><a href="the-input-element.html#the-input-element">input</a></code>,
<code><a href="obsolete.html#isindex-0">isindex</a></code>, <code><a href="grouping-content.html#the-li-element">li</a></code>, <code><a href="semantics.html#the-link-element">link</a></code>,
<code><a href="obsolete.html#listing">listing</a></code>, <code><a href="obsolete.html#the-marquee-element">marquee</a></code>, <code><a href="interactive-elements.html#the-menu-element">menu</a></code>,
<code><a href="semantics.html#the-meta-element">meta</a></code>, <code><a href="sections.html#the-nav-element">nav</a></code>, <code><a href="obsolete.html#noembed">noembed</a></code>,
<code><a href="obsolete.html#noframes">noframes</a></code>, <code><a href="scripting-1.html#the-noscript-element">noscript</a></code>, <code><a href="the-iframe-element.html#the-object-element">object</a></code>,
<code><a href="grouping-content.html#the-ol-element">ol</a></code>, <code><a href="grouping-content.html#the-p-element">p</a></code>, <code><a href="the-iframe-element.html#the-param-element">param</a></code>,
<code><a href="obsolete.html#plaintext">plaintext</a></code>, <code><a href="grouping-content.html#the-pre-element">pre</a></code>, <code><a href="scripting-1.html#the-script-element">script</a></code>,
<code><a href="sections.html#the-section-element">section</a></code>, <code><a href="the-button-element.html#the-select-element">select</a></code>, <code><a href="semantics.html#the-style-element">style</a></code>,
<code><a href="interactive-elements.html#the-summary-element">summary</a></code>, <code><a href="tabular-data.html#the-table-element">table</a></code>, <code><a href="tabular-data.html#the-tbody-element">tbody</a></code>,
<code><a href="tabular-data.html#the-td-element">td</a></code>, <code><a href="the-button-element.html#the-textarea-element">textarea</a></code>, <code><a href="tabular-data.html#the-tfoot-element">tfoot</a></code>,
<code><a href="tabular-data.html#the-th-element">th</a></code>, <code><a href="tabular-data.html#the-thead-element">thead</a></code>, <code><a href="semantics.html#the-title-element">title</a></code>,
<code><a href="tabular-data.html#the-tr-element">tr</a></code>, <code><a href="grouping-content.html#the-ul-element">ul</a></code>, <code><a href="text-level-semantics.html#the-wbr-element">wbr</a></code>, and
<code><a href="obsolete.html#xmp">xmp</a></code>; MathML's <code title="">mi</code>, <code title="">mo</code>, <code title="">mn</code>, <code title="">ms</code>, <code title="">mtext</code>, and <code title="">annotation-xml</code>; and SVG's <code title="">foreignObject</code>, <code title="">desc</code>, and
<code title="">title</code>.</p></dd>
<dt><dfn id="formatting">Formatting</dfn></dt>
<dd><p>The following HTML elements are those that end up in the
<a href="#list-of-active-formatting-elements">list of active formatting elements</a>: <code><a href="text-level-semantics.html#the-a-element">a</a></code>,
<code><a href="text-level-semantics.html#the-b-element">b</a></code>, <code><a href="obsolete.html#big">big</a></code>, <code><a href="text-level-semantics.html#the-code-element">code</a></code>,
<code><a href="text-level-semantics.html#the-em-element">em</a></code>, <code><a href="obsolete.html#font">font</a></code>, <code><a href="text-level-semantics.html#the-i-element">i</a></code>,
<code><a href="obsolete.html#nobr">nobr</a></code>, <code><a href="text-level-semantics.html#the-s-element">s</a></code>, <code><a href="text-level-semantics.html#the-small-element">small</a></code>,
<code><a href="obsolete.html#strike">strike</a></code>, <code><a href="text-level-semantics.html#the-strong-element">strong</a></code>, <code><a href="obsolete.html#tt">tt</a></code>, and
<code><a href="text-level-semantics.html#the-u-element">u</a></code>.</p></dd>
<dt><dfn id="ordinary">Ordinary</dfn></dt>
<dd><p>All other elements found while parsing an HTML
document.</p></dd>
</dl><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-the-specific-scope" title="has an element in the specific scope">have an element in a
specific scope</dfn> consisting of a list of element types <var title="">list</var> when the following algorithm terminates in a
match state:</p>
<ol><li><p>Initialize <var title="">node</var> to be the <a href="#current-node">current
node</a> (the bottommost node of the stack).</p></li>
<li><p>If <var title="">node</var> is the target node, terminate in
a match state.</p></li>
<li><p>Otherwise, if <var title="">node</var> is one of the element
types in <var title="">list</var>, terminate in a failure
state.</p></li>
<li><p>Otherwise, set <var title="">node</var> to the previous
entry in the <a href="#stack-of-open-elements">stack of open elements</a> and return to step
2. (This will never fail, since the loop will always terminate in
the previous step if the top of the stack &#8212; an
<code><a href="semantics.html#the-html-element">html</a></code> element &#8212; is reached.)</p></li>
</ol><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-scope" title="has an element in scope">have an element in scope</dfn> when
it <a href="#has-an-element-in-the-specific-scope">has an element in the specific scope</a> consisting
of the following element types:</p>
<ul class="brief"><li><code><a href="obsolete.html#the-applet-element">applet</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="tabular-data.html#the-caption-element">caption</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="semantics.html#the-html-element">html</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="tabular-data.html#the-table-element">table</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="tabular-data.html#the-td-element">td</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="tabular-data.html#the-th-element">th</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="obsolete.html#the-marquee-element">marquee</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="the-iframe-element.html#the-object-element">object</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code title="">mi</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
<li><code title="">mo</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
<li><code title="">mn</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
<li><code title="">ms</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
<li><code title="">mtext</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
<li><code title="">annotation-xml</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
<li><code title="">foreignObject</code> in the <a href="namespaces.html#svg-namespace">SVG namespace</a></li>
<li><code title="">desc</code> in the <a href="namespaces.html#svg-namespace">SVG namespace</a></li>
<li><code title="">title</code> in the <a href="namespaces.html#svg-namespace">SVG namespace</a></li>
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-list-item-scope" title="has an element in list item scope">have an element in list
item scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
scope</a> consisting of the following element types:</p>
<ul class="brief"><li>All the element types listed above for the <i><a href="#has-an-element-in-scope">has an element
in scope</a></i> algorithm.</li>
<li><code><a href="grouping-content.html#the-ol-element">ol</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="grouping-content.html#the-ul-element">ul</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-button-scope" title="has an element in button scope">have an element in button
scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
scope</a> consisting of the following element types:</p>
<ul class="brief"><li>All the element types listed above for the <i><a href="#has-an-element-in-scope">has an element
in scope</a></i> algorithm.</li>
<li><code><a href="the-button-element.html#the-button-element">button</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-table-scope" title="has an element in table scope">have an element in table
scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
scope</a> consisting of the following element types:</p>
<ul class="brief"><li><code><a href="semantics.html#the-html-element">html</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="tabular-data.html#the-table-element">table</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-select-scope" title="has an element in select scope">have an element in select
scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
scope</a> consisting of all element types <em>except</em> the
following:</p>
<ul class="brief"><li><code><a href="the-button-element.html#the-optgroup-element">optgroup</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
<li><code><a href="the-button-element.html#the-option-element">option</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
</ul><p>Nothing happens if at any time any of the elements in the
<a href="#stack-of-open-elements">stack of open elements</a> are moved to a new location in,
or removed from, the <code><a href="infrastructure.html#document">Document</a></code> tree. In particular, the
stack is not changed in this situation. This can cause, amongst
other strange effects, content to be appended to nodes that are no
longer in the DOM.</p>
<p class="note">In some cases (namely, when <a href="tree-construction.html#adoptionAgency">closing misnested formatting elements</a>),
the stack is manipulated in a random-access fashion.</p>
<h5 id="the-list-of-active-formatting-elements"><span class="secno">8.2.3.3 </span>The list of active formatting elements</h5>
<p>Initially, the <dfn id="list-of-active-formatting-elements">list of active formatting elements</dfn> is
empty. It is used to handle mis-nested <a href="#formatting" title="formatting">formatting element tags</a>.</p>
<p>The list contains elements in the <a href="#formatting">formatting</a>
category, and scope markers. The scope markers are inserted when
entering <code><a href="obsolete.html#the-applet-element">applet</a></code> elements, buttons, <code><a href="the-iframe-element.html#the-object-element">object</a></code>
elements, marquees, table cells, and table captions, and are used to
prevent formatting from "leaking" <em>into</em> <code><a href="obsolete.html#the-applet-element">applet</a></code>
elements, buttons, <code><a href="the-iframe-element.html#the-object-element">object</a></code> elements, marquees, and
tables.</p>
<p class="note">The scope markers are unrelated to the concept of an
element being <a href="#has-an-element-in-scope" title="has an element in scope">in
scope</a>.</p>
<p>In addition, each element in the <a href="#list-of-active-formatting-elements">list of active formatting
elements</a> is associated with the token for which it was
created, so that further elements can be created for that token if
necessary.</p>
<p>When the steps below require the UA to <dfn id="push-onto-the-list-of-active-formatting-elements">push onto the list of
active formatting elements</dfn> an element <var title="">element</var>, the UA must perform the following steps:</p>
<ol><li><p>If there are already three elements in the <a href="#list-of-active-formatting-elements">list of
active formatting elements</a> after the last list marker, if
any, or anywhere in the list if there are no list markers, that
have the same tag name, namespace, and attributes as <var title="">element</var>, then remove the earliest such element from
the <a href="#list-of-active-formatting-elements">list of active formatting elements</a>. For these
purposes, the attributes must be compared as they were when the
elements were created by the parser; two elements have the same
attributes if all their parsed attributes can be paired such that
the two attributes in each pair have identical names, namespaces,
and values (the order of the attributes does not matter).</p>
<p class="note">This is the Noah's Ark clause. But with three per
family instead of two.</p></li>
<li><p>Add <var title="">element</var> to the <a href="#list-of-active-formatting-elements">list of active
formatting elements</a>.</p></li>
</ol><p>When the steps below require the UA to <dfn id="reconstruct-the-active-formatting-elements">reconstruct the
active formatting elements</dfn>, the UA must perform the following
steps:</p>
<ol><li>If there are no entries in the <a href="#list-of-active-formatting-elements">list of active formatting
elements</a>, then there is nothing to reconstruct; stop this
algorithm.</li>
<li>If the last (most recently added) entry in the <a href="#list-of-active-formatting-elements">list of
active formatting elements</a> is a marker, or if it is an
element that is in the <a href="#stack-of-open-elements">stack of open elements</a>, then
there is nothing to reconstruct; stop this algorithm.</li>
<li>Let <var title="">entry</var> be the last (most recently added)
element in the <a href="#list-of-active-formatting-elements">list of active formatting
elements</a>.</li>
<li>If there are no entries before <var title="">entry</var> in the
<a href="#list-of-active-formatting-elements">list of active formatting elements</a>, then jump to step
8.</li>
<li>Let <var title="">entry</var> be the entry one earlier than
<var title="">entry</var> in the <a href="#list-of-active-formatting-elements">list of active formatting
elements</a>.</li>
<li>If <var title="">entry</var> is neither a marker nor an element
that is also in the <a href="#stack-of-open-elements">stack of open elements</a>, go to step
4.</li>
<li>Let <var title="">entry</var> be the element one later than
<var title="">entry</var> in the <a href="#list-of-active-formatting-elements">list of active formatting
elements</a>.</li>
<li><a href="tree-construction.html#create-an-element-for-the-token">Create an element for the token</a> for which the
element <var title="">entry</var> was created, to obtain <var title="">new element</var>.</li>
<li>Append <var title="">new element</var> to the <a href="#current-node">current
node</a> and push it onto the <a href="#stack-of-open-elements">stack of open
elements</a> so that it is the new <a href="#current-node">current
node</a>.</li>
<li>Replace the entry for <var title="">entry</var> in the list
with an entry for <var title="">new element</var>.</li>
<li>If the entry for <var title="">new element</var> in the
<a href="#list-of-active-formatting-elements">list of active formatting elements</a> is not the last
entry in the list, return to step 7.</li>
</ol><p>This has the effect of reopening all the formatting elements that
were opened in the current body, cell, or caption (whichever is
youngest) that haven't been explicitly closed.</p>
<p class="note">The way this specification is written, the
<a href="#list-of-active-formatting-elements">list of active formatting elements</a> always consists of
elements in chronological order with the least recently added
element first and the most recently added element last (except for
while steps 8 to 11 of the above algorithm are being executed, of
course).</p>
<p>When the steps below require the UA to <dfn id="clear-the-list-of-active-formatting-elements-up-to-the-last-marker">clear the list of
active formatting elements up to the last marker</dfn>, the UA must
perform the following steps:</p>
<ol><li>Let <var title="">entry</var> be the last (most recently added)
entry in the <a href="#list-of-active-formatting-elements">list of active formatting elements</a>.</li>
<li>Remove <var title="">entry</var> from the <a href="#list-of-active-formatting-elements">list of active
formatting elements</a>.</li>
<li>If <var title="">entry</var> was a marker, then stop the
algorithm at this point. The list has been cleared up to the last
marker.</li>
<li>Go to step 1.</li>
</ol><h5 id="the-element-pointers"><span class="secno">8.2.3.4 </span>The element pointers</h5>
<p>Initially, the <dfn id="head-element-pointer"><code title="">head</code> element
pointer</dfn> and the <dfn id="form-element-pointer"><code title="">form</code> element
pointer</dfn> are both null.</p>
<p>Once a <code><a href="semantics.html#the-head-element">head</a></code> element has been parsed (whether
implicitly or explicitly) the <a href="#head-element-pointer"><code title="">head</code>
element pointer</a> gets set to point to this node.</p>
<p>The <a href="#form-element-pointer"><code title="">form</code> element pointer</a>
points to the last <code><a href="forms.html#the-form-element">form</a></code> element that was opened and
whose end tag has not yet been seen. It is used to make form
controls associate with forms in the face of dramatically bad
markup, for historical reasons.</p>
<h5 id="other-parsing-state-flags"><span class="secno">8.2.3.5 </span>Other parsing state flags</h5>
<p>The <dfn id="scripting-flag">scripting flag</dfn> is set to "enabled" if <a href="webappapis.html#concept-n-script" title="concept-n-script">scripting was enabled</a> for the
<code><a href="infrastructure.html#document">Document</a></code> with which the parser is associated when the
parser was created, and "disabled" otherwise.</p>
<p class="note">The <a href="#scripting-flag">scripting flag</a> can be enabled even
when the parser was originally created for the <a href="the-end.html#html-fragment-parsing-algorithm">HTML fragment
parsing algorithm</a>, even though <code><a href="scripting-1.html#the-script-element">script</a></code> elements
don't execute in that case.</p>
<p>The <dfn id="frameset-ok-flag">frameset-ok flag</dfn> is set to "ok" when the parser is
created. It is set to "not ok" after certain tokens are seen.</p>
</div></body></html>