You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1704 lines
98 KiB
1704 lines
98 KiB
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html lang="en-US-x-Hixie" ><head><title>8.2 Parsing HTML documents — HTML5 </title><style type="text/css">
|
|
pre { margin-left: 2em; white-space: pre-wrap; }
|
|
h2 { margin: 3em 0 1em 0; }
|
|
h3 { margin: 2.5em 0 1em 0; }
|
|
h4 { margin: 2.5em 0 0.75em 0; }
|
|
h5, h6 { margin: 2.5em 0 1em; }
|
|
h1 + h2, h1 + h2 + h2 { margin: 0.75em 0 0.75em; }
|
|
h2 + h3, h3 + h4, h4 + h5, h5 + h6 { margin-top: 0.5em; }
|
|
p { margin: 1em 0; }
|
|
hr:not(.top) { display: block; background: none; border: none; padding: 0; margin: 2em 0; height: auto; }
|
|
dl, dd { margin-top: 0; margin-bottom: 0; }
|
|
dt { margin-top: 0.75em; margin-bottom: 0.25em; clear: left; }
|
|
dt + dt { margin-top: 0; }
|
|
dd dt { margin-top: 0.25em; margin-bottom: 0; }
|
|
dd p { margin-top: 0; }
|
|
dd dl + p { margin-top: 1em; }
|
|
dd table + p { margin-top: 1em; }
|
|
p + * > li, dd li { margin: 1em 0; }
|
|
dt, dfn { font-weight: bold; font-style: normal; }
|
|
dt dfn { font-style: italic; }
|
|
pre, code { font-size: inherit; font-family: monospace; font-variant: normal; }
|
|
pre strong { color: black; font: inherit; font-weight: bold; background: yellow; }
|
|
pre em { font-weight: bolder; font-style: normal; }
|
|
@media screen { code { color: orangered; } code :link, code :visited { color: inherit; } }
|
|
var sub { vertical-align: bottom; font-size: smaller; position: relative; top: 0.1em; }
|
|
table { border-collapse: collapse; border-style: hidden hidden none hidden; }
|
|
table thead, table tbody { border-bottom: solid; }
|
|
table tbody th:first-child { border-left: solid; }
|
|
table tbody th { text-align: left; }
|
|
table td, table th { border-left: solid; border-right: solid; border-bottom: solid thin; vertical-align: top; padding: 0.2em; }
|
|
blockquote { margin: 0 0 0 2em; border: 0; padding: 0; font-style: italic; }
|
|
|
|
.bad, .bad *:not(.XXX) { color: gray; border-color: gray; background: transparent; }
|
|
.matrix, .matrix td { border: none; text-align: right; }
|
|
.matrix { margin-left: 2em; }
|
|
.dice-example { border-collapse: collapse; border-style: hidden solid solid hidden; border-width: thin; margin-left: 3em; }
|
|
.dice-example caption { width: 30em; font-size: smaller; font-style: italic; padding: 0.75em 0; text-align: left; }
|
|
.dice-example td, .dice-example th { border: solid thin; width: 1.35em; height: 1.05em; text-align: center; padding: 0; }
|
|
|
|
.toc dfn, h1 dfn, h2 dfn, h3 dfn, h4 dfn, h5 dfn, h6 dfn { font: inherit; }
|
|
img.extra { float: right; }
|
|
pre.idl { border: solid thin; background: #EEEEEE; color: black; padding: 0.5em 1em; }
|
|
pre.idl :link, pre.idl :visited { color: inherit; background: transparent; }
|
|
pre.css { border: solid thin; background: #FFFFEE; color: black; padding: 0.5em 1em; }
|
|
pre.css:first-line { color: #AAAA50; }
|
|
dl.domintro { color: green; margin: 2em 0 2em 2em; padding: 0.5em 1em; border: none; background: #DDFFDD; }
|
|
hr + dl.domintro, div.impl + dl.domintro { margin-top: 2.5em; margin-bottom: 1.5em; }
|
|
dl.domintro dt, dl.domintro dt * { color: black; text-decoration: none; }
|
|
dl.domintro dd { margin: 0.5em 0 1em 2em; padding: 0; }
|
|
dl.domintro dd p { margin: 0.5em 0; }
|
|
dl.switch { padding-left: 2em; }
|
|
dl.switch > dt { text-indent: -1.5em; }
|
|
dl.switch > dt:before { content: '\21AA'; padding: 0 0.5em 0 0; display: inline-block; width: 1em; text-align: right; line-height: 0.5em; }
|
|
dl.triple { padding: 0 0 0 1em; }
|
|
dl.triple dt, dl.triple dd { margin: 0; display: inline }
|
|
dl.triple dt:after { content: ':'; }
|
|
dl.triple dd:after { content: '\A'; white-space: pre; }
|
|
.diff-old { text-decoration: line-through; color: silver; background: transparent; }
|
|
.diff-chg, .diff-new { text-decoration: underline; color: green; background: transparent; }
|
|
a .diff-new { border-bottom: 1px blue solid; }
|
|
|
|
h2 { page-break-before: always; }
|
|
h1, h2, h3, h4, h5, h6 { page-break-after: avoid; }
|
|
h1 + h2, hr + h2.no-toc { page-break-before: auto; }
|
|
|
|
p > span:not([title=""]):not([class="XXX"]):not([class="impl"]):not([class="note"]),
|
|
li > span:not([title=""]):not([class="XXX"]):not([class="impl"]):not([class="note"]), { border-bottom: solid #9999CC; }
|
|
|
|
div.head { margin: 0 0 1em; padding: 1em 0 0 0; }
|
|
div.head p { margin: 0; }
|
|
div.head h1 { margin: 0; }
|
|
div.head .logo { float: right; margin: 0 1em; }
|
|
div.head .logo img { border: none } /* remove border from top image */
|
|
div.head dl { margin: 1em 0; }
|
|
div.head p.copyright, div.head p.alt { font-size: x-small; font-style: oblique; margin: 0; }
|
|
|
|
body > .toc > li { margin-top: 1em; margin-bottom: 1em; }
|
|
body > .toc.brief > li { margin-top: 0.35em; margin-bottom: 0.35em; }
|
|
body > .toc > li > * { margin-bottom: 0.5em; }
|
|
body > .toc > li > * > li > * { margin-bottom: 0.25em; }
|
|
.toc, .toc li { list-style: none; }
|
|
|
|
.brief { margin-top: 1em; margin-bottom: 1em; line-height: 1.1; }
|
|
.brief li { margin: 0; padding: 0; }
|
|
.brief li p { margin: 0; padding: 0; }
|
|
|
|
.category-list { margin-top: -0.75em; margin-bottom: 1em; line-height: 1.5; }
|
|
.category-list::before { content: '\21D2\A0'; font-size: 1.2em; font-weight: 900; }
|
|
.category-list li { display: inline; }
|
|
.category-list li:not(:last-child)::after { content: ', '; }
|
|
.category-list li > span, .category-list li > a { text-transform: lowercase; }
|
|
.category-list li * { text-transform: none; } /* don't affect <code> nested in <a> */
|
|
|
|
.XXX { color: #E50000; background: white; border: solid red; padding: 0.5em; margin: 1em 0; }
|
|
.XXX > :first-child { margin-top: 0; }
|
|
p .XXX { line-height: 3em; }
|
|
.annotation { border: solid thin black; background: #0C479D; color: white; position: relative; margin: 8px 0 20px 0; }
|
|
.annotation:before { position: absolute; left: 0; top: 0; width: 100%; height: 100%; margin: 6px -6px -6px 6px; background: #333333; z-index: -1; content: ''; }
|
|
.annotation :link, .annotation :visited { color: inherit; }
|
|
.annotation :link:hover, .annotation :visited:hover { background: transparent; }
|
|
.annotation span { border: none ! important; }
|
|
.note { color: green; background: transparent; font-family: sans-serif; }
|
|
.warning { color: red; background: transparent; }
|
|
.note, .warning { font-weight: bolder; font-style: italic; }
|
|
p.note, div.note { padding: 0.5em 2em; }
|
|
span.note { padding: 0 2em; }
|
|
.note p:first-child, .warning p:first-child { margin-top: 0; }
|
|
.note p:last-child, .warning p:last-child { margin-bottom: 0; }
|
|
.warning:before { font-style: normal; }
|
|
p.note:before { content: 'Note: '; }
|
|
p.warning:before { content: '\26A0 Warning! '; }
|
|
|
|
.bookkeeping:before { display: block; content: 'Bookkeeping details'; font-weight: bolder; font-style: italic; }
|
|
.bookkeeping { font-size: 0.8em; margin: 2em 0; }
|
|
.bookkeeping p { margin: 0.5em 2em; display: list-item; list-style: square; }
|
|
.bookkeeping dt { margin: 0.5em 2em 0; }
|
|
.bookkeeping dd { margin: 0 3em 0.5em; }
|
|
|
|
h4 { position: relative; z-index: 3; }
|
|
h4 + .element, h4 + div + .element { margin-top: -2.5em; padding-top: 2em; }
|
|
.element {
|
|
background: #EEEEFF;
|
|
color: black;
|
|
margin: 0 0 1em 0.15em;
|
|
padding: 0 1em 0.25em 0.75em;
|
|
border-left: solid #9999FF 0.25em;
|
|
position: relative;
|
|
z-index: 1;
|
|
}
|
|
.element:before {
|
|
position: absolute;
|
|
z-index: 2;
|
|
top: 0;
|
|
left: -1.15em;
|
|
height: 2em;
|
|
width: 0.9em;
|
|
background: #EEEEFF;
|
|
content: ' ';
|
|
border-style: none none solid solid;
|
|
border-color: #9999FF;
|
|
border-width: 0.25em;
|
|
}
|
|
|
|
.example { display: block; color: #222222; background: #FCFCFC; border-left: double; margin-left: 2em; padding-left: 1em; }
|
|
td > .example:only-child { margin: 0 0 0 0.1em; }
|
|
|
|
ul.domTree, ul.domTree ul { padding: 0 0 0 1em; margin: 0; }
|
|
ul.domTree li { padding: 0; margin: 0; list-style: none; position: relative; }
|
|
ul.domTree li li { list-style: none; }
|
|
ul.domTree li:first-child::before { position: absolute; top: 0; height: 0.6em; left: -0.75em; width: 0.5em; border-style: none none solid solid; content: ''; border-width: 0.1em; }
|
|
ul.domTree li:not(:last-child)::after { position: absolute; top: 0; bottom: -0.6em; left: -0.75em; width: 0.5em; border-style: none none solid solid; content: ''; border-width: 0.1em; }
|
|
ul.domTree span { font-style: italic; font-family: serif; }
|
|
ul.domTree .t1 code { color: purple; font-weight: bold; }
|
|
ul.domTree .t2 { font-style: normal; font-family: monospace; }
|
|
ul.domTree .t2 .name { color: black; font-weight: bold; }
|
|
ul.domTree .t2 .value { color: blue; font-weight: normal; }
|
|
ul.domTree .t3 code, .domTree .t4 code, .domTree .t5 code { color: gray; }
|
|
ul.domTree .t7 code, .domTree .t8 code { color: green; }
|
|
ul.domTree .t10 code { color: teal; }
|
|
|
|
body.dfnEnabled dfn { cursor: pointer; }
|
|
.dfnPanel {
|
|
display: inline;
|
|
position: absolute;
|
|
z-index: 10;
|
|
height: auto;
|
|
width: auto;
|
|
padding: 0.5em 0.75em;
|
|
font: small sans-serif, Droid Sans Fallback;
|
|
background: #DDDDDD;
|
|
color: black;
|
|
border: outset 0.2em;
|
|
}
|
|
.dfnPanel * { margin: 0; padding: 0; font: inherit; text-indent: 0; }
|
|
.dfnPanel :link, .dfnPanel :visited { color: black; }
|
|
.dfnPanel p { font-weight: bolder; }
|
|
.dfnPanel * + p { margin-top: 0.25em; }
|
|
.dfnPanel li { list-style-position: inside; }
|
|
|
|
#configUI { position: absolute; z-index: 20; top: 10em; right: 1em; width: 11em; font-size: small; }
|
|
#configUI p { margin: 0.5em 0; padding: 0.3em; background: #EEEEEE; color: black; border: inset thin; }
|
|
#configUI p label { display: block; }
|
|
#configUI #updateUI, #configUI .loginUI { text-align: center; }
|
|
#configUI input[type=button] { display: block; margin: auto; }
|
|
|
|
fieldset { margin: 1em; padding: 0.5em 1em; }
|
|
fieldset > legend + * { margin-top: 0; }
|
|
fieldset > :last-child { margin-bottom: 0; }
|
|
fieldset p { margin: 0.5em 0; }
|
|
|
|
.stability {
|
|
position: fixed;
|
|
bottom: 0;
|
|
left: 0; right: 0;
|
|
margin: 0 auto 0 auto !important;
|
|
z-index: 1000;
|
|
width: 50%;
|
|
background: maroon; color: yellow;
|
|
-webkit-border-radius: 1em 1em 0 0;
|
|
-moz-border-radius: 1em 1em 0 0;
|
|
border-radius: 1em 1em 0 0;
|
|
-moz-box-shadow: 0 0 1em #500;
|
|
-webkit-box-shadow: 0 0 1em #500;
|
|
box-shadow: 0 0 1em red;
|
|
padding: 0.5em 1em;
|
|
text-align: center;
|
|
}
|
|
.stability strong {
|
|
display: block;
|
|
}
|
|
.stability input {
|
|
appearance: none; margin: 0; border: 0; padding: 0.25em 0.5em; background: transparent; color: black;
|
|
position: absolute; top: -0.5em; right: 0; font: 1.25em sans-serif; text-align: center;
|
|
}
|
|
.stability input:hover {
|
|
color: white;
|
|
text-shadow: 0 0 2px black;
|
|
}
|
|
.stability input:active {
|
|
padding: 0.3em 0.45em 0.2em 0.55em;
|
|
}
|
|
.stability :link, .stability :visited,
|
|
.stability :link:hover, .stability :visited:hover {
|
|
background: transparent;
|
|
color: white;
|
|
}
|
|
|
|
</style><link href="data:text/css,.impl%20%7B%20display:%20none;%20%7D%0Ahtml%20%7B%20border:%20solid%20yellow;%20%7D%20.domintro:before%20%7B%20display:%20none;%20%7D" id="author" rel="alternate stylesheet" title="Author documentation only"><link href="data:text/css,.impl%20%7B%20background:%20%23FFEEEE;%20%7D%20.domintro:before%20%7B%20background:%20%23FFEEEE;%20%7D" id="highlight" rel="alternate stylesheet" title="Highlight implementation
|
|
requirements"><link href="http://www.w3.org/StyleSheets/TR/W3C-WD" rel="stylesheet" type="text/css"><style type="text/css">
|
|
|
|
.applies thead th > * { display: block; }
|
|
.applies thead code { display: block; }
|
|
.applies tbody th { whitespace: nowrap; }
|
|
.applies td { text-align: center; }
|
|
.applies .yes { background: yellow; }
|
|
|
|
.matrix, .matrix td { border: hidden; text-align: right; }
|
|
.matrix { margin-left: 2em; }
|
|
|
|
.dice-example { border-collapse: collapse; border-style: hidden solid solid hidden; border-width: thin; margin-left: 3em; }
|
|
.dice-example caption { width: 30em; font-size: smaller; font-style: italic; padding: 0.75em 0; text-align: left; }
|
|
.dice-example td, .dice-example th { border: solid thin; width: 1.35em; height: 1.05em; text-align: center; padding: 0; }
|
|
|
|
td.eg { border-width: thin; text-align: center; }
|
|
|
|
#table-example-1 { border: solid thin; border-collapse: collapse; margin-left: 3em; }
|
|
#table-example-1 * { font-family: "Essays1743", serif; line-height: 1.01em; }
|
|
#table-example-1 caption { padding-bottom: 0.5em; }
|
|
#table-example-1 thead, #table-example-1 tbody { border: none; }
|
|
#table-example-1 th, #table-example-1 td { border: solid thin; }
|
|
#table-example-1 th { font-weight: normal; }
|
|
#table-example-1 td { border-style: none solid; vertical-align: top; }
|
|
#table-example-1 th { padding: 0.5em; vertical-align: middle; text-align: center; }
|
|
#table-example-1 tbody tr:first-child td { padding-top: 0.5em; }
|
|
#table-example-1 tbody tr:last-child td { padding-bottom: 1.5em; }
|
|
#table-example-1 tbody td:first-child { padding-left: 2.5em; padding-right: 0; width: 9em; }
|
|
#table-example-1 tbody td:first-child::after { content: leader(". "); }
|
|
#table-example-1 tbody td { padding-left: 2em; padding-right: 2em; }
|
|
#table-example-1 tbody td:first-child + td { width: 10em; }
|
|
#table-example-1 tbody td:first-child + td ~ td { width: 2.5em; }
|
|
#table-example-1 tbody td:first-child + td + td + td ~ td { width: 1.25em; }
|
|
|
|
.apple-table-examples { border: none; border-collapse: separate; border-spacing: 1.5em 0em; width: 40em; margin-left: 3em; }
|
|
.apple-table-examples * { font-family: "Times", serif; }
|
|
.apple-table-examples td, .apple-table-examples th { border: none; white-space: nowrap; padding-top: 0; padding-bottom: 0; }
|
|
.apple-table-examples tbody th:first-child { border-left: none; width: 100%; }
|
|
.apple-table-examples thead th:first-child ~ th { font-size: smaller; font-weight: bolder; border-bottom: solid 2px; text-align: center; }
|
|
.apple-table-examples tbody th::after, .apple-table-examples tfoot th::after { content: leader(". ") }
|
|
.apple-table-examples tbody th, .apple-table-examples tfoot th { font: inherit; text-align: left; }
|
|
.apple-table-examples td { text-align: right; vertical-align: top; }
|
|
.apple-table-examples.e1 tbody tr:last-child td { border-bottom: solid 1px; }
|
|
.apple-table-examples.e1 tbody + tbody tr:last-child td { border-bottom: double 3px; }
|
|
.apple-table-examples.e2 th[scope=row] { padding-left: 1em; }
|
|
.apple-table-examples sup { line-height: 0; }
|
|
|
|
.details-example img { vertical-align: top; }
|
|
|
|
#base64-table {
|
|
white-space: nowrap;
|
|
font-size: 0.6em;
|
|
column-width: 6em;
|
|
column-count: 5;
|
|
column-gap: 1em;
|
|
-moz-column-width: 6em;
|
|
-moz-column-count: 5;
|
|
-moz-column-gap: 1em;
|
|
-webkit-column-width: 6em;
|
|
-webkit-column-count: 5;
|
|
-webkit-column-gap: 1em;
|
|
}
|
|
#base64-table thead { display: none; }
|
|
#base64-table * { border: none; }
|
|
#base64-table tbody td:first-child:after { content: ':'; }
|
|
#base64-table tbody td:last-child { text-align: right; }
|
|
|
|
#named-character-references-table {
|
|
white-space: nowrap;
|
|
font-size: 0.6em;
|
|
column-width: 30em;
|
|
column-gap: 1em;
|
|
-moz-column-width: 30em;
|
|
-moz-column-gap: 1em;
|
|
-webkit-column-width: 30em;
|
|
-webkit-column-gap: 1em;
|
|
}
|
|
#named-character-references-table > table > tbody > tr > td:first-child + td,
|
|
#named-character-references-table > table > tbody > tr > td:last-child { text-align: center; }
|
|
#named-character-references-table > table > tbody > tr > td:last-child:hover > span { position: absolute; top: auto; left: auto; margin-left: 0.5em; line-height: 1.2; font-size: 5em; border: outset; padding: 0.25em 0.5em; background: white; width: 1.25em; height: auto; text-align: center; }
|
|
#named-character-references-table > table > tbody > tr#entity-CounterClockwiseContourIntegral > td:first-child { font-size: 0.5em; }
|
|
|
|
.glyph.control { color: red; }
|
|
|
|
@font-face {
|
|
font-family: 'Essays1743';
|
|
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743.ttf');
|
|
}
|
|
@font-face {
|
|
font-family: 'Essays1743';
|
|
font-weight: bold;
|
|
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743-Bold.ttf');
|
|
}
|
|
@font-face {
|
|
font-family: 'Essays1743';
|
|
font-style: italic;
|
|
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743-Italic.ttf');
|
|
}
|
|
@font-face {
|
|
font-family: 'Essays1743';
|
|
font-style: italic;
|
|
font-weight: bold;
|
|
src: url('http://www.whatwg.org/specs/web-apps/current-work/fonts/Essays1743-BoldItalic.ttf');
|
|
}
|
|
|
|
</style><style type="text/css">
|
|
.domintro:before { display: table; margin: -1em -0.5em -0.5em auto; width: auto; content: 'This box is non-normative. Implementation requirements are given below this box.'; color: black; font-style: italic; border: solid 2px; background: white; padding: 0 0.25em; }
|
|
</style><script type="text/javascript">
|
|
function getCookie(name) {
|
|
var params = location.search.substr(1).split("&");
|
|
for (var index = 0; index < params.length; index++) {
|
|
if (params[index] == name)
|
|
return "1";
|
|
var data = params[index].split("=");
|
|
if (data[0] == name)
|
|
return unescape(data[1]);
|
|
}
|
|
var cookies = document.cookie.split("; ");
|
|
for (var index = 0; index < cookies.length; index++) {
|
|
var data = cookies[index].split("=");
|
|
if (data[0] == name)
|
|
return unescape(data[1]);
|
|
}
|
|
return null;
|
|
}
|
|
</script>
|
|
<script src="link-fixup.js" type="text/javascript"></script>
|
|
<link href="style.css" rel="stylesheet"><link href="syntax.html" title="8 The HTML syntax" rel="prev">
|
|
<link href="spec.html#contents" title="Table of contents" rel="index">
|
|
<link href="tokenization.html" title="8.2.4 Tokenization" rel="next">
|
|
</head><body><div class="head" id="head">
|
|
<div id="multipage-common">
|
|
<p class="stability" id="wip"><strong>This is a work in
|
|
progress!</strong> For the latest updates from the HTML WG, possibly
|
|
including important bug fixes, please look at the <a href="http://dev.w3.org/html5/spec/Overview.html">editor's draft</a> instead.
|
|
There may also be a more
|
|
<a href="http://www.w3.org/TR/html5">up-to-date Working Draft</a>
|
|
with changes based on resolution of Last Call issues.
|
|
<input onclick="closeWarning(this.parentNode)" type="button" value="╳⃝"></p>
|
|
<script type="text/javascript">
|
|
function closeWarning(element) {
|
|
element.parentNode.removeChild(element);
|
|
var date = new Date();
|
|
date.setDate(date.getDate()+4);
|
|
document.cookie = 'hide-obsolescence-warning=1; expires=' + date.toGMTString();
|
|
}
|
|
if (getCookie('hide-obsolescence-warning') == '1')
|
|
setTimeout(function () { document.getElementById('wip').parentNode.removeChild(document.getElementById('wip')); }, 2000);
|
|
</script></div>
|
|
|
|
<p><a href="http://www.w3.org/"><img alt="W3C" height="48" src="http://www.w3.org/Icons/w3c_home" width="72"></a></p>
|
|
|
|
<h1>HTML5</h1>
|
|
</div><div>
|
|
<a href="syntax.html" class="prev">8 The HTML syntax</a> –
|
|
<a href="spec.html#contents">Table of contents</a> –
|
|
<a href="tokenization.html" class="next">8.2.4 Tokenization</a>
|
|
<ol class="toc"><li><ol><li><a href="parsing.html#parsing"><span class="secno">8.2 </span>Parsing HTML documents</a>
|
|
<ol><li><a href="parsing.html#overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</a></li><li><a href="parsing.html#the-input-stream"><span class="secno">8.2.2 </span>The input stream</a>
|
|
<ol><li><a href="parsing.html#determining-the-character-encoding"><span class="secno">8.2.2.1 </span>Determining the character encoding</a></li><li><a href="parsing.html#character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</a></li><li><a href="parsing.html#preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</a></li><li><a href="parsing.html#changing-the-encoding-while-parsing"><span class="secno">8.2.2.4 </span>Changing the encoding while parsing</a></li></ol></li><li><a href="parsing.html#parse-state"><span class="secno">8.2.3 </span>Parse state</a>
|
|
<ol><li><a href="parsing.html#the-insertion-mode"><span class="secno">8.2.3.1 </span>The insertion mode</a></li><li><a href="parsing.html#the-stack-of-open-elements"><span class="secno">8.2.3.2 </span>The stack of open elements</a></li><li><a href="parsing.html#the-list-of-active-formatting-elements"><span class="secno">8.2.3.3 </span>The list of active formatting elements</a></li><li><a href="parsing.html#the-element-pointers"><span class="secno">8.2.3.4 </span>The element pointers</a></li><li><a href="parsing.html#other-parsing-state-flags"><span class="secno">8.2.3.5 </span>Other parsing state flags</a></li></ol></li></ol></li></ol></li></ol></div>
|
|
|
|
<div class="impl">
|
|
|
|
<h3 id="parsing"><span class="secno">8.2 </span>Parsing HTML documents</h3>
|
|
|
|
<p><i>This section only applies to user agents, data mining tools,
|
|
and conformance checkers.</i></p>
|
|
|
|
<p class="note">The rules for parsing XML documents into DOM trees
|
|
are covered by the next section, entitled "<a href="the-xhtml-syntax.html#the-xhtml-syntax">The XHTML
|
|
syntax</a>".</p>
|
|
|
|
<p>For <a href="dom.html#html-documents">HTML documents</a>, user agents must use the parsing
|
|
rules described in this section to generate the DOM trees. Together,
|
|
these rules define what is referred to as the <dfn id="html-parser">HTML
|
|
parser</dfn>.</p>
|
|
|
|
<div class="note">
|
|
|
|
<p>While the HTML syntax described in this specification bears a
|
|
close resemblance to SGML and XML, it is a separate language with
|
|
its own parsing rules.</p>
|
|
|
|
<p>Some earlier versions of HTML (in particular from HTML2 to
|
|
HTML4) were based on SGML and used SGML parsing rules. However, few
|
|
(if any) web browsers ever implemented true SGML parsing for HTML
|
|
documents; the only user agents to strictly handle HTML as an SGML
|
|
application have historically been validators. The resulting
|
|
confusion — with validators claiming documents to have one
|
|
representation while widely deployed Web browsers interoperably
|
|
implemented a different representation — has wasted decades
|
|
of productivity. This version of HTML thus returns to a non-SGML
|
|
basis.</p>
|
|
|
|
<p>Authors interested in using SGML tools in their authoring
|
|
pipeline are encouraged to use XML tools and the XML serialization
|
|
of HTML.</p>
|
|
|
|
</div>
|
|
|
|
<p>This specification defines the parsing rules for HTML documents,
|
|
whether they are syntactically correct or not. Certain points in the
|
|
parsing algorithm are said to be <dfn id="parse-error" title="parse error">parse
|
|
errors</dfn>. The error handling for parse errors is well-defined:
|
|
user agents must either act as described below when encountering
|
|
such problems, or must abort processing at the first error that they
|
|
encounter for which they do not wish to apply the rules described
|
|
below.</p>
|
|
|
|
<p>Conformance checkers must report at least one parse error
|
|
condition to the user if one or more parse error conditions exist in
|
|
the document and must not report parse error conditions if none
|
|
exist in the document. Conformance checkers may report more than one
|
|
parse error condition if more than one parse error condition exists
|
|
in the document. Conformance checkers are not required to recover
|
|
from parse errors.</p>
|
|
|
|
<p class="note">Parse errors are only errors with the
|
|
<em>syntax</em> of HTML. In addition to checking for parse errors,
|
|
conformance checkers will also verify that the document obeys all
|
|
the other conformance requirements described in this
|
|
specification.</p>
|
|
|
|
<p>For the purposes of conformance checkers, if a resource is
|
|
determined to be in <a href="syntax.html#syntax">the HTML syntax</a>, then it is an
|
|
<a href="dom.html#html-documents" title="HTML documents">HTML document</a>.</p>
|
|
|
|
</div><div class="impl">
|
|
|
|
<h4 id="overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</h4>
|
|
|
|
<p>The input to the HTML parsing process consists of a stream of
|
|
Unicode characters, which is passed through a
|
|
<a href="tokenization.html#tokenization">tokenization</a> stage followed by a <a href="tree-construction.html#tree-construction">tree
|
|
construction</a> stage. The output is a <code><a href="infrastructure.html#document">Document</a></code>
|
|
object.</p>
|
|
|
|
<p class="note">Implementations that <a href="infrastructure.html#non-scripted">do not
|
|
support scripting</a> do not have to actually create a DOM
|
|
<code><a href="infrastructure.html#document">Document</a></code> object, but the DOM tree in such cases is
|
|
still used as the model for the rest of the specification.</p>
|
|
|
|
<p>In the common case, the data handled by the tokenization stage
|
|
comes from the network, but <a href="apis-in-html-documents.html#dynamic-markup-insertion" title="dynamic markup
|
|
insertion">it can also come from script</a> running in the user
|
|
agent, e.g. using the <code title="dom-document-write"><a href="apis-in-html-documents.html#dom-document-write">document.write()</a></code> API.</p>
|
|
|
|
<p><img alt="" height="554" src="parsing-model-overview.png" width="427"></p>
|
|
|
|
<p id="nestedParsing">There is only one set of states for the
|
|
tokenizer stage and the tree construction stage, but the tree
|
|
construction stage is reentrant, meaning that while the tree
|
|
construction stage is handling one token, the tokenizer might be
|
|
resumed, causing further tokens to be emitted and processed before
|
|
the first token's processing is complete.</p>
|
|
|
|
<div class="example">
|
|
|
|
<p>In the following example, the tree construction stage will be
|
|
called upon to handle a "p" start tag token while handling the
|
|
"script" end tag token:</p>
|
|
|
|
<pre>...
|
|
<script>
|
|
document.write('<p>');
|
|
</script>
|
|
...</pre>
|
|
|
|
</div>
|
|
|
|
<p>To handle these cases, parsers have a <dfn id="script-nesting-level">script nesting
|
|
level</dfn>, which must be initially set to zero, and a <dfn id="parser-pause-flag">parser
|
|
pause flag</dfn>, which must be initially set to false.</p>
|
|
|
|
</div><div class="impl">
|
|
|
|
<h4 id="the-input-stream"><span class="secno">8.2.2 </span>The <dfn>input stream</dfn></h4>
|
|
|
|
<p>The stream of Unicode characters that comprises the input to the
|
|
tokenization stage will be initially seen by the user agent as a
|
|
stream of bytes (typically coming over the network or from the local
|
|
file system). The bytes encode the actual characters according to a
|
|
particular <em>character encoding</em>, which the user agent must
|
|
use to decode the bytes into characters.</p>
|
|
|
|
<p class="note">For XML documents, the algorithm user agents must
|
|
use to determine the character encoding is given by the XML
|
|
specification. This section does not apply to XML documents. <a href="references.html#refsXML">[XML]</a></p>
|
|
|
|
|
|
<h5 id="determining-the-character-encoding"><span class="secno">8.2.2.1 </span>Determining the character encoding</h5>
|
|
|
|
<p>In some cases, it might be impractical to unambiguously determine
|
|
the encoding before parsing the document. Because of this, this
|
|
specification provides for a two-pass mechanism with an optional
|
|
pre-scan. Implementations are allowed, as described below, to apply
|
|
a simplified parsing algorithm to whatever bytes they have available
|
|
before beginning to parse the document. Then, the real parser is
|
|
started, using a tentative encoding derived from this pre-parse and
|
|
other out-of-band metadata. If, while the document is being loaded,
|
|
the user agent discovers an encoding declaration that conflicts with
|
|
this information, then the parser can get reinvoked to perform a
|
|
parse of the document with the real encoding.</p>
|
|
|
|
<p id="documentEncoding">User agents must use the following
|
|
algorithm (the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>) to determine
|
|
the character encoding to use when decoding a document in the first
|
|
pass. This algorithm takes as input any out-of-band metadata
|
|
available to the user agent (e.g. the <a href="fetching-resources.html#content-type" title="Content-Type">Content-Type metadata</a> of the document)
|
|
and all the bytes available so far, and returns an encoding and a
|
|
<dfn id="concept-encoding-confidence" title="concept-encoding-confidence">confidence</dfn>. The
|
|
confidence is either <i>tentative</i>, <i>certain</i>, or
|
|
<i>irrelevant</i>. The encoding used, and whether the confidence in
|
|
that encoding is <i>tentative</i> or <i>certain</i>, is <a href="tree-construction.html#meta-charset-during-parse">used during the parsing</a> to
|
|
determine whether to <a href="#change-the-encoding">change the encoding</a>. If no
|
|
encoding is necessary, e.g. because the parser is operating on a
|
|
stream of Unicode characters and doesn't have to use an encoding at
|
|
all, then the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is
|
|
<i>irrelevant</i>.</p>
|
|
|
|
<ol><li><p>If the user has explicitly instructed the user agent to
|
|
override the document's character encoding with a specific
|
|
encoding, optionally return that encoding with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
|
|
<i>certain</i> and abort these steps.</p></li>
|
|
|
|
<li><p>If the transport layer specifies an encoding, and it is
|
|
supported, return that encoding with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
|
|
<i>certain</i>, and abort these steps.</p></li>
|
|
|
|
<li>
|
|
|
|
<p>The user agent may wait for more bytes of the resource to be
|
|
available, either in this step or at any later step in this
|
|
algorithm. For instance, a user agent might wait 500ms or 1024
|
|
bytes, whichever came first. In general preparsing the source to
|
|
find the encoding improves performance, as it reduces the need to
|
|
throw away the data structures used when parsing upon finding the
|
|
encoding information. However, if the user agent delays too long
|
|
to obtain data to determine the encoding, then the cost of the
|
|
delay could outweigh any performance improvements from the
|
|
preparse.</p>
|
|
|
|
<p class="note">The authoring conformance requirements for
|
|
character encoding declarations limit them to only appearing <a href="semantics.html#charset1024">in the first 1024 bytes</a>. User agents are
|
|
therefore encouraged to use the preparse algorithm below (part of
|
|
these steps) on the first 1024 bytes, but not to stall beyond
|
|
that.</p>
|
|
|
|
</li>
|
|
|
|
<li><p>For each of the rows in the following table, starting with
|
|
the first one and going down, if there are as many or more bytes
|
|
available than the number of bytes in the first column, and the
|
|
first bytes of the file match the bytes given in the first column,
|
|
then return the encoding given in the cell in the second column of
|
|
that row, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
|
|
<i>certain</i>, and abort these steps:</p>
|
|
|
|
|
|
<table><thead><tr><th>Bytes in Hexadecimal
|
|
</th><th>Encoding
|
|
</th></tr></thead><tbody><tr><td>FE FF
|
|
</td><td>Big-endian UTF-16
|
|
</td></tr><tr><td>FF FE
|
|
</td><td>Little-endian UTF-16
|
|
</td></tr><tr><td>EF BB BF
|
|
</td><td>UTF-8
|
|
</td></tr></tbody></table><p class="note">This step looks for Unicode Byte Order Marks
|
|
(BOMs).</p></li>
|
|
|
|
<li><p>Otherwise, the user agent will have to search for explicit
|
|
character encoding information in the file itself. This should
|
|
proceed as follows:
|
|
|
|
</p><p>Let <var title="">position</var> be a pointer to a byte in the
|
|
input stream, initially pointing at the first byte. If at any
|
|
point during these substeps the user agent either runs out of
|
|
bytes or decides that scanning further bytes would not be
|
|
efficient, then skip to the next step of the overall character
|
|
encoding detection algorithm. User agents may decide that scanning
|
|
<em>any</em> bytes is not efficient, in which case these substeps
|
|
are entirely skipped.</p>
|
|
|
|
<p>Now, repeat the following "two" steps until the algorithm
|
|
aborts (either because user agent aborts, as described above, or
|
|
because a character encoding is found):</p>
|
|
|
|
<ol><li><p>If <var title="">position</var> points to:</p>
|
|
|
|
<dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
|
|
<dd>
|
|
|
|
<p>Advance the <var title="">position</var> pointer so that it
|
|
points at the first 0x3E byte which is preceded by two 0x2D
|
|
bytes (i.e. at the end of an ASCII '-->' sequence) and comes
|
|
after the 0x3C byte that was found. (The two 0x2D bytes can be
|
|
the same as the those in the '<!--' sequence.)</p>
|
|
|
|
</dd>
|
|
|
|
<dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and finally one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
|
|
<dd>
|
|
|
|
<ol><li><p>Advance the <var title="">position</var> pointer so
|
|
that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
|
|
0x2F byte (the one in sequence of characters matched
|
|
above).</p></li>
|
|
|
|
<li><p>Let <var title="">attribute list</var> be an empty
|
|
list of strings.</p></li>
|
|
<li><p>Let <var title="">got pragma</var> be false.</p></li>
|
|
|
|
<li><p>Let <var title="">need pragma</var> be null.</p></li>
|
|
|
|
<li><p>Let <var title="">charset</var> be the null value
|
|
(which, for the purposes of this algorithm, is distinct from
|
|
an unrecognised encoding or the empty string).</p></li>
|
|
|
|
<li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an
|
|
attribute</a> and its value. If no attribute was sniffed,
|
|
then jump to the <i>processing</i> step below.</p></li>
|
|
|
|
<li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
|
|
labeled <i>attributes</i>.</p>
|
|
|
|
</li><li><p>Add the attribute's name to <var title="">attribute
|
|
list</var>.</p>
|
|
|
|
</li><li>
|
|
|
|
<p>Run the appropriate step from the following list, if one
|
|
applies:</p>
|
|
|
|
<dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
|
|
|
|
<dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
|
|
pragma</var> to true.</p></dd>
|
|
|
|
<dt>If the attribute's name is "<code title="">content</code>"</dt>
|
|
|
|
<dd><p>Apply the <a href="fetching-resources.html#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding
|
|
from a <code>meta</code> element</a>, giving the
|
|
attribute's value as the string to parse. If an encoding is
|
|
returned, and if <var title="">charset</var> is still set
|
|
to null, let <var title="">charset</var> be the encoding
|
|
returned, and set <var title="">need pragma</var> to
|
|
true.</p></dd>
|
|
|
|
<dt>If the attribute's name is "<code title="">charset</code>"</dt>
|
|
|
|
<dd><p>Let <var title="">charset</var> be the encoding
|
|
corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</p></dd>
|
|
|
|
</dl></li>
|
|
|
|
<li><p>Return to the step labeled <i>attributes</i>.</p></li>
|
|
|
|
<li><p><i>Processing</i>: If <var title="">need pragma</var>
|
|
is null, then jump to the second step of the overall "two
|
|
step" algorithm.</p></li>
|
|
|
|
<li><p>If <var title="">mode</var> is true but <var title="">got pragma</var> is false, then jump to the second
|
|
step of the overall "two step" algorithm.</p></li>
|
|
|
|
<li><p>If <var title="">charset</var> is a UTF-16 encoding,
|
|
change the value of <var title="">charset</var> to
|
|
UTF-8.</p></li>
|
|
|
|
<li><p>If <var title="">charset</var> is not a supported
|
|
character encoding, then jump to the second step of the
|
|
overall "two step" algorithm.</p></li>
|
|
|
|
<li><p>Return the encoding given by <var title="">charset</var>, with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
|
|
<i>tentative</i>, and abort all these steps.</p></li>
|
|
|
|
</ol></dd>
|
|
|
|
<dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
|
|
<dd>
|
|
|
|
<ol><li><p>Advance the <var title="">position</var> pointer so
|
|
that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
|
|
0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
|
|
(ASCII >) byte.</p></li>
|
|
|
|
<li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
|
|
attribute</a> until no further attributes can be found,
|
|
then jump to the second step in the overall "two step"
|
|
algorithm.</p></li>
|
|
|
|
</ol></dd>
|
|
|
|
<dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
|
|
<dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
|
|
<dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
|
|
<dd>
|
|
|
|
<p>Advance the <var title="">position</var> pointer so that it
|
|
points at the first 0x3E byte (ASCII >) that comes after the
|
|
0x3C byte that was found.</p>
|
|
|
|
</dd>
|
|
|
|
<dt>Any other byte</dt>
|
|
<dd>
|
|
|
|
<p>Do nothing with that byte.</p>
|
|
|
|
</dd>
|
|
|
|
</dl></li>
|
|
|
|
<li>Move <var title="">position</var> so it points at the next
|
|
byte in the input stream, and return to the first step of this
|
|
"two step" algorithm.</li>
|
|
|
|
</ol><p>When the above "two step" algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
|
|
attribute</dfn>, it means doing this:</p>
|
|
|
|
<ol><li><p>If the byte at <var title="">position</var> is one of 0x09
|
|
(ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
|
|
0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
|
|
substep.</p></li>
|
|
|
|
<li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
|
|
>), then abort the "get an attribute" algorithm. There isn't
|
|
one.</p></li>
|
|
|
|
<li><p>Otherwise, the byte at <var title="">position</var> is the
|
|
start of the attribute name. Let <var title="">attribute
|
|
name</var> and <var title="">attribute value</var> be the empty
|
|
string.</p></li>
|
|
|
|
<li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
|
|
|
|
<dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
|
|
name</var> is longer than the empty string</dt>
|
|
|
|
<dd>Advance <var title="">position</var> to the next byte and
|
|
jump to the step below labeled <i>value</i>.</dd>
|
|
|
|
<dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
|
|
FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
|
|
|
|
<dd>Jump to the step below labeled <i>spaces</i>.</dd>
|
|
|
|
<dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
|
|
|
|
<dd>Abort the "get an attribute" algorithm. The attribute's
|
|
name is the value of <var title="">attribute name</var>, its
|
|
value is the empty string.</dd>
|
|
|
|
<dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
|
|
Z)</dt>
|
|
|
|
<dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
|
|
name</var> (where <var title="">b</var> is the value of the
|
|
byte at <var title="">position</var>).</dd>
|
|
|
|
<dt>Anything else</dt>
|
|
|
|
<dd>Append the Unicode character with the same code point as the
|
|
value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
|
|
bytes outside the ASCII range are handled here, since only
|
|
ASCII characters can contribute to the detection of a character
|
|
encoding.)</dd>
|
|
|
|
</dl></li>
|
|
|
|
<li><p>Advance <var title="">position</var> to the next byte and
|
|
return to the previous step.</p></li>
|
|
|
|
<li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
|
|
LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
|
|
advance <var title="">position</var> to the next byte, then,
|
|
repeat this step.</p></li>
|
|
|
|
<li><p>If the byte at <var title="">position</var> is
|
|
<em>not</em> 0x3D (ASCII =), abort the "get an attribute"
|
|
algorithm. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty
|
|
string.</p></li>
|
|
|
|
<li><p>Advance <var title="">position</var> past the 0x3D (ASCII
|
|
=) byte.</p></li>
|
|
|
|
<li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
|
|
LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
|
|
advance <var title="">position</var> to the next byte, then,
|
|
repeat this step.</p></li>
|
|
|
|
<li><p>Process the byte at <var title="">position</var> as
|
|
follows:</p>
|
|
|
|
<dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
|
|
|
|
<dd>
|
|
|
|
<ol><li>Let <var title="">b</var> be the value of the byte at
|
|
<var title="">position</var>.</li>
|
|
|
|
<li>Advance <var title="">position</var> to the next
|
|
byte.</li>
|
|
|
|
<li>If the value of the byte at <var title="">position</var>
|
|
is the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get
|
|
an attribute" algorithm. The attribute's name is the value of
|
|
<var title="">attribute name</var>, and its value is the
|
|
value of <var title="">attribute value</var>.</li>
|
|
|
|
<li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to
|
|
0x5A (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
|
|
than the value of the byte at <var title="">position</var>.</li>
|
|
|
|
<li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
|
|
the value of the byte at <var title="">position</var>.</li>
|
|
|
|
<li>Return to the second step in these substeps.</li>
|
|
|
|
</ol></dd>
|
|
|
|
<dt>If it is 0x3E (ASCII >)</dt>
|
|
|
|
<dd>Abort the "get an attribute" algorithm. The attribute's
|
|
name is the value of <var title="">attribute name</var>, its
|
|
value is the empty string.</dd>
|
|
|
|
|
|
<dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
|
|
Z)</dt>
|
|
|
|
<dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
|
|
value</var> (where <var title="">b</var> is the value of the
|
|
byte at <var title="">position</var>). Advance <var title="">position</var> to the next byte.</dd>
|
|
|
|
<dt>Anything else</dt>
|
|
|
|
<dd>Append the Unicode character with the same code point as the
|
|
value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
|
|
|
|
</dl></li>
|
|
|
|
<li><p>Process the byte at <var title="">position</var> as
|
|
follows:</p>
|
|
|
|
<dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
|
|
FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
|
|
>)</dt>
|
|
|
|
<dd>Abort the "get an attribute" algorithm. The attribute's
|
|
name is the value of <var title="">attribute name</var> and its
|
|
value is the value of <var title="">attribute value</var>.</dd>
|
|
|
|
<dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
|
|
Z)</dt>
|
|
|
|
<dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
|
|
value</var> (where <var title="">b</var> is the value of the
|
|
byte at <var title="">position</var>).</dd>
|
|
|
|
<dt>Anything else</dt>
|
|
|
|
<dd>Append the Unicode character with the same code point as the
|
|
value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
|
|
|
|
</dl></li>
|
|
|
|
<li><p>Advance <var title="">position</var> to the next byte and
|
|
return to the previous step.</p></li>
|
|
|
|
</ol><p>For the sake of interoperability, user agents should not use a
|
|
pre-scan algorithm that returns different results than the one
|
|
described above. (But, if you do, please at least let us know, so
|
|
that we can improve this algorithm and benefit everyone...)</p>
|
|
|
|
</li>
|
|
|
|
<li><p>If the user agent has information on the likely encoding for
|
|
this page, e.g. based on the encoding of the page when it was last
|
|
visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
|
|
<i>tentative</i>, and abort these steps.</p></li>
|
|
|
|
<li>
|
|
|
|
<p>The user agent may attempt to autodetect the character encoding
|
|
from applying frequency analysis or other algorithms to the data
|
|
stream. Such algorithms may use information about the resource
|
|
other than the resource's contents, including the address of the
|
|
resource. If autodetection succeeds in determining a character
|
|
encoding, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
|
|
<i>tentative</i>, and abort these steps. <a href="references.html#refsUNIVCHARDET">[UNIVCHARDET]</a></p>
|
|
|
|
<p class="note">The UTF-8 encoding has a highly detectable bit
|
|
pattern. Documents that contain bytes with values greater than
|
|
0x7F which match the UTF-8 pattern are very likely to be UTF-8,
|
|
while documents with byte sequences that do not match it are very
|
|
likely not. User-agents are therefore encouraged to search for
|
|
this common encoding. <a href="references.html#refsPPUTF8">[PPUTF8]</a> <a href="references.html#refsUTF8DET">[UTF8DET]</a></p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>Otherwise, return an implementation-defined or user-specified
|
|
default character encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
|
|
<i>tentative</i>.</p>
|
|
|
|
<p>In controlled environments or in environments where the
|
|
encoding of documents can be prescribed (for example, for user
|
|
agents intended for dedicated use in new networks), the
|
|
comprehensive <code title="">UTF-8</code> encoding is
|
|
suggested.</p>
|
|
|
|
<p>In other environments, the default encoding is typically
|
|
dependent on the user's locale (an approximation of the languages,
|
|
and thus often encodings, of the pages that the user is likely to
|
|
frequent). The following table gives suggested defaults based on
|
|
the user's locale, for compatibility with legacy content. Locales
|
|
are identified by BCP 47 language tags. <a href="references.html#refsBCP47">[BCP47]</a></p>
|
|
|
|
|
|
<table><thead><tr><th>Locale language
|
|
</th><th>Suggested default encoding
|
|
</th></tr></thead><tbody><tr><td>ar
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>be
|
|
</td><td>ISO-8859-5
|
|
|
|
</td></tr><tr><td>bg
|
|
</td><td>windows-1251
|
|
|
|
</td></tr><tr><td>cs<!-- -CZ -->
|
|
</td><td>ISO-8859-2
|
|
|
|
</td></tr><tr><td>cy
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>fa<!-- -IR -->
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>he<!-- -IL -->
|
|
</td><td>windows-1255
|
|
|
|
</td></tr><tr><td>hr
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>hu<!-- -HU -->
|
|
</td><td>ISO-8859-2
|
|
|
|
</td></tr><tr><td>ja
|
|
</td><td>Windows-31J
|
|
|
|
</td></tr><tr><td>kk
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>ko<!-- -KR -->
|
|
</td><td>windows-949 <!-- EUC-KR -->
|
|
|
|
</td></tr><tr><td>ku
|
|
</td><td>windows-1254 <!-- ISO-8859-9 -->
|
|
|
|
</td></tr><tr><td>lt
|
|
</td><td>windows-1257
|
|
|
|
</td></tr><tr><td>lv<!-- -LV -->
|
|
</td><td>ISO-8859-13
|
|
|
|
</td></tr><tr><td>mk<!-- -MK -->
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>or
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>pl<!-- -PL -->
|
|
</td><td>ISO-8859-2
|
|
|
|
</td></tr><tr><td>ro
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>ru
|
|
</td><td>windows-1251
|
|
|
|
</td></tr><tr><td>sk
|
|
</td><td>windows-1250
|
|
|
|
</td></tr><tr><td>sl
|
|
</td><td>ISO-8859-2
|
|
|
|
</td></tr><tr><td>sr
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>th
|
|
</td><td>windows-874 <!-- TIS-620 -->
|
|
|
|
</td></tr><tr><td>tr<!-- -TR -->
|
|
</td><td>windows-1254 <!-- ISO-8859-9 -->
|
|
|
|
</td></tr><tr><td>uk
|
|
</td><td>windows-1251
|
|
|
|
</td></tr><tr><td>vi
|
|
</td><td>UTF-8
|
|
|
|
</td></tr><tr><td>zh-CN
|
|
</td><td>GB18030
|
|
|
|
</td></tr><tr><td>zh-TW
|
|
</td><td>Big5
|
|
|
|
</td></tr><tr><td>All other locales
|
|
</td><td>windows-1252
|
|
|
|
</td></tr></tbody></table></li>
|
|
|
|
</ol><p>The <a href="dom.html#document-s-character-encoding">document's character encoding</a> must immediately
|
|
be set to the value returned from this algorithm, at the same time
|
|
as the user agent uses the returned value to select the decoder to
|
|
use for the input stream.</p>
|
|
|
|
<p class="note">This algorithm is a <a href="introduction.html#willful-violation">willful violation</a>
|
|
of the HTTP specification, which requires that the encoding be
|
|
assumed to be ISO-8859-1 in the absence of a <a href="semantics.html#character-encoding-declaration">character
|
|
encoding declaration</a> to the contrary, and of RFC 2046,
|
|
which requires that the encoding be assumed to be US-ASCII in the
|
|
absence of a <a href="semantics.html#character-encoding-declaration">character encoding declaration</a> to the
|
|
contrary. This specification's third approach is motivated by a
|
|
desire to be maximally compatible with legacy content. <a href="references.html#refsHTTP">[HTTP]</a> <a href="references.html#refsRFC2046">[RFC2046]</a></p>
|
|
|
|
|
|
<h5 id="character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</h5>
|
|
|
|
<p>User agents must at a minimum support the UTF-8 and Windows-1252
|
|
encodings, but may support more. <a href="references.html#refsRFC3629">[RFC3629]</a> <a href="references.html#refsWIN1252">[WIN1252]</a></p>
|
|
|
|
<p class="note">It is not unusual for Web browsers to support dozens
|
|
if not upwards of a hundred distinct character encodings.</p>
|
|
|
|
<p>User agents must support the <a href="infrastructure.html#preferred-mime-name">preferred MIME name</a> of
|
|
every character encoding they support, and should support all the
|
|
IANA-registered names and aliases of every character encoding they
|
|
support. <a href="references.html#refsIANACHARSET">[IANACHARSET]</a></p>
|
|
|
|
<p>When comparing a string specifying a character encoding with the
|
|
name or alias of a character encoding to determine if they are
|
|
equal, user agents must remove any leading or trailing <a href="common-microsyntaxes.html#space-character" title="space character">space characters</a> in both names, and
|
|
then perform the comparison in an <a href="infrastructure.html#ascii-case-insensitive">ASCII
|
|
case-insensitive</a> manner.</p>
|
|
|
|
<hr><p>When a user agent would otherwise use an encoding given in the
|
|
first column of the following table to either convert content to
|
|
Unicode characters or convert Unicode characters to bytes, it must
|
|
instead use the encoding given in the cell in the second column of
|
|
the same row. When a byte or sequence of bytes is treated
|
|
differently due to this encoding aliasing, it is said to have been
|
|
<dfn id="misinterpreted-for-compatibility">misinterpreted for compatibility</dfn>.</p>
|
|
|
|
<table id="table-encoding-overrides"><caption>Character encoding overrides</caption>
|
|
<thead><tr><th> Input encoding </th><th> Replacement encoding </th><th> References
|
|
</th></tr></thead><tbody><tr><td> EUC-KR </td><td> windows-949 </td><td>
|
|
<a href="references.html#refsEUCKR">[EUCKR]</a>
|
|
<a href="references.html#refsWIN949">[WIN949]</a>
|
|
</td></tr><tr><td> EUC-JP </td><td> CP51932 </td><td>
|
|
<a href="references.html#refsEUCJP">[EUCJP]</a>
|
|
<a href="references.html#refsCP51932">[CP51932]</a>
|
|
</td></tr><tr><td> GB2312 </td><td> GBK </td><td>
|
|
<a href="references.html#refsRFC1345">[RFC1345]</a>
|
|
<a href="references.html#refsGBK">[GBK]</a>
|
|
</td></tr><tr><td> GB_2312-80 </td><td> GBK </td><td>
|
|
<a href="references.html#refsRFC1345">[RFC1345]</a>
|
|
<a href="references.html#refsGBK">[GBK]</a>
|
|
</td></tr><tr><td> ISO-8859-1 </td><td> windows-1252 </td><td>
|
|
<a href="references.html#refsRFC1345">[RFC1345]</a>
|
|
<a href="references.html#refsWIN1252">[WIN1252]</a>
|
|
</td></tr><tr><td> ISO-8859-9 </td><td> windows-1254 </td><td>
|
|
<a href="references.html#refsRFC1345">[RFC1345]</a>
|
|
<a href="references.html#refsWIN1254">[WIN1254]</a>
|
|
</td></tr><tr><td> ISO-8859-11 </td><td> windows-874 </td><td>
|
|
<a href="references.html#refsISO885911">[ISO885911]</a>
|
|
<a href="references.html#refsWIN874">[WIN874]</a>
|
|
</td></tr><tr><td> KS_C_5601-1987 </td><td> windows-949 </td><td>
|
|
<a href="references.html#refsRFC1345">[RFC1345]</a>
|
|
<a href="references.html#refsWIN949">[WIN949]</a>
|
|
</td></tr><tr><td> Shift_JIS </td><td> Windows-31J </td><td>
|
|
<a href="references.html#refsSHIFTJIS">[SHIFTJIS]</a>
|
|
<a href="references.html#refsWIN31J">[WIN31J]</a>
|
|
</td></tr><tr><td> TIS-620 </td><td> windows-874 </td><td>
|
|
<a href="references.html#refsTIS620">[TIS620]</a>
|
|
<a href="references.html#refsWIN874">[WIN874]</a>
|
|
</td></tr><tr><td> US-ASCII </td><td> windows-1252 </td><td>
|
|
<a href="references.html#refsRFC1345">[RFC1345]</a>
|
|
<a href="references.html#refsWIN1252">[WIN1252]</a>
|
|
</td></tr></tbody></table><p class="note">The requirement to treat certain encodings as other
|
|
encodings according to the table above is a <a href="introduction.html#willful-violation">willful
|
|
violation</a> of the W3C Character Model specification, motivated
|
|
by a desire for compatibility with legacy content. <a href="references.html#refsCHARMOD">[CHARMOD]</a></p>
|
|
|
|
<p>When a user agent is to use the UTF-16 encoding but no BOM has
|
|
been found, user agents must default to UTF-16LE.</p>
|
|
|
|
<p class="note">The requirement to default UTF-16 to LE rather than
|
|
BE is a <a href="introduction.html#willful-violation">willful violation</a> of RFC 2781, motivated by a
|
|
desire for compatibility with legacy content. <a href="references.html#refsRFC2781">[RFC2781]</a></p>
|
|
|
|
<hr><p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
|
|
encodings. <a href="references.html#refsCESU8">[CESU8]</a> <a href="references.html#refsUTF7">[UTF7]</a> <a href="references.html#refsBOCU1">[BOCU1]</a> <a href="references.html#refsSCSU">[SCSU]</a></p>
|
|
|
|
<p>Support for encodings based on EBCDIC is not recommended. This
|
|
encoding is rarely used for publicly-facing Web content.</p>
|
|
|
|
<p>Support for UTF-32 is not recommended. This encoding is rarely
|
|
used, and frequently implemented incorrectly.</p>
|
|
|
|
<p class="note">This specification does not make any attempt to
|
|
support EBCDIC-based encodings and UTF-32 in its algorithms; support
|
|
and use of these encodings can thus lead to unexpected behavior in
|
|
implementations of this specification.</p>
|
|
|
|
|
|
|
|
<h5 id="preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</h5>
|
|
|
|
<p>Given an encoding, the bytes in the input stream must be
|
|
converted to Unicode characters for the tokenizer, as described by
|
|
the rules for that encoding, except that the leading U+FEFF BYTE
|
|
ORDER MARK character, if any, must not be stripped by the encoding
|
|
layer (it is stripped by the rule below).</p>
|
|
<p>Bytes or sequences of bytes in the original byte stream that
|
|
could not be converted to Unicode code points must be converted to
|
|
U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
|
|
UTF-8, the bytes must be <a href="infrastructure.html#decoded-as-utf-8-with-error-handling" title="decoded as UTF-8, with error
|
|
handling">decoded with the error handling</a> defined in this
|
|
specification.</p>
|
|
|
|
<p class="note">Bytes or sequences of bytes in the original byte
|
|
stream that did not conform to the encoding specification
|
|
(e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
|
|
errors that conformance checkers are expected to report.</p>
|
|
|
|
<p>Any byte or sequence of bytes in the original byte stream that is
|
|
<a href="#misinterpreted-for-compatibility">misinterpreted for compatibility</a> is a <a href="#parse-error">parse
|
|
error</a>.</p>
|
|
|
|
<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
|
|
any are present.</p>
|
|
|
|
<p class="note">The requirement to strip a U+FEFF BYTE ORDER MARK
|
|
character regardless of whether that character was used to determine
|
|
the byte order is a <a href="introduction.html#willful-violation">willful violation</a> of Unicode,
|
|
motivated by a desire to increase the resilience of user agents in
|
|
the face of naïve transcoders.</p>
|
|
|
|
<p>Any occurrences of any characters in the ranges U+0001 to U+0008,
|
|
U+000E to U+001F, U+007F
|
|
to U+009F, U+FDD0
|
|
to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
|
|
U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
|
|
U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
|
|
U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
|
|
U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
|
|
U+10FFFE, and U+10FFFF are <a href="#parse-error" title="parse error">parse
|
|
errors</a>. These are all control characters or permanently
|
|
undefined Unicode characters (noncharacters).</p>
|
|
|
|
<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
|
|
characters are treated specially. Any CR characters that are
|
|
followed by LF characters must be removed, and any CR characters not
|
|
followed by LF characters must be converted to LF characters. Thus,
|
|
newlines in HTML DOMs are represented by LF characters, and there
|
|
are never any CR characters in the input to the
|
|
<a href="tokenization.html#tokenization">tokenization</a> stage.</p>
|
|
|
|
<p>The <dfn id="next-input-character">next input character</dfn> is the first character in the
|
|
input stream that has not yet been <dfn id="consumed">consumed</dfn>. Initially,
|
|
the <i><a href="#next-input-character">next input character</a></i> is the first character in the
|
|
input. The <dfn id="current-input-character">current input character</dfn> is the last character
|
|
to have been <i><a href="#consumed">consumed</a></i>.</p>
|
|
|
|
<p>The <dfn id="insertion-point">insertion point</dfn> is the position (just before a
|
|
character or just before the end of the input stream) where content
|
|
inserted using <code title="dom-document-write"><a href="apis-in-html-documents.html#dom-document-write">document.write()</a></code> is actually
|
|
inserted. The insertion point is relative to the position of the
|
|
character immediately after it, it is not an absolute offset into
|
|
the input stream. Initially, the insertion point is
|
|
undefined.</p>
|
|
|
|
<p>The "EOF" character in the tables below is a conceptual character
|
|
representing the end of the <a href="#the-input-stream">input stream</a>. If the parser
|
|
is a <a href="apis-in-html-documents.html#script-created-parser">script-created parser</a>, then the end of the
|
|
<a href="#the-input-stream">input stream</a> is reached when an <dfn id="explicit-eof-character">explicit "EOF"
|
|
character</dfn> (inserted by the <code title="dom-document-close"><a href="apis-in-html-documents.html#dom-document-close">document.close()</a></code> method) is
|
|
consumed. Otherwise, the "EOF" character is not a real character in
|
|
the stream, but rather the lack of any further characters.</p>
|
|
|
|
|
|
<h5 id="changing-the-encoding-while-parsing"><span class="secno">8.2.2.4 </span>Changing the encoding while parsing</h5>
|
|
|
|
<p>When the parser requires the user agent to <dfn id="change-the-encoding">change the
|
|
encoding</dfn>, it must run the following steps. This might happen
|
|
if the <a href="#encoding-sniffing-algorithm">encoding sniffing algorithm</a> described above
|
|
failed to find an encoding, or if it found an encoding that was not
|
|
the actual encoding of the file.</p>
|
|
|
|
<ol><li>If the new encoding is identical or equivalent to the encoding
|
|
that is already being used to interpret the input stream, then set
|
|
the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
|
|
<i>certain</i> and abort these steps. This happens when the
|
|
encoding information found in the file matches what the
|
|
<a href="#encoding-sniffing-algorithm">encoding sniffing algorithm</a> determined to be the
|
|
encoding, and in the second pass through the parser if the first
|
|
pass found that the encoding sniffing algorithm described in the
|
|
earlier section failed to find the right encoding.</li>
|
|
|
|
<li>If the encoding that is already being used to interpret the
|
|
input stream is a UTF-16 encoding, then set the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
|
|
<i>certain</i> and abort these steps. The new encoding is ignored;
|
|
if it was anything but the same encoding, then it would be clearly
|
|
incorrect.</li>
|
|
|
|
<li>If the new encoding is a UTF-16 encoding, change it to
|
|
UTF-8.</li>
|
|
|
|
<li>If all the bytes up to the last byte converted by the current
|
|
decoder have the same Unicode interpretations in both the current
|
|
encoding and the new encoding, and if the user agent supports
|
|
changing the converter on the fly, then the user agent may change
|
|
to the new converter for the encoding on the fly. Set the
|
|
<a href="dom.html#document-s-character-encoding">document's character encoding</a> and the encoding used to
|
|
convert the input stream to the new encoding, set the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
|
|
<i>certain</i>, and abort these steps.</li>
|
|
|
|
<li>Otherwise, <a href="history.html#navigate">navigate</a> to the
|
|
document again, with <a href="history.html#replacement-enabled">replacement enabled</a>, and using
|
|
the same <a href="history.html#source-browsing-context">source browsing context</a>, but this time skip
|
|
the <a href="#encoding-sniffing-algorithm">encoding sniffing algorithm</a> and instead just set
|
|
the encoding to the new encoding and the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
|
|
<i>certain</i>. Whenever possible, this should be done without
|
|
actually contacting the network layer (the bytes should be
|
|
re-parsed from memory), even if, e.g., the document is marked as
|
|
not being cacheable. If this is not possible and contacting the
|
|
network layer would involve repeating a request that uses a method
|
|
other than HTTP GET (<a href="fetching-resources.html#concept-http-equivalent-get" title="concept-http-equivalent-get">or
|
|
equivalent</a> for non-HTTP URLs), then instead set the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> to
|
|
<i>certain</i> and ignore the new encoding. The resource will be
|
|
misinterpreted. User agents may notify the user of the situation,
|
|
to aid in application development.</li>
|
|
|
|
</ol></div><div class="impl">
|
|
|
|
<h4 id="parse-state"><span class="secno">8.2.3 </span>Parse state</h4>
|
|
|
|
<h5 id="the-insertion-mode"><span class="secno">8.2.3.1 </span>The insertion mode</h5>
|
|
|
|
<p>The <dfn id="insertion-mode">insertion mode</dfn> is a state variable that controls
|
|
the primary operation of the tree construction stage.</p>
|
|
|
|
<p>Initially, the <a href="#insertion-mode">insertion mode</a> is "<a href="tree-construction.html#the-initial-insertion-mode" title="insertion mode: initial">initial</a>". It can change to
|
|
"<a href="tree-construction.html#the-before-html-insertion-mode" title="insertion mode: before html">before html</a>",
|
|
"<a href="tree-construction.html#the-before-head-insertion-mode" title="insertion mode: before head">before head</a>",
|
|
"<a href="tree-construction.html#parsing-main-inhead" title="insertion mode: in head">in head</a>", "<a href="tree-construction.html#parsing-main-inheadnoscript" title="insertion mode: in head noscript">in head noscript</a>",
|
|
"<a href="tree-construction.html#the-after-head-insertion-mode" title="insertion mode: after head">after head</a>", "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>", "<a href="tree-construction.html#parsing-main-incdata" title="insertion mode: text">text</a>", "<a href="tree-construction.html#parsing-main-intable" title="insertion
|
|
mode: in table">in table</a>", "<a href="tree-construction.html#parsing-main-intabletext" title="insertion mode: in
|
|
table text">in table text</a>", "<a href="tree-construction.html#parsing-main-incaption" title="insertion mode: in
|
|
caption">in caption</a>", "<a href="tree-construction.html#parsing-main-incolgroup" title="insertion mode: in column
|
|
group">in column group</a>", "<a href="tree-construction.html#parsing-main-intbody" title="insertion mode: in
|
|
table body">in table body</a>", "<a href="tree-construction.html#parsing-main-intr" title="insertion mode: in
|
|
row">in row</a>", "<a href="tree-construction.html#parsing-main-intd" title="insertion mode: in cell">in
|
|
cell</a>", "<a href="tree-construction.html#parsing-main-inselect" title="insertion mode: in select">in
|
|
select</a>", "<a href="tree-construction.html#parsing-main-inselectintable" title="insertion mode: in select in table">in
|
|
select in table</a>", "<a href="tree-construction.html#parsing-main-afterbody" title="insertion mode: after
|
|
body">after body</a>", "<a href="tree-construction.html#parsing-main-inframeset" title="insertion mode: in
|
|
frameset">in frameset</a>", "<a href="tree-construction.html#parsing-main-afterframeset" title="insertion mode: after
|
|
frameset">after frameset</a>", "<a href="tree-construction.html#the-after-after-body-insertion-mode" title="insertion mode:
|
|
after after body">after after body</a>", and "<a href="tree-construction.html#the-after-after-frameset-insertion-mode" title="insertion mode: after after frameset">after after
|
|
frameset</a>" during the course of the parsing, as described in
|
|
the <a href="tree-construction.html#tree-construction">tree construction</a> stage. The insertion mode affects
|
|
how tokens are processed and whether CDATA sections are
|
|
supported.</p>
|
|
|
|
<p>Several of these modes, namely "<a href="tree-construction.html#parsing-main-inhead" title="insertion mode: in
|
|
head">in head</a>", "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in
|
|
body</a>", "<a href="tree-construction.html#parsing-main-intable" title="insertion mode: in table">in
|
|
table</a>", and "<a href="tree-construction.html#parsing-main-inselect" title="insertion mode: in select">in
|
|
select</a>", are special, in that the other modes defer to them
|
|
at various times. When the algorithm below says that the user agent
|
|
is to do something "<dfn id="using-the-rules-for">using the rules for</dfn> the <var title="">m</var> insertion mode", where <var title="">m</var> is one
|
|
of these modes, the user agent must use the rules described under
|
|
the <var title="">m</var> <a href="#insertion-mode">insertion mode</a>'s section, but
|
|
must leave the <a href="#insertion-mode">insertion mode</a> unchanged unless the
|
|
rules in <var title="">m</var> themselves switch the <a href="#insertion-mode">insertion
|
|
mode</a> to a new value.</p>
|
|
|
|
<p>When the insertion mode is switched to "<a href="tree-construction.html#parsing-main-incdata" title="insertion
|
|
mode: text">text</a>" or "<a href="tree-construction.html#parsing-main-intabletext" title="insertion mode: in table
|
|
text">in table text</a>", the <dfn id="original-insertion-mode">original insertion mode</dfn>
|
|
is also set. This is the insertion mode to which the tree
|
|
construction stage will return.</p>
|
|
|
|
<hr><p>When the steps below require the UA to <dfn id="reset-the-insertion-mode-appropriately">reset the insertion
|
|
mode appropriately</dfn>, it means the UA must follow these
|
|
steps:</p>
|
|
|
|
<ol><li>Let <var title="">last</var> be false.</li>
|
|
|
|
<li>Let <var title="">node</var> be the last node in the
|
|
<a href="#stack-of-open-elements">stack of open elements</a>.</li>
|
|
|
|
<li><i>Loop</i>: If <var title="">node</var> is the first node in
|
|
the stack of open elements, then set <var title="">last</var> to
|
|
true and set <var title="">node</var> to the <var title="concept-frag-parse-context"><a href="the-end.html#concept-frag-parse-context">context</a></var> element.
|
|
(<a href="the-end.html#fragment-case">fragment case</a>)</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="the-button-element.html#the-select-element">select</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inselect" title="insertion mode: in select">in select</a>" and abort these
|
|
steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-td-element">td</a></code> or
|
|
<code><a href="tabular-data.html#the-th-element">th</a></code> element and <var title="">last</var> is false, then
|
|
switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intd" title="insertion
|
|
mode: in cell">in cell</a>" and abort these steps.</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-tr-element">tr</a></code> element, then
|
|
switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intr" title="insertion
|
|
mode: in row">in row</a>" and abort these steps.</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-tbody-element">tbody</a></code>,
|
|
<code><a href="tabular-data.html#the-thead-element">thead</a></code>, or <code><a href="tabular-data.html#the-tfoot-element">tfoot</a></code> element, then switch the
|
|
<a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intbody" title="insertion mode: in
|
|
table body">in table body</a>" and abort these steps.</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-caption-element">caption</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-incaption" title="insertion mode: in caption">in caption</a>" and abort
|
|
these steps.</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-colgroup-element">colgroup</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-incolgroup" title="insertion mode: in column group">in column group</a>" and
|
|
abort these steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="tabular-data.html#the-table-element">table</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-intable" title="insertion mode: in table">in table</a>" and abort these
|
|
steps.</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="semantics.html#the-head-element">head</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>" ("<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>"! <em> not "<a href="tree-construction.html#parsing-main-inhead" title="insertion mode: in head">in head</a>"</em>!) and abort
|
|
these steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
|
|
<li>If <var title="">node</var> is a <code><a href="sections.html#the-body-element">body</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in body">in body</a>" and abort these
|
|
steps.</li>
|
|
|
|
<li>If <var title="">node</var> is a <code><a href="obsolete.html#frameset">frameset</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inframeset" title="insertion mode: in frameset">in frameset</a>" and abort
|
|
these steps. (<a href="the-end.html#fragment-case">fragment case</a>)</li>
|
|
|
|
<li>If <var title="">node</var> is an <code><a href="semantics.html#the-html-element">html</a></code> element,
|
|
then switch the <a href="#insertion-mode">insertion mode</a>
|
|
to "<a href="tree-construction.html#the-before-head-insertion-mode" title="insertion mode: before head">before
|
|
head</a>" Then, abort these steps. (<a href="the-end.html#fragment-case">fragment
|
|
case</a>)</li>
|
|
<li>If <var title="">last</var> is true, then switch the
|
|
<a href="#insertion-mode">insertion mode</a> to "<a href="tree-construction.html#parsing-main-inbody" title="insertion mode: in
|
|
body">in body</a>" and abort these steps. (<a href="the-end.html#fragment-case">fragment
|
|
case</a>)</li>
|
|
|
|
<li>Let <var title="">node</var> now be the node before <var title="">node</var> in the <a href="#stack-of-open-elements">stack of open
|
|
elements</a>.</li>
|
|
|
|
<li>Return to the step labeled <i>loop</i>.</li>
|
|
|
|
</ol><h5 id="the-stack-of-open-elements"><span class="secno">8.2.3.2 </span>The stack of open elements</h5>
|
|
|
|
<p>Initially, the <dfn id="stack-of-open-elements">stack of open elements</dfn> is empty. The
|
|
stack grows downwards; the topmost node on the stack is the first
|
|
one added to the stack, and the bottommost node of the stack is the
|
|
most recently added node in the stack (notwithstanding when the
|
|
stack is manipulated in a random access fashion as part of <a href="tree-construction.html#adoptionAgency">the handling for misnested tags</a>).</p>
|
|
|
|
<p>The "<a href="tree-construction.html#the-before-html-insertion-mode" title="insertion mode: before html">before
|
|
html</a>" <a href="#insertion-mode">insertion mode</a> creates the
|
|
<code><a href="semantics.html#the-html-element">html</a></code> root element node, which is then added to the
|
|
stack.</p>
|
|
|
|
<p>In the <a href="the-end.html#fragment-case">fragment case</a>, the <a href="#stack-of-open-elements">stack of open
|
|
elements</a> is initialized to contain an <code><a href="semantics.html#the-html-element">html</a></code>
|
|
element that is created as part of <a href="the-end.html#html-fragment-parsing-algorithm" title="html fragment
|
|
parsing algorithm">that algorithm</a>. (The <a href="the-end.html#fragment-case">fragment
|
|
case</a> skips the "<a href="tree-construction.html#the-before-html-insertion-mode" title="insertion mode: before
|
|
html">before html</a>" <a href="#insertion-mode">insertion mode</a>.)</p>
|
|
|
|
<p>The <code><a href="semantics.html#the-html-element">html</a></code> node, however it is created, is the topmost
|
|
node of the stack. It only gets popped off the stack when the parser
|
|
<a href="the-end.html#stop-parsing" title="stop parsing">finishes</a>.</p>
|
|
|
|
<p>The <dfn id="current-node">current node</dfn> is the bottommost node in this
|
|
stack.</p>
|
|
|
|
<p>The <dfn id="current-table">current table</dfn> is the last <code><a href="tabular-data.html#the-table-element">table</a></code>
|
|
element in the <a href="#stack-of-open-elements">stack of open elements</a>, if there is
|
|
one. If there is no <code><a href="tabular-data.html#the-table-element">table</a></code> element in the <a href="#stack-of-open-elements">stack of
|
|
open elements</a> (<a href="the-end.html#fragment-case">fragment case</a>), then the
|
|
<a href="#current-table">current table</a> is the first element in the <a href="#stack-of-open-elements">stack
|
|
of open elements</a> (the <code><a href="semantics.html#the-html-element">html</a></code> element).</p>
|
|
|
|
<p>Elements in the stack fall into the following categories:</p>
|
|
|
|
<dl><dt><dfn id="special">Special</dfn></dt>
|
|
<dd><p>The following elements have varying levels of special
|
|
parsing rules: HTML's <code><a href="sections.html#the-address-element">address</a></code>, <code><a href="obsolete.html#the-applet-element">applet</a></code>,
|
|
<code><a href="the-map-element.html#the-area-element">area</a></code>, <code><a href="sections.html#the-article-element">article</a></code>, <code><a href="sections.html#the-aside-element">aside</a></code>,
|
|
<code><a href="semantics.html#the-base-element">base</a></code>, <code><a href="obsolete.html#basefont">basefont</a></code>, <code><a href="obsolete.html#bgsound">bgsound</a></code>,
|
|
<code><a href="grouping-content.html#the-blockquote-element">blockquote</a></code>, <code><a href="sections.html#the-body-element">body</a></code>, <code><a href="text-level-semantics.html#the-br-element">br</a></code>,
|
|
<code><a href="the-button-element.html#the-button-element">button</a></code>, <code><a href="tabular-data.html#the-caption-element">caption</a></code>, <code><a href="obsolete.html#center">center</a></code>,
|
|
<code><a href="tabular-data.html#the-col-element">col</a></code>, <code><a href="tabular-data.html#the-colgroup-element">colgroup</a></code>, <code><a href="interactive-elements.html#the-command-element">command</a></code>,
|
|
<code><a href="grouping-content.html#the-dd-element">dd</a></code>, <code><a href="interactive-elements.html#the-details-element">details</a></code>, <code><a href="obsolete.html#dir">dir</a></code>,
|
|
<code><a href="grouping-content.html#the-div-element">div</a></code>, <code><a href="grouping-content.html#the-dl-element">dl</a></code>, <code><a href="grouping-content.html#the-dt-element">dt</a></code>,
|
|
<code><a href="the-iframe-element.html#the-embed-element">embed</a></code>, <code><a href="forms.html#the-fieldset-element">fieldset</a></code>, <code><a href="grouping-content.html#the-figcaption-element">figcaption</a></code>,
|
|
<code><a href="grouping-content.html#the-figure-element">figure</a></code>, <code><a href="sections.html#the-footer-element">footer</a></code>, <code><a href="forms.html#the-form-element">form</a></code>,
|
|
<code><a href="obsolete.html#frame">frame</a></code>, <code><a href="obsolete.html#frameset">frameset</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h1</a></code>,
|
|
<code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h2</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h3</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h4</a></code>, <code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h5</a></code>,
|
|
<code><a href="sections.html#the-h1-h2-h3-h4-h5-and-h6-elements">h6</a></code>, <code><a href="semantics.html#the-head-element">head</a></code>, <code><a href="sections.html#the-header-element">header</a></code>,
|
|
<code><a href="sections.html#the-hgroup-element">hgroup</a></code>, <code><a href="grouping-content.html#the-hr-element">hr</a></code>, <code><a href="semantics.html#the-html-element">html</a></code>,
|
|
<code><a href="the-iframe-element.html#the-iframe-element">iframe</a></code>, <code><a href="embedded-content-1.html#the-img-element">img</a></code>, <code><a href="the-input-element.html#the-input-element">input</a></code>,
|
|
<code><a href="obsolete.html#isindex-0">isindex</a></code>, <code><a href="grouping-content.html#the-li-element">li</a></code>, <code><a href="semantics.html#the-link-element">link</a></code>,
|
|
<code><a href="obsolete.html#listing">listing</a></code>, <code><a href="obsolete.html#the-marquee-element">marquee</a></code>, <code><a href="interactive-elements.html#the-menu-element">menu</a></code>,
|
|
<code><a href="semantics.html#the-meta-element">meta</a></code>, <code><a href="sections.html#the-nav-element">nav</a></code>, <code><a href="obsolete.html#noembed">noembed</a></code>,
|
|
<code><a href="obsolete.html#noframes">noframes</a></code>, <code><a href="scripting-1.html#the-noscript-element">noscript</a></code>, <code><a href="the-iframe-element.html#the-object-element">object</a></code>,
|
|
<code><a href="grouping-content.html#the-ol-element">ol</a></code>, <code><a href="grouping-content.html#the-p-element">p</a></code>, <code><a href="the-iframe-element.html#the-param-element">param</a></code>,
|
|
<code><a href="obsolete.html#plaintext">plaintext</a></code>, <code><a href="grouping-content.html#the-pre-element">pre</a></code>, <code><a href="scripting-1.html#the-script-element">script</a></code>,
|
|
<code><a href="sections.html#the-section-element">section</a></code>, <code><a href="the-button-element.html#the-select-element">select</a></code>, <code><a href="semantics.html#the-style-element">style</a></code>,
|
|
<code><a href="interactive-elements.html#the-summary-element">summary</a></code>, <code><a href="tabular-data.html#the-table-element">table</a></code>, <code><a href="tabular-data.html#the-tbody-element">tbody</a></code>,
|
|
<code><a href="tabular-data.html#the-td-element">td</a></code>, <code><a href="the-button-element.html#the-textarea-element">textarea</a></code>, <code><a href="tabular-data.html#the-tfoot-element">tfoot</a></code>,
|
|
<code><a href="tabular-data.html#the-th-element">th</a></code>, <code><a href="tabular-data.html#the-thead-element">thead</a></code>, <code><a href="semantics.html#the-title-element">title</a></code>,
|
|
<code><a href="tabular-data.html#the-tr-element">tr</a></code>, <code><a href="grouping-content.html#the-ul-element">ul</a></code>, <code><a href="text-level-semantics.html#the-wbr-element">wbr</a></code>, and
|
|
<code><a href="obsolete.html#xmp">xmp</a></code>; MathML's <code title="">mi</code>, <code title="">mo</code>, <code title="">mn</code>, <code title="">ms</code>, <code title="">mtext</code>, and <code title="">annotation-xml</code>; and SVG's <code title="">foreignObject</code>, <code title="">desc</code>, and
|
|
<code title="">title</code>.</p></dd>
|
|
<dt><dfn id="formatting">Formatting</dfn></dt>
|
|
<dd><p>The following HTML elements are those that end up in the
|
|
<a href="#list-of-active-formatting-elements">list of active formatting elements</a>: <code><a href="text-level-semantics.html#the-a-element">a</a></code>,
|
|
<code><a href="text-level-semantics.html#the-b-element">b</a></code>, <code><a href="obsolete.html#big">big</a></code>, <code><a href="text-level-semantics.html#the-code-element">code</a></code>,
|
|
<code><a href="text-level-semantics.html#the-em-element">em</a></code>, <code><a href="obsolete.html#font">font</a></code>, <code><a href="text-level-semantics.html#the-i-element">i</a></code>,
|
|
<code><a href="obsolete.html#nobr">nobr</a></code>, <code><a href="text-level-semantics.html#the-s-element">s</a></code>, <code><a href="text-level-semantics.html#the-small-element">small</a></code>,
|
|
<code><a href="obsolete.html#strike">strike</a></code>, <code><a href="text-level-semantics.html#the-strong-element">strong</a></code>, <code><a href="obsolete.html#tt">tt</a></code>, and
|
|
<code><a href="text-level-semantics.html#the-u-element">u</a></code>.</p></dd>
|
|
|
|
<dt><dfn id="ordinary">Ordinary</dfn></dt>
|
|
<dd><p>All other elements found while parsing an HTML
|
|
document.</p></dd>
|
|
|
|
</dl><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-the-specific-scope" title="has an element in the specific scope">have an element in a
|
|
specific scope</dfn> consisting of a list of element types <var title="">list</var> when the following algorithm terminates in a
|
|
match state:</p>
|
|
|
|
<ol><li><p>Initialize <var title="">node</var> to be the <a href="#current-node">current
|
|
node</a> (the bottommost node of the stack).</p></li>
|
|
|
|
<li><p>If <var title="">node</var> is the target node, terminate in
|
|
a match state.</p></li>
|
|
|
|
<li><p>Otherwise, if <var title="">node</var> is one of the element
|
|
types in <var title="">list</var>, terminate in a failure
|
|
state.</p></li>
|
|
|
|
<li><p>Otherwise, set <var title="">node</var> to the previous
|
|
entry in the <a href="#stack-of-open-elements">stack of open elements</a> and return to step
|
|
2. (This will never fail, since the loop will always terminate in
|
|
the previous step if the top of the stack — an
|
|
<code><a href="semantics.html#the-html-element">html</a></code> element — is reached.)</p></li>
|
|
|
|
</ol><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-scope" title="has an element in scope">have an element in scope</dfn> when
|
|
it <a href="#has-an-element-in-the-specific-scope">has an element in the specific scope</a> consisting
|
|
of the following element types:</p>
|
|
|
|
<ul class="brief"><li><code><a href="obsolete.html#the-applet-element">applet</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="tabular-data.html#the-caption-element">caption</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="semantics.html#the-html-element">html</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="tabular-data.html#the-table-element">table</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="tabular-data.html#the-td-element">td</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="tabular-data.html#the-th-element">th</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="obsolete.html#the-marquee-element">marquee</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="the-iframe-element.html#the-object-element">object</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code title="">mi</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
|
|
<li><code title="">mo</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
|
|
<li><code title="">mn</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
|
|
<li><code title="">ms</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
|
|
<li><code title="">mtext</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
|
|
<li><code title="">annotation-xml</code> in the <a href="namespaces.html#mathml-namespace">MathML namespace</a></li>
|
|
<li><code title="">foreignObject</code> in the <a href="namespaces.html#svg-namespace">SVG namespace</a></li>
|
|
<li><code title="">desc</code> in the <a href="namespaces.html#svg-namespace">SVG namespace</a></li>
|
|
<li><code title="">title</code> in the <a href="namespaces.html#svg-namespace">SVG namespace</a></li>
|
|
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-list-item-scope" title="has an element in list item scope">have an element in list
|
|
item scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
|
|
scope</a> consisting of the following element types:</p>
|
|
|
|
<ul class="brief"><li>All the element types listed above for the <i><a href="#has-an-element-in-scope">has an element
|
|
in scope</a></i> algorithm.</li>
|
|
<li><code><a href="grouping-content.html#the-ol-element">ol</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="grouping-content.html#the-ul-element">ul</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-button-scope" title="has an element in button scope">have an element in button
|
|
scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
|
|
scope</a> consisting of the following element types:</p>
|
|
|
|
<ul class="brief"><li>All the element types listed above for the <i><a href="#has-an-element-in-scope">has an element
|
|
in scope</a></i> algorithm.</li>
|
|
<li><code><a href="the-button-element.html#the-button-element">button</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-table-scope" title="has an element in table scope">have an element in table
|
|
scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
|
|
scope</a> consisting of the following element types:</p>
|
|
|
|
<ul class="brief"><li><code><a href="semantics.html#the-html-element">html</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="tabular-data.html#the-table-element">table</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
</ul><p>The <a href="#stack-of-open-elements">stack of open elements</a> is said to <dfn id="has-an-element-in-select-scope" title="has an element in select scope">have an element in select
|
|
scope</dfn> when it <a href="#has-an-element-in-the-specific-scope">has an element in the specific
|
|
scope</a> consisting of all element types <em>except</em> the
|
|
following:</p>
|
|
|
|
<ul class="brief"><li><code><a href="the-button-element.html#the-optgroup-element">optgroup</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
<li><code><a href="the-button-element.html#the-option-element">option</a></code> in the <a href="namespaces.html#html-namespace-0">HTML namespace</a></li>
|
|
</ul><p>Nothing happens if at any time any of the elements in the
|
|
<a href="#stack-of-open-elements">stack of open elements</a> are moved to a new location in,
|
|
or removed from, the <code><a href="infrastructure.html#document">Document</a></code> tree. In particular, the
|
|
stack is not changed in this situation. This can cause, amongst
|
|
other strange effects, content to be appended to nodes that are no
|
|
longer in the DOM.</p>
|
|
|
|
<p class="note">In some cases (namely, when <a href="tree-construction.html#adoptionAgency">closing misnested formatting elements</a>),
|
|
the stack is manipulated in a random-access fashion.</p>
|
|
|
|
|
|
<h5 id="the-list-of-active-formatting-elements"><span class="secno">8.2.3.3 </span>The list of active formatting elements</h5>
|
|
|
|
<p>Initially, the <dfn id="list-of-active-formatting-elements">list of active formatting elements</dfn> is
|
|
empty. It is used to handle mis-nested <a href="#formatting" title="formatting">formatting element tags</a>.</p>
|
|
|
|
<p>The list contains elements in the <a href="#formatting">formatting</a>
|
|
category, and scope markers. The scope markers are inserted when
|
|
entering <code><a href="obsolete.html#the-applet-element">applet</a></code> elements, buttons, <code><a href="the-iframe-element.html#the-object-element">object</a></code>
|
|
elements, marquees, table cells, and table captions, and are used to
|
|
prevent formatting from "leaking" <em>into</em> <code><a href="obsolete.html#the-applet-element">applet</a></code>
|
|
elements, buttons, <code><a href="the-iframe-element.html#the-object-element">object</a></code> elements, marquees, and
|
|
tables.</p>
|
|
|
|
<p class="note">The scope markers are unrelated to the concept of an
|
|
element being <a href="#has-an-element-in-scope" title="has an element in scope">in
|
|
scope</a>.</p>
|
|
|
|
<p>In addition, each element in the <a href="#list-of-active-formatting-elements">list of active formatting
|
|
elements</a> is associated with the token for which it was
|
|
created, so that further elements can be created for that token if
|
|
necessary.</p>
|
|
|
|
<p>When the steps below require the UA to <dfn id="push-onto-the-list-of-active-formatting-elements">push onto the list of
|
|
active formatting elements</dfn> an element <var title="">element</var>, the UA must perform the following steps:</p>
|
|
|
|
<ol><li><p>If there are already three elements in the <a href="#list-of-active-formatting-elements">list of
|
|
active formatting elements</a> after the last list marker, if
|
|
any, or anywhere in the list if there are no list markers, that
|
|
have the same tag name, namespace, and attributes as <var title="">element</var>, then remove the earliest such element from
|
|
the <a href="#list-of-active-formatting-elements">list of active formatting elements</a>. For these
|
|
purposes, the attributes must be compared as they were when the
|
|
elements were created by the parser; two elements have the same
|
|
attributes if all their parsed attributes can be paired such that
|
|
the two attributes in each pair have identical names, namespaces,
|
|
and values (the order of the attributes does not matter).</p>
|
|
|
|
<p class="note">This is the Noah's Ark clause. But with three per
|
|
family instead of two.</p></li>
|
|
<li><p>Add <var title="">element</var> to the <a href="#list-of-active-formatting-elements">list of active
|
|
formatting elements</a>.</p></li>
|
|
|
|
</ol><p>When the steps below require the UA to <dfn id="reconstruct-the-active-formatting-elements">reconstruct the
|
|
active formatting elements</dfn>, the UA must perform the following
|
|
steps:</p>
|
|
|
|
<ol><li>If there are no entries in the <a href="#list-of-active-formatting-elements">list of active formatting
|
|
elements</a>, then there is nothing to reconstruct; stop this
|
|
algorithm.</li>
|
|
|
|
<li>If the last (most recently added) entry in the <a href="#list-of-active-formatting-elements">list of
|
|
active formatting elements</a> is a marker, or if it is an
|
|
element that is in the <a href="#stack-of-open-elements">stack of open elements</a>, then
|
|
there is nothing to reconstruct; stop this algorithm.</li>
|
|
|
|
<li>Let <var title="">entry</var> be the last (most recently added)
|
|
element in the <a href="#list-of-active-formatting-elements">list of active formatting
|
|
elements</a>.</li>
|
|
|
|
<li>If there are no entries before <var title="">entry</var> in the
|
|
<a href="#list-of-active-formatting-elements">list of active formatting elements</a>, then jump to step
|
|
8.</li>
|
|
|
|
<li>Let <var title="">entry</var> be the entry one earlier than
|
|
<var title="">entry</var> in the <a href="#list-of-active-formatting-elements">list of active formatting
|
|
elements</a>.</li>
|
|
|
|
<li>If <var title="">entry</var> is neither a marker nor an element
|
|
that is also in the <a href="#stack-of-open-elements">stack of open elements</a>, go to step
|
|
4.</li>
|
|
|
|
<li>Let <var title="">entry</var> be the element one later than
|
|
<var title="">entry</var> in the <a href="#list-of-active-formatting-elements">list of active formatting
|
|
elements</a>.</li>
|
|
|
|
<li><a href="tree-construction.html#create-an-element-for-the-token">Create an element for the token</a> for which the
|
|
element <var title="">entry</var> was created, to obtain <var title="">new element</var>.</li>
|
|
|
|
<li>Append <var title="">new element</var> to the <a href="#current-node">current
|
|
node</a> and push it onto the <a href="#stack-of-open-elements">stack of open
|
|
elements</a> so that it is the new <a href="#current-node">current
|
|
node</a>.</li>
|
|
|
|
<li>Replace the entry for <var title="">entry</var> in the list
|
|
with an entry for <var title="">new element</var>.</li>
|
|
|
|
<li>If the entry for <var title="">new element</var> in the
|
|
<a href="#list-of-active-formatting-elements">list of active formatting elements</a> is not the last
|
|
entry in the list, return to step 7.</li>
|
|
|
|
</ol><p>This has the effect of reopening all the formatting elements that
|
|
were opened in the current body, cell, or caption (whichever is
|
|
youngest) that haven't been explicitly closed.</p>
|
|
|
|
<p class="note">The way this specification is written, the
|
|
<a href="#list-of-active-formatting-elements">list of active formatting elements</a> always consists of
|
|
elements in chronological order with the least recently added
|
|
element first and the most recently added element last (except for
|
|
while steps 8 to 11 of the above algorithm are being executed, of
|
|
course).</p>
|
|
|
|
<p>When the steps below require the UA to <dfn id="clear-the-list-of-active-formatting-elements-up-to-the-last-marker">clear the list of
|
|
active formatting elements up to the last marker</dfn>, the UA must
|
|
perform the following steps:</p>
|
|
|
|
<ol><li>Let <var title="">entry</var> be the last (most recently added)
|
|
entry in the <a href="#list-of-active-formatting-elements">list of active formatting elements</a>.</li>
|
|
|
|
<li>Remove <var title="">entry</var> from the <a href="#list-of-active-formatting-elements">list of active
|
|
formatting elements</a>.</li>
|
|
|
|
<li>If <var title="">entry</var> was a marker, then stop the
|
|
algorithm at this point. The list has been cleared up to the last
|
|
marker.</li>
|
|
|
|
<li>Go to step 1.</li>
|
|
|
|
</ol><h5 id="the-element-pointers"><span class="secno">8.2.3.4 </span>The element pointers</h5>
|
|
|
|
<p>Initially, the <dfn id="head-element-pointer"><code title="">head</code> element
|
|
pointer</dfn> and the <dfn id="form-element-pointer"><code title="">form</code> element
|
|
pointer</dfn> are both null.</p>
|
|
|
|
<p>Once a <code><a href="semantics.html#the-head-element">head</a></code> element has been parsed (whether
|
|
implicitly or explicitly) the <a href="#head-element-pointer"><code title="">head</code>
|
|
element pointer</a> gets set to point to this node.</p>
|
|
|
|
<p>The <a href="#form-element-pointer"><code title="">form</code> element pointer</a>
|
|
points to the last <code><a href="forms.html#the-form-element">form</a></code> element that was opened and
|
|
whose end tag has not yet been seen. It is used to make form
|
|
controls associate with forms in the face of dramatically bad
|
|
markup, for historical reasons.</p>
|
|
|
|
|
|
<h5 id="other-parsing-state-flags"><span class="secno">8.2.3.5 </span>Other parsing state flags</h5>
|
|
|
|
<p>The <dfn id="scripting-flag">scripting flag</dfn> is set to "enabled" if <a href="webappapis.html#concept-n-script" title="concept-n-script">scripting was enabled</a> for the
|
|
<code><a href="infrastructure.html#document">Document</a></code> with which the parser is associated when the
|
|
parser was created, and "disabled" otherwise.</p>
|
|
|
|
<p class="note">The <a href="#scripting-flag">scripting flag</a> can be enabled even
|
|
when the parser was originally created for the <a href="the-end.html#html-fragment-parsing-algorithm">HTML fragment
|
|
parsing algorithm</a>, even though <code><a href="scripting-1.html#the-script-element">script</a></code> elements
|
|
don't execute in that case.</p>
|
|
|
|
<p>The <dfn id="frameset-ok-flag">frameset-ok flag</dfn> is set to "ok" when the parser is
|
|
created. It is set to "not ok" after certain tokens are seen.</p>
|
|
|
|
</div></body></html>
|