You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
570 lines
24 KiB
570 lines
24 KiB
<!-- $Id: html-essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp $ -->
|
|
<html>
|
|
<head>
|
|
<title>Toward a Formalism for Communication On the Web</title>
|
|
</head>
|
|
<body>
|
|
|
|
<ADDRESS>Daniel W. Connolly <connolly@hal.com> <P>
|
|
$Id: html-essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp $
|
|
</ADDRESS>
|
|
|
|
<H2>Status</H2>
|
|
|
|
<P>I had hoped to polish this more before publishing it, but I can't seem
|
|
to get caught up... there's so much new stuff all the time!
|
|
|
|
<H1>Some Background on SGML for the World-Wide Web
|
|
</H1>
|
|
|
|
<p>In late 1992 and early 1993, I did quite a bit of work on the HTML DTD
|
|
while I was working at Convex in the online documentation group.
|
|
|
|
<p>When I began, there was the LineMode browser and the NeXT
|
|
implementation, and a few nodes in The Web describing HTML with some
|
|
oblique references to SGML. I was not intimately familiar with SGML, but
|
|
I was quite familiar with the problems of document interchange, and I
|
|
was eager to apply some of my formal systems background to the problem.
|
|
|
|
<H2>On Formally Unconvertable Document Formats
|
|
</H2>
|
|
|
|
<P>My experience with document interchange led me to classify document
|
|
formats using the essential distinction that some are "programmable" and
|
|
some are not. Most widely used source forms are programmable: TeX,
|
|
troff, postscript, and the like. On the other hand, there are several "static"
|
|
formats: plain text, Microsoft RTF, FrameMaker MIF, GNU's TeXinfo,
|
|
|
|
<P>The reason that this distinction is essential with respect to document
|
|
interchange is that extracting information from documents in
|
|
"programmable" document formats is equivalent to the halting problem.
|
|
That is, it is arbitrarily difficult and cannot be automated in a
|
|
general fashion.
|
|
|
|
<P>For example, I conjecture that it is impossible to write a program that
|
|
will extract the third word from a TeX document. It would be an easy
|
|
task for 80% of the TeX documents out there -- just skip over some
|
|
formatting stuff and grab the third bunch of characters surrounded by
|
|
whitespace. But that "formatting stuff" might be a program that
|
|
generates 100 words from the hypenation dictionary. So the simple
|
|
lexical scan of the TeX source would find a word that is <em>not</em> third
|
|
word of the document when printed.
|
|
|
|
<P>This may seem like an obscure and unimportant problem, but I assure you
|
|
that the problem of converting TeX tables to FrameMaker MIF is just as
|
|
unsolvable.
|
|
|
|
<P>So while "programmable" document formats have the advantage that
|
|
features can be added on a per-document basis, they suffer the
|
|
disadvantage that these features cannot be recovered by the machine and
|
|
translated in an automated fashion.
|
|
|
|
|
|
<H2>Document Formats as Communications Media
|
|
</H2>
|
|
|
|
<P>If we look at document formats in light of the conventional
|
|
sender/message/medium/receiver communications model, we see that
|
|
document formats capture the message at various levels of
|
|
"concreteness".
|
|
|
|
<P>The message begins as a collection of concepts and ideas in the mind of
|
|
the sender. In order to communicate, the sender and receiver must share
|
|
some language. That is, they must both understand some common set of
|
|
symbols and the way those symbols combine to represent ideas. The
|
|
senders job is to express the message in terms of the common symbols and
|
|
express them on the medium -- that is "render" or "present" them. The
|
|
the medium stimulates the receiver to reconstruct the symbols in his/her
|
|
brain -- that is, the receiver "interprets" or "recognizes" the symbols
|
|
from the medium. Those symbols interact with other symbols in the
|
|
receiver's brain, and the receiver "gets the message."
|
|
|
|
<P>The communications medium is often a layered combination of more and
|
|
less concrete media. For example, folks first render their ideas in the
|
|
symbology of the English language, and then render those symbols as
|
|
sequences of spoken phonemes or written characters. Those written
|
|
characters are in turn combinations of lines, curves, strokes, and
|
|
points. The receiving folks then assemble the strokes into characters,
|
|
the characters into words, the words into phrases, sentences, thoughts,
|
|
ideas, and so on.
|
|
|
|
<P>The most common and ubiquitous document format, plain ASCII text,
|
|
captures or digitizes messages at the level of written characters.
|
|
PostScript captures the characters as lines, curves, and paths. The GIF
|
|
format captures a document as an array of pixels. GIF is in many ways
|
|
infinitely more expressive than plain text, which is limited to
|
|
arrangements of the 96 ASCII characters.
|
|
|
|
<P>The RTF, TeX, nroff, etc. document formats provide very sophisticated
|
|
automated techniques for authors of documents to express their ideas. It
|
|
seems strange at first to see that plain text is still so widely used.
|
|
It would seem that PostScript is the ultimate document format, in that
|
|
its expressive capabilities include essentially anything that the human
|
|
eye is capable of perceiving, and yet it is device-independent.
|
|
|
|
<P>And yet if we take a look at the task of interpreting data back into
|
|
the ideas that they represent, we find that plain text is much to be
|
|
preferred, since reading plain text is so much easier to automate than
|
|
reading GIF files (optical character recognition) or postscript
|
|
documents (halting problem). In the end, while the source to a various
|
|
TeX or troff documents may correspond closely to the structure of the
|
|
ideas of the author, and while PostScript allows the author very precise
|
|
control and tremenous expressive capability, all these documents
|
|
ultimately capture an image of a document for presentation to the human
|
|
eye. They don't capture the original information as symbols that can be
|
|
processed by machine.
|
|
|
|
<P>To put it another way, rendering ideas in PostScript is not going to
|
|
help solve the problem of information overload -- it will only compound
|
|
the situation.
|
|
|
|
<P>As a real world example, suppose you had a 5000 page document in
|
|
PostScript, and you wanted to find a particular piece of information
|
|
inside it. The author may have organized the document very well, but
|
|
you'd have to print it to use those clues. If the characters aren't
|
|
kerned much, you might be able to use grep or sick a WAIS indexing
|
|
engine on it. Then, once you've found what looks like postscript code
|
|
for some relavent information, you'd pray that the document adheres to
|
|
the Adobe Document Structuring conventions so that you could pick out
|
|
the page containing the information you need and view that page.
|
|
|
|
<P>If that's too perverse, look at the problem of navigating a large
|
|
collection of technical papers coded in TeX. Many of the authors use
|
|
LaTeX, and you may be able to convince the indexing engine to filter
|
|
out common LaTeX formatting idioms -- or better yet, weight headings,
|
|
abstracts, etc. more heavily than other sections based on the
|
|
formatting idioms. While there are heuristic solutions to this problem
|
|
that will work in the typical 80%/20% fashion, the general solution is
|
|
once again equivalent to the halting problem; for example, individual
|
|
documents might have bits of TeX programming that change the
|
|
significance of words in a way that the indexing engine won't be able
|
|
to understand.
|
|
|
|
|
|
<H2>SGML as a Layered Communications Medium
|
|
</H2>
|
|
|
|
<P>So where does SGML fit into the sender/message/medium/receiver game?
|
|
|
|
<P>I'll use PostScript as a basis of comparison. The PostScript model
|
|
consists of a fairly powerful and general purpose two dimensional
|
|
imaging model, that is, a set of primitive symbols for specifying sets
|
|
of points in two dimensions using handy computational techniques, and a
|
|
general purpose programming model for building complex symbols out of
|
|
those primitives. That model is applied extensively to the problem of
|
|
typography, and there is a an architecture (that is, a set of well known
|
|
symbols derived from the primitives) for using and building fonts.
|
|
|
|
<P>So to communicate message consisting of symbols from human
|
|
communications in PostScript, one may choose from a well known set of
|
|
typefaces, or create a new typeface using the well known font
|
|
architecture, or free-hand draw some characters using postscript
|
|
primitives, or draw lines, boxes, circles and such using postscript
|
|
primitives, or scribble on a piece of paper, scan it, and convert the
|
|
bits to use the postscript image operator. The space of symbols is
|
|
nearly limitless, as long as those symbols can be expressed ultimately
|
|
as pixels on a page.
|
|
|
|
<P>The distinctive feature of PostScript (an advantage at times, and a
|
|
disadvantage at others) is that whether you print it and deliver the
|
|
paper or you deliver the PostScript and the receiver prints it out, the
|
|
result is the same bunch of images.
|
|
|
|
<P>The SGML model, on the other hand, specifies no general purpose
|
|
programming model where complex symbols can be defined in terms of
|
|
primitive symbols. The meaning of a symbol is either found in the SGML
|
|
standard itself, or in some PUBLIC document (which may or may not be
|
|
machine readable), or in some SYSTEM specific manner, or defined by an
|
|
SGML application. The only real primitives are the character and the
|
|
"non-SGML data entity".
|
|
|
|
<P>The model perscribes that a document consist of a declaration, a
|
|
prologue, and an instance. The declaration is expressed in ASCII and
|
|
specifies the character sets and syntactic symbols used by the prologue
|
|
and instance. The prologue is expressed in a standard language using the
|
|
syntactic symbols from the delcaration, and specifies a set of entities
|
|
and a grammar of element types available to the instance.
|
|
|
|
<P>The instance is a sequence of elements, character data, and entities
|
|
constrained by the grammar set forth in the prologue, and the SGML
|
|
standard does not specify any semantics or meaning for the instance.
|
|
|
|
<P>So to communicate using SGML, the sender first chooses a character set
|
|
and certain processing quatities and capacities. For example "I'm
|
|
writing in ASCII, and I'll never use an element name more than 40
|
|
characters long" is some information that can be expressed in the SGML
|
|
declaration. [The standard allows the SGML declaration to be implicitly
|
|
agreed upon by sender and receiver, and this is generally the case].
|
|
|
|
<P>The tricky part is the prologue, where the sender gives a grammar that
|
|
constrains the structure of the document. Along with the information
|
|
actually expressed in SGML in the prologue, there is usually some amound
|
|
of application defined semantics attached to the element types. For
|
|
example, the prologue may express in SGML that an H2 element must occur
|
|
within the content of an H1 element. But the convention that text in an
|
|
H1 is usually displayed larger and considered more important is
|
|
application defined.
|
|
|
|
<P>Once the prologue is determined (this usually involves considerable
|
|
discussion between a collection of authors and consumers in some
|
|
domain -- in the end, there may be some "parameter entities" in the
|
|
prologue which allow some variation on a per-document basis), the sender
|
|
is constrained to a rigorous structure for the organization of the
|
|
symbols and character data of the document. On the other hand, s/he has
|
|
an automated technique for verifying that s/he has not viloated the
|
|
structure, and hence there is some confidence that the document can be
|
|
consumed and processed by machine.
|
|
|
|
|
|
<H1>The HTML DTD: Conforming, though Expedient
|
|
</H1>
|
|
|
|
<H2>Design Constraints of the HTML DTD
|
|
</H2>
|
|
|
|
<P>Tim's original conception of HTML is that it should be about as
|
|
expressive as RTF. In contrast to traditional SGML applications where
|
|
documents might be batch processed and complex structure is the norm,
|
|
HTML documents are intended to be processed interactively. And the
|
|
widespread success of WYSIWYG word processors based on fairly flat
|
|
paragraph structure was proof that something like RTF was suitable for a
|
|
fairly wide variety of tasks.
|
|
|
|
<P>As I learned a little about SGML, it was clear that the WWW browser
|
|
implementation of HTML sorely lacked anything resembling an SGML entity
|
|
manager. And there were some syntactic inconsitencies with the SGML
|
|
standard. And it didn't use the ID/IDREF feature where it should have...
|
|
|
|
<P>Then, as I began to comprehend SGML with all its warts, (who's idea was
|
|
it to attach the significance of a newline character to the phase of the
|
|
moon anyway?) I was less gung-ho about declaring all the HTML out there
|
|
to be blasphemy to the One True SGML Way.
|
|
|
|
<P>Thus I chose for my battle to find some formal relationship between the
|
|
SGML standard and the HTML that was "out there." The quest was:
|
|
|
|
<H3>Find some DTD such that the vast majority of HTML documents are
|
|
instances of that DTD, conversely, such that all its instances make
|
|
sense to the existing WWW clients.
|
|
</H3>
|
|
|
|
<P>I struggled mightily with such issues as:
|
|
|
|
<UL>
|
|
<LI>Should we be sticking <! DOCTYPE HTML SYSTEM> in .html files? What
|
|
if somebody puts an entity declaration in there? (And does that mean
|
|
that WWW clients have to be able to parse SGML prologues in general?
|
|
|
|
<LI>What's the syntax of an attribute value? If we allow SHORTTAG YES,
|
|
does that mean we have to parse <CODE><em/this/</CODE> style of
|
|
markup too?
|
|
|
|
<LI>Can we put some short reference maps in the DTD that will cause real
|
|
SGML parsers and current WWW browsers to do the same thing w.r.t
|
|
newlines? (i.e. can we make all that phase-of-the-moon processing with
|
|
newlines a moot issue)
|
|
|
|
<LI>What about marked sections? Short reference maps?
|
|
|
|
<LI>What character set should we be using? How do I express ISO-Latin-1
|
|
in the SGML declaration? How should authors express the '<' character?
|
|
How should this be expressed in the DTD?
|
|
|
|
<LI>How do you put quotes in an attribute value literal?
|
|
|
|
<LI>How can I deal with the current paragraph element idioms without
|
|
using minimization?
|
|
|
|
<LI>Can I stick base64 encoded stuff in a CDATA element? Do I have to
|
|
watch out for <'s and such?
|
|
|
|
<LI>How do we combine SGML and multimedia data in the same data stream?
|
|
|
|
</UL>
|
|
|
|
|
|
<P>I found solutions to some problems, and punted on others. I probably
|
|
should have put more comments in the DTD regarding the compromises. But
|
|
I wanted to keep the DTD stripped down to the normative information and
|
|
keep the informative information in other documents.
|
|
|
|
<P>I did, by the way, draft a series of 4 or 5 documents demonstrating
|
|
various structural and syntactic features of SGML -- a sort of
|
|
validation suite. I'm not sure where it went.
|
|
|
|
<P>I'd like to respond to Elliot Kimber's critique of the HTML DTD that I
|
|
posted.
|
|
|
|
<pre>
|
|
>At the bottom of this posting is a slightly modified copy of the
|
|
>HTML DTD that conforms to the HyTime standard. I have not modified
|
|
>the elements or content models in any way. I have not added any
|
|
>new elements. I have only added to the attribute lists of a few
|
|
>elements.
|
|
>
|
|
>The biggest change I made was to the way URL addresses are handled.
|
|
>In order to use HyTime (as opposed to application-specific)
|
|
>methods for doing addressing, I had to change the URL address
|
|
>from a direct reference into an entity reference where the
|
|
>entity's system identifier is its URL address.
|
|
</pre>
|
|
|
|
<P>I suggested this long ago, but Tim shot the idea down. As I recall, he
|
|
said that all that extra markup was a waste. On the one hand, I agree
|
|
with him -- the purpose of a language is to be able to express common
|
|
idioms succinctly, and SGML/HyTime are poor in that respect. On the
|
|
other hand, once you've chosen SGML, you might as do as the Romans do.
|
|
|
|
<pre>
|
|
> This makes
|
|
>the link elements conform to the architectural forms and puts
|
|
>in enough indirection to allow other addressing methods to
|
|
>be used to locate the objects without having to modify the
|
|
>links, only the entity declarations.
|
|
</pre>
|
|
|
|
<P>Why is it easier to modify entity declarations than links? Six of one,
|
|
half-dozen of the other if you ask me.
|
|
|
|
<pre>
|
|
> I use SUBDOC entities
|
|
>for refering to other complete documents, although I'm not
|
|
>sure this the best thing, but there's no other construct in
|
|
>SGML that works as well. Note that nowwhere in 8879 does it
|
|
>define what must happen as the result of a SUBDOC reference,
|
|
>except that a new parsing context is established. The actual
|
|
>result of a SUBDOC reference is a matter of style and presumably
|
|
>in a WWW context it would result in the retrieval of the document
|
|
>and its presentation in a seperate window. The key is that
|
|
>the subdoc reference establishes a specific relationship between
|
|
>the source of the link and the target, namely one document
|
|
>refering to another. The target document could also be defined
|
|
>as a data entity with whatever notation is appropriate (possibly
|
|
>even SGML if it's another SGML document). This may be the better
|
|
>approach, I don't know.
|
|
</pre>
|
|
|
|
<P>I don't expect that the data entity/subdocument entity distinction
|
|
matters one hill of beans to contemporary WWW clients. I'm interested to
|
|
know if it means anything to HyTime engines.
|
|
|
|
<pre>
|
|
>If I were re-designing the HTML, I would add direct support
|
|
>for HyTime location ladders using at a minimum the nameloc,
|
|
>notloc, and dataloc addressing elements. However, if these
|
|
>elements are needed for interchange they could be generated
|
|
>from the information contained in WWW documents using the
|
|
>DTD below, so it's not critical.
|
|
>
|
|
</pre>
|
|
|
|
<P>Could you expand on that? If we'll be "generating" compliant SGML for
|
|
interchange, we might as well use TeXinfo or something practical like
|
|
that for application-specific purposes.
|
|
|
|
<pre>
|
|
>This is just one attempt at applying HyTime to the HTML.
|
|
>I'm sure there are other equally-valid (or more valid)
|
|
>ways it could be done. Given the current functionality
|
|
>of the WWW, I'm sure there are ways to express that functionality
|
|
>using HyTime constructs. HyTime constructs may also suggest
|
|
>useful ways to extend the WWW functionality, who knows.
|
|
</pre>
|
|
|
|
<P>I finally got to actually read the HyTime standard the other day, and
|
|
the clink and noteloc forms looked most useful. I'm also interested in
|
|
expressing some of the "relative link" idioms used in HTML.
|
|
(e.g how would we express HREF="../foo/bar.html#zabc" using HyTime? The
|
|
object of the game is to do it in such a way that the markup can be
|
|
copied verbatim from one system to another (say unix to VMS) and have
|
|
the right meaning)
|
|
|
|
<pre>
|
|
><!ENTITY % URL "CDATA"
|
|
> -- The term URL means a CDATA attribute
|
|
> whose value is a Universal Resource Locator,
|
|
> as defined in ftp://info.cern.ch/pub/www/doc/url3.txt
|
|
> -->
|
|
><!--=====================================================================
|
|
> WEK: I have defined URL addresses as a notation so that they can
|
|
> be then used in a notloc element.
|
|
> =====================================================================-->
|
|
><!NOTATION url PUBLIC "-//WWW//NOTATION URL/Universal Resource Locator
|
|
> /'ftp: info.cern.ch/pub/www/doc/url3.txt'
|
|
> //EN"
|
|
>>
|
|
</pre>
|
|
|
|
<P>Cool good idea.
|
|
|
|
<pre>
|
|
>
|
|
><!ENTITY % linkattributes
|
|
> "NAME NMTOKEN #IMPLIED
|
|
> HREF ENTITY #IMPLIED
|
|
>
|
|
> --=== WEK =======================================================
|
|
>
|
|
> HREF is now an entity attribute rather than containing a
|
|
> URL address directly. To create a link using a URL address,
|
|
> declare a SUBDOC or data entity and make the system
|
|
> identifier the URL address of the object:
|
|
>
|
|
> <!ENTITY mydoc SYSTEM "URL address of document " SUBDOC >
|
|
>
|
|
> This indirection gives to things:
|
|
>
|
|
> 1. A way to protect links in the source from changes in the
|
|
> location of a document since the physical address is only
|
|
> specified once.
|
|
</pre>
|
|
|
|
<P>Ah... now I get it... in case you have lots of links to mydoc or parts
|
|
of mydoc, you only have one place that defines where mydoc is. Nifty.
|
|
|
|
<pre>
|
|
>
|
|
> 2. An opportunity to use other addressing methods, including
|
|
> possibly replacing the URL with an ISO formal public
|
|
> identifier.
|
|
> =================================================================-->
|
|
>
|
|
> TYPE NAME #IMPLIED -- type of relashionship to referent data:
|
|
> PARENT CHILD, SIBLING, NEXT, TOP,
|
|
> DEFINITION, UPDATE, ORIGINAL etc. --
|
|
> URN CDATA #IMPLIED -- universal resource number. unique doc id --
|
|
> TITLE CDATA #IMPLIED -- advisory only --
|
|
> METHODS NAMES #IMPLIED -- supported methods of the object:
|
|
> TEXTSEARCH, GET, HEAD, ... --
|
|
> -- WEK: --
|
|
> LINKENDS NAMES #IMPLIED
|
|
> -- Linkends takes one or more NAME= values for local links--
|
|
> HyNames CDATA #FIXED 'TYPE ANCHROLE URN DOCORSUB'
|
|
> ">
|
|
</pre>
|
|
|
|
<P>I thought the ANCHROLEs of a clink were defined by HyTime to be
|
|
REFsomething and REFSUB. Or are those just defaults? Also... does the
|
|
HyNames think work locally like this? What a HACK!
|
|
|
|
<pre>
|
|
>
|
|
><!--=== WEK ==========================
|
|
>
|
|
> The HyNames= attribute maps the local attribute names to their
|
|
> cooresponding HyTime forms.
|
|
>
|
|
> The Methods= attribute is bit of a puzzle since it is really
|
|
> a part of the hyperlink presentation/processing style, not
|
|
> a property of the anchors, but there's nothing wrong with
|
|
> having application-specific stuff in your HyTime application.
|
|
</pre>
|
|
|
|
The Methods= attribute has been striken :-(. It was motivated by the
|
|
observation that textsearch interactions in WWW go like this:
|
|
|
|
<OL>
|
|
<LI>Doc A says "click here[23] to see the index"
|
|
<LI>user clicks
|
|
<LI>client fetches link 23, "http://host/index"
|
|
<LI>displays "cover page" document
|
|
<LI>user enters FIND abc
|
|
<LI>client fetches "http://host/index?abc"
|
|
<LI>search results are displayed
|
|
</OL>
|
|
|
|
Wheras in gopher, you get to save a step if you like:
|
|
|
|
<OL>
|
|
<LI>Doc A says "click here[23] to search the index"
|
|
<LI>user clicks
|
|
<LI>client displayes "enter search words here: " dialog
|
|
<LI>user enters FIND abc
|
|
<LI>client fetches "http://host/index?abc"
|
|
<LI>search results are displayed
|
|
</OL>
|
|
|
|
So to specify the latter, you would create a link with Methods=textsearch.
|
|
|
|
<pre>
|
|
> I added LinkEnds= so that the various linking elements will
|
|
> completely conform to the clink and ilink forms. The presence
|
|
> of the LinkEnds= attribute does not imply required support
|
|
> for this type of linking, but it does make HTML more consistent
|
|
> with other DTDs that do use the LinkEnds= attribute form.
|
|
>
|
|
> Note that 10744 shows the attribute name for the ILINK form
|
|
> to be 'linkend', not 'linkends'. I consider this to be a
|
|
> typo, as there's no logical reason to disallow multiple anchors
|
|
> from a clink and lack of it puts an undue requirement of
|
|
> specifying otherwise unneeded nameloc elements. In any case,
|
|
> an application can transform linkends= to linkend= plus a
|
|
> nameloc, so it doesn't matter in practice.
|
|
</pre>
|
|
|
|
Are there <EM>any</EM> HyTime implementations out there? Do they use
|
|
'linkend' or 'linkends'? It's hard to beleive that HyTime became a
|
|
standard without a proof-of-concept implementation.
|
|
|
|
<pre>
|
|
>
|
|
><!ELEMENT P - O EMPTY -- separates paragraphs -->
|
|
><!--=== WEK ==========================================================
|
|
>
|
|
> Design note: This seems like a clumsy way to structure information.
|
|
> One would expect paragraphs to be containing.
|
|
>
|
|
> ==================================================================-->
|
|
</pre>
|
|
|
|
Yeah, well, try implementing end tag inference in <1000 or so lines of code.
|
|
Maybe we'll get it right next time...
|
|
|
|
<pre>
|
|
><!ELEMENT DL - - (DT | DD | P | %hypertext;)*>
|
|
><!-- Content should match ((DT,(%hypertext;)+)+,(DD,(%hypertext;)+))
|
|
> But mixed content is messy.
|
|
> -->
|
|
><!--=== WEK ============================================================
|
|
>
|
|
> Design note: This content should be:
|
|
>
|
|
> <!ELEMENT DL - - (DT+, DD)+ >
|
|
> <!ELEMENT (DT | DD) - O (%hypertext;)* >
|
|
>
|
|
> There's no reason for DT and DD to be empty. Perhaps there was
|
|
> some confusion about the problems with mixed content? There are
|
|
> none here.
|
|
>
|
|
> These comments apply to the other list elements as well.
|
|
>
|
|
> ====================================================================-->
|
|
</pre>
|
|
|
|
The problem is that DL, DT, DD, UL, OL, and LI were marked up in extant
|
|
HTML documents as if minimization were supported. But I didn't want to
|
|
introduce minimization into the implementation, so I made the DT, DD,
|
|
and LI elements empty.
|
|
<p>
|
|
|
|
It's possible I'm confused about mixed content, but the way I understand
|
|
it, you don't want to use mixed content except in repeatable or groups
|
|
because authors will stick whitespace in where it is meant to be ignored
|
|
but it won't be.
|
|
|
|
<pre>
|
|
>
|
|
><!-- Character entities omitted. These should be separate from
|
|
> the main DTD so specific applications can define their values.
|
|
> ISO entity sets could be used for this.
|
|
> -->
|
|
</pre>
|
|
|
|
Another point I should have explained in the DTD: the WWW application
|
|
specifies that HTML uses the Latin-1 character set, and that the Ouml
|
|
entity represents exactly that character from the Latin-1 character and
|
|
not some system specific thingy. Translation to system character sets is
|
|
done <em>outside</em> of the SGML parser.
|
|
|
|
</body>
|
|
</html>
|