You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
417 lines
14 KiB
417 lines
14 KiB
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<TITLE>"Character Set" Considered Harmful</TITLE>
|
|
<META NAME="author" CONTENT="Connolly">
|
|
<META NAME="status" CONTENT="Internet Draft">
|
|
<META NAME="title" CONTENT="Character Terminology">
|
|
<META NAME="date" CONTENT="May, 1995">
|
|
<base href="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html">
|
|
<!-- $Id: charset-harmful.html,v 1.6 2004/12/02 15:18:12 liam Exp $ -->
|
|
</HEAD>
|
|
|
|
<body>
|
|
<PRE>
|
|
HTML Working Group <a href="http://www.w3.org/hypertext/WWW/People/Connolly/">D. Connolly</a>
|
|
INTERNET-DRAFT <A HREF="../../Consortium/Prospectus/Overview.html">MIT/W3C</a>
|
|
draft-ietf-html-charset-harmful-00.txt May 2, 1995
|
|
Expires November, 1995
|
|
|
|
</pre>
|
|
<h1><em>Character Set</em> Considered Harmful</h1>
|
|
<h2>Status of this Document</h2>
|
|
|
|
<p>This document is an Internet-Draft. Internet-Drafts are working
|
|
documents of the Internet Engineering Task Force (IETF), its areas,
|
|
and its working groups. Note that other groups may also distribute
|
|
working documents as Internet-Drafts.
|
|
|
|
<p>Internet-Drafts are draft documents valid for a maximum of six
|
|
months and may be updated, replaced, or obsoleted by other
|
|
documents at any time. It is inappropriate to use Internet-Drafts
|
|
as reference material or to cite them other than as "work in
|
|
progress."
|
|
|
|
<p>To learn the current status of any Internet-Draft, please check the
|
|
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
|
|
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
|
|
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
|
|
ftp.isi.edu (US West Coast).
|
|
|
|
<p>Distribution of this document is unlimited. Please send comments to
|
|
the HTML working group (HTML-WG) of the Internet Engineering Task
|
|
Force (IETF) at <tt><html-wg@oclc.org></tt>;. Discussions of
|
|
the group are
|
|
archived at <tt><a
|
|
href="http://www.acl.lanl.gov/HTML_WG/archives.html">http://www.acl.lanl.gov/HTML_WG/archives.html</a></tt>.
|
|
|
|
<h2>Abstract</h2>
|
|
|
|
<p> The term <em>character set</em> is often used to describe a ditigal
|
|
representation of text. ASCII is perhaps the most widely deployed
|
|
representation of text, and in the interest of interoperability,
|
|
information systems on the Internet traditionally rely on it
|
|
exclusively.
|
|
|
|
<p> The Multipurpose Internet Mail Extensions (MIME) introduces
|
|
Internet Media Types, including text representations besides
|
|
ASCII. The Hypertext Markup Language (HTML) used in the World-Wide Web
|
|
is a proposed Internet Media Type. But HTML is also an application of
|
|
Standard Generalized Markup Language (SGML).
|
|
|
|
<p> In the MIME and SGML specifications, the discussion of characters
|
|
representation is notoriously complex, and apparently subtly
|
|
inconsistent or incompatible. This document presents a collection of
|
|
terms intended to reconcile the two specifications and serve as a
|
|
basis for rigorous discussion of characters and their digital
|
|
representations.
|
|
|
|
|
|
<h2>Introduction</h2>
|
|
|
|
<p> The term <em>character set</em> is often used to describe a ditigal
|
|
representation of text. The specification of such a representation
|
|
typically involves identifying a sufficiently expressive collection of
|
|
characters, and giving each of them a number.
|
|
|
|
<p> In conventional mathematics terminology then, a "character set"
|
|
is not just a set of characters, but a <em>function</em> whose domain
|
|
is a set of integers, and whose range is a set of characters.
|
|
|
|
<p> Some standards documents, including the SGML standard, make little
|
|
or no use of such conventional mathematical terms as function, domain
|
|
and range. Perhaps the authors of those documents intend the documents
|
|
to be comprehensible without a prior understanding of mathematics. But
|
|
the specification of notions such as the conformance of an SGML
|
|
document or SGML system are much more complex than the basics of logic
|
|
and mathematics.
|
|
|
|
<p> In his text on Calculus<a href="#Spivak">[Spivak]</a>, Michael
|
|
Spivak writes:
|
|
|
|
<blockquote>
|
|
|
|
<p> Every aspect of this book was influenced by the desire to present
|
|
calculus not merely as a prelude to but as the first real encounter
|
|
with mathematics. Since the foundation of analysis provided the arena
|
|
in which modern modes of mathematical thinking developed, calculus
|
|
ought to be the place in which to expect, rather than avoid, the
|
|
strengthening of insight with logic. In addition to developing the
|
|
students' intuition about the beautiful concepts of analysis, it is
|
|
surely equally important to persuade them that precision and rigor are
|
|
neither deterrents to intuition, nor ends in themselves, but the
|
|
natural medium in which to formulate and think about mathematical
|
|
questions.
|
|
|
|
</blockquote>
|
|
|
|
<p> This document is not intended as the first real encounter with
|
|
mathematics. But neither will we make any effort to avoid or apologize
|
|
for mathematical terminology. The reader is referred to the large body
|
|
of literature on logic and set theory, including a history of writings
|
|
on math and logic[SET] and Douglas Hofstadter's fascinating book<a
|
|
href="#GEB">[GEB]</a>.
|
|
|
|
<h2>Coded Character Sets</h2>
|
|
|
|
<p> Using "character set" rather than something such as <em>character
|
|
table</em> or even <em>character sequence</em> to denote the functions
|
|
that maps integers to characters is unfortunate, but it is water under
|
|
the bridge, and a lot of it by now. Rather than attempting to divert
|
|
all that water at this point, we introduce the primitive notion of
|
|
character and use it to define the term <em>coded character set</em>
|
|
from [ISO10646] and other standards:
|
|
|
|
<dl>
|
|
|
|
<dt> character
|
|
|
|
<dd> An atom of information
|
|
|
|
<dt> coded character set
|
|
|
|
<dd> A function whose domain is a subset of the integers, and whose
|
|
range is a set of characters.
|
|
|
|
</dl>
|
|
|
|
<p> Note that by the term character, we do not mean a glyph, a name, a
|
|
phoneme, nor a bit combination. A character is simply an atomic unit
|
|
of communication. It is typically a symbol whose various
|
|
representations are understood to mean the same thing by a community
|
|
of people.
|
|
|
|
<p> It might seem more intuitive to map from characters to integers,
|
|
rather than the way it is defined here. But in practice there are
|
|
some coded character sets that assign two different numbers to the
|
|
same character<a href="lee">[Lee]</a>, and so the inverse is not a
|
|
function in the general case.
|
|
|
|
<p> There are two other terms used in standards such as [ISO10646]
|
|
that we define in relation to the first two:
|
|
|
|
<dl>
|
|
|
|
<dt> code position
|
|
|
|
<dd> An integer. A coded character set and a code position from
|
|
its domain determine a character.
|
|
|
|
<dt> character repertoire
|
|
|
|
<dd> A set of characters; that is, the range of a coded character set.
|
|
|
|
</dl>
|
|
|
|
<h2>Character Encoding Schemes</h2>
|
|
|
|
<p> The only practical means for exchanging information on the
|
|
Internet is to represent it as a sequence of octets (bytes).
|
|
|
|
<p> One way to transmit a sequence of characters is to agree on
|
|
a coded character set and transmit the character numbers of each
|
|
of the characters.
|
|
|
|
<p> But in practice, characters are encoded using a variety of
|
|
optimizations of this brute-force approach: code switching techniques,
|
|
escape sequences, etc. The encoding of a sequence of characters is
|
|
not, in general, the result of encoding each character independently
|
|
and then concatenating them. But it is sufficiently general to note
|
|
that sequences of characters are encoded as a sequence of bytes. So we
|
|
define:
|
|
|
|
<dl>
|
|
|
|
<dt>octet
|
|
|
|
<dd>an element of the set {0, 1, 2, ..., 255}
|
|
|
|
<dt>character encoding scheme
|
|
|
|
<dd>a function whose domain is the set of sequences of octets, and
|
|
whose range is the set of sequences of characters over some character
|
|
repertoire.
|
|
|
|
</dl>
|
|
|
|
|
|
<h2>Representation of SGML Text Entities</h2>
|
|
|
|
<p> An SGML document is made up of entities: a text entity called the
|
|
document entity, and possibly some other text entities and data
|
|
entities.
|
|
|
|
<p> A text entity is a sequence of characters. The representation of a
|
|
text entity is not specified by the SGML standard. For the purpose of
|
|
MIME-based interchange of SGML text entities, we define the following:
|
|
|
|
<dl>
|
|
|
|
<dt>text entity
|
|
|
|
<dd>a sequence of characters
|
|
|
|
<dt>message entity
|
|
|
|
<dd>a pair (T, OS) where T is an Internet Media Type and OS is a
|
|
sequence of octets.
|
|
|
|
</dl>
|
|
|
|
<p> Note that each <tt>text/*</tt> media type has an associated
|
|
<tt>charset</tt> parameter, which designates a character encoding
|
|
scheme. The character encoding scheme maps the body -- a sequence of
|
|
octets -- to a text entity -- a sequence of characters. Hence any
|
|
message entity of type <tt>text/*</tt> is equivalent to a text entity.
|
|
|
|
|
|
<h2>Numeric Character References</h2>
|
|
|
|
<p> Numeric character references are a great source of confusion. The
|
|
key insights are that:
|
|
|
|
<ul>
|
|
|
|
<li> Every SGML document has exactly one document character set, which
|
|
is a coded character set
|
|
|
|
<li> Numeric character references give code positions in the document
|
|
character set
|
|
|
|
</ul>
|
|
|
|
<h3>Example: ISO2022 Encoding with ISO10646 Coded Character Set</h3>
|
|
|
|
<p> Consider the following message entity:
|
|
|
|
<pre>
|
|
Date: Saturday, 29-Apr-95 03:53:33 GMT
|
|
MIME-version: 1.0
|
|
Content-Type: text/html; charset=iso-2022-jp
|
|
|
|
<TITLE>...</TITLE>
|
|
<BODY>
|
|
Here is some normal text.
|
|
Here is a 10646 numeric character reference ঀ.
|
|
Here is some ISO-2022-JP text: ...
|
|
</BODY>
|
|
|
|
</pre>
|
|
|
|
<p> To interpret the message entity, we notice that the
|
|
<tt>Content-Type</tt> is <tt>text/html</tt>, so this represents a text
|
|
entity. The <tt>charset</tt> parameter <tt>iso-2022-jp</tt>, along
|
|
with the octet sequence of the body, determines a sequence of
|
|
characters. The octets denoted above by '...' represent characters,
|
|
as per <tt>iso-2022-jp</tt>.
|
|
|
|
<p> To parse the resulting text entity as per SGML, the sender and
|
|
receiver must agree on an SGML declaration, since none is present in
|
|
the document entity. For this example, we assume that SGML declaration
|
|
specifies ISO10646 as the document character set. So the numeric
|
|
character reference <tt>ঀ</tt> is resolved with respect to
|
|
ISO10646.
|
|
|
|
<p> It may seem contradictory that the <tt>ISO-2022-JP</tt> character
|
|
encoding scheme is defined in terms of a collection of coded character
|
|
sets, none of which is ISO10646. But there is no contradiction. Each
|
|
character encoded by <tt>ISO-2022-JP</tt> is in the repertoire of one
|
|
of those coded character sets, each of which is a subset of the
|
|
repertoire of ISO10646.
|
|
|
|
<p> So while <tt>ISO-2022-JP</tt> is not sufficient for every ISO10646
|
|
document, it is the case that ISO10646 is a sufficient document
|
|
character set for any entity encoded with <tt>ISO-2022-JP</tt>.
|
|
|
|
<h3>Example: Reducing the Repertoire of an Entity</h3>
|
|
|
|
<p> Suppose we have an SGML document D whose document character set is
|
|
the coded character set ISO10646. We find the document entity DE in
|
|
the form of sequence of octets OS in a disk file, encoded using the
|
|
Unicode-UCS-2 character encoding scheme.
|
|
|
|
<pre>
|
|
Unicode-UCS-2(OS) = DE
|
|
</pre>
|
|
|
|
<p> We can reduce the character repertoire necessary to represent the
|
|
document entity by replacing characters outside the ISO-646-IRV
|
|
character repertoire with numeric character references:
|
|
|
|
<pre>
|
|
DE' = reduce(DE, ISO10646, ISO-646-IRV)
|
|
|
|
where
|
|
|
|
reduce : SEQ(char) X Coded Character Set X Character Repertoire -> SEQ(char)
|
|
|
|
and
|
|
|
|
reduce(c . rest, CCS, R) = if c in R, c . reduce(rest, CCS, R)
|
|
else &#N; . reduce(rest, CCS, R)
|
|
where CCS(N) = c
|
|
</pre>
|
|
|
|
<p> The resulting entity, DE' can then be endoded using US-ASCII
|
|
|
|
|
|
<pre>
|
|
US-ASCII(OS') = DE' = reduce(DE, ISO10646, ISO-646-IRV)
|
|
</pre>
|
|
|
|
<p> Hence, we can represent the document D as a message entity whose
|
|
content type is "text/plain; charset=US-ASCII" and whose body is OS'.
|
|
|
|
|
|
<h2>Conclusion</h2>
|
|
|
|
<p> It is critical to keep separate the notion of a simple table of
|
|
characters and their numbers, i.e. a coded character set, separate
|
|
from the various algorithms to encoded sequences of characters,
|
|
i.e. character encoding schemes. This separation allows a
|
|
representation of a text entity which is consistent with both the MIME
|
|
and SGML specifications.
|
|
|
|
<h2>Acknowledgements</h2>
|
|
|
|
|
|
<p> The idea for the title of this document actually came from John
|
|
Klensin. The notion of character encoding scheme was inspired by the
|
|
MIME specification by Ned Freed. James Clark, Ed Levinson, and several
|
|
other members of the MIMESGML working group collaborated in
|
|
discussions leading up to this draft. Liam Quin from SoftQuad
|
|
and Gavin Nicol from EBT have provided guidance on these issues in the
|
|
past. Erik Naggum has provided invaluable aid in understanding the
|
|
SGML standard.
|
|
|
|
<h2>References</h2>
|
|
|
|
<dl>
|
|
|
|
<dt><a name="MIME">[MIME]</a>
|
|
|
|
<dd>N. Borenstein and N. Freed. "MIME (Multipurpose Internet Mail
|
|
Extensions) Part One: Mechanisms for Specifying and Describing the
|
|
Format of Internet Message Bodies." RFC 1521, Bellcore, Innosoft,
|
|
September 1993.
|
|
|
|
<dt><a name="ASCII">[ASCII]</a>
|
|
|
|
<dd> US-ASCII. Coded Character Set - 7-Bit American Standard Code for
|
|
Information Interchange. Standard ANSI X3.4-1986, ANSI, 1986.
|
|
|
|
<dt><a name="ISO8859">[ISO-8859]</a>
|
|
|
|
<dd> ISO 8859. International Standard -- Information Processing --
|
|
8-bit Single-Byte Coded Graphic Character Sets -- Part 1: Latin
|
|
Alphabet No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2,
|
|
ISO 8859-2, 1987. Part 3: Latin alphabet No. 3, ISO 8859-3,
|
|
1988. Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5:
|
|
Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6: Latin/Arabic
|
|
alphabet, ISO 8859-6, 1987. Part 7: Latin/Greek alphabet, ISO
|
|
8859-7, 1987. Part 8: Latin/Hebrew alphabet, ISO 8859-8, 1988.
|
|
Part 9: Latin alphabet No. 5, ISO 8859-9, 1990.
|
|
|
|
<dt><a name="SGML">[SGML]</a>
|
|
|
|
<dd> ISO 8879. Information Processing -- Text and Office Systems --
|
|
Standard Generalized Markup Language (SGML), 1986.
|
|
|
|
<dt><a name="Nicol">[Nicol]</a>
|
|
|
|
<dd> <a href="http://www.ebt.com:8080/docs/multilingual-www.html">The
|
|
Multilingual World Wide Web</a>, Gavin T. Nicol, Electronic Book
|
|
Technologies, Japan <tt>gtn@ebt.com</tt>
|
|
|
|
<!-- @@ DSSSL draft spec -->
|
|
|
|
<!-- @@ O'Reilly's internationalization book -->
|
|
|
|
<!-- @@ ISO 10646 -->
|
|
|
|
<!-- @@ ISO 2022-jp RFC -->
|
|
|
|
<dt> <a name="lee">[Lee]</a>
|
|
|
|
<dd> Private communication with Liam Quin, from SoftQuad.
|
|
|
|
<dt> <a name="Spivak">[Spivak]</a>
|
|
<dd> Spivak, Michael. Calculus. 2nd Ed. 1967 ISBN 0-914098-77-2
|
|
|
|
<dt> <a name="GEB">[GEB]</a>
|
|
|
|
<dd> Hofstadter, Douglas R.
|
|
Gödel, Escher, Bach: An Eternal Golden Braid, 1979
|
|
ISBN 0-394-75682-7
|
|
|
|
<dt>[SET]
|
|
<dd>
|
|
"Investigations in the foundations of set theory I", in Jean van
|
|
Heijenoort (ed.) _From Frege to Godel: A Source Book in Mathematical
|
|
Logic, 1879-1931_ (Harvard U.P., 1967)
|
|
|
|
|
|
</dl>
|
|
|
|
</body>
|
|
</html>
|