You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
384 lines
13 KiB
384 lines
13 KiB
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html lang="EN">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
|
|
<title>Mappings and identity in URIs and IRis</title>
|
|
<style type="text/css">
|
|
code { font-family: monospace; }
|
|
|
|
div.constraint,
|
|
div.issue,
|
|
div.note,
|
|
div.notice { margin-left: 2em; }
|
|
|
|
li p { margin-top: 0.3em;
|
|
margin-bottom: 0.3em; }
|
|
|
|
div.exampleInner pre { margin-left: 1em;
|
|
margin-top: 0em; margin-bottom: 0em}
|
|
div.exampleOuter {border: 4px double gray;
|
|
margin: 0em; padding: 0em}
|
|
div.exampleInner { background-color: #d5dee3;
|
|
border-top-width: 4px;
|
|
border-top-style: double;
|
|
border-top-color: #d3d3d3;
|
|
border-bottom-width: 4px;
|
|
border-bottom-style: double;
|
|
border-bottom-color: #d3d3d3;
|
|
padding: 4px; margin: 0em }
|
|
div.exampleWrapper { margin: 4px }
|
|
div.exampleHeader { font-weight: bold;
|
|
margin: 4px}</style>
|
|
<link type="text/css" rel="stylesheet"
|
|
href="http://www.w3.org/StyleSheets/TR/base.css">
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div class="head">
|
|
<p><a href="http://www.w3.org/"><img width="72" height="48" alt="W3C"
|
|
src="http://www.w3.org/Icons/w3c_home"></a></p>
|
|
|
|
<h1><a id="title" name="title"></a>Mappings and identity in URIs and IRIs</h1>
|
|
|
|
<p>Preface: This document was originally written in 2003, before the IRI spec
|
|
was an RFC. Some of this has since been addressed in the RFC.</p>
|
|
|
|
<p></p>
|
|
|
|
<div class="div1">
|
|
|
|
<div class="div2">
|
|
<p>Summary: There is a discrepancy between namespaces and URI specs about
|
|
what identifiers are equivalent. The ony reason this has not caused a problem
|
|
is that in practice the test cases (two equivalent but not equal unicode
|
|
character sequences being used) has not occurred in practice. Using IRIs
|
|
maliciously could however deliberately introduce a bug which could cause a
|
|
security problem.</p>
|
|
|
|
<p>Using relationship notation (why not use N3?) to discuss the
|
|
inconsistencies between some current thinkings about IRIs, URIs, and for
|
|
example namespace names.</p>
|
|
|
|
<h2><a name="Requiremen">Requirements</a>:</h2>
|
|
|
|
<p></p>
|
|
|
|
<p>1. URI identity is shared by all parties. Within a given context (<a
|
|
href="#context">*</a>), there is a single (inverse functional) relationship
|
|
between an ASCII string a and a thing x identified by a string s taken as a
|
|
uri is uri(x, a).</p>
|
|
</div>
|
|
|
|
<div class="div2">
|
|
<p>2. The users of any specification which mention URIs, when one can prove
|
|
that the two are equivalent by reading [scheme-independent] specs, then one
|
|
can use one in place of another. That is, when URI (or IRI) strings are
|
|
deemed "equivalent" then they must refer to the same object.</p>
|
|
</div>
|
|
|
|
<p>3. We should be able to use the same software to parse and compare URIs
|
|
wherever they are used, eg in namespace names or in hypertext links.</p>
|
|
</div>
|
|
|
|
<h2>What do we get from the specs?</h2>
|
|
|
|
<p>Let us formalize the concepts in the documents we are talking about.</p>
|
|
|
|
<h3>The URI spec</h3>
|
|
|
|
<p>uri(x,a) => A(a)</p>
|
|
|
|
<p>where A(a) means that it is a sequence of ASCII characters (grounded in
|
|
ANSI X3.4-1986).</p>
|
|
|
|
<p>The ANSI spec gives a 1:1 mapping ascii(a, s) from the set A of ASCII
|
|
character to the set S of septets (integers between 0 and 127 inclusive).</p>
|
|
|
|
<p>Let sames(s1, s2) be the "strcmp" relation between two strings which are
|
|
septet for septet identical.</p>
|
|
|
|
<p>Consider the equivalence relation ea(a1, a2) which we use here to indicate
|
|
that two uris identify the same thing. It (is symmetric and transitive and)
|
|
has properties</p>
|
|
|
|
<p>ea(a1, a2) & uri(t1, a1) => uri(t1, a2)</p>
|
|
|
|
<p>for all a1, a2 (for some t uri(t, a1) & uri(t, a2)) <=>
|
|
ea(a1, a2)</p>
|
|
|
|
<p>A(a) <=> ea(a,a)</p>
|
|
|
|
<p>Now in fact we are going to deal with the ASCII encoded septets for which
|
|
a similar equivalnce holds</p>
|
|
|
|
<p>es(s1, s2) <=> Exists a1, s2 such that ascii(a1, s1) &
|
|
ascii(a2, s2) & ea(a1, a2)</p>
|
|
|
|
<p>The URI spec mentions two uses of hexadecimal encoding. Hex encoding
|
|
relates octet strings to septet strings. When the URI spec was written, the
|
|
significance of the octets greater than 127 was not defined.</p>
|
|
|
|
<p>It implies that if you see %HH in a URI you should consider it as an
|
|
encoding of an octet. There is (a the level of this spec) the notion that
|
|
the URI is an encoding of a string of octets. Those from 0-127 are
|
|
considered as representing ASCII characters. There is no assumption about
|
|
what the others represent. The IRI spec will later take advantage of
|
|
this.</p>
|
|
|
|
<p>hexify(s1, s2) is true if the difference if any between s1 and s2 is only
|
|
that for one or more characters in s1 are replaced in s2 by their %HH or %hh
|
|
encoding, and ascii(s2).</p>
|
|
|
|
<p>ascii(s) => hexify(s, s)</p>
|
|
|
|
<p>hexify(s, s)</p>
|
|
|
|
<p>There are another 128 characters in this notional "extended" set, each of
|
|
which has a hex encoding.</p>
|
|
</div>
|
|
|
|
<div class="head">
|
|
<p>(DanC: hexify(s+ c, s+hexify(c))</p>
|
|
|
|
<p>hexify('A') = '%65'</p>
|
|
|
|
<p>corrollary: hexify(s1, s2) => ascii(s2))</p>
|
|
|
|
<p>I take hexify to be a subrelation of equality. That is, the URI spec
|
|
authorizes one to use s2 where you would have used s2. In some cases such as
|
|
7-bit transport such as HTTP you have to. It is important that hexification
|
|
preseves the identity of the resource.</p>
|
|
|
|
<p>hexify(t, s1) & hexify(t, s2) => es(s1, s2)</p>
|
|
|
|
<p>{ for some s, hexify(t1, s) & hexify(t2, s) } <=> et(t1, t2)</p>
|
|
</div>
|
|
|
|
<p>Note that equivalence is preserved by the interchange of "%20" with "
|
|
", but not by interchange of "%2F" with "/".</p>
|
|
|
|
<p></p>
|
|
|
|
<p>URI encoding maps octets into URIs</p>
|
|
|
|
<p>@@ relative</p>
|
|
|
|
<p>rel(s, b, r) many-many relation between ascii strings, that r is a
|
|
relative URI reference for s relative to b. Implication of spec is</p>
|
|
|
|
<p>rel(s1, b, r) . rel(s2, b, r) => e(s1, s2)</p>
|
|
|
|
<p>abs(s) <=> forAll b: rel(s, b, s)</p>
|
|
|
|
<div class="body">
|
|
<h3>THE Unicode Spec</h3>
|
|
|
|
<p>UTF-8 <a href="#Unicode32">[Unicode 3.2]</a> gives us a relation utf8(i,
|
|
s)</p>
|
|
|
|
<p>Note by the way that</p>
|
|
|
|
<p>ascii(s) => utf8(s,s)</p>
|
|
|
|
<p>utf8(i, s) is true if i is a string of unicode characters, and s is an
|
|
extended ASCII string of octets, and the relationship is as specified in the
|
|
utf-8 specification.</p>
|
|
|
|
<p>sameu(i1, i2)</p>
|
|
|
|
<p>is true whenever the two unicode strings convey exactly the same series of
|
|
glyphs and/or control characters. There are strings which are not
|
|
identical</p>
|
|
|
|
<h3>The IRI spec</h3>
|
|
|
|
<p>This says that (basically, with some work on corner cases etc) there
|
|
should be a convention that any 8-bit string which is not ASCII which can be
|
|
interpreted as a UTF-8 encoding should be interpreted as a uitf-8
|
|
encoding.</p>
|
|
</div>
|
|
|
|
<p>What does that mean? I take it to mean that you can encode it and
|
|
de-encode it.</p>
|
|
|
|
<div class="body">
|
|
<p>There is a cannonicalization function which the IRI spec uses, defined in
|
|
@@, which allows a particular</p>
|
|
|
|
<p>ucan(i,i)</p>
|
|
|
|
<p>Axioms are that it is a function:</p>
|
|
|
|
<p>ucan(i, j1) . ucan(i, j2) => strcmp(j1, j2)</p>
|
|
|
|
<p>for all i: can(i,i)</p>
|
|
|
|
<p>e(s1, s2).</p>
|
|
|
|
<p>There is a function (not 1:1) which we define as</p>
|
|
|
|
<p>iri_uri(i, s) <=> for some j, t: ucan(i, j). utf8(j, t).
|
|
hexify(t, s)</p>
|
|
|
|
<p>IRIs are defined as the domain that function, where the range is URIs. An
|
|
IRI is any unicode string which when canonicalized and utf-8 encoded and
|
|
hexified is a URI.</p>
|
|
|
|
<p>There is a uri equivalent to every iri. There is NOT an IRI for every
|
|
8-bit string t. There is at least one IRI for every URI: itself.</p>
|
|
|
|
<p>For requirement 2, equivalent IRIs must identify the same</p>
|
|
|
|
<p>iri_uri(i1, s1). iri_uri(i1, s2). sameu(i1, i2) => e(s1, s2)</p>
|
|
|
|
<p></p>
|
|
|
|
<h3>The namespace spec</h3>
|
|
|
|
<p></p>
|
|
|
|
<div class="div2">
|
|
<p>The namespaces specification 2.3 talks about identifiers being different.
|
|
Specifically, "http://www.example.org/ros%c3%a9" and
|
|
"http://www.example.org/ros%C3%a9" are different. Let's call these constant
|
|
strings D1 and D2 for short.</p>
|
|
|
|
<p>ne(D1, D2)</p>
|
|
|
|
<p>Now "difference" is something which allows them for example to occur as
|
|
different attributes in an XML element. It seems to me that this is ne is
|
|
the negation of e. It is the common understanding of differentness such that
|
|
two things can't be both different and the same. To make it otherwise would
|
|
be very confusing and would prevent (3).</p>
|
|
|
|
<p>ne(s1, s2) => ~e(s1, s2)</p>
|
|
</div>
|
|
|
|
<p>Ouch. We have one spec saying that these are different, and another
|
|
saying that they are the same.</p>
|
|
|
|
<p>That isn't logically compatible. The whole layering of the different
|
|
forms of equality described in Tim Bray's draft finding is of the form</p>
|
|
|
|
<p>e_uri(s,t) => e(s,t)</p>
|
|
|
|
<p>e_http(s,t) => e_uri(s,t)</p>
|
|
</div>
|
|
|
|
<p>and so on. None of the specs until namespaces say "these are
|
|
different".</p>
|
|
|
|
<p>So if you accept the requirements above, and you accept any of the
|
|
equivalences we have to throw out thatpart of XML namespaces.</p>
|
|
|
|
<h2>Choices</h2>
|
|
|
|
<p>In general there are two ways of operating:</p>
|
|
|
|
<p>1. ignore the equivalences like the namespace spec. This causes a bug if
|
|
anyone uses two identfiers which are diffrent strings but equivalent. The
|
|
only practical way of doing that is to make any non-canonical IRIs or URIs
|
|
illegal. This means IRIs cannot be used except in their trivial URI form.</p>
|
|
|
|
<p>2. Transmit in any form, receiver makes right. Receiver must compare
|
|
equivalnce-sware or must cannonicalize before intrenal use (whichhas the same
|
|
effect).</p>
|
|
|
|
<p>3. Make IRIs be just unicode strings. Scratch the axiom that hexifying
|
|
leaves a valid and equivalent IRI. Allow the hexified forms to be used to
|
|
identify quite different things, in IRIs. Allow IRIs to be converted into
|
|
URIs, but NOT allow any place where URIs and IRIs can be used interchangebly.
|
|
This works toward a DanC-proposed world of unicdoe character string
|
|
comparison. It does not allow a smooth transiition for existing browsers etc
|
|
whcih mix URIs and IRIs.</p>
|
|
|
|
<h2>Reality factors</h2>
|
|
|
|
<p>There are NOT very many actual uses of D1 and D2, because there aren't
|
|
really any motivations for making them.</p>
|
|
|
|
<p>-This is why we haven't had a big problem recently.</p>
|
|
|
|
<p>There ARE motivations for using (non-uri) IRIs. people are infact using
|
|
them though maybe not for namespaces yet.</p>
|
|
|
|
<p>- This is why endorsing IRIs forces us to fix this.</p>
|
|
|
|
<p>There ARE lot sof applications which canonicalize URIs in various ways.</p>
|
|
|
|
<p>Theer IS software which compares namespaces character-for-charcter.</p>
|
|
|
|
<p>There are NOT many if any uses of different IRIs or different URIs for the
|
|
same namespace.</p>
|
|
|
|
<h2>Conclusion</h2>
|
|
|
|
<p>We should continue the recommendation <strong>not</strong> to use URIs or
|
|
IRIs which are equivalent but arbitrarily different strings. The easist way
|
|
of ensuing this is to use a cannonical form. We can therefore deprocate the
|
|
transmission or use of non-canonical forms.</p>
|
|
|
|
<p>We should switch as soon as possible to canonicalizing IRIs in all
|
|
applications before comparison (or using equiavlence-aware comparisons). The
|
|
Namespaces spec should change to say when things are the same. the
|
|
constraint in XML to constrain that attributes cannot occur twice should be
|
|
made more complicated. It should say that you can't have two occurrences
|
|
which are the same attribute name, or two attrributes which are equivalent in
|
|
any way, leaving I regret some fuzziness. For example, you can't use the
|
|
xhtml1.0 and xml1.1 namespaces in the same document to put two src attributes
|
|
on an image! they arenot even the same namespace, but clearly they are
|
|
equivalent at the application level. It should be clear that the fact that
|
|
strings are different is not a guarantee that the namespaces are different.
|
|
The parser just isn't expected to spot this. But I think the parser ought to
|
|
be allowed to consistently cannonicalize. That makes life much easier for
|
|
the application. DanC wanted to be able to do strcmp, and he can if the
|
|
parser canonicalizes.</p>
|
|
|
|
<p>We should then in a few years be able to relax the constraint on not
|
|
transmitting multiple different forms.</p>
|
|
|
|
<p>We need a very good IRI cannonicalization test suite.</p>
|
|
|
|
<p>We should formalize with names the various functions above, and make sure
|
|
there are good working coded implmentations of them in the mjor languages. A
|
|
standard API will help. URI working group stuff.</p>
|
|
|
|
<p>timbl</p>
|
|
|
|
<p>2003/04</p>
|
|
<hr>
|
|
|
|
<h2 id="References">References</h2>
|
|
<dl>
|
|
<dt>IRI</dt>
|
|
<dd>foobar<cite></cite></dd>
|
|
</dl>
|
|
|
|
<h3>Footnotes:</h3>
|
|
|
|
<p><a name="context">context</a></p>
|
|
|
|
<p>The foundational architecture of the web is that there is a global context
|
|
common to all publically published documents, in which each URI is agreed by
|
|
everyone to identify the same thing. In practice of course, things break and
|
|
people are confused and misled. Those making formal systems often restrict
|
|
the scope of data to that in which this ideal approximation can be taken to
|
|
hold in practcie as well as in theory.</p>
|
|
|
|
<p>The fact that the use of uris varies with time (sad but true) (we are NOT
|
|
talking about living documents or concepts whose reopresentations change,
|
|
here, but really reuse of the same URI for a totally different concept) means
|
|
that to model things over a relatively long time one might want to model the
|
|
time varying nature:</p>
|
|
|
|
<p>u(x, s, t)</p>
|
|
|
|
<p>This time modelling can be done and has been done in many ways, but is not
|
|
addressed here.</p>
|
|
|
|
<p></p>
|
|
<hr>
|
|
</body>
|
|
</html>
|