server_playground/doc/www.w3.org/2003/04/iri.html


								<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

								<html lang="EN">

								<head>

								  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

								  <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">

								  <title>Mappings and identity in URIs and IRis</title>

								  <style type="text/css">

								code           { font-family: monospace; }


								div.constraint,

								div.issue,

								div.note,

								div.notice     { margin-left: 2em; }


								li p           { margin-top: 0.3em;

								                 margin-bottom: 0.3em; }


								div.exampleInner pre { margin-left: 1em;

								                       margin-top: 0em; margin-bottom: 0em}

								div.exampleOuter {border: 4px double gray;

								                  margin: 0em; padding: 0em}

								div.exampleInner { background-color: #d5dee3;

								                   border-top-width: 4px;

								                   border-top-style: double;

								                   border-top-color: #d3d3d3;

								                   border-bottom-width: 4px;

								                   border-bottom-style: double;

								                   border-bottom-color: #d3d3d3;

								                   padding: 4px; margin: 0em }

								div.exampleWrapper { margin: 4px }

								div.exampleHeader { font-weight: bold;

								                    margin: 4px}</style>

								  <link type="text/css" rel="stylesheet"

								  href="http://www.w3.org/StyleSheets/TR/base.css">

								</head>


								<body>


								<div class="head">

								<p><a href="http://www.w3.org/"><img width="72" height="48" alt="W3C"

								src="http://www.w3.org/Icons/w3c_home"></a></p>


								<h1><a id="title" name="title"></a>Mappings and identity in URIs and IRIs</h1>


								<p>Preface: This document was originally written in 2003, before the IRI spec

								was an RFC. Some of this has since been addressed in the RFC.</p>


								<p></p>


								<div class="div1">


								<div class="div2">

								<p>Summary: There is a discrepancy between namespaces and URI specs about

								what identifiers are equivalent. The ony reason this has not caused a problem

								is that in practice the test cases (two equivalent but not equal unicode

								character sequences being used) has not occurred in practice. Using IRIs

								maliciously could however deliberately introduce a bug which could cause a

								security problem.</p>


								<p>Using relationship notation (why not use N3?) to discuss the

								inconsistencies between some current thinkings about IRIs, URIs, and for

								example namespace names.</p>


								<h2><a name="Requiremen">Requirements</a>:</h2>


								<p></p>


								<p>1. URI identity is shared by all parties.  Within a given context (<a

								href="#context">*</a>), there is a single (inverse functional) relationship

								between an ASCII string a and a thing x identified by a string s taken as a

								uri is uri(x, a).</p>

								</div>


								<div class="div2">

								<p>2. The users of any specification which mention URIs, when one can prove

								that the two are equivalent by reading [scheme-independent] specs, then one

								can use one in place of another.  That is, when URI (or IRI) strings are

								deemed "equivalent" then they must refer to the same object.</p>

								</div>


								<p>3.  We should be able to use the same software to parse and compare URIs

								wherever they are used, eg in namespace names or in hypertext links.</p>

								</div>


								<h2>What do we get from the  specs?</h2>


								<p>Let us formalize the concepts in the documents we are talking about.</p>


								<h3>The URI spec</h3>


								<p>uri(x,a) =&gt; A(a)</p>


								<p>where A(a) means that it is a sequence of ASCII  characters (grounded in

								ANSI X3.4-1986).</p>


								<p>The ANSI spec gives a 1:1  mapping  ascii(a, s) from the set A of ASCII

								character to the set S of septets (integers between 0 and 127 inclusive).</p>


								<p>Let sames(s1, s2) be the "strcmp" relation between two strings which are

								septet for septet identical.</p>


								<p>Consider the equivalence relation ea(a1, a2) which we use here to indicate

								that two uris identify the same thing. It (is symmetric and transitive and)

								has properties</p>


								<p>ea(a1, a2)   &amp;  uri(t1, a1)   =&gt;   uri(t1, a2)</p>


								<p>for all a1, a2   (for some t uri(t, a1)  &amp;  uri(t, a2))   &lt;=&gt;

								ea(a1, a2)</p>


								<p>A(a) &lt;=&gt;  ea(a,a)</p>


								<p>Now in fact we are going to deal with the ASCII encoded septets for which

								a similar equivalnce holds</p>


								<p>es(s1, s2) &lt;=&gt;  Exists a1, s2 such that   ascii(a1, s1) &amp;

								ascii(a2, s2) &amp; ea(a1, a2)</p>


								<p>The URI spec mentions two uses of hexadecimal encoding. Hex encoding

								relates octet strings to septet strings. When the URI spec was written, the

								significance of the octets greater than 127 was not defined.</p>


								<p>It  implies that if you see %HH in a URI you should consider it as an

								encoding of an octet.  There is (a the level of this spec) the notion that

								the URI is an encoding of a string of octets.  Those from 0-127 are

								considered as representing ASCII characters.  There is no assumption about

								what the others represent.  The IRI spec will later take advantage of

								this.</p>


								<p>hexify(s1, s2)  is true if the difference if any between s1 and s2 is only

								that for one or more characters in s1 are replaced in s2 by their %HH  or %hh

								encoding, and ascii(s2).</p>


								<p>ascii(s)   =&gt; hexify(s, s)</p>


								<p>hexify(s, s)</p>


								<p>There are another 128 characters in this notional "extended" set, each of

								which has a hex encoding.</p>

								</div>


								<div class="head">

								<p>(DanC: hexify(s+ c, s+hexify(c))</p>


								<p>hexify('A') = '%65'</p>


								<p>corrollary: hexify(s1, s2) =&gt; ascii(s2))</p>


								<p>I take hexify to be a subrelation of equality. That is, the URI spec

								authorizes one to use s2 where you would have used s2.  In some cases such as

								7-bit transport such as HTTP you have to.  It is important that hexification

								preseves the identity of the resource.</p>


								<p>hexify(t, s1) &amp;  hexify(t, s2) =&gt; es(s1, s2)</p>


								<p>{ for some s, hexify(t1, s) &amp;  hexify(t2, s) } &lt;=&gt; et(t1, t2)</p>

								</div>


								<p>Note that equivalence is preserved by the  interchange of  "%20"  with "

								", but not by interchange of  "%2F" with "/".</p>


								<p></p>


								<p>URI encoding maps octets into URIs</p>


								<p>@@ relative</p>


								<p>rel(s, b, r)  many-many relation between ascii strings, that r is a

								relative URI reference for s relative to b.  Implication of spec is</p>


								<p>rel(s1, b, r) . rel(s2, b, r)  =&gt; e(s1, s2)</p>


								<p>abs(s)   &lt;=&gt;   forAll b:    rel(s, b, s)</p>


								<div class="body">

								<h3>THE Unicode Spec</h3>


								<p>UTF-8 <a href="#Unicode32">[Unicode 3.2]</a> gives us a relation utf8(i,

								s)</p>


								<p>Note by the way that</p>


								<p>ascii(s) =&gt; utf8(s,s)</p>


								<p>utf8(i, s) is true if i is a string of unicode characters, and s is an

								extended ASCII string of octets, and the relationship is as specified in the

								utf-8 specification.</p>


								<p>sameu(i1, i2)</p>


								<p>is true whenever the two unicode strings convey exactly the same series of

								glyphs and/or control characters. There are strings which are not

								identical</p>


								<h3>The IRI spec</h3>


								<p>This says that (basically, with some work on corner cases etc)  there

								should be a convention that any 8-bit string which is not ASCII which can be

								interpreted as a UTF-8 encoding should be interpreted as a uitf-8

								encoding.</p>

								</div>


								<p>What does that mean?  I take it to mean that you can encode it and

								de-encode it.</p>


								<div class="body">

								<p>There is a cannonicalization function which the IRI spec uses, defined in

								@@, which allows a particular</p>


								<p>ucan(i,i)</p>


								<p>Axioms are that it is a function:</p>


								<p>ucan(i, j1) .  ucan(i, j2)  =&gt;  strcmp(j1, j2)</p>


								<p>for all i: can(i,i)</p>


								<p>e(s1, s2).</p>


								<p>There is a function (not 1:1) which we define as</p>


								<p>iri_uri(i, s)  &lt;=&gt;  for some j, t:   ucan(i, j).  utf8(j, t).

								hexify(t, s)</p>


								<p>IRIs are defined as the domain that function, where the range is URIs.  An

								IRI is any unicode string which when canonicalized and utf-8 encoded and

								hexified is a URI.</p>


								<p>There is a uri equivalent to every iri.  There is NOT an IRI for every

								8-bit string t. There is at least one IRI for every URI: itself.</p>


								<p>For requirement 2, equivalent IRIs must identify the same</p>


								<p>iri_uri(i1, s1).  iri_uri(i1, s2).  sameu(i1, i2)  =&gt;  e(s1, s2)</p>


								<p></p>


								<h3>The namespace spec</h3>


								<p></p>


								<div class="div2">

								<p>The namespaces specification 2.3 talks about identifiers being different.

								Specifically, "http://www.example.org/ros%c3%a9" and

								"http://www.example.org/ros%C3%a9" are different.  Let's call these constant

								strings D1 and D2 for short.</p>


								<p>ne(D1, D2)</p>


								<p>Now "difference" is something which allows them for example to occur as

								different attributes in an XML element.  It seems to me that this is ne is

								the negation of e.  It is the common understanding of differentness such that

								two things can't be both different and the same.   To make it otherwise would

								be very confusing and would prevent (3).</p>


								<p>ne(s1, s2) =&gt; ~e(s1, s2)</p>

								</div>


								<p>Ouch.  We have one spec saying that these are different, and another

								saying that they are the same.</p>


								<p>That isn't logically compatible.   The whole layering of the different

								forms of equality described in Tim Bray's draft finding is of the form</p>


								<p>e_uri(s,t) =&gt; e(s,t)</p>


								<p>e_http(s,t) =&gt; e_uri(s,t)</p>

								</div>


								<p>and so on.  None of the specs until namespaces say "these are

								different".</p>


								<p>So if you accept the requirements above, and you accept any of the

								equivalences we have to throw out thatpart of XML namespaces.</p>


								<h2>Choices</h2>


								<p>In general there are two ways of operating:</p>


								<p>1.  ignore the equivalences like the namespace spec. This causes a bug if

								anyone uses two identfiers which are diffrent strings but equivalent.  The

								only practical way of doing that is to make any non-canonical IRIs or URIs

								illegal.  This means IRIs cannot be used except in their trivial URI form.</p>


								<p>2. Transmit in any form, receiver makes right. Receiver must compare

								equivalnce-sware or must cannonicalize before intrenal use (whichhas the same

								effect).</p>


								<p>3. Make IRIs be just unicode strings.  Scratch the axiom that hexifying

								leaves a valid and equivalent IRI.  Allow the hexified forms to be used to

								identify quite different things, in IRIs.   Allow IRIs to be converted into

								URIs, but NOT allow any place where URIs and IRIs can be used interchangebly.

								This works toward a DanC-proposed world of unicdoe character string

								comparison.  It does not allow a smooth transiition for existing browsers etc

								whcih mix URIs and IRIs.</p>


								<h2>Reality factors</h2>


								<p>There are NOT very many actual uses of  D1 and D2, because there aren't

								really any motivations for making them.</p>


								<p>-This is why we haven't had a big problem recently.</p>


								<p>There ARE motivations for using (non-uri) IRIs.  people are infact using

								them though maybe not for namespaces yet.</p>


								<p>- This is why endorsing IRIs forces us to fix this.</p>


								<p>There ARE lot sof applications which canonicalize URIs in various ways.</p>


								<p>Theer IS software which compares namespaces character-for-charcter.</p>


								<p>There are NOT many if any uses of different IRIs or different URIs for the

								same namespace.</p>


								<h2>Conclusion</h2>


								<p>We should continue the recommendation <strong>not</strong> to use  URIs or

								IRIs which are equivalent but arbitrarily different strings.  The easist way

								of ensuing this is to use a cannonical form.  We can therefore deprocate the

								transmission or use of non-canonical forms.</p>


								<p>We should switch as soon as possible to canonicalizing IRIs in all

								applications before comparison (or using equiavlence-aware comparisons).  The

								Namespaces spec should change to say when things are the same.  the

								constraint in XML to constrain that attributes cannot occur twice should be

								made more complicated.   It should say that you can't have two occurrences

								which are the same attribute name, or two attrributes which are equivalent in

								any  way, leaving I regret some fuzziness. For example, you can't use the

								xhtml1.0 and xml1.1 namespaces in the same document to put two src attributes

								on an image!  they arenot even the same namespace, but clearly they are

								equivalent at the application level.  It should be clear that the fact that

								strings are different is not a guarantee that the namespaces are different.

								The parser just isn't expected to spot this.  But I think the parser ought to

								be allowed to consistently cannonicalize.  That makes life much easier for

								the application.  DanC wanted to be able to do strcmp, and he can if the

								parser canonicalizes.</p>


								<p>We should then in a few years be able to relax the constraint on not

								transmitting multiple different forms.</p>


								<p>We need a  very good IRI cannonicalization test suite.</p>


								<p>We should formalize with names the various functions above, and make sure

								there are good working coded implmentations of them in the mjor languages. A

								standard API will help.  URI working group stuff.</p>


								<p>timbl</p>


								<p>2003/04</p>

								<hr>


								<h2 id="References">References</h2>

								<dl>

								  <dt>IRI</dt>

								    <dd>foobar<cite></cite></dd>

								</dl>


								<h3>Footnotes:</h3>


								<p><a name="context">context</a></p>


								<p>The foundational architecture of the web is that there is a global context

								common to all publically published documents, in which each URI is agreed by

								everyone to identify the same thing.  In practice of course, things break and

								people are confused and misled.   Those making formal systems often restrict

								the scope of data to that in which this ideal approximation can be taken to

								hold in practcie as well as in theory.</p>


								<p>The fact that the use of uris varies with time (sad but true) (we are NOT

								talking about living documents or concepts whose reopresentations change,

								here, but really reuse of the same URI for a totally different concept) means

								that to model things over a relatively long time one might want to model the

								time varying nature:</p>


								<p>u(x, s, t)</p>


								<p>This time modelling can be done and has been done in many ways, but is not

								addressed here.</p>


								<p></p>

								<hr>

								</body>

								</html>