You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1097 lines
44 KiB
1097 lines
44 KiB
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta name="generator" content=
|
|
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
|
|
<title>
|
|
Univeral Resource Identifiers -- Axioms of Web architecture
|
|
</title>
|
|
<link href="di.css" rel="stylesheet" type="text/css" />
|
|
<meta http-equiv="Content-Type" content=
|
|
"text/html; charset=us-ascii" />
|
|
</head>
|
|
<body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
|
|
<address>
|
|
Tim Berners-Lee
|
|
<p>
|
|
Date: December 19, 1996
|
|
</p>
|
|
<p>
|
|
Status: personal view. Editing status: Italic text is
|
|
rough. Reques complete edit and possibly massaging, but
|
|
content is basically there. Words such as "axiom" and
|
|
"theorem" are used with gay abandon and the reverse of
|
|
rigour here..
|
|
</p>
|
|
</address>
|
|
<p>
|
|
<a href="Overview.html">Up to Design Issues</a>
|
|
</p>
|
|
<h3>
|
|
Universal Resource Identifiers -- Axioms of Web Architecture
|
|
</h3>
|
|
<ul>
|
|
<li>
|
|
<a href="#uri">Universal Resource Identifiers</a>
|
|
<ul>
|
|
<li>
|
|
<a href="#Universality">Universality</a>
|
|
</li>
|
|
<li>
|
|
<a href="#unique">Global uniqueness</a>
|
|
</li>
|
|
<li>
|
|
<a href="#same">Sameness</a>
|
|
</li>
|
|
<li>
|
|
<a href="#identity">Identity</a>
|
|
</li>
|
|
<li>
|
|
<a href="#canonicalization">Canonicalization - when is
|
|
a URI the saem URI?</a>
|
|
</li>
|
|
<li>
|
|
<a href="#abuse">Identity abuse</a>
|
|
</li>
|
|
<li>
|
|
<a href="#nonunique">Not a unique space, just
|
|
universal</a>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<a href="#state">Identity, State and GET</a>
|
|
<ul>
|
|
<li>
|
|
<a href="#opaque">Opacity</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Query">Query strings</a>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<a href="#relative">Relative URIs</a>
|
|
<ul>
|
|
<li>
|
|
<a href="#matrix">Matrix spaces</a>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<a href="Axioms.html#Properties">The properties of
|
|
different URI schemes</a>
|
|
</li>
|
|
</ul>
|
|
<hr />
|
|
<h1>
|
|
Universal Resource Identifiers
|
|
</h1>
|
|
<p>
|
|
The operation of the World Wide Web, and its interoperability
|
|
between platforms of differing hardware and software
|
|
manufacturers, depend on the specifications of protocols such
|
|
as HTTP, data formats such as HTML, and other syntaxes such
|
|
as the URL or, more generally, URI specifications. Behind
|
|
these specifications lie some important rules of behavior
|
|
which determine the foundation of the properties of the Web.
|
|
These are rules and principles upon which new designs of
|
|
programs and the behavior of people must rely. And it is that
|
|
reliance which makes the Web both an information space which
|
|
works now, and the foundation for future applications,
|
|
protocols, and extensions. The more essential of these I
|
|
refer to loosely as axioms, and the most basic of these have
|
|
to do with URI.<br />
|
|
</p>
|
|
<p>
|
|
The aim of thes article is to summarize in one place the
|
|
axioms of Web architecture: those invariant aspects of Web
|
|
design which are implied or stated in various specifications
|
|
or in some cases simply part of the folk law of how the Web
|
|
ought to be used. Especially for these latter cases, this
|
|
article is designed to tie together the Web community in a
|
|
common understanding of how we can progress, extend, and
|
|
evolve the Web protocols. <i>Terms such as "axiom", and
|
|
"theorem" are used with gay abandon rather than precision as
|
|
this not a mathematical treatise.</i><br />
|
|
</p>
|
|
<h2>
|
|
<a name="uri" id="uri">Universal Resource
|
|
Identifiers</a><br />
|
|
</h2>
|
|
<p>
|
|
The Web is a universal information space. It is a space in
|
|
the sense that things in it have an address. The "addresses",
|
|
"names", or as we call them here identifiers, are the subject
|
|
of this article. They are called <b>Universal Resource
|
|
Identifiers</b> (URIs).
|
|
</p>
|
|
<p>
|
|
An information object is "on the web" if it has a URI.
|
|
Objects which have URIs are sometimes known as "First
|
|
Class Objects" (FCOs). The Web works best when any
|
|
information object of value and identity is a first class
|
|
object. If something does not have a URI, you can't
|
|
refer to it, and the power of the Web is the less for that.
|
|
</p>
|
|
<p>
|
|
By <em>Universal</em> I mean that the web is declared to be
|
|
able to contain in principle every bit of information
|
|
accessible by networks. It was designed to be able to include
|
|
existing information systems such as FTP, and to be able
|
|
simply in the future to be extendable to include any new
|
|
information system.
|
|
</p>
|
|
<p>
|
|
The URI schemes identify things various different types of
|
|
information object, wich play different roles in the
|
|
protocols. Some identify services, connection end points, and
|
|
so on, but a fundamental underlying architectural notion is
|
|
of information objects - otherwise known as generic
|
|
<strong>documents</strong>. These can be
|
|
<strong>represented</strong> by strings of bits. An
|
|
information object conveys something - it may be art, poetry,
|
|
sensor values or mathematical equations.
|
|
</p>
|
|
<p>
|
|
The Semantic Web allows an information objects to give
|
|
information about anything - real objects, abstract concepts.
|
|
In this case, by combining the identifier of a document with
|
|
the identifier, within that document, of something it
|
|
describes, one forms an idenifier for anything. This is done
|
|
with "#" and fragment identifiers, discussed later.
|
|
</p>
|
|
<h4>
|
|
<a name="Universality" id="Universality">Axiom 0:
|
|
Universality</a> 1
|
|
</h4>
|
|
<p class="axiom">
|
|
Any resource anywhere can be given a URI
|
|
</p>
|
|
<h4>
|
|
<a name="Universality2" id="Universality2">Axiom 0a:
|
|
Universality 2</a>
|
|
</h4>
|
|
<p class="axiom">
|
|
Any resource of significance should be given a URI.
|
|
</p>
|
|
<p>
|
|
(What sorts of things can be resources? A very wide variety.
|
|
The URI concept istelf puts no limits on this. However, URIs
|
|
are divided into schemes, such as http: and telenet:, and the
|
|
specification of each scheme determines what sort of things
|
|
can be resources in that scheme. Schemes are discussed
|
|
later.)
|
|
</p>
|
|
<p>
|
|
This means that no information which has any significance and
|
|
persistence should be made available in a way that one cannot
|
|
refer to it with a URI.
|
|
</p>
|
|
<p>
|
|
In fact, we take care before extending the URIs to include
|
|
any old system, because URIs of any form must also be
|
|
understood anywhere in the world.
|
|
</p>
|
|
<p>
|
|
When you specify a URI, a Universal Resource Identifier,
|
|
(often people use the more restricted term "URL", Uniform
|
|
Resource Locator), first axiom is:
|
|
</p>
|
|
<h4>
|
|
<a name="unique" id="unique">Axiom 1: Global scope</a>
|
|
</h4>
|
|
<p class="axiom">
|
|
It doesn't matter to whom or where you specify that URI, it
|
|
will have the same meaning.
|
|
</p>
|
|
<p>
|
|
So, this means that there is no scope within which a URI must
|
|
be placed for it to hold. All you need say is that something
|
|
is "on the Web" and that is enough. Anyone can follow that
|
|
hypertext link, anyone can look up that URI. Now, the sorts
|
|
of URI that we find typically start "<code>http:</code>"
|
|
indicating that this URI points into a space of objects
|
|
accessed using the hypertext transfer protocol. But there are
|
|
many other sorts of URI, and a key to the Universality is
|
|
that this universal space of identifiers whether you call
|
|
them names addresses or locators, is universal through the
|
|
range of pre-existing protocols such as SMTP, and NNTP,
|
|
through protocols designed for the Web specifically (HTTP)
|
|
through, in principal, to new protocols yet to be invented.
|
|
So, there is a theorem, if you like, of URIs that:
|
|
</p>
|
|
<p class="axiom">
|
|
Any new space of identifiers or address space can be
|
|
represented as a subset of URI space.
|
|
</p>
|
|
<p>
|
|
You can prove this easily because there is no limit to the
|
|
length of a URI and any new name system or address system can
|
|
be incorporated simply by encoding the value of names or
|
|
addresses into an acceptable, printable string and prefacing
|
|
that string with a standard prefix for that new scheme. So,
|
|
you could replace http:, for example with ISBN: or X500:
|
|
depending on the new scheme.<br />
|
|
</p>
|
|
<p>
|
|
There is a second axiom of URIs which is difficult to
|
|
characterize exactly but accepted in some form by everyone
|
|
who uses the Web in some form and that is that:
|
|
</p>
|
|
<h4>
|
|
<a name="same" id="same">Axiom 2a: sameness</a>
|
|
</h4>
|
|
<p class="axiom">
|
|
a URI will repeatably refer to "the same" thing
|
|
</p>
|
|
<p>
|
|
The same identifier string is expected from one day to the
|
|
next to point to, in some sense, the same object. That is a
|
|
very important axiom, and it leaves open the "in some sense"
|
|
behind which is a very complicated discussion of the concept
|
|
of identity. When are two things "in some sense" the
|
|
same.<br />
|
|
</p>
|
|
<h4>
|
|
<a name="identity" href="Axioms.html#Universality" id=
|
|
"identity">Axiom 2b: identity</a>
|
|
</h4>
|
|
<p>
|
|
of URIs clears up the vagueness of 2a and is that
|
|
</p>
|
|
<p class="axiom">
|
|
the significance of identity for a given URI is determined by
|
|
the person who owns the URI, who first determined what it
|
|
points to.
|
|
</p>
|
|
<p>
|
|
We do not discuss in detail here the definition of owner
|
|
because the mechanism by which a person or agent comes to get
|
|
or create or be allocated a new URI varies from scheme to
|
|
scheme. But in every scheme in practice there is a way of
|
|
making a new URI. In many schemes, the scheme itself implies
|
|
or requires some properties of identity. The scheme, if you
|
|
like, imposes constraints within which the owner of a URI is
|
|
free to define identity.
|
|
</p>
|
|
<p>
|
|
The implication here is that we will need protocols for
|
|
exchanging any guarantees of the properties of given URIs:
|
|
they are not simply laid down in the specification of the
|
|
web. This is in tune with a general philosophical principle
|
|
of design (after Bob Scheifler and others):
|
|
</p>
|
|
<blockquote>
|
|
The technology should define <i>mechanisms</i> wherever
|
|
possible without defining <i>policy</i>.
|
|
</blockquote>
|
|
<p>
|
|
because we recognize here that many properties of URIs are
|
|
social rather than technical in origin.
|
|
</p>
|
|
<p>
|
|
Therefore, you will find pointers in hypertext which point to
|
|
documents which never change but you will also find pointers
|
|
to documents which change with time. You will find pointers
|
|
to documents which are available in more than one format. You
|
|
will find pointers to documents which look different
|
|
depending on who is asking for them. There are ways to
|
|
describe in a machine or human readable way exactly what sort
|
|
of repeatability you would expect from a URI, but the
|
|
architecture of the Web is that that is for something for the
|
|
owner of the URI to determine.
|
|
</p>
|
|
<h4>
|
|
<a name="abuse" id="abuse">Identity abuse</a>
|
|
</h4>
|
|
<p>
|
|
<i>All the same, a word of caution is appropriate about the
|
|
indiscriminate or deliberately misleading abuse of the
|
|
identity of the object refered to by a URI. A web server is
|
|
often in a position to know a lot of context about a request.
|
|
This can include for example, the person who is asking, the
|
|
document they were reading last from which they followed the
|
|
link. It is possible to use this information to
|
|
dramatically change the content of the document refered to.
|
|
This undermines the concept of identity and of
|
|
reference in general. To do that without making it
|
|
clear is misleading both to anyone who quotes the URI
|
|
of a page or who follows the link.</i>
|
|
</p>
|
|
<p>
|
|
<i>Unless it is clearly indicated on the page (or using a
|
|
future protocol) , to return differing information for the
|
|
same URI must be considered a form of deception. It
|
|
also of course messes up caches. Note the HTTP 1.1 "Vary"
|
|
header allows this indication to be passed.</i>
|
|
</p>
|
|
<h4>
|
|
<a name="canonicalization" id="canonicalization">When is a
|
|
URI "the same URI"?</a>
|
|
</h4>
|
|
<p>
|
|
Two URIs are the same if (and only if) they are the same
|
|
character for character.
|
|
</p>
|
|
<p>
|
|
Two URIs which are different may in fact be equivalent, in
|
|
that they may refer to the same thing, and give the same
|
|
result in all operations. In some cases any agent looking at
|
|
two URIs can deduce, from knowledge of the various web
|
|
standards, that they must be equivalent, in that they must
|
|
refer to the same thing. For example, HTTP URIs contain
|
|
domain names, and the Domain Name System is case-insensitive.
|
|
Therefore, while it is normal practice to use lower case for
|
|
domain names, any agent which comes across two URIs which
|
|
differ only in the case of the domain name can conclude that
|
|
they must refer to the same thing. In another case, a client
|
|
agent may use out-of-band information about a web site to
|
|
know that its URI paths are case-invariant, or that URIs
|
|
ending in "/" and "/index.html" are equivalent. It is bad
|
|
engineering practice to make new protocols require such
|
|
processing.
|
|
</p>
|
|
<p>
|
|
There are a long series of such algorithms. Which ones an
|
|
agent can apply depends on what information it has to hand,
|
|
and depend on what knowledge of which protocols has been
|
|
programmed into it. New schemes may be defined in the future,
|
|
for which different forms of canonicalization can be done.
|
|
There is, therefore, <strong>no definitive
|
|
canonicalization</strong> algorithm for URIs. Generic URI
|
|
handling code should handle URIs as case-sensitive character
|
|
strings. It is <strong>not</strong> recommended that, for
|
|
example, encryption and signature algorithms attempt to
|
|
canonicalize URIs before signature, because of the
|
|
arbitrariness of any attempt to define a canonicalization
|
|
algorithm.
|
|
</p>
|
|
<p>
|
|
The only canonicalization one could insist upon would be that
|
|
defined by the algorithms in the URI specifications. This
|
|
incldues the generation of an absolute URI from and the
|
|
hex-encoding or decoding of all non-reserved characters.
|
|
</p>
|
|
<h3>
|
|
URIs and the Test of Independent Invention
|
|
</h3>
|
|
<p>
|
|
The concept of a web as a "space" is based on these axioms of
|
|
design. As a result, the web behaves to a certain extent as a
|
|
system with state, and an important part of the work of the
|
|
system is the distribution of visible state rather than the
|
|
execution of invisible remote operations.
|
|
</p>
|
|
<h4>
|
|
<a name="nonunique" id="nonunique">Axiom 3: non unique</a>
|
|
</h4>
|
|
<p class="axiom">
|
|
URI space does not have to be the only universal space
|
|
</p>
|
|
<p>
|
|
The assertion that the space of URIs is a universal space
|
|
sometimes encounters opposition from those who feel there
|
|
should not be one universal space. These people need not
|
|
oppose the concept because it is not of a single universal
|
|
space: Indeed, the fact that URIs form universal space does
|
|
not prevent anyone else from forming their own universal
|
|
space, which of course by definition would be able to envelop
|
|
within it as a subset the universal URI space. Therefore the
|
|
web meets the "independent design" test, that if a similar
|
|
system had been concurrently and independently invented
|
|
elsewhere, in such a way that the arbitrary design decisions
|
|
were made differently, when they met later, the two systems
|
|
could be made to interoperate.
|
|
</p>
|
|
<p>
|
|
There may be in the world many universal spaces, and there
|
|
need not be any particular quarrel about one particular one
|
|
having a special status. (Of course, having very many may not
|
|
be very useful, and in the World Wide Web, the URI space
|
|
plays a special role by being the universal space chosen in
|
|
that design.)
|
|
</p>
|
|
<p>
|
|
For example, it would be possible to map all international
|
|
telephone numbers into URI space very easily, by inventing a
|
|
new URI "phone:" after which was the phone number. It would
|
|
in fact also conversely be possible to map URIs into
|
|
international phone numbers by allocating a special phone
|
|
number not used by anyone else, perhaps a special country
|
|
code for URI space, and then converting all URIs into a
|
|
decimal representation. In that case, both URIs and phone
|
|
numbers would be universal spaces. Identifiers in one space
|
|
would be consisting only of numbers, and in the other of
|
|
alphanumeric characters. One would be shorter than the other,
|
|
but there is no reason why, in principle, the two could not
|
|
co-exist, allowing you to dial any Web object from a
|
|
telephone as a telephone number, and point to any phone from
|
|
a hypertext document.<br />
|
|
</p>
|
|
<p>
|
|
So, on this last axiom rests not specifically the operation
|
|
of the web, but its acceptance as a non-domineering
|
|
technology, and therefore our trust in its future
|
|
evolvability.
|
|
</p>
|
|
<h3>
|
|
<a name="state" id="state">Identity, State and GET</a>
|
|
</h3>
|
|
<p>
|
|
From the fact that the thing referenced by the URI is in some
|
|
sense repeatably the same suggests that in a large number of
|
|
cases the result of de-referencing the URI will be exactly
|
|
the same, especially during a short period of time. This
|
|
leads to the possibility of caching information. It leads to
|
|
the whole concept of the Web as an information space rather
|
|
than a computing program. It is a very fundamental concept.
|
|
Not only do the concepts of navigation around the space
|
|
remembering "places" in the Web and other humanly visible
|
|
aspects of the nature of the Web depend on it, but also many
|
|
technical architectural properties depend on it. For example,
|
|
the implication is that the GET operation in HTTP is an
|
|
operation which is expected to repeatably return the same
|
|
result. As a result of that, anyone may know that under
|
|
certain circumstances that they may instead of repeating an
|
|
HTTP operation, use the result of a previous operation. The
|
|
operation is "idempotent". This, in turn, allows software to
|
|
use previously fetched copies of documents and it requires
|
|
that the HTTP GET operation should have no <em>side
|
|
effects</em>. For example, GET should never be used to
|
|
initiate another operation which will change state. In
|
|
general (see the HTTP 1.1 spec) the notion of side-effects is
|
|
that of any significant communication between the parties. A
|
|
user can never be held accountable to anything as a result of
|
|
doing a GET. The server may for example log the number of
|
|
requests, but the client user cannot be held responsible for
|
|
that: it does not constitute communication between the two
|
|
parties.
|
|
</p>
|
|
<p>
|
|
It is wrong to represent the user doing a GET as committing
|
|
to something or putting themselves on a mailing list, doing
|
|
any operation which effects the state of the Web or the state
|
|
of the users relationship with the information provider or
|
|
the server. To ignore this rule can be to introduce a serious
|
|
security problem in a website.
|
|
</p>
|
|
<p>
|
|
So, from this principal, we have a principal of the http
|
|
protocol that :
|
|
</p>
|
|
<h4>
|
|
<a name="get" id="get">Axiom</a>
|
|
</h4>
|
|
<p style="axiom" class="axiom">
|
|
In HTTP, GET must not have side effects.
|
|
</p>
|
|
<p>
|
|
The introduction of any other method apart from GET which has
|
|
no side effects is also incorrect, because the results of
|
|
such an operation effectively form a separate address space,
|
|
which violates the universality. A pragmatic symptom would be
|
|
that hypertext links would have to contain the method as well
|
|
as the URI in order to able to address the new space, which
|
|
people would soon want to do.
|
|
</p>
|
|
<h4>
|
|
<a name="GET2" id="GET2">Axiom</a>
|
|
</h4>
|
|
<p class="axiom">
|
|
In HTTP, anything which does not have side-effects should use
|
|
GET
|
|
</p>
|
|
<p>
|
|
This means that for people implementing systems in which
|
|
users request information and execute operations using forms,
|
|
when the form simply requests information it must result in a
|
|
GET operation. Indeed this is very much to be favored over a
|
|
post operation because the result of a GET operation has a
|
|
URI and may be leaked to, for example, may be put into a
|
|
bookmark. This violates the <a href=
|
|
"Axioms.html#Universality2">axiom of universality</a> above.
|
|
</p>
|
|
<p>
|
|
However, when the result of a form is to execute an
|
|
operation, which changes the Web or a relationship of a user
|
|
to anyone else, then the GET operation may not be used and
|
|
POST or other method either through HTTP or mail must be
|
|
used. Only by sticking to this rule can such systems
|
|
interoperate with caches and other agents which exploit the
|
|
repeatability of HTTP GET of URI dereferencing in the
|
|
future.<br />
|
|
</p>
|
|
<p>
|
|
The axiom above about URIs pointing in principle to
|
|
conceptually the same thing has a corollary which says that
|
|
URIs do not always have to point to <i>exactly</i> the same
|
|
set of bits. This means that URIs can be "generic". See the
|
|
<a href="Generic.html">discussion of generic URIs</a>.<br />
|
|
</p>
|
|
<h3>
|
|
<a name="opaque" id="opaque">The Opacity Axiom</a><br />
|
|
</h3>
|
|
<p>
|
|
The concept of an identifier referring to a resource is very
|
|
fundamental in the World Wide Web. Identifiers will refer to
|
|
resources all different sorts. Any addressable thing will
|
|
have an identifier. There are mechanisms we have just
|
|
discussed for extending the spaces of identifiers into name
|
|
spaces which have different properties. Different spaces may
|
|
address different sorts of objects, and the relationship
|
|
between the identifier and the object, such as the uniqueness
|
|
of the object and the concept of identity, may vary. A very
|
|
important axiom of the Web is that in general:
|
|
</p>
|
|
<h4>
|
|
Axiom: Opacity of URIs
|
|
</h4>
|
|
<p class="axiom">
|
|
The only thing you can use an identifier for is to refer to
|
|
an object. When you are not dereferencing, you should not
|
|
look at the contents of the URI string to gain other
|
|
information.
|
|
</p>
|
|
<p>
|
|
For the bulk of Web use URIs are passed around without anyone
|
|
looking at their internal contents, the content of the string
|
|
itself. This is known as the <b>opacity</b>. Software should
|
|
be made to treat URIs as generally as possible, to allow the
|
|
most reuse of existing or future schemes.
|
|
</p>
|
|
<p>
|
|
For example, within an HTTP identifier, even when access is
|
|
made to the object, the client machine looks at the first
|
|
part of the identifier to determine which server machine to
|
|
talk to and from then on the rest of the string is defined to
|
|
be opaque to the client. That is the client does not look
|
|
inside it, it can not deduce an information from the
|
|
characters in that identifier. It has been very tempting from
|
|
time to time for people to write software in which a client
|
|
will look at a string such as ".html" on the end of an
|
|
identifier, and come to a conclusion that it might be
|
|
hypertext markup file when dereferenced. But these thoughts
|
|
of breaking of the rule could lead to a broken architecture
|
|
in which the generality of URIs is something one can no
|
|
longer depend on.
|
|
</p>
|
|
<p>
|
|
Opacity of the URIs opens the door to new URI schemes, it
|
|
opens the door to excitingly different interpretations of
|
|
HTTP URI spaces. For example, servers can use the opaque
|
|
string to carry all kinds of parameters to spaces with new
|
|
topologies.
|
|
</p>
|
|
<p>
|
|
As a result of this axiom the many parts of metadata, that
|
|
information about the object that a client might be tempted
|
|
to infer from the actual sting value of the URI but can't,
|
|
have to be made available through the HTTP protocol. That is
|
|
the purpose of some of the headers of the HTTP protocol and
|
|
is discussed in the next section.
|
|
</p>
|
|
<p>
|
|
Another example of a reason for keeping the URI opaque is
|
|
that other address spaces for example within an HTTP servers
|
|
address space the rest of the URI can be used as a coded
|
|
representation of a name in some local space. Typically, when
|
|
that is done, when the server serves as a gateway into an
|
|
existing space, then it is extremely useful to be able to use
|
|
the string in any way consistent with the URI syntax rules to
|
|
represent coded names from the other space. The server can
|
|
encode, within the URI, complex locations in some legacy
|
|
system which is being mapped into an information space for
|
|
the first time. So, for example, names which come from names
|
|
in some sort of a database might by coincidence end up with
|
|
.html with no implication that there is a hypertext markup
|
|
language document involved, just that the particular encoding
|
|
used happened to produce that string of bits.
|
|
</p>
|
|
<h4>
|
|
<a name="Query" id="Query">Query strings</a>
|
|
</h4>
|
|
<p>
|
|
An important case is the treatment of the question mark in
|
|
HTML forms. There is a convention that infformation returned
|
|
from HTML forms is returned by encoding it and appending it
|
|
to the URI. The question mark within the URI is used to
|
|
separate the basic URI from parameters which are appended to
|
|
it to perform an operation. A typical use is for a search,
|
|
and the string following the question mark is often known as
|
|
a query string.
|
|
</p>
|
|
<p>
|
|
When a query string and fragment identifier are used, the
|
|
function evaluated on dereferencing a URL
|
|
</p>
|
|
<pre>
|
|
http://foo/bar?baz#frag
|
|
Is
|
|
select(get( "foo", "query("bar","baz")), "frag")
|
|
|
|
</pre>
|
|
<p>
|
|
where
|
|
</p>
|
|
<ul>
|
|
<li>query (resource, querystring) is evaluated by the
|
|
resource "bar"
|
|
</li>
|
|
<li>"bar" is opaque to all except the server "foo" ;
|
|
</li>
|
|
<li>"baz" is a format understood by client and by the
|
|
resource "foo/bar";
|
|
</li>
|
|
<li>get(server, restofuri) is executed by the client engine
|
|
which understands "foo" but not "bar"
|
|
</li>
|
|
<li>select(fragmentid, resource) is evaluated on the client
|
|
by the resource's handling code
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
Query strings are clearly not opaque to the client. However,
|
|
they should be opaque to (for example) proxies.
|
|
</p>
|
|
<p>
|
|
Apart from searches, other operations are performed, for
|
|
example by those filling out HTML forms which are set up to
|
|
have an HTTP "GET" action. This is done in situations in
|
|
which the results of that operation of the URI are
|
|
quasi-static. In other words, the resource referred to by the
|
|
complete URI (including the query mark and the query string
|
|
after it) follows the axiom of slow change above: the result
|
|
of performing the operation is repeatable in some fashion.
|
|
</p>
|
|
<p>
|
|
It is tempting and often done to assume that the result of
|
|
such an operation will be more transient than that of a URI
|
|
with less or without a query string. To make this assumption
|
|
breaks the Opacity rule in general. Not only that, but this
|
|
is in many cases a completely wrong assumption. For example,
|
|
the query string is sometimes used to indicate parameters
|
|
such as a personalized sub-space which is being browsed.
|
|
Unfortunately, because the Opacity rule has been broken by
|
|
clients and caches which don't cache documents whose URIs
|
|
contain question marks, the question mark is sometimes been
|
|
deliberately inserted in order to defeat caches. This
|
|
creeping use of non-standard and axiom breaking conventions
|
|
could clearly be damaging to other systems which use the
|
|
question mark for other reasons for perfectly cacheable
|
|
documents.
|
|
</p>
|
|
<h2>
|
|
<a name="relative" id="relative">Hierarchies and Relative
|
|
URIs</a>
|
|
</h2>
|
|
<p>
|
|
While discussing the universality of Universal Resource
|
|
Identifiers, it is as well to discuss the place of the
|
|
Universal Syntax as this has been the source of some
|
|
misunderstanding as to the intent and advantages behind this.
|
|
The URI Syntax, now famous through its HTTP form uses slashes
|
|
to indicate a hierarchical structured name or address. Apart
|
|
from that, the strings between the slashes are opaque. There
|
|
is nothing to say that the string between a double slash and
|
|
a slash must be in all URI schemes a fully qualified domain
|
|
name; there is nothing to relate strings between single
|
|
slashes to parts of a unix file name. The reason that the
|
|
slashes have been instituted as common universal syntax for a
|
|
hierarchical boundary is that hierarchical schemes are common
|
|
and that relative naming within a hierarchical space has many
|
|
advantages.
|
|
</p>
|
|
<p>
|
|
Relative naming allows small groups of documents which are
|
|
located close within a tree to refer to each other without
|
|
being aware of their absolute position within any absolute
|
|
tree. It turns out that for scalability of the creation of
|
|
material on the Web, this is essential. This has been found
|
|
both for file names in most modern operating systems and for
|
|
HTTP URLs, and one can also reasonably assume that it will be
|
|
true for any other hierarchical scheme. Therefore, it is
|
|
important that the generic concept of a hierarchical scheme
|
|
is kept separate for future use from specific schemes which
|
|
involve possibly to be outdated forms such as fully qualified
|
|
domain names.<br />
|
|
</p>
|
|
<h4>
|
|
An example in using relative URIs
|
|
</h4>
|
|
<p>
|
|
Let us take, for example, the exercise of mapping an
|
|
international telephone number onto the URL. International
|
|
telephone numbers are hierarchical. For example, the meaning
|
|
of and the format of a telephone number depends on the
|
|
country, but there is a universal format for a telephone
|
|
number in the world which can be understood everywhere. This
|
|
format is, in fact, a plus sign indicating that one starts at
|
|
the top of the hierarchy: that this is an absolute
|
|
international telephone number. It is followed by the country
|
|
code, the area code (if any) and the telephone number.
|
|
Mapping this onto the URL syntax, the double slash would be
|
|
used to indicate that one is starting from the top of the
|
|
tree, so the number
|
|
</p>
|
|
<p>
|
|
+1 (617) 253-5708
|
|
</p>
|
|
<p>
|
|
would be written with a double slash instead of the plus and
|
|
then slashes at the other hierarchical boundaries.:
|
|
</p>
|
|
<pre>
|
|
phone://1/617/253-5708
|
|
|
|
</pre>
|
|
<p>
|
|
(The dash here is used for decoration. In practice people
|
|
like to break telephone numbers up in various ways for
|
|
readability, even though the punctuation has no hierarchical
|
|
significance.) Of course, there could be other mappings used,
|
|
but let us look at how this particular mapping using the
|
|
slashes would be used in relative URIs. Suppose we are in a
|
|
context in which that telephone number is the default.
|
|
Suppose, for example we have declared that that number is the
|
|
absolute base telephone number ("base URL") within a
|
|
conversation: typically, we are talking to somebody who lives
|
|
in the same area code.
|
|
</p>
|
|
<p>
|
|
Using the relative URL pausing rules, we can refer to another
|
|
local telephone number simple as, for example, 861-5000 with
|
|
no punctuation. This is just what we do in practice. We can
|
|
refer to a telephone number within the same country as
|
|
/800/123-4567. Although these are not quite the conventions
|
|
currently used for telephone numbers, they are just as
|
|
compact as the various conventions of putting brackets around
|
|
the area code, and would probably be parsed correctly by a
|
|
human.
|
|
</p>
|
|
<p>
|
|
To indicate an international number we simply start with a
|
|
double slash. For example,
|
|
</p>
|
|
<pre>
|
|
phone://41/22/767-6111.
|
|
|
|
</pre>
|
|
<p>
|
|
Now, suppose instead we had used another system. We had just
|
|
decided that for consistency, we would simply use the plus
|
|
sign and for example, parenthesis around an area code. This
|
|
would mean that whereas you can use the conventions of simply
|
|
omitting the area code for the local telephone number, and
|
|
you could use a plus sign to indicate an international
|
|
telephone number. If you put <code>phone:</code> in front of
|
|
it, to have it correctly parsed by a URL parser, you would
|
|
always have to use the full international form. Now, there
|
|
may be some who would prefer to always see the full
|
|
international form in telephone numbers because telephone
|
|
numbers are of fairly limited length. However, the principle
|
|
of relative names or local telephone numbers being useful is
|
|
established beyond question. There is also perhaps as much
|
|
public use of the double slash in URIs as there is of the
|
|
plus sign in international telephone numbers. (Within the
|
|
United States of America there seem to be relatively few
|
|
people who understand the significance of the plus sign or
|
|
for that matter know what their country code is!).
|
|
</p>
|
|
<p>
|
|
So, in general when looking at new naming schemes which may
|
|
have a hierarchical nature we should regard the slash and the
|
|
double slash as common syntax. It may be that we can
|
|
transition to a shorter form in which, for example, a double
|
|
slash is assumed after the colon in a fully qualified URL in
|
|
order to address the worry that the URL syntax is clumsy when
|
|
you include the scheme name prefix. <i><br /></i>
|
|
</p>
|
|
<h4>
|
|
<a name="myth1" id="myth1">Myth:</a>
|
|
</h4>
|
|
<p class="axiom">
|
|
Myth: "The // must only be used to introduce a fully
|
|
qualified domain name."
|
|
</p>
|
|
<h4>
|
|
Grandfathering hierarchies: generalizing the scheme
|
|
</h4>
|
|
<p>
|
|
It is worth noting that the syntax with the double slash can
|
|
in fact be extended for use with a triple slash if one wanted
|
|
to be able to start at any level in a much more complicated
|
|
hierarchical structure. For example, suppose international
|
|
telephone numbers were to be extended to cover a planetary
|
|
code in the future. Then the planetary code could be attached
|
|
to the front of the international code. The triple slash
|
|
could introduce the interplanetary code, and the double slash
|
|
would introduce the international code. Indeed, this is how
|
|
the double slash came to be: when hierarchical naming schemes
|
|
such as those in unix file systems was extended to a networks
|
|
file system on the Apollo domain the extra slash was
|
|
introduced. Similarly, Microsoft NT networking now uses
|
|
double backslash in exactly the same way.
|
|
</p>
|
|
<p>
|
|
RFC1630 is an information RFC I wrote about URIs in WWW
|
|
because getting consensus on the philosophy of all this in
|
|
open forum ws going to take a long time at best. It contains
|
|
an algorithm for parsing relative URIs which in fact would
|
|
pause a relative URI in an environment with any arbitrary
|
|
number of consecutive slashes. (The only problem with this
|
|
scheme is, like others which use the same delimiter for
|
|
beginning and ending strings or that one cannot represent an
|
|
empty string. This is already a problem with the file syntax
|
|
when an empty string is used for the host name resulting in
|
|
three consecutive slashes.)<i><br /></i>
|
|
</p>
|
|
<p>
|
|
To quote RFC1630:
|
|
</p>
|
|
<blockquote>
|
|
If the scheme parts are different, the whole absolute URI
|
|
must be given. Otherwise, the scheme is omitted, and:
|
|
<p>
|
|
If the partial URI starts with a non-zero number of
|
|
consecutive slashes, then everything from the context URI
|
|
up to (but not including) the first occurrence of exactly
|
|
the same number of consecutive slashes which has no greater
|
|
number of consecutive slashes anywhere to the right of it
|
|
is taken to be the same and so prepended to the partial URL
|
|
to form the full URL. Otherwise:
|
|
</p>
|
|
<p>
|
|
The last part of the path of the context URI (anything
|
|
following the rightmost slash) is removed, and the given
|
|
partial URI appended in its place, and then:
|
|
</p>
|
|
<p>
|
|
Within the result, all occurrences of "xxx/../" or "/." are
|
|
recursively removed, where xxx, ".." and "." are complete
|
|
path elements.
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
The algorithm may not be perfect in its handling of "." and
|
|
"..", but it applies to any numbler of slashes.
|
|
</p>
|
|
<h3>
|
|
<a name="matrix" id="matrix">Matrix spaces and Semicolons</a>
|
|
</h3>
|
|
<p>
|
|
There are a lot of web sites in which documents -- often
|
|
virtual document -- vary along several dimensions. They are
|
|
naturally arranged not on a tree but on a matrix. The URI for
|
|
a map, for example, might be:<i><br /></i>
|
|
</p>
|
|
<pre>
|
|
<i> //moremaps.com/map/color;lat=50;long=20;scale=32000<br />
|
|
</i>
|
|
</pre>
|
|
<p>
|
|
(I had an idea to make special form of relative URIs for
|
|
these. See <a href="MatrixURIs.html">Matrix URIs</a> for the
|
|
idea, not a feature of the web as of 2001.)
|
|
</p>
|
|
<h2>
|
|
<a name="Properties" id="Properties">The properties of
|
|
different URI schemes</a>
|
|
</h2>
|
|
<p>
|
|
As noted above, the concept of a URI itself does not define
|
|
the particular identity properties which exist between a URI
|
|
and the resource associated with it. The <a href=
|
|
"Axioms.html#Universality">axiom above</a> leaves the owner
|
|
of the URI to define it. However, different URI schemes are
|
|
defined and implemented in different ways, and this itself
|
|
can impose restrictions on the mapping.
|
|
</p>
|
|
<p>
|
|
Some of the properties of URI to resource mappings which vary
|
|
from space to space were discussed in RFC1630. Some schemes
|
|
(such as HTTP) leave answers up to the information publisher
|
|
(URI owner).
|
|
</p>
|
|
<p>
|
|
There is a lot of flexibility and growth to be gained by
|
|
allowing any sort of URI, not one from a particular scheme,
|
|
in most circumstances. Similarly, one should not make
|
|
assumptions about the schemes involved. This is a facet of
|
|
the particular parameters about how the technology is used.
|
|
The choice of type URI in a pracical use of a language is an
|
|
important flexibility point.
|
|
</p>
|
|
<table border="1">
|
|
<caption>
|
|
Comparison of some URI schemes
|
|
</caption>
|
|
<tbody>
|
|
<tr>
|
|
<th>
|
|
Scheme prefix
|
|
</th>
|
|
<th>
|
|
Identity relationship: what does the URI correspond to?
|
|
</th>
|
|
<th>
|
|
Reuse
|
|
</th>
|
|
<th>
|
|
Persistence
|
|
</th>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
http
|
|
</td>
|
|
<td>
|
|
Geneneric document as :defined by publisher. Generic
|
|
URIs possible with content negotiation
|
|
</td>
|
|
<td>
|
|
defined by publisher
|
|
</td>
|
|
<td>
|
|
defined by publisher
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
ftp:
|
|
</td>
|
|
<td>
|
|
sequence of bits
|
|
</td>
|
|
<td>
|
|
defined by publisher
|
|
</td>
|
|
<td>
|
|
defined by publisher
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
uuid:
|
|
</td>
|
|
<td>
|
|
expectation of uniqueness has to be upheld by publisher
|
|
</td>
|
|
<td>
|
|
defined by publisher
|
|
</td>
|
|
<td>
|
|
(no dereference)
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
sha1:
|
|
</td>
|
|
<td>
|
|
sequence of bits.
|
|
</td>
|
|
<td>
|
|
mathematically extremely unlikely
|
|
</td>
|
|
<td>
|
|
(no dereference)
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
mid:
|
|
</td>
|
|
<td>
|
|
Email message. Should be 1:1 modulo recoding, and
|
|
header addition/deletion
|
|
</td>
|
|
<td>
|
|
Can happen after 2 years according to the spec, but
|
|
absolutely not recommended
|
|
</td>
|
|
<td>
|
|
(no derefernce)
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
mailto:
|
|
</td>
|
|
<td>
|
|
mailbox as used in email protocols
|
|
</td>
|
|
<td>
|
|
Socially unacceptable
|
|
</td>
|
|
<td>
|
|
(no dereference)
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
telnet:
|
|
</td>
|
|
<td>
|
|
connection endpoint for interactive login service
|
|
</td>
|
|
<td>
|
|
defined by publisher
|
|
</td>
|
|
<td>
|
|
(no dereference)
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h3 id="it">
|
|
How not to do it
|
|
</h3>
|
|
<p>
|
|
Typical URI abuse by breaking this rule is occurs when a
|
|
document format provides one URI space for a "name"and one
|
|
for a "location".
|
|
</p>
|
|
<p>
|
|
<a href="uri1" urn="foo">
|
|
</p>
|
|
<p>
|
|
or for example the SGML reference to a "public identifier"
|
|
and a "system identifier".
|
|
</p>
|
|
<p>
|
|
The Web way is to have a reference to one URI. If in the same
|
|
document you want to incldue information such as other in
|
|
some ways equivalent identifiers, then you embed that in your
|
|
document as <a href="Metadata.html">metadata</a>, to be
|
|
discussed later.
|
|
</p>
|
|
<p>
|
|
That allows the exatct relationship to be expressed without
|
|
ambiguity, with much more pwer and generality, and with
|
|
consistency across applications.
|
|
</p>
|
|
<h3 id="also">
|
|
See also
|
|
</h3>
|
|
<ul>
|
|
<li>
|
|
<a href="/Provider/Style/URI.html">Cool URIs don't
|
|
change</a> - persistence in the HTTP space
|
|
</li>
|
|
<li>
|
|
<a href="AxiomsHAF">Hall of Flame</a> -- how not to do it
|
|
</li>
|
|
</ul>
|
|
<hr />
|
|
<p>
|
|
<small>$Id: Axioms.html,v 1.36 2009/03/17 17:25:35 timbl Exp
|
|
$</small>
|
|
</p>
|
|
<p>
|
|
<a href="Fragment.html">Next: Fragmement Identifiers</a>
|
|
</p>
|
|
<p>
|
|
<a href="Overview.html">Up to Design Issues</a>
|
|
</p>
|
|
</body>
|
|
</html>
|