Another abandoned server code base... this is kind of an ancestor of taskrambler.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

505 lines
20 KiB

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
<title>
Univeral Resource Identifiers -- Axioms of Web architecture
</title>
<link href="di.css" rel="stylesheet" type="text/css" />
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />
</head>
<body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
<address>
Tim Berners-Lee
<p>
Date: January 1998
</p>
<p>
Status: personal view. Editing status: Spellchecked.
</p>
</address>
<p>
<a href="Overview.html">Up to Design Issues</a>
</p>
<h3>
Axioms of Web Architecture: 0
</h3>
<ul>
<li>
<a href="Model.html#Model">The Web model</a>
</li>
<li>
<a href="Model.html#Resource">Resources</a>
</li>
<li>
<a href="Model.html#Fragement">Fragment IDs</a>
</li>
<li>
<a href="Model.html">Document sets and relative
addressing</a>
</li>
<li>...
</li>
</ul>
<hr />
<h1>
<a name="Model" id="Model">The Web Model</a>
</h1>
<p>
The web is a very general concept -- one universal space of
information. The concepts it requires such as identifiers and
information resources (documents) are as general and abstract
as possible. However, there have been some design decisions
made which define some interfaces, and effectively define
modules or agents which are independent. These agents are
independent in many ways
</p>
<ul>
<li>There is knowledge they have individually but do not
share
</li>
<li>There is knowledge their designers had individually but
did not share
</li>
</ul>
<p>
This is basic modularity. The interfaces are defined by the
data formats and protocols, and the important features to
understand about the design I have ranted about in the linked
articles in this series. This modularity, ability for
different parts of the system, shows up when different specs
are independent, such that you could change one without
having to change the other.
</p>
<h2>
<a name="Resource" id="Resource">The Information Resource</a>
</h2>
<p>
(Formerly, <a href="#Resource1">Resource</a>)
</p>
<p>
This is the current term for a certain unit of information in
the Web. In many cases on the current Web, thinking
"document" will do. It is something which conveys
information. The Web model is that information in the
information space is in the abstract chunked into addressable
things known as resources.
</p>
<p>
In the technical architecture, resources have identifiers,
Universal Resource Identifiers, and the properties of these
identifiers are elaborated later. In fact the concept of a
unit of information is central, not only in the technical
architecture, but in society's concepts of information, as a
document is not only the unit for reference, retrieval and
presentation (typically), but also the unit of ownership,
license to use, payment, confidentiality, endorsement, etc.
So though technically we can derive such things as compound
document, generic documents, and resources which look
anything but the typical notion of a "document", we have to
be able to support these social aspects of information at the
same time, so we can't mess with it too much.
</p>
<h2>
<a name="Fragement" id="Fragement">Fragment Id and "#"</a>
</h2>
<p>
In the hypertext architecture, when making a reference, such
as a hypertext link, we don't just refer to an information
resource. Well, we can, but we can also refer to a particular
part of or view of a resource. The string which, within the
document, defines the other end of the link has two parts. It
has the identifier of the document as a whole, and then
optionally it has a hash sign "#" and a string representing
the view of the object required. &nbsp;This suffix is called
a fragment identifier. &nbsp;(Even though it doesn't
represent necessarily a fragment of the document: it could
represent how the document should be viewed.). The fragment
identifier only has relevance in the context of the web page
in question. This has an implication how the software is
built. For example, An "access" module can be given just the
bit of the URI without the fragment identifier. It gets the
information, and creates a software object for the hypertext
page. That object is passed the fragment identifier.
</p>
<p>
<img src="ParseHash.png" width="100%" alt=
"The URI is split off at the hash into a fragement ID and the rest"
border="0" />
</p>
<p>
In fact, analyzing the system a little more, the access
function can be broken into the underlying access which
creates the object by passing two things to some kind of
object creator ("factory"): a data stream and a MIME type.
</p>
<h3>
Generally
</h3>
<p>
Hypertext is a specific application, but this principle works
for other applications on the Web. In fact, when we discuss
<a href="Webize">webizing</a> an application, we take some
computer language, and we take what were document-global
things, say global variables in a programming language, and
make them truly global by appending the URI of the document
and "#".
</p>
<p>
Clearly, in different applications the fragment identifier
will have completely different function. The independence
here means that new applications (such as the Semantic Web)
can be built, just like hypertext web, just by introducing
new types of document.
</p>
<h2>
Independence
</h2>
<p>
The model of how the web works is that there are two separate
functions. &nbsp;The part (blue in the picture) which
accesses the document deals with its identifier, but does not
know what view will be required. &nbsp;It creates some
software object which represents and presents the resource.
That object does not need to know how it was created
(necessarily), and so does not need to know the URI it was
identified by. However, it does know how to interpret the
Fragment ID.
</p>
<p>
So we have two axioms:
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
The access machinery does not need to look at the
fragment ID.
</td>
<td></td>
</tr>
</tbody>
</table>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
The presentation object does not need to know the URI
of the resource
</td>
</tr>
</tbody>
</table>
<p>
The equivalent axioms&nbsp;when we are talking about
specifications amount to:
</p>
<table border="1" cellspacing="5" cellpadding="5">
<tbody>
<tr>
<td>
The specifications for access protocols are independent
of the specifications for fragment identifiers.
</td>
</tr>
</tbody>
</table>
<h3>
Why?
</h3>
<p>
For one thing, consider the special case of a link within a
document. &nbsp;In this case, the link <b>only</b> specifies
a fragment identifier. &nbsp;The object can follow the link
itself. &nbsp;It doesn't have to consult the access code in
order to figure out &nbsp;where the link goes to.
&nbsp;Because the "#" syntax s universal to all access
methods, the object can process the link internally.
&nbsp;For a static HTML file, for example, this means that
you can write and HTMl file with internal links without
worrying or knowing about exactly what URIs the file will
get. &nbsp;It means you don't have to alter the file if you
chose to serve it in some new name or address space. &nbsp;If
the "#" syntax was not a universal specification for the web,
this would break: you couldn't do it. As Jim Gettys points
out, as the era of digitally signed documents comes upon us,
changing a signed document will break the signature on it. So
allowing one to make a self-consistent document with internal
links in a way independent of the namespace is even more
essential.
</p>
<h3>
Why else?
</h3>
<p>
This independence is very important for the evolution of the
Web. &nbsp;It means that people can go off and design all
kinds of new systems for naming, addressing and accessing
documents, without having to worry about what sort of
documents will be moved. &nbsp;It means that people can go
off and make new media types (MIME types), each of which can
have different concepts for views and fragments, without
having to talk to the people developing the access
technology. This has already (1998) proved incredibly
enabling to the community, as HTTP has advanced in parallel
with many other ways of accessing data, and the number of
exciting media types has grown very rapidly, and will be the
key to many new revolutions built on top of the basic Web
idea.&nbsp;
</p>
<p>
If you look at the diagram you ill notice how the fragment
IDs are generated by and understood by just the one module.
&nbsp;You see how, when designing a new MIME type, one is
quite free to be creative in making new and powerful forms of
fragment ID, knowing hat no other specifications will refer
to them, and nothing else will break.
</p>
<h2>
Document sets and relative addressing
</h2>
<p>
Now let us look at what happens when we follow a link.
&nbsp;For example, say a hypertext page is clicked on.
&nbsp;The page has a representation of the end point of the
link. &nbsp;It hands it to the application. &nbsp;In fact,
often, there are links between pages whose URIs are very
similar and only differ in the right hand part. &nbsp;This
isn't true of all name spaces: for example, when making links
between news articles identifies by the news id (news:foo)
unique ID, you have to specify the whole thing. However, if
you restrict publication of a set of documents to a
hierarchical name or address space, then you can arrange for
documents which are very related and have many links to be in
the same part of the tree.
</p>
<p>
In this case, the links between these documents are "relative
URIs".
</p>
<p>
What happens then is that the relative URI, which only has
the locally different part of the URI in it, is handed back
to what in the diagram I have called the "application", to be
turned into an absolute URI by being combined with the
absolute URI of the resource, which the application has
remembered.
</p>
<p>
Note that the application is aware of the absolute URI but
still the resource does not have to.
</p>
<p>
Note that the fragment id is still circulated around a loop
between the object (green) which understands it and the
applications (yellow) which handles it transparently but does
not understand or change it.
</p>
<p>
Now there was a design decision that the application could
have passed to the access module both the relative URI and
the absolute URI. Then, different namespaces would have been
able to have different algorithms for resolving a base URI
and a relative URI into a new absolute URI. But the decision
was made that the relative address format should be common
across all name spaces.
</p>
<p>
<img src="Parse2.png" width="100%" alt=
"The URI is split off at the hash into a fragement ID and the rest"
border="0" />
</p>
<h3>
Why?
</h3>
<p>
Just as we considered internal links above, now consider
relative links between a bunch of documents, like the
sections of a book, which are close in the tree. &nbsp;In
practice, such document sets are moved from place to place,
from file systems into HTTP space or FTP space, and because
the relative address rules are universal, the documents do
not have to be modified every time they are moved. (Yes, if
you move half the set to one place and half to another, you
have to fix links). &nbsp;This is happening all the time.
&nbsp;People are creating and programs are generating
hypertext with relative links without knowing or caring what
absolute URI will be used to refer to the material.
</p>
<h2>
The access scheme
</h2>
<p>
<img src="Parse3.png" width="100%" alt=
"The URI is split off at the hash into a fragement ID and the rest"
border="0" />
</p>
<p>
The so-called "access scheme" is the first part of the URI.
As we have seen above, you don't have to know anything about
it to parse relative URIs or to process the fragment
identifier of a URI. The knowledge of particular schemes is
limited to the "access" function (blue in the above diagram).
</p>
<p>
The scheme is a very important flexibility point, and should
not be abused. Anyone dereferencing a URI must have a
knowledge of the scheme it uses.
</p>
<p>
The access scheme defines a huge part of URI space. The
scheme defines a subspace with particular properties
</p>
<p>
The access scheme is <i>by definition</i> the highest point
of flexibility. What does that mean? It means that if the
whole Web develops problems which we cannot solve within the
existing protocols, or if new spaces are designed which
really can't be accessed through or mapped into existing
spaces, then we can create a new space. We have faith that we
will be able to use this flexibility point in the future,
because it worked successfully for integrating the older
spaces such as Gopher and FTP spaces into the Web.
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
If you have ported a concept between environments in
the past, then there is a better hope that you can in
the future.
</td>
</tr>
</tbody>
</table>
<h3>
The danger of too many access schemes
</h3>
<p>
However, we do not do this lightly. When we introduce a new
space, it may have very different properties and we expect
that the deployment of new software will be needed to allow
access to it. Some spaces may be gatewayable into HTTP space,
and this will often provide a transition path. This is why
early browsers allowed one to declare in a configuration file
what gateways to use for what new spaces.
</p>
<p>
If we use this extension point frivolously, ironically, it
will cease to work. Suppose very many schemes are introduced.
The access scheme space itself becomes a namespace with all
the problems which current namespaces such as DNS are trying
to solve, but which are very hard problems:
</p>
<ul>
<li>Clashes in the namespace would destroy interoperability;
</li>
<li>Ownership of the space becomes commercially valuable;
</li>
<li>Democratic and fair management becomes essential and
difficult;
</li>
</ul>
<p>
Worse, though, technology will be needed to automatically
dereference the schemes themselves and download code to
handle them. Something like DNS will be needed. The top level
namespace then becomes in fact DNS, or something like it.
This, however, begs the question. What happens if later DNS
needs to be replaced? There is no top-level extension switch
left. The world is stuck with whatever form of access-scheme
name service exists.
</p>
<p>
Therefore, I conclude that access schemes should not be open
to trivial extension, and that the access scheme should only
be extended by the introduction of new standards with full
open review by the entire community.
</p>
<h3>
Alternatives to new schemes
</h3>
<p>
Whereas some schemes (like "data:") are clearly neat and new
and orthogonal to HTTP, many schemes could in fact be
integrated into http, using HTTP extension mechanisms.
</p>
<p>
In fact, is HTTP is to be taken as a general computing
protocol, then use of an <a href="Extensible.html">extensible
language system</a> for the HTTP request message would allow
a huge amount of extension, covering protocols with different
functionality (exporting different interfaces).
</p>
<h3>
Evolving scheme spaces
</h3>
<p>
When considering the evolution of a space, it is important to
remember that primarily the access scheme refers to a part of
the URI space, and secondarily it refers to a protocol.
Therefore, one can in fact change the protocols used to
access resources within a scheme's namespace, without
changing the space. For example, a new DNS protocol could be
introduced which over time would replace the current one,
without changing the DNS space. This would effectively
redefine the HTTP and FTP protocols, but would not harm the
namespaces. When touch-tone dialing was introduced, the
telephone numbering system remained the same. So an indexing
system could be introduced which, when deployed, would allow
http:// space objects to be found with greater reliability or
speed than the current protocols, while maintaining the HTTP
space as being the concatenation of a DNS name and an opaque
string.
</p>
<hr />
<h2>
Footnote
</h2>
<h4>
<a name="Resource1" id="Resource1">Resource</a>
</h4>
<p>
The word "document" in the original "Universal Document
Identifier" in the first web spec was changed to "Resource"
in the IETF discussions, because (a) the word "document"
didn't seem to cover all kinds of information resources such
as movies and sounds, and (b) actually URIs exist for
communication endpoints such as mailboxes (mailto:) and login
ports (telnet:). "Resource" was, though, later used by RDF as
a term for anything - the top class which is the superclass
of all classes. This stemmed from RDF's initial use as a
language for describing information resources on the Web,
although RDF was designed to be used to describe anything as
a general knowledge representation system. The term
"Information Resource" was adopted by the TAG for the Web
Architecture document. When people, including the author in
the article above, refer to an information resource, they
often
</p>
<h2>
Related material elsewhere in these notes
</h2>
<p>
<i>Content/Version negotiation and Fragment ID persistence:
warnings and awareness.</i> See <a href=
"Fragment.html">Fragment Identifiers</a>
</p>
<p>
<i>&nbsp;If you negotiate between MIME types which have
different fragment ID representations, you run a risk &amp;
should warn the client.</i>
</p>
<p>
To be added:
</p>
<p>
<i>Level breaking with care: optimizing in HTTPNG etc</i>
</p>
<hr />
<p>
<a href="Overview.html">Up to Design Issues</a>, On to URIs
</p>
</body>
</html>