You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1113 lines
46 KiB
1113 lines
46 KiB
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta name="generator" content=
|
|
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
|
|
<title>
|
|
What do HTTP URIs Identify? - Design Issues
|
|
</title>
|
|
<link rel="Stylesheet" href="di.css" type="text/css" />
|
|
<meta http-equiv="Content-Type" content=
|
|
"text/html; charset=us-ascii" />
|
|
</head>
|
|
<body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
|
|
<address>
|
|
Tim Berners-Lee<br />
|
|
Date: 2002-07-27, last change: $Date: 2007/01/15 20:05:15
|
|
$<br />
|
|
Status: personal view only. Editing status: first draft. This
|
|
was a result of my being in a minority with this opinion on
|
|
the Technical Architecture Group, and yet finding it the only
|
|
one I could accept. This is related to TAG issue
|
|
HTTPRange-14.
|
|
</address>
|
|
<p>
|
|
<a href="./">Up to Design Issues</a>
|
|
</p>
|
|
<p>
|
|
<strong>Note: (2006). This architectural question has now
|
|
been <a href=
|
|
"http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html">
|
|
decided</a> by the W3C TAG, in a compromise which I think
|
|
works quite well, and is described in a <a href=
|
|
"HTTP-URI2">later short note</a> and a TAG finding.</strong>
|
|
</p>
|
|
<hr />
|
|
<h1>
|
|
What do HTTP URIs Identify?
|
|
</h1>
|
|
<h3>
|
|
Background Note
|
|
</h3>
|
|
<p>
|
|
This question has been addressed only vaguely in the
|
|
specifications. However, the lack of very concise logical
|
|
definition of such things had not been a problem, until the
|
|
formal systems started to use them. There were no formal
|
|
systems addressing this sort of issue (as far as I know,
|
|
except for Dan Connolly's Larch work [@@]), until the
|
|
<a href="/2001/sw">Semantic Web</a> introduced languages such
|
|
as RDF which have well-defined logical properties and are
|
|
used to describe (among other things) web operations.
|
|
</p>
|
|
<p>
|
|
The efforts of the <a href="/2001/tag">Technical Architecture
|
|
Group</a> to create an architecture document with common
|
|
terms highlighted this problem. (It demonstrates the
|
|
ambiguity of natural language that no significant problem had
|
|
been noticed over the past decade, even though the original
|
|
author or HTTP , and later co-author of HTTP 1.1 who also did
|
|
his PhD thesis on an analysis of the web, and both of whom
|
|
have worked with Web protocols ever since, had had
|
|
conflicting ideas of what the various terms actually mean.)
|
|
</p>
|
|
<p>
|
|
This document explains why the author find it difficult to
|
|
work in the alternative proposed philosophies. If it
|
|
misrepresents those others' arguments, then it fails, for
|
|
which I apologize in advance and will endeavor to correct.
|
|
</p>
|
|
<h2>
|
|
1. Web Concepts as here proposed
|
|
</h2>
|
|
<p>
|
|
The WWW is a space of information objects. The URI was
|
|
originally called a UDI, and originally all URIs identified
|
|
information objects. Now, URI schemes exist which identify
|
|
more or less anything (e.g. UUIDs) or electronic mailboxes
|
|
(mailto:) but is we look purely at HTTP URIs, they define a
|
|
web of information objects. Information objects -- perhaps in
|
|
Cyc terms <a href="">ConceptualWorks</a> -- are normally
|
|
things which
|
|
</p>
|
|
<ul>
|
|
<li>Carry some sort of message, and
|
|
</li>
|
|
<li>Can be represented, to a greater or lesser authenticity,
|
|
in bits
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
I want to make it clear that such things are generic (See
|
|
<a href="/DesignIssues/Generic">Generic Resources)</a> --
|
|
while they are documents, they generally are abstractions
|
|
which may have many different bit representations, as a
|
|
function of, for example:
|
|
</p>
|
|
<ul>
|
|
<li>Time -- the contents can vary with revision --
|
|
</li>
|
|
<li>Content-type in which the bits are encoded
|
|
</li>
|
|
<li>Natural language in which a human-readable document is
|
|
written
|
|
</li>
|
|
<li>Machine language in which a machine-processable document
|
|
is written
|
|
</li>
|
|
<li>and a few more
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
but the philosophy is that an HTTP URI may identify something
|
|
with a vagueness as to the dimensions above, but it still
|
|
must be used to refer to a unique conceptual object whose
|
|
various representations have a very large a mount in common.
|
|
Formally, it is the publisher which defines the what an HTTP
|
|
URI identifies, and so one should look to the publisher for a
|
|
commitment as to the exact nature of the identity along these
|
|
axes.
|
|
</p>
|
|
<p>
|
|
I'm going to refer to this as a <strong>document</strong>,
|
|
because it needs a term and that is the best I have to date,
|
|
but the reader should be sure to realize that this does not
|
|
mean a conventional office document, it can be for example
|
|
</p>
|
|
<ul>
|
|
<li>A poem
|
|
</li>
|
|
<li>An order for ball bearings
|
|
</li>
|
|
<li>A painting
|
|
</li>
|
|
<li>A Movie
|
|
</li>
|
|
<li>A review of a movie
|
|
</li>
|
|
<li>A sound clip
|
|
</li>
|
|
<li>A record of the temperature of the furnace
|
|
</li>
|
|
<li>An array a million integers, all zero
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
and so on, as limited only by our imagination.
|
|
</p>
|
|
<p>
|
|
The Web works because, given an HTTP URI, one can in a large
|
|
number of cases, get a representation of the document. For a
|
|
human readable document, the person is presented with the
|
|
information by virtue of some gadget which is given the bits
|
|
of a representation. In the case of a hypertext document, a
|
|
reference to another document is encoded such that, upon user
|
|
request, the referenced document can in turn be automatically
|
|
presented. In the case of a machine-readable document,
|
|
identifiers of concepts, being HTTP URIs, will often allow
|
|
definitive reference information about those concepts to be
|
|
pulled in to guide further actions.
|
|
</p>
|
|
<p>
|
|
The web, then, is made of documents as the internet is made
|
|
of cables and routers. The documents can be about anything,
|
|
so when we move to talk about the contents of documents we
|
|
break away from talking about information space and the whole
|
|
universe of human -- and machine -- discourse is open to us.
|
|
Web pages can compare a renaissance choral works with jazz
|
|
pop hits, and discuss whether pigs have wings.
|
|
Machine-processable documents can encode information about
|
|
shoes, and ships, and sealing-wax. Until recently, the
|
|
Internet protocol standards out of which the Web is built had
|
|
little to say about such things. They were concerned only
|
|
with the human-readable side, so it was people, reading
|
|
natural language (not internet specs) who formed and
|
|
communicated the concepts at this level. Nowadays, however,
|
|
semantic web languages allow information to be expressed not
|
|
only about URIs, TCP ports and documents, but also about
|
|
arbitrary concepts - the shoes, and ships and sealing wax,
|
|
and whether pigs have wings. Simple semantic web application
|
|
allow one to order shoes and travel on ships, and determine
|
|
that, given the data, pigs do not have wings.
|
|
</p>
|
|
<p>
|
|
For these purposes it is of course quite essential to
|
|
distinguish between something described by a document and the
|
|
document itself. Now that we -- for the first time -- have
|
|
not only internet protocols which can talk about document but
|
|
also those which talk about real world things, we must either
|
|
distinguish or be hopelessly fuzzy.
|
|
</p>
|
|
<p>
|
|
And is this bad, is it an inhibition to have to work our way
|
|
though documents before we can talk about whatever we desire?
|
|
I would argue not, because it is very important not to lose
|
|
track of the reasons for our taking and processing any piece
|
|
of information. The process of publishing and reading is a
|
|
real social process between social entities, not mechanical
|
|
agents. To be socially responsible, to be able to handle
|
|
trust, and so on, we must be aware of these operations. The
|
|
difference between a car and what some web page says about it
|
|
is crucial - not only when you are buying a car.
|
|
</p>
|
|
<p>
|
|
Some have opined that the abstraction of the document is
|
|
nonsense, and all that exists, when a web page describes a
|
|
car, is the car and various representations of it, the HTML,
|
|
PNG and GIF bit streams. This is however very weak in my
|
|
opinion. The various representations have much more in common
|
|
than simply the car. And the relationship to the car can be
|
|
many and varied: home page, picture, catalog entry, invoice,
|
|
remote control panel, weblog, and so on. The document itself
|
|
is an important part of society - to dismiss its existence is
|
|
to prevent us being aware of human and aspects of information
|
|
without which we are impoverished. By contrast, the
|
|
difference between different representations of the document
|
|
(GIF or PNG image for example) is very small, and the
|
|
relationship between versions of a document which changes
|
|
through time a very strong one.
|
|
</p>
|
|
<h2>
|
|
2. Trying out the Alternatives
|
|
</h2>
|
|
<p>
|
|
The folks who disagree with the model do so for a number of
|
|
different arguments. This article, therefore will have to
|
|
take them one by one but the ones which come to mind are as
|
|
follows:
|
|
</p>
|
|
<ol>
|
|
<li>
|
|
<a href="#L728">Every web page (or many of therm) are in
|
|
fact themselves representations of some abstract thing, and
|
|
the URI really identifies that</a> thing, not a document at
|
|
all.
|
|
</li>
|
|
<li>
|
|
<a href="#L876">There are many levels of identification
|
|
(representation as a set of bits, document, car which the
|
|
web page is about) and the URI publisher, as owner of the
|
|
URI, has the right to define it to mean whatever he or she
|
|
likes;</a>
|
|
</li>
|
|
<li>
|
|
<a href="#L883">Actually the URI has to, like in English,
|
|
identify these different things ambiguously. Machines have
|
|
to disambiguate using common sense and logic</a>
|
|
</li>
|
|
<li>
|
|
<a href="#L890">Actually the URI has to, like in English,
|
|
identify these different things ambiguously. Machines have
|
|
to disambiguate using the fact that different properties
|
|
will refer to different levels</a>.
|
|
</li>
|
|
<li>
|
|
<a href="#L897">Actually the URI has to, like in English,
|
|
identify these different things ambiguously. Machines have
|
|
to disambiguate using extra information which will be
|
|
provided in other ways along with the URI</a>
|
|
</li>
|
|
<li>
|
|
<a href="#L909">Actually the URI has to, like in English,
|
|
identify these different things ambiguously. Machines have
|
|
to disambiguate them by context: A catalog card will talk
|
|
about a document. A car catalog will talk about a car</a>.
|
|
</li>
|
|
<li>
|
|
<a href="#L920">They may have been used to identify
|
|
documents up till now, but for RDF and the Semantic Web, we
|
|
should change that and start to use them as the Dublin Core
|
|
and RDF Core groups have for abstract concepts</a>.
|
|
</li>
|
|
</ol>
|
|
<h3 id="L728">
|
|
2.1 Identify abstract things not documents
|
|
</h3>
|
|
<p>
|
|
Let's take the alternatives in order. These alternatives all
|
|
make sense. Each one, however, has problems I can't see any
|
|
way around when we consider them as a basis as
|
|
</p>
|
|
<p>
|
|
The first was,
|
|
</p>
|
|
<blockquote>
|
|
<p>
|
|
Every web page (or many of them) are in fact themselves
|
|
representations of some abstract thing, and the URI really
|
|
identifies that thing, not a document at all.
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
Well, that wasn't the model I had when URIs were invented and
|
|
HTTP was written. However, let's see how it flies. If we
|
|
stick with the principle that a URI (or URIref) must
|
|
unambiguously identify the same thing in any context, then we
|
|
come to the conclusion that URIs can not identify the web
|
|
page. If a web page is about a car, then the URI can't be
|
|
used to refer to the web page.
|
|
</p>
|
|
<h4>
|
|
2.1.1 <a name="s2.1.1" id="s2.1.1">Same URI can identify a
|
|
web page and a car</a>
|
|
</h4>
|
|
<p>
|
|
What, a web page can't be a car? At this point a pedantic
|
|
line reasoning suggests that we should allow web pages and
|
|
cars to conceptually overlap, so that something can be both.
|
|
This is counterintuitive, as a web page is in common sense,
|
|
not a concrete object whereas a car is. But sure, we could
|
|
construct a mathematics in which we use the terms rather
|
|
specially and something can be at the same time a web page
|
|
and a car.
|
|
</p>
|
|
<p>
|
|
Frankly, this doesn't serve the social purpose of the
|
|
semantic web, to be able to deal with common sense concepts
|
|
and objects. A web page about a car and a car are in most
|
|
people's minds quite distinct (as I argue further below). A
|
|
philosophy in which they are identical does not allow me to
|
|
distinguish between them. not only conflicts with reality as
|
|
I see it, but also leaves us no way to make statements
|
|
individually about the two things.
|
|
</p>
|
|
<h4>
|
|
<img alt=
|
|
"A car has a different identifier -- and very different properties."
|
|
src="diagrams/http-uri-1.png" />
|
|
</h4>
|
|
<h4>
|
|
2.1.2 <a name="identifies" id="identifies">The URI identifies
|
|
the car, not the web page</a>
|
|
</h4>
|
|
<p>
|
|
So lets fall back on the idea that the URI identifies the
|
|
<em>subject</em> of the web page, but not the web page
|
|
itself. This makes sense. We can build the semantic web on
|
|
top of that easily.
|
|
</p>
|
|
<p>
|
|
The problem with this is that there are a large number of
|
|
systems which already do use URIs to identify the document.
|
|
This is the whole metadata world. Think of a few:
|
|
</p>
|
|
<ul>
|
|
<li>The Dublin Core
|
|
</li>
|
|
<li>RSS
|
|
</li>
|
|
<li>The HTTP headers
|
|
</li>
|
|
<li>The Adobe XML system
|
|
</li>
|
|
<li>Access control systems
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
(I'm sticking with the machine-processable languages as
|
|
examples because human-processable ones like HTML have a
|
|
level of ambiguity traditional in human natural language but
|
|
quite out of place in the WWW infrastructure -- or the
|
|
Semantic Web. You can argue that people say "I work for
|
|
w3.org" or "http://www.amazon.com/shrdlu?asin=314159265359"
|
|
is a great book, just as they happily say "<em>Moby Dick</em>
|
|
weighs over three thousand tons", "<em>Moby Dick</em> was
|
|
finished over a century ago" and "I left <em>Moby Dick</em>
|
|
on the beach" without expecting to be misunderstood. So we
|
|
won't use human language as a guide when defining
|
|
unambiguously the question of what a URI identifies. If we
|
|
want to do that on the Semantic Web, we will say "I work for
|
|
<em>the organization whose home page is</em>
|
|
http://www.ww3.org.)
|
|
</p>
|
|
<p>
|
|
Some argue the the URI which I associate with someone's home
|
|
page actually identifies that person. They argue that
|
|
conventionally people use the identifier to identify the
|
|
person. However, consider another page put together by
|
|
friends who found a photograph of the same person. A lot of
|
|
content filtering systems would collect that URI and put put
|
|
into their list. Even though the photo had many
|
|
representations which different devices could download using
|
|
content negotiation and/or CC/PP (color or black and white
|
|
and versions of different resolutions) the URI itself would
|
|
be listed as containing nudity. The public are very aware of
|
|
different works on the web, even though they have the same
|
|
topic.
|
|
</p>
|
|
<h4>
|
|
2.1.3 <a name="Indirect" id="Indirect">Indirect
|
|
identification</a>
|
|
</h4>
|
|
<p>
|
|
You can argue that a web page <em>indirectly</em> identifies
|
|
something, of course, and I am quite happy with that. If you
|
|
identify an organization as that which has home page
|
|
http://www.w3.org, then you are not saying that
|
|
http://www.w3.org/ itself is that organization. This scenario
|
|
is very very common, just as we identify people and things by
|
|
their "unambiguous properties": books by ISBN, people by
|
|
email address, and so forth. So long as we don't think that
|
|
the person <em>is</em> an email address, we are fine. Some
|
|
people have thought that in saying "An HTTP URI can't
|
|
identify an organization" I was ruling out this indirect
|
|
identification, but not so: I am very much in favor of it.
|
|
The whole SQL world, after all, only identified things
|
|
indirectly by a key property. This causes no contradiction.
|
|
Perhaps I should say "An HTTP URI can't directly identify an
|
|
organization". But by "identify" I mean "directly identify",
|
|
and "identity" is a fairly direct word and concept, so I will
|
|
stick with it.
|
|
</p>
|
|
<p>
|
|
Conclusion so far: the idea that a URI identifies the thing
|
|
the document is about doesn't work because we can only use a
|
|
URI to identify one thing and we have and already do use it
|
|
to identify documents on the web.
|
|
</p>
|
|
<h4>
|
|
2.1.4 <a name="argument" id="argument">The argument for HTTP
|
|
URIs identifying a Conceptual Work</a>
|
|
</h4>
|
|
<p>
|
|
So what's wrong with the URI being taken to identify whatever
|
|
the owner says?
|
|
</p>
|
|
<p>
|
|
Let's look at what we mean by <em>identifies</em>. When we
|
|
say there is identity, that means that there is some form of
|
|
sameness that we associate with the identifier. Now, for all
|
|
the philosophical argument, we can never test the identity of
|
|
an abstract thing. What we can test is a representation which
|
|
has been returned by the server when given that URI. When we
|
|
use aURI, and get back several possible representations of
|
|
it, then what expectation do we have about those
|
|
representations?
|
|
</p>
|
|
<p>
|
|
Take the test case that I see the web page which has a
|
|
picture of a car, and I see in the URI in the URI bar in the
|
|
browser. I email you the URI, "you see, the car is a
|
|
Toyota?". You click on the link. Your browser shows the same
|
|
URI as mine in the "URL bar" but you see a table of the car's
|
|
weight, length, height, color, and registration number. We
|
|
are confused. The web didn't work because you didn't get the
|
|
same information as me. I expected you to get the same
|
|
information, basically. That is how the Web works. That is
|
|
the expectation behind every hypertext link - that the
|
|
follower of the link should get basically the same
|
|
information as the person who made the link. I say,
|
|
"basically" because I would not have cared whether you saw or
|
|
JPEG or a GIF. It probably wouldn't have mattered if you had
|
|
seen a lower resolution or even black-and-white copy of the
|
|
picture. If you are visually impaired, you may have been able
|
|
to manage with a well-written description of the picture. But
|
|
the the essential information is the same, not just the
|
|
subject of the page.
|
|
</p>
|
|
<p>
|
|
So now we have put the four corners on the expectation we
|
|
have of a URI -- that all representations have essentially
|
|
the same <em>information content</em>. And what we mean by
|
|
"essentially" allows in fact some wriggle room, and in the
|
|
end it rests on a common understanding between publisher of
|
|
the information and quoter of the URI. The sameness we are
|
|
after is the sameness of information content. <em>That</em>
|
|
is what is identified by the URI. That is why we say that the
|
|
URI identifies that conceptual information content,
|
|
irrespective of its particular representation: the
|
|
<em>conceptual work</em>. Without that common understanding,
|
|
the web does not work.
|
|
</p>
|
|
<p>
|
|
Some people have said, "If we say that URIs identify people,
|
|
nothing breaks". But all the time they, day to day, rely on
|
|
sameness of the information things on the web, and use URIs
|
|
with that implicit assumption. As we formalize how the web
|
|
works, we have to make that assumption explicit.
|
|
</p>
|
|
<h3 id="L876">
|
|
2.2 Author definition
|
|
</h3>
|
|
<p>
|
|
So how can we break free of that line of reasoning? We can
|
|
try throwing away the rule that a URI identifies only one
|
|
thing.
|
|
</p>
|
|
<blockquote>
|
|
<p>
|
|
There are many levels of identification (representation as
|
|
a set of bits, document, car which the web page is about)
|
|
and the URI publisher, as owner of the URI, has the right
|
|
to define it to mean whatever he or she likes.
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
Well, this one is tempting from the point of view that the
|
|
owner of an identifier should reign supreme when it comes to
|
|
saying what it identifies. It is quite a logically consistent
|
|
position to take. After all, isn't this the case with
|
|
<code>uuid</code>'s? And for a new scheme, this would be
|
|
interesting. How can we do it though, with HTTP? the problem
|
|
is an engineering one: I can't in practice use a URI until I
|
|
have some definitive information from the publisher as to
|
|
what it identifies.
|
|
</p>
|
|
<p>
|
|
2.2.1 Default
|
|
</p>
|
|
<p>
|
|
Why can't a URI default to identifying a web page until you
|
|
know otherwise? Because the web is open and you will never
|
|
know when you might lean some other information which will
|
|
make the default incorrect. (You can't use such "closed
|
|
world" reasoning).
|
|
</p>
|
|
<p>
|
|
2.2.2 Web operation
|
|
</p>
|
|
<p>
|
|
Why can't a URI identify a web page until you have done some
|
|
well-defined operation -- such as HTTP HEAD or GET -- and
|
|
checked for information in that? Well, that would certainly
|
|
work logically. Suppose we we define a return code or HTTP
|
|
header which means "abstract object requested". It would mean
|
|
that every web application which deals with web pages as web
|
|
pages would actually be working under an ambiguity, and RDF
|
|
processors could be programmed to look for that special
|
|
information. We can't retrofit the millions of web servers
|
|
out there, I assume.
|
|
</p>
|
|
<p>
|
|
I feel that there is a great benefit to fixing this question
|
|
at the spec level. Otherwise, what happens? I read a web
|
|
page, I like it and I am going to annotate it as being a
|
|
great one -- but first I have to find out whether the URI my
|
|
browser is used, conceptually by the author of the page, to
|
|
represent some abstract idea? Before I recommend the
|
|
<em>Vietnam War</em> page, I have to be careful I am not
|
|
recommending the Vietnam War.
|
|
</p>
|
|
<p>
|
|
There has been no way to do this before RDF, but then
|
|
similarly no real need for it. (What, is this just a problem
|
|
with RDF? No, it will happen with any webized knowledge
|
|
representation system.). We really need to have communication
|
|
in which two people use the same URI to mean the same thing.
|
|
If there
|
|
</p>
|
|
<p>
|
|
We could fix HTTP so that it would return me some extra
|
|
semantic headers explaining the whole thing. And in the case
|
|
that the URI was deemed to be some abstract thing, I would
|
|
not have the option of recommending the web page. Too bad: it
|
|
has no URI.
|
|
</p>
|
|
<p>
|
|
The authors of document
|
|
<http://www.w3.org/2000/10/rdf-tests/rdfcore/Manifest.rdf>
|
|
certainly thought that they could use
|
|
"http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest"
|
|
to identify an abstract thing which is a type of software
|
|
test. Now they have a choice as to what to make the server
|
|
return for them when I ask for it. It returns 404 "doesn't
|
|
match anything we have available". It can't really, because
|
|
HTTP doesn't allow one to return a class, only a document.
|
|
And if it were to return a document, then I wouldn't be able
|
|
to refer to that document without accidentally referring to
|
|
the class of negative parser tests.
|
|
</p>
|
|
<p>
|
|
So, we could change HTTP to make this work. We could make a
|
|
new form of redirect, <em>343 Abstract Object, please see . .
|
|
.</em>, which would tell the client that the thing requested
|
|
was abstract, and would suggest a document to read about it.
|
|
This avenue of argument is still outstanding. We could take
|
|
it. It isn't the status quo, but we could make changes in
|
|
HTTP if the community felt that this was they way to go.
|
|
</p>
|
|
<h3 id="L883">
|
|
2.3 Logic disambiguates
|
|
</h3>
|
|
<p>
|
|
Otherwise,we have to try another way of letting the URI mean
|
|
sometimes one thing and sometimes another. Here is another.
|
|
</p>
|
|
<blockquote>
|
|
<p>
|
|
Actually the URI has to, like in English, identify these
|
|
different things ambiguously. Machines have to disambiguate
|
|
using common sense and logic
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
This is possible in theory. It is a mess. It fails
|
|
particularly spectacularly when a URI is used ambiguously to
|
|
refer to a web page and the thing that web page is about,
|
|
which happens to be another web page. <em>Anyone can write
|
|
anything about anything</em> is a Web motto, but here it
|
|
falls down. <em>Anyone can write anything about anything
|
|
except those things which might get confused with the
|
|
document they are writing</em>. It breaks the axiom that we
|
|
mean the same thing by a URI - in all contexts. (And RDF has
|
|
a model theory in which necessarily in any interpretation, a
|
|
symbol always denotes one thing).
|
|
</p>
|
|
<h3 id="L890">
|
|
2.4 Different Properties
|
|
</h3>
|
|
<blockquote>
|
|
<p>
|
|
Actually the URI has to, like in English, identify these
|
|
different things ambiguously. Machines have to disambiguate
|
|
using the fact that different properties will refer to
|
|
different levels.
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
One way of getting here is to start by considering that HTTP
|
|
headers can be divided into those which refer to the
|
|
representation (or the document) and those that refer to,
|
|
say, a car or a donkey. We can look at all RDF properties and
|
|
other attributes in other languages and divide them in in
|
|
such a way. So, when I say "http://example.com/albert is a
|
|
color photo", I am referring to the representation; when I
|
|
say "http://example.com/albert used to work down the mill" I
|
|
am referring to the person; when I say
|
|
"http://example.com/albert was taken on a rainy day" I am
|
|
revering to the original photograph, which is basically the
|
|
representation of Albert.
|
|
</p>
|
|
<p>
|
|
This one has the problem when a web page refers to a web
|
|
page. It can still be pursued, by having different verbs for
|
|
talking about ownership of the web page and ownership of the
|
|
car. This is a classic example of the 2-level syndrome (see
|
|
also <em>Dictionaries in the Library</em>). The basic fallacy
|
|
is that you can make the system general by introducing a
|
|
second level - a new set of attributes, properties, or
|
|
whatever, which allow you to refer to the metadata of
|
|
something separately from the thing itself. These systems
|
|
either turn out to be just limited 2-level systems (like XML
|
|
and DTDs) or have to be extended to be recursive in some way
|
|
later on such that in fact the two levels become unnecessary.
|
|
</p>
|
|
<h3 id="L897">
|
|
2.5 Extra info with URI
|
|
</h3>
|
|
<blockquote>
|
|
<p>
|
|
Actually the URI has to, like in English, identify these
|
|
different things ambiguously. Machines have to disambiguate
|
|
using extra information which will be provided in other
|
|
ways along with the URI
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
This twist now relies on sending extra information with a
|
|
URI. Effectively, the URI scheme has now failed to identify
|
|
anything by itself. Those most familiar URIs as used by HTML
|
|
sometimes suggested adding new attributes to the anchor tags
|
|
of HTML documents to disambiguate a reference. I guess it
|
|
would work if HTML anchors were the only uses of URIs. By
|
|
contrast, they are used in thousands of places and way, many
|
|
of which I am unaware. The architecture, however, is not that
|
|
way: the architecture of the WWW is that a URI is a global
|
|
unambiguous identifier. Not a URI and something else.
|
|
</p>
|
|
<p>
|
|
(The various designs such a WebDav's propfind which use HTTP
|
|
methods apart from GET to retreive information suffer from
|
|
this same problem. the information does not have a URI: it is
|
|
not on the web.)
|
|
</p>
|
|
<h3 id="L909">
|
|
2.6 Different meaning in different context
|
|
</h3>
|
|
<blockquote>
|
|
<p>
|
|
Actually the URI has to, like in English, identify these
|
|
different things ambiguously. Machines have to disambiguate
|
|
them by context: A catalog card will talk about a document.
|
|
A car catalog will talk about a car.
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
This works in the short term, when the two contexts are
|
|
disjoint groups who do not need to communicate. It is in fact
|
|
the current state: the groups of people who use HTTP URIs to
|
|
talk about documents, and those who have just started to use
|
|
them to talk about abstract concepts haven't collided yet.
|
|
(Well, they have in my code. I need to be able to model the
|
|
metadata about an HTTP URI as that about a document, and it
|
|
being a class at the same time doesn't jive.)
|
|
</p>
|
|
<p>
|
|
It doesn't work in the long term because it breaks the axiom
|
|
that a URI must identify one thing,
|
|
</p>
|
|
<h3 id="L920">
|
|
2.7 Change it for the Semantics Web
|
|
</h3>
|
|
<blockquote>
|
|
<p>
|
|
They may have been used to identify documents up till now,
|
|
but for RDF and the Semantic Web, we should change that and
|
|
start to use them as the Dublin Core and RDF Core groups
|
|
have for abstract concepts.
|
|
</p>
|
|
</blockquote>
|
|
<p>
|
|
I think that we would have to design a new URI scheme before
|
|
we change things that much. That is tempting of course. But
|
|
then -- building a semantic web out of what we have is
|
|
tempting too. It was tempting to rehash TCP a little when
|
|
making HTTP. It wasn't practical, and we would have lost a
|
|
lot more than we would have gained. There is a lot to be said
|
|
for using common technology. We've got an infrastructure of
|
|
documents. We want to build an infrastructure of knowledge.
|
|
Let's build it using the documents. We might find that the
|
|
commonality with the web of human-readable information is a
|
|
boon.
|
|
</p>
|
|
<h3 id="L735">
|
|
2.8 Abandon any identification of abstract things
|
|
</h3>
|
|
<p>
|
|
An argument which surprised me is that yes, HTTP URIs
|
|
identify documents, but in fact the frgament identifier must
|
|
only be used to identify parts -- fragments -- of documents.
|
|
This means that RDF cannot in fact use HTTP URI schemes at
|
|
all. A completely different system would have to be put
|
|
together -- either a new set of URIs, or RDF conventions in
|
|
which the relationship to the part of a document in which
|
|
something was described became explicit. In N3 this would
|
|
like like
|
|
</p>
|
|
<p>
|
|
[ is rdf:referent of <#fmyCar> ] [ is rdf:referent of
|
|
<#color> ] [ is rdf:referent of <#blue> ]
|
|
</p>
|
|
<p>
|
|
Of course, languages would quickly generate special syntax
|
|
for this. Alternatively, the RDF system would built entirely
|
|
on the understanding that we were referring always to that
|
|
denoted by a given bit of document, not the bit of document
|
|
itself. This would mean that there would be no way for the
|
|
RDF system to refer to documents themselves directly.
|
|
</p>
|
|
<p>
|
|
This is actually a consistent way of working. It would be a
|
|
change only for those people who use RDF to talk about
|
|
documents as documents. We could change.
|
|
</p>
|
|
<h2>
|
|
<a name="L409" id="L409">3. Conclusion</a>
|
|
</h2>
|
|
<p>
|
|
I didn't have this thought out a few years ago. It has only
|
|
been in actually building a relatively formal system on top
|
|
of the web infrastructure that I have had to clarify these
|
|
concepts my own mind. I am forced to conclude that modeling
|
|
the HTTP part of the web as a web of abstract documents if
|
|
the only way to go which is practical and, by the
|
|
philosophical underpinnings of the WWW, tenable.
|
|
</p>
|
|
<p>
|
|
I apologize again if I have misunderstood or misrepresented
|
|
other's arguments in this process of this explanation of my
|
|
own position.
|
|
</p>
|
|
<p>
|
|
Tim Berners-Lee
|
|
</p>
|
|
<p>
|
|
2002-07-28Z
|
|
</p>
|
|
<hr />
|
|
<h3>
|
|
FAQ
|
|
</h3>
|
|
<p>
|
|
<em>Q: But surely, if a document is identified by a namespace
|
|
URI, then when we look up an RDF namespace will millions of
|
|
words in it we will have too long a document to be
|
|
practical!</em>
|
|
</p>
|
|
<p>
|
|
A: It is arguable, for such as situation, whether the
|
|
namespace itself is more cumbersome to manage than the
|
|
document is to deliver. You can make an analogy with
|
|
hypertext: Isn't the model of retrieving a document going to
|
|
be inefficient when the documents are huge? Answers are
|
|
twofold in each case,
|
|
</p>
|
|
<p>
|
|
Firstly, yes it is likely to be less convenient, but that is
|
|
no reason to skew which is a good engineering design for the
|
|
vast proportion of namespaces (or hypertext documents) which
|
|
are not huge.
|
|
</p>
|
|
<p>
|
|
Secondly, the HTTP protocol actually does have methods of
|
|
retrieving parts of a large document.
|
|
</p>
|
|
<p>
|
|
<em>Q: It seems strange that an HTTP URI should be limited to
|
|
referring to documents, but that all one has to add is this
|
|
little hash mark and suddenly you say it can be used to
|
|
identify anything.</em>
|
|
</p>
|
|
<p>
|
|
A: The hash is not a minor appendage to the URI: It is the
|
|
most significant piece of punctuation in the whole URIref.
|
|
The hash adds a whole new level of abstraction and
|
|
specification! It is true that in a hypertext page and that
|
|
page scrolled to a given point seem very similar. The same
|
|
applies to a graphic chart and an object within that chart,
|
|
especially when it is displayed in the context of the
|
|
original document. So I suppose it may be a shock when the
|
|
technique is used with a semantic web language to refer to
|
|
not the document, but something which the document discusses.
|
|
That does allow it to break out of the whole concept of
|
|
documents and into -- anything. But no one promised the
|
|
Semantic Web would be boring. :-)
|
|
</p>
|
|
<p>
|
|
<em>Q: I thought you said "anything should be able to have a
|
|
URI"?</em>
|
|
</p>
|
|
<p>
|
|
Yes, and it should. There is nothing in the URI spec to say
|
|
what an individual scheme should or should not be created to
|
|
identify. A new URI scheme could for example be ale to
|
|
identify anything. But here we are talking about HTTP URIs.
|
|
And remember that with semantic web languages, you can use a
|
|
URIref (very different from a URI) to identify anything, for
|
|
example with HTTP and RDF.
|
|
</p>
|
|
<p>
|
|
<em>Q: But what about CGI scripts? Surely you don't mean the
|
|
HTTP URI identifies the script?</em>
|
|
</p>
|
|
<p>
|
|
A: Of course not. When we talk about the "document"
|
|
identified by a URI it is very often an virtual document
|
|
produced by, for example, a CGI script. The URI identifies
|
|
the document on the web, with no regard to the process which
|
|
causes representations of it to be served.
|
|
</p>
|
|
<p>
|
|
<em>Q: Some HTTP URIs can be POSTed to. Can you still say
|
|
they identify documents?</em>
|
|
</p>
|
|
<p>
|
|
A: Well, some HTTP URIs can't be accessed at all, and some
|
|
access is not allowed, and yes, some URIs are not only
|
|
documents but also can be posted to. So they object is more
|
|
complex than simply a document. But that it has this extra
|
|
functionality doesn't make it any less a HTTP document
|
|
formally. Something can have extra features and still remain
|
|
in the same class of things.
|
|
</p>
|
|
<p>
|
|
<em>Q: What do you mean by "identify", anyway, in Model
|
|
Theory terms? (2003)</em>
|
|
</p>
|
|
<p>
|
|
The closest term used in Model Theory to the way I am using
|
|
<em>identify</em> is <em>denote</em>. Model theory analyses
|
|
communication and understanding by imagining a set of
|
|
<em>interpretations</em>, where an interpretation is a
|
|
mapping from a symbol to that which it denotes. Model
|
|
theorists and linguists tend to complain that one cannot talk
|
|
about the meaning of a term, as you can never know what
|
|
anyone means by anything, you can only see how they react. A
|
|
given agent may have many possible interpretations, but new
|
|
information the agent believes which mentions a symbol will
|
|
rule out interpretations with which are inconsistent with the
|
|
symbol. By the process of exchange of a lot of information,
|
|
one arrives at a state in which one behaves as though other
|
|
agents has the effectively the same set of interpretations.
|
|
Under these conditions, one can think of the thing
|
|
<em>identified</em> by the symbol in the community as being
|
|
the set of things denoted by the symbol in the
|
|
interpretations which agents in the community are left with.
|
|
There has been much more discussion of this process (which is
|
|
the essence of the writing of a standard and the purpose of
|
|
documents like this) in email on www-tag with Pat Hayes and
|
|
others in 2003.
|
|
</p>
|
|
<address>
|
|
The rest are from Aaron Swartz
|
|
</address>
|
|
<p>
|
|
<em>Q: Can you point to something in the spec that says HTTP
|
|
URIs must identify a document?</em>
|
|
</p>
|
|
<p>
|
|
There are many answers. I can point to things which could be
|
|
interpreted to say that. The HTTP spec defines resources as
|
|
<em>network data objects</em>. To me that "data" indicates
|
|
the information nature of the thing. It precludes, in most
|
|
people's minds, a car or the Andromeda Galaxy.
|
|
</p>
|
|
<p>
|
|
I could explain that, as I originally wrote the HTTP spec,
|
|
that was the author's intent.
|
|
</p>
|
|
<p>
|
|
But I think the fairest thing is to say that the spec was
|
|
written it was not sufficiently clear about this particular
|
|
ambiguity, and for reasons mentioned above, this hasn't been
|
|
a problem until now.
|
|
</p>
|
|
<p>
|
|
<em>Q: Isn't it a little weird to start making pronouncements
|
|
about the entire HTTP Web when neither the spec nor the other
|
|
TAG members agree?</em>
|
|
</p>
|
|
<p>
|
|
Pronouncements about the whole Web are really important where
|
|
they are needed. In that case the TAG has a duty to make
|
|
them. And so do I. It seems to me that this assumption is one
|
|
we have been implicitly making and are now breaking, in a way
|
|
which will make the semantic web either inconsistent or much
|
|
less efficient. The TAG members do not agree on this: that is
|
|
why they asked my to write this document. It is written as a
|
|
TAG action item about tag issue HTTPRange-14. Things get a
|
|
lot weirder than that. ;-)
|
|
</p>
|
|
<p>
|
|
<em>Q: Why do we need to use URI-refs to identify abstract
|
|
concepts in a protocol where we can get more information
|
|
about them? .I thought URIs were doing just fine. If we have
|
|
to resort to UUIDs to identify things, I'll get annoyed
|
|
because I won't be able to put them in my browser.</em>
|
|
</p>
|
|
<p>
|
|
Well, there you are... you want to be able to put something
|
|
in your browser, then you must have a representation of it.
|
|
So somewhere in the picture, representations aside, is a
|
|
ConceptualWork. If the ConceptualWork is important, then it
|
|
needs a URI, in my opinion. The alternatives are attractive
|
|
when you start to look at them, but each has a different
|
|
snag. I have tried to explain above.
|
|
</p>
|
|
<p>
|
|
<em>Q: How can you say that the Semantic Web can use the hash
|
|
mark to make a URI-ref identify anything when the URI RFC is
|
|
very clear that hash marks only work when you dereference the
|
|
document.</em>
|
|
</p>
|
|
<p>
|
|
I wouldn't say that hash marks "only work when you deference
|
|
a document" any more than your street address "only works
|
|
when I visit you", or your date of birth "only worked when
|
|
you were born". I can use your street address -- or your data
|
|
of birth -- to help identify you. What the spec defines is a
|
|
way of using this particular URI to get some information over
|
|
the Internet. The whole web works by what someone recently
|
|
referred to as a "confusion" between name and address. It
|
|
isn't a confusion. It is a connection between two pieces of
|
|
architecture without which the web would not be. Rethink. It
|
|
is primarily a name. We have made a way of looking it up. So
|
|
you don't have to look it up for the name to "work" as an
|
|
identifier. Just as you don't go and look it up when someone
|
|
quotes the RDF namespace -- it works because the same
|
|
identifier identifies the same thing in any context. Looked
|
|
up or not. The same thing is true for foo#bar. If the
|
|
document foo is never served, one can still (if one owns it)
|
|
talk about foo#bar with authority. It is of course good
|
|
practice to serve documents.
|
|
</p>
|
|
<p>
|
|
<em>Q: Are all Semantic Web agents going to start
|
|
dereferencing every document they hear about?</em>
|
|
</p>
|
|
<p>
|
|
No, any more than you have to dereference every hypertext
|
|
link you see.
|
|
</p>
|
|
<p>
|
|
<em>Q: Isn't the Semantic Web broken if we have to start
|
|
disagreeing with major specifications like this?</em>
|
|
</p>
|
|
<p>
|
|
This philosophy is quite consistent with the HTTP spec as it
|
|
is.
|
|
</p>
|
|
<h3>
|
|
Exercises
|
|
</h3>
|
|
<p>
|
|
1) What does "<a href=
|
|
"http://www.amazon.com/exec/obidos/ASIN/0679600108/qid=1027958807/sr=2-3/ref=sr_2_3/103-4363499-9407855">http://www.amazon.com/exec/obidos/ASIN/0679600108/qid=1027958807/sr=2-3/ref=sr_2_3/103-4363499-9407855</a>"
|
|
identify?
|
|
</p>
|
|
<ol>
|
|
<li>A whale
|
|
</li>
|
|
<li>"Moby Dick or the Whale" by Herman Melville
|
|
</li>
|
|
<li>A web page on Amazon offering a book for sale
|
|
</li>
|
|
<li>A URI string
|
|
</li>
|
|
<li>All the above
|
|
</li>
|
|
</ol>
|
|
<p>
|
|
When was the thing it identified last changed?
|
|
</p>
|
|
<p>
|
|
Have you read the thing it identifies?
|
|
</p>
|
|
<p>
|
|
2) What does "<a href=
|
|
"http://www.vrc.iastate.edu/magritte.gif">http://www.vrc.iastate.edu/magritte.gif</a>"
|
|
identify?
|
|
</p>
|
|
<ol>
|
|
<li>A pipe
|
|
</li>
|
|
<li>I don't know, but whatever it is it isn't not a pipe.
|
|
</li>
|
|
<li>A contradiction
|
|
</li>
|
|
<li>
|
|
<strong>A picture by Magritte</strong>
|
|
</li>
|
|
<li>
|
|
<strong>A photograph of a picture by Magritte</strong>
|
|
</li>
|
|
<li>
|
|
<strong>A representation as a series of 341632 bits in of a
|
|
photo of a painting</strong>
|
|
</li>
|
|
<li>Validly 4, 5 and 6 but not 1
|
|
</li>
|
|
</ol>
|
|
<p>
|
|
<img alt="Hint: This is not a pipe" src=
|
|
"http://www.vrc.iastate.edu/magritte.gif" />
|
|
</p>
|
|
<p>
|
|
3) What does "<a href=
|
|
"http://dm93.org/2002/03/dans-car-23423423">http://dm93.org/2002/03/dans-car-23423423"</a>
|
|
identify?
|
|
</p>
|
|
<ol>
|
|
<li>An inaccessible web page
|
|
</li>
|
|
<li>A black Toyota
|
|
</li>
|
|
</ol>
|
|
<p>
|
|
4) What does "<a href=
|
|
"http://dm93.org/y2002/myCar-232">http://dm93.org/y2002/myCar-232</a>"
|
|
identify?
|
|
</p>
|
|
<ol>
|
|
<li>A black toyota
|
|
</li>
|
|
<li>A web page
|
|
</li>
|
|
</ol>
|
|
<p>
|
|
When was the thing identified last changed?
|
|
</p>
|
|
<p>
|
|
What does the writing on Dan's car say?
|
|
</p>
|
|
<p>
|
|
Answers: 1:3. 2:7 Note here the web tolerates vagueness along
|
|
the axis of different representations of the same image, but
|
|
not of semantic level between the image and the pipe. 3:1;
|
|
4:2
|
|
</p>
|
|
<h3>
|
|
References
|
|
</h3>
|
|
<p>
|
|
@@@links
|
|
</p>
|
|
<ul>
|
|
<li>The huge discussion of this issue on www-tag@w3.org
|
|
</li>
|
|
<li>
|
|
<a href="http://www.textuality.com/tag/s1.1.html">Tim
|
|
Bray's text</a>
|
|
</li>
|
|
<li>RFC 1634 and points west
|
|
</li>
|
|
<li>Roy Fielding's short history of URI specifications
|
|
</li>
|
|
<li>Weaving the Web
|
|
</li>
|
|
<li>
|
|
<a href=
|
|
"http://www.cyc.com/cycdoc/vocab/info-vocab.html">Cyc's
|
|
page about Conceptual Works</a> cyc:ConceptualWork <a href=
|
|
"http://ilrt.org/discovery/chatlogs/rdfig/2002-07-31.html#T15-56-58-1">
|
|
proposed as what I mean by document by DanC</a>.
|
|
</li>
|
|
</ul>
|
|
<hr />
|
|
<p>
|
|
<a href="Overview.html">Up to Design Issues</a>
|
|
</p>
|
|
<p>
|
|
<a href="../People/Berners-Lee">Tim BL</a>
|
|
</p>
|
|
</body>
|
|
</html>
|