Another abandoned server code base... this is kind of an ancestor of taskrambler.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

1235 lines
53 KiB

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
<title>
The Evolution of a specification -- Commentary on Web
architecture
</title>
<link rel="stylesheet" href="di.css" type="text/css" />
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />
</head>
<body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
<address>
Tim Berners-Lee
<p>
Date: March 1998. Last edited: $Date: 2009/08/27 21:38:07 $
</p>
<p>
Status: . Editing status: incomplete first draft. This
explains the rationale for XML namespaces and RDF schemas,
and derives requirement on them from a discussion of the
process by which we arrive at standards.
</p>
</address>
<p>
<a href="./">Up to Design Issues</a>
</p>
<h3>
Commentary
</h3>
<p>
<i>(These ideas were mentioned in a <a href=
"../Talks/1998/0415-Evolvability/slide1-1.htm">keynote on
"Evolvability"</a> at WWW7 and this text follows closely
enough for you to give yourself the talk below using those
slides. More or less. If and when we get a video from WWW7 of
the talk, maybe we'll be able to serve that up in
parallel.)</i>
</p>
<hr />
<h1>
Evolvability
</h1>
<h3>
<a name="Introduction" id="Introduction">Introduction</a>
</h3>
<p>
The World Wide Web Consortium was founded in 1994 on the
mandate to lead the <b>Evolution</b> of the Web while
maintaining its <b>Interoperability</b> as a universal space.
"Interoperability" and "Evolvability" were two goals for all
W3C technology, and whilst there was a good understanding of
what the first meant, it was difficult to define the second
in terms of technology.
</p>
<p>
Since then W3C has had first hand experience of the tension
beween these two goals, and has seen the process by which
specifications have been advanced, fragmented and later
reconverged. This has led to a desire for a technological
solution which will allow specifications to evolve with the
speed and freedom of many parallel deevlopments, but also
such that any message, whether "standard" or not, at least
has a well defined meaning.
</p>
<p>
There have been technologies dubbed "futureproof" for years
and years, whether they are languages or backplane busses.
&nbsp;I expect you the reader to share my cynicism when
encountering any such claim. &nbsp;We must work though
exactly what we mean: what we expect to be able to do which
we could not do before, and how that will make evolution more
possible and less painfull.
</p>
<h2>
<a name="Free" id="Free">Free extension</a>
</h2>
<p>
A rule explicit or implcit in all the email-like Internet
protocols has always been that if you found a mail header (or
something) which you did not understand, you should ignore
it. This obviously allows people to add all sorts of records
to things in a very free way, and so we can call it the rul
of free extension. It has its advatage of rapid prototyping
and incremental deployment, and the disadvantage of
ambiguity, confusion, and an inability to add a mandatory
feature to an existing protocol. I adopeted the rule for HTML
when initially designing it - and used it myself all the
time, adding elements one by one. This is one way in which
HTML was unlike a conventional SGML application, but it
allowed the dramatic development of HTML.
</p>
<h3>
<a name="cycle" id="cycle">The HTML cycle</a>
</h3>
<p>
The development of HTML between 1994 and 1998 took place in a
cycle, fuelled by the tension between the competitive urge of
companies to outdo each other and the common need for
standards for moving forward. The cycle starts simply simply
bcause the HTML standard is open and usable by anyone: this
means that any engineer, in any company or waiting for a bus
can think of new ways to extend HTML, and try them out.
</p>
<p>
The next phase is that some of these many ideas are tried out
in prototypes or products, using the fact free extension rule
that any unrecongined extensiosn will be ignored by
everything which does not understand them. The result is a
drmatic growth in features. Some of these become product
differentiators, during which time their originators are loth
to discuss the technology with the competition. Some features
die in the market and diappear from the products. Those
successful features have a fairly short lifetime as product
differetiators, as they are soon emulated in some equivalent
(though different) feature in competeing products.
</p>
<p>
After this phase of the cycle, there are three or four ways
of doing the same thing, and engineers in each company are
forced to spend their time writing three of four different
versions of the same thing, and coping with the software
architectural problems which arise from the mix of different
models. This wastes program size, and confuses users. In the
case for example, of the TABLE tag, a browser meeting one in
a document had no idea which table extension it was, so the
situation could become ambiguous. If the interpretation of
the table was important for the safe interpretation ofthe
document, the server would never know whether it had been
done, as an unaware client would blithely ignore it in any
case. This internal software mess resulting from having to
implement multiple models also threatens future deevlopment.
It turns the stable consistent base for future development
into something fragmented and inconsistent: it is difficult
to design new features in such an environment.
</p>
<p>
Now the marketting pressure is off which prevented
discussions, and there is a strong call for the engineers to
get around the W3C table, and iron out a common way of doing
things. As this happens, a system is designed which puts
together the best aspects of each other system, plus a few
weeks experience, so everyone is in the end happier with the
result. The companies all go away making public promises to
implement it, even though the engineering staff will be under
pressure to add the next feature and startthe next cycle. The
result is published as a common specification opene to anyone
to implement. And so the cycle starts again.
</p>
<p>
This is not the way all W3C activities have worked, but it
was the particular case with HTML, and it illustrates some of
the advantages and disadvantages with the free extenstion
rule.
</p>
<h3>
<a name="Breaking" id="Breaking">Breaking the cycle</a>
</h3>
<p>
The HTML cycle as a method of arriving at consensus on a
document has its drawbacks. By 1998, there were reasons to
change the cycle.The work in the W3C, which had started off
in 1994 with several years backlog of work, had more or less
caught up, and was begining to lead, rather than trail,
developments. The work was seen less as fire fighting and
more as consolitation. By this time the spec was growing to a
size where the principle of modularity was seriously
flaunted. Any new developments clearly had to be seperate
modules. Already style information had been moved out into
the Cascading Style Sheets language, the programming
interface work was a seperate Document Object Model activity,
and guidelines for accessibility were tackled by a seperate
group.
</p>
<p>
Inthe future it was clear that we needed somehow to set up a
modular system which would allow one to add to HTML new
standard modules. At the same time, it was clear that with
XML available as a manageble version of SGML as a base for
anyone to define their own tag sets, there was likely to be a
deluge of application-specific and industry-specific XML
based languages. The idea of all this happening underthefree
extension rule was frightening. Most applications would
simply add new tags to HTML. If we continued the process of
retrospectively roping into a new bigger standard, the
document would grow without limit and become totally
unmanageble. The rule of free extesnion was no longer
appropriate.
</p>
<h1>
<a name="wdi" id="wdi">Well defined interfaces</a>
</h1>
<p>
Now let us compare this situation with the way development
occus in the world of distributed computing, specifically
remote rpocedure call (RPC) and distributed object oriented
systems. In these systems, the distributed system (equivalent
to the server plus the client for the web) is viewed as a
single software system which happens to be spread over
several physical machines. [nelson - courier, etc]
</p>
<p>
The network protocols are defined automatically as a function
of the software interfaces which happen to end up being
between modules on different machines. Each interface, local
or remote, has a well documented structure, and the list of
functions (procedures, methods or whatever) and parameters
are defined in machine-processable form. As the system is
built, the compiler checks that the interfaces required by
one module is exactly provided by another module. The
interface, in each version of its development, typically has
an identifying (typically very long) unique number.
</p>
<p>
The interface defines the parameters of a remote call, and
therefore defines exactly what can occur in a message from
one module to another. There is no free extension. If the
interface is changed, and a new module made, any module on
the other side of the interface will have to be changed too,
or you can't build the system.
</p>
<p>
The great advantage of this is that when the system has been
built, you expect it to work. There is no wondering wether a
table is being displayed - if you have called the table
module, you know exactly what the module is supposed to do,
and there is no way the system could be without that module.
Given the chaos of the HTML devleopment world, you can
imagine that many people were hankering after the well
defined interfaces of the distributed computing technology.
</p>
<p>
With well-defined interfaces, either everything works, or
nothing. This was in fact at least formally the case with
SGML documents. Each had a document type definition (DTD)
refered to at the the top, which defiend in principle exactly
what could and could not be in the document. PICS labels were
similar in that thet are self-describing: they actually have
a URI atthe top which points to a machine-readable
description of what can and can't be in athat PICS label.
When you see one of these documents, as when you get an RPC
mesaage with an interface number on it, you can check whether
you understand the interface or not. Another intersting thing
you can do, if you don't have a way of processing it, is to
look it up in some index and dynamically download the code to
process it.
</p>
<p>
The existence of the Web makes all this much smoother:
instead of inventing arbitrary names for inetrfaces, tyou can
use a real URI which can be dereferenecd and return the
master definition of the interface in real time. The Web can
become a decentralised registray of interfaces (languages)
and code modules.
</p>
<p>
The need was clearly for the best of both worlds. One must be
able to freely extend a language, but do so with an extension
language which is itself well defined. If for example,
documents which were HTML 2.0 plus Netscape's version of
tables version 2.01 were identified as such, mcuh o the
problem of ambiguity would have been resolved, but the rest
ofthe world left free to make their own table extensions.
This was the goal of the namespaces work in XML.
</p>
<h3>
<a name="ModularityInHTML" id="ModularityInHTML">Modularity
in HTML</a>
</h3>
<p>
To be able to use the namespaces work in the extension of
HTML, HTML has to transition from being an SGML application
(with certain constraints) to being an XML based langauge.
This will not only give it a certain ease of parsing, but
allow it to build on the modularity introduced by namespaces.
</p>
<p>
In fact, already in April of 1998 there was a W3C
Recommendation for "MathML", defined as as XML langauge and
obviously aimed at being usable in the context of an HTML
document, but for which there was no defined way to write a
combined HTML+MathML document. MathML was already waiting for
XML namespaces.
</p>
<p>
XML namespaces will allow an author (or authoring tool,
hopefully) to declare exactly waht set of tags he orshe is
using in a document. Later, schemas should allow a browser to
decide what to do as a fall back when finding vocabulary
which it does not understand.
</p>
<p>
It is expected that new extensions to HTML be introduced as
namespaces, possibly languages in their own right. The intent
is that the new languages, where appropriate, will be able to
use the existing work on style sheets, such as CSS, and the
existing DOM work which defines a programming interface.
</p>
<h2>
<a name="Mixing" id="Mixing">Language mixing</a>
</h2>
<p>
Language mixing is an important facility, for HTML, for the
evolution of all other Web and application technology. It
must allow, in a mixed labguage document, for both langauges
to be well defined. A mixed langage document is quiote
analogous to a program which makes calls to two runtime
libraries, so it is not rocket science. It is not like an RPC
message, which in most systems is very strongly ytped froma
single rigid definition. (An RPC message can be represented
as a structured document but not, in general, vice-versa)
</p>
<p>
Language mixing is a realtity. Real HTML pages are often HTML
with Javascript, or HTML plus CSS, or both. They just aren't
declared as such. In real life, many documents are made from
multiple vocabularies, only some of which one understands. I
don't understand half the information in the tax form - but I
know enough to know what applies to me. The invoice is a good
example. Many differet coloured copies of the same document
used to serve as a packing list, restocking sheet, invoice,
and delivery note. Different parts of a company would
understand different bits: the financial dividion woul dcheck
amounts and signatures, the store would understand the part
numbers, and the sales and marketting would define dthe
relationship betwene the part numbers and prices.
</p>
<p>
No longer can the Web tolerate the laxness which HTML and
HTTP have been extended. However, it cannot constrain itself
to a system as rigid as a classical disributed object
oriented system.
</p>
<p>
The <a href="Extensible.html">note on namespaces</a> defines
some requirements of a language framework which allows new
schmata to be developed quite independently, and mixed within
one document. This note elaborates on the sorts of things
which have to be possible when the evolution occurs.
</p>
<h3>
<a name="power" id="power">The Power of schema languages</a>
</h3>
<p>
You may notice than nowhere in the architecture do XML or RDF
specify what language the schema should be written in. This
is because much of the future power of the system will lie in
the power of the schema and related documents, so it
isimportant to leave that open as a path for the future. In
the short term, yo can think of a schema being written in
HTML and english. Indeed, this is enough to tie the
significance of documents written in the schema to the law of
the land and mkae the document an effective part of serious
commercial or other social interaction. You can imagine a
schema being in a sort of SGML DTD language which tells a
computer program what constraints there are on the structure
of documents, but nothing about their meaning. This allows a
certain crude validity check to be made on a document but
little else.
</p>
<p>
Now let us imagine further power which we could put into a
schema language.
</p>
<h2>
<a name="PartialUnderstanding" id=
"PartialUnderstanding">Partial Understanding</a>
</h2>
<p>
A crucial first milestone for the system is partial
understanding. Let's use the scenario of an invoice, like the
<a href="Extensible.html#Scenario">scenario in the
"Extensible languages" note</a>. An invoice refers to two
schemata: one is a well-known invoice schema and the other a
proprietory part number schema. The requirement is that an
invoice processing program can process the invoice without
needing to understand the part description.
</p>
<p>
Somehow the program must find out that the invoice is from
its point of view just as valid as an invoice with the
details fo the part description stripped out.
</p>
<h3>
<a name="Optional" id="Optional">Optional parts</a>
</h3>
<p>
One possibility is to mark the part description as "optional"
on the text. We could imagine a well-known way of doing this.
It could be done in the document itself [as usual, using an
arbitrary syntax:]
</p>
<pre>
&lt;item&gt;
&lt;partnumber&gt;8137498237&lt;/&gt;
&lt;optional&gt;
&lt;xml:using href="http://aeroco.com/1998/03/sy4" as="A"&gt;<br />
&lt;a:partdesc&gt;
...
&lt;a:partdesc&gt;
&lt;/xml:using&gt;<br />
&lt;/opional&gt;
&lt;/item&gt;
</pre>
<p>
There are problems with this. One is that we are relying on
the invoice schema to define what in invoice is and isn't and
what it means. It would be nice if the designer of the
invoice could say whether the item should contain a part
description of not, or whether it is possible to add things
into the item description or not. But in general if there is
something to be said we like to allow it to be said anywhere
(like metadata). But for the optionalness to be expressed
elsewhere would save the writer of every invoice the bother
of having to explicitly.
</p>
<h3>
<a name="Partial" id="Partial">Partial Understanding</a>
</h3>
<p>
The other more fundamental problem is that the notion of
"optional" is subjective. We can be more precise about
"partial understanding" by saying that the invoice processing
system needs to convert the document which contains things it
doesn't understand into a document which it does completely
understand: a valid invoice. However, another agent may which
to convert the same detailed invoice into, say, a delivery
note: in this case, quite different information would be
"optional".
</p>
<p>
To be more specific, then, we need to be able to describe a
transformation from one document to another which preserves
"valididy" in some sense. A simple form of transformation is
the removal of sections, but obviously there can be all kinds
of level of transformation language ranging from the cudest
to theturing complete. Whatever the language, statement that
given a document x, that some f(x) can be deduced.
</p>
<h3>
<a name="Least" id="Least">Principle of Least Power</a>
</h3>
<p>
In practice, this suggest that one should leave the actual
choice of the transformation language as a flexibility point.
However, as with most choices of computer language, the
general "principle least power" applies:
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
When expressing something, use the least powerful
language you can.
</td>
</tr>
</tbody>
</table>
<p>
<i>(@@justify in greater depth in footnote)</i>
</p>
<p>
While being able to express a very complex function may feel
good, the result will in general be less useful. As Lao-Tse
puts it, "<a href="Evolution.html#within">Usefulness from
what is not there</a>". From the point of view of translation
algorithms, one usefulness is for them to be reversible. In
the case in which you are trying to prove something (such as
access to a web site or financial credibility) you need to be
able to derive a document of a given form. The rules you use
are the pieces of the web of trust and you are looking for a
path through the web of trust. Clearly, one approach is to
enumerate all the things which can be deduced from a given
document, but it is faster to have an idea of which
algorithms to apply. Simple ones have input and output
patterns. A deletion rule is a very simple case
</p>
<p align="center">
s/.*foo.*/\1\2/
</p>
<p>
This is stream editor languge for "Remove "foo" from any
string leaving what was on either side". If this rule is
allowed, it means that "foo"is optional. @@@ to be continued
</p>
<p>
Optional features and Partial Understanding
</p>
<ul>
<li>Goal: V1 software partially understands V2 document
</li>
<li>Optional features visible as such
</li>
<li>Example: "Mandatory" Internet Draft
</li>
<li>Example: SMIL (P.Rec. 1998/4/9)
</li>
<li>Conversion from unknown language to known language.
</li>
</ul>
<h1>
<a name="ToII" id="ToII">Test of Independent Invention</a>
</h1>
<p>
The test of independent invention is a thought experiment
which tests one aspect of the quality of a design. When you
design something, you make a number of important
architectural decisions, such as how many wheels a car has,
and that an arch will be used between the pillas of the
vault. You make other arbitrary decisions such as the color
of the car, the side of the road everyone will drive, whether
to open the egg at the big end or the little end.
</p>
<p>
Suppose it just happens that another group is designing the
same sort of thing, tackling the same problem, somewhere
else. They are quite unknown to you and you to them, but just
suppose that being just as smart as you, they make all the
same important archietctural decisions. This you can expect
if you believe hat these decisions make logical sense.
Imagine that they have the same philosophy: it is largely the
philosophy which we are testing. However, imagine that they
make all the arbitrary decisions differently. They complement
bit 7. They drive on the other other side of the road. They
use red buoys on the starbord side, and use 575 lines per
screen on their televisions.
</p>
<p>
Now imagine that the two systems both work (locally), and
being usccessful, grow and grow. After a while, they meet.
Suddenly you discover each other. Suddenly, people want to
work across both systems. They want to connect two road
systems, two telephone systems, two networks, two webs. What
happens?
</p>
<p>
I tried originally to make WWW pass the test. Suppose someone
had (and it was quite likely) invented a World Wide Web
system somewhere else with the same principles. Suppose they
called it the Multi Media Mesh <sup>(tm)</sup> and based it
on Media Resource Identifiers<sup>(tm)</sup>, the MultiMedia
Transport Protocol<sup>(tm)</sup>, and a Multi Media Markup
Language<sup>(tm)</sup>. After a few years, the Web and the
Mesh meet. What is the damage?
</p>
<ul>
<li>A huge battle, involving the abandonment of projects,
conversion or loss of data?
</li>
<li>Division of the world by a border commission into two
separate communities?
</li>
<li>Smooth integration with only incremental effort?
</li>
</ul>
<p>
(see also <a href="../People/Berners-Lee/UU.html">WWW and
Unitarian Universalism</a>)
</p>
<p>
Obviously we are looking for the latter option. Fortunately,
we could immediately extend URIs to include "mmtp://" and
extend MRIs to include "http:\\". We could make gateways, and
on the better browsers immediately configure them to go
through a gateway when finding a URI of the new type. The URI
space is universal: it covers all addresses of all accessible
objects. But it does not have to be the only universal space.
Universal, but not unique. We could add MMML as a MIME type.
And so on. However, if we required all Web servers to
synchronise though one and only one master lock server in
Waltdorf, we would have found the Mesh required
synchronisation though a master server in Melbourne. It would
have failed.
</p>
<p>
No system completely passes the ToII - it is always some
trouble to convert.
</p>
<h3>
<a name="real" id="real">Not just a thought experiment</a>
</h3>
<p>
As the Web becomes the basis for many many applications to be
build on top of it, the phenomenon of independent invention
will recur again and again. We have to build technology so as
to make it easy for systems to pass the test, and so survive
real life in an evolving world.
</p>
<p>
If systems cannot pass the TOII, then we can only achieve
worldwide interoperability when one original design has
originally beaten the others. This can happen if we all sit
down together as a worldwide committee and do a "top
down"design of the whole thing before we start. This works
for a new idea but not for the automation of something which,
like pharmacy or trade, has been going on for centuries and
is just being represented in the Semantic Web. For example,
the library community has had endless trouble trying to agree
on a single library card format (MARC record) worldwide.
</p>
<p>
Another way it can happen is if one system is dropped
completely, leading to a complete loss of the effport put
into it. When in the late 1980s Europe eventually abandoned
its suite of ISO protocols for networking because they just
could not interwork with the Internet, a huge amount of work
was lost. Many problems, solved in Europe but not in the US
(including network addresses of more than 32 bits) had to be
solved again on the Internet at great cost. Sweden actually
changed from driving on the left to driving on the right. All
over the world, people have changed word processor formats
again and again but only at the cost of losing access to huge
amounts of legacy information. The test of independent
invention is not just a thought experiment, it is happening
all the time.
</p>
<h1>
<a name="requirements" id="requirements">From philosophy to
requirement</a>
</h1>
<p>
So now let us get more specific about what we really need in
the underlying technology of the Semantic Web to allow
systems in the future to pass the test of independent
invention.
</p>
<h3>
<a name="smarter" id="smarter">We will be smarter</a>
</h3>
<p>
Our first assumption is that we will be smarter in the
future. This means that we will produce better systems. We
will want to move on from version 1 to version 2, from
version n to version n+1.
</p>
<p>
What happens now? A group of people use version 4 of a word
process and share some documents. One touches a document
using a new version 5 of the same program. Oen of the other
people tries to load it using version 4 of the software. The
version 4 program reads the file, and find it is a version5
file. It declares that there is no way it can read the
file,as it was produced in the future, and there is no way it
can predict the future to know how to read a version 5 file.
A flag day occurs: everyone in the group has to upgrade
immediately - and often they never even planned to.
</p>
<p>
So the first requirement is for a version 4 program to be
able to read a version 5 file. Of course there will be some
features in version 5 that the version 4 program will not be
able to understand. But most of the time, we actually find
that what we want to achieve can be done by partial
understanding - understanding those parts of the document
which correspond to functions which exist in version 4. But
even though we know partial understanding would be
acceptable, with most systems we don't know how to do even
that.
</p>
<h3>
<a name="others" id="others">We are not the smartest</a>
</h3>
<p>
The philosophical assumption that we may not be smarter than
everyone else (a huge step for some!) leads us to realise
that others will have gret ideas too, and will independently
invent the same things. It forces us to consider the test of
independent invention.
</p>
<p>
The requirement for the system to pass the ToII is for one
program which we write to be able to read somehow (partially
if not totally) data written by the program written by the
other folks. This simple operation is the key to
decentralised evolution of our technology, and to the whole
future of the Web.
</p>
<p>
So we have deduced two requirements for the system from our
simple philosophical assumptions:
</p>
<ul>
<li>We will be smarter in the future
<ul>
<li>Technology: Moving Version 1 to Version 2
</li>
</ul>
</li>
<li>We are not smarter than everyone else
<ul>
<li>Decentralized evolution
</li>
<li>Technology: Moving between parallel Version A and
Version B
</li>
</ul>
</li>
</ul>
<h3>
<a name="sofar" id="sofar">The story so far</a>
</h3>
<p>
We are we with the requirements for evolvability so far? We
are looking for a tecnology which has free but well defined
extension. We want to do it by allowing documents to use
mixed vocabularies. We have already found out (from PICS work
for example) that we need to be abl eto know whether
extension vocabulary is mandatory or can be ignored. We want
to use the Web for any registry, rather than any central
point. The technology has to be allow an application to be
able to convert the output of a future version of itself, or
the output of an equivalent program written indpendently,
into something it can process, just by looking up schema
information.
</p>
<h2>
<a name="data" id="data">Evolution of data</a>
</h2>
<p>
Now let us look at the world of data on the Web, the <a href=
"Semantic.html">Semantic Web</a>, which I expect we expect to
become a new force in the next few years. By "data" as
opposed to "documents", I am talking about information on the
Web in a form specifically to aid automated processing rather
than human browsing. "Data" is characterised by infomation
with a well defined strcuture, where the atomic parts have
wel ldefined types, such as numbers and choices from finite
sets. "Data", as in a relational database, normally has well
defined meaning which has rarely been written down. When
someone creates a new databse, they have to give the data
type of each column, but don't have to explain what the field
name actually means in any way. So there is a well defined
semantics but not one which can be accessed. In fact, the
only time you tells the machine anything about the semantics
is when you define which two columns of different tables are
equivalent in some way, so that they can be used for example
as the basis for joining the two databases. (That the meaning
of data is only defined relative to the meaning of other data
is of course quite normal - we don't expect machines to have
any built in understanding of what "zip code" might mean
apart from where you can read it and write it and what you
can compare it with). Notice that what happens with real
databases is that they are defined by users one day, and they
evolve. They are rarely the result of a committee sitting
down and deciding on a set of concepts to use across a
company or an industry, and then designing the data schema.
The schema is craeted on the fly by the user.
</p>
<p>
We can distinguish two ways in which tha word "schema" has
been used:
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
Syntactic Schema: A document, real or imagined, which
constrains the structure and/or type of data. <i>(pl.:
Schemata)</i>.
</td>
</tr>
</tbody>
</table>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
Semantic schema: A document, real or imagined, which
defines the infereneces from one schema to another,
thus defining the semantics of one syntactic schema in
terms of another.
</td>
</tr>
</tbody>
</table>
<p>
I will use it for the first only. In fact, a syntactic schema
dedfines a class of document, and often is accompanied by
human documentation which provides some rough semantics.
</p>
<p>
There is a huge amount ("legacy" would unfairly suggest
obsolescence) of data in relational databases. A certain
amount of it is being exported onto the web as virtual
hypertext. There are many applications which allow one to
make hypertext views of difeferent aspects of a database, so
that each server request is met by performing adatabse query,
and then formatting the result as a report in HTML, with
appropriate style and decoration.
</p>
<h2>
Data about data: Metadata
</h2>
<p>
Information about information is interesting in two ways.
Firstly, it is interesting because the Web society
desperately needs it to be able to manage social aspects of
information such as endorsement (PICS labels, etc), ownership
and access rights to information, privacy policies (P3P,
etc), structuring and cataloguing information and a hundred
otehr uses which I will not try to ennumerate. This first
aspect is discussed elsewhere. (See <a href=
"http://www.w3.org/DesignIssues/Metadata.html">Metadata
architecture</a> about general treatment of metadata and
labels, and the <a href="../TandS/Overview.html">Technology
and Society domain</a> for overveiw of many of the social
drivers and related projects and technology)
</p>
<p>
The second interest in metadata is that it is data. If we are
looking for a language for putting data onto the Web, in a
machine understandable way, then metadata happens to be a
first application area. Also, because metadat ais fundamental
to most data on eth web, it is the focus of W3C effort, while
many other forms of data are regarded as applications rather
than core Web archietcure, and so are not.
</p>
<h3>
Publishing data on the web
</h3>
<p>
Suppose for example that you run a server which provides
online stock prices. Your application which today provides
fancy web pages with a company's data in text and graphs (as
GIFs) could tomorrow produce the same page as XML data, in
tabular form, for machine access. The same page could even be
produced at the same URL in two formats using content
negotiation, or you could have a typed link between the
machine-understandable and person-understandable versions.
</p>
<p>
The XML version contains at the top (or soemewhere) a pointer
to a schema document. This poiner makes the document
"self-describing". It is this pointer which is the key to any
machine "understanding" of the page. By making the schema a
first class object, in other words by giving its URL and
nothing else, we are leaving the dooropen to many
possibilities. Now it is time to look at the various sorts of
schema document which it could point to.
</p>
<h2>
Levels of schema language
</h2>
<p>
Computer languags can be classified into various types, with
various capabilities, and the sort we chose for the schema
document, and information we allow the schema fundamentally
affects not just what the semantic web can be but, more
importantly, how it can grow.
</p>
<p>
The schema document can, broadly, be one of the following:
</p>
<ol>
<li>Notional only: imaginary, non-existent but named.
</li>
<li>Human readable
</li>
<li>Machine-understandable and defining structure
</li>
<li>Machine-understandable and slo which are optional parts
</li>
<li>A Turing-complete recipe for conversion into othr
langauges
</li>
<li>A logical model of document
</li>
</ol>
<p>
We'll go over the pros and cons of each, because none of
these should be overlooked, but some are often way better
than others.
</p>
<h3>
Schema 1: URI only
</h3>
<ul>
<li>No supporting documentation
</li>
<li>Allows compatibility yes/no test
</li>
</ul>
<p>
This may sound like a silly trivial example, but like many
trival examples, it is not silly. If you just name your
schema somewhere in URI space, then you have identified it.
This deosn't offer a lot of help to anyone to find any
documentation online, but one fundamental function is
possible. Anyone can check compatability: They can compare
the schema against a list of schemata they do understand, and
return yes or no.
</p>
<p>
In fact, they can also se an idnex to look up information
about the schema, including ifnromation about suitable
software to download to add understanding of the document. In
fact this level is the level which many RPC systems use: the
interface is given a unique but otherwise random number which
cannot be dereferenced directly.
</p>
<p>
So this is the level of machine-understanding typical of
distributed ocmputing systems and should not be
underestimated. There are lot sof parts of URI space you can
use for this: yo might own some http: space (but never
actually serve the document at that point) , but if you
don't, you can always generate a URI in a mid: ro cid: space
or if desperate in one of the hash spaces.
</p>
<h3>
Schema option 2: Human readable
</h3>
<p>
The next step up from just using the Schema identifier as a
document tyope identifier is to make that URI one which will
dereference to a human-readable document. If you're a
computer, big deal. But as well as allowing a strict
compatiability test (test for equality of the schema URI),
this also allows human beings to get involed if ther is any
argument as to what a document means. This can be signifiant!
For example, the schema could point to a complete technical
spec which is crammed with legalese about what the document
does and does not imply and commit to. At the end of the day,
all machine-understandable descriptions of documents are all
very well, but until the day that they bootstrap themselves
into legality, they must all in the end be defined in terms
of human-readable legalese to have social effect. Human
legalese is the schema language of our society. This is level
2.
</p>
<h3>
Schema option 3: Define structure
</h3>
<p>
Now we move into the meat of the schema system when we start
to discuss schema documents which are machine readable. now
we are satrting to enable some machine understanding and
automatic processing of document types which have not been
pre-programmed by people. &Ccedil;a commence.
</p>
<p>
The next level we conside is that when your brower (agent,
whatever) dereferences the namespace URI, it find a schema
which defines the structure of the document. this is a bit
like an SGML Doctument type Definition (DTD). It allows you
to do everything which the levels 1 and 2 allowed, if it has
sufficient comments in it to allow human arguments to be
settled.
</p>
<p>
In addition, a system which has a way of defineing structure
allows everyone to have one and only one parser to handle all
manner of documents. Any document coming across the threshold
can be parse into a tree.
</p>
<p>
More than that, it allows a document o be validated against
allowed strctures. If a memeo contains two subject fields, it
is not valid. Tjis is one fo the principal uses of DTDs in
SGML.
</p>
<p>
In some cases, there maybe another spin-off. You canimagine
that if the schema document lists the allwoed structrue of
the document, and the types (and maybe names) of each
element, then this would allow an agent to construct on the
fly a graphic user interafce for editing such a document.
This was theintent with PICS rating systems: at least, a
parent coming across a new rating system would be be given a
ahuman-readable descriptoin of the various parameters and
would be able to select
</p>
<h3>
Schema option 4: Structure + Optional flags
</h3>
<p>
The "optional" flag is a term I use here for a common crucial
step which can make the difference between chaos and smooth
evolution. All you need to do is to mark in the schema of a
new version of the language which elements of the langauge
can be ignored if you don't understand them. This simple step
allows a processor which handled the old language, giventhe
schema of the new langauge, to filter it so as to produce a
document it can legitimately understand.
</p>
<p>
Now we have a technology which ahs all the benefits to date,
plus it can handle that elusive <strong>version 2 to version
1 conversion</strong> problem!
</p>
<h3>
Schema option 5: Turning complete language
</h3>
<p>
Always in langauges there is the balance between the
declarative limited langauge, whose foprmulae can be easily
manipulated, and the powerful programming language whose
programs cannot be analyzed in general, but which have to be
left to run to see what they do. Each end of the spectrum has
its benefits. In describing a lanuage in terms of another,
one way is to provide a black box program, say in Java or
Javascript, which will convert from one to the other.
</p>
<p>
Filters written in turing-complete languages generally have
to be trusted, as you can't see what rules they are based on
by looking at them. But they can do weird and wonderful
things. (They can also crash and loop forever of course!).
</p>
<p>
A good language for conversion from one XML-based language to
another is XSL. It lstarted off as a template-like system for
building one document from another (and can be very simple)
but is in fact Turning-complete.
</p>
<p>
When you do publish a program to convert language A to
language B, then anyone who trusts it has that capability. A
disadvantage is that they never know how it works. You can't
deduce things about the individual components of the
languages. You can't therefore infer much indirectly about
relationships to other languages. The only way such a filter
can be used is to get whatever you have into language A and
then put it though the filter. This might be useful. But it
isn't as fascinating as the option of blowing language A
open.
</p>
<h3>
Schema option 6: Expose logic of document
</h3>
<p>
What is fundamentally more exciting is to write down as
explicitly as posible wahteth new language means. Sorry, let
me take that back, in case you think that I am talking about
some absulte meaning of meaning. If you know me, I am not.
All I mean is that we write in a machine-processable logical
way the equivalences and conversions which are possible in
and out of language A from other languages. And other
languages.
</p>
<p>
A specific case of course, is when we document the
relationship betwen version 2 and version 1. The schema
document for version 2 could explain that all the terms are
synonyms, except for some new terms which can be converted to
nothing (ie are optional) and some which affect the meaning
of the document completely and so if you don't understand
them you are stuck.
</p>
<p>
In a more general case, take a language like iCalendar in RDF
(were it in RDF), which is for describing events as would be
in a personal organizer. A schema for the language might
declare equivalences betwen a calendar's concept of group
MEMBER ship and an access control system's concept of group
membership; it might declare the equivalence of eth concept
of LOCATION to be the text description of a Geographical
Information Systems standard's location, and it may declare
an INDIVIDUAL to be a superset of the HR department's concept
of employee. These bits of information of the stuff of the
semantic web, as they allow inference to stretch across the
gloabe and conclude things which we knew as whole but no one
person knew. This is what RDF and the Semnatic Web logic
built on top of it is all about.
</p>
<hr />
<p>
So, what will semantic web engine be able to do? They will
not all have the same inference abilities or algorithms. They
will share a core concept of an RDF statement - an assertion
that a given <em>resource</em> has a <em>property</em> with a
given <em>value</em>. They will use this as a common way of
exchanging data even when their inference rules are not
compatible. An agent will be able to read a document in a new
version of a language, by looking up on the web the
relationship with the old version that it can natively read.
It will be able to combine many documents into a single graph
of knowledge, and draw deductions from the combination. And
even though it might not be able to find a proof of a given
hypothesis, when faced with an elaborated proof it will be
able to check its veracity.
</p>
<p>
At this stage (1998) we need relational database experts in
the XML and RDF groups, [2000 -- include ontology and
conceptual graph and knowledge representation experts].
</p>
<h2 id="maps">
Evolvability in the real world
</h2>
<p>
Examples abound of language mixing and evolution in the real
world which make the need for these capabilities clear. There
is a great and unused overlap in the concepts used by, for
example, personal information managers, email systems, and so
on. These capabilities would allow information to flow
between these applications.
</p>
<p>
You just have to look at the history of a standard such as
MARC record for library information to see that the tension
between agreeing on a standard (difficult and only possible
for a common subset) and allowing variations (quick by not
interoperable) would be eased by allowing language mixing. A
card could be written out in a mixture of standard and local
terms.
</p>
<p>
The real world is full of times when conventions have been
developed separately and the relationships have been deduced
afterward: hence the market for third party converters of
disk formats, scheduler files, and so on.
</p>
<h1>
<a name="Engines" id="Engines">Engines of the future</a>
</h1>
<p>
I have left open the discussion as to what inference power
and algorithms will be useful on the semantic web precisely
because it will always be an open question. When a language
is sufficiently expressive to be able to express teh state of
the real world and real problems then there will be no one
query engine which will be able to solve real problems.
</p>
<p>
We can, however, guess at how systems might evolve. No one at
the beginning of the Web foresaw the search engines which
could index almost all the web, so these guesses may be very
inaccurate!
</p>
<p>
We note that logical systems provide provably good answers,
but don't scale to large problems. We see that search
engines, remarkably, do scale - but at the moment produce
very unreliable answers. Now, on a semantic web we can
imagine a combination of the two. For example, a search
engine could retrieves all the documents which reference the
terms used in the query, and then a logical system act on
that closed finite world of information to determine a
reliable solution if one exists.
</p>
<p>
In fact I thing we will see a huge market for interesting new
algorithms, each to take advantage of particular
characteristics of particular parts of the Web. New
algorithms around electronic commerce may have directly
beneficial business models, to there will be incentive for
their development.
</p>
<p>
Imagine some questions we might want to ask an engine of the
future:
</p>
<ul>
<li>Can Joe access the party photos?
</li>
<li>Who are all the people who can?
</li>
<li>Is there a green car for sale for around $15000 in
Queensland?
</li>
<li>Did someone driving a blue car send us an invoice for
over $10000?
</li>
<li>What was the average temperature in 1997 in Brisbane?
</li>
<li>Please fill in my tax form!
</li>
</ul>
<p>
All these involve bridging barriers between domains of
knowledge, but they do not involve very complex logic --
except for the tax form, that is. And who knows, perhaps in
the future the tax code will have to be presented as a
formula on the semantic web, just as it is expected now that
one make such a public human-readable document available on
the Web.
</p>
<h2 id="Conclusion">
Conclusion
</h2>
<p>
There are some requirements on the Semantic Web design which
must be upheld if the technology is to be able to evolve
smoothly. They involve both the introduction of new versions
of one language, and also the merging of two originally
independent languages. XML Namespaces and RDF are designed to
meet these requirements, but a lot more thought and careful
design will be needed before the system is complete.
</p>
<hr />
<blockquote>
<h4>
<a name="within" id="within">The Space Within</a>
</h4>
<p>
Thirty spokes share the wheel's hub;<br />
It is the center hole that makes it useful.<br />
Shape clay into a vessel;<br />
It is the space within that makes it useful.<br />
Cut doors and windows for a room;<br />
It is the holes that make it useful.<br />
Therefore profit comes from what is there;<br />
Usefulness from what is not there.
</p>
</blockquote>
<address>
Lao-Tse
</address>
<p>
(UU-STLT#600)
</p>
<p>
...
</p>
<p>
Imagine that the EU and the US independently define RDF
schemata for an invoice. Invoice are traded around Europe
with a schema pointer at the top which identifies the smema.
Indeed, the schema may be found on the web.
</p>
<hr />
<hr />
<p>
<a href="Metadata.html">Next: &nbsp;Metadata architecture</a>
</p>
<p>
<a href="Overview.html">Up to Design Issues</a>
</p>
<p>
<a href="../People/Berners-Lee">Tim BL</a>
</p>
</body>
</html>