Another abandoned server code base... this is kind of an ancestor of taskrambler.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

860 lines
34 KiB

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
<title>
Web architecture: Metadata
</title>
<link href="di.css" rel="stylesheet" type="text/css" />
<meta http-equiv="Content-Type" content="text/html" />
</head>
<body bgcolor="#DDFFDD" text="#000000">
<address>
Tim Berners-Lee
<p>
Date started: January 6, 1997
</p>
<p>
. Status: personal view, but corresponds &nbsp;generally to
the W3C architecture for metadata.
</p>
<p>
.
</p>
<p>
Additions are at the end about consistency in
label/metaset/collection syntax and semantics.
</p>
<p>
The syntaxes used in this document are meant to illustrate
the architecture and be clear but are otherwise random.
This note was written before the more general <a href=
"Semantic.html">Semantic Web</a> note.
</p>
</address>
<p>
<a href="Overview.html">Up to Design Issues</a>
</p>
<h3>
Axioms of Web Architecture: Metadata
</h3>
<hr />
<h1>
Metadata Architecture
</h1>
<h4 id="Preface">
Preface
</h4>
<p>
<em>This document was written before the Semantic Web
Roadmap, but is an introduction to the same ideas. Both
introduce the world of machine-readable data on the web. This
document introduces the concepts in the historical sequence
at W3C, where the first driving applications of semantic web
were metadat, and the first driving metadata applications
were endorsement labels (<a href="#PICS">PICS</a>)</em>.
</p>
<h2>
Documents, Metadata, and Links<br />
</h2>
<p>
The thing which you get when you follow a link, when you
de-reference a URI, has a lot of names. Formally we call it a
<b>resource</b>. Sometimes it is referred to as a document
because many of the things currently on the Web are human
readable documents. Sometimes it is referred to as an object
when the object is something which is more machine readable
in nature or has hidden state. I will use the words document
and resource interchangeably in what follows and sometimes
may slip into using "object".
</p>
<p>
One of the characteristics of the World Wide Web is that
resources, when you retrieve them, do not stand simply by
themselves without explanation, but there is information
about the resource. Information about information is
generally known as <b>Metadata</b>. Specifically, in the web
design,
</p>
<h4>
Definition
</h4>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
Metadata is machine understandable information about
web resources or other things
</td>
</tr>
</tbody>
</table>
<p>
The phrase "machine understandable" is key. &nbsp;We are
talking here about information which software agents can use
in order to make life easier for us, ensure we obey our
principles, the law, check that we can trust what we are
doing, and make everything work more smoothly and rapidly.
Metadata has well defined semantics and structure.
</p>
<p>
Metadata was called "Metadata" because it started life, and
is currently still chiefly, information about web resources,
so data about data. &nbsp;In the future, when the metadata
languages and engines are more developed, it should also form
a strong basis for a web of machine understandable
information about anything: about the people, things,
concepts and ideas. &nbsp;We keep this fact in our minds in
the design, even though the first step is to make a system
for information about information.
</p>
<p>
For an example of metadata, when an object is retrieved using
the HTTP protocol, the protocol allows information about its
date, its expiry date, its owner, and other arbitrary
information to be sent by the server. The world of the World
Wide Web is therefore a world of information and some of that
information is information about information. In order to
have a coherent picture of this, we need a few axioms about
metadata. The first axiom is that :
</p>
<h4>
Axiom
</h4>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
metadata is data.
</td>
</tr>
</tbody>
</table>
<p>
That is to say, information about information is to be
counted in all respects as information. There are various
parts of this.
</p>
<p>
One is that metadata can be stored regarded as data, it can
be stored in a resource. So, one resource may contain
information about itself or about another resource. In
current practice on the World Wide Web there are three ways
in which one gets metadata. The first is the data about a
document contained within the document itself, for example in
the HEAD part of an HTML documents or within word processor
documents. The second is that during the HTTP transfer the
server transfers some metadata to the client about the object
which is being transferred. This, during an http GET, is
transferred from the server to the client and, during a PUT
or a POST, is transferred from the client to the server. One
of the things which we have to rationalize in our
architecture of the World Wide Web is who exactly is making
the statement. Whose statement, whose property is that
metadata. The third way in which metadata is found is when it
is looked up in another document. This practice has not been
very common until the PICS initiative was to define label
formats specifically for representing information about World
Wide Web resources. The PICS architecture specifically allows
for PICS labels which are resources about other resources to
be buried within the resource itself, to be retrieved as
separate resources, or to be passed over during the http
transaction. To conclude,
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
Metadata about one document can occur within the
document, or within a separate document, or it may be
transferred accompanying the document.<br />
</td>
</tr>
</tbody>
</table>
<p>
Put another way, metadata can be a first class object.
</p>
<p>
The second part of the above axiom is:
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
Metadata can describe metadata
</td>
</tr>
</tbody>
</table>
<p>
That is, metadata itself may have attributes such as
ownership and an expiry date, and so there is meta-metadata
but we don't distinguish many levels, we just say that
metadata is data and that from that it follows that it can
have other data about itself. This gives the Web a certain
consistency.
</p>
<h2>
The Form of Metadata<br />
</h2>
<p>
Metadata consists of assertions about data, and such
assertions typically, when represented in computer systems,
take the form of a name or type of assertion and a set of
parameters, just as in the natural language a sentence takes
the form of a verb and a subject, an object and various
clauses.
</p>
<h4>
<a name="independent" id="independent">Axiom</a>
</h4>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
The architecture is of metadata represented as a set of
independent assertions.
</td>
</tr>
</tbody>
</table>
<p>
This model implies that in general, two assertions about the
same resource can stand alone and independently. When they
are grouped together in one place, the combined assertion is
simply the sum (actually the logical AND) of the independent
ones. Therefore (because AND is commutative) collections of
assertions are essentially unordered sets. This design
decision rules out for example, in simple sets of data,
assertions which are somehow cumulative or later ones
override earlier ones. Each assertion stands independently of
others.
</p>
<p>
We will see below how logical expressions are formed to
combine assertions in more varied ways, and syntactic rules
which allow the subject at least of the assertion to be made
implicit. But neither of these change the basic operation of
combining assertions in unordered AND lists.
</p>
<h3>
<a name="Attributes" id="Attributes">Attributes</a>
</h3>
<p>
Assertions about resources are often referred to as
attributes of the resource. That is, the type of assertion is
an assertion that the object, the resource in question, has a
particular named property such as it's author, and in that
case the parameter is the name or identity of the author.
Similarly, if the attribute is the document's date of expiry
then the parameter is that date.
</p>
<p>
Often, a group of assertions about the same resource occur
together, in which case the syntax generally omits the URI of
that resource as it is implicit. In these cases, when it is
clear from the context about which resource the assertion is
being made, the assertion often takes the form of a list of
attributes and values. In RFC822 format messages, such as
mail messages and HTTP messages, metadata is transferred
where the attribute name is an RFC822 header name and the
rest of the RFC822 line is the value of the attribute, such
as Date: and From: and To: information. The attribute value
pair model is that used by most activities defining the
semantics of metadata today.<br />
</p>
<p>
I use the word "assertion" to emphasize the fact that the
attribute value pair when it is transferred is a statement
made by some party. It does not simply and directly imply
that the resource at any given time has that value for the
given attribute. It must be seen as a statement by a
particular party with or without implicit or explicit
guarantees as to validity. Throughout the World Wide Web, as
trust becomes an important issue, it will be important for
software -- and people -- to keep track of and take into
account who said what in terms of data and metadata. So, our
model of data of a resource is something about which
typically we know the creator or the person responsible, and
typically the date of which the information was created,
which implies, in the case of a piece of information which
makes an assertion, the date at which the assertion was made.
</p>
<p>
An assertion
</p>
<blockquote>
(A u1, p, q...)
</blockquote>
<p>
typically has as explicit parameters,
</p>
<ul>
<li>the URI of the resource about which the assertion is made
(u1).
</li>
<li>some identifier (A) for the type of assertion being made,
such as author or date or expiry date.
</li>
<li>other parameters (p, q,...) according to the type of
assertion.
</li>
</ul>
<p>
As implicit or explicit or implicit parameters,
</p>
<ul>
<li>The party making the assertion
</li>
<li>The date/time of the assertion
</li>
<li>etc...
</li>
</ul>
<p>
We can often make an analogy with programming languages. An
assertion in metadata can be compared with a function call in
a programing language. In object oriented languages, the
object of the function has a special place among the
parameters just as the subject of an assertion does in
metadata. In object oriented languages, though, the set of
possible functions depends on the object, whereas in metadata
the set of assertion types is more or less unlimited, defined
by independent choice of vocabulary. <em>Anyone can say
anything about anything</em>.
</p>
<h3>
A space for attribute names
</h3>
<p>
It is appropriate for the Web architecture to define like
this the topology and the general concepts of links and
metadata. What about the significance of individual
relationships? Sometimes, as above, these are special,
defined in the architecture, and having an architectural
significance or a significance to the protocols. In other
cases, the significance of relationships or indeed of
attributes is part of other specifications, other design, or
other applications, and must be defined easily by third
parties. Therefore, the set of such relationship and
attributes names must be extremely easily extensible and
therefore extensible in a decentralized manner. This is why
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
the URL space is an appropriate space for the
definition of attribute names.
</td>
</tr>
</tbody>
</table>
<p>
We have already (1997) several vocabularies of attribute
names: for example, the HTML elements which can occur within
the HEAD element, or as another example, the headers in an
HTTP request which specify attributes of the object. These
are defined within the scope of particular specifications.
There is always pressure to extend these specifications in a
flexible way. HTTP header names are generally extended
arbitrarily by those doing experiments. The same can also be
true of HTML elements and extension mechanisms have been
proposed for both. If we look generically at the very wide
space of all such metadata attribute names, we find something
in which the dictionary would be so large that ad hoc
arbitrary extension would be just as chaotic as central
registration would be stifling.
</p>
<blockquote>
<b>Aside: Comparison with Entity-Relationship models</b>.
<p>
This architecture, in which the assertion identifier is
taken from (basically) URL space differs from the
"Entity-relationship" (ER) model and many similar models
like it, including most object-oriented programming
systems. In an ER model, typically every object is typed
and the type of an object defines the attributes can have,
and therefore the assertions which are being made about it.
Once a person is defined as having a name, address and
phone number, then the schema has to be altered or a new
derived type of person must be introduced before one can
make assertions about the race, color or credit card number
of a person. The scope of the attribute name is the entity
type, just as in OOP the scope of a method name is an
object type (or interface)By contrast, in the web, the
hypertext link allows statements of new forms to be made
about any object, even though (before anything other than
syntax checking) this may lead to nonsense or paradox. One
can define a property "coolness" within one's own part of
the web, and then make statements about the "coolness" of
any object on the web.
</p>
<p>
This design difference is in essence a resurfacing of the
decision to make links mondirectional, sacrificing
consistency for scalability.
</p>
<p>
An advantage of ER systems is that they allow one to work,
in the user interface for example, with a set of properties
which "should" be defined for each entity. You can define
these in the Metadata's predicate calculus by defining an
expression for a "well specified" object. ("For all
<i>X</i> such that <i>X</i> is a customer <i>X</i> is
well-specified if there exists <i>n</i> such that <i>n</i>
is the name of <i>X</i> and there exists <i>t</i> such that
<i>t</i> is the telephone number of <i>X</i> and...)
</p>
<p>
end of aside.
</p>
</blockquote>
<h3>
<a name="MetadataHeaders" id="MetadataHeaders">Metadata
("Entity") headers in HTTP</a>
</h3>
<p>
In the above it is important to realize that the HTTP headers
which contain what can be considered as metadata ("entity
headers") should be separated quite distinctly from HTTP
headers which do not. HTTP headers which contain metadata
contain information which can follow the document around. For
example, it is reasonable for a cache to pass such
information on without treatment, it is reasonable for
clients or other programs which process data to store those
headers as metadata with the document for later processing.
The content of those headers do not have to be associated
with that particular HTTP transaction. By contrast, the
RFC822 headers in HTTP which deal specifically with the
transaction or deal specifically with the TCP link between
the two application programs have a shorter scope and can
only be regarded as parameters of the HTTP method. To make
this separation clear will be to make it easier not only to
understand HTTP and how it should be processed, it will also
make it clear which pieces of HTTP can be used easily and
transparently by other protocols which may use different
methods with different parameters. The clarification of the
architecture of HTTP such that both the metadata and the
methods can be extended into other domains is an important
part of the work of the World Wide Web Consortium. The
Internet protocols SMTP and NNTP and HTTP as well as many new
and proposed protocols share much of the semantics of the
RFC822 headers. Formalizing the shared space and making it
clear that there is a single design for a particular header,
rather than four designs which are independent and happen to
look very similar, requires a general architecture, some
careful thought, and is essential for the future design of
protocols. It will allow protocol design to happen in small
groups which can take for granted the bulk of previous work
and concentrate on independent new design.
</p>
<h4>
Authorship of HTTP entity headers
</h4>
<p>
It may be possible to remove or at least encompass the
apparent anomaly of metadata transferred from an HTTP server
by creating a special link type which links the document
itself to the set of attributes which the server would give
in the HTTP headers. In other words, the server would be able
to say, "here is a document, here is some metadata about it,
and the metadata about it has the following URL". This would
allow one, for example, request a signed copy of the HTTP
headers. It would allow one to ask about the intellectual
property rights of those headers, and the authorship of those
headers.
</p>
<p>
It is important to be completely clear about the authorship
of the HTTP headers. The server should be seen as a software
agent acting on behalf of a party which is the publisher or
document author: the definer of the URI to resource identity
mapping. The webmaster is only an administrator who is
responsible for ensuing that (through an appropriately
configured server) the transactions on the wire faithfully
represent the statements and wishes of that party.
</p>
<h2>
Links<br />
</h2>
<p>
An assertion of relationship between two resources is known
as a <b>link</b>.
</p>
<p>
In this case, it is a triple
</p>
<blockquote>
(<i>A u1 u2</i>)
</blockquote>
<p>
of:
</p>
<ul>
<li>the type of assertion being made, that is, the
relationship which is being asserted,
</li>
<li>the first URI,
</li>
<li>and the second URI.
</li>
</ul>
<p>
These sorts of assertions, links, are the basis of navigation
in the World Wide Web; they can be used for building
structure within the World Wide Web and also for creating a
semantic Web which can express knowledge about the world
itself. That is to say, links may be used both for the
structure of data, in which case they are metadata, but also
they may be used as a form of data.
</p>
<p>
Links, like all metadata can be transferred in three ways.
They can be embedded in a document, which is one end of the
link, they can be transferred in an HTTP message, for example
what is called the header of the document, and they can be
stored in a third document. This latter method has not been
used widely on the World Wide Web to date.
</p>
<h2>
Goal: <a name="Self-descr" id="Self-descr">Self-describing
information</a><br />
</h2>
<p>
A critical part of the design of the whole system is the way
that the semantics of metadata or indeed of data are defined.
The semantics of metadata in our RFC822 headers in mail
messages and in http messages are defined by hand in english
in the specifications of those protocols. The PICS system
takes this to one stage further in terms of flexibility by
allowing a message to contain a pointer to the document which
defines, in human readable terms, the semantics of each
assertion made within a <a href="#PICS">PICS</a> label. In
the future we would like to move toward a state in which any
metadata or eventually any form of machine readable data
carries a reference to the specification of the semantics of
all the assertions made within it.
</p>
<p>
For example, suppose that when a link is defined between two
documents, the relationship which is being asserted is
defined in a such way that it can be looked up on the World
Wide Web (i.e. using some form of URI), and someone or some
program, which has not come across that relationship before
can follow the link and extend its understanding or
functionality to take advantage of this new form of
assertion.
</p>
<p>
In the case of PICS, one can dynamically pick up a human
readable definition of what that assertion really means. In
PICS (and in theory in SGML using DTDs), one can also pick up
a machine readable definition of what form that assertion can
take, what syntax, what types of parameters it can take. This
allows a human interface to a new PICS scheme to built on the
fly. To go one step further, one could, given a suitable
logic or knowledge representation language, pick up a machine
readable definition of the semantics of that assertion in
terms of other relationships.
</p>
<p>
The advantages of such self describing information is that it
allows development of new applications and new functionality
independently by many groups across the web. Without
self-describing information, development must wait for large
companies or standards committees to meet and agree on the
commonly agreed semantics.
</p>
<p>
Of course a pragmatic way of extending software to handle new
forms of information is to dynamically download the code to
support a software object which can handle such data for one.
Whereas this is a powerful technique, and one which will be
used increasingly, it is not sufficient. It is not sufficient
because one has to trust the implementation of the object,
and the state.
</p>
<h4>
Goal
</h4>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
As much as possible of the syntax and semantics should
be able to be acquired by reference from a metadata
document.
</td>
</tr>
</tbody>
</table>
<h3>
Building Applications using Link Relationships
</h3>
<p>
It turns out that a very large number of applications both
built on top of the web and also built within the
infrastructure of the Web can largely be built by defining
new relationship types. Examples of these are the document
versioning problem which can be largely solved by defining
link values relating documents to previous and future
versions and to lists of versions; intellectual property
rights, distribution terms, and other labeling which can be
solved by making a link from one document to the document
containing the metadata.
</p>
<hr />
<h3>
Summary so far
</h3>
<ol>
<li>Metadata is data
</li>
<li>Metadata may refer to any resource which has a URI
</li>
<li>Metadata may be stored in any resource no matter to which
resource it refers
</li>
<li>Metadata can be regarded as a set of assertions, each
assertion being about a resource &nbsp;(A &nbsp;<i>u1</i>
&nbsp;...).
</li>
<li>Assertions which state a named relationship between two
resources are known links &nbsp;(A <i>u1 u2</i>)
</li>
<li>Assertion types (including link relationships) should be
first class objects in the sense that they should be able to
be defined in addressable resources and referred to by the
address of that resource &nbsp;A in { u }
</li>
<li>The development of new assertion types and link
relationships should be done in a consistent manner so that
these sort of assertions can be treated generically by people
and by software.
</li>
</ol>
<hr />
<p>
<i>Rough from here on down</i>
</p>
<h3 id="Label">
Label syntax: Assertions about a common subject
</h3>
<p>
When labeling information, it is often useful to make a lot
of statements about the same object. It is also useful to be
able to make the same set of &nbsp;statements about a set of
resources. For example, the assertions
</p>
<pre>
(A1 u1 a b ... )
(A2 u1 c d )
(A2 u1 a f g h )
</pre>
<p>
might be written
</p>
<pre>
(for u1
(A1 a b ... )
(A2 c d )
(A3 a f g h )
)
</pre>
<p>
Therefore in the syntax of an actual assertion the subject is
implicit. This is just the case with RFC822 headers which
implicitly refer to the following body, and with HTML "HEAD"
element contents which implicitly refer to the containing
document. &nbsp;(Though notice there is a fundamental
difference, discussed <a href=
"w:/DesignIssues/temp.html#mesages">below</a>, between a
general label and a message header because the message header
is definitive.)
</p>
<p>
So it is wise to recognise the label as case which it is wise
to specifically optimize in the syntax. <em>[In RDF this
indeed the case, that the subject is established as a
context, and then many properties are given within that
context. -2000/9]</em>
</p>
<p>
Assertions, when the subject is implicit, are known as
attribute-value pairs as discussed above. Let's use the term
"label" for a set of assertions with the subject extracted.
&nbsp;Like the label on a jam jar, it contains information
but there must be something else (in this case if its
placement on the jar) which tells you to what it applies.
&nbsp;(The PICS label in fact contained other information
too, including the subject and meta-meta-data about the
authorship of the label.)
</p>
<p>
Local definition:
</p>
<table border="1" cellpadding="2">
<tbody>
<tr>
<td>
A label is a set of assertions with a common implicit
subject. &nbsp;In this architecture it is a set of
attribute-value pairs
</td>
</tr>
</tbody>
</table>
<p>
<i>(There is a convention that you can write "Jam" on a jam
jar label. &nbsp;You don't write "Jam jar" or "Jam Jar
label". &nbsp; Even though I once saw a label on a cardboard
box with the words "Equipment shipping box label" on it!)</i>
</p>
<h3>
Authorship of Metadata
</h3>
<p>
It follows from the fact that metadata is data that here can
be metadata about it. &nbsp;Some of this metadata becomes
crucial when we consider a trust model. &nbsp;The logic we
need includes the author of metadata
</p>
<p>
p1: (A u1 . . .)
</p>
<p>
where p1 is ,in a system with low trust, the author as
stated, but in a cryptographically secure system is a
principle represented by a key.
</p>
<p>
On the web, the granularity of information is the resource.
Authorship and access control genrally use this granularity.
Therefore, typically, the trust one places in an assertion is
function the document which asserted it, and the metadata
about that document. However, when information is then
combined from many resources, one needs a language which
allows the source of the original to be recorded. Like
blockquote in HTML, this separates the data itself from the
resource, so the resource does assert the data directly but
asserts that it was asserted.
</p>
<h2>
Analysing labels
</h2>
<p>
See <a href="Labels.html">Analysing PICS labels as generic
Metadata</a>
</p>
<p>
where we look at PICS labels and try to sift out the actual
semantics of them. This is a thought experiment generating
requiremnts. The conclusions are that information such as
authorship and date information in fact form a tree of
assertions about assertions, and it is important to be clear
about the structure of that tree. The notion of a message is
brought up there too, but not followed up as it is not
germaine to the discussion at this point.
</p>
<h2>
Algebraic Manipulations
</h2>
<p>
If you can make assumptions about the properties of labels
then you can manipulate them, possibly without knowing
everything about their meaning. &nbsp;Properties such as
commutativity, transitivity and associativity would be very
useful to have easily available: perhaps in the syntax, or
failing that in the schema.
</p>
<p>
[See <a href="Semantic.html">Semantic Web roadmap</a> for
higher levels of logic]
</p>
<p>
For example, given a label saying a pair of jeans has a 32
inch waist and a price of $28, I can deduce a label which
just has the price of $28. &nbsp;But given a label which says
that the punishment for the crime is a 2 month in jail and a
fine of $3000, &nbsp;I can't deduce one that says that that
the punishment &nbsp;is 2 months in jail.
</p>
<p>
A typical use of metadata will be to provide a statement
along with its proof to be verified by another party.
&nbsp;Being able to process these things efficiently and with
limited knowledge will be crucial.
</p>
<p>
The most practical way to do this is to create a basic
commonvocabulary for the logical functions. Sometimes known
as the "RDF upper layers", these are mentioned in the
<a href="Semantic.html">note on the Semantic Web.</a>
</p>
<h4>
Ordered/Unordered
</h4>
<p>
The <a href="#independent">axiom of independence of
assertions</a> above gives us that in any set of assertions,
as assertions are independently true, specific assertions may
be removed or reordered, leaving the document just as valid
(though possibly less informative).
</p>
<p>
Examples of unordered things currently are: RFC822 message
header lines, SGML attributes. Examples of ordered things
are: HTTP header lines and SGML elements.
</p>
<p>
Do we need a form in which we can make an assertion which has
many parameters which are in fact not mutable in any way?
</p>
<h2>
Summary of Requirements
</h2>
<p>
There are ways of representing &nbsp;the above things:
&nbsp;messages, labels, specifying labels, and statements and
distinguish between them.
</p>
<p>
As much as possible of the syntax and semantics should be
able to be acquired by reference from a metadata document.
</p>
<p>
It must be possible to mix multiple vocabularies within the
same scope.
</p>
<p>
The syntax and structure should be such that as many
manipulations as possible can be done without having to know
the semantics of the vocabulary in use.
</p>
<p>
A common voabulary for basic logic and knowledge
representation functionality will be required.
</p>
<hr />
<h2>
References
</h2>
<p>
<a name="PICS" id="PICS">PICS</a> - The PICS project was a
project to define standards for interchange of endorsement
information, aimed at the content filterting problem. See the
PICS home page.
</p>
<hr />
<address>
Tim BL, &nbsp;January 1997
<p>
Last edit $Date: 2009/08/27 21:38:08 $
</p>
</address>
</body>
</html>