You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
857 lines
24 KiB
857 lines
24 KiB
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
|
|
"http://www.w3.org/TR/REC-html40/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<style type="text/css">
|
|
.feedback {border: thin solid black; padding: 1ex; margin: 1em;
|
|
background: #808080; color: white}
|
|
.subtitle {text-align: center}
|
|
|
|
</style>
|
|
<title>Web Characterisation Activity - Status Report</title>
|
|
<link rel="stylesheet" href="http://www.w3.org/StyleSheets/TR/W3C-NOTE.css"
|
|
type="text/css">
|
|
</head>
|
|
<body>
|
|
|
|
<div class="head">
|
|
<P><a href="http://www.w3.org/"><img border="0" alt="W3C"
|
|
height="48" width="72" src="/Icons/w3c_home"></a></P>
|
|
|
|
<h1 class="no-num no-toc">Web Characterization:</h1>
|
|
<h2 class="no-num no-toc">From working group to activity</h2>
|
|
<h3 class="no-num no-toc">W3C Note Mar 19 1999</h3>
|
|
<dl>
|
|
<dt>This version:</dt>
|
|
<dd>
|
|
<a href="http://www.w3.org/TR/1999/NOTE-WCA-19990319">http://www.w3.org/TR/1999/NOTE-WCA-19990319</a>
|
|
</dd>
|
|
<dt>Latest version:</dt>
|
|
<dd>
|
|
<a href="http://www.w3.org/TR/NOTE-WCA">http://www.w3.org/TR/NOTE-WCA</a>
|
|
</dd>
|
|
<dt>Editors:</dt>
|
|
<dd>
|
|
Jim Pitkow <<a href="mailto:pitkow@parc.xerox.com">pitkow@parc.xerox.com</a>>, Xerox PARC<br>
|
|
Johan Hjelm <<a href="mailto:hjelm@w3.org">hjelm@w3.org</a>>, W3C/Ericsson<br>
|
|
Henrik Frystyk Nielsen, <<a href="mailto:frystyk@w3.org">frystyk@w3.org</a>>, W3C
|
|
</dd>
|
|
</dl>
|
|
<p>
|
|
<small><a href="/Consortium/Legal/ipr-notice#Copyright">Copyright</a> ©
|
|
1998 <a href="http://www.w3.org/">W3C</a> (<a
|
|
href="http://www.lcs.mit.edu/">MIT</a>, <a
|
|
href="http://www.inria.fr/">INRIA</a>, <a
|
|
href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C <a
|
|
href="/Consortium/Legal/ipr-notice#Legal Disclaimer"> liability,</a> <a
|
|
href="/Consortium/Legal/ipr-notice#W3C Trademarks"> trademark</a>, <a
|
|
href="/Consortium/Legal/copyright-documents"> document use</a> and <a
|
|
href="/Consortium/Legal/copyright-software"> software licensing</a> rules
|
|
apply. Your interactions with this site are in accordance with our <a
|
|
href="/Consortium/Legal/privacy-statement#Public">public</a> and <a
|
|
href="/Consortium/Legal/privacy-statement#Members"> Member</a> privacy
|
|
statements.</small></p>
|
|
</div>
|
|
|
|
<h2>Status of this document</h2>
|
|
<p>
|
|
This document is a W3C Note reporting on the results of the HTTP-NG Web
|
|
Characterization Group and the structure of the Web Characterization Activity.
|
|
The work which was part of the <a href="/Protocols/HTTP-NG/Activity">W3C
|
|
HTTP-NG Activity, phase I</a>, is now continued in the <a
|
|
href="http://www.w3.org/WCA/">Web Characterization Activity</a>.</p>
|
|
<p>
|
|
Review comments on this document should be sent to <<a
|
|
href="mailto:www-wca@w3.org">www-wca@w3.org</a>> which is the <a
|
|
href="http://lists.w3.org/Archives/Public/www-wca/">archived</a> email list
|
|
for the <a href="/WCA/">Web Characterization Activity</a>. Information on how
|
|
to subscribe to public W3C email lists can be found at <a
|
|
href="http://www.w3.org/Mail/Lists">the subscription request page</a>.</p>
|
|
<p>
|
|
<em>This document is a NOTE made available by the W3C for discussion only.
|
|
This indicates no endorsement of its content, nor that the Consortium has, is,
|
|
or will be allocating any resources to the issues addressed by this
|
|
NOTE.</em></p>
|
|
|
|
<h2>Table of Content</h2>
|
|
<dl>
|
|
<dt><a href="#Abstract">Abstract</a></dt>
|
|
<dt><a href="#1">1. The HTTP-NG Web Characterization Group</a></dt>
|
|
<dd><dl>
|
|
<dt><a href="#11">1.1 Mission statement</a></dt>
|
|
<dt><a href="#112">1.2 Participants</a></dt>
|
|
<dt><a href="#12">1.3 Deliverables and Accomplishments</a></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><a href="#2">2. The Web Characterization Activity</a></dt>
|
|
<dd><dl>
|
|
<dt><a href="#21">2.1 The structure of the Activity</a></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><a href="#3">3. Example characterizations</a></dt>
|
|
<dd><dl>
|
|
<dt><a href="#4">3.1 The HTTP-NG testbed</a></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><a href="#31">4. WCG papers</a></dt>
|
|
<dt><a href="#5">5. Summary</a></dt>
|
|
</dl>
|
|
|
|
<h2><a name="Abstract">Abstract</a></h2>
|
|
<p>
|
|
This document describes the experiences and results that came out of the Web
|
|
Characterization Group as part of the W3C HTTP-NG Activity, and how that work
|
|
is now continued in the Web Characterization Activity.</p>
|
|
<p>
|
|
The HTTP-NG Working Group created a series of scenarios for the HTTP-NG
|
|
protocol design group, which were implemented in the scope of the HTTP-NG
|
|
testbed, and used to optimize its design.</p>
|
|
<p>
|
|
The WCA started in November 1998, and will bring that work model to a wider
|
|
audience.</p>
|
|
|
|
<h2><a name="1">1. Introduction</a></h2>
|
|
<p>
|
|
Web Characterization is concerned with looking at the overall patterns of Web
|
|
structure and usage by measuring such aspects as server access patterns, the
|
|
kind of data being accessed, bytes transferred, popularity of resources, etc.
|
|
By better understanding the dynamics of the Web and how it grows we believe
|
|
that W3C and the Web Community in general will be better suited to evolve the
|
|
Web and to ensure its long term interoperability and robustness.</p>
|
|
<p>
|
|
The purpose of the Activity is to define and implement a scalable mechanism
|
|
for gathering data, boiling it down and to presenting it in efficient ways to
|
|
content providers, service providers, user groups, researchers and technology
|
|
designers and other groups.</p>
|
|
<p>
|
|
The information used to characterize the Web is strictly concerned with
|
|
general patterns of Web usage and does not focus on specific users or Web
|
|
sites. The scope of this Activity is to characterize the Web as a distributed
|
|
system and not on an individual basis.</p>
|
|
|
|
<h3><a name="11">1.1 Mission Statement</a></h3>
|
|
<p>
|
|
The HTTP-NG Web Characterization Group was chartered in August 1997 as a part
|
|
of the HTTP-NG Activity. Its intent was to create a stable and comprehensive
|
|
platform of knowledge and analysis of the Web, to enable the protocol
|
|
designers to create a relevant and well-instructed solution. Previously,
|
|
analysis of user behavior on the Web has often been based on spurious data,
|
|
gathered in an ad-hoc manner. The HTTP-NG Web Characterization Group was an
|
|
attempt at rectifying this.</p>
|
|
<p>
|
|
It was set up to fulfill four primary goals:</p>
|
|
<ol>
|
|
<li>
|
|
To respond to the questions raised by the HTTP-NG Protocol Design Group
|
|
regarding current usage of the World Wide Web.
|
|
</li>
|
|
<li>
|
|
To design and develop representative scenarios for use in the HTTP-NG testbed.
|
|
</li>
|
|
<li>
|
|
To make recommendations to the Protocol Design Group in issues concerning Web
|
|
usage and characterization methods.
|
|
</li>
|
|
<li>
|
|
To devise a system and a methodology to make characterization of the Web
|
|
easier and more reliable in the future.
|
|
</li>
|
|
</ol>
|
|
|
|
<h3><a name="112">1.2 Participants</a></h3>
|
|
<p>
|
|
The group consisted of members from Boston Universities Ocean group, Harvard
|
|
Colleges Vino group, INRIA, Microsoft, Netscape, Virginia Techs Network
|
|
Resource Group, and Xerox Parcs Webology group. Jim Pitkow, Xerox Parc,
|
|
chaired the group.</p>
|
|
|
|
<h3><a name="12">1.3 Deliverables and Accomplishments</a></h3>
|
|
<p>
|
|
The HTTP-NG WCG has leveraged and helped focus existing research programs,
|
|
which the group considers one of its major accomplishments.</p>
|
|
<p>
|
|
During its charter, the group has responded to the questions of the HTTP-NG
|
|
Protocol Design Group. This has been influential in the design of the HTTP-NG
|
|
protocol. It has also created the HTTP-NG testbed, which operates by using
|
|
SURGE (Scalable URL Generator) from Boston University Ocean Group. Scenario
|
|
parameters derived from observed statistical regularities in the distribution
|
|
of file sizes, reading times, and other metrics, were used to simulate
|
|
client traffic in the testbed. SURGE used some aspects of Web traffic which
|
|
were not taken into account by then current traffic generators.</p>
|
|
<p>
|
|
</p>
|
|
|
|
<center>
|
|
|
|
<table border="1" cellspacing="0" align="center">
|
|
<tbody>
|
|
<tr>
|
|
<th>
|
|
Status
|
|
</th>
|
|
<th>
|
|
Date accomplished
|
|
</th>
|
|
<th>
|
|
Deliverable
|
|
</th>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
Oct. 2-3, 1997
|
|
</td>
|
|
<td>
|
|
First face-to-face meeting
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
Nov. 1, 1997
|
|
</td>
|
|
<td>
|
|
Identification of classification parameters for Web categorization
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
Dec. 8, 1997
|
|
</td>
|
|
<td>
|
|
Plan for response to HTTP-NG Protocol Design Group questions
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
Dec. 31, 1997
|
|
</td>
|
|
<td>
|
|
Initial response to HTTP-NG PDG questions
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
Feb. 7, 1998
|
|
</td>
|
|
<td>
|
|
Final response to HTTP-NG PDG questions
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
March-April 1998
|
|
</td>
|
|
<td>
|
|
Trace analysis for scenario building, refined testbed software
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
April 24, 1998
|
|
</td>
|
|
<td>
|
|
Extended scenarios, refined testbed software
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Moved to WCA
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
<td>
|
|
Definition of new log file format
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Moved to WCA
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
<td>
|
|
Recommendations for automatic re-sampling
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Done
|
|
</td>
|
|
<td>
|
|
June 24, 1998
|
|
</td>
|
|
<td>
|
|
Project evaluation
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</center>
|
|
<p>
|
|
The group has completed all the original requirements, with the exception of
|
|
the redesign of the Common Log File Format and the recommendations for
|
|
automatic re-sampling of the Web, which has been moved to the Web
|
|
Characterization Activity.</p>
|
|
|
|
<h2><a name="2">2. The Web Characterization Activity</a></h2>
|
|
<p>
|
|
The W3C Web Characterization Activity was started in November 1998 with a
|
|
workshop, gathering some 50 persons interested in the subject. Subsequently, a
|
|
working group and an interest group has been started.</p>
|
|
<p>
|
|
The purpose of the Activity is to define and implement a scalable mechanism
|
|
for gathering data, boiling it down and to presenting it in efficient ways to
|
|
content providers, service providers, user groups, researchers and technology
|
|
designers and other groups.</p>
|
|
<p>
|
|
The information used to characterize the Web is strictly concerned with
|
|
general patterns of Web usage and does not focus on specific users or Web
|
|
sites. The scope of this Activity is to characterize the Web as a distributed
|
|
system and not on an individual basis.</p>
|
|
<p>
|
|
The Web Characterization Group in the HTTP-NG Activity was a first phase in
|
|
this project. It was completed in August 1998, and phase 2 begun. Its focus is
|
|
to extend the Web Characterization work and to create an active knowledge base
|
|
containing up-to-date information about the Web by broaden the scope of Web
|
|
characterization, and providing information and test scenarios for the W3C
|
|
Membership and the Web community in general about the Web and its use, both
|
|
now and in the near future.</p>
|
|
<p>
|
|
An important result of WCG is the identification of the three key groups in
|
|
the characterization work and how they interact:</p>
|
|
<p style="text-align: center">
|
|
<img src="WCG-org" alt="WCA org chart"></p>
|
|
|
|
<h3>Bulk Data Providers</h3>
|
|
<p>
|
|
The Bulk Data Providers are typically server maintainers and ISPs providing
|
|
server and proxy logs but can also be backbone providers gathering information
|
|
directly from the Net or users running instrumented Web clients etc. Because
|
|
of privacy concerns and because of the sheer size of log files, it is often
|
|
preferred to have data providers running a set of characterization tools
|
|
locally so that only the boiled down data sets and profiles are released.</p>
|
|
|
|
<h3>The W3C Characterization Working Group</h3>
|
|
<p>
|
|
The WCG develops and maintains a set of characterization tools used by the
|
|
data providers and defines the mechanism for exchanging boiled down data sets
|
|
and profiles with the data providers in order to maintain confidentiality and
|
|
trust. The collected data sets are used to develop characterization models and
|
|
to provide characterization data to the third group, the reduced data
|
|
consumers.</p>
|
|
|
|
<h3>Reduced Data Consumers</h3>
|
|
<p>
|
|
The reduced data consumers use the profiles and data sets provided by the WCG
|
|
and provide feedback and new questions to be asked. Primary data consumers are
|
|
expected to be content providers, service providers, user groups, researchers
|
|
and technology designers.</p>
|
|
|
|
<h2><a name="21">2.1 The structure of the Activity</a></h2>
|
|
<p>
|
|
The format for this Activity is to let the interaction between the reduced
|
|
data consumers and bulk data providers take place through an Interest Group,
|
|
with a new Web Characterization Working Group (WCG) functioning as the
|
|
mediator, provider of analysis tools and disseminator of characterization
|
|
information.</p>
|
|
|
|
<h4>Web Characterization Interest Group</h4>
|
|
<p>
|
|
The role of the Interest Group is to be a discussion forum for bulk data
|
|
providers and reduced data consumers, and to provide requests and feedback to
|
|
the Working Group. It is expected that the tools and dissemination mechanism
|
|
produced by the Working Group will benefit from a feedback mechanism with its
|
|
immediate users, as well as their continuous review. All work will be
|
|
discussed on the Web Characterization Activity Forum.</p>
|
|
<p>
|
|
Participation in the Interest Group is open to everybody.</p>
|
|
|
|
<h4>Web Characterization Workshop</h4>
|
|
<p>
|
|
The Activity was kicked off by the Web Characterization Workshop, November 5,
|
|
1998 in Boston, MA, with the intent of bringing together both W3C Members and
|
|
Web characterization experts. As a results of the Workshop, the Interest Group
|
|
was formed, and several organizations who wanted to participate in the Working
|
|
Group were identified.</p>
|
|
|
|
<h4>Web Characterization Working Group</h4>
|
|
<p>
|
|
The WCG is intended to work using a request/response based model similar to
|
|
the one used in the HTTP-NG Activity. Requests will be formally issued
|
|
by the Interest Group and by W3C Activities and the WCG will respond with
|
|
realistic time lines for when and how results can be made available.</p>
|
|
<p>
|
|
The WCG will start its work by formally soliciting requests for
|
|
characterization data needed by other W3C Working Groups and Activities. The
|
|
solicitation process is intended to occur at six-month intervals, enough time
|
|
for the Working Group to understand and respond to the requests of the other
|
|
W3C Groups. Requests from the Interest Group will be dealt with on a case by
|
|
case basis. All work will be discussed on the Web Characterization Activity
|
|
Forum.</p>
|
|
<p>
|
|
The working group has the following participants:</p>
|
|
<p>
|
|
</p>
|
|
|
|
<table border="1" cellpadding="0" style="text-align: center">
|
|
<tbody>
|
|
<tr>
|
|
<th>
|
|
<b>Name</b>
|
|
</th>
|
|
<th>
|
|
<b>Affiliation</b>
|
|
</th>
|
|
<th>
|
|
<b>Function in the WCA</b>
|
|
</th>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Marc Abrams
|
|
</td>
|
|
<td>
|
|
Virginia tech
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Martin F. Arlitt
|
|
</td>
|
|
<td style="text-align: left">
|
|
HP Labs
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Paul Barford
|
|
</td>
|
|
<td>
|
|
Boston University
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Pei Cao
|
|
</td>
|
|
<td>
|
|
University of Wisconsin
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Anja Feldmann
|
|
</td>
|
|
<td>
|
|
AT&T Research Labs
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Edward A. Fox
|
|
</td>
|
|
<td>
|
|
Virginia Tech
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Johan Hjelm
|
|
</td>
|
|
<td>
|
|
Ericsson/W3C
|
|
</td>
|
|
<td>
|
|
Interest Group Chair
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Balachander Krishnamurthy
|
|
</td>
|
|
<td>
|
|
AT&T Research Labs
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Jim Gettys
|
|
</td>
|
|
<td>
|
|
W3C/Compaq
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Joe Meadows
|
|
</td>
|
|
<td>
|
|
Boeing
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Henrik Frystyk Nielsen
|
|
</td>
|
|
<td>
|
|
W3C
|
|
</td>
|
|
<td>
|
|
W3C Staff Contact
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Ed O'Neill
|
|
</td>
|
|
<td>
|
|
OCLC
|
|
</td>
|
|
<td>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Jim Pitkow
|
|
</td>
|
|
<td>
|
|
Xerox PARC
|
|
</td>
|
|
<td>
|
|
Working Group Chair
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>
|
|
Further information about the work in progress can be found at the <a
|
|
href="http://www.w3.org/WCA/">Web Characterization Activity Home Page</a></p>
|
|
|
|
<h2><a name="3">3. Example Characterizations</a></h2>
|
|
<p>
|
|
The following are examples of some of the findings of the HTTP-NG WCG and
|
|
other researchers in the field of Web Characterization. This is by no means
|
|
meant to be neither a complete listing of the findings of the HTTP-NG WCG, nor
|
|
a representative sample of research in the field. Rather it contains results
|
|
that the group found provocative and representative of the types of questions
|
|
the HTTP-NG WCG found to be of interest.</p>
|
|
<ul>
|
|
<li>
|
|
<b>Question:</b> Where do all the clicks go?
|
|
</li>
|
|
<li>
|
|
<b>Answer:</b> Analysis of performed independently by Alexa Internet and by
|
|
the HTTP-NG WCG analysis of AOL trace data indicates that only a few servers
|
|
account for the majority of all clicks issued by users. The proportions may
|
|
surprise you: 50% of the clicks go to only 1% of the WWW sites visited and 80%
|
|
of the clicks go to only 26% of the sites!
|
|
<p>
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
<p align="center">
|
|
<img src="Links" alt="Links vs. Servers"><br>
|
|
<br>
|
|
<a href="http://www.alexa.com/">Alexa</a> Internet and WCG Analysis of AOL
|
|
Data - December 1997</p>
|
|
<ul>
|
|
<li>
|
|
<b>Question:</b> How fast is the Web growing?
|
|
</li>
|
|
<li>
|
|
<b>Answer:</b> Not as quickly as during the early years of the Web. Although
|
|
this calculation is a bit fuzzy given that the data sources used different
|
|
methods to count the total number of servers and definitions of what a server
|
|
actually is (virtual hosting, etc.), the rapid hyper-growth of the Web from
|
|
1992 to mid 1995 has slowed to roughly gaining an order of magnitude every 30
|
|
months
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
</p>
|
|
|
|
<h2 align="center"><img src="Servers" alt="Number of Web Servers">
|
|
</h2>
|
|
<p align="center">
|
|
Source: <a href="/">W3C</a>, Mark Gray,
|
|
<a href="http://www.netcraft.co.uk/Survey/Reports/">Netcraft Server Survey</a></p>
|
|
<ul>
|
|
<li>
|
|
<b>Question:</b> How many WWW pages are there?
|
|
</li>
|
|
<li>
|
|
<b>Answer:</b> During the spring of 1998, two independent research groups (NEC
|
|
Research Institute and DEC Systems Research Center) employed the same
|
|
theoretical foundation of set intersection to answer this question, but came
|
|
up with different answers to the question! While the devil is in the details
|
|
of the methodology employed by each study, the general procedure was take
|
|
issue a set of queries to the major search engines and determine the number of
|
|
same pages returned by the engines. This tells us the number of pages that all
|
|
the search engines agree exist and the degree to which each search engine
|
|
contains pages the others do not. From this NEC reports that there are <a
|
|
href="http://www.neci.nj.nec.com/homepages/lawrence/websize.html">320 million
|
|
pages</a> while DEC reports that there are <a
|
|
href="http://www.research.digital.com/SRC/personal/Krishna_Bharat/WebArcheology/measurement.html">275
|
|
million pages</a>. Our analysis of the AOL log files point to approximately
|
|
200 - 250 million pages as of this writing.
|
|
</li>
|
|
<li>
|
|
<b>Question:</b> How often are broken links encountered? Redirections? POSTS?
|
|
GETs with appended content?
|
|
</li>
|
|
<li>
|
|
<b>Answer:</b> Primarily based on analysis of AOL data, we have found that the
|
|
number of broken links encountered while users surf the Web is between 5 and
|
|
8% of all clicks. The occurrence rate of redirections (when a site
|
|
automatically sends your request to another locations as often occurs with
|
|
advertising and legacy servers) is approximately 6 to 7% of the time. The HTTP
|
|
header POST only occurred 1% of the time whereas GET requests with appended
|
|
material occurred around 10% of the time. These headers are significant in
|
|
they are used to carry user data back to the server. This is useful for
|
|
searching, ordering, counters on pages that say how many times a page has been
|
|
visited, etc. Secure links (using SSL) occupies 152 000 out of 300 million
|
|
links (0.05%).
|
|
</li>
|
|
</ul>
|
|
|
|
<h3><a name="4">3.1 The HTTP-NG testbed</a></h3>
|
|
<p>
|
|
The HTTP-NG testbed was designed for the specific purpose of making reliable
|
|
and convincing claims that the performance of HTTP-NG would be comparable to
|
|
prior HTTP implementations. It was designed in close cooperation with the
|
|
HTTP-NG Protocol Design Group.</p>
|
|
<p>
|
|
An analysis of the current practice in load generation tools left the HTTP-NG
|
|
WCG concerned with the representativeness of the traffic being generated.</p>
|
|
<p>
|
|
Essentially, three types of traffic generation models exist: Stress testing,
|
|
trace replay, and statistically derived models. Many current traffic
|
|
generators follow the first model, by varying the number of requests per
|
|
second that are issued to the server. While this approach does test the
|
|
capacity of the server as measured by the number of HTTP operations per
|
|
second, it does not produce traffic patterns that have actually been
|
|
observed.</p>
|
|
<p>
|
|
The second model for traffic generation utilizes packet traces collected from
|
|
various servers and protocol analyzers. If this method had been used in the
|
|
test bed, the group would have had to acquire traces from representative
|
|
servers. Apart from determining what is representative, it also presents the
|
|
problem of which servers to include, and obtain permission to use their log
|
|
file information. Each Web site will also need to be recreated, due to e.g.
|
|
the effect of the file system configuration on performance.</p>
|
|
<p>
|
|
Consequently, the group selected to statistically model HTTP traffic. The
|
|
users were segmented into three strata: Corporate users, ISP users, and
|
|
educational users. To create models for the behavior of each strata, the group
|
|
obtained full log files from America Online (major ISP), AltaVista (search
|
|
engine/mixed user group), and Boston University (educational users). From
|
|
Microsoft (Corporate usage) a distribution of usage was obtained. All data
|
|
sets except for the AltaVista data were used to generate scenarios for the
|
|
testbed. The log file analysis tools used were based on the prior work of the
|
|
group members, and the personal connections of the group members were
|
|
instrumental in obtaining these data sets.</p>
|
|
<p>
|
|
The HTTP-NG testbed is designed as the diagram below shows:</p>
|
|
<p align="center">
|
|
<img alt="HTTP-NG testbed" src="http://www.w3.org/TR/NOTE-HTTP-NG-testbed/base"></p>
|
|
<p>
|
|
The HTTP-NG testbed was thus able to take both network characteristics and
|
|
user behavior into account, inserting a simulated network between the robot
|
|
simulating the client and the server. The statistical traffic generator takes
|
|
a set of parameters to create a mock server with the associated file system,
|
|
and a set of simulated clients that make statistically based requests for
|
|
files.</p>
|
|
<p>
|
|
The model characterizes sites as containing Web pages with embedded media and
|
|
Web pages without embedded media. Using a model that characterizes pages,
|
|
rather than just objects, makes alteration in the composition of sites easier.
|
|
This facilitates determining the effect of new technologies, like Cascading
|
|
Style Sheets (CSS).</p>
|
|
|
|
<h2><a name="31">4. WCG Papers</a></h2>
|
|
<p>
|
|
Throughout the year of the WCG's existence, various group members have
|
|
contributed papers, articles, and presentations to the group and the Web
|
|
characterization community. Given the limited focus of the HTTP-NG project
|
|
effort, it is not surprising that these items are focused on characterizations
|
|
and representative testbed designs.</p>
|
|
<p>
|
|
</p>
|
|
|
|
<center>
|
|
|
|
<table border="1" cellspacing="0" align="center">
|
|
<tbody>
|
|
<tr>
|
|
<th>
|
|
Author(s)
|
|
</th>
|
|
<th>
|
|
Papers, Articles, Notes
|
|
</th>
|
|
<th align="center">
|
|
Date Published
|
|
</th>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Jim Pitkow
|
|
</td>
|
|
<td>
|
|
<a href="http://www.w3.org/TR/NOTE-WCA">W3C Note: HTTP-NG WCG Status
|
|
Report</a>
|
|
</td>
|
|
<td>
|
|
July 1998
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Jim Pitkow
|
|
</td>
|
|
<td>
|
|
<a
|
|
href="http://www7.scu.edu.au/programme/fullpapers/1877/com1877.htm">Summary
|
|
of WWW Characterizations<br>
|
|
</a>Paper at WWW7
|
|
</td>
|
|
<td>
|
|
April 1998
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Huberman, Pirolli, Pitkow and Lukose
|
|
</td>
|
|
<td>
|
|
<a href="http://www.w3.org/Protocols/HTTP-NG/1998/02/1998-02-surfing-final.pdf">Strong Regularities
|
|
in World Wide Web Surfing</a>(PDF format)
|
|
</td>
|
|
<td>
|
|
April, 1998
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Barford and Crovella
|
|
</td>
|
|
<td>
|
|
<a href="http://www.cs.bu.edu/techreports/97-006-surge.ps.Z">Generating
|
|
Representative Web Workloads for Network and Server Performance
|
|
Evaluation</a>(Postscript format)
|
|
</td>
|
|
<td>
|
|
November, 1997
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Manley, Courage and Seltzer
|
|
</td>
|
|
<td>
|
|
<a href="http://www.eecs.harvard.edu/~margo/papers/hbench-web.ps">A
|
|
Self-Scaling and Self-Configuring Benchmark for Web Servers</a>
|
|
</td>
|
|
<td>
|
|
November, 1997
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Manley and Seltzer
|
|
</td>
|
|
<td>
|
|
<a href="http://www.eecs.harvard.edu/~vino/web/sits.97.html">Web Fact and
|
|
Fantasy</a>
|
|
</td>
|
|
<td>
|
|
October, 1997
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
Abdulla, Fox and Abrams
|
|
</td>
|
|
<td>
|
|
<a href="http://www.cs.vt.edu/~chitra/docs/97webnet/">Shared User Behavior on
|
|
the World Wide Web</a>
|
|
</td>
|
|
<td>
|
|
October, 1997
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</center>
|
|
<p>
|
|
</p>
|
|
|
|
<h2><a name="5">5. Summary</a></h2>
|
|
<p>
|
|
The group has achieved its objectives, creating feedback for the HTTP-NG
|
|
Protocol Design Group by answering the questions this group had about the Web,
|
|
and by creating the HTTP-NG testbed, which enabled the creation of an
|
|
optimized and efficient design of the next generation of the Hypertext
|
|
Transfer Protocol. The Web characterization work is now being continued in the
|
|
Web Characterization Activity.</p>
|
|
<p>
|
|
</p>
|
|
<hr>
|
|
|
|
<address>
|
|
Jim Pitkow, Xerox PARC, Johan Hjelm, Ericsson/W3C, Henrik Frystyk Nielsen
|
|
W3C,<br>
|
|
@(#) $Id: NOTE-HTTP-NG-WCG-19990104.html,v 1.11 1999/01/04 23:06:42 frystyk
|
|
Exp $ </address>
|
|
</body>
|
|
</html>
|