server_playground/doc/www.w3.org/TR/1999/WD-voice-architecture-19991223


								<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

								<html>

								<head>

								<meta http-equiv="Content-Type" content=

								"text/html; charset=iso-8859-1">

								<title>Model Architecture for Voice Browser Systems</title>

								<style type="text/css">

								body {

								font-family: sans-serif;

								margin-left: 10%;

								margin-right: 5%;

								color: black;

								background-color: white;

								background-attachment: fixed;

								background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);

								background-position: top left;

								background-repeat: no-repeat;

								font-family: Tahoma, Verdana, "Myriad Web", Syntax, sans-serif;

								}

								.unfinished {  font-style: normal; background-color: #FFFF33}

								.dtd-code {  font-family: monospace;

								 background-color: #dfdfdf; white-space: pre;

								 border: #000000; border-style: solid;

								 border-top-width: 1px; border-right-width: 1px;

								 border-bottom-width: 1px; border-left-width: 1px; }

								p.copyright {font-size: smaller}

								h2,h3 {margin-top: 1em;}

								.extra { font-style: italic; color: #338033 }

								code {

								    color: green;

								    font-family: monospace;

								    font-weight: bold;

								}

								.example {

								    border: solid green;

								    border-width: 2px;

								    color: green;

								    font-weight: bold;

								    margin-right: 5%;

								    margin-left: 0;

								}

								.bad  {

								    border: solid red;

								    border-width: 2px;

								    margin-left: 0;

								    margin-right: 5%;

								    color: rgb(192, 101, 101);

								}

								div.navbar { text-align: center; }

								div.contents {

								    background-color: rgb(204,204,255);

								    padding: 0.5em;

								    border: none;

								    margin-right: 5%;

								}

								table {

								    margin-left: -4%;

								    margin-right: 4%;

								    font-family: sans-serif;

								    background: white;

								    border-width: 2px;

								    border-color: white;

								  }

								th { font-family: sans-serif; background: rgb(204, 204, 153) }

								td { font-family: sans-serif; background: rgb(255, 255, 153) }

								.tocline { list-style: none; }

								</style>

								<link rel="stylesheet" type="text/css" href=

								"http://www.w3.org/StyleSheets/TR/W3C-WD.css">

								</head>

								<body>

								<div class="head">

								<p><a href="http://www.w3.org/"><img class="head" src=

								"http://www.w3.org/Icons/WWW/w3c_home.gif" alt="W3C"></a></p>


								<h1 class="notoc">Model Architecture for<br>

								Voice Browser Systems</h1>


								<h3 class="notoc">W3C Working Draft <i>23 December 1999</i></h3>


								<dl>

								<dt>This version:</dt>


								<dd><a href=

								"http://www.w3.org/TR/1999/WD-voice-architecture-19991223">

								http://www.w3.org/TR/1999/WD-voice-architecture-19991223</a></dd>


								<dt>Latest version:</dt>


								<dd><a href=

								"http://www.w3.org/TR/voice-architecture">

								http://www.w3.org/TR/voice-architecture</a></dd>


								<dt>Editors:</dt>


								<dd>M. K. Brown, Bell Labs, Murray Hill, NJ<br>

								D. A. Dahl, Unisys, Malvern, PA</dd>

								</dl>


								<p class="copyright"><a href=

								"http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">

								Copyright</a> &#169; 1999 <a href="http://www.w3.org/">

								W3C</a><sup>&#174;</sup> (<a href=

								"http://www.lcs.mit.edu/">MIT</a>, <a href=

								"http://www.inria.fr/">INRIA</a>, <a href=

								"http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. <abbr

								title="World Wide Web Consortium">W3C</abbr> <a href=

								"http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">

								liability</a>, <a href=

								"http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">

								trademark</a>, <a href=

								"http://www.w3.org/Consortium/Legal/copyright-documents">document

								use</a> and <a href=

								"http://www.w3.org/Consortium/Legal/copyright-software">software

								licensing</a> rules apply.</p>


								<hr>

								</div>


								<h2 class="notoc">Abstract</h2>


								<p>The W3C Voice Browser working group aims to develop

								specifications to enable access to the Web using spoken

								interaction. This document is part of a set of requirements

								studies for voice browsers, and provides a model architecture for

								processing speech within voice browsers.</p>


								<h2>Status of this document</h2>


								<p>This document describes a model architecture for speech

								processing in voice browsers as an aid to work on understanding

								requirements. Related requirement drafts are linked from the <a

								href="/TR/1999/WD-voice-intro-19991223">introduction</a>. The

								requirements are being released as working drafts but are not

								intended to become proposed recommendations.</p>


								<p>This specification is a Working Draft of the Voice Browser working

								group for review by W3C members and other interested parties.  This is

								the first public version of this document. It is a draft document and

								may be updated, replaced, or obsoleted by other documents at any

								time. It is inappropriate to use W3C Working Drafts as reference

								material or to cite them as other than "work in progress".</p>


								<p>Publication as a Working Draft does not imply endorsement by

								the W3C membership, nor of members of the Voice Browser working

								groups. This is still a draft document and may be updated,

								replaced or obsoleted by other documents at any time. It is

								inappropriate to cite W3C Working Drafts as other than "work in

								progress."</p>


								<p>This document has been produced as part of the <a href=

								"http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,

								following the procedures set out for the <a href=

								"http://www.w3.org/Consortium/Process/">W3C Process</a>. The

								authors of this document are members of the <a href=

								"http://www.w3.org/Voice/Group">Voice Browser Working Group</a>.

								This document is for public review. Comments should be sent to

								the public mailing list &lt;<a href=

								"mailto:www-voice@w3.org">www-voice@w3.org</a>&gt; (<a href=

								"http://www.w3.org/Archives/Public/www-voice/">archive</a>) by

								14th January 2000.</p>


								<p>A list of current W3C Recommendations and other technical

								documents can be found at <a href="http://www.w3.org/TR">

								http://www.w3.org/TR</a>.</p>


								<h2>0. Introduction</h2>


								<p>To assist in clarifying the scope of charters of each of the

								several subgroups of the W3C Voice Browser Working Group, a

								representative or model architecture for a typical voice browser

								application has been developed.&#160; This architecture

								illustrates one possible arrangement of the main components of a

								typical system, and should not be construed as a

								recommendation.&#160;Other proposed architectures for spoken

								language systems are currently available, and may also be

								compatible with voice browsers, for example the <a href=

								"http://fofoca.mitre.org/index.html">DARPA Communicator

								architecture.</a></p>


								<p>Connections between components have been shown explicitly in

								the interest of clearly indicating the flow of information among

								the processes (and thereby indicating the interaction of the W3C

								subgroups).&#160; Each of the currently existing subgroups

								(Universal Access, Speech Synthesis, Grammar Representation,

								Natural Language, and Dialog) is represented in this

								architecture.&#160; New subgroups are currently being initiated

								and can contribute additional elements to this architecture for

								future drafts.</p>


								<p>The design is intended to be agnostic with respect to client,

								proxy, or server implementation of the various components,

								although in practice some components will naturally fall into

								client or server roles in relation to other components (indeed,

								some components can be both clients to some components and

								servers to other components).&#160; An open-agent architecture,

								where component connections are implicit, could be used for

								actual implementation of such a system where components can

								migrate to client and/or server roles as necessary to fulfill

								their duties.&#160; The model architecture is designed to

								accommodate synchronized multi-modal input and multi-media

								output.</p>


								<h2>1. Model Architecture</h2>


								<p>The model architecture is shown in Figure 1.&#160; Solid

								(green) boxes indicate system components, peripheral solid

								(yellow) boxes indicate points of usage for markup language, and

								dotted peripheral boxes indicate information flows.</p>


								<p align="center"><img src="new_arch2-crop.gif" width="643"

								height="491" alt=

								"diagram showing model architecture for speech processing"></p>


								<p align="center"><b>Figure 1. System Architecture</b></p>


								<p>Two types of clients are illustrated: telephony and data

								networking.&#160; The fundamental telephony client is, of course,

								the telephone, either wireline or wireless.&#160; The handset

								telephone requires PSTN (Public Switched Telephone Network)

								interface, which can be either tip/ring, T1, or higher level, and

								may include hybrid echo cancellation to remove line echoes for

								ASR barge-in over audio output.&#160; A speakerphone will also

								require an acoustic echo canceller to remove room echoes.&#160;

								The data network interface will require only acoustic echo

								cancellation if used with an open microphone since there is no

								line echo on data networks.&#160; The IP interface is shown for

								illustration only.&#160; Other data transport mechanisms can be

								used as well.</p>


								<p>Once data has passed the client interface, it can be processed

								in a similar manner.&#160; One minor difference may be speech

								endpointing.&#160; Endpointing will most likely be performed

								either in the telephony interface or at the front-end of the ASR

								processor for speech input from telephony interface.&#160; For

								speech via the IP interface endpointing can be performed at the

								client as well as the ASR front-end.&#160; The choice of where

								endpointing occurs is coupled with the choice for echo

								cancellation.</p>


								<p>It is currently not clear how non-speech data will be handled

								at the telephony interface. This can include inputs such as

								pointing device input from a "smart phone," address books and

								other client resident file data, and eventually even data like

								video.&#160; These smart telephone devices are now on the drawing

								boards of many suppliers.&#160; Some this traffic can be handled

								by WAP/WML, but there are still open issues with regards to

								multi-modality.&#160; Therefore voice markup language

								specifications should provide means for extending the language

								features.</p>


								<p>Data from the ASR/DTMF (etc.) recognizer must be in a format

								compatible with the NL (Natural Language) interpreter.&#160;

								Typically this would be text, but might include non-textual

								components for pointing device input, in which case pointing

								coordinates can be associated with text and/or semantic

								tags.&#160; If the recognizer has detected valid input while

								output is still being presented, the recognizer can signal the

								presentation component to stop output.&#160; Barge-in may not be

								desirable for certain types of multi-media output, and should

								primarily be considered important for interrupting speech

								output.&#160; In some cases it may also be undesirable to

								interrupt speech output, such as in the processing of commands to

								change speaking volume or rate.</p>


								<p>The recognizer can produce multiple outputs and associated

								confidence scores.&#160; The NL interpreter can also produce

								multiple interpretations.&#160; Interpreted NL output is

								coordinated with other modes of input that may require

								interpretation in the current NL context or may alter or augment

								the interpretation of the NL input.&#160;&#160;&#160; It is the

								responsibility of the multi-media integration module to produce

								possibly multiple coordinated joint interpretations of the

								multi-modal input and present these to the dialog manager.&#160;

								Context information can also be shared with the dialog manager to

								further refine the interpretation, including resolution of

								anaphora and implied expressions.&#160; The dialog manager is

								responsible for the final selection of best interpretation.</p>


								<p>The dialog manager is also responsible for responding to the

								input statement.&#160; This responsibility can include resolving

								ambiguity, issuing instructions and/or queries to the task

								manager, collecting output from the task manager, formation of a

								natural language expression or visual presentation of the task

								manager output, and coordination of recognizer context.</p>


								<p>The task manager is primarily an Application Program Interface

								(API), but can also include pragmatic and application specific

								reasoning.&#160; The task manager can be an agent, or proxy, can

								possess state, and can communicate with other agents or proxies

								for services.&#160; The primary application interface for the

								task manager is expected to be web servers, but can be other

								API's as well.</p>


								<p>Finally, the presentation manager, or output media "renderer,"

								has responsibility for formatting multi-media output in a

								coordinated manner.&#160; The presentation manager should be

								aware of the client device capabilities.</p>

								</body>

								</html>