You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
301 lines
12 KiB
301 lines
12 KiB
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content=
|
|
"text/html; charset=iso-8859-1">
|
|
<title>Model Architecture for Voice Browser Systems</title>
|
|
<style type="text/css">
|
|
body {
|
|
font-family: sans-serif;
|
|
margin-left: 10%;
|
|
margin-right: 5%;
|
|
color: black;
|
|
background-color: white;
|
|
background-attachment: fixed;
|
|
background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);
|
|
background-position: top left;
|
|
background-repeat: no-repeat;
|
|
font-family: Tahoma, Verdana, "Myriad Web", Syntax, sans-serif;
|
|
}
|
|
.unfinished { font-style: normal; background-color: #FFFF33}
|
|
.dtd-code { font-family: monospace;
|
|
background-color: #dfdfdf; white-space: pre;
|
|
border: #000000; border-style: solid;
|
|
border-top-width: 1px; border-right-width: 1px;
|
|
border-bottom-width: 1px; border-left-width: 1px; }
|
|
p.copyright {font-size: smaller}
|
|
h2,h3 {margin-top: 1em;}
|
|
.extra { font-style: italic; color: #338033 }
|
|
code {
|
|
color: green;
|
|
font-family: monospace;
|
|
font-weight: bold;
|
|
}
|
|
.example {
|
|
border: solid green;
|
|
border-width: 2px;
|
|
color: green;
|
|
font-weight: bold;
|
|
margin-right: 5%;
|
|
margin-left: 0;
|
|
}
|
|
.bad {
|
|
border: solid red;
|
|
border-width: 2px;
|
|
margin-left: 0;
|
|
margin-right: 5%;
|
|
color: rgb(192, 101, 101);
|
|
}
|
|
div.navbar { text-align: center; }
|
|
div.contents {
|
|
background-color: rgb(204,204,255);
|
|
padding: 0.5em;
|
|
border: none;
|
|
margin-right: 5%;
|
|
}
|
|
table {
|
|
margin-left: -4%;
|
|
margin-right: 4%;
|
|
font-family: sans-serif;
|
|
background: white;
|
|
border-width: 2px;
|
|
border-color: white;
|
|
}
|
|
th { font-family: sans-serif; background: rgb(204, 204, 153) }
|
|
td { font-family: sans-serif; background: rgb(255, 255, 153) }
|
|
.tocline { list-style: none; }
|
|
</style>
|
|
<link rel="stylesheet" type="text/css" href=
|
|
"http://www.w3.org/StyleSheets/TR/W3C-WD.css">
|
|
</head>
|
|
<body>
|
|
<div class="head">
|
|
<p><a href="http://www.w3.org/"><img class="head" src=
|
|
"http://www.w3.org/Icons/WWW/w3c_home.gif" alt="W3C"></a></p>
|
|
|
|
<h1 class="notoc">Model Architecture for<br>
|
|
Voice Browser Systems</h1>
|
|
|
|
<h3 class="notoc">W3C Working Draft <i>23 December 1999</i></h3>
|
|
|
|
<dl>
|
|
<dt>This version:</dt>
|
|
|
|
<dd><a href=
|
|
"http://www.w3.org/TR/1999/WD-voice-architecture-19991223">
|
|
http://www.w3.org/TR/1999/WD-voice-architecture-19991223</a></dd>
|
|
|
|
<dt>Latest version:</dt>
|
|
|
|
<dd><a href=
|
|
"http://www.w3.org/TR/voice-architecture">
|
|
http://www.w3.org/TR/voice-architecture</a></dd>
|
|
|
|
<dt>Editors:</dt>
|
|
|
|
<dd>M. K. Brown, Bell Labs, Murray Hill, NJ<br>
|
|
D. A. Dahl, Unisys, Malvern, PA</dd>
|
|
</dl>
|
|
|
|
<p class="copyright"><a href=
|
|
"http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
|
|
Copyright</a> © 1999 <a href="http://www.w3.org/">
|
|
W3C</a><sup>®</sup> (<a href=
|
|
"http://www.lcs.mit.edu/">MIT</a>, <a href=
|
|
"http://www.inria.fr/">INRIA</a>, <a href=
|
|
"http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. <abbr
|
|
title="World Wide Web Consortium">W3C</abbr> <a href=
|
|
"http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">
|
|
liability</a>, <a href=
|
|
"http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">
|
|
trademark</a>, <a href=
|
|
"http://www.w3.org/Consortium/Legal/copyright-documents">document
|
|
use</a> and <a href=
|
|
"http://www.w3.org/Consortium/Legal/copyright-software">software
|
|
licensing</a> rules apply.</p>
|
|
|
|
<hr>
|
|
</div>
|
|
|
|
<h2 class="notoc">Abstract</h2>
|
|
|
|
<p>The W3C Voice Browser working group aims to develop
|
|
specifications to enable access to the Web using spoken
|
|
interaction. This document is part of a set of requirements
|
|
studies for voice browsers, and provides a model architecture for
|
|
processing speech within voice browsers.</p>
|
|
|
|
<h2>Status of this document</h2>
|
|
|
|
<p>This document describes a model architecture for speech
|
|
processing in voice browsers as an aid to work on understanding
|
|
requirements. Related requirement drafts are linked from the <a
|
|
href="/TR/1999/WD-voice-intro-19991223">introduction</a>. The
|
|
requirements are being released as working drafts but are not
|
|
intended to become proposed recommendations.</p>
|
|
|
|
<p>This specification is a Working Draft of the Voice Browser working
|
|
group for review by W3C members and other interested parties. This is
|
|
the first public version of this document. It is a draft document and
|
|
may be updated, replaced, or obsoleted by other documents at any
|
|
time. It is inappropriate to use W3C Working Drafts as reference
|
|
material or to cite them as other than "work in progress".</p>
|
|
|
|
<p>Publication as a Working Draft does not imply endorsement by
|
|
the W3C membership, nor of members of the Voice Browser working
|
|
groups. This is still a draft document and may be updated,
|
|
replaced or obsoleted by other documents at any time. It is
|
|
inappropriate to cite W3C Working Drafts as other than "work in
|
|
progress."</p>
|
|
|
|
<p>This document has been produced as part of the <a href=
|
|
"http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
|
|
following the procedures set out for the <a href=
|
|
"http://www.w3.org/Consortium/Process/">W3C Process</a>. The
|
|
authors of this document are members of the <a href=
|
|
"http://www.w3.org/Voice/Group">Voice Browser Working Group</a>.
|
|
This document is for public review. Comments should be sent to
|
|
the public mailing list <<a href=
|
|
"mailto:www-voice@w3.org">www-voice@w3.org</a>> (<a href=
|
|
"http://www.w3.org/Archives/Public/www-voice/">archive</a>) by
|
|
14th January 2000.</p>
|
|
|
|
<p>A list of current W3C Recommendations and other technical
|
|
documents can be found at <a href="http://www.w3.org/TR">
|
|
http://www.w3.org/TR</a>.</p>
|
|
|
|
<h2>0. Introduction</h2>
|
|
|
|
<p>To assist in clarifying the scope of charters of each of the
|
|
several subgroups of the W3C Voice Browser Working Group, a
|
|
representative or model architecture for a typical voice browser
|
|
application has been developed.  This architecture
|
|
illustrates one possible arrangement of the main components of a
|
|
typical system, and should not be construed as a
|
|
recommendation. Other proposed architectures for spoken
|
|
language systems are currently available, and may also be
|
|
compatible with voice browsers, for example the <a href=
|
|
"http://fofoca.mitre.org/index.html">DARPA Communicator
|
|
architecture.</a></p>
|
|
|
|
<p>Connections between components have been shown explicitly in
|
|
the interest of clearly indicating the flow of information among
|
|
the processes (and thereby indicating the interaction of the W3C
|
|
subgroups).  Each of the currently existing subgroups
|
|
(Universal Access, Speech Synthesis, Grammar Representation,
|
|
Natural Language, and Dialog) is represented in this
|
|
architecture.  New subgroups are currently being initiated
|
|
and can contribute additional elements to this architecture for
|
|
future drafts.</p>
|
|
|
|
<p>The design is intended to be agnostic with respect to client,
|
|
proxy, or server implementation of the various components,
|
|
although in practice some components will naturally fall into
|
|
client or server roles in relation to other components (indeed,
|
|
some components can be both clients to some components and
|
|
servers to other components).  An open-agent architecture,
|
|
where component connections are implicit, could be used for
|
|
actual implementation of such a system where components can
|
|
migrate to client and/or server roles as necessary to fulfill
|
|
their duties.  The model architecture is designed to
|
|
accommodate synchronized multi-modal input and multi-media
|
|
output.</p>
|
|
|
|
<h2>1. Model Architecture</h2>
|
|
|
|
<p>The model architecture is shown in Figure 1.  Solid
|
|
(green) boxes indicate system components, peripheral solid
|
|
(yellow) boxes indicate points of usage for markup language, and
|
|
dotted peripheral boxes indicate information flows.</p>
|
|
|
|
<p align="center"><img src="new_arch2-crop.gif" width="643"
|
|
height="491" alt=
|
|
"diagram showing model architecture for speech processing"></p>
|
|
|
|
<p align="center"><b>Figure 1. System Architecture</b></p>
|
|
|
|
<p>Two types of clients are illustrated: telephony and data
|
|
networking.  The fundamental telephony client is, of course,
|
|
the telephone, either wireline or wireless.  The handset
|
|
telephone requires PSTN (Public Switched Telephone Network)
|
|
interface, which can be either tip/ring, T1, or higher level, and
|
|
may include hybrid echo cancellation to remove line echoes for
|
|
ASR barge-in over audio output.  A speakerphone will also
|
|
require an acoustic echo canceller to remove room echoes. 
|
|
The data network interface will require only acoustic echo
|
|
cancellation if used with an open microphone since there is no
|
|
line echo on data networks.  The IP interface is shown for
|
|
illustration only.  Other data transport mechanisms can be
|
|
used as well.</p>
|
|
|
|
<p>Once data has passed the client interface, it can be processed
|
|
in a similar manner.  One minor difference may be speech
|
|
endpointing.  Endpointing will most likely be performed
|
|
either in the telephony interface or at the front-end of the ASR
|
|
processor for speech input from telephony interface.  For
|
|
speech via the IP interface endpointing can be performed at the
|
|
client as well as the ASR front-end.  The choice of where
|
|
endpointing occurs is coupled with the choice for echo
|
|
cancellation.</p>
|
|
|
|
<p>It is currently not clear how non-speech data will be handled
|
|
at the telephony interface. This can include inputs such as
|
|
pointing device input from a "smart phone," address books and
|
|
other client resident file data, and eventually even data like
|
|
video.  These smart telephone devices are now on the drawing
|
|
boards of many suppliers.  Some this traffic can be handled
|
|
by WAP/WML, but there are still open issues with regards to
|
|
multi-modality.  Therefore voice markup language
|
|
specifications should provide means for extending the language
|
|
features.</p>
|
|
|
|
<p>Data from the ASR/DTMF (etc.) recognizer must be in a format
|
|
compatible with the NL (Natural Language) interpreter. 
|
|
Typically this would be text, but might include non-textual
|
|
components for pointing device input, in which case pointing
|
|
coordinates can be associated with text and/or semantic
|
|
tags.  If the recognizer has detected valid input while
|
|
output is still being presented, the recognizer can signal the
|
|
presentation component to stop output.  Barge-in may not be
|
|
desirable for certain types of multi-media output, and should
|
|
primarily be considered important for interrupting speech
|
|
output.  In some cases it may also be undesirable to
|
|
interrupt speech output, such as in the processing of commands to
|
|
change speaking volume or rate.</p>
|
|
|
|
<p>The recognizer can produce multiple outputs and associated
|
|
confidence scores.  The NL interpreter can also produce
|
|
multiple interpretations.  Interpreted NL output is
|
|
coordinated with other modes of input that may require
|
|
interpretation in the current NL context or may alter or augment
|
|
the interpretation of the NL input.    It is the
|
|
responsibility of the multi-media integration module to produce
|
|
possibly multiple coordinated joint interpretations of the
|
|
multi-modal input and present these to the dialog manager. 
|
|
Context information can also be shared with the dialog manager to
|
|
further refine the interpretation, including resolution of
|
|
anaphora and implied expressions.  The dialog manager is
|
|
responsible for the final selection of best interpretation.</p>
|
|
|
|
<p>The dialog manager is also responsible for responding to the
|
|
input statement.  This responsibility can include resolving
|
|
ambiguity, issuing instructions and/or queries to the task
|
|
manager, collecting output from the task manager, formation of a
|
|
natural language expression or visual presentation of the task
|
|
manager output, and coordination of recognizer context.</p>
|
|
|
|
<p>The task manager is primarily an Application Program Interface
|
|
(API), but can also include pragmatic and application specific
|
|
reasoning.  The task manager can be an agent, or proxy, can
|
|
possess state, and can communicate with other agents or proxies
|
|
for services.  The primary application interface for the
|
|
task manager is expected to be web servers, but can be other
|
|
API's as well.</p>
|
|
|
|
<p>Finally, the presentation manager, or output media "renderer,"
|
|
has responsibility for formatting multi-media output in a
|
|
coordinated manner.  The presentation manager should be
|
|
aware of the client device capabilities.</p>
|
|
</body>
|
|
</html>
|
|
|