You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1248 lines
44 KiB
1248 lines
44 KiB
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org" />
|
|
<meta http-equiv="Content-Type"
|
|
content="text/html; charset=iso-8859-1" />
|
|
<link rel="stylesheet" type="text/css"
|
|
href="http://www.w3.org/StyleSheets/TR/W3C-WD.css" />
|
|
<style type="text/css">
|
|
body {
|
|
font-family: sans-serif;
|
|
margin-left: 10%;
|
|
margin-right: 5%;
|
|
color: black;
|
|
background-color: white;
|
|
background-attachment: fixed;
|
|
background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);
|
|
background-position: top left;
|
|
background-repeat: no-repeat;
|
|
}
|
|
h1,h2,h3,h4,h5,h6 {
|
|
margin-left: -4%;
|
|
font-weight: normal;
|
|
color: rgb(0, 92, 160);
|
|
}
|
|
img { color: white; border: 0; }
|
|
h1 { margin-top: 2em; clear: both; }
|
|
div.navbar,div.head { margin-bottom: 1em; }
|
|
p.copyright { font-size: 70%; }
|
|
span.term { font-style: italic; color: rgb(0, 0, 192); }
|
|
|
|
code {
|
|
color: green;
|
|
font-family: monospace;
|
|
font-weight: bold;
|
|
}
|
|
|
|
code.greenmono {
|
|
color: green;
|
|
font-family: monospace;
|
|
font-weight: bold;
|
|
}
|
|
.good {
|
|
border: solid green;
|
|
border-width: 2px;
|
|
color: green;
|
|
font-weight: bold;
|
|
margin-right: 5%;
|
|
margin-left: 0;
|
|
margin-top: 1em;
|
|
margin-bottom: 1em;
|
|
}
|
|
.bad {
|
|
border: solid red;
|
|
border-width: 2px;
|
|
margin-left: 0;
|
|
margin-right: 5%;
|
|
margin-top: 1em;
|
|
margin-bottom: 1em;
|
|
color: rgb(192, 101, 101);
|
|
}
|
|
div.navbar { text-align: center; }
|
|
div.contents {
|
|
background-color: rgb(204,204,255);
|
|
padding: 0.5em;
|
|
border: none;
|
|
margin-right: 5%;
|
|
}
|
|
.tocline { list-style: none; }
|
|
table.exceptions { background-color: rgb(255,255,153); }
|
|
.diff-old-a {
|
|
font-size: smaller;
|
|
color: red;
|
|
}
|
|
|
|
.diff-old {
|
|
color: red;
|
|
text-decoration: line-through;
|
|
}
|
|
|
|
.diff-new {
|
|
color: green;
|
|
text-decoration: underline;
|
|
}
|
|
</style>
|
|
|
|
<style type="text/css">
|
|
pre.c7 {color: #3333FF}
|
|
p.c6 {color: #3333FF}
|
|
span.c5 {color: #3333FF}
|
|
p.c4 {color: #FF6600}
|
|
b.c3 {font-size: larger}
|
|
tt.c2 {font-size: larger}
|
|
span.c1 {color: #FF6600}
|
|
</style>
|
|
|
|
<title>Multimodal requirements</title>
|
|
</head>
|
|
<body text="#FF0000" bgcolor="#00FFFF">
|
|
<div class="head">
|
|
<p><a href="http://www.w3.org/"><img class="head"
|
|
src="http://www.w3.org/Icons/w3c_home" alt="W3C" /></a></p>
|
|
|
|
<h1 class="notoc">Multimodal Requirements<br />
|
|
for Voice Markup Languages</h1>
|
|
|
|
<h3 class="notoc">W3C Working Draft 10 July 2000</h3>
|
|
|
|
<dl>
|
|
<dt>This version:</dt>
|
|
|
|
<dd><a
|
|
href="http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710">
|
|
http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710</a></dd>
|
|
|
|
<dt>Latest version:</dt>
|
|
|
|
<dd><a href="http://www.w3.org/TR/multimodal-reqs">
|
|
http://www.w3.org/TR/multimodal-reqs</a></dd>
|
|
|
|
<dt>Editors:</dt>
|
|
|
|
<dd>Marianne Hickey, Hewlett Packard</dd>
|
|
</dl>
|
|
|
|
<p class="copyright"><a
|
|
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
|
|
Copyright</a> ©2000 <a href="http://www.w3.org/"><abbr
|
|
title="World Wide Web Consortium">W3C</abbr></a><sup>®</sup>
|
|
(<a href="http://www.lcs.mit.edu/"><abbr
|
|
title="Massachusetts Institute of Technology">MIT</abbr></a>, <a
|
|
href="http://www.inria.fr/"><abbr lang="fr"
|
|
title="Institut National de Recherche en Informatique et Automatique">
|
|
INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>), All
|
|
Rights Reserved. W3C <a
|
|
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">
|
|
liability</a>, <a
|
|
href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">
|
|
trademark</a>, <a
|
|
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">
|
|
document use</a> and <a
|
|
href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">
|
|
software licensing</a> rules apply.</p>
|
|
|
|
<hr />
|
|
</div>
|
|
|
|
<h2 class="notoc">Abstract</h2>
|
|
|
|
<p>Multimodal browsers allow users to interact via a combination
|
|
of modalities, for instance, speech recognition and synthesis,
|
|
displays, keypads and pointing devices. The Voice Browser working
|
|
group is interested in adding multimodal capabilities to voice
|
|
browsers. This document sets out a prioritized list of
|
|
requirements for multimodal dialog interaction, which any
|
|
proposed markup language (or extension thereof) should
|
|
address.</p>
|
|
|
|
<h2>Status of this document</h2>
|
|
|
|
<p>This specification is a Working Draft of the Voice Browser
|
|
working group for review by W3C members and other interested
|
|
parties. This is the first public version of this document. It is
|
|
a draft document and may be updated, replaced, or obsoleted by
|
|
other documents at any time. It is inappropriate to use W3C
|
|
Working Drafts as reference material or to cite them as other
|
|
than "work in progress".</p>
|
|
|
|
<p>Publication as a Working Draft does not imply endorsement by
|
|
the W3C membership, nor of members of the Voice Browser working
|
|
groups. This is still a draft document and may be updated,
|
|
replaced or obsoleted by other documents at any time. It is
|
|
inappropriate to cite W3C Working Drafts as other than "work in
|
|
progress."</p>
|
|
|
|
<p>This document has been produced as part of the <a
|
|
href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
|
|
but should not be taken as evidence of consensus in the Voice
|
|
Browser Working Group. The goals of the <a
|
|
href="http://www.w3.org/Voice/Group/">Voice Browser Working
|
|
Group</a> (<a href="http://cgi.w3.org/MemberAccess/">members
|
|
only</a>) are discussed in the <a
|
|
href="http://www.w3.org/Voice/1999/voice-wg-charter.html">Voice
|
|
Browser Working Group charter</a> (<a
|
|
href="http://cgi.w3.org/MemberAccess/">members only</a>). This
|
|
document is for public review. Comments should be sent to the
|
|
public mailing list <<a
|
|
href="mailto:www-voice@w3.org">www-voice@w3.org</a>> (<a
|
|
href="http://lists.w3.org/Archives/Public/www-voice/">archive</a>).</p>
|
|
|
|
<p>A list of current W3C Recommendations and other technical
|
|
documents can be found at <a href="http://www.w3.org/TR/">
|
|
http://www.w3.org/TR</a>.</p>
|
|
|
|
<p class="comment">NOTE: Italicized green comments are merely
|
|
that - comments. They are for use during discussions but will be
|
|
removed as appropriate.</p>
|
|
|
|
<h3>Scope</h3>
|
|
|
|
<p>The document addresses multimodal dialog
|
|
interaction.Multimodal as defined in this document is one or more
|
|
speech modes:</p>
|
|
|
|
<ul>
|
|
<li>speech recognition,</li>
|
|
|
|
<li>speech synthesis,</li>
|
|
|
|
<li>prerecorded speech,</li>
|
|
</ul>
|
|
|
|
<p>together with one or more of the following modes:</p>
|
|
|
|
<ul>
|
|
<li>dtmf,</li>
|
|
|
|
<li>keyboard,</li>
|
|
|
|
<li>small screen</li>
|
|
|
|
<li>pointing device (mouse, pen)</li>
|
|
|
|
<li>other input/output modes</li>
|
|
</ul>
|
|
|
|
<p>The focus is on multimodal dialog where there is a small
|
|
screen and keypad (e.g. a cell phone) or a small screen, keypad
|
|
and pointing device (e.g. a palm computer with cellular
|
|
connection to the Web). This document is agnostic about where the
|
|
browser(s) and speech and language engines are running - e.g.
|
|
they could be running on the device itself, on a server or a
|
|
combination of the two.</p>
|
|
|
|
<p>The document addresses applications where both speech input
|
|
and speech output can be available. Note that this includes
|
|
applications where speech input and/or speech output may be
|
|
deselected due to environment/accessibility needs.</p>
|
|
|
|
<p>The document does not specifically address universal access,
|
|
i.e. the issue of rendering the same pages of markup to devices
|
|
with different capabilities (e.g. PC, phone or PDA). Rather, the
|
|
document addresses a markup language that allows an author to
|
|
write an application that uses spoken dialog interaction together
|
|
with other modalities (e.g. a visual interface).</p>
|
|
|
|
<h3>Interaction with Other Groups</h3>
|
|
|
|
<p>The activities of the Multimodal Requirements Subgroup will be
|
|
coordinated with the activities of other sub-groups within the
|
|
W3C Voice Browsing Working Group and other related W3C working
|
|
groups. Where possible, the specification will reuse standard
|
|
visual, multimedia and aural markup languages, see <a
|
|
href="#s4.1">Reuse of standard markup requirement (4.1)</a>.</p>
|
|
|
|
<h2>1. General Requirements</h2>
|
|
|
|
<h3>1.1 Scalable across end user devices (must address)</h3>
|
|
|
|
<p>The markup language will be scalable across devices with a
|
|
range of capabilities, in order to sufficiently meet the needs of
|
|
consumer and device control applications. This includes devices
|
|
capable of supporting:</p>
|
|
|
|
<ol>
|
|
<li>audio I/O plus keypad input - e.g. the plain phone with
|
|
speech plus dtmf, MP3 player with speech input and output and
|
|
with cellular connection to the Web;</li>
|
|
|
|
<li>audio, keypad and small screen - e.g. WAP phones, smart
|
|
phones with displays;</li>
|
|
|
|
<li>audio, soft keyboard, small screen and pointing - e.g.
|
|
palm-top personal organizers with cellular connection to the
|
|
Web.</li>
|
|
|
|
<li>audio, keyboard, full screen and pointing - e.g. desktop PC,
|
|
information kiosk.</li>
|
|
</ol>
|
|
|
|
<p>The server must be able to get access to client capabilities
|
|
and the user's personal preferences, see <a href="#s4.1">reuse of
|
|
standard markup requirement (4.1).</a></p>
|
|
|
|
<h3>1.2 Easy to implement (must address)</h3>
|
|
|
|
<p>The markup language should be easy for designers to understand
|
|
and author without special tools or knowledge of vendor
|
|
technology or protocols (multimodal dialog design knowledge is
|
|
still essential).</p>
|
|
|
|
<h3>1.3 <a id="s1.3" name="s1.3">Complimentary use of
|
|
modalities</a></h3>
|
|
|
|
<p>A characteristic of speech input is that it can be very
|
|
efficient - for example, in a device with a small display and
|
|
keypad, speech can bypass multiple layers of menus. A
|
|
characteristic of speech output is its serial nature, which can
|
|
make it a long-winded way of presenting information that could be
|
|
quickly browsed on a display.</p>
|
|
|
|
<p>The markup will allow an author to use the different
|
|
characteristics of the modalities in the most appropriate way for
|
|
the application.</p>
|
|
|
|
<h4>1.3.1 <a id="s1.3.1" name="s1.3.1">Output media</a> (must
|
|
address)</h4>
|
|
|
|
<p>The markup language will allow speech output to have different
|
|
content to that of simultaneous output from other media. This
|
|
requirement is related to the <a href="#s3.3">simultaneous output
|
|
requirements</a> (3.3 and 3.4).</p>
|
|
|
|
<p>In a speech plus GUI system, the author will be able to choose
|
|
different text for simultaneous verbal and visual outputs. For
|
|
example, a list of options may be presented on screen and
|
|
simultaneous speech output does not necessarily repeat them
|
|
(which is long-winded) but can summarize them or present an
|
|
instruction or warning.</p>
|
|
|
|
<h4>1.3.2 <a id="s1.3.2" name="s1.3.2">Input modalities</a> (must
|
|
address)</h4>
|
|
|
|
<p>The markup language will allow, in a given dialog state, the
|
|
set of actions that can be performed using speech input to be
|
|
different tosimultaneous actions that can be performed with other
|
|
input modalities. This requirement is related to the <a
|
|
href="#s2.3">simultaneous input requirements</a> (2.3 and
|
|
2.4).</p>
|
|
|
|
<p>Consider a speech plus GUI system, where speech and touch
|
|
screen input is available simultaneously. The application can be
|
|
authored such that, in a given dialog state, there are more
|
|
actions available via speech than via the touch screen. For
|
|
example, the screen displays a list of flights and the user can
|
|
bypass the options available on the display and say "show me
|
|
later flights".</p>
|
|
|
|
<h3>1.4 Seamless synchronization of the various modalities
|
|
(should address)</h3>
|
|
|
|
<p>The markup will be designed such that an author can write
|
|
applications where the synchronization of the various modalities
|
|
is seamless from the user's point of view. That is, a cause in
|
|
one modality results in a synchronous change in another. For
|
|
example:</p>
|
|
|
|
<ol>
|
|
<li>an end-user selects something using voice and the visual
|
|
display changes to match;</li>
|
|
|
|
<li>an end-user specifies focus with a mouse and enters the data
|
|
with voice - the application knows which field the user is
|
|
talking to and therefore what it might expect;</li>
|
|
</ol>
|
|
|
|
<p>See <a href="#s4.7.1">minimally required synchronization
|
|
points (4.7.1)</a> and <a href="#s4.7.2">finer grained
|
|
synchronization points (4.7.2).</a></p>
|
|
|
|
<p>See also <a href="#s2.2">multimodal input requirements (2.2,
|
|
2.3, 2.4)</a> and <a href="#s3.2">multimodal output requirements
|
|
(3.2, 3.3, 3.4).</a></p>
|
|
|
|
<h3>1.5 Multilingual & international rendering</h3>
|
|
|
|
<h4>1.5.1 One language per document (must address)</h4>
|
|
|
|
<p>The markup language will provide the ability to mark the
|
|
language of a document.</p>
|
|
|
|
<h4>1.5.2 Multiple languages in the same document (nice to
|
|
address)</h4>
|
|
|
|
<p>The markup language will support rendering of multi-lingual
|
|
documents - i.e. where there is a mixed-language document. For
|
|
example, English and French speech output and/or input can appear
|
|
in the same document - a spoken system response can be "John read
|
|
the book entitled 'Viva La France'."</p>
|
|
|
|
<p><font color="#008000"><i>This is really a general requirement
|
|
for voice dialog, rather than a multimodal requirement. We may
|
|
move this to the dialog document.</i></font></p>
|
|
|
|
<h2>2. Input modality requirements</h2>
|
|
|
|
<h3>2.1 Audio Modality Input (must address)</h3>
|
|
|
|
<p>The markup language can specify which spoken user input is
|
|
interpreted by the voice browser.</p>
|
|
|
|
<h3>2.2 <a id="s2.2" name="s2.2">Sequential multi-modal Input</a>
|
|
(must address)</h3>
|
|
|
|
<p>The markup language specifies that speech and user input from
|
|
other modalities is to be interpreted by the browser. There is no
|
|
requirement that the input modalities are simultaneously active.
|
|
In a particular dialog state, there is only one input mode
|
|
available but in the whole interaction more than one input mode
|
|
is used. Inputs from different modalities are interpreted
|
|
separately. For example, a browser can interpret speech input in
|
|
one dialog state and keyboard input in another.</p>
|
|
|
|
<p>The granularity is defined by things like input events.
|
|
Synchronization does not occur at any finer granularity. When the
|
|
user takes some action, only one mode of input will be available
|
|
at that time. See requirement <a href="#s4.7.1">4.7.1 - minimally
|
|
required synchronization points.</a></p>
|
|
|
|
<p>Examples:</p>
|
|
|
|
<ol>
|
|
<li>In a bank application accessed via a phone, the browser
|
|
renders the speech "Speak your name", the user must respond in
|
|
speech and says "Jack Jones", the browser renders the speech
|
|
"Using the keypad, enter your pin number", the user must enter
|
|
the number via the keypad.</li>
|
|
|
|
<li>In an insurance application accessed via a PDA, the browser
|
|
renders the speech "Please say your postcode", the user must
|
|
reply in speech and says "BS34 8QZ", the browser renders the
|
|
speech "I'm having trouble understanding you, please enter your
|
|
postcode using the soft keyboard." The user must respond using
|
|
the soft keyboard (i.e. not in speech).</li>
|
|
</ol>
|
|
|
|
<h3>2.3 <a id="s2.3" name="s2.3">Uncoordinated, Simultaneous,
|
|
Multi-modal Input</a> (must address)</h3>
|
|
|
|
<p>The markup language specifies that speech and user input from
|
|
other modalities is to be interpreted by the browser and that
|
|
input modalities are simultaneously active. There is no
|
|
requirement that interpretation of the input modalities are
|
|
coordinated (i.e. interpreted together). In a particular dialog
|
|
state, there is more than one input mode available but only input
|
|
from one of the modalities is interpreted (e.g. the first input -
|
|
see <a href="#s2.13">2.13 Resolve conflicting input
|
|
requirement</a>). For example, a voice browser in a desktop
|
|
environment could accept either keyboard input or spoken input in
|
|
same dialog state.</p>
|
|
|
|
<p>The granularity is defined by things like input events.
|
|
Synchronization does not occur at any finer granularity. When the
|
|
user takes some action, it can be in one of several input modes -
|
|
only one mode of input will be accepted by the browser. See
|
|
requirement <a href="#s4.7.1">4.7.1 - minimally required
|
|
synchronization points.</a></p>
|
|
|
|
<p>Examples:</p>
|
|
|
|
<ol>
|
|
<li>In a bank application accessed via a phone, the browser
|
|
renders the speech "Enter your name", the user says "Jack Jones"
|
|
or enters his name via the keypad, the browser renders the speech
|
|
"Enter your account number", the user enters the number via the
|
|
keypad or speaks the account number.</li>
|
|
|
|
<li>In a music application accessed via a PDA, the user asks to
|
|
hear clips of new releases, either using speech or by selecting a
|
|
button on screen. The browser renders a list of titles on screen.
|
|
The user selects by pointing to the title with the pen or by
|
|
speaking the title of the track.</li>
|
|
</ol>
|
|
|
|
<h3>2.4 <a id="s2.4" name="s2.4">Coordinated, Simultaneous
|
|
Multi-modal Input</a> (nice to address)</h3>
|
|
|
|
<p>The markup language specifies that speech and user input from
|
|
other modalities is allowed at the same time and that
|
|
interpretation of the inputs are coordinated. In a particular
|
|
dialog state, there is more than one input mode available and
|
|
input from multiple modalities is interpreted (e.g. within a
|
|
given time window). When the user takes some action it can be
|
|
composed of inputs from several modalities - for example, a voice
|
|
browser in a desktop environment could accept keyboard input and
|
|
spoken input together in same dialog state.</p>
|
|
|
|
<p>Examples:</p>
|
|
|
|
<ol>
|
|
<li>In a telephony environment, the user can type<em>200</em> on
|
|
the keypad and say <em>transfer to checking account</em> and the
|
|
interpretations are coordinated so that they are understood as
|
|
<em>transfer 200 to checking account</em>.</li>
|
|
|
|
<li>In a route finding application, the user points at Bristol on
|
|
a map and says "Give me directions from London to here".</li>
|
|
</ol>
|
|
|
|
<p>See also <a href="#s2.11">2.11 Composite Meaning
|
|
requirement</a>, <a href="#s2.13">2.13 Resolve conflicting input
|
|
requirement</a>.</p>
|
|
|
|
<h3>2.5 Input modes supported (must address)</h3>
|
|
|
|
<p>The markup language will support the following input modes, in
|
|
addition to speech:</p>
|
|
|
|
<ul>
|
|
<li>DTMF</li>
|
|
|
|
<li>keyboard</li>
|
|
|
|
<li>pointing device (e.g. mouse, touchscreen, etc)</li>
|
|
</ul>
|
|
|
|
<p>DTMF will be supported using the dialog markup specified by
|
|
the W3C Voice Browsing Group's dialog requirements.</p>
|
|
|
|
<p>Character and pointing input will be supported using other
|
|
markup languages together with scripting (e.g. html with
|
|
Javascript).</p>
|
|
|
|
<p>See <a href="#s4.1">reuse standard markup requirement
|
|
(4.1).</a></p>
|
|
|
|
<h3>2.6 Input modes supported (nice to address)</h3>
|
|
|
|
<p>The markup language will support other input modes,
|
|
including:</p>
|
|
|
|
<ul>
|
|
<li>hand-writing script</li>
|
|
|
|
<li>hand-writing gesture - e.g. to delete, to insert.</li>
|
|
</ul>
|
|
|
|
<h3>2.7 Extensible to new input media types (nice to
|
|
address)</h3>
|
|
|
|
<p>The model will be abstract enough so any new or exotic input
|
|
media (e.g. gesture captured by video) could fit into it.</p>
|
|
|
|
<h3>2.8 <a id="s2.8" name="s2.8">Semantics of input generated by
|
|
UI components other than speech</a> (nice to address)</h3>
|
|
|
|
<p>The markup language should support semantic tokens that are
|
|
generated by UI components other than speech. These tokens can be
|
|
considered in a similar way to action tags and speech grammars.
|
|
For example, in a pizza application, if a topping can be selected
|
|
from an option list on the screen, the author can declare that
|
|
the semantic token 'topping' can be generated by a GUI
|
|
component.</p>
|
|
|
|
<h3>2.9 <a id="s2.9" name="s2.9">Modality-independent
|
|
representation of the meaning of user input</a> (nice to
|
|
address)</h3>
|
|
|
|
<p>The markup language should support a modality-independent
|
|
method of representing the meaning of user input. This should be
|
|
annotated with a record of the modality type. This is related to
|
|
the <a href="#s4.3">XForms requirement (4.3)</a> and to the work
|
|
on Natural Language within the <a
|
|
href="http://www.w3.org/Voice/">W3C Voice activity</a>.</p>
|
|
|
|
<p>The markup language supports the same semantic representation
|
|
of input from different modalities. For example, in a pizza
|
|
application, if a topping can be selected from an option list on
|
|
the screen or by speaking, the same semantic token, e.g.
|
|
'topping' can be used to represent the input.</p>
|
|
|
|
<h3>2.10 Coordinate speech grammar with grammar for other input
|
|
modalities (future revision)</h3>
|
|
|
|
<p>The markup language coordinates the grammars for modalities
|
|
other than speech with speech grammars to avoid duplication of
|
|
effort in authoring multimodal grammars.</p>
|
|
|
|
<h3>2.11 <a id="s2.11" name="s2.11">Composite meaning</a> (nice
|
|
to address)</h3>
|
|
|
|
<p>Multimodal input must be able to be combined to form a
|
|
composite meaning. This is related to the <a href="#s2.4">
|
|
Coordinated, Simultaneous Multi-modal Input (2.4)</a>. For
|
|
example, the user points at Bristol on a map and says "Give me
|
|
directions from London to here". The formal representation of the
|
|
meaning of each input needs to be combined to get a composite
|
|
meaning - "Give me directions from London to Bristol". See also
|
|
<a href="#s2.8">Semantics of input generated by UI components
|
|
other than speech (2.8)</a> and <a href="#s2.9">Modality
|
|
independent semantic representation (2.9)</a></p>
|
|
|
|
<h3>2.12 Time window for coordinated multimodal input (nice to
|
|
address)</h3>
|
|
|
|
<p>The markup language supports specification of timing
|
|
information to determine whether input from multiple modalities
|
|
should combine to form an integrated semantic representation. See
|
|
<a href="#s2.4">coordinated multimodal input requirement
|
|
(2.4)</a>. This could, for example, take the form of a time
|
|
window which is specified in the markup, where input events from
|
|
different modalities that occur within this window are combined
|
|
into one semantic entity.</p>
|
|
|
|
<h3>2.13 <a id="s2.13" name="s2.13">Support for conflicting input
|
|
from different modalities</a> (must address)</h3>
|
|
|
|
<p>The markup language will support the detection of conflicting
|
|
input from several modalities.For example, in a speech + GUI
|
|
interface, there may be simultaneous but conflicting speech and
|
|
mouse inputs; the markup language should allow the conflict to be
|
|
detected so that an appropriate action can be taken. Consider a
|
|
music application, the user says "play Madonna" while entering
|
|
"Elvis" in an artist text box on screen; an application might
|
|
resolve this by asking "Did you mean Madonna or Elvis?". This is
|
|
related to <a href="#s2.3">2.3 uncoordinated simultaneous
|
|
multimodal input.</a>and <a href="#s2.4">2.4 coordinated
|
|
simultaneous input requirement.</a></p>
|
|
|
|
<h3>2.14 <a id="s2.14" name="s2.14">Context for recognizer</a>
|
|
(nice to address)</h3>
|
|
|
|
<p>The markup language should allow features of the display to
|
|
indicate a context for voice interaction. For example:</p>
|
|
|
|
<ul>
|
|
<li>the context for interpreting a spoken utterance might be
|
|
indicated by the form field that has focus on the display;</li>
|
|
|
|
<li>the speech grammar might be dependent on what is currently
|
|
being displayed (the page or just the area that's visible).</li>
|
|
</ul>
|
|
|
|
<h3>2.15 <a id="s2.15" name="s2.15">Resolve spoken reference to
|
|
display</a> (future revision)</h3>
|
|
|
|
<p>Interpretation of the input must provide enough information to
|
|
the natural language system to be able to resolve speech input
|
|
that refers to items in the visual context. For example: the
|
|
screen is displaying a list of possible flights that match a
|
|
user's requirements and the user says "I'll take the third
|
|
one".</p>
|
|
|
|
<h3>2.16 Time stamping (should address)</h3>
|
|
|
|
<p>All input events will be time-stamped, in addition to the time
|
|
stamping covered by the Dialog Requirements. This includes, for
|
|
example, time-stamping speech, key press and pointing events. For
|
|
finer grained synchronization, time stamping at the start and the
|
|
end of each word within speech may be needed.</p>
|
|
|
|
<h2>3. Output media requirements</h2>
|
|
|
|
<h3>3.1 Audio Media Output (must address)</h3>
|
|
|
|
<p>The markup language can specify the content rendered as spoken
|
|
output by the voice browser.</p>
|
|
|
|
<h3>3.2 <a id="s3.2" name="s3.2">Sequential multimedia output</a>
|
|
(must address)</h3>
|
|
|
|
<p>The markup language specifies that content is rendered in
|
|
speech and other media types. There is no requirement that the
|
|
output media are rendered simultaneously. For example, a browser
|
|
can output speech in one dialog state and graphics in
|
|
another.</p>
|
|
|
|
<p>The granularity is defined by things like input events.
|
|
Synchronization does not occur at any finer granularity. When the
|
|
user takes some action - either spoken or by pointing, for
|
|
example - a response is rendered in one of the output media -
|
|
either visual or voice, for example. See requirement <a
|
|
href="#s4.7.1">4.7.1 - minimally required synchronization
|
|
points.</a></p>
|
|
|
|
<p>Examples:</p>
|
|
|
|
<ol>
|
|
<li>In a speech plus WML banking application, accessed via a WAP
|
|
phone, the user asks "What's my balance". The browser renders the
|
|
account balance on the display only. The user clicks OK, the
|
|
browser renders the response as speech only - "Would you like
|
|
another service?"...</li>
|
|
|
|
<li>In a music application accessed via a PDA, the user asks to
|
|
hear clips of new releases. The browser renders a list of titles
|
|
on screen, together with the text instruction to select a title
|
|
to hear the track. The user selects a track by speaking the
|
|
number. The browser plays the selected track - the screen does
|
|
not change.</li>
|
|
</ol>
|
|
|
|
<h3>3.3 <a id="s3.3" name="s3.3">Uncoordinated, Simultaneous,
|
|
Multi-media Output</a> (must address)</h3>
|
|
|
|
<p>The markup language specifies that content is rendered in
|
|
speech and other media at the same time (i.e. in the same dialog
|
|
state). There is no requirement that the rendering of output
|
|
media are coordinated (i.e. synchronized) any further.Where
|
|
appropriate, synchronization of speech with other output media
|
|
should be supported with SMIL or a related standard.</p>
|
|
|
|
<p>The granularity of the synchronization for this requirement is
|
|
coarser than for the <a href="#s3.4">coordinated simultaneous
|
|
output requirement (3.4)</a>. The granularity is defined by
|
|
things like input events. When the user takes some action -
|
|
either spoken or by pointing, for example - something happens
|
|
with the visual and the voice channels but there is no further
|
|
synchronization at a finer granularity than that. I.e., a browser
|
|
can output speech and graphics in one dialog state, but the two
|
|
outputs are not synchronized in any other way. See requirement <a
|
|
href="#s4.7.1">4.7.1 - minimally required synchronization
|
|
points.</a></p>
|
|
|
|
<p>Examples:</p>
|
|
|
|
<ol>
|
|
<li>In a cinema-ticket application accessed via a WAP phone, the
|
|
user asks what films are showing. The browser renders the list of
|
|
films on the screen and renders an instruction in speech - "Here
|
|
are today's films. Select one to hear a full description".</li>
|
|
|
|
<li>A browser in a smart phone environment plays a prompt "Which
|
|
service do you require?", while displaying a list of options such
|
|
as "Do you want to: (a) transfer money; (b) get account info; (c)
|
|
quit."</li>
|
|
|
|
<li>In a music application accessed via a PDA, the user asks to
|
|
hear clips of new releases. The browser renders a list of titles
|
|
on screen, and renders an instruction in speech "Here are the
|
|
five recommended new releases. Select one to hear a clip". The
|
|
user selects one by speaking the title. The browser renders the
|
|
audio clip and, at the same time, displays the price and
|
|
information about the band. When the track has finished, the user
|
|
selects a button on screen to return to the list of tracks.</li>
|
|
</ol>
|
|
|
|
<h3>3.4 <a id="s3.4" name="s3.4">Coordinated, Simultaneous
|
|
Multi-media Output</a> (nice to address)</h3>
|
|
|
|
<p>The markup language specifies that content is to be
|
|
simultaneously rendered in speech and other media and that output
|
|
rendering is further coordinated (i.e. synchronized). The
|
|
granularity is defined by things that happen within the response
|
|
to a given user input - see <a href="#s4.7.2">4.7.2 Finer grained
|
|
synchronization points.</a> Where appropriate, synchronization of
|
|
speech with other output media should be supported with SMIL or a
|
|
related standard.</p>
|
|
|
|
<p>Examples:</p>
|
|
|
|
<ol>
|
|
<li>In a news application, accessed via a PDA, a browser
|
|
highlights each paragraph of text (e.g. headline) as it renders
|
|
the corresponding speech.</li>
|
|
|
|
<li>In a learn-to-read application accessed via a PC, the lips of
|
|
an animated character are synchronized with speech output, the
|
|
words are highlighted on screen as they are spoken and pictures
|
|
are displayed as the corresponding words are spoken (e.g. a cat
|
|
is displayed as the word cat is spoken).</li>
|
|
|
|
<li>In a music application accessed via a PDA, the user asks to
|
|
hear clips of new releases. The browser renders a list of titles
|
|
on screen, highlights the first and starts playing it. When the
|
|
first track has finished, the browser highlights the second title
|
|
on screen and starts playing the second track, and so on.</li>
|
|
|
|
<li>Display an image 5 seconds after a spoken prompt has
|
|
started.</li>
|
|
|
|
<li>Display an image for 5 seconds then render a speech
|
|
prompt.</li>
|
|
</ol>
|
|
|
|
<p>See also <a href="#s3.5">Synchronization of Multimedia with
|
|
voice input requirement (3.5)</a>.</p>
|
|
|
|
<h3>3.5 <a id="s3.5" name="s3.5">Synchronization of multimedia
|
|
with voice input</a> (nice to address)</h3>
|
|
|
|
<p>The markup language specifies that media output and voice
|
|
input are synchronized. The granularity is defined by: things
|
|
that happen within the response to a given user input, e.g. play
|
|
a video and 30 seconds after it has started activate a speech
|
|
grammar; things that happen within a speech input, e.g. detect
|
|
the start of a spoken input and 5 seconds later play a video.
|
|
Where appropriate, synchronization of speech with other output
|
|
media should be supported with SMIL or a related standard. See <a
|
|
href="#s3.4">Coordinated simultaneous multimedia output
|
|
requirement (3.4)</a>; <a href="#s4.7.2">4.7.2 Finer grained
|
|
synchronization points.</a></p>
|
|
|
|
<h3>3.6 Temporal semantics for synchronization of voice input and
|
|
output with multimedia (nice to address)</h3>
|
|
|
|
<p>The markup language will have clear temporal semantics so that
|
|
it can be integrated into the SMIL multimedia framework.
|
|
Multi-media frameworks are characterized by precise temporal
|
|
synchronization of output and input. For example, the SMIL
|
|
notation is based on timing primitives that allow the composition
|
|
of complex behaviors. See <a href="#s3.5">Synchronization with
|
|
Multimedia with voice input requirement (3.5)</a> and <a
|
|
href="#s3.4">3.4 coordinated simultaneous multimodal output
|
|
requirement</a>.</p>
|
|
|
|
<h3>3.7 Visual output of text (must address)</h3>
|
|
|
|
<p>The markup language will support visual output of text, using
|
|
other markup languages such as html or wml (see <a href="#s4.1">
|
|
reuse of standard markup requirement, 4.1</a>). For example, the
|
|
following may be presented as text on the display:</p>
|
|
|
|
<ul>
|
|
<li>Contextual/history information (e.g. display partially filled
|
|
in form);</li>
|
|
|
|
<li>Prompts;</li>
|
|
|
|
<li>Menus;</li>
|
|
|
|
<li>Confirmation;</li>
|
|
|
|
<li>Error messages.</li>
|
|
</ul>
|
|
|
|
<p>Example 1:</p>
|
|
|
|
<ul>
|
|
<li>User says: "My name is Jack Jones",</li>
|
|
|
|
<li>System displays: "Jack Jones" in address field.</li>
|
|
</ul>
|
|
|
|
<p>Example 2:</p>
|
|
|
|
<ul>
|
|
<li>User says: "Transfer $200 from my savings account to my
|
|
checking account",</li>
|
|
|
|
<li>System displays:
|
|
|
|
<ul>
|
|
<li>Operation: transfer</li>
|
|
|
|
<li>Source account: savings account</li>
|
|
|
|
<li>Destination account: checking account</li>
|
|
|
|
<li>Amount: $200</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
|
|
<h3>3.8 Media supported by other Voice Browsing Requirements
|
|
(must address)</h3>
|
|
|
|
<p>The markup language supports output defined in other W3C Voice
|
|
Browsing Group specifications - for example, recorded audio
|
|
(Speech Synthesis Requirements). See <a href="#s4.1">reuse of
|
|
standard markup requirement (4.1).</a></p>
|
|
|
|
<h3>3.9 Media objects supported by SMIL (should address)</h3>
|
|
|
|
<p>The markup language supports output of media objects supported
|
|
by SMIL (animation, audio, img, video, text, textstream), using
|
|
other markup languages (see <a href="#s4.1">reuse of standard
|
|
markup requirement, 4.1</a>).</p>
|
|
|
|
<h3>3.10 Other output media (nice to address)</h3>
|
|
|
|
<p>The markup language supports output of the following media,
|
|
using other markup languages (see <a href="#s4.1">reuse of
|
|
standard markup requirement, 4.1</a>).</p>
|
|
|
|
<ul>
|
|
<li>media types supported by CSS2</li>
|
|
|
|
<li>synthesis of audio - MIDI</li>
|
|
|
|
<li>lip-synch face synthesis</li>
|
|
</ul>
|
|
|
|
<h3>3.11 Extensible to new media (nice to address)</h3>
|
|
|
|
<p>The markup language will be extensible to support new output
|
|
media types (e.g. 3D graphics).</p>
|
|
|
|
<h3>3.12 <a id="s3.12" name="s3.12"></a>Media-independent
|
|
representation of the meaning of output (future revision)</h3>
|
|
|
|
<p>The markup language should support a media-independent method
|
|
of representing the meaning of output. E.g. the output could be
|
|
represented in a frame format and rendered in speech or on the
|
|
display by the browser. This is related to <a href="#s4.3">XForms
|
|
requirement (4.3)</a></p>
|
|
|
|
<h3>3.13 <a id="s3.13" name="s3.13">Display size</a> (should
|
|
address)</h3>
|
|
|
|
<p>Visual output will be renderable on displays of different
|
|
sizes. This should be by using standard visual markup languages
|
|
e.g., HTML, CHTML, WML, where appropriate, see <a href="#s4.1">
|
|
reuse standard markup requirement</a> (4.1).</p>
|
|
|
|
<p>This requirement applies to two kinds of visual markup:</p>
|
|
|
|
<ul>
|
|
<li>markup that can be rendered flexibly as the display size
|
|
changes</li>
|
|
|
|
<li>markup that is pre-configured for a particular display
|
|
size.</li>
|
|
</ul>
|
|
|
|
<h3>3.14 <a id="s3.14" name="s3.14">Output to more than one
|
|
window</a> (future revision)</h3>
|
|
|
|
<p>The markup language supports the identification of the display
|
|
window. This is to support applications where there is more than
|
|
one window.</p>
|
|
|
|
<h3>3.15 <a id="s3.15" name="s3.15">Time stamping</a> (should
|
|
address)</h3>
|
|
|
|
<p>All output events will be time-stamped, in addition to the
|
|
time stamping covered by the Dialog<br />
|
|
Requirements. This includes time-stamping the start and the end
|
|
of a speech event. For finer grained synchronization, time
|
|
stamping at the start and the end of each word within speech may
|
|
be needed.</p>
|
|
|
|
<h2>4. <a id="s4" name="s4">Architecture, Integration and
|
|
Synchronization points</a></h2>
|
|
|
|
<h3>4.1 <a id="s4.1" name="s4.1">Reuse standard markup
|
|
languages</a> (must address)</h3>
|
|
|
|
<p>Where possible, the specification must reuse standard visual,
|
|
multimedia and aural markup languages, including:</p>
|
|
|
|
<ul>
|
|
<li>other <a href="http://www.w3.org/Voice/">W3C Voice Browsing
|
|
working group</a> specifications for voice markup;</li>
|
|
|
|
<li>standard multimedia notations (SMIL or a related
|
|
standard);</li>
|
|
|
|
<li>standard visual markup languages e.g., HTML, CHTML, WML;</li>
|
|
|
|
<li>other relevant specifications, including ACSS;</li>
|
|
</ul>
|
|
|
|
<p>The specification should avoid unnecessary differences with
|
|
these markup languages.</p>
|
|
|
|
<p>In addition, the markup will be compatible with the W3C's work
|
|
on Client Capabilities and Personal Preferences (CC/PP).</p>
|
|
|
|
<h3>4.2 Mesh with modular architecture proposed for XHTML (nice
|
|
to address)</h3>
|
|
|
|
<p>The results of the work should mesh with the modular
|
|
architecture proposed for XHTML, where different markup modules
|
|
are expected to cohabit and inter-operate gracefully within an
|
|
overall XHTML container.</p>
|
|
|
|
<p>As part of this goal the design should be capable of
|
|
incorporating multiple visual and aural markup languages.</p>
|
|
|
|
<h3>4.3 <a id="s4.3" name="s4.3">Compatibility with W3C work on
|
|
X-Forms</a> (nice to address)</h3>
|
|
|
|
<p>The markup language should be compatible with the W3C's work
|
|
on X-Forms.</p>
|
|
|
|
<ol>
|
|
<li>Have an explicit data model for the back end (i.e. the data)
|
|
and map it to the front end.</li>
|
|
|
|
<li>Separate the data model from the presentation. The
|
|
presentation depends on the device modality.</li>
|
|
|
|
<li>Application data and logic should be modality
|
|
independent.</li>
|
|
</ol>
|
|
|
|
<p>Related to requirements: <a href="#s3.12">media independent
|
|
representation of output (3.12)</a> and <a href="#s2.11">media
|
|
independent representation of input (2.11)</a>.</p>
|
|
|
|
<h3>4.4 Detect that a given modality is available (must
|
|
address)</h3>
|
|
|
|
<p>The markup language will allow identification of the
|
|
modalities available. This will allow an author to identify that
|
|
a given modality is/is not present and as a result switch to a
|
|
different dialog. E.g. there is a visible construct that an
|
|
author can query. This can be used to provide for accessibility
|
|
requirements and for environmental factors (e.g. noise). The
|
|
availability of input and output modalities can be controlled by
|
|
the user or by the system. The extent to which the functionality
|
|
is retained when modalities are not available is the
|
|
responsibility of the author.</p>
|
|
|
|
<p>The following is a list of use cases regarding a multimodal
|
|
document that specifies speech and GUI input and output. The
|
|
document could be designed such that:</p>
|
|
|
|
<ol>
|
|
<li>when the speech input error count is high, the user can make
|
|
equivalent selections via the GUI;</li>
|
|
|
|
<li>where a user has a speech impairment, speech input can be
|
|
deselected and the user controls the application via the
|
|
GUI;</li>
|
|
|
|
<li>when the user cannot hear a verbal prompt due to a noisy
|
|
environment (detected, for example, by no response), an
|
|
equivalent prompt is displayed on the screen;</li>
|
|
|
|
<li>where a user has a hearing impairment the speech output is
|
|
deselected and equivalent prompts are displayed.</li>
|
|
</ol>
|
|
|
|
<h3>4.5 Means to act on a notification that a modality has become
|
|
available/unavailable (must address)</h3>
|
|
|
|
<p>Note that this is a requirement on the system and not on the
|
|
markup language. For example, when there is temporarily high
|
|
background noise, the application may disable speech input and
|
|
output but enable them again when the noise lessens.This is a
|
|
requirement for an event handling mechanism.</p>
|
|
|
|
<h3>4.6 Transformable documents</h3>
|
|
|
|
<h4>4.6.1 Loosely coupled documents (nice to address)</h4>
|
|
|
|
<p>The mark-up language should support loosely coupled documents,
|
|
where separate markup streams for each modality are synchronized
|
|
at well-defined points. For example, separate voice and visual
|
|
markup streams could be synchronized at the following points:
|
|
visiting a form, following a link.</p>
|
|
|
|
<h4>4.6.2 Tightly coupled documents (nice to address)</h4>
|
|
|
|
<p>The mark-up language should support tightly coupled documents.
|
|
Tightly coupled documents have document elements for each
|
|
interaction modality interspersed in the same document. I.e. a
|
|
tightly coupled document contains sub-documents from different
|
|
interaction modalities (e.g. HTML and voice markup) and has been
|
|
authored to achieve explicit synchrony across the interaction
|
|
streams.</p>
|
|
|
|
<p>Tightly coupled documents should be viewed as an optimization
|
|
of the loosely-coupled approach, and should be defined by
|
|
describing a reversible transformation from a tightly-coupled
|
|
document to multiple loosely-coupled documents. For example, a
|
|
tightly coupled document that includes HTML and voice markup
|
|
sub-documents should be transformable to a pair of documents,
|
|
where one is HTML only and the other is voice markup only - see
|
|
<a href="#s4.6.3">transformation requirement</a> (4.6.3).</p>
|
|
|
|
<h4>4.6.3 <a id="s4.6.3" name="s4.6.3">Transformation between
|
|
tightly and loosely coupled documents by standard tree
|
|
transformations as expressible in XSLT</a> (nice to address)</h4>
|
|
|
|
<p>The markup language should be designed such that tightly
|
|
coupled documents are <em>transformable</em> to documents for a
|
|
specific interaction modalities by standard tree transformations
|
|
as expressible in XSLT. Conversely, tightly coupled documents
|
|
should be viewed as a simple transformation applied to the
|
|
individual sub-documents, with the transformation playing the
|
|
role of tightly coupling the sub-documents into a single
|
|
document.</p>
|
|
|
|
<p>This requirement will ensure content re-use, keep
|
|
implementation of multimodal browsers manageable and provide for
|
|
accessibility requirements.</p>
|
|
|
|
<p>It is important to note that all the interaction information
|
|
from the tightly coupled document may not be preserved. If, for
|
|
example, you have a speech + GUI design, when you take out the
|
|
GUI, the application is not necessarily equivalently usable. It
|
|
is up to the author to decide whether the speech document has all
|
|
the information that the speech plus GUI document has.Depending
|
|
on how the author created the multimodal document, the
|
|
transformation could be entirely lossy, could degrade gracefully
|
|
by preserving some information from the GUI or could preserve all
|
|
information from the GUI. If the author's intent is that the
|
|
application should be usable in the presence or absence of either
|
|
modality, it is the author's responsibility to design the
|
|
application to achieve this.</p>
|
|
|
|
<h3>4.7 <a id="s4.7" name="s4.7">Synchronization points</a></h3>
|
|
|
|
<h4>4.7.1 <a id="s4.7.1" name="s4.7.1">Minimally required
|
|
synchronization points</a>(must address)</h4>
|
|
|
|
<p>The markup language should minimally enable synchronization
|
|
across different modalities at well known interaction points in
|
|
today's browsers, for example, entering and exiting specific
|
|
interaction widgets:</p>
|
|
|
|
<ul>
|
|
<li>Entry to a form</li>
|
|
|
|
<li>Entry to a menu</li>
|
|
|
|
<li>Completion of a form</li>
|
|
|
|
<li>Choosing of menu item (in a voice markup language) or link
|
|
(HTML).</li>
|
|
|
|
<li>Filling of a field within a form.</li>
|
|
</ul>
|
|
|
|
<p>For example:</p>
|
|
|
|
<ul>
|
|
<li>The material displayed visually and the GUI input options can
|
|
be conditional on: the current voice dialog; the current state of
|
|
the voice dialog (e.g. the form, the menu).</li>
|
|
|
|
<li>The voice markup (i.e. the dialog/grammar/prompt) can be
|
|
conditional on: the html page being displayed; the text box in
|
|
focus; the option selected; the button that has been
|
|
clicked.</li>
|
|
</ul>
|
|
|
|
<p>See <a href="#s3.2">multimedia output requirements (3.2, 3.3
|
|
and 3.4)</a> and <a href="#s2.2">multimodal input
|
|
requirements</a> (2.2, 2.3 and 2.4).</p>
|
|
|
|
<h4>4.7.2 <a id="s4.7.2" name="s4.7.2">Finer-grained
|
|
synchronization points</a> (nice to address)</h4>
|
|
|
|
<p>The markup language should support finer-grained
|
|
synchronization. Where appropriate, synchronization of speech
|
|
with other output media should be supported with SMIL or a
|
|
related standard.</p>
|
|
|
|
<p>For example:</p>
|
|
|
|
<ul>
|
|
<li>to allow a display to synchronize with events in the auditory
|
|
output stream</li>
|
|
|
|
<li>to allow voice markup (i.e. the dialog/grammar/prompt) to
|
|
synchronize with scrolling events on the display</li>
|
|
|
|
<li>to allow voice markup to synchronize with temporal events in
|
|
output media.</li>
|
|
</ul>
|
|
|
|
<p>Synchronization points include:</p>
|
|
|
|
<ul>
|
|
<li>events in the auditory output stream e.g. start/finish voice
|
|
output events (word, line, paragraph, section)</li>
|
|
|
|
<li>fine-grained events on the display (e.g. scrolling)</li>
|
|
|
|
<li>temporal events in other output media.</li>
|
|
</ul>
|
|
|
|
<p>See <a href="#s3.4">3.4 coordinated simultaneous multimodal
|
|
output requirement</a>.</p>
|
|
|
|
<h4>4.7.3 Co-ordinate synchronization points with the DOM event
|
|
model (future study)</h4>
|
|
|
|
<ol>
|
|
<li>Synchronization points should be coordinated with the DOM
|
|
event model. I.e. one possible starting point for a list of such
|
|
synchronization points would be the event types defined by the
|
|
DOM, appropriately modified to be modality independent.</li>
|
|
|
|
<li>Event types defined for multimodal browsing should be
|
|
integrated into the DOM; as part of this effort, the Voice WG
|
|
might provide requirements as input to the next level of the DOM
|
|
specification.</li>
|
|
</ol>
|
|
|
|
<h4>4.7.4 Browser functions and synchronization points (future
|
|
study)</h4>
|
|
|
|
<p>The notion of synchronization points (or navigation sign
|
|
posts) are important; they should also be tied into a discussion
|
|
of what canonized browser functions like "back, "undo", and
|
|
"forward" mean, and what they mean to the global state of the MM
|
|
browser. The notion of 'back' is unclear in a voice context.</p>
|
|
|
|
<h3>4.8 Interaction with External Components (must have)</h3>
|
|
|
|
<p>The markup language must support a generic component interface
|
|
to allow for the use of external components on the client and/or
|
|
server side. The interface provides a mechanism for transferring
|
|
data between the markup language's variables and the component.
|
|
Examples of such data are: semantic representations of user input
|
|
(such as attribute-value pairs); URL of markup for different
|
|
modalities (e.g. URL of an HTML page). The markup language also
|
|
supports Interaction with External Components that is supported
|
|
by the <a
|
|
href="http://www.w3.org/TR/1999/WD-voice-dialog-reqs-19991223/">
|
|
W3C Voice Browsing Dialog Requirements (Requirement
|
|
2.10)</a>.</p>
|
|
|
|
<p>Examples of external components are components for interaction
|
|
modalities other than speech (e.g. an HTML browser) and server
|
|
scripts. Server scripts can be used to interact with remote
|
|
services, devices or databases.</p>
|
|
|
|
<h2>Acknowledgements</h2>
|
|
|
|
<p>The following people participated in the multimodal subgroup
|
|
of the Voice Browser working group and contributed to this
|
|
document</p>
|
|
|
|
<ul>
|
|
<li>T. V. Raman (IBM)</li>
|
|
|
|
<li>Bruce Lucas (IBM)</li>
|
|
|
|
<li>Pekka Kapanen (Nokia)</li>
|
|
|
|
<li>Peter Boda (Nokia)</li>
|
|
|
|
<li>Laurence Prevosto (EDF)</li>
|
|
|
|
<li>Marianne Hickey (HP)</li>
|
|
|
|
<li>Nils Klarlund (AT&T)</li>
|
|
|
|
<li>Carolina Di Cristo (Telecom Italia)</li>
|
|
|
|
<li>Charles T. Hemphill (Conversational Computing)</li>
|
|
|
|
<li>Alan Goldschen (MITRE)</li>
|
|
|
|
<li>Andreas Kellner (Philips)</li>
|
|
|
|
<li>Markku T. Hakkinen (The Productivity Works)</li>
|
|
|
|
<li>Kuansan Wang (Microsoft)</li>
|
|
|
|
<li>David Raggett (W3C/HP)</li>
|
|
|
|
<li>Jim Colson (IBM)</li>
|
|
|
|
<li>Scott McGlashan (Pipebeach)</li>
|
|
|
|
<li>Frank Scahill (BT)</li>
|
|
</ul>
|
|
</body>
|
|
</html>
|
|
|