server_playground/doc/www.w3.org/TR/2000/WD-multimodal-reqs-20000710


								<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

								    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

								<html xmlns="http://www.w3.org/1999/xhtml">

								<head>

								<meta name="generator" content="HTML Tidy, see www.w3.org" />

								<meta http-equiv="Content-Type"

								content="text/html; charset=iso-8859-1" />

								<link rel="stylesheet" type="text/css"

								href="http://www.w3.org/StyleSheets/TR/W3C-WD.css" />

								<style type="text/css">

								body {

								font-family: sans-serif;

								margin-left: 10%;

								margin-right: 5%;

								color: black;

								background-color: white;

								background-attachment: fixed;

								background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);

								background-position: top left;

								background-repeat: no-repeat;

								}

								h1,h2,h3,h4,h5,h6 {

								margin-left: -4%;

								font-weight: normal;

								color: rgb(0, 92, 160);

								}

								img { color: white; border: 0; }

								h1 { margin-top: 2em; clear: both; }

								div.navbar,div.head { margin-bottom: 1em; }

								p.copyright { font-size: 70%; }

								span.term { font-style: italic; color: rgb(0, 0, 192); }


								code {

								    color: green;

								    font-family: monospace;

								    font-weight: bold;

								}


								code.greenmono {

								    color: green;

								    font-family: monospace;

								    font-weight: bold;

								}

								.good {

								    border: solid green;

								    border-width: 2px;

								    color: green;

								    font-weight: bold;

								    margin-right: 5%;

								    margin-left: 0;

								    margin-top: 1em;

								    margin-bottom: 1em;

								}

								.bad  {

								    border: solid red;

								    border-width: 2px;

								    margin-left: 0;

								    margin-right: 5%;

								    margin-top: 1em;

								    margin-bottom: 1em;

								    color: rgb(192, 101, 101);

								}

								div.navbar { text-align: center; }

								div.contents {

								    background-color: rgb(204,204,255);

								    padding: 0.5em;

								    border: none;

								    margin-right: 5%;

								}

								.tocline { list-style: none; }

								table.exceptions { background-color: rgb(255,255,153); }

								.diff-old-a {

								  font-size: smaller;

								  color: red;

								}


								.diff-old {

								  color: red;

								  text-decoration: line-through;

								}


								.diff-new {

								        color: green;

								        text-decoration: underline;

								}

								</style>


								<style type="text/css">

								 pre.c7 {color: #3333FF}

								 p.c6 {color: #3333FF}

								 span.c5 {color: #3333FF}

								 p.c4 {color: #FF6600}

								 b.c3 {font-size: larger}

								 tt.c2 {font-size: larger}

								 span.c1 {color: #FF6600}

								</style>


								<title>Multimodal requirements</title>

								</head>

								<body text="#FF0000" bgcolor="#00FFFF">

								<div class="head">

								<p><a href="http://www.w3.org/"><img class="head"

								src="http://www.w3.org/Icons/w3c_home" alt="W3C" /></a></p>


								<h1 class="notoc">Multimodal Requirements<br />

								for Voice Markup Languages</h1>


								<h3 class="notoc">W3C Working Draft 10 July 2000</h3>


								<dl>

								<dt>This version:</dt>


								<dd><a

								href="http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710">

								http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710</a></dd>


								<dt>Latest version:</dt>


								<dd><a href="http://www.w3.org/TR/multimodal-reqs">

								http://www.w3.org/TR/multimodal-reqs</a></dd>


								<dt>Editors:</dt>


								<dd>Marianne Hickey, Hewlett Packard</dd>

								</dl>


								<p class="copyright"><a

								href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">

								Copyright</a> &#169;2000 <a href="http://www.w3.org/"><abbr

								title="World Wide Web Consortium">W3C</abbr></a><sup>&#174;</sup>

								(<a href="http://www.lcs.mit.edu/"><abbr

								title="Massachusetts Institute of Technology">MIT</abbr></a>, <a

								href="http://www.inria.fr/"><abbr lang="fr"

								title="Institut National de Recherche en Informatique et Automatique">

								INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>), All

								Rights Reserved. W3C <a

								href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">

								liability</a>, <a

								href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">

								trademark</a>, <a

								href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">

								document use</a> and <a

								href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">

								software licensing</a> rules apply.</p>


								<hr />

								</div>


								<h2 class="notoc">Abstract</h2>


								<p>Multimodal browsers allow users to interact via a combination

								of modalities, for instance, speech recognition and synthesis,

								displays, keypads and pointing devices. The Voice Browser working

								group is interested in adding multimodal capabilities to voice

								browsers. This document sets out a prioritized list of

								requirements for multimodal dialog interaction, which any

								proposed markup language (or extension thereof) should

								address.</p>


								<h2>Status of this document</h2>


								<p>This specification is a Working Draft of the Voice Browser

								working group for review by W3C members and other interested

								parties. This is the first public version of this document. It is

								a draft document and may be updated, replaced, or obsoleted by

								other documents at any time. It is inappropriate to use W3C

								Working Drafts as reference material or to cite them as other

								than "work in progress".</p>


								<p>Publication as a Working Draft does not imply endorsement by

								the W3C membership, nor of members of the Voice Browser working

								groups. This is still a draft document and may be updated,

								replaced or obsoleted by other documents at any time. It is

								inappropriate to cite W3C Working Drafts as other than "work in

								progress."</p>


								<p>This document has been produced as part of the <a

								href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,

								but should not be taken as evidence of consensus in the Voice

								Browser Working Group. The goals of the <a

								href="http://www.w3.org/Voice/Group/">Voice Browser Working

								Group</a> (<a href="http://cgi.w3.org/MemberAccess/">members

								only</a>) are discussed in the <a

								href="http://www.w3.org/Voice/1999/voice-wg-charter.html">Voice

								Browser Working Group charter</a> (<a

								href="http://cgi.w3.org/MemberAccess/">members only</a>). This

								document is for public review. Comments should be sent to the

								public mailing list &lt;<a

								href="mailto:www-voice@w3.org">www-voice@w3.org</a>&gt; (<a

								href="http://lists.w3.org/Archives/Public/www-voice/">archive</a>).</p>


								<p>A list of current W3C Recommendations and other technical

								documents can be found at <a href="http://www.w3.org/TR/">

								http://www.w3.org/TR</a>.</p>


								<p class="comment">NOTE: Italicized green comments are merely

								that - comments. They are for use during discussions but will be

								removed as appropriate.</p>


								<h3>Scope</h3>


								<p>The document addresses multimodal dialog

								interaction.Multimodal as defined in this document is one or more

								speech modes:</p>


								<ul>

								<li>speech recognition,</li>


								<li>speech synthesis,</li>


								<li>prerecorded speech,</li>

								</ul>


								<p>together with one or more of the following modes:</p>


								<ul>

								<li>dtmf,</li>


								<li>keyboard,</li>


								<li>small screen</li>


								<li>pointing device (mouse, pen)</li>


								<li>other input/output modes</li>

								</ul>


								<p>The focus is on multimodal dialog where there is a small

								screen and keypad (e.g. a cell phone) or a small screen, keypad

								and pointing device (e.g. a palm computer with cellular

								connection to the Web). This document is agnostic about where the

								browser(s) and speech and language engines are running - e.g.

								they could be running on the device itself, on a server or a

								combination of the two.</p>


								<p>The document addresses applications where both speech input

								and speech output can be available. Note that this includes

								applications where speech input and/or speech output may be

								deselected due to environment/accessibility needs.</p>


								<p>The document does not specifically address universal access,

								i.e. the issue of rendering the same pages of markup to devices

								with different capabilities (e.g. PC, phone or PDA). Rather, the

								document addresses a markup language that allows an author to

								write an application that uses spoken dialog interaction together

								with other modalities (e.g. a visual interface).</p>


								<h3>Interaction with Other Groups</h3>


								<p>The activities of the Multimodal Requirements Subgroup will be

								coordinated with the activities of other sub-groups within the

								W3C Voice Browsing Working Group and other related W3C working

								groups. Where possible, the specification will reuse standard

								visual, multimedia and aural markup languages, see <a

								href="#s4.1">Reuse of standard markup requirement (4.1)</a>.</p>


								<h2>1. General Requirements</h2>


								<h3>1.1 Scalable across end user devices (must address)</h3>


								<p>The markup language will be scalable across devices with a

								range of capabilities, in order to sufficiently meet the needs of

								consumer and device control applications. This includes devices

								capable of supporting:</p>


								<ol>

								<li>audio I/O plus keypad input - e.g. the plain phone with

								speech plus dtmf, MP3 player with speech input and output and

								with cellular connection to the Web;</li>


								<li>audio, keypad and small screen - e.g. WAP phones, smart

								phones with displays;</li>


								<li>audio, soft keyboard, small screen and pointing - e.g.

								palm-top personal organizers with cellular connection to the

								Web.</li>


								<li>audio, keyboard, full screen and pointing - e.g. desktop PC,

								information kiosk.</li>

								</ol>


								<p>The server must be able to get access to client capabilities

								and the user's personal preferences, see <a href="#s4.1">reuse of

								standard markup requirement (4.1).</a></p>


								<h3>1.2 Easy to implement (must address)</h3>


								<p>The markup language should be easy for designers to understand

								and author without special tools or knowledge of vendor

								technology or protocols (multimodal dialog design knowledge is

								still essential).</p>


								<h3>1.3 <a id="s1.3" name="s1.3">Complimentary use of

								modalities</a></h3>


								<p>A characteristic of speech input is that it can be very

								efficient - for example, in a device with a small display and

								keypad, speech can bypass multiple layers of menus. A

								characteristic of speech output is its serial nature, which can

								make it a long-winded way of presenting information that could be

								quickly browsed on a display.</p>


								<p>The markup will allow an author to use the different

								characteristics of the modalities in the most appropriate way for

								the application.</p>


								<h4>1.3.1 <a id="s1.3.1" name="s1.3.1">Output media</a> (must

								address)</h4>


								<p>The markup language will allow speech output to have different

								content to that of simultaneous output from other media. This

								requirement is related to the <a href="#s3.3">simultaneous output

								requirements</a> (3.3 and 3.4).</p>


								<p>In a speech plus GUI system, the author will be able to choose

								different text for simultaneous verbal and visual outputs. For

								example, a list of options may be presented on screen and

								simultaneous speech output does not necessarily repeat them

								(which is long-winded) but can summarize them or present an

								instruction or warning.</p>


								<h4>1.3.2 <a id="s1.3.2" name="s1.3.2">Input modalities</a> (must

								address)</h4>


								<p>The markup language will allow, in a given dialog state, the

								set of actions that can be performed using speech input to be

								different tosimultaneous actions that can be performed with other

								input modalities. This requirement is related to the <a

								href="#s2.3">simultaneous input requirements</a> (2.3 and

								2.4).</p>


								<p>Consider a speech plus GUI system, where speech and touch

								screen input is available simultaneously. The application can be

								authored such that, in a given dialog state, there are more

								actions available via speech than via the touch screen. For

								example, the screen displays a list of flights and the user can

								bypass the options available on the display and say "show me

								later flights".</p>


								<h3>1.4 Seamless synchronization of the various modalities

								(should address)</h3>


								<p>The markup will be designed such that an author can write

								applications where the synchronization of the various modalities

								is seamless from the user's point of view. That is, a cause in

								one modality results in a synchronous change in another. For

								example:</p>


								<ol>

								<li>an end-user selects something using voice and the visual

								display changes to match;</li>


								<li>an end-user specifies focus with a mouse and enters the data

								with voice - the application knows which field the user is

								talking to and therefore what it might expect;</li>

								</ol>


								<p>See <a href="#s4.7.1">minimally required synchronization

								points (4.7.1)</a> and <a href="#s4.7.2">finer grained

								synchronization points (4.7.2).</a></p>


								<p>See also <a href="#s2.2">multimodal input requirements (2.2,

								2.3, 2.4)</a> and <a href="#s3.2">multimodal output requirements

								(3.2, 3.3, 3.4).</a></p>


								<h3>1.5 Multilingual &amp; international rendering</h3>


								<h4>1.5.1 One language per document (must address)</h4>


								<p>The markup language will provide the ability to mark the

								language of a document.</p>


								<h4>1.5.2 Multiple languages in the same document (nice to

								address)</h4>


								<p>The markup language will support rendering of multi-lingual

								documents - i.e. where there is a mixed-language document. For

								example, English and French speech output and/or input can appear

								in the same document - a spoken system response can be "John read

								the book entitled 'Viva La France'."</p>


								<p><font color="#008000"><i>This is really a general requirement

								for voice dialog, rather than a multimodal requirement. We may

								move this to the dialog document.</i></font></p>


								<h2>2. Input modality requirements</h2>


								<h3>2.1 Audio Modality Input (must address)</h3>


								<p>The markup language can specify which spoken user input is

								interpreted by the voice browser.</p>


								<h3>2.2 <a id="s2.2" name="s2.2">Sequential multi-modal Input</a>

								(must address)</h3>


								<p>The markup language specifies that speech and user input from

								other modalities is to be interpreted by the browser. There is no

								requirement that the input modalities are simultaneously active.

								In a particular dialog state, there is only one input mode

								available but in the whole interaction more than one input mode

								is used. Inputs from different modalities are interpreted

								separately. For example, a browser can interpret speech input in

								one dialog state and keyboard input in another.</p>


								<p>The granularity is defined by things like input events.

								Synchronization does not occur at any finer granularity. When the

								user takes some action, only one mode of input will be available

								at that time. See requirement <a href="#s4.7.1">4.7.1 - minimally

								required synchronization points.</a></p>


								<p>Examples:</p>


								<ol>

								<li>In a bank application accessed via a phone, the browser

								renders the speech "Speak your name", the user must respond in

								speech and says "Jack Jones", the browser renders the speech

								"Using the keypad, enter your pin number", the user must enter

								the number via the keypad.</li>


								<li>In an insurance application accessed via a PDA, the browser

								renders the speech "Please say your postcode", the user must

								reply in speech and says "BS34 8QZ", the browser renders the

								speech "I'm having trouble understanding you, please enter your

								postcode using the soft keyboard." The user must respond using

								the soft keyboard (i.e. not in speech).</li>

								</ol>


								<h3>2.3 <a id="s2.3" name="s2.3">Uncoordinated, Simultaneous,

								Multi-modal Input</a> (must address)</h3>


								<p>The markup language specifies that speech and user input from

								other modalities is to be interpreted by the browser and that

								input modalities are simultaneously active. There is no

								requirement that interpretation of the input modalities are

								coordinated (i.e. interpreted together). In a particular dialog

								state, there is more than one input mode available but only input

								from one of the modalities is interpreted (e.g. the first input -

								see <a href="#s2.13">2.13 Resolve conflicting input

								requirement</a>). For example, a voice browser in a desktop

								environment could accept either keyboard input or spoken input in

								same dialog state.</p>


								<p>The granularity is defined by things like input events.

								Synchronization does not occur at any finer granularity. When the

								user takes some action, it can be in one of several input modes -

								only one mode of input will be accepted by the browser. See

								requirement <a href="#s4.7.1">4.7.1 - minimally required

								synchronization points.</a></p>


								<p>Examples:</p>


								<ol>

								<li>In a bank application accessed via a phone, the browser

								renders the speech "Enter your name", the user says "Jack Jones"

								or enters his name via the keypad, the browser renders the speech

								"Enter your account number", the user enters the number via the

								keypad or speaks the account number.</li>


								<li>In a music application accessed via a PDA, the user asks to

								hear clips of new releases, either using speech or by selecting a

								button on screen. The browser renders a list of titles on screen.

								The user selects by pointing to the title with the pen or by

								speaking the title of the track.</li>

								</ol>


								<h3>2.4 <a id="s2.4" name="s2.4">Coordinated, Simultaneous

								Multi-modal Input</a> (nice to address)</h3>


								<p>The markup language specifies that speech and user input from

								other modalities is allowed at the same time and that

								interpretation of the inputs are coordinated. In a particular

								dialog state, there is more than one input mode available and

								input from multiple modalities is interpreted (e.g. within a

								given time window). When the user takes some action it can be

								composed of inputs from several modalities - for example, a voice

								browser in a desktop environment could accept keyboard input and

								spoken input together in same dialog state.</p>


								<p>Examples:</p>


								<ol>

								<li>In a telephony environment, the user can type<em>200</em> on

								the keypad and say <em>transfer to checking account</em> and the

								interpretations are coordinated so that they are understood as

								<em>transfer 200 to checking account</em>.</li>


								<li>In a route finding application, the user points at Bristol on

								a map and says "Give me directions from London to here".</li>

								</ol>


								<p>See also <a href="#s2.11">2.11 Composite Meaning

								requirement</a>, <a href="#s2.13">2.13 Resolve conflicting input

								requirement</a>.</p>


								<h3>2.5 Input modes supported (must address)</h3>


								<p>The markup language will support the following input modes, in

								addition to speech:</p>


								<ul>

								<li>DTMF</li>


								<li>keyboard</li>


								<li>pointing device (e.g. mouse, touchscreen, etc)</li>

								</ul>


								<p>DTMF will be supported using the dialog markup specified by

								the W3C Voice Browsing Group's dialog requirements.</p>


								<p>Character and pointing input will be supported using other

								markup languages together with scripting (e.g. html with

								Javascript).</p>


								<p>See <a href="#s4.1">reuse standard markup requirement

								(4.1).</a></p>


								<h3>2.6 Input modes supported (nice to address)</h3>


								<p>The markup language will support other input modes,

								including:</p>


								<ul>

								<li>hand-writing script</li>


								<li>hand-writing gesture - e.g. to delete, to insert.</li>

								</ul>


								<h3>2.7 Extensible to new input media types (nice to

								address)</h3>


								<p>The model will be abstract enough so any new or exotic input

								media (e.g. gesture captured by video) could fit into it.</p>


								<h3>2.8 <a id="s2.8" name="s2.8">Semantics of input generated by

								UI components other than speech</a> (nice to address)</h3>


								<p>The markup language should support semantic tokens that are

								generated by UI components other than speech. These tokens can be

								considered in a similar way to action tags and speech grammars.

								For example, in a pizza application, if a topping can be selected

								from an option list on the screen, the author can declare that

								the semantic token 'topping' can be generated by a GUI

								component.</p>


								<h3>2.9 <a id="s2.9" name="s2.9">Modality-independent

								representation of the meaning of user input</a> (nice to

								address)</h3>


								<p>The markup language should support a modality-independent

								method of representing the meaning of user input. This should be

								annotated with a record of the modality type. This is related to

								the <a href="#s4.3">XForms requirement (4.3)</a> and to the work

								on Natural Language within the <a

								href="http://www.w3.org/Voice/">W3C Voice activity</a>.</p>


								<p>The markup language supports the same semantic representation

								of input from different modalities. For example, in a pizza

								application, if a topping can be selected from an option list on

								the screen or by speaking, the same semantic token, e.g.

								'topping' can be used to represent the input.</p>


								<h3>2.10 Coordinate speech grammar with grammar for other input

								modalities (future revision)</h3>


								<p>The markup language coordinates the grammars for modalities

								other than speech with speech grammars to avoid duplication of

								effort in authoring multimodal grammars.</p>


								<h3>2.11 <a id="s2.11" name="s2.11">Composite meaning</a> (nice

								to address)</h3>


								<p>Multimodal input must be able to be combined to form a

								composite meaning. This is related to the <a href="#s2.4">

								Coordinated, Simultaneous Multi-modal Input (2.4)</a>. For

								example, the user points at Bristol on a map and says "Give me

								directions from London to here". The formal representation of the

								meaning of each input needs to be combined to get a composite

								meaning - "Give me directions from London to Bristol". See also

								<a href="#s2.8">Semantics of input generated by UI components

								other than speech (2.8)</a> and <a href="#s2.9">Modality

								independent semantic representation (2.9)</a></p>


								<h3>2.12 Time window for coordinated multimodal input (nice to

								address)</h3>


								<p>The markup language supports specification of timing

								information to determine whether input from multiple modalities

								should combine to form an integrated semantic representation. See

								<a href="#s2.4">coordinated multimodal input requirement

								(2.4)</a>. This could, for example, take the form of a time

								window which is specified in the markup, where input events from

								different modalities that occur within this window are combined

								into one semantic entity.</p>


								<h3>2.13 <a id="s2.13" name="s2.13">Support for conflicting input

								from different modalities</a> (must address)</h3>


								<p>The markup language will support the detection of conflicting

								input from several modalities.For example, in a speech + GUI

								interface, there may be simultaneous but conflicting speech and

								mouse inputs; the markup language should allow the conflict to be

								detected so that an appropriate action can be taken. Consider a

								music application, the user says "play Madonna" while entering

								"Elvis" in an artist text box on screen; an application might

								resolve this by asking "Did you mean Madonna or Elvis?". This is

								related to <a href="#s2.3">2.3 uncoordinated simultaneous

								multimodal input.</a>and <a href="#s2.4">2.4 coordinated

								simultaneous input requirement.</a></p>


								<h3>2.14 <a id="s2.14" name="s2.14">Context for recognizer</a>

								(nice to address)</h3>


								<p>The markup language should allow features of the display to

								indicate a context for voice interaction. For example:</p>


								<ul>

								<li>the context for interpreting a spoken utterance might be

								indicated by the form field that has focus on the display;</li>


								<li>the speech grammar might be dependent on what is currently

								being displayed (the page or just the area that's visible).</li>

								</ul>


								<h3>2.15 <a id="s2.15" name="s2.15">Resolve spoken reference to

								display</a> (future revision)</h3>


								<p>Interpretation of the input must provide enough information to

								the natural language system to be able to resolve speech input

								that refers to items in the visual context. For example: the

								screen is displaying a list of possible flights that match a

								user's requirements and the user says "I'll take the third

								one".</p>


								<h3>2.16 Time stamping (should address)</h3>


								<p>All input events will be time-stamped, in addition to the time

								stamping covered by the Dialog Requirements. This includes, for

								example, time-stamping speech, key press and pointing events. For

								finer grained synchronization, time stamping at the start and the

								end of each word within speech may be needed.</p>


								<h2>3. Output media requirements</h2>


								<h3>3.1 Audio Media Output (must address)</h3>


								<p>The markup language can specify the content rendered as spoken

								output by the voice browser.</p>


								<h3>3.2 <a id="s3.2" name="s3.2">Sequential multimedia output</a>

								(must address)</h3>


								<p>The markup language specifies that content is rendered in

								speech and other media types. There is no requirement that the

								output media are rendered simultaneously. For example, a browser

								can output speech in one dialog state and graphics in

								another.</p>


								<p>The granularity is defined by things like input events.

								Synchronization does not occur at any finer granularity. When the

								user takes some action - either spoken or by pointing, for

								example - a response is rendered in one of the output media -

								either visual or voice, for example. See requirement <a

								href="#s4.7.1">4.7.1 - minimally required synchronization

								points.</a></p>


								<p>Examples:</p>


								<ol>

								<li>In a speech plus WML banking application, accessed via a WAP

								phone, the user asks "What's my balance". The browser renders the

								account balance on the display only. The user clicks OK, the

								browser renders the response as speech only - "Would you like

								another service?"...</li>


								<li>In a music application accessed via a PDA, the user asks to

								hear clips of new releases. The browser renders a list of titles

								on screen, together with the text instruction to select a title

								to hear the track. The user selects a track by speaking the

								number. The browser plays the selected track - the screen does

								not change.</li>

								</ol>


								<h3>3.3 <a id="s3.3" name="s3.3">Uncoordinated, Simultaneous,

								Multi-media Output</a> (must address)</h3>


								<p>The markup language specifies that content is rendered in

								speech and other media at the same time (i.e. in the same dialog

								state). There is no requirement that the rendering of output

								media are coordinated (i.e. synchronized) any further.Where

								appropriate, synchronization of speech with other output media

								should be supported with SMIL or a related standard.</p>


								<p>The granularity of the synchronization for this requirement is

								coarser than for the <a href="#s3.4">coordinated simultaneous

								output requirement (3.4)</a>. The granularity is defined by

								things like input events. When the user takes some action -

								either spoken or by pointing, for example - something happens

								with the visual and the voice channels but there is no further

								synchronization at a finer granularity than that. I.e., a browser

								can output speech and graphics in one dialog state, but the two

								outputs are not synchronized in any other way. See requirement <a

								href="#s4.7.1">4.7.1 - minimally required synchronization

								points.</a></p>


								<p>Examples:</p>


								<ol>

								<li>In a cinema-ticket application accessed via a WAP phone, the

								user asks what films are showing. The browser renders the list of

								films on the screen and renders an instruction in speech - "Here

								are today's films. Select one to hear a full description".</li>


								<li>A browser in a smart phone environment plays a prompt "Which

								service do you require?", while displaying a list of options such

								as "Do you want to: (a) transfer money; (b) get account info; (c)

								quit."</li>


								<li>In a music application accessed via a PDA, the user asks to

								hear clips of new releases. The browser renders a list of titles

								on screen, and renders an instruction in speech "Here are the

								five recommended new releases. Select one to hear a clip". The

								user selects one by speaking the title. The browser renders the

								audio clip and, at the same time, displays the price and

								information about the band. When the track has finished, the user

								selects a button on screen to return to the list of tracks.</li>

								</ol>


								<h3>3.4 <a id="s3.4" name="s3.4">Coordinated, Simultaneous

								Multi-media Output</a> (nice to address)</h3>


								<p>The markup language specifies that content is to be

								simultaneously rendered in speech and other media and that output

								rendering is further coordinated (i.e. synchronized). The

								granularity is defined by things that happen within the response

								to a given user input - see <a href="#s4.7.2">4.7.2 Finer grained

								synchronization points.</a> Where appropriate, synchronization of

								speech with other output media should be supported with SMIL or a

								related standard.</p>


								<p>Examples:</p>


								<ol>

								<li>In a news application, accessed via a PDA, a browser

								highlights each paragraph of text (e.g. headline) as it renders

								the corresponding speech.</li>


								<li>In a learn-to-read application accessed via a PC, the lips of

								an animated character are synchronized with speech output, the

								words are highlighted on screen as they are spoken and pictures

								are displayed as the corresponding words are spoken (e.g. a cat

								is displayed as the word cat is spoken).</li>


								<li>In a music application accessed via a PDA, the user asks to

								hear clips of new releases. The browser renders a list of titles

								on screen, highlights the first and starts playing it. When the

								first track has finished, the browser highlights the second title

								on screen and starts playing the second track, and so on.</li>


								<li>Display an image 5 seconds after a spoken prompt has

								started.</li>


								<li>Display an image for 5 seconds then render a speech

								prompt.</li>

								</ol>


								<p>See also <a href="#s3.5">Synchronization of Multimedia with

								voice input requirement (3.5)</a>.</p>


								<h3>3.5 <a id="s3.5" name="s3.5">Synchronization of multimedia

								with voice input</a> (nice to address)</h3>


								<p>The markup language specifies that media output and voice

								input are synchronized. The granularity is defined by: things

								that happen within the response to a given user input, e.g. play

								a video and 30 seconds after it has started activate a speech

								grammar; things that happen within a speech input, e.g. detect

								the start of a spoken input and 5 seconds later play a video.

								Where appropriate, synchronization of speech with other output

								media should be supported with SMIL or a related standard. See <a

								href="#s3.4">Coordinated simultaneous multimedia output

								requirement (3.4)</a>; <a href="#s4.7.2">4.7.2 Finer grained

								synchronization points.</a></p>


								<h3>3.6 Temporal semantics for synchronization of voice input and

								output with multimedia (nice to address)</h3>


								<p>The markup language will have clear temporal semantics so that

								it can be integrated into the SMIL multimedia framework.

								Multi-media frameworks are characterized by precise temporal

								synchronization of output and input. For example, the SMIL

								notation is based on timing primitives that allow the composition

								of complex behaviors. See <a href="#s3.5">Synchronization with

								Multimedia with voice input requirement (3.5)</a> and <a

								href="#s3.4">3.4 coordinated simultaneous multimodal output

								requirement</a>.</p>


								<h3>3.7 Visual output of text (must address)</h3>


								<p>The markup language will support visual output of text, using

								other markup languages such as html or wml (see <a href="#s4.1">

								reuse of standard markup requirement, 4.1</a>). For example, the

								following may be presented as text on the display:</p>


								<ul>

								<li>Contextual/history information (e.g. display partially filled

								in form);</li>


								<li>Prompts;</li>


								<li>Menus;</li>


								<li>Confirmation;</li>


								<li>Error messages.</li>

								</ul>


								<p>Example 1:</p>


								<ul>

								<li>User says: "My name is Jack Jones",</li>


								<li>System displays: "Jack Jones" in address field.</li>

								</ul>


								<p>Example 2:</p>


								<ul>

								<li>User says: "Transfer $200 from my savings account to my

								checking account",</li>


								<li>System displays:


								<ul>

								<li>Operation: transfer</li>


								<li>Source account: savings account</li>


								<li>Destination account: checking account</li>


								<li>Amount: $200</li>

								</ul>

								</li>

								</ul>


								<h3>3.8 Media supported by other Voice Browsing Requirements

								(must address)</h3>


								<p>The markup language supports output defined in other W3C Voice

								Browsing Group specifications - for example, recorded audio

								(Speech Synthesis Requirements). See <a href="#s4.1">reuse of

								standard markup requirement (4.1).</a></p>


								<h3>3.9 Media objects supported by SMIL (should address)</h3>


								<p>The markup language supports output of media objects supported

								by SMIL (animation, audio, img, video, text, textstream), using

								other markup languages (see <a href="#s4.1">reuse of standard

								markup requirement, 4.1</a>).</p>


								<h3>3.10 Other output media (nice to address)</h3>


								<p>The markup language supports output of the following media,

								using other markup languages (see <a href="#s4.1">reuse of

								standard markup requirement, 4.1</a>).</p>


								<ul>

								<li>media types supported by CSS2</li>


								<li>synthesis of audio - MIDI</li>


								<li>lip-synch face synthesis</li>

								</ul>


								<h3>3.11 Extensible to new media (nice to address)</h3>


								<p>The markup language will be extensible to support new output

								media types (e.g. 3D graphics).</p>


								<h3>3.12 <a id="s3.12" name="s3.12"></a>Media-independent

								representation of the meaning of output (future revision)</h3>


								<p>The markup language should support a media-independent method

								of representing the meaning of output. E.g. the output could be

								represented in a frame format and rendered in speech or on the

								display by the browser. This is related to <a href="#s4.3">XForms

								requirement (4.3)</a></p>


								<h3>3.13 <a id="s3.13" name="s3.13">Display size</a> (should

								address)</h3>


								<p>Visual output will be renderable on displays of different

								sizes. This should be by using standard visual markup languages

								e.g., HTML, CHTML, WML, where appropriate, see <a href="#s4.1">

								reuse standard markup requirement</a> (4.1).</p>


								<p>This requirement applies to two kinds of visual markup:</p>


								<ul>

								<li>markup that can be rendered flexibly as the display size

								changes</li>


								<li>markup that is pre-configured for a particular display

								size.</li>

								</ul>


								<h3>3.14 <a id="s3.14" name="s3.14">Output to more than one

								window</a> (future revision)</h3>


								<p>The markup language supports the identification of the display

								window. This is to support applications where there is more than

								one window.</p>


								<h3>3.15 <a id="s3.15" name="s3.15">Time stamping</a> (should

								address)</h3>


								<p>All output events will be time-stamped, in addition to the

								time stamping covered by the Dialog<br />

								 Requirements. This includes time-stamping the start and the end

								of a speech event. For finer grained synchronization, time

								stamping at the start and the end of each word within speech may

								be needed.</p>


								<h2>4. <a id="s4" name="s4">Architecture, Integration and

								Synchronization points</a></h2>


								<h3>4.1 <a id="s4.1" name="s4.1">Reuse standard markup

								languages</a> (must address)</h3>


								<p>Where possible, the specification must reuse standard visual,

								multimedia and aural markup languages, including:</p>


								<ul>

								<li>other <a href="http://www.w3.org/Voice/">W3C Voice Browsing

								working group</a> specifications for voice markup;</li>


								<li>standard multimedia notations (SMIL or a related

								standard);</li>


								<li>standard visual markup languages e.g., HTML, CHTML, WML;</li>


								<li>other relevant specifications, including ACSS;</li>

								</ul>


								<p>The specification should avoid unnecessary differences with

								these markup languages.</p>


								<p>In addition, the markup will be compatible with the W3C's work

								on Client Capabilities and Personal Preferences (CC/PP).</p>


								<h3>4.2 Mesh with modular architecture proposed for XHTML (nice

								to address)</h3>


								<p>The results of the work should mesh with the modular

								architecture proposed for XHTML, where different markup modules

								are expected to cohabit and inter-operate gracefully within an

								overall XHTML container.</p>


								<p>As part of this goal the design should be capable of

								incorporating multiple visual and aural markup languages.</p>


								<h3>4.3 <a id="s4.3" name="s4.3">Compatibility with W3C work on

								X-Forms</a> (nice to address)</h3>


								<p>The markup language should be compatible with the W3C's work

								on X-Forms.</p>


								<ol>

								<li>Have an explicit data model for the back end (i.e. the data)

								and map it to the front end.</li>


								<li>Separate the data model from the presentation. The

								presentation depends on the device modality.</li>


								<li>Application data and logic should be modality

								independent.</li>

								</ol>


								<p>Related to requirements: <a href="#s3.12">media independent

								representation of output (3.12)</a> and <a href="#s2.11">media

								independent representation of input (2.11)</a>.</p>


								<h3>4.4 Detect that a given modality is available (must

								address)</h3>


								<p>The markup language will allow identification of the

								modalities available. This will allow an author to identify that

								a given modality is/is not present and as a result switch to a

								different dialog. E.g. there is a visible construct that an

								author can query. This can be used to provide for accessibility

								requirements and for environmental factors (e.g. noise). The

								availability of input and output modalities can be controlled by

								the user or by the system. The extent to which the functionality

								is retained when modalities are not available is the

								responsibility of the author.</p>


								<p>The following is a list of use cases regarding a multimodal

								document that specifies speech and GUI input and output. The

								document could be designed such that:</p>


								<ol>

								<li>when the speech input error count is high, the user can make

								equivalent selections via the GUI;</li>


								<li>where a user has a speech impairment, speech input can be

								deselected and the user controls the application via the

								GUI;</li>


								<li>when the user cannot hear a verbal prompt due to a noisy

								environment (detected, for example, by no response), an

								equivalent prompt is displayed on the screen;</li>


								<li>where a user has a hearing impairment the speech output is

								deselected and equivalent prompts are displayed.</li>

								</ol>


								<h3>4.5 Means to act on a notification that a modality has become

								available/unavailable (must address)</h3>


								<p>Note that this is a requirement on the system and not on the

								markup language. For example, when there is temporarily high

								background noise, the application may disable speech input and

								output but enable them again when the noise lessens.This is a

								requirement for an event handling mechanism.</p>


								<h3>4.6 Transformable documents</h3>


								<h4>4.6.1 Loosely coupled documents (nice to address)</h4>


								<p>The mark-up language should support loosely coupled documents,

								where separate markup streams for each modality are synchronized

								at well-defined points. For example, separate voice and visual

								markup streams could be synchronized at the following points:

								visiting a form, following a link.</p>


								<h4>4.6.2 Tightly coupled documents (nice to address)</h4>


								<p>The mark-up language should support tightly coupled documents.

								Tightly coupled documents have document elements for each

								interaction modality interspersed in the same document. I.e. a

								tightly coupled document contains sub-documents from different

								interaction modalities (e.g. HTML and voice markup) and has been

								authored to achieve explicit synchrony across the interaction

								streams.</p>


								<p>Tightly coupled documents should be viewed as an optimization

								of the loosely-coupled approach, and should be defined by

								describing a reversible transformation from a tightly-coupled

								document to multiple loosely-coupled documents. For example, a

								tightly coupled document that includes HTML and voice markup

								sub-documents should be transformable to a pair of documents,

								where one is HTML only and the other is voice markup only - see

								<a href="#s4.6.3">transformation requirement</a> (4.6.3).</p>


								<h4>4.6.3 <a id="s4.6.3" name="s4.6.3">Transformation between

								tightly and loosely coupled documents by standard tree

								transformations as expressible in XSLT</a> (nice to address)</h4>


								<p>The markup language should be designed such that tightly

								coupled documents are <em>transformable</em> to documents for a

								specific interaction modalities by standard tree transformations

								as expressible in XSLT. Conversely, tightly coupled documents

								should be viewed as a simple transformation applied to the

								individual sub-documents, with the transformation playing the

								role of tightly coupling the sub-documents into a single

								document.</p>


								<p>This requirement will ensure content re-use, keep

								implementation of multimodal browsers manageable and provide for

								accessibility requirements.</p>


								<p>It is important to note that all the interaction information

								from the tightly coupled document may not be preserved. If, for

								example, you have a speech + GUI design, when you take out the

								GUI, the application is not necessarily equivalently usable. It

								is up to the author to decide whether the speech document has all

								the information that the speech plus GUI document has.Depending

								on how the author created the multimodal document, the

								transformation could be entirely lossy, could degrade gracefully

								by preserving some information from the GUI or could preserve all

								information from the GUI. If the author's intent is that the

								application should be usable in the presence or absence of either

								modality, it is the author's responsibility to design the

								application to achieve this.</p>


								<h3>4.7 <a id="s4.7" name="s4.7">Synchronization points</a></h3>


								<h4>4.7.1 <a id="s4.7.1" name="s4.7.1">Minimally required

								synchronization points</a>(must address)</h4>


								<p>The markup language should minimally enable synchronization

								across different modalities at well known interaction points in

								today's browsers, for example, entering and exiting specific

								interaction widgets:</p>


								<ul>

								<li>Entry to a form</li>


								<li>Entry to a menu</li>


								<li>Completion of a form</li>


								<li>Choosing of menu item (in a voice markup language) or link

								(HTML).</li>


								<li>Filling of a field within a form.</li>

								</ul>


								<p>For example:</p>


								<ul>

								<li>The material displayed visually and the GUI input options can

								be conditional on: the current voice dialog; the current state of

								the voice dialog (e.g. the form, the menu).</li>


								<li>The voice markup (i.e. the dialog/grammar/prompt) can be

								conditional on: the html page being displayed; the text box in

								focus; the option selected; the button that has been

								clicked.</li>

								</ul>


								<p>See <a href="#s3.2">multimedia output requirements (3.2, 3.3

								and 3.4)</a> and <a href="#s2.2">multimodal input

								requirements</a> (2.2, 2.3 and 2.4).</p>


								<h4>4.7.2 <a id="s4.7.2" name="s4.7.2">Finer-grained

								synchronization points</a> (nice to address)</h4>


								<p>The markup language should support finer-grained

								synchronization. Where appropriate, synchronization of speech

								with other output media should be supported with SMIL or a

								related standard.</p>


								<p>For example:</p>


								<ul>

								<li>to allow a display to synchronize with events in the auditory

								output stream</li>


								<li>to allow voice markup (i.e. the dialog/grammar/prompt) to

								synchronize with scrolling events on the display</li>


								<li>to allow voice markup to synchronize with temporal events in

								output media.</li>

								</ul>


								<p>Synchronization points include:</p>


								<ul>

								<li>events in the auditory output stream e.g. start/finish voice

								output events (word, line, paragraph, section)</li>


								<li>fine-grained events on the display (e.g. scrolling)</li>


								<li>temporal events in other output media.</li>

								</ul>


								<p>See <a href="#s3.4">3.4 coordinated simultaneous multimodal

								output requirement</a>.</p>


								<h4>4.7.3 Co-ordinate synchronization points with the DOM event

								model (future study)</h4>


								<ol>

								<li>Synchronization points should be coordinated with the DOM

								event model. I.e. one possible starting point for a list of such

								synchronization points would be the event types defined by the

								DOM, appropriately modified to be modality independent.</li>


								<li>Event types defined for multimodal browsing should be

								integrated into the DOM; as part of this effort, the Voice WG

								might provide requirements as input to the next level of the DOM

								specification.</li>

								</ol>


								<h4>4.7.4 Browser functions and synchronization points (future

								study)</h4>


								<p>The notion of synchronization points (or navigation sign

								posts) are important; they should also be tied into a discussion

								of what canonized browser functions like "back, "undo", and

								"forward" mean, and what they mean to the global state of the MM

								browser. The notion of 'back' is unclear in a voice context.</p>


								<h3>4.8 Interaction with External Components (must have)</h3>


								<p>The markup language must support a generic component interface

								to allow for the use of external components on the client and/or

								server side. The interface provides a mechanism for transferring

								data between the markup language's variables and the component.

								Examples of such data are: semantic representations of user input

								(such as attribute-value pairs); URL of markup for different

								modalities (e.g. URL of an HTML page). The markup language also

								supports Interaction with External Components that is supported

								by the <a

								href="http://www.w3.org/TR/1999/WD-voice-dialog-reqs-19991223/">

								W3C Voice Browsing Dialog Requirements (Requirement

								2.10)</a>.</p>


								<p>Examples of external components are components for interaction

								modalities other than speech (e.g. an HTML browser) and server

								scripts. Server scripts can be used to interact with remote

								services, devices or databases.</p>


								<h2>Acknowledgements</h2>


								<p>The following people participated in the multimodal subgroup

								of the Voice Browser working group and contributed to this

								document</p>


								<ul>

								<li>T. V. Raman (IBM)</li>


								<li>Bruce Lucas (IBM)</li>


								<li>Pekka Kapanen (Nokia)</li>


								<li>Peter Boda (Nokia)</li>


								<li>Laurence Prevosto (EDF)</li>


								<li>Marianne Hickey (HP)</li>


								<li>Nils Klarlund (AT&amp;T)</li>


								<li>Carolina Di Cristo (Telecom Italia)</li>


								<li>Charles T. Hemphill (Conversational Computing)</li>


								<li>Alan Goldschen (MITRE)</li>


								<li>Andreas Kellner (Philips)</li>


								<li>Markku T. Hakkinen (The Productivity Works)</li>


								<li>Kuansan Wang (Microsoft)</li>


								<li>David Raggett (W3C/HP)</li>


								<li>Jim Colson (IBM)</li>


								<li>Scott McGlashan (Pipebeach)</li>


								<li>Frank Scahill (BT)</li>

								</ul>

								</body>

								</html>