server_playground/doc/www.w3.org/TR/2002/NOTE-mmi-use-cases-20021204/index.html


								<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

								    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

								<html xmlns="http://www.w3.org/1999/xhtml">

								<head>

								<title>Multimodal Interaction Use Cases</title>

								<meta name="generator"

								content="HTML Tidy for Linux/x86 (vers 1st April 2002), see www.w3.org" />

								<meta http-equiv="CONTENT-TYPE"

								content="text/html; charset=iso-8859-1" />

								<style type="text/css">

								 /*<![CDATA[*/

								  body {

								    margin-left: 8%;

								    margin-right: 5%;

								    background-color: white;

								    font-family: Trebuchet, Arial, sans-serif

								  }

								  h1 { margin-left: -4%; color: rgb(0,92,160) }

								  h2 { margin-left: -4%; color: rgb(0,92,160)}

								  h3 { margin-left: 0% }

								  p.fig {text-align: center}

								  .c1 { display: none }

								  p.example { margin-left: 10% }

								  tr td { vertical-align: top }

								  //--> /*]]>*/

								</style>

								<link rel="stylesheet" type="text/css"

								href="http://www.w3.org/StyleSheets/TR/W3C-NOTE" />

								</head>

								<body>

								<div class="head">

								<p><a href="http://www.w3.org/"><img height="48" alt="W3C"

								src="http://www.w3.org/Icons/w3c_home" width="72" /></a></p>


								<h1 class="notoc" id="name">Multimodal Interaction Use Cases</h1>


								<h2 class="notoc" id="date">W3C NOTE 4 December 2002</h2>


								<dl>

								<dt>This version:</dt>


								<dd><a

								href="http://www.w3.org/TR/2002/NOTE-mmi-use-cases-20021204/">

								http://www.w3.org/TR/2002/NOTE-mmi-use-cases-20021204/</a></dd>


								<dt>Latest version:</dt>


								<dd><a

								href="http://www.w3.org/TR/mmi-use-cases/">http://www.w3.org/TR/mmi-use-cases/</a></dd>


								<dt>Previous version:</dt>


								<dd><i>this is the first publication</i></dd>


								<dt>Editors:</dt>


								<dd>Emily Candell, Dave Raggett</dd>

								</dl>


								<p class="copyright"><a

								href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Copyright">

								Copyright</a> &copy; 2002 <a href="http://www.w3.org/"><abbr

								title="World Wide Web Consortium">W3C</abbr></a> <sup>&reg;</sup> (

								<a href="http://www.lcs.mit.edu/"><abbr

								title="Massachusetts Institute of Technology">MIT</abbr></a>, <a

								href="http://www.inria.fr/"><abbr lang="fr"

								title="Institut National de Recherche en Informatique et Automatique">

								INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a> ), All

								Rights Reserved. W3C <a

								href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer">

								liability</a>, <a

								href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks">

								trademark</a>, <a

								href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">

								document use</a>, and <a

								href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">

								software licensing</a> rules apply.</p>

								</div>


								<hr title="Separator from Header" />

								<h2 class="notoc" id="abstract">Abstract</h2>


								<p>The W3C <a href="http://www.w3.org/2002/mmi/">Multimodal

								Interaction Activity</a> is developing specifications as a basis

								for a new breed of Web applications in which you can interact using

								multiple modes of interaction, for instance, using speech, hand

								writing, and key presses for input, and spoken prompts, audio and

								visual displays for output. This document describes several use

								cases for multimodal interaction and presents them in terms of

								varying device capabilities and the events needed by each use case

								to couple different components of a multimodal application.</p>


								<h2 id="Status">Status of this Document</h2>


								<p><em>This section describes the status of this document at the

								time of its publication. Other documents may supersede this

								document. The latest status of this document series is maintained

								at the <abbr

								title="the World Wide Web Consortium">W3C</abbr>.</em></p>


								<p>W3C's <a href="http://www.w3.org/2002/mmi/">Multimodal

								Interaction Activity</a> is developing specifications for extending

								the Web to support multiple modes of interaction. This document

								describes several use cases as the basis for gaining a better

								understanding of the requirements for multimodal interaction, and

								the kinds of information flows needed for multimodal

								applications.</p>


								<p>This document has been produced as part of the <a

								href="http://www.w3.org/2002/mmi/">W3C Multimodal Interaction

								Activity</a>,<span class="c1"><a

								href="http://www.w3.org/2002/mmi/Activity.html"></a></span>

								following the procedures set out for the <a

								href="http://www.w3.org/Consortium/Process/">W3C Process</a>. The

								authors of this document are members of the <a

								href="http://www.w3.org/2002/mmi/Group/">Multimodal Interaction

								Working Group</a> (<a

								href="http://cgi.w3.org/MemberAccess/AccessRequest">W3C Members

								only</a>). This is a Royalty Free Working Group, as described in

								W3C's <a href="/TR/2002/NOTE-patent-practice-20020124">Current

								Patent Practice</a> NOTE. Working Group participants are required

								to provide <a href="http://www.w3.org/2002/01/mmi-ipr.html">patent

								disclosures</a>.</p>


								<p>Please send comments about this document to the public mailing

								list: <a

								href="mailto:www-multimodal@w3.org">www-multimodal@w3.org</a> (<a

								href="http://lists.w3.org/Archives/Public/www-multimodal/">public

								archives</a>). To subscribe, send an email to &lt;<a

								href="mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.org</a>&gt;

								with the word <em>subscribe</em> in the subject line

								(include the word <em>unsubscribe</em> if you want to

								unsubscribe).</p>


								<p>A list of current W3C Recommendations and other technical

								documents including Working Drafts and Notes can be found at <a

								href="http://www.w3.org/TR/">http://www.w3.org/TR/</a>.</p>


								<h2 id="intro">1. Introduction</h2>


								<p>Analysis of use cases provides insight into the requirements for

								applications likely to require a multimodal infrastructure.</p>


								<p>The use cases described below were selected for analysis in

								order to highlight different requirements resulting from

								application variations in areas such as device requirements, event

								handling, network dependencies and methods of user interaction</p>


								<p>It should be noted that although the results of this analysis we

								be used as input to the Multimodal Specification being developed by

								the W3C Multimodal Interaction Working Group, there is no guarantee

								that all of these applications will be implementable using the

								language defined in the specification.</p>


								<h3 id="devices">1.1 Use Case Device Classification</h3>


								<h4 id="thin">Thin Client</h4>


								<p>A device with little processing power and capabilities that can

								be used to capture user input (microphone, touch display, stylus,

								etc) as well as non-user input such as GPS. The device may have a

								very limited capability to interpret the input, for example a small

								vocabulary speech recognition, or a character recognizer. The bulk

								of the processing occurs on the server including natural language

								processing and dialog management.</p>


								<p>An example of such a device may be a mobile phone with DSR

								capabilities and a visual browser (there could actually be thinner

								clients than this).</p>


								<h4 id="thick">Thick Client</h4>


								<p>A device with powerful processing capabilities, such that most

								of the processing can occur locally. Such a device is capable of

								input capture and interpretation. For example, the device can have

								a medium vocabulary speech recognizer, a handwriting recognizer,

								natural language processing and dialog management capabilities. The

								data itself may still be stored on the server.</p>


								<p>An example of such a device may be a recent production PDA or an

								in-car system.</p>


								<h4 id="medium">Medium Client</h4>


								<p>A device capable of input capture and some degree of

								interpretation. The processing is distributed in a client/server or

								a multidevice architecture. For example, a medium client will have

								the voice recognition capabities to handle small vocabulary command

								and control tasks but connects to a voice server for more advanced

								dialog tasks.</p>


								<h3 id="summaries">1.2 Use Case Summaries</h3>


								<h4 id="table1">Table 1: <a href="#form-filling">Form Filling for

								air travel reservation</a></h4>


								<table border="1" cellpadding="5" summary="4 column table">

								<tbody>

								<tr>

								<th>Description</th>

								<th>Device Classification</th>

								<th>Device Details</th>

								<th>Execution Model</th>

								</tr>


								<tr>

								<td>The means for a user to reserve a flight using a wireless

								personal mobile device and a combination of input and output

								modalities. The dialogue between the user and the application is

								directed through the use of a form-filling paradigm.</td>

								<td>Thin and medium clients</td>

								<td>touch-enabled display (i.e., supports pen input), voice input,

								local ASR and Distributed Speech Recognition Framework, local

								handwriting recognition, voice output, TTS, GPS, wireless

								connectivity, roaming between various networks.</td>

								<td>Client Side Execution</td>

								</tr>

								</tbody>

								</table>


								<h5 id="form-filling-details">Scenario Details</h5>


								<p>User wants to make a flight reservation with his mobile device

								while he is on the way to work. The user initiates the service via

								means of making a phone call to a multimodal service (telephone

								metaphore) or by selecting an application (portal environment

								metaphore). The details are not described here.</p>


								<p>As the user moves between networks with very different

								characteristics, the user is offered the flexibility to interact

								using the preferred and most appropriate modes for the situation.

								For example, while sitting in a train, the use of stylus and

								handwriting can achieve higher accuracy than speech (due to

								surrounding noise) and protect privacy. When the user is walking,

								the input and output modalities that more appropriate would be

								voice with some visual output. Finally, at the office the user can

								use pen and voice in a synergistic way.</p>


								<p>The dialogue between the user and the application is driven by a

								form-filling paradigm where the user provides input to fields such

								as "Travel Origin:", "Travel Destination:", "Leaving on date",

								"Returning on date". As the user selects each field in the

								application to enter information, the corresponding input

								constraints are activated to drive the recognition and

								interpretation of the user input. The capability of providing

								composite multimodal input is also examined, where input from

								multiple modalities is combined for the interpretation of the

								user's intent.</p>


								<h4 id="table2">Table 2: <a href="#driving-dir">Driving

								Directions</a></h4>


								<table border="1" cellpadding="5" summary="4 column table">

								<tbody>

								<tr>

								<th>Description</th>

								<th>Device Classification</th>

								<th>Device Details</th>

								<th>Execution Model</th>

								</tr>


								<tr>

								<td>This application provides a mechanism for a user to request and

								receive driving directions via speech and graphical input and

								output</td>

								<td>Medium Client</td>

								<td>on-board system (in a car) with a graphical display, map

								database, touch screen, voice and touch input, speech output, local

								ASR and TTS Processing and GPS.</td>

								<td>Client Side Execution</td>

								</tr>

								</tbody>

								</table>


								<h5 id="driving-direction-details">Scenario Details</h5>


								<p>User wants to go to a specific address from his current location

								and while driving wants to take a detour to a local restaurant (The

								user does not know the restaurant address nor the name). The user

								initiates service via a button on his steering wheel and interacts

								with the system via the touch screen and speech.</p>


								<h4 id="table3">Table 3: <a href="#name-dialing">Name

								Dialing</a></h4>


								<table border="1" cellpadding="5" summary="4 column table">

								<tbody>

								<tr>

								<th>Description</th>

								<th>Device Classification</th>

								<th>Device Details</th>

								<th>Execution Model</th>

								</tr>


								<tr>

								<td>

								<p>The means for users to call someone by saying their name.</p>

								</td>

								<td>

								<p>thin and fat devices</p>

								</td>

								<td>

								<p>Telephone</p>

								</td>

								<td>

								<p>The study covers several possibilities:</p>


								<ul>

								<li>whether the application runs in the device or the server</li>


								<li>whether the device supports limited local speech

								recognition</li>

								</ul>


								<p>These choices determine the kinds of events that are needed to

								coordinate the device and network based services.</p>

								</td>

								</tr>

								</tbody>

								</table>


								<h5 id="name-dialing-details">Scenario Details</h5>


								<p>Janet presses a button on her multimodal phone and says one of

								the following commands:</p>


								<ul>

								<li>Call Wendy</li>


								<li>Call Wendy on her cell phone</li>


								<li>Call Wendy at work</li>


								<li>Call Wendy Smith at Acme Research</li>

								</ul>


								<p>The application initially looks for a match in Janet's personal

								contact list and if no match is found then proceeds to look in

								other directories. Directed dialog and tapered help are used to

								narrow down the search, using aural and visual prompts. Janet is

								able to respond by pressing buttons, or tapping with a stylus, or

								by using her voice.</p>


								<p>Once a selection has been made, rules defined by Wendy are used

								to determine how the call should be handled. Janet may see a

								picture of Wendy along with a personalized message (aural and

								visual) that Wendy has left for her. Call handling may depend on

								the time of day, the location and status of the both parties, and

								the relationship between them. An "ex" might be told to never call

								again, while Janet might be told that Wendy will be free in half an

								hour after Wendy's meeting has finished. The call may be

								automatically directed to Wendy's home, office or mobile phone, or

								Janet may be invited to leave a message.</p>


								<h2 id="use-case-details">2. Use Case Details</h2>


								<h3 id="form-filling">2.1 Use-case: Form filling for air travel

								reservation</h3>


								<p>Description: The air travel reservation use case describes a

								scenario in which the user books a flight using a wireless personal

								mobile device and a combination of input and output modalities.</p>


								<p>The device has a touch-enabled display (i.e., supports pen

								input) and it is voice enabled. The use case describes a rich

								multimodal interaction model that allows the user to start a

								session while commuting on the train, continue the interaction

								while walking to his office and complete the transaction while sat

								at his office-desk. As the user moves between environments with

								very different characteristics, the user is given the opportunity

								to interact using the preferred and most appropriate modes for the

								situation. For example, while sitting in a train, the use of stylus

								and handwriting can offer higher accuracy than speech (due to

								noise) and protect privacy. When the user is walking, the input and

								output modalities more appropriate would be voice with some visual

								output. Finally, at the office the user can use pen and voice in a

								synergistic way.</p>


								<p>This example assumes the seamless transition through a variety

								of connectivity options such as high bandwidth LAN at the office

								(i.e., 802.11), lower bandwidth while walking (i.e., cellular

								network such as GPRS) and low bandwidth but in addition

								intermittent connectivity while on the train (e.g., can get

								disconnected when going through a tunnel). The scenario also takes

								advantage of network services such as location and time.</p>


								<h4 id="form-filling-actors">Actors</h4>


								<ul>

								<li>User who makes the air travel reservation</li>


								<li>Mobile device with touch-enabled display wireless network

								connectivity, handwriting recognition capability and limited voice

								recognition capability on the device.</li>


								<li>Network service with full voice dialog capabilities, connection

								to travel reservation database and location/time services.</li>

								</ul>


								<h4 id="form-filling-assumptions">Additional Assumptions</h4>


								<ul>

								<li>Data capabilities are available on the communications

								provider's network. Voice requirements are satisfied either via

								voice capabilities on the communications provider network or

								through a DSR framework that utilized the existing data

								capabilities.</li>


								<li>There are means for describing user and device profile

								information and means of exchanging this information between server

								and client.</li>

								</ul>


								<h4 id="table4">Table 4: Event Table</h4>


								<table border="1" cellpadding="5" summary="5 column table">

								<tr>

								<th>User Action</th>

								<th>Action on device</th>

								<th>Events sent from device</th>

								<th>Action on server</th>

								<th>Events sent From server</th>

								</tr>


								<tr>

								<td>

								<p>Device turned on</p>

								</td>

								<td>

								<p>Registers with network and uploads delivery context [available

								I/O modalities, bandwidth, user-specific info (e.g., home

								city)]</p>

								</td>

								<td>

								<p>register_device (delivery_context)</p>

								</td>

								<td>

								<p>Complete session initiation by registering device and delivery

								context (init_session)</p>

								</td>

								<td>

								<p>register_ack</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User picks travel app (taps with stylus or says travel)</p>

								</td>

								<td>

								<p>Client side of application is started</p>

								</td>

								<td>

								<p>app_connect (app_name)</p>

								</td>

								<td>

								<p>Loads a page that is appropriate to current profile</p>

								</td>

								<td>

								<p>app_connect_ack (start_page)</p>

								</td>

								</tr>


								<tr>

								<td colspan="5">

								<p>Application is running and ready to take input. Origin city was

								guessed from user profile or location service. User is o the train.

								Active I/O modalities are pen, display and audio output.</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User picks a field in the form to interact with the stylus</p>

								</td>

								<td>

								<p>Destination field gets highlighted</p>

								</td>

								<td>

								<p>on_focus (field_name)</p>

								</td>

								<td>

								<p>Server loads the appropriate constraints for input on this

								field. Constraints are sent to device for hwr.</p>

								</td>

								<td>

								<p>listen_ack (field_grammar)</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User starts writing. When he is finished</p>

								</td>

								<td>

								<p>Handwriting recognition performed locally with visual and audio

								presentation of result (i.e., earcon)</p>

								</td>

								<td>

								<p>&nbsp;</p>

								</td>

								<td>

								<p>&nbsp;</p>

								</td>

								<td>

								<p>&nbsp;</p>

								</td>

								</tr>


								<tr>

								<td colspan="5">

								<p>If recognition confidence is low, a different earcon is played

								and pop-up menu of top-n hypotheses is displayed.</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User approves result by moving to next field with stylus (e.g.,

								departure time)</p>

								</td>

								<td>

								<p>Result is submitted to server.</p>


								<p>&nbsp;</p>


								<p>Time field is highlighted.</p>

								</td>

								<td>

								<p>submit_partial (destination)</p>


								<p>on_focus (field_name)</p>

								</td>

								<td>

								<p>Dialog state is updated. Appropriate constraints for input on

								this field are loaded. Grammar constraints are sent to the

								device</p>

								</td>

								<td>

								<p>listen_ack (field_grammar)</p>

								</td>

								</tr>


								<tr>

								<td colspan="5">

								<p>User gets off the train and starts walking - I/O modality is

								voice only</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User explicitly switches profile via button press, or through

								non-user sensory input the profile is changed</p>

								</td>

								<td>

								<p>Profile update - only voice enabled input with voice and visual

								output</p>

								</td>

								<td>

								<p>update (delivery_context)</p>

								</td>

								<td>

								<p>Speech recognition and output module initialization.

								Synchronization of dialog state between modalities. Audio prompt

								"what time do you want to leave" is generated).</p>

								</td>

								<td>

								<p>send (autio_prompt)</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>In response to audio prompt, user says "I want a flight in the

								morning".</p>

								</td>

								<td>

								<p>Audio is collected and sent it to server through data or voice

								channel</p>

								</td>

								<td>

								<p>send (audio)</p>

								</td>

								<td>

								<p>Recognizes voice and generates list of hypotheses. Corresponding

								audio prompt is created (e.g., "would you like to flight at 10 or

								11 in the morning").</p>

								</td>

								<td>

								<p>send (audio_prompt)</p>

								</td>

								</tr>


								<tr>

								<td colspan="5">

								<p>While walking, field selection is either driven by the dialog

								engine on the server, or by the user uttering simple phrases (e.g.,

								voice graffiti)</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User reaches his office.</p>

								</td>

								<td>

								<p>User explicitly switches profile via button press, or through

								non-user sensory input the profile is changed.</p>

								</td>

								<td>

								<p>Events an handlers as previously for changing the delivery

								context to accommodate interaction via voice, pen and GUI

								selection</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr>

								<td colspan="5">

								<p>At this point in the dialogue, it has been determined that there

								are no direct flights between origin and destination. The

								application displays available routes with in-between stops on a

								map and the user is prompted to select one.</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User says "I would like to take this one" while making a pen

								gesture (i.e., circling over the preferred route)</p>

								</td>

								<td>

								<p>Ink and audio are collected and sent to the server with time

								stamp information.</p>

								</td>

								<td>

								<p>send (audio)</p>


								<p>send (ink)</p>

								</td>

								<td>

								<p>Server receives the two inputs and integrates them into a

								semantic representation</p>


								<p>Server updates app with selection, acknowledging that input

								integration was possible.</p>

								</td>

								<td>

								<p>completeAck</p>

								</td>

								</tr>


								<tr>

								<td colspan="5">

								<p>At this point in the dialog, payment authorization needs to be

								made. User enters credit card information via voice, pen or

								keypad.</p>

								</td>

								</tr>


								<tr>

								<td>

								<p>User provides signature for authorization purposes</p>

								</td>

								<td>

								<p>Ink is collected with information about pressure and tilt.</p>

								</td>

								<td>

								<p>send (ink)</p>

								</td>

								<td>

								<p>Server verifies signature.</p>

								</td>

								<td>

								<p>DONE</p>

								</td>

								</tr>

								</table>


								<h3 id="driving-dir">2.2 Use-case: Driving Directions</h3>


								<h4 id="driving-dir-assumptions">Assumptions</h4>


								<ul>

								<li>ASR services are local for simple requests (e.g. session

								preference setup)</li>


								<li>ASR is server-based for complex requests (e.g. addresses)</li>


								<li>TTS local</li>


								<li>Execution model is hosted on the device.</li>


								<li>single language - with acknowledgement that we will ultimately

								need language selection</li>


								<li>availability (always on) - with acknowledgement that there may

								be temporary interruptions due to unexpected circumstances (e.g.

								tunnels, mountains)</li>


								<li>driver is alone [cannot get assistance]</li>


								<li>Additional applications may be available when the service is

								initiated via a service selection menu (this is beyond the scope of

								this use case analysis)</li>


								<li>Initiating recognition requires a single button press. Button

								press indicating end of speech is optional assuming with

								preconfigured timeout to stop listening (requiring the user to hold

								down a button while driving may be dangerous)</li>


								<li>At any time during the session, the user may change display

								options via the touch screen (includes zooming in and changing

								route display options). Display options may also be changed using

								speech by initiating a dialog by pressing the button on the

								steering wheel</li>

								</ul>


								<h4 id="driving-dir-actors">Actors</h4>


								<p>Primary Device:</p>


								<ul>

								<li>

								<p>on-board system (in a car) with the following capabilities:</p>


								<ul>

								<li>graphical display:

								<ul>

								<li>maps</li>


								<li>Estimated time of arrival</li>


								<li>Textual Directions</li>

								</ul>

								</li>


								<li>touch screen</li>


								<li>voice (input and output)</li>


								<li>keyboard/text input</li>


								<li>local ASR and TTS processing</li>


								<li>access to remote servers (ASR and App Server)</li>


								<li>GPS</li>

								</ul>

								</li>

								</ul>


								<p>Data sources:</p>


								<ul>

								<li>route database</li>


								<li>traffic conditions</li>


								<li>GPS data</li>


								<li>speedometer</li>


								<li>landmarks database and places of interest:

								<ul>

								<li>nearest gas station</li>


								<li>nearest restaurant of a specific type</li>

								</ul>

								</li>


								<li>User Preference Database</li>

								</ul>


								<h4 id="driving-dir-walkthru">Scenario Walkthrough (User point of

								view)</h4>


								User preferences (These may be changed on a per session basis):

								<ul>

								<li>Primary Input: Speech</li>


								<li>Secondary Input: Touch Screen</li>


								<li>Speech and Graphical Output</li>


								<li>Preferences are stored on the server to enable multiple users

								to use the same device (Preferences may be retrieved automatically

								based on speaker identification or key identification eliminating

								the need for an authentication dialog)</li>

								</ul>


								<p>User wants to go to a specific address from his current location

								and while driving wants to take a detour to a local restaurant (The

								user does not know the restaurant address nor the name)</p>


								<h4 id="table5">Table 5: Event Table</h4>


								<table border="1" cellpadding="5" summary="5 column table">

								<tr>

								<th>User Action/External Input</th>

								<th>Action on Device</th>

								<th>Event Description</th>

								<th>Event Handler</th>

								<th>Resulting Action</th>

								</tr>


								<tr>

								<td>User presses button on steering wheel</td>

								<td>Service is initiated and GPS satellite detection begins</td>

								<td>HTTP Request to app server</td>

								<td>App server returns initial page to device</td>

								<td>Welcome prompts are played. Authentication dialog is initiated

								(may be initiated via speaker identification or key

								identification).</td>

								</tr>


								<tr>

								<td>User interacts in an authentication dialog</td>

								<td>Device executes authentication dialog using local ASR

								processing</td>

								<td>HTTP Request to app server which includes user credentials</td>

								<td>App server returns initial page to device including user

								preferences</td>

								<td>User is prompted for a destination (if additional services are

								availble after authentication, assume that user selects driving

								direction application)</td>

								</tr>


								<tr>

								<td>Initial GPS Input</td>

								<td>N/A</td>

								<td>GPS_Data_In Event</td>

								<td>Device handles location information</td>

								<td>Device updates map on graphical display (assumes all maps are

								stored locally on device)</td>

								</tr>


								<tr>

								<td>User selects option to change volume of on-board unit using

								touch display.</td>

								<td>N/A</td>

								<td>Touch_screen_event (includes x, y coordinates)</td>

								<td>Touch screen detects and processes input</td>

								<td>Volume indicator changes on screen. Volume of speech output is

								changed</td>

								</tr>


								<tr>

								<td>User presses button on steering wheel</td>

								<td>Device initiates connection to ASR server</td>

								<td>Start_Listening Event</td>

								<td>ASR Server receives request and establishes connection</td>

								<td>"listening" icon appears on display (utterances prior to

								establishing the connection are buffered)</td>

								</tr>


								<tr>

								<td>User says destination address (may improve recognition accuracy

								by sending grammar constraints to server based on a local dialog

								with the user instead of allowing any address from the start)</td>

								<td>N/A</td>

								<td>N/A</td>

								<td>ASR Server processes speech and returns results to device</td>

								<td>Device processes results and plays confirmation dialog to user

								while highlighting destination and route on graphical display</td>

								</tr>


								<tr>

								<td>User confirms destination</td>

								<td>Device performs ASR Processing locally. Upon confirmation,

								destination info is sent to app server</td>

								<td>HTTP Request is sent to app server (includes current location

								and destination information)</td>

								<td>App Server processes input and returns data to device</td>

								<td>Device processes results and updates graphical display with

								route and directions highlighting next step</td>

								</tr>


								<tr>

								<td>GPS Input at regular intervals</td>

								<td>N/A</td>

								<td>GPS_Data_In Event</td>

								<td>Device processes location data and checks if location milestone

								is hit</td>

								<td>Device updates map on graphical display (assumes all maps are

								stored locally on device) and highlights current step. When

								milestone is hit, next instruction is played to user</td>

								</tr>


								<tr>

								<td>GPS Input at regular intervals (indicating driver is off

								course)</td>

								<td>N/A</td>

								<td>GPS_Data_In Event</td>

								<td>Device processes location data and determines that user is off

								course</td>

								<td>Map on graphical display is updated and textual message is

								displayed indicating that route is not correct. Prompt is played

								from the device indicating that route is being recalculated</td>

								</tr>


								<tr>

								<td>N/A</td>

								<td>Route request is sent to app server including new location

								data</td>

								<td>HTTP Request is sent to app server (includes current location

								and destination information)</td>

								<td>App Server processes input and returns data to device</td>

								<td>Device processes results and updates graphical display with

								route and directions highlighting next step</td>

								</tr>


								<tr>

								<td>Alert received on device based on traffic conditions</td>

								<td>N/A</td>

								<td>Route_Change Alert</td>

								<td>Device processes event and initiates dialog to determine if

								route should be recalculated</td>

								<td>User is informed of traffic conditions and asked whether route

								should be recalculated.</td>

								</tr>


								<tr>

								<td>User requests recalculation of route based on current traffic

								conditions</td>

								<td>Device performs ASR Processing locally. Upon confirmation,

								destination info is sent to app server</td>

								<td>HTTP Request is sent to app server (includes current location

								and destination information)</td>

								<td>App Server processes input and returns data to device</td>

								<td>Device processes results and updates graphical display with

								route and directions highlighting next step</td>

								</tr>


								<tr>

								<td>GPS Input at regular intervals</td>

								<td>N/A</td>

								<td>GPS_Data_In Event</td>

								<td>Device processes location data and checks if location milestone

								is hit</td>

								<td>Device updates map on graphical display (assumes all maps are

								stored locally on device) and highlights current step. When

								milestone is hit, next instruction is played to user</td>

								</tr>


								<tr>

								<td>User presses button on steering wheel</td>

								<td>Connection to ASR server is established</td>

								<td>Start_Listening Event</td>

								<td>ASR Server receives request and establishes connection</td>

								<td>User hears acknowledgement prompt for continuation, and

								"listening" icon appears on display</td>

								</tr>


								<tr>

								<td>User requests new destination by destination type while still

								depressing button on steering wheel (may improve recognition

								accuracy by sending grammar constraints to server based on a local

								dialog with the user)</td>

								<td>N/A</td>

								<td>N/A</td>

								<td>ASR Server processes speech and returns results to device</td>

								<td>Device processes results and plays confirmation dialog to user

								while highlighting destination and route on graphical display</td>

								</tr>


								<tr>

								<td>User confirms destination via a multiple interaction dialog to

								determine exact destination</td>

								<td>Device executes dialog based on user responses (using local ASR

								Processing) and accesses app server as needed</td>

								<td>HTTP requests to app server for dialog and data specific to

								user response</td>

								<td>App server responds with appropriate dialog</td>

								<td>User interacts in a dialog and selects destination. User is

								asked whether this is a new destination</td>

								</tr>


								<tr>

								<td>User indicates that this is a stop on the way to original

								destination</td>

								<td>Devices sends updated destination information to app

								server</td>

								<td>HTTP Request for updated directions (based on current location,

								detour destination, and ultimate destination)</td>

								<td>App Server processes input and returns data to device</td>

								<td>Device processes results and updates graphical display with new

								route and directions highlighting next step</td>

								</tr>


								<tr>

								<td>GPS Input at regular intervals</td>

								<td>N/A</td>

								<td>GPS_Data_In Event</td>

								<td>Device processes location data and checks if location milestone

								is hit</td>

								<td>Device updates map on graphical display (assumes all maps are

								stored locally on device) and highlights current step. When

								milestone is hit, next instruction is played to user</td>

								</tr>

								</table>


								<h4 id="protocols">Protocols:</h4>


								<ul>

								<li>HTTP</li>


								<li>Proprietary protocol for connection to ASR server?</li>


								<li>GPS</li>


								<li>Others</li>

								</ul>


								<h4 id="driving-dir-events">Events:</h4>


								<ul>

								<li>ASR Events</li>


								<li>Touch Screen Events</li>


								<li>GPS Updates</li>


								<li>Refresh Triggers</li>


								<li>Traffic Alerts</li>


								<li>Others???</li>

								</ul>


								<h4 id="driving-dir-synch">Synchronization Issues:</h4>


								<ul>

								<li>Spoken Directions must be synchronized with current

								location</li>


								<li>When route changes while prompts are playing, current prompts

								must be stopped and new prompts queued. This may be triggered by

								the following:

								<ul>

								<li>BSW pressed by user</li>


								<li>Screen is touched</li>


								<li>Traffic Update event is received</li>


								<li>Driver Error</li>

								</ul>

								</li>


								<li>Screen must be updated to reflect current location and route.

								This may be triggered by:

								<ul>

								<li>Refresh Event</li>


								<li>Change of destination</li>


								<li>Change of route</li>


								<li>Driver Error</li>

								</ul>

								</li>


								<li>Asynchronous events such as traffic updates need to be

								synchronized with explicit user requests including:

								<ul>

								<li>Route change requests</li>


								<li>Display/Output Preference change requests</li>

								</ul>

								</li>


								<li>Others???</li>

								</ul>


								<h4 id="driving-dir-latency">Latency Concerns</h4>


								<ul>

								<li>Unanticipated app Server delays may cause directions to be

								inaccurate</li>

								</ul>


								<h4 id="driving-dir-considerations">Scenario Considerations</h4>


								<p>Input Information:</p>


								<ul>

								<li>Starting address/location:

								<ul>

								<li>explicit street address</li>


								<li>current location obtained via GPS</li>


								<li>landmark or place of interest</li>

								</ul>

								</li>


								<li>Ending address/location:

								<ul>

								<li>explicit street address</li>


								<li>landmark or place of interest</li>

								</ul>

								</li>


								<li>Traffic Conditions</li>


								<li>General preferences:

								<ul>

								<li>highway vs. scenic route</li>


								<li>time vs. distance</li>


								<li>style of output (graphical, turn-by-turn, etc...)</li>


								<li>units of output (miles vs. kilometers)</li>

								</ul>

								</li>

								</ul>


								<p>Possible Devices:</p>


								<ul>

								<li>Phone with display</li>


								<li>Phone without display (voice only)</li>


								<li>In-dash system (GPS, ASR, TTS)</li>


								<li>PC</li>


								<li>PDA</li>


								<li>Phone (voice + data)</li>


								<li>UMTS</li>

								</ul>


								<p>Available Technologies:</p>


								<ul>

								<li>Communication (2.5G, 3G)</li>


								<li>Display (Y/N)</li>


								<li>Application run-time environment (BREW, J2ME, etc)</li>


								<li>Server access</li>

								</ul>


								<p>Data sources:</p>


								<ul>

								<li>route database</li>


								<li>traffic conditions</li>


								<li>location [GPS]</li>


								<li>speed and time of arrival [GPS, speedometer]</li>


								<li>landmarks database and places of interest:

								<ul>

								<li>nearest gas station</li>


								<li>nearest restaurant of a specific type</li>

								</ul>

								</li>


								<li>User Preference Database</li>

								</ul>


								<p>Output Mechanisms:</p>


								<ul>

								<li>graphical (map)</li>


								<li>text description</li>


								<li>voice</li>


								<li>fax</li>


								<li>dynamic updates (recalculation based on traffic information,

								driver error, etc...)</li>


								<li>single delivery of results vs. multiple/sequential delivery of

								results as needed</li>

								</ul>


								<h3 id="name-dialing">2.3 Use Case: Multimodal Name Dialling Use

								Case</h3>


								<h4 id="name-dialing-overview">Overview</h4>


								<p>The Name Dialing use case describes a scenario in which users

								can say a name into their mobile terminals and be connected to the

								named person based on the called party's availability for that

								caller.</p>


								<p>If the called user is not available, the calling user may be

								given the choice of either leaving a message on the called user's

								voicemail system or sending an email to the called user. The called

								user may provide a personalized message for the caller, including,

								for example, "Don't ever call me again!"</p>


								<p>The called user is given the opportunity of selecting which

								device the call should be routed to, e.g. work, mobile, home, or

								voice mail. This may be dependent on the time of day, the called

								user's location, and the identity of the calling user.</p>


								<p>The use case assumes a rich model of name dialling as an example

								of a premium service exploiting a range of information such as

								personal and network directories, location, presence, buddy lists

								and personalization features.</p>


								<p>The benefits of making this a multimodal interacton include the

								ability to view and listen to information about the called user,

								and to be able to use a keypad or stylus, as an alternative to

								using voice as part of the name selection process.</p>


								<h4 id="name-dialing-actors">Actors</h4>


								<ul>

								<li>

								<p>Caller &mdash; user who wishes to place a call</p>

								</li>


								<li>

								<p>Called user &mdash; user who wishes control over how incoming

								calls are handled</p>

								</li>


								<li>

								<p>Mobile display phone with a lightweight client browser, and

								optional speaker-dependent minimal speech recognition

								capabilities</p>

								</li>


								<li>

								<p>Network based directory service with speech recognition

								capabilities, this provides support for looking up names in

								personal contact lists, as well as in corporate and public

								directories</p>

								</li>


								<li>

								<p>Network based unified messaging service with provision for

								composing, transferring and playing back messages, including

								personalized messages intended for specific callers</p>

								</li>


								<li>

								<p>User profile database with presence information, buddy lists,

								and personalized call handling rules</p>

								</li>

								</ul>


								<h4 id="name-dialing-assumptions">Assumptions</h4>


								<p>The user has a device with a button that is pushed to place a

								call. The device has recording capabilities. [voice activation is

								power hungry and unreliable in noisy environments]</p>


								<p>Both voice and data capabilities are available on the

								communications provider's network (not necessarily as

								simultaneously active modes).</p>


								<p>If the phone supports speech recognition and there is a local

								copy of the personal phone contact list, then the user's spoken

								input is first recognized against the local directory for a

								possible match and if unsuccessful, the request is extended back to

								the directory provider.</p>


								<p>The directory provider has access to a messaging service and to

								user profiles and presence information. The directory provider thus

								knows the whereabouts of each registered user - on the phone, at

								work, unavailable etc.</p>


								<p>The directory provider enforces access control rules to ensure

								individual and corporate privacy. This isn't explored in this use

								case.</p>


								<p>People can be identified by personal names like "Wendy" or by

								nick names or aliases. The personal contact list provides a means

								for subscribers to define their own aliases, and to limit the scope

								of search (there are a lot of Wendy's worldwide).</p>


								<p>There is a user agent on the client device with an XHTML browser

								and optional speaker-dependent speech recognition capabilities.</p>


								<p>There is a client server relationship between the user agent on

								the device and the directory provider.</p>


								<p>The dialog could be driven from either the client device or from

								the network. This doesn't effect the user view, but does alter the

								events used to coordinate the two systems. This will be explored in

								a later section.</p>


								<p>The Name Dialing use case will be described through the

								following views:</p>


								<h4 id="name-dialing-user-view">User view</h4>


								<p>User pushes a button and says</p>


								<pre>

								  "Call Wendy Smith"

								</pre>


								<p>It is also possible to say such things as:</p>


								<pre>

								  "Call Wendy"


								  "Call Wendy Smith at work".


								  "Call Wendy at home".


								  "Call Wendy Smith on her mobile phone".

								</pre>


								<p>Multiple scenarios are possible here:</p>


								<p>If local recognition is supported, the utterance will be first

								processed by a local name dialling application. If there is no

								match, the recorded utterance is forwarded to a network based name

								dialling application.</p>


								<p>The user's personal contact list will take priority over the

								corporate and public directories. This is independent of whether

								the personal list is held locally in the device or in the

								network.</p>


								<p>The following situations can arise when the user says a

								name:</p>


								<ol>

								<li>

								<p>Single match &mdash; the caller is presented with information

								about the callee. This may include a picture taken from the

								callee's profile. The caller is asked for a confirmation before the

								call is put through.</p>

								</li>


								<li>

								<p>Multiple matches &mdash; if the number of matches is small

								(perhaps five or less), the caller is asked to choose from the

								list. This is be presented to the caller via speech and accompanied

								with a display of a list of names and pictures. The caller can

								then:</p>


								<ul>

								<li>Use a button on the phone to select a list item.</li>


								<li>Point or touch a link on the screen in the presented list.</li>


								<li>Say index number or expanded name from the presented list.</li>

								</ul>


								<p>A further alternative is to say "that one" as the system speaks

								each item in the list in sequence. This method is offered in case

								the user needs hands and eyes free operation, or the device is

								incapable of displaying the list</p>

								</li>


								<li>

								<p>Lot's of matches, for example, when the caller says a common

								name. The caller is led through a directed dialog to narrow down

								the search.</p>

								</li>


								<li>

								<p>No recognition &mdash; the recognizer wasn't able to find a

								match. The user could have failed to say anything, or there could

								have been too much noise. A tapered help mechanism is invoked.

								Callers could be asked to repeat themselves, or asked to key in the

								number or speak it digit by digit.</p>

								</li>

								</ol>


								<p>Assuming that the user successfully makes a selection:</p>


								<ul>

								<li>

								<p>The system retrieves further information on the called user such

								as the current location and local time of that user. The

								information presented may depend on the relationship between the

								called and calling users. This assumes support for a buddy list and

								presence capability. The called user may specify her availability

								for specific individuals or groups of would be callers depending on

								time of day etc.</p>

								</li>


								<li>

								<p>Two scenarios are described here:</p>


								<ol>

								<li>

								<p>The system finds that the called person is currently available.

								A picture and/or sound bite is provided to the caller. The system

								places the call and the user is connected to Wendy Smith.</p>


								<p><b>Post condition</b>: The user is in a call with the intended

								party.</p>

								</li>


								<li>

								<p>The system finds that the called person is unavailable. The

								system attempts to connects to the called user's voicemail

								system.</p>


								<p>Assuming this succeeds, the system plays the following prompt

								back to the caller: "Wendy Smith is currently unavailable. She has

								left this message for you."</p>


								<p>The message is played out. It could be a multimedia message with

								recorded sound, text, pictures and even short video clips.</p>


								<p>The system plays a prompt back - "Would you like to leave a

								message?"</p>


								<p>The user says "Yes".</p>


								<p>The user is then connected to the voicemail system and leaves a

								message for Wendy Smith.</p>


								<p>If Wendy's voicemail box is full or unavailable, the system

								offers the caller the chance of composing an email. This occupies

								the caller's storage allocation until it has been sent.</p>


								<p><b>Post condition</b>: The user has left a message for the

								intended party.</p>

								</li>

								</ol>

								</li>

								</ul>


								<p>The availability of the called user may depend on the time of

								day, whether the called user is away from her work or home

								location, and who the calling user is. For example, when travelling

								you may want to take calls on your mobile during the day. Don't you

								hate it when people call you in the middle of the night because

								they don't realize what timezone you are in! You may want to make

								an exception for close friends and family members. There may also

								be some people whom you never want to accept calls from, not even

								voice messages!</p>


								<p>When a user is notified of an incoming call, the device may

								present information on the caller including a photograph, name,

								sound bite, location and local time information, depending on the

								relationship between the caller and callee. The user then has an

								opportunity to accept the call or to divert it to voice mail.</p>


								<h4 id="name-dialing-provider-view">Directory provider View</h4>


								<ul>

								<li>

								<p>The client on the user device records the spoken input. The

								spoken input is recognized against the directory on client device.

								When this fails, the utterance is extended to the directory

								provider for recognition.</p>


								<p>If the user device doesn't support local recognition, it may

								still need to record the utterance, so that the user can start

								talking immediately without needing to wait for the connection to

								the directory provider to be completed.</p>

								</li>


								<li>

								<p>The directory provider retrieves the profile for the calling

								user. This has information on which device the user is calling

								from, the current location of the calling user etc. The calling

								user is authenticated and authorized.</p>

								</li>


								<li>

								<p>The recognizer in the provider recognizes the spoken utterance

								and returns the result. This result can either be a single entry or

								a list of possible close matches.</p>

								</li>


								<li>

								<p>The server application (hosting the directory provider) now

								controls the flow of the interaction henceforth.</p>

								</li>


								<li>

								<p>The server goes to the database and retrieves more information

								based on the recognizer result.</p>

								</li>


								<li>

								<p>The provider queries the presence of the called user, and

								personalization information (buddy list, location and presence

								information, etc.) to construct the content for the response.</p>

								</li>


								<li>

								<p>A result may be returned back to the client device in more than

								one way here:</p>


								<p>A single XHTML page is constructed with both visual picture and

								audio with the complete name of the recognized match.</p>


								<p>The feedback can include two channels such as visual for the

								picture and a separate voice channel for playing back the name of

								the user (an optimization for reduced latency).</p>

								</li>


								<li>

								<p>The server creates and transfers a composed page to the client

								device.</p>

								</li>


								<li>

								<p>Once the client receives the content from the application

								server, multiple scenarios are possible here based on the

								recognizer result. See user view for details.</p>

								</li>


								<li>

								<p>Picking a choice from a list can be done by voice, button or

								stylus. The user should be able to browse the list, and to revisit

								the list upon rejecting a confirmation of a preceding choice.</p>


								<p>Example: user says "Call the first one". This utterance is

								processed by the directory provider to select the first match.</p>

								</li>


								<li>

								<p>The directory application may need to apply a directed dialog to

								narrow the search when there are more than a few matches, or when

								recognition and tapered help needs to be offered.</p>

								</li>

								</ul>


								<h4 id="name-dialing-initiative">What is driving the dialog?</h4>


								<p>The details of the events depend on whether the dialog is being

								driven from the network or from the user device.</p>


								<p>When the device sends a spoken utterance to the server, the user

								may have spoken a name such as "Tom Smith" or spoken a command such

								as "the last one". If the directory search is being driven by the

								user device, the server's response is likely to be a short list of

								matches, or a command or error code. To support the application,

								the server would provide a suite of functions, including the means

								for the device to set the recognition context, the ability to play

								specific prompts, and to download information on named users.</p>


								<p>If the network is driving the dialog, the device sends the

								spoken utterance in the same way, but the responses are actions to

								update the display and local state. If the caller presses a button

								or uses a stylus to make a selection, this event will be sent to

								the server. The device and server could exchange low level events,

								such as a stylus tap at a given coordinate, or higher level events

								such as which name the user has selected from the list.</p>


								<h4 id="table6">Table 6: Event Table</h4>


								<table border="1" cellpadding="5" summary="5 column table">

								<tr valign="top">

								<th width="10%">

								<p>User action</p>

								</th>

								<th width="25%">

								<p>Action on device</p>

								</th>

								<th width="20%">

								<p>Events sent from device</p>

								</th>

								<th width="25%">

								<p>Action on server</p>

								</th>

								<th width="20%">

								<p>Events sent from server</p>

								</th>

								</tr>


								<tr valign="top">

								<td>

								<p>Turns on the device</p>

								</td>

								<td>

								<p>Registers with the Directory Provider through the operator in

								the NW and downloads the personal directory</p>

								</td>

								<td>

								<p>register user (userId)</p>

								</td>

								<td>

								<p>Directory Provider gets register information, updates user's

								presence and location info, loads user's personal info (buddy list,

								personal directory,...)</p>

								</td>

								<td>

								<p>acknowledgement + personal directory</p>


								<p class="comment">In practice, SyncML would be used to reduce net

								traffic</p>

								</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Pushes a button to place a call</p>

								</td>

								<td>

								<p>Local reco initialized, activates the personal directory</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Displays a prompt</p>


								<p>"Please say a name"</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Speaks a name</p>

								</td>

								<td>

								<p>Local recognition against personal directory</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>a) If grammar matches:</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Display the name or namelist (see following table)</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Confirms by pressing the call button again if 1 name is

								displayed, or selects a name on the list (see following table)</p>

								</td>

								<td>

								<p>Fetches the number from the personal directory</p>

								</td>

								<td>

								<p>call(userID, number)</p>

								</td>

								<td>

								<p>Checks the location and presence status of the called party</p>

								</td>

								<td>

								<p>call ok(picture)<br />

								OR<br />

								called party not available</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>if call ok, displays the picture and places a call,</p>


								<p>if called party not available, displays/plays a corresponding

								prompt about leaving a message or sending an e-mail</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>i) if user chooses to leave a message:</p>

								</td>

								</tr>


								<tr valign="top">

								<td>

								<p>User agrees to leave a message by pressing a suitable button</p>

								</td>

								<td>

								<p>Initializes the recording, displays a prompt to start the

								recording</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>User speaks and ends by pressing a suitable button</p>

								</td>

								<td>

								<p>Closes the recording, sends the recording to the Directory

								Provider app</p>

								</td>

								<td>

								<p>leave message(userID, number, recording)</p>

								</td>

								<td>

								<p>Stores the message for the called party</p>

								</td>

								<td>

								<p>message ok</p>

								</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>ii) if user chooses to send an e-mail:</p>

								</td>

								</tr>


								<tr valign="top">

								<td>

								<p>User selects 'send e-mail' option by pressing a suitable

								button</p>

								</td>

								<td>

								<p>Starts an e-mail writing application</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Writes e-mail</p>

								</td>

								<td>

								<p>Fetches the e-mail address from the personal directory, sends

								e-mail, closes the e-mail app</p>

								</td>

								<td>

								<p>send mail(userID, mail address, text)</p>

								</td>

								<td>

								<p>Sends the e-mail to the called party</p>

								</td>

								<td>

								<p>mail ok</p>

								</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>b) if personal grammar does not match:</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>sends the utterance to be recognized in the network</p>

								</td>

								<td>

								<p>send(userID, utterance)</p>

								</td>

								<td>

								<p>Recognition against public directory</p>

								</td>

								<td>

								<p>reco ok(namelist)</p>


								<p>OR</p>


								<p>reco nok</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>if reco ok, displays the name or namelist (more details in

								following table), activates local reco with the index list if more

								than one name,</p>


								<p>if reco nok, display/play a message to the user</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Confirms by pressing the call button again if 1 name is

								displayed, or selects a name on the list (see following table)</p>

								</td>

								<td>

								<p>Selection received (perhaps spoken index recognized first)</p>

								</td>

								<td>

								<p>call(userID, number)</p>

								</td>

								<td>

								<p>Checks the location ... [continues as described above]</p>

								</td>

								<td>&nbsp;</td>

								</tr>

								</table>


								<h4 id="table7">Table 7: Interaction details of displaying and

								confirming the recognition results</h4>


								<table border="1" cellpadding="7" summary="5 column table">

								<tr valign="top">

								<th width="10%">

								<p>User action</p>

								</th>

								<th width="25%">

								<p>Action on device</p>

								</th>

								<th width="20%">

								<p>Events sent from device</p>

								</th>

								<th width="25%">

								<p>Action on server</p>

								</th>

								<th width="20%">

								<p>Events sent from server</p>

								</th>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>... speaker utterance has been processed by the recogniser</p>

								</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>A. Very high confidence, unique match, auto confirmation (NB! I

								would recommend letting the user confirm this explicitly; this

								would also make the application behaviour seem more consistent to

								the user since some kind of confirmation would be needed every

								time)</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Displays the name and shows/plays clear prompt "Calling ..."</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Fetches the number</p>

								</td>

								<td>

								<p>call(userID, number)</p>

								</td>

								<td>

								<p>Checks the location and presence status of the called party</p>

								</td>

								<td>

								<p>call ok(picture)</p>


								<p>OR</p>


								<p>called party not available</p>

								</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>B. High confidence, unique match, explicit confirmation</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Displays the name and picture, prompt asking "Place a call?"</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Confirms by pressing the call button again</p>

								</td>

								<td>

								<p>Fetches the number</p>

								</td>

								<td>

								<p>call(userID, number)</p>

								</td>

								<td>

								<p>Checks the location and presence status of the called party</p>

								</td>

								<td>

								<p>call ok(picture)</p>


								<p>OR</p>


								<p>called party not available</p>

								</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>C. High confidence with several matching entries, or medium

								confidence with either unique match or several matching entries</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Displays the namelist with indexes, activates index grammar on

								local reco; if multiple entries with same spelling, additional info

								should be added on the list</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Selects a name by speaking the index or navigating to the

								correct name with keypad and pressing the call button</p>

								</td>

								<td>

								<p>Fetches the number</p>

								</td>

								<td>

								<p>call(userID, number)</p>

								</td>

								<td>

								<p>Checks the location and presence status of the called party</p>

								</td>

								<td>

								<p>call ok(picture)</p>


								<p>OR</p>


								<p>called party not available</p>

								</td>

								</tr>


								<tr>

								<td colspan="5" valign="top">

								<p>D. Low confidence, no match from the directory/ies</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Prompts "Not found, please try again"</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>User speaks the name again</p>

								</td>

								<td>

								<p>New recognition, on 2<sup>nd</sup> or 3<sup>rd</sup> 'nomatch',

								change the prompt to ~ "Sorry, no number found"</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>

								</table>


								<h4 id="table8">Table 8: No local recognition, all recognition in

								the Network</h4>


								<table border="1" cellpadding="7" summary="">

								<tr valign="top">

								<th width="10%">

								<p>User action</p>

								</th>

								<th width="25%">

								<p>Action on device</p>

								</th>

								<th width="20%">

								<p>Events sent from device</p>

								</th>

								<th width="25%">

								<p>Action on server</p>

								</th>

								<th width="20%">

								<p>Events sent from server</p>

								</th>

								</tr>


								<tr valign="top">

								<td>

								<p>Turns on the device</p>

								</td>

								<td>

								<p>Registers with the Directory Provider through the operator in

								the NW</p>

								</td>

								<td>

								<p>register user(userID)</p>

								</td>

								<td>

								<p>Directory Provider gets register information, updates user's

								presence and location info, loads user's personal info (buddy list,

								personal directory,...)</p>

								</td>

								<td>

								<p>register ack</p>

								</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Pushes a button to place a call</p>

								</td>

								<td>&nbsp;</td>

								<td>

								<p>init reco(userID)</p>

								</td>

								<td>

								<p>Activates the personal directory and public directory</p>

								</td>

								<td>

								<p>reco init ok</p>

								</td>

								</tr>


								<tr valign="top">

								<td>&nbsp;</td>

								<td>

								<p>Displays a prompt</p>


								<p>"Please say a name"</p>

								</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								<td>&nbsp;</td>

								</tr>


								<tr valign="top">

								<td>

								<p>Speaks a name</p>

								</td>

								<td>

								<p>Sends the utterance to be recognized in the network</p>

								</td>

								<td>

								<p>send(userID, utterance)</p>

								</td>

								<td>

								<p>Recognition against personal directory first, if no match there

								with confidence greater than some threshold, then against public

								directory</p>

								</td>

								<td>

								<p>reco ok(namelist)</p>


								<p>OR</p>


								<p>reco nok</p>

								</td>

								</tr>

								</table>


								<h2 id="acknowledgements">3. Acknowledgements</h2>


								<p>The following people contributed to this document:</p>


								<ul>

								<li>Paulo Baggia, Loquendo</li>


								<li>Art Barstow, Nokia</li>


								<li>Emily Candell, Comverse</li>


								<li>Debbie Dahl, Consultant and Working Group Chair</li>


								<li>Stephen Potter, Microsoft</li>


								<li>Vlad Sejnoha, Scansoft</li>


								<li>Luc Van Tichelin, Scansoft</li>


								<li>Tasos Anastasakos, Motorola</li>


								<li>Lin Chen, Voice Genie</li>


								<li>Jim Larson, Intel Architecture Lab</li>


								<li>T.V. Raman, IBM</li>


								<li>Derek Schwenke, Mitsubishi Electric</li>


								<li>Giovanni Seni, Motorola</li>


								<li>Dave Raggett, W3C/Openwave</li>


								<li>Bennett Marks, Nokia</li>


								<li>Katriina Halonen, Nokia</li>


								<li>Ramalingam Hariharan, Nokia</li>


								<li>Stephane Maes, IBM</li>


								<li>Purush Yeluripati</li>


								<li>Kuansan Wang, Microsoft</li>

								</ul>

								</body>

								</html>