server_playground/doc/www.w3.org/TR/2003/NOTE-mmi-framework-20030506/index.html


								<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

								    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


								<html xmlns="http://www.w3.org/1999/xhtml">

								<head>

								<meta content="HTML Tidy for Linux/x86 (vers 1st April 2003), see www.w3.org"

								      name="generator" />

								<title>W3C Multimodal Interaction Framework</title>

								<style type="text/css">

								 /*<![CDATA[*/

								  body {

								    margin-left: 8%;

								    margin-right: 5%;

								    background-color: white;

								    font-family: Trebuchet, Arial, sans-serif

								  }

								  h1 { margin-left: -4%; color: rgb(0,92,160) }

								  h2 { margin-left: -4%; color: rgb(0,92,160)}

								  h3 { margin-left: 0% }

								  p.fig {text-align: center}

								  .c1 { display: none }

								  .old { text-decoration: line-through }

								  .new { font-style: italic; color: green }

								  .note { font-style: italic; color: red }

								  p.example { margin-left: 10% }

								  //--> /*]]>*/

								</style>

								<link rel="stylesheet" type="text/css"

								href="http://www.w3.org/StyleSheets/TR/W3C-NOTE" />

								</head>

								<body>

								<div class="head">

								<p><a href="http://www.w3.org/"><img height="48" alt="W3C"

								src="http://www.w3.org/Icons/w3c_home" width="72" /></a></p>


								<h1 class="notoc" id="name">W3C Multimodal Interaction

								Framework</h1>


								<h2>W3C NOTE 06 May 2003</h2>


								<dl>

								<dt>This version:</dt>


								<dd>

								<a href="http://www.w3.org/TR/2003/NOTE-mmi-framework-20030506/">http://www.w3.org/TR/2003/NOTE-mmi-framework-20030506/</a></dd>


								<dt>Latest version:</dt>


								<dd>

								<a href="http://www.w3.org/TR/mmi-framework/">http://www.w3.org/TR/mmi-framework/</a></dd>


								<dt>Previous version:</dt>


								<dd>

								<a href="http://www.w3.org/TR/2002/NOTE-mmi-framework-20021202/">http://www.w3.org/TR/2002/NOTE-mmi-framework-20021202/</a></dd>


								<dt>Editors:</dt>


								<dd>James A. Larson, Intel</dd>


								<dd>T.V. Raman, IBM</dd>


								<dd>Dave Raggett, W3C &amp; Canon</dd>


								<dt>Contributors:</dt>


								<dd>Michael Bodell, Tellme Networks</dd>


								<dd>Michael Johnston AT&amp;T</dd>


								<dd>Sunil Kumar V-Enable Inc.</dd>


								<dd>Stephen Potter, Microsoft</dd>


								<dd>Keith Waters France Telecom</dd>

								</dl>


								<p class="copyright"><a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">

								Copyright</a> &#xa9; 2003 <a href="http://www.w3.org/"><acronym

								title="World Wide Web Consortium">W3C</acronym></a><sup>&#xae;</sup> (<a

								href="http://www.lcs.mit.edu/"><acronym

								title="Massachusetts Institute of Technology">MIT</acronym></a>, <a

								href="http://www.ercim.org/"><acronym

								title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,

								<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C

								<a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,

								<a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,

								<a href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>

								and <a href="http://www.w3.org/Consortium/Legal/copyright-software">software

								licensing</a> rules apply.</p>


								<hr title="Separator for header" />


								</div>


								<h2 class="notoc" id="Abstract">Abstract</h2>


								<p>This document introduces the W3C Multimodal Interaction

								Framework, and identifies the major components for multimodal

								systems. Each component represents a set of related functions. The

								framework identifies the markup languages used to describe

								information required by components and for data flowing among

								components. The W3C Multimodal Interaction Framework describes

								input and output modes widely used today and can be extended to

								include additional modes of user input and output as they become

								available.</p>


								<h2 id="Status">Status of this Document</h2>


								<p><em>This section describes the status of this document at the

								time of its publication. Other documents may supersede this

								document. The latest status of this document series is maintained

								at the <abbr title="the World Wide Web Consortium">W3C</abbr>

								.</em></p>


								<p>W3C's <a href="http://www.w3.org/2002/mmi/">Multimodal

								Interaction Activity</a> is developing specifications for extending

								the Web to support multiple modes of interaction. This document

								introduces a functional framework for multimodal interaction and is

								intended to provide a context for the specifications that comprise

								the W3C Multimodal Interaction Framework.</p>


								<p>This document has been produced as part of the

								<a href="http://www.w3.org/2002/mmi/">W3C Multimodal Interaction

								Activity</a>,<a class="c1" href="http://www.w3.org/2002/mmi/Activity.html"></a> following the procedures set out for the

								<a href="http://www.w3.org/Consortium/Process/">W3C Process</a> .

								The authors of this document are members of the

								<a href="http://www.w3.org/2002/mmi/Group/">Multimodal Interaction

								Working Group</a>

								(<a href="http://cgi.w3.org/MemberAccess/AccessRequest">W3C Members

								only</a> ). This is a Royalty Free Working Group, as described in

								W3C's <a href="/TR/2002/NOTE-patent-practice-20020124">Current

								Patent Practice</a> NOTE. Working Group participants are required

								to provide <a href="http://www.w3.org/2002/01/mmi-ipr.html">patent

								disclosures</a> .</p>


								<p>Please send comments about this document to the public mailing

								list:

								<a href="mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>

								(<a href="http://lists.w3.org/Archives/Public/www-multimodal/">public

								archives</a> ). To subscribe, send an email to

								<a href="mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.

								org</a> with the word <em>subscribe</em> in the subject line

								(include the word <em>unsubscribe</em> if you want to

								unsubscribe).</p>


								<p>A list of current W3C Recommendations and other technical

								documents including Working Drafts and Notes can be found at

								<a href="http://www.w3.org/TR/">http://www.w3.org/TR/</a> .</p>


								<h2 id="intro">1. Introduction</h2>


								<p>The purpose of the W3C multimodal interaction framework is to

								identify and relate markup languages for multimodal interaction

								systems. The framework identifies the major components for every

								multimodal system. Each component represents a set of related

								functions. The framework identifies the markup languages used to

								describe information required by components and for data flowing

								among components.</p>


								<p>The W3C Multimodal Interaction Framework describes input and

								output modes widely used today and can be extended to include

								additional modes of user input and output as they become

								available.</p>


								<p><em>The multimodal interaction framework is not an

								architecture</em> . The multimodal interaction framework is a level

								of abstraction above an architecture. An architecture indicates how

								components are allocated to hardware devices and the communication

								system enabling the hardware devices to communicate with each

								other. The W3C Multimodal Interaction Framework does not describe

								either how components are allocated to hardware devices or how the

								communication system enables the hardware devices to communicate.

								See Section 6 for descriptions of several example architectures

								consistent with the W3C multimodal interaction framework.</p>


								<h2 id="s2">2. Basic Components of the W3C Multimodal Interaction

								Framework</h2>


								<p>The Multimodal Interaction Framework is intended as a basis for

								developing multimodal applications in terms of markup, scripting,

								styling and other resources. The Framework will build upon a range

								of existing W3C markup languages together with the

								<a href="http://www.w3.org/DOM/">W3C Document Object Model</a>

								(DOM). &nbsp;DOM defines interfaces whereby programs and scripts

								can dynamically access and update the content, structure and style

								of documents.</p>


								<p>Figure 1 illustrates the basic components of the W3C multimodal

								interaction framework.</p>


								<p class="fig"><img src="fig1.png" width="493" height="300"

								alt="I/O processors, dialog manager and application back end" /></p>


								<p><em>Human user &mdash; A user</em> who enters input into the

								system and observes and hears information presented by the system.

								In this document, we will use the term "user" to refer to a human

								user. However, an automated user may replace the human user for

								testing purposes. For example, an automated "testing harness" may

								replace human users for regression testing to verify that changes

								to one component do not affect the user interface negatively.</p>


								<p><em>Input</em> &mdash; An interactive multimodal implementation

								will use multiple input modes such as audio, speech, handwriting,

								and keyboarding, and other input modes. The various modes of input

								will be described in <a href="#s3">Section 3</a>.</p>


								<p><em>Output</em> &mdash; An interactive multimodal implementation

								will use one or more modes of output, such as speech, text,

								graphics, audio files, and animation. The various modes of output

								will be described in <a href="#s4">Section 4</a>.</p>


								<p><em>Interaction manager</em> &mdash; The interaction manager is

								the logical component that coordinates data and manages execution

								flow from various input and output modality component interface

								objects. The input and output modality components are as described

								in <a href="#s5">Section 5</a>.</p>


								<p>The interaction manager maintains the interaction state and

								context of the application and responds to inputs from component

								interface objects and changes in the system and environment. The

								interaction manager then manages these changes and coordinates

								input and output across component interface objects. The

								Interaction manager is discussed in <a href="#s6">section

								6</a>.</p>


								<p>In some architectures the interaction manager may be implemented

								as one single component. In other architectures the interaction

								manager may be treated as a composition of lesser components.

								Composition may be distributed across process and device

								boundaries.</p>


								<p><i>Session component</i> &mdash; The Session component

								(discussed in <a href="#s7">Section 7</a>) provides an interface to

								the interaction manager to support state management, and temporary

								and persistent sessions for multimodal applications. This will be

								useful in the following scenarios but is not limited to these:</p>


								<ul>

								<li>A user is interacting with an application which runs on

								multiple devices.</li>


								<li>The application is session based e.g. multiplayer game,

								multimodal chat, meeting room etc.</li>


								<li>The application provides multiple modes of providing input and

								receiving output.</li>


								<li>The application runs on a single device and needs to experience

								multimodality by switching modes.</li>

								</ul>


								<p><em>System and Environment component</em> &mdash; This component

								enables the interaction manager to find out about and respond to

								changes in device capabilities, user preferences and environmental

								conditions. For example, which of the available modes, the user

								wishes to use &mdash; has the user muted audio input? The

								interaction manager may be interested in the width and height of

								the display, whether it supports color, and other capability and

								configuration information. For more information see

								<a href="#s8">Section 8</a></p>


								<h2 id="s3">3. Input Components</h2>


								<p>Figure 2 illustrates the various types of components within the

								input component.</p>


								<p class="fig"><img src="fig2.png" width="539" height="449"

								alt="recognition, interpretation and integation of inputs" /></p>


								<ul>

								<li>

								<p><em>Recognition component</em> &mdash; Captures natural input

								from the user and translates the input into a form useful for later

								processing. The recognition component may use a grammar described

								by a grammar markup language. Example recognition components

								include:</p>


								<ul>

								<li><em>Speech</em> &mdash; Converts spoken speech into text. The

								automatic speech recognition component uses an acoustic model, a

								language model, and a grammar specified using the W3C Speech

								Recognition Grammar or the Stochastic Language Model (N-Gram)

								Specification to convert human speech into words specified by the

								grammar.</li>


								<li><em>Handwriting</em> &mdash; Converts handwritten symbols and

								messages into text. The handwriting recognition component may use a

								handwritten gesture model, a language model, and a grammar to

								convert handwriting into words specified in a grammar.</li>


								<li><em>Keyboarding</em> &mdash; Converts key presses into textual

								characters</li>


								<li><em>Pointing device</em> &mdash; Converts button presses into

								x-y positions on a two-dimensional surface</li>

								</ul>


								<p>Other input recognition components may include vision, sign

								language, DTMF, biometrics, tactile input, speaker verification,

								handwritten identification, and other input modes yet to be

								invented.</p>

								</li>


								<li>

								<p><em>Interpretation component</em> &mdash; May further process

								the results of recognition components. Each interpretation

								component identifies the "meaning" or "semantics" intended by the

								user. For example, many words that users utter such as "yes,"

								"affirmative," "sure," and "I agree," could be represented as

								"yes."</p>

								</li>


								<li>

								<p><em>Integration component</em> &mdash; Combines the output from

								several interpretation components</p>

								</li>

								</ul>


								<p>Some or all of the functionality of this component could be

								implemented as part of the recognition, interpretation, or

								interaction components. For example, audio-visual speech

								recognition may integrate lip movement recognition and speech

								recognition as part of a lip reading component, as part of the

								speech recognition component, or integrated within a separate

								integration component. As another example, the two input modes of

								speaking and pointing are used in</p>


								<p class="example">"put that," (point to an object), "there,"

								(point to a location)</p>


								<p>and may be integrated within a separate integration component or

								may be integrated within the interaction manager component.</p>


								<p>Information generated by other system components may be

								integrated with user input by the integration component. For

								example, a GPS system generates the current location of the user,

								or a banking application generates an overdraft to prohibit the

								user from making additional purchases.</p>


								<p>The output for each interpretation component may be expressed

								using EMMA, a language for representing the semantics or meaning of

								data. &nbsp;Either the user or the system may create information

								that may be routed directly to the interaction manager without

								being encoded in EMMA. For example, audio is recorded for later

								replay or a sequence of keystrokes is captured during the creation

								of a macro.</p>


								<h2 id="s4">4. Output Components</h2>


								<p>Figure 3 illustrates the components within the output

								component.</p>


								<p class="fig"><img src="fig3.png" width="534" height="391"

								alt="Use of EMMA to drive output modes" /></p>


								<ul>

								<li>

								<p><em>Generation component</em> &mdash; The generation component

								determines which output mode or modes will be used for presenting

								information from the interaction manager to the user. The

								generation component may select a single output mode or it may

								select complementary or supplementary modes. The "internal

								representation" language used to describe the output from the

								generation component is under discussion by the working group.</p>

								</li>

								</ul>


								<p>Information from the interaction manager may be routed directly

								to the appropriate rendering device without being encoded in an

								internal representation. For example, recorded audio is send

								directly to the audio system.</p>


								<ul>

								<li>

								<p><em>Styling component</em> &mdash; This component adds

								information about how the information is "layed out." For example,

								the styling component for a display specifies how graphical objects

								are positioned on a canvas, while the styling component for audio

								may insert pauses and voice inflections into text which will be

								rendered by a speech synthesizer. Cascading Style Sheets (CSS)

								could be used to modify voice output.</p>

								</li>


								<li>

								<p><em>Rendering component</em> &mdash; The rendering component

								converts the information from the styling component into a format

								that is easily understood by the user. For example, a graphics

								rendering component rectangle displays a vector of points as a

								curved line, and a speech synthesis system converts text into

								synthesized voice.</p>

								</li>

								</ul>


								<p>Each of the output modes has both a styling and rendering

								component.</p>


								<p>The voice styling component constructs text strings containing

								Speech Synthesis Markup Language tags describing how the words

								should be pronounced. This is converted to voice by the voice

								rendering component. The voice styling component may also select

								prerecorded audio files for replay by the voice rendering

								component.</p>


								<p>The graphics styling component creates XHTML, XHTML Basic, or

								<a href="http://www.w3.org/TR/SVG/">SVG</a> markup tags describing

								how the graphics should be rendered. The graphics rendering

								component converts the output from the graphics styling component

								into graphics displayed to the user.</p>


								<p>Other pairs of styling and rendering components are possible for

								other output modes.

								<a href="http://www.w3.org/AudioVideo/">SMIL</a> may be used for

								coordinated multimedia output.</p>


								<h2 id="s5">5. Specification of input and output components</h2>


								<p>This section describes how the input and output components of

								sections 3 and 4 are specified. In brief, input and output

								components of the user interface will be specified as DOM objects

								that expose interfaces pertaining to that object's functionality.

								&nbsp;This enables the modality objects to be accessed and

								manipulated in the interaction management environments described in

								section 6.</p>


								<p>(The use of the term "object" in this section is intended in the

								sense of "object" as used in the Document Object Model, and is not

								intended to imply a particular class or object hierarchy.)</p>


								<h3 id="s5.1">5.1 Encapsulated interfaces based on DOM</h3>


								<p>User interface components make their functionality available to

								interaction managers through a set of interfaces, and can be

								considered as receiving values from and returning values to the

								host environment. Here, values can be simple or complex types, and

								components can specify the location for binding the received data,

								perhaps using XPath, which is W3C's language for addressing parts

								of an XML document, and was originally designed to be used by both

								XSLT and XPointer. The set of interfaces will be built on DOM, and

								thereby provide an object model for realizing the functionality of

								a given modality.</p>


								<p>The functionality of a user interface component can therefore

								usefully be encapsulated in a programming-language-independent

								manner into an <span>object</span> exposing the following kinds of

								features:</p>


								<ul>

								<li>a set of <b>properties</b> (e.g. presentation parameters or

								input constraints);</li>


								<li>a set of <b>methods</b> (e.g. begin playback or recognition);

								and</li>


								<li>a set of <b>events</b> raised by the component (e.g. mouse

								clicks, speech events).</li>

								</ul>


								<p>The DOM defines a platform-neutral and

								programming-language-neutral interface to documents, their

								structure and their content. The <span>user interface

								objects</span> extend this model by adding modality-specific

								interfaces. In this way, <span>user interface objects</span> can

								define abstract interfaces which are usable across different host

								environments.</p>


								<p>In multimodal applications, multiple user interface components

								are controlled and coordinated individually by the interaction

								manager.</p>


								<p>User interface <span>objects</span> should follow certain

								guidelines to integrate into the multimodal framework:</p>


								<ul>

								<li>adhere to the principles of encapsulation, that is, the

								features of a given modality should relate only to the modality in

								question;</li>


								<li>adopt common or recommended interfaces where possible;</li>


								<li><span>In order to insure that the framework is sufficiently

								general to accommodate both local and distributed

								architectures,</span> avoid blocking calls and threading

								issues.&nbsp;</li>


								<li><span>Consider what kinds of message exchange patterns are

								needed, for instance, publish/subscribe, broadcast, and

								specifically addressed messages. This is also an important

								consideration for insuring that the framework is neutral with

								respect to local and distributed architectures.</span></li>

								</ul>


								<p>In general, the formalization of features into properties,

								methods and events should not be taken to imply that the

								manipulation of the interface can take place only in local DOM

								architectures. It is the intention of this design that modality

								interfaces should remain agnostic to component architectures where

								possible. So the object feature definitions should be considered as

								abstract indications of functionality, the uses of which will

								probably differ according to architectural considerations (for

								example property setting may take different forms, and

								implementation mechanisms for event dispatch and handling are not

								addressed here.)</p>


								<h3 id="s5.2">5.2 Interface formalization</h3>


								<p>Each <span>user interface object</span> will specify a set of

								interfaces in terms of properties, events and methods, using a

								formal interface definition language. Bindings into XML, ECMAScript

								and other programming languages will also be defined.</p>


								<p>In addition to formal definition of markup and DOM interfaces, a

								description of the execution model of the <span>user interface

								object</span> will be defined, that is, the behaviour of the

								<span>object</span> when used. Further, a <span>user interface

								object</span> should also describe how <span>it</span> is

								controlled in different interaction management environments, for

								example, those which support:</p>


								<ul>

								<li>limited environments without programmatic capabilities;</li>


								<li>XHTML and its flavours, including scripting, DOM eventing,

								XForms, etc.</li>


								<li>SMIL</li>


								<li>HTML</li>


								<li>SVG</li>


								<li>etc.</li>

								</ul>


								<p>As work proceeds on the definition of individual modality

								interfaces, sufficient commonality of features may be found such

								that it is desirable to standardize in some way those features

								across different modalities. As such, the MMI group will

								investigate the possibilities for establishing a set of common

								interfaces that may be shared among all relevant modalities</p>

								<br />


								<h2 id="s6">6. Specification of interaction management

								component</h2>


								<h3 id="s6.1">6.1 Host Environments for interaction management</h3>


								<p>The interaction manager is a logical component. The interaction

								manager is contained in the host environment that hosts interface

								objects. Interface objects influence one another by interacting

								with the Host Environment. A host environment provides data

								management and flow control to its hosted interface objects. Some

								languages that may be candidates as Host Environment languages

								include <a href="http://www.w3.org/Graphics/SVG/">SVG</a>,

								<a href="http://www.w3.org/MarkUp/">XHTML</a> (possibly

								<a href="http://www.w3.org/MarkUp/">XHTML</a>+

								<a href="http://www.w3.org/MarkUp/Forms/">XForms</a>), and

								<a href="http://www.w3.org/AudioVideo/">SMIL</a>.</p>


								<p>A Host Environment's hosted interface objects may range from the

								simple to the complex. Authors will be able to specify the

								interface object components through a mixture of markup, scripting,

								style sheets, or any other resources supported by their Host

								Environment's functionality. The Host Environment design makes

								possible architectures where the interface objects may each have

								their own thread of execution independent from context of the Host

								Environment. The design also supports each component communicating

								asynchronously with the Host Environment (however familiarity with

								synchronization primitives such as mutexes will not be required to

								successfully author multimodal documents).</p>


								<p>In some architectures, it is possible to have a hierarchical

								composition of Host Environments similar in spirit to Russian

								nesting dolls. Different aspects of interaction management may be

								handled at different levels of the hierarchy. For example,

								"barge-in", where speech output is cut off on the basis of user

								input, is an interaction management mechanism that may handled by

								one lower level Environment that just hosts the basic speech input

								and speech output objects while a different higher level Host

								Environment coordinates the multimodal application. Hierarchical

								interaction management also enables the delegation of complex input

								tasks to lower levels of the hierarchy. As an example, a date

								dialog might encapsulate the necessary interaction management logic

								needed to produce appropriate tapered prompts, error handling, and

								other dialog constructs to eventually collect a valid date. This

								form of nesting enables the creation of hierarchical interaction

								management that reflects the task hierarchy within the overall

								application.</p>


								<h2 id="s7">7. Session Component</h2>


								<p>An important goal of the W3C Multimodal Interaction Framework is

								to provide a simplified approach for authoring multimodal

								applications whether on a single system/user or distributed across

								multiple systems/users. The framework is architecture neutral, and

								abstractly relies on passing messages between the various framework

								components. The session component provides a means to simplify the

								author's view of how resources are identified in terms of source

								and destination of such messages.&nbsp;The session component is

								particularly important for distributed applications involving more

								than one device and/or user. It hides the details of the resource

								naming schemes and protocols used and provides a high-level

								interface for requesting and releasing resources taking part in the

								session.&nbsp;</p>


								<h3 id="s7.1">7.1 Functions of Session Component</h3>


								<h4 id="s7.1.1">7.1.1 Session as basis of state replication and

								synchronization</h4>


								<p>The session component can be used for replicating state across

								devices, or across processes within the same device. In a graphical

								interface scenario running on a hand held device coupled to a voice

								interface running in the network. The user can choose to navigate

								or enter data using the device keypad or using speech. When filling

								out a form, this gives two ways to update the field's value. The

								session provides a scope for the replication mechanism and provides

								a way to keep multiple modes in sync.</p>


								<h4 id="s7.1.2">7.1.2 Temporary/Persistent Sessions</h4>


								<p>For certain applications the session is short lived. In theses

								cases the same session may last for a single page or for several

								pages as the user navigates through the application, for example

								when visiting a web site. This makes it practical to retain state

								information for the duration of the application. For applications

								that involve persistent sessions such as meeting rooms, multiplayer

								games, there is a need for session management, and a means to

								locate, join and leave such sessions.</p>


								<h4 id="s7.1.3">7.1.3 Simplifying Applications</h4>


								<p>In a distributed environment there are several ways to identify

								a resource. The session component provides a means to query

								descriptions of resources, including the type of the resource, what

								properties the resource has, and what interfaces it supports.</p>


								<h3 id="s7.2">7.2 Use Cases</h3>


								<p>The following use case provides the basis of defining session

								component:</p>


								<ul>

								<li><b>Mobile Devices with sequential capability</b></li>

								</ul>


								<p>Devices with limited capability provide a good example of the

								importance of a session component. The sequential multimodality

								allows user to experience multiple modes but only one mode at a

								time. In such a scenario the user has to switch between modes to

								experience multiple modes. In an application where the user is

								filling out a form using voice as the input mode since voice is

								preferred/easier mode for providing input. After the user has

								provided the input the application saves the form fields in a

								session object and switches the mode to visual. In visual mode the

								application retrieves the values from session and uses the form

								fields for further processing. An example of such application would

								be Driving Directions application where the user provides source

								and destination using voice mode and then selects directions from

								visual mode to see the directions.</p>


								<ul>

								<li><b>Form filling</b></li>

								</ul>


								<p>Form filling presents another use case for a session component.

								Especially when partial information is filled using the keypad

								attached to the device and partial information is filled using the

								speech processed at the speech server in the network. For example

								in an airline reservation system the user can provide date of

								travel by clicking on appropriate dates in the calendar and provide

								source and destination using speech which is processed in the

								network. A session component helps in synchronizing the input

								provided in either mode and provides filled form information back

								to the application.</p>


								<ul>

								<li><b>Meeting Rooms</b></li>

								</ul>


								<p>The session in this case is persistent and users join/leave the

								session during the application. A session component allows a user

								to query the session environment. A session environment would

								consist of the resources and the values of the attributes in the

								resources. In case of meeting room application the user can query:

								i) who else is in the meeting room. ii) Get the information about a

								particular member in the meeting room e.g. contact information,

								whether the member is online etc.? The resources that application

								wants its user to share is stored and proper interfaces are

								provided to access the attributes of the resource.</p>


								<ul>

								<li><b>Multiple Device Applications</b></li>

								</ul>


								<p>For multimodal applications running across multiple devices, the

								session component can play an important role in the synchronization

								of state across the devices. For example a user may be running an

								application while sitting in a car using a device attached to the

								car. The user gets off the car and goes to his office and wants to

								continue with the application on his laptop that he was running in

								the car. The session component provides interfaces to save the

								state of the whole application on a device and reinstating the

								whole state on another device. The few examples of such

								applications could be video conferencing, online shopping, airline

								reservations etc. For example in an airline reservation system, the

								user selects the itinerary while he is still in the car. The user

								gets out of the car and buys the same ticket using his laptop in

								his office.</p>


								<h2 id="s8">8. System and Environment Component</h2>


								<p>The <a href="http://www.w3.org/TR/mmi-reqs/">W3C Multimodal

								Interaction Requirements</a> call for the ability for developers to

								be able to create applications that

								<a href="http://www.w3.org/TR/mmi-reqs/#Deliveryandcontext">dynamically

								adapt</a> to changes in device capabilities, user preferences and

								environmental conditions. The multimodal interaction framework must

								allow the interaction manager to determine what information is

								available, as this will be system dependent. In addition, the

								framework must support stand-alone as well as distributed scenarios

								involving multiple devices and multiple users (see

								<a href="#s7">section 7</a> for more details).</p>


								<p>It is expected that the system and environment component will

								make use of the work of the

								<a href="http://www.w3.org/2001/di/">W3C Device Independence

								activity</a>, in particular the

								<a href="http://www.w3.org/Mobile/CCPP/">CC/PP</a> language, whose

								aim is to standardize ways of expressing device features and

								settings, and to describe how they are transmitted between

								components. Profiles regarding multimodal-specific properties, such

								as those listed below, are expected to be defined in accordance to

								the <a href="http://www.w3.org/TR/CCPP-struct-vocab/">CC/PP

								Structure and Vocabularies specification</a>.</p>


								<h3 id="s8.1">8.1 User Case Scenarios</h3>


								<p>To illustrate the components functionality it is worth

								considering the following few user case scenarios:</p>


								<ul>

								<li>

								<p><b>Mobile</b> devices typically have limited capabilities and

								resources, so that applications need to be tailored to the

								specifics of the device. For example, many mobile phones have small

								monochrome displays, while others have rich, fast color displays.

								The following are typical characteristics of mobile devices that

								can be provided to the Interaction Manager through the System and

								Environment component:</p>


								<ul>

								<li>

								<p><b>Location</b> information can be provided by an increasing

								number of mobile devices. Typically this information is derived

								from cell quadrant (cellular radio networks), GPS satellite data or

								dead reckoning based on motion sensors. The Location

								Interoperability Forum - now part of the

								<a href="http://www.openmobilealliance.org/lif/">Open Mobile

								Alliance</a> &mdash; has been responsible for much of the work on

								this to date. Location-based services (LBS) provide time stamped

								location data of varying accuracy, in some circumstances, this can

								be to within a few meters. This information can be provided upon

								request at sub-second intervals. Multimodal applications can use

								such information to orient maps and to provide geographically

								relevant information.</p>

								</li>


								<li>

								<p><b>Signal strength</b> provides information on network

								connectivity as well as the quality of service that can be

								provided. As signal strength decreases a Multimodal application

								could adapt accordingly. This could be as simple as switching to an

								alternative low-bandwidth mode of communication.</p>

								</li>


								<li>

								<p><b>Aural noise level</b> for mobile devices is an important

								consideration because of the variety of situations where the device

								can be used, for example, noise from passing vehicles, other people

								talking nearby, or loud music. Speech recognition can be tailored

								based on noise levels returned by the System and Environment

								component.</p>

								</li>


								<li>

								<p><b>Battery level</b> provides information on the remaining

								operational time. Such a notification to the Interaction Manager is

								particularly revelant to small un-tethered devices where power

								consumption is critical.</p>

								</li>

								</ul>

								</li>


								<li>

								<p><b>Automotive</b> &mdash; Multimodality is typically an on-board

								capability that senses the local environment to determine what

								services can be adapted to the drivers situation, for example:</p>


								<ul>

								<li>

								<p><b>Aural noise Level</b> within the car can be generated and

								modified by numerous environmental factors for example driving with

								the windows down, radio volume, the AC/Fan on/off or windscreen

								wipers on/off. Environmental conditions of the vehicle, controlled

								by the driver, can be notified via the System Environment component

								to the Interaction Manager to adapt the speech recognition.</p>

								</li>


								<li>

								<p><b>In gear</b> notifications could provide information on the

								drivers ability to use a touch screen in a Multimodal application.

								In addition there are legal ramifications associated with the

								driver operating devices whilst the vehicle is in motion. Therefore

								the general behavior of a Multimodal application may need to adapt

								according to whether the vehicle is parked or "in-drive".</p>

								</li>


								<li>

								<p><b>GPS</b> notifications are an important feature of an on-board

								Multimodal navigation system. The update frequency and accuracy of

								updates being higher than typical LBS mobile services (see Mobile

								section).</p>

								</li>

								</ul>

								</li>


								<li>

								<p><b>Desktop</b> &mdash; Multimodal applications can be tailored

								to the user's preferences. These choices can be dynamic or static

								for example:</p>


								<ul>

								<li>

								<p><b>Static user preferences</b> &mdash; the default volume

								setting, the rate in words per minute for playing text to speech, a

								general preference to using speech rather than a keyboard. People

								with visual impairments may opt for easy to see large print text

								and high contrast color themes.</p>

								</li>


								<li>

								<p><b>Dynamic preferences</b> &mdash; the user may suddenly mute

								audio output, or switch from speech to pen input, and expect the

								application to adapt accordingly. The application itself may

								monitor's the user's progress, and react appropriately, for

								example, prompting the user to use a pen after successive failures

								with speech recognition.</p>

								</li>

								</ul>

								</li>

								</ul>


								<h3 id="s8.2">8.2 System and Environment Component Categories</h3>


								<p>The above examples give a general indication of the

								functionality that the System and Environment component offers as a

								means for enabling applications to be tailored to adapt to device

								capabilities, user preferences and environmental conditions.</p>


								<ul>

								<li>

								<p><b>Environmental</b> conditions can be monitored and reported to

								to the Interaction Manager. One way to look at these

								characteristics is to inspect interference channels:</p>


								<ul>

								<li>

								<p><b>Auditory</b></p>


								<ul>

								<li>

								<p><b>Environment too noisy and bad for listening</b> - the

								application should adapt to this change to provide a better

								experience.</p>

								</li>


								<li>

								<p><b>A speaker system/headphone attached?</b> A speaker system

								allows the user to see the screen as well as listen at the same

								time.</p>

								</li>


								<li>

								<p><b>Car environment factors</b> - radio on/off, radio volume,

								AC/Fan on/off, windscreen wipers on/off windows up/down.</p>

								</li>

								</ul>

								</li>


								<li>

								<p><b>Visual</b></p>


								<ul>

								<li>

								<p><b>Whether gesture recognition is possible.</b> The user should

								be able to see the sensor for a gesture based application.

								Moreover, if the user cannot see the device then audio becomes the

								predominate mode of communication and the application adapt to

								it.</p>

								</li>

								</ul>

								</li>


								<li>

								<p><b>Tactile</b></p>


								<ul>

								<li>

								<p><b>Pen</b> &mdash; large or small or finger begin used as a

								tactile input device.</p>

								</li>

								</ul>

								</li>

								</ul>

								</li>


								<li>

								<p><b>System</b> notifications can be derived from numerous

								environmental sources, particularly within mobile and automotive

								applications. Notifications from the System and Environment

								component to the Interaction Manger can range from GPS location

								information to the fact that the laptop has been closed. Many of

								these system notifications indicate that the application should

								switch to an alternative mode of operation.</p>

								</li>


								<li>

								<p><b>User preferences</b> help with tailoring the application to

								the user. These characteristics are most apparent in rich

								Multimodal scenarios such as the desktop where resources are less

								of an issue (large screens and fast CPU's). Preferences can be

								modified to best suit user choices. Furthermore, it is possible to

								dynamically adapt to the users preferences overtime.</p>

								</li>

								</ul>


								<h2 id="s9">9. Illustrative Use Case</h2>


								<p>To illustrate the component markup languages of the W3C

								Multimodal Interaction Framework, consider this simple use case.

								The human user points to a position on a displayed map and speaks:

								"What is the name of this place?" The multimodal interaction system

								responds by speaking "Lake Wobegon, Minnesota" and displays the

								text "Lake Wobegon, Minnesota" on the map. The following summarizes

								the actions of the relevant components of the W3C Multimodal

								Interaction Framework:</p>


								<p><em>Human user</em> &mdash; Points to a position on a map and

								says, "What is the name of this place?"</p>


								<p><em>Speech recognition component</em> &mdash; Recognizes the

								words "What is the name of this place?"</p>


								<p><em>Mouse recognition component</em> &mdash; Recognizes the x-y

								coordinates of the position to which the user pointed on a map.</p>


								<p><em>Speech interpretation component</em> &mdash; Converts the

								words "What is the name of this place?" into an internal

								notation.</p>


								<p><em>Pointing interpretation component</em> &mdash; Converts the

								x-y coordinates of the position to which the user pointed into an

								internal notation.</p>


								<p><em>Integration component</em> &mdash; Integrates the internal

								notation for the words "What is the name of this place?" with the

								internal notation for the x-y coordinates.</p>


								<p><em>Interaction manager component</em> &mdash; Stores the

								internal notation in the session object. Converts the request to a

								database request, submits the request to a database management

								system which returns the value of "Lake Wobegon, Minnesota". Add

								the response to the internal notation in the session object The

								interaction manager converts the response into an internal notation

								and sends the response to the generation component.</p>


								<p><em>Generation component</em> &mdash; Access the Environment

								component to determine that voice and graphics modes are available.

								Decides to present the result as two complementary modes, voice and

								graphics. The generation component sends internal notation

								representing "Lake Wobegon, Minnesota" to the voice styling

								component, and sends internal notation representing the location of

								Lake Wobegon, Minnesota on a map to the graphics styling

								component.</p>


								<p><em>Voice styling component</em> &mdash; Converts the internal

								notation representing "Lake Wobegon, Minnesota" into SSML.</p>


								<p><em>Graphics styling component</em> &mdash; Converts the

								internal notation representing the "Lake Wobegon, Minnesota"

								location on a map into HTML notation.</p>


								<p><em>Voice rendering component</em>: Converts the SSML notation

								into acoustic voice for the user to hear.</p>


								<p><em>Graphics styling component</em>: Converts the HTML notation

								into visual graphics for the user to see.</p>


								<h2 id="s10">10.&nbsp; Examples of Architectures Consistent with

								the W3C Multimodal Interaction Framework.</h2>


								<p>There are many possible multimodal architectures that are

								consistent with the W3C multimodal interaction framework. These

								multimodal architectures have the following properties:</p>


								<p>Property 1. THE MULTIMODAL ARCHITECTURE CONTAINS A SUBSET OF THE

								COMPONENTS OF THE W3C MULTIMODAL INTERACTION FRAMEWORK. A

								<em>multimedia architecture</em> contains two or more output modes.

								A <em>multimodal architecture</em> contains two or more input

								modes.</p>


								<p>Property 2. COMPONENTS MAY BE PARTITIONED AND COMBINED. The

								functions within a component may be partitioned into several

								modules within the architecture, and the functions within two or

								more components may be combined into a single module within the

								architecture.</p>


								<p>Property 3. THE COMPONENTS ARE ALLOCATED TO HARDWARE DEVICES. If

								all components are allocated to the same hardware device, the

								architecture is said to be <em>centralized architecture</em> . For

								example, a PC containing all of the selected components has a

								centralized architecture. A <em>client-server architecture</em>

								consists of two types of devices, several client devices containing

								many of the input and output components, and the server which

								contains the remaining components. A <em>distributed

								architecture</em> consists of multiple types of devices connected

								by a communication system.</p>


								<p>Property 4. THE COMMUNICATION SYSTEMS ARE SPECIFIED. Designers

								specify the protocols for exchanging messages among hardware

								devices.</p>


								<p>Property 5. THE DIALOG MODEL IS SPECIFIED. Designers specify how

								modules are invoked and terminated, and how they interpret input to

								produce output.</p>


								<p>The following examples illustrate architectures that conform to

								the W3C multimodal interaction framework.</p>


								<h3 id="s10.1">Example 1: Driving Example (Figure 4)</h3>


								<p>In this example, the user wants to go to a specific address from

								his current location and while driving wants to take a detour to a

								local restaurant (The user does not know the restaurant address nor

								the name). The user initiates service via a button on his steering

								wheel and interacts with the system via the touch screen and

								speech.</p>


								<p>Property 1. The driving architecture contains the components

								illustrated in Figure 4: a graphical display, map database, voice

								and touch input, speech output, local ASR, TTS Processing and

								GPS.</p>


								<p>Property 2. No components are partitioned or combined with the

								possible exception of the integration and interaction manager

								components, and the generation and interaction components. There

								are two possible configurations, depending upon whether the

								integration component is stand alone or combined with the

								interaction manager component:</p>


								<ul>

								<li>

								<p>Information entered by the user may be encoded into EMMA

								(Extensible MultiModal Annotation Markup Language, formerly known

								as the Natural Language Semantic Markup Language) and combined by

								an integration component (shown within the dotted rectangle in

								Figure 4) which is separate from the interaction manager.</p>

								</li>


								<li>

								<p>Information entered by the user may be recognized and

								interpreted and then routed directly to the interaction manager,

								which performs its own integration of user information</p>

								</li>

								</ul>


								<p>There are two possible configurations, depending upon whether

								the generation component is stand alone or combined with the

								interaction manager component:</p>


								<ul>

								<li>

								<p>Information from the interaction manager may be routed to the

								generation component, where multiple modes of output are generated

								and the appropriate synchronization control created.</p>

								</li>


								<li>

								<p>Information may be be routed directly to the styling components

								and then on to the rendering components. In this case, the

								interaction manager does its own generation and

								synchronization.</p>

								</li>

								</ul>


								<p>Property 3. All components are allocated to a single client side

								hardware device onboard the car. In Figure 4, the client is

								illustrated by a pink box containing all of the components.</p>


								<p>Property 4. No communication system is required in this

								centralized architecture.</p>


								<p>Property 5. Dialog Model: The user wants to go to a specific

								address from his current location and while driving wants to take a

								detour to a local restaurant . (The user does not know the

								restaurant name or address.) The user initiates service via a

								button on his steering wheel and interacts with the system via the

								touch screen and speech.</p>


								<p class="fig"><img src="fig4.png" width="559" height="539"

								alt="Figure 4: Driving Example" /></p>


								<h3 id="s10.2">Example 2: Name dialing (Figure 5)</h3>


								<p>The Name dialing example enables a user to initiate a call by

								saying the name of the person to be contacted. Visual and spoken

								dialogs are used to narrow the selection, and to allow an exchange

								of multimedia messages if the called person is unavailable. Call

								handling is determined by a script provided by the called person.

								The example supports the use of a combination of local and remote

								speech recognition.</p>


								<p>Property 1: The architecture contains a subset of the components

								of the W3C Multimodal Interface Framework.</p>


								<p>Property 2: No components have been partitioned or combined with

								the possible exception of the integration component and interaction

								component, and the generation component and the interaction

								component (as discussed in example 2).</p>


								<p>Property 3. The components in pink are allocated to the client

								and the components in green are allocated to the server. Note that

								the speech recognition and interpretation components are on both

								client and server. The local ASR recognizes basic control commands

								based upon the ETSI DES/HF-00021 standardized command and control

								vocabulary, and the remote ASR recognizes names of individuals the

								user wishes to dial. (The vocabulary of names is too large to

								maintain on the client, so it is maintained on the server.)</p>


								<p>Property 4. Communications system is SIP.

								<a href="http://www.ietf.org/html.charters/sip-charter.html">SIP</a>

								is a session initiation protocol and is a means for initiating

								communication sessions involving multiple devices, and for control

								signaling during such sessions.</p>


								<p>Property 5. Navigational and control commands are recognized by

								the ASR on the client. When the user says "call John Smith," the

								ASR on the client recognizes the command "call" and transfers the

								following information ("John Smith") to the server for recognition.

								The application on the server then connects the user with John

								Smith's telephone.</p>


								<p class="fig"><img src="fig5.png" width="557" height="545"

								alt="Figure 5: Name Dialing Example" /></p>


								<h3 id="s10.3">Example 3: Form fill-in (Figure 6)</h3>


								<p>In the Form fill-in example, the user wants to make a flight

								reservation with his mobile device while he is on the way to work.

								The user initiates the service by means of making a phone call to a

								multimodal service (telephone) or by selecting an application

								(portal environment metaphor). The dialogue between the user and

								the application is driven by a form-filling paradigm where the user

								provides input to fields such as "Travel Origin:", "Travel

								Destination:", "Leaving on date", "Returning on date". As the user

								selects each field in the application to enter information, the

								corresponding input constraints are activated to drive the

								recognition and interpretation of the user input.</p>


								<p>Property 1: The architecture contains a subset of the components

								of the W3C Multimodal Interface Framework, including GPS and

								Ink.</p>


								<p>Property 2: The speech recognition component has been

								partitioned into two components, one which will be placed on the

								client and the other on the server. The integration component and

								interaction component, and the generation component and the

								interaction component may be combined or left separate (as

								discussed in example 2).</p>


								<p>Property 3. The components in pink are allocated to the client

								and the components in green are allocated to the server. Speech

								recognition is distributed between the client and the server, with

								the feature extraction on the client and the remaining speech

								recognition functions performed on the server.</p>


								<p>Property 4. Communications system is SIP.

								<a href="http://www.ietf.org/html.charters/sip-charter.html">SIP</a>

								is a session initiation protocol and is a means for initiating

								communication sessions involving multiple devices, and for control

								signaling during such sessions.</p>


								<p>Property 5. Dialog Model: The user wants to make a flight

								reservation with his mobile device while he is on the way to work.

								The user initiates the service via means of making a phone call to

								a multimodal service (telephone metaphor) or by selecting an

								application (portal environment metaphor). The dialogue between the

								user and the application is driven by a form-filling paradigm where

								the user provides input to fields such as "Travel Origin:", "Travel

								Destination:", "Leaving on date", "Returning on date". As the user

								selects each field in the application to enter information, the

								corresponding input constraints are activated to drive the

								recognition and interpretation of the user input. The capability of

								providing composite multimodal input is also examined, where input

								from multiple modalities is combined for the interpretation of the

								user's intent.</p>


								<p class="fig"><img src="fig6.png" width="558" height="547"

								alt="Figure 6: Form Fill-in Example" /></p>

								</body>

								</html>