You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1213 lines
49 KiB
1213 lines
49 KiB
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta content="HTML Tidy for Linux/x86 (vers 1st April 2003), see www.w3.org"
|
|
name="generator" />
|
|
<title>W3C Multimodal Interaction Framework</title>
|
|
<style type="text/css">
|
|
/*<![CDATA[*/
|
|
body {
|
|
margin-left: 8%;
|
|
margin-right: 5%;
|
|
background-color: white;
|
|
font-family: Trebuchet, Arial, sans-serif
|
|
}
|
|
h1 { margin-left: -4%; color: rgb(0,92,160) }
|
|
h2 { margin-left: -4%; color: rgb(0,92,160)}
|
|
h3 { margin-left: 0% }
|
|
p.fig {text-align: center}
|
|
.c1 { display: none }
|
|
.old { text-decoration: line-through }
|
|
.new { font-style: italic; color: green }
|
|
.note { font-style: italic; color: red }
|
|
p.example { margin-left: 10% }
|
|
//--> /*]]>*/
|
|
</style>
|
|
<link rel="stylesheet" type="text/css"
|
|
href="http://www.w3.org/StyleSheets/TR/W3C-NOTE" />
|
|
</head>
|
|
<body>
|
|
<div class="head">
|
|
<p><a href="http://www.w3.org/"><img height="48" alt="W3C"
|
|
src="http://www.w3.org/Icons/w3c_home" width="72" /></a></p>
|
|
|
|
<h1 class="notoc" id="name">W3C Multimodal Interaction
|
|
Framework</h1>
|
|
|
|
<h2>W3C NOTE 06 May 2003</h2>
|
|
|
|
<dl>
|
|
<dt>This version:</dt>
|
|
|
|
<dd>
|
|
<a href="http://www.w3.org/TR/2003/NOTE-mmi-framework-20030506/">http://www.w3.org/TR/2003/NOTE-mmi-framework-20030506/</a></dd>
|
|
|
|
<dt>Latest version:</dt>
|
|
|
|
<dd>
|
|
<a href="http://www.w3.org/TR/mmi-framework/">http://www.w3.org/TR/mmi-framework/</a></dd>
|
|
|
|
<dt>Previous version:</dt>
|
|
|
|
<dd>
|
|
<a href="http://www.w3.org/TR/2002/NOTE-mmi-framework-20021202/">http://www.w3.org/TR/2002/NOTE-mmi-framework-20021202/</a></dd>
|
|
|
|
<dt>Editors:</dt>
|
|
|
|
<dd>James A. Larson, Intel</dd>
|
|
|
|
<dd>T.V. Raman, IBM</dd>
|
|
|
|
<dd>Dave Raggett, W3C & Canon</dd>
|
|
|
|
<dt>Contributors:</dt>
|
|
|
|
<dd>Michael Bodell, Tellme Networks</dd>
|
|
|
|
<dd>Michael Johnston AT&T</dd>
|
|
|
|
<dd>Sunil Kumar V-Enable Inc.</dd>
|
|
|
|
<dd>Stephen Potter, Microsoft</dd>
|
|
|
|
<dd>Keith Waters France Telecom</dd>
|
|
</dl>
|
|
|
|
<p class="copyright"><a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
|
|
Copyright</a> © 2003 <a href="http://www.w3.org/"><acronym
|
|
title="World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a
|
|
href="http://www.lcs.mit.edu/"><acronym
|
|
title="Massachusetts Institute of Technology">MIT</acronym></a>, <a
|
|
href="http://www.ercim.org/"><acronym
|
|
title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
|
|
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C
|
|
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
|
|
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
|
|
<a href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
|
|
and <a href="http://www.w3.org/Consortium/Legal/copyright-software">software
|
|
licensing</a> rules apply.</p>
|
|
|
|
<hr title="Separator for header" />
|
|
|
|
</div>
|
|
|
|
<h2 class="notoc" id="Abstract">Abstract</h2>
|
|
|
|
<p>This document introduces the W3C Multimodal Interaction
|
|
Framework, and identifies the major components for multimodal
|
|
systems. Each component represents a set of related functions. The
|
|
framework identifies the markup languages used to describe
|
|
information required by components and for data flowing among
|
|
components. The W3C Multimodal Interaction Framework describes
|
|
input and output modes widely used today and can be extended to
|
|
include additional modes of user input and output as they become
|
|
available.</p>
|
|
|
|
<h2 id="Status">Status of this Document</h2>
|
|
|
|
<p><em>This section describes the status of this document at the
|
|
time of its publication. Other documents may supersede this
|
|
document. The latest status of this document series is maintained
|
|
at the <abbr title="the World Wide Web Consortium">W3C</abbr>
|
|
.</em></p>
|
|
|
|
<p>W3C's <a href="http://www.w3.org/2002/mmi/">Multimodal
|
|
Interaction Activity</a> is developing specifications for extending
|
|
the Web to support multiple modes of interaction. This document
|
|
introduces a functional framework for multimodal interaction and is
|
|
intended to provide a context for the specifications that comprise
|
|
the W3C Multimodal Interaction Framework.</p>
|
|
|
|
<p>This document has been produced as part of the
|
|
<a href="http://www.w3.org/2002/mmi/">W3C Multimodal Interaction
|
|
Activity</a>,<a class="c1" href="http://www.w3.org/2002/mmi/Activity.html"></a> following the procedures set out for the
|
|
<a href="http://www.w3.org/Consortium/Process/">W3C Process</a> .
|
|
The authors of this document are members of the
|
|
<a href="http://www.w3.org/2002/mmi/Group/">Multimodal Interaction
|
|
Working Group</a>
|
|
(<a href="http://cgi.w3.org/MemberAccess/AccessRequest">W3C Members
|
|
only</a> ). This is a Royalty Free Working Group, as described in
|
|
W3C's <a href="/TR/2002/NOTE-patent-practice-20020124">Current
|
|
Patent Practice</a> NOTE. Working Group participants are required
|
|
to provide <a href="http://www.w3.org/2002/01/mmi-ipr.html">patent
|
|
disclosures</a> .</p>
|
|
|
|
<p>Please send comments about this document to the public mailing
|
|
list:
|
|
<a href="mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>
|
|
(<a href="http://lists.w3.org/Archives/Public/www-multimodal/">public
|
|
archives</a> ). To subscribe, send an email to
|
|
<a href="mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.
|
|
org</a> with the word <em>subscribe</em> in the subject line
|
|
(include the word <em>unsubscribe</em> if you want to
|
|
unsubscribe).</p>
|
|
|
|
<p>A list of current W3C Recommendations and other technical
|
|
documents including Working Drafts and Notes can be found at
|
|
<a href="http://www.w3.org/TR/">http://www.w3.org/TR/</a> .</p>
|
|
|
|
<h2 id="intro">1. Introduction</h2>
|
|
|
|
<p>The purpose of the W3C multimodal interaction framework is to
|
|
identify and relate markup languages for multimodal interaction
|
|
systems. The framework identifies the major components for every
|
|
multimodal system. Each component represents a set of related
|
|
functions. The framework identifies the markup languages used to
|
|
describe information required by components and for data flowing
|
|
among components.</p>
|
|
|
|
<p>The W3C Multimodal Interaction Framework describes input and
|
|
output modes widely used today and can be extended to include
|
|
additional modes of user input and output as they become
|
|
available.</p>
|
|
|
|
<p><em>The multimodal interaction framework is not an
|
|
architecture</em> . The multimodal interaction framework is a level
|
|
of abstraction above an architecture. An architecture indicates how
|
|
components are allocated to hardware devices and the communication
|
|
system enabling the hardware devices to communicate with each
|
|
other. The W3C Multimodal Interaction Framework does not describe
|
|
either how components are allocated to hardware devices or how the
|
|
communication system enables the hardware devices to communicate.
|
|
See Section 6 for descriptions of several example architectures
|
|
consistent with the W3C multimodal interaction framework.</p>
|
|
|
|
<h2 id="s2">2. Basic Components of the W3C Multimodal Interaction
|
|
Framework</h2>
|
|
|
|
<p>The Multimodal Interaction Framework is intended as a basis for
|
|
developing multimodal applications in terms of markup, scripting,
|
|
styling and other resources. The Framework will build upon a range
|
|
of existing W3C markup languages together with the
|
|
<a href="http://www.w3.org/DOM/">W3C Document Object Model</a>
|
|
(DOM). DOM defines interfaces whereby programs and scripts
|
|
can dynamically access and update the content, structure and style
|
|
of documents.</p>
|
|
|
|
<p>Figure 1 illustrates the basic components of the W3C multimodal
|
|
interaction framework.</p>
|
|
|
|
<p class="fig"><img src="fig1.png" width="493" height="300"
|
|
alt="I/O processors, dialog manager and application back end" /></p>
|
|
|
|
<p><em>Human user — A user</em> who enters input into the
|
|
system and observes and hears information presented by the system.
|
|
In this document, we will use the term "user" to refer to a human
|
|
user. However, an automated user may replace the human user for
|
|
testing purposes. For example, an automated "testing harness" may
|
|
replace human users for regression testing to verify that changes
|
|
to one component do not affect the user interface negatively.</p>
|
|
|
|
<p><em>Input</em> — An interactive multimodal implementation
|
|
will use multiple input modes such as audio, speech, handwriting,
|
|
and keyboarding, and other input modes. The various modes of input
|
|
will be described in <a href="#s3">Section 3</a>.</p>
|
|
|
|
<p><em>Output</em> — An interactive multimodal implementation
|
|
will use one or more modes of output, such as speech, text,
|
|
graphics, audio files, and animation. The various modes of output
|
|
will be described in <a href="#s4">Section 4</a>.</p>
|
|
|
|
<p><em>Interaction manager</em> — The interaction manager is
|
|
the logical component that coordinates data and manages execution
|
|
flow from various input and output modality component interface
|
|
objects. The input and output modality components are as described
|
|
in <a href="#s5">Section 5</a>.</p>
|
|
|
|
<p>The interaction manager maintains the interaction state and
|
|
context of the application and responds to inputs from component
|
|
interface objects and changes in the system and environment. The
|
|
interaction manager then manages these changes and coordinates
|
|
input and output across component interface objects. The
|
|
Interaction manager is discussed in <a href="#s6">section
|
|
6</a>.</p>
|
|
|
|
<p>In some architectures the interaction manager may be implemented
|
|
as one single component. In other architectures the interaction
|
|
manager may be treated as a composition of lesser components.
|
|
Composition may be distributed across process and device
|
|
boundaries.</p>
|
|
|
|
<p><i>Session component</i> — The Session component
|
|
(discussed in <a href="#s7">Section 7</a>) provides an interface to
|
|
the interaction manager to support state management, and temporary
|
|
and persistent sessions for multimodal applications. This will be
|
|
useful in the following scenarios but is not limited to these:</p>
|
|
|
|
<ul>
|
|
<li>A user is interacting with an application which runs on
|
|
multiple devices.</li>
|
|
|
|
<li>The application is session based e.g. multiplayer game,
|
|
multimodal chat, meeting room etc.</li>
|
|
|
|
<li>The application provides multiple modes of providing input and
|
|
receiving output.</li>
|
|
|
|
<li>The application runs on a single device and needs to experience
|
|
multimodality by switching modes.</li>
|
|
</ul>
|
|
|
|
<p><em>System and Environment component</em> — This component
|
|
enables the interaction manager to find out about and respond to
|
|
changes in device capabilities, user preferences and environmental
|
|
conditions. For example, which of the available modes, the user
|
|
wishes to use — has the user muted audio input? The
|
|
interaction manager may be interested in the width and height of
|
|
the display, whether it supports color, and other capability and
|
|
configuration information. For more information see
|
|
<a href="#s8">Section 8</a></p>
|
|
|
|
<h2 id="s3">3. Input Components</h2>
|
|
|
|
<p>Figure 2 illustrates the various types of components within the
|
|
input component.</p>
|
|
|
|
<p class="fig"><img src="fig2.png" width="539" height="449"
|
|
alt="recognition, interpretation and integation of inputs" /></p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><em>Recognition component</em> — Captures natural input
|
|
from the user and translates the input into a form useful for later
|
|
processing. The recognition component may use a grammar described
|
|
by a grammar markup language. Example recognition components
|
|
include:</p>
|
|
|
|
<ul>
|
|
<li><em>Speech</em> — Converts spoken speech into text. The
|
|
automatic speech recognition component uses an acoustic model, a
|
|
language model, and a grammar specified using the W3C Speech
|
|
Recognition Grammar or the Stochastic Language Model (N-Gram)
|
|
Specification to convert human speech into words specified by the
|
|
grammar.</li>
|
|
|
|
<li><em>Handwriting</em> — Converts handwritten symbols and
|
|
messages into text. The handwriting recognition component may use a
|
|
handwritten gesture model, a language model, and a grammar to
|
|
convert handwriting into words specified in a grammar.</li>
|
|
|
|
<li><em>Keyboarding</em> — Converts key presses into textual
|
|
characters</li>
|
|
|
|
<li><em>Pointing device</em> — Converts button presses into
|
|
x-y positions on a two-dimensional surface</li>
|
|
</ul>
|
|
|
|
<p>Other input recognition components may include vision, sign
|
|
language, DTMF, biometrics, tactile input, speaker verification,
|
|
handwritten identification, and other input modes yet to be
|
|
invented.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><em>Interpretation component</em> — May further process
|
|
the results of recognition components. Each interpretation
|
|
component identifies the "meaning" or "semantics" intended by the
|
|
user. For example, many words that users utter such as "yes,"
|
|
"affirmative," "sure," and "I agree," could be represented as
|
|
"yes."</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><em>Integration component</em> — Combines the output from
|
|
several interpretation components</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Some or all of the functionality of this component could be
|
|
implemented as part of the recognition, interpretation, or
|
|
interaction components. For example, audio-visual speech
|
|
recognition may integrate lip movement recognition and speech
|
|
recognition as part of a lip reading component, as part of the
|
|
speech recognition component, or integrated within a separate
|
|
integration component. As another example, the two input modes of
|
|
speaking and pointing are used in</p>
|
|
|
|
<p class="example">"put that," (point to an object), "there,"
|
|
(point to a location)</p>
|
|
|
|
<p>and may be integrated within a separate integration component or
|
|
may be integrated within the interaction manager component.</p>
|
|
|
|
<p>Information generated by other system components may be
|
|
integrated with user input by the integration component. For
|
|
example, a GPS system generates the current location of the user,
|
|
or a banking application generates an overdraft to prohibit the
|
|
user from making additional purchases.</p>
|
|
|
|
<p>The output for each interpretation component may be expressed
|
|
using EMMA, a language for representing the semantics or meaning of
|
|
data. Either the user or the system may create information
|
|
that may be routed directly to the interaction manager without
|
|
being encoded in EMMA. For example, audio is recorded for later
|
|
replay or a sequence of keystrokes is captured during the creation
|
|
of a macro.</p>
|
|
|
|
<h2 id="s4">4. Output Components</h2>
|
|
|
|
<p>Figure 3 illustrates the components within the output
|
|
component.</p>
|
|
|
|
<p class="fig"><img src="fig3.png" width="534" height="391"
|
|
alt="Use of EMMA to drive output modes" /></p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><em>Generation component</em> — The generation component
|
|
determines which output mode or modes will be used for presenting
|
|
information from the interaction manager to the user. The
|
|
generation component may select a single output mode or it may
|
|
select complementary or supplementary modes. The "internal
|
|
representation" language used to describe the output from the
|
|
generation component is under discussion by the working group.</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Information from the interaction manager may be routed directly
|
|
to the appropriate rendering device without being encoded in an
|
|
internal representation. For example, recorded audio is send
|
|
directly to the audio system.</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><em>Styling component</em> — This component adds
|
|
information about how the information is "layed out." For example,
|
|
the styling component for a display specifies how graphical objects
|
|
are positioned on a canvas, while the styling component for audio
|
|
may insert pauses and voice inflections into text which will be
|
|
rendered by a speech synthesizer. Cascading Style Sheets (CSS)
|
|
could be used to modify voice output.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><em>Rendering component</em> — The rendering component
|
|
converts the information from the styling component into a format
|
|
that is easily understood by the user. For example, a graphics
|
|
rendering component rectangle displays a vector of points as a
|
|
curved line, and a speech synthesis system converts text into
|
|
synthesized voice.</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Each of the output modes has both a styling and rendering
|
|
component.</p>
|
|
|
|
<p>The voice styling component constructs text strings containing
|
|
Speech Synthesis Markup Language tags describing how the words
|
|
should be pronounced. This is converted to voice by the voice
|
|
rendering component. The voice styling component may also select
|
|
prerecorded audio files for replay by the voice rendering
|
|
component.</p>
|
|
|
|
<p>The graphics styling component creates XHTML, XHTML Basic, or
|
|
<a href="http://www.w3.org/TR/SVG/">SVG</a> markup tags describing
|
|
how the graphics should be rendered. The graphics rendering
|
|
component converts the output from the graphics styling component
|
|
into graphics displayed to the user.</p>
|
|
|
|
<p>Other pairs of styling and rendering components are possible for
|
|
other output modes.
|
|
<a href="http://www.w3.org/AudioVideo/">SMIL</a> may be used for
|
|
coordinated multimedia output.</p>
|
|
|
|
<h2 id="s5">5. Specification of input and output components</h2>
|
|
|
|
<p>This section describes how the input and output components of
|
|
sections 3 and 4 are specified. In brief, input and output
|
|
components of the user interface will be specified as DOM objects
|
|
that expose interfaces pertaining to that object's functionality.
|
|
This enables the modality objects to be accessed and
|
|
manipulated in the interaction management environments described in
|
|
section 6.</p>
|
|
|
|
<p>(The use of the term "object" in this section is intended in the
|
|
sense of "object" as used in the Document Object Model, and is not
|
|
intended to imply a particular class or object hierarchy.)</p>
|
|
|
|
<h3 id="s5.1">5.1 Encapsulated interfaces based on DOM</h3>
|
|
|
|
<p>User interface components make their functionality available to
|
|
interaction managers through a set of interfaces, and can be
|
|
considered as receiving values from and returning values to the
|
|
host environment. Here, values can be simple or complex types, and
|
|
components can specify the location for binding the received data,
|
|
perhaps using XPath, which is W3C's language for addressing parts
|
|
of an XML document, and was originally designed to be used by both
|
|
XSLT and XPointer. The set of interfaces will be built on DOM, and
|
|
thereby provide an object model for realizing the functionality of
|
|
a given modality.</p>
|
|
|
|
<p>The functionality of a user interface component can therefore
|
|
usefully be encapsulated in a programming-language-independent
|
|
manner into an <span>object</span> exposing the following kinds of
|
|
features:</p>
|
|
|
|
<ul>
|
|
<li>a set of <b>properties</b> (e.g. presentation parameters or
|
|
input constraints);</li>
|
|
|
|
<li>a set of <b>methods</b> (e.g. begin playback or recognition);
|
|
and</li>
|
|
|
|
<li>a set of <b>events</b> raised by the component (e.g. mouse
|
|
clicks, speech events).</li>
|
|
</ul>
|
|
|
|
<p>The DOM defines a platform-neutral and
|
|
programming-language-neutral interface to documents, their
|
|
structure and their content. The <span>user interface
|
|
objects</span> extend this model by adding modality-specific
|
|
interfaces. In this way, <span>user interface objects</span> can
|
|
define abstract interfaces which are usable across different host
|
|
environments.</p>
|
|
|
|
<p>In multimodal applications, multiple user interface components
|
|
are controlled and coordinated individually by the interaction
|
|
manager.</p>
|
|
|
|
<p>User interface <span>objects</span> should follow certain
|
|
guidelines to integrate into the multimodal framework:</p>
|
|
|
|
<ul>
|
|
<li>adhere to the principles of encapsulation, that is, the
|
|
features of a given modality should relate only to the modality in
|
|
question;</li>
|
|
|
|
<li>adopt common or recommended interfaces where possible;</li>
|
|
|
|
<li><span>In order to insure that the framework is sufficiently
|
|
general to accommodate both local and distributed
|
|
architectures,</span> avoid blocking calls and threading
|
|
issues. </li>
|
|
|
|
<li><span>Consider what kinds of message exchange patterns are
|
|
needed, for instance, publish/subscribe, broadcast, and
|
|
specifically addressed messages. This is also an important
|
|
consideration for insuring that the framework is neutral with
|
|
respect to local and distributed architectures.</span></li>
|
|
</ul>
|
|
|
|
<p>In general, the formalization of features into properties,
|
|
methods and events should not be taken to imply that the
|
|
manipulation of the interface can take place only in local DOM
|
|
architectures. It is the intention of this design that modality
|
|
interfaces should remain agnostic to component architectures where
|
|
possible. So the object feature definitions should be considered as
|
|
abstract indications of functionality, the uses of which will
|
|
probably differ according to architectural considerations (for
|
|
example property setting may take different forms, and
|
|
implementation mechanisms for event dispatch and handling are not
|
|
addressed here.)</p>
|
|
|
|
<h3 id="s5.2">5.2 Interface formalization</h3>
|
|
|
|
<p>Each <span>user interface object</span> will specify a set of
|
|
interfaces in terms of properties, events and methods, using a
|
|
formal interface definition language. Bindings into XML, ECMAScript
|
|
and other programming languages will also be defined.</p>
|
|
|
|
<p>In addition to formal definition of markup and DOM interfaces, a
|
|
description of the execution model of the <span>user interface
|
|
object</span> will be defined, that is, the behaviour of the
|
|
<span>object</span> when used. Further, a <span>user interface
|
|
object</span> should also describe how <span>it</span> is
|
|
controlled in different interaction management environments, for
|
|
example, those which support:</p>
|
|
|
|
<ul>
|
|
<li>limited environments without programmatic capabilities;</li>
|
|
|
|
<li>XHTML and its flavours, including scripting, DOM eventing,
|
|
XForms, etc.</li>
|
|
|
|
<li>SMIL</li>
|
|
|
|
<li>HTML</li>
|
|
|
|
<li>SVG</li>
|
|
|
|
<li>etc.</li>
|
|
</ul>
|
|
|
|
<p>As work proceeds on the definition of individual modality
|
|
interfaces, sufficient commonality of features may be found such
|
|
that it is desirable to standardize in some way those features
|
|
across different modalities. As such, the MMI group will
|
|
investigate the possibilities for establishing a set of common
|
|
interfaces that may be shared among all relevant modalities</p>
|
|
<br />
|
|
|
|
|
|
<h2 id="s6">6. Specification of interaction management
|
|
component</h2>
|
|
|
|
<h3 id="s6.1">6.1 Host Environments for interaction management</h3>
|
|
|
|
<p>The interaction manager is a logical component. The interaction
|
|
manager is contained in the host environment that hosts interface
|
|
objects. Interface objects influence one another by interacting
|
|
with the Host Environment. A host environment provides data
|
|
management and flow control to its hosted interface objects. Some
|
|
languages that may be candidates as Host Environment languages
|
|
include <a href="http://www.w3.org/Graphics/SVG/">SVG</a>,
|
|
<a href="http://www.w3.org/MarkUp/">XHTML</a> (possibly
|
|
<a href="http://www.w3.org/MarkUp/">XHTML</a>+
|
|
<a href="http://www.w3.org/MarkUp/Forms/">XForms</a>), and
|
|
<a href="http://www.w3.org/AudioVideo/">SMIL</a>.</p>
|
|
|
|
<p>A Host Environment's hosted interface objects may range from the
|
|
simple to the complex. Authors will be able to specify the
|
|
interface object components through a mixture of markup, scripting,
|
|
style sheets, or any other resources supported by their Host
|
|
Environment's functionality. The Host Environment design makes
|
|
possible architectures where the interface objects may each have
|
|
their own thread of execution independent from context of the Host
|
|
Environment. The design also supports each component communicating
|
|
asynchronously with the Host Environment (however familiarity with
|
|
synchronization primitives such as mutexes will not be required to
|
|
successfully author multimodal documents).</p>
|
|
|
|
<p>In some architectures, it is possible to have a hierarchical
|
|
composition of Host Environments similar in spirit to Russian
|
|
nesting dolls. Different aspects of interaction management may be
|
|
handled at different levels of the hierarchy. For example,
|
|
"barge-in", where speech output is cut off on the basis of user
|
|
input, is an interaction management mechanism that may handled by
|
|
one lower level Environment that just hosts the basic speech input
|
|
and speech output objects while a different higher level Host
|
|
Environment coordinates the multimodal application. Hierarchical
|
|
interaction management also enables the delegation of complex input
|
|
tasks to lower levels of the hierarchy. As an example, a date
|
|
dialog might encapsulate the necessary interaction management logic
|
|
needed to produce appropriate tapered prompts, error handling, and
|
|
other dialog constructs to eventually collect a valid date. This
|
|
form of nesting enables the creation of hierarchical interaction
|
|
management that reflects the task hierarchy within the overall
|
|
application.</p>
|
|
|
|
<h2 id="s7">7. Session Component</h2>
|
|
|
|
<p>An important goal of the W3C Multimodal Interaction Framework is
|
|
to provide a simplified approach for authoring multimodal
|
|
applications whether on a single system/user or distributed across
|
|
multiple systems/users. The framework is architecture neutral, and
|
|
abstractly relies on passing messages between the various framework
|
|
components. The session component provides a means to simplify the
|
|
author's view of how resources are identified in terms of source
|
|
and destination of such messages. The session component is
|
|
particularly important for distributed applications involving more
|
|
than one device and/or user. It hides the details of the resource
|
|
naming schemes and protocols used and provides a high-level
|
|
interface for requesting and releasing resources taking part in the
|
|
session. </p>
|
|
|
|
<h3 id="s7.1">7.1 Functions of Session Component</h3>
|
|
|
|
<h4 id="s7.1.1">7.1.1 Session as basis of state replication and
|
|
synchronization</h4>
|
|
|
|
<p>The session component can be used for replicating state across
|
|
devices, or across processes within the same device. In a graphical
|
|
interface scenario running on a hand held device coupled to a voice
|
|
interface running in the network. The user can choose to navigate
|
|
or enter data using the device keypad or using speech. When filling
|
|
out a form, this gives two ways to update the field's value. The
|
|
session provides a scope for the replication mechanism and provides
|
|
a way to keep multiple modes in sync.</p>
|
|
|
|
<h4 id="s7.1.2">7.1.2 Temporary/Persistent Sessions</h4>
|
|
|
|
<p>For certain applications the session is short lived. In theses
|
|
cases the same session may last for a single page or for several
|
|
pages as the user navigates through the application, for example
|
|
when visiting a web site. This makes it practical to retain state
|
|
information for the duration of the application. For applications
|
|
that involve persistent sessions such as meeting rooms, multiplayer
|
|
games, there is a need for session management, and a means to
|
|
locate, join and leave such sessions.</p>
|
|
|
|
<h4 id="s7.1.3">7.1.3 Simplifying Applications</h4>
|
|
|
|
<p>In a distributed environment there are several ways to identify
|
|
a resource. The session component provides a means to query
|
|
descriptions of resources, including the type of the resource, what
|
|
properties the resource has, and what interfaces it supports.</p>
|
|
|
|
<h3 id="s7.2">7.2 Use Cases</h3>
|
|
|
|
<p>The following use case provides the basis of defining session
|
|
component:</p>
|
|
|
|
<ul>
|
|
<li><b>Mobile Devices with sequential capability</b></li>
|
|
</ul>
|
|
|
|
<p>Devices with limited capability provide a good example of the
|
|
importance of a session component. The sequential multimodality
|
|
allows user to experience multiple modes but only one mode at a
|
|
time. In such a scenario the user has to switch between modes to
|
|
experience multiple modes. In an application where the user is
|
|
filling out a form using voice as the input mode since voice is
|
|
preferred/easier mode for providing input. After the user has
|
|
provided the input the application saves the form fields in a
|
|
session object and switches the mode to visual. In visual mode the
|
|
application retrieves the values from session and uses the form
|
|
fields for further processing. An example of such application would
|
|
be Driving Directions application where the user provides source
|
|
and destination using voice mode and then selects directions from
|
|
visual mode to see the directions.</p>
|
|
|
|
<ul>
|
|
<li><b>Form filling</b></li>
|
|
</ul>
|
|
|
|
<p>Form filling presents another use case for a session component.
|
|
Especially when partial information is filled using the keypad
|
|
attached to the device and partial information is filled using the
|
|
speech processed at the speech server in the network. For example
|
|
in an airline reservation system the user can provide date of
|
|
travel by clicking on appropriate dates in the calendar and provide
|
|
source and destination using speech which is processed in the
|
|
network. A session component helps in synchronizing the input
|
|
provided in either mode and provides filled form information back
|
|
to the application.</p>
|
|
|
|
<ul>
|
|
<li><b>Meeting Rooms</b></li>
|
|
</ul>
|
|
|
|
<p>The session in this case is persistent and users join/leave the
|
|
session during the application. A session component allows a user
|
|
to query the session environment. A session environment would
|
|
consist of the resources and the values of the attributes in the
|
|
resources. In case of meeting room application the user can query:
|
|
i) who else is in the meeting room. ii) Get the information about a
|
|
particular member in the meeting room e.g. contact information,
|
|
whether the member is online etc.? The resources that application
|
|
wants its user to share is stored and proper interfaces are
|
|
provided to access the attributes of the resource.</p>
|
|
|
|
<ul>
|
|
<li><b>Multiple Device Applications</b></li>
|
|
</ul>
|
|
|
|
<p>For multimodal applications running across multiple devices, the
|
|
session component can play an important role in the synchronization
|
|
of state across the devices. For example a user may be running an
|
|
application while sitting in a car using a device attached to the
|
|
car. The user gets off the car and goes to his office and wants to
|
|
continue with the application on his laptop that he was running in
|
|
the car. The session component provides interfaces to save the
|
|
state of the whole application on a device and reinstating the
|
|
whole state on another device. The few examples of such
|
|
applications could be video conferencing, online shopping, airline
|
|
reservations etc. For example in an airline reservation system, the
|
|
user selects the itinerary while he is still in the car. The user
|
|
gets out of the car and buys the same ticket using his laptop in
|
|
his office.</p>
|
|
|
|
<h2 id="s8">8. System and Environment Component</h2>
|
|
|
|
<p>The <a href="http://www.w3.org/TR/mmi-reqs/">W3C Multimodal
|
|
Interaction Requirements</a> call for the ability for developers to
|
|
be able to create applications that
|
|
<a href="http://www.w3.org/TR/mmi-reqs/#Deliveryandcontext">dynamically
|
|
adapt</a> to changes in device capabilities, user preferences and
|
|
environmental conditions. The multimodal interaction framework must
|
|
allow the interaction manager to determine what information is
|
|
available, as this will be system dependent. In addition, the
|
|
framework must support stand-alone as well as distributed scenarios
|
|
involving multiple devices and multiple users (see
|
|
<a href="#s7">section 7</a> for more details).</p>
|
|
|
|
<p>It is expected that the system and environment component will
|
|
make use of the work of the
|
|
<a href="http://www.w3.org/2001/di/">W3C Device Independence
|
|
activity</a>, in particular the
|
|
<a href="http://www.w3.org/Mobile/CCPP/">CC/PP</a> language, whose
|
|
aim is to standardize ways of expressing device features and
|
|
settings, and to describe how they are transmitted between
|
|
components. Profiles regarding multimodal-specific properties, such
|
|
as those listed below, are expected to be defined in accordance to
|
|
the <a href="http://www.w3.org/TR/CCPP-struct-vocab/">CC/PP
|
|
Structure and Vocabularies specification</a>.</p>
|
|
|
|
<h3 id="s8.1">8.1 User Case Scenarios</h3>
|
|
|
|
<p>To illustrate the components functionality it is worth
|
|
considering the following few user case scenarios:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Mobile</b> devices typically have limited capabilities and
|
|
resources, so that applications need to be tailored to the
|
|
specifics of the device. For example, many mobile phones have small
|
|
monochrome displays, while others have rich, fast color displays.
|
|
The following are typical characteristics of mobile devices that
|
|
can be provided to the Interaction Manager through the System and
|
|
Environment component:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Location</b> information can be provided by an increasing
|
|
number of mobile devices. Typically this information is derived
|
|
from cell quadrant (cellular radio networks), GPS satellite data or
|
|
dead reckoning based on motion sensors. The Location
|
|
Interoperability Forum - now part of the
|
|
<a href="http://www.openmobilealliance.org/lif/">Open Mobile
|
|
Alliance</a> — has been responsible for much of the work on
|
|
this to date. Location-based services (LBS) provide time stamped
|
|
location data of varying accuracy, in some circumstances, this can
|
|
be to within a few meters. This information can be provided upon
|
|
request at sub-second intervals. Multimodal applications can use
|
|
such information to orient maps and to provide geographically
|
|
relevant information.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Signal strength</b> provides information on network
|
|
connectivity as well as the quality of service that can be
|
|
provided. As signal strength decreases a Multimodal application
|
|
could adapt accordingly. This could be as simple as switching to an
|
|
alternative low-bandwidth mode of communication.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Aural noise level</b> for mobile devices is an important
|
|
consideration because of the variety of situations where the device
|
|
can be used, for example, noise from passing vehicles, other people
|
|
talking nearby, or loud music. Speech recognition can be tailored
|
|
based on noise levels returned by the System and Environment
|
|
component.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Battery level</b> provides information on the remaining
|
|
operational time. Such a notification to the Interaction Manager is
|
|
particularly revelant to small un-tethered devices where power
|
|
consumption is critical.</p>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Automotive</b> — Multimodality is typically an on-board
|
|
capability that senses the local environment to determine what
|
|
services can be adapted to the drivers situation, for example:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Aural noise Level</b> within the car can be generated and
|
|
modified by numerous environmental factors for example driving with
|
|
the windows down, radio volume, the AC/Fan on/off or windscreen
|
|
wipers on/off. Environmental conditions of the vehicle, controlled
|
|
by the driver, can be notified via the System Environment component
|
|
to the Interaction Manager to adapt the speech recognition.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>In gear</b> notifications could provide information on the
|
|
drivers ability to use a touch screen in a Multimodal application.
|
|
In addition there are legal ramifications associated with the
|
|
driver operating devices whilst the vehicle is in motion. Therefore
|
|
the general behavior of a Multimodal application may need to adapt
|
|
according to whether the vehicle is parked or "in-drive".</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>GPS</b> notifications are an important feature of an on-board
|
|
Multimodal navigation system. The update frequency and accuracy of
|
|
updates being higher than typical LBS mobile services (see Mobile
|
|
section).</p>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Desktop</b> — Multimodal applications can be tailored
|
|
to the user's preferences. These choices can be dynamic or static
|
|
for example:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Static user preferences</b> — the default volume
|
|
setting, the rate in words per minute for playing text to speech, a
|
|
general preference to using speech rather than a keyboard. People
|
|
with visual impairments may opt for easy to see large print text
|
|
and high contrast color themes.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Dynamic preferences</b> — the user may suddenly mute
|
|
audio output, or switch from speech to pen input, and expect the
|
|
application to adapt accordingly. The application itself may
|
|
monitor's the user's progress, and react appropriately, for
|
|
example, prompting the user to use a pen after successive failures
|
|
with speech recognition.</p>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
|
|
<h3 id="s8.2">8.2 System and Environment Component Categories</h3>
|
|
|
|
<p>The above examples give a general indication of the
|
|
functionality that the System and Environment component offers as a
|
|
means for enabling applications to be tailored to adapt to device
|
|
capabilities, user preferences and environmental conditions.</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Environmental</b> conditions can be monitored and reported to
|
|
to the Interaction Manager. One way to look at these
|
|
characteristics is to inspect interference channels:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Auditory</b></p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Environment too noisy and bad for listening</b> - the
|
|
application should adapt to this change to provide a better
|
|
experience.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>A speaker system/headphone attached?</b> A speaker system
|
|
allows the user to see the screen as well as listen at the same
|
|
time.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Car environment factors</b> - radio on/off, radio volume,
|
|
AC/Fan on/off, windscreen wipers on/off windows up/down.</p>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Visual</b></p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Whether gesture recognition is possible.</b> The user should
|
|
be able to see the sensor for a gesture based application.
|
|
Moreover, if the user cannot see the device then audio becomes the
|
|
predominate mode of communication and the application adapt to
|
|
it.</p>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>Tactile</b></p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p><b>Pen</b> — large or small or finger begin used as a
|
|
tactile input device.</p>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>System</b> notifications can be derived from numerous
|
|
environmental sources, particularly within mobile and automotive
|
|
applications. Notifications from the System and Environment
|
|
component to the Interaction Manger can range from GPS location
|
|
information to the fact that the laptop has been closed. Many of
|
|
these system notifications indicate that the application should
|
|
switch to an alternative mode of operation.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><b>User preferences</b> help with tailoring the application to
|
|
the user. These characteristics are most apparent in rich
|
|
Multimodal scenarios such as the desktop where resources are less
|
|
of an issue (large screens and fast CPU's). Preferences can be
|
|
modified to best suit user choices. Furthermore, it is possible to
|
|
dynamically adapt to the users preferences overtime.</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<h2 id="s9">9. Illustrative Use Case</h2>
|
|
|
|
<p>To illustrate the component markup languages of the W3C
|
|
Multimodal Interaction Framework, consider this simple use case.
|
|
The human user points to a position on a displayed map and speaks:
|
|
"What is the name of this place?" The multimodal interaction system
|
|
responds by speaking "Lake Wobegon, Minnesota" and displays the
|
|
text "Lake Wobegon, Minnesota" on the map. The following summarizes
|
|
the actions of the relevant components of the W3C Multimodal
|
|
Interaction Framework:</p>
|
|
|
|
<p><em>Human user</em> — Points to a position on a map and
|
|
says, "What is the name of this place?"</p>
|
|
|
|
<p><em>Speech recognition component</em> — Recognizes the
|
|
words "What is the name of this place?"</p>
|
|
|
|
<p><em>Mouse recognition component</em> — Recognizes the x-y
|
|
coordinates of the position to which the user pointed on a map.</p>
|
|
|
|
<p><em>Speech interpretation component</em> — Converts the
|
|
words "What is the name of this place?" into an internal
|
|
notation.</p>
|
|
|
|
<p><em>Pointing interpretation component</em> — Converts the
|
|
x-y coordinates of the position to which the user pointed into an
|
|
internal notation.</p>
|
|
|
|
<p><em>Integration component</em> — Integrates the internal
|
|
notation for the words "What is the name of this place?" with the
|
|
internal notation for the x-y coordinates.</p>
|
|
|
|
<p><em>Interaction manager component</em> — Stores the
|
|
internal notation in the session object. Converts the request to a
|
|
database request, submits the request to a database management
|
|
system which returns the value of "Lake Wobegon, Minnesota". Add
|
|
the response to the internal notation in the session object The
|
|
interaction manager converts the response into an internal notation
|
|
and sends the response to the generation component.</p>
|
|
|
|
<p><em>Generation component</em> — Access the Environment
|
|
component to determine that voice and graphics modes are available.
|
|
Decides to present the result as two complementary modes, voice and
|
|
graphics. The generation component sends internal notation
|
|
representing "Lake Wobegon, Minnesota" to the voice styling
|
|
component, and sends internal notation representing the location of
|
|
Lake Wobegon, Minnesota on a map to the graphics styling
|
|
component.</p>
|
|
|
|
<p><em>Voice styling component</em> — Converts the internal
|
|
notation representing "Lake Wobegon, Minnesota" into SSML.</p>
|
|
|
|
<p><em>Graphics styling component</em> — Converts the
|
|
internal notation representing the "Lake Wobegon, Minnesota"
|
|
location on a map into HTML notation.</p>
|
|
|
|
<p><em>Voice rendering component</em>: Converts the SSML notation
|
|
into acoustic voice for the user to hear.</p>
|
|
|
|
<p><em>Graphics styling component</em>: Converts the HTML notation
|
|
into visual graphics for the user to see.</p>
|
|
|
|
<h2 id="s10">10. Examples of Architectures Consistent with
|
|
the W3C Multimodal Interaction Framework.</h2>
|
|
|
|
<p>There are many possible multimodal architectures that are
|
|
consistent with the W3C multimodal interaction framework. These
|
|
multimodal architectures have the following properties:</p>
|
|
|
|
<p>Property 1. THE MULTIMODAL ARCHITECTURE CONTAINS A SUBSET OF THE
|
|
COMPONENTS OF THE W3C MULTIMODAL INTERACTION FRAMEWORK. A
|
|
<em>multimedia architecture</em> contains two or more output modes.
|
|
A <em>multimodal architecture</em> contains two or more input
|
|
modes.</p>
|
|
|
|
<p>Property 2. COMPONENTS MAY BE PARTITIONED AND COMBINED. The
|
|
functions within a component may be partitioned into several
|
|
modules within the architecture, and the functions within two or
|
|
more components may be combined into a single module within the
|
|
architecture.</p>
|
|
|
|
<p>Property 3. THE COMPONENTS ARE ALLOCATED TO HARDWARE DEVICES. If
|
|
all components are allocated to the same hardware device, the
|
|
architecture is said to be <em>centralized architecture</em> . For
|
|
example, a PC containing all of the selected components has a
|
|
centralized architecture. A <em>client-server architecture</em>
|
|
consists of two types of devices, several client devices containing
|
|
many of the input and output components, and the server which
|
|
contains the remaining components. A <em>distributed
|
|
architecture</em> consists of multiple types of devices connected
|
|
by a communication system.</p>
|
|
|
|
<p>Property 4. THE COMMUNICATION SYSTEMS ARE SPECIFIED. Designers
|
|
specify the protocols for exchanging messages among hardware
|
|
devices.</p>
|
|
|
|
<p>Property 5. THE DIALOG MODEL IS SPECIFIED. Designers specify how
|
|
modules are invoked and terminated, and how they interpret input to
|
|
produce output.</p>
|
|
|
|
<p>The following examples illustrate architectures that conform to
|
|
the W3C multimodal interaction framework.</p>
|
|
|
|
<h3 id="s10.1">Example 1: Driving Example (Figure 4)</h3>
|
|
|
|
<p>In this example, the user wants to go to a specific address from
|
|
his current location and while driving wants to take a detour to a
|
|
local restaurant (The user does not know the restaurant address nor
|
|
the name). The user initiates service via a button on his steering
|
|
wheel and interacts with the system via the touch screen and
|
|
speech.</p>
|
|
|
|
<p>Property 1. The driving architecture contains the components
|
|
illustrated in Figure 4: a graphical display, map database, voice
|
|
and touch input, speech output, local ASR, TTS Processing and
|
|
GPS.</p>
|
|
|
|
<p>Property 2. No components are partitioned or combined with the
|
|
possible exception of the integration and interaction manager
|
|
components, and the generation and interaction components. There
|
|
are two possible configurations, depending upon whether the
|
|
integration component is stand alone or combined with the
|
|
interaction manager component:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>Information entered by the user may be encoded into EMMA
|
|
(Extensible MultiModal Annotation Markup Language, formerly known
|
|
as the Natural Language Semantic Markup Language) and combined by
|
|
an integration component (shown within the dotted rectangle in
|
|
Figure 4) which is separate from the interaction manager.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Information entered by the user may be recognized and
|
|
interpreted and then routed directly to the interaction manager,
|
|
which performs its own integration of user information</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>There are two possible configurations, depending upon whether
|
|
the generation component is stand alone or combined with the
|
|
interaction manager component:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>Information from the interaction manager may be routed to the
|
|
generation component, where multiple modes of output are generated
|
|
and the appropriate synchronization control created.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Information may be be routed directly to the styling components
|
|
and then on to the rendering components. In this case, the
|
|
interaction manager does its own generation and
|
|
synchronization.</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Property 3. All components are allocated to a single client side
|
|
hardware device onboard the car. In Figure 4, the client is
|
|
illustrated by a pink box containing all of the components.</p>
|
|
|
|
<p>Property 4. No communication system is required in this
|
|
centralized architecture.</p>
|
|
|
|
<p>Property 5. Dialog Model: The user wants to go to a specific
|
|
address from his current location and while driving wants to take a
|
|
detour to a local restaurant . (The user does not know the
|
|
restaurant name or address.) The user initiates service via a
|
|
button on his steering wheel and interacts with the system via the
|
|
touch screen and speech.</p>
|
|
|
|
<p class="fig"><img src="fig4.png" width="559" height="539"
|
|
alt="Figure 4: Driving Example" /></p>
|
|
|
|
<h3 id="s10.2">Example 2: Name dialing (Figure 5)</h3>
|
|
|
|
<p>The Name dialing example enables a user to initiate a call by
|
|
saying the name of the person to be contacted. Visual and spoken
|
|
dialogs are used to narrow the selection, and to allow an exchange
|
|
of multimedia messages if the called person is unavailable. Call
|
|
handling is determined by a script provided by the called person.
|
|
The example supports the use of a combination of local and remote
|
|
speech recognition.</p>
|
|
|
|
<p>Property 1: The architecture contains a subset of the components
|
|
of the W3C Multimodal Interface Framework.</p>
|
|
|
|
<p>Property 2: No components have been partitioned or combined with
|
|
the possible exception of the integration component and interaction
|
|
component, and the generation component and the interaction
|
|
component (as discussed in example 2).</p>
|
|
|
|
<p>Property 3. The components in pink are allocated to the client
|
|
and the components in green are allocated to the server. Note that
|
|
the speech recognition and interpretation components are on both
|
|
client and server. The local ASR recognizes basic control commands
|
|
based upon the ETSI DES/HF-00021 standardized command and control
|
|
vocabulary, and the remote ASR recognizes names of individuals the
|
|
user wishes to dial. (The vocabulary of names is too large to
|
|
maintain on the client, so it is maintained on the server.)</p>
|
|
|
|
<p>Property 4. Communications system is SIP.
|
|
<a href="http://www.ietf.org/html.charters/sip-charter.html">SIP</a>
|
|
is a session initiation protocol and is a means for initiating
|
|
communication sessions involving multiple devices, and for control
|
|
signaling during such sessions.</p>
|
|
|
|
<p>Property 5. Navigational and control commands are recognized by
|
|
the ASR on the client. When the user says "call John Smith," the
|
|
ASR on the client recognizes the command "call" and transfers the
|
|
following information ("John Smith") to the server for recognition.
|
|
The application on the server then connects the user with John
|
|
Smith's telephone.</p>
|
|
|
|
<p class="fig"><img src="fig5.png" width="557" height="545"
|
|
alt="Figure 5: Name Dialing Example" /></p>
|
|
|
|
<h3 id="s10.3">Example 3: Form fill-in (Figure 6)</h3>
|
|
|
|
<p>In the Form fill-in example, the user wants to make a flight
|
|
reservation with his mobile device while he is on the way to work.
|
|
The user initiates the service by means of making a phone call to a
|
|
multimodal service (telephone) or by selecting an application
|
|
(portal environment metaphor). The dialogue between the user and
|
|
the application is driven by a form-filling paradigm where the user
|
|
provides input to fields such as "Travel Origin:", "Travel
|
|
Destination:", "Leaving on date", "Returning on date". As the user
|
|
selects each field in the application to enter information, the
|
|
corresponding input constraints are activated to drive the
|
|
recognition and interpretation of the user input.</p>
|
|
|
|
<p>Property 1: The architecture contains a subset of the components
|
|
of the W3C Multimodal Interface Framework, including GPS and
|
|
Ink.</p>
|
|
|
|
<p>Property 2: The speech recognition component has been
|
|
partitioned into two components, one which will be placed on the
|
|
client and the other on the server. The integration component and
|
|
interaction component, and the generation component and the
|
|
interaction component may be combined or left separate (as
|
|
discussed in example 2).</p>
|
|
|
|
<p>Property 3. The components in pink are allocated to the client
|
|
and the components in green are allocated to the server. Speech
|
|
recognition is distributed between the client and the server, with
|
|
the feature extraction on the client and the remaining speech
|
|
recognition functions performed on the server.</p>
|
|
|
|
<p>Property 4. Communications system is SIP.
|
|
<a href="http://www.ietf.org/html.charters/sip-charter.html">SIP</a>
|
|
is a session initiation protocol and is a means for initiating
|
|
communication sessions involving multiple devices, and for control
|
|
signaling during such sessions.</p>
|
|
|
|
<p>Property 5. Dialog Model: The user wants to make a flight
|
|
reservation with his mobile device while he is on the way to work.
|
|
The user initiates the service via means of making a phone call to
|
|
a multimodal service (telephone metaphor) or by selecting an
|
|
application (portal environment metaphor). The dialogue between the
|
|
user and the application is driven by a form-filling paradigm where
|
|
the user provides input to fields such as "Travel Origin:", "Travel
|
|
Destination:", "Leaving on date", "Returning on date". As the user
|
|
selects each field in the application to enter information, the
|
|
corresponding input constraints are activated to drive the
|
|
recognition and interpretation of the user input. The capability of
|
|
providing composite multimodal input is also examined, where input
|
|
from multiple modalities is combined for the interpretation of the
|
|
user's intent.</p>
|
|
|
|
<p class="fig"><img src="fig6.png" width="558" height="547"
|
|
alt="Figure 6: Form Fill-in Example" /></p>
|
|
</body>
|
|
</html>
|