server_playground/doc/www.w3.org/TR/2009/NOTE-emma-usecases-20091215


								<?xml version="1.0" encoding="utf-8"?>

								<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

								    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

								<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

								<head>

								<meta name="generator" content=

								"HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" />

								<title>Use Cases for Possible Future EMMA Features</title>

								<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

								<style type="text/css">

								/*<![CDATA[*/

								code           { font-family: monospace; }


								div.constraint,

								div.issue,

								div.note,

								div.notice     { margin-left: 2em; }


								ol.enumar      { list-style-type: decimal; }

								ol.enumla      { list-style-type: lower-alpha; }

								ol.enumlr      { list-style-type: lower-roman; }

								ol.enumua      { list-style-type: upper-alpha; }

								ol.enumur      { list-style-type: upper-roman; }


								div.exampleInner pre { margin-left: 1em;

								                       margin-top: 0em; margin-bottom: 0em}

								div.exampleOuter {border: 4px double gray;

								                  margin: 0em; padding: 0em}

								div.exampleInner { background-color: #d5dee3;

								                   border-top-width: 4px;

								                   border-top-style: double;

								                   border-top-color: #d3d3d3;

								                   border-bottom-width: 4px;

								                   border-bottom-style: double;

								                   border-bottom-color: #d3d3d3;

								                   padding: 4px; margin: 0em }

								div.exampleWrapper { margin: 4px }

								div.exampleHeader { font-weight: bold;

								                    margin: 4px}


								table {

								        width:80%;

								                border:1px solid #000;

								                border-collapse:collapse;

								                font-size:90%;

								        }


								td,th{

								                border:1px solid #000;

								                border-collapse:collapse;

								                padding:5px;

								        }


								caption{

								                background:#ccc;

								                font-size:140%;

								                border:1px solid #000;

								                border-bottom:none;

								                padding:5px;

								                text-align:center;

								        }


								img.center {

								  display: block;

								  margin-left: auto;

								  margin-right: auto;

								}

								p.caption {

								  text-align: center

								}


								.RFC2119 {

								  text-transform: lowercase;

								  font-style: italic;

								}

								/*]]>*/

								</style>


								<style type="text/css">

								/*<![CDATA[*/

								 p.c1 {font-weight: bold}

								/*]]>*/

								</style>


								<link href="http://www.w3.org/StyleSheets/TR/W3C-WG-NOTE.css" type="text/css" rel="stylesheet" />

								<meta content="MSHTML 6.00.6000.16762" name="GENERATOR" />

								<style type="text/css">

								/*<![CDATA[*/

								 ol.c2 {list-style-type: lower-alpha}

								 li.c1 {list-style: none}

								/*]]>*/

								</style>

								</head>

								<body xml:lang="en" lang="en">

								<div class="head"><a href="http://www.w3.org/"><img alt="W3C" src=

								"http://www.w3.org/Icons/w3c_home" width="72" height="48" /></a>

								<h1 id="title">Use Cases for Possible Future EMMA Features</h1>

								<h2 id="w3c-doctype">W3C Working Group Note <i>15</i> <i>December</i> <i>2009</i></h2>


								<dl>

								<dt>This version:</dt>

								<dd><a href="http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215">http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215</a></dd>

								<dt>Latest version:</dt>

								<dd><a href="http://www.w3.org/TR/emma-usecases">http://www.w3.org/TR/emma-usecases</a></dd>

								<dt>Previous version:</dt>

								<dd><em>This is the first publication.</em></dd>


								<dt>Editor:</dt>

								<dd>Michael Johnston, AT&amp;T</dd>

								<dt>Authors:</dt>

								<dd>Deborah A. Dahl, Invited Expert</dd>

								<dd>Ingmar Kliche, Deutsche Telekom AG</dd>

								<dd>Paolo Baggia, Loquendo</dd>

								<dd>Daniel C. Burnett, Voxeo</dd>

								<dd>Felix Burkhardt, Deutsche Telekom AG</dd>

								<dd>Kazuyuki Ashimura, W3C</dd>

								</dl>

								<p class="copyright"><a href=

								"http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>

								© 2009 <a href="http://www.w3.org/"><acronym title=

								"World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a href=

								"http://www.csail.mit.edu/"><acronym title=

								"Massachusetts Institute of Technology">MIT</acronym></a>, <a href=

								"http://www.ercim.org/"><acronym title=

								"European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,

								<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved.

								W3C <a href=

								"http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,

								<a href=

								"http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>

								and <a href=

								"http://www.w3.org/Consortium/Legal/copyright-documents">document

								use</a> rules apply.</p>

								</div>

								<!-- end of head div -->

								<hr title="Separator for header" />

								<h2 id="abstract">Abstract</h2>

								<p>The EMMA: Extensible MultiModal Annotation specification defines

								an XML markup language for capturing and providing metadata on the

								interpretation of inputs to multimodal systems. Throughout the

								implementation report process and discussion since EMMA 1.0 became

								a W3C Recommendation, a number of new possible use cases for the

								EMMA language have emerged. These include the use of EMMA to

								represent multimodal output, biometrics, emotion, sensor data,

								multi-stage dialogs, and interactions with multiple users. In this

								document, we describe these use cases and illustrate how the EMMA

								language could be extended to support them.</p>


								<h2 id="status">Status of this Document</h2>


								<p><em>This section describes the status of this document at the

								time of its publication. Other documents may supersede this

								document. A list of current W3C publications and the latest

								revision of this technical report can be found in the <a href=

								"http://www.w3.org/TR/">W3C technical reports index</a> at

								http://www.w3.org/TR/.</em></p>


								<p>This document is a W3C Working Group Note published on 15 December

								2009. This is the first publication of this document and it represents

								the views of the W3C Multimodal Interaction Working Group at the time

								of publication. The document may be updated as new technologies emerge

								or mature. Publication as a Working Group Note does not imply

								endorsement by the W3C Membership. This is a draft document and may be

								updated, replaced or obsoleted by other documents at any time. It is

								inappropriate to cite this document as other than work in

								progress.</p>


								<p>This document is one of a series produced by the

								<a href="http://www.w3.org/2002/mmi/">Multimodal Interaction WorkingGroup</a>,

								part of the <a href="http://www.w3.org/2002/mmi/Activity">W3C Multimodal Interaction

								Activity</a>.


								Since <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C

								Recommendation, a number of new possible use cases for the EMMA language have

								emerged, e.g., the use of EMMA to represent multimodal output, biometrics,

								emotion, sensor data, multi-stage dialogs and interactions with multiple users.


								Therefore the Working Group have been working on a document capturing use cases

								and issues for a series of possible extensions to EMMA.


								The intention of publishing this Working Group Note is to seek feedback on the

								various different use cases.

								</p>


								<p>Comments on this document can be sent to <a href=

								"mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>, the

								public forum for discussion of the W3C's work on Multimodal

								Interaction. To subscribe, send an email to <a href=

								"mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.org</a>

								with the word subscribe in the subject line (include the word

								unsubscribe if you want to unsubscribe). The <a href=

								"http://lists.w3.org/Archives/Public/www-multimodal/">archive</a>

								for the list is accessible online.</p>


								<p> This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 February 2004 W3C Patent Policy</a>. W3C maintains a <a rel="disclosure" href="http://www.w3.org/2004/01/pp-impl/34607/status">public list of any patent disclosures</a> made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential Claim(s)</a> must disclose the information in accordance with <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section 6 of the W3C Patent Policy</a>. </p>


								<h2 id="contents">Table of Contents</h2>

								<ul>

								<li>1. <a href="#s1">Introduction</a></li>

								<li>2. <a href="#s2">EMMA use cases</a></li>

								</ul>

								<ul class="tocline">

								<li>2.1 <a href="#s2.1">Incremental results for streaming

								modalities such as haptics, ink, monologues, dictation</a></li>

								<li>2.2 <a href="#s2.2">Representing biometric information</a></li>

								<li>2.3 <a href="#s2.3">Representing emotion in EMMA</a></li>

								<li>2.4 <a href="#s2.4">Richer semantic representations in

								EMMA</a></li>

								<li>2.5 <a href="#s2.5">Representing system output in EMMA</a></li>

								<li class="c1">

								<ul class="tocline">

								<li>2.5.1 <a href="#s2.5.1">Abstracting output from specific

								modalities</a></li>

								<li>2.5.2 <a href="#s2.5.2">Coordination of outputs distributed

								over multiple different modalities</a></li>

								</ul>

								</li>

								<li>2.6 <a href="#s2.6">Representation of dialogs in EMMA</a></li>

								<li>2.7 <a href="#s2.7">Logging, analysis, and annotation</a></li>

								<li class="c1">

								<ul class="tocline">

								<li>2.7.1 <a href="#s2.7.1">Log analysis</a></li>

								<li>2.7.2 <a href="#s2.7.2">Log annotation</a></li>

								</ul>

								</li>

								<li>2.8 <a href="#s2.8">Multi-sentence inputs</a></li>

								<li>2.9 <a href="#s2.9">Multi-participant interactions</a></li>

								<li>2.10 <a href="#s2.10">Capturing sensor data such as GPS in

								EMMA</a></li>

								<li>2.11 <a href="#s2.11">Extending EMMA from NLU to also represent

								search or database retrieval results</a></li>

								<li>2.12 <a href="#s2.12">Supporting other semantic representation

								forms in EMMA</a></li>

								</ul>

								<ul>

								<li><a href="#references">General References</a></li>

								</ul>

								<hr title="Separator for introduction" />

								<h2 id="s1">1. Introduction</h2>

								<p>This document presents a set of use cases for possible new

								features of the Extensible MultiModal Annotation (EMMA) markup

								language. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was

								designed primarily to be used as a data interchange format by

								systems that provide semantic interpretations for a variety of

								inputs, including but not necessarily limited to, speech, natural

								language text, GUI and ink input. EMMA 1.0 provides a set of

								elements for containing the various stages of processing of a

								user's input and a set of elements and attributes for specifying

								various kinds of metadata such as confidence scores and timestamps.

								<a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C

								Recommendation on February 10, 2009.</p>

								<p>A number of possible extensions to <a href=

								"http://www.w3.org/TR/emma/">EMMA 1.0</a> have been identified

								through discussions with other standards organizations,

								implementers of EMMA, and internal discussions within the W3C

								Multimodal Interaction Working Group. This document focusses on the

								following use cases:</p>

								<ol>

								<li>Representing incremental results for streaming modalities such

								as haptics, ink, monologues, dictation, where it is desirable to

								have partial results available before the full input finishes.</li>

								<li>Representing biometric results such as the results of speaker

								verification or speaker identification (briefly covered in EMMA

								1.0).</li>

								<li>Representing emotion, for example, as conveyed by intonation

								patterns, facial expression, or lexical choice.</li>

								<li>Richer semantic representations, for example, integrating EMMA

								application semantics with ontologies.</li>

								<li>Representing system output in addition to user input, including

								topics such as:</li>

								<li class="c1">

								<ol class="c2">

								<li>Isolating presentation logic from dialog/interaction

								management.</li>

								<li>Coordination of outputs distributed over multiple different

								modalities.</li>

								</ol>

								</li>

								<li>Support for archival functions such as logging, human

								annotation of inputs, and data analysis.</li>

								<li>Representing full dialogs and multi-sentence inputs in addition

								to single inputs.</li>

								<li>Representing multi-participant interactions.</li>

								<li>Representing sensor data such as GPS input.</li>

								<li>Representing the results of database queries or search.</li>

								<li>Support for forms of representation of application semantics

								other than XML, such as JSON.</li>

								</ol>

								<p>It may be possible to achieve support for some of these features

								without modifying the language, through the use of the

								extensibility mechanisms of <a href=

								"http://www.w3.org/TR/emma/">EMMA 1.0</a>, such as the

								<code>&lt;emma:info&gt;</code> element and application-specific

								semantics; however, this would significantly reduce

								interoperability among EMMA implementations. If features are of

								general value then it would be beneficial to define standard ways

								of implementing them within the EMMA language. Additionally,

								extensions may be needed to support additional new kinds of input

								modalities such as multi-touch and accelerometer input.</p>

								<p>The W3C Membership and other interested parties are invited to

								review this document and send comments to the Working Group's

								public mailing list www-multimodal@w3.org <a href=

								"http://lists.w3.org/Archives/Public/www-multimodal/">(archive)</a>

								.</p>

								<h2 id="s2">2. EMMA use cases</h2>

								<h3 id="s2.1">2.1 Incremental results for streaming modalities such

								as haptic, ink, monologues, dictation</h3>

								<p>In EMMA 1.0, EMMA documents were assumed to be created for

								completed inputs within a given modality. However, there are

								important use cases where it would be beneficial to represent some

								level of interpretation of partial results before the input is

								complete. For example, in a dictation application, where inputs can

								be lengthy it is often desirable to show partial results to give

								feedback to the user while they are speaking. In this case, each

								new word is appended to the previous sequence of words. Another use

								case would be incremental ASR, either for dictation or dialog

								applications, where previous results might be replaced as more

								evidence is collected. As more words are recognized and provide

								more context, earlier word hypotheses may be updated. In this

								scenario it may be necessary to replace the previous hypothesis

								with a revised one.</p>

								<p>In this section, we discuss how the EMMA standard could be

								extended to support incremental or streaming results in the

								processing of a single input. Some key considerations and areas for

								discussion are:</p>

								<ol>

								<li>Do we need an identifier for a particular stream? Or is

								<code>emma:source</code> sufficient? Subsequent messages (carrying

								information for a particular stream) may need to have the same

								identifier.</li>

								<li>Do we need a sequence number to indicate order? Or are

								timestamps sufficient (though optional)?</li>

								<li>Do we need to mark "begin", "in progress" and "end" of a

								stream? There are streams with a particular start and end, like a

								dictation. Note that sensors may never explicitly end a

								stream.</li>

								<li>Do we always append information? Or do we also replace previous

								data? A dictation application will probably append new text. But do

								we consider sensor data (such as GPS position or device tilt) as

								streaming or as "final" data?</li>

								</ol>

								<p>In the example below for dictation, we show how three new

								attributes <code>emma:streamId</code>,

								<code>emma:streamSeqNr</code>, and <code>emma:streamProgress</code>

								could be used to annotate each result with metadata regarding its

								position and status within a stream of input. In this example, the

								<code>emma:streamId</code> is a identifier which can be used to

								show that different <code>emma:interpretation</code> elements are

								members of the same stream. The <code>emma:streamSeqNr</code>

								attribute provides a numerical order to elements in the stream

								while <code>emma:streamProgress</code> indicates the start of the

								stream (and whether to expect more interpretations within the same

								stream), and the end of the stream. This is an instance of the

								'append' scenario for partial results in EMMA.</p>

								<table width="120">

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">User</td>

								<td>Hi Joe the meeting has moved</td>

								<td>

								<pre>

								&lt;emma:emma &gt;

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  &lt;emma:interpretation id="int1"

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:function="transcription"

								    emma:confidence="0.75"

								    emma:tokens="Hi Joe the meeting has moved"

								    emma:streamId="id1"

								    emma:streamSeqNr="0"

								    emma:streamProgress="begin"&gt;

								      &lt;emma:literal&gt;

								      Hi Joe the meeting has moved

								      &lt;/emma:literal&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="50">User</td>

								<td>to friday at four</td>

								<td>

								<pre>

								&lt;emma:emma &gt;

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  &lt;emma:interpretation id="int2"

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:function="transcription"

								    emma:confidence="0.75"

								    emma:tokens="to friday at four"

								    emma:streamId="id1"

								    emma:streamSeqNr="1"

								    emma:streamProgress="end"&gt;

								      &lt;emma:literal&gt;

								      to friday at four

								      &lt;/emma:literal&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								</table>

								<p>In the example below, a speech recognition hypothesis for the

								whole string is updated once more words have been recognized. This

								is an instance of the 'replace' scenario for partial results in

								EMMA. Note that the <code>emma:streamSeqNr</code> is the same for

								each interpretation in this case.</p>

								<table width="120">

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">User</td>

								<td>Is there a Pisa</td>

								<td>

								<pre>

								&lt;emma:emma &gt;

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  &lt;emma:interpretation id="int1"

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:function="dialog"

								    emma:confidence="0.7"

								    emma:tokens="is there a pisa"

								    emma:streamId="id2"

								    emma:streamSeqNr="0"

								    emma:streamProgress="begin"&gt;

								      &lt;emma:literal&gt;

								      is there a pisa

								      &lt;/emma:literal&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="50">User</td>

								<td>Is there a pizza restaurant</td>

								<td>

								<pre>

								&lt;emma:emma &gt;

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  &lt;emma:interpretation id="int2"

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:function="dialog"

								    emma:confidence="0.9"

								    emma:tokens="is there a pizza restaurant"

								    emma:streamId="id2"

								    emma:streamSeqNr="0"

								    emma:streamProgress="end"&gt;

								      &lt;emma:literal&gt;

								      is there a pizza restaurant

								      &lt;/emma:literal&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								</table>

								<p>One issue for the 'replace' case of incremental results, is how

								to specify that a result replaces multiple of the previously

								received results. For example, a system could receive partial

								results consisting of each word in turn of an utterance, and then a

								final result which is the final recognition for the whole sequence

								of words. One approach to this problem would be to allow

								<code>emma:streamSeqNr</code> to specify a range of inputs to be

								replaced. For example, if the <code>emma:streamSeqNr</code> for

								each of three single word results was 1, 2, and then 3. A final

								revised result could be marked as

								<code>emma:streamSeqNr="1-3"</code> indicating that it is a revised

								result for those three words.</p>

								<p>One issue is whether timestamps might be used to track ordering

								instead of introducing new attributes. One problem is that

								timestamp attributes are not required and may not always be

								available. Also as shown in the example, chunks of input in a

								stream may not always be in sequential order. Even with timestamps

								providing an order some kind of 'begin' and 'end' flag is needed

								(like <code>emma:streamProgress)</code> to indicate indicate the

								beginning and end of transmission of streamed input. Moreover,

								timestamps do not provide sufficient information to detect whether

								a message has been lost.</p>

								<p>Another possibility to explore for representation of incremental

								results would be to use an <code>&lt;emma:sequence&gt;</code>

								element containing the interim results and a derived result which

								contains the combination.</p>

								<p>Another issue to explore is the relationship between incremental

								results and the MMI lifecyle events within the <a href=

								"http://www.w3.org/TR/mmi-arch/">MMI Architecture</a>.</p>

								<h3 id="s2.2">2.2 Representing biometric information</h3>

								<p>Biometric technologies include systems designed to identify

								someone or verify a claim of identity based on their physical or

								behavioral characteristics. These include speaker verification,

								speaker identification, face recognition, and iris recognition,

								among others. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>

								provided some capability for representing the results of biometric

								analysis through values of the <code>emma:function</code> attribute

								such as "verification". However, it did not discuss the specifics

								of this use case in any detail. It may be worth exploring further

								considerations and consequences of using EMMA to represent

								biometric results. As one example, if different biometric results

								are represented in EMMA, this would simplify the process of fusing

								the outputs of multiple biometric technologies to obtain a more

								reliable overall result. &nbsp;It should also make it easier to

								take into account non-biometric claims of identity, such as a

								statement like "this is Kazuyuki", represented in EMMA, along with

								a speaker verification result based on the speaker's voice, which

								would also be represented in EMMA. In the following example, we

								have extended the set of values for <code>emma:function</code> to

								include "identification" for an interpretation showing the results

								of a biometric component that picks out an individual from a set of

								possible individuals (who are they). This contrasts with

								"verification" which is used for verification of a particular user

								(are they who they say they are).</p>

								<h4 id="biometric_example">Example</h4>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>an image of a face</td>

								<td>

								<pre>

								&lt;emma:emma&gt;

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  &lt;emma:interpretation id=“int1"

								    emma:confidence="0.75”

								    emma:medium="visual"

								    emma:mode="photograph"

								    emma:verbal="false"

								    emma:function="identification"&gt;

								      &lt;person&gt;12345&lt;/person&gt;

								      &lt;name&gt;Mary Smith&lt;/name&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>One direction to explore further is the relationship between

								work on messaging protocols for biometrics within the OASIS

								Biometric Identity Assurance Services (<a href="http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=bias">BIAS</a>)

								standards committee and EMMA.</p>

								<h3 id="s2.3">2.3 Representing emotion in EMMA</h3>

								<p>In addition to speech recognition, and other tasks such as

								speaker verification and identification, another kind of

								interpretation of speech that is of increasing importance is

								determination of the emotional state of the speaker, based on, for

								example, their prosody, lexical choice, or other features. This

								information can be used, for example, to make the dialog logic of

								an interactive system sensitive to the user's emotional state.

								Emotion detection can also use other modalities such as vision

								(facial expression, posture) and physiological sensors such as skin

								conductance measurement or blood pressure. Multimodal approaches

								where evidence is combined from multiple different modalities are

								also of significance for emotion classification.</p>

								<p>The creation of a markup language for emotion has been a recent

								focus of attention in W3C. Work that initiated in the W3C Emotion

								Markup Language Incubator Group (<a href=

								"http://www.w3.org/2005/Incubator/emotion/XGR-emotionml-20081120/">EmotionML

								XG</a>), has now transitioned to the <a href=

								"http://www.w3.org/2002/mmi/">W3C Multimodal Working Group</a> and

								the <a href="http://www.w3.org/TR/emotionml">EmotionML</a> language

								has been published as a working draft. One of the major use cases

								for that effort is: "Automatic recognition of emotions from

								sensors, including physiological sensors, speech recordings, facial

								expressions, etc., as well as from multi-modal combinations of

								sensors."</p>

								<p>Given the similarities to the technologies and annotations used

								for other kinds of input processing (recognition, semantic

								classification) which are now captured in EMMA, it makes sense to

								explore the use of EMMA for capture of emotional classification of

								inputs. Just as EMMA does not standardize the application markup

								for semantic results, though, it does not make sense to try and

								standardize emotion markup within EMMA. One promising approach is

								to combine the containers and metadata annotation of EMMA with the

								<a href="http://www.w3.org/TR/emotionml">EmotionML</a> markup, as

								shown in the following example.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td width="50">expression of boredom</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;

								  &lt;emma:interpretation id="emo1"

								    emma:start="1241035886246"

								    emma:end="1241035888246"

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:verbal="false"

								    emma:signal="http://example.com/input345.amr"

								    emma:media-type="audio/amr; rate:8000;"

								    emma:process="engine:type=emo_class&amp;vn=1.2”&gt;

								      &lt;emo:emotion&gt;

								        &lt;emo:intensity

								          value="0.1"

								          confidence="0.8"/&gt;

								        &lt;emo:category

								          set="everydayEmotions"

								          name="boredom"

								          confidence="0.1"/&gt;

								      &lt;/emo:emotion&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>In this example, we use the capabilities of EMMA for describing

								the input signal, its temporal characteristics, modality, sampling

								rate, audio codec etc. and EmotionML is used to provide the

								specific representation of the emotion. Other EMMA container

								elements also have strong use cases for emotion recognition. For

								example, <code>&lt;emma:one-of&gt;</code> can be used to represent

								N-best lists of competing classifications of emotion. The

								<code>&lt;emma:group&gt;</code> element could be used to combine a

								semantic interpretation of a user input with an emotional

								classification, as illustrated in the following example. Note that

								all of the general properties of the signal can be specified on the

								<code>&lt;emma:group&gt;</code> element.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td width="50">spoken input "flights to boston tomorrow" to dialog

								system in angry voice</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;

								  &lt;emma:group id="result1"

								    emma:start="1241035886246"

								    emma:end="1241035888246"

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:verbal="false"

								    emma:signal="http://example.com/input345.amr"

								    emma:media-type="audio/amr; rate:8000;"&gt;

								    &lt;emma:interpretation id="asr1"

								      emma:tokens="flights to boston tomorrow"

								      emma:confidence="0.76"

								      emma:process="engine:type=asr_nl&amp;vn=5.2”&gt;

								        &lt;flight&gt;

								          &lt;dest&gt;boston&lt;/dest&gt;

								          &lt;date&gt;tomorrow&lt;/date&gt;

								        &lt;/flight&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation id="emo1"

								      emma:process="engine:type=emo_class&amp;vn=1.2”&gt;

								      &lt;emo:emotion&gt;

								        &lt;emo:intensity

								          value="0.3"

								          confidence="0.8"/&gt;

								        &lt;emo:category

								          set="everydayEmotions"

								          name="anger"

								          confidence="0.8"/&gt;

								      &lt;/emo:emotion&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:group-info&gt;

								    meaning_and_emotion

								    &lt;/emma:group-info&gt;

								  &lt;/emma:group&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>The element <code>&lt;emma:group&gt;</code> can also be used to

								capture groups of emotion detection results from individual

								modalities for combination by a multimodal fusion component or when

								automatic recognition results are described together with manually

								annotated data. This use case is inspired by <a href=

								"http://www.w3.org/2005/Incubator/emotion/XGR-emotion/#AppendixUseCases">

								Use case 2b (II)</a> of the Emotion Incubator Group Report. The

								following example illustrates the grouping of three

								interpretations, namely: a speech analysis emotion classifier, a

								physiological emotion classifier measuring blood pressure, and a

								human annotator viewing video, for two different media files (from

								the same episode) that are synchronized via <code>emma:start</code>

								and <code>emma:end</code> attributes. In this case, the

								physiological reading is for a subinterval of the video and audio

								recording.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td width="50">audio, video, and physiological sensor of a test

								user acting with a new design.</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;

								  &lt;emma:group id="result1"&gt;

								    &lt;emma:interpretation id="speechClassification1"

								      emma:medium="acoustic"

								      emma:mode="voice"

								      emma:verbal="false"

								      emma:start="1241035884246"

								      emma:end="1241035887246"

								      emma:signal="http://example.com/video_345.mov"

								      emma:process="engine:type=emo_voice_classifier”&gt;

								        &lt;emo:emotion&gt;

								          &lt;emo:category

								            set="everydayEmotions"

								            name="anger"

								            confidence="0.8"/&gt;

								        &lt;/emo:emotion&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation id="bloodPressure1"

								      emma:medium="tactile"

								      emma:mode="blood_pressure"

								      emma:verbal="false"

								      emma:start="1241035885300"

								      emma:end="1241035886900"

								      emma:signal="http://example.com/bp_signal_345.cvs"

								      emma:process="engine:type=emo_physiological_classifier”&gt;

								        &lt;emo:emotion&gt;

								          &lt;emo:category

								            set="everydayEmotions"

								            name="anger"

								            confidence="0.6"/&gt;

								        &lt;/emo:emotion&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation id="humanAnnotation1"

								      emma:medium="visual"

								      emma:mode="video"

								      emma:verbal="false"

								      emma:start="1241035884246"

								      emma:end="1241035887246"

								      emma:signal="http://example.com/video_345.mov"

								      emma:process="human:type=labeler&amp;id=1”&gt;

								        &lt;emo:emotion&gt;

								          &lt;emo:category

								            set="everydayEmotions"

								            name="fear"

								            confidence="0.6"/&gt;

								        &lt;/emo:emotion&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:group-info&gt;

								    several_emotion_interpretations

								    &lt;/emma:group-info&gt;

								  &lt;/emma:group&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>A combination of <code>&lt;emma:group&gt;</code> and

								<code>&lt;emma:derivation&gt;</code> could be used to represent a

								combined emotional analysis resulting from analysis of multiple

								different modalities of the user's behavior. The

								<code>&lt;emma:derived-from&gt;</code> and

								<code>&lt;emma:derivation&gt;</code> elements can be used to

								capture both the fused result and combining inputs in a single EMMA

								document. In the following example, visual analysis of user

								activity and analysis of their speech have been combined by a

								multimodal fusion component to provide an combined multimodal

								classification of the user's emotional state. The specifics of the

								multimodal fusion algorithm are not relevant here, or to EMMA in

								general. Note though that in this case, the multimodal fusion

								appears to have compensated for uncertainty in the visual analysis

								which gave two results with equal confidence, one for fear and one

								for anger. The <code>emma:one-of</code> element is used to capture

								the N-best list of multiple competing results from the video

								classifier.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td width="50">multimodal fusion of emotion classification of user

								based on analysis of voice and video</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;

								  &lt;emma:interpretation id="multimodalClassification1"

								    emma:medium="acoustic,visual"

								    emma:mode="voice,video"

								    emma:verbal="false"

								    emma:start="1241035884246"

								    emma:end="1241035887246"

								    emma:process="engine:type=multimodal_fusion”&gt;

								      &lt;emo:emotion&gt;

								        &lt;emo:category

								          set="everydayEmotions"

								          name="anger"

								          confidence="0.7"/&gt;

								      &lt;/emo:emotion&gt;

								    &lt;emma:derived-from ref="mmgroup1" composite="true"/&gt;

								  &lt;/emma:interpretation&gt;

								  &lt;emma:derivation&gt;

								    &lt;emma:group id="mmgroup1"&gt;

								      &lt;emma:interpretation id="speechClassification1"

								        emma:medium="acoustic"

								        emma:mode="voice"

								        emma:verbal="false"

								        emma:start="1241035884246"

								        emma:end="1241035887246"

								        emma:signal="http://example.com/video_345.mov"

								        emma:process="engine:type=emo_voice_classifier”&gt;

								          &lt;emo:emotion&gt;

								            &lt;emo:category

								              set="everydayEmotions"

								              name="anger"

								              confidence="0.8"/&gt;

								          &lt;/emo:emotion&gt;

								      &lt;/emma:interpretation&gt;

								      &lt;emma:one-of id="video_nbest"

								        emma:medium="visual"

								        emma:mode="video"

								        emma:verbal="false"

								        emma:start="1241035884246"

								        emma:end="1241035887246"

								        emma:signal="http://example.com/video_345.mov"

								        emma:process="engine:type=video_classifier"&gt;

								        &lt;emma:interpretation id="video_result1"

								          &lt;emo:emotion&gt;

								            &lt;emo:category

								              set="everydayEmotions"

								              name="anger"

								              confidence="0.5"/&gt;

								          &lt;/emo:emotion&gt;

								        &lt;/emma:interpretation&gt;

								        &lt;emma:interpretation id="video_result2"

								          &lt;emo:emotion&gt;

								            &lt;emo:category

								              set="everydayEmotions"

								              name="fear"

								              confidence="0.5"/&gt;

								          &lt;/emo:emotion&gt;

								        &lt;/emma:interpretation&gt;

								      &lt;/emma:one-of&gt;

								      &lt;emma:group-info&gt;

								      emotion_interpretations

								      &lt;/emma:group-info&gt;

								    &lt;/emma:group&gt;

								  &lt;/emma:derivation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>One issue which need to be addressed is the relationship between

								EmotionML <code>confidence</code> attribute values and

								<code>emma:confidence</code> values. Could the

								<code>emma:confidence</code> value be used as an overall confidence

								value for the emotion result, or should confidence values appear

								only within the EmotionML markup since confidence is used for

								different dimensions of the result? If a series of possible emotion

								classifications are contained in <code>emma:one-of</code> should

								they be ordered by the EmotionML confidence values?</p>

								<h3 id="s2.4">2.4 Richer semantic representations in EMMA</h3>

								<p>Enriching the semantic information represented in EMMA would be

								helpful for certain use cases. For example, the concepts in an EMMA

								application semantics representation might include references to

								concepts in an ontology such as WordNet. Then, a translation system

								might make use of a sense disambiguator to represent the

								probabilities of different senses of a word, for example, "spicy"

								in the example has two possible WordNet senses. In the following

								example, inputs to a machine translation system are annotated in

								the application semantics with specific WordNet senses which are

								used to distinguish among different senses of the words. A

								translation system might make use of a sense disambiguator to

								represent the probabilities of different senses of a word, for

								example, "spicy" in the example has two possible WordNet

								senses.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>I love to eat Mexican food because it is spicy</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"

								  xmlns="http://example.com/universal_translator"&gt;

								  &lt;emma:interpretation id="spanish"&gt;

								    &lt;result xml:lang="es"&gt;

								    Adoro alimento mejicano porque es picante.

								    &lt;/result&gt;

								    &lt;emma:derived-from resource="#english" composite="false"/&gt;

								  &lt;/emma:interpretation&gt;

								  &lt;emma:derivation&gt;

								    &lt;emma:interpretation id="english"

								      emma:tokens="I love to eat Mexican food

								                   because it is spicy"&gt;

								      &lt;assertion&gt;

								        &lt;interaction

								          wordnet="1828736"

								          wordnet-desc="love, enjoy (get pleasure from)"

								          token="love"&gt;

								          &lt;experiencer

								            reference="first"

								            token="I"&gt;

								                &lt;attribute quantity="single"/&gt;

								          &lt;/experiencer&gt;

								          &lt;attribute time="present"/&gt;

								          &lt;content&gt;

								            &lt;interaction wordnet="1157345"

								              wordnet-desc="eat (take in solid food)"

								              token="to eat"&gt;

								              &lt;object id="obj1"

								                wordnet="7555863"

								                wordnet-desc="food, solid food (any solid

								                              substance (as opposed to

								                              liquid) that is used as a source

								                              of nourishment)"

								                        token="food"&gt;

								                &lt;restriction

								                  wordnet="3026902"

								                  wordnet-desc="Mexican (of or relating

								                                to Mexico or its inhabitants)"

								                                token="Mexican"/&gt;

								              &lt;/object&gt;

								            &lt;/interaction&gt;

								          &lt;/content&gt;

								          &lt;reason token="because"&gt;

								            &lt;experiencer reference="third"

								              target="obj1" token="it"/&gt;

								                &lt;attribute time="present"/&gt;

								                &lt;one-of token="spicy"&gt;

								                  &lt;modification wordnet="2397732"

								                    wordnet-desc="hot, spicy (producing a

								                                  burning sensation on

								                                  the taste nerves)"

								                    confidence="0.8"/&gt;

								                  &lt;modification wordnet="2398378"

								                    wordnet-desc="piquant, savory,

								                                  savoury, spicy, zesty

								                                  (having an agreeably

								                                  pungent taste)"

								                    confidence="0.4"/&gt;

								                &lt;/one-of&gt;

								           &lt;/reason&gt;

								         &lt;/interaction&gt;

								       &lt;/assertion&gt;

								     &lt;/emma:interpretation&gt;

								  &lt;/emma:derivation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>In addition to sense disambiguation it could also be useful to

								relate concepts to superordinate concepts in some ontology. For

								example, it could be useful to know that O'Hare is an airport and

								Chicago is a city, even though they might be used interchangeably

								in an application. For example, in an air travel application a user

								might say "I want to fly to O'Hare" or "I want to fly to

								Chicago".</p>

								<h3 id="s2.5">2.5 Representing system output in EMMA</h3>

								<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was explicitly

								limited in scope to representation of the interpretation of user

								inputs. Most interactive systems also produce system output and one

								of the major possible extensions of the EMMA language would be to

								provide support for representation of the outputs made by the

								system in addition to the user inputs. One advantage of having EMMA

								representation for system output is that system logs can have

								unified markup representation across input and output for viewing

								and analyzing user/system interactions. In this section, we

								consider two different use cases for addition of output

								representation to EMMA.</p>

								<h4 id="s2.5.1">2.5.1 Abstracting output from specific modality or

								output language</h4>

								<p>It is desirable for a multimodal dialog designer to be able to

								isolate dialog flow (for example <a href=

								"http://www.w3.org/TR/2009/WD-scxml-20091029/">SCXML</a> code) from

								the details of specific utterances produced by a system. This can

								achieved by using presentation or media planning component that

								takes the abstract intent from the system and creates one or more

								modality-specific presentations. In addition to isolating dialog

								logic from specific modality choice this can also make it easier to

								support different technologies for the same modality. For example,

								in the example below, the GUI technology is HTML, but abstracting

								output would also support using a different GUI technology like

								Flash, or <a href="http://www.w3.org/Graphics/SVG/">SVG</a>. If

								EMMA is extended to support output, then EMMA documents could be

								used for communication from the dialog manager to the presentation

								planning component, and also potentially for the documents

								generated by the presentation component, which could embed specific

								markup such as HTML and <a href=

								"http://www.w3.org/TR/speech-synthesis/">SSML</a>. Just as there

								can be multiple different stages of processing of a user input,

								there may be multiple stages of processing of an output, and the

								mechanisms of EMMA can be used to capture and provide metadata on

								these various stages of output processing.</p>

								<p>Potential benefits for this approach include:</p>

								<ol>

								<li>Accessibility: it would be useful for an application to be able

								to accommodate users who might have an assistive device or devices

								without requiring special logic or even special applications.</li>

								<li>Device independence: An application could separate the flow in

								the IM from the details of the presentation. This might be

								especially useful if there are a lot of target devices with

								different types of screens, cameras, or possibilities for haptic

								output.</li>

								<li>Adapting to user preferences: An application could accommodate

								different dynamic preferences, for example, switching to visual

								presentation from speech in public places without disturbing the

								application flow.</li>

								</ol>

								<p>In the following example, we consider the introduction of a new

								EMMA element, <code>&lt;emma:presentation&gt;</code> which is the

								output equivalent of the input element

								<code>&lt;emma:interpretation&gt;</code>. Like

								<code>&lt;emma:interpretation&gt;</code> this element can take

								<code>emma:medium</code> and <code>emma:mode</code> attributes

								classifying the specific modality. It could also potentially take

								timestamp annotations indicating the time at which the output

								should be produced. One issue is whether timestamps should be used

								for the intended time of production or for the actual time of

								production and how to capture both. Relative timestamps could be

								used to anchor the planned time of presentation to another element

								of system output. In this example we show how the

								<code>emma:semantic-rep</code> attribute proposed in <a href=

								"#s2.12">Section 2.12</a> could potentially be used to indicate the

								markup language of the output.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Output</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">IM (step 1)</td>

								<td>semantics of "what would you like for lunch?"</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:presentation&gt;

								    &lt;question&gt;

								      &lt;topic&gt;lunch&lt;/topic&gt;

								      &lt;experiencer&gt;second person&lt;/experiencer&gt;

								      &lt;object&gt;questioned&lt;/object&gt;

								    &lt;/question&gt;

								  &lt;/emma:presentation&gt;

								&lt;/emma:emma&gt;


								</pre>

								<p>or, more simply, without natural language generation:</p>

								<pre>

								&lt;emma:emma&gt;

								  &lt;emma:presentation&gt;

								    &lt;text&gt;what would you like for lunch?&lt;/text&gt;

								  &lt;/emma:presentation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="50">presentation manager (voice output)</td>

								<td>text "what would you like for lunch?"</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:presentation

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:verbal="true"

								    emma:function="dialog"

								    emma:semantic-rep="ssml"&gt;

								      &lt;speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

								        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

								        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis

								        http://www.w3.org/TR/speech-synthesis/synthesis.xsd"

								        xml:lang="en-US"&gt;

								          what would you like for lunch&lt;/speak&gt;

								  &lt;/emma:presentation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="50">presentation manager (GUI output)</td>

								<td>text "what would you like for lunch?"</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:presentation

								    emma:medium="visual"

								    emma:mode="graphics"

								    emma:verbal="true"

								    emma:function="dialog"

								    emma:semantic-rep="html"&gt;

								      &lt;html&gt;

								        &lt;body&gt;

								          &lt;p&gt;what would you like for lunch?"&lt;/p&gt;

								          &lt;input name="" type="text"&gt;

								          &lt;input type="submit" name="Submit"

								           value="Submit"&gt;

								        &lt;/body&gt;

								      &lt;/html&gt;

								  &lt;/emma:presentation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								</tbody>

								</table>

								<h4 id="s2.5.2">2.5.2 Coordination of outputs distributed over

								multiple different modalities</h4>

								<p>A critical issue in the enablement of effective multimodal

								output is to enable synchronization of outputs in different output

								media. For example, text to speech output or prompts may be

								coordinated with graphical outputs such as highlighting of items in

								an HTML table. EMMA markup could potentially be used to indicate

								that elements in each medium should be coordinated in their

								presentation. In the following example, a new attribute

								<code>emma:sync</code> is used to indicate the relationship between

								a <code>&lt;mark&gt;</code> in <a href=

								"http://www.w3.org/TR/speech-synthesis/">SSML</a> and an element to

								be highlighted in HTML content. The <code>emma:process</code>

								attribute could be used to identify the presentation planning

								component. Again <code>emma:semantic-rep</code> is used to indicate

								the embedded markup language.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Output</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">system</td>

								<td width="50">Coordinated presentation of table with TTS</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:group id=“gp1"

								    emma:medium="acoustic,visual"

								    emma:mode="voice,graphics"

								    emma:process="http://example.com/presentation_planner"&gt;

								    &lt;emma:presentation id=“pres1"

								      emma:medium="acoustic"

								      emma:mode="voice"

								      emma:verbal="true"

								      emma:function="dialog"

								      emma:semantic-rep="ssml"&gt;

								      &lt;speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

								        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

								        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis

								        http://www.w3.org/TR/speech-synthesis/synthesis.xsd"

								        xml:lang="en-US"&gt;

								        Item 4 &lt;mark emma:sync="123"/&gt; costs fifteen dollars.

								      &lt;/speak&gt;

								    &lt;/emma:presentation&gt;

								    &lt;emma:presentation id=“pres2"

								      emma:medium="visual"

								      emma:mode="graphics"

								      emma:verbal="true"

								      emma:function="dialog"

								      emma:semantic-rep="html"

								      &lt;table xmlns="http://www.w3.org/1999/xhtml"&gt;

								        &lt;tr&gt;

								          &lt;td emma:sync="123"&gt;Item 4&lt;/td&gt;

								          &lt;td&gt;15 dollars&lt;/td&gt;

								        &lt;/tr&gt;

								      &lt;/table&gt;

								    &lt;/emma:presentation&gt;

								  &lt;/emma:group&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>One issue to be considered is the potential role of the

								Synchronized Multimedia Integration Language (<a href=

								"http://www.w3.org/TR/REC-smil/">SMIL</a>) for capturing multimodal

								output synchronization. SMIL markup for multimedia presentation

								could potentially be embedded within EMMA markup coming from an

								interaction manager to a client for rendering.</p>

								<h3 id="s2.6">2.6 Representation of dialogs in EMMA</h3>

								<p>The scope of <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>

								was explicitly limited to representation of single turns of user

								input. For logging, analysis, and training purposes it could be

								useful to be able to represent multi-stage dialogs in EMMA. The

								following example shows a sequence of two EMMA documents where the

								the first is a request from the system and the second is the user

								response. A new attribute <code>emma:in-response-to</code> is used

								to relate the system output to the user input. EMMA already has an

								attribute <code>emma:dialog-turn</code> used to provide an

								indicator of the turn of interaction.</p>

								<h4 id="dialog_example">Example</h4>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">system</td>

								<td width="50">where would you like to go?</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:presentation id="pres1"

								    emma:dialog-turn="turn1"

								    emma:in-response-to="initial"&gt;

								      &lt;prompt&gt;

								      where would you like to go?

								      &lt;/prompt&gt;

								  &lt;/emma:presentation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>New York</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:interpretation id="int1"

								    emma:dialog-turn="turn2"

								    emma:tokens="new york"

								    emma:in-response-to="pres1"&gt;

								      &lt;location&gt;

								      New York

								      &lt;/location&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>In this case, each utterance is still a single EMMA document,

								and markup is being used to encode the fact that the utterance are

								part of an ongoing dialog. Another possibility would be to use EMMA

								markup to contain a whole dialog within a single EMMA document. For

								example, a flight query dialog could be represented as follows

								using <code>&lt;emma:sequence&gt;</code>:</p>

								<h4 id="sequence_example">Example</h4>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>flights to boston</td>

								<td rowspan="5">

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:sequence&gt;

								    &lt;emma:interpretation id="user1"

								      emma:dialog-turn="turn1"

								      emma:in-response-to="initial"&gt;

								      &lt;emma:literal&gt;

								      flights to boston

								      &lt;/emma:literal&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:presentation id="sys1"

								      emma:dialog-turn="turn2"

								      emma:in-response-to="user1"&gt;

								      &lt;prompt&gt;

								      traveling to boston,

								      which departure city

								      &lt;/prompt&gt;

								    &lt;/emma:presentation&gt;

								    &lt;emma:interpretation id="user2"

								      emma:dialog-turn="turn3"

								      emma:in-response-to="sys1"&gt;

								      &lt;emma:literal&gt;

								      san francisco

								      &lt;/emma:literal&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:presentation id="sys2"

								      emma:dialog-turn="turn4"

								      emma:in-response-to="user2"&gt;

								      &lt;prompt&gt;

								      departure date

								      &lt;/prompt&gt;

								    &lt;/emma:presentation&gt;

								    &lt;emma:interpretation id="user3"

								      emma:dialog-turn="turn5"

								      emma:in-response-to="sys2"&gt;

								      &lt;emma:literal&gt;

								      next thursday

								      &lt;/emma:literal&gt;

								    &lt;/emma:interpretation&gt;

								  &lt;/emma:sequence&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="50">system</td>

								<td>traveling to Boston, which departure city?</td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>San Francisco</td>

								</tr>

								<tr>

								<td width="50">system</td>

								<td>departure date</td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>next thursday</td>

								</tr>

								</tbody>

								</table>

								<p>Note that in this example with

								<code>&lt;emma:sequence&gt;</code> the

								<code>emma:in-response-to</code> attribute is still important since

								there is no guarantee that an utterance in a dialog is a response

								to the previous utterance. For example, a sequence of utterances

								may all be from the user.</p>

								<p>One issue that arises with the representation of whole dialogs

								is that the resulting EMMA documents with full sets of metadata may

								become quite large. One possible extension that could help with

								this would be allow the value of <code>emma:in-response-to</code>

								to be URI valued so it can refer to another EMMA document.</p>

								<h3 id="s2.7">2.7 Logging, analysis, and annotation</h3>

								<p>EMMA was initially designed to facilitate communication among

								components of an interactive system. It has become clear over time

								that the language can also play an important role in logging of

								user/system interactions. In this section, we consider possible

								advantages of EMMA for log analysis and illustrate how elements

								such as <code>&lt;emma:derived-from&gt;</code> could be used to

								capture and provide metadata on annotations made by human

								annotators.</p>

								<h3 id="s2.7.1">2.7.1 Log analysis</h3>

								<p>The proposal above for representing system output in EMMA would

								support after the fact analysis of dialogs. For example, if both

								the system's and the user's utterance are represented in EMMA, it

								should be much easier to examine relationships between factors such

								as how the wording of prompts might affect user's responses or even

								the modality that users select for their responses. It would also

								be easier to study timing relationships between the system prompt

								and the user's responses. For example, prompts that are confusing

								might consistently elicit longer times before the user starts

								speaking. This would be useful even without a presentation manager

								or fission component. In the following example, it might be useful

								to look into the relationship between the end of the prompt and the

								start of the user's response. We use here the

								<code>emma:in-response-to</code> attribute suggested in <a href=

								"#s2.6">Section 2.6</a> for the representation of dialogs in

								EMMA.</p>

								<h4 id="log_example">Example</h4>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">system</td>

								<td>where would you like to go?</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:presentation id="pres1"

								    emma:dialog-turn="turn1"

								    emma:in-response-to="initial"

								    emma:start="1241035886246"

								    emma:end="1241035888306"&gt;

								    &lt;prompt&gt;

								    where would you like to go?

								    &lt;/prompt&gt;

								  &lt;/emma:presentation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>New York</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:interpretation id="int1"

								    emma:dialog-turn="turn2"

								    emma:in-response-to="pres1"

								    emma:start="1241035891246"

								    emma:end="1241035893000""&gt;

								    &lt;destination&gt;

								    New York

								    &lt;/destination&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								</tbody>

								</table>

								<h3 id="s2.7.2">2.7.2 Log annotation</h3>

								<p>EMMA is generally used to show the recognition, semantic

								interpretation etc. assigned to inputs based on <em>machine</em>

								processing of the user input. Another potential use case is to

								provide a mechanism for showing the interpretation assigned to an

								input by a human annotator and using

								<code>&lt;emma:derived-from&gt;</code> to show the relationship

								between the input received the annotation. The

								<code>&lt;emma:one-of&gt;</code> element can then be used to show

								multiple competing annotations for an input. The

								<code>&lt;emma:group&gt;</code> element could be used to contain

								multiple different kinds of annotation on a single input. One

								question here is whether <code>emma:process</code> can be used for

								identification of the labeller, and whether there is a need for any

								additional EMMA machinery to better support this this use case. In

								these examples, <code>&lt;emma:literal&gt;</code> contains mixed

								content with text and elements. This is in keeping with the EMMA

								1.0 schema.</p>

								<p>One issue that arises concerns the meaning of an

								<code>emma:confidence</code> value on an annotated interpretation.

								It may be preferable to have another attribute for annotator

								confidence rather than overloading the current

								<code>emma:confidence</code>.</p>

								<p>Another issue concerns mixing of system results and human

								annotation. Should these be grouped or is the annotation a derived

								from the system's interpretation. Also it would be useful to

								capture the time of the annotation. The current timestamps are used

								for the time of the input itself. Where should annotation

								timestamps be recorded?</p>

								<p>It would also be useful to have a way to specify open ended

								information about the annotator such as their native language,

								profession, experience etc. One approach would be to be to have a

								new attribute e.g. <code>emma:annotator</code> with a URI value

								that could point to a description of the annotator.</p>

								<p>It could be useful for very common annotations to have in

								addition to <code>emma:tokens</code> another dedicated element to

								indicate the annotated transcription, for example,

								<code>emma:annotated-tokens</code> or

								<code>emma:transcription</code>.</p>

								<p>In the following example, we show how

								<code>emma:interpretation</code> and <code>emma:derived-from</code>

								could be used to capture the annotation of an input.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td width="614"><strong>Input</strong></td>

								<td width="531"><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="93">user</td>

								<td>

								<p>In this example the user has said:</p>

								<p>"flights from boston to san francisco leaving on the fourth of

								september"</p>

								<p>and the semantic interpretation here is a semantic tagging of

								the utterance done by a human annotator. emma:process is used to

								provide details about the annotation</p>

								</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:interpretation id="annotation1"

								    emma:process="annotate:type=semantic&amp;annotator=michael"

								    emma:confidence="0.90"&gt;

								      &lt;emma:literal&gt;

								      flights from &lt;src&gt;san francisco&lt;/src&gt; to

								      &lt;dest&gt;boston&lt;/dest&gt; on

								      &lt;date&gt;the fourth of september&lt;/date&gt;

								      &lt;/emma:literal&gt;

								    &lt;emma:derived-from resource="#asr1"/&gt;

								  &lt;/emma:interpretation&gt;

								  &lt;emma:derivation&gt;

								    &lt;emma:interpretation id="asr1"

								      emma:medium="acoustic"

								      emma:mode="voice"

								      emma:function="dialog"

								      emma:verbal="true"

								      emma:lang="en-US"

								      emma:start="1241690021513"

								      emma:end="1241690023033"

								      emma:media-type="audio/amr; rate=8000"

								      emma:process="smm:type=asr&amp;version=watson6"

								      emma:confidence="0.80"&gt;

								        &lt;emma:literal&gt;

								        flights from san francisco

								        to boston on the fourth of september

								        &lt;/emma:literal&gt;

								    &lt;/emma:interpretation&gt;

								  &lt;/emma:derivation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<p>Taking this example a step further,

								<code>&lt;emma:group&gt;</code> could be used to group annotations

								made by multiple different annotators of the same utterance:</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td width="614"><strong>Input</strong></td>

								<td width="531"><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="93">user</td>

								<td>

								<p>In this example the user has said:</p>

								<p>"flights from boston to san francisco leaving on the fourth of

								september"</p>

								<p>and the semantic interpretation here is a semantic tagging of

								the utterance done by two different human annotators.

								<code>emma:process</code> is used to provide details about the

								annotation</p>

								</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:group emma:confidence="1.0"&gt;

								    &lt;emma:interpretation id="annotation1"

								      emma:process="annotate:type=semantic&amp;annotator=michael"

								      emma:confidence="0.90"&gt;

								        &lt;emma:literal&gt;

								        flights from &lt;src&gt;san francisco&lt;/src&gt;

								        to &lt;dest&gt;boston&lt;/dest&gt;

								        on &lt;date&gt;the fourth of september&lt;/date&gt;

								        &lt;/emma:literal&gt;

								      &lt;emma:derived-from resource="#asr1"/&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation id="annotation2"

								      emma:process="annotate:type=semantic&amp;annotator=debbie"

								      emma:confidence="0.90"&gt;

								        &lt;emma:literal&gt;

								        flights from &lt;src&gt;san francisco&lt;/src&gt;

								        to &lt;dest&gt;boston&lt;/dest&gt; on

								        &lt;date&gt;the fourth of september&lt;/date&gt;

								        &lt;/emma:literal&gt;

								      &lt;emma:derived-from resource="#asr1"/&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:group-info&gt;semantic_annotations&lt;/emma:group-info&gt;

								  &lt;/emma:group&gt;

								  &lt;emma:derivation&gt;

								    &lt;emma:interpretation id="asr1"

								      emma:medium="acoustic"

								      emma:mode="voice"

								      emma:function="dialog"

								      emma:verbal="true"

								      emma:lang="en-US"

								      emma:start="1241690021513"

								      emma:end="1241690023033"

								      emma:media-type="audio/amr; rate=8000"

								      emma:process="smm:type=asr&amp;version=watson6"

								      emma:confidence="0.80"&gt;

								        &lt;emma:literal&gt;

								        flights from san francisco to boston

								        on the fourth of september

								        &lt;/emma:literal&gt;

								    &lt;/emma:interpretation&gt;

								  &lt;/emma:derivation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<h3 id="s2.8">2.8 Multisentence Inputs</h3>

								<p>For certain applications, it is useful to be able to represent

								the semantics of multi-sentence inputs, which may be in one of more

								modalities such as speech (e.g. voicemail), text (e.g. email), or

								handwritten input. One application use case is for summarizing a

								voicemail or email. We develop this example below.</p>

								<p>There are at least two possible approaches to addressing this

								use case.</p>

								<ol>

								<li>If there is no reason to distinguish the individual sentences

								of the input or interpret them individually, the entire input could

								be included as the value of the <code>emma:tokens</code> attribute

								of an <code>&lt;emma:interpretation&gt;</code> or

								<code>&lt;emma:one-of&gt;</code> element, where the semantics of

								the input is represented as the value of an

								<code>&lt;emma:interpretation&gt;</code>. Although in principle

								there is no upper limit on the length of a <code>emma:tokens</code>

								attribute, in practice, this approach might be cumbersome for

								longer or more complicated texts.</li>

								<li>If more structure is required, the interpretations of the

								individual sentences in the input could be grouped as individual

								<code>&lt;emma:interpretation&gt;</code> elements under an

								<code>&lt;emma:sequence&gt;</code> element. A single unified

								semantics representing the meaning of the entire input could then

								be represented with the sequence as the value of

								<code>&lt;emma:derived-from&gt;</code>.</li>

								</ol>

								The example below illustrates the first approach.

								<h4 id="multisentence_example">Example</h4>

								<table border="1">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td width="614"><strong>Input</strong></td>

								<td width="531"><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="93">user</td>

								<td>

								<p>Hi Group,</p>

								<p>You are all invited to lunch tomorrow at Tony's Pizza at 12:00.

								Please let me know if you're planning to come so that I can make

								reservations. Also let me know if you have any dietary

								restrictions. Tony's Pizza is at 1234 Main Street. We will be

								discussing ways of using EMMA.</p>

								<p>Debbie</p>

								</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:interpretation

								    emma:tokens="Hi Group, You are all invited to

								    lunch tomorrow at Tony's Pizza at 12:00.

								    Please let me know if you're planning to

								    come so that I can make reservations.

								    Also let me know if you have any dietary

								    restrictions. Tony's Pizza is at 1234

								    Main Street. We will be discussing

								    ways of using EMMA." &gt;

								      &lt;business-event&gt;lunch&lt;/business-event&gt;

								      &lt;host&gt;debbie&lt;/host&gt;

								      &lt;attendees&gt;group&lt;/attendees&gt;

								      &lt;location&gt;

								        &lt;name&gt;Tony's Pizza&lt;/name&gt;

								        &lt;address&gt; 1234 Main Street&lt;/address&gt;

								      &lt;/location&gt;

								      &lt;date&gt; tuesday, March 24&lt;/date&gt;

								      &lt;needs-rsvp&gt;true&lt;/needs-rsvp&gt;

								      &lt;needs-restrictions&gt;true&lt;/need-restrictions&gt;

								      &lt;topic&gt;ways of using EMMA&lt;/topic&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								</tbody>

								</table>

								<h3 id="s2.9">2.9 Multi-participant interactions</h3>

								<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> primarily

								focussed on the interpretation of inputs from a single user. Both

								for annotation of human-human dialogs and for the emerging systems

								which support dialog or multimodal interaction with multiple

								participants (such as multimodal systems for meeting analysis), it

								is important to support annotation of interactions involving

								multiple different participants. The proposals above for capturing

								dialog can play an important role. One possible further extension

								would be to add specific markup for annotation of the user making a

								particular contribution. In the following example, we use an

								attribute <code>emma:participant</code> to identify the participant

								contributing each response to the prompt.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td width="668"><strong>Input</strong></td>

								<td width="480"><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="90">system</td>

								<td>Please tell me your lunch orders</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:presentation id="pres1"

								    emma:dialog-turn="turn1"

								    emma:in-response-to="initial"

								    emma:start="1241035886246"

								    emma:end="1241035888306"&gt;

								      &lt;prompt&gt;please tell me your lunch orders&lt;/prompt&gt;

								  &lt;/emma:presentation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								<tr>

								<td width="90">user1</td>

								<td>I'll have a mushroom pizza</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:interpretation id="int1"

								    emma:dialog-turn="turn2"

								    emma:in-response-to="pres1"

								    emma:participant="user1"

								    emma:start="1241035891246"

								    emma:end="1241035893000""&gt;

								      &lt;pizza&gt;

								        &lt;topping&gt;

								        mushroom

								        &lt;/topping&gt;

								      &lt;/pizza&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="90">user3</td>

								<td>I'll have a pepperoni pizza.</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:interpretation id="int2"

								    emma:dialog-turn="turn3"

								    emma:in-response-to="pres1"

								    emma:participant="user2"

								    emma:start="1241035896246"

								    emma:end="1241035899000""&gt;

								      &lt;pizza&gt;

								        &lt;topping&gt;

								        pepperoni

								        &lt;/topping&gt;

								      &lt;/pizza&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								</tbody>

								</table>

								<h3 id="s2.10">2.10 Capturing sensor data such as GPS in EMMA</h3>

								<p>The multimodal examples described in the <a href=

								"http://www.w3.org/TR/emma/">EMMA 1.0</a> specification, include

								combination of spoken input with a location specified by touch or

								pen. With the increase in availability of GPS and other location

								sensing technology such as cell tower triangulation in mobile

								devices, it is desirable to provide a method for annotating inputs

								with the device location and, in some cases fusing the GPS

								information with the spoken command in order to derive a complete

								interpretation. GPS information could potentially be determined

								using the <a href=

								"http://www.w3.org/TR/2009/WD-geolocation-API-20090707/">Geolocation

								API Specification</a> from the <a href=

								"http://www.w3.org/2008/geolocation/">Geolocation working group</a>

								and then encoded into a EMMA result sent to a server for

								fusion.</p>

								<p>One possibility using the current EMMA capabilities is to use

								<code>&lt;emma:group&gt;</code> to associate GPS markup with the

								semantics of a spoken command. For example, the user might say

								"where is the nearest pizza place?" and the interpretation of the

								spoken command is grouped with markup capturing the GPS sensor

								data. This example uses the existing

								<code>&lt;emma:group&gt;</code> element and extends the set of

								values of <code>emma:medium</code> and <code>emma:mode</code> to

								include <code>"sensor"</code> and <code>"gps"</code>

								respectively.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td width="50">where is the nearest pizza place?</td>

								<td rowspan="2">

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:group&gt;

								    &lt;emma:interpretation

								      emma:tokens="where is the nearest pizza place"

								      emma:confidence="0.9"

								      emma:medium="acoustic"

								      emma:mode="voice"

								      emma:start="1241035887111"

								      emma:end="1241035888200"

								      emma:process="reco:type=asr&amp;version=asr_eng2.4"

								      emma:media-type="audio/amr; rate=8000"

								      emma:lang="en-US"&gt;

								        &lt;category&gt;pizza&lt;/category&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation

								      emma:medium="sensor"

								      emma:mode="gps"

								      emma:start="1241035886246"

								      emma:end="1241035886246"&gt;

								        &lt;lat&gt;40.777463&lt;/lat&gt;

								        &lt;lon&gt;-74.410500&lt;/lon&gt;

								        &lt;alt&gt;0.2&lt;/alt&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:group-info&gt;geolocation&lt;/emma:group-info&gt;

								  &lt;/emma:group&gt;

								&lt;/emma:emma&gt;


								</pre></td>

								</tr>

								<tr>

								<td width="50">GPS</td>

								<td>(GPS coordinates)</td>

								</tr>

								</tbody>

								</table>

								<p>Another, more abbreviated, way to incorporate sensor information

								would be to have spatial correlates of the timestamps and allow for

								location stamping of user inputs, e.g. <code>emma:lat</code> and

								<code>emma:lon</code> attributes that could appear on EMMA

								container elements to indicate the location where the input was

								produced.</p>

								<h3 id="s2.11">2.11 Extending EMMA from NLU to also represent

								search or database retrieval results</h3>

								<p>In many of the use cases considered so far, EMMA is used for

								representation of the results of speech recognition and then for

								the results of natural language understanding, and possibly

								multimodal fusion. In systems used for voice search, the next step

								is often to conduct search and extract a set of records or

								documents. Strictly speaking, this stage of processing is out of

								scope for EMMA. It is odd though to have the mechanisms of EMMA

								such as <code>&lt;emma:one-of&gt;</code> for ambiguity all the way

								up to NLU or multimodal fusion, but not to have access to the same

								apparatus for representation of the next stage of processing which

								can often be search or database lookup. Just as we can use

								<code>&lt;emma:one-of&gt;</code> and <code>emma:confidence</code>

								to represent N-best recognitions or semantic interpretations,

								similarly we can use them to represent a series of search results

								along with their relative confidence. One issue is whether we need

								some measure other than confidence for relevance ranking, or is the

								same confidence attribute can be used.</p>

								<p>One issue that arises is whether it would be useful to have some

								recommended or standardized element to use for query results e.g

								<code>&lt;result&gt;</code> as in the following example. Another

								issue is how to annotate information about the database and the

								query that was issued. The database could be indicate as part of

								the <code>emma:process</code> value as in the following example.

								For web search the query URL could be annotated on the result e.g.

								<code>&lt;result url="http://cnn.com"/&gt;</code>. For database

								queries, the query, SQL for example could be annotated on the

								results or on the containing <code>&lt;emma:group&gt;</code>.</p>

								<p>The following example shows the use of EMMA to represent the

								results of database retrieval from an employee directory. The user

								says "John Smith". After ASR, NLU, and then database look up, the

								system returns the XML here which shows the N-best lists associated

								with each of these three stages of processing. Here

								<code>&lt;emma:derived-from&amp;gr;</code> is used to indicate the

								relations between each of the <code>&lt;emma:one-of&gt;</code>

								elements. However, if you want to see which specific ASR result a

								record is derived from, you would need to put

								<code>&lt;emma:derived-from&gt;</code> on the individual

								elements.</p>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td width="50">User says "John Smith"</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:one-of id="db_results1"

								    emma:process="db:type=mysql&amp;database=personel_060109.db&gt;

								    &lt;emma:interpretation id="db_nbest1"

								      emma:confidence="0.80" emma:tokens="john smith"&gt;

								        &lt;result&gt;

								          &lt;name&gt;John Smith&lt;/name&gt;

								          &lt;room&gt;dx513&lt;/room&gt;

								          &lt;number&gt;123-456-7890&gt;/number&gt;

								        &lt;/result&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation id="db_nbest2"

								      emma:confidence="0.70" emma:tokens="john smith"&gt;

								        &lt;result&gt;

								          &lt;name&gt;John Smith&lt;/name&gt;

								          &lt;room&gt;ef312&lt;/room&gt;

								          &lt;number&gt;123-456-7891&gt;/number&gt;

								        &lt;/result&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation id="db_nbest3"

								      emma:confidence="0.50" emma:tokens="jon smith"&gt;

								        &lt;result&gt;

								          &lt;name&gt;Jon Smith&lt;/name&gt;

								          &lt;room&gt;dv900&lt;/room&gt;

								          &lt;number&gt;123-456-7892&gt;/number&gt;

								       &lt;/result&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:interpretation id="db_nbest4"

								      emma:confidence="0.40" emma:tokens="joan smithe"&gt;

								        &lt;result&gt;

								          &lt;name&gt;Joan Smithe&lt;/name&gt;

								          &lt;room&gt;lt567&lt;/room&gt;

								          &lt;number&gt;123-456-7893&gt;/number&gt;

								        &lt;/result&gt;

								    &lt;/emma:interpretation&gt;

								    &lt;emma:derived-from resource="#nlu_results1/&gt;

								  &lt;/emma:one-of&gt;

								  &lt;emma:derivation&gt;

								    &lt;emma:one-of id="nlu_results1"

								      emma:process="smm:type=nlu&amp;version=parser"&gt;

								      &lt;emma:interpretation id="nlu_nbest1"

								        emma:confidence="0.99" emma:tokens="john smith"&gt;

								          &lt;fn&gt;john&lt;/fn&gt;&lt;ln&gt;smith&lt;/ln&gt;

								      &lt;/emma:interpretation&gt;

								      &lt;emma:interpretation id="nlu_nbest2"

								        emma:confidence="0.97" emma:tokens="jon smith"&gt;

								          &lt;fn&gt;jon&lt;/fn&gt;&lt;ln&gt;smith&lt;/ln&gt;

								      &lt;/emma:interpretation&gt;

								      &lt;emma:interpretation id="nlu_nbest3"

								        emma:confidence="0.93" emma:tokens="joan smithe"&gt;

								          &lt;fn&gt;joan&lt;/fn&gt;&lt;ln&gt;smithe&lt;/ln&gt;

								      &lt;/emma:interpretation&gt;

								      &lt;emma:derived-from resource="#asr_results1/&gt;

								    &lt;/emma:one-of&gt;

								    &lt;emma:one-of id="asr_results1"

								      emma:medium="acoustic" emma:mode="voice"

								      emma:function="dialog" emma:verbal="true"

								      emma:lang="en-US" emma:start="1241641821513"

								      emma:end="1241641823033"

								      emma:media-type="audio/amr; rate=8000"

								      emma:process="smm:type=asr&amp;version=watson6"&gt;

								        &lt;emma:interpretation id="asr_nbest1"

								          emma:confidence="1.00"&gt;

								            &lt;emma:literal&gt;john smith&lt;/emma:literal&gt;

								        &lt;/emma:interpretation&gt;

								        &lt;emma:interpretation id="asr_nbest2"

								          emma:confidence="0.98"&gt;

								            &lt;emma:literal&gt;jon smith&lt;/emma:literal&gt;

								        &lt;/emma:interpretation&gt;

								        &lt;emma:interpretation id="asr_nbest3"

								          emma:confidence="0.89" &gt;

								            &lt;emma:literal&gt;joan smithe&lt;/emma:literal&gt;

								        &lt;/emma:interpretation&gt;

								   &lt;/emma:one-of&gt;

								  &lt;/emma:derivation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<h3 id="s2.12">2.12 Supporting other semantic representation forms

								in EMMA</h3>

								<p>In the <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>

								specification, the semantic representation of an input is

								represented either in XML in some application namespace or as a

								literal value using <code>emma:literal</code>. In some

								circumstances it could be beneficial to allow for semantic

								representation in other formats such as JSON. Serializations such

								as JSON could potentially be contained within

								<code>emma:literal</code> using CDATA, and a new EMMA annotation

								e.g. <code>emma:semantic-rep</code> used to indicate the semantic

								representation language being used.</p>

								<h4 id="semantic_representation_example">Example</h4>

								<table width="120">

								<tbody>

								<tr>

								<td><strong>Participant</strong></td>

								<td><strong>Input</strong></td>

								<td><strong>EMMA</strong></td>

								</tr>

								<tr>

								<td width="50">user</td>

								<td>semantics of spoken input</td>

								<td>

								<pre>

								&lt;emma:emma

								  version="2.0"

								  xmlns:emma="http://www.w3.org/2003/04/emma"

								  xmlns="http://www.example.com/example"&gt;

								  &lt;emma:interpretation id=“int1"

								    emma:confidence=".75”

								    emma:medium="acoustic"

								    emma:mode="voice"

								    emma:verbal="true"

								    emma:function="dialog"

								    emma:semantic-rep="json"

								      &lt;emma:literal&gt;

								        &lt;![CDATA[

								              {

								           drink: {

								              liquid:"coke",

								              drinksize:"medium"},

								           pizza: {

								              number: "3",

								              pizzasize: "large",

								              topping: [ "pepperoni", "mushrooms" ]

								           }

								          }

								          ]]&gt;

								      &lt;/emma:literal&gt;

								  &lt;/emma:interpretation&gt;

								&lt;/emma:emma&gt;

								</pre></td>

								</tr>

								</tbody>

								</table>

								<h2 id="references">General References</h2>

								<p>EMMA 1.0 Requirements <a href=

								"http://www.w3.org/TR/EMMAreqs/">http://www.w3.org/TR/EMMAreqs/</a></p>

								<p>EMMA Recommendation <a href=

								"http://www.w3.org/TR/emma/">http://www.w3.org/TR/emma/</a></p>

								<h2 id="acknowledgements">Acknowledgements</h2>

								<p>Thanks to Jim Larson (W3C Invited Expert) for his contribution

								to the section on EMMA for multimodal output.</p>

								</body>

								</html>