You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2114 lines
79 KiB
2114 lines
79 KiB
<?xml version="1.0" encoding="utf-8"?>
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
|
<head>
|
|
<meta name="generator" content=
|
|
"HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" />
|
|
<title>Use Cases for Possible Future EMMA Features</title>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
<style type="text/css">
|
|
/*<![CDATA[*/
|
|
code { font-family: monospace; }
|
|
|
|
div.constraint,
|
|
div.issue,
|
|
div.note,
|
|
div.notice { margin-left: 2em; }
|
|
|
|
ol.enumar { list-style-type: decimal; }
|
|
ol.enumla { list-style-type: lower-alpha; }
|
|
ol.enumlr { list-style-type: lower-roman; }
|
|
ol.enumua { list-style-type: upper-alpha; }
|
|
ol.enumur { list-style-type: upper-roman; }
|
|
|
|
|
|
div.exampleInner pre { margin-left: 1em;
|
|
margin-top: 0em; margin-bottom: 0em}
|
|
div.exampleOuter {border: 4px double gray;
|
|
margin: 0em; padding: 0em}
|
|
div.exampleInner { background-color: #d5dee3;
|
|
border-top-width: 4px;
|
|
border-top-style: double;
|
|
border-top-color: #d3d3d3;
|
|
border-bottom-width: 4px;
|
|
border-bottom-style: double;
|
|
border-bottom-color: #d3d3d3;
|
|
padding: 4px; margin: 0em }
|
|
div.exampleWrapper { margin: 4px }
|
|
div.exampleHeader { font-weight: bold;
|
|
margin: 4px}
|
|
|
|
table {
|
|
width:80%;
|
|
border:1px solid #000;
|
|
border-collapse:collapse;
|
|
font-size:90%;
|
|
}
|
|
|
|
td,th{
|
|
border:1px solid #000;
|
|
border-collapse:collapse;
|
|
padding:5px;
|
|
}
|
|
|
|
|
|
caption{
|
|
background:#ccc;
|
|
font-size:140%;
|
|
border:1px solid #000;
|
|
border-bottom:none;
|
|
padding:5px;
|
|
text-align:center;
|
|
}
|
|
|
|
img.center {
|
|
display: block;
|
|
margin-left: auto;
|
|
margin-right: auto;
|
|
}
|
|
p.caption {
|
|
text-align: center
|
|
}
|
|
|
|
|
|
.RFC2119 {
|
|
text-transform: lowercase;
|
|
font-style: italic;
|
|
}
|
|
/*]]>*/
|
|
</style>
|
|
|
|
<style type="text/css">
|
|
/*<![CDATA[*/
|
|
p.c1 {font-weight: bold}
|
|
/*]]>*/
|
|
</style>
|
|
|
|
<link href="http://www.w3.org/StyleSheets/TR/W3C-WG-NOTE.css" type="text/css" rel="stylesheet" />
|
|
<meta content="MSHTML 6.00.6000.16762" name="GENERATOR" />
|
|
<style type="text/css">
|
|
/*<![CDATA[*/
|
|
ol.c2 {list-style-type: lower-alpha}
|
|
li.c1 {list-style: none}
|
|
/*]]>*/
|
|
</style>
|
|
</head>
|
|
<body xml:lang="en" lang="en">
|
|
<div class="head"><a href="http://www.w3.org/"><img alt="W3C" src=
|
|
"http://www.w3.org/Icons/w3c_home" width="72" height="48" /></a>
|
|
<h1 id="title">Use Cases for Possible Future EMMA Features</h1>
|
|
<h2 id="w3c-doctype">W3C Working Group Note <i>15</i> <i>December</i> <i>2009</i></h2>
|
|
|
|
<dl>
|
|
<dt>This version:</dt>
|
|
<dd><a href="http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215">http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215</a></dd>
|
|
<dt>Latest version:</dt>
|
|
<dd><a href="http://www.w3.org/TR/emma-usecases">http://www.w3.org/TR/emma-usecases</a></dd>
|
|
<dt>Previous version:</dt>
|
|
<dd><em>This is the first publication.</em></dd>
|
|
|
|
<dt>Editor:</dt>
|
|
<dd>Michael Johnston, AT&T</dd>
|
|
<dt>Authors:</dt>
|
|
<dd>Deborah A. Dahl, Invited Expert</dd>
|
|
<dd>Ingmar Kliche, Deutsche Telekom AG</dd>
|
|
<dd>Paolo Baggia, Loquendo</dd>
|
|
<dd>Daniel C. Burnett, Voxeo</dd>
|
|
<dd>Felix Burkhardt, Deutsche Telekom AG</dd>
|
|
<dd>Kazuyuki Ashimura, W3C</dd>
|
|
</dl>
|
|
<p class="copyright"><a href=
|
|
"http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
|
|
© 2009 <a href="http://www.w3.org/"><acronym title=
|
|
"World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a href=
|
|
"http://www.csail.mit.edu/"><acronym title=
|
|
"Massachusetts Institute of Technology">MIT</acronym></a>, <a href=
|
|
"http://www.ercim.org/"><acronym title=
|
|
"European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
|
|
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved.
|
|
W3C <a href=
|
|
"http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
|
|
<a href=
|
|
"http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>
|
|
and <a href=
|
|
"http://www.w3.org/Consortium/Legal/copyright-documents">document
|
|
use</a> rules apply.</p>
|
|
</div>
|
|
<!-- end of head div -->
|
|
<hr title="Separator for header" />
|
|
<h2 id="abstract">Abstract</h2>
|
|
<p>The EMMA: Extensible MultiModal Annotation specification defines
|
|
an XML markup language for capturing and providing metadata on the
|
|
interpretation of inputs to multimodal systems. Throughout the
|
|
implementation report process and discussion since EMMA 1.0 became
|
|
a W3C Recommendation, a number of new possible use cases for the
|
|
EMMA language have emerged. These include the use of EMMA to
|
|
represent multimodal output, biometrics, emotion, sensor data,
|
|
multi-stage dialogs, and interactions with multiple users. In this
|
|
document, we describe these use cases and illustrate how the EMMA
|
|
language could be extended to support them.</p>
|
|
|
|
<h2 id="status">Status of this Document</h2>
|
|
|
|
<p><em>This section describes the status of this document at the
|
|
time of its publication. Other documents may supersede this
|
|
document. A list of current W3C publications and the latest
|
|
revision of this technical report can be found in the <a href=
|
|
"http://www.w3.org/TR/">W3C technical reports index</a> at
|
|
http://www.w3.org/TR/.</em></p>
|
|
|
|
<p>This document is a W3C Working Group Note published on 15 December
|
|
2009. This is the first publication of this document and it represents
|
|
the views of the W3C Multimodal Interaction Working Group at the time
|
|
of publication. The document may be updated as new technologies emerge
|
|
or mature. Publication as a Working Group Note does not imply
|
|
endorsement by the W3C Membership. This is a draft document and may be
|
|
updated, replaced or obsoleted by other documents at any time. It is
|
|
inappropriate to cite this document as other than work in
|
|
progress.</p>
|
|
|
|
<p>This document is one of a series produced by the
|
|
<a href="http://www.w3.org/2002/mmi/">Multimodal Interaction WorkingGroup</a>,
|
|
part of the <a href="http://www.w3.org/2002/mmi/Activity">W3C Multimodal Interaction
|
|
Activity</a>.
|
|
|
|
Since <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C
|
|
Recommendation, a number of new possible use cases for the EMMA language have
|
|
emerged, e.g., the use of EMMA to represent multimodal output, biometrics,
|
|
emotion, sensor data, multi-stage dialogs and interactions with multiple users.
|
|
|
|
Therefore the Working Group have been working on a document capturing use cases
|
|
and issues for a series of possible extensions to EMMA.
|
|
|
|
The intention of publishing this Working Group Note is to seek feedback on the
|
|
various different use cases.
|
|
</p>
|
|
|
|
<p>Comments on this document can be sent to <a href=
|
|
"mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>, the
|
|
public forum for discussion of the W3C's work on Multimodal
|
|
Interaction. To subscribe, send an email to <a href=
|
|
"mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.org</a>
|
|
with the word subscribe in the subject line (include the word
|
|
unsubscribe if you want to unsubscribe). The <a href=
|
|
"http://lists.w3.org/Archives/Public/www-multimodal/">archive</a>
|
|
for the list is accessible online.</p>
|
|
|
|
<p> This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 February 2004 W3C Patent Policy</a>. W3C maintains a <a rel="disclosure" href="http://www.w3.org/2004/01/pp-impl/34607/status">public list of any patent disclosures</a> made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential Claim(s)</a> must disclose the information in accordance with <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section 6 of the W3C Patent Policy</a>. </p>
|
|
|
|
<h2 id="contents">Table of Contents</h2>
|
|
<ul>
|
|
<li>1. <a href="#s1">Introduction</a></li>
|
|
<li>2. <a href="#s2">EMMA use cases</a></li>
|
|
</ul>
|
|
<ul class="tocline">
|
|
<li>2.1 <a href="#s2.1">Incremental results for streaming
|
|
modalities such as haptics, ink, monologues, dictation</a></li>
|
|
<li>2.2 <a href="#s2.2">Representing biometric information</a></li>
|
|
<li>2.3 <a href="#s2.3">Representing emotion in EMMA</a></li>
|
|
<li>2.4 <a href="#s2.4">Richer semantic representations in
|
|
EMMA</a></li>
|
|
<li>2.5 <a href="#s2.5">Representing system output in EMMA</a></li>
|
|
<li class="c1">
|
|
<ul class="tocline">
|
|
<li>2.5.1 <a href="#s2.5.1">Abstracting output from specific
|
|
modalities</a></li>
|
|
<li>2.5.2 <a href="#s2.5.2">Coordination of outputs distributed
|
|
over multiple different modalities</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>2.6 <a href="#s2.6">Representation of dialogs in EMMA</a></li>
|
|
<li>2.7 <a href="#s2.7">Logging, analysis, and annotation</a></li>
|
|
<li class="c1">
|
|
<ul class="tocline">
|
|
<li>2.7.1 <a href="#s2.7.1">Log analysis</a></li>
|
|
<li>2.7.2 <a href="#s2.7.2">Log annotation</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>2.8 <a href="#s2.8">Multi-sentence inputs</a></li>
|
|
<li>2.9 <a href="#s2.9">Multi-participant interactions</a></li>
|
|
<li>2.10 <a href="#s2.10">Capturing sensor data such as GPS in
|
|
EMMA</a></li>
|
|
<li>2.11 <a href="#s2.11">Extending EMMA from NLU to also represent
|
|
search or database retrieval results</a></li>
|
|
<li>2.12 <a href="#s2.12">Supporting other semantic representation
|
|
forms in EMMA</a></li>
|
|
</ul>
|
|
<ul>
|
|
<li><a href="#references">General References</a></li>
|
|
</ul>
|
|
<hr title="Separator for introduction" />
|
|
<h2 id="s1">1. Introduction</h2>
|
|
<p>This document presents a set of use cases for possible new
|
|
features of the Extensible MultiModal Annotation (EMMA) markup
|
|
language. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was
|
|
designed primarily to be used as a data interchange format by
|
|
systems that provide semantic interpretations for a variety of
|
|
inputs, including but not necessarily limited to, speech, natural
|
|
language text, GUI and ink input. EMMA 1.0 provides a set of
|
|
elements for containing the various stages of processing of a
|
|
user's input and a set of elements and attributes for specifying
|
|
various kinds of metadata such as confidence scores and timestamps.
|
|
<a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C
|
|
Recommendation on February 10, 2009.</p>
|
|
<p>A number of possible extensions to <a href=
|
|
"http://www.w3.org/TR/emma/">EMMA 1.0</a> have been identified
|
|
through discussions with other standards organizations,
|
|
implementers of EMMA, and internal discussions within the W3C
|
|
Multimodal Interaction Working Group. This document focusses on the
|
|
following use cases:</p>
|
|
<ol>
|
|
<li>Representing incremental results for streaming modalities such
|
|
as haptics, ink, monologues, dictation, where it is desirable to
|
|
have partial results available before the full input finishes.</li>
|
|
<li>Representing biometric results such as the results of speaker
|
|
verification or speaker identification (briefly covered in EMMA
|
|
1.0).</li>
|
|
<li>Representing emotion, for example, as conveyed by intonation
|
|
patterns, facial expression, or lexical choice.</li>
|
|
<li>Richer semantic representations, for example, integrating EMMA
|
|
application semantics with ontologies.</li>
|
|
<li>Representing system output in addition to user input, including
|
|
topics such as:</li>
|
|
<li class="c1">
|
|
<ol class="c2">
|
|
<li>Isolating presentation logic from dialog/interaction
|
|
management.</li>
|
|
<li>Coordination of outputs distributed over multiple different
|
|
modalities.</li>
|
|
</ol>
|
|
</li>
|
|
<li>Support for archival functions such as logging, human
|
|
annotation of inputs, and data analysis.</li>
|
|
<li>Representing full dialogs and multi-sentence inputs in addition
|
|
to single inputs.</li>
|
|
<li>Representing multi-participant interactions.</li>
|
|
<li>Representing sensor data such as GPS input.</li>
|
|
<li>Representing the results of database queries or search.</li>
|
|
<li>Support for forms of representation of application semantics
|
|
other than XML, such as JSON.</li>
|
|
</ol>
|
|
<p>It may be possible to achieve support for some of these features
|
|
without modifying the language, through the use of the
|
|
extensibility mechanisms of <a href=
|
|
"http://www.w3.org/TR/emma/">EMMA 1.0</a>, such as the
|
|
<code><emma:info></code> element and application-specific
|
|
semantics; however, this would significantly reduce
|
|
interoperability among EMMA implementations. If features are of
|
|
general value then it would be beneficial to define standard ways
|
|
of implementing them within the EMMA language. Additionally,
|
|
extensions may be needed to support additional new kinds of input
|
|
modalities such as multi-touch and accelerometer input.</p>
|
|
<p>The W3C Membership and other interested parties are invited to
|
|
review this document and send comments to the Working Group's
|
|
public mailing list www-multimodal@w3.org <a href=
|
|
"http://lists.w3.org/Archives/Public/www-multimodal/">(archive)</a>
|
|
.</p>
|
|
<h2 id="s2">2. EMMA use cases</h2>
|
|
<h3 id="s2.1">2.1 Incremental results for streaming modalities such
|
|
as haptic, ink, monologues, dictation</h3>
|
|
<p>In EMMA 1.0, EMMA documents were assumed to be created for
|
|
completed inputs within a given modality. However, there are
|
|
important use cases where it would be beneficial to represent some
|
|
level of interpretation of partial results before the input is
|
|
complete. For example, in a dictation application, where inputs can
|
|
be lengthy it is often desirable to show partial results to give
|
|
feedback to the user while they are speaking. In this case, each
|
|
new word is appended to the previous sequence of words. Another use
|
|
case would be incremental ASR, either for dictation or dialog
|
|
applications, where previous results might be replaced as more
|
|
evidence is collected. As more words are recognized and provide
|
|
more context, earlier word hypotheses may be updated. In this
|
|
scenario it may be necessary to replace the previous hypothesis
|
|
with a revised one.</p>
|
|
<p>In this section, we discuss how the EMMA standard could be
|
|
extended to support incremental or streaming results in the
|
|
processing of a single input. Some key considerations and areas for
|
|
discussion are:</p>
|
|
<ol>
|
|
<li>Do we need an identifier for a particular stream? Or is
|
|
<code>emma:source</code> sufficient? Subsequent messages (carrying
|
|
information for a particular stream) may need to have the same
|
|
identifier.</li>
|
|
<li>Do we need a sequence number to indicate order? Or are
|
|
timestamps sufficient (though optional)?</li>
|
|
<li>Do we need to mark "begin", "in progress" and "end" of a
|
|
stream? There are streams with a particular start and end, like a
|
|
dictation. Note that sensors may never explicitly end a
|
|
stream.</li>
|
|
<li>Do we always append information? Or do we also replace previous
|
|
data? A dictation application will probably append new text. But do
|
|
we consider sensor data (such as GPS position or device tilt) as
|
|
streaming or as "final" data?</li>
|
|
</ol>
|
|
<p>In the example below for dictation, we show how three new
|
|
attributes <code>emma:streamId</code>,
|
|
<code>emma:streamSeqNr</code>, and <code>emma:streamProgress</code>
|
|
could be used to annotate each result with metadata regarding its
|
|
position and status within a stream of input. In this example, the
|
|
<code>emma:streamId</code> is a identifier which can be used to
|
|
show that different <code>emma:interpretation</code> elements are
|
|
members of the same stream. The <code>emma:streamSeqNr</code>
|
|
attribute provides a numerical order to elements in the stream
|
|
while <code>emma:streamProgress</code> indicates the start of the
|
|
stream (and whether to expect more interpretations within the same
|
|
stream), and the end of the stream. This is an instance of the
|
|
'append' scenario for partial results in EMMA.</p>
|
|
<table width="120">
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">User</td>
|
|
<td>Hi Joe the meeting has moved</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma >
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
<emma:interpretation id="int1"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:function="transcription"
|
|
emma:confidence="0.75"
|
|
emma:tokens="Hi Joe the meeting has moved"
|
|
emma:streamId="id1"
|
|
emma:streamSeqNr="0"
|
|
emma:streamProgress="begin">
|
|
<emma:literal>
|
|
Hi Joe the meeting has moved
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">User</td>
|
|
<td>to friday at four</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma >
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
<emma:interpretation id="int2"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:function="transcription"
|
|
emma:confidence="0.75"
|
|
emma:tokens="to friday at four"
|
|
emma:streamId="id1"
|
|
emma:streamSeqNr="1"
|
|
emma:streamProgress="end">
|
|
<emma:literal>
|
|
to friday at four
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
</table>
|
|
<p>In the example below, a speech recognition hypothesis for the
|
|
whole string is updated once more words have been recognized. This
|
|
is an instance of the 'replace' scenario for partial results in
|
|
EMMA. Note that the <code>emma:streamSeqNr</code> is the same for
|
|
each interpretation in this case.</p>
|
|
<table width="120">
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">User</td>
|
|
<td>Is there a Pisa</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma >
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
<emma:interpretation id="int1"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:function="dialog"
|
|
emma:confidence="0.7"
|
|
emma:tokens="is there a pisa"
|
|
emma:streamId="id2"
|
|
emma:streamSeqNr="0"
|
|
emma:streamProgress="begin">
|
|
<emma:literal>
|
|
is there a pisa
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">User</td>
|
|
<td>Is there a pizza restaurant</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma >
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
<emma:interpretation id="int2"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:function="dialog"
|
|
emma:confidence="0.9"
|
|
emma:tokens="is there a pizza restaurant"
|
|
emma:streamId="id2"
|
|
emma:streamSeqNr="0"
|
|
emma:streamProgress="end">
|
|
<emma:literal>
|
|
is there a pizza restaurant
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
</table>
|
|
<p>One issue for the 'replace' case of incremental results, is how
|
|
to specify that a result replaces multiple of the previously
|
|
received results. For example, a system could receive partial
|
|
results consisting of each word in turn of an utterance, and then a
|
|
final result which is the final recognition for the whole sequence
|
|
of words. One approach to this problem would be to allow
|
|
<code>emma:streamSeqNr</code> to specify a range of inputs to be
|
|
replaced. For example, if the <code>emma:streamSeqNr</code> for
|
|
each of three single word results was 1, 2, and then 3. A final
|
|
revised result could be marked as
|
|
<code>emma:streamSeqNr="1-3"</code> indicating that it is a revised
|
|
result for those three words.</p>
|
|
<p>One issue is whether timestamps might be used to track ordering
|
|
instead of introducing new attributes. One problem is that
|
|
timestamp attributes are not required and may not always be
|
|
available. Also as shown in the example, chunks of input in a
|
|
stream may not always be in sequential order. Even with timestamps
|
|
providing an order some kind of 'begin' and 'end' flag is needed
|
|
(like <code>emma:streamProgress)</code> to indicate indicate the
|
|
beginning and end of transmission of streamed input. Moreover,
|
|
timestamps do not provide sufficient information to detect whether
|
|
a message has been lost.</p>
|
|
<p>Another possibility to explore for representation of incremental
|
|
results would be to use an <code><emma:sequence></code>
|
|
element containing the interim results and a derived result which
|
|
contains the combination.</p>
|
|
<p>Another issue to explore is the relationship between incremental
|
|
results and the MMI lifecyle events within the <a href=
|
|
"http://www.w3.org/TR/mmi-arch/">MMI Architecture</a>.</p>
|
|
<h3 id="s2.2">2.2 Representing biometric information</h3>
|
|
<p>Biometric technologies include systems designed to identify
|
|
someone or verify a claim of identity based on their physical or
|
|
behavioral characteristics. These include speaker verification,
|
|
speaker identification, face recognition, and iris recognition,
|
|
among others. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
|
|
provided some capability for representing the results of biometric
|
|
analysis through values of the <code>emma:function</code> attribute
|
|
such as "verification". However, it did not discuss the specifics
|
|
of this use case in any detail. It may be worth exploring further
|
|
considerations and consequences of using EMMA to represent
|
|
biometric results. As one example, if different biometric results
|
|
are represented in EMMA, this would simplify the process of fusing
|
|
the outputs of multiple biometric technologies to obtain a more
|
|
reliable overall result. It should also make it easier to
|
|
take into account non-biometric claims of identity, such as a
|
|
statement like "this is Kazuyuki", represented in EMMA, along with
|
|
a speaker verification result based on the speaker's voice, which
|
|
would also be represented in EMMA. In the following example, we
|
|
have extended the set of values for <code>emma:function</code> to
|
|
include "identification" for an interpretation showing the results
|
|
of a biometric component that picks out an individual from a set of
|
|
possible individuals (who are they). This contrasts with
|
|
"verification" which is used for verification of a particular user
|
|
(are they who they say they are).</p>
|
|
<h4 id="biometric_example">Example</h4>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>an image of a face</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma>
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
<emma:interpretation id=“int1"
|
|
emma:confidence="0.75”
|
|
emma:medium="visual"
|
|
emma:mode="photograph"
|
|
emma:verbal="false"
|
|
emma:function="identification">
|
|
<person>12345</person>
|
|
<name>Mary Smith</name>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>One direction to explore further is the relationship between
|
|
work on messaging protocols for biometrics within the OASIS
|
|
Biometric Identity Assurance Services (<a href="http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=bias">BIAS</a>)
|
|
standards committee and EMMA.</p>
|
|
<h3 id="s2.3">2.3 Representing emotion in EMMA</h3>
|
|
<p>In addition to speech recognition, and other tasks such as
|
|
speaker verification and identification, another kind of
|
|
interpretation of speech that is of increasing importance is
|
|
determination of the emotional state of the speaker, based on, for
|
|
example, their prosody, lexical choice, or other features. This
|
|
information can be used, for example, to make the dialog logic of
|
|
an interactive system sensitive to the user's emotional state.
|
|
Emotion detection can also use other modalities such as vision
|
|
(facial expression, posture) and physiological sensors such as skin
|
|
conductance measurement or blood pressure. Multimodal approaches
|
|
where evidence is combined from multiple different modalities are
|
|
also of significance for emotion classification.</p>
|
|
<p>The creation of a markup language for emotion has been a recent
|
|
focus of attention in W3C. Work that initiated in the W3C Emotion
|
|
Markup Language Incubator Group (<a href=
|
|
"http://www.w3.org/2005/Incubator/emotion/XGR-emotionml-20081120/">EmotionML
|
|
XG</a>), has now transitioned to the <a href=
|
|
"http://www.w3.org/2002/mmi/">W3C Multimodal Working Group</a> and
|
|
the <a href="http://www.w3.org/TR/emotionml">EmotionML</a> language
|
|
has been published as a working draft. One of the major use cases
|
|
for that effort is: "Automatic recognition of emotions from
|
|
sensors, including physiological sensors, speech recordings, facial
|
|
expressions, etc., as well as from multi-modal combinations of
|
|
sensors."</p>
|
|
<p>Given the similarities to the technologies and annotations used
|
|
for other kinds of input processing (recognition, semantic
|
|
classification) which are now captured in EMMA, it makes sense to
|
|
explore the use of EMMA for capture of emotional classification of
|
|
inputs. Just as EMMA does not standardize the application markup
|
|
for semantic results, though, it does not make sense to try and
|
|
standardize emotion markup within EMMA. One promising approach is
|
|
to combine the containers and metadata annotation of EMMA with the
|
|
<a href="http://www.w3.org/TR/emotionml">EmotionML</a> markup, as
|
|
shown in the following example.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td width="50">expression of boredom</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
xmlns:emo="http://www.w3.org/2009/10/emotionml">
|
|
<emma:interpretation id="emo1"
|
|
emma:start="1241035886246"
|
|
emma:end="1241035888246"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:verbal="false"
|
|
emma:signal="http://example.com/input345.amr"
|
|
emma:media-type="audio/amr; rate:8000;"
|
|
emma:process="engine:type=emo_class&vn=1.2”>
|
|
<emo:emotion>
|
|
<emo:intensity
|
|
value="0.1"
|
|
confidence="0.8"/>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="boredom"
|
|
confidence="0.1"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>In this example, we use the capabilities of EMMA for describing
|
|
the input signal, its temporal characteristics, modality, sampling
|
|
rate, audio codec etc. and EmotionML is used to provide the
|
|
specific representation of the emotion. Other EMMA container
|
|
elements also have strong use cases for emotion recognition. For
|
|
example, <code><emma:one-of></code> can be used to represent
|
|
N-best lists of competing classifications of emotion. The
|
|
<code><emma:group></code> element could be used to combine a
|
|
semantic interpretation of a user input with an emotional
|
|
classification, as illustrated in the following example. Note that
|
|
all of the general properties of the signal can be specified on the
|
|
<code><emma:group></code> element.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td width="50">spoken input "flights to boston tomorrow" to dialog
|
|
system in angry voice</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
xmlns:emo="http://www.w3.org/2009/10/emotionml">
|
|
<emma:group id="result1"
|
|
emma:start="1241035886246"
|
|
emma:end="1241035888246"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:verbal="false"
|
|
emma:signal="http://example.com/input345.amr"
|
|
emma:media-type="audio/amr; rate:8000;">
|
|
<emma:interpretation id="asr1"
|
|
emma:tokens="flights to boston tomorrow"
|
|
emma:confidence="0.76"
|
|
emma:process="engine:type=asr_nl&vn=5.2”>
|
|
<flight>
|
|
<dest>boston</dest>
|
|
<date>tomorrow</date>
|
|
</flight>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="emo1"
|
|
emma:process="engine:type=emo_class&vn=1.2”>
|
|
<emo:emotion>
|
|
<emo:intensity
|
|
value="0.3"
|
|
confidence="0.8"/>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="anger"
|
|
confidence="0.8"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
<emma:group-info>
|
|
meaning_and_emotion
|
|
</emma:group-info>
|
|
</emma:group>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>The element <code><emma:group></code> can also be used to
|
|
capture groups of emotion detection results from individual
|
|
modalities for combination by a multimodal fusion component or when
|
|
automatic recognition results are described together with manually
|
|
annotated data. This use case is inspired by <a href=
|
|
"http://www.w3.org/2005/Incubator/emotion/XGR-emotion/#AppendixUseCases">
|
|
Use case 2b (II)</a> of the Emotion Incubator Group Report. The
|
|
following example illustrates the grouping of three
|
|
interpretations, namely: a speech analysis emotion classifier, a
|
|
physiological emotion classifier measuring blood pressure, and a
|
|
human annotator viewing video, for two different media files (from
|
|
the same episode) that are synchronized via <code>emma:start</code>
|
|
and <code>emma:end</code> attributes. In this case, the
|
|
physiological reading is for a subinterval of the video and audio
|
|
recording.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td width="50">audio, video, and physiological sensor of a test
|
|
user acting with a new design.</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
xmlns:emo="http://www.w3.org/2009/10/emotionml">
|
|
<emma:group id="result1">
|
|
<emma:interpretation id="speechClassification1"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:verbal="false"
|
|
emma:start="1241035884246"
|
|
emma:end="1241035887246"
|
|
emma:signal="http://example.com/video_345.mov"
|
|
emma:process="engine:type=emo_voice_classifier”>
|
|
<emo:emotion>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="anger"
|
|
confidence="0.8"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="bloodPressure1"
|
|
emma:medium="tactile"
|
|
emma:mode="blood_pressure"
|
|
emma:verbal="false"
|
|
emma:start="1241035885300"
|
|
emma:end="1241035886900"
|
|
emma:signal="http://example.com/bp_signal_345.cvs"
|
|
emma:process="engine:type=emo_physiological_classifier”>
|
|
<emo:emotion>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="anger"
|
|
confidence="0.6"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="humanAnnotation1"
|
|
emma:medium="visual"
|
|
emma:mode="video"
|
|
emma:verbal="false"
|
|
emma:start="1241035884246"
|
|
emma:end="1241035887246"
|
|
emma:signal="http://example.com/video_345.mov"
|
|
emma:process="human:type=labeler&id=1”>
|
|
<emo:emotion>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="fear"
|
|
confidence="0.6"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
<emma:group-info>
|
|
several_emotion_interpretations
|
|
</emma:group-info>
|
|
</emma:group>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>A combination of <code><emma:group></code> and
|
|
<code><emma:derivation></code> could be used to represent a
|
|
combined emotional analysis resulting from analysis of multiple
|
|
different modalities of the user's behavior. The
|
|
<code><emma:derived-from></code> and
|
|
<code><emma:derivation></code> elements can be used to
|
|
capture both the fused result and combining inputs in a single EMMA
|
|
document. In the following example, visual analysis of user
|
|
activity and analysis of their speech have been combined by a
|
|
multimodal fusion component to provide an combined multimodal
|
|
classification of the user's emotional state. The specifics of the
|
|
multimodal fusion algorithm are not relevant here, or to EMMA in
|
|
general. Note though that in this case, the multimodal fusion
|
|
appears to have compensated for uncertainty in the visual analysis
|
|
which gave two results with equal confidence, one for fear and one
|
|
for anger. The <code>emma:one-of</code> element is used to capture
|
|
the N-best list of multiple competing results from the video
|
|
classifier.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td width="50">multimodal fusion of emotion classification of user
|
|
based on analysis of voice and video</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
xmlns:emo="http://www.w3.org/2009/10/emotionml">
|
|
<emma:interpretation id="multimodalClassification1"
|
|
emma:medium="acoustic,visual"
|
|
emma:mode="voice,video"
|
|
emma:verbal="false"
|
|
emma:start="1241035884246"
|
|
emma:end="1241035887246"
|
|
emma:process="engine:type=multimodal_fusion”>
|
|
<emo:emotion>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="anger"
|
|
confidence="0.7"/>
|
|
</emo:emotion>
|
|
<emma:derived-from ref="mmgroup1" composite="true"/>
|
|
</emma:interpretation>
|
|
<emma:derivation>
|
|
<emma:group id="mmgroup1">
|
|
<emma:interpretation id="speechClassification1"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:verbal="false"
|
|
emma:start="1241035884246"
|
|
emma:end="1241035887246"
|
|
emma:signal="http://example.com/video_345.mov"
|
|
emma:process="engine:type=emo_voice_classifier”>
|
|
<emo:emotion>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="anger"
|
|
confidence="0.8"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
<emma:one-of id="video_nbest"
|
|
emma:medium="visual"
|
|
emma:mode="video"
|
|
emma:verbal="false"
|
|
emma:start="1241035884246"
|
|
emma:end="1241035887246"
|
|
emma:signal="http://example.com/video_345.mov"
|
|
emma:process="engine:type=video_classifier">
|
|
<emma:interpretation id="video_result1"
|
|
<emo:emotion>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="anger"
|
|
confidence="0.5"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="video_result2"
|
|
<emo:emotion>
|
|
<emo:category
|
|
set="everydayEmotions"
|
|
name="fear"
|
|
confidence="0.5"/>
|
|
</emo:emotion>
|
|
</emma:interpretation>
|
|
</emma:one-of>
|
|
<emma:group-info>
|
|
emotion_interpretations
|
|
</emma:group-info>
|
|
</emma:group>
|
|
</emma:derivation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>One issue which need to be addressed is the relationship between
|
|
EmotionML <code>confidence</code> attribute values and
|
|
<code>emma:confidence</code> values. Could the
|
|
<code>emma:confidence</code> value be used as an overall confidence
|
|
value for the emotion result, or should confidence values appear
|
|
only within the EmotionML markup since confidence is used for
|
|
different dimensions of the result? If a series of possible emotion
|
|
classifications are contained in <code>emma:one-of</code> should
|
|
they be ordered by the EmotionML confidence values?</p>
|
|
<h3 id="s2.4">2.4 Richer semantic representations in EMMA</h3>
|
|
<p>Enriching the semantic information represented in EMMA would be
|
|
helpful for certain use cases. For example, the concepts in an EMMA
|
|
application semantics representation might include references to
|
|
concepts in an ontology such as WordNet. Then, a translation system
|
|
might make use of a sense disambiguator to represent the
|
|
probabilities of different senses of a word, for example, "spicy"
|
|
in the example has two possible WordNet senses. In the following
|
|
example, inputs to a machine translation system are annotated in
|
|
the application semantics with specific WordNet senses which are
|
|
used to distinguish among different senses of the words. A
|
|
translation system might make use of a sense disambiguator to
|
|
represent the probabilities of different senses of a word, for
|
|
example, "spicy" in the example has two possible WordNet
|
|
senses.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>I love to eat Mexican food because it is spicy</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example"
|
|
xmlns="http://example.com/universal_translator">
|
|
<emma:interpretation id="spanish">
|
|
<result xml:lang="es">
|
|
Adoro alimento mejicano porque es picante.
|
|
</result>
|
|
<emma:derived-from resource="#english" composite="false"/>
|
|
</emma:interpretation>
|
|
<emma:derivation>
|
|
<emma:interpretation id="english"
|
|
emma:tokens="I love to eat Mexican food
|
|
because it is spicy">
|
|
<assertion>
|
|
<interaction
|
|
wordnet="1828736"
|
|
wordnet-desc="love, enjoy (get pleasure from)"
|
|
token="love">
|
|
<experiencer
|
|
reference="first"
|
|
token="I">
|
|
<attribute quantity="single"/>
|
|
</experiencer>
|
|
<attribute time="present"/>
|
|
<content>
|
|
<interaction wordnet="1157345"
|
|
wordnet-desc="eat (take in solid food)"
|
|
token="to eat">
|
|
<object id="obj1"
|
|
wordnet="7555863"
|
|
wordnet-desc="food, solid food (any solid
|
|
substance (as opposed to
|
|
liquid) that is used as a source
|
|
of nourishment)"
|
|
token="food">
|
|
<restriction
|
|
wordnet="3026902"
|
|
wordnet-desc="Mexican (of or relating
|
|
to Mexico or its inhabitants)"
|
|
token="Mexican"/>
|
|
</object>
|
|
</interaction>
|
|
</content>
|
|
<reason token="because">
|
|
<experiencer reference="third"
|
|
target="obj1" token="it"/>
|
|
<attribute time="present"/>
|
|
<one-of token="spicy">
|
|
<modification wordnet="2397732"
|
|
wordnet-desc="hot, spicy (producing a
|
|
burning sensation on
|
|
the taste nerves)"
|
|
confidence="0.8"/>
|
|
<modification wordnet="2398378"
|
|
wordnet-desc="piquant, savory,
|
|
savoury, spicy, zesty
|
|
(having an agreeably
|
|
pungent taste)"
|
|
confidence="0.4"/>
|
|
</one-of>
|
|
</reason>
|
|
</interaction>
|
|
</assertion>
|
|
</emma:interpretation>
|
|
</emma:derivation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>In addition to sense disambiguation it could also be useful to
|
|
relate concepts to superordinate concepts in some ontology. For
|
|
example, it could be useful to know that O'Hare is an airport and
|
|
Chicago is a city, even though they might be used interchangeably
|
|
in an application. For example, in an air travel application a user
|
|
might say "I want to fly to O'Hare" or "I want to fly to
|
|
Chicago".</p>
|
|
<h3 id="s2.5">2.5 Representing system output in EMMA</h3>
|
|
<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was explicitly
|
|
limited in scope to representation of the interpretation of user
|
|
inputs. Most interactive systems also produce system output and one
|
|
of the major possible extensions of the EMMA language would be to
|
|
provide support for representation of the outputs made by the
|
|
system in addition to the user inputs. One advantage of having EMMA
|
|
representation for system output is that system logs can have
|
|
unified markup representation across input and output for viewing
|
|
and analyzing user/system interactions. In this section, we
|
|
consider two different use cases for addition of output
|
|
representation to EMMA.</p>
|
|
<h4 id="s2.5.1">2.5.1 Abstracting output from specific modality or
|
|
output language</h4>
|
|
<p>It is desirable for a multimodal dialog designer to be able to
|
|
isolate dialog flow (for example <a href=
|
|
"http://www.w3.org/TR/2009/WD-scxml-20091029/">SCXML</a> code) from
|
|
the details of specific utterances produced by a system. This can
|
|
achieved by using presentation or media planning component that
|
|
takes the abstract intent from the system and creates one or more
|
|
modality-specific presentations. In addition to isolating dialog
|
|
logic from specific modality choice this can also make it easier to
|
|
support different technologies for the same modality. For example,
|
|
in the example below, the GUI technology is HTML, but abstracting
|
|
output would also support using a different GUI technology like
|
|
Flash, or <a href="http://www.w3.org/Graphics/SVG/">SVG</a>. If
|
|
EMMA is extended to support output, then EMMA documents could be
|
|
used for communication from the dialog manager to the presentation
|
|
planning component, and also potentially for the documents
|
|
generated by the presentation component, which could embed specific
|
|
markup such as HTML and <a href=
|
|
"http://www.w3.org/TR/speech-synthesis/">SSML</a>. Just as there
|
|
can be multiple different stages of processing of a user input,
|
|
there may be multiple stages of processing of an output, and the
|
|
mechanisms of EMMA can be used to capture and provide metadata on
|
|
these various stages of output processing.</p>
|
|
<p>Potential benefits for this approach include:</p>
|
|
<ol>
|
|
<li>Accessibility: it would be useful for an application to be able
|
|
to accommodate users who might have an assistive device or devices
|
|
without requiring special logic or even special applications.</li>
|
|
<li>Device independence: An application could separate the flow in
|
|
the IM from the details of the presentation. This might be
|
|
especially useful if there are a lot of target devices with
|
|
different types of screens, cameras, or possibilities for haptic
|
|
output.</li>
|
|
<li>Adapting to user preferences: An application could accommodate
|
|
different dynamic preferences, for example, switching to visual
|
|
presentation from speech in public places without disturbing the
|
|
application flow.</li>
|
|
</ol>
|
|
<p>In the following example, we consider the introduction of a new
|
|
EMMA element, <code><emma:presentation></code> which is the
|
|
output equivalent of the input element
|
|
<code><emma:interpretation></code>. Like
|
|
<code><emma:interpretation></code> this element can take
|
|
<code>emma:medium</code> and <code>emma:mode</code> attributes
|
|
classifying the specific modality. It could also potentially take
|
|
timestamp annotations indicating the time at which the output
|
|
should be produced. One issue is whether timestamps should be used
|
|
for the intended time of production or for the actual time of
|
|
production and how to capture both. Relative timestamps could be
|
|
used to anchor the planned time of presentation to another element
|
|
of system output. In this example we show how the
|
|
<code>emma:semantic-rep</code> attribute proposed in <a href=
|
|
"#s2.12">Section 2.12</a> could potentially be used to indicate the
|
|
markup language of the output.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Output</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">IM (step 1)</td>
|
|
<td>semantics of "what would you like for lunch?"</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:presentation>
|
|
<question>
|
|
<topic>lunch</topic>
|
|
<experiencer>second person</experiencer>
|
|
<object>questioned</object>
|
|
</question>
|
|
</emma:presentation>
|
|
</emma:emma>
|
|
|
|
</pre>
|
|
<p>or, more simply, without natural language generation:</p>
|
|
<pre>
|
|
<emma:emma>
|
|
<emma:presentation>
|
|
<text>what would you like for lunch?</text>
|
|
</emma:presentation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">presentation manager (voice output)</td>
|
|
<td>text "what would you like for lunch?"</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:presentation
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:verbal="true"
|
|
emma:function="dialog"
|
|
emma:semantic-rep="ssml">
|
|
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
|
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
|
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
|
|
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
|
|
xml:lang="en-US">
|
|
what would you like for lunch</speak>
|
|
</emma:presentation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">presentation manager (GUI output)</td>
|
|
<td>text "what would you like for lunch?"</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:presentation
|
|
emma:medium="visual"
|
|
emma:mode="graphics"
|
|
emma:verbal="true"
|
|
emma:function="dialog"
|
|
emma:semantic-rep="html">
|
|
<html>
|
|
<body>
|
|
<p>what would you like for lunch?"</p>
|
|
<input name="" type="text">
|
|
<input type="submit" name="Submit"
|
|
value="Submit">
|
|
</body>
|
|
</html>
|
|
</emma:presentation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h4 id="s2.5.2">2.5.2 Coordination of outputs distributed over
|
|
multiple different modalities</h4>
|
|
<p>A critical issue in the enablement of effective multimodal
|
|
output is to enable synchronization of outputs in different output
|
|
media. For example, text to speech output or prompts may be
|
|
coordinated with graphical outputs such as highlighting of items in
|
|
an HTML table. EMMA markup could potentially be used to indicate
|
|
that elements in each medium should be coordinated in their
|
|
presentation. In the following example, a new attribute
|
|
<code>emma:sync</code> is used to indicate the relationship between
|
|
a <code><mark></code> in <a href=
|
|
"http://www.w3.org/TR/speech-synthesis/">SSML</a> and an element to
|
|
be highlighted in HTML content. The <code>emma:process</code>
|
|
attribute could be used to identify the presentation planning
|
|
component. Again <code>emma:semantic-rep</code> is used to indicate
|
|
the embedded markup language.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Output</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">system</td>
|
|
<td width="50">Coordinated presentation of table with TTS</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:group id=“gp1"
|
|
emma:medium="acoustic,visual"
|
|
emma:mode="voice,graphics"
|
|
emma:process="http://example.com/presentation_planner">
|
|
<emma:presentation id=“pres1"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:verbal="true"
|
|
emma:function="dialog"
|
|
emma:semantic-rep="ssml">
|
|
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
|
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
|
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
|
|
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
|
|
xml:lang="en-US">
|
|
Item 4 <mark emma:sync="123"/> costs fifteen dollars.
|
|
</speak>
|
|
</emma:presentation>
|
|
<emma:presentation id=“pres2"
|
|
emma:medium="visual"
|
|
emma:mode="graphics"
|
|
emma:verbal="true"
|
|
emma:function="dialog"
|
|
emma:semantic-rep="html"
|
|
<table xmlns="http://www.w3.org/1999/xhtml">
|
|
<tr>
|
|
<td emma:sync="123">Item 4</td>
|
|
<td>15 dollars</td>
|
|
</tr>
|
|
</table>
|
|
</emma:presentation>
|
|
</emma:group>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>One issue to be considered is the potential role of the
|
|
Synchronized Multimedia Integration Language (<a href=
|
|
"http://www.w3.org/TR/REC-smil/">SMIL</a>) for capturing multimodal
|
|
output synchronization. SMIL markup for multimedia presentation
|
|
could potentially be embedded within EMMA markup coming from an
|
|
interaction manager to a client for rendering.</p>
|
|
<h3 id="s2.6">2.6 Representation of dialogs in EMMA</h3>
|
|
<p>The scope of <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
|
|
was explicitly limited to representation of single turns of user
|
|
input. For logging, analysis, and training purposes it could be
|
|
useful to be able to represent multi-stage dialogs in EMMA. The
|
|
following example shows a sequence of two EMMA documents where the
|
|
the first is a request from the system and the second is the user
|
|
response. A new attribute <code>emma:in-response-to</code> is used
|
|
to relate the system output to the user input. EMMA already has an
|
|
attribute <code>emma:dialog-turn</code> used to provide an
|
|
indicator of the turn of interaction.</p>
|
|
<h4 id="dialog_example">Example</h4>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">system</td>
|
|
<td width="50">where would you like to go?</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:presentation id="pres1"
|
|
emma:dialog-turn="turn1"
|
|
emma:in-response-to="initial">
|
|
<prompt>
|
|
where would you like to go?
|
|
</prompt>
|
|
</emma:presentation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>New York</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:interpretation id="int1"
|
|
emma:dialog-turn="turn2"
|
|
emma:tokens="new york"
|
|
emma:in-response-to="pres1">
|
|
<location>
|
|
New York
|
|
</location>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>In this case, each utterance is still a single EMMA document,
|
|
and markup is being used to encode the fact that the utterance are
|
|
part of an ongoing dialog. Another possibility would be to use EMMA
|
|
markup to contain a whole dialog within a single EMMA document. For
|
|
example, a flight query dialog could be represented as follows
|
|
using <code><emma:sequence></code>:</p>
|
|
<h4 id="sequence_example">Example</h4>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>flights to boston</td>
|
|
<td rowspan="5">
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:sequence>
|
|
<emma:interpretation id="user1"
|
|
emma:dialog-turn="turn1"
|
|
emma:in-response-to="initial">
|
|
<emma:literal>
|
|
flights to boston
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
<emma:presentation id="sys1"
|
|
emma:dialog-turn="turn2"
|
|
emma:in-response-to="user1">
|
|
<prompt>
|
|
traveling to boston,
|
|
which departure city
|
|
</prompt>
|
|
</emma:presentation>
|
|
<emma:interpretation id="user2"
|
|
emma:dialog-turn="turn3"
|
|
emma:in-response-to="sys1">
|
|
<emma:literal>
|
|
san francisco
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
<emma:presentation id="sys2"
|
|
emma:dialog-turn="turn4"
|
|
emma:in-response-to="user2">
|
|
<prompt>
|
|
departure date
|
|
</prompt>
|
|
</emma:presentation>
|
|
<emma:interpretation id="user3"
|
|
emma:dialog-turn="turn5"
|
|
emma:in-response-to="sys2">
|
|
<emma:literal>
|
|
next thursday
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:sequence>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">system</td>
|
|
<td>traveling to Boston, which departure city?</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>San Francisco</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">system</td>
|
|
<td>departure date</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>next thursday</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>Note that in this example with
|
|
<code><emma:sequence></code> the
|
|
<code>emma:in-response-to</code> attribute is still important since
|
|
there is no guarantee that an utterance in a dialog is a response
|
|
to the previous utterance. For example, a sequence of utterances
|
|
may all be from the user.</p>
|
|
<p>One issue that arises with the representation of whole dialogs
|
|
is that the resulting EMMA documents with full sets of metadata may
|
|
become quite large. One possible extension that could help with
|
|
this would be allow the value of <code>emma:in-response-to</code>
|
|
to be URI valued so it can refer to another EMMA document.</p>
|
|
<h3 id="s2.7">2.7 Logging, analysis, and annotation</h3>
|
|
<p>EMMA was initially designed to facilitate communication among
|
|
components of an interactive system. It has become clear over time
|
|
that the language can also play an important role in logging of
|
|
user/system interactions. In this section, we consider possible
|
|
advantages of EMMA for log analysis and illustrate how elements
|
|
such as <code><emma:derived-from></code> could be used to
|
|
capture and provide metadata on annotations made by human
|
|
annotators.</p>
|
|
<h3 id="s2.7.1">2.7.1 Log analysis</h3>
|
|
<p>The proposal above for representing system output in EMMA would
|
|
support after the fact analysis of dialogs. For example, if both
|
|
the system's and the user's utterance are represented in EMMA, it
|
|
should be much easier to examine relationships between factors such
|
|
as how the wording of prompts might affect user's responses or even
|
|
the modality that users select for their responses. It would also
|
|
be easier to study timing relationships between the system prompt
|
|
and the user's responses. For example, prompts that are confusing
|
|
might consistently elicit longer times before the user starts
|
|
speaking. This would be useful even without a presentation manager
|
|
or fission component. In the following example, it might be useful
|
|
to look into the relationship between the end of the prompt and the
|
|
start of the user's response. We use here the
|
|
<code>emma:in-response-to</code> attribute suggested in <a href=
|
|
"#s2.6">Section 2.6</a> for the representation of dialogs in
|
|
EMMA.</p>
|
|
<h4 id="log_example">Example</h4>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">system</td>
|
|
<td>where would you like to go?</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:presentation id="pres1"
|
|
emma:dialog-turn="turn1"
|
|
emma:in-response-to="initial"
|
|
emma:start="1241035886246"
|
|
emma:end="1241035888306">
|
|
<prompt>
|
|
where would you like to go?
|
|
</prompt>
|
|
</emma:presentation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>New York</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:interpretation id="int1"
|
|
emma:dialog-turn="turn2"
|
|
emma:in-response-to="pres1"
|
|
emma:start="1241035891246"
|
|
emma:end="1241035893000"">
|
|
<destination>
|
|
New York
|
|
</destination>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h3 id="s2.7.2">2.7.2 Log annotation</h3>
|
|
<p>EMMA is generally used to show the recognition, semantic
|
|
interpretation etc. assigned to inputs based on <em>machine</em>
|
|
processing of the user input. Another potential use case is to
|
|
provide a mechanism for showing the interpretation assigned to an
|
|
input by a human annotator and using
|
|
<code><emma:derived-from></code> to show the relationship
|
|
between the input received the annotation. The
|
|
<code><emma:one-of></code> element can then be used to show
|
|
multiple competing annotations for an input. The
|
|
<code><emma:group></code> element could be used to contain
|
|
multiple different kinds of annotation on a single input. One
|
|
question here is whether <code>emma:process</code> can be used for
|
|
identification of the labeller, and whether there is a need for any
|
|
additional EMMA machinery to better support this this use case. In
|
|
these examples, <code><emma:literal></code> contains mixed
|
|
content with text and elements. This is in keeping with the EMMA
|
|
1.0 schema.</p>
|
|
<p>One issue that arises concerns the meaning of an
|
|
<code>emma:confidence</code> value on an annotated interpretation.
|
|
It may be preferable to have another attribute for annotator
|
|
confidence rather than overloading the current
|
|
<code>emma:confidence</code>.</p>
|
|
<p>Another issue concerns mixing of system results and human
|
|
annotation. Should these be grouped or is the annotation a derived
|
|
from the system's interpretation. Also it would be useful to
|
|
capture the time of the annotation. The current timestamps are used
|
|
for the time of the input itself. Where should annotation
|
|
timestamps be recorded?</p>
|
|
<p>It would also be useful to have a way to specify open ended
|
|
information about the annotator such as their native language,
|
|
profession, experience etc. One approach would be to be to have a
|
|
new attribute e.g. <code>emma:annotator</code> with a URI value
|
|
that could point to a description of the annotator.</p>
|
|
<p>It could be useful for very common annotations to have in
|
|
addition to <code>emma:tokens</code> another dedicated element to
|
|
indicate the annotated transcription, for example,
|
|
<code>emma:annotated-tokens</code> or
|
|
<code>emma:transcription</code>.</p>
|
|
<p>In the following example, we show how
|
|
<code>emma:interpretation</code> and <code>emma:derived-from</code>
|
|
could be used to capture the annotation of an input.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td width="614"><strong>Input</strong></td>
|
|
<td width="531"><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="93">user</td>
|
|
<td>
|
|
<p>In this example the user has said:</p>
|
|
<p>"flights from boston to san francisco leaving on the fourth of
|
|
september"</p>
|
|
<p>and the semantic interpretation here is a semantic tagging of
|
|
the utterance done by a human annotator. emma:process is used to
|
|
provide details about the annotation</p>
|
|
</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:interpretation id="annotation1"
|
|
emma:process="annotate:type=semantic&annotator=michael"
|
|
emma:confidence="0.90">
|
|
<emma:literal>
|
|
flights from <src>san francisco</src> to
|
|
<dest>boston</dest> on
|
|
<date>the fourth of september</date>
|
|
</emma:literal>
|
|
<emma:derived-from resource="#asr1"/>
|
|
</emma:interpretation>
|
|
<emma:derivation>
|
|
<emma:interpretation id="asr1"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:function="dialog"
|
|
emma:verbal="true"
|
|
emma:lang="en-US"
|
|
emma:start="1241690021513"
|
|
emma:end="1241690023033"
|
|
emma:media-type="audio/amr; rate=8000"
|
|
emma:process="smm:type=asr&version=watson6"
|
|
emma:confidence="0.80">
|
|
<emma:literal>
|
|
flights from san francisco
|
|
to boston on the fourth of september
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:derivation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>Taking this example a step further,
|
|
<code><emma:group></code> could be used to group annotations
|
|
made by multiple different annotators of the same utterance:</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td width="614"><strong>Input</strong></td>
|
|
<td width="531"><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="93">user</td>
|
|
<td>
|
|
<p>In this example the user has said:</p>
|
|
<p>"flights from boston to san francisco leaving on the fourth of
|
|
september"</p>
|
|
<p>and the semantic interpretation here is a semantic tagging of
|
|
the utterance done by two different human annotators.
|
|
<code>emma:process</code> is used to provide details about the
|
|
annotation</p>
|
|
</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:group emma:confidence="1.0">
|
|
<emma:interpretation id="annotation1"
|
|
emma:process="annotate:type=semantic&annotator=michael"
|
|
emma:confidence="0.90">
|
|
<emma:literal>
|
|
flights from <src>san francisco</src>
|
|
to <dest>boston</dest>
|
|
on <date>the fourth of september</date>
|
|
</emma:literal>
|
|
<emma:derived-from resource="#asr1"/>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="annotation2"
|
|
emma:process="annotate:type=semantic&annotator=debbie"
|
|
emma:confidence="0.90">
|
|
<emma:literal>
|
|
flights from <src>san francisco</src>
|
|
to <dest>boston</dest> on
|
|
<date>the fourth of september</date>
|
|
</emma:literal>
|
|
<emma:derived-from resource="#asr1"/>
|
|
</emma:interpretation>
|
|
<emma:group-info>semantic_annotations</emma:group-info>
|
|
</emma:group>
|
|
<emma:derivation>
|
|
<emma:interpretation id="asr1"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:function="dialog"
|
|
emma:verbal="true"
|
|
emma:lang="en-US"
|
|
emma:start="1241690021513"
|
|
emma:end="1241690023033"
|
|
emma:media-type="audio/amr; rate=8000"
|
|
emma:process="smm:type=asr&version=watson6"
|
|
emma:confidence="0.80">
|
|
<emma:literal>
|
|
flights from san francisco to boston
|
|
on the fourth of september
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:derivation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h3 id="s2.8">2.8 Multisentence Inputs</h3>
|
|
<p>For certain applications, it is useful to be able to represent
|
|
the semantics of multi-sentence inputs, which may be in one of more
|
|
modalities such as speech (e.g. voicemail), text (e.g. email), or
|
|
handwritten input. One application use case is for summarizing a
|
|
voicemail or email. We develop this example below.</p>
|
|
<p>There are at least two possible approaches to addressing this
|
|
use case.</p>
|
|
<ol>
|
|
<li>If there is no reason to distinguish the individual sentences
|
|
of the input or interpret them individually, the entire input could
|
|
be included as the value of the <code>emma:tokens</code> attribute
|
|
of an <code><emma:interpretation></code> or
|
|
<code><emma:one-of></code> element, where the semantics of
|
|
the input is represented as the value of an
|
|
<code><emma:interpretation></code>. Although in principle
|
|
there is no upper limit on the length of a <code>emma:tokens</code>
|
|
attribute, in practice, this approach might be cumbersome for
|
|
longer or more complicated texts.</li>
|
|
<li>If more structure is required, the interpretations of the
|
|
individual sentences in the input could be grouped as individual
|
|
<code><emma:interpretation></code> elements under an
|
|
<code><emma:sequence></code> element. A single unified
|
|
semantics representing the meaning of the entire input could then
|
|
be represented with the sequence as the value of
|
|
<code><emma:derived-from></code>.</li>
|
|
</ol>
|
|
The example below illustrates the first approach.
|
|
<h4 id="multisentence_example">Example</h4>
|
|
<table border="1">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td width="614"><strong>Input</strong></td>
|
|
<td width="531"><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="93">user</td>
|
|
<td>
|
|
<p>Hi Group,</p>
|
|
<p>You are all invited to lunch tomorrow at Tony's Pizza at 12:00.
|
|
Please let me know if you're planning to come so that I can make
|
|
reservations. Also let me know if you have any dietary
|
|
restrictions. Tony's Pizza is at 1234 Main Street. We will be
|
|
discussing ways of using EMMA.</p>
|
|
<p>Debbie</p>
|
|
</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:interpretation
|
|
emma:tokens="Hi Group, You are all invited to
|
|
lunch tomorrow at Tony's Pizza at 12:00.
|
|
Please let me know if you're planning to
|
|
come so that I can make reservations.
|
|
Also let me know if you have any dietary
|
|
restrictions. Tony's Pizza is at 1234
|
|
Main Street. We will be discussing
|
|
ways of using EMMA." >
|
|
<business-event>lunch</business-event>
|
|
<host>debbie</host>
|
|
<attendees>group</attendees>
|
|
<location>
|
|
<name>Tony's Pizza</name>
|
|
<address> 1234 Main Street</address>
|
|
</location>
|
|
<date> tuesday, March 24</date>
|
|
<needs-rsvp>true</needs-rsvp>
|
|
<needs-restrictions>true</need-restrictions>
|
|
<topic>ways of using EMMA</topic>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h3 id="s2.9">2.9 Multi-participant interactions</h3>
|
|
<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> primarily
|
|
focussed on the interpretation of inputs from a single user. Both
|
|
for annotation of human-human dialogs and for the emerging systems
|
|
which support dialog or multimodal interaction with multiple
|
|
participants (such as multimodal systems for meeting analysis), it
|
|
is important to support annotation of interactions involving
|
|
multiple different participants. The proposals above for capturing
|
|
dialog can play an important role. One possible further extension
|
|
would be to add specific markup for annotation of the user making a
|
|
particular contribution. In the following example, we use an
|
|
attribute <code>emma:participant</code> to identify the participant
|
|
contributing each response to the prompt.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td width="668"><strong>Input</strong></td>
|
|
<td width="480"><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="90">system</td>
|
|
<td>Please tell me your lunch orders</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:presentation id="pres1"
|
|
emma:dialog-turn="turn1"
|
|
emma:in-response-to="initial"
|
|
emma:start="1241035886246"
|
|
emma:end="1241035888306">
|
|
<prompt>please tell me your lunch orders</prompt>
|
|
</emma:presentation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="90">user1</td>
|
|
<td>I'll have a mushroom pizza</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:interpretation id="int1"
|
|
emma:dialog-turn="turn2"
|
|
emma:in-response-to="pres1"
|
|
emma:participant="user1"
|
|
emma:start="1241035891246"
|
|
emma:end="1241035893000"">
|
|
<pizza>
|
|
<topping>
|
|
mushroom
|
|
</topping>
|
|
</pizza>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="90">user3</td>
|
|
<td>I'll have a pepperoni pizza.</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:interpretation id="int2"
|
|
emma:dialog-turn="turn3"
|
|
emma:in-response-to="pres1"
|
|
emma:participant="user2"
|
|
emma:start="1241035896246"
|
|
emma:end="1241035899000"">
|
|
<pizza>
|
|
<topping>
|
|
pepperoni
|
|
</topping>
|
|
</pizza>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h3 id="s2.10">2.10 Capturing sensor data such as GPS in EMMA</h3>
|
|
<p>The multimodal examples described in the <a href=
|
|
"http://www.w3.org/TR/emma/">EMMA 1.0</a> specification, include
|
|
combination of spoken input with a location specified by touch or
|
|
pen. With the increase in availability of GPS and other location
|
|
sensing technology such as cell tower triangulation in mobile
|
|
devices, it is desirable to provide a method for annotating inputs
|
|
with the device location and, in some cases fusing the GPS
|
|
information with the spoken command in order to derive a complete
|
|
interpretation. GPS information could potentially be determined
|
|
using the <a href=
|
|
"http://www.w3.org/TR/2009/WD-geolocation-API-20090707/">Geolocation
|
|
API Specification</a> from the <a href=
|
|
"http://www.w3.org/2008/geolocation/">Geolocation working group</a>
|
|
and then encoded into a EMMA result sent to a server for
|
|
fusion.</p>
|
|
<p>One possibility using the current EMMA capabilities is to use
|
|
<code><emma:group></code> to associate GPS markup with the
|
|
semantics of a spoken command. For example, the user might say
|
|
"where is the nearest pizza place?" and the interpretation of the
|
|
spoken command is grouped with markup capturing the GPS sensor
|
|
data. This example uses the existing
|
|
<code><emma:group></code> element and extends the set of
|
|
values of <code>emma:medium</code> and <code>emma:mode</code> to
|
|
include <code>"sensor"</code> and <code>"gps"</code>
|
|
respectively.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td width="50">where is the nearest pizza place?</td>
|
|
<td rowspan="2">
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:group>
|
|
<emma:interpretation
|
|
emma:tokens="where is the nearest pizza place"
|
|
emma:confidence="0.9"
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:start="1241035887111"
|
|
emma:end="1241035888200"
|
|
emma:process="reco:type=asr&version=asr_eng2.4"
|
|
emma:media-type="audio/amr; rate=8000"
|
|
emma:lang="en-US">
|
|
<category>pizza</category>
|
|
</emma:interpretation>
|
|
<emma:interpretation
|
|
emma:medium="sensor"
|
|
emma:mode="gps"
|
|
emma:start="1241035886246"
|
|
emma:end="1241035886246">
|
|
<lat>40.777463</lat>
|
|
<lon>-74.410500</lon>
|
|
<alt>0.2</alt>
|
|
</emma:interpretation>
|
|
<emma:group-info>geolocation</emma:group-info>
|
|
</emma:group>
|
|
</emma:emma>
|
|
|
|
</pre></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">GPS</td>
|
|
<td>(GPS coordinates)</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<p>Another, more abbreviated, way to incorporate sensor information
|
|
would be to have spatial correlates of the timestamps and allow for
|
|
location stamping of user inputs, e.g. <code>emma:lat</code> and
|
|
<code>emma:lon</code> attributes that could appear on EMMA
|
|
container elements to indicate the location where the input was
|
|
produced.</p>
|
|
<h3 id="s2.11">2.11 Extending EMMA from NLU to also represent
|
|
search or database retrieval results</h3>
|
|
<p>In many of the use cases considered so far, EMMA is used for
|
|
representation of the results of speech recognition and then for
|
|
the results of natural language understanding, and possibly
|
|
multimodal fusion. In systems used for voice search, the next step
|
|
is often to conduct search and extract a set of records or
|
|
documents. Strictly speaking, this stage of processing is out of
|
|
scope for EMMA. It is odd though to have the mechanisms of EMMA
|
|
such as <code><emma:one-of></code> for ambiguity all the way
|
|
up to NLU or multimodal fusion, but not to have access to the same
|
|
apparatus for representation of the next stage of processing which
|
|
can often be search or database lookup. Just as we can use
|
|
<code><emma:one-of></code> and <code>emma:confidence</code>
|
|
to represent N-best recognitions or semantic interpretations,
|
|
similarly we can use them to represent a series of search results
|
|
along with their relative confidence. One issue is whether we need
|
|
some measure other than confidence for relevance ranking, or is the
|
|
same confidence attribute can be used.</p>
|
|
<p>One issue that arises is whether it would be useful to have some
|
|
recommended or standardized element to use for query results e.g
|
|
<code><result></code> as in the following example. Another
|
|
issue is how to annotate information about the database and the
|
|
query that was issued. The database could be indicate as part of
|
|
the <code>emma:process</code> value as in the following example.
|
|
For web search the query URL could be annotated on the result e.g.
|
|
<code><result url="http://cnn.com"/></code>. For database
|
|
queries, the query, SQL for example could be annotated on the
|
|
results or on the containing <code><emma:group></code>.</p>
|
|
<p>The following example shows the use of EMMA to represent the
|
|
results of database retrieval from an employee directory. The user
|
|
says "John Smith". After ASR, NLU, and then database look up, the
|
|
system returns the XML here which shows the N-best lists associated
|
|
with each of these three stages of processing. Here
|
|
<code><emma:derived-from&gr;</code> is used to indicate the
|
|
relations between each of the <code><emma:one-of></code>
|
|
elements. However, if you want to see which specific ASR result a
|
|
record is derived from, you would need to put
|
|
<code><emma:derived-from></code> on the individual
|
|
elements.</p>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td width="50">User says "John Smith"</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:one-of id="db_results1"
|
|
emma:process="db:type=mysql&database=personel_060109.db>
|
|
<emma:interpretation id="db_nbest1"
|
|
emma:confidence="0.80" emma:tokens="john smith">
|
|
<result>
|
|
<name>John Smith</name>
|
|
<room>dx513</room>
|
|
<number>123-456-7890>/number>
|
|
</result>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="db_nbest2"
|
|
emma:confidence="0.70" emma:tokens="john smith">
|
|
<result>
|
|
<name>John Smith</name>
|
|
<room>ef312</room>
|
|
<number>123-456-7891>/number>
|
|
</result>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="db_nbest3"
|
|
emma:confidence="0.50" emma:tokens="jon smith">
|
|
<result>
|
|
<name>Jon Smith</name>
|
|
<room>dv900</room>
|
|
<number>123-456-7892>/number>
|
|
</result>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="db_nbest4"
|
|
emma:confidence="0.40" emma:tokens="joan smithe">
|
|
<result>
|
|
<name>Joan Smithe</name>
|
|
<room>lt567</room>
|
|
<number>123-456-7893>/number>
|
|
</result>
|
|
</emma:interpretation>
|
|
<emma:derived-from resource="#nlu_results1/>
|
|
</emma:one-of>
|
|
<emma:derivation>
|
|
<emma:one-of id="nlu_results1"
|
|
emma:process="smm:type=nlu&version=parser">
|
|
<emma:interpretation id="nlu_nbest1"
|
|
emma:confidence="0.99" emma:tokens="john smith">
|
|
<fn>john</fn><ln>smith</ln>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="nlu_nbest2"
|
|
emma:confidence="0.97" emma:tokens="jon smith">
|
|
<fn>jon</fn><ln>smith</ln>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="nlu_nbest3"
|
|
emma:confidence="0.93" emma:tokens="joan smithe">
|
|
<fn>joan</fn><ln>smithe</ln>
|
|
</emma:interpretation>
|
|
<emma:derived-from resource="#asr_results1/>
|
|
</emma:one-of>
|
|
<emma:one-of id="asr_results1"
|
|
emma:medium="acoustic" emma:mode="voice"
|
|
emma:function="dialog" emma:verbal="true"
|
|
emma:lang="en-US" emma:start="1241641821513"
|
|
emma:end="1241641823033"
|
|
emma:media-type="audio/amr; rate=8000"
|
|
emma:process="smm:type=asr&version=watson6">
|
|
<emma:interpretation id="asr_nbest1"
|
|
emma:confidence="1.00">
|
|
<emma:literal>john smith</emma:literal>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="asr_nbest2"
|
|
emma:confidence="0.98">
|
|
<emma:literal>jon smith</emma:literal>
|
|
</emma:interpretation>
|
|
<emma:interpretation id="asr_nbest3"
|
|
emma:confidence="0.89" >
|
|
<emma:literal>joan smithe</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:one-of>
|
|
</emma:derivation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h3 id="s2.12">2.12 Supporting other semantic representation forms
|
|
in EMMA</h3>
|
|
<p>In the <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
|
|
specification, the semantic representation of an input is
|
|
represented either in XML in some application namespace or as a
|
|
literal value using <code>emma:literal</code>. In some
|
|
circumstances it could be beneficial to allow for semantic
|
|
representation in other formats such as JSON. Serializations such
|
|
as JSON could potentially be contained within
|
|
<code>emma:literal</code> using CDATA, and a new EMMA annotation
|
|
e.g. <code>emma:semantic-rep</code> used to indicate the semantic
|
|
representation language being used.</p>
|
|
<h4 id="semantic_representation_example">Example</h4>
|
|
<table width="120">
|
|
<tbody>
|
|
<tr>
|
|
<td><strong>Participant</strong></td>
|
|
<td><strong>Input</strong></td>
|
|
<td><strong>EMMA</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="50">user</td>
|
|
<td>semantics of spoken input</td>
|
|
<td>
|
|
<pre>
|
|
<emma:emma
|
|
version="2.0"
|
|
xmlns:emma="http://www.w3.org/2003/04/emma"
|
|
xmlns="http://www.example.com/example">
|
|
<emma:interpretation id=“int1"
|
|
emma:confidence=".75”
|
|
emma:medium="acoustic"
|
|
emma:mode="voice"
|
|
emma:verbal="true"
|
|
emma:function="dialog"
|
|
emma:semantic-rep="json"
|
|
<emma:literal>
|
|
<![CDATA[
|
|
{
|
|
drink: {
|
|
liquid:"coke",
|
|
drinksize:"medium"},
|
|
pizza: {
|
|
number: "3",
|
|
pizzasize: "large",
|
|
topping: [ "pepperoni", "mushrooms" ]
|
|
}
|
|
}
|
|
]]>
|
|
</emma:literal>
|
|
</emma:interpretation>
|
|
</emma:emma>
|
|
</pre></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
<h2 id="references">General References</h2>
|
|
<p>EMMA 1.0 Requirements <a href=
|
|
"http://www.w3.org/TR/EMMAreqs/">http://www.w3.org/TR/EMMAreqs/</a></p>
|
|
<p>EMMA Recommendation <a href=
|
|
"http://www.w3.org/TR/emma/">http://www.w3.org/TR/emma/</a></p>
|
|
<h2 id="acknowledgements">Acknowledgements</h2>
|
|
<p>Thanks to Jim Larson (W3C Invited Expert) for his contribution
|
|
to the section on EMMA for multimodal output.</p>
|
|
</body>
|
|
</html>
|