Multimodal Interaction Activity

Extending the Web to support multiple modes of interaction.

News

20 September 2011: Ink Markup Language (InkML) is a W3C Recommendation.

6 September 2011: The second Last Call Working Draft of Multimodal Architecture and Interfaces is published.
The main normative change from the previous draft is removing the 'immediate' field from the following Life Cycle Events: CancelRequest, PauseRequest.

10 May 2011: Ink Markup Language (InkML) is a W3C Proposed Recommendation.

8 April 2011: The Last Call Working Draft of Emotion Markup Language (EmotionML) 1.0 is published. The main change from the previous draft is the inclusion of a mechanism for defining emotion vocabularies. A precise list of changes from the previous draft is also available for comparison purposes.
Also Vocabularies for EmotionML is published as a First Public Working Draft. This document represents a public collection of emotion vocabularies that can be used with EmotionML to represent emotions and related states. It was originally part of an earlier draft of the EmotionML specification, but was moved out of it so that we can easily update, extend and correct the list of vocabularies as required.

1 Mach 2011: Best practices for creating MMI Modality Components is published as a Working Group Note.

25 January 2011: The Last Call Working Draft of Multimodal Architecture and Interfaces is published.
The main change from the previous draft is the tightening of the language to make the requirements more precise. A diff-marked version is also available for comparison purposes.

11 January 2011: Ink Markup Language (InkML) is a W3C Candidate Recommendation (See also the group's Implementation Report Plan).

5-6 October 2010: The EmotionML Workshop was held in Paris, France, hosted by Telecom ParisTech. The summary and detailed minutes are available online. Participants from 12 organizations discussed use cases of possible emotion-ready applications and clarified several key requirements for the current EmotionML to make the specification even more useful.

21 September 2010: The seventh Working Draft of Multimodal Architecture and Interfaces is published.
The main changes from the previous draft are (1) the inclusion of state charts for modality components, (2) the addition of a 'confidential' field to life-cycle events and (3) the removal of the 'media' field from life-cycle events. A diff-marked version is also available for comparison purposes.

29 July 2010: The second Working Draft of Emotion Markup Language (EmotionML) 1.0 is published. A diff-marked version is also available for comparison purposes.
Please send your comments to the Multimodal Interaction public mailing list (<www-multimodal@w3.org>).

18-19 June 2010: The workshop on Conversational Applications was held in Somerset, NJ, (USA), hosted by Openstream. The summary and detailed minutes are available online. Participants from 12 organizations fucused discussion on the use cases of possible conversational applications and clarified limitations of the current W3C language model in order to develop a more comprehensive one.

10 February 2009: EMMA: Extensible MultiModal Annotation markup language is a W3C Recommendation. (press release)

Introduction

The Mission

The Multimodal Interaction Activity seeks to extend the Web to allow users to dynamically select the most appropriate mode of interaction for their current needs, including any disabilities, whilst enabling developers to provide an effective user interface for whichever modes the user selects. Depending upon the device, users will be able to provide input via speech, handwriting, and keystrokes, with output presented via displays, pre-recorded and synthetic speech, audio, and tactile mechanisms such as mobile phone vibrators and Braille strips.

Multimodal interaction offers significant ease of use benefits over uni-modal interaction, for instance, when hands-free operation is needed, for mobile devices with limited keypads, and for controlling other devices when a traditional desktop computer is unvailable to host the application user interface. This is being driven by advances in embedded and network-based speech processing that are creating opportunities for integrated multimodal Web browsers and for solutions that separate the handling of visual and aural modalities, for example, by coupling a local XHTML user agent with a remote VoiceXML user agent.

Target Audience

The Multimodal Interaction Working Group (member only link) should be of interest to a range of organizations in different industry sectors:

Mobile: Multimodal applications are of particular interest for mobile devices. Speech offers a welcome means to interact with smaller devices, allowing one-handed and hands-free operation. Pen input enables handwriting, gestures, drawings and specialized notations. Users benefit from being able to choose which modalities they find convenient in any situation. The Working Group should be of interest to companies developing smart phones and personal digital assistants or who are interested in providing tools and technology to support the delivery of multimodal services to such devices.
Automotive Telematics: With the emergence of dashboard integrated high resolution color displays for navigation, communication and entertainment services, W3C's work on open standards for multimodal interaction should be of interest to companies working on developing the next generation of in-car systems.
Multimodal interfaces in the office: Multimodal has benefits for desktops, wall mounted interactive displays, multi-function copiers, and other office equipment, offering a richer user experience and the chance to use speech and pens as alternatives to the mouse and keyboard. W3C's standardization work in this area should be of interest to companies developing client software and application authoring technologies, and who wish to ensure that the resulting standards live up to their needs.
Multimodal interfaces in the home: In addition to desktop access to the Web, multimodal interfaces are expected to add value to remote control of home entertainment systems, as well as finding a role for other systems around the home. Companies involved in developing embedded systems and consumer electronics should be interested in W3C's work on multimodal interaction.

Current Situation

The Multimodal Interaction Working Group was launched in 2002 following a joint workshop between the W3C and the WAP Forum. The Working Group's initial focus was on use cases and requirements. This led to the publication of the W3C Multimodal Interaction Framework, and in turn to work on extensible multi-modal annotations (EMMA), and InkML, an XML language for ink traces. The Working Group has also worked on integration of composite multimodal input; dynamic adaptation to device configurations, user preferences and environmental conditions (now transferred to the Device Independence Activity); modality component interfaces; and a study of current approaches to interaction management. The Working Group has now been re-chartered through 31 January 2009 under the terms of the W3C Patent Policy (5 February 2004 Version). To promote the widest adoption of Web standards, W3C seeks to issue Recommendations that can be implemented, according to this policy, on a Royalty-Free basis. The Working Group is chaired by Deborah Dahl. The W3C Team Contact is Kazuyuki Ashimura.

We want to hear from you!

We are very interested in your comments and suggestions. If you have implemented multimodal interfaces, please share your experiences with us, as we are particularly interested in reports on implementations and their usability for both end-users and application developers. We welcome comments on any of our published documents. If you have a proposal for multimodal authoring language, please let us know. To subscribe to the discussion list send an email to www-multimodal-request@w3.org with the word subscribe in the subject header. Previous discussion can be found in the public archive. To unsubscribe send an email to www-multimodal-request@w3.org with the word unsubscribe in the subject header.

How to join the Working Group

If your organization is already a member of W3C, ask your W3C Advisory Comittee Representative (member only link) to fill out the online registration form to confirm that your organization is prepared to commit the time and expense involved in particpating in the group. You will be expected to attend all Working Group meetings (about 3 or 4 times a year) and to respond in a timely fashion to email requests. Further details about joining are available on the Working Group (member only link) page. Requirements for patent disclosures, as well as terms and conditions for licensing essential IPR are given in the W3C Patent Policy.

More information about the W3C is available, as is information about joining W3C.

Patent Disclosures

W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent.

Revised publication target dates

Specification	FPWD	LC	CR	PR	Rec
Multimodal Architecture and Interfaces	- Completed - 2nd WD - 3rd WD - 4th WD - 5th WD - 6th WD - 7th WD	Completed	TBD	TBD	TBD
EMMA 2.0	4Q 2009	January 2011	TBD	TBD	TBD
EMMA	Completed	Completed	Completed	Completed	Completed (10 Feb. 2009)
InkML	Completed	1st LC: Completed 2nd LC: Completed	Completed	April 2010	June 2010
EmotionML	Completed - 2nd	Completed	June 2011	TBD	TBD
Ink Modality Component Definition	Completed (as a WG Notes)	-	-	-	-
Voice Modality Component Definition	December 2009 (as a WG Notes)	-	-	-	-

Work in Progress

This is intended to give you a brief summary of each of the major work items under development by the Multimodal Interaction Working Group. The suite of specifications is known as the W3C Multimodal Interaction Framework.

Introduction, 6 May 2003. The Multimodal Interaction Framework introduces a general framework for multimodal interaction, and the kinds of markup languages being considered.
Use cases, 4 December 2002. Multimodal Interaction Use Cases describes several use cases that are helping us to better understand the requirements for multimodal interaction.
Core requirements, 8 January 2003. Multimodal Interaction Requirements describes fundamental requirements for the specifications under development in the W3C Multimodal Interaction Activity.

Current Work

The following indicates current work items. Additional work is expected on topics described in section 4 of the charter, including multimodal authoring, modality component interfaces, composite multimodal input, and coordinated multimodal output.

Multimodal Architecture

Main Architecture draft

Requirements and Capabilities, 10 May 2004
First Public Working Draft, 22 April 2005
Second Working Draft, 14 April 2006
Third Working Draft, 11 December 2006
Fourth Working Draft, 14 April 2008
Fifth Working Draft, 16 October 2008
Sixth Working Draft, 1 December 2009
Seventh Working Draft, 21 September 2010
Last Call Working Draft, 25 January 2011

A loosely coupled architecture for the Multimodal Interaction Framework that focuses on providing a general means for components to communicate with each other, plus basic infrastructure for application contrl and platform services. Work is continuing on how the architecture can be realized in terms of well defined component interfaces and eventing models.

MMI Authoring

Multimodal Application Developer Feedback Working Group Note, 14 April 2006
Common Sense Suggestions for Developing Multimodal User Interfaces Working Group Note, 11 September 2006
Authoring Applications for the Multimodal Architecture Working Group Note, 2 July 2008

MMI Authoring

Best practices for creating MMI Modality Components Working Group Note, 1 March 2011

Extensible Multi-Modal Annotations (EMMA)

- EMMA 1.0

Requirements, 13 January 2003
Last Call Working Draft, 16 September 2005
Second Last Call Working Draft, 9 April 2007
Candidate Recommendation, 11 December 2007
Proposed Recommendation, 15 December 2008
Recommendation, 10 February 2009

EMMA has been developed as a data exchange format for the interface between input processors and interaction management systems. It will define the means for recognizers to annotate application specific data with information such as confidence scores, time stamps, input mode (e.g. key strokes, speech or pen), alternative recognition hypotheses, and partial recognition results etc. EMMA is a target data format for the semantic interpretation specification being developed in the Voice Browser Activity, and which describes annotations to speech grammars for extracting application specific data as a result of speech recognition. EMMA supercedes earlier work on the natural language semantics markup language in the Voice Browser Activity.

- EMMA 2.0

Use Cases for Possible Future EMMA Features Working Group Note, 15 December 2009

Since EMMA 1.0 became a W3C Recommendation, a number of new possible use cases for the EMMA language have emerged. These include the use of EMMA to represent multimodal output, biometrics, emotion, sensor data, multi-stage dialogs, and interactions with multiple users. So the Working Group have decided to work on a document capturing use cases and issues for a series of possible extensions to EMMA, and published a Working Group Note to seek feedback on the various different use cases.

InkML - an XML language for digital ink traces

Requirements, 22 January 2003
Working Draft, 28 September 2004
Last Call Working Draft, 23 October 2006
Second Last Call Working Draft, 27 May 2010

This work item sets out to define an XML data exchange format for ink entered with an electronic pen or stylus as part of a multimodal system. This will enable the capture and server-side processing of handwriting, gestures, drawings, and specific notations for mathematics, music, chemistry and other fields, as well as supporting further research on this processing. The Ink subgroup maintains a separate public page devoted to W3C's work on pen and stylus input.

Emotion Markup Language (EmotionML) 1.0

First Public Working Draft, 29 October 2009
Second Working Draft, 29 July 2010
Last Call Working Draft, 7 April 2011

EmotionML will provide representations of emotions and related states for technological applications. As the web is becoming ubiquitous, interactive, and multimodal, technology needs to deal increasingly with human factors, including emotions. The language is conceived as a "plug-in" language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.

Related Materials

Workshops

Workshop on Emotion Markup Language was held on 5-6 October 2010 in Paris, France, hosted by Telecom ParisTech. The summary and detailed minutes are available online.
Workshop on Conversational Applications was held on 18-19 June 2010 in Somerset, New Jersey, US hosted by Openstream. The Agenda, the Minutes and the Summary are available.
Workshop on W3C's Multimodal Architecture and Interfaces was held on 16-17 November, 2007 at Keio University in Japan. The Agenda, the Minutes, the Summary and the Issue List raised during the workshop are available.

MMI related presentations

Jerry Carter (Nuance), Rafah Hosn (IBM) and Kaz Ashimura (W3C) gave talks on "Multimodal Web to Expand Universal Access" on 11 May 2007 during W3C Track in WWW2007 Conference, Banff, Canada.
InkML slides were presented on 24 October 2006 at IWFHR 2006 Conference.
W3C Seminar on Multimodal Web Applications for Embedded Systems was held on 21 June 2005.
W3C Workshop on Multimodal Interaction was held on 19-20 July 2004 in Sophia Antipolis, France. (schedule, papers.)
IST-FP6-001895 "Multimodal Web Interaction" (MWeb) Project: A W3C initiative funded by the European Commission in support of the development and adoption of W3C standards that enable multimodal Web access via mobile devices. MWeb includes European outreach and the development of demonstrators.
Openstream Multimodal Interaction use case demo (Macromedia Flash video).
The W3C Voice Browser working group published a set of requirements for multimodal interaction in July 2000. The working group also invited participants to demonstrate proof of concept examples of multimodal applications. A number of such demonstrations were shown at the working group's face to face meeting held in Paris in May 2000.
To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on late 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. The workshop's recommendations encouraged W3C to set up a multimodal working group to develop standards for multimodal user interfaces for the Web.
The IETF Speech Services Control (SpeechSC) working group is developing protocols to support distributed speech recognition, speech synthesis and speaker verification services, and expects to take advantage of W3C's work on the speech recognition grammar specification (SRGS), the speech synthesis markup language (SSML), semantic interpretetation (SI) and extensible multimodal markup annotations (EMMA). ETSI's STQ Aurora project is looking at codecs optimized for distributed speech recognition. See also David Pearce's presentation on DSR to the W3C VB/MMI working groups on 25th May 2005.
ETSI standard ES 202076 defines a generic spoken command vocabulary for controlling common operations such as calling someone by saying their name, browsing through a voice mail box, adjusting the volume, muting the microphone and other device properties. ETSI provide bindings for the vocabulary to a variety of human languages. This suggests the possibility of device-based recognition for common spoken commands together with network based recognition for other vocabularies.
Another idea is to couple a local graphical user interface with a remote voice dialog engine, perhaps based upon VoiceXML. Here the idea is to allow events to be passed between the device and the remote dialog engine. To the application developer, these events would look just the same whether they originated locally or remotely. In this approach, events can be used to initiate a range of actions, for instance, changing the focus of interaction, setting the value of a form field, loading a new page, or altering the current page via the DOM. W3C work on REX aims to provide an XML grammar for DOM events with a view to supporting distribution of events, and in principle, could be used to couple different modality components.
SIP can also be used to synchronize several devices, for instance to update the display on a PDA, automotive or desktop system in concert with the much smaller display on a cellphone. When it comes to setting up a session that potentially involves multiple devices and servers, SIP looks like it will provide an effective solution together with server-side scripts. The Voice Browser working group's work on call control may prove valuable.
ETSI EG 202 191 - V1.1.1 - Human Factors (HF); Multimodal interaction, communications and navigation guidelines (PDF). A study of design principles for multimodal applications with a focus on accessibility. Published August 2003.
InkXML specification (W3C Members only) contributed to W3C on 16th August 2002 by IBM, Intel, the International Unipen Foundation, and Motorola, Inc. InkXML is a markup language for the exchange of virtual ink, conveying such information as the kind of pen, the color of the ink and the nature of the medium, the pressure applied to the pen, its position and speed. InkXML can be used to exchange virtual ink among devices, such as handhelds, laptops, desktops, and servers. InkXML is intended to provide the ink component of Web-based multimodal applications. The working group consensus process will determine which ideas in InkXML will be taken up within W3C. W3C Members can view the contribution letter.
Multimodal browser architecture (PDF) by Stéphane Maes (IBM), dated 20th August 2001. Makes the case for using the model-view-controller paradigm and presents a variety of architectures for synchronization across modalities and devices. This is the presentation (T2-010705) that Stéphane gave to the 3GPP T2 meeting in September 2001.
Multimodal access position paper (PDF) by Nathalie Amann, Laurent Hue and Klaus Lukas (Siemens), dated 26th November 2001. Describes a possible architectures architecture for multimodal interaction based upon coupling a visual client with a VoiceXML interpreter.
Towards SMIL as a foundation for for multimodal, multimedia applications (PDF), by Jennifer Beckham (University of Wisconsin), Giuseppe Di Fabbrizio, and Nils Klarlund (AT&T Labs), dated 1 October 2001. Shows how SMIL can provide fine grained synchronization control for multimodal interaction. The approach combines SMIL with markup for control of speech engines.
XHTML+Voice, W3C Submission by IBM, Motorola and Opera Software. Dated 30th November 2001. Shows how markup for XHTML and VoiceXML can be combined to support multimodal interaction. An updated version was contributed to the Voice Browser and Multimodal Interaction working groups on 11th March 2003, see the Team Comment for details of associated IPR disclosures. W3C Members can view the contribution letter.
The SALT Forum was launched on 15th October 2001 with a mission to develop standards for speech enabling HTML, XHTML and SMIL. More recently, it has been applied to speech enabling SVG. The SALT 1.0 specfication was contributed to the Multimodal Interaction and Voice Browser working groups on 31rd July 2002, and the working group consensus process will determine which ideas in SALT will be taken up within W3C. W3C Members can view the contribution letter. The SALT+SVG profile was provided as a subsequent contribution.
3GPP is studying different ways to include speech-enabled services comprising both speech-only and multimodal services in 3G networks. One option for distributed speech recognition is based on the ETSI's STQ Aurora developments. Other options are dependent on the general study on speech enabled services. 3GPP may be interested in working on integrating remote access to speech synthesis resources. W3C should keep a watching brief. There is a possible connection to the IETF Speech Services Control Working Group (SpeechSC), which is developing protocols for distributed access to speech synthesis, recognition and speaker verification services (MRCP)

For more details on other organizations see the Multimodal Interaction Charter.

W3C Team Contact

Kazuyuki Ashimura <ashimura@w3.org> - Multimodal Interaction Activity Lead

Copyright © 2002-2008 W3C^® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements. This page was last updated on $Date: 2011/10/18 10:06:03 $