You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1080 lines
41 KiB
1080 lines
41 KiB
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org" />
|
|
<title>Introduction and Overview of W3C Speech Interface
|
|
Framework</title>
|
|
<meta content="text/html; charset=windows-1252"
|
|
http-equiv="Content-Type" />
|
|
<meta content="Microsoft FrontPage 4.0" name="GENERATOR" />
|
|
<style type="text/css">
|
|
body {
|
|
font-family: sans-serif;
|
|
margin-left: 10%;
|
|
margin-right: 5%;
|
|
color: black;
|
|
background-color: white;
|
|
background-attachment: fixed;
|
|
background-image: url(http://www.w3.org/StyleSheets/TR/WD);
|
|
background-position: top left;
|
|
background-repeat: no-repeat;
|
|
font-family: Tahoma, Verdana, "Myriad Web", Syntax, sans-serif;
|
|
}
|
|
.unfinished { font-style: normal; background-color: #FFFF33}
|
|
.dtd-code { font-family: monospace;
|
|
background-color: #dfdfdf; white-space: pre;
|
|
border: #000000; border-style: solid;
|
|
border-top-width: 1px; border-right-width: 1px;
|
|
border-bottom-width: 1px; border-left-width: 1px; }
|
|
p.copyright {font-size: smaller}
|
|
h2,h3 {margin-top: 1em;}
|
|
ul.toc li {list-style: none}
|
|
ul.toc a {text-decoration: none }
|
|
code {
|
|
color: green;
|
|
font-family: monospace;
|
|
font-weight: bold;
|
|
}
|
|
.example {
|
|
border: solid green;
|
|
border-width: 2px;
|
|
color: green;
|
|
font-weight: bold;
|
|
margin-right: 5%;
|
|
margin-left: 0;
|
|
}
|
|
.bad {
|
|
border: solid red;
|
|
border-width: 2px;
|
|
margin-left: 0;
|
|
margin-right: 5%;
|
|
color: rgb(192, 101, 101);
|
|
}
|
|
div.navbar { text-align: center; }
|
|
div.contents {
|
|
background-color: rgb(204,204,255);
|
|
padding: 0.5em;
|
|
border: none;
|
|
margin-right: 5%;
|
|
}
|
|
table {
|
|
margin-left: 0;
|
|
margin-right: 0;
|
|
font-family: sans-serif;
|
|
background: white;
|
|
border-width: 2px;
|
|
border-color: white;
|
|
}
|
|
th { font-family: sans-serif; background: rgb(204, 204, 153) }
|
|
td { font-family: sans-serif; background: rgb(255, 255, 153) }
|
|
.tocline { list-style: none; }
|
|
</style>
|
|
|
|
<link rel="stylesheet" type="text/css"
|
|
href="http://www.w3.org/StyleSheets/TR/W3C-WD" />
|
|
</head>
|
|
<body>
|
|
<div class="head">
|
|
<p><a href="http://www.w3.org/"><img class="head"
|
|
src="http://www.w3.org/Icons/WWW/w3c_home" alt="W3C" width="72" height="48"/></a></p>
|
|
|
|
<h1 class="head">Introduction and Overview of W3C Speech
|
|
Interface Framework</h1>
|
|
|
|
<h2 class="notoc">W3C Working Draft 4 December 2000</h2>
|
|
|
|
<dl>
|
|
<dt>This version:</dt>
|
|
|
|
<dd><a
|
|
href="http://www.w3.org/TR/2000/WD-voice-intro-20001204/">
|
|
http://www.w3.org/TR/2000/WD-voice-intro-20001204</a></dd>
|
|
|
|
<dt>Latest version:</dt>
|
|
|
|
<dd><a
|
|
href="http://www.w3.org/TR/voice-intro">http://www.w3.org/TR/voice-intro</a></dd>
|
|
|
|
<dt>Previous version:</dt>
|
|
|
|
<dd><a
|
|
href="http://www.w3.org/TR/1999/WD-voice-intro-19991223">http://www.w3.org/TR/1999/WD-voice-intro-19991223</a></dd>
|
|
|
|
<dt>Editor:</dt>
|
|
|
|
<dd>Jim A. Larson, Intel Architecture Labs</dd>
|
|
</dl>
|
|
<p class="copyright"><a
|
|
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Copyright">
|
|
Copyright</a> ©2000 <a href="http://www.w3.org/"><abbr title="World
|
|
Wide Web Consortium">W3C</abbr></a><sup>®</sup> (<a
|
|
href="http://www.lcs.mit.edu/"><abbr title="Massachusetts Institute of
|
|
Technology">MIT</abbr></a>, <a href="http://www.inria.fr/"><abbr lang="fr"
|
|
title="Institut National de Recherche en Informatique et
|
|
Automatique">INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>),
|
|
All Rights Reserved. W3C <a
|
|
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer">liability</a>,
|
|
<a
|
|
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks">trademark</a>,
|
|
<a
|
|
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">document
|
|
use</a> and <a
|
|
href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">software
|
|
licensing</a> rules apply.</p>
|
|
|
|
<hr />
|
|
</div>
|
|
|
|
<h2 class="notoc"><a id="abstract"
|
|
name="abstract">Abstract</a></h2>
|
|
|
|
<p>The World Wide Web Consortium's Voice Browser Working Group is
|
|
defining several markup languages for applications supporting
|
|
speech input and output. These markup languages will enable
|
|
speech applications across a range of hardware and software
|
|
platforms. Specifically, the Working Group is designing markup
|
|
languages for dialog, speech recognition grammar, speech
|
|
synthesis, natural language semantics, and a collection of
|
|
reusable dialog components. These markup languages make up the
|
|
W3C Speech Interface Framework. The speech community is invited
|
|
to review and comment on the working draft requirement and
|
|
specification documents.</p>
|
|
|
|
<h2><a id="status" name="status">Status of This Document</a></h2>
|
|
|
|
<p>This document describes a model architecture for speech
|
|
processing in voice browsers. It also briefly describes markup
|
|
languages for dialog, speech recognition grammar, speech
|
|
synthesis, natural language semantics, and a collection of
|
|
reusable dialog components. This document is being released as a
|
|
working draft, but is not intended to become a proposed
|
|
recommendation.</p>
|
|
|
|
<p>This specification is a Working Draft of the Voice Browser
|
|
working group for review by W3C members and other interested
|
|
parties. It is a draft document and may be updated, replaced, or
|
|
obsoleted by other documents at any time. It is inappropriate to
|
|
use W3C Working Drafts as reference material or to cite them as
|
|
other than "work in progress".</p>
|
|
|
|
<p>Publication as a Working Draft does not imply endorsement by
|
|
the W3C membership, nor of members of the Voice Browser working
|
|
groups. This is still a draft document and may be updated,
|
|
replaced or obsoleted by other documents at any time. It is
|
|
inappropriate to cite W3C Working Drafts as other than "work in
|
|
progress."</p>
|
|
|
|
<p>This document has been produced as part of the <a
|
|
href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
|
|
following the procedures set out for the <a
|
|
href="http://www.w3.org/Consortium/Process/">W3C Process</a>. The
|
|
authors of this document are members of the <a
|
|
href="http://www.w3.org/Voice/Group">Voice Browser Working
|
|
Group</a>. This document is for public review. Comments should be
|
|
sent to the public mailing list <<a
|
|
href="mailto:www-voice@w3.org">www-voice@w3.org</a>> (<a
|
|
href="http://www.w3.org/Archives/Public/www-voice/">archive</a>).</p>
|
|
|
|
<p>A list of current W3C Recommendations and other technical
|
|
documents can be found at <a
|
|
href="http://www.w3.org/TR">http://www.w3.org/TR</a>.</p>
|
|
|
|
<h2>1. <a id="group" name="group">Voice Browser Working
|
|
Group</a></h2>
|
|
|
|
<p>The Voice Browser Working Group was <a
|
|
href="http://www.w3.org/Voice/1999/voice-wg-charter.html">chartered</a>
|
|
by the World Wide Web Consortium (W3C) within the User Interface
|
|
Activity in May 1999 to prepare and review markup languages that
|
|
enable voice browsers. Members meet weekly via telephone and
|
|
quarterly in face-to-face meetings.</p>
|
|
|
|
<p>The <a href="http://www.w3.org/Voice/">W3C Voice Browser
|
|
Working Group</a> is open to any member of the W3C Consortium.
|
|
The Voice Browser Working Group has also invited experts whose
|
|
affiliations are not members of the W3C Consortium. The four
|
|
founding members of the VoiceXML Forum, as well as telelphony
|
|
applications venders, speech recognition and text to speech
|
|
engine venders, web portals, hardware venders, software venders,
|
|
telcos and appliance manufactures have representatives who
|
|
participate in the Voice Browser Working Group. Current members
|
|
include AskJeves, AT&T, Avaya, BT, Canon, Cisco, France
|
|
Telecon, General Magic, Hitachi, HP, IBM, isSound, Intel, Locus
|
|
Dialogue, Lucent, Microsoft, Mitre, Motorola, Nokia, Nortel,
|
|
Nuance, Phillips, PipeBeach, Speech Works, Sun, Telecon Italia,
|
|
TellMe.com, and Unisys, in addition to several invited
|
|
experts.</p>
|
|
|
|
<h2 class="notoc">Table of Contents</h2>
|
|
|
|
<ul class="toc">
|
|
<li><a href="#abstract">Abstract</a></li>
|
|
|
|
<li><a href="#status">Status of this Document</a></li>
|
|
|
|
<li>1. <a href="#group">The Voice Browser Working Group</a></li>
|
|
|
|
<li>2. <a href="#browsers">Voice Browsers</a></li>
|
|
|
|
<li>3. <a href="#benefits">Voice Browser Benefits</a></li>
|
|
|
|
<li>4. <a href="#spif">W3C Speech Interface Framework</a></li>
|
|
|
|
<li>5. <a href="#other">Other Uses for Markup Languages</a></li>
|
|
|
|
<li>6. <a href="#specs">Individual Markup Languages Overview</a>
|
|
|
|
<ul>
|
|
<li>6.1. <a href="#gram">Speech Recognition Grammar
|
|
Specification</a></li>
|
|
|
|
<li>6.2. <a href="#synth">Speech Synthesis</a></li>
|
|
|
|
<li>6.3. <a href="#dialog">Dialog</a></li>
|
|
|
|
<li>6.4. <a href="#nl">Natural Language Semantics</a></li>
|
|
|
|
<li>6.5 <a href="#reuse">Reusable Dialog Components</a></li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>7. <a href="#examples">Example Markup Language Use</a></li>
|
|
|
|
<li>8. <a href="#submissions">Submissions</a></li>
|
|
|
|
<li>9. <a href="#reading">Further Reading Material</a></li>
|
|
|
|
<li>10. <a href="#summary">Summary</a></li>
|
|
</ul>
|
|
|
|
<h2>2. <a id="browsers" name="browsers">Voice Browsers</a></h2>
|
|
|
|
<p>A <em>voice browser</em> is a device (hardware and software)
|
|
that interprets voice markup languages to generate voice output,
|
|
interpret voice input, and possibly accept and produce other
|
|
modalities of input and output.</p>
|
|
|
|
<p>Currently the major deployment of voice browsers enable users
|
|
to speak and listen using a telephone or cell phone to access
|
|
information available on the World Wide Web. These voice browsers
|
|
accept DTMF and spoken words as input, and produce synthesized
|
|
speech or replay prerecorded speech as output. The voice markup
|
|
languages interpreted by voice browsers are also frequently
|
|
available on the World Wide Web. However, many other deployments
|
|
of voice browsers are possible.</p>
|
|
|
|
<p>Hardware devices may include telephones or cell phones,
|
|
hand-held computers, palm-sized computers, laptop PCs, and
|
|
desktop PCs. Voice browser hardware processors may be embedded
|
|
into appliances such as TVs, radios, VCRs, remote controls,
|
|
ovens, refrigerators, coffeepots, doorbells, and practically any
|
|
other electronic or electrical device.</p>
|
|
|
|
<p>Possible software applications include:</p>
|
|
|
|
<ul>
|
|
<li>Accessing business information, including the corporate
|
|
"front desk" asking callers who or what they want, automated
|
|
telephone ordering services, support desks, order tracking,
|
|
airline arrival and departure information, cinema and theater
|
|
booking services, and home banking services</li>
|
|
|
|
<li>Accessing public information, including community information
|
|
such as weather, traffic conditions, school closures, directions
|
|
and events; local, national and international news; national and
|
|
international stock market information; and business and
|
|
e-commerce transactions</li>
|
|
|
|
<li>Accessing personal information, including calendars, address
|
|
and telephone lists, to-do lists, shopping lists, and calorie
|
|
counters</li>
|
|
|
|
<li>Assisting the user to communicate with other people sending
|
|
and receiving voice-mail messages</li>
|
|
</ul>
|
|
|
|
<p>Our definition of a voice browser does not support a voice
|
|
interface to HTML pages. A voice browser processes scripts
|
|
written using voice markup languages. HTML is not among the
|
|
languages which can be interpreted by a voice browser. Some
|
|
venders are creating voice-enabled HTML browsers that produce
|
|
voice instead of displaying text on a screen display. A
|
|
voice-enabled HTML browser must determine the sequence of text to
|
|
present to the user as voice, and possibly how to verbally
|
|
present non-text data such as tables, illustrations, and
|
|
animations. A voice browser, on the other hand, interprets a
|
|
script which specifies exactly what to verbally present to the
|
|
user as well as when to present each piece of information</p>
|
|
|
|
<h2>3. <a id="benefits" name="benefits">Voice Browser
|
|
Benefits</a></h2>
|
|
|
|
<p>Voice is a <em>very natural</em> user interface because it
|
|
enables the user to speak and listen using skills learned during
|
|
childhood. Currently users speak and listen to telephones and
|
|
cell phones with no display to interact with voice browsers. Some
|
|
voice browsers may have small screens, such as those found on
|
|
cell phones and palm computers. In the future, voice browsers may
|
|
also support other modes and media such as pen, video, and sensor
|
|
input and graphics animation and actuator controls as output. For
|
|
example, voice and pen input would be appropriate for Asian users
|
|
whose spoken language does not lend itself to entry with
|
|
traditional QWERTY keyboards.</p>
|
|
|
|
<p>Some voice browsers are <em>portable</em>. They can be used
|
|
anywhere—at home, at work, and on the road. Information
|
|
will be <em>available</em> to a greater audience, especially to
|
|
people who have access to handsets, either telephones or cell
|
|
phones, but not to networked computers.</p>
|
|
|
|
<p>Voice browsers present a <em>pragmatic</em> interface for
|
|
functionally blind users or users needing Web access while
|
|
keeping their hands and eyes free for other things. Voice
|
|
browsers present an invisible user interface to the user, while
|
|
freeing workspace previously occupied by keyboards and mice.</p>
|
|
|
|
<h2>4. <a id="spif" name="spif">W3C Speech Interface
|
|
Framework</a></h2>
|
|
|
|
<p>The Voice Browser Working group has defined the <i>W3C Speech
|
|
Interface Framework</i>, shown in Figure 1. The white boxes
|
|
represent typical components of a speech-enabled web application.
|
|
The black arrows represent data flowing among these components.
|
|
The blue ovals indicate data specified using markup languages
|
|
used to guide components to accomplish their respective tasks. To
|
|
review the latest requirement and specification documents for
|
|
each of the markup languages, see the section entitled
|
|
Requirements and Language specification Documents on our <a
|
|
href="http://www.w3.org/Voice/">W3C Voice Browser home web
|
|
site</a>.</p>
|
|
|
|
<p align="center"><img src="voice-intro-fig1.gif" width="559"
|
|
height="392"
|
|
alt="block diagram for speech interface framework" /></p>
|
|
|
|
<p>Components of the W3C Speech Interface Framework include the
|
|
following:</p>
|
|
|
|
<p><i>Automatic Speech Recognizer (ASR)</i>—accepts speech
|
|
from the user and produces text. The ASR uses a grammar to
|
|
recognize words from the user's spoken speech. Some ASRs use
|
|
grammars specified by a developer using the <b>Speech Grammar
|
|
Markup Language</b>. Other ASRs use statistical grammars
|
|
generated from large corpra of speech data. These grammars are
|
|
represented using the <b>N-gram Stochastic Grammar Markup
|
|
Language.</b></p>
|
|
|
|
<p><i>DTMF Tone Recognizer</i>—accepts touch-tones produced
|
|
by a telephone when the user presses the keys on the telephone's
|
|
keypad. Telephone users may use touch-tones to enter digits or
|
|
make menu selections.</p>
|
|
|
|
<p><i>Language Understanding Component</i>—extracts
|
|
semantics from a text string by using a prespecified grammar. The
|
|
text string may by produced by an ASR or be entered directly by a
|
|
user via a keyboard. The Language Understanding Component may
|
|
also use grammars specified using the <b>Speech Grammar Markup
|
|
Language</b> or the <b>N-gram Stochastic Grammar Markup
|
|
Language.</b> The output of the Language Understanding Component
|
|
is expressed using the <b>Natural Language Semantics Markup
|
|
Language.</b></p>
|
|
|
|
<p><i>Context Interpreter</i>—enhances the semantics from
|
|
the Language Understanding Module by obtaining context
|
|
information from a dialog history (not shown in Figure 1). For
|
|
example, the Context Interpreter may replace a pronoun by a noun
|
|
to which the pronoun referred. The input and output from the
|
|
Context Interpreter is expressed using the <b>Natural Language
|
|
Semantics Markup Language.</b></p>
|
|
|
|
<p><i>Dialog Manager</i>—prompts the user for input, makes
|
|
sense of the input, and determines what to do next according to
|
|
instructions in a dialog script specified using VoiceXML 2.0
|
|
modeled after VoiceXML 1.0. Depending upon the input received,
|
|
the dialog manager may invoke application services, or download
|
|
another dialog script from the web, or cause information to be
|
|
presented to the user. The Dialog Manager accepts input specified
|
|
using the <b>Natural Language Semantics Markup Language.</b>
|
|
Dialog scripts may refer to <b>Reusable Dialog Components</b>,
|
|
portions of another dialog script which can be reused across
|
|
multiple applications.</p>
|
|
|
|
<p><i>Media Planner</i>—determines whether output from the
|
|
dialog manager should be presented to the user as synthetic
|
|
speech or prerecorded audio.</p>
|
|
|
|
<p><i>Recorded audio player</i>—replays prerecorded audio
|
|
files to the user, either in conjunction with, or in place of
|
|
synthesized voices.</p>
|
|
|
|
<p><i>Language Generator</i>—Accepts text from the media
|
|
planner and prepares it for presentation to the user as spoken
|
|
voice via a text-to-speech synthesizer (TTS). The text may
|
|
contain markup tags expressed using the <b>Speech Synthesis
|
|
Markup Language</b> which provides hints and suggestions for how
|
|
acoustic sounds should be produced. These tags may be produced
|
|
automatically by the Language Generator or manually inserted by a
|
|
developer.</p>
|
|
|
|
<p><i>Text-to-Speech Synthesizer (TTS)</i>—Accepts text
|
|
from the Language Generator and produces acoustic signals which
|
|
the user hears as a human-like voice according to hints specified
|
|
using the <b>Speech Synthesis Markup Language</b>.</p>
|
|
|
|
<p>The components of any specific voice browser may differ
|
|
significantly from the Components shown in Figure 1. For example,
|
|
the Context Interpretation, Language Generation and Media
|
|
Planning components may be incorporated into the Dialog Manager,
|
|
or the tone recognizer may be incorporated into the Context
|
|
Interpretation. However, most voice browser implementations will
|
|
still be able to use of the various markup languages defined in
|
|
the W3C Speech Interface Framework.</p>
|
|
|
|
<p>The Voice Browser Working Group is not defining the components
|
|
in the W3C Speech Interface Framework. It is defining markup
|
|
languages for representing data in each of the blue ovals in
|
|
Figure 1. Specifically, the Voice Browser Working Group is
|
|
defining the following markup languages:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>Speech Recognition Grammar Specification</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>N-gram Grammar Markup Language</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Speech Synthesis Markup Language</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Dialog Markup Language</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>The Voice Browser Working Group is also defining packaged
|
|
dialogs which we call <b>Reusable Components</b>. As their name
|
|
suggests, reusable components can be reused in other dialog
|
|
scripts, decreasing the implementation effort and increasing user
|
|
interface consistency. The Working Group may also define a
|
|
collection of reusable components such as solicit the user's
|
|
credit card number and exploration date, solicit the user's
|
|
address, etc.</p>
|
|
|
|
<p>Just as HTML formats data for screen-based interactions over
|
|
the Internet, an XML-based language is needed to format data for
|
|
voice-based interactions over the Internet. All markup languages
|
|
recommended by the Working Group will be XML-based, so XML
|
|
language processors can process any of the W3C Speech Interface
|
|
Framework markup languages.</p>
|
|
|
|
<h2>5. <a id="other" name="other">Other Uses of the Markup
|
|
Languages</a></h2>
|
|
|
|
<p>Figure 2 illustrates the W3C Speech Interface Framework
|
|
extended to support multiple modes of input and output. It is
|
|
anticipated that another working group will be formed to specify
|
|
the <b>Multimodal Dialog Language</b>, an extension of the Dialog
|
|
Language. We anticipate that another Working Group will be
|
|
established to take over our current work in defining the
|
|
Multimodal Dialog Language.</p>
|
|
|
|
<p align="center"><img src="voice-intro-fig2.gif" width="556"
|
|
height="402"
|
|
alt="block diagram for multimodal interface framework" /></p>
|
|
|
|
<p>Markup languages also may be used in applications not usually
|
|
associated with voice browsers. The following applications also
|
|
may benefit from the use of voice browser markup languages:</p>
|
|
|
|
<ul>
|
|
<li><em>Text-based Information Storage and
|
|
Retrieval</em>—Acceptance of text from a keyboard and
|
|
presents the text on a display. It uses neither ASR nor TTS, but
|
|
makes heavy use of the language understanding module and the
|
|
natural language semantic markup language.</li>
|
|
|
|
<li><em>Robot Command and Control</em>—Users speak commands
|
|
that control a mechanical robot. This application may use both
|
|
Speech Recognition Grammar Specification and dialog markup
|
|
languages.</li>
|
|
|
|
<li><em>Medical Transcription</em>—A complex, specialized
|
|
speech recognition grammar is used to extract medical information
|
|
from text produced by the ASR. A human editor corrects the
|
|
resulting text before printing.</li>
|
|
|
|
<li><em>Newsreader</em>—A language generator produces
|
|
marked-up text for presenting voice to the user. This application
|
|
uses a special language generator to markup text from news wire
|
|
services for verbal presentation.</li>
|
|
</ul>
|
|
|
|
<h2>6. <a id="specs" name="specs">Individual Markup Language
|
|
Overviews</a></h2>
|
|
|
|
<p>To review the latest requirement and specification documents
|
|
for each of the following languages, see the section titled
|
|
Requirements and Language specification Documents on our <a
|
|
href="http://www.w3.org/Voice/">W3C Voice Browser home web
|
|
site</a></p>
|
|
|
|
<h3><a id="gram" name="gram">6.1. Speech Recognition Grammar
|
|
Specification</a></h3>
|
|
|
|
<p>The Speech Recognition Grammar Specification supports the
|
|
definition of Context-Free Grammars (CFG) and, by subsumption,
|
|
Finite-State Grammars (FSG). The specification defines an XML
|
|
Grammar Markup Language, and an optional Augmented Backus-Naur
|
|
Format (ABNF) Markup Language. Automatic transformations between
|
|
the two formats is possible, for example, by XSLT to convert the
|
|
XML format to ABNF. We anticipate that development tools will be
|
|
constructed that provide the familiar ABNF format to developers,
|
|
and enable XML software to manipulate the XML grammar format. The
|
|
ABNF and XML languages are modeled after Sun's <a
|
|
href="http://www.w3.org/Submission/2000/06/">JSpeech Grammar
|
|
Format</a>. Some of the interesting features of the draft
|
|
specification:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>Ability to cross-reference grammars by URI and to use this
|
|
ability to define libraries of useful grammars.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Internationalized.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Semantic tagging mechanism for interpretation of spoken input
|
|
(under development).</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Applicable to non-speech input modalities, e.g. DTMF input or
|
|
parsing and interpretation of typed input.</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>A complementary speech recognition grammar language
|
|
specification is defined for N-Gram language models.</p>
|
|
|
|
<p>Terms used in the Speech Grammar Markup Language requirements
|
|
and specification documents include:</p>
|
|
|
|
<table border="1" cellpadding="6" cellspacing="1" width="85%"
|
|
summary="term in first column, explanation in second">
|
|
<tbody>
|
|
<tr>
|
|
<th width="24%">CFG</th>
|
|
<td width="76%">Context-Free Grammar. A formal computer science
|
|
term for a language that permits embedded recursion.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th width="24%">BNF</th>
|
|
<td width="76%">Backus-Naur Format. A language used widely in
|
|
computer science for textural representations of CFGs.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th width="24%">ABNF</th>
|
|
<td width="76%">Augmented Backus-Naur Format. The language
|
|
defined in the grammar specification that extends a conventional
|
|
BNF representation with regular grammar capabilities, syntax for
|
|
cross-referencing between grammars and other useful syntactic
|
|
features</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th width="24%">Grammar</th>
|
|
<td width="76%">The representation of constraints defining the
|
|
set of allowable sentences in a language. E.g. a grammar for
|
|
describing a set of sentences for ordering a pizza.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th width="24%">Language</th>
|
|
<td width="76%">A formal computer science term for the collection
|
|
of set of sentences associated with a particular domain. Language
|
|
may refer to natural or program language.</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<h3><a id="synth" name="synth">6.2. Speech Synthesis</a></h3>
|
|
|
|
<p>A text document may be produced automatically, authored by
|
|
people, or a combination of both. The Speech Synthesis Markup
|
|
Language supports high-level specifications, including the
|
|
selection of voice characteristics (name, gender, and age) and
|
|
the speed, volume, and emphasis of individual words. The language
|
|
also may describe how to pronounce acronyms, such as "Nasa" for
|
|
NASA, or spelled, such as "N, double A, C, P," for NAACP. At a
|
|
lower level, designers may specify prosodic control, which
|
|
includes pitch, timing, pausing, and speaking rate. The Speech
|
|
Synthesis Markup Language is modeled on Sun's <a
|
|
href="http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html">
|
|
<b>Java Speech Markup Language</b></a>.</p>
|
|
|
|
<p>There is some variance in the use of terminology in the speech
|
|
synthesis community. The following definitions establish a common
|
|
understanding</p>
|
|
|
|
<table border="1" cellpadding="6" cellspacing="1" width="85%"
|
|
summary="term in first column, explanation in second">
|
|
<tbody>
|
|
<tr>
|
|
<th>Prosody</th>
|
|
<td width="76%">Features of speech such as pitch, pitch range,
|
|
speaking rate and volume.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th width="24%">Speech Synthesis</th>
|
|
<td width="76%">The process of automatic generation of speech
|
|
output from data input which may include plain text, <span
|
|
class="diff">formatted text or binary objects</span>.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th width="24%">Text-To-Speech</th>
|
|
<td width="76%">The process of automatic generation of speech
|
|
output from text or annotated text input.</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<h3><a id="dialog" name="dialog">6.3. VoiceXML 2.0</a></h3>
|
|
|
|
<p>VoiceXML 2.0 Markup supports four I/O modes: speech
|
|
recognition and DTMF as input with synthesized speech and
|
|
prerecorded speech as output. VoiceXML 2.0 supports
|
|
system-directed speech dialogs where the system prompts the user
|
|
for responses, makes sense of the input, and determines what to
|
|
do next. VoiceXML 2.0 also supports mixed initiative speech
|
|
dialogs. In addition, VoiceXML also supports task switching and
|
|
the handling of events, such as recognition errors, incomplete
|
|
information entered by the user, timeouts, barge-in, and
|
|
developer-defined events. Barge-in allows users to speak while
|
|
the browser is speaking. VoiceXML 2.0 is modeled after <a
|
|
href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0</a>
|
|
designed by the <a href="http://www.voicexml.org/">VoiceXML
|
|
Forum</a>, whose founding members are AT&T, IBM, Lucent, and
|
|
Motorola.</p>
|
|
|
|
<p>Terms used in the Dialog Markup Language requirements and
|
|
specification documents include:</p>
|
|
|
|
<table border="1" cellpadding="6" cellspacing="1" width="85%"
|
|
summary="term in first column, explanation in second">
|
|
<tbody>
|
|
<tr>
|
|
<th>Dialog Markup Language</th>
|
|
<td>a language in which voice dialog behavior is specified. The
|
|
language may include reference to scripting elements which can
|
|
also determine dialog behavior.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Voice Browser</th>
|
|
<td>a software device which interprets a voice markup language
|
|
and generates a dialog with voice output and possibly other
|
|
output modalities and/or voice input and possibly other
|
|
modalities.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Dialog</th>
|
|
<td>a model of interactive behavior underlying the interpretation
|
|
of the markup language. The model consists of states, variables,
|
|
events, event handlers, inputs and outputs.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Utterance</th>
|
|
<td>Used in this document generally to refer to a meaningful user
|
|
input in any modality supported by the platform, not limited to
|
|
spoken inputs. For example, speech, DTMF, pointing, handwriting,
|
|
text and OCR.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Mixed initiative dialog</th>
|
|
<td>A type of dialog in which either they system or the user can
|
|
take the initiative at any point in the dialog by failing to
|
|
respond directly to the previous utterance. For example, the user
|
|
can make corrections, volunteer additional information, etc.
|
|
Systems support mixed initiative dialog to various degrees.
|
|
Compare to "directed dialog."</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Directed dialog</th>
|
|
<td>Also referred to as "system initiative" or "system led." A
|
|
type of dialog in which the user is permitted only direct literal
|
|
responses to the system's prompts.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>State</th>
|
|
<td>the basic interact ional unit defined in the markup language.
|
|
A state can specify variables, event handlers, outputs and
|
|
inputs. A state may describe output content to be presented to
|
|
the user, input which the user can enter, event handlers
|
|
describing, for example, which variables to bind and which state
|
|
to transition to when an event occurs.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Events</th>
|
|
<td>generated when a state is executed by the voice browser; for
|
|
example, when outputs or inputs in a state are rendered or
|
|
interpreted. Events are typed and may include information; for
|
|
example, an input event generated when an utterance is recognized
|
|
may include the string recognized, an interpretation, confidence
|
|
score, and so on.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Event Handlers</th>
|
|
<td>are specified in the voice markup language and describe how
|
|
events generated by the voice browser are to be handled.
|
|
Interpretation of events may bind variables, or map the current
|
|
state into another state (possibly itself).</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Output</th>
|
|
<td>content specified in an element of the markup language for
|
|
presentation to the user. The content is rendered by the voice
|
|
browser; for example, audio files or text rendered by a TTS.
|
|
Output can also contain parameters for the output device; for
|
|
example, volume of audio file playback, language for TTS, etc.
|
|
Events are generated when, for example, the audio file has been
|
|
played.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th>Input</th>
|
|
<td>content (and its interpretation) specified in an element of
|
|
the markup language which can be given as input by a user; for
|
|
example, a grammar for DTMF and speech input. Events are
|
|
generated by the voice browser when, for example, the user has
|
|
spoken an utterance and variables may be bound to information
|
|
contained in the event. Input can also specify parameters for the
|
|
input device; for example, timeout parameters, etc.</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<h3><a id="nl" name="nl">6.4. Natural Language Semantics</a></h3>
|
|
|
|
<p>The Natural Language Semantics Markup Language supports XML
|
|
semantic representations. For application-specific information,
|
|
it is based on the W3C <a
|
|
href="http://www.w3.org/TR/2000/WD-xforms-datamodel-20000406/">XForms.</a>
|
|
The Natural Language Semantics Markup Language also includes
|
|
application-independent elements defined by the W3C Voice Browser
|
|
group. This application-independent information includes
|
|
confidences, the grammar matched by the interpretation, speech
|
|
recognizer input, and timestamps. The Natural Language Semantics
|
|
Markup Language combines elements from the XForms, natural
|
|
language semantics, and application-specific namespaces. For
|
|
example, the text, "I want to fly from New York to Boston, and,
|
|
then, to Washington, DC", could be represented as:</p>
|
|
|
|
<pre>
|
|
<result xmlns:xf="http://www.w3.org/2000/xforms"
|
|
x-model="http://flight-model"
|
|
grammar="http://flight-grammar">
|
|
<interpretation confidence=100>
|
|
<xf:instance>
|
|
<flight:trip>
|
|
<leg1>
|
|
<from>New York</from>
|
|
<to>Boston</to>
|
|
</leg1>
|
|
<leg2>
|
|
<from>Boston</from>
|
|
<to>DC</to>
|
|
</leg2>
|
|
</flight:trip>
|
|
</xf:instance>
|
|
<input mode="speech">
|
|
I want to fly from New York to Boston, and,
|
|
then, to Washington, DC
|
|
</input>
|
|
</interpretation>
|
|
</result>
|
|
</pre>
|
|
|
|
<p>Terms used in the Natural Language Semantics Markup Language
|
|
requirements and specification documents include:</p>
|
|
|
|
<table border="1" cellpadding="6" cellspacing="1" width="85%"
|
|
summary="term in first column, explanation in second">
|
|
<tbody>
|
|
<tr>
|
|
<th width="23%">Natural language interpreter</th>
|
|
<td width="77%">A device which produces a representation of the
|
|
meaning of a natural language expression.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<th width="23%">Natural language expression</th>
|
|
<td width="77%">An unformatted spoken or written utterance in a
|
|
human language such as English, French, Japanese, etc.</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<h3><a id="reuse" name="reuse">6.5 Reusable Dialog
|
|
Components</a></h3>
|
|
|
|
<p>Reusable Dialog Components are dialog components (chunks of
|
|
dialog script or platform-specific objects that pose frequently
|
|
asked questions in dialog scripts, and can be invoked from any
|
|
dialog script) that are reusable (can be used multiple times
|
|
within an application or used by multiple applications) and that
|
|
meet specific interface (configuration parameter and return value
|
|
format) requirements. The purpose of reusable components is to
|
|
reduce the effort to implement a dialog by reusing encapsulations
|
|
of common dialog tasks, and to promote consistency across
|
|
applications. The W3C Voice Browser Working Group is defining the
|
|
interface for Reusable Dialog Components. Future specifications
|
|
will define standard reusable dialog components for designated
|
|
tasks that are portable across platforms.</p>
|
|
|
|
<h2>7. <a id="examples" name="examples">Example of Markup
|
|
Language Use</a></h2>
|
|
|
|
<p>The following speech dialog fragment illustrates the use of
|
|
the speech synthesis, Speech Recognition Grammar Specification,
|
|
and speech dialog markup languages:</p>
|
|
|
|
<pre>
|
|
<menu>
|
|
<!-- This is an example of a menu which present the user -->
|
|
<!-- with a prompt and listens for the user to utter a choice -->
|
|
<prompt>
|
|
<!-- This text is presented to the user as synthetic speech -->
|
|
<!-- The emphasisis element adds emphasis to its content -->
|
|
Welcome to Ajax Travel Do you want to fly to
|
|
<emphasis>New York, Boston</emphasis> or
|
|
<emphasis>Washington DC</emphasis>
|
|
</prompt>
|
|
<!-- When the user speaks an utterance that matches the grammar -->
|
|
<!-- control is transferred to the "next" VoiceXML document -->
|
|
<choice next="http://www.NY...">
|
|
<!-- The Grammar element indicates the words which -->
|
|
<!-- the user may utter to select this choice -->
|
|
<grammar>
|
|
<choice>
|
|
<item> New York </item>
|
|
<item> The Big Apple </item>
|
|
</choice>
|
|
</grammar>
|
|
</choice>
|
|
<choice next="http://www.Boston...">
|
|
<grammar>
|
|
<choice>
|
|
<item> Boston </item>
|
|
<item> Beantown </item>
|
|
</choice>
|
|
</grammar>
|
|
</choice>
|
|
<choice next="http://www.Wash....">
|
|
<grammar>
|
|
<choice>
|
|
<item> Washington D.C. </item>
|
|
<item> Washington </item>
|
|
<item> The U.S. Capital </item>
|
|
</choice>
|
|
</grammar>
|
|
</choice>
|
|
</menu>
|
|
</pre>
|
|
|
|
<p>In the example above, the Dialog Markup Language describes
|
|
when a voice menu which contains a prompt to be presented to the
|
|
user. The user may respond by saying and of several choices. When
|
|
the user speech matches a particular grammar, control is
|
|
transferred to the dialog fragment at the "next" location.</p>
|
|
|
|
<p>The Speech Synthesis Markup Language describes how text is
|
|
rendered to the user. The Speech Synthesis Markup Language
|
|
includes <emphasis> element. When rendered to the user, the
|
|
word "you" will be emphasized, and the end of the sentence will
|
|
raise in pitch to indicate a question.</p>
|
|
|
|
<p>The Speech Recognition Grammar Specification describes the
|
|
words that the user must say when making a choice. The
|
|
<grammar> element is shown within the <choice>
|
|
element. The language understanding module will recognize "New
|
|
York" or "The Big Apple" to mean New York, "Boston" or "Beantown"
|
|
to mean Boston, and "Washington, D.C.," "Washington," or "The
|
|
U.S. Capital" to mean Washington.</p>
|
|
|
|
<p>An example user-computer dialog resulting from interpreting
|
|
the above dialog script is</p>
|
|
|
|
<pre>
|
|
Computer: <i>Welcome to Ajax Travel Do you want to fly
|
|
to New York, Boston, or Washington DC?</i>
|
|
|
|
User: Beantown
|
|
|
|
Computer: <i>(transfers to dialog script associated with Boston)</i>
|
|
</pre>
|
|
|
|
<h2>8. <a id="submissions"
|
|
name="submissions">Submissions</a></h2>
|
|
|
|
<p>W3C has acknowledged the <a
|
|
href="http://www.w3.org/Submission/2000/06/">JSGF and JSML
|
|
submission</a> from the <a href="http://www.sun.com/">Sun
|
|
Microsystems</a>. The W3C Voice Browser Working Group plans to
|
|
develop specifications for its Speech Synthesis Markup Language
|
|
and Speech Grammar Specification using JSGF and JSML as a
|
|
model.</p>
|
|
|
|
<p>W3C has acknowledged the <a
|
|
href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0
|
|
submission</a> from the <a
|
|
href="http://www.voicexml.org/">VoiceXML Forum</a>. The W3C <a
|
|
href="http://www.w3.org/Voice/Group/">Voice Browser Working
|
|
Group</a> plans to adopt VoiceXML 1.0 as the basis for developing
|
|
a Dialog Markup Language for interactive voice response
|
|
applications. See <a
|
|
href="http://www.zdnet.com/eweek/stories/general/0,11011,2574350,00.html">
|
|
ZDNet's article</a> covering the announcement</p>
|
|
|
|
<h2>9. <a id="reading" name="reading">Further Reading
|
|
Material</a></h2>
|
|
|
|
<p>The following resources are related to the efforts of the
|
|
Voice Browser working group.</p>
|
|
|
|
<dl>
|
|
<dt><a href="http://www.w3.org/TR/REC-CSS2/aural.html">Aural
|
|
CSS</a></dt>
|
|
|
|
<dd>The aural rendering of a document, already commonly used by
|
|
the blind and print-impaired communities, combines speech
|
|
synthesis and "auditory icons." Often such aural presentation
|
|
occurs by converting the document to plain text and feeding this
|
|
to a screen reader -- software or hardware that simply reads all
|
|
the characters on the screen. This results in less effective
|
|
presentation than would be the case if the document structure
|
|
were retained. Style sheet properties for aural presentation may
|
|
be used together with visual properties (mixed media) or as an
|
|
aural alternative to visual presentation.</dd>
|
|
|
|
<dt><br />
|
|
<a href="http://www.etsi.org/">The European Telecommunications
|
|
Standards Institute (ETSI)</a></dt>
|
|
|
|
<dd>The European Telecommunications Standards Institute (ETSI)
|
|
ETSI is a non-profit organization whose mission is "to determine
|
|
and produce the telecommunications standards that will be used
|
|
for decades to come". ETSI's work is complementary to W3C's. The
|
|
ETSI STQ Aurora DSR Working Group standardizes algorithms for
|
|
Distributed Speech Recognition (DSR). The idea is to preprocess
|
|
speech signals before transmission to a server connected to a
|
|
speech recognition engine. Navigate to http://www.etsi.org/stq/
|
|
for more details.</dd>
|
|
|
|
<dt><br />
|
|
<a
|
|
href="http://www.java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html">
|
|
Java Speech Grammar Format</a></dt>
|
|
|
|
<dd>The Java™ Speech Grammar Format is used for defining
|
|
context free grammars for speech recognition. JSGF adopts the
|
|
style and conventions of the Java programming language in
|
|
addition to use of traditional grammar notations.<br />
|
|
</dd>
|
|
|
|
<dt><a href="http://www.microsoft.com/IIT/">Microsoft Speech
|
|
Site</a></dt>
|
|
|
|
<dd class="c5">This site describes the Microsoft speech API, and
|
|
contains a recognizer and synthesizer that can be
|
|
downloaded.</dd>
|
|
|
|
<dt><br />
|
|
<a href="http://www.w3.org/TR/NOTE-voice">NOTE-voice</a></dt>
|
|
|
|
<dd>This note describes features needed for effective interaction
|
|
with Web browsers that are based upon voice input and output.
|
|
Some extensions are proposed to HTML 4.0 and CSS2 to support
|
|
voice browsing, and some work is proposed in the area of speech
|
|
recognition and synthesis to make voice browsers more
|
|
effective.</dd>
|
|
|
|
<dt><br />
|
|
<a
|
|
href="http://www.bell-labs.com/project/tts/sable.html">SABLE</a></dt>
|
|
|
|
<dd>SABLE is a markup language for controlling text to speech
|
|
engines. It has evolved out of work on combining three existing
|
|
text to speech languages: SSML, STML and JSML.</dd>
|
|
|
|
<dt><br />
|
|
<a href="http://www.alphaworks.ibm.com/tech">SpeechML</a></dt>
|
|
|
|
<dd><i>(IBM's server precludes a simple URL for this, but you can
|
|
reach the SpeechML site by following the link for Speech
|
|
Recognition in the left frame)</i> SpeechML plays a similar role
|
|
to VoxML, defining a markup language written in XML for IVR
|
|
systems. SpeechML features close integration with Java.</dd>
|
|
|
|
<dt><br />
|
|
<a href="http://www.w3.org/Voice/TalkML">TalkML</a></dt>
|
|
|
|
<dd>This is an experimental markup language from HP Labs, written
|
|
in XML, and aimed at describing spoken dialogs in terms of
|
|
prompts, speech grammars and production rules for acting on
|
|
responses. It is being used to explore ideas for object-oriented
|
|
dialog structures, and for next generation aural style
|
|
sheets.</dd>
|
|
|
|
<dt><br />
|
|
<a href="http://www.w3.org/Voice/WWW8/slide1.html">Voice Browsers
|
|
and Style Sheets</a></dt>
|
|
|
|
<dd>Presentation by Dave Raggett on May 13th 1999 as part of the
|
|
Style stack of Developer's Day in <a
|
|
href="http://www8.org/">WWW8</a>. The presentation makes
|
|
suggestions for extensions to <a
|
|
href="http://www.w3.org/TR/REC-CSS2/aural.html">ACSS</a>.</dd>
|
|
|
|
<dt><br />
|
|
<a href="http://www.vxml.org/">VoiceXML site</a></dt>
|
|
|
|
<dd>The VoiceXML Forum formed by AT&, IBM, Lucent and
|
|
Motorola to pool their experience. The Forum has published an
|
|
early version of the VoiceXML specification. This builds on
|
|
earlier work on PML, VoxML and SpeechML.</dd>
|
|
</dl>
|
|
|
|
<h2>10. <a id="summary" name="summary">Summary</a></h2>
|
|
|
|
<p>The W3C Voice Browser Working Group is defining markup
|
|
languages for speech recognition grammars, speech dialog, natural
|
|
language semantics, multimodal dialogs, and speech synthesis, as
|
|
well as a collection of reusable dialog components. In addition
|
|
to voice browsers, these languages can also support a wide range
|
|
of applications including information storage and retrieval,
|
|
robot command and control, medical transcription, and newsreader
|
|
applications. The speech community is invited to review and
|
|
comment on working draft requirement and specification
|
|
documents.</p>
|
|
</body>
|
|
</html>
|
|
|