server_playground/doc/www.w3.org/TR/2000/WD-voice-intro-20001204/index.html


								<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

								    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

								<html xmlns="http://www.w3.org/1999/xhtml">

								<head>

								<meta name="generator" content="HTML Tidy, see www.w3.org" />

								<title>Introduction and Overview of W3C Speech Interface

								Framework</title>

								<meta content="text/html; charset=windows-1252"

								http-equiv="Content-Type" />

								<meta content="Microsoft FrontPage 4.0" name="GENERATOR" />

								<style type="text/css">

								body {

								font-family: sans-serif;

								margin-left: 10%;

								margin-right: 5%;

								color: black;

								background-color: white;

								background-attachment: fixed;

								background-image: url(http://www.w3.org/StyleSheets/TR/WD);

								background-position: top left;

								background-repeat: no-repeat;

								font-family: Tahoma, Verdana, "Myriad Web", Syntax, sans-serif;

								}

								.unfinished {  font-style: normal; background-color: #FFFF33}

								.dtd-code {  font-family: monospace;

								 background-color: #dfdfdf; white-space: pre;

								 border: #000000; border-style: solid;

								 border-top-width: 1px; border-right-width: 1px;

								 border-bottom-width: 1px; border-left-width: 1px; }

								p.copyright {font-size: smaller}

								h2,h3 {margin-top: 1em;}

								ul.toc li {list-style: none}

								ul.toc a {text-decoration: none }

								code {

								    color: green;

								    font-family: monospace;

								    font-weight: bold;

								}

								.example {

								    border: solid green;

								    border-width: 2px;

								    color: green;

								    font-weight: bold;

								    margin-right: 5%;

								    margin-left: 0;

								}

								.bad  {

								    border: solid red;

								    border-width: 2px;

								    margin-left: 0;

								    margin-right: 5%;

								    color: rgb(192, 101, 101);

								}

								div.navbar { text-align: center; }

								div.contents {

								    background-color: rgb(204,204,255);

								    padding: 0.5em;

								    border: none;

								    margin-right: 5%;

								}

								table {

								    margin-left: 0;

								    margin-right: 0;

								    font-family: sans-serif;

								    background: white;

								    border-width: 2px;

								    border-color: white;

								  }

								th { font-family: sans-serif; background: rgb(204, 204, 153) }

								td { font-family: sans-serif; background: rgb(255, 255, 153) }

								.tocline { list-style: none; }

								</style>


								<link rel="stylesheet" type="text/css"

								href="http://www.w3.org/StyleSheets/TR/W3C-WD" />

								</head>

								<body>

								<div class="head">

								<p><a href="http://www.w3.org/"><img class="head"

								src="http://www.w3.org/Icons/WWW/w3c_home" alt="W3C" width="72" height="48"/></a></p>


								<h1 class="head">Introduction and Overview of W3C Speech

								Interface Framework</h1>


								<h2 class="notoc">W3C Working Draft 4 December 2000</h2>


								<dl>

								<dt>This version:</dt>


								<dd><a

								href="http://www.w3.org/TR/2000/WD-voice-intro-20001204/">

								http://www.w3.org/TR/2000/WD-voice-intro-20001204</a></dd>


								<dt>Latest version:</dt>


								<dd><a

								href="http://www.w3.org/TR/voice-intro">http://www.w3.org/TR/voice-intro</a></dd>


								<dt>Previous version:</dt>


								<dd><a

								href="http://www.w3.org/TR/1999/WD-voice-intro-19991223">http://www.w3.org/TR/1999/WD-voice-intro-19991223</a></dd>


								<dt>Editor:</dt>


								<dd>Jim A. Larson, Intel Architecture Labs</dd>

								</dl>

								<p class="copyright"><a

								            href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Copyright">

								            Copyright</a> &copy;2000 <a href="http://www.w3.org/"><abbr title="World

								            Wide Web Consortium">W3C</abbr></a><sup>&reg;</sup> (<a

								            href="http://www.lcs.mit.edu/"><abbr title="Massachusetts Institute of

								            Technology">MIT</abbr></a>, <a href="http://www.inria.fr/"><abbr lang="fr"

								            title="Institut National de Recherche en Informatique et

								            Automatique">INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>),

								            All Rights Reserved. W3C <a

								            href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer">liability</a>,

								            <a

								            href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks">trademark</a>,

								            <a

								            href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">document

								            use</a> and <a

								            href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">software

								            licensing</a> rules apply.</p>


								<hr />

								</div>


								<h2 class="notoc"><a id="abstract"

								name="abstract">Abstract</a></h2>


								<p>The World Wide Web Consortium's Voice Browser Working Group is

								defining several markup languages for applications supporting

								speech input and output. These markup languages will enable

								speech applications across a range of hardware and software

								platforms. Specifically, the Working Group is designing markup

								languages for dialog, speech recognition grammar, speech

								synthesis, natural language semantics, and a collection of

								reusable dialog components. These markup languages make up the

								W3C Speech Interface Framework. The speech community is invited

								to review and comment on the working draft requirement and

								specification documents.</p>


								<h2><a id="status" name="status">Status of This Document</a></h2>


								<p>This document describes a model architecture for speech

								processing in voice browsers. It also briefly describes markup

								languages for dialog, speech recognition grammar, speech

								synthesis, natural language semantics, and a collection of

								reusable dialog components. This document is being released as a

								working draft, but is not intended to become a proposed

								recommendation.</p>


								<p>This specification is a Working Draft of the Voice Browser

								working group for review by W3C members and other interested

								parties. It is a draft document and may be updated, replaced, or

								obsoleted by other documents at any time. It is inappropriate to

								use W3C Working Drafts as reference material or to cite them as

								other than "work in progress".</p>


								<p>Publication as a Working Draft does not imply endorsement by

								the W3C membership, nor of members of the Voice Browser working

								groups. This is still a draft document and may be updated,

								replaced or obsoleted by other documents at any time. It is

								inappropriate to cite W3C Working Drafts as other than "work in

								progress."</p>


								<p>This document has been produced as part of the <a

								href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,

								following the procedures set out for the <a

								href="http://www.w3.org/Consortium/Process/">W3C Process</a>. The

								authors of this document are members of the <a

								href="http://www.w3.org/Voice/Group">Voice Browser Working

								Group</a>. This document is for public review. Comments should be

								sent to the public mailing list &lt;<a

								href="mailto:www-voice@w3.org">www-voice@w3.org</a>&gt; (<a

								href="http://www.w3.org/Archives/Public/www-voice/">archive</a>).</p>


								<p>A list of current W3C Recommendations and other technical

								documents can be found at <a

								href="http://www.w3.org/TR">http://www.w3.org/TR</a>.</p>


								<h2>1. <a id="group" name="group">Voice Browser Working

								Group</a></h2>


								<p>The Voice Browser Working Group was <a

								href="http://www.w3.org/Voice/1999/voice-wg-charter.html">chartered</a>

								by the World Wide Web Consortium (W3C) within the User Interface

								Activity in May 1999 to prepare and review markup languages that

								enable voice browsers. Members meet weekly via telephone and

								quarterly in face-to-face meetings.</p>


								<p>The <a href="http://www.w3.org/Voice/">W3C Voice Browser

								Working Group</a> is open to any member of the W3C Consortium.

								The Voice Browser Working Group has also invited experts whose

								affiliations are not members of the W3C Consortium. The four

								founding members of the VoiceXML Forum, as well as telelphony

								applications venders, speech recognition and text to speech

								engine venders, web portals, hardware venders, software venders,

								telcos and appliance manufactures have representatives who

								participate in the Voice Browser Working Group. Current members

								include AskJeves, AT&amp;T, Avaya, BT, Canon, Cisco, France

								Telecon, General Magic, Hitachi, HP, IBM, isSound, Intel, Locus

								Dialogue, Lucent, Microsoft, Mitre, Motorola, Nokia, Nortel,

								Nuance, Phillips, PipeBeach, Speech Works, Sun, Telecon Italia,

								TellMe.com, and Unisys, in addition to several invited

								experts.</p>


								<h2 class="notoc">Table of Contents</h2>


								<ul class="toc">

								<li><a href="#abstract">Abstract</a></li>


								<li><a href="#status">Status of this Document</a></li>


								<li>1. <a href="#group">The Voice Browser Working Group</a></li>


								<li>2. <a href="#browsers">Voice Browsers</a></li>


								<li>3. <a href="#benefits">Voice Browser Benefits</a></li>


								<li>4. <a href="#spif">W3C Speech Interface Framework</a></li>


								<li>5. <a href="#other">Other Uses for Markup Languages</a></li>


								<li>6. <a href="#specs">Individual Markup Languages Overview</a>


								<ul>

								<li>6.1. <a href="#gram">Speech Recognition Grammar

								Specification</a></li>


								<li>6.2. <a href="#synth">Speech Synthesis</a></li>


								<li>6.3. <a href="#dialog">Dialog</a></li>


								<li>6.4. <a href="#nl">Natural Language Semantics</a></li>


								<li>6.5 <a href="#reuse">Reusable Dialog Components</a></li>

								</ul>

								</li>


								<li>7. <a href="#examples">Example Markup Language Use</a></li>


								<li>8. <a href="#submissions">Submissions</a></li>


								<li>9. <a href="#reading">Further Reading Material</a></li>


								<li>10. <a href="#summary">Summary</a></li>

								</ul>


								<h2>2. <a id="browsers" name="browsers">Voice Browsers</a></h2>


								<p>A <em>voice browser</em> is a device (hardware and software)

								that interprets voice markup languages to generate voice output,

								interpret voice input, and possibly accept and produce other

								modalities of input and output.</p>


								<p>Currently the major deployment of voice browsers enable users

								to speak and listen using a telephone or cell phone to access

								information available on the World Wide Web. These voice browsers

								accept DTMF and spoken words as input, and produce synthesized

								speech or replay prerecorded speech as output. The voice markup

								languages interpreted by voice browsers are also frequently

								available on the World Wide Web. However, many other deployments

								of voice browsers are possible.</p>


								<p>Hardware devices may include telephones or cell phones,

								hand-held computers, palm-sized computers, laptop PCs, and

								desktop PCs. Voice browser hardware processors may be embedded

								into appliances such as TVs, radios, VCRs, remote controls,

								ovens, refrigerators, coffeepots, doorbells, and practically any

								other electronic or electrical device.</p>


								<p>Possible software applications include:</p>


								<ul>

								<li>Accessing business information, including the corporate

								"front desk" asking callers who or what they want, automated

								telephone ordering services, support desks, order tracking,

								airline arrival and departure information, cinema and theater

								booking services, and home banking services</li>


								<li>Accessing public information, including community information

								such as weather, traffic conditions, school closures, directions

								and events; local, national and international news; national and

								international stock market information; and business and

								e-commerce transactions</li>


								<li>Accessing personal information, including calendars, address

								and telephone lists, to-do lists, shopping lists, and calorie

								counters</li>


								<li>Assisting the user to communicate with other people sending

								and receiving voice-mail messages</li>

								</ul>


								<p>Our definition of a voice browser does not support a voice

								interface to HTML pages. A voice browser processes scripts

								written using voice markup languages. HTML is not among the

								languages which can be interpreted by a voice browser. Some

								venders are creating voice-enabled HTML browsers that produce

								voice instead of displaying text on a screen display. A

								voice-enabled HTML browser must determine the sequence of text to

								present to the user as voice, and possibly how to verbally

								present non-text data such as tables, illustrations, and

								animations. A voice browser, on the other hand, interprets a

								script which specifies exactly what to verbally present to the

								user as well as when to present each piece of information</p>


								<h2>3. <a id="benefits" name="benefits">Voice Browser

								Benefits</a></h2>


								<p>Voice is a <em>very natural</em> user interface because it

								enables the user to speak and listen using skills learned during

								childhood. Currently users speak and listen to telephones and

								cell phones with no display to interact with voice browsers. Some

								voice browsers may have small screens, such as those found on

								cell phones and palm computers. In the future, voice browsers may

								also support other modes and media such as pen, video, and sensor

								input and graphics animation and actuator controls as output. For

								example, voice and pen input would be appropriate for Asian users

								whose spoken language does not lend itself to entry with

								traditional QWERTY keyboards.</p>


								<p>Some voice browsers are <em>portable</em>. They can be used

								anywhere&#8212;at home, at work, and on the road. Information

								will be <em>available</em> to a greater audience, especially to

								people who have access to handsets, either telephones or cell

								phones, but not to networked computers.</p>


								<p>Voice browsers present a <em>pragmatic</em> interface for

								functionally blind users or users needing Web access while

								keeping their hands and eyes free for other things. Voice

								browsers present an invisible user interface to the user, while

								freeing workspace previously occupied by keyboards and mice.</p>


								<h2>4. <a id="spif" name="spif">W3C Speech Interface

								Framework</a></h2>


								<p>The Voice Browser Working group has defined the <i>W3C Speech

								Interface Framework</i>, shown in Figure 1. The white boxes

								represent typical components of a speech-enabled web application.

								The black arrows represent data flowing among these components.

								The blue ovals indicate data specified using markup languages

								used to guide components to accomplish their respective tasks. To

								review the latest requirement and specification documents for

								each of the markup languages, see the section entitled

								Requirements and Language specification Documents on our <a

								href="http://www.w3.org/Voice/">W3C Voice Browser home web

								site</a>.</p>


								<p align="center"><img src="voice-intro-fig1.gif" width="559"

								height="392"

								alt="block diagram for speech interface framework" /></p>


								<p>Components of the W3C Speech Interface Framework include the

								following:</p>


								<p><i>Automatic Speech Recognizer (ASR)</i>&#8212;accepts speech

								from the user and produces text. The ASR uses a grammar to

								recognize words from the user's spoken speech. Some ASRs use

								grammars specified by a developer using the <b>Speech Grammar

								Markup Language</b>. Other ASRs use statistical grammars

								generated from large corpra of speech data. These grammars are

								represented using the <b>N-gram Stochastic Grammar Markup

								Language.</b></p>


								<p><i>DTMF Tone Recognizer</i>&#8212;accepts touch-tones produced

								by a telephone when the user presses the keys on the telephone's

								keypad. Telephone users may use touch-tones to enter digits or

								make menu selections.</p>


								<p><i>Language Understanding Component</i>&#8212;extracts

								semantics from a text string by using a prespecified grammar. The

								text string may by produced by an ASR or be entered directly by a

								user via a keyboard. The Language Understanding Component may

								also use grammars specified using the <b>Speech Grammar Markup

								Language</b> or the <b>N-gram Stochastic Grammar Markup

								Language.</b> The output of the Language Understanding Component

								is expressed using the <b>Natural Language Semantics Markup

								Language.</b></p>


								<p><i>Context Interpreter</i>&#8212;enhances the semantics from

								the Language Understanding Module by obtaining context

								information from a dialog history (not shown in Figure 1). For

								example, the Context Interpreter may replace a pronoun by a noun

								to which the pronoun referred. The input and output from the

								Context Interpreter is expressed using the <b>Natural Language

								Semantics Markup Language.</b></p>


								<p><i>Dialog Manager</i>&#8212;prompts the user for input, makes

								sense of the input, and determines what to do next according to

								instructions in a dialog script specified using VoiceXML 2.0

								modeled after VoiceXML 1.0. Depending upon the input received,

								the dialog manager may invoke application services, or download

								another dialog script from the web, or cause information to be

								presented to the user. The Dialog Manager accepts input specified

								using the <b>Natural Language Semantics Markup Language.</b>

								Dialog scripts may refer to <b>Reusable Dialog Components</b>,

								portions of another dialog script which can be reused across

								multiple applications.</p>


								<p><i>Media Planner</i>&#8212;determines whether output from the

								dialog manager should be presented to the user as synthetic

								speech or prerecorded audio.</p>


								<p><i>Recorded audio player</i>&#8212;replays prerecorded audio

								files to the user, either in conjunction with, or in place of

								synthesized voices.</p>


								<p><i>Language Generator</i>&#8212;Accepts text from the media

								planner and prepares it for presentation to the user as spoken

								voice via a text-to-speech synthesizer (TTS). The text may

								contain markup tags expressed using the <b>Speech Synthesis

								Markup Language</b> which provides hints and suggestions for how

								acoustic sounds should be produced. These tags may be produced

								automatically by the Language Generator or manually inserted by a

								developer.</p>


								<p><i>Text-to-Speech Synthesizer (TTS)</i>&#8212;Accepts text

								from the Language Generator and produces acoustic signals which

								the user hears as a human-like voice according to hints specified

								using the <b>Speech Synthesis Markup Language</b>.</p>


								<p>The components of any specific voice browser may differ

								significantly from the Components shown in Figure 1. For example,

								the Context Interpretation, Language Generation and Media

								Planning components may be incorporated into the Dialog Manager,

								or the tone recognizer may be incorporated into the Context

								Interpretation. However, most voice browser implementations will

								still be able to use of the various markup languages defined in

								the W3C Speech Interface Framework.</p>


								<p>The Voice Browser Working Group is not defining the components

								in the W3C Speech Interface Framework. It is defining markup

								languages for representing data in each of the blue ovals in

								Figure 1. Specifically, the Voice Browser Working Group is

								defining the following markup languages:</p>


								<ul>

								<li>

								<p>Speech Recognition Grammar Specification</p>

								</li>


								<li>

								<p>N-gram Grammar Markup Language</p>

								</li>


								<li>

								<p>Speech Synthesis Markup Language</p>

								</li>


								<li>

								<p>Dialog Markup Language</p>

								</li>

								</ul>


								<p>The Voice Browser Working Group is also defining packaged

								dialogs which we call <b>Reusable Components</b>. As their name

								suggests, reusable components can be reused in other dialog

								scripts, decreasing the implementation effort and increasing user

								interface consistency. The Working Group may also define a

								collection of reusable components such as solicit the user's

								credit card number and exploration date, solicit the user's

								address, etc.</p>


								<p>Just as HTML formats data for screen-based interactions over

								the Internet, an XML-based language is needed to format data for

								voice-based interactions over the Internet. All markup languages

								recommended by the Working Group will be XML-based, so XML

								language processors can process any of the W3C Speech Interface

								Framework markup languages.</p>


								<h2>5. <a id="other" name="other">Other Uses of the Markup

								Languages</a></h2>


								<p>Figure 2 illustrates the W3C Speech Interface Framework

								extended to support multiple modes of input and output. It is

								anticipated that another working group will be formed to specify

								the <b>Multimodal Dialog Language</b>, an extension of the Dialog

								Language. We anticipate that another Working Group will be

								established to take over our current work in defining the

								Multimodal Dialog Language.</p>


								<p align="center"><img src="voice-intro-fig2.gif" width="556"

								height="402"

								alt="block diagram for multimodal interface framework" /></p>


								<p>Markup languages also may be used in applications not usually

								associated with voice browsers. The following applications also

								may benefit from the use of voice browser markup languages:</p>


								<ul>

								<li><em>Text-based Information Storage and

								Retrieval</em>&#8212;Acceptance of text from a keyboard and

								presents the text on a display. It uses neither ASR nor TTS, but

								makes heavy use of the language understanding module and the

								natural language semantic markup language.</li>


								<li><em>Robot Command and Control</em>&#8212;Users speak commands

								that control a mechanical robot. This application may use both

								Speech Recognition Grammar Specification and dialog markup

								languages.</li>


								<li><em>Medical Transcription</em>&#8212;A complex, specialized

								speech recognition grammar is used to extract medical information

								from text produced by the ASR. A human editor corrects the

								resulting text before printing.</li>


								<li><em>Newsreader</em>&#8212;A language generator produces

								marked-up text for presenting voice to the user. This application

								uses a special language generator to markup text from news wire

								services for verbal presentation.</li>

								</ul>


								<h2>6. <a id="specs" name="specs">Individual Markup Language

								Overviews</a></h2>


								<p>To review the latest requirement and specification documents

								for each of the following languages, see the section titled

								Requirements and Language specification Documents on our <a

								href="http://www.w3.org/Voice/">W3C Voice Browser home web

								site</a></p>


								<h3><a id="gram" name="gram">6.1. Speech Recognition Grammar

								Specification</a></h3>


								<p>The Speech Recognition Grammar Specification supports the

								definition of Context-Free Grammars (CFG) and, by subsumption,

								Finite-State Grammars (FSG). The specification defines an XML

								Grammar Markup Language, and an optional Augmented Backus-Naur

								Format (ABNF) Markup Language. Automatic transformations between

								the two formats is possible, for example, by XSLT to convert the

								XML format to ABNF. We anticipate that development tools will be

								constructed that provide the familiar ABNF format to developers,

								and enable XML software to manipulate the XML grammar format. The

								ABNF and XML languages are modeled after Sun's <a

								href="http://www.w3.org/Submission/2000/06/">JSpeech Grammar

								Format</a>. Some of the interesting features of the draft

								specification:</p>


								<ul>

								<li>

								<p>Ability to cross-reference grammars by URI and to use this

								ability to define libraries of useful grammars.</p>

								</li>


								<li>

								<p>Internationalized.</p>

								</li>


								<li>

								<p>Semantic tagging mechanism for interpretation of spoken input

								(under development).</p>

								</li>


								<li>

								<p>Applicable to non-speech input modalities, e.g. DTMF input or

								parsing and interpretation of typed input.</p>

								</li>

								</ul>


								<p>A complementary speech recognition grammar language

								specification is defined for N-Gram language models.</p>


								<p>Terms used in the Speech Grammar Markup Language requirements

								and specification documents include:</p>


								<table border="1" cellpadding="6" cellspacing="1" width="85%"

								summary="term in first column, explanation in second">

								<tbody>

								<tr>

								<th width="24%">CFG</th>

								<td width="76%">Context-Free Grammar. A formal computer science

								term for a language that permits embedded recursion.</td>

								</tr>


								<tr>

								<th width="24%">BNF</th>

								<td width="76%">Backus-Naur Format. A language used widely in

								computer science for textural representations of CFGs.</td>

								</tr>


								<tr>

								<th width="24%">ABNF</th>

								<td width="76%">Augmented Backus-Naur Format. The language

								defined in the grammar specification that extends a conventional

								BNF representation with regular grammar capabilities, syntax for

								cross-referencing between grammars and other useful syntactic

								features</td>

								</tr>


								<tr>

								<th width="24%">Grammar</th>

								<td width="76%">The representation of constraints defining the

								set of allowable sentences in a language. E.g. a grammar for

								describing a set of sentences for ordering a pizza.</td>

								</tr>


								<tr>

								<th width="24%">Language</th>

								<td width="76%">A formal computer science term for the collection

								of set of sentences associated with a particular domain. Language

								may refer to natural or program language.</td>

								</tr>

								</tbody>

								</table>


								<h3><a id="synth" name="synth">6.2. Speech Synthesis</a></h3>


								<p>A text document may be produced automatically, authored by

								people, or a combination of both. The Speech Synthesis Markup

								Language supports high-level specifications, including the

								selection of voice characteristics (name, gender, and age) and

								the speed, volume, and emphasis of individual words. The language

								also may describe how to pronounce acronyms, such as "Nasa" for

								NASA, or spelled, such as "N, double A, C, P," for NAACP. At a

								lower level, designers may specify prosodic control, which

								includes pitch, timing, pausing, and speaking rate. The Speech

								Synthesis Markup Language is modeled on Sun's <a

								href="http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html">

								<b>Java Speech Markup Language</b></a>.</p>


								<p>There is some variance in the use of terminology in the speech

								synthesis community. The following definitions establish a common

								understanding</p>


								<table border="1" cellpadding="6" cellspacing="1" width="85%"

								summary="term in first column, explanation in second">

								<tbody>

								<tr>

								<th>Prosody</th>

								<td width="76%">Features of speech such as pitch, pitch range,

								speaking rate and volume.</td>

								</tr>


								<tr>

								<th width="24%">Speech Synthesis</th>

								<td width="76%">The process of automatic generation of speech

								output from data input which may include plain text, <span

								class="diff">formatted text or binary objects</span>.</td>

								</tr>


								<tr>

								<th width="24%">Text-To-Speech</th>

								<td width="76%">The process of automatic generation of speech

								output from text or annotated text input.</td>

								</tr>

								</tbody>

								</table>


								<h3><a id="dialog" name="dialog">6.3. VoiceXML 2.0</a></h3>


								<p>VoiceXML 2.0 Markup supports four I/O modes: speech

								recognition and DTMF as input with synthesized speech and

								prerecorded speech as output. VoiceXML 2.0 supports

								system-directed speech dialogs where the system prompts the user

								for responses, makes sense of the input, and determines what to

								do next. VoiceXML 2.0 also supports mixed initiative speech

								dialogs. In addition, VoiceXML also supports task switching and

								the handling of events, such as recognition errors, incomplete

								information entered by the user, timeouts, barge-in, and

								developer-defined events. Barge-in allows users to speak while

								the browser is speaking. VoiceXML 2.0 is modeled after <a

								href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0</a>

								designed by the <a href="http://www.voicexml.org/">VoiceXML

								Forum</a>, whose founding members are AT&amp;T, IBM, Lucent, and

								Motorola.</p>


								<p>Terms used in the Dialog Markup Language requirements and

								specification documents include:</p>


								<table border="1" cellpadding="6" cellspacing="1" width="85%"

								summary="term in first column, explanation in second">

								<tbody>

								<tr>

								<th>Dialog Markup Language</th>

								<td>a language in which voice dialog behavior is specified. The

								language may include reference to scripting elements which can

								also determine dialog behavior.</td>

								</tr>


								<tr>

								<th>Voice Browser</th>

								<td>a software device which interprets a voice markup language

								and generates a dialog with voice output and possibly other

								output modalities and/or voice input and possibly other

								modalities.</td>

								</tr>


								<tr>

								<th>Dialog</th>

								<td>a model of interactive behavior underlying the interpretation

								of the markup language. The model consists of states, variables,

								events, event handlers, inputs and outputs.</td>

								</tr>


								<tr>

								<th>Utterance</th>

								<td>Used in this document generally to refer to a meaningful user

								input in any modality supported by the platform, not limited to

								spoken inputs. For example, speech, DTMF, pointing, handwriting,

								text and OCR.</td>

								</tr>


								<tr>

								<th>Mixed initiative dialog</th>

								<td>A type of dialog in which either they system or the user can

								take the initiative at any point in the dialog by failing to

								respond directly to the previous utterance. For example, the user

								can make corrections, volunteer additional information, etc.

								Systems support mixed initiative dialog to various degrees.

								Compare to "directed dialog."</td>

								</tr>


								<tr>

								<th>Directed dialog</th>

								<td>Also referred to as "system initiative" or "system led." A

								type of dialog in which the user is permitted only direct literal

								responses to the system's prompts.</td>

								</tr>


								<tr>

								<th>State</th>

								<td>the basic interact ional unit defined in the markup language.

								A state can specify variables, event handlers, outputs and

								inputs. A state may describe output content to be presented to

								the user, input which the user can enter, event handlers

								describing, for example, which variables to bind and which state

								to transition to when an event occurs.</td>

								</tr>


								<tr>

								<th>Events</th>

								<td>generated when a state is executed by the voice browser; for

								example, when outputs or inputs in a state are rendered or

								interpreted. Events are typed and may include information; for

								example, an input event generated when an utterance is recognized

								may include the string recognized, an interpretation, confidence

								score, and so on.</td>

								</tr>


								<tr>

								<th>Event Handlers</th>

								<td>are specified in the voice markup language and describe how

								events generated by the voice browser are to be handled.

								Interpretation of events may bind variables, or map the current

								state into another state (possibly itself).</td>

								</tr>


								<tr>

								<th>Output</th>

								<td>content specified in an element of the markup language for

								presentation to the user. The content is rendered by the voice

								browser; for example, audio files or text rendered by a TTS.

								Output can also contain parameters for the output device; for

								example, volume of audio file playback, language for TTS, etc.

								Events are generated when, for example, the audio file has been

								played.</td>

								</tr>


								<tr>

								<th>Input</th>

								<td>content (and its interpretation) specified in an element of

								the markup language which can be given as input by a user; for

								example, a grammar for DTMF and speech input. Events are

								generated by the voice browser when, for example, the user has

								spoken an utterance and variables may be bound to information

								contained in the event. Input can also specify parameters for the

								input device; for example, timeout parameters, etc.</td>

								</tr>

								</tbody>

								</table>


								<h3><a id="nl" name="nl">6.4. Natural Language Semantics</a></h3>


								<p>The Natural Language Semantics Markup Language supports XML

								semantic representations. For application-specific information,

								it is based on the W3C <a

								href="http://www.w3.org/TR/2000/WD-xforms-datamodel-20000406/">XForms.</a>

								The Natural Language Semantics Markup Language also includes

								application-independent elements defined by the W3C Voice Browser

								group. This application-independent information includes

								confidences, the grammar matched by the interpretation, speech

								recognizer input, and timestamps. The Natural Language Semantics

								Markup Language combines elements from the XForms, natural

								language semantics, and application-specific namespaces. For

								example, the text, "I want to fly from New York to Boston, and,

								then, to Washington, DC", could be represented as:</p>


								<pre>

								&lt;result xmlns:xf="http://www.w3.org/2000/xforms"

								x-model="http://flight-model"

								grammar="http://flight-grammar"&gt;

								  &lt;interpretation confidence=100&gt;

								    &lt;xf:instance&gt;

								       &lt;flight:trip&gt;

								         &lt;leg1&gt;

								           &lt;from&gt;New York&lt;/from&gt;

								           &lt;to&gt;Boston&lt;/to&gt;

								         &lt;/leg1&gt;

								         &lt;leg2&gt;

								           &lt;from&gt;Boston&lt;/from&gt;

								           &lt;to&gt;DC&lt;/to&gt;

								         &lt;/leg2&gt;

								       &lt;/flight:trip&gt;

								    &lt;/xf:instance&gt;

								    &lt;input mode="speech"&gt;

								      I want to fly from New York to Boston, and,

								      then, to Washington, DC

								    &lt;/input&gt;

								  &lt;/interpretation&gt;

								&lt;/result&gt;

								</pre>


								<p>Terms used in the Natural Language Semantics Markup Language

								requirements and specification documents include:</p>


								<table border="1" cellpadding="6" cellspacing="1" width="85%"

								summary="term in first column, explanation in second">

								<tbody>

								<tr>

								<th width="23%">Natural language interpreter</th>

								<td width="77%">A device which produces a representation of the

								meaning of a natural language expression.</td>

								</tr>


								<tr>

								<th width="23%">Natural language expression</th>

								<td width="77%">An unformatted spoken or written utterance in a

								human language such as English, French, Japanese, etc.</td>

								</tr>

								</tbody>

								</table>


								<h3><a id="reuse" name="reuse">6.5 Reusable Dialog

								Components</a></h3>


								<p>Reusable Dialog Components are dialog components (chunks of

								dialog script or platform-specific objects that pose frequently

								asked questions in dialog scripts, and can be invoked from any

								dialog script) that are reusable (can be used multiple times

								within an application or used by multiple applications) and that

								meet specific interface (configuration parameter and return value

								format) requirements. The purpose of reusable components is to

								reduce the effort to implement a dialog by reusing encapsulations

								of common dialog tasks, and to promote consistency across

								applications. The W3C Voice Browser Working Group is defining the

								interface for Reusable Dialog Components. Future specifications

								will define standard reusable dialog components for designated

								tasks that are portable across platforms.</p>


								<h2>7. <a id="examples" name="examples">Example of Markup

								Language Use</a></h2>


								<p>The following speech dialog fragment illustrates the use of

								the speech synthesis, Speech Recognition Grammar Specification,

								and speech dialog markup languages:</p>


								<pre>

								&lt;menu&gt;

								  &lt;!-- This is an example of a menu which present the user --&gt;

								  &lt;!-- with a prompt  and listens for the user to utter a choice --&gt;

								  &lt;prompt&gt;

								    &lt;!-- This text is presented to the user as synthetic speech --&gt;

								    &lt;!-- The emphasisis element adds emphasis to its content  --&gt;

								    Welcome to Ajax Travel Do you want to fly to

								    &lt;emphasis&gt;New York, Boston&lt;/emphasis&gt; or

								    &lt;emphasis&gt;Washington DC&lt;/emphasis&gt;

								  &lt;/prompt&gt;

								  &lt;!-- When the user speaks an utterance that matches the grammar --&gt;

								  &lt;!-- control is transferred to the "next" VoiceXML document --&gt;

								  &lt;choice next="http://www.NY..."&gt;

								    &lt;!-- The Grammar element indicates the words which --&gt;

								    &lt;!-- the user may utter to select this choice --&gt;

								    &lt;grammar&gt;

								      &lt;choice&gt;

								        &lt;item&gt; New York &lt;/item&gt;

								        &lt;item&gt; The Big Apple &lt;/item&gt;

								      &lt;/choice&gt;

								    &lt;/grammar&gt;

								  &lt;/choice&gt;

								  &lt;choice next="http://www.Boston..."&gt;

								    &lt;grammar&gt;

								      &lt;choice&gt;

								        &lt;item&gt; Boston &lt;/item&gt;

								        &lt;item&gt; Beantown &lt;/item&gt;

								      &lt;/choice&gt;

								    &lt;/grammar&gt;

								  &lt;/choice&gt;

								  &lt;choice next="http://www.Wash...."&gt;

								    &lt;grammar&gt;

								      &lt;choice&gt;

								        &lt;item&gt; Washington D.C. &lt;/item&gt;

								        &lt;item&gt; Washington &lt;/item&gt;

								        &lt;item&gt; The U.S. Capital &lt;/item&gt;

								      &lt;/choice&gt;

								    &lt;/grammar&gt;

								 &lt;/choice&gt;

								&lt;/menu&gt;

								</pre>


								<p>In the example above, the Dialog Markup Language describes

								when a voice menu which contains a prompt to be presented to the

								user. The user may respond by saying and of several choices. When

								the user speech matches a particular grammar, control is

								transferred to the dialog fragment at the "next" location.</p>


								<p>The Speech Synthesis Markup Language describes how text is

								rendered to the user. The Speech Synthesis Markup Language

								includes &lt;emphasis&gt; element. When rendered to the user, the

								word "you" will be emphasized, and the end of the sentence will

								raise in pitch to indicate a question.</p>


								<p>The Speech Recognition Grammar Specification describes the

								words that the user must say when making a choice. The

								&lt;grammar&gt; element is shown within the &lt;choice&gt;

								element. The language understanding module will recognize "New

								York" or "The Big Apple" to mean New York, "Boston" or "Beantown"

								to mean Boston, and "Washington, D.C.," "Washington," or "The

								U.S. Capital" to mean Washington.</p>


								<p>An example user-computer dialog resulting from interpreting

								the above dialog script is</p>


								<pre>

								Computer: <i>Welcome to Ajax Travel Do you want to fly

								          to New York, Boston, or Washington DC?</i>


								User:     Beantown


								Computer: <i>(transfers to dialog script associated with Boston)</i>

								</pre>


								<h2>8. <a id="submissions"

								name="submissions">Submissions</a></h2>


								<p>W3C has acknowledged the <a

								href="http://www.w3.org/Submission/2000/06/">JSGF and JSML

								submission</a> from the <a href="http://www.sun.com/">Sun

								Microsystems</a>. The W3C Voice Browser Working Group plans to

								develop specifications for its Speech Synthesis Markup Language

								and Speech Grammar Specification using JSGF and JSML as a

								model.</p>


								<p>W3C has acknowledged the <a

								href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0

								submission</a> from the <a

								href="http://www.voicexml.org/">VoiceXML Forum</a>. The W3C <a

								href="http://www.w3.org/Voice/Group/">Voice Browser Working

								Group</a> plans to adopt VoiceXML 1.0 as the basis for developing

								a Dialog Markup Language for interactive voice response

								applications. See <a

								href="http://www.zdnet.com/eweek/stories/general/0,11011,2574350,00.html">

								ZDNet's article</a> covering the announcement</p>


								<h2>9. <a id="reading" name="reading">Further Reading

								Material</a></h2>


								<p>The following resources are related to the efforts of the

								Voice Browser working group.</p>


								<dl>

								<dt><a href="http://www.w3.org/TR/REC-CSS2/aural.html">Aural

								CSS</a></dt>


								<dd>The aural rendering of a document, already commonly used by

								the blind and print-impaired communities, combines speech

								synthesis and "auditory icons." Often such aural presentation

								occurs by converting the document to plain text and feeding this

								to a screen reader -- software or hardware that simply reads all

								the characters on the screen. This results in less effective

								presentation than would be the case if the document structure

								were retained. Style sheet properties for aural presentation may

								be used together with visual properties (mixed media) or as an

								aural alternative to visual presentation.</dd>


								<dt><br />

								<a href="http://www.etsi.org/">The European Telecommunications

								Standards Institute (ETSI)</a></dt>


								<dd>The European Telecommunications Standards Institute (ETSI)

								ETSI is a non-profit organization whose mission is "to determine

								and produce the telecommunications standards that will be used

								for decades to come". ETSI's work is complementary to W3C's. The

								ETSI STQ Aurora DSR Working Group standardizes algorithms for

								Distributed Speech Recognition (DSR). The idea is to preprocess

								speech signals before transmission to a server connected to a

								speech recognition engine. Navigate to http://www.etsi.org/stq/

								for more details.</dd>


								<dt><br />

								<a

								href="http://www.java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html">

								Java Speech Grammar Format</a></dt>


								<dd>The Java&#8482; Speech Grammar Format is used for defining

								context free grammars for speech recognition. JSGF adopts the

								style and conventions of the Java programming language in

								addition to use of traditional grammar notations.<br />

								</dd>


								<dt><a href="http://www.microsoft.com/IIT/">Microsoft Speech

								Site</a></dt>


								<dd class="c5">This site describes the Microsoft speech API, and

								contains a recognizer and synthesizer that can be

								downloaded.</dd>


								<dt><br />

								<a href="http://www.w3.org/TR/NOTE-voice">NOTE-voice</a></dt>


								<dd>This note describes features needed for effective interaction

								with Web browsers that are based upon voice input and output.

								Some extensions are proposed to HTML 4.0 and CSS2 to support

								voice browsing, and some work is proposed in the area of speech

								recognition and synthesis to make voice browsers more

								effective.</dd>


								<dt><br />

								<a

								href="http://www.bell-labs.com/project/tts/sable.html">SABLE</a></dt>


								<dd>SABLE is a markup language for controlling text to speech

								engines. It has evolved out of work on combining three existing

								text to speech languages: SSML, STML and JSML.</dd>


								<dt><br />

								<a href="http://www.alphaworks.ibm.com/tech">SpeechML</a></dt>


								<dd><i>(IBM's server precludes a simple URL for this, but you can

								reach the SpeechML site by following the link for Speech

								Recognition in the left frame)</i> SpeechML plays a similar role

								to VoxML, defining a markup language written in XML for IVR

								systems. SpeechML features close integration with Java.</dd>


								<dt><br />

								<a href="http://www.w3.org/Voice/TalkML">TalkML</a></dt>


								<dd>This is an experimental markup language from HP Labs, written

								in XML, and aimed at describing spoken dialogs in terms of

								prompts, speech grammars and production rules for acting on

								responses. It is being used to explore ideas for object-oriented

								dialog structures, and for next generation aural style

								sheets.</dd>


								<dt><br />

								<a href="http://www.w3.org/Voice/WWW8/slide1.html">Voice Browsers

								and Style Sheets</a></dt>


								<dd>Presentation by Dave Raggett on May 13th 1999 as part of the

								Style stack of Developer's Day in <a

								href="http://www8.org/">WWW8</a>. The presentation makes

								suggestions for extensions to <a

								href="http://www.w3.org/TR/REC-CSS2/aural.html">ACSS</a>.</dd>


								<dt><br />

								<a href="http://www.vxml.org/">VoiceXML site</a></dt>


								<dd>The VoiceXML Forum formed by AT&amp;, IBM, Lucent and

								Motorola to pool their experience. The Forum has published an

								early version of the VoiceXML specification. This builds on

								earlier work on PML, VoxML and SpeechML.</dd>

								</dl>


								<h2>10. <a id="summary" name="summary">Summary</a></h2>


								<p>The W3C Voice Browser Working Group is defining markup

								languages for speech recognition grammars, speech dialog, natural

								language semantics, multimodal dialogs, and speech synthesis, as

								well as a collection of reusable dialog components. In addition

								to voice browsers, these languages can also support a wide range

								of applications including information storage and retrieval,

								robot command and control, medical transcription, and newsreader

								applications. The speech community is invited to review and

								comment on working draft requirement and specification

								documents.</p>

								</body>

								</html>