server_playground/doc/www.w3.org/2004/Talks/0914-tbl-speech/text


								<?xml version="1.0" encoding="UTF-8"?>

								<html xmlns="http://www.w3.org/1999/xhtml">

								<head>

								  <title>Speech and the Future</title>

								  <style type="text/css">


								.soundbyte {text-align: center}

								.new {color: #FF0000; background-color: #FFFF00}</style>

								  <link rel="stylesheet" type="text/css" title="W3C Talk"

								  href="../../../Tools/w3ctalk-summary.css" />

								  <link href="em.css" rel="stylesheet" type="text/css" />

								  <link xmlns:xlink="http://www.w3.org/1999/xlink"

								  href="../../../People/Berners-Lee/general.css" rel="stylesheet"

								  type="text/css" />

								</head>


								<body xml:lang="en" lang="en">

								<h1>Speech and the Future</h1>


								<p><code></code><code>http://www.w3.org/2004/Talks/0914-tbl-speech/text</code></p>


								<p align="center" style="text-align: left"><a

								href="http://http://www.w3.org/People/Berners-Lee/">Tim Berners-Lee</a></p>


								<p>Director, World Wide Web Consortium</p>


								<p>SpeechTek New York</p>


								<p>2004-09-14</p>


								<p></p>


								<h3 id="Introducti">Introduction</h3>


								<p>Good morning, welcome, and thank you for inviting me to speak today. I'm

								going to use speech today but without much technology. I won't be using

								slides, you'll just have an audio channel. So even though I'm not an expert

								on speech technology -- you all probably know more about it than I do -- I am

								putting my faith in speech itself as a medium for the next few minutes.</p>


								<p>So, as I'm not a researcher at the forefront on speech technology, I'm not

								going to be telling you about the latest and greatest advances. Instead I

								come to you, I suppose, with four different roles. One, as someone who spent

								a a lot of effort getting one new technology, the Web, from idea into general

								deployment, I'm interested in how we as a technical community get from where

								we are now to where we'd like to be. Two, as director of the World Wide Web

								Consortium, I try to get an overall view of where the new waves of Web

								technology are heading, and hopefully how they will fit together.</p>


								<p>With my third hat on I'm a researcher at MIT's Computer Science and

								Artificial Intelligence Laboratory (CSAIL). MIT, along with ERCIM

								organization in Europe, and Keio University in Japan, plays host to the

								Consortium, and I get an office in the really nifty new CSAIL building, the

								Stata Center. That I like for lots of reasons, one of which is the people you

								get to talk to. I have chatted to some of my colleagues who actually are

								engaged in leading edge research about the future.</p>


								<p>And fourth, I come as a random user who is going to be affected by this

								technology and who wants it to work well. It is perhaps the role I'm most

								comfortable in, because I can talk about what I would like. I don't normally

								try to predict the future - that's too hard: but talking about what we would

								like to see is the first step to getting it, so I do that a lot.</p>


								<p>When you step back and look at what's happening, then one thing becomes

								clearer and clearer -- that things are very interconnected. If you are a fan

								of Douglas Adams and/or Ted Nelson, you'll know that all things are

								hopelessly intertwingled, and in the new technologies that in certainly the

								case. So I'm going to discuss speech first and then some of the things it

								connects with.</p>


								<h3><a name="Language" id="Language">Language</a></h3>


								<p>Speech is a form of language. Language is what its all about, in fact.

								Languages of different sorts. Human languages and computer languages. This

								conference is, in a way, about the difference between them.</p>


								<p>Let's think about natural language first. Human language is an amazing

								thing. Anyone who is a technologist has to be constantly in awe of the human

								being. When you look at the brain and what it is capable of, and you look at

								what people are capable of (especially when they actually put their brains

								into use), it is pretty impressive. And in fact I'm most impressed by what

								people can do when they get together. And when you think about that, when you

								look at how people communicate, and you find this phenomenon of Natural

								Language -- this crazy evolving way words and symbols splash between

								different people, and while no one can really pin down what any word means,

								and while so many of the utterances don't even parse grammatically, still the

								end effect is a medium of great power. And of course among the challenges for

								speech technology, is that Natural Language varies from place to place and

								person to person, and, particularly, evolves all the time. That is speech.</p>


								<h3>..Tek</h3>


								<p>Now what is technology? Computer Technology is mostly made up of

								languages, different sorts of language. The HTML, URIs, HTTP make the web

								work, all the technology which we develop at the World Wide Web Consortium,

								not to mention Speech technology, involves sets of languages of a different

								kind. Computer languages.</p>


								<p>The original Web code I wrote in 1990, and the first simple specs of URLs

								(then UDIs), HTML and HTTP. By 1993 the Web was exploding very rapidly, and

								the Information Technology sector had got wind of it and was planning how to

								best use this huge new opportunity. Now, people realized that the reason the

								Web was spreading so fast was that there was no central control and no

								royalty fee. Anyone could start playing with it -- browsing, running a

								server, writing software, without commitment, without ending up in the

								control of or owing money to any central company. And they knew that it all

								worked because the standards HTML, URIs and HTTP were common standards. Now

								I'd written those specs originally and they worked OK, but there was a huge

								number of things which we all wanted to to do which were even more exciting.

								So there was a need for a place for people, companies, organizations to come

								together and build a new evolving set of standards. And still it was

								important to keep that openness.</p>


								<h3>W3C</h3>


								<p>The answer was the World Wide Web Consortium, W3C, and all you have to do

								to join is go to the web site and fill in some forms, pay some money to keep

								it going, and find some people who can be involved in developing or steering

								new technology. You'll need engineers, because we build things here, and

								you'll need communicators because you need to let the community know what

								your needs are, and you need to make sure your company understands whats

								happening in W3C, and how it will affect them at every level. The Consortium

								was around 350 members, and we work in a lot of interconnected areas, from

								things like HTML and graphics, mobile systems, privacy, program integration

								which we call Web Services and data integration which we call Semantic Web,

								...too many things to name -- go to the web site w3.org for details -- just

								look at the list of areas in which Web technology is evolving. Speech

								technology -- recognition and synthesis -- is one of these areas.</p>


								<p>So the business we're in is making open common infrastructure which will

								make the base of a new wave of technology, new markets, and whole new types

								of business in the future. We all are or should be in that business, and

								whether we do it well will determine how big a pie the companies here will be

								sharing in the future.</p>


								<p></p>


								<p>Hard unbending languages with well defined grammars. Yes, the technical

								terms in something like VoiceXML are defined in English, typically, which is

								a natural language -- but English which has been iterated over so much that

								effectively, for practical purposes, the technical term -- each TAG in

								VoiceXML, say -- becomes different from a word. While the meaning of english

								words flows with time, the technical term is an anchor point. The meanings of

								the terms have been defined by working groups, labored over, established as a

								consensus and described beyond all reasonable possibility of practical

								ambiguity in documents we call standards -- or at W3C, Recommendations.</p>


								<p>Last Tuesday, we added a new one to that set. After many months of hard

								work by the <a href="http://www.w3.org/Voice/">Voice Browser Working

								Group</a>, the <a

								href="http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/">Speech

								Synthesis Markup Language</a>, SSML, became a W3C Recommendation. So now two

								machines can exchange bits in SSML, and by that can communicate how to

								synthesize speech. Now speech synthesis systems can be built out of

								components from different manufacturers because there is a standard bus by

								which they can be connected. Now you can invest in speech synthesis, in SSML

								data, and your own in-house applications which produce SSML knowing that the

								data will retain its value; that it won't commit you to a single technology

								supplier. This is the sort of thing which builds a market. It is added to the

								<a href="http://www.w3.org/TR/2004/REC-voicexml20-20040316/">VoiceXML 2.0</a>

								spec, and the <a

								href="http://www.w3.org/TR/2004/REC-speech-grammar-20040316/">Speech

								Recognition Grammar Specification</a> which became Recommendations in March.

								Coming up, we have Semantic Interpretation ML, and Call Control ML from the

								Voice Browser working group, and from the MultiModal Working Group, InkML for

								pen-written information, and the Extended MultiModal Annotation language. So

								a lot is happening, and it is an exciting time. </p>


								<p>I know and you know that the standards picture in this area isn't all that

								rosy. In the area of integration with HTML, he fact that SALT and HTML+Voice

								are competing and are not being developed openly in common is one of the

								major concerns which I hear from all sides -- (except perhaps from those who

								are betting on taking control of part of the space by controlling a

								proprietary specification.!)</p>


								<p>This sort of tension is the rule for standards. There is always much to be

								gained by a company that can take control of a space using proprietary

								languages, and then changing them slightly every year.  There is always a lot

								to be gained by all in terms of a larger market by having open standards. I

								note that in yesterday's announcement by IBM that some of its speech software

								will be going open source, Steven Mills says he wants to "spur the industry

								around open standards". He talks about the need to "get the ecosystem going.

								If that happens, it will bring more business to IBM".  In fact, of many of

								the areas of W3C work, speech used to have standards but little open source

								support.  It will be interesting to see hoe the IBM contribution affects the

								take-off of the whole area.</p>


								<p>All I'll say about the SALT/HTML+Voice situation now is that a conference

								like this is a good time to think strategically, to weigh the importance of a

								solid common foundation for a potentially huge new market area, against short

								term benefits there might be from developing your own standards, if you are a

								supplier, or of purchasing non-standard technology, if you are a user.</p>


								<p>The infrastructure for the connected technology is made up from such

								standards, and these standards are written in computer languages, and those

								are very different from natural language. The difference between natural

								language and computer languages are the chasm which speech technology is

								starting to bridge. Speech technology takes on that really difficult task of

								making computers communicate with people using human speech, trying to enter

								the world of fuzziness and ambiguity. It is really difficult, because

								understanding speech is something which human brains can only just do -- in

								fact you and I learn to talk just slow enough and just plain enough to be

								just understood well enough by a person. When we are understood very

								reliably, we tend to speed up or make new shortcuts. So the computer is

								chasing the human brain, and that is a challenge at the moment.</p>


								<p>I'd end this comparison of the two types of language by noting that

								computer languages do also evolve, though in a different way from natural

								languages. One of the design goals for the semantic web for data integration

								is to allow evolution of data systems, to that the new terms can be

								introduced which are related to but different to the old terms, and to get

								the maximum interoperability between old and new data and old and new

								systems.  This is one of the uses of the web ontology language, OWL.</p>


								<h3>Speech dialog</h3>


								<p>So you know where I am as a user, my last conversation with a machine was

								with a home appliance repair center and was something like the following:</p>


								<blockquote>

								  <dl>

								    <dt></dt>

								    <dt>It</dt>

								      <dd>What would you like to do? You can make, change or cancel an

								        appointment, order a part ...</dd>

								    <dt>me</dt>

								      <dd>[interrupting] make an appointment</dd>

								    <dt>it</dt>

								      <dd>You want to make an appointment, right?</dd>

								    <dt>me</dt>

								      <dd>Right.</dd>

								    <dt>It</dt>

								      <dd>(pause) I'm sorry. Please say "yes" or "no"</dd>

								    <dt>me</dt>

								      <dd>Yes.</dd>

								    <dt>It</dt>

								      <dd>Ok, what sort of a product needs the service? For example, say

								        "refrigerator", or "furnace"</dd>

								    <dt>me</dt>

								      <dd>Washer</dd>

								    <dt>It</dt>

								      <dd>Ok, so you want to make an appointment to service a washer,

								      right?</dd>

								    <dt>me</dt>

								      <dd>Yes</dd>

								    <dt>It</dt>

								      <dd>I'm sorry, I didn't get that.</dd>

								    <dt>me</dt>

								      <dd>Yes!</dd>

								    <dt>It</dt>

								      <dd>Please say yes or no. You want to make an appointment to service a

								        washer, right?</dd>

								    <dt>me</dt>

								      <dd>Yes!!</dd>

								    <dt>It</dt>

								      <dd>I'm sorry. Thank you for calling ____ Customer Service. Have a nice

								        day.</dd>

								      <dd></dd>

								    <dt></dt>

								    <dt>The good news is, I called back and learned to say <em>yeup</em>, and

								    got through. (The bad news is my washer still isn't working!)</dt>

								    <dt></dt>

								  </dl>

								</blockquote>


								<p>In fact I found the system worked after I'd learned to say

								<em>Yeup</em>.</p>


								<p>(It beat a comparable experience I had with DTMF tones trying to trace an

								order for a some computer equipment. I called the 1-800 number, went through

								a DTML tree -- if you want to do this press 1, ... and so on ..if you want to

								track an order press 9, (9) if it was for a computer press 1 (1), if you want

								to talk to somebody about it press one (1), and talked to somebody about the

								problem for 25 minutes, after which she decided to transfer me to someone

								else. Thoughtfully, she gave me a number to call if I was disconnected.

								Inevitably, I got disconnected almost immediately. I realized the number she

								had given me was just the same 1-800 number, so I hit redial. The redial

								didn't seem to send enough digits, so I had to hang up and dial again. I

								found my way painfully though the tree to the place I should have been, and

								talked for another 40 minutes about how to convert my order from something

								they could not deliver to something that they could deliver. And by the end

								of the process when I was almost exhausted, and just giving the last element

								of personal information so they could credit check the new order, my wife

								came in, "Tim, the police are here", and sure enough in come the local

								police. They'd had a 911 call, and hadn't been able to call back the line,

								and so presumed it must be an emergency. Yes, when I had hit <em>redial</em>,

								my phone had forgotten the 1-800 number, but remembered the DTMF tones from

								the phone tree; 9-1-1. An interesting system design flaw.)</p>


								<h3>Speech: long way to go</h3>


								<p>Now I've talked to a few people before coming here to give this talk. I've

								chatted with people like Hewlett-Packard's Scott McGlashan, very involved in

								speech at W3C, and I've also talked to researchers like Stephanie Seneff and

								Victor Zue at the Spoken Language Systems (SLS) research group at MIT's

								Computer Science and Artificial Intelligence Laboratory, CSAIL, just along

								the corridor from my office.</p>


								<p>And when I just talked to these people a few things emerge clearly. One is

								that speech technology itself has a very long way to go. Another thing is

								that it the most important thing may turn out to be be not the speech

								technology itself, but the way in which speech technology connects to all the

								other technologies. I'll go into both those points.</p>


								<p>Yes, what we have today is exciting, but it is also very much simpler than

								the sorts of things we would really like to be able to do.</p>


								<p>Don't get me wrong. VXML and SSML and company are great, and you should be

								using them. I much prefer to be able to use english on the phone to a call

								center than to have to type in touch-tones. However, I notice that the form

								of communication I'm involved in cannot be called a conversation. It is more

								of an interrogation. The data I am giving has the rigidity of a form to be

								filled in, with the extra constraint that I have to go through it in the

								order defined by the speech dialog. Now, I know that VoiceXML has facilities

								for me to interrupt, and to jump out from one dialog into another, but the

								mode in general still tends to be one of a set of scripts to which I must

								conform. This is no wonder. The job is to get data from a person. Data is

								computer-language stuff, not natural language stuff. The way we make machine

								data from human thoughts has been for years been to make the person talk in a

								structured, computer-like way. Its not just speech: "wizards" which help you

								install things on your computer are similar: straightjackets which make you

								think in the computer's way, with the computer's terms, in the computers'

								order.</p>


								<h3>Context feedback</h3>


								<p>The systems in research right now, like the SLS group's Jupiter system

								which you can ask about the weather, and its Mercury system which can arrange

								a trip for you, are much more sophisticated. They allow keep track of the

								context, of which times and places a user is thinking of. They seem to be

								happy both gently leading the caller with questions, or being interrogated

								themselves..</p>


								<p>Here is one example recorded with a random untrained caller who had been

								given an access code. The things to watch for include the machine keeping

								track context, and of falling back when one tack fails.</p>


								<p>[<a href="mercury.wav">speech audio example of Mercury</a>]</p>


								<p>Now In understand that when a machine tries to understand a fragment of

								speech, or a partly formed sentence, or mumbled words, that the actual

								decision it makes about which word must have been said is affected by the

								context of the conversation. This is normal also for people: it is just

								impossible to extract the information from the noise without some clues. A

								person sometimes misunderstands a word if he or she is thinking about the

								wrong subject -- sometimes with amusing consequences, like a Freudian slip in

								reverse. So this means that speech systems become complex many-layered things

								in which the higher layers of abstraction feed down information about what

								the person is likely to be saying. (Or maybe what we would like them to be

								saying?). So I understand that this is the way speech recognition has to

								work. But this architecture prevents the speech system from being separated

								into two layers, a layer of speech to text and a layer of natural language

								processing.  It means that the simple speech architecture, in which

								understanding is a one-way street from audio to syllables to words to

								sentences to semantics to actions breaks down.</p>


								<p><img src="http://www.w3.org/TR/voice-intro/voice-intro-fig1.gif"

								width="559" height="392" alt="block diagram for speech interface framework"

								/></p>


								<p><em>Figure 1. Speech Interface Framework, from browsing,<a

								href="http://www.w3.org/TR/voice-intro/"> Introduction to and Overview of W3C

								Speech Interface Framework. </a> The one-way flow in the top half ignores

								context information sent back to ASR, which complicates the

								architecture.</em></p>


								<p></p>


								<p>One of the interesting parts of feedback context is  when it is taken all

								the way back to the user by an avatar.  Human understanding of speech is very

								much a two-way street: not only does a person ask questions of clarification,

								as good speech dialog systems do today.  A human also gives low-level

								feedback with the wrinkling of the forehead, inclining or nodding of the

								head, to indicate how well the understanding process is going. </p>


								<p>What are the effects of having this context feedback in the architecture?

								One effect is that when a call is passed to a subsystem which deals with a

								particular aspect of the transaction, or for that matter to a human being, it

								is useful to pass the whole context.  Instead of, "Please get this person's

								car plate", it is more like "Please take the car plate of a southern male,

								who likes to spell out letters in international radio alphabet, and is

								involved in trying to pay his car tax on this vehicle for 2005, and is still

								fairly patient.  The plate numbers of two cars he has registered before are

								on in this window, and its probably the top one."</p>


								<p>Well, because a speech system is not an island.</p>


								<p>In fact, these systems also have keyboards. They also have pens.</p>


								<h2>Multimodal</h2>


								<p>The big drive, it seems, at the moment, toward speech is the cellphone

								market. Mobile phones speech is the dominant mode of communication. While

								they have buttons and screens, they are rather small, and also people tend to

								use phones when it would be even more dangerous to be looking at the the

								screen and using the buttons. However, a phone is in fact a device which

								supports a lot more than voice: you can type, it has a camera, and it has

								screen. Meanwhile, the boundary of the concepts of "phone" and "computer" are

								being pushed and challenged all the time by new forms of PDA. The blackberry

								and the sidekick are between computer phone. The PDA market is playing with

								all kinds of shapes. Computer LCDs are getting large enough to make a

								separate TV screen redundant -- and they can be easier to use and program,

								and accept many more formats, than typical DVD players. PCs are coming out

								which look more like TVs. France telecom now <a

								href="http://www.rd.francetelecom.com/en/technologies/ddm200311/techfiche4.php">proposes</a>

								TV over an ADSL (originally, phone, now internet) line. The television would

								be delivered by IP. The Internet model is indeed that everything runs over

								IP, and IP runs over everything. The result is a platform which embraces IP

								becomes open to very rapid spread of new technologies. This is very powerful.

								On my phone, for example, I didn't have an MP3 player -- so I downloaded a

								shareware one written by someone in Romania.</p>


								<p>So in the future, we can expect phones, like TVs, to become

								indistinguishable from small personal computers, and for there to be a very

								wide range of different combinations of device to suit all tastes and

								situations.</p>


								<h3>Device Independence</h3>


								<p>In fact the ability to view the same information on different devices was

								one of the earliest design principles of the web: Device Independence.

								Whereas the initial cellphone architectures such as the first WAP tended to

								be vertical stacks, and tended to give the phone carrier and the phone

								supplier a monopoly channel of communication, the web architecture is that

								any device should be able to access any resource. The first gives great

								control and short-term profits to a single company; the second creates a

								whole new world. This layering is essential to the independent strong markets

								for devices, for communication and for content.</p>


								<p>From the beginning, this device independence was a high priority -- you

								may remember early web sites would foolishly announce that they were only

								viewable by those with 800x600 pixel screens. The right thing to do was to

								achieve device independence by separating the actual content of the data from

								the form in which it happened to be presented. On screens this is done with

								style sheets. Style sheets allow information to be authored once and

								presented appropriately whatever size screen you have. Web sites which use

								style sheets in this way would find that they were more accessible to people

								using the new devices. Also, they would find that they were more accessible

								to people with disabilities. W3C has a series of guidelines on how to make

								your web site accessible as possible to people who for one reason don't use

								eyes or ears or hands in the same way that you might to access your web site.

								So the principle of separation of the form and content, and that of device

								independence, are very important for the new world in which we have such a

								diversity of gadgets.</p>


								<p>However, this only allows for differences in size of screen. Yes, a blind

								person can have a screen reader read a window - but that isn't a good speech

								interface.</p>


								<p><a href="../../../TR/1999/WD-voice-intro-19991223/"> </a></p>


								<h3>GUI vs Conversation</h3>


								<p></p>


								<p>There is a much more fundamental difference between a conversational

								interface and a window-based one. It was actually the conversational one

								which came first for computers. For years, the standard way to communicate

								with a computer was to type at a command prompt, for the computer to respond

								in text, and wait for your to type again. As you typed, you could list and

								change the contents of various directories and files on your system. You'd

								only see one at a time, and you'd build a mental image of the whole system as

								a result of the conversation.</p>


								<p>When the Xerox Parc machines and the Apple Lisa came out with a screen of

								"folders", what was revolutionary was that you could see the state of the

								things your were manipulating. the shared context - the nest structure of

								folders and files, or the document your are editing, were displayed by the

								computer and seen at each point by the user, so they were a shared

								information space with a mutually agreed state. This was so much more

								relaxing to use because you didn't have to remember where everything was, you

								could see it at each point. That "wysiwyg" feature is something which became

								essential for any usable computer system. (In fact I was amazed on 1990 that

								people would edit HTML in the raw source without wysiwyg editors.)</p>


								<p>Now, with speech, we are in the conversational model again. there is no

								shared display of where we are. The person has to remember what it is that

								the computer is thinking. The computer has to remember what it thought the

								person was thinking. The work at SLS and the clip we heard seem to deal with

								the conversational system quite effectively. So what's the problem?</p>


								<p>The challenge in fact is that people won't be choosing one mode of

								communication, they will be doing all at once. As we've seen, a device will

								have many modes. and we have many devices. Already my laptop and phone are

								becoming more aware of each other, and starting to use each other -- but only

								a little. They are connected by bluetooth - but why can't I use the camera on

								my phone for a video chat on my PC? Why can't I use my PC's email as a

								voicemail server and check my email from my phone while I drive in, just as I

								check my voicemail? To get the most out of computer-human communications, the

								system will use everything at once. If I call about the weather and a screen

								is nearby, a map should come up. If I want to zoom in on the map, I can say

								"Zoom in on Cambridge", or I can point at the map, or I can use a gesture

								with a pen on the surface -- or I can type "Cambridge", I can use the

								direction keys, or click with a mouse. Suddenly the pure conversational

								model, which we can do quiet well, is broken, and so is the pure wysiwyg

								model. Impinging on the computer are spoken and typed words, commands,

								gestures, handwriting, and so on. These may refer to things discussed in the

								past, to things being displayed. The context is partly visible, partly not.

								The vocabulary is partly well-known clickstream, partly english which we are

								learning to handle, and partly gestures for which we really don't have a

								vocabulary, let alone a grammar. The speech recognition system will be

								biasing its understanding of words as a function of where the user's hands

								are, and what his stance is.</p>


								<p>System integration is typically the hairiest part of a software

								engineering project: glueing it all together. To glue together a multimedia

								system which can deal with all the modes of communication at once will need

								some kind of framework in which the very different types of system can

								exchange state. Some of the state is hard (the time of the departing plane --

								well the flight number at least!), some soft and fuzzy (the sort of time the

								user was thinking of leaving, the fact that we are talking travel rather than

								accommodation at the moment). So speech technology will not be in a vacuum.

								It will not only have to make great strides to work at all -- it will have to

								integrate in real time with a host of other very different technologies.</p>


								<h3>Back end</h3>


								<p>I understand now, that there are a number of people here involved in call

								center phone tree systems?  I will not hold your personally responsible for

								all the time I spend with these systems -- in fact, I know that speech

								technology will actually shorten the amount of time I spend on the phone. I

								won't even demand you fix my washing machine.</p>


								<p>But while we are here, let me give you one peeve. I speak, I suspect, for

								millions when I say this. I am prepared to to type in my account number, or

								even sometimes my social security number. I am happy, probably happier, to

								speak it carefully out loud. However, once I have told your company what my

								account number is, I never ever on the same call want to have to tell you

								again. This may seem peevish, but sometimes the user experience may have been

								optimized within a small single region, but as a whole, on the large scale,

								is a complete mess. Sometimes it is little things. Track who I am as you pass

								me between departments. Don't authenticate me with credit card number and

								zipcode before telling me your office is closed at weekends. Try to keep you

								left hand aware of what the right hand is doing.</p>


								<p>Actually, I know that this is a difficult problem. When I applied to have

								my green card extended, I first filed the application electronically, then I

								went to the office to be photographed, fingerprinted again, and I noticed

								that not only did each of the three people I talked to type in my application

								number, but they also typed in all my personal details. Why? Because they

								were different systems. When I talk to CIOs across pretty much any industry,

								I keep hearing the same problem - the stovepipe problem. Different parts of

								the company, the organization, the agency, have related data in different

								systems. You can't integrated them all but you need to be able to connect

								them. The problem is of integrating data between systems which have been

								designed quite independently in the past, and are maintained by different

								groups which don't necessarily trust or understand each other. I mention this

								because this is the problem the semantic web addresses. The semantic web

								standards, RDF and OWL, also W3C Recommendations, are all about describing

								your data, exporting it into a common format, and then explaining to the

								world of machines how the different datasets are actually interconnected in

								what they are about, even if they were not physically interconnected. The

								Semantic Web, when you take it from an enterprise tool to a global system,

								actually becomes a really powerful global system, a sort of global

								interconnection bus for data. Why do I talk about this? Because the semantic

								web is something people are trying to understand nowadays. Because it

								provides a unified view of the data side in your organization, it is

								important when we think about how speech ties in with the rest of the system.

								And that tying in is very important.</p>


								<h3>Semantic Web explanation</h3>


								<p>When you use speech grammars and VoiceXML, you are describing possible

								speech conversaions. When you use XML schema, you are describing documents.

								RDF is different. When you use RDF and OWL, you are talking about real

								things.  Not a conversation about  car, or a car licence plate renewal form,

								but a car.</p>


								<p>The fact that a form has one value for a plane number will pass with the

								form. The fact that a car has one unique plate number is very useful to know

								- it constrains the form, and the speech grammars. It allows a machine to

								know that two cars in different databases are the same car.</p>


								<p>Because this information is about real things, it is much more reusable.

								Speech apps will be replaced. Application forms will be revised, much more

								often than a car changes its nature. The general properties of a car, or a

								product of your company, of  real things, change rarely.  They are useful to

								many applcations. This background information is called the

								<em>ontology</em>, and OWL the language it is written in.</p>


								<p>And data written in RDF labels fields not just with tag names, but with

								URIs.  This means that each concept can be allocated withou clashing with

								someone else's.  It also means that when you get some semantic web data,

								anyone or anything can go look up the terms on the web, and get information

								about it.  Car is a subclass of vehicle.</p>


								<p>It is no use having a wonderful conversation with a computer about the

								sort of vacation you would like to have, if at the end of the day you don't

								have a very well-defined dataset with precise details of the flights, hotels

								cars and shows which that would involve. Data which can be treated and

								understood by all the different programs which will be involved in bringing

								that vacation into existence. There is a working draft <a

								href="http://www.w3.org/TR/semantic-interpretation/">Semantic Interpretation

								for Speech Recognition</a> which is in this area, although hit does not

								ground the data in the semantic web.</p>


								<h3>Closing the loop</h3>


								<p>At the moment speech technology is concentrated in business-consumer (b2c)

								applications, where it seems the only job is to get the data to the back-end.

								But I'd like to raise the bar higher. When I has a consumer have finished a

								conversation and committed to buying a something, I'd like my own computer to

								get a document it can process with all the details. My computer ought to be

								able to connect it with the credit card transaction, and tax forms, expense

								returns and so on. This means we need a common standard for the data. The

								semantic web technology gives us RDF as a base language for this, and I hope

								that each industry will convert or develop the terms which are useful for

								describing products in their own area.</p>


								<p>In fact, the development of ontologies could be a a great help in

								developing speech applications. The ontology is the modeling of the real

								objects in question -- rental cars flights and so on, and their properties -

								number of seats, departure times and so on. This structure is the base of

								understanding of speech about these subjects. It needs a lot of of added

								information about the colloquial ways of talking about such things. So far

								I've discussed the run-time problem -- how a computer can interact with a

								person. But in fact limiting factors can also be the problems designers have

								creating all the dialogs and scripts and so on which it takes to put together

								a new application. In fact the amount of effort which goes into a good speech

								system is very great. So technology which makes it easier for application

								designers can also be a gating factor on deployment.</p>


								<h3>Conclusion</h3>


								<p>The picture I end up when I try to think of the speech system of the

								future is a web. Well, maybe I think of everything as a web. In this case, I

								think of a web of concepts, connected to words and phrases, connected to

								pronunciation, connected to phrases and dialog fragments. I see also icons

								and style sheets for physical display, and I see the sensors that the

								computer has trained on the person connected to layers of recognition systems

								which, while feeding data from the person, are immersed in a reverse stream

								of context which directs them as to what they should be looking for.</p>


								<p>Speech communication by computers has always been one of those things

								which seemed to more difficult that they seemed at first - though five

								decades.</p>


								<p>It happens that as I was tidying the house the other day I just came

								across a bunch of Isaac Azimov books, and got distracted by a couple of

								stories from <em>Earth is Room Enough</em><em>.</em> In most Azimov stories,

								computers either communicate very obscurely using teletypes, or they have

								flawless speech. He obviously thought that speech would happen, but I haven't

								found any stories about the transition time we are in now. The short story

								<em>Someday</em> is one of the ones of the post-speech era. At one point the

								young Paul is telling his friend Niccolo how he discovered all kinds of

								ancient computer -- and these squiggly things (characters) which people had

								to use communicate with them.</p>


								<blockquote>

								  <p>"Each different squiggle stood for a different number. For 'one', you

								  made a kind of mark, for 'two' you make another kind of mark, for 'three'

								  another one and so on."</p>


								  <p>"What for?"</p>


								  <p>"So you could compute"</p>


								  <p>"What <em>for?</em> You just tell the computer---"</p>


								  <p>"Jiminy", cried Paul, his face twisting in anger, "can't you get it

								  though your head? These slide rules and things didn't talk<em>.</em>"</p>

								</blockquote>


								<p>So Asimov certainly imagined we'd get computers chatting seamlessly, and

								the goal seems, while a long way off no, attainable in the long run.

								Meanwhile, we have sound technology for voice dialogs which are developed

								part prototypes the level of standards.  The important thing for users is to

								realize which is possible and what isn't, as it is easy to expect the world

								and be disappointed, but also a mistake to realize that here is a very usable

								technology which will save a lot of time and money.  And please remember that

								when you think about saving time, its not just your call center staff time,

								it is the user's time.  It may not show up directly on your spreadsheet, but

								it will show up indirectly if frustration levels cause them to switch.  So

								use this conference to find out what's happening, and remember to check about

								standards conformance.</p>


								<p>In the future, integration of speech  with other media, and with semantic

								web for the data, will be a major challenge, but will be necessary before the

								technology can be used to its utmost.</p>

								<hr />

								</body>

								</html>