Speech Application Programming Interface

The Speech Application Programming Interface or SAPI is an API

Application programming interface

An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

developed by Microsoft

Microsoft

Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

to allow the use of speech recognition

Speech recognition

Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

and speech synthesis

Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

within Windows

Microsoft Windows

Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK, or as part of the Windows OS

Operating system

An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

itself. Applications that use SAPI include Microsoft Office

Microsoft Office

Microsoft Office is a non-free commercial office suite of inter-related desktop applications, servers and services for the Microsoft Windows and Mac OS X operating systems, introduced by Microsoft in August 1, 1989. Initially a marketing term for a bundled set of applications, the first version of...

, Microsoft Agent

Microsoft Agent

Microsoft Agent is a technology developed by Microsoft which employs animated characters, text-to-speech engines, and speech recognition software to enhance interaction with computer users. Thus it is an example of an embodied agent. It comes preinstalled as part of Microsoft Windows 2000 through...

and Microsoft Speech Server

Microsoft Speech Server

The Microsoft Speech Server is a product from Microsoft designed to allow the authoring and deployment of IVR applications incorporating Speech Recognition, Speech Synthesis and DTMF....

.

In general all versions of the API have been designed such that a software developer can write an application to perform speech recognition and synthesis by using a standard set of interfaces, accessible from a variety of programming languages. In addition, it is possible for a 3rd-party company to produce their own Speech Recognition and Text-To-Speech

Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

engines or adapt existing engines to work with SAPI. In principle, as long as these engines conform to the defined interfaces they can be used instead of the Microsoft-supplied engines.

In general the Speech API is a freely-redistributable component which can be shipped with any Windows application that wishes to use speech technology. Many versions (although not all) of the speech recognition and synthesis engines are also freely redistributable.

There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4 are all similar to each other, with extra features in each newer version. SAPI 5 however was a completely new interface, released in 2000. Since then several sub-versions of this API have been released.

Basic architecture

Broadly the Speech API can be viewed as an interface or piece of middleware which sits between applications and speech engines (recognition and synthesis). In SAPI versions 1 to 4, applications could directly communicate with engines. The API included an abstract interface definition which applications and engines conformed to. Applications could also use simplified higher-level objects rather than directly call methods on the engines.

In SAPI 5 however, applications and engines do not directly communicate with each other. Instead each talk to a runtime

Run-time system

A run-time system is a software component designed to support the execution of computer programs written in some computer language...

component (sapi.dll). There is an API implemented by this component which applications use, and another set of interfaces for engines.

Typically in SAPI 5 applications issue calls through the API (for example to load a recognition grammar; start recognition; or provide text to be synthesized). The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces (for example, the loading of a grammar from a file is done in the runtime, but then the grammar data is passed to the recognition engine to actually use in recognition). The recognition and synthesis engines also generate events while processing (for example, to indicate an utterance has been recognized or to indicate word boundaries in the synthesized speech). These pass in the reverse direction, from the engines, through the runtime dll, and on to an event sink in the application.

In addition to the actual API definition and runtime dll, other components are shipped with all versions of SAPI to make a complete Speech Software Development Kit

Software development kit

A software development kit is typically a set of software development tools that allows for the creation of applications for a certain software package, software framework, hardware platform, computer system, video game console, operating system, or similar platform.It may be something as simple...

. The following components are among those included in most versions of the Speech SDK:

API definition files - in MIDL
MIDL
Microsoft Interface Definition Language is a text-based interface description language by Microsoft, based on the DCE/RPC IDL which it extends for use with the Microsoft Component Object Model. Its compiler is also called MIDL.- External links :*...

and as C or C++ header files.
Runtime components - e.g. sapi.dll.
Control Panel applet - to select and configure default speech recognizer and synthesizer.
Text-To-Speech engines in multiple languages.
Speech Recognition engines in multiple languages.
Redistributable components to allow developers to package the engines and runtime with their application code to produce a single installable application.
Sample application code.
Sample engines - implementations of the necessary engine interfaces but with no true speech processing which could be used as a sample for those porting an engine to SAPI.
Documentation.

SAPI 1

The first version of SAPI was released in 1995, and was supported on Windows 95

Windows 95

Windows 95 is a consumer-oriented graphical user interface-based operating system. It was released on August 24, 1995 by Microsoft, and was a significant progression from the company's previous Windows products...

and Windows NT 3.51

Windows NT 3.51

Windows NT 3.51 is the third release of Microsoft's Windows NT line of operating systems. It was released on 30 May 1995, nine months after Windows NT 3.5. The release provided two notable feature improvements; firstly NT 3.51 was the first of a short-lived outing of Microsoft Windows on the...

. This version included low-level Direct Speech Recognition and Direct Text To Speech APIs which applications could use to directly control engines, as well as simplified 'higher-level' Voice Command and Voice Talk APIs.

SAPI 3

SAPI 3.0 was released in 1997. It added limited support for dictation speech recognition (discrete speech, not continuous), and additional sample applications and audio sources.

SAPI 4

SAPI 4.0 was released in 1998. This version of SAPI included both the core COM

Component Object Model

Component Object Model is a binary-interface standard for software componentry introduced by Microsoft in 1993. It is used to enable interprocess communication and dynamic object creation in a large range of programming languages...

API; together with C++

C++

C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

wrapper classes to make programming from C++ easier; and ActiveX

ActiveX

ActiveX is a framework for defining reusable software components in a programming language-independent way. Software applications can then be composed from one or more of these components in order to provide their functionality....

controls to allow drag-and-drop Visual Basic

Visual Basic

Visual Basic is the third-generation event-driven programming language and integrated development environment from Microsoft for its COM programming model...

development. This was shipped as part of an SDK that included recognition and synthesis engines. It also shipped (with synthesis engines only) in Windows 2000

Windows 2000

Windows 2000 is a line of operating systems produced by Microsoft for use on personal computers, business desktops, laptops, and servers. Windows 2000 was released to manufacturing on 15 December 1999 and launched to retail on 17 February 2000. It is the successor to Windows NT 4.0, and is the...

.

The main components of the SAPI 4 API (which were all available in C++, COM, and ActiveX flavors) were:

Voice Command - high-level objects for command & control speech recognition
Voice Dictation - high-level objects for continuous dictation speech recognition
Voice Talk - high-level objects for speech synthesis
Voice Telephony - objects for writing telephone speech applications
Direct Speech Recognition - objects for direct control of recognition engine
Direct Text To Speech - objects for direct control of synthesis engine
Audio objects - for reading to and from an audio device or file

SAPI 5 API family

The Speech SDK version 5.0, incorporating the SAPI 5.0 runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification.

The design of the new API included the concept of strictly separating the application and engine so all calls were routed through the runtime sapi.dll. This change was intended to make the API more 'engine-independent', preventing applications from inadvertently depending on features of a specific engine. In addition this change was aimed at making it much easier to incorporate speech technology into an application by moving some management and initialization code into the runtime.

The new API was initially a pure COM API and could be used easily only from C/C++. Support for VB and scripting languages were added later. Operating systems from Windows 98

Windows 98

Windows 98 is a graphical operating system by Microsoft. It is the second major release in the Windows 9x line of operating systems. It was released to manufacturing on 15 May 1998 and to retail on 25 June 1998. Windows 98 is the successor to Windows 95. Like its predecessor, it is a hybrid...

and NT 4.0 upwards were supported.

Major features of the API include:

Shared Recognizer. For desktop speech recognition applications, a recognizer object can be used that runs in a separate process (sapisvr.exe). All applications using the shared recognizer communicate with this single instance. This allows sharing of resources, removes contention for the microphone and allows for a global UI for control of all speech applications.
In-proc recognizer. For applications that require explicit control of the recognition process the in-proc recognizer object can be used instead of the shared one.
Grammar objects. Speech grammars are used to specify the words that the recognizer is listening for. SAPI 5 defines an XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

markup for specifying a grammar, as well as mechanisms to create them dynamically in code. Methods also exist for instructing the recognizer to load a built-in dictation language model.
Voice object. This performs speech synthesis, producing an audio stream from text. A markup language (similar to XML, but not strictly XML) can be used for controlling the synthesis process.
Audio interfaces. The runtime includes objects for performing speech input from the microphone or speech output to speakers (or any sound device); as well as to and from wave files. It is also possible to write a custom audio object to stream audio to or from a non-standard location.
User lexicon object. This allows custom words and pronunciations to be added by a user or application. These are added to the recognition or synthesis engine's built-in lexicons.
Object tokens. This is a concept allowing recognition and TTS engines, audio objects, lexicons and other categories of object to be registered, enumerated and instantiated in a common way.

SAPI 5.0

This version shipped in late 2000 as part of the Speech SDK version 5.0, together with version 5.0 recognition and synthesis engines. The recognition engines supported continuous dictation and command & control and were released in U.S. English, Japanese and Simplified Chinese versions. In the U.S. English system, special acoustic models were available for children's speech and telephony speech. The synthesis engine was available in English and Chinese. This version of the API and recognition engines also shipped in Microsoft Office XP in 2001.

SAPI 5.1

This version shipped in late 2001 as part of the Speech SDK version 5.1. Automation-compliant interfaces were added to the API to allow use from Visual Basic, scripting languages such as JScript

JScript

JScript is a scripting language based on the ECMAScript standard that is used in Microsoft's Internet Explorer.JScript is implemented as a Windows Script engine. This means that it can be "plugged in" to any application that supports Windows Script, such as Internet Explorer, Active Server Pages,...

, and managed code

Managed code

Managed code is a term coined by Microsoft to identify computer program code that requires and will only execute under the "management" of a Common Language Runtime virtual machine ....

. This version of the API and TTS engines was shipped in Windows XP

Windows XP

Windows XP is an operating system produced by Microsoft for use on personal computers, including home and business desktops, laptops and media centers. First released to computer manufacturers on August 24, 2001, it is the second most popular version of Windows, based on installed user base...

. Windows XP Tablet PC Edition and Office 2003 also include this version, but with a substantially improved version 6 recognition engine and Traditional Chinese.

SAPI 5.2

This was a special version of the API for use only in the Microsoft Speech Server

Microsoft Speech Server

The Microsoft Speech Server is a product from Microsoft designed to allow the authoring and deployment of IVR applications incorporating Speech Recognition, Speech Synthesis and DTMF....

which shipped in 2004. It added support for SRGS

Speech Recognition Grammar Specification

Speech Recognition Grammar Specification is a W3C standard for how speech recognition grammars are specified. A speech recognition grammar is a set of word patterns, and tells a speech recognition system what to expect a human to say...

and SSML

Speech Synthesis Markup Language

Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for...

mark-up languages, as well as additional server features and performance improvements. The Speech Server also shipped with the version 6 desktop recognition engine and the version 7 server recognition engine.

SAPI 5.3

This is the version of the API that ships in Windows Vista

Windows Vista

Windows Vista is an operating system released in several variations developed by Microsoft for use on personal computers, including home and business desktops, laptops, tablet PCs, and media center PCs...

together with new recognition and synthesis engines. As Windows Speech Recognition

Windows Speech Recognition

Windows Speech Recognition is a speech recognition application included in Windows Vista and more recently, Windows 7.-Features:Windows Speech Recognition allows the user to control the computer by giving specific voice commands...

is now integrated into the operating system, the Speech SDK and APIs are a part of the Windows SDK. SAPI 5.3 includes the following new features:

Support for W3C XML speech grammars for recognition and synthesis. The Speech Synthesis Markup Language
Speech Synthesis Markup Language
Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for...

(SSML) version 1.0 provides the ability to mark up voice characteristics, speed, volume, pitch, emphasis, and pronunciation.
The Speech Recognition Grammar Specification
Speech Recognition Grammar Specification
Speech Recognition Grammar Specification is a W3C standard for how speech recognition grammars are specified. A speech recognition grammar is a set of word patterns, and tells a speech recognition system what to expect a human to say...

(SRGS) supports the definition of context-free grammars, with two limitations:
- It does not support the use of SRGS to specify dual-tone modulated-frequency (touch-tone) grammars.
- It does not support Augmented Backus–Naur form
  Augmented Backus–Naur form
  In computer science, Augmented Backus–Naur Form is a metalanguage based on Backus–Naur Form , but consisting of its own syntax and derivation rules. The motive principle for ABNF is to describe a formal system of a language to be used as a bidirectional communications protocol...
  
  (ABNF).
Support for semantic interpretation script within grammars. SAPI 5.3 enables an SRGS grammar to be annotated with JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....

for semantic interpretation to supplement the recognized text.
User-Specified shortcuts in lexicons, which is the ability to add a string to the lexicon and associate it with a shortcut word. When dictating, the user can say the shortcut word and the recognizer will return the expanded string.
Additional functionality and ease-of-programming provided by new types.
Performance improvements, improved reliability and security.
Version 8 of the speech recognition engine ("Microsoft Speech Recognizer")

SAPI 5 Voices

Microsoft Sam (Speech Articulation Module) is a commonly-shipped SAPI 5 voice. In addition, Microsoft Office

Microsoft Office

XP and Office 2003 installed L&H

Lernout & Hauspie

Lernout & Hauspie Speech Products, or L&H, was a leading Belgium-based speech recognition technology company, founded by Jo Lernout and Pol Hauspie, that went bankrupt in 2001...

Michael and Michelle voices. The SAPI 5.1 SDK installs 2 more voices, Mike and Mary. Windows Vista

Windows Vista

includes Microsoft Anna which replaces Microsoft Sam. Anna is designed to sound more natural and offer greater intelligibility. The Chinese version of Windows Vista and later Windows client versions also include a female voice named Microsoft Lili. Microsoft Anna is also installed on Windows XP by Microsoft Streets & Trips 2006 and later versions.

Managed code Speech API

A managed code

Managed code

Managed code is a term coined by Microsoft to identify computer program code that requires and will only execute under the "management" of a Common Language Runtime virtual machine ....

API ships as part of the .NET Framework 3.0. It has similar functionality to SAPI 5 but is more suitable to be used by managed code applications. The new API is available on Windows XP

Windows XP

, Windows Server 2003

Windows Server 2003

Windows Server 2003 is a server operating system produced by Microsoft, introduced on 24 April 2003. An updated version, Windows Server 2003 R2, was released to manufacturing on 6 December 2005...

, Windows Vista

Windows Vista

, and Windows Server 2008.

The existing SAPI 5 API can also be used from managed code to a limited extent by creating COM Interop code (helper code designed to assist in accessing COM interfaces and classes). This works well in some scenarios however the new API should provide a more seamless experience equivalent to using any other managed code library.

Speech functionality in Windows Vista

Windows Vista

includes a number of new speech-related features including:

Speech control of the full Windows GUI
Graphical user interface
In computing, a graphical user interface is a type of user interface that allows users to interact with electronic devices with images rather than text commands. GUIs can be used in computers, hand-held devices such as MP3 players, portable media players or gaming devices, household appliances and...

and applications
New tutorial, microphone wizard, and UI for controlling speech recognition
New version of the Speech API runtime: SAPI 5.3
Built-in updated Speech Recognition engine (Version 8)
New Speech Synthesis engine and SAPI voice Microsoft Anna
Managed code
Managed code
Managed code is a term coined by Microsoft to identify computer program code that requires and will only execute under the "management" of a Common Language Runtime virtual machine ....

speech API (codenamed SpeechFX)
Speech recognition support for 8 languages at release time: U.S. English, U.K. English, traditional Chinese, simplified Chinese, Japanese, German, French and Spanish, with more language to be released later.
Microsoft Agent
Microsoft Agent
Microsoft Agent is a technology developed by Microsoft which employs animated characters, text-to-speech engines, and speech recognition software to enhance interaction with computer users. Thus it is an example of an embodied agent. It comes preinstalled as part of Microsoft Windows 2000 through...

most notably, and all other Microsoft speech applications use SAPI 5.

SAPI 5

Microsoft Windows 7
Microsoft Windows Vista
Microsoft Windows 2003
Microsoft Windows XP
Microsoft Windows 2000

SAPI 4

Microsoft Windows Millennium Edition
Windows Me
Windows Millennium Edition, or Windows Me , is a graphical operating system released on September 14, 2000 by Microsoft, and was the last operating system released in the Windows 9x series. Support for Windows Me ended on July 11, 2006....
Microsoft Windows 98
Microsoft Windows NT 4.0, Service Pack 6a, in English, Japanese and Simplified Chinese.

Major applications using SAPI

Microsoft Windows XP Tablet PC Edition includes SAPI 5.1 and speech recognition engines 6.1 for English, Japanese, and Chinese (simplified and traditional)
Windows Speech Recognition
Windows Speech Recognition
Windows Speech Recognition is a speech recognition application included in Windows Vista and more recently, Windows 7.-Features:Windows Speech Recognition allows the user to control the computer by giving specific voice commands...

in Windows Vista
Windows Vista
Windows Vista is an operating system released in several variations developed by Microsoft for use on personal computers, including home and business desktops, laptops, tablet PCs, and media center PCs...
Microsoft Narrator
Microsoft Narrator
Narrator is a light-duty screen reader utility included in Microsoft Windows. Narrator reads dialog boxes and window controls in a number of the more basic applications for Windows....

in Windows 2000 and later Windows operating systems
Microsoft Office
Microsoft Office
Microsoft Office is a non-free commercial office suite of inter-related desktop applications, servers and services for the Microsoft Windows and Mac OS X operating systems, introduced by Microsoft in August 1, 1989. Initially a marketing term for a bundled set of applications, the first version of...

XP and Office 2003
Microsoft Excel
Microsoft Excel
Microsoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...

2002, Microsoft Excel 2003, and Microsoft Excel 2007 for speaking spreadsheet data
Microsoft Voice Command
Microsoft Voice Command
Microsoft Voice Command is an application which can control Windows Mobile devices by voice. The first version was announced in November 2003. The latest version is 1.6, for the United Kingdom, United States, France, and Germany and is a free upgrade for users of previous versions.As of May 20,...

for Windows Pocket PC and Windows Mobile
Microsoft Plus! Voice Command for Windows Media Player
Dragon NaturallySpeaking
Dragon NaturallySpeaking
Dragon NaturallySpeaking is a speech recognition software package developed and sold by Nuance Communications for Windows personal computers. The most recent package is version 11.5, which supports 32-bit and 64-bit editions of Windows XP, Vista and 7. The Mac OS version is called Dragon...

general-purpose speech recognition application
Adobe Reader uses voice output to read document content
CoolSpeech
CoolSpeech
CoolSpeech is an award-winning proprietary text-to-speech program for Microsoft Windows, developed by ByteCool Software. It controls text-to-speech engines compliant with Microsoft Speech API to fetch and read aloud text from a variety of sources, including websites, email accounts, local text...

, text-to-speech application that reads text aloud from a variety of sources
Window-Eyes
Window-eyes
Window-Eyes is a screen reader for Microsoft Windows, developed by GW Micro. The first version was released in 1995. The latest version of Window-Eyes is 7.2, released in 2010. and version 7.5.1 was realsed on march 14 2011-Features:...

screen reader
JAWS
JAWS (screen reader)
JAWS is a computer screen reader program in Microsoft Windows that allows blind and visually impaired users to read the screen either with a text-to-speech output or by a Refreshable Braille display....

screen reader
NVDA open-source screen reader http://www.nvda-project.org/
Klango player, client software for the Klango.net social network http://Klango.net
Multi Crew Experience, voice control for Microsoft Flight Simulator http://www.multicrewxp.com

Libraries using SAPI output

FastFormat, via its speech_sink
Pantheios
Pantheios
Pantheios is an open source C/C++ logging API library, whose design focus is performance, robustness and transparency. It claims 100% type-safety, and high efficiency....

, via its be.speech back-end

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Basic architecture

SAPI 1

SAPI 3

SAPI 4

SAPI 5 API family

SAPI 5.0

SAPI 5.1

SAPI 5.2

SAPI 5.3

SAPI 5 Voices

Managed code Speech API

Speech functionality in Windows Vista

SAPI 5

SAPI 4

Major applications using SAPI

Libraries using SAPI output

See also

External links