All Topics  
Speech synthesis

 

   Email Print
   Bookmark   Link






 

Speech synthesis



 
 
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware
Computer hardware

A personal computer is made up of computer hardware, multiple physical components onto which can be loaded into a multitude of software that perform the functions of the computer....
. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representation
Symbolic linguistic representation

A symbolic linguistic representation is a representation of an utterance that uses symbols to represent linguistic information about the utterance, such as information about phonetics, phonology, morphology , syntax, or semantics....
s like phonetic transcription
Phonetic transcription

Phonetic transcription is the visual system of symbolization of the sounds occurring in spoken human language. The most common type of phonetic transcription uses a phonetic alphabet ....
s into speech.

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database
Database

A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model....
.






Discussion
Ask a question about 'Speech synthesis'
Start a new discussion about 'Speech synthesis'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware
Computer hardware

A personal computer is made up of computer hardware, multiple physical components onto which can be loaded into a multitude of software that perform the functions of the computer....
. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representation
Symbolic linguistic representation

A symbolic linguistic representation is a representation of an utterance that uses symbols to represent linguistic information about the utterance, such as information about phonetics, phonology, morphology , syntax, or semantics....
s like phonetic transcription
Phonetic transcription

Phonetic transcription is the visual system of symbolization of the sounds occurring in spoken human language. The most common type of phonetic transcription uses a phonetic alphabet ....
s into speech.

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database
Database

A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model....
. Systems differ in the size of the stored speech units; a system that stores phones or diphone
Diphone

In phonetics, a diphone is an adjacent pair of Phone . It is usually used to refer a recording of the transition between two phones.In the following diagram, a stream of phones are represented by P1, P2, etc., and the corresponding diphones are represented by D1-2, D2-3, etc:...
s provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract
Vocal tract

The vocal tract is the cavity in animals where sound that is produced at the sound source is filtered. In birds it consists of the Vertebrate trachea, the Syrinx , the oral cavity, the upper part of the esophagus, and the beak....
 and other human voice characteristics to create a completely "synthetic" voice output.

The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairment
Visual impairment

Visual impairment or vision impairment is vision loss having reduced vision as to constitute a handicap that constitutes a significant limitation of visual perception capability resulting from disease, Physical trauma, or a congenital or degenerative condition that cannot be corrected by conventional means, including refractive correcti...
s or reading disabilities
Reading disability

A reading disability is a condition in which a sufferer displays difficulty reading resulting primarily from neurological factors. There are different types of reading disabilities that include Word-Level Recoginiton Disability , also known as Dyslexia, Fluency, and Reading Comprehension....
 to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.

Overview of text processing


A text-to-speech system (or "engine") is composed of four parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcription
Phonetic transcription

Phonetic transcription is the visual system of symbolization of the sounds occurring in spoken human language. The most common type of phonetic transcription uses a phonetic alphabet ....
s to each word, and divides and marks the text into prosodic units
Prosody (linguistics)

In linguistics, prosody is the rhythm, stress , and intonation of connected speech . Prosody may reflect various features of the speaker or the utterance: the emotional state of a speaker; whether an utterance is a statement, a question, or a command; whether the speaker is being ironic or sarcastic; emphasis, contrast, and focus ; or othe...
, like phrase
Phrase

In grammar, a phrase is a group of words that functions as a single unit in the syntax of a Sentence .For example the house at the end of the street is a phrase....
s, clause
Clause

In grammar, a clause is a pair of words or group of words that consists of a subject and a predicate , although in some languages and some types of clauses, the subject may not appear explicitly as a noun phrase....
s, and sentence
Sentence (linguistics)

In linguistics, a sentence is a grammatical unit of one or more words, bearing minimal syntactic relation to the words that precede or follow it, often preceded and followed in speech by pauses, having one of a small number of characteristic intonation patterns, and typically expressing an independent statement, question, request, command, et...
s. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme
Grapheme

In typography, a grapheme is the fundamental unit in writing systems. Graphemes include letter , Chinese characters, numerals, punctuation marks, and all the individual symbols of any of the world's writing systems....
-to-phoneme
conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.

History


Long before electronic
Electronics

Electronics refers to the flow of charge through nonmetal electrical conductor , whereas electrical refers to the flow of charge through metal electrical conductor....
 signal processing
Signal processing

Signal processing is the analysis, interpretation, and manipulation of signal . Signals of interest include: audio signal processing, , time-varying measurement values and sensor data, for example biological data such as electrocardiograms, control system signals, telecommunication transmission signals such as radio signals, and many others....
 was invented, there were those who tried to build machines to create human speech. Some early legends of the existence of "speaking heads"
Brazen Head

A Brazen Head was a prophetic device attributed to many medieval scholars who were believed to be wizards, or who were reputed to be able to answer any question....
 involved Gerbert of Aurillac
Pope Silvester II

Pope Sylvester II, or Silvester II , born Gerbert d'Aurillac, was a prolific scholar, teacher, and pope. He introduced Islamic science of Arabic numerals, Islamic mathematics, and Islamic astronomy to Europe, reintroducing the abacus and armillary sphere which had been lost to Europe since the end of the Greco-Roman era....
 (d. 1003 AD), Albertus Magnus
Albertus Magnus

Saint Albertus Magnus, Ordo Praedicatorum , also known as Saint Albert the Great and Albert of Cologne, was a Dominican Order Dominican friar and bishop who achieved fame for his comprehensive knowledge of and advocacy for the peaceful Relationship between religion and science....
 (1198–1280), and Roger Bacon
Roger Bacon

For the Nova Scotia premier see Roger Bacon .Roger Bacon, Order of Friars Minor , also known as Doctor Mirabilis , was an England philosopher and Franciscan friar who placed considerable emphasis on empiricism....
 (1214–1294).

In 1779, the Danish
Denmark

Denmark is a Scandinavian country in northern Europe and the senior member of the Kingdom of Denmark. It is the southernmost of the Nordic countries....
 scientist Christian Kratzenstein, working at the Russian Academy of Sciences
Russian Academy of Sciences

The Russian Academy of Sciences consists of the national academy of Russia and a network of scientific research institutes from across the Russian Federation as well as auxiliary scientific and social units like libraries, publishers and hospitals....
, built models of the human vocal tract
Vocal tract

The vocal tract is the cavity in animals where sound that is produced at the sound source is filtered. In birds it consists of the Vertebrate trachea, the Syrinx , the oral cavity, the upper part of the esophagus, and the beak....
 that could produce the five long vowel
Vowel

In phonetics, a vowel is a sound in spoken language, such as English ah! or oh! , pronounced with an open vocal tract so that there is no build-up of air pressure at any point above the glottis....
 sounds (in International Phonetic Alphabet notation, they are , , , and ). This was followed by the bellows
Bellows

A bellows is a device for delivering pressurized air in a controlled quantity to a controlled location. Basically, a bellows is a deformable container which has an outlet nozzle....
-operated "acoustic-mechanical speech machine" by Wolfgang von Kempelen
Wolfgang von Kempelen

Johann Wolfgang Ritter von Kempelen de P?zm?nd was a Hungarian author and inventor with Irish people ancestors....
 of Vienna
Vienna

Vienna is the Capital of Republic of Austria and also one of the nine states of Austria. Vienna is Austria's primary city, with a population of about 1.7 million...
, Austria
Austria

Austria , officially the Republic of Austria , is a landlocked country in Central Europe. It borders both Germany and the Czech Republic to the north, Slovakia and Hungary to the east, Slovenia and Italy to the south, and Switzerland and Liechtenstein to the west....
, described in a 1791 paper. This machine added models of the tongue and lips, enabling it to produce consonant
Consonant

In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the upper vocal tract, the upper vocal tract being defined as that part of the vocal tract that lies above the larynx....
s as well as vowels. In 1837, Charles Wheatstone
Charles Wheatstone

Knighthood Charles Wheatstone Fellow of the Royal Society , was a United Kingdom scientist and inventor of many scientific breakthroughs of the Victorian era, including the English concertina, the stereoscope , and the Playfair cipher ....
 produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget.

In the 1930s, Bell Labs
Bell Labs

Bell Laboratories is the research organization of Alcatel-Lucent and previously of the American Telephone & Telegraph Company .Bell Laboratories has had its headquarters at Berkeley Heights, New Jersey, and it has research and development facilities throughout the world....
 developed the VOCODER
Vocoder

A vocoder, , is an analysis / synthesis system, mostly used for speech in which the input is passed through a multiband filter, each filter is passed through an envelope follower, the control signals from the envelope followers are communicated, and the decoder applies these control signals to corresponding filters in the synthesizer....
, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley
Homer Dudley

Homer W. Dudley was a pioneering electronic and acoustic engineer who created the first electronic voice synthesizer for Bell Labs in the 1930s and led the development of a method of sending secure voice transmissions during World War Two....
 refined this device into the VODER, which he exhibited at the 1939 New York World's Fair
1939 New York World's Fair

1939 World's Fair redirects here. The term can also refer to the Golden Gate International Exposition, which was held in San Francisco/Oakland at the same time as the New York fair....
.

The Pattern playback
Pattern playback

The Pattern playback is an early talking device that was built by Dr. Franklin S. Cooper and his colleagues, including John M. Borst and Caryl Haskins, at Haskins Laboratories in the late 1940s and completed in 1950....
 was built by Dr. Franklin S. Cooper
Franklin S. Cooper

Franklin Seaney Cooper was an American physicist and American inventor who was a pioneer in speech research. He attended the University of Illinois at Urbana-Champaign where he received his undergraduate degree in physics in 1931, and received his Ph.D....
 and his colleagues at Haskins Laboratories
Haskins Laboratories

Haskins Laboratories is an independent, international, interdisciplinary community of researchers conducting basic research on Speech communication and reading language....
 in the late 1940s and completed in 1950. There were several different versions of this hardware device but only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, Alvin Liberman
Alvin Liberman

Alvin Meyer Liberman was an American psychologist whose ideas set the agenda for fifty years of research in the psychology of speech perception and laid the groundwork for modern computer speech synthesis and the understanding of critical issues in cognitive science....
 and colleagues were able to discover acoustic cues for the perception of phonetic segments (consonants and vowels).

Early electronic speech synthesizers sounded robotic and were often barely intelligible. However, the quality of synthesized speech has steadily improved, and output from contemporary speech synthesis systems is sometimes indistinguishable from actual human speech.

Electronic devices


The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr
John Larry Kelly, Jr

John Larry Kelly, Jr. , was a scientist who worked at Bell Labs. He is best known for formulating the Kelly criterion, an algorithm for maximally investing money....
 and colleague Louis Gerstman used an IBM 704
IBM 704

The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in April, 1954. The 704 was significantly improved over the IBM 701 in terms of architecture as well as implementation, and was not compatible with its predecessor....
 computer to synthesize speech, an event among the most prominent in the history of Bell Labs
Bell Labs

Bell Laboratories is the research organization of Alcatel-Lucent and previously of the American Telephone & Telegraph Company .Bell Laboratories has had its headquarters at Berkeley Heights, New Jersey, and it has research and development facilities throughout the world....
. Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell
Daisy Bell

"Daisy Bell" is a popular song whose lyrics are considerably better known than the song's actual title....
", with musical accompaniment from Max Mathews
Max Mathews

Max Vernon Mathews is a pioneer in the world of computer music. He studied electrical engineering at the California Institute of Technology and the Massachusetts Institute of Technology, receiving a Sc.D....
. Coincidentally, Arthur C. Clarke
Arthur C. Clarke

Sri Lankabhimanya Sir Arthur Charles Clarke, Order of the British Empire was a British people science fiction author, inventor, and Futurology, most famous for the novel 2001: A Space Odyssey , written in collaboration with director Stanley Kubrick, a collaboration which also produced the 2001: A Space Odyssey ; and as a host and comment...
 was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey
2001: A Space Odyssey (novel)

2001: A Space Odyssey is a science fiction novel by Arthur C. Clarke. It was developed concurrently with Stanley Kubrick's 2001: A Space Odyssey and published after the release of the film....
, where the HAL 9000
HAL 9000

HAL 9000 is a fictional computer in Arthur C. Clarke's Space Odyssey saga. The novels, along with two films, begin with 2001: A Space Odyssey, released in 1968....
 computer sings the same song as it is being put to sleep by astronaut Dave Bowman. Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.

Synthesizer technologies


The most important qualities of a speech synthesis system are naturalness and Intelligibility
Intelligibility

Intelligibility is for voice communications, the capability of being understood - the quality of language that is comprehensible language or thought....
. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.

The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant
Formant

A formant is a peak in the frequency spectrum of a sound caused by Acoustics resonance. In phonetics, the word refers to sounds produced by the vocal tract....
 synthesis
. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.

Concatenative synthesis


Concatenative synthesis is based on the concatenation
Concatenation

In computer programming, string concatenation is the operation of joining two character string end to end. For example, the strings "snow" and "ball" may be concatenated to give "snowball"....
 (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.

Unit selection synthesis

Unit selection synthesis uses large database
Database

A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model....
s of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphone
Diphone

In phonetics, a diphone is an adjacent pair of Phone . It is usually used to refer a recording of the transition between two phones.In the following diagram, a stream of phones are represented by P1, P2, etc., and the corresponding diphones are represented by D1-2, D2-3, etc:...
s, half-phones, syllable
Syllable

A syllable is a unit of organization for a sequence of Speech communication sounds. For example, the word water is composed of two syllables: wa and ter....
s, morpheme
Morpheme

In morpheme-based morphology, a is the smallest linguistic unit that has semantics Meaning .In spoken language, morphemes are composed of phonemes , and in written language morphemes are composed of graphemes ....
s, word
Word

A word is a unit of language that represents a concept which can be expressively communication with Meaning . A word consists of one or more morphemes which are linked more or less tightly together, and has a phonetic value....
s, phrase
Phrase

In grammar, a phrase is a group of words that functions as a single unit in the syntax of a Sentence .For example the house at the end of the street is a phrase....
s, and sentence
Sentence (linguistics)

In linguistics, a sentence is a grammatical unit of one or more words, bearing minimal syntactic relation to the words that precede or follow it, often preceded and followed in speech by pauses, having one of a small number of characteristic intonation patterns, and typically expressing an independent statement, question, request, command, et...
s. Typically, the division into segments is done using a specially modified speech recognizer
Speech recognition

Speech recognition converts spoken words to machine-readable input . The term "voice recognition" is sometimes incorrectly used to refer to speech recognition, when actually referring to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said....
 set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform
Waveform

Waveform means the shape and form of a signal such as a wave moving in a solid, liquid or gaseous medium.In many cases the medium in which the wave is being propagated does not permit a direct visual image of the form....
 and spectrogram
Spectrogram

A spectrogram is an image that shows how the spectral density of a signal varies with time. Also known as spectral waterfalls, sonograms, voiceprints, or voicegrams, spectrograms are used to identify phonetics sounds, to analyse the cries of animals, and in the fields of music, sonar/radar, speech processing, seismo...
. An index
Index (database)

A database index is a data structure that improves the speed of operations on a Table . Indexes can be created using one or more column , providing the basis for both rapid random look ups and efficient access of ordered records....
 of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency
Fundamental frequency

The fundamental tone, often referred to simply as the fundamental and abbreviated f0 or F0, is the lowest frequency in a harmonic series ....
 (pitch
Pitch (music)

Pitch represents the perceived fundamental frequency of a sound. It is one of the three major auditory system attributes of sounds along with loudness and timbre....
), duration, position in the syllable, and neighboring phones. At runtime
Runtime

In computer science, runtime or run time describes the operation of a computer program, the duration of its execution, from beginning to termination ....
, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree
Decision tree

A decision tree is a decision support tool that uses a tree-like Diagram or Causal model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility....
.

Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing
Digital signal processing

Digital signal processing is concerned with the representation of the signal s by a sequence of numbers or symbols and the processing of these signals....
 (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabyte
Gigabyte

Gigabyte is an SI prefix-multiple of the unit byte for Computer data storage. Since the giga- prefix means 109, gigabyte means 1,000,000,000 bytes ....
s of recorded data, representing dozens of hours of speech. Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.

Diphone synthesis

Diphone synthesis uses a minimal speech database containing all the diphone
Diphone

In phonetics, a diphone is an adjacent pair of Phone . It is usually used to refer a recording of the transition between two phones.In the following diagram, a stream of phones are represented by P1, P2, etc., and the corresponding diphones are represented by D1-2, D2-3, etc:...
s (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics
Phonotactics

Phonotactics is a branch of phonology that deals with restrictions in a language on the permissible combinations of phonemes. Phonotactics defines permissible syllable structure, consonant clusters, and vowel sequences by means of phonotactical constraints....
 of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody
Prosody (linguistics)

In linguistics, prosody is the rhythm, stress , and intonation of connected speech . Prosody may reflect various features of the speaker or the utterance: the emotional state of a speaker; whether an utterance is a statement, a question, or a command; whether the speaker is being ironic or sarcastic; emphasis, contrast, and focus ; or othe...
 of a sentence is superimposed on these minimal units by means of digital signal processing
Digital signal processing

Digital signal processing is concerned with the representation of the signal s by a sequence of numbers or symbols and the processing of these signals....
 techniques such as linear predictive coding
Linear predictive coding

Linear predictive coding is a tool used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of Speech communication in data compression form, using the information of a linear prediction model....
, PSOLA
PSOLA

In digital signal processing techniques PSOLA stands for Pitch Synchronous Overlap Add Method.It is used in speech synthesis....
 or MBROLA
MBROLA

MBROLA is an algorithm for speech synthesis, a software which is distributed at no financial cost but in binary form only, and a worldwide collaborative project....
. The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.

Domain-specific synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.

Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic
Rhotic and non-rhotic accents

English language pronunciation is divided into two main Accent groups: A rhotic speaker pronounces the letter R in hard or water. A non-rhotic speaker does not....
 dialects of English the "r" in words like "clear" is usually only pronounced when the following word has a vowel as its first letter (e.g. "clear out" is realized as ). Likewise in French
French language

French is a Romance language spoken around the world by around 80 million people as first language, by 190 million as second language, and by about another 200 million people as an acquired tongue, with significant speakers in 54 countries....
, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. This alternation
Alternation (linguistics)

In linguistics, an alternation is the phenomenon of a phoneme or morpheme exhibiting variation in its phonology realization. Each of the various realizations is called an alternant....
 cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive
Context-sensitive

Context-sensitive is an adjective meaning "depending on context" or "depending on circumstances".See:* Context-sensitive grammar* Context-sensitive language...
.

Formant synthesis


Formant
Formant

A formant is a peak in the frequency spectrum of a sound caused by Acoustics resonance. In phonetics, the word refers to sounds produced by the vocal tract....
 synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using an acoustic model. Parameters such as fundamental frequency
Fundamental frequency

The fundamental tone, often referred to simply as the fundamental and abbreviated f0 or F0, is the lowest frequency in a harmonic series ....
, voicing
Phonation

Phonation has slightly different meanings depending on the subfield of phonetics. Among some phoneticians, phonation is the process by which the vocal folds produce certain sounds through quasi-periodic vibration....
, and noise
Noise

In common use, the word noise means unwanted sound or noise pollution. In electronics noise can refer to the electronic signal corresponding to acoustic noise or the electronic signal corresponding to the noise commonly seen as 'Noise ' on a degraded television or video image....
 levels are varied over time to create a waveform
Waveform

Waveform means the shape and form of a signal such as a wave moving in a solid, liquid or gaseous medium.In many cases the medium in which the wave is being propagated does not permit a direct visual image of the form....
 of artificial speech. This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.

Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader
Screen reader

A screen reader is a Application software that attempts to identify and interpret what is being displayed on the screen . This interpretation is then re-presented to the user with text-to-speech, sound icons, or a Refreshable Braille display....
. Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in embedded system
Embedded system

An embedded system is a special-purpose computer system designed to perform one or a few dedicated functions, often with real-time computing constraints....
s, where memory
Data storage device

A data storage device is a device for recording information . Recording can be done using virtually any form of energy, spanning from manual muscle power in handwriting, to acoustic vibrations in phonographic recording, to electromagnetic energy modulating magnetic tape and optical discs....
 and microprocessor
Microprocessor

A microprocessor incorporates most or all of the functions of a central processing unit on a single integrated circuit . The first microprocessors emerged in the early 1970s and were used for electronic calculators, using Binary-coded decimal arithmetic on 4-bit Word ....
 power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and intonation
Intonation (linguistics)

In linguistics, intonation is variation of pitch while speaking which is not used to distinguish words. Intonation and stress are two main elements of linguistic prosody ....
s can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.

Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas Instruments
Texas Instruments

Texas Instruments , better known in the electronics industry as TI, is an United States company based in Dallas, Texas, Texas, United States, renowned for developing and commercializing semiconductor and computer technology....
 toy Speak & Spell, and in the early 1980s Sega
Sega

is a Multinational corporation video game software and hardware development company, and a home computer and console manufacturer headquartered in Ota, Tokyo, Tokyo, Japan....
 arcade
Video arcade

A video arcade is a venue where people play arcade game that are housed in colourfully-decorated cabinets. The cabinets consist of a video monitor, gameplay controls and buttons, computer hardware and software, and a coin-, Token coin-, or magnetic card-based payment mechanism....
 machines. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.

Articulatory synthesis


Articulatory synthesis
Articulatory synthesis

Articulatory synthesis refers to computational techniques for speech synthesis based on models of the human vocal tract and the articulation processes occurring there....
 refers to computational techniques for synthesizing speech based on models of the human vocal tract
Vocal tract

The vocal tract is the cavity in animals where sound that is produced at the sound source is filtered. In birds it consists of the Vertebrate trachea, the Syrinx , the oral cavity, the upper part of the esophagus, and the beak....
 and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories
Haskins Laboratories

Haskins Laboratories is an independent, international, interdisciplinary community of researchers conducting basic research on Speech communication and reading language....
 in the mid-1970s by Philip Rubin
Philip Rubin

Philip E. Rubin is an American Cognitive science who since 2003 has been the Chief Executive Officer and a Senior Scientist at Haskins Laboratories in New Haven, Connecticut....
, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.

Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the NeXT
NeXT

NeXT, Inc. was an American computer company headquartered in Redwood City, California, California, that developed and manufactured a series of computer workstations intended for the higher education and business markets....
-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary
University of Calgary

The University of Calgary is a research-intensive public university in Calgary, Alberta, Canada. The University is composed of 24,000 undergraduate and 5,500 graduate students....
, where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by Steve Jobs
Steve Jobs

Steven Paul Jobs is an United States businessman and co-founder, Chairman, and Chief executive officer of Apple Inc.. Jobs is the former CEO of Pixar Animation Studios....
 in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License
GNU General Public License

The GNU General Public License is a widely used free software license, originally written by Richard Stallman for the GNU project. The GPL is the most popular and well-known example of the type of strong copyleft license that requires derived works to be available under the same copyleft....
, with work continuing as gnuspeech
Gnuspeech

Gnuspeech is an extensible, text-to-speech computer software package, that produces artificial speech output based on real-time, articulatory synthesis, speech-synthesis-by-rules....
. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".

HMM-based synthesis


HMM-based synthesis is a synthesis method based on hidden Markov model
Hidden Markov model

A hidden Markov model is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data....
s. In this system, the frequency spectrum
Frequency spectrum

Familiar concepts associated with a frequency are colors, musical notes, radio/TV channels, and even the regular rotation of the earth. A source of light can have many colors mixed together and in different amounts ....
 (vocal tract
Vocal tract

The vocal tract is the cavity in animals where sound that is produced at the sound source is filtered. In birds it consists of the Vertebrate trachea, the Syrinx , the oral cavity, the upper part of the esophagus, and the beak....
), fundamental frequency
Fundamental frequency

The fundamental tone, often referred to simply as the fundamental and abbreviated f0 or F0, is the lowest frequency in a harmonic series ....
 (vocal source), and duration (prosody
Prosody (linguistics)

In linguistics, prosody is the rhythm, stress , and intonation of connected speech . Prosody may reflect various features of the speaker or the utterance: the emotional state of a speaker; whether an utterance is a statement, a question, or a command; whether the speaker is being ironic or sarcastic; emphasis, contrast, and focus ; or othe...
) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood
Maximum likelihood

Maximum likelihood estimation is a popular statistics method used for fitting a mathematical model to data. The modeling of real world data using estimation by maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit....
 criterion.

Sinewave synthesis

Sinewave synthesis
Sinewave synthesis

Sinewave synthesis is a technique for speech synthesis by replacing the formants with pure tone whistles. The first sinewave synthesis program for the automatic creation of stimuli for perceptual experiments was developed by Philip Rubin at Haskins Laboratories in the 1970s....
 is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles.

Challenges


Text normalization challenges

The process of normalizing text is rarely straightforward. Texts are full of heteronym
Heteronym (linguistics)

In linguistics, heteronyms are words with identical spellings but different pronunciations and meanings. They may vary in vowel realisation or in stress patterns, or both....
s, number
Number

A number is a mathematical object used in counting and measurement. A notational symbol which represents a number is called a Numeral system, but in common usage the word number is used for both the abstract object and the symbol, as well as for the numeral for the number....
s, and abbreviation
Abbreviation

An abbreviation is a shortened form of a word or phrase. Usually, but not always, it consists of a letter or group of letters taken from the word or phrase....
s that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".

Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various heuristic
Heuristic

Heuristic is an adjective for methods that help in problem solving, in turn leading to learning and discovery. These methods in most cases employ experimentation and trial-and-error techniques....
 techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.

Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words, like "1325" becoming "one thousand three hundred twenty-five." However, numbers occur in many different contexts; when a year or perhaps a part of an address, "1325" should likely be read as "thirteen twenty-five", or, when part of a social security number
Social Security number

In the United States, a Social Security number is a nine-digit number issued to United States nationality law, Permanent residence , and temporary residents under section 205 of the Social Security Act, codified as ....
, as "one three two five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs.

Text-to-phone challenges


Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme
Phoneme

In human language, a phoneme is the smallest posited linguistically distinctive unit of sound. Phonemes carry no semantic content themselves. In theoretical terms, phonemes are not the physical segment s themselves, but cognitive abstractions or categorizations of them....
 is the term used by linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or synthetic phonics
Synthetic phonics

Synthetic Phonics is a method of teaching reading which first teaches the letter sounds and then builds up to blending these sounds together to achieve full pronunciation of whole words....
, approach to learning reading.

Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use a combination of these approaches.

Some languages, like Spanish
Spanish language

Spanish or Castilian is a Romance languages that originated in northern Spain, and gradually spread in the Kingdom of Castile and evolved into the principal language of government and trade....
, have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful. Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like English
English language

English is a West Germanic language that originated in Anglo-Saxon England and has lingua franca status in many parts of the world as a result of the military, economic, scientific, political and cultural influence of the British Empire in the 18th, 19th and early 20th centuries and that of the United States from the mid 20th century onwa...
, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries.

Evaluation challenges

The consistent evaluation of speech synthesis systems may be difficult because of a lack of universally agreed objective evaluation criteria. Different organizations often use different speech data. The quality of speech synthesis systems also depends to a large degree on the quality of the production technique (which may involve analogue or digital recording) and on the facilities used to replay the speech. Evaluating speech synthesis systems has therefore often been compromised by differences between production techniques and replay facilities.

Recently, however, some researchers have started to evaluate speech synthesis systems using a common speech dataset..

Prosodics and emotional content

A recent study reported in the journal "Speech Communication" by Amy Drahota and colleagues at the University of Portsmouth
University of Portsmouth

The University of Portsmouth is a university in Portsmouth, England.The University is the 5th most popular destination in the UK for EU students and the 10th most popular destination for overseas students....
, UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling. It was suggested that identification of the vocal features which signal emotional content may be used to help make synthesized speech sound more natural.

Dedicated hardware

  • Votrax
    • SC-01A (analog formant)
    • SC-02 / SSI-263 / "Arctic 263"
  • General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)
  • Magnevation SpeakJet (www.speechchips.com TTS256)
  • Savage Innovations SoundGin
  • National Semiconductor DT1050 Digitalker (Mozer)
  • Silicon Systems SSI 263 (analog formant)
  • Texas Instruments
    • TMS5110A (LPC)
    • TMS5200
  • Oki Semiconductor
    • MSM5205
    • MSM5218RS (ADPCM)
  • Toshiba T6721A
  • Philips PCF8200


Computer operating systems or outlets with speech synthesis


Apple


The first speech system integrated into an operating system
Operating system

An operating system is an interface between hardware and applications; it is responsible for the management and coordination of activities and the sharing of the limited resources of the computer....
 was Apple Computer
Apple Computer

Apple Inc., formerly Apple Computer Inc., is an United States multinational corporation which designs and manufactures consumer electronics and software products....
's MacInTalk
PlainTalk

PlainTalk is the collective name for several speech synthesis and speech recognition technologies developed by Apple, Inc.In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many respected researchers in the field....
 in 1984. Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software. In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support. With the introduction of faster PowerPC-based computers they included higher quality voice sampling. Apple also introduced speech recognition
Speech recognition

Speech recognition converts spoken words to machine-readable input . The term "voice recognition" is sometimes incorrectly used to refer to speech recognition, when actually referring to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said....
 into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting as a curiosity, the speech system of Apple Macintosh has evolved into a cutting edge fully-supported program, PlainTalk
PlainTalk

PlainTalk is the collective name for several speech synthesis and speech recognition technologies developed by Apple, Inc.In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many respected researchers in the field....
, for people with vision problems. VoiceOver
VoiceOver

VoiceOver is a feature built into Apple Inc.'s Mac OS X operating system since version Mac OS X v10.4. By using VoiceOver, the user can access his or her Apple Macintosh by using speech and the Computer keyboard....
 was included in Mac OS X Tiger and more recently Mac OS X Leopard. The voice shipping with Mac OS X 10.5 ("Leopard") is called "Alex" and features the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates.

AmigaOS


The second operating system with advanced speech synthesis capabilities was AmigaOS
AmigaOS

AmigaOS is the default native operating system of the Amiga personal computer. It was developed first by Commodore International, and initially introduced in 1985 with the Amiga 1000....
, introduced in 1985. The voice synthesis was licensed by Commodore International
Commodore International

Commodore, the commonly used name for Commodore International, was a United States electronics company based in West Chester, Pennsylvania which was a vital player in the home computer/personal computer field in the 1980s....
 from a third-party software house (Don't Ask Software, now Softvoice, Inc.) and it featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the Amiga
Amiga

The Amiga is a family of personal computers originally developed by Amiga Corporation. Development on the Amiga began in 1982 with Jay Miner as the principal hardware designer....
 hardware audio chipset
Chipset

A chipset or chip set refers to a group of integrated circuits, or chips, that are designed to work together. They are usually marketed as a single product....
. It was divided into a narrator device and a translator library. Amiga Speak Handler
AmigaOS

AmigaOS is the default native operating system of the Amiga personal computer. It was developed first by Commodore International, and initially introduced in 1985 with the Amiga 1000....
 featured a text-to-speech translator. AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it. Some Amiga programs, such as word processors, made extensive use of the speech system.

Microsoft Windows


Modern Windows
Microsoft Windows

Microsoft Windows is a series of software operating systems and graphical user interfaces produced by Microsoft. Microsoft first introduced an operating environment named Windows in November 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces ....
 systems use SAPI4
Speech Application Programming Interface

The Speech Application Programming Interface or SAPI is an Application programming interface developed by Microsoft to allow the use of speech recognition and speech synthesis within Microsoft Windows applications....
- and SAPI5
Speech Application Programming Interface

The Speech Application Programming Interface or SAPI is an Application programming interface developed by Microsoft to allow the use of speech recognition and speech synthesis within Microsoft Windows applications....
-based speech systems that include a speech recognition
Speech recognition

Speech recognition converts spoken words to machine-readable input . The term "voice recognition" is sometimes incorrectly used to refer to speech recognition, when actually referring to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said....
 engine (SRE). SAPI 4.0 was available on Microsoft-based operating systems as a third-party add-on for systems like Windows 95
Windows 95

Windows 95 is a consumer-oriented graphical user interface-based operating system. It was released on August 24, 1995 by Microsoft, and was a significant progression from the company's previous Microsoft Windows products....
 and Windows 98
Windows 98

Windows 98 is a graphical operating system released on 25 June 1998 by Microsoft and the successor to Windows 95. Like its predecessor, it is a hybrid 16-bit application/32-bit application monolithic product based on MS-DOS....
. Windows 2000
Windows 2000

Windows 2000 is a line of operating systems produced by Microsoft for use on business desktops, Laptop, and Server . Released on 17 February, 2000, it was the successor to Windows NT 4.0, and is the final release of Microsoft Windows to display the "Windows NT" designation....
 added a speech synthesis program called Narrator
Microsoft Narrator

Narrator is a light-duty screen reader utility included in Microsoft Windows. Narrator reads dialog boxes and window controls in a number of the more basic Application software for Windows....
, directly available to users. All Windows-compatible programs could make use of speech synthesis features, available through menus once installed on the system. Microsoft Speech Server
Microsoft Speech Server

The Microsoft Speech Server is a product from Microsoft designed to allow the authoring and deployment of Interactive voice response applications incorporating Speech Recognition, Speech Synthesis and Dual-tone multi-frequency....
 is a complete package for voice synthesis and recognition, for commercial applications such as call centers.

Internet

Currently, there are a number of applications
Application software

Application software is any tool that functions and is operated by means of a computer, with the purpose of supporting or improving the software user 's work....
, plugin
Plugin

In computing, a plug-in consists of a computer program that interacts with a host application software to provide a certain, usually very specific, function "on demand"....
s and gadget
Gadget

A gadget is a small technological object that has a particular function, but is often thought of as a novelty. Gadgets are invariably considered to be more unusually or cleverly designed than normal technological objects at the time of their invention....
s that can read messages directly from an e-mail client
E-mail client

An e-mail client is a frontend computer program used to manage e-mail.Sometimes, the term e-mail client is also used to refer to any agent acting as a Client toward an e-mail server, independently of it being a real MUA, a relaying server, or a human typing directly on a telnet terminal....
 and web pages from a web browser
Web browser

A Web browser is a application software which enables a user to display and interact with text, images, videos, music, games and other information typically located on a Web page at a website on the World Wide Web or a local area network....
. Some specialized software
Computer software

Computer software, or just software is a general term used to describe a collection of computer programs, Algorithm and Software documentation that perform some tasks on a computer system....
 can narrate RSS-feeds
RSS

RSS is a three-letter abbreviation that can stand for a wide variety of terms....
. On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to podcasts. On the other hand, on-line RSS-readers are available on almost any PC
Personal computer

A personal computer is any general-purpose computer whose original sales price, size, and capabilities make it useful for individuals, and which is intended to be operated directly by an end user, with no intervening computer operator....
 connected to the Internet. Users can download generated audio files to portable devices, e.g. with a help of podcast receiver, and listen to them while walking, jogging or commuting to work.

A growing field in internet based TTS is web-based assistive technology, e.g. 'Talklets' from UK company , a software as a service (SaaS) model for TTS. The SaaS model for web based TTS negates the need for a software download by individual users. It can deliver TTS functionality to anyone (for reasons of accessibility, convenience, entertainment or information) with access to a web browser. The speed of response, as with all SaaS implementations, will rely on the user's individual Internet connection however the 'access anywhere' nature of SaaS TTS is a key benefit to this approach.

Others

  • Some models of Texas Instruments home computers produced in 1979 and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral. TI used a proprietary codec
    Codec

    A codec is a device or computer program capable of encoder and/or Decoding methods a digital data stream or signal . The word codec is a portmanteau of 'compressor-decompressor' or, most commonly, 'coder-decoder'....
     to embed complete spoken phrases into applications, primarily video games.
  • IBM
    IBM

    International Business Machines Corporation, abbreviated IBM and nicknamed "Big Blue" , is a multinational corporation computer technology and consulting corporation headquartered in Armonk, New York, New York, United States....
    's OS/2 Warp 4
    OS/2

    OS/2 is a computer operating system, initially created by Microsoft and IBM, then later developed by IBM exclusively. The name stands for "Operating System/2," because it was introduced as part of the same generation change release as IBM's "IBM Personal System/2 " line of second-generation personal computers....
     included VoiceType, a precursor to IBM ViaVoice.
  • Systems that operate on free and open source software systems including GNU/Linux
    Linux

    Linux is a generic term referring to Unix-like computer operating systems based on the Linux kernel. Their development is one of the most prominent examples of free and open source software collaboration; typically all the underlying source code can be used, freely modified, and redistributed by anyone under the terms of the GNU GPL license...
     are various, and include open-source programs such as the Festival Speech Synthesis System
    Festival Speech Synthesis System

    Festival is a general multi-lingual speech synthesis system originally developed at at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites....
     which uses diphone-based synthesis (and can use a limited number of MBROLA
    MBROLA

    MBROLA is an algorithm for speech synthesis, a software which is distributed at no financial cost but in binary form only, and a worldwide collaborative project....
     voices), and gnuspeech
    Gnuspeech

    Gnuspeech is an extensible, text-to-speech computer software package, that produces artificial speech output based on real-time, articulatory synthesis, speech-synthesis-by-rules....
     which uses articulatory synthesis from the Free Software Foundation
    Free Software Foundation

    The Free Software Foundation is a non-profit corporation founded by Richard Stallman on 4 October 1985 to support the free software movement, a copyleft-based movement which aims to promote the universal freedom to distribute and modify computer software without restriction....
    . Other commercial vendor software also runs on GNU/Linux.
  • Several commercial companies are also developing speech synthesis systems (this list is reporting them just for the sake of information, not endorsing any specific product): , , AT&T
    AT&T

    AT&T Inc. is the largest US provider of both local and long distance telephone services, and Digital subscriber line Internet access. AT&T is the second largest provider of wireless service in the United States, with over 77 million wireless customers, and more than 150 million total customers....
    , Cepstral, DECtalk
    DECtalk

    DECtalk was a speech synthesizer and text-to-speech technology developed by Digital Equipment Corporation in the early 1980s, based largely on the work of Dennis Klatt at MIT, whose source-filter algorithm was variously known as KlattTalk or MITalk....
    , IBM ViaVoice, IVONA TTS
    IVONA

    IVONA is a multi-lingual speech synthesis system developed at IVO Software.It offers a full text to speech system with various APIs....
    , , , Nuance Communications
    Nuance Communications

    Nuance Communications is a multinational computer software technology corporation, headquartered in Burlington, Massachusetts, USA, that provides speech and imaging applications....
    , , and .
  • Companies which developed speech synthesis systems but which are no longer in this business include BeST Speech (bought by L&H), Eloquent Technology (bought by SpeechWorks), Lernout & Hauspie
    Lernout & Hauspie

    Lernout & Hauspie Speech Products, or L&H, was a leading Belgium-based speech recognition technology company , founded by Jo Lernout and Pol Hauspie, that went bankrupt in 2001....
     (bought by Nuance), SpeechWorks
    SpeechWorks

    SpeechWorks was a company founded in the late 1990s in Boston that developed and supported speech-related computer software. The company was purchased in mid-2003 by Peabody, Massachusetts-based Nuance Communications, which was then known as ScanSoft....
     (bought by Nuance), Rhetorical Systems (bought by Nuance).


Speech synthesis markup languages


A number of markup language
Markup language

A markup language is a set of codes that give instructions regarding the structure of a text or how it is to be displayed. Markup languages have been in use for centuries, and in recent years have been used in computer typesetting and word-processing systems to specify the formatting, layout, structure, and other elements of a document....
s have been established for the rendition of text as speech in an XML-compliant format. The most recent is Speech Synthesis Markup Language
Speech Synthesis Markup Language

Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group....
 (SSML), which became a W3C recommendation
W3C recommendation

A W3C Recommendation is the final stage of a ratification process of the World Wide Web Consortium working group concerning the standard. This designation signifies that a document has been subjected to a public and W3C-member organization's review....
 in 2004. Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE
Sable

The sable is a small carnivorous mammal, closely related to the martens. It inhabits taiga environments primarily in Russia from the Ural Mountains throughout Siberia, in northern Mongolia and China and on Hokkaido in Japan....
. Although each of these was proposed as a standard, none of them has been widely adopted.

Speech synthesis markup languages are distinguished from dialogue markup languages. VoiceXML
VoiceXML

VoiceXML is the World Wide Web Consortium's standard XML format for specifying interactive voice dialogues between a human and a computer. It allows voice applications to be developed and deployed in an analogous way to HTML for visual applications....
, for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup.

Applications


Accessibility


Speech synthesis has long been a vital assistive technology
Assistive technology

Assistive technology is a generic term that includes assistive, adaptive, and rehabilitative devices for disability and includes the process used in selecting, locating, and using them....
 tool and its application in this area is significant and widespread. It allows environmental barriers to be removed for people with a wide range of disabilities. The longest application has been in the use of screenreaders for people with visual impairment
Visual impairment

Visual impairment or vision impairment is vision loss having reduced vision as to constitute a handicap that constitutes a significant limitation of visual perception capability resulting from disease, Physical trauma, or a congenital or degenerative condition that cannot be corrected by conventional means, including refractive correcti...
, but text-to-speech systems are now commonly used by people with dyslexia
Dyslexia

Dyslexia is a learning disability that manifests itself primarily as a difficulty with Writing, particularly with Reading . It is separate and distinct from reading difficulties resulting from other causes, such as a non-neurological deficiency with vision or hearing, or from poor or inadequate reading instruction....
 and other reading difficulties as well as by pre-literate youngsters. They are also frequently employed to aid those with severe speech impairment usually through a dedicated voice output communication aid
Voice output communication aid

A Voice Output Communication Aid creates audible speech or cleartype for someone who cannot speak. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or Computer hardware....
.

News service


Sites such as Ananova
Ananova

Ananova is a Web-oriented news service that originally featured a Computer simulation animation of a woman newscaster, an embodied agent named "Ananova," who had been programmed to "read" newscasts to Web users....
 have used speech synthesis to convert written news to audio content, which can be used for mobile applications.

Entertainment

Speech synthesis techniques are used as well in the entertainment productions such as games, anime and similar. In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications. The application reached maturity in 2008, when NEC Biglobe
BIGLOBE

is one of the leading internet service providers in Japan, operated by NEC BIGLOBE, Ltd., a 2006 spin-off from NEC.Bibliography...
 announced a web service that allows users to create phrases from the voices of Code Geass: Lelouch of the Rebellion R2 characters.

Software such as Vocaloid
Vocaloid

Vocaloid is a singing synthesizer application software developed by the Yamaha Corporation that enables users to synthesize singing by just typing in lyrics and melody....
 can generate singing voices via lyrics and melody. This is also the aim of the Singing Computer project (which uses the GPL
GNU General Public License

The GNU General Public License is a widely used free software license, originally written by Richard Stallman for the GNU project. The GPL is the most popular and well-known example of the type of strong copyleft license that requires derived works to be available under the same copyleft....
 software Lilypond
GNU LilyPond

GNU LilyPond is a computer program for music engraving. One of LilyPond's major goals is to produce scores that are engraved with traditional layout rules, reflecting the era when scores were engraved by hand....
 and Festival
Festival Speech Synthesis System

Festival is a general multi-lingual speech synthesis system originally developed at at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites....
) to help blind people check their lyric input.

Specific programs


See also

  • Articulatory synthesis
    Articulatory synthesis

    Articulatory synthesis refers to computational techniques for speech synthesis based on models of the human vocal tract and the articulation processes occurring there....
  • Chinese speech synthesis
    Chinese speech synthesis

    Chinese speech synthesis is the application of speech synthesis to the Chinese language . It poses additional difficulties due to the Chinese characters , the complex Prosody which is essential to convey the meaning of words, the more frequent occurrence of unexpected, unusual combinations of syllables, and sometimes the difficulty in obtai...
  • Computing
    Computing

    Computing is usually defined as the activity of using and developing computer technology, computer hardware and computer software. It is the computer-specific part of information technology....
  • Language
    Language

    A language is a form of symbol communication in which elements are combined to represents something other than themselves. Language can also refer to the use of such systems as a general phenomenon....
  • Natural language processing
    Natural language processing

    Natural language processing is a field of computer science concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language....
  • OpenDocument
    OpenDocument

    The OpenDocument format is a file format for electronic office documents such as spreadsheets, charts, presentation programs and word processor documents....
  • Paperless office
    Paperless office

    Historical perspectiveThe paperless office was a publicist's slogan, meant to describe the office of the future. The basic idea was that office automation would make paper redundant for routine tasks such as record-keeping and bookkeeping....
  • Screen readers, comparison
  • Sinewave synthesis
    Sinewave synthesis

    Sinewave synthesis is a technique for speech synthesis by replacing the formants with pure tone whistles. The first sinewave synthesis program for the automatic creation of stimuli for perceptual experiments was developed by Philip Rubin at Haskins Laboratories in the 1970s....
  • Speech processing
    Speech processing

    Speech processing is the study of Speech communication Signal_ and the processing methods of these signals.The signals are usually processed in a digital representation whereby speech processing can be seen as the intersection of digital signal processing and natural language processing....
  • Speech recognition
    Speech recognition

    Speech recognition converts spoken words to machine-readable input . The term "voice recognition" is sometimes incorrectly used to refer to speech recognition, when actually referring to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said....




External links