Chinese speech synthesis
Encyclopedia
Chinese speech synthesis is the application of speech synthesis
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

 to the Chinese language (usually Standard Chinese
Standard Chinese
Standard Chinese, or Modern Standard Chinese, also known as Mandarin or Putonghua, is the official language of the People's Republic of China and Republic of China , and is one of the four official languages of Singapore....

). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody
Prosody (linguistics)
In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

 which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what is the correct pronunciation of certain phonemes.

Corpus-based

Anhui USTC iFlyTek Co., Ltd (iFlyTek) published a W3C paper in which they adapted Speech Synthesis Markup Language
Speech Synthesis Markup Language
Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for...

 to produce a mark-up language called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information. Their synthesiser takes a "corpus-based" approach, which means it can sound very natural in most cases but can err in dealing with unusual phrases if they can't be matched with the corpus. The amount of data involved is not disclosed by iFlyTek but can be seen from the commercial products that iFlyTek have licensed their technology to; for example, Bider's SpeechPlus is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly-compressed data for a single Chinese voice. iFlyTek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".

The iFlyTek corpus appears to be heavily dependent on Chinese character
Chinese character
Chinese characters are logograms used in the writing of Chinese and Japanese , less frequently Korean , formerly Vietnamese , or other languages...

s, and it is not possible to synthesize from pinyin
Pinyin
Pinyin is the official system to transcribe Chinese characters into the Roman alphabet in China, Malaysia, Singapore and Taiwan. It is also often used to teach Mandarin Chinese and spell Chinese names in foreign publications and used as an input method to enter Chinese characters into...

 alone. It is sometimes possible by means of CSSML to add pinyin to the characters to disambiguate between multiple possible pronunciations, but this does not always work.

A corpus-based approach is also taken by Tsinghua University
Tsinghua University
Tsinghua University , colloquially known in Chinese as Qinghua, is a university in Beijing, China. The school is one of the nine universities of the C9 League. It was established in 1911 under the name "Tsinghua Xuetang" or "Tsinghua College" and was renamed the "Tsinghua School" one year later...

's SinoSonic, with the Harbin voice data taking 800 Megabytes. As of 2007 (and 2011), the download link for SinoSonic has not yet been activated. (Vapourware?)

Concatenation (KeyTip)

A less complex approach is taken by cjkware.com's KeyTip Putonghua Reader, which contains 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprising 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase). These recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

) and this can severely affect prosody; the synthesizer is also inflexible in terms of speed and expression. However, because this synthesizer does not rely on a corpus, there is no noticeable degradation in performance when it is given more unusual or awkward phrases.

eSpeak

The lightweight open-source speech project eSpeak
ESpeak
eSpeak is a compact open source software speech synthesizer for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size. Much of the programming for eSpeak's languages was based on information found on Wikipedia, with some subsequent...

, which has its own approach to synthesis, has started experimenting with Chinese synthesis. It was used by Google Translate
Google Translate
Google Translate is a free statistical machine translation service provided by Google Inc. to translate a section of text, document or webpage, into another language.The service was introduced in April 28, 2006 for the Arabic language...

 from May 2010 until December 2010.

Ekho

Ekho is another open source TTS, which simply concatenates sampled syllables. It currently supports Cantonese, Mandarin, and Korean. Some of the Mandarin syllables have been pitched-normalised in Praat
Praat
Praat is a free scientific software program for the analysis of speech in phonetics. It has been designed and continuously developed by Paul Boersma and David Weenink of the University of Amsterdam. It can run on a wide range of operating systems, including various Unix versions, Mac and Microsoft...

. A modified version of these is used in Gradint's "synthesis from partials".

Online Demos and Bell Labs

There is an online interactive demonstration for NeoSpeech speech synthesis, but it is not possible to customize the Chinese pronunciation by entering pinyin
Pinyin
Pinyin is the official system to transcribe Chinese characters into the Roman alphabet in China, Malaysia, Singapore and Taiwan. It is also often used to teach Mandarin Chinese and spell Chinese names in foreign publications and used as an input method to enter Chinese characters into...

. iFlyTek has two demos available online.

Bell Labs
Bell Labs
Bell Laboratories is the research and development subsidiary of the French-owned Alcatel-Lucent and previously of the American Telephone & Telegraph Company , half-owned through its Western Electric manufacturing subsidiary.Bell Laboratories operates its...

 have an online Mandarin text-to-speech demo dated 1997, but it is now non-functional (the server that the query is to be submitted to does not exist in the DNS
Domain name system
The Domain Name System is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities...

) and the contact email is no longer valid. However, their approach was described in a monograph "Multilingual Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31, 1997, ISBN 978-0792380276), and the former employee who was responsible for the project, Chilin Shih (who now works at the University of Illinois), has some notes about her methods on her website.

Non-Windows systems

The above-mentioned Chinese speech synthesis systems (apart from the online demos) are available only for Windows. However, the spaced-interval repetition language-practice program Gradint includes code and instructions for using KeyTIP and SpeechPlus data on other operating systems, by reading the data directly or using the WINE
Wine (software)
Wine is a free software application that aims to allow computer programs written for Microsoft Windows to run on Unix-like operating systems. Wine also provides a software library, known as Winelib, against which developers can compile Windows applications to help port them to Unix-like...

 emulator.

There are some reports that SAPI 5-based speech synthesizers can be run on recent versions of the WINE
Wine (software)
Wine is a free software application that aims to allow computer programs written for Microsoft Windows to run on Unix-like operating systems. Wine also provides a software library, known as Winelib, against which developers can compile Windows applications to help port them to Unix-like...

 emulator.

Mac OS had Chinese speech synthesizers available up to version 9. This was removed in Mac OS X. From the release 10.5 (Leopard), the built-in VoiceOver application claims to support third-party Chinese voices, but no Chinese voice is built in to the operating system and Apple does not provide any links to actual Mac OS X Chinese voice products. In 10.7 (Lion), voice packs are automatically downloaded as needed when selected in Speech settings in System Preferences.

Notable approaches not yet taken

As of 2007, it appears that there have been no projects to synthesize Chinese by simulating the human vocal tract, as Gnuspeech
Gnuspeech
Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules...

 is doing for English. Chinese is also not one of the languages being synthesized in the multilingual MBROLA
MBROLA
MBROLA is an algorithm for speech synthesis, and software which is distributed at no financial cost but in binary form only, and a worldwide collaborative project...

project.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK