An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a
speech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...
engine to recognize speech.
Speech recognition engines require two types of files to recognize speech. They require an acoustic model, which is created by taking audio recordings of speech and their transcriptions (taken from a
speech corpusA speech corpus is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models ....
), and 'compiling' them into a statistical representations of the sounds that make up each word (through a process called 'training').
An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a
speech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...
engine to recognize speech.
Background
Speech recognition engines require two types of files to recognize speech. They require an acoustic model, which is created by taking audio recordings of speech and their transcriptions (taken from a
speech corpusA speech corpus is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models ....
), and 'compiling' them into a statistical representations of the sounds that make up each word (through a process called 'training'). They also require a
language modelA statistical language model assigns a probability to a sequence of m words by means of a probability distribution.Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information...
or grammar file. A language model is a file containing the probabilities of sequences of words. A grammar is a much smaller file containing sets of predefined combinations of words. Language models are used for dictation applications, whereas grammars are used in desktop command and control or telephony
interactive voice responseInteractive Voice Response product, interactive technology that allows a computer to detect voice and keypad inputs. IVR technology is used extensively in telecommunications, but is also being introduced into automobile systems for hands-free operation. Current deployment in automobiles revolves...
(IVR) type applications.
Speech Audio Characteristics
Audio can be
encodedAn encoder is a device, circuit, transducer, software program, algorithm or person that converts information from one format or code to another, for the purposes of standardization, speed, secrecy, security, or saving space by shrinking size.-Media:...
at different
sampling rateThe sampling rate, sample rate, or sampling frequency defines the number of samples per second taken from a continuous signal to make a discrete signal. For time-domain signals, it can be measured in samples per second , or hertz...
s (i.e. samples per second - the most common being: 8 kHz, 16 kHz, 32 kHz, 44.1 kHz, 48 kHz and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits or 32-bits). Speech recognition engines work best if the acoustic model they use was trained with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized.
Telephony-based Speech Recognition
The limiting factor for
telephonyIn telecommunication, telephony encompasses the general use of equipment to provide voice communication over distances, specifically by connecting telephones to each other....
based speech recognition is the bandwidth at which speech can be transmitted. For example, your standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony based speech recognition, you need acoustic models trained with 8 kHz/8-bit speech audio files.
In the case of
Voice over IPVoice over Internet Protocol is a general term for a family of transmission technologies for delivery of voice communications over IP networks such as the Internet or other packet-switched networks...
, the
codecA codec is a device or computer program capable of encoding and/or decoding a digital data stream or signal. The word codec is a portmanteau of compressor-decompressor' or, more accurately, coder-decoder'.Historically a modem was a contraction of modulator/demodulator and converted...
determines the sampling rate/bits per sample of speech transmission. If you use a codec with a higher sampling rate/bits per sample for speech transmission (to improve the sound quality), then your acoustic model must be trained with audio data that matches that sampling rate/bits per sample.
Desktop-based Speech Recognition
For speech recognition on a standard desktop PC, the limiting factor is the
sound cardA sound card is a computer expansion card that facilitates the input and output of audio signals to and from a computer under control of computer programs. Typical uses of sound cards include providing the audio component for multimedia applications such as music composition, editing video or...
. Most sound cards today can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and playback at up to 96 kHz.
As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample can slow the recognition engine down. A compromise is needed. Thus for desktop speech recognition, the current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits per sample.
External links
- Acoustic models (last modified: March 19, 2008) from CMU Sphinx
CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University...
- Japanese acoustic models for the use with Julius
Julius is an open source speech recognition engine.Julius is a high-performance, two-pass large vocabulary continuous speech recognition decoder software for speech-related researchers and developers. Based on word 3-gram and context-dependent HMM, it can perform almost real-time decoding on most...
- open source acoustic models at VoxForge
VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines.VoxForge was set up to collect transcribed speech to create a free GPL speech corpus for use with open source speech recognition engines...
- HTK WSJ acoustic models for HTK
HTK is software toolkit for handling HMMs. It is mainly intended for speech recognition, but has been used in many other pattern recognition applications that employ HMMs....
- Sphinx WSJ acoustic models for CMU Sphinx
CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University...