Home      Discussion      Topics      Dictionary      Almanac
Signup       Login
Acoustic Model

Acoustic Model

Overview
An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...

 engine to recognize speech.

Speech recognition engines require two types of files to recognize speech. They require an acoustic model, which is created by taking audio recordings of speech and their transcriptions (taken from a speech corpus
Speech corpus
A speech corpus is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models ....

), and 'compiling' them into a statistical representations of the sounds that make up each word (through a process called 'training').
Discussion
Ask a question about 'Acoustic Model'
Start a new discussion about 'Acoustic Model'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...

 engine to recognize speech.

Background


Speech recognition engines require two types of files to recognize speech. They require an acoustic model, which is created by taking audio recordings of speech and their transcriptions (taken from a speech corpus
Speech corpus
A speech corpus is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models ....

), and 'compiling' them into a statistical representations of the sounds that make up each word (through a process called 'training'). They also require a language model
Language model
A statistical language model assigns a probability to a sequence of m words by means of a probability distribution.Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information...

 or grammar file. A language model is a file containing the probabilities of sequences of words. A grammar is a much smaller file containing sets of predefined combinations of words. Language models are used for dictation applications, whereas grammars are used in desktop command and control or telephony interactive voice response
Interactive voice response
Interactive Voice Response product, interactive technology that allows a computer to detect voice and keypad inputs. IVR technology is used extensively in telecommunications, but is also being introduced into automobile systems for hands-free operation. Current deployment in automobiles revolves...

 (IVR) type applications.

Speech Audio Characteristics


Audio can be encoded
Encoder
An encoder is a device, circuit, transducer, software program, algorithm or person that converts information from one format or code to another, for the purposes of standardization, speed, secrecy, security, or saving space by shrinking size.-Media:...

 at different sampling rate
Sampling rate
The sampling rate, sample rate, or sampling frequency defines the number of samples per second taken from a continuous signal to make a discrete signal. For time-domain signals, it can be measured in samples per second , or hertz...

s (i.e. samples per second - the most common being: 8 kHz, 16 kHz, 32 kHz, 44.1 kHz, 48 kHz and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits or 32-bits). Speech recognition engines work best if the acoustic model they use was trained with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized.

Telephony-based Speech Recognition


The limiting factor for telephony
Telephony
In telecommunication, telephony encompasses the general use of equipment to provide voice communication over distances, specifically by connecting telephones to each other....

 based speech recognition is the bandwidth at which speech can be transmitted. For example, your standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony based speech recognition, you need acoustic models trained with 8 kHz/8-bit speech audio files.

In the case of Voice over IP
Voice over IP
Voice over Internet Protocol is a general term for a family of transmission technologies for delivery of voice communications over IP networks such as the Internet or other packet-switched networks...

, the codec
Codec
A codec is a device or computer program capable of encoding and/or decoding a digital data stream or signal. The word codec is a portmanteau of compressor-decompressor' or, more accurately, coder-decoder'.Historically a modem was a contraction of modulator/demodulator and converted...

 determines the sampling rate/bits per sample of speech transmission. If you use a codec with a higher sampling rate/bits per sample for speech transmission (to improve the sound quality), then your acoustic model must be trained with audio data that matches that sampling rate/bits per sample.

Desktop-based Speech Recognition


For speech recognition on a standard desktop PC, the limiting factor is the sound card
Sound card
A sound card is a computer expansion card that facilitates the input and output of audio signals to and from a computer under control of computer programs. Typical uses of sound cards include providing the audio component for multimedia applications such as music composition, editing video or...

. Most sound cards today can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and playback at up to 96 kHz.

As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample can slow the recognition engine down. A compromise is needed. Thus for desktop speech recognition, the current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits per sample.

External links

  • Acoustic models (last modified: March 19, 2008) from CMU Sphinx
    CMU Sphinx
    CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University...

  • Japanese acoustic models for the use with Julius
    Julius (software)
    Julius is an open source speech recognition engine.Julius is a high-performance, two-pass large vocabulary continuous speech recognition decoder software for speech-related researchers and developers. Based on word 3-gram and context-dependent HMM, it can perform almost real-time decoding on most...

  • open source acoustic models at VoxForge
    VoxForge
    VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines.VoxForge was set up to collect transcribed speech to create a free GPL speech corpus for use with open source speech recognition engines...

  • HTK WSJ acoustic models for HTK
    HTK (software)
    HTK is software toolkit for handling HMMs. It is mainly intended for speech recognition, but has been used in many other pattern recognition applications that employ HMMs....

  • Sphinx WSJ acoustic models for CMU Sphinx
    CMU Sphinx
    CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University...