Speaker recognition
Encyclopedia
Speaker recognition is the computing
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

 task of validating a user's claimed identity using characteristics extracted from their voices
Human voice
The human voice consists of sound made by a human being using the vocal folds for talking, singing, laughing, crying, screaming, etc. Its frequency ranges from about 60 to 7000 Hz. The human voice is specifically that part of human sound production in which the vocal folds are the primary...

 .

There is a difference between speaker recognition (recognizing who is speaking) and speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

(recognizing what is being said). These two terms are frequently confused, as is voice recognition. Voice recognition is combination of the two where it uses learned aspects of a speakers voice to determine what is being said; the system cannot recognize speech from random speakers very accurately, but it can reach high accuracy for individual voices for which it has been trained. In addition, there is a difference between the act of authentication (commonly referred to as speaker verification or speaker authentication) and identification. Finally, there is a difference between speaker recognition (recognizing who is speaking) and speaker diarisation
Speaker diarisation
Speaker diarisation is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with Speaker...

(recognizing when the same speaker is speaking).

Speaker recognition has a history dating back some four decades and uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy
Anatomy
Anatomy is a branch of biology and medicine that is the consideration of the structure of living things. It is a general term that includes human anatomy, animal anatomy , and plant anatomy...

 (e.g., size and shape of the throat
Throat
In vertebrate anatomy, the throat is the anterior part of the neck, in front of the vertebral column. It consists of the pharynx and larynx...

 and mouth
Mouth
The mouth is the first portion of the alimentary canal that receives food andsaliva. The oral mucosa is the mucous membrane epithelium lining the inside of the mouth....

) and learned behavioral patterns (e.g., voice pitch, speaking style). Speaker verification has earned speaker recognition its classification as a "behavioral biometric".

Verification versus identification

There are two major applications of speaker recognition technologies and methodologies. If the speaker claims to be of a certain identity and the voice is used to verify this claim, this is called verification or authentication. On the other hand, identification is the task of determining an unknown speaker's identity. In a sense speaker verification is a 1:1 match where one speaker's voice is matched to one template (also called a "voice print" or "voice model") whereas speaker identification is a 1:N match where the voice is compared against N templates.

From a security perspective, identification is different from verification. For example, presenting your passport at border control is a verification process - the agent compares your face to the picture in the document. Conversely, a police officer comparing a sketch of an assailant against a database of previously documented criminals to find the closest match(es) is an identification process.

Speaker verification is usually employed as a "gatekeeper" in order to provide access to a secure system (e.g.: telephone banking). These systems operate with the user's knowledge and typically requires their cooperation. Speaker identification systems can also be implemented covertly without the user's knowledge to identify talkers in a discussion, alert automated systems of speaker changes, check if a user is already enrolled in a system, etc.

In forensic applications, it is common to first perform a speaker identification process to create a list of "best matches" and then perform a series of verification processes to determine a conclusive match.

Variants of speaker recognition

Each speaker recognition system has two phases: Enrollment and verification. During enrollment, the speaker's voice is recorded and typically a number of features are extracted to form a voice print, template, or model. In the verification phase, a speech sample or "utterance" is compared against a previously created voice print. For identification systems, the utterance is compared against multiple voice prints in order to determine the best match(es) while verification systems compare an utterance against a single voice print. Because of the process involved, verification is faster than identification.

Speaker recognition systems fall into two categories: text-dependent and text-independent.

Text-Dependent:

If the text must be the same for enrollment and verification this is called text-dependent recognition. In a text-dependent system, prompts can either be common across all speakers (e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets (e.g.: passwords and PINs) or knowledge-based information can be employed in order to create a multi-factor authentication scenario.

Text-Independent:

Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different. In fact, the enrollment may happen without the user's knowledge, as in the case for many forensic applications. As text-independent technologies do not compare what was said at enrollment and verification, verification applications tend to also employ speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

 to determine what the user is saying at the point of authentication.

Technology

The various technologies used to process and store voice prints include frequency estimation
Frequency estimation
Frequency estimation is the process of estimating the complex frequency components of a signal in the presence of noise. The most common methods involve identifying the noise subspace to extract these components...

, hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

s, Gaussian mixture models, pattern matching
Pattern matching
In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact. The patterns generally have the form of either sequences or tree structures...

 algorithms, neural networks
Neural Networks
Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

, matrix representation
Matrix representation
Matrix representation is a method used by a computer language to store matrices of more than one dimension in memory.Fortran and C use different schemes. Fortran uses "Column Major", in which all the elements for a given column are stored contiguously in memory...

,Vector Quantization and decision trees
Decision tree learning
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...

. Some systems also use "anti-speaker" techniques, such as cohort model
Cohort model
The cohort model in psycholinguistics and neurolinguistics is a model of lexical retrieval first proposed by William Marslen-Wilson in the late 1980s. It attempts to describe how visual or auditory input is mapped onto a word in a hearer's lexicon...

s, and world models.

Ambient noise levels can impede both collection of the initial and subsequent voice samples. Noise reduction algorithms can be employed to improve accuracy, but incorrect application can have the opposite effect. Performance degradation can result from changes in behavioural attributes of the voice and from enrolment using one telephone and verification on another telephone ("cross channel"). Integration with two-factor authentication
Two-factor authentication
Two-factor authentication is an approach to authentication which requires the presentation of two different kinds of evidence that someone is who they say they are. It is a part of the broader family of multi-factor authentication, which is a defense in depth approach to security...

products is expected to increase. Voice changes due to ageing may impact system performance over time. Some systems adapt the speaker models after each successful verification to capture such long-term changes in the voice, though there is debate regarding the overall security impact imposed by automated adaptation.

Capture of the biometric is seen as non-invasive. The technology traditionally uses existing microphones and voice transmission technology allowing recognition over long distances via ordinary telephones (wired or wireless).

Digitally recorded audio voice identification and analogue recorded voice identification uses electronic measurements as well as critical listening skills that must be applied by a forensic expert in order for the identification to be accurate.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK