Speaker diarisation
Encyclopedia
Speaker diarisation is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

 by structuring the audio stream into speaker turns and, when used together with Speaker recognition
Speaker recognition
Speaker recognition is the computing task of validating a user's claimed identity using characteristics extracted from their voices .There is a difference between speaker recognition and speech recognition . These two terms are frequently confused, as is voice recognition...

 systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?".
.
Speaker diarisation is a combination of Speaker Segmentation and Speaker Clustering. The first aims at finding speaker change points
in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.

With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarization has received much attention in recent times by the speech community, as is manifested by the specific evaluations devoted to it under the auspices of the National Institute of Standards and Technology
National Institute of Standards and Technology
The National Institute of Standards and Technology , known between 1901 and 1988 as the National Bureau of Standards , is a measurement standards laboratory, otherwise known as a National Metrological Institute , which is a non-regulatory agency of the United States Department of Commerce...

 for telephone speech, broadcast news and meetings.

Main Types of Diarisation System

In speaker diarisation one of the most popular method is to use a GMM to model each of the speaker, and assign the corresponding frames for each speaker with the help of a HMM. We can differentiate two main kinds of clustering scenario. The first one is by far the most popular and is called Bottom-Up. The algorithm starts in splitting the full audio content in a succession of clusters and progressively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second clustering strategy is called top-down and starts with one single cluster for all the audio data and tries to split it iteratively until reaching a number of clusters equal to the number of speakers.

Open Source Speaker Diarisation Software

There are some open source initiatives for speaker diarisation:


  • SHoUT
    Shout
    Shout may refer to:* A form of vociferation* Shout, or ring shout, a religious dance originating among African slaves in the Americas* An Australian and British term referring to buying a round of drinks- Films and television :...

    : Shout is a software package developed at the University of Twente to aid speech recognition research. SHOUT is a Dutch acronym for Speech Recognition Research at the University of Twente http://wwwhome.cs.utwente.nl/~huijbreg/shout/
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK