Soundex
Encyclopedia
Soundex is a phonetic algorithm
Phonetic algorithm
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....

 for indexing names by sound, as pronounced
Pronunciation
Pronunciation refers to the way a word or a language is spoken, or the manner in which someone utters a word. If one is said to have "correct pronunciation", then it refers to both within a particular dialect....

 in English. The goal is for homophone
Homophone
A homophone is a word that is pronounced the same as another word but differs in meaning. The words may be spelled the same, such as rose and rose , or differently, such as carat, caret, and carrot, or to, two, and too. Homophones that are spelled the same are also both homographs and homonyms...

s to be encoded to the same representation so that they can be matched despite minor differences in spelling
Spelling
Spelling is the writing of one or more words with letters and diacritics. In addition, the term often, but not always, means an accepted standard spelling or the process of naming the letters...

. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithm
Phonetic algorithm
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....

s, as it is a standard feature of MS SQL Server and Oracle, and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.

History

Soundex was developed by Robert C. Russell and Margaret K. Odell and patent
Patent
A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....

ed in 1918 and 1922. A variation called American Soundex was used in the 1930s
1930s
File:1930s decade montage.png|From left, clockwise: Dorothea Lange's photo of the homeless Florence Thompson show the effects of the Great Depression; Due to the economic collapse, the farms become dry and the Dust Bowl spreads through America; The Battle of Wuhan during the Second Sino-Japanese...

 for a retrospective analysis of the US censuses
United States Census
The United States Census is a decennial census mandated by the United States Constitution. The population is enumerated every 10 years and the results are used to allocate Congressional seats , electoral votes, and government program funding. The United States Census Bureau The United States Census...

 from 1890 through 1920. The Soundex code came to prominence in the 1960s
1960s
The 1960s was the decade that started on January 1, 1960, and ended on December 31, 1969. It was the seventh decade of the 20th century.The 1960s term also refers to an era more often called The Sixties, denoting the complex of inter-related cultural and political trends across the globe...

 when it was the subject of several articles in the Communications
Communications of the ACM
Communications of the ACM is the flagship monthly journal of the Association for Computing Machinery . First published in 1957, CACM is sent to all ACM members, currently numbering about 80,000. The articles are intended for readers with backgrounds in all areas of computer science and information...

and Journal of the Association for Computing Machinery
Journal of the ACM
The Journal of the ACM is the flagship scientific journal of the Association for Computing Machinery . It is peer-reviewed and covers computer science in general, especially theoretical aspects. Its current editor-in-chief is Victor Vianu, from University of California, San Diego.The journal has...

, and especially when described in Donald Knuth's
Donald Knuth
Donald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...

 The Art of Computer Programming
The Art of Computer Programming
The Art of Computer Programming is a comprehensive monograph written by Donald Knuth that covers many kinds of programming algorithms and their analysis....

.

The National Archives and Records Administration
National Archives and Records Administration
The National Archives and Records Administration is an independent agency of the United States government charged with preserving and documenting government and historical records and with increasing public access to those documents, which comprise the National Archives...

 (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government. These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".

Rules

Different from the original algorithm, the algorithm in American Soundex is as below.

The Soundex code for a name consists of a letter
Letter (alphabet)
A letter is a grapheme in an alphabetic system of writing, such as the Greek alphabet and its descendants. Letters compose phonemes and each phoneme represents a phone in the spoken form of the language....

 followed by three numerical digit
Numerical digit
A digit is a symbol used in combinations to represent numbers in positional numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e...

s: the letter is the first letter of the name, and the digits encode the remaining consonant
Consonant
In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are , pronounced with the lips; , pronounced with the front of the tongue; , pronounced with the back of the tongue; , pronounced in the throat; and ,...

s. Similar sounding consonants share the same digit so, for example, the labial consonant
Labial consonant
Labial consonants are consonants in which one or both lips are the active articulator. This precludes linguolabials, in which the tip of the tongue reaches for the posterior side of the upper lip and which are considered coronals...

s B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:
  1. Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y.
  2. Replace consonants with digits as follows (after the first letter):
    • b, f, p, v => 1
    • c, g, j, k, q, s, x, z => 2
    • d, t => 3
    • l => 4
    • m, n => 5
    • r => 6
  3. Two adjacent letters with the same number are coded as a single number.
  4. Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.


Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (e.g. the chars 's' and 'c' in the name "Ashcraft" would receive a single number of 2 and not 22, even though an 'h' lies in between them and they are not the same repeating character).

Soundex variants

A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.

The NYSIIS
New York State Identification and Intelligence System
The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System...

 algorithm was introduced by the New York State Identification and Intelligence System in 1970 as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...

s and maintains relative vowel positioning, whereas Soundex does not.

Daitch–Mokotoff Soundex (D–M Soundex) was developed in 1985 by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D–M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex", although the authors discourage the use of these nicknames. The D–M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.

As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone
Metaphone
Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate...

 algorithm in 1990 for the same purpose. Philips developed an improvement to Metaphone in 2000, which he called Double Metaphone. Double Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.

See also

  • Phonetic algorithm
    Phonetic algorithm
    A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....

  • Metaphone
    Metaphone
    Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate...

  • New York State Identification and Intelligence System
    New York State Identification and Intelligence System
    The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System...

  • Match Rating Approach
    Match Rating Approach
    A phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.The algorithm itself has a simple set of encoding rules but a more lengthy set of comparison rules....


External links


Ready-to-use soundex converters


Programming algorithms for soundex


  • Soundex in PostgreSQL
  • Soundex Tcl
    Tcl
    Tcl is a scripting language created by John Ousterhout. Originally "born out of frustration", according to the author, with programmers devising their own languages intended to be embedded into applications, Tcl gained acceptance on its own...

     package from the tcllib
    Tcllib
    Tcllib is a collection of packages available for the Tcl programming language. Tcllib is distributed in both source code as well as pre-compiled binary formats...

    library

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK