Metaphone
Encyclopedia
Metaphone is a phonetic algorithm
Phonetic algorithm
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....

, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems, including later versions of PHP
PHP
PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

.

The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages. In 2009 Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.

Procedure

Metaphone codes use the 16 consonant
Consonant
In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are , pronounced with the lips; , pronounced with the front of the tongue; , pronounced with the back of the tongue; , pronounced in the throat; and ,...

 symbols 0BFHJKLMNPRSTWXY. The '0' represents "th
Th (digraph)
Th is a digraph in the Roman alphabet. It is the most common digraph in order of frequency in the English language.-Cluster /t.h/:The most literal use of ⟨th⟩ is to represent a consonant cluster of /t/ and /h/ as in English knighthood...

" (as an ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 approximation of Θ
Theta
Theta is the eighth letter of the Greek alphabet, derived from the Phoenician letter Teth...

), 'X' represents "sh
Sh (digraph)
Sh is a digraph of the Latin alphabet, a combination of S and H.-English:In English, sh usually represents . The exception is in compound words, where the s and h are not a digraph, but pronounced separately, e.g. hogshead is hogs-head , not *hog-shead...

" or "ch
Ch (digraph)
Ch is a digraph in the Roman alphabet and Uyghur. It is treated as a letter of its own in Chamorro, Czech, Slovak, Polish, Igbo, Quechua, Guarani, Welsh, Cornish, Breton and Belarusian Łacinka alphabets. In Vietnamese, it also used to be considered a letter for collation purposes but this is no...

", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.
  1. Drop duplicate adjacent letters, except for C.
  2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
  3. Drop 'B' if after 'M' and if it is at the end of the word.
  4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
  5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
  6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
  7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'.
  8. Drop 'H' if after vowel and not before a vowel.
  9. 'CK' transforms to 'K'.
  10. 'PH' transforms to 'F'.
  11. 'Q' transforms to 'K'.
  12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
  13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
  14. 'V' transforms to 'F'.
  15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
  16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
  17. Drop 'Y' if not followed by a vowel.
  18. 'Z' transforms to 'S'.
  19. Drop all vowels unless it is the beginning.

Double Metaphone

The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. Its implementation was described in the June 2000 issue of C/C++ Users Journal
C/C++ Users Journal
C/C++ Users Journal was a computer magazine published by CMP Media LLC in the United States. The magazine concentrated on the C++ programming language and was one of the last printed magazines to cover the topic....

.

It is called "Double" because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of SM0 and a secondary code of XMT, while the name "Schmidt" yields a primary code of XMT and a secondary code of SMT--both have XMT in common.

Double Metaphone tries to account for myriad irregularities in English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

 of Slavic
Slavic languages
The Slavic languages , a group of closely related languages of the Slavic peoples and a subgroup of Indo-European languages, have speakers in most of Eastern Europe, in much of the Balkans, in parts of Central Europe, and in the northern part of Asia.-Branches:Scholars traditionally divide Slavic...

, Germanic
Germanic languages
The Germanic languages constitute a sub-branch of the Indo-European language family. The common ancestor of all of the languages in this branch is called Proto-Germanic , which was spoken in approximately the mid-1st millennium BC in Iron Age northern Europe...

, Celtic
Celtic languages
The Celtic languages are descended from Proto-Celtic, or "Common Celtic"; a branch of the greater Indo-European language family...

, Greek
Greek language
Greek is an independent branch of the Indo-European family of languages. Native to the southern Balkans, it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the majority of its history;...

, French
French language
French is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...

, Italian
Italian language
Italian is a Romance language spoken mainly in Europe: Italy, Switzerland, San Marino, Vatican City, by minorities in Malta, Monaco, Croatia, Slovenia, France, Libya, Eritrea, and Somalia, and by immigrant communities in the Americas and Australia...

, Spanish
Spanish language
Spanish , also known as Castilian , is a Romance language in the Ibero-Romance group that evolved from several languages and dialects in central-northern Iberia around the 9th century and gradually spread with the expansion of the Kingdom of Castile into central and southern Iberia during the...

, Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...

, and other origin. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.

Metaphone 3

Developed by the same author, this algorithm aims at further improving the technique of phonetic encoding of words in the English language, non-English words familiar to Americans, and first names and family names commonly found in the United States. The first version of Metaphone to be developed according to strict engineering standards and tested against a substantial target encoding set, it improves the accuracy of the algorithm from the approximately 89% of Double Metaphone to over 99%. The ability to encode Metaphone keys taking non-initial vowels into account, as well as encoding voiced and unvoiced consonants differently, has been added. This allows the result set to be more closely focused if the developer finds that the search results include too many words that don't resemble the search term closely enough. Development for other language versions has been announced. Metaphone 3 is sold as source code in C++, Java and C# for 40 USD each.

See also

  • Soundex
    Soundex
    Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless...

  • New York State Identification and Intelligence System
    New York State Identification and Intelligence System
    The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System...

  • Match Rating Approach
    Match Rating Approach
    A phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.The algorithm itself has a simple set of encoding rules but a more lengthy set of comparison rules....


External links


Metaphone Implementations

  • Metaphone implementation in T-SQL
  • Soundex, Metaphone, and Double Metaphone implementation in Java
    Java (programming language)
    Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

  • Soundex, Metaphone, Caverphone implementation in Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

  • Text::Metaphone Perl
    Perl
    Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

     module from CPAN
    CPAN
    CPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations...

  • Text::DoubleMetaphone Perl
    Perl
    Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

     module from CPAN
    CPAN
    CPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations...

  • OCaml implementation of Double Metaphone
  • PHP
    PHP
    PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

     implementation by Stephen Woodbridge
  • Ruby implementation included in http://english.rubyforge.org
  • Ruby implementation included in http://rubyforge.org/projects/text/
  • 4GL implementation by Robert Minter
  • CodeProject's article about double metaphone implementations
  • FileMaker Pro custom function, requiring FileMaker Pro Advanced to implement
  • Spanish Metaphone in PHP (First post), from a comment in the PHP Metaphone Manual Page
  • Brazilian Portuguese in C Metaphone for Brazilian Portuguese, in C with PHP and PostgreSQL port.
  • natural - javascript (nodejs) natural language toolkit

Double Metaphone Implementations

  • C++
    C++
    C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

     see: http://web.archive.org/web/20080101012741/http://www.cuj.com/documents/s=8038/cuj0006philips/
  • C# see: http://www.codeproject.com/KB/recipes/dmetaphone5.aspx
  • Perl
    Perl
    Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

     see: http://search.cpan.org/dist/Text-DoubleMetaphone/
  • PHP
    PHP
    PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

     see: http://swoodbridge.com/DoubleMetaPhone/ and native, in C: http://pecl.php.net/package/doublemetaphone
  • Java
    Java (programming language)
    Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

     see: http://commons.apache.org/codec/userguide.html
  • Ruby
    Ruby (programming language)
    Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

     see: http://english.rubyforge.org/ and http://rubyforge.org/projects/text/
  • SQL
    SQL
    SQL is a programming language designed for managing data in relational database management systems ....

    :
    • MySQL
      MySQL
      MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

       see: see: http://www.atomodo.com/code/double-metaphone
    • PostgreSQL
      PostgreSQL
      PostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...

       see: http://www.postgresql.org/docs/current/static/fuzzystrmatch.html
    • Transact-SQL
      Transact-SQL
      Transact-SQL is Microsoft's and Sybase's proprietary extension to SQL. SQL, often expanded to Structured Query Language, is a standardized computer language that was originally developed by IBM for querying, altering and defining relational databases, using declarative statements...

       see: http://www.sqlmag.com/Articles/ArticleID/26094/pg/1/1.html (full-text access requires subscription)
  • Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

     see: http://www.atomodo.com/code/double-metaphone
  • Smalltalk
    Smalltalk
    Smalltalk is an object-oriented, dynamically typed, reflective programming language. Smalltalk was created as the language to underpin the "new world" of computing exemplified by "human–computer symbiosis." It was designed and created in part for educational use, more so for constructionist...

    , Squeak
    Squeak
    The Squeak programming language is a Smalltalk implementation. It is object-oriented, class-based and reflective.It was derived directly from Smalltalk-80 by a group at Apple Computer that included some of the original Smalltalk-80 developers...

    , also with SoundEx, see: http://www.squeaksource.com/SoundsLike.html
  • Visual Basic
    Visual Basic
    Visual Basic is the third-generation event-driven programming language and integrated development environment from Microsoft for its COM programming model...

     see: http://www.snakelegs.org/2008/01/18/double-metaphone-visual-basic-implementation/
    • Visual Basic for Applications
      Visual Basic for Applications
      Visual Basic for Applications is an implementation of Microsoft's event-driven programming language Visual Basic 6 and its associated integrated development environment , which are built into most Microsoft Office applications...

      see: http://bytes.com/topic/access/answers/192513-metaphone-source-code/
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK