All Topics  
Spell checker

 

   Email Print
   Bookmark   Link

 

Spell checker


 
 

In computingComputing

Originally, the word computing was synonymous with counting and calculating, and a science and technology that deals wit...
, a spell checker is an applications programApplication software Overview

Application software is a defined subclass of computer software that employs the capabilities of a computer directly to a ta...
 that flags words in a document that may not be spelledFacts About Spelling

Proper spelling is the writing of a word or words with all necessary letters and diacritics present in an accepted, conventi...
 correctly. Spell checkers may be stand-alone capable of operating on a block of text, or as part of a larger application, such as a word processorWord processor

A word processor is a computer application used for the production of any sort of printable material....
, email client, electronic dictionaryDictionary

A dictionary is a list of words with their definitions, a list of characters with their glyphs, or a list of words with corr...
, or search engineSearch engine

A search engine or search service is a program designed to help find information stored on a computer system such as t...
.

Operation

Simple spell checkers operate on individual words by comparing each of them against the contents of a dictionaryDictionary

A dictionary is a list of words with their definitions, a list of characters with their glyphs, or a list of words with corr...
, possibly performing stemmingStemming

Stemming is the process for reducing inflected words to their stem, base or root form – generally a written word form...
 on the word. If the word is not found it is considered to be a error, and an attempt may be made to suggest a word that was likely to have been intended. One such suggestion algorithm is to list those words in the dictionary having a small Levenshtein distanceLevenshtein distance

In information theory and computer science, the Levenshtein distance or edit distance between two strings is given by ...
 from the original word.

When a word which is not within the dictionary is encountered most spell checkers provide an option to add that word to a list of known exceptions that should not be flagged.

Design

A spell checker customarily consists of two parts:
  1. A set of routines for scanning text and extracting words, and
  2. An algorithm for comparing the extracted words against a known list of correctly spelled words (ie., the dictionary).


The scanning routines sometimes include language-dependent algorithms for handling morphologyMorphology (linguistics)

In linguistics, morphology is the study of word structure....
. Even for a lightly inflected language like EnglishEnglish language

English is a widely distributed language that originated in England but is now the primary language in numerous countries....
, word extraction routines will need to handle such phenomena as contractionContraction (grammar) Summary

In traditional grammar, a contraction is the formation of a new word from two or more individual words....
s and possessives. It is unclear whether morphological analysis provides a significant benefit.


The word list might contain just a list of words, or it might also contain additional information, such as hyphenation points or lexical and grammatical attributes.

As an adjunct to these two components, the program's user interfaceUser interface

The user interface is the aggregate of means by which people interact with a particular machine, device, computer program o...
 will allow users to approve replacements and modify the program's operation.

One exception to the above paradigm are spell checkers which use based solely statistical information, for instance using n-gramN-gram

An n-gram is a sub-sequence of n items from a given sequence....
s. This approach usually requires a lot of effort to obtain sufficient statistical information and may require a lot more runtime storage. These methods are not currently in general use. In some cases spell checkers use a fixed list of misspellings and suggestions for those misspellings; this less flexible approach is often used in paper-based correction methods, such as the see also entries of encyclopedias.

History

The first spell checkers were widely available on mainframe computers in the late 1970s. A group of six linguists from Georgetown UniversityGeorgetown University

Georgetown University is a private university in the United States, located in Georgetown, a neighborhood of Washington, D.C...
 developed the first spell-check system for the IBM corporation. The first spell checkers for personal computers appeared for CP/MCP/M

CP/M is an operating system originally created for Intel 8080/85 and Zilog Z80 based microcomputers by Gary Kildall of Digit...
 and TRS-80TRS-80

TRS-80 was Tandy Corporation's desktop microcomputer model line, and sold through Tandy's RadioShack stores, in the late-197...
 computers in 1980, followed by packages for the IBM PCIBM PC

The IBM PC , was the original version and progenitor of the IBM PC compatible hardware platform....
 after it was introduced in 1981. Developers such as Maria Mariani, Soft-Art, Microlytics, Proximity, Circle Noetics, and Reference Software rushed OEMOriginal Equipment Manufacturer

An original equipment manufacturer is a company that builds products or components which are used in products sold by anothe...
 packages or end-user products into the rapidly expanding software market, primarily for the PC but also for Apple Macintosh, VAXVAX

VAX is a 32-bit computing architecture that supports an orthogonal instruction set and virtual addressing ....
, and UnixUnix Summary

Unix or UNIX is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T Bell Labs e...
. On the PCs, these spell checkers were standalone programs, many of which could be run in TSRTerminate and Stay Resident

Terminate and Stay Resident was a system call in the MS-DOS operating system that returned control to the system as if the p...
 mode from within word-processing packages on PCs with sufficient memory.

However, the market for standalone packages was short-lived, as by the mid 1980s developers of popular word-processing packages like WordStarWordStar Summary

WordStar was a word processor application, published by MicroPro, originally written for the CP/M operating system but later...
 and WordPerfectWordPerfect

WordPerfect is a word processing application....
 had incorporated spell checkers in their packages, mostly licensed from the above companies, who quickly expanded support from just EnglishEnglish language

English is a widely distributed language that originated in England but is now the primary language in numerous countries....
 to EuropeEurope

Europe is one of the seven traditional continents of the Earth....
an and eventually even Asian languages. However, this required increasing sophistication in the morphology routines of the software, particularly with regard to heavily-inflected languages like HungarianHungarian language

Hungarian is a Finno-Ugric language, unrelated to the other languages of Central Europe....
 and FinnishFinnish language

Finnish is the language spoken by the majority of the population in Finland and by ethnic Finns outside Finland....
. Although the size of the word-processing market in a country like IcelandIceland

Iceland, officially the Republic of Iceland is a volcanic island nation in the northern Atlantic Ocean between Greenl...
 might not have justified the investment of implementing a spell checker, companies like WordPerfect nonetheless strove to localize their software for as many as possible national markets as part of their global marketingMarketing

Marketing is a social and managerial function associated with the process of researching, developing, promoting, selling, an...
 strategy.

Recently, spell checking has moved beyond word processors as Firefox 2.0, a web browserWeb browser

A web browser is a software application that enables a user to display and interact with text, images, and other informatio...
, has spell check support for user-written content, such as when editing Wikitext,writing on many webmail sites, blogs, and social networking websites. The web browsers KonquerorKonqueror

Konqueror is a file manager, web browser and file viewer, which was developed as part of the K Desktop Environment by volun...
 and Opera, the email client KmailKMail

KMail is the e-mail client of the KDE Desktop Environment....
 and the instant messagingInstant messaging

Instant messaging or IM is a form of real-time communication between two or more people based on typed text....
 clientClient (computing)

A client is a computer system that accesses a service on another computer by some kind of network....
 PidginPidgin (software)

Pidgin is a multi-platform instant messaging client....
 also offer spell checking support, transparently using GNU AspellGNU Aspell

GNU Aspell, usually called just Aspell, is a free and open source spell checker designed to replace Ispell....
 as their engine.
Mac OS XMac OS X

Mac OS X is a line of proprietary, graphical operating systems developed, marketed, and sold by Apple Computer, the latest ...
 now has spell check in virtually all bundled apps and many third party apple take advantage of this as well. Safari, Mail, iChat and more all have spell check capability.

Functionality

The first spell checkers were "verifiers" instead of "correctors." They offered no suggestions for incorrectly spelled words. This was helpful for typos but it was not so helpful for logical or phonetic errors. The challenge the developers faced was the difficulty in offering useful suggestions for misspelled words. This requires reducing words to a skeletal form and applying pattern-matching algorithms.

It might seem logical that where spell-checking dictionaries are concerned, "the bigger, the better," so that correct words are not marked as incorrect. In practice, however, an optimal size for English appears to be around 90,000 entries. If there are more than this, incorrectly spelled words may be skipped because they are mistaken for others. For example, a linguist might determine in the basis of corpus linguisticsFacts About Corpus linguistics

Corpus linguistics is the study of language as expressed in samples ' or "real world" text....
 that the word baht is more frequently a misspelling of bath or bat than a reference to the Thai currency. Hence, it would typically be more useful if a few people who write about Thai currency were slightly inconvenienced, than if the spelling errors of the many more people who discuss baths were overlooked.

The first MS-DOS spell checkers were mostly used in proofing mode from within word processing packages. After preparing a document, a user scanned the text looking for misspellings. Later, however, batch processing was offered in such packages as OracleOracle Corporation

Oracle Corporation is one of the major companies developing database management systems, tools for database development, mi...
's short-lived CoAuthor. This allowed a user to view the results after a document was processed and only correct the words that he or she knew to be wrong. When memory and processing power became abundant, spell checking was performed in the background in an interactive way, such as has been the case with the Sector Software produced Spellbound program released in 1987 and Microsoft WordMicrosoft Word

Microsoft Word, or Microsoft Office Word, is Microsoft's flagship word processing software....
 since Word 95.

In recent years, spell checkers have become increasingly sophisticated; some are now capable of recognizing simple grammatical errors. However, even at their best, they rarely catch all the errors in a text (such as homonymHomonym Overview

A homonym is a word that has the same pronunciation and spelling as another word, but a different meaning....
 errors) and will flag neologismNeologism

A neologism is a word, term, or phrase which has been recently created — often to apply to new concepts, or to reshape...
s and foreign words as misspelling.

Spell-checking other languages

English is unusual in that most words used in formal writing have a single spelling that can be found in a typical dictionary, with the exception of some jargon and modified words. In many languages, however, it's typical to frequently combine words in new ways. In German, compound nouns are frequently coined from other existing nouns. Some scripts do not clearly separate one word from another, requiring word-splitting algorithms. Each of these presents unique challenges to non-English language spell checkers.

Context-sensitive spell checkers

Recently, research has focused on developing algorithms which are capable of recognizing a misspelled word, even if the word itself is in the vocabulary, based on the context of the surrounding words . Not only does this allow words such as those in the poem above to be caught, but it mitigates the detrimental effect of enlarging dictionaries, allowing more words to be recognized. The most common example of errors caught by such a system are homophoneHomophone

A homophone is a word which is pronounced the same as another word but differs in meaning, for example: carat, caret, carrot...
 errors, such as the bold words in the following sentence:
Their coming too sea if its reel.


The most successful algorithm to date is Andrew Golding and Dan Roth's "Winnow-based spelling correction algorithm" , published in 1999, which is able to recognize about 96% of context-sensitive spelling errors, in addition to ordinary non-word spelling errors. A context-sensitive spell checker appears in Microsoft Office 2007Microsoft Office 2007

The 2007 Microsoft Office System, also known as Microsoft Office 2007, is Microsoft's next release of its productivity...
.

See also

  • Nearest neighbor (pattern recognition)
  • Record linkage problem
  • Spelling suggestionSpelling suggestion

    Spelling suggestion is a feature of many computer software applications used to suggest plausible replacements for words tha...
  • Grammar checkerGrammar checker

    A grammar checker uses Natural Language Processing, a branch of Artificial Intelligence, in order to check the grammatical c...


External links