Spell checker
Encyclopedia
In computing
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

, a spell checker (or spell check) is an application program
Application software
Application software, also known as an application or an "app", is computer software designed to help the user to perform specific tasks. Examples include enterprise software, accounting software, office suites, graphics software and media players. Many application programs deal principally with...

 that flags words in a document that may not be spelled
Spelling
Spelling is the writing of one or more words with letters and diacritics. In addition, the term often, but not always, means an accepted standard spelling or the process of naming the letters...

 correctly. Spell checkers may be stand-alone capable of operating on a block of text, or as part of a larger application, such as a word processor
Word processor
A word processor is a computer application used for the production of any sort of printable material....

, email client, electronic dictionary
Dictionary
A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...

, or search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

.

Design

A spell checker customarily consists of two parts:
  1. A set of routines for scanning text and extracting words, and
  2. An algorithm for comparing the extracted words against a known list of correctly spelled words (i.e., the dictionary).


The scanning routines sometimes include language-dependent algorithms for handling morphology
Morphology (linguistics)
In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...

. Even for a lightly inflected language like English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

, word extraction routines will need to handle such phenomena as contraction
Contraction (grammar)
A contraction is a shortened version of the written and spoken forms of a word, syllable, or word group, created by omission of internal letters....

s and possessives. It is unclear whether morphological analysis provides a significant benefit for English, though its benefits for highly synthetic languages
Synthetic language
In linguistic typology, a synthetic language is a language with a high morpheme-per-word ratio, as opposed to a low morpheme-per-word ratio in what is described as an isolating language...

 such as German, Hungarian or Turkish are clear.

The word list might contain just a list of words, or it might also contain additional information, such as hyphenation points or lexical and grammatical attributes.

As an adjunct to these two components, the program's user interface
User interface
The user interface, in the industrial design field of human–machine interaction, is the space where interaction between humans and machines occurs. The goal of interaction between a human and a machine at the user interface is effective operation and control of the machine, and feedback from the...

 will allow users to approve replacements and modify the program's operation.

One exception to the above paradigm are spell checkers which use based solely statistical information, for instance using n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...

s. This approach usually requires a lot of effort to obtain sufficient statistical information and may require a lot more runtime storage. These methods are not currently in general use. In some cases spell checkers use a fixed list of misspellings and suggestions for those misspellings; this less flexible approach is often used in paper-based correction methods, such as the see also entries of encyclopedias.

History

Research extends back to 1957, including spelling checkers for bitmap images of cursive writing and special applications to find records in databases in spite of incorrect entries. In 1961, Les Earnest
Les Earnest
Lester Donald Earnest was born in the United States on December 17, 1930. He began his career as a computer programmer in 1954 during a stint as a U.S. Navy Aviation Electronics Officer & Digital Computer Project Officer at Naval Air Development Center, Johnsville, Pennsylvania...

, who headed the research on this budding technology, saw it necessary to include the first spell checker that accessed a list of 10,000 acceptable words. Ralph Gorin, a graduate student under Earnest at the time, created the first true spelling checker program written as an applications program (rather than research) for general English text: Spell for the DEC PDP-10 at Stanford University's Artificial Intelligence Laboratory, in February 1971. Gorin wrote SPELL in assembly language, for faster action; he made the first spelling corrector by searching the word list for plausible correct spellings that differ by a single letter or adjacent letter transpositions and presenting them to the user. Gorin made SPELL publicly accessible, as was done with most SAIL (Stanford Artificial Intelligence Laboratory) programs, and it soon spread around the world via the new ARPAnet, about ten years before personal computers came into general use. Spell, its algorithms and data structures inspired the Unix ispell program.

The first spell checkers were widely available on mainframe computers in the late 1970s. A group of six linguists from Georgetown University
Georgetown University
Georgetown University is a private, Jesuit, research university whose main campus is in the Georgetown neighborhood of Washington, D.C. Founded in 1789, it is the oldest Catholic university in the United States...

 developed the first spell-check system for the IBM corporation.

The company Software Concepts, Inc., founded by William J. Tobin
William J. Tobin
William J. Tobin is an American entrepreneur, inventor, and business owner known for founding ten different startups since 1968, and securing 15 different patents for products and software. Among the companies he founded and served as CEO and Director of Marketing are PC Flowers, Inc. from 1988,...

 in 1978, developed one of the first patented computer software programs in the United States
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...

 for spelling verification. The program was used by most major word-processing and photo-typesetting systems, including Lanier
Lanier
-People:* Allen Lanier, musician in the band The Blue Öyster Cult* Bob Lanier , basketball player* Bob Lanier , Texas politician...

, Philips
Philips
Koninklijke Philips Electronics N.V. , more commonly known as Philips, is a multinational Dutch electronics company....

, and Xerox
Xerox
Xerox Corporation is an American multinational document management corporation that produced and sells a range of color and black-and-white printers, multifunction systems, photo copiers, digital production printing presses, and related consulting services and supplies...

, among many others. The patent the company was issued in 1980 for the Spell-Checking program was one of the first software patents issued in the United States
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...

, Canada
Canada
Canada is a North American country consisting of ten provinces and three territories. Located in the northern part of the continent, it extends from the Atlantic Ocean in the east to the Pacific Ocean in the west, and northward into the Arctic Ocean...

, and Europe
Europe
Europe is, by convention, one of the world's seven continents. Comprising the westernmost peninsula of Eurasia, Europe is generally 'divided' from Asia to its east by the watershed divides of the Ural and Caucasus Mountains, the Ural River, the Caspian and Black Seas, and the waterways connecting...

.

The first spell checkers for personal computers appeared for CP/M
CP/M
CP/M was a mass-market operating system created for Intel 8080/85 based microcomputers by Gary Kildall of Digital Research, Inc...

 and TRS-80
TRS-80
TRS-80 was Tandy Corporation's desktop microcomputer model line, sold through Tandy's Radio Shack stores in the late 1970s and early 1980s. The first units, ordered unseen, were delivered in November 1977, and rolled out to the stores the third week of December. The line won popularity with...

 computers in 1980, followed by packages for the IBM PC
IBM PC
The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...

 after it was introduced in 1981. Developers such as Maria Mariani, Soft-Art, Microlytics, Proximity, Circle Noetics, and Reference Software rushed OEM
Original Equipment Manufacturer
An original equipment manufacturer, or OEM, manufactures products or components that are purchased by a company and retailed under that purchasing company's brand name. OEM refers to the company that originally manufactured the product. When referring to automotive parts, OEM designates a...

 packages or end-user products into the rapidly expanding software market, primarily for the PC but also for Apple Macintosh, VAX
VAX
VAX was an instruction set architecture developed by Digital Equipment Corporation in the mid-1970s. A 32-bit complex instruction set computer ISA, it was designed to extend or replace DEC's various Programmed Data Processor ISAs...

, and Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

. On the PCs, these spell checkers were standalone programs, many of which could be run in TSR
Terminate and Stay Resident
Terminate and Stay Resident is a computer system call in DOS computer operating systems that returns control to the system as if the program has quit, but keeps the program in memory...

 mode from within word-processing packages on PCs with sufficient memory.

However, the market for standalone packages was short-lived, as by the mid 1980s developers of popular word-processing packages like WordStar
WordStar
WordStar is a word processor application, published by MicroPro International, originally written for the CP/M operating system but later ported to DOS, that enjoyed a dominant market share during the early to mid-1980s. Although Seymour I...

 and WordPerfect
WordPerfect
WordPerfect is a word processing application, now owned by Corel.Bruce Bastian, a Brigham Young University graduate student, and BYU computer science professor Dr. Alan Ashton joined forces to design a word processing system for the city of Orem's Data General Corp. minicomputer system in 1979...

 had incorporated spell checkers in their packages, mostly licensed from the above companies, who quickly expanded support from just English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

 to Europe
Europe
Europe is, by convention, one of the world's seven continents. Comprising the westernmost peninsula of Eurasia, Europe is generally 'divided' from Asia to its east by the watershed divides of the Ural and Caucasus Mountains, the Ural River, the Caspian and Black Seas, and the waterways connecting...

an and eventually even Asian languages. However, this required increasing sophistication in the morphology routines of the software, particularly with regard to heavily-agglutinative languages like Hungarian
Hungarian language
Hungarian is a Uralic language, part of the Ugric group. With some 14 million speakers, it is one of the most widely spoken non-Indo-European languages in Europe....

 and Finnish
Finnish language
Finnish is the language spoken by the majority of the population in Finland Primarily for use by restaurant menus and by ethnic Finns outside Finland. It is one of the two official languages of Finland and an official minority language in Sweden. In Sweden, both standard Finnish and Meänkieli, a...

. Although the size of the word-processing market in a country like Iceland
Iceland
Iceland , described as the Republic of Iceland, is a Nordic and European island country in the North Atlantic Ocean, on the Mid-Atlantic Ridge. Iceland also refers to the main island of the country, which contains almost all the population and almost all the land area. The country has a population...

 might not have justified the investment of implementing a spell checker, companies like WordPerfect nonetheless strove to localize their software for as many as possible national markets as part of their global marketing
Marketing
Marketing is the process used to determine what products or services may be of interest to customers, and the strategy to use in sales, communications and business development. It generates the strategy that underlies sales techniques, business communication, and business developments...

 strategy.

Recently, spell checking has moved beyond word processors as Firefox 2.0, a web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...

, has spell check support for user-written content, such as when editing Wikitext, writing on many webmail sites, blogs, and social networking websites. The web browsers Google Chrome
Google Chrome
Google Chrome is a web browser developed by Google that uses the WebKit layout engine. It was first released as a beta version for Microsoft Windows on September 2, 2008, and the public stable release was on December 11, 2008. The name is derived from the graphical user interface frame, or...

, Konqueror
Konqueror
Not to be confused with the Conqueror web browser.Konqueror is a web browser and file manager that provides file-viewer functionality for file systems such as local files, files on a remote ftp server and files in a disk image. It is a core part of the KDE desktop environment...

, and Opera
Opera (web browser)
Opera is a web browser and Internet suite developed by Opera Software with over 200 million users worldwide. The browser handles common Internet-related tasks such as displaying web sites, sending and receiving e-mail messages, managing contacts, chatting on IRC, downloading files via BitTorrent,...

, the email client Kmail and the instant messaging
Instant messaging
Instant Messaging is a form of real-time direct text-based chatting communication in push mode between two or more people using personal computers or other devices, along with shared clients. The user's text is conveyed over a network, such as the Internet...

 client
Client (computing)
A client is an application or system that accesses a service made available by a server. The server is often on another computer system, in which case the client accesses the service by way of a network....

 Pidgin
Pidgin (software)
Pidgin is an open-source multi-platform instant messaging client, based on a library named libpurple. Libpurple has support for many commonly used instant messaging protocols, allowing the user to log into various services from one application.The number of Pidgin users was estimated to be over 3...

 also offer spell checking support, transparently using GNU Aspell
GNU Aspell
GNU Aspell, usually called just Aspell, is a free software spell checker designed to replace Ispell. It is the standard spell checker for the GNU software system. It also compiles for other Unix-like operating systems and Windows. The main program is licensed under the GNU Lesser General Public...

 as their engine.
Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

 now has spell check systemwide, extending the service to virtually all bundled and third party applications.

Functionality

The first spell checkers were "verifiers" instead of "correctors." They offered no suggestions for incorrectly spelled words. This was helpful for typos but it was not so helpful for logical or phonetic errors. The challenge the developers faced was the difficulty in offering useful suggestions for misspelled words. This requires reducing words to a skeletal form and applying pattern-matching algorithms.

It might seem logical that where spell-checking dictionaries are concerned, "the bigger, the better," so that correct words are not marked as incorrect. In practice, however, an optimal size for English appears to be around 90,000 entries. If there are more than this, incorrectly spelled words may be skipped because they are mistaken for others. For example, a linguist might determine on the basis of corpus linguistics
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

 that the word baht is more frequently a misspelling of bath or bat than a reference to the Thai currency. Hence, it would typically be more useful if a few people who write about Thai currency were slightly inconvenienced, than if the spelling errors of the many more people who discuss baths were overlooked.
The first MS-DOS spell checkers were mostly used in proofing mode from within word processing packages. After preparing a document, a user scanned the text looking for misspellings. Later, however, batch processing was offered in such packages as Oracle
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...

's short-lived CoAuthor. This allowed a user to view the results after a document was processed and only correct the words that he or she knew to be wrong. When memory and processing power became abundant, spell checking was performed in the background in an interactive way, such as has been the case with the Sector Software produced Spellbound program released in 1987 and Microsoft Word
Microsoft Word
Microsoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...

 since Word 95.

In recent years, spell checkers have become increasingly sophisticated; some are now capable of recognizing simple grammatical errors. However, even at their best, they rarely catch all the errors in a text (such as homophone
Homophone
A homophone is a word that is pronounced the same as another word but differs in meaning. The words may be spelled the same, such as rose and rose , or differently, such as carat, caret, and carrot, or to, two, and too. Homophones that are spelled the same are also both homographs and homonyms...

 errors) and will flag neologisms and foreign words as misspellings.

Spell-checking non-English languages

English is unusual in that most words used in formal writing have a single spelling that can be found in a typical dictionary, with the exception of some jargon and modified words. In many languages, however, it is typical to frequently combine words in new ways. In German, compound nouns are frequently coined from other existing nouns. Some scripts do not clearly separate one word from another, requiring word-splitting algorithms. Each of these presents unique challenges to non-English language spell checkers.

Context-sensitive spell checkers

Recently, research has focused on developing algorithms which are capable of recognizing a misspelled word, even if the word itself is in the vocabulary, based on the context of the surrounding words. Not only does this allow words such as those in the poem above to be caught, but it mitigates the detrimental effect of enlarging dictionaries, allowing more words to be recognized. The most common example of errors caught by such a system are homophone
Homophone
A homophone is a word that is pronounced the same as another word but differs in meaning. The words may be spelled the same, such as rose and rose , or differently, such as carat, caret, and carrot, or to, two, and too. Homophones that are spelled the same are also both homographs and homonyms...

 errors, such as the bold words in the following sentence:
Their coming too sea if its reel.


The most successful algorithm to date is Andrew Golding and Dan Roth's "Winnow-based spelling correction algorithm", published in 1999, which is able to recognize about 96% of context-sensitive spelling errors, in addition to ordinary non-word spelling errors. A context-sensitive
Context sensitive user interface
A context sensitive user interface is one which can automatically choose from a multiplicity of options based on the current or previous state of the program operation.Context sensitivity is almost ubiquitous in current graphical user interfaces, and should, when operating correctly, be practically...

 spell checker appears in Microsoft Office 2007
Microsoft Office 2007
Microsoft Office 2007 is a Windows version of the Microsoft Office System, Microsoft's productivity suite. Formerly known as Office 12 in the initial stages of its beta cycle, it was released to volume license customers on November 30, 2006 and made available to retail customers on January 30, 2007...

, Google Wave
Google Wave
Apache Wave is a software framework for real-time collaborative editing online. Google Inc. originally developed it as Google Wave.It was announced at the Google I/O conference on May 27, 2009....

, Ginger Software and in Ghotit Dyslexia Software context spell checker tuned for people with dyslexia.

Criticism

Some critics of technology and computers have attempted to link spell checkers to a trend of skill losses in writing, reading, and speaking. They claim that the convenience of computers has led people to become lazy, often not proofreading written work past a simple pass by a spell checker. Supporters claim that these changes may actually be beneficial to society, by making writing and learning new languages more accessible to the general public. They claim that the skills lost by the invention of automated spell checkers are being replaced by better skills, such as faster and more efficient research skills. Other supporters of technology point to the fact that these skills are not being lost to people who require and make use of them regularly, such as authors, critics, and language professionals.

An example of the problem of completely relying on spell checkers is shown in the Spell-checker Poem above. It was originally composed by Dr. Jerrold H. Zar in 1991, assisted by Mark Eckman with an original length of 225 words, and containing 123 incorrectly used words. According to most spell checkers, the poem is valid, although most people would be able to tell at a simple glance that most words are used incorrectly. As a result, spell checkers are sometimes derided as spilling chuckers or similar, slightly misspelled names.

Not all of the critics are opponents of technological progress, however. An article based on research by Galletta et al. reports that in the Galletta study, higher verbal skills are needed for highest performance when using a spell checker. The theory suggested that only writers with higher verbal skills could recognize and ignore false positives or incorrect suggestions. However, it was found that those with the higher skills lost their unaided performance advantage in multiple categories of errors, performing as poorly as the low verbals with the spell-checkers turned on. The conclusion points to some evidence of a loss of skill.

See also

  • Cupertino effect
    Cupertino effect
    The Cupertino effect is the tendency of a spell checker to suggest or autocorrect inappropriate words to replace misspelled words and words not in its dictionary....

  • Grammar checker
    Grammar checker
    A grammar checker in computing terms, is a program, or part of a program, that attempts to verify written text for grammatical correctness. Grammar checkers are most often implemented as a feature of a larger program, such as a word processor, but are also available as stand-alone application that...

  • Record linkage problem
  • Spelling checking programs
    Simple spell-checker (21 lines), with explanation and references.
  • Spelling suggestion
    Spelling suggestion
    Spelling suggestion is a feature of hi computer software applications used to suggest plausible replacements for words that are likely to have been misspelled....

  • Approximate string matching
    Approximate string matching
    In computing, approximate string matching is the technique of finding strings that match a pattern approximately...


External links

  • PSU.edu, Computer Programs for Detecting and Correcting Spelling Errors
  • Norvig.com, "How to Write a Spelling Corrector", by Peter Norvig
    Peter Norvig
    Peter Norvig is an American computer scientist. He is currently the Director of Research at Google Inc.-Educational Background:...

  • BBK.ac.uk, "Spellchecking by computer", by Roger Mitton
  • CBSNews.com, Spell-Check Crutch Curtails Correctness, by Lloyd de Vries
  • NIU.edu, Candidate for a Pullet Surprise - Complete corrected poem
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK