All Topics  
Stemming

 

   Email Print
   Bookmark   Link






 

Stemming



 
 
Stemming is the process for reducing inflected (or sometimes derived) words to their stem
Word stem

In linguistics, a stem is the part of a word that is common to all its inflection variants. Stems are often root , e.g. atomic, its root is atom, but its stem is atom?ic....
, base or root
Root (linguistics)

The root is the primary lexicology unit of a word, which carries the most significant aspects of semantics content and cannot be reduced into smaller constituents....
 form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm
Algorithm

In mathematics, computing, linguistics and related subjects, an algorithm is a sequence of finite instructions, often used for calculation and data processing....
 has been a long-standing problem in computer science
Computer science

Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems....
; the first paper on the subject was published in 1968.






Discussion
Ask a question about 'Stemming'
Start a new discussion about 'Stemming'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Stemming is the process for reducing inflected (or sometimes derived) words to their stem
Word stem

In linguistics, a stem is the part of a word that is common to all its inflection variants. Stems are often root , e.g. atomic, its root is atom, but its stem is atom?ic....
, base or root
Root (linguistics)

The root is the primary lexicology unit of a word, which carries the most significant aspects of semantics content and cannot be reduced into smaller constituents....
 form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm
Algorithm

In mathematics, computing, linguistics and related subjects, an algorithm is a sequence of finite instructions, often used for calculation and data processing....
 has been a long-standing problem in computer science
Computer science

Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems....
; the first paper on the subject was published in 1968. The process of stemming, often called conflation
Conflation

Conflation occurs when the identities of two or more individuals, concepts, or places, sharing some characteristics of one another, become confused until there seems to be only a single identity ? the differences appear to become lost....
, is useful in search engine
Search engine

A search engine is an information retrieval designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits....
s for query expansion
Query expansion

Query expansion is the process of reformulating a seed query to improve retrieval performance in information retrieval operations.In the context of web search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents....
 or indexing
Index (search engine)

Search engine index collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science....
 and other natural language processing
Natural language processing

Natural language processing is a field of computer science concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language....
 problems.

Stemming programs are commonly referred to as stemming algorithms or stemmers.

Examples

A stemmer for English
English language

English is a West Germanic language that originated in Anglo-Saxon England and has lingua franca status in many parts of the world as a result of the military, economic, scientific, political and cultural influence of the British Empire in the 18th, 19th and early 20th centuries and that of the United States from the mid 20th century onwa...
, for example, should identify the string
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
 "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

History

The first ever published stemmer was written by Julie Beth Lovins in 1968. This paper was remarkable for its early date and had great influence on later work in this area.

A later stemmer was written by Martin Porter
Martin Porter

Dr. Martin F. Porter is the inventor of the Porter Stemming and the Snowball programming language programming framework.The Muscat search engine comes from research performed by Dr Porter at Cambridge University , and was commercialized in 1984 by Cambridge CD Publishing; it was subsequently sold to MAID which became the Dialog ....
 and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award
Tony Kent Strix award

The Strix award is an annual award for outstanding contributions to the field of information retrieval.The award has been presented since 1998 in memory of Dr Tony Kent, a past Fellow of the Institute of Information Scientists , who died in 1997....
 in 2000 for his work on stemming and information retrieval.

Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official of the algorithm around the year 2000. He extended this work over the next few years by building Snowball
Snowball programming language

Snowball is a small string-handling programming language whose name was chosen as a tribute to the SNOBOL programming language, with which it shares the concept of string patterns delivering signals that are used to control the flow of the program....
, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.

Algorithms

There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.

Brute Force Algorithms


Brute force
Brute-force search

In computer science, brute-force search or exhaustive search, also known as generate and test, is a trivial but very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's statement....
 stemmers employ a lookup table which contains relations between root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root form is returned.

Brute force approaches are criticized for their general lack of elegance in that no algorithm is applied that would more quickly converge on a solution. In other words, there are more operations performed during the search than should be necessary. Brute force searches consume immense amounts of storage to host the list of relations (relative to the task). The algorithm is only accurate to the extent that the inflected form already exists in the table. Given the number of words in a given language, like English, it is unrealistic to expect that all word forms can be captured and manually recorded by human action alone. Manual training of the algorithm is overly time-intensive and the ratio between the effort and the increase in accuracy is marginal at best.

Brute force algorithms do overcome some of the challenges faced by the other approaches. Not all inflected word forms in a given language "follow the rules" appropriately. While "running" might be easy to stem to "run" in a suffix stripping approach, the alternate inflection, "ran", is not. Suffix stripping algorithms are somewhat powerless to overcome this problem, short of increasing the number and complexity of the rules, but brute force algorithms only require storing a single extra relation between "run" and "ran". While that is true, this assumes someone bothered to store the relation in the first place, and one of the major problems of improving brute force algorithms is the coverage of the language.

Brute force algorithms are initially very difficult to design given the immense amount of relations that must be initially stored to produce an acceptable level of accuracy (the number can span well into the millions). However, brute force algorithms are easy to improve in that decreasing the stemming error is only a matter of adding more relations to the table. Someone with only a minor experience in linguistics is capable of improving the algorithm, unlike the suffix stripping approaches which require a solid background in linguistics.

For technical accuracy, some programs may use suffix stripping to generate the lookup table given a text corpus, and then only consult the lookup table when stemming. This is not regarded as a brute force approach, although a lookup table is involved.

Suffix Stripping Algorithms


Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" are stored which provide a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:

  • if the word ends in 'ed', remove the 'ed'
  • if the word ends in 'ing', remove the 'ing'
  • if the word ends in 'ly', remove the 'ly'


Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical categories
Lexical category

In grammar, a lexical category is a linguistic category of words , which is generally defined by the syntactic or morphology behaviour of the lexical item in question....
 which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve upon this challenge.

Lemmatisation Algorithms


A more complex approach to the problem of determining a stem of a word is lemmatisation
Lemmatisation

Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.In computing, lemmatisation is the algorithmic process of determining the lemma for a given word....
. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.

This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information about the word to be stemmed, then we are able to more accurately apply normalization rules (which are, more or less, suffix stripping rules).

Stochastic Algorithms


Stochastic algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they "learn") on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).

Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.

Hybrid Approaches


Hybrid approaches use two or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.

Affix Stemmers

In linguistics
Linguistics

Linguistics is the science study of natural language. Linguistics encompasses a number of sub-fields. An important topical division is between the study of language structure and the study of Meaning ....
, the term affix
Affix

An affix is a morpheme that is attached to a word stem to form a new word. Affixes may be derivation , like English -ness and pre-, or inflectional, like English plural -s and past tense -ed....
 refers to either a prefix
Prefix

A prefix is an affix which is placed before the stem of a word. The word "prefix" is itself made up of the stem fix , and the prefix pre- , both of which are derived from Latin root s....
 or a suffix
Suffix

In grammar, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns or adjectives, and verb endings, which form the grammatical conjugation of verbs....
. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word indefinitely, identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping.

Matching Algorithms

Such algorithms use a stem database (for example a set of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the "brows" in "browse" and in "browsing"). In order to stem a word the algorithm tries to match it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix "be", which is the stem of such words as "be", "been" and "being", would not be considered as the stem of the word "beside").

Language Challenges

While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.

Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions), a Hebrew one is even more complex (due to non-catenative morphology and a writing system without vowels), and so on..

Multilingual Stemming

Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.

Error metrics

There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been - a false positive. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not - a false negative. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other.

Applications


Information Retrieval

Stemmers are common elements in query systems
Information retrieval

Information retrieval is the science of searching for documents, for information within documents and for Metadata about documents, as well as that of searching relational databases and the World Wide Web....
 such as Web
World Wide Web

The World Wide Web is a very large set of interlinked hypertext documents accessed via the Internet. With a Web browser, one can view Web pages that may contain writing, s, videos, and other multimedia and navigate between them using hyperlinks....
 search engine
Search engine

A search engine is an information retrieval designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits....
s, since a user who runs a query on "daffodils" would probably also be interested in documents that contain the word "daffodil" (without the s). The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early Information retrieval
Information retrieval

Information retrieval is the science of searching for documents, for information within documents and for Metadata about documents, as well as that of searching relational databases and the World Wide Web....
 researchers to deem stemming irrelevant in general. An alternative approach, based on searching for n-gram
N-gram

N-gram models are a type of probabilistic model for predicting the next item in a sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis....
s rather than stems, may be used instead. Also, recent research has shown greater benefits for retrieval in other languages.

Usage in commercial products

Many commercial companies have been using stemming since at least the 1980's and have produced algorithmic and lexical stemmers in many languages.

The Snowball stemmers have been compared with commercial lexical stemmers with varying results.

Google search
Google search

Google search is a Web search engine owned by Google, and is the most used search engine on the World Wide Web. Google receives several hundred million queries each day through its various services....
 adopted word stemming in 2003. Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings obviously will find "fish" in "fishing" but when searching for "fishes" will not find occurrences of the word "fish".

See also

  • Root (linguistics)
    Root (linguistics)

    The root is the primary lexicology unit of a word, which carries the most significant aspects of semantics content and cannot be reduced into smaller constituents....
     - linguistic definition of the term "root"
  • Stem (linguistics) - linguistic definition of the term "stem"
  • Morphology (linguistics)
    Morphology (linguistics)

    Morphology is the identification, analysis and description of structure of words . While words are generally accepted as being the smallest units of syntax, it is clear that in most languages, words can be related to other words by rules....
  • Lemma (linguistics)
    Lemma (linguistics)

    In linguistics a lemma has two distinct interpretations:# morphology / lexicography: the canonical form or citation form of a set of forms ; e.g....
     - linguistic definition
  • Lemmatization
  • Lexeme
    Lexeme

    A lexeme is an abstract Unit of Morphology Semantic analysis in linguistics, that roughly corresponds to a set of forms taken by a single word....
  • Inflection
    Inflection

    In grammar, inflection or inflexion is the way language handles grammatical relations and relational categories such as grammatical tense, grammatical mood, grammatical voice, grammatical aspect, grammatical person, grammatical number, grammatical gender, grammatical case....
  • Derivation
    Derivation (linguistics)

    In linguistics, derivation is "Used to form new words, as with happi-ness and un-happy from happy, or determination from determine....
     - stemming is a form of reverse derivation
  • Natural language processing
    Natural language processing

    Natural language processing is a field of computer science concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language....
     - stemming is generally regarded as a form of NLP
  • Text mining
    Text mining

    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers generally to the process of deriving high-quality information from text....
     - stemming algorithms play a major role in commercial NLP software
  • Computational linguistics
    Computational linguistics

    Computational linguistics is an interdisciplinary field dealing with the Statistics and/or rule-based modeling of natural language from a computational perspective....


Further reading


External links

  • - open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API)
  • - free stemming algorithms for many languages, includes source code, including stemmers for five romance languages
  • - Ruby extension to Snowball API
  • - PHP extension to the Snowball API
  • - stemming library in C++ released under BSD
  • - with source code in a couple of languages
  • - including source code in several languages
  • - Lancaster University, UK
  • - University of East Anglia, UK
  • - A Java stemming toolkit for the Portuguese language