Lesk algorithm - AbsoluteAstronomy.com

The Lesk algorithm is a classical algorithm for word sense disambiguation

Word sense disambiguation

In computational linguistics, word-sense disambiguation is an open problem of natural language processing, which governs the process of identifying which sense of a word is used in a sentence, when the word has multiple meanings...

introduced by Michael E. Lesk

Mike Lesk

Michael E. Lesk is a computer programmer.In the 1960s, Michael Lesk worked for the SMART Information Retrieval System project, wrote much of its retrieval code and did many of the retrieval experiments, as well as obtaining a PhD in Chemical Physics....

in 1986.

Overview

The Lesk algorithm is based on the assumption that words in a given neighbourhood will tend to share a common topic. A simplified version of the Lesk algorithm is to compare the dictionary definition of an ambiguous word with the terms contained of the neighbourhood. Versions have been adapted to WordNet

WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

. It would be like this:

for every sense of the word being disambiguated one should count the amount of words that are in both neighbourhood of that word and in the definition of each sense in a dictionary
the sense that is to be chosen is the sense which has the biggest number of this count

A frequently used example illustrating this algorithm is for the context "pine cone". The following dictionary definitions are used:
PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness

CONE
1. solid body which narrows to a point
2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees
As can be seen, the best intersection is Pine#1 ⋂ Cone#3 = 2.

Simplified Lesk algorithm

In Simplified Lesk algorithm, the correct meaning of each word in a given context is determined individually by locating the sense that overlaps the most between its dictionary definition and the given context. Rather than simultaneously determining the meanings of all words in a given context, this approach tackles each word individually, independent of the meaning of the other words occurring in the same context.

"A comparative evaluation performed by Vasileseu et al. (2004) has shown that the simplified Lesk algorithm can significantly outperform the original definition of the algorithm, both in terms of precision and efficiency. By evaluating the disambiguation algorithms on the Senseval-2 English all words data, they measure a 58% precision using the simplified Lesk algorithm compared to the only 42% under the original algorithm.

Note: Vasileseu et al. implementation considers a back-off strategy for words not covered by the algorithm, consisting of the most frequent sense defined in WordNet. This means that words for which all their possible meanings lead to zero overlap with current context or with other word definitions are by default assigned sense number one in WordNet."

Simplified Lesk Algorithm (Kilgarriff and Rosenzweig, 2000)

function SIMPLIFIED LESK(word,sentence) returns best sense of word best-sense <- most frequent sense for word max-overlap <- 0 context <- set of words in sentence for each sense in senses of word do signature <- set of words in the gloss and examples of sense overlap <- COMPUTEOVERLAP (signature,context) if overlap > max-overlap then max-overlap <- overlap best-sense <- sense end return (best-sense)

The COMPUTEOVERLAP function returns the number of words in common between two sets, ignoring function words or other words on a stop list. The original Lesk algorithm defines the context in a more complex way.

Criticisms and other Lesk-based methods

Unfortunately, Lesk’s approach is very sensitive to the exact wording of definitions, so the absence of a certain word can radically change the results. Further, the algorithm determines overlaps only among the glosses of the senses being considered. This is a signiﬁcant limitation in that dictionary glosses tend to be fairly short and do not provide sufficient vocabulary to relate ﬁne-grained sense distinctions.

Recently, a lot of works appeared which offer different modifications of this algorithm. These works uses other resources for analysis (thesauruses, synonyms dictionaries or morphological and syntactic models): for instance, it may use such information as synonyms, different derivatives, or words from definitions of words from definitions.

There are a lot of studies concerning Lesk and its extensions:

Kwong, 2001;
Nastase and Szpakowicz, 2001;
Wilks and Stevenson, 1998, 1999;
Mahesh et al., 1997;
Cowie et al., 1992;
Yarowsky, 1992;
Pook and Catlett, 1988;
Kilgarriff & Rosensweig, 2000,
Alexander Gelbukh, Grigori Sidorov, 2004.

Accuracy

The original method achieved 50–70% accuracy (depending on the word) on Pride and Prejudice

Pride and Prejudice

Pride and Prejudice is a novel by Jane Austen, first published in 1813. The story follows the main character Elizabeth Bennet as she deals with issues of manners, upbringing, morality, education and marriage in the society of the landed gentry of early 19th-century England...

and selected papers of the Associated Press

Associated Press

The Associated Press is an American news agency. The AP is a cooperative owned by its contributing newspapers, radio and television stations in the United States, which both contribute stories to the AP and use material written by its staff journalists...

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.