All Topics  
METEOR

 

   Email Print
   Bookmark   Link






 

METEOR



 
 
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric
Metrics

A metric is a standard unit of measure, such as meter or mile for length, or gram or ton for weight, or more generally, part of a system of parameters, or systems of measurement, or a set of ways of quantitatively and periodically measuring, assessing, controlling or selecting a person, process, event, or institution, along with the procedure...
 for the evaluation of machine translation
Machine translation

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translation text or speech from one natural language to another....
 output. The metric is based on the harmonic mean
Harmonic mean

In mathematics, the harmonic mean is one of several kinds of average. Typically, it is appropriate for situations when the average of Rate s is desired....
 of unigram precision
Precision

Precision has the following meanings:Concepts* Accuracy and precision, measurement deviation from true value and its scatter* arithmetic precision, the number of digits from which a value is expressed...
 and recall
Recall

Recall may refer to:*Product recall*Recall election*Letter of credence sent to return an ambassador from a country, either as a diplomatic protest or because the diplomat is being reassigned elsewhere and is being replaced by another envoy...
, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming
Stemming

Stemming is the process for reducing inflected words to their Word stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root....
 and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU
Bleu

bleu or BLEU may be* French for blue* A 1993 movie,...
 metric, and also produce good correlation with human judgement at the sentence or segment level This differs from the BLEU metric in that BLEU seeks correlation at the corpus level. Results have been presented which give correlation
Pearson product-moment correlation coefficient

In statistics, the Karl Pearson product-moment correlation coefficient is a common measure of the correlation between two variables X and Y....
 of up to 0.964 with human judgement at the corpus level, compared to BLEU
Bleu

bleu or BLEU may be* French for blue* A 1993 movie,...
's achievement of 0.817 on the same data set.






Discussion
Ask a question about 'METEOR'
Start a new discussion about 'METEOR'
Answer questions from other users
Full Discussion Forum



Encyclopedia


METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric
Metrics

A metric is a standard unit of measure, such as meter or mile for length, or gram or ton for weight, or more generally, part of a system of parameters, or systems of measurement, or a set of ways of quantitatively and periodically measuring, assessing, controlling or selecting a person, process, event, or institution, along with the procedure...
 for the evaluation of machine translation
Machine translation

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translation text or speech from one natural language to another....
 output. The metric is based on the harmonic mean
Harmonic mean

In mathematics, the harmonic mean is one of several kinds of average. Typically, it is appropriate for situations when the average of Rate s is desired....
 of unigram precision
Precision

Precision has the following meanings:Concepts* Accuracy and precision, measurement deviation from true value and its scatter* arithmetic precision, the number of digits from which a value is expressed...
 and recall
Recall

Recall may refer to:*Product recall*Recall election*Letter of credence sent to return an ambassador from a country, either as a diplomatic protest or because the diplomat is being reassigned elsewhere and is being replaced by another envoy...
, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming
Stemming

Stemming is the process for reducing inflected words to their Word stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root....
 and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU
Bleu

bleu or BLEU may be* French for blue* A 1993 movie,...
 metric, and also produce good correlation with human judgement at the sentence or segment level This differs from the BLEU metric in that BLEU seeks correlation at the corpus level. Results have been presented which give correlation
Pearson product-moment correlation coefficient

In statistics, the Karl Pearson product-moment correlation coefficient is a common measure of the correlation between two variables X and Y....
 of up to 0.964 with human judgement at the corpus level, compared to BLEU
Bleu

bleu or BLEU may be* French for blue* A 1993 movie,...
's achievement of 0.817 on the same data set. At the sentence level, the maximum correlation with human judgement achieved was 0.403.

Algorithm


As with BLEU
Bleu

bleu or BLEU may be* French for blue* A 1993 movie,...
, the basic unit of evaluation is the sentence, the algorithm first creates an alignment (see illustrations) between two sentence
Sentence (linguistics)

In linguistics, a sentence is a grammatical unit of one or more words, bearing minimal syntactic relation to the words that precede or follow it, often preceded and followed in speech by pauses, having one of a small number of characteristic intonation patterns, and typically expressing an independent statement, question, request, command, et...
s, the candidate translation string, and the reference translation string. The alignment is a set of mappings between unigrams. A mapping can be thought of as a line between a unigram in one string, and a unigram in another string. The constraints are as follows; every unigram in the candidate translation must map to zero or one unigram in the reference translation and vice versa
Vice Versa

Vice Versa: A Lesson to Fathers is a novel by F. Anstey, first published in 1882.The title originates from the Latin phrase, "vice versa", meaning "the other way around"....
. In any alignment, a unigram in one string cannot map to more than one unigram in another string.

An alignment is created incrementally through a series of stages, which are controlled by modules. A module is simply a matching algorithm, for example the "wn_synonymy" module maps synonyms using WordNet
WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets....
, while the "exact" module matches exact words. Examples are given as follows:

Each stage is split up into two phases. In the first phase, all possible unigram mappings are collected for the module being used in this stage. In the second phase, the largest subset of these mappings is selected to produce an alignment as defined above. If there are two alignments with the same number of mappings, the alignment is chosen with the fewest crosses, that is, with fewer intersections
Intersection (set theory)

In mathematics, the intersection of two Set A and B is the set that contains all elements of A that also belong to B , but no other elements....
 of two mappings. From the two alignments shown, alignment (a) would be selected at this point. Stages are run consecutively and each stage only adds to the alignment those unigrams which have not been matched in previous stages. Once the final alignment is computed, the score is computed as follows: Unigram precision is calculated as:

Where is the number of unigrams in the candidate translation that are also found in the reference translation, and is the number of unigrams in the candidate translation. Unigram recall is computed as:

Where is as above, and is the number of unigrams in the reference translation. Precision and recall are combined using the harmonic mean
Harmonic mean

In mathematics, the harmonic mean is one of several kinds of average. Typically, it is appropriate for situations when the average of Rate s is desired....
 in the following fashion, with recall weighted 9 times more than precision:

The measures that have been introduced so far only account for congruity with respect to single words but not with respect to larger segments that appear in both the reference and the candidate sentence. In order to take these into account, longer n-gram matches are used to compute a penalty for the alignment. The more mappings there are that are not adjacent in the reference and the candidate sentence, the higher the penalty will be.

In order to compute this penalty, unigrams are grouped into the fewest possible chunks, where a chunk is defined as a set of unigrams that are adjacent in the hypothesis and in the reference. The longer the adjacent mappings between the candidate and the reference, the fewer chunks there are. A translation that is identical to the reference will give just one chunk. The penalty is computed as follows,

Where c is the number of chunks, and is the number of unigrams that have been mapped. The final score for a segment is calculated as below. The penalty has the effect of reducing the by up to 50% if there are no bigram or longer matches.

To calculate a score over a whole corpus
Corpus

Corpus is Latin for body. It can refer to:* Corpus Christi * Corpus linguistics** Text corpus, in linguistics, a large and structured set of texts...
, or collection of segments, the aggregate values for , and are taken and then combined using the same formula. The algorithm also works for comparing a candidate translation against more than one reference translations. In this case the algorithm compares the candidate against each of the references and selects the highest score.

Examples


Reference the cat sat on the mat
Hypothesis on the mat sat the cat


Score: 0.5000 = Fmean: 1.0000 * (1 - Penalty: 0.5000)
Fmean: 1.0000 = 10 * Precision: 1.0000 * Recall: 1.0000 / Recall: 1.0000 + 9 * Precision: 1.0000
Penalty: 0.5000 = 0.5 * (Fragmentation: 1.0000 ^3)
Fragmentation: 1.0000 = Chunks: 6.0000 / Matches: 6.0000


Reference the cat sat on the mat
Hypothesis the cat sat on the mat


Score: 0.9977 = Fmean: 1.0000 * (1 - Penalty: 0.0023)
Fmean: 1.0000 = 10 * Precision: 1.0000 * Recall: 1.0000 / Recall: 1.0000 + 9 * Precision: 1.0000
Penalty: 0.0023 = 0.5 * (Fragmentation: 0.1667 ^3)
Fragmentation: 0.1667 = Chunks: 1.0000 / Matches: 6.0000


Reference the cat sat on the mat
Hypothesis the cat was sat on the mat


Score: 0.9654 = Fmean: 0.9836 * (1 - Penalty: 0.0185)
Fmean: 0.9836 = 10 * Precision: 0.8571 * Recall: 1.0000 / Recall: 1.0000 + 9 * Precision: 0.8571
Penalty: 0.0185 = 0.5 * (Fragmentation: 0.3333 ^3)
Fragmentation: 0.3333 = Chunks: 2.0000 / Matches: 6.0000


External links


  • (including link for download)