Brill Tagger
Encyclopedia
The Brill tagger is a method for doing part-of-speech tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

. It was described by Eric Brill
Eric Brill
Eric Brill is a computer scientist specializing in natural language processing. He is famous for his Brill tagger, a supervised part of speech tagger. Another widely cited research paper of Brill introduced a machine learning technique now known as transformation-based learning...

 in his 1993 PhD thesis http://ucrel.lancs.ac.uk/acl/H/H92/H92-1022.pdf. It can be summarized as an "error-driven transformation-based tagger". It is
  • error-driven in the sense that it recourses to supervised learning
    Supervised learning
    Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

  • transformation-based in the sense that a tag is assigned to each word and changed using a set of predefined rules. Note: If the word is known, it first assigns the most frequent tag, or if the word is unknown, it naively assigns the tag "noun" to it. Applying over and over these rules, changing the incorrect tags, a quite high accuracy is achieved.

Algorithm

The algorithm goes as follows:
  • Initialisation:
    • Known words (in vocabulary): assigning the most frequent tag associated to a form of the word
    • Unknown words (out of vocabulary) :
      • Proper noun
        Proper noun
        A proper noun or proper name is a noun representing a unique entity , as distinguished from a common noun, which represents a class of entities —for example, city, planet, person or corporation)...

         if capitalised and simple noun else (1992)
      • Learning or guessing rules with lexical rule
        Lexical rule
        A lexical rule is in a form of syntactic rule used within many theories of natural language syntax. These rules alter the argument structures of lexical items in order to alter their combinatory properties....

        s on the same basis as contextual rules (1994)
  • Learning Phase
    • Iteratively compute the error score of each candidate rule (difference between the number of errors before and after applying the rule)
    • Select the best (higher score) rule.
    • Add it to the rule set and apply it to the text.
    • Repeat until no rule has a score above a given threshold (that is, if the chosen threshold is zero (which can lead to over-fitting), until applying new rules leaves the text in the same state, which is then supposed to be the final state of the tagging).

Rules

Lexical rule
Lexical rule
A lexical rule is in a form of syntactic rule used within many theories of natural language syntax. These rules alter the argument structures of lexical items in order to alter their combinatory properties....

s are used for the initialisation, and contextual rules are used to correct the tags.
  • Lexical rules: wordtag IF Condition (example: identification of suffixes
    Affix
    An affix is a morpheme that is attached to a word stem to form a new word. Affixes may be derivational, like English -ness and pre-, or inflectional, like English plural -s and past tense -ed. They are bound morphemes by definition; prefixes and suffixes may be separable affixes...

    like "-tion")
  • Contextual rules: tag1tag2 IF Condition (example: "preceding/following tag is X", "preceding/following word is w")

Code

Brill's code pages at Johns Hopkins University are no longer on the web. A mirror of the Brill tagger at its latest version is available at Plymouth Tech, here. http://www.tech.plym.ac.uk/soc/staff/guidbugm/software/RULE_BASED_TAGGER_V.1.14.tar.Z

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK