Statistical semantics
Encyclopedia
Statistical semantics is the study of "how the statistical patterns of human word usage can be used to figure out what people mean, at least to a level sufficient for information access" (Furnas
George Furnas
Prof. George W. Furnas is a professor and Associate Dean for Academic Strategy at the School of Information of the University of Michigan. Prior to his position at the University of Michigan, Furnas worked at Bell Labs for 15 years where he was a distinguished member of technical staff, and then...

, 2006). How can we figure out what words mean, simply by looking at patterns of words in huge collections of text? What are the limits to this approach to understanding words?

History

The term Statistical Semantics was first used by Weaver
Warren Weaver
Warren Weaver was an American scientist, mathematician, and science administrator...

 (1955) in his well-known paper on machine translation
Machine translation
Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...

. He argued that word sense disambiguation
Word sense disambiguation
In computational linguistics, word-sense disambiguation is an open problem of natural language processing, which governs the process of identifying which sense of a word is used in a sentence, when the word has multiple meanings...

 for machine translation should be based on the co-occurrence
Co-occurrence
Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...

 frequency of the context words near a given target word. The underlying assumption that "a word is characterized by the company it keeps" was advocated by J.R. Firth
J. R. Firth
John Rupert Firth , commonly known as J. R. Firth, was an English linguist. He was Professor of English at the University of the Punjab from 1919-1928...

 (1957). This assumption is known in Linguistics
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

 as the Distributional Hypothesis
Distributional hypothesis
The Distributional Hypothesis in linguistics is the theory that words that occur in the same contexts tend to have similar meanings. The underlying idea that "a word is characterized by the company it keeps" was popularized by Firth. The Distributional Hypothesis is the basis for Statistical...

. Delavenay (1960) defined Statistical Semantics as "Statistical study of meanings of words and their frequency and order of recurrence." Furnas
George Furnas
Prof. George W. Furnas is a professor and Associate Dean for Academic Strategy at the School of Information of the University of Michigan. Prior to his position at the University of Michigan, Furnas worked at Bell Labs for 15 years where he was a distinguished member of technical staff, and then...

 et al. (1983) is frequently cited as a foundational contribution to Statistical Semantics. An early success in the field was Latent Semantic Analysis
Latent semantic analysis
Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...

.

Applications of statistical semantics

Research in Statistical Semantics has resulted in a wide variety of algorithms that use the Distributional Hypothesis to discover many aspects of semantics
Semantics
Semantics is the study of meaning. It focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for, their denotata....

, by applying statistical techniques to large corpora
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

:
  • Measuring the similarity in word meanings
    Semantic similarity
    Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....

     (Lund et al., 1995; Landauer and Dumais, 1997; McDonald and Ramscar, 2001, Terra and Clarke, 2003)

  • Measuring the similarity in word relations (Turney, 2006)

  • Modeling similarity-based generalization (Yarlett, 2008)

  • Discovering words with a given relation (Hearst, 1992)

  • Classifying relations between words (Turney and Littman, 2005)

  • Extracting keywords from documents (Frank et al., 1999; Turney, 2000)

  • Measuring the cohesiveness of text (Turney, 2003)

  • Discovering the different senses of words (Pantel and Lin, 2002)

  • Distinguishing the different senses of words (Turney, 2004)

  • Subcognitive aspects of words (Turney, 2001)

  • Distinguishing praise from criticism (Turney and Littman, 2003)

Related fields

Statistical Semantics focuses on the meanings of common words and the relations between common words, unlike text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

, which tends to focus on whole documents, document collections, or named entities (names of people, places, and organizations). Statistical Semantics is a subfield of computational semantics
Computational Semantics
Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions...

, which is in turn a subfield of computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

 and natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

.

Many of the applications of Statistical Semantics (listed above) can also be addressed by lexicon
Lexicon
In linguistics, the lexicon of a language is its vocabulary, including its words and expressions. A lexicon is also a synonym of the word thesaurus. More formally, it is a language's inventory of lexemes. Coined in English 1603, the word "lexicon" derives from the Greek "λεξικόν" , neut...

-based algorithms, instead of the corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

-based algorithms of Statistical Semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexicon-based algorithms. Another advantage is that they are usually easier to adapt to new languages than lexicon-based algorithms. However, the best performance on an application is often achieved by combining the two approaches (Turney et al., 2003).

See also

  • Latent semantic analysis
    Latent semantic analysis
    Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...

  • Latent semantic indexing
    Latent semantic indexing
    Latent Semantic Indexing is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words...

  • Text mining
    Text mining
    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

  • Information retrieval
    Information retrieval
    Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

  • Natural language processing
    Natural language processing
    Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

  • Computational linguistics
    Computational linguistics
    Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

  • Web mining
    Web mining
    Web mining - is the application of data mining techniques to discover patterns from the Web.According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.-Web usage mining:Web usage mining is the process...

  • Semantic similarity
    Semantic similarity
    Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....

  • Co-occurrence
    Co-occurrence
    Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...

  • Text corpus
    Text corpus
    In linguistics, a corpus or text corpus is a large and structured set of texts...

  • Semantic Analytics
    Semantic analytics
    Semantic analytics is the use of ontologies to analyze content in web resources. This field of research combines text analytics and semantic web technologies like RDF....


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK