Tf–idf
Encyclopedia
The tf–idf weight is a weight often used in information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

 and text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

. This weight is a statistical measure used to evaluate how important a word is to a document
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...

 in a collection or corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

. The importance increases proportionally
Proportionality (mathematics)
In mathematics, two variable quantities are proportional if one of them is always the product of the other and a constant quantity, called the coefficient of proportionality or proportionality constant. In other words, are proportional if the ratio \tfrac yx is constant. We also say that one...

 to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

s as a central tool in scoring and ranking a document's relevance
Relevance (information retrieval)
In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...

 given a user query
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization
Automatic summarization
Automatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text....

 and classification.

One of the simplest ranking function
Ranking function
In information retrieval, a ranking function is a function used by search engines to rank matching documents according to their relevance to a given search query....

s is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Motivation

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow." A simple way to start out is by eliminating documents that do not contain all three words "the," "brown," and "cow," but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency. However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms. On the contrary, the words "brown" and "cow" that occur rarely are good keywords to distinguish relevant documents from the non-relevant documents. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely.

Mathematical details

The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term within the particular document . Thus we have the term frequency , defined in the simplest case as the occurrence count of a term in a document. (Many variants have been suggested; see e.g. Manning, Raghavan and Schütze, p. 118.)

The inverse document frequency is a measure of the general importance of the term (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm
Logarithm
The logarithm of a number is the exponent by which another fixed value, the base, has to be raised to produce that number. For example, the logarithm of 1000 to base 10 is 3, because 1000 is 10 to the power 3: More generally, if x = by, then y is the logarithm of x to base b, and is written...

 of that quotient
Quotient
In mathematics, a quotient is the result of division. For example, when dividing 6 by 3, the quotient is 2, while 6 is called the dividend, and 3 the divisor. The quotient further is expressed as the number of times the divisor divides into the dividend e.g. The quotient of 6 and 2 is also 3.A...

).


with
  • : cardinality of D, or the total number of documents in the corpus
  • : number of documents where the term appears (i.e., ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the formula to .


Then


A high weight in tf–idf is reached by a high term frequency
Frequency (statistics)
In statistics the frequency of an event i is the number ni of times the event occurred in the experiment or the study. These frequencies are often graphically represented in histograms....

 (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. The tf–idf value for a term will be greater than zero if and only if the ratio inside the idf's log function is greater than 1. Depending on whether a 1 is added to the denominator, a term in all documents will have either a zero or negative idf, and if the 1 is added to the denominator a term that occurs in all but one document will have an idf equal to zero.

Various (mathematical) forms of the tf–idf term weight can be derived from a probabilistic retrieval model that mimicks human relevance decision making.

Example

Consider a document containing 100 words wherein the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then (3 / 100) = 0.03. Now, assume we have 10 million documents and cow appears in one thousand of these. Then, the inverse document frequency is calculated as log(10 000 000 / 1 000) = 4. The tf–idf score is the product of these quantities: 0.03 × 4 = 0.12.

See also

  • Okapi BM25
    Okapi BM25
    In information retrieval, Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and...

  • Noun phrase
    Noun phrase
    In grammar, a noun phrase, nominal phrase, or nominal group is a phrase based on a noun, pronoun, or other noun-like word optionally accompanied by modifiers such as adjectives....

  • Word count
    Word count
    The word count is the number of words in a document or passage of text. Word counting may be needed when a text is required to stay within certain numbers of words. This may particularly be the case in academia, legal proceedings, journalism and advertising. Word count is commonly used by...

  • Kullback-Leibler divergence
  • Mutual Information
    Mutual information
    In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...

  • Latent semantic analysis
    Latent semantic analysis
    Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...

  • Latent semantic indexing
    Latent semantic indexing
    Latent Semantic Indexing is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words...

  • Latent Dirichlet allocation
    Latent Dirichlet allocation
    In statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK