Binary Independence Model - AbsoluteAstronomy.com

The Binary Independence Model (BIM)is a probabilistic information retrieval technique that makes some simple assumptions to make the estimation of document/query similarity probability feasible.

Definitions

The Binary Independence Assumption is that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. Terms are independently distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents.
The representation is an ordered set of Boolean variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector d = (x₁, ..., x_m) where x_t=1 if term t is present in the document d and x_t=0 if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way.
"Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a Naive Bayes classifier

Naive Bayes classifier

A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...

, where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a Vector space model

Vector space model

Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...

by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms.

The probability P(R|d,q) that a document is relevant derives from the probability of relevance of the terms vector of that document P(R|x,q). By using the Bayes rule we get:

where P(x|R=1,q) and P(x|R=0,q) are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is x.
The exact probabilities can not be known beforehand, so use estimates from statistics about the collection of documents must be used.

P(R=1|q) and P(R=0|q) indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query q. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities.
Since a document is either relevant or nonrelevant to a query we have that:

Query Terms Weighting

Given a binary query and the Dot Product

Dot product

In mathematics, the dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers and returns a single number obtained by multiplying corresponding entries and then summing those products...

as the similarity function
between a document and a query, the problem is to assign weights to the
terms in the query such that the retrieval effectiveness will be high. Let

p_i

and

q_i

be the probability that a relevant document and an irrelevant
document has the

term respectively. Yu and Salton , who first introduce BIM, propose that the weight of the

term is an increasing function of

. Thus, if

is higher than

, the weight
of term

will be higher than that of term

. Yu and Salton showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted. Robertson and Sparck later showed that if the

term is assigned the weight of

, then optimal retrieval effectiveness is obtained under the Binary Independence Assumption.
The Binary Independence Model was introduced by Yu and Salton. The name Binary Independence Model was coined by Robertson and Sparck.

Recent work

Wu et al. proposed a new probabilistic retrieval model which weights terms according to their contexts in documents. The term weighting function they used is similar to the Language model

Language model

A statistical language model assigns a probability to a sequence of m words P by means of a probability distribution.Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information...

and the binary independence model.

Roelleke et al. investigates the probabilistic relational implementations of BIM under the use of probabilistic relational algebra for the integration of Information Retrieval in databases.
This work surges as a result of the interest shown by Surajit et al. in applying the knowledge of probabilistic models for information retrieval in structured data, to investigate the problem of ranking the answers to a database query when many tuples are returned.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.