Nearest neighbor search
Encyclopedia
Nearest neighbor search also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric space
Metric space
In mathematics, a metric space is a set where a notion of distance between elements of the set is defined.The metric space which most closely corresponds to our intuitive understanding of space is the 3-dimensional Euclidean space...

s. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q. In many cases, M is taken to be d-dimensional Euclidean space
Euclidean space
In mathematics, Euclidean space is the Euclidean plane and three-dimensional space of Euclidean geometry, as well as the generalizations of these notions to higher dimensions...

 and distance is measured by Euclidean distance
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...

 or Manhattan distance
Taxicab geometry
Taxicab geometry, considered by Hermann Minkowski in the 19th century, is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their coordinates...

.

Donald Knuth
Donald Knuth
Donald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...

 in vol. 3 of The Art of Computer Programming
The Art of Computer Programming
The Art of Computer Programming is a comprehensive monograph written by Donald Knuth that covers many kinds of programming algorithms and their analysis....

(1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office.

Applications

The nearest neighbor search problem arises in numerous fields of application, including:
  • Pattern recognition
    Pattern recognition
    In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

     - in particular for optical character recognition
    Optical character recognition
    Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

  • Statistical classification- see k-nearest neighbor algorithm
    K-nearest neighbor algorithm
    In pattern recognition, the k-nearest neighbor algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

  • Computer vision
    Computer vision
    Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...

  • Database
    Database
    A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

    s - e.g. content-based image retrieval
    Content-based image retrieval
    Content-based image retrieval , also known as query by image content and content-based visual information retrieval is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases....

  • Coding theory
    Coding theory
    Coding theory is the study of the properties of codes and their fitness for a specific application. Codes are used for data compression, cryptography, error-correction and more recently also for network coding...

     - see maximum likelihood decoding
    Decoding methods
    In communication theory and coding theory, decoding is the process of translating received messages into codewords of a given code. There have been many common methods of mapping messages to codewords...

  • Data compression
    Data compression
    In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

     - see MPEG-2
    MPEG-2
    MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods which permit storage and transmission of movies using currently available storage media and transmission...

     standard
  • Recommendation systems
  • Internet marketing
    Internet marketing
    Internet marketing, also known as digital marketing, web marketing, online marketing, search marketing or e-marketing, is referred to as the marketing of products or services over the Internet...

     - see contextual advertising
    Contextual advertising
    Contextual advertising is a form of targeted advertising for advertisements appearing on websites or other media, such as content displayed in mobile browsers...

     and behavioral targeting
    Behavioral targeting
    Behavioral targeting is a technique used by online publishers and advertisers to increase the effectiveness of their campaigns.Behavioral targeting uses information collected on an individual's web-browsing behavior, such as the pages they have visited or the searches they have made, to select...

  • DNA sequencing
    DNA sequencing
    DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....

  • Spell checking - suggesting correct spelling
  • Plagiarism detection
    Plagiarism detection
    Plagiarism detection is the process of locating instances of plagiarism within a work or document. The widespread use of computers and the advent of the Internet has made it easier to plagiarize the work of others. Most cases of plagiarism are found in academia, where documents are typically essays...

  • Contact searching algorithms in FEA
  • Similarity score
    Similarity score
    In Sabermetrics and APBRmetrics, similarity scores are a method of comparing baseball and basketball players to other players, with the intent of discovering who the most similar historical players are to a certain player....

    s for predicting career paths of professional athletes.
  • Cluster analysis - assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance
    Euclidean distance
    In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...


Methods

Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality
Curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.There are multiple phenomena referred to by this name in...

 states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.

Linear search

The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time
Running Time
Running Time may refer to:* Running Time * see Analysis of algorithms...

 of O(Nd) where N is the cardinality of S and d is the dimensionality of M. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database. Surprisingly, naive search outperforms space partitioning approaches on higher dimensional spaces.

Space partitioning

Since the 1970s, branch and bound
Branch and bound
Branch and bound is a general algorithm for finding optimal solutions of various optimization problems, especially in discrete and combinatorial optimization...

 methodology has been applied to the problem. In the case of Euclidean space this approach is known as spatial index or spatial access methods. Several space-partitioning
Space partitioning
In mathematics, space partitioning is the process of dividing a space into two or more disjoint subsets . In other words, space partitioning divides a space into non-overlapping regions...

 methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points, worst case complexity analyses have been performed.
Alternatively the R-tree
R-tree
R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both research and real-world applications...

 data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions.

In case of general metric space branch and bound approach is known under the name of metric trees. Particular examples include vp-tree
Vp-tree
A vantage-point tree, or vp-tree is a BSP tree that segregates data in a metric space by choosing a position in the space and dividing the data points into two partitions: those that are nearer to the vantage point than a threshold, and those that are not...

 and BK-tree.

Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result.

Locality sensitive hashing

Locality sensitive hashing
Locality sensitive hashing
Locality-sensitive hashing is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability .-Definition:An LSH family \mathcal F is defined fora...

 (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.

Nearest neighbor search in spaces with small intrinsic dimension

The cover tree
Cover tree
The cover tree is a type of data structure in computer science that is specifically designed to facilitate the speed-up of a nearest neighbor search...

 has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.

Vector Approximation Files

In high dimensional spaces tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.

Compression/Clustering Based Search

The VA-File approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge-gains over VA-File, tree-based indexes and sequential scan have been observed. Also note the parallels between clustering and LSH.

Variants

There are numerous variants of the NNS problem and the two most well-known are the k-nearest neighbor search
K-nearest neighbor algorithm
In pattern recognition, the k-nearest neighbor algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

 and the ε-approximate nearest neighbor search.

k-nearest neighbor

k-nearest neighbor search
K-nearest neighbor algorithm
In pattern recognition, the k-nearest neighbor algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

 identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors.

Approximate nearest neighbor

In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried.

Algorithms that support the approximate nearest neighbor search include locality-sensitive hashing, best bin first
Best Bin First
Best bin first is a search algorithm that is designed to efficiently find an approximate solution to the nearest neighbor search problem in very-high-dimensional spaces. The algorithm is based on a variant of the kd-tree search algorithm which makes indexing higher dimensional spaces possible...

 and balanced box-decomposition tree based search.

ε-approximate nearest neighbor search is becoming an increasingly popular tool for fighting the curse of dimensionality
Curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.There are multiple phenomena referred to by this name in...

.

Nearest neighbor distance ratio

Nearest neighbor distance ratio do not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in CBIR
Content-based image retrieval
Content-based image retrieval , also known as query by image content and content-based visual information retrieval is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases....

 to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several matching
Matching
In the mathematical discipline of graph theory, a matching or independent edge set in a graph is a set of edges without common vertices. It may also be an entire graph consisting of edges without common vertices.- Definition :...

 problems.

All nearest neighbors

For some applications (e.g. entropy estimation
Entropy estimation
Estimating the differential entropy of a system or process, given some observations, is useful in various science/engineering applications, such as Independent Component Analysis, image analysis, genetic analysis, speech recognition, manifold learning, and time delay estimation...

), we may have N data-points and wish to know which is the nearest neighbor for every one of those N points. This could of course be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these N queries to produce a more efficient search. As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.

Given a fixed dimension, a semi-definite positive norm (thereby including every Lp norm
Lp space
In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces...

), and n points in this space, the m nearest neighbours of every point can be found in O(mn log n) time.

See also

  • Ball tree
    Ball tree
    A ball tree is a data structure designed to enable fast nearest neighbor searching in high-dimensional spaces, by covering the points in a dataset with a system of balls arranged in a tree.-External links:...

  • Cluster analysis
  • Content-based image retrieval
    Content-based image retrieval
    Content-based image retrieval , also known as query by image content and content-based visual information retrieval is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases....

  • Curse of dimensionality
    Curse of dimensionality
    The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.There are multiple phenomena referred to by this name in...

  • Digital signal processing
    Digital signal processing
    Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

  • Dimension reduction

  • Fixed-radius near neighbors
    Fixed-radius near neighbors
    In computational geometry, the fixed-radius near neighbor problem is a variant of the nearest neighbor search problem. In the fixed-radius near neighbor problem, one is given as input a set of points in d-dimensional Euclidean space and a fixed distance Δ...

  • Fourier Analysis
  • k-nearest neighbor algorithm
    K-nearest neighbor algorithm
    In pattern recognition, the k-nearest neighbor algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

  • Linear least squares
    Linear least squares
    In statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model...

  • Locality sensitive hashing
    Locality sensitive hashing
    Locality-sensitive hashing is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability .-Definition:An LSH family \mathcal F is defined fora...

  • Multidimensional analysis
    Multidimensional analysis
    In statistics, econometrics, and related fields, multidimensional analysis is a data analysis process that groups data into two or more categories: data dimensions and measurements. For example, a data set consisting of the number of wins for a single football team at each of several years is a...


  • Nearest-neighbor interpolation
  • Principal Component Analysis
  • Singular value decomposition
    Singular value decomposition
    In linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....

  • Time series
    Time series
    In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...

  • Voronoi diagram
    Voronoi diagram
    In mathematics, a Voronoi diagram is a special kind of decomposition of a given space, e.g., a metric space, determined by distances to a specified family of objects in the space...

  • Wavelet
    Wavelet
    A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to have...



External links

  • Nearest Neighbors and Similarity Search - a website dedicated to educational materials, software, literature, researchers, open problems and events related to NN searching. Maintained by Yury Lifshits.
  • Similarity Search Wiki a collection of links, people, ideas, keywords, papers, slides, code and data sets on nearest neighbours.
  • Metric Spaces Library - An open-source C-based library for metric space indexing by Karina Figueroa, Gonzalo Navarro, Edgar Chávez.
  • ANN - A Library for Approximate Nearest Neighbor Searching by David M. Mount and Sunil Arya.
  • FLANN - Fast Approximate Nearest Neighbor search library by Marius Muja and David G. Lowe
  • STANN - A Simple Threaded Approximate Nearest Neighbor Search Library in C++ by Michael Connor and Piyush Kumar.
  • MESSIF - Metric Similarity Search Implementation Framework by Michal Batko and David Novak.
  • OBSearch - Similarity Search engine for Java (GPL). Implementation by Arnoldo Muller, developed during Google Summer of Code 2007.
  • KNNLSB - K Nearest Neighbors Linear Scan Baseline (distributed, LGPL). Implementation by Georges Quénot (LIG-CNRS).
  • NearTree - An API for finding nearest neighbors among points in spaces of arbitrary dimensions by Lawrence C. Andrews and Herbert J. Bernstein.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK