R-tree

# R-tree

Overview
R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates
Geographic coordinate system
A geographic coordinate system is a coordinate system that enables every location on the Earth to be specified by a set of numbers. The coordinates are often chosen such that one of the numbers represent vertical position, and two or three of the numbers represent horizontal position...

, rectangle
Rectangle
In Euclidean plane geometry, a rectangle is any quadrilateral with four right angles. The term "oblong" is occasionally used to refer to a non-square rectangle...

s or polygon
Polygon
In geometry a polygon is a flat shape consisting of straight lines that are joined to form a closed chain orcircuit.A polygon is traditionally a plane figure that is bounded by a closed path, composed of a finite sequence of straight line segments...

s. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both research and real-world applications. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" (to display them in a navigation system
A navigation system is a system that aids is navigation. Navigation systems may be entirely on board a vehicle or vessel, or they may be located elsewhere and communicate via radio or other signals with a vehicle or vessel, or they may use a combination of these methods.Navigation systems may be...

) or "find the nearest gas station" (although not taking roads into account).
Discussion

Encyclopedia
R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates
Geographic coordinate system
A geographic coordinate system is a coordinate system that enables every location on the Earth to be specified by a set of numbers. The coordinates are often chosen such that one of the numbers represent vertical position, and two or three of the numbers represent horizontal position...

, rectangle
Rectangle
In Euclidean plane geometry, a rectangle is any quadrilateral with four right angles. The term "oblong" is occasionally used to refer to a non-square rectangle...

s or polygon
Polygon
In geometry a polygon is a flat shape consisting of straight lines that are joined to form a closed chain orcircuit.A polygon is traditionally a plane figure that is bounded by a closed path, composed of a finite sequence of straight line segments...

s. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both research and real-world applications. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" (to display them in a navigation system
A navigation system is a system that aids is navigation. Navigation systems may be entirely on board a vehicle or vessel, or they may be located elsewhere and communicate via radio or other signals with a vehicle or vessel, or they may use a combination of these methods.Navigation systems may be...

) or "find the nearest gas station" (although not taking roads into account).

## R-tree idea

The key idea of the data structure is to group nearby objects and represent them with their minimum bounding rectangle
Minimum bounding rectangle
The minimum bounding rectangle , also known as bounding box or envelope, is an expression of the maximum extents of a 2-dimensional object within its 2-D coordinate system, in other words min, max, min, max...

in the next higher level of the tree; the "R" in R-tree is for rectangle. Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle can also not intersect any of the contained objects. At the leaf level, each rectangle describes a single object; at higher levels the aggregation of an increasing number of objects. This can also be seen as an increasingly coarse approximation of the data set.

Similar to the B-tree
B-tree
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children...

, the R-tree is also a balanced search tree (so all leaf nodes are at the same height), organizes the data in pages, and is designed for storage on disk (as used in database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

s). Each page can contain a maximum number of entries, often denoted as . It also guarantees a minimum fill (exept for the root node), however best performance has been experienced with a minimum fill of of the maxium number of entries (B-trees guarantee page fill, and B*-trees even ). The reason for this is the more complex balancing required for spatial data as opposed to linear data stored in B-trees.

As with most trees, the searching algorithms (e.g., intersection
Intersection (set theory)
In mathematics, the intersection of two sets A and B is the set that contains all elements of A that also belong to B , but no other elements....

, containment, nearest neighbor search
Nearest neighbor search
Nearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...

) are rather simple. The key idea is to use the bounding boxes to decide whether or not to search inside a subtree. In this way, most of the nodes in the tree are never read during a search. Like B-trees, this makes R-trees suitable for large data sets and database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

s, where nodes can be paged to memory when needed, and the whole tree cannot be kept in main memory.

The key difficulty of R-trees is to build an efficient tree that on one hand is balanced (so the leaf nodes are at the same height) on the other hand the rectangles do not cover too much empty space and do not overlap too much (so that during search, fewer subtrees need to be processed). For example, the original idea for inserting elements to obtain an efficient tree is to always insert into the subtree that requires least enlargement of its bounding box. Once that page is full, the data is split into two sets that should cover the minimal area each. Most of the research and improvements for R-trees aims at improving the way the tree is built and can be grouped into two objectives: building an efficient tree from scratch (known as bulk-loading) and performing changes on an existing tree (insertion and deletion).

R-trees do not historically guarantee good worst-case performance, but generally perform well with real-world data. While more of theoretical interest, the (bulk-loaded) Priority R-Tree
Priority R-tree
The Priority R-tree is a alternative to the spatial tree R-tree. It was first proposed by Arge, De Berg, Haverkort and Yi, K. in an article from 2004...

variant of the R-tree is also worst-case optimal, but due to the increased complexity, has not received much attention in practical applications so far.

When data is organized in an R-Tree, the k nearest neighbors (for any Lp-Norm
Lp space
In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces...

) of all points can efficiently be computed using a spatial join. This is beneficial for many algorithms based on the k nearest neighbors, for example the Local Outlier Factor
Local Outlier Factor
Local outlier factor is an anomaly detection algorithm presented as "LOF: Identifying Density-based Local Outliers" by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander...

. DeLi-Clu, Density-Link-Clustering is a cluster analysis algorithm that uses the R-tree structure for a similar kind of spatial join to efficiently compute an OPTICS
OPTICS algorithm
OPTICS is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander....

clustering.

### Data layout

Data in R-trees is organized in pages, that can have a variable number of entries (up to some pre-defined maximum, and usually above a minimum fill). Each entry within a non-leaf node stores two pieces of data: a way of identifying a child node, and the bounding box of all entries within this child node. Leaf nodes store the data required for each child, often a point or bounding box representing the child and an external identifier for the child. For point data, the leaf entries can be just the points themselves. For polygon data (that often requires the storage of large polygons) the common setup is to store only the MBR (minimum bounding rectangle) of the polygon along with a unique identifier in the tree.

### Search

The input is a search rectangle (Query box). Searching is quite similar to searching in a B+tree. The search starts from the root node of the tree. Every internal node contains a set of rectangles and pointers to the corresponding child node and every leaf node contains the rectangles of spatial objects (the pointer to some spatial object can be there). For every rectangle in a node, it has to be decided if it overlaps the search rectangle or not. If yes, the corresponding child node has to be searched also. Searching is done like this in a recursive manner until all overlapping nodes have been traversed. When a leaf node is reached, the contained bounding boxes (rectangles) are tested against the search rectangle and their objects (if there are any) are put into the result set if they lie within the search rectangle.

### Insertion

To insert an object, the tree is traversed recursively from the root node. At each step, all rectangles in the current directory node are examined, and a candidate is chosen using a heuristic such as choosing the rectangle which requires least enlargement. The search then descends into this page, until reaching a leaf node. If the leaf node is full, it must be split before the insertion is made. Again, since an exhaustive search is too expensive, an heuristic is employed to split the node into two. Adding the newly created node to the previous level, this level can again overflow, and these overflows can propagate up to the root node; when this node also overflows, a new root node is created and the tree has increased in height.

#### Choosing the insertion subtree

At each level, the algorithm needs to decide in which subtree to insert the new data object. When a data object is fully contained in a single rectangle, the choice is obvious, but when there are multiple options or rectangles need enlargement, the choice can have a significant impact on the performance of the tree.

In the classic R-tree, objects are inserted into the subtree that needs least enlargement. In the more advanced R*-tree, a mixed heuristic is employed: at leaf level, it tries to minimize the overlap (in case of ties, prefer least enlargement and then least area); at the higher levels, it behaves similar to the R-tree, but on ties again preferring the subtree with smaller area. The decreased overlap of rectangles in the R*-tree is one of the key benefits over the traditional R-Tree (but this is also a consequence of the other heuristics used, not only the subtree choosing).

#### Splitting an overflowing node

Since redistributing all objects of a node into two nodes has an exponential number of options, an heuristic needs to be employed to find the best split. In the classic R-Tree, Guttman proposed two such heuristics, called QuadraticSplit and LinearSplit. In quadratic split, the algorithm searches the pair of rectangles that is the worst combination to have in the same node, and puts them as initial objects into the two new groups. It then searches the entry which has the strongest preference for one of the groups (in terms of area increase) and assigns the object to this group until all objects are assigned (satisfying the minimum fill).

There are other splitting strategies such as Greene's Split, the R*-tree splitting heuristic (which again tries to minimize overlap, but also prefers quadratic pages) or the linear split algorithm proposed by Ang and Tan (which however can produce very unregular rectangles, which are less performant for many real world range and window queries). In addition to having a more advanced splitting heuristic, the R*-tree also tries to avoid splitting a node by reinserting some of the node members, which is similar to the way a B-tree
B-tree
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children...

balances overflowing nodes. This was shown to also reduce overlap and thus tree performance.

Finally, the X-tree
X-tree
In computer science, an X-tree is an index tree structure based on the R-tree used for storing data in many dimensions. It differs from R-trees, R+-trees and R*-trees because it emphasizes prevention of overlap in the bounding boxes, which increasingly becomes a problem in high dimensions...

can be seen as a R*-tree variant that can also decide to not split a node, but construct a so-called super-node containing all the extra entries, when it doesn't find a good split (in particular for high-dimensional data).

### Deletion

Deleting an entry from a page may require updating the bounding rectangles of parent pages. However, when a page is underfull, it will not be balanced with its neighbors. Instead, the page will be dissolved and all its children (which may be subtrees, not only leaf objects) will be reinserted. If during this process the root node has a single element, the tree height can decrease.

• Sort-Tile-Recursive (STR)
• Packed Hilbert R-Tree
Hilbert R-tree
Hilbert R-tree, an R-tree variant, is an index for multidimensional objects like lines, regions, 3-D objects, or high dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects....

- Uses the Hilbert value of the center of a rectangle to sort the leaf nodes and recursively builds the tree.
• Nearest-X - Rectangles are sorted on the x-coordinate and nodes are created.
• Priority R-Tree
Priority R-tree
The Priority R-tree is a alternative to the spatial tree R-tree. It was first proposed by Arge, De Berg, Haverkort and Yi, K. in an article from 2004...

• Segment tree
Segment tree
In computer science, a segment tree is a tree data structure for storing intervals, or segments. It allows querying which of the stored segments contain a given point. It is, in principle, a static structure; that is, its content cannot be modified once the structure is built...

• Interval tree
Interval tree
In computer science, an interval tree is an ordered tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for example, to find all roads on a computerized map inside...

- A degenerate R-Tree for 1 dimension (usually time).
• Bounding volume hierarchy
Bounding volume hierarchy
A bounding volume hierarchy is a tree structure on a set of geometric objects. All geometric objects are wrapped in bounding volumes that form the leaf nodes of the tree. These nodes are then grouped as small sets and enclosed within larger bounding volumes...

• Spatial index
• GiST
GiST
In computing, GiST or Generalized Search Tree, is a data structure and API that can be used to build a variety of disk-based search trees. GiST is a generalization of the B+ tree, providing a concurrent and recoverable height-balanced search tree infrastructure without making any assumptions about...