Compressed data structure - AbsoluteAstronomy.com

The term compressed data structure arises in the computer science

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

subfields of algorithms, data structures, and theoretical computer science

Theoretical computer science

Theoretical computer science is a division or subset of general computer science and mathematics which focuses on more abstract or mathematical aspects of computing....

. It refers to a data structure whose operations are roughly as fast as those of a conventional data structure for the problem, but whose size can be substantially smaller. The size of the compressed data structure is typically highly dependent upon the entropy of the data being represented.

Important examples of compressed data structures include the compressed suffix array

Compressed suffix array

The Compressed Suffix Array is a compressed data structure for pattern matching. Given a text T of n characters from an alphabet Σ, the compressed suffix array support searching for arbitrary patterns in T...

and the FM-index

FM-index

The FM-index is type of Substring index based on the Burrows-Wheeler transform,with some similarities to the Suffix Array.It is named after the creators of the algorithm, Paolo Ferragina and Giovanni Manzini, who describe it as an opportunistic data structure as it allows compression of the input...

, both of which can represent an arbitrary text of characters T for pattern matching

Pattern matching

In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact. The patterns generally have the form of either sequences or tree structures...

. Given any input pattern P, they support the operation of finding if and where P appears in T. The search time is proportional to the sum of the length of pattern P, a very slow-growing function of the length of the text T, and the number of reported matches. The space they occupy is roughly equal to the size of the text T in entropy-compressed form, such as that obtained by Prediction by Partial Matching or gzip

Gzip

Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

. Moreover, both data structures are self-indexing, in that they can reconstruct the text T in a random access manner, and thus the underlying text T can be discarded. In other words, they simultaneously provide a compressed and quickly searchable representation of the text T. They represent a substantial space improvement over the conventional suffix tree

Suffix tree

In computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...

and suffix array

Suffix array

In computer science, a suffix array is an array of integers giving the starting positions of suffixes of a string in lexicographical order.-Details:Consider the string...

, which occupy many times more space than the size of T. They also support searching for arbitrary patterns, as opposed to the inverted index

Inverted index

In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents...

, which can support only word-based searches. In addition, inverted indexes do not have the self-indexing feature.

An important related notion is that of a succinct data structure

Succinct data structure

In computer science, a succinct data structure is data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, trees, and planar...

, which uses space roughly equal to the information-theoretic minimum, which is a worst-case notion of the space needed to represent the data. In contrast, the size of a compressed data structure depends upon the particular data being represented. When the data are compressible, as is often the case in practice for natural language text, the compressed data structure can occupy substantially less space than the information-theoretic minimum.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.