Bitmap Index
Encyclopedia
A bitmap index is a special kind of database index
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

 that uses bitmaps.

Bitmap indexes have traditionally been considered to work well for data such as gender, which has a small number of distinct values, for example male and female, but many occurrences of those values. This would happen if, for example, you had gender data for each resident in a city. Bitmap indexes have a significant space and performance advantage over other structures for such data. Some researchers argue that Bitmap indexes are also useful for unique valued data which is not updated frequently. Bitmap indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operation
Bitwise operation
A bitwise operation operates on one or more bit patterns or binary numerals at the level of their individual bits. This is used directly at the digital hardware level as well as in microcode, machine code and certain kinds of high level languages...

s on these bitmaps.

Bitmap indexes are also useful in data warehousing applications for joining a large fact table
Fact table
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema or a snowflake schema, surrounded by dimension tables....

 to smaller dimension table
Dimension table
In data warehousing, a dimension table is one of the set of companion tables to a fact table.The fact table contains business facts or measures and foreign keys which refer to candidate keys in the dimension tables....

s such as those arranged in a star schema
Star schema
In computing, the star schema is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables...

.

Example

Continuing the gender example, a bitmap index may be logically viewed as follows:
Identifier Gender Bitmaps
F M
1 Female 1 0
2 Male 0 1
3 Male 0 1
4 Unspecified 0 0
5 Female 1 0


On the left, identifier
Identifier
An identifier is a name that identifies either a unique object or a unique class of objects, where the "object" or class may be an idea, physical [countable] object , or physical [noncountable] substance...

 refers to the unique number assigned to each customer, gender is the data to be indexed, the content of the bitmap index is shown as two columns under the heading bitmaps. Each column in the above illustration is a bitmap in the bitmap index. In this case, there are two such bitmaps, one for gender Female and one for gender Male. It is easy to see that each bit in bitmap M shows whether a particular row refers to a male. This is the simplest form of bitmap index. Most columns will have more distinct values. For example, the sales amount is likely to have a much larger number of distinct values. Variations on the bitmap index can effectively index this data as well. We briefly review three such variations.

Note: many of the references cited here are reviewed at. For those who might be interested in experimenting with some of the ideas mentioned here, many of them are implemented in open source software such as FastBit, the Lemur Bitmap Index C++ Library,, the Apache Hive
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix...

 Data Warehouse system and LucidDB
LucidDB
LucidDB is an Open Source database purpose-built to power Data Warehouses, OLAP servers and Business Intelligence systems.LucidDB is the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence...

.

Compression

Software can compress
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

 each bitmap in a bitmap index to save space. There has been considerable amount of work on this subject.
Bitmap compression algorithms typically employ run-length encoding
Run-length encoding
Run-length encoding is a very simple form of data compression in which runs of data are stored as a single data value and count, rather than as the original run...

, such as the Byte-aligned Bitmap Code, the Word-Aligned Hybrid code, the Position List Word Aligned Hybrid, the Compressed Adaptive Index (COMPAX) and the COmpressed 'N' Composable Integer SEt. These compression methods require very little effort to compress and decompress. More importantly, bitmaps compressed with BBC, WAH, COMPAX, PLWAH and CONCISE can directly participate in bitwise operation
Bitwise operation
A bitwise operation operates on one or more bit patterns or binary numerals at the level of their individual bits. This is used directly at the digital hardware level as well as in microcode, machine code and certain kinds of high level languages...

s without decompression. This gives them considerable advantages over generic compression techniques such as LZ77. BBC compression and its derivatives are used in a commercial database management system
Database management system
A database management system is a software package with computer programs that control the creation, maintenance, and use of a database. It allows organizations to conveniently develop databases for various applications by database administrators and other specialists. A database is an integrated...

. BBC is effective in both reducing index sizes and maintaining query performance. BBC encodes the bitmaps in bytes, while WAH encodes in words, better matching current CPUs. "On both synthetic data and real application data, the new word aligned schemes use only 50% more space, but perform logical operations on compressed data 12 times faster than BBC." PLWAH bitmaps were reported to take 50% of the storage space consumed by WAH bitmaps and offer up to 20% faster performance on logical operations. Similar considerations can be done for CONCISE.

The performance of schemes such as BBC, WAH, PLWAH, COMPAX and CONCISE is dependent on the order of the rows. A simple lexicographical sort can divide the index size by 9 and make indexes several times faster. The larger the table, the more important it is to sort the rows. Reshuffling techniques have also been proposed to achieve the same results of sorting when indexing streaming data.

Encoding

Basic bitmap indexes use one bitmap for each distinct value. It is possible to reduce the number of bitmaps used by using a different encoding method. For example, it is possible to encode C distinct values using log(C) bitmaps with binary encoding.

This reduces the number of bitmaps, further saving space, but to answer any query, most of the bitmaps have to be accessed. This makes it potentially not as effective as scanning a vertical projection of the base data, also known as a materialized view
Materialized view
A materialized view is a database object that contains the results of a query. They are local copies of data located remotely, or are used to create summary tables based on aggregations of a table's data. Materialized views, which store data based on remote tables, are also known as snapshots...

 or projection index. Finding the optimal encoding method that balances (arbitrary) query performance, index size and index maintenance remains a challenge.

Without considering compression, Chan and Ioannidis analyzed a class of multi-component encoding methods and came to the conclusion that two-component encoding sits at the kink of the performance vs. index size curve and therefore represents the best trade-off between index size and query performance.

Binning

For high-cardinality columns, it is useful to bin the values, where each bin covers multiple values and build the bitmaps to represent the values in each bin. This approach reduces the number of bitmaps used regardless of encoding method. However, binned indexes can only answer some queries without examining the base data. For example, if a bin covers the range from 0.1 to 0.2, then when the user asks for all values less than 0.15, all rows that fall in the bin are possible hits and have to be checked to verify whether they are actually less than 0.15. The process of checking the base data is known as the candidate check. In most cases, the time used by the candidate check is significantly longer than the time needed to work with the bitmap index. Therefore, binned indexes exhibit irregular performance. They can be very fast for some queries, but much slower if the query does not exactly match a bin.

History

The concept of bitmap index was first introduced by Professor Israel Spiegler and Rafi Maayan in their research "Storage and Retrieval Considerations of Binary Data Bases", published in 1985. The first commercial database product to implement a bitmap index was Computer Corporation of America's Model 204
Model 204
Model 204 is a Database management system for IBM and compatible mainframes, which was first deployed in 1972. It incorporates a programming language and an environment for application development. It can deal with very large databases and very high transaction loads.Model 204 relies on its own...

. Patrick O'Neil
Patrick O'Neil
Patrick Eugene O'Neil is an American computer scientist, an expert on databases, and a professor of computer science at the University of Massachusetts Boston....

 implemented the bitmap index around 1987. This implementation is a hybrid between the basic bitmap index (without compression) and the list of Row Identifiers (RID-list). Overall, the index is organized as a B+tree. When the column cardinality is low, each leaf node of the B-tree would contain long list of RIDs. In this case, it requires less space to represent the RID-lists as bitmaps. Since each bitmap represents one distinct value, this is the basic bitmap index. As the column cardinality increases, each bitmap becomes sparse and it may take more disk space to store the bitmaps than to store the same content as RID-lists. In this case, it switches to use the RID-lists, which makes it a B+tree index.

In-memory bitmaps

One of the strongest reasons for using bitmap indexes is that the intermediate results produced from them are also bitmaps and can be efficiently reused in further operations to answer more complex queries. Many programming languages support this as a bit array data structure. For example, Java has the BitSet class.

Some database systems that do not offer persistent bitmap indexes use bitmaps internally to speed up query processing. For example, PostgreSQL
PostgreSQL
PostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...

 versions 8.1 and later implement a "bitmap index scan" optimization to speed up arbitrarily complex logical operations between available indexes on a single table.

For tables with many columns, the total number of distinct indexes to satisfy all possible queries (with equality filtering conditions on either of the fields) grows very fast, being defined by this formula:

.
A bitmap index scan combines expressions on different indexes, thus requiring only one index per column to support all possible queries on a table.

Applying this access strategy to B-tree indexes can also combine range queries on multiple columns. In this approach, a temporary in-memory bitmap is created with one bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

 for each row in the table (1 MiB
MIB
MIB may refer to any of several concepts:* Master of International Business, a postgraduate business degree* Melayu Islam Beraja, the adopted national philosophy of Brunei* Motion induced blindness, a visual illusion in peripheral vision...

 can thus store over 8 million entries). Next, the results from each index are combined into the bitmap using bitwise operation
Bitwise operation
A bitwise operation operates on one or more bit patterns or binary numerals at the level of their individual bits. This is used directly at the digital hardware level as well as in microcode, machine code and certain kinds of high level languages...

s. After all conditions are evaluated, the bitmap contains a "1" for rows that matched the expression. Finally, the bitmap is traversed and matching rows are retrieved. In addition to efficiently combining indexes, this also improves locality of reference
Locality of reference
In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...

of table accesses, because all rows are fetched sequentially from the main table. The internal bitmap is discarded after the query. If there are too many rows in the table to use 1 bit per row, a "lossy" bitmap is created instead, with a single bit per disk page. In this case, the bitmap is just used to determine which pages to fetch; the filter criteria are then applied to all rows in matching pages.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK