Apache Hive is a
data warehouseIn computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...
infrastructure built on top of
HadoopApache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
for providing data summarization, query, and analysis. While initially developed by
FacebookFacebook is a social networking service and website launched in February 2004, operated and privately owned by Facebook, Inc. , Facebook has more than 800 million active users. Users must register before using the site, after which they may create a personal profile, add other users as...
, Apache Hive is now used and developed by other companies such as
NetflixNetflix, Inc., is an American provider of on-demand internet streaming media in the United States, Canada, and Latin America and flat rate DVD-by-mail in the United States. The company was established in 1997 and is headquartered in Los Gatos, California...
. Hive is also included in
Amazon Elastic MapReduce on
Amazon Web ServicesAmazon Web Services is a collection of remote computing services that together make up a cloud computing platform, offered over the Internet by Amazon.com...
.
Features
Apache Hive supports analysis of large datasets stored in Hadoop compatible file systems such as
Amazon S3Amazon S3 is an online storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces...
filesystem. It provides an
SQLSQL is a programming language designed for managing data in relational database management systems ....
-like language called HiveQL while maintaining full support for map/reduce. To accelerate queries, it provides indexes, including
bitmap indexA bitmap index is a special kind of database index that uses bitmaps.Bitmap indexes have traditionally been considered to work well for data such as gender, which has a small number of distinct values, for example male and female, but many occurrences of those values. This would happen if, for...
es.
By default, Hive stores metadata in an embedded
Apache DerbyApache Derby is a relational database management system developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. It has a 2 MB disk-space footprint.Apache Derby is developed as an open source project under the Apache 2.0 license...
database, and other client/server databases like MySQL can optionally be used.
Currently, there are three file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE and
RCFILEBig data refers to fast growing and huge data sets that cannot be easily handled by traditional databases, including parallel databases. Big data sets are stored, managed and analyzed in large and scalable distributed systems, where data processing model is based on the MapReduce framework...
.
HiveQL
While based on SQL, HiveQL does not strictly follow the full
SQL-92SQL-92 was the third revision of the SQL database query language. Unlike SQL-89, it was a major revision of the standard. For all but a few minor incompatibilities, the SQL-89 standard is forwards-compatible with SQL-92....
standard. HiveQL offers extensions not in SQL, including
multitable inserts and
create table as select, but only offers basic support for
indexesA database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...
. Also, HiveQL lacks support for
transactionsA transaction comprises a unit of work performed within a database management system against a database, and treated in a coherent and reliable way independent of other transactions...
and
materialized viewA materialized view is a database object that contains the results of a query. They are local copies of data located remotely, or are used to create summary tables based on aggregations of a table's data. Materialized views, which store data based on remote tables, are also known as snapshots...
s, and only limited subquery support.
Internally, a
compilerA compiler is a computer program that transforms source code written in a programming language into another computer language...
translates HiveQL statement into a
directed acyclic graphIn mathematics and computer science, a directed acyclic graph , is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of...
of
MapReduceMapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
jobs, which are submitted to Hadoop for execution.
External links