Skip to content
Published on

How to Integrate HBase with Hive

Authors
  • Name
    Twitter

Overview

Learn how to query HBase data using Hive.

Preparation

There is an existing Hive table called u_data. u_data is part of the dataset provided by MovieLens, and you can find how to create this data at https://www.youngju.dev/blog/202211/hive_query_exampl.

The data schema is as follows.

hive> describe u_data;
OK
userid              	int
movieid             	int
rating              	int
unixtime            	timestamp
Time taken: 0.442 seconds, Fetched: 4 row(s)

How to Create a Hive-HBase Table!

Creating a Table

hive> CREATE TABLE hbase_u_data(key int, value int)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val");
OK
Time taken: 2.292 seconds

As shown below, you can see that the table has been created in Hive.

hive> show tables;
OK
hbase_u_data

Furthermore, you can also confirm that the table has been created in HBase.

hbase:001:0> list
TABLE
hbase_u_data
hbase:002:0> desc 'hbase_u_data';
Table hbase_u_data is ENABLED
hbase_u_data, {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_V
ERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACH
E => 'true', BLOCKSIZE => '65536 B (64KB)', METADATA => {'EVICT_BLOCKS_ON_CLOSE' => 'false', 'NEW_VERSION_BEHAVIOR' =>
 'false', 'CACHE_DATA_ON_WRITE' => 'false', 'CACHE_INDEX_ON_WRITE' => 'false', 'CACHE_BLOOMS_ON_WRITE' => 'false', 'PR
EFETCH_BLOCKS_ON_OPEN' => 'false', 'CACHE_DATA_IN_L1' => 'false'}}

1 row(s)
Quota is disabled
Took 0.0953 seconds

Inserting Data

Insert data from the Hive table into the HBase table.

hive> INSERT OVERWRITE TABLE hbase_u_data SELECT userid, rating FROM u_data;

After inserting data from Hive to HBase, the number of rows may differ because Hive supports duplicate keys while HBase only supports unique keys.

hive> select count(*) from hbase_u_data;
OK
943
Time taken: 18.137 seconds, Fetched: 1 row(s)

Converting an HBase Table to a Hive-HBase Table

You can also convert an existing usertable in HBase into a Hive-HBase table that can be used from Hive.

Below is the usertable that exists only in HBase.

hbase:004:0> desc "usertable";
Table usertable is ENABLED
usertable, {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'family', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BL
OCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE
=> '65536 B (64KB)', REPLICATION_SCOPE => '0'}

Convert HBase table to Hive-HBase table

Enter the currently existing HBase table name as the value for "hbase.table.name".

hive> CREATE EXTERNAL TABLE hbase_usertable(key string, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "family:field9")
    > TBLPROPERTIES("hbase.table.name" = "usertable")
Time taken: 1.651 seconds

You can confirm that a table named hbase_usertable has been created in Hive.

hive> desc hbase_usertable;
OK
key                 	string
value               	string

Check the number of rows to verify that the row count of the original HBase table matches the Hive-HBase table's row count to confirm it was created correctly.

hive> select count(distinct key) from hbase_usertable;
Total MapReduce CPU Time Spent: 4 minutes 42 seconds 330 msec
OK
1000

After this, if you insert data into this table from either Hive or HBase, the data can be queried from both Hive and HBase.

For more diverse use cases, refer to https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.

Reference