Skip to content

Split View: HBase와 Hive 연동 방법

|

HBase와 Hive 연동 방법

Overview

HBase의 데이터를 Hive로 조회하는 방법을 알아본다.

Prepairation

기존, Hive Table에는 u_data 라는 테이블이 존재한다. u_data는 movielens에서 제공하는 데이터 셋 중 일부이고, 이 데이터를 생성하는 법은 https://www.youngju.dev/blog/202211/hive_query_exampl 이곳을 참조하면 된다.

데이터의 schema는 아래와 같다.

hive> describe u_data;
OK
userid              	int
movieid             	int
rating              	int
unixtime            	timestamp
Time taken: 0.442 seconds, Fetched: 4 row(s)

Hive-HBase 테이블 생성 방법!

테이블 생성

hive> CREATE TABLE hbase_u_data(key int, value int)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val");
OK
Time taken: 2.292 seconds

아래처럼 hive에서 table이 생성되었음을 알 수 있다.

hive> show tables;
OK
hbase_u_data

뿐만 아니라 hbase 에도 해당 테이블이 생성되었음을 확인할 수 있다.

hbase:001:0> list
TABLE
hbase_u_data
hbase:002:0> desc 'hbase_u_data';
Table hbase_u_data is ENABLED
hbase_u_data, {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_V
ERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACH
E => 'true', BLOCKSIZE => '65536 B (64KB)', METADATA => {'EVICT_BLOCKS_ON_CLOSE' => 'false', 'NEW_VERSION_BEHAVIOR' =>
 'false', 'CACHE_DATA_ON_WRITE' => 'false', 'CACHE_INDEX_ON_WRITE' => 'false', 'CACHE_BLOOMS_ON_WRITE' => 'false', 'PR
EFETCH_BLOCKS_ON_OPEN' => 'false', 'CACHE_DATA_IN_L1' => 'false'}}

1 row(s)
Quota is disabled
Took 0.0953 seconds

데이터 insert

hive table의 데이터를 hbase table로 insert 한다.

hive> INSERT OVERWRITE TABLE hbase_u_data SELECT userid, rating FROM u_data;

data를 hive에서 hbase로 넣고난 뒤, row의 갯수가 달라질 수 있는데 그 이유는 hive의 경우 duplicate 키를 제공하지만, hbase는 unique 키만을 제공하기 때문이다.

hive> select count(*) from hbase_u_data;
OK
943
Time taken: 18.137 seconds, Fetched: 1 row(s)

HBase Table을 Hive-HBase 테이블로 변경하기

기존에 HBase에 존재하는 usertable 을 Hive에서도 사용할 수 있는 Hive-HBase 테이블로 변경할 수도 있다.

아래는 HBase 상에만 존재하는 usertable 이다.

hbase:004:0> desc "usertable";
Table usertable is ENABLED
usertable, {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'family', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BL
OCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE
=> '65536 B (64KB)', REPLICATION_SCOPE => '0'}

Convert HBase table to Hive-HBase table

"hbase.table.name"의 값으로는 HBase에 현재 존재하는 table 명을 입력한다.

hive> CREATE EXTERNAL TABLE hbase_usertable(key string, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "family:field9")
    > TBLPROPERTIES("hbase.table.name" = "usertable")
Time taken: 1.651 seconds

hive에 hbase_usertable 라는 이름의 table이 생성된 것을 확인할 수 있다.

hive> desc hbase_usertable;
OK
key                 	string
value               	string

row의 수를 확인해 HBase의 기존 테이블의 row 수와 Hive-HBase table 의 row수가 동일한지 체크하여 잘 생성되었는지를 확인한다.

hive> select count(distinct key) from hbase_usertable;
Total MapReduce CPU Time Spent: 4 minutes 42 seconds 330 msec
OK
1000

이후에는 이 테이블에 hive에서 insert를 하거나, hbase에서 insert 할 경우 해당 데이터를 hive와 hbase 모두에서 조회가 가능하다.

더 다양한 usercase의 경우 https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration를 참고한다.

Reference

How to Integrate HBase with Hive

Overview

Learn how to query HBase data using Hive.

Preparation

There is an existing Hive table called u_data. u_data is part of the dataset provided by MovieLens, and you can find how to create this data at https://www.youngju.dev/blog/202211/hive_query_exampl.

The data schema is as follows.

hive> describe u_data;
OK
userid              	int
movieid             	int
rating              	int
unixtime            	timestamp
Time taken: 0.442 seconds, Fetched: 4 row(s)

How to Create a Hive-HBase Table!

Creating a Table

hive> CREATE TABLE hbase_u_data(key int, value int)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val");
OK
Time taken: 2.292 seconds

As shown below, you can see that the table has been created in Hive.

hive> show tables;
OK
hbase_u_data

Furthermore, you can also confirm that the table has been created in HBase.

hbase:001:0> list
TABLE
hbase_u_data
hbase:002:0> desc 'hbase_u_data';
Table hbase_u_data is ENABLED
hbase_u_data, {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_V
ERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACH
E => 'true', BLOCKSIZE => '65536 B (64KB)', METADATA => {'EVICT_BLOCKS_ON_CLOSE' => 'false', 'NEW_VERSION_BEHAVIOR' =>
 'false', 'CACHE_DATA_ON_WRITE' => 'false', 'CACHE_INDEX_ON_WRITE' => 'false', 'CACHE_BLOOMS_ON_WRITE' => 'false', 'PR
EFETCH_BLOCKS_ON_OPEN' => 'false', 'CACHE_DATA_IN_L1' => 'false'}}

1 row(s)
Quota is disabled
Took 0.0953 seconds

Inserting Data

Insert data from the Hive table into the HBase table.

hive> INSERT OVERWRITE TABLE hbase_u_data SELECT userid, rating FROM u_data;

After inserting data from Hive to HBase, the number of rows may differ because Hive supports duplicate keys while HBase only supports unique keys.

hive> select count(*) from hbase_u_data;
OK
943
Time taken: 18.137 seconds, Fetched: 1 row(s)

Converting an HBase Table to a Hive-HBase Table

You can also convert an existing usertable in HBase into a Hive-HBase table that can be used from Hive.

Below is the usertable that exists only in HBase.

hbase:004:0> desc "usertable";
Table usertable is ENABLED
usertable, {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'family', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BL
OCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE
=> '65536 B (64KB)', REPLICATION_SCOPE => '0'}

Convert HBase table to Hive-HBase table

Enter the currently existing HBase table name as the value for "hbase.table.name".

hive> CREATE EXTERNAL TABLE hbase_usertable(key string, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "family:field9")
    > TBLPROPERTIES("hbase.table.name" = "usertable")
Time taken: 1.651 seconds

You can confirm that a table named hbase_usertable has been created in Hive.

hive> desc hbase_usertable;
OK
key                 	string
value               	string

Check the number of rows to verify that the row count of the original HBase table matches the Hive-HBase table's row count to confirm it was created correctly.

hive> select count(distinct key) from hbase_usertable;
Total MapReduce CPU Time Spent: 4 minutes 42 seconds 330 msec
OK
1000

After this, if you insert data into this table from either Hive or HBase, the data can be queried from both Hive and HBase.

For more diverse use cases, refer to https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.

Reference

Quiz

Q1: What is the main topic covered in "How to Integrate HBase with Hive"? Learn how to integrate HBase with Hive.

Q2: What is Creating a Table? As shown below, you can see that the table has been created in Hive. Furthermore, you can also confirm that the table has been created in HBase.

Q3: Explain the core concept of Inserting Data. Insert data from the Hive table into the HBase table. After inserting data from Hive to HBase, the number of rows may differ because Hive supports duplicate keys while HBase only supports unique keys.

Q4: What are the key aspects of Convert HBase table to Hive-HBase table? Enter the currently existing HBase table name as the value for "hbase.table.name". You can confirm that a table named hbase_usertable has been created in Hive.