Overview

HBase splits a single table into multiple units called regions for distributed processing. Starting from a single region, as the region grows in size, large regions are automatically split into two regions. However, this splitting process incurs significant cost. Therefore, if you expect a large amount of initial data, pre-splitting regions when creating the table can reduce the load on the cluster.

So, what criteria should you use to split a table? The answer likely depends on your row-key design. If you know in advance that the rowkey range will be from 1xxxxxx to 9xxxxxx, it would be best to pre-split into regions with prefixes 1 through 9.

I will introduce three methods that can be used universally for commonly used rowkey patterns in general situations.

PreSplit Methods

HexStringSplit

As shown below, if you select HexStringSplit as the split algorithm when creating a table, regions will be split based on HexString[1-9a-z].

HexStringSplit

create 'user-table', 'cf', {NUMREGIONS =>10, SPLITALGO => 'HexStringSplit'}

UniformSplit

The UniformSplit option also exists. This method splits the table using random bytes keys, and the regions are created as shown below.

UniformSplit

create 'user-table2', 'cf', {NUMREGIONS =>10, SPLITALGO => 'UniformSplit'}

Custom Split

Instead of using the default split algorithms provided by HBase, you can configure a custom split function to pre-split regions accordingly. For example, if all data has a rowkey of user#xxxx and you need to prepend the prefix user#, you can create a table in the hbase shell as follows.

custom_split

n_splits = 10
create 'usertable', 'family', {SPLITS => (1..n_splits).map {|i| "user#{1000+i*(9999-1000)/n_splits}"}}

Reference

https://hbase.apache.org/book.html

This concludes the post on three methods for presplitting HBase tables.