Background

This post compares the time difference between performing a full scan with HBase and iterating over all rows in a table using MapReduce.

HBase Full Scan vs MapReduce

Full Scan

If you use the Scan operation built into HBase without any constraints, you can iterate over all rows in a table. HBase shell has an operation called count, and when using it to iterate over approximately 5 million rows, it took about 235 seconds, roughly 4 minutes.

hbase:002:0> count 'usertable'
5119700 row(s)
Took 235.7437 seconds
=> 5119700

MapReduce Row Count

I created a Row Count application by referring to HBase Row Count MapReduce Development Guide, and when executing it, it took about 104 seconds, less than 2 minutes.

root@latte01:~# hadoop jar hbase-mapreduce-test.jar RowCounterJob
23/06/11 00:02:36 INFO mapreduce.Job: Job job_1686391929383_0004 running in uber mode : false
23/06/11 00:02:36 INFO mapreduce.Job:  map 0% reduce 0%
...
23/06/11 00:04:17 INFO mapreduce.Job:  map 94% reduce 0%
23/06/11 00:04:19 INFO mapreduce.Job:  map 100% reduce 0%
23/06/11 00:04:20 INFO mapreduce.Job: Job job_1686391929383_0004 completed successfully
23/06/11 00:04:20 INFO mapreduce.Job: Counters: 44
	RowCounterMapper$Counters
		ROWS=5119700
	File Input Format Counters
		Bytes Read=0
	File Output Format Counters
		Bytes Written=0

Theoretically, since 3 Node Managers process work in parallel, it should be about 3 times faster. However, due to the overhead of splitting and merging tasks, it was about 2.25 times faster. As the number of worker nodes available for parallel processing increases, the computation speed will become even faster. Using MapReduce for full scans will be incomparably faster than scanning with a single thread. Therefore, if you need to perform batch jobs with HBase and they need to be completed quickly, you should consider running them with MapReduce.