- Authors
- Name
Background
This post compares the time difference between performing a full scan with HBase and iterating over all rows in a table using MapReduce.
HBase Full Scan vs MapReduce
Full Scan
If you use the Scan operation built into HBase without any constraints, you can iterate over all rows in a table. HBase shell has an operation called count, and when using it to iterate over approximately 5 million rows, it took about 235 seconds, roughly 4 minutes.
hbase:002:0> count 'usertable'
5119700 row(s)
Took 235.7437 seconds
=> 5119700
MapReduce Row Count
I created a Row Count application by referring to HBase Row Count MapReduce Development Guide, and when executing it, it took about 104 seconds, less than 2 minutes.
root@latte01:~# hadoop jar hbase-mapreduce-test.jar RowCounterJob
23/06/11 00:02:36 INFO mapreduce.Job: Job job_1686391929383_0004 running in uber mode : false
23/06/11 00:02:36 INFO mapreduce.Job: map 0% reduce 0%
...
23/06/11 00:04:17 INFO mapreduce.Job: map 94% reduce 0%
23/06/11 00:04:19 INFO mapreduce.Job: map 100% reduce 0%
23/06/11 00:04:20 INFO mapreduce.Job: Job job_1686391929383_0004 completed successfully
23/06/11 00:04:20 INFO mapreduce.Job: Counters: 44
RowCounterMapper$Counters
ROWS=5119700
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
Theoretically, since 3 Node Managers process work in parallel, it should be about 3 times faster. However, due to the overhead of splitting and merging tasks, it was about 2.25 times faster. As the number of worker nodes available for parallel processing increases, the computation speed will become even faster. Using MapReduce for full scans will be incomparably faster than scanning with a single thread. Therefore, if you need to perform batch jobs with HBase and they need to be completed quickly, you should consider running them with MapReduce.