Background

This post compares the time difference between performing a full scan with HBase and iterating over all rows in a table using MapReduce.

HBase Full Scan vs MapReduce

Full Scan

If you use the Scan operation built into HBase without any constraints, you can iterate over all rows in a table. HBase shell has an operation called count, and when using it to iterate over approximately 5 million rows, it took about `235 seconds`, roughly `4 minutes`.

hbase:002:0> count 'usertable'

5119700 row(s)

Took 235.7437 seconds

=> 5119700

MapReduce Row Count

I created a Row Count application by referring to [HBase Row Count MapReduce Development Guide](https://www.youngju.dev/blog/202306/how_to_develop_hbase_mapreduce), and when executing it, it took about `104 seconds`, less than `2 minutes`.

root@latte01:~# hadoop jar hbase-mapreduce-test.jar RowCounterJob

23/06/11 00:02:36 INFO mapreduce.Job: Job job_1686391929383_0004 running in uber mode : false

23/06/11 00:02:36 INFO mapreduce.Job: map 0% reduce 0%

...

23/06/11 00:04:17 INFO mapreduce.Job: map 94% reduce 0%

23/06/11 00:04:19 INFO mapreduce.Job: map 100% reduce 0%

23/06/11 00:04:20 INFO mapreduce.Job: Job job_1686391929383_0004 completed successfully

23/06/11 00:04:20 INFO mapreduce.Job: Counters: 44

RowCounterMapper$Counters

ROWS=5119700

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

Theoretically, since 3 Node Managers process work in parallel, it should be about 3 times faster. However, due to the overhead of splitting and merging tasks, it was about `2.25 times` faster. As the number of worker nodes available for parallel processing increases, the computation speed will become even faster. Using MapReduce for full scans will be incomparably faster than scanning with a single thread. Therefore, if you need to perform batch jobs with HBase and they need to be completed quickly, you should consider running them with MapReduce.

Quiz

Q1: What is the main topic covered in "Performance Comparison of MapReduce vs HBase Scan"?

Performance Comparison of MapReduce vs HBase Scan

If you use the Scan operation built into HBase without any constraints, you can iterate over all

rows in a table. HBase shell has an operation called count, and when using it to iterate over

approximately 5 million rows, it took about 235 seconds, roughly 4 minutes.

application by referring to HBase Row Count MapReduce Development Guide, and when executing it, it

took about 104 seconds, less than 2 minutes. Theoretically, since 3 Node Managers process work in

parallel, it should be about 3 times faster.