Split View: Mapreduce와 HBase scan의 성능 비교
Mapreduce와 HBase scan의 성능 비교
Background
HBase로 Full Scan을 할 때와 Mapreduce로 전체 Table의 row를 순회할 때 시간이 얼마나 차이가 나는지 비교해본다.
HBase full scan vs Map Reduce
full scan
HBase에 기본적으로 내장되어있는 Scan이라는 연산에 아무런 제약조건을 걸지 않고 사용하게되면 table에 모든 row를 순회할 수 있습니다. HBase shell에는 count라는 operation이 있는데 이를 이용하여 약 500만개의 row를 순회하였더니 약 235초 약 4분이 소요됩니다.
hbase:002:0> count 'usertable'
5119700 row(s)
Took 235.7437 seconds
=> 5119700
Mapreduce Row Count
HBase row count Mapreduce 개발 방법 을 참조하여 Row Count Application을 만들었고 이를 사용하여 실행하였을 때는 약 104초 약 2분 이 채 걸리지 않았습니다.
root@latte01:~# hadoop jar hbase-mapreduce-test.jar RowCounterJob
23/06/11 00:02:36 INFO mapreduce.Job: Job job_1686391929383_0004 running in uber mode : false
23/06/11 00:02:36 INFO mapreduce.Job: map 0% reduce 0%
...
23/06/11 00:04:17 INFO mapreduce.Job: map 94% reduce 0%
23/06/11 00:04:19 INFO mapreduce.Job: map 100% reduce 0%
23/06/11 00:04:20 INFO mapreduce.Job: Job job_1686391929383_0004 completed successfully
23/06/11 00:04:20 INFO mapreduce.Job: Counters: 44
RowCounterMapper$Counters
ROWS=5119700
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
이론적으로는 3대의 Node Manager가 일을 병렬처리하기 때문에 약 3배가 빨라야할 것 같지만, Task를 나누고 또 합치는 과정에서 생기는 burden이 있기 때문에 약 2.25배 정도의 빠른 것을 확인할 수 있습니다. 병렬처리를 해줄수 있는 Worker Node가 늘어나면 늘어날수록 연산 속도는 더욱더 빨라져,full scan시 mapreduce를 활용하면 single thread로 Scan을 하는것과는 비교가 안될 정도로 빠르게 처리할 수 있을 것입니다. 따라서, HBase로 어떤 Batch성 Job을 수행해야한다면, 그리고 그것이 빠른 시간 내에 진행되어야한다면 Mapreduce로 실행하는 것을 고려해볼 수 있습니다.
Performance Comparison of MapReduce vs HBase Scan
Background
This post compares the time difference between performing a full scan with HBase and iterating over all rows in a table using MapReduce.
HBase Full Scan vs MapReduce
Full Scan
If you use the Scan operation built into HBase without any constraints, you can iterate over all rows in a table. HBase shell has an operation called count, and when using it to iterate over approximately 5 million rows, it took about 235 seconds, roughly 4 minutes.
hbase:002:0> count 'usertable'
5119700 row(s)
Took 235.7437 seconds
=> 5119700
MapReduce Row Count
I created a Row Count application by referring to HBase Row Count MapReduce Development Guide, and when executing it, it took about 104 seconds, less than 2 minutes.
root@latte01:~# hadoop jar hbase-mapreduce-test.jar RowCounterJob
23/06/11 00:02:36 INFO mapreduce.Job: Job job_1686391929383_0004 running in uber mode : false
23/06/11 00:02:36 INFO mapreduce.Job: map 0% reduce 0%
...
23/06/11 00:04:17 INFO mapreduce.Job: map 94% reduce 0%
23/06/11 00:04:19 INFO mapreduce.Job: map 100% reduce 0%
23/06/11 00:04:20 INFO mapreduce.Job: Job job_1686391929383_0004 completed successfully
23/06/11 00:04:20 INFO mapreduce.Job: Counters: 44
RowCounterMapper$Counters
ROWS=5119700
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
Theoretically, since 3 Node Managers process work in parallel, it should be about 3 times faster. However, due to the overhead of splitting and merging tasks, it was about 2.25 times faster. As the number of worker nodes available for parallel processing increases, the computation speed will become even faster. Using MapReduce for full scans will be incomparably faster than scanning with a single thread. Therefore, if you need to perform batch jobs with HBase and they need to be completed quickly, you should consider running them with MapReduce.
Quiz
Q1: What is the main topic covered in "Performance Comparison of MapReduce vs HBase Scan"?
Performance Comparison of MapReduce vs HBase Scan
Q2: What is Full Scan?
If you use the Scan operation built into HBase without any constraints, you can iterate over all
rows in a table. HBase shell has an operation called count, and when using it to iterate over
approximately 5 million rows, it took about 235 seconds, roughly 4 minutes.
Q3: Explain the core concept of MapReduce Row Count.
I created a Row Count
application by referring to HBase Row Count MapReduce Development Guide, and when executing it, it
took about 104 seconds, less than 2 minutes. Theoretically, since 3 Node Managers process work in
parallel, it should be about 3 times faster.