Skip to content

Split View: Mapreduce와 HBase scan의 성능 비교

|

Mapreduce와 HBase scan의 성능 비교

Background

HBase로 Full Scan을 할 때와 Mapreduce로 전체 Table의 row를 순회할 때 시간이 얼마나 차이가 나는지 비교해본다.

HBase full scan vs Map Reduce

full scan

HBase에 기본적으로 내장되어있는 Scan이라는 연산에 아무런 제약조건을 걸지 않고 사용하게되면 table에 모든 row를 순회할 수 있습니다. HBase shell에는 count라는 operation이 있는데 이를 이용하여 약 500만개의 row를 순회하였더니 약 235초4분이 소요됩니다.

hbase:002:0> count 'usertable'
5119700 row(s)
Took 235.7437 seconds
=> 5119700

Mapreduce Row Count

HBase row count Mapreduce 개발 방법 을 참조하여 Row Count Application을 만들었고 이를 사용하여 실행하였을 때는 약 104초2분 이 채 걸리지 않았습니다.

root@latte01:~# hadoop jar hbase-mapreduce-test.jar RowCounterJob
23/06/11 00:02:36 INFO mapreduce.Job: Job job_1686391929383_0004 running in uber mode : false
23/06/11 00:02:36 INFO mapreduce.Job:  map 0% reduce 0%
...
23/06/11 00:04:17 INFO mapreduce.Job:  map 94% reduce 0%
23/06/11 00:04:19 INFO mapreduce.Job:  map 100% reduce 0%
23/06/11 00:04:20 INFO mapreduce.Job: Job job_1686391929383_0004 completed successfully
23/06/11 00:04:20 INFO mapreduce.Job: Counters: 44
	RowCounterMapper$Counters
		ROWS=5119700
	File Input Format Counters
		Bytes Read=0
	File Output Format Counters
		Bytes Written=0

이론적으로는 3대의 Node Manager가 일을 병렬처리하기 때문에 약 3배가 빨라야할 것 같지만, Task를 나누고 또 합치는 과정에서 생기는 burden이 있기 때문에 약 2.25배 정도의 빠른 것을 확인할 수 있습니다. 병렬처리를 해줄수 있는 Worker Node가 늘어나면 늘어날수록 연산 속도는 더욱더 빨라져,full scan시 mapreduce를 활용하면 single thread로 Scan을 하는것과는 비교가 안될 정도로 빠르게 처리할 수 있을 것입니다. 따라서, HBase로 어떤 Batch성 Job을 수행해야한다면, 그리고 그것이 빠른 시간 내에 진행되어야한다면 Mapreduce로 실행하는 것을 고려해볼 수 있습니다.

Performance Comparison of MapReduce vs HBase Scan

Background

This post compares the time difference between performing a full scan with HBase and iterating over all rows in a table using MapReduce.

HBase Full Scan vs MapReduce

Full Scan

If you use the Scan operation built into HBase without any constraints, you can iterate over all rows in a table. HBase shell has an operation called count, and when using it to iterate over approximately 5 million rows, it took about 235 seconds, roughly 4 minutes.

hbase:002:0> count 'usertable'
5119700 row(s)
Took 235.7437 seconds
=> 5119700

MapReduce Row Count

I created a Row Count application by referring to HBase Row Count MapReduce Development Guide, and when executing it, it took about 104 seconds, less than 2 minutes.

root@latte01:~# hadoop jar hbase-mapreduce-test.jar RowCounterJob
23/06/11 00:02:36 INFO mapreduce.Job: Job job_1686391929383_0004 running in uber mode : false
23/06/11 00:02:36 INFO mapreduce.Job:  map 0% reduce 0%
...
23/06/11 00:04:17 INFO mapreduce.Job:  map 94% reduce 0%
23/06/11 00:04:19 INFO mapreduce.Job:  map 100% reduce 0%
23/06/11 00:04:20 INFO mapreduce.Job: Job job_1686391929383_0004 completed successfully
23/06/11 00:04:20 INFO mapreduce.Job: Counters: 44
	RowCounterMapper$Counters
		ROWS=5119700
	File Input Format Counters
		Bytes Read=0
	File Output Format Counters
		Bytes Written=0

Theoretically, since 3 Node Managers process work in parallel, it should be about 3 times faster. However, due to the overhead of splitting and merging tasks, it was about 2.25 times faster. As the number of worker nodes available for parallel processing increases, the computation speed will become even faster. Using MapReduce for full scans will be incomparably faster than scanning with a single thread. Therefore, if you need to perform batch jobs with HBase and they need to be completed quickly, you should consider running them with MapReduce.

Quiz

Q1: What is the main topic covered in "Performance Comparison of MapReduce vs HBase Scan"?

Performance Comparison of MapReduce vs HBase Scan

Q2: What is Full Scan? If you use the Scan operation built into HBase without any constraints, you can iterate over all rows in a table. HBase shell has an operation called count, and when using it to iterate over approximately 5 million rows, it took about 235 seconds, roughly 4 minutes.

Q3: Explain the core concept of MapReduce Row Count.I created a Row Count application by referring to HBase Row Count MapReduce Development Guide, and when executing it, it took about 104 seconds, less than 2 minutes. Theoretically, since 3 Node Managers process work in parallel, it should be about 3 times faster.