Split View: Spark on YARN 설치 방법

Spark on YARN 설치 방법

Overview

Spark는 데이터 입출력을 디스크가 아닌, 메모리에서 하기 때문에 중간중간의 결과를 HDFS 상에 저장할 필요가 없다. 따라서 I/O 에서 큰 시간절약을 할 수 있고 이는 최대 100배 정도 빠르다고 홍보한다. spark3.3.1-cluster-overview

Spark는 아래 4가지의 Cluster Manager TYPE을 제공한다.

이중, 기존 Hadoop의 YARN을 활용하는 방법으로, Spark를 설치해보려한다.

Install
- DownLoad
- Configuration 설정
spark history server 실행
pyspark 실행
Hive -> Mapreduce 와 SparkSQL 비교.
- Hive (mapreduce engin)
- Spark SQL
Reference

Install

Spark Standalone으로 구동하려면, 모든 노드에 Spark를 설치해야하지만, Spark on YARN은 Clien가 되는 하나의 노드에만 설치하면 된다. 설치된 Hadoop 버젼은 3.3.4 이고, Spark version은 3.3.1이다.

DownLoad

나는 네임노드가 떠있는 node에 Spark 설치를 진행했다.

wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar -zxvf spark-3.3.1-bin-hadoop3.tgz

그리고, 적절한 위치로 해당 binary file을 옮겨준다. 필자의 경우 /usr/local/spark로 옮겨주었다.

cp -R spark-3.3.1-bin-hadoop3 /usr/local/spark

Configuration 설정

~/.bashrc

아래 내용 추가.

export SPARK_HOME=/usr/local/spark
PATH=$PATH:$SPARK_HOME/bin

환경 변수 리로드는 source ~/.bashrc

binary 파일을 다운로드 받으면 $SPARK_HOME/conf 아래에 .template 로 끝나는 파일들이 있고 이 것을 아래처럼 복사해서 설정하도록 한다.

spark-env.sh

root@ubuntu01:/usr/local/spark/conf# ls
fairscheduler.xml.template  log4j2.properties.template  metrics.properties.template  spark-defaults.conf.template  spark-env.sh.template  workers.template
root@ubuntu01:/usr/local/spark/conf# cp spark-env.sh.template spark-env.sh

아래와 같이 추가해준다.

spark-env.sh

# Options read in any cluster manager using HDFS
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
# Options read in YARN client/cluster mode
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

spark-defaults.conf

spark.history.fs.logDirectory   hdfs:///sparklog
spark.eventLog.dir               hdfs:///sparklog
spark.eventLog.enabled           true
spark.history.provider           org.apache.spark.deploy.history.FsHistoryProvider

spark history server 실행

/usr/local/spark/sbin# ./start-history-server.sh

아래와 같이 18080에서 history server가 실행되고 있는지 확인하기

pyspark 실행

원주율을 구하는 아래의 예제를 실행해서 spark 가 YARN을 통해 정상적으로 실행되는지 확인한다.

spark-submit

root@ubuntu01:/usr/local/spark/examples/src/main/python# spark-submit --master yarn --deploy-mode cluster pi.py

아래와 같이 실행 결과가 남으면, 정상적으로 설치된 것으로 볼 수 있다.

22/11/26 03:09:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/26 03:09:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ubuntu01/192.168.219.101:8040
22/11/26 03:09:50 INFO Configuration: resource-types.xml not found
22/11/26 03:09:50 INFO ResourceUtils: Unable to find 'resource-types.xml'.
22/11/26 03:09:50 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
22/11/26 03:09:50 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
22/11/26 03:09:50 INFO Client: Setting up container launch context for our AM
22/11/26 03:09:50 INFO Client: Setting up the launch environment for our AM container
22/11/26 03:09:50 INFO Client: Preparing resources for our AM container
22/11/26 03:09:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/11/26 03:09:51 INFO Client: Uploading resource file:/tmp/spark-817663b0-763a-4276-9323-06d7673fef50/__spark_libs__4960795851374146516.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/__spark_libs__4960795851374146516.zip
22/11/26 03:09:54 INFO Client: Uploading resource file:/usr/local/spark/examples/src/main/python/pi.py -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/pi.py
22/11/26 03:09:54 INFO Client: Uploading resource file:/usr/local/spark/python/lib/pyspark.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/pyspark.zip
22/11/26 03:09:54 INFO Client: Uploading resource file:/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/py4j-0.10.9.5-src.zip
22/11/26 03:09:54 INFO Client: Uploading resource file:/tmp/spark-817663b0-763a-4276-9323-06d7673fef50/__spark_conf__2594025362965367855.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/__spark_conf__.zip
22/11/26 03:09:54 INFO SecurityManager: Changing view acls to: root
22/11/26 03:09:54 INFO SecurityManager: Changing modify acls to: root
22/11/26 03:09:54 INFO SecurityManager: Changing view acls groups to:
22/11/26 03:09:54 INFO SecurityManager: Changing modify acls groups to:
22/11/26 03:09:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/11/26 03:09:54 INFO Client: Submitting application application_1669137917003_0010 to ResourceManager
22/11/26 03:09:54 INFO YarnClientImpl: Submitted application application_1669137917003_0010
22/11/26 03:09:55 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:55 INFO Client:
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1669432194364
	 final status: UNDEFINED
	 tracking URL: http://ubuntu01:8088/proxy/application_1669137917003_0010/
	 user: root
22/11/26 03:09:56 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:57 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:58 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:59 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:10:00 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:00 INFO Client:
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: ubuntu03
	 ApplicationMaster RPC port: 43259
	 queue: default
	 start time: 1669432194364
	 final status: UNDEFINED
	 tracking URL: http://ubuntu01:8088/proxy/application_1669137917003_0010/
	 user: root
22/11/26 03:10:01 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:02 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:03 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:04 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:05 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:06 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:07 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:08 INFO Client: Application report for application_1669137917003_0010 (state: FINISHED)
22/11/26 03:10:08 INFO Client:
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: ubuntu03
	 ApplicationMaster RPC port: 43259
	 queue: default
	 start time: 1669432194364
	 final status: SUCCEEDED
	 tracking URL: http://ubuntu01:8088/proxy/application_1669137917003_0010/
	 user: root
22/11/26 03:10:08 INFO ShutdownHookManager: Shutdown hook called
22/11/26 03:10:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-467194a9-c6c1-4708-88aa-f66d15229dd6
22/11/26 03:10:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-817663b0-763a-4276-9323-06d7673fef50

아래와 같이 이 job에 대해서는 Spark history server에서 그리고, resource manager Web UI에서도 JOB이 정상적으로 제출된 것을 확인할 수 있다.

Hive -> Mapreduce 와 SparkSQL 비교.

Hive의 경우 Query를 Mapreduce로 처리하지만, 이를 Spark로도 처리할 수 있다. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started 참고. 그러나 최근에는 Hive -> Spark -> YARN 으로 사용하기 보다는, Hive를 사용하지 않고 SPARKSQL -> YARN 구조로 사용하는 것으로 보인다. Ek

movie_lens 데이터의 u_data 테이블에 select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC; 이 쿼리를 hive에서 실행해서, memory에서 실행되는 Spark가, HDFS를 사용하는 Mapreduce 대비 얼마나 속도가 빠른지 측정해 볼 것이다.

Hive (mapreduce engin)

Hive의 경우 33.699초가 소요되었다.

hive> select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC;
Time taken: 33.699 seconds, Fetched: 1682 row(s)

Spark SQL

preparatoin

Spark SQL을 사용하기 위해서는, Hive 설치 때와 마찬가지로 mysql-connector가 필요하다. mysql-connector-설치방법은 링크를 참조하여 설치한다.

그리고 나서, 이 mysql-connector-java-*.jar을 $SPARK_HOME/jars/ 하위에 넣어준다.

pyspark 실행

$SPARK_HOME/bin/pyspark를 실행한 뒤 아래의 코드를 입력해보자.

from pyspark.sql import SparkSession
import time
start = time.time()
math.factorial(100000)

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
spark.sql("select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC").show()
end = time.time()
print(f"{end - start:.5f} sec")

실행에 오직 6초 정도가 소요되었다. 놀랍다. 기존 Mapreduce를 통해 실행되는 Hive인 33초에 비해서 굉장히 빨라졌음을 알 수 있다.

6.07104 sec

Reference

spark-submit --master yarn --deploy-mode cluster --num-executors 4 wordcount.py  hdfs:///tmp/input/sample.txt

How to Install Spark on YARN

Overview

Spark performs data I/O in memory rather than on disk, so there is no need to store intermediate results on HDFS. This allows significant time savings on I/O, which is advertised as being up to 100 times faster. spark3.3.1-cluster-overview

Spark provides the following 4 types of Cluster Managers:

Among these, I will install Spark using the existing Hadoop YARN.

Install
- DownLoad
- Configuration Settings
Starting Spark History Server
Running pyspark
Comparing Hive with MapReduce vs SparkSQL
- Hive (MapReduce Engine)
- Spark SQL
Reference
- Quiz

Install

To run Spark in Standalone mode, Spark must be installed on all nodes. However, with Spark on YARN, you only need to install it on a single client node. The installed Hadoop version is 3.3.4, and the Spark version is 3.3.1.

DownLoad

I installed Spark on the node where the NameNode is running.

wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar -zxvf spark-3.3.1-bin-hadoop3.tgz

Then, move the binary files to an appropriate location. In my case, I moved them to /usr/local/spark.

cp -R spark-3.3.1-bin-hadoop3 /usr/local/spark

Configuration Settings

~/.bashrc

Add the following content.

export SPARK_HOME=/usr/local/spark
PATH=$PATH:$SPARK_HOME/bin

To reload environment variables, run source ~/.bashrc.

When you download the binary files, there are files ending with .template under $SPARK_HOME/conf. Copy them as shown below and configure them.

spark-env.sh

root@ubuntu01:/usr/local/spark/conf# ls
fairscheduler.xml.template  log4j2.properties.template  metrics.properties.template  spark-defaults.conf.template  spark-env.sh.template  workers.template
root@ubuntu01:/usr/local/spark/conf# cp spark-env.sh.template spark-env.sh

Add the following content.

spark-env.sh

# Options read in any cluster manager using HDFS
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
# Options read in YARN client/cluster mode
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

spark-defaults.conf

spark.history.fs.logDirectory   hdfs:///sparklog
spark.eventLog.dir               hdfs:///sparklog
spark.eventLog.enabled           true
spark.history.provider           org.apache.spark.deploy.history.FsHistoryProvider

Starting Spark History Server

/usr/local/spark/sbin# ./start-history-server.sh

Verify that the history server is running on port 18080 as shown below.

Running pyspark

Run the following example that calculates pi to verify that Spark runs correctly through YARN.

spark-submit

root@ubuntu01:/usr/local/spark/examples/src/main/python# spark-submit --master yarn --deploy-mode cluster pi.py

If the execution result appears as shown below, the installation was successful.

22/11/26 03:09:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/26 03:09:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ubuntu01/192.168.219.101:8040
22/11/26 03:09:50 INFO Configuration: resource-types.xml not found
22/11/26 03:09:50 INFO ResourceUtils: Unable to find 'resource-types.xml'.
22/11/26 03:09:50 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
22/11/26 03:09:50 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
22/11/26 03:09:50 INFO Client: Setting up container launch context for our AM
22/11/26 03:09:50 INFO Client: Setting up the launch environment for our AM container
22/11/26 03:09:50 INFO Client: Preparing resources for our AM container
22/11/26 03:09:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/11/26 03:09:51 INFO Client: Uploading resource file:/tmp/spark-817663b0-763a-4276-9323-06d7673fef50/__spark_libs__4960795851374146516.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/__spark_libs__4960795851374146516.zip
22/11/26 03:09:54 INFO Client: Uploading resource file:/usr/local/spark/examples/src/main/python/pi.py -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/pi.py
22/11/26 03:09:54 INFO Client: Uploading resource file:/usr/local/spark/python/lib/pyspark.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/pyspark.zip
22/11/26 03:09:54 INFO Client: Uploading resource file:/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/py4j-0.10.9.5-src.zip
22/11/26 03:09:54 INFO Client: Uploading resource file:/tmp/spark-817663b0-763a-4276-9323-06d7673fef50/__spark_conf__2594025362965367855.zip -> hdfs://ubuntu01:9000/user/root/.sparkStaging/application_1669137917003_0010/__spark_conf__.zip
22/11/26 03:09:54 INFO SecurityManager: Changing view acls to: root
22/11/26 03:09:54 INFO SecurityManager: Changing modify acls to: root
22/11/26 03:09:54 INFO SecurityManager: Changing view acls groups to:
22/11/26 03:09:54 INFO SecurityManager: Changing modify acls groups to:
22/11/26 03:09:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/11/26 03:09:54 INFO Client: Submitting application application_1669137917003_0010 to ResourceManager
22/11/26 03:09:54 INFO YarnClientImpl: Submitted application application_1669137917003_0010
22/11/26 03:09:55 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:55 INFO Client:
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1669432194364
	 final status: UNDEFINED
	 tracking URL: http://ubuntu01:8088/proxy/application_1669137917003_0010/
	 user: root
22/11/26 03:09:56 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:57 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:58 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:09:59 INFO Client: Application report for application_1669137917003_0010 (state: ACCEPTED)
22/11/26 03:10:00 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:00 INFO Client:
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: ubuntu03
	 ApplicationMaster RPC port: 43259
	 queue: default
	 start time: 1669432194364
	 final status: UNDEFINED
	 tracking URL: http://ubuntu01:8088/proxy/application_1669137917003_0010/
	 user: root
22/11/26 03:10:01 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:02 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:03 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:04 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:05 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:06 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:07 INFO Client: Application report for application_1669137917003_0010 (state: RUNNING)
22/11/26 03:10:08 INFO Client: Application report for application_1669137917003_0010 (state: FINISHED)
22/11/26 03:10:08 INFO Client:
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: ubuntu03
	 ApplicationMaster RPC port: 43259
	 queue: default
	 start time: 1669432194364
	 final status: SUCCEEDED
	 tracking URL: http://ubuntu01:8088/proxy/application_1669137917003_0010/
	 user: root
22/11/26 03:10:08 INFO ShutdownHookManager: Shutdown hook called
22/11/26 03:10:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-467194a9-c6c1-4708-88aa-f66d15229dd6
22/11/26 03:10:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-817663b0-763a-4276-9323-06d7673fef50

As shown below, you can verify that the job was successfully submitted both in the Spark history server and the Resource Manager Web UI.

Comparing Hive with MapReduce vs SparkSQL

Hive processes queries using MapReduce, but it can also use Spark for processing. Refer to https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started. However, recently, instead of using Hive then Spark then YARN, it seems more common to skip Hive and use SparkSQL then YARN directly.

I will run the query select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC; on the u_data table from the MovieLens dataset in Hive to measure how much faster Spark (which runs in memory) is compared to MapReduce (which uses HDFS).

Hive (MapReduce Engine)

Hive took 33.699 seconds.

hive> select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC;
Time taken: 33.699 seconds, Fetched: 1682 row(s)

Spark SQL

preparation

To use Spark SQL, you need the mysql-connector, just like when installing Hive. Refer to the link for mysql-connector installation instructions.

Then, place the mysql-connector-java-*.jar file under $SPARK_HOME/jars/.

Running pyspark

Run $SPARK_HOME/bin/pyspark and enter the following code.

from pyspark.sql import SparkSession
import time
start = time.time()
math.factorial(100000)

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
spark.sql("select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC").show()
end = time.time()
print(f"{end - start:.5f} sec")

The execution took only about 6 seconds. This is remarkable. Compared to the 33 seconds taken by Hive running through MapReduce, the speed improvement is significant.

6.07104 sec

Reference

spark-submit --master yarn --deploy-mode cluster --num-executors 4 wordcount.py  hdfs:///tmp/input/sample.txt

Quiz

Q1: What is the main topic covered in "How to Install Spark on YARN"?

Learn how to install Spark on YARN.

Q2: What is DownLoad?

I installed Spark on the node where the NameNode is running. Then, move the binary files to an appropriate location. In my case, I moved them to /usr/local/spark.

Q3: What are the key steps for Configuration Settings?

~/.bashrc Add the following content. To reload environment variables, run source ~/.bashrc. When you download the binary files, there are files ending with .template under $SPARK_HOME/conf. Copy them as shown below and configure them. spark-env.sh Add the following content.

Q4: What are the key aspects of Hive (MapReduce Engine)?

Hive took 33.699 seconds.

Q5: How does Spark SQL work?

preparation To use Spark SQL, you need the mysql-connector, just like when installing Hive. Refer to the link for mysql-connector installation instructions. Then, place the mysql-connector-java-*.jar file under $SPARK_HOME/jars/.