Skip to content

Split View: Secured(Kerberized) Hadoop 구축하기.

|

Secured(Kerberized) Hadoop 구축하기.

Overview

하둡을 설치하게 되면, 기본적으로 보안이 적용되어있지 않다. Hadoop에 보안을 적용하기 위해서 Kerberos라는 인증 시스템을 이용해야하고 이를 구축하는 것은 꽤 까다로운 일이기 때문에 문서로 정리해보려한다.

Secured Hadoop 구축방법 공식문서를 참고해 작성했다.

Hadoop 클러스터 구축

kerberos 보안을 적용하기에 앞서, Hadoop 클러스터가 구축되어있는 것을 전제로 한다. 또한, H/A를 위해 2대의 namenode와 3대의 journalnode가 있는 것을 전제로 한다. 이를 위해서는 zookeeper가 필요하다.

Linux User 생성

Linux에 hadoop 그룹과 hdfs 계정이 있어야한다. 왜냐하면, kerberos를 적용하려면 해당 계정으로 Namenode나 datanode를 띄워야하기 때문이다.

Linux 상에 디렉터리 생성

네임노드 데이터 저장 위치인 /dfs/nn이나 저널노드 저장위치 /dfs/jn 그리고 datanode 저장위치인 /dfs/dn 폴더가 미리 생성되어있어야하고 권한도 적절하게 부여되어있어야한다.

Kerberos principal 생성

Kerberos Server에 hdfs/{fqdn}@{realm}HTTP/{fqdn}@{realm} principal을 생성해놓고, 해당 principal로 로그인 할 수 있는 keytab을 다운받아놓는다. 이 키탭으로 로그인하기 위해서는 /etc/krb5.conf 파일이 잘 정의되어 있어야한다.

jsvc 설치

secured 환경에서는 Datanode를 실행시키는 것이 가장 까다로운데 jsvc를 이용해서 실행해야하기 때문이다.

yum install jsvcwhich jsvc
which jsvc

Configuratoin 수정

아래의 configuration들은 모든 서버가 동일하게 가지고있어야한다.

hdfs-site.xml
<configuration>
	<property>
		<name>dfs.nameservices</name>
		<value>mycluster</value>
	</property>
	<property>
		<name>dfs.ha.namenodes.mycluster</name>
		<value>nn1,nn2</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.mycluster.nn1</name>
		<value>hadoop1.mysite.com:8020</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.mycluster.nn2</name>
		<value>hadoop2.mysite.com:8020</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.mycluster.nn1</name>
		<value>hadoop1.mysite.com:9870</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.mycluster.nn2</name>
		<value>hadoop2.mysite.com:9870</value>
	</property>
	<property>
		<name>dfs.namenode.shared.edits.dir</name>
		<value>qjournal://hadoop1.mysite.com:8485;hadoop2.mysite.com:8485;hadoop3.mysite.com:8485/mycluster</value>
	</property>
	<property>
		<name>dfs.client.failover.proxy.provider.mycluster</name>
		<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
	</property>
	<property>
		<name>dfs.ha.automatic-failover.enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>ha.zookeeper.quorum</name>
		<value>hadoop1.mysite.com:2181,hadoop2.mysite.com:2181,hadoop3.mysite.com:2181</value>
	</property>
	<property>
		<name>dfs.ha.fencing.methods</name>
		<value>shell(/bin/true)</value>
	</property>
	<property>
		<name>dfs.ha.fencing.ssh.private-key-files</name>
		<value>/root/.ssh/id_rsa</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/dfs/nn</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/dfs/dn</value>
	</property>
	<property>
		<name>dfs.blocksize</name>
		<value>134217728</value>
	</property>
	<property>
		<name>dfs.journalnode.edits.dir</name>
		<value>/dfs/jn</value>
	</property>
	<!-- JournalNode -->
	<property>
		<name>dfs.journalnode.keytab.file</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.journalnode.kerberos.principal</name>
		<value>hdfs/_HOST@{CHAOS.ORDER.COM}</value>
	</property>
	<property>
		<name>dfs.journalnode.kerberos.internal.spnego.principal</name>
		<value>HTTP/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<!-- NameNode -->
	<property>
		<name>dfs.namenode.keytab.file</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.namenode.kerberos.principal</name>
		<value>hdfs/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<property>
		<name>dfs.namenode.kerberos.internal.spnego.principal</name>
		<value>${dfs.web.authentication.kerberos.principal}</value>
	</property>
	<!-- DataNode -->
	<property>
		<name>dfs.datanode.keytab.file</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.datanode.kerberos.principal</name>
		<value>hdfs/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<property>
		<name>dfs.datanode.address</name>
		<value>0.0.0.0:1004</value>
	</property>
	<property>
		<name>dfs.datanode.http.address</name>
		<value>0.0.0.0:1006</value>
	</property>
	<!-- Web -->
	<property>
		<name>dfs.web.authentication.kerberos.keytab</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.web.authentication.kerberos.principal</name>
		<value>HTTP/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<property>
		<name>dfs.block.access.token.enable</name>
		<value>true</value>
	</property>
	<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>0.0.0.0:50090</value>
	</property>
	<property>
		<name>dfs.namenode.secondary.https-address</name>
		<value>0.0.0.0:50091</value>
	</property>
</configuration>

core-site.xml
<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://mycluster</value>
	</property>
	<property>
		<name>ha.zookeeper.quorum</name>
		<value>hadoop1.mysite.com:2181,hadoop2.mysite.com:2181,hadoop3.mysite.com:2181</value>
	</property>
	<property>
		<name>hadoop.rpc.protection</name>
		<value>authentication</value>
	</property>
	<property>
		<name>hadoop.security.authentication</name>
		<value>kerberos</value>
	</property>
	<property>
		<name>hadoop.security.authorization</name>
		<value>true</value>
	</property>
</configuration>

아래 파일처럼 주석되어있는 JSVC_HOMEHADOOP_SECURE_DN_USER 부분을 주석해제한다.

which jsvc를 통해 나오는 값을 JSVC_HOME에, HADOOP_SECURE_DN_USER에는 hdfs를 넣어준다.

hadoop-env.sh
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
export JSVC_HOME=/usr/bin
# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.

export HADOOP_SECURE_DN_USER=hdfs

Journalnode 및 Namenode 및 Datanode 실행

클러스터 초기 구축이라면 아래 과정이 필요하다.

  1. start Zookeepers : zookeeper/bin/zkServer.sh start (3대모두).
  2. format zkfc : hdfs zkfc -formatZK (1번만 실행).
  3. start journalnodes: hdfs journalnode (3대 모두).
  4. format active namenode: hdfs namenode -format (active namenode에서 1번만 실행)
  5. start active namenode : hdfs namenode (active namnode에서 실행)
  6. format standby namenode: hdfs namenode -bootstrapStandby (standby namenode에서 1번만 실행)
  7. start standby namenode : hdfs namenode (standby namenode에서 실행)
  8. start zkfc hdfs zkfc (active, standby namenode가 있는 노드에서 실행)
  9. start datanodes hdfs datanode (데이터 노드에서 실행.)

주의할 점 Journalnode와 Namenode, Journalnode, zkfc는 hdfs 계정으로 실행시키고, Datanode와 zookeeper경우 root계정으로 실행하여야 한다.

설치가 성공적으로 되면

http://namenode_ip_address:9870에 namenode WEB UI에 접속할 수 있다. 그리고 아래와 같이 security 부분이 on 으로 표시 되어 있어야 한다.

hadoop web ui

Building a Secured (Kerberized) Hadoop Cluster

Overview

When you install Hadoop, security is not applied by default. To apply security to Hadoop, you need to use an authentication system called Kerberos, and setting it up can be quite challenging, so I decided to document the process.

This guide was written based on the Secured Hadoop official documentation.

Building a Hadoop Cluster

Before applying Kerberos security, it is assumed that a Hadoop cluster is already set up. It is also assumed that there are 2 NameNodes and 3 JournalNodes for High Availability (H/A). ZooKeeper is required for this setup.

Creating Linux Users

A hadoop group and an hdfs account must exist on Linux. This is because, to apply Kerberos, you need to start the NameNode and DataNode under that account.

Creating Directories on Linux

The NameNode data storage location /dfs/nn, the JournalNode storage location /dfs/jn, and the DataNode storage location /dfs/dn must be created in advance, and appropriate permissions must be granted.

Creating Kerberos Principals

Create hdfs/{fqdn}@{realm} and HTTP/{fqdn}@{realm} principals on the Kerberos Server, and download the keytab that allows login with those principals. To log in using this keytab, the /etc/krb5.conf file must be properly configured.

Installing jsvc

In a secured environment, running the DataNode is the most challenging part because it must be executed using jsvc.

yum install jsvcwhich jsvc
which jsvc

Modifying Configuration

The configurations below must be identical across all servers.

hdfs-site.xml
<configuration>
	<property>
		<name>dfs.nameservices</name>
		<value>mycluster</value>
	</property>
	<property>
		<name>dfs.ha.namenodes.mycluster</name>
		<value>nn1,nn2</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.mycluster.nn1</name>
		<value>hadoop1.mysite.com:8020</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.mycluster.nn2</name>
		<value>hadoop2.mysite.com:8020</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.mycluster.nn1</name>
		<value>hadoop1.mysite.com:9870</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.mycluster.nn2</name>
		<value>hadoop2.mysite.com:9870</value>
	</property>
	<property>
		<name>dfs.namenode.shared.edits.dir</name>
		<value>qjournal://hadoop1.mysite.com:8485;hadoop2.mysite.com:8485;hadoop3.mysite.com:8485/mycluster</value>
	</property>
	<property>
		<name>dfs.client.failover.proxy.provider.mycluster</name>
		<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
	</property>
	<property>
		<name>dfs.ha.automatic-failover.enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>ha.zookeeper.quorum</name>
		<value>hadoop1.mysite.com:2181,hadoop2.mysite.com:2181,hadoop3.mysite.com:2181</value>
	</property>
	<property>
		<name>dfs.ha.fencing.methods</name>
		<value>shell(/bin/true)</value>
	</property>
	<property>
		<name>dfs.ha.fencing.ssh.private-key-files</name>
		<value>/root/.ssh/id_rsa</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/dfs/nn</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/dfs/dn</value>
	</property>
	<property>
		<name>dfs.blocksize</name>
		<value>134217728</value>
	</property>
	<property>
		<name>dfs.journalnode.edits.dir</name>
		<value>/dfs/jn</value>
	</property>
	<!-- JournalNode -->
	<property>
		<name>dfs.journalnode.keytab.file</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.journalnode.kerberos.principal</name>
		<value>hdfs/_HOST@{CHAOS.ORDER.COM}</value>
	</property>
	<property>
		<name>dfs.journalnode.kerberos.internal.spnego.principal</name>
		<value>HTTP/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<!-- NameNode -->
	<property>
		<name>dfs.namenode.keytab.file</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.namenode.kerberos.principal</name>
		<value>hdfs/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<property>
		<name>dfs.namenode.kerberos.internal.spnego.principal</name>
		<value>${dfs.web.authentication.kerberos.principal}</value>
	</property>
	<!-- DataNode -->
	<property>
		<name>dfs.datanode.keytab.file</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.datanode.kerberos.principal</name>
		<value>hdfs/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<property>
		<name>dfs.datanode.address</name>
		<value>0.0.0.0:1004</value>
	</property>
	<property>
		<name>dfs.datanode.http.address</name>
		<value>0.0.0.0:1006</value>
	</property>
	<!-- Web -->
	<property>
		<name>dfs.web.authentication.kerberos.keytab</name>
		<value>/etc/hdfs.keytab</value>
	</property>
	<property>
		<name>dfs.web.authentication.kerberos.principal</name>
		<value>HTTP/_HOST@CHAOS.ORDER.COM</value>
	</property>
	<property>
		<name>dfs.block.access.token.enable</name>
		<value>true</value>
	</property>
	<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>0.0.0.0:50090</value>
	</property>
	<property>
		<name>dfs.namenode.secondary.https-address</name>
		<value>0.0.0.0:50091</value>
	</property>
</configuration>

core-site.xml
<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://mycluster</value>
	</property>
	<property>
		<name>ha.zookeeper.quorum</name>
		<value>hadoop1.mysite.com:2181,hadoop2.mysite.com:2181,hadoop3.mysite.com:2181</value>
	</property>
	<property>
		<name>hadoop.rpc.protection</name>
		<value>authentication</value>
	</property>
	<property>
		<name>hadoop.security.authentication</name>
		<value>kerberos</value>
	</property>
	<property>
		<name>hadoop.security.authorization</name>
		<value>true</value>
	</property>
</configuration>

Uncomment the JSVC_HOME and HADOOP_SECURE_DN_USER sections that are commented out in the file below.

Set the value from which jsvc to JSVC_HOME, and set hdfs for HADOOP_SECURE_DN_USER.

hadoop-env.sh
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
export JSVC_HOME=/usr/bin
# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.

export HADOOP_SECURE_DN_USER=hdfs

Starting JournalNode, NameNode, and DataNode

If this is the initial cluster setup, the following steps are required.

  1. Start ZooKeepers: zookeeper/bin/zkServer.sh start (on all 3 nodes).
  2. Format ZKFC: hdfs zkfc -formatZK (run only once).
  3. Start JournalNodes: hdfs journalnode (on all 3 nodes).
  4. Format active NameNode: hdfs namenode -format (run only once on the active NameNode)
  5. Start active NameNode: hdfs namenode (run on the active NameNode)
  6. Format standby NameNode: hdfs namenode -bootstrapStandby (run only once on the standby NameNode)
  7. Start standby NameNode: hdfs namenode (run on the standby NameNode)
  8. Start ZKFC: hdfs zkfc (run on the nodes where the active and standby NameNodes reside)
  9. Start DataNodes: hdfs datanode (run on the data nodes)

Important note: JournalNode, NameNode, and ZKFC should be started with the hdfs account, while DataNode and ZooKeeper should be started with the root account.

If the installation is successful,

you can access the NameNode Web UI at http://namenode_ip_address:9870. The security section should be displayed as on as shown below.

hadoop web ui

Quiz

Q1: What is the main topic covered in "Building a Secured (Kerberized) Hadoop Cluster"?

Learn how to apply Kerberos security to Hadoop.

Q2: What are the key takeaways from this article? Learn how to apply Kerberos security to Hadoop.

Q3: How can the concepts in this article be applied in practice? Consider the practical examples and patterns discussed throughout the post.