Overview

Following the Hadoop build guide, I will share the Cluster mode installation method.

Extracting the Binary File

Find the desired version of Hadoop at https://hadoop.apache.org/releases.html. Instead of using a Hadoop release version, I installed using the binary built from the source of the latest development branch (trunk) Hadoop 3.4 version. Reference: How to build Hadoop 3.4

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -zxvf hadoop-3.3.4.tar.gz
sudo cp -r hadoop-3.4.4 /usr/local/hadoop

Setting Environment Variables: ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_YARN_USER=${HADOOP_YARN_USER:-yarn}
export ZOOKEEPER_HOME=/usr/local/zookeeper
export HBASE_HOME=/usr/local/hbase
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export HIVE_HOME=/usr/local/hive
export SPARK_HOME=/usr/local/spark
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$HIVE_HOME/bin:$SPARK_HOME/bin

Configuration Changes

HDFS configuration

core-site.xml

<configuration>
        <property>
                <name>fs.default.name</name>
		        <value>hdfs://ubuntu01:9000</value>
        </property>
        <property>
                 <name>hadoop.tmp.dir</name>
                 <value>/usr/local/hadoop/tmp</value>
         </property>
</configuration>

hdfs-site.xml

<configuration>
	<property>
                <name>dfs.replication</name>
                <value>3</value>
        </property>
        <property>
                <name>dfs.permissions.enabled</name>
                <value>false</value>
        </property>
        <property>
                <name>dfs.webhdfs.enabled</name>
                <value>true</value>
        </property>
		<property>
		    <name>dfs.namenode.name.dir</name>
		    <value>/dfs/nn</value>
	    </property>
	    <property>
		    <name>dfs.datanode.data.dir</name>
		    <value>/dfs/dn</value>
	    </property>
</configuration>

Add the following content to hadoop-env.sh.

hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_DATANODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

hadoop/etc/hadoop/workers

ubuntu02
ubuntu03
ubuntu04
ubuntu05
ubuntu06

YARN configuration

yarn-site.xml

<configuration>
    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>
    <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>ubuntu01:8025</value>
    </property>
    <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>ubuntu01:8030</value>
    </property>
    <property>
            <name>yarn.resourcemanager.address</name>
            <value>ubuntu01:8040</value>
    </property>
</configuration>

Key Distribution

Run the following command on the master node and press Enter repeatedly.

ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:D7CPU5pl7TUrMINhMzsO/ZVEXOE2YI73M/HCcDruhAQ root@yxxxxx
The key's randomart image is:
+---[RSA 3072]----+
|         .+.o.   |
|         =.o     |
|      E . = *    |

Copy the generated RSA public key under ~/.ssh/ to ~/.ssh/authorized_keys on all nodes (including the master node).

~/.ssh/id_rsa.pub

cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADSxxxxxxxx...mmn8= root@yxxxxxx

Create the ~/.ssh folder on all nodes. (For the master node, the ~/.ssh folder should already have been created during the keygen process.) Open an editor and add the master's public key copied above to ~/.ssh/authorized_keys.

mkdir ~/.ssh
vim ~/.ssh/authorized_keys

namenode format

Execute the following command on the master node (ubuntu01).

hdfs naemnode -format

Starting the Hadoop Cluster

Running the following command will start the namenode and resource-manager on the master node ubuntu01, and node-manager on the worker nodes ubuntu02 through ubuntu06.

start-all.sh

Check the running Hadoop processes with the jps command.

For the master node, you should see something like this:

jps

1693816 ResourceManager
1702488 SecondaryNameNode
1703408 Jps
1701958 NameNode

For worker nodes, you should see something like this:

jps

1703882 DataNode
1704983 Jps
1704373 NodeManager

Checking the Web UI

Access the Namenode Web UI at http://ubuntu01:9870 to verify that the namenode and datanodes are running properly.

Access the Resource Manager Web UI at http://ubuntu01:8088 to check the status of the resource manager and node managers.