Skip to content
Published on

Hadoop Ecosystem Practical Guide: HDFS, MapReduce, and YARN Core Concepts

Authors
  • Name
    Twitter

Introduction

Since its debut in 2006, Hadoop has established itself as the standard for big data processing. Although next-generation frameworks like Spark and Flink have emerged, HDFS and YARN remain the core foundation of data infrastructure. As of 2026, many enterprises still operate Hadoop-based data lakes, and even in cloud environments, HDFS-compatible storage (S3, ADLS) is widely used.

This article provides an architecture-level understanding of Hadoop's core components and covers the configuration and tuning points needed for real-world operations.

1. Hadoop Ecosystem Overview

Core Components

┌─────────────────────────────────────────────────────────────┐
Hadoop Ecosystem├─────────────────────────────────────────────────────────────┤
[Hive]  [Pig]  [Spark]  [HBase]  [Presto]  [Flink]Application / Processing Layer├─────────────────────────────────────────────────────────────┤
YARNResource Management Layer├─────────────────────────────────────────────────────────────┤
HDFSDistributed Storage Layer└─────────────────────────────────────────────────────────────┘
ComponentRoleVersion (as of 2026)
HDFSDistributed file systemHadoop 3.4.x
YARNResource management and schedulingHadoop 3.4.x
MapReduceBatch processing frameworkHadoop 3.4.x
HiveSQL-on-HadoopHive 4.x
HBaseNoSQL databaseHBase 2.6.x
SparkUnified analytics engineSpark 3.5.x / 4.x
ZooKeeperDistributed coordinationZooKeeper 3.9.x

Hadoop's Position in 2026

  • HDFS: Still the core storage for large-scale data lakes. Coexists with S3/ADLS
  • YARN: Used as the resource manager for various frameworks including Spark and Flink
  • MapReduce: Usage has declined for new development, but still necessary for legacy system maintenance
  • Trends: Hadoop on Kubernetes, Ozone (next-gen storage), integration with Iceberg/Delta Lake

2. HDFS Architecture

Core Components

                    ┌──────────────┐
NameNode                      (Master)- Metadata- Block map │
                    └──────┬───────┘
              ┌────────────┼────────────┐
              │            │            │
      ┌───────▼──────┐ ┌──▼──────────┐ ┌▼─────────────┐
DataNode 1  │ │  DataNode 2 │ │  DataNode 3Block A     │ │  Block A    │ │  Block BBlock C     │ │  Block B    │ │  Block A      └──────────────┘ └─────────────┘ └──────────────┘
                    Replication Factor = 3

NameNode

The NameNode is the master node that manages HDFS metadata.

# Information managed by NameNode
# 1. File/directory tree structure (Namespace)
# 2. Block list for each file
# 3. Which DataNode holds each block (Block Mapping)
# 4. File permissions, modification times, etc.

# NameNode memory usage estimate
# 1 file ≈ 150 bytes
# 1 block ≈ 150 bytes
# 100 million files → approximately 30GB heap memory needed

DataNode

# DataNode core behavior
# 1. Stores blocks (default 128MB)
# 2. Serves block read/write to clients
# 3. Sends heartbeats to NameNode (every 3 seconds)
# 4. Sends block reports (every 6 hours)

HDFS HA (High Availability)

<!-- hdfs-site.xml - HA configuration -->
<configuration>
    <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>namenode1:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>namenode2:8020</value>
    </property>

    <!-- JournalNode configuration (minimum 3) -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://jn1:8485;jn2:8485;jn3:8485/mycluster</value>
    </property>

    <!-- Automatic failover -->
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
    </property>
</configuration>

Block Replication Strategy

# Default Rack-Aware replication strategy (Replication Factor = 3)
#
# 1st replica: Same node as client (or same rack)
# 2nd replica: Node in a different rack
# 3rd replica: Different node in the same rack as the 2nd
#
# ┌──────── Rack 1 ────────┐  ┌──────── Rack 2 ────────┐
# │ [DataNode1: Block A]   │  │ [DataNode3: Block A]# │ [DataNode2]            │  │ [DataNode4: Block A]# └────────────────────────┘  └────────────────────────┘

3. MapReduce Principles and Execution Flow

MapReduce Programming Model

InputSplitMapShuffle & SortReduceOutput

[File Split]
  Split 1Mapper 1  (key, value) pairs ─┐
  Split 2Mapper 2  (key, value) pairs ──┤── Shuffle & Sort
  Split 3Mapper 3  (key, value) pairs ─┘     │
                                                   ├→ Reducer 1Output Part 1
                                                   └→ Reducer 2Output Part 2

WordCount Example (Java)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {

    // Mapper: Splits each line into words and outputs (word, 1)
    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken().toLowerCase());
                context.write(word, one);
            }
        }
    }

    // Reducer: Sums values for the same key
    public static class IntSumReducer
            extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);  // Local aggregation to reduce network traffic
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Job Execution Flow

1. Client submits Job
   └→ Uploads JAR, configuration, InputSplit info to HDFS

2. ResourceManager allocates ApplicationMaster container
   └→ NodeManager starts AM

3. ApplicationMaster requests Map tasks based on InputSplit count
   └→ ResourceManager allocates containers

4. Map tasks execute
   └→ Read InputSplit → call map() → store intermediate results on local disk

5. Shuffle & Sort
   └→ Partition Map output → transfer to Reducers → sort by key

6. Reduce tasks execute
   └→ Call reduce() → store final results on HDFS

7. ApplicationMaster notifies ResourceManager of Job completion

4. YARN Resource Management

YARN Architecture

┌─────────────────────────────────────────────────────┐
ResourceManager│  ┌──────────────┐  ┌────────────────────────┐       │
│  │  Scheduler   │  │  ApplicationManager    │       │
  (Resource  (App management)      │       │
│  │   allocation)│  │                        │       │
│  └──────────────┘  └────────────────────────┘       │
└──────────────────────┬──────────────────────────────┘
        ┌──────────────┼──────────────┐
        │              │              │
┌───────▼──────┐ ┌─────▼────────┐ ┌──▼────────────┐
NodeManager 1│ │ NodeManager 2│ │ NodeManager 3│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐  │
│ │Container │ │ │ │Container │ │ │ │Container │  │
(AM)      │ │ │ (Map Task)│ │ │ (Reduce)  │  │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘  │
│ ┌──────────┐ │ │ ┌──────────┐ │ │               │
│ │Container │ │ │ │Container │ │ │               │
(Map Task)│ │ │ (Map Task)│ │ │               │
│ └──────────┘ │ │ └──────────┘ │ │               │
└──────────────┘ └──────────────┘ └───────────────┘

YARN Scheduler Configuration

<!-- yarn-site.xml -->
<configuration>
    <!-- Scheduler type: Capacity, Fair, FIFO -->
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>

    <!-- NodeManager resource configuration -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>65536</value>  <!-- 64GB -->
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>16</value>
    </property>

    <!-- Container min/max memory -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>32768</value>
    </property>

    <!-- Container min/max vCores -->
    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>8</value>
    </property>
</configuration>

Capacity Scheduler Queue Configuration

<!-- capacity-scheduler.xml -->
<configuration>
    <property>
        <name>yarn.scheduler.capacity.root.queues</name>
        <value>default,production,development</value>
    </property>

    <!-- Queue capacity (total must equal 100%) -->
    <property>
        <name>yarn.scheduler.capacity.root.default.capacity</name>
        <value>20</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.production.capacity</name>
        <value>60</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.development.capacity</name>
        <value>20</value>
    </property>

    <!-- Maximum capacity (elasticity) -->
    <property>
        <name>yarn.scheduler.capacity.root.production.maximum-capacity</name>
        <value>80</value>
    </property>

    <!-- Per-queue user limits -->
    <property>
        <name>yarn.scheduler.capacity.root.production.user-limit-factor</name>
        <value>2</value>
    </property>
</configuration>

5. Cluster Installation

Pseudo-Distributed Mode (for Development/Testing)

# 1. Install Java
sudo apt install -y openjdk-11-jdk
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

# 2. Download Hadoop
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
tar -xzf hadoop-3.4.1.tar.gz
sudo mv hadoop-3.4.1 /opt/hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

# 3. SSH key setup (localhost)
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
<!-- core-site.xml -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/data/tmp</value>
    </property>
</configuration>
<!-- hdfs-site.xml -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>  <!-- Set to 1 for pseudo-distributed mode -->
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/data/datanode</value>
    </property>
</configuration>
# 4. Format NameNode and start
hdfs namenode -format
start-dfs.sh
start-yarn.sh

# 5. Verify
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/$USER
jps  # Verify NameNode, DataNode, ResourceManager, NodeManager

Fully Distributed Mode (Production)

# Workers file configuration (/opt/hadoop/etc/hadoop/workers)
datanode1
datanode2
datanode3
datanode4
datanode5
<!-- core-site.xml (HA mode) -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://mycluster</value>
    </property>
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>zk1:2181,zk2:2181,zk3:2181</value>
    </property>
</configuration>

6. Essential HDFS Commands

File System Operations

# List files/directories
hdfs dfs -ls /user/hadoop/
hdfs dfs -ls -R /user/hadoop/   # Recursive listing

# Upload files
hdfs dfs -put localfile.txt /user/hadoop/
hdfs dfs -copyFromLocal data.csv /user/hadoop/input/
hdfs dfs -moveFromLocal temp.log /user/hadoop/logs/

# Download files
hdfs dfs -get /user/hadoop/output/part-r-00000 ./result.txt
hdfs dfs -copyToLocal /user/hadoop/data.csv ./

# View file contents
hdfs dfs -cat /user/hadoop/input/data.csv
hdfs dfs -head /user/hadoop/input/data.csv   # First 1KB
hdfs dfs -tail /user/hadoop/input/data.csv   # Last 1KB
hdfs dfs -text /user/hadoop/output/part-r-00000  # Read compressed files too

# Create/delete directories
hdfs dfs -mkdir -p /user/hadoop/input/2026/03
hdfs dfs -rm /user/hadoop/temp.txt
hdfs dfs -rm -r /user/hadoop/output/      # Delete directory
hdfs dfs -rm -skipTrash /user/hadoop/old/  # Skip trash

# Permission management
hdfs dfs -chmod 755 /user/hadoop/scripts/
hdfs dfs -chown hadoop:hadoop /user/hadoop/
hdfs dfs -chgrp analytics /data/shared/

# Disk usage
hdfs dfs -du -h /user/hadoop/        # Usage per directory
hdfs dfs -df -h                       # Total capacity
hdfs dfs -count -h /user/hadoop/      # File/directory count

Administrative Commands

# Cluster status
hdfs dfsadmin -report                    # Full cluster report
hdfs dfsadmin -printTopology             # Rack topology

# Safe Mode
hdfs dfsadmin -safemode get              # Check Safe Mode status
hdfs dfsadmin -safemode leave            # Leave Safe Mode

# Block management
hdfs fsck / -files -blocks -locations    # File system integrity check
hdfs fsck /user/hadoop/ -files           # Check specific path

# Snapshots
hdfs dfsadmin -allowSnapshot /user/hadoop/important
hdfs dfs -createSnapshot /user/hadoop/important snap_20260308
hdfs dfs -deleteSnapshot /user/hadoop/important snap_20260308

# Quota management
hdfs dfsadmin -setSpaceQuota 100G /user/hadoop/project
hdfs dfsadmin -clrSpaceQuota /user/hadoop/project

7. Performance Tuning

HDFS Tuning

<!-- hdfs-site.xml -->
<configuration>
    <!-- Block size (increase to 256MB if large files are common) -->
    <property>
        <name>dfs.blocksize</name>
        <value>268435456</value>  <!-- 256MB -->
    </property>

    <!-- DataNode concurrent transfer threads (default 4096) -->
    <property>
        <name>dfs.datanode.max.transfer.threads</name>
        <value>8192</value>
    </property>

    <!-- NameNode handler threads (default 10) -->
    <property>
        <name>dfs.namenode.handler.count</name>
        <value>100</value>  <!-- ln(cluster node count) * 20 -->
    </property>

    <!-- Short-circuit local read -->
    <property>
        <name>dfs.client.read.shortcircuit</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.domain.socket.path</name>
        <value>/var/lib/hadoop-hdfs/dn_socket</value>
    </property>
</configuration>

MapReduce Tuning

<!-- mapred-site.xml -->
<configuration>
    <!-- Map task memory -->
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>2048</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx1638m</value>  <!-- 80% of memory.mb -->
    </property>

    <!-- Reduce task memory -->
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>4096</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx3276m</value>
    </property>

    <!-- Shuffle buffer -->
    <property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>  <!-- Map output sort buffer -->
    </property>
    <property>
        <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
        <value>0.70</value>
    </property>

    <!-- Compression -->
    <property>
        <name>mapreduce.map.output.compress</name>
        <value>true</value>
    </property>
    <property>
        <name>mapreduce.map.output.compress.codec</name>
        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>
</configuration>

Tuning Summary Table

ItemDefaultRecommendedImpact
dfs.blocksize128MB128~256MBLarge files → 256MB
dfs.replication32~3Storage space vs reliability
dfs.namenode.handler.count1020~200NameNode RPC throughput
mapreduce.map.memory.mb10241024~4096Map task memory
mapreduce.reduce.memory.mb10242048~8192Reduce task memory
mapreduce.task.io.sort.mb100256~512Map sort performance
yarn.nodemanager.resource.memory-mb819280% of physical memoryTotal allocatable node memory

8. Spark vs MapReduce Comparison

ItemMapReduceSpark
Processing MethodDisk-based batchIn-memory batch/streaming
SpeedBaseline (1x)10~100x faster
Iterative ProcessingDisk I/O every timeCached in memory
Programming ModelMap → ReduceRDD, DataFrame, SQL
LanguagesJavaScala, Python, Java, R, SQL
Real-time ProcessingNot supportedStructured Streaming
Memory RequirementsLowHigh
Fault RecoveryRecompute from diskRecompute via Lineage
Suitable WorkloadsSimple ETL, log processingML, graph, iterative, interactive analysis
# Spark WordCount (for comparison)
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()
text_file = spark.sparkContext.textFile("hdfs:///user/hadoop/input/")
counts = text_file.flatMap(lambda line: line.split(" ")) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hadoop/output/spark-wordcount")

9. Monitoring

Key Monitoring Metrics

MetricDescriptionAlert Threshold
NameNode Heap UsageNN memory usage> 80%
Under-Replicated BlocksCount of under-replicated blocks> 0 (sustained)
Dead DataNodesFailed DN count> 0
HDFS Capacity UsedDisk utilization> 80%
YARN Memory UsedMemory utilization> 85%
Pending ContainersWaiting containers> 100 (sustained)

Web UI Ports

# NameNode UI: http://namenode:9870
# ResourceManager UI: http://resourcemanager:8088
# DataNode UI: http://datanode:9864
# NodeManager UI: http://nodemanager:8042
# MapReduce History: http://historyserver:19888

Prometheus + Grafana Integration

# prometheus.yml
scrape_configs:
  - job_name: 'hadoop-namenode'
    metrics_path: '/jmx'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['namenode:9870']

  - job_name: 'hadoop-datanode'
    metrics_path: '/jmx'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['datanode1:9864', 'datanode2:9864', 'datanode3:9864']

  - job_name: 'hadoop-resourcemanager'
    metrics_path: '/jmx'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['resourcemanager:8088']

10. Troubleshooting

NameNode Failure

# When NameNode won't leave Safe Mode
hdfs dfsadmin -safemode get
hdfs dfsadmin -safemode leave

# When NameNode EditLog is corrupted
hdfs namenode -recover

# NameNode OOM
# → Increase heap size in hadoop-env.sh
# export HDFS_NAMENODE_OPTS="-Xmx32g -Xms32g"

DataNode Failure

# Check DataNode status
hdfs dfsadmin -report | grep -A5 "Dead datanodes"

# Verify blocks on a specific DataNode
hdfs fsck / -files -blocks -locations | grep "datanode1"

# DataNode disk full
# → Add new disk to dfs.datanode.data.dir
# → Restart DataNode

YARN Application Failure

# List running applications
yarn application -list

# Check application logs
yarn logs -applicationId application_1709875200_0001

# Force kill application
yarn application -kill application_1709875200_0001

# Check container logs
yarn logs -applicationId application_1709875200_0001 -containerId container_1709875200_0001_01_000001

11. Operations Checklists

Initial Build Checklist

  • Verify Java version (JDK 11 or above recommended)
  • Configure SSH key-based authentication (all nodes)
  • Verify NTP synchronization (all nodes)
  • OS tuning (ulimit, swappiness, disable THP)
  • JBOD disk configuration (do not use RAID)
  • HDFS HA configuration (3+ JournalNodes)
  • YARN HA configuration (2 ResourceManagers)
  • Verify network bandwidth (10Gbps+ recommended)

Daily Operations Checklist

  • Monitor NameNode heap usage
  • Check under-replicated blocks
  • Check for dead DataNodes
  • Check HDFS capacity utilization
  • Check YARN queue resource utilization
  • Check log disk usage
  • Check for long-running jobs

Regular Inspection Checklist

  • Run hdfs fsck / to verify file system integrity
  • Back up NameNode (fsimage + edits)
  • Plan security patch application
  • Plan disk replacements (check SMART)
  • Cluster balancing (hdfs balancer)
  • Log rotation and cleanup

Conclusion

Hadoop remains an important part of the big data ecosystem as foundational infrastructure. HDFS continues to serve as a large-scale data store, and YARN as the resource manager for various processing frameworks.

Key Takeaways:

  1. HDFS ensures reliability through block-level distributed storage + replication
  2. MapReduce is simple but slower than Spark due to heavy disk I/O
  3. YARN is a general-purpose resource manager for diverse frameworks
  4. Performance tuning focuses on block size, memory, and thread count adjustments
  5. Monitoring should focus on NameNode heap, under-replicated blocks, and YARN queues