Hadoop Ecosystem Practical Guide: HDFS, MapReduce, and YARN Core Concepts

Introduction
1. Hadoop Ecosystem Overview
- Core Components
- Hadoop's Position in 2026
2. HDFS Architecture
3. MapReduce Principles and Execution Flow
4. YARN Resource Management
5. Cluster Installation
- Pseudo-Distributed Mode (for Development/Testing)
- Fully Distributed Mode (Production)
6. Essential HDFS Commands
- File System Operations
- Administrative Commands
7. Performance Tuning
8. Spark vs MapReduce Comparison
9. Monitoring
10. Troubleshooting
11. Operations Checklists
Conclusion

Introduction

Since its debut in 2006, Hadoop has established itself as the standard for big data processing. Although next-generation frameworks like Spark and Flink have emerged, HDFS and YARN remain the core foundation of data infrastructure. As of 2026, many enterprises still operate Hadoop-based data lakes, and even in cloud environments, HDFS-compatible storage (S3, ADLS) is widely used.

This article provides an architecture-level understanding of Hadoop's core components and covers the configuration and tuning points needed for real-world operations.

1. Hadoop Ecosystem Overview

Core Components

┌─────────────────────────────────────────────────────────────┐
│                    Hadoop Ecosystem                         │
├─────────────────────────────────────────────────────────────┤
│  [Hive]  [Pig]  [Spark]  [HBase]  [Presto]  [Flink]       │
│              Application / Processing Layer                 │
├─────────────────────────────────────────────────────────────┤
│                      YARN                                   │
│              Resource Management Layer                      │
├─────────────────────────────────────────────────────────────┤
│                      HDFS                                   │
│              Distributed Storage Layer                      │
└─────────────────────────────────────────────────────────────┘

Component	Role	Version (as of 2026)
HDFS	Distributed file system	Hadoop 3.4.x
YARN	Resource management and scheduling	Hadoop 3.4.x
MapReduce	Batch processing framework	Hadoop 3.4.x
Hive	SQL-on-Hadoop	Hive 4.x
HBase	NoSQL database	HBase 2.6.x
Spark	Unified analytics engine	Spark 3.5.x / 4.x
ZooKeeper	Distributed coordination	ZooKeeper 3.9.x

Hadoop's Position in 2026

HDFS: Still the core storage for large-scale data lakes. Coexists with S3/ADLS
YARN: Used as the resource manager for various frameworks including Spark and Flink
MapReduce: Usage has declined for new development, but still necessary for legacy system maintenance
Trends: Hadoop on Kubernetes, Ozone (next-gen storage), integration with Iceberg/Delta Lake

2. HDFS Architecture

Core Components

                    ┌──────────────┐
                    │  NameNode    │
                    │  (Master)    │
                    │  - Metadata  │
                    │  - Block map │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
      ┌───────▼──────┐ ┌──▼──────────┐ ┌▼─────────────┐
      │  DataNode 1  │ │  DataNode 2 │ │  DataNode 3  │
      │  Block A     │ │  Block A    │ │  Block B     │
      │  Block C     │ │  Block B    │ │  Block A     │
      └──────────────┘ └─────────────┘ └──────────────┘
                    Replication Factor = 3

NameNode

The NameNode is the master node that manages HDFS metadata.

# Information managed by NameNode
# 1. File/directory tree structure (Namespace)
# 2. Block list for each file
# 3. Which DataNode holds each block (Block Mapping)
# 4. File permissions, modification times, etc.

# NameNode memory usage estimate
# 1 file ≈ 150 bytes
# 1 block ≈ 150 bytes
# 100 million files → approximately 30GB heap memory needed

DataNode

# DataNode core behavior
# 1. Stores blocks (default 128MB)
# 2. Serves block read/write to clients
# 3. Sends heartbeats to NameNode (every 3 seconds)
# 4. Sends block reports (every 6 hours)

HDFS HA (High Availability)

<!-- hdfs-site.xml - HA configuration -->
<configuration>
    <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>namenode1:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>namenode2:8020</value>
    </property>

    <!-- JournalNode configuration (minimum 3) -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://jn1:8485;jn2:8485;jn3:8485/mycluster</value>
    </property>

    <!-- Automatic failover -->
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
    </property>
</configuration>

Block Replication Strategy

# Default Rack-Aware replication strategy (Replication Factor = 3)
#
# 1st replica: Same node as client (or same rack)
# 2nd replica: Node in a different rack
# 3rd replica: Different node in the same rack as the 2nd
#
# ┌──────── Rack 1 ────────┐  ┌──────── Rack 2 ────────┐
# │ [DataNode1: Block A]   │  │ [DataNode3: Block A]   │
# │ [DataNode2]            │  │ [DataNode4: Block A]   │
# └────────────────────────┘  └────────────────────────┘

3. MapReduce Principles and Execution Flow

MapReduce Programming Model

Input → Split → Map → Shuffle & Sort → Reduce → Output

[File Split]
  Split 1 → Mapper 1 → (key, value) pairs ─┐
  Split 2 → Mapper 2 → (key, value) pairs ──┤── Shuffle & Sort
  Split 3 → Mapper 3 → (key, value) pairs ─┘     │
                                                   ├→ Reducer 1 → Output Part 1
                                                   └→ Reducer 2 → Output Part 2

WordCount Example (Java)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {

    // Mapper: Splits each line into words and outputs (word, 1)
    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken().toLowerCase());
                context.write(word, one);
            }
        }
    }

    // Reducer: Sums values for the same key
    public static class IntSumReducer
            extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);  // Local aggregation to reduce network traffic
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Job Execution Flow

1. Client submits Job
   └→ Uploads JAR, configuration, InputSplit info to HDFS

2. ResourceManager allocates ApplicationMaster container
   └→ NodeManager starts AM

3. ApplicationMaster requests Map tasks based on InputSplit count
   └→ ResourceManager allocates containers

4. Map tasks execute
   └→ Read InputSplit → call map() → store intermediate results on local disk

5. Shuffle & Sort
   └→ Partition Map output → transfer to Reducers → sort by key

6. Reduce tasks execute
   └→ Call reduce() → store final results on HDFS

7. ApplicationMaster notifies ResourceManager of Job completion

4. YARN Resource Management

YARN Architecture

┌─────────────────────────────────────────────────────┐
│                  ResourceManager                     │
│  ┌──────────────┐  ┌────────────────────────┐       │
│  │  Scheduler   │  │  ApplicationManager    │       │
│  │  (Resource   │  │  (App management)      │       │
│  │   allocation)│  │                        │       │
│  └──────────────┘  └────────────────────────┘       │
└──────────────────────┬──────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
┌───────▼──────┐ ┌─────▼────────┐ ┌──▼────────────┐
│ NodeManager 1│ │ NodeManager 2│ │ NodeManager 3 │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐  │
│ │Container │ │ │ │Container │ │ │ │Container │  │
│ │(AM)      │ │ │ │(Map Task)│ │ │ │(Reduce)  │  │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘  │
│ ┌──────────┐ │ │ ┌──────────┐ │ │               │
│ │Container │ │ │ │Container │ │ │               │
│ │(Map Task)│ │ │ │(Map Task)│ │ │               │
│ └──────────┘ │ │ └──────────┘ │ │               │
└──────────────┘ └──────────────┘ └───────────────┘

YARN Scheduler Configuration

<!-- yarn-site.xml -->
<configuration>
    <!-- Scheduler type: Capacity, Fair, FIFO -->
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>

    <!-- NodeManager resource configuration -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>65536</value>  <!-- 64GB -->
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>16</value>
    </property>

    <!-- Container min/max memory -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>32768</value>
    </property>

    <!-- Container min/max vCores -->
    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>8</value>
    </property>
</configuration>

Capacity Scheduler Queue Configuration

<!-- capacity-scheduler.xml -->
<configuration>
    <property>
        <name>yarn.scheduler.capacity.root.queues</name>
        <value>default,production,development</value>
    </property>

    <!-- Queue capacity (total must equal 100%) -->
    <property>
        <name>yarn.scheduler.capacity.root.default.capacity</name>
        <value>20</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.production.capacity</name>
        <value>60</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.development.capacity</name>
        <value>20</value>
    </property>

    <!-- Maximum capacity (elasticity) -->
    <property>
        <name>yarn.scheduler.capacity.root.production.maximum-capacity</name>
        <value>80</value>
    </property>

    <!-- Per-queue user limits -->
    <property>
        <name>yarn.scheduler.capacity.root.production.user-limit-factor</name>
        <value>2</value>
    </property>
</configuration>

5. Cluster Installation

Pseudo-Distributed Mode (for Development/Testing)

# 1. Install Java
sudo apt install -y openjdk-11-jdk
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

# 2. Download Hadoop
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
tar -xzf hadoop-3.4.1.tar.gz
sudo mv hadoop-3.4.1 /opt/hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

# 3. SSH key setup (localhost)
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

<!-- core-site.xml -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/data/tmp</value>
    </property>
</configuration>

<!-- hdfs-site.xml -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>  <!-- Set to 1 for pseudo-distributed mode -->
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/data/datanode</value>
    </property>
</configuration>

# 4. Format NameNode and start
hdfs namenode -format
start-dfs.sh
start-yarn.sh

# 5. Verify
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/$USER
jps  # Verify NameNode, DataNode, ResourceManager, NodeManager

Fully Distributed Mode (Production)

# Workers file configuration (/opt/hadoop/etc/hadoop/workers)
datanode1
datanode2
datanode3
datanode4
datanode5

<!-- core-site.xml (HA mode) -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://mycluster</value>
    </property>
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>zk1:2181,zk2:2181,zk3:2181</value>
    </property>
</configuration>

6. Essential HDFS Commands

File System Operations

# List files/directories
hdfs dfs -ls /user/hadoop/
hdfs dfs -ls -R /user/hadoop/   # Recursive listing

# Upload files
hdfs dfs -put localfile.txt /user/hadoop/
hdfs dfs -copyFromLocal data.csv /user/hadoop/input/
hdfs dfs -moveFromLocal temp.log /user/hadoop/logs/

# Download files
hdfs dfs -get /user/hadoop/output/part-r-00000 ./result.txt
hdfs dfs -copyToLocal /user/hadoop/data.csv ./

# View file contents
hdfs dfs -cat /user/hadoop/input/data.csv
hdfs dfs -head /user/hadoop/input/data.csv   # First 1KB
hdfs dfs -tail /user/hadoop/input/data.csv   # Last 1KB
hdfs dfs -text /user/hadoop/output/part-r-00000  # Read compressed files too

# Create/delete directories
hdfs dfs -mkdir -p /user/hadoop/input/2026/03
hdfs dfs -rm /user/hadoop/temp.txt
hdfs dfs -rm -r /user/hadoop/output/      # Delete directory
hdfs dfs -rm -skipTrash /user/hadoop/old/  # Skip trash

# Permission management
hdfs dfs -chmod 755 /user/hadoop/scripts/
hdfs dfs -chown hadoop:hadoop /user/hadoop/
hdfs dfs -chgrp analytics /data/shared/

# Disk usage
hdfs dfs -du -h /user/hadoop/        # Usage per directory
hdfs dfs -df -h                       # Total capacity
hdfs dfs -count -h /user/hadoop/      # File/directory count

Administrative Commands

# Cluster status
hdfs dfsadmin -report                    # Full cluster report
hdfs dfsadmin -printTopology             # Rack topology

# Safe Mode
hdfs dfsadmin -safemode get              # Check Safe Mode status
hdfs dfsadmin -safemode leave            # Leave Safe Mode

# Block management
hdfs fsck / -files -blocks -locations    # File system integrity check
hdfs fsck /user/hadoop/ -files           # Check specific path

# Snapshots
hdfs dfsadmin -allowSnapshot /user/hadoop/important
hdfs dfs -createSnapshot /user/hadoop/important snap_20260308
hdfs dfs -deleteSnapshot /user/hadoop/important snap_20260308

# Quota management
hdfs dfsadmin -setSpaceQuota 100G /user/hadoop/project
hdfs dfsadmin -clrSpaceQuota /user/hadoop/project

7. Performance Tuning

HDFS Tuning

<!-- hdfs-site.xml -->
<configuration>
    <!-- Block size (increase to 256MB if large files are common) -->
    <property>
        <name>dfs.blocksize</name>
        <value>268435456</value>  <!-- 256MB -->
    </property>

    <!-- DataNode concurrent transfer threads (default 4096) -->
    <property>
        <name>dfs.datanode.max.transfer.threads</name>
        <value>8192</value>
    </property>

    <!-- NameNode handler threads (default 10) -->
    <property>
        <name>dfs.namenode.handler.count</name>
        <value>100</value>  <!-- ln(cluster node count) * 20 -->
    </property>

    <!-- Short-circuit local read -->
    <property>
        <name>dfs.client.read.shortcircuit</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.domain.socket.path</name>
        <value>/var/lib/hadoop-hdfs/dn_socket</value>
    </property>
</configuration>

MapReduce Tuning

<!-- mapred-site.xml -->
<configuration>
    <!-- Map task memory -->
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>2048</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx1638m</value>  <!-- 80% of memory.mb -->
    </property>

    <!-- Reduce task memory -->
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>4096</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx3276m</value>
    </property>

    <!-- Shuffle buffer -->
    <property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>  <!-- Map output sort buffer -->
    </property>
    <property>
        <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
        <value>0.70</value>
    </property>

    <!-- Compression -->
    <property>
        <name>mapreduce.map.output.compress</name>
        <value>true</value>
    </property>
    <property>
        <name>mapreduce.map.output.compress.codec</name>
        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>
</configuration>

Tuning Summary Table

Item	Default	Recommended	Impact
`dfs.blocksize`	128MB	128~256MB	Large files → 256MB
`dfs.replication`	3	2~3	Storage space vs reliability
`dfs.namenode.handler.count`	10	20~200	NameNode RPC throughput
`mapreduce.map.memory.mb`	1024	1024~4096	Map task memory
`mapreduce.reduce.memory.mb`	1024	2048~8192	Reduce task memory
`mapreduce.task.io.sort.mb`	100	256~512	Map sort performance
`yarn.nodemanager.resource.memory-mb`	8192	80% of physical memory	Total allocatable node memory

8. Spark vs MapReduce Comparison

Item	MapReduce	Spark
Processing Method	Disk-based batch	In-memory batch/streaming
Speed	Baseline (1x)	10~100x faster
Iterative Processing	Disk I/O every time	Cached in memory
Programming Model	Map → Reduce	RDD, DataFrame, SQL
Languages	Java	Scala, Python, Java, R, SQL
Real-time Processing	Not supported	Structured Streaming
Memory Requirements	Low	High
Fault Recovery	Recompute from disk	Recompute via Lineage
Suitable Workloads	Simple ETL, log processing	ML, graph, iterative, interactive analysis

# Spark WordCount (for comparison)
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()
text_file = spark.sparkContext.textFile("hdfs:///user/hadoop/input/")
counts = text_file.flatMap(lambda line: line.split(" ")) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hadoop/output/spark-wordcount")

9. Monitoring

Key Monitoring Metrics

Metric	Description	Alert Threshold
NameNode Heap Usage	NN memory usage	> 80%
Under-Replicated Blocks	Count of under-replicated blocks	> 0 (sustained)
Dead DataNodes	Failed DN count	> 0
HDFS Capacity Used	Disk utilization	> 80%
YARN Memory Used	Memory utilization	> 85%
Pending Containers	Waiting containers	> 100 (sustained)

Web UI Ports

# NameNode UI: http://namenode:9870
# ResourceManager UI: http://resourcemanager:8088
# DataNode UI: http://datanode:9864
# NodeManager UI: http://nodemanager:8042
# MapReduce History: http://historyserver:19888

Prometheus + Grafana Integration

# prometheus.yml
scrape_configs:
  - job_name: 'hadoop-namenode'
    metrics_path: '/jmx'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['namenode:9870']

  - job_name: 'hadoop-datanode'
    metrics_path: '/jmx'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['datanode1:9864', 'datanode2:9864', 'datanode3:9864']

  - job_name: 'hadoop-resourcemanager'
    metrics_path: '/jmx'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['resourcemanager:8088']

10. Troubleshooting

NameNode Failure

# When NameNode won't leave Safe Mode
hdfs dfsadmin -safemode get
hdfs dfsadmin -safemode leave

# When NameNode EditLog is corrupted
hdfs namenode -recover

# NameNode OOM
# → Increase heap size in hadoop-env.sh
# export HDFS_NAMENODE_OPTS="-Xmx32g -Xms32g"

DataNode Failure

# Check DataNode status
hdfs dfsadmin -report | grep -A5 "Dead datanodes"

# Verify blocks on a specific DataNode
hdfs fsck / -files -blocks -locations | grep "datanode1"

# DataNode disk full
# → Add new disk to dfs.datanode.data.dir
# → Restart DataNode

YARN Application Failure

# List running applications
yarn application -list

# Check application logs
yarn logs -applicationId application_1709875200_0001

# Force kill application
yarn application -kill application_1709875200_0001

# Check container logs
yarn logs -applicationId application_1709875200_0001 -containerId container_1709875200_0001_01_000001

11. Operations Checklists

Initial Build Checklist

Verify Java version (JDK 11 or above recommended)
Configure SSH key-based authentication (all nodes)
Verify NTP synchronization (all nodes)
OS tuning (ulimit, swappiness, disable THP)
JBOD disk configuration (do not use RAID)
HDFS HA configuration (3+ JournalNodes)
YARN HA configuration (2 ResourceManagers)
Verify network bandwidth (10Gbps+ recommended)

Daily Operations Checklist

Regular Inspection Checklist

Run hdfs fsck / to verify file system integrity
Back up NameNode (fsimage + edits)
Plan security patch application
Plan disk replacements (check SMART)
Cluster balancing (hdfs balancer)
Log rotation and cleanup

Conclusion

Hadoop remains an important part of the big data ecosystem as foundational infrastructure. HDFS continues to serve as a large-scale data store, and YARN as the resource manager for various processing frameworks.

Key Takeaways:

HDFS ensures reliability through block-level distributed storage + replication
MapReduce is simple but slower than Spark due to heavy disk I/O
YARN is a general-purpose resource manager for diverse frameworks
Performance tuning focuses on block size, memory, and thread count adjustments
Monitoring should focus on NameNode heap, under-replicated blocks, and YARN queues