- Authors
- Name
- Introduction
- 1. Hadoop Ecosystem Overview
- 2. HDFS Architecture
- 3. MapReduce Principles and Execution Flow
- 4. YARN Resource Management
- 5. Cluster Installation
- 6. Essential HDFS Commands
- 7. Performance Tuning
- 8. Spark vs MapReduce Comparison
- 9. Monitoring
- 10. Troubleshooting
- 11. Operations Checklists
- Conclusion
Introduction
Since its debut in 2006, Hadoop has established itself as the standard for big data processing. Although next-generation frameworks like Spark and Flink have emerged, HDFS and YARN remain the core foundation of data infrastructure. As of 2026, many enterprises still operate Hadoop-based data lakes, and even in cloud environments, HDFS-compatible storage (S3, ADLS) is widely used.
This article provides an architecture-level understanding of Hadoop's core components and covers the configuration and tuning points needed for real-world operations.
1. Hadoop Ecosystem Overview
Core Components
┌─────────────────────────────────────────────────────────────┐
│ Hadoop Ecosystem │
├─────────────────────────────────────────────────────────────┤
│ [Hive] [Pig] [Spark] [HBase] [Presto] [Flink] │
│ Application / Processing Layer │
├─────────────────────────────────────────────────────────────┤
│ YARN │
│ Resource Management Layer │
├─────────────────────────────────────────────────────────────┤
│ HDFS │
│ Distributed Storage Layer │
└─────────────────────────────────────────────────────────────┘
| Component | Role | Version (as of 2026) |
|---|---|---|
| HDFS | Distributed file system | Hadoop 3.4.x |
| YARN | Resource management and scheduling | Hadoop 3.4.x |
| MapReduce | Batch processing framework | Hadoop 3.4.x |
| Hive | SQL-on-Hadoop | Hive 4.x |
| HBase | NoSQL database | HBase 2.6.x |
| Spark | Unified analytics engine | Spark 3.5.x / 4.x |
| ZooKeeper | Distributed coordination | ZooKeeper 3.9.x |
Hadoop's Position in 2026
- HDFS: Still the core storage for large-scale data lakes. Coexists with S3/ADLS
- YARN: Used as the resource manager for various frameworks including Spark and Flink
- MapReduce: Usage has declined for new development, but still necessary for legacy system maintenance
- Trends: Hadoop on Kubernetes, Ozone (next-gen storage), integration with Iceberg/Delta Lake
2. HDFS Architecture
Core Components
┌──────────────┐
│ NameNode │
│ (Master) │
│ - Metadata │
│ - Block map │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │ │
┌───────▼──────┐ ┌──▼──────────┐ ┌▼─────────────┐
│ DataNode 1 │ │ DataNode 2 │ │ DataNode 3 │
│ Block A │ │ Block A │ │ Block B │
│ Block C │ │ Block B │ │ Block A │
└──────────────┘ └─────────────┘ └──────────────┘
Replication Factor = 3
NameNode
The NameNode is the master node that manages HDFS metadata.
# Information managed by NameNode
# 1. File/directory tree structure (Namespace)
# 2. Block list for each file
# 3. Which DataNode holds each block (Block Mapping)
# 4. File permissions, modification times, etc.
# NameNode memory usage estimate
# 1 file ≈ 150 bytes
# 1 block ≈ 150 bytes
# 100 million files → approximately 30GB heap memory needed
DataNode
# DataNode core behavior
# 1. Stores blocks (default 128MB)
# 2. Serves block read/write to clients
# 3. Sends heartbeats to NameNode (every 3 seconds)
# 4. Sends block reports (every 6 hours)
HDFS HA (High Availability)
<!-- hdfs-site.xml - HA configuration -->
<configuration>
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>namenode1:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>namenode2:8020</value>
</property>
<!-- JournalNode configuration (minimum 3) -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://jn1:8485;jn2:8485;jn3:8485/mycluster</value>
</property>
<!-- Automatic failover -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
</configuration>
Block Replication Strategy
# Default Rack-Aware replication strategy (Replication Factor = 3)
#
# 1st replica: Same node as client (or same rack)
# 2nd replica: Node in a different rack
# 3rd replica: Different node in the same rack as the 2nd
#
# ┌──────── Rack 1 ────────┐ ┌──────── Rack 2 ────────┐
# │ [DataNode1: Block A] │ │ [DataNode3: Block A] │
# │ [DataNode2] │ │ [DataNode4: Block A] │
# └────────────────────────┘ └────────────────────────┘
3. MapReduce Principles and Execution Flow
MapReduce Programming Model
Input → Split → Map → Shuffle & Sort → Reduce → Output
[File Split]
Split 1 → Mapper 1 → (key, value) pairs ─┐
Split 2 → Mapper 2 → (key, value) pairs ──┤── Shuffle & Sort
Split 3 → Mapper 3 → (key, value) pairs ─┘ │
├→ Reducer 1 → Output Part 1
└→ Reducer 2 → Output Part 2
WordCount Example (Java)
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.StringTokenizer;
public class WordCount {
// Mapper: Splits each line into words and outputs (word, 1)
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().toLowerCase());
context.write(word, one);
}
}
}
// Reducer: Sums values for the same key
public static class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class); // Local aggregation to reduce network traffic
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Job Execution Flow
1. Client submits Job
└→ Uploads JAR, configuration, InputSplit info to HDFS
2. ResourceManager allocates ApplicationMaster container
└→ NodeManager starts AM
3. ApplicationMaster requests Map tasks based on InputSplit count
└→ ResourceManager allocates containers
4. Map tasks execute
└→ Read InputSplit → call map() → store intermediate results on local disk
5. Shuffle & Sort
└→ Partition Map output → transfer to Reducers → sort by key
6. Reduce tasks execute
└→ Call reduce() → store final results on HDFS
7. ApplicationMaster notifies ResourceManager of Job completion
4. YARN Resource Management
YARN Architecture
┌─────────────────────────────────────────────────────┐
│ ResourceManager │
│ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Scheduler │ │ ApplicationManager │ │
│ │ (Resource │ │ (App management) │ │
│ │ allocation)│ │ │ │
│ └──────────────┘ └────────────────────────┘ │
└──────────────────────┬──────────────────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌───────▼──────┐ ┌─────▼────────┐ ┌──▼────────────┐
│ NodeManager 1│ │ NodeManager 2│ │ NodeManager 3 │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │Container │ │ │ │Container │ │ │ │Container │ │
│ │(AM) │ │ │ │(Map Task)│ │ │ │(Reduce) │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │Container │ │ │ │Container │ │ │ │
│ │(Map Task)│ │ │ │(Map Task)│ │ │ │
│ └──────────┘ │ │ └──────────┘ │ │ │
└──────────────┘ └──────────────┘ └───────────────┘
YARN Scheduler Configuration
<!-- yarn-site.xml -->
<configuration>
<!-- Scheduler type: Capacity, Fair, FIFO -->
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
<!-- NodeManager resource configuration -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>65536</value> <!-- 64GB -->
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>16</value>
</property>
<!-- Container min/max memory -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>32768</value>
</property>
<!-- Container min/max vCores -->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>8</value>
</property>
</configuration>
Capacity Scheduler Queue Configuration
<!-- capacity-scheduler.xml -->
<configuration>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,production,development</value>
</property>
<!-- Queue capacity (total must equal 100%) -->
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>20</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.production.capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.development.capacity</name>
<value>20</value>
</property>
<!-- Maximum capacity (elasticity) -->
<property>
<name>yarn.scheduler.capacity.root.production.maximum-capacity</name>
<value>80</value>
</property>
<!-- Per-queue user limits -->
<property>
<name>yarn.scheduler.capacity.root.production.user-limit-factor</name>
<value>2</value>
</property>
</configuration>
5. Cluster Installation
Pseudo-Distributed Mode (for Development/Testing)
# 1. Install Java
sudo apt install -y openjdk-11-jdk
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
# 2. Download Hadoop
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
tar -xzf hadoop-3.4.1.tar.gz
sudo mv hadoop-3.4.1 /opt/hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
# 3. SSH key setup (localhost)
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/data/tmp</value>
</property>
</configuration>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value> <!-- Set to 1 for pseudo-distributed mode -->
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/data/datanode</value>
</property>
</configuration>
# 4. Format NameNode and start
hdfs namenode -format
start-dfs.sh
start-yarn.sh
# 5. Verify
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/$USER
jps # Verify NameNode, DataNode, ResourceManager, NodeManager
Fully Distributed Mode (Production)
# Workers file configuration (/opt/hadoop/etc/hadoop/workers)
datanode1
datanode2
datanode3
datanode4
datanode5
<!-- core-site.xml (HA mode) -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1:2181,zk2:2181,zk3:2181</value>
</property>
</configuration>
6. Essential HDFS Commands
File System Operations
# List files/directories
hdfs dfs -ls /user/hadoop/
hdfs dfs -ls -R /user/hadoop/ # Recursive listing
# Upload files
hdfs dfs -put localfile.txt /user/hadoop/
hdfs dfs -copyFromLocal data.csv /user/hadoop/input/
hdfs dfs -moveFromLocal temp.log /user/hadoop/logs/
# Download files
hdfs dfs -get /user/hadoop/output/part-r-00000 ./result.txt
hdfs dfs -copyToLocal /user/hadoop/data.csv ./
# View file contents
hdfs dfs -cat /user/hadoop/input/data.csv
hdfs dfs -head /user/hadoop/input/data.csv # First 1KB
hdfs dfs -tail /user/hadoop/input/data.csv # Last 1KB
hdfs dfs -text /user/hadoop/output/part-r-00000 # Read compressed files too
# Create/delete directories
hdfs dfs -mkdir -p /user/hadoop/input/2026/03
hdfs dfs -rm /user/hadoop/temp.txt
hdfs dfs -rm -r /user/hadoop/output/ # Delete directory
hdfs dfs -rm -skipTrash /user/hadoop/old/ # Skip trash
# Permission management
hdfs dfs -chmod 755 /user/hadoop/scripts/
hdfs dfs -chown hadoop:hadoop /user/hadoop/
hdfs dfs -chgrp analytics /data/shared/
# Disk usage
hdfs dfs -du -h /user/hadoop/ # Usage per directory
hdfs dfs -df -h # Total capacity
hdfs dfs -count -h /user/hadoop/ # File/directory count
Administrative Commands
# Cluster status
hdfs dfsadmin -report # Full cluster report
hdfs dfsadmin -printTopology # Rack topology
# Safe Mode
hdfs dfsadmin -safemode get # Check Safe Mode status
hdfs dfsadmin -safemode leave # Leave Safe Mode
# Block management
hdfs fsck / -files -blocks -locations # File system integrity check
hdfs fsck /user/hadoop/ -files # Check specific path
# Snapshots
hdfs dfsadmin -allowSnapshot /user/hadoop/important
hdfs dfs -createSnapshot /user/hadoop/important snap_20260308
hdfs dfs -deleteSnapshot /user/hadoop/important snap_20260308
# Quota management
hdfs dfsadmin -setSpaceQuota 100G /user/hadoop/project
hdfs dfsadmin -clrSpaceQuota /user/hadoop/project
7. Performance Tuning
HDFS Tuning
<!-- hdfs-site.xml -->
<configuration>
<!-- Block size (increase to 256MB if large files are common) -->
<property>
<name>dfs.blocksize</name>
<value>268435456</value> <!-- 256MB -->
</property>
<!-- DataNode concurrent transfer threads (default 4096) -->
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>
<!-- NameNode handler threads (default 10) -->
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value> <!-- ln(cluster node count) * 20 -->
</property>
<!-- Short-circuit local read -->
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
</configuration>
MapReduce Tuning
<!-- mapred-site.xml -->
<configuration>
<!-- Map task memory -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1638m</value> <!-- 80% of memory.mb -->
</property>
<!-- Reduce task memory -->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3276m</value>
</property>
<!-- Shuffle buffer -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value> <!-- Map output sort buffer -->
</property>
<property>
<name>mapreduce.reduce.shuffle.input.buffer.percent</name>
<value>0.70</value>
</property>
<!-- Compression -->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
</configuration>
Tuning Summary Table
| Item | Default | Recommended | Impact |
|---|---|---|---|
dfs.blocksize | 128MB | 128~256MB | Large files → 256MB |
dfs.replication | 3 | 2~3 | Storage space vs reliability |
dfs.namenode.handler.count | 10 | 20~200 | NameNode RPC throughput |
mapreduce.map.memory.mb | 1024 | 1024~4096 | Map task memory |
mapreduce.reduce.memory.mb | 1024 | 2048~8192 | Reduce task memory |
mapreduce.task.io.sort.mb | 100 | 256~512 | Map sort performance |
yarn.nodemanager.resource.memory-mb | 8192 | 80% of physical memory | Total allocatable node memory |
8. Spark vs MapReduce Comparison
| Item | MapReduce | Spark |
|---|---|---|
| Processing Method | Disk-based batch | In-memory batch/streaming |
| Speed | Baseline (1x) | 10~100x faster |
| Iterative Processing | Disk I/O every time | Cached in memory |
| Programming Model | Map → Reduce | RDD, DataFrame, SQL |
| Languages | Java | Scala, Python, Java, R, SQL |
| Real-time Processing | Not supported | Structured Streaming |
| Memory Requirements | Low | High |
| Fault Recovery | Recompute from disk | Recompute via Lineage |
| Suitable Workloads | Simple ETL, log processing | ML, graph, iterative, interactive analysis |
# Spark WordCount (for comparison)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()
text_file = spark.sparkContext.textFile("hdfs:///user/hadoop/input/")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hadoop/output/spark-wordcount")
9. Monitoring
Key Monitoring Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| NameNode Heap Usage | NN memory usage | > 80% |
| Under-Replicated Blocks | Count of under-replicated blocks | > 0 (sustained) |
| Dead DataNodes | Failed DN count | > 0 |
| HDFS Capacity Used | Disk utilization | > 80% |
| YARN Memory Used | Memory utilization | > 85% |
| Pending Containers | Waiting containers | > 100 (sustained) |
Web UI Ports
# NameNode UI: http://namenode:9870
# ResourceManager UI: http://resourcemanager:8088
# DataNode UI: http://datanode:9864
# NodeManager UI: http://nodemanager:8042
# MapReduce History: http://historyserver:19888
Prometheus + Grafana Integration
# prometheus.yml
scrape_configs:
- job_name: 'hadoop-namenode'
metrics_path: '/jmx'
params:
format: ['prometheus']
static_configs:
- targets: ['namenode:9870']
- job_name: 'hadoop-datanode'
metrics_path: '/jmx'
params:
format: ['prometheus']
static_configs:
- targets: ['datanode1:9864', 'datanode2:9864', 'datanode3:9864']
- job_name: 'hadoop-resourcemanager'
metrics_path: '/jmx'
params:
format: ['prometheus']
static_configs:
- targets: ['resourcemanager:8088']
10. Troubleshooting
NameNode Failure
# When NameNode won't leave Safe Mode
hdfs dfsadmin -safemode get
hdfs dfsadmin -safemode leave
# When NameNode EditLog is corrupted
hdfs namenode -recover
# NameNode OOM
# → Increase heap size in hadoop-env.sh
# export HDFS_NAMENODE_OPTS="-Xmx32g -Xms32g"
DataNode Failure
# Check DataNode status
hdfs dfsadmin -report | grep -A5 "Dead datanodes"
# Verify blocks on a specific DataNode
hdfs fsck / -files -blocks -locations | grep "datanode1"
# DataNode disk full
# → Add new disk to dfs.datanode.data.dir
# → Restart DataNode
YARN Application Failure
# List running applications
yarn application -list
# Check application logs
yarn logs -applicationId application_1709875200_0001
# Force kill application
yarn application -kill application_1709875200_0001
# Check container logs
yarn logs -applicationId application_1709875200_0001 -containerId container_1709875200_0001_01_000001
11. Operations Checklists
Initial Build Checklist
- Verify Java version (JDK 11 or above recommended)
- Configure SSH key-based authentication (all nodes)
- Verify NTP synchronization (all nodes)
- OS tuning (ulimit, swappiness, disable THP)
- JBOD disk configuration (do not use RAID)
- HDFS HA configuration (3+ JournalNodes)
- YARN HA configuration (2 ResourceManagers)
- Verify network bandwidth (10Gbps+ recommended)
Daily Operations Checklist
- Monitor NameNode heap usage
- Check under-replicated blocks
- Check for dead DataNodes
- Check HDFS capacity utilization
- Check YARN queue resource utilization
- Check log disk usage
- Check for long-running jobs
Regular Inspection Checklist
- Run
hdfs fsck /to verify file system integrity - Back up NameNode (fsimage + edits)
- Plan security patch application
- Plan disk replacements (check SMART)
- Cluster balancing (
hdfs balancer) - Log rotation and cleanup
Conclusion
Hadoop remains an important part of the big data ecosystem as foundational infrastructure. HDFS continues to serve as a large-scale data store, and YARN as the resource manager for various processing frameworks.
Key Takeaways:
- HDFS ensures reliability through block-level distributed storage + replication
- MapReduce is simple but slower than Spark due to heavy disk I/O
- YARN is a general-purpose resource manager for diverse frameworks
- Performance tuning focuses on block size, memory, and thread count adjustments
- Monitoring should focus on NameNode heap, under-replicated blocks, and YARN queues