- Published on
Revisiting Hadoop Architecture — HDFS, YARN, and Beyond
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- The Problem Hadoop Set Out to Solve
- The Overall Shape of Hadoop
- HDFS — The Distributed File System
- YARN — The Resource Management Layer
- MapReduce — The Original Processing Model
- The Hadoop Ecosystem
- The Object-Storage, Cloud, and Lakehouse Era
- Practical and Operational Notes
- Common Misconceptions and Pitfalls
- Closing Thoughts
- References
Introduction
For a long stretch, the word "big data" was practically synonymous with "Hadoop." In the early-to-mid 2010s, countless companies built on-premises Hadoop clusters of dozens to hundreds of servers, and almost every data engineering job posting listed Hadoop experience as a requirement.
But entering the 2020s, the mood changed. Cloud object storage (S3, GCS, Azure Blob) became the de facto standard data store, Spark became the mainstream compute engine, and "lakehouse" architectures built on table formats such as Iceberg, Delta, and Hudi emerged. As a result, some people now say "Hadoop is dead."
This post is not about settling that debate. Instead, its goal is to calmly revisit what problem Hadoop was designed to solve and how, and to sort out which of those design decisions remain valid today and which have aged out. Once you truly understand Hadoop, nearly every concept in the modern data stack built on top of it (distributed storage, data locality, resource scheduling, shuffle) comes into sharper focus.
Here is what we will cover.
- The background behind Hadoop and its core design philosophy
- The structure of HDFS: NameNode, DataNode, blocks, replication
- Tracing the HDFS read/write paths with diagrams
- The structure of YARN: ResourceManager, NodeManager, ApplicationMaster
- How MapReduce works and the shuffle flow
- The ecosystem: Hive, Spark, HBase, and more
- How Hadoop position has shifted in the object-storage / cloud / lakehouse era
- Operational considerations and common pitfalls in practice
Note: The technical descriptions here follow the official Apache Hadoop documentation (hadoop.apache.org). Details can vary by version, so when operating in production, consult the docs for the distribution and version you actually use.
The Problem Hadoop Set Out to Solve
Hadoop traces back to two papers Google published in 2003 and 2004: the GFS (Google File System) paper and the MapReduce paper. At the time, Google had to process enormous amounts of data to crawl and index the entire web, and it chose to bundle thousands of cheap commodity servers rather than a handful of expensive high-end machines.
This approach rested on two big premises.
- Servers fail all the time. When you run thousands of them, a few dying every day is normal. Failure is therefore the rule, not the exception, and the system must be designed to be fault-tolerant.
- When data is huge, it is far cheaper to move computation to where the data lives than to move the data to where the computation is. This is the concept of "data locality."
Hadoop is essentially an open-source implementation of exactly these two philosophies. HDFS corresponds to GFS, and Hadoop MapReduce corresponds to Google MapReduce. It began as part of Doug Cutting search-engine project Nutch and later became a top-level Apache project.
The core design goals can be summarized as follows.
┌──────────────────────────────────────────────────────────────┐
│ Hadoop Core Design Goals │
├──────────────────────────────────────────────────────────────┤
│ 1. Scalability │
│ Scale out — add nodes to grow capacity/throughput. │
│ │
│ 2. Fault Tolerance │
│ Failure is normal — protect data/jobs via replication+retry.│
│ │
│ 3. Data Locality │
│ Move compute near the data — minimize network transfer. │
│ │
│ 4. Commodity Hardware │
│ Many cheap servers — no reliance on a single costly box. │
│ │
│ 5. High Throughput (not Low Latency) │
│ Optimized for large batch — not real-time responses. │
└──────────────────────────────────────────────────────────────┘
The last item is especially important. Hadoop is fundamentally optimized for "reading large data sequentially in one sweep and processing it in batch." It was never well suited to handling millions of tiny files or serving single-record lookups at low latency. This trait reappears later when we discuss pitfalls.
The Overall Shape of Hadoop
It helps to understand Hadoop in three layers.
┌───────────────────────────────────────────────────────────────┐
│ Processing Layer │
│ MapReduce · Spark · Tez · Hive · Pig · Flink ... │
├───────────────────────────────────────────────────────────────┤
│ Resource Management Layer │
│ YARN — ResourceManager / NodeManager / ApplicationMaster │
├───────────────────────────────────────────────────────────────┤
│ Storage Layer │
│ HDFS — NameNode / DataNode / Block / Replication │
└───────────────────────────────────────────────────────────────┘
- The storage layer is handled by HDFS. It stores data in a distributed way and protects it through replication.
- The resource management layer is handled by YARN. It divides resources such as CPU and memory among multiple applications.
- The processing layer hosts various engines. Initially there was only MapReduce, but after YARN arrived, engines like Spark and Tez began sharing the same cluster.
This separation of the three layers became explicit when YARN was introduced in Hadoop 2.x. In Hadoop 1.x, MapReduce handled everything except storage (scheduling plus execution), which made it hard to use any processing approach other than MapReduce. By splitting "resource management" out from the "processing engine," YARN turned Hadoop from a mere MapReduce platform into something more like a general-purpose distributed processing operating system.
Now let us look at each layer in depth.
HDFS — The Distributed File System
Basic Concepts: Blocks and Replication
The most fundamental idea in HDFS is to "split a large file into fixed-size blocks and store them distributed across many nodes." The default block size was 64MB for a long time and is commonly 128MB in recent distributions (configured via dfs.blocksize).
Compared with the blocks of an ordinary local file system (usually around 4KB), this is enormous, and the reason is clear. If blocks are small, the number of metadata entries the NameNode must manage explodes, and disk seek cost takes up a larger share of total transfer time. With large blocks, you can seek once and then read sequentially for a long time, which improves throughput.
File: bigfile.log (380 MB), block size 128 MB
bigfile.log
┌──────────────┬──────────────┬──────────────┐
│ Block A │ Block B │ Block C │
│ 128 MB │ 128 MB │ 124 MB │
└──────────────┴──────────────┴──────────────┘
Each block is replicated to several DataNodes per the
replication factor (default 3).
Block A ──▶ DataNode 1, DataNode 3, DataNode 5
Block B ──▶ DataNode 2, DataNode 3, DataNode 6
Block C ──▶ DataNode 1, DataNode 4, DataNode 6
A replication factor of 3 means a copy of the same block exists on three different DataNodes. So even if one or two DataNodes die, the data is not lost. This "fault tolerance through replication" is the heart of HDFS.
NameNode and DataNode
HDFS uses a master-slave structure.
┌─────────────────────────────────────────────────────────────┐
│ NameNode (master) │
│ - Manages the file system namespace (directory/file tree) │
│ - Records which blocks make up each file │
│ - Keeps the mapping of which block lives on which DataNode │
│ - Metadata resides in memory (for fast lookups) │
└─────────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│DataNode1│ │DataNode2│ │DataNode3│ │DataNode4│ (slaves)
│store blk│ │store blk│ │store blk│ │store blk│
└─────────┘ └─────────┘ └─────────┘ └─────────┘
- Store actual block data on local disk
- Periodically send heartbeats and block reports to NameNode
- The NameNode manages only metadata. It does not hold the actual data bytes. Instead it knows everything about "which file consists of which blocks, and which DataNode each block lives on." This information resides in memory for fast lookups.
- DataNodes store the actual block data on their own local disks. They periodically send the NameNode a heartbeat (an "I am alive" signal) and a block report (the list of blocks they hold).
The fact that the NameNode keeps metadata in memory has an important implication. As the number of files and blocks grows, NameNode memory usage grows with it, so a workload of "millions of tiny files" puts heavy pressure on the NameNode. This is the notorious "small files problem."
NameNode Durability: FsImage and EditLog
NameNode metadata lives in memory but must be recoverable on restart. Two on-disk structures make this possible.
┌──────────── NameNode Metadata Persistence ────────────────┐
│ │
│ FsImage : A full snapshot of the file system at a point │
│ EditLog : A log of every change since that FsImage │
│ │
│ Recovery procedure on restart: │
│ 1) Load FsImage (the baseline snapshot) │
│ 2) Replay EditLog (apply changes since then) │
│ 3) Reconstruct the latest in-memory metadata │
│ │
└────────────────────────────────────────────────────────────┘
If the EditLog grows without bound, restart becomes slow, so checkpointing — periodically merging FsImage and EditLog into a new FsImage — is required. This was historically handled by the Secondary NameNode, and in HA (high availability) setups it is handled by the Standby NameNode.
Let us clear up a common misconception here. The Secondary NameNode is not a backup (replacement) for the NameNode. Despite its name, it cannot perform failover; it merely assists with checkpointing. True high availability is achieved with the HA setup described below.
NameNode High Availability (HA)
In a single-NameNode setup, the NameNode is a single point of failure (SPOF). If the NameNode dies, the entire cluster metadata becomes inaccessible, so virtually all work halts. To address this, HDFS offers an HA setup.
┌──────────────┐ shared EditLog ┌──────────────┐
│ Active │◀────────────────▶│ Standby │
│ NameNode │ (JournalNodes) │ NameNode │
└──────┬───────┘ └──────┬───────┘
│ │
│ DataNodes send heartbeats and │
│ block reports to both NameNodes│
▼ ▼
┌───────────────────────────────────────────────┐
│ DataNode 1 · DataNode 2 · DataNode 3 · ... │
└───────────────────────────────────────────────┘
Active dies → ZKFC + ZooKeeper detect → Standby promoted to Active
- The Active NameNode handles all client requests.
- The Standby NameNode continuously replays the EditLog shared through the JournalNodes to stay in sync with the Active.
- The ZKFC (ZooKeeper Failover Controller) and ZooKeeper monitor the Active liveness and automatically promote the Standby to the new Active (failover) when a failure occurs.
In this process, a fencing mechanism is crucial to prevent "split-brain," where both NameNodes believe they are Active at the same time. Misconfiguration can lead to data corruption, so care is required in operations.
The HDFS Write Path
Let us trace what actually happens when a client writes a file to HDFS.
[HDFS Write Path]
Client
│ 1. create() request — "make a file at this path"
▼
NameNode
│ - Check permissions/existence, create file entry in namespace
│ 2. Return list of DataNodes for first block (replication 3 → 3)
│ e.g.: DN1 ──▶ DN2 ──▶ DN3 (pipeline)
▼
Client
│ 3. Send data as packets to DN1
▼
┌─────────┐ replication pipeline ┌─────────┐ ┌─────────┐
│ DN1 │ ─── forward packets ──▶│ DN2 │ ──────▶ │ DN3 │
│ disk ↓ │ │ disk ↓ │ │ disk ↓ │
└─────────┘ └─────────┘ └─────────┘
│ │
│ 4. ack flows backward: DN3 ──▶ DN2 ──▶ DN1 ──▶ Client │
▼
Client
│ 5. When the block is full, repeat 2-4 for the next block
│ 6. close() — notify NameNode that the write is complete
▼
NameNode (finalize metadata)
The key point is that replication happens via a "replication pipeline." The client sends data only to the first DataNode, which streams the packets to the second, and the second to the third. This way the client upload bandwidth is used only once while three replicas are created.
Another important piece is the replica placement policy. The default policy is roughly as follows.
[Default Replica Placement (replication = 3, rack-aware)]
Replica 1 : the node where the client runs (or a random node on the same rack)
Replica 2 : a node on a different rack than replica 1
Replica 3 : another node on the same rack as replica 2
Purpose:
- 2 copies on the same rack → fast write over the rack-local network
- 1 copy on a different rack → data survives a whole-rack failure
This "rack awareness" policy is a compromise between write performance (the network is fast within a rack) and durability (data survives even if an entire rack dies).
The HDFS Read Path
Reading is relatively simple.
[HDFS Read Path]
Client
│ 1. open() — "I want to read this file"
▼
NameNode
│ 2. Return the file block list + DataNode locations per block
│ (sorted by closeness to the client)
▼
Client
│ 3. For each block, read directly from the nearest DataNode
▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ DN (A) │ │ DN (B) │ │ DN (C) │ ← pick the nearest node per block
└─────────┘ └─────────┘ └─────────┘
│
│ 4. If a DataNode does not respond
│ → retry against another DataNode holding a replica
▼
Client (stitch blocks together into the full file)
Here the NameNode only provides location information ("read from here"); the actual data is exchanged directly between the client and the DataNodes. In other words, data traffic does not pass through the NameNode. This is the secret to how the NameNode can focus on metadata while the cluster maintains high overall throughput.
Also, when the NameNode sorts locations, it prioritizes "closeness to the client" (same node → same rack → different rack), so reads happen locally or within the same rack when possible. This is data locality on the read path.
YARN — The Resource Management Layer
Why YARN Was Needed
In Hadoop 1.x, MapReduce did two jobs at once. One was managing cluster resources (slots) and scheduling work (the JobTracker), and the other was running the actual map/reduce tasks. This structure had problems.
- All the load piled onto the JobTracker, limiting scalability (a bottleneck around 4,000 nodes).
- The cluster could run only MapReduce jobs. Even if you wanted to process data differently, like with Spark, it was hard to use the same cluster.
- Resources were statically split into "map slots" and "reduce slots," which was very inefficient.
YARN (Yet Another Resource Negotiator) separated this "resource management" from "job execution." As a result, multiple processing engines such as MapReduce, Spark, Tez, and Flink can now share resources and coexist on the same cluster.
YARN Components
┌──────────────────────────────────────────────────────────────┐
│ ResourceManager (RM, master) │
│ - Scheduler: allocates cluster resources to applications │
│ - ApplicationsManager: manages AM start/restart │
└───────────────┬──────────────────────────────────────────────┘
│ resource allocation / status reports
┌────────────┼────────────────────────┬───────────────┐
▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Node- │ │ Node- │ │ Node- │ │ Node- │
│Manager │ │Manager │ ... │Manager │ │Manager │
│ (NM) │ │ (NM) │ │ (NM) │ │ (NM) │
└───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘
│ │ │ │
container container container container
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ App │ │ Task │ │ Task │ │ Task │
│Master│ │ │ │ │ │ │
└──────┘ └──────┘ └──────┘ └──────┘
- The ResourceManager (RM) is the master of the entire cluster. Internally it splits into the Scheduler (resource allocation) and the ApplicationsManager (managing the lifecycle of ApplicationMasters).
- A NodeManager (NM) runs one per node, manages that node resources (CPU/memory), and launches and monitors containers. Much like the NameNode-DataNode relationship, RM-NM is also master-slave.
- The ApplicationMaster (AM) is a "dedicated manager for that job," one per application (job). It requests the resources the job needs from the RM and runs and monitors the actual tasks in the allocated containers.
- A Container is a bundle of resources (e.g., 2GB memory + 1 vCore). The actual work (tasks) runs inside containers.
The existence of the ApplicationMaster is the heart of the YARN design. The RM, which manages the whole cluster, does not get involved in the details of each individual job (task progress, retries, etc.). That work is handled by the per-job AM. By distributing responsibility this way, YARN reduces the RM load and gains scalability.
YARN Scheduling Sequence
Let us walk through how resources are negotiated, in order, when a client submits a job.
[YARN Job Submission and Execution Sequence]
Client RM NM(multiple nodes) AM
│ │ │ │
│ 1. submitApplication │ │
│───────────────────▶│ │ │
│ │ 2. request AM container │
│ │──────────────────▶│ │
│ │ │ 3. launch AM │
│ │ │ container │
│ │ │──────────────▶│
│ │ 4. AM registers │
│ │◀──────────────────────────────────│
│ │ 5. request containers for tasks │
│ │◀──────────────────────────────────│
│ │ 6. container allocation response │
│ │──────────────────────────────────▶│
│ │ │ 7. AM tells NM │
│ │ │ to run tasks │
│ │ │◀──────────────│
│ │ │ 8. run task │
│ │ │ (in container)│
│ 9. query progress │ │ │
│◀───────────────────┼───────────────────┼───────────────│
│ │ 10. on completion AM unregisters, frees res │
│ │◀──────────────────────────────────│
▼ ▼ ▼ ▼
In words, the flow is as follows.
- The client submits an application to the RM.
- The RM decides which NM to launch the AM on and instructs that NM to start an AM container.
- Once the AM starts, it registers itself with the RM and requests containers for the tasks the job needs.
- The RM Scheduler examines the resource situation and allocates containers.
- The AM instructs the NMs hosting the allocated containers to run the actual tasks.
- Tasks run inside the containers, and the AM tracks progress.
- When the job finishes, the AM unregisters and returns the resources.
Types of Schedulers
The YARN Scheduler comes in several flavors depending on policy. The three representative ones are below.
| Scheduler | Core behavior | Good for |
|---|---|---|
| FIFO Scheduler | Process in submission order | Simple tests, single user |
| Capacity Scheduler | Guarantee capacity per queue in advance | When multiple teams need guaranteed minimum resources |
| Fair Scheduler | Distribute resources fairly among running jobs | When you want a variety of jobs to share resources evenly |
Large multi-tenant clusters typically use the Capacity Scheduler or the Fair Scheduler. They split queues per team/project and set a guaranteed minimum capacity and a maximum usage limit per queue to control resource contention.
MapReduce — The Original Processing Model
Map and Reduce
MapReduce processes data with two stages of functional operations.
- Map: takes input and produces intermediate (key, value) pairs.
- Reduce: gathers values with the same key and aggregates/summarizes them.
Let us look at the classic word count example.
// Mapper: takes a line and emits (word, 1) for each word
public class WordCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
private final Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String token : line.split("\\s+")) {
if (!token.isEmpty()) {
word.set(token);
context.write(word, ONE);
}
}
}
}
// Reducer: sums up all the 1s for the same word
public class WordCountReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
private final IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable v : values) {
sum += v.get();
}
result.set(sum);
context.write(key, result);
}
}
// Driver: job configuration and submission
public class WordCountDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setCombinerClass(WordCountReducer.class); // partial aggregation on map side
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Shuffle and Sort
The heaviest stage in MapReduce, and the one most often a performance bottleneck, is the shuffle between Map and Reduce. It is the process of redistributing Map output by key so that the same key ends up at the same Reducer.
[Full MapReduce Data Flow (including shuffle)]
Input (HDFS blocks)
┌────────┐ ┌────────┐ ┌────────┐
│ Split1 │ │ Split2 │ │ Split3 │
└───┬────┘ └───┬────┘ └───┬────┘
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Map 1 │ │ Map 2 │ │ Map 3 │ ← each Map emits (key,value)
└───┬────┘ └───┬────┘ └───┬────┘
│ │ │
│ [partition + sort + (optional) combiner]
│ │ │
└────┬─────┴─────┬────┘
│ S H U F F L E │ ← redistribute by key over network
┌────┴─────┐ ┌────┴─────┐
▼ ▼ ▼ ▼
┌────────────────┐ ┌────────────────┐
│ Reduce 1 │ │ Reduce 2 │ ← same key → same Reducer
│ (keys A,B) │ │ (keys C,D) │
└───────┬────────┘ └───────┬────────┘
▼ ▼
┌────────┐ ┌────────┐
│Output 1│ │Output 2│ ← write results to HDFS
└────────┘ └────────┘
Breaking down what happens during the shuffle in more detail.
[Map-side → Reduce-side shuffle detail]
Map side:
1. map() output accumulates in an in-memory buffer
2. When the buffer fills, spill to disk — sorting per partition
3. Merge multiple spill files into one
4. (If a combiner is set) partially aggregate on the map side to shrink data
Reduce side:
5. Fetch each map partition data over the network
6. Merge and sort the fetched data by key
7. Call reduce() — process all values for the same key at once
The shuffle incurs both disk I/O and network I/O, so it is expensive. That is why it is important to use a combiner to pre-aggregate on the map side and reduce the volume of data. Note, however, that a combiner can only be safely applied to operations that are commutative and associative (e.g., sum, count). For operations like average where this does not hold, you can get incorrect results, so be careful.
Limitations of MapReduce
MapReduce is robust but slow. The biggest reason is that it writes the intermediate result of each stage to disk (HDFS). If you build a complex multi-stage pipeline with MapReduce, the cost of writing to and re-reading from disk at every stage piles up. For iterative machine learning algorithms, this cost was crippling.
This limitation is exactly the backdrop against which Spark appeared and quickly became mainstream.
The Hadoop Ecosystem
Hadoop is not a single product but a vast ecosystem. A variety of tools stack on top of HDFS and YARN.
┌──────────────────────────────────────────────────────────────┐
│ SQL / query: Hive, Impala, Presto/Trino │
│ scripting: Pig │
│ engines: MapReduce, Spark, Tez, Flink │
│ NoSQL: HBase (distributed key-value/column store on HDFS) │
│ ingestion: Sqoop(RDB-HDFS), Flume/Kafka(stream) │
│ coordination: ZooKeeper (distributed coordination) │
│ workflow: Oozie, Airflow │
├──────────────────────────────────────────────────────────────┤
│ YARN (resource management) │
├──────────────────────────────────────────────────────────────┤
│ HDFS (distributed storage) │
└──────────────────────────────────────────────────────────────┘
Let us highlight the three most frequently mentioned.
Hive — SQL on Top of HDFS
Hive is a data warehousing tool that lets you query data stored in HDFS with SQL. When a user writes SQL (HiveQL, to be precise), Hive internally translates it into MapReduce (or Tez/Spark) jobs and runs them.
-- External table definition: treat data at an HDFS path like a table
CREATE EXTERNAL TABLE access_log (
ip STRING,
ts STRING,
url STRING,
status INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/access_log/';
-- Aggregate by status code
SELECT status, COUNT(*) AS cnt
FROM access_log
GROUP BY status
ORDER BY cnt DESC;
The arrival of Hive was significant. Countless analysts and data users who could not write MapReduce in Java could now access Hadoop data with familiar SQL. Moreover, the Hive Metastore (a store that manages table schema/location metadata) was later shared as a de facto standard by other engines such as Spark and Presto/Trino, becoming the effective standard data catalog.
Spark — The In-Memory Processing Engine
Spark emerged to overcome the disk-centric limitations of MapReduce. The core idea is to keep intermediate results in memory when possible and to build DAG (directed acyclic graph) execution plans over an abstraction called RDD (later DataFrame/Dataset).
| Aspect | MapReduce | Spark |
|---|---|---|
| Intermediate data | Written to disk each stage | Kept in memory when possible |
| Programming model | Forced Map/Reduce two stages | Flexible DAG-based operations |
| Iterative work | Very inefficient | Well suited (ML, etc.) |
| Latency | High | Relatively low |
| Good for | Simple large-scale batch | Complex pipelines, ML, interactive |
The important point is that Spark replaces a part of Hadoop (MapReduce) rather than Hadoop as a whole. Spark can run on YARN and use HDFS as a data source. In other words, Spark is the new powerhouse of the processing layer, not a wholesale replacement for the storage/resource-management layers (though in the cloud this picture changes again, as we will see).
HBase — NoSQL on Top of HDFS
HDFS is optimized for reading and writing large files sequentially, so it is poorly suited to single-record random reads/writes. HBase fills this gap as a distributed column-oriented NoSQL database modeled after Google Bigtable paper. It sits on top of HDFS but enables random access and low-latency single-record lookups.
[Where HBase sits]
Application (needs low-latency single-record lookups)
│
▼
┌──────────┐
│ HBase │ ← random read/write, row-level access
└────┬─────┘
▼
┌──────────┐
│ HDFS │ ← persist actual data files (HFile)
└──────────┘
The Object-Storage, Cloud, and Lakehouse Era
That covers "classic Hadoop." Now it is time to talk about the changes of the 2020s. To state the conclusion first: Hadoop core ideas survived, but the role of its implementation (especially HDFS) shrank dramatically in the cloud.
The Biggest Change: Separating Storage and Compute
The core premise of classic Hadoop was "data locality" — sending computation to the node where the data lives to save the network. For this, storage (HDFS) and compute (MapReduce/YARN) had to sit together on the same physical node (co-location).
But in the cloud, the story changes.
- Networks became very fast. Intra-data-center bandwidth is large enough that the "cost of moving data" is no longer as high as it once was.
- Object storage (S3, etc.) provides effectively infinite, cheap, and highly durable storage. The provider takes responsibility for availability/durability.
- Separating storage and compute lets you scale each independently. You keep accumulating data, but spin compute up and down only when you need it.
[Classic Hadoop] [Cloud / Lakehouse]
┌──────────────┐ compute (only when needed)
│ node = storage │ ┌──────┐ ┌──────┐ ┌──────┐
│ + compute │ (co-location) │Spark │ │Trino │ │Flink │
│ HDFS + YARN │ └───┬──┘ └───┬──┘ └───┬──┘
└──────────────┘ │ │ │
data-locality centric └────────┼────────┘
storage+compute coupled ▼
┌────────────────┐
│ object storage │
│ S3 / GCS / ... │ (separated)
└────────────────┘
storage/compute scale independently
This separation is the core change in cloud data architecture. And in this layout, HDFS place is increasingly replaced by object storage.
HDFS vs Object Storage
| Aspect | HDFS | Object Storage (S3, etc.) |
|---|---|---|
| Operated by | You (NameNode/DataNode) | Cloud provider managed |
| Scaling | Add nodes, NameNode memory limit | Effectively unlimited |
| Cost model | Fixed server/disk cost | Usage-based (storage + requests) |
| Consistency | Strong consistency | Strong consistency (modern S3) |
| Small files | NameNode burden (small files problem) | Relatively free |
| Data locality | Yes (same node/rack) | No (over the network) |
| rename/directory | Fast metadata op | Can be costly or non-atomic |
| Best fit | On-prem, fixed workloads | Cloud, elastic workloads |
Object storage is not a cure-all. The most famous difference is that a directory rename is a fast metadata operation in HDFS, whereas in object storage it is effectively a copy-plus-delete, which can be slow and non-atomic. Because many data pipelines rely on the pattern of "write to a temporary directory and commit with a final rename," this difference is part of the backdrop for the rise of table formats.
Table Formats and the Lakehouse
Object storage is just a "bag of files" and knows nothing of the concept of a table. It does not give you transactions, schema changes, time travel, or concurrent writes. To fill this gap, "open table formats" such as Apache Iceberg, Delta Lake, and Apache Hudi appeared.
They abstract the files on object storage into "tables" and, through a metadata layer, provide ACID transactions, schema evolution, snapshot-based time travel, and more. The result is the "lakehouse" architecture, which combines the flexibility of a data lake (store any data cheaply) with the reliability of a data warehouse (ACID, schema).
[Lakehouse Layer Structure]
┌──────────────────────────────────────────────┐
│ query/process: Spark, Trino, Flink, Dremio │
├──────────────────────────────────────────────┤
│ table format: Iceberg / Delta / Hudi │
│ - ACID transactions - schema evolution - time travel │
├──────────────────────────────────────────────┤
│ file format: Parquet / ORC (columnar) │
├──────────────────────────────────────────────┤
│ storage: object storage (S3/GCS) or HDFS │
└──────────────────────────────────────────────┘
What is interesting is that several pieces of this stack originated in the Hadoop ecosystem. The columnar file formats Parquet and ORC were both born in the Hadoop ecosystem, and table-format metadata management started as an attempt to overcome the limitations of the Hive Metastore. In other words, the lakehouse is not a break from Hadoop but closer to a redesign of the problems Hadoop was solving, adapted to the cloud.
The Hadoop Era vs the Lakehouse Era
| Perspective | Hadoop era (2010s) | Lakehouse era (2020s) |
|---|---|---|
| Storage | HDFS (self-operated) | Object storage (managed) |
| Storage-compute | Coupled (co-location) | Separated |
| Main engine | MapReduce | Spark, Trino, Flink |
| Table/transaction | Hive Metastore | Iceberg/Delta/Hudi |
| Resource management | YARN | Kubernetes also rising |
| Scaling unit | Node (storage+compute together) | Storage/compute independent |
| Operational burden | High (run your own cluster) | Relatively low (managed) |
It is worth noting that YARN place is wobbling here too. In cloud-native environments, Kubernetes increasingly handles resource scheduling. Spark too can run not only on YARN but directly on Kubernetes.
So Is Hadoop Dead?
"Hadoop is dead" is only half true. More precisely:
- New builds of on-premises clusters running HDFS directly have clearly declined. If you are in the cloud, object storage is often the more sensible choice.
- MapReduce is virtually unused in new development. It has been replaced by Spark and others.
- But the concepts Hadoop established (distributed storage, replication, data locality, resource scheduling, shuffle, columnar formats, the metastore) live on as the foundation of the modern data stack.
- Also, for reasons of data sovereignty/regulation, cost, and security, quite a few organizations still run large on-premises Hadoop clusters. In such environments, Hadoop knowledge remains practically important.
In short, Hadoop has passed its peak as a "specific product," but as a "set of ideas and vocabulary" it remains the common language of modern data engineering.
Practical and Operational Notes
The Small Files Problem
Because of the NameNode memory constraint mentioned earlier, you must beware of accumulating many small files in HDFS. The NameNode consumes a fixed amount of memory per file/directory/block, so 100 million 1KB files impose a metadata burden similar to 100 million 1GB files while being far less efficient to process.
Typical mitigations include:
- Periodically running compaction jobs that merge small files into larger ones
- Bundling them with container formats such as HAR (Hadoop Archive) or SequenceFile
- Increasing batch size at the ingestion stage so fewer small files are created in the first place
Choosing a Data Format
Even for the same data, performance and cost vary greatly depending on which file format you store it in.
| Format | Characteristics | Good for |
|---|---|---|
| Text (CSV/JSON) | Human-readable, inefficient | Temporary/debugging, external integration |
| Avro | Row-oriented, great schema evolution | Ingestion/streaming, full-row reads |
| Parquet | Columnar, good compression/scan | Analytical queries (read only some columns) |
| ORC | Columnar, pairs well with Hive | Hive-based analytics |
For analytical workloads, columnar formats (Parquet/ORC) are almost always advantageous. They read only the columns a query needs, compress well column by column, and reduce unnecessary data scans through predicate pushdown.
Replication Factor and Cost
A replication factor of 3 is safe but triples storage cost. For very large cold data (old data that is rarely read), you can consider HDFS Erasure Coding. Erasure coding uses parity instead of replication, achieving similar durability with about a 1.5x storage overhead. However, it has encoding/decoding compute cost and recovery can be slow, so it suits cold data rather than frequently read hot data.
Resource Tuning
If jobs are slow or keep dying in a YARN environment, check the following.
[Common Checkpoints]
- Is the container memory ( yarn.scheduler.maximum-allocation-mb )
enough for the task? Is the container being killed by OutOfMemory?
- Is the number of map/reduce tasks appropriate (too few → poor
parallelism, too many → scheduling overhead)?
- Data skew: do values pile onto a particular key so one Reducer
runs long?
- Is the shuffle volume excessive (can a combiner/partitioner reduce it)?
- Are queue settings appropriate (is one queue hogging resources)?
Data skew in particular is the most common and trickiest performance problem in distributed processing. When the key distribution leans to one side, only some tasks get overloaded and the whole job ends up waiting on them. Adding a salt key or adjusting the partitioner to spread the distribution evenly is the fix.
Common Misconceptions and Pitfalls
Finally, let us sort out the misconceptions you frequently see around Hadoop.
-
The "Hadoop = fast" misconception. Hadoop is not fast; it is designed to handle large data with good throughput. If your goal is low-latency single-record responses, Hadoop is the wrong tool.
-
The "Secondary NameNode is a backup" misconception. The Secondary NameNode assists with checkpointing; it does not perform failover. True high availability comes from the HA (Active/Standby) setup.
-
The "Spark replaces Hadoop entirely" misconception. Spark only replaces the processing engine (MapReduce); it does not wholesale replace storage (HDFS) or resource management (YARN). That said, in the cloud, the combination of object storage and Kubernetes can take over both of those roles.
-
The "moving to object storage is always better" misconception. It differs from HDFS in rename cost, consistency model, and metadata operation characteristics, so unless you design with a table format (Iceberg/Delta/Hudi), you can actually fall into traps.
-
The "many small files are fine" misconception. They hurt both NameNode memory and processing efficiency. Compaction must be part of your operational routine.
-
The "replication 3 is always right" misconception. It is reasonable for hot data, but for large cold data that is rarely read, erasure coding can be more economical.
Closing Thoughts
Hadoop was an answer to one big question: "how do you safely and efficiently process enormous data on thousands of cheap servers where failure is routine?" The concrete implementation of that answer (HDFS, MapReduce) has largely ceded ground to other tools with the rise of the cloud, object storage, and the lakehouse.
But the concepts that answer established are still alive. The idea of splitting data to distribute it and protecting it with replication, the idea of sending computation near the data, the idea of separating resource management from execution, the idea of redistributing keys via shuffle — these underpin nearly every distributed data system we use today.
So to the question "should I learn Hadoop?" I would answer this way. There is less occasion to newly adopt Hadoop as a specific product, but understanding the way of thinking Hadoop taught us is still a great asset for the modern data engineer. Whether lakehouse or cloud-native stack, at its roots lie the problems Hadoop bumped into and solved first.
References
- Apache Hadoop official site: https://hadoop.apache.org/
- HDFS Architecture (HDFS Design): https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
- Apache Hadoop YARN: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html
- MapReduce Tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- HDFS High Availability: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
- Apache Hive: https://hive.apache.org/
- Apache Spark Documentation: https://spark.apache.org/docs/latest/
- Apache HBase Reference Guide: https://hbase.apache.org/book.html
- Apache Iceberg: https://iceberg.apache.org/
- Apache Parquet: https://parquet.apache.org/