✍️ 필사 모드: Semiconductor Deep Dive -- Complete Guide to CPU, GPU, RAM, ASIC, and CUDA Architecture
EnglishIntroduction
Inside the smartphones, laptops, and cloud servers we use every day, billions of transistors are ceaselessly processing 0s and 1s. But can you explain exactly how CPU, GPU, and RAM work? Why has the GPU become essential in the AI era, and why are dedicated chips like ASICs and TPUs emerging?
In this article, we start from the physical principles of semiconductors and cover CPU pipelines, RAM cell structures, GPU parallel processing, CUDA programming, and the latest AI semiconductor trends -- all in depth.
1. What is a Semiconductor?
Conductors, Insulators, and Semiconductors
Materials can be classified into three categories based on electrical conductivity.
| Classification | Conductivity | Representative Materials | Resistivity Range |
|---|---|---|---|
| Conductor | Very high | Copper, gold, aluminum | Below 10^-8 |
| Semiconductor | Controllable | Silicon, germanium | 10^-5 to 10^6 |
| Insulator | Very low | Glass, rubber, ceramics | Above 10^10 |
The key property of semiconductors is that their conductivity can be controlled. By injecting impurities (doping) into pure silicon, N-type semiconductors (rich in electrons) or P-type semiconductors (rich in holes) are created.
How Transistors Work
A transistor is an electronic switch. When voltage is applied to the gate, current flows from source to drain; when there is no voltage, current is blocked.
MOSFET Transistor Structure
Gate
|
+---------+
| Oxide |
| (SiO2) |
--+---------+--
| P-type |
| S D | S = Source
| o o | D = Drain
| u r | G = Gate
| r a |
| c i |
| e n |
+---------------+
N-substrate
When billions of these simple ON/OFF switches are combined, complex logical operations become possible.
- NOT gate: 1 transistor -- inverts input
- NAND gate: 2 transistors -- the basic building block of all logic gates
- Adder: Combination of multiple gates -- performs addition
- ALU: Combination of adders and logic gates -- the core arithmetic unit of the CPU
Moore's Law
Predicted by Gordon Moore in 1965, this observes that the number of transistors on a chip approximately doubles every two years.
Moore's Law Transistor Count Progression
Year | Transistors | Representative Processor
---------|-----------------|------------------
1971 | 2,300 | Intel 4004
1985 | 275,000 | Intel 386
1999 | 9,500,000 | Pentium III
2006 | 291,000,000 | Core 2 Duo
2015 | 1,750,000,000 | Skylake
2022 | 57,000,000,000 | Apple M2 Ultra
2024 | 114,000,000,000 | NVIDIA B200
As we approach physical limits in the 2020s, performance improvements are being pursued through 3D stacking, chiplet technology, and new materials (GAA structures) rather than simple miniaturization.
2. CPU Architecture Deep Dive
The CPU (Central Processing Unit) is the brain of the computer. It is optimized for processing a single instruction quickly.
Core Components of a CPU
+-------------------------------------------------------+
| CPU |
| +----------+ +-----------+ +--------------------+ |
| | Control | | ALU | | Registers | |
| | Unit | | (Arithmetic| | +----+ +----+ | |
| | | | Logic | | | R0 | | R1 | | |
| | Instruction| | Unit) | | +----+ +----+ | |
| | decode / | | | | +----+ +----+ | |
| | execution| | +-------+ | | | R2 | | R3 | | |
| | flow | | | Adder | | | +----+ +----+ | |
| | control | | +-------+ | | +----+ +----------+| |
| +----------+ | | Logic | | | | PC | | SP || |
| | +-------+ | | +----+ +----------+| |
| +-----------+ +--------------------+ |
| +---------------------------------------------------+ |
| | Cache Memory | |
| | L1: 64KB | L2: 512KB | L3: 32MB | |
| +---------------------------------------------------+ |
+-------------------------------------------------------+
ALU (Arithmetic Logic Unit): Performs actual operations such as addition, subtraction, AND, OR, and XOR.
Control Unit: Decodes instructions and sends control signals to each component.
Registers: The fastest storage inside the CPU. Includes general-purpose registers, the program counter (PC), and the stack pointer (SP).
Cache Memory: Bridges the speed gap between the CPU and main memory. L1 is the fastest and smallest; L3 is the largest and relatively slower.
Pipeline
A technique that divides CPU instruction execution into multiple stages and processes them simultaneously, similar to a factory assembly line.
5-Stage Pipeline (Classic RISC Pipeline)
Time --> 1 2 3 4 5 6 7 8 9
+----+----+----+----+----+
Instr 1: | IF | ID | EX | MEM| WB |
+----+----+----+----+----+
+----+----+----+----+----+
Instr 2: | IF | ID | EX | MEM| WB |
+----+----+----+----+----+
+----+----+----+----+----+
Instr 3: | IF | ID | EX | MEM| WB |
+----+----+----+----+----+
+----+----+----+----+----+
Instr 4: | IF | ID | EX | MEM| WB |
+----+----+----+----+----+
+----+----+----+----+----+
Instr 5: | IF | ID | EX | MEM| WB |
+----+----+----+----+----+
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute
MEM = Memory Access
WB = Write Back
Without pipelining, one instruction must complete entirely before the next can begin. With a 5-stage pipeline, throughput theoretically improves by 5x.
Pipeline Hazards
There are situations where the pipeline stalls.
Data Hazard: When a previous instruction's result is needed but not yet ready
ADD R1, R2, R3 -- R1 = R2 + R3 (result written at WB)
SUB R4, R1, R5 -- R4 = R1 - R5 (needs R1 but not available yet!)
Solution: Forwarding -- pass EX stage result directly to next instruction
Control Hazard: When the next execution location is unknown due to a branch instruction
Branch Prediction is used to solve this.
Branch Prediction
Branch Prediction Flow
+--------------------+
| Branch instruction |
| detected |
+--------+-----------+
|
+--------v-----------+
| Branch History |
| Table (BHT) lookup |
+--------+-----------+
|
+-------+-------+
| |
+-----v-----+ +----v------+
| Not taken | | Taken |
| predicted | | predicted |
+-----+-----+ +----+------+
| |
+-----v-----+ +----v------+
| Execute | | Execute |
| next instr | | branch |
| | | target |
+-----+-----+ +----+------+
| |
+-------+-------+
|
+--------v---------+
| Verify actual |
| result |
+--------+---------+
|
+-------+-------+
| |
+-----v-----+ +----v------+
| Prediction | | Prediction|
| hit | | miss |
| (continue) | | (pipeline |
| | | flush) |
+------------+ +-----------+
Modern CPUs achieve over 90% branch prediction accuracy. The latest Intel and AMD processors use advanced branch predictors such as TAGE (Tagged Geometric History Length).
Out-of-Order Execution
A technique that executes ready instructions first, regardless of their original order.
Original order: Out-of-order:
1. LOAD R1, [addr] 1. LOAD R1, [addr] -- Cache miss! Wait
2. ADD R2, R1, 5 3. MUL R4, R5, R6 -- No R1 needed, run first
3. MUL R4, R5, R6 4. ADD R7, R8, R9 -- No R1 needed, run first
4. ADD R7, R8, R9 2. ADD R2, R1, 5 -- R1 loaded, now execute
The Reorder Buffer (ROB) commits execution results in original order to ensure program correctness.
3. Instruction Set Architecture (ISA)
The ISA is the contract between hardware and software. It defines the specification of instructions that the CPU can understand.
CISC vs RISC
CISC (Complex Instruction Set Computer) RISC (Reduced Instruction Set Computer)
+-----------------------------------+ +-----------------------------------+
| - Complex, diverse instructions | | - Simple, fixed-length instrs |
| - One instruction does much work | | - One instruction does one thing |
| - Variable-length instructions | | - Fixed-length (usually 32-bit) |
| - Direct memory operations | | - Load/Store architecture |
| - Complex decoding | | - Simple decoding |
| - Rep: x86, x86-64 | | - Rep: ARM, RISC-V, MIPS |
+-----------------------------------+ +-----------------------------------+
x86 vs ARM vs RISC-V Comparison
| Feature | x86-64 | ARM (AArch64) | RISC-V |
|---|---|---|---|
| Philosophy | CISC | RISC | RISC |
| Instruction length | Variable (1-15 bytes) | Fixed (4 bytes) | Variable (2/4 bytes) |
| General registers | 16 | 31 | 32 |
| License | Intel/AMD proprietary | ARM license required | Open source (free) |
| Power efficiency | Average | High | High |
| Primary use | Desktop, server | Mobile, Apple Silicon | IoT, academia, new chips |
| Representative | Ryzen, Xeon | Apple M4, Snapdragon | SiFive, XUANTIE |
Secret of modern x86 processors: Externally they accept CISC (x86) instructions, but internally they decompose them into micro-ops (micro-operations), a RISC form, for execution. Internally, it is essentially a RISC engine.
The Rise of RISC-V
RISC-V is an open-source ISA that originated at UC Berkeley. Anyone can freely design RISC-V based chips at no cost.
Its modular design is distinctive, allowing selective extensions on top of the base integer instruction set (I).
RISC-V Extension System
Base: RV32I / RV64I (Integer instructions)
Extensions:
M -- Multiplication/Division
A -- Atomic operations
F -- Single-precision floating point
D -- Double-precision floating point
C -- Compressed instructions (16-bit)
V -- Vector operations
Common combination: RV64IMAFDC (= RV64GC)
4. How RAM Works
RAM (Random Access Memory) temporarily stores data that the CPU is actively working on. It is volatile memory -- data is lost when power is off.
SRAM vs DRAM
SRAM Cell Structure (6-Transistor)
VDD
|
+-----+-----+
| |
+-+-+ +-+-+
| | | |
| P | | P |
| | | |
+-+-+ +-+-+
| |
+-----+-----+------ Q (stored value)
| | |
+-+-+ +-+-+ +-+-+
| | | | | |
| N | | N | | N |
| | | | | |
+-+-+ +-+-+ +-+-+
| | |
GND BL BL_bar
6 transistors store 1 bit
No refresh needed, very fast
Large and expensive -- used for cache memory
DRAM Cell Structure (1-Transistor, 1-Capacitor)
Word Line (row select)
|
+-+-+
| |
| T |----+---- Bit Line (data I/O)
| | |
+-+-+ |
|
+--+--+
| |
| Cap | Capacitor = 1-bit storage
| | Charged = 1, Discharged = 0
+--+--+
|
GND
1 transistor + 1 capacitor stores 1 bit
Simple structure enables high density
Capacitor charge leaks, requiring periodic refresh
| Feature | SRAM | DRAM |
|---|---|---|
| Cell structure | 6 transistors | 1 transistor + 1 capacitor |
| Speed | Very fast (1-2ns) | Relatively slow (10ns) |
| Refresh | Not needed | Required every 64ms |
| Density | Low | High |
| Cost | Expensive | Inexpensive |
| Use | CPU cache (L1/L2/L3) | Main memory (DDR) |
DRAM Refresh
DRAM capacitors naturally leak charge over time. Data must be periodically read and rewritten to prevent loss. This is called refresh.
DRAM Refresh Process
Time -->
Charge state: ██████████████▓▓▓▓▓░░░░░░░ <-- charge leakage
|
Refresh performed
|
Charge state: ██████████████████████████ <-- charge restored
All rows must be refreshed once within 64ms
8Gb DRAM has about 131,072 rows --> ~488ns per row
During refresh, the row being refreshed cannot be accessed, slightly reducing performance. Various refresh scheduling techniques are used to minimize this.
DDR5 Memory
DDR (Double Data Rate) transfers data on both the rising and falling edges of the clock.
DDR Generation Comparison
Gen | Data Rate | Voltage | Prefetch | Channel Structure
--------|-----------------|---------|---------|------------------
DDR4 | 3200 MT/s | 1.2V | 8n | 1 channel/DIMM
DDR5 | 4800~8800 MT/s | 1.1V | 16n | 2 channels/DIMM
LPDDR5X | ~8533 MT/s | 1.05V | 16n | Mobile optimized
Key improvements in DDR5:
- Dual channel: A single DIMM has two independent 32-bit channels
- On-die ECC: Automatic error correction inside the memory chip
- More banks: 16-32 banks for improved simultaneous access
- Power management: PMIC (Power Management IC) mounted on the DIMM for stable voltage supply
5. GPU Architecture
The GPU (Graphics Processing Unit) was originally designed for graphics rendering, but its massive parallel processing capability has made it the core hardware for AI computation.
CPU vs GPU Design Philosophy
CPU Architecture (Optimized for serial processing)
+--------------------------------------------------------+
| +===========+ +===========+ +===========+ |
| | Core 0 | | Core 1 | | Core 2 | ... |
| | (complex | | (complex | | (complex | |
| | control, | | control, | | control, | |
| | large | | large | | large | |
| | cache, | | cache, | | cache, | |
| | branch | | branch | | branch | |
| | predict) | | predict) | | predict) | |
| +===========+ +===========+ +===========+ |
| Cores: 8-64 / Each core is powerful |
+--------------------------------------------------------+
GPU Architecture (Optimized for parallel processing)
+--------------------------------------------------------+
| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ |
| |SM| |SM| |SM| |SM| |SM| |SM| |SM| |SM| |SM| |SM| ... |
| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ |
| (SM = Streaming Multiprocessor) |
| Each SM has hundreds of small, simple cores |
| Total cores: thousands to tens of thousands / simple |
+--------------------------------------------------------+
The CPU processes one complex task quickly with a few powerful cores. The GPU processes thousands of simple tasks simultaneously with thousands of small cores.
Why GPU is Suitable for AI
The core operation of AI training is matrix multiplication. For example, one layer of a neural network performs:
Y = W * X + B
W: Weight matrix (4096 x 4096)
X: Input vector (4096 x 1)
B: Bias vector (4096 x 1)
Multiplications needed: 4096 x 4096 = 16,777,216
This repeats across hundreds of layers --> billions of operations
CPU: 8 cores processing sequentially --> slow
GPU: 16,384 CUDA cores processing in parallel --> fast
Tensor Core: Dedicated matrix operation hardware --> even faster
6. CUDA Programming Basics
CUDA (Compute Unified Device Architecture) is NVIDIA's GPU programming platform. It allows direct use of the GPU's parallel processing capabilities from C/C++ code.
Grid, Block, Thread Hierarchy
CUDA Thread Hierarchy
+--------------------------------------------------+
| Grid |
| +------------+ +------------+ +------------+ |
| | Block(0,0) | | Block(1,0) | | Block(2,0) | |
| | +--+--+ | | +--+--+ | | +--+--+ | |
| | |T0|T1| | | |T0|T1| | | |T0|T1| | |
| | +--+--+ | | +--+--+ | | +--+--+ | |
| | |T2|T3| | | |T2|T3| | | |T2|T3| | |
| | +--+--+ | | +--+--+ | | +--+--+ | |
| +------------+ +------------+ +------------+ |
+--------------------------------------------------+
Grid = Collection of Blocks (1D, 2D, 3D)
Block = Collection of Threads (max 1024 threads)
Thread = Smallest unit of execution
Warp = Bundle of 32 Threads (actual GPU scheduling unit)
Simple CUDA Example -- Vector Addition
#include <stdio.h>
// Kernel function executed on GPU
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
// Calculate global thread index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Only operate within array bounds
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int N = 1000000;
size_t size = N * sizeof(float);
// Allocate host (CPU) memory
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);
// Initialize data
for (int i = 0; i < N; i++) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
// Allocate device (GPU) memory
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// Copy data host --> device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Kernel launch configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
// Launch kernel
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result device --> host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verify result
printf("C[0] = %f (expected 3.0)\n", h_C[0]);
// Free memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C);
return 0;
}
Key CUDA Performance Optimization Points
1. Memory Coalescing: Design so adjacent threads access adjacent memory addresses.
// Good pattern: coalesced access
// Thread 0 accesses data[0], thread 1 accesses data[1], ...
float val = data[threadIdx.x];
// Bad pattern: non-coalesced access
// Thread 0 accesses data[0], thread 1 accesses data[128], ...
float val = data[threadIdx.x * 128];
2. Shared Memory Usage: Bring frequently accessed data from Global Memory to Shared Memory for reuse.
__shared__ float tile[256];
// Read once from Global Memory
tile[threadIdx.x] = globalData[idx];
__syncthreads();
// Access quickly multiple times from Shared Memory
float result = tile[threadIdx.x] + tile[threadIdx.x + 1];
3. Minimize Warp Divergence: When threads within the same Warp take different branches, execution becomes serial.
// Bad pattern: Half of Warp goes if, half goes else
if (threadIdx.x % 2 == 0) {
doSomething();
} else {
doSomethingElse();
}
// Good pattern: Branch at Warp boundaries
if (threadIdx.x / 32 == 0) {
doSomething(); // Entire Warp 0 takes the same path
} else {
doSomethingElse(); // Entire Warp 1 takes the same path
}
7. ASIC vs FPGA
Beyond general-purpose chips (CPU/GPU), there is hardware optimized for specific purposes.
ASIC (Application-Specific Integrated Circuit)
An ASIC is a custom semiconductor. Designed to perform only specific functions, it is much faster and more power-efficient than CPUs or GPUs for that particular task. However, once manufactured, its purpose cannot be changed.
FPGA (Field-Programmable Gate Array)
An FPGA allows its internal circuits to be reconfigured after shipping. It sits between the performance of ASICs and the flexibility of general-purpose processors.
Generality vs Performance/Efficiency Spectrum
High generality High performance/efficiency
<--------------------------------------------->
CPU GPU FPGA ASIC
| | | |
All Parallel Reconfig- Dedicated
tasks processing urable function
possible specialized logic (unchangeable)
Dev cost: Low <---------------------------> High
Unit cost: High <--------------------------> Low (mass production)
Power efficiency: Low <--------------------> High
Google TPU
Google's TPU (Tensor Processing Unit) is an AI-specific ASIC. It uses a Systolic Array architecture optimized for matrix multiplication and tensor operations.
Bitcoin Mining ASIC
Bitcoin mining involves repeatedly executing the SHA-256 hash function. Initially CPUs were used, then GPUs, and finally ASICs.
Bitcoin Mining Hardware Efficiency Comparison
Hardware | Hash Rate | Power | Efficiency
----------------|----------------|----------|------------------
CPU (i7) | ~25 MH/s | 150W | 0.17 MH/W
GPU (RTX 4090) | ~1,500 MH/s | 350W | 4.3 MH/W
ASIC (S21) | ~200,000 MH/s | 3,500W | 57 MH/W
ASIC is about 335x more efficient per watt than CPU
8. Semiconductor Manufacturing Process
What Nanometers (nm) Mean
In semiconductor manufacturing, numbers like "5nm" and "3nm" are closer to marketing names. In the past, they directly indicated transistor gate length, but currently they do not directly correspond to actual physical dimensions.
The practically important metric is transistor density (MTr/mm2).
Transistor Density by Process Node (2026 Basis)
Process Node | Density (MTr/mm2) | Major Manufacturers
--------------|-------------------|--------------------
7nm | ~90 | TSMC, Samsung
5nm | ~170 | TSMC, Samsung
4nm | ~200 | TSMC
3nm (N3E) | ~290 | TSMC
3nm (GAA) | ~330 | Samsung
2nm | ~490 (projected) | TSMC, Samsung, Intel
EUV Lithography
Lithography is the core process of drawing circuit patterns onto silicon wafers.
EUV Lithography Principle
EUV Light Source (13.5nm wavelength)
|
v
+----------+
| Multilayer|
| mirror |
| system |
| (6-8) |
+----+-----+
|
v
+----------+
| Mask | <-- Plate with circuit patterns engraved
| (Reticle) |
+----+-----+
|
v
+----------+
| Reduction | <-- Reduces pattern 4x
| lens |
+----+-----+
|
v
+----------+
| Photo- | <-- Chemical sensitive to light
| resist |
+----------+
| Silicon |
| Wafer |
+----------+
Previous ArF source: 193nm --> Multiple patterning (SAQP) required
EUV source: 13.5nm --> Fine patterns possible in a single exposure
ASML is the only company in the world that produces EUV equipment. A single EUV machine costs over 200 million dollars and weighs 180 tons.
Samsung / TSMC / Intel Comparison (2026 Basis)
| Feature | TSMC | Samsung | Intel |
|---|---|---|---|
| Leading process | N2 (2nm) | SF2 (2nm GAA) | Intel 18A |
| Transistor structure | GAA (Nanosheet) | GAA (Nanosheet) | RibbonFET (GAA) |
| Backside power | Applied from N2P | Applied from SF2 | PowerVia (18A) |
| Key customers | Apple, NVIDIA, AMD | Qualcomm, in-house | In-house + foundry |
| Foundry share | ~60% | ~12% | Expanding |
GAA (Gate-All-Around) Transistor
A next-generation transistor structure that overcomes the limitations of the existing FinFET design.
Transistor Structure Evolution
Planar FET FinFET GAA (Nanosheet)
(~28nm) (~16nm~5nm) (~3nm~)
Gate Gate Gate
| __|__ __|__|__
| | | | ____ |
---+--- ---| Fin |--- ---|_| |_|---
Bulk | | | ---- |
|_____| |_| |_|
| ---- |
|_| |_|
|________|
Gate contact: Gate contact: Gate contact:
1 side (top) 3 sides 4 sides (fully wrapped)
Channel control improvement --> Leakage current reduction --> Power efficiency improvement
9. AI Semiconductor Trends
HBM (High Bandwidth Memory)
As AI models grow larger, memory bandwidth has become the biggest bottleneck. HBM stacks DRAM dies vertically to provide ultra-high bandwidth.
HBM Structure
+-----------+
| DRAM Die | <-- 4 layers (HBM2)
+-----------+ 8 layers (HBM2e)
| DRAM Die | 12 layers (HBM3e)
+-----------+ 16 layers (HBM4, planned)
| DRAM Die |
+-----------+
| DRAM Die |
+-----------+
| Base Die |
+-----+-----+
|
TSV (Through-Silicon Via)
Thousands of through-silicon electrodes
|
+-----+-----+
| GPU / SoC |
+------------+
Bandwidth by Generation:
DDR5: ~51 GB/s (single channel)
HBM2e: 460 GB/s
HBM3: 819 GB/s
HBM3e: 1,218 GB/s
HBM4: ~2,048 GB/s (projected)
SK Hynix leads the HBM market, with Samsung and Micron in pursuit. The NVIDIA H100 uses HBM3, the H200 uses HBM3e, and the B200 uses 12-stack HBM3e.
NPU (Neural Processing Unit)
NPUs are AI-dedicated processors embedded in mobile/edge devices, used for photo enhancement, speech recognition, and on-device AI.
| Product | NPU Performance | Primary Use |
|---|---|---|
| Apple Neural Engine (M4) | 38 TOPS | Image/video processing, Siri |
| Qualcomm Hexagon (SD 8 Gen 4) | 75 TOPS | On-device LLM, camera |
| Samsung Exynos NPU | 34.7 TOPS | Photo enhancement, translation |
| Intel NPU (Lunar Lake) | 48 TOPS | Windows Copilot+ |
TOPS (Tera Operations Per Second): A unit indicating 1 trillion operations per second.
10. Comprehensive Comparison and Practical Guide
Optimal Hardware by AI Workload
Workload | Recommended Hardware | Reason
-----------------------------|---------------------|---------------------------
LLM Training (10B+ params) | NVIDIA H100/B200 | HBM + Tensor Core + NVLink
LLM Inference (serving) | NVIDIA L40S, TPU | Cost efficiency, high throughput
Image Generation (Diffusion) | RTX 4090/5090 | 24GB VRAM, value
On-device AI | NPU (Apple, QC) | Low power, always-on
Bitcoin Mining | Dedicated ASIC | SHA-256 specialized, max efficiency
Network Packet Processing | FPGA (Xilinx) | Low latency, reconfigurable
Memory Hierarchy Access Time Comparison
Memory Hierarchy Pyramid
+-------+
| Reg | ~0.3ns (~1 cycle)
+---+---+
|
+----+----+
| L1 Cache| ~1ns (~4 cycles)
+----+----+
|
+-----+-----+
| L2 Cache | ~3ns (~12 cycles)
+-----+-----+
|
+------+------+
| L3 Cache | ~10ns (~40 cycles)
+------+------+
|
+--------+--------+
| Main Memory | ~100ns (~400 cycles)
| (DRAM) |
+--------+--------+
|
+-----------+-----------+
| SSD | ~100,000ns (100us)
+-----------+-----------+
|
+-----------+-----------+
| HDD | ~10,000,000ns (10ms)
+------------------------+
Speed: Faster toward the top
Capacity: Larger toward the bottom
Cost: More expensive toward the top
Register access is approximately 33 million times faster than HDD access. This gap is why cache memory and memory hierarchy are decisive for computer performance.
Conclusion
Semiconductors are the foundation of modern technological civilization. From a single transistor's ON/OFF, through the CPU pipeline's sophisticated instruction processing, RAM capacitor charge/discharge, GPU's thousands of parallel cores, and ASIC's extreme efficiency -- everything is connected as one grand system.
With the AI era, the importance of semiconductors is only growing. HBM is resolving memory bandwidth bottlenecks, CXL is enabling cross-device memory sharing, and sub-2nm processes are pushing the limits of transistor density.
For software developers, understanding hardware operation principles is a tremendous help in writing better code. Designing data structures considering cache locality, writing CUDA code considering GPU memory coalescing, and choosing the right hardware for your workload -- the foundation for all these decisions is understanding semiconductors.
References
- Patterson, D. A., Hennessy, J. L. -- Computer Organization and Design (RISC-V Edition)
- NVIDIA CUDA Programming Guide
- IEEE International Solid-State Circuits Conference (ISSCC) Proceedings
- TSMC Technology Symposium 2025
- SK hynix HBM Technical Brief
Quiz: Test Your Semiconductor Knowledge
Q1. What components does a DRAM cell use to store 1 bit?
A: 1 transistor and 1 capacitor store 1 bit. The charged state of the capacitor represents 1, and the discharged state represents 0.
Q2. Why is SRAM faster than DRAM?
A: SRAM uses a flip-flop circuit composed of 6 transistors, maintaining data stably without refresh. Without the need for capacitor charge/discharge, access speed is very fast at 1-2ns.
Q3. What is a Warp in CUDA?
A: The smallest GPU scheduling unit, consisting of 32 threads bundled together. All threads within the same Warp execute the same instruction simultaneously (SIMT model).
Q4. Why are ASICs more efficient than GPUs for specific tasks?
A: ASICs are designed with circuits that perform only specific operations, lacking unnecessary general-purpose logic (branch prediction, cache management, etc.). All transistors are dedicated to the target operation, resulting in much higher processing efficiency per watt.
Q5. What is the biggest advantage of Apple Silicon's unified memory architecture?
A: CPU, GPU, and NPU can directly access the same physical memory, eliminating the data copy process (from CPU RAM to GPU VRAM). This reduces power consumption, decreases latency, and allows all processors to flexibly share limited memory.
현재 단락 (1/561)
Inside the smartphones, laptops, and cloud servers we use every day, billions of transistors are cea...