Skip to content
Published on

Semiconductor Deep Dive -- Complete Guide to CPU, GPU, RAM, ASIC, and CUDA Architecture

Authors

Introduction

Inside the smartphones, laptops, and cloud servers we use every day, billions of transistors are ceaselessly processing 0s and 1s. But can you explain exactly how CPU, GPU, and RAM work? Why has the GPU become essential in the AI era, and why are dedicated chips like ASICs and TPUs emerging?

In this article, we start from the physical principles of semiconductors and cover CPU pipelines, RAM cell structures, GPU parallel processing, CUDA programming, and the latest AI semiconductor trends -- all in depth.


1. What is a Semiconductor?

Conductors, Insulators, and Semiconductors

Materials can be classified into three categories based on electrical conductivity.

ClassificationConductivityRepresentative MaterialsResistivity Range
ConductorVery highCopper, gold, aluminumBelow 10^-8
SemiconductorControllableSilicon, germanium10^-5 to 10^6
InsulatorVery lowGlass, rubber, ceramicsAbove 10^10

The key property of semiconductors is that their conductivity can be controlled. By injecting impurities (doping) into pure silicon, N-type semiconductors (rich in electrons) or P-type semiconductors (rich in holes) are created.

How Transistors Work

A transistor is an electronic switch. When voltage is applied to the gate, current flows from source to drain; when there is no voltage, current is blocked.

  MOSFET Transistor Structure

       Gate
         |
    +---------+
    |  Oxide   |
    |  (SiO2)  |
  --+---------+--
  |    P-type     |
  | S           D |   S = Source
  | o           o |   D = Drain
  | u           r |   G = Gate
  | r           a |
  | c           i |
  | e           n |
  +---------------+
     N-substrate

When billions of these simple ON/OFF switches are combined, complex logical operations become possible.

  • NOT gate: 1 transistor -- inverts input
  • NAND gate: 2 transistors -- the basic building block of all logic gates
  • Adder: Combination of multiple gates -- performs addition
  • ALU: Combination of adders and logic gates -- the core arithmetic unit of the CPU

Moore's Law

Predicted by Gordon Moore in 1965, this observes that the number of transistors on a chip approximately doubles every two years.

Moore's Law Transistor Count Progression

Year     | Transistors      | Representative Processor
---------|-----------------|------------------
1971     | 2,300           | Intel 4004
1985     | 275,000         | Intel 386
1999     | 9,500,000       | Pentium III
2006     | 291,000,000     | Core 2 Duo
2015     | 1,750,000,000   | Skylake
2022     | 57,000,000,000  | Apple M2 Ultra
2024     | 114,000,000,000 | NVIDIA B200

As we approach physical limits in the 2020s, performance improvements are being pursued through 3D stacking, chiplet technology, and new materials (GAA structures) rather than simple miniaturization.


2. CPU Architecture Deep Dive

The CPU (Central Processing Unit) is the brain of the computer. It is optimized for processing a single instruction quickly.

Core Components of a CPU

+-------------------------------------------------------+
|                        CPU                             |
|  +----------+  +-----------+  +--------------------+   |
|  | Control  |  |    ALU    |  |    Registers       |   |
|  |  Unit    |  | (Arithmetic|  | +----+ +----+     |   |
|  |          |  |  Logic    |  | | R0 | | R1 |     |   |
|  | Instruction| |  Unit)   |  | +----+ +----+     |   |
|  | decode / |  |           |  | +----+ +----+     |   |
|  | execution|  | +-------+ |  | | R2 | | R3 |     |   |
|  | flow     |  | | Adder | |  | +----+ +----+     |   |
|  | control  |  | +-------+ |  | +----+ +----------+|   |
|  +----------+  | | Logic | |  | | PC | | SP       ||   |
|                | +-------+ |  | +----+ +----------+|   |
|                +-----------+  +--------------------+   |
|  +---------------------------------------------------+ |
|  |              Cache Memory                          | |
|  |   L1: 64KB   |   L2: 512KB   |   L3: 32MB        | |
|  +---------------------------------------------------+ |
+-------------------------------------------------------+

ALU (Arithmetic Logic Unit): Performs actual operations such as addition, subtraction, AND, OR, and XOR.

Control Unit: Decodes instructions and sends control signals to each component.

Registers: The fastest storage inside the CPU. Includes general-purpose registers, the program counter (PC), and the stack pointer (SP).

Cache Memory: Bridges the speed gap between the CPU and main memory. L1 is the fastest and smallest; L3 is the largest and relatively slower.

Pipeline

A technique that divides CPU instruction execution into multiple stages and processes them simultaneously, similar to a factory assembly line.

5-Stage Pipeline (Classic RISC Pipeline)

Time -->   1    2    3    4    5    6    7    8    9
         +----+----+----+----+----+
Instr 1: | IF | ID | EX | MEM| WB |
         +----+----+----+----+----+
              +----+----+----+----+----+
Instr 2:      | IF | ID | EX | MEM| WB |
              +----+----+----+----+----+
                   +----+----+----+----+----+
Instr 3:           | IF | ID | EX | MEM| WB |
                   +----+----+----+----+----+
                        +----+----+----+----+----+
Instr 4:                | IF | ID | EX | MEM| WB |
                        +----+----+----+----+----+
                             +----+----+----+----+----+
Instr 5:                     | IF | ID | EX | MEM| WB |
                             +----+----+----+----+----+

IF  = Instruction Fetch
ID  = Instruction Decode
EX  = Execute
MEM = Memory Access
WB  = Write Back

Without pipelining, one instruction must complete entirely before the next can begin. With a 5-stage pipeline, throughput theoretically improves by 5x.

Pipeline Hazards

There are situations where the pipeline stalls.

Data Hazard: When a previous instruction's result is needed but not yet ready

ADD R1, R2, R3    -- R1 = R2 + R3  (result written at WB)
SUB R4, R1, R5    -- R4 = R1 - R5  (needs R1 but not available yet!)

Solution: Forwarding -- pass EX stage result directly to next instruction

Control Hazard: When the next execution location is unknown due to a branch instruction

Branch Prediction is used to solve this.

Branch Prediction

Branch Prediction Flow

         +--------------------+
         | Branch instruction  |
         |     detected       |
         +--------+-----------+
                  |
         +--------v-----------+
         | Branch History      |
         | Table (BHT) lookup  |
         +--------+-----------+
                  |
          +-------+-------+
          |               |
    +-----v-----+   +----v------+
    | Not taken  |   | Taken     |
    | predicted  |   | predicted |
    +-----+-----+   +----+------+
          |               |
    +-----v-----+   +----v------+
    | Execute    |   | Execute   |
    | next instr |   | branch    |
    |            |   | target    |
    +-----+-----+   +----+------+
          |               |
          +-------+-------+
                  |
         +--------v---------+
         | Verify actual     |
         | result            |
         +--------+---------+
                  |
          +-------+-------+
          |               |
    +-----v-----+   +----v------+
    | Prediction |   | Prediction|
    | hit        |   | miss      |
    | (continue) |   | (pipeline |
    |            |   |  flush)   |
    +------------+   +-----------+

Modern CPUs achieve over 90% branch prediction accuracy. The latest Intel and AMD processors use advanced branch predictors such as TAGE (Tagged Geometric History Length).

Out-of-Order Execution

A technique that executes ready instructions first, regardless of their original order.

Original order:         Out-of-order:
1. LOAD R1, [addr]      1. LOAD R1, [addr]    -- Cache miss! Wait
2. ADD R2, R1, 5        3. MUL R4, R5, R6     -- No R1 needed, run first
3. MUL R4, R5, R6       4. ADD R7, R8, R9     -- No R1 needed, run first
4. ADD R7, R8, R9       2. ADD R2, R1, 5      -- R1 loaded, now execute

The Reorder Buffer (ROB) commits execution results in original order to ensure program correctness.


3. Instruction Set Architecture (ISA)

The ISA is the contract between hardware and software. It defines the specification of instructions that the CPU can understand.

CISC vs RISC

CISC (Complex Instruction Set Computer)     RISC (Reduced Instruction Set Computer)
+-----------------------------------+       +-----------------------------------+
| - Complex, diverse instructions   |       | - Simple, fixed-length instrs     |
| - One instruction does much work  |       | - One instruction does one thing  |
| - Variable-length instructions    |       | - Fixed-length (usually 32-bit)   |
| - Direct memory operations        |       | - Load/Store architecture         |
| - Complex decoding                |       | - Simple decoding                 |
| - Rep: x86, x86-64               |       | - Rep: ARM, RISC-V, MIPS         |
+-----------------------------------+       +-----------------------------------+

x86 vs ARM vs RISC-V Comparison

Featurex86-64ARM (AArch64)RISC-V
PhilosophyCISCRISCRISC
Instruction lengthVariable (1-15 bytes)Fixed (4 bytes)Variable (2/4 bytes)
General registers163132
LicenseIntel/AMD proprietaryARM license requiredOpen source (free)
Power efficiencyAverageHighHigh
Primary useDesktop, serverMobile, Apple SiliconIoT, academia, new chips
RepresentativeRyzen, XeonApple M4, SnapdragonSiFive, XUANTIE

Secret of modern x86 processors: Externally they accept CISC (x86) instructions, but internally they decompose them into micro-ops (micro-operations), a RISC form, for execution. Internally, it is essentially a RISC engine.

The Rise of RISC-V

RISC-V is an open-source ISA that originated at UC Berkeley. Anyone can freely design RISC-V based chips at no cost.

Its modular design is distinctive, allowing selective extensions on top of the base integer instruction set (I).

RISC-V Extension System

Base: RV32I / RV64I (Integer instructions)

Extensions:
  M -- Multiplication/Division
  A -- Atomic operations
  F -- Single-precision floating point
  D -- Double-precision floating point
  C -- Compressed instructions (16-bit)
  V -- Vector operations

Common combination: RV64IMAFDC (= RV64GC)

4. How RAM Works

RAM (Random Access Memory) temporarily stores data that the CPU is actively working on. It is volatile memory -- data is lost when power is off.

SRAM vs DRAM

SRAM Cell Structure (6-Transistor)

          VDD
           |
     +-----+-----+
     |            |
   +-+-+        +-+-+
   |   |        |   |
   | P |        | P |
   |   |        |   |
   +-+-+        +-+-+
     |            |
     +-----+-----+------ Q (stored value)
     |     |      |
   +-+-+ +-+-+  +-+-+
   |   | |   |  |   |
   | N | | N |  | N |
   |   | |   |  |   |
   +-+-+ +-+-+  +-+-+
     |     |      |
    GND   BL     BL_bar

6 transistors store 1 bit
No refresh needed, very fast
Large and expensive -- used for cache memory
DRAM Cell Structure (1-Transistor, 1-Capacitor)

    Word Line (row select)
        |
      +-+-+
      |   |
      | T |----+---- Bit Line (data I/O)
      |   |    |
      +-+-+    |
               |
            +--+--+
            |     |
            | Cap |  Capacitor = 1-bit storage
            |     |  Charged = 1, Discharged = 0
            +--+--+
               |
              GND

1 transistor + 1 capacitor stores 1 bit
Simple structure enables high density
Capacitor charge leaks, requiring periodic refresh
FeatureSRAMDRAM
Cell structure6 transistors1 transistor + 1 capacitor
SpeedVery fast (1-2ns)Relatively slow (10ns)
RefreshNot neededRequired every 64ms
DensityLowHigh
CostExpensiveInexpensive
UseCPU cache (L1/L2/L3)Main memory (DDR)

DRAM Refresh

DRAM capacitors naturally leak charge over time. Data must be periodically read and rewritten to prevent loss. This is called refresh.

DRAM Refresh Process

Time -->
Charge state:  ██████████████▓▓▓▓▓░░░░░░░  <-- charge leakage
                                    |
                              Refresh performed
                                    |
Charge state:  ██████████████████████████  <-- charge restored

All rows must be refreshed once within 64ms
8Gb DRAM has about 131,072 rows --> ~488ns per row

During refresh, the row being refreshed cannot be accessed, slightly reducing performance. Various refresh scheduling techniques are used to minimize this.

DDR5 Memory

DDR (Double Data Rate) transfers data on both the rising and falling edges of the clock.

DDR Generation Comparison

Gen     | Data Rate        | Voltage | Prefetch | Channel Structure
--------|-----------------|---------|---------|------------------
DDR4    | 3200 MT/s       | 1.2V    | 8n      | 1 channel/DIMM
DDR5    | 4800~8800 MT/s  | 1.1V    | 16n     | 2 channels/DIMM
LPDDR5X | ~8533 MT/s      | 1.05V   | 16n     | Mobile optimized

Key improvements in DDR5:

  • Dual channel: A single DIMM has two independent 32-bit channels
  • On-die ECC: Automatic error correction inside the memory chip
  • More banks: 16-32 banks for improved simultaneous access
  • Power management: PMIC (Power Management IC) mounted on the DIMM for stable voltage supply

5. GPU Architecture

The GPU (Graphics Processing Unit) was originally designed for graphics rendering, but its massive parallel processing capability has made it the core hardware for AI computation.

CPU vs GPU Design Philosophy

CPU Architecture (Optimized for serial processing)
+--------------------------------------------------------+
|  +===========+  +===========+  +===========+           |
|  |  Core 0   |  |  Core 1   |  |  Core 2   |  ...      |
|  | (complex  |  | (complex  |  | (complex  |           |
|  |  control, |  |  control, |  |  control, |           |
|  |  large    |  |  large    |  |  large    |           |
|  |  cache,   |  |  cache,   |  |  cache,   |           |
|  |  branch   |  |  branch   |  |  branch   |           |
|  |  predict) |  |  predict) |  |  predict) |           |
|  +===========+  +===========+  +===========+           |
|  Cores: 8-64 / Each core is powerful                    |
+--------------------------------------------------------+

GPU Architecture (Optimized for parallel processing)
+--------------------------------------------------------+
| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+     |
| |SM| |SM| |SM| |SM| |SM| |SM| |SM| |SM| |SM| |SM| ... |
| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+     |
| (SM = Streaming Multiprocessor)                         |
| Each SM has hundreds of small, simple cores              |
| Total cores: thousands to tens of thousands / simple     |
+--------------------------------------------------------+

The CPU processes one complex task quickly with a few powerful cores. The GPU processes thousands of simple tasks simultaneously with thousands of small cores.

Why GPU is Suitable for AI

The core operation of AI training is matrix multiplication. For example, one layer of a neural network performs:

Y = W * X + B

W: Weight matrix (4096 x 4096)
X: Input vector (4096 x 1)
B: Bias vector (4096 x 1)

Multiplications needed: 4096 x 4096 = 16,777,216
This repeats across hundreds of layers --> billions of operations

CPU: 8 cores processing sequentially --> slow
GPU: 16,384 CUDA cores processing in parallel --> fast
Tensor Core: Dedicated matrix operation hardware --> even faster

6. CUDA Programming Basics

CUDA (Compute Unified Device Architecture) is NVIDIA's GPU programming platform. It allows direct use of the GPU's parallel processing capabilities from C/C++ code.

Grid, Block, Thread Hierarchy

CUDA Thread Hierarchy

+--------------------------------------------------+
|                    Grid                           |
|  +------------+  +------------+  +------------+  |
|  | Block(0,0) |  | Block(1,0) |  | Block(2,0) |  |
|  |  +--+--+   |  |  +--+--+   |  |  +--+--+   |  |
|  |  |T0|T1|   |  |  |T0|T1|   |  |  |T0|T1|   |  |
|  |  +--+--+   |  |  +--+--+   |  |  +--+--+   |  |
|  |  |T2|T3|   |  |  |T2|T3|   |  |  |T2|T3|   |  |
|  |  +--+--+   |  |  +--+--+   |  |  +--+--+   |  |
|  +------------+  +------------+  +------------+  |
+--------------------------------------------------+

Grid  = Collection of Blocks (1D, 2D, 3D)
Block = Collection of Threads (max 1024 threads)
Thread = Smallest unit of execution
Warp  = Bundle of 32 Threads (actual GPU scheduling unit)

Simple CUDA Example -- Vector Addition

#include <stdio.h>

// Kernel function executed on GPU
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
    // Calculate global thread index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Only operate within array bounds
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1000000;
    size_t size = N * sizeof(float);

    // Allocate host (CPU) memory
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    // Initialize data
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Allocate device (GPU) memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy data host --> device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Kernel launch configuration
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    // Launch kernel
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result device --> host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verify result
    printf("C[0] = %f (expected 3.0)\n", h_C[0]);

    // Free memory
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

Key CUDA Performance Optimization Points

1. Memory Coalescing: Design so adjacent threads access adjacent memory addresses.

// Good pattern: coalesced access
// Thread 0 accesses data[0], thread 1 accesses data[1], ...
float val = data[threadIdx.x];

// Bad pattern: non-coalesced access
// Thread 0 accesses data[0], thread 1 accesses data[128], ...
float val = data[threadIdx.x * 128];

2. Shared Memory Usage: Bring frequently accessed data from Global Memory to Shared Memory for reuse.

__shared__ float tile[256];

// Read once from Global Memory
tile[threadIdx.x] = globalData[idx];
__syncthreads();

// Access quickly multiple times from Shared Memory
float result = tile[threadIdx.x] + tile[threadIdx.x + 1];

3. Minimize Warp Divergence: When threads within the same Warp take different branches, execution becomes serial.

// Bad pattern: Half of Warp goes if, half goes else
if (threadIdx.x % 2 == 0) {
    doSomething();
} else {
    doSomethingElse();
}

// Good pattern: Branch at Warp boundaries
if (threadIdx.x / 32 == 0) {
    doSomething();      // Entire Warp 0 takes the same path
} else {
    doSomethingElse();  // Entire Warp 1 takes the same path
}

7. ASIC vs FPGA

Beyond general-purpose chips (CPU/GPU), there is hardware optimized for specific purposes.

ASIC (Application-Specific Integrated Circuit)

An ASIC is a custom semiconductor. Designed to perform only specific functions, it is much faster and more power-efficient than CPUs or GPUs for that particular task. However, once manufactured, its purpose cannot be changed.

FPGA (Field-Programmable Gate Array)

An FPGA allows its internal circuits to be reconfigured after shipping. It sits between the performance of ASICs and the flexibility of general-purpose processors.

Generality vs Performance/Efficiency Spectrum

  High generality                           High performance/efficiency
  <--------------------------------------------->
  CPU      GPU      FPGA      ASIC
  |        |        |         |
  All      Parallel  Reconfig- Dedicated
  tasks    processing urable   function
  possible specialized logic   (unchangeable)

  Dev cost: Low <---------------------------> High
  Unit cost: High <--------------------------> Low (mass production)
  Power efficiency: Low <--------------------> High

Google TPU

Google's TPU (Tensor Processing Unit) is an AI-specific ASIC. It uses a Systolic Array architecture optimized for matrix multiplication and tensor operations.

Bitcoin Mining ASIC

Bitcoin mining involves repeatedly executing the SHA-256 hash function. Initially CPUs were used, then GPUs, and finally ASICs.

Bitcoin Mining Hardware Efficiency Comparison

Hardware        | Hash Rate       | Power    | Efficiency
----------------|----------------|----------|------------------
CPU (i7)        | ~25 MH/s       | 150W     | 0.17 MH/W
GPU (RTX 4090)  | ~1,500 MH/s   | 350W     | 4.3 MH/W
ASIC (S21)      | ~200,000 MH/s | 3,500W   | 57 MH/W

ASIC is about 335x more efficient per watt than CPU

8. Semiconductor Manufacturing Process

What Nanometers (nm) Mean

In semiconductor manufacturing, numbers like "5nm" and "3nm" are closer to marketing names. In the past, they directly indicated transistor gate length, but currently they do not directly correspond to actual physical dimensions.

The practically important metric is transistor density (MTr/mm2).

Transistor Density by Process Node (2026 Basis)

Process Node  | Density (MTr/mm2) | Major Manufacturers
--------------|-------------------|--------------------
7nm           | ~90               | TSMC, Samsung
5nm           | ~170              | TSMC, Samsung
4nm           | ~200              | TSMC
3nm (N3E)     | ~290              | TSMC
3nm (GAA)     | ~330              | Samsung
2nm           | ~490 (projected)  | TSMC, Samsung, Intel

EUV Lithography

Lithography is the core process of drawing circuit patterns onto silicon wafers.

EUV Lithography Principle

    EUV Light Source (13.5nm wavelength)
         |
         v
    +----------+
    | Multilayer|
    | mirror    |
    | system    |
    | (6-8)     |
    +----+-----+
         |
         v
    +----------+
    |  Mask     |  <-- Plate with circuit patterns engraved
    | (Reticle) |
    +----+-----+
         |
         v
    +----------+
    | Reduction |  <-- Reduces pattern 4x
    | lens      |
    +----+-----+
         |
         v
    +----------+
    | Photo-    |  <-- Chemical sensitive to light
    | resist    |
    +----------+
    | Silicon   |
    | Wafer     |
    +----------+

Previous ArF source: 193nm --> Multiple patterning (SAQP) required
EUV source: 13.5nm --> Fine patterns possible in a single exposure

ASML is the only company in the world that produces EUV equipment. A single EUV machine costs over 200 million dollars and weighs 180 tons.

Samsung / TSMC / Intel Comparison (2026 Basis)

FeatureTSMCSamsungIntel
Leading processN2 (2nm)SF2 (2nm GAA)Intel 18A
Transistor structureGAA (Nanosheet)GAA (Nanosheet)RibbonFET (GAA)
Backside powerApplied from N2PApplied from SF2PowerVia (18A)
Key customersApple, NVIDIA, AMDQualcomm, in-houseIn-house + foundry
Foundry share~60%~12%Expanding

GAA (Gate-All-Around) Transistor

A next-generation transistor structure that overcomes the limitations of the existing FinFET design.

Transistor Structure Evolution

Planar FET        FinFET           GAA (Nanosheet)
(~28nm)           (~16nm~5nm)      (~3nm~)

  Gate              Gate             Gate
   |              __|__            __|__|__
   |             |     |          |  ____  |
---+---       ---| Fin |---    ---|_|    |_|---
  Bulk           |     |          |  ----  |
                 |_____|          |_|    |_|
                                  |  ----  |
                                  |_|    |_|
                                  |________|

Gate contact:     Gate contact:    Gate contact:
1 side (top)      3 sides          4 sides (fully wrapped)

Channel control improvement --> Leakage current reduction --> Power efficiency improvement

HBM (High Bandwidth Memory)

As AI models grow larger, memory bandwidth has become the biggest bottleneck. HBM stacks DRAM dies vertically to provide ultra-high bandwidth.

HBM Structure

        +-----------+
        |  DRAM Die |  <-- 4 layers (HBM2)
        +-----------+       8 layers (HBM2e)
        |  DRAM Die |       12 layers (HBM3e)
        +-----------+       16 layers (HBM4, planned)
        |  DRAM Die |
        +-----------+
        |  DRAM Die |
        +-----------+
        | Base Die  |
        +-----+-----+
              |
         TSV (Through-Silicon Via)
         Thousands of through-silicon electrodes
              |
        +-----+-----+
        | GPU / SoC  |
        +------------+

Bandwidth by Generation:
  DDR5:    ~51 GB/s (single channel)
  HBM2e:   460 GB/s
  HBM3:    819 GB/s
  HBM3e:  1,218 GB/s
  HBM4:   ~2,048 GB/s (projected)

SK Hynix leads the HBM market, with Samsung and Micron in pursuit. The NVIDIA H100 uses HBM3, the H200 uses HBM3e, and the B200 uses 12-stack HBM3e.

NPU (Neural Processing Unit)

NPUs are AI-dedicated processors embedded in mobile/edge devices, used for photo enhancement, speech recognition, and on-device AI.

ProductNPU PerformancePrimary Use
Apple Neural Engine (M4)38 TOPSImage/video processing, Siri
Qualcomm Hexagon (SD 8 Gen 4)75 TOPSOn-device LLM, camera
Samsung Exynos NPU34.7 TOPSPhoto enhancement, translation
Intel NPU (Lunar Lake)48 TOPSWindows Copilot+

TOPS (Tera Operations Per Second): A unit indicating 1 trillion operations per second.


10. Comprehensive Comparison and Practical Guide

Optimal Hardware by AI Workload

Workload                     | Recommended Hardware | Reason
-----------------------------|---------------------|---------------------------
LLM Training (10B+ params)  | NVIDIA H100/B200    | HBM + Tensor Core + NVLink
LLM Inference (serving)      | NVIDIA L40S, TPU    | Cost efficiency, high throughput
Image Generation (Diffusion) | RTX 4090/5090       | 24GB VRAM, value
On-device AI                 | NPU (Apple, QC)     | Low power, always-on
Bitcoin Mining               | Dedicated ASIC      | SHA-256 specialized, max efficiency
Network Packet Processing    | FPGA (Xilinx)       | Low latency, reconfigurable

Memory Hierarchy Access Time Comparison

Memory Hierarchy Pyramid

              +-------+
              |  Reg  |  ~0.3ns    (~1 cycle)
              +---+---+
                  |
             +----+----+
             | L1 Cache|  ~1ns     (~4 cycles)
             +----+----+
                  |
           +-----+-----+
           | L2 Cache   |  ~3ns     (~12 cycles)
           +-----+-----+
                 |
          +------+------+
          |  L3 Cache   |  ~10ns    (~40 cycles)
          +------+------+
                 |
        +--------+--------+
        |   Main Memory   |  ~100ns   (~400 cycles)
        |     (DRAM)      |
        +--------+--------+
                 |
     +-----------+-----------+
     |         SSD           |  ~100,000ns  (100us)
     +-----------+-----------+
                 |
     +-----------+-----------+
     |         HDD           |  ~10,000,000ns  (10ms)
     +------------------------+

Speed: Faster toward the top
Capacity: Larger toward the bottom
Cost: More expensive toward the top

Register access is approximately 33 million times faster than HDD access. This gap is why cache memory and memory hierarchy are decisive for computer performance.


Conclusion

Semiconductors are the foundation of modern technological civilization. From a single transistor's ON/OFF, through the CPU pipeline's sophisticated instruction processing, RAM capacitor charge/discharge, GPU's thousands of parallel cores, and ASIC's extreme efficiency -- everything is connected as one grand system.

With the AI era, the importance of semiconductors is only growing. HBM is resolving memory bandwidth bottlenecks, CXL is enabling cross-device memory sharing, and sub-2nm processes are pushing the limits of transistor density.

For software developers, understanding hardware operation principles is a tremendous help in writing better code. Designing data structures considering cache locality, writing CUDA code considering GPU memory coalescing, and choosing the right hardware for your workload -- the foundation for all these decisions is understanding semiconductors.


References

  • Patterson, D. A., Hennessy, J. L. -- Computer Organization and Design (RISC-V Edition)
  • NVIDIA CUDA Programming Guide
  • IEEE International Solid-State Circuits Conference (ISSCC) Proceedings
  • TSMC Technology Symposium 2025
  • SK hynix HBM Technical Brief

Quiz: Test Your Semiconductor Knowledge

Q1. What components does a DRAM cell use to store 1 bit?

A: 1 transistor and 1 capacitor store 1 bit. The charged state of the capacitor represents 1, and the discharged state represents 0.

Q2. Why is SRAM faster than DRAM?

A: SRAM uses a flip-flop circuit composed of 6 transistors, maintaining data stably without refresh. Without the need for capacitor charge/discharge, access speed is very fast at 1-2ns.

Q3. What is a Warp in CUDA?

A: The smallest GPU scheduling unit, consisting of 32 threads bundled together. All threads within the same Warp execute the same instruction simultaneously (SIMT model).

Q4. Why are ASICs more efficient than GPUs for specific tasks?

A: ASICs are designed with circuits that perform only specific operations, lacking unnecessary general-purpose logic (branch prediction, cache management, etc.). All transistors are dedicated to the target operation, resulting in much higher processing efficiency per watt.

Q5. What is the biggest advantage of Apple Silicon's unified memory architecture?

A: CPU, GPU, and NPU can directly access the same physical memory, eliminating the data copy process (from CPU RAM to GPU VRAM). This reduces power consumption, decreases latency, and allows all processors to flexibly share limited memory.