Digital Logic Design Complete Guide: From Boolean Algebra to CPU Design

1. Introduction to Digital Systems
2. Boolean Algebra
3. Logic Gates
4. Karnaugh Maps
5. Combinational Logic Circuits
6. Sequential Logic Circuits
- Latches
- Flip-Flops
7. Registers and Counters
- Shift Registers
- Counters
8. Finite State Machines (FSM)
9. Verilog HDL Fundamentals
10. FPGA Fundamentals
11. Basic CPU Design
12. Connecting to AI Hardware
13. Learning Roadmap
Quiz
References

1. Introduction to Digital Systems

Analog vs Digital

Analog signals carry continuous values while digital signals use only discrete states — 0 and 1. Modern electronics overwhelmingly favors digital systems for several key reasons:

Noise immunity: Only two levels to distinguish, making the system highly resistant to electrical noise.
Perfect reproduction: Copying digital data introduces no signal degradation.
Logical processing: Boolean algebra provides a rigorous mathematical foundation.
Storage: Flip-flops and memory cells store state reliably and indefinitely.

Number System Conversions

Digital systems regularly use binary (base-2), octal (base-8), and hexadecimal (base-16) representations.

Decimal to Binary — repeated division by 2:
  45 ÷ 2 = 22 remainder 1
  22 ÷ 2 = 11 remainder 0
  11 ÷ 2 =  5 remainder 1
   5 ÷ 2 =  2 remainder 1
   2 ÷ 2 =  1 remainder 0
   1 ÷ 2 =  0 remainder 1
  Result: 101101 (binary)

Decimal	Binary	Octal	Hex
0	0000	0	0
5	0101	5	5
10	1010	12	A
15	1111	17	F
16	10000	20	10

Signed Number Representation

Two's complement is the universal standard for representing negative integers in hardware.

+5  →  0000 0101
Computing -5:
  One's complement:  1111 1010
  Add 1:             1111 1011  ← -5 in two's complement

Range for an 8-bit two's complement number: -128 to +127.

BCD and Gray Code

BCD (Binary Coded Decimal): Each decimal digit is encoded as 4 bits. Example: 29 = 0010 1001.
Gray code: Adjacent codewords differ by exactly one bit, minimizing transition errors in mechanical encoders and analog-to-digital converters.

2. Boolean Algebra

Fundamental Operations

Three primitive operations form the foundation of Boolean algebra:

AND (logical product): A · B — output is 1 only when both inputs are 1.
OR (logical sum): A + B — output is 1 when at least one input is 1.
NOT (complement): A' — inverts the input.

Boolean Laws

Commutative:   A + B = B + A,      A · B = B · A
Associative:   A + (B+C) = (A+B) + C
Distributive:  A · (B+C) = AB + AC
Identity:      A + 0 = A,  A · 1 = A
Complement:    A + A' = 1, A · A' = 0
Idempotent:    A + A = A,  A · A = A

De Morgan's Theorems — essential for NAND/NOR-based implementations:

(A · B)' = A' + B'
(A + B)' = A' · B'

Canonical Forms

SOP (Sum of Products): Sum of minterms. Example: F = AB' + A'B + AB
POS (Product of Sums): Product of maxterms. Example: F = (A+B)(A'+C)

Any Boolean function can be expressed in either canonical form, making these forms the starting point for logic minimization.

3. Logic Gates

Truth Tables for All Basic Gates

AND Gate:            OR Gate:             NOT Gate:
A  B  A·B            A  B  A+B            A  A'
0  0   0             0  0   0             0   1
0  1   0             0  1   1             1   0
1  0   0             1  0   1
1  1   1             1  1   1

NAND Gate:           NOR Gate:
A  B  (A·B)'         A  B  (A+B)'
0  0    1            0  0    1
0  1    1            0  1    0
1  0    1            1  0    0
1  1    0            1  1    0

XOR Gate:            XNOR Gate:
A  B  A⊕B            A  B  (A⊕B)'
0  0   0             0  0    1
0  1   1             0  1    0
1  0   1             1  0    0
1  1   0             1  1    1

Universal Gates (NAND and NOR)

NAND alone can implement any Boolean function:

NOT A   = A NAND A
A AND B = (A NAND B) NAND (A NAND B)
A OR B  = (A NAND A) NAND (B NAND B)

This universality is commercially significant — an entire chip family can be fabricated from a single cell type, simplifying the manufacturing process. NOR gates are equally universal.

CMOS Implementation Overview

Modern digital ICs use CMOS (Complementary Metal-Oxide-Semiconductor) technology:

NMOS transistor: Turns ON when gate is HIGH (pull-down network).
PMOS transistor: Turns ON when gate is LOW (pull-up network).
CMOS inverter: series PMOS + NMOS between VDD and GND.
Static power consumption is nearly zero, which explains CMOS dominance in low-power design.

4. Karnaugh Maps

The Karnaugh map (K-map) provides a visual, systematic method for minimizing Boolean expressions by identifying groups of adjacent 1s.

4-Variable K-map Example

          CD
AB      00  01  11  10
00  [    1   0   1   1  ]
01  [    1   1   1   0  ]
11  [    0   1   1   0  ]
10  [    0   0   1   1  ]

Grouping Rules

Group sizes must be powers of 2: 1, 2, 4, 8, or 16.
Groups must be rectangular (including squares).
The map wraps around on all edges (toroidal topology).
Use the fewest, largest possible groups.
Don't-care conditions (X) may be included in groups to enlarge them.

Prime Implicants

A prime implicant is a group that cannot be further expanded. An essential prime implicant covers at least one minterm not covered by any other prime implicant — it must appear in the minimal SOP expression.

5. Combinational Logic Circuits

A combinational circuit's output depends solely on current inputs — there is no internal state or memory.

Half Adder and Full Adder

Half Adder:
  Sum   S = A ⊕ B
  Carry C = A · B

Full Adder:
  Sum   S    = A ⊕ B ⊕ Cin
  Carry Cout = AB + Cin(A ⊕ B)

Cascading full adders creates a Ripple Carry Adder — simple but slow because carry propagates serially. The Carry Lookahead Adder (CLA) computes carries in parallel for high-speed arithmetic.

Multiplexer (MUX)

A 2-to-n-input MUX selects one of 2^n data inputs using n select lines.

4-to-1 MUX (select lines S1, S0):
  S1=0, S0=0 → Y = I0
  S1=0, S0=1 → Y = I1
  S1=1, S0=0 → Y = I2
  S1=1, S0=1 → Y = I3

A 2^n-to-1 MUX can implement any n-variable Boolean function directly.

Encoders and Decoders

Encoder: Converts 2^n input lines to an n-bit binary code (e.g., priority encoder handles multiple simultaneous inputs).
Decoder: Converts an n-bit code to up to 2^n output lines (e.g., 3-to-8 decoder asserts one of eight outputs).

A decoder combined with OR gates directly implements any sum-of-minterms (SOP) expression.

6. Sequential Logic Circuits

Sequential circuits depend on both current inputs and internal state (memory). They are the building blocks of registers, counters, and processors.

Latches

The SR latch is the simplest storage element, built from cross-coupled NAND or NOR gates.

SR Latch truth table:
S  R  Q(next)
0  0  Q (no change)
1  0  1 (Set)
0  1  0 (Reset)
1  1  Undefined (forbidden)

The D latch eliminates the forbidden state by tying S = D and R = D'. When Enable is HIGH, Q tracks D; when Enable is LOW, Q retains its value.

Flip-Flops

Flip-flops are edge-triggered — state changes only on the rising or falling clock edge, not during the entire clock period. This strict timing discipline is essential for synchronous design.

D flip-flop:   Q(next) = D

JK flip-flop:
  J=0, K=0 → Q (hold)
  J=1, K=0 → Q = 1 (Set)
  J=0, K=1 → Q = 0 (Reset)
  J=1, K=1 → Q' (Toggle)

T flip-flop:
  T=0 → Q (hold)
  T=1 → Q' (Toggle)

The JK flip-flop has no undefined input combination. The T flip-flop is the natural choice for counter design.

7. Registers and Counters

Shift Registers

Shift registers chain multiple flip-flops so data moves (shifts) through them with each clock cycle.

SISO (Serial In, Serial Out): Data enters and leaves serially.
SIPO (Serial In, Parallel Out): Serial data is loaded, then all bits are read simultaneously.
PISO (Parallel In, Serial Out): Parallel data is loaded, then transmitted bit by bit.
PIPO (Parallel In, Parallel Out): Complete parallel load and parallel read.

Applications: serial-to-parallel conversion, CRC generation, pseudo-random number generators (LFSR), and fast multiply/divide by powers of 2.

Counters

3-bit Asynchronous (Ripple) Counter:
  Three T flip-flops chained in series.
  0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 0 ...
  Drawback: ripple delay accumulates with each stage.

3-bit Synchronous Counter:
  All flip-flops share a common clock.
  All outputs change simultaneously → no ripple delay.

A Modulo-N counter cycles through 0 to N-1. Example: a BCD counter (modulo-10) cycles 0 through 9 and is widely used in digital clocks.

8. Finite State Machines (FSM)

An FSM is the abstract model underlying all sequential logic systems. It consists of a finite set of states, transition conditions, and outputs.

Moore vs Mealy Machines

Moore machine: Output depends only on the current state. Output labels appear inside state bubbles.
Mealy machine: Output depends on current state AND inputs. Output labels appear on transitions.

Mealy machines generally require fewer states than equivalent Moore machines.

Vending Machine FSM Example

States: S0 (0 cents), S1 (10 cents), S2 (20 cents), S3 (dispense)
Inputs: D (dime = 10c), Q (quarter = 25c)

Transitions:
  S0 --D--> S1
  S0 --Q--> S2 (with 5c change output)
  S1 --D--> S2
  S1 --Q--> S3 (dispense, 5c change)
  S2 --D--> S3 (dispense)
  S2 --Q--> S3 (dispense, 15c change)

FSM Design Procedure

Analyze the problem and define all states.
Draw the state diagram.
Construct the state table (next-state and output).
Assign binary codes to states (state encoding).
Derive next-state and output Boolean equations.
Minimize with Karnaugh maps.
Implement with flip-flops and combinational gates.

9. Verilog HDL Fundamentals

Verilog HDL (Hardware Description Language) allows designers to describe digital circuits textually, enabling simulation and automated synthesis to physical gates.

Module Structure

module module_name (
    input  wire        clk,
    input  wire        reset,
    input  wire [3:0]  data_in,
    output reg  [3:0]  data_out
);
    wire [3:0] temp;

    // Combinational logic — assign
    assign temp = data_in & 4'b1111;

    // Sequential logic — always
    always @(posedge clk or posedge reset) begin
        if (reset)
            data_out <= 4'b0000;
        else
            data_out <= temp;
    end
endmodule

D Flip-Flop

module d_ff (
    input  wire clk,
    input  wire reset,
    input  wire d,
    output reg  q
);
    always @(posedge clk or posedge reset) begin
        if (reset)
            q <= 1'b0;
        else
            q <= d;
    end
endmodule

4-bit Synchronous Counter

module counter_4bit (
    input  wire       clk,
    input  wire       reset,
    output reg  [3:0] count
);
    always @(posedge clk or posedge reset) begin
        if (reset)
            count <= 4'b0000;
        else
            count <= count + 1'b1;
    end
endmodule

Testbench

module tb_counter;
    reg        clk, reset;
    wire [3:0] count;

    counter_4bit uut (
        .clk(clk),
        .reset(reset),
        .count(count)
    );

    // 10 ns clock period
    initial clk = 0;
    always #5 clk = ~clk;

    initial begin
        reset = 1;
        #20 reset = 0;
        #200 $finish;
    end

    initial begin
        $monitor("t=%0t reset=%b count=%b", $time, reset, count);
    end
endmodule

Verilog Data Types Summary

Type	Description	Typical Use
wire	Combinational net, holds driven value	assign statements, port connections
reg	Can store a value between updates	Inside always blocks
integer	32-bit signed integer	Loop variables, simulation
parameter	Named constant	Parameterizing module widths

10. FPGA Fundamentals

FPGA vs ASIC vs Microprocessor

Attribute	FPGA	ASIC	MCU / CPU
Reconfigurable	Yes	No	Via software
Peak performance	Medium–High	Highest	General purpose
NRE cost	None	Very high	Low
Unit cost (high volume)	High	Very low	Low
Parallelism	Excellent	Excellent	Limited
Best for	Prototyping, moderate volume	Mass production	Control tasks

FPGA Internal Architecture

CLB (Configurable Logic Block): Contains LUTs, flip-flops, and MUXes.
LUT (Look-Up Table): An n-input LUT implements any n-variable Boolean function by storing the truth table directly in SRAM.
IOB (I/O Block): Interface between the FPGA fabric and external pins; supports multiple voltage standards and I/O protocols.
Block RAM (BRAM): Dedicated on-chip memory blocks for storing data arrays.
DSP Blocks: Hardened fast multiply-accumulate units optimized for signal processing.
PLL / Clock Management: On-chip phase-locked loops for frequency synthesis and clock distribution.

FPGA Design Flow

1. Design Entry — write HDL (Verilog/VHDL) or use a schematic editor
        ↓
2. Synthesis — convert HDL to a technology-independent gate-level netlist
        ↓
3. Implementation
   ├── Technology Mapping — map netlist to FPGA primitives (LUTs, FFs)
   ├── Placement (Place) — assign primitives to physical FPGA resources
   └── Routing (Route) — connect placed elements with on-chip wires
        ↓
4. Timing Analysis — verify all setup/hold constraints are met
        ↓
5. Bitstream Generation — produce the programming file
        ↓
6. Device Programming — download bitstream to FPGA via JTAG

Leading FPGA Vendors

AMD-Xilinx: Spartan (cost-optimized), Artix, Kintex, Virtex (high-performance), Zynq (ARM + FPGA SoC).
Intel (Altera): Cyclone (low cost), Arria (mid-range), Stratix (high-end).
Lattice Semiconductor: iCE40 (ultra-low power, open-source toolchain available).

11. Basic CPU Design

ALU (Arithmetic Logic Unit)

The ALU is the computational heart of any processor.

ALU Inputs:
  A, B      — n-bit operands
  OpCode    — selects the operation

ALU Outputs:
  Result    — n-bit result
  Zero      — asserted when Result = 0
  Carry     — asserted on unsigned overflow
  Overflow  — asserted on signed overflow
  Negative  — asserted when Result < 0

Example operations:
  000 → A + B (addition)
  001 → A - B (subtraction)
  010 → A AND B
  011 → A OR B
  100 → A XOR B
  101 → NOT A
  110 → A << 1 (logical left shift)
  111 → A >> 1 (logical right shift)

Register File

A register file provides fast, on-chip storage for operands and results. A typical 32-register file has two read ports and one write port.

module register_file (
    input  wire        clk,
    input  wire        we,
    input  wire [4:0]  rs1, rs2, rd,
    input  wire [31:0] write_data,
    output wire [31:0] read_data1,
    output wire [31:0] read_data2
);
    reg [31:0] regs [0:31];

    assign read_data1 = regs[rs1];
    assign read_data2 = regs[rs2];

    always @(posedge clk) begin
        if (we && rd != 0)
            regs[rd] <= write_data;
    end
endmodule

Simple CPU Datapath

IF  — Instruction Fetch: read instruction from memory at PC
ID  — Instruction Decode: read registers, generate control signals
EX  — Execute: ALU performs computation
MEM — Memory Access: load/store data memory
WB  — Write Back: write result to register file

Pipelining

A 5-stage pipeline overlaps execution of multiple instructions to increase throughput:

Cycle:      1   2   3   4   5   6   7   8   9
Instr 1:   IF  ID  EX  ME  WB
Instr 2:       IF  ID  EX  ME  WB
Instr 3:           IF  ID  EX  ME  WB
Instr 4:               IF  ID  EX  ME  WB
Instr 5:                   IF  ID  EX  ME  WB

Pipeline hazards disrupt ideal throughput:

Structural: Two instructions need the same hardware resource simultaneously.
Data (RAW/WAR/WAW): An instruction reads a value not yet written by a previous instruction. Resolved with forwarding, stalls, or compiler scheduling.
Control: Branch outcome unknown until late in the pipeline. Resolved with branch prediction, delayed branching, or speculative execution.

12. Connecting to AI Hardware

GPU Digital Logic Architecture

A modern GPU contains thousands of small ALU cores (CUDA cores, SIMD lanes) operating in parallel. This massive parallelism maps perfectly onto the matrix multiplications that dominate deep learning workloads.

NVIDIA Tensor Cores: Perform mixed-precision (FP16 × FP16 accumulate to FP32) matrix operations in a single clock cycle.
SIMT execution model: Thousands of threads execute the same instruction on different data simultaneously.

TPU and NPU

TPU (Tensor Processing Unit): Google's custom ASIC built around a systolic array of MAC (Multiply-Accumulate) units. Designed exclusively for neural network inference and training.
NPU (Neural Processing Unit): Edge-device AI accelerators. Examples: Apple Neural Engine, Qualcomm Hexagon DSP, Samsung Exynos NPU.

FPGA-Based AI Acceleration

FPGAs occupy a useful niche in AI acceleration:

Low latency: Suitable for real-time inference (autonomous driving, industrial vision).
Flexibility: Reconfigure the bitstream when the model changes without replacing hardware.
Custom precision: Implement 4-bit or 8-bit arithmetic to balance accuracy and resource usage.
Real-world example: Microsoft Project Brainwave deploys FPGAs in Azure data centers for DNN inference.

Neuromorphic Computing

Neuromorphic chips model the brain's neuron–synapse architecture directly in silicon.

Intel Loihi 2: Implements Spiking Neural Networks (SNNs) with on-chip learning.
IBM TrueNorth: 1 million neurons and 256 million synapses on a single chip with milliwatt power consumption.
Event-driven, asynchronous computation enables ultra-low power operation — a compelling advantage for always-on edge AI.

13. Learning Roadmap

[Foundations]
  Boolean Algebra → Logic Gates → Karnaugh Maps
        ↓
[Combinational Circuits]
  Adders, MUX, Decoder, Encoder
        ↓
[Sequential Circuits]
  Latches → Flip-Flops → Registers → Counters
        ↓
[System-Level Design]
  FSM → Datapath → Control Unit
        ↓
[HDL and FPGA]
  Verilog → Simulation → FPGA Implementation
        ↓
[Processor Architecture]
  ALU → Pipelining → Memory Hierarchy
        ↓
[AI Hardware]
  GPU Architecture → TPU/NPU → FPGA Acceleration

Quiz

Q1. Why is NAND called a universal gate?

Answer: Because NAND alone can implement NOT, AND, and OR, making it possible to build any Boolean function from NAND gates only.

Explanation:

NOT A = A NAND A
A AND B = (A NAND B) NAND (A NAND B)
A OR B = (A NAND A) NAND (B NAND B)

Since NOT, AND, and OR form a functionally complete set, any logic circuit can be realized with NAND gates exclusively. NOR gates share the same property.

Q2. What is the most negative value representable in 8-bit two's complement, and why is the range asymmetric?

Answer: -128 (binary: 10000000). The range -128 to +127 is asymmetric because zero is treated as positive, consuming one code from the positive half.

Explanation: With 8 bits there are 256 possible patterns. One pattern (00000000) represents zero. The two's complement of 10000000 would require 9 bits to represent as +128, so 10000000 maps exclusively to -128. This asymmetry is an inherent property of two's complement representation.

Q3. What is the key difference between a D latch and a D flip-flop?

Answer: A D latch is level-triggered — it passes D to Q continuously while Enable is HIGH. A D flip-flop is edge-triggered — it samples D only at the clock edge and holds the value until the next edge.

Explanation: Level-triggered behavior makes latches difficult to use in synchronous designs because the output can change multiple times while the enable is asserted. Edge-triggered flip-flops provide strict single-update-per-cycle timing, which is the foundation of clocked synchronous design methodology.

Q4. What is a Don't-care condition in a Karnaugh map and when is it used?

Answer: A Don't-care (X) is an input combination that either cannot occur or whose output value is irrelevant to the system. The designer may assign it 0 or 1 to maximize grouping and minimize the final expression.

Explanation: For example, in a circuit that processes BCD digits (0–9), input combinations 1010 through 1111 (10–15) never appear during normal operation. Marking these six minterms as Don't-cares allows larger groups in the K-map, reducing the number of gates required.

Q5. What is a data hazard in a pipelined CPU, and what are the main techniques to handle it?

Answer: A data hazard occurs when an instruction needs a value that has not yet been written back by an earlier instruction still in the pipeline (a Read-After-Write dependency).

Resolution techniques:

Forwarding (Bypassing): Wire the ALU output directly to the next instruction's ALU input, skipping the register file write-back. Resolves most RAW hazards without stalling.
Stalling (Pipeline bubble): Insert no-operation cycles until the dependent value is available. Simple but reduces throughput.
Compiler scheduling: Reorder independent instructions to fill the slots between a producer and its consumer, avoiding stalls without hardware forwarding.

References

Morris M. Mano & Michael D. Ciletti, Digital Design: With an Introduction to the Verilog HDL, VHDL, and SystemVerilog, Pearson
Thomas L. Floyd, Digital Fundamentals, Pearson
Frank Vahid, Digital Design, Wiley
AMD-Xilinx FPGA Design Resources
Intel FPGA (Altera) Documentation
HDLBits — Verilog Practice Problems
nandland — FPGA/VHDL/Verilog Tutorials