Concurrency & Parallelism Deep Dive: Multithreading, Async, and Event Loops Explained

Introduction: Why Is Concurrency So Hard?
1. Threading Models
2. Language-by-Language Concurrency
3. Event Loop Deep Dive
4. Coroutines and Structured Concurrency
5. Actor Model
6. Synchronization Primitives
7. Lock-Free Programming
8. Common Concurrency Bugs
9. Practical Patterns
10. Performance
11. Interview Questions (15)
12. Practice Quiz
References

Introduction: Why Is Concurrency So Hard?

One of the major causes of the 2024 AWS outage was a concurrency bug. Cloudflare's large-scale incident also originated from a race condition. Concurrency programming creates impossible-to-debug bugs if not properly understood.

Rob Pike (co-creator of Go) made the distinction clear:

"Concurrency is about dealing with lots of things at once.
 Parallelism is about doing lots of things at once.
 Concurrency is about structure.
 Parallelism is about execution."
                                          — Rob Pike

Concurrency:                    Parallelism:
One person juggling two tasks   Two people each doing one task

   Task A ━━━━━━╸               Task A ━━━━━━━━━━━━
                 ╻               Task B ━━━━━━━━━━━━
   Task B ━━━━━━╸
                 ╻               CPU Core 1 ━━━━━━━━
   Task A ━━━━━━╸               CPU Core 2 ━━━━━━━━
                 ╻
   CPU Core 1 ━━━━━━━━
   (time-slicing)               (physically simultaneous)

This guide systematically covers the theoretical foundations of concurrency and parallelism, language-specific implementations, practical patterns, and common bugs with their solutions.

1. Threading Models

1.1 1:1 Model (Kernel Threads)

Threads managed directly by the OS. Each user thread maps to one kernel thread.

User threads:   T1    T2    T3    T4
                │      │      │      │
Kernel threads: KT1   KT2   KT3   KT4
                │      │      │      │
CPU cores:     Core1  Core2  Core1  Core2

Languages: Java (traditional), C/C++ (pthread), Rust

1.2 N:1 Model (Green Threads)

Runtime maps multiple user threads to a single OS thread.

User threads:   T1  T2  T3  T4  T5  T6
                └──┬──┘  └──┬──┘  └──┬──┘
                   │        │        │
Kernel threads:  KT1      KT1      KT1
                   │
CPU cores:       Core1

Pros: Lightweight, fast context switching Cons: Cannot use multiple cores Used by: Early Java, early Ruby

1.3 M:N Model (Hybrid)

Maps M user threads to N OS threads — the optimal model:

User threads:   G1  G2  G3  G4  G5  G6  G7  G8
                └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
                   │       │       │       │
Kernel threads:  KT1     KT2     KT3     KT4
                   │       │       │       │
CPU cores:       Core1   Core2   Core3   Core4

Languages: Go (goroutines), Java 21+ (Virtual Threads), Erlang (processes)

1.4 Threading Model Comparison

┌─────────────────┬──────────────┬──────────────┬──────────────┐
│ Category        │ 1:1          │ N:1          │ M:N          │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Thread cost     │ ~1MB stack   │ ~KB stack    │ ~KB stack    │
│ Creation speed  │ Slow (~ms)   │ Fast (~us)   │ Fast (~us)   │
│ Context switch  │ Expensive    │ Cheap (user) │ Cheap (user) │
│ Multi-core      │ Yes          │ No           │ Yes          │
│ I/O blocking    │ Thread only  │ All blocked  │ Thread only  │
│ Max threads     │ ~thousands   │ ~millions    │ ~millions    │
│ Complexity      │ Low          │ Medium       │ High         │
│ Examples        │ Java, C++,   │ Early Ruby   │ Go, Erlang,  │
│                 │ Rust         │              │ Java 21+     │
└─────────────────┴──────────────┴──────────────┴──────────────┘

2. Language-by-Language Concurrency

2.1 Go — Goroutines and Channels

package main

import (
    "fmt"
    "sync"
)

func main() {
    var wg sync.WaitGroup
    results := make(chan string, 10)

    urls := []string{
        "https://api.example.com/users",
        "https://api.example.com/orders",
        "https://api.example.com/products",
    }

    for _, url := range urls {
        wg.Add(1)
        go func(u string) {
            defer wg.Done()
            results <- fmt.Sprintf("Fetched: %s", u)
        }(url)
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    for result := range results {
        fmt.Println(result)
    }
}

Characteristics: CSP (Communicating Sequential Processes) based. "Don't communicate by sharing memory; share memory by communicating."

2.2 Java 21+ — Virtual Threads

import java.util.concurrent.*;
import java.util.List;

public class VirtualThreadExample {
    public static void main(String[] args) throws Exception {
        // Million concurrent tasks with Virtual Threads
        try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
            List<Future<String>> futures = List.of(
                executor.submit(() -> fetchURL("https://api.example.com/users")),
                executor.submit(() -> fetchURL("https://api.example.com/orders")),
                executor.submit(() -> fetchURL("https://api.example.com/products"))
            );

            for (var future : futures) {
                System.out.println(future.get());
            }
        }

        // Structured Concurrency (Preview)
        try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
            var user = scope.fork(() -> fetchUser(1));
            var order = scope.fork(() -> fetchOrder(1));

            scope.join();
            scope.throwIfFailed();

            System.out.println(user.get() + ", " + order.get());
        }
    }
}

Characteristics: Virtual threads added via Project Loom. Millions of concurrent tasks with minimal code changes.

2.3 Python — asyncio and the GIL

import asyncio
import aiohttp

async def fetch_url(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = [
        "https://api.example.com/users",
        "https://api.example.com/orders",
        "https://api.example.com/products",
    ]

    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

        for url, result in zip(urls, results):
            print(f"Fetched {url}: {len(result)} bytes")

asyncio.run(main())

Characteristics: The GIL (Global Interpreter Lock) prevents true parallelism for CPU-bound work. asyncio is effective for I/O-bound tasks. Use multiprocessing for CPU-bound work.

2.4 Rust — tokio Async Runtime

use tokio;
use reqwest;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let urls = vec![
        "https://api.example.com/users",
        "https://api.example.com/orders",
        "https://api.example.com/products",
    ];

    let mut handles = vec![];

    for url in urls {
        let handle = tokio::spawn(async move {
            let resp = reqwest::get(url).await?;
            let body = resp.text().await?;
            Ok::<String, reqwest::Error>(body)
        });
        handles.push(handle);
    }

    for handle in handles {
        match handle.await? {
            Ok(body) => println!("Fetched: {} bytes", body.len()),
            Err(e) => eprintln!("Error: {}", e),
        }
    }

    Ok(())
}

Characteristics: Ownership system prevents data races at compile time. Send/Sync traits guarantee thread safety.

2.5 JavaScript — Event Loop

async function fetchAll() {
    const urls = [
        'https://api.example.com/users',
        'https://api.example.com/orders',
        'https://api.example.com/products',
    ];

    // Concurrent execution with Promise.all
    const results = await Promise.all(
        urls.map(url => fetch(url).then(r => r.json()))
    );

    // Promise.allSettled allows failures
    const settled = await Promise.allSettled(
        urls.map(url => fetch(url).then(r => r.json()))
    );

    settled.forEach((result, i) => {
        if (result.status === 'fulfilled') {
            console.log(`Success: ${urls[i]}`);
        } else {
            console.log(`Failed: ${urls[i]} - ${result.reason}`);
        }
    });
}

Characteristics: Single thread + event loop. CPU-bound work requires Worker Threads. Optimized for I/O.

2.6 Erlang/Elixir — Actor Model

%% Erlang processes (lightweight actors)
-module(counter).
-export([start/0, increment/1, get/1]).

start() ->
    spawn(fun() -> loop(0) end).

loop(Count) ->
    receive
        {increment, From} ->
            From ! {ok, Count + 1},
            loop(Count + 1);
        {get, From} ->
            From ! {value, Count},
            loop(Count);
        stop ->
            ok
    end.

increment(Pid) ->
    Pid ! {increment, self()},
    receive {ok, NewCount} -> NewCount end.

get(Pid) ->
    Pid ! {get, self()},
    receive {value, Count} -> Count end.

Characteristics: About 300 bytes per process. Millions of concurrent processes. "Let it crash" philosophy for fault tolerance.

2.7 Language Comparison Table

┌─────────────┬─────────────┬───────────────┬──────────────┬──────────────┐
│ Language    │ Model       │ Unit          │ Scheduling   │ Key Feature  │
├─────────────┼─────────────┼───────────────┼──────────────┼──────────────┤
│ Go          │ CSP (M:N)   │ Goroutine     │ Runtime      │ Channels     │
│ Java 21+    │ M:N         │ Virtual Thread│ JVM          │ Backward compat│
│ Python      │ Event loop  │ Coroutine     │ asyncio      │ GIL limit    │
│ Rust        │ async/await │ Future/Task   │ tokio etc.   │ Zero-cost    │
│ JavaScript  │ Event loop  │ Promise       │ libuv/V8     │ Single thread│
│ Erlang      │ Actor (M:N) │ Process       │ BEAM VM      │ Fault tolerant│
│ Kotlin      │ Coroutines  │ Coroutine     │ Dispatcher   │ Structured   │
└─────────────┴─────────────┴───────────────┴──────────────┴──────────────┘

3. Event Loop Deep Dive

3.1 Node.js Event Loop Phases

┌───────────────────────────┐
│         timers            │  ← setTimeout, setInterval
│   (execute timer cbs)     │
├───────────────────────────┤
│     pending callbacks     │  ← system error callbacks
│   (deferred I/O cbs)      │
├───────────────────────────┤
│       idle, prepare       │  ← internal use
├───────────────────────────┤
│          poll             │  ← I/O callbacks (most)
│   (wait for new I/O)      │
├───────────────────────────┤
│          check            │  ← setImmediate callbacks
├───────────────────────────┤
│     close callbacks       │  ← socket.on('close')
└───────────────────────────┘
   Between each phase: microtask queue is drained
   (Promise.then, queueMicrotask, process.nextTick)

3.2 Microtasks vs Macrotasks

console.log('1: sync');

setTimeout(() => console.log('2: macro (setTimeout)'), 0);

Promise.resolve().then(() => console.log('3: micro (Promise)'));

queueMicrotask(() => console.log('4: micro (queueMicrotask)'));

console.log('5: sync');

// Output order:
// 1: sync
// 5: sync
// 3: micro (Promise)
// 4: micro (queueMicrotask)
// 2: macro (setTimeout)

Rule: Synchronous code first, then all microtasks, then one macrotask, repeat.

3.3 libuv Architecture

┌──────────────────────────────────────────┐
│              Node.js                     │
│  ┌─────────────────────────────────┐     │
│  │     JavaScript (V8 Engine)      │     │
│  │  ┌───────────┐  ┌────────────┐  │     │
│  │  │  Async    │  │ Event      │  │     │
│  │  │  APIs     │  │ Queue      │  │     │
│  │  └─────┬─────┘  └──────┬─────┘  │     │
│  └────────┼────────────────┼────────┘     │
│           │   Event Loop   │              │
│  ┌────────▼────────────────▼────────┐     │
│  │           libuv                   │     │
│  │  ┌──────────┐  ┌──────────────┐  │     │
│  │  │ Thread   │  │ epoll/kqueue │  │     │
│  │  │ Pool     │  │ /IOCP        │  │     │
│  │  │ (4 thds) │  │ (non-block)  │  │     │
│  │  └──────────┘  └──────────────┘  │     │
│  └───────────────────────────────────┘     │
│           │                │               │
│      File I/O          Network I/O         │
│      DNS lookup        TCP/UDP             │
│      Crypto            HTTP                │
└──────────────────────────────────────────┘

File I/O and DNS are handled via the thread pool, while network I/O uses the OS's async mechanisms (epoll, kqueue, IOCP).

4. Coroutines and Structured Concurrency

4.1 Python async/await

import asyncio

async def process_item(item: str) -> str:
    await asyncio.sleep(1)  # I/O simulation
    return f"Processed: {item}"

async def main():
    items = ["a", "b", "c", "d", "e"]

    # Concurrent execution
    tasks = [process_item(item) for item in items]
    results = await asyncio.gather(*tasks)
    print(results)  # All complete in 1 second

    # Timeout
    try:
        result = await asyncio.wait_for(
            process_item("slow"),
            timeout=0.5
        )
    except asyncio.TimeoutError:
        print("Timeout!")

    # Semaphore to limit concurrency
    semaphore = asyncio.Semaphore(3)

    async def limited_process(item):
        async with semaphore:
            return await process_item(item)

    results = await asyncio.gather(*[limited_process(i) for i in items])

asyncio.run(main())

4.2 Kotlin Coroutines and Structured Concurrency

import kotlinx.coroutines.*

suspend fun fetchUser(id: Int): User {
    delay(1000)
    return User(id, "User $id")
}

fun main() = runBlocking {
    // Structured concurrency: parent cancel propagates to children
    coroutineScope {
        val user = async { fetchUser(1) }
        val orders = async { fetchOrders(1) }

        println("User: ${user.await()}")
        println("Orders: ${orders.await()}")
    } // Waits for all child coroutines

    // SupervisorScope: child failure doesn't affect siblings
    supervisorScope {
        val job1 = launch {
            delay(100)
            throw RuntimeException("Failed!")
        }
        val job2 = launch {
            delay(200)
            println("Job2 completed") // Still runs
        }
    }
}

4.3 Structured Concurrency Principles

Traditional concurrency:        Structured concurrency:
┌─────────────┐                ┌─────────────┐
│    Parent    │                │    Parent    │
│  ┌───┐ ┌───┐│                │  ┌───┐ ┌───┐│
│  │ A │ │ B ││                │  │ A │ │ B ││
│  └─┬─┘ └─┬─┘│                │  └─┬─┘ └─┬─┘│
└────┼─────┼───┘                └────┼─────┼───┘
     │     │ ← Who manages?          │     │ ← Parent manages
     ▼     ▼   Leak possible!        ▼     ▼   Auto cleanup!

Rules:
1. All concurrent work starts within a scope
2. Scope waits for all child tasks to complete
3. Parent cancellation cancels all children
4. Child errors propagate to parent

5. Actor Model

5.1 Concept

In the actor model, each actor:

Has its own state (not shared with other actors)
Sends and receives messages asynchronously
Can create new actors, change state, or send messages when processing a message

┌──────────┐     message     ┌──────────┐
│  Actor A │ ──────────────→ │  Actor B │
│          │                 │          │
│ State: x │     message     │ State: y │
│ Mailbox  │ ←────────────── │ Mailbox  │
└──────────┘                 └──────────┘
     │
     │ message
     ▼
┌──────────┐
│  Actor C │
│ State: z │
│ Mailbox  │
└──────────┘

Key: No shared state, no locks, only message passing

5.2 Erlang/OTP Supervision Trees

                  ┌──────────────┐
                  │  Supervisor  │
                  │ (one_for_one)│
                  └──────┬───────┘
                    ┌────┼────┐
                    ▼    ▼    ▼
              ┌─────┐ ┌─────┐ ┌─────┐
              │ W1  │ │ W2  │ │ W3  │
              └─────┘ └─────┘ └─────┘

When W2 crashes:
1. Supervisor detects it
2. Only W2 restarts (one_for_one strategy)
3. W1, W3 are unaffected
4. If restarted 5+ times in 5 minutes, Supervisor also terminates

5.3 When to Use the Actor Model

Good fit:
- Managing millions of independent states (chat rooms, IoT devices)
- Distributed systems (inter-node communication)
- Fault-tolerance required (telecom, finance)
- Real-time systems (game servers, trading)

Poor fit:
- Simple CRUD applications
- Strong consistency required
- Transaction processing is key
- Order guarantees critical in pipelines

6. Synchronization Primitives

6.1 Mutex

// Go - Mutex
type SafeMap struct {
    mu   sync.Mutex
    data map[string]int
}

func (m *SafeMap) Set(key string, val int) {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.data[key] = val
}

func (m *SafeMap) Get(key string) (int, bool) {
    m.mu.Lock()
    defer m.mu.Unlock()
    val, ok := m.data[key]
    return val, ok
}

// Java - ReentrantLock
import java.util.concurrent.locks.ReentrantLock;

public class SafeCounter {
    private final ReentrantLock lock = new ReentrantLock();
    private int count = 0;

    public void increment() {
        lock.lock();
        try {
            count++;
        } finally {
            lock.unlock();
        }
    }
}

6.2 RWLock (Read-Write Lock)

type Cache struct {
    mu   sync.RWMutex
    data map[string]string
}

// Multiple goroutines can read simultaneously
func (c *Cache) Get(key string) (string, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    val, ok := c.data[key]
    return val, ok
}

// Write requires exclusive lock
func (c *Cache) Set(key, val string) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.data[key] = val
}

6.3 Semaphore

import asyncio

async def worker(name, semaphore):
    async with semaphore:
        print(f"{name} acquired semaphore")
        await asyncio.sleep(1)
        print(f"{name} released semaphore")

async def main():
    semaphore = asyncio.Semaphore(3)  # max 3 concurrent
    tasks = [worker(f"Worker-{i}", semaphore) for i in range(10)]
    await asyncio.gather(*tasks)

asyncio.run(main())

6.4 Condition Variable

type BoundedQueue struct {
    mu       sync.Mutex
    notEmpty *sync.Cond
    notFull  *sync.Cond
    items    []interface{}
    maxSize  int
}

func NewBoundedQueue(maxSize int) *BoundedQueue {
    q := &BoundedQueue{maxSize: maxSize}
    q.notEmpty = sync.NewCond(&q.mu)
    q.notFull = sync.NewCond(&q.mu)
    return q
}

func (q *BoundedQueue) Put(item interface{}) {
    q.mu.Lock()
    defer q.mu.Unlock()

    for len(q.items) == q.maxSize {
        q.notFull.Wait() // wait until space available
    }
    q.items = append(q.items, item)
    q.notEmpty.Signal() // wake consumer
}

func (q *BoundedQueue) Take() interface{} {
    q.mu.Lock()
    defer q.mu.Unlock()

    for len(q.items) == 0 {
        q.notEmpty.Wait() // wait until item available
    }
    item := q.items[0]
    q.items = q.items[1:]
    q.notFull.Signal() // wake producer
    return item
}

7. Lock-Free Programming

7.1 CAS (Compare-And-Swap)

import "sync/atomic"

type LockFreeCounter struct {
    value int64
}

func (c *LockFreeCounter) Increment() int64 {
    for {
        old := atomic.LoadInt64(&c.value)
        new := old + 1
        if atomic.CompareAndSwapInt64(&c.value, old, new) {
            return new
        }
        // CAS failed, retry (another goroutine changed it first)
    }
}

func (c *LockFreeCounter) Get() int64 {
    return atomic.LoadInt64(&c.value)
}

7.2 The ABA Problem

Thread A: reads value A
                          Thread B: changes A to B
                          Thread B: changes B back to A
Thread A: CAS(A -> C) succeeds! (but value changed in between)

Solutions:
- Add version numbers (A,v1 -> B,v2 -> A,v3)
- Tagged pointers (AtomicStampedReference in Java)
- Hazard Pointers

7.3 When NOT to Use Lock-Free

Lock-free is good for:
- Very high contention scenarios
- Real-time systems (avoiding priority inversion)
- Simple data structures (counters, stacks, queues)

Avoid lock-free when (most cases):
- Regular applications (Mutex is sufficient)
- Complex data structures (trees, graphs)
- Correctness verification is difficult
- Team is not experienced with lock-free

"If a Mutex works, use a Mutex."

8. Common Concurrency Bugs

8.1 Race Condition

// Bug: multiple goroutines modifying counter simultaneously
var counter int

func buggyIncrement() {
    for i := 0; i < 1000; i++ {
        go func() {
            counter++ // read-modify-write is not atomic
        }()
    }
}

// Fix: use atomic or mutex
var safeCounter int64

func safeIncrement() {
    for i := 0; i < 1000; i++ {
        go func() {
            atomic.AddInt64(&safeCounter, 1)
        }()
    }
}

8.2 Deadlock

Four necessary conditions (all must hold for deadlock):
1. Mutual Exclusion: resource used by only one at a time
2. Hold and Wait: holding resource while waiting for another
3. No Preemption: cannot forcibly take held resource
4. Circular Wait: A waits for B, B waits for C, C waits for A

// Deadlock example
var mu1, mu2 sync.Mutex

func goroutine1() {
    mu1.Lock()
    defer mu1.Unlock()
    time.Sleep(time.Millisecond)
    mu2.Lock()       // waiting for mu2 (held by goroutine2)
    defer mu2.Unlock()
}

func goroutine2() {
    mu2.Lock()
    defer mu2.Unlock()
    time.Sleep(time.Millisecond)
    mu1.Lock()       // waiting for mu1 (held by goroutine1)
    defer mu1.Unlock()
}

// Fix: always acquire locks in the same order
func fixed1() {
    mu1.Lock()
    defer mu1.Unlock()
    mu2.Lock()
    defer mu2.Unlock()
}

func fixed2() {
    mu1.Lock()       // mu1 first (same order)
    defer mu1.Unlock()
    mu2.Lock()
    defer mu2.Unlock()
}

8.3 Livelock

A and B meet in a narrow corridor:
A: steps left  <-> B: steps right (same direction!)
A: steps right <-> B: steps left  (same direction again!)
... (infinite loop, nobody progresses)

vs Deadlock: doing nothing
vs Livelock: constantly moving but no progress

8.4 Starvation

High-priority threads continuously hold resources, preventing low-priority threads from ever executing.

8.5 Priority Inversion

1997 Mars Pathfinder incident:

Low-priority task holds mutex
   |
Medium-priority task preempts Low (unrelated to mutex)
   |
High-priority task waits for mutex... forever!
(Low needs to run to release mutex, but Medium keeps preempting)

Solution: Priority Inheritance Protocol
Low temporarily inherits High's priority while holding the mutex

8.6 Detection Tools

# Go: Race Detector
go test -race ./...
go run -race main.go

# Java: Thread Sanitizer
java -XX:+UseThreadSanitizer MyApp

# Rust: Compiler detects at compile time!
# Send/Sync trait violations = compile error

# Python: asyncio debug mode
PYTHONASYNCIODEBUG=1 python app.py

# General: Helgrind (Valgrind tool)
valgrind --tool=helgrind ./program

9. Practical Patterns

9.1 Producer-Consumer

func producerConsumer() {
    ch := make(chan int, 100) // buffered channel

    // Producers
    var prodWg sync.WaitGroup
    for i := 0; i < 3; i++ {
        prodWg.Add(1)
        go func(id int) {
            defer prodWg.Done()
            for j := 0; j < 10; j++ {
                ch <- id*100 + j
            }
        }(i)
    }

    go func() {
        prodWg.Wait()
        close(ch)
    }()

    // Consumers
    var consWg sync.WaitGroup
    for i := 0; i < 2; i++ {
        consWg.Add(1)
        go func(id int) {
            defer consWg.Done()
            for item := range ch {
                fmt.Printf("Consumer %d: %d\n", id, item)
            }
        }(i)
    }

    consWg.Wait()
}

9.2 Thread Pool / Worker Pool

import asyncio
from asyncio import Queue

async def worker(name: str, queue: Queue):
    while True:
        item = await queue.get()
        try:
            await process(item)
        finally:
            queue.task_done()

async def main():
    queue = Queue(maxsize=100)

    # Start 5 workers
    workers = [
        asyncio.create_task(worker(f"worker-{i}", queue))
        for i in range(5)
    ]

    # Add tasks
    for item in range(50):
        await queue.put(item)

    await queue.join()

    for w in workers:
        w.cancel()

asyncio.run(main())

9.3 Work Stealing

Traditional thread pool:        Work Stealing:
┌─────────────┐                ┌─────────────┐
│ Shared queue │                │ Worker 1    │
│ [T1|T2|T3]  │                │ local:[T1]  │
│             │                ├─────────────┤
│ All workers  │                │ Worker 2    │
│ contend!    │                │ local:[T2]  │  ← steals from W1 when empty
│             │                ├─────────────┤
└─────────────┘                │ Worker 3    │
                               │ local:[T3]  │
                               └─────────────┘

Benefits: Better cache locality, reduced contention
Used by: Java ForkJoinPool, Go runtime, Tokio (Rust)

10. Performance

10.1 Amdahl's Law

Speedup = 1 / ((1 - P) + P/N)

P = fraction that can be parallelized
N = number of processors

Example: 95% parallelizable program on 64 cores
Speedup = 1 / (0.05 + 0.95/64) = 1 / 0.0648 = 15.4x

Key insight: No matter how many cores, the serial portion (5%)
is the bottleneck! Maximum theoretical speedup = 1/0.05 = 20x

10.2 Gustafson's Law

Speedup = N - alpha * (N - 1)

alpha = fraction of serial work
N = number of processors

Key difference:
- Amdahl: "Fixed problem size, add more cores?"
- Gustafson: "Fixed cores, grow the problem?"

In reality, as data grows, the parallel portion also grows.
Gustafson is more optimistic and realistic.

10.3 Context Switch Cost

Context Switch Cost Comparison:
┌───────────────────────────┬──────────────┐
│ Switch Type               │ Approx Cost  │
├───────────────────────────┼──────────────┤
│ Function call             │ ~1-5 ns      │
│ Goroutine switch          │ ~100-200 ns  │
│ Coroutine switch (Python) │ ~1-5 us      │
│ Thread switch (same proc) │ ~1-10 us     │
│ Process switch            │ ~10-50 us    │
│ VM switch                 │ ~100+ us     │
└───────────────────────────┴──────────────┘

Problem: Different variables on the same cache line

CPU Core 1          CPU Core 2
    |                   |
+---v-------------------v---+
| Cache Line (64 bytes)     |
| [counter_A] [counter_B]   |  <- Different variables but
|                           |     same cache line!
+---------------------------+

Core 1 modifies counter_A -> invalidates cache line -> Core 2 must reload
Core 2 modifies counter_B -> invalidates cache line -> Core 1 must reload

Fix: Padding to separate cache lines

// Preventing False Sharing in Go
type PaddedCounter struct {
    value int64
    _pad  [56]byte // 64-byte cache line padding
}

type Counters struct {
    c1 PaddedCounter
    c2 PaddedCounter
}

11. Interview Questions (15)

What is the difference between concurrency and parallelism? Concurrency is about structuring tasks to alternate. Parallelism is physically executing simultaneously. Concurrency is possible on a single core; parallelism is not.
What is the difference between a process and a thread? Processes have independent memory spaces; threads share memory within a process. Threads are cheaper to create/switch but require synchronization.
What are the 4 necessary conditions for deadlock and how to solve each? Mutual exclusion, hold and wait, no preemption, circular wait. Ordered lock acquisition (preventing circular wait) is most practical.
Why are Go goroutines lighter than OS threads? About 2KB stack (OS threads 1MB+), M:N scheduler maps millions of goroutines to few OS threads, user-space context switching.
What is Python's GIL and why does it exist? Global Interpreter Lock. CPython's reference counting memory management is not thread-safe. Use multiprocessing for CPU-bound parallel work.
Pros and cons of event loops vs multithreading? Event loop: simpler model, no context switch overhead, poor for CPU-bound. Multithreading: maximizes CPU utilization, complex synchronization, deadlock risk.
How is async/await different from threads? async/await is cooperative multitasking, switching only at await points. Threads are preemptive, OS can switch at any time.
Pros and cons of the actor model? Pros: no shared state, no deadlocks, natural for distributed systems. Cons: message ordering hard, debugging complex, performance overhead.
What is CAS? When to use instead of Mutex? Compare-And-Swap is an atomic compare-exchange operation. Faster than Mutex for simple counters or pointer updates, but unsuitable for complex logic.
What is false sharing and how to fix it? Performance degradation when different cores modify different variables on the same cache line. Fix by padding variables onto separate cache lines.
What does Amdahl's Law tell us? The serial portion determines the upper bound on speedup. Reducing the serial fraction may matter more than optimizing parallel work.
Difference between Java Virtual Threads and Go goroutines? Both are M:N models, but Go uses channel-based CSP, Java maintains existing Thread API compatibility. Go introduced first, Java 21 made it official.
What is Structured Concurrency? Binding concurrent tasks to a scope for lifecycle management. Parent cancel propagates to children, parent waits for all children. Prevents goroutine/thread leaks.
Explain Erlang's "Let it crash" philosophy. Rather than defensively handling errors, quickly terminate the process and let the supervisor restart it. Eliminates complex error recovery code.
What is the right thread pool size? CPU-bound: number of CPU cores. I/O-bound: CPU cores * (1 + wait_time/compute_time). Can estimate using Little's Law (L = lambda * W).

12. Practice Quiz

Q1: What is the output order of this JavaScript code?

console.log('A');
setTimeout(() => console.log('B'), 0);
Promise.resolve().then(() => console.log('C'));
console.log('D');

Answer: A, D, C, B. Synchronous code (A, D) first, then microtasks (Promise: C), then macrotasks (setTimeout: B).

Q2: Why does this Go code deadlock?

func main() {
    ch := make(chan int)
    ch <- 42
    fmt.Println(<-ch)
}

Answer: An unbuffered channel requires both sender and receiver to be present simultaneously. The main goroutine blocks at ch <- 42 waiting for a receiver and can never reach <-ch.

Q3: What is the difference between asyncio.gather and asyncio.wait in Python?

Answer: gather returns all results in order and raises an exception if one fails (configurable with return_exceptions=True). wait returns sets of completed/pending tasks with FIRST_COMPLETED, FIRST_EXCEPTION, and ALL_COMPLETED options. wait is more flexible.

Q4: According to Amdahl's Law, what is the maximum speedup for a program that is 90% parallelizable?

Answer: Even with infinite cores, maximum speedup = 1/(1-0.9) = 1/0.1 = 10x. The 10% serial portion becomes the bottleneck. With 64 cores: 1/(0.1 + 0.9/64) = approximately 8.8x.

Q5: How does Rust prevent data races at compile time?

Answer: Rust's ownership system and Send/Sync traits are the key. Send means ownership can be transferred between threads; Sync means references can be shared between threads. Only one mutable reference can exist at a time (borrow checker), which prevents data races at compile time entirely.