- Published on
The Complete Actor Model Guide 2025: Erlang OTP, Akka, Orleans, Elixir — The Real Way to Handle Millions of Concurrent Connections
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction: The Miracle of 2 Million Connections
The WhatsApp Legend
In 2012, a WhatsApp engineer published a benchmark:
"2,000,000 TCP connections on a single server with FreeBSD + Erlang."
It became legend. C10K (10,000 concurrent connections) was considered hard at the time — but 2 million? How was this possible?
The answer was Erlang's Actor Model. Erlang, created in 1986 for Ericsson's telephone switches, was designed from the ground up for concurrency, distribution, and fault tolerance.
What is the Actor Model?
Proposed in 1973 by Carl Hewitt, the Actor Model is fundamentally different from traditional programming:
"Everything is an actor. An actor receives messages, sends messages, and creates new actors."
- No shared memory: All communication happens through messages.
- No locks: State lives inside actors only.
- Asynchronous: Messages are delivered asynchronously.
- Fault isolation: One actor's failure doesn't propagate to others.
From this simple model, remarkable systems are built.
What This Article Covers
- The mathematical definition of the Actor Model.
- Erlang/OTP: The original and the reference.
- Akka: The JVM's actor implementation.
- Elixir: Erlang's modern syntax.
- Orleans: Microsoft's .NET Actor.
- The Let it crash philosophy.
- Supervision tree patterns.
- Distributed actor systems.
Why This Knowledge Matters
- Concurrency: Handling millions of connections. Essential for modern backends.
- Fault tolerance: "Let it die" instead of "recover."
- Distributed systems: Same code for local and remote actors.
- Elixir: Modern frameworks like Phoenix LiveView.
- WhatsApp, Discord, Riot: The choice of major companies.
1. The Theory of the Actor Model
Three Fundamental Operations
Carl Hewitt's original definition:
- Create: Create a new actor.
- Send: Send a message to another actor.
- Become: Determine behavior for the next message (state change).
That's all. These three operations can express any computation.
Components of an Actor
Each Actor has:
- Address (PID): Unique identifier.
- Mailbox: Queue of received messages.
- Behavior: A function defining how to handle messages.
- State: Internal state (inaccessible from outside).
Actor A
┌─────────────┐
│ Mailbox │ ← msg3 ← msg2 ← msg1
├─────────────┤
│ Behavior │ (function: msg → action)
├─────────────┤
│ State │ (internal data)
└─────────────┘
Message Processing Model
An actor processes one message at a time:
while True:
msg = mailbox.pop() # next message
action = behavior(msg) # handle with current behavior
apply(action) # state change, message sending, etc.
Key point: The inside of a single actor runs single-threaded. So the state inside an actor is free from concurrent modification concerns. No locks needed.
No Shared Memory
Communication between actors happens only via messages:
// wrong approach
actor_B.state.counter += 1 // illegal!
// correct approach
actor_B.send(IncrementMessage())
This constraint makes all concurrency problems disappear. No race conditions, no deadlocks, no locks. Instead, the complexity shifts to message passing.
Asynchronous Message Passing
Message sending is asynchronous:
actor_A.send(actor_B, msg)
// returns immediately. B will actually process it later.
This means:
- Sender is not blocked: The sender keeps doing other work.
- Backpressure: The mailbox can overflow (implementation-dependent).
- Ordering guarantees: Most implementations guarantee "messages from one A to one B arrive in order."
- Delivery guarantees: Local is at-most-once; remote depends on implementation.
Location Transparency
An actor's PID doesn't distinguish local from remote:
% same code
Pid ! Message
Whether Pid is on the same node or a different one, the code is identical. This enables transparent distribution. Erlang was the first to implement this principle.
Actor vs Thread
| Item | Thread | Actor |
|---|---|---|
| Shared memory | Yes | No |
| Synchronization | Locks, semaphores | Messages |
| Weight | Heavy (several MB) | Light (a few KB) |
| Count | Thousands | Millions |
| Error propagation | Crashes the whole process | Isolated per actor |
| Distribution | Difficult | Built-in |
| Debugging | Hard | Relatively easy |
2. Erlang: The Power of the Original
History
- 1986: Built for Ericsson's telephone switches.
- 1998: Open-sourced.
- 1989: OTP (Open Telecom Platform) library.
- 2010s: Large-scale use at WhatsApp, Ericsson, Heroku, etc.
Language Philosophy
Erlang is an intentionally constrained language:
- Immutable: Variables are assigned only once.
- Pattern matching: The primary control flow.
- Functional: Minimal side effects.
- Dynamic typing: Runtime types.
- Single assignment: Prolog influence.
These constraints make concurrency safe.
Lightweight Processes
Erlang's actors are called "processes," but they aren't OS processes — they're scheduling units inside the VM:
- Size: Starts at roughly 300 bytes.
- Spawn cost: Under 1 μs.
- Context switch: Faster than OS threads.
- Concurrent process count: Millions are possible.
% Spawn 1 million processes
lists:foreach(
fun(_) -> spawn(fun() -> loop() end) end,
lists:seq(1, 1000000)
).
% Completes in seconds.
Impossible with OS threads. Erlang's lightness is the secret.
Send and Receive
% Spawn an actor (process)
Pid = spawn(fun() -> counter(0) end).
% Send a message
Pid ! {increment, 5}.
Pid ! {get, self()}.
% Receive messages
counter(N) ->
receive
{increment, Delta} ->
counter(N + Delta);
{get, From} ->
From ! {count, N},
counter(N);
stop ->
ok
end.
! is send, receive is blocking message receive, and self() is your own PID.
Pattern Matching in Receive
receive scans the mailbox and retrieves the first message matching a pattern:
receive
{urgent, Msg} -> handle_urgent(Msg);
{normal, Msg} -> handle_normal(Msg)
after
5000 -> timeout()
end.
- Selective matching (pick out specific messages to handle).
- Timeout support.
- Only matches — does not reorder the mailbox.
The Selective Receive Trap
Erlang scans the mailbox from the front. Unmatched messages remain in the pending queue:
receive
{critical, X} -> ...
end.
If there's no {critical, _} in the mailbox, it waits forever, and other messages pile up, causing a memory explosion. This is Erlang's famous pitfall. Fix: always include a catch-all or an after timeout.
Erlang/OTP
OTP (Open Telecom Platform) is Erlang's standard library plus a collection of design patterns:
- gen_server: Generic server behavior.
- gen_statem: State machines.
- gen_event: Event handlers.
- supervisor: Supervision trees.
- application: Application structure.
OTP is the reusable design patterns of the actor model. When you build something in Erlang, you almost always build it on top of OTP.
gen_server Example
-module(counter).
-behaviour(gen_server).
-export([start_link/0, increment/1, get/0]).
-export([init/1, handle_call/3, handle_cast/2]).
start_link() -> gen_server:start_link({local, ?MODULE}, ?MODULE, 0, []).
increment(N) -> gen_server:cast(?MODULE, {increment, N}).
get() -> gen_server:call(?MODULE, get).
init(State) -> {ok, State}.
handle_cast({increment, N}, State) -> {noreply, State + N}.
handle_call(get, _From, State) -> {reply, State, State}.
Structured, familiar pattern. Easy to write and maintain.
3. Let It Crash: A Revolutionary Philosophy
Traditional Error Handling
In most languages:
try:
result = risky_operation()
except Exception as e:
log_error(e)
# try to recover, or return failure
return None
- Predict every possible error.
- Defensive programming.
- Problem: Unexpected errors still crash things.
Erlang's Approach
"Let it crash."
- Don't try to anticipate errors.
- If an actor fails, just let it die.
- The supervisor restarts it.
Your code focuses on the happy path, and failure is managed externally (by the supervisor).
Why Does This Work?
Traditional assumption: "Errors are exceptional." Erlang's assumption: "Errors are normal."
Networks disconnect, disks fill up, memory runs out. This is an everyday reality to be expected. But trying to handle every error case makes code complex.
Erlang's solution: restart on failure. After restart, you start from a clean state. This resolves most problems ("have you tried turning it off and on again?").
Supervision Tree
A Supervisor is a special actor that watches its child actors:
Main Supervisor
/ | \
Worker Worker Sub-Supervisor
/ | \
Worker Worker Worker
Each supervisor has a restart strategy:
- one_for_one: Restart only the failed child.
- one_for_all: If one fails, restart all children.
- rest_for_one: Restart the failed child and all children after it.
- simple_one_for_one: Dynamically spawned children.
Restart Limits
"Infinite restart" is dangerous (wastes CPU if the same error repeats). Supervisors have restart limits:
max_restarts: Max restarts per time window.max_seconds: Measurement period.
If exceeded, the supervisor itself dies, propagating to the parent supervisor. This is the hierarchical propagation of errors.
Example: WhatsApp Structure
Application Supervisor
├── Cluster Supervisor
│ ├── Node Manager
│ └── Distribution Manager
├── Connection Supervisor
│ ├── Connection 1
│ ├── Connection 2
│ └── ... (millions)
├── Message Supervisor
│ ├── Message Router
│ └── Storage Manager
└── Monitoring Supervisor
├── Metrics Collector
└── Health Checker
Failures are isolated at each level. An error in one connection doesn't affect another. Trouble in one subsystem doesn't bring down the whole.
Lesson: Software That Accepts Failure
Let it crash is a paradigm shift in software design:
- Traditional: "My code must be perfect."
- Erlang: "My code will fail. The system keeps running anyway."
This fits the nature of distributed systems. In distributed systems, nothing is 100% trustworthy. Networks, hardware, software — all fail from time to time. Accept it, and design for it.
4. Akka: Actors on the JVM
Motivation
Started in 2009 by Jonas Bonér, who wanted to bring Erlang-style actors to the JVM, integrated with Scala.
In 2012, Typesafe (now Lightbend) began commercial support. Akka became synonymous with actors in the Java/Scala ecosystem.
Akka vs Erlang Differences
Language differences:
- Erlang: Dynamically typed, immutable, functional.
- Akka: Uses the JVM language's type system as-is.
Actor weight:
- Erlang: 200-300 bytes per process, perfect isolation.
- Akka: A few KB per Akka actor, JVM thread pool based.
Selective receive:
- Erlang: Native support.
- Akka: Can be approximated with Stash.
Culture:
- Erlang: "The language is the philosophy." The entire language is actor-centric.
- Akka: "A library." Can be mixed with JVM code.
Basic Akka Example (Scala)
import akka.actor.{Actor, ActorSystem, Props}
class Counter extends Actor {
var count = 0
def receive = {
case Increment(n) => count += n
case Get => sender() ! count
}
}
val system = ActorSystem("MySystem")
val counter = system.actorOf(Props[Counter], "counter")
counter ! Increment(5)
counter ! Get // response goes to sender
- Props: Actor creation config.
- receive: Message handler function.
- sender(): Ref of the current message's sender.
Typed Actor (Akka 2.6+)
Early Akka had Any-typed messages (Erlang style), giving up type safety. From Akka 2.6, Typed actors are the default:
sealed trait CounterMessage
case class Increment(n: Int) extends CounterMessage
case class Get(replyTo: ActorRef[Int]) extends CounterMessage
val counterBehavior: Behavior[CounterMessage] = Behaviors.setup { context =>
var count = 0
Behaviors.receiveMessage {
case Increment(n) =>
count += n
Behaviors.same
case Get(replyTo) =>
replyTo ! count
Behaviors.same
}
}
Compile-time message type checking. Leveraging the JVM's strengths.
Akka Cluster
Akka Cluster ties multiple nodes into a single actor system:
- Gossip protocol: Node membership.
- Distributed actors: Used like local ones.
- Cluster Sharding: Automatic state distribution.
- Cluster Singleton: A single actor across the cluster.
- Distributed Data: CRDT-based shared state.
Cluster Sharding example:
val shardRegion = ClusterSharding(system).start(
typeName = "User",
entityProps = Props[UserActor],
settings = ClusterShardingSettings(system),
extractEntityId = extractEntityId,
extractShardId = extractShardId
)
// Send a message with userId=42
shardRegion ! Envelope(42, SomeMessage)
// Automatically routes to whichever node holds the actor with ID 42
This automatically distributes millions of actors across dozens of nodes. Rebalances when nodes are added or removed.
Akka Persistence
Built-in Event Sourcing. Actor state changes are stored as an event log:
val behavior: Behavior[Command] = EventSourcedBehavior[Command, Event, State](
persistenceId = PersistenceId("counter", "1"),
emptyState = State(0),
commandHandler = (state, cmd) => cmd match {
case Increment(n) => Effect.persist(Incremented(n))
},
eventHandler = (state, evt) => evt match {
case Incremented(n) => state.copy(count = state.count + n)
}
)
When the actor restarts, the event log is replayed to recover state. Powerful for failure recovery in distributed systems.
Akka Pitfalls
- "Child creation via Props": More complex than Erlang.
- Legacy untyped actors: Many tutorials are still outdated.
- JVM tuning: GC, heap, thread pool.
- License change: Switched to BSL (Business Source License) in 2022. Large-scale commercial use is paid.
After the licensing issue, Pekko (Apache fork) emerged. 100% Akka-compatible. Open source.
5. Elixir: The Modern Face of Erlang
Birth
- 2012: Started by Ruby developer José Valim.
- Runs on the Erlang VM (BEAM).
- Ruby/Python-style syntax + Erlang's power.
Why Elixir?
Erlang is powerful but has alien syntax (Prolog influence). Many developers hit the barrier to entry. Elixir offers:
- Familiar syntax: Friendly to Ruby/Python users.
- Modern tooling: Mix (build), ExUnit (test), Hex (package manager).
- 100% Erlang compatible: All OTP, all libraries usable.
- Metaprogramming: Macros to build DSLs.
Example: Counter
defmodule Counter do
use GenServer
# Client API
def start_link(initial \\ 0) do
GenServer.start_link(__MODULE__, initial, name: __MODULE__)
end
def increment(n) do
GenServer.cast(__MODULE__, {:increment, n})
end
def get do
GenServer.call(__MODULE__, :get)
end
# Server callbacks
def init(initial), do: {:ok, initial}
def handle_cast({:increment, n}, state) do
{:noreply, state + n}
end
def handle_call(:get, _from, state) do
{:reply, state, state}
end
end
Same structure as Erlang's gen_server, but much easier to read.
Phoenix and LiveView
Elixir's killer app: Phoenix framework. Rails-level web development productivity + Erlang's concurrency.
Phoenix LiveView: Server rendering plus real-time updates without JavaScript. Internally implemented with WebSockets and actors.
Performance: A single Phoenix server can handle millions of concurrent connections (WhatsApp-level).
Discord's Usage
Discord uses Elixir + Erlang at scale:
- Message routing: Elixir clusters.
- Concurrent voice chat: Millions of connections.
- 5 million+ concurrent users: Disclosed in a 2019 blog post.
This answers "Isn't Erlang a dead language?" Discord, WhatsApp, Riot (League of Legends chat), and other services with millions of users run on BEAM.
6. Orleans: Microsoft's Virtual Actors
History
- 2010: Started at Microsoft Research.
- Project Orleans goal: "Cloud-scale .NET actors."
- Halo 4+: Player matchmaking, stats, leaderboards.
Virtual Actors
Orleans' innovation: the virtual actor. The runtime manages actor lifecycles automatically:
Traditional actor:
- Developer explicitly creates/destroys.
- PID management required.
Virtual actor:
- Actor is assumed to always exist.
- Auto-activates when receiving a message.
- Auto-deactivates when idle.
- Auto-restored when restarting.
public interface ICounterGrain : IGrainWithIntegerKey
{
Task Increment(int n);
Task<int> Get();
}
public class CounterGrain : Grain, ICounterGrain
{
private int count = 0;
public Task Increment(int n)
{
count += n;
return Task.CompletedTask;
}
public Task<int> Get() => Task.FromResult(count);
}
Usage:
var counter = grainFactory.GetGrain<ICounterGrain>(42);
await counter.Increment(5);
int value = await counter.Get();
When the counter grain with ID 42 is called for the first time, it's auto-created; after prolonged idleness, it's auto-deactivated. Developers don't worry about lifecycle.
Advantages
- Simplicity: No create/destroy code.
- Automatic failover: On node failure, recreated on another node.
- State persistence: Grain state is auto-saved via storage providers.
- .NET native: Leverages C# type system.
Disadvantages
- Runtime overhead: The cost of automation.
- Error model: A different approach from let it crash.
- Learning curve: The virtual actor concept feels unfamiliar at first.
Where It's Used
- Halo series game services.
- Parts of Azure internals.
- Financial trading (FIS and other partners).
- IoT: A grain per device.
The Only .NET Actor System?
Orleans is the most famous actor framework in the .NET ecosystem, but not the only one. There's also Akka.NET (Akka's .NET port), Proto.Actor, etc. However, Microsoft's official backing plus Azure integration makes Orleans the mainstream choice.
7. Other Actor Systems
Dapr Actors
Microsoft's Dapr (Distributed Application Runtime) is a sidecar-based distributed runtime. Dapr Actors are language-neutral virtual actors:
# Python
from dapr.actor import Actor
class CounterActor(Actor):
async def increment(self, n):
state = await self._state_manager.get_state("count")
await self._state_manager.set_state("count", state + n)
HTTP/gRPC based, so any language can use it. Kubernetes friendly.
Akka.NET
Akka's .NET port. Used from F#/C#. An extension of the Akka philosophy, along with Pekko.
Pykka
A simple actor library for Python. Thread based.
Ray
Python's distributed computing framework. Includes an actor abstraction:
import ray
@ray.remote
class Counter:
def __init__(self):
self.count = 0
def increment(self, n):
self.count += n
def get(self):
return self.count
counter = Counter.remote()
counter.increment.remote(5)
count = ray.get(counter.get.remote())
Popular for ML workloads. Integrates with Ray Tune, Ray Serve, etc.
CAF (C++ Actor Framework)
High-performance actor framework for C++. Used in game engines and HFT.
8. The Secret of Millions of Concurrent Connections
Erlang's Secret
Why can Erlang handle millions of concurrent connections?
1. Lightweight processes:
- Start size ~300 bytes.
- Spawn cost ~1 μs.
- 1000x lighter than a JVM thread.
2. Preemptive scheduling:
- BEAM VM's preemptive scheduling.
- "Reduction" count after function calls.
- Forcibly preempted if a process holds the CPU for "too long."
3. Independent heap:
- Each process has its own heap.
- Per-process GC: GCing one process doesn't affect others.
- No "stop the world."
4. Share-nothing:
- No shared memory, so no lock contention.
- Messages are copied when sent (not local pointers).
5. Simple scheduler:
- One scheduler thread per CPU.
- Each thread has its own run queue.
- Work stealing balances the load.
Benchmark
Erlang's The Road to 2 Million Websocket Connections in Phoenix (2015 blog):
- Test: Phoenix Channels with WebSocket.
- Goal: 2 million connections.
- Result: Achieved on a single server.
- Memory: ~40 GB (20 KB per connection).
- CPU: Almost idle.
This is on a single server. Very hard with Node.js or Java.
Comparison: Go goroutines
Go's goroutines are famous for lightweight concurrency:
- Size: Starts at 2 KB (grows dynamically).
- Count: Millions possible.
- Performance: Similar to Erlang or slightly better.
But Go is not an actor model. Shared memory + channels. Error handling is also traditional (recover/defer).
Erlang's strength is its fault tolerance patterns and distribution transparency. In raw performance, it's roughly even with Go.
Practical Design: State Management
With millions of actors, state management matters:
- In-memory: Each actor holds its own state. Fastest.
- Persistent: Event sourcing stores an event log.
- Distributed: CRDTs or a separate DB.
WhatsApp mainly used in-memory. Message routing is nearly stateless, so it's feasible. For more complex state, combine with Cassandra/DB.
9. Practical Patterns
Pattern 1: Request-Reply
The most basic pattern:
# synchronous call
reply = GenServer.call(server, :request)
# asynchronous cast (no reply)
GenServer.cast(server, :fire_and_forget)
call has an internal timeout (default 5 seconds). No reply → exception.
Pattern 2: Pub-Sub
# Subscriber
Phoenix.PubSub.subscribe(MyApp.PubSub, "chat:lobby")
# Publisher
Phoenix.PubSub.broadcast(MyApp.PubSub, "chat:lobby", {:new_message, msg})
# Subscriber receives
def handle_info({:new_message, msg}, state), do: ...
Naturally implemented in actor systems.
Pattern 3: Worker Pool
Distribute work across multiple worker actors:
# using poolboy
{:ok, pool_pid} = :poolboy.start_link(
[worker_module: MyWorker, size: 10],
init_args
)
worker = :poolboy.checkout(pool_pid)
result = MyWorker.process(worker, task)
:poolboy.checkin(pool_pid, worker)
Pattern 4: State Machine
State machines with gen_statem:
defmodule TrafficLight do
use GenStateMachine
def init(_), do: {:ok, :red, %{}}
def handle_event(:state_timeout, :tick, :red, data) do
{:next_state, :green, data, [{:state_timeout, 3000, :tick}]}
end
def handle_event(:state_timeout, :tick, :green, data) do
{:next_state, :yellow, data, [{:state_timeout, 1000, :tick}]}
end
def handle_event(:state_timeout, :tick, :yellow, data) do
{:next_state, :red, data, [{:state_timeout, 3000, :tick}]}
end
end
State-based logic is explicit.
Pattern 5: Circuit Breaker
Easy to implement as an actor:
defmodule CircuitBreaker do
use GenServer
def call(name, func) do
case GenServer.call(name, :state) do
:closed -> try_call(name, func)
:open -> {:error, :circuit_open}
:half_open -> try_call(name, func)
end
end
# ... handle failures and state transitions
end
10. Limitations and Pitfalls
Pitfall 1: Actors Aren't a Silver Bullet
Bad for compute-intensive work:
- Message passing overhead.
- A single actor is single-threaded.
- Doesn't fit parallel hardware like GPUs.
Fix: Do computation in other languages/systems, use actors for coordination.
Pitfall 2: Mailbox Explosion
Fast producer + slow consumer:
Producer: sends 1M messages/sec
Consumer: processes 10K messages/sec
→ 990K per second pile up in the mailbox
→ memory explosion, OOM
Fix:
- Backpressure: Consumer signals "slow down."
- Sampling: Drop when overflowing.
- Bounded mailbox: Configurable in Akka. Erlang is unbounded by default.
- Flow control: GenStage, Broadway (Elixir).
Pitfall 3: Selective Receive Performance
Erlang's selective receive scans the mailbox:
receive
{specific, Msg} -> handle(Msg)
end
If the mailbox holds 100K messages and the wanted pattern is near the back, you scan 100K entries. Slow.
Fix:
- Process all messages in order without selective
receive. spawna new actor to branch specific messages.
Pitfall 4: Message Ordering in Distributed Environments
Locally, "A → B message ordering" is guaranteed. In distributed settings:
- Messages go over the network.
- Retries and reorderings happen.
- Eventual consistency.
Fix:
- Implement causal ordering explicitly if needed.
- Idempotent design.
- Event sourcing.
Pitfall 5: The Temptation of Global State
Against the actor philosophy, but sometimes for convenience:
# :ets tables can be shared outside processes
:ets.new(:shared, [:public, :named_table])
Such shared state is convenient but introduces lock contention and debugging pain. Use only when truly necessary.
Pitfall 6: Remote Call Timeouts
GenServer.call(remote_pid, :get, 5000)
5-second timeout. If the timeout is due to a network issue:
- You don't know whether it actually completed.
- For at-least-once, you need retries.
- But retries enable duplicates.
Fix: Idempotent design. Unique request IDs.
11. Actor Model vs Other Concurrency Models
vs Thread + Lock
- Thread: Shared memory, locks, complex.
- Actor: Messages, no locks, simple.
Actors are safer and simpler in most cases. But for compute-intensive work, threads are still faster.
vs CSP (Go channels)
CSP (Communicating Sequential Processes): Go's goroutines + channels. Similar to actors, but:
- Actor: Messages go to an actor's mailbox.
- CSP: Messages go to a channel (not tied to an actor).
Go's channels are more flexible but lack actors' "ownership" model, which can complicate reasoning.
vs Reactive Streams
Reactive Streams (RxJS, Reactor): Async data streams + backpressure.
Actors are "entity"-centric, Reactive Streams are "data flow"-centric. They complement each other and are often combined in practice.
vs Software Transactional Memory (STM)
STM (Clojure, Haskell): Memory accesses as transactions.
Actors don't need STM (no shared memory). Both are alternatives to lock-based concurrency, but with different approaches.
vs Event-driven (Node.js)
Node.js is a single-threaded event loop. One thing at a time.
Actors allow many actors to run in parallel. Each actor is sequential internally.
Node is simple and easy to understand but limited in CPU utilization. Actors add complexity but use multi-core freely.
12. Learning and Practice
Where to Start
Recommend Elixir + Phoenix:
- Familiar Ruby/Python-like syntax.
- Immediate access to all of Erlang OTP's power.
- Phoenix for real projects.
- "Programming Elixir" (Dave Thomas) book.
- Phoenix LiveView for real-time web.
Erlang recommendation:
- "Learn You Some Erlang" (free online).
- "Programming Erlang" (Joe Armstrong).
- OTP fundamentals without worrying about outdated tutorials.
Akka recommendation:
- Natural choice on the JVM.
- But mind the licensing issue (consider Pekko).
- "Reactive Messaging Patterns with Akka" (rich material).
Orleans recommendation:
- .NET environments.
- Official Microsoft support.
- Azure integration.
Practical Project Ideas
- Chat application: WebSocket + actors. Phoenix LiveView works well.
- Game server: An actor per player.
- Stock ticker: An actor per symbol.
- IoT device manager: An actor per device.
- Task scheduler: Worker pools.
- Metrics collector: Collect from many sources, aggregate.
Start small and feel the power of actors. You'll see how much simpler and safer it is than the shared-memory approach.
Real-World Cases
- WhatsApp: 2M+ connections per server.
- Discord: 5M+ concurrent users.
- Riot Games: League of Legends chat.
- Heroku: Routing layer.
- Klarna: Financial trading.
- Pinterest: Notification system.
- Bleacher Report: Real-time feeds.
All use Erlang or Elixir. Not a "niche language."
Review Quiz
Q1. What are the three fundamental operations of the Actor Model, and why can these alone express "all computation"?
A. Carl Hewitt's original definition:
- Create: Create a new actor.
- Send: Send a message to another actor.
- Become: Determine behavior for the next message (state change).
Why is this enough:
"Become" is the key. An actor can change its own behavior before receiving the next message. This is equivalent to a state machine.
Example: Counter
behavior_0 = on increment, become behavior_1
behavior_1 = on increment, become behavior_2
...
behavior_N = on get, reply N to sender
"A counter with state N" equals "an actor with a specific behavior for the next message." Stateful computation is expressible with actors.
Turing-completeness:
- Create enables recursion (self-replicating actors).
- Send enables I/O.
- Become enables state.
- Their combination expresses all λ-calculus operations.
Hence the actor model is Turing-complete. Any computation can be expressed.
Practical meaning: The simplicity of the actor model makes concurrency manageable. Traditional locks, semaphores, and monitors exist to solve shared-memory problems. Actors forbid shared memory itself, eliminating the problem.
Instead of "how do we solve a complex problem," it's "let's remove the problem entirely." That's Hewitt's philosophical contribution.
Q2. Why is Erlang's "Let it crash" philosophy better than traditional error handling?
A. Traditional error handling tries to "predict and handle errors":
try:
result = db.query(...)
except ConnectionError:
retry(3)
except Timeout:
return cached_value()
except DataCorruption:
log_and_skip()
except UnknownError: # what if unexpected?
raise # process dies
Problems:
- Can't predict all errors: 30% of real errors are "unexpected."
- Code complexity explosion: 20 lines of happy path, 200 lines of error handling.
- Loss of consistency: After errors, state is half-done. Partially completed transactions.
- Cascading failure: One error propagates elsewhere.
- Hard to reproduce: Specific error sequences can't be tested.
Let it crash approach:
handle_request(Req) ->
Result = do_work(Req), % no error checks
reply(Result).
If an error occurs, the process dies. The supervisor restarts it. The new process has a clean state.
Benefits of this approach:
- Code is simple: Focus on the happy path.
- Consistency: A dead process leaves no state behind (isolated memory).
- Natural recovery: Most errors are "transient." Restarts resolve them.
- Visibility: Restart logs reveal problem patterns.
- Fault tolerance by default: Every actor is automatically "resilient."
"Why is restart the solution":
Real data:
- 95% of network failures are transient (recover in seconds).
- 90% of disk write failures resolve on retry.
- 80% of external service failures recover after a few seconds.
"Just restart" works surprisingly well in practice. Google's SRE book's famous advice: "Have you tried turning it off and on again?"
Preconditions:
For let it crash to work:
- Isolated state: One actor's state must not affect another.
- Supervisor hierarchy: Automatic restart mechanism.
- Restart limits: Prevent infinite restarts.
- Idempotent design: Same result after restart.
- Event sourcing (optional): State recovery after restart.
Erlang supports all of these at the language level. That's why let it crash works as an engineering principle, not just philosophy.
Attempts in other languages:
- Kubernetes: Container-level let it crash. Pod fails → restart.
- Erlang's ideal evolved into cloud-native. Kubernetes' "self-healing" concept is exactly this.
Lesson: Giving up perfection yields robustness. Trying to prevent every error breeds complexity, and complexity breeds new bugs. Accepting failure and designing recovery mechanisms is more practical.
Q3. Why are Erlang processes 1000x lighter than OS threads?
A. Several factors combine to achieve extreme lightness:
1. Stack size:
- OS thread: Default stack 2-8 MB (configurable but rarely changed).
- Erlang process: ~233 words at start (~1.8 KB). Grows dynamically if needed.
1000 OS threads = 8 GB memory. 1000 Erlang processes = 2 MB. 4000x difference.
2. Context switch:
- OS thread: Kernel call required. TLB flush, register save/restore. A few μs.
- Erlang process: Inside the BEAM VM. No kernel call. Tens of ns.
3. Scheduling unit:
- OS: Kernel schedules. Overhead surges with thousands of threads.
- BEAM: User-space scheduler. Supports millions of processes.
4. Independent Heap: Each Erlang process has its own heap:
- Per-process GC: GCing one process doesn't affect others.
- Small heap: Most processes carry little data. Fast GC.
- No stop-the-world: None of the JVM's nightmare.
5. Share-nothing:
- No locks (no shared state to access).
- No cache contention.
6. Reduction-based scheduling:
- BEAM counts function calls as "reductions."
- Preempt every 2000 reductions.
- Fair and very fast scheduling.
7. Pure Erlang except NIFs:
- No FFI means no external complexity.
- Everything is managed inside the VM.
Real measurements:
- WhatsApp: 2,000,000 connections on a single server.
- Phoenix: 2 million WebSocket benchmark.
- 1 million actor spawn: A few seconds.
Compared to the JVM:
- JVM thread: OS thread based. Similarly heavy.
- JVM + Akka actor: Multiple actors share a single JVM thread. Mid-tier lightness.
- Goroutine (Go): 2 KB start. Similar lightness.
When this lightness matters:
- Many connections: WebSocket, chat, game servers.
- Many independent tasks: IoT device management.
- Actor-based modeling: Each entity as an actor.
- Parallel work: Tens of thousands of independent computations.
Why other VMs can't do this:
JVM, V8, etc. chose OS-thread-based models. They had to be compatible with:
- Existing libraries (OS API calls).
- FFI with other languages.
- Complex memory models.
Erlang could start from a clean slate and build lightweight processes into the language level. That's Erlang's unique historical legacy.
Lesson: Constraints are sometimes liberation. Erlang's "inability to do many things" (shared memory, direct system calls, etc.) made "other things possible." Lightweight processes aren't the price of freedom but the result of constraints.
Q4. How does the Supervision Tree provide fault tolerance for distributed systems?
A. The supervision tree is a mechanism that hierarchically isolates and recovers from failure.
Basic structure:
Application
│
Main Supervisor
/ | \
Child Child Sub-Supervisor
/ | \
Child Child Child
Each supervisor:
- Maintains a list of child actors.
- Detects child failures.
- Restarts according to the configured strategy.
Restart Strategies:
one_for_one: Restart only the failed child.
[A, B, C] and B fails
→ restart only B
→ [A, B', C]
Use: When children are independent.
one_for_all: On one failure, restart all.
[A, B, C] and B fails
→ restart all
→ [A', B', C']
Use: When children depend on shared state.
rest_for_one: Restart the failed one and all after it.
[A, B, C, D] and B fails
→ restart B, C, D
→ [A, B', C', D']
Use: When there's sequential dependency.
simple_one_for_one: Dynamically spawned children.
- Children created from a predefined template.
- Each child independent.
- Use: Thousands of similar actors (e.g., one per connection).
Restart limits:
{intensity, 10}, % max 10 restarts per period
{period, 60} % within 60 seconds
If exceeded:
- The supervisor itself dies.
- Failure propagates to the parent supervisor.
- Cascading upward into larger restarts.
The power of hierarchical recovery:
Small failure → small restart (child actor). Medium failure → medium restart (entire sub-supervisor). Large failure → higher restart (whole subsystem).
Catastrophic failure → full application restart (worst case).
Because isolation happens at each level, small errors are handled in small scopes.
Example: Chat application
ChatApp Supervisor
├── AuthService Supervisor
│ ├── User1 Session
│ ├── User2 Session
│ └── UserN Session ← one session failure doesn't affect others
├── MessageRouter Supervisor
│ ├── RoomA Router ← RoomA failure = affects RoomA only
│ ├── RoomB Router
│ └── RoomC Router
└── Storage Supervisor
├── Primary DB
├── Replica DB
└── Cache
- User1's session crashes due to a bug → User1 reconnects. Others unaffected.
- RoomA crashes → RoomA messages temporarily fail. Other rooms normal.
- Storage-wide outage → may need application restart.
Lesson: Let it crash + Supervisor = Fault Tolerance:
Individual actor: unafraid to die. Supervisor: always-ready recovery mechanism. Hierarchy: limits blast radius of failure. Restart limits: prevent infinite loops.
This combination creates actually working fault-tolerant systems. Ericsson's claim that Erlang systems achieved 9 nines (99.9999999%) availability (AXD301 switch) is proof of this philosophy.
Modern applications:
- Kubernetes Pod restart: Simple supervisor concept.
- Docker restart policy: Container level.
- Systemd: Linux service supervisor.
- PM2 (Node.js): Process supervisor.
The cloud-native world is rediscovering what Erlang solved in the 1980s. The idea hasn't changed.
Practical design principles:
- Shallow hierarchy: 3-4 levels is sweet. Too deep gets complex.
- Distinguish transient vs permanent failures: Permanent failures aren't solved by restart.
- Simple initialization: Supervisor restarts should be fast.
- State recovery: Important state via event sourcing or DB.
- Metric collection: Monitor restart frequency → anomaly detection.
Supervision tree isn't a way to build perfect systems but a way for imperfect systems to behave robustly. This is one of Erlang's greatest gifts to us.
Q5. What's the fundamental difference between the Actor Model and Go's goroutine+channel?
A. Both are "lightweight concurrency + message-based communication," but their philosophies differ.
Actor Model:
- Entity-centric: An actor represents "something" (user, order, device).
- Actor owns messages: Messages go to a specific actor's mailbox.
- Addressing:
actorRef.send(msg)— explicit about who receives. - No sharing: No external access to an actor's internal state.
CSP (Go channels):
- Process-centric: A goroutine represents "work."
- Channel owns messages: Messages go to a channel (who receives is unclear).
- Indirect communication:
ch <- msg— send to a channel, whoever reads it. - Sharing allowed: Go permits shared memory (with select).
Philosophy comparison:
Erlang: "Do not share."
Actor ! Message % to a specific actor
Go: "Do not communicate by sharing memory; share memory by communicating."
ch <- message // through a channel
Concrete differences:
1. Addressing model
- Actor: Each actor has a PID/ref. "Message to User #42."
- Go: Channel is the "pipe." Who writes and who reads is decided at runtime.
2. Lifecycle
- Actor: Explicit creation/termination. Managed by supervisor.
- Goroutine: Start with
go f(), auto-destroyed when function ends. No supervisor.
3. Error handling
- Actor: Let it crash + supervision.
- Go:
recover()+ defer. Or the whole process dies when a goroutine panics.
4. Distribution
- Actor: Location transparent (Erlang, Akka).
- Go: Local only. Distribution is separate (gRPC, etc.).
5. State management
- Actor: Encapsulated internal state.
- Go: Can use shared state (sync.Mutex, etc.). Mixed model.
6. Type safety
- Erlang: Dynamic typing.
- Akka: Typed actors (Akka 2.6+).
- Go: Static typing. Channels are typed too (
chan int).
7. Communication patterns
Actor - Request-Reply:
response = GenServer.call(user_server, :get_profile)
- Call to a specific actor.
- Response comes back to the caller.
Go - Fan-out:
results := make(chan Result, 10)
for _, job := range jobs {
go func(j Job) {
results <- process(j)
}(job)
}
- Collect results from multiple goroutines through channels.
- "Who processes" is automatic.
Which is better:
Prefer actors:
- Entity-based domains: Each user, order, session as an actor.
- Fault tolerance matters: Thanks to supervision.
- Distribution: Location transparency.
- Long-lived state: State of long-lived actors.
Prefer Go channels:
- Pipeline processing: Data flow.
- Simple concurrency: Many short tasks.
- Fan-in/Fan-out: Multiple producers/consumers.
- Quick implementation: Simpler cases have shorter code.
In practice, a mix:
Many systems combine both patterns:
- Phoenix (Elixir): Actors by default, but pipelines inside.
- Go systems: Channels by default, with actor-style structs when needed.
Academic relationship:
- Actor model: Hewitt, 1973.
- CSP: Hoare, 1978.
- Same era, independent development.
- Both emphasize messages over shared memory.
- Difference: Actors are identity-centric, CSP is process-centric.
One-line summary:
- Actor: "Send this request (message) to this entity (actor)."
- CSP: "Put this data (msg) into this pipe (channel). Whoever receives it."
Both perspectives are valid with their own strengths. If the problem is entity management, actors are natural; if data flow, channels are. Ideally, modern systems understand both concepts and use them appropriately.
Closing: Messages Are Everything
Core Summary
- Actor model: Create, Send, Become — three operations.
- No sharing: Communicate only through messages. No locks.
- Erlang/OTP: The original. Millions of lightweight processes.
- Let it crash: Accept failure and restart.
- Supervision tree: Hierarchical fault tolerance.
- Akka: Actors on the JVM. Led by Lightbend.
- Elixir: The modern face of Erlang. Phoenix + LiveView.
- Orleans: .NET's virtual actors.
- Location transparency: Same code for local and remote.
When to Choose the Actor Model
Choose when:
- Thousands to millions of concurrent connections.
- Distributed systems.
- Fault tolerance matters.
- Entity-based domains (user, game player, IoT device).
- Stateful services.
Avoid when:
- Simple CRUD APIs.
- Compute-intensive tasks.
- Small systems.
- No team experience.
Final Lesson
Erlang started in 1986 and has evolved for nearly 40 years. Countless concurrency models have come and gone in that time. Erlang is still in active use.
Why? Joe Armstrong (Erlang co-creator, passed in 2019) said:
"If your system must handle 100,000 connections, you may be using the wrong language."
WhatsApp was acquired by Facebook in 2014 for $19 billion. Just 55 engineers served 450 million users. Erlang's power made that ratio possible.
The actor model is a paradigm shift. From "how do we synchronize shared memory" to "how do we collaborate through messages." It's a hard transition, but once accepted, many problems become simple.
Next time questions like "how do we handle many concurrent users?" or "how do we build reliable distributed systems?" come up, remember the actor model. This 1973 idea is still among the best answers.
References
- Erlang/OTP Official Documentation
- Learn You Some Erlang for Great Good! - Free online
- Programming Erlang (Joe Armstrong)
- Programming Elixir (Dave Thomas)
- Designing for Scalability with Erlang/OTP - Francesco Cesarini
- Akka Documentation
- Reactive Messaging Patterns with Akka
- Orleans Documentation
- Phoenix Framework
- Phoenix LiveView
- The Road to 2 Million Websocket Connections in Phoenix
- How Discord Scaled Elixir to 5M Concurrent Users
- Hewitt: Actor Model of Computation (1973 original)
- Joe Armstrong: Why Erlang Is Safe