- Published on
MariaDB Architecture — From Storage Engines to Galera
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Overall Server Architecture
- Storage Engines in Depth
- Replication
- Galera Cluster (Synchronous Multi-Master)
- Partitioning
- The Relationship Between MariaDB and MySQL, and Their Differences
- Operations and Tuning
- Pitfalls and Common Mistakes
- Closing
- References
Introduction
MariaDB is an open-source relational database that forked from MySQL. In the beginning it shared almost the same code as MySQL, but over time it has grown its own storage engines, replication methods, and clustering technology. So if you only think of it as "something compatible with MySQL," you will struggle to handle the many choices and behavioral differences you encounter in a real production environment.
In this article we follow the MariaDB architecture from top to bottom. First we sketch the full flow: a client request arrives, the SQL is parsed, the optimizer builds an execution plan, and we finally reach the storage engine that actually reads and writes data. Then we compare storage engines such as InnoDB, Aria, ColumnStore, and MyRocks, and look at how data is spread across multiple nodes — from asynchronous replication through semi-synchronous, GTID, and the Galera synchronous multi-master cluster. Finally we cover partitioning and the pitfalls you often meet in operations.
Because defaults and supported features can differ by version, for specific numbers or capabilities we recommend always checking the official documentation for the version you are running. This article focuses on the big picture and the concepts.
Overall Server Architecture
It is fastest to understand the MariaDB server as two layers. The upper part is the SQL layer (server layer) that interprets SQL and builds execution plans, and the lower part is the storage engine layer that actually stores data on disk and reads it back. Separating these two is the most important design characteristic of the MySQL family.
Client (app / mysql CLI / connector)
│
▼
┌───────────────────────────────────────────────────┐
│ Connection Layer │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Thread Pool │ │ Auth │ │ Session/ │ │
│ │ │ │ │ │ privileges│ │
│ └─────────────┘ └──────────────┘ └───────────┘ │
└───────────────────────────┬───────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ SQL Layer │
│ ┌────────┐ ┌────────┐ ┌──────────┐ ┌───────┐ │
│ │ Parser │──▶│Rewrite │──▶│Optimizer │─▶│ Plan │ │
│ └────────┘ └────────┘ └──────────┘ └───┬───┘ │
│ │ query cache (if any) / privilege chk │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Executor │ │
│ └───────────────────────┬──────────────────────┘ │
└──────────────────────────┼─────────────────────────┘
│ Storage Engine API (handler)
▼
┌───────────────────────────────────────────────────┐
│ Storage Engine Layer │
│ ┌────────┐ ┌──────┐ ┌────────────┐ ┌────────┐ │
│ │ InnoDB │ │ Aria │ │ ColumnStore│ │MyRocks │ │
│ └────────┘ └──────┘ └────────────┘ └────────┘ │
└───────────────────────────────────────────────────┘
│
▼
data files / logs / indexes
Let us walk through what each stage does.
Connection Layer
When a client connects, the server first accepts the connection and handles authentication. It checks the user name, password, source host, and privileges, and prepares the session-level variables and context.
MariaDB supports both the traditional model of allocating a thread per connection and the thread pool model. In environments with many connections and frequent short queries, a thread pool reduces context-switching cost and memory usage. If your workload must handle thousands of concurrent connections, the thread pool is worth considering.
SQL Layer (Parser · Optimizer · Executor)
Once the connection is established, the incoming SQL string goes through the following stages.
- Parser: splits the SQL string into tokens, checks the grammar, and produces an internal syntax tree. Syntax errors are caught here.
- Preprocess / Rewrite: expands views and turns some subqueries into joins, shaping the query into a form the optimizer can handle well.
- Optimizer: in a cost-based manner, picks the plan estimated to be the cheapest among several execution methods. It decides which index to use, the join order, and whether a temporary table or a sort is needed.
- Executor: follows the chosen plan, calling the storage engine API to actually read and write rows.
The optimizer relies on table statistics (row counts, index selectivity, and so on). If statistics are stale or inaccurate, it may pick a poor plan, so keeping statistics fresh matters in operations.
Storage Engine Layer
The executor does not touch data directly. Instead, through a consistent interface called the handler API, it sends requests such as "give me the next row matching this condition" or "write this row" to the storage engine. Thanks to this, the same SQL can have completely different transaction support, locking behavior, and on-disk layout depending on which engine the table uses.
This separation lets you mix different engines per table within one database. For example, you can keep a transaction-critical orders table on InnoDB and a bulk-aggregation analytics table on ColumnStore.
Storage Engines in Depth
Storage engines are the heart of the MariaDB architecture. Each engine is optimized for a different workload.
InnoDB
InnoDB is the default choice for most OLTP (online transaction processing) workloads. It has the following characteristics.
- ACID transactions: supports commit, rollback, and isolation levels.
- Row-level locking: in a highly concurrent environment, it locks only the necessary rows rather than the whole table.
- MVCC (multi-version concurrency control): so that reads and writes block each other less, reads see a consistent snapshot at a point in time.
- Clustered index: data is physically stored sorted by primary key order. As a result, primary key design has a large impact on performance.
- Crash recovery: through the redo log, it restores consistency even after an abnormal shutdown.
A simplified view of InnoDB's internals looks like this.
Write path (simplified)
COMMIT ──▶ write redo log (WAL) ──▶ modify page in buffer pool
│ │
│ (to disk first) │ (dirty page in memory)
▼ ▼
log file buffer pool (memory)
│
│ checkpoint / background flush
▼
data file (.ibd)
The key concept here is WAL (Write-Ahead Logging). Instead of writing data pages to disk immediately, the changes are first recorded sequentially in the redo log. Sequential writes are much faster than random writes, and if the server suddenly dies, recovery can replay the redo log.
The buffer pool is the memory area where InnoDB caches data and index pages. When the working set fits in the buffer pool, disk I/O drops dramatically, so buffer pool size is one of the most important items in InnoDB performance tuning.
Aria
Aria was developed as a successor to MyISAM. It is safer than MyISAM in that it supports crash recovery. Inside MariaDB it is sometimes used for system tables or temporary tables. However, it does not provide InnoDB-level full transactions and row locks, so for ordinary OLTP tables with many concurrent writes, InnoDB is a better fit.
ColumnStore
ColumnStore is a columnar engine for analytical (OLAP) workloads. Unlike the traditional row-oriented storage model, it stores values of the same column together.
Row-oriented (InnoDB) Columnar (ColumnStore)
[id|name|amount|date] id : 1,2,3,4,...
[ 1|kim | 100 |...] name : kim,lee,park,...
[ 2|lee | 250 |...] amount : 100,250,90,...
[ 3|park| 90 |...] date : ...
reads an entire row reads only chosen columns
Aggregation queries (for example, the sum or average of a particular column) only need to read the necessary columns, so I/O drops dramatically, and because values of the same column sit together, compression ratios are high. On the other hand, it is not suited to OLTP patterns that frequently update one record at a time. It is a good fit for analytics systems that scan and aggregate large volumes of data.
MyRocks
MyRocks is an LSM tree (Log-Structured Merge-tree) engine based on RocksDB. Its write handling is fundamentally different from the B-tree-based InnoDB.
- Low write amplification: the LSM structure batches writes sequentially, so for certain workloads the amount actually written to disk is reduced.
- High compression and space efficiency: it tends to store the same data in less space, which is advantageous for large-scale environments where storage cost matters.
- Read-side trade-off: by the nature of the LSM structure, the read path may traverse several levels, so read cost varies by workload.
It is considered as an alternative to InnoDB in environments with very heavy writes where storage space must be saved.
Storage Engine Selection Criteria (Comparison Table)
| Item | InnoDB | Aria | ColumnStore | MyRocks |
|---|---|---|---|---|
| Main use | General OLTP | System/temp tables | Analytical OLAP | Write-heavy large data |
| Transactions (ACID) | Full support | Limited | Analytics-focused | Supported |
| Lock granularity | Row | Mostly table | Analytics-focused | Row |
| Storage structure | B-tree | B-tree | Columnar | LSM tree |
| Compression | Optional | Limited | Strong | Strong |
| Strength | Balance, concurrency | Lightweight | Aggregation scans | Writes, space efficiency |
| Weakness | Space usage | Concurrent writes | Single-row updates | Read trade-off |
The starting point of the choice is almost always InnoDB, because it strikes a good balance of transactions, concurrency, and stability. It is safest to consider adopting ColumnStore and MyRocks only when you are confident that "this workload hits InnoDB's weak point precisely." The list of supported engines and whether they are enabled by default can differ by version and build, so please confirm with the official documentation.
You can check the installed engines as follows.
-- Check available storage engines and their support status
SHOW ENGINES;
-- Check which engine a particular table uses
SELECT table_name, engine
FROM information_schema.tables
WHERE table_schema = 'mydb';
-- Specify the engine when creating a table
CREATE TABLE orders (
id BIGINT NOT NULL AUTO_INCREMENT,
user_id BIGINT NOT NULL,
amount DECIMAL(12,2) NOT NULL,
created DATETIME NOT NULL,
PRIMARY KEY (id),
KEY idx_user (user_id)
) ENGINE=InnoDB;
Replication
A single server has limits in availability and read scaling. Replication is the technology that delivers changes from one server (primary/master) to other servers (replica/slave) to keep the same data.
At the center of MariaDB replication is the binary log (binlog). The primary records every operation that changes data into the binlog, and the replica receives this log and re-applies it to its own data.
┌────────────── Primary ──────────────┐
│ transaction ──▶ storage engine │
│ │ │
│ └──▶ write Binary Log (binlog) │
└──────────────────┬───────────────────┘
│ send binlog events
▼
┌────────────── Replica ───────────────┐
│ IO thread ──▶ store in Relay Log │
│ │ │
│ SQL/worker thread ─▶ apply to data │
└──────────────────────────────────────┘
Looking at the replica side in a little more detail, the IO thread receives binlog events from the primary and stores them in the local relay log, and the SQL thread (or parallel workers) reads the relay log and applies it to the actual data.
Asynchronous Replication
The default replication method is asynchronous. When the primary commits a transaction, it immediately responds success to the client and does not wait to see whether the replica has received and applied the change.
- Advantage: the primary's write latency is not affected by the replica, so it is fast.
- Disadvantage: if the primary dies right after committing, transactions that have not yet been transferred to the replica can be lost (possibility of data loss). Also, replication lag, where the replica falls behind the primary, can occur.
Semi-synchronous Replication
Semi-synchronous replication is a compromise between asynchronous and synchronous. Before the primary confirms the commit to the client, it waits for an acknowledgment (ack) that at least one replica has received the event.
The important point here is that this is not a guarantee that the replica has applied the event, but a guarantee that it has received it and safely written it to the relay log. Even so, it reduces the situation where only the primary holds the data, so the risk of data loss is lower than asynchronous. In return, commit latency increases by the amount of time waiting for the replica's ack.
GTID (Global Transaction ID)
Traditional replication tracked progress by "the binlog file name and the position (offset) within it." This made it tricky to align positions precisely during failover (promoting another server when a failure occurs).
GTID attaches a cluster-wide unique identifier to every transaction. Thanks to this, the replica can know "up to which transaction it has applied" by ID rather than by position, and even if the primary changes, it can clearly pick up the missing transactions. Failover and topology changes become much easier.
An example of checking replication status and starting replication based on GTID looks like this (syntax may differ by version).
-- Check replication status on the replica
SHOW REPLICA STATUS\G
-- Example of GTID-based replication setup
CHANGE MASTER TO
MASTER_HOST='primary.example.internal',
MASTER_USER='repl',
MASTER_PASSWORD='secret',
MASTER_USE_GTID=slave_pos;
START REPLICA;
Galera Cluster (Synchronous Multi-Master)
The replication so far has been a structure where a single primary takes writes and the rest follow. The Galera cluster is different. It is a synchronous multi-master approach in which every node can take both reads and writes.
application / load balancer
/ | \
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Node A │◀─▶│ Node B │◀─▶│ Node C │
│ (R/W) │ │ (R/W) │ │ (R/W) │
└──────────┘ └──────────┘ └──────────┘
▲ ▲ ▲
└───────── wsrep group comm ───┘
(on commit, propagate/certify the write-set on all nodes)
The heart of Galera is the certification-based replication that happens at commit time. Simplified, the flow is as follows.
Galera commit flow (simplified)
1) Client runs a transaction on Node A
│
▼
2) Just before commit, bundle the changes into a write-set
(changed rows and their key info)
│
▼
3) Broadcast the write-set to all nodes
│
▼
4) Each node performs certification in the same order
- Does it conflict with a concurrent transaction on another node?
│
├── no conflict ──▶ all nodes commit (consistent state)
│
└── conflict ─────▶ reject the later transaction (rollback)
→ deadlock-like error to the application
It is important to understand precisely what "synchronous" means here. Galera propagates the write-set to all nodes at commit time and passes certification, so if the commit succeeds, the transaction is guaranteed to be reflected across the whole cluster. However, actually applying it to each node's disk can lag slightly, which is why it is commonly called virtually synchronous.
Galera's Advantages and Caveats
- Data consistency: a transaction that passes certification is reflected on all nodes, so there is almost no risk of the data loss seen in asynchronous replication.
- High availability: even if one node dies, the remaining nodes keep taking reads and writes.
- Read scaling: you can read from any node.
However, watch out for the following.
- Write conflicts and certification failures: if the same row is modified concurrently on multiple nodes, a conflict occurs at the certification stage, and the later transaction is rolled back. For this reason, rather than freely distributing writes across multiple masters, a common pattern is to funnel writes to a single node.
- Commit latency: because it must communicate with all nodes, if the network is slow or nodes are far apart, commits become slow. You must be careful in geographically distributed environments.
- Quorum and split-brain: to prevent data from diverging when the network is partitioned, only the side that forms a majority (quorum) operates. For this reason, it is common to keep the node count odd (for example, 3).
- DDL and large transactions: schema changes and very large transactions can affect the entire cluster and are tricky to handle.
Galera Configuration Example
Galera operates with wsrep-related settings (parameter names and recommended values may differ by version).
[galera]
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name=my_cluster
wsrep_cluster_address=gcomm://10.0.0.1,10.0.0.2,10.0.0.3
wsrep_node_address=10.0.0.1
wsrep_sst_method=mariabackup
# Galera assumes InnoDB and a row-based replication format
default_storage_engine=InnoDB
binlog_format=ROW
innodb_autoinc_lock_mode=2
When a new node joins the cluster, it must receive data from an existing node and synchronize. This is called an SST (State Snapshot Transfer), which copies the entire dataset wholesale. There is also an IST (Incremental State Transfer) that quickly catches up only the changes. In large clusters, SST imposes significant time and load, so it must be considered in operations.
Replication Methods Comparison Table
| Item | Asynchronous | Semi-synchronous | Galera |
|---|---|---|---|
| Write node | Single primary | Single primary | Any node possible |
| Commit response timing | Immediate | After replica receive ack | After cluster certification |
| Data loss risk | Yes | Low | Very low |
| Commit latency | Low | Medium | Depends on network |
| Replication lag possibility | Yes | Yes | Very small |
| Typical use | Read scaling, backup | Loss-sensitive OLTP | HA, consistency-focused |
Partitioning
When a table becomes very large, partitioning — logically dividing one huge table into several pieces — is useful. To the application it still looks like a single table, but internally the data is split across several partitions according to the rule you define.
One logical table (sales)
┌──────────────────────────────────────────────┐
│ PARTITION BY RANGE (YEAR(sale_date)) │
└──────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ p2023 │ │ p2024 │ │ p2025 │ │ pmax │
│ < 2024 │ │ < 2025 │ │ < 2026 │ │ rest │
└────────┘ └────────┘ └────────┘ └────────┘
WHERE sale_date >= '2025-01-01'
──▶ p2023, p2024 excluded from scan (partition pruning)
The biggest benefit of partitioning is partition pruning. When the partition key appears in the query condition, the optimizer skips irrelevant partitions entirely, reducing the scan range. Also, when wiping out old data wholesale, you can remove it quickly per partition, which is useful for managing large time-series data.
An example of range partitioning looks like this.
CREATE TABLE sales (
id BIGINT NOT NULL AUTO_INCREMENT,
sale_date DATE NOT NULL,
amount DECIMAL(12,2) NOT NULL,
PRIMARY KEY (id, sale_date)
)
ENGINE=InnoDB
PARTITION BY RANGE (YEAR(sale_date)) (
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026),
PARTITION pmax VALUES LESS THAN MAXVALUE
);
A point to be careful about is that partitioning is not a silver bullet. If the partition key is not in the query condition, all partitions must be scanned, which can actually be slower. Also, there is a constraint that the partition key must be included in every unique key, including the primary key, which affects key design.
The Relationship Between MariaDB and MySQL, and Their Differences
Because MariaDB started as a fork of MySQL, much of the SQL syntax and the client protocol are compatible. However, as the two projects have evolved independently, the differences have grown.
| Item | MariaDB | MySQL |
|---|---|---|
| Origin | Forked from MySQL | The original |
| License/governance | Community-centered foundation | Commercial vendor-led |
| Storage engine diversity | Diverse: Aria, ColumnStore, etc. | InnoDB-centered |
| Synchronous cluster | Galera integrated by default | Separate group replication solution |
| JSON handling | Function-oriented approach | Emphasizes a native type |
| Some new features | Adds its own features | Adds its own features |
The point to remember in practice is that as versions advance, compatibility between the two products is not 100%. Basic CRUD and standard SQL are largely compatible, but there can be differences in the replication protocol, GTID format, JSON handling, some system variables, and new functions. Therefore, if you plan a migration from one to the other, do not assume a simple dump/restore is the end of it; you must go through sufficient compatibility verification. It is safest to compare the specific differences between the versions you are using through the official documentation.
Operations and Tuning
Once you understand the architecture, it is good to keep a list of the items you frequently touch in operations.
Commonly Seen Configuration Items
[mysqld]
# InnoDB buffer pool: the more of the working set in memory, the less disk I/O
# (on a dedicated DB server, allocating a large portion of physical memory is common)
innodb_buffer_pool_size = 8G
# Redo log related: too small and checkpoints become frequent
# (recommended values and parameter names may differ by version)
innodb_log_file_size = 1G
# Log flush policy on commit (a durability vs. performance trade-off)
# 1: safest (flush every commit), 0/2: faster but higher loss risk
innodb_flush_log_at_trx_commit = 1
# Replication format: ROW is recommended for Galera and stable replication
binlog_format = ROW
# In high-connection environments, consider the thread pool
thread_handling = pool-of-threads
The right level for each value depends on the workload, hardware, and version. The values above are only a starting point; in real operations you must adjust them while watching monitoring metrics.
Monitoring Points
- Buffer pool hit ratio: a high rate of falling through to disk is a sign that the buffer pool is small or the working set is too large.
- Replication lag: periodically check how far the replica has fallen behind the primary.
- Slow query log: the starting point for finding queries with poor execution plans.
- Lock waits and deadlocks: catch concurrency problems early.
- Galera status variables: watch cluster size, flow control frequency, certification failure rate, and so on.
Check execution plans with EXPLAIN.
-- Check the execution plan chosen by the optimizer
EXPLAIN
SELECT u.id, u.name, COUNT(o.id) AS cnt
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created >= '2026-01-01'
GROUP BY u.id, u.name;
-- When you also want the actual execution statistics (where supported)
ANALYZE
SELECT * FROM orders WHERE user_id = 42;
Pitfalls and Common Mistakes
Here are the pitfalls repeatedly encountered over long operations.
-
Setting the buffer pool too small: it is common to blame queries when disk I/O is the bottleneck. Please first check the relationship between the working set and the buffer pool size.
-
Mistaking asynchronous replication for synchronous: with asynchronous replication, the last few transactions can be lost when the primary fails. The assumption "there is replication, so there is no loss" is dangerous. If you are sensitive to loss, consider semi-synchronous or Galera.
-
Recklessly distributing writes across multiple masters in Galera: modifying the same row concurrently on several nodes causes frequent certification conflicts. Usually you funnel writes to one node and use the rest for reads.
-
Queries where partition pruning does not work: if the partition key is not in the WHERE condition, all partitions are scanned. If you adopted partitioning but things became slower, check with EXPLAIN whether pruning is working.
-
Bad plans from stale statistics: if the data distribution has changed greatly but the statistics are old, the optimizer may pick the wrong index.
-
Choosing an engine that does not match the workload: beware of mismatches like using only InnoDB for analytical aggregation and complaining it is slow, or using ColumnStore where single-row updates are frequent.
-
Running heavy DDL: schema changes on a large table can cause locks and load. Check the online DDL options and their constraints in the documentation for your version, and handle them even more carefully in Galera.
-
Overconfidence in MariaDB/MySQL compatibility: do not assume a migration between two products of different versions can end with a simple dump/restore. There can be differences in replication, GTID, and some functions.
Closing
The MariaDB architecture starts from the separation of "the layer that interprets SQL" and "the storage engine layer that stores data." Thanks to this separation, with the same SQL you can pick engines with completely different characters — InnoDB, Aria, ColumnStore, MyRocks — to match the workload.
The methods of spreading data across multiple nodes — from asynchronous replication through semi-synchronous, GTID, and Galera synchronous multi-master — offer different balance points between consistency and performance. There is no single "correct answer"; the choice changes according to your data-loss tolerance, latency requirements, and operational complexity.
Partitioning is a powerful tool for handling large tables, but remember that partition key design and pruning are the key. And while MariaDB and MySQL share the same roots, their differences keep growing, so it is safest to make a habit of always confirming version-specific behavior with the official documentation.
Finally, to emphasize: the specific parameters, syntax, and defaults in this article may change by version. Use the big picture to set your direction, but before applying anything in practice, be sure to check the official documentation for the version you are running.
References
- MariaDB Knowledge Base: https://mariadb.com/kb/en/
- Storage Engines: https://mariadb.com/kb/en/storage-engines/
- InnoDB: https://mariadb.com/kb/en/innodb/
- Galera Cluster: https://mariadb.com/kb/en/galera-cluster/
- Replication: https://mariadb.com/kb/en/replication/
- Partitioning Tables: https://mariadb.com/kb/en/partitioning-tables/
- Global Transaction ID: https://mariadb.com/kb/en/gtid/
- Galera Cluster Documentation: https://galeracluster.com/library/documentation/
- MySQL Reference Manual: https://dev.mysql.com/doc/