Three Durability Modes, One WAL: Configurable Guarantees for Different Workloads

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

  1. Why I’m Building a Database Engine in C#
  2. What Game Engines Know About Data That Databases Forgot
  3. Microsecond Latency in a Managed Language
  4. Deadlock-Free by Construction
  5. Three Durability Modes, One WAL (this post)
  6. MVCC at Microsecond Scale (coming soon)

Octocat GitHub repo  •  :mailbox_with_mail: Subscribe via RSS

Most databases pick one durability strategy at boot time. Typhon picks one per commit — and the surprising part isn’t the user-facing API, it’s that all three modes share the same WAL (Write-Ahead Log — every commit appends a record here before being durable) writer thread, the same ring buffer, the same I/O path. The only thing that differs is whether the producer waits.

Ten thousand NPC position updates per simulation tick at ~1-2µs each, all Deferred. One legendary item drop on the same transaction code path, escalated to Immediate, paying ~15-85µs for a guaranteed FUA (Force Unit Access — don’t ack the write until it’s on stable media) write to disk. Same engine, same WAL, one extra argument at Commit(). This post is about why that’s possible and what it cost to keep it that way.

The other classical knob

Post #4 covered Typhon’s first big architectural bet — eliminating deadlocks at the design level instead of detecting them at runtime. This one is about the other classical database knob: durability. And like deadlocks, the interesting decision is upstream of the implementation.

Three workloads sit on the same engine in a real game server:

A single global durability setting forces all three to the most conservative option — and ~15-85µs per FUA write turns a 60Hz tick budget (16ms) into a fight you’ve already lost.

The three modes

The decision lives on the Unit of Work (UoW) at creation time, with a per-transaction override for escalation only. A UoW sits one level above a transaction: it groups one or more transactions and owns the durability contract they share. Transactions still commit atomic state changes; the UoW decides when — and whether — those commits reach disk. The user-facing enum is exactly what it looks like:

public enum DurabilityMode : byte
{
    /// <summary>WAL records buffered. Durable only after explicit Flush()/FlushAsync().
    /// Commit latency: ~1-2µs. Data-at-risk: until Flush().</summary>
    Deferred = 0,

    /// <summary>WAL writer auto-flushes every N ms (default 5ms).
    /// Commit latency: ~1-2µs. Data-at-risk: ≤ GroupCommitInterval.</summary>
    GroupCommit = 1,

    /// <summary>FUA on every tx.Commit(). Blocks until WAL record is on stable media.
    /// Commit latency: ~15-85µs. Data-at-risk: zero.</summary>
    Immediate = 2,
}

public enum DurabilityOverride : byte
{
    Default   = 0,  // Use the UoW's DurabilityMode
    Immediate = 1,  // Force FUA for this specific commit (escalation only)
}

The override can only escalate. A Deferred UoW can promote one transaction to Immediate; an Immediate UoW cannot weaken anything. This is a deliberate constraint, not an oversight — it makes data-loss bugs impossible by API shape. You can never accidentally make a transaction less durable than the UoW’s contract.

Mode Commit latency Data-at-risk window Use case
Deferred ~1-2µs Until explicit Flush() Game ticks, batch imports, simulation steps
GroupCommit ~1-2µs amortized ≤ 5ms (configurable) General server load, request handlers
Immediate ~15-85µs Zero Trades, account writes, legendary drops

One writer thread, three signaling patterns

Here’s the part that surprised me when I came back to the design six months in: I did not need three I/O paths. Or three threads. Or three buffers. The shared infrastructure looks like this:

Three durability modes converging on one WAL Writer — Deferred, GroupCommit, and Immediate producers all publish to the same MPSC ring buffer; only Immediate also signals the writer and waits for DurableLsn to advance past its LSN; the GroupCommit timer wakes the writer on a 5ms ceiling; one writer thread, one segment file, one FUA write per drained batch

The same picture broken into phases:

# Producer thread (tx.Commit()) What this means for you WAL Writer thread (single, dedicated) What this means for you
1 TryClaim() — CAS (Compare-And-Swap) slot allocation Your commit atomically reserves a slot in the WAL ring buffer. No lock contention with other transactions claiming slots in parallel. (idle, or finishing a previous drain) The writer thread runs independently. Your producer never waits on it to claim a slot.
2 Publish() — release-store the frame header The record is now visible to the writer. Your commit has an LSN (Log Sequence Number) — its position in the durability timeline. TryDrain() — contiguous batch of published frames The writer harvests every published slot in one pass. This is the structural reason GroupCommit is amortized: N producers, one drain, one FUA cost.
3 Mode-specific: return now, or WaitForDurable(lsn) Deferred / GroupCommit: Commit() returns in ~1-2µs and durability lands asynchronously. Immediate: Commit() returns only once your LSN is on disk (~15-85µs). WriteAligned() → FUA write → Interlocked.Exchange(DurableLsn)_durabilityEvent.Set() One physical FUA write per batch (~15-85µs, paid once). DurableLsn advances; any Immediate waiter whose LSN ≤ DurableLsn wakes and returns from tx.Commit().

What changes between the modes is only the producer side:

That last one is where the elegance lives. It’s not a separate code path — it’s the same path, with one extra fast-path check:

public void WaitForDurable(long lsn, ref WaitContext ctx)
{
    // Fast path: already durable, returns inline.
    if (Interlocked.Read(ref _durableLsn) >= lsn)
    {
        return;
    }
    WaitForDurableSlow(lsn, ref ctx);
}

If the WAL writer has already drained past your LSN by the time you call this — say, because someone else’s Immediate commit just batched yours along with it — you pay one atomic read and a return. No event wait, no syscall, no context switch. Immediate mode is GroupCommit’s batching benefit, available to the one transaction that needs it now.

That’s the teaching moment I want this post to leave you with: per-transaction durability is not three implementations. It’s one implementation with three producer-side policies, and the FUA cost is a property of the I/O path, not the API surface.

Per-UoW, not per-engine — why

The API-shape decision is recorded in its own ADR (Architecture Decision Record). I considered four alternatives before landing on per-UoW:

Alternative Why I rejected it
Per-database (boot-time) Too coarse. Game ticks and trades on the same DB need different modes within the same process.
Per-transaction Can’t batch — GroupCommit is inherently multi-transaction. The UoW is the natural batching boundary.
Two modes (Sync / Async) Misses GroupCommit’s sweet spot. The whole point is amortized FUA, which a binary doesn’t capture.
Caller-managed flush only Error-prone. Developers forget to flush. GroupCommit automates the common case correctly.

The honest version: I tried “per-database” first because it was easiest to wire up, and immediately hit the simulation-vs-trade problem on the first benchmark. Two production game-server workloads on the same engine, one wanting ~1µs commits and one wanting zero data loss. The mode has to follow the workload, not the storage.

Numbers that matter

The latency table above is the headline. The throughput table is where it gets interesting:

Mode Single-thread durable tx/s Multi-thread durable tx/s
Deferred N/A (batch-durable) N/A
GroupCommit (5ms interval) ~200K+ amortized Millions (shared flush)
Immediate ~12K-65K ~12K-65K (FUA-limited)

Single-thread Immediate is capped by NVMe FUA round-trip — there’s no software trick to escape that. But multi-thread Immediate does not scale linearly past one thread, because every commit is racing the same writer through the same I/O. GroupCommit, on the other hand, scales nearly with thread count because the FUA cost is paid once per drain cycle no matter how many producers contributed to the batch.

That’s not a flaw in Immediate. It’s the physics of the storage device. The point of having three modes is that you only pay that cost where you actually need it.

What I got wrong

The first group-commit timer was 1ms. Under low write load the WAL was doing constant small FUA writes — worst-case for SSD wear and tail latency. Tuning to 5ms with a “wake on N records OR T milliseconds” trigger fixed it: the writer sleeps on WaitForData(intervalMs) and gets pulled out early when a producer signals (Immediate commits, explicit Flush(), or back-pressure). Idle periods cost nothing; busy periods batch naturally.

The first override design allowed downgrade. tx.Commit(DurabilityOverride.Deferred) from an Immediate UoW. The use case was “this single read-mostly transaction doesn’t really need FUA.” The use case was wrong: the UoW’s contract is the durability floor, not the default. Downgrading a single commit means the application has accidentally created a hole in a contract it thinks it has. Now overrides can only escalate, and the type system enforces it.

Deferred mode is a contract, not a latency number. Early users assumed Commit() always meant “on disk.” It doesn’t. Deferred mode says: your data is not durable until you call Flush() or close cleanly. For game servers that’s fine; their tick loop already has clear boundaries. But the documentation now leads with the contract, not the µs number. The latency is a consequence; the contract is what you signed up for.

What’s next

Post #6 in the series: Building a Page Cache That Doesn’t Count: Epoch-Based Memory Management. The durability story above assumes the commit path is fast, but that’s only half the story — the read path has to be just as cheap, and the trick there is replacing per-page reference counting with epoch-based protection: two atomic operations per transaction instead of two per page. The mechanism is elegant enough that it deserves its own post.

Follow the GitHub repo for source and benchmarks, or subscribe via RSS.