Deadlock-Free by Construction: How Typhon Eliminates Deadlocks Instead of Detecting Them

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

  1. Why I’m Building a Database Engine in C#
  2. What Game Engines Know About Data That Databases Forgot
  3. Microsecond Latency in a Managed Language
  4. Deadlock-Free by Construction (this post)
  5. MVCC at Microsecond Scale (coming soon)

Octocat GitHub repo  •  :mailbox_with_mail: Subscribe via RSS

Deadlocks are usually treated as a runtime problem. We treat them as a design bug.

That sounds like a slogan. It isn’t. It’s the actual reasoning behind three architectural decisions that, taken together, make a lock-dependency cycle impossible in Typhon — not unlikely, not rare, impossible. The engine ships without a deadlock detector. From the project’s own concurrency overview:

Deadlock detection is explicitly not implemented — it would add overhead for a scenario that cannot occur in the current architecture.

This post is the how: how three structural decisions remove three classes of edges from the lock-dependency graph, and how that elimination cascades into “no cycle is possible.” But it’s also the why: why the constraint was set at project inception, before any code existed to deadlock, and what it cost.

The upfront bet

I didn’t compare deadlock detection schemes before starting Typhon. I’d seen them in production at previous engines, and the pattern was always the same: a separate background scanner, a wait-for graph, victim selection heuristics, transaction abort and retry. A lot of code. A lot of edge cases. None of it bulletproof for the user, who still sees occasional one-second pauses or unexplained transaction failures under load.

So I made an upfront call, recorded as ADR-003, the project’s first concurrency decision, dated 2024-01 (project inception):

Optimistic locking: No locks during execution; conflict detection only at commit.

(ADRs — Architecture Decision Records — are short documents capturing one design choice with its context and rationale. They’re a paper trail for why a thing was built a particular way, not just what it does. Typhon has 49 of them so far. They live in the project’s internal documentation, not in the public repo.)

That’s the bet. No locks across data, whatever the architectural cost. Not because I had proof prevention would be faster — I didn’t run those benchmarks — but because the implementation cost of detection is real, the result is never bulletproof, and trading an architectural cost up front for never paying a runtime cost later is the trade I wanted.

The three “pillars” that follow aren’t a survey of alternatives I considered. They’re what the architecture had to become once the constraint was set. MVCC was the obvious starting point. Optimistic Lock Coupling for indexes followed because traditional B+Tree latch coupling violated the constraint at the index level. The “no cross-table latching” rule emerged because anything else reintroduced the cycles I was trying to eliminate.

It’s constraint-driven design, not survey-driven. And it’s why this post claims a property — deadlock-free by construction — instead of a benchmark.

What a deadlock actually is

Briefly, because the rest of the post needs the picture.

Two transactions, T1 and T2. T1 holds a lock on row A and asks for a lock on row B. T2 holds B and asks for A. Neither can proceed. Each is waiting for the other; the wait will never end. That’s a cycle in the lock-dependency graph — the directed graph whose nodes are transactions and whose edges are “is waiting for.” A deadlock is a cycle in that graph. Detection-based databases scan for cycles and break them by aborting one transaction. Prevention-based databases make cycles impossible to form.

The three sections that follow each remove one class of edges from that graph. With every class removed, no cycle is possible.

Pillar 1: MVCC eliminates inter-transaction data locks

The textbook deadlock — T1 locks row A, T2 locks row B, both want the other — requires row-level locking between transactions. Typhon doesn’t do that.

Reads are snapshot-consistent: every transaction is frozen at the global tick value when it began. A reader sees a stable view of the database for its entire lifetime. It never asks for a lock, because there’s nothing to lock against — the snapshot is already immutable.

Writes don’t lock existing rows either. They create new revisions, with the previous revision left intact for any transaction whose snapshot still references it. Two writers updating the same component don’t fight over a lock; they each append a new revision to the chain. Conflict detection happens at commit time, as a single CAS operation: when the writer tries to install its new revision as the current one, the engine checks that the version it built on is still current. If not, the writer aborts and retries.

This removes the entire edge class of “data locks held across transactions.” There are no row locks, no read locks, no write locks on data. The wait-for graph at the transaction level has no edges to form a cycle from.

The cost isn’t free. Two writers updating the same component will conflict at commit, and one of them will retry. For game-server workloads where most components are written by exactly one system, conflicts are rare. For general OLTP workloads with high write contention, the cost would shift the trade — fewer deadlocks, more aborts. Different curve.

Pillar 2: Optimistic Lock Coupling for index structures

Even without row locks, an index structure (B+Tree, R-Tree) is shared mutable state. Traditional databases serialize access through latch coupling: a reader holds a latch on the parent node while acquiring one on the child, releases the parent, advances. It’s a chain of overlapping latches walking down the tree.

That pattern can deadlock. Reader R has the parent latched and wants the child; concurrent writer W has the child latched and walks back up to fix the parent. Two threads, two index latches, mutual wait.

Typhon uses Optimistic Lock Coupling (Leis et al., 2019) instead. Readers don’t latch at all. Each B+Tree node carries a 32-bit version counter. The reader reads the version, traverses, then re-reads the version at the end — if it changed, the traversal data may have been mutated mid-flight, so the reader restarts.

// From Typhon.Engine/Data/Index/OlcLatch.cs
public int ReadVersion()
{
    int v = _version;
    return (v & 0b11) == 0 ? v : 0;  // locked (bit 0) or obsolete (bit 1) -> restart
}

public bool TryWriteLock()
{
    int v = _version;
    if ((v & 0b1) != 0) return false;
    return Interlocked.CompareExchange(ref _version, v | 0b1, v) == v;
}

public void WriteUnlock()
{
    int v = _version;
    _version = ((v >> 2) + 1) << 2 | (v & 0b10);  // version++, keep obsolete, clear lock
}

Bit 0 is the write-lock flag; bits 2–31 are a monotonic version counter. ReadVersion returns 0 if the node is locked or obsolete — the caller treats that as “restart.” TryWriteLock is a single CAS. WriteUnlock increments the version atomically with releasing the lock.

Writers latch only the modified nodes, and they acquire from root to leaf, in strict order. No reader ever blocks a writer. No writer ever holds a parent latch while waiting on a child. The same pattern is reused by the spatial R-Tree — same OlcLatch, same protocol — so this single mechanism covers both index families.

This removes the edge class of “index-level latch cycles.”

Pillar 3: No cross-table latch holding

Two edge classes are gone. The third is the most boring and the most important: at any given moment, a thread never holds a latch in more than one table.

Each ComponentTable in Typhon has independent indexes, independent revision chains, independent page allocations. A transaction’s commit path processes one table at a time. When the commit moves from table A to table B, all of A’s latches are released first.

The only resource that would be shared across tables is the page cache. Latches there could form cycles across the entire engine. So the page cache doesn’t use latches. That refactor is recorded as ADR-033, dated 2026-02-12:

Replace per-page reference counting with epoch-based protection. Each transaction enters an epoch scope that pins the current global epoch; pages accessed within the scope are stamped with that epoch; eviction defers any page whose epoch is still active.

The previous approach was reference counting: every page access incremented a counter, every release decremented it. A transaction touching 100 pages paid for 200 atomic operations — and atomics aren’t free, each one stalls the CPU pipeline waiting for cache-line coherence. Epochs collapse that into two operations regardless of how many pages the transaction touches: one to enter the scope, one to exit.

But the deadlock-freedom payoff isn’t the cost reduction. It’s that the page cache never holds a lock anyone else could wait on. No latch, no waiter queue, no edge in the lock-dependency graph at all.

This removes the last edge class — cross-structure cycles. With all three classes gone, there is no graph in which a cycle can form.

This pillar is the one I worry about most, and the one most likely to break in the future. It’s enforced by convention, not by the type system. Future features — cross-table indexes, parallel query execution holding read latches across multiple tables, foreign-key constraints — would each require extending the lock-hierarchy discipline. The concurrency overview explicitly lists those scenarios as known risks. I’ll have to introduce explicit lock ordering when I get there.

What the bet costs

Prevention isn’t free; it just shifts the cost.

What’s eliminated What remains
Deadlocks (cycles in the lock graph) Aborts at commit — local retry
Detection runtime overhead OLC restarts under index contention
Wait-for-graph data structures Livelock under heavy contention (different problem)

A writer that loses the commit-time CAS doesn’t trigger a global abort — it retries from where it was, against a refreshed baseline. An OLC reader that sees a version change doesn’t block a writer — it restarts the traversal. These are local costs. A Postgres deadlock victim aborts the entire transaction; a Typhon OLC restart is one tree traversal.

Livelock — repeated retries that never converge — is a different beast. It can’t deadlock, but it can starve. Typhon’s AdaptiveWaiter handles this with a spin-then-yield progression: 65,536 tight spin iterations first (most contention resolves there), then exponentially halving spin counts interleaved with Thread.Sleep(100µs). The 100µs sleep is below the OS scheduler quantum, so wake latency stays sub-millisecond. It bounds livelock probability without trading away the latency targets.

So: deadlocks gone, aborts and restarts kept, livelock bounded by a spin policy.

What others do

I didn’t survey these in depth before committing to prevention — the upfront “no locks” constraint was made on principle. But for the reader’s context, here’s the landscape Typhon sidesteps.

System Strategy Cost model
PostgreSQL Wait-for graph, triggered after deadlock_timeout (1s default) Detection deferred to ≥1s lock wait; cycle scan is expensive but rare
MySQL InnoDB Wait-for graph + victim selection (smallest tx by row modifications wins) Detection can be disabled on high-concurrency systems in favor of innodb_lock_wait_timeout
CockroachDB Per-node in-memory lock tables + Raft-replicated write intents Detection is near-instantaneous; cost shifted to lock-table maintenance
Typhon Prevention by structure (three pillars above) No detection runtime cost; cost shifted to OLC restarts and commit-time aborts

These are all sound engineering choices for their workloads. Postgres’ deferred detection is rare-event optimization. InnoDB’s “smaller transaction wins” is a pragmatic heuristic for the OLTP shapes it’s tuned for. CockroachDB’s instantaneous detection genuinely solves the latency problem detection has elsewhere. None of these are wrong. They’re answering a different question: given that we accept locks, how do we manage their cycles?

Typhon answers a different question: given that we don’t accept locks, what does the rest of the architecture have to look like? That’s why the comparison isn’t “Typhon is faster” — it’s “Typhon paid the cost in a different layer.” Each row above describes where the cost lives, not who’s faster.

A footnote: TigerBeetle reaches the same end via a different upfront constraint — single-writer serializable execution. No concurrent transactions, no deadlocks. Different category, same conclusion: detection is the wrong layer to solve this.

What I’d flag for a reviewer

Three honest acknowledgments.

Pillar 3 is enforced by convention, not by the type system. The compiler won’t catch a future PR that holds latches across two ComponentTables. The discipline lives in code review and architectural awareness, not in mechanically-checked invariants. To compensate, I’ve set up a list of explicit design rules that Claude Code enforces during design, development, and code review. Pillar 3’s “no cross-table latching” invariant is on that list; any code that would violate it gets flagged before it reaches the diff.

OLC restart cost is bounded but not zero. Under heavy write contention on a hot B+Tree leaf, optimistic readers can restart a few times before getting a clean version. The restart is one traversal, not a transaction abort, but it’s not free either.

The “deadlock-free” claim assumes the current feature set. Cross-table indexes, parallel queries holding read latches across tables, and foreign-key constraints are all listed as future scenarios that would require extending the discipline. The structural argument holds for what ships today; future features will need to maintain it consciously.

What’s next

The next post drills into Pillar 1 — how Typhon’s MVCC works without cloning rows. The big trick is per-component revision chains instead of per-row tuple versioning: an entity with eight components that updates one creates a single new revision, not eight. The visibility check is a single comparison against the transaction’s snapshot tick. And the EnabledBits exception dictionary pattern — zero-overhead fast path, dictionary slow path — is the prettiest piece of code in the engine.