MVCC at Microsecond Scale: Snapshot Isolation Without Cloning Rows

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

  1. Why I’m Building a Database Engine in C#
  2. What Game Engines Know About Data That Databases Forgot
  3. Microsecond Latency in a Managed Language
  4. Deadlock-Free by Construction
  5. Three Durability Modes, One WAL
  6. Building a Page Cache That Doesn’t Count
  7. MVCC at Microsecond Scale (this post)
  8. The 8-Byte Lock (coming soon)

Octocat GitHub repo  •  :mailbox_with_mail: Subscribe via RSS

An entity has a position, a velocity, a health bar, a level, a name, a team, a score, and a bag of status flags — eight components. A hit lands and its health drops by three. In a textbook MVCC database, that one-field write produces a brand-new copy of all eight. The position it didn’t move, the name it didn’t change, the flags it didn’t touch — copied, re-indexed, and left behind as garbage for a vacuum process to sweep up later.

Typhon writes one revision. Just the health. The other seven components don’t know anything happened. This post is about the data structure that makes that possible, the one-byte rule that decides what each transaction is allowed to see, and why a read stays fast and nearly flat whether the value changed once or fifty times.

🛡️ Why I gave Typhon one isolation level, not five. Every SQL engine hands you a dial — SQL Server has five levels, from read uncommitted to serializable, each leaking different anomalies (dirty reads, phantoms, and friends). Picking wrong is a bug that surfaces only under production load. That dial exists because in a lock-based engine, stronger isolation means more locking means less concurrency — so the weak levels are escape hatches for when locks hurt. MVCC deletes the tradeoff: readers never lock, writers append instead of overwriting, so the strong guarantee is also the fast one. Nothing to buy back, so there’s no dial — one model, always on: snapshot isolation, the database frozen the instant your transaction began. When two transactions collide on the same component, it isn’t silently dropped — it’s reconciled at commit (see Trade-offs). Simple to reason about, hard to misuse.

📍 Where we are

Two earlier posts set this one up. Deadlock-Free by Construction argued that Typhon has no deadlock detector because it holds no data locks — and the first of the three pillars holding that claim up was MVCC: readers see an immutable snapshot, writers append new revisions, nobody waits on anybody. That post promised a drill-down. This is it.

Building a Page Cache That Doesn’t Count is the other half of the foundation. Those epoch-stamped 8 KB pages are exactly where revision chains live, and the trick that retires an old revision once no transaction can still see it is the same grace-period reasoning that retires an evictable page. If you read that post, you already understand Typhon’s garbage collector — you just haven’t seen it applied to versions yet.

The contract MVCC has to honour is snapshot isolation: a transaction picks a point in time when it starts and sees the whole database frozen at that instant for its entire lifetime — no dirty reads, no phantoms, later commits invisible. The question every MVCC implementation answers differently is: what do you physically store so that many transactions can each see a different “now” of the same data at the same time?

🧬 The row-cloning tax

The classic answer, the one PostgreSQL uses, is tuple versioning. A row is a tuple. Update any column and you write a new version of the entire tuple, chained to the old one by a pointer, each stamped with the transaction IDs that bound its visibility. Readers walk the version chain and pick the tuple their snapshot is allowed to see.

It’s a proven design, and for a row of three ints it’s perfectly fine. The tax shows up when rows are wide and updates are narrow — which is precisely the shape of an entity-component workload. A simulation tick nudges one field on thousands of entities: health here, position there. Every one of those single-field writes clones a whole wide row. You pay to copy columns that didn’t change, you pay to re-point the secondary indexes at the new tuple, and you pay a third time when a background vacuum comes through to reclaim the corpses. The write amplification is proportional to how wide the row is, when it should be proportional to how much you changed.

Top: row-versioning MVCC — updating one field (Health) on an eight-column row writes a whole new tuple with all eight columns copied and every index re-pointed. Bottom: per-component revision chains — updating Health appends one revision to the Health chain while the other seven component chains are never touched.

⛓️ Per-component revision chains

Typhon never has a “row” to clone, because it doesn’t store rows. It stores components in separate columnar tables — that’s the ECS design an earlier post was about. MVCC then attaches to the natural unit: each component of each entity owns its own revision chain. Update health, and you append one entry to that entity’s health chain. The position chain, the name chain, the other five — untouched, because from their point of view nothing changed. An eight-component entity that updates one component creates one revision, not eight. That’s the whole headline, and it falls straight out of storing columns instead of rows.

A chain entry is deliberately tiny — 12 bytes, because layout density decides how many entries fit in a cache-line-sized chunk:

[StructLayout(LayoutKind.Sequential, Pack = 2)]
internal struct CompRevStorageElement
{
    public int    ComponentChunkId;   // 0 = tombstone (a delete)
    private uint   _packedTickHigh;    // TSN bits 16..47
    private ushort _packedTickLow;     // TSN bits  0..15
    private ushort _packedUowId;       // UowId in bits 0..14, IsolationFlag in bit 15
}

Four fields, no waste. ComponentChunkId points at the actual component payload elsewhere in the page cache (0 is the sentinel for a tombstone — a delete). The Transaction Sequence Number — the global clock every snapshot reads from — is a 48-bit value split across a uint and a ushort; 48 bits is 280 trillion transactions, which is enough. The last ushort does double duty: 15 bits of UowId (which unit of work wrote this) and, in its top bit, the single most important flag in the whole system — IsolationFlag.

Why 12 bytes and not 16? Because the chain lives in 64-byte chunks, and 64 bytes is one cache line. The root chunk spends 28 bytes on a header — a small reader-writer lock, chain bookkeeping, the entity key — leaving 36 bytes, which is exactly three 12-byte entries. Overflow chunks carry only a 4-byte “next chunk” pointer, leaving 60 bytes: exactly five entries. One byte more per entry and that arithmetic turns wasteful — you’d fit two per root chunk and burn the rest as padding. The struct is sized backward from the cache line.

👁️ Visibility: two facts, one byte

Given a chain, a read has to answer one question: of all these versions, which one is this transaction allowed to see? The answer is two facts about each entry, and the second is that top bit.

💡 What’s a TSN? The Transaction Sequence Number is Typhon’s logical clock — one global counter that only ever ticks up. Every transaction draws a number from it at the moment it starts, and that number does double duty: it’s the transaction’s snapshot (the single “now” it will see for its entire life) and it’s the stamp burned onto any revision the transaction writes. Because the counter is monotonic, comparing two TSNs is just asking which happened first — and that comparison is the whole basis for deciding what a snapshot is allowed to see.

The first fact is chronology. Every entry carries the TSN of the transaction that wrote it; every reader carries the TSN it froze at when it started. An entry is a candidate only if entry.TSN <= reader.TSN — you can see the past, never the future.

The second fact is committedness, and it’s the IsolationFlag. When a writer appends an entry, it sets the flag to true: written, but not yet committed. That entry is invisible to every other reader regardless of its TSN, until the writer commits and clears the flag. This is the piece pure chronology can’t express. A reader whose snapshot sits at TSN 190 is chronologically entitled to see an entry stamped 180 — but if that entry is still mid-flight with its flag up, the reader must skip it and fall back to the last committed version underneath.

A revision chain with three entries — committed at TSN 42, committed at TSN 150, and uncommitted at TSN 180 with IsolationFlag set. Three concurrent transactions resolve differently: Reader A at snapshot 100 sees TSN 42 (150 and 180 are in the future); Reader C at snapshot 190 sees TSN 150 (180 is at or before 190 but uncommitted, so skipped); the Writer at TSN 180 sees its own uncommitted write via a transaction-local cache, not the chain.

So visibility is committed AND TSN <= snapshot, and the whole outcome of a walk compresses into a one-byte enum — the entire MVCC contract distilled to four states:

public enum RevisionReadStatus : byte
{
    Success           = 0,  // a visible revision was found — here's its payload chunk
    NotFound          = 1,  // this entity never had a chain for this component
    SnapshotInvisible = 2,  // the chain exists, but nothing is visible at our snapshot
    Deleted           = 3,  // the visible revision is a tombstone
}

That Deleted state is quietly powerful. A tombstone is a real revision with a real TSN — it just points at chunk 0. A transaction whose snapshot predates the delete walks right past the tombstone and reads the entity’s pre-delete content, which is still sitting on the chain. You get time-travel reads of deleted entities for free, for as long as the old versions survive — and how long they survive is the garbage-collection story below.

One honest caveat on the walk itself. You might expect it to stop at the first entry past the snapshot, binary-search style. It can’t — the chain isn’t sorted by TSN, and the reason is concurrency. The walk resolves “latest committed version” by keeping the last visible entry it passes, so entries have to sit in commit order. But commit order and TSN order slightly diverge: a transaction created earlier (a lower TSN) can commit later than one created after it, and when it does, its revision is moved to the tail of the chain — landing physically after a higher-TSN entry that already committed. So a later slot can legitimately hold an older TSN, and the walk has to scan the whole live chain rather than break early. It stays cheap only because that chain is short — the grace-period cleanup keeps it to a handful of entries.

✍️ The write path, and a decision I’d defend

A write appends. It takes the chain’s lock in exclusive mode, grows the chain by one overflow chunk if the current one is full, allocates a payload chunk for the new component value (or writes 0 for a delete), and stamps a fresh entry: TSN, UowId, ComponentChunkId, IsolationFlag = true. Then it releases the lock. The writer can now see its own change — not by walking the chain, which would skip its own flagged entry, but through a small transaction-local cache that remembers “the chunk I just wrote.” Every other reader still lands on the previous committed version. At commit, the only thing the revision layer does is flip that one bit from true to false. The entry becomes visible atomically, to everyone, in a single flag write.

💡 And the UowId? If the TSN says when a revision was written, the Unit-of-Work ID says who wrote it — which unit of work produced it. It deliberately stays off the read path: WalkChain decides visibility from the TSN and the IsolationFlag alone and never consults a UowId. Its work happens to the side — it lets a concurrent cleanup pass know whose in-flight chunks it must not reclaim, lets commit-time conflict detection name the other writer, and gives crash recovery the bookkeeping it needs to rebuild only the revisions that were durably committed. (That last one is settled before the engine serves a single read — recovery replays the WAL and scrubs anything a crash left half-written, so a live reader can never trip over it.) Temporal order is the TSN’s job; ownership — and a clean slate after a crash — is the UowId’s.

Here’s the decision worth pausing on, because the obvious alternative is wrong in an instructive way. That UowId — the identity of the writing unit of work — is stamped when the entry is written, not in a loop at commit time. The tidy-sounding version (“commit walks everything the transaction touched and stamps ownership then”) was rejected, for reasons that only surface once you look at what else is happening concurrently:

So commit stays a single flag flip instead of a chain walk over hundreds of touched components. The cost of that simplicity is one field written a few microseconds earlier than strictly necessary. Cheap.

📊 The numbers

So what does versioning actually cost? These are measured per-operation numbers — Ryzen 9 7950X, .NET 10, Release, steady-state with a hot cache — for the identical component laid out three ways:

Per-entity operation Versioned (MVCC) SingleVersion Transient
Read ~80 ns ~15 ns ~15 ns
Write ~250 ns ~40 ns ~40 ns

Two things jump out. First, SingleVersion and Transient cost the same — both are pointer arithmetic into a pinned page and a cache-line access; whether that page is memory-mapped or heap-backed is invisible once it’s in L1. Second, the cliff is MVCC: a Versioned read runs ~5× a raw read, a Versioned write ~6×. The lever is that the mode is chosen per component — a hot, history-less field like a cached AI cost or an animation tween can opt out of versioning and run at ~15 ns.

That 6× on writes is the whole thesis showing up in a profiler. A SingleVersion write is a store into a slot. A Versioned write is copy-on-write — allocate a fresh content chunk, copy the current value into it, append a revision entry pointing at it, take the chain lock, and stamp a TSN and UowId. (The WAL record and the eventual GC of the version it replaced are real costs too, but they’re paid at commit and in the background — not inside that 250 ns.) It’s the same building blocks as the SingleVersion path, just a lot more of them. That isn’t a missed optimization; it’s what snapshot isolation physically is. You pay ~80 ns to read a value nobody can tear out from under you, and ~250 ns to write one without ever blocking a reader.

What the per-op numbers don’t show is chain depth. A read walks the live chain with no early exit, so its cost grows with the number of live revisions it has to pass. In steady state that’s a non-issue — grace-period cleanup keeps the live chain to a handful of entries, so you stay near that ~80 ns. It only bites if you let a chain grow: aim a burst of concurrent writers at one hot entity and its chain lengthens until they drain, and the walk lengthens with it. Bounded contention plus cleanup keeps you in the flat regime.

That cleanup is the page-cache grace period again, wearing a different hat. A revision is dead once no live transaction’s snapshot could still select it — i.e. once its TSN falls below the oldest active transaction’s TSN. When that threshold advances, cleanup compacts the chain: it drops the dead entries, keeps one committed “sentinel” so a reader sitting exactly at the boundary still has its baseline, and never touches an entry whose IsolationFlag is still up. No separate vacuum daemon, no scheduled sweep — reclamation is driven entirely by the oldest reader finishing, exactly as page eviction waits for the oldest epoch to retire.

💎 The prettiest code in the engine

There’s a sibling problem that a full revision chain would be absurd overkill for. Every entity has an EnabledBits mask — 16 bits marking which of its components are logically “on” (present in storage but toggled out of queries). Toggling a component is a common, cheap operation. But it still has to obey snapshot isolation: a transaction that started before you disabled a component must keep seeing it enabled.

Sixteen bits do not deserve a chain of 64-byte chunks. So the fast path is the entire mechanism most of the time:

public ushort ResolveEnabledBits(long entityKey, ushort inlineBits, long txTsn)
{
    if (_overrideCount == 0) return inlineBits;   // fast path: no toggles in flight, done
    if (!_overrides.TryGetValue(entityKey, out var history)) return inlineBits;
    return history.ResolveAt(txTsn, inlineBits);
}

The current bits live inline on the entity record. The moment a toggle commits while an older transaction is still running, the pre-change bits get parked in an override dictionary keyed by entity, tagged with the TSN of the change. A reader only consults that dictionary when _overrideCount says overrides exist at all — and then ResolveAt walks the tiny per-entity history newest-to-oldest and hands back the bits that were current at the reader’s snapshot. When the last transaction old enough to care finishes, the pruner drops the entry and _overrideCount falls back toward zero. The dictionary is empty in the overwhelmingly common case, so the whole apparatus costs one integer comparison. It’s MVCC without a chain — versioning made proportional to how expensive the thing being versioned actually is. That principle, don’t pay chain overhead for something that fits in a register, is the one I’d take to any system.

⚖️ Trade-offs

Two costs, stated plainly.

Write-write conflicts are resolved, not rejected. Two transactions writing the same component never block during execution — Typhon holds no data locks while one runs — and, contrary to what “optimistic” usually implies, they don’t abort or retry either. The collision is resolved when the second commits. Most components have a single writer, so conflicts are rare; when one fires, the default is last-writer-wins — fine for a position, a footgun for an accumulator like a gold balance, where a blind overwrite discards real work. So Commit takes an optional conflict handler: on a conflict the engine hands you the read, committed, and committing values — once per entity, atomically under that entity’s chain lock — and you decide: keep yours, take theirs, revert, or merge (rebase the delta, clamp, sum). The real cost isn’t aborts or retries; it’s having to think about the merge — the default drops writes silently, and the handler runs in a tight context (fast, no I/O, no re-entering the engine) with no built-in roll-back.

Time travel has rent. Those readable-after-delete tombstones, and the old versions they shade, occupy pages until the oldest transaction that might want them finishes. A single long-lived reader holds the cleanup threshold down and lets chains and tombstones accumulate behind it — the versioning equivalent of the working-set pressure the page-cache post described. Bounded transactions keep it healthy; a forgotten long one is how you get bloat.

🎯 The takeaway

The reusable idea isn’t “chains” or “snapshots” — every MVCC system has those. It’s match the versioned unit to the thing that actually changes. Row-versioning copies the whole row because the row is the unit it stores. Store columns instead and the unit shrinks to the component, so a write costs what it changed and nothing more — and a 16-bit mask shrinks further still, to a dictionary that’s empty until two snapshots actually disagree.

Whatever you’re versioning — rows, documents, config, undo history — ask the question the page cache asked before it reached for a counter: what is the smallest unit I actually have to keep a history of? Version that, and only that.

⏭️ What’s next

Every revision chain in this post was guarded by a lock I waved at as “a small reader-writer lock in the header.” That lock is four bytes. It packs a full shared/exclusive lock — reader count, writer bit, waiter tracking, an owning thread id — into a single 32-bit word, and its bigger sibling does the same in 64 bits with room for three access modes. Post #8: The 8-Byte Lock — packing a reader-writer lock into a handful of bytes, and why the thread id inside it is exactly 16 bits wide.

Follow the GitHub repo for source and benchmarks, or subscribe via RSS.