Building a Page Cache That Doesn't Count: Epoch-Based Memory Management

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

  1. Why I’m Building a Database Engine in C#
  2. What Game Engines Know About Data That Databases Forgot
  3. Microsecond Latency in a Managed Language
  4. Deadlock-Free by Construction
  5. Three Durability Modes, One WAL
  6. Building a Page Cache That Doesn’t Count (this post)
  7. MVCC at Microsecond Scale (coming soon)

Octocat GitHub repo  •  :mailbox_with_mail: Subscribe via RSS

A transaction that reads a thousand pages should pay for reading a thousand pages. Mine used to pay for two thousand atomic operations on top — one to pin each page so the cache wouldn’t yank it out from under a live pointer, one to unpin it afterward — before a single byte of useful work happened.

That bookkeeping was pure tax. Worse, it was contended tax. The page cache in Typhon today touches those same thousand pages and pays for exactly two operations, flat, regardless of the count. This post is about how the counter went away — and the one corner where I had to bring a smaller one back.

📍 Where we are

A quick recap for anyone joining mid-series. Typhon stores everything — components, indexes, the lot — in 8 KB pages. Those pages are cached in a single byte[] that is pinned with a GCHandle so the GC can never move it. Pinning is what makes the rest of the engine possible: once the backing array is fixed in memory, the hot paths hand out raw pointers and ref T interior references straight into the cache slot. No copy, no marshaling, no managed wrapper. A read of a component is a pointer add and a dereference.

That zero-copy guarantee is also the whole problem. The cache is finite — 256 pages by default — so when it fills, a clock-sweep eviction picks a victim slot and reuses it for a different file page. If a reader is still holding a ref Position into that slot when the sweep recycles it, the reader is now looking at someone else’s data. A use-after-free, in a managed language, with no exception to tell you it happened — just a silently wrong number.

So the cache needs a lifetime rule: a slot must not be evicted while anyone still holds a pointer into it. The interesting part is not that rule. Every cache has it. The interesting part is what you pay to enforce it.

🧾 The obvious answer, and its bill

The textbook enforcement is reference counting. Every page carries a counter. You take a reference (++) before you read, you drop it (--) when you’re done, and eviction skips any page whose count is above zero. Simple, correct, and the first thing I shipped — Typhon’s early page cache had exactly this, a per-page ConcurrentSharedCounter.

It has three costs, and they compound:

  1. It’s 2N. A transaction that touches N pages does N increments and N decrements. The work scales with the data, which is the opposite of what you want from bookkeeping.
  2. The decrement is mandatory and paired. Every acquire needs its matching release on every exit path, including exceptions. Miss one and the page is pinned forever — a slow leak that strands cache slots until the engine wedges. This is the half that generates bugs.
  3. The counter is shared, mutable, cross-core state. That ++/-- is an Interlocked operation — an atomic read-modify-write. When two worker threads touch the same hot page — a B+Tree root, say — its counter’s cache line ping-pongs between their cores. The atomic instruction isn’t the expensive part; the coherency traffic is. A contended counter can cost an order of magnitude more than an uncontended one, and the hottest pages are exactly the ones every thread wants.
  4. It clutters everything it touches. A mandatory paired release means scaffolding at every call site — hundreds of using blocks whose entire reason to exist is a Dispose that decrements a counter. And the moment that overhead shows up in a profile, you start bending the code to dodge it: paths that steal an already-held reference instead of taking a fresh one, so an increment here cancels a decrement there. Each of those tricks is correct only under assumptions the next reader can’t see from the call site. The counter didn’t just cost cycles — it spread through the codebase and made the whole engine harder to reason about. That cost is invisible in a microbenchmark and very visible six months later.

Reference counting answers a question the cache never actually asked: “exactly how many people are looking at this page right now?” Eviction doesn’t need that number. It needs a cheaper, weaker fact: “is it safe to reclaim this slot yet?” Those are different questions, and the second one has a much cheaper answer.

⏱️ The reframe: stop counting, start timing

The technique is epoch-based reclamation (EBR) — the same family of idea behind RCU in the Linux kernel. Instead of tracking who holds what, you track time, coarsely, and you only reclaim memory once enough time has passed that nobody could still be holding the old version.

💡 The intuition. Think of a museum that wants to renovate a gallery — but only once every visitor who might still be wandering through it has gone home. The expensive way is to tag each visitor and constantly track which gallery they’re standing in; that’s reference counting. The cheap way: stamp every visitor at the entrance with the hour they arrived, and have each gallery note the last hour anyone walked through it. Now the renovation crew checks a single number — the arrival time of the earliest visitor still in the building. Any gallery whose last visit predates that time is guaranteed empty of anyone who could still be inside, so the crew can move in. No per-visitor ledger, just one question: who’s the oldest person still here? Epoch-based reclamation is that museum — the visitor’s arrival stamp is the epoch a thread pins on entry, the gallery’s last-walked hour is the page’s stamp, and “the earliest visitor still inside” is the threshold eviction compares against.

“Time” here is a single global counter — the epoch — that ticks forward. The protocol has three moves:

Eviction then asks one question of a page: is its stamp older than the oldest still-active thread? If every active thread started after this page was last touched, no one can be holding a pointer into it, and the slot is free to recycle. That comparison is the entire safety check.

Side-by-side comparison: per-page reference counting requires 2N contended atomic operations on shared per-page counters that bounce between CPU cores; epoch-based protection requires one thread-local pin on entry plus one global epoch advance on exit — two operations total, flat, regardless of how many pages the transaction touches, with page stamps that are fire-and-forget and need no matching release

The asymmetry is the whole win. The expensive, bug-prone, contended half of reference counting — the mandatory paired release on every page — disappears entirely. Releasing is now a single global tick that retires all of a transaction’s pages at once. What’s left on the acquire side is a stamp that is uncontended and usually a no-op.

⚙️ How it actually works

Three types carry the design (src/Typhon.Engine/Foundation/Concurrency/internals/):

A transaction wraps its whole lifetime in one guard:

// Managed by Transaction — one scope per unit of work
using var guard = EpochGuard.Enter(epochManager);

// Every page access inside this scope is protected, no per-page bookkeeping
pmmf.RequestPageEpoch(filePageIndex, epochManager.GlobalEpoch, out int memPageIndex);
byte* addr = pmmf.GetMemPageAddress(memPageIndex);
// ... read/write through raw pointers, hold ref T across many pages ...

Entering publishes the epoch into the thread’s own slot. The hot path is a single write to thread-local memory — no atomic, no contention:

// EpochThreadRegistry.PinCurrentThread — outermost scope pins, nesting is free
var depth = _slots[slot].Depth;
_slots[slot].Depth = depth + 1;
if (depth == 0)
    _slots[slot].PinnedEpoch = epoch;   // publish: "active, started at this epoch"
return depth;

Stamping a page is an atomic-max — never let a page’s stamp go backward — that skips the atomic write entirely in the common case where the page is already current for this epoch:

// PagedMMF.RequestPageEpoch — tag the page, never moving the stamp backward
long existing;
do
{
    existing = pi.AccessEpoch;
    if (currentEpoch <= existing)
        break;                  // already protected this epoch — no write at all
} while (Interlocked.CompareExchange(ref pi.AccessEpoch, currentEpoch, existing) != existing);

That Interlocked.CompareExchange is a compare-and-swap (CAS): it stores the new epoch only if the field still holds the value we just read, so two threads stamping the same page can’t clobber each other. The first touch of a page in an epoch does one such CAS; every subsequent touch breaks out at the currentEpoch <= existing guard — a plain read and a compare, no atomic at all. And — this is the point — there is no release. Nothing to pair, nothing to forget, nothing to run in a finally.

Exiting the outermost scope is where the single shared atomic lives — the global epoch tick:

// EpochManager.ExitScope — outermost exit retires the whole transaction's pages
internal void ExitScope(int expectedDepth)
{
    if (_registry.UnpinCurrentThread(expectedDepth))
    {
        var newEpoch = Interlocked.Increment(ref _globalEpoch);
        TyphonEvent.EmitConcurrencyEpochAdvance((uint)newEpoch);
    }
}

And eviction is the comparison the whole scheme exists to make cheap. MinActiveEpoch is the smallest epoch any thread is still pinned to; a page survives if its stamp is at least that:

// PagedMMF.TryAcquire — the eviction guard, minus the dirty/in-flight checks
if (info.AccessEpoch >= minActiveEpoch)
    return false;   // a transaction old enough to hold a pointer is still live — skip

Walk the invariant once and it clicks. A thread that pinned epoch E contributes E to MinActiveEpoch, so MinActiveEpoch ≤ E. Any page it stamped was stamped at the global epoch at access time, which is ≥ E. So while that thread is live, every page it ever touched satisfies AccessEpoch ≥ E ≥ MinActiveEpoch and cannot be evicted. The moment it exits — and no older thread remains — MinActiveEpoch climbs past those stamps and they all become reclaimable together. This is a grace period, exactly as in RCU: reclamation waits until every reader that could have seen the old state has gone away.

📊 The numbers

Measured with BenchmarkDotNet on a Ryzen 9 7950X (Zen 4), .NET 10, RELEASE:

Operation Cost Allocations
EpochGuard enter + exit (incl. global tick) 3.5 ns 0
Three-level nested enter/exit 8.8 ns 0
MinActiveEpoch (eviction check, no threads pinned) 0.8 ns 0
MinActiveEpoch while a thread is pinned 6.6 ns 0
Page-cache hit (RequestPageEpoch, slot resident) 6.5 ns 0

The number that matters is the first one: 3.5 ns for a transaction’s entire eviction-protection obligation, and it does not move when the page count goes up. A transaction touching one page and a transaction touching ten thousand pay the same 3.5 ns of protection overhead. The per-page stamps on top are uncontended and frequently free; the contended, paired, O(N) release is gone.

Put that against the model it replaced: a thousand-page transaction was a thousand Interlocked increments and a thousand decrements, on a thousand separate counters, several of them hot enough to bounce across cores on every touch. Trading 2,000 contended atomics for one global tick is the kind of win you only get by answering an easier question.

⚖️ Trade-offs, and the one place pure epochs weren’t enough

Epoch protection is conservative — that’s the source of both its speed and its cost. It doesn’t know which pages a transaction is done with; it only knows the transaction is still running. So a transaction pins every page it has ever touched for its entire lifetime, whether it still needs them or not.

For short transactions that’s exactly right and basically free. For a long one it bites: if a single transaction’s working set grows past the cache size, every page it has stamped is epoch-protected, so eviction has no legal victim. The cache can’t make room, and AllocateMemoryPage runs into a backpressure wall and eventually times out. Reference counting wouldn’t have this problem — it could evict a page the long transaction had touched and released. Epochs can’t, because there is no “released” until the scope exits. That working-set-fits-in-cache constraint is a real, documented limit of the design, not a bug I haven’t gotten to.

The mitigation is to let a long-lived thread refresh its epoch mid-flight — advance its own pin without unpinning — so pages it touched earlier age out and become evictable while it keeps running:

// EpochManager.RefreshScope — stay continuously pinned, but move forward in time
var newEpoch = Interlocked.Increment(ref _globalEpoch);
_registry.RefreshPinnedEpoch(newEpoch);   // no unpinned window; MinActiveEpoch never gaps

Which is where the pure model sprang a leak. The whole premise is that a page is safe as long as its epoch is live. But a caller can hold a raw ref T into a page across a refresh — and a refresh deliberately ages that page out, marking it evictable while the pointer is still on the stack. The coarse epoch clock can’t see a pointer that outlives the epoch that created it.

So pure EBR wasn’t quite enough, and I had to bring back a small, targeted counter: SlotRefCount. It tracks the much narrower fact of “a live accessor slot is pointing at this page right now,” and it gates eviction alongside the epoch check. The full guard is the conjunction:

// Evict only when nobody can be looking — by epoch OR by raw pointer
(DirtyCounter == 0) && (AccessEpoch < MinActiveEpoch) && (SlotRefCount == 0)

It would be tidier to say I deleted reference counting outright. The honest version is that I moved it: epochs handle the common case — thousands of pages, retired in one tick — and a tiny refcount covers the narrow case epochs structurally can’t see, the raw pointer that escapes its epoch. The 2N tax is gone from the hot path; what remains is a counter that almost never moves.

🧰 Two details worth stealing

Cache-line-pad the per-thread slots. The registry’s whole reason to exist is that threads pin and unpin without fighting each other. Pack two threads’ epoch fields into one cache line and you’ve reintroduced exactly the false-sharing the design set out to kill. Each slot is its own cache line:

// EpochThreadRegistry — one cache line per thread, no false sharing on pin/unpin
[StructLayout(LayoutKind.Explicit, Size = 64)]
internal struct PaddedEpochSlot
{
    [FieldOffset(0)]  public long PinnedEpoch;  // hot: written every enter/exit
    [FieldOffset(8)]  public int  Depth;        // hot: nesting depth
    [FieldOffset(12)] public int  SlotState;    // warm: CAS on claim/free
    // bytes 16–63: padding to fill the line
}

A pinned epoch is a liability if its thread dies. A thread that pins an epoch and then crashes or exits without unpinning would hold MinActiveEpoch down forever — freezing eviction across the whole cache. Two safeguards catch it: each slot is owned by a CriticalFinalizerObject that releases it when the thread is collected, and the MinActiveEpoch scan does a Thread.IsAlive liveness check, reclaiming any slot whose owner has died. The grace period is robust to threads that never come back.

🎯 The takeaway

Reference counting is the reflex answer to “don’t free this while someone’s using it,” and for a lot of code it’s the right one. But it answers a harder question than most callers ask. When you don’t need who or how many — only is it safe yet — epoch-based reclamation turns O(N) contended bookkeeping into O(1): publish a number on the way in, tick a number on the way out, and let a grace period retire everything at once. The cost is that protection is coarse, so it fits workloads whose units of work are bounded — which, for a database built for game ticks and simulation frames, they are by construction.

That’s the pattern worth keeping even if you never touch Typhon: before you reach for a counter, check whether you actually need the count, or just the guarantee.

⏭️ What’s next

Post #7: MVCC at Microsecond Scale: Snapshot Isolation Without Cloning Rows. Those epoch-stamped pages are where revision chains live — every update appends a new version instead of overwriting, readers see a frozen snapshot, and old revisions get retired by the same kind of grace-period reasoning you just read. Next post pulls that thread.

Follow the GitHub repo for source and benchmarks, or subscribe via RSS.