Nockawa’s Blog

A Database You Can See

2026-06-08T00:00:00+00:00

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: The Typhon Workbench — the tools that make the engine usable

A Database You Can See (this post)

You Can’t Optimize What You Can’t See (coming soon)

Querying by Hand (coming soon)

A lighter, hands-on companion to the engine deep-dive series, A Database That Thinks Like a Game Engine.

GitHub repo • :mailbox_with_mail: Subscribe via RSS

I spent long time making Typhon fast. Sub-microsecond commits, MVCC snapshot isolation, cache-line-aware storage — the kind of numbers that make a systems programmer lean in. Then I opened a .typhon file to debug something, and realized I was staring at a black box. I had built an engine I couldn’t see into.

That’s the quiet trap of infrastructure software: the better the engine, the more invisible it is. A database that does its job disappears — right up until the moment you need to know what your schema actually looks like in memory, which systems touch a component, or whether a query does what you think. At that moment, raw speed is worth nothing without a way to look inside.

So this post starts a new, lighter track in the series — about the Typhon Workbench, the tool that makes the engine usable. The thesis is simple and, I think, under-appreciated:

🎯 Great technology is not enough. You need tools that let people make the most of it.

So, concretely: the Workbench is the tool you keep open to understand a Typhon system — its data, its schema, and its behavior — whether that system is a file on disk, a captured trace, or a live engine you’re attached to. It’s built for both developers and ops. Developers use it to understand structure and chase performance: how an archetype is laid out, what’s actually stored on an entity, where the time goes.

Ops and reliability engineers use it to watch a live system — tick rate, jitter, overload, queue depth — and to freeze the feed or capture a window when something spikes. Either way, it answers what code alone can’t: What does my schema really look like? What’s in this archetype right now? Which systems touch this component? Why did that tick stall? Is the engine healthy? — the things you’d otherwise chase with Console.WriteLine, log scraping, and a lot of guessing. This first post stays on the data-and-schema side; later in the track we reach the profiling and live views.

A quick note on what it’s built from, for the curious: the Workbench is a small full-stack app that runs entirely on your machine. The backend is ASP.NET Core (Kestrel) speaking to the engine; the frontend is React 19 + TypeScript built with Vite, styled with Tailwind CSS and shadcn/ui (Radix primitives), using dockview for the draggable panel layout, Zustand and TanStack Query for state, and cmdk behind the command palette. Nothing leaves localhost.

See it first

Before the words, here’s the two-minute tour:

▶ Watch "Typhon Workbench intro" on YouTube — if the embed doesn't load in your viewer

DataGrip meets a flight recorder

The Workbench is a local developer tool. You point it at a Typhon database — a .typhon file on disk — and it opens a window into everything inside: the schema, the data, and the way both sit in memory and on disk. It can also attach to a live engine or replay a captured trace, but this post is about the simplest, most common case: you have a database file, and you want to understand it.

If you’ve used JetBrains DataGrip, DBeaver, or MongoDB Compass, the shape is familiar — a navigator down one side, an object inspector, a data grid, drill-downs. That’s deliberate. I didn’t want to invent a new mental model for “exploring a database”; the last thirty years of database tooling already converged on one that works, and the Workbench borrows it wholesale. What it adds is the part that’s specific to Typhon: it speaks entities, archetypes, and components natively, and it can show you things a row-store tool never could — cache-line layouts, on-disk fragmentation, the cost of a query before you run it. (The other half of the name — the flight recorder, for profiling a running engine — is a story for a later post in this track. Here we stay with the database.)

There’s no install ceremony — open it, point it at a file, and it’s out of your way in seconds.

From a file to a model you can walk

Open a .typhon file and you don’t get a hex dump or a list of opaque page numbers. You get a model you can walk, top to bottom.

It starts with the Schema Explorer — a tree of your archetypes. As the engine series describes, Typhon stores data the way a game engine does: an archetype is a set of entities that share the same component composition, which makes it the rough equivalent of a table. The Schema Explorer lists them all, with the numbers that matter at a glance — how many entities each holds, which components it carries, how full its storage is. It’s fuzzy-searchable, so on a schema with a hundred archetypes you type three letters and you’re there.

From an archetype you drill into a Component, and this is where it stops looking like a generic table browser and starts looking like something built for this engine. A component isn’t just a column; it has tabs for its fields, the archetypes that use it, the systems that read and write it, its storage mode. (One of those tabs — the memory layout — gets a whole post to itself next time. It earns it.)

Three things from the video are worth calling out, because they’re the difference between a tool you tolerate and one you live in:

Density is a setting. I stare at this thing all day, so it ships compact by default — more rows, less chrome — but you can loosen it when you’d rather have room to breathe.
Dark and light both exist and both work. Not a switch bolted on at the end; the whole UI is built on theme variables, so nothing breaks when you flip it.
The command palette is the spine. One shortcut gives you a single search box that reaches everything — every view, and every object in your database. Type a component’s name and jump straight to it; prefix your search to scope it (one prefix runs actions, another finds an object in the current session, another jumps to a moment in a trace). Anything you can reach with the mouse, you can reach from the keyboard. That isn’t a power-user garnish — it’s how discoverability works. Someone who doesn’t know where a feature lives can find it by typing what they want.

One click, every view

Here’s the part that took the most work and is the easiest to miss: in the Workbench, selection is global.

Select a component — anywhere, in any panel — and the whole tool reorients around it. The inspector shows its details. The archetypes that use it light up. The systems that read it and the systems that write it appear together, so you can see the blast radius of a change before you make it: if I touch this field, here’s everything that cares. You made one gesture; several panels answered.

That sounds obvious until you remember how it usually goes. In most toolchains every view is an island. You find an ID in one window, copy it, paste it into another, lose your place, open a third. The friction is so normal you stop noticing it — you just accept that “investigating” means juggling. The Workbench’s bet is that you shouldn’t have to: one selection, shared across every panel, reversible with a back button that always takes you home.

The honest part: none of the individual panels were the hard problem. Most of them existed already. The hard problem was the wiring — a single shared notion of “what is selected” that every panel both listens to and can drive. That’s the unglamorous work that turns a folder full of capable views into something that feels like one product. It’s invisible when it works, which is exactly why it’s worth pointing at.

Real data, decoded

Schema is half the story. The other half is what’s actually in there.

The Data Browser pages through the real entities in an archetype and shows their component values decoded — not raw bytes, not a blob you have to interpret, but Position { X = 124.5, Y = 88.1 }, Health = 98, the actual values your systems are reading. Pick the columns you care about, page through, click a single entity to inspect it on its own. And it’s strictly read-only: looking never changes what’s there, so you can explore a production capture without a second thought.

It closes the loop the rest of the tool sets up. You walked the schema from archetype to component; now you see the data sitting in that shape. From a row you can select a component value and — because selection is global — bounce straight back to the schema side to ask “wait, how is this field actually laid out?” The investigation flows in both directions, which is the whole point.

The lesson: tools need design too

Let me end on the thing this whole post is really about.

Typhon’s engine solved a genuinely hard problem: ACID transactions at microsecond latency, on a data model borrowed from game engines, with none of the usual compromises. That took two years and most of my stubbornness. But an engine is a capability, not an experience. The moment a real developer sits down — to debug a schema, to understand why something is slow, to check what’s actually in the database — that capability is only as good as their ability to reach it.

The Workbench is the other half of that work, and it needed its own kind of design discipline — none of it about raw performance:

One action, not three. Any move you’d want to make is a single gesture from where you already are.
No dead buttons. A control that renders is a control that works. Nothing that looks clickable does nothing — broken affordances erode trust faster than missing features ever do.
Selection is global and reversible. One click drives everything; the back button always brings you home.
Speed without orientation is just a faster way to get lost. Fast and oriented, or it doesn’t count.

That’s the under-appreciated half of building good technology: the tools aren’t a cherry on top, they’re the path the value travels to reach a human being. You can have the fastest engine in the world, and if nobody can see inside it, you’ve built a very elegant black box. I’d rather build one you can see.

What’s next

Next in the Workbench track: You Can’t Optimize What You Can’t See — where the abstract gets physical. The component layout grid shows you cache lines, field padding, and alignment as something you can actually look at; the File Map draws your entire database on disk, fragmentation and all. Meanwhile the engine track continues its deep dives.

Follow the GitHub repo for source and benchmarks, or subscribe via RSS.

Three Durability Modes, One WAL: Configurable Guarantees for Different Workloads

2026-05-20T00:00:00+00:00

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I’m Building a Database Engine in C#

What Game Engines Know About Data That Databases Forgot

Microsecond Latency in a Managed Language

Deadlock-Free by Construction

Three Durability Modes, One WAL (this post)

MVCC at Microsecond Scale (coming soon)

GitHub repo • :mailbox_with_mail: Subscribe via RSS

Most databases pick one durability strategy at boot time. Typhon picks one per commit — and the surprising part isn’t the user-facing API, it’s that all three modes share the same WAL (Write-Ahead Log — every commit appends a record here before being durable) writer thread, the same ring buffer, the same I/O path. The only thing that differs is whether the producer waits.

Ten thousand NPC position updates per simulation tick at ~1-2µs each, all Deferred. One legendary item drop on the same transaction code path, escalated to Immediate, paying ~15-85µs for a guaranteed FUA (Force Unit Access — don’t ack the write until it’s on stable media) write to disk. Same engine, same WAL, one extra argument at Commit(). This post is about why that’s possible and what it cost to keep it that way.

The other classical knob

Post #4 covered Typhon’s first big architectural bet — eliminating deadlocks at the design level instead of detecting them at runtime. This one is about the other classical database knob: durability. And like deadlocks, the interesting decision is upstream of the implementation.

Three workloads sit on the same engine in a real game server:

Simulation tick: physics, AI, position updates. Hundreds to thousands of writes per frame. Losing the last tick on crash is fine; the simulation will recompute it.
Player actions: combat events, item pickups, dialogue state. Sub-5ms data loss window is acceptable. Throughput matters more than per-commit durability.
Money and rare events: currency transfers, legendary drops, account creation. Zero tolerance for loss. Players will dispute every missing transaction.

A single global durability setting forces all three to the most conservative option — and ~15-85µs per FUA write turns a 60Hz tick budget (16ms) into a fight you’ve already lost.

The three modes

The decision lives on the Unit of Work (UoW) at creation time, with a per-transaction override for escalation only. A UoW sits one level above a transaction: it groups one or more transactions and owns the durability contract they share. Transactions still commit atomic state changes; the UoW decides when — and whether — those commits reach disk. The user-facing enum is exactly what it looks like:

public enum DurabilityMode : byte
{
    /// WAL records buffered. Durable only after explicit Flush()/FlushAsync().
    /// Commit latency: ~1-2µs. Data-at-risk: until Flush().
    Deferred = 0,

    /// WAL writer auto-flushes every N ms (default 5ms).
    /// Commit latency: ~1-2µs. Data-at-risk: ≤ GroupCommitInterval.
    GroupCommit = 1,

    /// FUA on every tx.Commit(). Blocks until WAL record is on stable media.
    /// Commit latency: ~15-85µs. Data-at-risk: zero.
    Immediate = 2,
}

public enum DurabilityOverride : byte
{
    Default   = 0,  // Use the UoW's DurabilityMode
    Immediate = 1,  // Force FUA for this specific commit (escalation only)
}

The override can only escalate. A Deferred UoW can promote one transaction to Immediate; an Immediate UoW cannot weaken anything. This is a deliberate constraint, not an oversight — it makes data-loss bugs impossible by API shape. You can never accidentally make a transaction less durable than the UoW’s contract.

Mode	Commit latency	Data-at-risk window	Use case
`Deferred`	~1-2µs	Until explicit `Flush()`	Game ticks, batch imports, simulation steps
`GroupCommit`	~1-2µs amortized	≤ 5ms (configurable)	General server load, request handlers
`Immediate`	~15-85µs	Zero	Trades, account writes, legendary drops

One writer thread, three signaling patterns

Here’s the part that surprised me when I came back to the design six months in: I did not need three I/O paths. Or three threads. Or three buffers. The shared infrastructure looks like this:

The same picture broken into phases:

#	Producer thread (`tx.Commit()`)	What this means for you	WAL Writer thread (single, dedicated)	What this means for you
1	`TryClaim()` — CAS (Compare-And-Swap) slot allocation	Your commit atomically reserves a slot in the WAL ring buffer. No lock contention with other transactions claiming slots in parallel.	(idle, or finishing a previous drain)	The writer thread runs independently. Your producer never waits on it to claim a slot.
2	`Publish()` — release-store the frame header	The record is now visible to the writer. Your commit has an LSN (Log Sequence Number) — its position in the durability timeline.	`TryDrain()` — contiguous batch of published frames	The writer harvests every published slot in one pass. This is the structural reason `GroupCommit` is amortized: N producers, one drain, one FUA cost.
3	Mode-specific: return now, or `WaitForDurable(lsn)`	Deferred / GroupCommit: `Commit()` returns in ~1-2µs and durability lands asynchronously. Immediate: `Commit()` returns only once your LSN is on disk (~15-85µs).	`WriteAligned()` → FUA write → `Interlocked.Exchange(DurableLsn)` → `_durabilityEvent.Set()`	One physical FUA write per batch (~15-85µs, paid once). `DurableLsn` advances; any Immediate waiter whose LSN ≤ `DurableLsn` wakes and returns from `tx.Commit()`.

What changes between the modes is only the producer side:

Deferred — Publish and return. No signal, no wait. The writer may be asleep; it will wake on the next group-commit timer or explicit Flush().
GroupCommit — Publish and return. The writer is already in a WaitForData(GroupCommitIntervalMs) loop; the next tick of that timer (≤5ms) drains the batch.
Immediate — Publish, signal the writer, then WaitForDurable(lsn) until DurableLsn advances past the producer’s LSN.

That last one is where the elegance lives. It’s not a separate code path — it’s the same path, with one extra fast-path check:

public void WaitForDurable(long lsn, ref WaitContext ctx)
{
    // Fast path: already durable, returns inline.
    if (Interlocked.Read(ref _durableLsn) >= lsn)
    {
        return;
    }
    WaitForDurableSlow(lsn, ref ctx);
}

If the WAL writer has already drained past your LSN by the time you call this — say, because someone else’s Immediate commit just batched yours along with it — you pay one atomic read and a return. No event wait, no syscall, no context switch. Immediate mode is GroupCommit’s batching benefit, available to the one transaction that needs it now.

That’s the teaching moment I want this post to leave you with: per-transaction durability is not three implementations. It’s one implementation with three producer-side policies, and the FUA cost is a property of the I/O path, not the API surface.

Per-UoW, not per-engine — why

The API-shape decision is recorded in its own ADR (Architecture Decision Record). I considered four alternatives before landing on per-UoW:

Alternative	Why I rejected it
Per-database (boot-time)	Too coarse. Game ticks and trades on the same DB need different modes within the same process.
Per-transaction	Can’t batch — `GroupCommit` is inherently multi-transaction. The UoW is the natural batching boundary.
Two modes (Sync / Async)	Misses `GroupCommit`’s sweet spot. The whole point is amortized FUA, which a binary doesn’t capture.
Caller-managed flush only	Error-prone. Developers forget to flush. `GroupCommit` automates the common case correctly.

The honest version: I tried “per-database” first because it was easiest to wire up, and immediately hit the simulation-vs-trade problem on the first benchmark. Two production game-server workloads on the same engine, one wanting ~1µs commits and one wanting zero data loss. The mode has to follow the workload, not the storage.

Numbers that matter

The latency table above is the headline. The throughput table is where it gets interesting:

Mode	Single-thread durable tx/s	Multi-thread durable tx/s
`Deferred`	N/A (batch-durable)	N/A
`GroupCommit` (5ms interval)	~200K+ amortized	Millions (shared flush)
`Immediate`	~12K-65K	~12K-65K (FUA-limited)

Single-thread Immediate is capped by NVMe FUA round-trip — there’s no software trick to escape that. But multi-thread Immediate does not scale linearly past one thread, because every commit is racing the same writer through the same I/O. GroupCommit, on the other hand, scales nearly with thread count because the FUA cost is paid once per drain cycle no matter how many producers contributed to the batch.

That’s not a flaw in Immediate. It’s the physics of the storage device. The point of having three modes is that you only pay that cost where you actually need it.

What I got wrong

The first group-commit timer was 1ms. Under low write load the WAL was doing constant small FUA writes — worst-case for SSD wear and tail latency. Tuning to 5ms with a “wake on N records OR T milliseconds” trigger fixed it: the writer sleeps on WaitForData(intervalMs) and gets pulled out early when a producer signals (Immediate commits, explicit Flush(), or back-pressure). Idle periods cost nothing; busy periods batch naturally.

The first override design allowed downgrade. tx.Commit(DurabilityOverride.Deferred) from an Immediate UoW. The use case was “this single read-mostly transaction doesn’t really need FUA.” The use case was wrong: the UoW’s contract is the durability floor, not the default. Downgrading a single commit means the application has accidentally created a hole in a contract it thinks it has. Now overrides can only escalate, and the type system enforces it.

Deferred mode is a contract, not a latency number. Early users assumed Commit() always meant “on disk.” It doesn’t. Deferred mode says: your data is not durable until you call Flush() or close cleanly. For game servers that’s fine; their tick loop already has clear boundaries. But the documentation now leads with the contract, not the µs number. The latency is a consequence; the contract is what you signed up for.

What’s next

Post #6 in the series: Building a Page Cache That Doesn’t Count: Epoch-Based Memory Management. The durability story above assumes the commit path is fast, but that’s only half the story — the read path has to be just as cheap, and the trick there is replacing per-page reference counting with epoch-based protection: two atomic operations per transaction instead of two per page. The mechanism is elegant enough that it deserves its own post.

Follow the GitHub repo for source and benchmarks, or subscribe via RSS.

Deadlock-Free by Construction: How Typhon Eliminates Deadlocks Instead of Detecting Them

2026-04-27T00:00:00+00:00

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I’m Building a Database Engine in C#

What Game Engines Know About Data That Databases Forgot

Microsecond Latency in a Managed Language

Deadlock-Free by Construction (this post)

Three Durability Modes, One WAL

Epoch-Based Page Cache (coming soon)

GitHub repo • :mailbox_with_mail: Subscribe via RSS

Deadlocks are usually treated as a runtime problem. We treat them as a design bug.

That sounds like a slogan. It isn’t. It’s the actual reasoning behind three architectural decisions that, taken together, make a lock-dependency cycle impossible in Typhon — not unlikely, not rare, impossible. The engine ships without a deadlock detector. From the project’s own concurrency overview:

Deadlock detection is explicitly not implemented — it would add overhead for a scenario that cannot occur in the current architecture.

This post is the how: how three structural decisions remove three classes of edges from the lock-dependency graph, and how that elimination cascades into “no cycle is possible.” But it’s also the why: why the constraint was set at project inception, before any code existed to deadlock, and what it cost.

The upfront bet

I didn’t compare deadlock detection schemes before starting Typhon. I’d seen them in production at previous engines, and the pattern was always the same: a separate background scanner, a wait-for graph, victim selection heuristics, transaction abort and retry. A lot of code. A lot of edge cases. None of it bulletproof for the user, who still sees occasional one-second pauses or unexplained transaction failures under load.

So I made an upfront call, recorded as ADR-003, the project’s first concurrency decision, dated 2024-01 (project inception):

Optimistic locking: No locks during execution; conflict detection only at commit.

(ADRs — Architecture Decision Records — are short documents capturing one design choice with its context and rationale. They’re a paper trail for why a thing was built a particular way, not just what it does. Typhon has 49 of them so far. They live in the project’s internal documentation, not in the public repo.)

That’s the bet. No locks across data, whatever the architectural cost. Not because I had proof prevention would be faster — I didn’t run those benchmarks — but because the implementation cost of detection is real, the result is never bulletproof, and trading an architectural cost up front for never paying a runtime cost later is the trade I wanted.

The three “pillars” that follow aren’t a survey of alternatives I considered. They’re what the architecture had to become once the constraint was set. MVCC was the obvious starting point. Optimistic Lock Coupling for indexes followed because traditional B+Tree latch coupling violated the constraint at the index level. The “no cross-table latching” rule emerged because anything else reintroduced the cycles I was trying to eliminate.

It’s constraint-driven design, not survey-driven. And it’s why this post claims a property — deadlock-free by construction — instead of a benchmark.

What a deadlock actually is

Briefly, because the rest of the post needs the picture.

Two transactions, T1 and T2. T1 holds a lock on row A and asks for a lock on row B. T2 holds B and asks for A. Neither can proceed. Each is waiting for the other; the wait will never end. That’s a cycle in the lock-dependency graph — the directed graph whose nodes are transactions and whose edges are “is waiting for.” A deadlock is a cycle in that graph. Detection-based databases scan for cycles and break them by aborting one transaction. Prevention-based databases make cycles impossible to form.

The three sections that follow each remove one class of edges from that graph. With every class removed, no cycle is possible.

Pillar 1: MVCC eliminates inter-transaction data locks

The textbook deadlock — T1 locks row A, T2 locks row B, both want the other — requires row-level locking between transactions. Typhon doesn’t do that.

Reads are snapshot-consistent: every transaction is frozen at the global tick value when it began. A reader sees a stable view of the database for its entire lifetime. It never asks for a lock, because there’s nothing to lock against — the snapshot is already immutable.

Writes don’t lock existing rows either. They create new revisions, with the previous revision left intact for any transaction whose snapshot still references it. Two writers updating the same component don’t fight over a lock; they each append a new revision to the chain. Conflict detection happens at commit time, as a single CAS operation: when the writer tries to install its new revision as the current one, the engine checks that the version it built on is still current. If not, the writer aborts and retries.

This removes the entire edge class of “data locks held across transactions.” There are no row locks, no read locks, no write locks on data. The wait-for graph at the transaction level has no edges to form a cycle from.

The cost isn’t free. Two writers updating the same component will conflict at commit, and one of them will retry. For game-server workloads where most components are written by exactly one system, conflicts are rare. For general OLTP workloads with high write contention, the cost would shift the trade — fewer deadlocks, more aborts. Different curve.

Pillar 2: Optimistic Lock Coupling for index structures

Even without row locks, an index structure (B+Tree, R-Tree) is shared mutable state. Traditional databases serialize access through latch coupling: a reader holds a latch on the parent node while acquiring one on the child, releases the parent, advances. It’s a chain of overlapping latches walking down the tree.

That pattern can deadlock. Reader R has the parent latched and wants the child; concurrent writer W has the child latched and walks back up to fix the parent. Two threads, two index latches, mutual wait.

Typhon uses Optimistic Lock Coupling (Leis et al., 2019) instead. Readers don’t latch at all. Each B+Tree node carries a 32-bit version counter. The reader reads the version, traverses, then re-reads the version at the end — if it changed, the traversal data may have been mutated mid-flight, so the reader restarts.

// From Typhon.Engine/Data/Index/OlcLatch.cs
public int ReadVersion()
{
    int v = _version;
    return (v & 0b11) == 0 ? v : 0;  // locked (bit 0) or obsolete (bit 1) -> restart
}

public bool TryWriteLock()
{
    int v = _version;
    if ((v & 0b1) != 0) return false;
    return Interlocked.CompareExchange(ref _version, v | 0b1, v) == v;
}

public void WriteUnlock()
{
    int v = _version;
    _version = ((v >> 2) + 1) << 2 | (v & 0b10);  // version++, keep obsolete, clear lock
}

Bit 0 is the write-lock flag; bits 2–31 are a monotonic version counter. ReadVersion returns 0 if the node is locked or obsolete — the caller treats that as “restart.” TryWriteLock is a single CAS. WriteUnlock increments the version atomically with releasing the lock.

Writers latch only the modified nodes, and they acquire from root to leaf, in strict order. No reader ever blocks a writer. No writer ever holds a parent latch while waiting on a child. The same pattern is reused by the spatial R-Tree — same OlcLatch, same protocol — so this single mechanism covers both index families.

This removes the edge class of “index-level latch cycles.”

Pillar 3: No cross-table latch holding

Two edge classes are gone. The third is the most boring and the most important: at any given moment, a thread never holds a latch in more than one table.

Each ComponentTable in Typhon has independent indexes, independent revision chains, independent page allocations. A transaction’s commit path processes one table at a time. When the commit moves from table A to table B, all of A’s latches are released first.

The only resource that would be shared across tables is the page cache. Latches there could form cycles across the entire engine. So the page cache doesn’t use latches. That refactor is recorded as ADR-033, dated 2026-02-12:

Replace per-page reference counting with epoch-based protection. Each transaction enters an epoch scope that pins the current global epoch; pages accessed within the scope are stamped with that epoch; eviction defers any page whose epoch is still active.

The previous approach was reference counting: every page access incremented a counter, every release decremented it. A transaction touching 100 pages paid for 200 atomic operations — and atomics aren’t free, each one stalls the CPU pipeline waiting for cache-line coherence. Epochs collapse that into two operations regardless of how many pages the transaction touches: one to enter the scope, one to exit.

But the deadlock-freedom payoff isn’t the cost reduction. It’s that the page cache never holds a lock anyone else could wait on. No latch, no waiter queue, no edge in the lock-dependency graph at all.

This removes the last edge class — cross-structure cycles. With all three classes gone, there is no graph in which a cycle can form.

This pillar is the one I worry about most, and the one most likely to break in the future. It’s enforced by convention, not by the type system. Future features — cross-table indexes, parallel query execution holding read latches across multiple tables, foreign-key constraints — would each require extending the lock-hierarchy discipline. The concurrency overview explicitly lists those scenarios as known risks. I’ll have to introduce explicit lock ordering when I get there.

What the bet costs

Prevention isn’t free; it just shifts the cost.

What’s eliminated	What remains
Deadlocks (cycles in the lock graph)	Aborts at commit — local retry
Detection runtime overhead	OLC restarts under index contention
Wait-for-graph data structures	Livelock under heavy contention (different problem)

A writer that loses the commit-time CAS doesn’t trigger a global abort — it retries from where it was, against a refreshed baseline. An OLC reader that sees a version change doesn’t block a writer — it restarts the traversal. These are local costs. A Postgres deadlock victim aborts the entire transaction; a Typhon OLC restart is one tree traversal.

Livelock — repeated retries that never converge — is a different beast. It can’t deadlock, but it can starve. Typhon’s AdaptiveWaiter handles this with a spin-then-yield progression: 65,536 tight spin iterations first (most contention resolves there), then exponentially halving spin counts interleaved with Thread.Sleep(100µs). The 100µs sleep is below the OS scheduler quantum, so wake latency stays sub-millisecond. It bounds livelock probability without trading away the latency targets.

So: deadlocks gone, aborts and restarts kept, livelock bounded by a spin policy.

What others do

I didn’t survey these in depth before committing to prevention — the upfront “no locks” constraint was made on principle. But for the reader’s context, here’s the landscape Typhon sidesteps.

System	Strategy	Cost model
PostgreSQL	Wait-for graph, triggered after `deadlock_timeout` (1s default)	Detection deferred to ≥1s lock wait; cycle scan is expensive but rare
MySQL InnoDB	Wait-for graph + victim selection (smallest tx by row modifications wins)	Detection can be disabled on high-concurrency systems in favor of `innodb_lock_wait_timeout`
CockroachDB	Per-node in-memory lock tables + Raft-replicated write intents	Detection is near-instantaneous; cost shifted to lock-table maintenance
Typhon	Prevention by structure (three pillars above)	No detection runtime cost; cost shifted to OLC restarts and commit-time aborts

These are all sound engineering choices for their workloads. Postgres’ deferred detection is rare-event optimization. InnoDB’s “smaller transaction wins” is a pragmatic heuristic for the OLTP shapes it’s tuned for. CockroachDB’s instantaneous detection genuinely solves the latency problem detection has elsewhere. None of these are wrong. They’re answering a different question: given that we accept locks, how do we manage their cycles?

Typhon answers a different question: given that we don’t accept locks, what does the rest of the architecture have to look like? That’s why the comparison isn’t “Typhon is faster” — it’s “Typhon paid the cost in a different layer.” Each row above describes where the cost lives, not who’s faster.

A footnote: TigerBeetle reaches the same end via a different upfront constraint — single-writer serializable execution. No concurrent transactions, no deadlocks. Different category, same conclusion: detection is the wrong layer to solve this.

What I’d flag for a reviewer

Three honest acknowledgments.

Pillar 3 is enforced by convention, not by the type system. The compiler won’t catch a future PR that holds latches across two ComponentTables. The discipline lives in code review and architectural awareness, not in mechanically-checked invariants. To compensate, I’ve set up a list of explicit design rules that Claude Code enforces during design, development, and code review. Pillar 3’s “no cross-table latching” invariant is on that list; any code that would violate it gets flagged before it reaches the diff.

OLC restart cost is bounded but not zero. Under heavy write contention on a hot B+Tree leaf, optimistic readers can restart a few times before getting a clean version. The restart is one traversal, not a transaction abort, but it’s not free either.

The “deadlock-free” claim assumes the current feature set. Cross-table indexes, parallel queries holding read latches across tables, and foreign-key constraints are all listed as future scenarios that would require extending the discipline. The structural argument holds for what ships today; future features will need to maintain it consciously.

What’s next

The next post drills into Pillar 1 — how Typhon’s MVCC works without cloning rows. The big trick is per-component revision chains instead of per-row tuple versioning: an entity with eight components that updates one creates a single new revision, not eight. The visibility check is a single comparison against the transaction’s snapshot tick. And the EnabledBits exception dictionary pattern — zero-overhead fast path, dictionary slow path — is the prettiest piece of code in the engine.

Microsecond Latency in a Managed Language: The Performance Philosophy Behind Typhon

2026-04-12T00:00:00+00:00

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I’m Building a Database Engine in C#

What Game Engines Know About Data That Databases Forgot

Microsecond Latency in a Managed Language (this post)

Deadlock-Free by Construction

Three Durability Modes, One WAL

Epoch-Based Page Cache (coming soon)

GitHub repo • :mailbox_with_mail: Subscribe via RSS

The first two posts in this series covered the why and the what. Why C# for a database engine. What happens when you combine ECS storage with database guarantees.

This post is the how. Specifically: the five design principles that guide every performance decision in Typhon. Not a bag of tricks — a philosophy. Individual optimizations come and go as the engine evolves, but these principles are stable. They’re what let a managed language deliver sub-microsecond transaction latency.

When your tick budget is 16 milliseconds and you have 100,000 entities to process, every nanosecond of per-entity cost matters. And most of that cost comes from decisions made at design time, not runtime.

Principle 1: Control Memory Layout

Performance starts at the struct definition, not the algorithm. If your data layout causes cache misses, no algorithm can save you.

The most dramatic example: Typhon recently moved from per-entity hash-table lookups to cluster-based Structure of Arrays (SoA) storage. Same data, same queries, different memory layout. Measured on a Ryzen 9 7950X:

Path	ns / entity	vs baseline
Standard EntityAccessor	139 ns	1.0x
ArchetypeAccessor (cached)	94 ns	1.5x
Cluster iteration	2.5 ns	55x

That’s a 55x improvement from changing memory layout alone. The reason: clusters pack N entities (8 to 64, auto-computed per archetype) in contiguous SoA memory. All positions together, all health values together. Every cache line the CPU loads is 100% useful data. For 100K entities, the working set dropped from scattered L3/DRAM access to ~2.5 MB that fits entirely in L2 cache — and L2 is 3x faster than L3 on Zen 4.

The cluster size isn’t a magic constant. An auto-tuning algorithm evaluates every N from 8 to 64 and picks the one that maximizes entities per 8 KB page for a given archetype’s component schema. Non-power-of-2 sizes often pack better: N=14 can yield 28 entities per page vs N=16 yielding only 16. The capacity is derived from the data, not from convention.

False sharing is the other side of layout control. When multiple threads write to adjacent fields, the CPU bounces the shared cache line between cores — a 40-60 cycle penalty per bounce. Typhon wraps mutable per-thread state in 64-byte padded structs. The WAL commit buffer goes further: explicit padding fields isolating the producer’s _tailPosition and the consumer’s _drainPosition onto separate cache lines. Seven unused long fields between them, suppressed with #pragma warning, because the correct layout matters more than the linter’s opinion.

The same hardware awareness drives B+Tree node sizing:

[StructLayout(LayoutKind.Sequential, Pack = 4)]
unsafe public struct Index32Chunk
{
    // 256 bytes — fills four cache lines. Adjacent Line Prefetcher (ALP) on
    // Zen 4+/recent Intel automatically fetches paired 64-byte lines within
    // 128-byte regions, so two ALP triggers cover the full node.

    public const int Capacity = 29;

    public int Control;
    public int OlcVersion;       // bit 0 = locked, bit 1 = obsolete, bits 2-31 = version
    public int PrevChunk;
    public int NextChunk;
    public int LeftValue;
    public int HighKey;          // B-link upper bound
    public fixed int Values[Capacity];  // 29 × 4 = 116 bytes
    public fixed int Keys[Capacity];    // 29 × 4 = 116 bytes
}

This struct is exactly 256 bytes because of the CPU’s prefetcher. The Adjacent Line Prefetcher on modern x86 fetches paired 64-byte lines within 128-byte aligned regions — so two ALP triggers cover the full node. A 256-byte node costs effectively the same as a 128-byte node in terms of memory access, but holds nearly twice the keys.

The capacity of 29 keys isn’t a round number because it isn’t derived from the algorithm. It’s derived from the hardware: 256 bytes of budget minus 24 bytes of header, divided across Keys and Values arrays. Typhon has three B+Tree variants — 16-bit, 32-bit, and 64-bit keys — and all three hit exactly 256 bytes with different capacities (38, 29, and 19 keys respectively). Post #1 mentioned 128-byte nodes. We’ve since moved to 256 bytes after measuring ALP behavior on Zen 4 — capacity went up, lookup latency stayed flat.

Principle 2: Eliminate Allocations on Hot Paths

In .NET, every allocation is a future GC event. On hot paths, the cost isn’t the allocation itself (~5 ns) — it’s the Gen0/Gen1 collection later that pauses unrelated threads. The discipline is simple: allocate nothing in steady state.

ref struct is the primary weapon. A ref struct lives on the stack, dies when the scope ends, and the GC never knows it existed. Post #1 showed EntityRef (96 bytes, inline component cache). But ref structs are a systematic discipline in Typhon, not a one-off optimization:

OlcLatch: wraps a single ref int — the B+Tree node’s version field. The entire optimistic lock coupling protocol (read version, validate, try-write-lock) in a struct that’s basically a typed pointer. Allocated millions of times per second during tree traversal, at zero GC cost.
EpochGuard: RAII scope for epoch-based page protection. Enter and exit in 3.3 ns. Because it’s a ref struct, it can’t be boxed, captured in a closure, or passed to async code — exactly the constraints you want for a scope guard.
WalClaim: a Write-Ahead Log buffer claim containing a Span that points directly into native WAL memory. Can’t escape to the heap by construction — the Span field makes it a ref struct automatically.
PointInTimeAccessor: a reusable snapshot attached to parallel workers. One per worker, stored in a flat array indexed by worker ID. Zero per-entity dictionary overhead — no Dictionary on the hot path.

For short-lived buffers, stackalloc with a threshold pattern: stack-allocate when the array is small (under 64 elements), fall back to the heap otherwise. Most arrays stay small, so they never touch the allocator.

For larger long-lived buffers, the Pinned Object Heap: GC.AllocateArray(capacity, pinned: true). Pre-zeroed by the OS, never compacted by the GC, stable pointer for direct access. Typhon’s HashMap uses this for its entire entry array.

For medium reusable buffers, ArrayPool.Shared. FPI compression rents 9 KB buffers, returns them in a finally block. Query execution rents stream arrays sized for the common case (8 slots), doubles if needed.

Four strategies — ref struct for scoped access, stackalloc for small temporaries, POH for large long-lived buffers, ArrayPool for medium reusable buffers. The result: zero hot-path allocations in steady state.

Principle 3: Reduce Memory Indirections

Every pointer chase is a potential cache miss. An L3 hit costs ~100 cycles, a DRAM miss costs ~200+. The goal: minimize the number of hops from “I want this data” to “here’s the data.”

Post #1 showed the flagship example — the SIMD chunk accessor with its 3-tier lookup (MRU check, Vector256 search, clock-hand eviction). Each tier reduces indirection compared to the next.

Epoch-based page protection eliminates another class of indirection. The traditional approach: atomic increment on page access, atomic decrement on release. For N page accesses in a transaction, that’s 2N atomic operations — each one a potential cache-line bounce. Typhon uses epoch-based protection instead: one stamp when entering a transaction scope, one clear when exiting. Pages accessed within an active epoch can’t be evicted. Cost: 2 operations per transaction, regardless of how many pages are touched.

Zone maps eliminate entire clusters of indirection. Each indexed field maintains per-cluster min/max bounds. A range query like WHERE Level >= 50 checks two integers per cluster — if the cluster’s maximum is below 50, skip every entity in it without loading a single component byte. The impact at different selectivities, measured on 100K entities:

Selectivity	Without zone maps	With zone maps	Speedup
100%	13.4 ms	1.3 ms	10x
50%	13.4 ms	0.65 ms	21x
10%	13.4 ms	0.16 ms	84x
1%	13.4 ms	0.05 ms	268x

The float ordering trick makes this work for non-integer types: an IEEE 754 sign-flip converts floats to a representation where integer comparison order equals numeric order, enabling the same two-comparison interval overlap check regardless of field type.

At the other end of the scale, division elimination saves cycles on every single chunk lookup:

// Field: precomputed at segment creation
// Replaces expensive division (~20-80 cycles) with multiply+shift (~3-4 cycles)
private readonly ulong _divMagic;

// Constructor: compute magic multiplier once
_divMagic = (0x1_0000_0000UL + (uint)_otherChunkCount - 1) / (uint)_otherChunkCount;

// Hot path: every chunk lookup uses this instead of idiv
var pageIndex = (int)((adjusted * _divMagic) >> 32);
var offset = (int)(adjusted - (uint)(pageIndex * _otherChunkCount));

Integer division (idiv on x64) is notoriously slow — 20 to 80 cycles depending on operand size. The magic multiplier replaces it with a multiply and a shift: 3-4 cycles. The precomputation happens once when a segment is created; the benefit repeats on every one of the millions of chunk lookups that follow. Six lines of math, 20x speedup on a hot path. This is a classic systems programming trick that most managed-language developers have never needed — but when your per-entity budget is 2.5 nanoseconds, you need it.

Principle 4: Let the JIT Help

The JIT compiler is your optimization partner, not your enemy. Write code in patterns it can optimize, and it does work for you that you’d have to do manually in C or Rust.

Constrained generics give you monomorphization. When you write where TMask : struct, IArchetypeMask, the JIT generates a separate native code path for each concrete type. ArchetypeMask256 (four ulong fields, bitwise operations) gets fully inlined — no vtable, no virtual dispatch. This is the same optimization Rust gets from generics, but opt-in through the struct constraint.

sealed enables devirtualization. DirtyBitmap and ArchetypeClusterInfo are both on hot paths and both sealed. The JIT knows no subclass can exist, so it converts virtual calls to direct calls and can inline them.

[AggressiveInlining] eliminates call overhead on micro-operations. B+Tree binary search, transaction state validation, every lock acquire/release — the overhead of a method call (save registers, set up stack frame, restore) is 2-5 ns. On a path called millions of times, that compounds.

SoA layout enables auto-vectorization. When a cluster is fully occupied (all N slots in use), the iteration loop becomes a simple sequential walk over contiguous SoA arrays with no branches. The JIT can auto-vectorize this on AVX2 — processing 8 floats per SIMD instruction. The SoA layout isn’t just about cache locality; it’s about giving the JIT a pattern it can vectorize.

But the most surprising JIT trick is dead-code elimination through static readonly fields:

// TelemetryConfig.cs — field declarations
/// 
/// static readonly fields allow the JIT to eliminate disabled telemetry code paths
/// entirely. When a readonly field is false, the JIT treats guarded blocks as dead
/// code and removes them completely in Tier 1 compilation.
/// 
public static readonly bool Enabled;
public static readonly bool EcsEnabled;
public static readonly bool EcsActive;    // Combined: Enabled && EcsEnabled

// Static constructor — computed once at startup
static TelemetryConfig()
{
    var section = config.GetSection("Typhon:Telemetry");
    Enabled = section.GetValue("Enabled", false);
    EcsEnabled = ecsSection.GetValue("Enabled", false);
    EcsActive = Enabled && EcsEnabled;
}

// EcsQuery.cs — usage on hot path
if (TelemetryConfig.EcsActive)
{
    activity = TyphonActivitySource.StartActivity("ECS.Query.Execute");
    activity?.SetTag(TyphonSpanAttributes.EcsArchetype, typeof(TArchetype).Name);
}

When EcsActive is false, the JIT doesn’t just short-circuit the branch — it eliminates the entire if block from the generated native code. No branch instruction, no condition check, zero cost. The static readonly field, initialized in a static constructor, is treated as a constant after Tier 1 JIT compilation. The dead branch and everything inside it vanish.

This gives you zero-cost observability. Full OpenTelemetry tracing when enabled; literally nothing — not even a branch — when disabled. Most C# developers don’t know the JIT does this. It’s worth structuring your telemetry and feature flags around this pattern.

Principle 5: Design for the Hardware

The CPU manual is a requirements document. Cache-line size, SIMD register width, TLB coverage, memory bandwidth — these aren’t abstract numbers. They drive struct sizing, batch sizes, and allocation strategy.

Cache-line size (64 bytes on x86, 128 bytes on Apple Silicon) drives CacheLinePaddedInt sizing, B+Tree node alignment, and SoA array alignment. The ViewDeltaRingBuffer aligns each sub-buffer to 64-byte boundaries so that the hardware prefetcher doesn’t waste bandwidth loading adjacent unrelated data.

SIMD width determines batch sizes. Typhon’s SimdPredicateEvaluator uses three-tier CPU dispatch for filtering entities by field values: AVX-512 processes 16 integer comparisons per instruction, AVX2 processes 8, with a scalar fallback for older hardware. The AVX-512 path uses a workaround — .NET doesn’t expose 512-bit gather intrinsics, so it performs two 256-bit AVX2 gathers and combines them into a Vector512 for the comparison step. The JIT emits a native vpcmpd instruction for the 16-wide comparison. On Zen 4 (which double-pumps 512-bit operations), throughput matches two AVX2 iterations but with half the loop overhead.

Software prefetch hides memory latency where it matters most. During HashMap resize, speculative prefetch computes the future entry’s position in the resized table and issues Sse.Prefetch0 to start loading that cache line while the current entry is being processed. The JIT translates this to a prefetcht0 instruction — essentially free to issue, and it hides 100+ cycles of latency per entry.

BMI2 instructions accelerate spatial indexing. Morton key encoding (Z-order curves) uses Bmi2.ParallelBitDeposit to interleave X/Y coordinates in ~1 cycle. The scalar fallback costs ~10 cycles. Morton ordering places spatially adjacent grid cells at nearby array indices, improving cache locality during neighbor queries.

TLB coverage constrains working set design. Without 2 MB huge pages, x86 L2 TLB covers only 8-12 MB. Every access beyond that risks a 15-20 ns page walk penalty on top of the data access itself. Typhon’s cluster storage keeps 100K entities in ~2.5 MB — comfortably within L2 TLB coverage even without huge pages. For larger datasets, the page cache’s 8 KB pages and sequential access patterns keep the hardware prefetcher effective.

Memory bandwidth (~50 GB/s on Zen 4) is the ceiling for bulk scans. If your SoA component scan isn’t approaching this number, something is leaving performance on the table — unnecessary indirection, poor alignment, or branches that defeat the prefetcher.

All measurements in this post were taken on an AMD Ryzen 9 7950X with .NET 10, BenchmarkDotNet, release configuration.

The Numbers

Individual principles are nice. What matters is how they compound. Here’s what the engine actually delivers:

Operation	Latency
Cluster iteration (per entity)	2.5 ns
CRUD lifecycle (spawn, read, update, destroy, commit)	2.95 μs
Transaction create-read-commit (100 entities)	3.6 μs
B+Tree point lookup (10K entries)	191 ns
Component read (1 MVCC version)	703 ns
Component read (50 MVCC versions)	720 ns
Uncontended RW lock acquire	7.5 ns
Page cache hit	5.5 ns
Chunk accessor MRU hit	1.1 ns
Epoch enter/exit	3.3 ns
Cascade delete 10K entities	7.6 μs

The version invariance number deserves a callout: reading a component with 50 MVCC revisions costs the same as reading one with a single revision. 703 ns vs 720 ns — within measurement noise. The revision chain design works.

These principles also scale to parallel execution:

Workers	Tick time	Speedup	Efficiency
1	~37 ms	1.0x	100%
2	~18 ms	2.1x	104%
4	~10 ms	3.8x	95%
8	~5.3 ms	7.1x	89%

89% parallel efficiency on 8 workers. The 16-worker result (6.7x, 42% efficiency) hits the L3 cache / CCD boundary on the 7950X — a hardware wall, not a software one.

To put these numbers in perspective, here’s the concurrency cost hierarchy that drives Typhon’s design decisions:

Level	Cost	Example
0: Thread-local	~2 ns	TLS counter, local variable
1: Uncontended atomic	5-10 ns	AccessControl read latch
2: Contended atomic	20-140 ns	Multiple writers, same lock
3: System call	500-1000 ns	Timestamp via syscall
4: Context switch	~10,000 ns	Blocking lock, futex wait
5: Oversubscription	100,000+ ns	More threads than cores

Each level is roughly 10x more expensive than the previous one. Typhon’s AdaptiveWaiter (spin → yield → sleep progression) keeps most contention at Level 2, avoiding the 100x jump to Level 4. The cache-line padding from Principle 1 keeps parallel workers from bouncing each other between Level 1 and Level 2. Every design decision maps to staying as low in this hierarchy as possible.

Trade-offs

Unsafe is unsafe. These techniques require unsafe code — pointer arithmetic, raw memory access, manual layout control. One bug can corrupt the page cache. Roslyn analyzers catch some classes of errors at compile time, but not all. The safety net has holes.

Complexity budget. Magic multipliers, SIMD evaluators, epoch-based protection, zone maps — each one is simple in isolation. The combination creates a codebase that demands systems-level understanding to navigate. There’s no shortcut around understanding the hardware.

Not all of this transfers. Most .NET applications don’t need microsecond latency. Using CacheLinePaddedInt in a web API is premature optimization. These techniques are for when you’ve measured, profiled, and confirmed that memory access patterns are your bottleneck — not before.

What’s Next

The next post dives into concurrency: “Deadlock-Free by Construction: How Typhon Eliminates Deadlocks Instead of Detecting Them.” Most databases treat deadlocks as a runtime problem — detect the cycle, abort a transaction, retry. Typhon makes deadlocks structurally impossible through a three-pillar mathematical argument. No detection, no timeouts, no retries.

What Game Engines Know About Data That Databases Forgot

2026-04-05T00:00:00+00:00

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I’m Building a Database Engine in C#

What Game Engines Know About Data That Databases Forgot (this post)

Microsecond Latency in a Managed Language

Deadlock-Free by Construction

Three Durability Modes, One WAL

Epoch-Based Page Cache (coming soon)

GitHub repo • :mailbox_with_mail: Subscribe via RSS

Game servers sit at an uncomfortable intersection. They need the raw throughput of a game engine — tens of thousands of entities updated every tick. But they also need what databases provide: transactions that don’t corrupt state, queries that don’t scan everything, and durability that survives crashes.

Today, game server teams pick one side and hack around the other. An Entity-Component-System framework for speed, with manual serialization to a database for persistence. Or a database for safety, with an impedance mismatch every time they touch game state.

Typhon draws from both traditions. It’s a database engine that stores data the way game engines do — and provides the guarantees that game servers need. Here’s why those two worlds aren’t as far apart as they look.

Two Fields, One Problem

ECS architecture evolved in game engines. Relational databases evolved in enterprise software. They never talked to each other. But look at what they built:

ECS Concept	Database Concept	Shared Principle
Archetype	Table	Homogeneous, fixed-schema storage
Component	Column	Typed, blittable, bulk-iterable data
Entity	Row	Identity with dynamic composition
System	Query	Process all records matching a signature
Frame Budget (16ms)	Latency SLA	Hard real-time deadline

An ECS “archetype” is a table. A “component” is a column. A “system” is a query. The vocabulary is different, the underlying structure is the same. Two fields, separated by decades and industry boundaries, converged on structurally identical solutions because they were solving the same fundamental problem: managing structured data under performance constraints.

This convergence is why a synthesis is possible at all. It’s not an accident — it’s driven by the same physics. Data must be laid out for the CPU cache. Access patterns must be predictable. Latency budgets are real.

What We Learned From Game Engines

ECS taught the database world something important about how data should be stored. Three lessons Typhon draws directly from game engine architecture:

Cache locality by default. In a traditional row store, reading all player positions means loading entire rows — names, inventories, health, everything. Most of those bytes are wasted. In ECS, components are stored per type: all positions contiguous, all health values contiguous. Reading 10,000 positions is a linear memory scan where every byte is useful.

This matters more than most developers realize. An L1 cache hit costs roughly 1 nanosecond. A DRAM miss costs 60-70 ns — a 65x penalty. When your database layout forces cache misses, no amount of algorithmic cleverness can save you.

Zero-copy is the default, not the optimization. In a traditional database, reading a record means deserializing from a storage page into a language-level object. In ECS, a component is already in memory in its final layout — you just hand back a pointer. Typhon preserves this: components are blittable unmanaged structs read directly from pinned memory pages. No serialization, no managed heap allocation, no GC involvement.

Entity as pure identity. In ECS, an entity is just an ID — a 64-bit number with no inherent structure. All data lives externally in component tables. This is the opposite of ORM thinking where the object is the entity. Typhon inherits this: EntityId is a lightweight value type, all state lives in typed component storage. This separation is what makes the rest of the architecture possible — per-component versioning, per-component storage modes, independent indexes per component type.

What We Learned From Databases

Traditional databases solved problems that ECS never had to face. Four capabilities Typhon draws from database architecture:

ACID transactions with per-component MVCC. Game engines typically have no isolation. Two systems modifying the same entity in the same tick is a race condition — and in a single-process game, you control the execution order so you can manage it. On a game server with concurrent player sessions, you can’t.

Databases solved this decades ago with MVCC: snapshot isolation where readers never block writers, with conflict detection at commit time. Typhon brings this in — but with a twist. Traditional databases version entire rows. Typhon versions each component independently. An entity’s PositionComponent and InventoryComponent each maintain their own revision chain: a circular buffer of 12-byte revision entries, each stamped with a 48-bit transaction sequence number.

// Simplified: finding the visible revision for a snapshot
foreach (var rev in WalkRevisions(entityId))
{
    if (rev.IsolationFlag && rev.TSN != myTransactionTSN)
        continue;  // Skip uncommitted revisions from other transactions

    if (rev.TSN <= snapshotTSN)
        return rev; // Most recent revision visible to our snapshot
}

This means a transaction reading a player’s position sees a consistent frozen point-in-time across all component types simultaneously — without locking any of them. Writers never block readers. And because revisions are per-component rather than per-entity, updating a player’s position doesn’t create a new version of their inventory. Less data copied, less garbage to collect.

Indexed selective access. This is the big one. ECS systems iterate everything matching a component signature every tick. That works brilliantly for particle simulations where every particle needs updating. But game servers often don’t need all of them:

Scenario	Total Entities	Processed Per Tick	Useful Work
Battle royale (per-client relevancy)	50,000 actors	500–2,000	1–4%
MMO area of interest	100,000	200–1,000	0.2–1%
Physics (awake bodies only)	All rigidbodies	Awake subset	5–20%

When you’re processing 1–4% of your entities, scanning everything is doing 25–100x more work than necessary. ECS frameworks recognized this — Unity DOTS added enableable components, Flecs added group_by, Unreal MassEntity added LOD tiers. These are all clever workarounds for the same underlying issue: ECS was designed for bulk iteration, not selective access.

Databases solved this with indexes. B+Trees for value-based lookups, spatial trees for area-of-interest queries, selectivity estimation to decide when to scan versus when to seek. Typhon brings these into the component storage model — not as bolted-on workarounds, but as first-class citizens.

Spatial partitioning. For spatial access patterns specifically — the #1 selective access need in game servers — Typhon integrates a two-layer spatial index directly into the component storage:

Layer 1: Sparse hash map — maps coarse grid cells to entity counts. O(1) rejection of empty regions before the tree is even touched.
Layer 2: Page-backed R-Tree — AABB, radius, ray, frustum, and kNN queries. Same OLC-latched, SOA node architecture as the B+Trees.

Both layers run inside the same transactional model as everything else. No external spatial hash bolted on alongside your ECS. No cache locality destroyed by chasing pointers into a separate data structure.

Durability. A game client can afford to lose state on crash — reload the level. A game server cannot. Player inventories, economy state, progression data — all must survive process restarts and crashes. WAL-based crash recovery, checkpointing, configurable fsync — these are database fundamentals that game servers need but ECS frameworks never provided.

Query planning. When you have both indexes and sequential storage, someone needs to decide which access path to use. Databases have decades of work on cost-based query optimization — selectivity estimation, histogram statistics, index selection. Typhon brings a query planner into the ECS world: given a predicate on a component field, it automatically chooses full scan or B+Tree seek based on estimated selectivity.

Purpose-Built for Game Servers

Typhon doesn’t glue ECS and database concepts together with duct tape. It synthesizes them into a single model designed for game server workloads.

A component in Typhon is simultaneously an ECS component and a database schema:

[Component]
public struct PlayerComponent
{
    [Field]
    public String64 Name;

    [Field]
    [Index]                    // B+Tree for fast lookups
    public int AccountId;

    [Field]
    public float Experience;
}

Blittable, unmanaged, fixed-size, stored contiguously per type — that’s the ECS side. Typed fields with automatic B+Tree indexes on marked fields — that’s the database side. One declaration, both worlds.

The query API makes the synthesis concrete:

var topPlayers = db.Query<Player>()
    .Where(p => p.Level >= 50)
    .OrderByDescending(p => p.Level)
    .Take(10)
    .ExecuteOrdered(tx);

ECS-style typed component access. Database-style predicate filtering with automatic index selection. Inside a transaction with snapshot isolation. The query planner chooses scan vs B+Tree based on selectivity — the developer doesn’t have to.

And because game servers have different durability needs for different operations, Typhon lets you choose per unit of work:

// Position ticks: game-engine speed, batched durability
using var uow = dbe.CreateUnitOfWork(DurabilityMode.Deferred);

// Legendary item drop: database safety, immediate fsync
using var uow = dbe.CreateUnitOfWork(DurabilityMode.Immediate);

Same engine, same API. Deferred mode gives game-engine-class commit latency for position updates that can be re-simulated on crash. Immediate mode gives database-class guarantees for a transaction that grants a rare item worth real money. The game server decides per operation — not globally.

Storage Modes: Not All Data Is Equal

A game server doesn’t treat all data the same. Player positions change 60 times per second and can be re-simulated on crash. Inventory mutations are rare but must never be lost. AI runtime state — current targets, threat scores, pathfinding waypoints — is recomputed every tick and worthless after a restart.

Traditional databases treat all data identically. Traditional ECS keeps everything in memory with no durability distinction. Typhon lets you choose per component type:

Mode	MVCC History	Persisted	Change Tracking	Best For
Versioned	Full revision chains	Yes (WAL + checkpoint)	Via MVCC	Inventory, economy, progression
SingleVersion	Current state only	Yes (WAL + checkpoint)	DirtyBitmap	Positions, health, frequently-updated state
Transient	Current state only	No	DirtyBitmap	AI blackboard, threat scores, pathfinding scratch

SingleVersion components skip the revision chain overhead entirely — no circular buffer, no per-write allocation. They track changes through a DirtyBitmap instead: one bit per entity, flipped on write, scanned on tick fence. This is how game engines track what changed, and it’s the right model for data that updates every tick.

Versioned components get full MVCC with snapshot isolation — readers see consistent historical state, writers don’t block readers, conflicts are detected at commit time. This is how databases protect critical data, and it’s the right model for things that must never be corrupted.

Transient components never touch disk at all — no WAL, no checkpoint, no recovery. Pure in-memory storage with the same query and indexing API as everything else. AI blackboard data that’s recomputed every tick has no business paying persistence overhead.

The same engine, the same transaction API, but the storage layer does exactly what each component type needs. This is what “purpose-built for game servers” means in practice.

Views: The Bridge Between ECS Systems and Database Queries

In ECS, a “system” runs every tick, processing all matching entities. In a database, a “materialized view” maintains a cached result set and refreshes it incrementally. Typhon’s Views are both:

using var view = db.Query<ItemData>()
    .Where(i => i.Rarity >= 3)
    .ToView();

// Game loop
while (running)
{
    using var tx = dbe.CreateQuickTransaction();
    view.Refresh(tx);  // Microsecond incremental refresh

    // React to changes — like an ECS system, but only for what changed
    var delta = view.GetDelta();
    foreach (var pk in delta.Added)   SpawnVisual(pk);
    foreach (var pk in delta.Removed) DespawnVisual(pk);
    foreach (var pk in delta.Modified) UpdateVisual(pk);
    view.ClearDelta();
}

The initial ToView() runs a full query. After that, Refresh() drains a lock-free ring buffer of changes pushed by the commit path — only entities whose indexed fields actually changed are re-evaluated. If 100,000 entities match your view but only 12 changed since last refresh, you do 12 evaluations, not 100,000.

This is the iterate-everything problem solved from the database side: don’t re-scan, track deltas.

Trade-offs

Specializing for game servers means giving things up.

Blittable components only. No string, no object references, no variable-length arrays inside components. Text uses fixed-size types like String64. This is the price of zero-copy reads and cache-friendly storage — and it’s a constraint game developers are already familiar with from ECS frameworks.

Entity-centric relationships, not SQL JOINs. Typhon supports navigation links, 1:N and N:M relationships — but they follow entity references, closer to a graph database than a traditional SQL one. This matches how game servers naturally think about data (an entity has components, a guild contains members), but if your mental model is SELECT ... FROM a JOIN b ON a.x = b.y, it’s a different paradigm.

Schema in code, not SQL. Components are C# structs with attributes, not DDL statements. Natural for game developers, unfamiliar territory for database administrators. If your team thinks in SQL, this is a paradigm shift.

What’s Next

In the next post, I’ll go deeper into the performance philosophy that makes all of this actually fast — data-oriented design, cache-line awareness, and zero-allocation hot paths. The principles that let a managed language hit microsecond-latency transactions.

Why I’m Building a Database Engine in C#

2026-03-28T00:00:00+00:00

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I’m Building a Database Engine in C# (this post)

What Game Engines Know About Data That Databases Forgot

Microsecond Latency in a Managed Language

Deadlock-Free by Construction

Three Durability Modes, One WAL

Epoch-Based Page Cache (coming soon)

GitHub repo • :mailbox_with_mail: Subscribe via RSS

When I tell people I’m building an ACID database engine in C#, the first reaction is always the same: “But what about GC pauses?”

It’s a fair question. Nobody builds high-performance database engines in .NET. The assumption is that you need C, C++, or Rust for this class of software — that managed languages are fundamentally disqualified from the microsecond-latency club.

After 30 years of building real-time 3D engines and systems software, I chose C# anyway. The project is called Typhon: an embedded ACID database engine targeting 1–2 microsecond transaction commits. And the reasons behind that choice might change how you think about what C# can do.

The Case Against C# (Let’s Steel-Man It)

Before I make my case, let me honestly lay out every argument against choosing C# for this. These are real concerns, not strawmen.

The GC is non-deterministic. It can pause all your threads whenever it wants. For a database engine that promises microsecond latency, a 10ms Gen2 collection is catastrophic — that’s 10,000x your latency budget.

You don’t control memory layout. The managed heap decides where objects live. The GC can move them around during compaction. You can’t guarantee that your B+Tree nodes sit on cache-line boundaries, or that your page cache buffer won’t get relocated mid-transaction.

JIT warmup is real. The first call to any method pays the compilation cost. In a database engine, the first transaction after startup shouldn’t be 100x slower than the steady state.

Virtual dispatch and bounds checking add overhead. Every array access has a hidden bounds check. Every interface call goes through a vtable. In a hot loop processing millions of entities, these nanoseconds compound.

These are all legitimate problems. I won’t pretend they aren’t. But here’s what most people miss: modern C# has answers for every single one of them.

What Most People Don’t Know About C#

The C# that most developers know — classes, garbage collection, LINQ — is only half the language. There’s a whole other side that the .NET runtime team has been quietly building for a decade, and it looks nothing like what you’d expect.

unsafe gives you C-level control. Raw pointers, pointer arithmetic, stackalloc for stack buffers, fixed-size arrays — the JIT generates the same mov/cmp/jne instructions you’d get from C. Not “close to C.” The same instructions.

GCHandle.Alloc(Pinned) makes the GC irrelevant where it matters. You can pin byte arrays so the GC never moves them. Typhon’s entire page cache is pinned memory — the GC doesn’t touch it, doesn’t scan it, doesn’t move it. It’s just raw bytes at a fixed address, exactly like malloc in C.

ref struct eliminates heap allocations on hot paths. A ref struct can never escape to the heap. It lives on the stack, dies when the scope ends, and the GC never knows it existed. Typhon’s entity accessor (EntityRef) is a 96-byte ref struct — zero allocation, zero GC pressure.

Constrained generics give you true monomorphization. When you write where T : unmanaged, the JIT generates a separate native code path for each type parameter. sizeof(T) becomes a constant. Dead branches get eliminated. It’s the same optimization Rust gets from generics — not a runtime dispatch, but compile-time specialization.

Hardware intrinsics are first-class. System.Runtime.Intrinsics gives you Vector256, Sse42.Crc32, BitOperations.TrailingZeroCount — the same SIMD instructions available in C/C++, with the same performance, and runtime feature detection so you can fall back gracefully.

[StructLayout(Explicit)] gives you exact memory layout. Field offsets, padding, size — you control every byte. Cache-line alignment, false-sharing prevention, bit-packing — it’s all there.

This isn’t “C# trying to be C.” It’s C# providing a genuine systems programming layer on top of a best-in-class managed ecosystem.

What Typhon Actually Looks Like

Theory is nice, now let’s look at real code.

Hardware-accelerated WAL checksums

Every page written to the Write-Ahead Log needs a CRC32C checksum. Here’s what that looks like in C# — calling CPU instructions by name:

private static uint ComputePartial(uint crc, ReadOnlySpan<byte> data)
{
    if (Sse42.X64.IsSupported)   return ComputeSse42X64(crc, data);
    if (Sse42.IsSupported)       return ComputeSse42X32(crc, data);
    if (ArmCrc32.Arm64.IsSupported) return ComputeArm64(crc, data);
    return ComputeSoftware(crc, data);
}

private static uint ComputeSse42X64(uint crc, ReadOnlySpan<byte> data)
{
    ulong crc64 = crc;
    ref byte ptr = ref MemoryMarshal.GetReference(data);
    int offset = 0;
    int aligned = data.Length & ~7;

    while (offset < aligned)
    {
        crc64 = Sse42.X64.Crc32(crc64, Unsafe.ReadUnaligned<ulong>(ref Unsafe.Add(ref ptr, offset)));
        offset += 8;
    }

    uint crc32 = (uint)crc64;
    while (offset < data.Length)
    {
        crc32 = Sse42.Crc32(crc32, Unsafe.Add(ref ptr, offset));
        offset++;
    }
    return crc32;
}

Sse42.X64.Crc32() compiles to a single x86 crc32 instruction. The runtime detects the CPU capabilities, the JIT eliminates the dead branches, and what executes is the same code a C programmer would write — but with automatic fallback on platforms without SSE4.2. Result: ~1.3 µs per 8 KB page.

The SIMD chunk accessor

This is Typhon’s page cache hot path — a 16-slot cache that finds your data in one of three tiers:

// === ULTRA FAST PATH: MRU check ===
var mru = _mruSlot;
if (_pageIndices[mru] == pageIndex)
{
    var headerOffset = pageIndex == 0 ? _rootHeaderOffset : _otherHeaderOffset;
    return (byte*)_baseAddresses[mru] + headerOffset + offset * _stride;
}

// === FAST PATH: SIMD search through all 16 cached slots ===
fixed (int* indices = _pageIndices)
{
    var target = Vector256.Create(pageIndex);

    var v0 = Vector256.Load(indices);
    var mask0 = Vector256.Equals(v0, target).ExtractMostSignificantBits();
    if (mask0 != 0)
    {
        var slot = BitOperations.TrailingZeroCount(mask0);
        return GetFromSlot(slot, pageIndex, offset, dirty);
    }

    var v1 = Vector256.Load(indices + 8);
    var mask1 = Vector256.Equals(v1, target).ExtractMostSignificantBits();
    if (mask1 != 0)
    {
        var slot = 8 + BitOperations.TrailingZeroCount(mask1);
        return GetFromSlot(slot, pageIndex, offset, dirty);
    }
}

The _pageIndices array is a fixed int[16] — 64 bytes, one cache line, packed for SIMD. One Vector256.Equals compares 8 page indices in a single instruction. The MRU fast path handles the common case (repeated access to the same page) with a single branch — branch predictor friendly, near-zero cost.

Zero-copy entity reads

EntityRef is a ref struct — stack-only, 96 bytes, with an inline fixed array caching component locations:

public unsafe ref struct EntityRef
{
    internal readonly EntityId _id;
    internal readonly ArchetypeMetadata _archetype;
    internal readonly ArchetypeEngineState _engineState;
    internal readonly Transaction _tx;
    internal ushort _enabledBits;
    internal readonly bool _writable;
    private fixed int _locations[16];  // inline component chunk IDs

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public ref readonly T Read<T>(Comp<T> comp) where T : unmanaged
    {
        byte slot = _archetype.GetSlot(comp._componentTypeId);
        int chunkId = _locations[slot];
        var table = _engineState.SlotToComponentTable[slot];
        return ref _tx.ReadEcsComponentData<T>(table, chunkId);
    }
}

That Read call goes from method call → slot lookup → chunk ID → page cache → pointer arithmetic → ref readonly T pointing directly into a pinned memory page. Zero copies. Zero allocations. Zero GC involvement. The where T : unmanaged constraint means the JIT knows the exact layout — it compiles to pointer arithmetic, nothing more.

JIT-specialized hash functions

Even the hash functions exploit the JIT. Since sizeof(TKey) is a compile-time constant for constrained generics, the dead branches vanish:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static uint ComputeHash<TKey>(TKey key) where TKey : unmanaged
{
    if (sizeof(TKey) == 4) return FastHash32(Unsafe.As<TKey, uint>(ref key));
    if (sizeof(TKey) == 8) return XxHash32_8Bytes(Unsafe.As<TKey, long>(ref key));
    return XxHash32_Bytes((byte*)Unsafe.AsPointer(ref key), sizeof(TKey));
}

When you call ComputeHash(42), the JIT generates just the 4-byte path. The other two branches are completely eliminated. This is real monomorphization, not runtime dispatch.

The Productivity Argument

A database engine is more than its hot path. Around the core engine sits a large shell of infrastructure: configuration management, structured logging, telemetry, dependency injection, testing, benchmarking.

In C or Rust, you’d build much of this yourself or stitch together crates/libraries with varying quality. In .NET, this is production-grade and free: ILogger and OpenTelemetry for observability, BenchmarkDotNet for rigorous micro-benchmarks, NUnit for testing, IConfiguration for settings. All well-documented, all interoperable, all maintained by Microsoft or battle-tested OSS communities.

For a solo developer building a database engine, this is a genuine competitive advantage. I spend my time on concurrency primitives and page cache eviction, not on reinventing a logging framework.

It’s the Memory Layout, Not the Language

Here’s the insight that years of real-time 3D engines taught me: the bottleneck in a database engine is memory access patterns, not instruction throughput.

A cache miss to DRAM on a Ryzen 7950X costs 61–73 nanoseconds. That’s ~250 CPU cycles doing nothing, waiting for data. A CAS operation hitting L1 costs 1.4 nanoseconds. The ratio is 50:1.

No amount of “zero-cost abstractions” in your language can save you if your data structures cause cache misses. Conversely, if your data layout is cache-friendly — contiguous, aligned, predictable access patterns — the language barely matters. C# with unsafe generates identical machine code to C on hot paths. The JIT is that good.

What matters is:

Cache-line awareness: Typhon’s B+Tree nodes are 128 bytes — two cache lines. The stride prefetcher on Zen4 covers the second line automatically. This alone cut insert latency by 53% and lookup latency by 30% versus 64-byte nodes.
Data-oriented design: Structure of Arrays over Array of Structures. SIMD-friendly layouts. Blittable types only.
Minimizing indirections: Every pointer chase is a potential cache miss. The SIMD chunk accessor’s MRU hit avoids the chase entirely.

The language you write in matters far less than the memory layout you design.

The Numbers

All measurements on a Ryzen 9 7950X, .NET 10.0, BenchmarkDotNet, release configuration.

Operation	Latency	Throughput
CRUD lifecycle MVCC (spawn, read, update, destroy, commit)	1.2 µs	830K ops/sec
90 reads/10 updates workload (100 ops per tx, MVCC)	22 µs	~4.5M entity-ops/sec
B+Tree lookup (hit)	267 ns	3.7M ops/sec
B+Tree sequential scan (per key)	2.1 ns	479M keys/sec
Uncontended lock acquire	7.8 ns	128M ops/sec
Page cache hit	5.3 ns	—

Context: an uncontended CAS on Zen4 costs 1.4 ns. A DRAM round-trip costs 61–73 ns. Typhon’s lock acquire (7.8 ns) is about 5 CAS operations — tight, considering it handles shared/exclusive arbitration with waiter tracking. The 267 ns B+Tree lookup implies 6–7 memory accesses, which matches a tree traversal through L2/L3 cache.

These are early alpha numbers. There’s room to improve. But they validate the core thesis: C# is not the bottleneck.

Trade-offs

No choice is without cost. Here’s what I’d tell someone considering the same path.

Memory safety is on you. In unsafe blocks, you can corrupt memory, dereference bad pointers, overflow buffers — the compiler won’t save you. Span is a slightly slower but totally safe alternative.

The GC hasn’t been a problem — but it could be. By pinning the page cache and using ref struct on hot paths, Gen2 collections are rare and cheap. But I won’t pretend this is guaranteed. A workload that allocates heavily in managed code between transactions could still see pauses. The answer is discipline: don’t allocate on hot paths. The language lets you — it just doesn’t force you.

“But Rust would give you compile-time safety.” True — the borrow checker catches ownership and lifetime bugs that unsafe C# can’t. But C# has a trick Rust doesn’t: Roslyn analyzers. I wrote a custom analyzer suite (TYPHON001–007) that enforces domain-specific safety rules as compiler errors:

[NoCopy] attribute + analyzer: performance-critical structs like ChunkAccessor cannot be passed by value — the compiler errors if you forget ref. This is the same guarantee Rust’s borrow checker gives for move semantics, but scoped to the types that actually matter.
Ownership tracking: if you create a ChunkAccessor or Transaction and don’t dispose it, that’s a compiler error — not a runtime leak. The analyzer tracks ownership transfers through assignments, returns, and ref/out parameters, [return: TransfersOwnership] on a method helps to express ownership transfer for the analyzer to act accordingly.
Disposal completeness: if your type holds a critical disposable field and your Dispose() method misses it or has an early return that skips it — compiler error.

// This is a compile-time error in Typhon — TYPHON001
void Process(ChunkAccessor accessor) { ... }  // ✗ Error: must be passed by ref

void Process(ref ChunkAccessor accessor) { ... }  // ✓ OK

You don’t get Rust’s safety for free in C#. But you can build the exact subset you need as compiler errors, tailored to your domain. And unlike Rust’s borrow checker, these rules carry domain context in the diagnostics: “causes page cache deadlock” is more actionable than “value moved here.”

Rust’s ecosystem for the surrounding infrastructure (logging, DI, configuration, testing) is also less mature than .NET’s, and as a solo developer, my velocity matters. I chose the language where I ship faster.

JIT warmup is real but manageable. The first few transactions after cold start are slower. For an embedded engine (no separate server process), this is acceptable — the host application typically has its own warmup. For a server database, you’d want tiered compilation or AOT.

What’s Next

In the next post, I’ll explain why an ACID database engine borrows its storage architecture from game engines — specifically the Entity-Component-System pattern. Game engines and databases are solving the same fundamental problem: managing structured data with extreme performance constraints. They just evolved completely different solutions.

Introduction of working with struct

2018-04-04T18:24:33+00:00

Introduction

Before C# 7.2 and .net core 2.1, you could improve .net performance with a good dose of conscious effort and relying on code that would not necessarily be nice to look at (and certainly maintainable). Microsoft made several improvements to make sure that you could design & write faster code, not at the sake of the good practices.

Struct, struct and more struct!

It is important to get rid of this reflex of choosing the class keyword every time you design a new type.

Question yourself about if object-oriented programming is really necessary or if you should use another paradigm that would be more data driven.

Using struct has game changing advantages: You don’t directly allocate on the heap, so you’re not using the GC.

You can design a memory friendly layout for your type, avoiding many memory indirection that would increase the chances of cache miss!

Before C# 7.2, relying on struct were not necessarily a performance win, the reason was that each time you passed/return a struct based object: a copy would be made, on the stack, but still a copy is a copy: it takes time!

It is now possible to pass/return struct based objects using reference to the initial object: avoiding an unnecessary and costly copy.

Two know languages keywords ref and in enable many new patterns to speed things up!

Relying on struct will also enable a linear memory layout for your data, making things way more CPU cache friendly.

Let’s take an example:

public class A
{
    public int val1;
    public int val2;
}
 
public class B
{
    public float f1;
    public float f2;
 
    public A a1 = new A();   // Point to another object: another memory location
    public A a2 = new A();   // Same here
}

// Allocate an array of 256 pointers to 256 distinct instances of B
var data = new B[256];

data is one object allocated on the heap (GC), it references 256 instances of B, each also allocated on the heap. Each instance of B references two instances of A, also on the heap.

So we have a total of 1 + 256 + 2*256 objects allocate on the heap: 769 objects, each located somewhere in the memory, that will be eventually garbage collected when no longer needed.

Things to note:

You stress your GC. It could be fine if the life time of all these objects is big, close to static. But if you’re doing some high frequency code and you allocate data hundred, thousand of time per second: it will have an impact on performances.
Let’s pretend you want to access all fields (direct and indirect) for data[0] and data[1]. You will have to fetch 7 separate memory locations (the data array, data[0], data[0].a1, data[0].a2, data[1], data[1].a1, data[1].a2).

Let’s make the following changes: we no longer use class, but struct instead.

public struct A
{
    public int val1;
    public int val2;
}
 
public struct B         // Size of the type: 24 bytes
{
    public float f1;    // Offset 0
    public float f2;    // Offset 4
 
    public A a1;        // Offset 8
    public A a2;        // Offset 16
}

// One single memory block of 256 * 24 bytes
var data = new B[256];

Ok, this is a naive explanation, internally .net will make things a bit different, but you get the point:

We now have 1 object allocated on the heap (data), which allocates a continuous memory surface block to sequentially store all instances of B.
B no longer reference other objects: the a1 and a2 fields are part of B, not referenced by B.

A foreach on the class version with access to all the fields would have lead to deal with 769 distinct memory locations, with a CPU that would have hard time to prefetch to reduce the time to access the data.

A foreach on the struct version with access to all the fields would be as fast as it could be: there’s one memory block, the CPU understand pretty quickly that we’re sequentially accessing the data, so the prefetch and cache loads are very efficient, because everything was design for this!

Benchmark of class versus struct

I’ve created a small project in order to demonstrate what was explained above, you can go grab it and play with it or just keep on reading.

There are two implementations of a simple program which has to deal with a Financial Stock, containing a list of Trades, each Trade also contains a list of Tickets.

Diagram of the class version

Diagram of the struct version

(don’t mind about the TradeType enum, it’s not important here)

The program

The program file is fairly simple:

It creates one Stock.
Generates 1000 Trades to buy or sell some quantity of this Stock.
Each Trade resulted in one to many Tickets with a given quantity for a given price. The sum of all the ticket’s quantity match the requested one for the Trade.

The program creates the class version and the struct one.

We are going to bench an operation that will compute the average buy price and average sell price for all the Tickets.

So basically:

We parse all the tickets of all the trades
Multiply their price with their quantity
Divide the total buy price by the total buy quantity, same for sell.

In other words we parse the whole tree hierarchy of instances and perform basic computation on it.

Here is a result of the benchmark comparing the class version against the struct (using BenchmarkDotNet)

Few facts:

The struct version is 4 times faster than the class one, take a look at the Scaled column, the class version is the baseline, so its value is 1.00, the struct version is running at 0.24 time compared to the baseline.
There’s no Garbage Collection on the struct version, for a pretty obvious reason.
The class version has some Garbage Collection and extra allocated memory.

Let’s be clear: this benchmark is not including the construction of the objects, this is done in a setup phase that is not benchmarked. Here, we are only profiling the computation of the average prices.

So why is this 4 times faster considering the fact we’re not creating objects, only parsing them? Well the reason is the one explained in the first post of the series: struct are more memory friendly.

Let’s explain a bit

Memory layout for the `struct` version

In the diagram above, each color represent a memory location.

What is important to understand is:

All Trades (Tr1…Tr6) objects are stored in an array (stored, not referenced!), so they are in a contiguous memory zone. A For/Loop on them will be pretty efficient as the CPU will quickly fetch the Trade n+1 while we’re processing the Trade n.
Same thing for the Tickets, but only for the ones that are owned by the same Trade: each Trade has an array containing the Tickets it owns.

In our case there’s 1000 Trades objects that are in a contiguous memory location: this is very memory friendly!

In the program, on average there are 5 Tickets per Trade, it is also apparently enough to be memory friendly.

We could have pushed things further and store all Tickets for all Trades in the same array, but things would be a bit more difficult, let’s keep it simple for now.

Memory layout for the `class` version

Well, no need of color this time, each object is stored in a distinct memory location, determined by heap manager of the .net CLR.

What is important to understand here is:

You have no control/guarantee of where the objects are stored compared to the others, which is not good when you care about performances.
Each object has a distinct lifetime, which is good, but it comes with a price.

A design choice to make

Again, there are no silver bullet: to gain something you have to give up something else in return.

In our case this more about a design decision to make:

You could easily have everything stored as objects in the heap, this is easy and it’s very “C#”, but performances will be what they are: average for .net.
You can decide from the start how your objects will be stored, to improve performances, at the expense of some programming flexibility/simplicity.

There’s a saying out there which warns every programmer:

“Early optimization is the root of all evil.”

This is a simplified version of a quote from the great Donald Knuth:

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”

Early optimization is not the root of all evil, most of the time, it will be for sure. Optimizing something that won’t worth it is one of the biggest mistakes we all did (and still do, because, you know, it’s fun, it’s challenging!).

However, there are some profound design choices that have to be made from the start, because after, it will be too late!

Ok, that’s all for this post, in the next one we will take a closer look at the code, how to design and program things in order to achieve better performances!

UPDATE #1 on April the 5th

As Marko Lahma pointed out in the comment, the class/struct benchmark is not a fair one, I rely on foreach for classes, because, well, daily habits. This is what generated the 2040 B of Allocation and the Gen 0 GC. The speed difference was bigger than I expected, but mainly because the test is doing pretty much nothing in the nested for/loops (and the GC surely impacts overall performances).

Here’s the result.

Method	Mean	Error	StdDev	Scaled	Gen 0	Allocated
ComputeAveragePriceOnClass	4,083.4 ns	13.766 ns	12.877 ns	1.00	0.4807	2040 B
ComputeAveragePriceOnClassNoEnumerator	2,807.9 ns	27.279 ns	25.517 ns	0.69	–	0 B
ComputeAveragePriceOnStruct	850.5 ns	4.484 ns	3.500 ns	0.21	–	0 B

Working with struct, a closer look

2018-04-03T08:25:34+00:00

Introduction

It’s time to take a closer look at the code and find out the core mechanics of working with struct in C# 7.2 and .net core 2.1.

First, we will make a quick recap of the new ref and in keywords.

Then, we will take a look at a class that will be used to store and retrieve easily the objects and see how we use it to manipulate the objects.

Finally we will see some pitfalls to avoid.

Quick recap of the `ref` and `in` keywords

For those who are not familiar with the new feature of C# 7.2, let’s make a quick recap of the ref and in keywords. You can also find the full documentation about this here.

Say you have this class:

public partial struct Vector3
{
    public double X;
    public double Y;
    public double Z;
}

Now you want to code a method that makes an addition of two vectors:

public partial struct Vector3
{
    static Vector3 Add(Vector3 a, Vector3 b)
    {
        return new Vector3
        {
            X = a.X + b.X,
            Y = a.Y + b.Y,
            Z = a.Z + b.Z
        };
    }
}

In this implementation we have two similar problems:

When you will call the Add() method, the a and b objects you will pass will be copied: that’s the basic behavior of value types. You may argue that it’s not a big deal for such a small type considering a 64-bits pointer will be 2/3 of it, but that’s not the point right now.
You will also have to create a new instance that will store the result and return it to the caller. This instance will also be duplicated at call site.

So we are likely dealing with 4 allocations, for a simple addition, these allocations will be made on the stack rather than the heap because we’re dealing with value types, but nevertheless: it’s not the fastest way.

Now let’s take a look at a different implementation of the Add() method.

public partial struct Vector3
{
    static void AddByRef(ref Vector3 a, ref Vector3 b, out Vector3 res)
    {
        res.X = a.X + b.X;
        res.Y = a.Y + b.Y;
        res.Z = a.Z + b.Z;
    }
}

Small changes, but big time differences:

Adding the ref keyword no longer copies the objects passed during the method call but passes a reference to them.
Using the out keyword that already exists for quite some time will avoid a new allocation, by storing the result directly in the destination object.

We got rid of these 4 allocations, fairly easy. The arguable trade-off here is not returning the result but using an out parameters, which is less convenient to use, but again, fast.

This implementation is not quite good yet, the ref keyword allows the AddByRef() method to modify the content of a and b (remember, they are references now), which is not appropriate in our case. This is why we should rely on the new in keyword instead, which passes a read-only reference of the object.

The correct implementation should be:

public partial struct Vector3
{
    static void AddByRef(in Vector3 a, in Vector3 b, out Vector3 res)
    {
        res.X = a.X + b.X;
        res.Y = a.Y + b.Y;
        res.Z = a.Z + b.Z;
    }
}

This is not the place of in-depth explanation about how the in keyword behaves, but be aware that you may not always get a performance improvement because of the so-called defensive copy mechanism.

The `RefArray` class

I’ve quickly developed a small class RefArray that wraps an array and allow access using the new ref keyword.

Here the implementation:

public class RefArray<T> where T : struct
{
    public RefArray (int initialSize = 16)
    {
        Count = 0;
        _data = new T[initialSize];
        _dataLength = initialSize;
    }

    public int Count { get ; private set; }

    public int Add(ref T data)
    {
        // Check grow
        CheckGrow();

        _data[Count] = data;

        return Count++;
    }

    public ref T this[int index]
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            if (index < 0 || index >= _dataLength)
            {
                throw new IndexOutOfRangeException();
            }

            return ref _data[index];
        }
    }

    private void CheckGrow()
    {
        if (Count == _dataLength)
        {
            var newLength = (int)(_data.Length * 1.5f);
            var newArray = new T[newLength];
            _data.CopyTo(newArray, 0);
            _data = newArray;
            _dataLength = newLength;
        }
    }

    private T[] _data;
    private int _dataLength;
}

The code is fairly easy, internally it’s an array of T and you have two methods to interact with the array:

public int Add(ref T data) to add an item to the array.
public ref T this[int index] to retrieve a reference to the item (to access or modify it).

Let’s take a closer look at the array accessor implementation

public ref T this[int index]
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    get
    {
        if (index < 0 || index >= _dataLength)
        {
            throw new IndexOutOfRangeException();
        }

        return ref _data[index];
    }
}

You may notice something that wouldn’t be obvious to understand at first: there’s only one get method, no set!

The reason is simple, as a reference to the object is returned you don’t need a setter, you will modify the object directly.

RefArray is the class that I used in the benchmark of the post #2 of this series, you could elaborate something more feature complete, but it serves the primary purpose.

A concrete example of using the `Array` class

The struct version of the Stock type is using the Array class to store all the trades the stock owns.

public struct StockStruct
{
    public StockStruct(string name)
    {
        Name = name;
        _tradeArray = new RefArray<TradeStruct>();
    }

    public string Name { get;  }

    private readonly RefArray<TradeStruct> _tradeArray;
    
    public int TradeCount => _tradeArray.Count;

    public ref TradeStruct GetTrade(int index)
    {
        return ref _tradeArray[index];
    }

    public int AddTrade(ref TradeStruct trade)
    {
        return _tradeArray.Add(ref trade);
    }
}

As you can see the code is fairly simple, the Array class is encapsulated and we make sure we use the ref keyword to add/get Trades.

It worth to mention that Array is a class, so it’s stored in the heap, which is what we want, what matter is that all struct objects are stored sequentially in the the private T[] _data; field, which is what we’re looking for to speed things up.

The `ref readonly` pitfall

When passing or returning a reference to an object, you have the possibility to specify the readonly keyword to make sure the callee won’t be able to modify the given instance.

This will work fine with struct containing mainly public fields, as we demonstrated in the Vector3 struct. In this case the compiler can check that any attempt to modify any field and return an error during compilation.

If your struct is using properties, things get trickier, internally a property is a method, so you’re able to modify the content of a given instance even if you’re accessing a property through its getter.

As of today, the compiler reacts in such case with a pretty brutal approach: a copy of your readonly object is made and returned to the callee in order to make sure the initial object won’t be modified.

As we can see below, the benchmark run with a readonly version of the structure access is slower than both ref struct or struct copy !

General rules

Use a struct to store plain, publicly exposed data, if you get into a more complex type with properties, then working with struct may not be the best fit for you.
If you’re willing to expose read-only object, consider the readonly struct keywords during the declaration of a new type, it will be considered as immutable, then the compiler will stay away from defensive copies, ensuring you the best performances.
Profile! The theory is what it is: theory. I won’t replace the reality of a profiled piece of code! Sometime the copy of a struct will be faster than a reference, especially for small objects.

Immutability in C# is not managed as well as in C++, for instance, if you consider advanced scenarios with struct and readonly or in keywords, I strongly encourage you to thoroughly read the official documentation about ref/in/readonly struct keywords.

How to optimize .net development using .net Core 2.1 and C# 7.2

2018-04-02T15:29:45+00:00

Forewords

This is the first blog post of a series about understanding how to improve performances developing with .net core and C# 7.2.

Some parts are purely theory, so it’s not about .net or C# 7.2, but it’s mostly the first post of the series, for the reader to understand the basics of CPU and Memory.

This series is intended for any kind of readers, especially the ones who a not familiar with the topic and are willing to understand the basics.

For the experts on the matter, you may find these posts are lacking depth, but it’s on purpose: the goal is not to thoroughly explain everything, it would be too big and ending up confusing most readers, but instead explaining what matters, why it matters and how to deal with it.

Understanding the memory.
The benefits of working with struct.
Working with Data Stores.
Working with Memory and Span.

If you have remarks, typo corrections, or simply read posts still in progress, you can check my dedicated GitHub repo.

Introduction

It’s not a secret that Microsoft decided to focus on improving performances for the 2.1 release of .net core.

The main driver is to improve asp.net core but it doesn’t mean the new features only target the web server. Most of the time when it’s about performances you have to dig to the lowest layer in order to bring game changers and this time was no exception.

What is interesting, from my point of view, is that we’re starting to see some features that bring us closer to low level/high performance language such as C++.

The goal of this post series is to :

Explain what matters when we’re dealing with optimization.
How you can use the new features (and also some of the old ones) to improve the code speed while keeping things clean and well designed.

C# is about writing clean code to achieve high maintainability and meet good programming practices/standards. Writing optimized code often drives you away from these principles, finding the right balance is definitely a key aspect for the programmer.

Why .net is slower than C++ ?

Well, there are many reasons and I won’t detail all of them, mostly because I couldn’t, but there’s some of them we can focus on:

Seamless control over the lifetime of object through the use of Garbage Collection. It scares people who are performance/real-time driven.
No direct access to memory, through pointers and boundless checks (we are not considering unsafe .net, of course).
It is easy to not pay attention to the layout of the data.
A lot of implicit memory copy. Things are easy to develop, but under the hood you don’t realize all the memory bandwidth that is consumed.
A JIT that doesn’t generate code as efficient as a pre-compiled language.

C# is a pretty high level programming language, it’s pretty easy/safe to use, that’s why you have things like the bullets #1 to #4 above. On the other hand it’s also easy to not being aware of what matters to optimize things up.

Let’s not focus on the #5, because there’s few things we can do about it. If we take a close look at #1 to #4 we will see there’s a common theme: memory!

Is memory important? Yes, you bet!

A bit of talk about memory

CPUs are getting more and more powerful the years passing by, but we don’t see the same trend going on for memory, see below:

Computer Architecture: A Quantitative Approach by John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau

It means that in order to keep the CPU busy, we have to develop our code & data in a memory friendly way, because accessing data directly to memory will cost more than you may think!

There’s a very good analogy that you can read here that basically gives you crucial information.

Let’s summarize it.

Today, most of the CPU instructions that don’t involve memory access or very complex computation will take one cycle to execute, you have a 4Ghz CPU so it’s 4 billions instructions per second per logical core (so 32 billions for a hyper-threaded quad cores).

Let’s scale things to understand their impact better:

Access type	Real duration	Scaled duration
One CPU Cycle	0.4ns	1 second
Cache L1 Access	0.9ns	2 seconds
Cache L2 Access	2.8ns	7 seconds
Cache L3 Access	28ns	1 minute
Main memory Access	~100ns	4 minutes

Compared to one CPU cycle:

L1 access is 2x slower
L2 access is 7x
L3 access is 60x
Memory access is 250x slower!

So yes, you can understand that the more you are memory friendly (we’ll explain roughly what it implies) the better you’ll have chances to hit the CPU cache, getting you significant performance boost!

Put it differently, compared to main memory access:

A L1 access is 125 times faster.
A L2 access is 38 times faster.
A L3 is 3.5 times faster.

So worrying about the JIT not being fast enough may not be the main reason, you can leverage things yourself by being aware of what the CPU needs to execute as fast as possible.

Being memory friendly, means being cache friendly!

There are a lot of good, in-depth articles/posts out there explaining why the CPU cache is important and how to work with it. This topic can get really complex very quickly, here, again, we will try to keep things simple.

Few explanations/remarks:

Level 1 cache has dedicated cache for Data and Instructions (running assembly code), this is important because we don’t want one to compete against the other.
4 x 32KBytes, here the ‘4 x’ means we’ve a dedicated cache for each Core of the CPU: that’s right L1/L2 have dedicated caches for each CPU Core. ’32KBytes’ is the size of each for one CPU Core.
8-way is about ‘associativity’, which is a rather complex topic. Follow the link is you’re curious and brave!
Data in a CPU Cache are organized by ‘Line’ (or Block), which are now most of the time 64 bytes wide. It means that whatever you do, when a data is loaded in the cache, it will fill a whole Line of 64 bytes and the starting address will also be a multiple of 64 bytes (hence the importance of allocating memory with a starting address being a multiple of 64 bytes).
The CPU likes to prefetch data. Prefetch means that it will read ahead data that follows the one you’re accessing, hoping that you will access your data sequentially. Which is why it is a good thing to pack the data you often access at the same time in the same memory zone.

More about how a CPU cache works.

Enough of the theory, how could we make things faster in .net?

Minimizing the usage of the Garbage Collection

Yes, the GC is a very nice and handy feature, but as each feature, it’s not a silver bullet, it’s not something you have to rely on 100% of the time, and definitely not in .net! The GC is only used when you involve class based types, struct ones are not. So yes, there’re ways to minimize pressure on the GC and you should know about them!

Direct/fast memory access, avoiding copies

It’s easy to copy data, to isolate it for the sake of a good design (or easy and well readable code), it may not harm when the size is small and the frequency of the operation is low, but when one of these two factor increase, things amplify and performances are dropping.

One of the best example is the String class, it’s allocated on the heap and it’s immutable, which means all methods that change the string will return a new object! It’s a lot of memory traffic and the developer is most of the time not aware of this.

Luckily for us we have new weapons to improve things on this area.

Designing the data in a more memory friendly way

C# is a high-level language, we don’t pay attention to how we define the data in the types we design and it’s a big mistake when we want things to be driven by performances. Again, this is more about convenience, because the language don’t prevent you to improve things: you just don’t know/care to do it.

To be followed !

This was just the first post of the series and we talked mostly about theory, it was important to lay these foundations for the posts to come.

Starting the next post we’ll start talking concrete stuffs with examples.

Microservice or not microservice…

2018-01-28T09:44:21+00:00

It is always a good thing to benefit from the point of view of others, things are never either black or white and to find your way in this grey area that you’ll have to define is certainly not easy.
I really enjoyed reading this article from @dwmkerr because it highlight many good points, the article’s title was carefully chosen to generate some “hype”, it certainly delivered.

But I have to say that after this, I saw a wave of negative opinions (even if many people defended the concept by writing comments on Dave’s blog) toward Microservice and I thought it could be useful for me to share my experience on the matter.

The trigger for me to get back at blogging after so many years is this twitter post from Katrina Novakovic, which basically summarized Dave key arguments:

Complexity for developers, operators and devops
Requires expertise
Poorly defined boundaries of real world systems
Complexities of state and communication often ignored
Versioning can be hard
Monoliths in disguise
Distributed Transactions

Looking this way, I think it’d be harder to find more extreme than this point of view. Summarized this way make Microservice scary for sure!

Few facts

Silver bullet don’t exist, in the real world and also in the programming/architecture world. Microservices won’t be the solution to everything! Every time there’s a hyper on something, some people become “expert” at it and then try to push the concept to solve everything. It leads the less experienced people to believe than a given pattern is the ultimate one, a mere dream…Always ending the same way.
Doing complex stuffs is easy while achieving simplicity is very hard. One of my all-time favorite quote comes from Leonardo da Vinci:

“Simplicity is the ultimate sophistication” It became one of my mantra in my daily work, because when programming/architecture is concerned, something easy is really hard to achieve and when you create something complex you have to realize something: you’ve made it wrong.
Of course, we are not perfect, we will always make mistakes we won’t have the time to correct, but just acknowledging this is very important to improve yourself for the next opportunity you will have. Otherwise you will go deeper in the complexity, even embracing it, because you will feel “superior” compared to others “mortal people” that won’t be able to catch what you’ve done.
If you directly transitioned from a monolith architecture to a Microservice one : you will suffer! We have here almost two extremes, would you think it would be that easy to switch? Hell no!

Unfortunately a lot of people make this mistake, most of the time they don’t have the choice: they are young pro and certainly inherit from an old monolithic architecture and when they finally get the chance to start from a clean slate, they go toward the extreme opposite, not realizing what will be the consequences which will lead them to a very complex solution. For this reason…
Shifting to Microservice is hard, Rome wasn’t built in a day, neither your experience on the matter, although reading things may be able to save you from making some mistakes.

Things to realize

A Microservice architecture requires a lot of tooling and best practices to operate it correctly, if you’re not experienced nor ready in all of them, you will suffer:

A Continuous Delivery Chain (CDC) is required, if you don’t commit/build/test/push package in an automated and versioned fashion you will certainly fail.
One of the key principle to respect in architecture is “low coupling of components“.

You know, the thing that didn’t exist in your monolithic architecture which made you wanna die many times due to the famous butterfly effect: you touch one little thing at a given place and bam, you have regressions on other parts you never thought they were related.
If you don’t have experience designing and writing a low coupled architecture and code, then you will certainly fail.
I’m not a pattern freak, someone who live by the dogma at all cost, but my advice is: try to learn Domain Driven Design, if you don’t get it, it will be hard to switch to Microservice. The very small grained approach of Microservice will requires most of the same constraints, so it will certainly be a good starting point to familiarize yourself with DDD.
Having a Continuous Delivery Chain is a start, but it won’t be enough, having an orchestrator to automatically deploy, components to monitor the health and load of each service instance is almost mandatory.
Otherwise the operating cost (and complexity) will be certainly unbearable. As Dave Kerr mentioned in his article, there is a correlation between Microservice and DevOps. I would say more precisely: you can’t do Microservice if you’re not already good at DevOps.

My point of view, more detailed

Here is what I have to say on each of the main points of Dave’s article:

Complexity for developers, operators and #devops: nothing is complex when you master them. I stated above, you have to be good at DevOps principles if you want to have a chance. You won’t build a Microservice architecture in one day.
Requires expertise: well another open door to blast… Everything requires expertise, even a Monolithic architecture, it’s just that some things are easier to gain expertise at than others.
Poorly defined boundaries of real world systems: this is not relevant to Microservice, look at DDD first…
Complexities of state and communication often ignored: this one deserve its own explanation below.
Versioning can be hard: because versioning can be easy? More from this below.
Monoliths in disguise: Your Microservice architecture certainly doesn’t sound like mine, but I believe that failing at designing low coupled services, lack of knowledge at DDD and Versioning issues lead you to an architecture of many services that end up be tied up ones to others, then forcing you an update of most of your architecture every time you make a minor upgrade. Microservice architecture didn’t fail you, you failed it.
Distributed Transactions: more below…

Complexities of state and communication

For me, most of your Microservice architecture has to be stateless, easier said than done I agree, but you have to employ at least two layers in your architecture:

The top level one, which answers the request from your client, whatever it’s a rich desktop, web or third-party service, here you will take care of things like: authentication, authorization, session state management (using Redis, NCache, Geode or whatever suits you) and security based cross-cuttings. This layer will be the visible surface of your architecture, you certainly don’t have to expose the whole architecture to the rest of the world (yes, your rich desktop, web client are part of “another world” for the sake of achieving low-coupling.)
The rest of your architecture will be stateless services where you can communicate freely, without fear of security issues and always carrying the minimal state (which will always be a subset of the actual client’s state) to perform the operation.

Versioning can be hard

Yes, it’s always is… That’s why we have ALM, DevOps and tones of disciplines I won’t mention (everyone will be his/her favorite) to deal with this. But whatever you’re doing you have to realize/accept few things:

Backward compatibility is a must, it must be maintained at the Service’s Interface declaration at run-time level. If you’re from the .net world, this post will be helpful. If you fail to do so, yes, you will end up with a “monoliths in disguise”, but that’s the same of everything else: if you need to recompile the client when you make an upgrade at an interface it uses: you failed. You don’t need microservice to fail that, just two DLL talking are enough.
So please respect the SemVer principles and don’t be afraid of doing a brand new interface as a new Major Version when you will break backward compatibility. It should not harm your Microservice architecture if you have a CDC and everything that monitors, handles scale up and down through automated deployment (and cleanup). If nobody uses your V1 of a given service anymore, it will end up in your architecture with a couple of instances, running for almost nothing, being decommissioned eventually by a human or a machine.
Low-coupling, DDD, again, will be essential to success.

Distributed Transactions

I don’t see why distributed transaction are a requirement doing Microservices, I never used that and always find my way of doing things.

One of the key fundamental of Microservice is a very fine grained solution, a Service operation shouldn’t last forever executing, hence, when you really require transaction through many nested call, if you rely on synchronous call with each operation declaring its own transaction, then reporting as it should the success or failure of its execution, then the caller can commit or rollback its own transaction: no big deal.

If you start async everywhere, even when it’s not needed, ok, things are going to be tougher…

Conclusion

For me, server-side architecture and development is definitely challenging, you can do a very simple (and working at some extend) architecture that will be for on-premise solutions, but when you’ll go for SaaS you will definitely stumble upon things like: high availability, scaling, mutualized architecture, using PaaS solutions and it will be a whole different world.

Whatever the architecture you employ, if you don’t do it the right way, you will fail because the result will be overly complex. I agree that Microservice is a complex architecture so far, so jumping to this should be done with extreme caution. Is it a silver bullet? Nope, it won’t be the solution to everything, big companies rely on it because they have the skills, they can afford it and above all: they need it.

That being said I don’t rule out this architecture for smaller companies, it will be certainly harder for them, but there are more and more solutions that will assist you (read this for instance).

I always welcome all the point of views because, even if the tone of this post doesn’t suggest it at all, I will always be open minded and learn from others. The hype around new techs in our industry is something that bring us as many good things as bad ones. One day a given tech is the solution to everything, the day after it will be crucified by everyone, for the sake of a new tech, most of the time.

Microservice are being crucified and we’re entering the era of serverless architecture: the true, first, and only one silver bullet that will ease all our pain!

Does it look like something we’ve already saw before? Hum… 🙂

Nockawa’s Blog

A Database You Can See

See it first

DataGrip meets a flight recorder

From a file to a model you can walk

One click, every view

Real data, decoded

The lesson: tools need design too

What’s next

Three Durability Modes, One WAL: Configurable Guarantees for Different Workloads

The other classical knob

The three modes

One writer thread, three signaling patterns

Per-UoW, not per-engine — why

Numbers that matter

What I got wrong

What’s next

Deadlock-Free by Construction: How Typhon Eliminates Deadlocks Instead of Detecting Them

The upfront bet

What a deadlock actually is

Pillar 1: MVCC eliminates inter-transaction data locks

Pillar 2: Optimistic Lock Coupling for index structures

Pillar 3: No cross-table latch holding

What the bet costs

What others do

What I’d flag for a reviewer

What’s next

Microsecond Latency in a Managed Language: The Performance Philosophy Behind Typhon

Principle 1: Control Memory Layout

Principle 2: Eliminate Allocations on Hot Paths

Principle 3: Reduce Memory Indirections

Principle 4: Let the JIT Help

Principle 5: Design for the Hardware

The Numbers

Trade-offs

What’s Next

What Game Engines Know About Data That Databases Forgot

Two Fields, One Problem

What We Learned From Game Engines

What We Learned From Databases

Purpose-Built for Game Servers

Storage Modes: Not All Data Is Equal

Views: The Bridge Between ECS Systems and Database Queries

Trade-offs

What’s Next

Why I’m Building a Database Engine in C#

The Case Against C# (Let’s Steel-Man It)

What Most People Don’t Know About C#

What Typhon Actually Looks Like

Hardware-accelerated WAL checksums

The SIMD chunk accessor

Zero-copy entity reads

JIT-specialized hash functions

The Productivity Argument

It’s the Memory Layout, Not the Language

The Numbers

Trade-offs

What’s Next

Introduction of working with struct

Introduction

Struct, struct and more struct!

Benchmark of class versus struct

Diagram of the class version

Diagram of the struct version

The program

Let’s explain a bit

Memory layout for the struct version

Memory layout for the class version

A design choice to make

UPDATE #1 on April the 5th

Working with struct, a closer look

Introduction

Quick recap of the ref and in keywords

The RefArray class

A concrete example of using the Array class

The ref readonly pitfall

General rules

How to optimize .net development using .net Core 2.1 and C# 7.2

Forewords

Introduction

Memory layout for the `struct` version

Memory layout for the `class` version

Quick recap of the `ref` and `in` keywords

The `RefArray` class

A concrete example of using the `Array` class

The `ref readonly` pitfall